Cascading - Data Processing Workflows on Hadoop

  •        0

Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. It is a thin Java library and API that sits on top of Hadoop's MapReduce layer and is executed from the command line like any other Hadoop application.



comments powered by Disqus

Related Projects

Hadoop Common

Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. Hadoop common supports other Hadoop subprojects

Schedulix - Enterprise Job Scheduling System

Schedulix is the Open Source Enterprise Job Scheduling System, which meets the complex requirements of modern IT process automation. It helps to create Complex workflow, Hierarchical workflow modelling, Workflows can be dynamically submitted or paralleled, Automatic reruns of sub-workflow, Load balancing, Sticky allocations, Time scheduling and lot more.

Hue - The open source Apache Hadoop UI

Hue is a Web application for interacting with Apache Hadoop. It supports a FileBrowser for accessing HDFS, JobBrowser for accessing MapReduce jobs (MR1/MR2-YARN), Job Designer for creating MapReduce/Streaming/Java jobs, HBase Browser for exploring and modifying HBase tables and data, Oozie App for submitting and scheduling workflows and bundles, A Pig/HBase/Sqoop2 shell, Beeswax application for executing Hive queries, Search app for querying Solr and Solr Cloud.


Quartz.NET is a full-featured, open source job scheduling system that can be used from smallest apps to large scale enterprise systems. It is a port of very propular open source Java job scheduling framework, Quartz.

Katta - Lucene and more in the cloud.

Katta is a scalable, failure tolerant, distributed, data storage for real time access. Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.

ANTLR - ANother Tool for Language Recognition

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It's widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees. Twitter search uses ANTLR for query parsing, with over 2 billion queries a day.


Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

HBase - Hadoop database

HBase provides support to handle BigTable - billions of rows X millions of columns. It is a scalable, distributed, versioned, column-oriented store modeled after Google's Bigtable and runs on top of HDFS (Hadoop Distributed Filesystem). It features compression, in-memory operation per-column. Data could be replicated between the nodes. HBase is used in Facebook and Twitter.


Avro is a data serialization system. It is a subproject of Apache Hadoop.

Apache Mahout - Scalable machine learning library

Apache Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining.

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.

Tag Cloud >>