Cascading - Data Processing Workflows on Hadoop

  •        0

Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. It is a thin Java library and API that sits on top of Hadoop's MapReduce layer and is executed from the command line like any other Hadoop application.

http://www.cascading.org/

Tags
Implementation
License
Platform

   

comments powered by Disqus


Related Projects

Hadoop Common


Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. Hadoop common supports other Hadoop subprojects

Schedulix - Enterprise Job Scheduling System


Schedulix is the Open Source Enterprise Job Scheduling System, which meets the complex requirements of modern IT process automation. It helps to create Complex workflow, Hierarchical workflow modelling, Workflows can be dynamically submitted or paralleled, Automatic reruns of sub-workflow, Load balancing, Sticky allocations, Time scheduling and lot more.

Hue - The open source Apache Hadoop UI


Hue is a Web application for interacting with Apache Hadoop. It supports a FileBrowser for accessing HDFS, JobBrowser for accessing MapReduce jobs (MR1/MR2-YARN), Job Designer for creating MapReduce/Streaming/Java jobs, HBase Browser for exploring and modifying HBase tables and data, Oozie App for submitting and scheduling workflows and bundles, A Pig/HBase/Sqoop2 shell, Beeswax application for executing Hive queries, Search app for querying Solr and Solr Cloud.

Quartz.NET


Quartz.NET is a full-featured, open source job scheduling system that can be used from smallest apps to large scale enterprise systems. It is a port of very propular open source Java job scheduling framework, Quartz.

Katta - Lucene and more in the cloud.


Katta is a scalable, failure tolerant, distributed, data storage for real time access. Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.

ANTLR - ANother Tool for Language Recognition


ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It's widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees. Twitter search uses ANTLR for query parsing, with over 2 billion queries a day.

Nutch


Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

HBase - Hadoop database


HBase provides support to handle BigTable - billions of rows X millions of columns. It is a scalable, distributed, versioned, column-oriented store modeled after Google's Bigtable and runs on top of HDFS (Hadoop Distributed Filesystem). It features compression, in-memory operation per-column. Data could be replicated between the nodes. HBase is used in Facebook and Twitter.

Avro


Avro is a data serialization system. It is a subproject of Apache Hadoop.

Apache Mahout - Scalable machine learning library


Apache Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining.