Cascading - Data Processing Workflows on Hadoop

  •        2607

Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. It is a thin Java library and API that sits on top of Hadoop's MapReduce layer and is executed from the command line like any other Hadoop application.

http://www.cascading.org/

Tags
Implementation
License
Platform

   




Related Projects

jumbune - Jumbune is an open-source project to optimize both Yarn (v2) and older (v1) Hadoop based solutions


Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs. It provides development & administrative insights of Hadoop based analytical solutions. It enables user to Debug, Profile, Monitor & Validate analytical solutions hosted on decoupled clusters.

Ambari - Monitor Hadoop Cluster


The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. The set of Hadoop components that are currently supported by Ambari includes HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop.

geoprocessing-tools-for-hadoop


Hadoop system, and * Allow ArcGIS users to run Hadoop workflow jobs.See these tools in action as part of the [samples](https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples) in [GIS Tools for Hadoop](https://github.com/Esri/gis-tools-for-hadoop).

Cascalog - Data processing on Hadoop


Cascalog is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.

Sahale - A Cascading Workflow Visualizer


A tool to record and visualize metrics captured from Cascading (Scalding) workflows at runtime.Designed to target the pain points of analysts and end users of Cascading, Sahale provides insight into a workflow's runtime resource usage and makes job debugging and locating relevant Hadoop logs easy. The tool reveals optimization opportunities by exposing inefficient MapReduce jobs in a larger workflow, and enables users to track the execution history of their workflows.



mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services


mrjob is a Python 2.7/3.3+ package that helps you write and run Hadoop Streaming jobs. It fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. mrjob has basic support for Google Cloud Dataproc (Dataproc) which allows you to buy time on a Hadoop cluster on a minute-by-minute basis. It also works with your own Hadoop cluster.

hadoop-ops-tools - Hadoop cluster operationtools


Hadoop cluster operationtools

sahara - Sahara aims to provide users with simple means to provision a Hadoop cluster by specifying several parameters like Hadoop version, cluster topology, nodes hardware details and a few more


Sahara aims to provide users with simple means to provision a Hadoop cluster by specifying several parameters like Hadoop version, cluster topology, nodes hardware details and a few more.

Hadoop Common


Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. Hadoop common supports other Hadoop subprojects

cascading.plumber - Switching between Hadoop Cascading and InMemory Cascading made simple.


Switching between Hadoop Cascading and InMemory Cascading made simple.

TellApart-Hadoop-Utils - Utilities for working with Hadoop and Cascading


Utilities for working with Hadoop and Cascading

gis-tools-for-hadoop


* [Tutorial: An Introduction for Beginners] (https://github.com/Esri/gis-tools-for-hadoop/wiki/GIS-Tools-for-Hadoop-for-Beginners)* [Tutorial: Aggregating Data Into Bins](https://github.com/Esri/gis-tools-for-hadoop/wiki/Aggregating-CSV-Data-%28Spatial-Binning%29)* [Tutorial: Correcting your ArcGIS Projection](https://github.com/Esri/gis-tools-for-hadoop/wiki/Correcting-Projection-in-ArcGIS)* [Updated Wiki page for the Spatial-Framework-for-Hadoop](https://github.com/Esri/spatial-framework-for-h

handson-hadoop


This tutorial aims to present Hadoop in a pragmatic (and hopefully amusing) manner, through a series of exercises with minimal theoretical input. It is intended for Hadoop beginners and is focused on 4 distinct components / libraries commonly used to query Hadoop clusters: "Vanilla" Hadoop, Cascading, Pig and Hive.

spatial-framework-for-hadoop


The __Spatial Framework for Hadoop__ allows developers and data scientists to use the Hadoop data processing system for spatial data analysis.For tools, [samples](https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples), and [tutorials](https://github.com/Esri/gis-tools-for-hadoop/wiki) that use this framework, head over to [GIS Tools for Hadoop](https://github.com/Esri/gis-tools-for-hadoop).

riak-cascading - Cascading wrapper on top of riak-hadoop


Cascading wrapper on top of riak-hadoop

cascading-batch-query - Optimized joins using bloom filters on Hadoop via Cascading.


Optimized joins using bloom filters on Hadoop via Cascading.

cascading-simhash - simple simhashing in hadoop with cascading


simple simhashing in hadoop with cascading

cwensel-cascading


Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on a Hadoop cluster.

vagrant-hadoop-cluster - A mini Hadoop cluster configuration in Vagrant.


A mini Hadoop cluster configuration in Vagrant.

vagrant-hadoop-cluster - Deploying hadoop in a virtualized cluster in simple steps


Deploying hadoop in a virtualized cluster in simple steps