Cascalog - Data processing on Hadoop

  •        6255

Cascalog is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.

http://cascalog.org/
https://github.com/nathanmarz/cascalog

Tags
Implementation
License
Platform

   




Related Projects

Apache Tajo - A big data warehouse system on Hadoop


Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.

Apache Hive - The Apache Hive (TM) data warehouse software facilitates querying and managing large d


The Apache Hive (TM) data warehouse software facilitates querying and managing large datasets residing in distributed storage.

HPCC System - Hadoop alternative


HPCC is a proven and battle-tested platform for manipulating, transforming, querying and data warehousing Big Data. It supports two type of configuration. Thor is responsible for consuming vast amounts of data, transforming, linking and indexing that data. It functions as a distributed file system with parallel processing power spread across the nodes. Roxie, the Data Delivery Engine, provides separate high-performance online query processing and data warehouse capabilities.

Kylin - Extreme OLAP Engine for Big Data


Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets, original contributed from eBay Inc. It is designed to reduce query latency on Hadoop for 10+ billions of rows of data. It offers ANSI SQL on Hadoop and supports most ANSI SQL query functions.

jumbune - Jumbune is an open-source project to optimize both Yarn (v2) and older (v1) Hadoop based solutions


Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs. It provides development & administrative insights of Hadoop based analytical solutions. It enables user to Debug, Profile, Monitor & Validate analytical solutions hosted on decoupled clusters.



Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop


Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

Shark - Hive on Spark


Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. It runs Hive queries up to 100x faster in memory, or 10x on disk. it is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive.

AsterixDB - Big Data Management System (BDMS)


AsterixDB is a BDMS (Big Data Management System) with a rich feature set that sets it apart from other Big Data platforms. Its feature set makes it well-suited to modern needs such as web data warehousing and social data storage and analysis. It is a highly scalable data management system that can store, index, and manage semi-structured data, but it also supports a full-power query language with the expressiveness of SQL (and more).

ANTLR - ANother Tool for Language Recognition


ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It's widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees. Twitter search uses ANTLR for query parsing, with over 2 billion queries a day.

hadoop-20-warehouse - Facebook Data Warehouse based on Apache Hadoop 0.20


Facebook Data Warehouse based on Apache Hadoop 0.20

spatial-framework-for-hadoop


The __Spatial Framework for Hadoop__ allows developers and data scientists to use the Hadoop data processing system for spatial data analysis.For tools, [samples](https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples), and [tutorials](https://github.com/Esri/gis-tools-for-hadoop/wiki) that use this framework, head over to [GIS Tools for Hadoop](https://github.com/Esri/gis-tools-for-hadoop).

Luigi - Python module that helps you build complex pipelines of batch jobs


The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else.

Sqoop - Transfers data between Hadoop and Datastores


Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

Hue - The open source Apache Hadoop UI


Hue is a Web application for interacting with Apache Hadoop. It supports a FileBrowser for accessing HDFS, JobBrowser for accessing MapReduce jobs (MR1/MR2-YARN), Job Designer for creating MapReduce/Streaming/Java jobs, HBase Browser for exploring and modifying HBase tables and data, Oozie App for submitting and scheduling workflows and bundles, A Pig/HBase/Sqoop2 shell, Beeswax application for executing Hive queries, Search app for querying Solr and Solr Cloud.

Apache Gora - Provides persistence to column stores, key value stores, document stores and RDBMS


The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.

gis-tools-for-hadoop


* [Tutorial: An Introduction for Beginners] (https://github.com/Esri/gis-tools-for-hadoop/wiki/GIS-Tools-for-Hadoop-for-Beginners)* [Tutorial: Aggregating Data Into Bins](https://github.com/Esri/gis-tools-for-hadoop/wiki/Aggregating-CSV-Data-%28Spatial-Binning%29)* [Tutorial: Correcting your ArcGIS Projection](https://github.com/Esri/gis-tools-for-hadoop/wiki/Correcting-Projection-in-ArcGIS)* [Updated Wiki page for the Spatial-Framework-for-Hadoop](https://github.com/Esri/spatial-framework-for-h

r-hadoop-for-big-data


Download Free Associated R open source script files for big data analysis with Hadoop and R

GoldenOrb - Scalable Graph Analysis


GoldenOrb is a cloud-based project for massive-scale graph analysis, built upon Apache Hadoop and modeled after Google's Pregel architecture. It provides solutions to complex data problems, remove limits to innovation and contribute to the emerging ecosystem that spans all aspects of big data analysis. It enables users to run analytics on entire data sets instead of samples.

HadoopIlluminated - Source code for "Hadoop Illuminated", an open source book on Hadoop and Big Data


Source code for "Hadoop Illuminated", an open source book on Hadoop and Big Data

Pinot - A realtime distributed OLAP datastore


Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally, so that it can scale to larger data sets and higher query rates as needed.