Displaying 1 to 20 from 26 results


Avro is a data serialization system. It is a subproject of Apache Hadoop.

Luigi - Python module that helps you build complex pipelines of batch jobs

The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else.

Scalding - A Scala API for Cascading

Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.

XLearning - AI on Hadoop

XLearning is a convenient and efficient scheduling platform combined with the big data and artificial intelligence, support for a variety of machine learning, deep learning frameworks. XLearning is running on the Hadoop Yarn and has integrated deep learning frameworks such as TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning has the satisfactory scalability and compatibility.Besides the distributed mode of TensorFlow and MXNet frameworks, XLearning supports the standalone mode of all deep learning frameworks such as Caffe, Theano, PyTorch. Moreover, XLearning allows the custom versions and multi-version of frameworks flexibly.

Cascalog - Data processing on Hadoop

Cascalog is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.

Apache Trafodion - Webscale SQL-on-Hadoop solution enabling transactional or operational workloads on Apache Hadoop.

Apache Trafodion is a webscale SQL-on-Hadoop solution enabling transactional or operational workloads on Apache Hadoop. Trafodion builds on the scalability, elasticity, and flexibility of Hadoop. Trafodion extends Hadoop to provide guaranteed transactional integrity, enabling new kinds of big data applications to run on Hadoop.

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

Cascading - Data Processing Workflows on Hadoop

Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. It is a thin Java library and API that sits on top of Hadoop's MapReduce layer and is executed from the command line like any other Hadoop application.

Big Data Twitter Demo

This demo analyzes tweets in real-time, even including a dashboard. The tweets are also archived in Azure DB/Blob and Hadoop where Excel can be used for BI!

floating-elephants - Docker containers for Hadoop.

Docker containers for Hadoop.An easy way to reproduce a multi-node Hadoop cluster on a local machine.

camus - Mirror of Linkedin's Camus

Camus is LinkedIn's Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka.

kafka-connect-hdfs - Kafka Connect HDFS connector

kafka-connect-hdfs is a Kafka Connector for copying data between Kafka and Hadoop HDFS.Documentation for this connector can be found here.

camus-compressor - Camus Compressor merges files created by Camus and saves them in a compressed format

Camus Compressor merges files created by Camus and saves them in a compressed format.Camus is massively used at Allegro for dumping more than 200 Kafka topics onto HDFS. The script runs every 15 minutes and creates one file per Kafka partition which results in about 76800 small files per day. Most of the files do not exceed Hadoop block size. This is a clear Hadoop antipattern which leads to performance issues, for example extensive number of mappers in SQL queries’ executions.

hoodie - Spark Library for Hadoop Upserts And Incrementals

Hoodie is a Apache Spark library that provides the ability to efficiently do incremental processing on datasets in HDFS

hadoop-crypto - Library for per-file client-side encyption in Hadoop FileSystems such as HDFS or S3.

Seekable Crypto is a Java library that provides the ability to seek within SeekableInputs while decrypting the underlying contents along with some utilities for storing and generating the keys used to encrypt/decrypt the data streams. An implementation of the Hadoop FileSystem is also included that uses the Seekable Crypto library to provide efficient and transparent client-side encryption for Hadoop filesystems.Currently AES/CTR/NoPadding and AES/CBC/PKCS5Padding are supported.

jumbune - Jumbune is an open-source project to optimize both Yarn (v2) and older (v1) Hadoop based solutions

Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs. It provides development & administrative insights of Hadoop based analytical solutions. It enables user to Debug, Profile, Monitor & Validate analytical solutions hosted on decoupled clusters.