Displaying 1 to 20 from 21 results

Gaffer - A large-scale entity and relation database supporting aggregation of properties

  •    Java

Gaffer is a graph database framework. It allows the storage of very large graphs containing rich properties on the nodes and edges. Several storage options are available, including Accumulo, Hbase and Parquet. It is designed to be as flexible, scalable and extensible as possible, allowing for rapid prototyping and transition to production systems.

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop

  •    Java

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

Cascalog - Data processing on Hadoop

  •    Clojure

Cascalog is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.

Big Data Twitter Demo


This demo analyzes tweets in real-time, even including a dashboard. The tweets are also archived in Azure DB/Blob and Hadoop where Excel can be used for BI!

kafka-connect-hdfs - Kafka Connect HDFS connector

  •    Java

kafka-connect-hdfs is a Kafka Connector for copying data between Kafka and Hadoop HDFS.Documentation for this connector can be found here.

cloudbreak - A tool for provisioning and managing Apache Hadoop clusters in the cloud

  •    Java

You'll need a Hypervisor too. Cloudbreak-Deployer has built-in xhyve setup option, but some of us use VirtualBox instead (so do the docker docs). Cloudbreak-Deployer works with both, it's up to you.Simplest way to prepare the working environment is to start Cloudbreak on your local machine is to use the Cloudbreak Deployer.

clusterdock - clusterdock is a framework for creating Docker-based container clusters

  •    Python

clusterdock is a Python 3 project that enables users to build, start, and manage Docker container-based clusters. It uses a pluggable system for defining new types of clusters using folders called topologies and is a swell project, if I may say so myself.

hadoop-for-geoevent - ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.

  •    Java

ArcGIS 10.4 GeoEvent Extension for Server sample Hadoop Output Connector for storing GeoEvents in HDFS. Find a bug or want to request a new feature? Please let us know by submitting an issue.

webhdfs - Node.js WebHDFS REST API client

  •    Javascript

Hadoop WebHDFS REST API (2.2.0) client library for node.js with fs module like (asynchronous) interface.

ethz-web-scale-data-mining-project - ETH Zurich - Web Scale Data Processing and Mining Project

  •    HTML

This is the main repository for the web scale data mining project, which took place in summer 2014 as a research project. One of the results are the visualized topics, which have been learned autonomously from terabytes of raw HTML data.

big-data-lite - Samples to the Oracle Big Data Lite VM

  •    Java

The samples contained in this repo are used in Oracle Big Data Lite VM. Each branch is associated with a Big Data Lite Version; version 4.3.0 is the first release that is using github. This repository includes scripts to quickly install third-party software that is useful to play with some demos. Please see the README in the thirdparty directory.

eel-sdk - Big Data Toolkit for the JVM

  •    Scala

Eel is a toolkit for manipulating data in the hadoop ecosystem. By hadoop ecosystem we mean file formats common to the big-data world, such as parquet, orc, csv in locations such as HDFS or Hive tables. In contrast to distributed batch or streaming engines such as Spark or Flink, Eel is an SDK intended to be used directly in process. Eel is a lower level API than higher level engines like Spark and is aimed for those use cases when you want something like a file API. Here are some of our notes comparing eel to other tools that offer functionality similar to eel.

euphoria - Euphoria is an open source Java API for creating unified big-data processing flows

  •    Java

A Java API for creating unified big-data processing flows providing an engine independent programming model which can express both batch and stream transformations.