HPCC System - Hadoop alternative

  •        2511

HPCC is a proven and battle-tested platform for manipulating, transforming, querying and data warehousing Big Data. It supports two type of configuration. Thor is responsible for consuming vast amounts of data, transforming, linking and indexing that data. It functions as a distributed file system with parallel processing power spread across the nodes. Roxie, the Data Delivery Engine, provides separate high-performance online query processing and data warehouse capabilities.




Related Projects

Hadoop Common

  •    Java

Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. Hadoop common supports other Hadoop subprojects

mahout - Mirror of Apache Mahout

  •    Java

Mahout's goal is to build scalable machine learning libraries. With scalable we mean: Scalable to reasonably large data sets. Our core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms. Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license. Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more. Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

Alluxio - Data orchestration for analytics and machine learning in the cloud

  •    Java

Alluxio (formerly known as Tachyon) is a virtual distributed storage system. It bridges the gap between computation frameworks and storage systems, enabling computation applications to connect to numerous storage systems through a common interface.

Apache Trafodion - Webscale SQL-on-Hadoop solution enabling transactional or operational workloads on Apache Hadoop.

  •    C++

Apache Trafodion is a webscale SQL-on-Hadoop solution enabling transactional or operational workloads on Apache Hadoop. Trafodion builds on the scalability, elasticity, and flexibility of Hadoop. Trafodion extends Hadoop to provide guaranteed transactional integrity, enabling new kinds of big data applications to run on Hadoop.

glow - Glow is an easy-to-use distributed computation system written in Go, similar to Hadoop Map Reduce, Spark, Flink, Storm, etc

  •    Go

Glow is providing a library to easily compute in parallel threads or distributed to clusters of machines. This is written in pure Go.I am also working on a Go+Luajit system, https://github.com/chrislusf/gleam , which is more flexible and more performant.

RethinkDB - Distributed JSON database

  •    C++

RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn. It supports JSON data model, Distributed joins, subqueries, aggregation, atomic updates, Hadoop-style map/reduce.

gleam - Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly

  •    Go

Gleam is a high performance and efficient distributed execution system, and also simple, generic, flexible and easy to customize.Gleam is built in Go, and the user defined computation can be written in Go, Unix pipe tools, or any streaming programs.

TensorFlowOnSpark - TensorFlowOnSpark brings TensorFlow programs onto Apache Spark clusters

  •    Python

TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from deep learning framework TensorFlow and big-data frameworks Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.

dist-keras - Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark

  •    Python

Distributed Deep Learning with Apache Spark and Keras. Distributed Keras is a distributed deep learning framework built op top of Apache Spark and Keras, with a focus on "state-of-the-art" distributed optimization algorithms. We designed the framework in such a way that a new distributed optimizer could be implemented with ease, thus enabling a person to focus on research. Several distributed methods are supported, such as, but not restricted to, the training of ensembles and models using data parallel methods.

Apache Tajo - A big data warehouse system on Hadoop

  •    Java

Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.

XLearning - AI on Hadoop

  •    Java

XLearning is a convenient and efficient scheduling platform combined with the big data and artificial intelligence, support for a variety of machine learning, deep learning frameworks. XLearning is running on the Hadoop Yarn and has integrated deep learning frameworks such as TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning has the satisfactory scalability and compatibility.Besides the distributed mode of TensorFlow and MXNet frameworks, XLearning supports the standalone mode of all deep learning frameworks such as Caffe, Theano, PyTorch. Moreover, XLearning allows the custom versions and multi-version of frameworks flexibly.

Apache Storm - Distributed and fault-tolerant realtime computation

  •    Java

Storm is a distributed real time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop

  •    Java

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

barefoot - Java library for integrating the map into software and services with state-of-the-art online and offline map matching that can be used stand-alone and in the cloud

  •    Java

An open source Java library for online and offline map matching with OpenStreetMap. Together with its extensive set of geometric and spatial functions, an in-memory map data structure and basic machine learning functions, it is a versatile basis for scalable location-based services and spatio-temporal data analysis on the map. It is designed for use in parallel and distributed systems and, hence, includes a stand-alone map matching server and can be used in distributed systems for map matching services in the cloud. Barefoot consists of a software library and a (Docker-based) map server that provides access to street map data from OpenStreetMap and is flexible to be used in distributed cloud infrastructures as map data server or side-by-side with Barefoot's stand-alone servers for offline (matcher server) and online map matching (tracker server), or other applications built with Barefoot library. Access to map data is provided with a fast and flexible in-memory map data structure. Together with GeographicLib [1] and ESRI's geometry API [2], it provides an extensive set of geographic and geometric operations for spatial data analysis on the map.

mapreduce - C++ MapReduce Library for efficient multi-threading on single-machine

  •    C++

The MapReduce C++ Library implements a single-machine platform for programming using the the Google MapReduce idiom. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the Google paper. The developer is required to write two classes; MapTask implements a mapping function to process key/value pairs generate a set of intermediate key/value pairs and ReduceTask that implements a reduce function to merges all intermediate values associated with the same intermediate key. In addition, there are three optional template parameters that can be used to modify the default implementation behavior; Datasource that implements a mechanism to feed data to the Map Tasks - on request of the MapReduce library, Combiner that can be used to partially consolidate results of the Map Task before they are passed to the Reduce Tasks, and IntermediateStore that handles storage, merging and sorting of intermediate results between the Map and Reduce phases. The MapTask class must define four data types; the key/value types for the inputs to the Map Tasks and the intermediate types.

Samza - Distributed Stream Processing Framework

  •    Java

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It provides a very simple call-back based process message API that should be familiar to anyone who's used Map/Reduce. Samza was originally developed at LinkedIn. It's currently used to process tracking data, service log data, and for data ingestion pipelines for realtime services.

multiverso - Parameter server framework for distributed machine learning

  •    C++

Multiverso is a parameter server based framework for training machine learning models on big data with numbers of machines. It is currently a standard C++ library and provides a series of friendly programming interfaces. With such easy-to-use APIs, machine learning researchers and practitioners do not need to worry about the system routine issues such as distributed model storage and operation, inter-process and inter-thread communication, multi-threading management, and so on. Instead, they are