Gaffer is a graph database framework. It allows the storage of very large graphs containing rich properties on the nodes and edges. Several storage options are available, including Accumulo, Hbase and Parquet. It is designed to be as flexible, scalable and extensible as possible, allowing for rapid prototyping and transition to production systems.

Gaffer - A large-scale entity and relation database supporting aggregation of properties

Alluxio (formerly known as Tachyon) is a virtual distributed storage system. It bridges the gap between computation frameworks and storage systems, enabling computation applications to connect to numerous storage systems through a common interface.
 
Alluxio enables data orchestration for compute in any cloud. It unifies data silos on-premise and across any cloud to give you the data locality, accessibility, and elasticity needed to reduce the complexities associated with orchestrating data for today’s big data and AI/ML workloads.

Alluxio (formerly known as Tachyon) is a virtual distributed storage system. It bridges the gap between computation frameworks and storage systems, enabling computation applications to connect to numerous storage systems through a common interface.

Alluxio - Data orchestration for analytics and machine learning in the cloud

Vespa is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time. Vespa is serving platform for Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr.
 Queries can use both structured filters and unstructured text search to select data. All the matching data is then ranked according to a ranking function - typically machine learned - to implement such use cases as search relevance, recommendation, targeting and personalization.
 
Vespa is scalable. System sizes up to hundreds of nodes handling tens of billions of documents are not uncommon, and no harder to set up and modify than single node systems. Since all system components, as well as stored data is redundant and self-correcting, hardware failures are not operational emergencies and can be handled by re-adding capacity when convenient.
Its feature include:
<ul>
<li>Text search - Combine structured query and text search to select data</li>
<li>Advanced Ranking - Machine learned ranking</li>
<li>Aggregation</li>
<li>Elastic, Scalable</li>
<li>High Availability</li>
<li>Auto repair data corruption</li>
<li>Simple HTTP API interface&nbsp;</li>
<li>lot more...</li>
</ul>

Vespa is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time. Vespa is serving platform for Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr.

Vespa - Yahoo's big data serving engine

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. It is Linearly Scalable, Blazing Fast, Highly Reliable, Fault Tolerant, Data compression, Real time query processing, Web analytics, Vectorized query execution, Local and distributed joins. It can process hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second.

ClickHouse - Columnar DBMS and Real Time Analytics

ClickHouse is a fast open-source OLAP database management system. It is column-oriented and allows to generate analytical reports using SQL queries in real-time. Large queries are parallelized using multiple cores, taking all the necessary resources available on the current server. It supports distributed processing, data can reside on different shards. Each shard can be a group of replicas used for fault tolerance. All shards are used to run a query in parallel, transparently for the user.
 
ClickHouse implements user account management using SQL queries and allows for role-based access control configuration similar to what can be found in ANSI SQL standard and popular relational database management systems. It uses asynchronous multi-master replication. After being written to any available replica, all the remaining replicas retrieve their copy in the background. It adaptively chooses how to JOIN multiple tables, by preferring hash-join algorithm and falling back to the merge-join algorithm if there’s more than one large table.

ClickHouse is a fast open-source OLAP database management system. It is column-oriented and allows to generate analytical reports using SQL queries in real-time. Large queries are parallelized using multiple cores, taking all the necessary resources available on the current server. It supports distributed processing, data can reside on different shards. Each shard can be a group of replicas used for fault tolerance. All shards are used to run a query in parallel, transparently for the user.

ClickHouse - Column-oriented database management system that allows generating analytical data reports in real time

Trino is a highly parallel and distributed query engine, that is built from the ground up for efficient, low latency analytics. It is an ANSI SQL compliant query engine, that works with BI tools such as R, Tableau, Power BI, Superset and many others. It helps to natively query data in Hadoop, S3, Cassandra, MySQL, and many others, without the need for complex, slow, and error-prone processes for copying the data. 
 
Access data from multiple systems within a single query. For example, join historic log data stored in an S3 object storage with customer data stored in a MySQL relational database.

Trino is a highly parallel and distributed query engine, that is built from the ground up for efficient, low latency analytics. It is an ANSI SQL compliant query engine, that works with BI tools such as R, Tableau, Power BI, Superset and many others. It helps to natively query data in Hadoop, S3, Cassandra, MySQL, and many others, without the need for complex, slow, and error-prone processes for copying the data. 

Trino - A query engine that runs at ludicrous speed

Kafka UI is a UI for Apache Kafka to monitor and manage Apache Kafka clusters. It is a simple tool that makes your data flows observable, helps find and troubleshoot issues faster and deliver optimal performance. Its lightweight dashboard makes it easy to track key metrics of your Kafka clusters - Brokers, Topics, Partitions, Production, and Consumption. 
Its features include: 
<ul>
<li>Multi-Cluster Management — monitor and manage all your clusters in one place</li>
<li>Performance Monitoring with Metrics Dashboard — track key Kafka metrics with a lightweight dashboard</li>
<li>View Kafka Brokers — view topic and partition assignments, controller status</li>
<li>View Kafka Topics — view partition count, replication status, and custom configuration</li>
<li>View Consumer Groups — view per-partition parked offsets, combined and per-partition lag</li>
<li>Browse Messages — browse messages with JSON, plain text, and Avro encoding</li>
<li>Dynamic Topic Configuration — create and configure new topics with dynamic configuration</li>
<li>Configurable Authentification — secure your installation with optional Github/Gitlab/Google OAuth 2.0</li>
</ul>

Kafka UI is a UI for Apache Kafka to monitor and manage Apache Kafka clusters. It is a simple tool that makes your data flows observable, helps find and troubleshoot issues faster and deliver optimal performance. Its lightweight dashboard makes it easy to track key metrics of your Kafka clusters - Brokers, Topics, Partitions, Production, and Consumption.

Kafka UI - Open-Source Web UI for Apache Kafka Management

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

Spark - Fast Cluster Computing

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
 
Tez can execute complex directed acyclic graphs of general data processing tasks. In many ways it can be thought of as a more flexible and powerful successor of the map-reduce framework. By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be used to process data, that earlier took multiple MR jobs, now it can be done in a single Tez job.

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop

Cascalog is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.

Cascalog - Data processing on Hadoop

Metron integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis. Metron provides capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat intelligence information to security telemetry within a single platform.
 
Metron will provide next generation defense techniques that consists of using a class of anomaly detection and machine learning algorithms that can be applied in real-time as events are streaming in.

Metron integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis. Metron provides capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat intelligence information to security telemetry within a single platform.

Apache Metron - Real-time Big Data Security 

Hazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In-Memory Data Grid (IMDG) to provide a lightweight, simple-to-deploy package that includes scalable in-memory storage. Hazelcast Jet performs parallel execution to enable data-intensive applications to operate in near real-time.
 
Hazelcast Jet gives you all the infrastructure you need to build a distributed data processing pipeline within one 10Mb Java JAR: processing, storage and clustering. Hazelcast Jet comes with in-memory operational storage that’s available out-of-the box. This storage is partitioned, distributed and replicated across the Hazelcast Jet cluster for capacity and resiliency. It can be used as an input data buffer, to publish the results of a Hazelcast Jet computation, to connect multiple Hazelcast Jet jobs or as a lookup cache for data enrichment.

Hazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In-Memory Data Grid (IMDG) to provide a lightweight, simple-to-deploy package that includes scalable in-memory storage. Hazelcast Jet performs parallel execution to enable data-intensive applications to operate in near real-time.

Hazelcast Jet - A general purpose distributed data processing engine, built on top of Hazelcast.

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink and Hive using a high-performance table format that works just like a SQL table. Users don’t need to know about partitioning to get fast queries. Iceberg was built for huge tables. Iceberg is used in production where a single table can contain tens of petabytes of data and even these huge tables can be read without a distributed SQL engine.

Apache Iceberg - Table format for storing large, slow-moving tabular data

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It allows querying data from relational / nosql databases. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. It is developed by Facebook.

Presto - Distributed SQL query engine for big data

Apache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.
 
Apache REEF turns the chaos of a distributed application into events in a single machine, the Job Driver. Events include container allocation, Task launch, completion and failure. For failures, Apache REEF makes every effort of making the actual Exception thrown by the Task available to the Driver.
 
Apache REEF provides a Task runtime called Evaluator. Evaluators are instantiated in every container of a REEF application. Evaluators can keep data in memory in between Tasks, which enables efficient pipelines on REEF. Apache REEF is the only API to write YARN or Mesos applications in .NET. Further, a single REEF application is free to mix and match Tasks written for .NET or Java.

Apache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.

Apache REEF - a stdlib for Big Data

Discover open source projects across all platforms

Projects