Java based data integration framework can be used to transform/map/manipulate data in various formats (CSV,FIXLEN,XML,XBASE,COBOL,LOTUS, etc.); can be used standalone or embedded(as a library). Connects to RDBMS/JMS/SOAP/LDAP/S3/HTTP/FTP/ZIP/TAR.

CloverETL - Rapid Data Integration

Alibaba JStorm is an enterprise fast and stable streaming process engine. It runs program up to 4x faster than Apache Storm. It is easy to switch from record mode to mini-batch mode. It is not only a streaming process engine. It means one solution for real time requirement, whole realtime ecosystem.


jstorm - Enterprise Stream Process Engine

Vespa is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time. Vespa is serving platform for Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr.
 Queries can use both structured filters and unstructured text search to select data. All the matching data is then ranked according to a ranking function - typically machine learned - to implement such use cases as search relevance, recommendation, targeting and personalization.
 
Vespa is scalable. System sizes up to hundreds of nodes handling tens of billions of documents are not uncommon, and no harder to set up and modify than single node systems. Since all system components, as well as stored data is redundant and self-correcting, hardware failures are not operational emergencies and can be handled by re-adding capacity when convenient.
Its feature include:
<ul>
<li>Text search - Combine structured query and text search to select data</li>
<li>Advanced Ranking - Machine learned ranking</li>
<li>Aggregation</li>
<li>Elastic, Scalable</li>
<li>High Availability</li>
<li>Auto repair data corruption</li>
<li>Simple HTTP API interface&nbsp;</li>
<li>lot more...</li>
</ul>

Vespa is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time. Vespa is serving platform for Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr.

Vespa - Yahoo's big data serving engine

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
 
Tez can execute complex directed acyclic graphs of general data processing tasks. In many ways it can be thought of as a more flexible and powerful successor of the map-reduce framework. By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be used to process data, that earlier took multiple MR jobs, now it can be done in a single Tez job.

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop

Hazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In-Memory Data Grid (IMDG) to provide a lightweight, simple-to-deploy package that includes scalable in-memory storage. Hazelcast Jet performs parallel execution to enable data-intensive applications to operate in near real-time.
 
Hazelcast Jet gives you all the infrastructure you need to build a distributed data processing pipeline within one 10Mb Java JAR: processing, storage and clustering. Hazelcast Jet comes with in-memory operational storage that’s available out-of-the box. This storage is partitioned, distributed and replicated across the Hazelcast Jet cluster for capacity and resiliency. It can be used as an input data buffer, to publish the results of a Hazelcast Jet computation, to connect multiple Hazelcast Jet jobs or as a lookup cache for data enrichment.

Hazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In-Memory Data Grid (IMDG) to provide a lightweight, simple-to-deploy package that includes scalable in-memory storage. Hazelcast Jet performs parallel execution to enable data-intensive applications to operate in near real-time.

Hazelcast Jet - A general purpose distributed data processing engine, built on top of Hazelcast.

Build concurrent and multi-stage data ingestion and data processing pipelines with Elixir. It allows developers to consume data efficiently from different sources, known as producers, such as Amazon SQS, Apache Kafka, Google Cloud PubSub, RabbitMQ, and others. Broadway takes the burden of defining concurrent GenStage topologies and provide a simple configuration API that automatically defines concurrent producers, concurrent processing, batch handling, and more, leading to both time and cost efficient ingestion and processing of data.

Broadway - Concurrent and multi-stage data ingestion and data processing with Elixir 

Apache Flink is an open source platform for scalable batch and stream data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.


Apache Flink - Platform for Scalable Batch and Stream Data Processing

Apache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.
 
Apache REEF turns the chaos of a distributed application into events in a single machine, the Job Driver. Events include container allocation, Task launch, completion and failure. For failures, Apache REEF makes every effort of making the actual Exception thrown by the Task available to the Driver.
 
Apache REEF provides a Task runtime called Evaluator. Evaluators are instantiated in every container of a REEF application. Evaluators can keep data in memory in between Tasks, which enables efficient pipelines on REEF. Apache REEF is the only API to write YARN or Mesos applications in .NET. Further, a single REEF application is free to mix and match Tasks written for .NET or Java.

Apache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.

Apache REEF - a stdlib for Big Data

Storm is a distributed real time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. 
Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.

Storm is a distributed real time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. 

Apache Storm - Distributed and fault-tolerant realtime computation

Kapacitor is a open source framework for processing, monitoring, and alerting on time series data. Kapacitor imports (stream or batch) time series data, and then transform, analyze, and act on the data. It uses Telegraf to collect system metrics on your local machine and store them in InfluxDB.
 
Kapacitor’s alerting system follows a publish-subscribe design pattern. Alerts are published to topics and handlers subscribe to a topic. This pub/sub model and the ability for these to call User Defined Functions make Kapacitor very flexible to act as the control plane in your environment, performing tasks like auto-scaling, stock reordering, and IoT device control.

Kapacitor is a open source framework for processing, monitoring, and alerting on time series data. Kapacitor imports (stream or batch) time series data, and then transform, analyze, and act on the data. It uses Telegraf to collect system metrics on your local machine and store them in InfluxDB.

Kapacitor - Open source framework for processing, monitoring, and alerting on time series data

Apache StreamPipes is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. It can Integrate data sets and data streams using the built-in StreamPipes Connect library with support for generic protocols such as HTTP, Kafka, MQTT, OPC-UA, Files or specific adapters for open data sources.
 
It enables flexible modeling of stream processing pipelines by providing a graphical modeling editor on top of existing stream processing frameworks. It empowers non-technical users to quickly define and execute processing pipelines based on an easily extensible toolbox of data sources, data processors and data sinks. StreamPipes has an exchangeable runtime execution layer and executes pipelines using one of the provided wrappers, e.g., standalone or distributed in Apache Flink.

Apache StreamPipes is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. It can Integrate data sets and data streams using the built-in StreamPipes Connect library with support for generic protocols such as HTTP, Kafka, MQTT, OPC-UA, Files or specific adapters for open data sources.

Apache StreamPipes - A self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams

Hazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In Memory Data Grid (IMDG) to provide a lightweight package of a processor and a scalable in-memory storage. It supports distributed java.util.stream API support for Hazelcast data structures such as IMap and IList, Distributed implementations of java.util.{Queue, Set, List, Map} data structures highly optimized to be used for the processing

Hazelcast Jet - Distributed data processing engine, built on top of Hazelcast

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.
 
Beam is particularly useful for Embarrassingly Parallel data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel. You can also use Beam for Extract, Transform, and Load (ETL) tasks and pure data integration. These tasks are useful for moving data between different storage media and data sources, transforming data into a more desirable format, or loading data onto a new system.

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines

Teiid is a data virtualization system that allows applications to use data from multiple, heterogenous data stores. Teiid is comprised of tools, components and services for creating and executing bi-directional data access services.

Teiid - Data virtualization system that allows applications to use data from multiple, heterogeneous data stores.

Discover open source projects across all platforms

Projects