Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines

  •        1792

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Beam is particularly useful for Embarrassingly Parallel data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel. You can also use Beam for Extract, Transform, and Load (ETL) tasks and pure data integration. These tasks are useful for moving data between different storage media and data sources, transforming data into a more desirable format, or loading data onto a new system.




Related Projects

Hazelcast Jet - A general purpose distributed data processing engine, built on top of Hazelcast.

  •    Java

Hazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In-Memory Data Grid (IMDG) to provide a lightweight, simple-to-deploy package that includes scalable in-memory storage. Hazelcast Jet performs parallel execution to enable data-intensive applications to operate in near real-time.

Hazelcast Jet - Distributed data processing engine, built on top of Hazelcast

  •    Java

Hazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In Memory Data Grid (IMDG) to provide a lightweight package of a processor and a scalable in-memory storage. It supports distributed java.util.stream API support for Hazelcast data structures such as IMap and IList, Distributed implementations of java.util.{Queue, Set, List, Map} data structures highly optimized to be used for the processing

Apache Flink - Platform for Scalable Batch and Stream Data Processing

  •    Java

Apache Flink is an open source platform for scalable batch and stream data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

DataflowJavaSDK - Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines

  •    Java

Google Cloud Dataflow SDK for Java is a distribution of Apache Beam designed to simplify usage of Apache Beam on Google Cloud Dataflow service. This artifact includes the parent POM for other Dataflow SDK artifacts.

jstorm - Enterprise Stream Process Engine

  •    Java

Alibaba JStorm is an enterprise fast and stable streaming process engine. It runs program up to 4x faster than Apache Storm. It is easy to switch from record mode to mini-batch mode. It is not only a streaming process engine. It means one solution for real time requirement, whole realtime ecosystem.

snappydata - Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™

  •    Scala

SnappyData (aka TIBCO ComputeDB) is a distributed, in-memory optimized analytics database. SnappyData delivers high throughput, low latency, and high concurrency for unified analytics workload. By fusing an in-memory hybrid database inside Apache Spark, it provides analytic query processing, mutability/transactions, access to virtually all big data sources and stream processing all in one unified cluster. One common use case for SnappyData is to provide analytics at interactive speeds over large volumes of data with minimal or no pre-processing of the dataset. For instance, there is no need to often pre-aggregate/reduce or generate cubes over your large data sets for ad-hoc visual analytics. This is made possible by smartly managing data in-memory, dynamically generating code using vectorization optimizations and maximizing the potential of modern multi-core CPUs. SnappyData enables complex processing on large data sets in sub-second timeframes.

klio - Smarter data pipelines for audio.

  •    Python

Klio is an ecosystem that allows you to process audio files – or any binary files – easily and at scale. Klio jobs are opinionated data pipelines in Python (streaming or batch) built upon Apache Beam and tuned for audio and binary file processing.

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop

  •    Java

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

Apache Storm - Distributed and fault-tolerant realtime computation

  •    Java

Storm is a distributed real time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.

Apache StreamPipes - A self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams

  •    Java

Apache StreamPipes is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. It can Integrate data sets and data streams using the built-in StreamPipes Connect library with support for generic protocols such as HTTP, Kafka, MQTT, OPC-UA, Files or specific adapters for open data sources.

bistro - A general-purpose data analysis engine radically changing the way batch and stream data is processed

  •    Java

The main general goal of Bistro is data processing. By data processing we mean deriving new data from existing data. Bistro assumes that data is represented as a number of sets of elements. Each element is a tuple which is a combination of column values. A value can be any (Java) object.

Quickwit - Fast and cost-efficient distributed search engine for immutable data

  •    Rust

Quickwit is a distributed search engine built from the ground up to offer cost-efficiency and high reliability. By mere mortals for mere mortals, Quickwit's architecture is as simple as possible. Quickwit is written in Rust and built on top of the mighty tantivy library. It designed to index big datasets straight from object storage like AWS S3 in a stateless manner.

Kapacitor - Open source framework for processing, monitoring, and alerting on time series data

  •    Go

Kapacitor is a open source framework for processing, monitoring, and alerting on time series data. Kapacitor imports (stream or batch) time series data, and then transform, analyze, and act on the data. It uses Telegraf to collect system metrics on your local machine and store them in InfluxDB.

Samza - Distributed Stream Processing Framework

  •    Java

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It provides a very simple call-back based process message API that should be familiar to anyone who's used Map/Reduce. Samza was originally developed at LinkedIn. It's currently used to process tracking data, service log data, and for data ingestion pipelines for realtime services.

automi - A stream API for Go (alpha)

  •    Go

Automi abstracts away (not too far away) the gnarly details of using Go channels to create pipelined and staged processes. It exposes higher-level API to compose and integrate stream of data over Go channels for processing. This is still alpha work. The API is still evolving and changing rapidly with each commit (beware). Nevertheless, the core concepts are have been bolted onto the API. The following example shows how Automi could be used to compose a multi-stage pipeline to process stream of data from a csv file. The code implements stream processing based on the pipeline patterns. What is clearly absent, however, is the low level channel communication code to coordinate and synchronize goroutines. The programmer is provided a clean surface to express business code without the noisy channel infrastructure code. Underneath the cover however, Automi is using patterns similar to the pipeline patterns to create safe and concurrent structures to execute the processing of the data stream.

spring-cloud-dataflow - Spring Cloud Data Flow is a toolkit for building data integration and real-time data processing pipelines

  •    Java

Spring Cloud Data Flow is a toolkit for building data integration and real-time data processing pipelines.Pipelines consist of Spring Boot apps, built using the Spring Cloud Stream or Spring Cloud Task microservice frameworks.

hstream - The database built for IoT streaming data storage and real-time stream processing.

  •    Haskell

The database built for IoT streaming data storage and real-time stream processing. By subscribing to streams in HStreamDB, any update of the data stream will be pushed to your apps in real-time, and this promotes your apps to be more responsive.

Broadway - Concurrent and multi-stage data ingestion and data processing with Elixir

  •    Elixir

Build concurrent and multi-stage data ingestion and data processing pipelines with Elixir. It allows developers to consume data efficiently from different sources, known as producers, such as Amazon SQS, Apache Kafka, Google Cloud PubSub, RabbitMQ, and others. Broadway takes the burden of defining concurrent GenStage topologies and provide a simple configuration API that automatically defines concurrent producers, concurrent processing, batch handling, and more, leading to both time and cost efficient ingestion and processing of data.

Fluo - Make incremental updates to large data sets stored in Apache Accumulo

  •    Java

Apache Fluo (incubating) is an open source implementation of Percolator (which populates Google's search index) for Apache Accumulo. Fluo makes it possible to update the results of a large-scale computation, index, or analytic as new data is discovered. When combining new data with existing data, Fluo offers reduced latency when compared to batch processing frameworks (e.g Spark, MapReduce).

Sensorbee - Lightweight stream processing engine for IoT

  •    Go

Sensorbee is designed for low-latency processing of streaming data at the edge of the network. IoT devices frequently generate large volumes of unstructured streaming data, such as video and audio streams. Even if the data streams are structured, they may be meaningless if their temporal characteristics are not considered. Cloud-based services are generally not good at processing these kinds of data. Preprocessing data streams before they are sent to the cloud makes large scale data processing in the cloud more efficient and reduces the usage of network bandwidth.

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.