Samza - Distributed Stream Processing Framework

  •        0

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It provides a very simple call-back based process message API that should be familiar to anyone who's used Map/Reduce. Samza was originally developed at LinkedIn. It's currently used to process tracking data, service log data, and for data ingestion pipelines for realtime services.



Related Projects

incubator-samza-hello-samza - Mirror of Apache Samza

[Hello Samza]( is developed as part of the [Apache Samza]( project. Please direct questions, improvements and bug fixes there. Questions about [Hello Samza]( are welcome on the [dev list]( and the [Samza JIRA]( has a hello-samza compone

NSQ - A realtime distributed messaging platform in Go

NSQ is a realtime distributed messaging platform designed to operate at scale, handling billions of messages per day. It promotes distributed and decentralized topologies without single points of failure, enabling fault tolerance and high availability coupled with a reliable message delivery guarantee. It scales horizontally, without any centralized brokers. Built-in discovery simplifies the addition of nodes to the cluster.

SenseiDB - Distributed, Realtime, Semi-Structured Database from LinkedIn

Sensei is a distributed data system that was built to support many product initiatives at LinkedIn, including the real-time faceted search in LinkedIn Signal and the news feed and tabs on the Homepage. Sensei is both a search engine and a database. It is designed to query and navigate through documents that consist of unstructured text and well-formed and structured metadata. Sensei is both a search engine and a database.


SAMOA is a platform for mining on big data streams.It is a distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms.SAMOA enables development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Apache Storm and Apache S4). SAMOA also provides extensibility in integratingnew SPEs into the framework. These features allow SAMOA users to develop distributed stream


Apache ActiveMQ is the most popular and powerful open source messaging and Integration Patterns provider. Apache ActiveMQ is fast, supports many Cross Language Clients and Protocols, comes with easy to use Enterprise Integration Patterns and many advanced features while fully supporting JMS 1.1 and J2EE 1.4.

Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Project-voldemort - A distributed database, Clone of Amazon's Dynamo

Voldemort is a distributed key-value storage system. Data is automatically replicated over multiple servers. Data is automatically partitioned so each server contains only a subset of the total data. Server failure is handled transparently. It is used at LinkedIn for certain high-scalability storage problems where simple functional partitioning is not sufficient.

Pulsar - Distributed pub-sub Messaging System from Yahoo

Pulsar is a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API. It is horizontally scalable (Millions of independent topics and millions of messages published per second), Strong ordering and consistency guarantees, Low latency , REST API, Geo Replication and lot more.

Openmeetings - Open Source Web Conferencing

Openmeetings provides video conferencing, instant messaging, white board, collaborative document editing and other groupware tools using API functions of the Red5 Streaming Server for Remoting and Streaming.

Apache Flink - Platform for Scalable Batch and Stream Data Processing

Apache Flink is an open source platform for scalable batch and stream data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

GeoMesa - Suite of tools for working with big geo-spatial data in a distributed fashion

GeoMesa is an open-source, distributed, spatio-temporal database built on a number of distributed cloud data storage systems, including Accumulo, HBase, Cassandra, and Kafka. Leveraging a highly parallelized indexing strategy, GeoMesa aims to provide as much of the spatial querying and data manipulation to Accumulo as PostGIS does to Postgres.

Zopkio - A Functional and Performance Test Framework for Distributed Systems

.. image:: :target:

Bobo - Faceted search library based on Lucene

Bobo Browse is an information retrieval technology that provides navigational browsing into a semi-structured dataset. Beyond the result set from queries and selections, Bobo Browse also provides the facets from this point of browsing. It provides support to sort documents on fields that have multiple values. It is stable and used by LinkedIn.

IndexTank - Search Engine powers Reddit

IndexTank search engine powers search in Reddit, Social bookmarking site. IndexTank is acquired by LinkedIn and released the project as open source. It includes features like Variables boosts, Facets, Faceted search, Snippeting, Custom scoring functions, Suggest, and Autocomplete.

Rambox - Messaging and Emailing app that combines common web applications into one

Rambox is a messaging and emailing app that combines common web applications into one. It gives you the possibility to add common services many times you need, all in one place. It's perfect for people who work with many services for business and private accounts.

Kafka - A high-throughput distributed messaging system

Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to Hadoop.

Hazelcast Jet - Distributed data processing engine, built on top of Hazelcast

Hazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In Memory Data Grid (IMDG) to provide a lightweight package of a processor and a scalable in-memory storage. It supports distributed API support for Hazelcast data structures such as IMap and IList, Distributed implementations of java.util.{Queue, Set, List, Map} data structures highly optimized to be used for the processing

s4 - Distributed Stream Computing Platform

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. S4 has been deployed in production systems at Yahoo to process thousands of search queries per second.

nsq - A realtime distributed messaging platform

* **Docs**: [][docs] * **Twitter**: [@nsqio][nsqio_twitter][![Build Status](](**NSQ** is a realtime distributed messaging platform designed to operate at scale, handlingbillions of messages per day.It promotes *distributed* and *decentralized* topologies without single points of failure,enabling fault tolerance and high availability coupled with a reliable message deliveryguarantee. See [featur