Samza - Distributed Stream Processing Framework

  •        2867

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It provides a very simple call-back based process message API that should be familiar to anyone who's used Map/Reduce. Samza was originally developed at LinkedIn. It's currently used to process tracking data, service log data, and for data ingestion pipelines for realtime services.

http://samza.incubator.apache.org/

Tags
Implementation
License
Platform

   




Related Projects

RocketMQ - Distributed messaging and streaming data platform


Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.

incubator-samza-hello-samza - Mirror of Apache Samza


[Hello Samza](http://samza.incubator.apache.org/startup/hello-samza/0.8.0/) is developed as part of the [Apache Samza](http://samza.incubator.apache.org) project. Please direct questions, improvements and bug fixes there. Questions about [Hello Samza](http://samza.incubator.apache.org/startup/hello-samza/0.8.0/) are welcome on the [dev list](http://samza.incubator.apache.org/community/mailing-lists.html) and the [Samza JIRA](https://issues.apache.org/jira/browse/SAMZA) has a hello-samza compone

NSQ - A realtime distributed messaging platform in Go


NSQ is a realtime distributed messaging platform designed to operate at scale, handling billions of messages per day. It promotes distributed and decentralized topologies without single points of failure, enabling fault tolerance and high availability coupled with a reliable message delivery guarantee. It scales horizontally, without any centralized brokers. Built-in discovery simplifies the addition of nodes to the cluster.

SenseiDB - Distributed, Realtime, Semi-Structured Database from LinkedIn


Sensei is a distributed data system that was built to support many product initiatives at LinkedIn, including the real-time faceted search in LinkedIn Signal and the news feed and tabs on the Homepage. Sensei is both a search engine and a database. It is designed to query and navigate through documents that consist of unstructured text and well-formed and structured metadata. Sensei is both a search engine and a database.



samoa


SAMOA is a platform for mining on big data streams.It is a distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms.SAMOA enables development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Apache Storm and Apache S4). SAMOA also provides extensibility in integratingnew SPEs into the framework. These features allow SAMOA users to develop distributed stream

ActiveMQ


Apache ActiveMQ is the most popular and powerful open source messaging and Integration Patterns provider. Apache ActiveMQ is fast, supports many Cross Language Clients and Protocols, comes with easy to use Enterprise Integration Patterns and many advanced features while fully supporting JMS 1.1 and J2EE 1.4.

node-nats-streaming - Node.js client for NATS Streaming


Node NATS Streaming is an extremely performant, lightweight reliable streaming platform powered by NATS for Node.js.NATS Streaming subscriptions are similar to NATS subscriptions, but clients may start their subscription at an earlier point in the message stream, allowing them to receive messages that were published before this client registered interest.

Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines


Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

li-apache-kafka-clients - li-apache-kafka-clients is a wrapper library for the Apache Kafka vanilla clients


li-apache-kafka-clients is a wrapper Kafka clients library built on top of vanilla Apache Kafka clients.Apache Kafka has now become a very popular messaging system and is well known for its low latency, high throughput and durable messaging. At LinkedIn, we have built an ecosystem around Kafka to power our infrastructure. In our ecosystem, li-apache-kafka-clients library is a fundamental component for many functions such as auditing, data format standardization, large message handling, and so on.

Project-voldemort - A distributed database, Clone of Amazon's Dynamo


Voldemort is a distributed key-value storage system. Data is automatically replicated over multiple servers. Data is automatically partitioned so each server contains only a subset of the total data. Server failure is handled transparently. It is used at LinkedIn for certain high-scalability storage problems where simple functional partitioning is not sufficient.

Pulsar - Distributed pub-sub Messaging System from Yahoo


Pulsar is a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API. It is horizontally scalable (Millions of independent topics and millions of messages published per second), Strong ordering and consistency guarantees, Low latency , REST API, Geo Replication and lot more.

Openmeetings - Open Source Web Conferencing


Openmeetings provides video conferencing, instant messaging, white board, collaborative document editing and other groupware tools using API functions of the Red5 Streaming Server for Remoting and Streaming.

Luxun - A high-throughput, persistent, distributed, publish-subscribe messaging system based on memo


A high-throughput, persistent, distributed, publish-subscribe messaging system based on memory mapped file and Thrift RPC.

Apache Flink - Platform for Scalable Batch and Stream Data Processing


Apache Flink is an open source platform for scalable batch and stream data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

GeoMesa - Suite of tools for working with big geo-spatial data in a distributed fashion


GeoMesa is an open-source, distributed, spatio-temporal database built on a number of distributed cloud data storage systems, including Accumulo, HBase, Cassandra, and Kafka. Leveraging a highly parallelized indexing strategy, GeoMesa aims to provide as much of the spatial querying and data manipulation to Accumulo as PostGIS does to Postgres.

Pinot - A realtime distributed OLAP datastore


Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally, so that it can scale to larger data sets and higher query rates as needed.

Bobo - Faceted search library based on Lucene


Bobo Browse is an information retrieval technology that provides navigational browsing into a semi-structured dataset. Beyond the result set from queries and selections, Bobo Browse also provides the facets from this point of browsing. It provides support to sort documents on fields that have multiple values. It is stable and used by LinkedIn.

IndexTank - Search Engine powers Reddit


IndexTank search engine powers search in Reddit, Social bookmarking site. IndexTank is acquired by LinkedIn and released the project as open source. It includes features like Variables boosts, Facets, Faceted search, Snippeting, Custom scoring functions, Suggest, and Autocomplete.