Samza - Distributed Stream Processing Framework

  •        2893

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It provides a very simple call-back based process message API that should be familiar to anyone who's used Map/Reduce. Samza was originally developed at LinkedIn. It's currently used to process tracking data, service log data, and for data ingestion pipelines for realtime services.

http://samza.incubator.apache.org/

Tags
Implementation
License
Platform

   




Related Projects

RocketMQ - Distributed messaging and streaming data platform


Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.

Kafka-Message-Server - Example application based on Apache Kafka framework to show it usage as distributed message server


Apache kafka is yet another precious gem from Apache Software Foundation. Kafka was originally developed at Linkedin and later on became a member of Apache project. Apache Kafka is a distributed publish-subscribe messaging system. Kafka differs from traditional messaging system as it is designed as distributed system, persists messages on disk and supports multiple subscribers. Kafka-Message-Server is an sample application for demonstrating kafka usage as message-server. Please follow the below instructions for productive use of the sample application.

NSQ - A realtime distributed messaging platform in Go


NSQ is a realtime distributed messaging platform designed to operate at scale, handling billions of messages per day. It promotes distributed and decentralized topologies without single points of failure, enabling fault tolerance and high availability coupled with a reliable message delivery guarantee. It scales horizontally, without any centralized brokers. Built-in discovery simplifies the addition of nodes to the cluster.

SenseiDB - Distributed, Realtime, Semi-Structured Database from LinkedIn


Sensei is a distributed data system that was built to support many product initiatives at LinkedIn, including the real-time faceted search in LinkedIn Signal and the news feed and tabs on the Homepage. Sensei is both a search engine and a database. It is designed to query and navigate through documents that consist of unstructured text and well-formed and structured metadata. Sensei is both a search engine and a database.


Apache Pulsar - Distributed pub-sub messaging platform


Pulsar is a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API. It has run in production at Yahoo scale for over 3 years, with millions of messages per second across millions of topics. It is Horizontally scalable, Low latency, High throughput, Multi-tenancy, Geo-replication, Transparent batching of messages, Transparent handling of partitioned topics, REST API for provisioning, admin and stats and lot more.

samoa


SAMOA is a platform for mining on big data streams.It is a distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms.SAMOA enables development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Apache Storm and Apache S4). SAMOA also provides extensibility in integratingnew SPEs into the framework. These features allow SAMOA users to develop distributed stream

ActiveMQ


Apache ActiveMQ is the most popular and powerful open source messaging and Integration Patterns provider. Apache ActiveMQ is fast, supports many Cross Language Clients and Protocols, comes with easy to use Enterprise Integration Patterns and many advanced features while fully supporting JMS 1.1 and J2EE 1.4.

Project-voldemort - A distributed database, Clone of Amazon's Dynamo


Voldemort is a distributed key-value storage system. Data is automatically replicated over multiple servers. Data is automatically partitioned so each server contains only a subset of the total data. Server failure is handled transparently. It is used at LinkedIn for certain high-scalability storage problems where simple functional partitioning is not sufficient.

Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines


Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Pulsar - Distributed pub-sub Messaging System from Yahoo


Pulsar is a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API. It is horizontally scalable (Millions of independent topics and millions of messages published per second), Strong ordering and consistency guarantees, Low latency , REST API, Geo Replication and lot more.

Openmeetings - Open Source Web Conferencing


Openmeetings provides video conferencing, instant messaging, white board, collaborative document editing and other groupware tools using API functions of the Red5 Streaming Server for Remoting and Streaming. Meeting can be recorded and screen sharing support is also available.

Luxun - A high-throughput, persistent, distributed, publish-subscribe messaging system based on memo


A high-throughput, persistent, distributed, publish-subscribe messaging system based on memory mapped file and Thrift RPC.

Pravega - Streaming as a new software defined storage primitive


Pravega is an open source distributed storage service implementing Streams. It offers Stream as the main primitive for the foundation of reliable storage systems: a high-performance, durable, elastic, and unlimited append-only byte stream with strict ordering and consistency.

FiloDB - Distributed. Columnar. Versioned. Streaming. SQL.


High-performance distributed analytical database + Spark SQL queries + built for streaming. Columnar, versioned layers of data wrapped in a yummy high-performance analytical database engine.

Apache Flink - Platform for Scalable Batch and Stream Data Processing


Apache Flink is an open source platform for scalable batch and stream data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

Rambox - Messaging and Emailing app that combines common web applications into one


Rambox is a messaging and emailing app that combines common web applications into one. It gives you the possibility to add common services many times you need, all in one place. It's perfect for people who work with many services for business and private accounts.

IndexTank - Search Engine powers Reddit


IndexTank search engine powers search in Reddit, Social bookmarking site. IndexTank is acquired by LinkedIn and released the project as open source. It includes features like Variables boosts, Facets, Faceted search, Snippeting, Custom scoring functions, Suggest, and Autocomplete.