Samza - Distributed Stream Processing Framework

  •        0

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It provides a very simple call-back based process message API that should be familiar to anyone who's used Map/Reduce. Samza was originally developed at LinkedIn. It's currently used to process tracking data, service log data, and for data ingestion pipelines for realtime services.

http://samza.incubator.apache.org/

Tags
Implementation
License
Platform

   




Related Projects

incubator-samza-hello-samza - Mirror of Apache Samza


[Hello Samza](http://samza.incubator.apache.org/startup/hello-samza/0.8.0/) is developed as part of the [Apache Samza](http://samza.incubator.apache.org) project. Please direct questions, improvements and bug fixes there. Questions about [Hello Samza](http://samza.incubator.apache.org/startup/hello-samza/0.8.0/) are welcome on the [dev list](http://samza.incubator.apache.org/community/mailing-lists.html) and the [Samza JIRA](https://issues.apache.org/jira/browse/SAMZA) has a hello-samza compone

NSQ - A realtime distributed messaging platform in Go


NSQ is a realtime distributed messaging platform designed to operate at scale, handling billions of messages per day. It promotes distributed and decentralized topologies without single points of failure, enabling fault tolerance and high availability coupled with a reliable message delivery guarantee. It scales horizontally, without any centralized brokers. Built-in discovery simplifies the addition of nodes to the cluster.

SenseiDB - Distributed, Realtime, Semi-Structured Database from LinkedIn


Sensei is a distributed data system that was built to support many product initiatives at LinkedIn, including the real-time faceted search in LinkedIn Signal and the news feed and tabs on the Homepage. Sensei is both a search engine and a database. It is designed to query and navigate through documents that consist of unstructured text and well-formed and structured metadata. Sensei is both a search engine and a database.

samoa


SAMOA is a platform for mining on big data streams.It is a distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms.SAMOA enables development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Apache Storm and Apache S4). SAMOA also provides extensibility in integratingnew SPEs into the framework. These features allow SAMOA users to develop distributed stream

ActiveMQ


Apache ActiveMQ is the most popular and powerful open source messaging and Integration Patterns provider. Apache ActiveMQ is fast, supports many Cross Language Clients and Protocols, comes with easy to use Enterprise Integration Patterns and many advanced features while fully supporting JMS 1.1 and J2EE 1.4.

Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines


Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Project-voldemort - A distributed database, Clone of Amazon's Dynamo


Voldemort is a distributed key-value storage system. Data is automatically replicated over multiple servers. Data is automatically partitioned so each server contains only a subset of the total data. Server failure is handled transparently. It is used at LinkedIn for certain high-scalability storage problems where simple functional partitioning is not sufficient.

Pulsar - Distributed pub-sub Messaging System from Yahoo


Pulsar is a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API. It is horizontally scalable (Millions of independent topics and millions of messages published per second), Strong ordering and consistency guarantees, Low latency , REST API, Geo Replication and lot more.

Openmeetings - Open Source Web Conferencing


Openmeetings provides video conferencing, instant messaging, white board, collaborative document editing and other groupware tools using API functions of the Red5 Streaming Server for Remoting and Streaming.

Apache Flink - Platform for Scalable Batch and Stream Data Processing


Apache Flink is an open source platform for scalable batch and stream data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

GeoMesa - Suite of tools for working with big geo-spatial data in a distributed fashion


GeoMesa is an open-source, distributed, spatio-temporal database built on a number of distributed cloud data storage systems, including Accumulo, HBase, Cassandra, and Kafka. Leveraging a highly parallelized indexing strategy, GeoMesa aims to provide as much of the spatial querying and data manipulation to Accumulo as PostGIS does to Postgres.

Zopkio - A Functional and Performance Test Framework for Distributed Systems


.. image:: https://travis-ci.org/linkedin/Zopkio.svg?branch=master :target: https://travis-ci.org/linkedin/Zopkio

Bobo - Faceted search library based on Lucene


Bobo Browse is an information retrieval technology that provides navigational browsing into a semi-structured dataset. Beyond the result set from queries and selections, Bobo Browse also provides the facets from this point of browsing. It provides support to sort documents on fields that have multiple values. It is stable and used by LinkedIn.

IndexTank - Search Engine powers Reddit


IndexTank search engine powers search in Reddit, Social bookmarking site. IndexTank is acquired by LinkedIn and released the project as open source. It includes features like Variables boosts, Facets, Faceted search, Snippeting, Custom scoring functions, Suggest, and Autocomplete.

Rambox - Messaging and Emailing app that combines common web applications into one


Rambox is a messaging and emailing app that combines common web applications into one. It gives you the possibility to add common services many times you need, all in one place. It's perfect for people who work with many services for business and private accounts.

Kafka - A high-throughput distributed messaging system


Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to Hadoop.

Hazelcast Jet - Distributed data processing engine, built on top of Hazelcast


Hazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In Memory Data Grid (IMDG) to provide a lightweight package of a processor and a scalable in-memory storage. It supports distributed java.util.stream API support for Hazelcast data structures such as IMap and IList, Distributed implementations of java.util.{Queue, Set, List, Map} data structures highly optimized to be used for the processing

s4 - Distributed Stream Computing Platform


S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. S4 has been deployed in production systems at Yahoo to process thousands of search queries per second.

nsq - A realtime distributed messaging platform


* **Docs**: [http://nsq.io][docs] * **Twitter**: [@nsqio][nsqio_twitter][![Build Status](https://secure.travis-ci.org/bitly/nsq.svg?branch=master)](http://travis-ci.org/bitly/nsq)**NSQ** is a realtime distributed messaging platform designed to operate at scale, handlingbillions of messages per day.It promotes *distributed* and *decentralized* topologies without single points of failure,enabling fault tolerance and high availability coupled with a reliable message deliveryguarantee. See [featur