Apache Storm - Distributed and fault-tolerant realtime computation

  •        680

Storm is a distributed real time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.

Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.

http://storm.apache.org/
https://github.com/apache/storm
http://storm-project.net/
https://github.com/nathanmarz/storm/

Tags
Implementation
License
Platform

   




Related Projects

jstorm - Enterprise Stream Process Engine


Alibaba JStorm is an enterprise fast and stable streaming process engine. It runs program up to 4x faster than Apache Storm. It is easy to switch from record mode to mini-batch mode. It is not only a streaming process engine. It means one solution for real time requirement, whole realtime ecosystem.

cortana-intelligence-personalized-offers - Generate real-time personalized offers on a retail website to engage more closely with customers


In today’s highly competitive and connected environment, modern businesses can no longer survive with generic, static online content. Furthermore, marketing strategies using traditional tools are often expensive, hard to implement, and do not produce the desired return on investment. These systems often fail to take full advantage of the data collected to create a more personalized experience for the user. Surfacing offers that are customized for the user has become essential to build customer loyalty and remain profitable. On a retail website, customers desire intelligent systems which provide offers and content based on their unique interests and preferences. Today’s digital marketing teams can build this intelligence using the data generated from all types of user interactions. By analyzing massive amounts of data, marketers have the unique opportunity to deliver highly relevant and personalized offers to each user. However, building a reliable and scalable big data infrastructure, and developing sophisticated machine learning models that personalize to each user is not trivial.Cortana Intelligence provides advanced analytics tools through Microsoft Azure — data ingestion, data storage, data processing and advanced analytics components — all of the essential elements for building an demand forecasting for energy solution. This solution combines several Azure services to provide powerful advantages. Event Hubs collects real-time consumption data. Stream Analytics aggregates the streaming data and updates the data used in making personalized offers to the customer. Azure DocumentDB stores the customer, product and offer information. Azure Storage is used to manage the queues that simulate user interaction. Azure Functions are used as a coordinator for the user simulation and as the central portion of the solution for generating personalized offers. Azure Machine Learning implements and executes the product recommendations and when no user history is available Azure Redis Cache is used to provide pre-computed product recommendations for the customer. PowerBI visualizes the activity of the system with the data from DocumentDB.

snappydata - SnappyData: OLTP + OLAP Database built on Apache Spark


SnappyData is a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) in a single integrated cluster. We realize this platform through a seamless integration of Apache Spark (as a big data computational engine) with GemFire XD (as an in-memory transactional store with scale-out SQL semantics).

demos - Stream processing demo


Stream processing demo showing how Kafka can be used to create a real-time dashboard at scale.We want a dashboard presenting Big Data in real-time. Our company is a Spotify-esque service that lets users play songs. We want to see the real-time statistics and usage of our 10 million subscribers. Our dashboard is showing how many people are using our free or paid service. We see which pages they're hitting and the location in the United States where the person is.

storm-example-projects


Storm is real time computing system which supports fault-tolerance, horizontal scalability and guaranteed message processing with amazing performance. Here is the library of sample projects which is essentially exposing reusable bolts for real time computation.



streamparse - Run Python in Apache Storm topologies. Pythonic API, CLI tooling, and a topology DSL.


Streamparse lets you run Python code against real-time streams of data via Apache Storm. With streamparse you can create Storm bolts and spouts in Python without having to write a single line of Java. It also provides handy CLI utilities for managing Storm clusters and projects.The Storm/streamparse combo can be viewed as a more robust alternative to Python worker-and-queue systems, as might be built atop frameworks like Celery and RQ. It offers a way to do "real-time map/reduce style computation" against live streams of data. It can also be a powerful way to scale long-running, highly parallel Python processes in production.

spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.


Spindle is Brandon Amos' 2014 summer internship project with Adobe Research and is not under active development.Analytics platforms such as Adobe Analytics are growing to process petabytes of data in real-time. Delivering responsive interfaces querying this amount of data is difficult, and there are many distributed data processing technologies such as Hadoop MapReduce, Apache Spark, Apache Drill, and Cloudera Impala to build low-latency query systems.

LogEventsProcessing - real time log event processing using storm, kafka, logstash & cassandra


real time log event processing using storm, kafka, logstash & cassandra

Bagri - XML/Document DB on top of distributed cache


Bagri is a Document Database built on top of distributed cache solution like Hazelcast or Coherence. The system allows to process semi-structured schema-less documents and perform distributed queries on them in real-time. It scales horizontally very well with use of data sharding, when all documents are distributed evenly between distributed cache partitions.

birding - Stream processing in Python of twitter searches using public APIs.


birding is an open source project to produce a stream of recent twitter activity based on a sequence of search terms, using only twitter's public APIs. It serves as both a standalone project and a demo of distributed real-time computing with Python using Storm/streamparse and Kafka/pykafka.See the docs (below) for a full discussion of birding, including how to use birding in an existing streamparse project.

Pinot - A realtime distributed OLAP datastore


Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally, so that it can scale to larger data sets and higher query rates as needed.

Vespa - Yahoo's big data serving engine


Vespa is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time. Vespa is serving platform for Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr.

Druid IO - Real Time Exploratory Analytics on Large Datasets


Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations. Druid can load both streaming and batch data.

VoltDB - Fast Scalable SQL DBMS with ACID


VoltDB was specifically designed for contemporary software applications that are pushed beyond their limits by high volume data sources. VoltDB provides the ability to capture, store and process incoming data at millions of read/write operations per second. And VoltDB’s relational model opens that data to be analyzed in real-time, using familiar Business Intelligence tools, to identify data patterns and trends, spot anomalies, or perform tracking and alerting.

AthenaX - SQL-based streaming analytics platform at scale


AthenaX is a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL). AthenaX was released and open sourced by Uber Technologies. It is capable of scaling across hundreds of machines and processing hundreds of billions of real-time events daily.Apache 2.0 License.

storm


Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

oryx - Simple real-time large-scale machine learning infrastructure.


<img align="right" src="https://raw.github.com/wiki/cloudera/oryx/OryxLogoSmall.png"/>The Oryx open source project provides simple, real-time large-scale machine learning /predictive analytics infrastructure. It implements a few classes of algorithm commonly used in business applications:*collaborative filtering / recommendation*, *classification / regression*, and *clustering*.It can continuously build models from a stream of data at large scale using[Apache Hadoop](http://hadoop.apache.org/).

storm-on-dotcloud - Easily deploy Storm, the real-time computation system, on dotCloud


Easily deploy Storm, the real-time computation system, on dotCloud

sllb - Sliding-LogLog-Beta


An implementation of an algorithm for estimating the number of active flows in a data stream is proposed. This algorithm adapts the HyperLogLog algorithm of Flajolet et. al to the data stream processing by adding a sliding window mechanism. It has the advantage to estimate at any time the number of flows seen over any duration bounded by the length of the sliding window.The estimate is very accurate with a standard error of about 1.04/sqrt(m) (the same as the HyperLogLog algorithm). As the new algorithm answers more flexible queries, it needs an additional memory storage compared to HyerLogLog algorithm. It is proved that this additional memory is at most equal to 5m * ln(n/m) bytes where n is the real number of flows in the sliding window. For instance, with an additional memory of only 35kB, a standard error of about 3% can be achieved for a data stream of several million flows. Theoretical results are validated on both real and synthetic traffic.

memcached_mon - A processing hack for visualizing memcached data in real(ish) time.


A processing hack for visualizing memcached data in real(ish) time.