Distrib(uted) Processing Grid

  •        89

Distrib is a simple yet powerful distributed processing system.

http://distrib.codeplex.com/

Tags
Implementation
License
Platform

   




Related Projects

Hazelcast Jet - Distributed data processing engine, built on top of Hazelcast


Hazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In Memory Data Grid (IMDG) to provide a lightweight package of a processor and a scalable in-memory storage. It supports distributed java.util.stream API support for Hazelcast data structures such as IMap and IList, Distributed implementations of java.util.{Queue, Set, List, Map} data structures highly optimized to be used for the processing

Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines


Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Bagri - XML/Document DB on top of distributed cache


Bagri is a Document Database built on top of distributed cache solution like Hazelcast or Coherence. The system allows to process semi-structured schema-less documents and perform distributed queries on them in real-time. It scales horizontally very well with use of data sharding, when all documents are distributed evenly between distributed cache partitions.

Apache Storm - Distributed and fault-tolerant realtime computation


Storm is a distributed real time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.

Hypertable - A high performance, scalable, distributed storage and processing system for structured


Hypertable is based on Google's Bigtable Design, which is a proven scalable design that powers hundreds of Google services. Many of the current scalable NoSQL database offerings are based on a hash table design which means that the data they manage is not kept physically ordered. Hypertable keeps data physically sorted by a primary key and it is well suited for Analytics.



databus - Source-agnostic distributed change data capture system


In Internet architectures, data systems are typically categorized into source-of-truth systems that serve as primary stores for the user-generated writes, and derived data stores or indexes which serve reads and other complex queries. The data in these secondary stores is often derived from the primary data through custom transformations, sometimes involving complex processing driven by business logic. Similarly, data in caching tiers is derived from reads against the primary data store, but needs to get invalidated or refreshed when the primary data gets mutated. A fundamental requirement emerging from these kinds of data architectures is the need to reliably capture, flow and process primary data changes.We have built Databus, a source-agnostic distributed change data capture system, which is an integral part of LinkedIn's data processing pipeline. The Databus transport layer provides latencies in the low milliseconds and handles throughput of thousands of events per second per server while supporting infinite look back capabilities and rich subscription functionality.

Samza - Distributed Stream Processing Framework


Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It provides a very simple call-back based process message API that should be familiar to anyone who's used Map/Reduce. Samza was originally developed at LinkedIn. It's currently used to process tracking data, service log data, and for data ingestion pipelines for realtime services.

HPCC System - Hadoop alternative


HPCC is a proven and battle-tested platform for manipulating, transforming, querying and data warehousing Big Data. It supports two type of configuration. Thor is responsible for consuming vast amounts of data, transforming, linking and indexing that data. It functions as a distributed file system with parallel processing power spread across the nodes. Roxie, the Data Delivery Engine, provides separate high-performance online query processing and data warehouse capabilities.

Apache Flink - Platform for Scalable Batch and Stream Data Processing


Apache Flink is an open source platform for scalable batch and stream data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

Performance Co-Pilot - System Performance and Analysis Framework.


Performance Co-Pilot (PCP) provides a framework and services to support system-level performance monitoring and management. It presents a unifying abstraction for all of the performance data in a system, and many tools for interrogating, retrieving and processing that data. The distributed PCP architecture makes it especially useful for those seeking centralized monitoring of distributed processing.

tangle - A python framework for distributed data processing


A python framework for distributed data processing

Modular toolkit for Data Processing MDP


The Modular toolkit for Data Processing (MDP) is a Python data processing framework. From the user's perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures. From the scientific developer's perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new i

jstorm - Enterprise Stream Process Engine


Alibaba JStorm is an enterprise fast and stable streaming process engine. It runs program up to 4x faster than Apache Storm. It is easy to switch from record mode to mini-batch mode. It is not only a streaming process engine. It means one solution for real time requirement, whole realtime ecosystem.

spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.


Spindle is Brandon Amos' 2014 summer internship project with Adobe Research and is not under active development.Analytics platforms such as Adobe Analytics are growing to process petabytes of data in real-time. Delivering responsive interfaces querying this amount of data is difficult, and there are many distributed data processing technologies such as Hadoop MapReduce, Apache Spark, Apache Drill, and Cloudera Impala to build low-latency query systems.

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop


Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

snappydata - SnappyData: OLTP + OLAP Database built on Apache Spark


SnappyData is a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) in a single integrated cluster. We realize this platform through a seamless integration of Apache Spark (as a big data computational engine) with GemFire XD (as an in-memory transactional store with scale-out SQL semantics).

genie - Distributed Big Data Orchestration Service


Genie is a federated job orchestration engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more. It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.See the official website to find documentation about Genie and specific documentation for various releases.

Sensorbee - Lightweight stream processing engine for IoT


Sensorbee is designed for low-latency processing of streaming data at the edge of the network. IoT devices frequently generate large volumes of unstructured streaming data, such as video and audio streams. Even if the data streams are structured, they may be meaningless if their temporal characteristics are not considered. Cloud-based services are generally not good at processing these kinds of data. Preprocessing data streams before they are sent to the cloud makes large scale data processing in the cloud more efficient and reduces the usage of network bandwidth.

BoomFilters - Probabilistic data structures for processing continuous, unbounded streams.


Boom Filters are probabilistic data structures for processing continuous, unbounded streams. This includes Stable Bloom Filters, Scalable Bloom Filters, Counting Bloom Filters, Inverse Bloom Filters, Cuckoo Filters, several variants of traditional Bloom filters, HyperLogLog, Count-Min Sketch, and MinHash.Classic Bloom filters generally require a priori knowledge of the data set in order to allocate an appropriately sized bit array. This works well for offline processing, but online processing typically involves unbounded data streams. With enough data, a traditional Bloom filter "fills up", after which it has a false-positive probability of 1.

zimg - A lightweight and high performance image storage and processing system.


Project zimg is a lightweight image storage and processing system. It's written in C and it has high performance in image field. The zimg is designed for high concurrency image server. It supports many features for storing and processing images.The concurrent I/O, distributed storage and in time processing ability of zimg is excellent. You needn't nginx in your image server any more. In the benchmark test, zimg can deal with 3000+ image downloading task per second and 90000+ HTTP echo request per second on a high concurrency level. The performance is higher than PHP or other image processing server. More infomation of zimg is in the documents below.