Druid IO - Real Time Exploratory Analytics on Large Datasets

  •        616

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations. Druid can load both streaming and batch data.

http://druid.io/
https://github.com/druid-io/druid/

Tags
Implementation
License
Platform

   




Related Projects

Akumuli - Time-series database


Akumuli is a time-series database for modern hardware. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from Esperanto as "accumulate".

Timescaledb - An open-source time-series database optimized for fast ingest and complex queries


TimescaleDB is an open-source database designed to make SQL scalable for time-series data. It is engineered up from PostgreSQL, providing automatic partitioning across time and space (partitioning key), as well as full SQL support. TimescaleDB is packaged as a PostgreSQL extension and released under the Apache 2 open-source license.

Kairosdb - Fast distributed scalable time series database written on top of Cassandra


KairosDB is a fast distributed scalable time series database written on top of Cassandra. Data can be pushed in KairosDB via multiple protocols : Telnet, Rest, Graphite. KairosDB stores time series in Cassandra, the popular and performant NoSQL datastore. It supports aggregators which can perform an operation on data points and down samples. Standard functions like min, max, sum, count, mean etc.

Gnocchi - Time series database


Gnocchi is an open-source |time series| database. The problem that Gnocchi solves is the storage and indexing of |time series| data and resources at a large scale. This is useful in modern cloud platforms which are not only huge but also are dynamic and potentially multi-tenant. Gnocchi takes all of that into account. Gnocchi has been designed to handle large amounts of aggregates being stored while being performant, scalable and fault-tolerant. While doing this, the goal was to be sure to not build any hard dependency on any complex storage system.

Pinot - A realtime distributed OLAP datastore


Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally, so that it can scale to larger data sets and higher query rates as needed.


DalmatinerDB - Fast distributed metrics database in Erlang


DalmatinerDB is a metric database written in pure Erlang. It takes advantage of some special properties of metrics to make some tradeoffs. Its goal is to make a store for metric data (time, value of a metric) that is fast, has a low overhead, and is easy to query and manage. DalmatinerDB allows for metric input in second or even sub-second precision. It will interpolate the missing values to the best of it’s abilities. This is usually acceptable for aggregated data.

InfluxDB - Distributed Time Series Database


InfluxDB is an open-source, distributed, time series database with no external dependencies. It's useful for recording metrics, events, and performing analytics. Everything in InfluxDB is a time series that you can perform standard functions on like min, max, sum, count, mean, median, percentiles, and more. Collect your data on any interval and compute rollups on the fly later.

SiriDB - Highly-scalable, robust and super fast time series database


SiriDB is a highly-scalable, robust and super fast time series database. Build from the ground up SiriDB uses a unique mechanism to operate without indexes and allows server resources to be added on the fly. SiriDB's unique query language includes dynamic grouping of time series for easy and super fast analysis over large amount's of time series.

TrailDB - Efficient tool for storing and querying series of events


TrailDB is a library, implemented in C, which allows you to query series of events at blazing speed. TrailDB is also optimized for speed of development: Use its simple API with your favorite language, in your favorite environment. TrailDB's secret sauce is data compression. It leverages predictability of time-based data to compress your data to a fraction of its original size. In contrast to traditional compression, you can query the encoded data directly, decompressing only the parts you need.

influxdb - Scalable datastore for metrics, events, and real-time analytics


InfluxDB is an open source time series database with no external dependencies. It's useful for recording metrics, events, and performing analytics. If you're feeling adventurous and want to contribute to InfluxDB, see our contributing doc for info on how to make feature requests, build from source, and run tests.

OpenTSDB - A scalable, distributed Time Series Database.


OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable.

ClickHouse - Columnar DBMS and Real Time Analytics


ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. It is Linearly Scalable, Blazing Fast, Highly Reliable, Fault Tolerant, Data compression, Real time query processing, Web analytics, Vectorized query execution, Local and distributed joins. It can process hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second.

ElasticSearch - Distributed, RESTful search and analytics engine


Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.

InfiniDB - Scale-up analytics database engine for data warehousing and business intelligence


InfiniDB Community Edition is a scale-up, column-oriented database for data warehousing, analytics, business intelligence and read-intensive applications. InfiniDB's data warehouse columnar engine is multi-terabyte capable and accessed via MySQL.

Cubism.js - Time Series Visualization


Cubism.js is a D3 plugin for visualizing time series. Use Cubism to construct better realtime dashboards, pulling data from Graphite, Cube and other sources. Cubism fetches time series data incrementally: after the initial display, Cubism reduces server load by polling only the most recent values. Cubism renders incrementally, too, using Canvas to shift charts one pixel to the left.

ceres - Distributable time-series database (not actively maintained)


Ceres is not actively maintained.Ceres is a time-series database format intended to replace Whisper as the default storage format for Graphite. In contrast with Whisper, Ceres is not a fixed-size database and is designed to better support sparse data of arbitrary fixed-size resolutions. This allows Graphite to distribute individual time-series across multiple servers or mounts.

EventQL - The database for large-scale event analytics


EventQL is a distributed, column-oriented database built for large-scale event collection and analytics. It runs super-fast SQL and MapReduce queries. Its features include Automatic partitioning, Columnar storage, Standard SQL support, Scales to petabytes, Timeseries and relational data, Fast range scans and lot more.

Timelion - Time series composer for Elasticsearch and beyond


Timelion, pronounced "Timeline", brings together totally independent data sources into a single interface, driven by a simple, one-line expression language combining data retrieval, time series combination and transformation, plus visualization. Every Timelion expression starts with a data source function.

snappydata - SnappyData - The Spark Database. Stream, Transact, Analyze, Predict in one cluster


Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight. At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.

Graphite - A highly scalable real-time graphing system


Graphite is an enterprise-scale monitoring tool that runs well on cheap hardware. It is a highly scalable real-time graphing system. It stores numeric time-series data and renders graphs of this data on demand.