rainbow - A data layout optimization framework for wide tables stored on HDFS

  •        15

Rainbow is a tool that helps improve the I/O performance of wide tables stored in columnar formats on HDFS. More information in our project main page.

https://dbiir.github.io/rainbow/
https://github.com/dbiir/rainbow


Tags
Implementation
License
Platform

   




Related Projects

InfiniDB - Scale-up analytics database engine for data warehousing and business intelligence

  •    C++

InfiniDB Community Edition is a scale-up, column-oriented database for data warehousing, analytics, business intelligence and read-intensive applications. InfiniDB's data warehouse columnar engine is multi-terabyte capable and accessed via MySQL.

Hypertable - A high performance, scalable, distributed storage and processing system for structured

  •    C++

Hypertable is based on Google's Bigtable Design, which is a proven scalable design that powers hundreds of Google services. Many of the current scalable NoSQL database offerings are based on a hash table design which means that the data they manage is not kept physically ordered. Hypertable keeps data physically sorted by a primary key and it is well suited for Analytics.

rdb - Javascript ORM

  •    Javascript

ORM for nodejs. Supports postgres, mySql and sqlite. 1.7.1 Support for schemas (postgres only). 1.7.0 sqlite3 is now a peer dependency. Add it to your own package.json if you intend to use it. 1.6.9 Bugfix: one-to-many relation returns empty if strategy is included. 1.6.8 Bugfix: one-to-many relation returns empty if insert/update is done earlier in transaction. 1.6.7 Bugfix in relations. 1.6.6 Bugfix. 1.6.5 Improved performance on relations. 1.6.4 Bugfix. 1.6.3 Bugfix: potential incorrect timeZoneOffset when serializing date to JSON. Got timeZoneOffset from now() instead of on actual date. 1.6.2 Removed es6 syntax to ensure backwards compatability. Fixed global var leak. 1.6.1 Now supporting sqlite. 1.6.0 Bugfix: potential ambigous column error when using limit and relating to other tables. 1.5.9 Bugfix: using multipleStatements in mySql could sometimes cause an error when updates are run right before a select. Improved performance on limit when relating to other tables. Using uuid instead of node-uuid Updated all dependencies but generic-pool to latest. (Generic-pool has some breaking changes in latest. I will update it in next release.) 1.5.8 Cleanup line breaks in documentation. 1.5.7 Bugfix: getById.exclusive and tryGetById.exclusive did not lock if row was cached. Improved performance on tryGetFirst. 1.5.6 Raw sql filters can accept sql both as string and as function. E.g. var filter = {sql: function() {return 'foo > 1';}}. 1.5.5 Optional locks for getMany, tryGetFirst and tryGetById. Instead of calling getMany(params) just call getMany.exclusive(params). Same syntax goes for tryGetFirst and tryGetById. This will result in SELECT FOR UPDATE. Bugfix: bulk deletes now accepts raw sql filters too. 1.5.4 Transaction locks. Postgres only. 1.5.3 Upgraded to pg 6.0.3 1.5.2 Improved performance and reduced memory footprint. 1.5.1 Documented JSON column type. Bug fix: Insert and foreign key violation. 1.5.0 JSON column type. Postgres json type does not support rdb filters. 1.4.1 Empty filter would sometimes cause invalid filter. 1.4.0 Raw SQL query. 1.3.0 getMany() now supports limit and orderBy - same syntax as in streaming. 1.2.3 Bugfix: iEqual gave incorrect sql when parameterized. 1.2.2 Exlusive no longer returns a clone of table. It has changes current table to exclusive locking. 1.2.1 Bugfix: Exclusive row locks 1.2.0 Exclusive row locks 1.1.0 Now supporting streaming. Requires postgres or MySQL >=5.7.7 1.0.8 README fixup. 1.0.7 Better performance on insert and update. 1.0.6 Bugfix: Transaction domain should not forward rdb singleton from old domain. 1.0.5 Documentation cleanup. 1.0.4 orderBy in toDto(). 1.0.3 toDto() using next tick on every thousandth row to avoid maximum call stack size exceeded. 1.0.2 Reduced number of simultaneous promises in order to avoid maximum call stack size exceeded. 1.0.1 Bugfix: Incorrect insert/updates on timestamp without timezone. The time was converted utc instead of stripping the timezone. 1.0.0 Transaction domain forwards properties from old domain. Semantic versioning from now on. 0.5.1 Improved performance 0.5.0 Logging: rdb.log(someFunc) logs sql and parameters. Raw sql filters. 0.4.9 New method: tryGetById. New filter: iEqual, postgres only. Bugfix: rows.toJSON() without strategy did not include any children. 0.4.8 Explicit pooling with size and end(). Bugfix: mySql did not release client to pool. 0.4.7 Upgraded to pg 4.3.0 Upgraded to mysql 2.5.5 0.4.6 Upgraded pg 4.2.0. 0.4.5 Oops. Forgot to use pg.js instead of pg. 0.4.4 Upgraded all dependencies to latest. Using pg.js instead of pg. 0.4.3 Can ignore columns when serializing to dto. 0.4.2 Bugfix: update on a row crashes when a delete occurs earlier in same transaction. 0.4.1 Bugfix: more global leaks. 0.4.0 Bugfix: global leak. 0.3.9 Bugfix: eager loading joins/hasOne with non unique column names was not handled correctly. 0.3.8 Supports mySql. Bulk deletes. 0.3.7 Bugfix: eager loading manyRelation on a join/hasOne returned empty array #11 0.3.6 Fixed sql injection vulnerability. 0.3.5 Built-in fetching strategies for lazy loading. Works best in readonly scenarios. 0.3.4 Docs and examples split moved to separate file. 0.3.3 Fixed documentation layout again. 0.3.2 Fixed documentation layout. 0.3.1 Case insensitive filters: iStartsWith, iEndsWith and iContains. 0.3.0 Fix broken links in docs. 0.2.9 Support for row.delete(). Rollback only throws when error is present. 0.2.8 Guid accepts uppercase letters. Bugfix: null inserts on guid columns yielded wrong sql. 0.2.7 New method, toDto(), converts row to data transfer object. Bugfix: toJSON returned incorrect string on hasMany relations. 0.2.6 Fixed incorrect links in README. 0.2.5 Bugfix: caching on composite keys could give a crash #7. Improved sql compression on insert/update. 0.2.4 Bugfix: getMany with many-strategy and shallowFilter yields incorrect query #6. 0.2.3 Reformatted documentation. No code changes.

EventQL - The database for large-scale event analytics

  •    C++

EventQL is a distributed, column-oriented database built for large-scale event collection and analytics. It runs super-fast SQL and MapReduce queries. Its features include Automatic partitioning, Columnar storage, Standard SQL support, Scales to petabytes, Timeseries and relational data, Fast range scans and lot more.

Druid IO - Real Time Exploratory Analytics on Large Datasets

  •    Java

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations. Druid can load both streaming and batch data.


Pinot - A realtime distributed OLAP datastore

  •    Java

Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally, so that it can scale to larger data sets and higher query rates as needed.

Apache Arrow - Powering Columnar In-Memory Analytics

  •    Java

Apache arrow is an open source, low latency SQL query engine for Hadoop and NoSQL. Arrow enables execution engines to take advantage of the latest SIMD (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.

ToroDB Stampede - Provides better analytics on top of MongoDB and make it easier to migrate from MongoDB to SQL

  •    Java

ToroDB helps to transform your NoSQL data from a MongoDB replica set into a relational database in PostgreSQL. There are other solutions that are able to store the JSON document in a relational table using PostgreSQL JSON support, but it doesn't solve the real problem of 'how to really use that data'. ToroDB Stampede replicates the document structure in different relational tables and stores the document data in different tuples using those tables.

MapD - The MapD Core database

  •    C++

MapD Core is an in-memory, column store, SQL relational database that was designed from the ground up to run on GPUs. MapD Core is the foundational element of a larger data exploration platform that emphasizes speed at scale. By taking advantage of the parallel processing power of the hardware, MapD Core can query billions of rows in milliseconds. Furthermore, by using the graphics pipelines of GPUs, MapD Core can render graphics directly from the server.

HBase - Hadoop database

  •    Java

HBase provides support to handle BigTable - billions of rows X millions of columns. It is a scalable, distributed, versioned, column-oriented store modeled after Google's Bigtable and runs on top of HDFS (Hadoop Distributed Filesystem). It features compression, in-memory operation per-column. Data could be replicated between the nodes. HBase is used in Facebook and Twitter.

ClickHouse - Columnar DBMS and Real Time Analytics

  •    C++

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. It is Linearly Scalable, Blazing Fast, Highly Reliable, Fault Tolerant, Data compression, Real time query processing, Web analytics, Vectorized query execution, Local and distributed joins. It can process hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second.

PHPMyAdmin

  •    PHP

phpMyAdmin is a free software tool written in PHP intended to handle the administration of MySQL over the World Wide Web. phpMyAdmin supports a wide range of operations with MySQL. The most frequently used operations are supported by the user interface (managing databases, tables, fields, relations, indexes, users, permissions, etc), while you still have the ability to directly execute any SQL statement.

Apache Tajo - A big data warehouse system on Hadoop

  •    Java

Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.

Dremio - The missing link in modern data

  •    Java

Dremio is a self-service data platform that empowers users to discover, curate, accelerate, and share any data at any time, regardless of location, volume, or structure. Modern data is managed by a wide range of technologies, including relational databases, NoSQL datastores, file systems, Hadoop, and others. Many of the newer datastores are often more agile and provide improved scalability, but at a cost to speed and ease of access via traditional SQL-based analysis tools. Additionally, raw data found in these stores is often too complex or inconsistent for analysis to use with business intelligence tools.

Apache Hive - The Apache Hive (TM) data warehouse software facilitates querying and managing large d

  •    Java

The Apache Hive (TM) data warehouse software facilitates querying and managing large datasets residing in distributed storage.

snappydata - SnappyData - The Spark Database. Stream, Transact, Analyze, Predict in one cluster

  •    Scala

Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight. At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.

magellan - Geo Spatial Data Analytics on Spark

  •    Scala

Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries. The application developer writes standard sql or data frame queries to evaluate geometric expressions while the execution engine takes care of efficiently laying data out in memory during query processing, picking the right query plan, optimizing the query execution with cheap and efficient spatial indices while presenting a declarative abstraction to the developer.

Kudu - Hadoop storage layer to enable fast analytics on fast data

  •    C++

Kudu is a storage system for tables of structured data. Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the flexibility to address a wider variety of use cases without exotic workarounds.

Infobright - The Database for Analytics

  •    C++

Infobright combines a columnar database with our Knowledge Grid architecture to deliver a self-managing, self-tuning database optimized for analytics. Infobright eliminates the need to create indexes, partition data, or do any manual tuning to achieve fast response for queries and reports.