incubator-doris - Palo,an MPP data warehouse

  •        27

Palo is an MPP-based interactive SQL data warehousing for reporting and analysis. Palo mainly integrates the technology of Google Mesa and Apache Impala. Unlike other popular SQL-on-Hadoop systems, Palo is designed to be a simple and single tightly coupled system, not depending on other systems. Palo not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Palo not only provides batch data loading, but also provides near real-time mini-batch data loading. Palo also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Palo. In Baidu, the largest Chinese search engine, we run a two-tiered data warehousing system for data processing, reporting and analysis. Similar to lambda architecture, the whole data warehouse comprises data processing and data serving. Data processing does the heavy lifting of big data: cleaning data, merging and transforming it, analyzing it and preparing it for use by end user queries; data serving is designed to serve queries against that data for different use cases. Currently data processing includes batch data processing and stream data processing technology, like Hadoop, Spark and Storm; Palo is a SQL data warehouse for serving online and interactive data reporting and analysis querying.



Related Projects

Apache Tajo - A big data warehouse system on Hadoop

  •    Java

Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.

Apache Hive - The Apache Hive (TM) data warehouse software facilitates querying and managing large d

  •    Java

The Apache Hive (TM) data warehouse software facilitates querying and managing large datasets residing in distributed storage.

flume - WE HAVE MOVED to Apache Incubator

  •    Java

WE HAVE MOVED to Apache Incubator. . Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.

incubator-gobblin - Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems

  •    Java

Apache Gobblin is a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop. Apache Gobblin handles the common routine tasks required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. Gobblin ingests data from different data sources in the same execution framework, and manages metadata of different sources all in one place. This, combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility, and the ability of handling data model evolution, makes Gobblin an easy-to-use, self-serving, and efficient data ingestion framework.

Data Warehouse


data warehouse

gpdb - Greenplum Database

  •    C

The Greenplum Database (GPDB) is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes. Uniquely geared toward big data analytics, Greenplum Database is powered by the world’s most advanced cost-based query optimizer delivering high analytical query performance on large data volumes. The Greenplum project is released under the Apache 2 license. We want to thank all our current community contributors and are really interested in all new potential contributions. For the Greenplum Database community no contribution is too small, we encourage all types of contributions.

AsterixDB - Big Data Management System (BDMS)

  •    Java

AsterixDB is a BDMS (Big Data Management System) with a rich feature set that sets it apart from other Big Data platforms. Its feature set makes it well-suited to modern needs such as web data warehousing and social data storage and analysis. It is a highly scalable data management system that can store, index, and manage semi-structured data, but it also supports a full-power query language with the expressiveness of SQL (and more).

archived_madlib - MADlib has moved to Apache MADlib (incubating)

  •    C

================================================= MADlib® is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. Please refer to the Apache repository ( for the latest code and to create pull requests.

Shark - Hive on Spark

  •    Scala

Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. It runs Hive queries up to 100x faster in memory, or 10x on disk. it is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive.

VXQuery - Query XML Data

  •    Java

Apache VXQuer will be a standards compliant XML Query processor implemented in Java. The focus is on the evaluation of queries on large amounts of XML data. Specifically the goal is to evaluate queries on large collections of relatively small XML documents. To achieve this queries will be evaluated on a cluster of shared nothing machines.

blinkdb - BlinkDB: Sub-Second Approximate Queries on Very Large Data.

  •    Scala

BlinkDB is a large-scale data warehouse system built on Shark and Spark and is designed to be compatible with Apache Hive. It can answer HiveQL queries up to 200-300 times faster than Hive by executing them on user-specified samples of data and providing approximate answers that are augmented with meaningful error bars. BlinkDB 0.1.0 is an alpha developer release that supports creating/deleting samples on any input table and/or materialized view and executing approximate HiveQL queries with those aggregates that have statistical closed forms (i.e., AVG, SUM, COUNT, VAR and STDEV).

Cascalog - Data processing on Hadoop

  •    Clojure

Cascalog is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.

dbt - dbt (data build tool) is a command line tool that enables data analysts and engineers to transform data in their warehouse more effectively

  •    HTML

dbt (data build tool) helps analysts write reliable, modular code using a workflow that closely mirrors software development. A dbt project primarily consists of "models". These models are SQL select statements that filter, aggregate, and otherwise transform data to facilitate analytics. Analysts use dbt to aggregate pageviews into sessions, calculate ad spend ROI, or report on email campaign performance.

BioDWH: Bioinformatics Data Warehouse

  •    Java

BioDWH is a bioinformatics data warehouse software kit that integrates biological information from multiple public life science data sources into a local RDBMS. It provides up-to-date integrated knowledge, platform and database independence.

Doris - OpenGL scripting

  •    Lua

Doris is a Lua script driven OpenGL viewer with GUI widget extensions. Lua is a fast, powerful, portable scripting language. Lua bindings are provided to OpenGL, GLUT, GLUI (a GL widget library) and Luasocket (networking).

42 Sight Holistic Data Warehousing on Microsoft SQL Server 2008


This project is the Holistic Data Warehouse Template. Included are the 3 MS XL spreadsheet reporting models as described in the book Holistic Data Warehousing. This Template is multi-purpose and requires no modification other than you populating it according to the book examples

CommonDataModel - Definition and DDLs for the OMOP Common Data Model (CDM)

  •    Batchfile

This repo contains the definition of the OMOP Common Data Model. It supports the SQL technologies: BigQuery, Impala, Netezza, Oracle, Parallel Data Warehouse, Postgres, Redshift, and SQL Server. For each, the DDL, constraints and indexes (if appropriate) are defined. Versions are defined using tagging and versioning. Full versions (V6, 7 etc.) are usually released each year (1-Jan) and are not backwards compatible. Minor versions (V5.1, 5.2 etc.) are not guaranteed to be backwards compatible though an effort is made to make sure that current queries will not break. Micro versions (V5.1.1, V5.1.2 etc.) are released irregularly and often, and contain small hot fixes or backward compatible changes to the last minor version.

incubator-usergrid - Mirror of Apache Usergrid

  •    Java

We accept all contributions via our GitHub, so you can fork our repo (apache/incubator-usergrid) and then submit a PR back to us for approval. For larger PRs you'll need to have an ICLA form on file with Apache. For more information see our [Contribution Workflow Policy](, and specifically our [External Contributors Guide](

incubator-openwhisk - Apache OpenWhisk is a serverless event-based programming service and an Apache Incubator project

  •    Scala

OpenWhisk is a cloud-first distributed event-based programming service. It provides a programming model to upload event handlers to a cloud service, and register the handlers to respond to various events. Learn more at Vagrant machine is the easiest way to run OpenWhisk on Mac, Windows PC or GNU/Linux. Download and install VirtualBox and Vagrant for your operating system and architecture.