incubator-hudi - Upserts And Incremental Processing on Big Data

  •        134

Hoodie is a Apache Spark library that provides the ability to efficiently do incremental processing on datasets in HDFS

https://hudi.apache.org
https://github.com/apache/incubator-hudi
https://github.com/uber/hudi

Dependencies:

com.beust:jcommander:1.72
log4j:log4j:1.2.17
joda-time:joda-time:2.9.9
org.apache.hadoop:hadoop-client:2.7.3
org.apache.parquet:parquet-avro:1.8.1
org.apache.parquet:parquet-hadoop:1.8.1
org.apache.avro:avro-mapred:1.7.7
com.google.guava:guava:15.0
org.apache.hadoop:hadoop-common:2.7.3
org.apache.hadoop:hadoop-hdfs:2.7.3
org.apache.hadoop:hadoop-auth:2.7.3
org.apache.hadoop:hadoop-mapreduce-client-core:2.7.3
org.apache.hadoop:hadoop-mapreduce-client-common:2.7.3
commons-logging:commons-logging:1.2
commons-io:commons-io:2.6
com.twitter:parquet-hadoop-bundle:1.6.0
com.twitter:parquet-hive-bundle:1.6.0
com.twitter:parquet-avro:1.6.0
org.apache.parquet:parquet-hive-bundle:1.8.1
org.apache.spark:spark-core_2.11:2.1.0
org.apache.spark:spark-sql_2.11:2.1.0
org.apache.hbase:hbase-client:1.0.0
org.apache.avro:avro:1.7.7
io.dropwizard.metrics:metrics-graphite:3.1.1
io.dropwizard.metrics:metrics-core:3.1.1
xerces:xercesImpl:2.9.1
xalan:xalan:2.7.1
commons-dbcp:commons-dbcp:1.4
commons-pool:commons-pool:1.4
org.apache.httpcomponents:httpcore:4.3.2
org.apache.httpcomponents:httpclient:4.3.6
org.slf4j:slf4j-api:1.7.5
org.slf4j:slf4j-log4j12:1.7.5
org.apache.commons:commons-configuration2:2.1.1
com.fasterxml.jackson.core:jackson-annotations:2.8.11
com.fasterxml.jackson.core:jackson-core:2.8.11
com.fasterxml.jackson.core:jackson-databind:2.8.11
com.fasterxml.jackson.module:jackson-module-scala_2.11:2.8.11
org.codehaus.jackson:jackson-core-asl:1.9.13
org.codehaus.jackson:jackson-mapper-asl:1.9.13
${hive.groupid}:hive-service:1.2.1
${hive.groupid}:hive-shims:1.2.1
${hive.groupid}:hive-jdbc:1.2.1
${hive.groupid}:hive-serde:1.2.1
${hive.groupid}:hive-metastore:1.2.1
${hive.groupid}:hive-common:1.2.1
${hive.groupid}:hive-exec:1.2.1
org.apache.hadoop:hadoop-hdfs:2.7.3
org.apache.hadoop:hadoop-common:2.7.3

Tags
Implementation
License
Platform

   




Related Projects

Pinot - A realtime distributed OLAP datastore

  •    Java

Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally, so that it can scale to larger data sets and higher query rates as needed.

Kudu - Hadoop storage layer to enable fast analytics on fast data

  •    C++

Kudu is a storage system for tables of structured data. Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the flexibility to address a wider variety of use cases without exotic workarounds.

Gaffer - A large-scale entity and relation database supporting aggregation of properties

  •    Java

Gaffer is a graph database framework. It allows the storage of very large graphs containing rich properties on the nodes and edges. Several storage options are available, including Accumulo, Hbase and Parquet. It is designed to be as flexible, scalable and extensible as possible, allowing for rapid prototyping and transition to production systems.

snappydata - SnappyData - The Spark Database. Stream, Transact, Analyze, Predict in one cluster

  •    Scala

Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight. At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.

EventQL - The database for large-scale event analytics

  •    C++

EventQL is a distributed, column-oriented database built for large-scale event collection and analytics. It runs super-fast SQL and MapReduce queries. Its features include Automatic partitioning, Columnar storage, Standard SQL support, Scales to petabytes, Timeseries and relational data, Fast range scans and lot more.


spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.

  •    Javascript

Spindle is Brandon Amos' 2014 summer internship project with Adobe Research and is not under active development.Analytics platforms such as Adobe Analytics are growing to process petabytes of data in real-time. Delivering responsive interfaces querying this amount of data is difficult, and there are many distributed data processing technologies such as Hadoop MapReduce, Apache Spark, Apache Drill, and Cloudera Impala to build low-latency query systems.

Apache Mnemonic - Non-volatile hybrid memory storage oriented library

  •    Java

Apache Mnemonic is a non-volatile hybrid memory storage oriented library, it proposed a non-volatile/durable Java object model and durable computing service that bring several advantages to significantly improve the performance of massive real-time data processing/analytics. developers are able to use this library to design their cache-less and SerDe-less high performance applications.

incubator-gobblin - Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems

  •    Java

Apache Gobblin is a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop. Apache Gobblin handles the common routine tasks required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. Gobblin ingests data from different data sources in the same execution framework, and manages metadata of different sources all in one place. This, combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility, and the ability of handling data model evolution, makes Gobblin an easy-to-use, self-serving, and efficient data ingestion framework.

Akumuli - Time-series database

  •    C++

Akumuli is a time-series database for modern hardware. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from Esperanto as "accumulate".

parquet-format - Mirror of Apache Parquet

  •    Java

Parquet is a columnar storage format that supports nested data. This provides all generated metadata code.

parquet-mr - Mirror of Apache Parquet

  •    Java

Parquet is a columnar storage format that supports nested data. This provides the java implementation.

ClickHouse - Columnar DBMS and Real Time Analytics

  •    C++

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. It is Linearly Scalable, Blazing Fast, Highly Reliable, Fault Tolerant, Data compression, Real time query processing, Web analytics, Vectorized query execution, Local and distributed joins. It can process hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second.

carbondata - Mirror of Apache CarbonData

  •    Scala

Apache CarbonData is an indexed columnar data format for fast analytics on big data platform, e.g.Apache Hadoop, Apache Spark, etc.

Apache Arrow - Powering Columnar In-Memory Analytics

  •    Java

Apache arrow is an open source, low latency SQL query engine for Hadoop and NoSQL. Arrow enables execution engines to take advantage of the latest SIMD (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.

Infobright - The Database for Analytics

  •    C++

Infobright combines a columnar database with our Knowledge Grid architecture to deliver a self-managing, self-tuning database optimized for analytics. Infobright eliminates the need to create indexes, partition data, or do any manual tuning to achieve fast response for queries and reports.

cstore_fdw - Columnar store for analytics with Postgres, developed by Citus Data

  •    C

Cstore_fdw is an open source columnar store extension for PostgreSQL. Columnar stores provide notable benefits for analytics use cases where data is loaded in batches. Cstore_fdw’s columnar nature delivers performance by only reading relevant data from disk, and it may compress data 6x-10x to reduce space requirements for data archival. Cstore_fdw is developed by Citus Data and can be used in combination with Citus, a postgres extension that intelligently distributes your data and queries across many nodes so your database can scale and your queries are fast. If you have any questions about how Citus can help you scale or how to use Citus in combination with cstore_fdw, please let us know.

GeoMesa - Suite of tools for working with big geo-spatial data in a distributed fashion

  •    Scala

GeoMesa is an open-source, distributed, spatio-temporal database built on a number of distributed cloud data storage systems, including Accumulo, HBase, Cassandra, and Kafka. Leveraging a highly parallelized indexing strategy, GeoMesa aims to provide as much of the spatial querying and data manipulation to Accumulo as PostGIS does to Postgres.

Trino - A query engine that runs at ludicrous speed

  •    Java

Trino is a highly parallel and distributed query engine, that is built from the ground up for efficient, low latency analytics. It is an ANSI SQL compliant query engine, that works with BI tools such as R, Tableau, Power BI, Superset and many others. It helps to natively query data in Hadoop, S3, Cassandra, MySQL, and many others, without the need for complex, slow, and error-prone processes for copying the data.

FiloDB - Distributed. Columnar. Versioned. Streaming. SQL.

  •    Scala

High-performance distributed analytical database + Spark SQL queries + built for streaming. Columnar, versioned layers of data wrapped in a yummy high-performance analytical database engine.

Alluxio - Data orchestration for analytics and machine learning in the cloud

  •    Java

Alluxio (formerly known as Tachyon) is a virtual distributed storage system. It bridges the gap between computation frameworks and storage systems, enabling computation applications to connect to numerous storage systems through a common interface.