The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else.
orchestration-framework scheduling hadoopScalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.
hadoop map-reduce cascadingA distributed deep learning library for Apache Spark.
deep-learning spark neural-network big-data hadoop keras aiGaffer is a graph database framework. It allows the storage of very large graphs containing rich properties on the nodes and edges. Several storage options are available, including Accumulo, Hbase and Parquet. It is designed to be as flexible, scalable and extensible as possible, allowing for rapid prototyping and transition to production systems.
accumulo graph graph-database hadoop big-data aggregation hbase parquet sparkAlluxio (formerly known as Tachyon) is a virtual distributed storage system. It bridges the gap between computation frameworks and storage systems, enabling computation applications to connect to numerous storage systems through a common interface.
distributed-storage big-data memory-speed hadoop spark virtual-file-system presto tensorflow storage object-storeApache Trafodion is a webscale SQL-on-Hadoop solution enabling transactional or operational workloads on Apache Hadoop. Trafodion builds on the scalability, elasticity, and flexibility of Hadoop. Trafodion extends Hadoop to provide guaranteed transactional integrity, enabling new kinds of big data applications to run on Hadoop.
database distributed-database newsql oltp hbase hadoop map-reduceApache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
map-reduce batch-processing data-processing big-data hadoop yarn directed-acyclic-graphCascalog is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.
big-data data-analysis data-warehouse query hadoop hadoop-toolsHoodie is a Apache Spark library that provides the ability to efficiently do incremental processing on datasets in HDFS
hadoop spark parquet analytics-database ingestion hoodie hudi columnar storageThe GIS Tools for Hadoop are a collection of GIS tools that leverage the Spatial Framework for Hadoop for spatial analysis of big data. The tools make use of the Geoprocessing Tools for Hadoop toolbox, to provide access to the Hadoop system from the ArcGIS Geoprocessing environment. Start out by navigating to samples and following the instructions provided with each sample.There are also tutorials for using the GP tools and aggregation methods.
spatial-analysis hadoopDocker containers for Hadoop.An easy way to reproduce a multi-node Hadoop cluster on a local machine.
hadoop hadoop-distributions docker dnskafka-connect-hdfs is a Kafka Connector for copying data between Kafka and Hadoop HDFS.Documentation for this connector can be found here.
confluent kafka apache-kafka kafka-connect-hdfs kafka-connector hadoop hdfs big-data streamingHoodie is a Apache Spark library that provides the ability to efficiently do incremental processing on datasets in HDFS
hadoop spark parquet analytics-database ingestion hoodie
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.