Displaying 1 to 20 from 38 results

Vespa - Yahoo's big data serving engine

  •    Java

Vespa is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time. Vespa is serving platform for Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr.

genie - Distributed Big Data Orchestration Service

  •    Java

Genie is a federated job orchestration engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more. It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.See the official website to find documentation about Genie and specific documentation for various releases.

DataflowJavaSDK - Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines

  •    Java

Google Cloud Dataflow SDK for Java is a distribution of Apache Beam designed to simplify usage of Apache Beam on Google Cloud Dataflow service. This artifact includes the parent POM for other Dataflow SDK artifacts.




datumbox-framework - Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications

  •    Java

Datumbox is an open-source Machine Learning Framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Gaffer - A large-scale entity and relation database supporting aggregation of properties

  •    Java

Gaffer is a graph database framework. It allows the storage of very large graphs containing rich properties on the nodes and edges. Several storage options are available, including Accumulo, Hbase and Parquet. It is designed to be as flexible, scalable and extensible as possible, allowing for rapid prototyping and transition to production systems.

Presto - Distributed SQL query engine for big data

  •    Java

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It allows querying data from relational / nosql databases. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. It is developed by Facebook.


Apache REEF - a stdlib for Big Data

  •    Java

Apache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop

  •    Java

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

Apache Metron - Real-time Big Data Security

  •    Java

Metron integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis. Metron provides capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat intelligence information to security telemetry within a single platform.

Hue - The open source Apache Hadoop UI

  •    Java

Hue is a Web application for interacting with Apache Hadoop. It supports a FileBrowser for accessing HDFS, JobBrowser for accessing MapReduce jobs (MR1/MR2-YARN), Job Designer for creating MapReduce/Streaming/Java jobs, HBase Browser for exploring and modifying HBase tables and data, Oozie App for submitting and scheduling workflows and bundles, A Pig/HBase/Sqoop2 shell, Beeswax application for executing Hive queries, Search app for querying Solr and Solr Cloud.

Hazelcast Jet - A general purpose distributed data processing engine, built on top of Hazelcast.

  •    Java

Hazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In-Memory Data Grid (IMDG) to provide a lightweight, simple-to-deploy package that includes scalable in-memory storage. Hazelcast Jet performs parallel execution to enable data-intensive applications to operate in near real-time.

Fluo - Make incremental updates to large data sets stored in Apache Accumulo

  •    Java

Apache Fluo (incubating) is an open source implementation of Percolator (which populates Google's search index) for Apache Accumulo. Fluo makes it possible to update the results of a large-scale computation, index, or analytic as new data is discovered. When combining new data with existing data, Fluo offers reduced latency when compared to batch processing frameworks (e.g Spark, MapReduce).

ConcourseDB - Self-tuning database designed for both transactions and ad hoc analytics across time

  •    Java

ConcourseDB is a distributed self-tuning database with automatic indexing, version control and ACID transactions. ConcourseDB provides a more intuitive approach to data management that is easy to deploy, access and scale while maintaining the strong consistency of traditional database systems.

AsterixDB - Big Data Management System (BDMS)

  •    Java

AsterixDB is a BDMS (Big Data Management System) with a rich feature set that sets it apart from other Big Data platforms. Its feature set makes it well-suited to modern needs such as web data warehousing and social data storage and analysis. It is a highly scalable data management system that can store, index, and manage semi-structured data, but it also supports a full-power query language with the expressiveness of SQL (and more).

Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines

  •    Java

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

StratoSphere - Cloud Computing Framework for Big Data Analytics

  •    Java

The Stratosphere System is an open-source cluster/cloud computing framework for Big Data analytics. It comprises of An extensible higher level language (Meteor) to quickly compose queries for common and recurring use cases, A parallel programming model (PACT, an extension of MapReduce) to run user-defined operations, An efficient massively parallel runtime (Nephele) for fault tolerant execution of acyclic data flows.

Dremio - The missing link in modern data

  •    Java

Dremio is a self-service data platform that empowers users to discover, curate, accelerate, and share any data at any time, regardless of location, volume, or structure. Modern data is managed by a wide range of technologies, including relational databases, NoSQL datastores, file systems, Hadoop, and others. Many of the newer datastores are often more agile and provide improved scalability, but at a cost to speed and ease of access via traditional SQL-based analysis tools. Additionally, raw data found in these stores is often too complex or inconsistent for analysis to use with business intelligence tools.





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.