Java based data integration framework can be used to transform/map/manipulate data in various formats (CSV,FIXLEN,XML,XBASE,COBOL,LOTUS, etc.); can be used standalone or embedded(as a library). Connects to RDBMS/JMS/SOAP/LDAP/S3/HTTP/FTP/ZIP/TAR.
etl data-processing data-integration data-extractionVespa is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time. Vespa is serving platform for Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr.
searchengine search-engine big-data data-processing machine-learning real-timeAlibaba JStorm is an enterprise fast and stable streaming process engine. It runs program up to 4x faster than Apache Storm. It is easy to switch from record mode to mini-batch mode. It is not only a streaming process engine. It means one solution for real time requirement, whole realtime ecosystem.
stream-processing batch-processing real-time data-processing distributedA curated list of awesome curated lists of many topics.
curated-lists science machine-learning database awesome awesome-list data data-processing editor web-browser jquery jquery-pluginGoogle Cloud Dataflow SDK for Java is a distribution of Apache Beam designed to simplify usage of Apache Beam on Google Cloud Dataflow service. This artifact includes the parent POM for other Dataflow SDK artifacts.
google-cloud-dataflow data-science data-analysis data-mining big-data data-processingMiller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, and positionally-indexed.
data-processing data-cleaning csv csv-files csv-format csv-reader streaming-data streaming-algorithms tsv json json-data data-reduction data-regression statistics statistical-analysis devops devops-tools tabular-data command-line command-line-toolsThe list of tools, programming libraries and APIs used in web-scraping.
awesome awesome-list web-scraping data-processingNote: the translations of this document may not be up-to-date. For the latest version, please check the README in English. Software 2.0 needs Data 2.0, and Hub delivers it. Most of the time Data Scientists/ML researchers work on data management and preprocessing instead of training models. With Hub, we are fixing this. We store your (even petabyte-scale) datasets as single numpy-like array on the cloud, so you can seamlessly access and work with it from any machine. Hub makes any data type (images, text files, audio, or video) stored in cloud usable as fast as if it were stored on premise. With same dataset view, your team can always be in sync.
training data-science machine-learning cloud ai computer-vision deep-learning tensorflow cv ml collaboration pytorch cloud-computing datasets dataset-generation data-processing data-version-control data-pipelines mlopsApache Flink is an open source platform for scalable batch and stream data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
data-streaming data-processing streamingApache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.
cluster-management resource-manager big-data data-processingStorm is a distributed real time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.
real-time-computation analytics real-time stream-processing distributed-rpc data-processingApache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
map-reduce batch-processing data-processing big-data hadoop yarn directed-acyclic-graphHazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In-Memory Data Grid (IMDG) to provide a lightweight, simple-to-deploy package that includes scalable in-memory storage. Hazelcast Jet performs parallel execution to enable data-intensive applications to operate in near real-time.
in-memory data-grid big-data stream-processing data-processing real-time streams batch-processingKapacitor is a open source framework for processing, monitoring, and alerting on time series data. Kapacitor imports (stream or batch) time series data, and then transform, analyze, and act on the data. It uses Telegraf to collect system metrics on your local machine and store them in InfluxDB.
monitoring data-processing data-analysis alerts streaming data-streaming streaming-analyticsBuild concurrent and multi-stage data ingestion and data processing pipelines with Elixir. It allows developers to consume data efficiently from different sources, known as producers, such as Amazon SQS, Apache Kafka, Google Cloud PubSub, RabbitMQ, and others. Broadway takes the burden of defining concurrent GenStage topologies and provide a simple configuration API that automatically defines concurrent producers, concurrent processing, batch handling, and more, leading to both time and cost efficient ingestion and processing of data.
data-processing concurrent data-pipeline batch-processingHazelcast Jet is a distributed computing platform built for high-performance stream processing and fast batch processing. It embeds Hazelcast In Memory Data Grid (IMDG) to provide a lightweight package of a processor and a scalable in-memory storage. It supports distributed java.util.stream API support for Hazelcast data structures such as IMap and IList, Distributed implementations of java.util.{Queue, Set, List, Map} data structures highly optimized to be used for the processing
data-grid data-processing data-streaming in-memory batch-processing stream-processingApache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.
data-processing data-streaming batch-processing stream-processing distributed big-dataTeiid is a data virtualization system that allows applications to use data from multiple, heterogenous data stores. Teiid is comprised of tools, components and services for creating and executing bi-directional data access services.
data-source data-processing data-connectorApache StreamPipes is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. It can Integrate data sets and data streams using the built-in StreamPipes Connect library with support for generic protocols such as HTTP, Kafka, MQTT, OPC-UA, Files or specific adapters for open data sources.
iot analytics edge stream-processing iiot industrial-iot pipeline data-processing data-streamsNIPO is a general purpose component framework for data processing applications (that follow the IPO-principle). Its plugin-based architecture makes it scalable, flexible and enables a broad range of usage scenarios.
data-processing framework ipo plugin plugin-framework
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.