Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink and Hive using a high-performance table format that works just like a SQL table. Users don’t need to know about partitioning to get fast queries. Iceberg was built for huge tables. Iceberg is used in production where a single table can contain tens of petabytes of data and even these huge tables can be read without a distributed SQL engine.
http://iceberg.apache.org/Tags | big-data big-table analytics table |
Implementation | Java |
License | Apache |
Platform | OS-Independent |
Metron integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis. Metron provides capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat intelligence information to security telemetry within a single platform.
security-framework cyber-crime anomoly-detection monitoring security big-data threat-intelligence security-analytics opensoc siem threat-analyticsClickHouse is a fast open-source OLAP database management system. It is column-oriented and allows to generate analytical reports using SQL queries in real-time. Large queries are parallelized using multiple cores, taking all the necessary resources available on the current server. It supports distributed processing, data can reside on different shards. Each shard can be a group of replicas used for fault tolerance. All shards are used to run a query in parallel, transparently for the user.
database columnar-database column-oriented analytics dbms olap distributed-database mpp hacktoberfest sql big-dataMagellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries. The application developer writes standard sql or data frame queries to evaluate geometric expressions while the execution engine takes care of efficiently laying data out in memory during query processing, picking the right query plan, optimizing the query execution with cheap and efficient spatial indices while presenting a declarative abstraction to the developer.
geospatial-analytics sparksql spark geometric-algorithms geojson shapefile geospatial geospatial-processing geospatial-analysis big-data magellanAsterixDB is a BDMS (Big Data Management System) with a rich feature set that sets it apart from other Big Data platforms. Its feature set makes it well-suited to modern needs such as web data warehousing and social data storage and analysis. It is a highly scalable data management system that can store, index, and manage semi-structured data, but it also supports a full-power query language with the expressiveness of SQL (and more).
database parallel-database big-data data-warehouse nosql no-sql analyticsThe Stratosphere System is an open-source cluster/cloud computing framework for Big Data analytics. It comprises of An extensible higher level language (Meteor) to quickly compose queries for common and recurring use cases, A parallel programming model (PACT, an extension of MapReduce) to run user-defined operations, An efficient massively parallel runtime (Nephele) for fault tolerant execution of acyclic data flows.
cloud-framework cloud big-data parallel information-managementDelta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines. See the Delta Lake Documentation for details.
big-data spark analytics acidApache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.
cluster-management resource-manager big-data data-processingTDengine is an open-source big data platform designed and optimized for Internet of Things (IoT), Connected Vehicles, and Industrial IoT. Besides the 10x faster time-series database, it provides caching, stream computing, message queuing and other functionalities to reduce the complexity and costs of development and operations.
iot database monitoring time-series bigdata full-stack connected-vehicles industrial-iot time-series-database analytics real-time-analytics column-store columnar-databasePresto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It allows querying data from relational / nosql databases. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. It is developed by Facebook.
query-engine database-tool analytics big-data distributedSnappyData (aka TIBCO ComputeDB) is a distributed, in-memory optimized analytics database. SnappyData delivers high throughput, low latency, and high concurrency for unified analytics workload. By fusing an in-memory hybrid database inside Apache Spark, it provides analytic query processing, mutability/transactions, access to virtually all big data sources and stream processing all in one unified cluster. One common use case for SnappyData is to provide analytics at interactive speeds over large volumes of data with minimal or no pre-processing of the dataset. For instance, there is no need to often pre-aggregate/reduce or generate cubes over your large data sets for ad-hoc visual analytics. This is made possible by smartly managing data in-memory, dynamically generating code using vectorization optimizations and maximizing the potential of modern multi-core CPUs. SnappyData enables complex processing on large data sets in sub-second timeframes.
spark stream scale analytics transaction memory-database snappydataApache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.
data-warehouse etl aggregation analytics sql-on-hadoopApache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight. At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.
snappydata spark memory-database analytics stream transaction scaleApache Doris is a modern MPP analytical database product. It can provide sub-second queries and efficient real-time data analysis. With it's distributed architecture, up to 10PB level datasets will be well supported and easy to operate. Doris provides batch data loading and real-time mini-batch data loading. It provides high availability, reliability, fault tolerance, and scalability. Its original name was Palo, developed in Baidu.
data-warehouse database bigdata mysql-protocol analytical-database realtime-query scalable paloThe Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.
key-value key-value-store big-table distributed database aggregationHigh Performance Analytics Toolkit (HPAT) scales analytics/ML codes in Python to bare-metal cluster/cloud performance automatically. It compiles a subset of Python (Pandas/Numpy) to efficient parallel binaries with MPI, requiring only minimal code changes. HPAT is orders of magnitude faster than alternatives like Apache Spark. HPAT's documentation can be found here.
big-data parallel-computing compilers machine-learning numpy pandasAnimated Investment Management Research at Sov.ai — Sponsoring open source AI, Machine learning, and Data Science initiatives. I don't know the other areas that well, send my your thought leaders by pull request.
data-science machine-learning big-data analytics resources career business-intelligence business-analyticsDev Lake brings all your DevOps data into one practical, personalized, extensible view. Ingest, analyze, and visualize data from an ever-growing list of developer tools, with our free and open source product. Dev Lake is most exciting for leaders and managers looking to make better sense of their development data, though it's useful for any developer looking to bring a more data-driven approach to their own practices. With Dev Lake you can ask your process any question, just connect and query.
devops big-data data-warehouse big-code data-source data-connectors jira github gitlab jenkins performance-metrics analyticsFuzzy matching strings. Find out how similar two string is, and find the best fuzzy matching string from a string table. Given a string (strA) and a big string table. Find the likeness or similarity of the string in the string table. Using C# and LINQ
Alluxio (formerly known as Tachyon) is a virtual distributed storage system. It bridges the gap between computation frameworks and storage systems, enabling computation applications to connect to numerous storage systems through a common interface.
distributed-storage big-data memory-speed hadoop spark virtual-file-system presto tensorflow storage object-storeHypertable is based on Google's Bigtable Design, which is a proven scalable design that powers hundreds of Google services. Many of the current scalable NoSQL database offerings are based on a hash table design which means that the data they manage is not kept physically ordered. Hypertable keeps data physically sorted by a primary key and it is well suited for Analytics.
no-sql distributed-database column-store analytics database distributed scalable cloud-database
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.