Displaying 1 to 15 from 15 results

Vespa - Yahoo's big data serving engine

  •    Java

Vespa is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time. Vespa is serving platform for Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr.

Spark - Fast Cluster Computing

  •    Scala

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

Shark - Hive on Spark

  •    Scala

Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. It runs Hive queries up to 100x faster in memory, or 10x on disk. it is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive.




Calliope - Bridge between Cassandra and Spark framework

  •    Scala

Calliope provides a bridge between Cassandra and Spark framework allowing you to create those magical, realtime bigdata apps with ease. It is a library providing an interface to consume data from Cassandra to spark and store RDDs from Spark to Cassandra.

magellan - Geo Spatial Data Analytics on Spark

  •    Scala

Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries. The application developer writes standard sql or data frame queries to evaluate geometric expressions while the execution engine takes care of efficiently laying data out in memory during query processing, picking the right query plan, optimizing the query execution with cheap and efficient spatial indices while presenting a declarative abstraction to the developer.

Gimel - PayPal's Big Data Processing Framework

  •    Scala

Gimel provides unified Data API to access data from any storage like HDFS, GS, Alluxio, Hbase, Aerospike, BigQuery, Druid, Elastic, Teradata, Oracle, MySQL, etc.

maha - A framework for rapid reporting API development; with out of the box support for high cardinality dimension lookups with druid

  •    Scala

A centralised library for building reporting APIs on top of multiple data stores to exploit them for what they do best.We run millions of queries on multiple data sources for analytics every day. They run on hive, oracle, druid etc. We needed a way to utilize the data stores in our architecture to exploit them for what they do best. This meant we needed to easily tune and identify sets of use cases where each data store fits the best. Our goal became to build a centralized system which was able to make these decisions on the fly at query time and also take care of the end to end query execution. The system needed to take in all the heuristics available, applying any constraints already defined in the system and select the best data store to run the query. It then would need to generate the underlying queries and pass on all available information to the query execution layer in order to facilitate further optimization at that layer.


eel-sdk - Big Data Toolkit for the JVM

  •    Scala

Eel is a toolkit for manipulating data in the hadoop ecosystem. By hadoop ecosystem we mean file formats common to the big-data world, such as parquet, orc, csv in locations such as HDFS or Hive tables. In contrast to distributed batch or streaming engines such as Spark or Flink, Eel is an SDK intended to be used directly in process. Eel is a lower level API than higher level engines like Spark and is aimed for those use cases when you want something like a file API. Here are some of our notes comparing eel to other tools that offer functionality similar to eel.

mist - Serverless proxy for Spark cluster

  •    Scala

Hydrosphere Mist is a serverless proxy for Spark cluster. Mist provides a new functional programming framework and deployment model for Spark applications. It creates a unified API layer for building enterprise solutions and microservices on top of a Spark functions.

spark-on-lambda - Apache Spark on AWS Lambda

  •    Scala

AWS Lambda is a Function as a Service which is serverless, scales up quickly and bills usage at 100ms granularity. We thought it would be interesting to see if we can get Apache Spark run on Lambda. This is an interesting idea we had, in order to validate we just hacked it into a prototype to see if it works. We were able to make it work making some changes in Spark's scheduler and shuffle areas. Since AWS Lambda has a 5 minute max run time limit, we have to shuffle over an external storage. So we hacked the shuffle parts of Spark code to shuffle over an external storage like S3. This is a prototype and its not battle tested possibly can have bugs. The changes are made against OS Apache Spark-2.1.0 version. We also have a fork of Spark-2.2.0 which has few bugs will be pushed here soon. We welcome contributions from developers.

cloudberry - Big Data Visualization

  •    Scala

Option 1: Follow the official documentation to setup a fully functional cluster. Option 2: Use the prebuilt AsterixDB docker image to run a small test cluster locally. This approach serves the debug purpose.

cypher-for-apache-spark - Cypher for Apache Spark brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark

  •    Scala

Okapi is a compiler pipeline for Cypher queries, including a consumer API, which translates Cypher query strings into a declarative intermediate representation, into a logical execution plan, into a execution plan in relational algebra.

sparkling-graph - SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX

  •    Scala

SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX. Bartusiak et al. (2017). SparklingGraph: large scale, distributed graph processing made easy. Manuscript in preparation.

metorikku - A simplified, lightweight ELT Framework based on Apache Spark

  •    Scala

Metorikku is a library that simplifies writing and executing ETLs on top of Apache Spark. A user needs to write a simple YAML configuration file that includes SQL queries and run Metorikku on a spark cluster. The platform also includes a way to write tests for metrics using MetorikkuTester. To run Metorikku you must first define 2 files.