Displaying 20 to 40 from 126 results

Presto - Distributed SQL query engine for big data

  •    Java

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It allows querying data from relational / nosql databases. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. It is developed by Facebook.

Apache REEF - a stdlib for Big Data

  •    Java

Apache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop

  •    Java

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.




spark-movie-lens - An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

  •    Jupyter

This Apache Spark tutorial will guide you step-by-step into how to use the MovieLens dataset to build a movie recommender using collaborative filtering with Spark's Alternating Least Saqures implementation. It is organised in two parts. The first one is about getting and parsing movies and ratings data into Spark RDDs. The second is about building and using the recommender and persisting it for later use in our on-line recommender system. This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit. Starting from there, I've added with minor modifications to use a larger dataset, then code about how to store and reload the model for later use, and finally a web service using Flask.

Apache Metron - Real-time Big Data Security

  •    Java

Metron integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis. Metron provides capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat intelligence information to security telemetry within a single platform.

Hue - The open source Apache Hadoop UI

  •    Java

Hue is a Web application for interacting with Apache Hadoop. It supports a FileBrowser for accessing HDFS, JobBrowser for accessing MapReduce jobs (MR1/MR2-YARN), Job Designer for creating MapReduce/Streaming/Java jobs, HBase Browser for exploring and modifying HBase tables and data, Oozie App for submitting and scheduling workflows and bundles, A Pig/HBase/Sqoop2 shell, Beeswax application for executing Hive queries, Search app for querying Solr and Solr Cloud.

Cascalog - Data processing on Hadoop

  •    Clojure

Cascalog is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.


Facets - Visualizations for machine learning datasets

  •    Typescript

The facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive. The visualizations are implemented as Polymer web components, backed by Typescript code and can be easily embedded into Jupyter notebooks or webpages.

sciblog_support - Support content for my blog

  •    Jupyter

This repo contains the projects, additional information and code to support my blog: sciblog. You can find a list of all the post I made in this file.

vue-virtual-scroll-list - A vue component that support big data list with high scroll performance.

  •    Javascript

If you are looking for a vue component which support big data list and high scroll performance, you are in the right place. Tiny and very very easy to use.

Shark - Hive on Spark

  •    Scala

Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. It runs Hive queries up to 100x faster in memory, or 10x on disk. it is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive.

Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster

  •    C

Postgres-XL is a horizontally scalable open source SQL database cluster, flexible enough to handle varying database workloads like OLTP, Business Intelligence requiring MPP parallelism, Key value store, GIS Geospatial and lot more.

Fluo - Make incremental updates to large data sets stored in Apache Accumulo

  •    Java

Apache Fluo (incubating) is an open source implementation of Percolator (which populates Google's search index) for Apache Accumulo. Fluo makes it possible to update the results of a large-scale computation, index, or analytic as new data is discovered. When combining new data with existing data, Fluo offers reduced latency when compared to batch processing frameworks (e.g Spark, MapReduce).

ConcourseDB - Self-tuning database designed for both transactions and ad hoc analytics across time

  •    Java

ConcourseDB is a distributed self-tuning database with automatic indexing, version control and ACID transactions. ConcourseDB provides a more intuitive approach to data management that is easy to deploy, access and scale while maintaining the strong consistency of traditional database systems.

AsterixDB - Big Data Management System (BDMS)

  •    Java

AsterixDB is a BDMS (Big Data Management System) with a rich feature set that sets it apart from other Big Data platforms. Its feature set makes it well-suited to modern needs such as web data warehousing and social data storage and analysis. It is a highly scalable data management system that can store, index, and manage semi-structured data, but it also supports a full-power query language with the expressiveness of SQL (and more).

Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines

  •    Java

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

TrailDB - Efficient tool for storing and querying series of events

  •    C

TrailDB is a library, implemented in C, which allows you to query series of events at blazing speed. TrailDB is also optimized for speed of development: Use its simple API with your favorite language, in your favorite environment. TrailDB's secret sauce is data compression. It leverages predictability of time-based data to compress your data to a fraction of its original size. In contrast to traditional compression, you can query the encoded data directly, decompressing only the parts you need.

StratoSphere - Cloud Computing Framework for Big Data Analytics

  •    Java

The Stratosphere System is an open-source cluster/cloud computing framework for Big Data analytics. It comprises of An extensible higher level language (Meteor) to quickly compose queries for common and recurring use cases, A parallel programming model (PACT, an extension of MapReduce) to run user-defined operations, An efficient massively parallel runtime (Nephele) for fault tolerant execution of acyclic data flows.