Displaying 1 to 13 from 13 results

pachyderm - Reproducible Data Science at Scale!

  •    Go

Pachyderm is a tool for production data pipelines. If you need to chain together data scraping, ingestion, cleaning, munging, wrangling, processing, modeling, and analysis in a sane way, then Pachyderm is for you. If you have an existing set of scripts which do this in an ad-hoc fashion and you're looking for a way to "productionize" them, Pachyderm can make this easy for you. Install Pachyderm locally or deploy on AWS/GCE/Azure in about 5 minutes.

ClickHouse - Columnar DBMS and Real Time Analytics

  •    C++

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. It is Linearly Scalable, Blazing Fast, Highly Reliable, Fault Tolerant, Data compression, Real time query processing, Web analytics, Vectorized query execution, Local and distributed joins. It can process hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second.

Spark - Fast Cluster Computing

  •    Scala

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

Presto - Distributed SQL query engine for big data

  •    Java

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It allows querying data from relational / nosql databases. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. It is developed by Facebook.




sciblog_support - Support content for my blog

  •    Jupyter

This repo contains the projects, additional information and code to support my blog: sciblog. You can find a list of all the post I made in this file.

ConcourseDB - Self-tuning database designed for both transactions and ad hoc analytics across time

  •    Java

ConcourseDB is a distributed self-tuning database with automatic indexing, version control and ACID transactions. ConcourseDB provides a more intuitive approach to data management that is easy to deploy, access and scale while maintaining the strong consistency of traditional database systems.

AsterixDB - Big Data Management System (BDMS)

  •    Java

AsterixDB is a BDMS (Big Data Management System) with a rich feature set that sets it apart from other Big Data platforms. Its feature set makes it well-suited to modern needs such as web data warehousing and social data storage and analysis. It is a highly scalable data management system that can store, index, and manage semi-structured data, but it also supports a full-power query language with the expressiveness of SQL (and more).


Dremio - The missing link in modern data

  •    Java

Dremio is a self-service data platform that empowers users to discover, curate, accelerate, and share any data at any time, regardless of location, volume, or structure. Modern data is managed by a wide range of technologies, including relational databases, NoSQL datastores, file systems, Hadoop, and others. Many of the newer datastores are often more agile and provide improved scalability, but at a cost to speed and ease of access via traditional SQL-based analysis tools. Additionally, raw data found in these stores is often too complex or inconsistent for analysis to use with business intelligence tools.

maha - A framework for rapid reporting API development; with out of the box support for high cardinality dimension lookups with druid

  •    Scala

A centralised library for building reporting APIs on top of multiple data stores to exploit them for what they do best.We run millions of queries on multiple data sources for analytics every day. They run on hive, oracle, druid etc. We needed a way to utilize the data stores in our architecture to exploit them for what they do best. This meant we needed to easily tune and identify sets of use cases where each data store fits the best. Our goal became to build a centralized system which was able to make these decisions on the fly at query time and also take care of the end to end query execution. The system needed to take in all the heuristics available, applying any constraints already defined in the system and select the best data store to run the query. It then would need to generate the underlying queries and pass on all available information to the query execution layer in order to facilitate further optimization at that layer.

countly-sdk-js - Countly Product Analytics SDK for Icenium and Phonegap

  •    Java

Questions? Visit http://community.count.ly. Countly is an innovative, real-time, open source mobile analytics and push notifications platform. It collects data from mobile devices, and visualizes this information to analyze mobile application usage and end-user behavior. There are two parts of Countly: the server that collects and analyzes data, and mobile SDK that sends this data. Both parts are open source with different licensing terms.

data-science-live-book - An open source book to learn data science, data analysis and machine learning, suitable for all ages!

  •    TeX

This book is now available at Amazon in [Kindle]( Link: http://a.co/d/dIj1XwD) Black & White and color 📗 🚀. Most of the written R code can be used in real scenarios! I worked on the funModeling R package at the same time, so it is used many times in the book.

cloudberry - Big Data Visualization

  •    Scala

Option 1: Follow the official documentation to setup a fully functional cluster. Option 2: Use the prebuilt AsterixDB docker image to run a small test cluster locally. This approach serves the debug purpose.