Displaying 1 to 20 from 47 results

genie - Distributed Big Data Orchestration Service

  •    Java

Genie is a federated job orchestration engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more. It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.See the official website to find documentation about Genie and specific documentation for various releases.

spark-py-notebooks - Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

  •    Jupyter

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

Mobius - C# and F# language binding and extensions to Apache Spark

  •    CSharp

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.

VoltDB - Fast Scalable SQL DBMS with ACID

  •    Java

VoltDB was specifically designed for contemporary software applications that are pushed beyond their limits by high volume data sources. VoltDB provides the ability to capture, store and process incoming data at millions of read/write operations per second. And VoltDB’s relational model opens that data to be analyzed in real-time, using familiar Business Intelligence tools, to identify data patterns and trends, spot anomalies, or perform tracking and alerting.




spark-movie-lens - An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

  •    Jupyter

This Apache Spark tutorial will guide you step-by-step into how to use the MovieLens dataset to build a movie recommender using collaborative filtering with Spark's Alternating Least Saqures implementation. It is organised in two parts. The first one is about getting and parsing movies and ratings data into Spark RDDs. The second is about building and using the recommender and persisting it for later use in our on-line recommender system. This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit. Starting from there, I've added with minor modifications to use a larger dataset, then code about how to store and reload the model for later use, and finally a web service using Flask.

Optimus - :truck: Agile Data Science Workflows made easy with Python and Spark.

  •    Python

Optimus is the missing framework to profile, clean, process and do ML in a distributed fashion using Apache Spark(PySpark). You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.

vaex - Lazy Out-of-Core DataFrames for Python, visualize and explore big tabular data at a billion rows per second

  •    Python

Vaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).


telemetry-batch-view - A Scala framework to build derived datasets, aka batch views, of Telemetry data

  •    Scala

This is a Scala application to build derived datasets, also known as batch views, of Telemetry data.Raw JSON pings are stored on S3 within files containing framed Heka records. Reading the raw data in through e.g. Spark can be slow as for a given analysis only a few fields are typically used; not to mention the cost of parsing the JSON blobs. Furthermore, Heka files might contain only a handful of records under certain circumstances.

empujar - When you need to push data around, you push it. A node.js ETL tool.

  •    Javascript

When you need to push data around, you push it. Push it real good. An ETL and Operations tool.Empujar's top level object is a "book", which contains "chapters" and then "pages". Chapters are excecuted 1-by-1 in order, and then each page in a chapter can be run in parallel (up to a threading limit you specify).

big-data-rosetta-code - Code snippets for solving common big data problems in various platforms

  •    Scala

Code snippets for solving common big data problems on various platforms. Inspired by Rosetta Code.Copyright 2016 Spotify AB.

hadoop-for-geoevent - ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.

  •    Java

ArcGIS 10.4 GeoEvent Extension for Server sample Hadoop Output Connector for storing GeoEvents in HDFS. Find a bug or want to request a new feature? Please let us know by submitting an issue.

mongodb-for-geoevent - ArcGIS GeoEvent Server sample MongoDB Connector for storing GeoEvents.

  •    Java

ArcGIS 10.4 GeoEvent Extension for Server sample MongoDB Ouptut Connector for sending GeoEvents to MongoDB. Find a bug or want to request a new feature? Please let us know by submitting an issue.

leaflet-echarts - A plugin for leaflet to load echarts map and Make big data visualization easier.

  •    Javascript

A plugin for leaflet to load echarts map and make BigData Visualization. This is a beta version,so it would have some bugs,visit it by chrome will be better. When you want to drag the map,drag on zhe basemap without echarts data. It seems that i have solved this problem.

all-your-github-are-belong-to-us - :octocat: Save all your GitHub data to one place, private & public

  •    Javascript

GitHub is a fantastic tool — I use it constantly. I'd like to understand more of how I work on GitHub, but I don't have all my data. I can setup an RSS hook on my public feed, but I do a lot of work on private repos as well. Simple, just run script/bootstrap from your clone.

spark-r-notebooks - R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks

  •    Jupyter

This is a collection of Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the R language. If your are interested in being introduced to some basic Data Science Engineering concepts and applications, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R. Additionally, if you are interested in using Python with Spark, you can have a look at our pySpark notebooks.

lectures-hse-spark - Масштабируемое машинное обучение и анализ больших данных с Apache Spark

  •    Jupyter

Масштабируемое машинное обучение и анализ больших данных с Apache Spark

pyspark-notebook - Pyspark Notebook With Docker

  •    Python

Run your docker with docker-compose. It helps to keep your arguments/settings in a single file and run together in an isolated environment.

SparkTwitterAnalysis - An Apache Spark standalone application using the Spark API in Scala

  •    Scala

A standalone application using the Spark API in Scala. The application uses Simple Build Tool(SBT) for building the project. Using sbt-assembly plugin, Create a fat JAR of your project with all of its dependencies.

bigdata-docker - Docker images for Open Source bigdata/hadoop projects

  •    Shell

This is the umbrella project for all of my docker images related to Apache Hadoop and other bigdata related Apache and non-apache projects. All of the docker images contains the component extracted from the open source distribution and some advanced configuration loading mechanism.