Genie is a federated job orchestration engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more. It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.See the official website to find documentation about Genie and specific documentation for various releases.
big-data bigdata orchestration configuration configuration-management spring-boot distributed-systems netflixossThis is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.
spark pyspark data-analysis mllib ipython-notebook notebook ipython data-science machine-learning big-data bigdata.NET for Apache Spark provides high performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data. .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.
spark analytics bigdata spark-streaming spark-sql machine-learning fsharp dotnet-core dotnet-standard streaming apache-spark tpcds tpch azure hdinsight databricks emr microsoftMobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.
spark apache-spark rdd dataframe dstream dataset streaming mobius kafka-streaming spark-streaming fsharp bigdata mapreduce eventhubs near-real-timeVoltDB was specifically designed for contemporary software applications that are pushed beyond their limits by high volume data sources. VoltDB provides the ability to capture, store and process incoming data at millions of read/write operations per second. And VoltDB’s relational model opens that data to be analyzed in real-time, using familiar Business Intelligence tools, to identify data patterns and trends, spot anomalies, or perform tracking and alerting.
database relational acid bigdata scale-out distributed distributed-database oltp analyticsThis Apache Spark tutorial will guide you step-by-step into how to use the MovieLens dataset to build a movie recommender using collaborative filtering with Spark's Alternating Least Saqures implementation. It is organised in two parts. The first one is about getting and parsing movies and ratings data into Spark RDDs. The second is about building and using the recommender and persisting it for later use in our on-line recommender system. This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit. Starting from there, I've added with minor modifications to use a larger dataset, then code about how to store and reload the model for later use, and finally a web service using Flask.
movie-recommendation movielens-dataset spark flask big-data bigdataOptimus is the missing framework to profile, clean, process and do ML in a distributed fashion using Apache Spark(PySpark). You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.
spark pyspark data-wrangling bigdata big-data-cleaning data-science cleansing data-cleansing data-cleaner apache-spark data-transformationsidekick is a high-performance sidecar load-balancer. By attaching a tiny load balancer as a sidecar to each of the client application processes, you can eliminate the centralized loadbalancer bottleneck and DNS failover management. sidekick automatically avoids sending traffic to the failed servers by checking their health via the readiness API and HTTP error returns.
minio-servers sidecar load-balancer proxy spark bigdata kubernetes minioVaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).
dataframe bigdata tabular-data visualization memory-mapped-file hdf5This is a Scala application to build derived datasets, also known as batch views, of Telemetry data.Raw JSON pings are stored on S3 within files containing framed Heka records. Reading the raw data in through e.g. Spark can be slow as for a given analysis only a few fields are typically used; not to mention the cost of parsing the JSON blobs. Furthermore, Heka files might contain only a handful of records under certain circumstances.
telemetry spark bigdata mozilla datasetWhen you need to push data around, you push it. Push it real good. An ETL and Operations tool.Empujar's top level object is a "book", which contains "chapters" and then "pages". Chapters are excecuted 1-by-1 in order, and then each page in a chapter can be run in parallel (up to a threading limit you specify).
data big bigdata redshift amazon mysql aurora ftp elasticsearch etl elt job transform parallelCode snippets for solving common big data problems on various platforms. Inspired by Rosetta Code.Copyright 2016 Spotify AB.
scio spark scalding bigdataArcGIS 10.4 GeoEvent Extension for Server sample Hadoop Output Connector for storing GeoEvents in HDFS. Find a bug or want to request a new feature? Please let us know by submitting an issue.
arcgis geoevent hadoop connector hdfs transport arcgis-geoevent-server big-data bigdata serverArcGIS 10.4 GeoEvent Extension for Server sample MongoDB Ouptut Connector for sending GeoEvents to MongoDB. Find a bug or want to request a new feature? Please let us know by submitting an issue.
geoevent arcgis mongodb connector transport arcgis-geoevent-server big-data bigdata serverA plugin for leaflet to load echarts map and make BigData Visualization. This is a beta version,so it would have some bugs,visit it by chrome will be better. When you want to drag the map,drag on zhe basemap without echarts data. It seems that i have solved this problem.
leaflet echarts bigdata visualization leaflet-plugin mapsGitHub is a fantastic tool — I use it constantly. I'd like to understand more of how I work on GitHub, but I don't have all my data. I can setup an RSS hook on my public feed, but I do a lot of work on private repos as well. Simple, just run script/bootstrap from your clone.
github aggregate collect data bigdata webhook hookThis is a collection of Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the R language. If your are interested in being introduced to some basic Data Science Engineering concepts and applications, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R. Additionally, if you are interested in using Python with Spark, you can have a look at our pySpark notebooks.
sparkr notebook jupyter-notebook exploratory-data-analysis jupyter r data-science data-analysis big-data bigdataМасштабируемое машинное обучение и анализ больших данных с Apache Spark
spark mapreduce lectures machine-learning bigdataRun your docker with docker-compose. It helps to keep your arguments/settings in a single file and run together in an isolated environment.
docker docker-image pyspark-notebook spark pyspark docker-compose notebook apache-spark bigdata python-notebook jupyter-notebook
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.