telemetry-batch-view - A Scala framework to build derived datasets, aka batch views, of Telemetry data

  •        35

This is a Scala application to build derived datasets, also known as batch views, of Telemetry data.Raw JSON pings are stored on S3 within files containing framed Heka records. Reading the raw data in through e.g. Spark can be slow as for a given analysis only a few fields are typically used; not to mention the cost of parsing the JSON blobs. Furthermore, Heka files might contain only a handful of records under certain circumstances.



Related Projects

spark-movie-lens - An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

  •    Jupyter

This Apache Spark tutorial will guide you step-by-step into how to use the MovieLens dataset to build a movie recommender using collaborative filtering with Spark's Alternating Least Saqures implementation. It is organised in two parts. The first one is about getting and parsing movies and ratings data into Spark RDDs. The second is about building and using the recommender and persisting it for later use in our on-line recommender system. This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit. Starting from there, I've added with minor modifications to use a larger dataset, then code about how to store and reload the model for later use, and finally a web service using Flask.

Snap - A powerful open telemetry framework

  •    Go

Snap is an open telemetry framework designed to simplify the collection, processing and publishing of system data through a single API. The goals of this project are to Empower systems to expose a consistent set of telemetry data, Simplify telemetry ingestion across ubiquitous storage systems, Provide powerful clustered control of telemetry workflows across small or large clusters and lot more.

cernan - telemetry aggregation and shipping, last up the ladder

  •    Rust

Cernan is a telemetry and logging aggregation server. It exposes multiple interfaces for ingestion and can emit to multiple aggregation sources while doing in-flight manipulation of data. Cernan has minimal CPU and memory requirements and is intended to service bursty telemetry without load shedding. Cernan aims to be reliable and convenient to use, both for application engineers and operations staff. If you'd like to learn more, please do have a look in our wiki.

NR2003 season view

  •    DotNet

NR2003 season view application is program which consumes telemetry events produced by "Nascar racing 2003" game and represents them in race or season statistics data.

Mobius - C# and F# language binding and extensions to Apache Spark

  •    CSharp

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.

Disable-Nvidia-Telemetry - Windows utility to disable Nvidia's telemetry services

  •    CSharp

Disable Nvidia Telemetry is a utility that allows you to disable the telemetry services Nvidia bundles with their drivers.

opensoc - OpenSOC Apache Hadoop Code


OpenSOC integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis. OpenSOC provides capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat intelligence information to security telemetry within a single platform. A mechanism to capture, store, and normalize any type of security telemetry at extremely high rates. Because security telemetry is constantly being generated, it requires a method for ingesting the data at high speeds and pushing it to various processing units for advanced computation and analytics.

Mosquitto - An Open Source MQTT v3.1 Broker

  •    C

Mosquitto is an open source message broker that implements the MQ Telemetry Transport protocol version 3.1. MQTT provides a lightweight method of carrying out messaging using a publish/subscribe model. This makes it suitable for machine to machine messaging such as with low power sensors or mobile devices such as phones, embedded computers or microcontrollers like the Arduino.

nodejs-dashboard - Telemetry dashboard for node.js apps from the terminal!

  •    Javascript

Determine in realtime what's happening inside your node process from the terminal. No need to instrument code to get the deets. Also splits stderr/stdout to help spot errors sooner.NOTE: This module isn't designed for production use and should be limited to development environments.

Apache Spot - A Community Approach to Fighting Cyber Threats

  •    Java

Apache Spot is a community-driven cybersecurity project, built from the ground up, to bring advanced analytics to all IT Telemetry data on an open, scalable platform. pot expedites threat detection, investigation, and remediation via machine learning and consolidates all enterprise security data into a comprehensive IT telemetry hub based on open data models.

bigdata-ecosystem - BigData Ecosystem Dataset

  •    HTML

Incomplete-but-useful list of big-data related projects packed into a JSON dataset. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Calliope - Bridge between Cassandra and Spark framework

  •    Scala

Calliope provides a bridge between Cassandra and Spark framework allowing you to create those magical, realtime bigdata apps with ease. It is a library providing an interface to consume data from Cassandra to spark and store RDDs from Spark to Cassandra.

spark-py-notebooks - Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

  •    Jupyter

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

shc - The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink

  •    Scala

The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink. With it, user can operate HBase with Spark-SQL on DataFrame and DataSet level. With the DataFrame and DataSet support, the library leverages all the optimization techniques in catalyst, and achieves data locality, partition pruning, predicate pushdown, Scanning and BulkGet, etc.

Optimus - :truck: Agile Data Science Workflows made easy with Python and Spark.

  •    Python

Optimus is the missing framework to profile, clean, process and do ML in a distributed fashion using Apache Spark(PySpark). You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.

snappydata - SnappyData - The Spark Database. Stream, Transact, Analyze, Predict in one cluster

  •    Scala

Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight. At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.

flint - A Time Series Library for Apache Spark

  •    Scala

The ability to analyze time series data at scale is critical for the success of finance and IoT applications based on Spark. Flint is Two Sigma's implementation of highly optimized time series operations in Spark. It performs truly parallel and rich analyses on time series data by taking advantage of the natural ordering in time series data to provide locality-based optimizations. Flint is an open source library for Spark based around the TimeSeriesRDD, a time series aware data structure, and a collection of time series utility and analysis functions that use TimeSeriesRDDs. Unlike DataFrame and Dataset, Flint's TimeSeriesRDDs can leverage the existing ordering properties of datasets at rest and the fact that almost all data manipulations and analysis over these datasets respect their temporal ordering properties. It differs from other time series efforts in Spark in its ability to efficiently compute across panel data or on large scale high frequency data.

Hebrew Mozilla

  •    Javascript

We provide a Hebrew version of Mozilla Firefox, Mozilla Thunderbird, SeaMonkey and Mozilla Application Suite. SeaMonkey/Mozilla Suite are also available for download from here, while the latest Firefox amp; Thunderbird are available only from

Checky for Firefox, Mozilla, Netscape

  •    Javascript

Checky is an easy to use interface to many online validation and analysis services. Validate, analyze and view documents containing HTML, XHTML, CSS, RDF, RSS, XML, P3P, hyperlinks and metadata. Check Section 508 and WAI compliance of your documents.

Belarusian Mozilla


Belarusian localization of Mozilla Application Suite, Mozilla Firefox, Mozilla Thunderbird, SeaMonkey.