flint - A Time Series Library for Apache Spark

  •        166

The ability to analyze time series data at scale is critical for the success of finance and IoT applications based on Spark. Flint is Two Sigma's implementation of highly optimized time series operations in Spark. It performs truly parallel and rich analyses on time series data by taking advantage of the natural ordering in time series data to provide locality-based optimizations. Flint is an open source library for Spark based around the TimeSeriesRDD, a time series aware data structure, and a collection of time series utility and analysis functions that use TimeSeriesRDDs. Unlike DataFrame and Dataset, Flint's TimeSeriesRDDs can leverage the existing ordering properties of datasets at rest and the fact that almost all data manipulations and analysis over these datasets respect their temporal ordering properties. It differs from other time series efforts in Spark in its ability to efficiently compute across panel data or on large scale high frequency data.

https://github.com/twosigma/flint

Tags
Implementation
License
Platform

   




Related Projects

spark-timeseries - A library for time series analysis on Apache Spark

  •    Scala

A library for analyzing large-scale time series data.

flint - An open-source lint program for C++ developed by, and formerly used at Facebook.

  •    D

flint is an open-source lint program for C++ developed and used at Facebook. flint is published on Github at https://github.com/facebook/flint; for discussions, there is a Google group at https://groups.google.com/d/forum/facebook-flint.

flint - Flint is a powerful Sass grid framework designed for rapid responsive development.

  •    CSS

Flint is a responsive grid framework written in Sass, created to allow developers to rapidly produce responsive layouts that are built on a sophisticated foundation. Most notably? The syntax. Being a designer myself, a large amount of time was spent on this aspect. Flint is very unique in the fact that it uses only a single mixin for all output: _(...).

Flint Particle System for Silverlight

  •    Silverlight

Flint Particle System (Particle Engine) for Silverlight,based on original Flint Particles. XNA framework is not required.

Flint - The Flint framework for building apps on Apple platforms using Feature Driven Development

  •    Swift

We made Flint because we want people to build great apps for Apple platforms that make the most of native platform capabilities. We want to remove barriers to that, which means making it as simple as possible to get things running in a modern way.


Marketstore - DataFrame Server for Financial Timeseries Data

  •    Go

MarketStore is a database server optimized for financial timeseries data. You can think of it as an extensible DataFrame service that is accessible from anywhere in your system, at higher scalability. It is designed from the ground up to address scalability issues around handling large amounts of financial market data used in algorithmic trading backtesting, charting, and analyzing price history with data spanning many years, including tick-level for the all US equities or the exploding crypto currencies space. If you are struggling with managing lots of HDF5 files, this is perfect solution to your problem.

seasonal - Robustly estimate trend and periodicity in a timeseries.

  •    Python

Robustly estimate trend and periodicity in a timeseries. Seasonal can recover sharp trend and period estimates from noisy timeseries data with only a few periods. It is intended for estimating season, trend, and level when initializing structural timeseries models like Holt-Winters [Hyndman], and its defaults are biased towards the kinds of training data that arise in that setting. Input samples are assumed evenly-spaced from a continuous real-valued signal with additive noise but no anomalies.

EventQL - The database for large-scale event analytics

  •    C++

EventQL is a distributed, column-oriented database built for large-scale event collection and analytics. It runs super-fast SQL and MapReduce queries. Its features include Automatic partitioning, Columnar storage, Standard SQL support, Scales to petabytes, Timeseries and relational data, Fast range scans and lot more.

react-timeseries-charts - Declarative and modular timeseries charting components for React

  •    Javascript

This library contains a set of modular charting components used for building flexible interactive charts. It was built for React from the ground up, specifically to visualize timeseries data and network traffic data in particular. Low level elements are constructed using d3, while higher level composability is provided by React. Charts can be stacked as rows, overlaid on top of each other, or any combination, all in a highly declarative manner. The library is used throughout the public facing ESnet Portal.

mtail - extract whitebox monitoring data from application logs for collection in a timeseries database

  •    Go

mtail is a tool for extracting metrics from application logs to be exported into a timeseries database or timeseries calculator for alerting and dashboarding.It aims to fill a niche between applications that do not export their own internal state, and existing monitoring systems, without patching those applications or rewriting the same framework for custom extraction glue code.

Griddb - High performance, High scalability and High reliability database for big data

  •    C++

GridDB is an In-Memory NoSQL Database for highly scalable IoT applications . It has a KVS (Key-Value Store)-type data model that is suitable for sensor data stored in a timeseries. It is a database that can be easily scaled-out according to the number of sensors. High Reliability It is equipped with a structure to spread out the replication of key value data among fellow nodes so that in the event of a node failure, automatic failover can be carried out in a matter of seconds by using the replication function of other nodes.

flint - Check your project for common sources of contributor friction.

  •    Go

Check your project for common sources of contributor friction. Flint checks if your project's folder contains the proper files and structure to allow potential contributors to understand: 1) the project's goals, 2) how to contribute, 3) usage guidelines, and 4) how to install the project.

spark - .NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

  •    CSharp

.NET for Apache Spark provides high performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data. .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.

snappydata - SnappyData - The Spark Database. Stream, Transact, Analyze, Predict in one cluster

  •    Scala

Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight. At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.

spark-nlp - Natural Language Understanding Library for Apache Spark.

  •    Jupyter

John Snow Labs Spark-NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment. This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .

spark-cassandra-connector - DataStax Spark Cassandra Connector

  •    Scala

Lightning-fast cluster computing with Apache Spark™ and Apache Cassandra®.This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.

spark-movie-lens - An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

  •    Jupyter

This Apache Spark tutorial will guide you step-by-step into how to use the MovieLens dataset to build a movie recommender using collaborative filtering with Spark's Alternating Least Saqures implementation. It is organised in two parts. The first one is about getting and parsing movies and ratings data into Spark RDDs. The second is about building and using the recommender and persisting it for later use in our on-line recommender system. This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit. Starting from there, I've added with minor modifications to use a larger dataset, then code about how to store and reload the model for later use, and finally a web service using Flask.

spark-ec2 - Scripts used to setup a Spark cluster on EC2

  •    Python

Please note: spark-ec2 is no longer under active development and the project has been archived. All the existing code, PRs and issues are still accessible but are now read-only. If you're looking for a similar tool that is under active development, we recommend you take a look at Flintrock. spark-ec2 allows you to launch, manage and shut down Apache Spark [1] clusters on Amazon EC2. It automatically sets up Apache Spark and HDFS on the cluster for you. This guide describes how to use spark-ec2 to launch clusters, how to run jobs on them, and how to shut them down. It assumes you've already signed up for an EC2 account on the Amazon Web Services site.

spark-jobserver - REST job server for Apache Spark

  •    Scala

spark-jobserver provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts. This repo contains the complete Spark job server project, including unit tests and deploy scripts. It was originally started at Ooyala, but this is now the main development repo. Other useful links: Troubleshooting, cluster, YARN client, YARN on EMR, Mesos, JMX tips.