Displaying 1 to 15 from 15 results

Mobius - C# and F# language binding and extensions to Apache Spark

  •    CSharp

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.

modin - Modin: Speed up your Pandas workflows by changing a single line of code

  •    Python

Modin uses Ray to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical. To use Modin, you do not need to know how many cores your system has and you do not need to specify how to distribute the data. In fact, you can continue using your previous pandas notebooks while experiencing a considerable speedup from Modin, even on a single machine. Once you’ve changed your import statement, you’re ready to use Modin just like you would pandas.

datafusion - SQL Query Execution against Apache Arrow, in Rust

  •    Rust

DataFusion is a SQL parser, planner, and query execution library for Rust. A DataFrame API is also provided. DataFusion can be used as a crate dependency in your project to add SQL support for custom data sources.

datasheets - Read data from, write data to, and modify the formatting of Google Sheets

  •    Python

datasheets is a library for interfacing with Google Sheets, including reading data from, writing data to, and modifying the formatting of Google Sheets. It is built on top of Google's google-api-python-client and oauth2client libraries using the Google Drive v3 and Google Sheets v4 REST APIs. It can be installed with pip via pip install datasheets.




pdpipe - Easy pipelines for pandas DataFrames.

  •    Python

Easy pipelines for pandas DataFrames. Some pipeline stages require scikit-learn; they will simply not be loaded if scikit-learn is not found on the system, and pdpipe will issue a warning. To use them you must also install scikit-learn.

vaex - Lazy Out-of-Core DataFrames for Python, visualize and explore big tabular data at a billion rows per second

  •    Python

Vaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

jardin - A pandas.DataFrame-based ORM.

  •    Python

jardin (noun, french) – garden, yard, grove. Jardin is a pandas.DataFrame-based ORM for Python applications.

drill-sergeant-rstats - 📗 A Little Book About Using Apache Drill and R

  •    R

A little book about using Apache Drill with R (created due to a cpl tweets by spiffy #rstats folks).


qframe - Immutable data frame for Go

  •    Go

QFrame is an immutable data frame that support filtering, aggregation and data manipulation. Any operation on a QFrame results in a new QFrame, the original QFrame remains unchanged. This can be done fairly efficiently since much of the underlying data will be shared between the two frames. The design of QFrame has mainly been driven by the requirements from qocache but it is in many aspects a general purpose data frame. Any suggestions for added/improved functionality to support a wider scope is always of interest as long as they don't conflict with the requirements from qocache! See Contribute.

Spark-Example - Spark1

  •    Scala

Spark1.6和spark2.2的示例,包含kafka,flume,structuredstreaming,jedis,elasticsearch,mysql,dataframe

NimData - DataFrame API written in Nim, enabling fast out-of-core data processing

  •    Nim

DataFrame API written in Nim, enabling fast out-of-core data processing. NimData is inspired by frameworks like Pandas/Spark/Flink/Thrill, and sits between the Pandas and the Spark/Flink/Thrill side. Similar to Pandas, NimData is currently non-distributed, but shares the type-safe, lazy API of Spark/Flink/Thrill. Thanks to Nim, it enables elegant out-of-core processing at native speed.

sparkflow - Easy to use library to bring Tensorflow on Apache Spark

  •    Python

This is an implementation of Tensorflow on Spark. The goal of this library is to provide a simple, understandable interface in using Tensorflow on Spark. With SparkFlow, you can easily integrate your deep learning model with a ML Spark Pipeline. Underneath, SparkFlow uses a parameter server to train the Tensorflow network in a distributed manner. Through the api, the user can specify the style of training, whether that is Hogwild or async with locking. While there are other libraries that use Tensorflow on Apache Spark, Sparkflow's objective is to work seemlessly with ML Pipelines, provide a simple interface for training Tensorflow graphs, and give basic abstractions for faster development. For training, Sparkflow uses a parameter server which lives on the driver and allows for asynchronous training. This tool provides faster training time when using big data.

spark-daria - Essential Spark extensions and helper methods ✨😲

  •    Scala

Spark helper methods to maximize developer productivity. Fetch the JAR file from Maven.

PySpark_Basics - Fundamentals of PySpark, code examples

  •    Jupyter

Apache Spark is one of the hottest new trends in the technology domain. It is the framework with probably the highest potential to realize the fruit of the marriage between Big Data and Machine Learning. It runs fast (up to 100x faster than traditional Hadoop MapReduce) due to in-memory operation, offers robust, distributed, fault-tolerant data objects (called RDD), and integrates beautifully with the world of machine learning and graph analytics through supplementary packages like Mlib and GraphX. Unlike most Python libraries, getting PySpark to start working properly is not as straightforward as pip install ... and import ... Most of us with Python-based data science and Jupyter/IPython background take this workflow as granted for all popular Python packages. We tend to just head over to our CMD or BASH shell, type the pip install command, launch a Jupyter notebook and import the library to start practicing.