Displaying 1 to 20 from 27 results

vaex - Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

  •    Python

Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It calculates statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid for more than a billion (10^9) samples/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted). HDF5 and Apache Arrow supported.

PandasGUI - A GUI for Pandas DataFrames

  •    Python

PandasGUI is a GUI for viewing, plotting and analyzing Pandas DataFrames. Issues, feedback and pull requests are welcome.

polars - Fast multi-threaded DataFrame library in Rust and Python

  •    Rust

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow(2) as memory model. To learn more, read the User Guide.

Mobius - C# and F# language binding and extensions to Apache Spark

  •    CSharp

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.




modin - Modin: Speed up your Pandas workflows by changing a single line of code

  •    Python

Modin uses Ray to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical. To use Modin, you do not need to know how many cores your system has and you do not need to specify how to distribute the data. In fact, you can continue using your previous pandas notebooks while experiencing a considerable speedup from Modin, even on a single machine. Once you’ve changed your import statement, you’re ready to use Modin just like you would pandas.

datafusion - SQL Query Execution against Apache Arrow, in Rust

  •    Rust

DataFusion is a SQL parser, planner, and query execution library for Rust. A DataFrame API is also provided. DataFusion can be used as a crate dependency in your project to add SQL support for custom data sources.

tech.ml.dataset - A Clojure high performance data processing system

  •    Clojure

tech.ml.dataset is a Clojure library for data processing and machine learning. Datasets are currently in-memory columnwise databases and we support parsing from file or input-stream. We support these formats: raw/gzipped csv/tsv, xls, xlsx, json, and sequences of maps as input sources. SQL bindings are provided as a separate library. Data size in memory is minimized (primitive arrays), datetime types are often converted to an integer representation and strings are loaded into string tables. These features together dramatically decrease the working set size in memory. Because data is stored in columnar fashion columnwise operations on the dataset are very fast.

pandasvault - Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

  •    Python

The only Pandas utility package you would ever need. It has no exotic external dependencies. All functions have been compared and tested with alternatives, only the fastest equivalent functions have been developed and included in this package. The package has more than 20 wrapped functions and 100 snippets. Animated Investment Management Research at Sov.ai — Sponsoring open source AI, Machine learning, and Data Science initiatives.


datasheets - Read data from, write data to, and modify the formatting of Google Sheets

  •    Python

datasheets is a library for interfacing with Google Sheets, including reading data from, writing data to, and modifying the formatting of Google Sheets. It is built on top of Google's google-api-python-client and oauth2client libraries using the Google Drive v3 and Google Sheets v4 REST APIs. It can be installed with pip via pip install datasheets.

pdpipe - Easy pipelines for pandas DataFrames.

  •    Python

Easy pipelines for pandas DataFrames. Some pipeline stages require scikit-learn; they will simply not be loaded if scikit-learn is not found on the system, and pdpipe will issue a warning. To use them you must also install scikit-learn.

vaex - Lazy Out-of-Core DataFrames for Python, visualize and explore big tabular data at a billion rows per second

  •    Python

Vaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

eland - Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

  •    Python

Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch with a familiar Pandas-compatible API. Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and not in memory, which allows Eland to access large datasets stored in Elasticsearch.

jardin - A pandas.DataFrame-based ORM.

  •    Python

jardin (noun, french) – garden, yard, grove. Jardin is a pandas.DataFrame-based ORM for Python applications.

drill-sergeant-rstats - 📗 A Little Book About Using Apache Drill and R

  •    R

A little book about using Apache Drill with R (created due to a cpl tweets by spiffy #rstats folks).

qframe - Immutable data frame for Go

  •    Go

QFrame is an immutable data frame that support filtering, aggregation and data manipulation. Any operation on a QFrame results in a new QFrame, the original QFrame remains unchanged. This can be done fairly efficiently since much of the underlying data will be shared between the two frames. The design of QFrame has mainly been driven by the requirements from qocache but it is in many aspects a general purpose data frame. Any suggestions for added/improved functionality to support a wider scope is always of interest as long as they don't conflict with the requirements from qocache! See Contribute.

Spark-Example - Spark1

  •    Scala

Spark1.6和spark2.2的示例,包含kafka,flume,structuredstreaming,jedis,elasticsearch,mysql,dataframe

NimData - DataFrame API written in Nim, enabling fast out-of-core data processing

  •    Nim

DataFrame API written in Nim, enabling fast out-of-core data processing. NimData is inspired by frameworks like Pandas/Spark/Flink/Thrill, and sits between the Pandas and the Spark/Flink/Thrill side. Similar to Pandas, NimData is currently non-distributed, but shares the type-safe, lazy API of Spark/Flink/Thrill. Thanks to Nim, it enables elegant out-of-core processing at native speed.

sparkflow - Easy to use library to bring Tensorflow on Apache Spark

  •    Python

This is an implementation of Tensorflow on Spark. The goal of this library is to provide a simple, understandable interface in using Tensorflow on Spark. With SparkFlow, you can easily integrate your deep learning model with a ML Spark Pipeline. Underneath, SparkFlow uses a parameter server to train the Tensorflow network in a distributed manner. Through the api, the user can specify the style of training, whether that is Hogwild or async with locking. While there are other libraries that use Tensorflow on Apache Spark, Sparkflow's objective is to work seemlessly with ML Pipelines, provide a simple interface for training Tensorflow graphs, and give basic abstractions for faster development. For training, Sparkflow uses a parameter server which lives on the driver and allows for asynchronous training. This tool provides faster training time when using big data.

spark-daria - Essential Spark extensions and helper methods ✨😲

  •    Scala

Spark helper methods to maximize developer productivity. Fetch the JAR file from Maven.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.