vaex - Lazy Out-of-Core DataFrames for Python, visualize and explore big tabular data at a billion rows per second

  •        35

Vaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

https://vaex.io
https://github.com/maartenbreddels/vaex

Tags
Implementation
License
Platform

   




Related Projects

Marketstore - DataFrame Server for Financial Timeseries Data

  •    Go

MarketStore is a database server optimized for financial timeseries data. You can think of it as an extensible DataFrame service that is accessible from anywhere in your system, at higher scalability. It is designed from the ground up to address scalability issues around handling large amounts of financial market data used in algorithmic trading backtesting, charting, and analyzing price history with data spanning many years, including tick-level for the all US equities or the exploding crypto currencies space. If you are struggling with managing lots of HDF5 files, this is perfect solution to your problem.

statistical-analysis-python-tutorial - Statistical Data Analysis in Python

  •    HTML

Chris Fonnesbeck is an Assistant Professor in the Department of Biostatistics at the Vanderbilt University School of Medicine. He specializes in computational statistics, Bayesian methods, meta-analysis, and applied decision analysis. He originally hails from Vancouver, BC and received his Ph.D. from the University of Georgia. This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to bootstrapping methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

gota - Gota: DataFrames and data wrangling in Go (Golang)

  •    Go

This is an implementation of DataFrames, Series and data wrangling methods for the Go programming language. The API is still in flux so use at your own risk.The term DataFrame typically refers to a tabular dataset that can be viewed as a two dimensional table. Often the columns of this dataset refers to a list of features, while the rows represent a number of measurements. As the data on the real world is not perfect, DataFrame supports non measurements or NaN elements.

LArray - Large off-heap arrays for Java/Scala

  •    Scala

A library for managing large off-heap arrays that can hold more than 2G (2^31) entries in Java and Scala. Notably LArray is disposable by calling LArray.free or you can let GC automatically release the memory. LArray also can be used to create an mmap (memory-mapped file) whose size is more than 2GB


gramm - Gramm is a complete data visualization toolbox for Matlab

  •    Matlab

Gramm is a powerful plotting toolbox which allows to quickly create complex, publication-quality figures in Matlab, and is inspired by R's ggplot2 library by Hadley Wickham. As a reference to this inspiration, gramm stands for GRAMmar of graphics for Matlab. Gramm is a data visualization toolbox for Matlab that allows to produce publication-quality plots from grouped data easily and flexibly. Matlab can be used for complex data analysis using a high-level interface: it supports mixed-type tabular data via tables, provides statistical functions that accept these tables as arguments, and allows users to adopt a split-apply-combine approach (Wickham 2011) with rowfun(). However, the standard plotting functionality in Matlab is mostly low-level, allowing to create axes in figure windows and draw geometric primitives (lines, points, patches) or simple statistical visualizations (histograms, boxplots) from numerical array data. Producing complex plots from grouped data thus requires iterating over the various groups in order to make successive statistical computations and low-level draw calls, all the while handling axis and color generation in order to visually separate data by groups. The corresponding code is often long, not easily reusable, and makes exploring alternative plot designs tedious.

php-export-data - PHP class to export data in CSV, TSV, or Excel XML (aka SpreadsheeML) format to a file or directly to the browser

  •    PHP

A simple library for exporting tabular data to Excel-friendly XML, CSV, or TSV. It supports streaming exported data to a file or directly to the browser as a download so it is suitable for exporting large datasets (you won't run out of memory). See the test/ directory for more examples.

xarray - N-D labeled arrays and datasets in Python

  •    Python

xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures. Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray.Dataset is an in-memory representation of a netCDF file.

Luxun - A high-throughput, persistent, distributed, publish-subscribe messaging system based on memo

  •    Java

A high-throughput, persistent, distributed, publish-subscribe messaging system based on memory mapped file and Thrift RPC.

bcolz - A columnar data container that can be compressed.

  •    C

bcolz provides columnar, chunked data containers that can be compressed either in-memory and on-disk. Column storage allows for efficiently querying tables, as well as for cheap column addition and removal. It is based on NumPy, and uses it as the standard data container to communicate with bcolz objects, but it also comes with support for import/export facilities to/from HDF5/PyTables tables and pandas dataframes. bcolz objects are compressed by default not only for reducing memory/disk storage, but also to improve I/O speed. The compression process is carried out internally by Blosc, a high-performance, multithreaded meta-compressor that is optimized for binary data (although it works with text data just fine too).

Mobius - C# and F# language binding and extensions to Apache Spark

  •    CSharp

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.

meza - A Python toolkit for processing tabular data

  •    Python

meza is a Python library for reading and processing tabular data. It has a functional programming style API, excels at reading/writing large files, and can process 10+ file types. meza has been tested and is known to work on Python 2.7, 3.5, and 3.6; PyPy2 5.8.0, and PyPy3 5.8.0.

LMDB - Lightning Memory-mapped Database

  •    C

LMDB is an extraordinarily fast, memory-efficient database developed for the Symas OpenLDAP Project. With memory-mapped files, it has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases. With only 32KB of object code, LMDB may seem tiny. But it’s the right 32KB. Compact and efficient are two sides of a coin; that’s part of what makes LMDB so powerful.

pmem - persistant memory programming

  •    C

Persistent memory (or pmem for short) is accessed like volatile memory, using processor load and store instructions, but it retains its contents across power loss like storage. This project focuses specifically on how persistent memory is exposed to server-class applications which will explicitly manage the placement of data among the three tiers (volatile memory, persistent memory, and storage).

sharedhashfile - Share Hash Tables Stored In Memory Mapped Files Between Arbitrary Processes & Threads

  •    C

SharedHashFile is a lightweight NoSQL key value store / hash table, a zero-copy IPC queue, & a multiplexed IPC logging library written in C for Linux. There is no server process. Data is read and written directly from/to shared memory or SSD; no sockets are used between SharedHashFile and the application program. APIs for C, C++, & nodejs. Data is kept in shared memory by default, making all the data accessible to separate processes and/or threads. Up to 4 billion keys can be stored in a single SharedHashFile hash table which is limited in size only by available RAM.

Data Frame Loader

  •    

A simple C# API for loading tabular dataframes into Microsoft SQL Server database using only a small number of tables to represent any kind of dataframe.

Capsule - The Capsule Hash Trie Collections Library

  •    Java

Capsule aims to become a full-fledged (immutable) collections library for Java 8+ that is solely built around persistent tries. The library is designed for standalone use and for being embedded in domain-specific languages. Capsule still has to undergo some incubation before it can ship as a well-rounded collection library. Nevertheless, the code is stable and performance is solid.

daru - Data Analysis in RUby

  •    Ruby

daru (Data Analysis in RUby) is a library for storage, analysis, manipulation and visualization of data in Ruby. daru makes it easy and intuitive to process data predominantly through 2 data structures: Daru::DataFrame and Daru::Vector. Written in pure Ruby works with all ruby implementations. Tested with MRI 2.0, 2.1, 2.2, 2.3, and 2.4.

space-radar - Disk And Memory Space Visualization App built with Electron & d3.js

  •    Javascript

SpaceRadar allows interactive visualization of disk space and memory. It currently supports Sunburst, Treemap, and Flamegraph charts. Compressed files can be read directly. To detect them, the file name has to end with .gz.

librgr

  •    C

API to access rgr (regular grid road) data files for 3D road surface descriptions used in advanced tire simulation models. Provides memory mapped data file access to keep RAM footprint low. Tracking of active surface area for visualisation etc.