Pandas - Powerful Python Data Analysis Toolkit

  •        1110

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions. It supports aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets, High performance merging and joining of data sets, Time series-functionality, Hierarchical axis indexing and lot more.



Related Projects

statistical-analysis-python-tutorial - Statistical Data Analysis in Python

  •    HTML

Chris Fonnesbeck is an Assistant Professor in the Department of Biostatistics at the Vanderbilt University School of Medicine. He specializes in computational statistics, Bayesian methods, meta-analysis, and applied decision analysis. He originally hails from Vancouver, BC and received his Ph.D. from the University of Georgia. This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to bootstrapping methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

xarray - N-D labeled arrays and datasets in Python

  •    Python

xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures. Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray.Dataset is an in-memory representation of a netCDF file.

100-pandas-puzzles - 100 data puzzles for pandas, ranging from short and simple to super tricky (60% complete)

  •    Jupyter

Inspired by 100 Numpy exerises, here are 100* short puzzles for testing your knowledge of pandas' power. Since pandas is a large library with many different specialist features and functions, these excercises focus mainly on the fundamentals of manipulating data (indexing, grouping, aggregating, cleaning), making use of the core DataFrame and Series objects. Many of the excerises here are straightforward in that the solutions require no more than a few lines of code (in pandas or NumPy - don't go using pure Python!). Choosing the right methods and following best practices is the underlying goal.

Marketstore - DataFrame Server for Financial Timeseries Data

  •    Go

MarketStore is a database server optimized for financial timeseries data. You can think of it as an extensible DataFrame service that is accessible from anywhere in your system, at higher scalability. It is designed from the ground up to address scalability issues around handling large amounts of financial market data used in algorithmic trading backtesting, charting, and analyzing price history with data spanning many years, including tick-level for the all US equities or the exploding crypto currencies space. If you are struggling with managing lots of HDF5 files, this is perfect solution to your problem.

gramm - Gramm is a complete data visualization toolbox for Matlab

  •    Matlab

Gramm is a powerful plotting toolbox which allows to quickly create complex, publication-quality figures in Matlab, and is inspired by R's ggplot2 library by Hadley Wickham. As a reference to this inspiration, gramm stands for GRAMmar of graphics for Matlab. Gramm is a data visualization toolbox for Matlab that allows to produce publication-quality plots from grouped data easily and flexibly. Matlab can be used for complex data analysis using a high-level interface: it supports mixed-type tabular data via tables, provides statistical functions that accept these tables as arguments, and allows users to adopt a split-apply-combine approach (Wickham 2011) with rowfun(). However, the standard plotting functionality in Matlab is mostly low-level, allowing to create axes in figure windows and draw geometric primitives (lines, points, patches) or simple statistical visualizations (histograms, boxplots) from numerical array data. Producing complex plots from grouped data thus requires iterating over the various groups in order to make successive statistical computations and low-level draw calls, all the while handling axis and color generation in order to visually separate data by groups. The corresponding code is often long, not easily reusable, and makes exploring alternative plot designs tedious.

vaex - Lazy Out-of-Core DataFrames for Python, visualize and explore big tabular data at a billion rows per second

  •    Python

Vaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

EventQL - The database for large-scale event analytics

  •    C++

EventQL is a distributed, column-oriented database built for large-scale event collection and analytics. It runs super-fast SQL and MapReduce queries. Its features include Automatic partitioning, Columnar storage, Standard SQL support, Scales to petabytes, Timeseries and relational data, Fast range scans and lot more.

bcolz - A columnar data container that can be compressed.

  •    C

bcolz provides columnar, chunked data containers that can be compressed either in-memory and on-disk. Column storage allows for efficiently querying tables, as well as for cheap column addition and removal. It is based on NumPy, and uses it as the standard data container to communicate with bcolz objects, but it also comes with support for import/export facilities to/from HDF5/PyTables tables and pandas dataframes. bcolz objects are compressed by default not only for reducing memory/disk storage, but also to improve I/O speed. The compression process is carried out internally by Blosc, a high-performance, multithreaded meta-compressor that is optimized for binary data (although it works with text data just fine too).

tad - A desktop application for viewing and analyzing tabular data

  •    Javascript

Tad is a desktop application for viewing and analyzing tabular data such as CSV files. The easiest way to install Tad is to use a pre-packaged binary release. See The Tad Landing Page for information on the latest release and a download link.

fecon235 - Computational tools for financial economics

  •    Jupyter

This is a free open source project for software tools in financial economics. We develop code for research notebooks which are executable scripts capable of statistical computations, as well as, collection of raw data in real-time. This serves to verify theoretical ideas and practical methods interactively. Economic and financial data, both historical and the most current.

pandas - Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data

  •    Python

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal. Binary installers for the latest released version are available at the Python package index and on conda.

Open Door Logistics Studio - Intelligient software for Vechile routing and Territory mangement

  •    Java

Open Door Logistics Studio is an easy-to-use standalone open source desktop application for performing (a) analysis of your customer locations, (b) sales territory design and mapping and (c) vehicle fleet routing & scheduling - all using an Excel spreadsheet. It supports Territory design, Territory mapping, Vehicle routing & scheduling.

blaze - NumPy and Pandas interface to Big Data

  •    Python

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems. We point blaze to a simple dataset in a foreign database (PostgreSQL). Instantly we see results as we would see them in a Pandas DataFrame.

meza - A Python toolkit for processing tabular data

  •    Python

meza is a Python library for reading and processing tabular data. It has a functional programming style API, excels at reading/writing large files, and can process 10+ file types. meza has been tested and is known to work on Python 2.7, 3.5, and 3.6; PyPy2 5.8.0, and PyPy3 5.8.0.

pycm - Multi-class confusion matrix library in Python

  •    Python

PyCM is a multi-class confusion matrix library written in Python that supports both input data vectors and direct matrix, and a proper tool for post-classification model evaluation that supports most classes and overall statistics parameters. PyCM is the swiss-army knife of confusion matrices, targeted mainly at data scientists that need a broad array of metrics for predictive models and an accurate evaluation of large variety of classifiers. threshold is added in version 0.9 for real value prediction.

pandas-videos - Jupyter notebook and datasets from the pandas Q&A video series

  •    Jupyter

Read about the series, and view all of the videos on one page: Easier data analysis in Python with pandas.