Pandas - Powerful Python Data Analysis Toolkit

  •        1149

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions. It supports aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets, High performance merging and joining of data sets, Time series-functionality, Hierarchical axis indexing and lot more.

http://pandas.pydata.org/
https://github.com/pydata/pandas
http://code.google.com/p/pandas

Tags
Implementation
License
Platform

   




Related Projects

statistical-analysis-python-tutorial - Statistical Data Analysis in Python

  •    HTML

Chris Fonnesbeck is an Assistant Professor in the Department of Biostatistics at the Vanderbilt University School of Medicine. He specializes in computational statistics, Bayesian methods, meta-analysis, and applied decision analysis. He originally hails from Vancouver, BC and received his Ph.D. from the University of Georgia. This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to bootstrapping methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

xarray - N-D labeled arrays and datasets in Python

  •    Python

xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures. Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray.Dataset is an in-memory representation of a netCDF file.

vaex - Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

  •    Python

Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It calculates statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid for more than a billion (10^9) samples/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted). HDF5 and Apache Arrow supported.

100-pandas-puzzles - 100 data puzzles for pandas, ranging from short and simple to super tricky (60% complete)

  •    Jupyter

Inspired by 100 Numpy exerises, here are 100* short puzzles for testing your knowledge of pandas' power. Since pandas is a large library with many different specialist features and functions, these excercises focus mainly on the fundamentals of manipulating data (indexing, grouping, aggregating, cleaning), making use of the core DataFrame and Series objects. Many of the excerises here are straightforward in that the solutions require no more than a few lines of code (in pandas or NumPy - don't go using pure Python!). Choosing the right methods and following best practices is the underlying goal.

Marketstore - DataFrame Server for Financial Timeseries Data

  •    Go

MarketStore is a database server optimized for financial timeseries data. You can think of it as an extensible DataFrame service that is accessible from anywhere in your system, at higher scalability. It is designed from the ground up to address scalability issues around handling large amounts of financial market data used in algorithmic trading backtesting, charting, and analyzing price history with data spanning many years, including tick-level for the all US equities or the exploding crypto currencies space. If you are struggling with managing lots of HDF5 files, this is perfect solution to your problem.


gramm - Gramm is a complete data visualization toolbox for Matlab

  •    Matlab

Gramm is a powerful plotting toolbox which allows to quickly create complex, publication-quality figures in Matlab, and is inspired by R's ggplot2 library by Hadley Wickham. As a reference to this inspiration, gramm stands for GRAMmar of graphics for Matlab. Gramm is a data visualization toolbox for Matlab that allows to produce publication-quality plots from grouped data easily and flexibly. Matlab can be used for complex data analysis using a high-level interface: it supports mixed-type tabular data via tables, provides statistical functions that accept these tables as arguments, and allows users to adopt a split-apply-combine approach (Wickham 2011) with rowfun(). However, the standard plotting functionality in Matlab is mostly low-level, allowing to create axes in figure windows and draw geometric primitives (lines, points, patches) or simple statistical visualizations (histograms, boxplots) from numerical array data. Producing complex plots from grouped data thus requires iterating over the various groups in order to make successive statistical computations and low-level draw calls, all the while handling axis and color generation in order to visually separate data by groups. The corresponding code is often long, not easily reusable, and makes exploring alternative plot designs tedious.

vaex - Lazy Out-of-Core DataFrames for Python, visualize and explore big tabular data at a billion rows per second

  •    Python

Vaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

eland - Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

  •    Python

Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch with a familiar Pandas-compatible API. Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and not in memory, which allows Eland to access large datasets stored in Elasticsearch.

EventQL - The database for large-scale event analytics

  •    C++

EventQL is a distributed, column-oriented database built for large-scale event collection and analytics. It runs super-fast SQL and MapReduce queries. Its features include Automatic partitioning, Columnar storage, Standard SQL support, Scales to petabytes, Timeseries and relational data, Fast range scans and lot more.

bcolz - A columnar data container that can be compressed.

  •    C

bcolz provides columnar, chunked data containers that can be compressed either in-memory and on-disk. Column storage allows for efficiently querying tables, as well as for cheap column addition and removal. It is based on NumPy, and uses it as the standard data container to communicate with bcolz objects, but it also comes with support for import/export facilities to/from HDF5/PyTables tables and pandas dataframes. bcolz objects are compressed by default not only for reducing memory/disk storage, but also to improve I/O speed. The compression process is carried out internally by Blosc, a high-performance, multithreaded meta-compressor that is optimized for binary data (although it works with text data just fine too).

tad - A desktop application for viewing and analyzing tabular data

  •    Javascript

Tad is a desktop application for viewing and analyzing tabular data such as CSV files. The easiest way to install Tad is to use a pre-packaged binary release. See The Tad Landing Page for information on the latest release and a download link.

pandapy - PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

  •    Python

Animated investment research at Sov.ai, sponsoring open source initiatives. PandaPy software, similar to the original Pandas project, is developed to improve the usability of python for finance. Structured datatypes are designed to be able to mimic ‘structs’ in the C language, and share a similar memory layout. PandaPy currently houses more than 30 functions. Structured NumPy are meant for interfacing with C code and for low-level manipulation of structured buffers, for example for interpreting binary blobs. For these purposes they support specialized features such as subarrays, nested datatypes, and unions, and allow control over the memory layout of the structure.

fecon235 - Computational tools for financial economics

  •    Jupyter

This is a free open source project for software tools in financial economics. We develop code for research notebooks which are executable scripts capable of statistical computations, as well as, collection of raw data in real-time. This serves to verify theoretical ideas and practical methods interactively. Economic and financial data, both historical and the most current.

pandapower - Convenient Power System Modelling and Analysis based on PYPOWER and pandas

  •    Python

pandapower is an easy to use network calculation program aimed to automate the analysis and optimization of power systems. It uses the data analysis library pandas and is compatible with the commonly used MATPOWER / PYPOWER case format. pandapower allows using different solvers including an improved Newton-Raphson power flow implementation, all PYPOWER solvers, and the PowerModels.jl library. To get realistic load profile data and grid models across all voltage levels that are ready to be used in pandapower, have a look at the SimBench project website or on GitHub.

pandas - Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data

  •    Python

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal. Binary installers for the latest released version are available at the Python package index and on conda.

Open Door Logistics Studio - Intelligient software for Vechile routing and Territory mangement

  •    Java

Open Door Logistics Studio is an easy-to-use standalone open source desktop application for performing (a) analysis of your customer locations, (b) sales territory design and mapping and (c) vehicle fleet routing & scheduling - all using an Excel spreadsheet. It supports Territory design, Territory mapping, Vehicle routing & scheduling.

mars - Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions

  •    Python

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and many other libraries. More details about installing Mars can be found at installation section in Mars document.

evidently - Interactive reports to analyze machine learning models during validation or production monitoring

  •    Jupyter

Interactive reports and JSON profiles to analyze, monitor and debug machine learning models. Evidently helps evaluate machine learning models during validation and monitor them in production. The tool generates interactive visual reports and JSON profiles from pandas DataFrame or csv files. You can use visual reports for ad hoc analysis, debugging and team sharing, and JSON profiles to integrate Evidently in prediction pipelines or with other visualization tools.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.