Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions. It supports aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets, High performance merging and joining of data sets, Time series-functionality, Hierarchical axis indexing and lot more.
http://pandas.pydata.org/Tags | data-analysis data econometrics models numpy statistics tables tabular timeseries |
Implementation | Python |
License | BSD |
Platform | Windows Linux |
Chris Fonnesbeck is an Assistant Professor in the Department of Biostatistics at the Vanderbilt University School of Medicine. He specializes in computational statistics, Bayesian methods, meta-analysis, and applied decision analysis. He originally hails from Vancouver, BC and received his Ph.D. from the University of Georgia. This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to bootstrapping methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.
xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures. Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray.Dataset is an in-memory representation of a netCDF file.
scientific-computing netcdf numpy data-science pandas dataframes data-analysis pydataVaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It calculates statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid for more than a billion (10^9) samples/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted). HDF5 and Apache Arrow supported.
visualization machine-learning bigdata tabular-data hdf5 machinelearning dataframe memory-mapped-fileInspired by 100 Numpy exerises, here are 100* short puzzles for testing your knowledge of pandas' power. Since pandas is a large library with many different specialist features and functions, these excercises focus mainly on the fundamentals of manipulating data (indexing, grouping, aggregating, cleaning), making use of the core DataFrame and Series objects. Many of the excerises here are straightforward in that the solutions require no more than a few lines of code (in pandas or NumPy - don't go using pure Python!). Choosing the right methods and following best practices is the underlying goal.
pandas numpy data-analysisMarketStore is a database server optimized for financial timeseries data. You can think of it as an extensible DataFrame service that is accessible from anywhere in your system, at higher scalability. It is designed from the ground up to address scalability issues around handling large amounts of financial market data used in algorithmic trading backtesting, charting, and analyzing price history with data spanning many years, including tick-level for the all US equities or the exploding crypto currencies space. If you are struggling with managing lots of HDF5 files, this is perfect solution to your problem.
marketstore financial-analysis pandas-dataframe trading database timeseries timeseries-database cryptocurrency gdaxGramm is a powerful plotting toolbox which allows to quickly create complex, publication-quality figures in Matlab, and is inspired by R's ggplot2 library by Hadley Wickham. As a reference to this inspiration, gramm stands for GRAMmar of graphics for Matlab. Gramm is a data visualization toolbox for Matlab that allows to produce publication-quality plots from grouped data easily and flexibly. Matlab can be used for complex data analysis using a high-level interface: it supports mixed-type tabular data via tables, provides statistical functions that accept these tables as arguments, and allows users to adopt a split-apply-combine approach (Wickham 2011) with rowfun(). However, the standard plotting functionality in Matlab is mostly low-level, allowing to create axes in figure windows and draw geometric primitives (lines, points, patches) or simple statistical visualizations (histograms, boxplots) from numerical array data. Producing complex plots from grouped data thus requires iterating over the various groups in order to make successive statistical computations and low-level draw calls, all the while handling axis and color generation in order to visually separate data by groups. The corresponding code is often long, not easily reusable, and makes exploring alternative plot designs tedious.
matlab visualization stats plot data-visualization statisticsVaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).
dataframe bigdata tabular-data visualization memory-mapped-file hdf5Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch with a familiar Pandas-compatible API. Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and not in memory, which allows Eland to access large datasets stored in Elasticsearch.
elasticsearch machine-learning big-data etl scikit-learn pandas lightgbm data-analysis dataframe dataframes time-series-forecasting elandEventQL is a distributed, column-oriented database built for large-scale event collection and analytics. It runs super-fast SQL and MapReduce queries. Its features include Automatic partitioning, Columnar storage, Standard SQL support, Scales to petabytes, Timeseries and relational data, Fast range scans and lot more.
database columnar-database columnar-storage timeseries streaming distributed-database distributed analytics column-storeMiller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, and positionally-indexed.
data-processing data-cleaning csv csv-files csv-format csv-reader streaming-data streaming-algorithms tsv json json-data data-reduction data-regression statistics statistical-analysis devops devops-tools tabular-data command-line command-line-toolsbcolz provides columnar, chunked data containers that can be compressed either in-memory and on-disk. Column storage allows for efficiently querying tables, as well as for cheap column addition and removal. It is based on NumPy, and uses it as the standard data container to communicate with bcolz objects, but it also comes with support for import/export facilities to/from HDF5/PyTables tables and pandas dataframes. bcolz objects are compressed by default not only for reducing memory/disk storage, but also to improve I/O speed. The compression process is carried out internally by Blosc, a high-performance, multithreaded meta-compressor that is optimized for binary data (although it works with text data just fine too).
column-store compressed-dataTad is a desktop application for viewing and analyzing tabular data such as CSV files. The easiest way to install Tad is to use a pre-packaged binary release. See The Tad Landing Page for information on the latest release and a download link.
desktop-application pivots tabular-data csv pivot-tables data-science data-analysisAnimated investment research at Sov.ai, sponsoring open source initiatives. PandaPy software, similar to the original Pandas project, is developed to improve the usability of python for finance. Structured datatypes are designed to be able to mimic ‘structs’ in the C language, and share a similar memory layout. PandaPy currently houses more than 30 functions. Structured NumPy are meant for interfacing with C code and for low-level manipulation of structured buffers, for example for interpreting binary blobs. For these purposes they support specialized features such as subarrays, nested datatypes, and unions, and allow control over the memory layout of the structure.
finance data-science machine-learning numpy pandas data-structures arrays structured-data algorithmic-tradingThis is a free open source project for software tools in financial economics. We develop code for research notebooks which are executable scripts capable of statistical computations, as well as, collection of raw data in real-time. This serves to verify theoretical ideas and practical methods interactively. Economic and financial data, both historical and the most current.
jupyter-notebook pandas federal-reserve gdp inflation income housing equities bonds fx gold time-series econometrics statistics asset-pricing finance interest-rates economics employmentThe JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ. Implemented in TypeScript, used in JavaScript ES5+ or TypeScript.
data-wrangling data-forge data data-analysis nodejs pandas visualization data-visualization data-management data-manipulation data-munging data-cleaning data-cleansing csv json data-science data-clensingpandapower is an easy to use network calculation program aimed to automate the analysis and optimization of power systems. It uses the data analysis library pandas and is compatible with the commonly used MATPOWER / PYPOWER case format. pandapower allows using different solvers including an improved Newton-Raphson power flow implementation, all PYPOWER solvers, and the PowerModels.jl library. To get realistic load profile data and grid models across all voltage levels that are ready to be used in pandapower, have a look at the SimBench project website or on GitHub.
system analysis optimization power state-estimation powerflow short-circuit loadflowpandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal. Binary installers for the latest released version are available at the Python package index and on conda.
data-analysis pandas flexible alignmentOpen Door Logistics Studio is an easy-to-use standalone open source desktop application for performing (a) analysis of your customer locations, (b) sales territory design and mapping and (c) vehicle fleet routing & scheduling - all using an Excel spreadsheet. It supports Territory design, Territory mapping, Vehicle routing & scheduling.
logistics territory-design territory-management territory-mappingMars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and many other libraries. More details about installing Mars can be found at installation section in Mars document.
machine-learning tensorflow numpy scikit-learn pandas pytorch xgboost lightgbm tensor dask ray dataframe statsmodels joblibInteractive reports and JSON profiles to analyze, monitor and debug machine learning models. Evidently helps evaluate machine learning models during validation and monitor them in production. The tool generates interactive visual reports and JSON profiles from pandas DataFrame or csv files. You can use visual reports for ad hoc analysis, debugging and team sharing, and JSON profiles to integrate Evidently in prediction pipelines or with other visualization tools.
data-science machine-learning pandas-dataframe jupyter-notebook html-report production-machine-learning mlops model-monitoring machine-learning-operations data-drift
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.