prince - :crown: Python factor analysis library (PCA, CA, MCA, FAMD)

  •        152

Prince uses pandas to manipulate dataframes, as such it expects an initial dataframe to work with. In the following example, a Principal Component Analysis (PCA) is applied to the iris dataset. Under the hood Prince decomposes the dataframe into two eigenvector matrices and one eigenvalue array thanks to a Singular Value Decomposition (SVD). The eigenvectors can then be used to project the initial dataset onto lower dimensions.The first plot displays the rows in the initial dataset projected on to the two first right eigenvectors (the obtained projections are called principal coordinates). The ellipses are 90% confidence intervals.



Related Projects

statistical-analysis-python-tutorial - Statistical Data Analysis in Python

Chris Fonnesbeck is an Assistant Professor in the Department of Biostatistics at the Vanderbilt University School of Medicine. He specializes in computational statistics, Bayesian methods, meta-analysis, and applied decision analysis. He originally hails from Vancouver, BC and received his Ph.D. from the University of Georgia. This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to bootstrapping methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

alphalens - Performance analysis of predictive (alpha) stock factors

Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open source backtesting library, and Pyfolio which provides performance and risk analysis of financial portfolios.Check out the example notebooks for more on how to read and use the factor tear sheet.

xarray - N-D labeled arrays and datasets in Python

xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures. Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray.Dataset is an in-memory representation of a netCDF file.

pandas-datareader - Extract data from a wide range of Internet sources into a pandas DataFrame.

Up to date remote data access for pandas, works for multiple versions of pandas. As of v0.6.0 Yahoo!, Google Options, Google Quotes and EDGAR have been immediately deprecated due to large changes in their API and no stable replacement.

pandas-cookbook - Recipes for using Python's pandas library

pandas is a Python library for doing data analysis. It's really fast and lets you do exploratory work incredibly quickly. The goal of this cookbook is to give you some concrete examples for getting started with pandas. The docs are really comprehensive. However, I've often had people tell me that they have some trouble getting started, so these are examples with real-world data, and all the bugs and weirdness that entails.

pca-magic - PCA that iteratively replaces missing data

An implementaton of probabilisitc principal components analysis which is a variant of vanilla PCA that can be used to

holoviews - Stop plotting your data - annotate your data and let it visualize itself.

Stop plotting your data - annotate your data and let it visualize itself. HoloViews is an open-source Python library designed to make data analysis and visualization seamless and simple. With HoloViews, you can usually express what you want to do in very few lines of code, letting you focus on what you are trying to explore and convey, not on the process of plotting.

Modular toolkit for Data Processing MDP

The Modular toolkit for Data Processing (MDP) is a Python data processing framework. From the user's perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures. From the scientific developer's perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new i

Zipline - A Pythonic Algorithmic Trading Library

Zipline is a Pythonic algorithmic trading library. It is an event-driven system that supports both backtesting and live-trading. Zipline is currently used in production as the backtesting and live-trading engine powering Quantopian -- a free, community-centered, hosted platform for building and executing trading strategies.Note: Installing Zipline via pip is slightly more involved than the average Python package. Simply running pip install zipline will likely fail if you've never installed any scientific Python packages before.

django-rest-pandas - 📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i

Django REST Pandas (DRP) provides a simple way to generate and serve pandas DataFrames via the Django REST Framework. The resulting API can serve up CSV (and a number of other formats) for consumption by a client-side visualization tool like d3.js. The design philosophy of DRP enforces a strict separation between data and presentation. This keeps the implementation simple, but also has the nice side effect of making it trivial to provide the source data for your visualizations. This capability can often be leveraged by sending users to the same URL that your visualization code uses internally to load the data.

Face Recognition

Dear Friends this project deals with the appliaction of AI in Face Recognition by applying the concepts of Independent Component Analysis (generalised form of PCA) based upon the recently developed algorithm in 2001.


Simple console program intended for factor or principal components analysis. It calculates the optimal number of factors using the Horn's parallel analysis, computes the Kaiser-Meyer-Olkin and a few other measures of sampling adequacy.


Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

datacleaner - A Python tool that automatically cleans data sets and readies them for analysis.

A Python tool that automatically cleans data sets and readies them for analysis.datacleaner works with data in pandas DataFrames.

Marketstore - DataFrame Server for Financial Timeseries Data

MarketStore is a database server optimized for financial timeseries data. You can think of it as an extensible DataFrame service that is accessible from anywhere in your system, at higher scalability. It is designed from the ground up to address scalability issues around handling large amounts of financial market data used in algorithmic trading backtesting, charting, and analyzing price history with data spanning many years, including tick-level for the all US equities or the exploding crypto currencies space. If you are struggling with managing lots of HDF5 files, this is perfect solution to your problem.

538model - 538 Election Forecasting Model

This is a Python script that replicates some features of Nate Silver's 538 Election Forecasting Model. It was constructed from reading the methodology posts on the old site and the new one at the New York Times. This is my interpretation of these posts. Any and all errors are, of course, mine. Furthermore, this code should be considered as more of an example of how to conduct data analysis in Python using pandas and statsmodels rather than a "real" model. You can consider it a starting point for doing more complex analyses with Python rather than a real forecasting model. Or better yet, consider a fun way to learn some Python data tricks. The polling data is up to date as of 10/2/2012. It is all publicly available from Real Clear Politics. For some reason Real Clear Politics stopped allowing directory access to their servers, so if you want to update the polling data, you'll have to update the script to walk the links on their site or do it by hand. This should be trivial, I just don't have the time. Historical polling data was obtained from Electoral Vote.

kitabu - A framework for creating e-books from Markdown using Ruby

Kitabu is a framework for creating e-books from Markdown using Ruby. Using Prince PDF generator, you'll be able to get high quality PDFs. Also supports EPUB, Mobi, Text and HTML generation. While Prince is too expensive (495USD for a single user license), the free version available at generates a PDF with a small logo on the first page, which is removed when sent to a printer; you can use it locally for viewing the results immediately. When you're done writing your e-book, you can use DocRaptor, which have plans starting at $15/mo.

ca-bundle - Lets you find a path to the system CA bundle, and includes a fallback to the Mozilla CA bundle

Small utility library that lets you find a path to the system CA bundle, and includes a fallback to the Mozilla CA bundle. Originally written as part of composer/composer, now extracted and made available as a stand-alone library.

ERP PCA Toolkit

A Matlab toolkit for analyzing ERP datasets, especially PCA. If you run into a problem, please send me a note and I'll fix it. The tutorial is in the documentation folder and the tutorial data is a separate download (tutorial

boulder - An ACME-based CA, written in Go.

This is an implementation of an ACME-based CA. The ACME protocol allows the CA to automatically verify that an applicant for a certificate actually controls an identifier, and allows domain holders to issue and revoke certificates for their domains.Boulder has a Dockerfile to make it easy to install and set up all its dependencies. This is how the maintainers work on Boulder, and is our main recommended way to run it.