This project aims at a minimal benchmark for scalability, speed and accuracy of commonly used implementations of a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of limited cardinality i.e. not very sparse) and no missing data, perhaps the most common problem in business applications (e.g. credit scoring, fraud detection or churn prediction). If the input matrix is of n x p, n is varied as 10K, 100K, 1M, 10M, while p is ~1K (after expanding the categoricals into dummy variables/one-hot encoding). This particular type of data structure/size (the largest) stems from this author's interest in some particular business applications. Note: While a large part of this benchmark was done in Spring 2015 reflecting the state of ML implementations at that time, this repo is being updated if I see significant changes in implementations or new implementations have become widely available (e.g. lightgbm). Also, please find a summary of the progress and learnings from this benchmark at the end of this repo.machine-learning data-science r gradient-boosting-machine random-forest deep-learning xgboost h2o spark
Instructions for how to install the necessary software for this tutorial is available here. Data for the tutorial can be downloaded by running ./data/get-data.sh (requires wget). Certain algorithms don't scale well when there are millions of features. For example, decision trees require computing some sort of metric (to determine the splits) on all the feature values (or some fraction of the values as in Random Forest and Stochastic GBM). Therefore, computation time is linear in the number of features. Other algorithms, such as GLM, scale much better to high-dimensional (n << p) and wide data with appropriate regularization (e.g. Lasso, Elastic Net, Ridge).machine-learning deep-learning random-forest gradient-boosting-machine tutorial data-science ensemble-learning r
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.