Displaying 1 to 20 from 45 results

tpot - A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming

  •    Python

Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

xgboost - Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more

  •    C++

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone.

benchm-ml - A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc

  •    R

This project aims at a minimal benchmark for scalability, speed and accuracy of commonly used implementations of a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of limited cardinality i.e. not very sparse) and no missing data, perhaps the most common problem in business applications (e.g. credit scoring, fraud detection or churn prediction). If the input matrix is of n x p, n is varied as 10K, 100K, 1M, 10M, while p is ~1K (after expanding the categoricals into dummy variables/one-hot encoding). This particular type of data structure/size (the largest) stems from this author's interest in some particular business applications. Note: While a large part of this benchmark was done in Spring 2015 reflecting the state of ML implementations at that time, this repo is being updated if I see significant changes in implementations or new implementations have become widely available (e.g. lightgbm). Also, please find a summary of the progress and learnings from this benchmark at the end of this repo.

auto_ml - Automated machine learning for analytics & production

  •    Python

auto_ml is designed for production. Here's an example that includes serializing and loading the trained model, then getting predictions on single dictionaries, roughly the process you'd likely follow to deploy the trained model. All of these projects are ready for production. These projects all have prediction time in the 1 millisecond range for a single prediction, and are able to be serialized to disk and loaded into a new environment after training.

automl-gs - Provide an input CSV and a target field to predict, generate a model + code to run it.

  •    Python

Give an input CSV file and a target field you want to predict to automl-gs, and get a trained high-performing machine learning or deep learning model plus native Python code pipelines allowing you to integrate that model into any prediction workflow. No black box: you can see exactly how the data is processed, how the model is constructed, and you can make tweaks as necessary. automl-gs is an AutoML tool which, unlike Microsoft's NNI, Uber's Ludwig, and TPOT, offers a zero code/model definition interface to getting an optimized model and data transformation pipeline in multiple popular ML/DL frameworks, with minimal Python dependencies (pandas + scikit-learn + your framework of choice). automl-gs is designed for citizen data scientists and engineers without a deep statistical background under the philosophy that you don't need to know any modern data preprocessing and machine learning engineering techniques to create a powerful prediction workflow.

mars - Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions

  •    Python

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and many other libraries. More details about installing Mars can be found at installation section in Mars document.

open-solution-home-credit - Open solution to the Home Credit Default Risk challenge :house_with_garden:

  •    Python

This is an open solution to the Home Credit Default Risk challenge 🏡. In this open source solution you will find references to the neptune.ml. It is free platform for community Users, which we use daily to keep track of our experiments. Please note that using neptune.ml is not necessary to proceed with this solution. You may run it as plain Python script 🐍.

tgboost - Tiny Gradient Boosting Tree

  •    Java

It is a Tiny implement of Gradient Boosting tree, based on XGBoost's scoring function and SLIQ's efficient tree building algorithm. TGBoost build the tree in a level-wise way as in SLIQ (by constructing Attribute list and Class list). Currently, TGBoost support parallel learning on single machine, the speed and memory consumption are comparable to XGBoost. Handle missing value, XGBoost learn a direction for those with missing value, the direction is left or right. TGBoost take a different approach: it enumerate missing value go to left child, right child and missing value child, then choose the best one. So TGBoost use Ternary Tree.

fast_retraining - Show how to perform fast retraining with LightGBM in different business cases

  •    Jupyter

In this repo we compare two of the fastest boosted decision tree libraries: XGBoost and LightGBM. We will evaluate them across datasets of several domains and different sizes.On July 25, 2017, we published a blog post evaluating both libraries and discussing the benchmark results. The post is Lessons Learned From Benchmarking Fast Machine Learning Algorithms.

fashion - The Fashion-MNIST dataset and machine learning models.

  •    R

Training AI machine learning models on the Fashion MNIST dataset. Fashion-MNIST is a dataset consisting of 70,000 images (60k training and 10k test) of clothing objects, such as shirts, pants, shoes, and more. Each example is a 28x28 grayscale image, associated with a label from 10 classes. The 10 classes are listed below.

dextra-mindef-2015 - My solution for Dextra Data Science Challenge #44 (Singapore Ministry of Defense) https://challenges

  •    Python

Note: A few people asked me for the challenge's data source. Unfortunately, I am not authorized to publicly release it - if you need it, please do send the request to either Mindef or Dextra.sg instead of sending it to me. Only Native XGBoost was recorded since it just dominated everything.

kaggle-for-fun - All my submissions for Kaggle contests that I have been, and going to be participating

  •    Python

All my submissions for Kaggle contests that I have been, and going to be participating. I will probably have everything written in Python (utilizing scikit-learn or similar libraries), but occasionally I might also use R or Haskell if I can.

minimal-datascience - This repository contains all the code and dataset used in my blog series: Minimal Data Science

  •    Python

My goal for this minimal data science blog series is not only sharing, tutorializing, but also, making personal notes while learning and working as a Data Scientist. I’m looking forward to receiving any feedback from you. Chapter-1: Classify StarCraft 2 players with Python Pandas and Scikit-learn.

IntroCSExperiencePrediction - A supervised learning project using eXtreme Gradient Boosting Trees

  •    Jupyter

This project adds a predictive model for understanding the dynamics of gender in intro CS at Berkeley for years 2014 through 2015. This work builds on previous research done in fulfillment of a Computer Science Education Ph.D., HipHopathy, A Socio-Curricular Study of Introductory Computer Science. The dataset used in this project not available for mass consumption, as it contains sensitive, personally identifiable student data. To generate this dataset, I created a survey that includes the following attributes for each data point based on a Likert scale of 1 to 5, 1 for strongly disagree and 5 corresponding to strongly agree. Some items have yes and no answers.

zoltar - Common library for serving TensorFlow, XGBoost and scikit-learn models in production.

  •    Java

Zoltar is a common library for serving TensorFlow, XGBoost and scikit-learn models in production. See Zoltar docs for details. Copyright 2018 Spotify AB.

Apartment-Interest-Prediction - Predict people interest in renting specific NYC apartments

  •    Jupyter

Predict people interest in renting specific apartments. The challenge combines structured data, geolocalization, time data, free text and images. This solution features Gradient Boosted Trees (XGBoost and LightGBM) and does not use stacking, due to lack of time.

Arch-Data-Science - Archlinux PKGBUILDs for Data Science, Machine Learning, Deep Learning, NLP and Computer Vision

  •    Shell

Welcome to my repo to build Data Science, Machine Learning, Computer Vision, Natural language Processing and Deep Learning packages from source. My Data Science environment is running from a LXC container so Tensorflow build system, bazel, must be build with its auto-sandboxing disabled.

mli-resources - Machine Learning Interpretability Resources

  •    Jupyter

Machine learning algorithms create potentially more accurate models than linear models, but any increase in accuracy over more traditional, better-understood, and more easily explainable techniques is not practical for those who must explain their models to regulators or customers. For many decades, the models created by machine learning algorithms were generally taken to be black-boxes. However, a recent flurry of research has introduced credible techniques for interpreting complex, machine-learned models. Materials presented here illustrate applications or adaptations of these techniques for practicing data scientists. Want to contribute your own examples? Just make a pull request.