- 29

CatBoost is a machine learning method based on gradient boosting over decision trees. All CatBoost documentation is available here.

https://catboost.yandexhttps://github.com/catboost/catboost

Tags | machine-learning decision-trees gradient-boosting gbm gbdt r kaggle gpu-computing catboost tutorial categorical-features distributed gpu coreml opensource data-science big-data |

Implementation | C++ |

License | Apache |

Platform |

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone.

gbdt gbrt gbm distributed-systems xgboost gradient-boosting histogramInstructions for how to install the necessary software for this tutorial is available here. Data for the tutorial can be downloaded by running ./data/get-data.sh (requires wget). Certain algorithms don't scale well when there are millions of features. For example, decision trees require computing some sort of metric (to determine the splits) on all the feature values (or some fraction of the values as in Random Forest and Stochastic GBM). Therefore, computation time is linear in the number of features. Other algorithms, such as GLM, scale much better to high-dimensional (n << p) and wide data with appropriate regularization (e.g. Lasso, Elastic Net, Ridge).

machine-learning deep-learning random-forest gradient-boosting-machine tutorial data-science ensemble-learning rFor more details, please refer to Features.Experiments on public datasets show that LightGBM can outperform existing boosting frameworks on both efficiency and accuracy, with significantly lower memory consumption. What's more, the experiments show that LightGBM can achieve a linear speed-up by using multiple machines for training in specific settings.

gbdt gbm machine-learning data-mining kaggle efficiency distributed lightgbm gbrtYtk-learn is a distributed machine learning library which implements most of popular machine learning algorithms

machine-learning distributed gbm gbdt logistic-regression factorization-machines spark hadoopThis project aims at a minimal benchmark for scalability, speed and accuracy of commonly used implementations of a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of limited cardinality i.e. not very sparse) and no missing data, perhaps the most common problem in business applications (e.g. credit scoring, fraud detection or churn prediction). If the input matrix is of n x p, n is varied as 10K, 100K, 1M, 10M, while p is ~1K (after expanding the categoricals into dummy variables/one-hot encoding). This particular type of data structure/size (the largest) stems from this author's interest in some particular business applications. Note: While a large part of this benchmark was done in Spring 2015 reflecting the state of ML implementations at that time, this repo is being updated if I see significant changes in implementations or new implementations have become widely available (e.g. lightgbm). Also, please find a summary of the progress and learnings from this benchmark at the end of this repo.

machine-learning data-science r gradient-boosting-machine random-forest deep-learning xgboost h2o sparkTel-Aviv Deep Learning Bootcamp is an intensive (and free!) 5-day program intended to teach you all about deep learning. It is nonprofit focused on advancing data science education and fostering entrepreneurship. The Bootcamp is a prominent venue for graduate students, researchers, and data science professionals. It offers a chance to study the essential and innovative aspects of deep learning. Participation is via a donation to the A.L.S ASSOCIATION for promoting research of the Amyotrophic Lateral Sclerosis (ALS) disease.

gpu nvidia docker-image machine-learning deep-learning data-science cuda-kernels kaggle-competition cuda pytorch pytorch-tutorials pytorch-tutorial bootcamp meetup kaggle kaggle-scripts pycudaA Clojure Library for Bayesian Data Analysis and Machine Learning on the GPU. Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

bayesian-inference bayesian-data-analysis gpu-computing gpu-acceleration statistics machine-learning clojure-library bayesian opencl cuda high-performance-computing gpu mcmc markov-chain-monte-carloPyGDF implements the Python interface to access and manipulate the GPU Dataframe of GPU Open Analytics Initialive (GOAI). We aim to provide a simple interface that similar to the Pandas dataframe and hide the details of GPU programming.

gpu gpu-data-frame h2o python-api machine-learning-algorithms mapd anaconda analyticsEdward is a Python library for probabilistic modeling, inference, and criticism. It is a testbed for fast experimentation and research with probabilistic models, ranging from classical hierarchical models on small data sets to complex deep probabilistic models on large data sets. Edward fuses three fields: Bayesian statistics and machine learning, deep learning, and probabilistic programming. Edward is built on top of TensorFlow. It enables features such as computational graphs, distributed training, CPU/GPU integration, automatic differentiation, and visualization with TensorBoard.

bayesian-methods deep-learning machine-learning data-science tensorflow neural-networks statistics probabilistic-programmingConsider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

machine-learning data-science automl automation scikit-learn hyperparameter-optimization model-selection parameter-tuning automated-machine-learning random-forest gradient-boosting feature-engineering xgboost genetic-programmingIPython Notebook(s) demonstrating deep learning functionality.IPython Notebook(s) demonstrating scikit-learn functionality.

machine-learning deep-learning data-science big-data aws tensorflow theano caffe scikit-learn kaggle spark mapreduce hadoop matplotlib pandas numpy scipy kerasThis is a bare bones example of TensorFlow, a machine learning package published by Google. You will not find a simpler introduction to it. In each example, a straight line is fit to some data. Values for the slope and y-intercept of the line that best fit the data are determined using gradient descent. If you do not know about gradient descent, check out the Wikipedia page.

tensorflow tensorflow-tutorials distributed-computing simple big-data linear-regression tensorflow-examples tensorflow-exercisesMMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark.

machine-learning spark cntk pyspark azure microsoft-machine-learning microsoft mlI just built out v2 of this project that now gives you analytics info from your models, and is production-ready. machineJS is an amazing research project that clearly proved there's a hunger for automated machine learning. auto_ml tackles this exact same goal, but with more features, cleaner code, and the ability to be copy/pasted into production.

machine-learning data-science machine-learning-library machine-learning-algorithms ml data-scientists javascript-library scikit-learn kaggle numerai automated-machine-learning automl auto-ml neuralnet neural-network algorithms random-forest svm naive-bayes bagging optimization brainjs date-night sklearn ensemble data-formatting js xgboost scikit-neuralnetwork knn k-nearest-neighbors gridsearch gridsearchcv grid-search randomizedsearchcv preprocessing data-formatter kaggle-competitionTensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from deep learning framework TensorFlow and big-data frameworks Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.

tensorflow spark yahoo machine-learning cluster featuredPyTorch is a flexible deep learning framework that allows automatic differentiation through dynamic neural networks (i.e., networks that utilise dynamic control flow like if statements and while loops). It supports GPU acceleration, distributed training, various optimisations, and plenty more neat features. These are some notes on how I think about using PyTorch, and don't encompass all parts of the library or every best practice, but may be helpful to others. Neural networks are a subclass of computation graphs. Computation graphs receive input data, and data is routed to and possibly transformed by nodes which perform processing on the data. In deep learning, the neurons (nodes) in neural networks typically transform data with parameters and differentiable functions, such that the parameters can be optimised to minimise a loss via gradient descent. More broadly, the functions can be stochastic, and the structure of the graph can be dynamic. So while neural networks may be a good fit for dataflow programming, PyTorch's API has instead centred around imperative programming, which is a more common way for thinking about programs. This makes it easier to read code and reason about complex programs, without necessarily sacrificing much performance; PyTorch is actually pretty fast, with plenty of optimisations that you can safely forget about as an end user (but you can dig in if you really want to).

deep-learningCaffeOnSpark brings deep learning to Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.As a distributed extension of Caffe, CaffeOnSpark supports neural network model training, testing, and feature extraction. Caffe users can now perform distributed learning using their existing LMDB data files and minorly adjusted network configuration (as illustrated).

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

spark pyspark data-analysis mllib ipython-notebook notebook ipython data-science machine-learning big-data bigdataFast, flexible, multi-threaded ensembles of decision trees for machine learning in pure Go (golang).It can achieve quicker training times then many other popular implementations on some datasets. This is the result of cpu cache friendly memory utilization well suited to modern processors and separate, optimized paths to learn splits from binary, numerical and categorical data.

machine-learning go-libraryThese series of tutorials on Data Science engineering will try to compare how different concepts in the discipline can be implemented in the two dominant ecosystems nowadays: R and Python. We will do this from a neutral point of view. Our opinion is that each environment has good and bad things, and any data scientist should know how to use both in order to be as prepared as posible for job market or to start personal project.

data-science data-science-engineering tutorial data-frame exploratory-data-analysis r jupyter notebook machine-learning
We have large collection of open source products. Follow the tags from
Tag Cloud >>

Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
**Add Projects.**