MLIB - Apache Spark's scalable machine learning library

  •        4997

MLlib is a Spark implementation of some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction and lot more.



Related Projects

Scikit Learn - Machine Learning in Python

  •    Python

scikit-learn is a Python module for machine learning built on top of SciPy. It is simple and efficient tools for data mining and data analysis. It supports automatic classification, clustering, model selection, pre processing and lot more.

pycm - Multi-class confusion matrix library in Python

  •    Python

PyCM is a multi-class confusion matrix library written in Python that supports both input data vectors and direct matrix, and a proper tool for post-classification model evaluation that supports most classes and overall statistics parameters. PyCM is the swiss-army knife of confusion matrices, targeted mainly at data scientists that need a broad array of metrics for predictive models and an accurate evaluation of large variety of classifiers. threshold is added in version 0.9 for real value prediction.

Apache Mahout - Scalable machine learning library

  •    Java

Apache Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining.

NFStream - A Flexible Network Data Analysis Framework

  •    Python

NFStream is a Python package providing fast, flexible, and expressive data structures designed to make working with online or offline network data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world network data analysis in Python. Additionally, it has the broader goal of becoming a common network data processing framework for researchers providing data reproducibility across experiments. NFStream extracts +90 flow features and can convert it directly to a pandas Dataframe or a CSV file.

Support Vector Machines Data Mining Plug-in in Analysis Services


The datamining Support Vector Machine (SVM) plug-in in MS SQL Server Analysis Services 2008. This plug-in is the SVM classification algorithm in addition to the shipped data mining algorithms with SQL Server.

practical-machine-learning-with-python - Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system

  •    Jupyter

"Data is the new oil" is a saying which you must have heard by now along with the huge interest building up around Big Data and Machine Learning in the recent past along with Artificial Intelligence and Deep Learning. Besides this, data scientists have been termed as having "The sexiest job in the 21st Century" which makes it all the more worthwhile to build up some valuable expertise in these areas. Getting started with machine learning in the real world can be overwhelming with the vast amount of resources out there on the web. "Practical Machine Learning with Python" follows a structured and comprehensive three-tiered approach packed with concepts, methodologies, hands-on examples, and code. This book is packed with over 500 pages of useful information which helps its readers master the essential skills needed to recognize and solve complex problems with Machine Learning and Deep Learning by following a data-driven mindset. By using real-world case studies that leverage the popular Python Machine Learning ecosystem, this book is your perfect companion for learning the art and science of Machine Learning to become a successful practitioner. The concepts, techniques, tools, frameworks, and methodologies used in this book will teach you how to think, design, build, and execute Machine Learning systems and projects successfully.

Orange - Data Mining Suite

  •    Python

Orange is a component-based data mining software. It includes a range of data visualization, exploration, preprocessing and modeling techniques. It supports . interactive data analysis workflows with a large toolbox.

DataScienceR - a curated list of R tutorials for Data Science, NLP and Machine Learning

  •    R

This repo contains a curated list of R tutorials and packages for Data Science, NLP and Machine Learning. This also serves as a reference guide for several common data analysis tasks. Curated list of Python tutorials for Data Science, NLP and Machine Learning.

astroML - Machine learning, statistics, and data mining for astronomy and astrophysics

  •    Python

AstroML is a Python module for machine learning and data mining built on numpy, scipy, scikit-learn, and matplotlib, and distributed under the BSD license. It contains a growing library of statistical and machine learning routines for analyzing astronomical data in python, loaders for several open astronomical datasets, and a large suite of examples of analyzing and visualizing astronomical datasets. This project was started in 2012 by Jake VanderPlas to accompany the book Statistics, Data Mining, and Machine Learning in Astronomy by Zeljko Ivezic, Andrew Connolly, Jacob VanderPlas, and Alex Gray.

Jubatus - Framework and Library for Distributed Online Machine Learning

  •    C++

Jubatus is a distributed processing framework and streaming machine learning library. Jubatus includes these functionalities: Online Machine Learning Library: Classification, Regression, Recommendation (Nearest Neighbor Search), Graph Mining, Anomaly Detection, Clustering, Feature Vector Converter (fv_converter): Data Preprocess and Feature Extraction, Framework for Distributed Online Machine Learning with Fault Tolerance.

Smile - Statistical Machine Intelligence & Learning Engine

  •    Java

Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala. With advanced data structures and algorithms, Smile delivers state-of-art performance.Smile covers every aspect of machine learning, including classification, regression, clustering, association rule mining, feature selection, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc.

data-science-with-ruby - Practical Data Science with Ruby based tools.

  •    Ruby

Data Science is a new "sexy" buzzword without specific meaning but often used to substitute Statistics, Scientific Computing, Text and Data Mining and Visualization, Machine Learning, Data Processing and Warehousing as well as Retrieval Algorithms of any kind. This curated list comprises awesome tutorials, libraries, information sources about various Data Science applications using the Ruby programming language.

Data-Analysis-and-Machine-Learning-Projects - Repository of teaching materials, code, and data for my data analysis and machine learning projects

  •    Jupyter

This is a repository of teaching materials, code, and data for my data analysis and machine learning projects.Each repository will (usually) correspond to one of the blog posts on my web site.

Java Data Mining Package

  •    Java

The Java Data Mining Package (JDMP) is a library that provides methods for analyzing data with the help of machine learning algorithms (e.g. clustering, classification, graphical models, neural networks, Bayesian networks, text processing, optimization).

MMLSpark - Microsoft Machine Learning for Apache Spark

  •    Scala

MMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark.

mahout - Mirror of Apache Mahout

  •    Java

Mahout's goal is to build scalable machine learning libraries. With scalable we mean: Scalable to reasonably large data sets. Our core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms. Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license. Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more. Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

mlr - mlr: Machine Learning in R

  •    R

Please cite our JMLR paper [bibtex]. Some parts of the package were created as part of other publications. If you use these parts, please cite the relevant work appropriately. An overview of all mlr related publications can be found here.

LightGBM - A fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks

  •    C++

For more details, please refer to Features.Experiments on public datasets show that LightGBM can outperform existing boosting frameworks on both efficiency and accuracy, with significantly lower memory consumption. What's more, the experiments show that LightGBM can achieve a linear speed-up by using multiple machines for training in specific settings.