fastText - Library for fast text representation and classification.

  •        6

fastText is a library for efficient learning of word representations and sentence classification. You can find answers to frequently asked questions on our website.



Related Projects

klassify - Bayesian Text classification service based on Redis. Built on top of Tornado and React.js

Redis based text classification service with real-time web interface. Text classification, document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories.

limdu - Machine-learning for Node.js

Limdu is a machine-learning framework for Node.js. It supports multi-label classification, online learning, and real-time classification. Therefore, it is especially suited for natural language understanding in dialog systems and chat-bots.Limdu is in an "alpha" state - some parts are working (see this readme), but some parts are missing or not tested. Contributions are welcome.


NTextCat is text classification utility. Primary target is language identification. So it helps you to recognize (identify) the language of text (or binary) snippet. Pure .net application (C#).


Provides a set of tools for processing text, such as text extraction and classification. Classification implementations to be implemented include: Bayesian and Statistical (N-gram).

snips-nlu - Snips Python library to extract meaning from text

Snips NLU (Natural Language Understanding) is a Python library that allows to parse sentences written in natural language and extracts structured information. To find out how to use Snips NLU please refer to our documentation, it will provide you with a step-by-step guide on how to use and setup our library.

Word Vector Tool

The Word Vector Tool is a simple but flexible Java library to create word vector representations of text documents. Word vectors can be used for various text processing tasks, as text classification, text clustering or information retrieval.

mahout - Mirror of Apache Mahout

Mahout's goal is to build scalable machine learning libraries. With scalable we mean: Scalable to reasonably large data sets. Our core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms. Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license. Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more. Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

MMLSpark - Microsoft Machine Learning for Apache Spark

MMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark.

food-101-keras - Food Classification with Deep Learning in Keras / Tensorflow

If you are reading this on GitHub, the demo looks like this. Please follow the link below to view the live demo on my blog. Convolutional Neural Networks (CNN), a technique within the broader Deep Learning field, have been a revolutionary force in Computer Vision applications, especially in the past half-decade or so. One main use-case is that of image classification, e.g. determining whether a picture is that of a dog or cat.

ResNeXt - Implementation of a classification framework from the paper Aggregated Residual Transformations for Deep Neural Networks

This repository contains a Torch implementation for the ResNeXt algorithm for image classification. The code is based on fb.resnet.torch. ResNeXt is a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call “cardinality” (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width.

PCP (Pattern Classification Program)

PCP (Pattern Classification Program) is an open-source machine learning program for supervised classification of patterns. PCP is a binary executable running on Linux and Windows (under Cygwin environment).


Classifier4J is a java library that provides an API for automatic classification of text. The default (and only current) implementation of this API is a Bayesian classifier. This library can be used for multiple purposes - as a spam filter or a blog cl


Autofiler is an automatic serverside mail filer application based on Bayesian text classification. In combination with an IMAP server, autofiler can file messages in folders automatically and transparently.

Constellio - Enterprise Search engine

Constellio Open Source Enterprise Search is based on Apache Solr and using Google Search Appliances connectors architecture, it allows, with a single click, to find all relevant content in your organization (Web, email, ECM, CRM etc.).


MALLET (A Machine Learning for Language Toolkit) is an integrated collection of Java code useful for statistical natural language processing, document classification, clustering, information extraction, and other machine learning applications to text

Trainable Relation Extraction framework

T-Rex (Trainable Relation Extraction) is a highly configurable machine learning-based Information Extraction from Text framework, which includes tools for document classification, entity extraction and relation extraction.

Java Data Mining Package

The Java Data Mining Package (JDMP) is a library that provides methods for analyzing data with the help of machine learning algorithms (e.g. clustering, classification, graphical models, neural networks, Bayesian networks, text processing, optimization).


Text classification and summarization library for .NET. A port of the Classifier4J Java library (see

gensim - Topic Modelling for Humans

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

tc - A command-line twitter client with smart filtering and statistical classification

A command-line twitter client with smart filtering and statistical classification