text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

  •        38

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP). To learn how to use this package, see text2vec.org and the package vignettes. See also the text2vec articles on my blog.




Related Projects

gensim - Topic Modelling for Humans

  •    Python

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

magnitude - A fast, efficient universal vector embedding utility package.

  •    Python

A feature-packed Python package and vector storage file format for utilizing vector embeddings in machine learning models in a fast, efficient, and simple manner developed by Plasticity. It is primarily intended to be a simpler / faster alternative to Gensim, but can be used as a generic key-vector store for domains outside NLP. Vector space embedding models have become increasingly common in machine learning and traditionally have been popular for natural language processing applications. A fast, lightweight tool to consume these large vector space embedding models efficiently is lacking.

Yahoo_LDA - Yahoo!'s topic modelling framework using Latent Dirichlet Allocation

  •    C++

Yahoo!'s topic modelling framework using Latent Dirichlet Allocation

practical-1 - Oxford Deep NLP 2017 course - Practical 1: word2vec

  •    Jupyter

For this practical, you'll be provided with a partially-complete IPython notebook, an interactive web-based Python computing environment that allows us to mix text, code, and interactive plots. We will be training word2vec models on TED Talk and Wikipedia data, using the word2vec implementation included in the Python package gensim. After training the models, we will analyze and visualize the learned embeddings.

Apache Mahout - Scalable machine learning library

  •    Java

Apache Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining.

PyTorch-NLP - Supporting Rapid Prototyping with a Toolkit (incl. Datasets and Neural Network Layers)

  •    Python

PyTorch-NLP, or torchnlp for short, is a library of neural network layers, text processing modules and datasets designed to accelerate Natural Language Processing (NLP) research. Join our community, add datasets and neural network layers! Chat with us on Gitter and join the Google Group, we're eager to collaborate with you.

keras-quora-question-pairs - A Keras model that addresses the Quora Question Pairs dyadic prediction task

  •    Jupyter

A Keras model that addresses the Quora Question Pairs [1] dyadic prediction task. The model architecture is based on the Stanford Natural Language Inference [2] benchmark model developed by Stephen Merity [3], specifically the version using a simple summation of GloVe word embeddings [4] to represent each question in the pair. A difference between this and the Merity SNLI benchmark is that our final layer is Dense with sigmoid activation, as opposed to softmax. Another key difference is that we are using the max operator as opposed to sum to combine word embeddings into a question representation. We use binary cross-entropy as a loss function and Adam for optimization.

GibbsLDA++: A C/C++ Gibbs Sampling LDA

  •    C++

GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for parameter estimation and inference. GibbsLDA++ is fast and is designed to analyze hidden/latent topic structures of large-scale (text) data collections.

fastText_multilingual - Multilingual word vectors in 78 languages

  •    Jupyter

Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; meaning that while similar words within a language share similar vectors, translation words from different languages do not have similar vectors. In a recent paper at ICLR 2017, we showed how the SVD can be used to learn a linear transformation (a matrix), which aligns monolingual vectors from two languages in a single vector space. In this repository we provide 78 matrices, which can be used to align the majority of the fastText languages in a single space. Word embeddings define the similarity between two words by the normalised inner product of their vectors. The matrices in this repository place languages in a single space, without changing any of these monolingual similarity relationships. When you use the resulting multilingual vectors for monolingual tasks, they will perform exactly the same as the original vectors. To learn more about word embeddings, check out Colah's blog or Sam's introduction to vector representations.

models - Model configurations

  •    Python

PaddlePaddle provides a rich set of computational units to enable users to adopt a modular approach to solving various learning problems. In this repo, we demonstrate how to use PaddlePaddle to solve common machine learning tasks, providing several different neural network model that anyone can easily learn and use. The word embedding expresses words with a real vector. Each dimension of the vector represents some of the latent grammatical or semantic features of the text and is one of the most successful concepts in the field of natural language processing. The generalized word vector can also be applied to discrete features. The study of word vector is usually an unsupervised learning. Therefore, it is possible to take full advantage of massive unmarked data to capture the relationship between features and to solve the problem of sparse features, missing tag data, and data noise. However, in the common word vector learning method, the last layer of the model often encounters a large-scale classification problem, which is the bottleneck of computing performance.

lectures - Oxford Deep NLP 2017 course


This repository contains the lecture slides and course description for the Deep Natural Language Processing course offered in Hilary Term 2017 at the University of Oxford. This is an applied course focussing on recent advances in analysing and generating speech and text using recurrent neural networks. We introduce the mathematical definitions of the relevant machine learning models and derive their associated optimisation algorithms. The course covers a range of applications of neural networks in NLP including analysing latent dimensions in text, transcribing speech to text, translating between languages, and answering questions. These topics are organised into three high level themes forming a progression from understanding the use of neural networks for sequential language modelling, to understanding their use as conditional language models for transduction tasks, and finally to approaches employing these techniques in combination with other mechanisms for advanced applications. Throughout the course the practical implementation of such models on CPU and GPU hardware is also discussed.

word2vec-graph - Exploring word2vec embeddings as a graph of nearest neighbors

  •    Python

This visualization builds graphs of nearest neighbors from high-dimensional word2vec embeddings. The dataset used for this visualization comes from GloVe, and has 6B tokens, 400K vocabulary, 300-dimensional vectors.

sense2vec - 🦆 Use NLP to go beyond vanilla word2vec

  •    C++

sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. For an interactive example of the technology, see our sense2vec demo that lets you explore semantic similarities across all Reddit comments of 2015. This library is a simple Python/Cython implementation for loading and querying sense2vec models. While it's best used in combination with spaCy, the sense2vec library itself is very lightweight and can also be used as a standalone module. See below for usage details.

pattern - Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization

  •    Python

It is well documented, thoroughly tested with 350+ unit tests and comes bundled with 50+ examples. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern. This example trains a classifier on adjectives mined from Twitter using Python 3. First, tweets that contain hashtag #win or #fail are collected. For example: "$20 tip off a sweet little old lady today #win". The word part-of-speech tags are then parsed, keeping only adjectives. Each tweet is transformed to a vector, a dictionary of adjective → count items, labeled WIN or FAIL. The classifier uses the vectors to learn which other tweets look more like WIN or more like FAIL.

tidytext - Text mining using dplyr, ggplot2, and other tidy tools :sparkles::page_facing_up::sparkles::page_facing_up::sparkles:

  •    R

Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Much of the infrastructure needed for text mining with tidy data frames already exists in packages like dplyr, broom, tidyr and ggplot2. In this package, we provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. Check out our book to learn more about text mining using tidy data principles. This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.

nlp-with-ruby - Practical Natural Language Processing done in Ruby.

  •    Ruby

This curated list comprises awesome resources, libraries, information sources about computational processing of texts in human languages with the Ruby programming language. That field is often referred to as NLP, Computational Linguistics, HLT (Human Language Technology) and can be brought in conjunction with Artificial Intelligence, Machine Learning, Information Retrieval, Text Mining, Knowledge Extraction and other related disciplines. This list comes from our day to day work on Language Models and NLP Tools. Read why this list is awesome. Our FAQ describes the important decisions and useful answers you may be interested in.