gensim-data - Data repository for pretrained NLP models and NLP corpora.

  •        84

Research datasets regularly disappear, change over time, become obsolete or come without a sane implementation to handle the data format reading and processing. For this reason, Gensim launched its own dataset storage, committed to long-term support, a sane standardized usage API and focused on datasets for unstructured text processing (no images or audio). This Gensim-data repository serves as that storage.



Related Projects

gensim - Topic Modelling for Humans

  •    Python

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

word2vec-sentiments - Tutorial for Sentiment Analysis using Doc2Vec in gensim (or "getting 87% accuracy in sentiment analysis in under 100 lines of code")

  •    Jupyter

However, Word2Vec documentation is shit. The C-code is nigh unreadable (700 lines of highly optimized, and sometimes weirdly optimized code). I personally spent a lot of time untangling Doc2Vec and crashing into ~50% accuracies due to implementation mistakes. This tutorial aims to help other users get off the ground using Word2Vec for their own research. We use Word2Vec for sentiment analysis by attempting to classify the Cornell IMDB movie review corpus ( The specific data set used is available for download at The code to just run the Doc2Vec and save the model as imdb.d2v can be found in Should be useful for running on computer clusters.

models - The ONNX Model Zoo is a collection of pre-trained, state-of-the-art models in the ONNX format

  •    Jupyter

The ONNX Model Zoo is a collection of pre-trained models for state-of-the-art models in deep learning, available in the ONNX format. Accompanying each model are Jupyter notebooks for model training and running inference with the trained model. The notebooks are written in Python and include links to the training dataset as well as references to the original paper that describes the model architecture. The notebooks can be exported and run as python(.py) files. The Open Neural Network eXchange (ONNX) is a open format to represent deep learning models. With ONNX, developers can move models between state-of-the-art tools and choose the combination that is best for them. ONNX is developed and supported by a community of partners.

GloVe - GloVe model for distributed word representation

  •    C

We provide an implementation of the GloVe model for learning word representations, and describe how to download web-dataset vectors or train your own. See the project page or the paper for more information on glove vectors. The links below contain word vectors obtained from the respective corpora. If you want word vectors trained on massive web datasets, you need only download one of these text files! Pre-trained word vectors are made available under the Public Domain Dedication and License.

magnitude - A fast, efficient universal vector embedding utility package.

  •    Python

A feature-packed Python package and vector storage file format for utilizing vector embeddings in machine learning models in a fast, efficient, and simple manner developed by Plasticity. It is primarily intended to be a simpler / faster alternative to Gensim, but can be used as a generic key-vector store for domains outside NLP. Vector space embedding models have become increasingly common in machine learning and traditionally have been popular for natural language processing applications. A fast, lightweight tool to consume these large vector space embedding models efficiently is lacking.

sense2vec - 🦆 Use NLP to go beyond vanilla word2vec

  •    C++

sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. For an interactive example of the technology, see our sense2vec demo that lets you explore semantic similarities across all Reddit comments of 2015. This library is a simple Python/Cython implementation for loading and querying sense2vec models. While it's best used in combination with spaCy, the sense2vec library itself is very lightweight and can also be used as a standalone module. See below for usage details.

textgenrnn - Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code

  •    Python

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly train on a text using a pretrained model. The included model can easily be trained on new texts, and can generate appropriate text even after a single pass of the input data.

practical-1 - Oxford Deep NLP 2017 course - Practical 1: word2vec

  •    Jupyter

For this practical, you'll be provided with a partially-complete IPython notebook, an interactive web-based Python computing environment that allows us to mix text, code, and interactive plots. We will be training word2vec models on TED Talk and Wikipedia data, using the word2vec implementation included in the Python package gensim. After training the models, we will analyze and visualize the learned embeddings.

functional-zoo - PyTorch and Tensorflow functional model definitions

  •    Jupyter

PyTorch, unlike lua torch, has autograd in it's core, so using modular structure of torch.nn modules is not necessary, one can easily allocate needed Variables and write a function that utilizes them, which is sometimes more convenient. This repo contains model definitions in this functional way, with pretrained weights for some models. Weights are serialized as a dict of arrays in hdf5, so should be easily loadable in other frameworks. Thanks to @edgarriba we have cpp_parser for loading weights in C++.

modeldb - A system to manage machine learning models

  •    Javascript

ModelDB is an end-to-end system to manage machine learning models. It ingests models and associated metadata as models are being trained, stores model data in a structured format, and surfaces it through a web-frontend for rich querying. ModelDB can be used with any ML environment via the ModelDB Light API. ModelDB native clients can be used for advanced support in and scikit-learn. The ModelDB frontend provides rich summaries and graphs showing model data. The frontend provides functionality to slice and dice this data along various attributes (e.g. operations like filter by hyperparameter, group by datasets) and to build custom charts showing model performance.

rwa - Machine Learning on Sequential Data Using a Recurrent Weighted Average

  •    Python

This repository holds the code to a new kind of RNN model for processing sequential data. The model computes a recurrent weighted average (RWA) over every previous processing step. With this approach, the model can form direct connections anywhere along a sequence. This stands in contrast to traditional RNN architectures that only use the previous processing step. A detailed description of the RWA model has been published in a manuscript at Because the RWA can be computed as a running average, it does not need to be completely recomputed with each processing step. The numerator and denominator can be saved from the previous step. Consequently, the model scales like that of other RNN models such as the LSTM model.

youtube-8m - Starter code for working with the YouTube-8M dataset.

  •    Python

This repo contains starter code for training and evaluating machine learning models over the YouTube-8M dataset. The code gives an end-to-end working example for reading the dataset, training a TensorFlow model, and evaluating the performance of the model. Out of the box, you can train several model architectures over either frame-level or video-level features. The code can easily be extended to train your own custom-defined models.It is possible to train and evaluate on YouTube-8M in two ways: on Google Cloud or on your own machine. This README provides instructions for both.

cnn-models - ImageNet pre-trained models with batch normalization for the Caffe framework

  •    Python

This repository contains convolutional neural network (CNN) models trained on ImageNet by Marcel Simon at the Computer Vision Group Jena (CVGJ) using the Caffe framework as published in the accompanying technical report. Each model is in a separate subfolder and contains everything needed to reproduce the results. This repository focuses currently contains the batch-normalization-variants of AlexNet and VGG19 as well as the training code for Residual Networks (Resnet). No mean subtraction is required for the pre-trained models! We have a batch-normalization layer which basically does the same.

hed - code for Holistically-Nested Edge Detection

  •    C++

We develop a new edge detection algorithm, holistically-nested edge detection (HED), which performs image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets. HED automatically learns rich hierarchical representations (guided by deep supervision on side responses) that are important in order to resolve the challenging ambiguity in edge and object boundary detection. We significantly advance the state-of-the-art on the BSD500 dataset (ODS F-score of .790) and the NYU Depth dataset (ODS F-score of .746), and do so with an improved speed (0.4s per image). Detailed description of the system can be found in our paper. If you have downloaded the previous version (testing code) of HED, please note that we updated the code base to the new version of Caffe. We uploaded a new pretrained model with better performance. We adopted the python interface written for the FCN paper instead of our own implementation for training and testing. The evaluation protocol doesn't change.

tacotron - A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model

  •    Python

We train the model on three different speech datasets. LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available. It has 24 hours of reasonable quality samples. Nick's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours long. The World English Bible is a public domain update of the American Standard Version of 1901 into modern English. Its original audios are freely available here. Kyubyong split each chapter by verse manually and aligned the segmented audio clips to the text. They are 72 hours in total. You can download them at Kaggle Datasets.

pyLDAvis - Python library for interactive topic model visualization. Port of the R LDAvis package.

  •    Jupyter

Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

LDAvis - R package for web-based interactive topic model visualization.

  •    Javascript

R package for interactive topic model visualization. LDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

Backbone.js - Model, View for Javascript

  •    Javascript

Backbone.js gives structure to web applications by providing models with key-value binding and custom events, collections with a rich API of enumerable functions, views with declarative event handling, and connects it all to your existing API over a RESTful JSON interface.

kinetics-i3d - Convolutional neural network model for video classification trained on the Kinetics dataset

  •    Python

This repository contains trained models reported in the paper "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset" by Joao Carreira and Andrew Zisserman. The paper was posted on arXiv in May 2017, and will be published as a CVPR 2017 conference paper. "Quo Vadis" introduced a new architecture for video classification, the Inflated 3D Convnet or I3D. Here we release Inception-v1 I3D models trained on the Kinetics dataset training split.

arangodb - 🥑 ArangoDB is a native multi-model database with flexible data models for documents, graphs, and key-values

  •    C++

ArangoDB is a multi-model, open-source database with flexible data models for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions. Use ACID transactions if you require them. Scale horizontally with a few mouse clicks. The supported data models can be mixed in queries and allow ArangoDB to be the aggregation point for your data.