magnitude - A fast, efficient universal vector embedding utility package.

  •        94

A feature-packed Python package and vector storage file format for utilizing vector embeddings in machine learning models in a fast, efficient, and simple manner developed by Plasticity. It is primarily intended to be a simpler / faster alternative to Gensim, but can be used as a generic key-vector store for domains outside NLP. Vector space embedding models have become increasingly common in machine learning and traditionally have been popular for natural language processing applications. A fast, lightweight tool to consume these large vector space embedding models efficiently is lacking.

https://github.com/plasticityai/magnitude

Tags
Implementation
License
Platform

   




Related Projects

gensim - Topic Modelling for Humans

  •    Python

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

fastText_multilingual - Multilingual word vectors in 78 languages

  •    Jupyter

Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; meaning that while similar words within a language share similar vectors, translation words from different languages do not have similar vectors. In a recent paper at ICLR 2017, we showed how the SVD can be used to learn a linear transformation (a matrix), which aligns monolingual vectors from two languages in a single vector space. In this repository we provide 78 matrices, which can be used to align the majority of the fastText languages in a single space. Word embeddings define the similarity between two words by the normalised inner product of their vectors. The matrices in this repository place languages in a single space, without changing any of these monolingual similarity relationships. When you use the resulting multilingual vectors for monolingual tasks, they will perform exactly the same as the original vectors. To learn more about word embeddings, check out Colah's blog or Sam's introduction to vector representations.

BioSentVec - BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences

  •    Jupyter

We created biomedical word and sentence embeddings using PubMed and the clinical notes from MIMIC-III Clinical Database. Both PubMed and MIMIC-III texts were split and tokenized using NLTK. We also lowercased all the words. The statistics of the two corpora are shown below. We applied fastText to compute 200-dimensional word embeddings. We set the window size to be 20, learning rate 0.05, sampling threshold 1e-4, and negative examples 10. Both the word vectors and the model with hyperparameters are available for download below. The model file can be used to compute word vectors that are not in the dictionary (i.e. out-of-vocabulary terms). This work extends the original BioWordVec which provides fastText word embeddings trained using PubMed and MeSH. We used the same parameters as the original BioWordVec which has been thoroughly evaluated in a range of applications.

text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

  •    R

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP). To learn how to use this package, see text2vec.org and the package vignettes. See also the text2vec articles on my blog.

sense2vec - 🦆 Use NLP to go beyond vanilla word2vec

  •    C++

sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. For an interactive example of the technology, see our sense2vec demo that lets you explore semantic similarities across all Reddit comments of 2015. This library is a simple Python/Cython implementation for loading and querying sense2vec models. While it's best used in combination with spaCy, the sense2vec library itself is very lightweight and can also be used as a standalone module. See below for usage details.


PyTorch-NLP - Supporting Rapid Prototyping with a Toolkit (incl. Datasets and Neural Network Layers)

  •    Python

PyTorch-NLP, or torchnlp for short, is a library of neural network layers, text processing modules and datasets designed to accelerate Natural Language Processing (NLP) research. Join our community, add datasets and neural network layers! Chat with us on Gitter and join the Google Group, we're eager to collaborate with you.

practical-1 - Oxford Deep NLP 2017 course - Practical 1: word2vec

  •    Jupyter

For this practical, you'll be provided with a partially-complete IPython notebook, an interactive web-based Python computing environment that allows us to mix text, code, and interactive plots. We will be training word2vec models on TED Talk and Wikipedia data, using the word2vec implementation included in the Python package gensim. After training the models, we will analyze and visualize the learned embeddings.

spaCy - 💫 Industrial-strength Natural Language Processing (NLP) with Python and Cython

  •    Python

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pre-trained statistical models and word vectors, and currently supports tokenization for 20+ languages. It features the fastest syntactic parser in the world, convolutional neural network models for tagging, parsing and named entity recognition and easy deep learning integration. It's commercial open-source software, released under the MIT license. 💫 Version 2.0 out now! Check out the new features here.

OpenNLP - Machine learning based toolkit for the processing of natural language text

  •    Java

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.

thinc - 🔮 spaCy's Machine Learning library for NLP in Python

  •    Assembly

Thinc is the machine learning library powering spaCy. It features a battle-tested linear model designed for large sparse learning problems, and a flexible neural network model under development for spaCy v2.0. Thinc is a practical toolkit for implementing models that follow the "Embed, encode, attend, predict" architecture. It's designed to be easy to install, efficient for CPU usage and optimised for NLP and deep learning with text – in particular, hierarchically structured input and variable-length sequences.

Smile - Statistical Machine Intelligence & Learning Engine

  •    Java

Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala. With advanced data structures and algorithms, Smile delivers state-of-art performance.Smile covers every aspect of machine learning, including classification, regression, clustering, association rule mining, feature selection, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc.

MITIE - MITIE: library and tools for information extraction

  •    C++

This project provides free (even for commercial use) state-of-the-art information extraction tools. The current release includes tools for performing named entity extraction and binary relation detection as well as tools for training custom extractors and relation detectors. MITIE is built on top of dlib, a high-performance machine-learning library[1], MITIE makes use of several state-of-the-art techniques including the use of distributional word embeddings[2] and Structural Support Vector Machines[3]. MITIE offers several pre-trained models providing varying levels of support for both English, Spanish, and German trained using a variety of linguistic resources (e.g., CoNLL 2003, ACE, Wikipedia, Freebase, and Gigaword). The core MITIE software is written in C++, but bindings for several other software languages including Python, R, Java, C, and MATLAB allow a user to quickly integrate MITIE into his/her own applications.

spark-nlp - Natural Language Understanding Library for Apache Spark.

  •    Jupyter

John Snow Labs Spark-NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment. This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .

dgl-ke - High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings

  •    Python

Knowledge graphs (KGs) are data structures that store information about different entities (nodes) and their relations (edges). A common approach of using KGs in various machine learning tasks is to compute knowledge graph embeddings. DGL-KE is a high performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings. The package is implemented on the top of Deep Graph Library (DGL) and developers can run DGL-KE on CPU machine, GPU machine, as well as clusters with a set of popular models, including TransE, TransR, RESCAL, DistMult, ComplEx, and RotatE. This command will download the FB15k dataset, train the transE model and save the trained embeddings into the file.

spago - Self-contained Machine Learning and Natural Language Processing library in Go

  •    Go

A Machine Learning library written in pure Go designed to support relevant neural architectures in Natural Language Processing. spaGO is self-contained, in that it uses its own lightweight computational graph framework for both training and inference, easy to understand from start to finish.

conceptnet-numberbatch

  •    Python

ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings.

pynlpl - PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing

  •    Python

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotatation). The library is a divided into several packages and modules. It works on Python 2.7, as well as Python 3.

nlp_newsletter - 📰Natural language processing (NLP) newsletter

  •    

The NLP newsletter provides (weekly or biweekly) quick summaries of some of the latest natural language processing (NLP) and machine learning (ML) stories across a range of important categories such as fairness & ethics, society, educational resources, publications, etc. Want your story or project to be featured in the NLP Newsletter? Reach out via ellfae@gmail.com or Twitter. If you are interested in translating any of the NLP Newsletter issues, you can check out this GitHub issue.

nlp-with-ruby - Practical Natural Language Processing done in Ruby.

  •    Ruby

This curated list comprises awesome resources, libraries, information sources about computational processing of texts in human languages with the Ruby programming language. That field is often referred to as NLP, Computational Linguistics, HLT (Human Language Technology) and can be brought in conjunction with Artificial Intelligence, Machine Learning, Information Retrieval, Text Mining, Knowledge Extraction and other related disciplines. This list comes from our day to day work on Language Models and NLP Tools. Read why this list is awesome. Our FAQ describes the important decisions and useful answers you may be interested in.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.