Displaying 1 to 19 from 19 results

gensim - Topic Modelling for Humans

  •    Python

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

Chinese-Word-Vectors - 100+ Chinese Word Vectors 上百种预训练中文词向量

  •    Python

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks. Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

  •    R

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP). To learn how to use this package, see text2vec.org and the package vignettes. See also the text2vec articles on my blog.

magnitude - A fast, efficient universal vector embedding utility package.

  •    Python

A feature-packed Python package and vector storage file format for utilizing vector embeddings in machine learning models in a fast, efficient, and simple manner developed by Plasticity. It is primarily intended to be a simpler / faster alternative to Gensim, but can be used as a generic key-vector store for domains outside NLP. Vector space embedding models have become increasingly common in machine learning and traditionally have been popular for natural language processing applications. A fast, lightweight tool to consume these large vector space embedding models efficiently is lacking.




flair - A very simple framework for state-of-the-art NLP

  •    Python

A very simple framework for state-of-the-art NLP. Developed by Zalando Research. A powerful syntactic-semantic tagger / classifier. Flair allows you to apply our state-of-the-art models for named entity recognition (NER), part-of-speech tagging (PoS), frame sense disambiguation, chunking and classification to your text.

MachineLearningSamples-BiomedicalEntityExtraction - MachineLearningSamples-BiomedicalEntityExtraction

  •    Python

This real-world scenario focuses on how a large amount of unstructured unlabeled data corpus such as PubMed article abstracts can be analyzed to train a domain-specific word embedding model. Then the output embeddings are considered as automatically generated features to train a neural entity extraction model using Keras with TensorFlow deep learning framework as backend and a small amoht of labeled data.The detailed documentation for this scenario including the step-by-step walk-through: https://review.docs.microsoft.com/en-us/azure/machine-learning/preview/scenario-tdsp-biomedical-recognition.

dna2vec - dna2vec: Consistent vector representations of variable-length k-mers

  •    Python

Dna2vec is an open-source library to train distributed representations of variable-length k-mers. Note that this implementation has only been tested on Python 3.5.3, but we welcome any contributions or bug reporting to make it more accessible.


language-detection-fastText

  •    Jupyter

Also, check out this link to download the final .bin model and the preprocessed dataset.

fastrtext - R wrapper for fastText

  •    C++

R wrapper for fastText C++ code from Facebook. fastText is a library for efficient learning of word representations and sentence classification.

projector - Project Dense Vectors Text Representation on 2D Plan

  •    R

Project dense vector representations of texts on a 2D plan to better understand neural models applied to NLP. Since the famous word2vec, embeddings are everywhere in NLP (and other close areas like IR). The main idea behind embeddings is to represent texts (made of characters, words, sentences, or even larger blocks) as numeric vectors. This works very well and provides some abilities unreachable with the classic BoW approach. However, embeddings (e.g. vector representations) are difficult to understand, analyze (and debug) for humans because they are made of much more than just 3 dimensions.

Emoji2recipe - Recipe prediction model from emojis

  •    Python

For more info on Azure ML Workbench compute targets see documentation. Download the word2vec embeddings and emoji2vec embeddings and update respective paths in config.py.

wego - Word2Vec, GloVe in Go!

  •    Go

This is the implementation of word embedding (a.k.a word representation) models in Golang. Like this example, the models generate the vectors that could calculate word meaning by arithmetic operations for other vectors.

clustercat - Fast Word Clustering Software

  •    C

ClusterCat induces word classes from unannotated text. It is programmed in modern C, with no external libraries. A Python wrapper is also provided. Word classes are unsupervised part-of-speech tags, requiring no manually-annotated corpus. Words are grouped together that share syntactic/semantic similarities. They are used in many dozens of applications within natural language processing, machine translation, neural net training, and related fields.

S-WMD - Code for Supervised Word Mover's Distance (SWMD)

  •    Matlab

A demo code in Matlab for S-WMD [Supervised Word Mover's Distance, NIPS 2016] [Oral presentation video recording by Matt Kusner]. The only difference with the above datasets is that because there are pre-defined train-test splits, there are already variables BOW_xtr, BOW_xte, xtr, xte, ytr, yte.

Mimick - Code for Mimicking Word Embeddings using Subword RNNs (EMNLP 2017)

  •    Python

Code for Mimicking Word Embeddings using Subword RNNs (EMNLP 2017) and subsequent experiments. I'm adding details to this documentation as I go. When I'm through, this comment will be gone.