Displaying 1 to 20 from 23 results

gensim - Topic Modelling for Humans

  •    Python

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

Chinese-Word-Vectors - 100+ Chinese Word Vectors 上百种预训练中文词向量

  •    Python

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks. Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

  •    R

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP). To learn how to use this package, see text2vec.org and the package vignettes. See also the text2vec articles on my blog.

magnitude - A fast, efficient universal vector embedding utility package.

  •    Python

A feature-packed Python package and vector storage file format for utilizing vector embeddings in machine learning models in a fast, efficient, and simple manner developed by Plasticity. It is primarily intended to be a simpler / faster alternative to Gensim, but can be used as a generic key-vector store for domains outside NLP. Vector space embedding models have become increasingly common in machine learning and traditionally have been popular for natural language processing applications. A fast, lightweight tool to consume these large vector space embedding models efficiently is lacking.

flair - A very simple framework for state-of-the-art NLP

  •    Python

A very simple framework for state-of-the-art NLP. Developed by Zalando Research. A powerful syntactic-semantic tagger / classifier. Flair allows you to apply our state-of-the-art models for named entity recognition (NER), part-of-speech tagging (PoS), frame sense disambiguation, chunking and classification to your text.

BioSentVec - BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences

  •    Jupyter

We created biomedical word and sentence embeddings using PubMed and the clinical notes from MIMIC-III Clinical Database. Both PubMed and MIMIC-III texts were split and tokenized using NLTK. We also lowercased all the words. The statistics of the two corpora are shown below. We applied fastText to compute 200-dimensional word embeddings. We set the window size to be 20, learning rate 0.05, sampling threshold 1e-4, and negative examples 10. Both the word vectors and the model with hyperparameters are available for download below. The model file can be used to compute word vectors that are not in the dictionary (i.e. out-of-vocabulary terms). This work extends the original BioWordVec which provides fastText word embeddings trained using PubMed and MeSH. We used the same parameters as the original BioWordVec which has been thoroughly evaluated in a range of applications.

Hands-On-Deep-Learning-Algorithms-with-Python - Master Deep Learning Algorithms with Extensive Math by Implementing them using TensorFlow

  •    Jupyter

Deep learning is one of the most popular domains in the artificial intelligence (AI) space, which allows you to develop multi-layered models of varying complexities. This book is designed to help you grasp things, from basic deep learning algorithms to the more advanced algorithms. The book is designed in a way that first you will understand the algorithm intuitively, once you have a basic understanding of the algorithms, then you will master the underlying math behind them effortlessly and then you will learn how to implement them using TensorFlow step by step. The book covers almost all the state of the art deep learning algorithms. First, you will get a good understanding of the fundamentals of neural networks and several variants of gradient descent algorithms. Later, you will explore RNN, Bidirectional RNN, LSTM, GRU, seq2seq, CNN, capsule nets and more. Then, you will master GAN and various types of GANs and several different autoencoders.

MachineLearningSamples-BiomedicalEntityExtraction - MachineLearningSamples-BiomedicalEntityExtraction

  •    Python

This real-world scenario focuses on how a large amount of unstructured unlabeled data corpus such as PubMed article abstracts can be analyzed to train a domain-specific word embedding model. Then the output embeddings are considered as automatically generated features to train a neural entity extraction model using Keras with TensorFlow deep learning framework as backend and a small amoht of labeled data.The detailed documentation for this scenario including the step-by-step walk-through: https://review.docs.microsoft.com/en-us/azure/machine-learning/preview/scenario-tdsp-biomedical-recognition.

dna2vec - dna2vec: Consistent vector representations of variable-length k-mers

  •    Python

Dna2vec is an open-source library to train distributed representations of variable-length k-mers. Note that this implementation has only been tested on Python 3.5.3, but we welcome any contributions or bug reporting to make it more accessible.


  •    Jupyter

Also, check out this link to download the final .bin model and the preprocessed dataset.

fastrtext - R wrapper for fastText

  •    C++

R wrapper for fastText C++ code from Facebook. fastText is a library for efficient learning of word representations and sentence classification.

projector - Project Dense Vectors Text Representation on 2D Plan

  •    R

Project dense vector representations of texts on a 2D plan to better understand neural models applied to NLP. Since the famous word2vec, embeddings are everywhere in NLP (and other close areas like IR). The main idea behind embeddings is to represent texts (made of characters, words, sentences, or even larger blocks) as numeric vectors. This works very well and provides some abilities unreachable with the classic BoW approach. However, embeddings (e.g. vector representations) are difficult to understand, analyze (and debug) for humans because they are made of much more than just 3 dimensions.

Emoji2recipe - Recipe prediction model from emojis

  •    Python

For more info on Azure ML Workbench compute targets see documentation. Download the word2vec embeddings and emoji2vec embeddings and update respective paths in config.py.

wego - Word2Vec, GloVe in Go!

  •    Go

This is the implementation of word embedding (a.k.a word representation) models in Golang. Like this example, the models generate the vectors that could calculate word meaning by arithmetic operations for other vectors.

clustercat - Fast Word Clustering Software

  •    C

ClusterCat induces word classes from unannotated text. It is programmed in modern C, with no external libraries. A Python wrapper is also provided. Word classes are unsupervised part-of-speech tags, requiring no manually-annotated corpus. Words are grouped together that share syntactic/semantic similarities. They are used in many dozens of applications within natural language processing, machine translation, neural net training, and related fields.

S-WMD - Code for Supervised Word Mover's Distance (SWMD)

  •    Matlab

A demo code in Matlab for S-WMD [Supervised Word Mover's Distance, NIPS 2016] [Oral presentation video recording by Matt Kusner]. The only difference with the above datasets is that because there are pre-defined train-test splits, there are already variables BOW_xtr, BOW_xte, xtr, xte, ytr, yte.

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.