nonce2vec - This is the repo accompanying the paper "High-risk learning: acquiring new word vectors from tiny data" (Herbelot & Baroni, 2017)

  •        59

A. Herbelot and M. Baroni. 2017. High-risk learning: Acquiring new word vectors from tiny data. Proceedings of EMNLP 2017 (Conference on Empirical Methods in Natural Language Processing). Distributional semantics models are known to struggle with small data. It is generally accepted that in order to learn 'a good vector' for a word, a model must have sufficient examples of its usage. This contradicts the fact that humans can guess the meaning of a word from a few occurrences only. In this paper, we show that a neural language model such as Word2Vec only necessitates minor modifications to its standard architecture to learn new terms from tiny data, using background knowledge from a previously learnt semantic space. We test our model on word definitions and on a nonce task involving 2-6 sentences' worth of context, showing a large increase in performance over state-of-the-art models on the definitional task.

https://github.com/minimalparts/nonce2vec

Tags
Implementation
License
Platform

   




Related Projects

sense2vec - 🦆 Use NLP to go beyond vanilla word2vec

  •    C++

sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. For an interactive example of the technology, see our sense2vec demo that lets you explore semantic similarities across all Reddit comments of 2015. This library is a simple Python/Cython implementation for loading and querying sense2vec models. While it's best used in combination with spaCy, the sense2vec library itself is very lightweight and can also be used as a standalone module. See below for usage details.

practical-1 - Oxford Deep NLP 2017 course - Practical 1: word2vec

  •    Jupyter

For this practical, you'll be provided with a partially-complete IPython notebook, an interactive web-based Python computing environment that allows us to mix text, code, and interactive plots. We will be training word2vec models on TED Talk and Wikipedia data, using the word2vec implementation included in the Python package gensim. After training the models, we will analyze and visualize the learned embeddings.

word2vec-sentiments - Tutorial for Sentiment Analysis using Doc2Vec in gensim (or "getting 87% accuracy in sentiment analysis in under 100 lines of code")

  •    Jupyter

However, Word2Vec documentation is shit. The C-code is nigh unreadable (700 lines of highly optimized, and sometimes weirdly optimized code). I personally spent a lot of time untangling Doc2Vec and crashing into ~50% accuracies due to implementation mistakes. This tutorial aims to help other users get off the ground using Word2Vec for their own research. We use Word2Vec for sentiment analysis by attempting to classify the Cornell IMDB movie review corpus (http://www.cs.cornell.edu/people/pabo/movie-review-data/). The specific data set used is available for download at http://ai.stanford.edu/~amaas/data/sentiment/. The code to just run the Doc2Vec and save the model as imdb.d2v can be found in run.py. Should be useful for running on computer clusters.

magnitude - A fast, efficient universal vector embedding utility package.

  •    Python

A feature-packed Python package and vector storage file format for utilizing vector embeddings in machine learning models in a fast, efficient, and simple manner developed by Plasticity. It is primarily intended to be a simpler / faster alternative to Gensim, but can be used as a generic key-vector store for domains outside NLP. Vector space embedding models have become increasingly common in machine learning and traditionally have been popular for natural language processing applications. A fast, lightweight tool to consume these large vector space embedding models efficiently is lacking.

gensim - Topic Modelling for Humans

  •    Python

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.


word2vec - Python interface to Google word2vec

  •    C

Python interface to Google word2vec. Training is done using the original C code, other functionality is pure Python with numpy.

word2vec-graph - Exploring word2vec embeddings as a graph of nearest neighbors

  •    Python

This visualization builds graphs of nearest neighbors from high-dimensional word2vec embeddings. The dataset used for this visualization comes from GloVe, and has 6B tokens, 400K vocabulary, 300-dimensional vectors.

wiki2vec - Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps

  •    Java

Utilities for creating Word2Vec vectors for Dbpedia Entities via a Wikipedia Dump. Within the release of Word2Vec the Google team released vectors for freebase entities trained on the Wikipedia. These vectors are useful for a variety of tasks.

Word2VEC_java - word2vec java版本的一个实现

  •    Java

word2vec java版本的一个实现

word2vec_commented - Commented (but unaltered) version of original word2vec C implementation.

  •    C

This project is a functionally unaltered version of Google's published word2vec implementation in C, but which includes source comments. If you're new to word2vec, I recommending reading my tutorial first.

word2vec - Word2Vec in C++ 11

  •    C++

Word2Vec in C++ 11

text-analytics-with-python - Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer

  •    Python

Derive useful insights from your data using Python. Learn the techniques related to natural language processing and text analytics, and gain the skills to know which technique is best suited to solve a particular problem. A structured and comprehensive approach is followed in this book so that readers with little or no experience do not find themselves overwhelmed. You will start with the basics of natural language and Python and move on to advanced analytical and machine learning concepts. You will look at each technique and algorithm with both a bird's eye view to understand how it can be used as well as with a microscopic view to understand the mathematical concepts and to implement them to solve your own problems.

word2vec - GitHub clone of SVN repo http://word2vec

  •    C

GitHub clone of SVN repo http://word2vec.googlecode.com/svn/trunk/ (cloned by http://svn2github.com/)

DeepLearningMovies - Kaggle's competition for using Google's word2vec package for sentiment analysis

  •    Python

Kaggle's competition for using Google's word2vec package for sentiment analysis

Word2Bits - Quantized word vectors that take 8x-16x less space than regular word vectors

  •    C++

Word2Bits extends the Word2Vec algorithm to output high quality quantized word vectors that take 8x-16x less storage than regular word vectors. Read the details at https://arxiv.org/abs/1803.05651. Quantized word vectors are word vectors where each parameter is one of 2^bitlevel values.

flashtext - Extract Keywords from sentence or Replace keywords in sentences.

  •    Python

This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Documentation can be found at FlashText Read the Docs.