gsdmm - GSDMM: Short text clustering

  •        1683

This project is an easy to read reference implementation of GSDMM -- I don't plan to maintain it unless there is demand. I am however actively maintaining the much faster Rust version of GSDMM here. In their paper, the authors introduce a simple conceptual model for explaining the GSDMM called the Movie Group Process.



Related Projects

GibbsLDA++: A C/C++ Gibbs Sampling LDA

  •    C++

GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for parameter estimation and inference. GibbsLDA++ is fast and is designed to analyze hidden/latent topic structures of large-scale (text) data collections.

PyTorch-NLP - Supporting Rapid Prototyping with a Toolkit (incl. Datasets and Neural Network Layers)

  •    Python

PyTorch-NLP, or torchnlp for short, is a library of neural network layers, text processing modules and datasets designed to accelerate Natural Language Processing (NLP) research. Join our community, add datasets and neural network layers! Chat with us on Gitter and join the Google Group, we're eager to collaborate with you.

text-analytics-with-python - Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer

  •    Python

Derive useful insights from your data using Python. Learn the techniques related to natural language processing and text analytics, and gain the skills to know which technique is best suited to solve a particular problem. A structured and comprehensive approach is followed in this book so that readers with little or no experience do not find themselves overwhelmed. You will start with the basics of natural language and Python and move on to advanced analytical and machine learning concepts. You will look at each technique and algorithm with both a bird's eye view to understand how it can be used as well as with a microscopic view to understand the mathematical concepts and to implement them to solve your own problems.

headlines - Automatically generate headlines to short articles

  •    Jupyter

It is assumed that you already have training and test data. The data is made from many examples (I'm using 684K examples), each example is made from the text from the start of the article, which I call description (or desc), and the text of the original headline (or head). The texts should be already tokenized and the tokens separated by spaces. Once you have the data ready save it in a python pickle file as a tuple: (heads, descs, keywords) were heads is a list of all the head strings, descs is a list of all the article strings in the same order and length as heads. I ignore the keywrods information so you can place None.

NeuronBlocks - NLP DNN Toolkit - Building Your NLP DNN Models Like Playing Lego

  •    Python

NeuronBlocks is a NLP deep learning modeling toolkit that helps engineers/researchers to build end-to-end pipelines for neural network model training for NLP tasks. The main goal of this toolkit is to minimize developing cost for NLP deep neural network model building, including both training and inference stages. NeuronBlocks consists of two major components: Block Zoo and Model Zoo.

CoreNLP - Stanford CoreNLP: A Java suite of core NLP tools.

  •    Java

Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. It provides the foundational building blocks for higher level text understanding applications.

lectures - Oxford Deep NLP 2017 course


This repository contains the lecture slides and course description for the Deep Natural Language Processing course offered in Hilary Term 2017 at the University of Oxford. This is an applied course focussing on recent advances in analysing and generating speech and text using recurrent neural networks. We introduce the mathematical definitions of the relevant machine learning models and derive their associated optimisation algorithms. The course covers a range of applications of neural networks in NLP including analysing latent dimensions in text, transcribing speech to text, translating between languages, and answering questions. These topics are organised into three high level themes forming a progression from understanding the use of neural networks for sequential language modelling, to understanding their use as conditional language models for transduction tasks, and finally to approaches employing these techniques in combination with other mechanisms for advanced applications. Throughout the course the practical implementation of such models on CPU and GPU hardware is also discussed.

detext - DeText: A Deep Neural Text Understanding Framework for Ranking and Classification Tasks

  •    Python

DeText is a Deep Text understanding framework for NLP related ranking, classification, and language generation tasks. It leverages semantic matching using deep neural networks to understand member intents in search and recommender systems. As a general NLP framework, DeText can be applied to many tasks, including search & recommendation ranking, multi-class classification and query understanding tasks.

MCCCS Towhee

  •    C

Towhee is a Monte Carlo molecular simulation code originally designed for the prediction of fluid phase equilibria using atom-based force fields and the Gibbs ensemble with particular attention paid to algorithms addressing molecule conformation sampling.


  •    Java

A Java Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for Parameter Estimation and Inference



Mocapy++ is a Dynamic Bayesian Network toolkit, implemented in C++. It supports discrete, multinomial, Gaussian, Kent, Von Mises and Poisson nodes. Inference and learning is done by Gibbs sampling/Stochastic-EM.

Smile - Statistical Machine Intelligence & Learning Engine

  •    Java

Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala. With advanced data structures and algorithms, Smile delivers state-of-art performance.Smile covers every aspect of machine learning, including classification, regression, clustering, association rule mining, feature selection, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc.

jina - Cloud-native neural search framework for 𝙖𝙣𝙮 kind of data

  •    Python

Jina🔊 is a neural search framework that empowers anyone to build SOTA & scalable deep learning search applications in minutes. 🌌 All data types - Large-scale indexing and querying of any kind of unstructured data: video, image, long/short text, music, source code, PDF, etc.

nlp - Extract values from strings and fill your structs with nlp.

  •    Go

You will always begin by creating a NL type calling nlp.New(), the NL type is a Natural Language Processor that owns 3 funcs, RegisterModel(), Learn() and P().RegisterModel takes 3 parameters, an empty struct, a set of samples and some options for the model.

pytextrank - Python implementation of TextRank for text document NLP parsing and summarization

  •    Jupyter

Python implementation of TextRank, based on the Mihalcea 2004 paper. The results produced by this implementation are intended more for use as feature vectors in machine learning, not as academic paper summaries.

transformers - 🤗Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX

  •    Python

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting-edge NLP easier to use for everyone. 🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments.

gpt-3 - GPT-3: Language Models are Few-Shot Learners


Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.