GuidedLDA - semi supervised guided topic model with custom guidedLDA

  •        22

GuidedLDA OR SeededLDA implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. GuidedLDA can be guided by setting some seed words per topic. Which will make the topics converge in that direction. You can read more about guidedlda in the documentation.

https://github.com/vi3k6i5/GuidedLDA

Tags
Implementation
License
Platform

   




Related Projects

gensim - Topic Modelling for Humans

  •    Python

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

lightlda - Scalable, fast, and lightweight system for large-scale topic modeling

  •    C++

LightLDA is a distributed system for large scale topic modeling. It implements a distributed sampler that enables very large data sizes and models. LightLDA improves sampling throughput and convergence speed via a fast O(1) metropolis-Hastings algorithm, and allows small cluster to tackle very large data and model sizes through model scheduling and data parallelism architecture. LightLDA is implemented with C++ for performance consideration.

LDAvis - R package for web-based interactive topic model visualization.

  •    Javascript

R package for interactive topic model visualization. LDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

Snorkel - A system for quickly generating training data with weak supervision

  •    Jupyter

Snorkel is a system for rapidly creating, modeling, and managing training data, currently focused on accelerating the development of structured or "dark" data extraction applications for domains in which large labeled training sets are not available or easy to obtain. <BR><BR> Today's state-of-the-art machine learning models require massive labeled training sets--which usually do not exist for real-world applications. Instead, Snorkel is based around the new data programming paradigm, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel automatically models this process—learning, essentially, which labeling functions are more accurate than others—and then uses this to train an end model (for example, a deep neural network in TensorFlow).


edward - A probabilistic programming language in TensorFlow

  •    Jupyter

Edward is a Python library for probabilistic modeling, inference, and criticism. It is a testbed for fast experimentation and research with probabilistic models, ranging from classical hierarchical models on small data sets to complex deep probabilistic models on large data sets. Edward fuses three fields: Bayesian statistics and machine learning, deep learning, and probabilistic programming. Edward is built on top of TensorFlow. It enables features such as computational graphs, distributed training, CPU/GPU integration, automatic differentiation, and visualization with TensorBoard.

owl - Owl is an OCaml library for scientific and engineering computing.

  •    OCaml

Owl is an emerging numerical library for scientific computing and engineering. The library is developed in the OCaml language and inherits all its powerful features such as static type checking, powerful module system, and superior runtime efficiency. Owl allows you to write succinct type-safe numerical applications in functional language without sacrificing performance, significantly reduces the cost from prototype to production use. Owl's documentation contains a lot of learning materials to help you start. The full documentation consists of two parts: Tutorial Book and API Reference. Both are perfectly synchronised with the code in the repository by the automatic building system. You can access both parts with the following link.

text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

  •    R

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP). To learn how to use this package, see text2vec.org and the package vignettes. See also the text2vec articles on my blog.

LSTM-Human-Activity-Recognition - Human Activity Recognition example using TensorFlow on smartphone sensors dataset and an LSTM RNN (Deep Learning algo)

  •    Jupyter

Compared to a classical approach, using a Recurrent Neural Networks (RNN) with Long Short-Term Memory cells (LSTMs) require no or almost no feature engineering. Data can be fed directly into the neural network who acts like a black box, modeling the problem correctly. Other research on the activity recognition dataset can use a big amount of feature engineering, which is rather a signal processing approach combined with classical data science techniques. The approach here is rather very simple in terms of how much was the data preprocessed. Let's use Google's neat Deep Learning library, TensorFlow, demonstrating the usage of an LSTM, a type of Artificial Neural Network that can process sequential data / time series.

Bayesian-Modelling-in-Python - A python tutorial on bayesian modeling techniques (PyMC3)

  •    Jupyter

Welcome to "Bayesian Modelling in Python" - a tutorial for those interested in learning how to apply bayesian modelling techniques in python (PYMC3). This tutorial doesn't aim to be a bayesian statistics tutorial - but rather a programming cookbook for those who understand the fundamental of bayesian statistics and want to learn how to build bayesian models using python. The tutorial sections and topics can be seen below. Statistics is a topic that never resonated with me throughout university. The frequentist techniques that we were taught (p-values etc) felt contrived and ultimately I turned my back on statistics as a topic that I wasn't interested in.

mlr - mlr: Machine Learning in R

  •    R

Please cite our JMLR paper [bibtex]. Some parts of the package were created as part of other publications. If you use these parts, please cite the relevant work appropriately. An overview of all mlr related publications can be found here.

Skater - Python Library for Model Interpretation/Explanations

  •    Python

Skater is a unified framework to enable Model Interpretation for all forms of model to help one build an Interpretable machine learning system often needed for real world use-cases(** we are actively working towards to enabling faithful interpretability for all forms models). It is an open source python library designed to demystify the learned structures of a black box model both globally(inference on the basis of a complete data set) and locally(inference about an individual prediction). The project was started as a research idea to find ways to enable better interpretability(preferably human interpretability) to predictive "black boxes" both for researchers and practioners. The project is still in beta phase.

Orange - Data Mining Suite

  •    Python

Orange is a component-based data mining software. It includes a range of data visualization, exploration, preprocessing and modeling techniques. It supports . interactive data analysis workflows with a large toolbox.

Jupyter - Web-based notebook environment for interactive computing

  •    Python

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more. It supports over 40 programming languages.

Pyro - Deep universal probabilistic programming with Python and PyTorch

  •    Python

Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend. Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling.

practical-machine-learning-with-python - Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system

  •    Jupyter

"Data is the new oil" is a saying which you must have heard by now along with the huge interest building up around Big Data and Machine Learning in the recent past along with Artificial Intelligence and Deep Learning. Besides this, data scientists have been termed as having "The sexiest job in the 21st Century" which makes it all the more worthwhile to build up some valuable expertise in these areas. Getting started with machine learning in the real world can be overwhelming with the vast amount of resources out there on the web. "Practical Machine Learning with Python" follows a structured and comprehensive three-tiered approach packed with concepts, methodologies, hands-on examples, and code. This book is packed with over 500 pages of useful information which helps its readers master the essential skills needed to recognize and solve complex problems with Machine Learning and Deep Learning by following a data-driven mindset. By using real-world case studies that leverage the popular Python Machine Learning ecosystem, this book is your perfect companion for learning the art and science of Machine Learning to become a successful practitioner. The concepts, techniques, tools, frameworks, and methodologies used in this book will teach you how to think, design, build, and execute Machine Learning systems and projects successfully.

hypertools - A Python toolbox for gaining geometric insights into high-dimensional data

  •    Python

HyperTools is designed to facilitate dimensionality reduction-based visual explorations of high-dimensional data. The basic pipeline is to feed in a high-dimensional dataset (or a series of high-dimensional datasets) and, in a single function call, reduce the dimensionality of the dataset(s) and create a plot. The package is built atop many familiar friends, including matplotlib, scikit-learn and seaborn. Our package was recently featured on Kaggle's No Free Hunch blog. For a general overview, you may find this talk useful (given as part of the MIND Summer School at Dartmouth). Check the repo of Jupyter notebooks from the HyperTools paper.

OpenML - Open Machine Learning

  •    CSS

We are a group of people who are excited about open science, open data and machine learning. We want to make machine learning and data analysis simple, accessible, collaborative and open with an optimal division of labour between computers and humans. OpenML is an online machine learning platform for sharing and organizing data, machine learning algorithms and experiments. It is designed to create a frictionless, networked ecosystem, that you can readily integrate into your existing processes/code/environments, allowing people all over the world to collaborate and build directly on each other’s latest ideas, data and results, irrespective of the tools and infrastructure they happen to use.

Math-of-Machine-Learning-Course-by-Siraj - Implements common data science methods and machine learning algorithms from scratch in python

  •    Jupyter

This repository was initially created to submit machine learning assignments for Siraj Raval's online machine learning course. The purpose of the course was to learn how to implement the most common machine learning algorithms from scratch (without using machine learning libraries such as tensorflow, PyTorch, scikit-learn, etc). Although that course has ended now, I am continuing to learn data science and machine learning from other sources such as Coursera, online blogs, and attending machine learning lectures at University of Toronto. Sticking to the theme of implementing machine learning algortihms from scratch, I will continue to post detailed notebooks in python here as I learn more.

PyMC3 - Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

  •    Python

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning which focuses on advanced Markov chain Monte Carlo and variational fitting algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.Note: Running pip install pymc will install PyMC 2.3, not PyMC3, from PyPI.