Displaying 1 to 19 from 19 results

gensim - Topic Modelling for Humans

  •    Python

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

hypertools - A Python toolbox for gaining geometric insights into high-dimensional data

  •    Python

HyperTools is designed to facilitate dimensionality reduction-based visual explorations of high-dimensional data. The basic pipeline is to feed in a high-dimensional dataset (or a series of high-dimensional datasets) and, in a single function call, reduce the dimensionality of the dataset(s) and create a plot. The package is built atop many familiar friends, including matplotlib, scikit-learn and seaborn. Our package was recently featured on Kaggle's No Free Hunch blog. For a general overview, you may find this talk useful (given as part of the MIND Summer School at Dartmouth). Check the repo of Jupyter notebooks from the HyperTools paper.

text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

  •    R

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP). To learn how to use this package, see text2vec.org and the package vignettes. See also the text2vec articles on my blog.




owl - Owl is an OCaml library for scientific and engineering computing.

  •    OCaml

Owl is an emerging numerical library for scientific computing and engineering. The library is developed in the OCaml language and inherits all its powerful features such as static type checking, powerful module system, and superior runtime efficiency. Owl allows you to write succinct type-safe numerical applications in functional language without sacrificing performance, significantly reduces the cost from prototype to production use. Owl's documentation contains a lot of learning materials to help you start. The full documentation consists of two parts: Tutorial Book and API Reference. Both are perfectly synchronised with the code in the repository by the automatic building system. You can access both parts with the following link.

LDAvis - R package for web-based interactive topic model visualization.

  •    Javascript

R package for interactive topic model visualization. LDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

lda - LDA topic modeling for node.js

  •    Javascript

Latent Dirichlet allocation (LDA) topic modeling in javascript for node.js. LDA is a machine learning algorithm that extracts topics and their related keywords from a collection of documents. In LDA, a document may contain several different topics, each with their own related terms. The algorithm uses a probabilistic model for detecting the number of topics specified and extracting their related keywords. For example, a document may contain topics that could be classified as beach-related and weather-related. The beach topic may contain related words, such as sand, ocean, and water. Similarly, the weather topic may contain related words, such as sun, temperature, and clouds.

trlda - Implementations of various online inference algorithms for LDA, with Python interface.

  •    C++

Additional features include adaptive learning rates (Ranganath et al., 2013) and automatic tuning of hyperparameters via empirical Bayes. I have tested the code with the versions above, but older versions might also work.


ethz-web-scale-data-mining-project - ETH Zurich - Web Scale Data Processing and Mining Project

  •    HTML

This is the main repository for the web scale data mining project, which took place in summer 2014 as a research project. One of the results are the visualized topics, which have been learned autonomously from terabytes of raw HTML data.

hlda - Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model

  •    Jupyter

Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non-parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and readily accommodates growing data collections. The hLDA model combines this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation.

topicModels - topics Models extension for Mallet & scikit-learn

  •    Java

In Mallet package, it only contains two topic Models--LDA and Hierachical LDA. So I tried to implement some useful topic modeling methods on it. This extension is merged in scikit-learn 0.17 version.

GuidedLDA - semi supervised guided topic model with custom guidedLDA

  •    Python

GuidedLDA OR SeededLDA implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. GuidedLDA can be guided by setting some seed words per topic. Which will make the topics converge in that direction. You can read more about guidedlda in the documentation.

ISLE - This repository provides code for SVD and Importance sampling-based algorithms for large scale topic modeling

  •    C++

We built this project on Ubuntu 16.04LTS with gcc 5.4. Other linux versions with gcc 5+ could also work. This should generate two executables ISLETrain and ISLEInfer in the <ISLE_ROOT> directory.

sem - :white_medium_small_square: <- :white_circle: Structural Equation Modeling from a broader context

  •    R

The first few chapters also serve as the basis of a workshop, and include a brief introduction to R that will be enough for one to follow along with the tools used (e.g. psych, lavaan, and mediation packages). The actual document can be found at https://m-clark.github.io/sem.

text-summarization-and-visualization-using-watson-studio - Can we quickly summarize & visualize text to get the details about the unstructured data? Yes we can! Please review this code pattern for all the steps involved to quickly summarize & visualize the data

  •    Jupyter

We will demonstrate a methodology to summarize & visualize text using Watson Studio. Text summarization is the process of creating a short and coherent version of a longer document. There are two methods to summarize the text, extractive & abstractive summarization. We will focus on extractive summarization which involves the selection of phrases and sentences from the source document to make up the new summary. Techniques involve ranking the relevance of phrases in order to choose only those most relevant to the meaning of the source. Some of the advantages of text summarization are below. We will also demonstrate different methods to visualize the data which can aid in providing quick peek of the data. Summaries reduce reading time. When researching documents, summaries make the selection process easier.Text summarization improves the effectiveness of indexing.Text summarization algorithms are less biased than human summarizers. Personalized summaries are useful in question-answering systems as they provide personalized information.Using automatic or semi-automatic summarization systems enables commercial abstract services to increase the number of texts they are able to process.

learning-social-media-analytics-with-r - This repository contains code and bonus content which will be added from time to time for the book "Learning Social Media Analytics with R" by Packt

  •    R

The book will also cover several practical real-world use cases on social media using R and its advanced packages to utilize data science methodologies such as sentiment analysis, topic modeling, text summarization, recommendation systems, social network analysis, classification, and clustering. This will enable readers to learn different hands-on approaches to obtain data from diverse social media sources such as Twitter and Facebook. It will also show readers how to establish detailed workflows to process, visualize, and analyze data to transform social data into actionable insights.

bigram-anchor-words - An Implementation of Bigram Anchor Words algorithm

  •    Python

Implementation for the Bigram Anchor Words Topic Model paper. Bag of words is very poor text representation, since that, in traditional topic models, we are losing a lot of information. The project goal is to combine linguistic with statistical topic models. We propose new Anchor Words Topic Model [1] such as bigrams also could be anchor words. Here are an example of anchor words. Metrics are also good and could be found in the paper.

2018-MachineLearning-Lectures-ESA - Machine Learning Lectures at the European Space Agency (ESA) in 2018

  •    Jupyter

In 2018, The European Space Agency (ESA) organized a series of 6 lectures on Machine Learning at the European Space Operations Centre (ESOC). This repository contains the lectures resources: presentations, notebooks and links to the videos (presentation and hands-on).