Displaying 1 to 20 from 49 results

textract - extract text from any document. no muss. no fuss.

  •    HTML

Extract text from any document. No muss. No fuss. Full documentation.

DataScienceR - a curated list of R tutorials for Data Science, NLP and Machine Learning

  •    R

This repo contains a curated list of R tutorials and packages for Data Science, NLP and Machine Learning. This also serves as a reference guide for several common data analysis tasks. Curated list of Python tutorials for Data Science, NLP and Machine Learning.




tidy-text-mining - Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson

  •    TeX

This is a draft of the book Text Mining with R: A Tidy Approach, by Julia Silge and David Robinson. Please note that this work is being written under a Contributor Code of Conduct and released under a CC-BY-NC-SA license. By participating in this project (for example, by submitting a pull request with suggestions or edits) you agree to abide by its terms.

text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

  •    R

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP). To learn how to use this package, see text2vec.org and the package vignettes. See also the text2vec articles on my blog.

tidytext - Text mining using dplyr, ggplot2, and other tidy tools :sparkles::page_facing_up::sparkles::page_facing_up::sparkles:

  •    R

Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Much of the infrastructure needed for text mining with tidy data frames already exists in packages like dplyr, broom, tidyr and ggplot2. In this package, we provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. Check out our book to learn more about text mining using tidy data principles. This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.

rake-nltk - Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.

  •    Python

RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text. If you see a stopwords error, it means that you do not have the corpus stopwords downloaded from NLTK. You can download it using command below.


Orange - Data Mining Suite

  •    Python

Orange is a component-based data mining software. It includes a range of data visualization, exploration, preprocessing and modeling techniques. It supports . interactive data analysis workflows with a large toolbox.

LDAvis - R package for web-based interactive topic model visualization.

  •    Javascript

R package for interactive topic model visualization. LDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

Query Term Analyzer

  •    

Query term analyzer is used to analyse terms in query

rplos - R client for the PLoS Journals API

  •    R

rplos is a package for accessing full text articles from the Public Library of Science journals using their API.You used to need a key to use rplos - you no longer do as of 2015-01-13 (or v0.4.5.999).

crminer - Crossref Text Mining Client

  •    R

Crossref is a not-for-profit membership organization for scholarly publishing. For our purposes here, they provide a nice search API for metadata for scholarly works.See https://github.com/ropensci/rcrossref for a full fledged R client for working with the Crossref search API.

Guten-gutter - Strips boilerplate from Project Gutenberg text files [Public domain]

  •    Python

Guten-gutter is a command-line filter for stripping the boilerplate off of text files from Project Gutenberg. I was using gutenizer for this purpose, but it has some shortcomings and there were several Project Gutenberg texts which it failed to properly strip, so I wrote this as a more robust replacement. It's also (like Project Gutenberg texts themselves) in the public domain. Our basic tests will be on Peter Rabbit.

ChemDataExtractor - Automatically extract chemical information from scientific documents

  •    Python

ChemDataExtractor is a toolkit for extracting chemical information from the scientific literature. Alternatively, try one of the other installation options.

readability - Fast readability scores for text data

  •    R

readability utilizes the syllable package for fast calculation of readability scores by grouping variables.

textreadr - Tools to uniformly read in text data including semi-structured transcripts

  •    R

textreadr is a small collection of convenience tools for reading text documents into R. This is not meant to be an exhaustive collection; for more see the tm package. These packages are already specialized to handle these very specific data formats. textreadr provides the basic reading tools that work with the five basic file formats in which text data is stored.





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.