Displaying 1 to 20 from 23 results

corpora - A collection of small corpuses of interesting data for the creation of bots and similar stuff

  •    Javascript

This project is a collection of static corpora (plural of "corpus") that are potentially useful in the creation of weird internet stuff. I've found that, as a creator, sometimes I am making something that needs access to a lot of adjectives, but not necessarily every adjective in the English language. So for the last year I've been copy/pasting an adjs.json file from project to project. This is kind of awful, so I'm hoping that this project will at least help me keep everything in one place. I would like this to help with rapid prototyping of projects. For example: you might use nouns.json to start with, just to see if an idea you had was any good. Once you've built the project quickly around the nouns collection, you can then rip it out and replace it with a more complex or exhaustive data source.

weibo_terminater - Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything

  •    Python

Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator

chatterbot-corpus - A multilingual dialog corpus

  •    Python

These modules are used to quickly train ChatterBot to respond to various inputs in different languages. Although much of ChatterBot is designed to be language independent, it is still useful to have these training sets available to prime a fresh database and make the variety of responses that a bot can yield much more diverse. For instructions on how to use these data sets, please refer to the project documentation.

quanteda - An R package for the Quantitative Analysis of Textual Data

  •    R

An R package for managing and analyzing text, created by Kenneth Benoit. Supported by the European Research Council grant ERC-2011-StG 283794-QUANTESS. For more details, see https://docs.quanteda.io/index.html.

tif - Text Interchange Formats

  •    R

This package describes and validates formats for storing common object arising in text analysis as native R objects. Representations of a text corpus, document term matrix, and tokenized text are included. The tokenized text format is extensible to include other annotations. There are two versions of the corpus and tokens objects; packages should accept both and return or coerce to at least one of these.corpus (data frame) - A valid corpus data frame object is a data frame with at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.

machine-translator - Translate words using a statistical model

  •    Javascript

is a nodejs module that uses statistical machine translation to translate between two different languages. the module is loosely based off of the IBM model 1 algorithm and has been tested using english.

colibri-core - Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i

  •    C++

Skipgram and flexgram extraction are computationally more demanding but have been implemented with similar optimisations. Skipgrams are computed by abstracting over n-grams, and flexgrams in turn are computed either by abstracting over skipgrams, or directly from n-grams on the basis of co-occurrence information (mutual pointwise information). At the heart of the sofware is the notion of pattern models. The core tool, to be used from the command-line, is colibri-patternmodeller which enables you to build pattern models, generate statistical reports, query for specific patterns and relations, and manipulate models.

folia - FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations

  •    Python

FoLiA is an XML-based annotation format, suitable for the representation of linguistically annotated language resources. FoLiA’s intended use is as a format for storing and/or exchanging language resources, including corpora. Our aim is to introduce a single rich format that can accommodate a wide variety of linguistic annotation types through a single generalised paradigm. We do not commit to any label set, language or linguistic theory. This is always left to the developer of the language resource, and provides maximum flexibility. XML is an inherently hierarchic format. FoLiA does justice to this by maximally utilising a hierarchic, inline, setup. We inherit from the D-Coi format, which posits to be loosely based on a minimal subset of TEI. Because of the introduction of a new and much broader paradigm, FoLiA is not backwards-compatible with D-Coi, i.e. validators for D-Coi will not accept FoLiA XML. It is however easy to convert FoLiA to less complex or verbose formats such as the D-Coi format, or plain-text. Converters are provided.

aspen - 🔎 📖 ✨ Custom, private search engine for text documents built with NextJS/React/ES6/ES7

  •    Javascript

Put all your files in one place. This directory will be served via /static/data/ on the web server. Sometimes plaintext documents act weird. Maybe bin/import can't extract a title or maybe the search highlights are off. The file might have the wrong line endings or one of those annoying UTF-8 BOM headers. Try running dos2unix on your text files to fix them.

PoetryCorpus - Поэтический корпус русского языка

  •    Python

Поэтический корпус русского языка


  •    Javascript

An ongoing attempt at tying together various ML techniques for trending topic and sentiment analysis, coupled with some experimental Python async coding, a distributed architecture, EventSource and lots of Docker goodness. I needed a readily available corpus for doing text analytics and sentiment analysis, so I decided to make one from my RSS feeds.

voxceleb - mirror of VoxCeleb dataset - a large-scale speaker identification dataset

  •    Shell

This repo contains the download links to the VoxCeleb dataset, described in [1]. VoxCeleb contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube. The dataset is gender balanced, with 55% of the speakers male. The speakers span a wide range of different ethnicities, accents, professions and ages. There are no overlapping identities between development and test sets.

cljs-corpus - A greppable archive of ClojureScript code


In linguistics, a text corpus is a set of texts written in a language. Its purpose is to be analyzed to test hypotheses about the actual usage of the language. Similarly the aim of cljs-corpus is to provide a searchable local archive of ClojureScript as it is used in the wild.

egret-wenda-corpus - A Public Corpus for Machine Learning

  •    Javascript

QA Corpus, based on egret bbs. To make it more suitable for training, I have personally reviewed the raw data and modify some utterances, such as deleting codes in utterances.