Displaying 1 to 20 from 38 results

Hydra - Distributed processing framework for search solutions

  •    Java

Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. This is done by providing a scalable and efficient pipeline which the documents will have to pass through before being indexed into the search engine. Architecturally Hydra sits in between the search engine and the source integration.

Aperture - Java framework for getting data and metadata

  •    Java

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

Gate - General Architecture for Text Engineering

  •    Java

GATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin.




Apache Tika - A content analysis toolkit

  •    Java

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

OpenPipe - Document Pipeline

  •    Java

OpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index. The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.

TextTeaser - Automatic Summarization Algorithm

  •    Scala

TextTeaser is an automatic summarization algorithm that combines the power of natural language processing and machine learning to produce good results. It can provide provide a gist of an article, Better previews in news readers.

jekyll - Jekyll-based static site for The Programming Historian

  •    HTML

This is the main repository for the Programming Historian (http://programminghistorian.org), where we keep the files for the live website. For tutorials in submission, please see: Programming Historian Submissions.


whatlang-rs - Natural language detection library for Rust

  •    Rust

Natural language detection for Rust with focus on simplicity and performance.For more details (e.g. how to blacklist some languages) please check the documentation.

ore - An R interface to the Onigmo regular expression library

  •    C

Oniguruma (or rather, the Onigmo fork of it) is the regular expression library used by the Ruby programming language, and ore is somewhat inspired by Ruby's regular expression features; although it is implemented in what aims to be a natural way for R users, including full vectorisation. This README covers the package's R interface only, and assumes that the reader is already familiar with regular expressions. Please see the official reference document for details of supported regular expression syntax.

YoastSEO.js - Analyze content on a page and give SEO feedback as well as render a snippet preview.

  •    Javascript

Text analysis and assessment library in JavaScript. This library can generate interesting metrics about a text and assess these metrics to give you an assessment which can be used to improve the text. Also included is a preview of the Google search results which can be assessed using the library.

ciseau - :rocket: Tokenize and clean strings in Python

  •    Python

Word and sentence tokenization in Python. sent_tokenize can keep the whitespace as-is with the flags keep_whitespace=True and normalize_ascii=False.

textclean - Tools for cleaning and normalizing text data

  •    R

textclean is a collection of tools to clean and normalize text. Many of these tools have been taken from the qdap package and revamped to be more intuitive, better named, and faster. Tools are geared at checking for substrings that are not optimal for analysis and replacing or removing them (normalizing) with more analysis friendly substrings (see Sproat, Black, Chen, Kumar, Ostendorf, & Richards, 2001, doi:10.1006/csla.2001.0169) or extracting them into new variables. For example, emoticons are often used in text but not always easily handled by analysis algorithms. The replace_emoticon() function replaces emoticons with word equivalents. Other R packages provide some of the same functionality (e.g., english, gsubfn, mgsub, stringi, stringr, qdapRegex). textclean differs from these packages in that it is designed to handle all of the common cleaning and normalization tasks with a single, consistent, pre-configured toolset (note that textclean uses many of these terrific packages as a backend). This means that the researcher spends less time on munging, leading to quicker analysis. This package is meant to be used jointly with the textshape package, which provides text extraction and reshaping functionality. textclean works well with the qdapRegex package which provides tooling for substring substitution and extraction of pre-canned regular expressions. In addition, the functions of textclean are designed to work within the piping of the tidyverse framework by consistently using the first argument of functions as the data source. The textclean subbing and replacement tools are particularly effective within a dplyr::mutate statement.

afterwriting-labs - Post-processing for Fountain screenplays

  •    Javascript

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

aylien_textapi_go - AYLIEN's officially supported Go client library for accessing Text API

  •    Go

This is the Go client library for AYLIEN's APIs. If you haven't already done so, you will need to sign up. See the Developers Guide for additional documentation.

aylien_textapi_python - AYLIEN's officially supported Python client library for accessing Text API

  •    Python

This is the Python client library for AYLIEN's APIs. If you haven't already done so, you will need to sign up. See the Developers Guide for additional documentation.

aylien_textapi_ruby - AYLIEN's officially supported Ruby client library for accessing Text API

  •    Ruby

This is the Ruby client library for AYLIEN's APIs. If you haven't already done so, you will need to sign up. See the Developers Guide for additional documentation.