Displaying 1 to 20 from 74 results

chevrotain - Parser Building Toolkit for JavaScript

  •    TypeScript

Chevrotain is a blazing fast and feature rich Parser Building Toolkit for JavaScript. It can be used to build parsers/compilers/interpreters for various use cases ranging from simple configuration files, to full fledged programing languages. A more in depth description of Chevrotain can be found in this great article on: Parsing in JavaScript: Tools and Libraries.

SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation

  •    C++

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) and unigram language model with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre postprocessing.

parse5 - HTML parsing / serialization toolset for Node.js

  •    Javascript

parse5 provides nearly everything you may need when dealing with HTML. It's the fastest spec-compliant HTML parser for Node to date. It parses HTML the way the latest version of your browser does. It has proven itself reliable in such projects as jsdom, Angular2, Polymer and many more.

Mustard - 🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it

  •    Swift

Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it. If you want more than just the substrings, you can use the tokens(matchedWith: CharacterSet...) method which will return an array of TokenType.

react-typeahead - Pure react-based typeahead and typeahead-tokenizer

  •    Javascript

react-typeahead is a javascript library that provides a react-based typeahead, or autocomplete text entry, as well as a "typeahead tokenizer", a typeahead that allows you to select multiple results. Basic typeahead input and results list.

react-bootstrap-typeahead - React typeahead with Bootstrap styling

  •    Javascript

A React-based typeahead that relies on Bootstrap for styling and was originally inspired by Twitter's typeahead.js. It supports both single- and multi-selection and is compliant with WAI-ARIA authoring practices. Try the live examples. Please note that documentation and examples apply to the most recent release and may no longer be applicable if you're using an outdated version.

csstree - A tool set for working with CSS including fast detailed parser, walker, generator and lexer based on W3C specs and browser implementations

  •    Javascript

CSSTree is a tool set to work with CSS, including fast detailed parser (string->AST), walker (AST traversal), generator (AST->string) and lexer (validation and matching) based on knowledge of spec and browser implementations. The main goal is to be efficient and W3C spec compliant, with focus on CSS analyzing and source-to-source transforming tasks. NOTE: The project is in alpha stage since some parts need further improvements, AST format and API are subjects to change. However it's stable enough and used by packages like CSSO (CSS minifier) and SVGO (SVG optimizer) in production.

language - A fast PEG parser written in JavaScript with first class errors

  •    Objective-J

Language.js is an experimental new parser based on PEG (Parsing Expression Grammar), with the special addition of the "naughty OR" operator to handle errors in a unique new way. It makes use of memoization to achieve linear time parsing speed, and support for automatic cut placement is coming to maintain mostly constant space as well (for a discussion of cut operators see: www.ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf). The most unique addition Language.js makes to PEG is how it handles errors. No parse ever fails in Language.js, instead SyntaxErrorNodes are placed into the resultant tree. This makes it trivial to do things like write syntax highlighters that have live error reporting. This also means that Language.js is very competent at handling multiple errors (as opposed to aborting on the first one that is reached).

spark-nlp - Natural Language Understanding Library for Apache Spark.

  •    Jupyter

John Snow Labs Spark-NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment. This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .

kagome - Self-contained Japanese Morphological Analyzer written in pure Go

  •    Go

Kagome is an open source Japanese morphological analyzer written in pure golang. The MeCab-IPADIC and UniDic (unidic-mecab) dictionary/statiscal models are packaged in Kagome binary. Kagome has segmentation mode for search such as Kuromoji.

literalizer - Specialized heuristic lexer for JS to identify complex literals

  •    Javascript

Specialized heuristic lexer for JS to identify complex literals

simple-html-tokenizer - A lightweight JavaScript library for tokenizing non-`<script>` HTML expected to be found in the `<body>` of a document

  •    TypeScript

Simple HTML Tokenizer is a lightweight JavaScript library that can be used to tokenize the kind of HTML normally found in templates.

sentences - A multilingual command line sentence tokenizer in Golang

  •    Go

This command line utility will convert a blob of text into a list of sentences.This package attempts to fix some problems I noticed for english.

tif - Text Interchange Formats

  •    R

This package describes and validates formats for storing common object arising in text analysis as native R objects. Representations of a text corpus, document term matrix, and tokenized text are included. The tokenized text format is extensible to include other annotations. There are two versions of the corpus and tokens objects; packages should accept both and return or coerce to at least one of these.corpus (data frame) - A valid corpus data frame object is a data frame with at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.

megamark - :heart_eyes_cat: Markdown with easy tokenization, a fast highlighter, and a lean HTML sanitizer

  •    Javascript

Megamark is markdown-it plus a few reasonable factory defaults.The markdown input will be parsed via markdown-it. Megamark configures markdown-it for syntax highlighting, prefixing classes with md-. Output is sanitized via insane, and you can configure the whitelisting process too.

wirb - Don't use an IRB without WIRB!

  •    Ruby

WIRB syntax highlights Ruby objects. Supported Rubies: 2.4, 2.3, 2.2, 2.1, 2.0, jruby, rubinius.

unicode-tokenizer - Unicode Tokenizer following the Unicode Line Breaking algorithm

  •    Javascript

This is a tokenizer that tokenizes text according to the line breaking classes defined by the Unicode Line Breaking algorithm (tr14). It also annotates each token with its line breaking action. This is useful when performing Natural Language Processing or doing manual line breaking. The full range of Unicode code points are supported by this tokenizer. If you however only want to tokenize selected portions of the Unicode standard, such as the Basic Multilingual Plane, you can subset the supported Unicode range. To generate a subsetted tokenizer, modify the included-ranges.txt and excluded-classes.txt files, and use the --include-ranges and --exclude-classes command line options on the generate-tokens script.

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.