Displaying 1 to 20 from 60 results

parse5 - HTML parsing/serialization toolset for Node

  •    Javascript

HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.parse5 provides nearly everything you may need when dealing with HTML. It's the fastest spec-compliant HTML parser for Node to date. It parses HTML the way the latest version of your browser does. It has proven itself reliable in such projects as jsdom, Angular2, Polymer and many more.

chevrotain - Parser Building Toolkit for JavaScript

  •    TypeScript

Chevrotain is a blazing fast and feature rich Parser Building Toolkit for JavaScript. It can be used to build parsers/compilers/interpreters for various use cases ranging from simple configuration files, to full fledged programing languages. A more in depth description of Chevrotain can be found in this great article on: Parsing in JavaScript: Tools and Libraries.

Mustard - 🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it

  •    Swift

Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it. If you want more than just the substrings, you can use the tokens(matchedWith: CharacterSet...) method which will return an array of TokenType.




react-typeahead - Pure react-based typeahead and typeahead-tokenizer

  •    Javascript

react-typeahead is a javascript library that provides a react-based typeahead, or autocomplete text entry, as well as a "typeahead tokenizer", a typeahead that allows you to select multiple results. Basic typeahead input and results list.

csstree - A tool set for working with CSS including fast detailed parser, walker, generator and lexer based on W3C specs and browser implementations

  •    Javascript

CSSTree is a tool set to work with CSS, including fast detailed parser (string->AST), walker (AST traversal), generator (AST->string) and lexer (validation and matching) based on knowledge of spec and browser implementations. The main goal is to be efficient and W3C spec compliant, with focus on CSS analyzing and source-to-source transforming tasks. NOTE: The project is in alpha stage since some parts need further improvements, AST format and API are subjects to change. However it's stable enough and used by packages like CSSO (CSS minifier) and SVGO (SVG optimizer) in production.

language - A fast PEG parser written in JavaScript with first class errors

  •    Objective-J

Language.js is an experimental new parser based on PEG (Parsing Expression Grammar), with the special addition of the "naughty OR" operator to handle errors in a unique new way. It makes use of memoization to achieve linear time parsing speed, and support for automatic cut placement is coming to maintain mostly constant space as well (for a discussion of cut operators see: www.ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf). The most unique addition Language.js makes to PEG is how it handles errors. No parse ever fails in Language.js, instead SyntaxErrorNodes are placed into the resultant tree. This makes it trivial to do things like write syntax highlighters that have live error reporting. This also means that Language.js is very competent at handling multiple errors (as opposed to aborting on the first one that is reached).


spark-nlp - Natural Language Understanding Library for Apache Spark.

  •    Jupyter

John Snow Labs Spark-NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment. This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .

literalizer - Specialized heuristic lexer for JS to identify complex literals

  •    Javascript

Specialized heuristic lexer for JS to identify complex literals

simple-html-tokenizer - A lightweight JavaScript library for tokenizing non-`<script>` HTML expected to be found in the `<body>` of a document

  •    TypeScript

Simple HTML Tokenizer is a lightweight JavaScript library that can be used to tokenize the kind of HTML normally found in templates.

sentences - A multilingual command line sentence tokenizer in Golang

  •    Go

This command line utility will convert a blob of text into a list of sentences.This package attempts to fix some problems I noticed for english.

tif - Text Interchange Formats

  •    R

This package describes and validates formats for storing common object arising in text analysis as native R objects. Representations of a text corpus, document term matrix, and tokenized text are included. The tokenized text format is extensible to include other annotations. There are two versions of the corpus and tokens objects; packages should accept both and return or coerce to at least one of these.corpus (data frame) - A valid corpus data frame object is a data frame with at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.

megamark - :heart_eyes_cat: Markdown with easy tokenization, a fast highlighter, and a lean HTML sanitizer

  •    Javascript

Megamark is markdown-it plus a few reasonable factory defaults.The markdown input will be parsed via markdown-it. Megamark configures markdown-it for syntax highlighting, prefixing classes with md-. Output is sanitized via insane, and you can configure the whitelisting process too.

wirb - Don't use an IRB without WIRB!

  •    Ruby

WIRB syntax highlights Ruby objects. Supported Rubies: 2.4, 2.3, 2.2, 2.1, 2.0, jruby, rubinius.

unicode-tokenizer - Unicode Tokenizer following the Unicode Line Breaking algorithm

  •    Javascript

This is a tokenizer that tokenizes text according to the line breaking classes defined by the Unicode Line Breaking algorithm (tr14). It also annotates each token with its line breaking action. This is useful when performing Natural Language Processing or doing manual line breaking. The full range of Unicode code points are supported by this tokenizer. If you however only want to tokenize selected portions of the Unicode standard, such as the Basic Multilingual Plane, you can subset the supported Unicode range. To generate a subsetted tokenizer, modify the included-ranges.txt and excluded-classes.txt files, and use the --include-ranges and --exclude-classes command line options on the generate-tokens script.

tokenizer-array - general purpose regex tokenizer that returns an array of tokens

  •    Javascript

This module is based on Floby's node-tokenizer, but returns an array instead of a stream.Return an array of tokens by parsing a string src with an array of rules.

lexmachine - Lex machinary for go.

  •    Go

Copyright 2014-2017, All Rights Reserved. Made available for public use under the terms of a BSD 3-Clause license. lexmachine is a full lexical analysis framework for the Go programming language. It supports a restricted but usable set of regular expressions appropriate for writing lexers for complex programming languages. The framework also supports sub lexers and non-regular lexing through an "escape hatch" which allows the users to consume any number of further bytes after a match. So if you want to support nested C-style comments or other paired structures you can do so at the lexical analysis stage.

high5 - html 5 tokenizer

  •    Javascript

My previous HTML parser, htmlparser2, reached a point where a clean cut was needed. high5 is this cut, even though it's based on htmlparser2 and will try to be backwards compatible (I even tried to preserve the git history, so all previous committers are still credited). * The tokenizer takes several shortcuts, which greatly increase the speed of a JavaScript implementation, but disobay the spec implementation-wise. The output should be spec-compliant, though.