HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.parse5 provides nearly everything you may need when dealing with HTML. It's the fastest spec-compliant HTML parser for Node to date. It parses HTML the way the latest version of your browser does. It has proven itself reliable in such projects as jsdom, Angular2, Polymer and many more.
html-parsing html html5 serialization serializer parser whatwg specification fast html-parser html5-parser htmlparser parse5 html-serializer htmlserializer sax simple-api parse tokenize serialize tokenizerA small library for converting tokenized PHP source code into XML.
tokenizer xmlChevrotain is a blazing fast and feature rich Parser Building Toolkit for JavaScript. It can be used to build parsers/compilers/interpreters for various use cases ranging from simple configuration files, to full fledged programing languages. A more in depth description of Chevrotain can be found in this great article on: Parsing in JavaScript: Tools and Libraries.
typescript parser-library parsing grammars tokenizer open-source parser syntax lexical analysis grammar lexer generator compiler fault tolerantMustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it. If you want more than just the substrings, you can use the tokens(matchedWith: CharacterSet...) method which will return an array of TokenType.
substrings tokenizerreact-typeahead is a javascript library that provides a react-based typeahead, or autocomplete text entry, as well as a "typeahead tokenizer", a typeahead that allows you to select multiple results. Basic typeahead input and results list.
react typeahead tokenizer autocomplete react-componentCSSTree is a tool set to work with CSS, including fast detailed parser (string->AST), walker (AST traversal), generator (AST->string) and lexer (validation and matching) based on knowledge of spec and browser implementations. The main goal is to be efficient and W3C spec compliant, with focus on CSS analyzing and source-to-source transforming tasks. NOTE: The project is in alpha stage since some parts need further improvements, AST format and API are subjects to change. However it's stable enough and used by packages like CSSO (CSS minifier) and SVGO (SVG optimizer) in production.
css-parser ast css fast parser lexer walker generator w3c tokenizer utils syntax validationA React-based typeahead that relies on Bootstrap for styling and was originally inspired by Twitter's typeahead.js. It supports both single- and multi-selection and is compliant with WAI-ARIA authoring practices. Try the live examples. Please note that documentation and examples apply to the most recent release and may no longer be applicable if you're using an outdated version.
react bootstrap typeahead auto-complete auto-suggest autocomplete autosuggest bootstrap-tokenizer bootstrap-typeahead react-autocomplete react-autosuggest react-tokenizer react-typeahead react-bootstrap react-bootstrap-tokenizer react-bootstrap-typeahead tokenizerLanguage.js is an experimental new parser based on PEG (Parsing Expression Grammar), with the special addition of the "naughty OR" operator to handle errors in a unique new way. It makes use of memoization to achieve linear time parsing speed, and support for automatic cut placement is coming to maintain mostly constant space as well (for a discussion of cut operators see: www.ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf). The most unique addition Language.js makes to PEG is how it handles errors. No parse ever fails in Language.js, instead SyntaxErrorNodes are placed into the resultant tree. This makes it trivial to do things like write syntax highlighters that have live error reporting. This also means that Language.js is very competent at handling multiple errors (as opposed to aborting on the first one that is reached).
parser peg packrat generator compiler lexer tokenizer lex yacc bison antlrThe library's full documentation can be found here. Be sure to lint & pass the unit tests before submitting your pull request.
natural-language-processing machine-learning fuzzy-matching clustering record-linkage bayes bloom-filter canberra caverphone chebyshev cologne cosine classifier daitch-mokotoff dice fingerprint fuzzy hamming k-means jaccard jaro lancaster levenshtein lig metaphone mra ngrams nlp nysiis perceptron phonetic porter punkt schinke sorensen soundex stats tfidf tokenizer tversky vectorizer winklerJohn Snow Labs Spark-NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment. This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .
nlp nlu natural-language-processing natural-language-understanding spark spark-ml pyspark machine-learning named-entity-recognition sentiment-analysis lemmatizer spell-checker tokenizer entity-extraction stemmer part-of-speech-tagger annotation-frameworkKagome is an open source Japanese morphological analyzer written in pure golang. The MeCab-IPADIC and UniDic (unidic-mecab) dictionary/statiscal models are packaged in Kagome binary. Kagome has segmentation mode for search such as Kuromoji.
japanese tokenizer nlp-library japanese-language pos-tagging segmentation morphological-analysisScala/Java library to process Korean text
korean korean-text-processing natural-language-processing text-processing tokenizer korean-tokenizerSpecialized heuristic lexer for JS to identify complex literals
tokenizer lexer parserSimple HTML Tokenizer is a lightweight JavaScript library that can be used to tokenize the kind of HTML normally found in templates.
html tokenizerThis command line utility will convert a blob of text into a list of sentences.This package attempts to fix some problems I noticed for english.
sentence-tokenizer tokenizer cli sentences natural-language-processing nlpThis package describes and validates formats for storing common object arising in text analysis as native R objects. Representations of a text corpus, document term matrix, and tokenized text are included. The tokenized text format is extensible to include other annotations. There are two versions of the corpus and tokens objects; packages should accept both and return or coerce to at least one of these.corpus (data frame) - A valid corpus data frame object is a data frame with at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.
text-processing corpus term-frequency tokenizer natural-language-processing r-packageMegamark is markdown-it plus a few reasonable factory defaults.The markdown input will be parsed via markdown-it. Megamark configures markdown-it for syntax highlighting, prefixing classes with md-. Output is sanitized via insane, and you can configure the whitelisting process too.
markdown tokenizerWIRB syntax highlights Ruby objects. Supported Rubies: 2.4, 2.3, 2.2, 2.1, 2.0, jruby, rubinius.
irb syntax-highlighting terminal tokenizer stdlibThis is a tokenizer that tokenizes text according to the line breaking classes defined by the Unicode Line Breaking algorithm (tr14). It also annotates each token with its line breaking action. This is useful when performing Natural Language Processing or doing manual line breaking. The full range of Unicode code points are supported by this tokenizer. If you however only want to tokenize selected portions of the Unicode standard, such as the Basic Multilingual Plane, you can subset the supported Unicode range. To generate a subsetted tokenizer, modify the included-ranges.txt and excluded-classes.txt files, and use the --include-ranges and --exclude-classes command line options on the generate-tokens script.
tokenizer tokens unicode line-breaking tr14 natural-language-processing nlpThis module is based on Floby's node-tokenizer, but returns an array instead of a stream.Return an array of tokens by parsing a string src with an array of rules.
tokenizer
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.