tokenizer - A small library for converting tokenized PHP source code into XML (and potentially other formats)

  •        15

A small library for converting tokenized PHP source code into XML.

https://github.com/theseer/tokenizer

Tags
Implementation
License
Platform

   




Related Projects

react-bootstrap-typeahead - React typeahead with Bootstrap styling

  •    Javascript

A React-based typeahead that relies on Bootstrap for styling and was originally inspired by Twitter's typeahead.js. It supports both single- and multi-selection and is compliant with WAI-ARIA authoring practices. Try the live examples. Please note that documentation and examples apply to the most recent release and may no longer be applicable if you're using an outdated version.

react-typeahead - Pure react-based typeahead and typeahead-tokenizer

  •    Javascript

react-typeahead is a javascript library that provides a react-based typeahead, or autocomplete text entry, as well as a "typeahead tokenizer", a typeahead that allows you to select multiple results. Basic typeahead input and results list.

NQXML

  •    Ruby

NQXML is a pure Ruby implementation of a non-validating XML processor. It includes an XML tokenizer, a SAX-style streaming XML parser, a DOM-style tree parser, an XML writer, and a context-sensitive callback mechanism.

parsekit - Objective-C Tokenizer and Parser Generator. Supports Grammars.

  •    Objective-C

Objective-C Tokenizer and Parser Generator. Supports Grammars.

sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.

  •    C++

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements sub-word units (also known as wordpieces [Wu et al.] [Schuster et al.] and byte-pair-encoding (BPE) [Sennrich et al.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.This is not an official Google product.


php-token-stream - Wrapper around PHP's tokenizer extension.

  •    PHP

Wrapper around PHP's tokenizer extension.

php-short-array-syntax-converter - Command-line script to convert PHP's array() syntax to PHP 5

  •    PHP

Command-line script to convert and revert PHP's array() syntax to PHP 5.4's short array syntax[] using PHP's built-in tokenizer. By relying on the PHP tokenizer, nothing but the array syntax itself will be altered. The script was successfully tested against code bases with more than 5.000 PHP files.

elasticsearch-analysis-jieba - The plugin includes the `jieba` analyzer, `jieba` tokenizer, and `jieba` token filter, and have two mode you can choose

  •    Java

The plugin includes the `jieba` analyzer, `jieba` tokenizer, and `jieba` token filter, and have two mode you can choose. one is `index` which means it will be used when you want to index a document. another is `search` mode which used when you want to search something.

BlingFire - A lightning fast Finite State machine and REgular expression manipulation library.

  •    C++

Hi, we are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter. Here we wanted to share with all of you our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multi-word expression matching, Unknown word-guessing, Stemming / Lemmatization just to mention a few. Bling Fire Tokenizer is a tokenizer designed for fast-speed and quality tokenization of Natural Language text. It mostly follows the tokenization logic of NLTK, except hyphenated words are split and a few errors are fixed.

Java tokenizer and parser tools

  •    Java

A JAVA suite for parsing arbitrary text data. Not just HTML or XML or Java, but all of them. Use it when the JDK tokenizers are too limited, JavaCC, JTB etc. are too complicated, or You need dynamic parser configuration

pygold

  •    Python

Pure Python implementation of GOLD Parser Engine. GOLD Parser Engine is a LALR(1) parser with DFA tokenizer. It uses compiled grammar table generated by GOLD Parser Builder (not included - available on http://www.devincook.com/goldparser)

Indian Speech Synthesis System(festival)

  •    

festival-in will have different speech synthesis systems for respective Indian Languages based on quot;festivalquot; TTS (Text-To-Speech engine) under it's umbrella. It will have modules (tokenizer and lexical) for respective Indian Languages.

friso

  •    

Friso is a Chinese tokenizer developed in C. It uses the popular mmseg algorithm to tokenize the Chinese characters.

twitter-korean-text - Korean tokenizer

  •    Scala

Scala library to process Korean text

parse5 - HTML parsing/serialization toolset for Node

  •    Javascript

HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.parse5 provides nearly everything you may need when dealing with HTML. It's the fastest spec-compliant HTML parser for Node to date. It parses HTML the way the latest version of your browser does. It has proven itself reliable in such projects as jsdom, Angular2, Polymer and many more.

parsekit - Objective-C Tokenizer and Parser Generator. Supports Grammars.

  •    Objective-C

I've forked ParseKit into a new faster/cleaner/smaller library called PEGKit. ParseKit should be considered deprecated, and PEGKit should be used for all new development.

GPoSTTL: Enhanced Brill's Tagger

  •    C

Enhanced version of Brill's Parts-of-Speech Tagger with built-in Tokenizer and Lemmatizer.

language - A fast PEG parser written in JavaScript with first class errors

  •    Objective-J

Language.js is an experimental new parser based on PEG (Parsing Expression Grammar), with the special addition of the "naughty OR" operator to handle errors in a unique new way. It makes use of memoization to achieve linear time parsing speed, and support for automatic cut placement is coming to maintain mostly constant space as well (for a discussion of cut operators see: www.ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf). The most unique addition Language.js makes to PEG is how it handles errors. No parse ever fails in Language.js, instead SyntaxErrorNodes are placed into the resultant tree. This makes it trivial to do things like write syntax highlighters that have live error reporting. This also means that Language.js is very competent at handling multiple errors (as opposed to aborting on the first one that is reached).