This is a tokenizer that tokenizes text according to the line breaking classes defined by the Unicode Line Breaking algorithm (tr14). It also annotates each token with its line breaking action. This is useful when performing Natural Language Processing or doing manual line breaking. The full range of Unicode code points are supported by this tokenizer. If you however only want to tokenize selected portions of the Unicode standard, such as the Basic Multilingual Plane, you can subset the supported Unicode range. To generate a subsetted tokenizer, modify the included-ranges.txt and excluded-classes.txt files, and use the --include-ranges and --exclude-classes command line options on the generate-tokens script.