uax29 - A tokenizer based on Unicode text segmentation (UAX 29), for Go

  •        62

This package tokenizes words, sentences and graphemes, based on Unicode text segmentation (UAX 29), for Unicode version 13.0.0. Any time our code operates on individual words, we are tokenizing. Often, we do it ad hoc, such as splitting on spaces, which gives inconsistent results. Best to do it consistently.

https://github.com/clipperhouse/uax29

Tags
Implementation
License
Platform

   




Related Projects

prose - :book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction

  •    Go

prose is Go library for text (primarily English at the moment) processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. The library's functionality is split into subpackages designed for modular use.See the GoDoc documentation for more information.

BlingFire - A lightning fast Finite State machine and REgular expression manipulation library.

  •    C++

Hi, we are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter. Here we wanted to share with all of you our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multi-word expression matching, Unknown word-guessing, Stemming / Lemmatization just to mention a few. Bling Fire Tokenizer is a tokenizer designed for fast-speed and quality tokenization of Natural Language text. It mostly follows the tokenization logic of NLTK, except hyphenated words are split and a few errors are fixed.

pikkr - JSON parser which picks up values directly without performing tokenization in Rust

  •    Rust

Pikkr is a JSON parser which picks up values directly without performing tokenization in Rust. This JSON parser is implemented based on Y. Li, N. R. Katsipoulakis, B. Chandramouli, J. Goldstein, and D. Kossmann. Mison: a fast JSON parser for data analytics. In VLDB, 2017. This JSON parser performs well when there are a limited number of different JSON structural variants in a JSON data stream or JSON collection, and that is a common case in data analytics field.

argos-translate - Open source neural machine translation in Python

  •    Python

Open-source offline translation library written in Python. Uses OpenNMT for translations, SentencePiece for tokenization, Stanza for sentence boundary detection, and PyQt for GUI. Designed to be used as either a Python library, command-line, or GUI application. LibreTranslate is an API and web-app built on top of Argos Translate. Argos Translate supports installing model files which are a zip archive with an ".argosmodel" extension that contains an OpenNMT CTranslate2 model, a SentencePiece tokenization model, a Stanza tokenizer model for sentence boundary detection, and metadata about the model. Pretrained models can be downloaded here.

Awesome-Unicode - :joy: :ok_hand: A curated list of delightful Unicode tidbits, packages and resources

  •    Javascript

A curated list of delightful Unicode tidbits, packages and resources.Please read the contribution guidelines before contributing. Key Unicode terminology is defined in the glossary.


Unicode Converter

  •    

Unicode Converter is a free open source software for converting to/from unicode and also getting information about a character. Unicode Converter developed in c# 3.5 and provide 2 variant user interfaces, one for windows with WPF and one with Asp.net for Web.

Unicode-GLib

  •    

Unicode-GLib for PalmOS. Unicode-GLib provides Unicode rendering and display capabilities for PalmOS applications. Supports proper display of languages like Arabic, Chinese, Hebrew, Korean, Tamil, Thai, etc. NOW WITH TEXT ENTRY AND UNICODE KEYBOARDS!

unicode-slugify - A slugifier that works in unicode

  •    Python

Unicode Slugify is a slugifier that generates unicode slugs. It was originally used in the Firefox Add-ons web site to generate slugs for add-ons and add-on collections. Many of these add-ons and collections had unicode characters and required more than simple transliteration.

emoji-regex - A regular expression to match all Emoji-only symbols as per the Unicode Standard.

  •    Javascript

emoji-regex offers a regular expression to match all emoji symbols (including textual representations of emoji) as per the Unicode Standard.This repository contains a script that generates this regular expression based on the data from Unicode Technical Report #51. Because of this, the regular expression can easily be updated whenever new emoji are added to the Unicode standard.

utf8proc - a clean C library for processing UTF-8 Unicode data

  •    C

utf8proc is a small, clean C library that provides Unicode normalization, case-folding, and other operations for data in the UTF-8 encoding. It was initially developed by Jan Behrens and the rest of the Public Software Group, who deserve nearly all of the credit for this package. With the blessing of the Public Software Group, the Julia developers have taken over development of utf8proc, since the original developers have moved to other projects. The utf8proc package is licensed under the free/open-source MIT "expat" license (plus certain Unicode data governed by the similarly permissive Unicode data license); please see the included LICENSE.md file for more detailed information.

utf8proc - a clean C library for processing UTF-8 Unicode data

  •    C

utf8proc is a small, clean C library that provides Unicode normalization, case-folding, and other operations for data in the UTF-8 encoding. It was initially developed by Jan Behrens and the rest of the Public Software Group, who deserve nearly all of the credit for this package. With the blessing of the Public Software Group, the Julia developers have taken over development of utf8proc, since the original developers have moved to other projects. The utf8proc package is licensed under the free/open-source MIT "expat" license (plus certain Unicode data governed by the similarly permissive Unicode data license); please see the included LICENSE.md file for more detailed information.

Tantivy - Full-text search engine library inspired by Lucene and written in Rust

  •    Rust

Tantivy is a full text search engine library written in rust. It is closer to Lucene than to Elastic Search and Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

uni - Query the Unicode database from the commandline, with good support for emojis

  •    Go

uni queries the Unicode database from the commandline. It supports Unicode 14.0 (September 2021) and has good support for emojis. There are four commands: identify codepoints in a string, search for codepoints, print codepoints by class, block, or range, and emoji to find emojis.

Unicode IVS Add-in for Microsoft Office

  •    

"Unicode IVS Add-in for Microsoft Office" makes Microsoft Office 2007 and 2010 capable to load, save and edit documents contains Unicode IVS.

Khmer Unicode Converter

  •    CSharp

Khmer Unicode converter is a .NET library that converts Khmer text from legacy font to Unicode font and vice-versa. This library developed base on Khmer Converter from KhmerOS (http://www.khmeros.info). All the codes in this library is converted from Python version of Khmer Co...

Image to Text Art (HTML Art, Unicode Art, Ascii Art)

  •    

Image to Text Art is a class library, WinForms project & example Asp.Net site that turns images supported by the bitmap class into HTML art, Unicode art & ASCII art.

Unicode Rewriter

  •    Java

Unicode Rewriter is a Java tool which converts ID3 tags of MP3 files into Unicode. The reconverted MP3 files can be processed by iTunes and Rhythmbox.

ansiweather - Weather in your terminal, with ANSI colors and Unicode symbols

  •    Shell

AnsiWeather is a Shell script for displaying the current weather conditions in your terminal, with support for ANSI colors and Unicode symbols.Weather data comes from the OpenWeatherMap free weather API.

Open Layer for Unicode

  •    C++

An open-source friendly replacement library for the Microsoft Layer for Unicode. This library allows a unicode Windows application to run unchanged on all versions of Windows, including Windows 95, 98 and ME.

python-unicodecsv - Python2's stdlib csv module is nice, but it doesn't support unicode

  •    Python

The unicodecsv is a drop-in replacement for Python 2.7's csv module which supports unicode strings without a hassle. Supported versions are python 2.6, 2.7, 3.3, 3.4, 3.5, and pypy 2.4.0. Python 2's csv module doesn't easily deal with unicode strings, leading to the dreaded "'ascii' codec can't encode characters in position ..." exception.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.