Displaying 1 to 20 from 23 results

jieba - 结巴中文分词


"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.

Language Detection - Language Detection Library in Java


This is a language detection library implemented in plain Java. It detects language of a text using naive Bayesian filter. It is 99% over precision for 53 languages.

go-i18n - Translate your Go program into multiple languages with templates and CLDR plural support.


go-i18n is a Go package and a command that helps you translate Go programs into multiple languages.The i18n package provides runtime APIs for fetching translated strings.

budou - Budou is an automatic organizer tool for beautiful line breaking in CJK (Chinese, Japanese, and Korean)


English uses spacing and hyphenation as cues to allow for beautiful and legible line breaks. Certain CJK languages have none of these, and are notoriously more difficult. Breaks occur randomly, usually in the middle of a word. This is a long standing issue in typography on web, and results in degradation of readability.Budou automatically translates CJK sentences into organized HTML code with lexical chunks wrapped in non-breaking markup so as to semantically control line breaks. Budou uses Google Cloud Natural Language API (NL API) to analyze the input sentence, and it concatenates proper words in order to produce meaningful chunks utilizing part-of-speech (pos) tagging and syntactic information. Processed chunks are wrapped with SPAN tag, so semantic units will no longer be split at the end of a line by specifying their display property as inline-block in CSS.




sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.


SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements sub-word units (also known as wordpieces [Wu et al.] [Schuster et al.] and byte-pair-encoding (BPE) [Sennrich et al.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.This is not an official Google product.

prose - :book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction


prose is Go library for text (primarily English at the moment) processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. The library's functionality is split into subpackages designed for modular use.See the GoDoc documentation for more information.

nlp - Extract values from strings and fill your structs with nlp.


You will always begin by creating a NL type calling nlp.New(), the NL type is a Natural Language Processor that owns 3 funcs, RegisterModel(), Learn() and P().RegisterModel takes 3 parameters, an empty struct, a set of samples and some options for the model.



whatlanggo - Natural language detection library for Go


Natural language detection for Go.Thanks to greyblake Potapov Sergey for creating whatlang-rs from where I got the idea and logic.

MMSEGO - Chinese word splitting algorithm MMSEG in GO


This is a GO implementation of MMSEG which a Chinese word splitting algorithm.

segment - A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29


You can use a bufio.Scanner with the SplitWords implementation of SplitFunc. The SplitWords function will identify the appropriate word boundaries in the input text and the Scanner will return tokens at the appropriate place.Sometimes you would also like information returned about the type of token. To do this we have introduce a new type named Segmenter. It works just like Scanner but additionally a token type is returned.

mystem - CGo bindings to Yandex.Mystem


CGo bindings to Yandex.Mystem - russian morphology analyzer.Source code of go-mystem is licensed under MIT license, but Yandex.Mystem have their own EULA (allows commercial use), that you must accept.

icu - Cgo binding for icu4c library


Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1.Installation consists of several simple steps. They may be a bit different on your target system (e.g. require more permissions) so adapt them to the parameters of your system.

libtextcat - Cgo binding for libtextcat C library


Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2.Installation consists of several simple steps. They may be a bit different on your target system (e.g. require more permissions) so adapt them to the parameters of your system.

snowball - Cgo binding for Snowball C library


File modules.txt contains all the main algorithms for each language, in UTF-8, and also with the most commonly used encoding.Thus this Go wrapper uses sync.Mutex for each stem operation, so it is thread safe.

nlp - Selected Machine Learning algorithms for basic natural language processing in Golang


An implementation of selected machine learning algorithms for basic natural language processing in golang. The initial focus for this project is Latent Semantic Analysis to allow retrieval/searching, clustering and classification of text documents based upon semantic content.Built upon the gonum/gonum matrix library with some inspiration taken from Python's scikit-learn.

sentences - A multilingual command line sentence tokenizer in Golang


This command line utility will convert a blob of text into a list of sentences.This package attempts to fix some problems I noticed for english.

golibstemmer - Go bindings for the snowball libstemmer library including porter 2


This simple library provides Go (golang) bindings for the snowball libstemmer library including the popular porter and porter2 algorithms.... or you might need to install it from source.