Spelling correction & Fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.
https://medium.com/@wolfgarbe/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f
Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...
unicode algorithms edit-distance levenshtein jaro-winkler levenshtein-distance similarity-measures string-distance cosine string-matching damerau-levenshtein lcs lcs-distance hamming string-comparison golang-string-comparison edit-distance-algorithmsMeasure the difference between two strings using the fastest JS implementation of the Levenshtein distance algorithm
leven levenshtein distance algorithm algo string difference diff fast fuzzy similar similarity compare comparison edit text match matching"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module. Scroll down for English documentation.
nlp natural-language-processing chinese-text-segmentation machine-learningImplementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity...
levenshtein-distance cosine-similarity string-distance damerau-levenshtein distance distance-measure jaro-winkler similarity-measures shingles algorithm jvmInspired by bevacqua/fuzzysearch, a fuzzy matching library written in JavaScript. But contains some extras like ranking using Levenshtein distance (see RankMatch()) and finding matches in a list of words (see Find()). Fuzzy searching allows for flexibly matching a string with partial input, useful for filtering data very quickly based on lightweight user input.
fuzzy-search algorithmInspired by bevacqua/fuzzysearch, a fuzzy matching library written in JavaScript. But contains some extras like ranking using Levenshtein distance (see RankMatch()) and finding matches in a list of words (see Find()). Fuzzy searching allows for flexibly matching a string with partial input, useful for filtering data very quickly based on lightweight user input.
fuzzy-search algorithmBaidu's open-source lexical analysis tool for Chinese, including word segmentation, part-of-speech tagging & named entity recognition.
lexical-analysis word-segmentation part-of-speech-tagger named-entity-recognition chinese-word-segmentation chinese-nlpFuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
Go efficient text segmentation; support english, chinese, japanese and other. Dictionary with double array trie (Double-Array Trie) to achieve, Sender algorithm is the shortest path based on word frequency plus dynamic programming.
segment nlp gse chinese english japanese trieTextDistance -- python library for comparing distance between two or more sequences by many algorithms. Work in progress. Now all algorithms compare two strings as array of bits.
distance algorithm textdistance hamming-distance levenshtein-distance damerau-levenshtein damerau-levenshtein-distance algorithms distance-calculation jellyfishA lightweight open source chinese tokenizer with keywords, key sentences, summary extracts support, offered the latest lucene,solr,elasticsearch API.
jcseg mmseg chinese-word-segmentation natural-language-processing pos-tagging nlp nlp-keywords-extraction lucene-analyzer lucene-tokenizer solr-plugin elasticsearch-analyzer chinese-text-segmentation chinese-nlp keywords-extraction jcseg-analyzerThis is a package for Chinese text segmentation, keyword extraction and speech tagging.
cppjieba jieba chinese-text-segmentation lexical-analysis chinese nlp"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.
natural-language-processing nlpSentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) and unigram language model with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre postprocessing.
neural-machine-translation natural-language-processing word-segmentation translation text tokenizer neural-networkThis project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks. Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.
chinese chinese-word-segmentation embeddings word-embeddings vectors-trained embeddingJellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk <james.p.turk@gmail.com> and Michael Stephens.
levenshtein soundex hamming metaphone jaro-winkler fuzzy-searchan re-implementation of rmmseg (Chinese word segmentation library for Ruby) in C++
A collection of open source libraries and tools that provide solutions for common problems in processing Arabic text, especially in web applications. text normalization, phrase segmentation, text indexing, stop word lists, common spelling mistakes.
The library's full documentation can be found here. Be sure to lint & pass the unit tests before submitting your pull request.
natural-language-processing machine-learning fuzzy-matching clustering record-linkage bayes bloom-filter canberra caverphone chebyshev cologne cosine classifier daitch-mokotoff dice fingerprint fuzzy hamming k-means jaccard jaro lancaster levenshtein lig metaphone mra ngrams nlp nysiis perceptron phonetic porter punkt schinke sorensen soundex stats tfidf tokenizer tversky vectorizer winkler汉语言处理包
nlp natural-language-processing hanlp crf hmm trie textrank doublearraytrie neural-network chinese-word-segmentation text-mining pos-tagging dependency-parser text-classification word2vec perceptron named-entity-recognition text-clustering
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.