SymSpell - 1 million times faster through Symmetric Delete spelling correction algorithm

  •        1019

Spelling correction & Fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

https://medium.com/@wolfgarbe/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f
https://github.com/wolfgarbe/SymSpell

Tags
Implementation
License
Platform

   




Related Projects

fuzzysearch - :pig: Tiny and fast fuzzy search in Go

  •    Go

Inspired by bevacqua/fuzzysearch, a fuzzy matching library written in JavaScript. But contains some extras like ranking using Levenshtein distance (see RankMatch()) and finding matches in a list of words (see Find()). Fuzzy searching allows for flexibly matching a string with partial input, useful for filtering data very quickly based on lightweight user input.


fuzzysearch - :pig: Tiny and fast fuzzy search in Go

  •    Go

Inspired by bevacqua/fuzzysearch, a fuzzy matching library written in JavaScript. But contains some extras like ranking using Levenshtein distance (see RankMatch()) and finding matches in a list of words (see Find()). Fuzzy searching allows for flexibly matching a string with partial input, useful for filtering data very quickly based on lightweight user input.

fuzzywuzzy - Fuzzy String Matching in Python

  •    Python

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

gse - Go efficient text segmentation; support english, chinese, japanese and other. Go 语言高性能分词

  •    Go

Go efficient text segmentation; support english, chinese, japanese and other. Dictionary with double array trie (Double-Array Trie) to achieve, Sender algorithm is the shortest path based on word frequency plus dynamic programming.

textdistance - Compute distance between sequences

  •    Python

TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Work in progress. Now all algorithms compare two strings as array of bits.

jieba - 结巴中文分词

  •    Python

"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.

SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation

  •    C++

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) and unigram language model with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre postprocessing.

Chinese-Word-Vectors - 100+ Chinese Word Vectors 上百种预训练中文词向量

  •    Python

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks. Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

jellyfish - 🎐 a python library for doing approximate and phonetic matching of strings.

  •    Python

Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk <james.p.turk@gmail.com> and Michael Stephens.

rmmseg-cpp - an re-implementation of rmmseg (Chinese word segmentation library for Ruby) in C++

  •    Ruby

an re-implementation of rmmseg (Chinese word segmentation library for Ruby) in C++

Arab Techies

  •    Javascript

A collection of open source libraries and tools that provide solutions for common problems in processing Arabic text, especially in web applications. text normalization, phrase segmentation, text indexing, stop word lists, common spelling mistakes.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.