Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally. The core library is written in pure C. Language bindings for Python, Ruby, Go, Java, PHP, and NodeJS are officially supported and it's easy to write bindings in other languages.
address-parser machine-learning nlp address international deduplication record-linkage deduping natural-language-processingdedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data. dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.
dedupe record-linkage python-library entity-resolutionThe library's full documentation can be found here. Be sure to lint & pass the unit tests before submitting your pull request.
natural-language-processing machine-learning fuzzy-matching clustering record-linkage bayes bloom-filter canberra caverphone chebyshev cologne cosine classifier daitch-mokotoff dice fingerprint fuzzy hamming k-means jaccard jaro lancaster levenshtein lig metaphone mra ngrams nlp nysiis perceptron phonetic porter punkt schinke sorensen soundex stats tfidf tokenizer tversky vectorizer winklerCommand line tools for using the dedupe python library for deduplicating CSV files. csvdedupe - takes a messy input file or STDIN pipe and identifies duplicates.
dedupe cli record-linkage entity-resolution csv-filesThis is the R package to support phonetic spelling algorithms in R. Several packages provide the Soundex algorithm. However, other algorithms have been developed since Soundex that can also provide phonetic spelling and test phonetic similarity. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. In particular, it used the Comet system at the San Diego Supercomputing Center (SDSC) through allocations TG-DBS170012 and TG-ASC150024.
phonetic-spelling-algorithms soundex phonics nysiis metaphone text-processing linguistics record-linkageExample scripts for the dedupe, a library that uses machine learning to perform de-duplication and entity resolution quickly on structured data. We recommend using virtualenv and virtualenvwrapper for working in a virtualized development environment. Read how to set up virtualenv.
dedupe record-linkage entity-resolutionSpark RDD with Apache Lucene's query capabilities. Using the query parser, you can perform prefix queries, fuzzy queries, prefix queries, etc and any combination of those. For more information on using Lucene's query parser, see Query Parser.
spark lucene rdd spatial-search record-linkageA work-in-progress to provide a standard interface for deduplication of large databases with custom pre-processing and post-processing steps. In addition to running a database-level deduplication with dedupe, this script adds custom pre- and post-processing steps to improve the run-time and results, making this a hybrid between fuzzy matching and record linkage.
deduplication dedupe data-cleaning record-linkage postgresql databaseThe Record Linkage ToolKit (RLTK) is a general-purpose open-source record linkage platform that allows users to build powerful Python programs that link records referring to the same underlying entity. Record linkage is an extremely important problem that shows up in domains extending from social networks to bibliographic data and biomedicine. Current open platforms for record linkage have problems scaling even to moderately sized datasets, or are just not easy to use (even by experts). RLTK attempts to address all of these issues. RLTK supports a full, scalable record linkage pipeline, including multi-core algorithms for blocking, profiling data, computing a wide variety of features, and training and applying machine learning classifiers based on Python’s sklearn library. An end-to-end RLTK pipeline can be jump-started with only a few lines of code. However, RLTK is also designed to be extensible and customizable, allowing users arbitrary degrees of control over many of the individual components. You can add new features to RLTK (e.g. a custom string similarity) very easily.
linkage similarity similarity-metric string-similarity record-linkage entity-resolution
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.