Displaying 1 to 9 from 9 results

libpostal - A C library for parsing/normalizing street addresses around the world

  •    C

Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally. The core library is written in pure C. Language bindings for Python, Ruby, Go, Java, PHP, and NodeJS are officially supported and it's easy to write bindings in other languages.

dedupe - :id: A python library for accurate and scaleable fuzzy matching, record deduplication and entity-resolution

  •    Python

dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data. dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

csvdedupe - :id: Command line tool for deduplicating CSV files

  •    Python

Command line tools for using the dedupe python library for deduplicating CSV files. csvdedupe - takes a messy input file or STDIN pipe and identifies duplicates.

phonics - Phonetic Spelling Algorithms in R

  •    R

This is the R package to support phonetic spelling algorithms in R. Several packages provide the Soundex algorithm. However, other algorithms have been developed since Soundex that can also provide phonetic spelling and test phonetic similarity. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. In particular, it used the Comet system at the San Diego Supercomputing Center (SDSC) through allocations TG-DBS170012 and TG-ASC150024.

dedupe-examples - :id: Examples for using the dedupe library

  •    Python

Example scripts for the dedupe, a library that uses machine learning to perform de-duplication and entity resolution quickly on structured data. We recommend using virtualenv and virtualenvwrapper for working in a virtualized development environment. Read how to set up virtualenv.

spark-lucenerdd - Spark RDD with Lucene's query capabilities

  •    Scala

Spark RDD with Apache Lucene's query capabilities. Using the query parser, you can perform prefix queries, fuzzy queries, prefix queries, etc and any combination of those. For more information on using Lucene's query parser, see Query Parser.

pgdedupe - A simple command line interface to the datamade/dedupe library.

  •    Jupyter

A work-in-progress to provide a standard interface for deduplication of large databases with custom pre-processing and post-processing steps. In addition to running a database-level deduplication with dedupe, this script adds custom pre- and post-processing steps to improve the run-time and results, making this a hybrid between fuzzy matching and record linkage.

rltk - Record Linkage ToolKit (Find and link entities)

  •    Python

The Record Linkage ToolKit (RLTK) is a general-purpose open-source record linkage platform that allows users to build powerful Python programs that link records referring to the same underlying entity. Record linkage is an extremely important problem that shows up in domains extending from social networks to bibliographic data and biomedicine. Current open platforms for record linkage have problems scaling even to moderately sized datasets, or are just not easy to use (even by experts). RLTK attempts to address all of these issues. RLTK supports a full, scalable record linkage pipeline, including multi-core algorithms for blocking, profiling data, computing a wide variety of features, and training and applying machine learning classifiers based on Python’s sklearn library. An end-to-end RLTK pipeline can be jump-started with only a few lines of code. However, RLTK is also designed to be extensible and customizable, allowing users arbitrary degrees of control over many of the individual components. You can add new features to RLTK (e.g. a custom string similarity) very easily.