dedupe - :id: A python library for accurate and scaleable fuzzy matching, record deduplication and entity-resolution

  •        44

dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data. dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

https://github.com/dedupeio/dedupe

Tags
Implementation
License
Platform

   




Related Projects

csvdedupe - :id: Command line tool for deduplicating CSV files

  •    Python

Command line tools for using the dedupe python library for deduplicating CSV files. csvdedupe - takes a messy input file or STDIN pipe and identifies duplicates.

Borg - Deduplicating archiver with compression and authenticated encryption

  •    C

BorgBackup (short: Borg) is a deduplicating backup program. Optionally, it supports compression and authenticated encryption. The main goal of Borg is to provide an efficient and secure way to backup data. The data deduplication technique used makes Borg suitable for daily backups since only changes are stored. The authenticated encryption technique makes it suitable for backups to not fully trusted targets.

libpostal - A C library for parsing/normalizing street addresses around the world

  •    C

Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally. The core library is written in pure C. Language bindings for Python, Ruby, Go, Java, PHP, and NodeJS are officially supported and it's easy to write bindings in other languages.

bibnet.org - data management

  •    Java

Web based cataloging and dedupe application. Highly optimized for processing journal articles. Reads MarcXML and dedupes records using the field 773 combined with a fuzzy search on the title. Written for bibnet.org

free-style - Make CSS easier and more maintainable by using JavaScript

  •    TypeScript

Free-Style is designed to make CSS easier and more maintainable by using JavaScript. There's a great presentation by Christopher Chedeau you should check out.


CRM 2011 AttributeMapping

  •    

Entity mapping facilitates data entry when creating new records that are related to a parent record. The following limitations are associated with field Mapping: • Mapping only works when a new record is created in the context of a parent record. Mapping does not apply if y...

Telecine - Record full-resolution video on your Android devices.

  •    Java

Record full-resolution video on your Android devices.

OTB

  •    C++

The Orfeo Toolbox is a C++ library for high resolution remote sensing image processing. It is developped by CNES in the frame of the ORFEO program. More information is available at www.orfeo-toolbox.org It is based on the medical image processing library ITK and offers particular functionalities for remote sensing image processing in general and for high spatial resolution images in particular. Targeted algorithms for high resolution optical images (SPOT, Quickbird, Worldview, Landsat, Iko

ecst - [WIP] Experimental C++14 multithreaded compile-time entity-component-system library.

  •    C++

Experimental & work-in-progress C++14 multithreaded compile-time Entity-Component-System header-only library. Successful development of complex real-time applications and games requires a flexible and efficient entity management system. As a project becomes more intricate, it’s critical to find an elegant way to compose objects in order to prevent code repetition, improve modularity and open up powerful optimization possibilities.

FTASync - Allows you to sync CoreData entities with a Parse backend.

  •    Objective-C

Allows you to sync CoreData entities with a Parse backend. FTASync supports relationships (many-to-many have not been tested), conflict resolution (last in wins), custom data class names, and multiple levels of inheritance. For conflict resolution each relationship is in it's own conflict domain, but all entity attributes are currently in a single conflict domain. As with any open source code, do your own due diligence before putting this in a production app! There are a few known issues that still need addressed. They are listed below.

RecLink

  •    C++

A software package that implements the probabilistic record linkage technique (PRL). This is a new, improved, open-source, multi-platform version of the previously available program, by the same authors.

Duke - Duke is a fast and flexible deduplication engine written in Java

  •    Java

Duke is a configurable record linkage engine.

Associate Many to Many Relationship Entities Tool for Dynamics CRM 2011

  •    

Associate Many to many relationship tool is used for Dynamics CRM 2011 to associate or disassociate N:N relationship entities. This tool is dynamics crm 2011 solution, which consist of one entity and one plugin. Entity "Many to Many Relationship" record is used by Many...

snips-nlu - Snips Python library to extract meaning from text

  •    Python

Snips NLU (Natural Language Understanding) is a Python library that allows to parse sentences written in natural language and extracts structured information. To find out how to use Snips NLU please refer to our documentation, it will provide you with a step-by-step guide on how to use and setup our library.

PyAudio - Python bindings for PortAudio, the cross-platform audio I/O library

  •    Python

PyAudio provides Python bindings for PortAudio, the cross-platform audio I/O library. With PyAudio, you can easily use Python to play and record audio on a variety of platforms.

Image-Super-Resolution - Implementation of Super Resolution CNN in Keras.

  •    Python

Implementation of Image Super Resolution CNN in Keras from the paper Image Super-Resolution Using Deep Convolutional Networks. Also contains models that outperforms the above mentioned model, termed Expanded Super Resolution, Denoiseing Auto Encoder SRCNN which outperforms both of the above models and Deep Denoise SR, which with certain limitations, outperforms all of the above.

catalyst - An Algorithmic Trading Library for Crypto-Assets in Python

  •    Python

Catalyst is an algorithmic trading library for crypto-assets written in Python. It allows trading strategies to be easily expressed and backtested against historical data (with daily and minute resolution), providing analytics and insights regarding a particular strategy's performance. Catalyst also supports live-trading of crypto-assets starting with four exchanges (Binance, Bitfinex, Bittrex, and Poloniex) with more being added over time. Catalyst empowers users to share and curate data and build profitable, data-driven investment strategies. Please visit catalystcrypto.io to learn more about Catalyst. Catalyst builds on top of the well-established Zipline project. We did our best to minimize structural changes to the general API to maximize compatibility with existing trading algorithms, developer knowledge, and tutorials. Join us on the Catalyst Forum for questions around Catalyst, algorithmic trading and technical support. We also have a Discord group with the #catalyst_dev and #catalyst_setup dedicated channels.

Entity Framework CTP5 Extensions Library

  •    

The ADO.NET Entity Framework Extensions library contains a set of utility classes with additional functionality to Entity Framework CTP5.