DeDuplicator (Heritrix add-on)

  •        0

The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.



Related Projects

lessfs - data deduplication for less

Lessfs is an userspace (fuse) inline data de-duplicating filesystem for Linux that includes support for lzo or QuickLZ compression and encryption.


Efficient Client-Server Backup system for Linux and Windows. A client for Windows lets you backup open files and complete partition images. Backups are stored to disks in a efficient way (deduplication) on either Windows or Linux servers.


DataCleaner is a data quality analysis application and a solution platform for DQ solutions. It's core is a strong data profiling engine, which is extensible and thereby adds data cleansing, transformations, enrichment, deduplication, matching and merging. Website:

Files-depot - Experiments on file deduplication

Personal experiment on programming using file deduplication as a theme.

Dedupfileapi - Provide a file storage API using Block Deduplication technology

Provide a file storage API using Block Deduplication technology Written in Python Windows and Linux compatible

Cumpare - A python-based deduplication commandline tool and library.

Cumpare is a Python-based deduplication commandline tool and library. It aims to simplicity, extensibility and ease of use. The project's name comes from a variation of the English word "compare", which is the basic action for finding duplicates. The variation itself, "cumpare", is a Sicilian word meaning "best friend". In syntesis, cumpare wants to be the good fella who finds dupes.