DeDuplicator (Heritrix add-on)

  •        0

The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.



comments powered by Disqus

Related Projects

lessfs - data deduplication for less

Lessfs is an userspace (fuse) inline data de-duplicating filesystem for Linux that includes support for lzo or QuickLZ compression and encryption.


Efficient Client-Server Backup system for Linux and Windows. A client for Windows lets you backup open files and complete partition images. Backups are stored to disks in a efficient way (deduplication) on either Windows or Linux servers.


DataCleaner is a data quality analysis application and a solution platform for DQ solutions. It's core is a strong data profiling engine, which is extensible and thereby adds data cleansing, transformations, enrichment, deduplication, matching and merging. Website:

Files-depot - Experiments on file deduplication

Personal experiment on programming using file deduplication as a theme.

Dedupfileapi - Provide a file storage API using Block Deduplication technology

Provide a file storage API using Block Deduplication technology Written in Python Windows and Linux compatible

Cumpare - A python-based deduplication commandline tool and library.

Cumpare is a Python-based deduplication commandline tool and library. It aims to simplicity, extensibility and ease of use. The project's name comes from a variation of the English word "compare", which is the basic action for finding duplicates. The variation itself, "cumpare", is a Sicilian word meaning "best friend". In syntesis, cumpare wants to be the good fella who finds dupes.

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.

Tag Cloud >>