DeDuplicator (Heritrix add-on)

  •        0

The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.

http://deduplicator.sourceforge.net

Tags
Implementation
License
Platform

   

comments powered by Disqus


Related Projects

lessfs - data deduplication for less


Lessfs is an userspace (fuse) inline data de-duplicating filesystem for Linux that includes support for lzo or QuickLZ compression and encryption.

UrBackup


Efficient Client-Server Backup system for Linux and Windows. A client for Windows lets you backup open files and complete partition images. Backups are stored to disks in a efficient way (deduplication) on either Windows or Linux servers.

DataCleaner


DataCleaner is a data quality analysis application and a solution platform for DQ solutions. It's core is a strong data profiling engine, which is extensible and thereby adds data cleansing, transformations, enrichment, deduplication, matching and merging. Website: http://datacleaner.org

Files-depot - Experiments on file deduplication


Personal experiment on programming using file deduplication as a theme.

Dedupfileapi - Provide a file storage API using Block Deduplication technology


Provide a file storage API using Block Deduplication technology Written in Python Windows and Linux compatible

Cumpare - A python-based deduplication commandline tool and library.


Cumpare is a Python-based deduplication commandline tool and library. It aims to simplicity, extensibility and ease of use. The project's name comes from a variation of the English word "compare", which is the basic action for finding duplicates. The variation itself, "cumpare", is a Sicilian word meaning "best friend". In syntesis, cumpare wants to be the good fella who finds dupes.







Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.

Tag Cloud >>