rdedup - Data deduplication engine, supporting optional compression and public key encryption.

  •        90

See wiki for current project status. rdedup is a data deduplication engine and a backup software.

https://github.com/dpc/rdedup

Tags
Implementation
License
Platform

   




Related Projects

Borg - Deduplicating archiver with compression and authenticated encryption

  •    C

BorgBackup (short: Borg) is a deduplicating backup program. Optionally, it supports compression and authenticated encryption. The main goal of Borg is to provide an efficient and secure way to backup data. The data deduplication technique used makes Borg suitable for daily backups since only changes are stored. The authenticated encryption technique makes it suitable for backups to not fully trusted targets.

restic - Fast, secure, efficient backup program

  •    Go

restic is a backup program that is fast, efficient and secure. Restic should be easy to configure and use, so that in the unlikely event of a data loss you can just restore it. It uses cryptography to guarantee confidentiality and integrity of your data.

attic - Deduplicating backup program

  •    Python

Attic is a deduplicating backup program. The main goal of Attic is to provide an efficient and secure way to backup data. The data deduplication technique used makes Attic suitable for daily backups since only changes are stored. Attic requires Python 3.2 or above to work. Besides Python, Attic also requires msgpack-python and sufficiently recent OpenSSL (>= 1.0.0). In order to mount archives as filesystems, llfuse is required.

lbackup Java Backup

  •    Java

Java backup tool providing file level data deduplication: If a file is stored, it is never stored a second time unless the file's content changes. Instead, a reference to the stored data is created. This holds true even if the file is moved or renamed.

dedupe - :id: A python library for accurate and scaleable fuzzy matching, record deduplication and entity-resolution

  •    Python

dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data. dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.


lessfs - data deduplication for less

  •    C

Lessfs is an userspace (fuse) inline data de-duplicating filesystem for Linux that includes support for lzo or QuickLZ compression and encryption.

UrBackup

  •    Javascript

Efficient Client-Server Backup system for Linux and Windows. A client for Windows lets you backup open files and complete partition images. Backups are stored to disks in a efficient way (deduplication) on either Windows or Linux servers.

duplicacy - A new generation cloud backup tool

  •    Go

Duplicacy is a new generation cross-platform cloud backup tool based on the idea of Lock-Free Deduplication. This repository hosts source code, design documents, and binary releases of the command line version of Duplicacy. There is also a Duplicacy GUI frontend built for Windows and Mac OS X available from https://duplicacy.com.

bedup - Btrfs deduplication

  •    Python

Deduplication for Btrfs. bedup looks for new and changed files, making sure that multiple copies of identical files share space on disk. It integrates deeply with btrfs so that scans are incremental and low-impact.

dropship - Instantly transfer files between Dropbox accounts using only their hashes.

  •    Python

These utilities make use of the deduplication scheme of Dropbox to allow for "teleporting" files into your Dropbox account given only a list of hashes, provided of course that the files already exist on their servers. This enables arbitrary, anonymous transfers of files between Dropbox accounts. The deduplication scheme used by Dropbox works by breaking files into blocks. Each of these blocks is hashed with the SHA256 algorithm and represented by the digest. Only blocks that are not yet known are uploaded to the server when syncing.

DataCleaner

  •    Java

DataCleaner is a data quality analysis application and a solution platform for DQ solutions. It's core is a strong data profiling engine, which is extensible and thereby adds data cleansing, transformations, enrichment, deduplication, matching and merging. Website: http://datacleaner.org

CRM 2011 Duplicate Detection Toolkit

  •    

The way CRM 2011 handles duplicates has always been a bit of a mystery to me. Detection of duplicates through matchcodes is quite straightforward, but reporting on them and understanding the effectiveness of deduplication rules is difficult, and this is compounded by the CRM 2...

alertmanager - Prometheus Alertmanager

  •    Go

The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integrations such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.There are various ways of installing Alertmanager.

Bulk File Manager - Bulk File Renamer/Deduplicator in .NET

  •    CSharp

This is a file deduplication utility and is equipped with bulk name management options as well. Large volume of duplicate flies, or a small volume of really big duplicate files which would make manual cleaning difficult or tedious will be made easier with this tool. It also provides name-based sorting for a large batch of files crossing directories.

nwb - A toolkit for React, Preact, Inferno & vanilla JS apps, React libraries and other npm modules for the web, with no configuration (until you need it)

  •    Javascript

Installing globally provides an nwb command for quick development and working with projects. Using npm >= 3 is recommended, as Babel takes significantly more time and disk space to install with npm 2 due to its lack of deduplication.

MarcXimiL

  •    Python

MarcXimiL is a flexible multi-platform bibliographic similarity analysis framework. Features: deduplication, information monitoring, visual analysis, plagiarism detection. Supported: MARCXML, OAI-PMH2 harvesting, and importation of text MARC.

libpostal - A C library for parsing/normalizing street addresses around the world

  •    C

Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally. The core library is written in pure C. Language bindings for Python, Ruby, Go, Java, PHP, and NodeJS are officially supported and it's easy to write bindings in other languages.

Duke - Duke is a fast and flexible deduplication engine written in Java

  •    Java

Duke is a configurable record linkage engine.

seqkit - A cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang

  •    Go

FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and protein sequences. Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly. This project describes a cross-platform ultrafast comprehensive toolkit for FASTA/Q processing. SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OS X, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations.