file-dedupe - Fast duplicate file detection library

  •        28

findup is quite fast - it is within 2x of the fastest duplicate finders written in C/C++. Based on the V8 profiler output, about 40% of the time is spent on I/O, 13% on crypto and 11% on file traversal, so any further gains in performance will need to come from I/O optimizations rather than code optimizations. BTW, you may notice that file-dedupe defaults to sync I/O. This is because the async I/O seems to have significant overhead for typical FS tasks. You can test this out by passing the --async flag on your system.

https://github.com/mixu/file-dedupe

Dependencies:

bytes : 1.0.0
javascript-natural-sort : ^0.7.1
microee : 0.0.5
miniq : ~0.1.2
trash : ^4.0.0
yargs : ~1.2.6

Tags
Implementation
License
Platform

   




Related Projects

dedupe - :id: A python library for accurate and scaleable fuzzy matching, record deduplication and entity-resolution

  •    Python

dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data. dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

csvdedupe - :id: Command line tool for deduplicating CSV files

  •    Python

Command line tools for using the dedupe python library for deduplicating CSV files. csvdedupe - takes a messy input file or STDIN pipe and identifies duplicates.

Borg - Deduplicating archiver with compression and authenticated encryption

  •    C

BorgBackup (short: Borg) is a deduplicating backup program. Optionally, it supports compression and authenticated encryption. The main goal of Borg is to provide an efficient and secure way to backup data. The data deduplication technique used makes Borg suitable for daily backups since only changes are stored. The authenticated encryption technique makes it suitable for backups to not fully trusted targets.

Bulk File Manager - Bulk File Renamer/Deduplicator in .NET

  •    CSharp

This is a file deduplication utility and is equipped with bulk name management options as well. Large volume of duplicate flies, or a small volume of really big duplicate files which would make manual cleaning difficult or tedious will be made easier with this tool. It also provides name-based sorting for a large batch of files crossing directories.

DFK: Duplicate File Killer

  •    VB

DFK: Duplicate File Killer will search a user specified directory/drive for duplicate files of any kind. It will eventually utilize image recognition for image comparison, audio matching algorhythms for sound file matching and CRC/MD5 hash testing.


File Duplicate Finder

  •    

A small application built in C# .Net 3.5 with a set of Microsoft SQL Server Reporting Services reports that inventories files on a specific path on a file system or in SharePoint and stores its hash and location in a table. With this you can identify duplicates.

hasha - Hashing made simple. Get the hash of a buffer/string/stream/file.

  •    Javascript

Hashing made simple. Get the hash of a buffer/string/stream/file.Convenience wrapper around the core crypto Hash class with simpler API and better defaults.

free-style - Make CSS easier and more maintainable by using JavaScript

  •    TypeScript

Free-Style is designed to make CSS easier and more maintainable by using JavaScript. There's a great presentation by Christopher Chedeau you should check out.

node-fs-extra - Node.js: extra methods for the fs object like copy(), remove(), mkdirs()

  •    Javascript

fs-extra adds file system methods that aren't included in the native fs module and adds promise support to the fs methods. It should be a drop in replacement for fs.I got tired of including mkdirp, rimraf, and ncp in most of my projects.

mock-fs - Configurable mock for the fs module

  •    Javascript

The mock-fs module allows Node's built-in fs module to be backed temporarily by an in-memory, mock file system. This lets you run tests against a set of mock files and directories instead of lugging around a bunch of test fixtures. The code below makes it so the fs module is temporarily backed by a mock file system with a few files and directories.

fs-jetpack - Better file system API for Node.js

  •    Javascript

Node's fs library is very low level and because of that often painful to use. fs-jetpack wants to fix that by giving you completely rethought, much more convenient API to work with file system.Check out EXAMPLES to see few snippets what it can do.

vinyl-fs - Vinyl adapter for the file system.

  •    Javascript

Vinyl adapter for the file system. Vinyl is a very simple metadata object that describes a file. When you think of a file, two attributes come to mind: path and contents. These are the main attributes on a Vinyl object. A file does not necessarily represent something on your computer’s file system. You have files on S3, FTP, Dropbox, Box, CloudThingly.io and other services. Vinyl can be used to describe files from all of these sources.

Duplicate File Explorer

  •    

File search utility, that also shows what files are duplicate by name. Supports searching up to 3 different folders at one time, excluding folders or extensions and multiple search patterns.

Find Duplicate file

  •    

This application is developed in WPF. you can find duplicate files from the file impression not from file size of from file name. Although this process is very time consuming. You can search and delete the similar file from the application itself. you can also open file loc...

DUFF: DUplicate File Finder

  •    C++

DUFF is (will be) a tool for Windows used to find and process duplicate files on a computer file system and/or network. It will have numerous built-in and plugin-based file comparison layers, duplicate markers, and fileset processers.

casync - Content-Addressable Data Synchronization Tool

  •    C

Encoding: Let's take a large linear data stream, split it into variable-sized chunks (the size of each being a function of the chunk's contents), and store these chunks in individual, compressed files in some directory, each file named after a strong hash value of its contents, so that the hash value may be used to as key for retrieving the full chunk data. Let's call this directory a "chunk store". At the same time, generate a "chunk index" file that lists these chunk hash values plus their respective chunk sizes in a simple linear array. The chunking algorithm is supposed to create variable, but similarly sized chunks from the data stream, and do so in a way that the same data results in the same chunks even if placed at varying offsets. For more information see this blog story. Decoding: Let's take the chunk index file, and reassemble the large linear data stream by concatenating the uncompressed chunks retrieved from the chunk store, keyed by the listed chunk hash values.

Chitragupta File System

  •    C

ChitraguptaFS is a simple file system based on FUSE written in C language for logging FS events. ChitraguptaFS comprises of two parts, one is the FS itself and the other is a simple utility to retrieve FS logs.

phashion - Ruby wrapper around pHash, the perceptual hash library for detecting duplicate multimedia files

  •    Ruby

Phashion is a Ruby wrapper around the pHash library, "perceptual hash", which detects duplicate and near-duplicate multimedia files (e.g. images, audio, video, though Phashion currently only supports images.). "Near-duplicates" are images that come from the same source and show essentially the same thing, but may have differences in such features as dimensions, bytesizes, lossy-compression artifacts, and color levels. See an overview of Phashion on Mike's blog.

bibnet.org - data management

  •    Java

Web based cataloging and dedupe application. Highly optimized for processing journal articles. Reads MarcXML and dedupes records using the field 773 combined with a fuzzy search on the title. Written for bibnet.org

BD File Hash

  •    

BD File Hash is a convenient file hashing and hash compare tool for Windows which currently works with MD5, SHA-1, SHA-256, and SHA-512 algorithms.