BoomFilters - Probabilistic data structures for processing continuous, unbounded streams.

  •        49

Boom Filters are probabilistic data structures for processing continuous, unbounded streams. This includes Stable Bloom Filters, Scalable Bloom Filters, Counting Bloom Filters, Inverse Bloom Filters, Cuckoo Filters, several variants of traditional Bloom filters, HyperLogLog, Count-Min Sketch, and MinHash.Classic Bloom filters generally require a priori knowledge of the data set in order to allocate an appropriately sized bit array. This works well for offline processing, but online processing typically involves unbounded data streams. With enough data, a traditional Bloom filter "fills up", after which it has a false-positive probability of 1.

https://github.com/tylertreat/BoomFilters

Tags
Implementation
License
Platform

   




Related Projects

cfilter - Cuckoo Filter implementation in Go, better than Bloom Filters (unmaintained, unfortunately)

  •    Go

Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have lower space overhead than space-optimized Bloom filters. Some possible use-cases that depend on approximated set-membership queries would be databases, caches, routers, and storage systems where it is used to decide if a given item is in a (usually large) set, with some small false positive probability. Alternatively, given it is designed to be a viable replacement to Bloom filters, it can also be used to reduce the space required in probabilistic routing tables, speed longest-prefix matching for IP addresses, improve network state management and monitoring, and encode multicast forwarding information in packets, among many other applications. Cuckoo filters provide the flexibility to add and remove items dynamically. A cuckoo filter is based on cuckoo hashing (and therefore named as cuckoo filter). It is essentially a cuckoo hash table storing each key's fingerprint. Cuckoo hash tables can be highly compact, thus a cuckoo filter could use less space than conventional Bloom filters, for applications that require low false positive rates (< 3%).

bloom - Go package implementing Bloom filters

  •    Go

A Bloom filter is a representation of a set of n items, where the main requirement is to make membership queries; i.e., whether an item is a member of a set.A Bloom filter has two parameters: m, a maximum size (typically a reasonably large multiple of the cardinality of the set to represent) and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

cuckoofilter

  •    C++

Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. While Bloom filters are well-known space-efficient data structures to serve queries like "if item x is in a set?", they do not support deletion. Their variances to enable deletion (like counting Bloom filters) usually require much more space. Cuckoo filters provide the flexibility to add and remove items dynamically. A cuckoo filter is based on cuckoo hashing (and therefore named as cuckoo filter). It is essentially a cuckoo hash table storing each key's fingerprint. Cuckoo hash tables can be highly compact, thus a cuckoo filter could use less space than conventional Bloom filters, for applications that require low false positive rates (< 3%).

python-bloomfilter - Scalable Bloom Filter implemented in Python

  •    Python

P. Almeida, C.Baquero, N. Preguiça, D. Hutchison, Scalable Bloom Filters, (GLOBECOM 2007), IEEE, 2007. Bloom filters are great if you understand what amount of bits you need to set aside early to store your entire set. Scalable Bloom Filters allow your bloom filter bits to grow as a function of false positive probability and size.

bloom-filter-scala - Bloom filter for Scala, the fastest for JVM

  •    Scala

"A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not. In other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed," says Wikipedia. Warning: These are synthetic benchmarks in isolated environment. Usually the difference in throughput and latency is bigger in production system because it will stress the GC, lead to slow allocation paths and higher latencies, trigger the GC, etc.


pybloomfiltermmap - Fast Python Bloom Filter using Mmap

  •    Python

The goal of pybloomfiltermmap is simple: to provide a fast, simple, scalable, correct library for Bloom Filters in Python.

opposite_of_a_bloom_filter - Implementations of a data structure with false negatives but no false positives

  •    Java

A Bloom filter is a data structure that may report it contains an item that it does not (a false positive), but is guaranteed to report correctly if it contains the item ("no false negatives"). The opposite of a Bloom filter is a data structure that may report a false negative, but can never report a false positive. That is, it may claim that it has not seen an item when it has, but will never claim to have seen an item it has not.This repository contains thread-safe implementations of "the opposite of a Bloom filter" in Java and Go.

bloomd - C network daemon for bloom filters

  •    C

Bloomd is a high-performance C server which is used to expose bloom filters and operations over them to networked clients. It uses a simple ASCII protocol which is human readable, and similar to memcached. Bloom filters are a type of sketching or approximate data structure. They trade exactness for efficiency of representation. Bloom filters provide a set-like abstraction, with the important caveat that they may contain false-positives, meaning they may claim an element is part of the set when it was never in fact added. The rate of false positives can be tuned to meet application demands, but reducing the error rate rapidly increases the amount of memory required for the representation.

Redisson - Redis based In-Memory Data Grid for Java

  •    Java

Redisson - distributed Java objects and services (Set, Multimap, SortedSet, Map, List, Queue, BlockingQueue, Deque, BlockingDeque, Semaphore, Lock, AtomicLong, Map Reduce, Publish / Subscribe, Bloom filter, Spring Cache, Executor service, Tomcat Session Manager, Scheduler service, JCache API) on top of Redis server. Rich Redis client.

inbloom - Cross language bloom filter implementation

  •    Java

inbloom - a cross language Bloom filter implementation (https://en.wikipedia.org/wiki/Bloom_filter). A Bloom filter is a probabalistic data structure which provides an extremely space-efficient method of representing large sets. It can have false positives but never false negatives which means a query returns either "possibly in set" or "definitely not in set".

khmer - In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more

  •    Python

The official source code repository is at https://github.com/dib-lab/khmer and project documentation is available online at http://khmer.readthedocs.io. See http://khmer.readthedocs.io/en/stable/introduction.html for an overview of the khmer project. khmer is research software, so you should cite us when you use it in scientific publications! Please see the CITATION file for citation information.

dablooms - scaling, counting, bloom filter library

  •    C

Note: this project has been mostly unmaintained for a while. This project aims to demonstrate a novel Bloom filter implementation that can scale, and provide not only the addition of new members, but reliable removal of existing members.

bloomd - A high performance C server for bloom filters

  •    C

A high performance C server for bloom filters

whatlanguage - A language detection library for Ruby that uses bloom filters for speed.

  •    Ruby

Text language detection. Quick, fast, memory efficient, and all in pure Ruby. Uses Bloom filters for aforementioned speed and memory benefits. It works well on texts of over 10 words in length (e.g. blog posts or comments) and very poorly on short or Twitter-esque text, so be aware. Works with Dutch, English, Farsi, French, German, Italian, Pinyin, Swedish, Portuguese, Russian, Arabic, Finnish, Greek, Hebrew, Hungarian, Korean, Norwegian, Polish and Spanish out of the box.

rdb - Javascript ORM

  •    Javascript

ORM for nodejs. Supports postgres, mySql and sqlite. 1.7.1 Support for schemas (postgres only). 1.7.0 sqlite3 is now a peer dependency. Add it to your own package.json if you intend to use it. 1.6.9 Bugfix: one-to-many relation returns empty if strategy is included. 1.6.8 Bugfix: one-to-many relation returns empty if insert/update is done earlier in transaction. 1.6.7 Bugfix in relations. 1.6.6 Bugfix. 1.6.5 Improved performance on relations. 1.6.4 Bugfix. 1.6.3 Bugfix: potential incorrect timeZoneOffset when serializing date to JSON. Got timeZoneOffset from now() instead of on actual date. 1.6.2 Removed es6 syntax to ensure backwards compatability. Fixed global var leak. 1.6.1 Now supporting sqlite. 1.6.0 Bugfix: potential ambigous column error when using limit and relating to other tables. 1.5.9 Bugfix: using multipleStatements in mySql could sometimes cause an error when updates are run right before a select. Improved performance on limit when relating to other tables. Using uuid instead of node-uuid Updated all dependencies but generic-pool to latest. (Generic-pool has some breaking changes in latest. I will update it in next release.) 1.5.8 Cleanup line breaks in documentation. 1.5.7 Bugfix: getById.exclusive and tryGetById.exclusive did not lock if row was cached. Improved performance on tryGetFirst. 1.5.6 Raw sql filters can accept sql both as string and as function. E.g. var filter = {sql: function() {return 'foo > 1';}}. 1.5.5 Optional locks for getMany, tryGetFirst and tryGetById. Instead of calling getMany(params) just call getMany.exclusive(params). Same syntax goes for tryGetFirst and tryGetById. This will result in SELECT FOR UPDATE. Bugfix: bulk deletes now accepts raw sql filters too. 1.5.4 Transaction locks. Postgres only. 1.5.3 Upgraded to pg 6.0.3 1.5.2 Improved performance and reduced memory footprint. 1.5.1 Documented JSON column type. Bug fix: Insert and foreign key violation. 1.5.0 JSON column type. Postgres json type does not support rdb filters. 1.4.1 Empty filter would sometimes cause invalid filter. 1.4.0 Raw SQL query. 1.3.0 getMany() now supports limit and orderBy - same syntax as in streaming. 1.2.3 Bugfix: iEqual gave incorrect sql when parameterized. 1.2.2 Exlusive no longer returns a clone of table. It has changes current table to exclusive locking. 1.2.1 Bugfix: Exclusive row locks 1.2.0 Exclusive row locks 1.1.0 Now supporting streaming. Requires postgres or MySQL >=5.7.7 1.0.8 README fixup. 1.0.7 Better performance on insert and update. 1.0.6 Bugfix: Transaction domain should not forward rdb singleton from old domain. 1.0.5 Documentation cleanup. 1.0.4 orderBy in toDto(). 1.0.3 toDto() using next tick on every thousandth row to avoid maximum call stack size exceeded. 1.0.2 Reduced number of simultaneous promises in order to avoid maximum call stack size exceeded. 1.0.1 Bugfix: Incorrect insert/updates on timestamp without timezone. The time was converted utc instead of stripping the timezone. 1.0.0 Transaction domain forwards properties from old domain. Semantic versioning from now on. 0.5.1 Improved performance 0.5.0 Logging: rdb.log(someFunc) logs sql and parameters. Raw sql filters. 0.4.9 New method: tryGetById. New filter: iEqual, postgres only. Bugfix: rows.toJSON() without strategy did not include any children. 0.4.8 Explicit pooling with size and end(). Bugfix: mySql did not release client to pool. 0.4.7 Upgraded to pg 4.3.0 Upgraded to mysql 2.5.5 0.4.6 Upgraded pg 4.2.0. 0.4.5 Oops. Forgot to use pg.js instead of pg. 0.4.4 Upgraded all dependencies to latest. Using pg.js instead of pg. 0.4.3 Can ignore columns when serializing to dto. 0.4.2 Bugfix: update on a row crashes when a delete occurs earlier in same transaction. 0.4.1 Bugfix: more global leaks. 0.4.0 Bugfix: global leak. 0.3.9 Bugfix: eager loading joins/hasOne with non unique column names was not handled correctly. 0.3.8 Supports mySql. Bulk deletes. 0.3.7 Bugfix: eager loading manyRelation on a join/hasOne returned empty array #11 0.3.6 Fixed sql injection vulnerability. 0.3.5 Built-in fetching strategies for lazy loading. Works best in readonly scenarios. 0.3.4 Docs and examples split moved to separate file. 0.3.3 Fixed documentation layout again. 0.3.2 Fixed documentation layout. 0.3.1 Case insensitive filters: iStartsWith, iEndsWith and iContains. 0.3.0 Fix broken links in docs. 0.2.9 Support for row.delete(). Rollback only throws when error is present. 0.2.8 Guid accepts uppercase letters. Bugfix: null inserts on guid columns yielded wrong sql. 0.2.7 New method, toDto(), converts row to data transfer object. Bugfix: toJSON returned incorrect string on hasMany relations. 0.2.6 Fixed incorrect links in README. 0.2.5 Bugfix: caching on composite keys could give a crash #7. Improved sql compression on insert/update. 0.2.4 Bugfix: getMany with many-strategy and shallowFilter yields incorrect query #6. 0.2.3 Reformatted documentation. No code changes.

Java-BloomFilter - A stand-alone Bloom filter implementation written in Java

  •    Java

A stand-alone Bloom filter implementation written in Java

bloomfilter.js - JavaScript bloom filter using FNV for fast hashing

  •    Javascript

JavaScript bloom filter using FNV for fast hashing