consimilo is a library that utilizes locality sensitive hashing (implemented as lsh-forest) and minhashing, to support top-k similar item queries. Finding similar items across expansive data-sets is a common problem that presents itself in many real world applications (e.g. finding articles from the same source, plagiarism detection, collaborative filtering, context filtering, document similarity, etc...). Searching a corpus for top-k similar items quickly grows to an unwieldy complexity at relatively small corpus sizes (n choose 2). LSH reduces the search space by "hashing" items in such a way that collisions occur as a result of similarity. Once the items are hashed and indexed the lsh-forest supports a top-k most similar items query of ~O(log n). There is an accuracy trade-off that comes with the enormous increase in query speed. More information can be found in chapter 3 of Mining Massive Datasets. You can continue to add to this forest by passing it as the first argument to add-all-to-forest. The forest data structure is stored in an atom, so the existing forest is modified in place.