Displaying 1 to 6 from 6 results

datasketch - MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++

  •    Python

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy. datasketch must be used with Python 2.7 or above and NumPy 1.11 or above. Scipy is optional, but with it the LSH initialization can be much faster.

html-similarity - Compare html similarity using structural and style metrics

  •    Python

This package provides a set of functions to measure the similarity between web pages. Uses sequence comparison of the html tags to compute the similarity.

tika-similarity - Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features

  •    Python

This project demonstrates using the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features. The script can iterate over all files in the current directory or given files by command line and derives their metadata features, then computes the union of all features. The union of all features become the "golden feature set" that all document features are compared to via intersect. The length of that intersect per file divided by the length of the unioned set becomes the similarity score.

consimilo - A Clojure library for querying large data-sets on similarity

  •    Clojure

consimilo is a library that utilizes locality sensitive hashing (implemented as lsh-forest) and minhashing, to support top-k similar item queries. Finding similar items across expansive data-sets is a common problem that presents itself in many real world applications (e.g. finding articles from the same source, plagiarism detection, collaborative filtering, context filtering, document similarity, etc...). Searching a corpus for top-k similar items quickly grows to an unwieldy complexity at relatively small corpus sizes (n choose 2). LSH reduces the search space by "hashing" items in such a way that collisions occur as a result of similarity. Once the items are hashed and indexed the lsh-forest supports a top-k most similar items query of ~O(log n). There is an accuracy trade-off that comes with the enormous increase in query speed. More information can be found in chapter 3 of Mining Massive Datasets. You can continue to add to this forest by passing it as the first argument to add-all-to-forest. The forest data structure is stored in an atom, so the existing forest is modified in place.




text-metrics - Calculate various string metrics efficiently in Haskell

  •    Haskell

The library provides efficient implementations of various strings metric algorithms. It works with strict Text values. edit-distance allows to specify costs for every operation when calculating Levenshtein distance (insertion, deletion, substitution, and transposition). This is rarely needed though in real-world applications, IMO.

spark-stringmetric - Spark functions to run popular phonetic and string matching algorithms

  •    Scala

Making similarity functions and phonetic algorithms readily available for fuzzy matching analyses in Spark. Update your build.sbt file to import the libraries.





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.