Displaying 1 to 1 from 1 results

tika-similarity - Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features

  •    Python

This project demonstrates using the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features. The script can iterate over all files in the current directory or given files by command line and derives their metadata features, then computes the union of all features. The union of all features become the "golden feature set" that all document features are compared to via intersect. The length of that intersect per file divided by the length of the unioned set becomes the similarity score.