- 182

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy. datasketch must be used with Python 2.7 or above and NumPy 1.11 or above. Scipy is optional, but with it the LSH initialization can be much faster.

https://ekzhu.github.io/datasketchhttps://github.com/ekzhu/datasketch

Tags | bbit-minhash lsh-forest jaccard-similarity hyperloglog lsh minhash weighted-quantiles top-k search data-sketches data-summary |

Implementation | Python |

License | MIT |

Platform | Windows Linux |

Efficient set similarity search algorithms in Python. For even better performance see the Go Implementation. A popular way to measure the similarity between two sets is Jaccard similarity, which gives a fractional score between 0 and 1.0.

similarity-search set-similarity-search all-pairsA new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications. The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.

An easy-to-use collaborative filtering based recommendation engine and NPM module built on top of Node.js and Redis. The engine uses the Jaccard coefficient to determine the similarity between users and k-nearest-neighbors to create recommendations. This module is useful for anyone with users, a store of products/movies/items, and the desire to give their users the ability to like/dislike and receive recommendations based on similar users. Raccoon takes care of all the recommendation and rating logic. It can be paired with any database as it does not keep track of any user/item information besides a unique ID. Updated for ES6.

recommend recommended recommendation engine collaborative filtering middleware redisI just built out v2 of this project that now gives you analytics info from your models, and is production-ready. machineJS is an amazing research project that clearly proved there's a hunger for automated machine learning. auto_ml tackles this exact same goal, but with more features, cleaner code, and the ability to be copy/pasted into production.

machine-learning data-science machine-learning-library machine-learning-algorithms ml data-scientists javascript-library scikit-learn kaggle numerai automated-machine-learning automl auto-ml neuralnet neural-network algorithms random-forest svm naive-bayes bagging optimization brainjs date-night sklearn ensemble data-formatting js xgboost scikit-neuralnetwork knn k-nearest-neighbors gridsearch gridsearchcv grid-search randomizedsearchcv preprocessing data-formatter kaggle-competitionImplementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity...

levenshtein-distance cosine-similarity string-distance damerau-levenshtein distance distance-measure jaro-winkler similarity-measures shingles algorithm jvmRumale (Ruby machine learning) is a machine learning library in Ruby. Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python. Rumale supports Linear / Kernel Support Vector Machine, Logistic Regression, Linear Regression, Ridge, Lasso, Kernel Ridge, Factorization Machine, Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor classifier, K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, SNN, Power Iteration Clustering, Mutidimensional Scaling, t-SNE, Principal Component Analysis, Kernel PCA and Non-negative Matrix Factorization. This project was formerly known as "SVMKit". If you are using SVMKit, please install Rumale and replace SVMKit constants with Rumale.

machine-learning data-science data-analysis artificial-intelligenceThis library implements several locality sensitive hashing(LSH) based algorithms, including indexing data structure for high dimensional spaces and metric spaces, sketch constructions and set embedding algorithms.

SwiftGraph is a pure Swift (no Cocoa) implementation of a graph data structure, appropriate for use on all platforms Swift supports (iOS, macOS, Linux, etc.). It includes support for weighted, unweighted, directed, and undirected graphs. It uses generics to abstract away both the type of the vertices, and the type of the weights. It includes copious in-source documentation, unit tests, as well as search functions for doing things like breadth-first search, depth-first search, and Dijkstra's algorithm. Further, it includes utility functions for topological sort, Jarnik's algorithm to find a minimum-spanning tree, detecting a DAG (directed-acyclic-graph), and enumerating all cycles.

graph data-structure graph-algorithms dijkstra-algorithm topological-sort breadth-first-search depth-first-search prims-algorithmThis Postgres module introduces a new data type hll which is a HyperLogLog data structure. HyperLogLog is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes hll can estimate the count of tens of billions of distinct values with only a few percent error. In addition to the algorithm proposed in the original paper, this implementation is augmented to improve its accuracy and memory use without sacrificing much speed. See below for more details.

An efficient data structure for rank-sensitive autocomplete features. It stores a set of suggestion strings with associated weights and returns for any given prefix a rank-ordered list of the top k highest weighted suggestions that start with that prefix. In addition, it provides methods to insert a new suggestion, modify the weight of a suggestion, or remove a suggestion.

Apache Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining.

machine-learning classification data-mining fuzzyAn improved version of HyperLogLog for the count-distinct problem, approximating the number of distinct elements in a multiset using 20-50% less space than other usual HyperLogLog implementations.This work is based on "Better with fewer bits: Improving the performance of cardinality estimation of large data streams - Qingjun Xiao, You Zhou, Shigang Chen".

data-structure collections go-collectionRedisson - Distributed and Scalable Java data structures (Set, SortedSet, Map, ConcurrentMap, List, Queue, Deque, Lock, AtomicLong, CountDownLatch, Publish / Subscribe, HyperLogLog) on top of Redis server. Advanced redis java client. It supports over 28+ data structures and services, Synchronous / asynchronous / reactive interfaces and lot more.

distributed collections locks distributed-locks cacheThis project aims at a minimal benchmark for scalability, speed and accuracy of commonly used implementations of a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of limited cardinality i.e. not very sparse) and no missing data, perhaps the most common problem in business applications (e.g. credit scoring, fraud detection or churn prediction). If the input matrix is of n x p, n is varied as 10K, 100K, 1M, 10M, while p is ~1K (after expanding the categoricals into dummy variables/one-hot encoding). This particular type of data structure/size (the largest) stems from this author's interest in some particular business applications. Note: While a large part of this benchmark was done in Spring 2015 reflecting the state of ML implementations at that time, this repo is being updated if I see significant changes in implementations or new implementations have become widely available (e.g. lightgbm). Also, please find a summary of the progress and learnings from this benchmark at the end of this repo.

machine-learning data-science r gradient-boosting-machine random-forest deep-learning xgboost h2o sparkThis repository holds the code to a new kind of RNN model for processing sequential data. The model computes a recurrent weighted average (RWA) over every previous processing step. With this approach, the model can form direct connections anywhere along a sequence. This stands in contrast to traditional RNN architectures that only use the previous processing step. A detailed description of the RWA model has been published in a manuscript at https://arxiv.org/pdf/1703.01253.pdf. Because the RWA can be computed as a running average, it does not need to be completely recomputed with each processing step. The numerator and denominator can be saved from the previous step. Consequently, the model scales like that of other RNN models such as the LSTM model.

recurrent-neural-networks sequential-data time-series research rwa-model recurrent-weighted-average deep-memoryLogswan is a fast Web log analyzer using probabilistic data structures. It is targeted at very large log files, typically APIs logs. It has constant memory usage regardless of the log file size, and takes approximatively 4MB of RAM.Unique visitors counting is performed using two HyperLogLog counters (one for IPv4, and another one for IPv6), providing a relative accuracy of 0.10%. String representations of IP addresses are used and preferred as they offer better precision.

web log analyzer hyperloglogInstructions for how to install the necessary software for this tutorial is available here. Data for the tutorial can be downloaded by running ./data/get-data.sh (requires wget). Certain algorithms don't scale well when there are millions of features. For example, decision trees require computing some sort of metric (to determine the splits) on all the feature values (or some fraction of the values as in Random Forest and Stochastic GBM). Therefore, computation time is linear in the number of features. Other algorithms, such as GLM, scale much better to high-dimensional (n << p) and wide data with appropriate regularization (e.g. Lasso, Elastic Net, Ridge).

machine-learning deep-learning random-forest gradient-boosting-machine tutorial data-science ensemble-learning rRedisearch implements a search engine on top of Redis, but unlike other redis search libraries, it does not use internal data structures like sorted sets. Inverted indexes are stored as a special compressed data type that allows for fast indexing and search speed, and low memory footprint.

redis redis-module search search-engine inverted-index fulltextLightweight Python utilities for working with Redis. The purpose of walrus is to make working with Redis in Python a little easier by wrapping rich objects in Pythonic containers. It consist of wrappers for the Redis object types like Hash, List, Set, Sorted Set, HyperLogLog, Array.

orm redis-utilities database-utilitiesBoom Filters are probabilistic data structures for processing continuous, unbounded streams. This includes Stable Bloom Filters, Scalable Bloom Filters, Counting Bloom Filters, Inverse Bloom Filters, Cuckoo Filters, several variants of traditional Bloom filters, HyperLogLog, Count-Min Sketch, and MinHash.Classic Bloom filters generally require a priori knowledge of the data set in order to allocate an appropriately sized bit array. This works well for offline processing, but online processing typically involves unbounded data streams. With enough data, a traditional Bloom filter "fills up", after which it has a false-positive probability of 1.

bloom-filter stable-bloom-filters cuckoo-filter probabilistic-programming counting-bloom-filters scalable-bloom-filters count-min-sketch data-stream filter data-structure collections go-collection
We have large collection of open source products. Follow the tags from
Tag Cloud >>

Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
**Add Projects.**