Displaying 1 to 20 from 22 results

gensim - Topic Modelling for Humans

  •    Python

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

Lemur - Search Engine

  •    Java

The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software. The project is best known for its Indri search engine, Lemur Toolbar, and ClueWeb09 dataset.

Terrier - Information Retrieval Platform

  •    Java

Terrier is a highly flexible, efficient, and effective open source search engine, readily deployable on large-scale collections of documents. Terrier implements state-of-the-art indexing and retrieval functionalities, and provides an ideal platform for the rapid development and evaluation of large-scale retrieval applications. Terrier can index large corpora of documents, and provides multiple indexing strategies, such as multi-pass, single-pass and large-scale MapReduce indexing.

resin - 32-bit vector space search engine

  •    CSharp

A full-text search engine with HTTP API and programmable read/write pipelines. To provide full-text search words and phrases are extracted from documents and mapped to a 2 billion dimensional vector-space that form clusters of syntactically similar "bag-of-chars". In this language model, each character (glyph) is encoded as a 32-bit word (an int), and each word or phrase alike encoded as a 32-bit wide (but sparse) array.

RankyMcRankFace - Hardened Fork of Ranklib learning to rank library

  •    Java

This project is OpenSource Connections API-compatible fork of Ranklib, deployed on Maven, with various improvements making it easier to integrate with the Elasticsearch Learning to Rank Plugin.It is under the com.o19s:RankyMcRankFace Maven namespace.

BM25Transformer - (Python) transform a document-term matrix to an Okapi/BM25 representation

  •    Python

This library transforms a document-term matrix to a Okapi/BM25 representation. API of this library inherits from sklearn.feature_extraction.text.TfidfTransformer.

tika-similarity - Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features

  •    Python

This project demonstrates using the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features. The script can iterate over all files in the current directory or given files by command line and derives their metadata features, then computes the union of all features. The union of all features become the "golden feature set" that all document features are compared to via intersect. The length of that intersect per file divided by the length of the unioned set becomes the similarity score.

cuNVSM - Neural Vector Space Models

  •    Cuda

⚠️ You need a CUDA-compatible GPU (compute capability 5.2+) to use this software. cuNVSM is a C++/CUDA implementation of state-of-the-art NVSM and LSE representation learning algorithms.

pyndri - pyndri is a Python interface to the Indri search engine.

  •    Python

pyndri is a Python interface to the Indri search engine (http://www.lemurproject.org/indri/). During development, we use Python 3.5. Some of the examples require numpy.

pytrec_eval - pytrec_eval is an Information Retrieval evaluation tool for Python, based on the popular trec_eval

  •    C++

pytrec_eval is a Python interface to TREC's evaluation tool, trec_eval. It is an attempt to stop the cultivation of custom implementations of Information Retrieval evaluation measures for the Python programming language. The module was developed using Python 3.5. You need a Python distribution that comes with development headers. In addition to the default Python modules, numpy and scipy are required.

SERT - Semantic Entity Retrieval Toolkit

  •    Python

The Semantic Entity Retrieval Toolkit (SERT) is a collection of neural entity retrieval algorithms. SERT requires Python 3.5 and assorted modules. The trec_eval utility is required for evaluation and the end-to-end scripts. If you wish to train your models on GPGPUs, you will need a GPU compatible with Theano.

pke - Python Keyphrase Extraction module

  •    Python

pke is an open source python-based keyphrase extraction toolkit. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extented to develop new approaches. pke also allows for easy benchmarking of state-of-the-art keyphrase extraction approaches, and ships with supervised models trained on the SemEval-2010 dataset. pke works only for Python 2.x at the moment.

indonesian-nlp-playground - Repositori personal terkait penelitian linguistik bahasa Indonesia

  •    Python

Sesuai namanya, ini adalah repositori personal terkait penelitian linguistik bahasa Indonesia. Semua yang ada di repositori ini sifatnya eksperimental dan sewaktu-waktu dapat berubah menurut petunjuk rumput yang bergoyang atau menurut menu makan siang di restoran Mbah Jingkrak.

Mimir - OSINT Threat Intel Interface

  •    Python

OSINT Threat Intel Interface - Named after the old Norse God of knowledge. Mimir functions as a CLI to HoneyDB which in short is an OSINT aggregative threat intel pool. Starting the program brings you to a menu the options for which are as follows.

horus-ner - HORUS: A framework to boost NLP tasks

  •    Python

HORUS is meta and multi-level framework designed to provide a set of features at word-level to boost natural language frameworks. It's architecure is based on image processing and text classification clustering algorithms and shows to be helpful especially to noisy data, such as microblogs. We are currently investigating Named Entity Recognition (NER) as use case. This version supports the identification of classical named-entity types (LOC, PER, ORG).

errorlookup - Simple tool for retrieving information about Windows errors codes.

  •    C++

Portable open source tool which can help by translating error codes into a more meaningful text description. The interface is simple - just a box to type your code, and another which displays the details - so there doesn’t seem much to learn. The program also supports a wide range of codes: regular Windows errors, DirectX, NTSTATUS errors, Windows Internet errors, STOP codes, And we think it can probably be configured to read more (Settings > System modules), although there’s no documentation to confirm that and we didn’t test it.