Ephyra - Question Answering System

Ephyra is a modular and extensible framework for open domain question answering (QA). The system retrieves accurate answers to natural language questions from the Web and other sources. The goal is to give researchers the opportunity to develop new QA techniques without worrying about the end-to-end system.




Gate - General Architecture for Text Engineering

GATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin.

statistical-analysis-python-tutorial - Statistical Data Analysis in Python

Chris Fonnesbeck is an Assistant Professor in the Department of Biostatistics at the Vanderbilt University School of Medicine. He specializes in computational statistics, Bayesian methods, meta-analysis, and applied decision analysis. He originally hails from Vancouver, BC and received his Ph.D. from the University of Georgia. This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to bootstrapping methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

Thoth - Real-time Solr Monitor and Search Analysis Engine

Thoth is a real-time solr monitor and search analysis engine. It's a set of tools that can help you collect, visualize and leverage data coming from your solr search infrastructure.

AIL-framework - AIL framework - Analysis Information Leak framework

AIL is a modular framework to analyse potential information leaks from unstructured data sources like pastes from Pastebin or similar services or unstructured data streams. AIL framework is flexible and can be extended to support other functionalities to mine or process sensitive information. The default installing_deps.sh is for Debian and Ubuntu based distributions. For Arch linux based distributions, you can replace it with installing_deps_archlinux.sh.

Biological Pathway Exchange Language

A Data Exchange Format for Biological Pathway Information

knowledge-repo - A next-generation curated knowledge sharing platform for data scientists and other technical professions

The Knowledge Repository project is focused on facilitating the sharing of knowledge between data scientists and other technical roles using data formats and tools that make sense in these professions. It provides various data stores (and utilities to manage them) for "knowledge posts", with a particular focus on notebooks (R Markdown and Jupyter / IPython Notebook) to better promote reproducible research.Check out this Medium Post for the inspiration for the project.

Lemur - Search Engine

The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software. The project is best known for its Indri search engine, Lemur Toolbar, and ClueWeb09 dataset.

GoldenOrb - Scalable Graph Analysis

GoldenOrb is a cloud-based project for massive-scale graph analysis, built upon Apache Hadoop and modeled after Google's Pregel architecture. It provides solutions to complex data problems, remove limits to innovation and contribute to the emerging ecosystem that spans all aspects of big data analysis. It enables users to run analytics on entire data sets instead of samples.


vyasa is a digital library application that incorporates the functions of digital asset and document management systems. It facilitates information retrieval and knowledge discovery by providing comprehensive metadata generation and semantic analysis.

Data-Analysis-and-Machine-Learning-Projects - Repository of teaching materials, code, and data for my data analysis and machine learning projects

This is a repository of teaching materials, code, and data for my data analysis and machine learning projects.Each repository will (usually) correspond to one of the blog posts on my web site.

Hydra - Distributed processing framework for search solutions

Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. This is done by providing a scalable and efficient pipeline which the documents will have to pass through before being indexed into the search engine. Architecturally Hydra sits in between the search engine and the source integration.

OSQA - Stackoverflow like QA system in Python

OSQA is the open source Q&A system. It is more than just an FAQ page, it is a full-featured Q&A community. Users earn points and badges for useful participation, and everyone in the community wins. OSQA is built and maintained by a team of developers who share an interest in making a great, free, open source Q&A system available to everyone. The OSQA project is hosted and financially supported by DZone, Inc.

Modular toolkit for Data Processing MDP

The Modular toolkit for Data Processing (MDP) is a Python data processing framework. From the user's perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures. From the scientific developer's perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new i

perfview - PerfView is a CPU and memory performance-analysis tool

PerfView is a free performance-analysis tool that helps isolate CPU and memory-related performance issues. It is a Windows tool, but it also has some support for analyzing data collected on Linux machines. It works for a wide variety of scenarios, but has a number of special features for investigating performance issues in code written for the .NET runtime.If you are unfamiliar with PerfView, there are PerfView video tutorials. Also, Vance Morrison's blog gives overview and getting started information.

Kaldin - Online Examination Software

Open source examination software for conducting any type of exam including online exam, pre-screening for colleges, universities and companies. Its Key features include Questions Categories, Question Papers, Open Exams, Results, Manage Users, Schedule Exam, Real-time notifications, analysis by user, exam and category with ability to download certificates.


Plone database management and analysis system. Curare allows the semantic analysis of data allowing the classification of databases based on their informational content.

GRASS GIS - Geographic Resources Analysis Support System

Geographic Resources Analysis Support System, commonly referred to as GRASS GIS, is a Geographic Information System (GIS) used for data management, image processing, graphics production, spatial modelling, and visualization of many types of data. GRASS supports raster and vector data in two and three dimensions. The vector data model is topological, meaning that areas are defined by boundaries and centroids; boundaries cannot overlap within a single layer.


Thea, Tools for High-throughput Experiment Analysis, is an integrated information processing system dedicated to the annotation of data issued from classification systems with biological information coming from a knowledge base.

Insight Segmentation and Registration Toolkit

ITK is an open-source, cross-platform system that provides developers with an extensive suite of software tools for image analysis. Developed through extreme programming methodologies, ITK employs leading-edge algorithms for registering and segmenting multidimensional data.

X-Itools: Enterprise Collaboration

Enterprise Collaboration modules and strong Log Analysis modules