Carrot2 - Search Results Clustering Engine

  •        5086

Carrot2 is an Open Source Search Results Clustering Engine. It could cluster the search results from various sources and generates small collection of documents. Carrot2 offers ready-to-use components for fetching search results from various sources including YahooAPI, GoogleAPI, Bing API, eTools Meta Search, Lucene, SOLR, Google Desktop and more.

It is implemented in Java. It has native API implementation in CSharp. Java runtime is not required and the performance is comparable to Java. It has support of REST interface which could be called from PHP and Ruby.

If you have search instances running in multiple nodes and search has to perform across the nodes, then you need a way to combine those results, filter and sort them. Carrot2 helps to do this job efficiently. It is well suited to work with Lucene, Solr and Nutch.

Carrot2 could be even called as meta search engine. It has built-in functionality to fetch results from all popular search-engines and combine them. It also offers supporting tools like command-line and GUI application to experiment with this product. Firefox and IE search plug-in is also available.




Related Projects

beowulf_ssh_cluster - Skeleton program for a simple Beowulf cluster that uses ssh to communicate

This program is a example of a Beowulf cluster or so-called Stone SouperComputer with Python and SSH. In this example, the server connects to any number of client computers via SSH and asks them to help compute some problem. Once a client finishes, the client sends back to the result to the server which stores the result on its own disk. The server then sends that client a new set of computations to finish, and this repeats until all the computations are finished. The client never has to store any information. The server is able to keep track of the client threads, the overall productivity and productivity of each client, and the entirety of the finished results.Originally, a Beowulf cluster was used for computation. There are many better ways of getting speed out of multiple computers (like the clusters that carry out the great prime search, or solve the protein folding problem), so this is not the optimal use of a Beowulf cluster. This is a cluster of antiquated computers, afterall, so most of them will be slow.

incubator-slider - Mirror of Apache Slider

Slider is a YARN application to deploy existing distributed applications on YARN, monitor them and make them larger or smaller as desired -even while the cluster is running.Clusters can be stopped and restarted later; the distributionof the deployed application across the YARN cluster is persisted -enablinga best-effort placement close to the previous locations on a cluster start.Applications which remember the previous placement of data (such as HBase)can exhibit fast start-up times from this f

solr-scale-tk - Fabric-based framework for deploying and managing SolrCloud clusters in the cloud.

Setup========Make sure you're running Python 2.7 and have installed Fabric and boto dependencies. On the Mac, you can do:```sudo easy_install fabricsudo easy_install boto```For more information about fabric, see: the pysolr project from github and set it up as well:```git clone pysolrsudo python install```Note, you do not need to know any Python in order to use this framework.Local Setup========The framewor

cluster - A simple API for managing a network cluster with smart peer discovery.

Package cluster provides a small and simple API to manage a set of remote peers. It falls short of a distributed hash table in that the only communication allowed between two nodes is direct communication.The central contribution of this package is to keep the set of remote peers updated and accurate. Namely, whenever a remote is added, that remote will share all of the remotes that it knows about. The result is a very simple form of peer discovery. This also includes handling both graceful and ungraceful disconnections. In particular, if a node is disconnected ungracefully, other nodes will periodically try to reconnect with it.

Helix - Cluster Management Framework

Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. It helps to perform scheduling of maintenance tasks, such as backups, garbage collection, file consolidation, index rebuilds, repartitioning of data or resources across the cluster, informing dependent systems of changes so they can react appropriately to cluster changes, throttling system tasks and changes and so on.

coreos-cluster - An example of how to provision a CoreOS cluster on AWS using Terraform and ansible-vault

An example of how to provision a CoreOS cluster on AWS using Terraform. This example sets up a VPC, private and public networks, NAT server, an RDS database, a CoreOS cluster and a private Docker registry and properly configures tight security groups.The cluster is configured via cloud-config user data and runs etcd2.service and fleet.service. All peer and client traffic is encrypted using self signed certificates.

spark-cluster-deployment - Automates Spark standalone cluster tasks with Puppet and Fabric.

Apache Spark is a research project for distributed computing which interacts with HDFS and heavily utilizes in-memory caching. Spark 1.0.0 can be deployed to traditional cloud and job management services such as EC2, Mesos, or Yarn. Further, Spark's standalone cluster mode enables Spark to run on other servers without installing other job management services.However, configuring and submitting applications to a Spark 1.0.0 standalone cluster currently requires files to be synchronized across the entire cluster, including the Spark installation directory. This project utilizes Fabric and Puppet to further automate the Spark standalone cluster. The Puppet scripts are MIT-licensed from stefanvanwouw/puppet-spark and wikimedia/puppet-cdh4.

prombench - Prometheus E2E benchmarking tool

Make sure you have AWS credentials configured in a CLI profile. If you don't, then use aws configure --profile <some-name> to configure them. For convenience, if these are the only credentials you use, you can leave out the --profile argument to get them configured as default. If you do set a profile name, then make sure you export AWS_PROFILE=<profile-name> in the shell session where you want to use them.Run make to create a cluster. This will create all the necesary resources in AWS using terraform and kops. After the make command is finished, you cluster will take a little while to completely build and become available. A kubectl context will be automatically configured for you with the credentials to access the cluster. You can use it to check if the cluster is done building. Repeat kubectl cluster-info until you no longer get an error. Now your cluster is ready.

dataproc-initialization-actions - Run in all nodes of your cluster before the cluster starts - lets you customize your cluster

When creating a Google Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up.The folder structure of this Cloud Storage bucket mirrors this repository. You should be able to use this Cloud Storage bucket (and the initialization scripts within it) for your clusters.


This container arrangement makes popular tasks much less painful:* Never hard code IP addresses in source code. Instead, connect to localhost:7101 where haproxy routes traffic to a "random" postgresql server (because of the port number) in the cluster. Consul tracks cluster membership as nodes join and leave.* Registering a consul service? Just drop a consul JSON file into /etc/consul/conf.d/ (directory is not present in this repository because it's empty).* Registering a new service for supervi


OpenSSI webView is a simple and easy-to-use openSSI cluster monitoring system. Its goal is to provide a quick overview of the cluster state, by graphing vital functions and graphically representing key figures. It allows the cluster administrator to keep an eye on the cluster health and usage rate, to quick view each node state and load, and to watch, and even migrate, users processes all accross the cluster.

mariadb-ansible-galera-cluster - Automated installation of MariaDB Galera Cluster using Ansible

These roles allow you to automatically setup a MariaDB Galera cluster with sane default settings.These roles are currently only tested for RHEL/CentOS 7, but most tasks can be reused for Debian or SUSE based distributions.


Cluster Insight is a Kubernetes service that collects runtime metadata about resources in a Kubernetes cluster, and infers relationships between them to create a context graph.A context graph is a point-in-time snapshot of the cluster’s state. Clients of the Cluster Insight service, such as user interfaces, can retrieve context graphs through the service's REST API. Each call may produce a different context graph, reflecting the inherent dynamicity in the Kubernetes cluster.

khealth - basic kubernetes health monitoring

khealth is a Kubernetes cluster monitoring suite. Its Routines exercise Kubernetes subsystems and send events to Collectors. Collectors collate these events to compute current cluster state. Cluster status is available from Collectors over a simple HTTP API, which is served on a cluster nodeport in the example below.If you have a kubernetes cluster, you can deploy khealth.

charm-percona-cluster - Juju Charm - Percona XtraDB Cluster

Percona XtraDB Cluster is a high availability and high scalability solution for MySQL clustering. Percona XtraDB Cluster integrates Percona Server with the Galera library of MySQL high availability solutions in a single product package which enables you to create a cost-effective MySQL cluster.This charm deploys Percona XtraDB Cluster onto Ubuntu.

kubernetes-cluster-federation - Kubernetes cluster federation tutorial

This tutorial will walk you through setting up a Kubernetes cluster federation composed of four Kubernetes clusters across multiple GCP regions.This guide is not for people looking for a fully automated command to bring up a Kubernetes cluster federation. If that's you then check out Setting up Cluster Federation with Kubefed.

corvus - A fast and lightweight Redis Cluster Proxy for Redis 3.0

Corvus is a fast and lightweight redis cluster proxy for redis 3.0 with cluster mode enabled.Most redis client implementations don't support redis cluster. We have a lot of services relying on redis, which are written in Python, Java, Go, Nodejs etc. It's hard to provide redis client libraries for multiple languages without breaking compatibilities. We used twemproxy before, but it relies on sentinel for high availabity, it also requires restarting to add or remove backend redis instances, which causes service interruption. And twemproxy is single threaded, we have to deploy multiple twemproxy instances for large number of clients, which causes the sa headaches.

Spark - Fast Cluster Computing

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

csync2 - cluster synchronization tool

Csync2 is a cluster synchronization tool. It can be used to keep files on multiple hosts in a cluster in sync. Csync2 can handle complex setups with much more than just 2 hosts, handle file deletions and can detect conflicts. It is expedient for HA-clusters, HPC-clusters, COWs and server farms.

ClusterAware Manager for .NET

The ClusterAware.NET makes it easier for developer .NET to Manager Windows Failover Cluster Provider with a powerfull class library . It's developed in C# with .NET 3.5 and .NET 4.0