Lucene-on-cassandra - A Column-Oriented Cassandra-Based Lucene Directory

  •        0

IntroductionThis project aims to deliver a type of Lucene directory that stores its file in a Cassandra server, which makes for a scalable and robust store for Lucene indices. ArchitectureIn brief, the CassandraDirectory maps the concept of a Lucene directory to a column family that belongs to a certain keyspace located in a given Cassandra server. Further, it stores each file under this directory as a row in that column family. Specifically, its files are broken down into blocks (whose sizes are capped), where each block (see FileBlock) is stored as the value of a column in the corresponding row. As per http://wiki.apache.org/cassandra/CassandraLimitations, this is the recommended approach for dealing with large objects, which Lucene files tend to be. In addition, a descriptor of the file (see FileDescriptor) that outlines a map of blocks therein is stored as one of the columns in that row as well. Think of this descriptor as an inode for Cassandra-based files. The exhaustive mapping of a Lucene directory (file) to a Cassandra column family (row) is captured in the ColumnOrientedDirectory (ColumnOrientedFile) inner-class. Specifically, it interprets Cassandra's data model in terms of Lucene\\'s, and vice verca. More importantly, these are the only two inner-classes that have a foot in both the Lucene and Cassandra camps. All writes to a file in this directory occur through a CassandraIndexOutput, which puts the data flushed from a write-behind buffer into the fitting set of blocks. By the same token, all reads from a file in this directory occur through a CassandraIndexInput, which gets the data needed by a read-ahead buffer from the right set of blocks. The last (but not the least) inner-class, CassandraClient, acts as a facade over a Thrift-based Cassandra client. In short, it provides operations to get/put rows/columns in the column family and keyspace associated with this directory. Related WorkUnlike Lucandra, which attempts to bridge the gap between Lucene and Cassandra at the document-level, the CassandraDirectory is self-sufficient in the sense that it does not require a re-write of any other component in the Lucene stack. In other words, one may use the CassandraDirectory in conjunction with the Lucene IndexWriter and IndexReader, as you would any other kind of Lucene Directory. Moreover, given the the data unit that is transferred to and from Cassandra is a large-sized block, one may expect fewer round trips, and hence better throughputs, from the CassandraDirectory. ConclusionIn conclusion, this directory attempts to marry the rich search-based query language of Lucene with the distributed fault-tolerant database that is Cassandra. By delegating the responsibilities of replication, durability and elasticity to the directory, we free the layers above from such non-functional concerns. Our hope is that users will choose to make their large-scale indices instantly scalable by seamlessly migrating them to this type of directory (using Directory#copy(Directory source, Directory target)).

http://code.google.com/p/lucene-on-cassandra

Tags
Implementation
License
Platform

   




Related Projects

raspBerry+


raspBerry+ is a web-based administration platform for Blackberry Enterprise Server for MS Exchange (BES). You can group-based activate/kill/delete/add and get status of users, their handhelds and services. With a little download-area and a comment-system

RASP


RASP's A Sneakernet Proxy; download using a thumbdrive.

RasmusDSP


RasmusDSP is an embeddable Audio/MIDI processor. It contains various filters and generators (including SoundFont 2.0 compatible synthesizer). Has a script interpreter which is used to describe instruments, route Audio/MIDI signal between processor units.

Rasea


An acronym for cRoss-plAtform accesS control for Enterprise Applications. Rasea aims to become a reference in access control as a service based on the RBAC model.

Rascal


Rascal, the Advanced Scientific CALculator, is a platform independent modular calculator. Based on modules for integer, doubles, strings, vectors and matrices it can be easily extended with existing C or C++ code.

Rars


RARS is the Robot Auto Racing Simulation, in which the drivers are robot programs. It is intended as a competition among programmers. It consists of a simulation of the physics of cars, a graphic display of the race, and a robot driver for each car.

RARPlayer


This small program allows you to play a video directly from a RAR file and do so in real-time. Both VLC and MPlayer are supported video players.

RAReXtract


RAReXtract is a Front-End for the UnRAR command line utility for Mac OS X 10.5 (Leopard). Its purpose is the rapid and convenient extraction of RAR archives with a double click.

RAR Expander


Rar Expander is a MacOSX program which extracts the files contained in single or multi-volume RAR archives. It uses the official unRAR library internally so it is fully compatible with archives produced by WinRAR.

rarcrack


This program uses a brute force algorithm to guess your encrypted compressed file\'s password. If you forget your encrypted file password, this program is the solution. This program can crack zip,7z and rar file passwords.

RArcInfo


RArcInfo is a package for R (http://www.r-project.org) to import data from binary Arc/Info V7.X coverages and E00 files . This will allow R users to used it as a primary GIS tool.

rar brute force shell script - rarbrute


This is rarbrute, a shell script to brute force encrypted rar files under unix and linux. A long wordlist and a paper about security in internet cafes is included.

Raquel Database System


The system will : 1. use RAQUEL (= Relational Algebra Query, Update and Executive Language) for programming, implementing Third Manifesto principles. 2. have a 'Lego-like' architecture of building blocks and plug-ins, for wider applicability.

RAPv4


RAPv4 is an engine for building web application with only a business description (in XML format). NEW 04/2006 : Stable 2006 release. Add new functions like mail, sms, web services, graph, map engine (GIS), Excel output, QBE... and also a beta release of

Rafkill


2d Scroller. Clone of Raptor: Call of the Shadows and Tyrian. Fun game written in c++ using allegro.

rapple


Lightweight XML based transformation tool written in C that builds upon expat, tidylib and XSLT to tranform authored web content (incl. Word processor generated HTML) into styled web content suitable for publication.

RapidSMS


RapidSMS is an open-source internet and communications platform

RapidSmith


RapidSmith is a research-based FPGA CAD tool framework written in Java for modern Xilinx FPGAs. Based on XDL, its objective is to serve as a rapid prototyping platform for research ideas and algorithms relating to low level FPGA CAD tools.

Rapidshare Mass Downloader


What this program does is bringing out human interaction while downloading files from rapidshare(without premium account). It downloads all the rapidshare links sequentially to the specified location.

rapido visual profiler


rapido is a visual profiler for linux-x86. It traces function call using the ptrace interface and displays the information collected in a nice visual flow chart. rapido does not require the re-compilation of the application.