Icsiboost - Open-source implementation of Boostexter (Adaboost based classifier)

  •        1078

Boosting is a meta-learning approach that aims at combining an ensemble of weak classifiers to form a strong classifier. Adaptive Boosting (Adaboost) is a greedy search for a linear combination of classifiers by overweighting the examples that are misclassified by each classifier. icsiboost implements Adaboost over stumps (one-level decision trees) on discrete and continuous attributes (words and real values). See http://en.wikipedia.org/wiki/AdaBoost and the papers by Y. Freund and R. Schapire for more details. This approach is one of the most efficient and simple to combine continuous and nominal values. Our implementation is aimed at allowing training from millions of examples by hundreds of features (or millions of sparse features) in a reasonable time/memory. It includes classification time code for c, python and java. Here is an excellent tutorial on Boosting: http://nips.cc/Conferences/2007/Program/event.php?ID=575 WARNING: we are planning to switch to git for revision control NEWS: 2012-05-21: Switched to dual license: GPL and BSD. Choose the license that best fits your project. Old revisions are GPL only. 2011-02-15: icsiboost is now in archlinux (aur package icsiboost-svn) 2010-12-23: QualityTesting tracks svn releases for time, memory and error rate on a sample dataset. 2010-12-10: Added script to convert example files from icsiboost to svm_light/mlcomp format including ngram/cutoff management (icsiboost_to_svm.py) 2010-10-20: Fixed a bug with continuous features (r159) and removed the need for the --display-maxclass option: it is now the default when examples have a single label (r160). The old way of computing the error rate is still used in the multilabel scenario. 2010-10-03: Added support for Solaris (r154), also added win32 downloads (requires cygwin1.dll) 2010-05-05: Maintenance release: better error handling of discrete features declared in names file (r130). 2010-01-24: Added Stanislas' patch to display error rates based on argmax instead of sign decisions. 2009-10-19: Added a rudimentary java implementation of the classifier. 2009-10-10: Released the optimal_threshold.pl script to get a better decision threshold on unbalanced data (for binary problems only). 2009-07-29: There is now a decoder in pure python. It's quite slow (and could be optimized), but is useful for small projects in python and educational purposes. 2009-04-08 WARNING: On multiclass problems, icsiboost does not compute the error rates the same way boostexter does. This does not result in lower performing models, and an option for getting compatible values will be implemented in the future. 2009-03-30 You can now specify the type of text expert and its length on a per-column basis in the names file (previously set globally with -N ngram -W 3...). Example: "words:text." becomes "words:text:expert_type=ngram expert_length=5 cutoff=3." which is equivalent to -N ngram -W 5 --cutoff 3, only for that the words column. You should use the svn version to get the latest fixes (change log). do not use r96: a bug made training fail (all users should upgrade to r102 which fixes major bugs) WARNING: if you trained a model with -N ngram -W length, you must pass the same options at test time, otherwise the related weak classifiers will be ignored (unless you specify it in the names file). Get and Compile (you need PCRE >= 01-December-2003): svn checkout http://icsiboost.googlecode.com/svn/trunk/ .cd icsiboostautoreconfautomake -a./configure CFLAGS=-O3makeProgram usage (revision r124): USAGE: ./icsiboost [options] -S --version print version info -S defines model/data/names stem -n number of boosting iterations (also limits test time classifiers, if model is not packed) -E set smoothing value (default=0.5) -V verbose mode -C classification mode -- reads examples from -o long output in classification mode -N choose a text expert between fgram, ngram and sgram (also ":text:expert_type=" in the .names) -W specify window length of text expert (also ":text:expert_length=" in .names) --dryrun only parse the names file and the data file to check for errors --cutoff ignore nominal features occuring unfrequently (also ":text:cutoff=" in .names) --drop drop text features that match a regular expression (also ":text:drop=" in .names) --no-unk-ngrams ignore ngrams that contain the "unk" token --jobs number of threaded weak learners --do-not-pack-model do not pack model (this is the default behavior) --pack-model pack model (for boostexter compatibility) --output-weights output training examples weights at each iteration --posteriors output posterior probabilities instead of boosting scores --model save/load the model to/from this file instead of .shyp --resume resume training from a previous model (can use another dataset for adaptation) --train bypass the .data filename to specify

http://code.google.com/p/icsiboost

Tags
Implementation
License
Platform

   




Related Projects

raspBerry+


raspBerry+ is a web-based administration platform for Blackberry Enterprise Server for MS Exchange (BES). You can group-based activate/kill/delete/add and get status of users, their handhelds and services. With a little download-area and a comment-system

RASP


RASP's A Sneakernet Proxy; download using a thumbdrive.

RasmusDSP


RasmusDSP is an embeddable Audio/MIDI processor. It contains various filters and generators (including SoundFont 2.0 compatible synthesizer). Has a script interpreter which is used to describe instruments, route Audio/MIDI signal between processor units.

Rasea


An acronym for cRoss-plAtform accesS control for Enterprise Applications. Rasea aims to become a reference in access control as a service based on the RBAC model.

Rascal


Rascal, the Advanced Scientific CALculator, is a platform independent modular calculator. Based on modules for integer, doubles, strings, vectors and matrices it can be easily extended with existing C or C++ code.



Rars


RARS is the Robot Auto Racing Simulation, in which the drivers are robot programs. It is intended as a competition among programmers. It consists of a simulation of the physics of cars, a graphic display of the race, and a robot driver for each car.

RARPlayer


This small program allows you to play a video directly from a RAR file and do so in real-time. Both VLC and MPlayer are supported video players.

RAReXtract


RAReXtract is a Front-End for the UnRAR command line utility for Mac OS X 10.5 (Leopard). Its purpose is the rapid and convenient extraction of RAR archives with a double click.

RAR Expander


Rar Expander is a MacOSX program which extracts the files contained in single or multi-volume RAR archives. It uses the official unRAR library internally so it is fully compatible with archives produced by WinRAR.

rarcrack


This program uses a brute force algorithm to guess your encrypted compressed file\'s password. If you forget your encrypted file password, this program is the solution. This program can crack zip,7z and rar file passwords.

RArcInfo


RArcInfo is a package for R (http://www.r-project.org) to import data from binary Arc/Info V7.X coverages and E00 files . This will allow R users to used it as a primary GIS tool.

rar brute force shell script - rarbrute


This is rarbrute, a shell script to brute force encrypted rar files under unix and linux. A long wordlist and a paper about security in internet cafes is included.

Raquel Database System


The system will : 1. use RAQUEL (= Relational Algebra Query, Update and Executive Language) for programming, implementing Third Manifesto principles. 2. have a 'Lego-like' architecture of building blocks and plug-ins, for wider applicability.

RAPv4


RAPv4 is an engine for building web application with only a business description (in XML format). NEW 04/2006 : Stable 2006 release. Add new functions like mail, sms, web services, graph, map engine (GIS), Excel output, QBE... and also a beta release of

Rafkill


2d Scroller. Clone of Raptor: Call of the Shadows and Tyrian. Fun game written in c++ using allegro.

rapple


Lightweight XML based transformation tool written in C that builds upon expat, tidylib and XSLT to tranform authored web content (incl. Word processor generated HTML) into styled web content suitable for publication.

RapidSMS


RapidSMS is an open-source internet and communications platform

RapidSmith


RapidSmith is a research-based FPGA CAD tool framework written in Java for modern Xilinx FPGAs. Based on XDL, its objective is to serve as a rapid prototyping platform for research ideas and algorithms relating to low level FPGA CAD tools.

Rapidshare Mass Downloader


What this program does is bringing out human interaction while downloading files from rapidshare(without premium account). It downloads all the rapidshare links sequentially to the specified location.

rapido visual profiler


rapido is a visual profiler for linux-x86. It traces function call using the ptrace interface and displays the information collected in a nice visual flow chart. rapido does not require the re-compilation of the application.