Parf - Parallel Random Forest Algorithm
The Random Forests algorithm is one of the best among the known classification algorithms, able to classify big quantities of data with great accuracy. Also, this algorithm is inherently parallelisable. Originally, the algorithm was written in the programming language Fortran 77, which is obsolete and does not provide many of the capabilities of modern programming languages; also, the original code is not an example of "clear" programming, so it is very hard to employ in education. Within this project the program is adapted to Fortran 90. In contrast to Fortran 77, Fortran 90 is a structured programming language, legible â€” to researchers as well as to students. The creator of the algorithm, Berkeley professor emeritus Leo Breiman, expressed a big interest in this idea in our correspondence. He has confirmed that noone has yet worked on a parallel implementation of his algorithm, and promised his support and help. Leo Breiman is one of the pioneers in the fields of machine learning and data mining, and a co-author of the first significant programs (CART â€“ Classification and Regression Trees) in that field. The most up-to-date version of PARF's source code can be checked out by SVN (go to the Source tab). Snapshots of the SVN repository are created occasionaly, and available from the Downloads tab for convenience. To make an executable, a Fortran 90 compiler is required. The currently supported compilers are: Intel Fortran (free on Linux for academic use), Portland Group Fortran (commercial) and GNU g95 (free). For an efficient Java implementation of the Random Forest algorithm that integrates into the Weka environment, see FastRandomForest. RF and Random Forests are registered trademarks of Leo Breiman and Adele Cutler. PARF was developed in the Centre for Informatics and Computing and Division of Electronics of Rudjer Boskovic Institute, with the financial support of Ministry of Science, Technology and Sports of Croatia, i-Project 2004-111. Authors: Goran TopiÄ‡ and Tomislav Å muc.
comments powered by Disqus
Apache Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining.
HPCC is a proven and battle-tested platform for manipulating, transforming, querying and data warehousing Big Data. It supports two type of configuration. Thor is responsible for consuming vast amounts of data, transforming, linking and indexing that data. It functions as a distributed file system with parallel processing power spread across the nodes. Roxie, the Data Delivery Engine, provides separate high-performance online query processing and data warehouse capabilities.
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible.
Hypertable is based on Google's Bigtable Design, which is a proven scalable design that powers hundreds of Google services. Many of the current scalable NoSQL database offerings are based on a hash table design which means that the data they manage is not kept physically ordered. Hypertable keeps data physically sorted by a primary key and it is well suited for Analytics.
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization.
MPJ Express is an open source Java message passing library that allows application developers to write and execute parallel applications for multicore processors and compute clusters/clouds. It allows writing parallel Java applications using an MPI-like API.
jBNC is a Java toolkit for training, testing, and applying Bayesian Network Classifiers. Implemented classifiers have been shown to perform well in a variety of artificial intelligence, machine learning, and data mining applications.
The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model. Cassandra is suitable for applications that can't afford to lose data. Data is automatically replicated to multiple nodes for fault-tolerance.