Parf - Parallel Random Forest Algorithm

  •        0

The Random Forests algorithm is one of the best among the known classification algorithms, able to classify big quantities of data with great accuracy. Also, this algorithm is inherently parallelisable. Originally, the algorithm was written in the programming language Fortran 77, which is obsolete and does not provide many of the capabilities of modern programming languages; also, the original code is not an example of "clear" programming, so it is very hard to employ in education. Within this project the program is adapted to Fortran 90. In contrast to Fortran 77, Fortran 90 is a structured programming language, legible — to researchers as well as to students. The creator of the algorithm, Berkeley professor emeritus Leo Breiman, expressed a big interest in this idea in our correspondence. He has confirmed that noone has yet worked on a parallel implementation of his algorithm, and promised his support and help. Leo Breiman is one of the pioneers in the fields of machine learning and data mining, and a co-author of the first significant programs (CART – Classification and Regression Trees) in that field. The most up-to-date version of PARF's source code can be checked out by SVN (go to the Source tab). Snapshots of the SVN repository are created occasionaly, and available from the Downloads tab for convenience. To make an executable, a Fortran 90 compiler is required. The currently supported compilers are: Intel Fortran (free on Linux for academic use), Portland Group Fortran (commercial) and GNU g95 (free). For an efficient Java implementation of the Random Forest algorithm that integrates into the Weka environment, see FastRandomForest. RF and Random Forests are registered trademarks of Leo Breiman and Adele Cutler. PARF was developed in the Centre for Informatics and Computing and Division of Electronics of Rudjer Boskovic Institute, with the financial support of Ministry of Science, Technology and Sports of Croatia, i-Project 2004-111. Authors: Goran Topić and Tomislav Šmuc.



comments powered by Disqus

Related Projects

Apache Mahout - Scalable machine learning library

Apache Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining.

Scikit Learn - Machine Learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy. It is simple and efficient tools for data mining and data analysis. It supports automatic classification, clustering, model selection, pre processing and lot more.

MLIB - Apache Spark's scalable machine learning library

MLlib is a Spark implementation of some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction and lot more.

HPCC System - Hadoop alternative

HPCC is a proven and battle-tested platform for manipulating, transforming, querying and data warehousing Big Data. It supports two type of configuration. Thor is responsible for consuming vast amounts of data, transforming, linking and indexing that data. It functions as a distributed file system with parallel processing power spread across the nodes. Roxie, the Data Delivery Engine, provides separate high-performance online query processing and data warehouse capabilities.

Hadoop Common

Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. Hadoop common supports other Hadoop subprojects

R Language - Project for Statistical Computing

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible.

Hypertable - A high performance, scalable, distributed storage and processing system for structured

Hypertable is based on Google's Bigtable Design, which is a proven scalable design that powers hundreds of Google services. Many of the current scalable NoSQL database offerings are based on a hash table design which means that the data they manage is not kept physically ordered. Hypertable keeps data physically sorted by a primary key and it is well suited for Analytics.


Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Ganglia - scalable distributed monitoring system

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization.

MPJ Express - Parallel Programming in Java

MPJ Express is an open source Java message passing library that allows application developers to write and execute parallel applications for multicore processors and compute clusters/clouds. It allows writing parallel Java applications using an MPI-like API.