RGIFE User Guide

Overview

RGIFE is a feature reduction heuristic for the identification of small panels of highly predictive biomarkers. The heuristic is based on an iterative reduction paradigm: first it trains a classifier, then it ranks the attributes based on their importance and finally it removes attributes in block. RGIFE is designed to work with large-scale datasets and identifies reduced set of attributes with high classification power.

Installation

System requirements

RGIFE is written in Python. To run RGIFE the following libraries need to be installed:

Download

Before you start, you need to download RGIFE source code.

Basic configuration

In the RGIFE folder there is an example configuration file - configuration.conf:

[parameters]
block_type = RBS
validation: 10CV
cv_schema = DB_SCV
repetitions = 1
different_folds = no
tolerance_samples = 1
metric = accuracy
trees = 3000
max_depth = 5
cs_rf = yes
misclassification_cost = 1,1
missing_values = no
categorical_attributes = no

RGIFE parameters

Random forest parameters

Pre-processing parameters

Running RGIFE

To apply the RGIFE heuristic run the following script:

./rgife.py <configuration> <dataset>

The script requires only two parameters:

Example

The data directory contains a diffuse large B-cell lymphoma dataset in ARFF format (from Shipp2002). To identify a set of reduced biomarkers from this dataset run:

./rgife.py configuration.conf lymphoma.arff

The initial output lines show a summary of the analysed dataset along with configuration values:

Random Seed: 38606434
Configuration:
Dataset: dlbcl
Num Atts: 7129
Num Samples: 58
Tolerance value: 0.0172414
Missing values: no
Categorical attributes: no
Classification cost: no
Cost of class 0 : 1
Cost of class 1 : 1
Block type: RBS
Validation: 10CV
Repetitions of CV: 1
Different folds: no
Performance metric: accuracy

The initial (reference) performance is then calculated. For each iteration the software show all the performance metrics (only the specified one is used to determine the success of an iteration). The average specificity/sensitivity and the confusion matrix entries are calculated across repetitions.

RGIFE uses two different metrics:

The two metrics differ at most as the tolerance value.

=== Initial Iteration ===
Confusion Matrix  Row real class, Column predicted class (avg. across repetitions)
24.00 8.00
17.00 9.00
Avg. Specificity: 0.750
Avg. Sensitivity: 0.346
== Metrics ==
auc of iteration 0 is 0.5666667
overall_auc of iteration 0 is 0.5132212
robust_accuracy of iteration 0 is 0.5689655
gmean of iteration 0 is 0.4431217
fscore of iteration 0 is 0.5274603
accuracy of iteration 0 is 0.5766667
=============
Initial reference/best accuracy is 0.5766667

Afterwards RGIFE starts its iterative reduction process. The information for each iteration are shown as output along with the action taken by RGIFE. The variable Starting index indicates which attributes were removed from the attributes ranking.

============================
Actual Iteration 1
Reference Iteration 0
Best Iteration 0
The best accuracy is 0.5766667
The reference accuracy is 0.5766667
The block_size is 1782
The block_ratio is 0.25
Starting index 0
Atts of reference dataset 7129
Confusion Matrix  Row real class, Column predicted class (avg. across repetitions)
25.00 7.00
16.00 10.00
Avg. Specificity: 0.781
Avg. Sensitivity: 0.385
== Metrics ==
auc of iteration 1 is 0.6083333
overall_auc of iteration 1 is 0.5709135
robust_accuracy of iteration 1 is 0.6034483
gmean of iteration 1 is 0.4839465
fscore of iteration 1 is 0.5661508
accuracy of iteration 1 is 0.6133333
=============================
On iteration 1: The accuracy is better than the BEST! The accuracy is better than the reference accuracy!
The iteration 1 took 0.553522400061 minutes
=============================

When the block_size is lower than 1 the stopping condition is met. Once the iterative process stops, RGIFE prints a summary of its executions.

=============================
The accuracy is worse than the reference accuracy!
Consecutive Failures 6
Checking previous iterations . . .
Finish Condition
Summary:
The initial accuracy was: 0.6185714
The best accuracy was on iteration 47: 0.9490476 with 7 attributes
The final reference accuracy was on iteration 47: 0.9490476 with 7 attributes
Final Confusion Matrix. Row real class, Column predicted class
30.00 2.00
1.00 25.00
Best Spec: 0.938
Best Sens: 0.962
Final Reference Confusion Matrix. Row real class, Column predicted class
30.00 2.00
1.00 25.00
Reference Spec: 0.938
Reference Sens: 0.962
RGIFE took 16.3746276816 minutes

RGIFE generate two folder as output: BestIteration and ReferenceIteration that contain the reduced dataset with the attributes selected by the two iterations. The iterations.tar.xz archive contains instead all the intermediate data used during the reduction process.

RGIFE models

Multiple runs of RGIFE might identify different models with similar performances due to the stochastic nature of the heuristic. Three different polices can be used to select the final model from the output of multiple executions:

In the RGIFE folder there is a script called policies.py that generate, from multiple executions of RGIFE, the models using the 3 different policies.

./policies.py <path_to_results> <RGIFE_executions>

The script requires two parameters:

If we run RGIFE 5 times and we collect the results in the RGIFE_experiment folder:

./policies.py RGIFE_experiments 5

Where the RGIFE_experiment folder is structured as:

/RGIFE_experiment
    /Run1
        /BestIteration
        /ReferenceIteration
    /Run2
    /Run3
    /Run4
    /Run5

Contact

If you have any futher questions or comments about RGIFE or this tutorial in particular, please contact us at jaume.bacardit.