RGIFE User Guide
Overview
RGIFE is a feature reduction heuristic for the identification of small panels of highly predictive biomarkers. The heuristic is based on an iterative reduction paradigm: first it trains a classifier, then it ranks the attributes based on their importance and finally it removes attributes in block. RGIFE is designed to work with large-scale datasets and identifies reduced set of attributes with high classification power.
Installation
System requirements
RGIFE is written in Python. To run RGIFE the following libraries need to be installed:
- NumPy (BSD Licence)
- SciPy (BSD Licence)
- Scikit-learn (BSD Licence)
Download
Before you start, you need to download RGIFE source code.
Basic configuration
In the RGIFE folder there is an example configuration file - configuration.conf
:
[parameters]
block_type = RBS
validation: 10CV
cv_schema = DB_SCV
repetitions = 1
different_folds = no
tolerance_samples = 1
metric = accuracy
trees = 3000
max_depth = 5
cs_rf = yes
misclassification_cost = 1,1
missing_values = no
categorical_attributes = no
RGIFE parameters
- block_size: (RBS, ABS). RBS indicates the relative block size where the number of removed attributes is relative to the size (attributes) of the dataset being analysed. If ABS is used the block size is relative to the original number of attributes.* Default value: RBS.
- validation: (10CV, LOOCV). The validation schema that can be 10 cross fold validation or leave-one-out. Default value: 10CV.
- cv_schema: (SCV, DB_SCV). SCV indicates a standard stratified cross fold validation. DB_SCV implements the Distribution-balanced SCV presented in Zeng2000. Default value: DB_SCV.
- repetitions: (int value). The number of repetitions of the validation schema. Default value: 1.
- different_folds: (yes, no). Re-generate the folds for the validation schema every repetition. Default value: no.
- tolerance_samples: (int value). Number of misclassified samples (more than the reference iteration) to identify a soft fail. Default value: 1.
- metric: (accuracy, robust_accuracy, gmean, fscore, auc). Metric to assess the performances of the classifier. The accuracy calculates the average correct samples across the folds, while robust_accuracy divides the (overall) corrected samples by the total number of samples. auc computes the AUC for each fold and return the average value, while overall_auc computes a single AUC considering the predicted probabilities of the whole set of test samples. Default value: accuracy.
Random forest parameters
- trees: (int value). The number of tree to be used by the random forest. Default value: 3000.
- max_depth: (int value). The maximum depth of the trees used in the random forest. If None, then nodes are expanded until all leaves are pure. Default value: None.
- cs_rf: (yes, no). Cost sensitive learning for the random forest. Default value: no.
- misclassification_cost: (int values comma separated). Cost for the misclassification of each class. A value for each class is required.This parameter is considered only if cs_rf = yes. Default value: 1,1.
Pre-processing parameters
- categorical_attributes: (yes, no). The dataset contains categorical values. If yes, all the categorical attributes are binarised. Default value: no.
- ordinal_attributes: (text file). The categorical attributes to be considered as ordinal and not binarised by RGIFE.
- missing_values: (yes, no). The dataset contains missing values. If yes, the missing values are imputed using the mean values across samples. Default value: no.
Running RGIFE
To apply the RGIFE heuristic run the following script:
./rgife.py <configuration> <dataset>
The script requires only two parameters:
- configuration - configuration file, see Configuation
- dataset - biological data in ARFF format
Example
The data
directory contains a diffuse large B-cell lymphoma dataset in ARFF format
(from Shipp2002). To identify a set of reduced biomarkers from this dataset run:
./rgife.py configuration.conf lymphoma.arff
The initial output lines show a summary of the analysed dataset along with configuration values:
Random Seed: 38606434
Configuration:
Dataset: dlbcl
Num Atts: 7129
Num Samples: 58
Tolerance value: 0.0172414
Missing values: no
Categorical attributes: no
Classification cost: no
Cost of class 0 : 1
Cost of class 1 : 1
Block type: RBS
Validation: 10CV
Repetitions of CV: 1
Different folds: no
Performance metric: accuracy
The initial (reference) performance is then calculated. For each iteration the software show all the performance metrics (only the specified one is used to determine the success of an iteration). The average specificity/sensitivity and the confusion matrix entries are calculated across repetitions.
RGIFE uses two different metrics:
- Best: the absolute best performance obtained so far. It's used to accept/reject a soft fail.
- Reference: it's used to accept/reject the iterations. If a soft fail is accepted the reference metric will be the same as in the soft fail iteration (i.e best metric - tolerance value).
The two metrics differ at most as the tolerance value.
=== Initial Iteration ===
Confusion Matrix Row real class, Column predicted class (avg. across repetitions)
24.00 8.00
17.00 9.00
Avg. Specificity: 0.750
Avg. Sensitivity: 0.346
== Metrics ==
auc of iteration 0 is 0.5666667
overall_auc of iteration 0 is 0.5132212
robust_accuracy of iteration 0 is 0.5689655
gmean of iteration 0 is 0.4431217
fscore of iteration 0 is 0.5274603
accuracy of iteration 0 is 0.5766667
=============
Initial reference/best accuracy is 0.5766667
Afterwards RGIFE starts its iterative reduction process. The information for each iteration are shown as output along with the action taken by RGIFE. The variable Starting index indicates which attributes were removed from the attributes ranking.
============================
Actual Iteration 1
Reference Iteration 0
Best Iteration 0
The best accuracy is 0.5766667
The reference accuracy is 0.5766667
The block_size is 1782
The block_ratio is 0.25
Starting index 0
Atts of reference dataset 7129
Confusion Matrix Row real class, Column predicted class (avg. across repetitions)
25.00 7.00
16.00 10.00
Avg. Specificity: 0.781
Avg. Sensitivity: 0.385
== Metrics ==
auc of iteration 1 is 0.6083333
overall_auc of iteration 1 is 0.5709135
robust_accuracy of iteration 1 is 0.6034483
gmean of iteration 1 is 0.4839465
fscore of iteration 1 is 0.5661508
accuracy of iteration 1 is 0.6133333
=============================
On iteration 1: The accuracy is better than the BEST! The accuracy is better than the reference accuracy!
The iteration 1 took 0.553522400061 minutes
=============================
When the block_size is lower than 1 the stopping condition is met. Once the iterative process stops, RGIFE prints a summary of its executions.
=============================
The accuracy is worse than the reference accuracy!
Consecutive Failures 6
Checking previous iterations . . .
Finish Condition
Summary:
The initial accuracy was: 0.6185714
The best accuracy was on iteration 47: 0.9490476 with 7 attributes
The final reference accuracy was on iteration 47: 0.9490476 with 7 attributes
Final Confusion Matrix. Row real class, Column predicted class
30.00 2.00
1.00 25.00
Best Spec: 0.938
Best Sens: 0.962
Final Reference Confusion Matrix. Row real class, Column predicted class
30.00 2.00
1.00 25.00
Reference Spec: 0.938
Reference Sens: 0.962
RGIFE took 16.3746276816 minutes
RGIFE generate two folder as output: BestIteration and ReferenceIteration that contain the reduced dataset with the attributes selected by the two iterations. The iterations.tar.xz archive contains instead all the intermediate data used during the reduction process.
RGIFE models
Multiple runs of RGIFE might identify different models with similar performances due to the stochastic nature of the heuristic. Three different polices can be used to select the final model from the output of multiple executions:
- Min: select the model having the smallest number of attributes
- Max: select the model having the largest number of attributes
- Union: the final model is the union of the models generated across different executions
In the RGIFE folder there is a script called policies.py that generate, from multiple executions of RGIFE, the models using the 3 different policies.
./policies.py <path_to_results> <RGIFE_executions>
The script requires two parameters:
- path_to_results - the folders in which the results from multiple executions of RGIFE are stored. The folders containing the single results of RGIFE need to be named Run1, Run2, ... etc.
- RGIFE_executions - number of RGIFE executions (same value used for the last Run folder)
If we run RGIFE 5 times and we collect the results in the RGIFE_experiment folder:
./policies.py RGIFE_experiments 5
Where the RGIFE_experiment folder is structured as:
/RGIFE_experiment
/Run1
/BestIteration
/ReferenceIteration
/Run2
/Run3
/Run4
/Run5
Contact
If you have any futher questions or comments about RGIFE or this tutorial in particular, please contact us at jaume.bacardit.