RGIFE User Guide
RGIFE is a feature reduction heuristic for the identification of small panels of highly predictive biomarkers. The heuristic is based on an iterative reduction paradigm: first it trains a classifier, then it ranks the attributes based on their importance and finally it removes attributes in block. RGIFE is designed to work with large-scale datasets and identifies reduced set of attributes with high classification power.
RGIFE is written in Python. To run RGIFE the following libraries need to be installed:
Before you start, you need to download RGIFE source code.
In the RGIFE folder there is an example configuration file -
[parameters] block_type = RBS validation: 10CV cv_schema = DB_SCV repetitions = 1 different_folds = no tolerance_samples = 1 metric = accuracy trees = 3000 max_depth = 5 cs_rf = yes misclassification_cost = 1,1 missing_values = no categorical_attributes = no
- block_size: (RBS, ABS). RBS indicates the relative block size where the number of removed attributes is relative to the size (attributes) of the dataset being analysed. If ABS is used the block size is relative to the original number of attributes.* Default value: RBS.
- validation: (10CV, LOOCV). The validation schema that can be 10 cross fold validation or leave-one-out. Default value: 10CV.
- cv_schema: (SCV, DB_SCV). SCV indicates a standard stratified cross fold validation. DB_SCV implements the Distribution-balanced SCV presented in Zeng2000. Default value: DB_SCV.
- repetitions: (int value). The number of repetitions of the validation schema. Default value: 1.
- different_folds: (yes, no). Re-generate the folds for the validation schema every repetition. Default value: no.
- tolerance_samples: (int value). Number of misclassified samples (more than the reference iteration) to identify a soft fail. Default value: 1.
- metric: (accuracy, robust_accuracy, gmean, fscore, auc). Metric to assess the performances of the classifier. The accuracy calculates the average correct samples across the folds, while robust_accuracy divides the (overall) corrected samples by the total number of samples. auc computes the AUC for each fold and return the average value, while overall_auc computes a single AUC considering the predicted probabilities of the whole set of test samples. Default value: accuracy.
Random forest parameters
- trees: (int value). The number of tree to be used by the random forest. Default value: 3000.
- max_depth: (int value). The maximum depth of the trees used in the random forest. If None, then nodes are expanded until all leaves are pure. Default value: None.
- cs_rf: (yes, no). Cost sensitive learning for the random forest. Default value: no.
- misclassification_cost: (int values comma separated). Cost for the misclassification of each class. A value for each class is required.This parameter is considered only if cs_rf = yes. Default value: 1,1.
- categorical_attributes: (yes, no). The dataset contains categorical values. If yes, all the categorical attributes are binarised. Default value: no.
- ordinal_attributes: (text file). The categorical attributes to be considered as ordinal and not binarised by RGIFE.
- missing_values: (yes, no). The dataset contains missing values. If yes, the missing values are imputed using the mean values across samples. Default value: no.
To apply the RGIFE heuristic run the following script:
./rgife.py <configuration> <dataset>
The script requires only two parameters:
data directory contains a diffuse large B-cell lymphoma dataset in ARFF format
(from Shipp2002). To identify a set of reduced biomarkers from this dataset run:
./rgife.py configuration.conf lymphoma.arff
The initial output lines show a summary of the analysed dataset along with configuration values:
Random Seed: 38606434 Configuration: Dataset: dlbcl Num Atts: 7129 Num Samples: 58 Tolerance value: 0.0172414 Missing values: no Categorical attributes: no Classification cost: no Cost of class 0 : 1 Cost of class 1 : 1 Block type: RBS Validation: 10CV Repetitions of CV: 1 Different folds: no Performance metric: accuracy
The initial (reference) performance is then calculated. For each iteration the software show all the performance metrics (only the specified one is used to determine the success of an iteration). The average specificity/sensitivity and the confusion matrix entries are calculated across repetitions.
RGIFE uses two different metrics:
- Best: the absolute best performance obtained so far. It's used to accept/reject a soft fail.
- Reference: it's used to accept/reject the iterations. If a soft fail is accepted the reference metric will be the same as in the soft fail iteration (i.e best metric - tolerance value).
The two metrics differ at most as the tolerance value.
=== Initial Iteration === Confusion Matrix Row real class, Column predicted class (avg. across repetitions) 24.00 8.00 17.00 9.00 Avg. Specificity: 0.750 Avg. Sensitivity: 0.346 == Metrics == auc of iteration 0 is 0.5666667 overall_auc of iteration 0 is 0.5132212 robust_accuracy of iteration 0 is 0.5689655 gmean of iteration 0 is 0.4431217 fscore of iteration 0 is 0.5274603 accuracy of iteration 0 is 0.5766667 ============= Initial reference/best accuracy is 0.5766667
Afterwards RGIFE starts its iterative reduction process. The information for each iteration are shown as output along with the action taken by RGIFE. The variable Starting index indicates which attributes were removed from the attributes ranking.
============================ Actual Iteration 1 Reference Iteration 0 Best Iteration 0 The best accuracy is 0.5766667 The reference accuracy is 0.5766667 The block_size is 1782 The block_ratio is 0.25 Starting index 0 Atts of reference dataset 7129 Confusion Matrix Row real class, Column predicted class (avg. across repetitions) 25.00 7.00 16.00 10.00 Avg. Specificity: 0.781 Avg. Sensitivity: 0.385 == Metrics == auc of iteration 1 is 0.6083333 overall_auc of iteration 1 is 0.5709135 robust_accuracy of iteration 1 is 0.6034483 gmean of iteration 1 is 0.4839465 fscore of iteration 1 is 0.5661508 accuracy of iteration 1 is 0.6133333 ============================= On iteration 1: The accuracy is better than the BEST! The accuracy is better than the reference accuracy! The iteration 1 took 0.553522400061 minutes =============================
When the block_size is lower than 1 the stopping condition is met. Once the iterative process stops, RGIFE prints a summary of its executions.
============================= The accuracy is worse than the reference accuracy! Consecutive Failures 6 Checking previous iterations . . . Finish Condition Summary: The initial accuracy was: 0.6185714 The best accuracy was on iteration 47: 0.9490476 with 7 attributes The final reference accuracy was on iteration 47: 0.9490476 with 7 attributes Final Confusion Matrix. Row real class, Column predicted class 30.00 2.00 1.00 25.00 Best Spec: 0.938 Best Sens: 0.962 Final Reference Confusion Matrix. Row real class, Column predicted class 30.00 2.00 1.00 25.00 Reference Spec: 0.938 Reference Sens: 0.962 RGIFE took 16.3746276816 minutes
RGIFE generate two folder as output: BestIteration and ReferenceIteration that contain the reduced dataset with the attributes selected by the two iterations. The iterations.tar.xz archive contains instead all the intermediate data used during the reduction process.
Multiple runs of RGIFE might identify different models with similar performances due to the stochastic nature of the heuristic. Three different polices can be used to select the final model from the output of multiple executions:
- Min: select the model having the smallest number of attributes
- Max: select the model having the largest number of attributes
- Union: the final model is the union of the models generated across different executions
In the RGIFE folder there is a script called policies.py that generate, from multiple executions of RGIFE, the models using the 3 different policies.
./policies.py <path_to_results> <RGIFE_executions>
The script requires two parameters:
- path_to_results - the folders in which the results from multiple executions of RGIFE are stored. The folders containing the single results of RGIFE need to be named Run1, Run2, ... etc.
- RGIFE_executions - number of RGIFE executions (same value used for the last Run folder)
If we run RGIFE 5 times and we collect the results in the RGIFE_experiment folder:
./policies.py RGIFE_experiments 5
Where the RGIFE_experiment folder is structured as:
/RGIFE_experiment /Run1 /BestIteration /ReferenceIteration /Run2 /Run3 /Run4 /Run5
If you have any futher questions or comments about RGIFE or this tutorial in particular, please contact us at jaume.bacardit.