BioHEL User Guide

Installation
Configuration
Running BioHEL
- Input format
- Output format

1. Installation

System requirements

To compile the BioHEL code you need g++ 4.4.x or newer. To run the CUDA version you need a CUDA enabled graphics card and the CUDA SDK 4.x installed. It is also possible to build BioHEL in serial mode by switching off the compiler flag (see the next section).

Download

Before you start, you need to download the BioHEL source code.

How to compile BioHEL?

To uncompress the files execute:

$ tar -zxvf BioHEL.tar.gz

Now you should be able to see a folder called BioHEL-cuda in the folder where you decompressed the files. Access this folder and perform the following commands to compile the code.

$ cd BioHEL-cuda
$ touch .depend
$ make clean
$ make cuda

Now BioHEL is installed in the directory where the files where decompressed. To install BioHEL globally across the system perform:

$ export PATH=$PATH:$PWD

How to compile the serial version of BioHEL?

To compile the serial version of BioHEL to run it in a standard CPU, erase the flag -DCUDA_COMPILED=1 from the tag CFLAGS in the Makefile. To do this open the Makefile and modify the tag as follows:

CFLAGS=-O3 -march=nocona

Feel free to modify the -march flag to compile specifically for your arquitecture.

Aferwards, compile the code using make instead of make cuda as follows:

$ touch .depend
$ make clean
$ make

2. Configuration

Basic configuration

In the BioHEL-cuda folder there is a example configuration file called test1.conf

$ vi test1.conf

The content of the file is the following:

crossover operator 1px
default class major
fitness function mdl
initialization min classifiers 20
initialization max classifiers 20
iterations 50
mdl initial tl ratio 0.25
mdl iteration 10
mdl weight relax factor 0.90
pop size 500
prob crossover 0.6
prob individual mutation 0.6
prob one 0.75
selection algorithm tournamentwor
tournament size 4
windowing ilas 1
dump evolution stats
smart init
class wise init
coverage breakpoint 0.01
repetitions of rule learning 2
coverage ratio 0.90

kr hyperrect
num expressed attributes init 15
hyperrectangle uses list of attributes
prob generalize list 0.10
prob specialize list 0.10

expected number of attributes 10

device selected 0
device memory used 1.0
cuda enabled

random seed 0

To change the population size modify the following line (we suggest a population size of 500):

pop size 500

To adjust the probability of crossover modify the following line:

prob crossover 0.6

To use a specific random seed modify the following line:

random seed 367364

Or to not use a predefined seed erase or comment the line as follows:

#random seed 0

To change the number of iterations of each GA modify the following line (we suggest a number of iterations between 20 and 100).

iterations 50

CUDA related configuration

If there is more than one CUDA enabled device in your computer it is possible to select the device you wish to use modifying the following line:

device selected 0

If you wish the system to determine which GPU has more global memory to run your experiments use:

device selected -1

Also it is possible to determine the percentage of GPU global memory to use in your experiments through the following line:

device memory used 1.0

For a device that is also been used as a primary device for graphical purposes we recomend to use 0.9 percent of the memory. Moreover, if you are using the serial version of the algorithm this parameters will be simply ignored by the system.

Default class

BioHEL has a explicit default class mechanism that allows the system to generate a final rule that covers all the examples that have not been covered by the rules already generated. This mechanism reduces the complexity of the generated datasets, because the system only has to generate rules for n - 1 classes. To change the default class settings modifiy the following line:

default class major

There are three possible options for the default class:

major: the class that is more frequent in the training set will be used as the default class
minor: the class that is less frequent in the training set will be used ad the default class
disabled: disables the default class mechanism
fixed: a specific class is pass as a parameter.

In the fixed case it is necessary to add one line to the configuration file especifying the index of the selected class as follows:

default class fixed
fixed default class 0

ILAS windowing scheme

BioHEL is provided with a windowing mechanism called the Incremental Learning with Alternative Strata (ILAS). This mechanism partitions the training set into non-overlapped strata, and then the stratas are used one at the time in each iteration of the GA following a simple round robin policy. Using windows speeds up the learning process because each iteration will use a sample of the training set instead of all the instances. However, using a very large number of windows might cause that the sample in each window is very small and it would not be a good representation of the problem that we want to solve.

To set up the number of windows to use modify the following line:

windowing ilas 20

Windowing is not adviced for small problems, less than 10000 instances. For more information about the advantages and disadvantages of using the ILAS windowing scheme please see Bacardit2005.

Fitness function

BioHEL fitness funtion is based on the Minimum Description Length principle and tries to evolve rules that are accurate but have high coverage at the same time. Also, BioHEL fitness funtion has several parameters that change dramatically its behaviour.

The fitness funtion is divided in two terms the Theory Length TL and The Exceptions Length EL. The exception length term ponderates the accuracy and the coverage of the rules.

$F = T L t i m e s W + E L$

$E L = 2 - ACC (R) - C O V (R)$

$ACC (R) = (correctly_classified (R) \frac{)}{(} matched (R))$

$C O V = {(C R t i m e s (R C \frac{)}{(} C B), if R C < C B), (C R + (1 - C R) t i m e s (R C - C B \frac{)}{(} 1 - R C), if R C \geq C B) :}$

$R C (R) = (matched (R) \frac{)}{|T|}$

ACC(R) corresponds to the accuracy of the rule and COV(R) correspond to a metric of goodness of a rule according to the minimum number of examples a rule should cover. The value of COV(R) varies drastically depending on RC(R) (the number of examples the rule is covering) and CB (the number of examples any rule should cover to be considered a "good rule"). This last value CB corresponds to the parameter also know as coverage breakpoint. The rules that do not cover this minimal amount of examples are penalised.

To modify the coverage breakpoint in BioHEL modify the following line:

coverage breakpoint 0.01

Different problems have different coverage breakpoints. Problems with many instances tend to use CB in range [0.001, 0.1] while small problems benefit from CB in range [0.01, 0.25]. Preliminary experimentation is adviced to select best value of the coverage breakpoint for a given problem.

Moreover, the coverage ratio (CR) parameter indicates how much fitness should be given to a rule that covers the optimal percentage of examples. For syntetic problems or problems with no noise this parameter should be set to 1.0, while for problems with noise this parameter should be adjusted to the level of permitted noise. To modify this parameter alter the following line in the configuration file:

coverage ratio 0.90

You can find more information on setting the coverage parameters in Franco2010.

3. Running BioHEL

The syntax to excecute BioHEL is very simple. Just specify the configuration file to use followed by the train and test files in WEKA format.

$ biohel <conf file> <train file> <test file>

Input format

The datasets BioHEL accepts are in WEKA format. BioHEL requires both the train set and the test set to be passed as parameters.

Examples of datasets that BioHEL can use can be found in PSP Benchmarks.

Output format

BioHEL output is separated in three stages: initialization, learning and results.

Initialization stage

The initialization stage shows the parameter configuration used for the run followed by the problem characteristics. The initialization stage looks as follows:

One Point Crossover
Majoritarian class will be default
MDL fitness function
Minumum number of classifiers per individual in initialization:20.000000
Maximum number of classifiers per individual in initialization:20.000000
GA Iterations:5.000000
Initial theory length proportion in MDL formula: 0.250000
Iteracio activacio MDL 10
MDL Weight relax factor 0.900000
Popsize: 500.000000
Crossover probability: 0.600000
Individual-wise mutation probability:0.600000
Probability of value ONE for GABIL and ADI KR:0.750000
Tournament Selection without replacement Algorithm
Tournament size:4.000000
ILAS Windowing of degree 1
Dump learning process statistics at each iteration
Initialization uses examples to create the initial rules
Instances used in initialization are sampled with uniform class distribution
Coverage breakpoint for MDL fitness : 0.010000
Number of times we will try to learn a rule from the current training set: 2
Coverage ratio for MDL fitness : 0.900000
Using HYPERRECT Knowledge Representation
Number of expressed attributes in initialization : 15.000000
Hyperrectangle attribute list knowledge representation
Probability of generalizing the hyperrect list KR: 0.100000
Probability of specializing the hyperrect list KR: 0.100000
Random seed specified:0
Random seed 0
Dataset name: MX11
Attribute 0:Name att0 Def:{0,1}
Attribute 0 nominal
Value 0 of attribute 0: 0
Value 1 of attribute 0: 1
Attribute 1:Name att1 Def:{0,1}
Attribute 1 nominal
Value 0 of attribute 1: 0
Value 1 of attribute 1: 1
...
Least frequent class is 0
Most frequent class is 0
Coverage break for class 0 : 0.020000
Coverage break for class 1 : 0.020000
Probability of irrelevant attribute set to 0.000000

Learning process

During the learning stage the system will print the status of the learning in each iteration as follows:

...
It 1,Best ac:0.062500 1.000000 fi:0.091837. Ave ac:0.041133,0.045550
It 2,Best ac:0.062500 1.000000 fi:0.091837. Ave ac:0.028002,0.030243
It 3,Best ac:0.062500 1.000000 fi:0.091837. Ave ac:0.024124,0.026622
...

This output indicates that in the iteration 1 the best individual found had a coverage of 0.0625, an accuracy of 1.0 and a total fitness of 0.09837. Also the output shows the average accuracy of the whole population.

Each time the system learns one rule we will come across the following output:

Best acc 0.062500 1.000000
Rule:Att att0 is 0|Att att1 is 0|Att att3 is 1|Att att4 is 1|1
Removed 128 instances. 1920 Instances left. Acc of rule 0.062500-1.000000
Accuracy : 0.562500
0   1
1024    0
896 128

This output indicates the coverage and accuracy of the rule learn, the phenotype of the rule, the number of examples that where removed from the training set, and the confusion matrix. Also the line:

Accuracy : 0.562500

shows the global test accuracy we would achieve if we stop the learning at this point.

Learning Statistics

When the system cannot learn anymore rules it prints the complete set of rules learnt.

Phenotype:
0:Att att0 is 0|Att att1 is 0|Att att3 is 1|Att att4 is 1|1
1:Att att1 is 1|Att att2 is 1|Att att6 is 1|Att att10 is 1|1
2:Att att0 is 1|Att att1 is 1|Att att2 is 0|Att att9 is 1|1
3:Att att0 is 1|Att att1 is 0|Att att2 is 0|Att att7 is 1|1
4:Att att0 is 1|Att att1 is 0|Att att2 is 1|Att att8 is 1|1
5:Att att0 is 0|Att att1 is 1|Att att2 is 0|Att att5 is 1|1
6:Att att0 is 0|Att att1 is 1|Att att2 is 1|Att att6 is 1|1
7:Att att0 is 0|Att att1 is 0|Att att2 is 1|Att att3 is 0|Att att4 is 1|1
8:Att att0 is 1|Att att1 is 1|Att att2 is 1|Att att6 is 0|Att att10 is 1|1
9:Att att0 is 0|Att att1 is 0|Att att2 is 0|Att att3 is 1|Att att4 is 0|1
10:Default rule -> 0

Afterwards the system calculates statistics of learning with both the train and test sets:

Train accuracy : 1.000000
Train error : 0.000000
Train not classified : 0.000000
Train For each class:
0: accuracy : 1024
0: error : 0
0: not classified : 0
1: accuracy : 1024
1: error : 0
1: not classified : 0
Train Confusion Matrix. Row real class, Column predicted class
0   1
1024    0
0   1024
Performance of each classifier:
Classifier 0: 128/128=100.000000%
Classifier 1: 128/128=100.000000%
Classifier 2: 128/128=100.000000%
Classifier 3: 128/128=100.000000%
Classifier 4: 128/128=100.000000%
Classifier 5: 128/128=100.000000%
Classifier 6: 64/64=100.000000%
Classifier 7: 64/64=100.000000%
Classifier 8: 64/64=100.000000%
Classifier 9: 64/64=100.000000%
Classifier 10: 1024/1024=100.000000%
Test accuracy : 1.000000
Test error : 0.000000
Test not classified : 0.000000
Test For each class:
0: accuracy : 1024
0: error : 0
0: not classified : 0
1: accuracy : 1024
1: error : 0
1: not classified : 0
Test Confusion Matrix. Row real class, Column predicted class
0   1
1024    0
0   1024
Performance of each classifier:
Classifier 0: 128/128=100.000000%
Classifier 1: 128/128=100.000000%
Classifier 2: 128/128=100.000000%
Classifier 3: 128/128=100.000000%
Classifier 4: 128/128=100.000000%
Classifier 5: 128/128=100.000000%
Classifier 6: 64/64=100.000000%
Classifier 7: 64/64=100.000000%
Classifier 8: 64/64=100.000000%
Classifier 9: 64/64=100.000000%
Classifier 10: 1024/1024=100.000000%
Total time: 3.01 3.13001

Contact

If you have any futher questions or comments about the BioHEL system or this tutorial in particular, please contact us at jaume.bacardit.