GAssist User Guide

Installation
Configuration
Running the system
- Input format
- Output format

1. Installation

System requirements

GAssist can be run in a Unix based system using g++ 4.4.x or newer version.

Download

Start from downloading the GAssist source code.

How to compile GAssist?

To uncompress the files execute:

$ tar -zxvf GAssist+MPLCS.tar.gz

Now you should be able to see a folder called GAssist+MPLCS in the folder where you decompressed the files. Access this folder and perform the following commands to compile the code.

$ cd GAssist+MPLCS
$ touch .depend
$ make clean
$ make install

Now GAssist should be accesible acrosss the system.

2. Configuration

Basic configuration

In the GAssist-MPLCS folder there is a example configuration file called test0.conf

$ vi test0.conf

The content of the file is the following:

crossover operator 1px
default class auto
discretizer uniform 10
discretizer uniform 15
discretizer uniform 20
discretizer uniform 25
discretizer uniform 4
discretizer uniform 5
discretizer uniform 6
discretizer uniform 7
discretizer uniform 8
fitness function mdl
hierarchical selection iteration 24
hierarchical selection threshold 0
initialization max classifiers 20
initialization min classifiers 20
iterations 1000
kr adi
max intervals 5
mdl initial tl ratio 0.075
mdl iteration 25
mdl weight relax factor 0.90
penalize individuals with less classifiers than 4
pop size 400
prob crossover 0.6
prob individual mutation 0.6
prob merge 0.05
prob one 0.90
prob reinitialize 0.03
prob reinitialize at end 0
prob split 0.05
pruning iteration 5
pruning min classifiers 12
selection algorithm tournamentwor
tournament size 3
windowing ilas 2
dump evolution stats
class wise init
smart init

random seed 0

To change the population size modify the following line (we suggest a population size of 400):

pop size 400

To adjust the probability of crossover modify the following line:

prob crossover 0.6

To use a specific random seed modify the following line:

random seed 367364

Or to not use a predefined seed erase or comment the line as follows:

#random seed 0

To change the number of iterations of each GA modify the following line (we suggest a number of iterations between 1000 and 1500).

iterations 1000

Fitness penalisations

There are two parameters that help the system encounter more refined solutions if we have certain information about the problem we want to solve. These parameters panalise rule-sets that either have too many or too few rules.

To modify the threshold of minimum number of classifiers a ruleset should have to not be penalised modify the following line:

penalize individuals with less classifiers than 4

Every certain amount of time the system also prunes the rulesets that have a number of rules higher than a certain threshold. To update this threshold modify the following line:

pruning min classifiers 12

Also it is possible to determine from which iteration the pruning mechanism will be performed by modifying this line:

pruning iteration 5

Default class

GAssist has a explicit default class mechanism that allows the system to generate a final rule that covers all the examples left in the training set. This mechanism reduces the complexity of the generated datasets, because the system only has to generate rules for n - 1 classes. To change the default class settings modifiy the following line:

default class auto

There are three possible options for the default class:

major: the class that is more frequent in the training set will be used as the default class.
minor: the class that is less frequent in the training set will be used ad the default class.
disabled: disables the default class mechanism.
auto: the most suitable default class is determined by the system. A larger number of iterations might be needed for the system to converge to a good solution.

ILAS windowing scheme

GAssist is provided with a windowing mechanism called the Incremental Learning with Alternative Strata (ILAS). This mechanism partitions the training set into non-overlapped strata, and then the stratas are used one at the time in each iteration of the GA following a simple round robin policy. Using windows speeds up the learning process because each iteration will use a sample of the training set instead of all the instances. However, using a very large number of windows might cause that the sample in each window is very small and it would not be a good representation of the problem that we want to solve.

To set up the number of windows to use modify the following line:

windowing ilas 20

Windowing is not adviced for small problems, less than 10000 instances. For more information about the advantages and disadvantages of using the ILAS windowing scheme please see Bacardit2005.

3. Running GAssist

The syntax to excecute GAssist is the following. Just specify the configuration file to use followed by the train and test files in WEKA format.

$ genetic <conf file> <train file> <test file>

Input format

The datasets GAssist accepts are in WEKA format. GAssist requires both the train set and the test set to be passed as parameters.

Examples of datasets that GAssist can use can be found in PSP Benchmarks.

Output format

GAssist output is separated in three stages: initialization, learning and results.

Initialization stage

The initialization stage shows the parameter configuration used for the run followed by the problem characteristics. The initialization stage looks as follows:

One Point Crossover
Automatical determination of default class
New discretizer for LCS/GABIL/ADI KRs uniform 10
New discretizer for LCS/GABIL/ADI KRs uniform 15
New discretizer for LCS/GABIL/ADI KRs uniform 20
New discretizer for LCS/GABIL/ADI KRs uniform 25
New discretizer for LCS/GABIL/ADI KRs uniform 4
New discretizer for LCS/GABIL/ADI KRs uniform 5
New discretizer for LCS/GABIL/ADI KRs uniform 6
New discretizer for LCS/GABIL/ADI KRs uniform 7
New discretizer for LCS/GABIL/ADI KRs uniform 8
MDL fitness function
Hierarchical selection activated, starting at iteration 24
Hierarchical selection threshold :0.000000
Maximum number of classifiers per individual in initialization:20.000000
Minumum number of classifiers per individual in initialization:20.000000
GA Iterations:500.000000
Using Adaptive Discretization Intervals Knowledge Representation
Maximum number of intervals per attribute in ADI KR:5
Initial theory length proportion in MDL formula: 0.075000
Iteracio activacio MDL 25
MDL Weight relax factor 0.900000
Penalize the individuals that have a size less than 4
Popsize: 400.000000
Crossover probability: 0.600000
Individual-wise mutation probability:0.600000
Probability of merge operator in ADI KR: 0.050000
Probability of value ONE in initialization:0.900000
Probability of reinitialize operator in ADI KR: 0.030000
Probability of reinitialize operator at final iteration in ADI KR:0.000000
Probability of split operator in ADI KR: 0.050000
Pruning operator activated at iteration:5.000000
Pruning stops if #classifiers is less that 12.000000
Tournament Selection without replacement Algorithm
Tournament size:3.000000
ILAS Windowing of degree 2.000000
Dump learning process statistics at each iteration
Instances used in initialization are sampled with uniform class distribution
Initialization uses examples to create the initial rules
Random seed specified:0
Random seed 0
Dataset name: adult
Attribute 0:Name age Def:real
Attribute 0 real valued
Attribute 1:Name workclass Def:{Private,Self-emp-not-inc,Self-emp-inc,Federal-gov,Local-gov,State-gov,Without-pay,Never-worked}
Attribute 1 nominal
Value 0 of attribute 1: Private
Value 1 of attribute 1: Self-emp-not-inc
Value 2 of attribute 1: Self-emp-inc
Value 3 of attribute 1: Federal-gov
Value 4 of attribute 1: Local-gov
Value 5 of attribute 1: State-gov
Value 6 of attribute 1: Without-pay
Value 7 of attribute 1: Never-worked
Attribute 2:Name fnlwgt Def:real
Attribute 2 real valued
Attribute 3:Name education Def:{Bachelors,Some-college,11th,HS-grad,Prof-school,Assoc-acdm,Assoc-   voc,9th,7th-8th,12th,Masters,1st-4th,10th,Doctorate,5th-6th,Preschool}
Attribute 3 nominal
...
Attribute 14:Name class Def:{>50K,<=50K}
Attribute 14 nominal
Value 0 of attribute 14: >50K
Value 1 of attribute 14: <=50K
Least frequent class is 0
Most frequent class is 1

Learning process

During the learning stage the system will print the status of the learning in each iteration as follows:

...
It 1,Best ac:0.772930 fi:27.707006 #cl:6(6). Ave ac:0.502460,0.259306 #cl:20.507500(10.245000)
It 2,Best ac:0.772727 fi:27.727273 #cl:6(6). Ave ac:0.508485,0.256193 #cl:18.440000(9.640000)
It 3,Best ac:0.783030 fi:26.696997 #cl:4(4). Ave ac:0.525191,0.244834 #cl:15.787500(8.740000)
...

This output shows that in iteration 1, the accuracy of the best individual is 0.77, the given fitness to this individual is 27.70, the number of classifiers in the ruleset is 6 and the number of alive classifiers (between brackets) is also 6. Moreover, this output also provides information about the average and standard deviation of the accuracy of the whole population, and on average how many classifiers and alive classifiers the rule sets have.

Learning Statistics

When the system cannot learn anymore rules it prints the complete set of rules learnt.

Phenotype:
0:Att att0 is 0|Att att1 is 0|Att att3 is 1|Att att4 is 1|1
1:Att att1 is 1|Att att2 is 1|Att att6 is 1|Att att10 is 1|1
2:Att att0 is 1|Att att1 is 1|Att att2 is 0|Att att9 is 1|1
3:Att att0 is 1|Att att1 is 0|Att att2 is 0|Att att7 is 1|1
4:Att att0 is 1|Att att1 is 0|Att att2 is 1|Att att8 is 1|1
5:Att att0 is 0|Att att1 is 1|Att att2 is 0|Att att5 is 1|1
6:Att att0 is 0|Att att1 is 1|Att att2 is 1|Att att6 is 1|1
7:Att att0 is 0|Att att1 is 0|Att att2 is 1|Att att3 is 0|Att att4 is 1|1
8:Att att0 is 1|Att att1 is 1|Att att2 is 1|Att att6 is 0|Att att10 is 1|1
9:Att att0 is 0|Att att1 is 0|Att att2 is 0|Att att3 is 1|Att att4 is 0|1
10:Default rule -> 0

Afterwards the system calculates statistics of learning with both the train and test sets:

Test accuracy : 0.838862
Test error : 0.161138
Test not classified : 0.000000
Test For each class:
0: accuracy : 596
0: error : 572
0: not classified : 0
1: accuracy : 3501
1: error : 215
1: not classified : 0
Test Confusion Matrix. Row real class, Column predicted class
0   1
596 572
215 3501
Performance of each classifier:
Classifier 0: 485/672=72.172619%
Classifier 1: 4/11=36.363636%
Classifier 2: 0/0=nan%
Classifier 3: 0/0=nan%
Classifier 4: 0/0=nan%
Classifier 5: 0/0=nan%
Classifier 6: 107/128=83.593750%
Classifier 7: 3501/4073=85.956298%
Total time: 219.08 219.928

Contact

If you have any futher doubts or comments about the GAssist system or this tutorial in particular please contact jaume.bacardit.