GP challenge dataset

Below you will find the energy terms and distances to the native structure for protein models (decoys) used as an input to the genetic programming algorithm described in "GP challenge: Evolving the energy function for protein structure prediction". We encourage everyone to test their algorithms on this dataset.


The dataset is based on candidate protein structures (decoys) generated by the I-TASSER ab inito predictor. For each of the 56 non-homologues small protein chains I-TASSER have generated from 12.5k to 20k decoys. We used 54 chains (excluding 1ogwA and 1cy5A) and for each chain a sample of every 10th decoy along the generation time.

Energy terms

We implemented 8 of I-TASSER energy terms and calculated their values for each decoy:

We left out energy terms using data from the threading process (e.g. distance map or contact order) and the hydrophobic potential as they depend on external feature predictors.

Download energy terms: energy_terms.tar.gz [2.8 MiB]
The archive contains 54 files, one for each protein. Each line in the file contains space separated list of energy of terms for a single decoy.
The decoys (lines) in the file are sorted in increasing order of original I-TASSER energy.

 Line format: T1 T2 T3 T4 T5 T6 T7 T8

Distance to the native

For each decoy we have measured its similarity to the known native structure. As a measure we used the root mean square deviation (RMSD) between 3D coordinates of Calpha atoms of two structures minimised with respect to the rotation.

To each decoy we have assigned a rank based on the increasing order of RMSD, averaging the ranks in case of ties. A tie between decoys was called when RMSD values were the same up to the first two decimal places.

Download distances/ranks: distances.tar.gz [384 KiB]
The archive contains 54 files, one for each protein. Each line in the file contains space separated list of 3 values: rank, RMSD, and the original I-TASSER energy. The order of decoys (lines) in the file is the same as for energy terms.

Line format: rank RMSD energy

Plots gallery

The gallery contains high resolution versions (A4 300dpi) of the correlation plots from our article and additional population diversity plots for different GP configurations tested in the last round of our experiments.