GP challenge dataset
Below you will find the energy terms and distances to the native structure for protein models (decoys) used as an input to the genetic programming algorithm described in "GP challenge: Evolving the energy function for protein structure prediction". We encourage everyone to test their algorithms on this dataset.
Decoys
The dataset is based on candidate protein structures (decoys) generated by the I-TASSER ab inito predictor. For each of the 56 non-homologues small protein chains I-TASSER have generated from 12.5k to 20k decoys. We used 54 chains (excluding 1ogwA and 1cy5A) and for each chain a sample of every 10th decoy along the generation time.
Energy terms
We implemented 8 of I-TASSER energy terms and calculated their values for each decoy:
- three short-range potentials between Calpha atoms E13, E14 and E15 – [T1-T3]
- local stiffness potential Estiff – [T4]
- hydrogen bonds potential EHB – [T5]
- long-range pairwise potential between side chain centres of mass Epair – [T6]
- electrostatic interactions potential Eelectro – [T7]
- environment profile potential Eenv – [T8]
We left out energy terms using data from the threading process (e.g. distance map or contact order) and the hydrophobic potential as they depend on external feature predictors.
Download energy terms: energy_terms.tar.gz [2.8 MiB]
The archive contains 54 files, one for each protein. Each line in the file contains space separated list of energy of terms for a single decoy.
The decoys (lines) in the file are sorted in increasing order of original I-TASSER energy.Line format: T1 T2 T3 T4 T5 T6 T7 T8
Distance to the native
For each decoy we have measured its similarity to the known native structure. As a measure we used the root mean square deviation (RMSD) between 3D coordinates of Calpha atoms of two structures minimised with respect to the rotation.
To each decoy we have assigned a rank based on the increasing order of RMSD, averaging the ranks in case of ties. A tie between decoys was called when RMSD values were the same up to the first two decimal places.
Download distances/ranks: distances.tar.gz [384 KiB]
The archive contains 54 files, one for each protein. Each line in the file contains space separated list of 3 values: rank, RMSD, and the original I-TASSER energy. The order of decoys (lines) in the file is the same as for energy terms.Line format: rank RMSD energy
Plots gallery
The gallery contains high resolution versions (A4 300dpi) of the correlation plots from our article and additional population diversity plots for different GP configurations tested in the last round of our experiments.
Publications
-
DOI
eprint
data
BibTeX
GP challenge: evolving energy function for protein structure predictionin Genetic Programming and Evolvable Machines, 11(1):61-88, March 2010
@ARTICLE{Widera2010, title = {GP challenge: evolving energy function for protein structure prediction}, author = {Widera, Paweł and Garibaldi, Jonathan M. and Krasnogor, Natalio}, year = 2010, doi = {10.1007/s10710-009-9087-0}, month = mar, journal = {Genetic Programming and Evolvable Machines}, volume = {11}, number = {1}, pages = {61--88} }