Motivation and Background


In the Machine Learning (ML) field we need benchmarks to evaluate the behaviour and performance of our systems. The test suite is chosen/designed with many characteristics in mind:

Real world datasets usually contain noise and inconsistencies, thus being useful to evaluate the robustness of a learning system. However, if our objective is to evaluate the scalability of our system in terms of number of attributes, number of classes, etc. that system is able to cope with, then we need a really broad range of datasets. To achieve this is possible to use some synthetic datasets which we can arbitrarily adjust in dimensions, or we can inflate the real datasets with irrelevant data, introducing a bias to the evaluation procedure.

The datasets in this repository are an alternative family of problems based on real data, that vary the characteristics of the problem in a regular step.

Therefore, the evaluation process could (potentially) be more fair, reliable and robust, since we avoid having to artificially inflate real datasets but at the same time we are using problem far more complicated than toy problems.

What can Protein Structure Prediction do for ML benchmarking?

In short, Protein Structure Prediction (PSP) aims to predict the 3D structure of a protein based on its primary sequence, that is, a chain of amino-acids (a string using a 20-letter alphabet).

Illustration of mapping between a protein sequence and its 3D structure

PSP is, overall, an optimization problem. However, each amino-acid can be characterized by several structural features. A good prediction of these features contributes greatly to obtain better models for the 3D PSP problem. These features can be predicted as classification/regression problems.

These features are predicted from the local context (a window of amino acids) of the target in the chain. In the example below, a feature called Coordination Number (CN) for residue i is predicted using information of itself and its two nearest neighbours in the chain sequence, i-1 and i+1.

Example of (-1,+1) windows for the CN feature

By generating different versions of this problem with different window sizes we can construct a family of datasets of arbitrarily increasing number of attributes, that can be useful to evaluate how a learning system can cope with datasets of different sizes.

Moreover, most of these structural features are defined as continuous variables. Thus suitable to treat them as regression problems. However, it is also usual to discretize them and treat them as classification problems. We can decide the number of bins in which we discretize the feature. Therefore, we can construct too a family of datasets with arbitrarily increasing number of classes. The criteria used to discretize will create datasets with well balanced class distribution (using an uniform-frequency (UF) discretization), or will create datasets with uneven class distribution (using an uniform-length (UL) discretization).

Finally, there are usually two types of basic input information that can be used in these datasets:

As a summary, PSP can provide us with a large variety of ML datasets, derived from trying to predict the same protein structural feature with different formulations of inputs and outputs. Thus, we have an adjustable real-world family of benchmarks suitable for testing the scalability of prediction methods in several fronts.

How were these datasets generated?

More specific details about the generation of these datasets can be found in: