This tutorial provides information about Partek® model selection tool, how to use this function and some common mistakes which we should avoid to do. The dataset used in the tutorial is a simulated human microarray intensity values in log space. The data is not used for diagnostic procedure, only to show how to use the function.

Select a Classification Model

Download the zip file from here. The download contains the following files:

A classification model has two parts:  variables and classifier. The model selection tool in Partek Genomics Suite® uses cross-validation to choose the best classification model and gives the accuracy estimate of the best model.

1-level cross-validation is used to select the best model to deploy. There are two ways to report the unbiased accuracy estimate (or correct rate): 2-level cross validation on the same data set, or deploy the model on a independent test set. We will show both in this tutorial.

 

 


 

 

The more models configured, the long time it takes to run, in this example, in order to save time, we only specified 15 models and choose 5-fold cross-validation. You can also click on Load Spec button to load the above configuration from file tutorial.pcms

When you click on Run, a dialog as the one in Figure 4 will be displayed, notifying you that some classifiers, like discriminant analysis, are not recommended on dataset with more variables than samples.

 

 

Since we are doing 5-fold cross validation, there will be 6 samples held out as test set in each iteration, and the models are built on the remaining 22 samples training set. After it is done, all the 12 models have been tested on the 28 samples, and the correct rate will reported, they are displayed in the summary page in descending order of the normalized biased (warning) correct rate, the top one is the best model among the 12 models (Figure 5).

 

Deploying a Model

To get unbiased correct rate, the test set sample must be independent from the training set. Now we are going to load another dataset, it has 8 samples with logged intensity values on the set of genes as that of the training data set. To use a complete independent test set to get correct rate is called hold-out validation.

 

 

Hold-out validation have to split the whole data into two parts: training set and test set. However, genomic data (like microarray or NGS data) typically doesn't contain a large number of samples, os using hold-out method, we have to make the training and test test even smaller. When the sample size is small (here the example data is just illustrate the function), the result is not precise. As a rule of a thumb, you should have at least 100 test samples to properly measure the correct rate with a useful precision. In the other words, the larger the size of training set, the better efficiency of the fitted predicted models are; the bigger size of test set, the better power of validation. 

Another method to get unbiased accuracy estimate is to perform a 2-level cross validation on all the available samples (here: utilize the 36 samples set), so thta you don't have to split the data. The following steps show how to use all the  36 samples to select the best model and get the accuracy estimate.

The best model is 30 variables using 1-Nearest Neighbor with Euclidean distance measure (Figure 8).

 

Since there is no separate data to test the correct rate of the best model in the 12 model space, we will do a 2-level cross-validation to get the accuracy estimate. 

 

After it is done, you will get a report like the one in Figure 10. The highligted number is the unbiased accuracy estimate of the best model in the 12 model space.

 

 

Cross-validation

Cross validation is used to esimate the accuracy of the predictive model, it is the solution to overfitting problem. One example of ovefitting is testing the model on the same data set when the model is trained. It is like to give sample questions to students to practice before exam, and the exact same questions are given to the students during exam, so the result is biased. 

In cross-validation, the data is partition the data into training set and test set, build a model on training set and validate the model on the test set, so test set is completely independing from model traininig. An example of K-fold cross-validation is to randomly divide the whole data into k equal sized subsets, take one subset as test set, use the remanining k-1 subset to training a model, it will repeat K times, so all the samples are used for both training and test, and each sample is tested once. The following figure is showing 5-fold cross-validation:

 

 

 

Common Mistakes

In Partek model selection tool, the cross-validation is performed first. Each iteration of cross-valiation, the variable selection and classification are performed on the training set, and the test set is completely independent to validate the model. One common mistake is to select variable beforehand, e.g. using perform ANOVA on the whole dataset and use ANOVA's result to select top genes, and perform the cross-valiation to get correct rate. In this case, the test sets in cross validation were used in the variable selection, it is not independend from the training set, so the result will be biased.

Another common mistake is to run 1-level cross-validation with multiple models, and report the correct rate of the best model as the estimate of generalization correct rate, This correct rate is optimistically biased. The reason is that in 1-level cross validation, the test set is used to select the best model, the test set is not independent anymore in terms of estimating correct rate on a unseen dataset. So either use 2-level cross-validation option or use another independ set to get the accuracy estimate, the idea here is to partition the data into 3 sets: training set, validation set and test set. Train the models on the training set, validation set is used to select the best model, and test set is used to generate an unbiased accuracy estimate.