Introduction

This tutorial provides information about Partek model selection tool, how to use this function and some common mistakes which we should avoid to do. The dataset used in the tutorial is a simulated human microarray intensity values in log space. The data is not used for diagnostic procedure, only to show how to use the function.

Select a classification model

Download the zip file from the link below. The download contains the following files:

A classification model has two parts:  variables and classifier. The model selection tool in Partek Genomics Suite uses cross-validation to choose the best classification model and gives the accuracy estimate of the best model.

1-level cross-validation is used to select the best model to deploy. There are two ways to report the unbiased accuracy estimate (or correct rate): 2-level cross validation on the same data set, or deploy the model on a independent test set. We will show both in this tutorial.

 

 

 


 

 

The more models configured, the long time it takes to run, in this example, in order to save time, we only specified 15 models and choose 5-fold cross-validation. You can also click on Load Spec button to load the above configuration from file tuturial.pcms

When click on Run, a dialog as (Figure 4) will display, some classifiers like discriminant analysis are not recommended to perform on dataset with more number of variables than that of samples.

 

 

Since we are doing 5-fold cross validation, there will be 6 samples held out as test set in each iteration, and the models are built on the rest 22 samples training set. After it is done, all the 12 models have been tested on the 28 samples, and the correct rate will reported, they are displayed in the summary page in descending order of the normalized correct rate, the top one is the best model among the 12 models (Figure 5).

 

Deploy a model

To get unbiased correct rate, the test set sample must be independent from the training set. Now we are going to load another dataset, it has 8 samples with logged intensity values on the set of genes as that of the training data set

Cross validation

 

Common mistakes