PGS Documentation

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This tutorial provides information about Partek® Partek model selection tool, how to use this function and some common mistakes which we should avoid to do. The dataset used in the tutorial is a simulated human microarray intensity values in log space. The data is not used for diagnostic procedure, only to show how to use the function.

...

A classification model has two parts:  variables and classifier. The model selection tool in Partek Genomics Suite® Suite uses cross-validation to choose the best classification model and gives the accuracy estimate of the best model.

...

  • Open Partek Genomics Suite, choose File > Open... from the main menu to open the Training.fmt 
  • Select Tools > Predict > Model Selection from the Partek main menu
  • In Cross-Validation tab, choose to Predict on Type, Positive Outcome is Disease, Selection Criterion is Normalized Correct Rate (Figure 1)
  • Choose 1-Level Cross-Validation option, and use Manually specify partition option as 5. The idea of 1-level cross validation option is to select the best model to deploy on the test data set.

...


...

Numbered figure captions
SubtitleTextModel selection dialog: 1-level cross validation configuration
AnchorNamecross-validation

Image Modified

...


  • Choose Variable Selection tab, to use ANOVA to select variables. The number of genes selected are based on the p-value generated from the 1-way ANOVA model which factor is Type. In each iteration of cross validation, we will use the training set to perform ANOVA, take the top N number of genes with the most significant p-values to build the classifier. The Configure button allow you to specify ANOVA model if you want to include multiple factors (Figure 2).
  • Since we don't know how many genes should be used to build the model, we will try to use 10, 20, 30, 40, 50 genes – the more options you try, the longer time it takes to run. In the How many groups of variables do you want to try, select Multiple groups with sizes from 10 to 50 step 10


Numbered figure captions
SubtitleTextModel selection dialog: Variable selection configuration
AnchorNamevarsel

Image Modified


  • Click on Classification tab, select K-Nearest Neighbor, choose 1 and 3 neighbors using default Euclidean distance measure (Figure 3)

...


...

Numbered figure captions
SubtitleTextModel selection dialog: K-nearest neighbor configuration
AnchorNameknn

Image Modified


  • Select Discriminant Analysis option, use the default setting which has the Linear with equal prior probabilities option checked
  • Click on Summary tab, we have configured 15 models to choose from (Figure 3)

...


...

Numbered figure captions
SubtitleTextModel selection dialog: Summary page
AnchorNamesummary

Image Modified

The more models configured, the long time it takes to run, in this example, in order to save time, we only specified 15 models and choose 5-fold cross-validation. You can also click on Load Spec button to load the above configuration from file tutorial.pcms

...

When you click on Run, a dialog as the one in Figure 4 will be displayed, notifying you that some classifiers, like discriminant analysis, are not recommended on dataset with more variables than samples. 


Numbered figure captions
SubtitleTextA notification that discriminant analysis model is not recommended on data with more variables than samples
AnchorNamecurse-of-dimensionality

Image Modified

...


  • Click Run without those models button to dismiss the dialog, leaving12 models in this model space

Since we are doing 5-fold cross validation, there will be 6 samples held out as test set in each iteration, and the models are built on the remaining 22 samples training set. After it is done, all the 12 models have been tested on the 28 samples, and the correct rate will reported, they are displayed in the summary page in descending order of the normalized biased (warning) correct rate, the top one is the best model among the 12 models (Figure 5). 


Numbered figure captions
SubtitleTextOne-level cross-validation result: 20 variables 3 nearest neighbor with Euclidean distance measure is the best model among the 12 models on the 28 sample data
AnchorNamebest-model

Image Modified


  • Click on Deploy button to leaunch the model using the whole dataset, but first save the file as 20var-1NN-Euclidean.ppb.  It will run ANOVA on the 28 samples to generate the top 20 genes and build a model using 3 K-Nearest neighbor based on Euclidean distance measure.
  •  Since the deployed model was from the whole 28 samples, in order to know the correct rate, we need a test set to run the model on.

...

  • Choose File > Open...  to browse and open testSet.fmt
  • Choose Tools > Predict > Run Deployed Model... from the menu
  • Select 20var-3NN-Euclidean.ppb to open, click on Test button to run,  the Correct rate (= accuracy) is reported on the top of the dialog (Figure 6)

...


...

Numbered figure captions
SubtitleTextReport on deploying a model on a test data set
AnchorNametest_report

Image Modified


  • Click Add Prediction to New Spreadsheet to generate new spreadsheet with a predicted class name in the first column, the samples (rows) whose predicted and true class name are different are highlighted (Figure 7)

...


...

Numbered figure captions
SubtitleTextTest deployed model on test set report on spreadsheet
AnchorNametest_spreadsheet

Image Modified


  • Click on Test Report will generate a report in HTML format
  • Click Close to dismiss the dialog

...

The best model is 30 variables using 1-Nearest Neighbor with Euclidean distance measure (Figure 8). 


Numbered figure captions
SubtitleTextOne-level cross-validation result: 30 variables 1 nearest neighbor with Euclidean distance measure is the best model among the 12 models on the 36 sample data
AnchorNamebest model 36 sample

Image Modified


  • Click on the model with best correct rate and deploy the model

...

  • Click on Cross-Validation tab, choose 2-Level Nested Cross-Validation, specify the number of Partition as 5 for both, level everything else the same and click Run (Figure 9)

...


Numbered figure captions
SubtitleTextTwo-level cross-validation configuration setup
AnchorName2 level cv

Image Modified

After it is done, you will get a report like the one in Figure 10. The highligted number is the unbiased accuracy estimate of the best model in the 12 model space. 



Numbered figure captions
SubtitleTextTwo level cross-validation report. The highlighted model had the highest accuracy
AnchorName2level_report

Image Modified

...


Cross-validation

Cross validation is used to esimate the accuracy of the predictive model, it is the solution to overfitting problem. One example of ovefitting is testing the model on the same data set when the model is trained. It is like to give sample questions to students to practice before exam, and the exact same questions are given to the students during exam, so the result is biased. 

In cross-validation, the data is partition the data into training set and test set, build a model on training set and validate the model on the test set, so test set is completely independing from model traininig. An example of K-fold cross-validation is to randomly divide the whole data into k equal sized subsets, take one subset as test set, use the remanining k-1 subset to training a model, it will repeat K times, so all the samples are used for both training and test, and each sample is tested once. The following figure is showing 5-fold cross-validation: 


Numbered figure captions
SubtitleText5-fold cross-validation
AnchorNamecv

Image Modified

...



 

Common Mistakes

In Partek model selection tool, the cross-validation is performed first. Each iteration of cross-valiation, the variable selection and classification are performed on the training set, and the test set is completely independent to validate the model. One common mistake is to select variable beforehand, e.g. using perform ANOVA on the whole dataset and use ANOVA's result to select top genes, and perform the cross-valiation to get correct rate. In this case, the test sets in cross validation were used in the variable selection, it is not independend from the training set, so the result will be biased.

Another common mistake is to run 1-level cross-validation with multiple models, and report the correct rate of the best model as the estimate of generalization correct rate, This correct rate is optimistically biased. The reason is that in 1-level cross validation, the test set is used to select the best model, the test set is not independent anymore in terms of estimating correct rate on a unseen dataset. So either use 2-level cross-validation option or use another independ set to get the accuracy estimate, the idea here is to partition the data into 3 sets: training set, validation set and test set. Train the models on the training set, validation set is used to select the best model, and test set is used to generate an unbiased accuracy estimate. 


Additional assistance


 

Rate Macro
allowUsersfalse

...