PGS Documentation

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Open Partek Genomics Suite, choose File>Open.. from the main menu to open the trainingSet.fmt
  • Select Tools > Predict > Model Selection from the Partek main menu
  • In Cross-Validation tab, choose to Predict onType, Positive Outcome is Disease, Selection Criterion is Normalized Correct Rate (Figure 1)
  • Choose 1-Level Cross-Validation option, and use Manually specify partition option as 5– use 1-level cross validation option is to select the best model to deploy

...

  • Choose Variable Selection tab, to use ANOVA to select variables. The number of genes selected are based on the p-value generated from the 1-way ANOVA model which factor is Type. In each iteration of cross validation, we will use the training set to perform ANOVA, take the top N number of genes with the most significant p-values to build the classifier. The Configure button allow you to specify ANOVA model if you want to include multiple factors (Figure 2).
  • Since we don't know how many genes should be used to build the model, we will try to use 10, 20, 30, 40, 50 genes – the more options you try, the longer time it takes to run. In the How many groups of variables do you want to try, select Multiple groups with size from 10 to 50 step 10

...

  • Click on Classification tab, select K-Nearest Neighbor, choose 1 and 3 neighbors using default Euclidean distance measure (Figure 3)

 

Numbered figure captions
SubtitleTextModel selection dialog -- KNN configuration
AnchorNameknn

  • Select Discriminant Analysis option, use the default setting which has the Linear with equal prior probabilities option checked
  • Click on Summary tab, we have configured 15 models to choose from (Figure 3)

 

Numbered figure captions
SubtitleTextModel selection dialog -- Summary page
AnchorNamesummary

The more models configured, the long time it takes to run, in this example, in order to save time, we only specified 15 models and choose 5-fold cross-validation. You can also click on Load Spec button to load the above configuration from file tuturial.pcms

When click on Run, a dialog as (Figure 4) will display, some classifiers like discriminant analysis are not recommended to perform on dataset with more number of variables than that of samples.

...

Since we are doing 5-fold cross validation, there will be 6 samples held out as test set in each iteration, and the models are built on the rest 22 samples training set. After it is done, all the 12 models have been tested on the 28 samples, and the correct rate will reported, they are displayed in the summary page in descending order of the normalized correct rate, the top one is the best model among the 12 models (Figure 5).

 

Numbered figure captions
SubtitleText1-level cross-validation result: 20 variables 3 nearest neighbor with Euclidean distance measure is the best model among the 12 models in this dataset
AnchorNamebest-model

...

  • Choose File>Open... to open to browse and open testSet.fmtChoose 
  • Choose Tools>Predict>Run Deployed Model... from the menu
  • Select 20var-3NN-Euclidean.ppb to open, click on Test button to run,  the correct rate is reported on the top of the dialog (Figure 6)

 

Numbered figure captions
SubtitleTextTest deployed model on test set report
AnchorNametest_report

Image Added

 

Cross validation

 

Common mistakes 

...