Raw read counts are generated after quantification for each feature on all samples. These read counts need to be normalized prior to differential expression detection to ensure that samples are comparable.

This chapter covers the implementation of each normalization method. The Normalize counts option is available on the context-sensitive menu (Figure 1) upon selection of any quantified output data node or an imported feature count:


The format of the output is the same as the input data format, the node is called Normalized counts. This data node can be selected and normalized further using the same task.

Selecting Methods

Select whether you want your data normalized on a per sample or per feature basis (Figure 2). Some transformations are performed on each value independently of others e.g. log transformation, and you will get an identical result regardless of your choice.

 

The following normalization methods will generate different results depending on whether the transformation was performed on samples or on features:

Note that each task can only perform normalization on samples or features. If you wish to perform both transformations, run two normalization tasks successively. To normalize the data, click on a method from the left panel, then drag and drop the method to the right panel. Add all normalization methods you wish to perform. Alternatively, you can click on the green plus button () on each method to add it. Multiple methods can be added to the right panel and they will be processed in the order they are listed. You can change the order of methods by dragging each method up or down. To remove a method from the Normalization order panel, click the minus button () to the right of the method. Click Finish, when you are done choosing the normalization methods you have chosen.

Recommended Methods

For some data nodes, recommended methods are available:

If available, the Recommended button will appear.  Clicking the button will populate the right panel (Figure 3).  


Normalization Methods

Below is the notation that will be used to explain each method:

SymbolMeaning
SSample
FFeature
Xsf

Value of sample S from feature F (if normalization is performed on a quantification data node, this would be the raw read counts)

TXsf

transformed value of Xsf

CConstant value
bBase of log

 

    1. Normalize the reads by the length of feature, it generate reads per kilobase
      RPKsf  = Xsf / Lf;
    2. Sum up all the RPKsf in a sample
      PRKs =  ∑Ff=1 FRPKsf
    3. Generate a scaling factor for each sample by normalizing the PRK of the sample to the sum PRK of all the samples
      ,
      where TR is the total reads across all samples
    4. Divide raw reads by the scaling factor to get TPM
      TXsf = Xsf/Ks
    1. Remove all the features that have 0 read in all samples.
    2. Get the upper quartile in each sample, and divide the upper quartile by the total count of the sample.
    3. Upper quartile of a sample is scaled by the geometric mean of the upper quartile across all the samples.
    4. Raw reads of a sample on a feature is divided by the scaled upper quartile of the sample.

Normalization Report

The Normalization report includes the Normalization methods used, a Feature distribution table, Box-whisker plots of the Expression signal before and after normalization, and Sample histogram charts before and after normalization. Note that all visualizations are disabled for results with more than 30 samples.

Normalization methods

A summary of the normalization methods performed. They are listed by the order they were performed.

Feature distribution table  

A table that presents descriptive statistics on each sample, the last row is the grand statistics across all samples (Figure 4).

 

Expression signal

These box-whisker plots show the expression signal distribution for each sample before and after normalization. When you mouse over on each bar in the plot, a balloon would show detailed percentile information (Figure 5).


Sample histogram

A histogram is displayed for data before and after it is normalized. Each line is a sample, where the X axis is the range of the data in the node and the Y-axis is the frequency of the value within the range. When you mouse over a circle which represent a center of an interval, detailed information will appear in a balloon  (Figure 6).  It includes:

 

References