View Source

Raw read counts are generated after quantification for each feature on all samples. These read counts need to be normalized prior to differential expression detection to ensure that samples are comparable.

This chapter covers the implementation of each normalization method. The Normalize counts option is available on the context-sensitive menu (Figure 1) upon selection of any quantified output data node or an imported feature count:

Gene counts
Transcript counts
MicroRNA counts
Cufflinks quantification
Quantification

Flow Documentation > Normalization > quant_task_menu.png

The format of the output is the same as the input data format, the node is called Normalized counts. This data node can be selected and normalized further using the same task.

Selecting Methods

Select whether you want your data normalized on a per sample or per feature basis (Figure 2). Some transformations are performed on each value independently of others e.g. log transformation, and you will get an identical result regardless of your choice.

Flow Documentation > Normalization > transformation_orientation.png

The following normalization methods will generate different results depending on whether the transformation was performed on samples or on features:

Divided by mean, median, Q1, Q3, std dev, sum

Subtract mean, median, Q1, Q3, std dev, sum

Quantile normalization

Note that each task can only perform normalization on samples or features. If you wish to perform both transformations, run two normalization tasks successively. To normalize the data, click on a method from the left panel, then drag and drop the method to the right panel. Add all normalization methods you wish to perform. Alternatively, you can click on the green plus button () on each method to add it. Multiple methods can be added to the right panel and they will be processed in the order they are listed. You can change the order of methods by dragging each method up or down. To remove a method from the Normalization order panel, click the minus button ( Flow Documentation > Normalization > minus_icon_red.png ) to the right of the method. Click Finish, when you are done choosing the normalization methods you have chosen.

Recommended Methods

For some data nodes, recommended methods are available:

Data nodes resulting from Quantify to annotation model (Partek E/M) or Quantify to reference (Partek E/M) are raw read counts, the recommendation is Total Count, Add 0.0001
Cufflinks quantification data node output FPKM normalized read counts, the recommendation is Add 0.0001

If available, the Recommended button will appear. Clicking the button will populate the right panel (Figure 3).

Flow Documentation > Normalization > recommend.png

Normalization Methods

Below is the notation that will be used to explain each method:

Symbol	Meaning
S	Sample
F	Feature
X_sf	Value of sample S from feature F (if normalization is performed on a quantification data node, this would be the raw read counts)
TX_sf	transformed value of X_sf
C	Constant value
b	Base of log

Absolute value
TX_sf = | X_sf |
Add
TX_sf = X_sf + C
a constant value C needs to be specified
Antilog
TX_sf = bx_sf
A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen
Divided by
When mean, median, Q1, Q3, std dev, or sum is selected, the corresponding statistics will be calculated based on the transform on sample or features option
Example: If transform on Samples is selected, Divide by mean is calculated as:
TX_sf = X_sf/M_s
where Ms is the mean of the sample.
Example: If transform on Features is selected, Divide by mean is calculated as:
TX_sf = X_sf/M_f
where M_f is the mean of the feature.
Log
TX_sf = log_bX_sf
A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen
Logit
TX_sf=log_b(X_sf/(1-X_sf))
A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen
Lower bound
A constant value C needs to be specified,
if X_sf is smaller than C, then TX_sf= C; otherwise, TX_sf = X_sf
Multiply by
TX_sf = X_sf x C
A constant value C needs to be specified
Quantile normalization, a rank based normalization method.
For instance, if transformation is performed on samples, it first ranks all the features in each sample. Say vector V_s is the sorted feature values of sample S in ascending order, it calculates a vector that is the average of the sorted vectors across all samples --- V_m, then the values in V_s is replaced by the value in V_m in the same rank. Detailed information can be found in [1].
RPKM (Reads per kilobase of transcript per million mapped reads [2])
TX_sf = (10⁹ * X_sf)/(TMR_s*L_f)
Where X_sf is the raw read of sample S on feature F,
TMR_s is the total mapped reads of sample S,
L_f is the length of the feature F,

If quantification is performed on an aligned reads data node, total mapped reads is the aligned reads. If quantification is generated from imported read count text file, the total mapped reads is the sum of all feature reads in the sample.
If the feature is a transcript, transcript length L_f is the sum of the lengths of all the exons. If the feature is a gene, gene length is the distance between the start position of the most downstream exon and the stop position of the most upstream exon. See Bullard et al. for additional comparisons with other normalization packages [3]

For paired reads, the Normalization option will show up as FPKM (Fragments per kilobase per million mapped reads).

Subtract
When mean, median, Q1, Q3, std dev or sum is selected, the corresponding statistics will be calculated based on the transform on sample or features option
Example: If transform on Samples is selected, Subtract mean is calculated as:
TX_sf = X_sf - M_s
where Ms is the mean of the sample
Example: If transform on Features is selected, Subtract mean is calculated as:
TX_sf = X_sf - M_f
where M_f is the mean of the feature
TMM (Trimmed mean of M-values)
The scaling factors is produced according to the algorithm described in Robinson et al [4]. The paper by Dillies et al. [5] contains evidence that TMM has an edge over other normalization methods.
TPM (Transcripts per million as described in Wagner et al [6])
The following steps are performed:

1. Normalize the reads by the length of feature, it generate reads per kilobase
  RPK_sf = X_sf / L_f;
2. Sum up all the RPKsf in a sample
  PRK_s = ∑^F_f=1 FRPK_sf
3. Generate a scaling factor for each sample by normalizing the PRK of the sample to the sum PRK of all the samples
  ,
  where TR is the total reads across all samples
4. Divide raw reads by the scaling factor to get TPM
  TX_sf = X_sf/K_s

Total count(Reads per million)
TXsf = (10⁶ x X_sf)/TMR_s
where Xsf here is the raw read of sample S on feature F, and
TMRs is the total mapped reads of sample S.
If quantification is performed on an aligned reads data node, total mapped reads is the aligned reads. If quantification is generated from imported read count text file, the total mapped reads is the sum of all feature reads in the sample.
Upper quartile
The method is exactly the same as the LIMMA package [7].
The following is the simple summarization of the calculation:

1. Remove all the features that have 0 read in all samples.
2. Get the upper quartile in each sample, and divide the upper quartile by the total count of the sample.
3. Upper quartile of a sample is scaled by the geometric mean of the upper quartile across all the samples.
4. Raw reads of a sample on a feature is divided by the scaled upper quartile of the sample.

Normalization Report

The Normalization report includes the Normalization methods used, a Feature distribution table, Box-whisker plots of the Expression signal before and after normalization, and Sample histogram charts before and after normalization. Note that all visualizations are disabled for results with more than 30 samples.

Normalization methods

A summary of the normalization methods performed. They are listed by the order they were performed.

Feature distribution table

A table that presents descriptive statistics on each sample, the last row is the grand statistics across all samples (Figure 4).

Flow Documentation > Normalization > feature_distribution_norm.png

Expression signal

These box-whisker plots show the expression signal distribution for each sample before and after normalization. When you mouse over on each bar in the plot, a balloon would show detailed percentile information (Figure 5).

Flow Documentation > Normalization > boxplot_norm.png

Sample histogram

A histogram is displayed for data before and after it is normalized. Each line is a sample, where the X axis is the range of the data in the node and the Y-axis is the frequency of the value within the range. When you mouse over a circle which represent a center of an interval, detailed information will appear in a balloon (Figure 6). It includes:

The sample name.
The range of the interval, “[ “represent inclusive, “)” represent exclusive.
The frequency value within the interval

Flow Documentation > Normalization > histogram_normalized.png

References

Bolstad BM, Irizarry RA, Astrand M, Speed, TP. A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics. 2003; 19(2): 185-193.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008; 5(7): 621–628.
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010; 11: 94.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11: R25.
Dillies MA, Rau A, Aubert J et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2013; 14(6): 671-83.
Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data. Theory Biosci. 2012; 131(4): 281-5.
Ritchie ME, Phipson B, Wu D et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; 43(15):e97.