Page History

...

Each combination of sequencing instrument and sequencing protocol varies in terms of how precisely it is able to estimate the transcript abundance, and/or what data biases it propagates downstream. A good quantification algorithm is expected to extract the maximal amount of information from the reads, and, importantly, to adjust for the biases of the sequencer and protocol. However, the practical impact of biases depends on the purpose of the study. Likewise, if the study agenda includes some particular items such as novel transcript discovery, this alone can determine the choice of a combination of aligner and quantification package.

Challenges and Solutions

The most recognized quantification issues common to RNA-Seq include the following:

...

Sequence (composition) bias

Position bias

GC bias

Multireads

Multireads (or multimappers) are reads aligning to multiple locations in the genome. Three different approaches to handling those reads have been described.

...

For paired-end reads with multiple alignments, it is possible to improve quantification by extracting information from the fragment (insert) size distribution. This feature is available in Cufflinks.

Sequence Bias

Sequence bias refers to the observation that the reads that have particular subsequences of nucleotides (typically, at 5’ or 3’ end) may be over- or underrepresented due to some artifacts of sequencing technology (8, 9) (for an illustration see Figure 2 in Roberts et al. (4)). The corresponding bias correction is implemented in Cufflinks.

Position Bias

Position bias (10 - 12) is manifested in unequal distribution of reads along the length of a transcript (see Figure 4 in Jiang et al. (12)). The corresponding correction is implemented in Cufflinks.

GC Bias

GC bias means that the estimated abundance of a transcript (gene) is dependent on its GC content. While Dohm et al. reported that GC-rich regions attract reads (see Figure 2 in (13)), another study argued that the dependence can go either way (see Figure 2 in Zheng et al. (14)). Although there is no available package that contains the GC bias correction Cufflinks team shown that in some cases the GC bias is highly correlated with the sequence bias and then a separate GC correction is unnecessary.

Assessing the Practical Impact of Biases

An accepted way to measure the impact of a certain bias is as follows: take a number of transcripts (genes) and perform the quantification with and without bias correction and then observe the change in concordance between the expected (known) and the estimated abundance. Initially, the expected abundances were estimated based on qPCR (4). More recently, the usage of presumably more accurate External RNA Controls Consortium (ERCC) controls took hold (12, 15). In the latter approach, the known abundance is measured as concentration of transcripts in attomoles/μL.

...

On the other hand, suppose we want to compare the expression of two isoforms that differ by a short subsequence, such as a 50 bp exon. Then, the GC bias can be practically significant because the difference in expression may be attributed to the exon’s GC content. Likewise, it may be attributed to the nucleotide composition of the exon that causes it to attract an unfair number of reads due to the sequence bias. In that case, the GC and sequence biases need to be adjusted for during quantification.

Partek EM Algorithm: Validation of Quantification Results

Data from the Sequencing Quality Control Project (SEQC) was obtained from the Mayo Clinic site, which used the Illumina HiSeq2000. The dataset consists of about one billion of 2×100 paired-end reads, which amounts to about 512 GB of uncompressed fastq files, with about 4 GB for each of the 128 samples.

...

Numbered figure captions

SubtitleText	Comparison of expected number of molecules of ERCC Mix A and the estimated number of transcript molecules (log2 of RPKM values), obtained by Partek’s modification of expectation-maximization algorithm. Each dot is an ERCC control. r^2 = 0.97, regression y = 0.98 * x - 74.41
AnchorName	comp-expected

Image RemovedImage Added

Numbered figure captions

SubtitleText	Comparison of expected number of molecules of ERCC Mix B and the estimated number of transcript molecules (log2 of RPKM values), obtained by Partek’s modification of expectation-maximization algorithm. Each dot is an ERCC control. r^2 = 0.97, regression y = 0.98 * x - 74.06
AnchorName	comp-expected2

Image RemovedImage Added

Numbered figure captions

SubtitleText	Comparison of expected fold change (log2) vs. observed fold change (log2) in ERCC Mix A vs. Mix B, for the four groups of transcripts with the known fold change values. Each dot is an ERCC control. r^2 = 0.87, regression y = 0.96 * x + 0.09
AnchorName	comp-expected3

Image RemovedImage Added

To test for possible biases in abundance estimates, we combined the approaches of Li et al. (9) and Jiang et al. (12). That amounted to regressing the estimated abundance not only on the expected abundance, but also on the transcript length, GC content, and the expected fold change, subjecting all the variables to log₂ transformation.

...

Numbered figure captions

Prefix	Table
SubtitleText	Regression of the observed transcript abundance on the expected transcript abundance, transcript length, GC content, and the expected fold change, subjecting all the variables to log2 transformation. We started with the full model containing four covariates and performed model selection based on two criteria: adjusted r^2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The assessment was performed on the Mix A of ERCC, using Partek’s modified expectation-maximization (EM) algorithm for transcript quantification.
AnchorName	table1

Image RemovedImage Added

Numbered figure captions

Prefix	Table
SubtitleText	Regression of the observed transcript abundance on the expected transcript abundance and transcript length (the best model). We started with the full model containing four covariates and performed model selection based on two criteria: adjusted r^2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The assessment was performed on the Mix A of ERCC, using Partek’s modified expectation-maximization (EM) algorithm for transcript quantification.
AnchorName	table2

Image RemovedImage Added

Numbered figure captions

Prefix	Table
SubtitleText	Regression of the observed transcript abundance on the expected transcript abundance (the benchmark model). We started with the full model containing four covariates and performed model selection based on two criteria: adjusted r^2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The assessment was performed on the Mix A of ERCC, using Partek’s modified expectation-maximization (EM) algorithm for transcript quantification.
AnchorName	table3

Image RemovedImage Added

The full and the best models for the Mix B are shown in Tables 4 and 5 (respectively). Apparently, the regression failed to find evidence of any kind of bias.

Numbered figure captions

Prefix	Table
SubtitleText	Regression of the observed transcript abundance on the expected transcript abundance, transcript length, GC content, and the expected fold change, subjecting all the variables to log2 transformation. We started with the full model containing four covariates and performed model selection based on two criteria: adjusted r^2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The assessment was performed on the Mix B of ERCC, using Partek’s modified expectation-maximization (EM) algorithm for transcript quantification.
AnchorName	table4

Image RemovedImage Added

Numbered figure captions

Prefix	Table
SubtitleText	Regression of the observed transcript abundance on the expected transcript abundance (the best model). We started with the full model containing four covariates and performed model selection based on two criteria: adjusted R2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The assessment was performed on the Mix A of ERCC, using Partek’s modified expectation-maximization (EM) algorithm for transcript quantification.
AnchorName	table5

Image RemovedImage Added

Conclusions

Although we did not report the results obtained by Cufflinks quantification algorithm, we believe that no extra analysis is necessary. First, the nature of ERCC approach does not make it a tool sensitive enough to detect the subtle effects that are estimated by Cufflinks quantification algorithm. Second, we showed that, even if those effects were taken into account, there would be very little room for improvement that could be detected by ERCC tools.

Jiang and colleagues came to a similar conclusion (12). Although they did not include the dose response and fold change plots in their manuscript, in personal correspondence they acknowledged that they had actually constructed the plots but failed to see a significant impact of the bias corrections implemented in Cufflinks. The minimal ERCC transcript length is about 250 bp, and, unless one interested in comparing isoforms that differ by a much shorter sequence (50 bp is a good estimate), the bias corrections are of little use.

References

Langmead B, Hansen KD, Leek JT. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 2010;11(8):R83.
Li J, Jiang H, Wong WH. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010;11:R50.
Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006;34:3150-3160.
Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;12:R22.
Trapnell C, Williams BA, Pertea G, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511-515.
Trapnell C, Roberts A, Goff L, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7:562-578.
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011;27:2325-2329.
Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38:e131.
Li J, Jiang H, Wong WH. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010;11:R50.
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010;26:493-500.
Leng N, Dawson JA, Thomson JA, et al. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013;29:1035-1043.
Jiang L, Schlesinger F, Davis CA, et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 2011;21:1543-1551.
Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008;36:e105.
Zheng W, Chung LM, Zhao H. Bias detection and correction in RNA-Sequencing data. BMC Bioinformatics. 2011;12:290.
ERCC RNA spike-in control mixes. http://tools.thermofisher.com/content/sfs/manuals/cms_086340.pdf Accessed: March 25, 2016
Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009;4:14.

...

Partek Flow Documentation

Page tree

Versions Compared

Old Version 21

New Version Current

Key

Challenges and Solutions

Multireads

Sequence Bias

Position Bias

GC Bias

Assessing the Practical Impact of Biases

Partek EM Algorithm: Validation of Quantification Results

Conclusions

References