Page History

One of the goals of RNA-seq is to detect differential gene/transcripts expression (for the sake of simplicity, we use the term "gene" to represent gene or transcript in the document). Our recommendation is to perform differential expression it on normalized, rather than on raw data.

Tools for differential expression are available on Gene/Transcript counts and on Normalized counts data nodes, under the RNA-Seq analysis section of the toolbox. The possibilities are as follows:

Table of Contents

maxLevel	2
minLevel	2
exclude	Additional Assistance

Differential Gene Expression (GSA)

GSA stands for gene specific analysis, the goal of which is to identify the statistical model that is the best for a specific gene among all the selected models, and then use that best model to calculate p-value and fold change.

GSA dialog

The first step of GSA is to choose which attributes to include in the test (Figure 1). All sample attributes including numeric and categorical attributes are displayed in the dialog, so use the check button to select between them. An experiment with two attributes Cell type (with groups A and B)and Time (time points 0, 5, 10) is used as an example in this section.

Numbered figure captions

SubtitleText	Choosing attributes to include in the statistical test by selecting the corresponding check button
AnchorName	Attribute selection

Image Removed

Click Next to display the levels of each attribute to be selected for sub-group comparisons (contrasts).

...

Numbered figure captions

SubtitleText	Specifying attribute levels for sub-group comparisons (contrast): Select A for Cell type on the top, B for Cell type on the bottom, and click Add comparison to compare A vs B
AnchorName	Subgroup attribute selection

Image Removed

To compare Time point 5 vs. 0, select 5 for Time on the top, 0 for Time on the bottom, and click Add comparison (Figure 3).

Numbered figure captions

SubtitleText	Specifying attribute levels for sub-group comparisons (contrast): Select 5 for Time on the top, 0 for Time on the bottom, click Add comparison to compare 5 vs 0
AnchorName	Subgroup Attribute comparison

Image Removed

To compare cell types at a certain time point, e.g. time point 5, select A and 5 on the top, and B and 5 on the bottom. Thereafter click Add comparison (Figure 4).

Numbered figure captions

SubtitleText	Specifying attribute levels for subgroup comparisons (contrast): Select A and 5 on the top, B and 5 on the bottom, click Add comparison to compare A5 vs B5
AnchorName	Subgroup attribute comparrison contrast

Image Removed

Multiple comparisons can be computed in one GSA run; Figure 5 shows the above three comparisons are added in the computation.

Numbered figure captions

SubtitleText	Three comparisons included in GSA computation: A vs B; 5 vs 0; and A5 vs B5
AnchorName	Comparison table

Image Removed

In terms of design pool, i.e. choices of model designs to select from, two 2 factors in this example data will lead to seven possibilities in the design pool:

Cell type
Time
Cell type, Time
Cell type, Cell type * Time
Time, Cell type * Time
Cell type * Time
Cell type, Time, Cell type * Time

In GSA, if a 2^nd order interaction term is present in the design, then all first order terms must be present, which means, if Cell type * Time interaction is present, the two factors must be included in the model. In the other words, the following designs are not considered:

Cell type, Cell type * Time
Time, Cell type * Time
Cell type * Time

If a comparison is added, some models that don't have the comparison factors will also be eliminated. E.g. if a comparison on Cell type A vs. B is added, only designs that have Cell type factor included will be in the computation. These are:

Cell type
Cell type, Time
Cell type, Time, Cell type * Time

The more comparisons on different terms are added, the fewer models will be included in the computation. If the following comparisons are added in one GSA run:

A vs B (Cell type)
5 vs 0 (Time)

only the following two models will be computed:

Cell type, Time
Cell type, Time, Cell type * Time

If comparisons on all the three terms are added in one GSA run:

A vs B (Cell type)
5 vs 0 (Time)
A*5 vs B*5 (Cell type * Time)

then only one model will be computed:

Cell type, Time, Cell type * Time

If GSA is invoked from a quantification output data node directly, you will have the option to use the default normalization methods before performing differential expression detection (Figure 6).

If invoked from a Partek E/M method output, the data node contains raw read counts and the default normalization is:
- Normalize to total count (RPM)
- Add 0.0001 (offset)
If invoked from a Cufflinks method output, the data node contains FPKM and the default normalization is:
- Add 0.0001 (offset)

Numbered figure captions

SubtitleText	Applying default normalization if differential gene detection dialog is invoked from a quantification output data node (see text for details)
AnchorName	Default normalization

Image Removed

If advanced normalization needs to be applied, perform the Normalize counts task on a quantification data node before doing differential expression detection (GSA or ANOVA).

GSA advanced options

Click on Configure to customize Advanced options (Figure 7).

Numbered figure captions

SubtitleText	Configuring advanced GSA options
AnchorName	GSA advanced options

Image Removed

Low-expression feature

Low -expression feature section allows you to specify criteria to exclude features that do not meet requirements for the calculation.

Lowest average coverage: the computation will exclude a feature if its geometric mean across all samples is below than the specified value
Lowest maximum coverage: the computation will exclude a feature if its maximum across all samples is below the specified value
Minimum coverage: the computation will exclude a feature if its sum across all samples is below than the specified value
None: include all features in the computation

Multiple test correction

Multiple test correction can be performed on the p-values of each comparison, with FDR step-up being the default (1). If you check the Storey q-value (2), an extra column with q-values will be added to the report.

Report option

This section configures how to select the best model for a feature. There are two options for Model selection criterion: AICc (Akaike Information Criterion corrected) and AIC (Akaike Information Criterion). AICc is recommended for small sample size, while AIC is recommended for medium and large sample size What about large samples?(3). Note that when sample size grows from small to medium, AICc converges to AIC. Taking the AICc/AIC value into account, GSA considers the model with the lowest information criterion as the best choice.

In the results, the best model's Akaike weight is also generated. The model's weight is interpreted as the probability that the model would be picked as the best if the study were reproduced. The range of Akaike weight is between 0 to 1, where 1 means the best model is very superior to the other candidates from the model pool; if the best model's Akaike weight is close to 0.5 on the other hand, it means the best model is likely to be replaced by other candidates if the study were reproduced. One still uses the best shot model, however, the accuracy of the best shot is fairly low.

The default value for Enable multimodel approach is Yes. It means that the estimation will utilize all models in the pool by assigning weights to them based on AIC or AICc. If No is selected instead, the estimation is based on only one best model which has the smallest AIC or AICc. The output p-value will be different depending on the selected option for multimodel, but the fold change is the same. Multimodel approach is recommended when the best model's Akaike weight is not close to 1, meaning that the best model is not compelling.

There are situations when a model estimation procedure does not outright fail, but still encounters some difficulties. In this case, it can even generate p-value and fold change for the comparisons, but those values are not reliable, and can be misleading. It is recommended to use only reliable estimation results, so the default option for Use only reliable estimation results is set Yes.

Model types configuration

Partek Flow provides five response distribution types for each design model in the pool, namely:

Normal
Lognormal (the same as ANOVA task)
Lognormal with shrinkage (the same as limma-trend method 4)
Negative binomial
Poisson

We recommend to use lognormal with shrinkage distribution (the default), and an experienced user may want to click on Custom to configure the model type and p-value type (Figure 8).

Numbered figure captions

SubtitleText	Five response distribution types for each design model
AnchorName	Design model distribution types

Image Removed

If multiple distribution types are selected, then the number of total models that is evaluated for each feature is the product of the number of design models and the number of distribution types. In the above example, suppose we have only compared A vs B in Cell type as in Figure 2, then the design model pool will have the following three models:

Cell type
Cell type, Time
Cell type, Time, Cell type * Time

If we select Lognormal with shrinkage and Negative binomial, i.e. two distribution types, the best model fit for each feature will be selected from 3 * 2 = 6 models using AIC or AICc.

The design pool can also be restricted by Min error degrees of freedom. The minimal error degrees of freedom is set to the largest k (k represents the error degree of freedom of the model) in the design model pool, with 0 <= k <=6 for which admissible models exist. Admissible model is one that can be estimated given the specified contrastsPlease make sure that I got the meaning right. In the above example, when we compare A vs B in Cell type, there are three possible design models. The error degree of freedom of model Cell type is largest and the error degree of freedom of model Cell type, Time, Cell type * Time is the smallest:

k(Cell type) > k(Cell type, Time) > k (Cell type, Time, Cell type*Time)

If the sample size is big, k >=6 in all three models, all the models will be evaluated and the best model will be selected for each feature. However, if the sample size is too small, none of the models will have k >=6, then only the model with maximal k will be used in the calculation.

There are two types of p-value, F and Wald., Poisson, negative binomial and normal models can generate p-value using either Wald or F statistics. Lognormal models always employ the F statistics; the more replicates in the study, the less the difference between the two options. When there are no replicates, only Poisson can be used to generate p-value using Wald.

Partek Flow keeps tracking the log status of the data, and no matter whether GSA is performed on logged data or not, the fold change calculation is always in linear scale

GSA report

If there are multiple design models and multiple distribution types included in the calculation, the fraction of genes using each model and type will be displayed as pie charts in the task result (Figure 9).

Numbered figure captions

SubtitleText	Pie charts of proportion of genes using each model and distribution in gene-specific analysis calculation
AnchorName	Gene proportion

Image Removed

Feature list with p-value and fold change generated from the best model selected is displayed in a table with other statistical information (Figure 10).

Numbered figure captions

SubtitleText	Feature list on the gene-specific analysis result. Clicking on the column header sorts the table. Panel on the left filters the table
AnchorName	Feature list

Image Removed

The following information is included in the table by default:

Feature ID information: if transcript level analysis was performed, and the annotation file has both transcript and gene level information, both gene ID and transcript ID are displayed. Otherwise, the table shows only the available information.
Total reads: total number of raw read across all the samples. Raw reads are retrieved from quantification data node
Each contrast outputs p-value, FDR step up p-value, ratio and fold change in linear scale, LSmean of each group comparison in linear scale

By clicking on Optional columns, you can retrieve more annotation if there are any more annotation information in the annotation model you specified for quantification, like genomic location, strand information etc.

...

Numbered figure captions

SubtitleText	Volcano plot in comparison A vs B. X-axis represents fold change (linear scale), Y-axis represents negative logged p-value (unadjusted), each dot is a feature. The horizontal line represents p-value of 0.05, two vertical lines represent fold change of -2 and 2. Lower left corner displays number of features passing the fold-change and p-value criteria
AnchorName	Volcano plot

Image Removed

Feature list filter panel is on the left of the table (Figure 12). Click on the black triangle ( Image Removed ) to collapse and expand the panel.

Select the check box of the field and specify the cutoff, and press Enter to apply. After the filter has been applied, the total number of included features will be updated on the top of the panel (Result).

...

Numbered figure captions

SubtitleText	Feature list filter panel
AnchorName	Feature list filter panel

Image Removed

...

If lognormal with shrinkage method was selected for GSA, a shrinkage plot is generated in the report (Figure 13). X-axis shows the log2 value of average coverage. The plot helps to determine the threshold of low expression features. If there is an increase before a monotone decrease trend on the left side of the plot, you need to set a higher threshold on the low expression filter. Detailed information on how to set the threshold can be found in the GSA white paper.

Numbered figure captions

SubtitleText

Shrinkage plot generated on longnormal with shrinkage model. X-axis is represents average coverage in log2 scale; Y-axis represents log2 standard deviation of error term. Green dot represents standard deviation of residual error obtained from lognormal linear model on a gene; black line represents the trend how the errors change depending on the average gene expression; red dot represents adjusted (shrunk) standard deviation of error on a gene

AnchorName

Shrinkage plot

Image Removed

Differential gene expression (ANOVA)

ANOVA method is applying a specified log normal model to all the features.

ANOVA dialog

To setup ANOVA model, select factors from sample attribute. The factors can be categorical or numeric attribute. Click on a check button to select and click Add factors button to add it to the model (Figure 14).

Numbered figure captions

SubtitleText	ANOVA dialog: selecting factors and/or interactions to add to the model.
AnchorName	ANOVA dialog

Image Removed

When more than one factor is selected, Add interaction button will be enabled to allow you to specify interaction.

Once a factor is added to the model (Figure 14), you can specify whether the factor is a random effect (check Random check box) or not.

Most factors in an analysis of variance are fixed factors, i.e. the levels of that factor represent all the levels of interest. Examples of fixed factors include gender, race, strain, etc. However, in experiments that are more complex, a factor can be a random effect, meaning the levels of the factor only represent a random sample of all of the levels of interest. Examples of random effects include subject and batch.

Consider the example where one factor is type (with levels normal and diseased), and another factor is subject (the subjects selected for the experiment). In this example,
“type” is a fixed factor since the levels normal and diseased represent all conditions of interest. “Subject”, on the other hand, is a random effect since the subjects are only a random sample of all the levels of that factor. When model has both fixed and random effect, it is called a mixed model.

When more than one factor is added to the model, click on the Cross tabulation link at the bottom to view the relationship between the factors (Figure 15).

Numbered figure captions

SubtitleText	Cross tabulation table showing breakdown of samples across groups (the model contains one factor with three and one factor with two levels)
AnchorName	Cross tabulation table

Image Removed

...

Numbered figure captions

SubtitleText	ANOVA comparisons setup dialog: The example in the figure shows a single factor (Cell type) with two levels (A and B). A contrast A vs. B has been set
AnchorName	Comparisons setup dialog

Image Removed

Start by choosing a factor or interaction from the Factor drop-down list. The subgroups of the factor or interaction will be displayed in the left panel; click to select a subgroup name and move it to one of the panels on the right. The fold change calculation on the comparison will use the group in the top panel as numerator, and the group in the bottom panel as the denominator. Click on Add comparison button to add one comparison to the comparisons table. Note that multiple comparisons can be added to the specified model.

ANOVA advanced options

Click on the Configure to customize Advanced options (Figure 17)

Numbered figure captions

SubtitleText	Configuring advanced options when running ANOVA
AnchorName	Advanced ANOVA options

Image Removed

...

Report option

- User only reliable estimation results: There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of Use only reliable estimation results is set Yes.
- Display p-value for effects: If set to No, only the p-value of comparison will be displayed on the report, the p-value of the factors and interaction terms are not shown in the report table. When you choose Yes in addition to the comparison’s p-value, type III p-values are displayed for all the non-random terms in the model.
- Report partial correlations: If the model has a numeric factor(s), when choosing Yes, partial correlation coefficient(s) of the numeric factor(s) will be displayed in the result table. When choosing No, partial correlation coefficients are not shown.
- Data has been log transformed with base: showing the current scale of the input data on this task.

ANOVA report

Since there is only one model for all features, so there is no pie charts design models and response distribution information. The Gene list table format is the same as the GSA report.

Transcript expression analysis (Cuffdiff)

This option is only available when Cufflinks quantification node is selected. Detailed implementation information can be found in the Cuffdiff manual [5].

When the task is selected, the dialog will display all the categorical attributes more than one subgroups (Figure 18).

Numbered figure captions

SubtitleText	Cuffdiff setup dialog. “Select attributes(s) to groups samples” lists the categorical attributes which have at least two levels (e.g. “Cell type” and “Time”)
AnchorName	Cuffdiff setup dialog.

Image Removed

When an attribute is selected, pairwise comparisons of all the levels will be performed independently.

Click on Configure button in the Advanced options to configure normalization method and library types (Figure 19).

Numbered figure captions

SubtitleText	Advanced option of cuffdiff
AnchorName	Advanced options of cuffdiff

Image Removed

There are three library normalization methods:

Class-fpkm: library size factor is set to 1, no scaling applied to FPKM values
Geometric: FPKM are scaled via the median of the geometric means of the fragment counts across all libraries [6]. This is the default option (and is identical to the one used by DESeq)
Quartile: FPKMs are scaled via the ratio of the 75 quartile fragment counts to the average 75 quartile value across all libraries

...

Fr-unstranded: reads from the left-most end of the fragment in transcript coordinates map to the transcript strand, and the right-most end maps to the opposite strand. E.g. standard Illlumina
Fr-firststrand: reads from the left-most end of the fragment in transcript coordinates map to the transcript strand, and the right-most end maps to the opposite strand. The right-most end of the fragment is the first sequenced or only sequenced for single-end reads. It is assumed that only the strand generated during first strand synthesis is sequenced. E.g. dUPT, NSR, NNSR
Fr-secondstrand: reads from the left-most end of the fragment in transcript coordinates map to the transcript strand, and the right-most end maps to the opposite strand. The left-most end of the fragment is the first sequenced or only sequenced for single-end reads. It is assumed that only the strand generated during second strand synthesis is sequenced. E.g. Directional Illumina, standard SOLiD.

...

Numbered figure captions

SubtitleText	Figure 20: Cuffdiff task report. Each row is a feature, p-value, q-value and log2 fold change columns are display for each comparison
AnchorName	Cuffdiff task report

Image Removed

In the p-value column, besides an actual p-value, which means the test was performed successfully, there is also the following flags which indicate the test was not successful:

NOTEST: not enough alignments for testing
LOWDATA: too complex or shallowly sequences
HIGHDATA: too many fragments in locus
FAIL: when an ill-conditioned covariance matrix or other numerical exception prevents testing

...

References

Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS, B, 57, 289-300.
Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: 2013-2035.
Auer, 2011, A two-stage Poisson model for testing RNA-Seq
Burnham, Anderson, 2010, Model selection and multimodel inference
Law C, Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014 15:R29.
http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html#cuffdiff-output-files
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology, 2010

Additional assistance

Rate Macro

allowUsers	false

Partek Flow's powerful statistical analysis tools help identify differential expression patterns in the dataset. These can take into account a wide variety of data types and experimental designs.

Children Display

maxLevel	2
minLevel	2
exclude	Additional Assistance

Partek Flow Documentation

Page tree

Versions Compared

Old Version 44

New Version Current

Key

Differential Gene Expression (GSA)

GSA dialog

GSA advanced options

Low-expression feature

Multiple test correction

Report option

Model types configuration

GSA report

Differential gene expression (ANOVA)

ANOVA dialog

ANOVA advanced options

ANOVA report

Transcript expression analysis (Cuffdiff)

References