This tutorial will illustrate how to:

 

Note: the workflow described below is enabled in Partek Genomics Suite (PGS) version 7.0. Please contact the Partek Licensing Team at licensing@partek.com to request this version or update the software release via Help > Check for Updates from the main command line. The screenshots shown below may vary across platforms and across different versions of PGS.

 

Description of the Data Set

Down syndrome is caused by an extra copy of all or part of chromosome 21; it is the most common non-lethal trisomy in humans. The study used in this tutorial revealed a significant upregulation of chromosome 21 genes at the gene expression level in individuals with Down syndrome; this dysregulation was largely specific to chromosome 21 only and not to any other chromosomes. This experiment was performed using the Affymetrix GeneChip™ Human U133A arrays. It includes 25 samples taken from 10 human subjects and 4 different tissues.

The raw data for this study is available as experiment number GSE1397 in the Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/.

Data and associated files for this tutorial can be downloaded by going to Help > On-line Tutorials from the PGS main menu. The data can also be downloaded directly from: http://www.partek.com/Tutorials/microarray/Gene_Expression/Down_Syndrome/Down_Syndrome-GE.zip.

Importing Affymetrix CEL Files

Download the data from the Partek® site to your local disk. The zip file contains both data and annotation files.

Figure 1: Selecting the gene expression workflow

Figure 2: Selecting the folder and CEL files for the experiment

 

Figure 3: Import Affymetrix CEL Files dialog

Figure 4: Advanced Import Options

 

PGS will automatically assign the annotation files according to the chip type stored in the .CEL files. If the annotation files are not available in the library directory, PGS will automatically download them and store them in the Default Library File Folder.

Figure 5: 

Figure 6: 

 After importing the .CEL files has finished, the result file will open in PGS as a spreadsheet named 1 (Down_Syndrome-GE). The spreadsheet should contain 25 rows representing the micoarray chips (samples) and over 22,000 columns representing the probe sets (genes) (Figure 7). 

Figure 7: Viewing the main or top-level spreadsheet with imported .CEL files

For additional information on importing data into PGS, see Chapter 4 Importing and Exporting Data in the Partek User’s Manual. The User’s Manual is available from the Partek Genomic Suite menu from Help> User’s Manual. The FAQ (Help > On-line Tutorials > FAQ) may also be helpful. As this tutorial only addresses some topics, you may need to consult the User’s Manual for additional information about other useful features.

It is recommended that you are familiar with Chapter 6 The Pattern Visualization System of the user manual before going through the next section of the tutorial. 

Adding Sample Information

Twenty-five CEL files (samples) have been imported into PGS as shown in Figure 8. Sample information must be added to define the grouping and the goals of the experiment.

 

In this tutorial, the file name (e.g., Down Syndrome-Astrocyte-748-Male-1-U133A.CEL) contains the information about a sample and is separated by hyphens (-). Choosing to split the file name by delimiters will separate the categories into different columns as shown (Figure 8).



Figure 8: Configuring the Sample Information Creation dialog

Figure 9: Changing column properties

Note: More details on Random vs. Fixed Effects can be found later in this tutorial under the section Identifying Differentially Expressed Genes using the ANOVA.

Exploratory Data Analysis

At this point in analysis, you would explore the data preliminarily. Do the genes you expected to be differentially regulated appear to have larger or smaller intensity values?  Do similar samples resemble each other? 

The latter question can be explored using Principal Components Analysis (PCA), an excellent method for reducing and visualizing high-dimensional data.

Figure 10: PCR Scatter Plot tab

In the scatter plot, each point represents a chip (sample) and corresponds to a row on the top-level spreadsheet. The color of the dot represents the type of the sample; red represents a normal sample and blue represents a Down syndrome sample. Points that are close together in the plot have similar intensity values across the probe sets on the whole chip (genome), and points that are far apart in the plot are dissimilar

As you can see from rotating the plot, there is no clear separation between Down syndrome and normal samples in this data since the red and blue samples are not separated in space. However, there are other factors that may separate the data.

Figure 11: Configuring the PCA scatter plot: Color by Tissue, size by Type

Notice now that the data are clustered by different tissues (Figure 12). 

Figure 12: Configured PCA scatter plot

By rotating this PCA plot, you can see that the data is separated by tissues, and within some of the tissues, the Down syndrome samples and normal samples are separated. For example, in the Astrocyte and Heart tissues, the Down syndrome samples (small dots) are on the left, and the normal samples (large dots) are on the right (Figure 13).

Figure 13: PCA scatter plot with ellipses, rotated to show separation by Type 

PCA is an example of exploratory data analysis and is useful for identifying outliers and major effects in the data. From the scatter plot, you can see that the tissue is the biggest source of variation. There are many genes that express differently between the 4 tissues, but not as many genes that express differently between type (Down syndrome and normal) across the whole chip (genome).

The next step is to draw a histogram to examine the samples. Select Plot Sample Histogram in the QA/QC section of the Gene Expression workflow to generate the Histogram tab (Figure 14).

Figure 14: Histogram tab 

The histogram plots one line for each of the samples with the intensity of the probes graphed on the X-axis and the frequency of the probe intensity on the Y-axis. This allows you to view the distribution of the intensities to identify any outliers. In this dataset, all the samples follow the same distribution pattern indicating that there are no obvious outliers in the data. As demonstrated with the PCA plot, if you click on any of the lines in the histogram, the corresponding row will be highlighted in the spreadsheet 1 (Down_Syndrome-GE). You can also change the way the histogram displays the data by clicking on the Plot Properties button. Explore these options on your own.

The decision to discard any samples would be based on information from the PCA plot, sample histogram plot, and QC metrics. To discard a sample and renormalize the data (without the effects of the outlier), start over with importing samples and omit the outlier sample(s) during the .CEL file import.

 

Identifying Differentially Expressed Genes using the ANOVA

Analysis of variance (ANOVA) is a very powerful technique for identifying differentially expressed genes in a multi-factor experiment such as this one. In this data set, the ANOVA will be used to generate a list of genes that are significantly different between Down syndrome and normal with an absolute difference bigger than 1.3 fold.

The ANOVA model should include Type since it is the primary factor of interest. From the exploratory analysis using the PCA plot, we observed that tissue is a large source of variation; therefore, tissue should be included in the model. In the experiment, multiple samples were taken from the same subject, so Subject must be included in the model. If Subject were excluded from the model, the ANOVA assumption that samples within groups are independent will be violated. Additionally, the PCA scatter plot showed that the Downs syndrome and normal separated within tissue type, so the Type*Tissue interaction should be included in the model.



Figure 15: ANOVA configuration 


Random vs. Fixed Effects – Mixed Model ANOVA

Most factors in ANOVA are fixed effects, whose levels in a data set represent all the levels of interest. In this study, Type and Tissue are fixed effects. If the levels of a factor in a data set only represent a random sample of all the levels of interest (for example, Subject), the factor is a random effect. The ten subjects in this study represent only a random sample of the global population about which inferences are being made. Random effects are colored red on the spreadsheet and in the ANOVA dialog. When the ANOVA model includes both random and fixed factors, it is a mixed-model ANOVA.

Another way to determine if a factor is random or fixed is to imagine repeating the experiment. Would the same levels of each factor be used again?

You can specify which factors are random and which are fixed when you import your data or after importing by right-clicking on the column corresponding to a categorical variable, selecting Properties, and checking Random effect. By doing that, the ANOVA will automatically know which factors to treat as random and which factors to treat as fixed.

 

Nested/Nesting Relationships

The subject factor in the ANOVA model is listed as “8. Subject (6. Type)” this means that Subject is nested in Type. PGS can automatically detect this sort of hierarchical design and will adjust the ANOVA calculation accordingly.

Linear Contrasts

By default, an ANOVA only outputs a p-value for each factor/interaction. To get the fold change and ratio between Down syndrome and normal samples, a contrast must be set-up.

Figure 16: Configuring contrasts for ANOVA

Because the data is log2 transformed, PGS will automatically detect this and will automatically select Yes in the Data is already log transformed? at the top right-hand corner. PGS will use the geometric mean of the samples in each group to calculate the fold change and mean ratio for the contrast between the Down syndrome and Normal samples.

The result will be displayed in a child spreadsheet, ANOVA-3way (ANOVAResults). In the child result spreadsheet, each row represents a gene, and the columns represent the computation results for that gene (Figure 17). By default, the genes are sorted in ascending order by the p-value of the first categorical factor. In this tutorial,Type is the first categorical factor, which means the most highly significant differently expressed gene between Down syndrome and normal samples is at the top of the spreadsheet in row 1.

Figure 17: ANOVA spreadsheet

For additional information about ANOVA in PGS, see Chapter 11 Inferential Statistics in the User’s Manual (Help > User’s Manual).

 

Viewing the Sources of Variation

Deciding which factors to include in the ANOVA may be an iterative process while you decide which factors and interactions are relevant as not all factors have to be included in the model. For example, in this tutorial, Gender and Scan date were not included.  The Sources of Variation plot is a way to quantify the relative contribution of each factor in the model towards explaining the variability of the data.

Figure 18: Sources of Variation tab showing a bar chart

 

This plot presents the mean signal-to-noise ratio of all the genes on the microarray. All the factors in the ANOVA model are listed on the X-axis (including random error). The Y-axis represents the mean of the ratios of mean square of all the genes to the mean square error of all the genes. Mean square is ANOVA’s measure of variance. Compare each signal bar to the error bar; if a factor bar is higher than the error bar, that factor contributed significant variation to the data across all the variables. Notice, that this plot is very consistent with the results in the PCA scatter plot. In this data, on average, Tissue is the largest source of variation.

To view the source of variation for each individual gene, right click on a row header in the ANOVA-3way (ANOVAResults) spreadsheet and select the Sources of Variation item from the pop-up menu. This generates a Sources of Variation tab for the individual gene. View a few Sources of Variation plots from rows at the top of the ANOVA table and a few from the bottom of the table.

 Another useful graph is the ANOVA Interaction Plot which is also accessed by right-clicking on a row header in the ANOVA spreadsheet. Select ANOVA Interaction Plot from the options to generate an Interaction Plot tab for that individual gene. Generate these plots for rows 3 (DSCR3) and 8 (CSTB). If the lines in this plot are not parallel, then there is a chance there is an interaction between Tissue and Type. DSCR3 is a good example of this. We can look at the p-values in column 9, p-value(Type * Tissue) to check if this apparent interaction is statistically significant.  

 

Create Gene List

Now that you have obtained statistical results from the microarray experiment, you can now take the result of 22,283 genes and create a new spreadsheet of just those genes that pass certain criteria. This will streamline data management by focusing on just those genes with the most significant differential expression or substantial fold change. In PGS, the List Manager can be used to specify numerous conditions to use in the generation of our list of genes of interest. In this tutorial, we are going to create a gene list with a fold change between -1.3 to 1.3 with the significance FDR of 20%. The following section will illustrate how to use the List Manager to create this gene list.

Figure 19: Creating a gene list from ANOVA results

The spreadsheet Down_Syndrome_vs_Normal (A) will be created as a child spreadsheet under the Down_Syndrome-GE spreadsheet.

This gene list spreadsheet can now be used for further analysis such as hierarchical clustering, gene ontology, integration of copy number data, or exportation into other data analysis tools such as pathway analysis.

You should take some time creating new gene list criteria of your own to become familiar with the List Manager tool in PGS. For more information, you can always click on the () buttons.

 

Hierarchical Clustering

The gene list in spreadsheet Down_Syndrome_vs_Normal (A) can now be used for hierarchical clustering to visualize patterns in the data.

Figure 20: Hierarchical Clustering results

The graph (Figure 20) illustrates the standardized gene expression level of each gene in each sample. Each gene is represented in one column, and each sample is represented in one row. Genes which are unchanged are have a value of zero and are colored black. Genes with increased expression have positive values and are colored red. Genes with reduced expression have negative values and are colored green. Down syndrome samples are colored red and normal samples are colored orange. On the left-hand side of the graph, we can see that the Down syndrome samples cluster together.

For more information on the methods used for clustering, you can refer to Chapter 8: Hierarchical & Partitioning Clustering in Help > User’s Manual. For a tutorial on configuring the clustering plot, please refer to the user guide that can be downloaded from: here or from Help >On-line Tutorials > User Guides.


Adding Gene Annotation

During data importation, the GeneChip annotation file was linked to the imported data. This linked annotation information can be added as new columns to the ANOVA or gene list spreadsheets. For example, we can add additional annotation to the gene list we created from the ANOVA results as follows:


Figure 21: Adding a gene annotation 

Interestingly, of the 23 genes of the Down_Syndrome_vs_Normal (A) spreadsheet, 20 genes are located on chromosome 21.

Figure 22: Dot plot results for gene Down syndrome critical region 3

In the plot, each dot is a sample of the original data. The Y-axis represents the log2 normalized intensity of the gene and the X-axis represents the different types of samples. The median expression of each group is different from each other in this example. The median of the Down syndrome samples is ~6.3, but the median of the normal samples is ~6.0. The line inside the Box & Whiskers represents the median of the samples in a group. Placing the mouse cursor over a Box & Whiskers plot will show its median and range. 

 

Generating Gene Lists from a Volcano Plot

Next, we will generate a list of genes that passed a p-value threshold of 0.05 and fold-changes greater than 1.3 using a volcano plot.



Figure 23: Volcano Plot results for Down syndrome vs. Normal contrast

In the plot, each dot represents a gene. The X-axis represents the fold change of the contrast, and the Y-axis represents the range of p-values. The genes with increased expression in Down syndrome samples on the right side; genes with reduced expression in Down syndrome samples are on the left of the N/C line. The genes become more statistically significant with increasing Y-axis position. The genes that have larger and more significant changes between the Down syndrome and normal groups are on the upper right and upper left corner (Figure 23). 

In order to select the genes by fold-change and p-value, we will draw a horizontal line to represent the p-value 0.05 and two vertical lines indicating the –1.3 and 1.3-fold changes (cutoff lines).

Figure 24: Setting cutoff lines for -1.3 to 1.3 fold changes and p value of 0.05

The plot will be divided into six sections. By clicking on the upper-right section, all genes in that section will be selected (Figure 25).

 

Figure 25: Creating a gene list from a Volcano Plot

Note: If no column is selected in the parent (ANOVA) spreadsheet, all of the columns will be included in the gene list; if some columns are selected, only the selected columns will be included in the list.

 The list can be saved as a text file (File > Save As Text File) for use in reports or by downstream analysis software.

 

End of Tutorial

This is the end of tutorial. If you need additional assistance with this data set, email us at support@partek.com or contact the Partek Technical Support staff at:

North America

(9:00 a.m. - 5:00 p.m. CST)

+1-314-884-6172

Europe

(9:00 a.m. - 5:00 p.m. GMT)

+44 2071 930426 or +1.314.884.6173

Asia/Australasia

(9:00 a.m. - 6:00 p.m. SGT)

+65 6808 8706