View Source

This tutorial will illustrate how to:

Import Affymetrix CEL files and check quality
Add attributes describing the sample groups
Perform exploratory analysis using the PCA scatter plot
Find differentially expressed genes using ANOVA
Generate a list of genes of interest
Add annotations to the gene list

Note: the workflow described below is enabled in Partek Genomics Suite (PGS) version 7.0. Please contact the Partek Licensing Team at licensing@partek.com to request this version or update the software release via Help > Check for Updates from the main command line. The screenshots shown below may vary across platforms and across different versions of PGS.

Description of the Data Set

Down syndrome is caused by an extra copy of all or part of chromosome 21; it is the most common non-lethal trisomy in humans. The study used in this tutorial revealed a significant upregulation of chromosome 21 genes at the gene expression level in individuals with Down syndrome; this dysregulation was largely specific to chromosome 21 only and not to any other chromosomes. This experiment was performed using the Affymetrix GeneChip™ Human U133A arrays. It includes 25 samples taken from 10 human subjects and 4 different tissues.

The raw data for this study is available as experiment number GSE1397 in the Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/.

Data and associated files for this tutorial can be downloaded by going to Help > On-line Tutorials from the PGS main menu. The data can also be downloaded directly from: http://www.partek.com/Tutorials/microarray/Gene_Expression/Down_Syndrome/Down_Syndrome-GE.zip.

Importing Affymetrix CEL Files

Download the data from the Partek® site to your local disk. The zip file contains both data and annotation files.

For this tutorial, unzip the files to C:\Partek Training Data\Down_Syndrome-GE or to a directory of your choosing. Be sure to create a directory or folder to hold the contents of the zip file
Copy or move the annotation files (HG-U133A.cdf, HG-U133A.na36.annot, HG-U133A.na36.annot.idx) to C:\Microarray Libraries. (Copying the annotation files to the default library location is done because newer annotation files that are released after the publication of this tutorial may cause the results to be different than what is shown in the published tutorial. If, however, you prefer to download the latest version, you may omit copying the HG-U133A files to C:\Microarray Libraries)
Start PGS and select Gene Expression from the Workflows panel on the right side of the tool bar in the PGS main window (Figure 1)

Figure 1: Selecting the gene expression workflow

Select Import Samples under the Import section of the workflow
Select Import from Affymetrix CEL Files and then click OK
Select the Browse button to select the C:\Partek Training Data\Down_Syndrome-GE folder. By default, all the files with a .CEL extension are selected (Figure 2)

Figure 2: Selecting the folder and CEL files for the experiment

Select the Add File(s) > button to move all the .CEL files to the right panel. Twenty-five CEL files will be processed
Select the Next > button to open the Import Affymetrix CEL Files dialog (Figure 3)

Select Customize… to configure the import options (Figure 4)

Select Library Files… to specify the location of the library folder to be used and to specify the annotation files to use (Figure 5)

PGS will automatically assign the annotation files according to the chip type stored in the .CEL files. If the annotation files are not available in the library directory, PGS will automatically download them and store them in the Default Library File Folder.

The default library location can be modified at by selecting the Change button in the Default Library File Folder panel. By default, the library directory is at C:\Microarray Libraries. This directory is used to store all the external libraries and annotation files needed for analysis and visualization. The library directory can also be modified from Tools > File Manager
Select OK (Figure 5) to close the Specify File Locations dialog
Select the Outputs tab from the Advanced Import Options dialog (Figure 6)
In the Extract Time Stamp and Date from CEL File panel, make sure the Date button is selected to extract the chip scan date. This information can help you to detect if there are batch effects caused by the process time
In the Quality Assess of Gene Expression panel, leave the QC report button unselected. A user guide for the microarray data quality assessment and quality control features is available in the user manual.
Select OK to exit the Advanced Import Options dialog

Select Import. The progress bar on the lower left of the Import Affymetrix CEL files dialog will update as CEL files are imported. Once all files have been imported, the Import Affymetrix CEL Files dialog will close
You may see a dialog box asking if you’d like to overwrite the existing images files. This happens because the tutorial zip file already contained some of the chip image files. Select Yes

After importing the CEL files has finished, the result file will open in PGS as a spreadsheet named 1 (Down_Syndrome-GE). The spreadsheet should contain 25 rows representing the micoarray chips (samples) and 22,296 columns representing the probe sets (genes) (Figure 8).

For additional information on importing data into PGS, see Chapter 4 Importing and Exporting Data in the Partek User’s Manual. The User’s Manual is available from the Partek Genomic Suite menu from Help> User’s Manual. The FAQ (Help > On-line Tutorials > FAQ) may also be helpful. As this tutorial only addresses some topics, you may need to consult the User’s Manual for additional information about other useful features.

It is recommended that you are familiar with Chapter 6 The Pattern Visualization System of the user manual before going through the next section of the tutorial.

Exploratory Data Analysis

At this point in analysis, you would explore the data preliminarily. Do the genes you expected to be differentially regulated appear to have larger or smaller intensity values? Do similar samples resemble each other?

The latter question can be explored using Principal Components Analysis (PCA), an excellent method for reducing and visualizing high-dimensional data.

Select Plot PCA Scatter Plot from the QA/AC section of the Gene Expression workflow. The Scatter Plot tab with your PCA plot will appear as shown in Figure 13

In the scatter plot, each point represents a chip (sample) and corresponds to a row on the top-level spreadsheet. The color of the dot represents the type of the sample; red represents a normal sample and blue represents a Down syndrome sample. Points that are close together in the plot have similar intensity values across the probe sets on the whole chip (genome), and points that are far apart in the plot are dissimilar

Left-clicking on any point in the scatter plot selects that point. A dash with an identifying row number will appear on the selected PCA plot point. The spreadsheet in the Analysis tab will also jump to the corresponding row
While pressing the mouse wheel down, drag the mouse to rotate the plot or select the Rotate Mode icon () on the left side of the Scatter Plot tab. With Rotate Mode selected, press the left mouse button and drag to rotate the plot. Rotating the plot allows you to examine the grouping pattern or outliers of the data on the first 3 principal components (PCs)
Scrolling the mouse wheel up or down while the cursor is on the PCA plot will zoom in and out or select the Zoom Mode icon () on the left side of the Scatter Plot tab
Selecting the Reset icon () option on the left side of the Scatter Plot tab will return the PCA plot to its original orientation and zoom

As you can see from rotating the plot, there is no clear separation between Down syndrome and normal samples in this data since the red and blue samples are not separated in space. However, there are other factors that may separate the data.

In the Scatter Plot tab, select the Rendering Properties icon () and configure the plot as shown in Figure 14
Color the points by column 7. Tissue and Size the points by column 6. Type
Select Apply

Notice now that the data are clustered by different tissues (Figure 15).

Another way to see the cluster pattern is to put an ellipse around the Tissue groups. Select the Ellipsoids tab on the Plot Rendering Properties dialog
Select Add Ellipse/Ellipsoid
Select the Ellipse radio button
Double click on Tissue to move it to the Grouping Variable(s) panel
Select OK (Figure 16) to add the Add Ellipse/Ellipsoid dialog and OK again to exit the Plot Rendering Properties dialog

By rotating this plot, you can see that the data is separated by tissues, and within some of the tissues, the Down’s samples and normal samples are separated. For example, in the Astrocyte and Heart tissues, the Down syndrome samples (small dots) are on the left, and the normal samples (large dots) are on the right (Figure 17).

PCA is an example of exploratory data analysis and is useful for identifying outliers and major effects in the data. From the scatter plot, you can see that the tissue is the biggest source of variation. There are many genes that express differently between the 4 tissues, but not as many genes that express differently between type (Down syndrome and normal) across the whole chip (genome).

The next step is to draw a histogram to examine the samples. Select Plot Sample Histogram in the QA/QC section of the Gene Expression workflow to generate the Histogram tab as shown in Figure 18.

The histogram plots one line for each of the samples with the intensity of the probes graphed on the X-axis and the frequency of the probe intensity on the Y-axis. This allows you to view the distribution of the intensities to identify any outliers. In this dataset, all the samples follow the same distribution pattern indicating that there are no obvious outliers in the data. As demonstrated with the PCA plot, if you click on any of the lines in the histogram, the corresponding row will be highlighted in the spreadsheet 1 (Down_Syndrome-GE). [PF1] You can also change the way the histogram displays the data by clicking on the Plot Properties button. Explore these options on your own.

The other option in the QA/QC section of the Gene Expression workflow is Plot Sample Box & Whiskers Chart which is discussed elsewhere.

The decision to discard any samples would be based on information from the PCA plot, sample histogram plot, and QC metrics. To discard a sample and renormalize the data (without the effects of the outlier), start over with importing samples and omit the outlier sample(s) during the CEL file import.