Scientists often develop lists of genes, probes, transcripts, SNPs, and genomic regions of interest from analysis tools, research papers, and databases. Using Partek® Genomics Suite®, these lists can be integrated with genomics data sets, analyzed with powerful statistics, and visualized for new insights.

This tutorial will illustrate:

Importing a text file list

The preferred method for importing a generic list of data into Partek Genomics Suite is as a text file.

The File Type section of the Import dialog includes a preview of the text file and import options (Figure 1). 

 

The columns in the import file can be separate by a tab, comma, or any other character. 

For most applications, the items on the list should be in rows while attributes or values should be in columns. If a list is oriented with items on columns, select Transpose the file to to import a transposed spreadsheet. 

Selecting Genomic Data will result in a dialog prompt to configure genomic properties including selecting the type of genomic data, the location of genomic features in the spreadsheet, the annotation column with gene symbols, the chip or reference source and annotation file, and the species and genome build. This option should be selected if the text file contains genomic position data or other array/sequencing results. 

The Identify Column Labels, Start of Data section (Figure 3) 

The next step is to identify where the data starts and where the optional header is found. The line that contains the header (if present) must precede the data. If there are lines to be skipped in the file (like comments), they may only appear at the top of the file, before the header line or data begin.

If there are many comment lines at the start of the file, you may need to select View Next 5 Records to get to the row that contains the column header. If you accidentally move past the screen that contains the header or data rows, select View Previous 5 Records

If there are missing numerical values or empty cells in your input list, insert a special character or symbol (?, N/A, NA, etc.) in the missing cells; you will specify the character in the Missing Data Representation section of the dialog

This is important if the missing value is a number in a column that you plan to use for statistical analysis. The default missing value indicator is ?

The Preview text encoding section (Figure 4) previews the first five lines of the file, allowing you to check if the text encoding is correct. 

 

The final section of the Import .txt, .tsv, or .csv File dialog is Verify Type & Attribute of Data Columns (Figure 5). While data column type and attribute can be modified after import, it is easier and faster to select the proper options during import as multiple columns may be selected during this dialog.  

 

If there is an identifier like gene symbol or SNP, the Type field for that column should be set to text and Attribute should be set to label. Numeric values (intensities, p-values, fold-changes, etc.) should have Type set to double and Attribute set to response. The other possible value for Attribute is factor and describes sample data. The user interface is this dialog allows you to select multiple columns at once. The interface controls are detailed in the dialog (Figure 5). 

If Genomic Data was selected in the Data Type section, the Configure Genomic Properties dialog will open (Figure 6). These options will be discussed in the next section when we add an annotation file, but we will make a few selections now. 

 

The imported spreadsheet will open (Figure 7). 

 

Tasks available for a list of genes

There are many useful visualizations, annotations, and biological interpretations that can operate on a gene list. In order for these features operate on an imported list, an annotation file must first be associated with the gene-list. Additionally, many operations that work with a list of significant genes (like GO- or Pathway-Enrichment) require comparison against a background of “non-significant” genes.

Adding an annotation file

The quickest way to accomplish both is to use the background of “all genes” for that organism provided by an annotation source like RefSeq, Ensembl, etc. in .pannot (Partek® annotation), .gff, .gtf, .bed, tab- or comma-delimited format. If the file is not already in a tab-separated or comma delimited format, you may import, modify, and save the file in the proper file format.

Now we can add the annotation file to our imported gene list. 

This is the Configure Genomic Properties dialog we saw earlier (Figure 6). 

Adding annotations

Inserting annotations from an annotation file

If a genomic annotation file has been added, annotations from the file can be added as columns in the spreadsheet. 

Annotating with cytobands

A column with cytoband locations will be added to the spreadsheet. Adding a cytoband is possible if genomic coordinates are associated with the gene list spreadsheet during import or by association with an annotation file. 

Annotating with known SNPs

A column of SNPs associated the listed genes and a column indicating the number of SNPs known to be associated with the genes will be added to the spreadsheet. If a SNP database has not been previously downloaded, it will need to be downloaded through the SNP database dialog (Figure 9). 

Alternatively, to generate a list of SNP IDs per row, right-click on a row header and select Create list of dbSNP.

In addition to SNPs, this feature can associate any data with a list of genes or genomic coordinates; the dbSNP database, any miRNA database, data from the Database of Genomic Variants (dgv), any mRNA transcriptome database, or any custom annotation source can be associated with your list. In each case, this feature will add columns to the imported gene list spreadsheet that match the genes with features from those databases.

GO Enrichment

The Gene Ontology (GO) Enrichment  p-value calculation uses either a Chi-Square or Fisher’s Exact test to compare the genes included in the significant gene list to all possible genes present in the experiment or the background genes. For a microarray experiment, background genes consists of all genes on the chip/array; for a next generation sequencing experiment, all genes in the species transcriptome are considered background genes. 

Because the calculation is essentially comparing overlapping sets of genes and does not use intensity values, GO Enrichment can be performed on an imported gene list. GO Enrichment is available through the Gene Expression workflow. 

If no annotation file has been specified for the gene list, GO Enrichment will use the full species transcriptome as the background genes. While suitable for next generation sequencing experiments, for microarray experiments, only the genes on the chip/array are appropriate. Please contact our technical support department for assistance with this step if needed. 

Pathway Enrichment

Like GO Enrichment, Pathway Enrichment does not require numerical values, but instead operates on lists of genes - a list of significant genes vs. background genes. Consequently, Pathway Enrichment may be used with an imported list of genes. The list of background genes is set to the species transcriptome by default, but can be set to a specific set of genes if the gene list has been associated with an annotation file.  

Tasks available for a gene list with numeric data

All the operations available for a gene list are available; you may also use the numeric data associated with the genes for visualization, clustering, and statistical operations. 

Descriptive Statistics 

There are numerous descriptive statistics available in Partek Genomics Suite.

Principal Component Analysis is located in a different menu.

Applying Multiple Test Correction

If your imported data contains a list of p-values, you can use any of the available multiple test corrections.

Plotting numeric data associated with a gene list

A variety of profile plots can be used to visualize the numerical data associated with your imported gene list.

Genome Browser

If you have imported numerical data associated with genes (like p-values or fold-changes), you can visualize these values in the Genome Browser once an annotation file has been added. 

If the annotations have been configured properly, you should see a track for the first column of numerical data, a cytoband track, and an annotation track. You can also add another track to display a second column of numerical data. 

A new track titled Regions will be added. 

Clustering

If the data is suitable for clustering, access the clustering function through the toolbar, not form a workflow. The workflow implementation assumes the data to be clustered are found on a parent spreadsheet and the list of genes is in a child spreadsheet. Because the data to be clustered is all on one spreadsheet, access hierarchical clustering by selecting Tools from the main toolbar then Discover then Hierarchical Clustering. Consider transposing the spreadsheet if samples are on columns and genes are on rows as Hierarchical Clustering will assume samples are rows and genes are columns. If you only have one column or one row of data, cluster only on the dimension with multiple entries by deselecting either Rows or Columns from What to Cluster or consider using an intensity plot instead. 

Starting with a list of SNPs

A list of SNPs using dbSNP IDs can be imported as a text file and associated with an annotation file as described for a list of genes. The annotation file (genomic database) you use to annotate the SNPs should minimally contain the genomic coordinates (chromosome number and physical position) of the locus. 

Novel SNPs, or SNPs that are not found in your annotation source, must be imported as a region list. The only difference would be to use the SNP name in place of a region name.

Annotating SNPs with genes

Starting with a list of SNPs that have been associated with genomic loci using an annotation file and assigned a species with genome build, you may use Find Overlapping Genes to annotate these SNPs with the closest genes. 

The Report Regions from the specified database dialog will open.

This will add 3 columns to the list of SNPs spreadsheet including Nearest Feature, which will indicate the nearest gene and strand. To allow gene list operations such as GO Enrichment or Pathway Enrichment to be performed on the SNP list, we can set the Nearest Feature column as the Gene Symbol column.  

Annotating a Partek Genomics Suite-generated SNP list with SNVs

If you have a SNP spreadsheet that was generated using Partek Genomics Suite, you can annotate the SNP list with gene, transcript, exon, and information about the predicted effect of the SNPs. 

Starting with a list of genomic regions

Importing a region list

A region list in PGS must contain the chromosome, start location, and stop locations as the first three columns, respectively. The chromosome name (or number) in the region list must be compatible with the genomic annotation for the species if you plan to use any feature (like motif detection) that requires reference sequence information. 

The spreadsheet properties will now include region. If you would like to do any operation that requires looking up hte reference genomic sequence information for the regions based on genomic location, you will need to specify the species for this region list. 

With a few additional options, the region list can be made viewable in the genome browser. 

Motif detection

Starting with a region list, you may detect either known or de novo motifs using the ChIP-Seq
workflow if your spreadsheet has been associated with a species and a reference genome as
described in the import section.

Both options (Discover de novo motifs and Search for known motifs) can now be performed. Motifs (de novo) may be displayed in the Genome Browser and known detected motifs may be viewed in web-log format by right-clicking on a header row of the motif spreadsheet.

Determining the average values for a region list

If you have a region list or a bed file and you have a microarray experiment with data, you can summarize the data according to the genomic coordinates contained in the region list. For instance, the region list contains a list of CpG islands, the experiment contains methylation percentage values for probes (β values), and you would like to summarize the methylation values for individual probes for the CpG islands. Or you have a list of copy number amplifications, microarray gene expression data, and you are interested in determining if the average intensities of the probes in those regions is higher than expected.

Find region overlaps

You have a list of regions from another analysis program (perhaps you detected peaks using an R program) and you’d like to compare that region list with a region list that Genomics Suite calculated. Perhaps you have two lists created by Genomics Suite (one generated from peak detection with one set of parameters and the other created with different parameters) and you’d like to see what the two lists have in common. You may use the Tools > Find Region Overlaps command to compare two or more region lists as shown.

There are two separate modes of operation for this command: Report all regions and Only report regions present in all list. The first option, Report all regions, will report all regions in both lists. If there is any region overlap between the lists, the intersection of the regions will be reported along with the start and stop coordinates of the intersection, the percent overlap between the intersected region with each of the regions in the input lists. If a region is found in only one list, it will be reported as well.

In contrast, the second option, Only report regions present in all lists, will intersect both lists and only reports regions found in all the lists. 

Importing genomic locations to be used with annotating SNVs

The Tools > Annotate SNVs feature requires four columns of data per genomic location: the position of the SNP (chr.basePosition), the SampleName, a reference base, and the SNP call (single nucleotide or genotype) as shown in Figure 20. 

Now that the properties have been set appropriately, Tools > Annotate SNVs may be invoked on this
spreadsheet.

Importing a BED file

A BED (Browser Extensible Data) file is a special case of a region list: it is a tab-delimited text file and the first three columns of BED files contain the chromosome, start, and stop locations. To import a bed file to be used as a data region list, follow the import instructions for region lists. A BED File might also be visualized as an annotation file containing regions in the Genome Browser. 

Using a BED file as an annotation source for the genome browser

BED files do not contain individual sequences nor do the regions have names. For instance, the UCSC table browser has a BED file that contains reads from a long non-coding RNA-Seq experiment and you might like to view this information in the context of your dataset. Before you could visualize a BED file in the chromosome viewer, you would have to create a Partek annotation file from the BED file.

Visualizing a BED file as an annotation track in the genome browser

In order to use a BED file as an Annotation track in the Genome Browser, first create the annotation file as described above, being careful to have specified the species and genome build appropriately.

GO ANOVA, GSEA/GeneSet ANOVA, and Pathway ANOVA

As these features require intensity (or count) data as well as experimental groups, these features cannot be performed on an imported lists.

Integrating Imported Data

If the data from imported spreadsheets has been associated with annotations, several integration approaches may be used to integrate multiple kinds of imported data. For instance, the Genome Browser may be used to display data from multiple spreadsheets/experiments regardless of the type of spreadsheets (imported data or microarray or NGS experiments). The Venn Diagram tool may be used to find overlaps based on a feature name. Tools > Find Overlapping Regions can use an imported gene list and a list of regions from a copy number or ChIP-Seq experiment to identifygenomic regions in common.