Hierarchical Clustering (new version)

Hierarchical clustering is a statistical method used to assign similar objects into groups called clusters. It is typically performed on results of statistical analyses, such as a list of significant genes / transcripts, but can also be invoked on the full data set, as a part of exploratory analysis.

Hierarchical clustering is an unsupervised technique, meaning that the number of clusters is not specified up front. In the beginning, each row and/or column is considered a cluster. The two most similar clusters are combined and continue to combine until all objects are in the same cluster. Hierarchical clustering produces a tree (called a dendrogram) that shows the hierarchy of the clusters.

This tutorial will illustrate how to:

Invoking Hierarchical Clustering

To invoke hierarchical clustering, select a Quantification data node (to cluster samples) or a Feature list data node (to cluster significant genes/transcripts) and then click on the Hierarchical clustering option in the context sensitive menu (Figure 1).

Figure 1. Hierarchical clustering as a part of visualisation tools

The hierarchical clustering setup dialog (Figure 2) enables you to control the clustering algorithm. Starting from the top, you can choose to Cluster samples, Cluster features (genes/transcripts) or both. By default, if there are less than 3000 samples, the Cluster samples check button is selected. Otherwise the check button is de-selected.

Figure 2. Setup dialog of hierarchical clustering (default settings)

Cluster distance metric is used to determine how the distance between two clusters will be calculated.

Single Linkage: the distance between two clusters is determined by the distance of the closest objects in the two clusters.

Complete Linkage: the distance between two clusters is equal to the distance between the two furthest members of those clusters.

Average Linkage: the average distance between all the pairs of objects in the two different clusters is used as the measure of distance between the two clusters.

Centroid method: the distance between two clusters is equal to the distance between the centroids of those clusters.

Ward's method: the distance between two clusters is designed to minimize the size of an error measure based on the sum of squares.

Point distance metric is used to determine the distance between two rows or columns. For more detailed information about the equations, we refer you to the distance metrics chapter.

If you do not want to cluster all the samples, but select a subset based on specific sample attributes (i.e. group membership), use the Filtering drop down list (Figure 3). The default value of the Filtering option is All samples.

Figure 3. Specifying a subset of data for clustering, based on sample attributes. In the example on the figure, only Control samples will be clustered, while the Treatment samples will be omitted

You can choose how the data is normalized. Under the Normalization mode dropdown, Standardize (default) will make each column mean as zero and standard deviation as 1 in all features. This is the default normalization and it makes makes all the features (e.g., genes) have equal weight. The normalization mode Shift will make each column mean as zero. Choose None to perform clustering on the values in the quantified data node.

Heat Map

The output of a Hierarchical clustering task is a heat map (Figure 4) with or without dendrogram depends on whether you perform cluster on samples/cells or features. By default, samples are on rows (sample labels are displayed as seen in the Data tab) and features (genes or transcripts, depending on the input data) on columns. Colors are based on standardized expression values (default selection; performed on the fly), Dendrograms show clustering of rows (samples) and columns (variables).

Figure 4. Heat map. Samples are on columns, variables (in this example: genes) on columns, and the heat map is based on standardised gene expression values

Another way to invoke heatmap without performing clustering is in data viewer. When select Heatmap () icon in the available plots, data nodes that contains two dimension matrix can be use to draw this type of plot.

Depending on the resolution of your screen and number of samples and variables that need to be displayed, some binning may be involved. If there are more than samples/genes than pixels, values of neighboring components will be averaged together. When you zoom in to certain level, you will see each cell represent one sample/gene. To zoom, use the mouse wheel to zoom in / out. To move the map around when zoomed in, press down the right button of the mouse and drag the map.

There are 5 sections in the Configuration panel (Figure 5): Content, Heatmap, Dendrograms, Annotations and layout. Click on the triangle () to expand the section to configure.

Figure 5. Heat map controls

Content:

Content contains the value of which matrix data is used to draw heatmap in the plot. Heatmap is a color presentation of the values in the matrix selected. Most of the data nodes contains only one matrix, which you might not need to use Size by configuration to use the same value represent the same information as color by (Figure 6).

Figure 6. When data node contains only one matrix, not need to use size by configuration

However, if a data node contains multiple matrix information, e.g. if you perform descriptive statistics on cluster groups for every gene like mean, std. dev, percent of detected cells etc, each stats result will be in a separate matrix in the output data node. You might want to use color of the component in the heatmap to represent one type of stats (like mean of the groups) and size of component to represent a different statistic information (like std. dev) (Figure 7).

Figure 7. When a data node containing more than one matrices, different matrices can be used to color and size the components

Heatmap:

Heatmap section is used to configure the color and shape of the components in the heatmap (Figure 8)

Figure 8. Configure heatmap color and shape

In the color palette horizontal bar, left side color represent low value, right side color represent high value in the matrix data represented. By default, there are 3 tabs () present the min, middle and max color value of default range calculated on the matrix. Left click on the middle tab and drag left/right can change the middle the value this tab represents. When left click on the middle tab and release the mouse, you can change the color and value this tab represents ((Figure 9). Click on () to remove this tab.

Figure 9. Left click on the tab to change the color and value represented

Click on the little triangle next to the color square () to choose a color to represent the value by clicking on a color or type in the RGB color of the color, click OK (Figure 10).

Figure 10. Select a color from the color palette

The min and max tabs cannot be dragged or removed, however, when left click on it, you can choose a different color. To change the min and max color value to be represented, use the Range section. When click on the Palette bar, you can add a new color tab between min and max (Figure 11). Adding a tab can be useful when there is outlier value in the data, you can use different color to represent different value range.

Figure 11. Add a new color tab between min and max

To change the min and max value represented, in the Range section, click on the toggle switch() to make it blue, and specify the value in the text box.

The shape of the heatmap cell (component) can be configured either as rectangle or circle by selecting the radio button in the shape section.

Dendrograms:

Only if cluster analysis is performed, the result will be displayed in dendrograms. By default, the dendrograms are all colored in black (Figure 4 ).

The color of the dendrograms can be configured (Figure 12)

Figure 12. Configure the color of dendrograms

Click on the color square triangle () to choose a different color for the dendrogram.

When the By cluster in the Row/Column color drop-down list, the number of clusters needs to be specified (Figure 13). The top N number of clusters will be in N different colors.

Figure 13. Configure the dendrogram color based on number of cluster specified

Annotations:

Row annotation allow to add sample or cell level annotation to the viewer. First make sure to choose the correct data node which contains the annotation information you would like to use by clicking the circle (). All project level annotation will be available on all data node in the pipeline (Figure 14).

Figure 14. Configure sample or cell annotation

Each attribute is represented as an annotation bar next to the heatmap, different color represent different group in the attribute. The width of the bar can be adjusted by Block size slider, when the show labels toggle switch is on , the text of the label font size can be changed by specifying pixel size.

Layout

To change the orientation of the plot, click on the toggle switch of Data Transpose ()in the Layout section (Figure 15)

Figure 15. Configure the layout of the plot

Axes labels, dendrograms and legend can be turned on or off by clicking the toggle switches. The axes label font can be changed by specify number of pixels.

In the mode bar of the viewer, besides selection buttons, there is a special button just for hierarchical clustering–Flip mode button (), in this mode, when click on a line which represents a cluster branch in dendrogram, it will swap location of the two legs of the branch.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Your Rating:

Results:

4

rates

Partek Flow Documentation

Page tree