Hierarchical clustering is a statistical method used to assign similar objects into groups called clusters. It is typically performed on results of statistical analyses, such as a list of significant genes/transcripts, but can also be invoked on the full data set, as a part of exploratory analysis.
Hierarchical clustering is an unsupervised technique, meaning that the number of clusters is not specified upfront. In the beginning, each row and/or column is considered a cluster. The two most similar clusters are combined and continue to combine until all objects are in the same cluster. Hierarchical clustering produces a tree (called a dendrogram) that shows the hierarchy of the clusters.
This tutorial will illustrate how to:
To invoke hierarchical clustering, select a data node containing count data (e.g. Gene counts, Normalized counts, Single cell counts), or a Feature list data node (to cluster significant genes/transcripts) and then click on the Hierarchical clustering / heat map option in the context sensitive menu (Figure 1).
The hierarchical clustering setup dialog (Figure 2) enables you to control the clustering algorithm. Starting from the top, you can choose to plot a Heatmap or a Bubble map (clustering can be performed on both plot types). Next, perform Ordering by selecting Cluster for either feature order (genes/transcripts/proteins) or cell/sample/group order or both. Note the context-sensitive image that helps you decide to either perform hierarchical clustering (dendrogram) or assign order (arrow) for the columns and rows to help you orient yourself and make decisions (In Figure 2 below, Cluster is selected for both options so a dendrogram is shown in the image).
If you do not want to cluster all the samples, but select a subset based on a specific sample or cell attribute (i.e. group membership), check Filter cells under Filtering and set a filtering rule using the drop down lists (Figure 3). Notice the drop-down lists allow more than one factor (when available) to be selected at a time. When configuring the filtering rule, use AND to ensure all conditions pass for inclusion and use OR for any conditions to pass.
Hierarchical clustering uses distance metrics to sort based on similarity and is set to Average Linkage by default. This can be adjusted by clicking Configure under Advanced options (Figure 4).
Cluster distance metric for cells/samples and features is used to determine how the distance between two clusters will be calculated:
Point distance metric is used to determine the distance between two rows or columns. For more detailed information about the equations, we refer you to the distance metrics chapter.
If the Cluster option is unchecked for Cells/Sample/Group order or Feature order, the Ordering option will be Assign order (Figure 5).
The Default order of cells/samples/groups (rows) is based upon the labels as displayed in the Data tab and features (columns) are dependent on the input data of the data node.
Feature order can be assigned by selecting a managed list (e.g. generate saved feature lists from report nodes or add lists under list management in the settings) in the drop-down which will limit the features to only those in the list and the features will be ordered as they are listed. If a feature is not available, based on the input of the data node, it will not be shown in the plot (in other words, if the features from the list are not there they will not be plotted). Note that If no features are available from the data node, the task will not be able to perform and an error message will be shown.
Cell/Sample/Group order can also be assigned by choosing an attribute from the drop down list. Click and drag to rearrange categorical attributes; numeric attributes can be sorted in ascending or descending order (note the arrows in the image which are different from the dendrogram for Cluster).
You can choose how the data is scaled (sometimes referred to as normalized). Navigate to Advanced options → Configure → Feature scaling, Standardize (default for a heatmap) will make each column mean as zero and standard deviation as 1 in all features. This is the default scaling for a heatmap and it makes all of the features (e.g., genes or proteins) have equal weight; standardized values are also known as Z-scores. The scaling mode Shift will make each column mean as zero. Choose None to not scale and perform clustering on the values in the quantified data node (this is the default for a bubble map). If a bubble map is scaled, scaling will be performed on the group summary method (color).
Another way to invoke a heatmap without performing clustering is via the data viewer. When you select the Heatmap icon in the available plots list, data nodes that contain two-dimensional matrices can be used to draw this type of plot. A bubble map can also be similarly plotted (use the arrow from the heatmap icon to select a Bubble map for descriptive statistics that have been generated in the data analysis pipeline.
The output of a Hierarchical clustering task can be a heatmap (Figure 6) or a bubble map with or without dendrograms depending on whether you performed clustering on cells/samples/groups or features. By default, samples are on rows (sample labels are displayed as seen in the Data tab) and features (depending on the input data) are on columns. Colors are based on standardized expression values (default selection; performed on the fly). Dendrograms show clustering of rows (samples) and columns (variables).
Depending on the resolution of your screen and the number of samples and variables (features) that need to be displayed, some binning may be involved. If there are more samples/genes than pixels, values of neighboring rows/columns will be averaged together. Use the mouse wheel to zoom in and out. When you zoom in to certain level on the heatmap, you will see each cell represent one sample/gene. When you mouse over the row dendrogram or label area and zoom, it will only zoom in/out on the rows. The binning on the columns will remain the same. Similarly, when you mouse over the column dendrogram or label area and zoom, it will only zoom in/out on the columns. The binning on the rows will remain the same. To move the map around when zoomed in, press down the left mouse button and drag the map. The plot can be saved as a full-size image or as a current view; when Save image is clicked, a prompt will ask how you would like to save the image.
The Hierarchical clustering task can also be used to plot a bubble map. Let's go through the steps to make a bubble map (Figure 7):
There are plot Configuration/Action options for the Hierarchical clustering / heat map task which apply to both the heatmap and bubble map in the Data viewer (below): Axes, Heatmap, Dendrograms, Annotations, and Descriptions. Click on the icon to open these configuration options.
Heatmap
This section is used to configure the color, range, size, and shape of the components in the heatmap.
If cluster analysis is performed on samples and/or features, the result will be displayed as dendrograms. By default, the dendrograms are all colored in black.
The color of the dendrograms can be configured.
This section allows you to add sample or cell level annotations to the viewer. First, make sure to choose the correct data node which contains the annotation information you would like to use by clicking the circle (). All project level annotations will be available on all data nodes in the pipeline.
Description is used to modify the Title and toggle on or off the Legend.
The heatmap has several different mouse modes which modify the way the plot responds to the mouse buttons. The mode buttons are in the upper right corner of the heatmap. Clicking one of these buttons puts the heatmap into that mode.
In point mode (), you can left-click and drag to move around the heatmap (if you are not fully zoomed out). Left-clicking once on the heatmap or on a dendrogram branch will select the associated rows/columns.
In selection mode (), you can click and drag to select a range of rows, columns, or components.
In flip mode (), you can click on a line in the dendrogram (which represents a cluster branch) and the location of the two legs of the branch will be swapped. If no clustering is performed (no dendrogram is generated), in this mode, you can click on the label of an item (observation or feature), drag and drop to manually switch orders of the row or column on the heatmap.
Click on rest view () to rest to the default
Save Image icon () enables you to download the heat map to your local computer. If the heat map contains up to 2.5M cells (features * observations), you can choose between saving the current appearance of the heat map window (Current view) and saving the entire heat map (All data) (Figure 19). Depending on the number of features / observations, Partek Flow may not be able to fit all the labels on the screen, due to the limit imposed by the screen resolution. All Data option provides an image file of sufficient size so that all the labels are readable (in turn, that image may not fit the compute screen and the image file may be quite large). If the heat map exceeds 2.5M cells, the Current view option will not be shown, and you will see only a dialog like the one in Figure 20.
After selecting either Current view (if applicable) or All data button, the next dialog (Figure 20) will allow you to specify the image format, size, and resolution.