This guide illustrates how to process FASTQ files produced using the 10x Genomics Chromium Single Cell ATAC assay to obtain a Single cell counts data node, which is the starting point for analysis of single-cell ATAC experiments.

If you are new to Partek Flow, please see Getting Started with Your Partek Flow Hosted Trial for information about data transfer and import and Creating and Analyzing a Project for information about the Partek Flow user interface.  

Transfer files and create a new project

We recommend uploading your FASTQ files (fastq.gz) to a folder on the Partek® Flow® server before importing them into a project. Data files can be transferred into Flow from the Home page by clicking the Transfer file button (Figure 1). Following the instruction In Figure 1 to complete the data transfer. Users have the option to change the Upload directory by clicking the Browse button and either select another existing directory or create a new directory. 

To create a new project, from the Home page click the New Project button; enter a project name and then click Create project (Figure 1). Once a new project has been created, the user is automatically directed to the Data tab of the Project View. 

Import the FASTQ files

To proceed, click the Import button in the Data tab (Figure 2). Click the Automatically create samples from files button. The file browser interface will open (Figure 3). Select the FASTQ files using the file browser interface and push the Create sample button to complete the task. Paired end reads will be automatically detected and multiple lanes for the same sample will be automatically combined into a single sample. We encourage users to include all the FASTQ files including the index files although they are optional. 

When the FASTQ files have finished importing, the Unaligned reads data node will turn from transparent to opaque.


Convert FASTQ to count 

To deal with the single cell ATAC-seq FASTQ data, Flow® has wrapped the 'cellranger-atac count' pipeline from Cell Ranger ATAC v2.0[1]. It takes FASTQ files and performs multiple analysis simultaneously including reads filtering and alignment, barcode counting, identification of transposase cut sites, peak and cell calling, and generates the count matrix.

To run Cell Ranger - ATAC task:  

To learn more about how to run Cell Ranger - ATAC task in Flow, please refer to our online documentation.

The output of the count matrix then becomes the starting point for downstream analysis for scATAC-seq data in Flow (Figure 5). 

QA/QC

An important step in analyzing single cell ATAC data is to filter out low quality cells. A few examples of low-quality cells are doublets, cells with a low TSS enrichment score, cells with a high proportion of reads mapping to the genomic blacklist regions, or cells with too few reads to be analyzed. Users are able to do this in Partek Flow using the Single cell QA/QC task. 

A task node, Single cell QA/QC, is produced. Initially, the node will be semi-transparent to indicate that it has been queued, but not completed. A progress bar will appear on the Single cell QA/QC task node to indicate that the task is running (Figure 5).

The Single cell QA/QC report includes interactive violin plots showing the value of every cell in the project on several quality measures (Figure 6).

There are five plots: Nucleosome signal, TSS enrichment, % reads in peaks, Blacklist ratio, and Peak region fragments. Each point on the plots is a cell and the violins illustrate the distribution of values for the y-axis metric. Cells can be filtered either by clicking and dragging to select a region on one of the plots or by setting thresholds using the filters below the plots. Here, we will apply a filter for the number of read counts. The plot will be shaded to reflect the filter. Cells that are excluded will be shown as black dots on both plots.

Descriptions of QC metrics:

Nucleosome signal: calculated per single cell, and quantify the approximate ratio of mononucleosomal to nucleosome-free fragments. Nucleosome banding pattern: The histogram of DNA fragment sizes (determined from the paired-end sequencing reads) should exhibit a strong nucleosome banding pattern corresponding to the length of DNA wrapped around a single nucleosome.

TSS enrichment: Transcriptional start site (TSS) enrichment score. The ENCODE project has defined an ATAC-seq targeting score based on the ratio of fragments centered at the TSS to fragments in TSS-flanking regions (see https://www.encodeproject.org/data-standards/terms/). Poor ATAC-seq experiments typically will have a low TSS enrichment score.

Peak region fragments: total number of fragments in peaks which is a measure of cellular sequencing depth/complexity. Cells with very few reads may need to be excluded due to low sequencing depth. Cells with extremely high levels may represent doublets, nuclei clumps, or other artifacts.

% reads in peaks: Represents the fraction of all fragments that fall within ATAC-seq peaks. Cells with low values (i.e. <15-20%) often represent low-quality cells or technical artifacts that should be removed. Note that this value can be sensitive to the set of peaks used.

Blacklist ratio: The ENCODE project has provided a list of blacklist regions, representing reads which are often associated with artifactual signals. Cells with a high proportion of reads mapping to these areas (compared to reads mapping to peaks) often represent technical artifacts and should be removed.  

Filter cells

To filter out low quality cells (Figure 7), 

Filter features

Another common task is to filter the data to include only informative features. Partek Flow has a wide variety of flexible filtering options. 

Filter features task can be invoked from any counts or single cell data node. Noise Reduction and Statistics Based filters take each feature and perform the specified calculation across all the cells. The filter is applied to the values in the selected data node and the output is a filtered version of the input data node. 

In the task dialog, click the check box to activate one or more of the filter types, configure the filter(s), and click Finish to run (Figure 8). 

Annotate regions

To understand the importance of enriched regions in regulating gene expression, Flow uses Annotate regions task to add information about overlapping or nearby genomic features. That gives regulatory context for enriched regions.

The input for Annotate peaks is a Peaks type data node. 

The Genomics overlaps parameter lets you choose one of two options (Figure 9).

Users are able to define the transcription start site (TSS) and transcription termination site (TTS) limit in the unit of bp.

TF-IDF (frequency-inverse document frequency) normalization

Latent semantic indexing (LSI)  was first introduced for the analysis of scATAC-seq data by Cusanovich et al. 2018[2]. LSI combines steps of frequency-inverse document frequency (TF-IDF) normalization followed by singular value decomposition (SVD). Partek® Flow® wrapped Signac's TF-IDF normalization for single cell ATAC-seq dataset. It is a two-step normalization procedure that both normalizes across cells to correct for differences in cellular sequencing depth, and across peaks to give higher values to more rare peaks[3].

TF-IDF normalization in Flow can be invoked in Normalization and scaling section by clicking any single cell counts data node (Figure 10).

To run TF-IDF normalization

The output of TF-IDF normalization is a new data node that has been normalized by log(TF x IDF).

SVD (singular value decomposition )

Singular value decomposition (SVD) will be applied to TF-IDF output in scATAC-Seq data. It returns a reduced dimension representation of a matrix. Although SVD and Principal components analysis (PCA) are two different techniques, the SVD has a close connection to PCA. Because PCA is simply an application of the SVD.  For users who are more familiar with scRNA-Seq, you can think of SVD as analogous to the output of PCA. And similarly, the statistical interpretation of singular values is in the form of variance in the data explained by the various components.

To run SVD task, 

The GUI is simple and easy to understand. The SVD dialog is only asking to select the number of singular values to compute (Figure 11).  By default 100 singular values will be computed if users don't want to compute all of them.  However, the number could be adjusted manually or typed in directly.  Simply click the Finish button if you want to run the task as default.

The task report for SVD is similar to PCA. Its output will be used for downstream analysis and visualization, including Harmony and WNN.

Graph-based clustering

Graph-based clustering (Figure 12) identifies groups of similar cells using SVD values as the input. By including the informative SVDs, noise in the data set is excluded, improving the results of clustering. 

A new Graph-based clusters data and a Biomarkers data node will be generated. 

The Graph-based clustering result (Figure 13) lists the Total number of clusters and what proportion of cells fall into each cluster as well as Maximum modularity which is a measurement of the quality of the clustering result where optimal modularity is 1. The Biomarkers report (Figure 14) includes the top features for each graph-based cluster. It displays the top-10 genes that distinguish each cluster from the others. Download at the bottom right of the table can be used to view and save more features. These are calculated using an ANOVA test comparing the cells in each group to all the other cells, filtering to genes that are 1.5 fold upregulated, and sorting by ascending p-value. This ensures that the top-10 genes of each cluster are highly and disproportionately expressed in that cluster. 


UMAP

Similar to t-SNE, Uniform Manifold Approximation and Projection (UMAP) is a dimensional reduction technique. UMAP aims to preserve the essential high-dimensional structure and present it in a low-dimensional representation. UMAP is particularly useful for visually identifying groups of similar samples or cells in large high-dimensional data sets. 

To run UMAP (Figure 15):

UMAP produces a UMAP task node. Opening the task report launches a scatter plot showing the UMAP results. Each point on the plot is a cell for single cell data. The plot will open in 2D or 3D depending on the user preference.  

Promoter sum matrix

The Annotate regions task in Flow labels individual peaks as promoters for a particular gene if the peak falls 1000 bases upstream from a gene's transcription start site, or 100 bases downstream from a gene's transcription start site by default (Figure 9). A promoter sum for a given gene is the number of cut sites per cell that fall within all the peaks labeled as promoters (-1000bp ~ 100bp by default or user defined through Annotate regions) for that gene. Higher promoter sum values indicate higher chromatin accessibility in the promoter region [4]. 

Flow task Promoter sum matrix summarizes each promoter sum and outputs a cell x gene matrix. In the matrix, only genes that have peaks within its promoter region have been included. In Flow Promoter sum matrix can be invoked in the Peak analysis section by clicking the Annotated regions data node (Figure 16).

To run Promoter sum matrix in Flow, 

Once the task has been finished, a new data node will be produced where the promoter sum value for each feature can be used to color UMAP/t-SNE and to determine cell type with raw data. We recommend users normalize its output prior to color the UMAP just like the scRNA-seq data.

Classifying cells

Double-clicking the UMAP task node will open the task report in the Data Viewer. 

To classify a cell, just select it then click Classify selection in the Classify tool. 

For example, we can classify a cluster of cells expressing high levels of MS4A1 as B cells. 

Repeat the above steps to finish the other cell type classifications. To be able to use the classifications in downstream tasks and visualizations, you must first apply them.

Once the classifications have been added  to the project, one can color the UMAP/t-SNE plot by the Classification or compare the differentially expressed genes between different cell types.


Differential analysis

To identify genes that distinguish a cell type, one can use the differential analysis tools in Partek Flow.

Hurdle model produces a Feature list task node. The results table and options are the same as the GSA task report except the last two columns. The percentage of cells where the feature is detected (value is above the background threshold) in different groups (Pct(group1), Pct(group2)) are calculated and included in the Hurdle model report. 

A filtered Feature list data node can be produced by running the Differential analysis filter in the Hurdle model task report (Figure 21) . 

 

By default, selected cells are shown in bold while unselected cells are dimmed (Figure 21). This can be changed to gray selected cells using the Select & Filter tool in the left panel as shown in Figure 21.


Alternatively, you can select cells using any criteria available for the data node that is selected in the Select & Filter tool. To change the data selection click the circle (node) and select the data. 


This adds check boxes for each level of the attribute (i.e., clusters). Click a check box to select the cells with that attribute level. 

This selects cells from Graph-based clusters 2 and 3 (Figure 23). The number of selected cells is listed in the Legend on the plot. 


Cells can also be selected based on their gene expression values in the Select & Filter section. 

Very specific selections can be configured by adding criteria in this way. In the example below, Clusters 2 and 3 and high CD3D expression is selected (Figure 24). 


Filtering cells on the t-SNE scatter plot

Once a cell has been selected on the plot, it can be filtered. The filter controls can exclude or include (only) any selected cell. Filtering can be particularly useful when you want to use a gene expression threshold to classify a group of cells, but the gene in question is not exclusively expressed by your cell type of interest. 

In this example we can filter to include just cells from the selection we have already made.

The plot will update to show only the included cells as seen in Figure 25. 

Cells that are not shown on the plot cannot be selected, allowing you to focus on the visible cells. The number of cells shown on the plot out of the total number of original cells is shown in the Legend. You can adjust the view to focus on only the included cells.

To revert to the original scaling, click the  button again or turn off Fit visible with the toggle. 

Additional inclusion or exclusion filters can be added to focus on a smaller subset of cells. 

The plot will update to show all cells and return to the original scaling. 


Classifying cells

Classifying cells allows to you assign cells to groups that can be used in downstream analysis and visualizations. Commonly, this is used to describe cell types, such as B cells and T cells, but can be used to describe any group of cells that you want to consider together in your analysis, such as cycling cells or CD14 high expressing cells. Each cell can only belong to one class at a time so you cannot create overlapping classes. 

To classify a cell, just select it then click Classify selection in the Classify tool. 

For example, we can classify a cluster of cells expressing high levels of CD79A as B cells. 


Because most of these cells express CD79A, a B cell marker, and because they cluster together on the t-SNE, suggesting they have similar overall gene expression, we believe that all these cells are B cells.

The classification, B cells, is added to the Classifications section of the control panel and the number of cells in that classification is listed next to the name (Figure 29).


You can edit the name of a classification or delete it. The classifications you have made are saved as a working draft so if you close the plot and return to it, the classifications will still be there and can be visualized on the plot as "New classification". However, classifications are not available for downstream tasks until you apply them.  Continue classifying the clusters and save the Data viewer session until you are ready to apply the classification to the data project. 


To use the classifications in downstream tasks and visualizations, you must first apply them.

Once you have added a classification to the project, you can color the t-SNE plot by the Classification.

Here, I classified a few additional cell types using a combination of known marker genes and the clustering results then applied the classification (Figure 31). 


Summarize Classifications with the number and percentage of cells from each sample that belong to each classification using an Attribute table under New plot. This is particularly useful when you are classifying cells from multiple samples.




Comparing gene expression between cell types

A common goal in single cell analysis is to identify genes that distinguish a cell type. To do this, you can use the differential analysis tools in Partek Flow. I will show how to use the Gene Specific Analysis (GSA) test in Partek Flow, which on its default settings is equivalent to limma-trend, a statistical test shown to be highly effective for differential analysis of single cell RNA-Seq data (Soneson and Robinson 2018). 

The first page of the configuration dialog asks what attributes you want to include in the statistical test. Here, we only want to consider the Classifications, but in a more complex experiment, you could also include experimental conditions or other sample attributes. 

We will make a comparison between NK cells and all the other cell types to identify genes that distinguish NK cells. You can also use this tool to identify genes that differ between two cell types or genes that differ in the same cell type between experimental conditions. 

The top panel is the numerator for fold-change calculations so the experimental or test groups should be selected in the top panel.

The bottom panel is the denominator for fold-change calculations so the control group should be selected in the bottom panel.

This adds the comparison to the statistical test. 



The GSA task report lists genes on rows and the results of the statistical test (p-value, fold change, etc.) on columns (Figure 37). For more information, please see our documentation page on the GSA task report


Genes are listed in ascending order by the p-value of the first comparison so the most significant gene is listed first. To view a volcano plot for any comparison, click .  To view a violin plot for a gene, click  next to the Gene ID. 

 The Feature plot viewer will open showing a violin plot for KLRD1 (Figure 38). The violins are density plots with the width corresponding to frequency. 


You can switch the grouping of cells using the Group by drop-down menu. The order of groups can be adjusted by dragging groups up and down in the Group order panel. To navigate between genes in the table, click the Next > and Previous > buttons. 

The table lists all of genes in the data set; using the filter control panel on the left, we can filter to just the genes that are significantly different for the comparison.

Here, we are using a very stringent cutoff to focus only on genes that are specific to NK cells, but other applications may require a less stringent cutoff. 

The number of genes at the top of the filter control panel updates to indicate how many genes are left after the filters are applied. 


The GSA report will close and a new task, the Differential analysis filter, will run and generate a filtered Feature list data node. 

For more information about the GSA task, please see the Differential Gene Expression - GSA section of our user manual. 

Generating a heatmap

Once we have filtered to a list of significantly different genes, we can visualize these genes by generating a heatmap. 

The hierarchical clustering task will generate the heatmap; choose Heatmap as the plot type. You can choose to Cluster features (genes) and cells (samples) under Feature order and Cell order in the Ordering section. You will almost always want to cluster features as this generates the clear blocks of color that make heatmaps comprehensible. For single cell data sets, you may choose to forgo clustering the cells in favor of ordering them by the attribute of interest. Here, we will not filter the cells, but instead order them by their classification. 

You can filter samples using the Filtering section of the configuration dialog. Here, we will not filter out any samples or cells. 

It may initially be hard to distinguish striking differences in the heatmap. This is common in single cell RNA-Seq data because outlier cells will skew the high and low ends. We can adjust the minimum and maximum of the color scheme to improve the appearance of the heatmap.

Distinct blocks of red and blue are now more pronounced on the plot. Cells are on rows and genes are on columns. Because of the limited number of pixels on the screen, genes are grouped. You can zoom in using the zoom controls or your mouse wheel if you want to view individual gene rows. We can annotate the plot with cell attributes. 

The plot now includes blocks of color along the left edge indicating the classification of the cells. We can transpose the plot to give the cell labels a bit more space.

We can also customize the colors of the plot. Do this by clicking the Legend or Heatmap

The heatmap now shows a teal to yellow gradient with a black midpoint (Figure 41). 


As with any visualization in Partek Flow, the image can be saved as a publication-quality image to your local machine by clicking  or sent to a page in the project notebook by clicking . For more information about Hierarchical clustering, please see the Hierarchical Clustering section of the user manual. 

Performing enrichment analysis

While a long list of significantly different genes is important information about a cell type, it can be difficult to identify what the biological consequences of these changes might be just by looking at the genes one at a time. Using enrichment analysis, you can identify gene sets and pathways that are over-represented in a list of significant genes, providing clues to the biological meaning of your results.

We distribute the gene sets from the Gene Ontology Consortium, but Gene set enrichment can work with any custom or public gene set database. 

The Gene set enrichment task report lists gene sets on rows with an enrichment score and p-value for each. It also lists how many genes in the gene set were in the input gene list and how many were not (Figure 43). Clicking the Gene set ID links to the geneontology.org page for the gene set. 


In Partek Flow, you can also check for enrichment of KEGG pathways using the Pathway enrichment task. The task is quite similar to the Gene set enrichment task, but uses KEGG pathways as the gene sets. 

The task report is similar to the Gene set enrichment task report with enrichment scores, p-values, and the number of genes in and not in the list (Figure 44). 


Clicking the KEGG pathway ID in the Pathway enrichment task report opens a KEGG pathway map (Figure 45). The KEGG pathway maps have fold-change and p-value information from the input gene list overlaid on the map, adding a layer of additional information about whether the pathway was upregulated or downregulated in the comparison.


Color are customizable using the control panel on the left and the plot is interactive. Mousing over gene boxes gives the genes accounted for by the box, with genes present in the input list shown in bold, and the coloring gene shown in red (Figure 46).


Clicking a pathway box opens the map of that pathway, providing an easy way to explore related gene networks. 

Pipeline


For information about automating steps in this analysis workflow, please see our documentation page on Making a Pipeline

References

Soneson C and Robinson MD. Bias, robustness and scalability in single-cell differential expression analysis. Nature Methods 2018 Apr;15(4):255-261.