Page History

...

Identifying reads with matching UMIs and consolidating them into a single aligned read for use in quantification is handled by the UMI deduplication task in Partek Flow.

Default behavior

An important consideration when analyzing UMI data are the errors introduced into the UMIs themselves during PCR amplification of the original molecule. If these errors are not accounted for and each sequenced UMI is considered to be representative of the original UMI, the number of unique molecules can be significantly overestimated. To account for this, the UMI deduplication task uses an implementation of the UMI-tools algorithm described in Smith et al. 2017.

The task works by first partitioning reads into groups. Reads are grouped if they align to the same genomic position, have the same strandness, and any barcodes present match within an edit distance of two.

Within each group, sequenced UMIs are analyzed to determine whether they originated from the same UMI. To do this, UMIs are clustered. The UMI that has the most reads is used as the seed for the first cluster. The seed UMI is connected to all UMIs within a single edit distance that have fewer reads than it to form a cluster. Every UMI within the cluster then serves as the seed for a subsequent round of connection, again connecting seed UMIs to all UMIs within a single edit distance that have fewer reads than the seed UMI. Additional rounds of connection are performed until no more UMIs can be incorporated into the cluster. The unclustered UMI with the highest number of reads is chosen as the seed for a second cluster and the same clustering procedure is repeated. This process of clustering continues until all UMIs in the group have been assigned to a cluster.

...

Once the clusters have been identified, a consensus read for the cluster is generated. To begin, any reads that do not match the common CIGAR string for their cluster are discarded. From the remaining reads, the percentage of each base at each position is determined. If a base is present in over 60% of reads, it is used in the consensus read. Otherwise, N is used. The base quality score for each position in the consensus read is the maximum at each position from the contributing reads.

Retain only one alignment per UMI

Deduplicate UMIs has an alternative setting to more closely match the results provided by CellRanger, : Retain only one alignment per UMI. Selecting this option changes how the task functions and requires that you specify the genome assembly and gene/feature annotation.

The algorithm checks whether each aligned read is compatible with a transcript in the annotation file. Here, compatible is defined as 50% or more of the aligned read sequence overlapping. Strand is not considered.

The occurrence of each barcode and UMI combination is counted.

UMI within a Levenshtein distance of 1 are grouped.

The UMI within each UMI group with the highest number of reads is reported and other UMIs within the group are filtered out. If two UMIs within the group have the same number of reads, the UMI with the lowest ascii value is used.

This method is similar to the default method in the Drop-seq cookbook, which collapses UMI barcodes with a hamming distance of 1.

This method will output more UMIs than the default behavior as only UMIs within an edit distance of 1 are summarized, whereas UMIs with a greater distance can be linked in the UMI-tools method. For a comparison of the two approaches, please see the Adjacency (CellRanger) and Directional (UMI-tools) methods in Smith et al. 2017.

Additional assistance

Rate Macro

allowUsers	false

Partek Flow Documentation

Page tree

Versions Compared

Old Version 8

New Version 9

Key

Default behavior

Retain only one alignment per UMI