Partek Flow Documentation

Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


Identifying reads with matching UMIs and consolidating them into a single aligned read for use in quantification is handled by the UMI deduplication task in Partek Flow. 

An important consideration when analyzing UMI data are the errors introduced into the UMIs themselves during PCR amplification of the original molecule. If these errors are not accounted for and each sequenced UMI is considered to be representative of the original UMI, the number of unique molecules can be significantly overestimated. To account for this, the UMI deduplication task uses an implementation of the UMI-tools algorithm described in Smith et al. 2017.  

The task works by first partitioning reads into groups. Reads are grouped if they align to the same genomic position, have the same strandness, and any barcodes present match.

Within each group, sequenced UMIs are analyzed to determine whether they originated from the same UMI. To do this, UMIs are clustered. The UMI that has the most reads is used as the seed for the first cluster. The seed UMI is connected to all UMIs within a single edit distance that have fewer reads than it to form a cluster. Every UMI within the cluster then serves as the seed for a subsequent round of connection, again connecting seed UMIs to all UMIs within a single edit distance that have fewer reads than the seed UMI.  This clustering process continues until no more UMIs can be incorporated into the cluster. The unclustered UMI with the highest number of reads is chosen as the seed for a second cluster and the same clustering procedure is repeated. This process of clustering continues until all UMIs in the group have been assigned to a cluster. 

This process of directional clustering servers has two important benefits. First, it corrects for PCR errors by grouping sequenced UMIs with highly similar sequences so that they can be counted as a single UMI. Second, it recognizes that PCR errors that arise in later cycles will be present in lower quantities. This is why clustering proceeds directionally, connecting UMIs with more reads to UMIs with fewer reads. 

Once the clusters have been identified, a consensus read for the cluster is generated. To begin, any reads that do not match the common CIGAR string for their cluster are discarded. From the remaining reads, the percentage of each base at each position is determined. If a base is present in over 60% of reads, it is used in the consensus read. Otherwise, N is used. The base quality score for each position in the consensus read is the maximum at each position from the contributing reads.

Deduplicate UMIs has an alternative setting to more closely match the results provided by CellRanger, Retain only one alignment per UMI.

Selecting this option changes.


Additional assistance


Rate Macro