Partek Flow Documentation

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Most single cell RNA-seq library prep kits compensate for the small quantity of starting material by PCR amplifying the reverse transcribed cDNA. Because some sequences will amplify with varying aviditypreferentially, the proportions of final PCR amplified molecules will diverge from the original number of molecules. To correct for these PCR artifacts, reverse transcribed molecules are tagged with unique nucleotide sequences, termed unique molecular identifiers (UMIs). These UMIs are retained through PCR amplification, allowing PCR products that were amplified from the same original molecule to be identified. Counting UMIs for each gene instead of reads allows the original number of molecules corresponding to each gene to be more faithfully represented.

...

An important consideration when analyzing UMI data are the errors introduced into the UMIs themselves during PCR amplification of the original molecule. If these errors are not accounted for and each sequenced UMI is considered to be representative of the original UMI, the number of unique molecules can be significantly overestimated. To account for this, the UMI deduplication task uses an implementation of the UMI-tools algorithm described in Smith et al. 2017.  

The task works by first partitioning reads into groups. Reads are grouped if they align to the same genomic position, have the same strandness, and any barcodes present match within an edit distance of two.

...

Deduplicate UMIs has an alternative setting to more closely match the results provided methods used by CellRanger: Retain only one alignment per UMI. Selecting this option changes how the task functions and requires that you specify the genome assembly and gene/feature annotation.

The algorithm first checks whether each aligned read is compatible with a transcript in the annotation file. Here, compatible is defined as 50% or more of the aligned read sequence overlapping. Strand ; strand is not considered. Aligned reads that are not compatible with a transcript are discarded. 

The occurrence of each barcode and UMI combination is counted. 

UMI UMIs within a Levenshtein distance of 1 are grouped. 

The UMI within each UMI group with the highest number of reads is reported and other UMIs within the group are filtered out. If two UMIs within the group have the same number of reads, the UMI with the lowest sequence ASCII value is used. 

This method is also similar to the default method in the Drop-seq cookbook, which collapses UMI barcodes with a Hamming distance of 1. 

...