Pre-alignment Tools

Partek^® Flow^® provides Pre-alignment tools that allow the user to process next-generation sequencing data before proceeding to alignment. These tools are not only useful for controlling the quality of data, but can also be used for subsampling prior to analyzing the full dataset. There are three functions available in Pre-alignment tools:

Trim bases
Trim adapters
Subsample FASTQ

User is expected to have preliminary understanding of:

File formats for next generation sequencing data
Phred-quality score

Showing Pre-alignment tools

In order to show the Pre-alignment tools, select an Unaligned reads or Trimmed reads data node. They will appear on the context-sensitive menu on the right of the screen (Figure 1).

Figure 1. Showing Pre-alignment Tools from an unaligned reads node

Different Pre-alignment tools are available for different formats of unaligned reads. For example: if the reads are in FASTQ format, then all three tools are available.On the other hand, if the unaligned reads are in FASTA or SFF format, then the option Subsample FASTQ is not available.

Trim Bases

The Trim bases task is used to trim bases from the 5'-end or 3'-end of the reads. The most obvious reason for Trim bases is to trim away poor quality bases from the read prior to alignment because these can potentially affect alignment rate.

The task allows user to trim reads in different ways, including:

Trim bases from 3'-end
Trim bases from 5'-end
Trim bases from both ends
Trim bases based on quality score

Trim bases from 5'-(Figure 2) or 3'-end (Figure 3) allows a fixed number of bases to be trimmed away from the 5'- or 3'-end of the reads. These two functions are useful for when your read length is constant. This is not recommended if the read length is not constant, since good quality bases from shorter reads are likely trimmed away by these functions.

Figure 2. Trim bases from 5'-end

Figure 3. Trim bases from 3'-end

Trim bases from both ends (Figure 4) allows user to keep only bases from a fixed start and end position of the reads. This is particularly useful if poor quality bases are observed on both ends of the read. So instead of performing trim bases successively from the 5'- and 3'-end, the trim bases will only be performed once by trimming from both ends.

Figure 4. Trim bases from both ends

Trim bases based on quality score (Figure 5) is probably the most useful function to trim poor quality bases from the 5'- or 3'-ends of reads. This function allows dynamic trimming of bases depending on quality score. The trimming can be done from either 5'-end, 3'-end or both ends of the reads. The function evaluates each base from the end of the read and trims it away until the last base has a quality score greater than the specified threshold. For an extensive evaluation of read trimming effects on Illumina NGS data analysis, see Del Fabbro et. al. [1].

Figure 5. Trim bases based on quality score

Advanced options

In some cases, the reads that result from base trimming can have very short read lengths and thus are not recommended for alignment.Thus, Partek Flow provides the option to set a Min read length after base trimming. This discards reads that are shorter than the set length.

Also, reads could have a high percentage of N's or ambiguous bases. Thus, the Max N setting is available to discard reads with %Ns higher than the set threshold

The Quality encoding option refers to the Phred quality score encoded within the FASTQ input file. The list of available options are: Phred+33, Phred+64, Solexa+64 and Integers. Selecting Auto-detect will determine whether the quality encoding is Phred+33 or Phred+64. For Solexa data, you will need to select Solexa+64. For most of datasets, auto-detect option works very well with a few exception cases where the base quality score falls into the grey zone (ambiguous zone) of Phred+33 and Phred+64 score. However, if the quality-encoding scheme is known, we recommend to selecting the encoding format directly from the quality encoding list.

Figure 6 shows the options available for all the different selection of Trim bases function. Note the default Min read length is 25bp. For micro RNA sequencing data, this default Min read length needs to be set to a smaller value (we recommend 15) to account for mature microRNAs.

Figure 6. Trim bases options. A) Trim from 3'-end; B) Trim from 5'-end; C) Trim from both ends; D) Trim based on base quality score

Trim Bases Task Details Page

The Task Details page for Trim bases can be accessed by selecting the task node Trim bases, and subsequently selecting Task Details from the Task results section. In the Task details page, several sections are available:

General task information
- contains information such as the task name, owner, status, submitted time, start, end and duration of the task
Output Files
- contains the description of each output file. If you roll-over your mouse cursor to the file name, you will get the exact location of the file on the server. If you click on the file name, you will have the option to view up to 999 lines of the raw data. You can also download the file from the server
Input Files
- contains the information of input files. This section lists all the input files used in the Trim bases task
Input Parameters
- contains the parameters used for running Trim bases function. This section tells what option has been selected for the Trim bases task. It includes all the parameters used for the task, such as minimum read length, maximum percentage of N's base, quality encoding, quality score threshold (if applicable) and how trimming is performed.
Command Lines
- shows the commands used for running Trim bases function by the software Partek Flow

Trim Bases Task Report Page

The Trim bases Task Report page can be accessed by selecting either the Trim bases task node or Trimmed reads data node and then selecting the Task Report from the Task results section of the task pane. There is a link at the bottom of the page to directly go to the Task Details page. The page displays the following components:

Summary table
- gives the total number of reads in each sample, the total number of reads trimmed (i.e. with at least one base trimmed from the read), total number of reads removed (due to Min read length and Max N parameters), the average number of bases trimmed per read, the average read quality before trim bases and finally the average read quality after trim bases
Stacked bar-chart
- shows percentage of untrimmed reads, trimmed reads and removed reads are shown in a stacked bar-chart to compare all the samples
Average base quality score per position of trimmed reads
- shows the average base quality score at each position of the trimmed reads for all samples in the project

Trim Bases Output Files

The Trim bases function produces trimmed unaligned reads which is named as Trimmed Reads data node. The Trimmed Reads node will have the "trimmed" word appended to the filename. The Trimmed Reads data can be downloaded by selecting the Trimmed Reads node and then select Download data from the task pane. However, if you have access to the Partek Flow server, you can go to the Task Details page and identify the location of the output files from the Output Files section as described on the Trim Bases Task Details section above. The Trimmed Reads data node will have the same format as the raw data.

Trim Adapters

The existence of adapter sequences at the 5'-end or 3'-end of the reads has shown to be one of the major problems during alignment, causing the reads to be unaligned. Thus, removing adapter sequence is of utmost importance if the sequenced read length is longer than the molecule of interest, such as microRNA. The fact that mature microRNAs are short in length makes it almost certain that the adapter sequence will be sequenced at the 3'-end of the miRNA.

In order to know whether the data has been adapter-trimmed for microRNA data, we can look at the pre-alignment QA/QC of the raw data, specifically the read length distribution. If the read length distribution peaks at approximately 22-23 bases, this usually means the data has been adapter-trimmed. However, if you have a fixed length distribution, then very likely the data is not adapter-trimmed and you will need to get the adapter sequence from your vendor or service provider and use the Trim adapter function to trim away the adapter sequence.

Partek Flow software wraps Cutadapt [2], a widely used tool for adapter trimming. It can be used to trim adapter sequences in nucleotide-space data as well as color-space data.

In order to use Trim adapters function, you will need to know the adapter sequences. To trigger the Trim adapters function, please select Unaligned Reads node and then select Trim adapters from the Pre-alignment tools section of the task pane. In the Trim adapters page (Figure 7), paste the adapter sequences into the textbox and select the button.

There are three options when it comes to trimming the adapter sequence:

Trimming for adapter ligated to 3'-end
- the adapter sequence and anything that follows it will be trimmed away from the 3'-end
Trimming for adapter ligated to 5'-end or 3'-end
- the adapter sequence is identified within the read or overlapping the 3'-end, then the adapter sequence and anything that follows it will be trimmed away. However, if the adapter sequence partially overlaps the 5'-end of the read, the initial portion of the read matching the adapter sequence is trimmed and anything that follows it is kept
Trimming for adapter ligated to 5'-end
- if the adapter sequence appears partially at the 5'-end or within the read, the preceding sequence including the adapter sequence is trimmed. User has the option to use a special character '^' at the beginning of the adapter sequence, meaning the adapter is 'anchored'. An anchored adapter must appear in its entirety at the 5'-end of the read (i.e. it is a prefix of the read).

Figure 7. Trim adapters setup page

For Trim adapters, more than one adapter sequences can be specified at once. When multiple adapters are provided, all adapters are evaluated based on how many bases it overlaps the read as well as the error rate. Adapters which have a lower number of overlapped nucleotides or high error rates are removed from consideration.

After that, the best adapter will be chosen based on the number of matching bases to the read. If there is a tie, adapters of the same type will be chosen in the order they are provided and adapters of different types will be chosen by type in the following order: first 3', then 5' or 3', and lastly 5' adapters.

Advanced Options for Trim Adapters

There are cases when the Trim adapters function does not work properly, for example: the existence of N's base in the read, etc. Therefore, there are advanced options which allows user to configure how the matching is done to trim adapter sequence. The advanced options dialog box is shown in Figure 8.

Figure 8. Advanced options dialog box for Trim adapters function

The first section of advanced options is the Adapter options. This is used to configure how the matching between the adapter sequence and the read will be performed. This includes the maximum error rate allowed, the number of matched times, minimum length of overlapped bases, allowing Ns (ambiguous base) in adapter and whether N will be treated as wildcards. User can roll-over mouse cursor to the info button to get more information of each parameter.

The second section of advanced options is the Filtering options. This is used to filter adapter-trimmed reads which are shorter than the minimum read length. This is to avoid having reads too short because short reads gives non-unique alignment and we would like to avoid that.

The third section of advanced options is the Additional modification to reads. The quality cutoff is used to trim bad quality bases from the reads before trimming adapter. Quality encoding tells the quality score encoding for the raw data. The Reads names prefix and suffix is used to add prefix and suffix to the read ID. Lastly, the Negative quality zero if checked will convert all negative quality score base to zero.

Subsample FASTQ

Next generation sequencing (NGS) data is notably huge in file size. Dealing with NGS data is not only time consuming but also puts constraints on hard disk space. This is especially true if analysis parameters need to be optimized. The Subsample FASTQ function is a very useful tool to get a subset of the raw data upon which optimization can be performed. The optimized parameters can then be saved and applied to the whole dataset

Subsample FASTQ is only available for unaligned reads of FASTQ format. To trigger this function, select the Unaligned Reads data node and select Subsample FASTQ from the Pre-alignment tools section on the menu. Then specify how many reads you want to keep for every nth reads. For example: if the user specifies to "Keep one read for every 10 reads" (Figure 9), this means that for every 10 reads, the program will keep only 1 read. This is equivalent to keeping 10% of the data.

Figure 9. Subsample FASTQ page. This option shows getting a subset of raw data by keeping one read for every 10 reads.

References

Del Fabbro C, Scalabrin S, Moragante M, Giorgi FM. An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS ONE. 2013; 8(12): e85024.
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 2011; 17: 10-12.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Your Rating:

Results:

0

rates

Partek Flow Documentation

Page tree