Quality Control

What are the quality control standards for 16S rRNA and genome sequences?

16S rRNA Quality Control

The following criteria are used to filter out sequencing reads with low quality:

Sequences with the lengths of <100 bp or >2,000 bp
Averaged Q value is <25.
Not predicted as a 16S gene by the Hidden Markov Model (HMM) based search.
Sequences are first assigned to the reference 16S database. All sequences that do not match any of reference sequences with at least 97% similarity cutoff are clustered using UCLUST method using 97% the cutoff. If a sequence is found to be a singleton, we assume that it is an erroneous one that should be excluded in the subsequent analyses. This algorithm is widely used, especially for Illumina short read sequencing [See QIIME manual’s step 5].

Genome Quality Control

The QC process employed by TrueBacID Genome is a crucial step in ensuring the accuracy and reliability of downstream analysis. The platform uses a tool called Fastp to perform trimming and filtering of NGS raw reads from all kinds of platforms, including Illumina, Nanopore, and PacBio reads.

The trimming and filtering process carried out by Fastp is two-fold. Firstly, Fastp removes NGS library adapter sequences from the raw reads, a process known as adapter trimming. Adapter trimming is essential for removing any adapter contamination present in the raw reads, which can interfere with downstream analysis.

Secondly, Fastp removes the reads whose basecall qualities or length do not meet the specified threshold. In this process, Fastp removes the reads that have low basecall qualities or length, which can result from a range of factors such as sequencing errors or PCR amplification bias. By removing these reads, Fastp improves the quality and accuracy of the reads remaining, and consequently improves the accuracy of the downstream analysis.

For Illumina reads, TrueBacID Genome uses the default parameters of Fastp, which are designed to achieve a balance between retaining sufficient reads and removing low quality and short reads.

For Nanopore reads, the platform uses the parameters “-q 10 –l 1000” in Fastp. This means that if more than 40% of the bases have a quality score below 10, or if the length of the read is shorter than 1000 bp, the read is removed. This is necessary because Nanopore sequencing typically produces longer reads with some trade-offs in accuracy, which can lead to more sequencing errors and shorter reads.

For PacBio reads, the platform uses the parameters “-q 20 –l 1000” in Fastp. This means that if more than 40% of the bases have a quality score below 20, or if the length of the read is shorter than 1000 bp, the read is removed. PacBio sequencing typically generates long reads with high accuracy, but at a higher cost. By setting the quality score threshold higher for PacBio reads, TrueBacID Genome ensures that only high-quality reads are used in downstream analysis.

Overall, the QC process employed by TrueBacID Genome is designed to optimize the quality and accuracy of the input data, which is crucial for obtaining reliable and meaningful results in downstream analysis.

Genome Quality Control Step

The genome QC process is an important quality control step that ensures that the input genome sequences are of sufficient quality and completeness for downstream analysis. The QC process employed by TrueBacID Genome consists of several measures that are designed to assess the quality and completeness of the input genome sequences.

The basic statistics measured from the input genome sequences include the number of contigs, genome size, G+C contents, and assembly N50 length. These statistics provide a broad overview of the quality and completeness of the input genome sequences, and can be used to identify potential issues or errors such as contamination, fragmentation, or misassemblies.

To obtain a more detailed assessment of the completeness of the input genome sequences and the possibility of contamination, the platform recovers 92 universal bacterial and archaeal single-copy marker genes (UBCGs) from the predicted protein sequences. This is done using hmmsearch and the 92 profile HMMs of Pfam/TIGRfam entries, with the “–cut_tc” threshold parameter. UBCGs are conserved among most bacterial and archaeal genomes and are considered as reliable indicators of genome completeness and contamination

References

Fastp

Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17): i884-i890, 2018.

Skesa

Shawn T. Sullivan, Jared A. Petty, and Aaron M. Storrs. 2018. A high-quality genome assembly of SMRT sequences reveals long-range haplotype structure in the diploid mosquito Aedes aegypti. PLoS Negl. Trop. Dis. 12:e0006493.

Flye

Kolmogorov, M., Yuan, J., Lin, Y. et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37, 540–546 (2019).

Medaka

https://nanoporetech.com/software/medaka

PreviousGenome Database NextProfile

Last updated 1 year ago