LogoLogo
EzBioCloud
  • đź“„Overview
  • 🔆Highlights
  • 🔬Science Blogs
    • Basics
      • Species
      • Species Taxonomy
      • Chimeras
      • Average Nucleotide Identity
      • OrthoANI
      • Genetic Resolution
    • Identify
      • 16S rRNA
      • Identification with 16S rRNA
      • 16S rRNA Resolution
      • 16S rRNA Database
      • Genome Identification
      • Genome Identification Process
      • Multi-Locus Sequence Typing
      • 16S vs Genome Identification
      • Subspecies
      • Phylogenomic Trees
      • Genome Database
      • Quality Control
    • Profile
      • Tetra-Nucleotide Frequencies
      • 16S Copy Number
      • Up-to-date Bacterial Core Genes
      • UBCG Technical Guide
      • UBCG Set
      • Depth of Sequencing
      • Metagenome-Assembled Genomes Suitability
      • 16S Versus Metagenomic Sequencing
      • Microbiomes
    • Detect
      • Clinical Metagenomics
      • Inferring with Amplicons
      • Pathogenicity Markers
      • Antimicrobial Resistance
      • Clinical Report Process
      • Defining a Pathogen
      • Human Pathogens
      • in silico Serotyping
    • Analyze
      • Alpha Diversity
      • Beta Diversity
      • Co-occurrence
      • Enterotyping
      • Taxonomic Composition
    • reAnalyze
      • reAnalyze #1 - Skin Disease
      • reAnalyze #2 - Skin Ageing
      • PreAnalyze #3 - Scalp Dandruff
  • ⚗️Protocols
    • 16S Identification
      • Get Started
      • Prepare Samples
        • Private Samples
        • Public Samples
      • Navigate Menu
      • Upload Data
        • Single Upload
        • Batch Upload
      • Download Report
    • Genome Identification
      • Get Started
      • Prepare Samples
        • Private Samples
        • Public Samples
        • SRA Samples
      • Navigate Menu
      • Upload Data
        • Whole Genome
        • Illumina
        • Nanopore
      • Download Report
    • Shotgun Microbiome
      • Get Started
      • Download Samples
        • NCBI Route
        • Linux Route
      • Navigate Menus
      • Create Studies
      • Profile Samples
      • Describe Profiles
        • Retrieve Metadata
        • Organize Metadata
        • Upload Metadata
      • Create Datasets
      • Analyze Datasets
        • Quality Check
        • Pie Chart Composition
        • Summary Statistics
        • Group Composition
        • Alpha Diversity
        • Beta Diversity
        • Differential Abundance
        • Enterotype
        • Co-occurrence
        • Co-occurrence Spearman
        • Statistical Matching
        • LEfSe
        • Metadata EDA
        • Profile EDA
    • Clinical Metagenomics
  • 🏛️Dr. Chun's Lectures
  • đź”§Tools
  • đź§«Taxonomy
  • âť”FAQs
    • Identification
    • Clinical Metagenomics
    • Privacy Policy
    • Terms of Service
Powered by GitBook
LogoLogo

Legal

  • Terms of Service
  • Privacy Policy

EzBioCloud© 2024. All Rights Reserved

On this page
  • 16S rRNA Quality Control
  • Genome Quality Control
  • Genome Quality Control Step
  • References
  • Fastp
  • Skesa
  • Flye
  • Medaka
  1. Science Blogs
  2. Identify

Quality Control

What are the quality control standards for 16S rRNA and genome sequences?

PreviousGenome DatabaseNextProfile

Last updated 1 year ago

16S rRNA Quality Control

The following criteria are used to filter out sequencing reads with low quality:

  • Sequences with the lengths of <100 bp or >2,000 bp

  • Averaged Q value is <25.

  • Not predicted as a 16S gene by the Hidden Markov Model (HMM) based search.

  • Sequences are first assigned to the reference 16S database. All sequences that do not match any of reference sequences with at least 97% similarity cutoff are clustered using method using 97% the cutoff. If a sequence is found to be a singleton, we assume that it is an erroneous one that should be excluded in the subsequent analyses. This algorithm is widely used, especially for Illumina short read sequencing [].

Genome Quality Control

The QC process employed by TrueBacID Genome is a crucial step in ensuring the accuracy and reliability of downstream analysis. The platform uses a tool called Fastp to perform trimming and filtering of NGS raw reads from all kinds of platforms, including Illumina, Nanopore, and PacBio reads.

The trimming and filtering process carried out by Fastp is two-fold. Firstly, Fastp removes NGS library adapter sequences from the raw reads, a process known as adapter trimming. Adapter trimming is essential for removing any adapter contamination present in the raw reads, which can interfere with downstream analysis.

Secondly, Fastp removes the reads whose basecall qualities or length do not meet the specified threshold. In this process, Fastp removes the reads that have low basecall qualities or length, which can result from a range of factors such as sequencing errors or PCR amplification bias. By removing these reads, Fastp improves the quality and accuracy of the reads remaining, and consequently improves the accuracy of the downstream analysis.

For Illumina reads, TrueBacID Genome uses the default parameters of Fastp, which are designed to achieve a balance between retaining sufficient reads and removing low quality and short reads.

For Nanopore reads, the platform uses the parameters “-q 10 –l 1000” in Fastp. This means that if more than 40% of the bases have a quality score below 10, or if the length of the read is shorter than 1000 bp, the read is removed. This is necessary because Nanopore sequencing typically produces longer reads with some trade-offs in accuracy, which can lead to more sequencing errors and shorter reads.

For PacBio reads, the platform uses the parameters “-q 20 –l 1000” in Fastp. This means that if more than 40% of the bases have a quality score below 20, or if the length of the read is shorter than 1000 bp, the read is removed. PacBio sequencing typically generates long reads with high accuracy, but at a higher cost. By setting the quality score threshold higher for PacBio reads, TrueBacID Genome ensures that only high-quality reads are used in downstream analysis.

Overall, the QC process employed by TrueBacID Genome is designed to optimize the quality and accuracy of the input data, which is crucial for obtaining reliable and meaningful results in downstream analysis.

Genome Quality Control Step

The genome QC process is an important quality control step that ensures that the input genome sequences are of sufficient quality and completeness for downstream analysis. The QC process employed by TrueBacID Genome consists of several measures that are designed to assess the quality and completeness of the input genome sequences.

The basic statistics measured from the input genome sequences include the number of contigs, genome size, G+C contents, and assembly N50 length. These statistics provide a broad overview of the quality and completeness of the input genome sequences, and can be used to identify potential issues or errors such as contamination, fragmentation, or misassemblies.

To obtain a more detailed assessment of the completeness of the input genome sequences and the possibility of contamination, the platform recovers 92 universal bacterial and archaeal single-copy marker genes (UBCGs) from the predicted protein sequences. This is done using hmmsearch and the 92 profile HMMs of Pfam/TIGRfam entries, with the “–cut_tc” threshold parameter. UBCGs are conserved among most bacterial and archaeal genomes and are considered as reliable indicators of genome completeness and contamination

References

Fastp

Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17): i884-i890, 2018.

Skesa

Shawn T. Sullivan, Jared A. Petty, and Aaron M. Storrs. 2018. A high-quality genome assembly of SMRT sequences reveals long-range haplotype structure in the diploid mosquito Aedes aegypti. PLoS Negl. Trop. Dis. 12:e0006493.

Flye

Kolmogorov, M., Yuan, J., Lin, Y. et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37, 540–546 (2019).

Medaka

🔬
UCLUST
See QIIME manual’s step 5
https://nanoporetech.com/software/medaka