Identification with 16S rRNA
How can the 16S rRNA gene be used to identify bacterial species?
16S identification algorithm for identification of a bacterium
The most critical measurement for 16S-based species identification is pairwise sequence similarity. However, different sequence alignment algorithms may produce different similarity values. Therefore, it is important to use a taxonomically valid algorithm for alignment and similarity calculation. It is ideal if we calculate all similarities between the isolate and all type strains of the known species. This is doable, but not efficient as it will take very long for computing all pairs (>70,000) while we only need the values that are close enough (i.e., species with >98.7% similarity). For this reason, a two-step approach is devised for the EzBioCloud 16S Identification service. It is the same as the one used on our public 16S identification service (www.ezbiocloud.net), except that the reference database used in EzBioCloud 16S Identification is more stringently curated.
The EzBioCloud 16S Identification engine works in the following steps:
The query sequence is chopped into three fragments of equal length. If the length of the query sequence is > 1000 bp, the query is chopped into two fragments. If the length of the query sequence is > 500 bp, the query will not be chopped. The original full-length query and the fragmented sequences, four sequences in total, are used as the query sequence for a BLASTn-based search against the EzBioCloud 16S Identification Database. Using the different parts of the query sequences in the BLASTn search ensure the correct identification of all potentially similar reference sequences. Fifty hits are collected from each of the four BLASTn searches and combined. Because there are always duplicated hits, the final hit list contains much less than 200 hits. A robust pairwise sequence alignment (Myers and Miller, 1988) is carried out between all pairs, that is, the query sequence against all BLASTn hit species identified in the previous step. The alignment algorithm used in EzBioCloud 16S Identification service is same as the one used in defining the 16S cutoff (98.7%) for species definition (Kim et al., 2014) and used in the highly cited EzBioCloud (formerly EzTaxon) service. For more details about 16S similarity calculation, please read this article. Please note that BLASTn identity values are not used for taxonomic purposes [Learn more]. The completeness(%) of the query sequence is calculated [Learn more]. For example, 50% completeness means that the query sequence covers only half of the full-length 16S gene. The taxonomically meaningful 16S sequence similarity was proposed on the basis of full-length sequences. Therefore, similarity values based on partial sequences should be interpreted carefully. Finally, the hit species are sorted by the 16S similarities and displayed as a table and stored. Interpretation of 16S similarity values should be made carefully. For example, Bacillus cereus shows >99.8% 16S similarity to about ten species, implying that very similar 16S sequence does not always mean that the isolate belongs to the hit species.
Last updated