Genome Identification Process
Last updated
Last updated
EzBioCloud© 2024. All Rights Reserved
Genome-ID uses a combination of Mash and OrthoANI to perform species-level identification of input genome sequences. The first step is to screen candidates using Mash, which selects a subset of reference genomes that are closely related to the input genome based on a pairwise distance calculation. The number of reference genomes selected is determined by a user-defined threshold, which is set to be less than the total number of reference genomes in the database.
Once the reference genomes have been selected, OrthoANI is used to calculate the pairwise average nucleotide identity (ANI) between the input genome and the reference genomes. ANI is a measure of genomic similarity between two genomes, and values greater than 95% are typically used as a threshold for species-level identification. If the input genome displays a high degree of ANI with a particular reference genome, it is identified as belonging to the same species.
If the input genome cannot be identified with the reference genomes in the database using ANI analysis, Genome-ID will try identification using 16S rRNA gene sequences. This is because there are some species that are not covered by the genome database, but that can still be identified using their 16S rRNA gene sequences. If a match is found in the 16S database, the input genome can be identified as belonging to a particular species.
In addition to identifying individual species, Genome-ID also defines genome groups, which are revised species-level entities. For example, the “Escherichia coli group” is a genome group that includes several related species, such as Escherichia coli, Shigella dysenteriae, Shigella flexneri, Shigella boydii, and Shigella sonnei. If the ANI values indicate that the input genome should belong to one of the member species of a genome group, Genome-ID will identify the input genome as belonging to that group, rather than any individual member species. This is because ANI analysis (or 16S analysis) alone cannot reliably differentiate or delineate one species from another within a genome group.
ANI (average nucleotide identity) is a widely used measure of genomic similarity that is used to compare the nucleotide sequences of two genomes. It is calculated by aligning the complete set of orthologous genes between the two genomes and computing the average nucleotide identity of these aligned regions. ANI values between 95-96% are typically used as a threshold for species-level identification, as genomes from the same species typically display ANI values above this threshold. The choice of ANI cutoff for species-level identification is based on the observation that genomes from different species typically display ANI values below this threshold, while genomes from the same species typically display ANI values above it. In the case of Genome-ID, the ANI cutoff of 95% is used as a threshold for species-level identification. If the input genome displays a high degree of ANI with a particular reference genome, it is identified as belonging to the same species as the reference genome. However, if the ANI values do not indicate a clear match with any of the reference genomes in the database, Genome-ID will attempt identification using 16S rRNA gene sequences.
Genome-ID uses genome sequence data whereas other bacterial identification systems employ various aspects of phenotypes and genotypes. Because modern bacterial taxonomy defines the species by directly whole comparing genome sequences, identification based on genome sequence data is, in theory, always correct and definitive. All other methods detect various types of partial phenotypic or genotypic patterns in bacterial cells, which can give an incorrect identification or is not able to recognize the novel species. When the genome sequence and other data give conflicting identification results, one should always trust the former, since formal bacterial classification is based on the genome data.
The science behind genome-based bacterial identification is simple. Two conditions are required:
Each known bacterial species has a type strain whose genome sequence has been determined for comparison to other genomes. If the genome sequence of a bacterial isolate is sufficiently similar to that of the type strain of a known species, then the strain is identified to that species. For requirement (a), a genome sequence database of type strains should be established. Such a database should contain quality-assured genome sequences that are taxonomically correct. Also, ideally, it should cover most, if not all, species.
The scientific background of (b) is well established; if the average nucleotide identity (ANI) value between the type strain of a known species and an isolate is ≥ 95~96%, the latter is identified as a strain of the former species.