Genome Identification Report
Genome Identification reports contain a range of terminology and acronyms. Here we will cover what's in an example report.
Last updated
Genome Identification reports contain a range of terminology and acronyms. Here we will cover what's in an example report.
Last updated
This is the sample's name that was uploaded to EzBioCloud.
The analytic engine and genome reference databases (genes and genomes) are provided for consistency and referencing purposes.
Uploaded sample input type: FA: FASTA / PAIRED_FQ: paired FASTQ / SINGLE_FQ: single FASTQ.
This category indicates the specific species identified from your submitted sample.
This is the specific subspecies or the closest match to the query sequence found in the reference database.
This indicates the type of reference data used for the identification, such as the whole genome, partial genome, or specific gene sequences such as 16S rRNA.
This section lists the hierarchical classification of the identified species down to the species level.
This value represents the percentage identity of the query sequence to the reference sequence. It shows how similar the sample sequence is to the reference sequence.
Coverage is divided into two percentages:
The percentage of the query sequence that aligns with the reference.
The percentage of the reference sequence that is covered by the query sequence.
This is the threshold value used for species identification which determines whether the sample can be classified as a particular species based on sequence similarity.
This stands for the number of reads. It indicates the total count of sequencing reads obtained.
This refers to the total number of base pairs in all the reads. It is a measure of the total amount of sequence data.
This stands for the mean length of the reads, measured in base pairs. It indicates the average length of the sequencing reads.
The Q20 rate represents the percentage of bases with a quality score of 20 or higher. A Q20 score corresponds to a 1% error rate, meaning there is a 99% probability that the base is called correctly.
The Q30 rate represents the percentage of bases with a quality score of 30 or higher. A Q30 score corresponds to a 0.1% error rate, meaning there is a 99.9% probability that the base is called correctly.
This refers to short single-end reads, where each DNA fragment is sequenced from one end only.
This refers to short paired-end reads, where each DNA fragment is sequenced from both ends, providing two reads per fragment.
This refers to long reads, which are typically generated by long-read sequencing technologies such as Oxford Nanopore and PacBio.
These are the reads obtained directly from the sequencing process before any quality control filtering is applied.
These are the reads that have passed quality control checks, ensuring that they meet our standards of accuracy and reliability.
This refers to the software or method used to assemble the sequence reads into contiguous sequences (contigs). In this case, it indicates a user-uploaded assembly.
This is the total size of the assembled genome, measured in base pairs (bp). It provides an estimate of the total length of the genome sequence assembled. The range in parentheses indicates the possible variation in genome size.
This indicates the number of contigs in the assembly. Contigs are continuous sequences of DNA that have been assembled from overlapping reads. A lower number of contigs generally indicates a more complete and contiguous assembly.
This represents the percentage of guanine (G) and cytosine (C) bases in the DNA sequence. It is a measure of the composition of the genome. The range in parentheses shows the possible variation in GC content.
This refers to the average number of times each base in the genome is covered by the reads. Higher coverage depth usually indicates higher confidence in the accuracy of the assembly.
The N50 length is a statistic that defines the length of the contig for which the collection of all contigs of that length or longer contains at least 50% of the total assembly. It is a measure of the quality of the assembly, with longer N50 lengths indicating more complete assemblies.
This indicates the percentage of paralogous genes (genes that have evolved by duplication) detected out of a defined set of Universal Bacterial Core Genes (UBCG). Paralogous genes can complicate genome assembly and annotation.
This is the percentage of the defined set of Universal Bacterial Core Genes (UBCG) that have been successfully recovered in the assembly. A higher percentage indicates a more complete and representative assembly of the bacterial genome.
This section checks and categorizes the sequence reads and genome assembly into different domains of life, including Bacteria, Archaea, Eukarya, and Viruses. It provides an overview of the distribution of the sequences across these domains.
This refers to the percentage of the initial sequence reads that align with each domain. The columns under this heading show the distribution of reads among Bacteria, Archaea, Eukarya, and Viruses.
This refers to the percentage of the assembled genome that aligns with each domain. The columns under this heading show the distribution of the assembled genome sequences among Bacteria, Archaea, Eukarya, and Viruses.
MLST scheme refers to a specific set of genes used for MLST analysis. There are several available MLST schemes for different bacterial taxa.
Sequence type (ST) refers to the specific allele combination at each locus used in an MLST scheme, which can be used to define a unique subtype of a bacterial species.
Allele in each locus refers to the specific variant of a gene used in an MLST scheme. The combination of alleles at each locus in an MLST scheme can define a unique ST. Each locus corresponds to a different housekeeping gene, and the alleles are typically represented by numbers indicating the specific variant present. For example:
arcC [3]: The allele 3 of the arcC gene.
aroE [3]: The allele 3 of the aroE gene.
glpF [1]: The allele 1 of the glpF gene.
gmk [1]: The allele 1 of the gmk gene.
pta [1]: The allele 1 of the pta gene.
tpi [1]: The allele 1 of the tpi gene.
yqiL [10]: The allele 10 of the yqiL gene.
Antibiotic classes are the different categories of antibiotics, such as beta-lactams, aminoglycosides, macrolides, etc.
Antibiotic subclasses refer to specific subtypes of antibiotics within a larger class, such as penicillins, cephalosporins, etc.
Resistance gene families are groups of genes that encode resistance to antibiotics.
This section would list any specific mutations identified in the sample that are known to confer antibiotic resistance. In this case, no specific resistance mutations are listed (indicated by the "-").
Pathogenicity markers are genes or markers that are associated with pathogenicity or virulence in bacteria.
This specifies the particular pathogen or species for which the pathogenicity markers are being analyzed. In this case, it is Staphylococcus aureus.
These are markers that have been detected in the sample and are known to be associated with pathogenicity. These markers indicate the presence of specific genes or sequences that contribute to the organism's ability to cause disease.
These are markers that were tested for but not detected in the sample. The absence of these markers suggests that the corresponding pathogenicity factors are not present in the sample.
This section lists the top matches for the query sequence based on Average Nucleotide Identity (ANI), which is a measure of sequence similarity.
This is the rank or position of the match based on the similarity score, with #1 being the highest similarity.
This column lists the species names of the top hits. These are the species whose genome sequences show the highest similarity to the query sequence.
This indicates the specific group or subtype within the species, if available. It provides more detailed classification within the species.
This column lists the hierarchical classification of the species.
This stands for Identity percentage, which represents the percentage of the query sequence that is identical to the reference sequence. Higher percentages indicate higher similarity.
This stands for Query coverage percentage, which represents the percentage of the query sequence that aligns with the reference sequence. Higher percentages indicate more comprehensive alignment.
This stands for Reference coverage percentage, which represents the percentage of the reference genome that is covered by the query sequence. Higher percentages indicate more extensive alignment with the reference genome.
The UBCG (Universal Bacterial Core Gene) tree is a phylogenetic tree that represents the evolutionary relationships among different bacterial species based on the alignment and comparison of core genes that are universally present across bacterial genomes.
Query sequence:
Marked with a red asterisk (*) and in red text, this indicates the sequence you are analyzing.
Species:
The tree lists various species (e.g., Staphylococcus aureus subsp. aureus, Staphylococcus schweitzeri, etc.), showing their phylogenetic relationships. Each species is a leaf on the tree, representing a distinct organism.
Branches:
The branches connect different species or nodes, illustrating the evolutionary path from common ancestors. Red branches indicate the paths connecting the query sequence to its closest relatives.
Nodes:
Points where branches split, representing common ancestors shared by the species or sequences that branch out from them.
Scale bar:
The scale bar (e.g., 2.1893) provides a reference for the genetic distance. The length of the branches corresponds to the amount of genetic divergence between the species.
Close relationships:
Species that are closely related to the query sequence are grouped together near the top of the tree.
For instance, Staphylococcus aureus subsp. aureus is the closest relative to the query sequence, followed by other Staphylococcus species.
Distant relationships:
Species further down the tree, like Xanthomonas cucurbitae, are more distantly related to the query sequence.
Phylogenetic structure:
The tree's structure shows how different species diverged from common ancestors, providing insights into their evolutionary history.
This visualization shows the distribution of the species genome sequences identified in your sample based on their source, providing insight into where these sequences were collected from.
The diagram is a Sankey chart, illustrating the flow of data from broader categories on the left to more specific subcategories on the right. The width of the bands represents the proportion of sequences from each category.
Host
Human
Subcategories may include:
Skin
Oral
Respiratory
Gastrointestinal
Vaginal
Urinary
Stool
Blood
Brain
Other
Unknown
Animal
Subcategories may include:
Mouse
Cow
Pig
Chicken
Livestock (general category for farm animals)
Other
Unknown
Food
Sequences collected from food sources.
Environment
Subcategories may include:
Soil
Water
Other
Other
A general category for sequences that do not fit into the above categories.
Unknown
Sequences for which the source is not specified.
Is clinical isolate: This category indicates sequences identified as clinical isolates, meaning they were obtained from clinical settings, possibly related to infections or other clinical conditions.
Is not clinical isolate: This category indicates sequences not classified as clinical isolates.
DNA sequences in a genome that confer resistance to antibiotics.
The antibiotic class to which the resistance gene provides resistance.
A more specific categorization within the antibiotic class, detailing the particular type of antibiotic.
The specific gene that provides antibiotic resistance.
Indicates the form of antibiotic resistance (AMR), typically listed as a gene.
The unique identifier for the gene sequence in a database.
The contig number in the genome assembly where the gene is located.
The specific genomic coordinates where the gene is located within the contig.
The direction of the gene on the contig, often indicated with arrows or symbols to show forward or reverse orientation.
The identity percentage, representing how similar the gene sequence is to a reference sequence. Higher percentages indicate higher similarity.
The reference coverage percentage, representing the proportion of the reference sequence that is covered by the gene sequence. Higher percentages indicate more complete coverage.
Applied scheme
The specific scheme used to analyze the pathogenicity markers. In this context, it refers to the scheme designed for a particular species, such as Staphylococcus aureus.
Description
This provides details on what the scheme is designed to classify or identify. It may include various resistance factors and toxins.
Marker
Marker: The specific gene or genetic element that is being checked for its presence.
Detection
Detection: Indicates whether the marker was detected in the sample. Possible values are:
Positive: The marker is present.
Negative: The marker is absent.
Identity (%)
Identity (%): Represents the percentage identity of the detected marker sequence compared to a reference sequence. High percentages indicate high similarity. NA is used when the marker is not detected.
Coverage (%)
Coverage (%): Indicates the percentage of the reference marker sequence that is covered by the query sequence. High percentages indicate more complete coverage. NA is used when the marker is not detected.
Indication
Indication: Provides information on the significance of the marker's presence or absence, typically supporting specific classifications. For example:
Support classification as MRSA: Indicates methicillin-resistant Staphylococcus aureus if mecA is positive.
Support classification as MSSA: Indicates methicillin-susceptible Staphylococcus aureus if mecA is negative.
Support classification as PVL+: Indicates the presence of Panton-Valentine leukocidin if lukS-PV is positive.
Support classification as TSST+: Indicates the presence of toxic shock syndrome toxin if tst is positive.
Support classification as van+: Indicates the presence of vancomycin resistance genes if vanA, vanS, vanR, etc., are positive.
Support classification as CoN: Indicates coagulase-negative if coa is negative.
Support classification as et+: Indicates the presence of exfoliative toxin if et is positive.
Proteins or other molecules produced by pathogens that contribute to their ability to cause disease.
The type of virulence factor.
The specific protein or factor associated with virulence.
The family of genes to which the virulence factor belongs.
The reference accession number for the gene sequence in a database, providing a unique identifier for the reference sequence.
The contig number in the genome assembly where the gene is located.
The specific genomic coordinates where the gene is located within the contig.
The direction of the gene on the contig, often indicated with arrows or symbols to show forward (green arrow) or reverse (red arrow) orientation.
The identity percentage, representing how similar the gene sequence is to a reference sequence. Higher percentages indicate higher similarity.
The reference coverage percentage, representing the proportion of the reference sequence that is covered by the gene sequence. Higher percentages indicate more complete coverage.
Lee, I., Ouk Kim, Y., Park, S. C., & Chun, J. (2016). OrthoANI: an improved algorithm and software for calculating average nucleotide identity. International journal of systematic and evolutionary microbiology, 66(2), 1100-1103.
Orakov, A., Fullam, A., Coelho, L. P., Khedkar, S., Szklarczyk, D., Mende, D. R., ... & Bork, P. (2021). GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome biology, 22, 1-19.