# Genome Identification Report

## **Sample Information**

<figure><img src="/files/YZafyoZD4VhehfFGfp70" alt=""><figcaption></figcaption></figure>

### ​Name

This is the sample's name that was uploaded to EzBioCloud.

### Pipeline & Databse

The analytic engine and genome reference databases (genes and genomes) are provided for consistency and referencing purposes.

### Input

Uploaded sample input type: FA: FASTA / PAIRED\_FQ: paired FASTQ / SINGLE\_FQ: single FASTQ.

***

## Identification summary

<figure><img src="/files/2Z1H9yGcliSzbGoi8ooF" alt=""><figcaption><p>These categories provide a comprehensive overview of the detected species, including taxonomic classification, sequence similarity, and the extent of sequence alignment between the sample and the reference.</p></figcaption></figure>

### Identified species

This category indicates the specific species identified from your submitted sample.

### Top hit species

This is the specific subspecies or the closest match to the query sequence found in the reference database.

### Reference type

This indicates the type of reference data used for the identification, such as the whole genome, partial genome, or specific gene sequences such as 16S rRNA.

### Taxonomy

This section lists the hierarchical classification of the identified species down to the species level.

### Identity

This value represents the percentage identity of the query sequence to the reference sequence. It shows how similar the sample sequence is to the reference sequence.

### Coverage

Coverage is divided into two percentages:

* The percentage of the query sequence that aligns with the reference.
* The percentage of the reference sequence that is covered by the query sequence.

### Identification threshold

This is the threshold value used for species identification which determines whether the sample can be classified as a particular species based on sequence similarity.

***

## Read statistics and quality (for FASTQ)

<figure><img src="/files/wDmgfDlbIquenNNxZhHv" alt=""><figcaption><p>This table compares these metrics between the original reads and the quality-controlled (QC-passed) reads for different types of sequencing data (short single-end, short paired-end, and long reads).</p></figcaption></figure>

### N. reads

This stands for the number of reads. It indicates the total count of sequencing reads obtained.

### Bases (bp)

This refers to the total number of base pairs in all the reads. It is a measure of the total amount of sequence data.

### Mean len. (bp)

This stands for the mean length of the reads, measured in base pairs. It indicates the average length of the sequencing reads.

### Q20 rate

The Q20 rate represents the percentage of bases with a quality score of 20 or higher. A Q20 score corresponds to a 1% error rate, meaning there is a 99% probability that the base is called correctly.

### Q30 rate

The Q30 rate represents the percentage of bases with a quality score of 30 or higher. A Q30 score corresponds to a 0.1% error rate, meaning there is a 99.9% probability that the base is called correctly.

### Short, SE

This refers to short single-end reads, where each DNA fragment is sequenced from one end only.

### Short, PE

This refers to short paired-end reads, where each DNA fragment is sequenced from both ends, providing two reads per fragment.

### Long

This refers to long reads, which are typically generated by long-read sequencing technologies such as Oxford Nanopore and PacBio.

### Original reads

These are the reads obtained directly from the sequencing process before any quality control filtering is applied.

### QC-passed reads

These are the reads that have passed quality control checks, ensuring that they meet our standards of accuracy and reliability.

***

## Assembly statistics and quality

<figure><img src="/files/fE5lInCdyBhDPVgeJUvb" alt=""><figcaption><p>These statistics provide an overview of the quality and completeness of the genome assembly, indicating how well the sequence reads have been assembled into contiguous and accurate genome sequences.</p></figcaption></figure>

### Assembler

This refers to the software or method used to assemble the sequence reads into contiguous sequences (contigs). In this case, it indicates a user-uploaded assembly.

### Genome size

This is the total size of the assembled genome, measured in base pairs (bp). It provides an estimate of the total length of the genome sequence assembled. The range in parentheses indicates the possible variation in genome size.

### Number of contigs

This indicates the number of contigs in the assembly. Contigs are continuous sequences of DNA that have been assembled from overlapping reads. A lower number of contigs generally indicates a more complete and contiguous assembly.

### GC content

This represents the percentage of guanine (G) and cytosine (C) bases in the DNA sequence. It is a measure of the composition of the genome. The range in parentheses shows the possible variation in GC content.

### Coverage depth

This refers to the average number of times each base in the genome is covered by the reads. Higher coverage depth usually indicates higher confidence in the accuracy of the assembly.

### N50 length

The N50 length is a statistic that defines the length of the contig for which the collection of all contigs of that length or longer contains at least 50% of the total assembly. It is a measure of the quality of the assembly, with longer N50 lengths indicating more complete assemblies.

### UBCG paralog

This indicates the percentage of paralogous genes (genes that have evolved by duplication) detected out of a defined set of Universal Bacterial Core Genes (UBCG). Paralogous genes can complicate genome assembly and annotation.

### UBCG recovery

This is the percentage of the defined set of Universal Bacterial Core Genes (UBCG) that have been successfully recovered in the assembly. A higher percentage indicates a more complete and representative assembly of the bacterial genome.

***

## Domain affiliation check

<figure><img src="/files/vTH378HJsea1kF7lW2Ii" alt=""><figcaption><p>These categories help in understanding the composition of the sample in terms of different domains of life, showing which domains are present and how dominant each is in both the original sequencing reads and the assembled genome.</p></figcaption></figure>

### Domain affiliation check

This section checks and categorizes the sequence reads and genome assembly into different domains of life, including Bacteria, Archaea, Eukarya, and Viruses. It provides an overview of the distribution of the sequences across these domains.

### Original reads

This refers to the percentage of the initial sequence reads that align with each domain. The columns under this heading show the distribution of reads among Bacteria, Archaea, Eukarya, and Viruses.

### Genome assembly

This refers to the percentage of the assembled genome that aligns with each domain. The columns under this heading show the distribution of the assembled genome sequences among Bacteria, Archaea, Eukarya, and Viruses.

***

## **MLST**

<figure><img src="/files/BPwVexB7rlrkfeaw8tki" alt=""><figcaption><p>Multi-Locus Sequence Typing (MLST) is a method for subtyping bacteria based on the sequence of several housekeeping genes.</p></figcaption></figure>

### **MLST scheme**

MLST scheme refers to a specific set of genes used for MLST analysis. There are several available MLST schemes for different bacterial taxa.

### **Sequence Type**

Sequence type (ST) refers to the specific allele combination at each locus used in an MLST scheme, which can be used to define a unique subtype of a bacterial species.

### **Allele in each locus**

Allele in each locus refers to the specific variant of a gene used in an MLST scheme. The combination of alleles at each locus in an MLST scheme can define a unique ST. Each locus corresponds to a different housekeeping gene, and the alleles are typically represented by numbers indicating the specific variant present. For example:

* **arcC \[3]**: The allele 3 of the arcC gene.
* **aroE \[3]**: The allele 3 of the aroE gene.
* **glpF \[1]**: The allele 1 of the glpF gene.
* **gmk \[1]**: The allele 1 of the gmk gene.
* **pta \[1]**: The allele 1 of the pta gene.
* **tpi \[1]**: The allele 1 of the tpi gene.
* **yqiL \[10]**: The allele 10 of the yqiL gene.

***

## **Antibiotic resistance determinants summary**

<figure><img src="/files/ogIdZuPKR9HKdqHcuEXY" alt=""><figcaption><p>Antibiotic resistance determinants summary covers the presence and abundance of genes encoding resistance to antibiotics in a genome.</p></figcaption></figure>

### **Antibiotic classes**

Antibiotic classes are the different categories of antibiotics, such as beta-lactams, aminoglycosides, macrolides, etc.

### **Antibiotic subclasses**

Antibiotic subclasses refer to specific subtypes of antibiotics within a larger class, such as penicillins, cephalosporins, etc.

### **Resistance gene families**

Resistance gene families are groups of genes that encode resistance to antibiotics.

### Resistance mutations

This section would list any specific mutations identified in the sample that are known to confer antibiotic resistance. In this case, no specific resistance mutations are listed (indicated by the "-").

***

## Pathogenicity marker summary

<figure><img src="/files/0bLQIXM5kNPzhKDTbcQl" alt=""><figcaption><p>This section provides an overview of the markers associated with pathogenicity identified in the sample. The presence or absence of these markers helps in assessing the pathogenic potential of the microbial sample, providing insights into its virulence factors and the likelihood of it causing disease.</p></figcaption></figure>

### **Pathogenicity markers**

Pathogenicity markers are genes or markers that are associated with pathogenicity or virulence in bacteria.

### Scheme

This specifies the particular pathogen or species for which the pathogenicity markers are being analyzed. In this case, it is **Staphylococcus aureus**.

### Positive markers

These are markers that have been detected in the sample and are known to be associated with pathogenicity. These markers indicate the presence of specific genes or sequences that contribute to the organism's ability to cause disease.

### Negative markers

These are markers that were tested for but not detected in the sample. The absence of these markers suggests that the corresponding pathogenicity factors are not present in the sample.

***

## Top ANI hits

<figure><img src="/files/1NOec6y2DEU3190Qo5dm" alt=""><figcaption><p>These categories help in identifying and comparing the query sequence with the most similar known sequences, providing insights into its potential identity and classification based on genetic similarity.</p></figcaption></figure>

### Top ANI hits

This section lists the top matches for the query sequence based on Average Nucleotide Identity (ANI), which is a measure of sequence similarity.

This is the rank or position of the match based on the similarity score, with #1 being the highest similarity.

### Species

This column lists the species names of the top hits. These are the species whose genome sequences show the highest similarity to the query sequence.

### Genome group

This indicates the specific group or subtype within the species, if available. It provides more detailed classification within the species.

### Taxonomy

This column lists the hierarchical classification of the species.

### Iden. (%)

This stands for Identity percentage, which represents the percentage of the query sequence that is identical to the reference sequence. Higher percentages indicate higher similarity.

### Query cov. (%)

This stands for Query coverage percentage, which represents the percentage of the query sequence that aligns with the reference sequence. Higher percentages indicate more comprehensive alignment.

### Ref. cov. (%)

This stands for Reference coverage percentage, which represents the percentage of the reference genome that is covered by the query sequence. Higher percentages indicate more extensive alignment with the reference genome.

***

## UBCG tree

<figure><img src="/files/UUn7hJuXI8JDeZd3c1FO" alt=""><figcaption><p>This phylogenetic tree helps visualize the genetic relationships and evolutionary distances between the query sequence and other bacterial species based on core gene comparisons.</p></figcaption></figure>

### UBCG tree

The UBCG (Universal Bacterial Core Gene) tree is a phylogenetic tree that represents the evolutionary relationships among different bacterial species based on the alignment and comparison of core genes that are universally present across bacterial genomes.

### Elements of the UBCG tree:

**Query sequence**:

* Marked with a red asterisk (\*) and in red text, this indicates the sequence you are analyzing.

**Species**:

* The tree lists various species (e.g., Staphylococcus aureus subsp. aureus, Staphylococcus schweitzeri, etc.), showing their phylogenetic relationships. Each species is a leaf on the tree, representing a distinct organism.

**Branches**:

* The branches connect different species or nodes, illustrating the evolutionary path from common ancestors. Red branches indicate the paths connecting the query sequence to its closest relatives.

**Nodes**:

* Points where branches split, representing common ancestors shared by the species or sequences that branch out from them.

**Scale bar**:

* The scale bar (e.g., 2.1893) provides a reference for the genetic distance. The length of the branches corresponds to the amount of genetic divergence between the species.

### Interpretation:

**Close relationships**:

* Species that are closely related to the query sequence are grouped together near the top of the tree.
* For instance, Staphylococcus aureus subsp. aureus is the closest relative to the query sequence, followed by other Staphylococcus species.

**Distant relationships**:

* Species further down the tree, like Xanthomonas cucurbitae, are more distantly related to the query sequence.

**Phylogenetic structure**:

* The tree's structure shows how different species diverged from common ancestors, providing insights into their evolutionary history.

***

## **Source profile of microorganism genome sequences**

<figure><img src="/files/C54pL5xDVar2DHmmW9lZ" alt=""><figcaption><p>Source profile charts help in understanding the origins and distribution of your species' genome sequences, providing valuable information for epidemiological and clinical studies.</p></figcaption></figure>

### Source profile of microorganisms genome sequences

This visualization shows the distribution of the species genome sequences identified in your sample based on their source, providing insight into where these sequences were collected from.

### Categories and Flow

The diagram is a Sankey chart, illustrating the flow of data from broader categories on the left to more specific subcategories on the right. The width of the bands represents the proportion of sequences from each category.

### Main Source Categories

1. **Host**
   * **Human**
     * Subcategories may include:
       * Skin
       * Oral
       * Respiratory
       * Gastrointestinal
       * Vaginal
       * Urinary
       * Stool
       * Blood
       * Brain
       * Other
       * Unknown
   * **Animal**
     * Subcategories may include:
       * Mouse
       * Cow
       * Pig
       * Chicken
       * Livestock (general category for farm animals)
       * Other
       * Unknown
2. **Food**
   * Sequences collected from food sources.
3. **Environment**
   * Subcategories may include:
     * Soil
     * Water
     * Other
4. **Other**
   * A general category for sequences that do not fit into the above categories.
5. **Unknown**
   * Sequences for which the source is not specified.

### Clinical Isolates

* **Is clinical isolate**: This category indicates sequences identified as clinical isolates, meaning they were obtained from clinical settings, possibly related to infections or other clinical conditions.
* **Is not clinical isolate**: This category indicates sequences not classified as clinical isolates.

***

## **Antibiotic resistance determinants**

<figure><img src="/files/SCkKQHCIhlIyWsqjwUam" alt=""><figcaption><p>This section provides detailed information on the antibiotic resistance genes present in the sample, including their classification, genetic details, and alignment statistics. This information is crucial for understanding the resistance mechanisms and their potential impact on antibiotic treatment efficacy.</p></figcaption></figure>

### **Antibiotic resistance determinants**

DNA sequences in a genome that confer resistance to antibiotics.

#### Categories in the table:

### **Class**

The antibiotic class to which the resistance gene provides resistance.&#x20;

### **Subclass**

A more specific categorization within the antibiotic class, detailing the particular type of antibiotic.&#x20;

### **Gene**

The specific gene that provides antibiotic resistance.&#x20;

### **Form of AMR**

Indicates the form of antibiotic resistance (AMR), typically listed as a gene.

### **Accession**

The unique identifier for the gene sequence in a database.&#x20;

### **Contig**

The contig number in the genome assembly where the gene is located.

### **Location**

The specific genomic coordinates where the gene is located within the contig.

### **Dir.**

The direction of the gene on the contig, often indicated with arrows or symbols to show forward or reverse orientation.

### **Iden. (%)**

The identity percentage, representing how similar the gene sequence is to a reference sequence. Higher percentages indicate higher similarity.

### **Ref. cov. (%)**

The reference coverage percentage, representing the proportion of the reference sequence that is covered by the gene sequence. Higher percentages indicate more complete coverage.

***

## Pathogenicity scheme

<figure><img src="/files/B0UU8VJ9RfHL5CKhau3I" alt=""><figcaption><p>Pathogenicity markers are DNA sequences in a genome that are associated with the ability of an organism to cause disease. This table helps in understanding the pathogenic potential of the sample, identifying resistance mechanisms, and providing a basis for clinical and epidemiological classification.</p></figcaption></figure>

#### Pathogenicity Scheme

**Applied scheme**

The specific scheme used to analyze the pathogenicity markers. In this context, it refers to the scheme designed for a particular species, such as Staphylococcus aureus.

**Description**

This provides details on what the scheme is designed to classify or identify. It may include various resistance factors and toxins.

#### Pathogenicity markers

**Marker**

* **Marker**: The specific gene or genetic element that is being checked for its presence.&#x20;

**Detection**

* **Detection**: Indicates whether the marker was detected in the sample. Possible values are:
  * Positive: The marker is present.
  * Negative: The marker is absent.

**Identity (%)**

* **Identity (%)**: Represents the percentage identity of the detected marker sequence compared to a reference sequence. High percentages indicate high similarity. NA is used when the marker is not detected.

**Coverage (%)**

* **Coverage (%)**: Indicates the percentage of the reference marker sequence that is covered by the query sequence. High percentages indicate more complete coverage. NA is used when the marker is not detected.

**Indication**

* **Indication**: Provides information on the significance of the marker's presence or absence, typically supporting specific classifications. For example:
  * **Support classification as MRSA**: Indicates methicillin-resistant Staphylococcus aureus if mecA is positive.
  * **Support classification as MSSA**: Indicates methicillin-susceptible Staphylococcus aureus if mecA is negative.
  * **Support classification as PVL+**: Indicates the presence of Panton-Valentine leukocidin if lukS-PV is positive.
  * **Support classification as TSST+**: Indicates the presence of toxic shock syndrome toxin if tst is positive.
  * **Support classification as van+**: Indicates the presence of vancomycin resistance genes if vanA, vanS, vanR, etc., are positive.
  * **Support classification as CoN**: Indicates coagulase-negative if coa is negative.
  * **Support classification as et+**: Indicates the presence of exfoliative toxin if et is positive.

***

## **Virulence factor hits**

<figure><img src="/files/O8meNFJhrIovv1cMpm3R" alt=""><figcaption><p>These categories provide detailed information on the virulence genes present in the sample, including their classification, genetic details, and alignment statistics. This information is crucial for understanding the pathogenic mechanisms and potential impact of the virulence factors on host interactions.</p></figcaption></figure>

### **Full list of virulence factor hits**

Proteins or other molecules produced by pathogens that contribute to their ability to cause disease.

### **Category**

The type of virulence factor.&#x20;

### **Virulence factor**

The specific protein or factor associated with virulence.

### **Gene family**

The family of genes to which the virulence factor belongs.

### **REF. accession**

The reference accession number for the gene sequence in a database, providing a unique identifier for the reference sequence.&#x20;

### **Contig**

The contig number in the genome assembly where the gene is located.

### **Location**

The specific genomic coordinates where the gene is located within the contig.

### **Dir.**

The direction of the gene on the contig, often indicated with arrows or symbols to show forward (green arrow) or reverse (red arrow) orientation.

### **Iden. (%)**

The identity percentage, representing how similar the gene sequence is to a reference sequence. Higher percentages indicate higher similarity.

### **Ref. cov. (%)**

The reference coverage percentage, representing the proportion of the reference sequence that is covered by the gene sequence. Higher percentages indicate more complete coverage.

***

## References

Lee, I., Ouk Kim, Y., Park, S. C., & Chun, J. (2016). OrthoANI: an improved algorithm and software for calculating average nucleotide identity. *International journal of systematic and evolutionary microbiology*, *66*(2), 1100-1103.

Orakov, A., Fullam, A., Coelho, L. P., Khedkar, S., Szklarczyk, D., Mende, D. R., ... & Bork, P. (2021). GUNC: detection of chimerism and contamination in prokaryotic genomes. *Genome biology*, *22*, 1-19.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://kb.ezbiocloud.net/terms-and-definitions/genome-identification-report.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.