# UBCG Technical Guide

## What is UBCG?

UBCG stands for the Up-to-date Bacterial Core Gene. It is a method and software tool for inferring phylogenetic relationships using a bacterial core gene set that is defined by an up-to-date bacterial genome database.

*This document is for **version 3**.  If you have an older version, please download and install the latest version.*&#x20;

## How to cite the UBCG pipeline

If you use this tool, please cite the following:

Na, S. I., Kim, Y. O., Yoon, S. H., Ha, S. M., Baek, I. & Chun, J. (2018). UBCG: Up-to-date bacterial core gene set and pipeline for phylogenomic tree reconstruction. *J Microbiol* 56, (in press).   \[[View only version](http://rdcu.be/H3ia)]&#x20;

## Gene set used in the UBCG pipeline

The most widely employed method for genome-based phylogenetic tree reconstruction is using the core gene set. The core gene set can be defined as

* Genes that are present in the majority of species, if not all
* Genes that are present in a single copy (likely orthologous but not paralogous)

The number of core genes varies depending on the scope of a target taxon. If you generate a phylogenetic tree for a species, the core gene set may consist of up to thousands of genes. However, to cover any taxa in the Bacteria domain, the core gene set should be restricted to the highly conserved ones (Bacterial Core Gene \[BCG]).

Because the number and taxonomic coverage of complete genome sequences in the public database are not perfect, the number of BCG sets varies over time.

Here, we compiled the latest bacterial core gene set using the largest dataset (1,429 complete genome sequences, a single genome per species, covering 28 phyla). The current UBCG set consists of 92 genes whose details are given [here](https://kb.ezbiocloud.net/home/science-blogs/profile/ubcg-set). &#x20;

## Concept of the UBCG pipeline

We designed the pipeline for users to handle hundreds of genomes if not thousands. Here, the concept behind our design is briefed to help you understand and maximize the utility of our pipeline.

* All UBCG sequences extracted from each genome sequence are stored in a single file (\*.***bcg***). This file also contains a label with full information about the strain (e.g. *Escherichia coli* K12 MG1665) and other details (e.g. database accession). Once a *bcg* file is generated, it can be used for different analyses. This allows users to change the labels in the phylogenetic trees.
* A run is carried out using a set of *bcg* files of the user’s choice. For this, selected *bcg* files are saved in a single directory, then the UBCG pipeline will align each core gene, concatenate them, filter aligned positions, and calculate phylogenetic trees and gene support indices (GSIs).
* If a user wants to run the pipeline for another set of bcg files, store the desired *bcg* files in a *bcg* directory and re-run the pipeline. In other words, the set of *bcg* files to be analyzed together is controlled by the content of a directory holding *bcg* files.&#x20;

## Installation

* The latest version is available [here](https://kb.ezbiocloud.net/kb/tools#ubcg-phylogenomic-tree-inference-using-bacterial-core-genes).
* It has been tested on Linux and Mac OS X 10 or higher. MS Windows is not supported due to the external programs used. Please use a virtual Linux machine such as the [Virtualbox](https://www.virtualbox.org/).
* Unzip the UBCG.zip file in the desired directory.

{% embed url="<https://kb.ezbiocloud.net/kb/tools#ubcg-phylogenomic-tree-inference-using-bacterial-core-genes>" %}

## File formats used in the UBCG tool

| File      | Function                                                                                                                                                                                                                                                                                                                                                               |
| --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| extension |                                                                                                                                                                                                                                                                                                                                                                        |
| \*.bcg    | [The files with \*.bcg extension are of JSON format and contain all extracted UBCG gene sequences with metadata (data about data). This file is a text format and readable by any text editor. So, you can extract sequence information and edit metadata, if necessary. bcg is designed to hold all necessary information about genome and strain.](http://json.org/) |
| \*.fasta  | [FASTA is a standard file format for holding genome sequences. In the UBCG tool, all fasta files containing genome sequences should be converted to bcg files before generating multiple alignments and inferring phylogenetic trees. The results of multiple alignments are also written as fasta format files.](https://en.wikipedia.org/wiki/FASTA_format)          |
| \*.nwk    | Newick is a standard format for phylogenetic trees.                                                                                                                                                                                                                                                                                                                    |
| \*.trm    | A JSON format file containing Newick-format trees and metadata of individual core gene trees and a UBCG tree.                                                                                                                                                                                                                                                          |
| \*.log    | A log file is a text format file that contains detailed information about the pipeline run.                                                                                                                                                                                                                                                                            |

## A typical structure of directories

* The program’s root directory should contain the “**UBCG.jar”** file and the “programPath” file that contains the location information of the external software tools.
* 'The fasta' directory contains the FASTA format files holding example genome/contig sequences.
* The “bcg” directory contains JSON format files (=\*.*bcg*) holding UBCG gene sequences with metadata.
* The “output” directory contains all output files generated by the UBCG tool. Within the “output” directory, results of each run are stored in the separate directory (defined by “-prefix”).

## Installing external programs

* The following programs should be installed in advance. The locations of programs should be written in “programPath” file.
* PRODIGAL
  * For gene-finding
  * <https://github.com/hyattpd/prodigal/releases/>
* HMMER3
  * For identifying UBCG genes
  * <http://hmmer.org/>
  * We only need “hmmsearch” program.
* FastTree
  * For drawing maximum likelihood tree
  * <http://www.microbesonline.org/fasttree/>
* RAxML
  * For drawing maximum likelihood tree
  * <https://sco.h-its.org/exelixis/web/software/raxml/index.html>
* You may also install and use other tools for phylogenetic inferences. Since we provide multiple-alignment files, any phylogenetic inference program can be used to generate phylogenetic trees from UBCGs.

## Running UBCG pipeline

### Step 1: Converting genome assemblies or contigs (fasta) to bcg files

* Command: **java -jar UBCG.jar extract**
* This command converts a fasta file to bcg file using prodigal and hmmsearch tools.
* You are required to designate the following parameters:
  * -i                 : path of an input FASTA file containing genome assembles.
  * -bcg\_dir    : directory for all bcg files. The name of *bcg* file will be same as the *fasta* file. (e.g. -bcg\_dir bcg)
  * -label          : full label of the strain/genome. It should be encompassed by single quotes (&#x65;*.g.* -label “Escherichia coli O157 876”).
* The followings are optional, but useful metadata
  * -taxon   : name of species (&#x65;*.g.* -taxon “Escherichia coli”)
  * -strain   : name of the strain (e.g. -strain “JC 126”)
  * -type      : add this if a strain is the type strain of species or subspecies (e.g. -type)
  * -acc        : accession of a genome sequence. Usually, NCBI’s assembly accession is used for public domain data.
  * -uid        : this is a unique integer id. If you do not designate, one will be automatically generated for you. Ignore this when you are not sure about this field.
* The content of bcg files (for example, gene sequences) can be viewed (as CSV format that is readable by Microsoft Excel or Google spreadsheet) by using the following command:
  * **java -jar UBCG.jar view** -i  \<a bcg file name>
  * java -jar UBCG.jar view -d \<directory containing *bcg* files>

### Step 2: Generating multiple alignments from *bcg* files

* Place all bcg files that you want to include in the analysis into a single directory by copying desired bcg files.
* Command: **java -jar UBCG.jar align**
* You are required to designate the following parameters:
  * -bcg\_dir    directory for bcg files that you want to include in the alignment.
* Optional parameters:
  * -out\_dir    directory where all output files will be
  * -a \<string>:  alignment method (default : codon).
    * nt             : nucleotide sequence alignment
    * aa             : amino acid sequence alignment
    * codon      : codon-based alignment (output is nucleotide sequences, but alignment is carried out using amino acid sequences).
    * codon12  : same as “codon” option but only 1st and 2nd nucleotides of a codon are selected. The 3rd position is usually of high variability.
  * -t \<integer>      : number of threads to be used (default : 1)
  * -f \<integer>      : set a filtering cutoff for gap-containing positions (default: 50)
    * Enter 0\~100
    * 0 to select all alignment positions
    * 100 to select positions that are present in all genomes
    * 50 to select positions that are present in a half of genomes
  * -prefix \<string>: a prefix is to appended to all output files to recognize each different run. If you don’t designate, one will be generated automatically.
    * e.g. john\_115, mycoplasma\_1
  * -gsi\_threshold: Threshold for Gene Support Index (GSI). 95 means 95%. (default = 95)
  * -raxml : Use RAxML for phylogeny reconstruction (Default: FastTree). Be aware that RAxML is much slower than FastTree.
  * -zZ : Make zZ-formatted files. This additionally creates fasta/nwk files with zZ+uid+zZ format for the names of each genome
* Examples of typical runs
  * java -jar UBCG.jar align -bcg\_dir bcg -prefix mytest1   (align and draw trees with bcg files in “bcg” directory and save all results in “output/mytest1” directory.
* Output files will be generated in output directory (for default) or the directory that you designated with the following name: (assuming that the prefix is mytest1)
  * nwk files can be viewed by [MEGA](http://www.megasoftware.net/), [FigTree](http://tree.bio.ed.ac.uk/software/figtree/) and other tree viewers. MEGA was tested for displaying Gene Support Index (GSI) on the branches of phylogenetic trees.
  * mytest1.log = a text file containing logs (what happened during execution of program)
  * mytest1.UBCG\_concat.codon.label.nwk = A Newick file based on UBCG gene set, codon alignment, 50% filtered, labeled with full label
  * mytest1.UBCG\_gsi(92).codon.50.label.nwk= A newick file based on UBCG + Gene Support Index (GGI) values with full label
  * mytest1.concat.codon.50.label.fasta = A FASTA file containing multiple alignments of UBCG genes, codon aligned, 50% filtered with full label
  * mytest1.concat.codon.50.zZ.fasta = A FASTA file containing multiple alignments of UBCG genes, codon aligned, 50% filtered with zZ+Unique id+zZ
  * mytest1.secY.codon.50.label.nwk = A newick file based on a single gene (secY), codon aligned, 50% filtered with full label
  * mytest1.secY.codon.50.zZ.nwk = A newick file based on a single gene (secY), codon aligned, 50% filtered with zZ+Unique id+zZ
  * mytest1.align.secY.codon.50.label.fasta = A FASTA file containing multiple alignment of a single gene (secY), codon aligned, 50% filtered with full label
  * mytest1.align.secY.codon.50.zZ.fasta = A FASTA file containing multiple alignment of a single gene (secY), codon aligned, 50% filtered with zZ+Unique id+zZ
  * finally3.UBCG\_gsi(92).codon.50.label.nwk

### Test run using example data

An example set is provided with the UBCG package. Please follow the instructions below:

1. Unzip the UBCG package.  The example fasta files should be in the 'fasta' directory. Execute the commands in the code block below to convert *fasta* files to *bcg* files.
2. Check the 'bcg' directory by 'ls bcg/\*' where you should be able to find eight \*.bcg files that contain UBCG gene sequences with metadata.
3. To align and generate the UBCG tree, execute the following line
   * java -jar UBCG.jar align -bcg\_dir bcg -prefix my\_example
4. Outputs will be saved in 'output/my\_example' directory
5. Take the '\*.UBCG\_gsi(92).codon.50.label.nwk' file and open with MEGA or other tree-viewing programs. (The below is the screenshot of MEGA.)

```
java -jar UBCG.jar extract -bcg_dir bcg -i fasta/CP012646_s_GCA_001281025.1_KCOM_1350.fasta -label "CP012646_s KCOM 1350" -acc "GCA_001281025.1" -taxon "CP012646_s" -strain "KCOM 1350" 

java -jar UBCG.jar extract -bcg_dir bcg -i fasta/Streptococcus_constellatus_subsp._constellatus_GCA_000474075.1_SK53_TYPE.fasta -label "Streptococcus constellatus subsp. constellatus SK53(T)" -acc "GCA_000474075.1" -taxon "Streptococcus constellatus subsp. constellatus" -strain "SK53" -type 

java -jar UBCG.jar extract -bcg_dir bcg -i fasta/Streptococcus_constellatus_subsp._pharyngis_GCA_000223295.2_SK1060_TYPE.fasta -label "Streptococcus constellatus subsp. pharyngis SK1060(T)" -acc "GCA_000223295.2" -taxon "Streptococcus constellatus subsp. pharyngis" -strain "SK1060" -type 

java -jar UBCG.jar extract -bcg_dir bcg -i fasta/CP014326_s_GCA_001560895.1_SVGS_061.fasta -label "CP014326_s SVGS_061" -acc "GCA_001560895.1" -taxon "CP014326_s" -strain "SVGS_061" 

java -jar UBCG.jar extract -bcg_dir bcg -i fasta/Streptococcus_mitis_GCA_002014755.1_CCUG_31611_TYPE.fasta -label "Streptococcus mitis CCUG 31611(T)" -acc "GCA_002014755.1" -taxon "Streptococcus mitis" -strain "CCUG 31611" -type 

java -jar UBCG.jar extract -bcg_dir bcg -i fasta/Streptococcus_pneumoniae_GCA_001457635.1_NCTC_7465_TYPE.fasta -label "Streptococcus pneumoniae NCTC 7465(T)" -acc "GCA_001457635.1" -taxon "Streptococcus pneumoniae" -strain "NCTC 7465" -type 

java -jar UBCG.jar extract -bcg_dir bcg -i fasta/Streptococcus_pseudopneumoniae_GCA_002087075.1_CCUG_49455_TYPE.fasta -label "Streptococcus pseudopneumoniae CCUG 49455(T)" -acc "GCA_002087075.1" -taxon "Streptococcus pseudopneumoniae" -strain "CCUG 49455" -type 

java -jar UBCG.jar extract -bcg_dir bcg -i fasta/Streptococcus_pneumoniae_GCA_000006885.1_TIGR4.fasta -label "Streptococcus pneumoniae TIGR4" -acc "GCA_000006885.1" -taxon "Streptococcus pneumoniae" -strain "TIGR4" 

```

<figure><img src="https://820779907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FDWKOAVP0eaMhg1acSkor%2Fuploads%2F53tpbqsqHH3N04QVaCbF%2Fimage.png?alt=media&#x26;token=b31bf221-6b37-4573-86ab-b4dcfe389e62" alt=""><figcaption><p>An example UBCG tree. (T) denotes the type strain.</p></figcaption></figure>

## UBCG Tree

This UBCG tree shows that

* Two *Streptococcus constellatus* subspecies are closely related; all 92 UBCGs supported this.
* *Streptococcus pneumoniae* TIGR4 is closely related to the type strain of *Streptococcus pneumoniae;* 72 out of UBCGs supported this.
* Two tentatively new species named CP012646\_s and CP014326\_s formed a monophyletic clade with *S. pneumoniae*, *Streptococcus pseudopneumoniae*, and *Streptococcus mitis*.

Below is the dendrogram showing OrthoANI-based clustering of the same genomes. Please note that CP012646\_s and CP014326\_s represent novel species using 95\~96% average nucleotide identity (ANI) cutoff \[[Learn more](http://ijs.microbiologyresearch.org/content/journal/ijsem/10.1099/ijsem.0.002516)].

<figure><img src="https://820779907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FDWKOAVP0eaMhg1acSkor%2Fuploads%2FHnGItGRHa0kxdaVASVf8%2Fimage.png?alt=media&#x26;token=901432dc-da43-42d3-8440-abbdfd44b695" alt=""><figcaption><p>OrthoANI UPMGA clustering of the example set.</p></figcaption></figure>

## Frequently Asked Question

How can I access the UBCG sequences?

* Use our [UBCG viewer](https://kb.ezbiocloud.net/kb/tools#ubcg-phylogenomic-tree-inference-using-bacterial-core-genes) to open any \*.*bcg* file. Click the right mouse button to copy DNA or protein sequences of any UBCG into the clipboard.

{% embed url="<https://kb.ezbiocloud.net/kb/tools#ubcg-phylogenomic-tree-inference-using-bacterial-core-genes>" %}
