Synorth
Synorth Help Center

     Software

Synorth makes use of MySQL,Apache, Bioperl, and Ensembl API. To display the orthologous data in an efficient and flexible manner, we use the basic Bio::Graphics module.

Detection of ortholog genes

We construct a new dataset of orthologous genes between various vertebrate genomes. The data sources are :1) all Ensembl ortholog genes, 2) all Ensembl “between species paralogs” with a lower duplication score, and 3) predicted ortholog genes obtained by comparing the alignment of exon-sequences. Ensembl has a clear-cut definition of different kinds of homologs. Unlike the previous definition of paralogs (which were two or more homolog genes within one species), Ensembl defines two genes as paralogs if their common ancestral node is a duplication event (Sdup>0) in the protein family tree. We have relaxed the criteria slightly in order not to miss the cases like the pax2a and pax2b. In addition to the homologs from Ensembl database, we also check the alignment of exons of each gene in the other genome, by using the pair-wise BLASTZ chain whole-genome alignment (chains) downloaded from UCSC Genome Browser database. For example, in set2, we define a predicted ortholog pair if the target gene has a) >=10 exons aligned, b) 50% of all exons aligned well, or c) 50% of total nucleotides aligned in the query genome. The alignment status of an exon is defined as ‘aligned’ if, in total, at least 70% of the exon nucleotides are aligned in the chain alignment, ‘unaligned’ if less than 15% nucleotides are aligned, otherwise an alignment score will be returned.

Distinguish out-paralogs from orthologs

An out-paralog is defined as a paralog that predates a species split, and can easily be confused with true orthologs. To resolve this issue, we built a phylogenetic tree for each human gene that had more than two orthologs in a teleost fish (e.g. zebrafish) based on a multiple tBLASTn alignment of all sequences in the tree, using the neighbor-joining method. Starting from the node of human reference gene, we first identified the closest zebrafish ortholog, and then took that as a reference node to find the minimal-sized branch that contained the human gene node and the closest zebrafish ortholog. Then, if there were any orthologs outside of the minimum-size branch, we counted them as out-paralogs of the closest zebrafish ortholog. This method proved to be successful in distinguishing between in-paralogs and out-paralogs, as comfirmed by careful examination of some complicated cases (for example, human SOX2 and zebrafish sox2/19a/19b, data not shown).

Detection of synteny blocks

We identifies human-zebrafish synteny blocks as described in Kikuta et al. 2007[ref]. In brief, we base the synteny blocks on net alignments (see above) from the zebrafish genome to the human genome. Since neutrally evolving sequence typically cannot be aligned between human and zebrafish genomes, many syntenic regions are divided over several alignments separated by large regions of unaligned sequence. The net alignment procedure bridges gaps to some degree, but to allow for inversions and other local rearrangements such that syntenic blocks are separated by macro-rearrangements rather than smaller insertions and alignment gaps, we construct a graph based on the highest-scoring (level 1) net alignments where two alignments (nodes) are connected if 1)they were separated by <=150 kb in the zebrafish genome and <=450 kb in the human genome (Note: Ancora use 100kb/300kb as joined_Net gap threshold), and 2)they were on same strand. We then consider each connected component in the graph to be one synteny block. We keep the synteny block with the largest total amount of aligned sequence to the human genome in cases of block overlap in the zebrafish genome.

Extraction of the minimal subtree from the Ensembl protein family tree

Gene trees containing human, mouse, chicken, frog and the five fish species were extracted from the Ensembl protein family trees. The extraction method consisits of stepping through the Ensembl family tree from the query gene (i.e. human) to the root direction iteratively, and stop when the first set of Euteleostomi(bony vertebrate) species is included.

This stratedgy works well for most cases, since the teleost(bony fish) speciation occurred earlier than the mammal speciation in evolution. However, according to the method of building protein-family tree used in Ensembl, the gene tree may have a topology that is inconsistent with the species tree if the alignment strongly supports it [ref]. This can lead to the extracted sub-tree not including all Teleost fish if part of fish are closer the mammals (by sequence distance). The protein-family tree for Pax2 in Ensembl v46 was an example of such case [ref]. Fortunately, Ensembl has fixed most of such cases in later releases.

Construction of the GRB evolutionary tree

First of all, the gene tree (extracted from the Ensembl protein family tree using the method described above) for the target gene is projected to the ideal gene tree which is based on the perfect whole-genome duplication model, from which we could get the synteny regions containing the target gene in each fish genome (termed FSR_target), and the branches where the two duplicate orthologous copies should be located. If the bystander gene in the GRB has an ortholog in the FSR_target, then it will be put in the first-level at the same branch as the target ortholog; otherwise, it will be put in the second level at the branch where the closer ortholog (which is in FSR_target) is located.