Algorithms for genome comparison applied to bacterial genomes

Raluca Uricaru 1, 2
2 MAB - Méthodes et Algorithmes pour la Bioinformatique
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier
Abstract : With more than 1000 complete genomes available (among which, the vast majority come from bacteria), comparative genomic analysis become essential for the functional annotation of genomes, the understanding of their structure and evolution and have applications in phylogenomics or vaccine design. One of the main approaches for comparing genomes is by aligning their DNA sequences, i.e. whole genome alignment (WGA), which means identifying the similarity regions without any prior annotation knowledge. Despite the significant improvements during the last years, reliable tools for WGA and methodology for estimating its quality, in particular for bacterial genomes, still need to be designed. Besides their extremely large lengths that make classical dynamic programming alignment methods unsuitable, aligning whole genomes involves several additional difficulties, due to the mechanisms through which genomes evolve: the divergence, which let sequence similarity vanish over time, the re- ordering of genomic segments (rearrangements), or the acquisition of external genetic material generating regions that are unalignable between sequences, e.g. horizontal gene transfer, phages. Therefore, whole genome alignment tools implement heuristics, among which the most common is the anchor based strategy. It starts by detecting an initial set of similarity regions (phase 1), and, through a chaining phase (phase 2), selects a non-overlapping maximum-weighted, usually collinear, subset of those similarities, called anchors. Phases 1 and 2 are recursively applied on yet unaligned regions (phase 3). The last phase (phase 4) consists in systematically applying classical alignment tools to all short regions still left unaligned.This thesis addresses several problems related to whole genome alignment: the evaluation of the quality of results given by WGA tools and the improvement of the classical anchor based strategy. We first designed a protocol for evaluating the quality of alignment results, based on both computational and biological measures. An evaluation of the results given by two state of the art WGA tools on pairs of intra-species bacterial genomes revealed their shortcomings: the failure of detecting some of the similarities between sequences and the misalignment of some regions. Based on these results, which imply a lack in both sensitivity and specificity, we propose a novel, pair- wise whole genome alignment tool, YOC, implementing a simplified two-phase version of the anchor strategy. In phase 1, YOC improves sensitivity by using as anchors, for the first time, local similarities based on spaced seeds that are capable of detecting larger similarity regions in divergent sequences. This phase is followed by a chaining method adapted to local similarities, a novel type of collinear chaining, allowing for proportional overlaps. We give a formulation for this novel problem and provide the first algorithm for it. The algorithm, implementing a dynamic programming approach based on the sweep-line paradigm, is exact and runs in quadratic time. We show that, compared to classical collinear chaining, chaining with overlaps improves on real bacterial data, while remaining almost as efficient in practice. Our novel tool, YOC, is evaluated together with other four WGA tools on a dataset composed of 694 pairs of intra-species bacterial genomes. The results show that YOC improves on divergent cases by detecting more distant similarities and by avoiding misaligned regions. In conclusion, YOC should be easier to apply automatically and systematically to in- coming genomes, for it does not require a post-filtering step to detect misalignment and is less complex to calibrate.
Document type :
Theses
Complete list of metadatas

https://tel.archives-ouvertes.fr/tel-01084551
Contributor : Raluca Uricaru <>
Submitted on : Wednesday, November 19, 2014 - 2:59:11 PM
Last modification on : Friday, May 17, 2019 - 11:41:26 AM
Long-term archiving on : Friday, April 14, 2017 - 8:55:10 PM

Identifiers

  • HAL Id : tel-01084551, version 1

Collections

Citation

Raluca Uricaru. Algorithms for genome comparison applied to bacterial genomes. Bioinformatics [q-bio.QM]. Université Montpellier 2, 2010. English. ⟨tel-01084551⟩

Share

Metrics

Record views

252

Files downloads

148