Bioinformatics analysis and consensus ranking for biological high throughput data

Abstract : It is thought to be more and more important to solve biological questions using Bioinformatics approaches in the post-genomic era. This thesis focuses on two problems related to high troughput data: bioinformatics analysis at a large scale, and development of algorithms of consensus ranking. In molecular biology and genetics, RNA splicing is a modification of the nascent pre-messenger RNA (pre-mRNA) transcript in which introns are removed and exons are joined. The U2AF heterodimer has been well studied for its role in defining functional 3’ splice sites in pre-mRNA splicing, but multiple critical problems are still outstanding, including the functional impact of their cancer-associated mutations. Through genome-wide analysis of U2AF-RNA interactions, we report that U2AF has the capacity to define ~88% of functional 3’ splice sites in the human genome. Numerous U2AF binding events also occur in other genomic locations, and metagene and minigene analysis suggests that upstream intronic binding events interfere with the immediate downstream 3’ splice site associated with either the alternative exon to cause exon skipping or competing constitutive exon to induce inclusion of the alternative exon. We further build up a U2AF65 scoring scheme for predicting its target sites based on the high throughput sequencing data using a Maximum Entropy machine learning method, and the scores on the up and down regulated cases are consistent with our regulation model. These findings reveal the genomic function and regulatory mechanism of U2AF, which facilitates us understanding those associated diseases.Ranking biological data is a crucial need. Instead of developing new ranking methods, Cohen-Boulakia and her colleagues proposed to generate a consensus ranking to highlight the common points of a set of rankings while minimizing their disagreements to combat the noise and error for biological data. However, it is a NP-hard questioneven for only four rankings based on the Kendall-tau distance. In this thesis, we propose a new variant of pivot algorithms named as Consistent-Pivot. It uses a new strategy of pivot selection and other elements assignment, which performs better both on computation time and accuracy than previous pivot algorithms.
Document type :
Complete list of metadatas
Contributor : Abes Star <>
Submitted on : Thursday, October 1, 2015 - 1:04:13 AM
Last modification on : Tuesday, April 24, 2018 - 1:39:07 PM
Long-term archiving on : Saturday, January 2, 2016 - 10:22:53 AM


  • HAL Id : tel-01207489, version 1



Bo Yang. Bioinformatics analysis and consensus ranking for biological high throughput data. Bioinformatics [q-bio.QM]. Université Paris Sud - Paris XI, 2014. English. ⟨NNT : 2014PA112250⟩. ⟨tel-01207489⟩



Record views


Files downloads