Identifier les variations conduisant au cancer dans le génome non codant et du transcriptome

Abstract : Functional annotation of somatic mutations have been a consistent hotspot of cancer genomics studies. In the past, researchers preferentially focused on mutations in the coding fraction of the genome, for which ample bioinformatics tools were developed to distinguish cancer-driver mutations from neutral ones. In recent years, as an increasing number of variants were being identified as disease-associated in the non-coding genome, interpreting non-coding cancer mutations has become an urgent task. The completion of large scale projects such as ENCODE, has made functional interpretation of cancer variants achievable, and several programs were produced based on this functional information. However, there still exists some limitations as to these prediction tools, such as low prediction accuracy, lack of cancer mutation information and significant ascertainment bias. In chapter 2 of this thesis, in order to functionally interpret non-coding mutations in cancer, we developed two independent random forest models, referred to as SNP and SOM. Given a combination of features at a given genome positions, the SNP model predicts the expected fraction of rare SNPs (a measure of negative selection), and the SOM model predicts the expected mutation density at this position. We applied our two models to score these non-coding disease-associated clinvariant and HGMD variants and a set of random control SNPs. Results showed that disease-associated variants were scored higher than control SNPs with the SNP model and lower than control SNPs with the SOM model, supporting our hypothesis that purifying selection as measured by fraction of rare SNPs and mutation density is informative for the evaluation of the functional impact of cancer mutations in the non-coding genome. In the past, researchers have preferentially considered protein-coding genes as critical to the initiation and progression of cancers. However, recent evidences have shown that ncRNAs, in particular lncRNAs, are actively implicated in various cancer processes. A chapter of this thesis is devoted to this class of non-coding transcripts. Similar to protein coding genes, there might be a large number of lncRNAs with cancer-driving functions. The development of bioinformatics tools to prioritize them has become a new focus of research for computational oncologists.The last part of this thesis is devoted to the implementation of methods for discovering potential cancer-driving non-coding elements in lncRNA and protein-coding genes. We applied three scoring tools, CADD, funSeq2, GWAVA, together with our SNP and SOM scoring systems to prioritize cancer-associated elements using a permutation-based algorithm. For each locus, we compute the average score of all observed variants using one of the models, and we randomly take the same number of variants and compute their average score 1 million times to form a null distribution and obtain a P value for this locus. To validate our hypothesis and permutation model, we tested this system on 61 cancer-related lncRNA and 452 cancer genes using somatic mutation data from liver cancer, lung cancer, CLL and melanoma. We observed that both cancer lncRNAs and protein-coding genes had significantly lower average P values than total lncRNAs and protein-coding genes in all cases. Applying the permutation test to lncRNAs with five different scoring systems enabled us to prioritize hundreds to thousands of cancer-related lncRNA candidates. These candidates can be used for future experimental validation.
Document type :
Theses
Complete list of metadatas

https://tel.archives-ouvertes.fr/tel-01280751
Contributor : Abes Star <>
Submitted on : Tuesday, March 1, 2016 - 10:03:17 AM
Last modification on : Thursday, July 18, 2019 - 2:54:09 PM
Long-term archiving on : Thursday, June 2, 2016 - 10:43:18 AM

Identifiers

  • HAL Id : tel-01280751, version 1

Citation

Jia Li. Identifier les variations conduisant au cancer dans le génome non codant et du transcriptome. Bio-informatique [q-bio.QM]. Université Paris-Saclay, 2015. Français. ⟨NNT : 2015SACLS161⟩. ⟨tel-01280751⟩

Share

Metrics

Record views

714

Files downloads

544