Spatial Clustering of Linkage Disequilibrium blocks for Genome-Wide Association Studies

Abstract : With recent development of high-throughput genotyping technologies, the usage of Genome-Wide Association Studies (GWAS) has become widespread in genetic research. By screening large portions of the genome, these studies aim to characterize genetic factors involved in the development of complex genetic diseases. GWAS are also based on the existence of statistical dependencies, called Linkage Disequilibrium (LD) usually observed between nearby loci on DNA. LD is defined as the non-random association of alleles at different loci on the same chromosome or on different chromosomes in a population. This biological feature is of fundamental importance in association studies as it provides a fine location of unobserved causal mutations using adjacent genetic markers. Nevertheless, the complex block structure induced by LD as well as the large volume of genetic data are key issues that have arisen with GWA studies. The contributions presented in this manuscript are in twofold, both methodological and algorithmic. On the methodological part, we propose a three-step approach that explicitly takes advantage of the grouping structure induced by LD in order to identify common variants which may have been missed by single marker analyses. In the first step, we perform a hierarchical clustering of SNPs with an adjacency constraint using LD as a similarity measure. In the second step, we apply a model selection approach to the obtained hierarchy in order to define LD blocks. Finally, we perform Group Lasso regression on the inferred LD blocks. The efficiency of the proposed approach is investigated compared to state-of-the art regression methods on simulated, semi-simulated and real GWAS data. On the algorithmic part, we focus on the spatially-constrained hierarchical clustering algorithm whose quadratic time complexity is not adapted to the high-dimensionality of GWAS data. We then present, in this manuscript, an efficient implementation of such an algorithm in the general context of any similarity measure. By introducing a user-parameter h and using the min-heap structure, we obtain a sub-quadratic time complexity of the adjacency-constrained hierarchical clustering algorithm, as well as a linear space complexity in the number of items to be clustered. The interest of this novel algorithm is illustrated in GWAS applications.
Document type :
Theses
Complete list of metadatas

Cited literature [163 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01288568
Contributor : Alia Dehman <>
Submitted on : Tuesday, March 15, 2016 - 12:21:30 PM
Last modification on : Friday, July 20, 2018 - 11:12:58 AM

File

Identifiers

  • HAL Id : tel-01288568, version 1

Citation

Alia Dehman. Spatial Clustering of Linkage Disequilibrium blocks for Genome-Wide Association Studies. Statistics [stat]. Université d'Evry Val d'Essonne; Université Paris-Saclay; Laboratoire de Mathématiques et Modélisation d'Evry, 2015. English. ⟨NNT : 2015SACLE013⟩. ⟨tel-01288568⟩

Share

Metrics

Record views

985

Files downloads

1036