Skip to Main content Skip to Navigation

Passage à l'échelle, propriétés et qualité des algorithmes de classements consensuels pour les données biologiques massives

Abstract : Biologists and physicians regularly query public biological databases, for example when they are looking for the most associated genes towards a given disease. The chosen keyword are particularly important: synonymous reformulations of the same disease (for example "breast cancer" and "breast carcinoma") may lead to very different rankings of (thousands of) genes. The genes, sorted by relevance, can be tied (equal importance towards the disease). Additionally, some genes returned when using a first synonym may be absent when using another synonym. The rankings are then called "incomplete rankings with ties". The challenge is to combine the information provided by these different rankings of genes. The problem of taking as input a list of rankings and returning as output a so-called consensus ranking, as close as possible to the input rankings, is called the "rank aggregation problem". This problem is known to be NP-hard. Whereas most works focus on complete rankings without ties, we considered incomplete rankings with ties. Our contributions are divided into three parts. First, we have designed a graph-based heuristic able to divide the initial problem into independent sub-problems in the context of incomplete rankings with ties. Second, we have designed an algorithm able to identify common points between all the optimal consensus rankings, allowing to provide information about the robustness of the provided consensus ranking. An experimental study on a huge number of massive biological datasets has highlighted the biological relevance of these approaches. Our last contribution the following one : we have designed a parameterized model able to consider various interpretations of missing data. We also designed several algorithms for this model and did an axiomatic study of this model, based on social choice theory.
Complete list of metadata
Contributor : Abes Star :  Contact
Submitted on : Monday, September 6, 2021 - 10:12:10 AM
Last modification on : Wednesday, September 8, 2021 - 3:32:49 AM


Version validated by the jury (STAR)


  • HAL Id : tel-03335281, version 1


Pierre Andrieu. Passage à l'échelle, propriétés et qualité des algorithmes de classements consensuels pour les données biologiques massives. Bio-informatique [q-bio.QM]. Université Paris-Saclay, 2021. Français. ⟨NNT : 2021UPASG041⟩. ⟨tel-03335281⟩



Record views


Files downloads