Data Integration in the Life Sciences: Scientific Workflows, Provenance, and Ranking

Sarah Cohen-Boulakia 1, 2, 3
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
2 VIRTUAL PLANTS - Modeling plant morphogenesis at different scales, from genes to phenotype
CRISAM - Inria Sophia Antipolis - Méditerranée , INRA - Institut National de la Recherche Agronomique, Centre de coopération internationale en recherche agronomique pour le développement [CIRAD] : UMR51
Abstract : Biological research is a science which derives its findings from the proper analysis of experiments. Today, a large variety of experiments are carried-out in hundreds of labs around the world, and their results are reported in a myriad of different databases, web-sites, publications etc., using different formats, conventions, and schemas. Providing a uniform access to these diverse and distributed databases is the aim of data integration solutions, which have been designed and implemented within the bioinformatics community for more than 20 years. However, the perception of the problem of data integration research in the life sciences has changed: While early approaches concentrated on handling schema-dependent queries over heterogeneous and distributed databases, current research emphasizes instances rather than schemas, tries to place the human back into the loop, and intertwines data integration and data analysis. Transparency -- providing users with the illusion that they are using a centralized database and thus completely hiding the original databases -- was one of the main goals of federated databases. It is not a target anymore. Instead, users want to know exactly which data from which source was used in which way in studies (Provenance). The old model of "first integrate, then analyze" is replaced by a new, process-oriented paradigm: "integration is analysis - and analysis is integration". This paradigm change gives rise to some important research trends. First, the process of integration itself, i.e., the integration workflow, is becoming a research topic in its own. Scientific workflows actually implement the paradigm "integration is analysis". A second trend is the growing importance of sensible ranking, because data sets grow and grow and it becomes increasingly difficult for the biologist user to distinguish relevant data from large and noisy data sets. This HDR thesis outlines my contributions to the field of data integration in the life sciences. More precisely, my work takes place in the first two contexts mentioned above, namely, scientific workflows and biological data ranking. The reported results were obtained from 2005 to late 2014, first as a postdoctoral fellow at the Uniersity of Pennsylvania (Dec 2005 to Aug 2007) and then as an Associate Professor at Université Paris-Sud (LRI, UMR CNRS 8623, Bioinformactics team) and Inria (Saclay-Ile-de-France, AMIB team 2009-2014).
Liste complète des métadonnées

Littérature citée [68 références]  Voir  Masquer  Télécharger
Contributeur : Sarah Cohen-Boulakia <>
Soumis le : mercredi 16 décembre 2015 - 21:38:01
Dernière modification le : mardi 10 octobre 2017 - 13:48:08
Document(s) archivé(s) le : jeudi 17 mars 2016 - 17:00:35


Distributed under a Creative Commons Paternité 4.0 International License


  • HAL Id : tel-01245229, version 1


Sarah Cohen-Boulakia. Data Integration in the Life Sciences: Scientific Workflows, Provenance, and Ranking. Bioinformatics [q-bio.QM]. Université Paris-Sud, 2015. 〈tel-01245229〉



Consultations de
la notice


Téléchargements du document