Data Integration in the Life Sciences: Scientific Workflows, Provenance, and Ranking - TEL - Thèses en ligne Accéder directement au contenu
Hdr Année : 2015

Data Integration in the Life Sciences: Scientific Workflows, Provenance, and Ranking

Résumé

Biological research is a science which derives its findings from the proper analysis of experiments. Today, a large variety of experiments are carried-out in hundreds of labs around the world, and their results are reported in a myriad of different databases, web-sites, publications etc., using different formats, conventions, and schemas. Providing a uniform access to these diverse and distributed databases is the aim of data integration solutions, which have been designed and implemented within the bioinformatics community for more than 20 years. However, the perception of the problem of data integration research in the life sciences has changed: While early approaches concentrated on handling schema-dependent queries over heterogeneous and distributed databases, current research emphasizes instances rather than schemas, tries to place the human back into the loop, and intertwines data integration and data analysis. Transparency -- providing users with the illusion that they are using a centralized database and thus completely hiding the original databases -- was one of the main goals of federated databases. It is not a target anymore. Instead, users want to know exactly which data from which source was used in which way in studies (Provenance). The old model of "first integrate, then analyze" is replaced by a new, process-oriented paradigm: "integration is analysis - and analysis is integration". This paradigm change gives rise to some important research trends. First, the process of integration itself, i.e., the integration workflow, is becoming a research topic in its own. Scientific workflows actually implement the paradigm "integration is analysis". A second trend is the growing importance of sensible ranking, because data sets grow and grow and it becomes increasingly difficult for the biologist user to distinguish relevant data from large and noisy data sets. This HDR thesis outlines my contributions to the field of data integration in the life sciences. More precisely, my work takes place in the first two contexts mentioned above, namely, scientific workflows and biological data ranking. The reported results were obtained from 2005 to late 2014, first as a postdoctoral fellow at the Uniersity of Pennsylvania (Dec 2005 to Aug 2007) and then as an Associate Professor at Université Paris-Sud (LRI, UMR CNRS 8623, Bioinformactics team) and Inria (Saclay-Ile-de-France, AMIB team 2009-2014).
Fichier principal
Vignette du fichier
cohenboulakiaHDR.pdf (4.69 Mo) Télécharger le fichier
Loading...

Dates et versions

tel-01245229 , version 1 (16-12-2015)

Licence

Paternité

Identifiants

  • HAL Id : tel-01245229 , version 1

Citer

Sarah Cohen-Boulakia. Data Integration in the Life Sciences: Scientific Workflows, Provenance, and Ranking. Bioinformatics [q-bio.QM]. Université Paris-Sud, 2015. ⟨tel-01245229⟩
991 Consultations
254 Téléchargements

Partager

Gmail Facebook X LinkedIn More