Skip to Main content Skip to Navigation

Problématique des entrepôts de données textuelles : dr Warehouse et la recherche translationnelle sur les maladies rares

Abstract : The repurposing of clinical data for research has become widespread with the development of clinical data warehouses. These data warehouses are modeled to integrate and explore structured data related to thesauri. These data come mainly from machine (biology, genetics, cardiology, etc.) but also from manual data input forms. The production of care is also largely providing textual data from hospital reports (hospitalization, surgery, imaging, anatomopathologic etc.), free text areas in electronic forms. This mass of data, little used by conventional warehouses, is an indispensable source of information in the context of rare diseases. Indeed, the free text makes it possible to describe the clinical picture of a patient with more precision and expressing the absence of signs and uncertainty. Particularly for patients still undiagnosed, the doctor describes the patient's medical history outside any nosological framework. This wealth of information makes clinical text a valuable source for translational research. However, this requires appropriate algorithms and tools to enable optimized re-use by doctors and researchers. We present in this thesis the data warehouse centered on the clinical document, which we have modeled, implemented and evaluated. In three cases of use for translational research in the context of rare diseases, we attempted to address the problems inherent in textual data: (i) recruitment of patients through a search engine adapted to textual (data negation and family history detection), (ii) automated phenotyping from textual data, and (iii) diagnosis by similarity between patients based on phenotyping. We were able to evaluate these methods on the data warehouse of Necker-Enfants Malades created and fed during this thesis, integrating about 490,000 patients and 4 million reports. These methods and algorithms were integrated into the software Dr Warehouse developed during the thesis and distributed in Open source since September 2017.
Document type :
Complete list of metadata

Cited literature [375 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Monday, May 20, 2019 - 4:17:08 PM
Last modification on : Tuesday, May 11, 2021 - 8:21:23 PM


Version validated by the jury (STAR)


  • HAL Id : tel-02134609, version 1


Nicolas Garcelon. Problématique des entrepôts de données textuelles : dr Warehouse et la recherche translationnelle sur les maladies rares. Base de données [cs.DB]. Université Sorbonne Paris Cité, 2017. Français. ⟨NNT : 2017USPCB257⟩. ⟨tel-02134609⟩



Record views


Files downloads