Anonymisation de documents cliniques : performances et limites des méthodes symboliques et par apprentissage statistique

Abstract : This work focuses on the automatic de-identification of clinical records. The de-identification consists in concealing personal information within documents while preserving clinical data. This task is mandatory so as to use clinical records outside of the patient care process, for case study publications or in scientific research (producing automatic system to process the documents, similar cases search, etc.). We defined 12 categories of information to de-identify: nominative data (last names, first names, etc.) and numerical data (ages, dates, zip codes, etc.). Two approaches have been used to de-identify the documents, an expert knowledge based method using regular expressions and lexical mapping, and a machine-learning process based upon CRF. Several experiments have been performed including the use of each approach separately or in combination. We achieved our best results (overall F-measure=0.922) while combining both approaches and merging last names and first names categories into a single one (recall=0.953 and F-measure=0.931 on this category). This work is combined with the production of several resources: a guidelines, a gold standard corpus composed of 562 documents among them 100 double annotated with adjudication and interannotator agreement computation (K=0.807 before merging) and a de-identified corpus of 17,000 clinical records.
Complete list of metadatas

Cited literature [113 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00848672
Contributor : Cyril Grouin <>
Submitted on : Saturday, July 27, 2013 - 10:34:39 AM
Last modification on : Monday, September 16, 2019 - 11:45:19 AM
Long-term archiving on: Monday, October 28, 2013 - 2:40:12 AM

Identifiers

  • HAL Id : tel-00848672, version 1

Citation

Cyril Grouin. Anonymisation de documents cliniques : performances et limites des méthodes symboliques et par apprentissage statistique. Bio-informatique [q-bio.QM]. Université Pierre et Marie Curie - Paris VI, 2013. Français. ⟨tel-00848672⟩

Share

Metrics

Record views

1059

Files downloads

3759