Skip to Main content Skip to Navigation

Extraction d'informations sur la régulation transcriptionnelle de gènes à partir d'articles biomédicaux 2008

Abstract : Charting transcriptionally regulated networks of genes and gathering related molecular mechanisms are important issues for biologists- The molecular biology literature is a very rich mine of experimental information that encompasses the current state of knowledge in the gene expression domain. However, due to its tremendous size, automated methods must be devised in order to explore these data in a systemic way. In this thesis, we propose a method set for mining the molecular biology literature and extracting relevant facts about human gene expression regulation We first present a generic methodology to extract potential named entities from texts. This combines rule-based identification of noun phrases as candidate named entities with matching against manually cleaned dictionaries from public sources. Domain-specific disambiguation techniques are also reported in order to help classifying the true nature of an identified named entity. Then we detail a procedure for both retrieving relevant relationships between named entities and their associated features using a deep syntactic analysis and predicate-argument structures. We show that the acquisition of semantics from syntax can be split into several distinct phases so as to lessen the labour usually associated with the design of domain-specific extraction rules. Finally the performance of the system is evaluated using an annotated corpus of specialized full-text publications. The results are promising and despite the heterogeneous nature of the information to retrieve from the data set, the system exhibits homogeneous and highly-scalable performances.
Document type :
Complete list of metadatas

Cited literature [113 references]  Display  Hide  Download
Contributor : Pascale Kuntz <>
Submitted on : Monday, May 10, 2010 - 11:14:46 AM
Last modification on : Friday, October 23, 2020 - 4:51:47 PM
Long-term archiving on: : Thursday, September 16, 2010 - 1:39:26 PM


  • HAL Id : tel-00481403, version 1



Julien Lorec. Extraction d'informations sur la régulation transcriptionnelle de gènes à partir d'articles biomédicaux 2008. Informatique [cs]. Université de Nantes, 2008. Français. ⟨tel-00481403⟩



Record views


Files downloads