Exploration d'approches statistiques pour le résumé automatique de texte

Abstract : A summary is a text rephrased in a smaller space. It should express the essential content of a document with a minimum of words. Its purpose is to help the reader to locate information which may be of interest without having to read the entire document. But why do we need so much summaries? Simply because we do not have enough time and energy to read everything. The mass of textual information in electronic format is increasing, whether on the Internet or in private networks. This increasing volume of available textual documents makes it difficult to access a desired information without using specific tools. Producing a summary is a very complex task because it requires linguistic knowledge as well as world knowledge which remain very difficult to build into an automated system. In my Ph.D. thesis, we have explored the issue of automatic text summarization through three statistical approaches, each designed to handle a different task.

We first propose an efficient stratedy for summarizing documents in a specialized domain which is the Organic Chemistry. We present its implementation named YACHS (Yet Another Chemistry Summarizer) that combines a specific document pre-processing with a sentence scoring method relying on the statistical properties of documents. Next, we propose an approach to tackle the issue of topic-oriented multi-document text summarization. We give details on the adjustments made to the generic text summarization system Cortex and we evaluate our method on the DUC evaluation data. Results obtained by the LIA during the DUC 2006 and DUC 2007 campaigns are discussed. Finally, two approaches for the update summarization task are introduced. We evaluate the first, named maximisation-minimisation, by participating to the pilot task of the DUC 2007 campaign. The second approach is based on the Maximal Marginal Relevance (MMR) and assessed by two submissions to the TAC 2008 summarization task.
Document type :
Theses
Complete list of metadatas

Cited literature [22 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00419469
Contributor : Florian Boudin <>
Submitted on : Thursday, September 24, 2009 - 12:01:08 AM
Last modification on : Wednesday, April 17, 2019 - 12:15:34 PM
Long-term archiving on : Tuesday, October 16, 2012 - 11:15:50 AM

Identifiers

  • HAL Id : tel-00419469, version 1

Citation

Florian Boudin. Exploration d'approches statistiques pour le résumé automatique de texte. Interface homme-machine [cs.HC]. Université d'Avignon, 2008. Français. ⟨tel-00419469⟩

Share

Metrics

Record views

235

Files downloads

1678