Skip to Main content Skip to Navigation
Theses

Approches textuelles pour la catégorisation et la recherche de documents manuscrits en-ligne

Abstract : With recent technical evolutions, pen-based input devices have become very popular. As a result, large amounts of on-line handwritten data are being created. Consequently, algorithms for efficient storage and retrieval of on-line data, represented as a temporal sequence of (x,y) coordinates, are being increasingly demanded. This thesis addresses the problem of accessing textual information in on-line handwritten documents. The overall goal of this work is the design of a system for text categorization and retrieval. In order to validate the methods proposed in this study, we collected a benchmark collection of handwritten documents. The use of an on-line handwriting recognition engine, as the common component of our approaches, leads us to focus part of our work on the impact of handwriting recognition errors. We address the problem of document categorization by pipelining the output of a handwriting recognition system into the input of a text categorization engine based on machine learning algorithms. We also develop two retrieval algorithms. First, we propose combining different approaches for retrieving handwritten documents. Our hypothesis is that different retrieval algorithms should retrieve different sets of documents for the same query. Therefore, improvements in retrieval performances can be expected. The second proposed algorithm is based on the topical relationships between documents. If closely associated documents tend to be relevant to the same requests, then topically-related documents should be assigned close retrieval scores.
Document type :
Theses
Complete list of metadatas

Cited literature [146 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00483684
Contributor : Sebastián Peña Saldarriaga <>
Submitted on : Monday, May 17, 2010 - 11:05:15 AM
Last modification on : Monday, October 19, 2020 - 11:03:31 AM
Long-term archiving on: : Thursday, September 16, 2010 - 2:12:00 PM

Identifiers

  • HAL Id : tel-00483684, version 1

Collections

Citation

Sebastián Peña Saldarriaga. Approches textuelles pour la catégorisation et la recherche de documents manuscrits en-ligne. Informatique [cs]. Université de Nantes, 2010. Français. ⟨tel-00483684⟩

Share

Metrics

Record views

268

Files downloads

1757