Skip to Main content Skip to Navigation
Theses

Extraction d'information dans des documents manuscrits non contraints : application au traitement automatique des courriers entrants manuscrits

Abstract : Despite the avenment of our world into the digital era, a large amount of handwritten documents continue to be exchanged, forcing our companies and administrations to cope with the processing of masses of documents. Automatic processing of these documents requires access to an unknown but relevant part of their content, and implies taking into account three key points: the document segmentation into relevant entities, their recognition and the rejection of irrelevant entities. Contrary to traditional approaches (full documents reading or keyword detection), all processes are parallelized leading to an information extraction approach. The first contribution of the present work is the design of a generic text line model for information extraction purpose and the implementation of a complete system based on Hidden Markov Models (HMM) constrained by this model. In one pass, the recognition module seeks to discriminate relevant information, characterized by a set of alphabetic, numeric or alphanumeric queries, with the irrelevant information, characterized by a filler model. A second contribution concerns the improvement of the local frame discrimination by using a deep neural network. This allows one to infer high-level representation for the frames and thus automate the feature extraction process. These result is a complete, generic and industrially system, responding to emerging needs in the field of handwritten document automatic reading: the extraction of complex information in unconstrained documents.
Complete list of metadatas

Cited literature [215 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00863502
Contributor : Clément Chatelain <>
Submitted on : Wednesday, March 14, 2018 - 3:17:12 PM
Last modification on : Monday, October 19, 2020 - 10:59:30 AM
Long-term archiving on: : Tuesday, September 4, 2018 - 3:36:10 PM

File

fichier-final.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-00863502, version 2

Citation

Simon Thomas. Extraction d'information dans des documents manuscrits non contraints : application au traitement automatique des courriers entrants manuscrits. Traitement du signal et de l'image [eess.SP]. Université de Rouen, 2012. Français. ⟨tel-00863502v2⟩

Share

Metrics

Record views

129

Files downloads

131