Historical document image analysis : a structural approach based on texture

Abstract : Over the last few years, there has been tremendous growth in digitizing collections of cultural heritage documents. Thus, many challenges and open issues have been raised, such as information retrieval in digital libraries or analyzing page content of historical books. Recently, an important need has emerged which consists in designing a computer-aided characterization and categorization tool, able to index or group historical digitized book pages according to several criteria, mainly the layout structure and/or typographic/graphical characteristics of the historical document image content. Thus, the work conducted in this thesis presents an automatic approach for characterization and categorization of historical book pages. The proposed approach is applicable to a large variety of ancient books. In addition, it does not assume a priori knowledge regarding document image layout and content. It is based on the use of texture and graph algorithms to provide a rich and holistic description of the layout and content of the analyzed book pages to characterize and categorize historical book pages. The categorization is based on the characterization of the digitized page content by texture, shape, geometric and topological descriptors. This characterization is represented by a structural signature. More precisely, the signature-based characterization approach consists of two main stages. The first stage is extracting homogeneous regions. Then, the second one is proposing a graph-based page signature which is based on the extracted homogeneous regions, reflecting its layout and content. Afterwards, by comparing the different obtained graph-based signatures using a graph-matching paradigm, the similarities of digitized historical book page layout and/or content can be deduced. Subsequently, book pages with similar layout and/or content can be categorized and grouped, and a table of contents/summary of the analyzed digitized historical book can be provided automatically. As a consequence, numerous signature-based applications (e.g. information retrieval in digital libraries according to several criteria, page categorization) can be implemented for managing effectively a corpus or collections of books. To illustrate the effectiveness of the proposed page signature, a detailed experimental evaluation has been conducted in this work for assessing two possible categorization applications, unsupervised page classification and page stream segmentation. In addition, the different steps of the proposed approach have been evaluated on a large variety of historical document images.
Document type :
Theses
Complete list of metadatas

https://tel.archives-ouvertes.fr/tel-01280118
Contributor : Abes Star <>
Submitted on : Monday, February 29, 2016 - 9:58:05 AM
Last modification on : Tuesday, April 30, 2019 - 11:19:29 AM
Long-term archiving on : Monday, May 30, 2016 - 3:55:48 PM

File

2015Mehri67586.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01280118, version 1

Collections

Citation

Maroua Mehri. Historical document image analysis : a structural approach based on texture. Image Processing [eess.IV]. Université de La Rochelle, 2015. English. ⟨NNT : 2015LAROS005⟩. ⟨tel-01280118⟩

Share

Metrics

Record views

639

Files downloads

620