Segmentation of heterogeneous document images : an approach based on machine learning, connected components analysis, and texture analysis

Abstract : Document page segmentation is one of the most crucial steps in document image analysis. It ideally aims to explain the full structure of any document page, distinguishing text zones, graphics, photographs, halftones, figures, tables, etc. Although to date, there have been made several attempts of achieving correct page segmentation results, there are still many difficulties. The leader of the project in the framework of which this PhD work has been funded (*) uses a complete processing chain in which page segmentation mistakes are manually corrected by human operators. Aside of the costs it represents, this demands tuning of a large number of parameters; moreover, some segmentation mistakes sometimes escape the vigilance of the operators. Current automated page segmentation methods are well accepted for clean printed documents; but, they often fail to separate regions in handwritten documents when the document layout structure is loosely defined or when side notes are present inside the page. Moreover, tables and advertisements bring additional challenges for region segmentation algorithms. Our method addresses these problems. The method is divided into four parts:1. Unlike most of popular page segmentation methods, we first separate text and graphics components of the page using a boosted decision tree classifier.2. The separated text and graphics components are used among other features to separate columns of text in a two-dimensional conditional random fields framework.3. A text line detection method, based on piecewise projection profiles is then applied to detect text lines with respect to text region boundaries.4. Finally, a new paragraph detection method, which is trained on the common models of paragraphs, is applied on text lines to find paragraphs based on geometric appearance of text lines and their indentations. Our contribution over existing work lies in essence in the use, or adaptation, of algorithms borrowed from machine learning literature, to solve difficult cases. Indeed, we demonstrate a number of improvements : on separating text columns when one is situated very close to the other; on preventing the contents of a cell in a table to be merged with the contents of other adjacent cells; on preventing regions inside a frame to be merged with other text regions around, especially side notes, even when the latter are written using a font similar to that the text body. Quantitative assessment, and comparison of the performances of our method with competitive algorithms using widely acknowledged metrics and evaluation methodologies, is also provided to a large extend.(*) This PhD thesis has been funded by Conseil Général de Seine-Saint-Denis, through the FUI6 project Demat-Factory, lead by Safig SA
Document type :
Theses
Complete list of metadatas

Cited literature [105 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00912566
Contributor : Abes Star <>
Submitted on : Monday, December 2, 2013 - 12:37:45 PM
Last modification on : Thursday, July 5, 2018 - 2:26:41 PM
Long-term archiving on : Monday, March 3, 2014 - 9:25:15 AM

File

TH2012PEST1063_complete.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-00912566, version 1

Citation

Omid Bonakdar Sakhi. Segmentation of heterogeneous document images : an approach based on machine learning, connected components analysis, and texture analysis. Other [cs.OH]. Université Paris-Est, 2012. English. ⟨NNT : 2012PEST1063⟩. ⟨tel-00912566⟩

Share

Metrics

Record views

1072

Files downloads

5717