Skip to Main content Skip to Navigation

Deep learning for information extraction from business documents

Abstract : Due to the massive and increasing amount of documents received each day and the number of steps to process them, the largest companies have turned to document automation software for reaching low processing costs. One crucial step of such software is the automatic extraction of information from the documents, particularly retrieving fields that repeatedly appear in the incoming documents. To deal with the variability of structure of the information contained in such documents, the industrial and academic practitioners have progressively moved from rule-based methods to machine and deep learning models for performing the extraction task. The goal of this thesis is to provide methods for learning to extract information from business documents. In the first part of this manuscript, we embrace the sequence labeling approach by training deep neural networks to classify the information type carried by each token in the documents. When provided perfect token labels for learning, we show that these token classifiers can extract complex tabular information from document issuers and layouts that were unknown at the model training time. However, when the token level supervision must be deduced from the high-level ground truth naturally produced by the extraction task, we demonstrate that the token classifiers extract information from real-world documents with a significantly lower accuracy due to the noise introduced in the labels. In the second part of this thesis, we explore methods that learn to extract information directly from the high-level ground truth at our disposal, thus bypassing the need for costly token level supervision. We adapt an attention-based sequence-to-sequence model in order to alternately copy the document tokens carrying relevant information and generate the XML tags structuring the output extraction schema. Unlike the prior works in end-to-end information extraction, our approach allows to retrieve any arbitrarily structured information schemas. By comparing its extraction performance with the previous token classifiers, we show that end-to-end methods are competitive with sequence labeling approaches and can greatly outperform them when their token labels are not immediately accessible. Finally, in a third part, we confirm that using pre-trained models to extract information greatly reduces the needs for annotated documents. We leverage an existing Transformer based language model which has been pre-trained on a large collection of business documents. When adapted for an information extraction task through sequence labeling, the language model requires very few training documents for attaining close to maximal extraction performance. This underlines that the pre-trained models are significantly more data-efficient than models learning the extraction task from scratch. We also reveal valuable knowledge transfer abilities of this language model since the few-shot performance is improved when learning beforehand to extract information on another dataset, even if its targeted fields differ from the initial task.
Document type :
Complete list of metadata
Contributor : Abes Star :  Contact
Submitted on : Tuesday, January 11, 2022 - 4:39:09 PM
Last modification on : Wednesday, January 12, 2022 - 3:46:13 AM


Version validated by the jury (STAR)


  • HAL Id : tel-03521607, version 1


Clément Sage. Deep learning for information extraction from business documents. Machine Learning [stat.ML]. Université de Lyon, 2021. English. ⟨NNT : 2021LYSE1172⟩. ⟨tel-03521607⟩



Les métriques sont temporairement indisponibles