Recurrent Neural Network Approach for Table Field Extraction in Business Documents

Clément Sage 1, 2 Alex Aussem 1 Haytham Elghazel 1 Véronique Eglin 2 Jérémy Espinas
1 DM2L - Data Mining and Machine Learning
LIRIS - Laboratoire d'InfoRmatique en Image et Systèmes d'information
2 imagine - Extraction de Caractéristiques et Identification
LIRIS - Laboratoire d'InfoRmatique en Image et Systèmes d'information
Abstract : Efficiently extracting information from documents issued by their partners is crucial for companies that face huge daily document flows. Particularly, tables contain most valuable information of business documents. However, their contents are challenging to automatically parse as tables from industrial contexts may have complex and ambiguous physical structure. Bypassing their structure recognition, we propose a generic method for end-to-end table field extraction that starts with the sequence of document tokens segmented by an OCR engine and directly tags each token with one of the possible field types. Similar to the state-of-the-art methods for non-tabular field extraction, our approach resorts to a token level recurrent neural network combining spatial and textual features. We empirically assess the effectiveness of recurrent connections for our task by comparing our method with a baseline feedforward network having local context knowledge added to its inputs. We train and evaluate both approaches on a dataset of 28,570 purchase orders to retrieve the ID numbers and quantities of the ordered products. Our method outperforms the baseline with micro F1 score on unknown document layouts of 0.821 compared to 0.764.
Complete list of metadatas

Cited literature [18 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02156269
Contributor : Véronique Eglin <>
Submitted on : Monday, July 15, 2019 - 3:48:26 PM
Last modification on : Tuesday, July 16, 2019 - 2:30:23 PM

File

paper_certified_by_IEEE_PDF_eX...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02156269, version 1

Citation

Clément Sage, Alex Aussem, Haytham Elghazel, Véronique Eglin, Jérémy Espinas. Recurrent Neural Network Approach for Table Field Extraction in Business Documents. International Conference on Document Analysis and Recognition, ICDAR 2019, Sep 2019, Sydney, Australia. ⟨hal-02156269⟩

Share

Metrics

Record views

107

Files downloads

47