Toward Robust Information Extraction Models for Multimedia Documents

Ali-Reza Ebadat 1
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : During the last decade, huge amounts of multimedia documents have been generated. It is therefore important to find a way to manage this data. Every approach to facilitate this process requires to have a deep understanding of the content of the documents. Among two different approaches to get such insights, either by extracting information from the document (e.g. audio, image) or by using related data from external sources (such as the Web), we chose the latter. Then, these extracted information can be used in a global framework to be considered as annotations for multimedia documents in order to facilitate the management of such documents. One of the main objectives of this thesis was to be robust against noisy and small data. Our approach to reach this objective was to use simple and knowledge-light techniques (i.e. shallow linguistic analysis) as a guarantee of robustness that we assume to be mandatory for processing multimedia documents. Indeed, we used statistical analysis of text and some techniques inspired from Information Retrieval. In addition, we introduced a new data representation scheme for text processing which has been used successfully in image Information Retrieval domain. In this thesis, we focused on three tasks: Relation Extraction, Relation Discovery and Proper noun clustering. In the first task, Relation Extraction, we proposed a supervised model based on a Language Modeling and an instance-based learning algorithm, called kNN. Experimental results showed the effectiveness of our models which use shallow linguistic information compared to state-of-the-art systems that use deep linguistic analysis. In the second task, we moved to unsupervised model to discover relations instead of extracting predefined ones. We modeled this problem as clustering task and defined a similarity function based on Language Modeling and average probability. The performance of this model was evaluated with a textual football reports, which showed improvements compare to classical model with cosine similarity function. Moreover, we studied the importance of some domain independent filters in this task. Since each relation was between two entities, we defined the last task to cluster entities (more precisely, proper nouns) in order to discover and make emerge, without a priori, semantic classes.. In this task, we proposed to use a new data representation to keep each instance of the proper nouns separately. Then, we introduced a discriminative similarity function in order to take into account the importance of each occurrence of the proper nouns in the corpus. As a conclusion, we experimentally showed that simple techniques, requiring few a priori knowledge, and using shallow linguistic information can be useful to effectively extract information from text. In our case, such results have indeed been achieved by choosing suited representation for the data, based on statistical analysis or Information Retrieval models. This is still a long road before being able to process raw multimedia documents, but we hope that these good results may now serve as a springboard for future researches in this field.
Complete list of metadatas

Cited literature [167 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00760383
Contributor : Vincent Claveau <>
Submitted on : Monday, December 3, 2012 - 6:37:39 PM
Last modification on : Friday, November 16, 2018 - 1:22:04 AM
Long-term archiving on : Monday, March 4, 2013 - 3:51:47 AM

Identifiers

  • HAL Id : tel-00760383, version 1

Citation

Ali-Reza Ebadat. Toward Robust Information Extraction Models for Multimedia Documents. Computation and Language [cs.CL]. INSA de Rennes, 2012. English. ⟨tel-00760383⟩

Share

Metrics

Record views

466

Files downloads

665