From lexical towards contextualized meaning representation

Abstract : Continuous word representations (word type embeddings) are at the basis of most modern natural language processing systems, providing competitive results particularly when input to deep learning models. However, important questions are raised concerning the challenges they face in dealing with the complex natural language phenomena and regarding their ability to capture natural language variability.To better handle complex language phenomena, much work investigated fine-tuning the generic word type embeddings or creating specialized embeddings that satisfy particular linguistic constraints. While this can help distinguish semantic similarity from other types of semantic relatedness, it may not suffice to model certain types of relations between texts such as the logical relations of entailment or contradiction.The first part of the thesis investigates encoding the notion of entailment within a vector space by enforcing information inclusion, using an approximation to logical entailment of binary vectors. We further develop entailment operators and show how the proposed framework can be used to reinterpret an existing distributional semantic model. Evaluations are provided on hyponymy detection as an instance of lexical entailment.Another challenge concerns the variability of natural language and the necessity to disambiguate the meaning of lexical units depending on the context they appear in. For this, generic word type embeddings fall short of being successful by themselves, with different architectures being typically employed on top to help the disambiguation. As type embeddings are constructed from and reflect co-occurrence statistics over large corpora, they provide one single representation for a given word, regardless of its potentially numerous meanings. Furthermore, even given monosemous words, type embeddings do not distinguish between the different usages of a word depending on its context.In that sense, one could question if it is possible to directly leverage available linguistic information provided by the context of a word to adjust its representation. Would such information be of use to create an enriched representation of the word in its context? And if so, can information of syntactic nature aid in the process or is local context sufficient? One could thus investigate whether looking at the representations of the words within a sentence and the way they combine with each-other can suffice to build more accurate token representations for that sentence and thus facilitate performance gains on natural language understanding tasks.In the second part of the thesis, we investigate one possible way to incorporate contextual knowledge into the word representations themselves, leveraging information from the sentence dependency parse along with local vicinity information. We propose syntax-aware token embeddings (SATokE) that capture specific linguistic information, encoding the structure of the sentence from a dependency point of view in their representations. This enables moving from generic type embeddings (context-invariant) to specific token embeddings (context-aware). While syntax was previously considered for building type representations, its benefits may have not been fully assessed beyond models that harvest such syntactical information from large corpora.The obtained token representations are evaluated on natural language understanding tasks typically considered in the literature: sentiment classification, paraphrase detection, textual entailment and discourse analysis. We empirically demonstrate the superiority of the token representations compared to popular distributional representations of words and to other token embeddings proposed in the literature.The work proposed in the current thesis aims at contributing to research in the space of modelling complex phenomena such as entailment as well as tackling language variability through the proposal of contextualized token embeddings.
Complete list of metadatas

Cited literature [202 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02478383
Contributor : Abes Star <>
Submitted on : Thursday, February 13, 2020 - 7:07:08 PM
Last modification on : Friday, February 14, 2020 - 1:36:54 AM

File

POPA_2019_archivage.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02478383, version 1

Collections

Citation

Diana-Nicoleta Popa. From lexical towards contextualized meaning representation. Computers and Society [cs.CY]. Université Grenoble Alpes, 2019. English. ⟨NNT : 2019GREAM037⟩. ⟨tel-02478383⟩

Share

Metrics

Record views

113

Files downloads

17