Skip to Main content Skip to Navigation
New interface

From Text to Trust : A Priori Interpretability Versus Post Hoc Explainability in Natural Language Processing

Tom Bourgeade 1 
1 IRIT-MELODI - MEthodes et ingénierie des Langues, des Ontologies et du DIscours
IRIT - Institut de recherche en informatique de Toulouse
Abstract : With the advent of Transformer architectures in Natural Language Processing a few years ago, we have observed unprecedented progress in various text classification or generation tasks. However, the explosion in the number of parameters, and the complexity of these state-of-the-art blackbox models, is making ever more apparent the now urgent need for transparency in machine learning approaches. The ability to explain, interpret, and understand algorithmic decisions will become paramount as computer models start becoming more and more present in our everyday lives. Using eXplainable AI (XAI) methods, we can for example diagnose dataset biases, spurious correlations which can ultimately taint the training process of models, leading them to learn undesirable shortcuts, which could lead to unfair, incomprehensible, or even risky algorithmic decisions. These failure modes of AI, may ultimately erode the trust humans may have otherwise placed in beneficial applications. In this work, we more specifically explore two major aspects of XAI, in the context of Natural Language Processing tasks and models: in the first part, we approach the subject of intrinsic interpretability, which encompasses all methods which are inherently easy to produce explanations for. In particular, we focus on word embedding representations, which are an essential component of practically all NLP architectures, allowing these mathematical models to process human language in a more semantically-rich way. Unfortunately, many of the models which generate these representations, produce them in a way which is not interpretable by humans. To address this problem, we experiment with the construction and usage of Interpretable Word Embedding models, which attempt to correct this issue, by using constraints which enforce interpretability on these representations. We then make use of these, in a simple but effective novel setup, to attempt to detect lexical correlations, spurious or otherwise, in some popular NLP datasets. In the second part, we explore post-hoc explainability methods, which can target already trained models, and attempt to extract various forms of explanations of their decisions. These can range from diagnosing which parts of an input were the most relevant to a particular decision, to generating adversarial examples, which are carefully crafted to help reveal weaknesses in a model. We explore a novel type of approach, in parts allowed by the highly-performant but opaque recent Transformer architectures: instead of using a separate method to produce explanations of a model's decisions, we design and fine-tune an architecture which jointly learns to both perform its task, while also producing free-form Natural Language Explanations of its own outputs. We evaluate our approach on a large-scale dataset annotated with human explanations, and qualitatively judge some of our approach's machine-generated explanations.
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Tuesday, September 6, 2022 - 11:24:07 AM
Last modification on : Wednesday, September 7, 2022 - 3:52:19 AM


Version validated by the jury (STAR)


  • HAL Id : tel-03770191, version 1


Tom Bourgeade. From Text to Trust : A Priori Interpretability Versus Post Hoc Explainability in Natural Language Processing. Artificial Intelligence [cs.AI]. Université Paul Sabatier - Toulouse III, 2022. English. ⟨NNT : 2022TOU30063⟩. ⟨tel-03770191⟩



Record views


Files downloads