A visual analytics approach for multi-resolution and multi-model analysis of text corpora : application to investigative journalism

Abstract : As the production of digital texts grows exponentially, a greater need to analyze text corpora arises in various domains of application, insofar as they constitute inexhaustible sources of shared information and knowledge. We therefore propose in this thesis a novel visual analytics approach for the analysis of text corpora, implemented for the real and concrete needs of investigative journalism. Motivated by the problems and tasks identified with a professional investigative journalist, visualizations and interactions are designed through a user-centered methodology involving the user during the whole development process. Specifically, investigative journalists formulate hypotheses and explore exhaustively the field under investigation in order to multiply sources showing pieces of evidence related to their working hypothesis. Carrying out such tasks in a large corpus is however a daunting endeavor and requires visual analytics software addressing several challenging research issues covered in this thesis. First, the difficulty to make sense of a large text corpus lies in its unstructured nature. We resort to the Vector Space Model (VSM) and its strong relationship with the distributional hypothesis, leveraged by multiple text mining algorithms, to discover the latent semantic structure of the corpus. Topic models and biclustering methods are recognized to be well suited to the extraction of coarse-grained topics, i.e. groups of documents concerning similar topics, each one represented by a set of terms extracted from textual contents. We provide a new Weighted Topic Map visualization that conveys a broad overview of coarse-grained topics by allowing quick interpretation of contents through multiple tag clouds while depicting the topical structure such as the relative importance of topics and their semantic similarity. Although the exploration of the coarse-grained topics helps locate topic of interest and its neighborhood, the identification of specific facts, viewpoints or angles related to events or stories requires finer level of structuration to represent topic variants. This nested structure, revealed by Bimax, a pattern-based overlapping biclustering algorithm, captures in biclusters the co-occurrences of terms shared by multiple documents and can disclose facts, viewpoints or angles related to events or stories. This thesis tackles issues related to the visualization of a large amount of overlapping biclusters by organizing term-document biclusters in a hierarchy that limits term redundancy and conveys their commonality and specificities. We evaluated the utility of our software through a usage scenario and a qualitative evaluation with an investigative journalist. In addition, the co-occurrence patterns of topic variants revealed by Bima. are determined by the enclosing topical structure supplied by the coarse-grained topic extraction method which is run beforehand. Nonetheless, little guidance is found regarding the choice of the latter method and its impact on the exploration and comprehension of topics and topic variants. Therefore we conducted both a numerical experiment and a controlled user experiment to compare two topic extraction methods, namely Coclus, a disjoint biclustering method, and hierarchical Latent Dirichlet Allocation (hLDA), an overlapping probabilistic topic model. The theoretical foundation of both methods is systematically analyzed by relating them to the distributional hypothesis. The numerical experiment provides statistical evidence of the difference between the resulting topical structure of both methods. The controlled experiment shows their impact on the comprehension of topic and topic variants, from analyst perspective. (...)
Complete list of metadatas

Cited literature [165 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02121464
Contributor : Abes Star <>
Submitted on : Monday, May 6, 2019 - 3:40:09 PM
Last modification on : Saturday, June 15, 2019 - 4:20:03 AM
Long-term archiving on: Tuesday, October 1, 2019 - 10:06:45 PM

File

va_Medoc_Nicolas.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02121464, version 1

Collections

Citation

Nicolas Médoc. A visual analytics approach for multi-resolution and multi-model analysis of text corpora : application to investigative journalism. Data Structures and Algorithms [cs.DS]. Université Sorbonne Paris Cité, 2017. English. ⟨NNT : 2017USPCB042⟩. ⟨tel-02121464⟩

Share

Metrics

Record views

70

Files downloads

16