Skip to Main content Skip to Navigation

Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities

Abstract : Social sciences and Humanities research is often based on large textual corpora, that it would be unfeasible to read in detail. Natural Language Processing (NLP) can identify important concepts and actors mentioned in a corpus, as well as the relations between them. Such information can provide an overview of the corpus useful for domain-experts, and help identify corpus areas relevant for a given research question. To automatically annotate corpora relevant for Digital Humanities (DH), the NLP technologies we applied are, first, Entity Linking, to identify corpus actors and concepts. Second, the relations between actors and concepts were determined based on an NLP pipeline which provides semantic role labeling and syntactic dependencies among other information. Part I outlines the state of the art, paying attention to how the technologies have been applied in DH.Generic NLP tools were used. As the efficacy of NLP methods depends on the corpus, some technological development was undertaken, described in Part II, in order to better adapt to the corpora in our case studies. Part II also shows an intrinsic evaluation of the technology developed, with satisfactory results. The technologies were applied to three very different corpora, as described in Part III. First, the manuscripts of Jeremy Bentham. This is a 18th-19th century corpus in political philosophy. Second, the PoliInformatics corpus, with heterogeneous materials about the American financial crisis of 2007-2008. Finally, the Earth Negotiations Bulletin (ENB), which covers international climate summits since 1995, where treaties like the Kyoto Protocol or the Paris Agreements get negotiated.For each corpus, navigation interfaces were developed. These user interfaces (UI) combine networks, full-text search and structured search based on NLP annotations. As an example, in the ENB corpus interface, which covers climate policy negotiations, searches can be performed based on relational information identified in the corpus: the negotiation actors having discussed a given issue using verbs indicating support or opposition can be searched, as well as all statements where a given actor has expressed support or opposition. Relation information is employed, beyond simple co-occurrence between corpus terms.The UIs were evaluated qualitatively with domain-experts, to assess their potential usefulness for research in the experts' domains. First, we payed attention to whether the corpus representations we created correspond to experts' knowledge of the corpus, as an indication of the sanity of the outputs we produced. Second, we tried to determine whether experts could gain new insight on the corpus by using the applications, e.g. if they found evidence unknown to them or new research ideas. Examples of insight gain were attested with the ENB interface; this constitutes a good validation of the work carried out in the thesis. Overall, the applications' strengths and weaknesses were pointed out, outlining possible improvements as future work.
Complete list of metadata

Cited literature [292 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Monday, July 2, 2018 - 12:00:07 PM
Last modification on : Tuesday, July 6, 2021 - 3:25:17 PM
Long-term archiving on: : Monday, October 1, 2018 - 8:42:30 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01575167, version 2



Pablo Ruiz Fabo. Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities. Linguistics. Université Paris sciences et lettres, 2017. English. ⟨NNT : 2017PSLEE053⟩. ⟨tel-01575167v2⟩



Record views


Files downloads