Detection of automatically generated texts

Abstract : Automatically generated text has been used in numerous occasions with distinct intentions. It can simply go from generated comments in an online discussion to a much more mischievous task, such as manipulating bibliography information. So, this thesis first introduces different methods of generating free texts that resemble a certain topic and how those texts can be used. Therefore, we try to tackle with multiple research questions. The first question is how and what is the best method to detect a fully generated document.Then, we take it one step further to address the possibility of detecting a couple of sentences or a small paragraph of automatically generated text by proposing a new method to calculate sentences similarity using their grammatical structure. The last question is how to detect an automatically generated document without any samples, this is used to address the case of a new generator or a generator that it is impossible to collect samples from.This thesis also deals with the industrial aspect of development. A simple overview of a publishing workflow from a high-profile publisher is presented. From there, an analysis is carried out to be able to best incorporate our method of detection into the production workflow.In conclusion, this thesis has shed light on multiple important research questions about the possibility of detecting automatically generated texts in different setting. Besides the researching aspect, important engineering work in a real life industrial environment is also carried out to demonstrate that it is important to have real application along with hypothetical research.
Document type :
Theses
Complete list of metadatas

Cited literature [71 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01919207
Contributor : Abes Star <>
Submitted on : Monday, November 12, 2018 - 11:52:06 AM
Last modification on : Tuesday, May 21, 2019 - 6:33:24 PM
Long-term archiving on : Wednesday, February 13, 2019 - 1:59:19 PM

File

NGUYEN_MINH_TIEN_2018_diffusio...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01919207, version 1

Collections

Citation

Minh Tien Nguyen. Detection of automatically generated texts. Document and Text Processing. Université Grenoble Alpes, 2018. English. ⟨NNT : 2018GREAM025⟩. ⟨tel-01919207⟩

Share

Metrics

Record views

123

Files downloads

90