Skip to Main content Skip to Navigation

Similarités textuelles sémantiques translingues : vers la détection automatique du plagiat par traduction

Abstract : The massive amount of documents through the Internet (e.g. web pages, data warehouses anddigital or transcribed texts) makes easier the recycling of ideas. Unfortunately, this phenomenonis accompanied by an increase of plagiarism cases. Indeed, claim ownership of content, withoutthe consent of its author and without crediting its source, and present it as new and original, isconsidered as plagiarism. In addition, the expansion of the Internet, which facilitates access todocuments throughout the world (written in foreign languages) as well as increasingly efficient(and freely available) machine translation tools, contribute to spread a new kind of plagiarism:cross-language plagiarism. Cross-language plagiarism means plagiarism by translation, i.e. a texthas been plagiarized while being translated (manually or automatically) from its original languageinto the language of the document in which the plagiarist wishes to include it. While prevention ofplagiarism is an active field of research and development, it covers mostly monolingual comparisontechniques. This thesis is a joint work between an academic laboratory (LIG) and Compilatio (asoftware publishing company of solutions for plagiarism detection), and proposes cross-lingualsemantic textual similarity measures, which is an important sub-task of cross-language plagiarismdetection.After defining the plagiarism and the different concepts discussed during this thesis, wepresent a state-of-the-art of the different cross-language plagiarism detection approaches. Wealso present the preexisting corpora for cross-language plagiarism detection and show their limits.Then we describe how we have gathered and built a new dataset, which does not contain mostof the limits encountered by the preexisting corpora. Using this new dataset, we conduct arigorous evaluation of several state-of-the-art methods and discover that they behave differentlyaccording to certain characteristics of the texts on which they operate. We next present newmethods for measuring cross-lingual semantic textual similarities based on word embeddings.We also propose a notion of morphosyntactic and frequency weighting of words, which can beused both within a vector and within a bag-of-words, and we show that its introduction inthe new methods increases their respective performance. Then we test different fusion systems(mostly based on linear regression). Our experiments show that we obtain better results thanthe state-of-the-art in all the sub-corpora studied. We conclude by presenting and discussingthe results of these methods obtained during our participation to the cross-lingual SemanticTextual Similarity (STS) task of SemEval-2017, where we ranked 1st on the sub-task that bestcorresponds to Compilatio’s use-case scenario.
Document type :
Complete list of metadatas

Cited literature [273 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Thursday, September 20, 2018 - 11:39:06 AM
Last modification on : Wednesday, October 14, 2020 - 4:19:02 AM
Long-term archiving on: : Friday, December 21, 2018 - 3:39:11 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01877727, version 1




Jérémy Ferrero. Similarités textuelles sémantiques translingues : vers la détection automatique du plagiat par traduction. Informatique et langage [cs.CL]. Université Grenoble Alpes, 2017. Français. ⟨NNT : 2017GREAM088⟩. ⟨tel-01877727⟩



Record views


Files downloads