Skip to Main content Skip to Navigation
Theses

SUFT-1, a system for helping understand spontaneous multilingual and code-switching tweets in foreign languages : experimentation and evaluation on Indian and Japanese tweets

Abstract : As Twitter evolves into a ubiquitous information dissemination tool, understanding tweets in foreign languages becomes an important and difficult problem. Because of the inherent code-mixed, disfluent and noisy nature of tweets, state-of-the-art Machine Translation (MT) is not a viable option (Farzindar & Inkpen, 2015). Indeed, at least for Hindi and Japanese, we observe that the percentage of "understandable" tweets falls from 80% for natives to below 30% for target (English or French) monolingual readers using Google Translate. Our starting hypothesis is that it should be possible to build generic tools, which would enable foreigners to make sense of at least 70% of “native tweets”, using a versatile “active reading” (AR) interface, while simultaneously determining the percentage of understandable tweets under which such a system would be deemed useless by intended users.We have thus specified a generic "SUFT" (System for Helping Understand Tweets), and implemented SUFT-1, an interactive multi-layout system based on AR, and easily configurable by adding dictionaries, morphological modules, and MT plugins. It is capable of accessing multiple dictionaries for each source language and provides an evaluation interface. For evaluations, we introduce a task-related measure inducing a negligible cost, and a methodology aimed at enabling a « continuous evaluation on open data », as opposed to classical measures based on test sets related to closed learning sets. We propose to combine understandability ratio and understandability decision time as a two-pronged quality measure, one subjective and the other objective, and experimentally ascertain that a dictionary-based active reading presentation can indeed help understand tweets better than available MT systems.In addition to gathering various lexical resources, we constructed a large resource of "word-forms" appearing in Indian tweets with their morphological analyses (viz. 163221 Hindi word-forms from 68788 lemmas and 72312 Marathi word-forms from 6026 lemmas) for creating a multilingual morphological analyzer specialized to tweets, which can handle code-mixed tweets, compute unified features, and present a tweet with an attached AR graph from which foreign readers can intuitively extract a plausible meaning, if any.
Document type :
Theses
Complete list of metadatas

Cited literature [244 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01865400
Contributor : Abes Star :  Contact
Submitted on : Friday, August 31, 2018 - 2:37:08 PM
Last modification on : Thursday, November 19, 2020 - 1:02:02 PM
Long-term archiving on: : Saturday, December 1, 2018 - 1:30:38 PM

File

SHAH_2017_diffusion.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01865400, version 1

Collections

Citation

Ritesh Shah. SUFT-1, a system for helping understand spontaneous multilingual and code-switching tweets in foreign languages : experimentation and evaluation on Indian and Japanese tweets. Computation and Language [cs.CL]. Université Grenoble Alpes, 2017. English. ⟨NNT : 2017GREAM062⟩. ⟨tel-01865400⟩

Share

Metrics

Record views

331

Files downloads

184