Méthodes d’apprentissage interactif pour la classification des messages courts

Abstract : Automatic short text classification is more and more used nowadays in various applications like sentiment analysis or spam detection. Short texts like tweets or SMS are more challenging than traditional texts. Therefore, their classification is more difficult owing to their shortness, sparsity and lack of contextual information. We present two new approaches to improve short text classification. Our first approach is "Semantic Forest". The first step of this approach proposes a new enrichment method that uses an external source of enrichment built in advance. The idea is to transform a short text from few words to a larger text containing more information in order to improve its quality before building the classification model. Contrarily to the methods proposed in the literature, the second step of our approach does not use traditional learning algorithm but proposes a new one based on the semantic links among words in the Random Forest classifier. Our second contribution is "IGLM" (Interactive Generic Learning Method). It is a new interactive approach that recursively updates the classification model by considering the new data arriving over time and by leveraging the user intervention to correct misclassified data. An abstraction method is then combined with the update mechanism to improve short text quality. The experiments performed on these two methods show their efficiency and how they outperform traditional algorithms in short text classification. Finally, the last part of the thesis concerns a complete and argued comparative study of the two proposed methods taking into account various criteria such as accuracy, speed, etc.
Document type :
Theses
Complete list of metadatas

Cited literature [88 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01590468
Contributor : Abes Star <>
Submitted on : Tuesday, September 19, 2017 - 3:56:05 PM
Last modification on : Monday, November 5, 2018 - 3:52:10 PM

File

2017AZUR4039.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01590468, version 1

Collections

Citation

Ameni Bouaziz. Méthodes d’apprentissage interactif pour la classification des messages courts. Autre [cs.OH]. Université Côte d'Azur, 2017. Français. ⟨NNT : 2017AZUR4039⟩. ⟨tel-01590468⟩

Share

Metrics

Record views

452

Files downloads

1197