Skip to Main content Skip to Navigation
Theses

Apprentissage à base de Noyaux Sémantiques pour le Traitement de Données Textuelles

Abstract : Since the early eighties, statistical methods and, more specifically, the machine learning for textual data processing have known a considerable growth of interest. This is mainly due to the fact that the number of documents to process is growing exponentially. Thus, expert-based methods have become too costly, losing the research focus to the profit of machine learning-based methods.
In this thesis, we focus on two main issues. The first one is the processing of semi-structured textual data with kernel-based methods. We present, in this context, a semantic kernel for documents structured by sections under the XML format. This kernel captures the semantic information with the use of an external source of knowledge e.g., a thesaurus. Our kernel was evaluated on a medical document corpus with the UMLS thesaurus. It was ranked in the top ten of the best methods, according to the F1-score, among 44 algorithms at the 2007 CMC Medical NLP International Challenge.
The second issue is the study of the use of latent concepts extracted by statistical methods such as the Latent Semantic Analysis (LSA). We present, in a first part, kernels based on linguistic concepts from external sources and on latent concepts of the LSA. We show that a kernel integrating both kinds of concepts improves the text categorization performances. Then, in a second part, we present a kernel that uses local LSAs to extract latent concepts. Local latent concepts are used to have a more finer representation of the documents.
Document type :
Theses
Complete list of metadatas

Cited literature [163 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00274627
Contributor : Sujeevan Aseervatham <>
Submitted on : Saturday, April 19, 2008 - 9:29:14 PM
Last modification on : Monday, October 19, 2020 - 8:15:20 AM
Long-term archiving on: : Friday, May 21, 2010 - 1:53:33 AM

Identifiers

  • HAL Id : tel-00274627, version 1

Collections

Citation

Sujeevan Aseervatham. Apprentissage à base de Noyaux Sémantiques pour le Traitement de Données Textuelles. Informatique [cs]. Université Paris-Nord - Paris XIII, 2007. Français. ⟨tel-00274627⟩

Share

Metrics

Record views

561

Files downloads

2676