Modélisation du langage à l'aide de pénalités structurées

Anil Kumar Nelakanti 1
1 SIERRA - Statistical Machine Learning and Parsimony
DI-ENS - Département d'informatique de l'École normale supérieure, CNRS - Centre National de la Recherche Scientifique, Inria de Paris
Abstract : Modeling natural language is among fundamental challenges of artificial intelligence and the design of interactive machines, with applications spanning across various domains, such as dialogue systems, text generation and machine translation. We propose a discriminatively trained log-linear model to learn the distribution of words following a given context. Due to data sparsity, it is necessary to appropriately regularize the model using a penalty term. We design a penalty term that properly encodes the structure of the feature space to avoid overfitting and improve generalization while appropriately capturing long range dependencies. Some nice properties of specific structured penalties can be used to reduce the number of parameters required to encode the model. The outcome is an efficient model that suitably captures long dependencies in language without a significant increase in time or space requirements. In a log-linear model, both training and testing become increasingly expensive with growing number of classes. The number of classes in a language model is the size of the vocabulary which is typically very large. A common trick is to cluster classes and apply the model in two-steps; the first step picks the most probable cluster and the second picks the most probable word from the chosen cluster. This idea can be generalized to a hierarchy of larger depth with multiple levels of clustering. However, the performance of the resulting hierarchical classifier depends on the suitability of the clustering to the problem. We study different strategies to build the hierarchy of categories from their observations.
Document type :
Theses
Liste complète des métadonnées

Cited literature [247 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01001634
Contributor : Abes Star <>
Submitted on : Wednesday, June 4, 2014 - 4:03:09 PM
Last modification on : Thursday, February 7, 2019 - 1:33:21 AM
Document(s) archivé(s) le : Thursday, September 4, 2014 - 12:35:14 PM

File

Kumartheseretourducinesoptimis...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01001634, version 1

Citation

Anil Kumar Nelakanti. Modélisation du langage à l'aide de pénalités structurées. Other [cs.OH]. Université Pierre et Marie Curie - Paris VI, 2014. English. ⟨NNT : 2014PA066033⟩. ⟨tel-01001634⟩

Share

Metrics

Record views

468

Files downloads

768