Skip to Main content Skip to Navigation

Modèles probabilistes pour les fréquences de mots et la recherche d'information

Abstract : The present study deals with word frequencies distributions and their relation to probabilistic Information Retrieval (IR) models. We examine the burstiness phenomenon of word frequencies in textual collections. We propose to model this phenomenon as a property of probability distributions and we study the Beta Negative Binomial and Log-Logistic distributions to model word frequencies. We then focus on probabilistic IR models and their fundamental properties. Our analysis reveals that probability distributions underlying most state-of-the-art models do not take this phenomenon into account , even if fundamental properties of IR models such as concavity enable implicitly to take it into account. We then introduce a novel family of probabilistic IR model, based on Shannon information. These new models bridge the gap between significant properties of IR models and the burstiness phenomenon of word frequencies. Lastly, we study empirically and theoretically pseudo relevance feedback models. We propose a theoretical framework which explain well the empirical behaviour and performance of pseudo relevance feedback models. Overall, this highlights interesting properties for pseudo relevance feedback and shows that some state-of-the-art model are inadequate.
Document type :
Complete list of metadatas

Cited literature [96 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Thursday, March 1, 2012 - 1:01:59 AM
Last modification on : Thursday, November 19, 2020 - 12:59:57 PM
Long-term archiving on: : Thursday, May 31, 2012 - 2:21:06 AM


Version validated by the jury (STAR)


  • HAL Id : tel-00675390, version 1



Stéphane Clinchant. Modèles probabilistes pour les fréquences de mots et la recherche d'information. Autre [cs.OH]. Université de Grenoble, 2011. Français. ⟨NNT : 2011GRENT046⟩. ⟨tel-00675390⟩



Record views


Files downloads