Détection et analyse d'une thématique rare dans de grands ensembles de requêtes : l'activité pédophile dans le P2P

Raphaël Fournier-S'Niehotta 1 
1 ComplexNetworks
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : The goal of this thesis is to study paedophile activity in P2P networks, using very large sets of search engine collected queries. In order to detect such paedophile-related queries, we design an automatic tool which labels queries as paedophile or not, based on the keywords the query contains. We then have some sample queries labeled by experts, to estimate the precision and recall of our tool, which are good. We obtain that the fraction of paedophile queries is close to 0.25% (in 2009). We quantify users entering such queries, which is difficult in this context, because only the IP address (and sometimes a connection port) is known. We study different methods to avoid mixing users and the detection errors of our tool on users. We estimate that the fraction of paedophile users is close to 0.22%. We analyse the evolution of paedophile activity. The fraction of paedophile queries significantly increases between 2009 and 2012. We also observe that paedophile queries are submitted mostly around 6am, which is enlightening on the social integration of such users - they significantly differ from traditional pornography users. Eventually, we compare the eDonkey and KAD networks. We design an adequate methodology to obtain relevant data on KAD and observe that, however more decentralized and allegedly more anonymous, KAD hosts less paedophile activity than eDonkey (0.1% approximately).
