Statistiques de scan : théorie et application à l'épidémiologie

Mickaël Genin

Résumé

The concept of cluster means the aggregation of events in time and / or space. In many areas, experts observe certain aggregations of events and the question arises whether these aggregations can be considered normal (by chance) or not. From a probabilistic point of view, normality can be described by a null hypothesis of random distribution of events.The detection of clusters of events is an area of statistics that has particularly spread over the past decades. First, the scientific community has focused on developing methods for the one-dimensional framework (eg time) and then subsequently extended these methods to the multidimensional case, especially two-dimensional (space). Of all the methods for detecting clusters of events, three major types of tests can be distinguished. The first type concerns global tests that detect an overall tendency to aggregation, without locating any clusters. The second type corresponds to the focused tests that are used when a priori knowledge is used to define a point source (date or spatial location) and to test the aggregation around it. The third type includes the cluster detection tests that allow localization, without a priori, cluster of events and test their statistical significance. In this thesis, we focused on the latter category, especially to methods based on scan statistics.These methods have emerged in the early 1960s and can detect clusters of events and determine their \"normal" appearance (coincidence) or "abnormal". The detection step is performed by scanning through a window, namely scanning window, the studied area (discrete or continuous, time, space), in which the events are observed. This detection step leads to a set of windows, each defining a potential cluster. A scan statistic is a random variable defined as the window with the maximum number of events observed.Scan statistics are used as a test statistic to check the independence and belonging to a given distribution of observations, against an alternative hypothesis supporting the existence of cluster within the studied region. Moreover, the main difficulty lies in determining the distribution of scan statistics under the null hypothesis. Indeed, since it is defined as the maximum of a sequence of dependent random variables, the dependence is due to the recovery of different windows scan, it exists only in very rare cases explicit solutions. Also, a piece of literature is focused on the development of methods (exact formulas and approximations) to determine the distribution of scan statistics. Moreover, in the two-dimensional framework, the scanning window can take various geometric shapes (rectangular, circular, ...) that could have an influence on the approximation of the distribution of the scan statistic. However, to our knowledge, no study has evaluated this influence. In the spatial context, the spatial scan statistics developed by M. Kulldorff are the most commonly used methods for spatial cluster detection. The principle of these methods lies in scanning the studied area with circular windows and selecting the most likely cluster maximizing a likelihood ratio test statistic. Statistical inference of the latter is achieved through Monte Carlo simulations. However, in the case of huge databases and / or when important accuracy of the critical probability associated with the detected cluster is required, Monte Carlo simulations are extremely time-consuming.First , we evaluated the influence of the scanning window shape on the distribution of two dimensional discrete scan statistics. A simulation study performed with squared, rectangular and discrete circle scanning windows has highlighted the fact that the distributions of the associated scan statistics are very close each to other but significantly different. The power of the scan statistics is related to the shape of the scanning window and that of the existing cluster under alternative hypothesis through out a simulation study. [...]

La notion de cluster désigne l'agrégation dans le temps et/ou l'espace d'évènements. Dans de nombreux domaines, les experts observent certaines agrégations d'évènements et la question se pose de savoir si ces agrégations peuvent être considérées comme normales (le fruit du hasard) ou non. D'un point de vue probabiliste, la normalité peut être décrite par une hypothèse nulle de répartition aléatoire des évènements. La détection de clusters d'évènements est un domaine de la statistique qui s'est particulièrement étendu au cours des dernières décennies. En premier lieu, la communauté scientifique s'est attachée à développer des méthodes dans le cadre unidimensionnel (ex : le temps) puis, par la suite, a étendu ces méthodes au cas multidimensionnel, et notamment bidimensionnel (l'espace). Parmi l'ensemble des méthodes de détection de clusters d'évènements, trois grands types de tests peuvent être distingués. Le premier concerne les tests globaux qui permettent de détecter une tendance globale à l'agrégation, sans pour autant localiser les clusters éventuels. Le deuxième type correspond aux tests focalisés qui sont utilisés lorsque des connaissances a priori permettent de définir un point source (date ou localisation spatiale) et de tester l'agrégation autour de ce dernier. Le troisième type englobe les tests de détection de cluster (ou sans point source défini) qui permettent la localisation, sans connaissance a priori, de clusters d'évènements et le test de leur significativité statistique. Au sein de cette thèse, nous nous sommes focalisés sur cette dernière catégorie et plus particulièrement aux méthodes basées sur les statistiques de scan (ou de balayage). Ces méthodes sont apparues au début des années 1960 et permettent de détecter des clusters d'évènements et de déterminer leur aspect "normal" (le fruit du hasard) ou "anormal". L'étape de détection est réalisée par le balayage (scan) par une fenêtre, dite fenêtre de scan, du domaine d'étude (discret ou continu) dans lequel sont observés les évènements (ex: le temps, l'espace,…). Cette phase de détection conduit à un ensemble de fenêtres définissant chacune un cluster potentiel. Une statistique de scan est une variable aléatoire définie comme la fenêtre comportant le nombre maximum d'évènements observés. Les statistiques de scan sont utilisées comme statistique de test pour vérifier l'indépendance et l'appartenance à une distribution donnée des observations, contre une hypothèse alternative privilégiant l'existence de cluster au sein de la région étudiée. Par ailleurs, la principale difficulté réside dans la détermination de la distribution, sous l'hypothèse nulle, de la statistique de scan. En effet, puisqu'elle est définie comme le maximum d'une suite de variables aléatoires dépendantes, la dépendance étant due au recouvrement des différentes fenêtres de scan, il n'existe que dans de très rares cas de figure des solutions explicites. Aussi, un pan de la littérature est axé sur le développement de méthodes (formules exactes et surtout approximations) permettant de déterminer la distribution des statistiques de scan. Par ailleurs, dans le cadre bidimensionnel, la fenêtre de scan peut prendre différentes formes géométriques (rectangulaire, circulaire,…) qui pourraient avoir une influence sur l'approximation de la distribution de la statistique de scan. Cependant, à notre connaissance, aucune étude n'a évalué cette influence. Dans le cadre spatial, les statistiques de scan spatiales développées par M. Kulldorff s'imposent comme étant, de loin, les méthodes les plus utilisées par la communauté scientifique. Le principe de ces méthodes résident dans le fait de scanner le domaine d'étude avec des fenêtres de forme circulaire et de sélectionner le cluster le plus probable comme celui maximisant un test de rapport de vraisemblance. [...]

Scan statistics : theory and application to epidemiology

Statistiques de scan : théorie et application à l'épidémiologie

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager