Skip to Main content Skip to Navigation

On Some Unsupervised Learning Problems for Highly Dependent Time Series

Azadeh Khaleghi 1
1 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal
Abstract : This thesis is devoted to the theoretical analysis of unsupervised learning problems involving highly dependent time-series. Two fundamental problems are considered, namely, the problem of change point estimation as well as that of time-series clustering. The problems are considered in an extremely general framework, where the data are assumed to be generated by arbitrary, unknown stationary ergodic process distributions. This is one of the weakest assumptions in statistics, because it is more general than the parametric and model-based settings, and it subsumes most of the non-parametric frameworks considered for this class of problems. These assumptions typically have the premise that each time-series consists of independent and identically distributed observations or that it satisfies certain mixing conditions. For each of the considered problems, novel nonparametric methods are proposed, and are further shown to be asymptotically consistent in this general framework. For change point estimation, asymptotic consistency refers to the algorithm's ability to produce change point estimates that are asymptotically arbitrarily close to the true change points. On the other hand, a clustering algorithm is asymptotically consistent, if the output clustering, restricted to each fixed batch of sequences, consistently coincides with the target clustering from some time on. The proposed algorithms are shown to be efficiently implementable, and the theoretical results are complemented with experimental evaluations. Statistical analysis in the stationary ergodic framework is extremely challenging. In general, rates of convergence (even of frequencies to respective probabilities) are provably impossible to obtain for this class of processes. As a result, given a pair of samples generated independently by stationary ergodic process distributions, it is provably impossible to distinguish between the case where they are generated by the same process or by two different ones. This in turn, implies that such problems as time-series clustering with unknown number of clusters, or change point detection, cannot possibly admit consistent solutions. Thus, a challenging task is to discover the problem formulations which admit consistent solutions in this general framework. The main contribution of this thesis is to constructively demonstrate that despite these theoretical impossibility results, natural formulations of the considered problems exist which admit consistent solutions in this general framework. Specifically, natural formulations of change-point estimation and time-series clustering are proposed, and efficient algorithms are provided, which are shown to be asymptotically consistent under the assumption that the process distributions are stationary ergodic. This includes the demonstration of the fact that the correct number of change points can be found, without the need to impose stronger assumptions on the process distributions. It turns out that in this formulation the change point estimation problem can be reduced to time-series clustering. The results presented in this work lay down the theoretical foundations for the analysis of sequential data in a broad range of real-world applications.
Complete list of metadata

Cited literature [79 references]  Display  Hide  Download
Contributor : Azadeh Khaleghi Connect in order to contact the contributor
Submitted on : Tuesday, December 17, 2013 - 11:19:42 PM
Last modification on : Thursday, January 20, 2022 - 4:16:31 PM
Long-term archiving on: : Saturday, April 8, 2017 - 7:40:08 AM


  • HAL Id : tel-00920184, version 1


Azadeh Khaleghi. On Some Unsupervised Learning Problems for Highly Dependent Time Series. Statistics [math.ST]. Institut national de recherche en informatique et en automatique (INRIA), 2013. English. ⟨tel-00920184⟩



Les métriques sont temporairement indisponibles