Skip to Main content Skip to Navigation

Efficient end-to-end monitoring for fault management in distributed systems

Dawei Feng 1, 2 
2 TAO - Machine Learning and Optimisation
LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : In this dissertation, we present our work on fault management in distributed systems, with motivating application roots in monitoring fault and abrupt change of large computing systems like the grid and the cloud. Instead of building a complete a priori knowledge of the software and hardware infrastructures as in conventional detection or diagnosis methods, we propose to use appropriate techniques to perform end-To-End monitoring for such large scale systems, leaving the inaccessible details of involved components in a black box.For the fault monitoring of a distributed system, we first model this probe-Based application as a static collaborative prediction (CP) task, and experimentally demonstrate the effectiveness of CP methods by using the max margin matrix factorization method. We further introduce active learning to the CP framework and exhibit its critical advantage in dealing with highly imbalanced data, which is specially useful for identifying the minority fault class.Further we extend the static fault monitoring to the sequential case by proposing the sequential matrix factorization (SMF) method. SMF takes a sequence of partially observed matrices as input, and produces predictions with information both from the current and history time windows. Active learning is also employed to SMF, such that the highly imbalanced data can be coped with properly. In addition to the sequential methods, a smoothing action taken on the estimation sequence has shown to be a practically useful trick for enhancing sequential prediction performance.Since the stationary assumption employed in the static and sequential fault monitoring becomes unrealistic in the presence of abrupt changes, we propose a semi-Supervised online change detection (SSOCD) framework to detect intended changes in time series data. In this way, the static model of the system can be recomputed once an abrupt change is detected. In SSOCD, an unsupervised offline method is proposed to analyze a sample data series. The change points thus detected are used to train a supervised online model, which gives online decision about whether there is a change presented in the arriving data sequence. State-Of-The-Art change detection methods are employed to demonstrate the usefulness of the framework.All presented work is verified on real-World datasets. Specifically, the fault monitoring experiments are conducted on a dataset collected from the Biomed grid infrastructure within the European Grid Initiative, and the abrupt change detection framework is verified on a dataset concerning the performance change of an online site with large amount of traffic.
Complete list of metadata

Cited literature [5 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Tuesday, July 1, 2014 - 5:14:09 PM
Last modification on : Sunday, June 26, 2022 - 12:01:36 PM
Long-term archiving on: : Tuesday, October 13, 2015 - 2:50:57 PM


  • HAL Id : tel-01017083, version 1



Dawei Feng. Efficient end-to-end monitoring for fault management in distributed systems. Machine Learning [cs.LG]. Université Paris Sud - Paris XI, 2014. English. ⟨NNT : 2014PA112044⟩. ⟨tel-01017083⟩



Record views


Files downloads