Skip to Main content Skip to Navigation

Understanding and improving HPC performance using Machine Learning and Statistical analysis

Salah Zrigui 1, 2 
2 DATAMOVE - Data Aware Large Scale Computing
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : The infrastructure of High Performance Computing (HPC) systems is rapidly increasing in complexity and scale. New components and innovations are added at a fast pace. This instigates the need for more efforts towards understanding such systems and designing new, more adapted optimization schemes.This thesis is a series of data-driven analytical and experimental campaigns with two goals in mind. (i) To improve the performance of HPC systems with a focus on scheduler performance. (ii) To better understand the inner workings of HPC systems, which includes scheduling evaluation methods and energy behavior of submitted jobs.We start by performing a comparative study. We focus on the evaluation methods of schedulers. We study two well-established metrics (waiting time and slowdown) and one less popular metric (per-processor-slowdown). We also evaluate other effects, such as the relationship between job size and the slowdown, the distribution of slowdown values, and the number of backfilled jobs. We focus on the popular First-Come-First-Served (FCFS) and compare it to other simple scheduling policies. We show that relinquishing FCFS is not as risky as it is perceived to be. We argue that using other ordering policies in combination with a simple thresholding mechanism can offer similar guarantees with significantly better performance.Then, we proceed to show the limits of simple scheduling policies and we design and test two machine learning-based paradigms to improve performance beyond what these basic policies can offer. First, we propose a method to dynamically generate new scheduling policies that adapt to the changing nature of data in any given platform. Also, we study the possibility of applying online learning on scheduling data, and we detail the difficulties that one might encounter in such an endeavor.%vspace{5mm}For the second approach, we improve the performance of already established scheduling policies by reducing the inherent uncertainty in the scheduling data. More precisely, the inaccuracy of user runtimes estimates. We propose a simple classification of jobs into small and large. We show that this classification is sufficient to harness most of the improvement that can be gained from accurate runtimes estimates. We use machine learning to predict the classes and improve performance across all studied platforms.Finally, we analyze the energy consumption of HPC platforms. We study the energy profiles of individual jobs. We Observe the similarities and differences between energy profiles and we propose a series of statistical tests through which we classify the jobs into periodic, constant, and non-stationary. We believe that this classification can be used to predict the energy consumption of future jobs and build energy-aware schedulers.
Document type :
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Friday, August 27, 2021 - 12:16:11 PM
Last modification on : Wednesday, July 6, 2022 - 4:22:52 AM
Long-term archiving on: : Sunday, November 28, 2021 - 6:20:48 PM


Version validated by the jury (STAR)


  • HAL Id : tel-03327540, version 1


Salah Zrigui. Understanding and improving HPC performance using Machine Learning and Statistical analysis. Symbolic Computation [cs.SC]. Université Grenoble Alpes [2020-..], 2021. English. ⟨NNT : 2021GRALM012⟩. ⟨tel-03327540⟩



Record views


Files downloads