Skip to Main content Skip to Navigation
Theses

Advanced Simulation for Resource Management

Adrien Faure 1, 2
2 DATAMOVE - Data Aware Large Scale Computing
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : High-Performance Computing (HPC) provides the computational power dedicated to solving complex problems of our society.HPC computers are large scale and distributed infrastructures composed of several thousands of computing cores.The management of theses systems is left to unique software: the Resources and Jobs Management System (RJMS).The objective of the RJMS is multiple: Managing the physical infrastructure, and handling the user requests to access to the computing power.The scheduling algorithm is the cornerstone of the RJMS, it decides where and when the user's jobs will be executed.Scheduling is a difficult problem; to manage large scale platforms RJMS needs to dispose of efficient yet scalable scheduling heuristicsEvaluating and testing new scheduling algorithms is crucial before releasing it in production.Any failure can have a dramatic impact on the HPC platform leading to wasted time, energy, and resources.The lack of a platform dedicated experiments and tests compels RJMS designers and HPC center's administrators to use different tools and methodologies to evaluate new algorithms.In the first part of this dissertation, we present and evaluate a new scheduling heuristics with job redirection.The evaluation is done using a large simulation campaign, it results that by redirecting jobs can improve the efficiency of the scheduling.In the second part, we focus on and extend the tools and methodologies available to experiment with RJMS.This part is twofold: Firstly, we propose to extend scheduling simulations with job models to simulate network contention between jobs.Secondly, we propose new tools that enable experiment with production RJMS without the need for an HPC platform.This dissertation aims to broaden the experimental landscape of tools and methodologies to experiment with RJMS and therefore help the release in the production of new scheduling algorithms.
Complete list of metadata

https://tel.archives-ouvertes.fr/tel-03155702
Contributor : Abes Star :  Contact
Submitted on : Tuesday, March 2, 2021 - 8:37:08 AM
Last modification on : Monday, April 12, 2021 - 6:37:48 PM
Long-term archiving on: : Monday, May 31, 2021 - 6:15:59 PM

File

FAURE_2020_archivage.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-03155702, version 1

Citation

Adrien Faure. Advanced Simulation for Resource Management. Computer Arithmetic. Université Grenoble Alpes [2020-..], 2020. English. ⟨NNT : 2020GRALM056⟩. ⟨tel-03155702⟩

Share

Metrics

Record views

112

Files downloads

69