Skip to Main content Skip to Navigation
Theses

Cooperative Resource Management for Parallel and Distributed Systems

Abstract : High-Performance Computing (HPC) resources, such as Supercomputers, Clusters, Grids and HPC Clouds, are managed by Resource Management Systems (RMSs) that multiple resources among multiple users and decide how computing nodes are allocated to user applications. As more and more petascale computing resources are built and exascale is to be achieved by 2020, optimizing resource allocation to applications is critical to ensure their efficient execution. However, current RMSs, such as batch schedulers, only offer a limited interface. In most cases, the application has to blindly choose resources at submittal without being able to adapt its choice to the state of the target resources, neither before it started nor during execution. The goal of this Thesis is to improve resource management, so as to allow applications to efficiently allocate resources. We achieve this by proposing software architectures that promote collaboration between the applications and the RMS, thus, allowing applications to negotiate the resources they run on. To this end, we start by analysing the various types of applications and their unique resource requirements, categorizing them into rigid, moldable, malleable and evolving. For each case, we highlight the opportunities they open up for improving resource management.The first contribution deals with moldable applications, for which resources are only negotiated before they start. We propose CooRMv1, a centralized RMS architecture, which delegates resource selection to the application launchers. Simulations show that the solution is both scalable and fair. The results are validated through a prototype implementation deployed on Grid’5000. Second, we focus on negotiating allocations on geographically-distributed resources, managed by multiple institutions. We build upon CooRMv1 and propose distCooRM, a distributed RMS architecture, which allows moldable applications to efficiently co-allocate resources managed by multiple independent agents. Simulation results show that distCooRM is well-behaved and scales well for a reasonable number of applications. Next, attention is shifted to run-time negotiation of resources, so as to improve support for malleable and evolving applications. We propose CooRMv2, a centralized RMS architecture, that enables efficient scheduling of evolving applications, especially non-predictable ones. It allows applications to inform the RMS about their maximum expected resource usage, through pre-allocations. Resources which are pre-allocated but unused can be filled by malleable applications. Simulation results show that considerable gains can be achieved. Last, production-ready software are used as a starting point, to illustrate the interest as well as the difficulty of improving cooperation between existing systems. GridTLSE is used as an application and DIET as an RMS to study a previously unsupported use-case. We identify the underlying problem of scheduling optional computations and propose an architecture to solve it. Real-life experiments done on the Grid’5000 platform show that several metrics are improved, such as user satisfaction, fairness and the number of completed requests. Moreover, it is shown that the solution is scalable.
Document type :
Theses
Complete list of metadatas

https://tel.archives-ouvertes.fr/tel-00780719
Contributor : Abes Star :  Contact
Submitted on : Thursday, January 24, 2013 - 4:27:36 PM
Last modification on : Monday, May 4, 2020 - 11:38:50 AM
Long-term archiving on: : Thursday, April 25, 2013 - 3:55:20 AM

File

KLEIN_HALMAGHI_Cristian_2012_T...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-00780719, version 1

Citation

Cristian Klein-Halmaghi. Cooperative Resource Management for Parallel and Distributed Systems. Other [cs.OH]. Ecole normale supérieure de lyon - ENS LYON, 2012. English. ⟨NNT : 2012ENSL0773⟩. ⟨tel-00780719⟩

Share

Metrics

Record views

967

Files downloads

421