On numerical resilience in linear algebra

Mawussi Zounon 1, 2
Abstract : As the computational power of high performance computing (HPC) systems continues to increase by using huge number of cores or specialized processing units, HPC applications are increasingly prone to faults. This study covers a new class of numerical fault tolerance algorithms at application level that does not require extra resources, i.e., computational unit or computing time, when no fault occurs. Assuming that a separate mechanism ensures fault detection, we propose numerical algorithms to extract relevant information from available data after a fault. After data extraction, well chosen part of missing data is regenerated through interpolation strategies to constitute meaningful inputs to numerically restart the algorithm. We have designed these methods called Interpolation-restart techniques for numerical linear algebra problems such as the solution of linear systems or eigen-problems that are the inner most numerical kernels in many scientific and engineering applications and also often ones of the most time consuming parts. In the framework of Krylov subspace linear solvers the lost entries of the iterate are interpolated using the available entries on the still alive nodes to define a new initial guess before restarting the Krylov method. In particular, we consider two interpolation policies that preserve key numerical properties of well-known linear solvers, namely the monotony decrease of the A-norm of the error of the conjugate gradient or the residual norm decrease of GMRES. We assess the impact of the fault rate and the amount of lost data on the robustness of the resulting linear solvers.For eigensolvers, we revisited state-of-the-art methods for solving large sparse eigenvalue problems namely the Arnoldi methods, subspace iteration methods and the Jacobi-Davidson method, in the light of Interpolation-restart strategies. For each considered eigensolver, we adapted the Interpolation-restart strategies to regenerate as much spectral information as possible. Through intensive experiments, we illustrate the qualitative numerical behavior of the resulting schemes when the number of faults and the amount of lost data are varied; and we demonstrate that they exhibit a numerical robustness close to that of fault-free calculations. In order to assess the efficiency of our numerical strategies, we have consideredan actual fully-featured parallel sparse hybrid (direct/iterative) linear solver, MaPHyS, and we proposed numerical remedies to design a resilient version of the solver. The solver being hybrid, we focus in this study on the iterative solution step, which is often the dominant step in practice. The numerical remedies we propose are twofold. Whenever possible, we exploit the natural data redundancy between processes from the solver toperform an exact recovery through clever copies over processes. Otherwise, data that has been lost and is not available anymore on any process is recovered through Interpolationrestart strategies. These numerical remedies have been implemented in the MaPHyS parallel solver so that we can assess their efficiency on a large number of processing units (up to 12; 288 CPU cores) for solving large-scale real-life problems.
Document type :
Complete list of metadatas

Cited literature [160 references]  Display  Hide  Download

Contributor : Abes Star <>
Submitted on : Friday, December 4, 2015 - 9:49:07 AM
Last modification on : Wednesday, April 4, 2018 - 1:24:20 AM
Long-term archiving on : Saturday, April 29, 2017 - 7:13:11 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01231838, version 2


Mawussi Zounon. On numerical resilience in linear algebra. Numerical Analysis [cs.NA]. Université de Bordeaux, 2015. English. ⟨NNT : 2015BORD0038⟩. ⟨tel-01231838v2⟩



Record views


Files downloads