Skip to Main content Skip to Navigation
Theses

Optimization of data transfer on many-core processors, applied to dense linear algebra and stencil computations

Abstract : Upcoming Exascale target in High Performance Computing (HPC) and disruptive achievements in artificial intelligence give emergence of alternative non-conventional many-core architectures, with energy efficiency typical of embedded systems, and providing the same software ecosystem as classic HPC platforms. A key enabler of energy-efficient computing on many-core architectures is the exploitation of data locality, specifically the use of scratchpad memories in combination with DMA engines in order to overlap computation and communication. Such software paradigm raises considerable programming challenges to both the vendor and the application developer. In this thesis, we tackle the memory transfer and performance issues, as well as the programming challenges of memory- and compute-intensive HPC applications on he Kalray MPPA many-core architecture. With the first memory-bound use-case of the lattice Boltzmann method (LBM), we provide generic and fundamental techniques for decomposing three-dimensional iterative stencil problems onto clustered many-core processors fitted withs cratchpad memories and DMA engines. The developed DMA-based streaming and overlapping algorithm delivers 33%performance gain over the default cache-based implementation.High-dimensional stencil computation suffers serious I/O bottleneck and limited on-chip memory space. We developed a new in-place LBM propagation algorithm, which reduces by half the memory footprint and yields 1.5 times higher performance-per-byte efficiency than the state-of-the-art out-of-place algorithm. On the compute-intensive side with dense linear algebra computations, we build an optimized matrix multiplication benchmark based on exploitation of scratchpad memory and efficient asynchronous DMA communication. These techniques are then extended to a DMA module of the BLIS framework, which allows us to instantiate an optimized and portable level-3 BLAS numerical library on any DMA-based architecture, in less than 100 lines of code. We achieve 75% peak performance on the MPPA processor with the matrix multiplication operation (GEMM) from the standard BLAS library, without having to write thousands of lines of laboriously optimized code for the same result.
Document type :
Theses
Complete list of metadatas

Cited literature [112 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02426014
Contributor : Abes Star :  Contact
Submitted on : Wednesday, January 1, 2020 - 1:13:42 AM
Last modification on : Friday, October 23, 2020 - 5:00:18 PM

File

HO_2018_diffusion.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02426014, version 1

Collections

STAR | CNRS | LIG | UGA

Citation

Minh Quan Ho. Optimization of data transfer on many-core processors, applied to dense linear algebra and stencil computations. Performance [cs.PF]. Université Grenoble Alpes, 2018. English. ⟨NNT : 2018GREAM042⟩. ⟨tel-02426014⟩

Share

Metrics

Record views

103

Files downloads

284