Adapting a HPC runtime system to FPGAs

Georgios Christodoulis

Résumé

Along with the traditional CPU cores, processing units of different architectures have been employed by the HPC community in order to obtain improved efficiency and performance. A Field Programmable Gate Arrays - FPGA, is a hardware fabric composed by interconnected re-programmable logic and memory blocks. This type of processing unit, constitutes promising candidate to amplify the computational power of heterogeneous HPC platforms, since due to the reduced amount of abstraction layers between the level of programming and the actual hardware, they can satisfy the aforementioned objectives.However, exploiting them requires an in-depth knowledge of low-level hardware design and high expertise on vendor-provided tools, which is not aligned with the expertise of HPC application programmers. In the scope of this thesis, we have designed a framework that allows a straightforward development of scientific applications over heterogeneous platforms enhanced with FPGA. The orientation of the work is towards a programming environment that requires the minimum knowledge of the underlying architecture, and an FPGA can be used in the same way as any other accelerator. In the core of the environment, there is the StarPU heterogeneous runtime system, that was extended to support FPGA, hiding from the programmer complex operations deriving from the complexity of the underlying architecture while it allows fine control of the performance through different scheduling strategies.For the communication with the FPGA device, we created Conor, a communication library based on RIFFA, that ensures the consistency of the accelerator during scenarios where software threads are interacting with the last concurrently.Our approach is evaluated across two dimensions, one corresponding to the programmability of the framework, and the other to the performance overhead imposed by the additional components attached to the FPGA.The programmability of the framework was evaluated using a basic blocking version of matrix multiplication, which is also used to demonstrate that our development did not impose any additional overhead to the rest of the platform.On top of the first example of matrix multiplication, we created an efficient hardware design of gemm, that will allow the execution of more complex and interesting applications like the Cholesky decomposition.

En plus des cœurs de CPU traditionnels, d'autres unités de traitementsont utilisées par la communauté High Performance Computing (HPC) afind'obtenir une efficacité et des performances améliorées. Un FieldProgrammable Gate Arrays (FPGA), est une unité de traitement composée delogique reprogrammable interconnectée et de blocs mémoire.Ce type d'unité de traitement constitue un candidat prometteur pouraméliorer la puissance de calcul de plates-formes HPC car il permet deréduire le nombre de couches d'abstraction entre le niveau deprogrammation et le matériel réel. En contre-partie, l'exploitation deFPGA nécessite une connaissance approfondie de la conception matériellede bas niveau et une grande expertise des outils fournis par lesvendeurs qui ne correspond pas à celle des programmeurs HPC. Nous avons,dans le cadre de cette thèse, conçu un framework permettant undéveloppement simple des applications scientifiques sur des plateformeshétérogènes intégrant des FPGAs. Au cœur de notre framework se trouve lesystème d'exécution hétérogène StarPU, qui a été étendu pour supporterles FPGAs, cachant aux programmeurs des opérations complexes dérivant dela complexité de l'architecture sous-jacente et permettant un contrôlefin de la performance à travers différentes stratégies de planification.Pour la communication avec le FPGA, nous avons créé Conor, unebibliothèque de communication basée sur RIFFA, qui assure la cohérencede l'accélérateur lors de scénarios où les threads logicielsinteragissent simultanément avec le calcul effectué sur le FPGA.Notre approche est évaluée selon deux axes, l'un correspondant à laprogrammabilité et l'autre aux surcoûts imposés par les composantesadditionnelles rattachées au FPGA.La programmabilité du framework a été évaluée à l'aide d'une version parblocs de multiplication de matrice. Cette multiplication de matrice estégalement utilisée pour démontrer que nos extensions à StarPU n'ont pasimposé de surcoût supplémentaires.En plus du premier exemple de multiplication de matrice, nous avons crééune conception matérielle efficace de gemm, qui permettra l'exécutiond'applications plus complexes et intéressantes comme la décomposition deCholesky.

Adapting a HPC runtime system to FPGAs

Adaption d'un système HPC pour intégrer des FPGAs

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager