Skip to Main content Skip to Navigation

Transforming TLP into DLP with the dynamic inter-thread vectorization architecture

Sajith Kalathingal 1, 2
1 ALF - Amdahl's Law is Forever
Inria Rennes – Bretagne Atlantique , IRISA-D3 - ARCHITECTURE
2 PACAP - Pushing Architecture and Compilation for Application Performance
Inria Rennes – Bretagne Atlantique , IRISA-D3 - ARCHITECTURE
Abstract : Many modern microprocessors implement Simultaneous Multi-Threading (SMT) to improve the overall efficiency of superscalar CPU. SMT hides long latency operations by executing instructions from multiple threads simultaneously. SMT may execute threads of different processes, threads of the same processes or any combination of them. When the threads are from the same process, they often execute the same instructions with different data most of the time, especially in the case of Single-Program Multiple Data (SPMD) applications.Traditional SMT architecture exploit thread-level parallelism and with the use of SIMD execution units, they also support explicit data-level parallelism. SIMD execution is power efficient as the total number of instructions required to execute a complete program is significantly reduced. This instruction reduction is a factor of the width of SIMD execution units and the vectorization efficiency. Static vectorization efficiency depends on the programmer skill and the compiler. Often, the programs are not optimized for vectorization and hence it results in inefficient static vectorization by the compiler.In this thesis, we propose the Dynamic Inter-Thread vectorization Architecture (DITVA) to leverage the implicit data-level parallelism in SPMD applications by assembling dynamic vector instructions at runtime. DITVA optimizes an SIMD-enabled in-order SMT processor with inter-thread vectorization execution mode. When the threads are running in lockstep, similar instructions across threads are dynamically vectorized to form a SIMD instruction. The threads in the convergent paths share an instruction stream. When all the threads are in the convergent path, there is only a single stream of instructions. To optimize the performance in such cases, DITVA statically groups threads into fixed-size independently scheduled warps. DITVA leverages existing SIMD units and maintains binary compatibility with existing CPU architectures.
Complete list of metadatas

Cited literature [109 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Monday, August 28, 2017 - 4:30:21 PM
Last modification on : Tuesday, February 25, 2020 - 8:08:10 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01426915, version 3


Sajith Kalathingal. Transforming TLP into DLP with the dynamic inter-thread vectorization architecture. Hardware Architecture [cs.AR]. Université Rennes 1, 2016. English. ⟨NNT : 2016REN1S133⟩. ⟨tel-01426915v3⟩



Record views


Files downloads