Architecture and Programming Model Support for Reconfigurable Accelerators in Multi-Core Embedded Systems
Satyajit Das

To cite this version:

HAL Id: tel-01989827
https://tel.archives-ouvertes.fr/tel-01989827
Submitted on 22 Jan 2019
Thèse de doctorat de

L'UNIVERSITÉ BRETAGNE SUD
COMUE UNIVERSITE BRETAGNE LOIRE

ÉCOLE DOCTORALE n° 601
Mathématiques et Sciences et Technologies
de l'Information et de la Communication
Spécialité : Électronique
Par
« Satyajit DAS »

« Architecture and Programming Model Support For Reconfigurable Accelerators in Multi-Core Embedded Systems »

Thèse présentée et soutenue à Lorient, le 4 juin 2018
Unité de recherche : Lab-STICC
Thèse N° : 491

Rapporteurs avant soutenance :

Michael HÜBNER  Professeur, Ruhr-Universität Bochum
Jari NURMI  Professeur, Tampere University of Technology

Composition du Jury :

François PÊCHEUX  Professeur, Sorbonne Université
Président (à préciser après la soutenance)
Angeliki KRITIKAKOU  Maître de conférences, Université Rennes 1
Davide ROSSI  Assistant professor, Université de Bologna
Kevin MARTIN  Maître de conférences, Université Bretagne Sud
Directeur de thèse
Philippe COUSSY  Professeur, Université Bretagne Sud
Co-directeur de thèse
Luca BENINI  Professeur, Université de Bologna
Emerging trends in embedded systems and applications need high throughput and low power consumption. Due to the increasing demand for low power computing, and diminishing returns from technology scaling, industry and academia are turning with renewed interest toward energy efficient hardware accelerators. The main drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function and increasing the number of the accelerators in a system on chip (SoC) causes scalability issues. Programmable accelerators provide flexibility and solve the scalability issues.

Coarse-Grained Reconfigurable Array (CGRA) architecture consisting several processing elements with word level granularity is a promising choice for programmable accelerator. Inspired by the promising characteristics of programmable accelerators, potentials of CGRAs in near threshold computing platforms are studied and an end-to-end CGRA research framework is developed in this thesis.

The major contributions of this framework are: CGRA design, implementation, integration in a computing system, and compilation for CGRA. First, the design and implementation of a CGRA named Integrated Programmable Array (IPA) is presented. Next, the problem of mapping applications with control and data flow onto CGRA is formulated. From this formulation, several efficient algorithms are developed using internal resources of a CGRA, with a vision for low power acceleration. The algorithms are integrated into an automated compilation flow. Finally, the IPA accelerator is augmented in PULP - a Parallel Ultra-Low-Power Processing-Platform to explore heterogeneous computing.
# Table of contents

<table>
<thead>
<tr>
<th>Section</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>List of figures</td>
<td>ix</td>
</tr>
<tr>
<td>List of tables</td>
<td>xi</td>
</tr>
<tr>
<td>Introduction</td>
<td>1</td>
</tr>
<tr>
<td>1 Background and Related Work</td>
<td>7</td>
</tr>
<tr>
<td>1.1 Design Space</td>
<td>9</td>
</tr>
<tr>
<td>1.1.1 Computational Resources</td>
<td>9</td>
</tr>
<tr>
<td>1.1.2 Interconnection Network</td>
<td>9</td>
</tr>
<tr>
<td>1.1.3 Reconfigurability</td>
<td>10</td>
</tr>
<tr>
<td>1.1.4 Register Files</td>
<td>12</td>
</tr>
<tr>
<td>1.1.5 Memory Management</td>
<td>12</td>
</tr>
<tr>
<td>1.2 Compiler Support</td>
<td>15</td>
</tr>
<tr>
<td>1.2.1 Data Level Parallelism (DLP)</td>
<td>15</td>
</tr>
<tr>
<td>1.2.2 Instruction Level Parallelism (ILP)</td>
<td>15</td>
</tr>
<tr>
<td>1.2.3 Thread Level Parallelism</td>
<td>16</td>
</tr>
<tr>
<td>1.3 Mapping</td>
<td>18</td>
</tr>
<tr>
<td>1.4 Representative CGRAs</td>
<td>19</td>
</tr>
<tr>
<td>1.4.1 MorphoSys</td>
<td>19</td>
</tr>
<tr>
<td>1.4.2 ADRES</td>
<td>19</td>
</tr>
<tr>
<td>1.4.3 RAW</td>
<td>21</td>
</tr>
<tr>
<td>1.4.4 TCPA</td>
<td>21</td>
</tr>
<tr>
<td>1.4.5 PACT XPP</td>
<td>21</td>
</tr>
<tr>
<td>1.5 Conclusion</td>
<td>22</td>
</tr>
<tr>
<td>2 Design of The Reconfigurable Accelerator</td>
<td>23</td>
</tr>
<tr>
<td>2.1 Design Choices</td>
<td>27</td>
</tr>
<tr>
<td>2.2 Integrated Programmable Array Architecture</td>
<td>31</td>
</tr>
</tbody>
</table>
## Table of contents

2.2.1 IPA components ................................................. 31
2.2.2 Computation Model ........................................... 33
2.3 Conclusion .......................................................... 38

3 Compilation flow for the Integrated Programmable Array Architecture 41

3.1 Background .......................................................... 41
  3.1.1 Architecture model ........................................... 41
  3.1.2 Application model ............................................ 42
  3.1.3 Homomorphism ............................................... 45
  3.1.4 Supporting Control Flow ................................... 45
3.2 Compilation flow ................................................... 49
  3.2.1 DFG mapping ................................................ 51
  3.2.2 CDFG mapping .............................................. 58
  3.2.3 Assembler .................................................. 63
3.3 Conclusion .......................................................... 64

4 IPA performance evaluation 65

4.1 Implementation of the IPA ....................................... 65
  4.1.1 Area Results ................................................ 66
  4.1.2 Memory Access Optimization ............................... 66
  4.1.3 Comparison with low-power CGRA architectures ...... 71
4.2 Compilation .......................................................... 74
  4.2.1 Performance evaluation of the compilation flow ...... 74
  4.2.2 Comparison of the register allocation approach with state of the art predication techniques ........................ 74
  4.2.3 Compiling smart visual trigger application .......... 75
4.3 Conclusion .......................................................... 77

5 The Heterogeneous Parallel Ultra-Low-Power Processing-Platform (PULP) Cluster 79

5.1 PULP heterogeneous architecture ................................ 80
  5.1.1 PULP SoC overview ......................................... 80
  5.1.2 Heterogeneous Cluster .................................... 81
5.2 Software infrastructure ............................................ 82
5.3 Implementation and Benchmarking ............................. 83
  5.3.1 Implementation Results .................................. 84
  5.3.2 Performance and Energy Consumption Results .......... 85
<table>
<thead>
<tr>
<th>Table of contents</th>
<th>vii</th>
</tr>
</thead>
<tbody>
<tr>
<td>5.4 Conclusion</td>
<td>89</td>
</tr>
<tr>
<td>Summary and Future work</td>
<td>91</td>
</tr>
<tr>
<td>References</td>
<td>95</td>
</tr>
</tbody>
</table>
List of figures

1  Block diagram of CGRA ................................................. 3
2  Wide spectrum of Accelerators ........................................ 3
3  Energy efficiency vs Flexibility and Performance of CGRA vs CPU, DSP, MC, GPU, FPGA and ASIC ........................................ 4
4  Thesis overview ......................................................... 6
1.1  Different interconnect topologies .................................... 10
1.2  Reconfigurability ...................................................... 11
1.3  Morphosys Architecture .............................................. 20
1.4  ADRES Architecture .................................................. 20
1.5  RAW Architecture .................................................... 21
1.6  TCPA Architecture .................................................... 22
1.7  PAE architecture of PACT XPP .................................... 22
2.1  Latency comparison for $3 \times 3$ CGRA .............................. 28
2.2  Latency comparison for $4 \times 4$ CGRA .............................. 28
2.3  Latency comparison for $5 \times 5$ CGRA .............................. 29
2.4  Execution strategies in multi-core cluster and IPA cluster ....... 30
2.5  System level description of the Integrated Programmable Array Architecture ........................................ 32
2.6  Components of PE ..................................................... 34
2.7  The configuration network for load-context ....................... 35
2.8  Segments of the GCM .................................................. 36
2.9  Format of the data and address bus in the configuration network ........................................................................ 36
2.10 Sample execution in CPU and IPA ................................ 37
3.1  Architecture and application model used for mapping .......... 43
3.2  CDFG of a sample program .......................................... 44
3.3  Classification of control flow present in an application ........ 45
3.4  Mapping using full and partial predication ....................... 47
3.5 Compilation flow ................................................................. 50
3.6 DFG scheduling and transformation ........................................ 52
3.7 Graph transformation .......................................................... 53
3.8 Performance comparison for different threshold functions .............. 54
3.9 Mapping latency comparison for 3x3 CGRA .............................. 56
3.10 Compilation time comparison for 3x3 CGRA .............................. 57
3.11 Architectural coverage between methods .................................. 58
3.12 Assignment routing graph transformation ............................... 61
3.13 Different location constraints in CDFG mapping ....................... 61

4.1 Synthesized area of IPA for different number of TCDM banks ........ 67
4.2 Performance of IPA executing matrix multiplication of different size ... 69
4.3 Latency performance in different configurations ([#LSUs][#TCDM Banks]) 69
4.4 Average power breakdown in different configurations ([#LSUs][#TCDM Banks]) ....................................................... 70
4.5 Average energy efficiency for different configurations ([#LSUs][#TCDM Banks]) ....................................................... 71
4.6 Energy efficiency/area trade-off between several configurations ([#LSUs][#TCDM Banks]) ....................................................... 72

5.1 PULP SoC. Source [89] .......................................................... 81
5.2 Block diagram of the heterogeneous cluster. ............................. 82
5.3 Block diagram of the IPA subsystem. ...................................... 83
5.4 Synchronous interface for reliable data transfer between the two clock domains. 84
5.5 Power consumption breakdown while executing compute intensive kernel in PULP heterogeneous cluster ................................. 86
5.6 Power consumption breakdown while executing control intensive kernel in PULP heterogeneous cluster ................................. 86
## List of tables

1.1 Qualitative comparison between different architectural approaches ........................................ 14
1.2 CGRA design space and compiler support: CR - Computational Resources; IN - Interconnect Network; RC - Reconfigurability; MM - Memory management; CS - Compilation Support; MP - Mapping; PR - Parallelism ........................................ 17
2.1 Characteristics of the benchmarks used for design space exploration ........................................ 27
2.2 Energy gain in IPA cluster compared to the multi-core execution ........................................ 31
2.3 Structure of a segment ............................................................................................................ 36
2.4 Instruction format .............................................................................................................. 37
2.5 Summary of the opcodes (R = Result, C = Condition bit) ...................................................... 39
3.1 Comparison between different approaches to manage control flow in CGRA ................................. 49
3.2 Comparison of RLC and TLC numbers between different CDFG traversal ................................. 63
4.1 Specifications of memories used in TCDM and each PE of the IPA ........................................... 66
4.2 Code size and the maximum depth of loop nests for the different kernels in the IPA ................. 67
4.3 Overall instructions executed and energy consumption in IPA vs CPU ................................... 68
4.4 Comparison with the state of the art low power targets ....................................................... 73
4.5 Performance (cycles) comparison between the register allocation approach and the state of the art approaches ........................................................................................................ 75
4.6 Energy consumption (µJ) comparison between the register allocation approach and the state of the art approaches ........................................................................................................ 76
4.7 Performance comparison ..................................................................................................... 76
4.8 Performance comparison of smart visual surveillance application [cycles/pixel] ......................... 77
5.1 List of APIs for controlling IPA ............................................................................................. 84
5.2 Cluster Parameters and memories used ................................................................................ 85
5.3 Synthesized area information for the PULP heterogeneous cluster ........................................ 87
<table>
<thead>
<tr>
<th>Table</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>5.4</td>
<td>Performance evaluation in execution time (ns) for different configuration in the heterogeneous platform</td>
<td>87</td>
</tr>
<tr>
<td>5.5</td>
<td>Performance comparison between iso-frequency and $2 \times$ frequency execution in IPA</td>
<td>88</td>
</tr>
<tr>
<td>5.6</td>
<td>Energy consumption evaluation in $\mu$J for different configuration in the heterogeneous platform</td>
<td>89</td>
</tr>
<tr>
<td>5.7</td>
<td>Comparison between total number of memory operations executed</td>
<td>89</td>
</tr>
</tbody>
</table>
Introduction

With the increasing transistor density, the power dissipation improves very little with each generation of Moore’s law. As a result, for fixed chip-level power budgets, the fraction of transistors that can be active at full frequency is decreasing exponentially. The empirical studies in [33] show that the strategy to enhance performance by increasing the number of cores will probably fail since voltage scaling has slowed or almost stopped, and the power consumption of individual cores are not reducing enough to allow the increase in the number of active computing units. Hence, as technology scales, an increasing fraction of the silicon will have to be dark, i.e., be powered off or under-clocked. This study estimated that at 8nm, more than 50% of the chip will have to be dark. The most popular approach to improve energy efficiency is a heterogeneous multi-core which is populated with a collection of specialized or custom hardware accelerators (HWAC), each optimized for a specific task such as graphics processing, signal processing, cryptographic computations etc.

Depending on the specific application domain, the trend is to have few general-purpose processors accelerated by highly optimized application-specific hardware accelerators (ASIC-HWACCs) or General-Purpose Graphics Processing Units (GPGPUs). Although ASIC-HWACCs provide the best performance/power/area figures, the lack of flexibility drastically limits their applicability to few domains (i.e. those where the same device can be used to cover large volumes, or where the cost of silicon is not an issue). Graphics Processing Units (GPUs) are very popular in high performance computing (HPC). Although they are programmable, their energy efficiency and performance advantages are limited to parallel loops [92]. Moreover, GPUs require significant effort to program them using specialized languages (e.g. CUDA). GPUs have rapidly evolved not to be limited only to perform graphics processing but also general purpose computing and referred to as GPGPU. As GPGPUs consist of thousands of cores, they are excellent computing platforms for the workloads that can be partitioned into a large number of threads with minimal interaction between the threads. The effectiveness of GPGPUs decreases significantly as the number of workload partitions decreases or the interaction between them increases [104]. In addition, the memory access contentions across threads should be minimized to diminish the performance
penalty [104]. In such acceleration model, programmers are responsible to find an effective way to partition the workload.

In between of the two extremes ASIC-HWACC and GPU, Field Programmable Gate Arrays (FPGAs) provide high degree of reconfigurability. FPGAs offer significant advantages in terms of sharing hardware between distinct isolated tasks, under tight time constraints. Historically, reconfigurable resources available in FPGA fabrics have been used to build high performance accelerators in specific domains. The HPC domain has driven the development of reconfigurable resources, from relatively simple modules to highly parametrizable and configurable subsystems. While FPGAs started out as a matrix of programmable processing elements, called configurable logic blocks (CLBs) connected by programmable interconnect to configurable I/O, they have evolved to also include a variety of processing macros, such as reconfigurable embedded memories and DSP blocks to improve the efficiency of FPGA based accelerators. The flexibility, however, comes at the cost of programming difficulty and high static power consumption [41]. The high energy overhead due to fine-grained reconfigurability and long interconnects limit their use in ultra-low power environments like internet-of-things (IoT), wireless sensor networks etc. In addition, limitations and overhead of reconfiguring FPGAs at run-time impose a significant restriction on using FPGAs extensively in wider set of energy-constrained applications. Kuon et al in [57] shows that FPGAs require an average of $40 \times$ area overhead, $12 \times$ power overhead and $3 \times$ execution time overhead than a comparable ASIC.

A promising alternative to ASIC-HWACC, GPU and FPGA is the Coarse-Grained Reconfigurable Array (CGRA) [102]. As opposed to FPGAs, CGRAs are programmable at instruction level granularity. Due to this feature, compared to FPGAs, a significantly less silicon area is required to implement CGRAs. Besides, static power consumption is much lower in CGRAs compared to FPGAs. CGRAs consist of multi-bit functional units, which are connected through rich interconnect network and have been shown to achieve high energy efficiency [8] while demonstrating the advantages of a programmable accelerator.

This dissertation capitalizes on the promising features of CGRAs as ultra-low-power accelerator in three segments. The first part studies the hardware and compiler for state of the art CGRAs. The second part is devoted to implementing a novel CGRA architecture referred to as Integrated Programmable Array (IPA). The third part is dedicated to compilation problems associated with the acceleration of applications in CGRAs and a novel compilation flow. The final part explores heterogeneous computing by augmenting the IPA in a state of the art multi-core platform.
Motivation for CGRAs

CGRAs (Figure 1) are built around an array of processing elements (PEs) with word level granularity. A PE is a single-pipeline stage functional unit (FU) (e.g., ALU, multiplier) with a small local register file and simple control unit, which fetches instruction from a small instruction memory. Additionally, some PEs can perform load-store operations on a shared data memory, which are usually referred to as load-store units (LSUs). Rather than a compiler mapping a C program onto a single core, a CGRA tool flow maps a high level program over multiple processing elements.

In Figure 2, we present wide range of accelerator solutions. Figure 3 compares the CGRA architecture against the instruction processor solutions such as CPU, DSP, multi-core (MC), GPU and FPGA. Energy efficiency is shown against flexibility and performance.
Fig. 3 Energy efficiency vs Flexibility and Performance of CGRA vs CPU, DSP, MC, GPU, FPGA and ASIC

The charts are summarized as follows:

- DSPs have superior energy efficiency to both CPU and GPU but lack scalable performance.
- MCs provide greater performance than CPU and have simpler cores, therefore greater energy efficiency than CPUs but inferior to GPUs.
- GPU energy efficiency has surpassed CPUs and multi-cores over recent years (however, GPUs still require multi-core CPUs to monitor them).
- Modern FPGAs with DSP slices offer superior energy efficiency to CPUs, MSPs, DSPs and GPUs, and provide strong performance.
- Nothing beats an ASIC for performance, but everything beats an ASIC for flexibility.
- CGRAs offer greater flexibility than FPGAs as they can be programmed efficiently in high level languages. They also offer greater performance due to the coarse grained nature of the computation.

**Contribution of the Thesis**

The thesis contributes in the following aspects of employing CGRAs as accelerators in computing platforms.

1. CGRA design and implementation: The dissertation presents a novel CGRA architecture design referred to as Integrated Programmable Array (IPA) Architecture. The
design is targeted for ultra-low-power execution of kernels. The design also includes an extensive architectural exploration for best trade-off between cost and performance.

2. Mapping of applications onto the CGRA: For better performance instruction level parallelism is exploited at the time of compilation of the program, which maps operations and data onto the internal resources of CGRA. Efficient resource binding is the most important task of a mapping flow. An efficient utilization of available resources on CGRA plays a crucial role in enhancing performance and reducing the complexity of the control unit of the CGRA. As Register Files (RF) are one of the key components, efficient utilization of registers helps to reduce unwanted traffic between the CGRA and data memory. In this dissertation, we present an energy efficient register allocation approach to satisfy data dependencies throughout the program.

Also, an accelerator without an automated compilation flow is unproductive. In this regard, we present a full compilation flow integrating the register allocation mapping approach to accelerate control and data flow of applications.

3. Support for control flow: Since control flow in an application limits the performance, it is important to carefully handle the control flow in hardware software co-design for the CGRA accelerator. On the one hand, taking care of the control flow in the compilation flow adds several operations increasing the chance of higher power consumption, on the other hand, implementing bulky centralized controller for the whole CGRA is not an option for energy efficient execution. In this thesis, we implement a lightweight control unit and synchronization mechanism to take care of the control flow in applications.

4. System level integration in a system on chip (SoC): Integration in a computing system is necessary to properly envision the CGRA as an accelerator. The challenge is interfacing with data memory due to scalability and performance issues. In this dissertation, we present our strategy to integrate the accelerator in the PULP [90] multi-core platform and study the best trade-off between the number of load-store units present in the CGRA and the number of banks present in the data memory.

Organization

The dissertation is organized into five chapters. In Figure 4, we show the positioning of the chapters based on the major features of the reconfigurable accelerator and compilation flow. Chapter 1 discusses the background with major emphasis on the state of the art works in architecture and compilation aspects. In chapter 2, the design and implementation of an
ultra-low power reconfigurable accelerator with the support for control flow management in applications is presented. Chapter 3 covers the compilation framework for the accelerator. In chapter 4, we evaluate the performance of the accelerator and compilation flow. Chapter 5 addresses the integration of the accelerator in a multi-core SoC along with the software infrastructure.

Finally, the thesis concludes with an overview of the presented work and suggestions for the future research directions.
Chapter 1

Background and Related Work

As a consequence of power-wall, in combination with the ever-increasing number of transistors available due to Moore’s Law, it is impossible to use all the available transistors at the same time. This problem is known as the utilisation-wall or dark silicon [33]. As a result, energy efficiency has become the first-order design constraint in all computing systems ranging from portable embedded systems to large-scale data-centers and supercomputers.

The research in the domain of computer architecture to tackle dark silicon, can be categorized into three disciplines.

• **Architectural heterogeneity**: To improve energy efficiency, modern chips tend to bundle different specialized processors or accelerators along with general purpose processors. The result is a system-on-chip (SoC) capable of general purpose computation, which can achieve high performance and energy efficiency occasionally for specialized operations, such as graphics, signal processing etc. The research in this domain identifies most suitable platforms like FPGA, CGRA, ASIC or hybrid, as accelerators. [6] [25] has already shown that accelerators in SoCs are useful for combating utilization wall.

• **Near threshold computing**: Low-voltage is indeed an attractive solution to increase energy efficiency, as the supply voltage has strong influence on both static and dynamic energy. Scaling down the supply voltage close to the threshold voltage (Vth) of the transistor is proven to be highly power efficient. [31] shows that near threshold operation achieves up to 5-10× energy efficiency improvement. Since decreasing supply voltage slows down the transistor, near threshold voltage operation aims to achieve a significant trade-off between performance and energy efficiency.

• **Power Management**: This category of research is concentrated on the architectures and algorithms to optimize the power budget. This involves introducing sleep modes [2]
or dynamic voltage and frequency scaling (DVFS) [73], power and clock gating techniques.

The use of application-specific accelerators [35] designed for specific workloads can improve the performance given a fixed power budget. The wide use of such specialized accelerators can be realized in SoCs such as Nvidia’s Tegra 4, Samsung’s Exynos 5 Octa, Tilera’s Gx72. However, due to the long design cycle, increasing design cost and limited re-usability of such application specific approach, reconfigurable accelerators have become attractive solution. FPGAs acting as reconfigurable accelerators [1] [21] [16], are widely used in heterogeneous computing platforms [5] [105] [59]. FPGAs have evolved from employing only matrix of configurable computing elements or configurable logic blocks connected by programmable interconnect to collection of processing macros, such as reconfigurable embedded memories and DSP blocks to improve the efficiency. FPGA accelerators are typically designed at RTL (Register Transfer Level) level of abstraction for best efficiency. The abstraction consumes more time and makes reuse difficult when compared to a similar software design. HLS tools [76] [18] [12] [64] have helped to simplify accelerator design by raising the level of programming abstraction from RTL to high-level languages, such as C or C++. These tools allow the functionality of an accelerator to be described at a higher level to reduce developer effort, enable design portability and enable rapid design space exploration, thereby improving productivity and flexibility.

Even though efforts, such as Xilinx SDSoC, RIFFA, LEAP, ReconOS, have abstracted the communication interfaces and memory management, allowing designers to focus on high level functionality instead of low-level implementation details, the compilation times due to place and route in the back-end flow for generating the FPGA implementation of the accelerator, have largely been ignored. Place and route time is now a major productivity bottleneck that prevents designers from using mainstream design based on rapid compilation. As a result, most of the existing techniques are generally limited to static reconfigurable systems [98]. Apparently, the key features like energy efficiency, ease of programming, fast compilation and reconfiguration motivate the use of CGRAs to address signal processing and high performance computing problems.

In this thesis, we explore CGRA as accelerator in a heterogeneous computing platform and create a research framework to fit the CGRA within an ultra-low power (mW) power envelope. In this chapter, we investigate the design space and compiler support of the state of the art CGRAs. As we move to the next chapters, we will consider the significant architectural features explored in this chapter to design and implement a novel, near-threshold CGRA and a compilation flow with a primary target to achieve high energy efficiency.
1.1 Design Space

CGRAs have been investigated for applications with power consumption profiles ranging from mobile (hundreds of mW) [27] to high performance (hundreds of W) [70]. The design philosophy differs in CGRAs to optimize and satisfy different goals. In the following, we present a set of micro-architectural aspects of designing CGRAs.

1.1.1 Computational Resources

The computational resources (CRs) or Processing Elements (PEs) in CGRAs are typically categorized as Functional Units (FUs) and Arithmetic Logic Units (ALUs) with input bit-widths ranging from 8-32. The FUs are of limited functionality. Specifically, few operations for a specific application domain, as used in ADRES [8] architectural templates. The ALUs feature complex functionality and require larger reconfiguration compared to that of FUs. Due to the versatility of ALU, they make instruction mapping easier. Depending on the application domain a full-custom processing element can be designed, which gives better area and performance, but extensive specialization can also lead to negative effects on energy consumption [103]. The PEs can be homogeneous in nature making instruction placement simple, whereas a heterogeneous PEs may present a more pragmatic choice.

1.1.2 Interconnection Network

There are several options for the interconnection network, such as programmable interconnect network, point-to-point, bus or a Network-on-Chip (NoC).

The programmable interconnection network consists of switches which need to be programmed to extract the desired behaviour of the network. This comes with a cost of configuration overhead. CREMA [37] CGRA consists of such programmable interconnect network.

A point-to-point (P2P) network directly connects the PEs and allows data to travel from one PE to only its immediate neighbours. To perform a multi-hop communication, one needs "move" instructions at each of the intermediate PE. These instructions are a part of the configuration. In contrast with the programmable interconnects, the P2P network does not impose any additional cost on programming the interconnect network. The most popular topologies in P2P network are mesh 2-D and bus based. Designs like MorphoSys [96], ADRES [8], Silicon Hive [10] feature a densely connected mesh-network. Other designs like RaPiD [19] feature a dense network of segmented buses. Typically, the use of crossbars is limited to very small instances because large ones are too power-hungry. SmartCell [63]
CGRA uses crossbar to connect the PEs in a cell. Since only 4 PEs are employed in each cell, the complexity of the crossbar is paid off by the performance in this design.

Fig. 1.1 Different interconnect topologies (a) Mesh 2D; (b) Bus based; (c) Next hop

A NoC provides a packet switched network where the router checks the destination field of the packet to be forwarded and determines the appropriate neighbouring router to which it needs to be forwarded. The NoC does not need to program the interconnect network but the hardware must implement a routing algorithm. Generally, cluster based CGRAs like REDEFINE [3], SmartCell use NoC to interact between the clusters. HyCube [51] implements a crossbar switched network. Unlike the NoC, this network implements clockless repeaters to achieve single clock multi-hop communication.

Apart from these basic types, one may choose a hierarchical interconnection where different network types are used at different levels of the fabric. As an example, MATRIX [74] CGRA consists of three levels of interconnection network which can be dynamically switched. The SYSCORE [80] CGRA uses mesh and cross interconnect for low power implementations. In this architecture, cross interconnections are only introduced at odd numbered columns in the array of PEs, to avoid dense interconnect. The cross interconnections are useful to perform non-systolic functions.

1.1.3 Reconfigurability

Two types of reconfigurability can be realized in CGRAs: static and dynamic.

In statically reconfigured CGRAs, each PE performs a single task for the whole duration of the execution. The term "execution" here refers to the total running period between two configurations. In this case, mapping of the applications onto CGRA concerns only space, as illustrated in Figure 1.2. In other words, over times or cycles, the PE performs the same operation. The mapping solution assigns single operation to each PE depending on the data-flow. The most important advantage of static reconfigurability is the lack of reconfiguration overhead, which helps to reduce power consumption. Due to the lack of
reconfigurability, the size of the CGRA becomes large to accommodate even small programs or need to break the large program into several smaller ones.

In dynamically reconfigured CGRAs, PEs perform different tasks during whole execution. Usually, in each cycle, the PEs are reconfigured by simple instructions which are referred to as context words. Dynamic reconfigurability can overcome the constraints over resources in static reconfigurability by expanding loop iterations through multiple configurations. Clearly, this comes with the cost of added power consumption due to consecutive instruction fetching. Designs like ADRES and MorphoSys tackle this by not allowing control flow in the loop bodies. Furthermore, if conditionals are present inside the loop body, the control flow is converted in data flow using predicates. This mechanism usually introduces overhead in the code. Liu et al in [66] performs affine transformations on loops based polyhedral model and able to execute up to 2 level deep loops.

![Diagram](https://example.com/diagram.png)

Fig. 1.2 Static and Dynamic Reconfigurability: (a) Mapping of the DFG in (d) for static reconfigurability onto a 2x2 CGRA; (b) Mapping of the DFG in (d) for Dynamic reconfigurability onto a 2x1 CGRA; (c) Execution of the statically reconfigurable CGRA in 3 cycles; (d) Data flow graph; (e) Execution of the dynamically reconfigurable CGRA in 3 cycles.
The MorphoSys design reduces the cost of instruction fetching further by limiting the code to Single Instruction Multiple Data (SIMD) mode of operation. All the PEs in a row or a column in this case execute same operation during the whole execution process. The similar approach is realized in SIMD-CGRA [32], where bio-medical applications are executed in an ultra-low-power environment. The RaPiD architecture limits the number of configuration bits to be fetched by making only a small part of the configuration dynamically reconfigurable. Kim et al [54] proposed to reduce the power consumption in the configuration memories by compressing the configurations.

Generally, a limited reconfigurability imposes more constraints on the types and sizes of loops that can be mapped. The compiler also needs to take extra burden to generate mappings satisfying the constraints. The Silicon Hive [10] is one such design which does not impose any restrictions on the code to be executed and allow execution of full control flow in an application. Unfortunately, no numbers on the power consumption are publicly available for this design.

The CGRA design in this thesis adopts the philosophy of unlimited reconfigurability that allows to map any kind of application consisting complex control and data flow in an energy constrained environment.

1.1.4 Register Files

CGRA compilers schedule and place operations in the computational resources (CR) and route the data flow over the interconnect network between the CRs. The data also travel through the Register Files (RF). Hence, the RFs in CGRA is treated as interconnects that can be extended over multiple cycles. As the RFs are treated for routing, compiler must know the location of RFs, their size and topology of interconnection with the CRs. Both power and performance depend on these parameters. Hence, while designing the CGRA, it is important to bear special attention to determine the size, number of ports location of the RFs.

1.1.5 Memory Management

While targeting low power execution, data and context management is of utmost importance. Over past years, several solutions [27] have been proposed to integrate CGRAs as accelerators with the data and instruction memory.

In many low-power targeted CGRAs [8][78][91][52], memory operations are managed by the host processor. Among these architectures, Ultra-Low-Power Samsung Reconfigurable Processor (ULP-SRP) and Cool Mega Array (CMA) operate in ultra-low-power (up to 3 mW) range. In these architectures, PEs can only access the data once prearranged in the shared
register file by the processor. For an energy efficient implementation, the main challenge for these designs is to balance the performance of the data distribution managed by the CPU, and the computation in the PE array. However, in several cases, the computational performance of the PE array is compromised by the CPU, due to large synchronization overheads. For example, in ADRES [8] the power overhead of the VLIW processor used to handle the data memory access is up to 20%. In CMA [78] the host CPU feeds the data into the PEs through a shared fetch register (FR) file. This is very inefficient in terms of flexibility. The key feature of this architecture is the possibility to apply independent DVFS [99] or body biasing [71] to balance array and controlling processor parameters to adjust performance and bandwidth requirements of the applications. The highest reported energy efficiency for CMA is 743 MOPs/mW on 8-bit kernels, not considering the overhead of the controlling processor, which is not reported. With respect to this work, which only deals with DFG described with a customized language, we target 32-bit data and application kernels described in C language, which are mapped onto the array using an end-to-end C-to-CGRA compilation flow.

In a few works [96] [53] load-store operations are managed explicitly by the PEs. Data elements in these architectures are stored in a shared memory with one memory port per PE row. The main disadvantages of such data access architecture are: (a) lots of contention between the PEs on the same row to access the memory banks, (b) expensive data exchange between rows through complex interconnect networks within the array. With respect to these architectures, our approach minimizes contention by exploiting a multi-banked shared memory with word-level interleaving. In this way data-exchange among tiles can be performed either through the much simpler point-to-point communication infrastructure or fully flexible shared TCDM.

Solutions targeting high programmability and performance executing full control and data flows are reported for the weakly programmable processor array (WPPA) [55], Asynchronous Array of Simple Processors (AsAP) [106], RAW [101], ReMAP [20] and XPP [11]. The WPPA array consists of VLIW processors. For low power target the instruction set of a single PE is minimized according to domain-specific computational needs. In AsAP, each processor contains local data and instruction memory, FIFOs for tile-to-tile communication and local oscillator for local clock generation. Both the ReMAP and XPP consist of PE array each with DSP extension. These architectures are mainly intended for exploitation of task-level parallelism. Hence, each processor of the array must be programmed independently, which is much closer to many-core architectures. RAW PEs consist of 96 KB instruction cache and 32 KB data cache, router-based communication. These large-scale "array of processors" CGRAs are out of scope for ultra-low power, mW-level acceleration (a single tile would take more than the full power budget).
NASA’s Reconfigurable Data-Path Processor (RDPP) [29] and Field Programmable Processor Array (FPPA) [30] are targeted for low-power stream data processing for spacecrafts. These architectures rely on control switching [29] of data streams and synchronous data flow computational model avoiding investment on memories and control. On the contrary, the IPA is tailored to achieve energy-efficient near sensor processing of data with the workloads very different from the stream data processing.

Table 1.1 summarizes an overview of the jobs managed by CGRA and the host processor for different architectural approaches. Acceleration of the kernels involves memory operations, innermost loop computation, outer loop computation, offload and synchronization with the CPU. As shown in the table, IPA manages to execute both the innermost and outer loops and the memory operations of a kernel imposing least communication and memory operation overhead while synchronizing with the CPU execution.

With respect to these state of the art reconfigurable arrays and array of processors, this thesis introduces a highly energy efficient, general-purpose IPA accelerator where PEs have random access to the local memory and execute full control and data flow of kernels on the array starting from a generic ANSI C representation of applications [23]. This work also focuses on the architectural exploration of the proposed IPA accelerator [24], with the goal to determine the best configuration of number of LSUs and number of banks for the shared L1 memory. Moreover, we employ a fine-grained power management architecture to eliminate dynamic power consumption of idle tiles during kernels execution which provides $2 \times$ improvement of energy efficiency, on average. The globally synchronized execution model, low cost but full-flexible programmability, tightly coupled data memory organization and fine-grained power management architecture define the suitability of the proposed architecture as an accelerator for ultra-low power embedded computing platforms.

Table 1.1 Qualitative comparison between different architectural approaches

<table>
<thead>
<tr>
<th>Architectures</th>
<th>ADRES, CMA, MUCCRA, FPPA, RDPP, MATRIX, CHESS</th>
<th>Liu et al</th>
<th>MorphoSys, RSPA, PipeRench, CHARM</th>
<th>IPA</th>
</tr>
</thead>
<tbody>
<tr>
<td>Memory operations</td>
<td>CPU</td>
<td>CGRA</td>
<td>CPU</td>
<td>CGRA</td>
</tr>
<tr>
<td>Innermost loop</td>
<td>CGRA</td>
<td>CGRA</td>
<td>CGRA</td>
<td>CGRA</td>
</tr>
<tr>
<td>Outer loop</td>
<td>CPU</td>
<td>CPU</td>
<td>CGRA</td>
<td>CGRA</td>
</tr>
<tr>
<td>Offload + Synchronization</td>
<td>CPU</td>
<td>CPU</td>
<td>CPU</td>
<td>CPU</td>
</tr>
<tr>
<td>Communication overhead</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
1.2 Compiler Support

The compiler produces an executable for the reconfigurable fabric. As opposed to the compiler for general-purpose processors where only instruction set is visible to the compiler, the micro-architecture details of a CGRA are exposed to a CGRA compiler. This enables compilers to optimize applications for the underlying CGRA and take advantage of interconnection network, Register Files to maximize performance. Like the other compilers, the CGRA compilers generate an intermediate representation from the high level language. The intermediate representation is usually convenient for parallelism extraction, as most CGRAs have many parallel function units. Different architectures exploit different levels of parallelism through their compilers.

1.2.1 Data Level Parallelism (DLP)

The computational resources in this approach operate on regular data structures such as one and two-dimensional arrays, where the computational resources operate on each element of the data structure in parallel. The compilers target accelerating DLP loops, vector processing or SIMD mode of operation. CGRAs like Morphosys, Remarc, PADDI leverage SIMD architecture. The compilers for these architectures target to exploit DLP in the applications. However, DLP-only accelerators face performance issues while the accelerating region does not have any DLP, i.e. there are inter iteration data dependency.

1.2.2 Instruction Level Parallelism (ILP)

As for the compute intensive applications, nested loops perform computations on arrays of data, that can provide a lot of ILP. For this reason, most of the compilers tend to exploit ILP for the underlying CGRA architecture.

State of the art compilers which tend to exploit the ILP, like RegiMap [44], DRESC [72], Edge Centric Modulo Scheduling (EMS) [79] mostly rely on software pipelining. This approach can manage to map the innermost loop body in a pipelined manner. On the other hand, for the outer loops, CPU must initiate each iteration in the CGRA, which causes significant overhead in the synchronization between the CGRA and CPU execution. Liu et al in [66] pinpointed this issue and proposed to map maximum of two levels of loops using polyhedral transformation on the loops. However, this approach is not generic as it cannot scale to an arbitrary number of loops. Some approaches [65] [61] use loop unrolling for the kernels. The basic assumption for these implementations is that the innermost loop’s trip count is not large. Hence, the solutions end up doing partial unroll of the innermost loops. The
outer loops remain to be executed by the host processor. As most of the proposed compilers handle innermost loop of the kernels, they mostly bank upon the partial predication [47] [13] and full predication [4] techniques to map the conditionals inside the loop body.

Partial predication maps instructions of both if-part and else-part on different PEs. If both the if-part and the else-part update the same variable, the result is computed by selecting the output from the path that must have been executed based on the evaluation of the branch condition. This technique increases the utilization of the PEs, at the cost of higher energy consumption due to execution of both paths in a conditional. Unlike partial predication, in full predication all instructions are predicated. Instructions on each path of a control flow, which are sequentially configured onto PEs, will be executed if the predicate value of the instruction is similar with the flag in the PEs. Hence, the instructions in the false path do not get executed. The sequential arrangement of the paths degrades the latency and energy efficiency of this technique.

Full predication is upgraded in state based full predication [46]. This scheme prevents the wasted instruction issues from false conditional path by introducing sleep and awake mechanisms but fails to improve performance. Dual issue scheme [45] targets energy efficiency by issuing two instructions to a PE simultaneously, one from the if-path, another from the else-path. In this mechanism, the latency remains similar to that of the partial predication with improved energy efficiency. However, this approach is too restrictive, as far as imbalanced and nested conditionals are concerned. To map nested, imbalanced conditionals and single loop onto CGRA, the triggered long instruction set architecture (TLIA) is presented in [67]. This approach merges all the conditions present in kernels into triggered instructions and creates instruction pool for each triggered instruction. As the depth of the nested conditionals increases the performance of this approach decreases. As far as the loop nests are concerned, the TLIA approach reaches bottleneck to accommodate the large set of triggered instructions into the limited set of PEs.

1.2.3 Thread Level Parallelism

To exploit TLP, compilers partition the program into multiple parallel threads, each of which is then mapped onto a set of PEs. Compilers for RAW, PACT, KressArray leverage on TLP. To support parallel execution modes, the controller must be extended for supporting the call stack and synchronizing the threads. As a result, power consumption is increased.

The TRIPS controller supports four operation modes of operation to support all the types of parallelism [94]. The first mode is configured to execute single thread in all the PEs, exploiting ILP. In the second mode, the four rows execute four independent threads exploiting TLP. In the third mode, fine-grained multi-threading is supported by time-multiplexing all
PEs over multiple threads. In the fourth mode each PE of a row executes the same operation, thus implementing SIMD, exploiting DLP. Thus, the TRIPS compiler can exploit the most suited form of parallelism.

The compiler for REDEFINE exploits TLP and DLP to accelerate a set of HPC applications. Table 1.2 presents an overview of several architectural and compilation aspects of the state of the art CGRA designs.

Table 1.2 CGRA design space and compiler support: CR - Computational Resources; IN - Interconnect Network; RC - Reconfigurability; MM - Memory management; CS - Compilation Support; MP - Mapping; PR - Parallelism

<table>
<thead>
<tr>
<th>Architecture</th>
<th>CR</th>
<th>IN</th>
<th>RC</th>
<th>MM</th>
<th>CS</th>
<th>MP</th>
<th>PR</th>
</tr>
</thead>
<tbody>
<tr>
<td>RAW</td>
<td>RISC core</td>
<td>Hybrid</td>
<td>Static and</td>
<td>FU</td>
<td>CDFG</td>
<td>TLP</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>dynamic</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>TRIPS</td>
<td>ALU</td>
<td>NoC</td>
<td>Dynamic</td>
<td>FU</td>
<td>CDFG</td>
<td>ILP, DLP, TLP</td>
<td></td>
</tr>
<tr>
<td>REDEFINE</td>
<td>ALU</td>
<td>Hybrid</td>
<td>Dynamic</td>
<td>FU</td>
<td>CDFG</td>
<td>DLP, TLP</td>
<td>TLP</td>
</tr>
<tr>
<td>ReMAP</td>
<td>DSP core</td>
<td>Programmable</td>
<td>Dynamic</td>
<td>FU</td>
<td>CDFG</td>
<td>TLP</td>
<td></td>
</tr>
<tr>
<td>MorphoSys</td>
<td>ALU</td>
<td>Hybrid P2P</td>
<td>Dynamic</td>
<td>FU</td>
<td>DFG</td>
<td>DLP, ILP</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>High performance targets</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Low-power targets</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Smartcell</td>
<td>ALU</td>
<td>Hybrid</td>
<td>Dynamic</td>
<td>PE</td>
<td>CDFG</td>
<td>TLP, DLP</td>
<td></td>
</tr>
<tr>
<td>PACT XPP</td>
<td>ALU</td>
<td>Hybrid</td>
<td>Dynamic</td>
<td>PE</td>
<td>CDFG</td>
<td>TLP</td>
<td></td>
</tr>
<tr>
<td>TCPA</td>
<td>ALU</td>
<td>Hybrid</td>
<td>Dynamic</td>
<td>PE</td>
<td>CDFG</td>
<td>TLP</td>
<td></td>
</tr>
<tr>
<td>AsAP</td>
<td>ALU</td>
<td>Mesh 2D</td>
<td>Dynamic</td>
<td>PE</td>
<td>CDFG</td>
<td>TLP</td>
<td></td>
</tr>
<tr>
<td>MUCRRA-3</td>
<td>FU</td>
<td>Hybrid P2P</td>
<td>Dynamic</td>
<td>VLIW host</td>
<td>DFG</td>
<td>ILP</td>
<td></td>
</tr>
<tr>
<td>RaPiD</td>
<td>ALU</td>
<td>Bus based</td>
<td>Static and</td>
<td>PE</td>
<td>DFG</td>
<td>DLP</td>
<td></td>
</tr>
<tr>
<td>Ultra-low-power targets</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>CMA</td>
<td>ALU</td>
<td>Hybrid P2P</td>
<td>Dynamic</td>
<td>Host micro controller</td>
<td>Data flow</td>
<td>ILP</td>
<td></td>
</tr>
<tr>
<td>ULP-SRP</td>
<td>FU</td>
<td>Mesh-x</td>
<td>Dynamic</td>
<td>VLIW host</td>
<td>Data flow</td>
<td>ILP</td>
<td></td>
</tr>
<tr>
<td>SYSCORE</td>
<td>ALU</td>
<td>Hybrid</td>
<td>Dynamic</td>
<td>DSP host</td>
<td>Data flow</td>
<td>DLP</td>
<td></td>
</tr>
</tbody>
</table>
1.3 Mapping

Mapping process assigns data and operations onto the CGRA resources. The process comprises of scheduling and binding of the operations and data on the functional units and registers respectively. Depending upon how these two steps are realized, mapping process can be divided into two classes.

The first one solves scheduling and binding sequentially using heuristics and/or meta-heuristics [60] [79] [36] or exact methods [43] [44]. [79] and [36] implement heuristic based iterative modulo scheduling [85] approach. In [79] an edge-based binding heuristic is used (instead of classical node-based approaches) to reduce the number of fails. In [36], binding problem is addressed by combining a routing heuristic from FPGA synthesis and a Simulated Annealing (SA) algorithm for placement. Schedulable operation nodes are moved randomly according to a decreasing temperature and a cost function. To efficiently explore the solution space, the temperature needs to be high. Slow decrease in the probability of accepting worse solutions effects the computation time. [60] proposes to improve the computation time, by finding a solution on a simplified problem with heuristic-based methods for both scheduling and binding at first and then trying to improve the initial solution with a genetic algorithm. However, the use of only one seed limits the ability to explore the whole solution space. Mapping proposed in EPIMap [43] and REGIMap [44] solve the scheduling and the binding problems sequentially by using a heuristic and an exact method respectively. Scheduling is made implicit by integrating both architectural constraints (i.e. the number of operations simultaneously executable on the CGRA and the maximum out-degree of each operation due to the communication network) and timing aspect into the DFG by statically transforming it. Binding is addressed by finding the common sub-graph between the transformed DFGs and a time extended CGRA with Levi’s algorithm [62]. However, since the graph transformations are done statically, it becomes difficult to know which transformation is relevant at a given time. This reduces the ability of the method to efficiently explore the solution space since the problem is over-constrained. Mapping proposed using graph minor approach in [15] also uses graph transformation and sub-graph matching to find the placement.

The second category solves the scheduling and binding problem concurrently. The mappings proposed in [60], [9], [82] use exact methods, e.g. ILP-based algorithms, to find the optimal results. Due to the exactness of the approaches, these methods suffer from scalability issues. DRESC [72] and its extension [26] that can cope with RFs, leverage on metaheuristics. These are based on a Simulated Annealing (SA) framework that includes a guided stochastic algorithm. The classical placement and routing problem which can be solved with SA, is extended in three dimensions to include scheduling. Thus, schedulable
1.4 Representative CGRAs

operation nodes are moved randomly through time and space. Therefore, the convergence is even slower than for other SA based methods as it includes scheduling.

The key idea of the mapping approach in this work is to combine the advantages of exact, heuristic and meta-heuristic methods while offsetting their respective drawbacks as much as possible. Hence, as detailed in chapter 4, scheduling and binding problems are solved simultaneously using a heuristic-based algorithm and a randomly degenerated exact method respectively and transforming the formal model of the application dynamically when necessary.

1.4 Representative CGRAs

In this section, we present five well-known and representative CGRA architectures chosen due to their unique network model, functional units and memory hierarchy.

1.4.1 MorphoSys

The MorphoSys reconfigurable array (Figure 1.3) consists of an $8 \times 8$ grid of reconfigurable cells (RCs), each of which contains an ALU, RF and multiplier. The interconnect network consists of four quadrants that are connected in columns and rows. Inside each quadrant a dense mesh network ((Figure 1.3 (a)) is implemented. At the global level, there are buses that support inter-quadrant connectivity ((Figure 1.3 (b)). The context memory stores multiple contexts which are broadcast to row or column-wise providing SIMD functionality. Each unit is configured by a 32-bit configuration word. A compiler is developed based on extended C language, but partitioning is performed manually.

1.4.2 ADRES

The ADRES architecture (Figure 1.4) comprises of an array of PEs tightly coupled with a VLIW processor. The reconfigurable cells (RC) consist of ALU and register file and tiny instruction memory. RCs are connected through a mesh interconnection network. A predication network is implemented to execute conditionals. The register file in VLIW processor is shared with the RC array. This reduces the communication between reconfigurable matrix and memory subsystem. ADRES features a C compiler for both VLIW and CGRA.
Fig. 1.3 Morphosys Architecture

Fig. 1.4 ADRES Architecture
1.4 Representative CGRAs

1.4.3 RAW

The RAW architecture (Figure 1.5) is an array of RISC-based pipelined FUs. Each FU consists of instruction and data memory. The FUs communicate via a programmable interconnect network. Each FU is connected to switch that controls the destination addresses used by the network interface, hence, the routing can be statically scheduled. When no data transfer is scheduled on the network, the instruction scheduled dynamically by RAW control unit can utilize the network as a dynamic one. A compiler based on a high-level language implementing TIERS based [95] place and route is available.

![Fig. 1.5 RAW Architecture](image)

1.4.4 TCPA

TCPA (Figure 1.6) consists of an array of heterogeneous, VLIW-style FUs, connected via a programmable network. The heterogeneity of the FUs is a design-time decision. The interconnect between the FUs is statically configured and forms direct connections between the FUs. Each FU has a (horizontal and vertical) mask that allows individual reconfiguration of FUs. In this way, SIMD type behaviour can also be implemented. Unlike conventional VLIW processors, the register files in these FUs are explicitly controlled. Compilation for the architecture is introduced in [100] where algorithms are described in PAULA language [48], designed for multi-dimensional data intensive applications.

1.4.5 PACT XPP

PACT XPP defines two types of processing array elements (PAE). One for computation and another with local RAM. The PAEs are connected with a packet-based network and computation is event driven. Figure 1.7 presents the architecture of a PAE. The typical
Background and Related Work

Fig. 1.6 TCPA Architecture

Fig. 1.7 PAE architecture of PACT XPP

PAE contains a back registers (BREG) object and forward register (FREG) object which are used for vertical routing, as well as an ALU object which performs the actual computations. Both the operation of the PAEs and the communication are reconfigurable resulting in a large number of configuration bits. The event driven compute model means the control flow is handled in a decentralized fashion such that a configuration can be kept static as long as possible. To support irregular computations that do require to update the configuration, PACT XPP uses two techniques. Firstly, configurations are cached locally to enable fast configuration and secondly, partial configurations are supported. Partial configurations only update selected bits, which can keep them small in many cases, optimizing the use of the local configuration cache.

1.5 Conclusion

This chapter presented an overview of different CGRAs and their execution models. Different architectural and compilation approaches have been presented for a comprehensive view of wide spectrum of the design and compilation. In the next chapter, we make design choices and focus on implementing CGRA operating in ultra-low power domain.
Chapter 2

Design of The Reconfigurable Accelerator

Due to the increasing complexity of near-sensor data analytics algorithms, low power embedded applications such as Wireless Sensor Networks (WSN), Internet of Things (IoT) and wearable sensors combine the requirement of high performance and extreme energy efficiency in a $mW$ power envelope [7]. While traditional ultra-low power sensor processing circuits rely on hardwired Application Specific Integrated Circuit (ASIC) architectures [28], near-threshold parallel computing is emerging as a promising solution to exploit the energy boost given by low-voltage operation while recovering the related performance degradation through execution over multiple programmable processors [89].

Even though exploitation of parallel ultra-low power computing provides maximum flexibility, a dominating majority of the power consumed during processing is linked to the typical overheads of instruction processors [38], such as complex fetching and decoding of instructions, control and data-path pipeline overheads (up to 40%), and the load and store overhead needed for processors to work with their L1 memory (up to 30%).

In this chapter, we make significant step forward in parallel near-threshold computing toward the goal of achieving the energy efficiency of application-specific data-paths, by exploiting the Coarse Grain Reconfigurable Array (CGRA) architectural template and revisiting it to fit within an ultra-low power (mW) power envelope. Some of the primary objectives that motivates highly flexible ULP CGRA design are discussed in the following.

- **Flexibility**: Flexibility is the key accomplishment relying on a reconfigurable fabric. However, along the design path, there are several compromises made to satisfy design constraints. As an example, the RaPiD [19] architecture limits the number of configuration bits to be fetched by making a small part of the configuration reconfigurable
Design of The Reconfigurable Accelerator

per cycle. The MorphoSys design reduces the reconfiguration overhead by limiting the supported code to SIMD. To achieve better energy efficiency, other CGRA design which support MIMD, like ADRES [8], CMA [99], ULP-SRP [52] relies on executing the innermost loop only to avoid the control flow hazards. If conditionals exist in the innermost loop, they are tackled in software by flattening them using several predication techniques.

All the restrictions are mostly addressed in the design entry point, where a high level language is used to program the CGRAs. As a result, restrictions in reconfigurability eventually leads to the programmability issues. As previously mentioned, most CGRAs use C language as the entry point, ideally, the designs and their compilers should be able to find a mapping for any valid C program. In practice, this is not the case: only the loops, particularly the innermost loops are mapped onto the CGRAs. In addition, the designs use a subset of C language. In other words, they do not support use of pointer-based access, recursions etc. However, well-structured loops can be written without these structures, but the fact of re-engineering the source code remains.

As flexibility is our primary design philosophy, we prioritize executing any C program, not just the loop kernels. This is achieved by implementing low cost control flow support in the hardware, and an efficient control flow mapping support in the compiler (see chapter 3).

• **Utilization:** One of the most critical design choices is the processing element. Since high performance is achieved by exploiting parallelism, choice of the computing unit (FU, ALU) is of paramount importance. FUs are of limited functionality, hence, the reduced area of each unit allows to have a larger number of them in a fabric. However, given an interconnect network, the interaction between the FUs gets limited. For illustration, let us assume a CGRA which comprises FUs interconnected through a mesh-torus topology. Each FU interacts with four of its neighbours. For better utilization, the types of these four FUs must be chosen carefully depending upon the certain instruction sequence in the application domain. If the instruction sequence does not match the type of the neighbouring FUs, it will result multi-hop communication to other FUs. This eventually leads to less utilization of the FUs resulting bottleneck for exploiting parallelism. Use of full fledged ALU supports wide range of functionality increasing the utilization and chance of better exploiting parallelism.

Another design goal is to achieve high energy efficiency, which is achieved by the data locality. In other words, data must stay as close as possible to the computing unit. As, registers are the closest possible storage unit to the processing part, high utilization
of register files increases the possibility of achieving high energy efficiency. There are two kinds of data in an application except the regular array inputs and outputs: the recurring variables (repeatedly written and read) and the constants. In this thesis, we show that register files can be efficiently used to store the recurring variables present not only in the loops but in an application. The existing designs use the shared memory to store the constants. Indeed, storing constants in the memory helps to reduce the instruction width or configuration overhead, but accessing shared/central register file [8] or memory [96] results in higher latency, and increase in the number of load-stores, degrading performance and energy efficiency. In this work, we take care of constants by introducing the concept of constant register file (CRF) which helps local access of constants at the time of execution.

- **Interconnect Network**: Since energy efficiency is first order design constraint in the thesis, the focus is on low power choices for the interconnect networks.

Both for static and dynamic reconfigurable CGRAs, there are two phases involved in the execution process. The first one is the *configuration phase*, when the fabric is reconfigured partially or fully. The second one is the *compute phase*, when computations are performed on the data. To efficiently support the phases there must be two interconnect networks involved: (a) network to distribute the instruction, (b) network to support the data flow. Depending on the computation model, frequency of performing configuration and compute phase, and ratio between their effective time must be analysed for choosing the ideal interconnect network.

To give a clear perspective, first, we consider the case for a statically configurable CGRA, where each PE executes single instruction in the whole execution. In this case, the ratio between the computation time and configuration time is usually high. In other words, much time is spent on the computation compared to the configuration. Hence, for computation, low cost interconnect networks (i.e. mesh 2D) provide better energy efficiency. Since configuration or instructions are supplied to the PEs at once, high cost interconnect network can be afforded for better performance. If the size of the CGRA is small, then it may require frequent configuration. In this case, the better choice for instruction delivery network may be a bus-based network [49].

In the dynamically reconfigurable CGRAs, there may be two types of arrangements. The first one uses a centralized configuration memory. The configurations or instructions are accessed by the PEs from the centralized memory in each cycle. Due to centralized access of instructions, the PEs must access the configurations frequently. Since the configuration phase is frequent, it is convenient to merge the instruction
and data distribution networks. In other words, the same interconnect network can be used for both instruction and data distribution. However, if they are performed simultaneously, then loading of instructions affects the data movement due to the use of shared resources. Certainly, some of these conflicts can be avoided through appropriate choice of placement and routing of data and instructions onto the CGRA. Many designs like SmartCell, TRIPS employ NoC as a unified network, which gives a great flexibility in routing data and instructions. In NoC, the destination is specified as a part of the packet and it is then routed based on a hard-wired routing algorithm. However, the flexibility comes with a cost of added power consumption in hardware routing and composing/decomposing of packets.

If the PEs consist of local configuration or instruction memories, the solution can be arranged differently for efficiency. Although the configuration happens in every cycle, no interconnect network is involved to deliver them to the PEs. Instead, the configuration memories are filled prior to the execution starts. Hence, the solution for filling the configuration or instruction memories can be viewed as streaming the instructions before starting the compute stage in statically reconfigurable CGRA. Hence, the arrangements of the network may also be similar.

- **Instruction set architecture**: It is essential to keep each processing element small to maximize the number of processing parts that can fit on a chip. Employing simple instruction set architecture helps to minimize the cost of instruction fetching and decoding.

- **Control flow support**: Acceleration of applications generally depends on efficient computation of the innermost loop kernel. Usually, the host processor takes in charge of initiating the outer loops. This scheme requires regular communication with the host processor, which in turn increases the synchronization overhead. For low power target, it becomes essential for the accelerator to have support for control flow, in order to minimize the communication with the host and synchronization overhead.

- **Compilation**: Automated compilation tools are required alongside the hardware designs, which map applications to the target architectures. A good compilation tool must exploit data locality references (see Chapter 4) for better energy efficiency.
2.1 Design Choices

Based on the discussions presented above, our reconfigurable fabric is designed as an interconnection of typically $4 \times 4$ processing elements consisting ALUs. We employ a mesh-torus based network for the data flow and a bus-based interconnection network for configuration. As, the size of the CGRA, interconnection topology and RF size are important dimensions of the architecture, we performed experiments to support the design choices.

Nine applications from signal processing domain have been used for our experiments (Table 2.1). We have used fully unrolled version of these applications. The increased code size of the applications after full unrolling helps to understand how the limit in the size of the CGRA, local RFs and interconnect network impacts on the performance. As, the data flow graphs (DFG) of the fully unrolled applications are mapped onto the CGRA, we consider ASAP (As-soon-as-possible schedule) length of the DFGs as the best performance metric. The cycles taken by the CGRA to compute the DFG will be similar to the ASAP length of the DFG if the particular configuration can exploit all the parallelism (maximum number of operations present in a cycle) available in the application.

In the experiments, we consider CGRAs with different dimensions ($3 \times 3$, $4 \times 4$, $5 \times 5$), with different RF sizes (6, 8, 16, 24), and with different P2P topology (mesh torus, mesh-x, fully connected). Figure 2.1 represents performance of a $3 \times 3$ CGRA with different RF sizes and topologies, normalized to the ASAP length of the application DFGs. Similarly, Figure 2.2 and 2.3 presents the performance analysis in $4 \times 4$ and $5 \times 5$. Latencies closer to the ASAP value represents better performance. A normalized latency of value 0.5 means that the specific configuration is unable to find a mapping solution.

The performance trend is similar to other dimensions of the CGRA, except the fact that graph with higher dimensions consist less number of bars with 0.5 normalized latency, which implies that finding mapping solutions is highly probable in CGRAs with higher dimensions.

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>nodes</th>
<th>ASAP</th>
<th>Parallelism</th>
</tr>
</thead>
<tbody>
<tr>
<td>2D Discrete Cosine Transform (DCT-2D)</td>
<td>711</td>
<td>81</td>
<td>32</td>
</tr>
<tr>
<td>matrix product</td>
<td>504</td>
<td>98</td>
<td>32</td>
</tr>
<tr>
<td>Fast Fourier Transform (FFT)</td>
<td>1348</td>
<td>37</td>
<td>64</td>
</tr>
<tr>
<td>Trapizoidal (Trapez) filter</td>
<td>332</td>
<td>59</td>
<td>32</td>
</tr>
<tr>
<td>Exponential Moving Average Filter (EMA)</td>
<td>412</td>
<td>99</td>
<td>38</td>
</tr>
<tr>
<td>Moving Window De-convolution (MWD)</td>
<td>440</td>
<td>112</td>
<td>32</td>
</tr>
<tr>
<td>Unsharp Mask</td>
<td>91</td>
<td>27</td>
<td>16</td>
</tr>
<tr>
<td>Elliptic Filter</td>
<td>130</td>
<td>31</td>
<td>16</td>
</tr>
<tr>
<td>DC Filter</td>
<td>507</td>
<td>96</td>
<td>32</td>
</tr>
</tbody>
</table>
as expected. However, for this extensive set of experiments the $4 \times 4$ CGRA with RF size of 8 is able to find solutions for all the applications with the minimum overhead possible among all the combinations of CGRA configurations. Fig. 2.2 shows that increasing the RF size does not result revolutionary performance gain. After a certain RF size (which depends on the application), the performance does not increase. With the increased interconnection complexity, performance is enhanced, but the small performance gains are not encouraging enough to go for a more complex solution.

With the choice of a $4 \times 4$ CGRA with RF size of 8 and mesh-torus topology we move forward to design the novel CGRA architecture referred to as Integrated Programmable Array (IPA) [24].

To cope with the ultra-low power profile and memory sharing challenges, IPA involves a multi-bank Tightly Coupled Data Memory (TCDM) coupled with a flexible and configurable memory hierarchy for data storage. As shown in Figure 2.4, from an architectural viewpoint,
2.1 Design Choices

Fig. 2.3 Latency comparison for 5×5 CGRA

Point-to-point data communication between processing elements (PEs) (Figure 2.4(b)) during kernel execution, represents a key advantage over energy-hungry data sharing over shared memory that is required when using a traditional processor-cluster architecture (Figure 2.4(a)) for parallel processing. Table 2.2 shows that the IPA cluster performs a lower number of memory operations on the sample program presented in the Listing 2.1, which in turn gives energy improvement of 1.3× over the clustered multi-core architecture, which performs data sharing through the TCDM. In this comparison, we even ignore the barrier synchronization overheads in the many-core cluster for the sake of simplicity.

The IPA approach allows to significantly reduce the pressure on L1 memory. Hence, it requires a smaller number of banks to achieve low contention [89]. As opposed to clustered multi-core architectures, where data-exchange among cores is managed through shared data structures and OpenMP parallel processing constructs, in CGRAs the compiler must take care of data-exchange among PEs by exploiting point-to-point connections among the PEs as much as possible to minimize shared memory accesses.

```
1 for(i=0; i<1; i++)
2 {
3    A[i] = B[i] * C[i]
4 }
5 for(i=0; i<1; i++)
6 {
7    sum = sum + A[i];
8 }
```

Listing 2.1 Sample program to execute in the multi-core and IPA cluster
Fig. 2.4 (a) multi-core Cluster and (b) IPA cluster executing the sample program in Listing 2.1.
Table 2.2 Energy consumption comparison between multi-core and IPA while executing the sample program in Figure 2.4(a)

<table>
<thead>
<tr>
<th></th>
<th>Load-Store operations</th>
<th>Arithmetic operations</th>
<th>MOV operation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Average energy consumption in pJ/operation</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>MOV operation</td>
<td>4.2</td>
<td>3.4</td>
<td>3.1</td>
</tr>
</tbody>
</table>

Total energy consumption

<table>
<thead>
<tr>
<th></th>
<th>Total #Load-Store</th>
<th>Total #Arithmetic</th>
<th>Total #MOV</th>
<th>Energy (pJ)</th>
<th>Gain</th>
</tr>
</thead>
<tbody>
<tr>
<td>Multi-Core</td>
<td>8</td>
<td>3</td>
<td>0</td>
<td>43.8</td>
<td>-</td>
</tr>
<tr>
<td>IPA</td>
<td>4</td>
<td>3</td>
<td>2</td>
<td>33.2</td>
<td>1.3x</td>
</tr>
</tbody>
</table>

2.2 Integrated Programmable Array Architecture

The architecture comprises a PE array, a global context memory, a controller, a tightly coupled data memory (TCDM) with multiple banks and a logarithmic interconnect. Figure 2.5 shows the organization of the IPA. In the following we discuss the components of the IPA fabric.

2.2.1 IPA components

Global Context Memory (GCM)

The configurations for the PEs are stored in the Global Context Memory. Prior to the computation starts in the PE array, the configurations are loaded into the PEs through the bus-based network. The configurations contain instructions and the non-recurring variables which are stored into the instruction register file and constant register file of each PE respectively.

IPA Controller (IPAC)

The IPA controller identifies configuration data for the corresponding PE and transfers it in the load context stage. It also initiates the execution phase after loading all the contexts. The IPAC handles the important task of synchronizing with the host processor which will be discussed in the following chapter where we integrate the IPA in a multi-core platform.

PE Array (PEA)

The PE Array follows the multiple instruction, multiple data (MIMD) model of computation. All PEs operate on different set of instructions. A bus based interconnect network is implemented to load instructions and constants (i.e. context) from the GCM into the PEs, whereas
the torus network is used during execution phase for low power data communication between the PEs. The details of the load context protocol are discussed later in this chapter. To achieve low power execution, the instruction set architecture was designed from the scratch resulting 20-bit long instruction. We took the advantage of the visibility of the micro-architecture to the compiler and shifted the immediate data to constant register file in the PEs (discussed later) which eases the compression of the instruction, imposing low pressure on the decoder. The details of components of the PEs are discussed in the following.

The PE array consists of a parametric number of PEs (the optimal number of PEs is studied in section 2.1), connected with mesh torus network for the data flow and a bus-based network for instruction distribution. Figure 2.6 describes the components of a PE. Two Muxes (IN0 and IN1) selects the inputs of each PE. The input sources are the neighbouring PEs and the register file. A 32-bits ALU and a 16-bits x 16-bits = 32-bits multiplier are employed in this block. The Load Store Unit (LSU) is optional for the PEs (the optimal number of LSU is a parameter studied later in this chapter). The Control unit is responsible for fetching the instruction from the corresponding address of the instruction memory and managing program flow. The Constant Register File (CRF) stores the non-recurring values or constants, while the Regular Register File (RRF) and Output Register (OPR) store the recurring variables.
Control flow support: In order to minimize the synchronization overhead with the host processor, the PEs support branch instructions for executing loops and conditionals. The Controller in the PE fetches the instructions from the Instruction Register File (IRF). If the decoded instruction is a *jump*, the target address of the *jump* is stored in the Jump Register (JR). The *cjump* (conditional jump) instruction contains two target addresses. The true path is evaluated in the JR by the Boolean OR of the Condition Register (CR) bits of the PEs.

Power Management Unit (PMU)

To reduce dynamic power consumption in idle mode, each PE contains a tiny Power Management Unit (PMU) which clock gates the PEs when idle. An idle condition for a PE arises from three situations: (i) Unused PE: when a PE is not used during mapping; (ii) Load Store stall: In case of TCDM banking conflict the PMU generates a *global stall*, which is broadcast to all the PEs. Until the global stall is resolved, all the PEs are clock gated by their corresponding PMUs. LSUs are placed in the global clock region (Figure 2.6) to avoid deadlocks; (iii) Multiple NOP operations: a NOP instruction contains the number of successive NOPs. When a NOP instruction is fetched, the decoder loads this number into a counter within the PMU. The *clockgate_en* remains low until the count reaches zero. The counter gets halted when it encounters a global stall and resumes the count after the stall is resolved, synchronizing the execution flow among PEs.

Due to the fine-grained nature of the power management, more aggressive power gating is not implemented, since it imposes large area penalty without significant benefits. Since the leakage power of each tile is so small that does not change significantly the energy efficiency when the rest of the system is active.

TCDM and logarithmic interconnect

The TCDM acts as L1 memory for the IPA. Featuring a number of ports equal to the number of memory banks, it provides concurrent access to different memory locations. The TCDM is interfaced with the LSUs of the PE array through a low latency, logarithmic interconnect [83], implementing a word level interleaving scheme to minimize access contention. To optimize the performance and energy efficiency, we explore the IPA architecture with special focus on shared memory access in the next section.

2.2.2 Computation Model

After compiling a kernel (see the Chapter 5), the compiler generates the assembly and the addresses for the input and output data in the local shared memory. The assembler takes the
Fig. 2.6 Components of PE

From the neighbouring PEs

To and from memory interconnect

Gated clock driven

Control bits from all the PEs

Control bits to all the PEs

To the neighbouring PEs

Global stall to all the PEs

Global stalls from all the PEs

Clockgate_en

Global clock driven

Architecture and Programming Model Support for Reconfigurable Accelerators in Multi-Core Embedded Systems Satyajit Das 2018
assembly and the Instruction Set Architecture (ISA) of the IPA, to generate the context (i.e. the program to be stored into the IRF) for each PE, which is pre-loaded in the GCM. The context contains instructions and constants for each PE in the array. Prior to the execution start, the context is loaded into the corresponding IRF and CRF of the PEs. We assume that the code fits in the local memory. Larger execution contexts can be handled using the IPA controller and overlays.

**Load context**

Figure 2.7 shows the configuration network to load the context in each PE. In each cycle of this stage the IPAC receives the context word from the GCM and broadcasts to the PEs. For the PEs with same instruction, broadcast mode is used to distribute instruction to a set of PEs. To load PEs with non-identical set of instructions and constants normal addressing mode is used. The organization of context word in these two modes are described in the following.

![Configuration Network](image)

**Fig. 2.7 The configuration network for load-context**

The GCM (Figure 2.8) contains the context of the PEs. Each address in the GCM contains a 64-bits context word. To distinguish between several sets of the instructions and constants, the GCM is divided into several segments (table 2.3), where each segment contains a set of instructions and constants to be broadcast or normally addressed. The first bit in each segment represents whether the next set of instructions is addressed to broadcast (0) or normal addressing mode (1). In broadcast mode following 16 bits represent the mask, where the position of the high bits represents the addresses of the PEs to be broadcast. For normal addressing mode, only 4 bits are used to address the target PE.
Table 2.3 Structure of a segment

<table>
<thead>
<tr>
<th>Number of bits</th>
<th>Encoded information</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Addressing modes</td>
</tr>
<tr>
<td>4/16</td>
<td>Normal Address/Mask</td>
</tr>
<tr>
<td>6</td>
<td>Total number of instructions (N)</td>
</tr>
<tr>
<td>4</td>
<td>Total number of constants (M)</td>
</tr>
<tr>
<td>20xN</td>
<td>Instructions</td>
</tr>
<tr>
<td>32xM</td>
<td>Constants</td>
</tr>
</tbody>
</table>

Fig. 2.8 Segments of the GCM

The format of the address and data bus in the configuration network is presented in Figure 2.9. The address bus encodes 22 bits of information containing the 16 bits mask or address of the target PE, 1 bit to select IRF or CRF followed by 5 bits address. The 64 bits data bus consists of \(20 \times 3\) bits instruction or \(32 \times 2\) bits constant.

Execution

In every cycle, each PE fetches 20-bits instruction from the local IRF. Table 2.4 describes the instruction format. The first field in the instruction is used to present the opcode, which is of 5 bits width supporting a maximum of 32 different operations. Details of supported operations are in Table 2.5.
Table 2.4 Instruction format

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Output Reg type</th>
<th>Dest Reg Addr</th>
<th>IN0 Type</th>
<th>IN0 Addr</th>
<th>IN1 Type</th>
<th>IN1 Addr</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jmp</td>
<td>Address</td>
<td>unused</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cjmp</td>
<td>Address of the true path</td>
<td>Address of the false path</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>NOP</td>
<td>Number of consecutive NOPs</td>
<td>unused</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 2.4 Instruction format

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Output Reg type</th>
<th>Dest Reg Addr</th>
<th>IN0 Type</th>
<th>IN0 Addr</th>
<th>IN1 Type</th>
<th>IN1 Addr</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jmp</td>
<td>Address</td>
<td>unused</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cjmp</td>
<td>Address of the true path</td>
<td>Address of the false path</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>NOP</td>
<td>Number of consecutive NOPs</td>
<td>unused</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Fig. 2.10 (a) Sample program (b) Execution in CPU (c) Example PEA (d) Execution in IPA (e) Execution metrics in CPU and IPA
Figure 2.10 shows the execution of a sample program in a traditional CPU and the IPA. The total number of instructions for the sample program in the CPU and the IPA are 31 and 12 respectively. Also, the IPA achieves $28 \times$ performance gain compared to that of the CPU while executing the sample program. The decrease in the number of instructions in the IPA in this specific example is mainly due to the much lower number of memory operations and the fact that the small loop can be completely unrolled without code size blown-up.

### 2.3 Conclusion

In this chapter, we presented the design of a CGRA targeting ultra-low power computing. The proposed *Integrated Programmable-Array* (IPA) is a 2-D array of $N \times N$ processing elements involving two layers of interconnect network. The context distribution network uses a bus-based solution for better performance, while the data distribution network uses a mesh-torus based solution for better energy efficiency. The proposed design leverages a multi-banked tightly coupled data memory for data storage to ease the integration in clustered multi-core architectures. The compilation flow for the IPA is presented in the next chapter. The succeeding chapter evaluates the performance and energy efficiency along with the implementation of the IPA.
Table 2.5 Summary of the opcodes (R = Result, C = Condition bit)

<table>
<thead>
<tr>
<th>Mnemonic</th>
<th>Opcode</th>
<th>Instruction</th>
<th>Operation</th>
</tr>
</thead>
<tbody>
<tr>
<td>NOP</td>
<td>0x00</td>
<td>No operation</td>
<td>-</td>
</tr>
<tr>
<td>UADD</td>
<td>0x01</td>
<td>Unsigned addition</td>
<td>( R = (U) \text{Op1} + \text{Op2} ) ( C = 0 )</td>
</tr>
<tr>
<td>SADD</td>
<td>0x02</td>
<td>Signed addition</td>
<td>( R = \text{Op1} + \text{Op2} ) ( C = 0 )</td>
</tr>
<tr>
<td>SSUB</td>
<td>0x03</td>
<td>Signed subtraction</td>
<td>( R = \text{Op1} - \text{Op2} ) ( C = 0 )</td>
</tr>
<tr>
<td>SMUL</td>
<td>0x04</td>
<td>Signed multiplication</td>
<td>( R = \text{Op1} \times \text{Op2} ) ( C = 0 )</td>
</tr>
<tr>
<td>LS</td>
<td>0x06</td>
<td>Shift left</td>
<td>( R = \text{Op1} \ll \text{Op2} ) ( C = 0 )</td>
</tr>
<tr>
<td>RS</td>
<td>0x06</td>
<td>Shift right</td>
<td>( R = \text{Op1} \gg \text{Op2} ) ( C = 0 )</td>
</tr>
<tr>
<td>LD</td>
<td>0x07</td>
<td>Load</td>
<td>-</td>
</tr>
<tr>
<td>STR</td>
<td>0x09</td>
<td>Store</td>
<td>-</td>
</tr>
<tr>
<td>AND</td>
<td>0x0b</td>
<td>Bit-wise AND</td>
<td>( R = \text{Op1} &amp; \text{Op2} ) ( C = 0 )</td>
</tr>
<tr>
<td>OR</td>
<td>0x0c</td>
<td>Bit-wise OR</td>
<td>( R = \text{Op1} \mid \text{Op2} ) ( C = 0 )</td>
</tr>
<tr>
<td>NOT</td>
<td>0x0d</td>
<td>Bit-wise NOT</td>
<td>( R = \overline{\text{Op1}} ) ( C = 0 )</td>
</tr>
<tr>
<td>XOR</td>
<td>0x0e</td>
<td>Bit-wise XOR</td>
<td>( R = \text{Op1} \oplus \text{Op2} ) ( C = 0 )</td>
</tr>
<tr>
<td>MOV</td>
<td>0x0f</td>
<td>Copy input to output</td>
<td>( R = \text{Op1} ) ( C = 0 )</td>
</tr>
<tr>
<td>LTE</td>
<td>0x10</td>
<td>Conditional less than equal</td>
<td>( \text{if } (\text{Op1} \leq \text{Op2}) ) ( C = 1 ) ( \text{else } C = 0 )</td>
</tr>
<tr>
<td>GTE</td>
<td>0x11</td>
<td>Conditional greater than equal</td>
<td>( \text{if } (\text{Op1} \geq \text{Op2}) ) ( C = 1 ) ( \text{else } C = 0 )</td>
</tr>
<tr>
<td>NE</td>
<td>0x12</td>
<td>Conditional not equal</td>
<td>( \text{if } (\text{Op1} \neq \text{Op2}) ) ( C = 1 ) ( \text{else } C = 0 )</td>
</tr>
<tr>
<td>EOC</td>
<td>0x1f</td>
<td>End of computation</td>
<td>-</td>
</tr>
</tbody>
</table>
Chapter 3

Compilation flow for the Integrated Programmable Array Architecture

Over the last twenty five years, CGRAs have been an active field of research. However, the lack of efficient and automated compiler prevents widespread use of the CGRAs. As opposed to the general purpose computing platforms, the micro-architecture of a CGRA must be visible to the compiler to be able to improve performance by extracting the advantages of the underlying interconnect network and distribution of register files.

In this chapter, we discuss about the design of a compiler to map programs onto a CGRA specifically for the IPA. First, we present the background and the problems for mapping applications onto CGRAs. Then, we study the design of the compiler based on a CGRA model, which can be varied to accommodate a wide range of CGRA designs.

3.1 Background

As discussed earlier the compiler must know the underlying architecture of the CGRA, it takes two inputs. The first is the architecture model (PE array (PEA) of the IPA), and the second is the application described by a high level language, in our case it is ANSI-C code of the application.

3.1.1 Architecture model

The PEA is modelled by a bipartite directed graph with two types of nodes: operators and registers. Timing is implicitly represented by connections between registers and operators, which is referred to as the time extended model of the PEA [44]. Two types of operator nodes are defined for the PEAs. The first type is the computing operator (functional unit (FU) nodes
in Figure 3.1(a)) that represents the physical implementation of an arithmetic and logical operation (\(+, \times, -, OR, AND\)) and/or memory access (e.g. load/store). The second type of operator is the memorization operator (circular nodes in Figure 3.1(b)). It is associated with the output register and represents the operation of keeping a value in a local register explicitly.

Figure 3.1(a) shows a sample PEA with two PEs connected by a torus network. Each PE has 3 registers in the distributed register file, and a single output register. Figure 3.1(b) represents the time extended model of the PEA shown in Figure 3.1(a).

In this model, one can vary the interconnect network, the distribution and size of the register file, and the type of the PE, to explore different PEA designs.

### 3.1.2 Application model

The application is modelled as a control and data flow graph (CDFG). Supporting control flow gives the opportunity to accelerate a kernel without any intervention of the host processor. A CDFG is depicted as $G = (V, E)$ where $V$ is the set of basic blocks and $E \subseteq V \times V$ is the set of directed edges representing control flow. A Basic Block (BB) is represented as a data flow graph (DFG) or $BB = (D, O, A)$ where $D$ is the set of data nodes, $O$ is the set of operation nodes and $A$ is the set of arcs representing dependencies. The control flow from one basic block to another is supported with jump ($jmp$) and conditional jump ($cjmp$) instructions.

Figure 3.2 shows the CDFG representation of the sample program presented in Listing 3.1. In the figure, basic blocks are represented as blue rectangles. The flow from one basic block to another basic block is represented by black arrows and managed by simple branch ($jmp$) operation. The true and false paths of a conditional managed by $cjmp$, are shown by solid and dashed arrows respectively. The execution flow of the CDFG is presented as: $BB_1 \rightarrow BB_2 \rightarrow (either \ BB_3 \ or \ BB_8) \ if \ BB_3 \rightarrow BB_4 \rightarrow (either \ BB_5 \ or \ BB_6) \rightarrow BB_7 \rightarrow BB_2 \cdots$.

In order to maintain the execution flow, it is necessary to synchronize all the PEs in the array, to the execution of the same basic block. When the execution flow jumps from one basic block to another, all the PEs in the PEA must be synchronized to the current basic block execution. This allows to use all the PEs concurrently or sequentially, while executing a single basic block, since only one basic block is executed at a time. Dually, several basic blocks can use the same PE. The synchronized execution allows the compiler to map several operations and data onto the same PE. Next, we present the homomorphism of the CDFG model with the application model, to support different stages in the compilation flow.
Fig. 3.1 (a) A $2 \times 1$ PEA with 3 registers in RF and one output register (c) CDFG model (b) A possible mapping of (b) onto the PEA over 4 cycles using register allocation based approach. (d) The transformed CDFG of (b) for systematic load store based approach (e) A possible mapping of (d) onto the PEA over 7 cycles using systematic load store based approach
// Sample program to demonstrate CDFG model
X1 = 10;
X2 = 20;
X3 = 500;
X4 = 30;
X5 = 50
for (i = 0; i < q; i++)
{
a = m[i] * X1;
b = n[i] * X2;
c = b + a;
if (c < X3)
p[i] = c + X4;
else
p[i] = c - X5;
}

Listing 3.1 Sample program with control flow

Fig. 3.2 CDFG representation of the sample program in 3.1
3.1.3 Homomorphism

The basic blocks in the CDFG, presented in Figure 3.1(c), are composed of data nodes, operation nodes, and data dependencies. Three equivalences between the basic block DFGs and PEA model nodes are defined: (1) data and registers; (2) computation and computing operators; (3) data dependences and connection between the time extended PE components. As the two models are homomorphic, the mapping of a DFG onto the PEA is therefore a problem equivalent to finding a DFG in the PEA graph.

Figure 3.1(b) represents a possible mapping of the sample CDFG in Figure 3.1(c) onto the PEA in Figure 3.1(a) over 4 cycles.

3.1.4 Supporting Control Flow

One of the major challenges associated with all accelerators is to effectively handle control flow in the applications. Since the goal of the compiler presented in this chapter is to execute a complete program efficiently onto a CGRA, by control flow, we do not only mean the conditionals which are present inside a loop body, but any conditional or unconditional branch in general. For better understanding, we classify the control flow into three categories as presented in Figure 3.3. The unconditional branches can be optimized by merging basic blocks or straightening which is applicable to pairs of basic blocks such that the first has no successors other than the second and the second has no predecessors other than the first. If there exists more than one basic block in a program after optimization, which is often the case, the underlying accelerator must support unconditional branch to avoid host interference.

![Fig. 3.3 Classification of control flow present in an application](image)

The fundamental problem for the conditionals are outcome of the branch at runtime. Hence, effective resource allocation is a problem. Hardware accelerators and FPGAs executes...
both the paths of a conditional branch in parallel, and then choose the results of the true path. This results in waste of resources and power. GP-GPUs also schedule the instructions and allocated resources for both the paths of the conditionals, but at the runtime, instructions from the false path are not issued. This saves power, but the cycles and resources allocated for the not-taken path are still wasted. In the graphics processing community, this is referred to as the problem of branch divergence. CGRAs widely use predication techniques to deal with the conditionals. Fundamentally partial, and full predication are adapted by the compilers, which are now discussed along with some other notable schemes.

Partial predication

Since conditionals are constructed by if-then-else (ITE), in partial predication, the operations of both the if-part and the else-part are mapped on different PEs. If the same variable needs to be updated in both the if-part and the else-part, the final result is computed by selecting the output from the true path, which is decided at runtime. This is achieved through a special operation, named select, which takes in the result of the branch condition from predicates\(^1\), and two updated values of the variable to select the correct one. If a variable is to be updated in only one path, a select operation is still necessary to maintain the validity of the variable for the upcoming cycles.

Figure 3.4 (a) shows the partial predication transformation of the CDFG presented in Figure 3.1(c), and mapping of the transformed DFG onto the CGRA (Figure 3.1(a)) in Figure 3.4(b). To map a conditional that has \(n\) operations on each path, the number of operations for partial predication transformation is, in the worst-case, \(3n\). This is because all the operations from both the paths must be mapped (2n), as well as the select operations (n), assuming the worst-case produces outputs in each operation, which are used outside of the conditionals.

Full predication

Full predication executes the two paths sequentially. Unlike partial predication the full predication does not need the select operation, instead, the operations that update the same variable are mapped to the same PE but in different cycles. Since only one of the operations will be executed at runtime (and the other will be squashed), the correct value of the output is present in the register file of that PE by the end of that iteration. If the paths have different variables to update, then they can be mapped in different PEs. This is done so that after

\(^1\)A predicated network in hardware is necessary to support the execution
executing an ITE, for each variable there is a unique PE, that has its value and therefore no select operation is required.

Figure 3.4 (a) shows the full predication transformation of the CDFG presented in Figure 3.1(c), and mapping of the transformed DFG onto the CGRA (Figure 3.1(a)) in Figure 3.4(b). Since both the PEs update the same variable in this case, they are mapped onto the same PE, and the output is validated at the end of the ITE execution. A conditional that possesses \( n \) operations in each path, full predication DFG transformation in the worst-case costs \( 2n \). Since the execution of one path gets squashed, there is performance penalty in this technique.

**Others**

Dual issue scheme [45] targets energy efficiency by issuing two instructions to a PE simultaneously, one from the if-path, another from the else-path. In this mechanism, the latency remains similar to that of the partial predication with improved energy efficiency. However, this approach is too restrictive, as far as imbalanced and nested conditionals are concerned. To map nested, imbalanced conditionals and single loop onto CGRA, the triggered long instruction set architecture (TLIA) is presented in [67]. This approach merges all the conditionals present in kernels into triggered instructions, and creates instruction pool for each triggered instruction. As the depth of the nested conditionals increases the performance of
this approach decreases. As far as the loop nests are concerned, the TLIA approach reaches bottleneck to accommodate the large set of triggered instructions into the limited set of PEs.

In this chapter, we address this problem by introducing a register allocation mapping approach where both the true and false path can reuse the resources preventing the waste of additional resource and power. This allows to map both loops and conditionals of any depth. In our case, the only limitation in the mapping of kernels onto the CGRA is given by the size of instruction memory of the PEs, and not by the structure of the application (i.e. number of loops, and branches). Also, one can increase the size of code segment to be executed in the CGRA as much as possible, minimizing the control and synchronization overheads with the core, which is not negligible in the other approaches.

Traditional CGRAs manage to execute only the innermost loop, since they lack the support for branches. The traditional software pipelining is an excellent choice for accelerating the innermost loop only. Compilation flow proposed in [79], [72], [44], [42] [34] [14] use modulo scheduling [85] for innermost loop pipelining. For the outer loops, the CPU or the host initiates each iteration. As, the loop nests increases, the communication overhead goes high both in terms of performance and power penalty. However, software pipelining faces several limitations such as, in-loop function calls\(^2\), multiple exits inside the loop. Loops with uncertain exits (example loop in the following code to compute greatest common divisor (Listing 3.2)) are not qualified for software-pipelining either.

```
1 // greatest common divisor (gcd)
2
3 void gcd (n1, n2)
4 {
5     while (n1 != n2)
6         {
7             if (n1 > n2)
8                 n1 -= n2;
9             else
10                 n2 -= n1;
11         }
12     res = n1;
13 }
```

Listing 3.2 Loop with uncertain exits

On the other hand, loop unrolling has its own limits for increasing code size immensely, preventing optimizing all the loop levels of a nested loop structure. Hence, for a flexible application acceleration, the need to support branches in CGRA accelerators is unavoidable.

\(^2\)this can be sorted out using intrinsics
3.2 Compilation flow

The compilation flow discussed in the latter sections uses partial unrolling of the innermost loop.

Table 3.1 presents a comprehensive comparison between several techniques to manage control flow in the kernels. The table clearly shows that the register allocation approach can deal with any kind of conditionals and loops.

Table 3.1 Comparison between different approaches to manage control flow in CGRA

<table>
<thead>
<tr>
<th>Techniques</th>
<th>Conditionals</th>
<th></th>
<th>Loops</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Balanced</td>
<td>Imbalanced</td>
<td>Single</td>
<td>Nested</td>
</tr>
<tr>
<td>Partial predication [13]</td>
<td>✓</td>
<td>✓</td>
<td>×</td>
<td>×</td>
</tr>
<tr>
<td>Full predication [4]</td>
<td>✓</td>
<td>✓</td>
<td>×</td>
<td>×</td>
</tr>
<tr>
<td>State based full predication [46]</td>
<td>✓</td>
<td>✓</td>
<td>×</td>
<td>×</td>
</tr>
<tr>
<td>Dual issue single execution [45]</td>
<td>✓</td>
<td>×</td>
<td>×</td>
<td>×</td>
</tr>
<tr>
<td>TLIA [67]</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>×</td>
</tr>
<tr>
<td>Software pipelining [72]</td>
<td>×</td>
<td>×</td>
<td>✓</td>
<td>×</td>
</tr>
<tr>
<td>Loop unrolling [61]</td>
<td>×</td>
<td>×</td>
<td>✓</td>
<td>NA</td>
</tr>
<tr>
<td>Register allocation [23]</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>

Next, in this chapter, we discuss the problem for supporting branches in CGRAs, and formulate a register allocation approach for supporting control flow efficiently. The compilation flow, described later in this chapter, is developed using this approach. Results at the final part of this chapter demonstrates the flexibility and efficiency of the compilation approach both in terms of performance and energy gain.

3.2 Compilation flow

Figure 3.5 shows a schematic representation of the compilation flow for mapping CDFGs onto the PEA. A CDFG mapping is a set of DFG mappings that are compatible with each other. To be compatible, the DFGs must access the data that remain in the PEs (see symbol variables (see definition 3.2.1)) in the same location. This is ensured by the register allocation approach.

First, the flow orders the basic blocks and for each basic block it finds a set of DFG mappings that are compatible with the DFG already mapped by settings the constraints. When no solution for scheduling and binding is found, the flow tries to transform the application
Fig. 3.5 Compilation flow
3.2 Compilation flow

graph to ease the mapping. When no transformation can be applied, it means that a mapping for the current basic block cannot be found given the constraints of the selected mappings of the other basic blocks. A backtrack mechanism is used to select another consistent set of already mapped DFGs to map the current DFG. The set of valid mappings found for the current basic block is saved into a mapping bank. To map the basic blocks, we rely on the highly scalable and efficient mapping approach for DFGs described in [22]. The compilation flow proposed here, extends the DFG mapping to accommodate the register allocation approach to map a full CDFG onto the PE array. As presented in Figure 3.5, the full compilation flow is composed of six interdependent stages: BB selection, backtracking, update constraints, scheduling and placement, graph transformation and a stochastic pruning. First, we discuss the steps involving the mapping of DFGs, then, we introduce the problems while mapping the control flow graph and discuss the solutions.

3.2.1 DFG mapping

As shown in Figure 3.5, mapping of DFGs involves three steps, scheduling and placement, graph transformation, and stochastic pruning.

Scheduling and placement

The scheduling step uses a backward traversal [81] list scheduling algorithm to schedule nodes of the DFG. It relies on a heuristic in which the schedulable operations are listed by priority order. In backward traversal, a node is schedulable if and only if all its children are already scheduled (e.g. node 2, in Fig. 3.6(b), is not schedulable since node 3 is not yet scheduled. So, it must be routed to keep data dependency resulting in Fig. 3.6(c)). The priority of nodes depends on their mobility and number of successors (fan-outs). It is possible to process memorization nodes and conventional nodes differently. When several nodes have the same mobility, their respective number of successors is used as a second priority criterion. The higher the number of successors, the higher the priority. Indeed, a node with a higher number of successors is more difficult to map due to routing constraint coming from the limited amount of connections between tiles. Thus, scheduling these nodes at first usually allows for reducing the application’s latency (e.g. node 2 in Fig. 3.6(d) has a higher priority than node 1).

As soon as nodes are prioritized and ordered, our approach tries to find a binding solution. The first node is then selected from the ordered list and the algorithm searches for a binding solution. If no binding solution exists, the graph is transformed (see Section 3.2.1). The proposed placement uses an incremental version of Levi’s algorithm [62], i.e. fully exhaustive
Fig. 3.6 Example of scheduled and transformed DFG on a CGRA with one PE. (a) Initial DFG, (b) after scheduling node 4, (c) after adding node 2’, (d) after scheduling node 3 and 2’, (e) after scheduling node 2, (f) Scheduled DFG after routing and scheduling node 1. Horizontal line shows the limit between scheduled and non scheduled nodes. Memorization nodes are dotted circles.

search of the whole DFG. The algorithm we propose, adds the newly scheduled operation node and its associated data node to the sub-graph composed of already scheduled and bound nodes. Only the previous set of solutions that have been kept are used to find every possibility to add this couple of nodes without considering the non-yet scheduled nodes. If no solution is found, there is absolutely no possibility to bind this couple in all the previous partial solutions because Levi’s algorithm provides a complete exploration of the available solution space.

**Introducing stochasticity in the scheduling:** The scheduling discussed above is a list ordering using a backward traversal. This heuristic approach proposes to schedule the nodes according to a priority function. The priority is derived depending on two criteria: i) the mobility of the nodes, ii) the number of outgoing arcs for the nodes having the same mobility. Despite these two types of criteria, it is possible that several nodes have the same priority (typically, those with same mobility and only one outgoing arc). Nodes with similar mobility and number of successors are ordered randomly. **Stochasticity** is introduced in the scheduling process to get better coverage of the underlying micro-architecture. The ability of the random selection of the similar priority nodes to better architectural exploration, is examined at the end of this section.

**Graph transformation**

DFG is transformed dynamically when no binding solution is found. Following are the two graph transformations (Figure 3.7) used in our compilation flow.
3.2 Compilation flow

1. **Operation splitting**: duplicates an operation node by keeping its same inputs and distributing output edges to reduce the number of successors of the original operation node (see Fig. 3.7(b)).

2. **Memorization routing**: adds a memorization node and its associated data node to delay one operation and to keep data dependencies (see Fig. 3.7(c)).

![Fig. 3.7 Graph transformation]

**Stochastic pruning**

The exhaustive enumeration of Levi’s algorithm usually leads to a very large number (up to tens of thousands) of partial mappings (depending on the data dependencies and the architectural constraints) which prevents its use with large DFG and/or complex CGRA. In [81], the idea to reduce this number was to remove redundant partial mappings. A partial mapping is redundant when it uses the same operators to make the same operations as another partial mapping at the current scheduling cycle. This step allows for keeping only all the different partial solutions and preserving an exhaustive search. However, this pruning technique does not scale well. The problem is so complex that it is difficult to define a smart and efficient pruning function. To keep both computation time and memory usage to a reasonable level in the mapping tool, we propose to use a stochastic selection instead of removing redundant partial mappings. This pruning step is made after the binding step and before scheduling the next node. Let the result of the binding be a list \( nb\text{Mappings} (nbM) \) of partial solutions. The stochastic pruning step selects \( nb\text{CurrentMappings} (nbCM) \) number of partial solutions from \( nb\text{Mappings} \).
For each partial mapping, a random number between 0 and 1 is generated and compared to a threshold. This threshold must be chosen carefully: it should be low enough to scale up and high enough to allow keeping enough partial solution among which at least one solution can lead to a complete mapping. Thus, the threshold should adapt itself to nbMappings. For that purpose, nbMappings is normalized by a reference number λ, set by the user. This number is used by the threshold function. Many functions can be considered (e.g. exponential, invert, hyperbolic etc.). To select an optimal threshold function we present a performance graph (Fig. 3.8) which presents the average number of selected partial mappings (nbCurrentMappings) for ten runs with average number of original partial mappings (nbMappings) for λ value 3000 (the same trend is experienced with several other values of λ).

We experience exponential decay in selected number of mappings for exponential and hyperbolic function as opposed to inverse function. Hence the inverse function has been chosen as the threshold function (see Eq. 3.1) in our approach.

\[
\text{Threshold}(nbM, \lambda) = \begin{cases} 
\lambda / nbM & \text{if } nbM > \lambda \\
1 & \text{if } nbM \leq \lambda 
\end{cases} \quad (3.1)
\]

After choosing the right threshold function it becomes very important to have control over the number of selected partial mappings as this leads the approach to find a valid solution. We propose to introduce bounds as control mechanisms: LB (Lower Bound) and UB (Upper Bound). We propose two variants based on bounds.
3.2 Compilation flow

**LB & UB:** This variant sets an upper bound and lower bound on \(nbCurrentMappings\) as presented in equation 3.2 and 3.3. In this method also, a random number is generated between 0 and 1, which is compared to the threshold value. If the random number is less than or equal to the threshold or the lower bound is not satisfied, then it selects the partial solution from \(nbMappings\) and stores into \(nbCurrentMappings\) otherwise the solution is discarded. If \(nbCurrentMappings\) exceeds the upper bound, then it stops selection of partial mappings.

\[
\max nbCM = \lceil nbM / 3 \rceil \quad \text{(3.2)}
\]

\[
\min nbCM = \begin{cases} 
|nbM/\lambda| & \text{if } nbM > \lambda \\
\lfloor nbM/3 \rfloor & \text{if } nbM \leq \lambda 
\end{cases} \quad \text{(3.3)}
\]

**LB only:** This variant generates a random number between 0 and 1 which is compared to the threshold. If the random number is less than or equal to the threshold then it selects the partial solution from \(nbMappings\) and stores into \(nbCurrentMappings\) otherwise the solution is discarded. The solution space \(nbMappings\) is traversed again and again until \(nbCurrentMappings\) reaches the minimum bound as presented in the equation 3.4.

\[
\min nbCM = \lceil nbM / \lambda \rceil \quad \text{(3.4)}
\]

**Efficiency of stochasticity in mapping:** To demonstrate the efficiency of stochastic based approach in the mapping flow, we perform experiments involving the kernels presented in Table 3.1 in chapter 3. Since latency performance and compilation time are the most critical parameters that are affected by the introduction of stochasticity in the mapping, we analyse the effect of different combinations of stochastic behaviour and a non-stochastic approach, on both latency and compilation time. For the different combinations of stochastic behaviour in the pruning stage, we consider (a) mapping with Stochastic pruning with no bounds or SNoB, (b) mapping with stochastic pruning with lower and upper bounds (LB & UB) or SLUB, (c) mapping with stochastic pruning with lower only bound (LB) or SLoB. For a non-stochastic based approach in pruning we select a mapping approach based on redundant elimination-based pruning, proposed in [81], which is referred to as RED in the comparisons. Since the mappings are based on different stochastic solutions, in the experiments we take the best outcome out of ten runs to ensure the best performance of the corresponding method.

- **Latency and computation time:** In Figure 3.9, we compare the latency obtained by the different mapping approaches normalized to the ASAP length of the corresponding DFGs. As the trends are the same for different sizes of CGRA and RF, we have
presented results for only 3x3 CGRA with RF 24. Fig. 3.10 presents compilation time comparison. To realize the gain in compilation time using stochastic based pruning over the state of the art pruning using redundant elimination RED, we have normalized the compilation time over RED.

Fig. 3.9 Mapping latency comparison for 3x3 CGRA

Fig. 3.9 shows the ability of different methods to find mappings with best latency. The latency value closer to the ASAP line refers to the capability of finding better mappings. The latency comparison depicts that SLoB generates the best of mappings whereas SLUB produces the worst latencies. The compilation time comparison in Fig. 3.10 shows that SLUB and SLoB achieve best scaling. Comparing both the performance metrics, the SLoB is the clear winner.

• **Architectural coverage:** After comparing the performance metrics, we analyse the ability of the stochasticity in mapping, to explore the underlying micro-architecture. Since better architectural coverage ensures better resource utilization, we consider the best performed candidate, the SLoB from the above set of experiments. As discussed in the previous section, we introduce SLoBS which integrates stochastic scheduling in SLoB. For the experiment, we consider FFT kernel, as it possesses the highest number of parallelism (see Table 3.1 in chapter 3).
Fig. 3.10 Compilation time comparison for 3x3 CGRA

Fig. 3.11 exhibits the architectural coverage for different mapping approaches. In this figure, we present FFT benchmark running for different CGRA configurations using three different methods, RED, SLoB and SLoBS.

Since the trend realized in this figure is similar for other benchmarks, results for only one kernel is presented for clarity and better understanding. The CGRA configurations are presented using dimension and the RF size (i.e. 4×4 RF 16, means the configuration is for a 4×4 CGRA where each PE consists of a RF of size 16). Each point in the Fig. 3.11 corresponds to the outcome of a single run by a method on the corresponding CGRA configuration.

The x axis of the graph represents latency normalized to ASAP length and the y axis represents the number of transformed nodes normalized to the number of operation nodes in the original graph. In other words, each point in the graph is basically the outcome latency and number of transformed nodes of each run by a certain method. The points corresponding to the method RED and SLoB show that they find similar latencies with almost similar number of transformations. The wide range of latencies and transformations in method SLoBS prove that this method can explore the solution space better. Not surprisingly, the method SLoBS finds the best latency with least number of transformations.
3.2.2 CDFG mapping

First, we formulate the problem of CDFG mapping and propose a register allocation based solution accordingly. Subsequently, we discuss the steps involving the mapping of CDFG.

**Definition and problem formulation**

Data in an application is separated into two categories.

1. The standard input and output data (mostly the array inputs and outputs) are mapped as memory operands. The inputs and outputs are allotted by load-store operations. In our sample program in Figure 3.2, \(m, n\) are the input arrays and \(p\) are the output array, which are managed by load and store operations.

2. The internal variables of a program are mapped onto the registers of the processing elements, and managed by the register allocation based approach [23].

Following, we introduce several definitions concerning register allocation approach:

**Definition 3.2.1.** Symbol Variables and location constraints: In compilation, the recurring variables (repeatedly written and read) are managed in local register files of the PEs to avoid multiple access of local memory. The recurring variables which have occurrences in multiple
basic blocks need special attention since the integrity of these variables must be kept intact throughout the mapping process for different basic blocks. These variables are defined as Symbol variables. The register locations for the symbol variables are referred to as location constraints. For instance, variable $c$ in the CDFG (Fig. 3.2) is written in $BB_3$, and read in $BB_4$, $BB_5$ and $BB_6$. In mapping all these basic blocks the register location for $c$ must be same. Similarly, $X_1$, $X_2$, $X_3$, $X_4$, $X_5$, $i$, $a$ and $b$ must be location constrained. The locations for such symbol variables are denoted with an overline, as \textit{variable\_name}.

Depending on the order of the basic blocks mapped (i.e. traversing the CDFG), some location constraints may be reused in the mapping process or may be kept reserved for later use. These two types of location constraints are discussed in the following.

\textbf{Definition 3.2.2.} \textit{Target Location Constraints (TLC):} We consider a scenario \textit{scenario\_1}, where $BB_6$ is mapped first, $BB_3$ is mapped next and so on. While mapping $BB_6$, variables $c$ and $X_5$ are placed at $\overline{c}$ and $\overline{X_5}$. While mapping $BB_3$, $\overline{c}$ and $\overline{X_5}$ which are already mapped in $BB_6$, must be considered because $\overline{c}$ will be used to map $c$ in $BB_3$. In other words, the placement of the variables in the registers must be respected. Also, $\overline{a}$, $\overline{b}$, $\overline{X_1}$ and $\overline{X_2}$ must not reuse $\overline{X_5}$. Otherwise, $X_5$ will have wrong value when executing $BB_6$. Let’s consider \textit{scenario\_2} with another order of basic blocks mapped, like first $BB_3$ and then $BB_6$ and so on. In this order of mapping, it is necessary to pass $\overline{c}$ and $\overline{X_5}$ from $BB_3$ to $BB_6$ mapping. To keep $c$ and $X_5$ alive in $BB_6$ both $\overline{c}$ and $\overline{X_5}$ must be used in mapping of $BB_6$. The placement or binding information which are passed from the mapping of one basic block to the mapping of the other basic block is referred to as constraint (e.g. \textit{scenario\_1}: $\overline{c}$ and $\overline{X_5}$ passed from $BB_6$ to $BB_3$). The location constraints related to the data that are used within a basic block mapping phase (e.g. \textit{scenario\_1}: $\overline{c}$ in $BB_3$ mapping) are referred to as target location constraints (TLC).

\textbf{Definition 3.2.3.} \textit{Reserved Location Constraints (RLC):} As we have seen in the previous examples, some of the location constraints must be reserved in the mapping of basic blocks for the sake of data integrity. To keep the symbol variables alive, it is necessary to exclude the memory elements from placement. Accordingly, these resources will not override while mapping the basic block (e.g. \textit{scenario\_1}: $\overline{X_5}$ in $BB_3$ mapping). These are referred to as reserved location constraints (RLC).

If the number of RLC and TLC is high, mapping becomes complex. As TLC forces to use resources, and RLC forces to exclude resources from placement. Hence, the primary goal for our compiler is to minimize the number of TCLs and RLCs by choosing an efficient traversal of the CDFG.
Register allocation approach: The basic solution to deal with the symbol variables is to introduce memory operations. The symbol variables are stored in the memory where they are generated and are loaded from the memory when used as operands. This basic solution is referred to as systematic load-store based approach. This method is presented in the Figure 3.1(d). For the symbol variable \( c \) in the CDFG shown in Figure 3.1(c), it stores variable \( c \) in the memory in \( BB_3 \), and loads in \( BB_4, BB_5 \) and \( BB_6 \). Figure 3.1 refers to the mapping of the transformed CDFG in this approach. This basic solution reduces the complexity of the mapping as there are no constraints in the basic block mapping. On the other hand, it requires a huge memory bandwidth, significantly reducing the energy efficiency of the system. As an alternative, we propose register allocation approach, where the symbol variables are stored in the register files when they are written and retrieved from the registers when used as operands. While doing so, the effects of the constraints in mapping are unavoidable. RLC restrict the use of some resources, and TLC force to reuse some resources. If there is only a single TLC in a basic block mapping, it becomes easier to start mapping from the known place. However, several TLC and RLC complicates the mapping. Forced and blocked placements by these constraints induce extra routing effort (dynamically transforming the graph in compilation).

Impact of the constraints on DFG mapping: Since location constraints in the register allocation approach forces to use and block some of the locations while mapping the variables, we have tailored the placement algorithm in DFG mapping. The modified binding approach uses a database of the RLC and TLC to find placements of the current data nodes. If no solution is found due to the constraints the DFG is dynamically transformed emulating additional routing of the targeted data node. We introduce Assignment routing (Figure 3.12), which adds an assignment node (mov operation node) to increase the physical distance between the source and sink of symbol variables by one. Due to TLC or RLC, when the physical distance between the source and sink of the symbol variable becomes more than one, the compiler dynamically adds one mov operation node to the DFG.

To illustrate the excess data routing due to RLC and TLC, we consider a scenario where \( BB_1 \) and \( BB_4 \) in Figure 3.2 are already mapped (variables \( X_1, X_2, X_3, X_4, X_5, c, i \) already mapped). The mapping of \( BB_3 \) must be done considering the TLC \( \tau, \overline{X_1}, \overline{X_2} \) and RLC \( \overline{X_5}, \overline{X_3}, \overline{X_4}, \overline{i} \). The variables \( a, b \) in \( BB_3 \) must be mapped satisfying all these constraints. Consequently, additional data move might be necessary. A graphical view of this circumstance is presented in Fig. 3.13, where \( BB_3 \) is being mapped onto a \( 3 \times 1 \) PEA with 4 registers in the RFs of each PE (R0 is the output register). In this PEA, we assume that the register files are local to the PEs.
3.2 Compilation flow

Fig. 3.12 Assignment routing graph transformation

\( a \) and \( b \) will be mapped in the respective PEs where \( X_1 \) and \( X_2 \) are allocated. Extra routing effort may be necessary to bring \( a \) and \( b \) to the PE where \( \tau \) is allocated. Hence, the graph must be transformed dynamically, adding extra mov operation, when such situation arises. The mapping can be done because the addition operation must generate \( c \) in \( \tau \) which is a location in the register file (RF) of the corresponding PE.

![Diagram](image)

Fig. 3.13 (a) DFG \( BB_3 \). (b) mapping of \( BB_3 \) onto a 3x1 PEA starting with TLCs \( \bar{\tau}, \bar{X_1} \) and \( \bar{X_2} \) and RLCs \( \bar{X_5}, \bar{X_3}, \bar{X_4} \) and \( \bar{i} \). (c) transformed \( BB_3 \) after mapping.
As we can see in Figure 3.13 (b), the execution starts with TLC τ (PE3 – R2), X1 (PE1 – R1) and X2 (PE2 – R3), and the RLC X3 (PE2 – R2), X4 (PE1 – R4), X5 (PE3 – R4) and 7 (PE3 – R1). Due to the constraints, the original BB_3 was transformed to the basic block presented in Figure 3.13 (c), and mapping was settled in 3 cycles. The TLCs force to map a, b in PE1 and PE2 in cycle 1. Let’s assume they are mapped in PE1 – R2 and PE2 – R1 respectively. In cycle 2, a and b cannot be accessed to produce c in PE3 – R2. Hence, graph transformation is necessary to route a, b from the register files to the output registers, which is done in cycle 2. In cycle 3, c is generated in τ which is (PE3 – R2). Hence, the mapping of the operation attached to c in this case, experiences longer schedule due to the several TLC and RLC. The increased number of the constraints during the basic block mapping affects the complexity and the quality of the mapping. Hence, it is necessary to wisely select the basic blocks to reduce the impact of the constraints on the mapping. In the next section, we present a suitable traversal of CDFG to minimize the number of RLCs and TLCs.

Following, we discuss the compilation flow steps implementing the register allocation approach.

**Basic block selection**

Once all the nodes of the BB have been scheduled and bound, the compiler selects one partial mapping among the several mappings generated and selects the next basic block to be mapped. As discussed previously, it is necessary to maintain data integrity over several basic block mappings. The data mapping problem for CDFG mapping is now described before going into the detailed basic block selection step.

As the selection of the basic blocks during the mapping is important, we compare the number of TLC and RLC for several CDFG traversal strategies in this section. Table 3.2 presents the comparison between the number of different constraints in the forward and backward CDFG traversal for Breadth First Search (BFS) and Depth First Search (DFS) strategies. As the trend is similar for other kernels we present the results for sobel and separable 2D filter only. The numbers show that DFS strategy generates a lower number of RLC than the BFS in both forward and backward traversal. The number of RLC for sobel filter is much higher in BFS due to several sequential loops present in the kernel. The numbers of TLC are similar in both the strategies for different traversal mechanisms. Also, for the different search strategies forward and backward traversal perform similarly. The DFS strategy is thus used.
### Table 3.2 Comparison of RLC and TLC numbers between different CDFG traversal

<table>
<thead>
<tr>
<th>Kernels</th>
<th>Forward Traversal</th>
<th>Backward Traversal</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>BFS</td>
<td>DFS</td>
</tr>
<tr>
<td></td>
<td># RLC</td>
<td># TLC</td>
</tr>
<tr>
<td>Sep 2D Filter</td>
<td>22</td>
<td>35</td>
</tr>
<tr>
<td>sobel Filter</td>
<td>64</td>
<td>85</td>
</tr>
</tbody>
</table>

#### Backtracking

For a basic block to be mapped except the first one, this stage selects the first map out of several mappings generated for the last basic block mapped. The selected map updates the constraints for the current basic block mapping. If one basic block does not find a mapping due to the constraints, this stage selects the second map from previous basic block to update the constraints and restart mapping of the new basic block. The process continues up to the first basic block mapped until a valid mapping is found for the current basic block.

#### Update Constraints

In this stage, the compiler creates and updates a constraint database. This database is used in the placement algorithm, to place the data nodes and corresponding operation nodes according to the TLC and RLC. In the current basic block mapping variables are not placed in RLCs, and TLCs are used to map the symbol variables. When mapping a current basic block, new variables cannot be placed in RLCs, while TLCs are used to map the symbol variables. If the symbol variable in the current basic block mapping is not present in the constraint database, then the variable is mapped using available resources, and the respective placement is used to update the constraint database prior to next basic block mapping.

Once all the basic blocks are mapped the compiler generates the assembly file containing a single map for the whole CDFG.

### 3.2.3 Assembler

Assembler holds the key to differentiate from the PEA model used in the compiler and the actual hardware implementation. The assembler takes the ASCII text assembly generated by the compiler and the instruction set architecture (ISA) and produces machine code, which can then be used to configure the PEs in the hardware. The ISA provides the added hardware information to the PEA model used in the compiler. As an example, the PEs in the IPA use an added constant register file (CRF) for storing the constants. The introduction of the CRF in the PEA model minimizes the instruction length by storing the immediates of the instruction.
into the internal registers, giving a low power solution. That is how the assembler separates the model used in the compiler from the actual implementation of the hardware. One can define their own PEA model and derive an architecture from that for actual implementation. Thus, the compiler can be used for a wide range of PEA variations.

### 3.3 Conclusion

In this chapter, we presented a compilation flow targeting the mapping of both control and data flow portions of kernels onto the IPA. The proposed approach maps a complete CDFG with least number of memory operations. A Register allocation approach was introduced for maintaining data locality throughout the CDFG mapping. We also showed the effect of the constraints raised due to the register allocation approach on different traversal of the CDFG. For mapping the basic blocks in the CDFG, the proposed approach leverages on simultaneous scheduling and binding steps respectively based on a heuristic and an exact method. Stochastic pruning was introduced to reduce the impact of the exact binding approach. The formal graph model of the basic blocks, obtained after compilation, is backward traversed and dynamically transformed to allow for a better exploration of the design space. In the next chapter, we present the efficiency of the compilation flow.
Chapter 4

IPA performance evaluation

In this chapter, first, we analyse the implementation of the IPA, providing performance, area, and energy consumption on several signal processing kernels. We perform an architectural exploration to find the optimal configuration in terms of number of load-store units and number of TCDM banks for a IPA with $4 \times 4$ PE array. Performance, area and energy efficiency are compared with that of the or1k CPU [58].

Second, we carry out experiments to show the efficiency of the register allocation approach compared to the state of the art predication techniques, considering a wide range of control dominated kernels. The proposed mapping flow has been fully automated through a software tool implemented by using Java and Eclipse Modeling Framework (EMF). GCC 4.8 is used to generate CDFGs from applications described in C language. Finally, we present the efficiency of the compilation flow, executing a smart visual trigger application enriched with data-flow and control-flow intensive kernels on the IPA, compared with the state of the art architectures.

4.1 Implementation of the IPA

This section presents the implementation results of the IPA using STMicroelectronics 28nm UTBB FD-SOI technology libraries. For area reference we consider low power or1k [58] CPU. Both the designs were synthesized with Synopsys design compiler 2014.09-SP4. The IPA consists of a $4 \times 4$ array with 16 PEs, each one consisting $64 \times 20$-bit instruction register file, a $8 \times 32$-bit regular register file and $16 \times 32$-bit constant register file, as shown in Table 4.1. For area comparison, the CPU includes 32kB $^1$ of data memory, 4kB of instruction memory, and 1 kB of instruction cache, which is equivalent to the design parameters of the IPA.

$^1$The size is considered both in size and power
For the memory access optimization, we compare the performance and energy efficiency of different configurations in the IPA with the CPU. The different configurations of the IPA are the variation of the number of LSUs present in the PEA and the number of TCDM banks present in the data memory. Table 4.2 presents the code-size (instructions and constants) and maximum depth of the loops present in the kernels used for the following experiments. Thanks to the simpler architecture and tiny processing elements, at the target operating voltage of 0.6V, the IPA runs at 100 MHz while or1k can only reach 45MHz in the same operating point. Synopsys PrimePower 2013.12-SP3 was used for timing and power analysis at the supply of 0.6V, 25°C temperature, in typical process conditions. The cycle information was achieved simulating the RTL with Mentor Questa Sim-64 10.5c.

Table 4.1 Specifications of memories used in TCDM and each PE of the IPA

<table>
<thead>
<tr>
<th>Name</th>
<th>Type</th>
<th>Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>Global context memory</td>
<td>SRAM</td>
<td>8KB</td>
</tr>
<tr>
<td>TCDM</td>
<td>SRAM</td>
<td>32KB</td>
</tr>
<tr>
<td>Instruction Register File (IRF)</td>
<td>Registers</td>
<td>0.16KB</td>
</tr>
<tr>
<td>Regular register file (RRF)</td>
<td>Registers</td>
<td>0.032KB</td>
</tr>
<tr>
<td>Constant register file(CRF)</td>
<td>Registers</td>
<td>0.128KB</td>
</tr>
</tbody>
</table>

4.1.1 Area Results

Figure 4.1 shows the area of the whole array and memory with different numbers of TCDM banks, where the total amount of memory is kept constant at 32kB. As the area of LSUs is negligible if compared to the overall system area, we show the area results for the worst-case scenario with maximum number of LSUs present in the PE array (i.e. 16). As shown in Figure 4.1, in the minimal configuration with 4 TCDM banks, the IPA area is dominated by the array of PEs (60%) and by the local data storage (35%), while the remaining 5% is consumed by the interconnect. Increasing the number of TCDM banks imposes a significant area overhead on the size of the interconnect. Also, the area of the TCDM increases as well due to the higher area/bit of small SRAM cuts necessary to implement 32kB of memory with several banks. Hence, it is fundamental to properly balance the number of LSUs and TCDM banks with the bandwidth requirements of applications.

4.1.2 Memory Access Optimization

This section provides an extensive comparison with respect to the CPU computational model and an evaluation of the performance of the IPA while varying the number of LSUs and
4.1 Implementation of the IPA

TCDM banks, a critical parameter for data-hungry accelerators. To carry out the exploration, we selected seven compute intensive signal processing kernels featuring a high bandwidth towards the TCDM. Table 4.2 presents the code-size (instructions and constants) of all the kernels used in the following experiments. The cost of the IRF is considered both in size and power.

Table 4.2 Code size and the maximum depth of loop nests for the different kernels in the IPA

<table>
<thead>
<tr>
<th>Kernel</th>
<th>FIR</th>
<th>MatMul</th>
<th>Conv</th>
<th>Sep filter</th>
<th>Non-Sep filter</th>
<th>FFT</th>
<th>DC filter</th>
<th>cordic</th>
<th>sobel</th>
<th>ged</th>
<th>sad</th>
<th>debloch</th>
<th>mahlid</th>
</tr>
</thead>
<tbody>
<tr>
<td>Code size (KB)</td>
<td>0.568</td>
<td>0.704</td>
<td>0.704</td>
<td>0.720</td>
<td>0.784</td>
<td>0.696</td>
<td>1.16</td>
<td>0.496</td>
<td>0.336</td>
<td>1.448</td>
<td>0.600</td>
<td>2.016</td>
<td>0.624</td>
</tr>
<tr>
<td>Max depth loop nests</td>
<td>2</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>4</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>2</td>
</tr>
</tbody>
</table>

Performance

Generally speaking, the IPA performs well when significant parallelism can be extracted from a kernel. This concept is well shown in Figure 4.2, which compares the performance of the IPA with that of the or1k processor on a matrix multiplication when growing the size of the matrices from $2 \times 2$ to $32 \times 32$. It is possible to note that the increase of the kernel size increases the average utilization of the PEs as well, which in turn helps to enhance performance. It also demonstrates that the initial configuration time, which is dominant for small kernel size is well amortized for larger kernels, further contributing to improve performance.

Fig. 4.1 Synthesized area of IPA for different number of TCDM banks
Figure 4.3 presents the total execution time (clock cycles) of seven compute-intensive kernels. The execution time is normalized with respect to that of or1k processor, where the kernels are compiled with -O3 optimization flag. The IPA outperforms the CPU by up to $20.3\times$, with an average speed-up of $9.7\times$. A quantitative performance comparison with respect to the CPU is presented in Table 4.3. The table presents the configuration and execution cycles in the IPA for different kernels. It also presents the average utilization of PEs over the total execution period and total number of instructions executed in the IPA. The instruction count includes the instructions that are replicated on all the active PEs for keeping the PE in synch across conditionals and jumps. It also includes NOPs that are used when some PEs are stalled due to manipulation of index variables. However, during NOP execution PEs are clock gated and do not consume dynamic power. The IPA achieves a maximum of $18\times$ and an average of $9.23\times$ energy gain over the CPU.

Table 4.3 Overall instructions executed and energy consumption in IPA vs CPU

<table>
<thead>
<tr>
<th>Kernels</th>
<th>FIR</th>
<th>MaM (16×16)</th>
<th>Convolution</th>
<th>SepFilter</th>
<th>NonSepFilter</th>
<th>FFT</th>
<th>DC Filter</th>
</tr>
</thead>
<tbody>
<tr>
<td>IPA</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Configuration cycles</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Execution cycles</td>
<td>6071</td>
<td>11940</td>
<td>56241</td>
<td>827685</td>
<td>1852382</td>
<td>8076</td>
<td>4748</td>
</tr>
<tr>
<td>Total number of instructions executed</td>
<td>44294</td>
<td>110946</td>
<td>531815</td>
<td>7349843</td>
<td>17486486</td>
<td>76310</td>
<td>28868</td>
</tr>
<tr>
<td>Active PEs/cycle(%)</td>
<td>46.1</td>
<td>58.5</td>
<td>59.2</td>
<td>55.5</td>
<td>59</td>
<td>59.7</td>
<td>39.5</td>
</tr>
<tr>
<td>Energy ($\mu$J)</td>
<td>0.022</td>
<td>0.043</td>
<td>0.202</td>
<td>2.98</td>
<td>6.669</td>
<td>0.032</td>
<td>0.017</td>
</tr>
<tr>
<td>Energy ($\mu$J) in non-clock-gated IPA</td>
<td>0.047</td>
<td>0.077</td>
<td>0.479</td>
<td>7.152</td>
<td>11.704</td>
<td>0.063</td>
<td>0.045</td>
</tr>
<tr>
<td>CPU</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Execution cycles</td>
<td>37677</td>
<td>96256</td>
<td>616805</td>
<td>5982730</td>
<td>9084101</td>
<td>164480</td>
<td>50085</td>
</tr>
<tr>
<td>Energy ($\mu$J)</td>
<td>0.132</td>
<td>0.337</td>
<td>2.159</td>
<td>20.94</td>
<td>31.794</td>
<td>0.576</td>
<td>0.175</td>
</tr>
<tr>
<td>Speed-up</td>
<td>6.21x</td>
<td>8.06x</td>
<td>10.97x</td>
<td>7.23x</td>
<td>4.9x</td>
<td>20.3x</td>
<td>10.55x</td>
</tr>
<tr>
<td>Energy-gain</td>
<td>6x</td>
<td>7.84x</td>
<td>10.69x</td>
<td>7.03x</td>
<td>4.77x</td>
<td>18x</td>
<td>10.29x</td>
</tr>
</tbody>
</table>

To establish the impact of the memory bandwidth over performance and energy efficiency, we vary the number of LSUs in the PE array from 4 to 16 and the number of TCDM banks from 4 to 32. The number of LSUs defines the available bandwidth from the TCDM to the
array, while increasing the number of TCDM banks reduces the banking conflict probability, improving performance. To perform the exploration without any bias towards configurations, the innermost loops of the kernels are unrolled to get a maximum of 16 load-store operations in one cycle (as the highest number of LSUs considered is 16, in the exploration). In Figure 4.3, each configuration is represented as a 2-dimensional number, where the first one represents the number of LSUs, and the second one represents the number of TCDM banks.
Results show that, as opposed to tightly coupled clusters of processors which require a banking factor of 2 (i.e. number of TCDM banks is twice the number of cores) [89], IPA performance is almost insensitive to the number of TCDM banks, and a configuration with a banking factor of 0.5 is sufficient to minimize the impact of contention on the shared memory banks for most applications. Indeed, while the typical processor execution requires several load/store operations for variables exceeding the size of the register file, direct CDFG mapping on the IPA does not add extra memory operations except primary inputs and outputs (e.g. arrays), since all the temporary variables are stored in the register file of the PEs. Moreover, flexible point-to-point connections within the array allow to efficiently exchange data among PEs, further reducing the pressure on the TCDM. This concept is well explained in Figure 2.10 and Figure 2.4, which show the typical mapping of an application on the IPA.

Energy Efficiency

Figure 4.4 shows the average breakdown of power consumption for different configurations of the IPA. As expected, the PE array is the most dominant power consumer for all the configurations. The configurations with 4 TCDM banks achieve the best power advantages in each group, since increasing the number of TCDM banks increases the complexity of the interconnect, causing timing pressure on the array, which increases the sizing of the cells, hence power consumption.

![Power Consumption Graph](image)

Fig. 4.4 Average power breakdown in different configurations ([#LSUs][#TCDM Banks])

Figure 4.5 shows the average energy efficiency (MOPS/mW) for different configurations. Million Operations Per Second (MOPS) only considers the active PEs during execution, since a PE may be idle due to TCDM bank access conflicts, consecutive NOPs, or not mapped (not used in the application execution). Executions with high number of active PEs/cycle
achieve large MOPS. As depicted in Figure 4.5, for different number of LSUs in the PE array, the configuration with 4 TCDM banks achieves the best energy efficiency, since this is the least number of banks in each configuration, it causes lowest power consumption. At the same time, the active number of PEs/cycle does not get significantly impacted due to the least memory access policy of the compilation. As a result, the best efficiency is achieved at 2306 MOPS/mW for matrix multiplication, in a configuration with 8 LSUs and 4 TCDM banks. The minimum energy efficiency is achieved at 1112 MOPS/mW for separable filter in a configuration with 4 LSUs and 16 TCDM banks.

To investigate the power gain in the fine-grained clock gating we present the energy consumption of the clock gated IPA and the non clock gated IPA in Table 4.3. In an average, the clock gated design consumes an average of $2 \times$ less power compared to that of the non clock gated design. Due to the regular architecture of the PE array, fine grained power management is much more suitable to implement. Moreover, thanks to the efficient execution of CDFG on the array, the smaller energy required to execute an instruction in the IPA with respect to a CPU (5.6E-07 $\mu$J vs 3.49E-06 $\mu$J), and the effectiveness of the fine-grained power management the IPA outperforms the or1k CPU’s energy efficiency by up to $18 \times$ (Table 4.3). The energy per instruction execution in the IPA is much less than that of the CPU due to its simple instruction set architecture. Also, the lower number of memory operations executed in the IPA helps reducing on the average energy consumption.

![Fig. 4.5 Average energy efficiency for different configurations ([#LSUs][#TCDM Banks])](image)

### 4.1.3 Comparison with low-power CGRA architectures

Table IX shows a comparison with existing CGRAs. For some papers, energy efficiency figures could not be extracted, so ‘NA’ is put in the corresponding cell. The energy efficiency
IPA performance evaluation

Fig. 4.6 Energy efficiency/area trade-off between several configurations ([#LSUs][#TCDM Banks])

results for Morphosys, Imagine and ReMAP presented in the table are studied in [20]. The energy efficiency figures of the other architectures are provided both in the original manufacturing technology node and scaled to the 28nm technology, according to the power scaling factor $C \times V^2$. $C$ and $V$ represent the effective capacitance (approximated with the channel length of the technology) and the supply voltage of the designs, normalized to the nominal parameters of the 28nm technology node. It should be noted that this simplified scaling factor penalizes our design, since deep-submicron technologies such as 28nm, where the load capacitance of gates is typically dominated by wires require much more buffering than mature technology nodes, which penalizes energy efficiency. Nevertheless, IPA provides leading-edge energy efficiency, surpassing by more than one order of magnitude other architectures (ADRES, Morphosys, XPP, AsAP) featuring a C based mapping flow. The driving factors for this gain are (a) architectural simplicity with less complex interconnect network, (b) low power instruction processing, (c) lowest possible number of memory operations in application execution, (d) fine grained power management architecture, described in previous sections. One distinguishing characteristic of the proposed accelerator is the flexible execution model capable of implementing CDFG on the array without the need of a host processor, coupled with a fully automated mapping flow that starts from a plain ANSI C description of the application. Moreover, the memory architecture, based on a shared multi-banked TCDM enables easy integration within ultra-low-power tightly coupled clusters of processors, while fine-grained power management allows to improve energy efficiency by up to $2 \times$. The average power consumption on the IPA is 0.49mW, which is compatible with the ultra-low power target.
### 4.1 Implementation of the IPA

#### Table 4.4 Comparison with the state of the art low power targets

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>[20]</td>
<td>Morphosy</td>
<td>DFG</td>
<td>ANSI C</td>
<td>PE</td>
<td>150</td>
<td>1.8V</td>
<td>256</td>
<td>4000</td>
<td>450</td>
<td>113</td>
<td>7.20</td>
<td>150</td>
<td>28800</td>
</tr>
<tr>
<td>[20]</td>
<td>Imagine</td>
<td>DFG</td>
<td>NA</td>
<td>PE</td>
<td>150</td>
<td>1.5V</td>
<td>144</td>
<td>4000</td>
<td>296</td>
<td>165</td>
<td>12.40</td>
<td>150</td>
<td>23700</td>
</tr>
<tr>
<td>[101]</td>
<td>RAW</td>
<td>CDFG</td>
<td>ANSI C</td>
<td>PE</td>
<td>150</td>
<td>1.8V</td>
<td>256</td>
<td>2288</td>
<td>100</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>[93]</td>
<td>TRIPS</td>
<td>DFG</td>
<td>NA</td>
<td>PE</td>
<td>130</td>
<td>1.0V</td>
<td>336</td>
<td>35868</td>
<td>366</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>[20]</td>
<td>ReMAP</td>
<td>CDFG</td>
<td>NA</td>
<td>PE</td>
<td>180</td>
<td>1.62V</td>
<td>8.28</td>
<td>312</td>
<td>200</td>
<td>386</td>
<td>10.30</td>
<td>173</td>
<td>3200</td>
</tr>
<tr>
<td>[55]</td>
<td>TCPA</td>
<td>CDFG</td>
<td>Customized</td>
<td>VLIW</td>
<td>90</td>
<td>1.0V</td>
<td>15</td>
<td>12.48</td>
<td>200</td>
<td>106</td>
<td>112.00</td>
<td>360</td>
<td>1587</td>
</tr>
<tr>
<td>[84]</td>
<td>Layers</td>
<td>CDFG</td>
<td>NA</td>
<td>PE</td>
<td>65</td>
<td>1.0</td>
<td>0.35</td>
<td>44.45</td>
<td>488</td>
<td>2786</td>
<td>21.94</td>
<td>72</td>
<td>975</td>
</tr>
<tr>
<td>[63]</td>
<td>SmartCell</td>
<td>CDFG</td>
<td>Customized</td>
<td>PE</td>
<td>130</td>
<td>1.0V</td>
<td>8.2</td>
<td>160</td>
<td>100</td>
<td>13.04</td>
<td>37.80</td>
<td>176</td>
<td>6048</td>
</tr>
<tr>
<td>[40]</td>
<td>PipeRench</td>
<td>DFG</td>
<td>Customized</td>
<td>PE</td>
<td>180</td>
<td>1.8V</td>
<td>55.5</td>
<td>675</td>
<td>120</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>[80]</td>
<td>SYSCORE</td>
<td>CDFG</td>
<td>NA</td>
<td>PE</td>
<td>90</td>
<td>1.0V</td>
<td>5.73</td>
<td>18.5</td>
<td>100</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>[8]</td>
<td>ADRES</td>
<td>DFG</td>
<td>ANSI C</td>
<td>VLIW</td>
<td>90</td>
<td>1.0V</td>
<td>15</td>
<td>80</td>
<td>100</td>
<td>94</td>
<td>17.51</td>
<td>56</td>
<td>1409</td>
</tr>
<tr>
<td>[11]</td>
<td>XPP</td>
<td>CDFG</td>
<td>ANSI C</td>
<td>PE</td>
<td>90</td>
<td>1.0V</td>
<td>42</td>
<td>93</td>
<td>150</td>
<td>310</td>
<td>10.00</td>
<td>32</td>
<td>13000</td>
</tr>
<tr>
<td>[106]</td>
<td>AsAP</td>
<td>CDFG</td>
<td>ANSI C</td>
<td>PE</td>
<td>180</td>
<td>1.8V</td>
<td>23.76</td>
<td>84</td>
<td>116</td>
<td>40</td>
<td>11.00</td>
<td>229</td>
<td>942</td>
</tr>
<tr>
<td>[91]</td>
<td>Muccra-3</td>
<td>DFG</td>
<td>Customized</td>
<td>VLIW</td>
<td>65</td>
<td>1.2V</td>
<td>8.82</td>
<td>11</td>
<td>41.4</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>[69]</td>
<td>Lopes et al</td>
<td>DFG</td>
<td>NA</td>
<td>PE</td>
<td>90</td>
<td>1.0V</td>
<td>0.45</td>
<td>3.47</td>
<td>100</td>
<td>222</td>
<td>28.8</td>
<td>92.6</td>
<td>100</td>
</tr>
<tr>
<td>[71]</td>
<td>CMA</td>
<td>DFG</td>
<td>Customized</td>
<td>µC</td>
<td>65</td>
<td>0.5V</td>
<td>25</td>
<td>1.6</td>
<td>85</td>
<td>23</td>
<td>2186</td>
<td>2430</td>
<td>274</td>
</tr>
<tr>
<td>[32]</td>
<td>SIMD-CGRA</td>
<td>DFG</td>
<td>ANSI C</td>
<td>PE</td>
<td>65</td>
<td>0.9</td>
<td>0.59</td>
<td>NA</td>
<td>1</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>[52]</td>
<td>ULP-SRP</td>
<td>DFG</td>
<td>ANSI C</td>
<td>VLIW</td>
<td>40</td>
<td>0.5V</td>
<td>0.21</td>
<td>7</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>This work</td>
<td>IPA</td>
<td>CDFG</td>
<td>ANSI C</td>
<td>PE</td>
<td>28</td>
<td>0.6V</td>
<td>0.25</td>
<td>0.36</td>
<td>100</td>
<td>3036</td>
<td>1617.00</td>
<td>1617.00</td>
<td>759</td>
</tr>
</tbody>
</table>

**High performance targets**

**Low power targets**

**Ultra-low power targets**

---

This work on Architecture and Programming Model Support for Reconfigurable Accelerators in Multi-Core Embedded Systems Satyajit Das 2018
4.2 Compilation

The results of the exploration show that a configuration of the IPA with 8 load-store units and 4 TCDM banks achieves the optimal performance/energy trade-off featuring an average speed-up of $9.7 \times$ (max $20.3 \times$, min $4.9 \times$) compared to a general-purpose processor. Thanks to the optimized architecture and mapping flow, the proposed accelerator achieves an average energy efficiency of 1617 MOPS/mW over a wide range of sensor signal processing kernels, surpassing other CGRA architectures featuring a C based mapping flow by more than one order of magnitude.

4.2.1 Performance evaluation of the compilation flow

This section analyses the performance and energy consumption results compiling kernels using the compilation flow described in the previous chapter. First, we compare the register allocation approach with different predication techniques to handle control in applications. Next, we perform some experiments while running several kernels involved in a smart visual trigger application, as the context of smart visual applications the shortcoming of traditional CGRAs is quite severe, since after brute-force morphological filtering (e.g. erosion, dilatation, sobel convolution), these algorithms usually require the execution of highly control intensive code for high-level feature extraction. In this experiment we perform trigger based feature extraction in the IPA compiling kernels using the flow.

All the designs used in the following experiments, were synthesized with Synopsys design compiler 2014.09-SP4 in STMicroelectronics 28nm UTBB FD-SOI technology. Synopsys PrimePower 2013.12-SP3 was used for timing and power analysis at the supply voltage of 0.6V, 25.C temperature, in typical process conditions.

4.2.2 Comparison of the register allocation approach with state of the art predication techniques

To evaluate the efficiency of the register allocation approach to handle the control flow we compare the execution of six control intensive kernels compared to the state of the art partial and full predication techniques. The results, presented in Table 4.5 show that the register based approach achieves a maximum of $1.33 \times$ (with minimum of $1.04 \times$ and average of $1.13 \times$) and $1.8 \times$ (with minimum of $1.37 \times$ and average of $1.59 \times$) performance gain compared to partial predication and full predication techniques. The maximum gain

---

\[^2\] PEs perform 8-bit operations, hence energy efficiency is normalized to equivalent 32-bit operations, does not include the power of controlling processor.
achieved over existing methods are highlighted in bold in the table. The smaller number of executed instructions allows the register allocation approach to outperform the partial and full predication techniques by an average of $1.54 \times$ (with min $1.35 \times$, max $2 \times$) and $1.71 \times$ (with min $1.44 \times$, max $2 \times$) respectively in terms of energy efficiency. The table also presents a comparison with respect to or1k CPU and C64 DSP processor [50] from TI. The register allocation approach achieves a maximum of $3.94 \times$, $15.8 \times$ performance gain and $7.52 \times$, $32.77 \times$ energy gain over or1k and C64 processor, respectively. Due to the abundance of branches in these kernels, the DSP processor performs worst. Finally, we compare with the basic systematic load-store (SLS) based approach for control mapping. It is depicted from the Table 4.5 and 4.6 that the register allocation approach performs an average of $1.16 \times$ (with max of $1.46 \times$, min of $1.05 \times$) better than the SLS based approach, while gaining an average of $1.31 \times$ energy efficiency with a maximum gain of $2 \times$ and minimum gain of $1.07 \times$.

Table 4.5 Performance (cycles) comparison between the register allocation approach and the state of the art approaches

<table>
<thead>
<tr>
<th>Kernels</th>
<th># loops</th>
<th># conditionals</th>
<th>IPA reg based</th>
<th>IPA SLS based</th>
<th>IPA partial pred</th>
<th>IPA full pred</th>
<th>IPA CPU</th>
<th>IPA C64 DSP</th>
</tr>
</thead>
<tbody>
<tr>
<td>cordic</td>
<td>1</td>
<td>2</td>
<td>328</td>
<td>408</td>
<td>396</td>
<td>542</td>
<td>513</td>
<td>286</td>
</tr>
<tr>
<td>cordic</td>
<td>1</td>
<td>2</td>
<td>328</td>
<td>408</td>
<td>396</td>
<td>542</td>
<td>513</td>
<td>286</td>
</tr>
<tr>
<td>sobel</td>
<td>4</td>
<td>11</td>
<td>179617</td>
<td>262282</td>
<td>188253</td>
<td>245583</td>
<td>454028</td>
<td>669794</td>
</tr>
<tr>
<td>gcd</td>
<td>1</td>
<td>1</td>
<td>55312</td>
<td>58596</td>
<td>73747</td>
<td>92852</td>
<td>67545</td>
<td>92184</td>
</tr>
<tr>
<td>sad</td>
<td>2</td>
<td>1</td>
<td>15962</td>
<td>16824</td>
<td>16573</td>
<td>28776</td>
<td>62932</td>
<td>252193</td>
</tr>
<tr>
<td>deblocking</td>
<td>5</td>
<td>7</td>
<td>472258</td>
<td>495081</td>
<td>518722</td>
<td>727243</td>
<td>834683</td>
<td>1310220</td>
</tr>
<tr>
<td>manh-dist</td>
<td>1</td>
<td>1</td>
<td>6288</td>
<td>6826</td>
<td>6738</td>
<td>9522</td>
<td>15394</td>
<td>55317</td>
</tr>
<tr>
<td>max gain</td>
<td></td>
<td></td>
<td>1.46x</td>
<td>1.33x</td>
<td>1.8x</td>
<td>3.94x</td>
<td>15.8x</td>
<td></td>
</tr>
</tbody>
</table>

### 4.2.3 Compiling smart visual trigger application

**Performance and energy consumption:** This section provides performance comparison of IPA running at 100 MHZ with respect to a or10n CPU [39] running at 45 MHZ clock frequency, that are the operating frequency of the two architectures at the operating voltage of 0.6V. The experiment is carried out on a smart visual surveillance application [75] performing on 160x120 resolution of images, consisting 9 different motion detection kernels including morphological filters (e.g. finding minimum and maximum pixel, erosion, dilatation, Sobel convolution), and a smart trigger kernel asserting an alarm if the size of the detected objects surpasses a defined threshold, the latter kernel composed of highly control intensive code. To compile the applications for the IPA, we use the compilation flow. Table 4.7 shows the
Table 4.6 Energy consumption (µJ) comparison between the register allocation approach and the state of the art approaches

<table>
<thead>
<tr>
<th>Kernels</th>
<th># loops</th>
<th># conditionals</th>
<th>IPA</th>
<th>CPU</th>
<th>C64 DSP</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>reg based</td>
<td>SLS based</td>
<td>partial pred</td>
</tr>
<tr>
<td>cordinx</td>
<td>1</td>
<td>2</td>
<td>328</td>
<td>408</td>
<td>396</td>
</tr>
<tr>
<td>cordinx</td>
<td>1</td>
<td>2</td>
<td>0.001</td>
<td>0.002</td>
<td>0.002</td>
</tr>
<tr>
<td>sobel</td>
<td>4</td>
<td>11</td>
<td>0.736</td>
<td>1.102</td>
<td>1</td>
</tr>
<tr>
<td>gcd</td>
<td>1</td>
<td>1</td>
<td>0.227</td>
<td>0.246</td>
<td>0.392</td>
</tr>
<tr>
<td>sad</td>
<td>2</td>
<td>1</td>
<td>0.065</td>
<td>0.071</td>
<td>0.088</td>
</tr>
<tr>
<td>deblocking</td>
<td>5</td>
<td>7</td>
<td>1.936</td>
<td>2.079</td>
<td>2.754</td>
</tr>
<tr>
<td>manh-dist</td>
<td>1</td>
<td>1</td>
<td>0.026</td>
<td>0.029</td>
<td>0.036</td>
</tr>
<tr>
<td>max gain</td>
<td></td>
<td></td>
<td>2x</td>
<td>2x</td>
<td>2x</td>
</tr>
</tbody>
</table>

Performance comparison executing the application in the IPA (programmed in plain ANSI C code) and a highly optimized core with the support for vectorization and DSP extensions that can only be exposed optimizing the source code with intrinsics [39]. The IPA surpasses the CPU by 6x and 10x in performance and energy consumption, respectively. It is interesting to notice that while DSP instructions do not improve the performance of the core during execution of the smart trigger kernel, its implementation on the IPA provides even more benefits with respect to the data-flow part of the application (motion detection), improving performance by 10x with respect to execution on the processor.

Table 4.7 Performance comparison

<table>
<thead>
<tr>
<th>Applications</th>
<th>CPU cycles</th>
<th>CPU energy[µJ]</th>
<th>CPU (optimized) cycles</th>
<th>CPU (optimized) energy[µJ]</th>
<th>IPA perf gain</th>
<th>IPA energy gain</th>
</tr>
</thead>
<tbody>
<tr>
<td>Motion Detection</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>cycles</td>
<td>2 237 124</td>
<td>10.179</td>
<td>2 237 124</td>
<td>10.179</td>
<td>9x</td>
<td>9x</td>
</tr>
<tr>
<td>energy[µJ]</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>perf gain</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>energy gain</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Smart Trigger</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>cycles</td>
<td>480 000</td>
<td>2.184</td>
<td>480 000</td>
<td>2.184</td>
<td>10x</td>
<td>10x</td>
</tr>
<tr>
<td>energy[µJ]</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>perf gain</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>energy gain</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Overall</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>cycles</td>
<td>2 707 200</td>
<td>12.318</td>
<td>2 707 200</td>
<td>12.318</td>
<td>6x</td>
<td>6x</td>
</tr>
<tr>
<td>energy[µJ]</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>performance gain</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>energy gain</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
4.3 Conclusion

**Comparison with the state of the art architectures:** Table 4.8 presents the performance comparison of the smart visual trigger application running on a CPU and two state of the art reconfigurable array architectures [75] [87], chosen due to the availability of the target application, with similar features to other state of the art CGRAs. Results show that, although the two state of the art CGRAs deliver huge performance when dealing with the data-flow portion of the application, thanks to highly optimized and pipelined datapath that allows to implement operations on binary images as Boolean operations [75], they are not able to implement the control dominated kernel, which runs on the CPU forming a major bottleneck for performance when considering the whole application. On the other hand, the superior flexibility of the IPA allows to implement the whole application on the accelerator, allowing to surpass performance of other CPU + CGRA systems by 1.6x. It is important to note that in the context of more complex smart vision applications, such as context understanding and scene labelling, it is common that control intensive kernels dominate the overall execution time share, further improving performance with respect to CGRA accelerators only able to map DFGs.

<table>
<thead>
<tr>
<th>Reference</th>
<th>Motion Detection</th>
<th>Smart Trigger</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU</td>
<td>116</td>
<td>25</td>
<td>141</td>
</tr>
<tr>
<td>CPU (Optimized)</td>
<td>68</td>
<td>25</td>
<td>93</td>
</tr>
<tr>
<td>Mucci et al [75]</td>
<td>2.09</td>
<td>25</td>
<td>27.09</td>
</tr>
<tr>
<td>Rossi et al [87]</td>
<td>1.27</td>
<td>25</td>
<td>26.27</td>
</tr>
<tr>
<td>IPA</td>
<td>13.6</td>
<td>2.5</td>
<td>16.1</td>
</tr>
</tbody>
</table>

4.3 Conclusion

With respect to state of the art partial and full predication techniques, the proposed compilation flow improves performance by $1.54 \times$ on average (min $1.35 \times$, max $2 \times$) and energy efficiency by $1.71 \times$ on average (min $1.44 \times$, max $2 \times$). The experiment on the target smart visual trigger application show that the IPA achieves an average performance of 507 MOPS with average energy efficiency of 142 MOPS/mW at 0.6V surpassing a general purpose processor by 6x in performance and 10x in energy efficiency. The proposed IPA also surpasses the state of the art CGRA architectures performance by 1.6x, thanks to the capability of efficiently implementing control intensive code. In the next chapter, we integrate the IPA in a multi-core cluster and present the energy efficiency aspects of heterogeneous computing.
Chapter 5

The Heterogeneous Parallel Ultra-Low-Power Processing-Platform (PULP) Cluster

High performance and extreme energy efficiency are strict requirements for many deeply embedded near-sensor processing applications such as wireless sensor networks, end-nodes of the Internet of Things (IoT) and wearables. One of the most traditional approaches to improve energy efficiency of deeply embedded computing systems is achieved exploiting architectural heterogeneity by coupling general-purpose processors with application- or domain-specific accelerators in a single computing fabric. On the other hand, most recent ultra-low power designs exploit multiple homogeneous programmable processors operating in near-threshold [86]. Such an approach, which joins parallelism with low-voltage computing, is emerging as an attractive way to join performance scalability with high energy efficiency.

The concepts of parallelism and heterogeneity in ultra-low power designs inherit from traditional high-end embedded platforms such as NVIDIA Tegra [77], IBM PowerEN processor [56], Qualcomm Snapdragon S4 Pro [97], STMicroelectronics P2012 [6], Kalray MPPA [25].

In this chapter, we present a heterogeneous architecture which integrates a near-threshold tightly-coupled cluster of processors [86] augmented with the Integrated Programmable Array (IPA) presented in [23]. We synthesized the architecture in a 28nm FD-SOI technology, and we carried out a quantitative exploration combining physical synthesis results (i.e. frequency, area, power) and benchmarking on a set of signal processing kernels typical of end-nodes applications. One interesting finding of our exploration is that (1) the performance of the IPA is much less sensitive to memory bandwidth than parallel processor clusters [23] and (2) the simpler nature of its architecture allows to run $2\times$ faster than the rest of the
The Heterogeneous Parallel Ultra-Low-Power Processing-Platform (PULP) Cluster system. Experimental results show that the heterogeneous architecture achieves significant performance improvement for both compute and control intensive benchmarks with respect to the software cluster.

5.1 PULP heterogeneous architecture

The PULP platform project is a collaborative effort of several academic and industrial institutions\(^1\), whose goal is to design an ultra-low power achieving high levels of energy efficiency by combining near-threshold computing and parallel computing and by exposing low power features of the technology up the technological stack, at the architecture and software levels.

Ultra-low power operation and extreme energy efficiency are the key features of the implementation of PULP, which exploits near threshold computing. The PULP SoC utilizes multi-core parallelism with explicitly-managed shared L1 memory to overcome performance degradation at low voltage, while keeping the flexibility typical of instruction processors. Moreover, enabling the cores to operate on-demand over a wide supply voltage and body bias ranges allows to achieve high energy efficiency over a wide spectrum of computational demands.

5.1.1 PULP SoC overview

Figure 5.1 shows the main building blocks of a single-cluster PULP SoC. The PULP cluster features 8 32-bit RISC-V cores based on a four pipeline stages micro-architecture optimized for energy-efficient operation [38] sharing a 64KB multi-banked scratchpad memory through a low-latency interconnect [83]. The ISA of the cores is extended with instructions targeting energy efficient digital signal processing such as hardware loops, load/store with pre/post increment, SIMD operations. The cores share a 4KB private instruction cache to boost performance and energy efficiency for tightly coupled clusters of processors typically relying on data parallel computational models [68]. Off-cluster data transfers are managed by a lightweight multi-channel DMAs optimized for energy-efficient operation [88]. Both the (I$) and DMA connects to an AXI4 cluster bus. A peripheral interconnect is used to communicate with on-cluster peripherals such as a timer, an event-unit used to accelerate synchronization among the cores and other memory mapped peripherals such as application-specific accelerators. To operate at the best operating point for a given workload the cluster

\(^1\)Includes the University of Bologna, ETH Zurich, STMicroelectronics, EPFL Lausanne, Politecnico di Milano and others.
can be integrated in an independent voltage and frequency domain, featuring dual-clock FIFOs and level-shifters at its boundary.

![Fig. 5.1 PULP SoC. Source [89]](image)

### 5.1.2 Heterogeneous Cluster

The PULP cluster is augmented with the Integrated Programmable Array accelerator, as shown in Figure 5.2. Figure 5.3 shows a detailed block diagram of the subsystem embedding the IPA array. The IPA array is configured through a global context memory (GCM), responsible for storing locally the configuration bitstream of the PEs. The GCM is connected through a DMA-capable AXI-4 port to the cluster bus, enabling pre-fetching of IPA contexts from L2 memory. The size of GCM is considered twice the size of configuration bitstream of the IPA in the worst case, in this way it is possible to employ a double-buffering mechanism and load a new bitstream from the L2 to the GCM when the current one is being loaded on the array, completely hiding time for reconfiguration. More details on the structure of the IPA array bitstream can be found in [24]. A set of memory mapped control registers allow to load a new context to the IPA array, trigger the execution of a kernel and synchronize with the other processors in the cluster.

As opposed to many CGRA architectures, the IPA can access a multi-banked shared memory through 8 master ports connected to the low-latency interconnect. This eases data
sharing with the other processors of the cluster, following the computational model described in [17]. The optimal number of port has been chosen to optimize the trade-off between the size of the interconnect and the bandwidth requirements of the IPA. Following the analysis conducted in [23], where it is shown that the IPA can operate $2 \times$ faster than the processors, we have extended the architecture of the cluster in a way that the IPA can work at twice the frequency of the rest of the cluster. This approach allows to operate each component in the cluster at the optimal frequency, without paying the overheads of dual-clock FIFOs, requiring a significant amount of logic and synchronization overhead. On the contrary, the hardware support for the dual-frequency mode includes a clock divider to generate the two different edge aligned clocks, and two modules needed to adapt the request-grant protocol of the low-latency interconnect [83] to deal with the frequency domain crossing, as shown in Figure 5.4.

5.2 Software infrastructure

To offload jobs to the IPA and synchronize the execution, the cores access the control registers of the IPA, by memory mapped operations. The control registers are composed of a command register and a status register. We designed a simple Application Programming Interface (APIs) to perform the offload and synchronize tasks with the IPA. The main functions are described in Table 5.1. Before execution starts in the IPA accelerator, the cores load the corresponding context and data from the L2 memory to the GCM and L1 memory,
5.3 Implementation and Benchmarking

In this section we present the implementation results of the heterogeneous PULP cluster. The three possible modes considered in these comparisons are: (a) single-core: running
Fig. 5.4 Synchronous interface for reliable data transfer between the two clock domains.

Table 5.1 List of APIs for controlling IPA

<table>
<thead>
<tr>
<th>Function</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>void load_data_l2totcdm</td>
<td>Writes data from L2 memory to the TCDM banks through DMA_CORE</td>
</tr>
<tr>
<td>(int DMA_CORE_ID, int size, unsigned int l2_addr, unsigned int tcdm_addr)</td>
<td></td>
</tr>
<tr>
<td>void load_context_l2togcm</td>
<td>Writes context from L2 memory to the GCM through DMA_IPA</td>
</tr>
<tr>
<td>(int DMA_IPA_ID, int size, unsigned int l2_addr, unsigned int gcm_addr)</td>
<td></td>
</tr>
<tr>
<td>int ipa_start_execution ()</td>
<td>Initiate IPA execution by writing in the command register</td>
</tr>
<tr>
<td>void ipa_check_status(in id)</td>
<td>Core synchronization</td>
</tr>
<tr>
<td>void free_ipa (int id)</td>
<td>Release IPA</td>
</tr>
</tbody>
</table>

applications in a single core, (b) ipa: running applications in the IPA where the core takes part in offloading only, (c) multi-core: running applications in parallel cores. All the benchmarks are coded in fully portable C, using the OpenMP programming model to express parallelism for PULP.

5.3.1 Implementation Results

Table 5.2 presents the details of the memories used in the cluster. It consists of 8 cores featuring 4 kB of shared I$, one IPA with 16 PEs and a GCM of 4KB, while the TCDM
is composed of 16 banks of 4 kB each, leading to an overall TCDM size of 64 kB. These architectural parameters were chosen to fit the constraints of the wide range of signal processing benchmarks. The SoC was synthesized with Synopsys Design Compiler 2013.12-SP3 on a STMicroelectronics 28nm UTBB FD-SOI technology library. Since the achievable frequency of the PEs in the IPA is higher than the RISKY cores used in the cluster, the IPA is clocked at 100 MHz, while the rest of the cluster runs at 50 MHz (in the SS, 0.6V, −40°C corner). Synopsys PrimePower 2013.12-SP3 was used for timing and power analysis at the supply voltage of 0.6V, 25°C temperature, in typical process conditions. Table 5.3 presents the area information of the components in the cluster. Although the total area of the IPA with 16 PEs is almost similar to the area of the 8 cores combined, the area occupied by the GCM is much less than the total cache memory, which in turn provides better area efficiency while running applications in IPA.

Table 5.2 Cluster Parameters and memories used

<table>
<thead>
<tr>
<th>Name</th>
<th>Type</th>
<th>Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>L1 Memory (16 banks)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>TCDM</td>
<td>SRAM</td>
<td>64 KB</td>
</tr>
<tr>
<td>Cores (8)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Instruction Cache</td>
<td>SRAM</td>
<td>4KB</td>
</tr>
<tr>
<td>IPA (16 PEs)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Global context memory</td>
<td>SRAM</td>
<td>8KB</td>
</tr>
<tr>
<td>Instruction Register File (IRF)</td>
<td>Registers</td>
<td>0.16KB</td>
</tr>
<tr>
<td>Regular register file (RRF)</td>
<td>Registers</td>
<td>0.032KB</td>
</tr>
<tr>
<td>Constant register file (CRF)</td>
<td>Registers</td>
<td>0.128KB</td>
</tr>
</tbody>
</table>

5.3.2 Performance and Energy Consumption Results

Table 5.4 reports the execution time in nano seconds for different benchmarks running on a single-core, on 8 cores and on the IPA. The IPA execution time includes the time taken for loading the context into the PEs. Comparing to the performance of execution in single-core, the accelerator achieves a maximum of 8× (with a minimum of 2.49× and an average of 5.4×) speed-up. The control intensive kernel like GCD does not exhibit parallelism, hence parallel software execution does not improve performance of the homogeneous cluster. On the other hand, the execution on the IPA improves the performance by almost 5×, exploiting also instruction-level parallelism rather than data-level parallelism only. The performance gain in the accelerator for the compute intensive kernels like matrix multiplication, convolution, FIR and separable filters is limited if compared to the performance of parallel-cores. However,
The Heterogeneous Parallel Ultra-Low-Power Processing-Platform (PULP) Cluster

Fig. 5.5 Power consumption breakdown in percentage: Executing Matrix-Multiplication in (a) Multi-Core; (b) Single-Core; (c) IPA. Executing GCD in (d) Single-Core; (e) IPA. (OTHERS contain peripherals, interconnect, clk-gate, bbmux)

Fig. 5.6 Power consumption breakdown in percentage: Executing GCD in (d) Single-Core; (e) IPA. (OTHERS contain peripherals, interconnect, clk-gate, bbmux)
Table 5.3 Synthesized area information for the PULP heterogeneous cluster

<table>
<thead>
<tr>
<th>Components</th>
<th>Area ($\mu m^2$)</th>
<th>% of cluster area</th>
</tr>
</thead>
<tbody>
<tr>
<td>CORES</td>
<td>160,352</td>
<td>18</td>
</tr>
<tr>
<td>ICACHE</td>
<td>190,089</td>
<td>22</td>
</tr>
<tr>
<td>DMA_CORE</td>
<td>41,406</td>
<td>5</td>
</tr>
<tr>
<td>IPA</td>
<td>156,323</td>
<td>18</td>
</tr>
<tr>
<td>DMA_IPA</td>
<td>32,636</td>
<td>4</td>
</tr>
<tr>
<td>GCM</td>
<td>18,704</td>
<td>2</td>
</tr>
<tr>
<td>TCDM</td>
<td>149,638</td>
<td>17</td>
</tr>
<tr>
<td>CLUSTER_INTCNCT</td>
<td>63,126</td>
<td>7</td>
</tr>
<tr>
<td>CLUSTER_PERIPHERALS</td>
<td>21,610</td>
<td>2</td>
</tr>
<tr>
<td>OTHERS</td>
<td>37,932</td>
<td>4</td>
</tr>
<tr>
<td>Total</td>
<td>871,816</td>
<td>100</td>
</tr>
</tbody>
</table>

The relatively small performance gain compared to the parallel cluster is compensated by the gain in energy consumption (Table 5.6) due to the simpler nature of the compute units of the IPA with respect to full processors, to the smaller number of power-hungry load/store operations (Table 5.7), and to the fine-grained power management architecture that allows clock gate the inactive PEs during execution (Table 5.6).

Table 5.4 Performance evaluation in execution time (ns) for different configuration in the heterogeneous platform

<table>
<thead>
<tr>
<th>Kernels</th>
<th>Single-core (ns)</th>
<th>Multi-core (ns)</th>
<th>Speed-up in multi-core (x)</th>
<th>IPA (ns)</th>
<th>Speed-up in IPA (x)</th>
</tr>
</thead>
<tbody>
<tr>
<td>MatMul</td>
<td>3,358,740</td>
<td>435,180</td>
<td>7.72</td>
<td>432,630</td>
<td>7.76</td>
</tr>
<tr>
<td>Convolution</td>
<td>9,733,380</td>
<td>1,520,840</td>
<td>6.4</td>
<td>1,494,860</td>
<td>6.51</td>
</tr>
<tr>
<td>FFT</td>
<td>767,640</td>
<td>142,720</td>
<td>5.38</td>
<td>94,510</td>
<td>8.12</td>
</tr>
<tr>
<td>FIR</td>
<td>182,500</td>
<td>33,460</td>
<td>5.45</td>
<td>33,410</td>
<td>5.46</td>
</tr>
<tr>
<td>Separable Filter</td>
<td>39,870,420</td>
<td>6,404,160</td>
<td>6.23</td>
<td>6,334,700</td>
<td>6.29</td>
</tr>
<tr>
<td>Sobel Filter</td>
<td>117,024,880</td>
<td>40,894,260</td>
<td>2.86</td>
<td>28,865,890</td>
<td>4.05</td>
</tr>
<tr>
<td>GCD</td>
<td>2,951,160</td>
<td>2,951,160</td>
<td>1</td>
<td>61,1300</td>
<td>4.83</td>
</tr>
<tr>
<td>Cordic</td>
<td>9,000</td>
<td>7,000</td>
<td>1.29</td>
<td>3,610</td>
<td>2.49</td>
</tr>
<tr>
<td>Manh Dist</td>
<td>244,640</td>
<td>164,640</td>
<td>1.49</td>
<td>70,300</td>
<td>3.48</td>
</tr>
</tbody>
</table>
Table 5.5 Performance comparison between iso-frequency and 2 × frequency execution in IPA

<table>
<thead>
<tr>
<th>Benchmarks</th>
<th>#cycles in iso frequency</th>
<th>#cycles in 2x frequency</th>
<th>Loss due to stalls</th>
<th>overall execution speed-up</th>
</tr>
</thead>
<tbody>
<tr>
<td>MatMul</td>
<td>39,330</td>
<td>43,263</td>
<td>3,933</td>
<td>1.82</td>
</tr>
<tr>
<td>Convolution</td>
<td>130,896</td>
<td>149,486</td>
<td>18,590</td>
<td>1.75</td>
</tr>
<tr>
<td>FFT</td>
<td>8,182</td>
<td>9,451</td>
<td>1,269</td>
<td>1.73</td>
</tr>
<tr>
<td>FIR</td>
<td>3,122</td>
<td>3,341</td>
<td>219</td>
<td>1.87</td>
</tr>
<tr>
<td>Separable filter</td>
<td>575,882</td>
<td>633,470</td>
<td>57,588</td>
<td>1.82</td>
</tr>
<tr>
<td>Sobel Filter</td>
<td>2,634,172</td>
<td>2,886,589</td>
<td>252,417</td>
<td>1.83</td>
</tr>
<tr>
<td>GCD</td>
<td>58,573</td>
<td>61,130</td>
<td>2,557</td>
<td>1.92</td>
</tr>
<tr>
<td>Cordic</td>
<td>328</td>
<td>361</td>
<td>33</td>
<td>1.82</td>
</tr>
<tr>
<td>ManhDistance</td>
<td>6,391</td>
<td>7,030</td>
<td>639</td>
<td>1.82</td>
</tr>
<tr>
<td>Average</td>
<td></td>
<td></td>
<td></td>
<td>1.82</td>
</tr>
</tbody>
</table>

Table 5.5 presents the performance improvement of the IPA when moving from iso-frequency to the 2 × frequency domain execution in the IPA. This shows that, although there is a reduction of memory bandwidth (see loss due to additional stalls column in Table 5.5), since the TCDM operates at the same frequency as the rest of the cluster (i.e. half frequency w.r.t. the IPA array), an average of 1.82 × speed-up (with maximum of 1.92 × and a minimum of 1.73 ×) can be achieved with this dual-frequency cluster architecture.

The power consumption profiles for the different modes of execution presented in Figure 5.5 and 5.6, which shows the percentage of contribution by the several components in the cluster. Figure 5.5 (a), (b), (c) represents the power breakdown while executing matrix multiplication in multi-core, single-core and IPA respectively, representative for other compute intensive benchmarks. Similarly, Figure 5.6 (a) and (b) present the profiles for executing GCD, a control intensive benchmark, in single-core and IPA respectively. In Figure 5.5 (a), (b), (c), the TCDM contributes to 14.7%, 15% and 7.2% in the multi-core, single-core and IPA configurations, respectively. The reduced memory access in IPA execution helps to achieve better energy efficiency. While executing GCD in single-core and the IPA, the TCDM consumed around 15.9% and 2.5% of the total power in the two analysed configurations, respectively. Also, the IPA consumes around 33.9% of the total power while executing the GCD kernel, due to heavy usage of internal registers to support control flow dependencies. The simpler nature of the compute units, low burden on the TCDM and data exchange through PEs explains the energy gain of 7 × in the IPA execution.
Table 5.6 Energy consumption evaluation in $\mu$ J for different configuration in the heterogeneous platform

<table>
<thead>
<tr>
<th>Kernels</th>
<th>Single-core</th>
<th>Multi-core</th>
<th>IPA Energy</th>
<th>of Active PEs/cycle</th>
</tr>
</thead>
<tbody>
<tr>
<td>MatMul</td>
<td>1.247</td>
<td>0.313</td>
<td>0.208</td>
<td>58.5</td>
</tr>
<tr>
<td>Convolution</td>
<td>2.876</td>
<td>1.095</td>
<td>0.658</td>
<td>59.2</td>
</tr>
<tr>
<td>FFT</td>
<td>0.292</td>
<td>0.087</td>
<td>0.042</td>
<td>59.7</td>
</tr>
<tr>
<td>FIR</td>
<td>0.08</td>
<td>0.026</td>
<td>0.026</td>
<td>46.1</td>
</tr>
<tr>
<td>Separable filter</td>
<td>16.663</td>
<td>4.611</td>
<td>4.28</td>
<td>55.5</td>
</tr>
<tr>
<td>Sobel Filter</td>
<td>51.491</td>
<td>29.444</td>
<td>12.701</td>
<td>51.2</td>
</tr>
<tr>
<td>GCD</td>
<td>1.151</td>
<td>1.151</td>
<td>0.257</td>
<td>6.25</td>
</tr>
<tr>
<td>Cordic</td>
<td>0.004</td>
<td>0.003</td>
<td>0.001</td>
<td>50</td>
</tr>
<tr>
<td>ManhDistance</td>
<td>0.1</td>
<td>0.095</td>
<td>0.03</td>
<td>48.5</td>
</tr>
</tbody>
</table>

Table 5.7 Comparison between total number of memory operations executed

<table>
<thead>
<tr>
<th>Benchmarks</th>
<th>multi-core</th>
<th>single-core</th>
<th>IPA</th>
</tr>
</thead>
<tbody>
<tr>
<td>MatMul</td>
<td>66,584</td>
<td>66,561</td>
<td>35,032</td>
</tr>
<tr>
<td>Convolution</td>
<td>135,280</td>
<td>135,114</td>
<td>75,600</td>
</tr>
<tr>
<td>FFT</td>
<td>12,528</td>
<td>11,733</td>
<td>6,528</td>
</tr>
<tr>
<td>FIR</td>
<td>5,904</td>
<td>5,893</td>
<td>3,990</td>
</tr>
<tr>
<td>Separable filter</td>
<td>142,840</td>
<td>142,800</td>
<td>95,200</td>
</tr>
<tr>
<td>Sobel Filter</td>
<td>148,240</td>
<td>148,224</td>
<td>120,000</td>
</tr>
<tr>
<td>GCD</td>
<td>64,531</td>
<td>64,531</td>
<td>2</td>
</tr>
<tr>
<td>Cordic</td>
<td>32</td>
<td>28</td>
<td>15</td>
</tr>
<tr>
<td>ManhDistance</td>
<td>2,158</td>
<td>2,049</td>
<td>2,048</td>
</tr>
</tbody>
</table>

5.4 Conclusion

In this chapter, we presented a novel approach towards heterogeneous computing, augmenting ultra-low power reconfigurable accelerator in the PULP multi-core cluster. The experiments integrating IPA in the PULP platform suggests that architectural heterogeneity is a powerful approach to improve energy profile of the computing systems. We have presented three possible executions of the benchmarks in the IPA integrated PULP platform. The heterogeneous cluster achieves achieving up to $4.8 \times$ speed-up and up to $4.4 \times$ better energy efficiency with respect to an 8-core homogeneous cluster.
Coarse-Grained Reconfigurable Architectures (CGRAs) are appealing choice of reconfigurable accelerator platforms to explore both performance and energy efficiency, the two most critical metrics in embedded computing domain. Since both energy and performance refinement require better exploring the underlying architectures, design of the compiler is much of importance. On the one hand the design of the computation unit, interconnect network strategy along with the computation model, decides the compiler complexity and flexibility of computing. On the other hand, compiler capabilities to truly explore the micro-architecture determine the final performance. Hence, the combined design flow is necessary to satisfy performance and power constraints.

In this dissertation, we addressed ultra-low-power acceleration through CGRA approach. In this regard, we have explored several architectural aspects like computation unit, interconnect network, synchronization mechanism and power management issues to design an Integrated Programmable Array (IPA) accelerator operating at 0 to 3 mW power envelop achieving significant performance improvement over ultra-low power processor cores. We also discussed about the compilation approach to accelerate kernels with a pressing concern of minimized memory access in ultra-low-power execution environment. In addition, the compilation approach along with the hardware synchronization makes the framework compatible with applications containing several loops and conditionals nests.

The key aspects of the thesis are listed below:

• **Data and control dependent execution:** In this dissertation, we have pointed out that the framework of a CGRA acceleration must possess the capability to handle both the data and control flow of the application, to dislodge the communication overhead with the host and achieve increased flexibility of execution. We have introduced a *register allocation* approach for supporting the execution of control and data dependence. This approach works independent of program optimizations giving freedom to explore several inner-most loop optimizations (unrolling, software pipelining, pattern oriented optimizations) without involving the host for the initiation of outer loops.
• **Data locality:** One of the key approaches for energy efficient execution is to keep the data as close as possible. This helps to increase latency performance as well as save energy consumption. In this dissertation, the execution of applications considers only the array input and outputs of the application to be processed by load-store operations. All the variables and constants are accessed involving the internal registers of the processing elements. Since the register files of the PEs are distributed, we have formulated the mapping problem on CGRAs while efficiently using registers, we present a unified and precise formulation of the problem of variable placement and register allocation and an effective and efficient placement augmenting the exact binding approach.

• **Two way synchronization:** While mapping and executing applications consisting several basic blocks, it is of utmost importance to synchronize between PEs. In compilation, the PEs are synchronized following the register allocation augmented placement algorithm. While executing, the PEs get synchronized to the same basic block in a single cycle following a lightweight synchronization mechanism, reducing performance and energy consumption penalty.

• **Constant management:** Managing constants in an application is one of the major challenges for energy efficient acceleration. Signal processing applications usually use 16 bits constants. For the sake of wide range of application domain support, in this thesis we considered constants maximum of 32 bits width. On the one side, accommodating the constants in the instructions increases its length, on the other hand memory based access of constants escalates the number of memory operations which in turn increase the energy consumption. In this dissertation, we have introduced the concept of distributed constant register file, where the constants are loaded as a part of the context load. These are accessed by the PEs at the time of execution as register operands.

• **Two fold interconnect network:** The computation model in this dissertation contemplates the sequential arrangement of context (instruction and constants) load and execution. Since the ratio between the context load time and execution time is very small, we deployed a bus-based network for efficient distribution of context into the PEs. The execution uses a different 2D torus-based network. However, Since the execution of the application is mapped by the compiler, only the 2D torus network is exposed to the compiler. This way we manage to keep configuration time as less as possible, while keeping the energy consumption in the execution checked by using a low cost interconnect network.
• *Coupling in a heterogeneous platform:* The experiments integrating the designed CGRA in the PULP platform suggests that architectural heterogeneity is a powerful approach to improve energy profile of the computing systems. We have presented three possible executions of the benchmarks in the IPA integrated PULP platform. The accelerator execution achieves a maximum of $4.5\times$ (with a minimum of $2\times$ and an average of $3\times$) energy consumption improvement over the execution in single core and parallel cores respectively.

**Directions for Future Research**

We believe, there are several directions of research that can be accomplished based on the framework, we have presented in this dissertation.

First, latency improvement through upgrading the binding algorithm in the compilation flow. The placement of operation and the data is managed in the binding algorithm where it takes the location constraints derived in the previous operation and data binding. The underlying graph is transformed if no placement is found in this algorithm. If the graph is transformed due to location constraints, then the latency is increased in each transformation. In this situation, a *guided placement* can help to reduce the number of graph transformations, hence improving the latency.

Second, performance improvement by exploring different loop optimizations. In this thesis, we have only explored the performance gain depending on the loop unrolling optimizations performed on the innermost loop. It may be another research direction to explore performance improvement by other loop optimizations like pattern based *polyhedral model* [66], software-pipelining or combining the possible optimizations.

Third, exploring different application domains other than the signal processing. The emerging domains of approximate computing, cryptographic application domain, machine learning etc., may be the interesting choices to explore for both performance and energy efficiency. Since the framework we have presented in this dissertation has the potential to execute wide range of application domain, it will be highly productive while exploring these emerging domains based on the hardware approach and compilation flow.

Fourth, investigating the potential of IPA by supporting floating point arithmetic. In this dissertation, we rely on computing using integer arithmetic. Supporting floating points could be another research direction considering the IPA architecture as the reference. Since the compiler has the flexibility to adapt several architectural configuration revisions, it will be fruitful to update the IPA architecture to support flexible floating point computation.
Fifth, Just-in-Time (JIT) compilation of kernels at runtime. The IPA integrated in the PULP platform uses pre-compiled contexts of the application. It will be another research direction to introduce JIT compilation at run-time and offload them onto the IPA.

In general, the dissertation presents a heterogeneous approach integrating reconfigurable accelerators into a state of the art multi-core computing platform, which addresses the rising concern of energy efficiency. Based on the framework presented in this thesis, there are several challenging research directions in the domain of ultra-low-power embedded computing which can be exploited.
References


[71] Koichiro Masuyama, Yu Fujita, Hayate Okuhara, and Hideharu Amano. A 297mops/0.4 mw ultra low power coarse-grained reconfigurable accelerator cmasoth-2. In 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), pages 1–6. IEEE, 2015.


Titre : Architecture et modèle de programmation pour accélérateurs reconfigurables dans les systèmes embarqués multi-cœurs.

Mots clés : CGRA, accélérateur matériel, low power, compilation,

Résumé : La complexité des systèmes embarqués et des applications impose des besoins croissants en puissance de calcul et de consommation énergétique. Couplé au rendement en baisse de la technologie, le monde académique et industriel est toujours en quête d'accélérateurs matériels efficaces en énergie. L'inconvénient d'un accélérateur matériel est qu'il est non programmable, le rendant ainsi dédié à une fonction particulière. La multiplication des accélérateurs dédiés dans les systèmes sur puce conduit à une faible efficacité en surface et pose des problèmes de passage à l'échelle et d'interconnexion. Les accélérateurs programmables fournissent le bon compromis efficacité et flexibilité. Les architectures reconfigurables à gros grains (CGRA) sont composées d'éléments de calcul au niveau mot et constituent un choix prometteur d'accélérateurs programmables.

Cette thèse propose d'exploiter le potentiel des architectures reconfigurables à gros grains et de pousser le matériel aux limites énergétiques dans un flot de conception complet. Les contributions de cette thèse sont une architecture de type CGRA, appelé IPA pour Integrated Programmable Array, sa mise en œuvre dans un système sur puce, avec le flot de compilation associé qui permet d'exploiter les caractéristiques uniques du nouveau composant, notamment sa capacité à supporter du flot de contrôle. L'efficacité de l'approche est éprouvée à travers le déploiement de plusieurs applications de traitement intensif. L'accélérateur proposé est enfin intégré à PULP, a Parallel Ultra-Low-Power Processing-Platform, pour explorer le bénéfice de ce genre de plate-forme hétérogène ultra basse consommation.

Title : Architecture and Programming Model Support For Reconfigurable Accelerators in Multi-Core Embedded Systems

Keywords : CGRA, hardware accelerator, low-power, compilation

Abstract: Emerging trends in embedded systems and applications need high throughput and low power consumption. Due to the increasing demand for low power computing and diminishing returns from technology scaling, industry and academia are turning with renewed interest toward energy efficient hardware accelerators. The main drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function and increasing the number of the accelerators in a system on chip (SoC) causes scalability issues. Programmable accelerators provide flexibility and solve the scalability issues. Coarse-Grained Reconfigurable Array (CGRA) architecture consisting of several processing elements with word level granularity is a promising choice for programmable accelerator.

Inspired by the promising characteristics of programmable accelerators, potentials of CGRAs in near threshold computing platforms are studied and an end-to-end CGRA research framework is developed in this thesis. The major contributions of this framework are: CGRA design, implementation, integration in a computing system, and compilation for CGRA. First, the design and implementation of a CGRA named Integrated Programmable Array (IPA) is presented. Next, the problem of mapping applications with control and data flow onto CGRA is formulated. From this formulation, several efficient algorithms are developed using internal resources of a CGRA, with a vision for low power acceleration. The algorithms are integrated into an automated compilation flow. Finally, the IPA accelerator is augmented in PULP - a Parallel Ultra-Low-Power Processing-Platform to explore heterogeneous computing.