Skip to Main content Skip to Navigation
Theses

Fault Tolerance and Reliability for Partially Connected 3D Networks-on-Chip

Abstract : Networks-on-Chip (NoC) have emerged as a viable solution for the communication challenges in highly complex Systems-on-Chip (SoC). The NoC architecture paradigm, based on a modular packet-switched mechanism, can address many of the on-chip communication challenges such as wiring complexity, communication latency, and bandwidth. Furthermore, the combined benefits of 3D IC and Networks-on-Chip (NoC) schemes provide the possibility of designing a high-performance system in a limited chip area. The major advantages of Three-Dimensional Networks-on-Chip (3D-NoCs) are a considerable reduction in the average wire length and wire delay, resulting in lower power consumption and higher performance. However, 3D-NoCs suffer from some reliability issues such as the process variability of 3D-IC manufacturing. In particular, the low yield of vertical connection significantly impacts the design of three-dimensional die stacks with a large number of Through Silicon Via (TSV). Equally concerning, advances in integrated circuit manufacturing technologies are resulting in a potential increase in their sensitivity to the effects of radiation present in the environment in which they will operate. In fact, the increasing number of transient faults has become, in recent years, a major concern in the design of critical SoC. As a result, the evaluation of the sensitivity of circuits and applications to events caused by energetic particles present in the real environment is a major concern that needs to be addressed. So, this thesis presents contributions in two important areas of reliability research: in the design and implementation of deadlock-free fault-tolerant routing schemes for the emerging three-dimensional Networks-on-Chips; and in the design of fault injection frameworks able to emulate single and multiple transient faults in the HDL-based circuits. The first part of this thesis addresses the issues of transient and permanent faults in the architecture of 3D-NoCs and introduces a new resilient routing computation unit as well as a new runtime fault-tolerant routing scheme. A novel resilient mechanism is introduced in order to tolerate transient faults occurring in the route computation unit (RCU), which is the most important logical element in NoC routers. Failures in the RCU can provoke misrouting, which may lead to severe effects such as deadlocks or packet loss, corrupting the operation of the entire chip. By combining a reliable fault detection circuit leveraging circuit-level double-sampling, with a cost-effective rerouting mechanism, we develop a full fault-tolerance solution that can efficiently detect and correct such fatal errors before the affected packets leave the router. Yet in the first part of this thesis, a novel fault-tolerant routing scheme for vertically-partially-connected 3D Networks-on-Chip called FL-RuNS is presented. Thanks to an asymmetric distribution of virtual channels, FL-RuNS can guarantee 100% packet delivery under an unconstrained set of runtime and permanent vertical link failures. With the aim to emulate the radiation effects on new SoCs designs, the second part of this thesis addresses the fault injection methodologies by introducing two frameworks named NETFI-2 (Netlist Fault Injection) and NoCFI (Networks-on-Chip Fault Injection). NETFI-2 is a fault injection methodology able to emulate transient faults such as Single Event Upsets (SEU) and Single Event Transient (SET) in a HDL-based (Hardware Description Language) design. Extensive experiments performed on two appealing case studies are presented to demonstrate NETFI-2 features and advantage. Finally, in the last part of this work, we present NoCFI as a novel methodology to inject multiple faults such as MBUs and SEMT in a Networks-on-Chip architecture. NoCFI combines ASIC-design-flow, in order to extract layout information, and FPGA-design-flow to emulate multiple transient faults.
Complete list of metadatas

Cited literature [150 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02523770
Contributor : Abes Star :  Contact
Submitted on : Sunday, March 29, 2020 - 5:25:10 PM
Last modification on : Wednesday, October 14, 2020 - 4:20:00 AM

File

DA_PENHA_COELHO_2019_archivage...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02523770, version 1

Collections

STAR | TIMA | UGA | CNRS

Citation

Alexandre Augusto da Penha Coelho. Fault Tolerance and Reliability for Partially Connected 3D Networks-on-Chip. Micro and nanotechnologies/Microelectronics. Université Grenoble Alpes, 2019. English. ⟨NNT : 2019GREAT054⟩. ⟨tel-02523770⟩

Share

Metrics

Record views

175

Files downloads

87