The 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

MICRO-46 Session 2B - Resilience

uDIREC: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC faults

Ritesh Parikh (University of Michigan Ann Arbor)
Valeria Bertacco (University of Michigan Ann Arbor)

Lightning session talk: PDF, Presentation: PDF, Poster: PDF, Full Paper: DOI 10.1145/2540708.2540722

Abstract:
As silicon continues to scale, transistor reliability is becoming a major concern. At the same time, increasing transistor counts are causing a rapid shift towards large chip multi-processors (CMP) and system-on-chip (SoC) designs, comprising several cores and IPs communicating via a network-on-chip (NoC). As the sole medium of on-chip communication, a NoC should gracefully tolerate many permanent faults.

We propose uDIREC, a unified framework for permanent fault diagnosis and subsequent reconfiguration in NoCs that provides graceful performance degradation with increasing number of faults. Upon in-field transistor failures, uDIREC leverages a fine-resolution diagnosis mechanism to disable faulty components very sparingly. At its core, uDIREC employs a novel routing algorithm to find reliable and deadlock-free routes that utilize the still-functional links in the NoC. uDIREC places no restriction on topology, router architecture and number and location of faults. Experimental results show that uDIREC, implemented in a 64-node NoC, drops 3× fewer nodes and provides 25% higher throughput (beyond 15 faults) when compared to other state-of-the-art fault-tolerance solutions. uDIREC's improvement over prior-art grows with more faults, making it a suitable NoC reliability solution for a wide range of fault rates.