The 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

MICRO-46 Session 2B - Resilience

Virtually Aged Sampling DMR: Unifying Circuit Failure Detection and Circuit Failure Prediction

Raghuraman Balasubramanian (University Of Wisconsin-Madison)
Karthikeyan Sankaralingam (University Of Wisconsin-Madison)

Lightning session talk: PDF, Presentation: PDF, Poster: PDF, Full Paper: DOI 10.1145/2540708.2540720

Abstract:
Hardware failure due to wearout is a growing concern. Circuit failure prediction is an approach that is effective if it meets the following requirements: low design complexity, low overheads, generality (supporting various types of wearout including soft and hard breakdown) and high accuracy. State-of-the-art techniques, which typically detect and measure low level circuit properties like gate delay cannot deliver on all four requirements. Moving away from the paradigm of measuring circuit delays is key to satisfying the four design requirements. Our insight is to virtually age the processor and thus manifest a wearout fault early � we convert the delay degradation into a logic fault; expose the fault and then detect the fault. To virtually age the processor, reducing supply voltage effectively mirrors wearout. For fault exposure, we observe that faults in critical paths are naturally exposed and we develop a technique to expose faults along the non-critical paths using clock phase shifting logic. Our system, Aged-SDMR, combines these two mechanisms to expose wearout faults early and detects them using Sampling DMR. We also develop principles to combine these two mechanisms with any detection technique. We implement a prototype system based on the OpenRISC processor on a Xilinx Zync FPGA. We demonstrate that Aged-SDMR is practical and delivers on all four requirements, has area and energy overheads of 9% and 0.7% respectively, takes at most 0.4 days to detect failure after onset and its early warning window is configurable. More generally, Aged-SDMR provides the capability for low-overhead DMR execution without any missed errors and 100% coverage. It is likely to find broad uses within reliability and elsewhere.