The 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

MICRO-46 Session 6B - Storage Optimizations

Aegis: Partitioning Data Block for Efficient Recovery of Stuck-At-Faults in Phase Change Memory

Jie Fan (Tsinghua University / Department of Computer Science and Technology)
Song Jiang (Wayne State University / The ECE Department)
Jiwu Shu (Tsinghua University / Department of Computer Science and Technology)
Youhui Zhang (Tsinghua University / Department of Computer Science and Technology)
Weimin Zhen (Tsinghua University / Department of Computer Science and Technology)

Lightning session talk: PDF, Presentation: PDF, Poster: PDF, Full Paper: DOI 10.1145/2540708.2540745

Abstract:
While Phase Change Memory (PCM) holds a great promise as a complement or even replacement of DRAM-based memory and flash-based storage, it must effectively overcome its limit on write endurance to be a reliable device for an extended period of intensive use. The limited write endurance can lead to permanent stuck-at faults after a certain number of writes, which causes some memory cells permanently stuck at either '0' or '1'. State-of-the-art solutions apply a bit inversion technique on selected bit groups of a data block after its partitioning. The effectiveness of this approach hinges on how a data block is partitioned into bit groups. While all existing solutions can separate faults into different groups for error correction, they are inadequate on three fundamental capabilities desired for any partition scheme. First, it can maximize probability of successfully re-partitioning a block so that two faults currently in the same group are placed into two new groups. Second, it can partition a block into a small number of groups for space efficiency. Third, it should spread out faults across the groups as uniformly as possible, so that more faults can be accommodated within the same number of groups. A recovery solution with these capabilities can provide strong fault tolerance with minimal overhead.

We propose Aegis, a recovery solution with a systematical partition scheme using fewer groups to accommodate more faults compared with state-of-the-art schemes. The uniqueness of Aegis's partition scheme lies on its guarantee that any two bits in the same group will not be in the same group after a re-partition. Empowered by the partition scheme, Aegis can recover significantly more faults with reduced space overhead relative to state-of-the-art solutions.