



# Heterogeneous System Coherence for Integrated CPU-GPU Systems Jason Power', Arkaprava Basu\*, Junli Gu<sup>+</sup>, Sooraj Puthoor<sup>+</sup>, Bradford M Beckmann<sup>+</sup>, Mark D Hill<sup>\*+</sup>, Steven K Reinhardt<sup>+</sup>, David A Wood<sup>\*+</sup>

## Methodology

### ▲ Simulation

- gem5 for CPU and memory system
- Ruby for caches
- GPU modeled off of GCN
- Workloads
- Subset of Rodinia
- AMD APP SDK

| CPU Clock             | 2 GHz           |
|-----------------------|-----------------|
| CPU Cores             | 2               |
| CPU Shared L2 Cache   | 2 MB            |
| GPU Clock             | 1 GHz           |
| Compute Units         | 32              |
| GPU L1 Data Cache     | 32 KB           |
| GPU Shared L2 Cache   | 4 MB            |
| L3 Memory-side Cache  | 16 MB           |
| Peak Memory Bandwidth | 700 GB/s        |
| Baseline Directory    | 262,144 entries |
| Region Directory      | 32,768 entries  |
| MSHRs                 | 32 entries      |
| Region Buffer         | 16,384 entries  |

To DRAM

### **Results Summary**

- Largest speedup for workloads which constrained resources hurt the most
- Massive bandwidth reduction
- Due to offloading data onto direct-access bus
- More than theoretical max of 94% in some cases
  - Region buffers can "prefetch" cache permissions
- HSC significantly improves performance over the baseline design
- Decreases bandwidth requirement of directory

**b** 3.5



