#### Linearizing Irregular Memory Accesses for Improved Correlated Prefetching

Akanksha Jain, Calvin Lin

University of Texas at Austin





















**Spatio-Temporal Prefetching** 





# **Our Contribution**

Significantly advance the state-of-the-art in temporal prefetching

|          | Previous<br>Best | Our<br>Solution |
|----------|------------------|-----------------|
| Speedup  | 8.3%             | 23.1%           |
| Accuracy | 58.6%            | 93.7%           |

How do we achieve this improvement?
 Address Correlation [Grunwald 97]
 PC-localization [Nesbit 04, Somogyi 06]

#### **Address Correlation**



A and X are temporally correlated if the occurrence of A is very likely to be followed by X

□ Temporal streams: Sequences of correlated accesses

Highly variable lengths from 2 to several hundreds [Chilimbi 02, Wenisch 08]

#### **Address Correlation is Expensive**



#### Large storage requirements

- Weaker forms of Correlation: Learn correlated deltas between consecutive memory addresses
  - Correlate distance(A,X) with distance(X,B)
  - Sacrifice performance for lower storage requirements

# **PC-Localization**



#### **Design Space for Temporal Prefetching**



# **Our Contribution**

- Significantly advance the state-of-the-art
- How do we achieve this improvement?
  - First to combine address correlation and PC-localization
  - Two more benefits!



- Why is it hard to combine address correlation and PClocalization?
  - Prefetcher Implementation
    - Global History Buffer (GHB)

#### Address Correlation with Global History Buffer



History Buffer

#### Address Correlation with Global History Buffer

# **Global Access Stream:** AXBCDYE...AXBC... Input : A Predict X **Index Table** Α B . . . .

**History Buffer** Α Х В C D Y Ε . . . Α Х В С Index Table History Buffer

#### Address Correlation with Global History Buffer

#### Large history buffer

- 10-100 MB [Wenisch 09]
- Stored off-chip



#### PC-localization with Global History Buffer



#### Limitation of GHB-based Solutions



#### Limitation of GHB-based Solutions



#### Outline

Motivation

GHB-based solutions are limited

□ Our Solution

Replace the GHB with a better organization

- Evaluation
- Future Work
- □ Conclusions



#### **Our Solution**



#### **PC-Localization**



**Physical Address Space** 



#### **PC-Localization**







25



26







#### Prediction in the Structural Address Space

#### Trigger Address: X

| Physical to Str     | uctural Mappi         | ng |
|---------------------|-----------------------|----|
| Physical<br>Address | Structural<br>Address |    |
| A                   | 1                     |    |
| В                   |                       |    |
| С                   | 4                     |    |
| D                   |                       |    |
| E                   |                       |    |
| F                   | 3                     |    |
|                     |                       |    |
| X                   | 2                     |    |
| Y                   |                       |    |
| Z                   | 5                     |    |

#### Structural to Physical Mapping

| Structural<br>Address | Physcial<br>Address |
|-----------------------|---------------------|
| 1                     | А                   |
| 2                     | Х                   |
| 3                     | F                   |
| 4                     | С                   |
| 5                     | Z                   |
| 6                     | E                   |



#### Physical to Structural Mapping

| Physical<br>Address | Structural<br>Address |
|---------------------|-----------------------|
| А                   | 1                     |
| В                   |                       |
| С                   | 4                     |
| D                   |                       |
| E                   |                       |
| F                   | 3                     |
|                     |                       |
| Х                   | 2                     |
| Y                   |                       |
| Z                   | 5                     |

#### Structural to Physical Mapping

| Structural<br>Address | Physcial<br>Address |
|-----------------------|---------------------|
| 1                     | А                   |
| 2                     | Х                   |
| 3                     | F                   |
| 4                     | С                   |
| 5                     | Z                   |
| 6                     | F                   |

#### Structural to Physical Address Mapping Cache

| Physical to<br>Mapping                                             | Structural                                        |                    |
|--------------------------------------------------------------------|---------------------------------------------------|--------------------|
| Physical<br>Address                                                | Structural<br>Address                             | Physical to        |
| A                                                                  | 1                                                 | Structural Address |
| В                                                                  |                                                   | 011001010111001000 |
| С                                                                  | 4                                                 | Mapping Cache      |
| D                                                                  |                                                   |                    |
| E                                                                  |                                                   |                    |
| F                                                                  | 3                                                 |                    |
| ***                                                                |                                                   |                    |
| x                                                                  | 2                                                 |                    |
| Y                                                                  |                                                   |                    |
|                                                                    |                                                   |                    |
| Z                                                                  | 5                                                 |                    |
| Z                                                                  | 5<br>to Physical<br>Physcial<br>Address<br>A      |                    |
| Z<br>Structural<br>Mapping<br>Structural<br>Address                | to Physical<br>Physcial<br>Address                |                    |
| Z<br>Structural<br>Mapping<br>Structural<br>Address<br>1           | to Physical<br>Physcial<br>Address<br>A           | Structural to      |
| Z<br>Structural<br>Mapping<br>Structural<br>Address<br>1<br>2      | to Physical<br>Physcial<br>Address<br>A<br>X      | Structural to      |
| Z<br>Structural<br>Mapping<br>Structural<br>Address<br>1<br>2<br>3 | to Physical<br>Physcial<br>Address<br>A<br>X<br>F |                    |





# **Revisiting Our Contributions**

- Significantly advance the state-of-the-art in temporal prefetching
- How do we achieve this improvement?
  First to combine address correlation and PC-localization
  Two more benefits!





#### Meta-data Management

#### But the table is too big to be stored on-chip!



Physical to Structural Address Mapping Cache

Structural to Physical Mapping

| Structural<br>Address | Physcial<br>Address |
|-----------------------|---------------------|
| 1                     | А                   |
| 2                     | х                   |
| 3                     | F                   |
| 4                     | С                   |
| 5                     | Z                   |
| 6                     | E                   |

Structural to Physical Address Mapping Cache

#### **New Caching Scheme**



# New Caching Scheme

Mapping cached on-chip only for TLB resident data
 Small on-chip budget

Movement of off-chip meta-data synchronized with TLB misses

Hides latency of accessing off-chip meta-data hidden

Reduces memory traffic

□ This caching scheme is not possible with the GHB

# **Revisiting Our Contributions**

- Significantly advance the state-of-the-art in temporal prefetching
- How do we achieve this improvement?
  First to combine address correlation and PC-localization
  Enables a novel TLB-synchronized caching scheme



#### Training on the L2 Access Stream



#### Train on the L2 Access Stream



# **Revisiting Our Contributions**

- Significantly advance the state-of-the-art in temporal prefetching
- How do we achieve this improvement?
  First to combine address correlation and PC-localization
  Enables a novel TLB-synchronized caching scheme
  Allows the ISB to train on the L2 access stream



# Outline

- Motivation
  - GHB-based solutions are limited
- Our Solution
  - Replace the GHB with a better organization
- Evaluation
- Future Work
- □ Conclusions

# Methodology

- MARSSx86 simulator
  - Cycle-accurate out-of-order code
  - L1 Data Cache 64 KB
  - L2 Data Cache 2 MB
  - DTLB 128 entries (TLB miss latency modeled realistically)
  - Main memory 140 cycles (DRAM queue contention accurately modeled)
- Benchmarks Irregular subset of Spec2006
  - 20-25 SimPoints per benchmarks
  - Each SimPoint simulates 250M instruction





Speedup due to irregular prefetchers over a baseline with no prefetching 48





# **Recall Our Contributions**

- Significantly advance the state-of-the-art in temporal prefetching
- How do we achieve this improvement?
  First to combine address correlation and PC-localization
  Enables a novel TLB-synchronized caching scheme
  Train on the L2 access stream



# Why does the ISB win?



# Why does the ISB win?



## Why does the ISB win?



#### Accuracy comparison



## Hybrid Prefetchers



### Hardware Cost

- Address Mapping Caches for TLB-resident data
  - 32 KB on-chip storage for TLB with 128 entries
  - 8 KB sufficient in a hybrid setting
- Memory traffic Overhead
  - Average meta-data traffic: 8.4%
  - Average traffic due to useless prefetch requests: 6.3%

|                      | mcf  | soplex | omnetpp | astar | libq | xalan | gcc   | sphinx |
|----------------------|------|--------|---------|-------|------|-------|-------|--------|
| Meta-data<br>Traffic | 5.7% | 3.8%   | 5.7%    | 12%   | 1.6% | 12.6% | 11.3% | 3.9%   |

## Future Work

- Combining spatial and temporal prefetching
  STeMS Spatial-temporal Prefetching [Somogyi 2009]
  Temporal component can be enhanced using the ISB
- Evaluation on commercial workloads
  - Bigger memory footprints don't affect the ISB's on-chip storage
  - Optimizations to improve TLB reach, such as superpages, affect ISB's on-chip storage

### Conclusions

Replace the GHB with a new organization

Significantly improves the state-of-the-art

- First to combine address correlation and PC-localization
- Enables a novel TLB-synchronized caching scheme
- Train on the L2 access stream

