

# Allocating Rotating Registers by Scheduling

Hongbo Rong Cheng Wang Hyunchul Park Youfeng Wu

Intel Labs

MICRO-46, Dec. 2013



Memory disambiguation in dynamic compilers



- Dynamic languages are increasing popular
  - JavaScript 88.9% in client-side websites (W3Techs)
  - PhP 81.5% in sever side (W3Techs)
- Huge amount of legacy code
  - Dynamic binary translator
  - Optimization scope tends to be small: For a loop, usually a loop iteration
    - Software pipelining: enlarge the scope
  - Memory aliases are a bottleneck to performance

Memory disambiguation in dynamic compilers (Cont.)



- A fundamental component in compilers
  - Determines if two memory operations are aliased
- Approach 1: Alias analysis at compile time
  - Too expensive or conservative for dynamic compilers
  - -Only simple alias analysis can be done
- Approach 2: Alias detection at runtime with hardware support



- Enable data speculation
  - Compiler optimistically assumes the memory operations do not alias with each other, and schedules them out of order
  - Compiler guards every memory operation with an alias register to catch any alias when the schedule runs



- Enable data speculation
  - Compiler optimistically assumes the memory operations do not alias with each other, and schedules them out of order
  - Compiler guards every memory operation with an alias register to catch any alias when the schedule runs
- Example:



- Enable data speculation
  - Compiler optimistically assumes the memory operations do not alias with each other, and schedules them out of order
  - Compiler guards every memory operation with an alias register to catch any alias when the schedule runs
- Example:

Original order

LD [w] ST [y] ST [z]



- Enable data speculation
  - Compiler optimistically assumes the memory operations do not alias with each other, and schedules them out of order
  - Compiler guards every memory operation with an alias register to catch any alias when the schedule runs
- Example:



#### Static Alias Registers



#### Static Alias Registers













|   |   | Mem  |     |     |       |
|---|---|------|-----|-----|-------|
| _ |   | addr |     | Set | Check |
| S | T | x    | ••• | SR1 |       |

















#### Static alias registers are not enough

- 24 hot loops from SPEC2000, with static alias registers only, loop throughput is far below optimal
- Need more scalable hardware













|    | Mem<br>addr |     | Set | Check |
|----|-------------|-----|-----|-------|
| ST | x           | ••• | RR1 |       |

















• Encode 1 register, check many: scalable

















Avoid by a good allocation <--- check</li>



set

• Resort to static alias registers







# Rotating alias registers (Cont.)

- Comparison:
  - ALAT in Itanium does not detect aliases
     between stores
  - DeAliaser [Ahn, Duan, Torrellas 2013] can only check all speculative stores
- Effective for sequential code [Wang et al. 2012]
- A new problem: How to apply them to loops?

Software Pipelining



• A loop with 3 operations in order



Software Pipelining



• A loop with 3 operations in order



Sequential execution

Software Pipelining



• A loop with 3 operations in order

Sequential execution  $0 \quad 3 \quad 6 \quad 9$  $a \quad b \quad c$ 

Software Pipelining



• A loop with 3 operations in order

a b c

Sequential execution0369abcaabca

Software Pipelining



• A loop with 3 operations in order



Sequential execution

| 0 | 3   | 6     | 9 | →Time |
|---|-----|-------|---|-------|
| а | b c | a b c |   | a b c |

Software Pipelining



• A loop with 3 operations in order



Sequential execution

| 0 | 3   | 6 | 9   | <br>→Time |
|---|-----|---|-----|-----------|
| а | b c | а | b c | b c       |

- To speedup
  - -Schedule operations out of order

Overlap the execution of iterations

# Software Pipelining (Cont.) • Software-pipelined execution 0 3 6 9 Time a b c



| 0 | 3 | 6   | 9 | —→Time              |
|---|---|-----|---|---------------------|
| С |   | b a |   | 1 <sup>st</sup> ltr |



| 0 | 3 |   | 6 |   |   | 9 | —→Time              |
|---|---|---|---|---|---|---|---------------------|
| С |   | b | а |   |   |   | 1 <sup>st</sup> Itr |
|   | С |   |   | b | а |   | 2 <sup>nd</sup> Itr |



| 0 | 3 | 6   | 9   | —→Time              |
|---|---|-----|-----|---------------------|
| С |   | b a |     | 1 <sup>st</sup> Itr |
|   | С | b   | а   | 2 <sup>nd</sup> Itr |
|   |   | С   | b a | 3 <sup>rd</sup> Itr |



| 0             | 3   |   | 6 |   |   | 9 |   | →Time               |
|---------------|-----|---|---|---|---|---|---|---------------------|
| С             |     | b | а |   |   |   |   | 1 <sup>st</sup> Itr |
| < <u>  =3</u> | → C |   |   | b | а |   |   | 2 <sup>nd</sup> Itr |
|               |     |   | С |   |   | b | а | 3 <sup>rd</sup> Itr |



| 0             | 3                           | 6 | 9 |   |                     |
|---------------|-----------------------------|---|---|---|---------------------|
| _             | _                           |   |   |   | →Time               |
| С             | b                           | а |   |   | 1 <sup>st</sup> ltr |
| <b>↓  =</b> 3 | $3 \rightarrow C$           | b | a |   | 2 <sup>nd</sup> Itr |
|               |                             | С | b | а | 3 <sup>rd</sup> Itr |
|               | o schedulin<br>ulo property | g |   |   |                     |



| 0            | 3                                             | 6   | 9 |   | <b></b> .           |
|--------------|-----------------------------------------------|-----|---|---|---------------------|
|              |                                               |     |   |   | <b>→</b> Time       |
| С            | b                                             | a   |   |   | 1 <sup>st</sup> ltr |
| <b>→   =</b> | :3<br>→ C                                     | b a |   |   | 2 <sup>nd</sup> Itr |
|              |                                               | С   | b | а | 3 <sup>rd</sup> Itr |
| •Mo          | lo schedulir<br>dulo property<br>pendence con | /   |   |   |                     |



| 0                                                                                                                            | 3         | 6   | 9 |   | →Time               |
|------------------------------------------------------------------------------------------------------------------------------|-----------|-----|---|---|---------------------|
| C                                                                                                                            | b         | a   |   |   | 1 <sup>st</sup> ltr |
| <b>−   =</b>                                                                                                                 | =3<br>→ C | b a |   |   | 2 <sup>nd</sup> Itr |
|                                                                                                                              |           | С   | b | а | 3 <sup>rd</sup> Itr |
| <ul> <li>Modulo scheduling</li> <li>Modulo property</li> <li>Dependence constraints</li> <li>Resource constraints</li> </ul> |           |     |   |   |                     |



#### Objective

- Given a software-pipelined schedule of a loop, how to allocate rotating alias registers for the memory operations?
  - Detect ALL aliases
  - -NO false positive
  - Minimal usage of rotating & static alias registers



#### Rotating Register Allocation = Scheduling !

- It is a software pipelining problem
  - A register allocation is a modulo schedule of lifetimes
  - Any existing modulo scheduling algorithm can solve it
- Contributions
  - Framework
  - A simple algorithm LCP
  - LCP usually achieves the best allocations in the least time
  - Generalization: allocation of general-purpose rotating registers is also a software pipelining problem
    - It derives bin-packing of Rau et al. 1992



| 0 | 3 | 6   | 9   |                     |
|---|---|-----|-----|---------------------|
|   |   |     |     | →Time               |
| С | k | ) a |     | 1 <sup>st</sup> ltr |
|   | С | b a |     | 2 <sup>nd</sup> Itr |
|   |   | С   | b a | 3 <sup>rd</sup> Itr |

























## Problem formulation



View Lifetimes as "Operations" Registers as "Time" Time as "Resources"

Then the allocation is a modulo schedule:

• Modulo property: A constant initiation interval R

$$r(a, i+1) = r(a, i) + R, \quad \forall i$$

- Dependence constraints
  - The ordering requirement between lifetimes

$$r(a,i) \le r(b,i+d) \quad \forall i$$

- Resource constraints
  - Two lifetimes in the same register cannot overlap in time

# Unifying Dependence and Resource Constraints

$$DIST(a,b) = DIST_{dep}(a,b) \bigcap DIST_{res}(a,b)$$

$$DIST_{dep}(a,b) = \bigcap_{\forall dependence (a \rightarrow b, \delta, d)} [\delta - d * R, +\infty)$$

$$\bigcap_{\forall dependence (b \rightarrow a, \delta, d)} (-\infty, -\delta + d * R]$$

$$DIST_{res}(a,b) = \begin{cases} DIST_{res\_1}(a,b) & \text{if } a \text{ or } b \text{ is a pure checker} \\ DIST_{res\_2}(a,b) \cup DIST_{res\_3}(a,b) & \text{otherwise} \end{cases}$$

$$DIST_{res\_1}(a,b) = (-\infty, +\infty)$$

$$DIST_{res\_2}(a,b) = (-\infty, +\infty) \setminus R * (-\infty, +\infty)$$

$$DIST_{res\_3}(a,b) = R * (-\infty, +\infty) \bigcap_{\substack{\{[-\lfloor \frac{start(a) - end(b)}{II} \rfloor * R, +\infty) \bigcup \\ (-\infty, -\lceil \frac{end(a) - start(b)}{II} \rceil * R]\}}$$

# LCP (Local Compaction followed by Pacing) (intel)



# LCP (Local Compaction followed by Pacing) (intel)



# LCP (Local Compaction followed by Pacing) (intel)

















- Implemented in Transmeta Code Morphing Software (CMS)
- simulated with a variable-size rotating alias register file



#### % loops allocated successfully



• 11,825 software-pipelined loops from SPEC2000 dynamic traces



#false positives and allocation time

- LCP has 0.16 false positives per loop iteration
- LCP takes 2.46% translation time, roughly about 0.07% total time
- #false positives and allocation time relative to LCP

|       | #false positive | Allocation time |
|-------|-----------------|-----------------|
| RS2   | 19X             | 4X              |
| DESP  | 12X             | 1.6X            |
| JITSP | 14X             | 1.7X            |



#### Summary

- Rotating alias registers detect memory aliases at runtime
  - Relatively new hardware
- Converting the allocation problem to scheduling problem
  - Software pipelining scheduling has been studied for 3 decades
  - Benefit from reusing well-known/stable algorithms
  - Not limited to software pipelined loops: non-pipelined loops can be treated as pipelined ones with only one stage
- Proposed LCP algorithm
  - Simple and fast
- Extended to Itanium general rotating registers
  - RegII  $\equiv 1$
  - Resource constraints



#### BACKUP

#### Rotation





Demonstration: Rotation size 2



Demonstration: Rotation size 2



Demonstration: Rotation size 2







Rotate RegII registers every II time steps