# **Divergence-Aware Warp Scheduling**

## Programmability

Simple, But Divergent

\_\_\_global\_\_\_ void spmv\_csr\_scalar\_kernel(const float\* val, const int\* cols, const int\* rowDelimiters, const int dim, float\* out)

int myRow = blockIdx.x \* blockDim.x + threadIdx.x; texReader vecTexReader;

+ (t / warpSiz<u>e);</u> if (myRow < dim) texReader vecTexReader; Divergence float t = 0.0f;int start = rowDelimiters[myRow]; \_\_\_shared\_\_\_volatile int end = rowDelimiters[mvRow float partialSums[BLOCK\_SIZE]; // Divergent Branch

\_\_\_global\_\_\_ void spmv\_csr\_vector\_kernel(const float\* val, const int\* cols, const int\* rowDelimiters, const int dim, float \* out)

Less Divergence, But More Complicated

int t = threadIdx.x; int id = t & (warpSize-1); int warpsPerBlock = blockDim.x / warpSize; int myRow = (blockIdx.x \* warpsPerBlock) Explicit Scratchpad Use



for (int j = start; j < end; j+</pre> // Uncoalesced Loads int col = cols[j]; t += val[j] \* vecTexReader(col);

out[myRow] = t;

Divergent loads a major problem when programming GPUs

- Common in irregular applications
- Programmer encouraged to restructure program, make use of scratchpad
  - Our work asks: What if they didn't have to?

Hardware aware of code locality can take advantage of it without needing the programmer

```
if (myRow < dim) Dependent on Warp Size
int warpStart = rowDelimiters[myRow];
int warpEnd = rowDelimiters[myRow+1];
float mySum = 0;
for (int j = warpStart + id;
     j < warpEnd; j += warpSize)</pre>
    int col = cols[j];
    mySum += val[j] * vecTexReader(col);
partialSums[t] = mySum;
                            Parallel
// Reduce partial sums
                          Reduction
if (id < 16)
    partialSums[t] += partialSums[t+16];
if (id < 8)
    partialSums[t] += partialSums[t+ 8];
if (id < 4)
    partialSums[t] += partialSums[t+ 4];
if (id < 2)
    partialSums[t] += partialSums[t+ 2];
if (id < 1)
    partialSums[t] += partialSums[t+ 1];
// Write result
if (id == 0)
    out[myRow] = partialSums[t];
```

Cache

## Example Operation

Cache is 4 entries, 128B lines and fully associative. By Time<sub>0</sub>, warp 0 has entered loop and loaded 4 lines into cache. By Time<sub>1</sub>, warp 0 has captured spatial locality, DAWS measures footprint. Warp 1 is prevented from scheduling as DAWS predicts it will oversubscribe cache. By Time<sub>2</sub>, warp 0 has accessed 4 lines for 32 iterations and loaded 1 new line. 3 lanes have exited loop, decreasing footprint. Warp 1 and warp 0 are allowed to capture spatial locality together.



### **Programmability Case Study Results**

Diverged Code vs. Locality Managed Code



of Locality Managed Code with no **Programmer Input** 



Greedy-then-oldest (GTO)

• Cache-Conscious Wavefront Scheduling (CCWS)

• Profile based Static Wavefront Limiting (Best-SWL)

Divergence-Aware Warp Scheduling (DAWS)

**Branch Divergence Awareness:** 

#### Performance vs. Other Schedulers on Cache-Sensitive Applications



26% Speedup over Cache-Conscious Wavefront Scheduling



Timothy G. Rogers<sup>1</sup>, Mike O'Connor<sup>2</sup>, Tor M. Aamodt<sup>1</sup> <sup>1</sup>The University of British Columbia <sup>2</sup>NVIDIA Research



MICRO 2013. Davis, CA