#### **CAMEO** A CACHE-LIKE MEMORY ORGANIZATION FOR 3D MEMORY SYSTEMS

Chiachen Chou, Georgia Tech, <u>cc.chou@ece.gatech.edu</u> Aamer Jaleel, Intel, <u>aamer.jaleel@intel.com</u> Moinuddin K. Qureshi, Georgia Tech, <u>moin@ece.gatech.edu</u>





#### 3D-memory can overcome the bandwidth wall



#### Hybrid Memory Cube



|           | Stacked<br>DRAM |
|-----------|-----------------|
| Bandwidth | 2-8X            |
| Latency   | 0.5-1X          |
| Capacity  | 0.25X           |



#### **Hybrid Memory System**



#### **Hybrid Memory System**

#### How to use Stacked DRAM: Cache or Memory?

#### **CAMEO FOR HYBRID MEMORY**



#### **CAMEO FOR HYBRID MEMORY**



#### **CAMEO FOR HYBRID MEMORY**



#### next paper

# Transparent Hardware Management of Stacked DRAM as Part of Memory

#### Jaewoong Sim<sup>1</sup> Alaa R. Alameldeen<sup>2</sup> Zeshan Chishti<sup>2</sup> Chris Wilkerson<sup>2</sup> Hyesoon Kim<sup>1</sup>

<sup>1</sup>Georgia Institute of Technology

<sup>2</sup>Intel Labs





# Transparent Hardware Management of Stacked DRAM as Part of Memory

#### Jaewoong Sim<sup>1</sup> Alaa R. Alameldeen<sup>2</sup> Zeshan Chishti<sup>2</sup> Chris Wilkerson<sup>2</sup> Hyesoon Kim<sup>1</sup>

<sup>1</sup>Georgia Institute of Technology

<sup>2</sup>Intel Labs





# Transparent Hardware Management of Stacked DRAM as Part of Memory

#### Jaewoong Sim<sup>1</sup> Alaa R. Alameldeen<sup>2</sup> Zeshan Chishti<sup>2</sup> Chris Wilkerson<sup>2</sup> Hyesoon Kim<sup>1</sup>

<sup>1</sup>Georgia Institute of Technology

<sup>2</sup>Intel Labs



# Transparent Hardware Management of Stacked DRAM as Part of Memory



## **Stacked DRAM as PoM**

#### Why Hardware?



# Stacked DRAM as PoM

Why Hardware?

Adapt & Remap data at a fine granularity!



## Hardware-Managed PoM

# What Challenges? Metadata for GBs of Memory!

# Size & Latency

# Remapping Table Memory Utilization Tracking Structure

## **A Practical PoM Architecture**





#### next paper





## TODAY @2:15pm, Session 1A

## Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache

Djordje Jevdjic Cansu Kaynak Gabriel H. Loh Babak Falsafi















| Core | Core | Core | Core |
|------|------|------|------|
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |



Tens to hundreds of cores & accelerators!





| Core | Core | Core | Core |
|------|------|------|------|
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |

Memory

Tens to hundreds of cores & accelerators!





| Core | Core Core |      | Core |
|------|-----------|------|------|
| Core | Core Core |      | Core |
| Core | Core      | Core | Core |
| Core | Core      | Core | Core |
| Core | Core      | Core | Core |
| Core | Core      | Core | Core |

Memory

In-memory big data! 100s of GBs

Tens to hundreds of cores & accelerators!







Tens to hundreds of cores & accelerators!







Tens to hundreds of cores & accelerators!





| Core | Core | Core | Core |
|------|------|------|------|
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |

Tens to hundreds of cores & accelerators!



In-memory big data! 100s of GBs

Many DIMMs per channel Capacity/BW tradeoff





| Core | Core Core |      | Core |
|------|-----------|------|------|
| Core | Core      | Core | Core |
| Core | Core      | Core | Core |
| Core | Core      | Core | Core |
| Core | Core      | Core | Core |
| Core | Core      | Core | Core |

Tens to hundreds of cores & accelerators!



In-memory big data! 100s of GBs

Many DIMMs per channel Capacity/BW tradeoff





| Core | Core | Core | Core |
|------|------|------|------|
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |

Tens to hundreds of

cores & accelerators!





In-memory big data! 100s of GBs

Many DIMMs per channel Capacity/BW tradeoff





| Core | Core | Core | Core |
|------|------|------|------|
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |



Tens to hundreds of cores & accelerators!



Memory In-memory big data! 100s of GBs

Many DIMMs per channel Capacity/BW tradeoff





| Core | Core | Core | Core |
|------|------|------|------|
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |



In-memory big data! 100s of GBs

Memory

Many DIMMs per channel Capacity/BW tradeoff

Tens to hundreds of cores & accelerators!





big data! 100s of GBs

## Server Trends

|        | Core     | Core         | Core          | Core              |                                  |            |                                    |                  |
|--------|----------|--------------|---------------|-------------------|----------------------------------|------------|------------------------------------|------------------|
|        | Core     | Core         | Core          | Core              |                                  | Г          |                                    | 1                |
|        | Core     | Core         | Core          | Core              |                                  |            | Momory                             | -41              |
|        | Core     | Core         | Core          | Core              |                                  |            | Memory                             | Ч                |
|        | Core     | Core         | Core          | Core              |                                  | L          |                                    |                  |
|        | Core     | Core         | Core          | Core              | In                               | <u>-</u> m | nemory big data!                   | TUUS O           |
| ך<br>כ | l<br>ens | to h<br>& au | undr<br>ccele | eds (             | of<br>rs! औ • (                  | M          | any DIMMs per o<br>Capacity/BW tra | channe<br>Ideoff |
|        |          |              |               | si ci co i        | through providence of the second |            |                                    |                  |
|        |          |              |               | , <sub>2</sub> 01 | Both consum                      |            | No.                                |                  |

#### **Servers Drive Into Bandwidth Wall**

bandwidth!





| Core<br>Core<br>Core<br>Core<br>Core<br>Core | Core<br>Core<br>Core<br>Core<br>Core<br>Core<br>Core | Core<br>Core<br>Core<br>Core<br>Core | Core<br>Core<br>Core<br>Core<br>Core | In-n         | Memory<br>Memory J<br>nemory big data! 100s of GBs |
|----------------------------------------------|------------------------------------------------------|--------------------------------------|--------------------------------------|--------------|----------------------------------------------------|
| Tens<br>cores                                | to h<br>& a                                          | undr<br>ccele                        | eds erato                            | of M<br>rs!  | any DIMMs per channel<br>Capacity/BW tradeoff      |
|                                              |                                                      |                                      |                                      | mought is a  |                                                    |
|                                              |                                                      | ×                                    | igher                                | Both consume | a cir                                              |
|                                              |                                                      |                                      |                                      | Sanamatin    | corner!                                            |
|                                              |                                                      |                                      |                                      |              | Com                                                |





| Core | Core | Core | Core |
|------|------|------|------|
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |
| Core | Core | Core | Core |


































































































#### How to move tags to DRAM?







How to move tags to DRAM? And improve performance?!







How to move tags to DRAM? And improve performance?! What is so special about servers?







Today at 2:15pm Session 1A Umney Theatre

How to move tags to DRAM? And improve performance?! What is so special about servers?

#### next paper

### **Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth**

Nagendra Gulur\* Collaborators: Mahesh Mehendale\*, R Manikantan§ and R Govindarajan+ \*Texas Instruments, Bangalore § Intel Corporation, Bangalore +Indian Institute of Science, Bangalore

Problems with Stacked DRAM Caches

Picture courtesy Bryan Black

VERTICAL STACKING (3D)

- 1. Large Tag Store
- 2. High Hit Latency
- 3. Wasted Off-Chip Bandwidth

#### Large Metadata

ExpensiveFast Access

SRAM

- •Mitigation:
  - Larger Blocks?
- High Wasted
  - Bandwidth

- •Slow Access •Scales for very large cache sizes
- •Mitigation:
  - •Tags and Data in same rows?
  - •Give up Associativity?

### Our Proposal Metadata : In DRAM Block Size : Two Sizes (64B and 512B) Tag Access : As good as SRAM



DRAM

## **Bi-Modal Cache Organization**



#### **Benefits and Results**



SRAM Overhead ~140KB

#### next paper

# Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures

#### Dec 15<sup>th</sup> 2014

Prashant Nair - Georgia Tech David Roberts - AMD Research Moinuddin Qureshi - Georgia Tech





#### 3D DRAM: TSV & LARGE FAULTS Georgia Tech

3D DRAM: Overcomes memory bandwidth wall



Susceptible to new failure modes: TSV faults

Causes large granularity failures (e.g. Faulty Bank) Striping data across banks → high overheads

#### 3D DRAM: TSV & LARGE FAULTS Georgia Tech

3D DRAM: Overcomes memory bandwidth wall



Susceptible to new failure modes: TSV faults

Causes large granularity failures (e.g. Faulty Bank) Striping data across banks → high overheads

#### Goal: Tolerate TSV faults & Large faults at low cost

 Citadel protects against TSV and Large Faults, while retaining line in the same bank

Georgia Tech

 Citadel protects against TSV and Large Faults, while retaining line in the same bank

Georgia

Tech

• Citadel employs a three-pronged approach

- Citadel protects against TSV and Large Faults, while retaining line in the same bank
- Citadel employs a three-pronged approach



Georgia

- Citadel protects against TSV and Large Faults, while retaining line in the same bank
- Citadel employs a three-pronged approach



Georgia

- Citadel protects against TSV and Large Faults, while retaining line in the same bank
- Citadel employs a three-pronged approach



Georgia

- Citadel protects against TSV and Large Faults, while retaining line in the same bank
- Citadel employs a three-pronged approach



Citadel has negligible overheads and still provides 700x higher resilience than the best ECC schemes

Georgia

#### next paper





## Locality-Aware Mapping of Nested Parallel Patterns on GPUs

**HyoukJoong Lee**<sup>\*</sup>, Kevin Brown<sup>\*</sup>, Arvind Sujeeth<sup>\*</sup>, Tiark Rompf<sup>†‡</sup>, Kunle Olukotun<sup>\*</sup>

\*Pervasive Parallelism Laboratory, Stanford University †Purdue University, ‡Oracle Labs





- High-level languages for GPUs
  - Provide higher productivity and portable performance
  - Using parallel patterns (e.g., map, reduce, groupby) is becoming popular





- High-level languages for GPUs
  - Provide higher productivity and portable performance
  - Using parallel patterns (e.g., map, reduce, groupby) is becoming popular
- Parallel patterns are often nested, but difficult to map on GPUs
  - Many factors to consider together (e.g., coalescing, divergence, dynamic allocations)
  - Large space of possible mappings
  - Compilers often support only a fixed mapping strategy, but not always efficient





- High-level languages for GPUs
  - Provide higher productivity and portable performance
  - Using parallel patterns (e.g., map, reduce, groupby) is becoming popular
- Parallel patterns are often nested, but difficult to map on GPUs
  - Many factors to consider together (e.g., coalescing, divergence, dynamic allocations)
  - Large space of possible mappings
  - Compilers often support only a fixed mapping strategy, but not always efficient

```
// Pagerank algorithm
Nodes map { n =>
    nbrsWeights = n.nbrs map { w =>
        getPrevPageRank(w) / w.degree
    }
    sumWeights = nbrsWeights reduce { (a,b) => a + b }
    ((1 - damp) / numNodes + damp * sumWeights
}
```




- High-level languages for GPUs
  - Provide higher productivity and portable performance
  - Using parallel patterns (e.g., map, reduce, groupby) is becoming popular
- Parallel patterns are often nested, but difficult to map on GPUs
  - Many factors to consider together (e.g., coalescing, divergence, dynamic allocations)
  - Large space of possible mappings
  - Compilers often support only a fixed mapping strategy, but not always efficient







- High-level languages for GPUs
  - Provide higher productivity and portable performance
  - Using parallel patterns (e.g., map, reduce, groupby) is becoming popular
- Parallel patterns are often nested, but difficult to map on GPUs
  - Many factors to consider together (e.g., coalescing, divergence, dynamic allocations)
  - Large space of possible mappings
  - Compilers often support only a fixed mapping strategy, but not always efficient







- High-level languages for GPUs
  - Provide higher productivity and portable performance
  - Using parallel patterns (e.g., map, reduce, groupby) is becoming popular
- Parallel patterns are often nested, but difficult to map on GPUs
  - Many factors to consider together (e.g., coalescing, divergence, dynamic allocations)
  - Large space of possible mappings
  - Compilers often support only a fixed mapping strategy, but not always efficient







#### Define Mapping Parameters

Logical Dimension: x, y, z, .. Block Size: N Degree of Parallelism (DOP): Span(n), Split(k)





#### **Define Mapping Parameters**

Logical Dimension: x, y, z, .. Block Size: N Degree of Parallelism (DOP): Span(n), Split(k)

Pattern (I) Pattern (J)

Dim(y), Size(16), Span(1) Dim(x), Size(32), Span(all)

Equivalent Parameters for Warp-Based Mapping





#### Define Mapping Parameters

Logical Dimension: x, y, z, ..Block Size: NDegree of Parallelism (DOP): Span(n), Split(k)

Equivalent Parameters for Warp-Based Mapping

#### Compiler Overview







#### Rodinia benchmark suite



- 28.6x speedup over ID mappings
- 9.6x speedup over existing 2D mappings
- 24% slower than manually optimized CUDA code (7 out of 8)
- Today I:25 PM (Session IB, Room: Main Auditorium)

#### next paper

Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists

Ji Kim and Christopher Batten

**Cornell University** 

IEEE/ACM International Symposium on Microarchitecture 2014 (MICRO-47)

## **Amorphous Data Parallelism**

#### **Breadth-First Search**



#### Delaunay Mesh Refinement



#### **Survey Propagation**





#### Barnes-Hut N-Body



Minimum Spanning Tree



Single-Source Shortest-Path



# C1 C2 C3 C4 C5 x1 x2 x3 x4 x5

## **Amorphous Data Parallelism**

#### Breadth-First Search





#### **Survey Propagation**



#### Barnes-Hut N-Body



Minimum Spanning Tree



Single-Source Shortest-Path



#### **Amorphous Data Parallelism Barnes-Hut N-Body Breadth-First Search** 4 5 2 Task 1 Minimum Spanning Task 2 Task 3 Tree **Delaunay Mesh** 3 6 0 Refinement 7 8 Single-Source Survey Propagation Shortest-Path



C3

x3

C4

x4

C5

x5

C2

x2

Cl

x1

#### **Amorphous Data Parallelism Barnes-Hut N-Body Breadth-First Search** 4 5 2 Task 1 Minimum Spanning Task 2 Task 3 Tree **Delaunay Mesh** 3 g 0 6 Refinement 7 8 Single-Source Survey Propagation Shortest-Path



**Cornell University** 

 $G_{11}$ 

b.

#### 5 Task 1 **Minimum Spanning** Task 2 Task 3 Tree **Delaunay Mesh** 3 g 0 6 Refinement 7 8 Single-Source **Experiments on NVIDIA** Survey Propagation Shortest-Path Tesla C2075 GPU using $G_{11}$ C1 C2 C5 C3 C4 LonestarGPU benchmark suite x1 x5 x4 x3 x2 **Cornell University** Ji Kim

# **Amorphous Data Parallelism**

#### **Breadth-First Search**

Barnes-Hut N-Body

### Fine-Grain Hardware Worklist

Memory contention

### Suboptimal load balancing

SW overhead

### Fine-Grain Hardware Worklist

- Memory contention
  - HWWL distributed banks
- Suboptimal load balancing

- SW overhead
  - HWWL distributed banks



## Fine-Grain Hardware Worklist

- Memory contention
  - HWWL distributed banks
- Suboptimal load balancing
  HWWL work redistribution
- SW overhead
  - HWWL distributed banks



- Memory contention
  - HWWL distributed banks
- Suboptimal load balancing
  HWWL work redistribution
- SW overhead
  - HWWL distributed banks
- Seamless work spilling to and refilling from memory

## Fine-Grain Hardware Worklist



### **Performance Results**



1.2—2.4X speedup over highly optimized SW implementation of challenging, irregular applications

#### next paper

# PORPLE: An Extensible Optimizer for Portable Data Placement on GPU

<u>Guoyang Chen</u> Xipeng Shen North Carolina State University

Bo Wu

Dong Li

Colorado School of Mines

Oak Ridge National Lab

# Data Placement Problem on GPU

Global memory Texture memory A Shared memory В Constant memory (L1/L2 cache)(Read-only cache) Data in a program (Texture cache) 3X performance difference



Goal: To determine the best data placement strategy cross different architectures and inputs during run-time.

### Portability and Input-adaptive

• Cross architectures

|              | spmv |    |    |    | particlefilter |           |     |    |    |            |    |
|--------------|------|----|----|----|----------------|-----------|-----|----|----|------------|----|
|              | A0   | A1 | A2 | A3 | A4             | <b>B0</b> | B1  | B2 | B3 | <b>B</b> 4 | B5 |
| Rule-Based   | Т    | Т  | Т  | Т  | G              | G         | S&G | G  | G  | G          | G  |
| PORPLE-C1060 | C    | Т  | Т  | Т  | G              | С         | S&G | G  | G  | G          | G  |
| PORPLE-M2075 | C    | Т  | G  | Т  | G              | С         | S&G | G  | G  | G          | G  |
| PORPLE-K20c  | C    | R  | Т  | R  | G              | С         | S&R | G  | Т  | G          | G  |

Cross inputs



#### next paper





## Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures

Yunsup Lee, UC Berkeley Vinod Grover, NVIDIA Ronny Krashinsky, NVIDIA Mark Stephenson, NVIDIA Stephen W. Keckler, NVIDIA

Krste Asanovic, UC Berkeley

### **Executive Summary**



Performance with predication is **on par** compared to performance with divergence stack.

The compiler should manage divergence, not the hardware.

GPUs do not need a divergence stack!

#### next paper













### **Shared Resources**



### **Shared Resources**



**Shared Resources** 

# Our Proposal

## Warp Scheduler Controls GPU Thread-Level Parallelism

# Our Proposal

## Warp Scheduler Controls GPU Thread-Level Parallelism

|                         | Improved GPU | Improved CPU |  |  |
|-------------------------|--------------|--------------|--|--|
|                         | performance  | performance  |  |  |
| CPU-centric<br>Strategy | ×            |              |  |  |
|                         |              |              |  |  |
|                         |              |              |  |  |

# Our Proposal

## Warp Scheduler Controls GPU Thread-Level Parallelism

|                     | Improved GPU | Improved CPU |  |  |  |
|---------------------|--------------|--------------|--|--|--|
|                     | performance  | performance  |  |  |  |
| <b>CPU</b> -centric |              |              |  |  |  |
| Strategy            |              |              |  |  |  |
| CPU-GPU             |              |              |  |  |  |
| Balanced            | $\checkmark$ |              |  |  |  |
| Strategy            |              |              |  |  |  |
# Warp Scheduler Controls GPU Thread-Level Parallelism



Control the trade-off

## **CPU-centric Strategy**

# Memory Congestion 🛧 CPU Performance

**CPU-centric Strategy** 

Memory Congestion **CPU** Performance





**CPU-centric Strategy** 

Memory Congestion **CPU** Performance





**Results Summary:** +24% CPU & -11% GPU



IF Memory Congestion 1

Results Summary: +24% CPU & -11% GPU



Results Summary: +24% CPU & -11% GPU



Results Summary:Results Summary:+24% CPU & -11% GPU+7% both CPU & GPU

# Managing GPU Concurrency in Heterogeneous Architectures

#### Onur Kayıran<sup>1</sup>,

Nachiappan CN<sup>1</sup>, Adwait Jog<sup>1</sup>, Rachata Ausavarungnirun<sup>2</sup>,

Mahmut T. Kandemir<sup>1</sup>, Gabriel H. Loh<sup>3</sup>, Onur Mutlu<sup>2</sup>, Chita R. Das<sup>1</sup>



<sup>1</sup> Penn State
 <sup>2</sup> Carnegie Mellon
 <sup>3</sup> AMD Research

Managing GPU Concurrency in Heterogeneous Architectures

Onur Kayıran<sup>1</sup>,

Nachiappan CN<sup>1</sup>, Adwait Jog<sup>1</sup>, Rachata Ausavarungnirun<sup>2</sup>,

Mahmut T. Kandemir<sup>1</sup>, Gabriel H. Loh<sup>3</sup>, Onur Mutlu<sup>2</sup>, Chita R. Das<sup>1</sup>

 PENNSTATE
 Carnegie Mellon
 Image: Penn State

 1 Penn State
 2 Carnegie Mellon

 2 Carnegie Mellon
 3 AMD Research

Today Session 1B – Main Auditorium @ 3 pm

#### next paper























WIKIPEDIA The Free Encyclopedia

#### Main page

Contents Featured content Current events Random article Donate to Wikipedia Wikimedia Shop Interaction

Help About Wikipedia Community portal Recent changes



"Computer organization" redirects her

Flynn's taxonomy. For another classif



In electronics engineering and computer called computer organization, is the wa given ISA may be implemented with differ given design or due to shifts in technolog Computer architecture is the combination

UNIVERSITY OF TORONTO FACULTY OF APPLIED SCIENCE & ENGINEERING





WIKIPEDIA The Free Encyclopedia

#### Main page

Contents Featured content Current events Random article Donate to Wikipedia Wikimedia Shop

Interaction Help About Wikipedia Community portal Recent changes



#### Microarchitecture

From Wikipedia, the free encyclopedia

"Computer organization" redirects her Flynn's taxonomy. For another classif



In electronics engineering and computer called **computer organization**, is the wa given ISA may be implemented with differ given design or due to shifts in technolog Computer architecture is the combination





## "Wikipedia is the best thing ever..."

#### 





Main page

Contents

Help

Featured content Current events

Random article

Donate to Wikipedia Wikimedia Shop

About Wikipedia

Recent changes

Community portal

Article Talk

#### Microarchitecture

From Wikipedia, the free encyclopedia

"Computer organization" redirects her Flynn's taxonomy. For another classif



In electronics engineering and computer called computer organization, is the wa given ISA may be implemented with differ given design or due to shifts in technolog Computer architecture is the combination



## "...Anyone in the world can write anything they want about any subject. So you know you are getting the best possible information." – Michael Scott, Dunder Mifflin



## "Wikipedia is the best thing ever..."

#### 



"...Anyone in the world can write anything they want about any subject. So you know you are getting the best possible information." – Michael Scott, Dunder Mifflin

### "Load Value Approximation is the best thing ever"

Joshua San Miguel Mario Badr Natalie Enright Jerger

#### 3:55pm, Session 2A, Main Auditorium



#### next paper



## Arbitrary Modulus Indexing

### Jeffrey R. Diamond, Donald S. Fussell (University of Texas at Austin) Stephen W. Keckler (NVIDIA Corporation)







# Arbitrary Modulus Indexing (AMI)

## 1980s: Extensive NPO2 Indexing Research - Few moduli implemented efficiently

## AMI can implement ANY moduli efficiently - Found robust, novel moduli







# AMI Eliminates Bank Conflicts







## **AMI Improves Power And Performance**







#### next paper



FR-FCFS [ISCA'00]

PAR-BS [ISCA'08] STFM [MICRO'07] FR-FCFS [ISCA'00]

TCM [MICRO'10] ATLAS [HPCA'10] PAR-BS [ISCA'08] STFM [MICRO'07] FR-FCFS [ISCA'00]

BLISS [ICCD'14] SMS [ISCA'12] TCM [MICRO'10] ATLAS [HPCA'10] PAR-BS [ISCA'08] STFM [MICRO'07] FR-FCFS [ISCA'00]

Writes

Reads *BLISS [ICCD'14] SMS [ISCA'12] TCM [MICRO'10] ATLAS [HPCA'10] PAR-BS [ISCA'08] STFM [MICRO'07] FR-FCFS [ISCA'00]* 

Reads Writes BLISS [ICCD'14] SMS [ISCA'12] TCM [MICRO'10] ATLAS [HPCA'10] PAR-BS [ISCA'08] STFM [MICRO'07] FR-FCFS [ISCA'00] Persistent Memory



Reads Writes BLISS [ICCD'14] Fast Load/Store SMS [ISCA'12] (Memory Attribute) TCM [MICRO'10] ATLAS [HPCA'10] PAR-BS [ISCA'08] **Data Persistence** STFM [MICRO'07] (Storage Attribute) FR-FCFS [ISCA'00] Persistent Memory

Reads Writes BLISS [ICCD'14] Fast Load/Store SMS [ISCA'12] (Memory Attribute) TCM [MICRO'10] ATLAS [HPCA'10] PAR-BS [ISCA'08] **Data Persistence** STFM [MICRO'07] (Storage Attribute) FR-FCFS [ISCA'00]


## **Why Another Memory Control Scheme?**



swm02125 [RF] © www.visualphotos.com







vs. the best of 5 previous scheduler designs

#### **Details about "FIRM"**

Fair and High-Performance Memory Control for Persistent Memory Systems

Jishen ZhaoHP Labs/Penn State/CMUOnur MutluCMUYuan XieUCSB/Penn State/AMD Research



Presentation:Mon (Dec.15) 3:30 PMSession 2A(Room: Dining Hall)

#### Poster Session: Tue (Dec. 16) 12:00 PM







#### next paper

# Short-Circuiting Memory Traffic in Handheld Platforms

#### Praveen Yedlapalli, Nachi Chidambaram. N., \*Niranjan Soundararajan,

#### Anand Sivasubramaniam, Mahmut Kandemir, Chita R. Das

The Pennsylvania State University

\*Intel Corp.



# Short-Circuiting Memory Traffic in Handheld Platforms

#### Praveen Yedlapalli, Nachi Chidambaram. N., \*Niranjan Soundararajan,

#### Anand Sivasubramaniam, Mahmut Kandemir, Chita R. Das

The Pennsylvania State University

\*Intel Corp.





# "Sub-Frame"





# **Sub-frame** buffers



# Store load forwarding

## **33 % Reduced DRAM energy**

## 35+% Reduced active cycles for IP

## 45 % Reduced Cycles Per Frame

## ~15% Increase in FPS

#### Talk on Dec-15th - 15:55 - 18:00 Session 2A

See you all at the poster session!

#### next paper

## Efficient Memory Virtualization Reducing Dimensionality of Nested Page Walks

#### Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift





*TLB misses are very costly in virtual servers.* —Buell, et al. VMware Technical Journal 2013

#### Cost



2

Cost



#### Problem















#### Optimization

Escape Filter: Permanent "hard" memory faults

## Optimization

Escape Filter: Permanent "hard" memory faults

## Please come to our talk

# Today, Session: 2A, Main Auditorium

Efficient Memory Virtualization Reducing Dimensionality of Nested Page Walks

> Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift





#### next paper

# **Iso-X**: A Flexible Architecture for Hardware-Managed Isolated Execution

**Dmitry Evtyushkin<sup>1</sup>** Jesse Elwell<sup>1</sup> Meltem Ozsoy<sup>1</sup> Dmitry Ponomarev<sup>1</sup> Nael Abu-Ghazaleh<sup>2</sup> Ryan Riley<sup>3</sup>

<sup>1</sup>State University of New York at Binghamton Department of Computer Science <sup>2</sup>University of California at Riverside Department of Computer Science & Engineering

<sup>3</sup>Qatar University Department of Computer Science

#### The 47th Annual IEEE/ACM International Symposium on Microarchitecture

December 14<sup>th</sup>, 2014







# **Threat Model**

- Multi-layered TCB
- Large and complex software







- Iso-X offers isolated execution environment
- Software modules can run in full isolation
- Only Hardware in Trusted Computing Base (TCB)

# **Isolated Compartments**

- Compartments reside in process' address space
  - Allows efficient interaction
  - Simple compartment booting process
- All code outside is untrusted
  - Used by compartment to communicate with outside world



# Iso-X Highlights

- Full featured execution environment for isolated compartments
- Relies only on a few simple data structures maintained in protected memory
- Attestation mechanism to protect against emulation
- Hardware explicitly controls every access to compartment memory
- Low performance impact, only 0.97% on average

#### next paper

# Random Fill Cache Architecture

Fangfei Liu and Ruby B. Lee Princeton University

- Defends against challenging cache sidechannel attacks, without impacting performance
- Unlike past work focused on cache contention-based attacks, we focus on "reuse-based attacks" – strikes at heart of cache's function!

# Motivation

Reuse of data may leak secret information

...,T[x],  $T[x_i]$ , T[x], T[x],  $T[x_j]$ ,  $T[x_j]$ ...



- No resource contention (conflicting misses)
- Strike at cache's main purpose cache hits



# Our discoveries and solutions

 Demand fetch policy is root cause of reusebased attacks!!





# Our discoveries and solutions

 Demand fetch policy is root cause of reusebased attacks!!



 Random fill policy: cache fill is de-correlated with a demand access



# Random fill within configurable neighborhood window

- Still take advantage of spatial locality No performance degradation for crypto programs
- Even improves performance of some streaming programs





#### next paper

#### CC-Hunter: Uncovering Covert Timing Channels on Shared Processor Hardware

Jie Chen and Guru Venkataramani Department of Electrical and Computer Engineering The George Washington University Washington, DC

Today, 3:55 pm, Session 2B: Security, Room: Umney Theatre

#### **Information Leakage**

#### Unauthorized exposure of sensitive data

Examples: Identify theft, credit card data breach

#### Software confinement mechanisms continue to improve

Attackers increasingly turn to hardware-based attacks



12/14/14
#### **Covert Timing Channel**





#### **Detecting Covert Timing Channels on Hardware**

#### We study conflicts on shared hardware resource and the associated events

#### We design algorithms that look for patterns of conflicts

- On combinational structures
  - Recurrent Burst Pattern
- On memory structures
  - Oscillation Pattern

#### Runtime detection capability

Low-cost and efficient hardware-software support



12/14/14

#### **Results**



#### More details in the talk!

Today, 3:55 pm, Session 2B: Security, Room: Umney Theatre

#### next paper

## Continuous, Low Overhead, Run-Time Validation of Program Executions <u>The Problem and Existing Solutions</u>

- An execution is authenticated if its control flow path and the instructions along that path were as intended: adversaries can exploit vulnerabilities to alter either of both at run-time or before the run starts
- Validating the binaries prior to execution is thus not enough
- Most existing solutions that check for control flow integrity are geared towards specific types of attacks (Vtable modification, ROP, JOP etc.)
- Need an enduring and universal solution:
  - Can handle all attacks against code and instruction integrity
  - Can handle future attacks that can compromise these

ERDEM AKTAS<sup>\*</sup>, FURAT AFRAM<sup>+</sup>, KANAD GHOSE [EAKTAS, FAFRAM1,GHOSE]@CS.BINGHAMTON.EDU COMPUTER SCIENCE DEPARTMENT STATE UNIVERSITY OF NEW YORK AT BINGHAMTON MICRO 2014

#### **Goal and Unique Aspects of this Work**

- Go beyond the current point-in-time solutions for dealing with specific types of attacks that are known today
- Validate code integrity and and/or control flow violations at run-time: as the program executes
- Provide a more universal and enduring solution, irrespective of the specific cause of the violations/attacks
- This solution:

**MICRO 2014** 

- Does not require binary modifications
- Does not require ISA extensions
- Works with out-of-order processors
- Is scalable to any binary size
- Has a low execution overhead

#### Approach

- Authenticate crypto-hash signature of basic blocks and control flow path between consecutive basic blocks as program executes
- Store statically-derived reference information in an encrypted form in RAM
- Use a signature cache and overlap authentication and normal pipeline activity to reduce performance penalty
- Delay memory updates from a BB till its execution is authenticated



**MICRO 201** 

#### **Performance Overhead: Two SC Sizes**



# IPC(Instruction per Cycle) overhead in % of the benchmarks (Average of 1.87%)

\*REV was evaluated through simulation using the cycle-accurate MARSS full system microarchitectural simulator for X86-64 ISA

#### next paper

SAVAT: **A Practical Methodology** for Measuring the Side-Channel Signal Available to the **Attacker** for **Instruction-Level Events** 







Rob Callan, Alenka Zajic, Milos Prvulovic Georgia Tech Comparch

## **Quantifying Side Channel Vulnerability**



Eve **EM** emanations Alice





## SAVAT

- Side-Channel Signal Available to Attackers
  - -Entire-program
  - -Circuit-Level
  - -Instruction-Level
    - Useful for both HW and SW improvements





## **SAVAT Results**

7.9x10<sup>-21</sup> joules per instruction

|      | LDM  | STM  | LDL2 | STL2 | LDL1 | STL1 | NOI | ADD | SUB | MUL | DIV  |
|------|------|------|------|------|------|------|-----|-----|-----|-----|------|
| LDM  | 1.8  | 2.4  | 7.9  | 11.4 | 4.6  | 4.4  | 4.3 | 4.2 | 4.4 | 4.2 | 5.1  |
| STM  | 2.3  | 2.4  | 8.8  | 11.8 | 4.3  | 4.2  | 3.8 | 3.9 | 3.9 | 4.3 | 4.2  |
| LDL2 | 7.7  | 7.7  | 0.6  | 0.8  | 3.9  | 3.5  | 4.3 | 3.6 | 4.8 | 3.8 | 6.2  |
| STL2 | 11.5 | 10.6 | 0.8  | 0.7  | 5.1  | 6.1  | 6.1 | 6.1 | 6.1 | 6.2 | 10.1 |
| LDL1 | 4.4  | 4.2  | 3.3  | 5.8  | 0.7  | 0.6  | 0.7 | 0.7 | 0.7 | 0.7 | 1.3  |
| STL1 | 4.5  | 4.2  | 3.8  | 4.9  | 0.7  | 0.6  | 0.7 | 0.6 | 0.6 | 0.6 | 1.2  |
| NOI  | 4.1  | 3.8  | 4.1  | 6.4  | 0.7  | 0.7  | 0.6 | 0.6 | 0.7 | 0.6 | 1.0  |
| ADD  | 4.2  | 4.1  | 4.1  | 7.0  | 0.7  | 0.7  | 0.6 | 0.7 | 0.6 | 0.6 | 1.0  |
| SUB  | 4.4  | 4.0  | 3.8  | 7.3  | 0.7  | 0.6  | 0.7 | 0.6 | 0.6 | 0.6 | 1.1  |
| MUL  | 4.4  | 3.9  | 3.7  | 5.7  | 0.7  | 0.7  | 0.6 | 0.6 | 0.6 | 0.6 | 1.1  |
| DIV  | 5.0  | 4.6  | 6.9  | 9.3  | 1.3  | 1.2  | 1.0 | 1.1 | 1.1 | 1.1 | 0.8  |





#### next paper



# **RpStacks:**

Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks

Jaewon Lee, Hanhwi Jang, and Jangwoo Kim

POSTECH

## Slow and expensive





## Slow and expensive





## Slow and expensive



div, R2, R3, Timing: 1, 5, 6, 7,... sub, R1, R0, Timing: 1, 6, 7, 7,... Id, R5, 0x02, L1Miss, Timing: ..., beq, R7, R0, Ntaken, Timing: ..., add, R6, R1, Timing: 1, 5, 6, 6, ...

Simulation Information (Baseline) 

Dependency & Stall cycles









#### **RpStacks**





#### **RpStacks**



## Fast and accurate

# Fast

# x26 speedup vs. simulator (1,000 design point testing)

## Accurate

## vs. traditional simulation result analysis

# Main auditorium 16<sup>th</sup> / 9:15 AM



#### next paper

## **GPUMech: GPU Performance Modeling Technique based on Interval Analysis**

Jen-Cheng Huang, Joo Hwan Lee, Hyesoon Kim, Hsien-Hsin S. Lee Georgia Institute of Technology





**Register file size** 

Can we quickly find bottlenecks of hardware configurations without time-consuming detailed timing simulation?











## **Approach: GPUMech**

- Use analytical modeling to model performance
- Use functional simulation to identify stall events
- Visualize performance bottlenecks using CPI stack
- 97x Speed Up over detailed timing simulation

(Error: 13.2%)

Session 3A: Methodology, Modeling and Tools *Tue 09:15 – 10:30* 

#### next paper

## PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research

Derek Lockhart, Gary Zibrat, and Christopher Batten



Cornell University Computer Systems Laboratory

#### PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research












#### **Modeling Towards Layout**

Functional Level

Behavior

#### Cycle Level

- Behavior
- Timing

#### **Register Transfer Level**

- Behavior
- Timing
- Physical Resources





















Session 3A: Methodology, Modeling, and Tools













#### next paper





#### CALCULATING ARCHITECTURAL VULNERABILITY FACTORS FOR SPATIAL MULTI-BIT TRANSIENT FAULTS

Mark Wilkening<sup>1</sup>, Vilas Sridharan<sup>2</sup>, Si Li<sup>3</sup>, Fritz Previlon<sup>1</sup>, Sudhanva Gurumurthi<sup>4</sup> and David R. Kaeli<sup>1</sup>

<sup>1</sup>ECE Department, Northeastern University, Boston, MA, USA
<sup>2</sup>RAS Architecture, Advanced Micro Devices, Inc., Boxborough, MA, USA
<sup>3</sup>ECE Department, Georgia Institute of Technology, Atlanta, GA, USA
<sup>4</sup>AMD Research, Advanced Micro Devices, Inc., Boxborough, MA, USA

#### **INTRODUCTION AND MOTIVATION**



- Particle-induced transient faults in SRAM are the dominant contributor to microprocessor faults [Baumann 2005]
  - High energy particles deposit charge in silicon
  - This can invert the state of logic devices and cause one or more bit flips

- Multi-bit transient faults have become more important as technology scales
  - Number and size of multi-bit faults is increasing [lbe 2010]
  - Trend will continue despite FinFETs [Seifert 2012]

Methods to characterize the impact of multi-bit faults are lacking

#### WHAT IS THIS PAPER ABOUT?



- We introduce architectural vulnerability factors for multi-bit transient faults (MB-AVFs)
- We measure MB-AVFs for detected, uncorrected errors (DUE MB-AVF)
- We approximate MB-AVFs for silent data corruption (SDC MB-AVF)
- We show how MB-AVFs can be used to make design trade-offs



#### next paper

# Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors

#### Anys Bacha and Radu Teodorescu Department of Computer Science and Engineering





# High Guardbands in Modern Processors







# High Guardbands in Modern Processors







# High Guardbands in Modern Processors







Find weakest cache line





Find weakest cache line



### Monitor weakest cache line





Find weakest cache line



## Monitor weakest cache line





Find weakest cache line

Monitor weakest cache line



# Real System



# RealRealSystemResults



# RealRealSystemResults

# **33%** Energy Savings



# Real

# Real



# **Energy Savings**

#### next paper

# Harnessing Soft Computation for Low-Budget Fault Tolerance

## Daya S Khudia Scott Mahlke

Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor





## Acceptable Vs. Unacceptable Outputs









## Acceptable Vs. Unacceptable Outputs



## Acceptable Vs. Unacceptable Outputs







# **Traditional Duplication**





**CCC** compilers creating custom processors

# **Traditional Duplication**





# **Traditional Duplication**





# **Selective Duplication and Value Checks**

- Duplication for critical variables
- Exploit value locality and check for deviations

-Produces 0 more than thr -Insert value comparison

# op2 op2 = op3 \* op4

# **1.2x** Performance Overhead

### **2.8x** Reduction in

# unacceptable outputs


#### next paper



#### What did we borrow?

From Decoupled Compressed Cache:

- Compression factor is a spatial property
- Leverage superblocks to limit tag overhead
- Complex look-up and replacement management
- Meta data overhead

From Skewed TLB:

- A mutiple grain structure in a N-way (skewed) cache
- The analogy:

At read time, page size (compression factor) is unknown







#### **Skewed Compressed Cache**

- Leverages spatial compression factor locality
- « x 2 » the LLC size
- Direct tag-data mapping
- Simple allocation/replacement automaton
- No meta-data, only 1.6 % storage overhead



Last step towards effective compressed caching?





#### next paper

### Adaptive Cache Management for Energy-efficient GPU Computing



Xuhao Chen<sup>1,2,3</sup>, Li-Wen Chang<sup>3</sup>, Chris Rodrigues<sup>3</sup>, Jie Lv<sup>3</sup>, Zhiying Wang<sup>1,2</sup>, Wen-Mei Hwu<sup>3</sup>
[1] State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, China
[2] School of Computer, National University of Defense Technology, Changsha, China
[3] Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, USA

- Many cache sensitive GPU applications have severe cache contention → low cache efficiency → poor performance
  - Smaller L1 cache capacity per thread
- Existing management schemes have limitations
- We propose Coordinated Bypassing and Warp Throttling (CBWT) to improve GPU cache efficiency

Reduce cache contention rate and NoC latency



ECE ILLINOIS

A 2.68x speedup on average (harmonic mean) for highly cache sensitive (HCS) benchmarks





#### **Observations – Understanding the Limitations**

- Cache bypassing retains useful cache lines instead of replacing upon miss
  - ✓ Retain useful data → fewer
     cache misses per thread
     ♦ Average HCS speedup 1.57x
  - X High demand on NoC to serve misses, i.e. congestion
  - X Still cannot avoid locality loss
- Warp throttling temporarily deactivates some threads
  - ✓ Fewer threads → more cache per thread → fewer misses
  - X Few threads  $\rightarrow$  cannot hide latency through multithreading
  - X Resource under-utilization



1.46x (SWL) 1.84x (SWL+bypass) 1.38x (bypass)

HCS





#### **ECE ILLINOIS**



### **CBWT Architecture Overview**

- Extra sampling modules (yellow blocks) are added to monitor *contention* and *congestion*.
- Adjust the MAW to keep the network in a *busy* but *low-congestion* range.



ECE ILLINOIS







### **Performance and Energy-efficiency**

- CBWT achieves an average of 74% (maximum 661%) IPC improvement on HCS benchmarks over baseline, which significantly outperforms PDP bypassing (42%) and Best-SWL (52%).
  - PDP bypassing: pure cache bypassing
  - Best-SWL: pure warp throttling
- CBWT outperforms the baseline with an average of 58.6% Perf/ Watt improvement
  - On average, PDP bypassing can reduce 16.5% of DRAM traffic,
  - CBWT reduces DRAM traffic by 54.9%
- Welcome to Session 4A in Main Auditorium on Dec. 16 (Tuesday) at 11:10 AM for more details

#### **ECE ILLINOIS**

#### ILLINOIS

#### next paper

# Problem: How to efficiently divide a cache into hundreds of partitions?



# Problem: How to efficiently divide a cache into hundreds of partitions?



Hundreds of partitions

# Problem: How to efficiently divide a cache into hundreds of partitions?







**X** Few coarse-grain partitions**X** Low associativity



**X** Few coarse-grain partitions**X** Low associativity



Control the partition size by adjusting eviction rate at replacement



**X** Few coarse-grain partitions**X** Low associativity



Control the partition size by adjusting eviction rate at replacement

 Able to support many fine-grain partitions



**X** Few coarse-grain partitions**X** Low associativity

Control the partition size by adjusting eviction rate at replacement

 Able to support many fine-grain partitions

Basic idea: control the size of each partition by properly scaling the futility of its cache lines

Basic idea: control the size of each partition by properly scaling the futility of its cache lines

Intuition:

Scaling up futility  $\rightarrow$  cache lines are evaluated as less useful  $\rightarrow$  lower probability to be kept in the cache  $\rightarrow$  partition size is reduced

Basic idea: control the size of each partition by properly scaling the futility of its cache lines

Intuition:

Scaling up futility  $\rightarrow$  cache lines are evaluated as less useful  $\rightarrow$  lower probability to be kept in the cache  $\rightarrow$  partition size is reduced

More details are in our tomorrow's talk:

- Futility estimation
- Scaling factor adjustment
- Feedback-based implementation

Tomorrow at 10:45am in Session 4A "TLB and Cache Optimization"

# Futility Scaling: High-Associativity Cache Partitioning

<u>Ruisheng Wang</u> — University of Southern California Lizhong Chen — Oregon State University





**College of Engineering** 

#### next paper

15 December 2014



## Voltage Noise in Multi-core Processors: Empirical Characterization and Optimization Opportunities

**Ramon Bertran<sup>1</sup>**, Alper Buyuktosunoglu<sup>1</sup>, Pradip Bose<sup>1</sup>, Timothy J. Slegel<sup>2</sup>, Gerard Salem<sup>2</sup>, Sean Carey<sup>2</sup>, Richard F. Rizzolo<sup>2</sup>, Thomas Strach<sup>2</sup>

<sup>1</sup>IBM Research <sup>2</sup>IBM Systems & Technology Group

Full presentation: Session 4B Managing Voltage and Time Tomorrow at 10:45AM Umney Theatre & Lounge room

© 2014 IBM Corporation

#### Voltage noise: Transient variations in supply voltage





# Voltage noise: Transient variations in supply voltage Goal: define and validate an operating voltage ensuring robust performance without being overly conservative



# Voltage noise: Transient variations in supply voltage Goal: define and validate an operating voltage ensuring robust performance without being overly conservative

Pre-silicon characterization is inadequate

 A systematic, direct measurement-based characterization is required



## Voltage noise: Transient variations in supply voltage

Goal: define and validate an operating voltage ensuring robust performance without being overly conservative

Pre-silicon characterization is inadequate

 A systematic, direct measurement-based characterization is required

**Our work:** Systematic noise characterization and exploration of optimization opportunities on zEC12





### **Characterization of noise**





# Do you want to understand voltage noise? When:

## Tomorrow, Tuesday 16<sup>th</sup>, Dec 2014

## at 10:45AM

## Where:

## **Umney Theatre & Lounge**

Session 4B - Managing Voltage and Time

Please come to our poster session too!

#### next paper

#### Enabling Realistic Fine-Grain Voltage Scaling with Reconfigurable Power Distribution Networks

Waclaw Godycki, Christopher Torng, Alyssa Apsel, Christopher Batten



#### Enabling Realistic Fine-Grain Voltage Scaling with Reconfigurable Power Distribution Networks

Waclaw Godycki, Christopher Torng, Alyssa Apsel, Christopher Batten



#### Benefits of Integrated Voltage Regulation

- Reduced system cost
- Potential for fine-grain voltage scaling (per-core, µs-scale voltage transitions)

#### Architecture and Analog Circuit Co-Design Challenges

- Enabling per-core voltage regulation while mitigating large regulator area overhead
- Choosing useful voltage levels
- Providing fast voltage transition times

#### Simple On-Chip Power Distribution Network

#### Single Fixed Voltage Regulator VR Core Core Core ... **On-Chip Interconnect** Cache Cache Cache Bank Bank Bank
#### **Simple On-Chip Power Distribution Network**

#### Single Fixed Voltage Regulator



#### **Simple On-Chip Power Distribution Network**



#### **Simple On-Chip Power Distribution Network**















Unit cells of shared energy storage are flexibly reconfigured to effectively create multiple differently-sized SC regulators "on-demand".



Unit cells of shared energy storage are flexibly reconfigured to effectively create multiple differently-sized SC regulators "on-demand".

#### **Benefits of RPDN**

- 10-50% performance and 10-70% energy-efficiency improvement compared to no FGVS
- 10× faster voltage transition times compared with MAVR (μs to 100 ns)
- 40% less area compared to a more traditional per-core regulation scheme



Unit cells of shared energy storage are flexibly reconfigured to effectively create multiple differently-sized SC regulators "on-demand".

#### **Benefits of RPDN**

- 10-50% performance and 10-70% energy-efficiency improvement compared to no FGVS
- 10× faster voltage transition times compared with MAVR (μs to 100 ns)
- 40% less area compared to a more traditional per-core regulation scheme

#### To hear more, come to the talk on Tuesday!

#### next paper



#### Micro-sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems

Jeongseob Ahn, Chang Hyun Park, and Jaehyuk Huh

#### Computer Science Department KAIST

Tomorrow, Session 4B, paper3 @ Umney Theatre

### Virtual Time Discontinuity



#### Virtual CPUs are not always running

### Interrupt with Virtualization



### Interrupt with Virtualization



### Spinlock with Virtualization



#### vCPU0 holding a lock is preempted

KAI5

### Spinlock with Virtualization



#### vCPU1 starts spinning to acquire the lock



### Spinlock with Virtualization





### Toward Virtual Time Continuity



#### Short, but more frequent runs

















#### **Pollution of architectural structures**

#### next paper

# SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers

Yunqi Zhang, Michael A. Laurenzano, Jason Mars, Lingjia Tang



Clarity-Lab Electrical Engineering and Computer Science University of Michigan

### Data centers are expensive





### Data centers are expensive







### Low utilization leads to inefficiency

SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers



### **Resource Sharing**



### **Resource Sharing**



### **Resource Sharing**



# < 2%</pre> 42% Prediction Error Utilization Improvement

## Section 5A Wednesday at 9:50 A.M.

SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers

#### next paper

### A Front-end Execution Architecture for High Energy Efficiency

<u>Ryota Shioya</u>\*, Masahiro Goshima+, and Hideki Ando\* \* Nagoya University

+ National Institute of Informatics

### Front-end Execution Architecture (FXA)

#### Background:

 OoO superscalar processors are fast, but their energy efficiency is low

 Even in heterogeneous architecture, big cores still consume a large amount of energy

#### Goal:

Improving the energy efficiency of OoO SSPs

#### Approach:

Execute instructions in-order in a front-end

### FXA has two execution units: IXU and OXU


### FXA's merit: improving energy efficiency

#### Merits:

FUs can be added to the IXU with low overhead, and it improves performance

- The IXU can execute over 50% insns in-order
  - The energy-consuming OXU is shrunk

#### Compared to Cortex A-57 (big)

 5.7% higher IPC and 17% lower energy consumption

 25% higher performance energy ratio (=the inverse of EDP)

# **Execution Drafting**

**Energy Efficiency Through Computation Deduplication** 

Michael McKeown, Jonathan Balkind, David Wentzlaff Princeton University

MICRO-47

Wednesday, December 17, 2014 – Session 5A – 9:50am



PRINCETON

**School of Engineering and Applied Science** 

## Motivation

Cloud Computing

• Data Center Energy Efficiency



Source:http://www.google.com/about/datacenters/gallery/ima ges/\_2000/CBF\_009.jpg

Computation Deduplication

Commonality in Applications







# **Execution Drafting**



## Main Results

- Up to 20% performance/energy gain
- Minuscule performance degradation
- Small area overhead
- Potential in drafting more threads



Wednesday, December 17, 2014 – Session 5A – 9:50am

#### PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK and DVFS Space Exploration

Bo Su<sup>†</sup> Junli Gu<sup>‡</sup> Li Shen<sup>†</sup> Wei Huang<sup>‡</sup> Joseph L. Greathouse<sup>‡</sup> Zhiying Wang<sup>†</sup> <sup>†</sup>National University of Defense Technology <sup>‡</sup>AMD Research

Dynamic Voltage & Frequency Scaling Challenge
How to predict performance & power across VF states

- Difficulties on modern processor
  - -Multiple clock domains
  - -Multiple power planes

Core Clock Domain & Power Plane

North Bridge Clock Domain & Power Plane

Scalable

Not scalable



AMD FX-8320



#### Q1: PERFORMANCE PREDICTION



Core Clock Domain Other Clock Domains 🛞



#### **Q2: POWER PREDICTION**



- ▲ Power model: CPU Events + Temperature
- Power prediction: LL-MAB + 2 observations of CPU events.
- ▲ <u>4.2%</u> error across 5 VF states.

Wednesday 11:05AM



**NoC Architectures** for Silicon **Interposer Systems** Wed: Session 5B Natalie Enright Jerger\* Ajaykumar Kannan\* Zimo Li\* Gabriel H. Loh<sup>+</sup> **University of Toronto** <sup>+</sup> AMD Research































## <u>Hi-Rise</u>: A High-Radix Switch for 3D Integration with Single-cycle Arbitration

### Supreet Jeloka, Reetuparna Das, Ronald G. Dreslinski, Trevor Mudge, David Blaauw University of Michigan, Ann Arbor



- Many-core systems
- Low-radix not scalable
- Goal: 3D High-radix switch



### 2D & 3D Switch Designs



### **3D Switch Design Challenges**



Efficient High Radix 3D switch requires

- Optimized datapath for <u>connection heterogeneity</u>
- Fair low-cost arbitration for <u>multi-stage switch</u>

## Proposed 3D-Switch: Hi-Rise

- True 3D Switch
- Hierarchical datapath
  - Reduced TSVs
- Class-based arbitration
  - Composable
  - Single cycle & Built-In
- 64-Radix 4-Layer Hi-Rise
  - 2.2 GHz,10.65 Tbps
  - 44pJ / 128-bit transaction
  - 13% System speedup over FBFly



### **Multi-GPU System Design with Memory Networks**

**Gwangsun Kim**, Minseok Lee, Jiyun Jeong, John Kim Department of Computer Science, KAIST





### **Multi-GPU System Design with Memory Networks**

**Gwangsun Kim**, Minseok Lee, Jiyun Jeong, John Kim Department of Computer Science, KAIST



### **Multi-GPU System Design with Memory Networks**

**Gwangsun Kim**, Minseok Lee, Jiyun Jeong, John Kim Department of Computer Science, KAIST


## **Memory Network**





## **Memory Network**







How to design the different networks?



### How to design the different networks?

























### next paper

Dodec: Random-Link, Low-Radix On-Chip Networks Haofan Yang, Jyoti Tripathi, Natalie Enright Jerger, Dan Gibson Session 5B: Wednesday @ 11:05



of Electrical & Computer Engineering

Google

## **On-Chip Network Routers**



© umnet.com

#### 4-ported router

## **On-Chip Network Routers**



### 4-ported router



Many-ported router

## **On-Chip Network Routers**



### 4-ported router





© muppet.wikia.com

### Simple 3-ported router

### Many-ported router

## **Topological Choices**



© wxs.ca

2D Mesh: regular, grid-like

• Grid cities: government designed

## **Topological Choices**





- Grid cities: government designed
- Non-grid cities: grow organically based on how people use them
- Similarly, we consider irregular network structures that might be more **useful** on-chip

### Simple Routers + Irregular Networks!



Dodec: Random-Link, Low-Radix On-Chip Networks Session 5B: Wednesday @ 11:05

### next paper





# LRLRRLRRLRR





10:00 12:00 15:00 1:00 21:00 Monday L R L R R Wednesday L R L R R Friday L R L R R

# Wormhole: Wisely Predicting Multidimensional Branches

Jorge Albericio, Joshua San Miguel, Natalie Enright Jerger, and Andreas Moshovos



Wednesday. 13:00 Session 6A

### next paper



### **Bias-Free Branch Predictor**



### Dibakar Gope & Mikko H. Lipasti University of Wisconsin – Madison

**MICRO 2014** 

### **Why Another Branch Predictor?**



Correlations to ~256 branches

Access to several tables

Predictor consumes ~15% of core energy (ARM's Cortex A15)





NISCONSIN

Conventional Predictor Table

**Bias-Free Pred:** 

Filter Useless Info



VISCONSIN



**Expand the effective reach of fixed length GHR** 





WISCONSIN

### **Our Solution:** Bias-Free Branch Predictor

NISCONSIN



### next paper

## Loop-Aware Memory Prefetching Using Code-Block Working Sets

\*Prefetch in Bulk and Listen to Your Mother

Adi FuchsShie MannorUri WeiserYoav EtsionElectrical EngineeringComputer ScienceTechnion – Israel Institute of Technology







### Prefetch Aggressively on Tight Loops (*i.e.*, *Prefetch in Bulk*)

#### • Observation:

Working sets of tight loop iterations (CBWS) are highly interdependent

#### • Means:

Vector arithmetic to compute CBWS differential vectors



#### • **Objective**: Prefetch complete CBWS when $\Delta_{AB} = \{\delta_i | \delta_i = b_i - a_i \text{ for each } i\}$ possible

- The compiler tells us when to use tight loop prefetcher
  - Otherwise, fallback to highperformance SMS prefetcher

BLOCK\_END(0);
}



## Speedup

Over 1.3x over SMS for memory intensive benchmarks


#### next paper



Stavros Volos, Javier Picorel, Babak Falsafi, Boris Grot







Stavros Volos, Javier Picorel, Babak Falsafi. Boris Grot









bing Google





0101001101001010111101000101010

**(Pfl** 







Stavros Volos, Javier Picorel, Babak Falsafi. Boris Grot





001010111101000101010

(EPFL

### Datacenters: The Workhorses of Information Age





Stavros Volos, Javier Picorel, Babak Falsafi. Boris Grot





(FI) (FI)

### Datacenters: The Workhorses of Information Age







010101101010011010010101111010001010101

#### **User Requests**







000101010

#### User Requests are Data-Intensive







### User Requests are Data-Intensive





#### Vast DRAM-resident datasets





### User Requests are Data-Intensive



# DRAM serves many requests

# Vast DRAM-resident datasets

DRAM: Major energy hog in datacenters











## bing Google

You Tube<sup>™</sup>











10100010101010

### **Datasets Have Different Flavors**

videos You Tube™





index bing pages Google









10100010101010

### **Datasets Have Different Flavors**







### **Datasets Have Different Flavors**







### **Datasets Have Different Flavors**





#### **(Pfl**

## **BuMP**

### **Bulk Memory Access Prediction and Streaming**

#### Prediction: Identify bulk accesses



## **BuMP**

### **Bulk Memory Access Prediction and Streaming**

#### Prediction: Identify bulk accesses

Streaming: Trigger bulk transfers to exploit locality



## **BuMP**

### **Bulk Memory Access Prediction and Streaming**

For a 16-core server running datacenter applications

## 23% lower DRAM energy 11% higher throughput





## **BuMP**

**Bulk Memory Access Prediction and Streaming** 

For a 16-core server running datacenter applications

## 23% lower DRAM energy 11% higher throughput

### Session 6A, Wednesday at 2:15 PM

#### next paper

## Protean Code: Achieving Near-free Online Code Transformations for Warehouse Scale Computers

Michael A. Laurenzano Yunqi Zhang Lingjia Tang Jason Mars





Electrical Engineering and Computer Science University of Michigan

#### Datacenter







## Dynamism is everywhere

Apps begin and end

Program phases

User behavior varies

Unreliable hardware

#### Datacenter







## Dynamism is everywhere

Apps begin and end

Program phases

User behavior varies

Unreliable hardware

Native code should change with the environment

#### Datacenter







## Dynamism is everywhere

Apps begin and end

Program phases

User behavior varies

Unreliable hardware

Native code should change with the environment

Not possible today in production environments





## Protean code is a breakthrough

- Compilation is asynchronous with near-zero overhead
- Dynamic code optimization is <u>always available</u>
- Static compilation choices do not have to be permanent

## Come to my talk to hear about

- A new paradigm for thinking about compilation
- A fully functional, open source dynamic compiler infrastructure implemented on top of LLVM
- A novel dynamic optimization that reduces the # of servers in the datacenter by > 25%

## Protean code is a breakthrough

- Compilation is asynchronous with near-zero overhead
- Dynamic code optimization is <u>always available</u>
- Static compilation choices do not have to be permanent

## Come to my talk to hear about

- A new paradigm for thinking about compilation
- A fully functional, open source dynamic compiler infrastructure implemented on top of LLVM
- A novel dynamic optimization that reduces the # of servers in the datacenter by > 25%

## Session 6B, Wednesday 1:00pm

#### next paper

Compiler Support for Optimizing Memory Bank Level-Parallelism

Wei Ding, Diana Guttman, and Mahmut Kandemir The Pennsylvania State University

- Last-level cache misses (memory accesses) are important for good performance, but current compilers focus on cache locality only
- Bank-level parallelism of LLC misses is critical to performance

How can a compiler optimize for bank-level parallelism on a multicore processor?





Compiler Support for Optimizing Memory Bank Level-Parallelism



Compiler Support for Optimizing Memory Bank Level-Parallelism

Compiler Support for Optimizing Memory Bank Level-Parallelism

Average bank-level parallelism improvement of 17.1% Average memory access latency reduction of 9.2%

Please come to the full-length presentation!

- Extensions to consider memory controller-level parallelism and row-buffer locality
- Algorithm details
- Detailed results

Session 6B: Compilation and Code Generation Wednesday, December 17 13:00 - 14:40 Room: Umney Theatre

#### next paper

#### Architectural Specialization for Inter-Iteration Loop Dependence Patterns

Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu Zhiru Zhang, Christopher Batten

> Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University

47th Int'tl Symp. on Microarchitecture, Dec 2014 Session 6B: Compilation and Code Generation

| ۰Ρ | ro | b | em | ٠ |
|----|----|---|----|---|
|----|----|---|----|---|

#### Loop Dependence Pattern Specialization



Key Challenge: Creating HW/SW abstractions that are flexible and enable performance-portable execution

| Problem                                                                              | • S                                                                                                                    | Results                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|--------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| loop:<br>lw<br>lw<br>mul<br>addu<br>sw<br>addiu.xi<br>addiu.xi<br>addiu<br>xloop.orm | XLO(<br>r4, 0(r3)<br>r5, 0(rA)<br>r6, r4, r5<br>rX, r6, rX<br>rX, 0(r3)<br>r3, 4<br>rA, 4<br>r1, r1, 1<br>r1, rN, loop | DPS ISA<br>teration 0<br>Iteration 1<br>Inst0<br>Inst2<br>Inst3<br>Inst3<br>Inst3<br>Inst3<br>Inst3<br>Inst3<br>Inst4<br>Inst4<br>Inst6<br>Inst6<br>Inst6<br>Inst6<br>Inst6<br>Inst7<br>Inst6<br>Inst7<br>Inst6<br>Inst7<br>Inst6<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Inst7<br>Ins | eration 2<br>ist0<br>ist2<br>inst2<br>inst3<br>icop.orm<br>ist3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst0<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3<br>inst3 |


#### **XLOOPS** Compiler

```
#pragma xloop ordered
for ( X=0, i=K; i<N; i++ )
{
    A[i] = A[i] * A[i-K];
    X += A[i];
}</pre>
```



#### **XLOOPS** Compiler

```
#pragma xloop ordered
for ( X=0, i=K; i<N; i++ )
{
    A[i] = A[i] * A[i-K];
    X += A[i];
}</pre>
```

#### **XLOOPS Microarchitecture**



Single-ISA hetereogenous architecture that transparently integrates traditional processors and specialized loop-accelerators

Session 6B

| Problem       | Solution          | Results •                           |
|---------------|-------------------|-------------------------------------|
| OoO GPP       | ► Tradit<br>Speed | ional Execution<br>dups close to 1× |
| L1 Data Cache |                   |                                     |



- Traditional Execution Speedups close to 1×
  - Specialized Execution
     Speedups 1.25–2.5×
     Energy Efficiency 1.5–3×



Architectural Specialization for Inter-Iteration Loop Dependence Patterns



- Traditional Execution Speedups close to 1×
- Specialized Execution
   Speedups 1.25–2.5×
   Energy Efficiency 1.5–3×
- Adaptive Execution
   Dynamically Trade
   Peformance vs. Energy Efficiency



#### next paper

#### Specializing Compiler Optimizations Through Programmable Composition For Dense Matrix Computations

Qing Yi, Qian Wang, Huimin Cui

University of Colorado, Colorado Springs, USA

Institute Of Software & Institute of Computing, Chinese Academy of Science



MICRO-2014

#### Motivation

#### • What's wrong with general purpose compilers?

- Target all possible user applications
   Try to attain the best average performance Inferior to manual specialized optimizations
- Independent optimization passes



 Unpredictable Interferences among optimizations Information loss from re-analyzing optimized code Best optimization order is NP-complete

#### **Overcoming The Uncertainties**

- **Programmable composition of compiler optimizations** 
  - Eliminate optimization interferences
    - Analyze the original input source code only once
    - Enable fine-grained coordination among optimization passes
  - Pattern-based Specialization
    - Recognize known computational patterns
    - Specialize optimization customization and ordering



#### **Optimization Workflow**



#### Specialized optimization for dense-matrix kernels

- Applied to 15 BLAS kernels and 15 applications in SPLASH-2
- Kernel performance comparable to manual assembly programming

#### next paper

## **A Machine-Learning Supercomputer**

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, Olivier Temam





## DNN: State-of-the-art Machine-Learning Algorithm















A Supercomputer for DNNs w/o memory access!

## **Experimental Results**



## **Experimental Results**



## See you in session 7

#### next paper



# **B-Fetch:** Branch Prediction Directed Prefetching for Chip-Multiprocessors

#### David Kadjo<sup>1</sup>, Jinchun Kim<sup>1</sup>, Prabal Sharma<sup>2</sup>, Reena Panda<sup>3</sup>, Paul V. Gratz<sup>1</sup>, Daniel Jiménez<sup>4</sup>

<sup>1</sup> Department of Electrical and Computer Engineering, Texas A&M University
 <sup>2</sup> Samsung R&D, Austin <sup>3</sup> Department of Electrical and Computer Engineering, University of Texas, Austin
 <sup>4</sup> Department of Computer Science & Engineering, Texas A&M University



End()

## Decisions while cooking change the flavor of the turkey

Ā M



Ā M



Ā M

ТЕХ



TEX

Ā M

#### **B-Fetch:** Prefetching based on Branch Prediction



- Path Speculation and Effective Address Speculation
  - Path speculation based on branch lookahead
  - Effective address speculation based on architectural register files

- Light weight prefetcher leverages branch prediction
  - Provide 28% speed-up compared to the baseline without data prefetching
  - 9% speed-up and 65% less storage than SMS [Somogyi 2006]
  - Join discussion at the Best Paper Section!
     Dec. 17<sup>th</sup> (Wednesday) 3PM at Main Auditorium

#### next paper

## PipeCheck: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models

Daniel Lustig, Michael Pellauer, Margaret MartonosiPrinceton UniversityIntel VSSAD

Session 7 (Best Paper Nominees), Paper 3, Wed @ 4pm

PRINCETON



# Motivation: Verify correctness of memory consistency model **implementation**





# Motivation: Verify correctness of memory consistency model **implementation**



• Heterogeneity  $\rightarrow$  even harder to verify!



## Architecture-Level Analyses Cannot Distinguish Between Implementations



From-reads

[Alglave, FMSD '12, Owens et al., TPHOLs '09]

- Arch.-level models cannot analyze/verify the behavior of:
  - Out-of-Order Execution
  - Speculative load reordering
  - Other μarch.
     optimizations

## **PipeCheck**: Verifying **Microarchitectural** Enforcement of Consistency Models



- Verify a given µarch w.r.t. architectural spec.
- Successes: fast automated verification; bugs found

#### next paper

## Equalizer: Dynamically Tuning GPU Resources for Efficient Execution

## Ankit Sethia\* Scott Mahlke University of Michigan





compilers creating custom processors










## **Compute Intensive**



## **Compute Intensive**

## **Memory Intensive**



## Large number of threads cause early saturation of some resources and under-utilization of others







## **Compute Intensive**

## **Memory Intensive**

# Kernels saturate one resource much faster than others

**Opportunity 1:** 

Boost bottleneck resource for performance improvement

**Opportunity 2:** 

Throttle under-utilized resources for energy savings

# Kernels saturate one resource much faster than others

- Observe kernel's hardware requirements
- Modulate hardware through:
  - Core frequency
  - Memory frequency
  - Number of threads



- Calculate state of warps over window of cycles
- Request new hardware parameters



- Calculate state of warps over window of cycles
- Request new hardware parameters

Boosting bottleneck resources: 22% speedup, 6% energy overhead

Throttling under-utilized resources: 15% energy savings, 5% speedup

### next paper



## **COMP: Compiler Optimization for Manycore Processors**

Linhai Song<sup>1</sup>, Min Feng<sup>2</sup>, Nishkam Ravi<sup>3</sup>, Yi Yang<sup>2</sup> and Srimat Chakradhar<sup>2</sup> <sup>1</sup>University of Wisconsin-Madison <sup>2</sup>NEC Laboratories America <sup>3</sup>Cloudera Inc.

٠

- We are entering manycore era
  - Intel Xeon Phi, Tilera processors, etc.
  - Used as coprocessors







• 4\*61 = 244 hardware threads



## **Performance Bottleneck: Data Transfer Overhead**





## **Our Compiler Optimizations**

#### Data streaming

- Automatically overlap data transfers and computations to reduce data transfer overhead
- Designed to minimize the device memory usage while maximizing the performance
- Avoid the overhead of launching the same kernel for multiple times

#### Regularization

- Enable data streaming in the presence of irregular accesses
- Eliminate unnecessary data transfer
- Improve vectorization and locality
- New shared memory mechanism
  - Designed to quickly transfer pointer-based data structures between host and device



## **Evaluation**



Our optimizations benefit 9 out of 12 benchmarks. (1.16x ~ 52.21x speedups)

## end

## end