

#### QUALITY PROGRAMMABLE VECTOR PROCESSORS FOR APPROXIMATE COMPUTING

**Swagath Vekataramani<sup>1</sup>,** Vinay Chippa<sup>1</sup>, Srimat Chakradhar<sup>2</sup>, Kaushik Roy<sup>1</sup>, Anand Raghunathan<sup>1</sup>

<sup>1</sup>Integrated Systems Laboratory School of ECE, Purdue University <sup>2</sup>NEC Laboratories America



Computers viewed as precise calculators



Computers viewed as precise calculators

Leads to inefficiency





#### Relaxed notion of correctness



- Relaxed notion of correctness
  - Results cannot be arbitrary either

Good enough answers !!!









# SAGE: Self-Tuning Approximation for Graphics Engines

# Mehrzad Samadi<sup>1</sup>, Janghaeng Lee<sup>1</sup>, D. Anoushe Jamshidi<sup>1</sup>, Amir Hormati<sup>2</sup>, and Scott Mahlke<sup>1</sup>

University of Michigan<sup>1</sup>, Google Inc.<sup>2</sup>





# GPU Specific Approximation

## Goal: Hardware-aware approximation



# GPU Specific Approximation

## Goal: Hardware-aware approximation





Tuning Parameters

# GPU Specific Approximation

## Goal: Hardware-aware approximation



# We Can Control Output Quality.



# We Can Control Output Quality.



2.5x speedup with 90% output quality2.0x speedup with 95% output quality

Approximate Storage in Solid-State Memories

> Adrian Sampson Jacob Nelson Karin Strauss Luis Ceze

University of Washington & Microsoft Research

# Approximate Storage in Solid-State Memories

Adrian Sampson Jacob Nelson Karin Strauss Luis Ceze

University of Washington & Microsoft Research

| ●●●○○ vodafone UK 🗢 5:32 PM | * •          |
|-----------------------------|--------------|
| Ceneral Usage               |              |
| 00.5 GB Available           | 34.6 GB Used |
| 22.5 GD / USIC              | 15.9 GB >    |
| Dhotos & Camera             | 11.3 GB >    |
| TripAdvisor                 | 593 MB       |
|                             | 549 MB       |
| stay Stay.com               | 535 MB       |
| E Keynote                   | 389 MB       |
| Dedcasts                    | 000          |

# Approximate Storage in Solid-State Memories

Adrian Sampson Jacob Nelson Karin Strauss Luis Ceze

University of Washington & Microsoft Research







# **70%** faster writes

# **23%** lifetime extension

# 70%23%faster writeslifetime extension

# Approximate Storage in Solid-State Memories right here, after lunch

# MLP-Aware Dynamic Instruction Window Resizing for Adaptively Exploiting Both ILP and MLP

Yuya Kora Kyohei Yamaguchi Hideki Ando

Nagoya University

# Problem to Solve

- Difficult to improve single-thread performance in memory-intensive programs
  – Memory wall
- Very large instruction window can overcome this problem by exploiting MLP
  - -This degrades the clock cycle time
  - While pipelining can solve this, it instead prevents ILP exploitation, degrading IPC in compute-intensive programs

# Dynamic Instruction Window Resizing

- Adapt window size to available parallelism – ILP or MLP
  - Based on prediction



**GM** memory-intensive



#### **GM** compute-intensive

# Dynamic Instruction Window Resizing

- Adapt window size to available parallelism
  - ILP or MLP
  - Based on prediction



#### 21% speedup on average

#### TLC: A Tag-Less Cache for Reducing Dynamic First Level Cache Energy

Andreas Sembrant, Erik Hagersten, David Black-Schaffer Uppsala University, Sweden

14:00 Session 1B - Energy Optimizations [Alpha Gamm Rho Room]



#### Problem: L1D consumes energy due to tags and ways



#### Problem: L1D consumes energy due to tags and ways



#### Solution: extend the TLB to eliminate tags and find the way



# Results

## Reduce total L1D dynamic energy by 78%

#### 1. Eliminate extra data-array reads

• by determining the correct correct way from the TLB

#### **2.** Eliminate the tag-array

by avoiding tag comparisions

#### 3. Filter out cache misses

by checking in the eTLB

#### 4. Amortize the TLB lookup energy

by integrating it with way information

# **Results**

## Reduce total L1D dynamic energy by 78%

#### 1. Eliminate extra data-array reads

by determining the correct correct way from the TLB

#### **2.** Eliminate the tag-array

by avoiding tag comparisions

#### 3. Filter out cache misses

by checking in the eTLB

#### 4. Amortize the TLB lookup energy

by integrating it with way information

#### More cool stuff in the presentation:

μPages, synonyms, coherency, replacements, ...

14:00 Session 1B - Energy Optimizations [Alpha Gamm Rho Room]

# **Decoupled Compressed Cache:**

Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Somayeh Sardashti and David A. Wood University of Wisconsin-Madison





## **Cache as Energy Filters**



#### **Main Memory**



## Why not double the LLC?





# Why not double the LLC? 2X LLC Area!





# State of the Art: Compressed Cache

Compacting compressed blocks in the same data space

- ✓ High Effective Cache Capacity
- ✓ Small Area Overhead




# State of the Art: Compressed Cache

Compacting compressed blocks in the same data space

- ✓ High Effective Cache Capacity
- ✓ Small Area Overhead





# State of the Art: Compressed Cache

Compacting compressed blocks in the same data space

- ✓ High Effective Cache Capacity
- ✓ Small Area Overhead





**Decoupled Super-Blocks** 





#### **Non-Contiguous Sub-Blocks**









# Today at 1:30pm!



#### Exploiting GPU Peak-power and Performance Tradeoffs through Reduced Effective Pipeline Latency

Syed Gilani (AMD) <u>Nam Sung Kim</u> (UW-Madison) Michael Schulte (AMD Research)

# Problem

- Thread-level parallelism (TLP) limited by
  - Thread synchronization patterns
  - Memory access patterns
  - Data dependencies
  - Limited hardware resources
- Low TLP exposes pipeline latencies
  - Data-forwarding networks are power hungry

### Contributions

- Limited forwarding for a few recently executed instructions
- Reduce impact of pipeline latency on performance
  - Low voltage pipelines with negligible impact on performance
- Mean speedups of 23% (SP/Int) and 33% (DP) within the same power-budget





#### What are we solving?

- GPUs leverage massive multi-threading
  - Core of their latency-tolerance
- Per-thread cache capacity of modern CPUs/GPU

| Intel              | IBM               | Oracle            | NVIDIA             |
|--------------------|-------------------|-------------------|--------------------|
| Core i7-4960x      | Power7            | UltraSparc T3     | Kepler GK 110      |
| 32KB L1            | 32KB L1           | 8KB L1            | 48KB L1            |
| 2 threads/core     | 4 threads/core    | 8 threads/core    | 2K threads/core    |
| <b>16KB/thread</b> | <b>8KB/thread</b> | <b>1KB/thread</b> | <u>24B</u> /thread |

- Efficient caching becomes extremely challenging
  - Low cache hit rates
  - Low cache block reuse
  - Waste in off-chip bandwidth utilization





#### What are we solving?

- GPUs leverage massive multi-threading
  - Core of their latency-tolerance
- Per-thread cache capacity of modern CPUs/GPU

| Intel          | IBM            | Oracle         | NVIDIA             |
|----------------|----------------|----------------|--------------------|
| Core i7-4960x  | Power7         | UltraSparc T3  | Kepler GK 110      |
| 32KB L1        | 32KB L1        | 8KB L1         | 48KB L1            |
| 2 threads/core | 4 threads/core | 8 threads/core | 2K threads/core    |
| 16KB/thread    | 8KB/thread     | 1KB/thread     | <u>24B</u> /thread |

- Efficient caching becomes extremely challenging
  - Low cache hit rates
  - Low cache block reuse
  - Waste in off-chip bandwidth utilization





#### What are we solving?

- GPUs leverage massive multi-threading
  - Core of their latency-tolerance
- Per-thread cache capacity of modern CPUs/GPU

| Intel              | IBM            | Oracle            | NVIDIA             |
|--------------------|----------------|-------------------|--------------------|
| Core i7-4960x      | Power7         | UltraSparc T3     | Kepler GK 110      |
| 32KB L1            | 32KB L1        | 8KB L1            | 48KB L1            |
| 2 threads/core     | 4 threads/core | 8 threads/core    | 2K threads/core    |
| <b>16KB/thread</b> | 8KB/thread     | <b>1KB/thread</b> | <u>24B</u> /thread |

- Efficient caching becomes extremely challenging
  - Low cache hit rates
    Low cache block reuse
    Waste in off-chip bandwidth utilization



A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures (Session 2A – Power-Efficient GPUs)

#### Last level cache block reuse (temporal / spatial)

• Number of *repeated accesses* to cache blocks -- *temporal* 



Fraction of cache block data actually used -- spatial





A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures (Session 2A – Power-Efficient GPUs)

#### Last level cache block reuse (temporal / spatial)

Number of *repeated accesses* to cache blocks -- *temporal*



Fraction of cache block data actually used -- spatial





A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures (Session 2A – Power-Efficient GPUs)

#### Last level cache block reuse (temporal / spatial)

• Number of *repeated accesses* to cache blocks -- *temporal* 



Fraction of cache block data actually used -- spatial





BASE

LAMAR



#### How do we solve the problem?

- Predict optimal data fetching granularity
  - Coarse granularity (fetch *all* cache block-wide data)
  - Fine granularity (fetch just enough to service GPU core)
  - Reduce number of RD/WR commands to memory



(a) [Byte-traffic/num\_of\_instrs] (left-axis) and the normalization to baseline memory system (right-axis).

- : <u>Baseline</u> memory system
- : <u>Proposed</u> solution



#### while (x[tid]) { load

- . . . }







#### Scheduler





Divergence-Aware Warp Scheduling 🖉 👁 nvidia.

#### Timothy G. Rogers, Mike O'Connor, Tor M. Aamodt

# 

# Speedup

Today 4:30pm Conference Center Ballroom

Divergence-Aware Warp Scheduling 🖉 👁 nvidia.

#### Timothy G. Rogers, Mike O'Connor, Tor M. Aamodt

# Within

# 4% 2.8x less instr

Today 4:30pm Conference Center Ballroom



#### Warped-GATES

Gating Aware Scheduling and Power Gating for GPGPUs Mohammad Abdel-Majeed, Daniel Wong and Murali Annavaram University of Southern California





University of Southern California

- Scheduler greedily issues ready instructions
  - Agnostic to instruction type.







- Scheduler greedily issues ready instructions
  - Agnostic to instruction type.







- Scheduler greedily issues ready instructions
  - Agnostic to instruction type.





- Scheduler greedily issues ready instructions
  - Agnostic to instruction type.







- Scheduler greedily issues ready instructions
  - Agnostic to instruction type.





#### **Proposed Techniques**



- Gating Aware Scheduler (GATES)
  - Gives priority to same instruction type during scheduling.
  - Is able to increase the length of the idle periods.
  - Idle periods are not long enough to avoid negative savings!!
- Blackout technique
  - Eliminates negative savings by forcing the unit to stay in power gating state.







#### Virtually Aged Sampling DMR Unifying Circuit Failure Prediction and Detection

Raghuraman Balasubramanian Karthikeyan Sankaralingam



# Virtually Aged Sampling DMR



# Virtually Aged Sampling DMR






### µ-Processor Failure Prediction Technique



### µ-Processor Failure Prediction Technique

### **Comprehensive Coverage**



µ-Processor Failure Prediction Technique

**Comprehensive Coverage** 

No Performance Overhead

Less than 0.7 % Energy Overhead



µ-Processor Failure Prediction Technique

**Comprehensive Coverage** 

No Performance Overhead

Less than 0.7 % Energy Overhead

Today @ 3:30PM Alpha Gamma Rho



## Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors



Hyungjun Kim,<sup>1</sup> Arseniy Vitkovsky,<sup>2</sup> Paul V. Gratz,<sup>1</sup> Vassos Soteriou<sup>2</sup>

 <sup>1</sup> Department of Electrical and Computer Engineering, Texas A&M University
 <sup>2</sup> Department of Electrical Engineering, Computer Engineering and Informatics, Cyprus University of Technology



### **Chip-multiprocessor Wearout**



## ITRS: Rates of wearout induced failure to increase 10X in 10 years

- HCI and NBTI: transistor slowdown with use

#### Wearout effects in CMPs:

#### Recoverable failures:

- 1) Core failure
  - Failure detection and remapping

#### Non-recoverable failures:

- 2) I/O device disconnection
  - Device unreachable
- 3) Network partition
  - Disruption of communication between cores
- 4) Individual link breakage
  - Deadlock potential

#### Interconnect critical point of failure



A 64-core Chip-Multiprocessor (CMP) with various peripherals interconnected via a 2-D Mesh, all failure scenarios illustrated

### Use it or Lose it



Analysis of real CMP workloads:

- Low loads in interconnect
- NBTI causes critical path slowdown
- *Lack* of load leads to interconnect breakdown and failure

The *Use it or Lose it,* wear-resistant router microarchitecture

- *Increases* utilization of router critical path
- 22x lifetime improvement!



Lifetime improvement of 8x8 CMP executing applications from the PARSEC benchmark suite

Session 2B (Alpha Gamma Rho Room) 4:00 PM today! *MICRO-46, 9<sup>th</sup> December- 2013 Davis, California* 



## uDIREC: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults

#### **Ritesh Parikh and Valeria Bertacco**

**Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor** 

## Unified Diagnosis and Reconfiguration



X cannot resend

need to re-route around fault

Our contributions:

- Fault Diagnosis at fine granularity

- Integrated Reconfiguration to find new route:

#### **Diagnosis**

- -End-to-end scheme in SW
- -Based on analyzing faulty routes
- -Passive and fine-grained



#### **Reconfiguration**

- -Based on a novel routing algorithm
- -Tightly integrated with the diagnosis scheme
- -Unconstrained by number and location of fault



Faulty irregular network with deadlock-free routes

### **Reliability and Performance Benefits**

- Dedicated testing is not required  $\rightarrow$  no overhead in absence of errors
- Unified implementation in software  $\rightarrow$  low area overhead







## Implicit-Storing and Redundant-Encoding-of-Attribute Information in Error-Correction-Codes

Yiannakis Sazeides<sup>1</sup>, Emre Ozer<sup>2</sup>, Danny Kershaw<sup>3</sup>, <u>Panagiota Nikolaou</u><sup>1</sup>, Marios Kleanthous<sup>1</sup>, Jaume Abella<sup>4</sup> <sup>1</sup>University of Cyprus, <sup>2</sup>ARM, <sup>3</sup>NXP, <sup>4</sup>Barcelona Supercomputing Center



MICRO 46, Davis, California, December 9th 2013





### **Implicit Storing (IS)**

Leverage error correction codes used for cache and memory data protection

- encode extra information
- without storing the information
- infer the information on reads

#### Based on error and erasure coding

Needs shortened codes: the number of protected data bits to be smaller than what can be protected by an error correction code

: reduce area and energy with low performance overhead
: Hurts error correction code strength

### **Redundant Encoding of Attribute Information (REA)**

Exploit fine granularity of protection in caches and memory

- encode the same extra information in multiple codewords
- decode the extra information from multiple codewords

Needs the multiple codewords to be correlated

: improve strength of the code that implicitly stores or tags extra info

- $\bigcirc$ : not full strength recovery
- : minimal area, energy, and timing overheads

Several IS & REA uses: reliability, performance, security, energy

### **Redundant Encoding of Attribute Information (REA)**

Exploit fine granularity of protection in caches and memory

- encode the same extra information in multiple codewords
- decode the extra information from multiple codewords

Needs the multiple codewords to be correlated

: improve strength of the code that implicitly stores or tags extra info

- $\bigcirc$ : not full strength recovery
- : minimal area, energy, and timing overheads



# Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

#### Gennady Pekhimenko,

Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch

Todd C. Mowry

## **Carnegie Mellon University**



- Main memory is a limited shared resource
- Observation: Significant data redundancy
- Old Idea: Compress data in main memory



- Main memory is a limited shared resource
- Observation: Significant data redundancy
- Old Idea: Compress data in main memory



Problem: How to avoid inefficiency in address computation?

- Main memory is a limited shared resource
- Observation: Significant data redundancy
- Old Idea: Compress data in main memory



- Problem: How to avoid inefficiency in address computation?
- <u>Solution</u>: Linearly Compressed Pages (LCP): fixed-size cache line granularity compression

- Main memory is a limited shared resource
- Observation: Significant data redundancy
- Old Idea: Compress data in main memory



- Problem: How to avoid inefficiency in address computation?
- <u>Solution</u>: Linearly Compressed Pages (LCP): fixed-size cache line granularity compression
  - 1. Increases capacity (62% on average)
  - 2. Decreases bandwidth consumption (24%)
  - 3. Improves overall performance (13.9%)

| 64B | 64B | 64B | 64B | • • • | 64B |
|-----|-----|-----|-----|-------|-----|









## RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization



**Carnegie Mellon University** 

Intel Pittsburgh

## RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization





**Carnegie Mellon University** 

Intel Pittsburgh

## RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization



Carnegie Mellon University

Intel Pittsburgh



#### Row Buffer



### Copy from source row to row buffer



Copy from source row to row buffer Copy from row buffer to destination row





Very few changes to DRAM (0.01% increase in die area)

- End-to-end system design to exploit DRAM substrate
- Several applications that benefit from RowClone

- End-to-end system design to exploit DRAM substrate
- Several applications that benefit from RowClone

## 8-Core System



# WHAT IS IR DROP?



# **KEY CONTRIBUTIONS**

1 A 3D memory package with few pins & TSVs



# **KEY CONTRIBUTIONS**

2 Spice analysis to show voltage maps


# **KEY CONTRIBUTIONS**

3 Memory controller: what, when, where



# **KEY CONTRIBUTIONS**

4 Handle starvation



# **KEY CONTRIBUTIONS**

5 Place pages in favored regions



### **NOW PLAYING**

A PAPER ABOUT COST & VOLTAGE NOISE

UTAH, SAMSUNG, ARM

#### NOW PLAYING

WHERE ARE THE #&\$@ PERFORMANCE NUMBERS? TUESDAY 9:30am, SESSION 3A ?! I'LL BE THERE !



#### INTRIGUED TOM CONTE

# **Crank It Up or Dial It Down**

#### Coordinated Multiprocessor Frequency and Folding Control

<u>Augusto Vega</u>, Alper Buyuktosunoglu, Heather Hanson, Pradip Bose, Srinivasan Ramani IBM T. J. Watson Research Center, IBM Systems & Technology Group

### **Executive Summary**

 Modern multi-core systems incorporate support for dynamic power management with multiple actuators



- Algorithms that control these actuators have evolved independently
  - Their independent operation can result in suboptimal decisions



We argue in favor of a coordinated control of these actuators to avoid potential conflicts in dynamic power management



#### Performance And Throughput Awareness



Operate power management *knobs* depending on if an application's current execution phase is single-thread performance or throughput bound

#### All turned-on cores are highly utilized



#### Some turned-on cores are highly utilized



#### All turned-on cores are low utilized or idle



#### Performance And Throughput Awareness

Operate power management knobs depending on if an application's current execution phase is single-thread performance or throughput bound

#### All turned-on cores are highly utilized



#### Some turned-on cores are highly utilized



#### All turned-on cores are low utilized or idle





PAMPA preserves

and PCPG



Funded, in part, by DARPA under HR0011-08-090001. The views, opinions, and/or findings contained in this presentation are those of the author/presenter and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.

# Wavelength Stealing: An Opportunistic Approach to Channel Sharing in Multichip Photonic Interconnects

Arslan Zulfiqar (UW-Madison)

Pranay Koka (Oracle Labs)

Herb Schwetman (Oracle Labs)

Mikko Lipasti (UW-Madison)

Xuezhe Zheng (Oracle Labs)

Ashok Krishnamoorthy (Oracle Labs)



# **Problem:** What is the "best" topology design for photonic substrates?





# **Our Contributions**

- Analytical model to quantify the limits and gains of channel sharing
- # of senders per channel  $\leq 3$
- Performance speedup  $\leq 1.70x$
- "Wavelength Stealing" architecture
- Arbitration-free accesses
- Strong fairness guarantees
- Up to 28% EDP improvement over baseline

# The High Cost of Data Movement

A significant and growing fraction of ondie energy is spent in data movement.

Long, capacitive interconnects consume most of the LLC access energy.



Shekhar Borkar, Journal of Lightwave Technology, 2013

ROCHESTER

**DESC: Energy-Efficient Data Exchange using Synchronized Counters** *Mahdi Nazm Bojnordi and Engin Ipek* 

# **Proposal: Time Based Data Transfer**

Key idea: represent information by the number of clock cycles between two consecutive pulses to reduce the interconnect activity factor.





**DESC: Energy-Efficient Data Exchange using Synchronized Counters** *Mahdi Nazm Bojnordi and Engin Ipek* 

# **Summary of Results**

DESC reduces LLC energy by 1.8x at the cost of a 2% increase in execution time.

DESC expands the Pareto frontier in energy-efficient cache design.

UNIVERSITY of



DESC: Energy-Efficient Data Exchange using Synchronized Counters Mahdi Nazm Bojnordi and Engin Ipek

# Linearizing Irregular Memory Accesses for Improved Correlated Prefetching

Akanksha Jain, Calvin Lin University of Texas at Austin















Irregular Prefetching

Regular Prefetching





Irregular Prefetching

Regular Prefetching













Irregular Prefetching

Regular Prefetching



| Irregular 🚃 |          |          |          |
|-------------|----------|----------|----------|
| Prefetching |          | Previous | Our      |
|             |          | Best     | Solution |
| Degulor     | Speedup  | 8.3%     | 23.1%    |
| Prefetching | Accuracy | 58.6%    | 93.7%    |
|             | $\sim$   |          |          |
|             |          |          |          |





# RAS-Directed Instruction Prefetching (RDIP)

Aasheesh Kolli\* Ali Saidi<sup>+</sup> Thomas F. Wenisch\*

\* University of Michigan <sup>+</sup> ARM

MICRO-46 12/10/2013



### Why another instruction prefetcher?

- Poor I\$ behavior affects modern workloads
- Cache size constraints  $\rightarrow$  Prefetching



Our Goal: Low overhead, high accuracy instruction prefetcher



#### Contributions

- I\$ misses correlate to program context
- Program contexts are repetitive  $\rightarrow$  predictable
- RAS succinctly captures program context



#### Contributions

- I\$ misses correlate to program context
- Program contexts are repetitive  $\rightarrow$  predictable
- RAS succinctly captures program context

### **RAS-Directed Instruction Prefetching (RDIP)**



#### Contributions

- I\$ misses correlate to program context
- Program contexts are repetitive  $\rightarrow$  predictable
- RAS succinctly captures program context

### **RAS-Directed Instruction Prefetching (RDIP)**

### Performance improvement of 11.5% Overhead of 64kB (3X ♥)





# SHIFT Shared History Instruction Fetch for Lean-Core Server Processors

### Cansu Kaynak, Boris Grot, Babak Falsafi



# Session 4A, Tuesday @ 11:30am







# Instruction fetch stalls: Major performance bottleneck Up to 60% of server exec. time














101000101010







































## **Shared History Instruction Fetch**



✓ Preserves performance of per-core history
 ✓ At 1/14<sup>th</sup> of history size

#### Session 4A, Tuesday @ 11:30am

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches

- Daniel A. Jiménez, Texas A&M University
- LRU keeps blocks in a recency stack

♦ *n*-way cache, 0 is MRU, *n*-1 is LRU

 When a block is inserted or promoted (used) it goes to the MRU position

Not always the best choice

 Instead, let's use the blocks' former position to indicate its new position



• We want to develop a new transition graph



- So we use a genetic algorithm to search them
  - Fitness function is estimate of speedup

#### PseudoLRU instead of LRU

- This idea works just as well for tree-based
   PseudoLRU
- Use set-dueling to dynamically choose between policies
- Replacement policy consumes < 1 bit per block</li>
- Performance comparable to state-of-the-art
  - ◆ 5.6% speedup over LRU on SPEC CPU 2006
  - ◆ 15.6% on a memory-intensive subset

Imbalanced Cache Partitioning for Balanced Data-Parallel Programs

Abhisek Pan & Vijay S. Pai, Purdue University

- Balanced data parallel programs need imbalance in allocation
- High imbalance helps both rewarded and penalized threads
- Prioritizing each thread in turn at a time ensures balanced progress



## Two-Stage Partitioning Method

Evaluation Stage



- Divide cache sets into segments with different levels of imbalance
- Choose segment with lowest # of misses

#### Stable Stage

- Use chosen partition for the entire cache
- Choose preferred thread in round-robin MICRO-46, 2013

## Evaluation

- Partitioning beneficial only when per-thread working set between the default allocation and the cache capacity
- Improves upon the state-of-the-art runtime partitioning method in most such cases
  - 6% drop in execution time, 17 % drop in misses for 8 MB cache with 4 cores
- Limited overheads in space (waypartitioning, phase detection) and time (evaluation stage)

# The Reuse Cache Downsizing the Shared Last-Level Cache

Jorge Albericio<sup>1</sup>, Pablo Ibáñez<sup>2</sup>, Víctor Viñals<sup>2</sup>, and José M. Llabería<sup>3</sup>

#### Tuesday, Session 4B (10:30-12:00)







#### **Enabling Datacenter Servers to Scale Out Economically and Sustainably**

**Chao Li**, Yang Hu, Ruijin Zhou, Ming Liu, Longjun Liu, Jingling Yuan, Tao Li **University of Florida** 

As data sets become big, servers must scale out



Unfortunately, modern datacenter servers are both **power-constrained** and **carbon-constrained** 

**Distributed, incremental** green energy integration allows a datacenter to double its power capacity with zero emissions and 25% cost reduction



# **Oasis & Ozone**: A unified power provisioning framework for scale-out green datacenters



Welcome! (15:30-17:00 Session 5A - #1)





# Efficient Multiprogramming for Multicores with SCAF

Timothy Creech, Aparna Kotha, Rajeev Barua University of Maryland, College Park, MD

- Scheduling and Allocation with Feedback
  - Runtime system for multiprogramming parallel processes
  - ~15% gains over equipartitioning
  - Targeting shared-memory systems

#### Efficient Multiprogramming for Multicores with SCAF

#### Problem?

- Parallel runtimes try to let the OS handle parallel multiprogramming
- **SCAF** runtime: automatic space-sharing
  - No porting, modification, recompilation
  - Policy: maximize sum of speedups



SCAF - Session 5A. Tues. 6pm

# Efficient Multiprogramming for Multicores with SCAF

#### **Dynamic Allocation**

- Realtime feedback
- Reward efficient processes



#### Serial "Experiments"

 Estimate serial performance to reason about efficiency





## Allocating Rotating Registers by Scheduling

Hongbo Rong Cheng Wang Hyunchul Park Youfeng Wu

Intel Labs





### Rotating Registers for Alias Detection



- Given a software pipelined schedule of a loop, how to allocate registers?
  - Detect ALL aliases
  - -No false positive
  - -Minimal spilling
  - -Minimal registers



#### Rotating Register Allocation = Scheduling!

- It is a software pipelining problem
   A modulo schedule of lifetimes
- Contributions
  - -Framework
  - -A simple algorithm
  - -Near-optimal results



# Multi-Grain Coherence Directories

#### **Dr. Jason Zebchuk** *Principal Engineer, Cavium Inc.*

Prof. Andreas Moshovos, *University of Toronto* Prof. Babak Falsafi, *EcoCloud, EPFL* 







### Multi-Grain Coherence Directory (MGD)

Conceptual MGD Directory:

✓ Dynamically refine granularity of entries
✓ 78% fewer directory entries (on average)

Practical MGD Directory:

✓ Limited number of fixed granularities
 ✓ 41% less area
 ✓ Robust performance
 ✓ No coherence protocol changes

## **BulkCommit: Scalable and Fast Commit of Atomic Blocks in a Lazy Multiprocessor Environment**

Xuehai Qian, Josep Torrellas (UIUC)

Benjamin Sahelices (Univ of Valladolid) and Depei Qian (Beihang Univ.)

- Problem:  $\bullet$ 
  - Current atomic block (chunk) ulletexecution incurs unnecessary squashes
  - Atomic block commit operation in a  $\bullet$ lazy environment has sequential bottlenecks
- Our solution: lacksquare
  - IntelliSquash: no squash on WAWonly conflict
  - IntelliCommit: parallel directory group lacksquareformation



Thursday, December 5, 13

Xuehai Qian

## IntelliSquash: No Squash on WAW-only Conflict

- Insight: WAW is a name dependence. It does not break semantic atomicity
- Similarity with two conflict stores from two processors
- If two chunks only have WAW conflicts, IntelliSquash serializes them without squash



Thursday, December 5, 13

## IntelliCommit: Parallel Directory Group Formation



- On chunk commit:
  - Processor sends commit requests to all the relevant directory modules
  - Directory module receives commit request:
    - Locks the memory lines
    - Responds with commit\_ack
  - Processor counts the number of commit\_acks received
  - Processor sends commit\_confirm when it receives the expected number of commit\_acks
- Challenge: resolving conflicts from two processors
- ChunkSort: ordering all the conflicting chunks in the same order in all relevant directories by preemption



3

BulkCommit: Fast and Scalable Atomic Block Commit

Xuehai Qian



### Large-Reach Memory Management Unit Caches

Abhishek Bhattacharjee Rutgers University

International Symposium on Microarchitecture-46

December 2013



#### Address Translation and MMU Caches



Abhishek Bhattacharjee - Rutgers University



#### Approaching an Ideal MMU Cache

• Intel i7: 8 cores, 8GB memory, 512-entry L2 TLB, and 8MB LLC



Abhishek Bhattacharjee - Rutgers University

# GPU LLC Management for 3D Scene Rendering

Jayesh Gaur, Intel Raghuram Srinivasan, Ohio State Sreenivas Subramoney, Intel Mainak Chaudhuri, IIT Kanpur
## **GPU last-level cache interface**



Efficient management of the LLC shared between different 3D rendering streams

# Solution approach and results

- Inter- and intra-stream reuses in LLC
  - -RT, TEX, Z are dominant in the LLC traffic
  - -Significant reuse from RT production to TEX consumption (render to texture)

-Intra-stream reuses vary across streams

 Learn intra- and inter-stream dynamic reuse probabilities from sample sets and modulate insertion/promotion in other sets



## **GPU Transactional Memory**





JORGE CHAM @ 2012

## **GPU Transactional Memory**







JORGE CHAM @ 2012

## **Energy Efficient GPU Transactional Memory** via Space-Time Optimizations

Wilson Fung, Tor Aamodt (UBC)









#### Warp-Level Transaction Management







### Low conflict workloads: TM ≈ Fine-Grained Locks



# 65% 2X→1.3X Speedup Energy Usage

## Energy Efficient GPU Transactional Memory via Space-Time Optimizations Wednesday 10:15am, Section 6A

## Low conflict workloads: TM ≈ Fine-Grained Locks



# 65% 2X→1.3X Speedup Energy Usage









## Persistent Memory





#### Previous: With "Log"



#### Leave original data intact

#### Previous: With "Log"



#### Leave original data intact

#### Our Design: "Kiln"



Directly overwrite original data

#### **Previous: With "Log"**

#### speedup across 6 benchmarks

Leave original data intact

Directly overwrite original data

## **Details about "Kiln"**

Closing the Performance Gap Between Systems With and Without Persistence Support

| <u>Jishen Zhao</u> | Penn State              |
|--------------------|-------------------------|
| Sheng Li           | HP Labs                 |
| Doe Hyun Yoon      | IBM Research            |
| Yuan Xie           | Penn State/AMD Research |
| Norm Jouppi        | Google                  |

### Poster Session: Tue (Dec.10) 2 – 3PM Presentation: Wed (Dec. 11) 9:45AM Session 6B (Alpha Gamma Rho Room)



MICRO 2013



#### Aegis: Partitioning Data Block for Efficient Recovery of Stuck-at-Faults in Phase Change Memory

#### Jie Fan, Jiwu Shu, Youhui Zhang, and Weimin Zhen

## Song Jiang









# **Stuck-at Faults in PCM**

- PCM has limited endurance.
- Stuck-at faults occur when memory cell fails to change its value.
  - > It is a major type of faults in PCM.
  - > This type of faults is permanent and accumulates.
  - > Values in such faulty cells can still be read.
- Inversion-based correction
  - Partition data block into a number of groups and exploit the fact that stuck-at values are still readable (e.g., SAFER).
  - > Each group can tolerate only one fault.

#### Proposal of an efficient partition scheme separating faults into different groups.

# **Illustration of Aegis Partition**



- PCM bits are laid out on an A×B Cartesian plane.
- Aegis considers all points on a line as a group.
- Any two bits in the same line will not be in the same line after Aegis changes slope of the lines.
- Aegis distributes faults more evenly to tolerate more faults with lower overhead.

## **Fine-grained Heterogeneity**

Traditional big.LITTLE Architecture



Transfer Overhead: ~20K cycles

## **Fine-grained Heterogeneity**



Transfer Overhead: ~20K cycles

## **Fine-grained Heterogeneity**



Transfer Overhead: ~20K cycles

Transfer Overhead: ~35 cycles









**Code repeats** (loops, functions)

Behavior repeats in the same program context



# Trace Based Switching For A Tightly Coupled Heterogeneous Core

Reduce energy consumption by **43%** more than state of art

Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, Scott Mahlke

Micro-46 December 2013

Full Talk: Session 7 Wednesday, 11<sup>th</sup> December 2013: 11 am



compilers creating custom processors

University of Michigan Electrical Engineering and Computer Science



#### HETEROGENEOUS SYSTEM COHERENCE for Integrated CPU-GPU Systems

**Jason Power\***, Arkaprava Basu\*, Junli Gu<sup>+</sup>, Sooraj Puthoor<sup>+</sup>, Bradford M Beckmann<sup>+</sup>, Mark D Hill<sup>\*+</sup>, Steven K Reinhardt<sup>+</sup>, David A Wood<sup>\*+</sup> \*University of Wisconsin-Madison <sup>+</sup>Advanced Micro Devices, Inc.





#### HETEROGENEOUS SYSTEM COHERENCE for Integrated CPU-GPU Systems

Jason Power\*, Arkaprava Basu\*, Junli Gu<sup>+</sup>, Sooraj Puthoor<sup>+</sup>, Bradford M Beckmann<sup>+</sup>, Mark D Hill<sup>\*+</sup>, Steven K Reinhardt<sup>+</sup>, David A Wood<sup>\*+</sup> \*University of Wisconsin-Madison <sup>+</sup>Advanced Micro Devices, Inc.





#### HETEROGENEOUS SYSTEM COHERENCE for Integrated CPU-GPU Systems

**Jason Power\***, Arkaprava Basu\*, Junli Gu<sup>+</sup>, Sooraj Puthoor<sup>+</sup>, Bradford M Beckmann<sup>+</sup>, Mark D Hill<sup>\*+</sup>, Steven K Reinhardt<sup>+</sup>, David A Wood<sup>\*+</sup>

\*University of Wisconsin-Madison

<sup>+</sup>Advanced Micro Devices, Inc.





#### **CPU-GPU COHERENCE**





#### **CPU-GPU COHERENCE**



#### 4x slowdown



#### **CPU-GPU COHERENCE**



#### 4x slowdown




### **CPU-GPU COHERENCE**



### HETEROGENEOUS SYSTEM COHERENCE AMD





### HETEROGENEOUS SYSTEM COHERENCE AMD





### HETEROGENEOUS SYSTEM COHERENCE AMD









#### Meet the Walkers Accelerating Index Traversals for In-Memory Databases

#### Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan

























# Index Lookup







### Index Lookups on General-Purpose OoO















# There is parallelism!













**\*** Low throughput Throughput  $\geq$ 

# Throughput

ecocloud













# Throughput

# ecocloud

Throughput



001101001010111101000101010

# Walkers

# ✓Simple



# **Energy Efficiency**



Throughput





✓ Simple✓ Parallel



# **Energy Efficiency**





✓ Simple
✓ Parallel
✓ Programmable



# Throughput





ЗХ





✓ Simple
 ✓ Parallel
 ✓ Programmable
 5.5X



Throughput











5.5x

# Throughput

# LAST TALK of the conference Wednesday, 12pm



Energy Efficiency