

MICRO-46, December 11, 2013



# Kiln: Closing the Performance Gap Between Systems With and Without Persistence Support

<u>Jishen Zhao</u> Sheng Li Doe Hyun Yoon Yuan Xie Norm Jouppi Penn State Univ. HP Labs IBM Research Penn State Univ./AMD Research Google













• Allow in-memory data structures to become permanent immediately



- Allow in-memory data structures to become permanent immediately
- Demonstrated 32x speedup compared with using storage devices [Condit+ SOSP'09, Volos+ ASPLOS'11, Coburn+ ASPLOS'11, Venkataraman+ FAST'11]

# Why Can't We Use Existing Systems on Persistent Memory?

Adapted from Katelin Bailey's description at INFLOW'13

# Why Can't We Use Existing Systems on Persistent Memory?

- Can't just use an in-memory system *No persistence support* 
  - In-place updates to nonvolatile media may be interrupted without completion in the event of power loss or system crashes
  - Interrupted writes could leave partially overwritten data or missing references

# Why Can't We Use Existing Systems on Persistent Memory?

- Can't just use an in-memory system No persistence support
  - In-place updates to nonvolatile media may be interrupted without completion in the event of power loss or system crashes
  - Interrupted writes could leave partially overwritten data or missing references
- Cant' just use a storage system *Not optimized for memory* 
  - Overhead from database or file system interfaces, which assume and are optimized for slow, block-addressable devices
  - Not optimized for fast, byte-addressable memory with a load/store interface

Persistence

Performance



High

**Persistence** 

(0 or 1)









A database or a file system



A database or a file system







ACID properties -- Guarantee transactions are processed reliably



Can contain multiple write requests

**ACID properties** -- Guarantee transactions are processed reliably



Can contain multiple write requests

**ACID properties** -- Guarantee transactions are processed reliably



**ACID properties** -- Guarantee transactions are processed reliably



Can contain multiple write requests

- Atomicity: A transaction is "all or nothing"
- **Consistency:** Take data from one consistent state to another



Can contain multiple write requests

- Atomicity: A transaction is "all or nothing"
- **Consistency:** Take data from one consistent state to another
- Isolation: Concurrent transactions appears to be one after another



Can contain multiple write requests

- Atomicity: A transaction is "all or nothing"
- **Consistency:** Take data from one consistent state to another
- Isolation: Concurrent transactions appears to be one after another
- **Durability:** Changes to data will remain across system boots



Can contain multiple write requests

- Atomicity: A transaction is "all or nothing"
- Consistency: Take data from one consistent state to another
- **Isolation:** Concurrent transactions appears to be one after another
- Durability: Changes to data will remain across Nonvolatility



Can contain multiple write requests

- Atomicity: A transaction is "all or nothing"
- Consistency: Take data from one consistent state to another
- Isolation: Concurrent transactions appears to Concurrency Control
- Durability: Changes to data will remain acrost Nonvolatility



Can contain multiple write requests

- Atomicity: A transaction is "all or nothing"
- Consistency: Take data from one consistent s Ordering of Writes
- Isolation: Concurrent transactions appears to Concurrency Control
- Durability: Changes to data will remain acrost Nonvolatility



Can contain multiple write requests



- Consistency: Take data from one consistent s Ordering of Writes
- Isolation: Concurrent transactions appears to Concurrency Control
- Durability: Changes to data will remain acrost Nonvolatility



• At least two versions



- At least two versions
- Update one version at a time



- At least two versions
- Update one version at a time
- Leave the other version intact



- At least two versions
- Update one version at a time
- Leave the other version intact



- At least two versions
- Update one version at a time
- Leave the other version intact

**ACID** maintained in different manners

#### **ACID** maintained in different manners

• ACI in an integrated mechanism: e.g., transactional memory

#### **ACID** maintained in different manners

- ACI in an integrated mechanism: e.g., transactional memory
- Separate A and D from ACID: e.g., failure-atomic msync() [Park+ Eurosys'13]

#### **ACID** maintained in different manners

- ACI in an integrated mechanism: e.g., transactional memory
- Separate A and D from ACID: e.g., failure-atomic msync() [Park+ Eurosys'13]

Our design

#### **ACID** maintained in different manners

- ACI in an integrated mechanism: e.g., transactional memory
- Separate A and D from ACID: e.g., failure-atomic msync() [Park+ Eurosys'13]

#### Our design

- Hardware maintains A and D
- Hardware preserves the consistency semantics defined by software
   Ordering of writes defined by software

#### **ACID** maintained in different manners

- ACI in an integrated mechanism: e.g., transactional memory
- Separate A and D from ACID: e.g., failure-atomic msync() [Park+ Eurosys'13]

#### Our design

- Hardware maintains A and D
- Hardware preserves the consistency semantics defined by software
   Ordering of writes defined by software
- Software maintains isolation with concurrency control mechanisms

#### **ACID** maintained in different manners

- ACI in an integrated mechanism: e.g., transactional memory
- Separate A and D from ACID: e.g., failure-atomic msync() [Park+ Eurosys'13]

#### Our design

- Hardware maintains A and D
- Hardware preserves the consistency semantics defined by software
   Ordering of writes defined by software
- Software maintains isolation with concurrency control mechanisms

#### Hardware portion of persistent memory design effectively maintains multiversioning and preserves ordering

# **Overheads** of multiversioning and ordering by software methods

#### **Overheads** of multiversioning and ordering by software methods

#### How to optimize

persistent memory design with hardware support

**Design strategy:** take a storage system mechanism, optimize its performance

Processor

NVRAM Memory (Persistent)

**Design strategy:** take a storage system mechanism, optimize its performance

|                           |  | Proc | esso | r |  |  |  |
|---------------------------|--|------|------|---|--|--|--|
| NVRAM Memory (Persistent) |  |      |      |   |  |  |  |
|                           |  |      |      |   |  |  |  |

#### • Multiversioning Updates do not overwrite original data

- Logging [Volos+ ASPLOS'11, Coburn+ ASPLOS'11]
- Copy-on-write [Condit+ SOSP'09, Venkataraman+ FAST'11]

(Atomic 8-byte writes [Condit+ SOSP'09, Volos+ ASPLOS'11, Coburn+ ASPLOS'11])

**Design strategy:** take a storage system mechanism, optimize its performance



#### Multiversioning Updates do not overwrite original data

- Logging [Volos+ ASPLOS'11, Coburn+ ASPLOS'11]
- Copy-on-write [Condit+ SOSP'09, Venkataraman+ FAST'11]

(Atomic 8-byte writes [Condit+ SOSP'09, Volos+ ASPLOS'11, Coburn+ ASPLOS'11])

**Design strategy:** take a storage system mechanism, optimize its performance

Logs, temporary buffers, or newly created versions



#### Multiversioning Updates do not overwrite original data

- Logging [Volos+ ASPLOS'11, Coburn+ ASPLOS'11]
- Copy-on-write [Condit+ SOSP'09, Venkataraman+ FAST'11]

(Atomic 8-byte writes [Condit+ SOSP'09, Volos+ ASPLOS'11, Coburn+ ASPLOS'11])



#### Multiversioning Updates do not overwrite original data

- Logging [Volos+ ASPLOS'11, Coburn+ ASPLOS'11]
- Copy-on-write [Condit+ SOSP'09, Venkataraman+ FAST'11] (Atomic 8-byte writes [Condit+ SOSP'09, Volos+ ASPLOS'11, Coburn+ ASPLOS'11])
- Ordering Restrict the ordering of writes arriving at memory
  - Software employs cache flush, memory fence, or uncacheable writes [Volos+ ASPLOS'11, Venkataraman+ FAST'11]
  - Software sets epoch barriers, hardware manages epoch-based writes [Condit+ SOSP'09, Coburn+ ASPLOS'11]

#### **Persistent Memory**

- Updates do not overwrite original data
- Restrict the ordering of writes crossing the memory bus

#### **Persistent Memory**

- Updates do not overwrite original data
- Restrict the ordering of writes crossing the memory bus

Native (Use NVRAM, but no persistence)

#### **Persistent Memory**

- Updates do not overwrite original data
- Restrict the ordering of writes crossing the memory bus

#### Native (Use NVRAM, but no persistence)

- Updates directly overwrite original data
- Writes reordered by cache writebacks/memory controllers

#### **Persistent Memory**

- Updates do not overwrite original data
- Restrict the ordering of writes crossing the memory bus

#### Native (Use NVRAM, but no persistence)

- Updates directly overwrite original data
- Writes reordered by cache writebacks/memory controllers

Ideal Performance

#### **Persistent Memory**

- Updates do not overwrite original data
- Restrict the ordering of writes crossing the memory bus

#### Native (Use NVRAM, but no persistence)

- Updates directly overwrite original data
- Writes reordered by cache writebacks/memory controllers

Ideal Performance

Benchmarks: inserts/deletes to B+ tree, hash table, red-black tree, array, and scalable large graph





- Explicitly duplicate data in memory
  - By executing logging or copy-on-write instructions



#### • Explicitly duplicate data in memory

- By executing logging or copy-on-write instructions



#### • Explicitly duplicate data in memory

- By executing logging or copy-on-write instructions



- Explicitly duplicate data in memory
  - By executing logging or copy-on-write instructions
- Two memory stores for one data update
  - Increase memory activities
  - Saturate memory bandwidth



- Explicitly duplicate data in memory
  - By executing logging or copy-on-write instructions
- Two memory stores for one data update
  - Increase memory activities
  - Saturate memory bandwidth





- Explicitly duplicate data in memory
  - By executing logging or copy-on-write instructions
- Two memory stores for one data update
  - Increase memory activities
  - Saturate memory bandwidth

#### • In-order updates

- Forced all the way down to main memory
- Cancel out the effect of processor's reordering optimizations



## **Overheads** of multiversioning and ordering by software methods

#### How to optimize

persistent memory design with hardware support

#### Software's view of memory system

- Aflat address space
- Pages

|     | 0x00000000 |
|-----|------------|
|     |            |
|     |            |
|     |            |
|     |            |
|     |            |
|     |            |
|     |            |
|     |            |
|     |            |
| ••• |            |
|     |            |
|     |            |
|     |            |
|     |            |
|     | OxOOffffff |

Page belonging to process

#### Software's view of memory system

- Aflat address space
- Pages



#### Hardware's view of memory system

- -Not flat, a hierarchy
- Cache lines



Page belonging to process

#### Software's view of memory system

- Aflat address space
- Pages



#### Hardware's view of memory system

- -Not flat, a hierarchy
- Cache lines



Page belonging to process

#### Software's view of memory system

- A flat address space

...

- Pages

# 0x00000000

#### Hardware's view of memory system

- Not flat, a hierarchy
- Cache lines



## Multiversioning: Leveraging Caching for In-place Updates

## Multiversioning: Leveraging Caching for In-place Updates

• How does a write-back cache work?

### How does a write-back cache work?

- A processor writes a value
- Write to L1 cache only
- Old values remain in lower levels
- Until the new value gets evicted

### How does a write-back cache work?

- A processor writes a value
- Write to L1 cache only
- Old values remain in lower levels
- Until the new value gets evicted
- A multiversioned system by nature

- How does a write-back cache work?
  - A processor writes a value
  - Write to L1 cache only
  - Old values remain in lower levels
  - Until the new value gets evicted
- A multiversioned system by nature



- How does a write-back cache work?
  - A processor writes a value
  - Write to L1 cache only
  - Old values remain in lower levels
  - Until the new value gets evicted
- A multiversioned system by nature



- How does a write-back cache work?
  - A processor writes a value
  - Write to L1 cache only
  - Old values remain in lower levels
  - Until the new value gets evicted
- A multiversioned system by nature



### How does a write-back cache work?

- A processor writes a value
- Write to L1 cache only
- Old values remain in lower levels
- Until the new value gets evicted
- A multiversioned system by nature
  - Allow updates to directly overwrite original data
  - No need for logging or copy-on-write

#### Native

• Updates overwrite original data





### • Out-of-order writes to NVRAM LLC

- hardware remembers the committing state of each cache line in NVRAM LLC

### • Out-of-order writes to NVRAM LLC

- hardware remembers the committing state of each cache line in NVRAM LLC
- In-order commits of transactions

- Out-of-order writes to NVRAM LLC
  - hardware remembers the committing state of each cache line in NVRAM LLC
- In-order commits of transactions
  - Example:  $T_A$  before  $T_B$





- Out-of-order writes to NVRAM LLC
  - hardware remembers the committing state of each cache line in NVRAM LLC
- In-order commits of transactions
  - Example:  $T_A$  before  $T_B$
  - $T_B$  will not commit until A'<sub>2</sub> arrives in NVRAM LLC



- Out-of-order writes to NVRAM LLC
  - hardware remembers the committing state of each cache line in NVRAM LLC
- In-order commits of transactions
  - Example:  $T_A$  before  $T_B$
  - T<sub>B</sub> will not commit until A'<sub>2</sub> arrives in NVRAM LLC
- Commit a transaction
  - Flush higher-level caches -- very fast
  - Change cache line states in NVRAM LLC

 $T_A$  updates {A<sub>1</sub>, A<sub>2</sub>, A<sub>3</sub>}  $T_{B}$  updates {B<sub>1</sub>, B<sub>2</sub>} Higher-level Caches Cache Out-of-order Flush A'<sub>3</sub>, B'<sub>2</sub>, A'<sub>1</sub>, B'<sub>1</sub>, A'<sub>2</sub> **NVRAM LLC NVRAM Memory** 

## Code Examples Before and After Using Our Design

## Code Examples Before and After Using Our Design



# Code Examples

### **Before and After Using Our Design**



- Updates directly overwrite original data
- Reordered memory requests

- Updates directly overwrite original data
- Reordered memory requests



### **Conventional Design**

- Updates do not overwrite original data
- Restrict ordering of writes crossing the memory bus

- Updates directly overwrite original data
- Reordered memory requests

### **Our Design**

- Updates directly overwrite original data
- Restrict ordering of writes arriving at LLC

### **Conventional Design**

Persistence

- Updates do not overwrite original data
- Restrict ordering of writes crossing the memory bus

Leverage hardware support for performance optimizations

- Updates directly overwrite original data
- Reordered memory requests

### **Our Design**

- Updates directly overwrite original data
- Restrict ordering of writes arriving at LLC

### **Conventional Design**

- Updates do not overwrite original data
- Restrict ordering of writes crossing the memory bus

Leverage hardware support for performance optimizations

Address implementation challenges

Persistence

- Updates directly overwrite original data
- Reordered memory requests

### **Our Design**

- Updates directly overwrite original data
- Restrict ordering of writes arriving at LLC

Persistence

- Conventional DesignUpdates do not overwrite
  - original data
  - Restrict ordering of writes crossing the memory bus

Leverage hardware support for performance optimizations

Multiversioning, really?

Address implementation challenges

#### • Why

- Prevents early eviction of uncommitted transactions
- Avoid violating atomicity

$$T_{A} \text{ updates } \{A_{1}, A_{2}, A_{3}\}$$

$$NVRAM LLC$$

$$A' B' C'$$



#### • Why

- Prevents early eviction of uncommitted transactions
- Avoid violating atomicity



#### • Why

- Prevents early eviction of uncommitted transactions
- Avoid violating atomicity



#### • Why

- Prevents early eviction of uncommitted transactions
- Avoid violating atomicity
- How -- Rule of thumb
  - Keep uncommitted transactions in NVRAM LLC

### Implementation

- Selective replacement policy in NVRAM LLC
- Add transactions' committing information to traditional replacement policies



### • Why

- Prevents early eviction of uncommitted transactions
- Avoid violating atomicity
- How -- Rule of thumb
  - Keep uncommitted transactions in NVRAM LLC

### Implementation

- Selective replacement policy in NVRAM LLC
- Add transactions' committing information to traditional replacement policies
- What if
  - NVRAM LLC overflows

 $T_{A} updates \{A_{1}, A_{2}, A_{3}\}$  NVRAM LLC A' B' C'  $A'_{1}$  Crash BC NVRAM Memory

## Addressing NVRAM LLC Overflow

### • Detect overflows by hardware

- Continuously monitor transactions' forward progress
- Set time out limit to prevent deadlocks

### • Provide a fall-back path

(Copy-on-write with the overflowed transactions)

- Hardware notifies OS by interrupts
- OS allocates temporary data buffers in NVRAM memory





## Addressing NVRAM LLC Overflow

### • Detect overflows by hardware

- Continuously monitor transactions' forward progress
- Set time out limit to prevent deadlocks

### • Provide a fall-back path

(Copy-on-write with the overflowed transactions)

- Hardware notifies OS by interrupts
- OS allocates temporary data buffers in NVRAM memory





## Addressing NVRAM LLC Overflow

### • Detect overflows by hardware

- Continuously monitor transactions' forward progress
- Set time out limit to prevent deadlocks

### • Provide a fall-back path

(Copy-on-write with the overflowed transactions)

- Hardware notifies OS by interrupts
- OS allocates temporary data buffers in NVRAM memory



# **Experimental Setup**

- Simulator: McSimA+ [Ahn+, ISPASS'13] (modified)
- Configuration
  - Eight-core processor, 16 threads
  - Private caches: L1/L2, SRAM, 64KB/256KB per core
  - Shared last-level cache: L3, STT-MRAM, 64MB
  - Main memory: DRAM (2GB) + STT-MRAM (2GB)
- Benchmarks
  - Perform inserts/deletes to B+ tree, hash table, red-black tree, array, and scalable large graph data structures

No persistence

STT-MRAM L3: 64MB SRAM based L3: 16MB No persistence

















Log updates become persistent once they arrive at the NVRAM LLC Other results: performance, power, sensitivity studies.

### **Summary**

## NVRAM Persistent Memory (New design opportunity)



## **Summary**

### NVRAM Persistent Memory (New design opportunity)

- Proposed a hardware-based persistent memory design
- Leverage a multiversioned persistent memory hierarchy that directly overwrites original data



## **Summary**

### NVRAM Persistent Memory (New design opportunity)

- Proposed a hardware-based persistent memory design
- Leverage a multiversioned persistent memory hierarchy that directly overwrites original data
- Close the performance gap between persistent memory systems and the native system



# Why Name It "Kiln"?

"Kiln" was once used by ancient Mesopotamians to bake the clay tablets with temporary scripts and turn them into permanent records. We name our persistent memory design Kiln, because it is analogous to persistent memory which turns volatile data into permanent records.



# Thank You!

# Kiln: Closing the Performance Gap Between Systems With and Without Persistence Support

<u>Jishen Zhao</u> Sheng Li Doe Hyun Yoon Yuan Xie Norm Jouppi

PENNSTATE

Penn State Univ. HP Labs IBM Research Penn State Univ./AMD Research Google



MICRO-46, December 11, 2013