Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator

Jan van Lunteren, Christoph Hagleitner, Timothy Heil, Giora Biran, Uzi Shvadron, Kubilay Atasu
1. Objectives and Challenges
2. RegX Architecture Overview
3. Memory Hierarchy
4. Local Results Processor
5. Compiler
6. IBM PowerEN™
7. Hardware-Measured Performance
8. Conclusions
Objectives and Challenges

Design Objectives

- a regular-expression scanner and pattern compiler, supporting
  1. large sets of string and regular expression patterns (~10K)
  2. multiple active pattern contexts
  3. high scan rates (~20 Gbit/s)
  4. parallel scans, multi-session support (millions of active sessions)
  5. incremental and dynamic updates

Design Challenges

- efficient use of available memory capacity and bandwidth
  - compact data structure
  - minimize number of memory accesses to process each input character
  - optimize cache hit rate
- exploit limited amount of parallelism to obtain small session state
- fast compilation times
Pattern-scanner designs (SW, HW) are typically based on deterministic (DFA) or non-deterministic finite automata (NFA).

Pros/cons:

<table>
<thead>
<tr>
<th></th>
<th>DFA</th>
<th>NFA</th>
</tr>
</thead>
<tbody>
<tr>
<td>processing complexity</td>
<td>+</td>
<td>-</td>
</tr>
<tr>
<td>storage complexity</td>
<td>-</td>
<td>+</td>
</tr>
</tbody>
</table>

Many published designs/solutions involve a DFA-based scan operation in combination with techniques to optimize storage-efficiency, targeting:
- compact executable representations of given DFAs
- optimization of the DFAs themselves

Differentiating factor from related work: techniques applied in RegX enable a deterministic scan throughput that is independent of the input characteristics, when executing out of dedicated memory (caches)
- important for dealing with Denial of Service (DoS) attacks
RegX Architecture Overview

- Sample regular expression: \((a|b)*ab\)

- NFA: several possible next states can exist for given state and input
- DFA: at most one possible next state exists for each state and input
Certain combinations of regular-expression patterns can cause a "state-explosion" when mapped on a single DFA.

Example:

```
ab.*cd
ef[^\n]*gh
k.lm
```

DFA with 48 states, 242 transition rules.
Certain combinations of regular-expression patterns can cause a “state-explosion” when mapped on a single DFA.

Example:

```
ab.*cd
ef[^\n]*gh
k..lm
```

→ DFA with 96 states, 508 transitions
Certain combinations of regular-expression patterns can cause a "state-explosion" when mapped on a single DFA.

Example:

```
ab.*cd
ef[^\n]*gh
k...lm
```

DFA with 192 states, 1038 transitions.
Certain combinations of regular-expression patterns can cause a “state-explosion” when mapped on a single DFA.

**Example:**

<table>
<thead>
<tr>
<th>pattern</th>
<th>#states</th>
<th>#transitions</th>
</tr>
</thead>
<tbody>
<tr>
<td>k.lm</td>
<td>48</td>
<td>242</td>
</tr>
<tr>
<td>k..lm</td>
<td>96</td>
<td>508</td>
</tr>
<tr>
<td>k....lm</td>
<td>192</td>
<td>1038</td>
</tr>
<tr>
<td>k.....lm</td>
<td>384</td>
<td>2098</td>
</tr>
<tr>
<td>k.......lm</td>
<td>768</td>
<td>4218</td>
</tr>
<tr>
<td>k........lm</td>
<td>1536</td>
<td>8458</td>
</tr>
<tr>
<td>k...........lm</td>
<td>3072</td>
<td>16938</td>
</tr>
</tbody>
</table>
RegX Architecture Overview – *B-FSM*

**B-FSM**
- Programmable state machine in HW
- Deterministic rate of one transition per clock cycle @ > 2 GHz
- Storage grows approximately linearly with #transitions

⇒ *compact executable representation of given DFAs*
Parallel B-FSMs

- Enables “NFA-like” storage optimization: multiple parallel states/transitions
- Intelligent distribution of patterns over B-FSMs by compiler
  - separate combinations of patterns that cause “state explosions”
- Trade-offs: additional computation costs and memory accesses, larger session state
Example:

```
ab.*cd
ef[^\n]*gh
k.lm
```

→ DFA with 48 states, 242 transition rules
Example:

\[
\begin{align*}
ab.*cd \\
\text{ef[^\n]*gh} \\
\text{k.lm}
\end{align*}
\]

3 DFAs with 17 states, 34 transition rules

**DFA 1**

**DFA 2**

**DFA 3**
Local Result Processor (LRP)

- Storage reduction by off-loading problematic “NFA-states” from B-FSMs
  - split individual problematic patterns into simpler parts
  - LRP checks if parts are detected in the right order, distance, etc.
- LRP provides additional features

- Trade-offs: instructions consume storage, larger session state
Example:

```
ab.*cd
ef[^\n]*gh
k.lm
```

DFA with 48 states, 242 transition rules
Example:

\[ \text{ab} .* \text{cd} \]
\[ \text{ef}[^\n]* \text{gh} \]
\[ \text{k}\text{.lm} \]

DFA with 13 states, 12 transition rules, 7 instructions

DFA detects 7 simple patterns:
\[ \text{ab, cd, ef, \n, gh, k, lm} \]

LRP checks if patterns are found in the right order
- \[ \text{cd after ab} \] [bit b7]
- \[ \text{gh after ef} \] with no \n in between [bit b6]
Lane

- Minimum set of resources allocated for scanning an input stream
- Physical lane implements multiple logical lanes
  - time-interleaved processing of multiple input streams

Can sustain maximum scan rate of one input character per cycle when the B-FSMs process out of transition-rule caches without requiring back-pressure and independent of input characteristics
Data Engine

- Enqueues and schedules scan commands
- Fetches and transmits input data streams
- Controls storage and retrieval of scan state, writing of scan results

Algorithmic Engine

- Scan lanes
**Processor Bus Interface**

- Threads running on general-purpose core initiate scan by sending a scan command to RegX which includes pointers to
  - input data
  - compiled pattern set (context)
  - output buffer for storing scan results
Memory Hierarchy

Main Memory

PBus

Compiled data structures
Scanner session state

Data Fetch Unit
Search Unit
Command Queue
UM Handler

8 command units

PBus Interface

B-FSM 0
B-FSM 1
B-FSM 2
B-FSM 3

4 physical/8 logical lanes

input characters
instructions

Classifier
LRP
REG

B-FSM
SRAM
SRAM

L1 Rule Caches

MEM Hierarchy
Memory Hierarchy

Challenges

- High access latency to main memory (e.g., 400 cycles)
  - Performance targets require high L1 cache hit rate (>99%)
- Single-cycle access to L1 rule cache required to realize maximum lane scan rate of one input character per cycle
  - Limits number of tags and associativity that can be supported
  - Required to achieve high single-stream scan rates
- HW-managed caches do not provide good performance for large target workloads: e.g., two-way set associativity results in a large amount of cache trashing for multiple active pattern contexts
L1 Transition-Rule Cache

1. Locked area
   - managed in SW by Upload Manager (UM)
   - tagless: exposed to B-FSMs as addressable memory area
   - almost fully associative

2. Temporary area
   - 2-way set-associative cache managed in HW
   - caches non-locked transitions upon miss in locked area

Address translation

- Transitions to locked states are translated to refer directly to physical location in locked area

Upload Manager

- Selection and placement of the most frequently used memory lines (includes B-FSM hash function “adaptation”)
- Driven by a statistical profile of transition access patterns, produced by dedicated HW counters embedded within RegX
Local Result Processor

Features

- LRP handles eight instructions in parallel in each cycle
- B-FSM can dispatch two instructions per cycle
  - default instructions are triggered by selected character values
  - regular instructions are triggered by the detection of (sub)patterns in the input stream – these are attached to transitions in the DFA
- LRP can sustain peak scan rate when all B-FSMs execute out of the rule caches without requiring a back-pressure mechanism
Local Result Processor

**Instructions**

- Instructions operate on general-purpose and offset registers
  - examples: set, reset, load immediate, count, shift
  - conditions: selected byte is equal, has all/at least one set bit in common
- Match instructions report detected pattern information to application
  - efficient support for transfer of GPR content for post-processing in SW (SW Result Processor – SRP)
Local Result Processor

- Multiple instructions can operate in parallel on the same register
  - enables efficient allocation of bits to individual patterns
- Priority order when instructions operate on the same bits
  - set, reset, shift, increment (decreasing priority)
Autonomously Self-Running Instructions

- Self-running instructions enable the execution of selected operations (shift, count) for every input character in certain states
  - allows efficient measuring/testing of distance and length conditions
- Activated by setting self-running mode flag in GPR
  - instruction tag defines operation (normal GPR bits in regular mode)
- Single load operation can (de)activate and configure self-running instruction, and store initial (shift register/counter) contents
- Regular (conditional) instructions can manipulate and test GPR contents

➔ Concept allows flexible and dynamic allocation of register resources as variable-width counters and shift registers
Local Result Processor – *Self-Running Instructions*

```
+-----+-----+-----+-----+-----+-----+-----+-----+
| 15  | 14  | 12  | 11  | 0   |
+-----+-----+-----+-----+-----+-----+-----+-----+
```

**GPR**

- **self-running instruction tag**
- **self-running mode flag**

<table>
<thead>
<tr>
<th>bits 15-12</th>
<th>instruction</th>
<th>bits 15-12</th>
<th>instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xxxxb</td>
<td>nop</td>
<td></td>
<td></td>
</tr>
<tr>
<td>1000b</td>
<td>6-bit shift</td>
<td>1101b</td>
<td>30-bit shift</td>
</tr>
<tr>
<td>1001b</td>
<td>12-bit shift</td>
<td>1101b</td>
<td>36-bit shift</td>
</tr>
<tr>
<td>1010b</td>
<td>18-bit shift</td>
<td>1110b</td>
<td>8-bit counter</td>
</tr>
<tr>
<td>1011b</td>
<td>24-bit shift</td>
<td>1111b</td>
<td>12-bit counter</td>
</tr>
</tbody>
</table>
Local Result Processor – Self-Running Instructions

- Shift 6
- Shift 12
- Shift 18
- Shift 24
- Shift 30
- Shift 36

Insert '0'

GPR [n]  GPR [n+1]  GPR [n+2]
Local Result Processor – *Self-Running Instructions*

### Example

- **pattern** `k..lm`
- `k` triggers set
- `lm` triggers test

<table>
<thead>
<tr>
<th>Input Trace</th>
<th>Self-running Shift Register</th>
</tr>
</thead>
<tbody>
<tr>
<td>abckkalmlmdef</td>
<td>0</td>
</tr>
<tr>
<td>abckkalmlmdef</td>
<td>0</td>
</tr>
<tr>
<td>abckkalmlmdef</td>
<td>0</td>
</tr>
<tr>
<td>abckkalmlmdef</td>
<td>0</td>
</tr>
<tr>
<td>abckkalmlmdef</td>
<td>0</td>
</tr>
<tr>
<td>abckkalmlmdef</td>
<td>0</td>
</tr>
</tbody>
</table>

- **set**
- **test**
Pattern Compiler

- Converts pattern sets into DFAs
  - pattern splitting, LRP instruction generation and register allocation
- lane count selection
  - distribution of patterns over lanes and B-FSM engines
  - mapping of patterns on DFAs, attachment of LRP instructions

B-FSM Compiler

- Converts DFAs into executable B-FSM data structures
  - construction of linked hash table structures
  - state encoding and instruction integration
  - optimizations to obtain high compression

Incremental and dynamic pattern updates

- the architecture supports incremental updates of the pattern set
- internal scanner data structures can be updated dynamically without interrupting the ongoing scan operations
Example

- Storage reduction by exploiting lanes and LRP for publicly available
  “Application Layer Packet Classifier for Linux” (http://l7-filter.sourceforge.net/)
# IBM PowerEN™

<table>
<thead>
<tr>
<th>Technology</th>
<th>IBM 45nm SOI</th>
</tr>
</thead>
<tbody>
<tr>
<td>Core Frequency</td>
<td>2.3GHz @ 0.97V (Worst Case Process)</td>
</tr>
<tr>
<td>Chip size</td>
<td>428 mm² (including kerf)</td>
</tr>
<tr>
<td>Chip Power (4-AT node)</td>
<td>65W @ 2.0GHz, 0.85V Max Single Chip</td>
</tr>
<tr>
<td>Chip Power (1-AT node)</td>
<td>20W @ 1.4GHz, 0.77V Min Single Chip</td>
</tr>
<tr>
<td>Main Voltage (VDD)</td>
<td>0.7V to 1.1V</td>
</tr>
<tr>
<td>Metal Layers</td>
<td>11 Cu (3-1x, 2-1.3x, 3-2x, 1-4x, 2-10x)</td>
</tr>
<tr>
<td>Latch Count</td>
<td>3.2M</td>
</tr>
<tr>
<td>Transistor Count</td>
<td>1.43B</td>
</tr>
<tr>
<td>A2 Cores / Threads</td>
<td>16 / 64</td>
</tr>
<tr>
<td>L1 I &amp; D Cache</td>
<td>16 x (16KB + 16KB) SRAM</td>
</tr>
<tr>
<td>L2 Cache</td>
<td>4 x 2MB eDRAM</td>
</tr>
<tr>
<td>Hardware Accelerators</td>
<td>Crypto, Compression, RegX, XML</td>
</tr>
<tr>
<td>Intelligent Network</td>
<td>Host Ethernet Adapter/Packet Processor</td>
</tr>
<tr>
<td>Interfaces</td>
<td>2 Modes: Endpoint &amp; Network</td>
</tr>
<tr>
<td>Memory Bandwidth</td>
<td>2x DDR3 controllers</td>
</tr>
<tr>
<td></td>
<td>4 Channels @ 800-1600MHz</td>
</tr>
<tr>
<td>System I/O Bandwidth</td>
<td>4x 10G Ethernet, 2x PCI Gen2</td>
</tr>
<tr>
<td>Chip-to-Chip Bandwidth</td>
<td>3 Links, 20GB/s per link</td>
</tr>
<tr>
<td>Chip Scaling</td>
<td>4 Chip SMP</td>
</tr>
<tr>
<td>Package</td>
<td>50mm FCPBGA (4 or 6 layers)</td>
</tr>
</tbody>
</table>

Source: Johnson et al., “A wire-speed power™ processor: 2.3GHz 45nm SOI with 16 cores and 64 threads,” ISSCC 2010.
RegX accelerator

- Implementation
  - area: 15.4 mm²
  - clocked at 2.3 GHz (1.15 GHz)

- Lanes
  - 4 physical (16 B-FSMs)
  - 8 logical (32 logical B-FSMs)

- Memory
  - 32 KB rule cache per B-FSM
  - 512 KB rule cache in total
  - LRP register file: 8 x 16 bit, 128 bits total

- Peak scan rates (data structure fits entirely in rule caches)
  - lane: one byte/cycle = 18.4 Gbit/s
  - stream: one byte/2 cycles = 9.2 Gbit/s
  - theoretical peak rate for 4 lanes: 73.6 Gbit/s
HW Measured Performance – *Single Context / String Patterns*

![Graph showing Gbit/s and Rule Miss Rate vs. Patterns for different lane configurations.](image)

- **Gbit/s**
  - Blue diamonds: 1 Lane
  - Green squares: 2 Lanes
  - Red squares: 3 Lanes
  - Cyan triangles: 4 Lanes

- **Rule Miss Rate**
  - Blue diamonds: 1 Lane
  - Green squares: 2 Lanes
  - Red squares: 3 Lanes
  - Cyan triangles: 4 Lanes

Patterns range from 0 to 4000, and Gbit/s range from 0 to 50. Rule Miss Rate ranges from 0.0% to 2.0%.
HW Measured Performance – *Multiple Contexts*

- **String**
- **Regex**
Conclusions

RegX accelerator

- An architecture, implementation, compiler and upload manager were designed to realize scan rates in the range of 15-40 Gbit/s for typical intrusion detection workloads

- Novel (micro-)architectural features
  - Local Result Processor design supporting eight instructions fully in parallel and the concept of self-running instructions
  - transition-rule caches, including address translation and cache line placement by a SW application exploiting hardware based profiling
  - physical/logical lane concept involving a time-interleaved processing of multiple data streams by B-FSMs and LRPs, sustaining the maximum scan rate when processing out of the rule caches, without requiring a back-pressure mechanism and independently of input characteristics

Future work

- RegX is part of our research towards more general-purpose accelerators
  ➔ this project has shown us that basic building blocks of such an accelerator are feasible at clock frequencies beyond 2 GHz
For more information, please contact:

Jan van Lunteren (jvl@zurich.ibm.com)

IBM Research - Zurich
Säumerstrasse 4
CH-8803 Rüschlikon
Switzerland

Phone: +41 44 724 8111
Fax: +41 44 724 8911
Backup