Register Write Specialization Register Read Specialization:
A Path to Complexity-Effective Wide-Issue Superscalar Processors-
Authors:
Abstract:
With the continuous shrinking of transistor size, proces-sor
designers are facing new difficulties to achieve high
clock frequency. The register file read time, the wake up
and selection logic traversal delay and the bypass network
transit delay with also their respective power consumption-s
constitute major difficulties for the design of wide issue
superscalar processors.
In this paper, we show that transgressing a rule, that has
so far been applied in the design of all the superscalar pro-cessors,
allows to reduce these difficulties. Currently used
general-purpose ISAs feature a single logical register file
(and generally a floating-point register file). Up to now al-l
superscalar processors have allowed any general-purpose
functional unit to read and write any physical general-purpose
register.
First, we propose Register Write Specialization, i.e, forc-ing
distinct groups of functional units to write only in dis-tinct
subsets of the physical register file, thus limiting the
number of write ports on each individual register. Register
Write Specialization significantly reduces the access time,
the power consumption and the silicon area of the register
file without impairing performance.
Second, we propose to combine Register Write Special-ization
with Register Read Specialization for clustered su-perscalar
processors. This limits the number of read ports
on each individual register and simplifies both the wake-up
logic and the bypass network. With a 8-way 4-cluster
WSRS architecture, the complexities of the wake-up logic
entry and bypass point are equivalent to the ones found with
a conventional 4-way issue processor. More physical regis-ters
are needed in WSRS architectures. Nevertheless, using
WSRS architecture allows a dramatic reduction of the total
silicon area devoted to the physical register file (by a factor
four to six). Its power consumption is more than halved and
its read access time is shortened by one third. Some extra
hardware and/or a few extra pipeline stages are needed for
register renaming. WSRS architecture induces constraints
on the policy for allocating instructions to clusters. How-ever,
performance of a 8-way 4-cluster WSRS architecture
stands the comparison with the one of a conventional 8-way
4-cluster conventional superscalar processor.