Register Write Specialization Register Read Specialization: A Path to Complexity-Effective Wide-Issue Superscalar Processors-

Authors:

Abstract:

With the continuous shrinking of transistor size, proces-sor designers are facing new difficulties to achieve high clock frequency. The register file read time, the wake up and selection logic traversal delay and the bypass network transit delay with also their respective power consumption-s constitute major difficulties for the design of wide issue superscalar processors.

In this paper, we show that transgressing a rule, that has so far been applied in the design of all the superscalar pro-cessors, allows to reduce these difficulties. Currently used general-purpose ISAs feature a single logical register file (and generally a floating-point register file). Up to now al-l superscalar processors have allowed any general-purpose functional unit to read and write any physical general-purpose register.

First, we propose Register Write Specialization, i.e, forc-ing distinct groups of functional units to write only in dis-tinct subsets of the physical register file, thus limiting the number of write ports on each individual register. Register Write Specialization significantly reduces the access time, the power consumption and the silicon area of the register file without impairing performance.

Second, we propose to combine Register Write Special-ization with Register Read Specialization for clustered su-perscalar processors. This limits the number of read ports on each individual register and simplifies both the wake-up logic and the bypass network. With a 8-way 4-cluster WSRS architecture, the complexities of the wake-up logic entry and bypass point are equivalent to the ones found with a conventional 4-way issue processor. More physical regis-ters are needed in WSRS architectures. Nevertheless, using WSRS architecture allows a dramatic reduction of the total silicon area devoted to the physical register file (by a factor four to six). Its power consumption is more than halved and its read access time is shortened by one third. Some extra hardware and/or a few extra pipeline stages are needed for register renaming. WSRS architecture induces constraints on the policy for allocating instructions to clusters. How-ever, performance of a 8-way 4-cluster WSRS architecture stands the comparison with the one of a conventional 8-way 4-cluster conventional superscalar processor.