# array-of-arrays architecture for parallel floating point multiplication[精选推荐pdf]

carry save additions in this architecture, as compared to 82% in a full array.6: ConclusionsA new architecture style is presented for the design of parallel floating point multipliers. The idea is to break up the partial product addition into groups of sub-arrays and combine the results of the sub-arrays using another array. This architecture requires a modest increase in area over a full array and can be implemented without requiring additional wire pitches over an interleaved array. The partitioning of partial products depends on the size of the multiplier, as well as wire loadings. It is shown that minimizing the number of CSAs in the critical path at the expense of wire loading does not improve performance. Dividing a 53-bit multiplication into two half arrays reduces the critical path for partial product addition by 42.1%. Dividing this further into four sub arrays gives 18.2% further improvement. Since the sizes of the two array and four array multiplier are nearly the same, the four array architecture seems attractive for future multiplier designs.7: References[1]Gary Bewick. “Fast Multiplication: Algorithms and Implementation,” PhD Stanford University, 1994 [2]A D Booth, “A signed binary multiplication technique”, Qt. J. Mech. Appl. Math., vol. 4, Part2, 1951.5. [3]D. Dobberpuhl, “A 200MHz 64bit Dual-Issue CMOS Microprocessor.,” 1992 IEEE ISSCC digest of technical papers, pp. 106-107. [4]Gensuke Goto et al., “A 54x54-b Regularly Structured Tree Multiplier”, IEEE J. Solid-State Circuits, vol. 27, No. 9, Sept 19892, pp 1229-1236. [5]Craig Heikes, “A 4.5mm2 Multiplier Array for a 200MFLOP Pipelined Coprocessor”, 1994 IEEE ISSCC digest of technical papers, pp. 290-291. [6]Scott Hilker, Nghia Phan and Dan Rainey, “A 3.4ns 0.8µm BiCMOS 53x53b Multiplier Tree”, 1994 IEEE ISSCC digest of technical papers, pp. 292-293. [7]Leslie Kohn and Sai-Wai Fu, “A 1,000,000 Transistor Microprocessor”, 1989 IEEE ISSCC digest of technical papers, pp. 54-55. [8]Junji Mori et al., “A 10-ns 54x54-b Parallel Structured Full Array Multiplier with 0.5-µm CMOS Technology”, IEEE J. Solid-State Circuits, vol. 26, No. 4, April 1991, pp 600-606. [9]Mark R Santoro and Mark Horowitz, “SPIM: A Pipelined 64x64-bit Iterative Multiplier”, IEEE J. Solid-State Circuits, vol. 24, No. 2, april 1989, pp 491-493. [10]C S Wallace, “A suggestion for fast multipliers”, IEEE Trans. Electron. Computers, vol EC-13, pp. 14-17, Feb. 1964. [11]Mark R Santoro, Gary Bewick, Mark Horowitz, “Rounding Algorithms for IEEE multipliers”, Proceedings of 9th symposium on computer arithmetic, Santa Monica, CA, Sept 1989. pp 176-183.comparing designs across different technologies. In our technology an inverter in a fanout of 4 inverter chain has a delay of 430 ps under worst-case operating conditions. Table 2. shows the initial and output delays that are common among all the array architectures. Table 3.-Table 7. show the addition delays for the different array architectures discussed in this paper.Interleaved arrays reduce the addition time of the full arrays by 38%. The carry save outputs within an interleaved array have almost twice as much wire capacitance as the ones in a full array. Two half arrays get rid of this internal wire loading, hence each CSA stage is faster as compared interleaved arrays. This makes two half arrays faster by 6.6% in spite of the fact that it has more carry save adders in the critical path than interleaved array. Table 7. shows that even though the number of carry save additions is minimized in array-of-arrays1, due to the large wire loading at three places in the critical path, 31.4% of the total addition time in this architecture is spent in wire loadings. This illustrates the fact that minimizing the CSA count in the critical path is not the only criteria for designing an optimal architecture, wiring plays an important role in the critical path. The best array architecture is array-of-arrays2 (Table 6.) which has no long running wires in the critical path. The array-of-arrays2 architecture gives 26% improvement over two half arrays (Table 4.). The total critical path latency for this optimal architecture is 10.03ns, which is 23.33FO4 delay. As can be seen from tables 2 and 6, only 66% of the total latency is spent inTable 2. Common DelaysComponentDelay nsecFO4 Delay Multiplier latch0.771.79 Booth Encoder1.744.04 Booth Mux0.4521.05 Total Initial Delay Output Latch0.431.00 Initial + Output Delay3.397.87Table 3. Full ArrayComponentDelay nsecFO4 Delay CSA within the array0.5761.34 Output of 25-CSA array14.4133.5 Output of Compressor10.8862.06 Addition Delay15.2935.56Table 4. Two Half ArraysComponentDelay nsecFO4 Delay CSA within the array0.5761.34 Output of 11-CSA array7.3917.2 Output of Compressor10.8862.06 Addition Delay8.8620.6Table 5. Interleaved ArrayComponentDelay nsecFO4 Delay CSA within the array0.7141.66 Output of 12-CSA array8.620 Output of Compressor10.8862.06 Addition Delay9.48622.06Table 6. Array-of-arrays 2ComponentDelay nsecFO4 Delay Output of 9-CSAArray45.1912.07 Output of Compressor30.8862.06 Addition Delay6.07614.13Table 7. Array-of-arrays 1ComponentDelay nsecFO4 Delay Output of 3-CSA Array12.616.07 Output of Compressor11.754.07 Output of Compressor21.8154.22 Output of Compressor30.8862.06 Addition Delay7.06116.424: Implementation:The first three sub-array architectures were designed with 1µm dual rail domino logic becauseit is very well suited circuit technology for arithmetic functions. We fixed the vertical bit pitch to be 80 lambda (40µ), which limits the number of wiring tracks per bit pitch to 11 of which we always have the following in any architecture: 1 power/ground + 4 sum, carry outputs (carry save output of addition) + 1 partial product bit. This leaves us with 5 extra tracks per bit pitch, which means we can run only one extra set of CSA outputs over another CSA stage. The sub- array architectures with arrays in the combining networks can thus be used with dual rail circuit style. We also laid out the four array-of-arrays architecture in MAGIC to verify our area and capacitance estimates. Table 1. summarizes these areas.Total Area = 3mm high + 6mm width. As can be seen from the relative areas of wiring and cells, 44% of the total area is spent in wiring channels. This area can not be reduced as the number of wires that need to be routed in an architecture as well as the wire pitch is constant. Routing the additional set of carry-save output diagonally across the array increases the size of the multiplier by about 30% in this technology. The difference in area between the split array organizations (2 half arrays, interleaved arrays, 4 quarter arrays) is small. This area overhead for wires grows much larger if a tree organization is used.5: Simulation Results:We broke the critical path of the multiplier into three basic pieces: setup, addition, output. The setup phase is the time required to get the data from the input latches to drive the wires to the Booth Encoders located along the bottom edge of the array, through the encoders to drive the control lines to the Booth Muxes, and then to have the data arrive at the inputs of the CSAs. The addition phase includes all the time needed to combine the partial products into a single carry- save result. The output phase is simply the time to latch the data and drive it to the carry propagation adder. The setup and output part of the critical path are the same in any architecture. Addition time depends on the way the array is designed and we explored a number of sub-array organizations. We modelled the wiring capacitances using aggressive formulae with known areas and process parameters for unit capacitance over length for a wire travelling over active, and over metal. The circuits were simulated with all the fan-outs and wiring loads modelled in HSPICE with worst-case operating conditions - 4.3v supply voltage and 120oC temperature. The delays were measured between 50% to 50% points. To normalize out the technology, the delay numbers are also measured relative to an inverter driving a load that is 4x its size. We have found that this number of ‘fanout of 4’ (FO4) inverter delays is relatively insensitive to technology and operating conditions, and is a good metric forTable 1. AreaDescription# cells in a bit sliceArea in width µmInput Latch2101 Booth Mux2781 CSA19 + 672 Output Latch222 Area of cells-3400 Area of wiring channels-2600To minimize the delay, we designed the arrays so that two inputs to each compressor arrive at the same time. Since one of the inputs comes from another unit driving a long wire, the numbers of adds done in this unit should be less than the local array. This implies that array1 will have fewer adders than array2, array3 will have fewer adders than array2 and compressor1 combined, etc. Ignoring the wire loading, this would give a partition of 5,5,7,10 to add the 27 partial products (3,3,5,8 CSAs in arrays1-4 respectively) with a critical path of 9 CSAs (3 in array1 and 6 in compressors 1, 2, and 3). We refer to this as array-of-arrays1 architecture. When wire loading is taken into account the situation is a little more complex. Our simulations show that the longest wire delay in this architecture is roughly equivalent to 1 CSA delay. The optimal partition is 4,4,8,11 partial products (2,2,6,9 CSAs in arrays1-4 respectively), and our simulations show that in this architecture the arrays and wire loadings are balanced such that the critical path is entirely in array4. The critical path is 11 CSA delays. This is referred to as array-of-arrays2 architecture.3.3: Four Quarter Arrays and a Tree:Figure 3. Four Quarter arrays and a Tree We also explored the tree combining network, but the tree required many more wire tracks, longer wires and more complex routing. The simplest layout is very similar to the previous design and is shown in Figure 3. Here the outputs of arrays 1, 2, 3 and 4 are combined by a 8-2 tree formed by compressors 1 2 and 3 in the datapath and a 8-2 tree at the top. Note that theoutputs of the first compressor and array3 need to route on top of array4. Since each bit is represented by 2 wires (sum and carry) and each wire is dual rail encoded, this is 8 extra lines that must run across a cell. If wire loading is ignored, the arrays should all be the same size, so the partition would be 6,7,7,7 partial products. The critical path through this would be 9 CSAs (5 in the array, and then 4 in the combining tree). When the wire delays are considered, again the latter arrays should be slightly larger to equalize the delays. Since the long wire delays are roughly equivalent to one CSA, we get an optimal partition of 6,6,7,8 and a critical path of 10 CSAs. The height of the datapath increases both because of the increase in wires routing over each cell, and because the combining network at the top is now a tree. The increase in area, additional wiring required and its modest delay difference caused us to drop the tree combiner, and we focused on the array-of-arrays2 architecture. The next sections discuss our model of the critical path and gives simulation results.L A T C H53L A T C H53L A T C H55LATCH 51CSA TreeARRAY1ARRAY2ARRAY3 ARRAY4 MultiplierMulti- plicandRC O M P R E S S O RCOMPRESSOC O M P R E S S ORR56-bit tall132a full array, but is still far from 7 CSAs long critical path of a full tree. We further divided the 27 partial products into four smaller groups of adjacent partial products that are added in parallel by four sub-arrays. The outputs of the sub arrays need to be combined by a 8-to-2 combining network. This can be a tree which has a critical path of 4 CSA delays or an array which has 6 CSA delays.3.2: Four Quarter Arrays and another array:To reduce the required wiring, we explored the option of a 8-to-2 array as the combining network. We call this the array-of-arrays architecture and it is shown in Figure 2. Arrays 1, 2, 3 and 4 add p1, p2, p3 and p4 partial products respectively, where p1+p2+p3+p4 = 27. We divided the combining array into three 4-to-2 compressors placed closest to their inputs in the datapath; in order to preserve regularity in wiring and to minimize the number of wires crossing over arrays. Compressor1 combines the outputs of arrays 1 and 2. Compressor2 combines the combined output of arrays 1 and 2 coming from compressor1 with the output of array3. Similarly, compressor3 combines the combined output of arrays 1, 2 and 3 coming fromcompressor2 with the output of array4, to give the final result.The least significant 2*(p1-1) bits of array1 fall off the datapath on top of array1. The 54 mostsignificant bits of the outputs of arrays 1 and 2 (out1 and out2) are aligned at the inputs ofcompressor1. The 2*p2 lower significant bits of the two arrays get aligned and fall off thedatapath above array2. Similarly, we get signals with aligned bit significants at the inputs of compressors 3 and 4 and above arrays 3 and 4. Since the carry save outputs of an array cross over only the next array, a 4-to-2 compressor is required at the top of the datapath to combineleast significant carry save outputs of two adjacent arrays. Compressor4 on the top combinesthese lower significant outputs from arrays 1 to 4.Figure 2. Array-of-arrays architectureThis architecture has very regular wiring and layout properties. All the signals logically flow from left to right in the datapath. The bit-pitch shift between two CSA stages is always 0-2 bits. At the most one pair of carry save output crosses over another array. This lack of dense wiring in a bit pitch allows us to use fast dual rail circuit style in this architecture. The height of the datapath is increased by compressor4 and the latch on top, and one Booth encoder at the bottom of the datapath. The architecture is very well suited for datapaths due to regular wiring and layout.L A T C H53L A T C H53L A T C H55LATCH 51Compressor 4ARRAY1 ARRAY2ARRAY3ARRAY4 MultiplierMulti- plicand2*(p1-1)2*p22*p32*p4RTo cpaTo cpaC O M P R E S S O RC O M P R E S S OC O M P R E S S O RR56-bit tall1233: Sub-Array architecturesThis section looks at architectures that address the trade-offs in multiplier design. The idea is to combine the concepts of arrays and trees, aimed at achieving low latency while keeping the wiring and area overheads small. This is achieved by breaking up a full array into smaller sub-arrays and then using combining arrays or trees to get the final result. Having more sub-arrays gives more parallelism in adding the partial products, but also adds wiring and layout complexity. Architectures with two and four sub-arrays are discussed below.3.1: Two Half ArraysThe simplest option is to divide the full array into two half arrays and combine the results of the two. Figure 1 shows this architecture.Figure 1. Two Half Arrays Architecture. The 27 partial products are divided into two smaller groups of adjacent partial products. Arrays 1 and 2 are carry save arrays, adding p1 and p2