AP05_2.vp 1 Introduction Nowadays, arithmetic operations are among the most common operations in digital integrated circuits. Even the simplest circuit adds, subtracts or multiplies something. Therefore there are high requirements on arithmetic opera- tions. The computation should be fast and the area consumed by the arithmetic units should be small. These are two basic requirements, which are unfortunately contradictory. Numbers in digital integrated circuits can be repre- sented in various ways. The most widely known and most frequently used is fixed-point addition and multiplication representation. As mentioned above, the question of efficient fixed-point arithmetic operations is interesting when implementing a digital circuit. This is not a new question. Many studies have been done and many articles and books have been published on the topic of arithmetic. However, these have only rarely considered this question in conjunction with FPGAs (Field Programmable Integrated Circuits). In the design of an arith- metic unit in FPGA, it was common to use the corresponding operator of an HDL (Hardware Description Language) and rely on a synthesis tool. No thorough study and comparison of different structures of basic fixed-point arithmetic units has yet been published. This paper shows the results of our study [7], based on im- plementing and then comparing of fixed-point arithmetic units. Virtex-II FPGA was chosen for implementation since it possesses arithmetic-support features common for contem- porary FPGAs. The paper is divided into five sections. After the introduction, the second section describes the selected implementation platform. The third section outlines the im- plantation of arithmetic units. The fourth section consists of a discussion and a comparison of the results of the measure- ments. The last section concludes the text with a summary of the results. 2 Platform As shown in Table 1, contemporary FPGAs possess special features to support of arithmetic operations. The support features differ from one FPGA vendor to another. It was not the goal of the study to compare the different FPGAs. We de- cided to use Xilinx [8] Virtex-II FPGAs as the implementation platform. This is a very popular implementation platform in contemporary designs, and it has special support features for both addition and multiplication. Cad tools are another important part of the implementa- tion platform. The choice of tools affects the results of the implementation. Again, a comparison of CAD (Computer Aided Design) tools was not the goal of our study. The tools were chosen according to their availability in our department. The following tools were used in the relevant implementa- tion steps: � Synthesis: Leonardo Spectrum 2001, Mentor Graphics [4] � P&R, Timing Analysis: Xilinx ISE 5.2i Foundation, Xilinx [8] � Simulation: ModelSim 5.5f, Mentor Graphics [4] 3 Implementation This section lists the implemented adders and multipliers. They were chosen from different sources [1-3][5-6] to meet the objectives of the study: to explore the characteristics of arithmetic unit structures when implemented in FPGA, and to examine the FPGA architecture itself including the features that support arithmetic operations. © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 67 Czech Technical University in Prague Acta Polytechnica Vol. 45 No. 2/2005 Fixed-Point Arithmetic in FPGA M. Bečvář, P. Štukjunger Arithmetic operations are among the most frequently-used operations in contemporary digital integrated circuits. Various structures have been designed, utilizing different features of IC architectures. Nevertheless, there are very few studies that consider the design of arithmetic operations in Field Programmable Gate Arrays (FPGAs), a re-programmable type of digital integrated circuit. This text compares the results achieved when implementation of basic fixed-point arithmetic units in FPGA. Keywords: arithmetic, fixed-point, FPGA. Vendor Chip Features Altera Stratix II Adder logic, carry chain, DSP support. Atmel AT40KAL Vector Multipliers. The AT40KAL’s patented 8-sided core cell with direct horizontal, vertical and diagonal cell-to-cell connections implement ultra fast array multipliers without using any busing resources. Lattice IspXPGA Dedicated carry chain and Booth multiplication logic (extra two-input AND). QuickLogic Eclipse-II ECU blocks (up to 8-bit MAC functions) for DSP (8b multiplier, 16b adder), 12 ECUs in the largest device. Xilinx Virtex-II Dedicated fast carry chain and carry propagation logic, embed- ded 18x18 bit signed multipliers with associated RAM blocks. Table 1: Arithmetic support features in FPGAs Classes of Implemented Fixed-point Adders: � Carry-Ripple Adder (CRA): A basic adder structure based on full adders connected into a chain. This is the simplest structure for implementation and also the slowest for oper- ation. Nevertheless, it is important for comparison with the other structures. � Generated Adder (GA): An adder generated by a synthesis tool. In contrast to CRA it utilises the dedicated carry chains of Xilinx Virtex II FPGAs. � Carry-Lookahead Adder (CLA): The carry-Lookahead ad- der uses two special signals G(enerate) and P(ropagate) to predict the propagation of a carry signal without using a carry chain. Several CLAs were implemented with different building-block types and sizes. � Carry-Skip Adder: Based on CLA Carry-Skip, this adder uses only the P(ropagate) signal, and for this reason, it is much easier to implement. It can be seen as a compromise between the complexity of CLAs and the long delay of CRAs. � Carry-Select Adder: Two sets of adders are used, one with the carry-in signal driven to logic 0 and the second with the carry-in signal driven to 1. The result is selected by multiplexers controlled by the actual carry-in signal. This structure is suspected to have the largest consumption, which should however compensate its performance. Fixed-point multipliers can be divided into two groups by the type by their operation: serial mode multipliers and par- allel mode multipliers. A serial mode multiplier computes bits of the result one by one in each computation cycle. On the other hand, a parallel mode multiplier computes all bits of the result at once. Several multiplier structures from the two groups were selected for implementation. In the case of the parallel mode, not only conventional versions but also pipe- lined versions of multipliers were implemented. As they were implemented, the multipliers take n-bit operands and produce a 2n-bit result. No adjustments of the result, such as radix-point position or length truncation, were considered in the implementation. Classes of Implemented Fixed-point Multipliers: Serial mode multipliers: � Classical Multiplier: The most basic serial mode multi- plier structure that computes multiplication in a human- -like way, i.e., going through the bits of one operand and simultaneously accumulating the second operand and shifting the sum. � Booth Multiplier: A serial mode multiplier that uses Booth recoding. This enables it to compute signed number multi- plication and, for radices greater than two, also increases its speed by computing several bits of the result in one compu- tational step. � CSA Multiplier: A serial mode multiplier that uses Carry- -Save Adders, adders that have 3N inputs and 2N outputs. Carry-Save Adders have a shorter delay than a normal full adder, which shortens the length of a single step of the multiplier. � Booth + CSA Multiplier: Combining the concepts of Booth and CSA multipliers, this multiplier uses radix-4 Booth recoding to reduce the number of computational steps and CSA adders to shorten the duration of a single computational step. Parallel mode multipliers: � Array Multiplier: A parallel mode multipliers with a struc- ture based on the basic formula for computation of a multiplication. It consists of N2 cells connected into an array-like structure. � Wallace Multiplier: A parallel mode multiplier that uses a tree of CSA adders. Thanks to the CSAs at each level of the tree, the number of operands is reduced by a factor of 3:2. � Generated Multiplier (Generated HW): A parallel mode multiplier generated by a synthesis tool. It utilises 18-bit hardware multipliers embedded in Xilinx Virtex II FPGAs. � Combinative Generated Multiplier (Generated C): A par- allel mode multiplier also generated by a synthesis tool, but with disabled usage of the embedded multipliers. This multiplier was implemented for the purposes of compari- son with other structures and also for investigating the pipelining abilities of the synthesis tool. VHDL hardware description language was used for imple- mentation; however there is no reason it should have any impact on the results of the experiments. Different arithmetic unit structures were implemented with a common interface, so that they can be used interchangeably (i.e. an adder can be replaced by another adder, and a multiplier by another multi- plier). The units are parametrized, which allows the user to choose the width of the operands (and consequently of the result). However, this it can be done only statically before the start of the implementation process. For more details on implementation, see [7]. 4 Results The results of the experiments with implemented arith- metic structures are shown below. The target FPGA chip was Xilinx Virtex-II, specifically XC2V500 in the FG456-6 pack- age. There was no particular reason for choosing exactly this part rather than some other member of the family. It was discovered later that some of the 64-bit multiplier structures did not fit in this chip, so a larger (XC2V1500) chip had to be used instead. Place and route reports were used as a source for the evaluation. 4.1 Fixed-point addition Tab. 2 shows the results of measurements carried out on the implemented adder structures of 16, 32 and 64 bit-width. The results show that the best adder structure in terms of both time and area is the one that uses dedicated carry chains as depicted in figure Fig. 1. None of the techniques that try to speed up the addition by cutting the carry chain, applied to this structure, brought any improvement in speed, but led to an increase in the occupied area. Therefore, the results of such structures are not shown. As mentioned above, the speed-up techniques cripple the performance of the Generated Adder, and the following ob- servations are only valid for adders based on the CRA adder. 68 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 45 No. 2/2005 Czech Technical University in Prague CLA: CLA occupies the largest area among adders with a speed-up technique applied, without bringing better speed improvement than the other techniques. The regularity of CLA with blocks of size 2 proved to have an advantage over other CLAs in smaller area consumption. Carry-Skip Adders: No extra area was occupied by the addi- tional carry propagation logic – a very favourable result. With no need for extra area, the Carry-Skip Adder accelerates CRA to less than half delay at 64 bits. Carry-Select Adders: The carry-Select Adder requires extra area. In addition, it does not shorten the delay of computa- tion as much as the Carry-Skip Adder. 4.2 Fixed-point multiplication Table 3 shows the results of measurements carried out on different multiplier structures. 16, 32 and 64-bit multipliers were implemented. As mentioned above, the implemented multipliers can be divided according to their operation mode into two groups: serial and parallel mode multipliers. The multipliers within each group have similar characteristics, and the results can often be generalized to the whole group. As in the case of adders, two parameters were examined: area and delay. Before dealing with the delay of the multiplier, it is necessary to state clearly what it means. Two measures have to be distinguished: delay and response time. Response time refers to the total time required for computation of the result. On the other hand, the delay of a sequential circuit is defined as the reverse value of its clock frequency. Note that the delay of the parallel mode multipliers is equal to their re- sponse time, but they response times of serial mode multipli- ers and pipelined multipliers are multiples of their delays. These time measures are both important, and were therefore the object of our examination. 4.2.1 Area Fig. 2 depicts the area consumption of selected multiplier structures with 32-bit operand width. This figure shows, and it is also valid in general, that serial mode multipliers consume considerably a smaller area than parallel mode multipliers. The implementation of wide (larger than 64 bits) parallel mode multipliers approaches the limits of technology – not only the limits of FPGA, but also the capacity of the synthesis tool. The results are much better when embedded hardware multipliers are used, but the number on one chip is limited [8] and consumption grows exponentionally with operand width. Note that the area consumption depicted by the figure does not include the area of the embedded multipliers. 4.2.2 Delay Two time measures were observed on the multiplier struc- tures: delay and response time. The response time gives an overview of the overall performance of the unit. The delay limits the maximum clock frequency of the system on which the given multiplier structure will be used. Parallel mode structures have a shorter response time, while serial mode © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 69 Czech Technical University in Prague Acta Polytechnica Vol. 45 No. 2/2005 N � 16 N � 32 N � 64 Area (Slices) Delay (ns) Area (Slices) Delay (ns) Area (Slices) Delay (ns) CRA 23 13.52 47 23.99 95 40.27 Generated 8 4.05 16 5.09 32 8.38 CLA 29 7.97 61 19.06 127 24.84 Carry-Skip 23 11.54 47 15.67 95 18.02 Carry-Select 27 13.27 67 16.47 135 23.88 Table 2: Results of fixed-point adders Fig. 1: Fixed-point adders: comparison of the delay and area con- sumption of an adder with dedicated carry chains vs. an adder utilising classical routing paths (an example for 32-bit data width) multipliers have a better delay criterion. This is an expected result, thanks to the philosophy of the operation mode. In parallel mode the computation is done in a single step. The step is long, longer than one step of the serial mode multi- plier, but shorter than the sequence of steps that has to be car- ried out by a serial mode multiplier to compute the result. The pipelining technique was applied on parallel mode multipliers to see if it brings any improvements. This tech- nique could not be applied when embedded multipliers were used, since the multipliers cannot be divided to allow the in- sertion of registers for the pipeline. The advantages and dis- advantages of pipelining were observed on the implemented 70 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 45 No. 2/2005 Czech Technical University in Prague N � 16 N � 32 N � 64 Area (Slices) Delay (ns) Area (Slices) Delay (ns) Area (Slices) Delay (ns) Serial Mode Multipliers: Classical 40 4.60 71 4.94 136 8.75 Booth-2 (radix 2) 41 4.63 73 4.89 138 7.99 Booth-4 (radix 4) 55 4.93 104 5.37 202 9.09 CSA 62 3.46 119 3.67 234 4.54 Booth-4 + CSA 101 4.62 191 4.86 370 5.45 Parallel Mode Multipliers: Array * 364 28.96 1536 53.45 4187 99.97 Wallace * 390 19.81 1542 27.18 6164 37.00 Generated 0 14.87 49 18.54 295 28.94 Generated C 134 20.32 533 26.27 2109 29.97 Pipelined Multipliers: Array [P2] * 274 14.95 1059 29.98 4193 59.92 Wallace [P] 446 4.64 1654 4.95 6367 9.9 Generated C [P4] * 188 4.95 635 9.94 3317 11.61 Note1: [Pn] denotes a pipelined multiplier with a given number of stages, where appropriate. Note2: 64-bit versions of multipliers marked * did not fit in an XC2V500 chip. An XC2V1500 was used instead. Note3: Area consumption does not include the area of embedded multipliers. Generated multipliers of 16, 32 and 64 bit-width utilize 1, 4, and 16 embedded multipliers, respectively. Table 3: Results of fixed-point multipliers Fig. 2: Area of fixed-point multipliers: selected types of fixed-point multipliers are shown, two parallel mode multipliers on the left and two serial mode multipliers on the right (all with 32-bit data width) structures. However, pipelining did not show any consider- able improvements. When a parallel mode multiplier is to be implemented, embedded multipliers are the best option. 5 Conclusion Arithmetic operations in contemporary digital integrated circuits have an important role. Our study [7] provides a com- parison of the structures of fixed-point adders and multipliers implemented in FPGA. From a number of well-known [1–3], [5–6] implementation architectures, several structures were chosen to observe their characteristics when they are im- plemented in FPGA and, at the same time, to investigate the possibilities of the FPGA architecture itself. The Xilinx Virtex-II FPGA family was chosen as the implementation platform, because it provides special features to support arith- metic operations, a trend that seems to be common at the present time for FPGA chips. Nine fixed-point adder structures were implemented in the study. The results show that a structure using Virtex’s dedicated fast carry chains has the best results in terms of both area and delay criteria. None of structures utilizing a speed-up technique showed better results. If no dedicated carry chain is available, the Carry-Skip Adder is the best option, because it doubles the speed of an adder with a nor- mal carry chain without any extra area consumption. Ten fixed-point multipliers were implemented. They can be divided according to their operation mode into two groups: serial and parallel mode multipliers, respectively. The parallel mode structures provide a shorter response time, © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 71 Czech Technical University in Prague Acta Polytechnica Vol. 45 No. 2/2005 Fig. 3: Response time of fixed-point multipliers: selected types of fixed-point multipliers are shown, two parallel mode multipliers on the left and two serial mode multipliers on the right (all with 32-bit data width) Fig. 4: Delay of Fixed-point Multipliers: selected types of fixed-point multipliers are shown, two parallel mode multipliers on the left and two serial mode multipliers on the right (all with 32-bit data width) but they also require a significantly larger area in terms of CLB slices. A parallel multiplier structure that utilizes em- bedded hardware multipliers proved to have the shortest response time. However, the embedded hardware multipliers are consumed very rapidly for bit-widths larger than 18 and, furthermore, the number on a chip is limited [8]. Serial mode structures proved to have advantages for applications that require small area usage with a short clock cycle. They provide a reasonable response time even for relatively large bit-widths (32 bits and more), on a very small area. This gives the oppor- tunity to exploit the parallelism between a large number of serial-mode multipliers in a single FPGA. References [1] Ercegovac, M. D., Lang, T.: Digital Arithmetic, Morgan Kaufmann Publishers, San Francisco, 2003. [2] Hennessy, J. L., Patterson, D. A., Goldberg, D.: Computer Architecture: A Quantitative Approach: Appendix H Com- puter Arithmetic, Elsevier Science, 1995. [3] Koren, I.: Computer Arithmetic Algorithms, Prentice-Hall, Englewood Cliffs, New Jersey, 1993. [4] MENTOR GRAPHICS, web page, http://www.mentor.com/. [5] Omondi, A. R.: Computer Arithmetic Systems (Algorithms, Architectures and Implementations), Prentice-Hall In- ternational (UK) Limited, 1994. [6] Pluháček, A.: Projektování logiky počítačů (Designing Computer Logic), CTU Publishing House, Prague, 1992. [7] Štukjunger, P.: Arithmetic in FPGA, Diploma thesis, CTU in Prague, 2004. [8] XILINX, web page, http://www.xilinx.com. Ing. Miloš Bečvář e-mail: becvarm@fel.cvut.cz Ing. Petr Štukjunger e-mail: stukjup@fel.cvut.cz Department of Computer Science & Engineering, Czech Technical University in Prague Faculty of Electrical Engineering Karlovo nám. 13 121 35 Prague, Czech Republic 72 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 45 No. 2/2005 Czech Technical University in Prague