AP05_2.vp 1 Introduction Contemporary cryptographic schemata are frequently based on the Discrete Logarithm Problem (DLP): find integer k such that Q k P P k � � � � 1 (1) for given group elements P and Q. In elliptic curve cryptography (ECC), P and Q are points on a chosen elliptic curve over a finite field. We focus on curves over GF(2m), where point coordinates are expressed as m-bit vectors. The DLP in such a group is exponentially hard in compar- ison with DLP in a multiplicative group over a finite field. This means that a 173 bit key provides approximately the same security level as the 1024-bit RSA [7]. This fact is very important in applications such as chip cards, where the size of the hardware and energy consumption is crucial. In algorithms such as the Elliptic Curve Digital Signature Algorithm (ECDSA), k is an m-bit integer, P is a chosen point and Q is computed using Eq. 1. It requires addition, multipli- cation and inversion over GF(2m). 1.1 Finite field operations The implementation of these operations is determined by the representation of the field elements in GF(2m) [2]. In this work we focus on the normal basis representation. Addition over elements of GF(2m) is implemented as a bit-wise XOR. Squaring is realized by rotation (cyclic shift) one bit to the right. Because it is so simple (one clock cycle), it is regarded as a special case. Multiplication is based on matrix multiplication over GF(2m). In hardware, a special unit (multi- plier) is necessary. The best-known algorithm for inversion in normal basis is the algorithm of Itoh, Teechai and Tsujii (ITT) [3] based on multiplication and squaring. 1.2 Scalability We understand scalability as the ability to change scale or to be available in various versions. The term ‘scale’ can be paraphrased as ‘the important dimension’ and hence is a matter of viewpoint. The basic dimension of cryptography is the measure of security, which translates to key length. A unit is scalable if it can serve for computations with varying key length, provided that the internal memory is sufficient [5]. We suggest calling this kind of scalability scalability in precision. In the world of parallel processing and VLSI design, the important dimension is the number of processors or other hardware units. An algorithm scales well if we can obtain more performance by assigning more resources (processors, chip area). We shall speak about scalability in performance. As this paper is focused on this kind of scalability only, we will use just the term ‘scalability’. We need scalability for two reasons. The first (and more important) reason is practical: hardware implementations are needed for a variety of contexts, from smart-card devices to high-throughput servers. The second reason is a fair compari- son of design versions, where scaling the units to match in one dimension is preferable to artificial quality factors. To scale a unit means to employ parallelism at a certain level of abstraction. We can scale at algorithmic level by selecting a different algorithm, at the register-transfer level, e.g., by changing the data path width, or at the gate level by factoring combinational circuits, or even at the physical level. This work aims at the seam between algorithmic and register-transfer level. Units composed of sub-units are harder to scale. We might be lucky and find a unified scaling parameter, e.g., data path width. In the general case, however, each sub-unit may have a different scaling parameter. The sub-units interact. Firstly, if there is a common clock (as preferred in practice), it must suit the slowest sub-unit. Secondly, to achieve the best global area-performance ratio, local area-performance tradeoffs are not independent. The above sketched elliptic curve operations offer very limited parallelism and therefore cannot be a major source of scalability. Scalability must be sought for in finite field opera- tions, and therefore the most interesting area for scalability in an elliptic curve processor is the finite field unit. 1.3 Metrics The metrics used reflects the abstraction level of the parallelism employed. When the data path units are simple and their implementations are obvious, we can get at a lower © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 55 Czech Technical University in Prague Acta Polytechnica Vol. 45 No. 2/2005 Scalable Normal Basis Arithmetic Unit for Elliptic Curve Cryptography J. Schmidt, M. Novotný The design of a scalable arithmetic unit for operations over elements of GF(2m) represented in normal basis is presented. The unit is applicable in public-key cryptography. It comprises a pipelined Massey-Omura multiplier and a shifter. We equipped the multiplier with additional data paths to enable easy implementation of both multiplication and inversion in a single arithmetic unit. We discuss optimum design of the shifter with respect to the inversion algorithm and multiplier performance. The functionality of the multiplier/inverter has been tested by simulation and implemented in Xilinx Virtex FPGA. We present implementation data for various digit widths which exhibit a time minimum for digit width D � 15. Keywords: finite fields, normal base, multiplication, inversion, arithmetic unit. level. In this work, we use the classical metrics of area (which is connected to power consumption) and time. When a unit is scaled, its area A and time t vary in opposite directions. Time t depends on the number of clock cycles T spent in a given calculation and on the critical path length � in hardware. To compare differently scaled units, we use the quality measure Q At AT� � �. 2 Previous work Massey and Omura proposed a multiplier [6] that em- ploys the regularity of equations for all bits of a result. From the equation for one bit of a result (e.g. c0), equations for other bits can be derived by rotating bits of the arguments a and b [2]. In this multiplier, one bit of the result is computed completely in one clock cycle. Then, registers holding the ar- guments a and b are rotated right one bit between cycles. The computation of m bits of the result takes m clock cycles and hence this multiplier is also called bit-serial. Agnew, Mullin, Onyszchuk and Vanstone introduced a modification of the Massey-Omura multiplier [1] (in this paper, we call it the AMOV multiplier). They divided the equation for each bit ci into m products Pi,j: c P P P Pi i i i m i m� � � � �� �, , , ,0 1 2 1� (2) In the first clock cycle, the product Pi,i+0 of bit ci for all i � �0, m – 1 is evaluated. In the next cycle, the product Pi,i+1 (all subscripts are reduced mod m) of bit ci for all i � �0, m – 1 is evaluated and added to the intermediate result, and so on. All bits of the result are successively evaluated in parallel; the computation is pipelined. The block diagram of the AMOV multiplier is in Fig. 1. The multiplication is performed as follows: In the first step, both operands a and b are loaded from inputs IN1 and IN2 to registers A and B, respectively. Then, in each of the following m clock cycles, both the A and B registers are rotated right (this is represented by the blocks ROR 1) and the result c in register C is evaluated suc- cessively by the block COMB. LOGIC, which implements the products Pi,j from Eq. 2. After m clock cycles, the result c a b� is available at output OUT. All registers and data paths are m bits wide. The amount of hardware is the same as for the non- -pipelined Massey-Omura multiplier, but the critical path is short and constant (it does not depend on m) and so the maximum achievable frequency is higher. This multiplier is widely used. The computation of an inverse element (inversion) by the ITT algorithm [3] is usually controlled by a microprogram [4]. When implementing the ITT inversion using a classical AMOV multiplier, additional registers and data transfers out- side the multiplier are necessary. In this work we present a modification of the AMOV multiplier, which allows efficient implementation of both the multiplication and ITT inversion algorithms. In comparison with the microprogrammed inversion, no additional regis- ters or data transfers outside the multiplier are necessary. We also introduce several improvements to this multiplica- tion/inversion unit, which lead to increased performance and a better performance/area ratio. 3 Structure of the unit The data path of our arithmetic unit is an extension of the AMOV multiplier. By adding one more input to the multi- plexer preceding register A and by redirecting some data paths (see bold lines in Fig. 2), we can simply implement both multiplication and ITT inversion in the unit and thus save additional registers and data transfers outside of the multiplier. The modified AMOV multiplier has a dedicated control unit based on a finite state machine, two counters COUNT_INV and COUNT_K and the shift register M. It im- plements the commands load_op, multiply and invert. 3.1 Multiplication Multiplication is performed as follows: In the first two steps, both operands a and b are successively loaded from input IN to registers A and B. Then, in m clock cycles, the mul- tiplication is performed as in the standard AMOV multiplier. After its completion, the result c a b� is loaded from register C to register A and is available at output OUT. 56 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 45 No. 2/2005 Czech Technical University in Prague B COMB. LOGIC ROR 1 IN1 ROR 1 IN2 A C OUT Fig. 1: AMOV multiplier B COMB. LOGIC ROR 1 IN ROR 1 A C OUT Fig. 2: Modified AMOV multiplier 3.2 Inversion Our unit computes the ITT inversion [2], [3] by Algorithm 1. It comprises � �log ( ) ( )2 1 1 1m w m� � � � multi- plications, where w (m � 1) is the number of non-zero bits in the binary representation of m � 1. Furthermore, it needs (m � 1) � w (m � 1) squarings. The total number of clock cycles spent for one inversion is � � � � C m w m C m w m C co INV MUL SQR � � � � � � � � � � � � � log ( ) ( ) ( ) ( ) 2 1 1 1 1 1 nst, where CMUL is number of clock cycles of 1 multiplication (CMUL � m) and CSQR is number of clock cycles of one squar- ing (CSQR � 1). Note that the inverted value must be available at input IN during the whole process of inversion. The rotation capability of register B is used for the computation of squarings in steps 3.2.1 and 3.4.2. 4 Scaling the multiplication The basic ECC operation (Eq. 1) is performed by suc- cessive point additions and point doublings. Each of these operations needs 1 inversion, 2 multiplications and 1 squar- ing [2]. The number of clock cycles necessary for one point addition or doubling is then: � � � � C m w m C m w m C const PADD MUL SQR � � � � � � � � � � � � log ( ) ( ) ( ) 2 1 1 1 1 . (3) We can reduce the number of clock cycles CPADD in two ways: by reducing the number of clock cycles CMUL of the multiplications and the number of clock cycles CSQR of the iterative squaring.0 Both the Massey-Omura and the AMOV multipliers need m clock cycles for computing all m bits of the result. Some au- thors also call them bit-serial, because they compute one bit of a result in one clock cycle. There is a digit-serial variant of the Massey-Omura multi- plier (some authors call it sliced or parallel). In this multiplier, D bits (also called a digit) are evaluated in one clock cycle. In the case of a digit-serial AMOV multiplier, D products Pi,j are evaluated in one clock cycle. All m bits of the result are then evaluated in � �C m DMUL � cycles. Since more products are evaluated in one clock cycle, more combinational logic is necessary. The size of the block COMB. LOGIC in Fig. 2 is proportional to D � 1. The size of other blocks remains constant. As the combinational logic becomes more complex, the length of the critical path grows proportionally to log D. Since one multiplication needs m/D clock cycles, the total time necessary for one multiplication is �O m D D( ) log , and the total time of one inversion (or point addition on elliptic curve) is T O m m D m DPADD � � � � � � � � �(log ). log . (4) 5 Scaling the iterative squarings Another way to improve the performance of the ITT in- version (and consequently the point addition) is to reduce the number of clock cycles necessary for the iterative squarings in step 3.2.1 of Algorithm 1. Adding one or more blocks performing “long distance” rotations can reduce the number of clock cycles required for all iterative squarings (Fig. 3). We shall refer to the hardware realizing these rotations as the shifter. Let m be the degree of the finite field we work in, GF(2m). Let m b b b br r� � �1 1 1 0( )� be the binary representation of m-1 such that the most signifi- cant bit br=1. The rotations required by the Itoh-Teechaji- -Tsujii algorithm are � �K k i r k b bi i r i� � �, , ( )1� � . The binary representation of ki is br … bi. Each of the shifts is performed exactly once in an inversion operation. Let A, T, and � be the area, the total number of clock cycles spent in rotations, and the critical path of the shifter. Let A0, T0, and �0 be the area, total number of clock cycles spent in © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 57 Czech Technical University in Prague Acta Polytechnica Vol. 45 No. 2/2005 1. Mr...M0 � m – 1; 2. A � IN; 3. for COUNT_INV in r – 1 down to 0 do 3.1 B � A; 3.2 for COUNT_K in M2r�1 … Mr down to 1 do 3.2.1. B � B ror 1; 3.3 C � B A; 3.3a A � C; 3.3b M � M shl 1; 3.4 if Mr = ,1’ then 3.4.1 B � A; 3.4.2 B � B ror 1; A � IN; 3.4.3 C � B A; 3.4.4 A � C; 4. A � A ror 1; Algorithm 1: An implementation of the ITT inversion in the modified AMOV multiplier B COMB. LOGIC ROR 1 IN ROR 1 A C OUT ROR q ROR r Fig. 3: “Long distance” rotations form a shifter that saves clock cycles necessary for squarings in ITT inversion multiplications, and the critical path of the rest of the arith- metic unit. The quality measure of the entire unit, and hence our optimization criterion, is � �Q A A T T� � � �( )( ) max ,0 0 0� � (5) This equation also shows the two dependencies between the shifter and the multiplier. Firstly, the ratio of the shifter’s area and time must be ‘the right one’. For each A0, T0 and �0, the area and time of the shifter shall be adjusted to achieve minimum Q of the entire unit. Secondly, the shifter may slow down the AU clock. In this case, not only the time T� is longer, but also the multipli- cation time is T0� instead of T0�0. As the multiplier domi- nates, the penalty may become unacceptable. The multiplier is scaled by manipulating the digit size, which affects A0, T0 and �0, while changing the number of rotations in the shifter varies A, T, and �� Thus the optimiza- tion problem becomes multi-parametric. Because the multi- plier logic dominates in both area and time, we solve it in a Pareto-optimal way: we modify the shifter to achieve optimum total Q for a given multiplier digit width. 5.1 Decomposition in time and space To implement a set of rotations, one might use: � a multiplexer structure, such as the barrel shifter, spending a single clock cycle for each rotation, or � hardware providing the rotation by 1 only, requiring k clock cycles to rotate by k. These options can be seen as decompositions in space and in time, respectively. In a more general approach, we con- struct hardware providing a limited set of rotations, and we use it in multiple clock cycles to realize the given rotations. The solution of our problem can be decomposed as follows: � Find rotations s j nj � � �Z , 1� and factors Tij � �Z , i r j n� �1 1� �, such that s T k mj ij j n i � � � 1 mod (6) (the time domain problem). � Implement the rotations sj under the optimality criterion in Eq. 5 (the space domain problem). For such a sequential decomposition to work, the first step must estimate the quality of the result of the second step. In our case, we need the estimation of area, clock cycles, and critical path of a circuit realizing a given set of rotations. Thanks to the special nature of rotations ki, we can have made reasonable assumptions about them. The decomposition of the problem is summarized in Fig. 4. 5.2 Space domain problem Let NMUX be the number of two-input multiplexers in a circuit implementing n rotations. For a given n, NMUX has the following bounds: � NMUX is O(n). Any set of rotations can be implemented using an n-input multiplexer, which in turn is a tree of n 2-input multiplexers. � NMUX is � (log n). This is the minimum number of bits specifying one number out of n. Note that in both cases, the critical path length is logarithmic. A circuit intermediate between these two extremes can be represented as a network of 2-input multiplexers. We have proven that: � unless there are distinct indices a, b, c, d such that s s s sa b c d� � � , the circuit optimal in area and critical path is an n-input multiplexer; � the original set of rotations ki does not have the above property. These facts lead us to the tentative assumption that the so- lution of the space domain problem can be approximated as an n-input multiplexer. The overall quality measure is then �Q nA n A T T nij j n i r � � � � � � � � � � � � � � � �� ��MUX MUX( ) max (0 1 0 1 �� �), .�0 The area AMUX(n) and critical path �MUX(n) of an n-input multiplexer can be expressed in terms of primitive gates, if such a measure is used for the rest of the unit. Alternatively, these values can be measured in terms of implementation technology (transistors, programmable blocks) and as such obtained from the synthesis tools used. 5.3 Time domain problem Due to the modularity of Eq. 6, the value of all sj and Tij can be restricted to (0, m) without loss of generality. This still represents, however, a large solution space. To reduce it, fur- ther decomposition is used: 1. Choose a set of rotations sj. 2. Obtain a set of factors Tij giving optimum quality Q. Because no reliable estimate can be made in Step 1, both steps are performed iteratively. In other words, we use Step 2 as the evaluation function for the local search in Step 1. 5.4 Optimal factors by dynamic programming In Step 2 above, n is fixed, and so are the parameters of the multiplexer implementing sj. Furthermore, note that the equations for different values of j in Eq. 6 are independent, that is, we have r primitive problems of the following form. 58 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 45 No. 2/2005 Czech Technical University in Prague s k time domain problem space domain problem T Step 2 Step 1 estimation s muxes i ij i obtain Tij i by dynamic programming genetic algorithm choose s i Fig. 4: Decomposition For a given c c m� ��Z , and h � �Z , find T j hij � � �Z , 1� such that s T c mj j j h � � � 1 mod while minimizing q Tj j n � � � 1 . Let F (h, c) be a function giving minimum q for the above problem. Let F (0, 0) = 0 and F (0, c) = � for c > 0. Then F (h, c) is, for 0 < h � r, �� �F h c F h c ds m d d h( , ) min , ( ) mod� � � �1 , where d ranges between 0 and the order of sh in Zm, that is 0 � d � m / gcd (sh, m) The values of F (h, c) are computed for increasing h and are stored in a two-dimensional array with r columns and m rows. After that, the element (r, ki) contains the contribution of the i-th primitive problem to the optimization criterion, for 1< i � r. Eventually, all values Tij and therefore the solution of Step 2 can be reconstructed from the array. This means that one pass only of the above outlined dynamic programming proce- dure is required to solve the entire Step 2. As r is O(log m), the complexity of this algorithm is O(m2log m). 5.5 Optimal rotation set by a genetic algorithm The search process in Step 1 is performed by a genetic algorithm, which in turn uses the above procedure to evaluate the solution. Thanks to the decomposition in Subsection 5.1, a configu- ration of the search problem is determined by the value of n � r variables si � (0, m). This is the phenotype of the genetic algorithm. The genotype (chromosome) is a fixed structure of r vari- ables from �0,m). A zero value is omitted from the phenotype, and equal values of two or more variables in the genotype constitute a single shift in the phenotype. All permutations of the genotype are considered equivalent. This decoder avoids the necessity to store n as a separate variable and the need to work with a variable-length genotype. Constrained optimization was implemented using a pen- alty for each rotation ki, which the individual in question can- not realize. Due to the modularity of Eq. 6, the search space is well connected. The infeasible individuals were not needed to contribute to the connectedness, and therefore the value of the penalty was chosen well above any possible value of Q. The rest of the genetic algorithm was quite classical, with single-point crossover and linear scaling of fitness values. The adaptive nature of linear scaling caused the convergence to remain unchanged even in the presence of large A0 and T0, where the differences in evaluations are relatively small. It seems that the number of clock cycles spent in iterative squarings is approx. �O k mk � 1 , where k is the number of rotation blocks. The total time necessary for the point addi- tion is then T O m D m k m DkPADD � � � � � � � � � � � � � � � �log log1 . (7) 6 Implementation The proposed multiplier/inverter has been implemented in the Xilinx Virtex300 FPGA using the Synopsys FPGA Express synthesis tool and the Foundation 3.3i implementa- tion tool. The functionality has been verified in the ModelSim simu- lator. Note that point addition and doubling are completely oblivious, i.e. the sequence of steps depends on m only, not on the processed data. Therefore the simulation was also able to show that the numbers of clock cycles used in Equations 3, 4, and 7 were correct. 6.1 Digit-serial multiplier/inverter Because of limited space, Table 1 presents results for m � 180 and for several digit widths D only. No additional blocks performing “long distance” rotations have been used in these cases. As expected, the area of the multiplier/inverter grows with growing D, and can be expressed as approx. (2 � 0.5 D)× m slices or (2 � 0.5 D) slices per 1 bit. The computation time does not depend on the digit width D in such a straightforward manner. The results in the last column of Table 1 and in Fig. 5 correspond to Eq. 4. The minimum time is obtained for D � 6. The other local minima are caused by the granularity of the FPGA. Whenever the capacity of a look-up table is exhausted, the length of the criti- cal path increases. 6.2 Improving the iterative squarings The optimizer was implemented using the GAlib C�� li- brary [8]. A number of experiments have been performed, with m in the range interesting for elliptic curve cryptography, that is, from 160 to 250. The following facts were observed: � Where a brute force optimum solution was available, the al- gorithm gave an identical answer. © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 59 Czech Technical University in Prague Acta Polytechnica Vol. 45 No. 2/2005 D Freq (MHz) #Slices #Slices per 1 bit Point addition (clocks) (�s) 1 122.489 451 2.51 2582 21.08 2 102.533 544 3.02 1412 13.77 3 101.502 632 3.51 1022 10.07 4 100.492 724 4.02 827 8.23 5 97.069 815 4.53 710 7.31 6 94.153 903 5.02 632 6.71 9 64.008 1174 6.52 502 7.84 10 61.054 1265 7.03 476 7.80 12 54.174 1480 8.22 437 8.07 15 58.042 1798 9.99 398 6.86 18 44.607 2026 11.26 372 8.34 20 40.955 2248 12.49 359 8.77 Table 1: Implementation of a modified multiplier/inverter in Xilinx Virtex300 � Any realized rotation si in an optimum solution was iden- tical to some given rotation ki, although even slightly sub-optimum solutions did not have this property. � The left side of Equation 6 was less than m in all optimum and some sub-optimum solutions. � When the above observation was exploited to simplify the evaluation procedure, the search space became discon- nected, and more time was needed to achieve equivalent results. � Neither brute force nor the described algorithm gave any optimum or sub-optimum solution violating the tentative assumption in Subsection 5.2. � With a population size of 100, the algorithm required circa 3000 generations to converge at m � 160, rising to 4000 at m � 250. � Infeasible individuals were rare. � The running time was below 20 minutes on an of- fice-grade PC. Table 2 illustrates the influence of the multiplier size on the shifter. The results were obtained for m � 163, where the required set of rotations is {1, 2, 5, 10, 20, 40, 81}. The area of the multiplier was AMUX(n) � 1.5 n and the critical path of the unit was outside the shifter. The effect of adding one rotation block (i.e. k � 2) is il- lustrated in Fig. 5. The expansion of multiplexers did not influence the clock period, as they did not lie on the critical path. The new design is systematically faster. For D � 6, the speedup is over 20 %, while the area increased by 10 %. Recall that the speedup is based on minimization of the last term in Eq. 7. In our case, this mechanism caused the second local minimum on the original curve to prevail, where the opti- mum digit width is D � 15 for m � 180. In this case the actual speedup is 37 %. 7 Conclusions A pipelined version of the Massey-Omura multiplier mod- ified for easy implementation of ITT inversion algorithm has been presented. The performance of this multiplier/inverter can be improved by employing digit-serialization and by speeding up the iterative squarings. The multiplier/inverter has been implemented in Xilinx Virtex 300. Without speeding up the iterative squarings, the shortest computation time has been obtained for digit width D � 6. The use of “long distance” rotation blocks further speeded up the design and benefited higher digit widths. References [1] Agnew, G. B., Mullin, R. C., Onyszchuk, I. M., Vanstone, S. A.: “An Implementation for a Fast Public-Key Crypto- system,” Journal of Cryptology, Vol. 3 (1999), p. 63–79. [2] IEEE 1363. Standard for Public-key Cryptography, IEEE 2000. [3] Itoh, T., Teechai, O., Tsujii, S.: “A Fast Algorithm for Computing Multiplicative Inverses in GF (2t) Using Normal Bases,” J. Society for Electronic Communications (Japan), Vol. 44 (1986), p. 31–36. [4] Leong P. H. W., Leung, K. H.: “A Microcoded Elliptic Curve Processor Using FPGA Technology,” IEEE Trans- actions on VLSI Systems, Vol. 10, No. 5, Oct. 2002, p. 550–559. [5] Savas, E., Koc, C. K.: “Architectures for Unified Field Inversion with Applications“. In: Elliptic Curve Cryptog- raphy. The 9th IEEE International Conference on Elec- tronics, Circuits and Systems – ICECS 2002, Dubrovnik, Croatia, September 15–18, 2002, Vol. 3, p. 1155–1158. [6] Massey, J., Omura, J.: “Computational Method and Ap- paratus for Finite Field Arithmetic,” U.S. patent number 4,587,627, 1986. [7] Blake, I., Seroussi, G., Smart, N.: “Elliptic Curves in Cryptography”, Chapter 1. Cambridge University Press, Cambridge (UK), 1999. [8] Wall, M.: “Galib, A C�� Library of Genetic Algorithm Components” [Online] Available from http://sourceforge.net/projects/galib/ Ing. Jan Schmidt, Ph.D. phone: +420 224 357 473 e-mail: schmidt@fel.cvut.cz Ing. Martin Novotný phone: +420 224 357 261 e-mail: novotnym@fel.cvut.cz Department of Computer Science and Engineering Czech Technical University in Prague Faculty of Eletrical Engineering Karlovo nám. 13 121 35 Praha 2, Czech Republic 60 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 45 No. 2/2005 Czech Technical University in Prague Multiplier Shifter digit width A0 T0 A T rotations si – 0 0 976 10 1, 5, 20, 81 1 489 1956 244 159 1 6 2934 326 488 24 1, 10 Table 2: Shifters adjusted to different multipliers Time of point addition m = 180 0 5 10 15 20 25 0 5 10 15 20 25 Digit width D u s k=1 Fig. 5: Time of point addition for different digit widths. One rota- tion block (k � 1) and two rotation blocks (k � 2) used