Instruction FACTA UNIVERSITATIS Series: Electronics and Energetics Vol. 28, N o 4, December 2015, pp. 657 - 671 DOI: 10.2298/FUEE1504657B A LATENCY OPTIMIZED BIASED IMPLEMENTATION STYLE WEAK-INDICATION SELF-TIMED FULL ADDER  Padmanabhan Balasubramanian School of Computer Engineering, Nanyang Technological University, Singapore Abstract. This article presents a biased implementation style weak-indication self-timed full adder design that is latency optimized. The proposed full adder is constructed using the delay-insensitive dual-rail code and adheres to the 4-phase handshaking. Performance comparisons of the proposed full adder vis-à-vis other strong and weak- indication full adders are done on the basis of a 32-bit self-timed ripple carry adder architecture, with the full adders and ripple carry adders realized using a 32/28nm CMOS process. The results show that the proposed full adder leads to reduction in latency by 63.3% against the best of the strong-indication full adders whilst reporting decrease in area by 10.6% and featuring comparable power dissipation. On the other hand, when compared with the existing optimized weak-indication full adder, the proposed full adder is found to minimize the latency by 25.1% whilst causing an increase in area by just 1.6%, however, with no associated power penalty. Key words: self-timed design, full adder, RCA, indication, standard cells, CMOS 1. INTRODUCTION Self-timed design, which constitutes a robust flavor of asynchronous design, is considered to be a viable alternative and/or a necessary supplement to mainstream synchronous design by the Semiconductor Industry Association [1] due to several reliability and variability issues, which have become prominent in the nanoscale electronics regime. Random dopant fluctuations, sub-wavelength lithography, high heat flux, electro-migration, hot carrier effects, negative bias temperature instability, stress- induced variation, electrostatic discharge, process-induced defects, and metrology and other manufacturing defects [2] are complicated issues which have become more pronounced in the nanoelectronics are compared to the microelectronics era and are indeed difficult to deal with. To circumvent these issues, various material-level, device- level, process-level, circuit-level and system-level solutions have been developed and further developments are also underway [1]. Received April 06, 2015; received in revised form June 08, 2015 Corresponding author: Padmanabhan Balasubramanian School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798 (e-mail: balasubramanian@ntu.edu.sg) 658 P. BALASUBRAMANIAN At the circuit/system-level, self-timed design has been drawing sustained interest from the research community over the past decades due to several inherent advantages such as low noise [3] and almost nil electro-magnetic interference (EMI) [4], greater modularity [5], ability to cope with process, temperature, and parametric variations with ease thus being inherently adaptive [6] [7], consuming power only when and where active [5] [8], and being self-checking [9]. Low noise and EMI compatibility imply that self-timed circuits are innately resistant to side channel attacks [10] [11] and are therefore preferable for secure banking and financial and other sensitive applications. Modularity, also known as design reusability, and the capacity to tolerate process, temperature and parametric uncertainties imply that self-timed circuits are well positioned to deal with statistical timing analysis and reliability issues whilst delivering an average case performance. Due to the consumption of power only on-demand, depending on when and where required, self-timed circuits/systems form a natural choice for ultra low power VLSI designs where complimentary design strategies such as multiple supplies, multiple thresholds, and dynamic voltage and/or frequency scaling may be deployed to leverage the maximum benefits from a self-timed design. Being self-checking, self-timed circuits/systems conform to the design-for-testability paradigm although complexities may be involved in the testability aspect of certain asynchronous elements, for example the C-element which incorporates feedback; nevertheless some feasible approaches are reported [12] [13]. Asynchronous designs are primarily classified as bundled-mode and input-output mode, and here, we only consider the input-output mode, which is the robust among the two as it employs unbounded delay models for components (gates) and/or interconnect [14]. Asynchronous circuits/systems corresponding to input-output mode are commonly referred to as self-timed circuits/systems [15]. The fundamental architecture of a self- timed system is shown in Figure 1, which has a centrally located function block. The function block in a self-timed system is equivalent to the combinational logic of a synchronous system, with the exceptions of being realized using delay-insensitive codes, and being bestowed with the responsibility of not only having to produce the correct outputs subject to the applied inputs but also should signal the completion of internal data computation. Thus the function block forms the heart of a self-timed system that performs data processing. In this article, the term ‘function block’ may refer to an arithmetic element, say the full adder, or a sub-system, for example, a ripple carry adder (RCA). The self-timed system, portrayed in Figure 1, utilizes delay-insensitive codes (here, dual-rail code) for data representation, communication and processing, and the 4-phase return-to- zero (RTZ) handshaking. The dual-rail code is the simplest member of the generic family of delay-insensitive m-of-n codes [16], where m wires are asserted ‘high’ (i.e. binary 1) out of a total of n wires to represent binary data. In a dual-rail code, a data wire D is encoded using two wires viz. D0 and D1, where D = 1 is represented by D1 = 1 and D0 = 0, and D = 0 is represented by D0 = 1 and D1 = 0. When D1 and D0 signify a binary value of 0 or 1 according to the assignments mentioned, it is called ‘valid data’. The state of both D0 and D1 being equal to 0 is referred to as the ‘spacer’. It may be noted that both D0 and D1 cannot simultaneously transition to 1 as it is illegal and invalid since the coding scheme adopted is unordered [17], where no codeword should form a subset of another codeword. A Latency Optimized Biased Implementation Style Weak-indication Self-timed Full Adder 659 Referring to Figure 1, the 4-phase handshake protocol is explained as follows 1 . The dual-rail data bus that feeds the current stage register is initially in the spacer state, and the acknowledge input (ACKIN) for the current stage register is high (binary 1), since the acknowledge output (ACKOUT) provided by the next stage register is low (binary 0). The current stage register now transmits a codeword (i.e. valid data). This results in low to high transitions on the bus wires (i.e. any one of the rails of all the dual-rail signals is asserted as binary 1) feeding the function block. After the next stage register receives a codeword, subsequent to data processing in the function block, it drives the ACKOUT wire to binary 1, and the ACKIN wire assumes binary 0. The current stage register waits for the ACKIN signal to become 0 and then resets the data bus, i.e. the data bus feeding the function block is driven to the spacer state. After an unbounded but finite and positive amount of time taken for resetting the function block and the passage of spacer data to the following register stage, the next stage register drives the ACKOUT (ACKIN) to 0 (1). A data transaction is now said to be complete, and the system is ready to proceed with the next transaction. The application of data in the self-timed system depicted by Figure 1 follows the sequence: valid data – spacer – valid data – spacer, and so forth. Next stage register Completion detector Completion detector Current stage register ACKOUT ACKOUT A C K IN A C K IN Function block Data bus Data bus Fig. 1 Standard self-timed system architecture employing delay-insensitive data encoding and 4-phase handshaking 2. FUNCTION BLOCKS – CLASSIFICATION AND TIMING BEHAVIOR Self-timed function blocks are classified as strongly indicating and weakly indicating [18] depending on the manner in which they indicate (i.e. acknowledge) the arrival of the primary inputs. The differences between the properties of strong and weak-indication function blocks are explained using the illustrative timing diagram shown as Figure 2. 2.1. Strong-indication function block A strongly indicating function block [19] waits for all the valid/spacer primary inputs to arrive and then starts to compute and produce the desired valid/spacer primary outputs. The strong input-output conditions are given as follows: 1 This explanation remains valid for any delay-insensitive data encoding scheme. 660 P. BALASUBRAMANIAN  All the primary inputs must attain the valid/spacer state before any primary output attains the valid/spacer state  All the primary outputs must have attained the valid/spacer state before any primary input attains the spacer/valid state 2.2. Weak-indication function block A weakly indicating function block [20] is capable of producing valid/spacer primary outputs subsequent to the arrival of just a subset of the valid/spacer primary inputs. However, the production of at least one valid/spacer primary output is delayed until all the valid/spacer primary inputs have arrived. Reference [21] discusses two kinds of weak- indication function block implementations: i) distributed implementation [22], where the task of indicating the primary inputs is shared between the primary outputs, and ii) biased implementation [18] [23], where the responsibility of primary inputs indication is just delegated to a single primary output. The weak input-output conditions are given below:  Some valid/spacer primary outputs are produced subsequent to the arrival of a subset of the valid/spacer primary inputs  All the valid/spacer primary inputs should have arrived before all the respective valid/spacer primary outputs are produced  All the valid/spacer primary outputs should have been produced before any subsequent spacer/valid primary input(s) arrive Inputs arrived All None All None Outputs producedStrong-indication All None Outputs producedWeak-indication Fig. 2 Input-output behavior of strong and weak-indication function blocks 3. WEAK-INDICATION: BASIC, DISTRIBUTED, AND BIASED IMPLEMENTATIONS Three types of weakly indicating self-timed full adder implementations viz. basic, distributed, and biased implementations are discussed in this section. 3.1. Basic implementation The weakly indicating DIMS full adder [24], shown in Figure 3, is an example of the basic weak-indication implementation style. The circuit consists of two levels, with C- elements realizing the product terms in the first level and the OR gates summing up the A Latency Optimized Biased Implementation Style Weak-indication Self-timed Full Adder 661 product terms in the second level. Only one product term will be activated to produce the sum or carry outputs thus satisfying the monotonic cover constraint [14], which requires that the product terms comprising a Boolean function should be mutually disjoint [25] [26]. The C-element 2 is highlighted by the circle with the marking ‘C’ on its periphery. A1, A0, B1, B0, CIN1 and CIN0 represent the dual-rail primary inputs, while SUM1, SUM0, COUT1 and COUT0 represent the dual-rail primary outputs. A1 and A0 represent the augend, and B1 and B0 represent the addend inputs. The logic equations governing the full adder are, SUM1 = A0B0CIN1 + A0B1CIN0 + A1B0CIN0 + A1B1CIN1 (1) SUM0 = A0B0CIN0 + A0B1CIN1 + A1B0CIN1 + A1B1CIN0 (2) COUT1 = A0B1CIN1 + A1B0CIN1 + A1B1 (3) COUT0 = A0B1CIN0 + A1B0CIN0 + A0B0 (4) From (1) to (4), it is evident that the sum outputs depend upon all the primary inputs, while the carry outputs may not. When the carry-generation occurs (i.e. A1 = B1 = 1), COUT1 would output binary 1 irrespective of the value of the carry input CIN1 or CIN0. On the other hand, when the carry-kill condition occurs (i.e. A0 = B0 = 1), COUT0 would output binary 1 regardless of the value of the incoming carry signal CIN1 or CIN0. Thus for both valid data and spacer, while the sum outputs have to wait for the arrival of all the primary inputs, the carry outputs may not. However, since the product terms are realized using C-elements for both the sum and carry outputs, the carry-output logic provides an additional acknowledgement for some/all of the primary inputs besides the sum outputs. When the carry-propagate condition (i.e. A1 = B0 = 1 or A0 = B1 = 1) occurs, the arrival of the augend, addend and carry inputs are indicated by both the sum and carry outputs, i.e., the responsibility of indicating the primary full adder inputs is not shared between the sum and carry outputs, neither is the responsibility confined to a single primary output (i.e. the sum or carry output), but rather, multiple acknowledgments tend to manifest for both valid data and spacers. As a result, if the full adder shown in Figure 3 is cascaded to form an n-bit RCA and if the carry signal propagates through a maximum m out of n full adder stages in the RCA, the forward latency and the reverse latency would be specified by O(m). The forward latency signifies the maximum propagation delay incurred in processing the valid data inputs, and the reverse latency denotes the maximum propagation delay encountered for the passage of spacer data inputs (i.e. the time taken for the reset of the self-timed circuit/system) [14]. Hence, the cycle time, which is the time taken for a single data transaction, and computed as the sum of forward and reverse latencies, would be specified by O(2m). 2 The Muller C-element/C-gate basically governs the rendezvous of the input signals. Hence the C-element is also referred to as an ‘input-complete element’. It outputs a 1/0 only if all its inputs are 1s/0s respectively. It retains the existing steady-state in case the inputs are different. 662 P. BALASUBRAMANIAN A0 B0 CIN0 A0 B0 CIN1 A0 B1 CIN0 A0 B1 CIN1 A1 B0 CIN0 A1 B0 CIN1 A1 B1 CIN0 A1 B1 CIN1 A0 B0 A1 B1 COUT0 COUT1 SUM1 SUM0 C C C C C C C C C C Fig. 3 DIMS weak-indication full adder 3.2. Distributed Implementation Martin’s full adder [22], which forms an example for the distributed weak-indication implementation, is shown in Figure 4. Equations (5) to (8) are synthesized by Martin’s full adder using a full-custom static CMOS design involving 42 transistors. In fact, (5) to (8) represent the factorized forms of (1) to (4). The nMOS network realizes the full adder functionality and is activated when the valid data inputs are supplied, while the pMOS network is activated during the application of spacers and resets the full adder during the RTZ phase. SUM1 = (A0B0 + A1B1) CIN1 + (A0B1 + A1B0) CIN0 (5) SUM0 = (A0B0 + A1B1) CIN0 + (A0B1 + A1B0) CIN1 (6) COUT1 = (A0B1 + A1B0) CIN1 + A1B1 (7) COUT0 = (A0B1 + A1B0) CIN0 + A0B0 (8) A Latency Optimized Biased Implementation Style Weak-indication Self-timed Full Adder 663 B1 A1 CIN1 B0 A0 CIN0 B0 A1 CIN1 B1 A0 CIN0 CIN0 CIN0 CIN1 CIN1 Vdd Vdd SUM1 SUM0 A1 A0 B1 B0 B0 A0 A0 B0 CIN0 Vdd COUT0 A1 A0 B1 B0 CIN1 Vdd COUT1 A1 B1 A1 B1 (a) SUM logic (b) CARRY OUTPUT logic Fig. 4 Martin’s weak-indication full adder When valid data inputs are supplied, the operation of the Martin’s full adder is identical to the weakly indicating DIMS full adder discussed previously. Therefore the forward latency of an n-bit self-timed RCA employing the Martin’s full adder would be specified by O(m), where m signifies the maximum length of the carry chain activated in the RCA. However when spacer data are applied, the carry output is reset through the spacer states of augend and addend inputs, while the sum output is reset subsequent to the arrival of the carry input as well. Therefore, the responsibility of indicating the primary inputs is distributed between the primary outputs (viz. sum and carry outputs) in the Martin’s full adder, especially when spacer data are applied during the RTZ phase. Since the carry signal alone may be required to propagate through the full adders in an n-bit RCA, subject to the propagate mode becoming active, the Martin’s full adder paves 664 P. BALASUBRAMANIAN the way for a fast and simultaneous reset of all the intermediate carry outputs in the RCA by involving just one full adder delay, soon after the corresponding augend and addend inputs of the full adders have become spacers regardless of the carry input. Once the carry output of a k th stage full adder attains a spacer which in turn serves as the carry input for the (k + 1) th stage full adder, the sum output of the (k + 1) th stage full adder in the RCA would also become a spacer which entails another full adder delay. Therefore the reverse latency of the n-bit self-timed RCA employing a cascade of Martin’s full adders is not data-dependent unlike the previous case but is just a constant, which is approximately equal to two full adder delays. As a result, the cycle time of the n-bit RCA incorporating a cascade of Martin’s full adders is specified by O(m + 2). 3.3. Biased implementation Seitz’s weakly indicating full adder design [18], shown in Figure 5, constitutes a good example for the biased implementation style, and synthesizes (1) to (4). Note that the full adder shown in Figure 5 is similar in many aspects to that portrayed by Figure 3 with a few exceptions: i) the product terms are realized using AND gates in Figure 5 instead of the C-elements in Figure 3, and ii) the intermediate sum outputs viz. intsum1 and intsum0 are combined with the output of the 6-input OR gate (org) that logically sums up the dual- rail input signals to produce the primary sum outputs viz. SUM1 and SUM0. The carry output logic of Seitz’s weak-indication full adder may utilize the carry-generate or the carry-kill condition similar to that of the DIMS weak-indication full adder or the Martin’s full adder. To explain the biased approach prevalent in the design of the Seitz’s weak- indication full adder, let us two consider two example scenarios. When valid data inputs are applied to the full adder shown in Figure 5, and assuming that the carry propagates, one of the 3-input AND gates present in the first-level of the full adder would transition to 1, which will cause a similar transition on intsum1 or intsum0. Even with a single dual-rail primary input transitioning to 1, org would transition to 1. However with isochronic fork assumptions [27] imposed on the primary inputs, it is implied that the low to high transitions on the primary inputs of the AND gate are simultaneously accompanied by similar transitions on the inputs of the 6-input OR gate. The isochronic fork implies that when a transition arrives on one branch of a node and is acknowledged, the transitions on all other branches of the same node are also assumed to have arrived at the same time and hence they are considered to be acknowledged. Subsequently, the low to high transition on intsum1/intsum0 is combined with the transition on org, resulting in the production of a low to high transition on the primary sum output viz. SUM1/SUM0. Notice that the low to high transition on the output of an AND gate would also cause a similar transition on the carry output viz. COUT1/COUT0. In a subsequent RTZ phase, the AND gate which experienced a low to high transition earlier would now output the spacer and this can happen even with anyone of its inputs assuming the spacer state, which leads to the production of a spacer output on COUT1/COUT0, either of which was asserted high previously. A Latency Optimized Biased Implementation Style Weak-indication Self-timed Full Adder 665 A0 B0 CIN0 A0 B0 CIN1 A0 B1 CIN0 A0 B1 CIN1 A1 B0 CIN0 A1 B0 CIN1 A1 B1 CIN0 A1 B1 CIN1 A0 B0 A1 B1 COUT0 COUT1 SUM1 SUM0 C C A1 A0 B1 B0 CIN1 CIN0 intsum1 intsum0 org Fig. 5 Seitz’s weak-indication full adder Let us now consider that the carry-generate mode is active. Under this consideration, when valid data inputs are supplied to the full adder shown in Figure 5, a 3-input AND gate and a 2-input AND gate (which implements ‘A1B1’) would transition to 1. With org also transitioning to 1, either SUM1 or SUM0 would experience a low to high transition, and COUT1 would also experience a low to high transition. Notice here that COUT1 acknowledges the arrival of only the augend and addend inputs, i.e. A1 and B1, and not the carry input CIN1/CIN0. However, intsum1/intsum0 and subsequently SUM1/SUM0 indicates the arrival of the augend and addend inputs as well as the carry input. In the 666 P. BALASUBRAMANIAN following RTZ phase, even with either A1 or B1 becoming a spacer, COUT1/COUT0, which transitioned to 1 earlier, would now assume the spacer state. Thus COUT1/COUT0 may not specifically acknowledge the RTZ of A1 and/or B1, nor is the RTZ of the carry input acknowledged. However, org that indicates the RTZ of those dual-rail primary inputs which experienced a low to high transition previously, when coupled with intsum1/intsum0 results in the RTZ of SUM1/SUM0 respectively. Hence the primary sum output is found to assume the entire responsibility of duly indicating the RTZ of all the primary inputs. This deliberation would equally apply for the carry-kill condition. From the preceding discussions, it may be understood that in the case of Seitz’s weak- indication full adder, the sum output assumes the responsibility of indicating the complete arrival of all the primary inputs subsequent to the application of valid or spacer data inputs and that the carry output is freed from indication constraints; thus there is a bias towards the carry output. Nevertheless, this tends to benefit by paving the way for fast carry propagation between the full adder stages in an n-bit RCA. Also, note that the 6-input OR gate shown as part of Figure 5 cannot be decomposed arbitrarily due to the gate orphan problem [21] [29] that would arise for the application of valid data. The gate orphan implies an unacknowledged transition at a gate output. A1 A0 B1 B0 CIN1 CIN0 org A1 A0 B1 B0 CIN1 CIN0 org int (a) Safe realization (b) Unsafe realization Fig. 6 Naïve decomposition of the 6-input OR gate, potentially causing a gate orphan To explain the gate orphan problem associated with a naïve logic decomposition, consider Figure 6, wherein the 6-input OR gate is decomposed into a 4-input OR gate and a 3-input OR gate. Supposing during a valid data phase CIN1 transitions to 1, output org will transition to 1 in the case of both the realizations shown in Figure 6 without waiting for the low to high transitions to occur on the remainder of the dual-rail inputs viz. A1/A0 and B1/B0. Subsequently, if A1/A0 and B1/B0 also experience transitions, they will not be acknowledged by the OR gate in Figure 6a but they do not give rise to wire orphans since they are considered to be acknowledged by the AND gate(s) present in the first level of Figure 5 through the isochronic fork assumption. Let us revisit the similar scenario of CIN1 transitioning to 1 before A1/A0 and B1/B0 experience low to high transitions with reference to Figure 6b. It can be seen that after CIN1 experiences a low to high transition, the output org will also experience a low to high transition irrespective of any transition occurring on the intermediate output int. Subsequently if A1/A0 and B1/B0 also transition to 1, the internal output int will experience a transition to 1. However, the low to high transition on int will not be acknowledged by the output org, and the unacknowledged transition on the internal gate output (int) is referred to as a gate orphan, which may get eliminated only through sophisticated timing assumptions. Gate orphans are problematic and tend to affect the robustness of a self-timed circuit/system. Therefore, self-timed implementations should be devoid of gate orphans in order to be robust. A Latency Optimized Biased Implementation Style Weak-indication Self-timed Full Adder 667 It can be inferred from [28] that a strong-indication n-bit RCA constructed using strongly indicating full adder blocks has fixed forward and reverse latencies of O(n), and hence exhibits the worst-case cycle time of O(2n). On the other hand, the weak-indication n-bit RCA composed using basic weak-indication full adders has similar forward and reverse latencies of O(m), and hence features a cycle time of O(2m), where m denotes the maximum length of carry propagation in the n-bit RCA. The weak-indication n-bit RCA constructed using the distributed or biased implementation style weak-indication full adders have forward and reverse latencies of O(m) and O(2), and hence the least cycle time of O(m + 2) [28]. Given these, weak-indication realizations are preferable than their strong-indication counterparts for self-timed design of arithmetic circuits. 4. PROPOSED BIASED IMPLEMENTATION STYLE WEAK-INDICATION FULL ADDER The proposed full adder that corresponds to the biased implementation style of weak- indication is depicted in Figure 7, synthesizing (5) to (8) using 4 simple gates viz. 2-input OR gates and 10 complex gates. Among the 10 complex gates, 8 of them are 2-input Muller C-elements, where a 2-input C-element is realized using an AO222 cell with feedback, and the remaining are AO21 gates. Assuming X and Y are the inputs and Z is the output of a 2-input C-element, Z = XY + (X + Y) Z. Presuming that A and B are the inputs given to the AND logic part of an AO21 gate and with C as its other input, the output of the AO21 gate, say D, is given by D = AB + C. When AB and/or C are equal to 1, D equates to 1; hence the AO21 gate is said to be input-incomplete. In general, with the exception of the C-gate, all other logic gates tend to exhibit input-incomplete behavior. It can be seen in Figure 7 that the product terms corresponding to the sum logic are realized using input-complete C-elements. Hence the sum output would indicate the complete arrival of the entire primary inputs viz. augend, addend and carry inputs for both valid and spacer data. On the other hand, the carry output logic is realized using a mix of input-complete C-elements and input-incomplete complex gates. Since the sum output fully indicates the arrival of all the primary inputs during the valid and spacer data phases, the carry output at the best provides multiple acknowledgments for the arrival of valid or spacer data on the augend and addend inputs and/or the carry input. Hence the proposed full adder features a bias toward the carry output in terms of relaxing its indication constraints, and the advantages associated with such an implementation in terms of less forward and reverse latencies and cycle time have been articulated earlier. To elaborate on this, let us consider the following:  Carry-propagate mode: Once the primary inputs assume valid data states, internal outputs intm1 or intm2 and intm3, shown in Figure 7, would transition to 1. Depending on whether CIN1 or CIN0 experiences a low to high signal transition, a similar transition is reflected on the corresponding primary output, COUT1 or COUT0. When spacer data are applied subsequently, even with intm1 or intm2 and intm3 becoming a spacer, the carry output which transitioned to 1 earlier would now be reset regardless of the carry input becoming a spacer  Carry-generate or Carry-kill mode: When valid data are supplied through the primary inputs, an intermediate output intm4 or intm5, highlighted in Figure 7, makes a low to high signal transition, which is followed by a similar transition on 668 P. BALASUBRAMANIAN COUT1/COUT0 respectively. This could occur regardless of a transition on the carry input. Subsequently, when spacer data are applied on the primary inputs, intm4 or intm5 which transitioned to 1 earlier would now assume the spacer state, which is acknowledged by the respective carry output. This could also happen irrespective of the carry input becoming a spacer C C A0 B1 A1 B0 C C A1 B1 A0 B0 C C C C AO21 COUT1 CIN1 AO21 COUT0 CIN0 SUM1 SUM0 CIN0 CIN0 CIN1 CIN1 intm1 intm2 intm3 intm4 intm5 Fig. 7 Proposed weak-indication full adder 5. SIMULATION RESULTS AND DISCUSSION A number of 32-bit self-timed RCAs were constructed in a semi-custom design fashion at the gate-level by utilizing the various strong and weak-indication full adders separately. The structural integrity of the different gate-level self-timed full adders was preserved during the physical realization (technology mapping) to pave the way for a legitimate comparison, and they were implemented using the elements of a 32/28nm CMOS cell library [30]. The 2-input C-element was alone designed manually using the AO222 cell with feedback and was made available to realize the self-timed designs, and the 3-input C-elements were decomposed safely into 2-input C-elements using the method of [29]. The self-timed RCAs comprise the function block, the input registers, and the completion detection circuit. The input registers and the completion detector part of the various RCAs are identical, and only the function blocks differ. Hence the differences between the simulation results of the various RCAs can be attributed to the differences between their constituent full adders. More than 1000 random input vectors were supplied to the RCAs at time intervals of 20ns through test benches in order to capture the A Latency Optimized Biased Implementation Style Weak-indication Self-timed Full Adder 669 switching activities. The .vcd files generated were subsequently used for power estimation using Synopsys tools. Since the EDA tool estimates just critical path timing, only the worst-case forward latency was evaluated. Appropriate wire loads were included automatically whilst performing the simulations. As part of the advanced timing analysis, a virtual clock was used just to constrain the input and output ports of the RCAs, and it did not consume any power. The power, (forward) latency, and area results obtained for the various 32-bit RCAs are shown in Table 1. The indication type of each full adder is highlighted in the 1 st column of Table 1. The area of the RCAs and the respective full adders are given before and after the semicolon in the 4 th column. The gates present in the critical path of the different RCAs are mentioned in the 5 th column. The simulation results correspond to a typical case specification (1.05V, 25⁰ C) of the 32/28nm CMOS process [30]. The primary sum and carry outputs of the RCAs possess fanout-of-4 drive strength. Table 1 Power, latency and area parameters of different 32-bit self-timed RCAs incorporating distinct full adders Full adder and its indication type Power (µW) Latency (ns) RCA; Full adder area (µm 2 ) Critical path elements Singh [31] – Strong DIMS [24] – Strong DIMS [24] [14] – Weak Toms [32] – Strong Folco et al. [33] – Weak SSSC [23] – Weak Toms & Edwards [34] – Weak Proposed – Weak 2190 2181 2177 2172 2171 2174 2192 2171 14.61 9.26 8.24 9.04 7.00 4.43 9.66 3.32 2529; 54.64 2504.60; 53.88 2423.27; 51.34 2293.14; 47.27 2016.63; 38.63 2097.96; 41.17 2642.85; 58.20 2049.16; 39.65 2 CE2, 2 OR3 CE2, OR4 CE2, OR3 CE2, 2 OR2 CE2, OR2 AO222 AND2, CE2, OR3 AO21 CE2: 2-input C-element; AND2: 2-input AND gate; OR2/3/4: 2/3/4-input OR gate; AO222 and AO21 are complex gates The reason for the differences in the latency figures of the various RCAs is due to the different logical operators found in their critical paths, as mentioned in Table 1. It can be seen in Table 1 that the 32-bit RCA incorporating the proposed full adder features the least latency of 3.32ns among its counterparts – thanks to the AO21 cell used for implementing the carry output logic of the proposed full adder. With respect to power dissipation, the 32-bit RCA featuring the Folco et al.’s full adder is comparable with that incorporating the proposed full adder since both these dissipate a similar average power of 2171µW. It can be seen that the total power dissipation does not vary much across the different RCAs, although the variations in area are quite significant. This is because self- timed designs have a unique signal propagation path for each input pattern unlike synchronous designs as they adhere to the monotonic cover constraint [14]. In terms of area, the 32-bit RCA incorporating Folco et al.’s full adder occupies the least area while the 32-bit RCA constructed using the proposed full adder occupies more Silicon by just 1.6%. However, the latter enables considerably less latency by 52.6% compared to the former. Also, note that this latency reduction is achieved not at the expense of any extra power dissipation for the latter compared to the former. The SSSC full adder which is a gate-level design features a carry output logic that is similar to the carry output logic of 670 P. BALASUBRAMANIAN Martin’s full adder which is a transistor-level design. Hence the SSSC full adder was considered as a substitute for the Martin’s full adder in implementing the RCA as both these have similar latencies and cycle time metrics. The weak-indication SSSC full adder corresponds to the biased implementation style similar to that of the proposed full adder, and the proposed full adder leads to reduced latency by 25.1% than the SSSC full adder for a 32-bit RCA implementation with no penalty in terms of power or area parameters. 6. CONCLUSION A new full adder design that corresponds to the biased implementation style of weak- indication was presented. For an n-bit RCA realized using the proposed full adder, the forward and reverse latencies and cycle time are specified by O(m), O(2), and O(m + 2) respectively, where m denotes the maximum number of full adder stages in the n-bit RCA through which the carry propagates. Example 32-bit self-timed RCA implementations incorporating different strong and weak-indication full adders were analyzed. Against the best of the strong-indication full adders, the proposed full adder reports respective reductions in latency and area by 63.3% and 10.6% whilst dissipating similar power. On the other hand, in comparison with the existing optimized weak-indication full adder, the proposed full adder achieves reduced latency by 25.1% with no power penalty albeit at a small area expense of 1.6%. Overall, from a combined power-latency-area perspective, the proposed full adder is found to yield optimum quality-of-results. REFERENCES [1] Semiconductor Industry Association’s ITRS report. Available: http://www.itrs.net [2] S. Kundu and A. Sreedhar, Nanoscale CMOS VLSI Circuits: Design for Manufacturability, McGraw- Hill, USA, 2010. [3] N.C. Paver, P. Day, C. Farnsworth, D.L. Jackson, W.A. Lien and J. Liu, "A low-power, low noise, configurable self-timed DSP", In Proceedings of the 4 th International Symposium on Advanced Research in Asynchronous Circuits and Systems, 1998, pp. 32-42. [4] G.F. Bouesse, G. Sicard, A. Baixas and M. Renaudin, "Quasi delay insensitive asynchronous circuits for low EMI", In Proceedings of the 4 th International Workshop on Electromagnetic Compatibility of Integrated Circuits, 2004, pp. 27-31. [5] C.H. Van Kees Berkel, M.B. Josephs and S.M. Nowick, "Scanning the technology applications of asynchronous circuits", In Proceedings of the IEEE, vol. 87, pp. 223-233, 1999. [6] K.J. Kulikowski, V. Venkataraman, Z. Wang, A. Taubin and M. Karpovsky, "Asynchronous balanced gates tolerant to interconnect variability", In Proceedings of the IEEE International Symposium on Circuits and Systems, 2008, pp. 3190-3193. [7] I.J. Chang, S.P. Park and K. Roy, "Exploring asynchronous design techniques for process-tolerant and energy-efficient subthreshold operation", IEEE Journal of Solid-State Circuits, vol. 45, pp. 401-410, 2010. [8] O.C. Akgun, J. Rodrigues and J. Sparsø, "Minimum-energy sub-threshold self-timed circuits: design methodology and a case study", In Proceedings of the 16 th IEEE International Symposium on Asynchronous Circuits and Systems, 2010, pp. 41-51. [9] I. David, R. Ginosar and M. Yoeli, "Self-timed is self-checking", Journal of Electronic Testing: Theory and Applications, vol. 6, pp. 219-228, 1995. [10] Z.C. Yu, S.B. Furber and L.A. Plana, "An investigation into the security of self-timed circuits", In Proceedings of the 9 th International Symposium on Asynchronous Circuits and Systems, 2003, pp. 206-215. [11] D. Sokolov, J. Murphy, A. Bystrov and A. Yakovlev, "Design and analysis of dual-rail circuits for security applications", IEEE Transactions on Computers, vol. 54, pp. 449-460, 2005. http://www.itrs.net/ A Latency Optimized Biased Implementation Style Weak-indication Self-timed Full Adder 671 [12] D. Koppad and A. Efthymiou, "BIST for strongly-indicating asynchronous circuits", In Proceedings of the 17 th IFIP International Conference on Very Large Scale Integration, 2009, pp. 215-218. [13] A. Efthymiou, "Initialization-based test pattern generation for asynchronous circuits", IEEE Transactions on VLSI Systems, vol. 18, pp. 591-601, 2010. [14] J. Sparsø and S. Furber (Editors), Principles of Asynchronous Circuit Design: A Systems Perspective, Kluwer Academic Publishers, Netherlands, 2001. [15] Balasubramanian Padmanabhan, "Self-timed logic and the design of self-timed adders", PhD thesis, School of Computer Science, The University of Manchester, UK, 2010. [16] T. Verhoeff, "Delay-insensitive codes – an overview", Distributed Computing, vol. 3, pp. 1-8, 1998. [17] B. Bose, "On unordered codes", IEEE Transactions on Computers, vol. 40, pp. 1-8, 1988. [18] C.L. Seitz, "System Timing", Introduction to VLSI Systems, C. Mead and L. Conway (Editors), pp. 218- 262, Addison-Wesley, Reading, Massachusetts, USA, 1980. [19] P. Balasubramanian and D.A. Edwards, "Efficient realization of strongly indicating function blocks", In Proceedings of the IEEE Computer Society Annual Symposium on VLSI, 2008, pp. 429-432. [20] P. Balasubramanian and D.A. Edwards, "A new design technique for weakly indicating function blocks", In Proceedings of the 11 th IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems, 2008, pp. 116-121. [21] C. Jeong and S.M. Nowick, "Block-level relaxation for timing-robust asynchronous circuits based on eager evaluation", In Proceedings of the 14 th IEEE International Symposium on Asynchronous Circuits and Systems, pp. 95-104, 2008. [22] A.J. Martin, "Asynchronous datapaths and the design of an asynchronous adder", Formal Methods in System Design, vol. 1, pp. 117-137, 1992. [23] P. Balasubramanian and D.A. Edwards, "A delay efficient robust self-timed full adder", In Proceedings of the IEEE 3 rd International Design and Test Workshop, 2008, pp. 129-134. [24] J. Sparsø and J. Staunstrup, "Delay-insensitive multi-ring structures", Integration, the VLSI Journal, vol. 15, pp. 313-340, 1993. [25] P. Balasubramanian and D.A. Edwards, "Self-timed realization of combinational logic", In Proceedings of the 19 th International Workshop on Logic and Synthesis, 2010, pp. 55-62. [26] P. Balasubramanian, R. Arisaka and H.R. Arabnia, "RB_DSOP: a rule based disjoint sum of products synthesis method", In Proceedings of the 12 th International Conference on Computer Design, 2012, pp. 39-43. [27] A.J. Martin, "The limitation to delay-insensitivity in asynchronous circuits", In Proceedings of the 6 th MIT Conference on Advanced Research in VLSI, 1990, pp. 263-278. [28] P. Balasubramanian and N.E. Mastorakis, "Timing analysis of quasi-delay-insensitive ripple carry adders – a mathematical study", In Proceedings of the 3 rd European Conference of Circuits Technology and Devices, 2012, pp. 233-240. [29] P. Balasubramanian and N.E. Mastorakis, "QDI decomposed DIMS method featuring homogeneous/ heterogeneous data encoding", In Proceedings of the International Conference on Computers, Digital Communications and Computing, 2011, pp. 93-101. [30] Synopsys Digital Standard Cell Library SAED_EDK32/28_CORE Databook, Revision 1.0.0, 2012. [31] N.P. Singh, "A design methodology for self-timed systems", M.Sc. thesis, MIT Laboratory for Computer Science Technical Report TR-258, 1981. [32] W.B. Toms, "Synthesis of quasi-delay-insensitive datapath circuits", PhD thesis, School of Computer Science, The University of Manchester, UK, 2006. [33] B. Folco, V. Bregier, L. Fesquet and M. Renaudin, "Technology mapping for area optimized quasi delay insensitive circuits", In Proceedings of the IFIP International Conference on Very Large Scale Integration, 2005, pp. 146-151. [34] W.B. Toms and D.A. Edwards, "A complete synthesis method for block-level relaxation in self-timed datapaths", In Proceedings of the 10 th International Conference on Application of Concurrency to System Design, 2010, pp. 24-34.