 Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 Performance Evaluation for Stacked-Layer Data Bus Based on Isolated Unit-Size Repeater Insertion Chia-Chun Tsai * Department of Computer Science and Information Engineering, Nanhua University, Chiayi, Taiwan Received 13 March 2019; received in revised form 11 April 2019; accepted 10 May 2019 Abstract The data bus of a stacked-layer chip always supports that data of a program are frequently running on the bus at different timing periods. The average data access time of a data bus to the timing periods dominates the program performance. In this paper, we proposed an evaluated approach to reconstruct a 3D data bus with inserted unit-size repeaters to motivate that the average data access time of the bus on a complete timing period can speed up at least 10%. The approach is trying to insert a number of unit-size repeaters into bus wires along the path of a source-sink pair for isolating extra capacitive loadings at each timing period to reduce their access time. The above process is repeated until no any improvement for each access time. Each inserted repeater with just one unit size due to the limited space of a chip area and the minor reconstruction of a data bus in practical. The approach has the advantages of uniform repeater insertion, less extra area occupation, and simplified time-to-space tradeoff. Experimental results show that our approach has the rapid capable evaluation for a stacked-layer data bus within one millisecond and the saving in average access time is up to 50.81% with the inserted repeater sizes of 70 on average. Keywords: stacked-layer chip, 3D data bus, unit-size repeater, average access time 1. Introduction For a stacked-layer chip [1], each layer has own local data bus and a number of TSVs (Through Silicon Vias) is used to vertically connect these local data buses to integrate them to be a 3D global data bus. The 3D global data bus consists of a number of 2D local data buses. Data are frequently running on the 2D local data bus or 3D global data bus for executing multiple programs. A data access time is defined as the propagation delay from a source to at least one sink at a timing period. For a program with a number of hundreds or thousands timing periods, its average access time is defined as the total data access times divided by the number of timing periods. The average access time dominates the program performance. p4 p1 p2 p3 p6 p5 C3 C2 C5 C4 p4 p1 p2 p3 p6 p5 C’3 C’2 C’5 C’4 (a) Extra loading capacitances C2 to C5 (b) Extra capacitive loadings are reduced by inserted repeaters Fig. 1 Data access on the source-sink pair p1-p6 of a 2D local data bus * Corresponding author. E-mail address: chun@nhu.edu.tw Tel.: +886-5-2721001#5030 Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 198 In nanotechnology, a longer interconnection wire always dominates the propagation delay more than a gate delay because of their incremental wire resistances and capacitances. Fig. 1(a) shows a 2D local data bus and there is a bidirectional data access between terminals p1 and p6 at two different timing periods. From the figure, obviously, these extra loading capacitances, C2, C3, C4, and C5 will cause to increase the data access time of the source-sink pair of p1-p6. Each extra loading capacitance comes from their branch wire capacitance and terminal capacitive loading along the path of p1-p6 or p6-p1. As shown in Fig. 1(b), most of these extra capacitive loadings can be isolated by inserting a bidirectional repeater into each branch wire for the data bus reconstruction. That is, these extra loading capacitances will be dramatically reduced to be C’2, C’3, C’4, and C’5, and C’2 < C2, C’3 < C3, C’4 < C4, and C’5 < C5. The data bus reconstruction will result that the data access time between two source-sink pairs with terminals p1 and p6 can be clearly reduced. The above concept of reconstructing a data bus can be expanded to other source-sink pairs for isolating unnecessary capacitive loadings by inserting repeaters into their branch wires to reduce their access times. For a program ran on a data bus with a number of timing periods, its average access time can thus be reduced and its performance can also be upgraded. However, all the inserted repeaters will also cause extra area occupation. This is the time-to-space tradeoff of a data bus reconstruction, such as the saving in average access time is at least 10% with paying a number of repeater sizes. The repeater insertion was widely applied to a one-way signal path that can effectively reduce their propagation delay, but a few papers discussed the repeater insertion to apply a bidirectional data bus. Ismail [2] proposed repeater insertion for the path delay of an RLC-based wire to estimate the delay and their inductive effects of on-chip interconnects. Lin [3] presented buffer insertion to construct a clock tree on multimode multivoltage islands. They used adjustable delay buffers (ADBs) for controlling the clock delay under a boundary skew. Ghoneima [4] introduced the optimal positioning of interleaved repeaters in bidirectional buses. His solution was in focus to reduce noise interference between buses. Acton [5] summarized some studies of signal repeater insertion in multi-source multi-sink data bus. Daneshtalab et al. [6] proposed an appropriate bus isolation strategy for a 3D stacked-layer chip and had a high-performance inter-layer bus structure (HIBS). The HIBS can reduce the complexity of bus arbitrators and make the saving in the propagation delay of data communication. Thakkar et al. [7] introduced a new architecture called 3D-Wiz that is used for reducing the interaction overloading between data bus of DRAMs. The architecture can reduce their access times among any DRAMs. Cho et al. [8] presented the analysis of system bus considering the interconnection of TSVs on a stacked-layer SoC (System-on a Chip). They found the maximum throughput of the system bus of a 3D stacked-layer chip depending on the data bandwidth. Mohamed [9] introduced a master-slave data access by adding NoCs (Network-on-Chips) among multiple processors and there was a number of data interchange rules that would limited the access time between processors. Khan et al. [10] analyzed the performance for current NoC simulation tools in terms of latency, throughput, and energy consumption, but this comparison was just for 2D NoCs. Tsai [11-12] first conducted repeater insertion and sized the repeaters to minimize the propagation delay for a 3D data bus based on RC delay model, but they do not to consider the capacitive loading effect of un-accessed local data buses. Tsai [13] created an effective method associated with embedded isolated switches [14-15] and inserted repeaters for a 3D data bus to reduce their critical access time, but no any considerations about the pre-evaluation in average access time for a data bus reconstruction. Most of the above reconstructed data bus methods were based on the repeater insertion and sized them as possible for maximally reducing the data access time. These approaches can decrease the access time effectively, but their data bus would be required to have extra areas for inserting different-size repeaters. This causes the incremental difficulty for reconstructing a data bus at the post refinement step in physical design. The above problem for the optimal solution in the time-to-space tradeoff (data access time minimization vs. repeaters’ locations and sizes) by inserting repeaters into a data bus had been approved to be intractable [16]. How to evaluate the data bus of a stacked-layer chip to run well for reducing the average access time? A few papers conduct to this topic and it is the valuable problem for investigation in advance. Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 199 In this work, we proposed an approach to evaluate the bus performance by reconstructing a 3D data bus. With inserting unit-size repeaters into a data bus at each timing period, the average data access time of a bus on a complete timing period can be motivated to speed up at least 10% (here, we call it as basic performance ratio). The approach is trying to insert a number of unit-size repeaters to isolate most of extra capacitive loadings to reduce the access time of each source-sink pair at different timing periods. This process is repeatedly done until no any improvement for each access time. Then we can estimate the new average access time of a data bus on a complete timing period. If the saving in average access time with inserted unit-size repeaters is larger than the basic performance ratio, the reconstructed data bus can be accepted for reducing the average access time and applied for most of multiple programs ran on the bus. Here, we emphasize the inserted repeater with just a unit size due to the limited space of a chip area and the minor reconstruction of a data bus. The evaluated approach has advantages: uniform repeaters, less extra occupied area, and simplified the time-to-space tradeoff with a basic performance ratio. The demonstrated results show that most of 3D stacked-layer data buses with inserted unit-size repeaters their average access time for any program can be dramatically reduced. 2. Problem Formulation 2.1. Symbols and definitions Table 1 shows all the symbols and their definitions that are used to go through the whole article for accordance. Table 1 Symbols and their definitions Symbol Definition Symbol Definition n The total number of terminals of a 3D data bus q The total number of bus wires of a 3D data bus n(n-1) The complete timing period of a 3D data bus pk The kth terminal on a 3D data bus Tij The access time from source i to sink j without any inserted unit-size repeaters T’ij The access time from source i to sink j with inserted unit-size repeaters Tav The average access time of a 3D data bus without any inserted unit-size repeaters U-Tav The average access time of a 3D data bus with inserted unit-size repeaters rw The resistance of a unit-length wire U-size The number of inserted unit-size repeaters cw The capacitance of a unit-length wire RPk The kth unit-size repeater rTSV The resistance of a TSV rB The resistance of a unit-size repeater cTSV The capacitance of a TSV cB The capacitance of a unit-size repeater rfg The resistance of a segment wire (f,g) tB The intrinsic delay of a unit-size repeater cfg The capacitance of a segment wire (f,g) Rdi The output driving resistance of a source i r1 The resistance of a bus wire l1 CLj The input loading capacitance of a sink j c1 The capacitance of a segment wire l1 c(Tg) The lumped capacitance at node g CA The total capacitance of a wiring area A CS The extra capacitance at a node C’A The reduced total capacitance of a wiring area A with inserted unit-size repeaters C’S The reduced extra capacitance at a node with inserted unit-size repeaters CBUSj The total capacitance of the jth-layer local bus,e.g., CBUS2 m The multiple times of wire capacitance c1 e.g., CS = mc1, m  0 2.2. Gartner ś hype cycle phases A 3D stacked-layer data bus as shown in Fig. 2 extended from Fig. 1, there is a number of n terminals and exists a maximal number of n(n-1) timing periods as well as a number of n(n-1) data access times. The number of n(n-1) timing periods is called the complete timing period of a data bus. Generally, an executed program has a number of hundreds or thousands timing periods that data are frequently running on the data bus and these timing periods may cover a complete timing period. If most of data access times for a program at different timing periods can be reduced a little, then its average access time will be decreased, that is, the program performance can thus be promoted. Fig. 2(a) shows the bidirectional data access between two terminals p4 located on layer1 and p16 located on layer3 at the different timing periods of a 3D stacked-layer data bus. Obviously, their data access times, Tp4-p16 and Tp16-p4, cover those extra loading capacitances, CA, CB, CC, CD, CE, CF and CBUS2. Especially, the total capacitance of local bus on layer2, CBUS2, will be Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 200 a bigger capacitive loading for their data access. As shown in Fig. 2(b), if we insert a number of unit-size repeaters to some bus wires, then most of extra loading capacitances can be reduced to be C’A, C’B, C’C, C’D, C’E, C’F and C’BUS2, respectively. Thus, their data access times, Tp4-p16 and Tp16-p4, can be effectively reduced. p16 p13 p14 p15 p10 p7 p8 p9 p4 p1 p2 p3 TSV1 Layer1 TSV2 Layer2 Layer3 p17 p18 p11 p12 p6 p5 CBUS2 CA CE CF CD CC CB p16 p13 p14 p15 p10 p7 p8 p9 p4 p1 p2 p3 TSV1 Layer1 TSV2 Layer2 Layer3 p17 p18 p11 p12 p6 p5 C’C C’BUS2 C’B C’A C’E C’D C’F (a) Extra loading capacitances (b) Access time is reduced with inserted repeaters Fig. 2 Data access on the source-sink pair p4-p16 of a 3D stacked layer In Fig. 2(a), the access time Tij (Tji) from source i (j) to sink j (i) along the path of the source-sink pair of p4-p16 based on Elmore -RC delay model [17] is represented as below. ( , ) ( . ) +( )( ( )) 2 T fg gij f g path i j di fg c R r C T   (1) where Rdi is the output driving resistance of source i, rfg and cfg are resistance and capacitance of a bus wire (f,g), respectively, and c(Tg) is the lumped capacitance of branch rooted at node g. It is noted that c(Tg) contains those extra capacitive loadings, CA, CB, CC, CD, CE, CF and CBUS2. Fig. 3 shows the equivalent -RC circuit based on Elmore delay model of Fig. 2(b) between terminals p4 located on layer1 and p16 located on layer3 with two TSVs and a number of inserted unit-size repeaters for isolating extra loading capacitances. From the figure, extra loading capacitances C11, C12, and C13 on layer1 are isolated from the inserted unit-size repeaters RP11, RP12, and RP13; extra capacitances C21 and C22 on layer2 are isolated from the inserted unit-size repeaters RP21 and RP22; and extra capacitances C31, C32, and C33 on layer3 are isolated from the inserted unit-size repeaters RP31, RP32, and RP33. A unit-size bidirectional repeater has two equivalent sets of input capacitance cB, intrinsic delay tB, and output resistance rB that are inversely connected in parallel. The access time is the scaled-50% propagation delay based on Elmore RC delay model. Likely a bus wire, a TSV has also the equivalent RC mode [18] with the resistance rTSV and two half capacitances of cTSV/2. The access time T’ij (T’ji) from source i (j) to sink j (i) along the path of a source-sink pair of p4-p16 with isolated unit-size repeaters is represented as below. + ( ) ( , ) ( . ), ( , ) ( )( ( )) 2 T k fg gf g path i j RP path i j di fgij c R r C T      (2) where c(Tg) is the lumped capacitance of branch rooted at node g including the capacitances within those isolated unit-size repeaters RP(k). Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 201 C13 p16 cTSV2/2 cTSV2/2 rTSV2 cTSV1/2 cTSV1/2 rTSV1 CL P 1 6 P 13 P 1 4 P 1 5 P 10 P 7 P 8 P 9 P 4 P 1 P 2 P 3 TS V1 La yer 1 TS V2 Lay er2 Lay er3 P 1 7 P 1 8 P 1 1 P 1 2 P 6 P 5 C ’C C’ BUS 2 C ’B C ’A C ’E C ’D C ’F cB rB tB rB cB tB cB rB tB rB cB tB cB rB tB rB cB tB cB rB tB rB cB tB cB rB tB rB cB tB p4 CL cB rB tB rB cB tB cB rB tB rB cB tB cB rB tB rB cB tB Rd Rd C31 RP31 C32 RP32 C33 RP33 C11 RP11 C12 RP12 RP13 C21 RP21 C22 RP22 Fig. 3 The equivalent -RC circuit of a source-sink pair p4-p16 shown in Fig. 2(b) For a reconstructed data bus with inserted unit-size repeaters, the average access time on a complete timing period is represented as the bus performance. Since a unit-size repeater has also including the input capacitance, output resistance, and intrinsic delay, the access times for all the source-sink pairs with inserted repeaters will be affected with each other. Thus, we need to estimate a data bus with inserted unit-size repeaters whether its average access time on a complete timing period is decreased or not. That is, we can calculate the saving percentage in the average access time on a complete timing period for the data bus without/with inserted unit-size repeaters. If the saving is over the basic performance ratio, the data bus can be reconstructed with inserted a number of unit-size repeaters that has good performance improvement in average access time. Therefore, the problem to evaluate the performance in average access time on a complete timing period for reconstructing the 3D data bus of a stacked-layer chip can be defined as below. Given the topology of a stacked-layer data bus that has a number of n terminals and a number of q bus wires on a complete timing period (i.e., the number of n(n-1) timing periods), the objective is to evaluate the possible reconstruction of a data bus by inserting unit-size repeaters into the bus wires such that the saving in average access time with inserted repeaters is at least the basic performance ratio than that of without any inserted repeaters, where the basic performance ratio depending on the user’s definition, such as 10 %. 3. Performance Evaluation of a Stacked-Layer Data Bus 3.1. The estimation of a unit-size repeater insertion To understand the effects in data access time of a source-sink pair, it is required to make the estimation of a data access time before/after inserting a unit-size bidirectional repeater into a bus wire. As shown in Fig. 4(a), the access time Tij from Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 202 source i to sink j along the bus wire l1 based on the Elmore delay model can be obtained. If a sink connects the wire segments of a subtree, then the sink has the extra loading capacitance CS and the access time Tij will be increased, and Tij is represented as below. 1 1 1 ( / 2 ) ( )T ij Lj Li LjS Sdi c C C R C C C Cr       (3) where r1and c1 are the resistance and capacitance of a wire l1, respectively, Rdi is the output driving resistance of source i, and CLi and CLj are the input loading capacitances of source i and sink j, respectively. CLi Rdj Rdi CLj source/sink i l1 source/sink j CS (a) Extra capacitance Cs unit-size repeater CLi Rdj Rdi CLj source/sink i l1 source/sink j C’S (b) Cs is reduced to be C’s CLi Rdj Rdi cB rB tB rB cB tB CLj source/sink i l1/2 l1/2 source/sink j bidirectional unit-size repeater C’S (c) Inserting repeater into the middle of a bus wire l1 Fig. 4 The bus wire l1 is inserted into a bidirectional unit-size repeater To reduce the access time Tij, we can insert a unit-size repeater to isolate the subtree wires that can largely decrease the extra loading capacitance CS to be C’S, C’S < CS, as shown in Fig. 4(b), that is, Eq. (3) is updated to be T’ij and T’ij is denoted as below. 1 1 1 ( / 2 ) ( )T ij Lj Li LjS Sdi c C C R C C C Cr        (4) As shown in Fig. 4(c), the access time T’ij from source i to sink j can be reduced in advance by inserting a unit-size bidirectional repeater into the middle of a bus wire l1 if it was enough longer, that is, Eq. (4) is updated as 1 1 11 1 1 / 2 / 2 / 2( / 4 ) ( / 2 )( / 4 ) ( )T B B Li Bdiij Lj B B LjS S r c c R C c cc C C r c c C C tr             (5) where rB, cB, and tB are the output resistance, input capacitance, and intrinsic delay of a unit-size repeater, respectively. For simplification, we assume that CS is the multiple times of the wire capacitance c1, that is, CS = mc1, m  0. And C’S is sum of the half of capacitance c1 and the input capacitance cB, i.e., C’S = c1/2+cB if m > 0 and C’S = 0 if m = 0. If the source and Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 203 sink are also a bidirectional unit-size repeater, then Rdi = rB and CLi = CLj = cB. The access times Tij and T’ij from source i to j without/with inserted repeaters are respectively derived as follow. 1 1 1 1 1 1 1 1 1 1 1 ( / 2 ) (2 ) ( / 2 ) (2 / 2 ) (m+1/2)r 2 ( 1) , 0 T B B B B B B BS S B B B B ij c c C r c c C r c c mc r c c c c r c r c m r c m r                    (6) If m is progressively large, then the access time Tij will be increased, but the access time T’ij always keeps a fixed value that is independent of m. With inserting a unit-size repeater into the wire l1, if its access time T’ij is always less than Tij and then the reduced quantity in access time of (Tij-T’ij) is obviously meaningful. Here, we want to know how the wire length l1 can be inserted a unit-size repeater for effectively reducing the access time. 1 1 11 1 1 / 2 / 2 / 2( / 4 ) ( / 2 )( / 4 ) ( )T B B Li Bdiij Lj B B LjS S r c c R C c cc C C r c c C C tr             (7) Case 1: m = 0, 2 1 1 1 / 4 2 ( ) / 4 (2 ) 0w wij ij B B B B B BT T r c r c t r c l r c t        (8) where the unit of rw and rB is Ω, the unit of cw and cB is pF, and the unit of tB is ps, and the unit of l1 is µ m. We can derive the wire length l1 (µ m) is 1 2 2 B Bb w w r c t l r c   (9) Case 2: m > 0, 1 1 2 1 1 / 2 3 ( 1/ 2) ( / 2 (1/ 2 ) ) (3 ) 0 1 1 B B B B B w w w wB B B B B ij ij r c r c m r c t mr c l r c m r c l r c t T T mr c                (10) The wire length l1 (µ m) can be formulated as 2 1 0.5 (0.5 ) (0.5 (0.5 ) ) 4 (3 ) 2 w w w w w wB B B B B B B w w r c m r c r c m r c mr c r c t l mr c         (11) 3.2. The effects of data access time with inserted unit-size repeaters Due to the strategy of extra capacitive loading isolation is adopted by inserting unit-size repeaters, the access time of a source-sink pair for the shorter path has larger reduction in extra capacitances than the longer path. For a data bus on a complete timing period, all the bus wires are almost inserted with full unit-size repeaters. The data access time of a source-sink pair for the longer path may increase. Fig. 5(a) shows its extended data access of a source-sink pair of p4-p16 in Fig. 2(b) that has up to the number of six inserted unit-size repeaters, RP14, RP15, RP16, RP34, RP35,and RP36, along their longer path. Repeaters RP14 and RP16 are inserted for isolating extra capacitive loading due to the path of p2-p6, RP15 is inserted for the isolation due to the path of p4-p6, RP34 and RP36 are inserted for the isolation due to the path of p14-p18, and RP35 is inserted for the isolation due to the path of p14-p16. Fig. 5(b) shows its equivalent circuit of Fig. 5(a) that is the updated bus structure with inserted unit-size repeaters. The access time T’ij from source i to sink j along the path of a source-sink pair of p4-p16 with inserted unit-size repeaters is formulated as below. + ( ) ( ) ( , ), ( . ), ( , ) T ( )( ( )) 2 x xB B Bx x k r fg gij f g RP path i j RP path i j di fg c R r C C T t      (12) where RP(k) is the number of isolated unit-size repeaters that are not located on the path of p4-p16 and RP(x) is the number of inserted repeaters that are located on the path of p4-p16. Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 204 3.3. The evaluation for reconstructing stacked-layer data bus with inserted unit-size repeaters The evaluated algorithm, Evaluate_Stacked-layer_DataBus_Reconstruction(), for the bus performance by reconstructing a stacked-layer data bus with inserted unit-size repeaters is introduced in Fig. 6 to solve the above problem defined in Section 2. The initial step is to read a 3D data bus topology to construct their data structure. Then, we calculate the average access time Tav of an original 3D data bus without any inserted repeaters on a complete timing period using the function, Find_AverageAccessTime(), where Tav is defined as the total access times divided by the number of n(n-1) timing periods. The for loop in step3 is for each timing of a complete timing period and insert a number of unit-size bidirectional repeaters into the middle of all the branch bus wires along the path of each source-sink pair estimated by Eqs. (9) and (11) for isolating the branch capacitive loadings, but at most one repeater is inserted into the middle of a bus wire. The new average access time U-Tav of a 3D data bus with inserted unit-size repeaters on a complete timing period is obtained using the same function Find_AverageAccessTime() in step4. Finally, if the saving U-saving in average access time defined as (Tav – U-Tav) / Tav * 100% is larger than the basic performance ratio 10%, then, the 3D data bus can be reconstructed by inserting a number of unit-size bidirectional repeaters in the space depending on the limited chip area. Otherwise, give up the reconstruction of a 3D data bus topology. p16 p13 p14 p15 p10 p7 p8 p9 p4 p1 p2 p3 TSV1 Layer1 TSV2 Layer2 Layer3 p17 p18 p11 p12 p6 p5 C’C C’BUS2 C’B C’A C’E C’D C’F RP15 RP14 RP35 RP16 RP34 RP36 (a) Six inserted unit-size repeaters to the path of p4-p16 C13 p16 cTSV2/2 cTSV2/2 rTSV2 cTSV1/2 cTSV1/2 rTSV1 CL p4 CL Rd Rd C31 RP31 C32 RP32 C33 RP33 C11 RP11 C12 RP12 RP13 C21 RP21 C22 RP22 cB rB tB rB cB t B cB rB tB rB cB tB cB rB tB rB cB tB cB rB tB rB cB tB cB rB tB rB cB t B cB rB tB rB cB tB cB rB tB rB cB tB cB rB tB rB cB cB rB tB rB cB cB rB tB rB cB cB rB tB rB cB t B cB rB t B rB cB t B cB rB t B rB cB t B cB rB tB rB cB t B RP14 RP34 RP15 RP35 RP36 RP16 (b) The equivalent circuit of the bus structure Fig. 5 A source-sink pair p4-p16 has six inserted unit-size repeaters and its equivalent circuit Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 205 Evaluate_Stacked-layer_DataBus_Reconstruction() { /* A 3D data bus topology with the number of n terminals and q bus wires on a complete timing period. */ step1: Scan a 3D data bus topology and construct its data structure. step2: Compute each source-sink access time of n(n-1) timing periods and get the whole average access time Tav by the function of Find_AverageAccessTime(). step3: for (each timing of a complete timing period) { insert a number of unit-size repeaters to isolate those all the extra capacitive loadings form the source-sink pair; but at most a repeater is inserted the middle of a bus wire. } step4: Estimate each source-sink access time of n(n-1) timing periods with inserted unit-size repeaters and calculate the whole average access time U-Tav. step5: if (U-saving = (Tav – U-Tav) / Tav * 100% > basic performance ratio 10%) then, the 3D data bus can be reconstructed by inserting a number of unit-size bidirectional repeaters into the space depending on a limited chip area. else, give up the reconstruction of a 3D data bus topology. } Fig. 6 The algorithm is used for the evaluation of a data bus performance The time complexity of the proposed evaluated algorithm is O(n 2 ) because the n(n-1) timing periods are executed, where n is the number of terminals. 4. Experimental Results We have implemented the proposed evaluated algorithm in C language on an i7 CPU@2.7GHz, dual cores with 8GB RAM, running MS-Windows 10. Table 2 shows the parameters of 45nm technology [19] based on Elmore RC delay model. Terms rw and cw represent the resistance and capacitance of a unit-length wire, respectively. rTSV and cTSV are the resistance and capacitance of a TSV, respectively. rB, cB, and tB denote the output resistance, input capacitance, and intrinsic delay of a unit-size repeater, respectively. Table 2 Parameters based on 45nm technology a unit-length wire a TSV a unit-size repeater rw cw rTSV cTSV rB cB tB 0.1Ω 0.2fF 0.035Ω 15.48fF 122Ω 24fF 17ps We refer six 3D data bus topologies with 3 stacked layers from [11-12] and reduce them in total length by five times for testing our proposed algorithm. For a data bus, the driving resistances of all the sources and the loading capacitances of all the sinks are assumed to be those parameters of a unit-size repeater. The inserted repeaters into bus wires are also fixed by a unit size due to the limited space of chip area and the minor reconstruction of a data bus. Table 3 shows the evaluation in average access time for six 3D data bus topologies that their total lengths are reduced by 5 (marked with r5) on their complete timing periods (marked with -nxn) and 2000 timing periods (marked with -2k), respectively. In the table, #Term , #Loc, Tlength, and #Peri are the number of terminals, number of bus wires, total wire length, and number of timing periods, respectively, of a 3D data bus. Tav and U-Tav are the average access times without/with inserted the number of U-size unit-size repeaters, respectively. U-Saving is the saving ratio defined as (Tav - U-Tav) / Tav * 100%. Since we always try to insert a bidirectional unit-size repeater into each bus wire for conducting the complete timing periods, thus their number of Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 206 U-size unit-size repeaters is near double to the bus wires #Loc. For all the cases on their complete timing periods (marked with $r5-nxn) and 2000 timing periods (marked with $r5-2k), their corresponded U-Tav and U-saving are almost equivalent with each other, for example, the U-savings of Test0r5-nxn and Test0r5-2k are 37.09% versus 37.27%. These average access times, U-Tavs, have better savings, U-savings, in the range of 37.09% to 60.88% and they are always larger than the basic performance ratio 10%. The results show that all the cases are suitable to reconstruct their data bus by inserting a number of unit-size repeaters for reducing the average access time to any programs with a number of hundred or thousand timings ran on the bus. Table 3 The evaluation in average access times Tav and U-Tav for data buses (their total length is reduced by 5, r5) without/with inserted unit-size repeaters on their complete timing periods and 2000 timing periods, respectively Example #Term #Loc Tlength Complete timing period ($r5-nxn) 2k timing periods ($r5-2k) #Peri Tav (ns) U-Tav (ns) U-size U-saving #Peri Tav(ns) U-Tav(ns) U-size U-saving Test0r5-* 18 38 7510μm 306 0.3625 0.2281 76 37.09% 2000 0.3627 0.2275 76 37.27% CaseFr5-* 15 29 12069μm 210 0.6078 0.2378 58 60.88% 2000 0.6099 0.2386 58 60.88% CaseGr5-* 10 21 8797μm 90 0.4419 0.2270 42 48.62% 2000 0.4405 0.2240 42 49.16% CaseHr5-* 9 20 8538μm 72 0.4393 0.2380 40 45.73% 2000 0.4392 0.2398 40 45.40% CaseJr5-* 21 44 11166μm 420 0.6117 0.2917 88 52.31% 2000 0.6104 0.2903 88 52.43% CaseKr5-* 30 58 13776μm 870 0.7686 0.3059 116 60.20% 2000 0.7556 0.3075 116 59.30% *: nxn or 2k We extend the evaluation for all the cases that their total lengths are reduced by 10 (marked with r10). Table 4 shows their corresponded U-savings of the cases on their complete timing periods (marked with $r10-nxn) and 2000 timing periods (marked with $r10-2k). Like the evaluation in Table 3, their U-savings are almost equivalent with each other. It is noted that three cases Test0r10-nxn (Test0r10-2k), CaseGr10-nxn (CaseGr10-2k), and CaseHr10-nxn (CaseHr10-2k) on their complete timing periods (2000 timing periods) are failed because their corresponded U-savings have -8.58% (-9.79%), 7.24% (7.27%), and 0.33% (-0.1%) under the basic performance ratio 10%. Obviously, these three-case data buses are not suitable for inserting a number of unit-size repeaters. Table 4 The evaluation in average access times Tav and U-Tav for data buses (their total length is reduced by 10, r10) without/with inserted unit-size repeaters on their complete timing periods and 2000 timing periods, respectively Example #Term #Loc Tlength Complete timing period ($r10-nxn) 2k timing periods ($r10-2k) #Peri Tav(ns) U-Tav(ns) U-size U-saving #Peri Tav(ns) U-Tav(ns) U-size U-saving Test0r10-* 18 38 3736μm 306 0.1856 0.2015 76 -8.58% 2000 0.1857 0.2038 76 -9.79% CaseFr10-* 15 29 6020μm 210 0.2703 0.1889 58 30.11% 2000 0.2703 0.1884 58 30.28% CaseGr10-* 10 21 4388μm 90 0.1952 0.1810 42 7.27% 2000 0.1942 0.1801 42 7.24% CaseHr10-* 9 20 4259μm 72 0.1908 0.1901 40 0.33% 2000 0.1907 0.1909 40 -0.10% CaseJr10-* 21 44 5561μm 420 0.2826 0.2479 88 12.27% 2000 0.2830 0.2480 88 12.35% CaseKr10-* 30 58 6859μm 870 0.3622 0.2601 116 28.18% 2000 0.3521 0.2610 116 25.89% *: nxn or 2k Table 5 The evaluation in average access times Tav and U-Tav for CaseK (the total length is reduced by 5 or 10, r5 or r10) without/with inserted unit-size repeaters on their complete timing periods and different timing periods, respectively Example #Term #Loc #Peri Total length reduced by 5 (CaseKr5-*k) Total length reduced by 10 (CaseKr10-*k) Tlength Tav(ns) U-Tav(ns) U-size U-saving Tlength Tav(ns) U-Tav(ns) U-size U-saving CaseKr?-nxn 30 58 870 13776μm 0.7686 0.3059 116 60.20% 6859μm 0.3622 0.2601 116 28.18% CaseKr?-.1k 30 58 100 13776μm 0.6324 0.3008 116 52.44% 6859μm 0.2581 0.2628 116 -1.82% CaseKr?-.2k 30 58 100 13776μm 0.6446 0.3019 116 53.16% 6859μm 0.2710 0.2588 116 4.53% CaseKr?-.3k 30 58 300 13776μm 0.6757 0.3103 116 54.07% 6859μm 0.2798 0.2573 116 8.04% CaseKr?-.5k 30 58 500 13776μm 0.6875 0.3053 116 55.60% 6859μm 0.2964 0.2594 116 12.48% CaseKr?-.7k 30 58 700 13776μm 0.6960 0.3067 116 55.94% 6859μm 0.3100 0.2610 116 15.82% CaseKr?-1k 30 58 1000 13776μm 0.7288 0.3086 116 57.66% 6859μm 0.3252 0.2599 116 20.08% CaseKr?-2k 30 58 2000 13776μm 0.7556 0.3075 116 59.30% 6859μm 0.3521 0.2610 116 25.89% CaseKr?-3k 30 58 3000 13776μm 0.7619 0.3038 116 60.13% 6859μm 0.3602 0.2613 116 27.45% CaseKr?-4k 30 58 4000 13776μm 0.7636 0.3022 116 60.42% 6859μm 0.3603 0.2596 116 27.95% CaseKr?-5k 30 58 5000 13776μm 0.7670 0.3059 116 60.12% 6859μm 0.3614 0.2588 116 28.38% *: by 5 or 10-.3k: 300 timing periods Table 5 presents the evaluation in average access time for the data bus topology CaseK that the total length is reduced by 5 or 10 (marked with r5 or r10) on their complete timing periods (marked with -nxn) and different number of timing periods (marked with -0.1k to -5k). From the table, the U-savings of CaseKr5-nxn and CaseKr10-nxn on their complete timing periods Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 207 are 60.20% and 28.18%, respectively. For the cases CaseKr5-.1k to CaseKr5-5k of a 3D data bus at different number of 100 -5000 timing periods, their average access times have good U-savings in range of 52.44% to 60.42% and they are suitable for inserting a number of unit-size repeaters for the access time reduction. For the cases CaseKr10-.1k to CaseKr10-5k of a 3D data bus at different number of 100-5000 timing periods, the U-savings of three cases CaseKr10-.1k, CaseKr10-.2k, and CaseKr10-.3k are -1.82%, 4.53%, and 8.04%, respectively are less than the basic performance ratio 10% and they are not suitable for their data bus reconstruction. Fig. 7(a) shows the 3D data bus topology of CaseKr10-nxn with inserting a number of 116 unit-size repeaters. In the figure, two numbers located on the middle of a bus wire are the sizes of an inserted unit-size bidirectional repeater. Fig. 7(b) presents all the access times without/with inserted repeaters to each source-to-sink pair on a complete timing period. The real access time (marked with real_time) of each source-sink pair with inserted repeaters is always less than that the required access time (marked with ireq_time) without any inserted repeaters. The average access times without and with inserted repeaters are 0.3622 ns and 0.2601 ns, respectively. The saving in average access time is up to 28.18%. (a) The 3D data bus topology with inserting the number of 116 unit-size repeaters (b) Real access times (real_times) with inserted repeaters are less than the required access times (ireq_times) without inserted repeaters Fig. 7 The bus topology and all the required and real access times of case CaseKr10-nxn Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 208 5. Conclusions The proposed evaluated approach for the bus performance by reconstructing a stacked-layer data bus based on inserted unit-size repeaters on a complete timing period has been successfully applied for estimating whether the average data access time is reduced more or not. Inserting a number of unit-size repeaters for a data bus reconstruction can reduce the impact in the requirements of each repeater area. Conducting the complete timing period can cover all the possible data accesses for any programs executed on the bus at different timing periods. Evaluating the average access time of a data bus can respond the performance of an executed program in practical. Therefore, our evaluated approach is simple but very fast and effective. Extending work is to investigate different diverse evaluated approaches such that can suit for the various data bus topologies of emerging stacked-layer chips. Acknowledgment This work was partially supported by NHU-107 research project subsidy of Nanhua University. Conflicts of Interest The authors declare no conflict of interest. References [1] EE Times, The state of the art in 3D IC technologies, November 27, 2013. [2] Y. I. Ismail, E. G. Friedman, and J. L. Neves, “Repeater insertion in tree structured inductive interconnect,” IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 48, no. 5, pp.471-481, May 2001. [3] K. Y. Lin, H. T. Lin, T. Y. Ho, and C. C. Tsai, “Load-balanced clock tree synthesis with adjustable delay buffer insertion for clock skew reduction in multiple dynamic supply Voltage designs,” ACM Trans. on Design Automation of Electronic Systems, vol. 17, no. 3, Article 34, 2012. [4] M. Ghoneima and Y. Ismail, “Optimum positioning of interleaved repeaters in bidirectional buses,” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 24, no. 3, pp. 461-669, March 2005. [5] Q. Ashton Acton, Issues in Electronic Circuits, Devices, and Materials: 2011 Edition, Scholarly Editions, January 2012. [6] M. Daneshtalab, M. Ebrahimi, and J. Plosila, “HIBS- Novel inter-layer bus structure for stacked architectures,” Proc. IEEE International Conference on 3D System Integration (3DIC 12), February 2012, pp. 1-7. [7] I. G. Thakkar and S. Pasricha, “3D-Wiz: A novel high bandwidth, optically interfaced 3D DRAM architecture with reduced random access time,” Proc. IEEE International Conference on Computer Design (ICCAD 14), November 2014, pp. 1-7. [8] K. Cho, H. S. Na, T. W. Cho, and Y. You, “Analysis of system bus on SoC platform using TSV interconnection,” Proc. IEEE Asia Symposium on Quality Electronic Design (ASQED 12), August 2012, pp. 255-259. [9] K. S. Mohamed, IP cores design from specifications to production, Chap-4 SoC buses and peripherals, 1 st ed. Switzerland: Springer International Publishing, 2016. [10] S. Khan, S. Anjum, U. A. Gulzari, and F. S. Torres, “Comparative analysis of network-on-chip simulation tools,” IET Computers & Digital Techniques, vol. 12, no. 1, pp. 30-38, January 2018. [11] C. C. Tsai, “Repeater insertion for 3D data bus with TSVs for reducing critical propagation delay,” Proc. International Conference on Computer Science and Information Engineering (CSIE 15), June 2015, pp. 203-208. [12] C. C. Tsai, “An effective algorithm for minimizing the critical access time of a 3D-chip data bus,” International Journal of Electronics Communication and Computer Engineering, vol. 9, no. 4, pp. 117-123, July 2018. [13] C. C. Tsai, “Embedded bus switches on 3D data bus for critical access time reduction,” Proc. IEEE Latin American Symposium on Circuits and Systems (LASCAS 18), February 2018, pp. 1-4. [14] IDTQS3245 data sheet, IDT Co., November 2014. [15] 74VHCT126AFT data sheet, Toshiba Co., November 2014. [16] C. C. Tsai, D. Y. Kao, and C. K. Cheng, “Performance driven bus buffer insertion,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 4, pp. 429-437, April 1996. Advances in Technology Innovation, vol. 4, no. 3, 2019, pp. 197-209 209 [17] W. C. Elmore, “The transient response of damped linear networks,” Journal of Applied Physics, vol. 19, no. 1, pp. 55-63, January 1948. [18] T. Bandyopadhyay, K. J. Han, D. Chung, R. Chatterjee, M. Swaminathan, and R. Tummala, “Rigorous electrical modeling of through silicon vias with MOS capacitance effects,” IEEE Trans. Components, Packaging, and Manufacturing Technology, vol. 1, no. 6, pp. 893-903, June 2011. [19] Y. Cao, W. Zhao, E. Wang, W. Wang, J. Velamala, A. Balijepali, and S. Sinha, "Predictive Technology Model (PTM)," http://ptm.asu.edu, June 1, 2012. Copyright© by the authors. Licensee TAETI, Taiwan. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY-NC) license (https://creativecommons.org/licenses/by-nc/4.0/).