Instruction FACTA UNIVERSITATIS Series: Electronics and Energetics Vol. 27, N o 3, September 2014, pp. 317 - 328 DOI: 10.2298/FUEE1403317S RAPID EXPLORATION OF COST-PERFORMANCE TRADEOFFS USING DOMINANCE EFFECT DURING DESIGN OF HARDWARE ACCELERATORS Reza Sedaghat 1 , Anirban Sengupta 2 1 Electrical and Computer Engineering, Ryerson University, Toronto, Canada 2 Computer Science and Engineering, Indian Institute of Technology, Indore, India Abstract. Modern Very Large Scale Integration (VLSI) designs require a tradeoff between cost efficiency and performance (circuit speed). Furthermore, the Design Space Exploration (DSE) of the cost-performance tradeoffs for the multi objective VLSI designs should also be fast and efficient in nature. This paper presents a novel accelerated DSE approach for the exploration of cost-performance tradeoffs of modular multi (trio parametric. viz. cost, execution time and power consumption) objective VLSI hardware accelerators using hierarchical criterion analysis. The selection of the final design point is made after the tradeoffs are explored using the proposed approach. Results of the proposed approach when applied to various benchmarks yielded significant acceleration in the exploration process compared to current existing approaches with multi parametric objective. Key words: Hardware accelerator, rapid, exploration, performance, cost 1. INTRODUCTION The design space exploration process generally takes into account two conflicting situations such as a) accurately searching the optimal design point from the huge design space b) time taken (or number of architectures analyzed) to evaluate the architecture design space in order to select the optimal design point. The second situation is more significant for modern multi-objective heterogeneous VLSI systems because exhaustive exploration of the architecture space is prohibitive due to the massive size of the design space. The architecture exploration process is therefore a battle between the optimal architecture determination and the speed of the exploration process. Furthermore, since present generation VLSI systems are multi-objective in nature, they demand efficient exploration approaches that can satisfy the multi-objective requisite by concurrently reducing the time spent in the architecture evaluation as well as maximizing the opportunity of automating the exploration methodology [1]-[7].  Received January 27, 2014 Corresponding author: Reza Sedaghat Electrical and Computer Engineering, Ryerson University, Toronto, Canada (e-mail: rsedagha@ee.ryerson.ca) 318 R. SEDAGHAT, A. SENGUPTA 2. RELATED WORKS Exploration has been a subject of research for almost two decades. Many approaches have been proposed in the recent past for fast and efficient evaluation of the design architecture space. The evaluation of the architecture design space has been performed by implementing an Architecture Configuration Graph (ACG) based on the hierarchical criterion factor [8], [9]. After the creation of the ACG, the Pareto optimal analysis is performed to find the optimal architecture. Although the approach seems promising, the major drawback of this approach is the excessive time taken for the framework to build the architecture design space in order to analyze the variants. On the other hand, authors in [10] use an evolutionary algorithm, such as Genetic Algorithm (GA), for efficiently searching the optimal solution. They propose a new encoding scheme to improve the efficiency of GA search for design space exploration. Using chromosome representation, the precedence relationships among the tasks in the input behavioral specification are encoded with a topological order-based representation to specify schedule priorities. Authors in [11] also use GA based on binary encoding of chromosome for efficient design space exploration. Additionally, authors in [12], [13] have developed a model that can assist designers at the system-level DSE stage to explore the utilization of the reconfigurable resources and evaluate the relative impact of certain design choices. All the above mentioned approaches mostly consider dual objective DSE (such as area and delay), but the proposed approach considers multi objective problems (such as cost, delay and power consumption).In addition to the above, a problem space genetic algorithm for design space exploration of data paths have been proposed in [14]. The authors have used the concept of heuristic/problem pair to convert a data flow graph into a valid schedule. The chromosome is encoded based on the „work remaining‟ value of each node. One of the problems with approach [14] is that the second special parent chromosome‟s built in correspondence with the minimum functional units (i.e. serial implementation) does not differ in the work remaining field of the first special chromosome. This may not always lead to the optimal solution. Furthermore, the cost function considers only latency and not total execution time. Authors in [15] describe an approach to solve the DSE problem which is based on GA and weighted sum particle swarm optimization (WSPSO). The authors use crossover between parent and local-best-solution, then parent and global- best-solution to implement particle swarm optimization (PSO) behavior. The authors do not consider velocity to update the position. Moreover in WSPSO, the authors also do not consider user constraints for area and execution time in cost function. In [16], authors describe another approach for DSE in high level systems based on binary encoding of the chromosomes. However, they consider only traditional latency and not execution time constraint for data pipelining. Authors in [17] suggest that identification of a few superior design points from the Pareto set is enough for an excellent design process. The work shown in [18] discusses the optimization of area, delay and power in behavioral synthesis but does not consider execution time during data pipelining. The problem of design space exploration is also addressed in [19] by suggesting order of efficiency, which assists in deciding preferences amongst the different Pareto optimal points. Authors in [20] introduce a tool called SystemCoDesigner that offers rapid design space exploration with rapid prototyping of behavioral systemC models. In [21] evolutionary algorithms such as the Genetic Algorithm (GA) have been suggested to yield better results for the design space exploration process. An automated tool was developed by integrating behavioral Rapid Exploration of Cost-Performance Tradeoffs Using Dominance Effect during Design.... 319 synthesis into their design flow, while authors in [22] describe current state-of-the-art high-level synthesis techniques for dynamically reconfigurable systems. Additionally, authors in [23]-[25] also use a genetic algorithm for scheduling and resource allocation for data path synthesis. Another class of scheduling methods employed previously was probabilistic in nature. For example the simulated annealing (SA) and simulated evolution (SE) based scheduling techniques have been used for the high level synthesis problem. Authors in [26], [27] have proposed a simulated annealing scheduling method called „SALSA‟, which uses many probabilistic search operators to enhance the performance of the SA-based technique for high level synthesis problems. In addition, authors have also proposed an extended binding model for handling the scheduling problem in high level synthesis. Simulated evolution has been proposed by authors in [28] to solve the combined problem of scheduling and resource allocation in high level synthesis. Unfortunately, approaches [23]-[28] do not consider execution time, chaining and data pipelining. Authors in [20],[29] proposed alternate approaches based on integer linear programming (ILP).Although they are capable of providing good results, the computational complexity is massive and therefore require and extensive amount of time. Furthermore, the concept of data pipelining based on execution time was not shown during system trade-off. Work shown in [30] for DSE suggests an evolutionary algorithm for successful evaluation of the design for an application specific SoC. Other well known tools for HLS exist, such as GAUT [31]. GAUT inputs a C/C++ behaviour description for automatically generating a RTL structure based on compulsory constraint of throughput (or initiation interval) and clock period. In addition, authors in [32] propose an open- source HLS tool called LegUp for FPGA-based processor/accelerator systems. LegUp is able to synthesize C language to hardware, thereby providing a nice platform for HLS. Different FPGA architectures are supported by this tool, which allows new scheduling algorithms and parallel accelerators. Moreover, ROCCC, proposed in [33], is an open- source HLS tool for generating RTL structure from C. It was designed for kernels that perform computation intensive tasks, such as most DSP applications. Therefore, ROCCC applies to a specific class of applications (streaming-oriented applications) and is not a general C-to-hardware compiler, unlike LegUp [32]. 3. THE PROPOSED FRAMEWORK BEHIND DESIGN SPACE EXPLORATION 3.1 The proposed framework for cost model The model for the cost of the resources is proposed in this section and is an extension of the authors‟ previous work [3]-[5] on the area model. Let the area of the resources be given as „A‟. Ri denotes the resources available for system designing; where 1<=i<=n. „Rclk‟ refers to the clock oscillator used as a resource providing the necessary clock frequency to the system. The total area can be represented as the sum of all the resources used for designing the system, such as adder, multiplier, divider etc, and clock frequency oscillator. Total area is shown in Equation (1). ( )A A Ri  (1) )()...( 2211 RclkAKNKNKNA RnRnRRRR  (2) 320 R. SEDAGHAT, A. SENGUPTA Where „NRi’ represents the number of resource „Ri‟,and „KRi‟ represent the area occupied per unit resource „Ri‟. Let the total cost of all resources in the system be „CR‟. Further, cost per area unit of the resource (such as adders, multipliers etc) is given as „CRi‟ and the cost per area unit of the clock oscillator is „CRclk‟. Therefore total cost of the resources is given as: 1 1 2 2 ( .. ) ( ) R R R R R Rn Rn Ri Rclk C N K N K N K C A Rclk C          (3) Applying partial derivative to Equation (3) NR1 ….NRn, NRM,and ARclkyields Equations (4) to (7) respectively as shown below: 1 111 1 ])()..[( R RclkRnRnRnRRR R R N CRclkACKNCKN N C     11 RR CK  (4) Rn RclkRnRnRnRRR Rn R N CRclkACKNCKN N C ])()..[( 111     RnRn CK  (5) Rclk Rclk R C A C    (6) According to the theory of approximation by differentials, the change in the total area can be approximated by the following equation: Rclk Rclk R Rn Rn R R R R R A A C N N C N N C dC           1 1 (7) Substituting Equations (4) to (6) into Equation (7) yields Equation (8): Equation (8) represents the change in total cost of resources with a change in the number of all resources and the clock period (clock frequency). The PF for cost of resources is defined as follows: 1 1 1 ( 1) R R Ri R N K C PF R N     (9) ( ) Rn Rn Ri Rn N K C PF Rn N     (10) ( ) ( ) Rclk Rclk A Rclk C PF Rclk N    (11) Equations (9) and (10) indicate the average deviation of cost with respect to change in resource R1,….Rn. Note: This average deviation of cost helps in finding the dominance The change of cost contributed by resource Rn The change of cost contributed by resource clock RclkRiRnRnR CRclkACKNdC  )( (8) Rapid Exploration of Cost-Performance Tradeoffs Using Dominance Effect during Design.... 321 effect of corresponding resource types on cost. Further, Equation (11) indicates the change of cost of the system with respect to change in resource „Rclk‟ (i.e. the dominance effect of Rclk). 3.2 The framework used for execution time This section introduces a new mathematical PF model for clock oscillator resource, thus extending the authors‟ previous work [3]-[5] on PF model of functional resources. The Priority factor of the resources R1, …Rn (such as adders, multipliers etc) for the execution time is derived from [3]-[5].From [3]-[5], the priority factor for the resources R1,...Rn for execution time, is defined as: max ( ) ( )Rn Rn p Rn N T PF Rn T N     (12) The PF model for the clock oscillator is defined as: Rclk Min Rclk Max Rclk N TT RclkPF  )( (13) In Equation (13), ‘TRclk Max ’ and ‘TRclk Min ’are the maximum and minimum values of „execution time‟ and all the available resources have the maximum value. The PF defined in Equations (12) and (13) indicates the average change in execution time with a change in number of a particular resource. This average deviation of execution time depends on various resources to find the dominance effect of corresponding resource types on execution time. 3.3 The framework used for power consumption PF for power consumption is defined as: max( ) ( )Rn Rn c Rn N K PF Rn p N     (14) max( ) ( )RM RM c RM N K PF RM p N     (15) 1 1 2 2 .. ( ) ( ) clk R R R R Rn Rn c R N T N T N T PF Rclk p N          (16) Similarly as explained above, the priority factors for power consumption defined in Equations (14), (15) and (16) indicate the average change in the total power consumption of the system with the change in number of resources at maximum clock frequency. Therefore, as discussed before, Equations (14),(15),(16) indicate the dominance effect of resource types Rn, RM and Rclk on power metric. 322 R. SEDAGHAT, A. SENGUPTA 4. PROPOSED DEMONSTRATION 4.1 System specifications The case study of a selected benchmark has been provided for demonstration of the proposed method based on multiple real system specifications (as shown in Table 1). The function of the selected second order digital IIR Chebyshev filter benchmark is given in (17). ( ) 0.041 ( ) 0.082 ( 1) 0.041 ( 2) 0.6743 ( 2) 1.4418 ( 1)y n x n x n x n y n y n         (17) x(n), x(n-1) and x(n-2) are the input vector variables for the function. The previous outputs are given by y (n-1) and y(n-2), while the present output is y(n). Table 1 System specifications and constraints 1) Maximum cost of resources: 1588 area units 2) Maximum time of execution: 200µs (for D =1000 sets of data) 3) Power consumption: Minimum 4) Maximum resources available for the system design: a) 3 Adder/subtractor units. b) 3 Multiplier units c) 3 clock frequency oscillators: : 24 MHz, 100 MHz and 400 MHz 5) No. of clock cycles needed for multiplier and adder/subtractor to finish each operation: 4 cc and 2cc 6) Area occupied by each adder/subtractor and multiplier: 12 area units (a.u), and 65a.u on the chip (e.g. 12 CLBs on FPGA for adder/subtractor) 7) Area occupied by the 24 MHz, 100 MHz and 400 MHz clock oscillator: 6 a.u., 10 a.u. and 14 a.u. 8) Power consumed at 24MHz, 100MHz and 400 MHz: 10mW/a.u., 32 mW/a.u. and 100mW/a.u. respectively. 9) Cost per area unit resource (CRi) = 10 units and Cost per area unit clock oscillator = 8 units 4.2 Arrangement of the design space (consisting of resources) in increasing orders of magnitude in the form of Architecture tree for cost model This paper proposes the use of a hierarchical tree topology for arrangement of design points in sorted orders and exploration of the optimal design point. Unlike the authors‟ previous works [3]-[5] using vector design space, this approach uses a more convenient topology for exploration. The tree structure is easy to construct and does not require a special algorithm to order the design space in increasing/decreasing order. The PF of the different resources for cost model is given in equations below: 1 1 1 (3 1) 12 10 ( 1) 80 3 R R Ri R N K C PF R N          (18) 2 2 2 (3 1) 65 10 ( 2) 433.33 3 R R Ri R N K C PF R N          (19) Rapid Exploration of Cost-Performance Tradeoffs Using Dominance Effect during Design.... 323 ( ) (14 6) 8 ( ) 21.36 3 Rclk Rclk A Rclk C PF Rclk N        (20) Based on the PF calculated for cost model, the architecture tree for cost can be constructed. The tree is constructed in such a way, so that the resource with the highest PF is assigned level (l1) in the tree, followed by level (L2) being assigned to the resources with next highest PF and finally the last level being assigned to the resource with the lowest PF. The resource with the highest PF influences the cost of the system the most compared to the resource with the least PF. After the assigning the levels, the architecture tree comprising of the design space is automatically arranged in increasing orders of magnitude for the cost model. The architecture tree for the cost model is shown in Fig. 1. After the design space is sorted in increasing order of magnitude, searching is applied on the design space. A mixed searching approach is proposed in this work by extracting the advantages of two different well known searching algorithms viz. interpolation search and binary search. Previous works [3]-[5] employed a mono binary searching procedure. However, as highlighted in Fig. 1, a mixed searching approach is proposed to further enhance the speed of the exploration process. Interpolation search is used with the cost model in order to search for the border variant for cost, while for the execution time model binary search is used to find the border variant. The interpolation search performs faster than binary search in cases of uniformly sorted models, such as design space for cost (cost is an increasingly linear function of the number of resources, i.e. cost of the system increases with increase in number of resources). On the other hand, binary search exploits the „divide and conquer‟ approach. Hence, it works faster on non- uniformly linear sorted models, such as execution time (execution time being a non- uniformly decreasing linear function of the number of resources i.e. increase in number of resources does not always decrease execution time, but remains same). Therefore applying interpolation search on the sorted design space for cost, shown in Fig.1 yields the border variant in just 2 comparisons (cost is calculated according to eqn.(3)). The border variant for cost is the last variant in the design space (in Fig.1) which satisfies the constraint for cost specified. The border variant obtained for cost is „V11‟. Fig. 1 Architecture tree representing the design space for cost arranged in increasing order 324 R. SEDAGHAT, A. SENGUPTA 4.3 Arrangement of the design space in decreasing orders of magnitude in the form of Architecture tree for execution time model The PF of the different resources used in system design for execution time model is given below equations: max1 1 1 (3 1) 2 ( 1 ) ( ) 0.0416 0.055 3 R R p R N T PF R T N          (21) max2 2 2 (3 1) 4 ( 2) ( ) 0.0416 0.111 3 R R p R N T PF R T N          (22) 333.5 20.01 ( ) 104.50 3 Max Min Rclk Rclk Rclk T T PF Rclk N      (23) Similarly, as described in section II.B, the architecture tree for execution time is constructed based on the PF calculated for execution time. Thus, the architecture tree obtained after construction is now also automatically arranged (sorted) in decreasing orders of magnitude. After arrangement, binary searching is applied in order to find the border variant for execution time (execution time is calculated according to the model of execution time shown in [4]). The border variant for execution time is the first variant in the design space, which satisfies the constraint for cost specified. The border variant obtained is variant „V5‟. After the border variants for both cost and execution time are found, the Pareto optimal set is derived as explained in [3]-[5]. The architecture tree for power consumption is constructed similarly in increasing orders of magnitude for power consumption. Among the variants of the Pareto set, the one which appears first in the ascending ordered sorted design space (in the tree), is the one with the minimum power consumption. It concurrently satisfies the constraints for cost, execution time and power consumption (specified in Table1) for the design problem. Therefore the optimal variant obtained, which satisfies all the specified constraints, is Variant „V5‟ (marked bold red in Fig.1). 5. ANALYSIS AND RESULTS The results of the proposed approach using PF and mixed searching scheme for rapid exploration of cost performance tradeoffs are verified for a number of benchmarks. Compared to the authors‟ previous works [3]-[5], the proposed approach is capable of further enhancing the speedup of the exploration process. The search of the border architecture in the case of execution time (using binary search) requires only log2  n i=1 vRi where „n‟ = number of type of resources and „vRi‟ is the number of variants of resource „Ri‟. The search of the border architecture (using interpolation search) for cost parameter requires log2 log2 log2  n i=1 vRi. In the design space exploration approach presented here, three objective parameters have been used; execution time and cost are the parametric constraints and power consumption is the optimization parameter. The total number of architecture evaluations performed during searching using the proposed method is given as: Rapid Exploration of Cost-Performance Tradeoffs Using Dominance Effect during Design.... 325 log2 log2  n i=1 vRi + log2  n i=1 vRi When applied on various benchmarks, the proposed approach indicated massive acceleration in the speedup compared to the exhaustive approach. The proposed method was also compared with a current approach in [8], [9]. The acceleration obtained, compared to the [8], [9], for both small and large size benchmarks is shown in Tables2 and 3 respectively. Moreover, the proposed approach has also been compared with a heuristic approach (WSPSO) [15]. As evident from Tables 4 and 5, the proposed approach performs lower architecture evaluations than [15] for both small and large benchmarks respectively. For example, in case of MPEG MMV (shown in Table 5) the proposed approach performs only 14 evaluations, while [15] perform 53 evaluations to search a final solution. Table 2 Experimental results of comparison between proposed DSE approach with the current approach [8], [9] for small benchmarks Benchmarks [2],[34],[35] Total possible architecture in the design space for one parameter Architecture evaluation using proposed approach (Number of variants analyzed) Architecture evaluation using approach [8],[9] (Number of variants analyzed) Percentage speed up using proposed approach compared to [8],[9] Average speedup using proposed approach compared to [8],[9] Cost Execution time Total IIR Chebyshev Filter 27 4 6 10 18 44.44 % 41. 85 % Mesa Horner 36 5 6 11 19 42.10 % Elliptic Wave Filter 78 5 7 12 19 36.84 % Differential Equation Solver (HAL) 90 5 7 12 19 47.82 % BPF 100 5 8 13 21 38.09 % Table 3 Experimental results of comparison between proposed DSE approach with the current approach [8], [9] for large benchmarks Benchmarks [2],[34],[35] Total possible architecture in the design space for two parameters Architecture evaluation using proposed approach (Number of variants analyzed) Architecture evaluation using approach [8],[9] (Number of variants analyzed) Percentage speed up using proposed approach compared to [8],[9] Average speedup using proposed approach compared to [8],[9] Cost Execution time Total Auto Regressive Filter 144 5 8 13 21 38.09 % 37.56 % MPEG MMV 200 5 9 14 23 39.13 % Matrix multiplication 400 6 10 16 25 36 % JPEG_IDCT 900 6 11 17 27 37.03 % 326 R. SEDAGHAT, A. SENGUPTA Table 4 Experimental results of comparison between proposed DSE approach and the current approach [15] for small benchmarks Benchmarks [2],[34],[35] Total possible architecture in the design space for one parameter Architecture evaluation using proposed approach (Number of variants analyzed) Architecture evaluation using approach [15] (Number of variants analyzed) Percentage speed up using proposed approach compared to [15] Average speedup using proposed approach compared to [15] Cost Execution time Total IIR Chebyshev Filter 27 4 6 10 17 41% 48.7% Mesa Horner 36 5 6 11 21 47% Elliptic Wave Filter 78 5 7 12 31 61% Differential Equation Solver (HAL) 90 5 7 12 32 62.5% BPF 100 5 8 13 35 62% Table 5 Experimental results of comparison between proposed DSE approach and the current approach [15] for large benchmarks Benchmarks [2][34][35] Total possible architecture in the design space for two parameters Architecture evaluation using proposed approach (Number of variants analyzed) Architecture evaluation using approach [15] (Number of variants analyzed) Percentage speed up using proposed approach compared to [15] Average speedup using proposed approach compared to [15] Cost Execution time Total Auto Regressive Filter 144 5 8 13 52 75% 75% MPEG MMV 200 5 9 14 53 73.5% Matrix multiplication 400 6 10 16 65 75.3% JPEG_IDCT 900 6 11 17 72 76.3% 6. CONCLUSIONS This paper presented a novel framework for rapid exploration of the cost-performance tradeoffs for modular multi-objective hardware accelerators. Once the design space for the cost-performance is explored, the final design point with minimum power consumption is searched from the obtained small Pareto optimal set. The proposed DSE approach for different benchmarks yielded superior results in terms of acceleration obtained compared to the current existing approaches. Acknowledgement: This work is supported by the Optimization and Algorithm Research Lab (OPRAL), Ryerson University, Canadian Microelectronics Corporation (CMC), Motorola, NSERC CRSNG, Ontario Innovation Trust and Sun Microsystems. Additionally, This work acknowledges the assistance provided by Science and Engineering Research Board (SERB), Department of Science and Technology, Govt. of India. Rapid Exploration of Cost-Performance Tradeoffs Using Dominance Effect during Design.... 327 REFERENCES [1] G. De Micheli, “Synthesis and Optimization of Digital Circuits”. McGraw-Hill: New York, 1994. [2] Saraju P. Mohanty, Nagarajan Ranganathan, Elias Kougianos and Priyadarsan Patra, “Low-Power High- Level Synthesis for Nanoscale CMOS Circuits” Chapter- High-Level Synthesis Fundamentals, Springer US, 2008 [3] Anirban Sengupta, Reza Sedaghat, Zhipeng Zeng, “A High Level Synthesis design flow with a novel approach for Efficient Design Space Exploration in case of multi parametric optimization objective”, Microelectronics Reliability, Science Direct, Elsevier, Volume 50, Issue 3, March 2010, pp. 424-437. [4] Zhipeng Zeng, Reza Sedaghat, Anirban Sengupta, “A Framework for Fast Design Space Exploration using Fuzzy search for VLSI Computing Architectures”, Accepted to Appear in the Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS), June 2, 2010. [5] Anirban Sengupta, Reza Sedaghat, Zhipeng Zeng, “Rapid Design Space Exploration for multi parametric optimization of VLSI designs”, Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS), June 2, 2010, Paris, France, Article # 2016 (Session: Logic & High-level Synthesis, C2L-F). [6] Anirban Sengupta, Reza Sedaghat, Zhipeng Zeng, “Hardware Efficient Design of speed optimized Power stringent Application Specific Processor”, Proceedings of IEEE 21st International Conference on Microelectronics (ICM), Morocco, December 22, 2009, pp. 167-170. [7] D. Gajski, N. Dutt, A.Wu, and S. Lin, “High Level Synthesis: Introduction to Chip and System Design”. Kluwer: Norwell, MA, 1992. [8] Kirischian, L;Geurkov, V., Kirischian, V. and Terterian, I. „Multi-parametric optimisation of the modular computer architecture‟, Int. J.Technology, Policy and Management, Vol. 6, No. 3,2006, pp.327–346. [9] Kirischian, L. „Optimization of parallel task execution on the adaptive reconfigurable group organized computing system‟, Proc. of International Conference PARELEC 2000, Canada, pp.150–154. [10] Vyas Krishnan and SrinivasKatkoori, “A Genetic Algorithm for the Design Space Exploration of Datapaths During High-Level Synthesis, IEEE Transactions on Evolutionary Computation, vol. 10, no. 3, June 2006, pp.229-313. [11] E. Torbey and J. Knight, “Performing scheduling and storage optimization simultaneously using genetic algorithms,” in Proc. IEEE Midwest Symp. Circuits Systems, 1998, pp. 284–287. [12] Giuseppe Ascia, Vincenzo Catania, Alessandro G. Di Nuovo, Maurizio Palesi, Davide Patti, “Efficient design space exploration for application specific systems-on-a-chip” Journal of Systems Architecture 53 (2007) pp. 733–750. [13] C. H. Gebotys and M. I. Elmasry, “Global optimization approach for architectural synthesis,” IEEE Trans. Comput.-Aided Des., vol. 12, 1993, pp. 1266–1278. [14] M. K. Dhodhi, F. H. Hielscher, R. H. Storer, and J. Bhasker, “Datapath synthesis using a problem-space genetic algorithm,” in IEEE Trans.Comput.-Aided Des., vol. 14, 1995, pp. 934–944. [15] Harish Ram D. S., M. C. Bhuvaneswari, and Shanthi S. Prabhu, (2012) A Novel Framework for Applying Multiobjective GA and PSO Based Approaches for Simultaneous Area, Delay, and Power Optimization in High Level Synthesis of Datapaths, VLSI Design Hindawi, Article ID 273276, 12 pages [16] E. Torbey and J. Knight, “High-level synthesis of digital circuits using genetic algorithms,” in Proc. Int. Conf. Evol. Comput, May 1998, pp.224–229. [17] Alessandro G. Di Nuovo, Maurizio Palesi, Davide Patti, Fuzzy Decision Making in Embedded System Design,” Proceedings of the 4th International Conference on Hardware/Software Codesign and System synthesis, October 2006,pp. 223-228. [18] A.C.Williams, A.D.Brown and M. Zwolinski,“Simultaneous optimisation of dynamic power, area and delay in behavioural synthesis”, IEE Proc.-Comput. Digit. Tech, Vol. 147, No. 6, 2000, pp. 383-390. [19] I. Das. A preference ordering among various Pareto optimal alternatives. Structural and Multidisciplinary Optimization, 18(1):Aug. 1999, pp.30–35. [20] Christian Haubelt, Thomas Schlichter, Joachim Keinert, Mike Meredith, “SystemCoDesigner: automatic design space exploration and rapid prototyping from behavioral models”, Proceedings of the 45th annual ACM IEEE Design Automation Conference, 2008, pp. 580-585. [21] J. C. Gallagher, S. Vigraham, and G. Kramer,“A family of compact genetic algorithms for intrinsic evolvable hardware,” IEEE Trans. Evolutionary Computation., vol. 8, no. 2 , Apr. 2004, pp. 111–126. [22] Xuejie Zhang and Kam W. Ng, “A review of high-level synthesis for dynamically reconfigurable FPGAs”, Microprocessors and Microsystems, Elsevier, Volume 24, Issue 4, 2000, pp. 199-211. [23] R. M. San and J. P. Knoght, “Genetic algorithms for optimization of integrated circuit synthesis,” in Proc. 5th Int. Conf. Genetic Algorithms, San Mateo, CA, 1993., pp. 432–438. 328 R. SEDAGHAT, A. SENGUPTA [24] R. J. Cloutier and D. E. Thomas, “The combination of scheduling, allocation and mapping in a single algorithm,” in Proc. 27th Design Automation Conf., Jun. 1990, pp. 71–76. [25] N. Wehn et al., “A novel scheduling and allocation approach to datapath synthesis based on genetic paradigms,” in Proc. IFIPWorking Conf. Logic Architecture Synthesis, 1991, pp. 47–56. [26] G. Krishnamoorthy and J. A. Nestor, “Data path allocation using extended binding model,” in Proc. 32nd ACM/IEEE Design Automation Conf.1992, pp. 279–284. [27] J. A. Nestor and G. Krishnamoorthy, “SALSA: A new approach to scheduling with timing constraints,” IEEE Trans. Comput.-Aided Des., vol. 12, 1993, pp. 1107–1122. [28] T. A. Ly and J. T. Mowchenko, “Applying simulated evolution to high level synthesis,” IEEE Trans. Comput.-Aided Des., vol. 12, no. 2, Feb. 1993, pp.389–409. [29] C. T. Hwang, J. H. Lee, Y. C. Hsu, and Y. L. Lin, “A formal approach to the scheduling problem in high- level synthesis,” IEEE Trans. Comput.- Aided Des., vol. 10, no. 2, Feb1991, pp. 464–475. [30] Giuseppe Ascia, Vincenzo Catania, Alessandro G. Di Nuovo, Maurizio Palesi, Davide Patti, “Efficient design space exploration for application specific systems-on-a-chip” Journal of Systems Architecture 53, 2007, pp. 733–750. [31] GAUT: A High-Level Synthesis Tool for DSP Applications”, P. Coussy, C. Chavet, P. Bomel et al., in High-Level Synthesis: From Algorithm to Digital Circuits, Springer, 2008, pp. 147-169. [32] Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Czajkowski, T., Brown, S. D., and Anderson, J. H. 2013. LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems. ACM Trans. Embedd. Comput. Syst. 13, 2, Article 24 (September 2013), 27 pages. [33] Villarreal, J., Park, A., Najjar, W., and Halstead, R. 2010. “Designing modular hardware accelerators in C with ROCCC 2.0”. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines. 2010, pp. 127–134. [34] http://www.cbl.ncsu.edu/benchmarks/. [35] http://express.ece.ucsb.edu/benchmark/ http://express.ece.ucsb.edu/benchmark/