Ammar A. Hassan /Al-khwarizmi Engineering Journal, Vol.3, No. 1 PP 99- (2007) ٨١ Low Cost Hardware Back Propagation Algorithm Ammar A. Hassan Computer Engineering Department / College of Engineering University of Baghdad (Received 12 September 2005; accepted 30 October 2006) Abstract:- The first successful implementation of Artificial Neural Networks (ANNs) was published a little over a decade ago. It is time to review the progress that has been made in this research area. This paper provides taxonomy for classifying Field Programmable Gate Arrays (FPGAs) implementation of ANNs. Different implementation techniques and design issues are discussed, such as obtaining a suitable activation function and numerical truncation technique trade-off, the improvement of the learning algorithm to reduce the cost of neuron and in result the total cost and the total speed of the complete ANN. Finally, the implementation of a complete very fast circuit for the pattern of English Digit Numbers NN has four layers of 70 nodes (neurons) on single chip using Xilinx FPGA technique is given. The main goal of this paper is how to achieve the suitable activation function and weights for this network that gives minimum hardware cost when all stages of this ANN algorithm is implemented on FPGA. 1.Introduction Artificial neural networks (ANNs) are a form of artificial intelligence, which have proven useful in different areas of application, such as pattern recognition [1] and function approximation/ prediction [2]. The most popular neural network is the multi-layer perceptron trained using the error back propagation algorithm [3]. However, an important obstacle in using this network in many applications is the slow training and the lack of clear methodology to determine the network topology before training starts. It is then desirable to speedup the training and allow fast experimentation with various topologies. One possible solution is an implementation on a reconfigurable computing platform (e.g Field Programmable Gate Arrays) FPGA. Reconfigurable computing is a means of increasing the processing density (i.e. greater performance per unit of silicon area) above and beyond that provided by general-purpose computing platform [4]. Field Programmable Gate Arrays (FPGAs) are a medium that can be used for reconfigurable computing, since they allow for custom design of fine-grain logic compared to course-grain logic found in general-purpose computing platforms. FPGAs are a form of programmable logic, which offer flexibility in design like software, but with performance speeds closer to Application Specific Integrated Circuits (ASICs). With the ability to be reconfigured an endless amount of Al-khwarizmi Engineering Journal Al-Khwarizmi Engineering Journal, Vol.3, No.1, pp 81-90 (2007) This page was created using Nitro PDF trial software. To purchase, go to http://www.nitropdf.com/ http://www.nitropdf.com/ Ammar A. Hassan /Al-khwarizmi Engineering Journal, Vol.3, No. 1 PP 99- (2007) ٨٢ times after it has already been manufactured, FPGAs have traditionally been used as a prototyping tool for hardware designers. However, as growing die capacities of FPGAs have increased over the years, so has their use in reconfigurable computing applications too [2, 4]. 2.Problem Formulation The design problem in ANN using FPGA is a precision vs. area trade-off. One way to help achieve the density advantage of reconfigurable computing over general-purpose computing is to make the most efficient use of the hardware area available. In terms of an optimal precision vs area Trade-off, this can be achieved by determining the minimum allowable precision, whose criterion is to minimize hardware area usage without sacrificing quality of performance. Because a reduction in precision introduces more error into the system, minimum allowable precision is actually a question of determining the maximum amount of uncertainty (i.e. quantization error due to limited precision) that an application can withstand before performance begins to degrade. Hence, determining a minimum allowable precision and suitable numeric representation to use in hardware is often dependent upon the application at hand, and the algorithm used. Fortunately, suitable precision for backpropagation-based ANNs has already been empirically determined in the past. Selecting weight precision is one of the important choices when implementing ANNs on FPGAs. Weight precision is used to trade- off the capabilities of the realized ANNs against the implementation cost. A higher weight precision means fewer quantization errors in the final implementations, while a lower precision leads to simpler designs, greater speed and reductions in area requirements and power consumption. One way of resolving the trade- off is to determine the “minimum precision”, required to solve a given problem. Traditionally, the minimum precision is found through “trial and error” by simulating the solution in software before implementation. Holt and Baker [5] studied the minimum precision required for a class of benchmark classification problems and found that 16-bit fixed-point is the minimum allowable precision without diminishing an ANN’s capability to learn these benchmark problems. 3.ANN Example 3.1.Structure of the Circuit Following example [6] present how these two parameters effect on hardware resources. Remember that in neural network resources capacity differ from network to other according to number of neurons in each layer. The existence of hidden units allows the network to develop complex feature detectors, or internal representations. Fig.(1) shows the application of four layers network to the problem of recognizing english digit numbers. The two dimensional grid containing the numeral “7” forms the input layer (it can be arranged this two dimension array as one dimension, so that, number of input neurons is thirteen). This page was created using Nitro PDF trial software. To purchase, go to http://www.nitropdf.com/ http://www.nitropdf.com/ Ammar A. Hassan /Al-khwarizmi Engineering Journal, Vol.3, No. 1 PP 99- (2007) ٨٣ The first hidden layer is formed from 40 units each unit might be strongly activated by horizontal line in the input, while the second hidden layer molded from 20 hidden units, each unit fully connected with each unit in the first hidden layer. The output layer has ten units that represent the value of the digit number. Knowing that, the behavior of these hidden units is automatically learned not preprogrammed, then the computing weight of each layer in the network accuracy by using C++ programming language and implementing this result on FPGA as shown in the next section. 3.2.Implementation ANN Components on FPGA Digital ANN architecture proposed in the previous section is an example of a reconfigurable computing application, where all stages of the algorithm reside together on the FPGA at once. These components based in idea of implementation on simple (or basic) arithmetic operations (such as addition and multiplication operations). The floating-point precision is the problem of which an engineer must deal with when testing and validating circuits since it limits quantization errors according to number of bits in each node. Therefore, truncation technique will be used in design to reduce the floating-point precision value and employ minimum hardware resources available on FPGA. Adding to this, choosing the optimal activation function suitable for implementation with truncation technique. Previously as declared, the ANN algorithm components (such as number of hidden layers and number of units in each Fig.(1): Multilayer Network to Learn to Classify English Numbers 0 1 2 3 4 5 6 7 8 9 Output 10 nodes Hidden2 20 nodes Input 25 nodes Hidden1 40 nodes !! This page was created using Nitro PDF trial software. To purchase, go to http://www.nitropdf.com/ http://www.nitropdf.com/ Ammar A. Hassan /Al-khwarizmi Engineering Journal, Vol.3, No. 1 PP 99- (2007) ٨٤ layer) differ from application to another. The English digit numbers NN is the application used in this work to implement ANN on FPGA, which is composed from four layers. The quantization error after 5,000 iteration using two types of activation functions are given in table (1). By using sigmoid function and without truncation (30-bit floating-point), quntization error is 0.02 (i.e. 98% of the corrected output). While, by the same method, but using bipolar function instead of sigmoid function, quntization error will be given 0.04. 4.Arithmetic Architecture for FPGA based ANNs To calculate the cost required for English digit number NN in FPGA, it should be declared the way of selecting activation function against truncation approach as following: - 4.1.Sigmoid Function without Truncation (30-bit) Before calculating the cost of FPGA based digit NN, it should be known that this function produced floating-point values. Then, it required a large number of logic storage as basic logic cells (LCs) called Look- UP- Tables (LUTs). These LCs are formed as RAM to store the output value of this function in each unit in the NN algorithm. For 30-bit flotation-point, RAM cost implementation in FPGA 30-bit input and 30- bit output, for each unit in NN algorithm. Then, when sigmoid function applied in algorithm design is increased number of LCs, so that, it increased the over all hardware resources of NN algorithm. The average cost of each unit (node) is calculated according to equation (1).   **/ nnunitCost (1) Where, n = Number of bits in data bus. α = Cost for each Multiplier in FPGA. β = Cost for each Adder / Subtractor in FPGA. δ = Cost for each LUT in FPGA for sigmoid function. i- For n=30 bit with sigmoid function α is a (30*30)-bit Multiplier and require about (2250) cell. β is a 30-bit Adder / Subtractor and require about (225) cell. Finally δ requires about (125000) cell using CORDIC approach. So that, the cost of arithmetic components / unit (for 30-bit sigmoid)  200,000 cell. Therefore for English Digit NN algorithm from previously, there are seventy units then: - Total Cost (for English Digit NN) = 200,000 * 70 = 14,000,000 cell. This process can be achieved approximately by 17 of propagation gates delay. Activation Function Quantization Error without Truncation (30-bit) Direct Truncation After Learning (10-bit) Sigmoid Bipolar 2% 4% 13-19% 18-26% Table (1): Quntization Error with and without Truncation This page was created using Nitro PDF trial software. To purchase, go to http://www.nitropdf.com/ http://www.nitropdf.com/ Ammar A. Hassan /Al-khwarizmi Engineering Journal, Vol.3, No. 1 PP 99- (2007) ٨٥ ii- When applying truncation technique through the learning process with sigmoid function on ANN, the unit output precision reduces to be 10-bit and using equation (1). Total Cost / unit  4000 cell. Where n = 10-bit, α = 300 cell, β = 75 cell, and δ = 60 cell. Total Cost (for English Digit NN) = 4000 * 70 = 280,000 cell with over all propagation delay reduced to 7 propagation gates delay. From this result, when applying this technique in NN algorithm design it can reduce number of LUTs on FPGA and so reduce the over all hardware resources capacity. 4.2.Bipolar Function As demonstrated from previous section, it required a large number of cells to implement NN based on sigmoid function. Then, bipolar function instead of sigmoid function in a NN algorithm can be used. Where, this function does not need storage cells for the output values in LUTs, because the output from this function is either “-1” or “1”, this technique called multiplierless technique. This approach has α =1 cell and δ=0 cell Therefore, the cost for each unit in a NN algorithm based bipolar function calculated using equation (2). */ nnunitCost  (2) n = Number of bits in data bus. β = Cost for each Adder / Subtractor in FPGA. i- Without truncation technique (i.e. n=30-bit ) β = 215 cell, therefor the cost / unit  6,500 cell. This process exceeds in 17 propagation gates delay. Then, to implement the English digit NN, that included 70 active neurons the total cost  455,000 cell. ii- With truncation technique (i.e. n=10-bit ) β = 64 cell, therefor the cost / unit = 650 cell. Also, this process exceeds in 7 propagation gates delay. So that, to implement the English Digit NN that included 70 neurons the total cost  45,500 cell. Then, the total LUTs cost of NN can be reduced when applying bipolar function with truncation technique on FPGAs. 5.Proposed Backpropagation Algorithm [5] An ANN using the classic backpropagation algorithm [6] as in fig.(2) has three basic phases named (1, 2 and 3) of execution the proposed backpropagation algorithm has same structure of the classic backpropagation algorithm with three additional steps for the proposed backpropagation algorithm named (1a, 2a and 3a) as following: This page was created using Nitro PDF trial software. To purchase, go to http://www.nitropdf.com/ http://www.nitropdf.com/ Ammar A. Hassan /Al-khwarizmi Engineering Journal, Vol.3, No. 1 PP 99- (2007) ٨٦ Fig. (2): Generic structure of an ANN. 5.1.Initialization The following initial parameters have to be determined by the ANN trainer a priori: (i) )()( nW sKj is defined as the synaptic weight that corresponds to the connection from neuron unit j in the (S-1) th layer, to K in the S th layer of the neural network. (ii)  is defined as the learning rate and is a constant scaling factor. (iii) )( sK is defined as the bias of a neuron, which is similar to synaptic weight in that it corresponds to a connection to neuron unit K in the (S-1)th layer of the ANN ,but is NOT connected to any neuron unit j in the(S-1) th layer. (1a) Set counter =0 5.2.Forward Computation During the forward computation, data from neurons of a lower layer (i.e (S-1) th layer), are propagated forward to neurons in the upper layer (i.e. (S) th layer) via a feedforward connection network. The structure of such a neural network is shown in Figure 2, where layers are numbered 0 to M, and neurons are numbered 1 to N. The computation performed by each neuron (in the hidden layer) is as follows: θowH (s) k )(s- j s-N j (s) kj (s) k    1 )1( 1 (3) Where j < k and s=1,….,M )1( s-N = Number of neurons in the (S-1)th layer of the ANN. )( s KH = Weight sum of the K th neuron in the Sth layer. )( s KjW = Synaptic weight which corresponds to the connection from neuron unit j in the (S-1) th layer to neuron unit K in the S th layer of the neural network )1( s jo Output of the jth neuron in the (S-1) th layer. On the other hand, the output computation of neurons in any layer is as follows: ) f ( Ho (s) k (s) k  (4) Where k =1,…,N and s= 1,…,M o s k )( = neuron output of the Kth neuron in the sth layer. ) f ( H (s) k = activation function computed on the weighted sum H (s) k . This page was created using Nitro PDF trial software. To purchase, go to http://www.nitropdf.com/ http://www.nitropdf.com/ Ammar A. Hassan /Al-khwarizmi Engineering Journal, Vol.3, No. 1 PP 99- (2007) ٨٧ Note that some sort of sigmoid function is often used as the nonlinear activation function, such as the following logsig function as in equation (5) or bipoler as in equation (6) as follows: (-x)sig f(x) exp1 1 log   (5)       01 01 or x f- for x bipolerf(x) (6) (2a) Increment counter For cycle counter = counter +1. 5.3.Backward Computation In this step, the weights of the networks are updated. Criterion for the learning algorithm is to minimize the error between the expected (or teacher) value and the actual output value that was determined in the Forward Computation. The following steps are performed: i- Starting with the output layer, and moving back towards the input layer, calculate the local gradients, as follows:             )1( 1 11 11 sN j )(s j s kj (s) kk (s) k ,...,M- s Ms - δw ot ε (7) t k = The target output for K TH neuron in the M layer. )1( sN = Number of neurons in the (S+1)th layer of the ANN. Where  )( s k = error term for the KTH neurons in the STH layer. The difference between the teaching signal b _ and the neuron output )( sKo )1( s j = Local gradient for the j th neuron in the (S+1)th layer. ,....,M s) ( Hfεδ (s) k \(s) k (s) k 1 (8) ii- Using the local gradients calculated in step 1, calculate the weight (and bias) changes for all the weights as follows: Noδw s- )(s j (s) k (s) kj ,..., and j,....,Ns k ηΔ 1 1 11   (9) Where w (s) kj Δ is the change in synaptic weight (or bias) CORRESPONDING TO THE GRADIENT OF ERROR FOR CONNECTION FROM NEURON UNIT J IN THE (S+1) th LAYER, TO NEURON K IN THE S th LAYER. iii- Once all weight (and bias) changes have been calculated in step 2, update all the weights (and biases) as follows: (n) (n) Δ) (n www s kj s kj s kj 1 (10) Where K = 1,….., N neurons in the S th LAYER and j = 1,………, N neurons in the (S-1) th LAYER. )1()( nW sKj = Update synaptic weight (or bias) to be used in the (N+1) th iteration of the Forward Computation. )()( nW sKj = Change in synaptic weight (or bias) calculated in the N th iteration of the Backward Computation, where n =the current iteration. This page was created using Nitro PDF trial software. To purchase, go to http://www.nitropdf.com/ http://www.nitropdf.com/ Ammar A. Hassan /Al-khwarizmi Engineering Journal, Vol.3, No. 1 PP 99- (2007) ٨٨ )()( nW sKj = Synaptic weight (or bias) to be used in the N th iteration of the Forward and Backward Computations, where n = the current iteration. (3a) Truncation For each counter > Q makes a truncation as in equation (11) to the values of the weights )()( nW sKj . ) -) * (/ (n) Int (W(n)W nn(s) Kj (s) Kj 122 (11) Where Q is a small integer value depends on the ANN, it defaults between 5- to- 50. n:- Number of bits in data bus. For truncation in the range of 10-bit this equation becomes as in equation (12). 10231024 ) * / (n) Int (W(n)W (s)Kj (s) Kj  (12) 6.Comparative study The simple example is the four-layer network to the problem of recognizing English digit numbers that has 70 nodes. The implementation of this NN using ISE 4.1i on Xilinx platform [7, 8] shows the effect of the proposed algorithm on the total cost and the final error as in table (2). 7.Conclusion Generally, any complete circuit of ANN has a very high cost because the high resolution of the weight values required is about 30-bit or more. The truncation of this data to low number such as 10-bit will reduce the total cost to 5% from the total cost but increase the error to more than 10 times. In addition to this, the effect of the transfer function type such as sigmoid or bipolar function declared the cost of the classic NN, where sigmoid function is 30 times over the bipolar transfer function but the bipolar has in average twice the error. This work has shown capability choice of different transfer functions for NN algorithm that are suitable to reduce the hardware cost with an optimal choice of precision value, by applying proposed truncation algorithm to reduced precision value from 30-bit to 10-bit with minimum possible error. The matching of the proposed algorithm with bipolar transfer function will reduce the cost to 0.3% from the total cost and increases ACTIVATION FUNCTION NUMBER OF BITS ALGORITHM FINAL ERROR % TOTAL NN COST / CELL MAX SPEED OF NN / MHZ SIGMOID 30 CLASSIC 2 14,000,000 35 BIPOLAR 30 CLASSIC 4 455,000 35 SIGMOID 10 CLASSIC WITH TRUNCATION 16 280,000 74 BIPOLAR 10 CLASSIC WITH TRUNCATION 22 45,500 74 SIGMOID 10 PROPOSED 6 455,000 74 BIPOLAR 10 PROPOSED 9 45,500 74 This page was created using Nitro PDF trial software. To purchase, go to http://www.nitropdf.com/ http://www.nitropdf.com/ Ammar A. Hassan /Al-khwarizmi Engineering Journal, Vol.3, No. 1 PP 99- (2007) ٨٩ the speed to twice the original speed but increase the error from 2% to 10%. While, the direct truncation with classic learning algorithm with bipolar transfer function will give a very high error reaching to 25% while the truncation using the proposed learning algorithm with bipolar transfer function will reduce the error to less than 10% with the same cost and same speed. 8.References [1] M. SKRBEK, “FAST NEURAL NETWORK IMPLEMENTATION”, NEURAL NETWORK WORLD, VOL. VOL. 9, N. NO. 5, PP. 375–391, 1999. [2] J. G. ELDREDGE, “FPGA DENSITY ENHANCEMENT OF A NEURAL NETWORK THROUGH RUN-TIME RECONFIGURATION”, MASTER’S THESIS, DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING, BRIGHAM YOUNG UNIVERSITY, MAY 1994. [3] D.E RUMELHART, J.L MCCLELLAND AND PDP RESEARCH GROUP, PARALLEL DISTRUBUTED PROCESSING: EXPLORATIONS IN THE MICROSTRUCTURE OF COGNITION, VOLUME 1: FOUNDATIONS, MIT PRESS, CAMBRIDGE, MASSACHUSETTS, 1986. [4] A. DEHON, “THE DENSITY ADVANTAGE OF CONFIGURABLE COMPUTING”, IEEE COMPUTER, VOL. 33, N. 5, PP. 41–49, APRIL 2000. [5] HOLT, J.L., T.E. BAKER. BACK PROPAGATION SIMULATIONS USING LIMITED PRECISION CALCULATIONS, IN PROCEEDINGS OF INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS. 1991. PP 121-126 VOL. 2. [6] E. K. KNIGHT, “ARTIFICIAL INTELLINGENCE” MCGRAW-HILL NEW YORK, SECOND EDITION, 1991. [7] XILINX.COM, “VIRTEXTM2.5V FPGAS”, DATA SHEET, WWW.XILINX.COM, MAY 1999. [8] WOLF, D.F., ROMERO, R. A. F., MARQUES, E. USING EMBEDDED PROCESSORS IN HARDWARE MODELS OF ARTIFICIAL NEURAL NETWORKS. IN PROCEEDINGS OF SBAI - SIMPSَIO BRASILEIRO DE AUTOMAO INTELIGENTE. 2001. PP 78-83. 9.List of symbols ANN: ARTIFICIAL NEURAL NETWORK. ASIC: SPECIFIC INTEGRATED CIRCUIT. CORDIC: COordinate Rotation DIgital Computer. FPGA: FIELD PROGRAMMABLE GATE ARRAY. LC: Logic Cell. LUT: Look- UP- Table. RAM: Random Access Memory. This page was created using Nitro PDF trial software. To purchase, go to http://www.nitropdf.com/ http://www.nitropdf.com/ Ammar A. Hassan /Al-khwarizmi Engineering Journal, Vol.3, No. 1 PP 99- (2007) ٩٠ بناء مستوى واطئ الكّلفة لخوارزمیة االنتشار العكسي عمار عادل حسن كــلیة الھــندسـة/ قـسم ھـندسة الحـاسبـات جامـعة بغـداد :الخالصة كانت قلیلة وقبل أكثر ) ANNs(من أولى التطبیقات الناجحة التي تم نشُرھا للشبكات العصبیة االصطناعیة یمتاز ھذا الملخص . ذي تم في مجال ھذا النوع من البحوثلذلك حان الوقت لمراجعة التقّدم ال. من عقد بتوفیر الفكرة األساسیة حول تطبیق أصناف األنواع المتوفرة للبوابات المرتبة بصیغة صفوف قابلة تقنیات مختلفة التطبیق وأفكار للتصمیم سیتم . لبناء الشبكات العصبیة االصطناعیة (FPGAs)للبرمجة العمل على , كذلك. مثًال الحصول على الدالة الفاعلة والمناسبة وتقنیة التقلیم العددیة, مناقشتھا الحقًا الكلیة وتحسین أداء تحسین خوارزمیة التعلم للتقلیل من كلفة بناء الخلیة العصبیة وبالتالي تقلیل الكلفة بناء دائرة متكاملة لھا السرعة العالیة لتمیز أشكال األرقام اإلنكلیزیة من خالل , وأخیرًا. الشبكة العصبیة على رقاقة واحدة ) خلیة عصبیة(عقدة ) ٧٠(شبكة عصبیة اصطناعیة لھا أربعة طبقات من خالل . Xilinx FPGAباستخدام تقنیة This page was created using Nitro PDF trial software. To purchase, go to http://www.nitropdf.com/ http://www.nitropdf.com/