Ammar A. Hassan /Al-khwarizmi Engineering Journal ,Vol.3, No. 1 PP,  80-41 (2007)

١

Reduction of the error in the hardware neural network
  

Dr. Dhafer r. Zaghar
Computer and programming Engineering Department

College of EngineeringAL-Mustanserya
University / Baghdad/ Iraq

  
(Received 7 March 2006; accepted 24 April 2007)

Abstract:-
Specialized hardware implementations of Artificial Neural Networks (ANNs) can offer faster 

execution than general-purpose microprocessors by taking advantage of reusable modules, parallel 
processes and specialized computational components. Modern high-density Field Programmable Gate 
Arrays (FPGAs) offer the required flexibility and fast design-to-implementation time with the 
possibility of exploiting highly parallel computations like those required by ANNs in hardware. The 
bounded width of the data in FPGA ANNs will add an additional error to the result of the output. This 
paper derives the equations of the additional error value that generate from bounded width of the data 
and proposed a method to reduce the effect of the error to give an optimal result in the output with a 
low cost.

Key Words:   Neural, co-processor, DSP, FPGA, ISE 4.1i software, adder, multiplier.

1.Introduction

Artificial neural networks have for 
many years proved themselves effective tools in 
classification and even prediction of arrays of 
patterns. Today, artificial neural networks are 
used in diverse applications, ranging from voice 
recognition to weather prediction models [1,2]. 
As industry has developed the need for rapid 
prototyping and denser devices, FPGA have 
become excellent tools for the implementation 
of complex digital systems [3,4]. By taking 
advantage of the dynamic flexibility that 
FPGAs have to offer, a digital implementation 
of a general-purpose ANN in the form of a co-
processor is created and presented in this paper. 
This ANN co-processor has been designed to 
work in conjunction with a DSP or general-
purpose processor to create a fast and flexible 

intelligent ANN system with the potential to be 
used in a wide range of applications.

By realizing specific hardware entities 
commonly used in ANNs like multiplication 
and addition and exploiting any parallelism 
found in the net, the hardware implementation 
of an ANN can very well outperform its 
software counterpart when comparing the 
throughput of both methods. One, nevertheless, 
has to keep in mind that the use of FPGAs 
implies the sacrifice of flexibility for speed. 
While the FPGA we use has a relatively large 
quantity of resources, we make a conscious 
effort to improve speed while keeping the gate 
count to a minimum. For this reason, each layer 
in the ANN is composed of parallel neurons 
that individually compute a summation of 
products, and all the internal components of the 
network are designed to take advantage of the

Al-khwarizmi  
                                                                                                                                                                                             Engineering               
                                                                                                                                                                  Journal                                        

                                                Al-Khwarizmi Engineering Journal, Vol.3, No.2, pp1-7 (2007)

This page was created using Nitro PDF trial software.

To purchase, go to http://www.nitropdf.com/

http://www.nitropdf.com/


         Dr. Dhafer r. Zaghar /Al-khwarizmi Engineering Journal, Vol.3, No.2 PP1-7 (2007)

available hardware resources like fast-carry 
chains, lookup tables and distributed RAM 
blocks [5].

2. Hidden and Output Layer 
Design

The target device for the co-processor 
design is Xilinx’s Spartan II FPGA. This device 
is chosen because of its distributed memory 
RAM banks which allow implementation of 
fast register files where inputs and weights are 
stored, its versatile I/O interface that supports 
3.3V outputs and TTL voltage levels and its 
low market cost. The complete co-processor 
design is written in VHDL (VHSIC Hardware 
Description Language and the acronym 
VHSIC, in turn, refers to the Very High-Speed 
Integrated Circuit program) using a structural 
approach with the help of Xilinx Foundation 
Series ISE 4.1i software [6].

The hardware implementation of the 
neurons realizes the following two equations:

1210
1

0

,....M-,,    jxwa i
N

i

s

ij

s

j
 





...         (1)

       
)f( ao
s

j

s

j
 ….....               (2)

Where a
s

j
in (1) is the jth-accumulated 

sum for layer s with M neurons and N inputs, 

w
s

ij
is the weight between node i in layer s-1 

and node j in layer s, xi is the i
th  input  and o

s

j

in (2) is the output of the jth neuron of layer s 
given by the activation function [5].

3.Hardware neural node

The corresponding block diagram in 
fig.(1) is a single neuron that has N input data 
and N input weight each one has k-bit bus 
width. The calculation of the output pass in to 
two stages first is the multiplication then second 
is the accumulation of the result of the 
multiplication stage.

The multiplication stage has N 
multiplier each multiplier has two inputs with 
K-bit one for the data and the other for weight. 
Each multiplier give 2K-bit output but the next 
stage has K-bit input, therefore the output must 
be truncated to K-bit. This truncation represents 

a positive error in the range 20
-K

 to that has 

average value equal to 2
1-K
.

The accumulation stage has M layer that 
can be calculated from equation (3), each one ( 

ith layer)  will have 2
iM 

adder unit.

log
2

N*
M  …………………………..(3)

       
Where N* is the nearest maximum power of 2 
number to N and N is the number of input data 
buses in neuron.

Each adder unit in the layers of the 
accumulation stage has two K-bit inputs with 
single output that has (K+1)-bit but the next 
layer is design to K-bit input, therefore the 
output must be truncated to K-bit. This 
truncation represents a positive error in the 

range 2
1

0
-K

 to that has average value equal to 

2
2-K

[7, 8]. However, the average value of 

the error in each layer Ei calculated from 
equation (4) and the total average error in 
accumulation stage calculated Ea from equation 
(5) below. 

222
2 TM-i-k-

i
 * E  …               (4)

2M-K-i- where T         

2
2

1

1

-K-
M

i
M-i

i
a

 M*EE  


…...         (5)

       
The total error in the neuron En is the sum of 
the multiplication stage and the accumulation 
stage with the effect of truncation in the next 
stage to the error of the past stage as in equation 
(6). 

This page was created using Nitro PDF trial software.

To purchase, go to http://www.nitropdf.com/

http://www.nitropdf.com/


         Dr. Dhafer r. Zaghar /Al-khwarizmi Engineering Journal, Vol.3, No.2 PP1-7 (2007)

)  (M M* N   * * EE
-K--K--K-

aM

-K-

n 222
2
2 111

1

1 ……. (6)

The result of error in output in range 

from 210
-K

) to (M  with positive average 

equal to 2
1

1



-K

) (M and the mean square 

error (MSE) equal to 21
322  K-)(M   

However, the number of data inputs in ANN is 
varied from 5 to 1000, that is mean the value of 
M is varied generally between 3 to 10 and it can 
be writen as:                

            
4. Reduction of the Error

There are three suggestion steps to 
reduce the total error in the output of the 
hardware neuron as follows:

1- Expanded data bus in the internally 
layers of the accumulation stage and make all 
the truncation in the last layer.  That means if 
the first layer has K-bit the second layer will 
have (K+1)-bit and the ith layer will have (K+i-
1)-bit and so on until to the last (M) layer that 
will truncate the data to the original data bus K-

bit. This process will reduce the error in the 

accumulation stage to range 20
-K

 to that has 

average value equal to 2
1-K
. That means that 

the total error output is in range 20
-K

 to with 

positive average equal to 2
1-K

and MSE equal 

to 2
32 K-
.    

2- Add a round unit to the multipliers of 
the multiplication unit as shown in fig. (2).The 
round unit sums up the truncated values of each 
two-neighbor multiplier then approximate the 

KxK bit 
Multiplier 1

KxK bit 
Multiplier 2

KxK bit 
Multiplier N-1

KxK bit 
Multiplier N

K+K bit 
Adder 

N/2

K+K bit 
Adder 1

K+K 
bit 

Adder 
1

Multiplication stage first adder layer last M adder layer 

Fig. (1): Block diagram for implementing a single neuron.

.2
v

M 

This page was created using Nitro PDF trial software.

To purchase, go to http://www.nitropdf.com/

http://www.nitropdf.com/


         Dr. Dhafer r. Zaghar /Al-khwarizmi Engineering Journal, Vol.3, No.2 PP1-7 (2007)

result to the LSB value. Hence, if this value is 
greater than the half of the LSB it will add to 
the accumulator else it will neglect if it is 
smaller than the half of the LSB. This process 
will reducte the error in the multiplication stage 

to range 22
22 


-KK

 to that has average 

value equal to zero and MSE equal to 2
42 K-
.

3- Add a round unit to the last layer of 
the accumulation as shown in fig. (3). The 
round unit approximates the truncated value to 
the LSB value if this value is greater than half 
of the LSB and neglected the truncated value if 
it is smaller than half of the LSB. This process 
will reduce the error in the multiplication stage 

to range 22
22 


-KK

 to that has average 

value equal to zero and MSE equal to 2
42 K-
.

The total error of the neuron output after 
applying the three steps is in range 

22
11 


-KK

 to that has average value equal 

to zero and MSE equal to 2
22 K-
.

K MSBs

K LSBs
neglect

K MSBs

K LSBs
neglect

K+K 
bit 

Adder KxK 
bit 
Mult.

KxK 
bit 
Mult.

a
b

K MSBs

K LSBs

K MSBs

K LSBs

K+K 
bit 

Adder KxK 
bit 
Mult.

KxK 
bit 
Mult. Round 

unit
  

Carry input

Fig.(2): Block diagram for multiplier and the first layer in accumulation unit in neuron 
(a) Original and (b) after step 2.

K MSBs  
output

M+1 
LSBs

neglect

(K+M)+(K+M) 
bit Adder 

a b

Round 
unit

  
Carry input

K MSBs  
output

M+1 
LSBs

(K+M)+(K+M) 
bit Adder 

Fig. (3): Block diagram of the last layer in accumulation unit in neuron 
(a) Original and (b) after step 1 and step3

This page was created using Nitro PDF trial software.

To purchase, go to http://www.nitropdf.com/

http://www.nitropdf.com/


         Dr. Dhafer r. Zaghar /Al-khwarizmi Engineering Journal, Vol.3, No.2 PP1-7 (2007)

5. Software Implementation
The proposed steps to reduce the error is 

tested using software simulation to implement 
an ANN equivalent to hardware circuit of this 
network. 

At first the original ANN is 
implemented using back propagation algorithm 
[9] using C++ or matlab, then it will be 
modified to represent the hardware ANN by 
applying the following steps: 

a) Define all values in the program as 
integers fill in rang ± (2k-1 -1).

b) Replace each addition process c= a+b 
with c= (a+b)/2.

c) Replace each multiplication process z= 
x*y by the following steps:

1- Convert the second number (y) to binary 
number (yy).

2- Multiply the digits of the binary number 
(yy) by weights that are the powers of 
two to calculate the set {u}.

3- Use shift and add (under condition b) 
approach [10] between the number a 
and {u} to satisfy the multiplication 
process.

The simulation of step 1 in section (4) 
require to replace condition (b) 

{ 





1

0

)(b)condition under (
N

i
iac } in the original 

program to become 





1

0
2/

N

i

M

iac . The 
simulation of step 2 in section (4) require to 
replace condition (b in condition c3) to become 
c= (a+b+0.5)/2. While the simulation of step 3 
in section (4) require to replace condition (b) to 
become c= (a+b+0.5)/2.

6. Example
This example discus the effect of the 

reduction of the error in a single neuron from 
the hidden layer in a neural network that has 54 
nodes in input layer (M=6), 30 nodes in hidden
layer and 10 nodes in output layer and use 8-bit 
data bus (k=8) for all nodes. The error of the 
output in table (1) is calculated with respect to 
the maximum amplitude of the input and output 
(maximum value equal to 1). The original error 
is calculated from section (3) and the other 
values of the error is calculated from the steps 
of section (4) using M=6 and k=8.

The above ANN is simulated using C++ 
program to represent an ANN that is used to 
detect the ten digits. The learning of the 
original software simulation (without any 
truncation) gives an error 3.7%. The reuse of 
the original software simulation with additional 
function to represent the effect of the truncation 
as in section (5) will give an error 14.2%. The 
modification of the truncation function to 
represent the effect of the steps in section (4) 
will give the results in table (2). Table (2) 

shows the practical results of the simulation of 
the ANN in the example, this table shows that 
the truncation increases the total error from 
3.7% to about 14.2%, while the proposed 
method will reduce this to 5.9%. The cost of the 
neuron in table (2) is calculated by the classic 
design of adder and multiplier in [7], while the 
additional cost for the steps 1, 2 and 3 must be 
calculated from the specific design with routes 
of the reference [7].

Table (1): The error of the output calculates with respect to the maximum amplitude. 

Error Range *1000 Average *1000 MSE *10-6

Original 0 to 27.34 13.67 93.46 
Step1 0 to 7.81 3.9 30.52

Step1 with 
step 2

-3.9  to 5.86 1.953 24.79

Step1 with 
step 3

-3.9  to 5.86 1.95 24.79

All steps -3.9  to 3.9 0 15.25

This page was created using Nitro PDF trial software.

To purchase, go to http://www.nitropdf.com/

http://www.nitropdf.com/


         Dr. Dhafer r. Zaghar /Al-khwarizmi Engineering Journal, Vol.3, No.2 PP1-7 (2007)

7. Conclusion
The implementation of ANNs in 

hardware has been proven to be beneficial for 
intelligent systems requiring fast computation 
of data. But the bounded width of the data in 
the hardware implementation will generate an 
error in the output of the neuron, this error will 
add to the result of the output as additional 
error. The bounded width of the data will 
generate an additional error value. The 
additional error value will fall in the range 10-4

per single process (addition) but it will 
represent an accumulated additional error in the 
result reach to about 10%.   

The proposed method reduces the effect 
of the accumulated error depend on rounding 
the truncated values to give an optimal result in 
the output. This process will reduce the 
accumulated additional error value from about 
14% to about 6%. However, it will increase the 
efficiency of the output 8% with a small 
additional cost less than 6%.

8. References

[1] Christodoulou, C., S. Michaelides, C. 
Pattichis, and K. Kyriakou, “Classification of 
Satellite Clouds Imagery Based on Multi-
feature Texture Analysis and Neural 
Networks,” IEEE International Conference on 
Image Processing, V.1. 2001, p 497-500.
[2] Soren, K. R., “Hidden Neural Networks: 
Application to Speech Recognition,” Neural 
Computation V.11 1999, p. 54.
[3] Jayaraman, R., “Physical design for 
FPGAs,” Proceedings of the 2001International 

Symposium on Physical Design, Sonoma, CA, p 
214-221.
[4] Dimond, K., and K. Pang, “Mapping VHDL 
descriptions of digital systems to FPGAs,” IEE 
Colloquium Digest: Computing and Control 
Division Colloquium on Software Support and 
Cad Techniques for FPGAS (Field 
Programmable Gate Arrays), n 094, p 9/1-9/3, 
1994.
[5] Contreras, G, and Nava, P, “Design, 
Implementation and Testing of an FPGA-based 
Neuro-Coprocessor”, Dept. of Electrical and 
Computer Engineering, The University of 
Texas at El Paso 500 W, 2002. 
[6] Advance Product Specification, “Virtex-E 
1.8 V Field Programmable Gate Arrays”, 
DS022 (v1.0) December 7, 1999.
[7] Hikawa, H. “Implementation of Simplified 
Multilayer Neural Network with On-chip 
Learning,” Proc.of the IEEE International 
Conference on Neural Networks (Part 4), Vol. 
4, 1999, pp 1633-1637.
[8] Schelin, C.W., “Calculator Function 
Approximation”, Am. Math. Monthly, vol.90, 
1983.
[9] Max van Daalen, Peter Jevons, and John 
Shawe Taylor. A stochastic neural architecture 
that exploits dynamically recon_gurable 
FPGAs. Proceedings IEEE Workshop on 
FPGAs for Custom Computing Machines, 
pages 202{211, April 1993.
[10] Oudjida, A.K., “High Speed and Very 
Compact Two’s Complement Serial/Parallel 
Multipliers Using Xilinx’s FPGA ”, 
CDTA/Microelectronics laboratory 128 
Chemin Mohamed Gacem El-Madania 16075, 
Algiers,Algeria,1996.

Table (2): The simulation results of the error and the FPGA cost of the network

Method Average of the 
percentage error

FPGA Cost
  Cell / Neuron

software 3.7 ------
After truncation 14.2 14880

Step1 9.6 15036
Step1 with step 2 8.3 15792
Step1 with step 3 8.4 15048

All steps 5.9 15804

This page was created using Nitro PDF trial software.

To purchase, go to http://www.nitropdf.com/

http://www.nitropdf.com/


         Dr. Dhafer r. Zaghar /Al-khwarizmi Engineering Journal, Vol.3, No.2 PP1-7 (2007)

   
  الخطأ لبناء الشبكة العصبیة مستوىتقلیل 

ظافر رافع زغیر. د 

كــلیة الھــندسـة  /و البرامجیات  قـسم ھـندسة الحـاسبـات
الجامـعة المستنصریة 

  
  - : الخالصة
باستخدام المكونات المادیة یكسبھا سرعة عالیة مقارنھ ) ANNs(ان عملیة بناء الشبكات العصبیة الذكیة 

ات التي تنفذ على معالج احادي مایكروي و ذلك بسبب كون البناء باستخدام المكونات المادیة یعتمد على المعالجة بالبرامجی
و ) FPGA(ان واحدة من احدث طرق البناء المادي المستخدمھ ھي مصفوفة البوابات الواسعة القابلة للبرمجة . المتوازیة

محددات البناء باستخدام المكونات المادیة ھي كون ناقل البیانات محدد بسعة ان من . التي تتمیز بالمرونة و السرعة العالیة
سیقوم ھذا البحث باشتقاق المعادالت التي تمثل نسبة . اضافة نسبة خطاء الى النتائج النھائیة معینة ثابتھ و ھذا التقیید یسبب

قلیلة للحصول على نسبة خطاء قلیلة مع كلفة غیر  الخطاء االضافي و تقترح طریقة مناسبھ لتقلیل ھذا الخطاء و بزیادة كلفة
  . عالیة

This page was created using Nitro PDF trial software.

To purchase, go to http://www.nitropdf.com/

http://www.nitropdf.com/