Microcontroller-Based Function Generator


Al-Khwarizmi 

Engineering   

Journal 

Al-Khwarizmi Engineering Journal,  Vol. 5, No. 1, PP 83-93 (2009) 

 
Multidimensional Systolic Arrays of LMS Algorithm 

Adaptive (FIR) Digital Filters 
 

Riyadh A.H. AL-Helali *   and  Bakir A. R. Al-Hashemy** 
* Department of Electrical Engineering / AL-Mustansiriyah University 

** Department of Electrical Engineering / Baghdad University 

 
(Received 12 June 2006; accepted 6 April 2009) 
 

Abstract 
 

 A multidimensional systolic arrays realization of LMS algorithm by a method of mapping regular algorithm onto 

processor array, are designed. They are based on appropriately selected 1-D systolic array filter that depends on the 

inner product sum systolic implementation. Various arrays may be derived that exhibit a regular arrangement of the 
cells (processors) and local interconnection pattern, which are important for VLSI implementation. It reduces latency 

time and increases the throughput rate in comparison to classical 1-D systolic arrays. The 3-D multilayered array 

consists of 2-D layers, which are connected with each other only by edges. Such arrays for LMS-based adaptive (FIR) 

filter may be opposed the fundamental requirements of fast convergence rate in most adaptive filter applications.         

                                                                                                               
Keywords: Adaptive signal processing; LMS algorithm; VLSI signal processing.  
 

1. Introduction 
                                                                          

 High-speed multidimensional digital filtering 
is very useful for real-time video signal 

processing such as video image coding, 

bandwidth compression, sampling rate conversion 
and the enhancement of televisions signals [1].  

 As a first step toward the          

implementation of an algorithm onto a processor 
array, it is common practice to go through a 

number of refinements (regularization, single 

assignment form….) that make the algorithm 

more suitable to a simple and modular VLSI 
design. The usual assumption is that the algorithm 

can be expressed by a system of recurrence 

equation [2]. 
Many emerging applications of specific systems 

characterized by many features which are the 

problem size begins to grow so as to guarantee 

better quantity of solutions and the problems must 
be implemented in real-time. These features 

demand high-performance computers. The 

algorithms, used by these applications, are 
typically highly parallel consisting of a huge 

number of simple and regular computations. 

These applications emphasis different demands on 
the processor design than the classical numerical 

and necessitate finding the new architectural 

solutions for processors [3].  
 Systolic arrays have been one of the more 

interesting paradigms for reconfigurable 

computing in recent times because the design 
process creates an excellent paradigm for 

implementing algorithms via space and time 

transformations [4]. 
 Real-time systolic implementations of 

multidimensional recursive digital filters have 

been discussed in prior work [5]-[8]. However, 

prior work has concentrated either on minimizing 
the critical path (that is, the unlatched signal path 

that requires the longest processing time) [5], [6] 

or on achieving the modularity or regularity of 
filter structures [7]-[8]. Thus the previously 

proposed filter structures do not satisfy the 

practical requirement of simultaneously having a 

critical path containing no more than one 
multiplier and one adder and the desired 

modularity and regularity. Moreover, they do not 

provide the property of local interconnectivity, 
which is essential in deep submicron VLSI 

implementations. In [1], proposed 1-D ladder 

filter structure based on which introduced a 
multilevel approach to derive multidimensional 


Riyadh A.H. AL-Helali                                           Al-Khwarizmi Engineering Journal, Vol. 5, No. 1, PP 83-93 (2009) 
 

84 

ladder filter structures that are regular, modular 

and locally interconnected on the one hand, and 
yet have the shortest critical path of one multiplier 

plus one adder and the canonic number of storage 

registers in the higher dimensions (e.g., image line 
and frame registers).  

 Dedicated VLSI processors/chips are not 

always suitable for this purpose due to the need to 

provide flexible devices that can be adjusted to 
the new problems and problem sizes [9]. Thus, the 

processors must be reprogrammable or 

reconfigurable. Iso-plane method may be 
introduced [4]. In this case the problem is 

represented by iso-planes instead of a dependence 

graph in the classical case. Iso-planes describe 
shared variables and reduction operators. To 

extract more parallelism from a problem 

specification, the idea of increasing the 

dimensionality of problem representations is 
introduced. This is based on tilling and expanding 

of the index space [2].  

 Increasing the degree of parallelism demands 
an increase in the number of processor elements 

(PEs) to be employed. This is possible only either 

by increasing the dimensionality or the dimension 

of the array.  Increasing the dimensionality of an 
array can result in a more efficient parallel 

algorithm implementation (parallel computing).      

This paper demonstrates how to extend existing 
designing methods for systolic arrays [11, 12, 13] 

and benefits from the approach in [4] to 

synthesize multilayered structures. This approach 
is demonstrated on the basis of 1-D convolution 

(inner product sum) of the output, which gives 

flexibility in the design procedure for extracting 

more than one array. Systolic arrays for 1-D 
LMS-based adaptive (FIR) filter that decreases 

the latency has the area complexity O(n) while the 

latency time complexity is O(n
1/2

). Arrays with 
the increased throughput rate have the area 

complexity O(sn) , 1 ≤ s < N , and the time 

complexity is O(N/s), where n and N  are sizes of 
two sequences of the inner product sum output.  

 This decreasing latency time and increasing 

throughput rate limits the inherent limitations of 

LMS algorithm, which necessitates a compromise 
between the opposing fundamental requirements 

of fast convergence rate and small maladjustment 

for high area complexity.          
 In section 2, a mapping of regular algorithms 

onto multilayered 3-D reconfigurable processor 

array is described to design the systolic arrays, 

that it achieves the optimum local interconnection 
pattern and the highest speed. The complete 2-D 

and 3-D design of systolic LMS-based adaptive 

(FIR) digital filter is given in section 3. In section 

4, we discuss the throughput and processing rate 

of the system. A comparison is made between the 
LMS systolic array and the serial design 

employing fast convolution and FFT as regards 

throughput. Finally, we present conclusions in 
section 5. 

 
2. Mapping Procedure 
 

 Regular array design means that the order of 
computations of reduction operators and the order 

of spreading shared variables must be chosen [4]. 

Imposing a partial order onto iso-planes will do 

this. Different partial orders can develop to 
improve the array design that affects on the 

structure and time parameters, such as latency and 

throughput rate.  
 Imposing a partial order onto an iso-set 

produces dependence vectors   y = x +  , x,y  

  ISO
v
(x) . Therefore, the affine equations are 

transformed into uniform recurrences. To offer a 

regular array design, the number of dependence 
vectors must be as small as possible. 

 The corresponding regular array can be 

derived by an affine space-time mapping [2]. 

                                                                            
IS` = T(IS) +                                                …(1) 
 

Where IS  
rz and IS`  rz  are the source and 

the target polytopes, respectively, T   rrz    is 

a transformation matrix and =
rz is an offset 

vector. 

 Space-time mapping transforms the source 

polytope IS into a target polytyope, IS`, that 

contains the same points but in a new coordinate 

system in space and time: every x  IS is mapped 
to the possible time step: 
 

t(x) = x +                                                    …(2) 
 

 And to the possible PE: 
 

a(x) = x +                                                  …(3) 
  

Where   
rz  is a time schedule and   

r)1r(
z


 is allocation,   Z,   

1rz   .  
and  are components of the transformation [4] 
 
 













T                                                       … (4) 

 
Riyadh A.H. AL-Helali                                           Al-Khwarizmi Engineering Journal, Vol. 5, No. 1, PP 83-93 (2009) 
 

85 

To be a valid mapping, transformation T must 

satisfy certain constraints: 
 

1. The time schedule  must be consistent with 

the dependence vector , i.e.    0.  
2. The second constraint excludes the conflict 

that two distinct computations would be 
scheduled for the same processor at the same 

time, i.e. transformation T must be bijective. 

And thus  T   0. If the transformation is 

unimodular,   i.e.  T   1, then the resulting 
array uses each time step efficiently, and, 

thus, this design is usually preferred. In this 

case, when  T   k, k  1, each processor 

works only in every kth time steps, thus less 
efficiency.  

 The channels between processors can be 

computed from dependence vectors and 

transformation. Each dependence vector  

produces a channel l: 
  

l = T                                                             …(5)  
 

With the direction: 

 
ldir() =                                                        …(6) 
 

And the delay: 

  
ldel() =                                                   …(7)   
 
 
3. Systolic Design of LMS Algorithm 
Using ISO-plane Method 
 

 The systematic design of this method gives 
various (1-D) and (2-D) arrays with modified 

property comparison to classical method as 

follows:  

 
3.1 (1-D) Classical Systolic Array of LMS 

Algorithm 
   

 1-D convolution sum of the inner product of 

the tap-weight vector and the tap-input vector, as 
represented as [10];  

                                
(a) 

 
(b) 
 

…u2u1u0   

              … 
 
q 

m 

ŵ1 

 
 
q 

m 
ŵ2 

 
 
q 
m 

ŵ3 

 
 
q 

m 

ŵ9 

 
….d2 d1 d0 

. y2 y1y0 

e 

_ 

 
+ 

 
q 
m 

ŵk 

 
uh uh 
     

ei 

“1” 

ei 

yi 

begin 

  uh:= uh ; ei:= ei ; 

   q := uh;                                                                                                     

  m := qei; 
  ŵk:= ŵk + m   

  yi:= yi + ŵkuh 
end 

                           
Fig. 1.   (a) 1-D systolic array realization for LMS algorithm.  (b) Functional specification of a PE. 

 
Riyadh A.H. AL-Helali                                           Al-Khwarizmi Engineering Journal, Vol. 5, No. 1, PP 83-93 (2009) 
 

86 

1)k-u(n 
M

1k

*
k

w)n(y 



                    …(8) 

 
                  = ŵ

 H 
(n)u (n) 

 
When the superscript H signifies Hermitian 

transposition, and the inner product of the tap-
weight vector ŵ(n) and the tap-input vector  u (n) 

represents the output of the filter, and is specified 

as:    
     

 ( i: 0 ≤ i                                       …(9)                                                                                                                                                                                                                                                                                                                                  

 yi = (∑ j : 1 ≤ j    :   ŵ[ j ]u[i – j + 1])), 
                                                                       

Where ŵ[j] and u[i] are weight and input 

sequences respectively, and yi is an output 

sequence. In practical applications M « N. Two 
shared variables, ŵ[k] and u[i], are used by 

several operators (multiplications) for computing 

different instances of the variable yi by the 
reduction operator ∑. The index space is a 2-D 

polytope:                                       
 

IS = {[i,j]
T
 |  0 ≤ i ≤ N   ≤ j                          

                   … (10)  

 
Each instance, i, of a variable, say vi, is 

represented by an iso-plane ISO
v
 (x), x  IS [4]. 

The respective source polytope with iso-planes is 

depicted for Kernel size M = 9 and N = 15. 
The 2-D index space for 1-D convolution sum of 

the inner product ŵ
H
 (n)u (n) gives the channels 

applied to each point of allocation space, the 
respective allocation direction and the direction of 

channels for variables. Therefore; the required 1-

D LMS systolic array can be derived from the 
inner product algorithm as derived in [14] and 

represented as shown in figure (1).  

                                         
3.1.1 Decreasing the Latency 
                                                                       

It is possible to speed-up the computations using 

recursive doubling that guarantees the logarithmic 
time complexity in the case of associative and 

commutative reduction operators. The same 

approach can be used and decompose the 

reduction operator into several parts that will be 
evaluated in parallel. This gives an increase in the 

degree of parallelism and decrease in the latency 

[4]. Partitioning and expanding the index space 
can be used; such step increases the array 

dimensionality. 

 Partitioning of the range of index, i, is defined 

by a vector:  
                                                                                           

Pi = pj  ≤ j  r                                      ……(11)  
 

Where r is the dimensionality of the index space. 

If j = i, then Pj > 0, otherwise, pj = 0. The vector Pi 
defines the factorization of the upper bound, N, of 

index, i, by setting N = pi si for appropriate natural 

numbers pi and si. 
Depending on the inner product ŵ

H
(n)u(n) of 

LMS algorithm, the 2-D systolic array realization 

for LMS algorithm can be derived. The large 

boxes in Figure (2-a) represent the processor 
elements and its functional specification is in (b) 

of the same Figure. 

 
3.1.2 Increasing the Throughput Rate 
 

 Classical systolic arrays implement LMS 

algorithm sequentially, i.e., they input and output 
data sequentially. This put the limit onto 

throughput rate and time complexity – the linear  

A3 A9 A6 

A2 A8 A5 

A1 A7 A4 

 
F

i

g

u

r

e 
خ
ط
! أ

ال 
ي
و
ج
د 
ن

ص
 
م

ن 
ا

ل
من
ط
 
ا

مل
ع
ي

+         - 

di 

ui ui 

0 

“0” 

0 0 

yi 

u 
 

A 

u'  e  y 

u 
  

e 

Procedure 2-D LMS 

begin 

   uh:= uh ; ei:= ei ; 

    q := uh;                                                                                                                                     
    m:= qei; 

   ŵk:= ŵk + m  

   yi:= yi + ŵkuh 
end 

                                  
Fig. 2. (a) (2-D) systolic array realization of LMS 

algorithm.  (b) Functional specification of a PE. 

 
Riyadh A.H. AL-Helali                                           Al-Khwarizmi Engineering Journal, Vol. 5, No. 1, PP 83-93 (2009) 
 

87 

complexity is the best that can be achieved. To 

improve the time parameters of systolic arrays, 
several data items must be entered into the array 

simultaneously. This is achieved using 

partitioning of the range of input stream, forming  
thus a 2-D input stream instead of a 1-D stream in 

the classical case. 

 The data items must be entered into the array 

in parallel, for each pi data inputs and thus needs a 
new block which converts the serial (M-tap) data 

sequence into the required pi parallel data that 

named serial to parallel converter [15]. The 
systolic array representation of the convolution 

sum, which was derived in [4] and increased the 

throughput rate, is used to derive the 
corresponding LMS array in Figure (3). 
 

3.1.3 Improving the Latency and    
Throughput 

           
 To increase the throughput rate and decrease 

the latency, one can partition the ranges of 

indexes: the ranges of index i that describes the 
reduction operator and the range of index j, which 

described the input stream. The corresponding    

4-D polytope shape has index point: x = [i, j, l, 
k]

T
 and the allocation is taken along the axis k that 

provides regular structures as depicted in [14].    
 Depending on the inner product ŵ

H
(n)u(n) of 

the required adaptive transversal filter with the 

assumed coefficients order, the whole array 
structure of the LMS algorithm can be suggested 

by entering the error signal through the array and 

compute the updated values of the tap-weight 

vector, the filter output is shown in Figure (4).  
 In this improvement, also the data items must 

enter, simultaneously for each pi inputs, into the 

general array structure of the layers that prepare 
the incoming pi inputs data according to certain 

sequence to enter to its own (2-D) array layer and 

then get the outputs from the summation circuits 

assigned to each layer structure. Thus, the whole 
systolic array represents a 3-D systolic array of 

the size 333 for a 1-D computational operations 
as shown in Figure (4). 

 
4.  Throughput and Processing Rate 

 
 Now by using the systolic arrays and applying 

the simulation of data flow, we find that: 

 
4.1 Classical systolic arrays 
                 
 The designed systolic arrays of the inner 

product ŵ
H
(n)u (n) and the LMS algorithm by   

iso-plane method have the same performance of 
quantities of that designed by dependence graph 

method, but gives flexibility of increasing in the 

degree of parallelism and improving the 
algorithmic properties by partitioning and 

expanding the index space as in the previous 

section.  
   

4.2 Systolic array to decrease the latency 
  

 By taking the 2-D semi-systolic array 

realization for the LMS algorithm that designed in 
chapter three in Figure (2). Consider the data flow 

for several consecutive clock periods, applying 

the procedure to be executed by each cell, added 
the intermediate results and getting the output 

signal sample, and then the algorithm gives: 

 
 Number of cells: = M. 

 Running time: clock periods = M. 

 Commutative product-type measures: (M) M. 

 Efficiency and utilization: 100%. 

 Broadcasting of the error signal, this is       
propagated to all cells. 

 The throughput rate of the 2-D LMS array is 
the same as the 1-D or conventional LMS array 

(only for this case). The latency time is reduced 

because the data that passes through the pipe 
for processing, takes short path (less number of 

cells (M
1/2

) than the first pipe that has M cells), 

therefore, the amount of time the data stays in 
the pipe (of M

1/2
 cells) is less than the time that 

the data stays in the pipe (of M cells). The 

overall computation time is:  
 

 To = (2 T + 2 T+)M                                …(12) 
 

Where T & T+ are the necessary times to 
compute one multiply and one add operations. 

  
Riyadh A.H. AL-Helali                                           Al-Khwarizmi Engineering Journal, Vol. 5, No. 1, PP 83-93 (2009) 
 

88 

          
                                          …              …             …               …              …  

                                          d5               d9              d6              d8               d7 

                                          d0               d4              d1              d3               d2 

                                   +               +                +               +                +           

 
                                   -                 -                -                -                 -              

                                           y0               y4              y1              y3               y2      

                                           y5               y9              y6              y8               y7    

                                          …               …             …              …              …  

 
                                                        …     …       …       

                                                        …     …       …     

 
                                         “0”            ”0”            ”0”             ”0”            ”0” 

 
                              u0             u4              u1              u3              u2  

                              u5             u9              u6              u8             u7 

                             …             …              …             …            … 
                                                       (a)                                                                              (b) 
 

Fig. 3.  (a) (2-D) systolic array realization of LMS algorithm. (b) Functional specification of a PE. 

 
A8 
 

A8 
 

A8 
 

A8 
 

A8 

A1 

A3 A3 A3 

A1 A1 

A2 

A3 

A2 

A3 

A2 A2 

A0 

A2 

A1 A1 

A0 A0 A0 A0 

u 
A 

A 

u'  e  y 

u 
  

e 

Procedure 2-D LMS 
begin 

   uh:= uh ; ei:= ei ; 

    q := uh;                                                                                                                                     
    m:= qei; 

   ŵk:= ŵk + m  

   yi:= yi + ŵkuh 
end 

                                  
Riyadh A.H. AL-Helali                                           Al-Khwarizmi Engineering Journal, Vol. 5, No. 1, PP 83-93 (2009) 
 

89 

 
  F
ig

. 4
.   3

-D
 sy

sto
lic

 a
r
r
a
y
 o

f th
e
 siz

e
 3


3


3
 fo

r
 th

e
 L

M
S

 a
lg

o
r
ith

m
 (fo

r
 M

=
9

). (a
) th

e
 g

e
n

e
r
a

l str
u

c
tu

r
e
 a

n
d

 th
e
 in

te
r
c
o
n

n
e
c
tio

n
 p

a
tte

r
n

 o
f la

y
e
r
s.  

(b
) T

h
e
 la

y
e
r
 str

u
c
tu

r
e
 o

f th
e
 2

-D
 p

ie
c
e
w

ise
 r

e
g
u

la
r
 a

r
r
a
y
 o

f L
M

S
 a

lg
o

r
ith

m
. 

 
A
 2

 
A

 8
 

A
 5

 
A
 1

 
A 4 

A
 7

 
A
 0

 
A

 3
 

A
 6

 
+ + + 

d
8

d
5

  d
2

 
d

6

d
3

  d
0

 
d
6

 d
3

  d
0  

 
 +  _y0  
y

3
 y

6
 

d
7

d
4

  d
1

 
 + 
 + 

 + 

 - 
 - 

 - 
y

0  

y
3

 y
6

 
y
1  

y
4

 y
7

 
A
 2

 
A

 2
 

A
 2

 
A
 1

 
A

 1
 

A
 1

 A
 0

 
A

 0
 

A
 0

 
y
2  

y
5

 y
8

 
u
i

 u
i

 u
i

 
u
0  

u
3  

    
u
1  

u
4  

    
u
2  

u
5  

    
0 
0

 
(a
)

 
(b

)
 

Riyadh A.H. AL-Helali                                           Al-Khwarizmi Engineering Journal, Vol. 5, No. 1, PP 83-93 (2009) 
 

90 

4.3 Systolic array to increase the 
throughput rate 
              

 By simulating the data flow that enters the 
2-D semi-systolic array realization of LMS 

algorithm designed of Figure (3), to improve 

the time parameters (throughput rate), five data 
items were entered into the array 

simultaneously, broadcasting to the input of 

the five lower cells. Then by computational 

activities of the simulation, the array gives 
that: 
 

 Number of cells: = (pi)M, or the area 
complexity is O(piM), where pi, represents 

number of the parallel inputs to the array 

(here, pi = 5).  

 Running time: clock periods = M. 

 Commutative product-type measures: (piM) 
M. 

 Efficiency and utilization: 100%. 

 Broadcasting of the pi parallel inputs signal, 
error signal, which are propagated to all 

cells. 

 Throughput rate has been improved, from 
the operation of the array, for each number 

of data outputs (N), there are sets of data 

inputs, which entered simultaneously into 

the array.  
 

 Therefore, the time complexity is O(N/pi). 
The input signal enters the first cell of the 

array and exits from the last cell of the pipe of 

the array. After applying the computational 
activities in the (M) cells, the latency is the 

same as in the classical case: O(M). 

           
4.4  Systolic array for improving the 
latency and throughput 
 

 By combining the last two 2-D LMS 
systolic arrays, a 3-D multilayered LMS 

systolic array was obtained, which improves 

the two above features. The simulations of data 

flow and computational activities for the array 
of Figure (4) give: 
 

 Number of cells: = (pi)M, or the area 

complexity is O(sj
2

i
p ) = O(piM), where 

2

i
p , 

represents number of the parallel inputs to 

the array (here,   pi = 3), (here, sj =3) and (sj 
pi = M).  

 Running time: clock periods = M. 

 Commutative product-type measures:  (piM) M. 

 Efficiency and utilization: 100%. 

 Broadcasting of the pi parallel inputs signal, 
error signal, which are propagated to all cells of 

the 3-D systolic array with size 333 for 1-D 
LMS systolic array.  

 Throughput and latency are improved to 
achieve the requirements of the 3-D 

multilayered array design and equal to O(N/pi), 

O(M
1/2

) respectively.  
 

 From the obtained simulation results, an 

efficient 3-D systolic LMS array algorithm design 
can be selected to represent the preferable systolic 

design. 

 The throughput rate of the 3-D designed 
systolic array is greater than that of the 

architectures [16], [17], but smaller than that of 

the pipelined architecture [18]. In [18], drawbacks 

of broadcasting the input signal, the overall 
complexity are high and its output latency is equal 

to the FIR filter length (i.e., L). 

 Thus our algorithm has single assignment 
form, L-cells, high throughput rate, 100% 

utilization, L
1/2

 latency, L-clock periods for one 

sample block, and finally the flexibility in the 
design, which takes the inner product 

(convolution sum) )()( nunw
H

in the design 

consideration, which enables us to extract more 

than one algorithm for the same problem.  
 

4.1.3 Conclusions 
 

 From the Systolic arrays were designed and 

simulating the data flows entering into the arrays, 
we deduced the following conclusions: 

 Various 2-D and 3-D systolic arrays of 1-D 
LMS-based adaptive (FIR) transversal digital 
filters were designed which can effectively be 

implemented using only local communications 

on a parallel system comprised of 
combinatorial circuits, clocked delay elements 

and distributed memory. 

  The technique described exploits the 
recurrence inherent in the application and is 
based on the convolution sum (inner product           

ŵ
H
(n). u(n)) in the LMS algorithm, which 

gives flexibility of deriving more than one 
systolic algorithm from the basic systolic 

structure. 

  Proposed systolic arrays for 1-D LMS 
algorithm allow an increase in the number of 
processors to increase throughput rate and to 

reduce the latency that is not achievable using 


Riyadh A.H. AL-Helali                                           Al-Khwarizmi Engineering Journal, Vol. 5, No. 1, PP 83-93 (2009) 
 

91 

the conventional systolic array synthesis 

method. 
  The final systolic array of LMS-based 

adaptive transversal filter is simple and 

exhibit increased throughput rate and 
decreased the latency, which is 3-D regular 

systolic algorithm making it amenable to 

VLSI implementation and can be used for a 

large variety recurrence computations in 
signal and image processing. However, the 

output of our efficient adaptive transversal 

filter is delayed by M
1/2 

samples, where M 
is the filter length. 

  Unlike the other structures, the throughput 
is independent of the filter length, implying 
that LMS adaptive FIR filter systolic array 

with several hundreds of filter coefficients 

can be represented by a word-level systolic 

arrays (multibit numbers). A word-level 
systolic array considers the inner product 

step as a typical operation for various 

signal-processing operations. 
 

 The future work on a 2-D LMS adaptive 
nonrecursive (or FIR) digital filters with 

excitation u(n1, n2), response y(n1, n2) and 

order of the pair (M1, M2) to extract a systolic 

array for the algorithm. Similarly by taking the 
2-D inner product step of the weight vector 

and the input vector as a basic systolic 

representation of the algorithm, it can be 
designed it by the dependence graph method. 

The 2-D LMS algorithm can formulate as: 

 
  e(n1 , n2) = d(n1 , n2) – y(n1 , n2)            …(13)                                            

                                              
ŵ(n1+1,n2+2)=ŵ(n1,n2)+e(n1,n2)u(n1,n2) 
…(14)  

 
Where d(n1,n2) is a 2-D desired response, and   

e(n1 , n2) is a 2-D error signal. 
 By using iso-plane method considered, 

partition one of the four indices of the 2-D 

convolution sum to reduce the latency and, by 

the same method try to obtain an (3-D and 4-
D) LMS algorithm which has improved 

throughput and latency features.   

 Systolic representation of the  2-D 
adaptive digital filters serves in many 

important applications such as reduction of 

noise in images, enhancement of edges in 

images, processing of geophysical signals and 
other geological applications [19], since 

continuous signals are encountered that are 

inherently (2-D).      

Abbreviations 
 

Del         Delay 

dir Direction 

FIR Finite-impulse-response 

FFT Fast Fourier Transform 

IPS Inner Product Step 

IS Index Space 

ISO Iso-plane 

LMS   Least-mean square 

PEs Processor Elements 

VLSI     Very-large scale integration 

 
Symbols 
 

A* set of interconnection pattern 

d(n) desired response 

E expected value expected value 

e projection vector 

e(n) error signal vector 

H Hermitian transposition 

M number of taps 

N set of nonnegative integer numbers 

O(  ) the class of all functions 

pi parallel inputs 

T transposition 

t time of execution 

T+ time execution of the addition 

operation 

T time execution of the multiplication 
operation 

u(i) discrete-time series 

u(n) observation vector 

V* set of cells/processors 

w(n) weight-error vector 

ŵ(n) estimate value of the weight-error 

vector 

w* conjugate value of w 

w0 optimum tap-weight vector 

y(n) output signal vector 

Z set of integer numbers 

Z
n
 dependence graph space 

Z
n-1

 processor space 


Riyadh A.H. AL-Helali                                           Al-Khwarizmi Engineering Journal, Vol. 5, No. 1, PP 83-93 (2009) 
 

92 

z* processor location 

d, dependence vectors 

 Offset vector 

 Step-size parameter 

 allocation 

² variance 

 time schedule 

 basis vector basis vector 

               
Acknowledgements 

 
 The authors would like to thank the 
anonymous reviewers for their useful 

comments. 

 
6.  References 

 
[1] L. Xiaojian and T. B. Leonard,  “High-

Speed Systolic Ladder Structures for 
Multidimensional      Recursive Digital 

Filters,” IEEE Trans. Signal Processing, 

vol. 44, no. 4, pp. 1048-1055, Nov. 1996.   

[2] F. Lorenzelli and K. Yao, “A linear 
systolic array for recursive least 

squares,“IEEE Trans. Signal Processing, 

vol. 43, no. 12, pp.3014- 484, Apr. 1992. 
[3] N. Petkov, Systolic Parallel Processing. 

Elsevier Science Publishers, North 

Holland, 1993. 
[4] T. P. Plaks, “Mapping Regular Algorithms 

onto Multilayered 3-D Reconfigurable 

Processor Array,“ IEEE Proceedings of the 

32
nd

 Hawaii International Conference 0n 
System Sciences, pp.1-10, 1999. 

[5] K. M. Ty and A. N. Venestanopoules, “A 
fast filter for  real-time image processing,” 
IEEE Trans. Circuits Syst. , vol. CAS-33, 

pp. 948-957, Oct. 1986. 

[6] R. Gnanasekaran, “2-D filter 
implementation for real-time signal 

processing,” IEEE Trans. Circuits Syst., 

vol. 35, no. 5, pp. 587-590, May. 1988  

[7] W. Luk and G.jones, “Systolic Recursive   
Filters,” IEEE Trans. Circuits Syst., vol. 

35, no. 8 pp. 1067-1068, Aug. 1988.  

[8] S. Sunder and V. Ramachandran, “Systolic   
implementation of multidimensional 

nonrecursive digital filters,” IEEE Trans. 

Circuits Syst. Video Technol., vol. 3, no. 6, 

pp. 399-407, Dec.1993.     

[9] JM. Jover and T. Kailath, “A  parallel 
architecture for Kalman filter measurement 
update and parameter estimation,” Automatica 

vol. 22, no. 1, pp. 43-57, 1986 

[10]  S. Haykin, Adaptive Filter Theory, 
Englewood Cliffs, NJ: Prentice-Hall, 1986. 

[11]  P. Quinton and Y. Robert, Systolic 
Algorithms and Architectures. Prentice Hall, 

Masson, UK, 1991. 
[12]  S. Rao and T. Kailath, “Regular iterative 

algorithms and their implementations on 

processor arrays,” Proc, IEEE, VOL. 76, 
NO.3, PP. 259-282, Mar. 1988. 

[13]   J. Teich and L. Thiele, “Partitioning of 
processor arrays: A piecewise regular 
approach,” INTEGRATION, VOL. 14, 

NO.3, PP. 297-332, 1993.  

[14]   R. A. H. Al-Helali, “Systolic Algorithms of 
LMS, RLS Adaptive (FIR) Digital Filters for 
Adaptive Channel Equalization,” M.Sc. 

Thesis, Univ. of Baghdad, Baghdad, Oct. 

2005. 
[15]   A. V. Oppenheim, and R. W. Schafer, 

Digital Signal Processing. London: Prentice 

Hall, 1975. 

[16]   G. Long, F. Ling, and J .G. Proakis, “The 
LMS algorithm with delayed coefficient 

adaptation,”IEEE Trans. Acoustic., Speech, 

Signal Processing, vol. 37, pp. 1397-1405, 
Sept. 1989; vol. 40, pp. 230-232, Jan. 1992.  

[17]  D. L. Jones, “Learning characteristics of 
transpose-form LMS adaptive filters,” IEEE 
Trans. Circuits Syst. II, vol. 40, pp. 745-749, 

Oct.1992 

[18]  S. C. Douglas, Q. Zhu, and K. F. Smith, “A 
pipelined LMS Adaptive FIR Filter. 
Architecture Without Adaptation Delay,” 

IEEE Trans. Signal Processing, vol. 46, pp. 

775-778, Mar. 1998. 
[19]  A. Antonio, Digital Filters Analysis,   

Design, and Applications McGraw-Hill, 

1993. 
 

 93-83، صفحة 1، العدد 5مجلة الخوارزمي الهندسية المجلد رياض علي عبد الحسين                                                                                            
(2009)  
 

93 

 
انمصفىفاث انمتعذدة األبعاد من انمعانجاث انمتىازيت نخىارزميت أقم مربع انمعذل نهمرشح 

 انرقمي انمكيف راث االستجابت انمحذودة اإلخراج
                   

**باقر عبذانرسىل انهاشمي*              رياض عهي عبذ انحسين انهالني  
 اندايؼح انًستُصشٌح/ كهٍح انُٓذسح  / لسى انُٓذسح انكٓشتائٍح  * 

 خايؼح تغذاد/ كهٍح انُٓذسح / لسى انُٓذسح انكٓشتائٍح ** 

 
 انخالصت

نخٕاسصيٍح يشتغ أدَى يؼذل  (انخالٌا االَمثاضٍح فً خسى اإلَساٌ)تتضًٍ انًمانح تصًٍى يُظٕياخ تشكم يصفٕفاخ يٍ انًؼانداخ   

. إنى يُظٕيح يٍ انًؼانداخ انًتٕاصٌح (انًشتثح)انخٕاسصيٍاخ انشٌاضٍح انًُضًح  (تحٌٕم)نهًششح انشلًً ٔرنك تاستخذاو طشٌمح تتًثٍم 

تألػتًاد ػهى يصفٕفح انًؼانداخ األحادٌح األتؼاد نهًششح انشلًً ٔانزي تذٔسِ ٌؼتًذ فً ػًهّ ػهى يدًٕع حاصم ضشب يصفٕفح 

ٔطشٌمح االستثاط انذاخهً تٍٍ ْزِ  (انخالٌا)تى أشتماق يدًٕػح إَٔاع يٍ انًصفٕفاخ ٔانتً تشكم تشتٍة يُظى يٍ انًؼانداخ , اإلخشاج

إٌ انضيٍ انًستغشق نهحصٕل ػهى لًٍح ٔاحذج لذ . (VLSI)" انخالٌا ٔانتً تدؼهٓا يًٓح االستخذاو فً دٔائش انتكايم انمٍاسٍح انكثٍشج خذا

.  تًصفٕفاخ راخ األتؼاد األحادٌح" أَخفض ٔكزنك يؼذل اإلخشاج لذ اصداد يماسَح

, إٌ انًصفٕفح انثالثٍح األتؼاد ٔانًتؼذدج انطثماخ تتأنف يٍ طثماخ ثُائٍح األتؼاد ٔانتً تشتثظ يغ تؼضٓا انثؼض ػٍ طشٌك انحافاخ فمظ 

إٌ يثم ْزِ انًصفٕفاخ نخٕاسصيٍح يشتغ أدَى يؼذل نهًششح انشلًً ًٌكُٓا تهثٍح االحتٍاخاخ فً تسشٌغ يؼذل االلتشاب نهُاتح انُٓائً فً 

.  يؼظى تطثٍماخ انًششحاخ انشلٍح انًالئًح