AP08_5.vp


1 Introduction
In the following text, we assume that A, B and C are real

matrices of order n. Every modern CPU architecture use a
complex memory hierarchy scheme (for details see [6]), we
assume the scheme depicted in Fig. 1. In contrast to other au-
thors, we also assume TLB performance impact.

1.1 The cache model
The cache model we consider corresponds to the structure

of L1 and L2 (and optionally L3) caches in the Intel x86 archi-
tecture. An s-way set-associative cache consists of h sets and one
set consists of s independent blocks (called lines in the Intel
terminology). Let CS denote the size of the data part of a
cache in bytes and BS denote the cache block size in bytes.
Then C s B hS S� � � . Let SD denote the size of type double
and SI the size of type integer.

We consider only two types of cache misses:
� Compulsory misses (sometimes called intrinsic or cold) that

occur when new data is loaded into empty cache blocks.
� Thrashing misses (also called cross-interference, conflict, or ca-

pacity misses) that occur if useful data has been replaced
prematurely from a cache block for capacity reason.

1.2 The code restructuring
There are many compiler techniques for improving the

performance (for details see [8, 9]). These compiler tech-
niques are used in programs to maximize the cache hit ratio
and improve ALU and FPU utilization.

2 Linear codes

2.1 Existing LA libraries and their possible
drawbacks

There exist many libraries for the linear algebra (LA) such
as BLAS [5], LAPACK [2], Intel MKL [1] etc. They are tuned
for the given architecture, but they also have some drawbacks:
� We must include the whole library in the final code, it can

significantly increase the code size.
� Licence problems can occur.
� Some format for storing sparse matrices are not supported.
� Operations are “black boxes”, so the user cannot manage

the inner solution process.
� A combination of two or more operations with the same

operands is solved separately (i.e. inefficiently).

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 29

Acta Polytechnica Vol. 48  No. 5/2008

Memory Hierarchy Behavior Study
during the Execution of Recursive

Linear Algebra Library
I. Šimeček

For good performance of every computer program, good cache and TLB utilization is crucial. In numerical linear algebra libraries (such as
BLAS or LAPACK), good cache utilization is achieved by explicit loop restructuring (mainly loop blocking), but this requires difficult
memory pattern behavior analysis. In this paper, we represent the recursive implementation (“divide and conquer” approach) of some
routines from numerical algebra libraries. This implementation leads to good cache and TLB utilization with no need to analyze the memory
pattern behavior due to “natural” partition of data.

Keywords: numerical linear algebra, code restructuring, loop unrolling, recursive implementation, memory hierarchy utilization.

Fig. 1: The memory hierarchy scheme

Fig. 2: Parameters of the cache model


2.2 Linear code optimization
There are many routines for linear algebra. For demon-

stration purposes, we choose following:
1. matrix-matrix multiplication (GEMM in BLAS notation).
2. symmetric matrix-matrix multiplication (SYRK in BLAS

notation).
3. backward substitution (TRSM in BLAS notation).
4. Cholesky factorization (POTRF in LAPACK notation).

The standard code of these routines has good perfor-
mance due to the high cache hit ratio only for small sizes of
order of matrix. When good performance is required for
larger values, modifications must be made. In numerical alge-
bra packages, this is achieved by explicit loop restructuring.
This includes loop unrolling-and-jam which increase the FPU
pipeline utilization in the innermost loop, loop blocking
(which is why we refer to blocked codes) and loop interchange
to maximize the cache hit ratio. After applying of these trans-
formations, these codes are divided into two parts. The outer
loops are “out-cache”, and the inner loops are “in-cache”.
The codes have almost the same performance irrespective on
the amount of data.

But good cache utilization can also be achieved by the
“divide-and-conquer” approach. These codes, which use re-
cursive-style formulations, referred to as recursive-based.

2.3 Comparison of two approaches for the
design of numerical LA routines

2.3.1 Blocked code programming
Effective implementation in blocked code programming

style requires the following steps:
1. Straightforward implementation.
2. Design a coarse cache model of this code.
3. Apply source code transformations (according to the ar-

chitecture parameters).
4. Refine the cache model.
5. Derive optimal values of the program thresholds from the

cache parameters.

2.3.2 Recursive code programming
The motivation idea of recursive codes, is to divide the

matrices into disjoint submatrices of the same size (see [3, 4]).
The resulting code has much better spatial and temporal lo-
cality. Effective implementation in recursive code program-
ming style requires the following steps:
1. Straightforward implementation.
2. Apply source code transformations (according to architec-

ture parameters).
3. Express routine(s) in the recursive style.
4. Measure threshold value(s).

2.4 Discussion
The main differences between blocked and recursive-

-based codes are
� Cache analysis of some LA routines is very difficult, much

more difficult than recursive formulation of these routines.

� For optimal performance of blocked codes, the program
may have different in-L1-cache loops and in-L2-cache
loops (and in-L3-cache respectively). In recursive-based
codes, this is automatically done by the partitioning of
the data.

� Blocked codes lead to even better cache utilization because
the data is optimally divided to fit into the cache. In recur-
sive codes some part of cache memory remains unused.

� Recursive style of expression is very easy to understand and
results in error-free codes.

� Recursive-based codes have additional stack management
overhead in comparison to blocked codes.

3 Detailed overview of matrix-matrix
multiplication procedure

3.1 Matrix-matrix multiplication
We consider input real matrices A i k( )� , B k j( )� and

C i j( )� . A standard sequential pseudocode for matrix-matrix
multiplication C A B� � (abbreviated as GEMM, assuming the
ordering of loops: i, j, k) is as follows:

3.2 Recursion in matrix-matrix multiplication
Let us denote C(i) if the final result (matrix C) is computed

from the recursion by i variable. Consider the partitioning of
matrices A, B and C as follows. For better illustration, the par-
titioning of matrices is depicted in the Fig. 3.

30 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 48  No. 5/2008

Pseudocode GEMM_STD
(* Standard matrix-matrix multiplication *)
(1) for i � 1 to n do
(2) for j � 1 to n do

(3) sum � 0
(4) for k � 1 to n do
(5) sum A i k B k j� � �[ ][ ] [ ][ ];
(6) C i j sum[ ][ ] ;�

Fig. 3: Recursion in GEMM


C( ) ( ) ( , ) ( ) ( )

(

i C C A A B A A B A B A B� � � � � � �

�

1 2 1 2 1 2 1 2gemm

gemm� 	A B A B1 2, ) ( , )gemm

C( ) ,j
C
C

A
B
B

A
B
B

�


�
�


�
� �


�
�


�
�


�
��


�
�� � �


�
�

1

2

1

2

1

2
gemm


�
� �

�

�


�
�


�
�

�


�
�


�
�

A B
A B

A B
A B

1

2

1

2

gemm
gemm

( , )
( , )

C( ) , ( )k C C
A
A

B B
A
A

� � �


�
�


�
�


�
��


�
�� �


�
�


�
1 2

1

2
1 2

1

2
gemm � �

� � � � � �

( )

( , ) ( , )

B B

A B A B A B A B

1 2

1 1 2 2 1 1 2 2gemm gemm

3.3 Compulsory misses
During the execution of GEMM(A, B, C, i, j, k) routine all ele-

ments from arrays A, B and C must be loaded or written to the
cache. So, the number of compulsory misses is

N
S ij ik jk

B
�

� �D

S

( )
.

A similar idea is also valid for TLB misses.

4 Evaluation of the results

4.1 Testing configuration

4.1.1 HW configuration
All results were measured on Pentium Celeron M420

at 1.6 GHz, 2GB@ 333 MHz, with the following cache
parameters:

L1 cache is a data cache with BS � 64, CS � 32 kB, s � 8,
h � 64, and LRU replacement strategy. L2 cache is unified
cache with BS � 64, CS � 1 MB, s � 8, h � 2048, and LRU
strategy.

4.1.2 SW configuration
We have used the following SW:

� OS Windows XP Professional SP3,
� Microsoft Visual Studio 2003,
� Intel compiler version 9.0 with switches:

/O3 /Og /Oa /Oy /Ot /Qpc64 /QxB /Qipo /Qsfalign16 /Zp16

� Intel MKL library 8.1.

4.1.3 Test data
We have measured instances using square matrices with

the order of n in range from 10 to 1000 (for cache analysis to
800) with the step 10.

4.1.4 Methodology of the measuring
� All cache events were monitored by the Cache Analyzer

(CA) [7].
� For TLB miss ratio measuring, we also used the CA with

the following parameters: h � 1, s � 128, BS � 4096. This
means, that we assume TLB as a fully-associative cache with
its block size equal to the system page size.

� For performance measurements, the average of five mea-
sured values was taken as the result.

� The caches were flushed out before each measurement. In
real measurement, they were flushed by a reading of a
given amount of useless data. In measurement using the
CA, they were flushed by a command of the CA.

� The codesize of the innermost loops is negligible in com-
parison to the size of the L2 cache, so we assume that the
unified L2 cache is used only for data.

4.2 Experimental results

4.2.1 L1 cache miss ratio
A comparison of the L1 cache miss rate for three different

variants of the GEMM procedure is illustrated in Fig. 4. The
peaks in this graph are caused by a resonance between the
matrix element mapping function and the cache memory
mapping function. Both unrolled variants achieve better L1
utilization.

4.2.2 L2 cache miss ratio
A comparison of the L2 cache miss rate for three different

variants of the GEMM procedure is illustrated in Fig 5. We can
conclude that if the size of operands exceeds the data cache
size (for n � 350), the L2 cache utilization drops significantly.
Both unrolled variants achieve better L2 utilization.

4.2.3 Choosing the optimal value of the threshold
A comparison of cache utilization when three different

threshold values were chosen, is depicted in Figs. 6 and 7. If
the threshold is too large, the L1 cache miss ratio increases
rapidly. The L2 cache miss ratio remains the same across the
whole measured set.

Comparisons of cache miss rates for three different vari-
ants of GEMM procedure are illustrated in Figs. 8 and 9. The
recursive variant of the GEMM procedure achieves much
better cache utilization across the whole measured set. The

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 31

Acta Polytechnica Vol. 48  No. 5/2008

Pseudocode GEMM_REC (A, B, C, i, j, k)
(* Standard implementation of recursive matrix-matrix
multiplication *)
(* limit is an architecture dependent constant *)
(1) n i j k� min( , , ) ;
(2) if (n � limit ) then
(3) C � GEMM_STD(A, B);
(4) else
(5) switch which of i, j, k is maximum
(6) case i: is maximum
(7) C � MMM_REC(A, B); break;
(7’) (* recursion by i variable *)
(8) case j: is maximum
(9) C � MMM_REC(A, B); break;
(9’) (* recursion by j variable *)
(10) case k: is maximum
(11) C � MMM_REC(A, B); break;
(11’) (* recursion by k variable *)


peaks in this graph are caused by resonance of the matrix
element mapping function with cache memory mapping
function.

4.2.4 TLB cache miss ratio
A comparison of TLB utilization when three different

threshold values were chosen, is depicted in Fig. 10. With the
exception of one case (caused by resonance of the matrix ele-
ment mapping function for n � 510 with the TLB memory
mapping function), the number of TLB misses is very low.

A comparison of the TLB miss rates for three different
variants of the GEMM procedure is illustrated in Fig. 11. The re-
cursive variant of the GEMM procedure achieve much better
TLB utilization for larger matrices (n � 250).

4.2.5 Performance of GEMM

Performance [MFLOPS]�
number of FPU operations

execution time in s[ ]�

32 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 48  No. 5/2008

Fig. 4: L1 cache miss rate for different variants of the GEMM procedure

Fig. 5: L2 cache miss rate for different variants of the GEMM procedure


Performance [MFLOPS] GEMM �
2 3n

execution time in s[ ]�

We compare the performance of five different variants:

� The standard variant (GEMM_STD),
� two unrolled variants,
� the recursive variant,

� the vendor implementation (Intel MKL library).

The resulting performances of different versions of GEMM
procedure are illustrated in the Fig. 12. We can conclude that

� the standard variant has very low performance.
� two unrolled variants have reasonable performance un-

til the sizes of the operands exceed the cache size. Then

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 33

Acta Polytechnica Vol. 48  No. 5/2008

Fig: 6: Dependence of L1 cache utilization on the threshold value

Fig. 7: Dependence of L2 cache utilization on the threshold value


the performance suffers from low memory hierarchy
utilization.

� the recursive variant achieves very good performance irre-
spective of matrix size. It was just slightly slower than the
vendor implementation.

5 Conclusions
In this paper, we have described the implementation of

some important routines from LA using a recursive approach,
concentrating on the matrix-matrix multiplication routine.
All tested routines achieve performance about 80–90 % in
comparison to the vendor MKL library. We can conclude that

34 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 48  No. 5/2008

Fig. 8: Comparison between unrolled and recursive variants

Fig. 9: Comparison between unrolled and recursive variants


recursive codes achieve very good performance due to ef-
fective memory hierarchy utilization. Unlike other (more
complicated) methods, no difficult memory pattern behavior
analysis is needed.

Acknowledgement
This research has been supported by MŠMT under re-

search program MSM6840770014.

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 35

Acta Polytechnica Vol. 48  No. 5/2008

Fig. 10: Dependence of TLB utilization on the threshold value

Fig. 11: Comparison between unrolled and recursive variants


References
[1] Intel® math kernel library 10.1 – overview.
[2] Lapack – linear algebra package.
[3] Carr, S., Kennedy, K.: Compiler Blockability of Numeri-

cal Algorithms. In Proceedings of the 1992 ACM/IEEE con-
ference on Supercomputing. IEEE Computer Society Press,
1992, p. 114–124.

[4] Carr, S., Lehoucq, R. B.: A Compiler-Blockable Algo-
rithm for QR Decomposition. In Proceedings of the Eighth
SIAM Conference on Parallel Processing for Scientific Com-
puting, San Francisco, CA, February 1995.

[5] Dongarra, J. J., Croz, J. D., Hammarling, S., Duff I.: A
Set of Level 3 Basic Linear Algebra Subprograms. ACM
Transactions on Mathematical Software, Vol. 16 (1990),
No. 1, p. 1–17, Mar. 1990.

[6] Hennessy, D. A. P. John L.: Computer Architecture, Fourth
Edition: A Quantitative Approach. Morgan Kaufmann;
4 edition (September 27, 2006), 2006.

[7] Tvrdík, P., Šimeček, I.: Software Cache Analyzer. In Pro-
ceedings of CTU Workshop, Vol. 9, p. 180–181, Prague,
Czech Republic, Mar. 2005.

[8] Wadleigh, K. R., Crawford, I. L.: Software Optimization
for High Performance Computing. Hewlett-Packard profes-
sional books, 2000.

[9] Wolfe, M.: High-Performance Compilers for Parallel Com-
puting. Addison-Wesley, Reading, Massachusetts, USA,
1995.

Ing. Ivan Šimeček
phone: +420 224 357 268
e-mail:xsimecek@fel.cvut.cz

Department of Computer Science and Engineering

Czech Technical University in Prague
Karlovo nám. 13
121 25 Prague 2, Czech Republic

36 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 48  No. 5/2008

Fig. 12: Performance of different versions of GEMM procedure in double precision