AP06_3.vp


1 Terminology and notation
Consider a sparse n×n matrix A with elements Aij;

i j n, ,� 1 . The largest distance between nonzero elements in
any row is the bandwidth of matrix A and is denoted by �B,
i.e.,

l Ai j ijj� �min ( ; )0
r Ai j ijj� �max ( ; )0
�B i i i� � �max ( )r l 1

1.1 Storage schemes for sparse matrices

1.1.1 Compressed sparse row (CSR) format
Matrix A is represented by 3 linear arrays A, adr, and ci (see

Fig. 1). Array A stores the nonzero elements of input matrix A,
array adr [1, …, n] contains indexes of the initial nonzero ele-
ments of rows of A, and array ci contains column indexes of
nonzero elements of A. Hence, the first nonzero element of
row j is stored at index adr [ j] in array A.

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 3

Czech Technical University in Prague Acta Polytechnica Vol. 46  No. 3/2006

Performance Aspects of Sparse
Matrix-Vector Multiplication

I. Šimeček

Sparse matrix-vector multiplication (shortly SpM×V) is an important building block in algorithms solving sparse systems of linear
equations, e.g., FEM. Due to matrix sparsity, the memory access patterns are irregular and utilization of the cache can suffer from low
spatial or temporal locality. Approaches to improve the performance of SpM×V are based on matrix reordering and register blocking [1, 2],
sometimes combined with software-pipelining [3]. Due to its overhead, register blocking achieves good speedups only for a large number of
executions of SpM×V with the same matrix A.
We have investigated the impact of two simple SW transformation techniques (software-pipelining and loop unrolling) on the performance
of SpM×V, and have compared it with several implementation modifications aimed at reducing computational and memory complexity and
improving the spatial locality. We investigate performance gains of these modifications on four CPU platforms.

Keywords: sparse matrix-vector multiplication, code restructuring, loop unrolling, software pipelining, cache hierarchy.

Fig. 2: The idea of the static L-CSR format: a) A sparse matrix A in dense format, b) The static L-CSR representation of A

Fig. 1: The idea of the CSR format: a) A sparse matrix A in dense format, b) The CSR representation of A


1.1.2 Length-sorted CSR (L-CSR) storage format
The main idea is explained in [4]. The data is represented

as in the CSR format, but the rows are sorted by length in in-
creasing order. This means that the length of row i is less or
equal to the length of row i � 1. There are two variants:
1. Static: The rows are physically stored in the sorted order

in the CSR format (see Fig. 2).
2. Dynamic: The original CSR format is extended with

two additional arrays (see Fig. 3). Array member [1 … n]
contains the indexes of the rows after sorting. Array
begin[1 … �B] contains the indexes into the array member:
begin [i] is the index of the first row of length i in the array
member.

1.2 Code restructuring
For demonstration purposes, we will use the following

pseudocode:

An example code

s � 0.0;

for i � 1 to n do

s � s � A[i];

1.2.1 Loop unrolling
Modern CPUs with deep instruction and arithmetic logic

unit (shortly ALU) pipelines achieve peak performance if they
execute straight serial codes without conditional branches. A
non-optimizing compiler translates a for cycle with n itera-
tions into a code with a single loop in which the condition
must be tested n times, even if the loop body is very small.
Loop unrolling by unrolling factor Uf consists in constructing
another loop whose body consists of Uf instances of the loop
body in a sequence followed by a clean-up sequence if n is not a
multiple of Uf . This makes the serial code longer, so that the
instructions can be better scheduled, the internal pipeline can
be better utilized, and the number of condition tests drops
from n to � �n Uf .

Loop unrolling applied to the example code (Uf � 2)

s � 0.0;

for i � 1 to n step 2 do

s � s � A [i];

s � s � A [i � 1];

1.2.2 Loop unrolling-and-jam (loop jamming)
Since operations in the floating point unit (shortly FPU)

on modern CPUs are multi-stage pipelined operations, de-
pendences between iterations can prevent the floating-point
pipeline from filling, even if unrolling is used. To improve
pipeline utilization in dense matrix codes that have a recur-
rence in the inner loop, the unroll-and-jam transformation is
often used. The transformation consists in unrolling the outer
loop and fusing the resulting inner loop bodies. This can in-
crease floating point pipeline utilization by interleaving the
computation of multiple independent recurrences.

Loop unrolling-and-jam applied to the example code
(Uf � 2)

s1 � 0.0;

s2 � 0.0;

for i � 1 to n step 2 do

s1 � s1 � A[i];

s2 � s2 � A[i+1];

s � s1 � s2;

1.2.3 Software pipelining
The initial instruction(s) of the first iteration is/are moved

into the prologue phase, and the final instruction(s) of the last
iteration is/are moved into the epilogue phase. This technique
is usually combined with loop unrolling and makes the loop
code larger and more efficient for instruction scheduling.

4 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 46  No. 3/2006 Czech Technical University in Prague

Fig. 3: The idea of the dynamic L-CSR format: a) A sparse matrix A in dense format, b) The dynamic L-CSR representation of A


Software pipelining applied to the example code

s � 0.0;

a1 � A[1];

for i � 2 to n do

a2 � A[i];

s � s � a1;

a1 � a2;

s � s � a2;

1.2.4 Sparse matrix-vector multiplication
Consider a sparse n×n matrix A stored in the format CSR

as defined in the previous section and input dense array
x [1, …, n] representing vector x. The goal is to compute out-
put dense array y [1, …, n] representing vector y � Ax. The fol-
lowing algorithm MVM_CSR is a straightforward implementa-
tion of SpM×V.

Algorithm MVM_CSR

low � adr [1];

for i � 1 to n do

s � 0.0;

up � adr [i + 1];

for j � low to up do

k � c [ j];

s � A [ j]*x [k];

y [i] � s;

low � up;

2 Improving the performance of
sparse matrix-vector multiplication
The MVM_CSR code stated above has poor performance

on modern processors. Due to matrix sparsity, the memory
access patterns are irregular and the utilization of caches suf-
fers from low spatial and temporal locality. The multiplication
at codeline (4) requires indirect addressing, which causes per-
formance degradation due to the small cache-hit ratio. This
problem is difficult to solve in general [5] and it is even more
difficult if the matrix has only a few nonzero elements per row.
In our case, we consider sparse matrices produced by discret-
ization of 2D differential equations of the second order with
typical stencils. These matrices contain only a few (typically
between 4 and 20) nonzero elements per row.

The performance of SpM×V is influenced by SW transfor-
mations (shortly SWT) and implementation decisions (shortly
ID).
a) Using explicit software preload ( SWT, SpM×V_(a) )

The elements of arrays A and ci are loaded in advance by
using software-pipelining in the loop at codeline (4). This
should hide memory system latencies.

b) Interleaving of adjacent rows ( SWT, SpM×V_(b) )
Two (or more) adjacent rows are computed simultaneously
by using loop unrolling in the loop at codeline (4). This
should improve FPU pipeline utilization.

c) Condensed implementation of the CSR format
( ID, SpM×V_(c) )
The matrix in the CSR format is not represented by two
independent arrays ci and A, but by a single array that
holds two records. This should improve spatial locality, be-
cause elements of these arrays are always used together.

d) Use of pointers in array ci ( ID, SpM×V_(d) )
The array ci contains pointers to array x instead of indexes
into it. This should decrease the amount of work for the
operand address decoder in the instruction pipeline.

e) Use of the static L-CSR format
for storing A ( ID, SpM×V_(e) )
Arrays A and ci are reordered to the static L-CSR format.
This modification can save some FPU and ALU opera-
tions. The number of conditional branches is strongly
reduced, and operations for loading zero into floating
point registers are completely removed. For rows with a
very small number of nonzero elements, this can have a
significant impact.

f) Using single precision ( ID, SpM×V_(f) )
All elements of array A are stored in the single precision
format (float). This floating point format requires half of
the space of the double format, so the memory require-
ments for storing A drop by 33 %.

3 HW and SW configuration
The impact of using these transformations of SpM×V on

the performance was evaluated empirically by measurements
on four different processors:
1. IBM Power 3, 200 Mhz, 1 GB RAM, 32 KB instruction and

64 KB data L1 cache (128 B cache line size), and 1 MB L2
cache (32 B cache line size), OS AIX 4.3.2, IBM C com-
piler, used switches: -O5.

2. AMD Opteron, 1.6 GHz, 1 GB RAM, 64 KB instruction
and 64 KB data L1 cache, and 1 MB L2 cache, OS Linux
Debian, kernel version 2.4.18, GNU C compiler, used
switches: -O3.

3. Intel Pentium III Coppermine 1 GHz, 512 MB RAM,
16 KB instruction and 16 KB data L1 cache (32 byte cache
line size), and 256 KB L2 cache (32 byte cache line size),
OS Linux Debian, kernel version 2.4.18, Intel compiler
version 6.0 build 020312Z, used switches:
icc -O3 -fno_alias -pc64 -tpp6 -xK -ipo

-align -Zp16.
4. SUN UltraSparc IIIi (Sparc v9), 1 GHz, 1 GB RAM, 32 KB

instruction and 64 KB data L1 cache, and 1 MB L2 cache,
OS SunOS 5.9, GNU C compiler, used switches: -O3.

The order n starts at n0 � 5100 and grows by a geometric
series with factor q � 1.32 up to n10 � n0	 q

10 � 80400. All mea-
surements on all four processors were performed with the
same set of matrices generated by an FEM generator.

The graphs illustrate the performance of SpM×V on these
four processors measured in MFlops as a function of the order
of matrix A:

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 5

Czech Technical University in Prague Acta Polytechnica Vol. 46  No. 3/2006


� mv represents performance of standard SpM×V.
� mv_a represents performance of SpM×V_(a).
� mv_b2 represents performance of SpM×V_(b) interleaving

of 2 rows.
� mv_b4 represents performance of SpM×V_(b) with inter-

leaving of 4 rows.
� mv_c represents performance of SpM×V_(c) with preload

with 1 iteration distance.
� mv_d represents performance of SpM×V_(d).
� mv_e represents performance of SpM×V_(e).
� mv_f represents performance of SpM×V_(f).

4 Results
Evaluation of the results:

� Using structures ( SpM×V_(a) )
Our assumptions were not fulfilled; this modification
caused slowdown on all architectures. One possible reason
is that the loop code becomes less clear and these compil-
ers are unable to optimize it.

� Using explicit preload ( SpM×V_(b) )
This modification caused slowdown on all architectures.
The reason is that explicit SW preload collides with default
compiler preload transformation.

6 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 46  No. 3/2006 Czech Technical University in Prague

Fig. 4: The performance of algorithms on IBM Power 3

Fig. 5: The performance of algorithms on AMD Opteron


� Interleaving of 2 adjacent rows ( SpM×V_(c) )
The effect of this modification is very different (Pentium
III: speedup about 20 %, SUN: constant performance,
Power 3: constant performance, Opteron: slowdown about
40 %.). One possible reason is that explicit loop unroll-
-and-jam collides with default loop unroll-and-jam heuristics
in compilers.

� Using pointers ( SpM×V_(d) )
Our assumptions were not fulfilled; this modification
caused slowdown on all architectures. One possible reason
is that the loop code becomes less clear and these compil-
ers are unable to optimize it.

� Matrix A is stored in L-CSR format ( SpM×V_(e) )
This modification achieves speedup on all architectures
due to slight reduction of the number of FPU operations
and conditional branches. The main drawback of this
method is that “typical” row lengths must be known at
compile-time.

� Using single precision ( SpM×V_(f) )
This modification achieves speedup on all architectures
caused by 33 % smaller amount of data for matrix and by
50 % smaller amount of data for vectors. The main draw-
back of this method is lower precision of the resulting
vector.

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 7

Czech Technical University in Prague Acta Polytechnica Vol. 46  No. 3/2006

Fig. 6: The performance of algorithms on Intel Pentium 3

Fig. 7: The performance of algorithms on Sun UltraSparc IIIi


5 Conclusion
We have tried to increase the performance of SpM×V, one

of most common routines in LA.
To fulfill this goal, we have used either SW code trans-

formation techniques or some implementation decisions. We
have measured the performance of several modifications of
an SpM×V algorithm on four different HW platforms. The
results differ due to the use of different CPU architectures and
compilers, but we can conclude that three of the techniques
improve the performance of the code and can be used to
accelerate SpM×V.

6 Acknowledgment
This work was supported by MŠMT under research pro-

gram MSM6840770014.

References
[1] Heras, D. B., Cabaleiro, J. C., Rivera, F. F.: Model-

ing Data Locality for the Sparse Matrix-Vector Product
Using Distance Measures. Parallel Computing, Vol. 27
(2001), No. 7, p. 897–912, June 2001.

[2] Vuduc, R., Demmel, J. W., Yelick, K. A., Kamil, S.,
Nishtala, R., Lee, B.: Performance Optimizations and
Bounds for Sparse Matrix-Vector Multiply. In: Proceed-

ings of Supercomputing 2002. Baltimore (MD, USA),
November 2002.

[3] Rollin, S., Geus, R.: Towards a fast parallel sparse ma-
trix-vector multiplication. In: Parallel Computing: Funda-
mentals and Applications. (D’Hollander, E. H., Joubert,
J. R., Peters, F. J., Sips, H. eds.), Proc. of PARCO’99, Im-
perial College Press, 2000, p. 308–315.

[4] White, J., Sadayappan, P.: On improving the perfor-
mance of sparse matrix-vector multiplication. In:
Proceedings of the 4th International Conference on High Per-
formance Computing (HiPC ’97), IEEE Computer Society,
1997, p. 578–587.

[5] Wolfe, M. J.: High-Performance Compilers for Parallel Com-
puting. Reading (Massachusetts, USA): Addison-Wesley,
1995.

Ing. Ivan Šimeček
phone: +420 224 357 268
e-mail: xsimecek@fel.cvut.cz

Department of Computer Science

Czech Technical University in Prague
Faculty of Electrical Engineering
Technická 2
166 27 Praha 6, Czech Republic

8 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 46  No. 3/2006 Czech Technical University in Prague