Начиная с начала 2000 года осуществляется внедрение GHIS в здравоохранении, в рамках принятого проекта о реформирование информ


Mathematical Problems of Computer Science  47, 68--75, 2017. 

 
Performances of Factorizations of Complex Hermitian 
Matrices in the Architecture of GPU Accelerator 

 
Hrachya V.  Astsatryan and Edita E. Gichunts 

 
Institute for Informatics and Automation Problems of NASRA 

e-mail: hrach@sci.am, editagich@ipia.sci.am 
 

Abstract 
 

Linear algebra 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇  factorizations are very important and widely used in scientific 
and engineering calculations. The factorizations with Bunch-Kaufmann and Aasen’s 
algorithms, as well as the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization without pivoting also belong to this class of 
factorizations.The implementation of the mentioned three factorizations in hybrid 
architecture are presented in this work, and also performances are given on the NVIDIA 
Tesla K40c graphic processor using the MAGMA  library for complex Hermitian 
matrices. 

Keywords: MAGMA library, LDLT factorization, Aasen's algorithm, Bunch-
Kaufman algorithm, Hermitian matrix,Triangular matrix. 

 
1. Introduction 
 
Solutions of linear system of equations of Hermitian matrices have repeatedly been used in 
physics. To calculate the solutions of linear system of equations of 𝐴𝐴𝐴𝐴 = 𝑏𝑏 form, the classical 
method turns 𝐴𝐴  matrix into the following 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇   analysis form: 𝑃𝑃𝐴𝐴𝑃𝑃𝑇𝑇 = 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 ,where 𝐿𝐿  is the 
unit lower triangular, 𝐿𝐿  is the block diagonal with either 1-by-1 or 2-by-2 diagonal blocks, and 
𝑃𝑃 is a permutation matrix to ensure the numerical stability of the factorization.  

The factorizations with Bunch-Kaufmann and Aasen’s algorithms, as well as the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇  
factorization without pivoting are very popular. 

In this paper we present the mentioned factorizations for complex Hermitian matrices. 
Novelty of this work is the realization of 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization by the Aasen’s algorithm on GPU 
with the application of MAGMA library, as it has just been integrated into the MAGMA 2.0.1 
package. The realizations of 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorizations without pivoting and with Bunch-Kaufmann 
algorithm are also presented in this work, in order to have a complete understanding of 
performances of 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorizations of complex Hermitian matrices. Note also that this problem 
is presented in cases of complex single and complex double precisions.  
The work contains the following sections: the first sectionis introduction, the second one 
describes the steps of the above mentioned algorithms in CPU / GPU hybrid architecture. The 

 
68 
 

H.Astsatryan and E. Gichunts 69 

third section contains the experiment results, where the graphs of performance and time of those 
algorithms are shown, as well as the analysis of comparisons of factorizations are given. The 
fourth section is the conclusion based on the experiment results. 
 
 
2. LDLT Factorizationin CPU/GPU Hybrid Architecture 

Solutions of linear system of equations of the form 𝐴𝐴 ∗ 𝑍𝑍 = 𝐵𝐵 are being used in science on 
repeated occasions, where 𝐴𝐴 is a complex Hermitian matrix, and 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 matrix factorization is 
used. Describe three cases of these factorizations in hybrid architecture that are resolved with 
Bunch-Kaufmann and Aasen’s algorithms, as well as the case without pivoting. LAPACK library 
is used in the systems with shared memory, where as in hybrid systems its parallelized MAGMA 
library is used. 
 
2.1Aasen’s Algorithm 
  
Aasen’s algorithm [1] factorizes 𝐴𝐴 into an 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 decomposition of the form 𝑃𝑃𝐴𝐴𝑃𝑃𝑇𝑇 = 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 
,where 𝑃𝑃 is a permutation matrix, 𝐿𝐿 is a unit lower triangular matrix, and 𝐿𝐿 is a Hermitian 
matrix. To exploit the memory hierarchy on a modern computer,a partitioned-version of Aasen’s 
algorithm was recently proposed [2]. This algorithm first factorizes a panel in a left-looking 
fashion, and then uses BLAS-3 operations toupdate the trailing submatrix in a right-looking way. 
The blocked-version of Aasen’s algorithm computes an 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇  factorization of 𝐴𝐴, where 𝐿𝐿 is a 
banded matrix of theband width equal to the block size. 
Aasen’s algorithm is a chetrf_aasen subprogram of MAGMA library for complex Hermitian 
matrices. 

CHETRF_AASEN computes the factorization of a complex Hermitian matrix 𝐴𝐴 based on a 
communication-avoiding variant of the Aasen's algorithm.  The form of the factorization is as 
follows: 

 
𝐴𝐴 = 𝑈𝑈 ∗ 𝐿𝐿 ∗ 𝑈𝑈 ∗∗ 𝐻𝐻  if uplo = MagmaUpper, or 𝐴𝐴 = 𝐿𝐿 ∗ 𝐿𝐿 ∗ 𝐿𝐿 ∗∗ 𝐻𝐻 if uplo = MagmaLower, 
 

where 𝑈𝑈 (or 𝐿𝐿) is a product of permutation and unit upper (lower) triangular matrices, and 𝐿𝐿 is 
Hermitian and banded matrix of theband width equal to the block size. 

There are three cases of 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization computing by the Aasen’s algorithm which 
areright-and left-looking algorithms and a blocked left-looking version of the algorithm. 
Factorization by the Aasen’s algorithmis implemented with the left-looking algorithm in CPU / 
GPU hybrid architecture. 

The intermediate Heisenberg matrix 𝐻𝐻 is used to calculate the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization that is 
defined by 𝐻𝐻 = 𝑇𝑇𝐿𝐿𝑇𝑇 . In the initial step the matrix 𝐴𝐴 is completely transferred from the CPU to 
the GPU memory through magma_csetmatrix_async () function. 
The algorithm consists of the following sequence: 𝐻𝐻(1: 𝑗𝑗 − 1, 𝑗𝑗) is calculated and 𝑇𝑇(𝑗𝑗, 𝑗𝑗) is 
updated: 
Initially the Aasen’s algorithm calculates the 𝑗𝑗−𝑡𝑡ℎ  column of 𝐻𝐻, that is: 
 

𝐻𝐻(𝑖𝑖, 𝑗𝑗) = 𝑇𝑇(𝑖𝑖, 𝑖𝑖 − 1) ∗ 𝐿𝐿(𝑗𝑗, 𝑖𝑖 − 1)′ + 𝑇𝑇(𝑖𝑖, 𝑖𝑖) ∗ 𝐿𝐿(𝑗𝑗, 𝑖𝑖)′ + 𝑇𝑇(𝑖𝑖, 𝑖𝑖 + 1) ∗ 𝐿𝐿(𝑗𝑗, 𝑖𝑖 + 1)′ , 
 

  Performanses of  Fuctorizations of  Complex Hermitian Matrices in the Architecture of GPU Accelerator 70 

wherei is modified from 1 to 𝑗𝑗 − 1 . Using magma_cgemm () function for matrix multiplication 
the (𝑗𝑗; 𝑗𝑗)𝑡𝑡ℎ element of 𝐻𝐻 will be the (𝑗𝑗; 𝑗𝑗)𝑡𝑡ℎ element of the equation 𝐴𝐴 = 𝐿𝐿𝐻𝐻 . 
In the next step 𝑇𝑇(𝑗𝑗, 𝑗𝑗) will be the (𝑗𝑗; 𝑗𝑗)𝑡𝑡ℎ element of the equation 𝐻𝐻 = 𝑇𝑇𝐿𝐿𝑇𝑇, namely 

 
𝑇𝑇(𝑗𝑗, 𝑗𝑗) = 𝐴𝐴(𝑗𝑗, 𝑗𝑗) − 𝐿𝐿(𝑗𝑗, 1: 𝑗𝑗) ∗ 𝐻𝐻(1: 𝑗𝑗, 𝑗𝑗) 

 
is calculated using magma_cher2k () function, which performs the Hermitian rank-2k update, 
and magmablas_csymmetrize () function, which copies the lower triangle to upper triangle, or  
vice-versa, to make 𝑑𝑑𝐴𝐴 a general representation of a symmetric matrix. 
If  𝑗𝑗 > 1 , then  

𝑇𝑇(𝑗𝑗, 𝑗𝑗) = 𝑇𝑇(𝑗𝑗, 𝑗𝑗) − 𝐿𝐿(𝑗𝑗, 𝑗𝑗) ∗ 𝑇𝑇(𝑗𝑗, 𝑗𝑗 − 1) ∗ 𝐿𝐿(𝑗𝑗, 𝑗𝑗 − 1)′ 
is calculated. 
 

2.2Bunch-Kaufman Algorithm 
 
Bunch-Kaufmann algorithmhas a very wide application in the solution of linear system of 
equations of Hermitian matrices through 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization. The Bunch-Kaufman method 
performs the following decomposition: 𝑃𝑃𝐴𝐴𝑃𝑃𝑇𝑇 = 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 ,where 𝐿𝐿 is an n-by-n lower triangular 
matrix with a unit diagonal, and 𝐿𝐿 is a block diagonal matrix with either 1-by-1 or 2-by-2 
diagonal blocks [3]. The pivoting strategies to compute the permutation matrix 𝑃𝑃 for the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 
factorization include complete pivoting (Bunch-Parlett algorithm) [4], partial pivoting (Bunch-
Kaufman algorithm) [5], rook pivoting (bounded Bunch-Kaufman) [6, p. 523]. In particular, the 
Bunch-Kaufman and rook pivoting are implemented in LAPACK [7]. A diagonal pivoting 
algorithm, the subprogram of which has the name_hetrf() in LAPACK library, is generally 
available by NetLib [8]. A concise matrix-notation description of this algorithm can be found in 
[9]. 

Bunch-Kaufman algorithm is chetrf subprogram of MAGMA library for complex Hermitian 
matrices. 

CHETRF computes the factorization of a complex Hermitian matrix Ausing the Bunch-
Kaufman diagonal pivoting method.  The form of the factorization is 

 
𝐴𝐴 = 𝑈𝑈 ∗ 𝐿𝐿 ∗ 𝑈𝑈𝐻𝐻 if uplo = MagmaUpper, or 𝐴𝐴 = 𝐿𝐿 ∗ 𝐿𝐿 ∗ 𝐿𝐿𝐻𝐻 if uplo = MagmaLower, 

 
where 𝑈𝑈 (or 𝐿𝐿) is a product of permutation and unit upper (lower) triangular matrices, and 𝐿𝐿 is 
Hermitian and block diagonal with1-by-1 and 2-by-2 diagonal blocks. This is the blocked 
version of the algorithm, calling Level 3 BLAS. 
In the first step of hybrid CPU / GPU programming the triangular matrix moves from CPU to 
GPU through magma_csetmatrix_async () function. If uplo = MagmaUpper, then the upper 
triangular matrix 𝐴𝐴 moves to GPU, and if uplo = MagmaLower, then the lower triangular matrix 
𝐴𝐴 moves to GPU. Afterwards 𝑈𝑈 ∗ 𝐿𝐿 ∗ 𝑈𝑈′ or 𝐿𝐿 ∗ 𝐿𝐿 ∗ 𝐿𝐿′ factorization of triangular matrix 𝐴𝐴 is 
carried out in the following way. In case ofthe upper triangular matrix each 𝑘𝑘 − 𝑘𝑘𝑏𝑏 + 1 ∶ 𝑘𝑘 
column of the matrix is factorized, as well as uses a blocked code to update the columns 1 ∶ 𝑘𝑘 −
𝑘𝑘𝑏𝑏 through the following magma_clahef_gpu() function. Magma_clahef_gpu() computes a 
partial factorization of a complex Hermitian matrix 𝐴𝐴 using the Bunch-Kaufman diagonal 
pivoting method. 𝐾𝐾 is the main loop index, decreasing from 𝑁𝑁 to 1 in steps of 𝐾𝐾𝐵𝐵, where 𝐾𝐾𝐵𝐵 is 
the number of columns factorized by magma_clahef_gpu(). In case of the lower triangular matrix 
each 𝑘𝑘 − 𝑘𝑘𝑏𝑏 + 1: 𝑘𝑘 column of the matrix is factorized, as well as a blocked code to update the 


H.Astsatryan and E. Gichunts 71 

columns 𝑘𝑘 + 𝑘𝑘𝑏𝑏 ∶ 𝑛𝑛 . 𝐾𝐾 is the main loop index, increasing from 1 to 𝑁𝑁 in steps of 𝐾𝐾𝐵𝐵, where 𝐾𝐾𝐵𝐵 
is the number of columns factorized by magma_clahef_gpu( ). 
 
2.3 LDLT Factorizations with no Pivoting 
 
This method of factorization for complex Hermitian matrices is performed by chetrf_nopiv() 
function of MAGMA library. 
CHETRF_nopiv computes the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization of a complex Hermitian matrix 𝐴𝐴. The 
factorization has the form 
 

𝐴𝐴 = 𝑈𝑈𝐻𝐻 ∗ 𝐿𝐿 ∗ 𝑈𝑈 if uplo = MagmaUpper, or 𝐴𝐴 = 𝐿𝐿 ∗ 𝐿𝐿 ∗ 𝐿𝐿𝐻𝐻 if uplo = MagmaLower, 
 
where 𝑈𝑈 is an upper triangular matrix, 𝐿𝐿 is a lower triangular, and 𝐿𝐿 is a diagonal matrix.    This 
is the block version of the algorithm, calling Level 3 BLAS. 
In hybrid CPU/GPU programming 𝐴𝐴 = 𝑈𝑈′ ∗ 𝐿𝐿 ∗ 𝑈𝑈 factorization without pivoting will be 
performed in the following sequence. The matrix completely moves to GPU memory. After 
wards, all the 𝐴𝐴(𝑗𝑗, 𝑗𝑗) diagonal elements in the main cycle move to CPU memory. CPU is used to 
calculate the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization of the diagonal block. This diagonal block on CPU is factorized 
through magma_chetrf_nopiv_cpu () function. Once the resulting 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factors of the diagonal 
block are copied back to the GPU, the corresponding off-diagonal blocks of the 𝐿𝐿 -factor are 
computed by the triangular solve on the GPU. Then the copy ofj column of 𝑈𝑈 is sent to CPU. 
Compute the off-diagonal blocks of current block column by magma_ctrsm() function. After 
wards update the trailing submatrix with 𝐿𝐿 by magmablas_clascl_diag() function. Finally, 
update each block column of the trailing submatrix, calling a matrix-matrix multiply on the 
GPU. 𝐴𝐴 = 𝐿𝐿 ∗ 𝐿𝐿 ∗ 𝐿𝐿′  factorization in similar sequence will be performed for the lower triangular 
matrix. 
 
3.   Experimental Results  
 
The experiments were conducted on NVIDIA K40c GPU. The architecture of Tesla K40c 
consists of 2880 CUDA processor cores. It is endowed with much higher bandwidth 288 GB/s of 
message transfer between CPU and GPU, having 12 GB of global memory, GDDR5 memory 
interface, and CUDA C programming environment. 

The operation system of Tesla K40 is Ubuntu 14.04.2 LTS.cuda7 programming environment 
was used for the realization of programs. MAGMA 2.0.1 package was installed in accordance 
with cuda7 environment. For the compilation of MAGMA library the lapack-3.4.2, clapack-3.2.1 
and atlas-3.10.0 packages were installed. gcc-4.8, gfortran-4.8, g ++ - 4.8 and nvcc compilers 
were used. Such references were made in make.inc file on libf77blas.a, libcblas.a, libf2c.a, 
libcublas.so, libcudart.so, libm.a, libstdc ++.so, libpthread.so, libdl.so, libcusparse.so static and 
dynamic libraries. MAGMA 2.0.1 package contains libmagma.a and libmagma_sparse.a 
libraries. 

Figures 1 and 2 show the time and performance graphs of factorizations with the Aasen’s 
algorithm, Bunch-Kaufmann algorithm and 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization without pivoting for complex 
single precision. 


  Performanses of  Fuctorizations of  Complex Hermitian Matrices in the Architecture of GPU Accelerator 72 

 
Fig. 1. Complex Single Precision 
 
 
Fig. 2. Complex Single Precision 
 

In case of complex single precisionat most 𝑁𝑁 = 35840 -dimensional matrix can be sent to 
Tesla K40c graphical processor. The experiment results show that in case of up to 𝑁𝑁 = 16384  
dimensional incoming matrix the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization without pivoting, both in terms of time and 
performance, exceeds the Aasen’s factorization algorithm for 5 times and the Bunch-Kaufmann 
factorization algorithm – for 3 times. Where as the Bunch-Kaufmann algorithm exceeds the 
Aasen’s algorithm for 3 times. Increasing the incoming matrix up to 𝑁𝑁 = 35840 dimension the 
Bunch-Kaufmann algorithm is approaching the factorization without pivoting, and it turns out 
that they exceed the Aasen’s factorization algorithm for 3 times both in terms of time and 
performance.  

Figures 3 and 4 show the time and performance graphs of the mentioned three factorizations 
for complex double-precision. 

0

20

40

60

80

100

120

140

0 10000 20000 30000 40000

Aasen

Bunch-Kauf

trf_nopiv

N

Ti
m

e 
(s

ec
on

ds
)

0
200
400
600
800

1000
1200
1400
1600
1800

0 10000 20000 30000 40000

Aasen

Bunch-Kauf

trf_nopiv

N

G
flo

p/
s


H.Astsatryan and E. Gichunts 73 

 
Fig. 3. Complex Double Precision 
 
 
Fig. 4. Complex Double Precision 
 

And in the case of complex double precision at most 𝑁𝑁 = 23552 -dimensional matrix can 
be sent to Tesla K40c graphical processor. The experiment results show that in this casethe 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 
factorization without pivoting in terms of time and performance exceeds the Aasen’s 
factorization algorithm for 5 times and the Bunch-Kaufmann factorization algorithm – for 2 
times, whereas the Bunch-Kaufmann algorithm, in terms of both time and performance, exceeds 
the Aasen’s algorithm for 2.5 times. 

 
4.   Conclusion 
  
We presented the performances of complex Hermitian matrix 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorizations in CPU/GPU 
hybrid architecture. We also presented the performance results of factorizations with Aasen’s 
algorithm, Bunch-Kaufmann algorithm and 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization without pivoting using MAGMA 
library. Based on the obtained results we came to the conclusion that the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization 
without pivotingis in a leading position by its high performance. In case of complex single 

0
10
20
30
40
50
60
70
80
90

0 10000 20000 30000

Aasen

Bunch-Kauf

trf_nopiv

N

Ti
m

e 
(s

ec
on

ds
)

0

200

400

600

800

1000

1200

0 10000 20000 30000

Aasen

Bunch-Kauf

trf_nopiv

N

G
flo

p/
s


  Performanses of  Fuctorizations of  Complex Hermitian Matrices in the Architecture of GPU Accelerator 74 

precision it equalsthe Bunch-Kaufmann algorithm along with the increase of the incoming 
matrix, and in the case of complex double precision it exceeds for 2 times. It exceeds the Aasen’s 
factorization algorithm for 5 times in complex single and double precisions cases. The latter 
concedes the Bunch-Kaufmann algorithm in both cases for 2.5-3 times. 
 
References 
 

[1] J. Aasen, “On the reduction of a symmetric matrix to tridiagonal form”, BIT 11, pp. 233-
242, 1971. 

[2] M. Rozlo˘zn´ık, G. Shklarski and S. Toledo, “Partitioned triangular tridiagonalization”, 
ACM Trans. Math. Softw.,vol. 37, no. 4, pp. 1–16, 2011. 

[3] G. Golub and Ch. Van Loan, Matrix Computations, John Hopkins University Press, 
Baltimore, second edition, 1989. 

[4] J. R. Bunch and B. N. Parlett,“Direct methods for solving symmetric indefinite systems 
of linear equations”, SIAM J. Numerical Analysis, vol. 8,pp. 639-655, 1971. 

[5] J. R. Bunch and L. Kaufman,“Some stable methods for calculating inertia andsolving 
symmetric linear systems”, Mathematics of Computation, vol. 31,pp. 163-179, 1977. 

[6] C. Ashcraft, R. G. Grimes and J. G. Lewis,“Accurate symmetric indefinite linear 
equation solvers”, SIAM J. Matrix Anal. and Appl., vol. 20, no. 2, pp. 513-561, 1998. 

[7] E. Anderson, Z. Bai, J. J. Dongarra, A. Greenbaum, A. McKenney, J. DuCroz, S. 
Hammarling, J. W. Demmel, C. Bischof and D. Sorensen, “LAPACK:a portable linear 
algebra library for high-performance computers”, Proceedings of the1990 ACM/IEEE 
conference on Supercomputing. 

[8] C. Anderson et al.,LAPACK User's Guide, SIAM Press, Philadelphia,1992. 
[9] P. E. Strazdins. A dense complex symmetric indefinite solver for theFujitsu AP3000, 

Technical Report TR-CS-99-01, Computer ScienceDept, Australian National University, 
May 1999. 

 
Submitted 03.10.2016, accepted 27.01.2017. 
 
 
Կոմպլեքս Հերմիտյան մատրիցների ֆակտորիզացիաների 
արտադրողականությունները GPU արագագործչի 

ճարտարապետությունում 
 

Հ. Ասցատրյան և Է. Գիչունց 
 

Ամփոփում 
 
Գծային հանրահաշվի LDLT ֆակտորիզացիաները շատ կարևոր են և մեծ կիրառություն ունեն 
գիտական և ինժեներական հաշվարկներում: Այդ ֆակտորիզացիաների դասին են 
պատկանում Բանչ-Կաուֆմանի և Աասենի ալգորիթմներով ֆակտորիզացիաները, ինչպես 


H.Astsatryan and E. Gichunts 75 

նաև առանց պտույտի LDLT ֆակտոիզացիան: Այս աշխատանքում ներկայացված են նշված 
երեք ֆակտորիզացիաների իրականացումները հիբրիդային ճարտարապետությունում, և 
տրված են արտադրողականություններ NVIDIA Tesla K40c գրաֆիկական պրոցեսորի վրա 
MAGMA գրադարանի կիրառմամբ կոմպլեքս Հերմիտյան մատրիցների համար: 

 
Производительности факторизации комплексных Эрмитовых 
матриц в архитектуре ускорителя GPU 

 
Г. Асцатрян и Э. Гичунц 

 
Аннотация 

 
 LDLT факторизации линейной алгебры очень важны и широко используются в 
научных и инженерных расчетах. К этому классу относятся факторизации с алгоритмами 
Банча-Кауфмана и Аасена, а также факторизация LDLT без поворота. В этой работе 
представлены реализации упомянутых трех факторизаций в гибридной архитектуре, а 
также представлены производительности на графическом процессоре NVIDIA Tesla K40c 
с использованием библиотеки MAGMA для комплексных Эрмитовых матриц.