Начиная с начала 2000 года осуществляется внедрение GHIS в здравоохранении, в рамках принятого проекта о реформирование информ Mathematical Problems of Computer Science 47, 68--75, 2017. Performances of Factorizations of Complex Hermitian Matrices in the Architecture of GPU Accelerator Hrachya V. Astsatryan and Edita E. Gichunts Institute for Informatics and Automation Problems of NASRA e-mail: hrach@sci.am, editagich@ipia.sci.am Abstract Linear algebra 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorizations are very important and widely used in scientific and engineering calculations. The factorizations with Bunch-Kaufmann and Aasen’s algorithms, as well as the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization without pivoting also belong to this class of factorizations.The implementation of the mentioned three factorizations in hybrid architecture are presented in this work, and also performances are given on the NVIDIA Tesla K40c graphic processor using the MAGMA library for complex Hermitian matrices. Keywords: MAGMA library, LDLT factorization, Aasen's algorithm, Bunch- Kaufman algorithm, Hermitian matrix,Triangular matrix. 1. Introduction Solutions of linear system of equations of Hermitian matrices have repeatedly been used in physics. To calculate the solutions of linear system of equations of 𝐴𝐴𝐴𝐴 = 𝑏𝑏 form, the classical method turns 𝐴𝐴 matrix into the following 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 analysis form: 𝑃𝑃𝐴𝐴𝑃𝑃𝑇𝑇 = 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 ,where 𝐿𝐿 is the unit lower triangular, 𝐿𝐿 is the block diagonal with either 1-by-1 or 2-by-2 diagonal blocks, and 𝑃𝑃 is a permutation matrix to ensure the numerical stability of the factorization. The factorizations with Bunch-Kaufmann and Aasen’s algorithms, as well as the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization without pivoting are very popular. In this paper we present the mentioned factorizations for complex Hermitian matrices. Novelty of this work is the realization of 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization by the Aasen’s algorithm on GPU with the application of MAGMA library, as it has just been integrated into the MAGMA 2.0.1 package. The realizations of 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorizations without pivoting and with Bunch-Kaufmann algorithm are also presented in this work, in order to have a complete understanding of performances of 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorizations of complex Hermitian matrices. Note also that this problem is presented in cases of complex single and complex double precisions. The work contains the following sections: the first sectionis introduction, the second one describes the steps of the above mentioned algorithms in CPU / GPU hybrid architecture. The 68 H.Astsatryan and E. Gichunts 69 third section contains the experiment results, where the graphs of performance and time of those algorithms are shown, as well as the analysis of comparisons of factorizations are given. The fourth section is the conclusion based on the experiment results. 2. LDLT Factorizationin CPU/GPU Hybrid Architecture Solutions of linear system of equations of the form 𝐴𝐴 ∗ 𝑍𝑍 = 𝐵𝐵 are being used in science on repeated occasions, where 𝐴𝐴 is a complex Hermitian matrix, and 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 matrix factorization is used. Describe three cases of these factorizations in hybrid architecture that are resolved with Bunch-Kaufmann and Aasen’s algorithms, as well as the case without pivoting. LAPACK library is used in the systems with shared memory, where as in hybrid systems its parallelized MAGMA library is used. 2.1Aasen’s Algorithm Aasen’s algorithm [1] factorizes 𝐴𝐴 into an 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 decomposition of the form 𝑃𝑃𝐴𝐴𝑃𝑃𝑇𝑇 = 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 ,where 𝑃𝑃 is a permutation matrix, 𝐿𝐿 is a unit lower triangular matrix, and 𝐿𝐿 is a Hermitian matrix. To exploit the memory hierarchy on a modern computer,a partitioned-version of Aasen’s algorithm was recently proposed [2]. This algorithm first factorizes a panel in a left-looking fashion, and then uses BLAS-3 operations toupdate the trailing submatrix in a right-looking way. The blocked-version of Aasen’s algorithm computes an 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization of 𝐴𝐴, where 𝐿𝐿 is a banded matrix of theband width equal to the block size. Aasen’s algorithm is a chetrf_aasen subprogram of MAGMA library for complex Hermitian matrices. CHETRF_AASEN computes the factorization of a complex Hermitian matrix 𝐴𝐴 based on a communication-avoiding variant of the Aasen's algorithm. The form of the factorization is as follows: 𝐴𝐴 = 𝑈𝑈 ∗ 𝐿𝐿 ∗ 𝑈𝑈 ∗∗ 𝐻𝐻 if uplo = MagmaUpper, or 𝐴𝐴 = 𝐿𝐿 ∗ 𝐿𝐿 ∗ 𝐿𝐿 ∗∗ 𝐻𝐻 if uplo = MagmaLower, where 𝑈𝑈 (or 𝐿𝐿) is a product of permutation and unit upper (lower) triangular matrices, and 𝐿𝐿 is Hermitian and banded matrix of theband width equal to the block size. There are three cases of 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization computing by the Aasen’s algorithm which areright-and left-looking algorithms and a blocked left-looking version of the algorithm. Factorization by the Aasen’s algorithmis implemented with the left-looking algorithm in CPU / GPU hybrid architecture. The intermediate Heisenberg matrix 𝐻𝐻 is used to calculate the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization that is defined by 𝐻𝐻 = 𝑇𝑇𝐿𝐿𝑇𝑇 . In the initial step the matrix 𝐴𝐴 is completely transferred from the CPU to the GPU memory through magma_csetmatrix_async () function. The algorithm consists of the following sequence: 𝐻𝐻(1: 𝑗𝑗 − 1, 𝑗𝑗) is calculated and 𝑇𝑇(𝑗𝑗, 𝑗𝑗) is updated: Initially the Aasen’s algorithm calculates the 𝑗𝑗−𝑡𝑡ℎ column of 𝐻𝐻, that is: 𝐻𝐻(𝑖𝑖, 𝑗𝑗) = 𝑇𝑇(𝑖𝑖, 𝑖𝑖 − 1) ∗ 𝐿𝐿(𝑗𝑗, 𝑖𝑖 − 1)′ + 𝑇𝑇(𝑖𝑖, 𝑖𝑖) ∗ 𝐿𝐿(𝑗𝑗, 𝑖𝑖)′ + 𝑇𝑇(𝑖𝑖, 𝑖𝑖 + 1) ∗ 𝐿𝐿(𝑗𝑗, 𝑖𝑖 + 1)′ , Performanses of Fuctorizations of Complex Hermitian Matrices in the Architecture of GPU Accelerator 70 wherei is modified from 1 to 𝑗𝑗 − 1 . Using magma_cgemm () function for matrix multiplication the (𝑗𝑗; 𝑗𝑗)𝑡𝑡ℎ element of 𝐻𝐻 will be the (𝑗𝑗; 𝑗𝑗)𝑡𝑡ℎ element of the equation 𝐴𝐴 = 𝐿𝐿𝐻𝐻 . In the next step 𝑇𝑇(𝑗𝑗, 𝑗𝑗) will be the (𝑗𝑗; 𝑗𝑗)𝑡𝑡ℎ element of the equation 𝐻𝐻 = 𝑇𝑇𝐿𝐿𝑇𝑇, namely 𝑇𝑇(𝑗𝑗, 𝑗𝑗) = 𝐴𝐴(𝑗𝑗, 𝑗𝑗) − 𝐿𝐿(𝑗𝑗, 1: 𝑗𝑗) ∗ 𝐻𝐻(1: 𝑗𝑗, 𝑗𝑗) is calculated using magma_cher2k () function, which performs the Hermitian rank-2k update, and magmablas_csymmetrize () function, which copies the lower triangle to upper triangle, or vice-versa, to make 𝑑𝑑𝐴𝐴 a general representation of a symmetric matrix. If 𝑗𝑗 > 1 , then 𝑇𝑇(𝑗𝑗, 𝑗𝑗) = 𝑇𝑇(𝑗𝑗, 𝑗𝑗) − 𝐿𝐿(𝑗𝑗, 𝑗𝑗) ∗ 𝑇𝑇(𝑗𝑗, 𝑗𝑗 − 1) ∗ 𝐿𝐿(𝑗𝑗, 𝑗𝑗 − 1)′ is calculated. 2.2Bunch-Kaufman Algorithm Bunch-Kaufmann algorithmhas a very wide application in the solution of linear system of equations of Hermitian matrices through 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization. The Bunch-Kaufman method performs the following decomposition: 𝑃𝑃𝐴𝐴𝑃𝑃𝑇𝑇 = 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 ,where 𝐿𝐿 is an n-by-n lower triangular matrix with a unit diagonal, and 𝐿𝐿 is a block diagonal matrix with either 1-by-1 or 2-by-2 diagonal blocks [3]. The pivoting strategies to compute the permutation matrix 𝑃𝑃 for the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization include complete pivoting (Bunch-Parlett algorithm) [4], partial pivoting (Bunch- Kaufman algorithm) [5], rook pivoting (bounded Bunch-Kaufman) [6, p. 523]. In particular, the Bunch-Kaufman and rook pivoting are implemented in LAPACK [7]. A diagonal pivoting algorithm, the subprogram of which has the name_hetrf() in LAPACK library, is generally available by NetLib [8]. A concise matrix-notation description of this algorithm can be found in [9]. Bunch-Kaufman algorithm is chetrf subprogram of MAGMA library for complex Hermitian matrices. CHETRF computes the factorization of a complex Hermitian matrix Ausing the Bunch- Kaufman diagonal pivoting method. The form of the factorization is 𝐴𝐴 = 𝑈𝑈 ∗ 𝐿𝐿 ∗ 𝑈𝑈𝐻𝐻 if uplo = MagmaUpper, or 𝐴𝐴 = 𝐿𝐿 ∗ 𝐿𝐿 ∗ 𝐿𝐿𝐻𝐻 if uplo = MagmaLower, where 𝑈𝑈 (or 𝐿𝐿) is a product of permutation and unit upper (lower) triangular matrices, and 𝐿𝐿 is Hermitian and block diagonal with1-by-1 and 2-by-2 diagonal blocks. This is the blocked version of the algorithm, calling Level 3 BLAS. In the first step of hybrid CPU / GPU programming the triangular matrix moves from CPU to GPU through magma_csetmatrix_async () function. If uplo = MagmaUpper, then the upper triangular matrix 𝐴𝐴 moves to GPU, and if uplo = MagmaLower, then the lower triangular matrix 𝐴𝐴 moves to GPU. Afterwards 𝑈𝑈 ∗ 𝐿𝐿 ∗ 𝑈𝑈′ or 𝐿𝐿 ∗ 𝐿𝐿 ∗ 𝐿𝐿′ factorization of triangular matrix 𝐴𝐴 is carried out in the following way. In case ofthe upper triangular matrix each 𝑘𝑘 − 𝑘𝑘𝑏𝑏 + 1 ∶ 𝑘𝑘 column of the matrix is factorized, as well as uses a blocked code to update the columns 1 ∶ 𝑘𝑘 − 𝑘𝑘𝑏𝑏 through the following magma_clahef_gpu() function. Magma_clahef_gpu() computes a partial factorization of a complex Hermitian matrix 𝐴𝐴 using the Bunch-Kaufman diagonal pivoting method. 𝐾𝐾 is the main loop index, decreasing from 𝑁𝑁 to 1 in steps of 𝐾𝐾𝐵𝐵, where 𝐾𝐾𝐵𝐵 is the number of columns factorized by magma_clahef_gpu(). In case of the lower triangular matrix each 𝑘𝑘 − 𝑘𝑘𝑏𝑏 + 1: 𝑘𝑘 column of the matrix is factorized, as well as a blocked code to update the H.Astsatryan and E. Gichunts 71 columns 𝑘𝑘 + 𝑘𝑘𝑏𝑏 ∶ 𝑛𝑛 . 𝐾𝐾 is the main loop index, increasing from 1 to 𝑁𝑁 in steps of 𝐾𝐾𝐵𝐵, where 𝐾𝐾𝐵𝐵 is the number of columns factorized by magma_clahef_gpu( ). 2.3 LDLT Factorizations with no Pivoting This method of factorization for complex Hermitian matrices is performed by chetrf_nopiv() function of MAGMA library. CHETRF_nopiv computes the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization of a complex Hermitian matrix 𝐴𝐴. The factorization has the form 𝐴𝐴 = 𝑈𝑈𝐻𝐻 ∗ 𝐿𝐿 ∗ 𝑈𝑈 if uplo = MagmaUpper, or 𝐴𝐴 = 𝐿𝐿 ∗ 𝐿𝐿 ∗ 𝐿𝐿𝐻𝐻 if uplo = MagmaLower, where 𝑈𝑈 is an upper triangular matrix, 𝐿𝐿 is a lower triangular, and 𝐿𝐿 is a diagonal matrix. This is the block version of the algorithm, calling Level 3 BLAS. In hybrid CPU/GPU programming 𝐴𝐴 = 𝑈𝑈′ ∗ 𝐿𝐿 ∗ 𝑈𝑈 factorization without pivoting will be performed in the following sequence. The matrix completely moves to GPU memory. After wards, all the 𝐴𝐴(𝑗𝑗, 𝑗𝑗) diagonal elements in the main cycle move to CPU memory. CPU is used to calculate the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization of the diagonal block. This diagonal block on CPU is factorized through magma_chetrf_nopiv_cpu () function. Once the resulting 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factors of the diagonal block are copied back to the GPU, the corresponding off-diagonal blocks of the 𝐿𝐿 -factor are computed by the triangular solve on the GPU. Then the copy ofj column of 𝑈𝑈 is sent to CPU. Compute the off-diagonal blocks of current block column by magma_ctrsm() function. After wards update the trailing submatrix with 𝐿𝐿 by magmablas_clascl_diag() function. Finally, update each block column of the trailing submatrix, calling a matrix-matrix multiply on the GPU. 𝐴𝐴 = 𝐿𝐿 ∗ 𝐿𝐿 ∗ 𝐿𝐿′ factorization in similar sequence will be performed for the lower triangular matrix. 3. Experimental Results The experiments were conducted on NVIDIA K40c GPU. The architecture of Tesla K40c consists of 2880 CUDA processor cores. It is endowed with much higher bandwidth 288 GB/s of message transfer between CPU and GPU, having 12 GB of global memory, GDDR5 memory interface, and CUDA C programming environment. The operation system of Tesla K40 is Ubuntu 14.04.2 LTS.cuda7 programming environment was used for the realization of programs. MAGMA 2.0.1 package was installed in accordance with cuda7 environment. For the compilation of MAGMA library the lapack-3.4.2, clapack-3.2.1 and atlas-3.10.0 packages were installed. gcc-4.8, gfortran-4.8, g ++ - 4.8 and nvcc compilers were used. Such references were made in make.inc file on libf77blas.a, libcblas.a, libf2c.a, libcublas.so, libcudart.so, libm.a, libstdc ++.so, libpthread.so, libdl.so, libcusparse.so static and dynamic libraries. MAGMA 2.0.1 package contains libmagma.a and libmagma_sparse.a libraries. Figures 1 and 2 show the time and performance graphs of factorizations with the Aasen’s algorithm, Bunch-Kaufmann algorithm and 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization without pivoting for complex single precision. Performanses of Fuctorizations of Complex Hermitian Matrices in the Architecture of GPU Accelerator 72 Fig. 1. Complex Single Precision Fig. 2. Complex Single Precision In case of complex single precisionat most 𝑁𝑁 = 35840 -dimensional matrix can be sent to Tesla K40c graphical processor. The experiment results show that in case of up to 𝑁𝑁 = 16384 dimensional incoming matrix the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization without pivoting, both in terms of time and performance, exceeds the Aasen’s factorization algorithm for 5 times and the Bunch-Kaufmann factorization algorithm – for 3 times. Where as the Bunch-Kaufmann algorithm exceeds the Aasen’s algorithm for 3 times. Increasing the incoming matrix up to 𝑁𝑁 = 35840 dimension the Bunch-Kaufmann algorithm is approaching the factorization without pivoting, and it turns out that they exceed the Aasen’s factorization algorithm for 3 times both in terms of time and performance. Figures 3 and 4 show the time and performance graphs of the mentioned three factorizations for complex double-precision. 0 20 40 60 80 100 120 140 0 10000 20000 30000 40000 Aasen Bunch-Kauf trf_nopiv N Ti m e (s ec on ds ) 0 200 400 600 800 1000 1200 1400 1600 1800 0 10000 20000 30000 40000 Aasen Bunch-Kauf trf_nopiv N G flo p/ s H.Astsatryan and E. Gichunts 73 Fig. 3. Complex Double Precision Fig. 4. Complex Double Precision And in the case of complex double precision at most 𝑁𝑁 = 23552 -dimensional matrix can be sent to Tesla K40c graphical processor. The experiment results show that in this casethe 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization without pivoting in terms of time and performance exceeds the Aasen’s factorization algorithm for 5 times and the Bunch-Kaufmann factorization algorithm – for 2 times, whereas the Bunch-Kaufmann algorithm, in terms of both time and performance, exceeds the Aasen’s algorithm for 2.5 times. 4. Conclusion We presented the performances of complex Hermitian matrix 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorizations in CPU/GPU hybrid architecture. We also presented the performance results of factorizations with Aasen’s algorithm, Bunch-Kaufmann algorithm and 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization without pivoting using MAGMA library. Based on the obtained results we came to the conclusion that the 𝐿𝐿𝐿𝐿𝐿𝐿𝑇𝑇 factorization without pivotingis in a leading position by its high performance. In case of complex single 0 10 20 30 40 50 60 70 80 90 0 10000 20000 30000 Aasen Bunch-Kauf trf_nopiv N Ti m e (s ec on ds ) 0 200 400 600 800 1000 1200 0 10000 20000 30000 Aasen Bunch-Kauf trf_nopiv N G flo p/ s Performanses of Fuctorizations of Complex Hermitian Matrices in the Architecture of GPU Accelerator 74 precision it equalsthe Bunch-Kaufmann algorithm along with the increase of the incoming matrix, and in the case of complex double precision it exceeds for 2 times. It exceeds the Aasen’s factorization algorithm for 5 times in complex single and double precisions cases. The latter concedes the Bunch-Kaufmann algorithm in both cases for 2.5-3 times. References [1] J. Aasen, “On the reduction of a symmetric matrix to tridiagonal form”, BIT 11, pp. 233- 242, 1971. [2] M. Rozlo˘zn´ık, G. Shklarski and S. Toledo, “Partitioned triangular tridiagonalization”, ACM Trans. Math. Softw.,vol. 37, no. 4, pp. 1–16, 2011. [3] G. Golub and Ch. Van Loan, Matrix Computations, John Hopkins University Press, Baltimore, second edition, 1989. [4] J. R. Bunch and B. N. Parlett,“Direct methods for solving symmetric indefinite systems of linear equations”, SIAM J. Numerical Analysis, vol. 8,pp. 639-655, 1971. [5] J. R. Bunch and L. Kaufman,“Some stable methods for calculating inertia andsolving symmetric linear systems”, Mathematics of Computation, vol. 31,pp. 163-179, 1977. [6] C. Ashcraft, R. G. Grimes and J. G. Lewis,“Accurate symmetric indefinite linear equation solvers”, SIAM J. Matrix Anal. and Appl., vol. 20, no. 2, pp. 513-561, 1998. [7] E. Anderson, Z. Bai, J. J. Dongarra, A. Greenbaum, A. McKenney, J. DuCroz, S. Hammarling, J. W. Demmel, C. Bischof and D. Sorensen, “LAPACK:a portable linear algebra library for high-performance computers”, Proceedings of the1990 ACM/IEEE conference on Supercomputing. [8] C. Anderson et al.,LAPACK User's Guide, SIAM Press, Philadelphia,1992. [9] P. E. Strazdins. A dense complex symmetric indefinite solver for theFujitsu AP3000, Technical Report TR-CS-99-01, Computer ScienceDept, Australian National University, May 1999. Submitted 03.10.2016, accepted 27.01.2017. Կոմպլեքս Հերմիտյան մատրիցների ֆակտորիզացիաների արտադրողականությունները GPU արագագործչի ճարտարապետությունում Հ. Ասցատրյան և Է. Գիչունց Ամփոփում Գծային հանրահաշվի LDLT ֆակտորիզացիաները շատ կարևոր են և մեծ կիրառություն ունեն գիտական և ինժեներական հաշվարկներում: Այդ ֆակտորիզացիաների դասին են պատկանում Բանչ-Կաուֆմանի և Աասենի ալգորիթմներով ֆակտորիզացիաները, ինչպես H.Astsatryan and E. Gichunts 75 նաև առանց պտույտի LDLT ֆակտոիզացիան: Այս աշխատանքում ներկայացված են նշված երեք ֆակտորիզացիաների իրականացումները հիբրիդային ճարտարապետությունում, և տրված են արտադրողականություններ NVIDIA Tesla K40c գրաֆիկական պրոցեսորի վրա MAGMA գրադարանի կիրառմամբ կոմպլեքս Հերմիտյան մատրիցների համար: Производительности факторизации комплексных Эрмитовых матриц в архитектуре ускорителя GPU Г. Асцатрян и Э. Гичунц Аннотация LDLT факторизации линейной алгебры очень важны и широко используются в научных и инженерных расчетах. К этому классу относятся факторизации с алгоритмами Банча-Кауфмана и Аасена, а также факторизация LDLT без поворота. В этой работе представлены реализации упомянутых трех факторизаций в гибридной архитектуре, а также представлены производительности на графическом процессоре NVIDIA Tesla K40c с использованием библиотеки MAGMA для комплексных Эрмитовых матриц.