Начиная с начала 2000 года осуществляется внедрение GHIS в здравоохранении, в рамках принятого проекта о реформирование информ Mathematical Problems of Computer Science 44, 93--100, 2015. Performance Comparisons of Eigensolutions Algorithms of a Symmetric Tridiagonal Matrix on GPU Accelerators Hrachya V. Astsatryan and Edita E. Gichunts Institute for Informatics and Automation Problems of NAS RA e-mail: hrach@sci.am, editagich@ipia.sci.am Abstract While finding the solutions of Hermitian matrix eigenproblem it is a key issue to find an efficient version of algorithms of symmetric tridiagonal solutions. In this paper these algorithms are compared for complex Hermitian matrices in hybrid systems. The methods were carried out on the Tesla C1060 and Tesla K40 GPU accelerators and the performances are presented as between the methods, as well as between the accelerators. Keywords: GPU accelerator; hybrid architectures, tridiagonal matrices, MRRR, D&C, BI, algorithm. 1. Introduction Finding the matrix eigenvalues and eigenvectors is one of the central issues in linear algebra. In particular, the solutions to the eigenproblem are in LaPACK, ScaLAPACK and PLaPACK packages, in the systems with general and distributed memory, respectively. Matrix eigensolutions can achieve a higher performance through MAGMA library in hybrid architecture on GPU accelerators. The aim of MAGMA [1,2] library is the realization of LaPACK library sub-programs in the architecture of hybrid systems. Due to the development of productivity of ranking algorithms, this problem is overcome in hybrid architecture. Finding of eigenproblem solutions of Hermitian matrix, as a rule, takes place through the following three stages: 1. Through the Householder transformation the matrix is reduced to a tridiagonal form. 2. The tridiagonal matrix solutions are found through one of the following algorithms:  QR iteration [3,4] ,  Bisection for the eigenvalues and inverse iteration for the eigenvectors(BI) [5,6],  Divide & Conquer method (D&C) [7,8], 1. Back transformation to find the eigenvectors for the full problem from the eigenvectors of the tridiagonal problem. 93 Performance Comparisons of Eigensolutions Algorithms of a Symmetric Tridiagonal Matrix on GPU 94 In fact, the algorithm performance may depend on the matrix, platform, and on the underlying linear algebra libraries BLAS, LaPACK, ATLAS and MAGMA. MAGMA library is used to realize the projects on GPU accelerator. The reduction of matrix to tridiagonal form in all cases is carried out through the Householder transformation, as well as through magma_chetrd function of MAGMA library. It should also be noted that in all releases of MAGMA library for complex Hermitian matrices the eigenproblem solving function with QR iteration is missing. Therefore, only the remaining (D & C), (MRRR), (BI) three cases will be considered. In this work the solutions of Hermitian problem are presented by means of three methods on GPU accelerators. Moreover, the performance comparisons for three algorithms are presented as between the methods, as well as between the accelerators. Section 2 briefly presents the mentioned algorithms. Section 3 states about the resources required for the implementation of programs. Sections 4 and 5 give the results of the mentioned three algorithms on GPU accelerators for both standard and generalized forms of eigensolutions of complex Hermitian matrices, respectively. Section 6 covers the conclusion of the obtained results. 2. Description of Algorithms 2.1.Divide and conquer. The divide-and-conquer method can be described in terms of a binary tree where each node corresponds to a submatrix and its eigenpairs, obtained through recursively dividing the matrix in halves; see the exposition in [12]. The tree is processed bottom up, starting with submatrices of size 25 or smaller. DC uses QR to solve the small eigenproblems and then computes the eigenpairs of a parent using the already computed eigenpairs of the children. A parent’s eigenvalues can be computed as solutions of a secular equation. The eigenvector computation consists of two steps. The first one is a relatively inexpensive scaling step. The second one, which is most of the work, multiplies the eigenvectors of the current matrix by the eigenvector matrix accumulated so far. This step uses the level 3 BLAS (BLAS 3) routine GEMM (dense matrix-matrix multiply). In the worst case, DC is an O(n3) algorithm. 2.2.Multiple relatively robust representations. Since its introduction in the 1990s, much has been written about the MRRR algorithm, its theoretical foundation is discussed in several publications [ 13, 14, 15, 16] and practical aspects of efficient and robust implementations are discussed in [17, 18, 19, 20, 21, 22]. One could say, with an implementation of the algorithm in the widely used LAPACK library and the description of (parts of) the algorithm in textbooks such as [23], MRRR has become mainstream. MRRR is a sophisticated variant of inverse iteration that avoids Gram–Schmidt orthogonalization and thus becomes an O(n2) algorithm. The algorithm can be described in terms of a (generally irregular) representation tree. The root node describes the entire spectrum of T, and the children define gradually refined eigenvalue approximations. The overall complexity of the algorithm depends on the clustering of the eigenvalues. If some eigenvalues of T agree to d digits on average, then the algorithm has to do work proportional to dn2. The algorithm uses a random perturbation to ensure with high probability that eigenvalues cannot be too strongly clustered; see [24] for details. MRRR cannot make use of higher-level BLAS. 2.3.Bisection and inverse iteration. Bisection based on Sturm sequences requires O(nk) operations to compute k eigenvalues of T. If the distance between the eigenvalues is large enough (relative to ||T||), then computing the corresponding eigenvector by inverse iteration also is an O(nk) process. If, however, the eigenvalues are not well separated, Gram–Schmidt orthogonalization is employed to try to achieve numerically orthogonal eigenvectors. In this case H. Astsatryan and E. Gichunts 95 the complexity of the algorithm increases to O(nk2). In the worst case where almost all eigenvalues of T are “clustered,” the complexity can increase to O(n3). Furthermore, from the accuracy point of view this procedure is not guaranteed to be reliable; see [11, 25]. Neither bisection nor inverse iteration make use of higher-level BLAS. 3. Software Development The GPU equipment in hybrid systems, due to its unique architecture, is used as an accelerator to process the limited computational applications. The popularity of GPU-based hybrid systems started with the release of the NVIDIA Compute Unified Device Architecture (CUDA) and the extensions of industry-standard programming languages, such as C, C++ and Fortran, which made the GPUs easier to program, allowing the developers to exploit the computational power of modern GPU devices. To realize the programs, the required software is presented on Tesla C1060 and Tesla K40 GPU accelerators. The architecture of Tesla C1060 consists of 240 processor cores, using the maximum capacity of parallelization. It is endowed with a high bandwidth transmission of messages between CPU and GPU, and also has 4 GB of global memory, 512-bit GDDR3 memory interface and CUDA C programming environment. The operation system on Tesla C1060 is Ubuntu 12.04.5 LTS, and cuda4 programming environment was used for realization of programs. MAGMA 1.3.0 package was installed. lapack-3.4.2, clapack-3.2.1 and atlas-3.8.0 packages were installed for this library compilation. gcc-4.4, gfortran-4.4 and nvcc compilers were used. During the compilation a number of references of static and dynamic libraries were made in make.incfile, such as libf77blas.a, libcblas.a, libf2c.a, libcublas.so, libcudart.so, libstdc ++.so, libpthread.so, libdl.so. MAGMA 1.3.0 package contains libmagma.a and libmagmablas.a libraries. The architecture of Tesla K40 consists of 2880 CUDA processor cores. It is endowed with much higher bandwidth 288 GB/s of message transfer between CPU and GPU, having 12 GB of global memory, GDDR5 memory interface, and CUDA C programming environment. The operation system of Tesla K40is Ubuntu 14.04.2 LTS.cuda7 programming environment was used for the realization of programs. MAGMA 1.6.1 package was installed in accordance with cuda7 environment. For the compilation of MAGMA library the lapack-3.4.2, clapack-3.2.1 and atlas-3.10.0 packages were installed. gcc-4.8, gfortran-4.8, g ++ - 4.8 and nvcc compilers were used. Such references were made in make.inc file on libf77blas.a, libcblas.a, libf2c.a, libcublas.so, libcudart.so, libstdc ++. so, libpthread.so, libdl.so static and dynamic libraries. MAGMA 1.6.1 package contains libmagma.a and libmagma_sparse.a libraries. 4. Performance Results of Eigenproblem Solutions for a Standard Form Finding of eigenproblem solutions of complex Hermitian matrices for Az = λz standard form is carried out with the help of the following functions of MAGMA library:  magma_xheevdx (D&C),  magma_xheevr (MRRR),  magma_xheevx (BI). Moreover, x can be c or z in complex and double complex cases, respectively. Figures 1 and 2 show the time-dependent graphs for three methods in the case of standard form on Tesla C1060 and Tesla K40 GPU accelerators, respectively. Performance Comparisons of Eigensolutions Algorithms of a Symmetric Tridiagonal Matrix on GPU 96 Fig. 1. Tesla C1060, standard eingensolvers. Fig. 2. Tesla K40, standard eingensolvers. Since the global memory of Tesla C1060 is 4 GB, and the matrix A should be wholly moved to the global memory of GPU, hence, the maximum dimensionality of the input complex matrix can be 12288*12288. In the case of Tesla K40 GPU accelerator it can be 31744 * 31744 as its global memory is 12 GB. The results show that MRRR algorithm, in a standard form, for 2-2.5 times concedes the DC algorithm. For example, on Tesla C1060 accelerator in case of 12288*12288 maximum dimensional input matrix, the DC algorithm is carried out at gpu_time = 258sec., and in the case of MRRR algorithm it is fulfilled atgpu_time = 462sec. On Tesla K40 GPU accelerator in case of 31744*31744 maximum input-dimensional complex matrix, the DC algorithm is carried out atgpu_time = 603sec., and in case of MRRR algorithm at gpu_time = 1680sec. . Figures 3 and 4 show in a standard form case the comparisons of GPU accelerators of time- dependent DC and MRRR algorithms with equal amounts of input matrices. The maximum dimensionality of the input matrix will be equal to the possible maximum dimensionality of the matrix used on Tesla C1060. Fig. 3. DC algorithm. Fig. 4. MRRR algorithm. Obviously, Tesla C1060 much concedes the Tesla K40 by its architecture, but let’s present the results in the case of these two algorithms. For example, in case of the input matrix with 12288*12288 maximum dimensionality on Tesla C1060 accelerator, the DC algorithm is implemented atgpu_time = 258sec., whereas on Tesla K40 it is implemented atgpu_time = 50sec.. MRRR algorithm is implemented on Tesla C1060 accelerator atgpu_time = 462sec., whereas on Tesla K40 - at gpu_time = 150sec.. 0 200 400 600 800 1000 1200 1400 0 5000 10000 15000 evdx evr evx Ti m e (s ec on ds ) N 0 500 1000 1500 2000 0 20000 40000 evdx evr evx N Ti m e (s ec on ds ) 0 50 100 150 200 250 300 0 10000 20000 DC C1060 DC K40 Ti m e (s ec on ds ) N 0 100 200 300 400 500 0 10000 20000 MRRR C1060 MRRR K40 Ti m e (s ec on d) N H. Astsatryan and E. Gichunts 97 5. Performance Results of Eigenproblem Solutions for Generalized Form Finding of eigensolutions of complex Hermitian matrices for Az= λBz generalized form is carried out with the help of the following functions of MAGMA library:  magma_xhegvdx ( D&C),  magma_xhegvr (MRRR),  magma_xhegvx (BI). x can be c or z in complex and double complex cases, respectively. Figures 5 and 6 show the time-dependent graphs for three methods in the case of generalized form on Tesla C1060 and Tesla K40 GPU accelerators, respectively. Fig. 5.Tesla C1060, generalized eingensolvers Fig. 6. Tesla K40, generalized eingensolvers Since for a generalized form A and B matrices should be wholly moved to the global memory of GPU accelerators, therefore in case of Tesla C1060 the maximum dimensionalities of input matrices will be 9216*9216, and in case of Tesla K40 they will be 21504*21504. The results show that for a generalized form together with the increase in dimensionalities of input matrices the MRRR algorithm concedes the DC algorithm for 1.5-4times. For example, on Tesla C1060 accelerator in case of 9216*9216 maximum dimensional input matrices, the DC algorithm is carried out at gpu_time=150sec., and in the case of MRRR algorithm it is fulfilled at gpu_time = 265sec. On Tesla K40 GPU accelerator in case of 21504*21504 maximum input- dimensional complex matrices, the DC algorithm is carried out at gpu_time = 603sec., and in case of MRRR algorithm at gpu_time = 1005sec.. Figures 7 and 8 show in a generalized form case the comparisons of GPU accelerators of time-dependent DC and MRRR algorithms with equal amounts of input matrices. The maximum dimensionality of the input matrix will be equal to the possible maximum dimensionality of the matrix used on Tesla C1060. 0 200 400 600 800 1000 1200 1400 0 5000 10000 gvdx gvr gvx N Ti m e (s ec on ds ) 0 200 400 600 800 1000 1200 0 20000 40000 gvdx gvr gvx N Ti m e( se co nd s) Performance Comparisons of Eigensolutions Algorithms of a Symmetric Tridiagonal Matrix on GPU 98 Fig. 7. DC algorithm Fig. 8. MRRR algorithm For the generalized form the results of these two algorithms will be presented. For example, in the case of input matrices with 9216*9216 maximum dimensionality on Tesla C1060 accelerator, the DC algorithm is implemented at gpu_time = 150sec., whereas on Tesla K40 it is implemented at gpu_time = 29sec. MRRR algorithm is implemented on Tesla C1060 accelerator at gpu_time = 266sec., whereas on Tesla K40 - at gpu_time = 106sec.. 6. Conclusion The performance studies of solutions of a symmetric tridiagonal matrix were carried out on both accelerators Tesla C1060 and Tesla K40. Our assessment on this issue considers the speed of the bisection and inverse iteration (BI), the divide-and-conquer method (DC), and the method of multiple relatively robust representations of (MRRR) algorithms in complex Hermitian matrices. The conclusions are as follows:  DC and MRRR algorithms, in case of large matrices, are rather quickly than in case of BI. Test results show that the programs using BI algorithm, in both cases of standard and generalized forms, are 20 times slower than the programs using DC and MRRR algorithms.  The results showed that the MRRR algorithm, as in both cases of standard and generalized forms, as well as on both GPU accelerators is inferior to the DC algorithm. But for the implementation of DC algorithm much more workspace is required, than for the implementation of MRRR algorithm. Therefore, the choice of the algorithms DC and MRRR depends on the user, which algorithm is more efficient for the intended issue.  Comparing the work of two GPU accelerators, we see that on Tesla K40 in case of an input matrix with the same dimensionality, DC algorithm is implemented 5 times faster than on Tesla C1060, and the MRRR algorithm is implemented 3 times faster. References [1] “MAGMA Matrix Algebra on GPU and Multicore Architectures”, [Online]. Available:http://icl.cs.utk.edu/magma/, 2014. [2] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, S. Tomov, “Numerical linear algebra on emerging architectures: The 0 20 40 60 80 100 120 140 160 0 5000 10000 DC C1060 DC K40 N Ti m e (s ec on ds ) 0 50 100 150 200 250 300 0 5000 10000 MRRR C1060 MRRR K40 N Ti m e (s ec on ds ) H. Astsatryan and E. Gichunts 99 PLASMA and MAGMA projects ”, Journal of Physics: Conference Series, vol. 180, no. 1, 2009. [3] B. N. Parlett, The Symmetric Eigenvalue Problem. Englewood Cliffs, NJ: Prentice- Hall, 1980. [4] G. H.Golub, C. F. V.Loan, Matrix Computations, 3rd ed. Baltimore: The Johns Hopkins University Press, 1996. [5] I. C. Ipsen, “Computing an eigenvector with inverse iteration”, SIAM review, vol. 39, no. 2, pp. 254–291, 1997. [6] I. S. Dhillon, “Current inverse iteration software can fail”, BIT Numerical Mathematics, vol. 38, no. 4, pp. 685–704, 1998. [7] J. J. M.Cuppen, “A divide and conquer method for the symmetric tridiagonaleigenproblem”, NumerischeMathematik, vol. 36, no. 2, pp. 177–195, 1980. [8] M. Gu and S. C. Eisenstat, “A divide-and-conquer algorithm for the symmetric tridiagonaleigenproblem”, SIAM Journal on Matrix Analysis and Applications, vol. 16,no. 1, pp. 172–191,1995. [9] I. S. Dhillon,B. N. Parlett, C. Vomel, “The design and Implementation of the MRRRalgorithm”, ACM Trans. Math. Soft.,vol. 32, no. 4, pp. 533–560, 2006. [10] I. S. Dhillon and B. N.Parlett, “Multiple representations to compute orthogonaleigenvectors of symmetric tridiagonalmatrices”, Linear Algebra and its Applications,vol. 387, pp. 1–28, 2004. [11] B. N. Parlett and I. S. Dhillon, “Relatively robust representations of symmetrictridiagonals”, Linear Algebra and its Applications, vol. 309, no. 1–3, pp.121–151, 2000. [12] J. W. Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997. [13] I.Dhillon, and B. Parlett,“Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices”,Linear Algebra Appl., vol. 387, pp. 1– 28. 2004. [14] I. Dhillonand B. Parlett,“Orthogonal eigenvectors and relative gaps”,SIAM J. Matrix Anal. Appl., vol.25,pp. 858–899, 2004. [15] P. Willems, On MR3-type Algorithms for the Tridiagonal Symmetric Eigenproblem and Bidiagonal SVD. PhD thesis, University of Wuppertal, 2010. [16] P. Willems and B. Lang, A Framework for the MR3 Algorithm: Theory and Implementation. Tech. Rep. 11/21, BergischeUniversit¨at Wuppertal, 2011. [17] I. Dhillon, B. Parlettand C. V¨omel,“Glued matrices and the MRRR algorithm”, SIAM J. Sci. Comput., vol. 27, no. 2, pp. 496–510, 2005. [18] I. Dhillon, B. Parlett and C. V¨omel, “The design and implementation of the MRRR algorithm”,ACM Trans. Math. Software, vol. 32, pp. 533–560, 2006. [19] O. Marques, B. Parlettand C. V¨omel, “Computations of eigenpairsubsets with the MRRR algorithm”,Numerical Linear Algebra with Applications, vol. 13, no. 8,pp. 643–653, 2006. [20] O. Marques, E. Riedy and C. V¨omel, “Benefits of IEEE-754 features in modern symmetric tridiagonaleigensolvers”,SIAM J. Sci. Comput., vol. 28, pp. 1613–1633, 2006. [21] M. Petschow and P. Bientinesi, “MR3-SMP: A symmetric tridiagonaleigensolver for multi-core architectures”,Parallel Computing, vol. 37, no. 12, pp. 795 – 805, 2011. [22] C. V¨omel, “ScaLAPACK’s MRRR Algorithm”, ACM Trans. Math. Software, vol. 37, pp 1 , 2010. [23] D. Watkins, Fundamentals of Matrix Computations. Pure and applied mathematics. John Wiley & Sons, 2010. http://iopscience.iop.org/1742-6596 Performance Comparisons of Eigensolutions Algorithms of a Symmetric Tridiagonal Matrix on GPU 100 [24] I. S. Dhillon, B. N. Parlett, and C. VЁomel, “Glued matrices and the MRRR algorithm”,SIAM J. Sci. Comput., vol27, pp. 496–510, 2005. [25] I. S. Dhillon, “Current inverse iteration software can fail”, BIT Numerical Mathematics, vol. 38 pp. 685–704, 1998. Submitted 31.08.2015, accepted 25.11.2015 Սիմետրիկ երեք անկյունագծային մատրիցի սեփական լուծումների ալգորիթմների արտադրողականությունների համեմատությունները GPU արագագործիչների վրա Հ. Ասցատրյան և Է. Գիչունց Ամփոփում Հերմիտյան մատրիցի սեփական լուծումները գտնելիս շատ կարևոր խնդիր է հանդիսանում սիմետրիկ երեքանկյունագծային մատրիցի լուծումների ալգորիթմներից արդյունավետ տարբերակի որոշումը: Աշխատանքում համեմատվում են այս ալգորիթմները կոմպլեքս Հերմիտյան մատրիցների դեպքում հիբրիդային համակարգերում: Մեթոդները կիրառվել են Tesla C1060 և Tesla K40 GPU արագագործիչների վրա և ներկայացված են արտադրողականությունները ինչպես մեթոդների միջև, այնպես էլ արագագործիչների միջև: Сравнение производительности алгоритмов вычисления собственных решений симметричных трехдиагональных матриц на графических процессорах GPU Г. Асцатрян и Э. Гичунц Аннотация При нахождении решений собственных значений и векторов эрмитовых матриц большую важность представляет проблема определения эффективного варианта из алгоритмов нахождения решений симметричных трехдиагональных матриц. В данной работе эти алгоритмы сравниваются для случая комплексных эрмитовых матриц в гибридных системах. Методы были применены на графических процессорах TeslaC1060 и TeslaK40 и представлены прозводительности как для методов, так и между графическими процессорами.