Начиная с начала 2000 года осуществляется внедрение GHIS в здравоохранении, в рамках принятого проекта о реформирование информ Mathematical Problems of Computer Science 45, 44--52, 2016. Performances of Methods for Solving a Linear System of Equations in the Architecture of GPU Accelerator Hrachya V. Astsatryan, Edita E. Gichunts Institute for Informatics and Automation Problems of NAS RA e-mail: hrach@sci.am, editagich@ipia.sci.am Abstract We consider some important issues related to the solution of linear system of equations that arise in multi-processor and graphics processing unit architecture. A more effective method for solving a linear system of equations is considered through the LU factorization. Investigations are conducted in case of general complex matrices, because for those matrices the random butterfly transformation is used. The paper presents performances of several ways of solving methods on the graphic processor NVIDIA K40c. Keywords: LU factorization, linear system of equations, Random Butterfly Transformation, GEPP, GENP, MAGMA, GPU accelerator. 1. Introduction Similar to LAPACK, MAGMA [1, 2, 3] is being built as a community effort, incorporating the newest developments in hybrid algorithms and scheduling, and aiming at minimizing synchronizations and communication in these algorithms. The goal of these efforts is to redesign the dense linear algebra algorithms in LAPACK to fully exploit the power of current heterogeneous systems of multi/manycore CPUs and accelerators, and deliver the shortest possible time to an accurate solution within the given energy constraints. Indeed, the algorithms included so far in MAGMA 1.6 manage to overcome bottlenecks associated with just multicore or GPUs, to significantly outperform the corresponding packages for any of these components taken separately. For the linear system solvers on current multicore or GPU architectures, a bottleneck in terms of communication cost and parallelism comes from the pivoting, a technique used to prevent divisions by too-small numbers in the Gaussian Elimination (GE) process. Current libraries like LAPACK implement GE using a block algorithm, which factors the input matrix by iterating over its blocks of columns (panels). Pivoting not only requires communication (or 44 H.Astsatryan, E.Gichunts 45 synchronization in a shared memory environment), but it also limits the exploitation of asynchronicity between the block operations. The solution of linear system of algebraic equations has the form AX = B, where A is a square matrix, B is either a right side vector or a matrix, consisting of columns of the right sides. X is the solutions of system equations and it is a vector if B is a vector and it is a matrix, if B is a matrix. Special block algorithms are used to solve the system equations. The system solution is divided into two parts: 1. System matrix factorization, 2. System solution through factorization. Depending on the matrix feature the case of factorization is different. The following two cases of LU factorization are used for the general matrix: 1. The LU factorization with partial pivoting and row interchanges is used to factor A as A=PLU, where P is a permutation matrix, L is a lower triangular matrix, the main diagonal elements of which are 1, and U is the upper triangular matrix 2. The LU factorization with no pivoting is used to factor A as A = LU, where L is the lower triangular unit, and U is the upper triangular one. Several ways of solutions of linear system of equations on NVIDIA K40c GPU accelerator are presented in the paper which are carried out through the two mentioned cases of LU factorization. They are presented only for general matrices, because to get high performance for them, Random Butterfly Transformation (RBT) was used which was repeatedly increasing the solution performance of the system of equations. Note that RBT has been studied in the systems with multicore [4] and distributed memory [5], but its performance has not been studied on GPU accelerator. Section 2 of the paper describes the cases of solving the linear system of equations. Section 3 presents the performances of solutions of the mentioned cases on GPU accelerator. Section 4 presents the conclusion. 2. Solution Methods 2.1 LU Factorization with Partial Pivoting The LU factorization (or decomposition) of a matrix A has the form A = PLU, where L is a unit lower triangular matrix, U is an upper triangular matrix and P is a permutation matrix. The block LU factorization algorithm [6] proceeds in the following steps: initially, a set of NB columns (the panel) is factored and a pivoting pattern is produced. Then the elementary transformations, resulting from the panel factorization, are applied in block fashion to the remaining part of the matrix (the trailing submatrix). First, NB rows of the trailing submatrix are swapped, according to the pivoting pattern. Then a triangular solve is applied to the top NB rows of the trailing submatrix. Finally, matrix multiplication of the form Aij←Aij −Aik ×Akj is performed, where Aik is the panel without the top NB rows, Akj is the top NB rows of the trailing submatrix and Aij is the trailing submatrix without the top NB rows. Then the procedure is applied repeatedly, descending down the diagonal of the matrix. The solution of linear system of equations, where the LU factorization is made with Partial Pivoting, is performed by MAGMA 1.6.1 library through cgesv and cgesv_gpu functions. The difference between these two functions is as follows: in the first case the function itself carries the matrices from the CPU to GPU and vice versa, while in the second case it is realized by the user. The solution sequence is as follows: Performances of Methods for Solving a Linear System of Equations in the Architecture of GPU Accelerator 46  A and B matrices are transferred from the CPU to the global memory of GPU.  LU factorization of the matrix A is performed through cgetrf_gpu function of MAGMA library using a partial pivoting with row interchanges.  A * X = B is solved through the function cgetrs_gpu of MAGMA library.  X derived solutions are transferred from the GPU to CPU. To implement cgetrs_gpu function, claswp subprojects of lapackf77 library and ctrsm subprojects of MAGMA library are used. The claswp routine swaps rows of the trailing submatrix according to the pivoting pattern, established in the panel factorization. This operation only performs data motion and the GPUs are very sensitive to the matrix layout in memory. In raw-major layout, threads in a warp can simultaneously access consecutive memory locations. The ctrsm routine uses the lower triangle of the NB × NB diagonal block to apply triangular solve to the block of right-hand-sides formed by the top NB rows of the trailing submatrix. An efficient implementation of this routine on a GPU is difficult due to the data-parallel nature of GPUs and small size of the solve (32 ≤ NB ≤ 288) [7]. The claswp routines are implemented in LAPACK [8], while the ctrsm routines are the part of the Basic Linear Algebra Subroutines (BLAS [9]) standard. LAPACK is an academic project and, therefore, the source code is freely distributed online. BLAS is a set of standardized routines, and it is available in commercial packages (e.g., MKL [10] from Intel, ACML [11] from AMD, ESSL [12] from IBM), in academic packages (e.g., ATLAS [13]) and also as a reference implementation in FORTRAN 77 from the Netlib software repository. 2.2 LU Factorization without Pivoting The LU factorization (or decomposition) of a matrix A consists of writing of that matrix as a matrix product A = LU, where L is the lower triangular and U is the upper triangular. It is a central kernel in linear algebra because it is commonly used in many important operations such as solving a nonsymmetric linear system, inverting a matrix, computing a determinant or an approximation of a condition number. LU decomposition is an algebraic process that transforms a matrix A into a product of a lower triangular matrix L the elements of which are only on the diagonal and below, and an upper triangular matrix U the elements of which are only on the diagonal and above determinant and the inverse of a matrix. The solution of linear system of equations, where the LU factorization is made without Pivoting, is performed by MAGMA 1.6.1 library through cgesv_rbt and cgesv_nopiv_gpu functions. Here also in the first case the function itself carries the matrices from the CPU to GPU and vice versa, while in the second case it is realized by the user. In case of xgesv_nopiv_gpu.cpp functions the solution sequence is as follows:  A and B matrices are transferred from the CPU to the global memory of GPU.  LU factorization of the matrix A is performed through cgetrf_nopiv_gpu function of MAGMA library without any pivoting.  A * X = B is solved through the function cgetrs_nopiv_gpu of MAGMA library.  X derived solutions are transferred from the GPU to CPU. To implement cgetrs_nopiv_gpu function, only the ctrsm subproject of MAGMA library is used. The cgesv_rbt function is the optimized version of the solution of linear system of equations. The GENP algorithm can be unstable due to a potentially large growth factor. This is why we systematically perform iterative refinement on the computed solution of the randomized system. Algorithm 1 describes how the iterative refinement is performed in our implementations. H.Astsatryan, E.Gichunts 47 We improve the computed solution until we reach the required accuracy or we reach a defined maximum number of iterations. Algorithm 1 Iterative refinement. Input: A the original matrix. Input:b the right hand side. Input:x the computed solution. Input:L and U the factorized form of A Input:N size of the matrix A Result:An improved solution x 1: EPS = Machine precision 2: ITERMAX = 30 3: ITER = 0 4: ANRM = ||A||∞ 5: XNRM = max|x| 6: Cte = ANRM * EPS * √N 7: r = b – Ax 8: RNRM = max|r| 9: while RNRM > XNRM and ITER < ITERMAX do 10: Solve: Ly = r 11: Solve:Uz = y 12: x = x+r 13: r = b – Ax 14: XNRM = max|x| 15: RNRM = max|r| 16: ITER = ITER + 1 17: end while The iterative refinement process is stopped if ITER > ITERMAX or for all the RHS we have: RNRM < SQRT(n)*XNRM*ANRM*EPS*BWDMAX where  ITER is the number of the current iteration in the iterative refinement process  RNRM is the infinity-norm of the residual  XNRM is the infinity-norm of the solution  ANRM is the infinity-operator-norm of the matrix A  EPS is the machine epsilon returned by SLAMCH('Epsilon')  The values ITERMAX and BWDMAX are fixed to 30 and 1.0D+00, respectively. Note that EPS is determined by xlamch("Epsilon") function of the lapackf77 library, and ANRM is determined by the xlange() function of the magmablas library. The following consecutive steps are made for the solution of this case:  A and B matrices are transferred from the CPU to the global memory of GPU.  LU factorization of the matrix A is performed through cgetrf_nopiv_gpu function of MAGMA library without any pivoting.  The cgesv_rbt function of the MAGMA library is called. Performances of Methods for Solving a Linear System of Equations in the Architecture of GPU Accelerator 48 Note that the cgesv_rbt function takes the factorized A matrix and by the cgetrs_nopiv_gpu function it gets the solutions of linear system of equations, afterwards for iterative refinement it improves the computed solution to a system of linear equations. Random Butterfly Transformation (RBT) is a randomization technique initially described by Parker and recently revisited for dense linear systems. The method for randomizing has been described in [14, 15]. It consists of a multiplicative preconditioning UTAV where the matrices U and V are chosen among a particular class of random matrices called recursive butterfly matrices. Then Gaussian Elimination with No Pivoting (GENP) is performed on the matrix UTAV and, to solve Ax = b, we instead solve (UTAV)y= UTb followed by x = Vy. The solution of linear system of equations, where the random butterfly transformation is applied on A and B matrices and the LU factorization is made without Pivoting, is implemented through the cgerbt_gpu function of the MAGMA 1.6.1 library. The implementation of this form of solution is realized in the following sequence:  We generate the random matrices U and V in packed storage on the CPU.  The matrix A and the packed representation of U and V are sent from the host memory to the device memory.  Randomization is performed on the GPU, updating A in the device memory.  Perform Partial Random Butterfly Transformation on the GPU with magmablas_cprbt() function.  We compute UTb on the GPU, Ary= UTb is solved on the GPU, followed by the solution x = Vy with magmablas_cprbt_mtv() function.  The solution is sent to the host memory. 3. Results of Experiments The experiments were conducted on NVIDIA K40c GPU. The architecture of NVIDIA K40c consists of 2880 CUDA processor cores. It is endowed with much higher bandwidth 288 GB/s of message transfer between CPU and GPU, having 12 GB of global memory, GDDR5 memory interface, and CUDA C programming environment. The operation system of K40c is Ubuntu 14.04.2 LTS. MAGMA 1.6.1 package is installed. The code is compiled using the GNU gcc version 4.8, gfortran-4.8, g ++ - 4.8 and the nvcc version 7.0 with the optimization flag -O3 and linked with the Atlas Library. Figures 1 and 2(a,b) show the time and performance schedules of methods for solving of linear system of equations. The obtained results show that the performance of solutions defined by LU factorization without pivoting is higher than that with pivoting, especially when the solutions of Random Butterfly Transformation are repeatedly endowed with high performance. The results show that the performance of solutions defined by cgesv_rbt function is the lowest of the mentioned cases but it is the optimized version of solutions because for iterative refinement it improves the computed solution to a system of linear equations. For the solutions defined by LU factorization with pivoting the cgesv_gpu function performance is higher than that of cgesv function. This is because the data have been loaded into the global memory of GPU before the function appeal. H.Astsatryan, E.Gichunts 49 Fig. 1. Fig. 2(a) 0 20 40 60 80 100 120 140 160 180 200 0 5000 10000 15000 20000 25000 cgesv cgesv_gpu cgesv_rbt cgesv_nopiv cgerbt_gpu N Ti m e (s ec on ds ) 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 0 10000 20000 30000 cgesv cgesv_gpu cgesv_rbt cgesv_nopiv cgerbt_gpu G flo p/ s N Performances of Methods for Solving a Linear System of Equations in the Architecture of GPU Accelerator 50 Fig. 2(b) without cgerbt_gpu. 3. Conclusion We presented methods for solving of linear system of equations on GPU accelerator where the LU factorization is performed with and without pivoting. The received performance results lead to the following conclusion that to achieve high performance in solutions of linear system of equations, Random Butterfly Transformation solution method is definitely in the first place. References [1] R. Nath, S. Tomov and J. Dongarra, “An improved MAGMA GEMM for Fermi GPUs”, International Journal of High Performance Computing Applications, vol. 24, no. 4, pp. 511–515, 2010. [2] S. Tomov, J. Dongarra and M. Baboulin, “Towards dense linear algebra for hybrid GPU accelerated manycore systems”, Parallel Computing, vol. 36(5&6), pp. 232– 240, 2010. [3] S. Tomov, R. Nath and J. Dongarra, “Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing”, Parallel Computing, vol. 36, no. 12, pp. 645–654, 2010. [4] M. Baboulin, D. Becker and J. J. Dongarra, “A parallel tiled solver for dense symmetric indefinite systems on multicore architectures”, Parallel & Distributed Processing Symposium (IPDPS), 2012. [5] M. Baboulin, D. Becker, G. Bosilca, A. Danalis and J. J. Dongarra, “An efficient distributed randomized algorithm for solving large dense symmetric indefinite linear systems”, Parallel Computing, vol. 40, no. 7 , pp. 212--223, 2014. [6] J. W. Demmel, Applied Numerical Linear Algebra, SIAM, 1997. ISBN: 0898713897 [7] J. Kurzak, P. Luszczek, M. Faverge, and J. Dongarra, “LU factorization with partial pivoting for a multicore system with accelerators”, IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 8, pp. 1613—1621, 2013. 0 200 400 600 800 1000 1200 0 10000 20000 30000 cgesv cgesv_gpu cgesv_rbt cgesv_nopiv G flo p/ s N H.Astsatryan, E.Gichunts 51 [8] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney and D. Sorensen, LAPACK User’s Guide, SIAM, 1999, Third edition. [9] K. Goto, GotoBLAS. Texas Advanced Computing Center, University of Texas at Austin, USA. http: // www. otc. utexas. edu/ ATdisplay. jsp, 2007. [10] Intel. Math Kernel Library (MKL). http://www.intel.com/software/products/mkl/. [11] AMD. AMD Core Math Library (ACML). [Online]. Available: http://developer. amd. com/acml. jsp, 2012. [12] IBM Corporation. IBM Parallel Engineering and Scientific Subroutine Library. Guide and Reference. (GC23-3836), 1995. [13] R. C. Whaley and J. Dongarra. Automatically Tuned Linear Algebra Software. Technical Report UT-CS-97-366, University of Tennessee, December 1997. [Online]. Available: http://www.netlib.org/lapack/lawns/lawn131.ps. [14] D. S. Parker, “Random butterfly transformations with applications in computational linear algebra”, Technical Report CSD-950023, UCLA Computer Science Department, 1995. [15] D. S. Parker and B. Pierce, “The randomizing FFT: an alternative to pivoting in Gaussian elimination”, Technical Report CSD-950037, Computer Science Department, UCLA, 1995. Submitted 04.09.2015, accepted 12.01.2016 Գծային հավասարումների համակարգի լուծման մեթոդների արտադրողականությունները GPU արագագործչի ճարտարապետությունում Հ. Ասցատրյան, Է. Գիչունց Ամփոփում Մենք դիտարկում ենք գծային հավասարումների համակարգի լուծմանը վերաբերող որոշ կարևորագույն հարցեր, որոնք առաջանում են բազմապրոցեսորային և գրաֆիկական պրոցեսորային ճարտարապետությունում: Դիտարկվում է LU վերլուծության միջոցով գծային հավասարումների համակարգի լուծման մեթոդներից ավելի արդյունավետ եղանակը: Ուսումնասիրությունները կատարվում են կոմպլեքս ընդհանուր մատրիցների դեպքում, քանի որ այդ մատրիցների համար կիրառվում է թիթեռնիկի պատահական ձևափոխությունը: Աշխատանքում ներկայացվում են մի քանի մեթոդներով լուծման եղանակների արտադրողականությունները NVIDIA K40c գրաֆիկական պրոցեսորի վրա: Performances of Methods for Solving a Linear System of Equations in the Architecture of GPU Accelerator 52 Производительности методов решения систем линейных уравнений в архитектуре GPU ускорителя Г. Асцатрян, Э. Гичунц Аннотация Рассмотрeны некоторые важные вопросы, связанные с решением систем линейных уравнений, возникающих в многопроцессорных и графических процессорных архитектурах. С помощью LU факторизации рассматривается более эффективный метод для решения линейной системы уравнений. Исследования проводятся для случая общих комплексных матриц, потому что для этих матриц используется случайное преобразование бабочки. В статье представлены производительности методов нескольких способов решения на графическом процессоре NVIDIA K40c.