Начиная с начала 2000 года осуществляется внедрение GHIS в здравоохранении, в рамках принятого проекта о реформирование информ


Mathematical Problems of Computer Science  46, 59--65, 2016. 

 
Performances of Methods for Solving a Linear System of 
Equations in the Architecture of GPU Accelerator in Case 

of Small Matrices 
 

Hrachya V.  Astsatryan and Edita E. Gichunts 
 

Institute for Informatics and Automation Problems of NAS RA 
e-mail: hrach@sci.am, editagich@ipia.sci.am 

 
Abstract 

 
The algebraic operations with a large number of small matrices are a very important 

issue in science. The solutions for linear system of equations with LU factorization are 
specific of the mentioned operations that have numerous applications of algebraic 
operations with small matrices. In this work we consider the performances of methods for 
solving linear system of equations with batched LU factorization for small complex 
matrices on the graphic processor NVIDIA K40c.The versions with Partial Pivoting, 
without Pivoting and Random Butterfly Transformations of batched LU factorization for 
small matrices are presented and shown, which of these versions is the effective one, in 
which case we achieve high performance. 

Keywords: GPU accelerator, MAGMA , linear algebra operation, LU factorization, 
small matrix, performance, batched computation, Random Butterfly Transformation. 

 
1. Introduction 
 
Numerous scientific computations require a number of small independent solutions to the 
problems. Since each individual problem size can fail to ensure the necessary parallelization for 
the given basic equipment, specifically in the case of accelerators, then these problems should be 
solved simultaneously so that to charge the device with the needed amount of work; hence the 
name of batched calculations came. The batched subprograms are designed to solve many 
problems independent of each other in a parallelization manner. Such an approach is required in 
a number of problems of various spheres, for example, in astrophysics [1], metabolic networks 
[2], quantum chemistry [3], high-order FEM schemes for hydrodynamics [4], direct-iterative 
preconditioned solvers [5], and some image [6] and signal processing [7]. 

As soon as a large number of heterogeneous systems with GPU accelerators appear, the 
linear algebra software is fully performed in small matrix operations. Graphic processors (GPUs) 

 
59 
 

Performances of Methods for Solving  a Linear System of Equations of GPU in Case of Small Matrices 
 
60 

are high-performance processors that are capable of making hundreds of floating point 
operations in parallel. Their large register files and the scratchpad memory of high performance 
are well suited for the model of streaming execution. 

In this paper, via the versions of batched LU factorization, we focus on solving of linear 
equation systems that are processed on the graphic processor. It should be noted that the 
experiments were carried out for general complex matrices. New functions are considered in the 
case of small matrices that are new in the MAGMA 1.6.1 package [8] and are named as batched. 
In the previous article [9] the solving methods of linear equations via the versions of LU 
factorization are presented for large matrices as well as it is shown that the higher performances 
are achieved due to the solving method of Random Butterfly Transformation. In this paper we 
will present the same problem for small matrices where the batched subprograms of the 
MAGMA package are applied, and consider how high performance will be achieved for small 
matrices. 
 
 
2.  Implementation of Batched Subprograms 
 
In the Batched problem each matrix represents a separate problem 
which is solved independently. The batched subprograms in MAGMA are called Lapack in 
accordance with the subprograms of the library. MAGMA 1.6.1 package contains four standard 
precision arithmetic of batched subprograms - single real, double real, single complex and 
double complex. In this paper we present the implementation of batched solutions of the solving 
methods of linear system of equations for small matrices for single complex and double complex 
precision. As in the case of large matrices, in small matrices as well the batched solutions of 
linear system of equations are implemented with Pivoting, without Pivoting and with Random 
Butterfly Transformations. 
The batched solution of linear system of equations, where the batched LU factorization is 
performed with Partial Pivoting, is implemented through the MAGMA 1.6.1 library of the 
xgesv_batched.cpp function.  
The solution consists of the following steps: 
 A and B matrices are transferred from the CPU to the global memory of GPU. 
 Batched LU factorization of the matrix A is performed through xgetrf_batched.cpp 

function of MAGMA library using a Partial Pivoting with row interchanges. 
 A * X = B is solved through the function xgetrs_batched.cpp of MAGMA library. 
 X derived solutions are transferred from the GPU to CPU. 

xgetrf_batched.cpp consists of xgetf2_batched, xlaswp_rowparallel_batched, 
xtrsm_outofplace_batchedandxgemm subprograms. 
a) xgetf2: DGETF2 computes an LU factorization of a general m-by-n matrix A using Partial 
Pivoting with row interchanges. The factorization has the form A = P * L * U, where P is a 
permutation matrix, L is a lower triangular with unit diagonal elements (lower trapezoidal if m > 
n), and U is an upper triangular (upper trapezoidal if m < n). 
In each step of LU factorization, dgetf2 subprogram is used while factorizing the m x nb panel. It 
consists of three subprograms idamax, dswap and dscal of Level 1 BLAS library and dger 
subprogram of Level 2. The only approach to the implementation of dgetf2 subprogram lies in 
the fact that all the panels are loaded in the common memory of GPU and then perform all 
calculations before the results are sent to the main memory.  
b) xlaswp: DLASWP performs a series of row interchanges on a general rectangular matrix. 
DLASWP performs a series of row interchanges on the matrix A. One row interchange is 


H.Astsatryan and E.Gichunts 
 
 
61 

initiated for each of rows K1 through K2 of A. To increase the digital stability, the 
transformation with Pivoting is required.  dlaswp subprogram is used in case of the  
transformation with Pivoting, despite the fact that we get low performance. In any case, 
experiments show that the transformation with Pivoting is quite expensive for the batched 
problem. 
c)xtrsm: DTRSM  solves one of the matrix equations op( A )*X = alpha*B,   or   X*op( A ) = 
alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or  a non-unit,  an 
upper or lower triangular matrix  and  op( A )  is one  of     op( A ) = A   or   op( A ) = A**T.  
The matrix X is overwritten on B. 
d) xgemm: DGEMM  performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) 
+ beta*C, where  op( X ) is one of op( X ) = X   or   op( X ) = X**T, alpha and beta are scalars, 
and A, B and C are matrices, with op( A ) an m by k matrix,  op( B )  a  k by n matrix and  C an 
m by n matrix. The aim of the batched LU factorization is to achieve batched dgemm high 
performance. A lot of effort has been concentrated for the optimization of dgemm subprogram. 
In particular, for our case, dgemm is not only used in the trailing matrix updates but also in the 
implementation of the triangular matrix solvers (xtrsm). 
The batched solution of linear system of equations where the batched LU factorization is 
performed without Pivoting, is implemented through the MAGMA 1.6.1 library of 
xgesv_nopiv_batched.cpp and xgerbt¬_batched.cpp functions. 
In case of xgesv_nopiv_batched.cpp function the solution sequence is as follows: 
 A and B matrices are transferred from the CPU to the global memory of GPU. 
 Batched LU factorization of the matrix A is performed through xgetrf_nopiv_batched.cpp 

function of MAGMA library without any Pivoting.  
 A * X = B is solved through the function xgetrs_nopiv_batched.cpp of MAGMA library. 
 X derived solutions are transferred from the GPU to CPU.  

To implement xgetrf_nopiv_batched.cpp function, the xgetrf_ panel_nopiv_batched subprogram 
of MAGMA library, as well as xtrsm_ work_batched and xgemm_batched subprograms of Blas 
library are called. 
To implement xgetrs_nopiv_batched.cpp function, only the xtrsm_outofplace_batched 
subprogram of Blas library is called. 
The solution of linear system of equations, where the Random Butterfly Transformation is 
applied on A and B matrices, and the batched LU factorization is performed without Pivoting, is 
implemented via xgerbt_batched.cpp function of MAGMA 1.6.1 library. 
This type of solution implementation is realized in the following sequence: 

• We generate the random matrices U and V in packed storage on the CPU. 
• The matrix A and the packed representation of U and V are sent from the host memory to 

the device memory. 
• Randomization is performed on the GPU, updating A in the device memory. 
• Perform Partial Random Butterfly Transformation on the GPU with 

magmablas_cprbt_batched () function. 
• We compute UTbon the GPU, Ary= UTbis solved on the GPU, followed by the solution x 

= Vy with magmablas_cprbt_mtv_batched () function. 
• The solution is sent to the host memory. 

 
Performances of Methods for Solving  a Linear System of Equations of GPU in Case of Small Matrices 
 
62 

3.   Performance Results  
 
The experiments were conducted on NVIDIA K40c GPU. The architecture of NVIDIA K40c 
consists of 2880 CUDA processor cores. It is endowed with much higher bandwidth 288 GB/s of 
message transfer between CPU and GPU, having 12 GB of global memory, GDDR5 memory 
interface, and CUDA C programming environment. The operation system of K40c is Ubuntu 
14.04.2 LTS. MAGMA 1.6.1 package is installed. The code is compiled using the GNU gcc 
version 4.8, gfortran-4.8, g ++ - 4.8 and the nvcc version 7.0 with the optimization flag -O3 and 
linked with the Atlas Library.  

Figures 1 and 2 show the performance results of batched functions of linear solution 
equations for small matrices for complex single precision and complex double precision, 
respectively. 

 
Fig. 1. Complex Single Precision. 
 

Fig. 2. Complex Double Precision. 
 

In case of single precision the obtained results show that the performance of solutions 
determined by the LU factorization without Pivoting is several times higher than the solutions 
with Pivoting. Thus, for example, the performance of cgesv_nopiv_batched function without 

0

100

200

300

400

500

600

700

800

cgesv

cgesv_nopiv

cgerbt

N

G
flo

p/
s

0

100

200

300

400

500

600

700

zgesv

zgesv_nopiv

zgerbt

N

G
flo

p/
s


H.Astsatryan and E.Gichunts 
 
 
63 

Pivoting is twice higher than the performance of cgesv_batched function with Pivoting, and the 
performance of cgerbt_batched function with Random Butterfly Transformation without 
Pivoting is 4.5 times higher than the performance of cgesv_batched function and 2 times higher 
than the performance of cgesv_nopiv_batched function. 

In case of double precision the obtained results show that the performance of 
zgesv_nopiv_batched function without Pivoting is twice higher than the performance of 
zgesv_batched function with Pivoting, and the performance of zgerbt_batched function with 
Random Butterfly Transformation without Pivoting is 6.5 times higher than the performance of 
zgesv_batched function and 3 times higher than the performance of zgesv_nopiv_batched 
function.  

Also note that in case of single precision the performance is relatively higher than in the 
case of double precision for the corresponding functions with and without Pivoting. 
 
 
4. Conclusion  
 
We presented the methods for solving of linear system of equations on GPU accelerator for small 
matrices with and without Pivoting. For small matrices the batched functions of MAGMA 1.6.1. 
package have been applied which are new and the batched function with Random Butterfly 
Transformation was not applied in the tests. It is concluded from the experiment results that for 
small matrices a high performance in solutions of linear system of equations is achieved 
applying the method of Random Butterfly Transformation. 
 

References 
 

[1] O.E.B. Messer, J. A. Harris, S. Parete-Koon  and M. A. Chertkow, “Multicore and 
accelerator development for a leadership-class stellar astrophysics code”, In 
Proceedings of "PARA2012: State-of-the-Art in Scientific and Parallel Computing", 
2012. 

[2] J. C. Liao Khodayari,   A. R. Zomorrodi and C. D. Maranas, “A kinetic model of 
Escherichia coli core metabolism satisfying multiple sets of mutant flux data”,  
Metabolic engineering, vol. 25, pp. 50–62, 2014. 

[3] A. A. Auer, G. Baumgartner, D. E. Bernholdt, A. Bibireata,V. Choppella, D. Cociorva, 
X. Gao, R. Harrison, S. Krishnamoorthy, S. Krishnan, C.-C. Lam, Q. Luc, M. 
Nooijene,R. itzerf, J. Ramanujamg, P. Sadayappan and A. Sibiryakovc, “Automatic 
code generation for many-body electronic structure methods: the tensor contraction 
engine’, Molecular Physics, vol. 104, no. 2, pp. 211–228, 2006. 

[4] T. Dong, V. Dobrev, T. Kolev, R.Rieben, S.Tomov  and J. Dongarra, “A step towards 
energy efficient computing: Redesigning a hydrodynamicapplication on CPU-GPU”,  
In IEEE 28th International Parallel Distributed ProcessingSymposium (IPDPS), 2014. 

[5] Eun-Jin Im, K. Yelick  and R. Vuduc, “Sparsity: Optimization frameworkfor sparse 
matrix kernels”,  Int. J. High Perform. Comput. Appl., vol. 18, no. 1, pp. 135–158, 
2004. 


Performances of Methods for Solving  a Linear System of Equations of GPU in Case of Small Matrices 
 
64 

[6] J. M. Molero, E. M. Garzón, I. García, E. S. Quintana-Ortí, and A. Plaza, “Poster: A 
batched Cholesky solver for local RX anomaly detection on GPUs”, PUMPS, 2013. 

[7] M. J. Anderson, D. Sheffield and K. Keutzer, “A predictive model for solving 
smalllinear algebra problems in gpu registers’, In IEEE 26th International Parallel 
DistributedProcessing Symposium (IPDPS), 2012. 

[8] Matrix algebra on GPU and multicore architectures (MAGMA), MAGMA Release 
1.6.1, 2015. Online. [Available]:  http://icl.cs.utk.edu/magma/ 

[9] H. V. Astsatryan and E. E. Gichunts, “Performances of methods for solving a linear 
system of equations in the architecture of GPU accelerator”, Transactions of IIAP NAS 
RA,  Mathematical Problems of Computer Science, vol. 45, pp. 44—52, 2016. 

 
Submitted 02.08.2016, accepted 25.10.2016. 
 
 
Փոքր մատրիցների  դեպքում գծային հավասարումների համակարգի 
լուծման մեթոդների արտադրողականությունները GPU արագագործչի 

ճարտարապետությունում 
 

Ð. ²ëó³ïñÛ³Ý ¨ ¾. ¶ÇãáõÝó 
 

Ամփոփում 
 

Գիտության մեջ շատ կարևոր խնդիր է հանդիսանում բազմաքանակ փոքր 
մատրիցների հետ կապված հանրահաշվական գործողությունները: Այդ 
գործողություններից առանձնահատուկ են LU վերլուծությամբ գծային 
հավասարումների համակարգի լուծումները, որոնք փոքր մատրիցների հետ 
հանրահաշվական գործողություններում բազմաթիվ կիրառություններ ունեն :Այս 
աշխատանքում դիտարկվում են կոմպլեքս փոքր մատրիցների համար batched LU 
վերլուծությամբ գծային հավասարումների համակարգի լուծման մեթոդների 
արտադրողականությունները NVIDIA K40c գրաֆիկական պրոցեսորի վրա: Փոքր 
մատրիցների համար ներկայացվում են batched LU վերլուծության պտույտով, առանց 
պտույտի և թիթեռնիկի պատահական ձևափոխությամբ տարբերակները և ցույց է 
տրվում, թե այդ տարբերակներից որն է արդյունավետը, որի դեպքում հասնում ենք 
բարձր արտադրողականության: 
  
  
http://mpcs.sci.am/filesimages/volumes/volume_45/06.pdf
http://mpcs.sci.am/filesimages/volumes/volume_45/06.pdf


H.Astsatryan and E.Gichunts 
 
 
65 

Производительности методов решения систем линейных уравнений в 
архитектуре GPU ускорителя в случае небольших матриц 

 
Г. Асцатрян и Э. Гичунц 

 
Аннотация 

 
 Алгебраические операции с большим числом малых матриц являются очень 
важными вопросами в науке. Из упомянутых операций специфическими являются 
решения для линейной системы уравнений с LU факторизацией, которые имеют 
многочисленные применения алгебраических операций с небольшими матрицами. В 
этой работе мы рассмотрим производительности методов для решения системы 
линейных уравнений с batched LU факторизацией для малых комплексных матриц на 
графическом процессоре NVIDIA K40c. Для небольших матриц представлены версии 
batched LU факторизации с поворотом, без поворота и со случайным преобразованием 
бабочки, а также показано, какой из этих вариантов является эффективным, и в каком 
случае мы достигаем высокой производительности.