Acta Polytechnica CTU Proceedings doi:10.14311/APP.2017.13.0016 Acta Polytechnica CTU Proceedings 13:16–19, 2017 © Czech Technical University in Prague, 2017 available online at http://ojs.cvut.cz/ojs/index.php/app PARALLEL APPROACH TO SOLVE OF THE DIRECT SOLUTION OF LARGE SPARSE SYSTEMS OF LINEAR EQUATIONS Michal Bošanský∗, Bořek Patzák Czech Technical University in Prague, Faculty of Civil Engineering, Thákurova 7, 166 29 Prague 6, Czech Republic ∗ corresponding author: michal.bosansky@fsv.cvut.cz Abstract. The paper deals with parallel approach for the numerical solution of large, sparse, non-symmetric systems of linear equations, that can be part of any finite element software. In this contribution, the differences between the sequential and parallel solution are highlighted and the approach to efficiently interface with distributed memory version of SuperLU solver is described. Keywords: linear equation solver, domain, parallelization, threads, distributed memory, communica- tion. 1. Introduction Many engineering problems lead to large-scale prob- lems. The solution of these problems is being limited by available resources. In many cases, the parallel and distributed computing paradigms can be successfully used to enable solution of large and complex problems, which are not possible to solve on single machine with one processing unit [1]. The traditional serial comput- ers permit to run simulation codes only sequentially and often have limited memory. Parallelization can significantly reduce computational time by more effi- cient use of modern hardware. Any parallel algorithm is based on idea of the partitioning the work into the set of smaller task which can be solved concurrently by the simultaneous use of multiple computing resources. Parallel computers can be classified, for example, by type of memory architecture. The shared, dis- tributed, and hybrid memory systems [2] exist. In the shared memory system, the main memory with a global address space is shared between all processing units which can directly address and access the same logical memory. The global memory significantly facil- itates the design of parallel program, but the memory bus performance is limiting factor for scalability with increasing number of processing units. In distributed memory systems, the memory is physically distributed to individual processing units and there is no global address space. When a processor needs to access data on another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. The cost of communication com- pared to local memory access can be very high, on the other hand, the advantage is that the overall mem- ory is scalable with increasing number of processors. Hybrid systems combine the features of shared and distributed memory system, providing global, shared memory for reasonably small number of processing units that are combined into the distributed memory system. The design of parallel algorithms requires the par- titioning of the problem into a set of tasks, that can be processed concurrently. The partitioning of the problem can be either fixed (static load balancing) or can change during the solution (dynamic load bal- ancing) [3]. The latter option is often necessary to achieve a good load balancing of individual proces- sors and consequently also reasonable scalability. The scalability of computation is often regarded as one of the most important goals of parallel computing. The parallel algorithm is considered as scalable, when by using increasing number of processing threads, the execution time is monotonically decreasing, ideally in a linear trend. However, when individual computing threads are mutually dependent, the ideal scalability is difficult to obtain due to the overhead of parallel algorithm (necessary synchronization and communi- cation between threads that is not present in serial code) and due to the fact that some parts of the over- all algorithm are essentially serial. In addition to the achieved speedup, the parallel computing allows to solve large, complex problems that do not fit into a single, even well equipped machine. The SuperLU library provides parallel implemen- tation of the direct solver [4]. It provides serial, mul- tithreaded (shared-memory), and distributed mem- ory version. The implementation and performance of the multithreaded version has been reported by the authors in [5, 6]. In this papers, the focus is on dis- tributed version of SuperLU based on message passing interface (MPI) [7]. Numerical modeling using high performance com- puters brings in new opportunities. Many engineering problems lead to solution of sparse linear system of equations. This is also the case of the Finite Element Method (FEM) which is used here as an representative example. In FEM the differential equations are con- verted to the algebraic system of equations by using variational methods with the help of decomposition of the problem domain into sub-domains called elements 16 http://dx.doi.org/10.14311/APP.2017.13.0016 http://ojs.cvut.cz/ojs/index.php/app vol. 13/2017 Direct Solution of Large Sparse Systems of Linear Equations 68 0 2 11 00 0 3 5 09 2 6 0 04 0 0 13 17 0 0 3 K = K_nzval 8 6 2 11 3 5 9 2 6 4 13 7 1 3 K_colind 1 2 4 5 4 5 1 3 4 1 5 1 2 3 K_row_ptr 1 5 7 10 12 21 3 4 5 1 2 3 4 5 Figure 1. The compressed row format of matrix schema. and smart choice of interpolation functions. For lin- ear elasticity problem, considered here, the resulting algebraic equations, represent the discrete equilibrium equations at nodes of the computational mesh K ∗ r = f, (1) where K is the global stiffness matrix, f is global vec- tor of external forces, and r is vector of unknown displacements. One of the key feature of the FEM is that the stiffness matrix is typically positive definite, symmetric matrix with sparse structure, which is a consequence of using interpolation and test functions with limited nonzero support. The efficient algorithm for solving linear system must exploit the symmetry and sparse structure, by saving considerable memory and CPU resources. There exist number of different storage schemes for sparse matrices. The most widely used are skyline, compressed row, and compress col- umn formats. In this contribution, the compressed row format will be used. This choice comes from Su- perLU library, which requires the sparse matrix in this format on input. In the compressed row format, only nonzero entries of the sparse matrix are stored in one dimensional ar- ray. Additional integer array is needed to store column indices of the stored values. In practical implemen- tation, the one dimensional arrays storing nonzero values are merged together into a single array contain- ing all nonzero entries, row by row. The same applies for arrays of column indices. An additional array with size equal to number of rows is needed to point to the beginning of each row record, see Fig. 1. 2. Implementation SuperLU Interface As noted already in the previous section, the dis- tributed memory programming model does not pro- vide global address space and global memory acces- sible by all processing nodes, but the memory is dis- tributed among the processing units. For large prob- lems, it is therefore essential to establish a distributed representation of the sparse matrix. This feature is also supported by SuperLU library, allowing the user to split the global sparse matrix stored in comprow format into blocks, representing compressed row stor- age for consequent, non-overlapping groups of rows, which are distributed across computing nodes. We assume, that the decomposition of the dis- cretized problem domain has been established and individual, non-overlapping sub domains (partitions) are assigned to and stored locally on individual com- puting nodes. The so called node-cut strategy is assumed, where the cut dividing the problem domain runs through the nodes. Nodes on mutual partition boundaries are called shared nodes, the nodes inside individual partitions are so-called local nodes. Each processing node is responsible for assembling its contri- bution to the global system matrix by summing up the contributions from individual elements. To minimize the communication, it is natural that each processing node will assemble and maintain in its local memory the block of global stiffness matrix rows corresponding to unknowns on its partition. This is uniquely defined for local nodes, which are exclusively shared by local elements on individual partitions. For shared nodes, which are shared by elements from multiple partitions, the ownership has to be defined. In this work, the partition with lowest rank sharing the shared node is the owner of shared node and thus responsible of maintaining the corresponding row entry of the global stiffness matrix. It is clear that for shared nodes, the contributions to the corresponding row have to be received from partitions sharing particular node. The assembly process requires to establish global, unique numbering of equations across computing nodes in ad- dition to the local numbering on individual partitions. The assembly process of the global, distributed sys- tem matrix requires basically two steps: set up of data structures required to store the block of compressed row records on individual partitions and the assembly process itself based on localization of individual ele- ment contributions into global equilibrium equations according to their connectivity. 2.1. Initialization of data structures for distributed matrix This phase should determine the required size of ar- rays used to store all nonzero entries of the global matrix. The nonzero contributions of each element can be identified from element code numbers. The code numbers represent simultaneously the numbering of equilibrium equations and numbering of correspond- ing unknowns. The individual values in the stiffness matrix represent nodal forces caused by correspond- ing unit displacement. The entry in element stiffness matrix representing the nodal force contributing to k-th equilibrium equation and being multiplied by displacement located in global displacement vector at l-th position, should be added to position (k,l) in the global matrix. By assuming that element contribu- tions are full matrices, one can initialize the nonzero structure of the global stiffness matrix using element 17 Michal Bošanský, Bořek Patzák Acta Polytechnica CTU Proceedings 1 (0 ,1 ) 1 2 (2 ,3 ) 5 (8 ,9 ) 3 (4 ,5 ) 4 (6 ,7 ) 6 (1 0, 11 ) 7 (1 2, 13 ) 8 (1 4, 15 ) 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 G # 1 # 2 G G # 3 2 5 3 4 6 7 8 # 1 # 2 # 3 4 6 42 # 2 6 K # 1 K # 2 K # 3 -n od e -s ha re d no de # 3 # 3 # 3 # 2 # 2 4 1 F1 F2 10 2 3 4 5 6 7 1 0 2 3 4 5 6 7 32 8 9 6 7 10 11 3 2 8 9 6 7 10 11 76 10 11 12 13 14 15 7 6 10 11 12 13 14 15 { { P1 P2 P3 Figure 2. Example of simple problem consisting of three elements distributed into different computing nodes, illustrating the global sparse matrix structure and its distribution. 18 vol. 13/2017 Direct Solution of Large Sparse Systems of Linear Equations code numbers. In distributed case, each partition is responsible for initializing and prealocating its block structure of compressed rows records. In case of the compressed row storage, the number of nonzero entries in each row has to be determined together with corre- sponding nonzero column positions of the individual entries. Each processing node is responsible for initializa- tion and assembly of the assigned block of rows. This block is composed from local contributions, as well as from shared node contributions from neighboring partitions. Therefore, the overall initialization can be divided into (i) local stage, where the local contri- butions to nonzero pattern are identified (from the partition elements), and (ii) communication stage, where contributions from neighboring partitions are received for locally maintained rows, corresponding to equilibrium equations for shared nodes. Any partition has to therefore keep its locally main- tained block of compressed rows and also block com- pressed rows corresponding to shared nodes not being locally maintained. These two blocks can be stored separately. In the initial stage, the block structure is initialized on every partition using only the contri- butions from local elements. The row entries, that correspond to the shared nodes which are not lo- cally maintained are then communicated (send) to the corresponding partition maintaining correspond- ing row. After the sending the data, the contributions from neighboring partitions are received containing the contributions to the locally maintained shared equations. Non-blocking communication [8] is used to send/receive the contributions. Non-blocking com- munication allows to utilize the potential overlap of communication and computation, which is utilized in the actual implementation. After finishing the com- munication stage, each partition has fully initialized data structure for locally maintained block of rows. 2.2. Assembly of the global distributed matrix After finishing the initialization step, in which mem- ory has been allocated for every non zero entry of the global distributed stiffness matrix, one can proceed with the actual assembly of the matrix from the ele- ment contributions. This is done in a similar fashion to the previous step. On each partition the contribu- tions from local elements are assembled. Then, the rows corresponding to shared nodes not being locally maintained have to be sent to the partition maintain- ing corresponding row. Finally, each partition has to receive remote contribution to the locally maintained rows of shared nodes. This process is illustrated on Fig. 2. In this example, the problem is composed of three elements (corresponding to sub domains as well) and eight nodes, from which five are local on individual partitions and three nodes are shared. 2.3. Assembly of the global right hand side The process of assembling the global right hand side vector is very similar to the process of assembling the stiffness matrix, but in many aspects is simpler, do to the assembly of local vector contributions. The global vector is again distributed among partitions, the distribution corresponds to the distribution of matrix rows. The contributions from local elements are assembled first, then those corresponding to rows maintained on remote partitions are communicated and contributions from neighbors are received and entries updated. By finishing the steps described above, the dis- tributed sparse matrix, represented by locally main- tained blocks of row entries on individual partitions, can be directly passed into SuperLU routines. 3. Conclusions In this work the algorithm of assembling the dis- tributed sparse matrix in row-column format is de- scribed, in order to develop the interface to the dis- tributed SuperLU solver. The evaluation of the perfor- mance and scalability is the subject of ongoing work. The authors also aim to compare the efficiency of the distributed solver to the efficiency of the multi- threaded SuperLU version. Acknowledgements This work was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS16/038/OHK1/1T/11 – Advanced algorithms for nu- merical modeling in mechanics of structures and materials. References [1] B. Patzak. Parallel computations in structural mechanics. České vysoké učení technické v Praze, 2010. [2] C. Hufhes, T. Hughes. Parallel and Distributed programming Using C++. Pearson Education, 2003. [3] B. Patzak, D. Rypl. Parallel adaptive finite element computations with dynamic load balancing. Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering 2009. [4] X. S. Li, J. W. Demmel, J. R. Gilbert, et al. Superlu users‘ guide program interface. http://crd.lbl.gov/ xiaoye/SuperLU/, September 1999. [5] M. Bosansky, B. Patzak. Evaluation of different approaches to solution of the direct solution of large, sparse systems of linear equations. Advanced Materials Research 1144(1144):97–101, 2016. [6] M. Bosansky, B. Patzak. On parallelization of linear system equation solver in finite element software. Engineering Mechanics 2017 1(1):206–209, 2017. [7] P. S. Pacheco. A User‘s Guide to MPI. Univ. of San Francisco, 1997. [8] T. Saif, M. Parashar. Unferstanding the Behavior and Performance of Non-blocking Communications in MPI. Department of Electrical and Computer Engineering, Rutgers University, 2002. 19 Acta Polytechnica CTU Proceedings 13:16–19, 2017 1 Introduction 2 Implementation SuperLU Interface 2.1 Initialization of data structures for distributed matrix 2.2 Assembly of the global distributed matrix 2.3 Assembly of the global right hand side 3 Conclusions Acknowledgements References