Acta Polytechnica https://doi.org/10.14311/AP.2021.61.0122 Acta Polytechnica 61(SI):122–134, 2021 © 2021 The Author(s). Licensed under a CC-BY 4.0 licence Published by the Czech Technical University in Prague TNL: NUMERICAL LIBRARY FOR MODERN PARALLEL ARCHITECTURES Tomáš Oberhuber∗, Jakub Klinkovský, Radek Fučík Czech Technical University in Prague, Faculty of Nuclear Sciences and Physical Engineering, Department of Mathematics, Trojanova 13, 120 00 Praha , Czech Republic ∗ corresponding author: tomas.oberhuber@fjfi.cvut.cz Abstract. We present Template Numerical Library (TNL, www.tnl-project.org) with native support of modern parallel architectures like multi–core CPUs and GPUs. The library offers an abstract layer for accessing these architectures via unified interface tailored for easy and fast development of high-performance algorithms and numerical solvers. The library is written in C++ and it benefits from template meta–programming techniques. In this paper, we present the most important data structures and algorithms in TNL together with scalability on multi–core CPUs and speed–up on GPUs supporting CUDA. Keywords: Parallel computing, GPU, explicit schemes, semi–implicit schemes, C++ templates. 1. Introduction TNL aims to be an efficient and user friendly library for numerical simulations. To fulfill this goal, it must support modern parallel architectures like GPUs (graphical processing units) and multi-core CPUs on one hand and offer simple and flexible interface for implementation of complex numerical solvers on the other hand. If high computational efficiency is re- quired, we cannot follow the typical rules of the object– oriented programming that usually lead to inefficient organization of data in the computer memory and, for example, use of virtual methods may lower the performance of the final code. The design of TNL profits from advantages of the C++ templates. Espe- cially templates specializations is a natural tool for generating specialized architecture dependent code with no run-time overhead. There is no doubt that modern numerical libraries must support accelerators like GPUs. They provide high memory bandwidth as well as great computa- tional power which is obtained not by high clock fre- quencies but by massively parallel design, which is more power efficient. Programming of GPUs is eas- ier with the CUDA framework. Nevertheless, deep knowledge of the GPU hardware is still necessary to produce efficient code which makes the GPUs almost unavailable for many experts in numerical mathemat- ics. Adding support for GPUs to existing numerical libraries is nearly impossible. The GPU architecture is so different from the CPU that most of the numerical methods and algorithms must be completely rewritten. In CUDA, we deal with two different address spaces: one associated with the CPU and the other with the GPU. Communication between them is remarkably slow and it must be fully managed by the programmer. To get the maximum performance from the GPU, the programmer must take care of the organization of data in the memory, minimize the divergence of CUDA threads, deal with limited shared memory, and many other details of the GPU design [1]. In some cases, one may use the GPU for numer- ical computing relatively easily. An implicit time discretization of linear problems allows to construct the linear system once on the CPU, transfer it to the GPU and solve it repeatedly there by some iterative solver like CG (conjugate gradients). Such solvers require only common linear algebraic operations im- plemented in libraries like CUBLAS or CUSPARSE which are part of the CUDA toolkit. Difficulties arise with non-linear problems, where the linear system matrix must be updated in each time step. Transfer of the matrix from the CPU to the GPU makes any speed-up impossible. Therefore the matrix must be assembled on the GPU. This process requires a manip- ulation with an underlying format for sparse matrices and also efficient access to numerical mesh. Neither is trivial on the GPU. Table 1 presents a comparison of TNL with sev- eral other HPC libraries like Cublas [2], Cusparse [3], Thrust [4], Kokkos [5], ViennaCL [6], PETSc [7] and Eigen [8]. We chose primarily libraries with support of GPU. The table shows which of them have modern in- terface in C++ and which of them support distributed computation using MPI. Parallel for, parallel reduc- tion and scan are emerging programming patterns in HPC which allow to write one code for different parallel architectures. Sorting is another common op- eration defined on arrays or vectors. Blas operations are well known in the HPC community. Blas is a standard library which, however, does not profit from the modern features of C++. Especially Blas level 1 operations can be better expressed with expression templates. Sparse matrices belong with no doubt to one of the most important data structures in HPC. GPUs are very sensitive to sparse matrix pattern. Re- 122 https://doi.org/10.14311/AP.2021.61.0122 https://creativecommons.org/licenses/by/4.0/ https://www.cvut.cz/en vol. 61 Special Issue/2021 TNL: Numerical library for modern parallel architectures TNL Cublas Cusparse Thrust Kokkos ViennaCL PETSc Eigen GPU Yes Yes Yes Yes Yes Yes Weak No MPI Weak No No No No No Yes No C++ Yes No No Yes Yes Yes No Yes Parallel for Yes No No Yes Yes No No No Parallel reduction,scan Yes No No Yes Yes No No No Sorting No Yes No Yes Yes No Yes No Blas 1 Yes Yes Yes No Yes Yes Yes Yes Blas 2 Weak Yes Yes No Yes Yes Yes Yes Blas 3 No Yes Yes No No Yes Yes Yes Expression templates Yes No No No No Yes No Yes Sparse matrices Yes No Yes No Yes Yes Yes Yes Linear systems solvers Yes No Weak No No Yes Yes Yes Preconditioners Weak No No No No Yes Yes Weak Nonlinear sys.solvers No No No No No Yes Yes No Eigenvalues comp. No No No No No Yes No No ODE systems solvers Yes No No No No No Yes No Stencil computations Yes No No No No No Yes No Unstructurted meshes Yes No No No No No Yes No Python interface Weak No No No No Yes Yes Yes Table 1. Comparison of TNL with other numerical libraries. gardless of the great advance in the design of sparse matrix formats for GPUs, it seems that there is no such format which would dominate the others in a large class of sparse matrices. It is profitable for a nu- merical library to offer several different sparse matrix formats. Having implemented sparse-matrix-vector multiplication, it is relatively simple to incorporate the sparse matrices into iterative linear systems solvers. Efficient preconditioning for GPUs is still quite rare though very important. Good preconditioning on the CPU can easily outperform high memory bandwidth and computational performance of GPUs. Nonlin- ear system solvers, eigenvalues computations or ODE system solvers are slightly less common compared to linear systems solvers, nevertheless these algorithms belong to the core of HPC as well. Stencil compu- tations cover a large class of numerical algorithms performed on structured rectangular grids. Unstruc- tured meshes are necessary for finite volume or for finite element methods. Some libraries also cooperate with Python to offer a more efficient user interface. Similar to PETSc or ViennaCL, we believe that the main advantage of TNL is that it offers a wider class of data structures and algorithms with a unified interface. For decades, numerical libraries were developed in a very modular way. The users at the end must combine different libraries to solve the problem at hand. This is becoming more difficult on modern architectures like GPUs. Modern features of C++ can make it easier. Our aim is to develop a library with a STL- like consistent templated user interface and native support of GPUs which would make development of HPC algorithms more efficient. The rest of the paper is organized as follows. First, we briefly explain the basics of GPU programming (Section 2). Then, we describe some basic data struc- tures and solvers already implemented in TNL (Sec- tion 3). Finally, we demonstrate the performance of TNL by showing the scalability on multi-core CPUs and speed–up of GPUs (Section 4). 2. Programming GPUs The GPU is an accelerator connected to the CPU via the PCI Express interface. It is equipped with its own memory referred to as the global memory. Though its memory bandwidth is several times faster compared to common DDR4 connected to the CPU, communication between the GPU and the CPU is substantially slow. This limits the design of algorithms for GPUs because frequent communication between the GPU and the CPU may negatively affect the overall performance. In case of iterative numerical solvers for PDEs, it is often necessary to copy all necessary data to the GPU before the start of iterations. The result is then copied back to the CPU for post–processing. Data in the global memory of the GPU need to be accessed in large continuous blocks (this is referred to as coalesced memory access), random access is by an order of magnitude slower. Therefore, the data must be often organized in a completely different way than on the CPU. This is also the reason why porting an older code to the GPU is so difficult. The GPU consists of several independent multipro- cessors which cannot communicate with each other. They can access the global memory and each multi- processor has its own shared memory which is remark- ably smaller (tens of kilobytes only) than the global memory, but much faster in comparison. The shared memory can work as a cache or it can be managed by the programmer. Each multiprocessor may process 123 T. Oberhuber, J. Klinkovský, R. Fučík Acta Polytechnica 32 CUDA threads simultaneously which are referred to as a warp. Threads in the same warp behave like the SIMD architecture, i.e., they should follow the same instruction at the same time. If not, we call it divergent threads. In this case, efficiency is diminished due to serialization. For more details about CUDA we refer to [1]. 3. Data structures and algorithms in TNL In this section, we describe basic data structures and algorithms implemented in TNL. In the following text, we refer to the GPU as device and to the CPU as host. Methods, declared as __cuda_callable__, can be executed on both, the device and host. If the CUDA framework is not installed, this attribute has no effect. TNL uses template parameters. Few of them are present in the majority of the code: • Device - can be either TNL::Devices::Host or TNL::Device::Cuda. It defines where the data are going to be stored and where the related algorithms will be executed. • Real - defines the precision of the floating point arithmetic (float or double). • Index - defines integer type used for indexing within data structure/algorithm (int or long int). • Allocator - controls allocation of memory. It can be TNL::Alocators::Host for allocation of mem- ory on the host system, TNL::Allocators::Cuda for allocation of the memory on the CUDA de- vice TNL::Allocators::CudaHost for allocation of a page-locked memory (part of the memory that cannot be swapped off) using CUDA and TNL::Allocators::CudaManaged for allocation of CUDA Unified memory (shared memory space be- tween CPU and GPU). 3.1. Arrays and vectors Arrays are basic structures for the memory management in TNL. An array is a template TNL::Containers::Array< Value, Device, Index, Allocator > with template parame- ters Value (an array element type), Device (a device where the array elements are stored), Index (the indexing type) and Allocator (controlling memory allocation). Array provides common methods such as allocation (setSize), comparison and assignment operators or I/O methods (binary load and save). Array elements may be manipulated by setElement, getElement and __cuda_callable__ operator[]. The first two methods can be called from the host even for arrays allocated on the device but they cannot be called from the CUDA kernels. The last one, on the other hand, is defined as __cuda_callable__ and it can be called only from the CUDA kernels, if the array is allocated on the device, and from the host system only if the array is allocated on the host. A method bind takes a pointer to an array of given size or an instance of another array. An array adopted in this way is not being deallocated in the Array destructor. Such mechanism serves for data sharing between more arrays or for wrapping of data allocated outside of TNL. It also serves for the partitioning of large arrays into a set of smaller ones while keeping them allocated as one continuous block in the memory. Vectors (TNL::Containers::Vector< Real, Device, Index, Allocator >), in TNL, extend arrays with vector operations from the linear algebra like vector addition, scalar products, norms, but also prefix-sum (scan) [9]. The Blas Level 1 operations are handled by expression templates which are more general and user friendly with no loss of performance. 3.2. Matrices TNL supports dense matrices as well as several formats for the sparse ones (in namespace TNL::Matrices). All formats are available for both, the host system and the CUDA device. Optimized sparse matrix formats are crucial especially for good efficiency of the GPU solvers. The user may choose between tridiagonal and multi-diagonal matrices, Ellpack format [10], Sliced Ellpack format [11, 12]1, Chunked Ellpack format [13], BiEll format [14] and CSR format (for the GPU, it is currently in experimental form). The sparse matrix formats are usually optimized for the matrix-vector multiplication. However, the matrix construction time is also important, especially for non-linear problems where the linear system must be recomputed in each time step. In general, the insertion of a matrix element can be very expansive for the majority of the sparse matrix formats. Global changes of the data structure or even reallocation can be evoked. Fortunately, in many applications the sparse matrix pattern does not change during the computation. In TNL, the matrix assembling process proceeds in two steps: (1.) Matrix format meta–data initiation is based on information about the number of non–zero elements in each row (we refer to it as compressed row lengths or matrix row capacities). The user calls a method setCompressedRowLengths which accepts a vector having the same size as the number of matrix rows. i-th element of the vector says the number of non- zero elements in i-th row. The matrix format is initialized based on this information. Most impor- tantly, it means that the necessary memory for each matrix row is allocated. If the matrix pattern does not change, this method can be called only once. (2.) Matrix elements insertion consists of the non– zero matrix elements insertion. Since the necessary 1In [11], we referred the Sliced Ellpack format as Row- grouped CSR. It is, however, almost identical as Sliced Ellpack from [12] and since it uses padding zeros, the name Ellpack seems to be more convenient. 124 vol. 61 Special Issue/2021 TNL: Numerical library for modern parallel architectures memory for each row is already allocated, this can be done in parallel. Each row can only have as many non-zero elements as its capacity allows. 3.3. Numerical meshes Currently, TNL supports regular orthogonal struc- tured grids and unstructured conforming homogeneous meshes. In this paper we deal only with the regular grids. Computational results obtained with multi- dimensional mixed-hybrid finite element method on both structured and unstructured meshes were pre- sented in [15]. The regular orthogonal grids are one, two and three dimensional – TNL::Meshes::Grid< Dimension, Real, Device, Index >. The tem- plate parameter Dimension corresponds to the grid dimension. The others are described at the beginning of Section 3. The grid consists of mesh entities. Mesh entity with topological dimension 0 is a vertex, mesh entity with the same topological dimension as the mesh itself is a cell. Each cell has a boundary made of faces, i.e. mesh entities having topological dimen- sion Dimension-1 . In 3D, each face has a boundary consisting of 1-dimensional edges. Within all mesh entities having the same dimension, each mesh entity has a unique index and coordinates as depicted on the Figure 1. In 2D, we have two kinds of faces – those parallel to the x-axis (marked with light gray) and those parallel to the y-axis (marked with dark gray). They can be distinguished by its index – the faces parallel to the x-axis are indexed first (indexes 0 − 11), followed by faces parallel to the y-axis (indexes 12 − 23) – and their orientation defined by the unit normal. This similarly holds for edges in 3D, whose orientation is given by the axis they are parallel to. A mesh entity may be obtained using the method getEntity defined in TNL::Meshes::Grid: All infor- mation necessary for a numerical scheme implementa- tion is accessible via the MeshEntity. It is especially the mesh entity center, proportions, measure and its neighbor entities indexes. There are several template specializations for each type of the mesh entity de- pending on what information is precomputed and stored for repetitive use in the numerical scheme. The following code snippet shows how to get their indexes in case of cells: 3.4. Solvers TNL provides solvers for ODEs and systems of ODEs arising from the method of lines. Currently, the user may choose between the first order accurate Eu- ler solver and the fourth order Runge-Kutta-Merson solver [16] with an adaptive time stepping. Both may run on the GPU. Systems of linear equations can be solved by several Krylov subspace methods like CG, BiCGStab , GMRES and TFQMR (may be exe- cuted on the GPU as well [17]) or the stationary SOR method (does not run on the GPU yet). (0,0) (1,0) (2,0) (0,1) (1,1) (2,1) (0,2) (1,2) (2,2) (0,0) (1,0) (2,0) (3,0) (0,1) (1,1) (2,1) (3,1) (0,2) (1,2) (2,2) (3,2) (0,3) (1,3) (2,3) (3,3) (0,0) (1,0) (2,0) (3,0) (0,1) (1,1) (2,1) (3,1) (0,2) (1,2) (2,2) (3,2) (0,0) (1,0) (2,0) (0,1) (1,1) (2,1) (0,2) (1,2) (2,2) (0,3) (1,3) (2,3) 0 3 6 1 4 7 2 5 8 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 0 3 6 9 1 4 7 10 2 5 8 11 12 16 20 13 17 21 14 18 22 15 19 23 Figure 1. Figure shows 2D grid consisting of 3 × 3 cells, faces with gray box labels (vertical with dark gray, horizontal with light gray) and vertexes with white rounded box labels. Coordinates are depicted on the top, indexing on the bottom. 4. Solution of parabolic PDEs Performance of the presented data structures and algorithms is demonstrated on a solution of the heat- equation. The reason for this simple problem is that, especially in explicit time discretization and with zero right-hand side f(x, t) = 0, we get low computational intensity which is the worst case for the GPU and so the lower bound of the speed-up. We consider a domain Ω ≡ [0, 1]2, the initial condi- tion uini ∈ C2(Ω) and the Dirichlet boundary condi- tions g ∈ C2(Ω × (0,T]) defined as: ∂u(x, t) ∂t − ∆u(x, t) = f(x, t) in Ω × (0,T], (1) 125 T. Oberhuber, J. Klinkovský, R. Fučík Acta Polytechnica u(x, 0) = uini(x) in Ω, (2) u(x, t) = g(x, t) on ∂Ω × (0,T].(3) The solution u is approximated on vertexes of the 2D grid. Let h be the space step and N the number of cells along x and y axis such that h = 1/N. Let ωh = {(ih,jh) | 1 ≤ i,j ≤ N − 1} denote the set of interior grid vertexes and let ωh = {(ih,jh) | 0 ≤ i,j ≤ N} (4) stand for the set of all grid vertexes. Then, by ∂ωh = ωh \ωh, we denote the boundary vertexes. For some function u : Ω → R2, we define the projection on ωh as uij = u(ih,jh). For interior vertexes from ωh, we define the approximation of the Laplace operator ∆h as follows: ∆hu ((ih,jh) , ·) ≈ ui+1,j + ui−1,j + ui,j+1 + ui,j−1 − 4uij h2 = ∆huij. The explicit time discretization is done using the method of lines which leads to the following system of ODEs d dt uij (t) = ∆huij (t) + fij (t), (5) at the interior vertexes and uij (t) = gij (t) at the boundary vertexes. This system can be solved by the Runge-Kutta method. For the semi-implicit time discretization2, we introduce a time step τ and we denote ukij := u(ih,jh,τk). With this notation, the semi-implicit scheme is given by a system of linear equations uk+1ij −u k ij τ − (6) uk+1i+1,j + u k+1 i−1,j + u k+1 i,j+1 + u k+1 i,j−1 − 4u k+1 ij h2 = fk+1ij , at the interior vertexes and uk+1ij = g k+1 ij (7) at the boundary vertexes. Both (6) and (7) can be written in the form of a linear system Axk+1 = bk, (8) where xk+1iN +j ≡ u k+1 ij and for (i,j) ∈ ωh AiN +j,i(N −1)+j = −λ AiN +j,iN +j−1 = −λ AiN +j,iN +j+1 = −λ AiN +j,i(N +1)+j = −λ, AiN +j,iN +j = 1 − 4λ, bkiN +j = τf k+1 ij + u k ij 2The scheme given by (6)–(7) is implicit. In general, implicit schemes for non-linear parabolic PDEs involves the Newton method, which is not implemented in TNL yet. Such problems must be discretized by semi-implicit schemes. where we set λ = τ/h2. The Dirichlet boundary conditions (3) are approximated on (i,j) ∈ ∂ωh as AiN +j = 1 and biN +j = gk+1ij . Discretization of 1D and 3D problem is done in the same way. Algorithm 1 Algorithm for the time dependent PDE solver 1: Initialize numerical mesh (4) 2: Allocate degrees of freedom for u0ij 3: Setup the initial condition (uini) 4: Setup the numerical solver (Runge-Kutta or linear system solver) 5: t := 0,k := 0,τ := initial time step 6: while t < T do 7: τ := min{τ,T − t} 8: if have explicit time discretization then 9: Evaluate the right hand side of (5) 10: Update ukij by the Runge-Kutta solver to get uk+1ij 11: else /* we have semi-implicit time discretiza- tion */ 12: Assembly the linear system (8) 13: Resolve (8) by the linear system solver to get uk+1ij 14: end if 15: t := t + τ, k := k + 1 16: if t is snapshot time then 17: Make snapshot of the solution ukij 18: end if 19: end while A numerical solver for such parabolic problem may be written as an Algorithm 1. Each of the steps in this algorithm may be non-trivial for more complex problems or parallel computations. TNL comes with a framework based on what we call a problem-solver model. On one hand, there is a problem defined by the user in a form of a (templated) class. On the other hand, there is a solver provided by TNL. The solver interacts with the problem via predefined methods and C++ types definitions. 4.1. PDE solver design The design of the solver is depicted in Figure 2. The problem to be solved is defined and managed by three template parameters of the solver TNL::Solvers::Solver i.e. ProblemConfig, ProblemSetter and ProblemBuild. They serve for definition of the problem configuration parameters, resolution of the problem template parameters and control of the problem build process (complete build may take a lot of time) respectively. The static method TNL::Solvers::Solver::run takes the command line arguments argc and argv. The TNL solver firstly calls a static method ProblemConfig::configSetup to get a definition of the configuration keywords. In the next step, the command line arguments are parsed and based on 126 vol. 61 Special Issue/2021 TNL: Numerical library for modern parallel architectures Figure 2. Structure of the PDE framework.The green boxes on the right are part of a problem written by the user. The blue ones on the left are the solver part implemented in TNL. The yellow boxes represent the template parameters and the violet ones stand for data. 127 T. Oberhuber, J. Klinkovský, R. Fučík Acta Polytechnica the definition of the configuration keywords, the con- figuration parameters are resolved and stored in a container TNL::Config::ParameterContainer. The control is then passed to the SolverInitiator that reads the file with the numerical mesh given by the value of the configuration keyword –mesh. It is a bi- nary file created by the TNL tool tnl-grid-setup and it contains a binary image of templated class Grid. In TNL, most objects may be saved in the binary format using the method save. A header of such file contains the object type written in the C++ style, e.g., TNL::Meshes::Grid< 2, double, TNL::Devices::Host, int >. The solver initiator parses template arguments of this object type and so it can resolve the mesh type completely. The values of its template parameters Real and Index are used as default values for floating point arith- metics and indexing type in the problem. This can be changed by the configuration keywords –real-type and –index-type. Together with the argument –device-type, the solver initiator can resolve the primary template arguments (RealType, DeviceType, and IndexType) and MeshType for the problem. The solver may, however, depend on some other tem- plate types like numerical scheme or boundary condi- tions. We refer to them as secondary template argu- ments. To resolve them, the control is passed to the ProblemSetter. The solver starter (based on the configuration pa- rameters) sets up the templated types for the time discretization (TimeStepper) and related solvers – the Runge-Kutta solver for the explicit time discretization or linear systems solver for semi-implicit scheme. At the end, the solver starter creates an instance of the class Problem, passes it to the PDESolver and starts the solver. PDESolver loads the numerical mesh from the file and, subsequently, it calls methods of the class Problem which we describe in the following section. 4.2. PDE problem structure The class Problem representing the heat- equation or similar parabolic problem could be parametrized by four template parameters – Mesh, BoundaryConditions, RightHandSide and DifferentialOperator defining the mesh type, the boundary conditions (3), the right-hand side of equation (1) and the differential operator in the same equation (1) respectively. It is inherited from a templated class TNL::Problems::PDEProblem which defines the following methods (we list only the most important ones): • setup - this method serves for set-up of the problem configuration parameters which were parsed from the command-line arguments. • getDofs - based on the numerical mesh, problem type (single PDE or system of PDEs) and type of the mesh entities linked with DOFs (cells, faces or vertexes), this method returns a number of DOFs for the unknown mesh function (it is ukij (5) or uij (t) (6) in our example). DOFs are then allocated by the solver. • bindDofs - this method serves for binding of DOFs into mesh functions. • setInitialCondition - the initial condition uini (2) may be set here. • getExplicitUpdate - this method evaluates the right-hand side (5) of the explicit numerical scheme. It is called for the explicit time discretization only. • setupLinearSystem - in the case of semi-implicit time discretization, this method serves for setup of the (sparse) matrix format storing the linear system (8) as described in the Section 3.2. It is called only for the semi-implicit time discretization. • assemblyLinearSystem - this method is responsi- ble for the construction of the linear system (8). It is called for the semi-implicit time discretization only. • makeSnapshot - this method stores the state of the time dependent problem into a file. 5. TNL tools TNL offers several helper tools (see Figure 3). tnl-grid-setup is a simple utility for creating a file with the numerical grid definition. tnl-init serves for easier set-up of initial conditions. It produces file with binary image of a mesh function. After the problem solver finishes the computation and saves the solution in a form of a sequence of binary files, they may be post-processed by tnl-view to convert the binary data to VTK (or Gnuplot) format or by tnl-diff to evaluate differences between different mesh functions or experimental order of convergence (EOC). tnl-image-converter can convert images to binary TNL files and vice versa. Currently, TNL sup- ports PGM (natively), PNG (via libpng), JPEG (via libjpeg) and Dicom (via dcmtk). 6. Performance tests Computational benchmarks were performed on In- tel Xeon E5-2640 running at 2.4GHz. This CPU is equipped with 8 computational cores and 25 MB L3 cache and it was connected to DDR4 memory modules. The Turbo-boost technology was turned off only for measuring the weak scalability on CPU. For all other computations it was turned on. The GPU solvers were tested on Nvidia Tesla V100 with 5120 CUDA cores running at 1455 MHz and equipped with 16 GB of global memory. The heat equation (1–3) was solved in a domain (−1, 1)n where n = 1, 2, 3 denotes the dimensions of the domain. The initial condition was uini(x) = exp ( −4‖x‖2 ) . The final time T was set to 0.1. 128 vol. 61 Special Issue/2021 TNL: Numerical library for modern parallel architectures Figure 3. TNL tools for a computation data pre- processing and post-processing. 6.1. Explicit numerical schemes Firstly, we test the explicit numerical scheme. The in- tegration in time is done by the Runge-Kutta-Merson solver with the adaptive choice of the time step. Re- sults are presented in Tables 2–4. The number of DOFs is shown in the first column. The following columns show times and efficiency of the CPU solver on one, two, four and eight cores, together with the parallel efficiency. The time values in bold face show the best CPU computation time in each row. The last two columns belong to the GPU solver. They show the time and speed-up compared to the best time obtained on CPU. The simulations in 1D (Table 2) have too small number of DOFs and thus we do not obtain good parallel scalability and speed-up on the GPU. In 2D (Table 3), the CPU solver scales relatively well up to four cores on larger numerical meshes. The best time is obtained on eight cores but the parallel efficiency is lower. The GPU speed-up is more than nine on large problems. In 3D, similar results can be seen in Table 4 reporting the GPU speed-up more than eight. Note that Turbo-boost was active on the CPU which can affect the parallel efficiency. The heat equation (1) with f(x, t) = 0 exhibits low computational intensity. Therefore the Tables 2–4 demonstrates the lower bound of the GPU speed-up that can be obtained for explicit solvers in TNL. Re- sults with arithmetically more intensive computations are presented in Tables 5–7 where we set f(x, t) = cos(t) ( −2a σ2 e −x2 σ2 + 4ax2 σ4 e −x2 σ2 ) , (9) f(x, t) = cos(t) ( −2a σ2 e −x2 −y2 σ2 + 4ax2 σ4 e −x2 −y2 σ2 + −2a σ2 e −x2 −y2 σ2 + 4ay2 σ4 e −x2 −y2 σ2 ) , (10) and f(x, t) = cos(t) ·( −2a σ2 e −x2 −y2 −z2 σ2 + 4ax2 σ4 e −x2 −y2 −z2 σ2 + −2a σ2 e −x2 −y2 −z2 σ2 + 4ay2 σ4 e −x2 −y2 −z2 σ2 + −2a σ2 e −x2 −y2 −z2 σ2 + 4az2 σ4 e −x2 −y2 −z2 σ2 ) , (11) for 1D, 2D and 3D respectively. The setup is the same for Tables 2–4, just for 2D and 3D simulations (Tables 6 and 7) we set the final time T was set to 0.01. In this situation we get certain speed-up even for 1D problem. In 2D and 3D the speed-up reaches almost one hundred compared to eight cores CPU time. This is the estimate of the upper bound of speed-up one can obtain with the explicit solver. 6.2. Semi-implicit numerical schemes Tables 8–10 show results obtained by the semi-implicit numerical scheme (6)–(7). In this case, majority of the time is spend by solving the linear system (8). The sparse matrix A in (8) is stored in CSR format on CPU and SlicedEllpack format on GPU. The linear system was solved by the GMRES method on CPU and CWYGMRES [18] on GPU. The use of GPU makes no sense in 1D where the speed-up is smaller than one as well as in the case of the explicit solver. In 2D and 3D, the speed-up is more than 12 and 10 respectively. The Figure 4 shows graphs of efficiency of the CPU solvers and so it demonstrates the strong parallel scal- ability of different solvers on different problems sizes. The left column shows scalability in 1D where the problem size is usually too small for efficient paral- lelization. In 2D and 3D (the second and the third column), especially for larger grid size, the efficiency grows. The middle row shows the result of arithmeti- cally more intensive explicit solver with f given by (9)–(11). The efficiency is significantly higher. It indi- cates that the CPU solvers are probably limited by the memory bandwidth. The weak scalability is reported in Table 11 and Figure 5. To study the weak scalability we increase the problem size linearly with the number of threads by prolongating the domain Ω along the x axis. The initial condition is scaled in the same way. We set the domain Ω as Ω ≡ (0,p), Ω ≡ (0,p) × (0, 1) and Ω ≡ (0,p)×(0, 1)×(0, 1) in 1D, 2D and 3D respectively, where p denotes the number of threads. The discrete numerical mesh ωh resolution is set as 10000p, 200p× 100 and 100p×50×50 in 1D, 2D and 3D respectively. 129 T. Oberhuber, J. Klinkovský, R. Fučík Acta Polytechnica CPU GPU DOFs 1 core 2 cores 4 cores 8 cores Time (s) Time (s) Eff. Time (s) Eff. Time (s) Eff. Time (s) Speed-up 16 0.003 0.002 0.75 0.003 0.25 0.003 0.12 0.04 0.05 32 0.003 0.003 0.5 0.003 0.25 0.003 0.12 0.04 0.07 64 0.005 0.005 0.5 0.005 0.25 0.005 0.12 0.05 0.1 128 0.015 0.015 0.5 0.015 0.25 0.01 0.18 0.12 0.08 256 0.06 0.06 0.5 0.06 0.25 0.06 0.12 0.43 0.13 512 0.28 0.29 0.48 0.33 0.21 0.34 0.10 1.36 0.20 1024 1.45 1.8 0.40 1.85 0.19 2.2 0.08 5.32 0.27 2048 8.57 9.111 0.47 8.5 0.25 9.38 0.11 20.95 0.40 4096 56.37 50.94 0.55 40.9 0.34 41.9 0.16 84.66 0.48 Table 2. Performance of the explicit numerical solver of (5) for the heat equation in 1D. Bold face stresses the best time on CPU (with Turbo-boost turned on) based on which we compute the GPU speed-up written in bold face as well. CPU GPU DOFs 1 core 2 cores 4 cores 8 cores Time (s) Time (s) Eff. Time (s) Eff. Time (s) Eff. Time (s) Speed-up 162 0.003 0.003 0.5 0.003 0.25 0.008 0.04 0.003 0.03 322 0.005 0.006 0.41 0.006 0.20 0.006 0.10 0.07 0.07 642 0.03 0.027 0.55 0.022 0.34 0.028 0.13 0.09 0.33 1282 0.41 0.27 0.75 0.17 0.60 0.16 0.32 0.22 0.72 2562 6.17 3.67 0.84 2.12 0.72 1.46 0.52 0.79 1.84 5122 96 55.47 0.86 30.84 0.77 18.7 0.64 5.73 3.26 10242 1743 990.7 0.87 556.9 0.78 381.6 0.57 64.82 5.88 20482 31226 17627.7 0.88 10403.4 0.75 8297.8 0.47 911.11 9.10 Table 3. Performance of the explicit numerical solver of (5) for the heat equation in 2D. Bold face stresses the best time on CPU (with Turbo-boost turned on) based on which we compute the GPU speed-up written in bold face as well. CPU GPU DOFs 1 core 2 cores 4 cores 8 cores Time (s) Time (s) Eff. Time (s) Eff. Time (s) Eff. Time (s) Speed-up 163 0.005 0.005 0.5 0.004 0.31 0.004 0.15 0.08 0.05 323 0.07 0.049 0.71 0.031 0.56 0.02 0.43 0.08 0.25 643 2.46 1.47 0.83 0.8 0.76 0.49 0.62 0.23 2.13 1283 94.32 53.08 0.88 30.66 0.76 23.75 0.49 3.48 6.82 2563 3050.47 1720.04 0.88 1005.89 0.75 827.16 0.46 103.46 7.99 Table 4. Performance of the explicit numerical solver of (5) for the heat equation in 3D. Bold face stresses the best time on CPU (with Turbo-boost turned on) based on which we compute the GPU speed-up written in bold face as well. 130 vol. 61 Special Issue/2021 TNL: Numerical library for modern parallel architectures CPU GPU DOFs 1 core 2 cores 4 cores 8 cores Time (s) Time (s) Eff. Time (s) Eff. Time (s) Eff. Time (s) Speed-up 16 0.003 0.003 0.5 0.003 0.25 0.003 0.12 0.02 0.15 32 0.003 0.003 0.5 0.003 0.25 0.003 0.12 0.05 0.05 64 0.007 0.007 0.5 0.007 0.25 0.007 0.12 0.07 0.09 128 0.032 0.032 0.5 0.03 0.26 0.03 0.13 0.11 0.27 256 0.19 0.2 0.47 0.2 0.23 0.2 0.11 0.36 0.52 512 1.35 1.37 0.49 1.5 0.22 1.5 0.11 1.37 0.98 1024 10.58 6.08 0.87 4.21 0.62 3.64 0.36 5.25 0.69 2048 78.26 43.08 0.90 26.86 0.72 20.13 0.48 20.74 0.97 4096 609.17 321.73 0.94 190.37 0.79 128.69 0.59 83.55 1.54 Table 5. Performance of the explicit numerical solver of (5) for the heat equation in 1D with f given by (9) and T = 0.1. Bold face stresses the best time on CPU (with Turbo-boost turned on) based on which we compute the GPU speed-up written in bold face as well. CPU GPU DOFs 1 core 2 cores 4 cores 8 cores Time (s) Time (s) Eff. Time (s) Eff. Time (s) Eff. Time (s) Speed-up 162 0.004 0.004 0.5 0.004 0.25 0.006 0.08 0.04 0.1 322 0.01 0.007 0.71 0.006 0.41 0.007 0.17 0.05 0.12 642 0.03 0.02 0.75 0.01 0.75 0.01 0.37 0.08 0.12 1282 0.69 0.34 1.01 0.19 0.90 0.14 0.61 0.06 2.33 2562 11.03 5.65 0.97 3.13 0.88 1.83 0.75 0.12 15.2 5122 181.6 93.57 0.97 53.65 0.84 29.29 0.77 0.63 46.4 10242 2996.2 1557.51 0.96 865.7 0.86 481.47 0.77 6.41 75.1 20482 48965 25065.9 0.97 13716.7 0.89 8002.77 0.76 87.6 91.3 Table 6. Performance of the explicit numerical solver of (5) for the heat equation in 2D with f given by (10) and T = 0.01. Bold face stresses the best time on CPU (with Turbo-boost turned on) based on which we compute the GPU speed-up written in bold face as well. CPU GPU DOFs 1 core 2 cores 4 cores 8 cores Time (s) Time (s) Eff. Time (s) Eff. Time (s) Eff. Time (s) Speed-up 163 0.03 0.02 0.75 0.014 0.53 0.011 0.34 0.01 1.1 323 0.33 0.16 1.03 0.1 0.82 0.09 0.45 0.01 9 643 4.5 2.33 0.96 1.31 0.85 0.81 0.69 0.03 27 1283 174.2 88.33 0.98 50.11 0.86 28.65 0.76 0.37 77 2563 6047.66 3071.05 0.98 1696 0.89 982.2 0.76 9.87 99 Table 7. Performance of the explicit numerical solver of (5) for the heat equation in 3D with f given by (11) and T = 0.01. Bold face stresses the best time on CPU (with Turbo-boost turned on) based on which we compute the GPU speed-up written in bold face as well. 131 T. Oberhuber, J. Klinkovský, R. Fučík Acta Polytechnica CPU GPU DOFs 1 core 2 cores 4 cores 8 cores Time (s) Time (s) Eff. Time (s) Eff. Time (s) Eff. Time (s) Speed-up 16 0.003 0.003 0.5 0.003 0.25 0.003 0.12 0.03 0.1 32 0.003 0.003 0.5 0.004 0.18 0.004 0.09 0.13 0.02 64 0.006 0.006 0.5 0.007 0.21 0.011 0.06 0.26 0.02 128 0.016 0.018 0.44 0.019 0.21 0.025 0.08 0.9 0.02 256 0.081 0.1 0.40 0.09 0.22 0.11 0.09 1.8 0.04 512 0.48 0.51 0.47 0.54 0.22 0.55 0.1 6.05 0.08 1024 3.42 2.98 0.57 2.83 0.30 3.57 0.11 21.6 0.17 2048 24.88 17.29 0.71 14.05 0.44 15.5 0.20 85.8 0.16 4096 189.7 114.12 0.83 79.53 0.59 76.11 0.31 342.8 0.22 Table 8. Performance of the implicit numerical solver (6)–(7) for the heat equation in 1D. Bold face stresses the best time on CPU (with Turbo-boost turned on) based on which we compute the GPU speed-up written in bold face as well. CPU GPU DOFs 1 core 2 cores 4 cores 8 cores Time (s) Time (s) Eff. Time (s) Eff. Time (s) Eff. Time (s) Speed-up 162 0.004 0.004 0.5 0.01 0.1 0.03 0.02 0.04 0.1 322 0.015 0.012 0.62 0.011 0.34 0.01 0.18 0.12 0.08 642 0.12 0.07 0.85 0.047 0.63 0.05 0.3 0.3 0.15 1282 1.2 0.65 0.92 0.37 0.81 0.26 0.57 0.83 0.31 2562 17.51 9.17 0.95 4.79 0.91 3.03 0.72 2.86 1.05 5122 230.9 120.83 0.95 66.49 0.86 39.62 0.72 12.07 3.28 10242 3634.24 1928.66 0.94 1081.04 0.84 752.73 0.60 86.9 8.66 20482 59185 32048 0.92 18080.1 0.81 13167.5 0.56 1057 12.4 Table 9. Performance of the implicit numerical solver (6)–(7) for the heat equation in 2D. Bold face stresses the best time on CPU (with Turbo-boost turned on) based on which we compute the GPU speed-up written in bold face as well. CPU GPU DOFs 1 core 2 cores 4 cores 8 cores Time (s) Time (s) Eff. Time (s) Eff. Time (s) Eff. Time (s) Speed-up 163 0.021 0.013 0.80 0.01 0.52 0.011 0.23 0.07 0.14 323 0.46 0.25 0.92 0.14 0.82 0.11 0.52 0.18 0.61 643 10.04 5.21 0.96 3.02 0.83 1.83 0.68 0.6 3.05 1283 240.9 125.11 0.96 81.24 0.74 51.46 0.58 6.54 7.86 2563 6429.6 3815.9 0.84 2313.79 0.69 1536.4 0.52 140.1 10.96 Table 10. Performance of the implicit numerical solver (6)–(7) for the heat equation in 3D. Bold face stresses the best time on CPU (with Turbo-boost turned on) based on which we compute the GPU speed-up written in bold face as well. 132 vol. 61 Special Issue/2021 TNL: Numerical library for modern parallel architectures 0 0.2 0.4 0.6 0.8 1 16 32 64 128 256 512 1024 2048 4096 E ffi c ie n c y 0 0.2 0.4 0.6 0.8 1 16 32 64 128 256 512 1024 2048 0 0.2 0.4 0.6 0.8 1 16 32 64 128 256 0 0.2 0.4 0.6 0.8 1 16 32 64 128 256 512 1024 2048 4096 E ffi c ie n c y 0 0.2 0.4 0.6 0.8 1 16 32 64 128 256 512 1024 2048 0 0.2 0.4 0.6 0.8 1 16 32 64 128 256 0 0.2 0.4 0.6 0.8 1 16 32 64 128 256 512 1024 2048 4096 E ffi c ie n c y Grid size 0 0.2 0.4 0.6 0.8 1 16 32 64 128 256 512 1024 2048 Grid size 0 0.2 0.4 0.6 0.8 1 16 32 64 128 256 Grid size 2 threads 4 threads 8 threads Figure 4. Graphs of the CPU efficiency – the first row represents the explicit solver from Tables 2–4, the second row is the explicit solver with f given by (9)–(11) from Tables 5–7 and the last row is the implicit solver from Tables 8–10. The first column contains results of computations in 1D, the second column in 2D and the third one in 3D. The red curve is efficiency of simulation with two threads, the green one with four threads and the blue one with eight threads. The horizontal axes represent grid size of the simulation and the vertical axes show the efficiency. 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 3 4 5 6 7 8 T im e Threads 1D 2D 3D 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 T im e Threads 1D 2D 3D Figure 5. Graphs of weak parallel scalability on the CPU – the figure on the top shows the results of the explicit solver, the one on the bottom shows the implicit solver. The red curve represents computations in 1D, the green one in 2D and the blue one in 3D. The problem size grows linearly with the number of threads. Therefore the more straight the curve is the better parallel scalability the solver exhibits. Time(s) Explicit Implicit Threads 1D 2D 3D 1D 2D 3D 1 0.56 1.06 1.34 1.25 2.84 5.59 2 0.58 1.31 1.36 1.29 3.54 5.82 3 0.61 1.42 1.35 1.30 3.61 6.09 4 0.59 1.51 1.49 1.31 3.80 6.30 5 0.60 1.54 1.80 1.31 3.80 6.59 6 0.61 1.51 1.94 1.31 3.68 7.07 7 0.60 1.42 1.91 1.32 3.60 7.59 8 0.66 1.41 1.97 1.36 3.54 8.31 Table 11. The weak scalability on the CPU – the columns represent the CPU time of simulation on a domain growing linearly with the number of threads (the first column). The less the times grow with the number of threads the better the weak scalabiltiy is. 7. Future work In the future we would like to improve the support of MPI to make computations on clusters easier, imple- ment support of GPUs by AMD company via ROCm toolkit [19] and create interface of TNL into Julia [20]. As we mentioned before, efficient precondition- ers for linear systems solvers are extremely important. Currently we are working on a more flexible imple- mentation of sparse matrices, on adaptive numerical grids and distributed unstructured numerical meshes. 133 T. Oberhuber, J. Klinkovský, R. Fučík Acta Polytechnica 8. Conclusion We have presented Template Numerical Library, TNL, for easy development of numerical solvers on modern parallel architectures. We have described details of the TNL design and we have presented a number of numerical experiments to demonstrate the scalability of TNL on multi-core CPUs together with the speed- up on GPUs. The explicit solver achieves speed–up of 8 to 99 depending on the arithmetic intensity. The semi-implicit solver gives a speed–up of almost 11. Both results were obtained by the comparison of 8 core CPU Intel Xeon with the GPU Nvidia Tesla V100. Acknowledgements The work was supported by the Ministry of Ed- ucation, Youth and Sports OPVVV project no. CZ.02.1.01/0.0/0.0/16_019/0000765: Research Cen- ter for Informatics, the Large Infrastructures for Re- search, Experimental Development and Innovations project IT4Innovations National Supercomputing Center – LM2015070 and Project No. 18-09539S of the Czech Science Foundation. References [1] J. Cheng, M. Grossman, T. McKercher. Professional CUDA C Programming. Wrox, 2014. [2] Cublas. https://developer.nvidia.com/cublas. [3] Cusparse. https://developer.nvidia.com/cusparse. [4] Thrust. https://developer.nvidia.com/thrust. [5] Kokkos. https://github.com/kokkos/kokkos. [6] Viennacl. http://viennacl.sourceforge.net. [7] Petsc. https://www.mcs.anl.gov/petsc. [8] Eigen. http://eigen.tuxfamily.org/index.php. [9] M. Harris, S. Sengupta, J. D. Owens. GPU gems 3, chap. Parallel prefix sum (scan) with CUDA, pp. 851–876. 2007. [10] N. Bell, M. Garland. Efficient sparse matrix-vector multiplication on CUDA. Tech. Rep. Technical Report NVR-2008-004, NVIDIA Corporation, 2008. [11] T. Oberhuber, A. Suzuki, J. Vacata. New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA. Acta Technica 56:447–466, 2011. [12] A. Monakov, A. Lokhmotov, A. Avetisyan. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In HiPEAC 2010, pp. 111–125. Springer-Verlag Berlin Heidelberg, 2010. [13] M. Heller, T. Oberhuber. Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU. In A. H. nad Z. Minarechová, D. Ševčovič (eds.), Proceedings of Algoritmy 2012, pp. 282–290. 2012. [14] C. Zheng, S. Gu, T.-X. Gu, et al. Biell: A bisection ellpack-based storage format for optimizing spmv on gpus. Journal of Parallel and Distributed Computing 74(7):2639 – 2647, 2014. [15] R. Fučík, J. Klinkovský, J. Solovský, et al. Multidimensional mixed-hybrid finite element method for compositional two-phase flow in heterogeneous porous media and its parallel implementation on GPU. Computer Physics Communications 238:165–180, 2019. [16] T. Oberhuber, A. Suzuki, V. Žabka. The CUDA implementation of the method of lines for the curvature dependent flows. Kybernetika 47:251–272, 2011. [17] T. Oberhuber, A. Suzuki, J. Vacata, V. Žabka. Image segmentation using CUDA implementations of the Runge-Kutta-Merson and GMRES methods. Journal of Math-for-Industry 3:73–79, 2011. [18] Y. Yamamoto, Y. Hirota. A parallel algorithm for incremental orthogonalization based on compact WY representation. JSIAM Letters 3(0):89–92, 2011. [19] Rocm. https://github.com/RadeonOpenCompute/ROCm. [20] Julia. https://julialang.org. 134 https://developer.nvidia.com/cublas https://developer.nvidia.com/cusparse https://developer.nvidia.com/thrust https://github.com/kokkos/kokkos http://viennacl.sourceforge.net https://www.mcs.anl.gov/petsc http://eigen.tuxfamily.org/index.php https://github.com/RadeonOpenCompute/ROCm https://julialang.org Acta Polytechnica 61(SI):122–134, 2021 1 Introduction 2 Programming GPUs 3 Data structures and algorithms in TNL 3.1 Arrays and vectors 3.2 Matrices 3.3 Numerical meshes 3.4 Solvers 4 Solution of parabolic PDEs 4.1 PDE solver design 4.2 PDE problem structure 5 TNL tools 6 Performance tests 6.1 Explicit numerical schemes 6.2 Semi-implicit numerical schemes 7 Future work 8 Conclusion Acknowledgements References