Acta Polytechnica Acta Polytechnica 53(2):70–74, 2013 © Czech Technical University in Prague, 2013 available online at http://ctn.cvut.cz/ap/ CLOCK MATH — A SYSTEM FOR SOLVING SLES EXACTLY Jakub Hladík, Róbert Lórencz, Ivan Šimeček∗ Department of Computer Systems, Faculty of Information Technology, Czech Technical University in Prague, Czech republic ∗ corresponding author: xsimecek@fit.cvut.cz Abstract. In this paper, we present a GPU-accelerated hybrid system that solves ill-conditioned systems of linear equations exactly. Exactly means without rounding errors due to using integer arithmetics. First, we scale floating-point numbers up to integers, then we solve dozens of SLEs within different modular arithmetics and then we assemble sub-solutions back using the Chinese remainder theorem. This approach effectively bypasses current CPU floating-point limitations. The system is capable of solving Hilbert’s matrix without losing a single bit of precision, and with a significant speedup compared to existing CPU solvers. Keywords: integer arithmetic, modular arithmetic, Hilbert’s matrix, error-free, GPGPU, solver, OpenCL, optimizations. 1. Introduction Solving linear algebraic equations solution is quite a frequent task in numerical mathematics. One may often find difficulties, while solving problems of the ill- conditioned matrix. Stability of the solution cannot be ensured for large dense sets of linear equations. Rounding error during the numerical computation cannot be tolerated. Methods have been developed that minimize the influence of rounding errors on the solution. One method that we use relies on modular arith- metics [2] in order to solve dense systems of linear equations precisely. The underlying idea sounds quite simple – bypass floating point rounding limitations by using integer arithmetics. It consists of three parts – converting floating point numbers into integers, solv- ing multiple systems of linear equations within their modules, and finally converting sub-solutions back using the Chinese remainder theorem. Nowadays, GPGPU (General-Purpose computing on the GPU) is a trending topic in high-performance computing. Since GPGPU began in 2006 hundreds of articles have been published every year. GPGPU is usually used to accelerate data-parallel algorithms, and that is our case. Dozens of systems of linear equations emerge during the second step of our com- putation. Each of them can be solved in parallel on the GPU, hence speedup is achieved. In this paper, we present a GPU-accelerated solver of ill-conditioned systems of linear equations. Sec- tion 3 gives a brief overview of the mathematics that is used. Section 4 describes the GPU architecture in general, and presents issues that we faced while optimizing the computation. Finally, Section 5 shows our measured results – the speedup and a comparison with existing systems solving similar problem. 2. State of the Art Nearly one half of the problems solved in numerical mathematics lead to problems in linear algebra. The primary objective of the numerical methods used in linear algebra is to solve sets of linear equations . . . They are solved in a chosen computer arithmetic, most often floating-point arithmetic. Floating-point arithmetic’s well known advantages have led to its widespread usage, yet it also has its important disad- vantages, sometimes resulting in severe problems. The disadvantages of floating-point arithmetic in- clude mainly the generation and accumulation of rounding errors during the calculation, and non- uniform distribution of values in the real number subset. There are a variety of alternatives to floating point representation and associated arithmetic oper- ations, including modular, logarithmic, p-adic, and continued fraction arithmetic. Many numerical problems lead to SLEs with dense and large matrices. The rounding error sensitivity of such systems can be so grave that they can be con- sidered to be ill-conditioned. To solve the problems, many numerical direct and iterative methods have been developed that tackle these systems with higher or lower success. Apart from many numerical methods, various pro- gramming and computing tools have been developed for solving rounding error sensitive SLEs. There are many computer libraries that allow the user to choose an optimal virtual length of a computer word that fully or partially eliminates the destructive influence of rounding errors. There are also arithmetic units that tackle this problem by grouping arithmetic oper- ations, and thus they create a longer computer word for operations sensitive to rounding. Software tools that eliminate rounding errors include libraries for precision computing, such as GMP (GNU Multiple 70 http://ctn.cvut.cz/ap/ vol. 53 no. 2/2013 Clock Math — a System for Solving SLEs Exactly Precision Arithmetic Library). Unfortunately, these tools lead to such great time complexity when they are used to solve rounding error sensitive SLEs that they may not be applicable to practical problems. Another limiting factor when solving SLEs with a square matrix is the inherent algorithmic time com- plexity O(n3), where n is the dimension of the matrix. To solve large SLEs with a large number of equations, special computers and processors have been developed. These are vector computers. Until recently, the vec- tor processors available on the market were CRAY SV1ex (2001), Fujitsu VPP5000 (1999) and VMIPS (2001). IBM Virtual Vector Architecture brings vec- tor processing features to its most recent POWER6 mainframe processors (2007). Another suitable archi- tecture for solving rounding error sensitive SLEs is SIMD (Single Instruction, Multiple Data). Today the SIMD features can be found as extensions of the stan- dard processor instruction set, for example, Intel-SSE, AMD-3DNow! or Motorola-Altivec. These are used in multimedia applications such as video, 3D graphics, animation, audio, etc. The recent trend is GPGPU. Today, at least three of ten top supercomputers in the top 500 list use GPU for numerical acceleration. Both major GPU manufacturers, NVIDIA and AMD, offer the possibility to run general purpose computation on their GPUs. At present there are no single-purpose numerical accelerators for solving rounding error sensitive SLEs. We use common NVIDIA Fermi graphics cards to- gether with Intel Core processors to accelerate the solution. 3. Mathematical Background Let us take a system of linear equations (see [1]): Ax = b, (1) where A ∈ RN×N is the matrix of a system of N equations of N unknowns, b ∈ RN is the right-side vector and x ∈ RN is the desired vector, the solution of our system of linear equations. 3.1. Matrix Scaling Matrix scaling is the first thing to do. The goal of matrix scaling is to adjust all the floating-point numbers of the matrix to their corresponding integer versions. Basically, every matrix row is multiplied by its scaling constant (scalar multiplication). This has to be done without losing a single bit of precision. This condition will be achieved only when the scaling constant is 2n, where n ∈ N1. The question now is how to determine the scaling constant. First, we should continue by finding the smallest absolute value element (closest to zero) in each row. Then, we extract the absolute value of its exponent and multiply it by the 253 constant, because a significant mantissa bit size in the IEEE 754 [6] double precision floating point number format is 53 bits long. Finally, the scaling constant s is computed: s = 253 ∗ 2|exp(minrow)|, (2) where exp is a function that extracts and returns the exponent (as an integer – power of 2) and minrow is the element in the row closest to zero. The approach being used is explained in greater detail (and with alternatives) in [2]. 3.2. Solving a System of Linear Equations From the previous step we have an properly scaled-up system of linear equations: Ax = b, (3) where aij matrix A elements and xi and bi vector elements are just big integers. We solve it using a multi-modulus arithmetics over the commutative ring (Zβ,⊕,�) with a base vector β that is equivalent to the single-modulus arithmetics over (ZM, +, ·) and module M. M has to be a big enough positive integer to avoid rounding errors during our computation. Hadamard’s determinant D estimate of matrix A can be used to estimate the maximum M value: |D|2 ≤ n∏ i=1 ( |a2i1| + |a 2 i2| + · · · + |a 2 in| ) . (4) The highest value of M that could appear during the computation: M > 2 max   n n 2 max(anij ), i,j=1,2,...,n, n(n−1) (n−1) 2 max(aij )n−1 max(yi), i,j=1,2,...,n   (5) and gcd(M,D) = 1. (6) We also need the following conditions for the vector β = (m1,m2, . . . ,mr) and module M to be satisfied: • r∏ i mi = M; • m1 < m2 < · · · < mr; • m1,m2, . . . .mr are prime numbers. The following condition for the SLE (given by Eq. (3)) determinant is satisfied when: |D|mi 6= 0, i = 1, 2, . . . ,r, (7) then the SLE (from Eq. (3)) solved within (Zβ,⊕,�) due to vector β can be expressed as: |Ax|mi = |b|mi, (8) or, for individual modules mi of vector β: |Ax|mi = |b|mi, i = 1, 2, . . . ,r. (9) 71 Jakub Hladík, Róbert Lórencz, Ivan Šimeček Acta Polytechnica The following expression is also valid for Eq. (3) within (Zβ,⊕,�): |AA−1|β = |A −1A|β = E (10) and |x|β = |A −1b|β, (11) where E is the identity matrix. To solve the SLE from Eq. (9) within the specific modular arithmetics we will use the Gauss-Jordan elimination algorithm with non-zero pivotization. The difference from the original Gauss-Jordan elimination is the usage of modular arithmetics for all algorithm steps. There is also pivotization simplification – we do not need to find the greatest element in the elimination step, a non- zero element is good enough. Using the GJ elimination algorithm, let us have an matrix W of dimensions n× (n + 1) consisting of matrix A and vector b: W =   a11 a12 · · · a1n b1 a21 a22 · · · a2n b2 ... ... ... ... ... an1 an2 · · · ann bn   . (12) The goal of the algorithm is to eliminate all W el- ements one by one to get the resulting x vector to Eq. (11) after ≈ n3 elimination steps. 3.3. Inverse Transformation After completing the two previous steps, we have a set of vectors x within each module from β = (m1,m2, . . . ,mr). They represent a sub-solution for the specific (Zmi, +, ·) arithmetic. Now we advance to the inverse transformation back to (Zβ,⊕,�). The necessary condition for each β module solution x is its non-zero determinant from Eq. (7). The algorithm used for this transformation is the Chinese remainder theorem. The product of our last step consists of the vector x elements in the form of fractions that contain the solution of SLE (3). 4. GPU and Optimizations GPGPU is the area of our research, so we optimize on the PC graphical hardware. There were several platforms to choose from: • NVIDIA CUDA, the mainstream platform today, was rejected; it is proprietary, and its usage is lim- ited to NVIDIA hardware; • Microsoft DirectX 11 DirectCompute was rejected; it is bound to the Microsoft Windows platform only (not used in HPC in general); • OpenCL is an open standard that works well across platforms (GNU/Linux, Microsoft Windows and Apple Mac) and on all latest graphical hardware (NVIDIA, AMD and latest Intel GPUs); we use OpenCL for the implementation. 0 % 25,00 % 50,00 % 75,00 % 100,00 % 16 32 64 128 256 512 1024 co m pu ta tio n sh ar e [% ] matrix size [m] Matrix scaling Beta vector computation SLEs solution Inverse transformation Figure 1. Measured computation shares. 4.1. Profiling The whole SLE solving process is a rather time- consuming task (Fig. 1), so optimizations had to be made. First, all the big number ALU operations – Eq. (5) and Section 3.3 – are performed using the GMP library, which is tuned for this platform. Solving dozens of systems of linear equations using modular arithmetics (Section 3.2) within the β vector is the most computation-intensive part of our task, see Fig. 1. According to Ahmdal’s Law (expressed in Eq. (13), the benefit (overall speedup) from optimizing part of the computation is equal to: Speedup = 1 (1 −P) + P S , (13) where P is the proportion of the computation we are optimizing and S is its speedup. Because modular arithmetics SLEs take a great amount of time to solve, we optimize the matrix elimination. 4.2. GPU architecture All modern common GPUs have a similar architecture (from our optimization type point of view). They con- sist of a global memory (1–4 GB) and several (2–24) SMP1 processors (example in Fig. 2). SMP features a scheduler of lightweight threads that have zero over- head. Each SMP contains dozens (8–48) of processor cores that share the SMP memory (shared memory ≈ 64 kB) and execute the same code. If it happens that one processor core branches differently, the per- formance goes down dramatically. Each processor core also has its own integer and floating-point unit. More details with examples and a description of the CUDA platform are given in [5]. The processor cores have access to the GPU global RAM, but the memory access has to be aligned. Oth- erwise performance will be affected. We used SMP processor cores to load part of the matrix (the row 1SMP – symmetric multiprocessor 72 vol. 53 no. 2/2013 Clock Math — a System for Solving SLEs Exactly Matrix size [n] Clock Math [s] linSolve-0.7.15 [s] Speedup [%] 16 0.006 0.006425 16.82 32 0.006 0.014921 42.10 64 0.029 0.046871 61.97 128 0.117 0.195461 67.06 256 1.043 1.401498 34.40 512 11.252 16.215546 44.11 1024 124.375 212.557002 70.90 Table 1. Measured results: comparison of execution times for Clock Math Solver, for linSolve-0.7.15 and corresponding speedups. Figure 2. NVIDIA Fermi SMP [5]. that contains the current pivot). We used it for com- putation and then stored the elimination step results back to global memory. 4.3. GPU Kernels, Optimizations The OpenCL framework defines the term kernel. It is an function written in slightly modified C99 language which is to be executed on an OpenCL capable device. In parallel, OpenCL runtime executes the same kernel on every processor core within an SMP. In the example of the SAXPY function, we are able to process (8–48) vector elements at once. However, our case is not so simple. Our system is currently limited to the matrix size of 4096×4096. The row with the current pivot is used in n multiply-add-modulo vector operations. Hence, it is quite useful performance-wise to cache it in the SMP shared memory. Current generation NVIDIA GPUs have shared memory size of 48 kB. One matrix row fits the local memory easily (4096 ∗ 8 = 32758 B). The processor cores fetch a matrix row from global memory and compute the inverse of the pivot ele- ment. Then the processor cores multiply the cached row and multiply-add-modulo other rows in global memory. More details, including the processor core synchronizations and source code samples, are avail- able online2. 5. Results After embedding architecture-related optimizations into our Clock Math solver and verifying the correct- ness of our results, we were finally able to benchmark them. There are not many up-to-date solvers like ours currently. We benchmarked primarily against linSolve-0.7.15 [7], a highly optimized solver with per- formance critical parts written in x86 assembler. The results (Tab. 1) are very satisfying. Clock Math solver outperforms linSolve-0.7.15 almost twice for larger matrices. 2http://www.github.com/kubbing/Clock-Math 73 Jakub Hladík, Róbert Lórencz, Ivan Šimeček Acta Polytechnica 6. Conclusion We have presented the working system for solving ill-conditioned systems of linear equations exactly. We have managed to effectively bypass floating-point rounding error by using modular arithmetics. Most of the entire computation has taken place by solv- ing SLEs within their corresponding modules. This part of the computation can be (and is) solved in parallel on the common-grade GPU. GPU is capable of solving several SLEs at once (8–16, depending on its memory size) and also each system in parallel on its SPE (8–40 cores). Double parallelism has been utilized, so significant speedup is achieved compared to other implementations. We are currently working on advanced kernel optimizations with larger matrix support. 7. Future Work As we now have a working system, we would like to proceed to: • add support for both single and double precision for matrix elimination, as the β vector module count and GPU performance may differ; • add support for larger matrices than 4096 × 4096 – optimize SMP shared memory usage; • add support for automatic kernel group size tuning for larger matrices, as the group size lowers the mem- ory access time on different GPUs/architectures; • examine the modulus operation performance across different GPU architectures and further opti- mize the SAXPY, respectively DAXPY functions (mod m); • utilize OpenMPI library to add cluster support; • adjust and run the solver on our university STAR cluster to test AMD’s OpenCL CPU implementa- tion. Acknowledgements This research has been supported by CESNET De- velopment Fund project 390/2010 and by the grant SGS12/097/OHK3/1T/18. References [1] Lórencz, R.: Aplikovaná numerická matematika a kryptografie. Vydavatelství ČVUT, 2004 [2] Gregory, R.T.: Error-free computation: why it is needed and methods for doing it. R. E. Krieger Pub Co, 1980 [3] Lórencz, R., Morháč, M.: A modular system for solving linear equations exactly. Computers and Artificial Intelligence, Vol. 12, 1992 [4] Zahradnický, T.: MOSFET Parameter Extraction Optimization. Ph.D. thesis, Department of Computer Systems, Faculty of Information Technology, Czech Technical University in Prague, 2010 [5] Kirk, D.B., Hwu, W.W.: Programming Massively Parallel Processors, A Hands-on approach. Morgan Kaufmann Publishers, 2010 [6] IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008, pages 1-58, 2008 [7] Vondra, L., Lórencz, R.: System for solving linear equation systems. Seminar on Numerical Analysis, pages 171-174, Technical University Liberec, 2012 74 Acta Polytechnica 53(2):70--74, 2013 1 Introduction 2 State of the Art 3 Mathematical Background 3.1 Matrix Scaling 3.2 Solving a System of Linear Equations 3.3 Inverse Transformation 4 GPU and Optimizations 4.1 Profiling 4.2 GPU architecture 4.3 GPU Kernels, Optimizations 5 Results 6 Conclusion 7 Future Work Acknowledgements References