Начиная с начала 2000 года осуществляется внедрение GHIS в здравоохранении, в рамках принятого проекта о реформирование информ Mathematical Problems of Computer Science 50, 111--118, 2018. Performance Analysis of Matrix Multiplication Algorithms Using MPI and OpenMP Tigran M. Galstyan Institute for Informatics and Automation Problems of NAS RA e-mail: tig.galstyan.96@gmail.com Abstract The combination of OpenMP and MPI in programming is called hybrid programming. Hybrid programming (through messages and shared memory) has gained an important role since the appearance of cluster architectures. A hybrid programming method combines the MPI and OpenMP libraries to use this hierarchical multi-core architecture. The purpose of this work is to carry out the performance analysis of matrix multiplication algorithms in a cluster system. Each node in the cluster consists of multiple core CPUs, in which memory is distributed among the nodes and shared memory. Algorithms use MPI as a message-passing mechanism and OpenMP as shared memory. Keywords: Hybrid, OpenMP, MPI, Matrix Multiplication, Fox. 1. Introduction Parallel algorithms play an important role in the computation of high-performance computing environment. Dividing a task into smaller tasks and assigning them to different processors for parallel execution are two key concepts in the performance of parallel algorithms. Multiprocessor machines allow simultaneous execution of different application programs on different processors. They also allow a single application program to execute faster if it can be rewritten to use multiple processors [1]. The most common way to write a parallel program is to use a sequential language and a subroutine library. The bodies of process are written in the sequential language such as C. Process creation, communication and synchronization are then programmed by calling the library function. For message passing environment, we use the MPI. The MPI functions are included in a header file called mpi.h [2, 3, 4]. For shared memory, we use the OpenMP. The OpenMP functions are included in a header file called omp.h [2, 5]. 111 Performance Analysis of Matrix Multiplication Algorithms Using MPI and OpenMP 112 Objective The purpose of this work is to study hybrid programming. For example, Matrix Multiplication using MPI and OpenMP. To carry out performance analysis of matrix multiplication algorithms in cluster system. Implement matrix multiplication algorithms in C using OpenMP and MPI. To create a Hybrid algorithm and to compare it with famous matrix multiplication algorithms, for example, Fox algorithm. To compare algorithms by increasing the size of order matrix. 2. MPI Message Passing Interface (MPI) is a standardized and portable message-passing standard designed by a group of researchers from academia and industry to function on a wide variety of parallel computing architectures. The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several well-tested and efficient implementations of MPI, many of which are open-source or in the public domain. They fostered the development of a parallel software industry, and encouraged the development of portable and scalable largescale parallel applications [6]. 3. OpenMP The OpenMP (Open Multi-Processing) API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer. An application built with the hybrid model of parallel programming can run on a computer cluster using both OpenMP and Message Passing Interface (MPI), such that OpenMP is used for parallelism within a (multi-core) node while MPI is used for parallelism between nodes. Attempts have also been made to run OpenMP on software distributed shared memory systems, to translate OpenMP into MPI and to extend OpenMP for non-shared memory systems [7]. 4. Matrix Multiplication In mathematics, matrix multiplication or matrix product is a binary operation that produces a matrix from two matrixes. In more detail, if 𝐴𝐴 is an 𝑛𝑛 𝑋𝑋 𝑚𝑚 matrix and 𝐵𝐵 is an 𝑚𝑚 𝑥𝑥 𝑝𝑝 matrix, their matrix product 𝐴𝐴𝐵𝐵 is an 𝑛𝑛 𝑥𝑥 𝑝𝑝 matrix, in which the 𝑚𝑚 entries across a row of 𝐴𝐴 are multiplied by the m entries down a column of 𝐵𝐵 and summed to produce an entry of 𝐴𝐴𝐵𝐵 (Figure 1) . https://en.wikipedia.org/wiki/Message-passing https://en.wikipedia.org/wiki/Message-passing https://en.wikipedia.org/wiki/Message-passing https://en.wikipedia.org/wiki/Message-passing https://en.wikipedia.org/wiki/Parallel_computing https://en.wikipedia.org/wiki/Parallel_computing https://en.wikipedia.org/wiki/Parallel_computing https://en.wikipedia.org/wiki/Parallel_computing https://en.wikipedia.org/wiki/Parallel_computing https://en.wikipedia.org/wiki/C_(programming_language) https://en.wikipedia.org/wiki/C_(programming_language) https://en.wikipedia.org/wiki/C_(programming_language) https://en.wikipedia.org/wiki/C_(programming_language) https://en.wikipedia.org/wiki/C%2B%2B https://en.wikipedia.org/wiki/Fortran https://en.wikipedia.org/wiki/Fortran https://en.wikipedia.org/wiki/Parallel_programming https://en.wikipedia.org/wiki/Parallel_programming https://en.wikipedia.org/wiki/Parallel_programming https://en.wikipedia.org/wiki/Computer_cluster https://en.wikipedia.org/wiki/Computer_cluster https://en.wikipedia.org/wiki/Message_Passing_Interface https://en.wikipedia.org/wiki/Message_Passing_Interface https://en.wikipedia.org/wiki/Message_Passing_Interface https://en.wikipedia.org/wiki/Distributed_shared_memory https://en.wikipedia.org/wiki/Distributed_shared_memory https://en.wikipedia.org/wiki/Distributed_shared_memory https://en.wikipedia.org/wiki/Distributed_shared_memory https://en.wikipedia.org/wiki/Distributed_shared_memory T. Galstyan 113 Fig. 1. Matrix Multiplication. If 𝐴𝐴 is an 𝑛𝑛 𝑥𝑥 𝑚𝑚 matrix and 𝐵𝐵 is an 𝑚𝑚 𝑥𝑥 𝑝𝑝 matrix, then 𝐶𝐶 = 𝐴𝐴𝐵𝐵 is an 𝑛𝑛 𝑥𝑥 𝑝𝑝 matrix, and the value of each element in 𝐶𝐶 is defined as: 𝐶𝐶𝑖𝑖𝑖𝑖 = 𝐴𝐴𝑖𝑖1𝐵𝐵1𝑖𝑖 + ⋯ + 𝐴𝐴𝑖𝑖𝑖𝑖𝐵𝐵𝑖𝑖𝑖𝑖 = � 𝐴𝐴𝑖𝑖𝑖𝑖𝐵𝐵𝑖𝑖𝑖𝑖 𝑖𝑖 𝑖𝑖=1 . Its computational complexity is 𝑂𝑂(𝑛𝑛3) (for 𝑛𝑛 𝑥𝑥 𝑛𝑛 matrices). To improve performance, we have added OpenMP and MPI. 5. Hybrid Experiment Matrix 𝐶𝐶 is calculated in each node using OpenMP. To do so, each process receives a piece of the matrix 𝐴𝐴 and the matrix 𝐵𝐵, sent to all processes. The number of orders 𝐶𝐶 matrix is divisible by the number of processes. The algorithm uses a master/worker-type interaction, where the master works both as coordinator and as a worker. It divides the matrix 𝐴𝐴 into pieces to be processed, and then generates the processing phases (Figure 2). Fig. 2. Matrix multiplication by processes. https://en.wikipedia.org/wiki/Computational_complexity https://en.wikipedia.org/wiki/Computational_complexity https://en.wikipedia.org/wiki/Computational_complexity Performance Analysis of Matrix Multiplication Algorithms Using MPI and OpenMP 114 6. Fox`s Algorithm Fox algorithm is one of the known solutions to this problem. Details of the algorithm are given below: [8] 𝐴𝐴 and 𝐵𝐵 are 𝑛𝑛 𝑥𝑥𝑛𝑛 matrixes. Compute 𝐶𝐶 = 𝐴𝐴𝐵𝐵 in parallel. Let 𝑞𝑞 = √𝑝𝑝 be an integer such that it divides n, where p is the number of processes. Create a Cartesian topology [9] with processes (i, j), (i, j = 0, . . . , q − 1 ). Denote 𝑚𝑚 = 𝑛𝑛/𝑞𝑞 Distribute 𝐴𝐴 and 𝐵𝐵 by blocks on p processes so that 𝐴𝐴𝑖𝑖,𝑖𝑖 and 𝐵𝐵𝑖𝑖,𝑖𝑖 are 𝑚𝑚 𝑥𝑥 𝑚𝑚 blocks stored on process (i, j). On process (i, j),we want to compute: 𝐶𝐶𝑖𝑖𝑖𝑖 = � 𝐴𝐴𝑖𝑖𝑖𝑖𝐵𝐵𝑖𝑖𝑖𝑖 𝑞𝑞−1 𝑖𝑖=0 = 𝐴𝐴𝑖𝑖0𝐵𝐵0𝑖𝑖 + 𝐴𝐴𝑖𝑖1𝐵𝐵1𝑖𝑖 + ⋯ + 𝐴𝐴𝑖𝑖,𝑖𝑖−1𝐵𝐵𝑖𝑖−1,𝑖𝑖 + 𝐴𝐴𝑖𝑖𝑖𝑖𝐵𝐵𝑖𝑖𝑖𝑖 + 𝐴𝐴𝑖𝑖,𝑖𝑖+1𝐵𝐵𝑖𝑖+1,𝑖𝑖 + ⋯ + 𝐴𝐴𝑖𝑖,𝑞𝑞−1𝐵𝐵𝑞𝑞−1,𝑖𝑖 Rewrite this as: Stage Compute 0 𝐶𝐶𝑖𝑖𝑖𝑖 = 𝐴𝐴𝑖𝑖𝑖𝑖𝐵𝐵𝑖𝑖𝑖𝑖 1 𝐶𝐶𝑖𝑖𝑖𝑖 = 𝐶𝐶𝑖𝑖𝑖𝑖 + 𝐴𝐴𝑖𝑖,𝑖𝑖+1𝐵𝐵𝑖𝑖+1,𝑖𝑖 … ….. 𝐶𝐶𝑖𝑖𝑖𝑖 = 𝐶𝐶𝑖𝑖𝑖𝑖 + 𝐴𝐴𝑖𝑖,𝑞𝑞−1𝐵𝐵𝑞𝑞−1,,𝑖𝑖 𝐶𝐶𝑖𝑖𝑖𝑖 = 𝐶𝐶𝑖𝑖𝑖𝑖 + 𝐴𝐴𝑖𝑖,0𝐵𝐵0,𝑖𝑖 𝐶𝐶𝑖𝑖𝑖𝑖 = 𝐶𝐶𝑖𝑖𝑖𝑖 + 𝐴𝐴𝑖𝑖,1𝐵𝐵1,𝑖𝑖 … …. q – 1 𝐶𝐶𝑖𝑖𝑖𝑖 = 𝐶𝐶𝑖𝑖𝑖𝑖 + 𝐴𝐴𝑖𝑖,𝑖𝑖−1𝐵𝐵𝑖𝑖−1,𝑖𝑖 Each process computes in stages: Stage 0  process (i, j) has 𝐴𝐴𝑖𝑖𝑖𝑖, 𝐵𝐵𝑖𝑖𝑖𝑖but needs 𝑨𝑨𝒊𝒊,𝒊𝒊  process (i, i) broadcasts 𝑨𝑨𝒊𝒊,𝒋𝒋 across processor row i  process (i, j) computes 𝑪𝑪𝒊𝒊𝒋𝒋 = 𝑨𝑨𝒊𝒊𝒊𝒊𝑩𝑩𝒊𝒊𝒋𝒋 Stage 1  process (i, j) has 𝐴𝐴𝑖𝑖𝑖𝑖, 𝐵𝐵𝑖𝑖𝑖𝑖 but needs 𝑨𝑨𝒊𝒊,𝒊𝒊+𝟏𝟏𝑩𝑩𝒊𝒊+𝟏𝟏,𝒋𝒋 o shift the jth block column of B by one block up (block 0 goes to block q − 1) o process (i, i + 1) broadcasts Ai,i+1 across processor row i  process (i, j) computes 𝑪𝑪𝒊𝒊𝒋𝒋 = 𝑪𝑪𝒊𝒊𝒋𝒋 + 𝑨𝑨𝒊𝒊,𝒊𝒊+𝟏𝟏𝑩𝑩𝒊𝒊+𝟏𝟏,𝒋𝒋 Similarly, in the next stages. T. Galstyan 115 7. Results The following tables and charts show the execution times (expressed in seconds), the difference in time between the algorithms. Table 1 and Chart 1 show the difference between the Hybrid experiment, the Sequential algorithm and the Sequential algorithm with OpenMP. Table 1: Chart 1: Sequential (sec) Sequential +OpenMP (sec) Hybrid (sec) 200x200 0,06 0,059 0,04 400x400 0,65 0,59 0,3 500x500 2 1.28 0.36 1000x1000 23 20 10 1500x1500 102 90 46 2000x2000 280 200 128 2500x2500 513 439 280 3000x3000 995 831 500 0 200 400 600 800 1000 1200 0 500 1000 1500 2000 2500 3000 3500 Seconds Size Matrix Mul [Size X Size] Sequential Sequential + OpenMP Hybrid Experiment Performance Analysis of Matrix Multiplication Algorithms Using MPI and OpenMP 116 Table 2 and Chart 2 show the difference between the Hybrid experiment and the Fox algorithm on 4 processes. Table 2: Chart 2: Size Hybrid (sec) Fox (sec) 50x50 0,002 0,019 100x100 0,018 0,032 200x200 0,04 0,05 300x300 0,15 0,18 400x400 0,3 0,31 500x500 0,56 0,55 750x750 3,5 2,9 1000x1000 10 8 1500x1500 46 40 2000x200 128 113 2500x2500 280 250 0 0 , 1 0 , 2 3 , 0 4 , 0 5 , 0 0 , 6 7 , 0 0 200 400 600 Seconds Size Matrix Mul [SizeXSize] processes 4 Hybrid Fox T. Galstyan 117 8. Conclusions and Future Work As it can be observed, the time difference between the Hybrid experiment and the sequential algorithm with OpenMP is large. As you can see, there is a difference between the Hybrid experiment and the Fox algorithm. The hybrid experiment runs faster up to 500x500 matrices. After 500x500, the Fox algorithm runs faster. The hybrid experiment runs faster up to 500 x 500 matrices, because one of the matrices was shared to all processes. when the matrix size is not large, the matrix shared quickly, so time is faster than Fox algorithm time. For the future, it is planned to improve the efficiency of the hybrid experiment, to make the code faster than the fox algorithm for matrices over 500x500. References [1] J. Ali and R. Zaman, Performance Analysis of Matrix Multiplication Algorithms Using MPI: Khan Department of Computer Science, Aligarh Muslim University, Aligarh. Parallel Programming in C with MPI and OpenMP – by Michael J. Quinn. [2] Parallel Programming in C with MPI and OpenMP – by Michael J. Quinn. [3] [Online]. Available: https://www.open-mpi.org/ [4] [Online]. Available: https://www.mpich.org/ [5] [Online]. Available: https://www.openmp.org/ [6] [Online]. Available: https://en.wikipedia.org/wiki/Message_P assing_Interface [7] [Online]. Available: https://en.wikipedia.org/wiki/OpenMP [8] Ned Nedialkov, Communicators and Topologies: Matrix Multiplication Example, McMaster University Canada CS/SE 4F03 March 2016. [9] [Online]. Available: http://pages.tacc.utexas.edu/~eijkhout/pcse/html/mpi-topo.html Submitted 22.05.2018, accepted 20.11.2018. Մատրիցների բազմապատկման արտադրողականության վերլուծությունը օգտագործելով OpenMP և MPI Տ. Գալստյան Ամփոփում OpenMP և MPI համադրությունը ծրագրավորման մեջ անվանում են հիբրիդային ծրագրավորում։ Հիբրիդային ծրագրավորումը ( հաղորդագրությունների և ընդհանուր հիշողությունների միջոցով) ձեռք է բերել մեծ համբավ կլաստերային ճարտարապետության հայտնվելուց հետո։ Հիբրիդային ծրագրավորումը միավորում է OpenMP և MPI գրադարանները բազմամիջուկային ճարտարապետություններում https://www.amazon.co.uk/Parallel-Programming-C-MPI-OpenMP/dp/0072822562 https://www.amazon.co.uk/Parallel-Programming-C-MPI-OpenMP/dp/0072822562 https://www.amazon.co.uk/Parallel-Programming-C-MPI-OpenMP/dp/0072822562 https://www.open-mpi.org/ https://www.open-mpi.org/ https://www.open-mpi.org/ https://www.open-mpi.org/ https://www.openmp.org/ https://www.openmp.org/ https://en.wikipedia.org/wiki/Message_Passing_Interface https://en.wikipedia.org/wiki/Message_Passing_Interface https://en.wikipedia.org/wiki/OpenMP https://en.wikipedia.org/wiki/OpenMP http://pages.tacc.utexas.edu/%7Eeijkhout/pcse/html/mpi-topo.html http://pages.tacc.utexas.edu/%7Eeijkhout/pcse/html/mpi-topo.html http://pages.tacc.utexas.edu/%7Eeijkhout/pcse/html/mpi-topo.html http://pages.tacc.utexas.edu/%7Eeijkhout/pcse/html/mpi-topo.html http://pages.tacc.utexas.edu/%7Eeijkhout/pcse/html/mpi-topo.html Performance Analysis of Matrix Multiplication Algorithms Using MPI and OpenMP 118 օգտագործելու համար։Այս աշխատանքի նպատակն է դուրս բերել մատրիցների բազմապատկման արտադրողականության վերլուծությունը կլաստերային համակարգում։ Յուրաքանչյուր հանգույց կլաստերում կազմված է մի քանի կենտրոնական պրոցեսորներից, որոնցում բաժանվում է հիշողությունը հանգույցների միջև և համատեղ հիշողության։ Ալգորիթմը օգտագործում է MPI-ը՝ հաղորդագրություններ ուղարկելու համար և OpenMP` հիշողության բաժանման համար։ Анализ производительности матричных алгоритмов умножения с использованием MPI и OpenMP Т. Галстян Аннотация Комбинация OpenMP и MPI в программировании называется гибридным программированием. Гибридное программирование (посредством сообщений и разделяемой памяти) приобрело большую роль с момента появления кластерных архитектур. Метод гибридного программирования объединяет библиотеки MPI и OpenMP для использования этой иерархической многоядерной архитектуры. Цель этой работы - выполнить алгоритмы умножения матрицы анализа производительности в кластерной системе. Каждый узел кластера состоит из нескольких центральных процессоров, в которых память распределена между узлами и разделяемой памятью. Алгоритмы используют MPI в качестве механизма передачи сообщений и OpenMP в качестве совместно используемой памяти. 1. Introduction 3. OpenMP 4. Matrix Multiplication 5. Hybrid Experiment 7. Results 8. Conclusions and Future Work