Microsoft Word - B_15_Mihaly_R.doc HUNGARIAN JOURNAL OF INDUSTRIAL CHEMISTRY VESZPRÉM Vol. 39(2) pp. 215-218 (2011) USING GRAPHICAL PROCESSING UNITS IN SCHEDULING PROBLEMS K. MIHÁLY , O. HORNYÁK University of Miskolc, Department of Information Engineering, 3535 Miskolc - Egyetemváros, HUNGARY E-mail: krisztian.mihaly@sap.com Scheduling problems exist everywhere in the so-called “real world”. They are there in manufacturing, transportation and logistics as well. The main object of these problems is to find an optimal sequence of tasks to be able to fulfil predefined objectives. There are efficient methods to solve complex scheduling problems in science and industry, which methods can be divided into several classes, like heuristic algorithms, genetic algorithms, etc. Even if these methods allow reducing significantly the computational time of the solution search space exploration, this latter cost remains exorbitant when very large problem instances are to be solved. Some of these methods are not designed for parallel computing; they are using a CPU as an arithmetical unit. From this point of view the bottleneck is the number of processed commands. Meanwhile the capabilities of specialized Graphical Processing Units (GPUs) have been extremely increased and they can provide an efficient platform for developing graphical algorithms. Nowadays there are new programming languages and platforms, where these GPUs can be used for more generic problems, using its hardwired parallel processing resources. Our goal is to use this specialized graphical platform for solving scheduling problems. This paper is an initial research of the existing platforms and solutions in general and describes the existing solutions in fact of scheduling problems. Keywords: GPU; CUDA; Scheduling Introduction This paper goes to give a short overview about the history of the age of parallel processing on graphical processing units and the available hardware and development platforms. This paper investigates the platform of main architecture and the feasibility of algorithms in general for parallel processing and special in area of scheduling problems. Finally the future researching areas will be outlined and some conclusions of the justification of this new approach will be given. Short history of Graphical Processing Units (GPU) Before the review of GPUperformance and computing capacity we have to point out the improvement problems with the currently used Central Processing Units (CPU). Over the last decades the main interest for improving the performance of a personal computer device was to increase the speed at the processor’s clock operated. The best example is that the early 1980s, the consumer CPU ran with internal clock operating near 1 MHz. In 2010 most desktop processors have clock speeds between 1 GHz and 4 GHz. It means ~1000 times better solution in the same chip. So what is the problem?! This method hits the power and heat restrictions as a rapidly approaching physical limit of the transistor size. It is no longer feasible to rely on upward-spiralling processor clock speeds. The hardware manufactures changed their approach of creation of new CPUs and in the year 2005 they started to ship multicore central processors. This process is sometimes referred as multicore revolution. “In the meantime the state of graphics processing underwent a dramatic revolution. In the late 1980s and 1990s, the growth in popularity of graphically driven operating systems such as Microsoft Windows helped create a market for a new type of processors. In the early 1990s, users began purchasing 2D display accelerators for their personal computer. These display accelerators offered hardware-assisted bitmap operations to assist in the display and usability of graphical operating systems” [1]. Around the same time another approach was developed for three-dimensional graphics. In 1992, the Silicon Graphics opened the programming interface to its hardware by releasing the OpenGL library. It was a standardized, platform-independent method for developing 3D applications. In this time the computing of rendering was running in the CPU. Because the success of first-person shooting games, as Doom, Duke Nukem 3D and Quake the market has created hardware support to create better, more-realistic applications with 3D display. The main companies were the NVIDIA, ATI Technologies and 3dfx Interactive. First the graphical hardware took care of transform and lighting computations, so these operations could run in the hard-wired graphical processors and the CPU could use for other tasks. The next breakthrough in parallel- computing point of view was the first graphic card, which supported the Microsoft’s DirectX 8.0 standard. This standard introduced the programmable vertex 216 and pixel shaders. This was the first point, when the developers were able to influence the control of GPUs program. Figure 1: The difference between the CPU and GPU in measurement of GFLOP/s [2] Early GPU programming; pain-points The DirectX APIs was designed to use GPU to execute special graphical tasks. They were designed to get some data from “input colors” with some additional “color” or “texture” information. The programmer could use this information during the computation. The “color” can mean any number. A clever, but simple trick was to use these containers to put data into the GPU, do some calculations and get back the data. It is quite simple to see, the pain points of this approach. The APIs were designed to support graphical APIs and the programmers needed very deep knowledge from the graphical platform. There were resource constraints; data could be loaded as picture and texture and retrieved data was another picture. So if the algorithm required accessing a memory area to write, it cannot run on GPU. It was nearly impossible to predict how your particular GPU can deal with floating-point data. They didn’t exist any good method to debug implementations in GPU. New GPU programming approach;OpenCL; CUDA and Stream as platforms Graphical hardware manufacturer companies came out a new generation of GPUs, where the partitioned resources as vertex and pixel shaders where changed and a unified shader pipeline has been introduced. This new generation allows each and every arithmetic logical units (ALU) to be controlled by program intending to perform general- purpose computations. Usually, like in case of NVIDIA’s solution, these ALUs were built to comply with IEEE requirements. Near IEEE these ALUs are able to read and write memory as well in software-managed ways. Unlike, each main supplier introduced their own architecture or standard. The NVIDIA has the “Computing Unified Device Architecture” (CUDA), the ATI has the Stream. Near the industry standards there are open standards as well, the main one is the OpenCL standard [3]. CUDA architecture The CUDA architecture has two main parts, one is the hardware which supports the CUDA programming and it can be called as device and the programming language to be able create programs using the device capacity. The programming language is based on industry standard C and adds relatively small number of keywords in order to harness some of the special features of CUDA architecture. There is a public compiler for this language, the CUDA C. We are going to give a short overview about the architecture, but further information you can refer the CUDA Programming Guide by NVidia [4]. As mentioned before CUDA is an extension of C language and hides GPU relevant APIs and interfaces. There are 3 main tasks, which is the task of the developers, as thread hierarchy, memory access and synchronisation [5]. Thread hierarchy To perform computation with CUDA, programmers have to define a special C function, the so-called kernel. This is function can be run in a thread using a specified number of lightweight threads in GPU. The kernel is loaded to the device from the host, where the “normal” code is running. Threads are grouped into blocks, and blocks forms a so-called grid. Threads can communicate through the memory area assigned to the block. The blocks are running independently and their behaviour cannot be affected by the programmer. Figure 2: CUDA Thread hiearchy [6] 217 Memory hierarchy There are a lot of types of memory areas, as Global memory, Shared memory, Constant memory, Registers and Local memory. The differences between the memory types are the size, the accessing time, the caching property. You can read more details in [4] or in [5]. Figure 3: Basic Organization of the GeForce 8800 [7] Synchronisation It could be needed to force a central point in the kernel, where the threads within the same block are waiting for each other. For example when the shared memory is used by the threads, it could be necessary to have a consistent state in case after writing of a thread. The CUDA language provides mechanism, a __synctrheads( ) method, which defines a waiting point in the kernel code. Each thread executes his code and stops the processing, while other treads are executed this statement as well. Show-cases of CUDA Applications CUDA has many areas, where it could be used, from the astrophysics or physical modelling of fluids, through medical imaging or representing mathematical sets. It cannot be our goal to describe each one; you can read more in [8]. As an example one show-case is described here, from the medical imaging [1]. There are a lot of people, who have been affected by breast cancer in the past 20 years. Ultimately, every case of breast cancer should be caught early to prevent the ravaging side effects of chemotherapy and radiation. To achieve this, an effective, fast and minimally invasive way has to be developed. There is the mammogram for early detection, but it has limitations. During the process two or more image need to be taken and the images have to be analysed by skilled doctors. There is another technique, the ultrasound imaging, which is safer, then X-ray image made by mammogram, but it is not used in practice because of the computation limitations. With the GPU capacity this limitations are eliminated. In practice 35 GB of data can be generated by a scan. With Tesla C1060 a doctor can manipulate a highly detailed, three-dimensional image of the woman’s breast within 20 minutes. Scheduling algorithms in general Generally, the purpose of scheduling is to find an optimal processing order of tasks (also called activities, operations or jobs) on a set of processors (machines) subject to several constraints imposed on the tasks, machines and their mutual relationships. In the literature there are a lot of types of scheduling models, like shop scheduling model, real-time scheduling, monoprocessor/dedicated processor/parallel process models, cyclic scheduling, etc. [9, 10, 11] Show-cases of CUDA Applications and a scheduling algorithm Parallel computing is considered as it is capable providing solutions for various complex optimization problems. [12] solves the traveling sales man problem by means of GPU. They also used generic operators, and tabu list. In a hybrid CPU-GPU environment the memory handling is also important. [13] applied a two level metaheuristic to solve the flexible job shop problem. The machine selection layer and the operation scheduling module are executed parallel. They implemented a code in C language (CUDA) for GPU. Also a tabu list was used. [14] suggested the scheduling problem to solve in aheterogeneousCPU- GPU environment. The CPU and GPU behave differently but both can execute the same task. So they implemented an appropriate load balancing algorithm between CPU and GPU. They achieved 30–40% speedup. [14] investigates a flowshop problem to solve by GPU computing. They focused on an efficient memory mapping model in order to implement a multiobjective local searchtask on GPU. They described their speedup as promising. [15] describes two approaches to parallel flowshop evaluation on CUDA platform. According to their measure the GPU based algorithmis faster up to 5%. Conclusion As the result of our first project task we get overview about the existing solutions and the main power of parallel computing on the GPU device. The show-cases have a very wide focus, from the medical to the abstract mathematic world. In this huge area the scheduling in manufacturing systems is a little stone. After the literature research we are going to implement the existing solutions in our new hardware, when it will be purchased and installed. We are going to do speed up implementations 218 in direction of floor-shop and job-shop models when the purchasing process of the new hardware will be finished. ACKNOWLEDGEMENTS This research was carried out as part of the TAMOP- 4.2.1.B-10/2/KONV-2010-0001 project with support by the European Union, co-financed by the European Social Fund. REFERENCES 1. J. SANDERS, E. KANDROT: CUDA by example, Addison-Wesley,(2010), p. 3 2. J. MASOOD: Nvidia Cuda http://www.hardwareinsight.com/nvidia-cuda/ (2011. 08. 25) 3. OpenCL standard, KronosGroup: http://www.khronos.org/registry/cl/specs/opencl- 1.0.29.pdf (2011. 08. 25) 4. NVidia Programming Guide 4.0, NVidia Developer Zone: http://developer.download.nvidia.com/compute/cuda/ 4_0_rc2/toolkit/docs/CUDA_C_Programming_Gui de.pdf (2011. 08. 25) 5. M. CZAPISNKY, S. BARNES: Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform, J. Parallel Distrib. Comput. 71(6), (2011), 802–811 6. http://mohamedfahmed.wordpress.com/2010/05/03/ cuda-computer-unified-device-architecture/ (2011. 08. 25) 7. S. RYOO, C. I. RODRIGES, S. S. BAGHSORKHI, S. S. STONE: Optimization Princiles and Application Performance Evaulation of Multithread GPU Using CUDA 8. CUDA Showcases, NVidia webpage: http://www.nvidia.com/object/cuda_showcase_stage. html 9. J. BLAZEVICZ, K. H. ECKER, E. PESCH, G. SCHMIDT, J. WEGLARZ: Scheduling Computer and Manufacturing Processes, Springer, (2001), ISBN 3-540-41931-4 10. P. BRUCKNER, S. KNUST: Complex Scheduling, Springer, (2006), ISBN 978-3-540-29545-7 11. B. CHEN, C. N. POTTS, G. J. WOEGINER: A review of machine scheduling: Complexity, algorithms and approximability, Handbook of Combinatorial Optimization (Volume 3), (1998), 21–169. 12. J. ZHAO. Q. LIU, W. WANG, Z. WEI, P. SHI: A parallel immune algorithm for traveling salesman problem and its application on cold rolling scheduling. Information Scienced 181, (2011), 1212–1223 13. W. BOZEJKO, M. UCHRONSKI, M. WODECKI: Parallel hybrid metaheuristics for the flexible job shop problem. Computers and Industrial Engineering 59, (2010), 323–333 14. T. V LUONG, N. MELAB, E. G. TALBI: GPU-Based Approaches for Multiobjective Local Search Algorithms. A Case Study: The Flowshop Scheduling Problem. EvoCop 2011, 155–166 15. M. CZAPINSKI, S BARNES: Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform. Journal of Parallel and Distributed Computing Archive, 71(6, June), 2011, 802–811 16. C. J. JIMENEZ, K. VILANOVA, I. GELADO, M. GIL: Predictive runtime code scheduling for heterogeneous architectures. HiPEAC 2009, 19–33