Plane Thermoelastic Waves in Infinite Half-Space Caused FACTA UNIVERSITATIS Series: Mechanical Engineering Vol. 12, N o 1, 2014, pp. 15 - 25 AN APPROACH TO EFFICIENT FEM SIMULATIONS ON GRAPHICS PROCESSING UNITS USING CUDA UDC 519.6+531 Björn Nutti 1 , Dragan Marinković 2,3 1 AlgoritmFabriken AB, Stockholm, Sweden 2 Department of Structural Analysis, TU Berlin, Germany 3 Faculty of Mechanical Engineering, University of Niš, Serbia Abstract: The paper presents a highly efficient way of simulating the dynamic behavior of deformable objects by means of the finite element method (FEM) with computations performed on Graphics Processing Units (GPU). The presented implementation reduces bottlenecks related to memory accesses by grouping the necessary data per node pairs, in contrast to the classical way done per element. This strategy reduces the memory access patterns that are not suitable for the GPU memory architecture. Furthermore, the presented implementation takes advantage of the underlying sparse-block-matrix structure, and it has been demonstrated how to avoid potential bottlenecks in the algorithm. To achieve plausible deformational behavior for large local rotations, the objects are modeled by means of a simplified co-rotational FEM formulation. Key Words: Co-rotational FEM, Graphics Processing Units, CUDA, Sparse Block- Matrix, Conjugate Gradient Solver, Geometrical Nonlinearity 1. INTRODUCTION Simulation of deformable objects plays an important role in different areas, ranging from engineering tasks in structural analysis, via virtual reality applications, such as surgery simulators, up to entertainment industry, e.g. games and animations, to name but a few. Many of these areas impose two conflicting goals: 1) highly efficient computations of objects' deformational behavior, whereby 2) the geometrically nonlinear behavior due to large local rotations is to be recovered with sufficient accuracy. In the entertainment industry applications, it is often sufficient to utilize very simple mass-spring models [1] for this purpose, while fields such as virtual reality [2] or multi- body system dynamics [3] demand more accurate methods for plausible results. With a Received December 03, 2013 / Accepted February 1, 2014  Corresponding author: Björn Nutti AlgoritmFabriken AB, Stockholm, Sweden E-mail: bjorn@algoritmfabriken.se Original scientific paper 16 B. NUTTI, D. MARINKOVIĆ rapid pace of the hardware development, the finite element method (FEM) has gained ground in the past decade. However, the application of nonlinear FEM formulations results in simulations that are computationally more intensive thus putting a higher demand on computational resources. There are two complementary ways of approaching the problem: 1) application of simplified geometrically nonlinear FEM formulations, and 2) shifting the computations to numerically more powerful hardware components. With the increasing computational power of Graphics Processing Units (GPU), shifting the computational tasks to the GPU has become an attractive option. This approach also allows the CPU to deal with other tasks such as collision detection or haptics. The linear FEM simulation of deformational structural behavior is rather efficient and stable but it also results in artifacts such as unrealistic growth in volume when the deformational behavior involves finite local rotations. A rigorous geometrically nonlinear FEM formulation resolves this problem effectively, but in many cases rather inefficiently and accompanied by numerical stability issues. One of the possible ways of handling this problem was presented by Mueller et al [4]. However, their method is based on rotations determined per element nodes and, as a consequence, the elastic forces are not guaranteed to sum up to zero (the authors refer to the unbalanced forces as "ghost forces"). In contrast to this approach, the improved method presented in [5] and [6] uses rotations on the element level. Marinkovic et al. [7] discussed different aspects of this simplified co- rotational FEM formulation and demonstrated its application in various fields. Solving the FE system of equations is another key-element for highly efficient simulations. The conjugate gradient solver [8, 9] offers many advantages, as will be discussed in this paper. The advantages come particularly to the fore when the solver is used in combination with a general purpose GPU as a modified form of stream processor that provides a massive floating-point computational power (for the state-of-the-art GPUs expressed in teraflops per second). This approach has already been a subject of interest of several authors in the recent years [10, 11, 12]. In this paper, an efficient simulation pipeline for deformable objects is presented. The FEM models of deformable objects employ the linear tetrahedral element in combination with the above mentioned simplified co-rotational FEM formulation that uses element- based rotations. The implementation is realized with the Nvidia's Compute Unified Device Architecture (CUDA) paradigm suitable for a vast majority of modern GPUs. To make the simulation pipeline feasible for simulating both soft and stiff objects at interactive frame rates, the implicit Euler time integration scheme is employed. The bottlenecks in the simulation pipeline and the optimization of memory access patterns will be particularly addressed in order to achieve high GPU utilization efficiency. 2. THE BASICS OF THE IMPLEMENTED CO-ROTATIONAL FEM FORMULATION The motivation behind the development of the simplified co-rotational FEM formulation is to derive a geometrically nonlinear FEM formulation that is stable and very efficient besides enabling sufficiently accurate simulation of the deformational behavior characterized by large local rotations. The idea arises from the principles of incorporating flexible bodies in Multi-Body-System (MBS) dynamics. In MBS dynamics, the overall motion of flexible objects is described as a superposition of large rigid-body motion and small An Approach to Efficient FEM Simulations on Graphics Processing Units Using CUDA 17 deformation with respect to a body-fixed reference frame that performs the same rigid- body motion as the object. Instead of assigning a single local reference frame to the object, a set of local reference frames can be assigned to different areas of the modeled object. In a rigorous co-rotational FEM formulation, a local reference frame would be typically defined at each element integration point as those points are used in the evaluation of element matrices and vectors. However, in the present formulation, the idea is to account for the rigid-body rotation with a somewhat lower 'spatial resolution'. Hence, an average rigid-body rotation is determined for each finite element. Since co-rotational approaches decouple rigid-body motion from deformational motion, they allow the usage of engineering strain and stress measures in the formulation as well as decoupling of geometrical from material nonlinearities. The element-based rigid-body rotation is not the only simplification of this formulation compared to the rigorous geometrically nonlinear FEM formulations. Another important simplification lies in the fact that the behavior of finite elements remains purely linear with respect to the co-rotational reference frame. Thus, the only geometrically nonlinear effect taken into account is the averaged rigid-body rotation of single finite elements. The very core of the formulation is given by the following equation that defines the vector of internal nodal forces on the element level: 1 ( )    inte e e e e 0e f R K R x x , (1) where Re is the rotation matrix describing the element rigid-body rotation, Ke is the linear element stiffness matrix, while x0e and xe are the initial and current element configurations (nodal coordinates), respectively. The first term in the parenthesis on the right-hand side of the equation yields the current element configuration rotated back to the original element orientation. Thus, the entire term in the parenthesis yields the rotation-free nodal displacements. By multiplying them with the linear element stiffness matrix, one obtains the element internal forces with respect to the original element configuration, which are, finally, rotated to the current element configuration, i.e. through Re. Very simple algebra transforms Eq. (1) into the following form: R uee R e0eeeeeeeinte fxKxKRxRKRf  1 , (2) where Ke R is the rotated element stiffness matrix and fue R a contribution to the internal elastic forces, a part of which can be pre-computed for more efficient computation. It should be now more obvious where the efficiency of this formulation rests. In a pre- simulation step, the linear stiffness matrix of each element is computed. Over the course of simulation, the information about the current and initial element configurations is used to extract the element rotation matrix describing the purely rotational part of the motion. The matrix is further used to obtain the rotated element stiffness matrix, re-assemble the current structural stiffness matrix and compute the internal elastic forces. 3. TIME INTEGRATION FOR DYNAMICS The properties of the presented co-rotational FEM formulation come primarily to the fore in dynamics. The FEM equations for dynamics can be given in the following form: 18 B. NUTTI, D. MARINKOVIĆ intext ffCvMa  , (3) where a and v are the vectors of nodal accelerations and velocities, M and C are the structural mass and damping matrices, while fext and fint are the vectors of nodal external and internal forces of the entire structure, respectively. The authors of the paper have used the lumped mass matrix. Bearing in mind that the primary objective of the development is an interactive real- time simulation, the implicit Euler time integration is chosen as a time-integration scheme suitable for both soft and stiff materials. This integration scheme uses the acceleration and velocity at the end of the time-step to update the current positions and velocities: t  Δttttt avv , (4) t  Δttttt vxx , (5) where Δt is the time-increment and the right superscript denotes the moment in time at which the quantity is defined. In each time-step, v t+Δt is obtained by solving the following system of linear algebraic equation: tttt bvA   , (6) with 2 ΔtΔt t T t KCMA  , (7) Δt( )    t t t t Rt t T u ext b Mv K x f f , (8) where KT t is the tangent stiffness matrix of the entire structure and fu Rt is assembled from vectors fue R introduced in Eq. (2) for an element, and both quantities are defined for time t. A detailed derivation of these equations can be found in [2]. For the implementation at hand, the damping is defined as Rayleigh damping [13], thus: KMC  , (9) where α and β are the coefficients of the mass proportional damping and stiffness proportional damping, respectively. In this work, β=0 was adopted. 4. IMPLEMENTATION CONSIDERATIONS The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors, designed to execute hundreds of threads simultaneously. During execution, threads are grouped into smaller units called warps. Threads within a warp are scheduled to execute simultaneously. The warp scheduler swaps warps in and out during execution to maximize hardware utilization. Resources, such as data registers, are shared among warps. Hence the higher resource requirements a warp has, the fewer warps can be scheduled on each multiprocessor simultaneously, potentially decreasing the computational efficiency. An Approach to Efficient FEM Simulations on Graphics Processing Units Using CUDA 19 In this paper, four categories of memory are considered: host memory, device memory, shared memory and textures. The host memory is accessible by processes running on the CPU. The device memory is the global memory accessible by the multiprocessor, while the shared memory is located on chip inside each multiprocessor unit and is, hence, much faster than the device memory. Although it is possible to perform random accesses to the device memory, it is crucial to follow a set of rules to maximize the bandwidth utilization. The heuristic is to use linear access patterns (coalescing) within warps, both when reading and writing the memory. Breaking this rule causes serialized memory accesses, which usually has a dramatic negative impact on the performance. Once the memory is loaded into the on- chip shared memory, data accesses are two orders of magnitude faster compared to the device memory. Although there is a cache for the device memory, it is sometimes beneficial to use the texture memory, as it uses a cache separate from the other ones. A warp performing random reads and writes can benefit from the use of textures, as one cache can be used for reading while the other for writing. 5. ASSEMBLING THE STRUCTURAL STIFFNESS MATRIX For each element, i, of the tetrahedral mesh, first the element stiffness and mass matrices are computed. Within the framework of geometrically nonlinear structural analysis, the structural mass matrix remains constant regardless of the structural deformation or motion in general. On the other hand, the structural stiffness is defined by the tangent stiffness matrix that is configuration dependant and has to be updated upon each time increment. The tetrahedral element stiffness matrix, KeR 1212 , has a form of 44 block matrices, KnmR 33 , that define the relation between nodes n and m. According to the presented simplified co-rotational FEM formulation, the global stiffness matrix is computed as a sum of the rotated element stiffness matrices:  i i R eT KK , (10) and since the rotation matrix is orthogonal, the rotated element stiffness matrix is computed as: T eee R e RKRK  . (11) When the global stiffness matrix is assembled on a single-threaded machine, the computation is usually implemented as an element-wise summation, where block matrices of Ke R are consecutively added to the corresponding positions in KT. An attempt to apply this approach on a multi-threaded machine would demand synchronization of all write accesses and serialization of all simultaneous accesses to a specific memory location. Hence, such an implementation would perform poorly. Assigning the task of computing a subset of Knm-matrices to each of the kernels would resolve the problem of synchronization, as each position in KT is written once after the corresponding thread is finished with the summation of Ke matrices. This approach would, however, break the rules for efficient coalesced memory accesses to the GPU's device memory, because a random access pattern is used when Ke matrices are summed up. 20 B. NUTTI, D. MARINKOVIĆ On the other hand, any off-diagonal block, Knm, nm, of the element stiffness matrix is attributed to the element edge with nodes n and m, while diagonal blocks Knn, are attributed to single element nodes. The same is valid for the global stiffness-matrix, in which off-diagonal and diagonal elements are denoted Kij and Kii, respectively. A possible approach to resolving the problem of memory accesses is to observe the mesh as a collection of edges and nodes rather than as a collection of elements. By gathering all Kij block-matrices for each edge ij and Kii for each node i in a pre-simulation step, the blocks in KT can be computed as a sum of piecewise linear memory throughout the simulation. This approach allows fast coalesced memory accesses when computing KT. The described pattern is visualized in Fig 1. Fig. 1 a) 'Per-element' assemblage of the global stiffness matrix; b) 'Per-edge and node' assemblage of the global stiffness matrix When assembling the diagonal part of the matrix, each block-matrix Kii, at position i in K diag is computed as a sum of Knn block-matrices belonging to the elements that share the node i. During the pre-simulation step, a two-dimensional array is created, where all block- matrices contributing to Kii are stored at row i. The maximum number of elements sharing the same node is determined in order to find the smallest block-width that will fulfill the requirements for coalesced memory reads. All rows are zero-padded to this width. Over the course of simulation, the elements in K diag are computed with a block-parallel-reduction algorithm (see [14]), that is applied row-wise. Within the framework of this research, it has been discovered that the use of thread-blocks of dimension [3block-width], with a grid dimension of [number-of-nodes1] is well suited for the computation. This layout puts an upper limit to the number of neighbors for a node, as there is an upper limit of the number of threads that is allowed in a thread-block. However, this should not introduce any practical limitation for conventional geometries and meshes. As previously discussed, the off-diagonal elements are attributed to the edges in the tetrahedral mesh. Generally speaking, the edges are shared by fewer elements than it is the case with the nodes. If each thread-block had computed a single Kij matrix, the benefits from coalesced memory accesses would have been poor. In order to improve this, each thread- block is given the task of computing multiple block-matrices, as shown in Fig. 2. a) b) An Approach to Efficient FEM Simulations on Graphics Processing Units Using CUDA 21 Fig. 2 The strategy of efficient coalesced memory accesses – grouping and summing up the block-matrices related to the same node pair The matrices computed within a thread-block are all related to the same node pair. In the presented implementation, the thread-block size is set to a value that assures coalesced memory accesses. Sets of sub-matrices are then stored in a memory-layout matching the thread-blocks. During the simulation, a thread-block reads a set of matrices into the shared memory, these matrices are then rotated, summed up and stored into the shared memory. When all computations are finished, the result is written back to the device memory. 6. IMPROVED EQUATION ASSEMBLY Assembling the system of linear equations (Eq. (6)) includes a range of arithmetic operations which are performed through a combination of kernel invocations. This section shows how to optimize the required amount of floating-points-operations, as well as the memory usage. To be more specific, the main issues addressed are:  When assembling A, each element in KT has to be multiplied by t 2 . This would not have had a noticeable impact if KT had been constant over the course of simulation. However, since KT depends on structural deformation (i.e. element rotations), this multiplication cannot be pre-computed. It is also not possible to perform this operation while assembling KT because KT is used again when assembling b.  The second issue is the summation that involves the sparse matrix KT and the diagonal matrix M. This requires attention because the block compressed row storage of the sparse matrix is not optimized for accessing diagonal elements. To address the first problem, the damping matrix in the form given in Eq. (9) is introduced into Eq. (6), which is further divided by t 2 . To improve the readability of the obtained equation, denotations 2 t A A   ~ , 2 t b b   ~ , 2 t M M   ~ and 2t t t1    are introduced, so that: t T t KMA  ~~ t , (12) 1 ( ) Δt t      t t t t t Rt t u ext b M(v x ) A x f f , (13) and the equation to solve reads: tttt bvA ~~   . (14) 22 B. NUTTI, D. MARINKOVIĆ Both M ~ and t can be pre-computed. The only additional computational effort in Eq. (14), compared to Eq. (6), is vector tx t , which however requires only a low complexity linear operation. However, the presented approach permits computation of t A ~ as the first step. This matrix is further directly used to compute t b ~ . To address the second issue mentioned above, matrix t A ~ can be observed as a sum of an off-diagonal matrix and a diagonal matrix, i.e. t off t diag t AAA ~~~  . Hence, the simple idea is to compute t offT K using one kernel, while computing t diagT KM  ~ t by means of another kernel. The presented implementation optimizes the number of arithmetic operations as well as random accesses to device memory. However, it still requires uncoalesced memory accessed for rotation matrices. To reduce the amount of data that is read in this manner, it is possible to represent a rotation either as an axis-angle pair or as a quaternion. This strategy allows for storing and accessing rotations in single 128-bits memory transactions, in contrast to the approach based on matrices, which requires up to three times as much. The results presented in this paper are based on the matrix representation. The axis-angle and quaternion representations are beyond the scope of this paper. 7. SOLVING THE FE SYSTEM OF EQUATIONS At each time-step, the FE system of equations (Eq. (14)) is solved to obtain the nodal velocities and, thus, update the configuration. Since the system matrix, Ã t , is a positive definite matrix, the authors' choice is a preconditioned conjugate gradient solver, optimized for working on block compressed row storage sparse matrices. The basic idea of an iterative solver is that it starts out with an initial guess, i.e. initial solution vector, and goes through an iterative process to update the solution vector in each iteration. A pre-conditioner matrix is used to provide a faster convergence to the solution. A convergence criterion is used to determine whether the accuracy of the solution is acceptable or more iteration steps are needed to improve the accuracy. The preconditioned conjugate gradient method involves three matrix-vector products, three vector updates and four inner products per iteration. The number of operations is somewhat greater compared to the conjugate gradient method without preconditioning, but the "pre-conditioned version" is more effective provided the pre-conditioner is well chosen and reduces the system condition number. In this manner, the pre-conditioner improves the convergence rate of the method enough to make up for the additional cost caused by the conditioning. The solver also provides a very easy way of performing a trade-off between the solution accuracy and computational effort by limiting the number of performed iterations. Furthermore, the efficiency of the iterative solver can be noticeably improved by a reasonable choice of starting vector of the iterative process. This is particularly interesting in dynamics with the system of equations solved for nodal velocities. Namely, the nodal velocities typically do not change dramatically within a time-step. Hence, a good choice would be to take the velocities from the previous time-step as a starting vector for the iterative solver in the next time-step. This improves the numerical efficiency of simulation as fewer iteration steps are needed to arrive at the solution. An Approach to Efficient FEM Simulations on Graphics Processing Units Using CUDA 23 8. RESULTS To provide a glimpse of the results achievable by the presented co-rotational FEM formulation, a liver model depicted in Fig. 3 is used. The model implements the coupled- meshed technique [7] to couple the triangulated surface mesh (2598 vertices and 5192 faces) to the FEM model (660 nodes and 2640 elements). This technique is adopted to enable modeling of rather complex geometries by means of computationally not too demanding FEM models. The approach represents a promising solution for various types of virtual reality applications, such as virtual surgery in this specific case. a) b) c) Fig. 3 Liver model: a) surface vertices; b) triangulated surface; c) FE-mesh Fig. 4 gives a few snapshots from an interactive simulation with the aforementioned liver model. Although large, the deformations appear realistic and they do not exhibit artificial enlargement typical for linear analysis. Fig. 4 Large realistic liver model deformations during a dynamic interactive simulation To evaluate the computational speed-up that the proposed GPU implementation offers, the performance of a single-threaded dual-core Intel Core2Duo T7200 processor running at 2 GHz is compared with the performance achieved by a GeForce GTX 8800 graphics card with 128 stream processors and a core clock-speed of 575 MHz. A synthetic mesh generator that produces tetrahedron meshes with evenly distributed elements is used for generating FEM models. The size of the generated FEM models ranges from 2500 to 20000 elements. As for the boundary conditions, a chosen set of nodes is fixed, while the structure is exposed to the influence of gravity. The preconditioned conjugate gradient solver discussed above is set to run exactly 20 iterations in each time-step so that a direct comparison between the CPU and GPU performances can be done. The diagram in Fig. 5, left, shows the number of frames per second for the considered models with the computations performed on the CPU and the GPU. One may notice that the single-threaded CPU implementation exhibits almost a linear dependence between the amount of data to be processed and the computational time. On the other hand, the GPU implementation demonstrates better utilization of available computational resources when a larger amount of data is to be processed. Consequently, Fig. 5, right, reveals greater simulation speed-up achieved by switching from the CPU to the GPU for larger models. 24 B. NUTTI, D. MARINKOVIĆ One of the reasons for that is the fact that the warp scheduler can overlap memory transactions in a more efficient way when more active threads are available. Fig. 5 Simulations results using GPU and CPU implementations 9. CONCLUSIONS The paper presents a very effective approach to performing real-time or nearly real- time interactive simulations with large, geometrically nonlinear deformations. The objective of real-time simulation is addressed in two complementary ways. The first consideration is related to the choice of an efficient FEM formulation. The authors use the simplified co-rotational FEM formulation that enables accounting for geometrically nonlinear effects to a large extent, whereby the efficiency and stability of numerical computation are kept over the course of simulation. This is achieved by extending the linear FEM by the element-based rigid-body rotation and neglecting all other geometrically nonlinear effects. The resulting deformational behavior is rather realistic for deformations characterized by arbitrarily large rotations but relatively small strains. The second consideration is related to the choice of hardware components used to perform the computations. The general idea is simple and consists in switching the computation from the conventionally used CPU to modern GPUs that offer ever greater computational power. However, typical FEM computations involve geometries of irregular topology. This fact further implies random memory access patterns unsuitable for the modern GPU memory management. This paper presents an original solution as to how to increase the memory bandwidth usage by utilizing efficient memory access patterns. The presented solutions are very suitable for various virtual reality applications. However, as the computational efficiency strongly gains in importance in engineering applications, the development can be of large interest in this field as well. Particularly the field of MBS dynamics with deformable bodies involved may benefit from the development. Acknowledgement: This work is partially supported by the project III41017 Virtual human osteoarticular system and its application in preclinical and clinical practice, funded by the Ministry of Education and Science of the Republic of Serbia. An Approach to Efficient FEM Simulations on Graphics Processing Units Using CUDA 25 REFERENCES 1. Yang, Y., Xiao, R., He Z., 2011, Real-time deformations simulation of soft tissue by combining mass-spring model with pressure based method, Proceedings of the 3rd IEEE International Conference on Advanced Computer Control (ICACC '11), Harbin, China, pp. 506–510. 2. Erleben, K., Sporring, J., Henrikssen K., Dohlman, H., 2005, Physics-Based animation, Charles river media, USA. 3. Zehn, M., 2005, MBS and FEM: A Marriage-of-Convenience or a Love Story?, BENCHmark Int. Magazine for Eng. Design&Analysis, pp. 12-15. 4. Mueller M., Dorsey J., McMillan L., Jagnow R., Cutler B., 2002, Stable Real-Time Deformations, Proceedings of 2002 ACM SIGGRAPH/Eurographic symposium on Computer animation, San Antonio, USA, pp. 49-54. 5. Hauth, M., Strasser W., 2004, Corotational simulation of deformable solids, Journal of WSCG, 12(1), pp. 137-144. 6. Mueller, M., Gross, M., 2004, Interactive Virtual Materials, Proceedings of Graphics Interface 2004, Waterloo, Canada, pp 239-246. 7. Marinković, D., Zehn, M., Marinković Z., 2012, Finite element formulations for effective computations of geometrically nonlinear deformations, Advances in Engineering Software, 50, pp. 3-11. 8. Hestenes, M. R., Stiefel, E., 1952, Methods of Conjugate Gradients for Solving Linear Systems, Journal of Research of the National Bureau of Standards, 49(6), pp. 409-436. 9. Benzi, M. 2002, Preconditioning Techniques for Large Linear Systems: A Survey, Journal of Computational Physics, 182(2), pp. 418-477. 10. Cecka C., Lew A. J., Darve, E., 2011, Assembly of Finite Element Methods on Graphics Processors, International Journal for Numerical Methods in Engineering, 85(5), pp. 640-669. 11. Mafi, R., Sirouspour, S., 2013, GPU-based acceleration of computations in nonlinear finite element deformation analysis, International Journal for Numerical Methods in Biomedical Engineering, doi: 10.1002/cnm.2607 12. Georgescu S., Chow, P., Okuda, H., 2013, GPU Acceleration for FEM-Based Structural Analysis, Archives of Computational Methods in Engineering 20(2), pp. 111-121 13. Bathe, K. J., 1982, Finite element procedures in engineering analysis, Prentice-Hall, Inc., Englewood Cliffs, New Jersey. 14. Harris, M., Optimizing parallel reduction in CUDA, available at: http://developer.download.nvidia.com/compute/cuda/1.1- Beta/x86_website/projects/reduction/doc/reduction.pdf (access date: 14.02.2014.) PRISTUP EFIKASNIM MKE SIMULACIJAMA POMOĆU GRAFIĈKIH KARTI PRIMENOM CUDA-PLATFORME Rad predstavlja izuzetno efikasan pristup simulaciji dinamičkog ponašanja deformabilnih objekata primenom metode konačnih elemenata (MKE) pri čemu se proračuni izvode primenom grafičkih karti (GK). Predstavljeno rešenje smanjuje efekat „uskog grla” uzrokovanog memorijskim pristupima tako što grupiše podatke po parovma čvorova MKE modela, nasuprot klasičnom pristupu kod koga se grupisanje vrši po elementima. Time ova strategija redukuje memorijske pristupe neadekvatne sa stanovišta arhitekture memorije grafičkih karti. Takođe, ovaj pristup koristi prednosti matrica zapisanih u kompaktnom obliku i pokazano je kako izbeći dalja potencijalna „uska grla” u algoritmu. Da bi se deformaciono ponašanje opisalo na realističan način i pri velikim lokalnim rotacijama, iskoršćena je uprošćena korotaciona FEM formulacija. Ključne reči: korotaciona MKE, grafička karta, CUDA, kompaktna matrica, metoda konjugovanih gradijenata, geometrijska nelinearnost. http://developer.download.nvidia.com/compute/cuda/1.1- http://developer.download.nvidia.com/compute/cuda/1.1-