Acta Polytechnica


DOI:10.14311/AP.2020.60.0025
Acta Polytechnica 60(1):25–37, 2020 © Czech Technical University in Prague, 2020

available online at https://ojs.cvut.cz/ojs/index.php/ap

PARALLELIZATION OF ASSEMBLY OPERATION IN FINITE
ELEMENT METHOD

Michal Bošanský∗, Bořek Patzák

Czech Technical University in Prague, Faculty of Civil Engineering, Thákurova 7, 166 29 Prague 6, Czech
Republic

∗ corresponding author: michal.bosansky@fsv.cvut.cz

Abstract.
The efficient codes can take an advantage of multiple threads and/or processing nodes to partition

a work that can be processed concurrently. This can reduce the overall run-time or make the solution
of a large problem feasible.

This paper deals with evaluation of different parallelization strategies of assembly operations for
global vectors and matrices, which are one of the critical operations in any finite element software.
Different assembly strategies for systems with a shared memory model are proposed and evaluated,
using Open Multi-Processing (OpenMP), Portable Operating System Interface (POSIX), and C++11
Threads. The considered strategies are based on simple synchronization directives, various block locking
algorithms and, finally, on smart locking free processing based on a colouring algorithm. The different
strategies were implemented in a free finite element code with object-oriented architecture OOFEM [1].

Keywords: Parallel computation, shared memory, finite element method, vector assembly, matrix
assembly.

1. Introduction
Current development in computer hardware brings
in new opportunities in numerical modelling. The
solutions of many engineering problems are extremely
demanding, both in terms of time and computational
resources. The traditional serial computers permit
to run simulation codes only sequentially on a single
processing unit, where only one instruction can be
processed at any moment in time.
The current trend in technology is parallel pro-

cessing, relying on the simultaneous use of multiple
processing units to solve a given problem. Any paral-
lel algorithm is based on the idea of partitioning the
overall work into a set of smaller tasks, which can be
solved concurrently by a simultaneous use of multiple
computing resources. Parallel processing allows to
concurrently perform tasks on a single computer with
multiple processing units or even multiple computers
with multiple processing units. Parallelization can
significantly reduce the computational time by a more
efficient use of the available hardware. It can also al-
low solving large problems that are often not fit for a
single machine. The parallel programming requires a
development of new techniques and relies on program-
ming language extensions and a use of communication
libraries. The principal issue in parallel assembly is to
prevent the race conditions, where the same memory
location is to be updated by multiple threads.
Parallel computers can be classified, for example,

by the type of memory architecture [2]. The shared,
distributed, and hybrid memory systems exist. In
the shared memory system, the main memory with
a global address space is shared between all process-

ing units, which can directly address and access the
same global memory. The global memory significantly
facilitates the design of the parallel program, as the
individual tasks can communicate via the global mem-
ory, but the memory bus performance is the limiting
factor for the scalability of these systems.

The systems with the distributed memory are built
by connecting individual processing units equipped
with a local memory. There is no global address space
and the individual tasks running on different process-
ing units have to communicate using messages. The
design and implementation of parallel algorithms is
typically more demanding, however, these systems
do not suffer from scalability limitations of shared
memory systems. Finally, the hybrid systems try to
combine the advantages of shared and distributed sys-
tems by combining muti-core processing units with a
local memory into powerful systems. The most power-
ful computers today are build using this paradigm [3].
Finite Element Method (FEM) is one of the most

popular methods to solve various problems in engi-
neering. It actually consists of a broad spectrum of
methods for finding an approximate solution to bound-
ary value problems described by the system of partial
differential equations. The differential equations are
converted to the algebraic system of equations using
variational methods. Once the integral form is set
up, the discretization of the problem domain into non
overlapping subdomains called elements is introduced
to define an approximation and test functions. In
structural mechanics, the resulting algebraic equa-
tions correspond to the discrete equilibrium equations
at element nodes. In matrix notation, the resulting

25

https://doi.org/10.14311/AP.2020.60.0025
https://ojs.cvut.cz/ojs/index.php/ap


Michal Bošanský, Bořek Patzák Acta Polytechnica

equilibrium equations for a linear static problem have
the following form

K ∗ r = F, (1)

where K and F are global stiffness matrix and load
vector, respectively, and r is the vector of unknown
nodal displacements. The global stiffness matrix K
and global load vector F are assembled from individ-
ual element and nodal contributions. This process is
relying on the global numbering of equations, where
the contributions of individual elements are assembled
(added) to global matrix/vector values according to
global equation numbers of nodal unknowns. The typ-
ical sequential vector assembly algorithm is outlined
in Algorithm 1.

Algorithm 1: Prototype code - Assembly of load
vector

001 for elem = 1, nelem
002 F e = computeElementV ector(elem)
003 Loce = giveElementCodeN umbers(elem)
004 for i = 1, Size(Loce)
005 F (Loce(i))+ = F e(i)

The typical sequential algorithm for matrix assem-
bly is similar to the algorithm of vector assembly, see
Algorithm 2 for a reference.

Algorithm 2: Prototype code - Assembly of stiff-
ness matrix

001 for elem = 1, nelem
002 Ke = computeElementM atrix(elem)
003 Loce = giveElementCodeN umbers(elem)
004 for i = 1, Size(Loce)
005 for j = 1, Size(Loce)
006 K(Loce(i), Loce(j))+ = Ke(i, j)

As already mentioned, the assembly operations are
one of the typical steps in the finite element analysis.
In order to obtain a scalable algorithm, all the steps
have to be parallelized, including the assembly phase,
which could be costly. Particularly when solving non-
linear problems, the evaluation of individual element
contributions (tangent stiffness matrix, internal force
vector) can be computationally demanding and have
to be performed for every load increment step and
every iteration (the frequency of update of stiffness
matrix depends on the actual solution algorithm).
In general, the parallelization of the assembly op-

eration consists in splitting the assembly loop over
elements into disjoint subsets, which are processed by
individual processing threads. Within each thread,
the individual element contribution is evaluated for
each element in subset (this part can be evaluated
concurrently), followed by the assembly of the local
contribution to the target global matrix/vector, see
simple demonstration in Figure 1. The key problem
is that this step could not be performed concurrently,

as the multiple threads may update the same global
entry at the same time. Therefore, it is necessary
to ensure, that the same global entry is not being
updated at the same time by multiple threads. This is
known as so called race condition. To prevent the race
condition on update, various techniques can be used.
They typically consist in using locking primitives to
make sure that (i) the code performing the update
can be executed only by a single thread, (ii) the spe-
cific memory location can be updated only by a single
thread, or (iii) the evaluation of element contributions
is ordered in a such a way that the conflict cannot
occur.
One of the important characteristics of the par-

allel algorithm is its scalability. Unfortunately, the
ideal, linear scalability is difficult to obtain. Almost
every parallel algorithm has an overhead cost, when
compared to the sequentional version. The individ-
ual tasks cannot be executed concurrently without a
synchronization and communication with other tasks.
Finally, some parts of the algorithm are essentially
serial and have to be executed only by a single thread.
The different parallel assembly approaches have

been presented, for example, in paper [4]. In [4],
an approach based on OpenMP critical sections and
OpenMP atomic directives, have been proposed.

2. Shared memory frameworks
In this section, different strategies to parallel vec-
tor/matrix assembly are presented. They use differ-
ent techniques to prevent the race condition on data
update. These strategies are implemented using dif-
ferent shared memory programming models available,
including Open Multi-Processing (OpenMP), Portable
Operating System Interface (POSIX) Threads, and
C++ 11 Threads programming interface.

2.1. OpenMP
OpenMP is a shared memory programming model
that supports multi-platform shared memory mul-
tiprocessing programming in C, C++, and Fortran
programming languages. It is available for a wide
variety of processor architectures and operating sys-
tems. It consists of a set of compiler directives, library
routines, and environment variables that influence the
run-time behaviour, see [5] for details.

The programming in OpenMP consists in using so
called parallel constructs (compiler directives), which
are inserted in the source code, instructing the com-
piler to generate a specific code. The OpenMP defines
various constructs allowing to parallelize serial code
and synchronize the individual threads [6].
To reduce the granularity of the problem and re-

duce the overhead connected to thread creation and
termination, it is usual to parallelize the outermost
loops in the algorithm, which, in our particular case,
corresponds to the parallelization of the loop over
elements in the assembly operation.

26


vol. 60 no. 1/2020 Parallelization of Assembly operation in Finite Element Method

(0,1)
(2,3)

(4,5)

(6,7)

0
1
2
3
4
5

2
3
4
5
6
7

0
1
2
3
4
5
6
7

0 1 2 3 4 5 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0
1
2
3
4
5

2
3
4
5
6
7

Figure 1. Parallel algorithm for Stiffness Matrix assembly schema

2.1.1. Synchronization using Critical
Section, Atomic Update, Simple Lock,
Nested Lock

In this subsection, we start with a simple solution
preventing an occurrence of race conditions during
data update based on the critical section, atomic
update, simple lock, and nested lock constructs.

The synchronization using critical section is imple-
mented in three variants. In the first variant (marked
as A1), the whole assembly operation consisting of
loop over element code numbers (rows and columns
numbers of global matrix / vector) is enclosed using
critical section. The second variant considers only
enclosing the actual update operation (A2). The last
variant (A3) is similar to (A2), but uses the atomic
update, see Algorithm 3.
A similar approach is followed to synchronize

threads using locks. The lock is used to ensure that
either a single thread can process the loop over ele-
ment code numbers (A1.1 or A1.2 using the nested
lock) or an actual update operation (A2.1) is carried
out, see Algorithm 4.

The approach followed in this section is rather con-
servative, ensuring that only a single thread can per-

Algorithm 3: Prototype code - matrix assembly
with explicit synchronization

001 # pragma omp parallel for private(Ke, Loce)
002 for elem = 1, nelem
003 Ke = computeElementM atrix(elem)
004 Loce = giveElementCodeN umber(elem)
005 # pragma omp critical (A1)
006 for i = 1, Size(Loce)
007 for j = 1, Size(Loce)
008 # pragma omp critical (A2)
009 # pragma omp atomic (A3)
010 K(Loce(i), Loce(j))+ = Ke(i, j)

Algorithm 4: Prototype code - matrix assembly
with explicit locks

001 omp_init_lock(&my_lock) (A1.1) (A1.2)
002 omp_init_nest_lock(&my_nest_lock) (A2.1)
003 # pragma omp parallel for private(Ke, Loce)
004 for elem = 1, nelem
005 Ke = computeElementM atrix(elem)
006 Loce = giveElementCodeN umber(elem)
007 omp_set_lock(&my_lock) (A1.1)
008 omp_set_nest_lock(&my_nest_lock) (A2.1)
009 for i = 1, Size(Loce)
010 for j = 1, Size(Loce)
011 omp_set_lock(&my_lock) (A1.2)
012 K(Loce(i), Loce(j))+ = Ke(i, j)
013 omp_unset_lock(&my_lock) (A1.2)
014 omp_unset_lock(&my_nest_lock) (A1.1)
015 omp_unset_nest_lock(&my_nest_lock)

(A2.1)
016 omp_destroy_lock(&my_lock) (A1.1) (A1.2)
017 omp_destroy_nest_lock(&my_nest_lock) (A2.1)

27


Michal Bošanský, Bořek Patzák Acta Polytechnica

form any update operation. In reality, the race condi-
tion on data update can happen only if two or more
threads are attempting to update the same entry in
global vector/matrix. This essentially means that
these threads are assembling contributions of elements
sharing the same node(s). The probability of this can
be relatively low, so, in next sections, we try to pro-
pose improved algorithms that allow to perform the
update operation in parallel, provided that different
entries of global vector/matrix are updated.

2.1.2. Synchronization using block locks
The algorithm presented in this section is based on the
idea of having an array of locks, each corresponding to
consecutive blocks of values in global vector/matrix.
Once a specific global value is to be updated by a
specific thread, the corresponding lock is acquired,
preventing other threads to update values in the same
block, but allowing other threads to update values in
other blocks. It may seem that the ideal situation is
to have unique lock for every global value, but as the
global vector/matrices are the dominant data struc-
tures (in terms of memory requirements) in a typical
FE code, this approach is not feasible. In the pre-
sented approach, the individual groups correspond to
blocks of rows of global vector/matrix values and the
prototype implementation is presented in Algorithm 5.

Algorithm 5: Prototype code - matrix assembly
with explicit block locks

001 # define N BLOCKS
002 omp_init_t_&my_lock [N BLOCKS]
003 for n = 1, N BLOCKS
004 omp_init_lock(&my_lock [n])
005 blocksize = Size(K.rows)/N BLOCKS
006 # pragma omp parallel for private(Ke, Loce)
007 for elem = 1, nelem
008 Ke = computeElementM atrix(elem)
009 Loce = giveElementCodeN umber(elem)
010 for i = 1, Size(Loce)
011 bI = Loce(i)/blocsize //integer division
012 for j = 1, Size(Loce)
013 omp_set_lock(&my_lock [bI])
014 K(Loce(i), Loce(j))+ = Ke(i, j)
015 omp_unset_lock(&my_lock [bI])
016 for n = 1, N BLOCKS
017 omp_destroy_lock(&my_lock [n])

2.2. POSIX Threads
The POSIX Threads (Pthreads) libraries are stan-
dardized thread programming interfaces for C/C++
language. Pthreads allows one to spawn a new concur-
rent processes. Pthreads consists of a set of C/C++
language types and procedure calls, see [7] for further
details. The POSIX Threads algorithms are based on
distributing the element set into continuous subsets
assigned to individual threads.

2.2.1. Synchronization using Simple Mutex
and Recursive Mutex

The POSIX Threads synchronization routines allow to
protect shared data when multiple threads update the

data. The concept is very similar to locks in OpenMP
library. In POSIX Threads, only a single thread can
lock mutex variable at any given time. In the case
where several threads try to lock mutex, only one
thread will be successful. No other thread can own
mutex until the owning thread unlocks that mutex.
The simple mutex can be used only once by a single
thread. Attempting to relock the mutex (trying to
lock mutex after a previous lock) causes a deadlock.
The attempt to unlock a simple mutex that it has not
locked leads to an undefined behaviour. The recur-
sive mutex shall maintain the concept of a lock count.
When a computing thread successfully acquires recur-
sive mutex for the first time, the lock count shall be
set to one. Each time the thread unlocks the recur-
sive mutex, the lock count shall be decremented by
one. When the lock count reaches to zero, the recur-
sive mutex shall become available for other threads
to acquire. In this section, the prototype algorithms
are presented using simple and recursive mutexes to
prevent the race condition on a data update. Three
variants are considered. The first, marked as B1.1, is
using simple mutex to protect loop over code num-
bers, the second, marked as B2.1, is using recursive
mutex again protecting the loop, and finally the one
protecting just the update operation using simple mu-
tex, marked as B1.2. All variants are illustrated in
Algorithm 6.

Algorithm 6: Prototype code - matrix assem-
bly with explicit synchronization using simple and
recursive POSIX mutexes

001 void Assembly Element M atrix ( ... )
002 for elem = 1, nelem
003 Ke = compute Element M atrix (elem)
004 Loce = give Element Code N umbers (elem)
005 pthread_mutex_lock(&mutex_SIM ) (B1.1)
006 pthread_mutex_lock(&mutex_REC) (B2.1)
007 for i = 1, Size(Loce)
008 for j = 1, Size(Loce)
009 pthread_mutex_lock(&mutex_SIM ) (B1.2)
010 K(Loce(i), Loce(j))+ = Ke(i, j)
011 pthread_mutex_unlock(&mutex_SIM ) (B1.2)
012 pthread_mutex_unlock(&mutex_SIM ) (B1.1)
013 pthread_mutex_unlock(&mutex_REC) (B2.1)

2.3. Synchronization using Colouring
Algorithm

As already discussed in previous sections, the con-
servative strategy on always protecting the update
operation may not lead to optimal results. It enforces
the serial execution of the update operation for se-
lected values, regardless if there is a real conflict or
not. The fact that it prevents a parallel execution can
have a significant impact on scalability. Partially, this
problem has been addressed in the algorithm using
array of locks preventing the update to block a row
of values. In this section, we present an alternative
approach, which is based on the idea of assigning the
individual elements into groups, where elements in

28


vol. 60 no. 1/2020 Parallelization of Assembly operation in Finite Element Method

a group should not share any node. This essentially
means that elements in a group can be processed
concurrently as during the assembly operation only
distinct values of a global vector/matrix can be up-
dated. After determining the element distribution
into the groups, the algorithm loops over the groups
and each group is processed in parallel. The objective
is to keep the number of groups minimal. This is
known as the so-called colouring algorithm in graph
theory [8].
In this paragraph, the introduction to the vertex

colouring algorithm will be given. Consider a graph
of n mutually connected vertices (representing FE
elements). The edges (connections) represent the ele-
ment connectivity, i.e., the edge between two vertices
represents a case, when two elements share the same
node. The task is to assign a "colour" to each ver-
tex under the condition that no neighbour has the
same colour and keep the number of "colours" minimal.
The algorithm for greedy colouring of a graph is the
following:
(1.) Loop over the elements e = 1, ne

(a) For the element e, find the colour C assigned
to neighbours of element e

- loop over the nodes of element e, i = 1, n
- C = C + f n(i)

(b) Find the unused available colour not in C and
assign it to element e

- loop over the available colours, c = 1, m
- if c is not in C then EC(e) = c

where EC is the array of colours and f n(i) is the
function returning a set of colours assigned to elements
sharing the node i
The computational cost of the Greedy colouring

algorithm depends heavily on the vertex ordering. In
the worst case, the behaviour is poor and solving
the of algorithm can take a lot of computation time.
However, the graph construction and graph colouring
has to be performed only once during the FE code
initialization and after that it can be reused in any
assembly operation. As already noted, the colouring
algorithm splits the elements into groups marked with
different colours. The algorithm ensures that a mini-
mum number of colours is used. Once the colouring
is available, the assembly algorithm consists of the
outer loop over individual colour groups and inner
loop over individual elements in a group. Inside the
inner loop, the element contributions are evaluated
and assembled into global vector/matrix. Now the
key is that the inner loop can be parallelized (individ-
ual members of a group can be processed by different
threads) without the need of synchronization, as the
colouring ensures that the race condition on update
could not occur. Even though this is appealing, the
algorithm has its overhead. This includes the already
discussed need to establish the colouring, however,
only the inner loop can be parallelized. All threads
should finish processing the element’s inner loop be-

fore processing elements for the next colour. This
requires synchronization. Finally, there is also an
overhead connected to the creation and termination
of threads for each colour. In the colouring algorithm,
the individual threads are created and terminated
inside the outer loop over individual colours, but this
overhead is typically very small, considering typical
assembly times for real problems. Eventually, it can
have some minor impact on overhead costs.
The prototype code for colouring assembly is pre-

sented in Algorithm 7 using OpenMP directives. The
additional arguments of the parallel loop directive are
used to declare some variables as shared (i.e., each
thread accesses the same variable) or private (each
thread has its own copy of that variable). The for-loop
clause allows accumulating a shared variable without
an explicit synchronization.

Algorithm 7: Prototype code - matrix assembly
without explicit synchronization using Colouring
Algorithm (OpenMP)

001 for ii = 0, number.of.colours
002 # pragma omp parallel for private(Ke, Loce)
003 for ie = 1, Size(colour.group[ii])
004 elem = colour.group[ii][ie]
005 Ke = computeElementM atrix(elem)
006 Loce = giveElementCodeN umber(elem)
007 for i = 1, Size(Loce)
008 for j = 1, Size(Loce)
009 K(Loce(i), Loce(j))+ = Ke(i, j)

The POSIX Threads variant of the colouring based
assembly algorithm is presented in Algorithm 8.

Algorithm 8: Prototype code - matrix assembly
without explicit synchronization using Colouring
Algorithm (POSIX Threads)

001 threads = new pthread _t[num_threads]
002 for ii = 0, number.of.colours
003 for jj = 0, (num_threads − 1)
004 psize = size(colour.group[ii])/num_threads
005 start.indx = (jj) ∗ psize
006 end.indx = start.indx + psize
007 pthread_create( &threads[jj], N U LL, Assembly

Element M atrix, ii, start.indx, end.indx)
008 for jj = 0, (num_threads − 1)
009 pthread_join( &threads[jj], N U LL)

010 void Assembly Element M atrix ( ... )
011 for ie = start.indx, end.indx)
012 elem = colour.group[ii][ie]
013 Ke = compute Element M atrix (elem)
014 Loce = give Element Code N umbers (elem)
015 for i = 1, Size(Loce)
016 for j = 1, Size(Loce)
017 K(Loce(i), Loce(j))+ = Ke(i, j)

2.4. C++11 Threads
Alongside well established OpenMP and Posix
Threads, the C++11 standard has introduced a na-
tive C++ thread support [9]. The C++11 Thread
libraries include utilities for creating and managing
threads, which are standardized for C/C++ language.

29


Michal Bošanský, Bořek Patzák Acta Polytechnica

The C++11 standard library contains classes for a
thread manipulation and synchronization, common
protected data, and low-level atomic operations.

The parallel program based on C++11 standard li-
brary is constructed by defining a new procedure func-
tion, which is executed by the thread and start the new
thread. The synchronization in the C++11 standard
is achieved by classical synchronization mechanisms
like mutex object, condition variables, and other mech-
anisms like locks or controlling features used when
threads are transferring computational data.

2.4.1. C++11 Threads - Simple Lock
In this paragraph, the multitasking synchronization
using mutex class is presented. The mutex can be
used to protect shared data from being simultaneously
accessed by multiple threads. The synchronization is
enforced in the loop, as illustrated in Algorithm 9.

Algorithm 9: Prototype code - matrix assembly
with explicit synchronization using simple locks
(C++11 Threads)

001 std :: mutexM T X
002 threads = new thread _t[N umber_Of _T hreads]
003 for i = 0, N umber_Of_T hreads
004 std::thread[Assembly Element Matrix, ... ]
005 for i = 0, N umber_Of_T hreads
006 std[i].join()

007 void Assembly Element M atrix ( ... )
008 for elem = 1, nelem
009 Ke = compute Element M atrix (elem)
010 Loce = give Element Code N umbers (elem)
011 M T X.lock()
012 for i = 1, Size(Loce)
013 for j = 1, Size(Loce)
014 K(Loce(i), Loce(j))+ = Ke(i, j)
015 M T X.unlock()

3. Performance evaluation
In this section, the performances and efficiencies of the
presented approaches are compared both for matrix
and vector assembly operations using two benchmark
problems. The first benchmark problem is a 3D model
of a nuclear containment, marked as Jete and shown in
Figure 2. The second benchmark problem is a model
of a 3D porous micro-structure, marked Micro and
shown in Figure 3.

Figure 2. The benchmark problem of 3D finite ele-
ment model of nuclear containment (Jete).

name nnodes nelems neqs
Jete250k 87k 67k 260k
Jete3M 899k 1M 3M
Micro250k 85k 80k 256k
Micro3M 1M 970k 3M

Table 1. Discretizations of the benchmark problems
considered where the nnodes represents the number of
nodes, nelems represents number of elements and neqs
represents number of equations.

Figure 3. The benchmark problem of 3D finite ele-
ment model of porous micro-structure cube (Micro).

The different discretizations are generated for both
benchmark problems with an increasing number of
elements, see Table 1. The tetrahedral elements with a
linear approximation have been used in the case of the
nuclear containment benchmark, while a structured
grid of brick elements with linear interpolation was
used for the micro-structure benchmark. The porous
micro-structure consists of two phases representing
the material and voids with a different set of elastic
constants.

In both cases, the linear structural analysis has been
performed and structures were loaded by self-weight,
which implied a nonzero contribution of every element
to the external load vector. The benchmark problems
are characterized by different sparsity of the system
matrix. The model of the porous micro-structure has
significantly more nonzero members than the model
of the nuclear containment. For example, a number
of nonzero members in the stiffness matrix of the
problem is 433M in the case of Jete3M, while the
model of the porous micro-structure Micro3M has
1528M nonzero entries, with the number of unknowns
being similar.
All the presented parallelization approaches have

been implemented in the OOFEM finite element code.
The object oriented C++ OOFEM code has been
compiled using g++ 4.5.3.1 compiler version with
optimization flags -O2. The tests have been executed
on a Linux workstation (running Ubuntu 14.04 OS)
with two CPUs Intel(R) Xeon(R) CPU E5-2630 v3
@ 2.40GHz and 132GB RAM. Each CPU consists of
eight physical and sixteen logical cores, allowing up
to thirty-two threads to run simultaneously on the
workstation. All the tests fit into the system memory.

30


vol. 60 no. 1/2020 Parallelization of Assembly operation in Finite Element Method

For each benchmark problem, the individual strate-
gies have been executed on increasing number of pro-
cessing cores and speedups (with respect to serial
single CPU execution) have been evaluated as an aver-
age from three consecutive runs. The performance of
individual strategies of the vector assembly (in terms
of the achieved speedup versus increasing the number
of processors) for Jete250k test are presented in Fig-
ure 4 and for matrix assembly in Figure 8. Similarly,
for the Jete3M test, the results are presented in Fig-
ures 5, 9, for the Micro250k test in Figures 6, 10 and,
finally, for the Micro3M test in Figures 7, 11. The fol-
lowing notation is used on above mentioned figures to
distinguish individual parallelization strategies imple-
mented: OpenMP with critical sections (OMP - CS),
OpenMP with simple locks (OMP - L), OpenMP with
nested locks (OMP - NL), OpenMP with critical sec-
tions only in update operation of global matrix/vector
(OMP - LCS), OpenMP with atomic update direc-
tive only in update operation of global matrix/vector
(OMP - LATO), OpenMP with simple locks only in
update operation of global matrix/vector (OMP - LL),
OpenMP using blocks of simple locks (OMP - B50,
OMP - B500, OMP - B105̂), OpenMP based on colour-
ing (OMP - CP), POSIX Threads based on colouring
(PTH - CP), POSIX Threads with simple mutexes
(PTH - M), POSIX Threads with recursive mutexes
(PTH - RM) and finally C++11 Threads with simple
mutexes (THR - M). The assembly process of the
matrix/vector is composed of an evaluation of the lo-
cal matrix/vector followed by its assembly into global
matrix/vector. The execution times of these two op-
erations together with the total execution time are
presented in Table 2 for selected benchmark problems.
Note that the complexity of the greedy colouring

algorithm is linearly proportional to the number of
elements and thus it is the same as the complexity of
assembly algorithm. However, as already mentioned,
the colouring is to be determined only once and can be
reused in all subsequent assembly steps. Theoretically,
the algorithms that use colouring strategies should
scale linearly up to the limit imposed by the memory
bandwidth, but the implementations tested in the
paper could not achieve these performance levels. This
requires further investigations that are beyond the
scope of this paper.

The results for the vector assembly show very simi-
lar trend. All the strategies yield approximately the
same results in terms of scalability and speedups.
From the results, it is evident that the speedups are
far from ideal. For example, in the case of Jete3M
test, the speedups for 16 CPUs are in the range of
2.5 - 4. There are several possible reasons why only
sub-optimal scalability has been obtained. The first
reason is that the individual tasks are not indepen-
dent, the update of the global entry is a part that can
be processed only by one thread at a time. Second,
there is an additional overhead connected with the
parallel algorithm (thread creation and management,

synchronization) that is not present in the serial ver-
sion. Third, the individual threads share the common
resources (memory bus), which do not appropriately
scale in performance. Moreover, from the results, one
can observe the performance drop after reaching the
limit on 16 threads. This can be attributed to the
hyper-threading technology specific to Intel proces-
sors [10], which shares some of the CPU resources
(execution engine, caches, and bus interface) between
the hyper-threaded cores. This trend is more pro-
nounced on larger tests (Jete3M and Micro3M). The
POSIX threads and C++11 Threads implementations
show better performance than OpenMP versions, par-
ticularly for smaller number of threads.
The results for the sparse matrix assembly show

different trends. Some strategies (OMP-LCS, OMP-
LL, OMP-B50, OMP-B500, OMP-B105) were not
even able to reach the performance of the serial algo-
rithm and speedups are less than 1. The Colouring
based strategies (OMP-CP, PTH-CP) have a simi-
lar speedup trend for Jete250k and Micro3M bench-
mark problems. However, the Colouring based strat-
egy PTH-CP clerly shows a better scalability than
the Colouring based strategy OMP-CP on bench-
mark Jete3M. The opposite trend can be observed for
Colouring strategies on benchmark Micro250k (OMP-
CP clearly shows a better scalability than PTH-CP).
The implementations based on the colouring algorithm
do not perform well, which is somehow surprising
observation. This could be (partially) explained by
so-called false sharing. A typical SMP system has
local data cache organized hierarchically in several
levels [11]. The system guarantees cache coherence.
In the case when one core modifies the memory lo-
cation that also resides on the different core cache
line then the false sharing occurs. This is going to
invalidate the cache lines on other cores and forcing
to re-read them every time when any other thread
has updated the memory in the cache line. The false
sharing seems to have a much bigger impact on sparse
matrix assembly performance than on the vector as-
sembly performance. The false sharing effect in our
case is confirmed using the valgrind tool with memory
management and threading bugs monitoring. The
benchmark problem Jete250k with using OMP-CS
or OMP-CP based on 32 computational threads is
used as a representative example for illustrating false
sharing effect and the outputs from valgrind are pre-
sented in Table 3. Cache accesses for data in table
are presented. The parameter D refs represents the
number of data fetches (summary of reads and writes
data) in the system cache memory. The parameter
D1 misses represents the number of data fetches in
cache memory layer L1. The last parameter DLL
misses] represent the instruction and data fetches at
the last level cache (L3). From the reported results,
one can see a slight increase of the number of cache
misses in the case of colouring algorithm compared
to the approach using the synchronization based on

31


Michal Bošanský, Bořek Patzák Acta Polytechnica

Number of threads 1 2 4 8 12 16 32
Total time [s] 36.7 23.81 13.83 11.70 13.51 12.34 16.59
Evaluation of local matrix [s] 13.83 7.07 3.61 1.73 1.23 0.93 0.465
Localization into global matrix [s] 22.87 16.74 10.22 9.97 12.28 11.41 16.225

Table 2. Matrix assembly total times with dividing to evaluation of local matrix assembly times and localization
into global matrix assembly times of the benchmark problem considered (Jete 3M).

 1

 1.5

 2

 2.5

 3

 3.5

 4

 4.5

 5

 1  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30  32

sp
e
e
d

u
p

number of threads

Intel Xeon (32 computing units) - Jete250k - Vector Assembly

OMP - CS(A1)
OMP - L(A1.1)

OMP - NL(A2.1)
OMP - LCS(A2)

OMP - LL(A1.2)
OMP - B50

OMP - B500
OMP - B105

OMP - CP
PTH - CP

PTH - M(B1.1)
PTH - RM(B2.1)

THR - M
OMP - LATO(A3)

Figure 4. Speedups of the external force vector for benchmark Jete250k.

 0.5

 1

 1.5

 2

 2.5

 3

 3.5

 4

 4.5

 1  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30  32

sp
e
e
d

u
p

number of threads

Intel Xeon (32 computing units) - Jete3M - Vector Assembly

OMP - CS(A1)
OMP - L(A1.1)

OMP - NL(A2.1)
OMP - LCS(A2)

OMP - LL(A1.2)
OMP - B50

OMP - B500
OMP - B105

OMP - CP
PTH - CP

PTH - M(B1.1)
PTH - RM(B2.1)

THR - M
OMP - LATO(A3)

Figure 5. Speedups of the external force vector assembly for benchmark Jete3M.

32


vol. 60 no. 1/2020 Parallelization of Assembly operation in Finite Element Method

 1

 1.5

 2

 2.5

 3

 3.5

 4

 4.5

 1  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30  32

sp
e
e
d

u
p

number of threads

Intel Xeon (32 computing units) - Micro250k - Vector Assembly

OMP - CS(A1)
OMP - L(A1.1)

OMP - NL(A2.1)
OMP - LCS(A2)

OMP - LL(A1.2)
OMP - B50

OMP - B500
OMP - B105

OMP - CP
PTH - CP

PTH - M(B1.1)
PTH - RM(B2.1)

THR - M
OMP - LATO(A3)

Figure 6. Speedups of the external force vector assembly for benchmark Micro250k.

 0.5

 1

 1.5

 2

 2.5

 3

 3.5

 4

 4.5

 1  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30  32

sp
e
e
d

u
p

number of threads

Intel Xeon (32 computing units) - Micro3M - Vector Assembly

OMP - CS(A1)
OMP - L(A1.1)

OMP - NL(A2.1)
OMP - LCS(A2)

OMP - LL(A1.2)
OMP - B50

OMP - B500
OMP - B105

OMP - CP
PTH - CP

PTH - M(B1.1)
PTH - RM(B2.1)

THR - M
OMP - LATO(A3)

Figure 7. Speedups of the external force vector assembly for benchmark Micro3M.

OpenMP critical sections. To get a further insight
into the problem, the Intel VTune Amplifier tool has
been used to detect false sharing [12]. The memory
access analysis type uses hardware event sampling to
collect data for the metric. This monitoring reveals
the growing importance of the false sharing effect with
increasing number of threads (particularly for 16 and
more threads, when Intel hyper-threading technology
is active). This is demonstrated by growing memory
bound (showing a fraction of cycles spent on waiting
due to load or store operations on each cache level)
and average memory latency (average load latency in
cycles) characteristics.

The individual speedups with and without synchro-
nization for different scheduling options for selected
benchmark problem (Jete3M) using critical section

OMP-CS OMP-CP
D refs 547 165 746 467 558 299 751 110
D1 misses 42 487 418 875 43 071 933 527
LLd misses 16 625 383 973 16 650 025 976

Table 3. Number of Cache data misses of the bench-
mark problem Jete250k.

synchronization (A1) are presented in Figures 12 and
13. The results are quite interesting. From the Fig. 12
(matrix assembly), one can clearly see that all the
executions with synchronization outperform the ones
without it. In our opinion, this indicates that the syn-
chronization overhead is more than balanced by the
better performance, which is most likely due to single

33


Michal Bošanský, Bořek Patzák Acta Polytechnica

 0

 0.5

 1

 1.5

 2

 2.5

 3

 3.5

 1  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30  32

sp
e
e
d

u
p

number of threads

Intel Xeon (32 computing units) - Jete250k - Matrix Assembly

OMP - CS(A1)
OMP - L(A1.1)

OMP - NL(A2.1)
OMP - LCS(A2)

OMP - LL(A1.2)
OMP - B50

OMP - B500
OMP - B105

OMP - CP
PTH - CP

PTH - M(B1.1)
PTH - RM(B2.1)

THR - M
OMP - LATO(A3)

Figure 8. Speedups of the sparse matrix assembly for benchmark Jete250k.

 0

 0.5

 1

 1.5

 2

 2.5

 3

 3.5

 4

 1  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30  32

sp
e
e
d

u
p

number of threads

Intel Xeon (32 computing units) - Jete3M - Matrix Assembly

OMP - CS(A1)
OMP - L(A1.1)

OMP - NL(A2.1)
OMP - LCS(A2)

OMP - LL(A1.2)
OMP - B50

OMP - B500
OMP - B105

OMP - CP
PTH - CP

PTH - M(B1.1)
PTH - RM(B2.1)

THR - M
OMP - LATO(A3)

Figure 9. Speedups of the sparse matrix assembly for benchmark Jete3M.

memory access at a specific time (ensured by critical
section). This can be attributed to false sharing. At
the same time, the results demonstrate that the effect
of scheduling is negligible for executions without syn-
chronization (symbols without fill) and only partially
relevant for executions with synchronization (filled
symbols). The effect of scheduling is relatively small.
This indicates that limitation in memory bandwidth
is the main reason for the sub-optimal scalability ob-
tained.
The results for vector assembly (Fig. 13) are dif-

ferent, as the differences between speedups with and
without synchronization are much smaller. This is
most likely due to the less complex memory access
pattern in the case of vector assembly. Similarly to
matrix assembly, the results demonstrate that the

effect of scheduling is negligible as well as the effect of
different scheduling options. The results again confirm
the significant role of the limited memory bandwidth.

POSIX threads and C++11 Threads implemen-
tations performed best for lower number of threads,
but overall, the OpenMP implementation (OMP-L,
OMP-NL), and OMP-CS performed the best.

In [4] authors reported a speedup 11 for 12 threads.
However, these results are achieved on a slightly differ-
ent architecture, where the computational node con-
sists of two processors Intel Xeon X5650 2,66GHz with
6 cores and results for 12 threads are given. Therefore,
the results were not affected by hyperthreading.

34


vol. 60 no. 1/2020 Parallelization of Assembly operation in Finite Element Method

 0

 0.5

 1

 1.5

 2

 2.5

 3

 3.5

 1  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30  32

sp
e
e
d

u
p

number of threads

Intel Xeon (32 computing units) - Micro250k - Matrix Assembly

OMP - CS(A1)
OMP - L(A1.1)

OMP - NL(A2.1)
OMP - LCS(A2)

OMP - LL(A1.2)
OMP - B50

OMP - B500
OMP - B105

OMP - CP
PTH - CP

PTH - M(B1.1)
PTH - RM(B2.1)

THR - M
OMP - LATO(A3)

Figure 10. Speedups of the sparse matrix assembly for benchmark Micro250k.

 0

 0.5

 1

 1.5

 2

 2.5

 3

 3.5

 4

 4.5

 1  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30  32

sp
e
e
d

u
p

number of threads

Intel Xeon (32 computing units) - Micro3M - Matrix Assembly

OMP - CS(A1)
OMP - L(A1.1)

OMP - NL(A2.1)
OMP - LCS(A2)

OMP - LL(A1.2)
OMP - B50

OMP - B500
OMP - B105

OMP - CP
PTH - CP

PTH - M(B1.1)
PTH - RM(B2.1)

THR - M
OMP - LATO(A3)

Figure 11. Speedups of the sparse matrix assembly for benchmark Micro3M.

4. Conclusions
The paper evaluates different parallelization strategies
of right-hand side vector and stiffness matrix assembly
operations, which are one of the critical operations in
any finite element software. The performance strate-
gies use different techniques to ensure consistency
and are implemented using OpenMP, POSIX Threads
and C++11 Threads libraries. The performance of
individual strategies and libraries is evaluated using
two benchmark problems (a 3D structural analysis of
a nuclear containment and of a 3D micro-structure)
each with two different mesh sizes. For the partic-
ular benchmark cases considered, the performance
of nearly all strategies has been much better than
the performance of serial algorithm with a relatively
good scalability, however, in the case of matrix assem-

bly, considerable differences exist and the presented
work provides an insight on how to select the optimal
strategy.
The achieved results clearly show a performance

drop on systems with hyper-threading technology
when the number of processes exceeds the number
of physical cores. Somehow disappointing are the re-
sults of the assembly based on the Colouring approach
that did not perform as expected, with the perfor-
mance being affected most likely with false-sharing
phenomena.
The main conclusion of this study is that the per-

formances of individual libraries are comparable, but
performances of individual strategies differ, often sig-
nificantly.

35


Michal Bošanský, Bořek Patzák Acta Polytechnica

 0.5

 1

 1.5

 2

 2.5

 3

 3.5

 4

 1  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30  32

sp
e
e
d

u
p

number of threads

Intel Xeon (32 computing units) - Jete3M - Matrix Assembly

OMP - CS(A1)
OMP - STA

OMP - STA 4
OMP - DYN

OMP - DYN 1
OMP - DYN 4

OMP - DYN 8
OMP - WS

OMP - STA WS
OMP - STA 4 WS

OMP - DYN WS
OMP - DYN 1 WS

OMP - DYN 4 WS
OMP - DYN 8 WS

Figure 12. Speedups of the sparse matrix assembly with and without synchronization for different scheduling
options for benchmark Jete3M using critical section synchronization (A1).

 1

 1.5

 2

 2.5

 3

 3.5

 4

 4.5

 1  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30  32

sp
e
e
d

u
p

number of threads

Intel Xeon (32 computing units) - Jete3M - Vector Assembly

OMP - CS(A1)
OMP - STA

OMP - STA 4
OMP - DYN

OMP - DYN 1
OMP - DYN 4

OMP - DYN 8
OMP - WS

OMP - STA WS
OMP - STA 4 WS

OMP - DYN WS
OMP - DYN 1 WS

OMP - DYN 4 WS
OMP - DYN 8 WS

Figure 13. Speedups of the external force vector assembly with and without synchronization for different scheduling
options for benchmark Jete3M using critical section synchronization (A1).

36


vol. 60 no. 1/2020 Parallelization of Assembly operation in Finite Element Method

In general, the presented paper illustrates the po-
tential of parallel assembly operations and importance
of benchmarking, allowing to identify an optimal strat-
egy.

Acknowledgements
This work was supported by the Grant Agency of
the Czech Technical University in Prague, grant
No. SGS16/038/OHK1/1T/11 - Advanced algo-
rithms for numerical modeling in mechanics of struc-
tures and materials. The second author was sup-
ported by OP VVV, Research Center for Informatics,
CZ.02.1.01/0.0/0.0/16019/0000765.

References
[1] B. Patzak. OOFEM. http://www.oofem.org, 2000.
[2] C. Hughes, T. Hughes. Parallel and Distributed

programming Using C++. Pearson Education, 2003.
[3] H. Meuer, E. Strohmaier, J. Dongarra, et al. Top 500
the list home page. https://www.top500.org/.

[4] P. Jarzebski, K. Wisniewski, R. L. Taylor. On
parallelization of the loop over elements in FEAP.
Computational Mechanics 56(1):77–86, 2015.
doi:10.1007/s00466-015-1156-z.

[5] B. Barney. OpenMP.
https://computing.llnl.gov/tutorials/openMP/,
2014.

[6] A. Marowka. Book Review [review of “Using OpenMP:
Portable Shared Memory Parallel Programming”
(Chapman, B. et al, 2007)]. IEEE Distributed Systems
Online 9(1):3–3, 2008. doi:10.1109/mdso.2008.1.

[7] B. Nicols, D. Buttlar, J. P. Farrell. Pthreads
Programming. O‘Reilly and Associates, 1996.

[8] D. B. West. Introduction to graph theory., vol. 2.
Upper Saddle River Prentice hall, 2001.

[9] A. Williams. C++ Concurrency in Action. Manning
Publications Co, United States of America, 2012.

[10] D. T. Marr, F. Binns, D. L. Hill, et al.
Hyper-threading technology architecture and
microarchitecture. Intel Technology Journal 6(1), 2002.

[11] J. Handy. The cache memory book. Academic Press,
Inc., 1998.

[12] Intel Corporation. Intel VTune Performance Analyzer.
http://software.intel.com/en-us/intel-vtune/,
2019.

37

http://www.oofem.org
https://www.top500.org/
http://dx.doi.org/10.1007/s00466-015-1156-z
https://computing.llnl.gov/tutorials/openMP/
http://dx.doi.org/10.1109/mdso.2008.1
http://software.intel.com/en-us/intel-vtune/

	Acta Polytechnica 60(1):25–37, 2020
	1 Introduction
	2 Shared memory frameworks
	2.1 OpenMP
	2.1.1 Synchronization using Critical Section, Atomic Update, Simple Lock, Nested Lock
	2.1.2 Synchronization using block locks

	2.2 POSIX Threads
	2.2.1 Synchronization using Simple Mutex and Recursive Mutex

	2.3 Synchronization using Colouring Algorithm
	2.4 C++11 Threads
	2.4.1 C++11 Threads - Simple Lock


	3 Performance evaluation
	4 Conclusions
	Acknowledgements
	References