Acta Polytechnica CTU Proceedings


doi:10.14311/APP.2020.26.0007
Acta Polytechnica CTU Proceedings 26:7–12, 2020 © Czech Technical University in Prague, 2020

available online at http://ojs.cvut.cz/ojs/index.php/app

ON TUNING THE DYNAMIC LOAD BALANCING FEM
FRAMEWORK

Michal BOŠANSKÝ∗, Bořek PATZÁK

Czech Technical University in Prague, Faculty of Civil Engineering, Department of Mechanics, Thákurova 7,
166 29 Prague 6, Czech Republic

∗ corresponding author: michal.bosansky@fsv.cvut.cz

Abstract. Developments in computer hardware are currently bringing new opportunities for numerical
modelling. The current trend in technology is parallel processing making use of multiple processing
units simultaneously to solve a given problem. This paper deals with exploring the parallel dynamic
load balancing framework implemented in the finite element software. This framework is based on a
domain decomposition paradigm for distributed memory model. The paper describes the improved
technique to determine the actual processor weights related to performance of individual processing
units. The load recovery consisting in mesh (re)partitioning is based on actual processor weights. The
(re)partitioning process has to be performed during the simulation and whenever the load imbalance
is significant. The performance of the proposed technique is tested on the benchmark problem and
discussed.

Keywords: Distributed memory, domain decomposition, load balancing, message passing, processor.

1. Introduction
The main goal of parallel algorithms is partitioning
the problem into a set of smaller tasks that can be
solved simultaneously. The problem can be parti-
tioned only once before the solution process, which
represents static load balancing. When the problem is
partitioned during the solution, it represents dynamic
load balancing, see [1].

Ideal scalability is difficult to obtain because of the
overhead cost of the parallel algorithm (synchroniza-
tion and communication) and because some parts of
the problem are essentially sequential. An effective
load balancing process that (re)distributes the work
to individual computing units is needed in many cases
to obtain reasonable scalability.

One of the important characteristics of the parallel
algorithm is its computational scalability, which is
the most important goal in parallel computing. The
scalable parallel algorithm achieves a reduction in
execution time by using more processing units, ide-
ally in a linear trend. Ideal scalability is difficult to
obtain because of the overhead costs of the parallel
algorithm (synchronization and communication). Al-
most every parallel algorithm has an overhead cost
compared to a sequential version. Individual tasks
cannot be executed concurrently without synchroniz-
ing and communicating with other tasks. Some parts
of the algorithm are also essentially serial and can
only be executed by a single thread. In addition to
speedup, parallel computing allows large and complex
problems to be solved that could not be solved on a
single, well-equipped machine.
The Finite Element Method (FEM) has become a

widely used tool for solving problems described by
partial differential equations and been widely adopted

by engineering and scientific communities as a reli-
able numerical tool. In mathematics, the FEM is a
numerical method for finding a solution to boundary
value problems for an ordinary differential equation
(ODE) or a partial differential equation (PDE). In
the FEM, differential equations are converted into
an algebraic system of equations by using variational
methods aided by decomposition of the problem do-
main into sub-domains called elements with an ap-
propriate choice of interpolation functions. In many
cases the resulting system of equations is nonlinear.
In structural mechanics, the non-linearity can origi-
nate from nonlinear geometrical relationships (large
deformations), from constitutive relationships, and
from non-linear boundary condition (e.g. follower
type of loading). Nonlinear problems are solved in-
crementally, typically using the N ewton − Raphson
algorithm. This makes the nonlinear problem solution
more demanding than linear problems. The problem
is solved in a series of load or displacement increments
in which the equilibrium state is iteratively searched.

The paper examines the solutions of large scale engi-
neering problems by using parallel computation based
on a parallel load-balancing framework. This frame-
work can significantly reduce computational time by
using available hardware more efficiently. This paper
presents improved algorithm for determining proces-
sor weights in load-balancing FEM framework. When
solving the real problem on the parallel computer, the
workload of individual processors can change during
the solution. The first source of imbalance originates
from the character of the problem. For example, in
nonlinear problems, the transition from initial elastic
material response to nonlinear regime is often asso-
ciated with increased computational cost and this

7

http://dx.doi.org/10.14311/APP.2020.26.0007
http://ojs.cvut.cz/ojs/index.php/app


Michal BOŠANSKÝ, Bořek PATZÁK Acta Polytechnica CTU Proceedings

transition is often associated only to certain regions of
overall domain. The second source of load imbalance
includes external factors, which can change perfor-
mance of individual processing nodes or communica-
tion network. This typically happens in non-dedicated
cluster environments, where processing nodes and com-
munication infrastructure is shared between users. In
both cases, the gradual grow of imbalance can have
significant effect on performance and on scalability.
The load-balancing process is driven not only by

emerging applications but also by emerging parallel
architectures. These parallel architectures span many
scales. For example, clusters have become viable al-
ternatives to tightly coupled parallel computers in
small scale systems. On the medium scale, supercom-
puters are constructed as networks of shared memory
multiprocessors (SMPs) and have complex and non-
homogeneous interconnection topologies [2]. Finally,
on the largest scale, grid technologies have enabled
computations on widely distributed systems, combin-
ing distributed clusters and supercomputers into a sin-
gle computational resource. These grids are a source
of extreme computational power and high network
heterogeneity. In order to effectively distribute data
from any program on such systems, partitioning must
account for heterogeneity in the solution environment.
Load balancing requires evaluation of computing envi-
ronment parameters (e.g. computing speed, memory
and network availability) and determining how to
apply this information in the load-balancing process
(e.g. adjusting computational domain sizes, selecting
partitioning algorithms).
The capabilities and performance of the load-

balancing framework depends on many aspects, in-
cluding problem at hand and hardware configuration
and are illustrated on the solution of selected engi-
neering problem. The advantages of implementing
this approach are discussed in this paper.

2. Introduction and overall
design of Load Balancing
Framework

The models are often solved on complex domains lead-
ing to many degrees of freedom and often nonlinear
effects have to be taken into account. Parallel algo-
rithms should account for load imbalance between
particular sub-domains. The imbalance can occur due
to several reasons: (i) first load imbalance factor is
part of the character of the solution process, as if
switching from a linear to nonlinear response in some
regions, (ii) second load imbalance factor is an exter-
nal factor concerning resource relocation, typical in
cases when a solution process is run in cluster environ-
ments where the available individual computing units
are shared by other processes owned by the system
or users and lead to changes in allocating processing
power. The imbalance is detected in the solution pro-
cess by monitoring processes during the run time of

the process. The decision depends on the amount of
load imbalance and the cost of load redistribution. Re-
distributing the work leads to serializing the problem
data (elements, nodes, boundary conditions) into mes-
sages sent over the network and a receiving process
based on unpacking, followed by a topology update
reflecting the new work distribution.

The idea of parallelization strategy is based on the
domain decomposition paradigm. In general, two
dual partitioning techniques for the parallel distri-
bution of finite element codes exist, node-cut and
element-cut strategy [3]. In this paper, the node-cut
strategy, dividing cut dividing the problem mesh into
partitions is given in the node-cut technique. The
node-cut strategy is assumed when cut dividing the
problem domain runs through the nodes. Nodes on
mutual partition boundaries are called shared nodes,
and nodes inside individual partitions are called local
nodes. The node-cut partitioning scheme can be in-
terpreted as mesh decomposition using cuts passing
through shared nodes of the mesh without crossing
any element. The cut strategy ensures duplication of
the shared nodes. Each processing node is responsible
for assembling its contribution to the global system
matrix by summing the contributions from individual
elements. In order to minimize communication, it
is natural that each processing node will assemble
and maintain in its local memory the block of global
stiffness matrix rows corresponding to unknowns on
its partition. This is uniquely defined for local nodes,
which are exclusively shared by local elements on indi-
vidual partitions. For shared nodes, which are shared
by elements from multiple partitions, ownership must
be defined. In this work, the partition with the low-
est rank sharing the shared node is the owner of the
shared node and thus responsible for maintaining the
corresponding row entry of the global stiffness matrix.
It is clear that for shared nodes, the contributions to
the corresponding row have to be received from parti-
tions sharing a particular node. The assembly process
requires global, unique numbering of equations to be
established across the computing nodes in addition to
local numbering on individual partitions [4].
Partitioning based on the load balancing process

can be affected by many factors. Ideally, no processor
during the parallel solving process should be waiting
for another processor to complete the solution. In the
typical FEM code, the most of the work is proportional
to individual elements. The total computational work
is determined as the sum of the contribution of individ-
ual elements. The computational work of individual
element is expressed relatively to computational work
of reference element. The load balancing process is
based on distributing the overall computational work
to a group of available processors with respect to
their relative performance. Another significant fac-
tor concerns minimizing and reducing communication
between computational partitions. The assembly of
equilibrium equations for shared nodes requires to

8


vol. 26/2020 On Tuning The Dynamic Load Balancing FEM Framework

sum up contributions from local as well as remote par-
titions. The cost of accessing remote memory is much
higher than the cost of accessing the local memory.
It is therefore important to reduce communication
between computational partitions.

The present contribution examines the design of an
object-oriented framework for dynamic load balancing
implemented in OOFEM code [5].

3. Design of micro benchmarks
tests

Overall processor performance depends on many fac-
tors, notably on its frequency, the performance of
memory subsystem, and type of code executed. The
performance of two CPUs can be different for integer
and floating point operations, on some SMPs, some re-
sources are shared between processing cores, etc. The
adopted approach to evaluate the individual proces-
sor performances is based on a set of so-called micro
benchmark tests, that evaluate processor performance
for different typical tasks and computing overall per-
formance as the weighted average of performances of
individual micro-benchmarks. The processor weights
are updated each time, when rebalancing should oc-
cur. It is therefore important, that the individual
benchmarks have very low time demands and on the
other hand provide reliable estimation of individual
processors.

The first micro benchmark test proposed is a func-
tion to compute a numerical integral. In the test, the
integral of Cosine function is evaluated using numer-
ical integration using Newton-Cortes formula. The
parameter ni defines the number of integration points
used to divide the area. Therefore, with increasing pa-
rameter ni (increasing number of integration points),
we have a solution process closely approximating the
integral value. This test measures performance in
floating point operations.
The next type of micro benchmark problem was

based on solving a linear system of equations. A
FEM model of the cantilever beam, divided into neq
elements is used to define a linear system. The right
hand side of the force vector in our case is a simple
vector with its first member set to 1, while all the other
elements of the vector are set to zero. The number
of equations of the linear system directly depends on
parameter neq , which also represents the dimension
of the stiffness matrix and right hand side of the force
vector.

Another micro benchmark problem called Whet-
stones is a test that attempts to measure the speed
and efficiency of a computer performing floating-point
operations. The Whetstone benchmark was the first
designed for benchmarking [6]. This benchmark is
very simple, comprising several sub-tests with active
loops executed via procedure calls. This module rep-
resents a mix of operations typically performed in
scientific applications. The test involves integer arith-
metic, floating point arithmetic, "if" statements, calls

and so on. At the end of this benchmark, a statement
with the results is printed. Weights were attached
to the different modules (realized as loop bounds for
loops around the individual modules statements). The
weight distribution of Whetstone instructions for the
benchmark matched the distribution seen in the pro-
gram sample. The execution frequency of each module
was proportional to the input value of such that the
scaling factors for nl = 10 gave the modules a to-
tal weight corresponding to one million Whetstone
instructions for more details see [7].
The final load balancing processor weighs are ob-

tained as a weighted average of individual weights
from micro bench-mark tests. The final weights are
assembled in the ratio of 15% micro benchmark prob-
lem based on a function to compute numerical integral,
80% benchmark problem based on the solution of a
linear system of equations and 5% benchmark problem
based on measuring the speed and efficiency of a com-
puter performing floating-point operations. The time
requirements of individual micro benchmark problems
were controlled using parameter n as follows: bench-
mark to compute numerical integral with parameter
ni = 106, solution of a linear system of equations
with parameter neq = 103 and computer performing
floating-point operations with parameter nl = 6 ∗ 104.

4. Dynamic load balancing
framework

The software design process is an important part of
dynamic load-balancing research. The design of a dy-
namic load balancing framework based on a problem
partitioned during the solution process is discussed
in this section. Dynamic load balancing is typically
used for problems involving multiple solution or time
steps, where work redistribution is performed to reflect
potential work imbalance. Partitioning the problem
during solution is based on monitoring computation
time of individual processors during the solution step.
Once the solution step is finished, the load balance
process is activated by evaluating the measured imbal-
ance. If the imbalance is larger than the user defined
tolerance, the (re)partitioning is performed, taking
into account the actual performance of processing
nodes and processing cost of individual elements. As
already noted, when solving a real problem on a par-
allel computer, the load balance can change during
the solution.
The only way of reflecting the growing imbalance

is to adaptively redistribute work between processing
units in order to restore load balance and thereby
ensure optimal use of resources. The monitoring and
evaluation of imbalance is the task of Load Balance
Monitor. This monitor can detect load imbalance by
monitoring the time required to perform allocated
work on individual processing units. The differences
in processing time indicate an imbalance. After im-
balance is detected, the decision of whether to restore
the load balance or continue is made. This can be

9


Michal BOŠANSKÝ, Bořek PATZÁK Acta Polytechnica CTU Proceedings

a complex task, as load redistribution may in fact
be a very complex problem with non-negligible time
requirements. The cost of load (re)balancing may be
higher than the cost of continuing with a slight load-
imbalance. All these aspects have to be considered
and are, unfortunately, specific to the problem and
implementation.
When the imbalance is detected, the work (ex-

pressed in terms of individual elements) has to be
redistributed to reflect the performance of individual
processing nodes. Individual elements have assigned
weights that reflect the relative computational costs.
These depend on element type, material state (elastic,
plastic, etc.). These weights are typically obtained in
advance from benchmark measurements. The process-
ing power of individual processing units also has to
be determined. The load (re)balancing process in this
chapter is based on (re)distributing the computational
work proportionally to the performance of individual
processing units. In principle, additional factors, in-
cluding available communication bandwidth between
individual processing units or available memory, can
be taken into account. Any load balancing should not
only distribute the work according to processing pow-
ers but also attempt to minimize the communication
cost between individual processing units. In the FEM,
this means the cuts between partitions (number of
shared nodes) should be minimized. Failing to meet
the secondary criteria can significantly impact overall
performance, as the cost of communication (in terms
of the time required) is much higher than the cost of
computation. Load (re)balancing should also attempt
to minimize the reallocation of elements as much as
possible in order to minimize communication costs.
The ParMETIS library [8] was used in this work,

however, the ParMETIS implementation is not re-
stricted to one specific partitioning library. A parallel
partitioner can take advantage of the increased mem-
ory capacity of parallel machines (distributed memory
model) and improve overall performance. A general
load balancing algorithm is responsible for dynamic
repartitioning using the processor weights provided
by the load balance monitor during the micro bench-
mark problem and should provide the new partitions
with numbers for all local elements on each partition.
After the updated element partition assignment is de-
termined, the distribution and classification of nodes
also have to be determined. Each node is classified as
either a local node that remains local on an existing
partition or a shared node that is assigned to a re-
mote partition. Node classification can be determined
from element partitioning. The dynamic load bal-
ancing framework implemented in OOFEM is using
the weights based on measuring of individual time
involved in the computation. These times represent
how long each CPU work on the predefined number
of solution steps. The repartitioning is based on de-
fault weights set as the ratio of total solution time to
individual computational thread time.

5. Example using a dynamic load
balancing framework

The above-mentioned strategy is illustrated on nonlin-
ear 3D finite element analysis of an anchor pullout test.
The FE model involved nonlinear nonlocal anisotropic
damage model to describe fracture process and con-
sists of 1456 nodes and 16772 tetrahedral elements,
which was subsequently refined in 20 steps into a final
mesh with 125400 elements and 22441 nodes. The
number of solution steps was set to 80 steps. Due to
the character of the solution, the number of required
equilibrium iterations was increasing with progressing
solution steps. The number of required equlibrium
iterations gradually increased form 9 (first solution
step) to 263 (solution step number 80). The existing
and proposed processor weight evaluation methods
were tested on a workstation (running Ubuntu 16.04
OS) with an Intel ® Core ™ i7-4790 CPU @ 3.60 GHz
with four cores consisting of two logical processors per
core connected to a workstation with an Intel® Core™
i3-2370M CPU @ 2.40 GHz with two cores consisting
of two logical cores. Each workstation could simul-
taneously run a maximum of eight threads and four
threads on their CPUs and had 15 GB and 8 GB of
system memory, respectively.

The iterative linear equation solver from the PETSc
library was used with a block Jacobi preconditioner.
The dynamic load balancing framework parameters
were set up. Load (re)balancing was activated af-
ter each fifth solution step. The relative wall clock
imbalance parameter represents the relative imbal-
ance between wall clock solution time of individual
computational threads. When greater than the pro-
vided threshold the rebalancing procedure is activated.
The additional parameter called absolute wall clock
imbalance parameter allows to triggered rebalancing
procedure when an achieved absolute imbalance be-
tween solution times of individual processing threads
is greater than the threshold. The absolute wall clock
imbalance parameter was set to 10.0 [s]. The mini-
mum absolute wall clock imbalance parameter allows
setting minimum absolute wall clock imbalance thresh-
old for performing rebalancing and was set to 0.5 [s].
The obtained results are presented in Tables 1, 2

and 3. In the presented tables, the number of process-
ing units used in the parallel computations is marked
as "NP". The results obtained using the existing dy-
namic load balancing framework without the micro
benchmark tests to set up processor weights were com-
pared to the results of the computation process with-
out load balancing. These load balancing strategies
were finally compared to the presented method based
on evaluating the microbenchmark tests. Tables 1
and 2 show examples of computation done by pro-
cessors with hyper-threading [9] technology disabled
(Intel i3 max. 2 threads, Intel i7 max. 4 threads). The
computational process with hyper-threading technol-
ogy enabled on the i3 processor is shown in Table 3.

10


vol. 26/2020 On Tuning The Dynamic Load Balancing FEM Framework

NP No DLB [s] DLB [s] DLB W. [s]
2 1659 1473 1411
3 1408 1399 1230
4 1372 1215 1181
5 1277 1179 1131

Table 1. Execution times comparing solutions with
estimated processing weights (DLB W.), uniform pro-
cessing weights (DLB) and without a load balancing
(No DLB) framework (Intel i3 - 1 computing unit +
Intel i7 - 4 computing units).

NP No DLB [s] DLB [s] DLB W. [s]
4 1524 1479 1373
5 1365 1314 1154
6 1330 1302 1140

Table 2. Execution times comparing solutions with
estimated processing weights (DLB W.), uniform pro-
cessing weights (DLB) and without a load balancing
(No DLB) framework (Intel i3 - 2 computing units +
Intel i7 - 4 computing units).

NP No DLB [s] DLB [s] DLB W. [s]
2 1847 1662 1373
3 1606 1339 1303
4 1610 1340 1435
5 1853 1642 1578

Table 3. Execution times comparing solutions with
estimated processing weights (DLB W.), uniform pro-
cessing weights (DLB) and without a load balancing
(No DLB) framework (Intel i3 - 4 computing units
(HT) + Intel i7 - 4 computing units).

The results confirm that better performance is ob-
tained when appropriate weights are used. In the
case with hyper-threading technology enabled, the
scalability trend is not ideal in the solution based on
six and eight computational threads and is worse, for
example, than the solution based on two and four
computational threads. The significant decrease in
performance can be attributed to the hyper-threading
technology of Intel processors, which assembles and
shares some CPU resources between hyper-threaded
cores and only takes place with 4 threads (note that
the workstation with Intel® Core™ i3 had two physical
hyper-threaded cores). However, by using dynamic
load balancing and appropriate weights (processor
performance parameter), we can achieve better per-
formance than in the solving process without load
balancing or with the existing dynamic load balancing
framework not using micro benchmark tests.

6. Conclusions
The contribution based on an improved methodology
to determine processing weights parameters as a part

of the load balancing framework was presented in this
paper. The parallelization strategy was based on the
dynamic load balancing process using weights that rep-
resented the performance of computational units. The
performance of the upgraded dynamic load balancing
process (processor weights parameters) was compared
to the previously implemented dynamic load balancing
process. In this work a nonlinear benchmark problem
was considered to evaluate the performance of dy-
namic load balancing framework. The results showed
differences in the performance of the upgraded and pre-
viously implemented dynamic load balancing process
for the considered benchmark. The upgraded dynamic
load balancing process had better performance than
the previously implemented dynamic load balancing
process. Future studies will investigate other testing
examples (simple linear equation system solutions) to
more precisely assess variables as a representation of
computational unit performance. It is clearly a benefit
to using the appropriate measures of performance of
individual computational units performance of com-
putational units in the load balancing process as a
necessary part of any Finite Element parallel code.

The performance of the described load balancing
framework could potentially be significantly improved
by considering communication speeds between pro-
cessing nodes by using parameters that represent com-
munication speeds through different, heterogeneous
networks. The load balancing framework is based
on parallel computation using a distributed memory
model. This model uses a message passing interface.
For example, many distributed systems are now being
constructed using a variety of different communica-
tion networks, such as Ethernet and Asynchronous
Transfer Mode (ATM). In addition to this hardware
heterogeneity, there is a heterogeneity in the types of
messages produced by parallel programs. Short syn-
chronization messages require low latency. Conversely,
large data messages require high-bandwidth, though
they can tolerate high start-up latency. The different
types of networks have different performance charac-
teristics, while the different types of communication
messages may have different communication require-
ments. The performance of parallel computations
based on a distributed memory model is typically lim-
ited by communication overhead. High-performance
networks and the use of multiple heterogeneous net-
works can help reduce this overhead. Further research
could focus on designing the load balancing frame-
work’s input parameters, which represent different
types of networks providing a data path between the
same pair of network nodes. These load balancing
parameters (network weight parameters) may allow a
solution process to maximize efficient use of different
types of networks rather than passively accept the
given features of a single network.

11


Michal BOŠANSKÝ, Bořek PATZÁK Acta Polytechnica CTU Proceedings

Acknowledgements
This work was supported by the Grant Agency of
the Czech Technical University in Prague, grant No.
SGS16/038/OHK1/1T/11 - Advanced algorithms for nu-
merical modelling in mechanics of structures and materials.
The authors acknowledge the support of the OP VVV
MEYS funded project
CZ.02.1.01/0.0/0.0/16_019/0000765 "Research Center for
Informatics".

References
[1] K. Schloegel, G. Karypis, V. Kumar. Parallel static
and dynamic multi-constraint graph partitioning.
Concurrency and Computation: Practice and Experience
14(3):219–240, 2002. doi:10.1002/cpe.605.

[2] K. D. Devine, E. G. Boman, R. T. Heaphy, et al. New
challenges in dynamic load balancing. Applied
Numerical Mathematics 52(2-3):133–152, 2005.
doi:10.1016/j.apnum.2004.08.028.

[3] P. Krysl, Z. Bittnar. Parallel explicit finite element
solid dynamics with domain decomposition and message
passing; deal programming scalability. Comput Struct
79(3):45–60, 2001.

[4] M. Bošanský, B. Patzák. Parallel Approach To Solve
Of The Direct Solution Of Large Sparse Systems Of
Linear Equations. Acta Polytechnica CTU Proceedings
13:16, 2017. doi:10.14311/app.2017.13.0016.

[5] B. Patzak. OOFEM. http://www.oofem.org, 2000.
[6] H. J. Curnow. A synthetic benchmark, vol. 19. Oxford
University Press (OUP), 1976.
doi:10.1093/comjnl/19.1.43.

[7] S. Harbaugh, J. A. Forakis. Timing studies using a
synthetic whetstone benchmark. ACM SIGAda Ada
Letters IV(2):23–34, 1984. doi:10.1145/998395.998396.

[8] V. K. G. Karypis. ParMETIS: Parallel graph
partitioning and sparse matrix ordering library.
Department of Computer Science, University of
Minnesota„ 1997.

[9] D. H. e. a. D. Marr, F. Binns. Hyper-threading
technology architecture and microarchitecture. Intel
Technology Journal 6(1), 2002.

12

https://doi.org/10.1002/cpe.605
https://doi.org/10.1016/j.apnum.2004.08.028
https://doi.org/10.14311/app.2017.13.0016
https://doi.org/10.1093/comjnl/19.1.43
https://doi.org/10.1145/998395.998396

	Acta Polytechnica CTU Proceedings 26:7–12, 2020
	1 Introduction
	2 Introduction and overall design of Load Balancing Framework
	3 Design of micro benchmarks tests
	4 Dynamic load balancing framework
	5 Example using a dynamic load balancing framework
	6 Conclusions
	Acknowledgements
	References