Original Research

A web-based feedback study on optimization-based
training and analysis of human decision making
Michael Engelhart1, Joachim Funke2, and Sebastian Sager1
1 Faculty of Mathematics, Otto-von-Guericke-Universität Magdeburg and 2 Ruprecht-Karls-Universität Heidelberg

The question “How can humans learn efficiently to make
decisions in a complex, dynamic, and uncertain envi-
ronment” is still a very open question. We investigate
what effects arise when feedback is given in a computer-
simulated microworld that is controlled by participants.
This has a direct impact on training simulators that are
already in standard use in many professions, e.g., flight
simulators for pilots, and a potential impact on a better
understanding of human decision making in general.
Our study is based on a benchmark microworld with an
economic framing, the IWR Tailorshop. N=94 partic-
ipants played four rounds of the microworld, each 10
months, via a web interface. We propose a new approach
to quantify performance and learning, which is based on
a mathematical model of the microworld and optimiza-
tion. Six participant groups receive different kinds of
feedback in a training phase, then results in a perfor-
mance phase without feedback are analyzed. As a main
result, feedback of optimal solutions in training rounds im-
proved model knowledge, early learning, and performance,
especially when this information is encoded in a graphical
representation (arrows).

Keywords: Complex problem solving, training, dynamic decision
making, feedback, mixed-integer nonlinear optimization, Tailorshop

Modern life imposes daily decision making, oftenwith important consequences. Illustrative exam-
ples are politicians who decide on actions to overcome
a financial crisis, medical doctors who decide on com-
plementary chemotherapy drug delivery strategies, or
entrepreneurs who decide on long-term strategies for
their company.
The process of human decision making is the sub-

ject of research in the field of Complex Problem Solving
(CPS), which deals with complex problems. The com-
plexity may result from one or several different charac-
teristics, such as a coupling of subsystems, nonlineari-
ties, dynamic changes, opaqueness, or others (Dörner,
1980). Such problems are considered to be similar
to problems we encounter and solve in everyday life.
Thus, investigation of CPS is claimed to yield more
insight into real-world human decision making than
simple problems with a well-defined problem space,
like the Tower of Hanoi. Apparently, our introduc-
tory examples are complex problems and as such, they
are ill-defined. More precisely, their problem space is
open and a problem solver has to deal with lots of vari-
ables, dependencies and dynamics making them com-

plex problems: Which information is relevant? How is
the data connected? What is the exact aim?

The main intention in CPS research is to under-
stand how certain exogenous variables influence a
solution process. In general, personal and situa-
tional variables are differentiated. The most typical
and frequently analyzed personal variable is intelli-
gence. It is an ongoing debate how intelligence in-
fluences complex problem solving (Wittmann & Hat-
trup, 2004). Other interesting personal variables
are working memory (Robbins et al., 1996), amount
of knowledge (Kluwe, 1993), and emotion regula-
tion (Otto & Lantermann, 2004). Situational vari-
ables like the impact of goal specificity and observa-
tion (Osman, 2008), feedback (Brehmer, 1995), and
time constraints (Gonzalez, 2004) attracted less atten-
tion. In a recent work (Selten, Pittnauer, & Hohnisch,
2012), an abstract computer-simulated monopoly mar-
ket is used to investigate dynamic decision making
based on the choice of goal systems. For investiga-
tions in the field of CPS, computer-based simulations
of small parts of the real world, microworlds, are fre-
quently used. These simulations present users with
situations similar to those encountered when attempt-
ing to solve real-world complex problems, but offer
researchers the possibility to conduct studies under
controlled conditions. In CPS, the performance of
participants in a clearly defined microworld is inves-
tigated, evaluated and correlated to certain character-
istics, such as the participant’s capacity to regulate
emotions.

Previous research with the microworld Tailorshop

One microworld that comprises a variety of properties
such as dynamics, complexity and interdependence,
discrete choices, lack of transparency, and polytely
in an economical framing is the Tailorshop. Partici-
pants have to make economic decisions to maximize
the overall balance of a small company, specialized
in the production and sales of shirts. The Tailor-
shop sometimes is referred to as the Drosophila for
CPS researchers (Funke, 2010) and thus is a promi-
nent example for a computer-based microworld. It has

Corresponding author: Sebastian Sager, Otto-von-Guericke-Universität
Magdeburg: sager@ovgu.de

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 1

mailto:sager@ovgu.de
http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

been used in a large number of studies, e.g., Putz-
Osterloh, Bott, and Köster (1990); Kluwe, Misiak,
and Haider (1991); Kleinmann and Strauß (1998);
Meyer and Scholl (2009); Barth (2010); Barth and
Funke (2010). Comprehensive reviews on studies with
Tailorshop have also been published, e.g., Frensch
and Funke (1995); Funke (2003); Funke and Frensch
(2007); Funke (2010).

The calculation of indicator functions to measure
performance of CPS participants is by no means triv-
ial. To measure performance within the Tailorshop
microworld, different indicator functions have been
proposed in the literature, see Danner, Hagemann,
Schankin, Hager, and Funke (2011) for a recent re-
view. Hörmann and Thomas (1989) proposed a com-
parison of the variable which the participants were re-
quested to maximize. Such a performance criterion
seems natural. However, it cannot yield insight into
the temporal process and is not objective in the sense
that the performance depends on what other partici-
pants achieved. Analyzing the temporal evolution of
other variables of this microworld has also been pro-
posed (see, e.g., Putz-Osterloh (1981); Süß, Oberauer,
and Kersting (1993); Funke (1983); Barth and Funke
(2010)). An obvious drawback of comparing the de-
velopment of variables which were not the actual ob-
jective for the participants is that a monotonic de-
velopment does not necessarily indicate good or even
optimal decision making.
The lacking availability of an objective performance

indicator is an obstacle for analysis and it has of-
ten been argued that inconsistent findings are due
to the fact that an objective indicator function yield-
ing detailed insight into the participants’ performance
is not available, e.g., in Wenke and Frensch (2003).
To overcome this problem, we propose to use indica-
tor functions based on optimal solutions. In Sager,
Barth, Diedam, Engelhart, and Funke (2010) as well
as ins Sager, Barth, Diedam, Engelhart, and Funke
(2011) the question of how to get a reliable perfor-
mance indicator for the Tailorshop microworld has
been addressed. Because all previously used indicators
have unknown reliability and validity, decisions are
compared to mathematically optimal solutions. For
the first time, a complex microworld such as Tailor-
shop has been described in terms of a mathematical
model. Thus, the assumption that the fruit fly of
complex problem solving is not mathematically acces-
sible has been disproved. This novel methodological
approach has also been combined with experimental
studies (Barth, 2010; Barth & Funke, 2010; Sager et
al., 2011) but beyond these works, has to our knowl-
edge not yet received much attention.

Training and relation to optimization

With tasks for humans becoming more complex in
the real-world, there is also an increasing need to
train and assist persons performing complex tasks. In
Hüfner, Tometzki, Kraja, and Engell (2011), a frame-
work for training engineering students in designing

controllers for complex systems like chemical reactors
is presented. In this approach, students can learn from
the results of simulations depending on their inputs.
In the context of CPS, an interesting approach would
be to determine optimal solutions and corresponding
controls for a microworld to compute a feedback for
participants to support and train them. However, as
Cronin, Gonzalez, and Sterman (2009) show, the pre-
sentation of information in a dynamic context is crucial
for the success of the participants. To the best of our
knowledge, there have been no studies investigating
the effects of an optimization-based feedback.
So far, CPS microworlds have been developed in a

purely disciplinary trial-and-error approach. A sys-
tematic development of CPS microworlds based on a
mathematical model, sensitivity analysis, and eventu-
ally optimization methods to choose parameters that
lead to a wanted behavior of the complex system has
not yet been applied. An example for this necessity is
the fact that the mathematical modeling of the Tai-
lorshop microworld in Sager et al. (2011) led to the
discovery of unwanted and unrealistic winning strate-
gies. Based on this experience with modeling odd-
ities, bugs, and other undesirable properties, a new
microworld has been built from scratch designed as
a mathematical model for CPS by Engelhart, Funke,
and Sager (2013), the IWR Tailorshop. The IWR Tai-
lorshop is the first CPS test-scenario with functional
relations and model parameters that have been formu-
lated based on optimization results yielding desirable
(mathematical) properties. Compared to the Tailor-
shop, the setting is slightly more general. For example,
machines have been replaced by production sites, and
vans by distribution sites.
The optimization problems that need to be solved in

the context of the IWR Tailorshop scenario are mixed-
integer nonlinear programs (MINLP) with non-convex
continuous relaxations. Whenever optimization prob-
lems involve variables of continuous and discrete na-
ture together, the term mixed-integer is used. In this
case they can be interpreted as discretized optimal con-
trol problems (dMIOCPs). We use the mathematical
approaches presented in Engelhart et al. (2013) and
Engelhart (2015) that are based on a tailored decom-
position technique to determine ε–optimal solutions
for IWR Tailorshop in (almost) real time.

About this study

In the interest of a compact presentation we focus on
the most important results of a study which has been
described in full detail in the PhD thesis of Engelhart
(2015).

Method

We describe the Tailorshop microworld, the feedback
study with the experimental groups, the hypotheses,
a prestudy, details of the data collection, and the sta-
tistical methods.

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 2

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

IWR Tailorshop: A new complex microworld

We work with a systematically built new microworld
with controlled properties, the IWR Tailorshop. It was
first described in Engelhart et al. (2013) and Engelhart
(2015) and is based on the economical framing of Tai-
lorshop. Table 1 lists all states and controls (interven-
tions for the participants) that the IWR Tailorshop
contains together with corresponding units. The final
mathematical model of the IWR Tailorshop consists of
14 state variables x (i.e., dependent variables) and 10
control variables u (i.e., independent variables) includ-
ing 5 integer controls. All equations and constraints,
the objective function, and the parameter and initial
values are specified in the Appendix.

States Variable Unit∗

employees xEM person(s)
production sites xPS site(s)
distribution sites xDS site(s)
shirts in stock xSH shirt(s)
resources in stock xRS shirt(s)
production xPR shirt(s)
sales xSA shirt(s)
demand xDE shirt(s)
reputation xRE —
shirts quality xSQ —
machine quality xMQ —
resources quality xRQ —
motivation of empl. xMO —
resources price∗∗ xRP MU/shirt
capital xCA MU

Controls Variable Unit∗

shirt price uSP MU/shirt
advertising uAD MU
wages uWA MU/person
working conditions∗∗ uWC MU
maintenance uMA MU
buy resources∗∗ uDRS shirt(s)
sell resources∗∗ udRS shirt(s)
resources quality uRQ —
recruit/dismiss empl. udEM/uDEM person(s)
create production site uDPS site(s)
close production site udPS site(s)
create distribution site uDDS site(s)
close distribution site udDS site(s)

Table 1. States and controls in the IWR Tailorshop microworld
(∗ MU means monetary units, ∗∗ not part of the final model for
the web-based study).

The equations describe how the different state and
control variables are connected. Some of these equa-
tions may be trivial, as, for example, the number of
production sites (xPS) in Equation (A.1b) in the Ap-
pendix, where the numbers of newly created (uDPS)
or closed distribution sites (udPS) are added to or sub-
tracted from the current value. They may also involve
more variables and include nonlinear expressions as,
e.g., in the demand which depends nonlinearly on shirt
price, advertisement, reputation, and others, compare
Equation (A.1d). These mathematical relations are
intransparent to the study participants, as it is a part
of the task to explore and understand the microworld.

The objective is the maximization of the capital at
the end of the discrete time-scale in this work, see
Equation (A.4) in the Appendix. The constraints are
basically bounds on the controls or non-negativity of
variables. The objective is communicated to partici-
pants, the constraints can be determined from admis-
sible values in the web interface.
IWR Tailorshop has been implemented including

different optimization-based feedback methods in a
web-based interface, compare Figure 1. For the analy-
sis of data collected with this interface, optimization-
based analysis methods have been implemented in the
analysis software Antils. Both the web front end and
the analysis back end are available as open-source soft-
ware under the GPL (GNU General Public License)
and thus can easily be used for further investigations.
Analysis and feedback based on optimal solutions en-
abled insights on human decision making which else
would not have been possible.

A web-based feedback study

From November to December 2013, we conducted a
feedback study with the described IWR Tailorshop
microworld. We collected data from 148 participants
(N = 94 after removal of incomplete datasets and out-
liers, see below) and applied our optimization-based
analysis and feedback approach. The participants were
asked to play four rounds with 10 "months" each of the
economic simulation via its web interface. Different
approaches for both feedback computation and feed-
back presentation have been applied in the first two
rounds (so-called training or feedback rounds). In the
last two rounds, however, no one received any feed-
back. These rounds will be referred to as performance
rounds.

Task. Participants had to play four rounds of the
IWR Tailorshop microworld of 10 months each via its
web interface. They were allowed to interrupt the pro-
cess at any time. For the four rounds, different initial
values were used, see Table A.3 in the Appendix, but
the same for all participants. Rounds 1 and 3 started
with the same values, whereas in rounds 2 and 4 pair-
wise different values were used. Control values for re-
cruitment and dismissal of employees and creation and
closing of sites were always reset to 0 in order to avoid
accidental execution.

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 3

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

Figure 1. The IWR Tailorshop web interface with arrows as feedback for the trend group (compare Figure 2) and a hint for maintenance
control.

F E E D B A C KO P T I M I Z A T I O N

4 Bar chart

1 Highlight variables

55

2 Show arrows

55

55

3 Toggle values

38

55

B Start optimization in xk, 
fix decisions uk with constraints
Artificial constraints for uk yield
sensitivities

A Start optimization in xk+1
Identical to the start values the
participant will have for next decisions uk+1

xk+1xk

uk uk+1uk-1

C Start optimization in xk
Identical to the start values, the
participant had for decisions uk

D Start optimization in xk+1,
fix single decision with constraint
Compute online when one variable is changed
and give feedback, which variables now
should be changed

Figure 2. Optimization-based feedback at month k + 1: on the left hand side, there are different methods to compute a feedback and on
the right hand side there are different types of feedback presentation. Optimization method A is used with feedback presentations 1, 2,
and 3 (corresponding to indicate group, trend group, and value group) and optimization method B is used with feedback presentation 4
(corresponding to chart group). Note that xk refers to the state variables and uk refers to the control variables of month k.

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 4

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

As an incentive, there was a competition with
chances weighted according to success in which partic-
ipants could win one of six 20 euro Amazon gift cards.
For this, only the results of performance rounds were
considered.

Procedure. For the main task, the control of the
IWR Tailorshop microworld, the participants received
guidance by the following introduction:

Thank you! Now you can start into the
IWR Tailorshop microworld. Please
note, that you need to finish 4 rounds of 10
"months" each to participate in the compe-
tition.
All in all it will take you about 30–45 min-
utes. You ideally play all 4 rounds at a
stretch, but you may interrupt after each
“month” and continue at a later date. The
first two rounds are training rounds, only
your points (not your rank) in the last two
rounds are considered for the drawing.
Now, please imagine you are the head of a
company, which produces shirts. Your
aim is to maximize the company’s capi-
tal at the end of each round, i.e., in month
10. For this there are several possibilities of
intervention available, which will be located
in the lower part. In the upper part you will
find important figures of your company.
However, your intervention possibilities are
subject to certain constraints, e.g., you are
not allowed to close all company sites. At the
end of each round, you will find a highscore
table and after the last round the table, which
is important for the competition. In the blue
hint box you can find assistance and useful
hints during your game. Good luck!

The hint box the introduction refers to was displayed
at the left side and contained hints corresponding to
the situation and the feedback group the participant
was in, compare Figure 2, e.g.,

During your first two rounds, you will receive
assistance to improve your performance. We
will show you arrows next to the inter-
ventions to indicate in which direction the
mathematically optimal decision for the next
month is, depending on the decision shown
at the beginning of the month. The arrows
will be thicker if the optimal decision is far
away, but will not change when you change
the values.

Hints on each state and control, e.g., “the wages for
each employee per month in money units” for control
wages, were available as a tooltip on mouse rollover.
After each round, participants were shown an anony-
mized highscore list with the top 20 participants in
their group.

Additional variables. Additional information on the
participants was collected via three questionnaires.
The first survey comprised gender, interest in eco-
nomics, interest in computer games, age, and a self-
assessment of systematic problem solving. This survey
had to be answered before participants could start the
main task, i.e., the four IWR Tailorshop rounds. The
other two surveys were carried out after the main task.
The second survey was targeted on participants’ model
knowledge. Participants were shown five claims about
the IWR Tailorshop microworld and had to decide if
they were right or wrong, compare Table A.8 in the
Appendix. Final survey was the 10-item short version
of the Big Five Inventory test proposed by Rammstedt
and John (2007) to measure the Big Five dimensions of
personality (Digman, 1990), i.e., agreeableness, consci-
entiousness, extraversion, neuroticism, and openness.

The experimental groups

Participants were divided randomly into six groups
based on pseudorandom numbers generated by a
Mersenne twister (Matsumoto & Nishimura, 1998).
They differ in the way they received additional infor-
mation in the first two (feedback) rounds. Compare
Figure 2 for an illustration of the optimization-based
feedback. The six groups were designed as follows.
The control group (co) did not receive any feedback.
The highscore group (hs) received a feedback based
on the results of previous participants during training
rounds, giving a ratio of participants who performed
better and worse of the kind “Until now x% of partici-
pants performed better and y% performed worse than
you.”
The indicate group (in) received optimization-based
feedback via highlighted control values. Variables are
highlighted if they differ from the optimal value more
than a given threshold, e.g., 30 % of the difference δ
between lower and upper bound of a variable.
The trend group (tr) received optimization-based
feedback via up and down arrows in different thick-
ness. Arrow thickness is also determined by thresholds
depending on δ. Arrows indicate the direction of the
optimal control: if the optimal control value is larger,
the arrow points up and vice versa.
The value group (va) received optimization-based
feedback via toggled values, showing the optimal solu-
tion. Note that participants of this group could theo-
retically obtain a 100% performance (in the two feed-
back rounds) by simply copying all values.
The chart group (ch) received optimization-based
feedback via bar charts. Lagrange multipliers are
displayed scaled according to δ. These dual variables
indicate the sensitivity of the objective function with
respect to the current value.

Hypotheses

Before the beginning of the study, specific hypotheses
were formulated. In the interest of a compact presen-

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 5

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

tation, we list a subset of them directly in the corre-
sponding result sections in Tables 2, 3, 4, 5, 6, 8.
The full set of hypotheses that have been formulated

and tested can be found in the PhD thesis of Engelhart
(2015). They concern correlations with the additional
variables mentioned above (computer games, economic
interest, gender, age, Big Five) and a detailed analy-
sis of processing times. No statistically significant ef-
fects were found (for age and gender possibly due to
low numbers of old/female participants). Therefore
in this paper we focus on the main result, namely the
impact of optimization-based feedback on performance
and learning.

Prestudy

In October 2013, 18 participants (recruited directedly
via e-mail) took part in a prestudy. The aim was two-
fold: on the one hand, this was a test under realistic
conditions for the main study and an opportunity to
eliminate bugs in the interface. On the other hand,
the data were used for highscore feedback in the main
study. This was particularly necessary to avoid a feed-
back like “0% performed better and 0% worse than
you” for the first participant in that group. However,
the data were considered neither in our statistical nor
in our optimization-based analysis.

Data collection

Starting from November 15, 2013, the study was an-
nounced in several first and third term lectures for
mathematics, physics, computer science, engineering,
and psychology students at Heidelberg University and
Otto von Guericke University Magdeburg in Germany.
These announcements were complemented by public
announcements in the social networks Google+ and
Facebook as well as selective announcements via e-mail.
Potential participants were informed that they

would have to play four rounds of the economic sim-
ulation IWR Tailorshop via a device of their choice
with a web browser (e.g., PC, tablet, or smartphone)
which in total would take approximately 30–45 min-
utes. It was advertised as an incentive that there will
be a competition with chances weighted according to
success where participants can win one of six 20 euro
Amazon gift cards. The deadline for participation was
December 15, 2013.
Participants had to create an account with an e-

mail address, which they needed to confirm in order
to avoid multiplicate participation. Creating multiple
accounts was also prohibited by terms of participation
leading to exclusion from the competition.

Until the end of data collection, 157 accounts were
registered for participation. Two accounts have not
been activated, maybe because of erroneous e-mail ad-
dresses or the like. Furthermore, seven participants
did not answer the first survey and therefore could not
start the main task, i.e., no data was recorded for them
at all. Thus, we received data from 149 participants, of
which 101 provided complete datasets, i.e., they played

four full rounds and answered all three surveys. One
account was identified as a duplicate participation and
was excluded from the analysis. The first account of
the corresponding participant was part of the analy-
sis, but was not considered in the competition. This
resulted in 100 complete datasets and 148 datasets in
total for our statistical analysis.

Model knowledge

A true/false questionnaire, Table A.8 in the Appendix,
was used at the end of the four rounds to determine
the participants’ knowledge about the IWR Tailorshop
microworld. The overall ratio of correct answers varies
a lot for the five claims. This shows that the questions
had a varying difficulty, which was intended.
Correct answers were identified as knowledge about

the model. Participants who chose don’t know were
considered to be uncertain about the corresponding
claim.

Statistical methods

Statistical analysis of the data was done using the open
source package R Version 3.0.1 (R Development Core
Team, 2008).

Statistical significance. We tested the statistical sig-
nificance of differences between means of scores and
other variables. To this end we applied Student’s t-
test and Welch’s t-test. Usually all tests have also
been confirmed qualitatively by Wilcoxon rank sum
tests.
For all tests, p-values of < 0.05 were considered sta-

tistically significant (i.e., α = 0.05). All such values
are printed in bold face in tables.

Normality of distributions. Statistical tests like Stu-
dent’s t-test and Welch’s t-test require normality of the
population—although these two are known to be rela-
tively robust against non-normality (e.g., Sawilowsky
& Blair, 1992).
We applied the implementation of the Kolmogorov-

Smirnov test for normality (Lilliefors, 1967) from the
R package nortest to the score variables. For this test,
the alternative hypothesis is that the data is not nor-
mally distributed.
Student’s t-test—in contrast to Welch’s t-test—also

requires homogeneity of variances between the groups.
This has been tested using Levene’s test (Levene,
1960), Brown-Forsythe test (Brown & Forsythe, 1974;
both as implemented in R package lawstat), and
Bartlett’s test (Bartlett, 1937).
For α = 0.05, the hypothesis of the data being nor-

mally distributed cannot be rejected for most groups
and rounds by a majority of the applied tests for nor-
mality.
However, we cannot assume homogeneous variances

between feedback groups. Thus, for the sake of com-
parability, Welch’s t-test will be used for comparison
of score means for all rounds.

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 6

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

0 2 4 6 8 10

0,0

0,2

0,4

0,6

0,8

1,0

1,2

(x10
5
)

State function 12

Month

M
.U

.

Participant

Optimal solutions

0 2 4 6 8 10

Analysis function 2

Month

Objective (participant)

How much is still possible

0

1

2

3

4

5

6

7

8

9
10

k=

Figure 3. Relation between optimal scores and the How Much is Still Possible-function, illustrated for a specific participant. Left:
development of score (capital) over time. An optimization starting at month k provides an optimal value that could have been achieved.
The specific shape of the optimal solutions (approximately constant, then linear increase of capital) is due to an investment that pays off
later. Therefore taking the score itself as an indicator is not a good performance measure. Right: The optimal objective values at the
final month 10 are plotted for different starting months k, resulting in the How Much is Still Possible-function. Participant decisions are
good (even optimal), whenever the values stay constant, and the worse, the more it decreases.

Dropouts and outliers. 148 datasets have been con-
sidered, 100 of which were complete. Our statisti-
cal analysis showed that incomplete datasets did not
show any systematic differences compared to complete
datasets. In particular, there were no significant effects
on the dropout concerning feedback group, gender, or
the performance until the dropout.
Grubbs’ test is a statistical test proposed by Frank

E. Grubbs (1950, 1969) which detects one outlier at
a time in a normally distributed population. We used
the implementation of Grubbs’ test available in the
R package outliers. Another approach are the outer
fences for boxplots, as described by John W. Tukey
(1977).
An analysis of the score variable with Grubbs’ test

and outer fences detected 6 severe outliers, which were
excluded from further analysis. The analysis in the
remainder, including the optimization-based analysis,
is therefore based on N=94 datasets.

Optimization-based analysis

As discussed in the Introduction, measuring perfor-
mance in a complex microworld is by no means triv-
ial. In previous work we suggested a completely novel
approach: to use mathematical optimization and the
so-called How Much is Still Possible-function and the
Use of Potential-function (Sager et al., 2011; Engel-
hart et al., 2013). We applied these techniques also in
the current study as follows.

Optimization. We computed optimal solutions for
each participant (1 to 94) and round (1 to 4) and
month (1 to 10). As illustrated in Figure 2, the start-
ing value is identical to the one of the participant in
the specific round and month, and hence pairwise dif-
ferent. Alltogether, we solved 94 ·4 ·10 = 3760 mixed-
integer nonlinear optimization problems for our anal-
ysis, using a specifically developed optimization algo-

rithm (Engelhart et al., 2013; Engelhart, 2015). Note
that this approach is very similar to the computation
of an optimization-based feedback, compare Figure 2.
The main difference is whether this is done a priori
(feedback for training) or a posteriori (analysis).

How much is still Possible-function. The optimal
solution starts in the identical state as the participant
in a specific round and month. Hence we know how
much could have been achieved if all of the partici-
pant’s future decisions would have been optimal. The
optimal objective function values are interpreted as
a monotonically decreasing function (because partici-
pants can’t do better than the optimal solution) over
rounds and months. An illustrating example is shown
in Figure 3.

Use of Potential-function. The Use of Potential-
function is derived from the How Much is Still Possi-
ble-function by taking the difference between two suc-
ceeding months. Doing this for each month one obtains
a function that indicates how much of the potential of
optimal decisions was used by a participant.

Learning

To enable conclusions on learning effects, we are going
to analyze the Use of Potential function. As this func-
tion indicates how close to optimality the decisions of a
participant (group) for each month were, the function
can be seen as a learning curve. We experimented
with different functional parameterizations, and de-
cided eventually to use a piecewise linear model for
our analysis.
We used R’s lm to fit the linear model for Use of

Potential for each participant and each round,

y = m ·x + c, (1)

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 7

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

based on given values yi of Use of Potential at months
xi = i. The regression parameters are m and c, and
estimate the gradient and the intercept of Use of Po-
tential.

The estimate m for the gradient characterizes how
much more potential the participant was able to use
over time, i.e., how much the participant learned. We
use statistical tests on the values of m for different par-
ticipant groups for our a priori hypotheses on learning.
The first months of the feedback rounds (i.e.,

months 1 and 11) were not considered for the linear
regression. No feedback is given before the first deci-
sion and thus Use of Potential may change drastically
from month 0 to month 1.

●

●

●
● ● ●

●
● ●

●

●

●

●
● ● ●

●

● ●
●

●

●

●

● ●

●
●

●
●

●

● ●
●

●
● ●

●
●

● ●

−6 × 10+4

−4 × 10+4

−2 × 10+4

0 × 10+0

0 10 20 30
Month

U
se

 o
f p

ot
en

tia
l

(a) based on all months

●

●

●

●
● ● ●

●
●

●

●

●

●
● ● ●

●

● ●
●

●

●

●

● ●

●
●

●

●

●

● ●
●

●

●
●

●

●

● ●

−6 × 10+4

−4 × 10+4

−2 × 10+4

0 × 10
+0

0 10 20 30

Month

U
s
e
 o

f 
p
o
te

n
ti
a
l

(b) without first month

Figure 4. Regression lines for Use of Potential for value group
over all rounds (one round consists of 10 months). (a) shows a
regression with all months of each round, for (b) the first month
of feedback rounds has been excluded.

The importance of this is shown in Figure 4, where
Figure 4a exhibits linear regressions based on all
months, and Figure 4b the corrected approach. In
performance rounds this effect does not occur, so all
months are considered.

Technical implementation

For data collection, the IWR Tailorshop web inter-
face was used, which is implemented using XHTML
and JavaScript with jQuery 1.10 and usage of AJAX
client-side, complemented by a server-side PHP code.
For the online optimization, AMPL Version 20131012

0

5

10

15

20

−4 × 10+5−2 × 10+5 0 × 10+0 2 × 10+5 4 × 10+5

N
u
m

b
e
r 

o
f 
p
a
rt

ic
ip

a
n
ts

Round 1

0

5

10

15

20

−4 × 10+5−2 × 10+5 0 × 10+0 2 × 10+5 4 × 10+5

Round 2

0

5

10

15

20

−4 × 10+5−2 × 10+5 0 × 10+0 2 × 10+5 4 × 10+5

Score

N
u
m

b
e
r 

o
f 
p
a
rt

ic
ip

a
n
ts

Round 3

0

5

10

15

20

−4 × 10+5−2 × 10+5 0 × 10+0 2 × 10+5 4 × 10+5

Score

Round 4

Figure 5. Score histogram for all four rounds for all complete da-
tasets without 6 outliers (N = 94).

together with Bonmin 1.5 and Ipopt 3.10 was used via
IWR Tailorshop’s AMPL interface. The web server
for the study was an Intel Core i7 920 machine with
12 GB RAM running PHP 5.5 and MySQL 5.5 with
an Apache 2.4 HTTP server on Ubuntu 13.10 64-bit.
The web interface implemented a so-called responsive
grid, which allowed participants to use both mobile de-
vices and desktop PCs conveniently. Usage statistics
based on user logins show that approximately 20% of
participants used mobile devices.

The methods for an optimization-based analysis are
implemented in the open-source software package An-
tils (Analysis Tool for IWR Tailorshop Results and
Solutions). All computations were carried out on an
Intel Core i7 920 machine with 12 GB RAM running
Ubuntu 14.04 64-bit. For the solution of the arising
optimization problems, AMPL Version 20140331 to-
gether with Bonmin 1.5 and Ipopt 3.10 was used via
IWR Tailorshop’s AMPL interface.

Results

We are going to test hypotheses related to the differ-
ent participant groups. First we will focus on perfor-
mance, second on learning, and third on model knowl-
edge. We will close by an illustrating investigation of
the strategies of exemplary participants.

We start with a look at the score and the Use of Po-
tential–functions of the study participants. Figure 5
shows how the performance (score) is distributed over
all participants in the four rounds. Note that rounds
1 and 3 had the same initial values, whereas rounds 2
and 4 had different initial values, and thus also differ-
ent optimal solutions and scores. Rounds 1 and 2 are
training rounds with feedback, rounds 3 and 4 perfor-
mance rounds without feedback.

Obviously, it is only meaningful to investigate the
impact of the different types of feedback, if the role
of the participants’ prerequisites is not a decisive fac-
tor (e.g., because one group simply consisted of better
problem solvers at the beginning of the study). This
would have biased the groups’ performance and is rel-
evant, given the low number of samples for some of the
participant groups.

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 8

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

Hypothesis Confirmed

(A) initial performance is not
important for final
performance

X

Table 2. Hypothesis about participant prerequisites.

The optimization-based analysis gives us the possi-
bility to check this by comparing the first Use of Poten-
tial value. At this point, all participants had received
the same information, as feedback only started after
the first decision, so there should be no significant dif-
ference in the performance. Table A.4c contains mean
values, Kolmogorov-Smirnov test results, and Welch’s
t-test results (in comparison to control group). The
Kolmogorov-Smirnov test shows that the first Use of
Potential values can be considered to be normally dis-
tributed for all groups. No significant differences to
control group can be observed by the Welch’s t-test
for all groups, so we can suppose that there were no
systematic differences among the participants of the
six groups. Correlation between first Use of Potential
and score in performance rounds is 0.067, confirming
Hypothesis (A), see Table 2.

Effects of optimization-based feedback on
performance

We investigate whether the optimization-based feed-
back in the first 2 training rounds had a significant ef-
fect on the performance in the rounds 3 and 4, where
no feedback was given (Table 3). We start by look-
ing at all optimization participants, i.e., the ones in
groups indicate (in), trend (tr), value (va), and chart
(ch). We assess Hypothesis (B) visually via Figure 6

Hypothesis Confirmed

(B) participants with
optimization-based
feedback perform better
overall, better in feedback
rounds, and better in
performance rounds
compared to control group

X

(C) control group performs
worst overall and performs
worse in performance
rounds than groups with
optimization-based
feedback

—

Table 3. Hypotheses related to performance of participants who
received optimization-based feedback (groups in, tr, va, ch) and to
performance of the control group.

and statistically via Table A.4a.
Figure 6 shows a boxplot of the different partici-

pant groups’ performance via the obtained score. The

four groups which received optimization-based feed-
back (in, tr, va, and ch, rightmost in Figure 6) show
different performance, which will be discussed later.
Relevant for Hypothesis (B) is that the mean scores
are above the ones of the control group (co). This
is true for training rounds 1 and 2, for performance
rounds 3 and 4, and thus also overall. The sta-
tistical significance based on a comparison between
optimization-based feedback groups and the other two
groups is shown in Table A.4a. Participants who re-
ceived optimization-based feedback performed signif-
icantly better than those without feedback, in each
round and in total, proving Hypothesis (B). Looking
closer at Table A.4a one observes that this significance
holds for both comparisons, the one to all participants
without feedback (highscore group and control group)
and only to those from control group. The value of
the statistical test is larger in the training rounds by
roughly one order of magnitude, which is not surpris-
ing given the direct benefit of the feedback on the per-
formance.
The performance of the four optimization-based

feedback groups is quite diverse, compare again Fig-
ure 6: value group was the best by far in all the rounds,
trend group comes second. The two other feedback
groups, indicate group and chart group, do not exhibit
such a good performance. As a result, the performance
of the control group is only significantly worse on av-
erage, but not compared to all of the single feedback
groups as tested in Table A.4b. Consequently, Hy-
pothesis (C) can be considered as disproved, both for
the performance rounds as overall.
The results of the group–specific Welch’s t-test in

Table A.4b are also helpful for an assessment of the hy-
potheses of Table 4. For α = 0.05, value group is sig-

Hypothesis Confirmed

(D) trend group performs
best overall and best in
performance rounds

—

(E) value group performs best
in training rounds and
worst in performance
rounds, compared to other
feedback groups

(X)

(F) indicate group and chart
group do not perform
significantly better than
control group in
performance rounds

X

Table 4. Hypotheses on specific feedback types (arrow feedback in
trend group and toggled values in value group).

nificantly better than control group in all the rounds.
Trend group misses significance only in round 3 by
narrow margin, but exhibits significant differences in
the other rounds. Indicate group is significantly bet-
ter only in round 1. The remaining groups are not

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 9

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

●

●
●

●
●

●

● ●

●

●
●

−4 × 10+5
−2 × 10+5

0 × 10+0
2 × 10+5
4 × 10+5

all co hs in tr va ch

S
co

re
Round 1

● ●
●

●●

−4 × 10+5
−2 × 10+5

0 × 10+0
2 × 10+5
4 × 10+5

all co hs in tr va ch

Round 2

●
●

●

●

●

●

−4 × 10+5
−2 × 10+5

0 × 10+0
2 × 10+5
4 × 10+5

all co hs in tr va ch

Group

S
co

re

Round 3

●
●●

●

●

●

−4 × 10+5
−2 × 10+5

0 × 10+0
2 × 10+5
4 × 10+5

all co hs in tr va ch

Group

Round 4

Figure 6. Score boxplot of all feedback groups (co: control, hs: highscore, in: indicate, tr: trend, va: value, ch: chart) for all rounds and
all complete datasets without 6 outliers (N = 94). The boxplot indicates that value group and—except for round 3—trend group are
better than the others.

significantly different than control group. The differ-
ence between value group and all other groups is also
significant in all rounds for α = 0.05 (not in the table).

As value group showed the best performance, and
trend group only second–best, Hypothesis (D) can be
considered as disproved.
It is true that value group performed best in training

rounds, but it did not perform worst in performance
rounds. So, the first statement of Hypothesis (E) is
likely to be true, the second to be false.
The two other feedback groups, indicate group and

chart group, do indeed not perform significantly better
than control group, confirming Hypothesis (F).
Figure 7 contains the average Use of Potential for

each feedback group over all rounds. This plot reveals
much more detail on the performance of the different
groups, as it contains also temporal information. This
will be helpful in the next section. Looking at the av-
erage values (remember: Use of Potential is the better,
the closer it is to 0), additional evidence is given for
the results for Hypotheses (B–F).

Effects of optimization-based feedback on learning

As described in section“Learning”, we use the gradient
m obtained from a linear regression as an indicator for
learning. As Use of Potential may hence be considered
as the learning curve, it is worthwhile to have a look
at Figure 7 to assess the first hypothesis on learning in
Table 5. The visual impression is that on average the
Use of Potential has a tendency to increase, at least

Hypothesis Confirmed

(G) participants learn how to
solve the complex problem

X

(H) learning function is
approximately logarithmic

—

Table 5. Hypotheses related to learning.

for rounds 1, 2, and 4. This is confirmed quantitatively
by looking at the average values and the p values in
Table A.6. On average, participants show significant
learning effects in all rounds except for round 3. This
supports the assumption that participants learn how
to control the microworld, i.e., Hypothesis (G). Ad-
ditional evidence for Hypothesis (G) comes from Fig-
ures 5 and 6. Comparing round 1 (with feedback) and
3 (without feedback) one can see that the distribution
is shifted slightly to the right, i.e., to higher scores,
hinting at an overall learning effect. That this learn-
ing effect is dependent on the feedback in the training
rounds 1 and 2, can already be guessed by looking at
round 4. Round 4, which is a performance round with
initial values the participants have not seen before in
the training rounds, exhibits a non–normal distribu-
tion of performance.

Trying to fit a logarithmic function to the Use of
Potential was not successful. A closer inspection of
Figure 7 indicates that although for certain partici-

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 10

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

0 10 20 30

-6

-4

-2

0

(x10
4
) Analysis function 7

Month

Control

Highscore

Indicate

Trend

Value

Chart

U
se

 o
f 

P
o
te

n
ti

a
l 
fo

r 
a
ll
 r

o
u
n
d
s

Figure 7. Use of Potential for all four complete datasets without 6 outliers (N = 94) over all rounds (one round consists of 10 months),
but averaged for the six different participant groups (see section “experimental groups”). value group is always on top and almost constant
in feedback rounds, but decreases slightly in performance rounds. All other groups show a more (control and highscore group) or less
(trend group) severe decline at the beginning of round 4.

pant groups and rounds (e.g., trend group in rounds
1, 2, and 4) there is a stronger increase at the begin-
ning that flattens toward the end of the round, the Hy-
pothesis (H) cannot be confirmed based on our data.
This is also the impression from investigating Use of
Potential of single participants, compare Figure 9.
We now have a closer look at the effect of opt-

imization-based feedback on learning. To test Hypo-

Hypothesis Confirmed

(I) optimization-based
feedback groups learn faster

(X)

(J) trend group learns fastest X

Table 6. Hypotheses related to learning, specific for participant
groups.

thesis (I), see Table 6, we look at the regression pa-
rameters m for the four optimization-based feedback
groups (of, consisting of in, tr, va, ch) and the two
other groups (nof, consisting of co and hs) in Table 7.
The mean for parameter m for of is higher in round
1 and lower in all other rounds. This suggests that,
given the performance of these groups, optimization-
based feedback groups learned faster, namely mainly
in the first round. However, Welch’s t-test only shows
significance for rounds 2–4. We see this as an indi-
cation that (I) might be true, but it cannot be fully
confirmed with our data.
To shed more light on the issue, we investigate the

learning curves of the single participant groups. As
above, Figure 6 hints at improved scores in round 3

Rnd nof of nof < of of < nof

1 651.2 1063.1 0.2384 0.7616
2 1086.6 550.3 0.9642 0.0358
3 670.9 -263.4 0.9997 0.0003
4 3445.1 817.0 1.0000 0.0000

Table 7. Columns 2 and 3: mean regression parameters m for non-
optimization based feedback groups (nof) and optimization-based
feedback groups (of). Columns 4 and 5: corresponding signifi-
cances from Welch’s t-test. Rnd means Round. One observes that
of learned more in round 1, however not significant, and co&hs
learned significantly more in rounds 2–4.

compared to round 1 (with identical initial values)
for all participant groups with the exception of value
group. Value group remained static (-4%) at a higher
level than the other groups. A reason for this may
be that participants profited so strong from the value
feedback during the feedback rounds that their perfor-
mance without feedback slightly decreased. However,
the group’s mean is on a high level, so there was not
much space for improvement anyhow. For the other
five groups performance improved drastically (20% at
least).

Again, more insight comes from our novel analysis
approach, the study of Use of Potential depicted in
Figure 7. Value group is always on top as expected
and almost constant in feedback rounds, but decreases
slightly in performance rounds. This means that the
performance of participants in this group is on a very
high level from the beginning and hardly improves, in
fact rather impairs. All other groups show a more or

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 11

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

0 10 20 30

-3,0

-2,5

-2,0

-1,5

-1,0

-0,5

(x10
4
) Analysis function 7

Month

U
se

 o
f 

P
o
te

n
ti

a
l 
fo

r 
a
ll
 r

o
u
n
d

s Average

(a) High

0 10 20 30

-4

-3

-2

-1

(x10
4
) Analysis function 7

Month

U
s
e
 o

f 
P
o
te

n
ti

a
l 
fo

r 
a
ll
 r

o
u
n
d
s

Average

(b) Mid

0 10 20 30

-6

-5

-4

-3

-2

-1

(x10
4
) Analysis function 7

Month

U
s
e
 o

f 
P
o
te

n
ti

a
l 
fo

r 
a
ll
 r

o
u
n
d
s

Average

(c) Low

Figure 8. Use of Potential according to high, mid, and low model knowledge for all complete datasets without 6 outliers (N = 94) over
all rounds (one round consists of 10 months). Participants with low knowledge show a severe decline in their score at the beginning of
round 4, whereas they stay on the same level in the rounds before. High and mid group show an increase in feedback rounds and high
group also stays almost on the same level later in round 4. Note that the start of round 4 is challenging due to new initial values.

less severe decline at the beginning of round 4 with
control and highscore group at the one end and trend
group at the other. However, all groups except value
group seem to improve their performance during the
first three rounds.
To quantify this, Table A.5 contains the mean val-

ues for the regression parameters m of the different
feedback groups. The Kolmogorov-Smirnov test re-
sults show that the mean values can be considered to
be normally distributed in all rounds, except for chart
group in round 1. The Welch’s t-test results show
whether the hypothesis that the mean value of m is
positive, and hence a positive learning effect occured,
is significant or not. Trend group is the only group
with a significant learning effect in both rounds 1 and
2. Therefore we see Hypothesis (J) as confirmed.
For control group, the learning effects get signifi-

cant from round 2 on, and for highscore group they
are significant in rounds 2 and 4. The mean values
in performance rounds for control and highscore group
are drastically higher than for the optimization-based
feedback groups. Value group is the only one with a
significantly decreasing performance in round 3 and
also the only one with an overall mean below 0. Note
that chart group performs even worse at least in the
feedback rounds. This changes in performance rounds,
so one can suppose that the feedback consternated the
participants. A possible reason could lie in a misin-
terpretation of the sensitivity information participants
were given by this feedback. All other optimization-
based feedback groups received direct information on
the optimal solution.

Effects of model knowledge

The focus of this section are the two variables knowl-
edge and uncertainty. We look at the hypotheses in
Table 8.
To investigate Hypothesis (K), quartiles have been

used to build groups of participants with high (best
25%), mid (those between first and third quartile), and
low (worst 25%) score for each round. Means of corre-
spondent model knowledge and uncertainty scores can
be found in Tables A.9a and A.9b. High groups have

Hypothesis Confirmed

(K) well-performers know more
about the model

X

(L) participants with high
model knowledge perform
well

X

(M) participants with high
model knowledge learn
more

(X)

(N) trend group has highest
model knowledge and
lowest uncertainty

X

Table 8. Model knowledge related hypotheses.

the highest means which increase over the rounds. Ex-
cept for round 1, mid groups are between low and high
groups. In performance rounds, all differences are sig-
nificant according to the Welch’s t-test. Significance
roughly increases over the rounds, which suggests that
model knowledge is a crucial factor for successful con-
trol of the IWR Tailorshop microworld.

Concerning Hypotheses (L) and (M), participants
have been merged in 3 (low (0/1), mid (2/3), and
high (4/5)) and 2 (low (0/1) and mid (2/3)) groups
respectively according to their knowledge and uncer-
tainty score, which both are between 0 and 5. No
participant achieved an uncertainty score of 4 or 5,
thus there are only two groups for uncertainty. Tables
A.10a and A.10b contain the mean score values of all
four rounds for these groups.

For knowledge, the high group has the highest score
means by far. Except for round 1, mid group lies be-
tween low and high group. Student’s t-test in Table
A.10c shows that high group was almost always signif-
icantly better than the two other groups. Significance
increases over the rounds, which means that model
knowledge becomes a better predictor for participants
success the more rounds the participants played. Com-
paring round 1 and 3, participants with low model

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 12

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

knowledge could barely improve their performance,
whereas the high group approximately doubled their
score. Indeed, correlation between score and model
knowledge increases from about 0.09 in round 1 to 0.48
in round 4. As a summary, we see Hypothesis (L) as
confirmed.
For uncertainty, the low group has higher means in

all rounds, but again the differences are much smaller
than for knowledge. Hence, the differences between
the groups are not significant. Correlation with score
is about -0.2 for all rounds except the first.
Concerning Hypothesis (M), the average Use of Po-

tential for the three model knowledge groups can be
found in Figure 8. Participants with low knowledge
show a severe decline at the beginning of round 4,
whereas they stay on the same level in the rounds be-
fore. High and mid group show an increase in feedback
rounds and high group also stays almost on the same
level in round 4.
The values in Table A.12 reveal that participants

with low model knowledge learned significantly less in
round 1 than those with high knowledge. Again, the
situation reverses in round 4. Hypothesis (M) could
thus be confirmed with a restriction to round 1. How-
ever, it seems also likely that model knowledge changes
from round to round and is an indicator of success in
learning, rather than a predictor. Therefore a high use
of potential in the training rounds could also be con-
sidered as a predictor for model knowledge at the end
of the experiment. In summary, Hypothesis (M) can
not be decided.

Concerning Hypothesis (N), for an analysis of differ-
ences between the groups, ratios of model knowledge
and uncertainty levels and mean values are given in
Table A.11. Trend and value group have the high-
est knowledge, but only highscore and trend group are
significantly better than control group. Indicate and
chart group have a much lower knowledge, which to-
gether with these groups’ performance suggests that
participants were rather confused by the optimization-
based feedback.
Trend group has by far the lowest uncertainty

among the groups and is the only one which has sig-
nificantly lower uncertainty than control group. All
other groups are on a similar level.

Exemplary participants

A more detailed look on single participants reveals dif-
ferent decision patterns. Figure 9 shows Use of Poten-
tial for participants 134, 164, 165, and 208 from value
group and of participant 115 from trend group. Par-
ticipants 134 and 164 seem to more or less copy the
optimal solution in the feedback rounds. Remember
that feedback for these participants consisted of the
numeric values of the optimal solution. Participant
208, in contrast, seems to pursue a different strategy
which is less solution-oriented.
The success in the performance rounds 3 and 4 also

varies a lot: participant 164 seems to remember the
solution, which is especially useful in round 3 as it

0 10 20 30

-4

-3

-2

-1

0

(x10
4
) Analysis function 7

Month

P
o
te

n
ti

a
l 
fo

r 
a
ll
 r

o
u
n
d

s

Participant

(a) Participant 134 (value)

0 10 20 30

-2,0

-1,5

-1,0

-0,5

0,0

(x10
4
) Analysis function 7

Month

P
o
te

n
ti

a
l 
fo

r 
a
ll
 r

o
u
n
d
s

Participant

(b) Participant 164 (value)

0 10 20 30

-4

-3

-2

-1

0

(x10
4
) Analysis function 7

Month

P
o
te

n
ti

a
l 
fo

r 
a
ll
 r

o
u
n
d
s

Participant

(c) Participant 165 (value)

0 10 20 30

-3

-2

-1

0

(x10
4
) Analysis function 7

Month

P
o
te

n
ti

a
l 
fo

r 
a
ll
 r

o
u
n
d
s

Participant

(d) Participant 208 (value)

0 10 20 30

-4

-2

0

(x10
4
) Analysis function 7

Month

P
o
te

n
ti

a
l 
fo

r 
a
ll
 r

o
u
n
d
s

Participant

(e) Participant 115 (trend)

0 2 4 6 8

50

52

54

Control function 1

Month

S
h
ir

t 
p
ri

c
e

(f) Shirt price decision of par-
ticipant 188 (chart group) in
round 3.

Figure 9. Use of Potential for single participants from value group
(a–d) and trend group (e), and exemplary shirt price decisions (f).

started with the same value as round 1, but participant
134 does not and lacks knowledge how to control the
model. Participant 165, who seems to change strategy
during feedback rounds from exploration to solution-
oriented, decreases in round 3, too. Participant 208,
who possibly has found an own strategy, stays on the
same level throughout all rounds.

Participants 115 from trend group reaches a compa-
rably high level of Use of Potential with monotonically
increasing curves during the first two rounds converg-
ing to 0, i.e., coming close to optimality at the end of
each round. Not surprisingly, a solution-oriented pat-
tern like among the participants from value group in
Figure 9 (a–d), cannot be observed due to the different
type of feedback.

Figure 9f shows the shirt price decision of partic-
ipant 188 from chart group. Although already in a
performance round, the participant seems quite un-
sure about the right strategy and changes the control
a lot. Such a pattern at that time point can particu-
larly be found among the datasets from chart group.

Conclusion and outlook

In this work, optimization methods were used in the
context of Complex Problem Solving (CPS) both as an

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 13

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

analysis tool and to provide feedback in real time for
learning purposes. While first works on optimization-
based analysis for CPS (Sager et al., 2010, 2011) had
a focus on understanding how external factors influ-
ence thinking, in the work at hand, we also investi-
gated learning effects. The use of optimization as an
analysis and feedback tool for psychological studies is
completely new to our knowledge.
We presented a variant of the IWR Tailorshop, a

new microworld for CPS. This turn-based test-scenario
yields a mixed-integer nonlinear program with non-
convex relaxation and consists of functional relations
based on optimization results. With the proof of feasi-
bility for the IWR Tailorshop in this article, we intend
to start a new era beyond trial-and-error in the def-
inition of microworlds for analyzing human decision
making.
In our web-based feedback study with 148 partici-

pants, we used the IWR Tailorshop microworld to in-
vestigate the effects of optimization-based feedback.
Optimization-based feedback could significantly im-
prove participants’ performance in the IWR Tailor-
shop microworld if the presentation was chosen appro-
priately. In our study, value group performed signifi-
cantly better than all other groups.
We could show that such a feedback can significantly

improve participants’ performance in a complex mi-
croworld and for some kinds of feedback, the difference
to control group was huge. However, it also became
apparent that the representation of feedback is impor-
tant. Feedback based on a kind of sensitivity infor-
mation seemed to rather confuse participants in this
study, which was also suggested by our optimization-
based analysis.
The best-performing group was the value group

which received the most precise information about the
optimal solution. Knowledge about the model was bet-
ter amongst another well-performing group, the trend
group. Since we could show that model knowledge
is a predictor for performance, perhaps these partici-
pants would have outperformed the others on a longer
timescale. More data is needed to verify this hypoth-
esis, though.
Optimization-based analysis could show that par-

ticipants learn to control the model over time by an
analysis of Use of Potential. Different aspects of the
analysis indicate that for a high performance, learning
during the first round is crucial. It turned out that
the best way to enforce learning at the beginning was
by trend feedback. Through the optimization-based
analysis, we were also able to show that there were no
systematic differences between the groups at the be-
ginning and that initial performance was not relevant
for performance at the end of the time scale. For some
of the hypotheses, however, significance could not or
only partly be shown. In these cases, more data and
investigation will be necessary.
The main intention of this paper is to present the

optimization-based feedback and to show their useful-
ness in a feedback situation. The test of (learning)
theories was not the focus. Our different hypotheses

are not drawing on specific literature but are kind of
“informed guesses” about what might happen. This is
also due to the fact that there exist no reference studies
with the Tailorshop in a feedback setting that could be
used as a baseline for expected effects. However, cou-
pling our approach to theoretically based hypotheses
on learning seems a promising line of future research.
Another interesting research direction could be if

the widely spread assumption that positive feedback
increases performance is true. In Barth and Funke
(2010) it has been shown that negative feedback im-
pairs performance. However, it is unclear if this is also
true in the long run. From former studies we know
that positive and negative feedback lead to different
processing styles. Therefore one could expect that a
quotient of positive and negative feedback (carrot and
stick) impairs performance the most. 40% positive
feedback and 60% negative feedback might lead to the
best performance, for instance.
Finally, the parameter set used for the computations

of the IWR Tailorshop microworld in this work has
been set up manually to achieve a reasonable model
behavior. Here we still see high potential for im-
provement. One could use derivative-free optimiza-
tion methods to optimize the parameter values such
that two (or even more) previously defined strategies
(e.g., a high and a low price strategy) yield a similar
objective value. By that, participants could follow dif-
ferent strategies and perform quite well in all of them
if decisions are made appropriate.

Acknowledgements: This project has received funding
from the European Research Council (ERC) under the
European Union’s Horizon 2020 research and innovation
programme (grant agreement No 647573) and from the
German BMBF under grant 05M2013 - GOSSIP. The
authors gratefully acknowledge this.

Declaration of conflicting interests: The authors de-
clare that the research was conducted in the absence of
any commercial or financial relationships that could be
constructed as a potential conflict of interest.

Author contributions: The main contribution is due to
the first author (ME) who performed the study as part
of his PhD thesis (Engelhart, 2015). SS and JF helped
in designing and analysing the study and did part of the
writing.

Supplementary material: Supplementary material
available online.

Handling editor: Andreas Fischer

Copyright: This work is licensed under a Creative Com-
mons Attribution-NonCommercial-NoDerivatives 4.0 In-
ternational License.

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 14

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

Citation: Engelhart, M., Funke, J. ,& Sager, S.
(2017). A web-based feedback study on optimization-
based training and analysis of human decision mak-
ing. Journal of Dynamic Decision Making, 3, 2.
doi:10.11588/jddm.2017.1.34608

Received: 04 January 2017
Accepted: 17 April 2017
Published: 26 May 2017

References

Barth, C. M. (2010). The impact of emotions on complex prob-
lem solving performance and ways of measuring this per-
formance (Unpublished doctoral dissertation). Ruprecht–
Karls–Universität Heidelberg.

Barth, C. M., & Funke, J. (2010). Negative affective envi-
ronments improve complex solving performance. Cog-
nition and Emotion, 24(7), 1259–1268. doi: 10.1080/
02699930903223766

Bartlett, M. S. (1937). Properties of sufficiency and statistical
tests. Proceedings of the Royal Statistical Society Series
A, 160(901), 268–282. doi: 10.1098/rspa.1937.0109

Brehmer, B. (1995). Feedback delays in dynamic decision mak-
ing. In P. A. Frensch & J. Funke (Eds.), Complex problem
solving: The European perspective (pp. 103–130). Hills-
dale, NJ: Erlbaum.

Brown, M., & Forsythe, A. B. (1974). Robust tests for the
equality of variances. Journal of the American Statistical
Association, 69(346), 364–367. doi: 10.1080/01621459
.1974.10482955

Cronin, M. A., Gonzalez, C., & Sterman, J. D. (2009). Why
don’t well-educated adults understand accumulation? A
challenge to researchers, educators, and citizens. Or-
ganizational Behavior and Human Decision Processes,
108(1), 116–130. doi: 10.1016/j.obhdp.2008.03.003

Danner, D., Hagemann, D., Schankin, A., Hager, M., & Funke,
J. (2011). Beyond IQ. a latent state-trait analysis of gen-
eral intelligence, dynamic decision making, and implicit
learning. Intelligence, 39(5), 323–334. doi: 10.1016/
j.intell.2011.06.004

Digman, J. M. (1990). Personality structure: Emergence of the
five-factor model. Annual Review of Psychology, 41(1),
471–440. doi: 10.1146/annurev.ps.41.020190.002221

Dörner, D. (1980). On the difficulties people have in dealing
with complexity. Simulation and Games, 11(1), 87–106.
doi: 10.1177/104687818001100108

Engelhart, M. (2015). Optimization-based analysis and training
of human decision making (Unpublished doctoral disser-
tation). Ruprecht-Karls-Universität Heidelberg.

Engelhart, M., Funke, J., & Sager, S. (2013). A decomposition
approach for a new test-scenario in complex problem solv-
ing. Journal of Computational Science, 4(4), 245–254.
doi: 10.1016/j.jocs.2012.06.005

Frensch, P. A., & Funke, J. (1995). Complex problem solving:
The european perspective. Taylor & Francis. doi: 10
.4324/9781315806723

Funke, J. (1983). Einige Bemerkungen zu Problemen der
Problemlöseforschung oder: Ist Testintelligenz doch ein
Prädiktor? Diagnostica, 29(4), 283–302. doi: 10.11588/
heidok.00008131

Funke, J. (2003). Problemlösendes Denken. Stuttgart, Ger-
many: Kohlhammer.

Funke, J. (2010). Complex problem solving: A case for complex
cognition? Cognitive Processing, 11(2), 133–142. doi:
10.1007/s10339-009-0345-0

Funke, J., & Frensch, P. A. (2007). Complex problem solv-
ing: The European perspective – 10 years after. In

D. H. Jonassen (Eds.), Learning to solve complex sci-
entific problems (pp. 25–47). New York: Erlbaum.

Gonzalez, C. (2004). Learning to make decisions in dynamic
environments: Effects of time constraints and cognitive
abilities. Human Factors, 46(3), 449–460. doi: 10.1518/
hfes.46.3.449.50395

Grubbs, F. E. (1950). Sample criteria for testing outlying obser-
vations. Annals of Mathematical Statistics, 21(1), 27–58.
doi: 10.1214/aoms/1177729885

Grubbs, F. E. (1969). Procedures for detecting outlying ob-
servations in samples. Technometrics, 11(1), 1–21. doi:
10.1080/00401706.1969.10490657

Hörmann, H. J., & Thomas, M. (1989). Zum Zusammen-
hang zwischen Intelligenz und komplexem Problemlösen.
Sprache & Kognition, 8(1), 23–31.

Hüfner, M., Tometzki, T., Kraja, T., & Engell, S. (2011).
Learn2Control: Eine webbasierte Lernumgebung im Bio-
und Chemieingenieurwesen. Journal Hochschuldidaktik,
22(1), 20–23.

Kleinmann, M., & Strauß, B. (1998). Validity and applica-
tions of computer simulated scenarios in personal assess-
ment. International Journal of Seclection and Assess-
ment, 6(2), 97–106. doi: 10.1111/1468-2389.00078

Kluwe, R. H. (1993). Knowledge and performance in com-
plex problem solving. Advances in Psychology, Volume
101, 401–423. Amsterdam, Netherland: Elsevier. doi:
10.1016/s0166-4115(08)62668-0

Kluwe, R. H., Misiak, C., & Haider, H. (1991). The control of
complex systems and performance in intelligence tests.
In H. Rowe (Ed.), Intelligence: Reconceptualization and
measurement (pp. 227–244). Hillsdale, NJ: Erlbaum.

Levene, H. (1960). Robust tests for equality of variances. In
I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow,
& H. B. Mann (Eds.), Contributions to probability and
statistics: Essays in honor of Harold Hotelling (pp. 278–
292). Stanford, CA: Stanford University Press.

Lilliefors, H. W. (1967). On the Kolmogorov-Smirnov test for
normality with mean and variance unknown. Journal of
the American Statistical Association, 62(318), 399–402.
doi: 10.1080/01621459.1967.10482916

Matsumoto, M., & Nishimura, T. (1998). Mersenne twister:
A 623-dimensionally equidistributed uniform pseudo-
random number generator. ACM Transactions on Model-
ing and Computer Simulation, 8(1), 3–30. doi: 10.1145/
272991.272995

Meyer, B., & Scholl, W. (2009). Complex problem solving after
unstructured discussion. Effects of information distribu-
tion and experiece. Group Process and Intergroup Rela-
tions, 12(4), 495–515. doi: 10.1177/1368430209105045

Osman, M. (2008). Observation can be as effective as action in
problem solving. Cognitive Science, 32(1), 162–183. doi:
10.1080/03640210701703683

Otto, J. H., & Lantermann, E.-D. (2004). Wahrgenommene
Beeinflussbarkeit von negativen Emotionen, Stimmung
und komplexes Problemlösen. Zeitschrift für Differen-
tielle und Diagnostische Psychologie, 25(1), 31–46. doi:
10.1024/0170-1789.25.1.31

Putz-Osterloh, W. (1981). Über die Beziehung zwischen Testin-
telligenz und Problemlöseerfolg. Zeitschrift für Psycholo-
gie, 189(1), 79–100.

Putz-Osterloh, W., Bott, B., & Köster, K. (1990). Models
of learning in problem solving – are they transferable to
tutorial systems? Computers in Human Behavior, 6(1),
83–96. doi: 10.1016/0747-5632(90)90032-c

R Development Core Team. (2008). R: A language and en-
vironment for statistical computing [Computer software
manual]. Vienna, Austria. Retrieved from http://www.R
-project.org

Rammstedt, B., & John, O. P. (2007). Measuring personality
in one minute or less: A 10-item short version of the
Big Five Inventory in English and German. Journal of
Research in Personality, 41(1), 203–212. doi: 10.1016/
j.jrp.2006.02.001

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 15

http://www.R-project.org
http://www.R-project.org
http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

Robbins, T. W., Anderson, E. J., Barker, D. R., Bradley, A. C.,
Fearnyhough, C., Henson, R., Hudson, S. R., & Baddeley,
A. D. (1996). Working memory in chess. Memory &
Cognition, 24(1), 83–93. doi: 10.3758/bf03197274

Sager, S., Barth, C. M., Diedam, H., Engelhart, M., & Funke, J.
(2010). Optimization to measure performance in the Tai-
lorshop test scenario — structured MINLPs and beyond.
In Proceedings EWMINLP10 (pp. 261–269). CIRM, Mar-
seille, France.

Sager, S., Barth, C. M., Diedam, H., Engelhart, M., & Funke, J.
(2011). Optimization as an analysis tool for human com-
plex problem solving. SIAM Journal on Optimization,
21(3), 936–959. doi: 10.1137/11082018x

Sawilowsky, S. S., & Blair, C. R. (1992). A more realistic look at
the robustness and Type II error properties of the t test
to departures from population normality. Psychological
Bulletin, 111(2), 352–360. doi: 10.1037/0033-2909.111.2
.352

Selten, R., Pittnauer, S., & Hohnisch, M. (2012). Dealing with
dynamic decision problems when knowledge of the en-
vironment is limited: An approach based on goal sys-
tems. Journal of Behavioral Decision Making, 25, 443–
457. doi: 10.1002/bdm.738

Süß, H.-M., Oberauer, K., & Kersting, M. (1993). Intellek-
tuelle Fähigkeiten und die Steuerung komplexer Systeme.
Sprache & Kognition, 12, 83–97.

Tukey, J. W. (1977). Exploratory data analysis. Boston, MA:
Addison-Wesley.

Wenke, D., & Frensch, P. A. (2003). Is success or failure
at solving complex problems related to intellectual abil-
ity? In J. Davidson & R. Sternberg (Eds.), The psy-
chology of problem solving (pp. 87–126). Cambridge,
England: Cambridge University Press. doi: 10.1017/
cbo9780511615771.004

Wittmann, W. W., & Hattrup, K. (2004). The relationship be-
tween performance in dynamic systems and intelligence.
Systems Research and Behavioral Science, 21(4), 393–
409. doi: 10.1002/sres.653

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 16

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

Appendix

The mathematical model for the IWR Tailorshop con-
sists of the following set of equations, for k = t0, . . . , tf ,
shown in Equations (A.1a) to (A.1l).

xEMk+1 = x
EM
k + u

EM
k (A.1a)

xPSk+1 = x
PS
k −u

dPS
k + u

DPS
k (A.1b)

xDSk+1 = x
DS
k −u

dDS
k + u

DDS
k (A.1c)

xDEk+1 = p
DE,0 · exp

(
−pDE,1 ·uSPk

)
· log

(
pDE,2 ·uADk + 1

)
·
(
xREk + p

DE,3) (A.1d)
xREk+1 = p

RE,0 ·xREk
+ pRE,1 log

((
pRE,2 ·uADk

+ pRE,3 ·uSPk · (x
SQ
k )

2

+ pRE,4 ·uWAk
)
·pRE,5

) (A.1e)

xPRk+1 = p
PR,0 ·xPSk+1

· log
(

pPR,1 ·xEMk+1
xPSk+1 + x

DS
k+1 + pPR,2

+ 1
)

(A.1f)

xSAk+1 = min
{
pSA,0 ·xDSk+1

· log
(

pSA,1 ·xEMk+1
xPSk+1 + x

DS
k+1 + pSA,2

+ 1
)

;

xSHk + x
PR
k+1; p

SA,3 ·xDEk+1

}
(A.1g)

xSHk+1 = x
SH
k −x

SA
k+1 + x

PR
k+1 (A.1h)

x
SQ
k+1 = p

SQ,0 ·xMOk + p
SQ,1 ·xMQk +

pSQ,2 ·uRQk
(A.1i)

x
MQ
k+1 = x

MQ
k ·p

MQ,0

· exp
(
−pMQ,1

xPRk
xPSk + pMQ,2

)
+ pMQ,3 · log

(
uMAk ·p

MQ,4 + 1
) (A.1j)

xMOk+1 =
(
1 −pMO,0

)
·xMOk + p

MO,0

· log
(
pMO,1 · (uEMk + p

dEM )

+ pMO,2 ·uDPSk + p
MO,3 ·uDDSk

+ pMO,4 ·uWAk + p
MO,5 ·xREk

+ pMO,6
)
· exp

(
− (pMO,7 ·udPSk

+ pMO,8 ·udDSk ) + p
MO,9

)
·pMO,10

(A.1k)

xCAk+1 = p
CA,0 ·

(
xCAk +

(
xSAk+1 ·u

SP
k

)
+
(
udPSk ·p

CA,1) + (udDSk ·pCA,2)
−
(
xEMk+1 ·u

WA
k

)
−
(
xPRk+1 ·u

RQ
k ·p

CA,3
)

−
(
xPSk ·p

CA,4)−(xDSk ·pCA,5)
−uMAk −u

AD
k −

(
xSHk+1 ·p

CA,6)
−
(
uDPSk ·p

CA,7)
−
(
uDDSk ·p

CA,8))

(A.1l)

Additional constraints are given by the inequalities
shown in equations (A.2a) to (A.2e),

udPSk + u
dPS
k−1 ≤ p

dPS, (A.2a)

pDEM,0 ·xPSk + p
DEM,1 ·xDSk ≥ u

EM
k , (A.2b)

xEMk ,x
PS
k ,x

DS
k ≥ 1, (A.2c)

xSHk ,x
PR
k ,x

SA
k ,x

DE
k ≥ 0, (A.2d)

xREk ,x
SQ
k ,x

MQ
k ,x

MO
k ≥ 0, (A.2e)

and the simple bounds on the controls (A.3a) to (A.3j),

uSPk ∈ [35 M.U., 55 M.U.], (A.3a)

uADk ∈ [1000 M.U., 2000 M.U.], (A.3b)

uWAk ∈ [1000 M.U., 2000 M.U.], (A.3c)

uMAk ∈ [10 M.U., 5000 M.U.], (A.3d)

u
RQ
k ∈

{
pRQ,1 ,pRQ,2

}
, (A.3e)

uEMk ∈
[
−pdEM,∞

]
∩Z+, (A.3f)

uDPSk ∈
[
0,pDPS

]
∩Z+, (A.3g)

udPSk ∈
[
0,∞

]
∩Z+, (A.3h)

uDDSk ∈
[
0,pDDS

]
∩Z+, (A.3i)

udDSk ∈
[
0,pdDS

]
∩Z+. (A.3j)

We use the objective function

max
x,u,p

xCAtf , (A.4)

i.e., maximizing the capital at the end.
Of course, the set of parameters has a significant in-

fluence on the model behavior. One could, e.g., think
of applying derivative-free optimization methods with
a subset of the parameters to determine an appropri-
ate parameter set for a microworld like IWR Tailor-
shop. For this work, however, we set up a parameter
set manually such that the model fulfills a certain de-
sired behavior. The chosen parameters also yield a
model behavior that makes sense for the optimization,
i.e., there are feasible solutions and the optimization
problem is not unbounded. The parameter values used
throughout this work unless otherwise stated are listed
in Tables A.1 and A.2.

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 17

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

Parameter Value

pDE,0 2200.0 shirts
pDE,1 2 · 10−2 shirts/MU
pDE,2 2 · 10−2 1/MU
pDE,3 0.5
pRE,0 0.5
pRE,1 0.672
pRE,2 2.5 · 10−5 1/MU
pRE,3 10−4 shirts/MU
pRE,4 6 · 10−5 persons/MU
pRE,5 12.0
pPR,0 99.9 shirts/sites
pPR,1 2.0 sites/persons
pPR,2 10−6 sites
pSA,0 99.9 shirts/sites
pSA,1 2.0 sites/persons
pSA,2 10−6 sites
pSA,3 1.0
pSQ,0 0.2
pSQ,1 0.3
pSQ,2 0.5
pMQ,0 0.8
pMQ,1 6 · 10−3 sites/shirts
pMQ,2 10−6 sites

Parameter Value

pMQ,3 0.13
pMQ,4 0.2 MU−1
pMO,0 0.5
pMO,1 2 · 10−2 persons−1
pMO,2 0.5 sites−1
pMO,3 0.25 sites−1
pMO,4 2.0 · 10−4 persons/MU
pMO,5 0.3
pMO,6 1.0
pMO,7 2.5 sites−1
pMO,8 2.0 sites−1
pMO,9 1.0
pMO,10 0.5
pCA,0 1.03
pCA,1 5000 MU/site
pCA,2 3500 MU/site
pCA,3 5.0 MU/shirt
pCA,4 1000 MU/site
pCA,5 700 MU/site
pCA,6 1.5 MU/shirt
pCA,7 10000 MU/site
pCA,8 7000 MU/site

Table A.1. Parameter set for states used with IWR Tailorshop. MU means monetary units.

Parameter Value

nRQ 2
pRQ,1 0.5
pRQ,2 1.0
pDEM,0 5 persons/site
pDEM,1 10 persons/site
pdEM 10 persons
pDPS 1 site
pdPS 1 site
pDDS 2 sites
pdDS 1 site

Table A.2. Parameter set for controls used with IWR Tailorshop.

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 18

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

Variable Round 1 Round 2 Round 3 Round 4

Employees xEM0 14 3 14 42
Production sites xPS0 1 1 1 2
Distribution sites xDS0 1 5 1 7
Shirts in stock xSH0 319 0 319 0
Production xPR0 270 69 270 467
Sales xSA0 270 69 270 467
Demand xDE0 3877 2399 3877 3065
Reputation xRE0 0.7934 0.1805 0.7934 0.4711
Shirts quality xSQ0 0.7500 0.6558 0.7500 0.8136
Machine quality xMQ0 0.8125 0.9998 0.8125 0.7712
Motivation of employees xMO0 0.7403 0.4032 0.7403 0.5108
Capital xCA0 175226 28075 175226 323907
Shirt price uSP0 50 39 50 42
Advertising uAD0 2000 1599 2000 1337
Wages uWA0 1500 1750 1500 1451
Maintenance uMA0 500 3000 500 267
Resources quality uRQ0 2 1 2 2
Recruit employees uDEM0 0 0 0 0
Dismiss employees udEM0 0 0 0 0
Create production site uDPS0 0 0 0 0
Close production site udPS0 0 0 0 0
Create distribution site uDDS0 0 0 0 0
Close distribution site udDS0 0 0 0 0

Table A.3. Initial values for each round used in IWR Tailorshop feedback study. Note that values for controls (lower part) were only
preset values and could still be changed by the participant. The last six controls, starting from recruit employees, were always set to the
value in the table after each month to avoid accidental recruitment and dismissal as well as site creation and closing. Round 1 and 3 had
the same initial values.

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 19

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

Round
Means t test

co&hs control of co&hs < of control < of

1 25869.2 24274.1 112042.8 0.0009 0.0020
2 -58869.2 -57289.0 -1174.4 0.0000 0.0003
3 124185.3 128502.7 172860.8 0.0091 0.0182
4 170923.2 166039.1 293403.4 0.0029 0.0059

Sum 262108.6 261526.8 577132.7 0.0000 0.0002

(a) Welch’s t-test p-values of comparison of score means for each round between control and highscore groups (co,
hs) on the one side and groups with optimization-based feedback (of) on the other side with all complete datasets
without 6 outliers (N = 94). With α = 0.05, optimization-based feedback groups were significantly better than
those without (co&hs as well as co alone).

Round Highscore Indicate Trend Value Chart

1 0.4429 0.0001 0.0005 0.0000 0.8531
2 0.5891 0.3804 0.0002 0.0000 0.5414
3 0.6216 0.2168 0.0507 0.0000 0.3622
4 0.4200 0.4037 0.0133 0.0000 0.0577

Sum 0.4947 0.1539 0.0007 0.0000 0.3935

(b) Welch’s t-test p-values of comparison of score means for each round to control group with all complete
datasets without 6 outliers (N = 94). Alternative hypothesis was that mean of control group is lower. With
α = 0.05, only value group is significantly better than control group in all rounds. However, trend group misses
significance only in round 3 by narrow margin.

control highscore indicate trend value chart

Mean -31807.3 -32308.6 -27065.5 -31202.2 -32194.4 -29073.8
KS test 0.2192 0.6468 0.5051 1.0000 0.6880 0.9652
t-test — 0.8988 0.1455 0.8231 0.9335 0.4110

(c) Comparison of Use of Potential by feedback groups in first month for all complete datasets without
6 outliers (N = 94): no significant differences between groups. Values can be considered to be normally
distributed.

Table A.4. Different statistical tests. Bold valus of the test statistics indicate significance (α = 0.05).

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 20

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

Round control highscore indicate trend value chart

1 599.1 767.5 359.9 1286.5 -91.3 2366.5
2 1140.9 965.4 714.2 725.0 22.2 610.7
3 814.4 350.8 104.6 -616.5 -448.5 294.7
4 3717.4 2837.5 1304.5 847.9 78.0 1097.9
Feedback rounds sum 1740.0 1732.9 1074.1 2011.5 -69.1 2977.2
Performance rounds sum 4531.8 3188.3 1409.2 231.4 -370.5 1392.6
Total sum 6271.8 4921.2 2483.2 2242.9 -439.6 4369.9

(a) Means

Round control high-
score

indicate trend value chart

1 0.1551 0.2901 0.7662 0.4528 0.0748 0.0493
2 0.5016 0.9603 0.9348 0.4203 0.6070 0.6826
3 0.8186 0.9434 0.7300 0.7786 0.4601 0.9627
4 0.9961 0.8713 0.8615 0.9498 0.9832 0.6299

(b) Kolmogorov-Smirnov test

Round control high-
score

indicate trend value chart

1 0.1051 0.1820 0.2194 0.0036 0.6708 0.0960
2 0.0002 0.0248 0.1263 0.0045 0.4718 0.0787
3 0.0002 0.1528 0.3853 0.9399 0.9646 0.2284
4 0.0000 0.0016 0.0435 0.1053 0.4542 0.0858

(c) Welch’s t-test for µ>0

Round control high-
score

indicate trend value chart

1 0.8949 0.8180 0.7806 0.9999 0.3292 0.9040
2 0.9998 0.9752 0.8737 1.0000 0.5282 0.9213
3 0.9998 0.8472 0.6147 0.0601 0.0354 0.7716
4 1.0000 0.9984 0.9565 0.8947 0.5458 0.9142

(d) Welch’s t-test for µ<0

Table A.5. Parameter m by feedback groups for all complete datasets without 6 outliers (N = 94): means, Welch’s t-test, and Kolmo-
gorov-Smirnov test. The values of all groups can be considered to be normally distributed in all rounds except for chart group in round
1. trend group is the only group with a significant learning effect in the first two rounds, value group the only one with a significantly
decreasing performance in round 3. Bold valus of the test statistics indicate significance (α = 0.05).

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 21

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

Round Mean t-Test µ > 0

1 879.1 0.0016
2 789.9 0.0000
3 154.0 0.1365
4 1991.2 0.0000

Table A.6. Regression m for all complete datasets without 6 out-
liers (N = 94): means and Welch’s t-test results (α = 0.05).
Participants show significant learning effects in all rounds except
for round 3, in which especially value group is significantly < 0.

Round Low Mid High

1 305.2 924.5 1365.9
2 1439.3 637.2 433.2
3 549.4 148.2 -230.2
4 4409.5 1888.7 -230.4
Feedback 1744.4 1561.7 1799.2
Performance 4958.9 2036.9 -460.7
Sum 6703.3 3598.6 1338.5

Table A.7. Means for regression m according to performance in
performance rounds (low: below lower quartile, mid: between lower
and higher quartile, high: above higher quartile) for all complete
datasets without 6 outliers (N = 94): high performers have the
highest mean for m in the first round and the lowest in all other
rounds.

Claim Answer Correct Wrong Don’t know

Motivation of employees plays an important role. false 56% 28% 16%
Maintenance is an important intervention possibility. false 55% 26% 19%
The higher the shirt price is, the lower is the demand. true 41% 45% 14%
Opening and Closing sites are important intervention
possibilities.

true 90% 3% 6%

It is wise to dismiss employees at the end. true 31% 33% 36%

Table A.8. Survey on model properties at the end of task. The participants were told that “We would like to ask you a few questions once
again. Your answers will help us very much and it only takes two minutes. [. . . ] Please decide if the following propositions are correct
or wrong according to your experience from all four rounds.” Participants could always choose between true, false, and don’t know. The
content of the five items can be found in the claim column, the correct answer is shown in the corresponding column. The remaining
columns show the ratio of correct, wrong and don’t know answers among all participants. Differences to 100% are due to rounding.

Round High Score Mid Score Low Score High > Low High > Mid Mid > Low

1 3.17 2.50 2.79 0.1477 0.0205 0.8417
2 3.42 2.65 2.25 0.0004 0.0063 0.0770
3 3.46 2.74 2.04 0.0000 0.0061 0.0068
4 3.50 2.70 2.08 0.0000 0.0023 0.0142

Sum 3.33 2.80 2.04 0.0001 0.0384 0.0035

(a) Means of model knowledge for participants with high (i.e., best 25%), mid (between 1st and 3rd quartile), and low (i.e.,
worst 25%) score in the corresponding round with all complete datasets without 6 outliers (N = 94). Pairwise comparison
of means by Welch’s t-test with α = 0.05 shows, that high scorers know significantly more about the model than mid or low
scorers.

Round High Score Mid Score Low Score High > Low High > Mid Mid > Low

1 0.75 1.07 0.79 0.4376 0.0909 0.8716
2 0.58 1.07 0.96 0.0711 0.0181 0.6733
3 0.67 0.87 1.25 0.0097 0.1667 0.0633
4 0.71 0.93 1.08 0.0820 0.1612 0.2740

Sum 0.67 0.98 1.04 0.0696 0.0930 0.3924

(b) Means of model uncertainty for participants with high (i.e., best 25%), mid (between 1st and 3rd quartile), and low (i.e.,
worst 25%) score in the corresponding round with all complete datasets without 6 outliers (N = 94). Uncertainty means are
lower for high scorers. Pairwise comparison of means by Welch’s t-test with α = 0.05 barely shows significance, however.

Table A.9. Different tests for model uncertainty and model knowledge. Bold valus of the test statistics indicate significance (α = 0.05).

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 22

http://dx.doi.org/10.11588/jddm.2017.1.34608


Engelhart et al.: Optimization-based training

Round Low (0/1) Mid (2/3) High (4/5)

1 101085.2 42407.9 110944.2
2 -51748.1 -40915.9 10319.9
3 108448.1 135943.2 200281.3
4 80163.6 214925.1 366269.3

(a) Mean score values for different levels of model
knowledge

Round Low (0/1) Mid (2/3)

1 69516.5 86706.5
2 -22096.4 -42847.0
3 159269.1 124416.9
4 259346.6 171036.4

(b) Mean score values for different levels of
model uncertainty

R Low < High Low < Mid Mid < High

1 0.3740 0.9743 0.0188
2 0.0005 0.2657 0.0010
3 0.0004 0.1447 0.0007
4 0.0001 0.0223 0.0001

(c) Student’s t-test p-values for model knowledge

Round Mid < Low

1 0.7335
2 0.1221
3 0.1020
4 0.0626

(d) Student’s t-test p-values for model
uncertainty

Table A.10. Scores for different model knowledge and uncertainty levels (R: round) with all complete datasets without 6 outliers (N = 94).
With α = 0.05, participants with high model knowledge have achieved a significantly better score in almost all rounds. For model
uncertainty, no significant score differences have been observed.

Property co hi in tr va ch All

Knowledge

low 24% 8% 44% 5% 18% 9% 17%
mid 59% 54% 33% 48% 36% 73% 52%
high 17% 38% 22% 48% 45% 18% 31%
mean 2.38 3.00 2.22 3.19 3.09 2.64 2.74
t-test — 0.0451 0.6241 0.0113 0.0824 0.2377 —

Uncertainty

low 72% 69% 67% 95% 82% 64% 77%
high 28% 31% 33% 5% 18% 36% 23%
mean 1.03 1.15 1.22 0.38 0.91 1.09 0.91
t-test — 0.6525 0.6630 0.0017 0.3545 0.5545 —

Table A.11. Ratio of model knowledge and uncertainty levels for all feedback groups (co: control, hs: highscore, in: indicate, tr: trend,
va: value, ch: chart) with all complete datasets without 6 outliers (N = 94). Mean refers to mean uncertainty and knowledge per group.
Alternative hypothesis for Welch’s t-test was that mean of control group is lower (knowledge) or higher (uncertainty) respectively. For
α = 0.05, only trend group is significantly better in both knowledge and uncertainty. Differences to 100% are due to rounding.

Round Low Mid High

1 93.3 1013.0 1086.4
2 666.2 888.5 691.6
3 111.0 301.0 -70.5
4 3257.4 1979.4 1312.6

(a) Means for Regression m

Low < High Mid < High Low < Mid

0.0207 0.4518 0.0686
0.4788 0.7564 0.3262
0.6712 0.8680 0.3020
0.9828 0.8480 0.9291

(b) Welch’s t-test

Table A.12. Regression m according to model knowledge (low: below lower quartile, mid: between lower and higher quartile, high: above
higher quartile) for all complete datasets without 6 outliers (N = 94): those with low model knowledge learned less in the first round,
and more in the last round. In comparison with high group, this is significant.

10.11588/jddm.2017.1.34608 JDDM | 2017 | Volume 3 | Article 2 | 23

http://dx.doi.org/10.11588/jddm.2017.1.34608

	References