Int. J. of Computers, Communications & Control, ISSN 1841-9836, E-ISSN 1841-9844
Vol. IV (2009), No. 2, pp. 185-197

Quality Control of Statistical Learning Environments and Prediction of
Learning Outcomes through Reproducible Computing

Patrick Wessa

K.U.Leuven Association
Lessius, Dept. of Business Studies
Belgium
E-mail: patrick@wessa.net

Abstract: This article introduces a new approach to statistics education that al-
lows us to accurately measure and control key aspects of the computations and com-
munication processes that are involved in non-rote learning within the pedagogi-
cal paradigm of Constructivism. The solution that is presented relies on a newly
developed technology (hosted at www.freestatistics.org) and computing framework
(hosted at www.wessa.net) that supports reproducibility and reusability of statistical
research results that are presented in a so-called Compendium. Reproducible com-
puting leads to responsible learning behaviour, and a stream of high-quality commu-
nications that emerges when students are engaged in peer review activities. More
importantly, the proposed solution provides a series of objective measurements of
actual learning processes that are otherwise unobservable. A comparison between
actual and reported data, demonstrates that reported learning process measurements
are highly misleading in unexpected ways. However, reproducible computing and ob-
jective measurements of actual learning behaviour, reveal important guidelines that
allow us to improve the effectiveness of learning and the e-learning system.
Keywords: Reproducible Computing, Learning Environment, Quality Control,
Statistics Education, Psychometrics

1 Introduction

In education-related research it is common practice to investigate learning processes through mea-
surements that are based on questionnaires. Reported measures often reveal interesting information
about a wide variety of aspects of computing-assisted learning such as: computer attitudes [22]; com-
puter emotions and knowledge [17]; learner experiences and satisfaction [34]; etc... The importance of
such measurements has been highlighted by many authors from various perspectives ([7], [15], [12])
especially from the perspective of the constructivist pedagogical paradigm ([35], [30], [11], [24]).

These reported measures, while intrinsically interesting, may not always provide us with the infor-
mation we need to assess and improve systems that support e-learning. Moreover, the implementation
of new learning technologies and data analysis tools open up a wide array of measurement opportunities
which lead to new areas of research. An excellent example is the use of data mining tools in the open
source e-learning environment called Moodle [28].

Even though it seems to be very difficult to measure and empirically prove [25], there is no doubt
in my mind that the introduction of computers in homes and classrooms has led to an improvement of
overall learning productivity, educational communication mechanisms, social constructivism, and col-
laboration. However, the use of computers and software in statistics education may - unwillingly - result
in several types of adverse effects because the complex processes that are required to learn and (truly)

Copyright © 2006-2009 by CCC Publications


186 Patrick Wessa

understand statistical concepts are often mystified by technicalities and a variety of practical problems
that have nothing to do with mathematics or statistics. It is within this context that I argue that a system
for Quality Control should be embedded into the e-learning system, which is not limited to the Vir-
tual Learning Environment but extends to the statistical software, databases, and learning repositories
(Statistical Learning Environment).

There is an important, additional benefit for implementing such a monitoring and control system - it
is directly related to the problem of irreproducible research which has received a great deal of attention
within the statistical computing community ([9], [26], [29], [14], [13], [18], [10]). The most prominent
citation about the problem of irreproducible research is called Claerbout’s principle ([9]):

An article about computational science in a scientific publication is not the scholarship
itself, it is merely advertising of the scholarship. The actual scholarship is the complete
software development environment and that complete set of instructions that generated the
figures...

Several solutions have been proposed ([5], [10], [19]) but have not been adopted in statistics educa-
tion because they require students to understand the technicalities of scientific word processing (LaTex)
or statistical programming (R code). Based on a newly developed Statistical Learning Environment
(SLE) I propose a solution that is feasible for educational purposes and allows us to monitor, research,
and control the learning processes based on the dynamics of between-student communication and col-
laboration.

2 Reproducible Computing

2.1 R Framework

The R Framework allows educators and scientists to develop new, tailor-made statistical software
(based on the R language) within the context of an open-access business model that allows us to create,
disseminate, and maintain software modules efficiently and with a very low cost in terms of computing
resources and maintenance efforts [36]. The so-called R modules empower students to perform statistical
analysis through a web-based interface that does not require them to download or install anything on the
client machine. This permits students to focus primarily on the interpretation of the analysis - however,
the R Framework also allows advanced students and scientists to inspect and change the R code that was
coded by the original author. This results in the creation of so-called derived R modules that may be
better suited for particular purposes.

There are several important reasons why the R Framework helps in controlling the quality of the
statistical learning processes that are supported by the computer:

• The R modules are web applications with an advanced session management which includes all
aspects of the computations that are executed. In addition, the session manager uses attributes that
identify the student and the course in which (s)he is enrolled. Therefore all computations that are
performed within the context of a statistics course can be associated with an individual student -
to implement this feature, the educator only needs to use certain HTML tags in the hyperlink that
is inserted in the virtual learning environment.

• Every R module is uniquely described by an expandable set of meta data (incl. the actual statistical
code) which can be stored and transmitted. This implies that every computation that is executed
can be uniquely defined by the R module’s meta data and additional information about the data
and the parameters that have been specified by the user. As a consequence, every computation can
be uniquely described and archived with meta data.


Quality Control of Statistical Learning Environments and Prediction of Learning Outcomes through
Reproducible Computing 187

• The R Framework allows other servers (under certain conditions) to send meta data through an
ordinary HTTP request which allows it to rebuild and execute the R module with the specified
data and parameters in real time. Therefore it is possible to remotely store computational objects
and send them back to the R Framework such that the original computation can be reproduced and
reused.

• All the processes that are associated with the above items are automatically stored in a so-called
process measurement database. This implies that all computer-assisted learning activities are ob-
jectively measured and stored for the purpose of analysis.

2.2 Compendium Platform

If a derived R module contains generic improvements or if a computation needs to be communicated
to other students/scientists then it is necessary to have a simple, transparent mechanism that allows
one to permanently store the computation in a repository of computational objects that can be easily
retrieved, recomputed, and reused. Such a repository was recently created within the OOF 2007/13
project of the K.U.Leuven Association and is called the Compendium Platform. The main reason for
creating the R Framework and the Compendium Platform, is that it allows anyone to create and use
Compendia of reproducible research. A Compendium is defined as [37]: a research document where
each computation is referenced by a unique URL that points to an object that contains all the information
that is necessary to recompute it. Such documents can be easily created (even by students) and permit
any reader to (exactly) recompute the statistical results that are presented therein. A few simple clicks
are sufficient to have the R Framework reproduce the results and to reuse them in derived work [37]. The
practical implications of this technology will become obvious in section 3 because the three figures that
are presented can be recomputed and reused through the Compendium Platform.

2.3 Communication, Feedback, and Learning

The concept of Reproducible Computing was implemented in several undergraduate statistics courses
in order to thoroughly test the new system and to measure key aspects of the educational activities
and experiences. Two different student populations were investigated in detail: a group of (academic)
bachelor students, and a group of so-called switching students. The second population is of particular
interest because it consists of students who obtained a (professional) bachelor degree and decided to
make the switch to an academic master which requires them to complete a preparatory year.

On the one hand, switching students are highly motivated and more mature than the bachelor stu-
dents. A priori, one would expect them to prefer practical activities (such as communication and com-
puting) above theory and critical reflection. On the other hand, one might expect the bachelor students to
have a more critical (scientific) attitude and better mathematical background than the switching students.

Students from both populations took a similar statistics course which covered topics from introduc-
tory statistics, regression analysis, and introductory time series analysis. The main learning activities in
both statistics courses were based on a weekly series of workshops where each student was required to
investigate practical, empirical problems. At the end of each week, students submitted their papers elec-
tronically. During the lecture I proposed a series of solutions and illustrated commonly made mistakes.
After the lecture, students had to work on the next assignment and complete a series of peer reviews
(assessments) about the work that was submitted the week before. The assessment grades did not count
towards the final score - however, each submitted peer review was accompanied by verbal feedback mes-
sages. I graded a (quasi random) sample of these messages in order to provide students with a strong
incentive to take the review process seriously. There is strong empirical evidence that this approach had
beneficial effects on non-rote learning of statistical concepts [38].


188 Patrick Wessa

3 Objective Measurements versus Reported Data

In a recent paper [37] it is illustrated how the Compendium Platform’s repository supports “technical”
quality control of the statistical software and accompanying documentation for students. On the one
hand, reproducible computing allows students to accurately communicate computational problems and
questions without the need to understand the underlying technicalities. On the other hand, it allows
the educator (and creator of the computational software) to analyze the reported problem (based on the
detailed, raw output of the R engine that executed the request) and to transparently communicate the
solutions to the students. Moreover, the measurement of learning activities and experiences is a conditio
sine qua non for controlling the “overall” quality of the SLE. This will be illustrated, based on the data
that have been collected from both student groups. At the same time, the importance of objective (as
opposed to reported) measurements is illustrated based on a simple, comparative diagnostic tool.

The reported measurements were obtained through questionnaires on a 5-point Likert scale and
should consequently be treated as ordinal data. The questions were based on well-known psychological
surveys ([12], [8]) and the IBM computer system usability survey [20] which was adapted and extended
[27]. Useful data was obtained from a total of 111 bachelor students and 129 switching students - the
response ratio was very high (between 82.9% and 92% depending on the questionnaire). All observa-
tions of actual learning activities were measured on a ratio scale (the number of archived computations
and the number of submitted feedback messages). A total number of 34438 meaningful, verbal feedback
communications and 6587 archived computations were registered.

In order to compare the actual and reported data, all measurements were converted to ordinal rank
orders. In addition, the Pearson’s rho correlations and Kendall’s tau rank correlations ([1], [2], [16])
that represent the degree of linear association between the properties under investigation, were computed
(these can be consulted in the archived computations about the Figures). In electronic versions of this
paper, one can simply (ctrl-)click the hyperlinks below Figures 1, 2, and 3 to view the archived com-
putation in the repository. Readers of the printed version of this document, have to manually enter the
respective URLs into their internet browser to view the statistical computations that have been stored (at
www.freestatistics.org).

Figure 1 displays the bivariate kernel density [21] between the rank order of the number of feedback
messages that have been submitted in peer reviews about the workshops (x-axis) and the rank order of
the number of (reproducible) computations that have been archived in the repository (y-axis).

The rank orders have been computed within the Bachelor population for the top panels, and within the
Switching population for the bottom panels. This implies that the ranks that are attributed to female and
male students are expressed on the same axes and can be compared. Figure 1 clearly demonstrates that
female bachelor students are much more involved in feedback and computing than their male colleagues.
At the same time, female switching students are more computing-oriented whereas the male switching
students seem to have a slight preference for feedback communication. This information has important
repercussions for controlling the quality of the learning environment and it provides clear guidelines
towards actions that should be taken (by me) to improve participatory incentives towards male bachelor
students in future courses. Would I have been able to gain this insight based on reported measurements
alone? The answer is clearly negative (as is illustrated in Figures 2 and 3).

It is quite obvious that male bachelor students highly over-estimate their performance in terms of
feedback submissions (see Figure 2) because the rank orders of reported measures (x-axis) are higher
than the ranks of actual feedback submissions (y-axis). Female bachelor students however, underestimate
their involvement (relative to their male colleagues) because they are concentrated above the diagonal
line. In the male switching student population several clusters of high density can be detected which
leads us to conclude that we cannot treat them as one homogeneous group.

In Figure 3 the comparison between reported computing measures (x-axis) and actual computing
(y-axis) leads to similar conclusions. Male bachelor students highly exaggerate their efforts, whereas


Quality Control of Statistical Learning Environments and Prediction of Learning Outcomes through
Reproducible Computing 189

0 20 40 60 80 100 120

0
2

0
4

0
6

0
8

0
1

0
0

1
2

0

Female Bachelor Students

# submitted messages

#
 r

e
p

ro
d

u
ci

b
le

 c
o

m
p

u
ta

tio
n

s

 0.002 

 0.0
02  0.004 

 0.006 

 0.008  0.008 

 0.01 

 0.012 

 0.014 

 0.016 

 0.018 

 0.02 

 0.022 

0 20 40 60 80 100 120

0
2

0
4

0
6

0
8

0
1

0
0

1
2

0

Male Bachelor Students

# submitted messages

#
 r

e
p

ro
d

u
ci

b
le

 c
o

m
p

u
ta

tio
n

s

 0.005 

 0.01 

 0.01
 

 0.01 

 0.01 

 0.0
1 

 0.015 

 0.02 

 0.025 

 0.03 

0 50 100 150

0
5

0
1

0
0

1
5

0

Female Switching Students

# submitted messages

#
 r

e
p

ro
d

u
ci

b
le

 c
o

m
p

u
ta

tio
n

s

 0.001 

 0.001 
 0.002 

 0.003 

 0.004 

 0.005 

 0.006 

 0.007 

 0.008 

 0.009 

 0
.0

1 

 0.01 

 0.01 

 0.011 

 0.012 

 0.013
 

0 50 100

0
5

0
1

0
0

Male Switching Students

# submitted messages

#
 r

e
p

ro
d

u
ci

b
le

 c
o

m
p

u
ta

tio
n

s

 0.002 

 0.004 

 0.006 

 0.006 

 0.008 
 0.01 

 0.01 

 0.01 

 0.012 

 0.014 

 0.014 

 0.016 

 0.016 

Figure 1: Submitted Feedback versus Reproducible Computations
www.freestatistics.org/blog/date/2008/Jun/30/t1214840420q0fyankop4x9ebf.htm

female bachelor and switching students underestimate themselves. The group of male switching students
is heterogeneous.

Overall, the testimony of students is extremely misleading and poorly correlated with actual obser-
vations. If we would have recomputed Figure 1 with reported measures then the conclusions would have
been the opposite of what is true. The reader can try out this experiment by simply reproducing the
computation of Figure 1 with reported measures on both axes.

4 Quality Control

In order to be able to control (and improve) the quality of the SLE, it is necessary to estimate the
impact of key-aspects of the learning processes that are associated with the SLE. The methodology that
allows us to do this is based on a mathematical model which is described in [40] and relates the learning
outcomes to objectively measured activities and reported experiences.

Typically, models that predict learning outcomes based on exogenous variables that are related to
the learning (and computing) environment have an extremely low percentage of variance explained. In a
recent and extensive study [25], six models were discussed that predicted the Statistics subtest scores of
the Massachusetts Comprehensive Assessment System - the variance explained ranged between 4% and
7%.

It is obvious that any model that is used to control the quality of an SLE should perform much better.
There are three important requirements to build high-quality models:

1. high-quality exogenous variables (preferably based on objective measurements) [39];


190 Patrick Wessa

0 20 40 60 80 100 120

0
2

0
4

0
6

0
8

0
1

0
0

1
2

0

Female Bachelor Students

reported feedback submissions

a
ct

u
a

l f
e

e
d

b
a

ck
 s

u
b

m
is

si
o

n
s

 0.002 

 0.004 

 0.006 

 0.008 

 0.01 

 0.012 

 0.014 

 0.016 

 0.018 
 0.02 

−20 0 20 40 60 80 100 120

0
2

0
4

0
6

0
8

0
1

0
0

1
2

0

Male Bachelor Students

reported feedback submissions

a
ct

u
a

l f
e

e
d

b
a

ck
 s

u
b

m
is

si
o

n
s

 0.002 

 0.002 
 0.004 

 0.006 

 0.008 
 0.01 

 0.012 

 0.012 

 0.014 

 0.014 

0 50 100

0
5

0
1

0
0

1
5

0

Female Switching Students

reported feedback submissions

a
ct

u
a

l f
e

e
d

b
a

ck
 s

u
b

m
is

si
o

n
s

 0.001 

 0.001 
 0.002 

 0.003  0.004 

 0.005 

 0.006 

 0.006 
 0.007 

 0
.0

08
 

 0.008  0.009 

 0.01 

 0.
01

1 

 0.011 

0 50 100 150

0
5

0
1

0
0

Male Switching Students

reported feedback submissions

a
ct

u
a

l f
e

e
d

b
a

ck
 s

u
b

m
is

si
o

n
s

 0.002 

 0.002  0.004 

 0.006 

 0.008 

 0.008 

 0.01 

 0.01 

 0.01 

 0.012 

 0.012 

 0.012 

 0.014 

 0.014 

Figure 2: Reported versus Actually Submitted Feedback
http://www.freestatistics.org/blog/date/2008/Jun/30/t12148409608o0dnj2k4s04jil.htm

2. high-quality endogenous variable (c.q. test scores) based on optimal weights of the individual
items (section 4.1, [40]);

3. homogeneous sample for which the model is computed.

The third condition refers to the fact that student populations may consist of different types of stu-
dents with specific learning behaviors. In the formentioned statistics course there were 4 groups with
distinct characteristics. This is clearly illustrated in section 3 and in Figures 1, 2, and 3.

Instead of computing separate models (for each of the sub populations) section 4.2 presents a com-
prehensive model with all combinations of interaction effects (male/female and Bachelor/Switching).
This greatly improves the interpretation of the prediction model and allows us to perform differential
quality control of the SLE.

4.1 Model

First, a classical regression approach is used to predict the learning outcomes (c.q. exam scores) as a
linear function of (K − 1) ∈ N0 exogenous variables of interest. Let ~y represent an N × 1 vector for all
N ∈ N students (with N > K), containing the weighted sum of G item scores (c.q. scores on individual
exam questions): ~y ≡ ∑Gj=1 ωj~yj with initial unit weights ωj ≡ 1. In addition, define an N × K matrix
X that represents all exogenous variables (including a one-valued column which represents the constant),
and a K × 1 parameter vector ~b that represents the weights of the linear combination of all columns in
X that is used to describe ~y. The complete model is denoted M1 and is defined by ~y = X~b + ~e where
~e ← iid N(~0, σ2e) represents the prediction error.


Quality Control of Statistical Learning Environments and Prediction of Learning Outcomes through
Reproducible Computing 191

0 20 40 60 80 100 120

0
2

0
4

0
6

0
8

0
1

0
0

1
2

0

Female Bachelor Students

reported intention to use

#
 r

e
p

ro
d

u
ci

b
le

 c
o

m
p

u
ta

tio
n

s

 0.002 

 0.004  0.006 

 0.008 

 0.01 

 0.01 

 0.012 

 0.012 

 0.014 

 0.016 

0 20 40 60 80 100 120

0
2

0
4

0
6

0
8

0
1

0
0

1
2

0

Male Bachelor Students

reported intention to use

#
 r

e
p

ro
d

u
ci

b
le

 c
o

m
p

u
ta

tio
n

s

 0.002 

 0.00
2 

 0.002 

 0.002 

 0.004 

 0.006 

 0.008 

 0.01 

 0.01 

 0.012 

 0.012 

 0.014 

 0.014 

 0.016 

 0
.0

1
8
 

 0
.0

2 

0 50 100

0
5

0
1

0
0

1
5

0

Female Switching Students

reported intention to use

#
 r

e
p

ro
d

u
ci

b
le

 c
o

m
p

u
ta

tio
n

s

 0.002 
 0.004 

 0.006 

 0.008 

 0.01 
 0.012 

0 50 100 150

0
5

0
1

0
0

Male Switching Students

reported intention to use

#
 r

e
p

ro
d

u
ci

b
le

 c
o

m
p

u
ta

tio
n

s

 0.002 

 0.004 
 0.006 

 0.00
6 

 0.008 

 0.008 

 0.01  0.01 

 0.012 

 0.012 

 0.012 

 0.012 

 0.014 

 0.014
 

 0.014 

 0.016 

Figure 3: Reported versus Actual Reproducible Computing
http://www.freestatistics.org/blog/date/2008/Jun/30/t1214841152sn6jlyhgseclgqm.htm

In the second model M2, the prediction of the first model is specified by a linear combination of the
individual items (questions) that made up the total exam score. Let Y represent the N × G matrix that
contains all G item scores, then it is possible to define the model ~̂y = Y~c + ~a where ~a ← iid N(~0, σ2a).
Note that there is no constant term in this model.

The third model (M3) simply combines M1 and M2 by relating
^̂
~y to X in the regression model ^̂~y =

X~f+~u. The estimator for ~f can be shown to be ~̂f = (X′X)−1 X′^̂~y = (X′X)−1 X′Y (Y ′Y)−1 Y ′X (X′X)−1 X′~y
([40]). M3 is likely to yield different results from M1 unless the estimated parameters M2 are (nearly)
equal to the original weights ~̂c = (ĉ1, ĉ2, ĉ3, ..., ĉG)

′ ' (ω̂1, ω̂2, ω̂3, ..., ω̂G)′.
From a statistical point of view it is not possible to test the improvement that is induced by the

objective exam score transformations. The reason for this is that the traditional F-test assumes that the
endogenous variables in two models (M1 and M3) to be compared are identical. Therefore it is necessary
to use an auxiliary model (M∗3) which is based on M3 and includes ~y as an explanatory variable. This ex-

tended model ^̂~y = X~f + ~yg + ~u can be shown to be equivalent to
(
Y (Y ′Y)−1 Y ′X (X′X)−1 X′ − gIN

)
~y

= X~f + ~u such that it can be concluded that M∗3 is equal to M1 with a transformed endogenous
variable. The interesting aspect about this auxiliary regression is the limiting case when g → 0 and
Y (Y ′Y)−1 Y ′X (X′X)−1 X′ → IN because it leads to M1 with ~f = ~b and ~u = ~e. This result is important
because it is now easy to test if it is necessary to apply the transformation to the endogenous variable.
The null hypothesis is simply H0 : g = 0 versus H1 : g 6= 0 which can be tested with the conventional
t-test. In other words, if the null hypothesis is rejected then the transformation is necessary and the es-
timated parameters ~̂c and ~̂f interpretable. The usefulness of this modeling approach is illustrated in the
next subsection.


192 Patrick Wessa

4.2 Empirical Evidence

The data that was collected from the implemented SLE (as described in section 2.3) contained the
following exogenous variables:

• Bcount: actual computations

• Gender: 0 = female / 1 = male

• Future: intention to use

• Pop: 0 = Bachelor / 1 = Switching

• nnzfg: actually submitted feedback messages in peer review

• Reflection: reported feedback messages in peer review

Table 1 presents the empirical results of two models (M1 and M3). The endogenous variable in
M1 is the sum of all exam questions with unit weights whereas M3 is based on objective exam score
transformations (optimal weights of individual questions).

Table 1: Empirical results
Variable Estimate M1 sig. Estimate M3 sig.
(Intercept) 6.935987 * 6.333557 ***
Bcount 0.033281 0.035939 ***
Gender -2.166419 -1.465320
Pop -4.616769 -0.494553
nnzfg 0.027379 * 0.030161 ***
Future 0.625812 . 0.639711 ***
Reflection -0.167591 -0.167980 **
Bcount:Gender -0.027786 -0.036090 **
Bcount:Pop 0.018510 -0.007901
Gender:Pop 3.699211 2.116220
Gender:nnzfg -0.001449 -0.013074 **
Pop:nnzfg -0.020088 -0.024768 ***
Gender:Future -0.274359 -0.354713 *
Pop:Future -0.038318 -0.143013
Gender:Reflection 0.161042 0.225622 *
Pop:Reflection 0.289011 0.160574 .
Bcount:Gender:Pop 0.019236 0.021735
Gender:Pop:nnzfg -0.002538 0.009896 .
Gender:Pop:Future -0.248325 -0.157158
Gender:Pop:Reflection -0.128991 -0.160408
Residual standard error 3.446 0.9593
Degrees of freedom 179 179
Adj. R-squared 0.1607 0.6626
F-statistic 2.995 *** 21.47 ***
Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

From the results in Table 1 it is clear that M3 provides - unlike M1 - a lot of interesting information
about the relationship between optimally weighted exam scores and the exogenous variables which are


Quality Control of Statistical Learning Environments and Prediction of Learning Outcomes through
Reproducible Computing 193

under the control of the educator. The percentage of variance explained (adjusted R2) in M3 is more
than 66% which allows us to make much better predictions than what is usually reported in - otherwise
excellent - academic articles [25]. As explained before, the traditional F-test cannot be used to test the
significance of the improvement. However, the auxiliary regression’s null hypothesis H0 : g = 0 is
rejected even if an extremely low type I error is chosen (the p-value is 3.23 × 10−11). This implies that
M3 performs significantly better and the objective exam score transformations are necessary. In addition,
several diagnostic tests about the final model (M3) are shown in Figure 4 - they indicate no statistical
inadequacies.

The most interesting aspects of this analysis are the estimated parameters of M3. With regard to
quality control of the SLE the following conclusions can be made:

• There is a positive effect of performing reproducible, statistical computations (Bcount). This effect
is significant at the 0.1% type I error level and cannot be measured without optimal weights (M1).
However, this effect is only relevant for female students because the parameter that is associated
with Bcount:Gender is also significant and has a negative sign.

• Submitting feedback messages (in peer review) is very beneficial and improves exam scores (p-
value < 0.01%). This effect is about twice as large for female students than for males (the Gen-
der:nnzfg parameter partially offsets the effect for male students). In addition, students from the
switching population benefit less from feedback submissions.

• The reported “intention to use” (as measured in the usability survey) positively affects exam scores.
This effect is strongest for female students. Note that previous research has shown that intention
is mainly related to student’s perception about the comparative advantage (of the software system)
to learn statistics as compared to other alternatives (such as textbooks) [27].

• Females who report a high number of submitted feedback messages have significantly lower exam
scores. On the other hand, male students who exaggerate their efforts are not in danger of having
lower exam scores. This implies that the female exaggeration bias is small but harmful - the male
exaggeration bias is big and harmless.

Based on these empirical results it is now possible to control (improve) the quality of the SLE:

• Female students should be encouraged to generate more reproducible computations.

• Peer review (based on Reproducible Computing) is highly beneficial to learn statistics - especially
when it requires students to engage in submitting feedback messages to their peers. Male students
need to (at least) double their efforts (compared to females) in order to obtain the same effect. Stu-
dents from the switching population also need more feedback submissions than bachelor students.

• It is important to explain the SLE to students - emphasizing the comparative advantages of the
system and the potentially improved exam scores. However, male students need more (or better)
arguments before they accept the new technology and exhibit an increased degree of “intention to
use.”

• Female students who exaggerate their reported efforts should receive accurate feedback about their
real performance which is based on objective measurements. Self assessment and reflection about
student’s actual efforts (as compared to perceived efforts) should be an integral part of the SLE.


194 Patrick Wessa

6 8 10 12

−
3

−
2

−
1

0
1

2
3

Fitted values

R
e

si
d

u
a

ls

Residuals vs Fitted

145

31

141

−3 −2 −1 0 1 2 3

−
3

−
2

−
1

0
1

2
3

Theoretical Quantiles

S
ta

n
d

a
rd

iz
e

d
 r

e
si

d
u

a
ls

Normal Q−Q

31

145
141

6 8 10 12

0
.0

0
.5

1
.0

1
.5

Fitted values

S
ta

n
d

a
rd

iz
e

d
 r

e
si

d
u

a
ls

Scale−Location
31

145
141

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−
4

−
2

0
2

Leverage

S
ta

n
d

a
rd

iz
e

d
 r

e
si

d
u

a
ls

Cook’s distance

Residuals vs Leverage

31

10368

Figure 4: Diagnostics of M3

5 Summary and Conclusions

The good news is that we now have a technology and methodology to assess actual and reported
learning activities for any student population that makes use of the new compendium technology. Ulti-
mately, this allows us to take control and improve the SLE which includes the e-learning environment,
the statistical software, the course materials, and the overall learning experiences of all students.

Bibliography

[1] Arndt S., Turvey C., Andreasen N. (1999), Correlating and predicting psychiatric symptom ratings:
Spearman’s r versus Kendall’s tau correlation, Journal of Psychiatric Research, 33, 97-104

[2] Arndt S., Magnotta V. (2001), Generating random series with known values of Kendall’s tau, Com-
puter Methods and Programs in Biomedicine, 65, 17-23

[3] Attitudes to Thinking and Learning Survey, (n.d.), Retrieved December 22, 2004, from
www.moodle.org

[4] Benson J. (1989). Structural components of statistical test anxiety in adults: An exploratory study,
Journal of Experimental Education, 57, 247-261.

[5] Buckheit J., and Donoho D. L. (1995), Wavelets and Statistics, Springer-Verlag, Editor: Antoniadis,
A.

[6] Chambers J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983), Graphical Methods for Data
Analysis, Wadsworth & Brooks/Cole.


Quality Control of Statistical Learning Environments and Prediction of Learning Outcomes through
Reproducible Computing 195

[7] Chen Z. (2008), Learning about Learners: System Learning in Virtual Learning Environment, Inter-
national Journal of Computers, Communications & Control, Vol. III, No. 1, pp. 33-40

[8] Constructivist On-Line Learning Environment Survey, (n.d.), Retrieved December 22, 2004, from
www.moodle.org

[9] de Leeuw J. (2001), Reproducible Research: the Bottom Line, Depart-
ment of Statistics Papers, 2001031101, Department of Statistics, UCLA., URL
http://repositories.cdlib.org/uclastat/papers/2001031101

[10] Donoho D. L., and Huo, X. (2005), BeamLab and Reproducible Research, International Journal of
Wavelets, Multiresolution and Information Processing, 2(4), 391-414

[11] Eggen P., and Kauchak, D. (2001), Educational Psychology: Windows on Classrooms (5th ed.),
Upper Saddle River, NJ: Prentice Hall.

[12] Galotti K. M., Clinchy B. M., Ainsworth K., Lavin B. and Mansfield A. F. (1999), A new way of
assessing ways of knowing: the attitudes towards thinking and learning survey (ATTLS), Sex roles,
40(9/10) p745-766

[13] Gentleman R. (2005), Applying Reproducible Research in Scientific Discovery, BioSilico, URL
http://gentleman.fhcrc.org/Fld-talks/RGRepRes.pdf

[14] Green P. J. (2003), Diversities of gifts, but the same spirit, The Statistician, 52(4), 423-438

[15] Hilton S., Schau C., Olsen J. (2004), Survey of Attitudes Toward Statistics: Factor Structure Invari-
ance by Gender and by Administration Time, Structural Equation Modeling, 11(1)

[16] Hollander M., and Wolfe D. A. (1973), Nonparametric statistical inference, New York: John Wiley
& Sons., 185-194 (Kendall and Spearman tests).

[17] Kay R. H. (2008), Exploring the relationship between emotions and the acquisition of computer
knowledge, Computers & Education, 50, 1269-1283

[18] Koenker R., and Zeileis A.(2007), A., Reproducible Econometric Research (A Critical Review of
the State of the Art), Research Report Series, Department of Statistics and Mathematics Wirtschaft-
suniversit Wien

[19] Leisch F. (2003), Sweave and beyond: Computations on text documents, Proceedings of the 3rd
International Workshop on Distributed Statistical Computing

[20] Lewis J. R. (1993), IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation
and Instructions for Use, IBM Corporation, Technical Report 54.786

[21] Lucy D., Aykroyd R. G. and Pollard A. M.(2002), Non-parametric calibration for age estimation,
Applied Statistics 51(2), 183-196

[22] Meelissen M. R. M., Drent M. (2008), Gender differences in computer attitudes: Does the school
matter?, Computers in Human Behavior, 24, 969-985

[23] Miller J. B., (n.d.), Examining the interplay between constructivism and different learning styles,
Retrieved October 20, 2005 from http://www.stat.auckland.ac.nz/ĩase/publications/1/8a4_mill.pdf

[24] Mvududu N. (2003), A Cross-Cultural Study of the Connection Between Students’ Attitudes To-
ward Statistics and the Use of Constructivist Strategies in the Course, Journal of Statistics Education,
11(3)


196 Patrick Wessa

[25] O’Dwyer L. M., Russell M., Bebell D., Seeley K. (2008), Examining the Relationship between
Students Mathematics Test Scores and Computer Use at Home and at School, Journal of Technology,
Learning, and Assessment, 6 (5)

[26] Peng R. D., Dominici F., and Zeger S. L. (2006), Reproducible Epidemiologic Research, American
Journal of Epidemiology, 163(9), 783-789

[27] Poelmans S., Wessa P., Milis K., Bloemen E., and Doom C. (2008), Usability and Acceptance of
E-Learning in Statistics Education, based on the Compendium Platform, Proceedings of the Interna-
tional Conference of Education, Research and Innovation, International Association of Technology,
Education and Development

[28] Romero C., Ventura S., Garcia E. (2008), Data mining in course management systems: Moodle
case study and tutorial, Computers & Education, 51, 368-384

[29] Schwab M., Karrenbach N., and Claerbout J. (2000), Making scientific computations reproducible,
Computing in Science & Engineering 2(6), 61-67

[30] Smith E. (1999), Social Constructivism, Individual Constructivism and the Role of Computers in
Mathematics Education, Journal of mathematical behavior, 17(4)

[31] Statistical Computations at FreeStatistics.org (2008a), Office for Re-
search Development and Education, Retrieved Mon, 30 Jun 2008, URL
http://www.freestatistics.org/blog/date/2008/Jun/30/t1214840420q0fyankop4x9ebf.htm

[32] Statistical Computations at FreeStatistics.org (2008b), Office for Re-
search Development and Education, Retrieved Mon, 30 Jun 2008, URL
http://www.freestatistics.org/blog/date/2008/Jun/30/t12148409608o0dnj2k4s04jil.htm

[33] Statistical Computations at FreeStatistics.org (2008c), Office for Re-
search Development and Education, Retrieved Mon, 30 Jun 2008, URL
http://www.freestatistics.org/blog/date/2008/Jun/30/t1214841152sn6jlyhgseclgqm.htm

[34] Sun P., Tsai R. J., Finger G., Chen Y., Yeh D. (2008), What drives a successful e-Learning? An em-
pirical investigation of the critical factors influencing learner satisfaction, Computers & Education,
50, 1183-1202

[35] Von Glasersfeld E. (1987), Learning as a Constructive Activity, Problems of Representation in the
Teaching and Learning of Mathematics, Hillsdale, NJ: Lawrence Erlbaum Associates, 3-17.

[36] Wessa P. (2008a), A framework for statistical software development, maintenance, and publish-
ing within an open-access business model, Computational Statistics, www.springerlink.com (DOI
10.1007/s00180-008-0107-y)

[37] Wessa P. (2008b), Learning Statistics based on the Compendium and Reproducible Computing,
Proceedings of the International Conference on Education and Information Technology, Berkeley,
San Francisco, USA

[38] Wessa P. (2008c), How Reproducible Research Leads to Non-Rote Learning Within a Socially
Constructivist E-Learning Environment, Proceedings of the 7th European Conference on e-Learning,
Cyprus

[39] Wessa P. (2008d), Measurement and Control of Statistics Learning Processes based on Construc-
tivist Feedback and Reproducible Computing, Proceedings of the 3rd International Conference on
Virtual Learning, Constanta, Romania


Quality Control of Statistical Learning Environments and Prediction of Learning Outcomes through
Reproducible Computing 197

[40] Wessa P. (2009a), Discovering Computer-Assisted Learning Processes based on Objective Exam
Score Transformations, Proceedings of the World Congress on Educational Sciences, Cyprus