Frontline Learning Research 3 (2014) 78-82

ISSN 2295-3159

Corresponding author: Tomi Silander, Xerox Research Centre Europe, www.xrce.xerox.com, tomi.silander@xrce.xerox.com ; Petri

Nokelainen, University of Tampere, www.uta.fi/edu, petri.nokelainen@uta.fi.

http://dx.doi.org/10.14786/flr.v2i1.107

78 | F L R

Using New Models to Analyze Complex Regularities of the World:

Commentary on Musso et al. (2013)

Petri Nokelainen
a
, Tomi Silander

a
University of Tampere, Finland

b
Xerox Research Centre Europe, France

Abstract

This commentary to the recent article by Musso et al. (2013) discusses issues related to

model fitting, comparison of classification accuracy of generative and discriminative

models, and two (or more) cultures of data modeling. We start by questioning the

extremely high classification accuracy with an empirical data from a complex domain.

There is a risk that we model perfect nonsense perfectly. Our second concern is related to

the relevance of comparing multilayer perceptron neural networks and linear discriminant

analysis classification accuracy indices. We find this problematic, as it is like comparing

apples and oranges. It would have been easier to interpret the model and the variable

(group) importance’s if the authors would have compared MLP to some discriminative

classifier, such as group lasso logistic regression. Finally, we conclude our commentary

with a discussion about the predictive properties of the adopted data modeling approach.

Keywords: Artificial Neural Networks; Commentary; Model-fit; Generative and

Discriminative Models; Algorithmic Data Modeling

1. Introduction

Statistical methods are constantly developed, not only within statistics, but also in other disciplines, such as,

physics, economics, bioinformatics, linguistics, and computer science. We therefore are very sympathetic to

the attempts to promote new methods for analyzing educational data. However, also in this research field, for

years there has been an emphasis on the predictive modeling, for example, to learn structures from the data

(Nokelainen, Silander, Ruohotie, & Tirri, 2007; Tirri, Nokelainen, & Komulainen, 2013) and to predict class

membership (Nokelainen & Ruohotie, 2009; Nokelainen, Tirri, Campbell, & Walberg, 2007; Villaverde,

P. Nokelainen & T. Silander

79 | F L R

Godoy, & Amandi, 2006). The recent boom of data analytics has further increased the efforts in this front.

One methodological rationale behind this development is that predictiveness guards against over-fitting and

serves as a natural criterion for the quality of the model. Classical statistical literature was not emphasizing

this aspect, since models were kept relatively simple to avoid over-fitting and to keep the calculations

reasonable. In addition, much of the theory concerned asymptotic behavior in which case over-fitting is

usually not an issue.

Increased computing power now allows more complicated models, such as Bayesian, fuzzy and

neural networks, to be used. While the increased flexibility brings benefits, there are also possibilities to

make new kind of errors in the analysis. Since we share the enthusiasm to promote new methods, we also

feel that is of utmost importance to perform the analyses with these new methods using extremely high

methodological standards. In this respect, we find some of the procedures followed in the recent article by

Musso, Kyndt, Cascallar and Dochy (2013) problematic. Before discussing about these issues in detail, we

wish to indicate that we agree with Edelsbrunner and Schneider’s (2013) previous commentary on this article

where they state that there are other data analysis techniques with similar properties than ANNs, but without

the drawbacks.

2. Fitting to the test data

Our first concern is the reported 100% classification accuracy in such a complex domain, and the

lack of thorough discussion of this issue. Multilayer perceptron (MLP) neural networks are universal

function approximators (Lek & Guegan, 1999). With enough twisting of the parameters, one can use them to

implement any classification rule (Schittenkopf, Deco, & Brauer, 1997). Consequently, the networks could

in theory also be designed to explain the version of the dataset in which the GPA scores would be randomly

assigned to the students. What knowledge does such a model (that can explain anything) extract from the real

world?

One persuasive answer does indeed lie in prediction. Only the regularities help one to generalize

beyond the training sample, that is, to predict. But here one needs to be very careful. To do this right, the

data must first be split into two parts and then the model must be built using only the first part. The testing

should be done with the second part of the data – the part that was not used in the model building process at

all. The big question is: Can we trust us to be able to refrain from “cheating” (using the test data)? In order to

avoid this, it would be best to gather the test data after building the model, or to separate it from the training

data in the very beginning, and give it to somebody else who will then, after the model has been built, test

the accuracy of the model – once and for all!

The paper by Musso and her colleagues (2013) practically acknowledges that such a discipline was

not rigorously followed. The network structure and learning parameters were adjusted to maximize the

accuracy in test data. Many models were tested to achieve this. Even the division of the data into training and

test samples was manipulated in order to “… maximize the training sample while preserving the appearance

of all detected patterns in the testing sample …” (Musso et al., 2013, 60).

Now, one cannot totally exclude the possibility that the authors actually promote this methodology as

a sound one. Take a maximally flexible model family, find the most parsimonious model that fits 100% to

the data, and then analyze the model. But if that were the case, why torture oneself with the tedious manual

work to find 100% fit (yes fit, not generalization) to the test data? It would be easier to just fit to the whole

data set – but that would break the illusion of prediction.

3. Comparison with the linear discriminant analysis

We find that the authors’ decision to compare the model to the other models sets a very good

example that should more often be followed in the educational research. Such comparisons are widely used

in machine learning (e.g., Demšar, 2006). However, comparing the multilayer perceptron and the

discriminant analysis raises some questions. Behind the linear discriminant analysis is a linear discriminant

model that defines a joint probability distribution for the whole 19-variate (18 independent variables + GPA

P. Nokelainen & T. Silander

80 | F L R

class) data vector. Such joint probability distributions can be used for classification, since the conditional

probability P(GPA-class | predictors) is proportional to the joint distribution P(GPA-class & predictors).

These kinds of classifiers are usually called generative classifiers (e.g., Xue & Titterington, 2008), since they

are based on the models that can be used to sample the whole (19-variate) data vectors.

MLPs are not generative classifiers, but so called discriminative classifiers. They are built to directly

estimate the conditional distribution P(GPA-class | predictors) without modeling the relationships among the

predictors. (Reading the paper sometimes makes you feel that the authors claim otherwise.) While the linear

discriminant model DA1 used in the Musso et al. (2013) paper has about 2*18 + 18*18 = 360 parameters,

the neural network model has 18*15*2 = 540 parameters. The difference in number of parameters is not

huge, but all the parameters of the MLP are used for modeling the conditional distribution, while the

parameters in the linear discriminant model also take care of modeling the relationships between variables.

Since the linear discriminant is also a predictive classifier, one cannot but wonder why the confusion

matrices for linear discriminants were not reported. Those numbers surely would have fitted to the same

space without any problem. On the other hand, it is plausible that any differences found are due to the other

classifier being generative and the other one discriminative. It would have been much more meaningful to

compare the MLP to some discriminative classifier such as a logistic regression, or better yet, some sparse

version of it such as the group lasso with interaction terms (Meier, van de Geer, & Bühlmann, 2008) that

would make interpreting the model and the variable (group) importances much easier. Furthermore, the

Musso et al. (2013) paper is very unclear about how the variable importances have been calculated. The

attempt to follow the references only lead to the statements like “this has been implemented in software X”

or to an unpublished technical report by one of the authors.

4. Two cultures

According to Breiman (2001b), there are two statistical modeling cultures. The Data Modeling Culture

assumes that the data are generated by a given stochastic data model (such as linear or logistic regression).

The Algorithmic Modeling Culture treats the data mechanism as unknown, using, for example, decision trees

and neural networks. Although the first of these two cultures, focusing on data models, is still dominating,

many fields outside statistics are rapidly adopting a wide variety of tools.

Neural networks are often considered as black-box models that do not offer a good explanation and

understanding of the domain (Correa, Bielza, & Pamies-Teixeira, 2009). Consequently, such models are

sometimes hastily deemed as unsuitable for much of the science. We would like to take the opportunity to

say a word for such black-box models along the lines expressed by a statistician, Leo Breiman (1928-2005).

World may not be a simple place. While among the simple theories there are those who most closely

approximate the complex reality, it is a priori possible that none of those simple theories, even the best of

them, approximate the situation well. If the model does not predict well, one can argue that it has not

captured the regularities of the world, so what insight would understanding and interpreting such a model

offer us. (Breiman, 2001b.) Most of the statistical community would agree that only if the model is

reasonably good (and we mean generalization, not just fit to the sample), interpretation makes sense.

Edelsbrunner and Schneider (2013) indicate in their commentary on this article that whenever

possible, more theory-driven data modeling techniques should be preferred. However, if we limit ourselves

to the models that can be easily interpreted, we may end up discarding models that truly capture important

regularities of the domain.

There are two different strategies then to extract true knowledge from the world. The first one is a

classical one in which we try to find a well predicting model among the easily interpretable ones. This is the

path that should always be attempted. Unfortunately, we suspect that it was not seriously pursued in the

article by Musso et al. (2013). It is also possible to try to build a well predicting model, even if it is not that

easy to interpret, and then put more effort to squeeze out the knowledge from the model. One could argue

that this is what happens, when you ask a doctor why she made the diagnosis she did. The answer will (only)

be some approximation of the real reason. Still doctors are considered useful.

P. Nokelainen & T. Silander

81 | F L R

We have an educated guess, that such a procedure is behind the independent variable importance

measures featured in the article. Naturally, such procedures should be carefully documented in order to

understand what kind of information we have managed to extract from the model. The article leaves the

impression of the claim that artificial neural networks were somehow especially good for inferring how

different complex patterns of variables affect the outcome. However, the presented results list only univariate

importance of variables. How could that possibly tell us anything relevant about complex patterns?

Neural networks are by no means the only black-box models that can be successful in the prediction.

Many ensemble learning based or motivated methods, for instance random decision forests (Breiman, 2001a)

and Bayesian additive regression trees (Chipman, George, & McCulloch, 2010), are among such models.

Ensemble methods have reached very high classification accuracies by using several (or growing a forest of)

decision trees on the same data instead of a single-tree predictor.

Ever increasing data sizes (e.g., Massive Open Online Courses, MOOCs, may have 100 000 students

with all their data gathered automatically to the digital form) and increasing computer power may well shift

focus from small, simple and understandable models, to the big, complex black-box models. But hopefully

some of that computing power can also be used to extract understandable (even if not always very close to

truth) approximations of the true complex regularities of the world.

Key points

Artificial neural networks (ANN) certainly provide interesting modeling possibilities for
educational scientists, but they also set certain challenges for the design of the study and

interpretation of the results.

The article shows a very good example by comparing the results of ANN to a conventional data
modeling approach, but the comparison should have been made between two discriminative

classifiers.

Ensemble methods provide a modern and powerful alternative to neural networks as they use
predictions of several models built during learning process instead of using a single model.

References

Breiman, L. (2001a). Random Forests. Machine Learning, 45, 5–32. doi:10.1023/a:1010933404324

Breiman, L. (2001b). Statistical Modeling: The Two Cultures. Statistical Science, 16(3), 199–231.

doi:10.1214/ss/1009213726

Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian Additive Regression Trees. The

Annals of Applied Statistics, 4(1), 266–298. doi:10.1214/09-aoas285

Demšar, J. (2006). Statistical comparison of classifiers over multiple data sets. Journal of Machine Learning

Research, 7, 1–30.

Correa, M., Bielza, C., & Pamies-Teixeira, J. (2009). Comparison of Bayesian networks and artificial neural

networks for quality detection in a machining process. Expert Systems with Applications, 36, 7270–

7279. doi:10.1016/j.eswa.2008.09.024

Lek, S., & Guegan, J. F. (1999). Artificial neural networks as a tool in ecological modelling, an introduction.

Ecological Modelling, 120, 65–73. doi:10.1016/s0304-3800(99)00092-7

Meier, L., van de Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the

Royal Statistical Society: Series B, 70(Part 1), 53-71. doi:10.1111/j.1467-9868.2007.00627.x

Musso, M. F., Kyndt, E., Cascallar, E. C., & Dochy, F. (2013). Predicting general academic performance and

identifying differential contribution of participating variables using artificial neural networks. Frontline

Learning Research, 1, 42-71. doi:10.14786/flr.v1i1.13

P. Nokelainen & T. Silander

82 | F L R

Nokelainen, P., Silander, T., Ruohotie, P., & Tirri, H. (2007). Investigating the Number of Non-linear and

Multi-modal Relationships between Observed Variables Measuring a Growth-oriented Atmosphere.

Quality & Quantity, 41(6), 869-890. doi:10.1007/s11135-006-9030-x

Nokelainen, P., & Ruohotie, P. (2009). Non-linear Modeling of Growth Prerequisites in a Finnish

Polytechnic Institution of Higher Education. Journal of Workplace Learning, 21(1), 36-57.

doi:10.1108/13665620910924907

Nokelainen, P., Tirri, K., Campbell, J. R., & Walberg, H. (2007). Factors that Contribute or Hinder

Academic Productivity: Comparing two groups of most and least successful Olympians. Educational

Research and Evaluation, 13(6), 483-500. doi:10.1080/13803610701785931

Schittenkopf, C., Deco, G., & Brauer, W. (1997). Two Strategies to Avoid Overfitting in Feedforward

Networks. Neural Networks, 10(3), 505-516. doi:10.1016/s0893-6080(96)00086-x

Schneider, M., & Edelsbrunner, P. (2013). Modelling for Prediction vs. Modelling for Understanding:

Commentary on Musso et al. (2013). Frontline Learning Research, 1(2), 99-101.

doi:10.14786/flr.v1i2.74

Tirri, K., Nokelainen, P., & Komulainen, E. (2013). Multiple Intelligences: Can they be measured?

Psychological Test and Assessment Modeling, 55(4), 438-461. doi:10.1007/978-94-6091-758-5_1

Villaverde, J. E., Godoy, D., & Amandi, A. (2006). Learning styles’ recognition in e-learning environments

with feed-forward neural networks. Journal of Computer Assisted Learning, 22, 197–206.

doi:10.1111/j.1365-2729.2006.00169.x

Xue, J-H., & Titterington, D. M. (2008). Comment on “On Discriminative vs. Generative Classifiers: A

Comparison of Logistic Regression and Naive Bayes”. Neural Processing Letters, 28(3), 169-187.

doi:10.1007/s11063-008-9088-7