1

INVESTIGATING 
POSSIBILITIES OF 
PREDICTIVE MATHEMATICAL 
MODELS TO IDENTIFY AT-
RISK STUDENTS IN THE 
SOUTH AFRICAN HIGHER 
EDUCATION CONTEXT

ABSTRACT
This article reports on the results of an investigation of the predictive 
accuracy of five different mathematical models to identify at-risk 
students in a Business Statistics course. Low levels of students’ 
success, especially in mathematics-related subjects such as 
statistics, are a salient problem in South Africa and other countries. 
Statistical knowledge is included in a variety of programmes 
offered by many faculties at tertiary level, and early prediction of 
at-risk students seems necessary to enhance academic success 
especially when dealing with large class groups. In this study, we 
used 395 Business Statistics students’ grades from an academic 
semester at an urban university in South Africa to build a predictive 
model to identify at-risk students. Grounded on Meyer’s model 
evaluation criteria and striving for a balance between accuracy 
and simplicity, two out of five models are identified as viable 
predictive models in identifying at-risk students by using a cross-
validation test. The article shows the possibilities and limits in 
deriving information from a number of covariates. These results 
are interesting and have implications for educational practice in 
statistics courses.

Keywords: At-risk Students; Large Classes; Predictive Modelling; 
Mathematical Model; Statistics Courses; Tertiary Education.

1. INTRODUCTION
The transition from school to university seems exciting for 
many first year students, but a daunting fact is the high 
drop-out rates, especially in mathematics-related courses. 
This phenomenon is locally and globally a matter of concern 
and universities are making efforts to investigate the 
reasons and eliminate the occurrence of this phenomenon 
(compare Greefrath, Koepf & Neugebauer, 2016; Van Zyl, 
Gravett & De Bruin, 2012). Research results published by 
The Centre for Development and Enterprise in South Africa 
(SA) confirmed a significant underperformance in education 
at school level, particularly in mathematics teaching and 
learning (Bernstein, 2013). Rach and Heinze (2017) reported 

Vaughan van Appel
Department of Statistics, 
University of Johannesburg, 
South Africa  
Email: vvanappel@uj.ac.za 

Rina Durandt*
Department of Mathematics 
and Applied Mathematics, 
University of Johannesburg, 
South Africa 
Email: rdurandt@uj.ac.za

DOI: http://dx.doi.
org/10.18820/2519593X/pie.
v37i2.1
ISSN 0258-2236
e-ISSN 2519-593X
Perspectives in Education 
2019 37(2): 1-15

Date Published:
27 November 2019

Published by the UFS
http://journals.ufs.ac.za/index.php/pie

© Creative Commons  

With Attribution (CC-BY)

mailto:vvanappel@uj.ac.za
mailto:rdurandt@uj.ac.za
http://dx.doi.org/10.18820/2519593X/pie.v37i2.1
http://dx.doi.org/10.18820/2519593X/pie.v37i2.1
http://dx.doi.org/10.18820/2519593X/pie.v37i2.1
http://journals.ufs.ac.za/index.php/pie
https://creativecommons.org/licenses/by/2.0/za/
https://creativecommons.org/licenses/by/2.0/za/


2

Perspectives in Education 2019: 37(2)

that the transition from school to university, particularly in mathematics, is often a substantial 
hurdle for students. Results from their research, conducted in Germany on 182 first-semester 
university students majoring in mathematics, indicated only a marginal influence of school-
related mathematical resources on first semester study success. They explain mathematical 
related courses at university is very different from school mathematics and the learning cultures 
at both institutions could be a reason for the transition problems between school and university. 
At tertiary level, a clear distinction exists between statistics education and mathematics 
education, although both subjects fall under the umbrella mathematical sciences.

In the current SA context, statistics forms some part of the mathematics school curriculum. 
Bernstein’s report (2013) underlined three key factors, among others, that are noteworthy for 
the teaching and learning of mathematics at school level (that also includes the teaching and 
learning of statistics at school level), namely (i) poor mathematics teachers’ competencies 
(related to content and pedagogy); (ii) poor mathematics students’ competencies; and 
(iii) a large gap in mathematics competencies among school students from the lowest income 
areas (approximately 66% of the population) and those from the richest areas. To appreciate 
the scale of mathematics schooling deficiencies in SA and the challenges that lie ahead, 
Bernstein’s report (2013) puts learners’ and teachers’ competencies into an international 
context and further highlight significant disparity within SA (between quintile1 5 schools from 
the richest areas and quintile 1, 2, and 3 schools reflecting the average rural area, small towns, 
and most townships). For example, the competency of SA Grade 6 mathematics teachers 
are placed at the bottom end of the spectrum compared to a selection of other Eastern and 
Southern African countries. Mathematics is a key requirement for not only entry into higher 
education, but also for most modern, knowledge-intensive work (Bernstein, 2013). The report 
underlines that pass rates at universities are low, with an eventual graduate rate of roughly half 
the students at contact education universities who start a bachelor’s degree. One concludes 
from these results a large number of SA students, who enter formal tertiary education, might 
be underprepared in terms of fundamental mathematical content for mathematics-related 
courses, such as statistics.

Successful transition between school and university is dependent on factors such as 
knowledge related to scientific mathematics and students’ abilities to develop adequate 
learning strategies (Rach & Heinze, 2017). It seems logical that the latter factors are also 
important for learning statistics at tertiary level. Garfield and Ben-Zvi (2007: 380-381) argue 
that learning statistics involves an integration of a first “statistical literacy” component, 
a second “statistical reasoning”, and a third “statistical thinking” component. The first 
component, statistical literacy, is often the expected outcome of introductory courses in 
statistics and is generally described as an understanding of the basic terms, symbols, tools of 
statistics, and the recognising and interpretation of different representations of data. Garfield 
and Ben-Zvi (2007) emphasised that statistics students’ understanding of the basic concepts 
of statistics can easily be underestimated or overestimated by the educator. The second 
component, statistical reasoning, refers to the way people reason with statistical ideas and 
make sense of statistical information. The third component, statistical thinking, involves a 
higher order of thinking than statistical reasoning, more the way professionals will think. 
Ideally, all three components should be integrated to increase proficiency and for educators to 

1 In South Africa the term ‘quintile’ is used as the national poverty ranking of public schools and their learners. 
Five different groups exist into which all public ordinary schools and their learners are placed – from the 
poorest in quintile 1 to the least poor in quintile 5.


3

van Appel & Durandt Investigating possibilities of predictive mathematical models...

assess student’s understanding of statistics. In this inquiry, students were expected to master 
the statistical literacy component and show some evidence of statistical reasoning, but they 
were not expected to act as professionals. We argue that every student can learn statistical 
literacy and develop basic reasoning skills to ultimately reach academic success – the focus 
should be to meet the students’ needs and not the incapacity of students.

Almost three decades ago, an at-risk student was defined as “one who is in danger of failing 
to complete his or her education with an adequate level of skills” (Slavin & Madden, 1989:4). 
Educators often view at-risk students as the ones who are more likely to fail than pass the 
course, and the term may be applied to students who face particular circumstances (e.g., low 
test scores, or low class attendance) that could jeopardise their ability to be academically 
successful. A broad overview of the literature revealed numerous studies on students’ lack 
of academic success or delay during formal tertiary education, particularly in numeracy 
related fields, with a number of contributing factors (Cassidy, 2015; Greefrath et al., 2017; 
Onwuegbuzie, 2004; Rach & Heinze, 2017). Some of these contributing factors include 
a lack of self-efficacy, statistical anxiety, fostering negative attitudes towards statistics 
courses, and students finding statistics content difficult (Coetzee & van der Merwe, 2010; 
Onwuegbuzie, 2004; Talsma, Schüz, Schwarzer, & Norris, 2018; van Appel & Durandt, 2018). 
In addition, Science and Engineering courses generally obtain lower pass rates than many 
other courses at tertiary level, making it very important to take part in the global discussion 
regarding academic success, the efforts to identify at-risk students as early as possible 
and the concerning variables or factors to support these students on their academic path. 
Within the SA context, but also globally, educators (lecturers, faculties, and universities) are 
continuously more strained to increase pass rates and at the same time present students 
with a quality course. It therefore seems essential to identify the needs of students as early 
as possible to improve instructional practice. However, the focus of this inquiry is merely to 
identify at-risk students in a statistics course at a public university in SA, and not to improve 
instructional practices as such, although the latter might be seen as a possible outcome of 
this investigation.

The two research questions are: (1) what is a suitable predictive mathematical model 
used to identify at-risk students in a Business Statistics course at tertiary level, and (2) how 
effective is such a model to predict students’ academic success in this course? In answering 
these research questions, we attempt to broaden our knowledge about the identification of 
at-risk students in a statistics course as early on as possible in the academic semester, and 
to identify and improve a suitable mathematical model that can provide trustworthy results in 
an educational context. These results have implications for the educational practice; it could 
contribute towards promptly detecting the needs of at-risk students, and ultimately resulting in 
enriched instructional practices and improved throughput rates in statistics.

2. PRIMARY THEORETICAL PERSPECTIVES
2.1 Theoretical guidelines for learning statistics 
Learning statistics is grounded in the learning of mathematics, although it has developed 
as a research area in its own right with a growing network of researchers studying the 
development of students’ statistical literacy, reasoning, and thinking. According to Garfield 
and Ben-Zvi (2007) research studies focus not only on statistics instruction, but also on the 
development of conceptual understanding. Kilpatrick, Swafford and Findell (2001) defined 


4

Perspectives in Education 2019: 37(2)

five different strands of mathematical knowledge, which in combination indicate mathematical 
proficiency. These strands (supported by Samuelsson, 2010: 62) seem to connect particularly 
well with learning statistical knowledge (literacy, reasoning, and thinking) and are therefore 
relevant to this study, and include:

I. Conceptual understanding – the ability to grasp mathematical and/or statistical concepts, 
operations, and relationships. 

II. Procedural fluency – the skill of performing flexible procedures accurately, efficiently, and 
appropriately to support mathematics and/or statistics learning.

III. Strategic competence – the ability to formulate, represent, and solve mathematical and/
or statistical problems to contribute in developing various mathematical and/or statistical 
competencies and appropriate attitudes.

IV. Adaptive reasoning – the capacity for logical thought, reflection, explanation, and 
justification, in order to contribute to an adequate picture of mathematics and/or statistics.

V. Productive disposition – the ability to view mathematics and/or statistics as sensible, 
useful, and worthwhile, together with a belief in self-confidence.

Both Boaler (2000) and Samuelsson (2010) argued that situational context is a key aspect 
in producing mathematical knowledge and seems important to learn statistical literacy, which 
forms the foundation for reasoning and thinking. Garfield and Ben-Zvi (2007) claim statistical 
literacy is often seen as an expected outcome of schooling and a necessary component of 
adults’ numeracy and literacy. In summary, we argue that the five strands of mathematical 
knowledge, by Kilpatrick et al. (2001), provide the knowledge base for students to learn 
statistics. Apart from the knowledge base, a notion of supporting disposition is also present 
with the belief that every student can attain the necessary skills for academic success when 
formal education meets the students’ needs and not the incapacity of students.

2.2  Mathematical models and their relevance for teaching
Doerr, Ärlebäck and Misfeldt (2017: 71) underscored the substantial impact of mathematical 
models at all levels of society by claiming that “mathematical models are used to control 
processes, to design products, to monitor and influence economic systems, to enhance 
human agency, and to structure and understand the natural world in society and above all 
in the workplace”. Earlier, Lesh and Doerr (2003) described models as conceptual systems 
that are used for some specific purpose. It seems reasonable to consider a mathematical 
model to monitor study success in an educational context, but if such a model is used to 
improve decision-making (e.g. regarding study success) then the quality of the model is 
also important. In the context of this study, we used a mathematical model in the context of 
teaching to identify students that are at-risk of failing a Business Statistics course. The idea 
was to implement the model as early as possible in the second academic semester of the 
academic year. Thus, the emphasis was more on the product and its efficiency as it would 
play a role in decision-making in the education context, than on all the steps taken during a 
modelling cycle. Meyer (2012: 150 – 222), proposed six criteria for mathematical models that 
should be considered during the evaluation of a model: (i) accuracy, (ii) realism, (iii) precision, 
(iv) robustness, (v) generalisability, and (vi) fruitfulness.

We followed the traditional steps in a modelling cycle; to identify the problem in the real 
world (e.g. low throughput rates in mathematical related courses), to make assumptions 
and identify variables by selecting relevant information and finding relations (e.g. using 


5

van Appel & Durandt Investigating possibilities of predictive mathematical models...

continuous and formal assessment marks), to formulate a mathematical model and perform 
procedures to find results (e.g. multiple regression, or linear regression, or decision trees), 
to analyse and assess the solution by questioning the results and consequences (e.g. by 
checking the model’s accuracy), and to iterate the modelling process to refine and extend the 
model (e.g. to use different data sets) (compare Blum and Leiβ, 2007; COMAP-SIAM, 2016). 
Through this modelling process we purposefully considered the evaluation criteria from 
Meyer (2012): (i) accuracy, if the output values were correct or near correct; (ii) realistic, if the 
model is based on correct assumptions; (iii) precise, if its predictions were in definite numbers, 
and imprecise, if its predictions were in a range of numbers; (iv) robust, if the model is to 
some extent protected against errors in the input data; (v) general, if it applied to a variety of 
educational contexts; and (vi) fruitful, if it resulted in useful conclusions. In this study we used 
a predictive model for a first-year Business Statistics course, but such a model can easily 
be adapted and used in other courses, given that there is enough reliable data available to 
‘train’ the model. By predictive modelling we intended to use mathematical and computational 
methods to predict students’ probability of academic success based on changes in the 
model input values. A genuine evaluation of the accuracy of the model used in this study 
will require observations over a number of years, or perhaps observations in a variety of 
courses. A limitation in building accurate predictive models is that these models are largely 
dependent on the quality of the data available. Thus, in this study, the quality and reliability of 
the data (which were only collected in one academic semester) will influence the accuracy (or 
predictive power) of the model. We attempted to increase the reliability and consistency of the 
data through a well-structured course plan with multiple assessment opportunities. Data from 
future research could refine the mathematical models and its implications in other educational 
context, for example to investigate the ‘optimal’ time to implement an at-risk model in an 
academic semester.

3. RESEARCH DESIGN
Model building seems important in identifying at-risk students, and in this study the accuracy of 
models was studied to determine the most suitable predictive model to identify at-risk students 
in a Business Statistics course offered at the University of Johannesburg during the second 
semester of 2017. Data were collected from 395 first year (undergraduate) students. In particular, 
we used this data to build five predictive models to predict the possible outcome of a future 
students’ success in the course. Moreover, the data used in the building of the predictive models 
are often referred to as the training data. The aim of training a predictive model is for the model 
to learn to identify patterns or trends from the training data, which is then used in predicting the 
success of future students in the course. The ever-growing need to increase throughput rates 
and maintain high course standards makes predictive modelling especially useful. For example, 
in large classes, a predictive model can be used to identify students that are likely to fail the 
course and provides the educator with information about students that require much needed 
assistance. This identification process would be tedious in a large class without a predictive 
model, and would only be possible later in the academic semester.

For the training data, we used the 2017 gradebook that consisted of each students online 
cumulative average quiz mark and, two formal semester test marks, which accumulated to 
a final period mark (FPM). Thereafter, the students wrote an examination (EM), which was 
weighted in equal parts to compute the final mark (FM). In order for a student to pass the 
course, the student must obtain a FPM and EM greater or equal to 40%, and then a FM 


6

Perspectives in Education 2019: 37(2)

greater or equal to 50%. In addition to the gradebook, we also collected the grade the students 
obtained in their prerequisite module, if students were repeating the module (Yes or No), 
the gender of the students, and the student’s high school quintile ranking (range between 
quintile 1 and 5). More specifically, in our model building design we considered the school 
quintile ranking to be a descriptive of the student’s socio-economic status. Socio-economic 
status generally combines three measures based on income, education, and occupation. 
Therefore, the school quintile raking represents a good predictor for the socio-economic 
variable in this study, as a high quintile ranking (levels 4 and 5) indicates a well-performing 
school (normally these schools are situated in or around major cities, where tuition fees are 
paid) indicating a higher socio-economic status. In contrast, low quintile ranked schools (levels 
1, 2 and 3) reflect weaker performing schools (normally situated in rural or township areas, 
where no tuition fees are paid) indicating a lower socio-economic status.

In Table 1, we summarise the categorical covariates considered in this study. Most 
students are from a quintile 5 (well-performing) school, followed by the school quintile being 
unknown to the university. Students assigned to the unknown school quintile are students who 
completed their schooling outside the borders of South Africa (international students), where 
schools may not follow the same rating scheme, or perhaps no rating scheme.

Table 1. Descriptive statistics of the categorical covariates

Factor Percentage

Gender
Male 48%

Female 52%
Repeating the Module

No 87%
Yes 13%

School Quintile
1 4%
2 6%
3 10%
4 9%
5 39%

Unknown 32%

A good predictive model should be trustworthy and robust with as few covariates (also 
commonly referred to as independent or predictor variables) as possible. Bainbridge, Melitski, 
Zahradnik, Lauría, Jayaprakash and Baron (2015) studied some demographic, educational 
and behavioural patterns in search of finding promising covariates to predict at-risk students in 
an online Masters of Public Administration programme. Among their promising covariates were 
the number of times a student logged into their online course portal, and their participation in 
an online forum for the course. In summary, they revealed that combining these covariates 
with more traditional covariates, such as the gradebook, class size, and age, could enhance 
the predictive power of the model to identify at-risk students. Furthermore, special care must 
be taken when considering possible covariates for building a predictive model as different 
courses might require different information to train the model. For example, some courses 


7

van Appel & Durandt Investigating possibilities of predictive mathematical models...

might require practical or laboratory work and others not, or a face-to-face course component 
compared to an online course component.

These unique differences require some strategic awareness from the educator in order 
to decide whether it is meaningful to incorporate these variables into the predictive model. 
Furthermore, it is of good practice and according to the traditional steps of a modelling cycle 
(compare COMAP-SIAM, 2016) to continuously update the training data and retrain the model 
to improve the robustness and predictive power of the model.

4. STATISTICAL ANALYSIS AND RESULTS
4.1 Predictive model
A number of forces should be considered when building a predictive model, such as, model 
simplicity, predictive power and the usability of the model in a course, where often these 
factors work against each other (Emmert-Streib & Dehmer, 2019). For example, to carry out 
a predictive model early on in an academic semester will allow sufficient time to assist at-
risk students. However, this will come at an accuracy cost, as there will be less information 
available to ‘train’ the model. Furthermore, to increase the predictive power of a model might 
come at a simplicity and usability cost, as more predictor variables may be required to improve 
the accuracy. In this study, we followed the approach of finding a suitable balance between 
these opposing forces. The course outline for the Business Statistics course is displayed 
in Table 2. After careful consideration of the course outline, we decided to implement the 
predictive model in week 7 of the semester. We reason that this strategy should allow for a 
good balance between forces; it will allow educators enough time to assist the identified at-
risk students without largely compromising the accuracy, and it will allow enough information 
(training data) to ‘train’ the model.

Table 2. Course outline of the Business Statistics course from week 1 - 14

Week Quiz Semester Test Content

1 Sampling & Sampling Distribution

2 & 3 1 Confidence Intervals

4, 5 & 6 2 Hypothesis Testing: One Population Mean & Proportion

7 1 (week 7) Revision

8 3 Hypothesis Testing: Chi-Square Procedures

9 4 Hypothesis Testing: ANOVA

10 5 Multiple Regression

11 Differentiation

12 Maxima and Minima

13 2 (week 13) Revision

14   Linear Programing


8

Perspectives in Education 2019: 37(2)

The predictive power (trustworthiness) of the models used in this study was assessed by 
using the statistical calculations explained below (see Marbouti, Diefes-Dux & Madhavan, 2016):

Where:

• true positives denote the number of students that failed and were identified as at-risk,

• true negatives denote the number of students who passed the course and were not 
identified as at-risk,

• false negatives (also known as type II error) denote the number of students who failed the 
course but were not identified as at-risk students,

• false positives (also known as type I error) denote the number of students who passed the 
course but were identified as at-risk students, and

• F1.5 denotes the harmonic mean of precision and recall. More specifically, the harmonic 
mean takes into account the accuracy for the students who passed and failed the course, 
where it weighs the accuracy for students who failed more than students who passed 
(Van Rijsbergen, 1979).

The traditional technique often used to identify at-risk students in statistics courses in the 
higher education context in SA (which is also the current technique used in this Business 
Statistics course), is to identify the students that obtain a mark less than 50% in their first 
formal assessment (semester test 1). Then, these students are categorised as at-risk of 
failing the course. This traditional technique will be referred to as the baseline model for this 
investigation, analysis and report findings. In Section 4.2.1, we illustrate that this model is 
not trustworthy in predicting at-risk students. Therefore, it seems necessary to find a better 
performing predictive model. Following this notion, we investigated four alternative predictive 
models in this study, namely:

• Multiple Regression: Multiple linear regression is an extension of simple linear regression 
by allowing for more than one independent variables.

• Logistic Regression: Logistic regression is a widely used prediction method in statistics and 
particularly in predicting at-risk students (see e.g. Bainbridge et al., 2015; Marbouti et al., 2016). 
Moreover, logistic regression calculates the likelihood of observing a binary variable (e.g. 
0 – fail and 1 – pass), using multiple covariates (independent variables).


9

van Appel & Durandt Investigating possibilities of predictive mathematical models...

• Decision trees: A decision tree is a tree-like model that predicts responses by following the 
decisions in the tree from the root node to the leaf node. In this study we considered two 
trees described as

i. a classification tree (the response variable is nominal, i.e. pass or fail), and 

ii. a regression tree (the response variable is numeric).

The advantages of the above models, compared to other predictive models found in the 
literature, are the ease in which these models can be implemented in the most basic statistical 
software packages. Moreover, when building a predictive model, one should keep in mind the 
following good statistical practices: (i) a good predictive model should be as powerful as possible 
with as few covariates as possible so that we are not overtraining the model. This occurs when 
the model maximises its performance on the training data by unknowingly modelling the residual 
noise as if it represented the underlying model structure, rather than learning to generalise from 
a trend; (ii) the selected covariates for a predictive model should be reliable and easily accessible 
to the educator. In addition, when deciding on a prediction model, one should also take into 
consideration the type of data used to train the model. For example, categorical covariates may 
yield better results in a classification tree than in a regression-based model.

4.2 Statistical analysis and results
Before building our predictive models, we started by assessing the causal relationship 
between the covariates and FM. In Table 3, we show the Pearson’s correlation coefficient, 
which is a measure of the strength of the relationship between the students FM and the 
numeric covariates. Equivalent to the study of Marbouti et al. (2016), a Pearson’s correlation 
coefficient of at least 0.3 is regarded as an acceptable covariate for further analysis, which 
was acceptable for all covariates in this study.

Table 3. Pearson correlation coefficients between covariates

Covariate Pearson Correlation Coefficient

Semester test 1 mark 0.6197

Average quiz mark 0.4339

Mark for the prerequisite 0.3080

Table 4 shows the results for the chi-squared test for independence. This test determines 
whether the categorical covariates and FM (pass/fail) are statistically related. The only 
categorical covariate statistically related, at a 5% level of significance, to the FM is whether 
the student is repeating the course or not.

Table 4. Chi-squared independence test

Covariate p-value

Repeating the module 0.0003

Gender 0.0751

School quintile 0.7530


10

Perspectives in Education 2019: 37(2)

In addition, Figure 1 (a)-(d) displays the relationship of each student’s covariates to their 
course outcome (i.e. pass or fail). More specifically, Figure 1a shows the relationship between 
the prerequisite course mark (shown on the x-axis) and the semester test 1 mark (shown on 
the y-axis), Figure 1b shows the relationship between the average quiz mark and semester test 
1 mark, Figure 1c shows the relationship between the prerequisite course mark and average 
quiz mark, and Figure 1d shows the relationship between all three numeric covariates.

To emphasize the challenging nature of building predictive models, it is meaningful to point out 
the irregular behaviour of some students. For example, in our dataset some students performed 
well in one or both of the covariates but failed the course. In contrast, some students passed the 
course but performed poorly in one or both of the covariates used. Similarly, Marbouti et al. (2016) 
highlighted that a students’ behaviour is seldom the same throughout the semester. For example, 
in our study, many students did not write semester test two or stopped attending lectures due 
to financial constraints. Therefore, it seems unreasonable to expect that a model can precisely 
predict all the student’s final outcome in the course – no model is faultless. Regardless of these 
difficulties, the traditional steps of the modelling cycle should be considered and continuous work 
should be carried out to improve the accuracy of the model over time. 

a b

c d

Figure 1. The relationship between the covariates and the module outcome


11

van Appel & Durandt Investigating possibilities of predictive mathematical models...

4.2.1 Model Validation
In order to satisfy the six evaluation criteria mentioned above, we start by partitioning the 
sample data into two datasets: 50% for training and 50% for model validation. This technique, 
known as cross-validation, assesses how the models will generalise to an independent 
dataset. The goal of the model validation is to assess how accurately a predictive model will 
perform in practice, by using the accuracy formulas above, and to prevent the likelihood of 
overfitting the model to the training set. To reduce variability in our analysis of the models, 
we implemented 1000 rounds of the cross-validation, using random samples, where the 
validation results are averaged giving us an estimate of the model’s predictive performance. 
In addition, we also calculate the standard error (SE) of the accuracy estimates over the 1000 
rounds. A good predictive model should yield high accuracy measures, a low number of false 
positives (type I error) and false negatives (type II error) with small SE measures. However, it 
is particularly important to have as few false negatives as possible as this classification carries 
a high consequence (i.e. to identify a student as not at-risk, but the student should have 
been identified as at-risk), where the false positives classification (i.e. to identify a student 
as being at-risk, when the student is not at-risk) carries less consequence. In keeping with 
the good statistical practices mentioned above, we found that the best covariates, based 
on this dataset, for the logistic regression model to be their semester test 1 mark and their 
prerequisite course mark, for the multiple regression model the average quiz mark, semester 
test 1 mark, and their prerequisite course mark. For the decision trees, we found the best 
covariates to be the average quiz mark, semester test 1 mark, their prerequisite course mark, 
and whether the student is repeating the course or not. Table 5 summarises the accuracy 
measures for the five predictive models used in this study. 

Table 5. Measures of accuracy in the predictive models

Method Base Method Logistic Regression
Multiple 

Regression
Regression 

Tree
Classification 

Tree

F1.5 43% 66% 72% 62% 61%
SE 4% 4% 3% 6% 6%

Accuracy 70% 80% 79% 75% 75%
SE 2% 2% 2% 3% 3%

Accuracy-Pass 86% 88% 80% 81% 82%
SE 2% 4% 3% 5% 5%

Accuracy-Fail 39% 65% 76% 62% 60%
SE 5% 6% 5% 8% 8%

True 
Negative*

57.7% 59.0% 53.6% 54.2% 54.7%

SE 2.5% 2.4% 2.6% 3.3% 3.1%
True Positive* 12.9% 20.2% 25.1% 20.6% 19.9%

SE 1.8% 2.0% 2.1% 3.0% 2.6%
False 

Negative*
20.2% 12.8% 8.0% 12.4% 13.1%

SE 2.0% 2.7% 2.0% 3.2% 3.1%
False 

Positive*
9.3% 8.0% 13.3% 12.7% 12.3%

SE 1.5% 2.3% 2.5% 3.1% 3.2%
*Represents the percentage of the sample.


12

Perspectives in Education 2019: 37(2)

From Table 5, the base model is not a suitable model for predicting at-risk students with 
a F1.5 score of 43% and an overall accuracy of 70%. More importantly, it yielded the largest 
percentage of false negatives. Recall, the high consequence the false negative classification 
carries. Therefore, the base model is untrustworthy, as no intervention programme for at-risk 
students could be meaningful. The logistic regression and multiple regression models yielded 
far superior results with F1.5 scores of 66% and 72%, respectively, and an overall accuracy 
of 80% and 79%, respectively. In addition, both models yielded a satisfactory percentage of 
false negative and false positive outcomes, making these two models superior to the base 
model. The regression and classification tree models did not perform as well (with a F1.5 score 
of 62% and 61%, respectively and high SE) as the logistic and multiple regression models in 
this study. However, the decision trees may be more useful when using categorical variables 
that are related to the dependent variable. The expected percentage of the sample incorrectly 
predicted (false negatives + false positives) by the models are 29.5% for the base model, 
20.8% for the logistic regression model, 21.3% for the multiple regression model, 25.1% 
for the regression tree, and 25.4% the classification tree. Thus, the logistic regression and 
multiple regression models yielded the most trustworthy results in identifying at-risk students, 
making these our models of choice in this study. In addition, the regression and decision tree 
models satisfied the six evaluation criteria outlined in Section 2.2 (compare Meyer, 2012). 
In particular, these models (i) yielded accurate results; (ii) are realistic and viable as the data 
required in the building of these models are relatively easy to obtain and implement in many 
statistical packages; (iii) yielded precise predictions (i.e. pass or fail); (iv) are known to be 
robust with extensive literature available to test the robustness of these models (although this 
is beyond the scope of this inquiry); (v) have successfully been used to predict at-risk students 
in this Business Statistics course, and (vi) yielded useful results, which will allow educators to 
identify at-risk students more accurately.

An area, worthwhile for further investigation would be to determine how well these predictive 
models perform in other courses and particularly in other mathematically related courses. 
Such an investigation might point to a model that is robust enough to give accurate results 
across many courses. Furthermore, it would also be worthwhile to consider incorporating a 
class attendance covariate into these models, especially, since we as educators mostly have 
the perception that academically successful students have good class attendance. It is often 
challenging to record data from class attendance, especially in large classes, where not all 
students have smart devices for an electronic attendance recording system, especially in a 
developing country such as South Africa. Another covariate to consider could be how active 
the students are on the electronic learning environment and how often they visit information 
and support material on this platform.

In addition, both the logistic regression and multiple regression models forecasted over 
30% (true positives + false positives) of the students in the study as being classified as at-
risk of failing the course. These large numbers could place enormous strain on educators to 
provide sufficient support for at-risk students, after all, it will be a fruitless exercise to identify 
students as at-risk, and not providing sufficient academic support.

5. CONCLUSION
In this study, we developed five different predictive mathematical models to identify at-risk 
students as early as possible in the academic semester in a Business Statistics course 
at a public university in SA. Quantitative and qualitative data were collected from past 


13

van Appel & Durandt Investigating possibilities of predictive mathematical models...

Business Statistics students and a number of numerical and categorical covariates and their 
relationships were investigated to answer the two research questions: (1) what is a suitable 
predictive mathematical model used to identify at-risk students in a Business Statistics course 
at tertiary level, and (2) how effective is such a model to predict students’ academic success 
in this course? We followed traditional modelling steps to construct the different predictive 
models (based method, logistic regression, multiple regression, regression tree, classification 
tree), commonly found in the literature, and compared the accuracy of these models based on 
Meyer’s criteria (2012). A good selection technique for a predictive model should be an ideal 
balance between accuracy and simplicity. In particular, the logistic regression and multiple 
regression models yielded the most truthful results, using a cross-validation test. More 
specifically, these models yielded the highest accuracy with the smallest standard errors. 
An interpretation of model results might inform educators early in the academic semester of 
potential at-risk students, where educators will have the opportunity to intervene by providing 
rich academic support in the learning of statistics. Early detection of possible academic 
failure with suitable treatment can improve throughput rates in statistics courses without 
compromising academic standards.

Furthermore, our efforts of finding a suitable predictive model is aligned with the notion 
of learning mathematical and/or statistical knowledge by highlighting the different strands of 
Kilpatrick et al. (2001) - conceptual understanding, procedural fluency, strategic competence, 
adaptive reasoning and productive disposition. Apart from the knowledge base, to ultimately 
develop statistical literacy, reasoning and thinking, and an interconnection between these 
statistical components (compare Garfield & Ben‐Zvi, 2007), a notion of supporting disposition 
is also present with the belief that every student can attain the necessary skills for academic 
success when formal education meets the students’ needs and not the incapacity of students. 
Further research could follow; into efficient intervention programmes for at-risk students and 
at combining models to build a ‘new’ hybrid model to predict at-risk students more accurately.

REFERENCES
Bainbridge, J., Melitski, J., Zahradnik, A., Lauría, E. J.M., Jayaprakash, S. & Baron, J. 2015. 
Using learning analytics to predict at-risk students in online graduate public affairs and 
administration education. Journal of Public Affairs Education, 21(2), 247-262. Retrieved from 
http://www.jstor.org/stable/24369796. https://doi.org/10.1080/15236803.2015.12001831

Bernstein, A. 2013. Mathematics outcomes in South African schools. What are the facts? 
What should be done? The Centre for Development and Enterprise. South Africa. Retrieved 
from http://www.cde.org.za

Blum, W. & Leiβ, D. 2007. How do students and teachers deal with modelling problems? In C. Haines, 
P. Galbraith, W. Blum, & S Khan (Eds.), Mathematical modelling: Education, engineering and 
economics (pp. 222-231). Chichester: Harwood. https://doi.org/10.1533/9780857099419.5.221

Boaler, J. (Ed.). 2000. Multiple perspectives on mathematics teaching and learning. London: 
ABLEX Publishing.

Cassidy, S. 2015. Resilience building in students: the role of academic self-efficacy. Frontiers 
in psychology, 6,1781. https://doi: 10.3389/fpsyg.2015.01781

Coetzee, S. & van der Merwe, P. 2010. Industrial psychology students’ attitudes towards 
statistics. SA Journal of Industrial Psychology /SA Tydskrif vir Bedryfsielkunde, 36(1), 1-8. 
https://doi.org/10.4102/sajip.v36i1.843

http://www.jstor.org/stable/24369796
https://doi.org/10.1080/15236803.2015.12001831
http://www.cde.org.za
https://doi.org/10.1533/9780857099419.5.221
10.3389/fpsyg
https://doi.org/10.4102/sajip.v36i1.843


14

Perspectives in Education 2019: 37(2)

COMAP-SIAM. 2016. Guidelines for assessment & instruction in mathematical modeling 
education (GAIMME). Norman, Oklahoma. Available from http://www.siam.org

Doerr, H.M., Ärlebäck, J.B. & Misfeldt, M. 2017. Representations of modelling in mathematics 
education. In: GA Stillman, W Blum, & G Kaiser (Eds.), Mathematical modelling and 
applications (pp. 71-81). Cham: Springer. https://doi.org/10.1007/978-3-319-62968-1_6

Emmert-Streib, F. & Dehmer, M. 2019. Evaluation of Regression Models: Model Assessment, 
Model Selection and Generalization Error. Mach. Learn. Knowl. Extr., 1, 521-551. https://doi.
org/10.3390/make1010032

Garfield, J. & Ben‐Zvi, D. 2007. How students learn statistics revisited: A current review of 
research on teaching and learning statistics. International Statistical Review, 75(3), 372-396. 
https://doi.org/10.1111/j.1751-5823.2007.00029.x

Greefrath, G., Koepf, W. & Neugebauer, C. 2017. Is there a link between Preparatory Course 
Attendance and Academic Success? A Case Study of Degree Programmes in Electrical 
Engineering and Computer Science. International Journal of Research in Undergraduate 
Mathematics Education, 3(1), 143-167. https://doi.org/10.1007/s40753-016-0047-9

Kilpatrick, J., Swafford, J. & Findell, B. (Eds.) 2001. Adding it up: Helping children learn 
mathematics. Washington, DC: National Academies Press.

Lesh, R. & Doerr, H. 2003. Foundations of a models and a modelling perspective on 
mathematics teaching, learning, and problem solving. In R. Lesh & H. M. Doerr (Eds.), Beyond 
constructivism: Models and modelling perspectives on mathematics problem solving, learning, 
and teaching (pp. 3-33). Mahwah: Lawrence Erlbaum. https://doi.org/10.4324/9781410607713

Marbouti, F., Diefes-Dux, H.A. & Madhavan, K. 2016. Models for early prediction of at-risk 
students in a course using standards-based grading. Computers & Education, 103, 1-15. 
Retrieved from http://dx.doi.org/10.1016/j.compedu.2016.09.005

Meyer, W.J. 2012. Concepts of mathematical modeling. Mineola, New York: Dover 
Publications, INC.

Onwuegbuzie, A.J. 2004. Academic procrastination and statistics anxiety. Assessment & 
Evaluation in Higher Education, 29(1), 3-19. https://doi.org/10.1080/0260293042000160384

Rach, S. & Heinze, A. 2017. The transition from school to university in mathematics: Which 
influence do school-related variables have? International journal of science and mathematics 
education, 15, 1343-1363. https://doi.org/10.1007/s10763-016-9744-8

Samuelsson, J. 2010. The impact of teaching approaches on students’ mathematical 
proficiency in Sweden. International electronic journal of mathematics education, 5(2), 61–78.

Slavin, R.E. & Madden, N.A. 1989. What works for students at risk: A research synthesis. 
Educational leadership, 46(5), 4-13.

Talsma, K., Schüz, B., Schwarzer, R. & Norris, K. 2018. I believe, therefore I achieve (and 
vice versa): A meta-analytic cross-lagged panel analysis of self-efficacy and academic 
performance. Learning and Individual Differences, 61, 136-150. https://doi.org/10.1016/j.
lindif.2017.11.015

Van Appel, V. & Durandt, R. 2018. Dissimilarities in attitudes between students in service 
and mainstream courses towards statistics: an analysis conducted in a developing country. 
EURASIA Journal of Mathematics, Science and Technology Education, 14(8). https://doi.
org/10.29333/ejmste/91912 

http://www.siam.org
https://doi.org/10.1007/978
https://doi.org/10.3390/make1010032
https://doi.org/10.3390/make1010032
https://doi.org/10.1111/j.1751-5823.2007.00029
https://doi.org/10.1007/s40753
https://doi.org/10.4324/9781410607713
http://dx.doi.org/10.1016/j.compedu.2016.09.005
https://doi.org/10.1080/0260293042000160384
https://doi.org/10.1007/s10763
https://doi.org/10.1016/j.lindif.2017.11.015
https://doi.org/10.1016/j.lindif.2017.11.015
https://doi.org/10.29333/ejmste/91912
https://doi.org/10.29333/ejmste/91912


15

van Appel & Durandt Investigating possibilities of predictive mathematical models...

Van Rijsbergen, C.J. 1979. Information retrieval (2nd ed.). London: Butterworths.

Van Zyl, A., Gravett, S. & De Bruin, G.P. 2012. To what extent do pre-entry attributes predict 
first year student academic performance in the South African context? South African Journal 
of Higher Education, 26(5), 1095-1111. https://doi.org/10.20853/26-5-210

https://doi.org/10.20853/26

	_GoBack