Meta-Psychology, 2021, vol 5, MP.2019.1916, 
https://doi.org/10.15626/MP.2019.1916 
Article type: Original Article 
Published under the CC-BY4.0 license 
 

Open data: N/A 
Open materials: N/A 
Open and reproducible analysis: Yes 
Open reviews and editorial process: Yes 
Preregistration: No 

 
Edited by:  Moritz Heene 
Reviewed by: J. McGrane, A. Kyngdon 
Analysis reproduced by: André Kalmendal 
All supplementary files can be accessed at the OSF pro-
ject page: https://doi.org/10.17605/OSF.IO/GTN9C 

 
Levels of measurement and statistical analyses 
Matt N. Williams 

School of Psychology, Massey University 

 
Most researchers and students in psychology learn of S. S. Stevens’ scales or “levels” of 
measurement (nominal, ordinal, interval, and ratio), and of his rules setting out which 
statistical analyses are admissible with each measurement level. Many are nevertheless 
left confused about the basis of these rules, and whether they should be rigidly followed. 
In this article, I attempt to provide an accessible explanation of the measurement-theo-
retic concerns that led Stevens to argue that certain types of analyses are inappropriate 
with data of particular levels of measurement. I explain how these measurement-theo-
retic concerns are distinct from the statistical assumptions underlying data analyses, 
which rarely include assumptions about levels of measurement. The level of measure-
ment of observations can nevertheless have important implications for statistical as-
sumptions. I conclude that researchers may find it more useful to critically investigate 
the plausibility of the statistical assumptions underlying analyses than to limit themselves 
to the set of analyses that Stevens believed to be admissible with data of a given level of 
measurement. 

Keywords: Levels of measurement, measurement theory, ordinal, statistical analysis. 

 
Introduction 

Most students and researchers in psychology 
learn of a division of measurement into four scales: 
nominal, ordinal, interval, and ratio. This taxonomy 
was created by the psychophysicist S. S. Stevens 
(1946). Stevens wrote his article in response to a 
long-running debate within a committee of the Brit-
ish Association for the Advancement of Science 
which had been formed in order to consider the 
question of whether it is possible to measure “sen-
sory events” (i.e., sensations and other psychological 
attributes; see Ferguson et al., 1940). The committee 
was partly made up of physical scientists, many of 
whom believed  that the numeric recordings taking  

 
1 For example, the term “scale” is often used to refer to 
specific psychological tests or measuring devices (e.g., 
the “hospital anxiety and depression scale”; Zigmond & 
Snaith, 1983). It is also often used to refer to formats for 

place in psychology (specifically psychophysics) did 
not constitute measurement as the term is usually 
understood in the natural sciences (i.e., as the esti-
mation of the ratio of a magnitude of an attribute to 
some unit of measurement; see Michell, 1999). Ste-
vens attempted to resolve this debate by suggesting 
that it is best to define measurement very broadly as 
“the assignment of numerals to objects or events ac-
cording to rules” (Stevens, 1946, p. 677), but then di-
vide measurements into four different “scales”. 
These are now often referred to as “levels” of meas-
urement, and that is the terminology I will predom-
inantly use in this paper, as the term “scales” has 
other competing usages within psychometrics1. Ac-
cording to Stevens’ definition of measurement, vir-
tually any research discipline can claim to achieve 
measurement, although not all may achieve interval 

collecting responses (e.g., “a four-point rating scale”). 
These contemporary usages are quite different from Ste-
vens “scales of measurement” and therefore the term 
“levels of measurement” is somewhat less ambiguous. 


WILLIAMS 

or ratio measurement. He went on to argue that the 
level with which an attribute has been measured de-
termines which statistical analyses are permissible 
(or “admissible”) with the resulting data. 

Stevens’ definition and taxonomy of measure-
ment has been extremely influential. Although I have 
not conducted a rigorous evaluation, it appears to 
be covered in the vast majority of research methods 
textbooks aimed at students in the social sciences 
(e.g., Cozby & Bates, 2015; Heiman, 2001; Judd et al., 
1991; McBurney, 1994; Neuman, 2000; Price, 2012; 
Ray, 2000; Sullivan, 2001). Stevens’ taxonomy is also 
often used as the basis for heuristics indicating 
which statistical analyses should be used in particu-
lar scenarios (see for example Cozby & Bates, 2015).   

However, the fame and influence of Stevens’ tax-
onomy is something of an anomaly in that it forms 
part of an area of inquiry (measurement theory) 
which is rarely covered in introductory texts on re-
search methods2. Measurement theories are theo-
ries directed at foundational questions about the 
nature of measurement. For example, what does it 
mean to “measure” something? What kinds of attrib-
utes can and cannot be measured? Under what con-
ditions can numbers be used to express relations 
amongst objects? Measurement theory can arguably 
be regarded as a branch of philosophy (see Tal, 2017), 
albeit one that has heavily mathematical features. 
The measurement theory literature contains several 
excellent resources pertaining to the topic of admis-
sibility of statistical analyses (e.g., Hand, 1996; Luce 
et al., 1990; Michell, 1986; Suppes & Zinnes, 1962), but 
this literature is often written for an audience of 
readers who have a reasonably strong mathematical 
background, and can be quite dense and challeng-
ing. This means that, while many students and re-
searchers are exposed to Stevens’ rules about ad-
missible statistics, few are likely to understand the 
basis of these rules. This can often lead to significant 
confusion about important applied questions: For 
example, is it acceptable to compute “parametric” 
statistics using observations collected with a Likert 
scale? 

 
2 By way of example, the well-known textbook “Psycho-
logical Testing: Principles, Applications, & Issues” by 
Kaplan and Saccuzzo (2018)  covers Stevens’ taxonomy 
and rules about permissible statistics (albeit without at-
tribution to Stevens), but does not mention any of the 

In this article, therefore, I attempt to provide an 
accessible description of the rationale for Stevens’ 
rules about admissible statistics. I also describe 
some major objections to Stevens’ rules, and explain 
how Stevens’ measurement-theoretic concerns are 
different from the statistical assumptions underly-
ing statistical analyses—although there exist im-
portant connections between the two. I close with 
conclusions and recommendations for practice. 

Stevens’ Taxonomy of Measurement 

Stevens’ definition and taxonomy of measure-
ment was inspired by two theories of measurement: 
Representationalism, especially as expressed by 
Campbell (1920), and operationalism, especially as 
expressed by Bridgman (1927)3. Operationalism (see 
Bridgman, 1927; Chang, 2009) holds that an attribute 
is fully synonymous with the operations used to 
measure it: That if I say I have measured depression 
using score on the Beck Depression Inventory (BDI), 
then when I speak of a participant’s level of depres-
sion, I mean nothing more or less than the score the 
participant received on the BDI. Stevens’ definition 
of measurement— as “the assignment of numerals to 
objects or events according to rules” (Stevens, 1946, 
p. 677)—is based on operationalism. In contrast to 
operationalism, representationalism argues that 
measurement starts with a set of observable empir-
ical relations amongst objects. The objects of meas-
urement could literally be inanimate objects (e.g., 
rocks), but they could also be people (e.g., partici-
pants in a research study). To a representationalist, 
measurement consists of transferring the knowledge 
obtained about the empirical relations amongst ob-
jects (e.g., that granite is harder than sandstone) into 
numbers which encode the information obtained 
about these empirical relations (see Krantz et al., 
1971; Michell, 2007).  

Stevens suggested that levels of measurement 
are distinguished by whether we have “empirical op-
erations” (p. 677) for determining relations (equality, 
rank-ordering, equality of differences, and equality 
of ratios). This is an idea that appears to have been 

measurement theories discussed in this section (opera-
tionalism, representationalism, and the classical theory 
of measurement). 
3 See McGrane (2015) for a discussion of these competing 
influences on Stevens’ definition of measurement. 


LEVELS OF MEASUREMENT AND STATISTICAL ANALYSES 

influenced by representationalism (representation-
alism being a theory that concerns the use of num-
bers to represent information about empirical rela-
tions). Nevertheless, an influence of operationalism 
is apparent here also: To Stevens, the level of meas-
urement of a set of observations depended on 
whether an empirical operation for determining 
equality, rank-ordering, equality of differences 
and/or equality of ratios was applied (regardless of 
the degree to which the empirical operation pro-
duced valid determinations of these empirical rela-
tions).  

While Stevens’ definition and taxonomy of meas-
urement incorporates both representational and 
operationalist influences, there is a third theory of 
measurement that he did not incorporate: The clas-
sical theory of measurement. This theory of meas-
urement has been the implicit theory of measure-
ment in the physical sciences since classical antiq-
uity (Michell, 1999). The classical theory of measure-
ment states that to measure an attribute is to esti-
mate the ratio of the magnitude of an attribute to a 
unit of the same attribute. For example, to say that 
we have measured a person’s height as 185cm means 
that we have estimated that the person’s height is 
185 times that of one centimetre (the unit). The clas-
sical theory of measurement suggests that only 
some attributes—quantitative attributes—have a 
structure such that their magnitudes stand in ratios 
to one another. A set of axioms demonstrating what 
conditions need to be met for an attribute to be 
quantitative were determined by the German math-
ematician Otto Hölder (1901; for an English transla-
tion see Michell & Ernst, 1996). Because the focus of 
this article is on Stevens’ arguments, I will not cover 
the classical theory of measurement further in this 
article, but excellent introductions can be found in 
Michell (1986, 1999, 2012). It may suffice at this point 
to note that from a classical perspective, Stevens’ 
“nominal” and “ordinal” levels do not constitute 
measurement at all.  

Stevens defined his four scales or levels of meas-
urement as follows. 

Nominal 

According to Stevens (1946), nominal measure-
ment is produced when we have an empirical oper-
ation that allows us to determine that some objects 
are equivalent with respect to some attribute, while 

other objects are noticeably different. For example, 
imagine we have a group of university students, and 
via the empirical operation of looking up their aca-
demic records we can determine that some of the 
students are psychology majors while some are 
business majors. If we wished to compare the psy-
chology students and the business students with re-
spect to some other attribute, we might record in-
formation about the participants and their majors in 
a dataset. For the sake of convenience, we might do 
so by recording their majors numerically. Specifi-
cally, we might enter a 0 in a “Major” column in the 
dataset for each of the psychology students, and a 1 
for each of the business students. In doing so, ac-
cording to Stevens, we would have accomplished 
nominal measurement. Importantly, there are many 
coding rules that would work just as effectively at 
conveying the information we have about partici-
pants’ majors. For example, we could just as well use 
1 to indicate a psychology student and 0 to indicate 
a business student, or -10 to indicate a psychology 
student and 437.3745 to indicate a business student. 
As long as students’ majors are recorded by assign-
ing the psychology students one fixed number and 
business students another, any two numbers would 
work just as well at conveying what we have ob-
served about the students.  

Ordinal 

Ordinal measurement is produced when we have 
an empirical operation that allows us to determine 
that some objects are greater or lesser than others 
with respect to some attribute. Imagine, for exam-
ple, that we are interested in the attribute satisfac-
tion with life. We perform the empirical operation of 
asking three participants (Hakim, Jeff, and Sarah) to 
respond to the question “In general, how satisfied 
are you with your life?” (see Cheung & Lucas, 2014, 
p. 2811), with response options of very dissatisfied, 
moderately dissatisfied, moderately satisfied, and 
very satisfied. We discover that Hakim indicates that 
he is very satisfied with life, while Jeff is moderately 
satisfied, and Sarah is moderately dissatisfied. We 
have thus performed an empirical operation that al-
lows us to determine whether each of these partici-
pants has more or less life satisfaction than another. 

If we are to record this information numerically, 
there are many possible coding rules that could con-


WILLIAMS 

vey the information collected about the life satisfac-
tion of our participants, but there is a restriction: 
The number assigned to Hakim must be higher than 
the number assigned to Jeff, which must be higher 
again than that assigned to Sarah. Any such coding 
rule will record what we have observed: That Hakim 
has the highest level of life satisfaction, followed by 
Jeff, followed by Sarah. So, we could assign Hakim a 
life satisfaction score of 3, Jeff a score of 2, and Sarah 
a score of 1. Or we could also assign Hakim a score 
of 1234, Jeff a score of 6, and Sarah a score of 0.45. 
Either of these two coding rules records the ob-
served ordering Hakim > Jeff > Sarah, and from Ste-
vens’ perspective each would be just as adequate as 
the other. However, we could not assign Hakim a 
score of 3, Jeff a score of 1, and Sarah a score of 2; 
this would imply that Sarah has higher life satisfac-
tion than Jeff, which conflicts with the empirical in-
formation we have collected. Formally, any coding 
system within the class of monotonic transfor-
mations will equivalently convey the information 
that we have about the participants. In other words, 
if we have assigned numeric scores to the partici-
pants such that Hakim > Jeff > Sarah, then we can 
transform those numeric scores in any way provided 
that the order of the scores (Hakim > Jeff > Sarah) 
remains the same. 

Interval 

Interval measurement is produced when, in ad-
dition to having an empirical operation that allows 
us to observe that some objects are greater or less 
than others with respect to some attribute, we have 
an empirical operation that allows us to determine 
whether the difference between a pair of objects is 
greater than, less than, or the same as the difference 
between another pair of objects. The classic exam-
ple of an interval scale is temperature when meas-
ured via a mercury thermometer (i.e., a narrow glass 
tube containing mercury, with a bulb at the bottom, 
held upright). If we place the thermometer inside a 
fridge, we can see that the mercury level will be 

 
4 The fact that this assumption is necessary points to the 
important role theory can have in measurements; for a 
sophisticated discussion in the context of thermometers 
and the measurement of temperature, see Sherry (2011). 
5 In fact, it is also possible to produce interval measure-
ment based only on observations about order and equal-
ity of differences along with some other conditions; see 

lower than if we placed the thermometer in a living 
room. It will be lower again if we place the thermom-
eter in a freezer. If we are willing to assume that 
mercury expands with increasing temperature, this 
empirical observation allows us to determine that, 
with respect to temperature, living room > fridge > 
freezer.    

This observation alone would be a purely ordinal 
one. However, we can also use a ruler to measure 
the highest point reached by the mercury in each lo-
cation. By this method, we can determine whether 
the distance the mercury expands by when moved 
from the freezer to the fridge is more, the same, or 
less than the distance it expands by when moved 
from the fridge to the living room. If we are willing 
to assume that the relationship between tempera-
ture and the height of the mercury in the thermom-
eter is linear within the range of temperatures ob-
served4, then we can also empirically compare dif-
ferences between observations. For example, we 
might attach a ruler to our tube of mercury, and ob-
serve that the difference in the height of the mer-
cury between the living room and the fridge is 5 mm, 
while the difference between the height of the mer-
cury in the fridge and in the freezer is 10 mm. Given 
our assumption of linear expansion of mercury with 
temperature, this implies that the difference in tem-
perature between the fridge and the living room is 
twice5 the difference in temperature between the 
freezer and the fridge. Because we have an empirical 
operation that allows us to compare differences in 
temperature, we have achieved interval measure-
ment. 

The information we have collected about the 
temperature of the fridge, the freezer, and the living 
room can be recorded via a variety of coding rules, 
but there is now an additional restriction: Not only 
must the coding rule preserve the observed order-
ing living room > fridge > freezer, but the difference 
between the number we assign to the fridge and the 
one we assign to the freezer must be twice the dif-
ference between the number we assign to the living 

Suppes and Zinnes’ (1962) description of infinite differ-
ence systems. For the sake of simplicity and brevity I 
have focused here on the simpler scenario of observa-
tions about ratios of differences. 


LEVELS OF MEASUREMENT AND STATISTICAL ANALYSES 

room and the fridge.  We could record the freezer as 
having a temperature of 0, the fridge a temperature 
of 2, and the living room a temperature of 3. Or we 
could record the freezer as having a temperature of 
5, the fridge a temperature of 15, and the living room 
a temperature of 20. But we should not record the 
freezer as having a temperature of 0, the fridge a 
temperature of 1, and the living room a temperature 
of 3; this would imply that the difference in temper-
ature between the living room and the fridge is 
greater than that between the fridge and the 
freezer. More formally, if we have a coding rule that 
records the information we have collected about 
these temperature, we can apply any linear trans-
formation to it (e.g., by multiplying the existing val-
ues by some number and/or adding a constant) 
while still adequately representing the information 
we have collected about the temperatures.  

It is worth emphasising here that it is the fact we 
can empirically compare differences between tem-
peratures that implies that we have achieved inter-
val measurement. The argument has sometimes 
been made (e.g., Carifio & Perla, 2008) that while the 
responses to a rating scale item (as in the earlier ex-
ample for life satisfaction) are ordinal in nature, a 
score created by summing the responses to multiple 
items is interval. This argument confuses the issue 
of level of measurement with that of the distribution 
of a variable. The act of summing ordinal observa-
tions may increase the degree to which the distribu-
tion of scores approximates a normal distribution 
but does not transform these observations from or-
dinal to interval (because it does not provide an op-
eration for determining equality of differences). 

Ratio 

Imagine, now, that we wish to compare the 
lengths of two objects: A pen and a rolling pin. By 
placing these objects side-by-side we can quickly 
establish that the rolling pin is the longer. Let us as-
sume that we have several of the same model of pen, 
each of identical length. Now imagine we lay three 

 
6 For the sake of simplicity, the situation I describe here 
is one where the length of one of the objects is exactly 
divisible by the length of the other. In reality, it might be 
the case that the rolling pin is slightly longer than three 
pens, such that I can conclude only that the ratio of the 
length of the rolling pin to the pen falls in the interval [3, 
4]. But we could obtain a more precise estimate of the 

of these pens end to end with one another and ob-
serve that the rolling pin appears to be equal in 
length to the three pens6. In other words, we have 
observed that the ratio of the length of the rolling 
pin to that of the pen is 3. Therefore, we have 
achieved ratio measurement. 

Once again, these observations can be recorded 
numerically, but our choice of coding rule is now 
very restricted: Whatever number we assigned as 
the length of the rolling pin must be three times the 
number assigned to the pen. So we might record the 
pen as having a length of 1 and the rolling pin a 
length of 3, or the pen a length of 0.5 and the rolling 
pin a length of 1.5, but we could not assign the pen a 
length of 1 and the rolling pin a length of 2. 
More formally, if we have applied a coding rule that 
records the information we have collected about the 
ratios of the lengths of the objects, then the only 
transformation we can apply to the numeric values 
is multiplying them by some constant. 

Stevens’ “Admissible Statistics” 

The connection Stevens drew between levels of 
measurement and statistical analysis was this: Given 
observations pertaining to a set of objects, there are 
a variety of coding rules that one could use to en-
code the information held about the empirical rela-
tions amongst objects. Furthermore, if we consider 
the four levels of measurement on a hierarchy from 
ratio at the top to nominal at the bottom, the lower 
levels of measurement offer a much more diverse 
range of coding rules from which one can arbitrarily 
select. Statistical analyses, however, may produce 
different results depending on which coding rule is 
used, so we should only use statistical analyses that 
produce invariant results across the class of coding 
rules that are permissible for the data we have col-
lected.  

By way of example, let’s return to our measure-
ments of life satisfaction from three participants: 
Hakim (who was very satisfied with his life), Jeff 

length of the rolling pin if we had a “standard sequence” 
of many replicates of the same short length. A ruler 
measured in millimetres, for example, just represents a 
sequence of identical one-millimetre lengths laid end to 
end. For a more rigorous treatment of this topic, see 
Krantz et al. (1971) 


WILLIAMS 

(moderately satisfied), and Sarah (moderately dis-
satisfied). Imagine now that we recruit a fourth par-
ticipant, Ming, who transpires to be very dissatisfied 
with life. We wish to use these four participants to 
test the hypothesis that owning a pet is associated 
with increased life satisfaction. We ask our partici-
pants whether they each own a pet; it turns out that 
Hakim and Jeff do, while Sarah and Ming do not. We 
can now proceed to comparing the life satisfaction 
of these two groups of participants. Recall, though, 
that we have only ordinal observations of life satis-
faction, and can apply any coding rule to our obser-
vations that preserves the ordering Hakim > Jeff > 
Sarah > Ming. The outcomes of two such coding 
rules are displayed in Table 1. 

Table 1 
Example Life Satisfaction Data 

  Life satisfaction 
Partici-
pant 

Owns 
pet? 

Qualitative 
response 

Coding 
rule one 

Coding 
rule two 

Hakim Yes Very  
satisfied 

4 1002 

Jeff Yes Moderately 
satisfied 

3 1000 

Mean (SD)   3.5 (0.71) 1001  (1.41)  
Sarah No Moderately 

dissatisfied 
2 1 

Ming No Very  
dissatisfied 

1 0 

Mean (SD)   1.5 (0.71) 0.5   (0.71) 
Note. Coding rule one: Very dissatisfied = 1, moderately 
dissatisfied = 2, moderately satisfied = 3, very satisfied = 4. 
Coding rule two: Very dissatisfied = 0, moderately dissat-
isfied = 1, moderately satisfied = 1000, very satisfied = 1002. 

 
If we applied a Student’s t test to compare the 

mean life satisfaction ratings of the two groups (pet 
owners, non- pet owners), we would discover that 
the results differ depending on which coding rule we 
use. For coding rule one, the mean difference in life 
satisfaction between the pet owners and the non-
pet owners is 2, and this difference is not statistically 
significant, t(2) = 2.83, p = .106. But for coding rule 
two, the mean difference in life satisfaction is 
1000.5, and this difference is statistically significant, 
t(2) = 894.87, p < .001. Thus, it seems that the out-
come of the Student’s t test varies across these two 
equally permissible coding rules, which does not 
seem like a satisfactory state of affairs. On the other 

hand, if we compare the two samples using a Mann-
Whitney U test, the resulting p value is identical 
across the two coding rules (being p = .33). This is 
the case for the simple reason that the Mann-Whit-
ney U is calculated using the ranks of the observa-
tions rather than their numeric coded values. As 
such, we might argue that the Mann-Whitney U sta-
tistic and its associated p value is invariant across 
the class of permissible transformations with ordi-
nal data, whereas the Student’s t test is not, and that 
as such the Mann-Whitney U is the more appropri-
ate test. 

Stevens went on to set out a list of statistical 
analyses that he believed would produce invariant 
results for variables of each level of measurement. 
For example, he suggested that a median is admissi-
ble as a measure of central tendency for an ordinal 
variable, since the case (or pair of cases) that falls at 
the median will always be the same across any mon-
otonic transformation of the variable, even if the nu-
meric value of the median will not. On the other 
hand, he suggested that a mean is not an admissible 
measure of central tendency with ordinal data, be-
cause both the actual value of the mean and the case 
to which it most closely corresponds will both differ 
across monotonic transformations of the observa-
tions. In noting these distinctions, it is clear there 
exists a degree of ambiguity about what constitutes 
“invariance”. Michell (1986) and Luce et al. (1990) 
provide more formal examination of the type of in-
variance that is implied by Stevens’ arguments. 

Parametric and Non-Parametric Statistics 

It is common for authors to claim that the issue 
of admissibility raised by Stevens implies that para-
metric statistical analyses should only be used with 
interval or ratio data (e.g., Jamieson, 2004; Kuzon et 
al., 1996). Broadly speaking, a parametric analysis is 
one that involves an assumption that observations 
or errors are drawn from a specific probability dis-
tribution, such as the normal distribution (see Alt-
man & Bland, 2009). Some statistical analyses (e.g., 
rank-based tests such as the Mann-Whitney U) are 
non-parametric and also produce invariant results 
across monotonic transformations of the outcome 
variable, and thus comply with Stevens’ rules about 
admissible statistics with ordinal data. However, 
there certainly exist non-parametric tests that 
would not be considered as admissible for use with 


LEVELS OF MEASUREMENT AND STATISTICAL ANALYSES 

ordinal data by Stevens. For example, a permutation 
test to compare two means (see Hesterberg et al., 
2002) is non-parametric—it does not assume that 
the errors or observations are drawn from any spe-
cific probability distribution—but it will not produce 
invariant p values across monotonic transfor-
mations of the observations. As such, Stevens’ rules 
about admissibility are not accurately described as 
applying to whether “parametric” analyses can be 
utilised. Cliff (1996) uses the term ordinal statistics 
to describe those analyses whose conclusions will be 
unaffected by monotonic transformations of the 
variables; this term can be helpful when describing 
those analyses that Stevens would have classed as 
permissible with ordinal data. 

Objections to Stevens’ Claims about Admissibility 

A range of objections to Stevens’ claims about the 
relationship between levels of measurement and ad-
missible statistical analysis have been offered in the 
literature. I will not attempt to cover these compre-
hensively; excellent summaries can be found in Vel-
leman and Wilkinson (1993), and Zumbo and Kroc 
(2019). In this section, I will focus on just three core 
objections. 

The first fundamental objection to Stevens’ dic-
tums is simply that researchers may not necessarily 
desire to make inferences that will be invariant 
across all permissible transformations of the meas-
urements they have observed. For example, con-
sider our earlier example of a researcher attempting 
to measure the relationship between owning a pet 
and satisfaction with life. The researcher might pro-
ceed by coding responses to a life satisfaction scale 
as very dissatisfied = 1, moderately dissatisfied = 2, 
moderately satisfied = 3, and very satisfied = 4, and 
then perform a statistical analysis. Such a researcher 
might very well see it as entirely irrelevant whether 
her results would remain invariant if she monoton-
ically transformed her data using the coding rule 
very dissatisfied = 0, moderately dissatisfied = 1, 
moderately satisfied = 1000, very satisfied = 1002. 
Amongst the types of generalisations that research-
ers seek (e.g., from samples to populations, from ob-
servations to causes), generalisations from one cod-
ing rule to another may not always be desired or 
claimed. Correspondingly, it may be inappropriate 
for methodologists to dictate that researchers 

should act in such a way as to permit such generali-
sations. 

The second objection is the presence of internal 
inconsistency in Stevens’ prohibitions. Specifically, 
Stevens was extremely liberal in his definition of 
measurement (any assignment of numbers to ob-
jects is enough to be measurement), and likewise lib-
eral in how he distinguished levels of measurement. 
For example, he argued that all that is needed to 
achieve interval measurement of an attribute is an 
empirical operation for determining equality of dif-
ferences of the attribute (regardless of the validity 
of this operation, or the structure of the attribute it-
self). Taken literally, this would imply that I could 
achieve “interval” measurements of film quality by 
applying the empirical operation of asking a group 
of participants whether they perceive there to be a 
larger difference in quality between Saw IV and 
Maid in Manhattan than between Maid in Manhat-
tan and The Godfather (regardless of whether par-
ticipants actually have any meaningful way of com-
paring these differences, or whether “film quality” is 
actually a quantitative attribute). When the defini-
tion of what constitutes a particular level of meas-
urement is so loose it makes little sense to make the 
level of measurement a strict determining factor for 
which statistical analyses may be applied. Even Ste-
vens himself wavered on the point of how strictly his 
rules about admissible statistics should be applied: 
“…most of the scales used widely and effectively by 
psychologists are ordinal scales. In the strictest pro-
priety the ordinary statistics involving means and 
standard deviations ought not to be used with these 
scales […] On the other hand, for this 'illegal' statis-
ticizing there can be invoked a kind of pragmatic 
sanction: In numerous instances it leads to fruitful 
results” (Stevens, 1946, p. 679). 

A final fundamental objection to Stevens’ dictums 
about admissible statistics is the fact that statistical 
tests make assumptions about the distributions of 
variables and/or errors—not about levels of meas-
urement. This is the topic I will turn to for the re-
mainder of this article. 

Statistical Assumptions 

When statisticians evaluate a method for esti-
mating a parameter (e.g., the relationship between 
two variables in a population), an important task is 


WILLIAMS 

to show that the estimation method has particular 
desirable properties. For example, we may desire 
that a method for estimating a parameter will pro-
duce estimates that are unbiased—that, across re-
peated samplings, do not tend to systematically 
over- or underestimate the parameter. We may also 
desire that the estimation method is consistent—
that the statistic estimated from the sample will 
converge to the true population parameter as we 
collect more and more observations. And we may 
desire that the estimation method is efficient—that 
it minimises how much variability or noise there is 
in the estimates it produces across repeated sam-
ples (see Dougherty, 2007 for more detailed descrip-
tions of these concepts). To demonstrate that par-
ticular estimation methods have particular desirable 
properties, statisticians must make assumptions. 
These assumptions are premises that are used to 
form deductive arguments (proofs).  

For example, a statistical model commonly used 
by psychologists is the linear regression model, in 
which a participant’s score on an outcome variable7 
is modelled as a function of their scores on a set of 
predictor variables multiplied by a set of regression 
coefficients (plus random error). In this model, the 
distributions of the errors (over repeated samplings) 
are typically assumed to be independently, identi-
cally and normally distributed with an expected 
value (true mean) of zero, regardless of the combi-
nation of levels of the predictor variables for each 
participant8 (Williams et al., 2013). Furthermore, we 
assume that the predictor variables are measured 
without error, and that any measurement error in 
the outcome variable is purely random and uncor-
related with the predictors (Williams et al., 2013). If 
these assumptions hold, then it can be demon-
strated that ordinary least squares estimation will 
produce estimates of the regression coefficients 
that are unbiased, consistent, efficient and normally 
distributed estimators of the true values in the pop-
ulation. This in turn means that statistical tests can 
be conducted on the coefficients that will abide by 

 
7 An outcome variable is often referred to as a “depend-
ent” variable, and a predictor variable as an “independ-
ent” variable. I use the more general terminology of pre-
dictor/outcome because some authors reserve the terms 
“independent variable” and “dependent variable” to refer 
to variables in a true experiment. 

their nominal Type I error rates and confidence in-
terval coverage. 

In most cases, the assumptions used to prove 
that statistical tests have particular desirable prop-
erties (e.g., unbiasedness, consistency, efficiency), 
do not include assumptions about levels of measure-
ment. It is not correct to say, for example, that a cor-
relation or a t test or a regression model or an 
ANOVA directly assume that any of the variables in-
volved are interval or ratio. As should be clear by 
now, the concerns that motivated Stevens’ rules do 
not pertain to statistical assumptions. Rather, they 
are measurement-theoretic concerns, pertaining in 
specific to the question of whether statistical anal-
yses will produce results that depend on what has 
been empirically observed as opposed to arbitrary 
features of the process used to numerically record 
these observations.  

If most statistical tests do not make assumptions 
about levels of measurement, does this in turn imply 
that concerns about levels of measurement can 
safely be disregarded? No. The application of statis-
tical analysis with ordinal or nominal data can result 
in consequential breaches of statistical assump-
tions. In fact, considering levels of measurement in 
terms of their potential impacts on statistical as-
sumptions provides a framework that may be useful 
for evaluating the extent to which levels of measure-
ment have implications for data analysis decisions.  

When researchers apply inferential statistics 
(e.g., significance tests, confidence intervals, Bayes-
ian analyses) they are by definition aiming to make 
inferences (e.g., from observations to causal effects, 
and/or from a sample to a population). The validity 
of these inferences will necessarily depend on the 
validity of the statistical assumptions made in form-
ing these inferences—so, whereas it is possible to 
make an argument that Stevens’ dictums can safely 
be ignored, this is certainly not the case for statisti-
cal assumptions. 

Below I identify several ways in which the level of 
measurement of a set of observations may affect 
whether particular statistical assumptions are met. I 

8 This presentation of assumptions is for a model where 
the predictor values may be either fixed in advance or 
sampled from a population. The assumptions of a model 
where predictor values are fixed in advance are slightly 
simpler, requiring only that the marginal mean of each 
error term is zero. 


LEVELS OF MEASUREMENT AND STATISTICAL ANALYSES 

make no claim to this being an exhaustive list of such 
mechanisms. I focus specifically on ordinal data be-
cause this is the measurement level which most 
commonly causes ambiguity with respect to analysis 
decisions in psychology research. I also focus on 
multiple linear regression as the statistical analysis 
of interest, as this is an analysis framework that en-
compasses many special cases of interest to psy-
chologists (e.g., ANOVA, ANCOVA, t tests), and that 
itself forms a special case of other more sophisti-
cated analysis techniques often applied by psy-
chologists (e.g., structural equation models, 
mixed/multilevel models, generalised linear mod-
els). 

Assumption that Measurement Error in Outcome 
is Uncorrelated with Predictors 

One scenario in which a researcher may find 
themselves with a set of ordinal observations is 
when the attribute they seek to measure is continu-
ous, but the observations are obtained in such a way 
that this quantitative attribute is discretised. For ex-
ample, we might assume that there exists an under-
lying continuous latent variable “life satisfaction”, 
and that recording it using a four-point rating scale 
such as the one described earlier in this article 
means dividing variation in this continuous attribute 
into four ordered categories. The assumption that 
responses to observed items are caused by variation 
in underlying latent attributes reflects a perspective 
on measurement sometimes referred to as latent 
variable theory (Borsboom, 2005). 

Liddell and Kruschke (2018) note that when a re-
searcher aims to make inferences about the effect of 
a set of predictor variables on an underlying un-
bounded continuous attribute—but the outcome 
variable is actually recorded as a response in one of 
a finite number of ordered categories—the partici-
pants’ observed ordinal responses can be biased es-
timates of their levels of the continuous attribute. 
This is the case because a response scale that con-
sists of a set of discrete options produces responses 
that are bounded to fall within a range, whereas the 
underlying continuous attribute may not be 
bounded to fall within that range. For example, if the 
underlying continuous attribute is normally distrib-
uted, it will have an unbounded distribution, and 
could theoretically take any value on the real num-

ber line. This implies in turn that values of the un-
derlying continuous variable that lie outside the 
range of the response options will be “censored”. For 
example, if responses are recorded on a rating scale 
with response options coded as 1 to 5, values on the 
underlying continuous attribute that are higher than 
5 can only be recorded as 5, while values lower than 
1 can only be recorded as 1. This means that, for 
those participants whose values of the attribute are 
outside the range of the response options, the rec-
orded responses are biased estimates of their levels 
of the underlying continuous attribute.  

Although Liddell and Kruschke do not describe it 
in these terms, the difference between the ordinal 
response and the true underlying values of the con-
tinuous attribute represents a form of systematic 
measurement error. And because the magnitude of 
this error depends on the value of the underlying 
continuous attribute, the presence of any relation-
ship between the predictor variables and the under-
lying continuous attribute will mean that the meas-
urement error in the outcome variable is correlated 
with the predictor variables. This constitutes a 
breach of the assumptions of a linear regression 
model, and a breach that can seriously distort pa-
rameter estimates and error rates (as demonstrated 
by Liddell & Kruschke, 2018). As Liddell and Kru-
schke show, it is also not a problem that is amelio-
rated when the predictor variable is formed by sum-
ming or averaging responses from multiple items. By 
way of solution, Liddell and Kruschke suggest that 
regression models specifically designed for ordinal 
outcome variables (e.g., the ordered probit model) 
may be useful in such situations; see also Bürkner 
and Vuorre (2019) for an introduction to a wider 
range of ordinal regression models. Furthermore, 
while the discussion above focuses on ordinal out-
come variables, using an ordinal predictor variable to 
make inferences about the effect of an underlying 
continuous attribute will likewise mean that the un-
derlying attribute is measured with error, and result 
in a biased estimate of its effect (see Westfall & Yar-
koni, 2016).  

Admittedly, this is a problem whose salience de-
pends on whether the researcher believes that a 
continuous attribute underlies an ordinal variable, 
and wishes to make inferences about the underlying 
continuous attribute rather than the observed ordi-
nal variable. However, making inferences only about 
ordinal variables themselves also presents serious 


WILLIAMS 

challenges for statistical analysis, as we will see in 
the next subsection. 

Non-linearity 

When social scientists specify statistical models, 
they often assume that relationships between varia-
bles are linear. For some statistical analyses (e.g., 
Pearson’s correlation), this assumption is intrinsic to 
the form of analysis itself. In other cases, it is possi-
ble to specify that a particular relationship is non-
linear, but doing so requires deliberate action from 
the data analyst, and the types of non-linear rela-
tionship that can be specified are restricted. For ex-
ample, multiple linear regression can accommodate 
some types of non-linear relationships between var-
iables (e.g., polynomial relationships), but the data 
analyst must specify these as part of the model. Fur-
thermore, only models where the outcome variable 
is a linear function of the parameters can be speci-
fied as linear regression models (this is why we call 
this mode of analysis “linear” regression).  

When a relationship between variables is as-
sumed to be linear but in fact is not, the applied sta-
tistical model clearly does not capture reality. Even 
if we accept that a linear regression model is an in-
accurate simplification of reality and wish neverthe-
less to make inferences about the parameters of this 
model were it fit to the population, the presence of 
non-linearity will mean that the statistical assump-
tion that the expected value of the errors is zero for 
all values of the predictors will be breached, imply-
ing that the estimation method may produce biased 
estimates of the population parameters. Admittedly, 
for some models an assumption of linearity is met by 
design: For example, if we estimate the effect of an 
experimentally manipulated binary variable on an 
outcome variable, it will obviously be possible for a 
straight line to perfectly connect the two group 
means. But in many situations—especially when we 
are trying to estimate effects of measured psycho-
logical variables on one another rather than estimat-
ing the effects of experimental manipulations—an 
implied assumption of linearity could well be false.       

As an empirical example, consider a study aimed 
at estimating the effect of perfectionism on procras-
tination, with both attributes measured using self-
report rating scales that we have numerically coded 
such that they each have a range of 1 to 10. If we fit 
a simple linear regression model with perfectionism 

as the predictor and procrastination as the out-
come, then we are assuming that increasing perfec-
tionism from 1 to 2 points has exactly the same effect 
on procrastination as increasing perfectionism from 
2 to 3 points, or from 3 to 4 points, and so forth. But 
if the perfectionism scores are ordinal, this may not 
be plausible: After all, an ordinal scale is one where 
we have been unable to compare differences in lev-
els of the attribute. Consequently, the size of the dif-
ference in numeric scores between two participants’ 
scores is largely an artefact of the rule we’ve used to 
code observations numerically, and we have no evi-
dence that it bears any connection to the magni-
tudes of the differences in the underlying attribute 
(in this case, perfectionism). As such, even if varia-
tion in the attribute underlying the predictor varia-
ble (perfectionism) has a completely linear effect on 
the outcome variable (procrastination), there is no 
strong reason to assume that there would be a linear 
relationship between the numeric scores. 

Exacerbating this problem further is the possibil-
ity that, when we estimate the effect of one psycho-
logical attribute on another, the attribute underlying 
the predictor variable may itself not have a linear ef-
fect on the outcome variable of interest. After all, 
different scores on a psychological test may not 
necessarily represent different levels of some ho-
mogenous quantitative attribute, but may instead 
represent the presence or absence of qualitatively 
different properties. Consider, for example, the dif-
ference between a person who has obtained an IQ 
score of 100 on the Wechsler Adult Intelligence 
Scale (WAIS-IV; Wechsler et al., 2008) and one who 
has received an IQ score of 120. These different IQ 
scores may reflect qualitative differences between 
the participants. For example, the second person 
may have elements of general knowledge that the 
first person does not, thus achieving a higher score 
on the Information subtest, or know how to apply 
the strategy of “chunking” digits so as to achieve a 
higher score on the Digit Span subtest. A person 
with an IQ score of 140 might have access to quali-
tatively different items of knowledge and cognitive 
skills again. The differences in “intelligence” be-
tween these individuals are not necessarily just dif-
ferences on some homogeneous quantitative attrib-
ute, but rather—at least in part—the presence or ab-
sence of qualitatively different items of knowledge 
and cognitive skills. There may be little reason, then, 
to assume that each of these qualitative differences 


LEVELS OF MEASUREMENT AND STATISTICAL ANALYSES 

would have identical effects on another psychologi-
cal attribute (e.g., job performance; Schmidt, 2002), 
despite the equal differences in numeric scores (100 
to 120, 120 to 140). Differences in scores on a variable 
that does not represent varying magnitudes of a ho-
mogeneous quantitative attribute but rather quali-
tative differences in the properties of participants 
may result in such a variable having distinctly non-
linear effects on other variables (see Figure 1). 

 
Figure 1. Illustration of three types of effect. The first is a 
linear effect. The second is a quadratic effect—an effect 
that is not linear, but that can readily be specified within 
a linear regression framework. The third is a non-linear 
effect that takes the form of a segmented function, where 
the effect of the predictor variable itself changes abruptly 
as the predictor variable increases. This kind of effect is 
plausible when the predictor variable is ordinal, but can-
not readily be accommodated within a linear regression 
framework (at least not without applying a piecewise 
model). 

What can researchers do about this? Statistical 
analyses that permit the specification of non-linear 
relationships obviously do exist (e.g., Cleveland & 
Devlin, 1988). However, psychological theories are 
rarely specific enough to imply the specific func-
tional form of relationships. Non-linear models can 
be selected based on empirical data, but basing such 
model specification decisions on empirical data 
alone may (at least in the absence of cross-valida-
tion) risk overfitting—i.e., selecting overly complex 
nonlinear models that do not generalise well outside 
the sample they are trained on (Babyak, 2004; Haw-
kins, 2004). In the field of statistical learning, this 

problem is known as the “bias-variance trade-off” 
(James et al., 2013, p. 33). If we apply a simple model 
which incorporates inaccurate assumptions (e.g., 
linear relationships), the resulting estimates may be 
substantially biased. Applying a more flexible model 
(e.g., a polynomial model) may reduce this bias, but 
at the cost of producing estimates that are more 
variable across datasets (e.g., overfitting). In the ab-
sence of a purely statistical solution, this problem 
may be addressed by the development of theory to 
be more specific about the functional form of rela-
tionships, as occurs in mathematical psychology 
(see Navarro, 2020).  

Where models assuming linear relationships are 
applied, it is important to apply diagnostic proce-
dures that can detect the presence of non-linearity. 
Such diagnostics may allow researchers to under-
stand and communicate to readers the degree to 
which an assumption of linearity is a reasonable ap-
proximation of reality in the specified case, and the 
consequent degree to which additional uncertainty 
may surround the results. Although a detailed de-
scription of methods for detecting non-linearity in 
relationships is beyond the scope of this paper, per-
haps the most well-known method is plotting resid-
uals against predicted (“fitted”) values to visually 
identify the presence of a non-linear pattern (see 
Gelman & Hill, 2007). More formal tests of non-line-
arity in the context of regression include the RESET 
test (Ramsey, 1969) and the rainbow test (Utts, 1982). 

Conclusion 

At this point it should be clear that I see little rea-
son for contemporary researchers to rigidly follow 
Stevens’ dictums about which statistical analyses 
are admissible with data of particular levels of meas-
urement. A number of strong objections to Stevens’ 
dictums have been raised in the methodological lit-
erature, of which perhaps the most fundamental is 
that his rules assume a goal on the part of the re-
searcher to achieve a type of generalisation (infer-
ences that apply across a class of coding rules) that 
may not be of interest to the researcher. Further-
more, most statistical tests do not directly require 
assumptions about levels of measurement. How-

0

5

10

15

20

0 1 2 3 4 5 6 7 8 9 10 11

O
u

tc
o

m
e

Predictor

Linear effect Quadratic effect

Segmented effect


WILLIAMS 

ever, statistical assumptions and measurement-the-
oretic concerns do intersect in important ways9. My 
suggestion that contemporary researchers do not 
need to follow Stevens’ rules exactly as he stated 
them should not be read as implying that research-
ers can safely set aside measurement-theoretic 
concerns. Indeed, much as Michell (1986) suggests, 
measurement-theoretic issues do have implications 
for statistical analysis, just not the simple implica-
tions proposed by Stevens. I suggest that research-
ers focus on whether the statistical assumptions of 
the analyses they wish to perform are consistent  
with the observations they have collected, consider-
ing in doing so how the plausibility of these assump-
tions may be affected by the level of measurement 
of the observations. The assumptions that are made 
should be clearly communicated to readers and in-
terrogated for plausibility, based both on a priori 
considerations (e.g., is it likely that an ordinal varia-
ble could have linear effects?) and empirical ones 
(e.g., to what extent is this set of observations con-
sistent with a linear relationship?) 

Author Contact  

Correspondence regarding this article should be 
addressed to Matt Williams, School of Psychology, 
Massey University, Private Bag 102904, North Shore, 
Auckland, New Zealand. Email: M.N.Williams@mas-
sey.ac.nz ORCID: https://orcid.org/0000-0002-
0571-215X  

Conflict of Interest and Funding 

I report no conflicts of interest.  This study did 
not receive any specific funding. 

Author Contributions 

I am the sole contributor to the content of this 
article. 

 
9 Sometimes this connection between measurement and 
analysis is very direct. Consider, for example, the “Mat-
thew effect” in reading, where investigating the claimed 

Open Science Practices 

 This article earned no Open Science Badges be-
cause it is theoretical and does not contain any data 
or data analyses. However, the R-code provided in 
the OSF project was fully reproducible with the 
given example data.  

References 

Altman, D. G., & Bland, J. M. (2009). Parametric v 
non-parametric methods for data analysis. 
BMJ, 338, a3167. 
https://doi.org/10.1136/bmj.a3167 

Babyak, M. A. (2004). What you see may not be 
what you get: A brief, nontechnical introduc-
tion to overfitting in regression-type models. 
Psychosomatic Medicine, 66(3), 411–421. 
https://doi.org/10.1097/01.psy.0000127692.23
278.a9 

Borsboom, D. (2005). Measuring the mind: Concep-
tual issues in contemporary psychometrics. 
Cambridge University Press. 

Bridgman, P. W. (1927). The logic of modern physics. 
Macmillan. 

Bürkner, P.-C., & Vuorre, M. (2019). Ordinal regres-
sion models in psychology: A tutorial. Advances 
in Methods and Practices in Psychological Sci-
ence, 2(1), 77–101. 
https://doi.org/10.1177/2515245918823199 

Campbell, N. R. (1920). Physics: The elements. Cam-
bridge University Press. 

Carifio, J., & Perla, R. (2008). Resolving the 50-year 
debate around using and misusing Likert 
scales. Medical Education, 42(12), 1150–1152. 
https://doi.org/10.1111/j.1365-
2923.2008.03172.x 

Chang, H. (2009). Operationalism. In E. N. Zalta 
(Ed.), The Stanford Encyclopedia of Philosophy. 
https://plato.stanford.edu/ar-
chives/fall2009/entries/operationalism/ 

Cheung, F., & Lucas, R. E. (2014). Assessing the va-
lidity of single-item life satisfaction measures: 
Results from three large samples. Quality of 

phenomenon—compounding differences over time be-
tween stronger and weaker readers—clearly requires the 
empirical comparison of differences (i.e., an interval 
scale; see Protopapas et al., 2016). 

mailto:M.N.Williams@massey.ac.nz
mailto:M.N.Williams@massey.ac.nz
https://orcid.org/0000-0002-0571-215X
https://orcid.org/0000-0002-0571-215X


LEVELS OF MEASUREMENT AND STATISTICAL ANALYSES 

Life Research, 23(10), 2809–2818. 
https://doi.org/10.1007/s11136-014-0726-4 

Cleveland, W. S., & Devlin, S. J. (1988). Locally 
weighted regression: An approach to regres-
sion analysis by local fitting. Journal of the 
American Statistical Association, 83(403), 596–
610. 
https://doi.org/10.1080/01621459.1988.104786
39 

Cliff, N. (1996). Answering ordinal questions with 
ordinal data using ordinal statistics. Multivari-
ate Behavioral Research, 31(3), 331–350. 
https://doi.org/10.1207/s15327906mbr3103_4 

Cozby, P. C., & Bates, S. C. (2015). Methods in behav-
ioral research (12th ed.). McGraw-Hill. 

Dougherty, C. (2007). Introduction to Econometrics 
(3rd ed.). Oxford University Press. 

Ferguson, A., Myers, C. S., Bartlett, R. J., Banister, 
H., Bartlett, F. C., Brown, W., Campbell, N. R., 
Craik, K. J. W., Drever, J., Guild, J., Houstoun, R. 
A., Irwin, J. O., Kaye, G. W. C., Philpott, S. J. F., 
Richardson, L. F., Shaxby, J. H., Smith, T., 
Thouless, R. H., & Tucker, W. S. (1940). Final re-
port of the committee appointed to consider 
and report upon the possibility of quantitative 
estimates of sensory events. Report of the Brit-
ish Association for the Advancement of Science, 
2, 331–349. 

Gelman, A., & Hill, J. (2007). Data analysis using re-
gression and multilevel/hierarchical models. 
Cambridge University Press. 

Hand, D. J. (1996). Statistics and the theory of 
measurement. Journal of the Royal Statistical 
Society: Series A (Statistics in Society), 159(3), 
445–492. https://doi.org/10.2307/2983326 

Hawkins, D. M. (2004). The problem of overfitting. 
Journal of Chemical Information and Computer 
Sciences, 44(1), 1–12. 
https://doi.org/10.1021/ci0342472 

Heiman, G. W. (2001). Understanding research 
methods and statistics: An integrated introduc-
tion for psychology (2nd ed.). Houghton Mifflin. 

Hesterberg, T., Moore, D. S., Monaghan, S., Clipson, 
A., & Epstein, R. (2002). Bootstrap methods and 
permutation tests. In D. S. Moore & G. P. 
McCabe (Eds.), Introduction to the practice of 
statistics (4th ed.). Freeman. 

Hölder, O. (1901). Die axiome der quantität und die 
lehre vom mass. Teubner. 

James, G., Witten, D., Hastie, T., & Tibshirani, R. 
(2013). An introduction to statistical learning 
(Vol. 112). Springer. 

Jamieson, S. (2004). Likert scales: How to (ab) use 
them. Medical Education, 38(12), 1217–1218. 
https://doi.org/10.1111/j.1365-
2929.2004.02012.x 

Judd, C. M., Smith, E. R., & Kidder, L. H. (1991). Re-
search methods in social relations (6th ed.). Holt 
Rinehart and Winston. 

Kaplan, R. M., & Saccuzzo, D. P. (2018). Psychological 
testing: Principles, applications, and issues (9th 
ed.). Cengage. 

Krantz, D. H., Suppes, P., & Luce, R. D. (1971). Foun-
dations of measurement: Additive and polyno-
mial representations (Vol. 1). Academic Press. 

Kuzon, W., Urbanchek, M., & McCabe, S. (1996). The 
seven deadly sins of statistical analysis. Annals 
of Plastic Surgery, 37, 265–272. 
https://doi.org/10.1097/00000637-
199609000-00006 

Liddell, T. M., & Kruschke, J. K. (2018). Analyzing or-
dinal data with metric models: What could pos-
sibly go wrong? Journal of Experimental Social 
Psychology, 79, 328–348. 
https://doi.org/10.1016/j.jesp.2018.08.009 

Luce, R. D., Suppes, P., & Krantz, D. H. (1990). Foun-
dations of Measurement: Representation, axio-
matization, and invariance. Academic Press. 

McBurney, D. H. (1994). Research methods (3rd ed.). 
Brooks/Cole. 

McGrane, J. A. (2015). Stevens’ forgotten crossroads: 
The divergent measurement traditions in the 
physical and psychological sciences from the 
mid-twentieth century. Frontiers in Psychology, 
6. https://doi.org/10.3389/fpsyg.2015.00431 

Michell, J. (1986). Measurement scales and statis-
tics: A clash of paradigms. Psychological Bulle-
tin, 100(3), 398–407. 
https://doi.org/10.1037/0033-2909.100.3.398 

Michell, J. (1999). Measurement in psychology: A 
critical history of a methodological concept. 
Cambridge University Press. 

Michell, J. (2007). Representational theory of meas-
urement. In M. Boumans (Ed.), Measurement in 
economics: A handbook (pp. 19–39). Elsevier. 


WILLIAMS 

Michell, J. (2012). Alfred Binet and the concept of 
heterogeneous orders. Frontiers in Quantita-
tive Psychology and Measurement, 3, 261. 
https://doi.org/10.3389/fpsyg.2012.00261 

Michell, J., & Ernst, C. (1996). The axioms of quan-
tity and the theory of measurement: Translated 
from part I of Otto Hölder’s German text “Die 
axiome der quantität und die lehre vom mass.” 
Journal of Mathematical Psychology, 40(3), 235–
252. https://doi.org/10.1006/jmps.1996.0023 

Navarro, D. (2020). If mathematical psychology did 
not exist we would need to invent it: A case 
study in cumulative theoretical development 
[Preprint]. PsyArXiv. 
https://doi.org/10.31234/osf.io/ygbjp 

Neuman, W. L. (2000). Social research methods (4th 
ed.). Allyn & Bacon. 

Price, P. (2012). Research methods in psychology. 
Saylor Foundation. 

Protopapas, A., Parrila, R., & Simos, P. G. (2016). In 
search of Matthew effects in reading. Journal of 
Learning Disabilities, 49(5), 499–514. 
https://doi.org/10.1177/0022219414559974 

Ramsey, J. B. (1969). Tests for specification errors in 
classical linear least-squares regression analy-
sis. Journal of the Royal Statistical Society: Se-
ries B (Methodological), 31(2), 350–371. 
https://doi.org/10.1111/j.2517-
6161.1969.tb00796.x 

Ray, W. J. (2000). Methods: Toward a science of be-
havior and experience (6th ed.). Wadsworth. 

Schmidt, F. L. (2002). The role of general cognitive 
ability and job performance: Why there cannot 
be a debate. Human Performance, 15(1–2), 187–
210. 
https://doi.org/10.1080/08959285.2002.9668
091 

Sherry, D. (2011). Thermoscopes, thermometers, 
and the foundations of measurement. Studies 
In History and Philosophy of Science Part A, 
42(4), 509–524. 
https://doi.org/10.1016/j.shpsa.2011.07.001 

Stevens, S. S. (1946). On the theory of scales of 
measurement. Science, 103(2684), 677–680. 
https://doi.org/10.1126/science.103.2684.677 

Sullivan, T. J. (2001). Methods of social research. 
Harcourt College Publishers. 

Suppes, P., & Zinnes, J. L. (1962). Basic measurement 
theory. Stanford University. 

Tal, E. (2017). Measurement in science. In E. N. Zalta 
(Ed.), The Stanford Encyclopedia of Philosophy. 
https://plato.stanford.edu/ar-
chives/fall2017/entries/measurement-science 

Utts, J. M. (1982). The rainbow test for lack of fit in 
regression. Communications in Statistics - The-
ory and Methods, 11(24), 2801–2815. 
https://doi.org/10.1080/03610928208828423 

Velleman, P. F., & Wilkinson, L. (1993). Nominal, or-
dinal, interval, and ratio typologies are mis-
leading. The American Statistician, 47(1), 65–72. 
https://doi.org/10.2307/2684788 

Wechsler, D., Coalson, D. L., & Raiford, S. E. (2008). 
WAIS-IV technical and interpretive manual. 
Pearson. 

Westfall, J., & Yarkoni, T. (2016). Statistically con-
trolling for confounding constructs is harder 
than you think. PLOS ONE, 11(3), e0152719. 
https://doi.org/10.1371/journal.pone.0152719 

Williams, M. N., Grajales, C. A. G., & Kurkiewicz, D. 
(2013). Assumptions of multiple regression: 
Correcting two misconceptions. Practical As-
sessment, Research & Evaluation, 18(11). 
https://scholar-
works.umass.edu/pare/vol18/iss1/11/ 

Zigmond, A. S., & Snaith, R. P. (1983). The hospital 
anxiety and depression scale. Acta Psychiatrica 
Scandinavica, 67(6), 361–370. 
https://doi.org/10.1111/j.1600-
0447.1983.tb09716.x 

Zumbo, B. D., & Kroc, E. (2019). A measurement is a 
choice and Stevens’ scales of measurement do 
not help make it: A response to Chalmers. Edu-
cational and Psychological Measurement, 76(6), 
1184–1197. 
https://doi.org/10.1177/0013164419844305