Meta-Psychology, 2022, vol 6, MP.2019.1615
https://doi.org/10.15626/MP.2019.1615
Article type: Original article
Published under the CC-BY4.0 license

Open data: Not Applicable
Open materials: Yes

Open and reproducible analysis: Yes
Open reviews and editorial process: Yes

Preregistration: No

Edited by: Felix D. Schönbrodt
Reviewed by: Williams M., Dienes, Z.

Analysis reproduced by: Counsell A., Batinović L.
All supplementary files can be accessed at OSF:

https://doi.org/10.17605/OSF.IO/6WPN4

Testing ANOVA Replications by Means of the
Prior Predictive p-Value.

M.A.J. Zondervan-Zwijnenburg
Department of Methodology & Statistics, Utrecht University, The Netherlands

A.G.J. van de Schoot
Department of Methodology & Statistics, Utrecht University, The Netherlands; Optentia

Research Program, North-West University, South Africa

H.J.A. Hoijtink
Department of Methodology & Statistics, Utrecht University, The Netherlands

Abstract
In the current study, we introduce the prior predictive p-value as a method to test replication of an analysis of
variance (ANOVA). The prior predictive p-value is based on the prior predictive distribution. If we use the original
study to compose the prior distribution, then the prior predictive distribution contains datasets that are expected
given the original results. To determine whether the new data resulting from a replication study deviate from the
data in the prior predictive distribution, we need to calculate a test statistic for each dataset. We propose to use F̄,
which measures to what degree the results of a dataset deviate from an inequality constrained hypothesis capturing
the relevant features of the original study: HRF. The inequality constraints in HRF are based on the findings of the
original study and can concern, for example, the ordering of means and interaction effects. The prior predictive
p-value consequently tests to what degree the new data deviates from predicted data given the original results,
considering the findings of the original study. We explain the calculation of the prior predictive p-value step by step,
elaborate on the topic of power, and illustrate the method with examples. The replication test and its integrated
power and sample size calculator are made available in an R-package and an online interactive application. As
such, the current study supports researchers that want to adhere to the call for replication studies in the field of
psychology.

Keywords: ANOVA, comparison of means, power analysis, prior predictive p-value, replication study

Introduction

New studies conducted to replicate earlier original
studies are often referred to as replication studies. After

the latest “crisis in confidence” in the field of psychol-
ogy, the call to conduct replication studies is stronger
than ever (Anderson & Maxwell, 2016; Asendorpf et
al., 2013; Cumming, 2014; Earp & Trafimow, 2015;

https://doi.org/10.15626/MP.2019.1615
https://doi.org/10.17605/OSF.IO/6WPN4


2

Ledgerwood, 2014; Open Science Collaboration, 2012,
2015; Pashler & Wagenmakers, 2012; Schmidt, 2009;
Verhagen & Wagenmakers, 2014), and large replica-
tion projects such as the Reproducibility Project Psy-
chology (Open Science Collaboration, 2015), Repro-
ducibility Project: Cancer Biology (RP:CB) (Errington
et al., 2019), and Many Labs projects (Ebersole et al.,
2016; Klein et al., 2014; Klein et al., 2018) have been
launched. As a result, methodology on conducting repli-
cation studies has received increasing attention (see for
example Anderson and Maxwell, 2016; Asendorpf et
al., 2013; Brandt et al., 2014; Schmidt, 2009). There
is, however, no standard methodology to determine
whether a replication is successful or not (Open Science
Collaboration, 2015).

The results of an original study are replicated when a
new study corroborates the original findings. A com-
mon and intuitive method to assess whether a result
is replicated is ‘vote-counting’. Vote-counting is assess-
ing whether the new effect is statistically significant
and in the same direction as the significant effect in
the original study (Anderson & Maxwell, 2016; Simon-
sohn, 2015). Vote-counting, however, has serious short-
comings. First of all, it is a dichotomous evaluation
that does not take into account the magnitude of dif-
ferences between effect-sizes of the original and new
study (Asendorpf et al., 2013; Simonsohn, 2015). Sec-
ondly, each of the effect sizes being significant does not
imply that both effect sizes are the same, nor does one
significant effect and one non-significant effect imply
that both effects are different (Gelman & Stern, 2006;
Nieuwenhuis et al., 2011). Stated otherwise, vote-
counting does not formally test whether a result is repli-
cated (Anderson & Maxwell, 2016; Verhagen & Wagen-
makers, 2014). Thirdly, underpowered replication stud-
ies are less likely to replicate significance, which can
lead to misleading conclusions (Asendorpf et al., 2013;
Cumming, 2008; Hedges & Olkin, 1980; Simonsohn,
2015).

In the current study, we address the following replica-
tion research question: “Does the new study fail to repli-
cate relevant features of the original study?”. For exam-
ple, the result of an original ANOVA study is: Group
A > Group B > Group C. The finding can be: “Group
A performs better than group B, which performs better
than group C”; “Group A performs better than group B
and C”; or “Group A and B perform better than group
C”. The ‘relevant features’ subordinate to the replica-
tion test always have to be in line with the original re-
sult (i.e., Group A > Group B > Group C) for the test
to function properly. If the purpose of the replication
test is to put the proclaimed theory by the original to
the test, then the claims of the original study determine

the exact relevant features to be evaluated. However, if
there is reason to test another feature, it is possible to
let the relevant features deviate from the claims in the
original study. The relevant features of original studies
will be captured in the form of an informative hypothe-
sis (Hoijtink, 2012), which is specified using inequality
constraints among the means of the ANOVA model. We
propose to evaluate the replication of these hypotheses
with the prior predictive p-value (Box, 1980).

The prior predictive p-value was not introduced to
test replication. It was originally presented as a method
to test whether the current data is unexpected given
prior expectations concerning the parameter values of a
statistical model. A disadvantage of the prior predictive
check to test model fit is that it is leaves undetermined
whether the prior expectations about the parameter val-
ues or the model assumptions are incorrect. Hence, as a
model test the prior predictive check has been replaced
by the posterior predictive check (Gelman et al., 1996),
which does not make prior assumptions about expected
parameter values, but instead uses the posterior results
given the current data.

With respect to testing replication, however, the prior
predictive check is a good method for three reasons.
First, instead of non-empirical prior expectations, we
use the posterior distribution of the model parameters
given the original data as the prior distribution. Con-
sequently, we have a well-founded and clear-cut prior.
Second, the prior predictive check uses a distribution
of datasets (i.e., the prior predictive distribution) that
are expected given the prior (i.e., the posterior of the
original study). In this manner, the prior predictive dis-
tribution takes into account that results in a new dataset
- resulting from a replication study - may deviate from
the original results because of random variation instead
of meaningful differences. According to our definition,
a study replicates if the new dataset is drawn from the
same population as the original dataset. Third, the prior
predictive check uses a ‘relevant checking function’ for
which we propose F̄ (Silvapulle & Sen, 2005, p. 38-39).
The statistic F̄ captures the deviance from a constrained
hypothesis that we base on the findings of the origi-
nal study. As a result, we can check whether the new
study significantly fails to replicate relevant features of
the original study, while taking variation into account.


3

Ta
bl

e
1

R
ep

li
ca

ti
on

R
es

ea
rc

h
Q

u
es

ti
on

s
an

d
M

et
h
od

s
to

A
dd

re
ss

T
h
em

R
ep

li
ca

ti
on

R
es

ea
rc

h
Q

u
es

ti
on

C
u

rr
en

t
S

tu
d

y
an

d
S

im
il

ar
Q

u
es

ti
on

s
M

et
h

od
S

et
ti

n
g

R
ef

er
en

ce
D

oe
s

th
e

n
ew

st
u

d
y

fa
il

to
re

p
li

ca
te

re
le

va
n

t
fe

at
u

re
s

of
th

e
or

ig
in

al
st

u
d

y?
P

ri
or

p
re

d
ic

ti
ve

p-
va

lu
e

t-
te

st
,

A
N

O
VA

C
u

rr
en

t
st

u
d

y

D
oe

s
th

e
n

ew
st

u
d

y
fa

il
to

re
p
li

ca
te

th
e

ef
fe

ct
si

ze
of

th
e

or
ig

in
al

st
u

d
y?

C
on

fi
d

en
ce

in
te

rv
al

fo
r

d
if

fe
r-

en
ce

in
ef

fe
ct

si
ze

s
t-

te
st

,
co

rr
el

at
io

n
A

n
d

er
so

n
an

d
M

ax
w

el
l

(2
0

1
6

)

P
re

d
ic

ti
on

in
te

rv
al

co
rr

el
at

io
n

Pa
ti

l
et

al
.

(2
0

1
6

)

D
oe

s
th

e
n

ew
st

u
d

y
re

p
li

ca
te

th
e

ef
fe

ct
si

ze
of

th
e

or
ig

in
al

st
u

d
y?

E
qu

iv
al

en
ce

te
st

t-
te

st
A

n
d

er
so

n
an

d
M

ax
w

el
l

(2
0

1
6

)
B

ay
es

fa
ct

or
t-

te
st

Ve
rh

ag
en

an
d

W
ag

en
m

ak
er

s
(2

0
1

4
)

B
ay

es
fa

ct
or

A
N

O
VA

H
ar

m
s

(2
0

1
8

)
B

ay
es

fa
ct

or
B

F
m

od
el

sa
Ly

et
al

.
(2

0
1

8
)

O
th

er
R

ep
li

ca
ti

on
R

es
ea

rc
h

Q
u

es
ti

on
s

M
et

h
od

S
et

ti
n

g
R

ef
er

en
ce

Is
th

e
ef

fe
ct

p
re

se
n

t
or

ab
se

n
t

in
th

e
re

p
li

ca
ti

on
st

u
d

y?
B

ay
es

fa
ct

or
t-

te
st

,
co

rr
el

at
io

n
b

M
ar

sm
an

et
al

.
(2

0
1

7
)

Is
C

oh
en

’s
d

in
th

e
p
op

u
la

ti
on

of
a

d
et

ec
ta

bl
e

si
ze

?
Te

le
sc

op
e

te
st

t-
te

st
c

S
im

on
so

h
n

(2
0

1
5

)
Is

th
e

or
ig

in
al

ef
fe

ct
si

ze
ex

tr
em

e
in

co
m

p
ar

is
on

to
th

e
n

ew
st

u
d

y?
C

on
fi

d
en

ce
in

te
rv

al
fo

r
d

if
fe

r-
en

ce
in

ef
fe

ct
si

ze
s

t-
te

st
,

co
rr

el
at

io
n

O
p
en

S
ci

en
ce

C
ol

la
bo

ra
ti

on
(2

0
1

5
)

W
h

at
is

C
oh

en
’s

d
in

th
e

p
op

u
la

ti
on

?
C

on
fi

d
en

ce
in

te
rv

al
fo

r
av

er
-

ag
e

ef
fe

ct
si

ze
t-

te
st

A
n

d
er

so
n

an
d

M
ax

w
el

l
(2

0
1

6
)

W
h

at
is

th
e

ef
fe

ct
si

ze
(c

or
re

ct
ed

fo
r

p
u

bl
ic

at
io

n
bi

as
)

in
th

e
p
op

u
la

ti
on

?
H

yb
ri

d
m

et
a-

an
al

ys
is

t-
te

st
V
an

A
er

t
an

d
V
an

A
ss

en
(2

0
1

7
)

a
A

ll
m

od
el

s
fo

r
w

h
ic

h
a

B
ay

es
fa

ct
or

ca
n

be
co

m
p
u

te
d

.
b
T

h
e

re
co

n
ce

p
tu

al
iz

at
io

n
by

Ly
et

al
.

(2
0

1
8

)
ge

n
er

al
iz

es
to

m
os

t
co

m
m

on
ex

p
er

im
en

ta
l

d
es

ig
n

s.
c T

h
e

te
le

sc
op

e
te

st
is

ex
p
la

in
ed

in
th

e
t-

te
st

se
tt

in
g,

bu
t

ap
p
li

ca
bl

e
to

an
y

m
od

el
fo

r
w

h
ic

h
a

p
ow

er
an

al
ys

is
ca

n
be

co
n

d
u

ct
ed

.


4

Table 1 shows how our research question and pro-
posed method relate to other replication research ques-
tions and associated methods that have been proposed.
Our method addresses a question similar to that in An-
derson and Maxwell (2016), Harms (2018), Ly et al.
(2018), Verhagen and Wagenmakers (2014) and Patil et
al. (2016), but now enables researchers to evaluate the
replication of relevant features of an original ANOVA
study. The bottom panel of Table 1 shows other replica-
tion research questions that will not be pursued in this
paper. The reader interested in these questions, should
consult the given references.

The goal of this paper is to introduce the prior predic-
tive p-value as a method to test replication of relevant
features of original ANOVA studies. In the first section,
we provide a step by step introduction of the prior pre-
dictive p-value as included in the ANOVAreplication
R-package and the online interactive application (see
osf.io/6h8x3). In the second section, we discuss the
statistical power of the prior predictive p-value. In the
third section, we explain how to use and interpret the
prior predictive p-value by means of a workflow. In the
fourth section, we use several studies from the Repro-
ducibility Project Psychology (Open Science Collabora-
tion, 2012) to demonstrate the use of the prior predic-
tive p-value. The paper ends with a discussion and con-
clusion section.

Prior Predictive p-Value

The evaluation of the replication of an ANOVA study
by means of the prior predictive p-value (Box, 1980)
consists of three steps that will be explained below.

Step 1: Prior Predictive Distribution of the Data

The ANOVA model is given by:

yi jd = µ jd + �i jd (1)

�i jd ∼N(0,σ
2
d ),

where yi jd is observation i = 1, ..., n jd in group j = 1, ..., J
for dataset d ∈ {o, r, sim}, where o denotes the original
data, r denotes the new data, and sim denotes simulated
data, the latter will be introduced towards the end of
this section. Furthermore, µ jd is the mean of group j
in dataset d, �i jd is the error term, and σ2d is the pooled
variance over all J groups.

The original ANOVA results can be summarized in the
posterior distribution of the parameters: g(µo,σ

2
o|yo),

where µo = [µ1o, ...,µJo] and yo includes all observations
yi jo:

g(µo,σ
2
o|yo) ∝ f (yo|µo,σ

2
o)h(µo,σ

2
o), (2)

where the density of the data

f (yo|µo,σ
2
o) =

J∏
j=1

n jo∏
i=1

1
√

2πσo
e
−(yi jo−µ jo )

2

2σ2o (3)

and the standard prior distribution,

h(µo,σ
2
o) ∝

1
σ2o
, (4)

that is, a uniform prior on the means and Jeffrey’s prior
on the variance. The prior distribution for the analysis
of the original data is uninformative, that is, the poste-
rior distribution is completely determined by the orig-
inal data in order to match the results of the original
study. If the original study used a Bayesian analysis,
the priors should match those of the original study in
order to reproduce the original study results. Given the
observed original results, the prior distribution for fu-
ture parameters h(µr,σ

2
r ) = h(µsim,σ

2
sim) = g(µo,σ

2
o|yo).

With the prior predictive p-value, we then test H0:
µr,σ

2
r ∼ h(µr,σ

2
r ). H0 states that µr,σ

2
r follow the dis-

tribution of the prior for µr,σ
2
r . Loosely formulated, H0

states that the parameters in the new data are in line
with our expectations given the original results.

To test H0, we obtain datasets that are to be expected
given the original data. Using this prior we simulate
data ysim that are to be expected given the results of the
original study:

f (ysim) =
∫

f (ysim|µsim,σ
2
sim)h(µsim,σ

2
sim)dµsim,σ

2
sim),

(5)
where f (ysim) is the prior predictive distribution of the
data. Note that f (ysim|µsim,σ

2
sim) is the counterpart of

Equation 3 for dataset sim instead of o. Datasets ytsim
for t = 1, ..., T , where T denotes the number of samples
from the prior predictive distribution, are obtained by
sampling µtsim,σ

t
sim from h(µsim,σsim) = g(µo,σo|yo), and

subsequently simulating ytsim from f (ysim|µ
t
sim,σ

t
sim) (cf.

Equation 3). Datasets ytsim have sample sizes n1r, ..., nJr,
because the predicted data needs to be compared to the
new data yr that has sample sizes n1r, ..., nJr.

The steps in the following sections elaborate how
new data yr can be compared to the T data matrices
sampled from f (ysim) that are to be expected given H0
using a test-statistic that evaluates relevant features of
the original data.

Step 2: Test Statistic Evaluating Relevant Features

We propose to use F̄ (Silvapulle & Sen, 2005, p. 38-
39) as a test-statistic to evaluate how much the pre-
dicted data and the observed data deviate from an in-
equality constrained hypothesis capturing the relevant

osf.io/6h8x3


5

features of the original study HRF:

F̄yd =
RSSd,HRF − RSSd,Hu

S 2d
, (6)

where RSSd,Hu denotes the residual sum of squares in
dataset d ∈ {r, sim} for the unrestricted hypothesis Hu:
µ1d, ...,µJd ,

RSSd,Hu =
∑

i j

(yi jd − ȳ jd )
2, (7)

where ȳ jd denotes the mean for group j in dataset d. S 2d
denotes the mean squared error,

S 2d =
RSSd,Hu
N − J

, (8)

where N =
J∑

j=1
n jr, and

RSSd,HRF =
∑

i j

(yi jd − µ̃ jd )
2, (9)

where

µ̃d = [µ̃ jd, ..., µ̃Jd ] = argmin
µ̃d∈HRF

∑
i j

(yi jd −µ jd )
2. (10)

µ̃d thus contains the set of parameter estimates that
minimize the residual sum of squares for yd under the
constraints imposed by HRF. F̄yd is the scaled difference
between the residual sum of squares under the con-
straints imposed by HRF and the residual sum of squares
for yd under Hu. As Hu is unrestricted, F̄yd quantifies the
misfit of yd with HRF.

The hypothesis capturing the relevant features of the
original data, HRF, is of the form Rµd > 0, where R is a
K×J restriction matrix, J denotes the number of groups
in the ANOVA study, and K the number of restrictions in
HRF, while µd is the mean vector of length J.

Examples of constraints that can be applied under
Rµr > 0 are:

• Simple order constraints: µ jd > µ j′d , or µ jd < µ j′d
for a pair j, j′.

• Interaction effects: (µABd - µAB′d ) > (µA′Bd - µA′B′d ),
for a 2×2 contingency table.

The constraints in HRF should be based on the find-
ings of the original study, which implies and requires
that HRF is always in agreement with the results of the
original study (i.e., F̄yo = 0). The results of the orig-
inal study alone are usually not enough to determine

which HRF is to be evaluated. For example, an orig-
inal study shows that ȳ1o < ȳ2o < ȳ3o. This finding
may lead to HRF: µ1d < µ2d < µ3d , but also to HRF:
(µ1d,µ2d )< µ3d or HRF: µ1d < (µ2d,µ3d ). Which exact fea-
tures should be covered in HRF can be guided by the
conclusions of the original study. For example, if in the
original study it is concluded that a treatment condition
leads to better outcomes than two control conditions,
the most logical specification of the relevant features
is HRF: (µcontrolAd,µcontrolBd )< µTreatmentd . Alternatively,
if in the original study it is concluded that treatment
A is better than treatment B, which is better than the
control condition, a logical relevant feature hypothesis
would be: HRF: µTreatmentAd > µTreatmentBd > µControld . It
may also occur that the researcher conducting the repli-
cation test has an interest to evaluate a claim that is
not in the original study, but could be made based on
its results. In all cases, the researcher conducting the
replication test should substantiate the choices made in
the formulation of HRF with results from the original
study. It is good practice to also pre-register HRF. In the
Examples Section, we demonstrate for two studies how
the original study is linked to HRF. First, however, we
explain how the prior predictive p-value is calculated.

Step 3: p-value

The third and final step is to compute the prior pre-
dictive p-value. When we calculate F̄ytsim for each dataset
ytsim obtained in Step 1 with respect to F̄ as defined
in Step 2, a sampling-based representation of the prior
predictive distribution of the test statistic f (F̄ysim ) is ob-
tained. Consequently,

p = P(F̄ysim ≥ F̄yr |H0) = (11)

1
T

T∑
t=1

I(F̄ytsim ≥ F̄yr ),

where H0 denotes “Replication”, that is: H0: µr,σ
2
r ∼

h(µr,σ
2
r ). Furthermore, I is an indicator function that

takes on the value 1 if the argument is true and 0 oth-
erwise.

As illustrated in Figure 1, the prior predictive p-value
indicates how exceptional the observed statistic for the
new data, F̄yr , is compared to its prior predictive distri-
bution f (F̄ysim ). The shaded area on the right side of F̄yr
is P(F̄ysim ≥ F̄yr |H0), that is, the prior predictive p-value.
If the prior predictive p-value is significant, we reject
replication of the relevant features of the original study
by the new data. Note that the focus is on rejecting
replication of the original results and not on rejecting
HRF in itself for the new study.1

1To test HRF we recommend Hoijtink et al. (2019), Vanbra-
bant et al. (2015).


6

F yd

F
re

q
u

e
n

cy

0 10 20 30 40

0
5

0
0

1
0

0
0

1
5

0
0

2
0

0
0

2
5

0
0

f (F ysi m)

prior predictive p−value

F yr

Figure 1. An illustration of the prior predictive p-value.

Uniformity. To determine the significance of a p-
value by comparing it to some preselected value α, the
p-value needs to be uniformly distributed if H0 is true.
Only when the p-value is uniform, α is equal to the nom-
inal Type I error. We will demonstrate that this is true
for the prior predictive p-value if f (F̄ysim ) is continuous,
and it is true up to some α0 if f (F̄ysim ) is discrete.

A p-value is uniform if:

P(p ≤ α|H0) ≤ α for all α ∈ [0, 1], (12)

where p denotes a p-value from f ( p|H0), that is, the null-
distribution of the p-values.

The following three steps proof that Equation 12
holds for the prior predictive p-value when f (F̄ysim ) is
continuous:

1. P( p < α|H0) = P(F̄yr > F̄ysim,1−α|H0), where F̄yr
is the test-statistic rendering p via p = P(F̄ysim >
F̄yr |H0) and F̄ysim,1−α is the 1-αth percentile of the
distribution f (F̄ysim |H0).

2. P(F̄yr > F̄ysim,1−α|H0) =
∫

F̄yr >F̄ysim,1−α
f (F̄yr |H0)dF̄yr ,

where f (F̄yr |H0) denotes the distribution of F̄yr un-
der H0.

3. For the situations considered in this pa-
per it holds that f (F̄yr |H0) = f (F̄ysim ),
therefore

∫
F̄yr >F̄ysim,1−α

f (F̄yr |H0)dF̄yr =

∫
F̄ysim >F̄ysim,1−α

f (F̄ysim )dF̄ysim = α, which completes

the proof.

With constraints of the form Rµr > 0, however,
f (F̄ysim ) will often be discrete. When f (F̄ysim ) is discrete,
the prior predictive p-value is not uniform for all α ∈
[0, 1]. For example, let us obtain g(µo,σ

2
o|yo) = h(µr,σ

2
r )

for an original study with ȳ1o = 1, ȳ2o = 2, ȳ3o = 3, s2o = 5,
and n jo = 50, with n jr = 50 and HRF: µ1r < µ2r < µ3r.
Subsequently, we simulate ytr for t = 1, ..., 100, 000, and
calculate the prior predictive p-value for each ytr. The
result is f (p|H0), which is plotted in Figure 2a. In Fig-
ure 2a, we see a thick vertical line that indicates a set
of p-values with exactly the same value, namely 1.00.
This set of equal p-values results from the fact that
HRF : µ1r < µ2r < µ3r is true for a substantial number
of datasets ytr causing the associated F̄ytr to be exactly
equal to 0 and the associated prior predictive p-values
to be exactly equal to 1 (see Figure 2b). Generally, how-
ever, there exists an α0 for which f ( p|H0) is uniform
(Meng, 1994), since all values in f (F̄ysim ) other than 0
will occur in a continuous fashion. Thus, α is uniform
for α ∈ [0,α0]. If the preselected α < α0, α is equal
to the nominal Type I error. α0 can be computed as
1 − P( f (F̄ysim ) = 0). For example, α0 ≥ .05 if no more
than 95% of F̄ysim is exactly 0. It would be exceptional
if more than 95% of F̄ysim = 0, but it could occur with
extremely low power in the original study and an unspe-


7

p

F
re

q
u

e
n

cy

0.0 0.2 0.4 0.6 0.8 1.0

0
2

0
0

0
0

4
0

0
0

0
6

0
0

0
0

(a) f (p|H0).

F yr

F
re

q
u

e
n

cy

0 5 10 15 20 25

0
2

0
0

0
0

4
0

0
0

0
6

0
0

0
0

(b) f (F̄ysim ).
Figure 2. Uniformity of the prior predictive p-value for HRF: µ1r < µ2r < µ3r.

cific HRF . A visualization of f (F̄ysim ) can help to roughly
estimate α0. For the discrete f (F̄ysim ) considered here
53% of f (F̄ysim ) = 0 and α0 = .47 (Figure 2b).

In the next section, we deal with another important
property of null hypothesis significance testing meth-
ods: Power.

Power

Power is the probability to reject the null hypothesis
(of replication) with a preselected α when not the null,
but an alternative hypothesis is true. Researchers typ-
ically pursue a power of .80. Let us denote power by
γ.

γ = P( p < α|Ha), (13)
= P(F̄yr > F̄ysim,1−α|Ha),

where Ha is the population under the alternative hy-
pothesis for which replication is to be rejected. Note
that any population for which H0 is not true can qualify
to reject replication. The population used is determined
by the theoretical context in which the replication test
takes place. The population with µ1a = .... = µJa is a spe-
cial population that is generally considered to display
a non-effect in ANOVA studies. Hence, µ1a = .... = µJa
seems a natural default choice for the population under
the alternative hypothesis. As a best guess for µ ja and σ2a
in a power analysis, the grand mean ȳo and variance σ2o
of the original study can be used. The population under
the alternative hypothesis with µ1a = .... = µJa is on the
edge of HRF: it deviates minimally from HRF, hence, the
associated γ will be a lower limit. Power will increase
when the population under the alternative hypothesis
is more different from HRF than in the population with
equal means, for example, when the means are ordered
differently.

Simulation Study

To illustrate the power of the prior predictive p-value,
we conducted a simulation study in which we varied the
effect size in the original study fo, the sample size for
the original study n jo, the sample size for the new study
n jr, the relevant feature of interest HRF, and the popu-
lation under the alternative hypothesis Ha as specified
in Table 2. For each cell in the simulation study, 10,000
samples were drawn from Ha and power was calculated
according to Equation 13.

The results of the simulation study are provided in
Table 3. As expected, power generally increases with
increasing effect sizes, increasing sample sizes, and in-
creasing deviation between yo and Ha. There are, how-
ever, some exceptions: With small fo and low n jo, larger
n jr only emphasize the noise in the original study more
and do not lead to an increase in power. Similarly,
a more specific HRF does not always increase power.
Given original studies with smaller samples and smaller
effect sizes, h(µr,σ

2
r ) is so uninformative that more spe-

cific HRF are only more inaccurate under H0, and F̄yr
needs to be extremely large to reject the null.

Table 3 also shows that the power on the edge (i.e.,
the power for Ha1) is insufficient for original studies
with small and medium effect sizes (γ < .60 in all
cells). With medium fo, power is only sufficient if the
new study originates from a population in which the
means are ordered differently (e.g., Ha2). For original
studies with large effect sizes and group sample sizes
in the original studies with at least 50 participants per
group, power can be sufficient under Ha1. Power levels
off, however, for HRF1 and HRF2 at .67, and .83 respec-
tively. Under µ1a = µ2a = µ3a, HRF1: µ1r < (µ2r,µ3r ) is true
in 13 of the situations by chance. Consequently, power
cannot exceed 1− 13 = .67. For HRF2: µ1r < µ2r < µ3r,

1
6 of


8

Table 2
Simulation Sample Statistics for Original Study and Population Values under Ha

yo Ha
fo ȳ1o ȳ2o ȳ3o s2o fa µ1a µ2a µ3a σ

2
a

.10 -0.12 0.00 0.12 1.00 0 0.00 0.00 0.00 1.00

.25 -0.31 0.00 0.31 1.00 .10 0.00 -0.24 0.00 1.00

.40 -0.49 0.00 0.49 1.00
Note. Effect size f as introduced by Cohen (1988, p. 274-275).
Other simulation factors: n jd ∈ 20, 50, 100. HRF1: µ1d < (µ2d,µ3d ), HRF2: µ1d < µ2d < µ3d .

Table 3
Power

HRF1|Ha1 HRF2|Ha1 HRF1|Ha2 HRF2|Ha2
n jr n jr n jr n jr

fo n jo 20 50 100 20 50 100 20 50 100 20 50 100
.10 20 .03 .01 .00 .02 .00 .00 .11 .08 .05 .08 .04 .02
.10 50 .08 .06 .03 .06 .04 .02 .21 .31 .36 .19 .22 .28
.10 100 .11 .12 .10 .09 .10 .08 .26 .45 .62 .25 .39 .52
.25 20 .13 .10 .06 .09 .05 .01 .33 .40 .45 .25 .26 .23
.25 50 .25 .32 .37 .20 .26 .26 .48 .73 .88 .43 .62 .81
.25 100 .30 .46 .57 .29 .44 .55 .54 .83 .96 .53 .79 .94
.40 20 .32 .41 .41 .27 .26 .21 .59 .78 .89 .49 .64 .75
.40 50 .49 .66 .67 .45 .68 .83 .74 .93 .98 .69 .93 .99
.40 100 .55 .66 .67 .57 .83 .83 .77 .93 .97 .77 .98 .99

Text in cells with γ ≥ .80 is boldface.
Text in cells with a maximum γ in relation to the specific HRF|Ha is italic.

the combinations under Ha1 are in line with replication
by chance. Hence, power cannot exceed 1 − 16 = .83. If
we move further from the edge of HRF, as we do with
Ha2, power increases. Thus, the power of the prior pre-
dictive p-value considering an HRF with three or fewer
order constraints will almost never be high if the true
means are equal, but can be high if there is a different
ordering in reality as compared to the one in HRF .

The results demonstrate that imprecise estimates
(i.e., large standard errors leading to a low informative
prior) in the original study lead to low power, especially
on the edge of HRF. This is as true for the prior predic-
tive p-value as it is for other approaches. For example,
in a classical ANOVA study with three groups with 20
participants each, power is <.10, <.40, and <.80 for
small, medium, and large effect sizes respectively; a re-
sult that was already pointed out in Cohen (1988, p.
313). Zondervan-Zwijnenburg and Rijshouwer (2020)
demonstrates the application of different methods to
evaluate replication, within the context of small sam-
ples. Not a single method is unaffected by small sam-
ple sizes. As highlighted by Morey and Lakens (2019)
and Patil et al. (2016): Replication can only be rejected
based on the findings of the original study, and when
these findings are highly imprecise due to large stan-
dard deviations and small sample sizes, rejecting them

is hard or even impossible.
Underpowered original studies may result in non-

significant prior predictive p-values that have a high
probability of being Type II errors (Morey & Lakens,
2019). Therefore, only reporting the prior predictive
p-value is not enough, the probability of a Type II er-
ror (i.e., 1-γ) given the population under the alterna-
tive hypothesis should be communicated to the reader
as well. The next section elaborates on the computa-
tion of power and the required sample size for sufficient
statistical power. The Workflow and Examples sections
explain how researchers should incorporate prior pre-
dictive p-values and power. One of the examples will
also demonstrate rejected replication despite low power
on the edge of H0.

Power and Sample Size Determination

As highlighted in the previous sections and in the lit-
erature (e.g., Brandt et al., 2014; Simonsohn, 2015),
power is an important characteristic of a convincing
replication study. It is thus important that researchers
can calculate the power of the prior predictive check,
and can determine the sample size for a new study
such that the replication test has high statistical power.
Therefore, the ANOVAreplication R-package and the


9

online interactive application (see osf.io/6h8x3) in-
clude a power and sample size calculator.

Given the vector with group sample sizes in the new
study nr, h(µr,σ

2
r ), Ha, HRF, and α, the power γ is calcu-

lated as follows:

1. Following Step 1 and 2 of the prior predictive
check, t = 1, ..., T datasets are simulated from
f (F̄ysim ), and F̄ysim,1−α can be calculated.

2. Given µa, σa, t = 1, ..., T datasets are simulated
from f (F̄yr|Ha) with sample sizes nr. Following
Step 2 of the prior predictive check, for each
dataset F̄yr is calculated.

3. γ = P(F̄yr > F̄ysim,1−α|Ha) =

1
T

T∑
t=1

I(F̄ytr ≥ F̄ysim,1−α ),

As default choice for µa, we recommend to use ȳo for
each group. With this setting, the power is calculated to
reject replication in case of equal group means. As de-
fault choice for σa, we recommend the pooled standard
deviation of the original study.

To determine the required sample size to reject repli-
cation with sufficient power, we use an iterative proce-
dure. In addition to h(µr,σ

2
r ), Ha, HRF, α, we use the

following information to calculate the required sample
size: a target power level γ̃; a small margin covering
acceptable values around the target power γmargin, be-
cause the calculated power may not be exactly equal to
the target power; a starting value for the group sample
size n jr0 ; a maximum number of iterations Qmax; and a
maximum total sample size for the new study Nrmax . Our
default values are: γ̃ = .825, γmargin = .025, α = .05,
n jr0 = 20, Qmax = 10, and Nrmax = 600.

1. In every iteration q, γq is calculated given n jrq .

2. When q > 1, n jrq+1 is determined by regressing
{γ1, ...,γi} on {n jr1, ..., n jrq} with a linear or quadratic
(only if q = 3) function. In case of a linear
regression, the linear regression coefficient β1 is
the power increase per subject. Subsequently,
n jrq+1 = (γq − γ̃)/β1 + n jrq . In case of regression with
a quadratic function, n jrq+1 is calculated by solving
the polynomial: γ̃ = β0 + β1n jrq+1 + β

2
2n jrq+1 .

3. Repeat step (1) and (2) until γq ∈ [γ̃ − γmargin, γ̃ +
γmargin] (i.e., power is sufficient), or γq−1 ≈ γq
(i.e., power does not increase anymore up to two
decimal points), or n jrq−1 = n jrq (i.e, the sample
size does not change anymore), or q = Qmax, or
ΣJj=1n jrq = Nmax.

Workflow

To clarify the procedure to obtain the prior predictive
p-value, the workflow is depicted in Figure 3.

Step 1. The first steps (1a-1c) only require the orig-
inal study. Step 1a is to derive the relevant feature to be
evaluated in the test statistic from the findings of the
original study. Next, the population for which replica-
tion should be rejected (i.e., Ha) can be defined. What is
the ordering of the means in this population and what is
the effect size in that ordering? Ha can be a population
in which all means are equal, but it does not have to
be. Step 1c is to obtain the data of the original study, or
reconstruct the data based on reported means, standard
deviations and group sample sizes. If the new study is
not yet conducted, the second step is to calculate the re-
quired sample size per group for the new study to reject
replication with sufficient power (i.e., γ).

Step 2. The sample sizes calculation can be con-
ducted with the sample.size.calc function in the
ANOVAreplication package. If the function cannot find
a (reasonable) group sample size for which γ is suffi-
cient, this implies that the original study is not suited for
replication testing with the prior predictive p-value for
the specified Ha: its conclusions are too vague (i.e., the
standard errors are too wide) to reject replication if Ha
is true. There is still a chance that the prior predictive
p-value turns out significant, especially if the observed
data is more extreme than most samples from Ha, but
the researcher should consider whether collecting data
with such a low probability of a meaningful result is
ethically acceptable.

Step 3. As a third step, the prior predic-
tive p-value can be computed with the function
prior.predictive.check. The power associated to the
sample size of the new study can be calculated with
power.calc. Note that it is not a post-hoc power analy-
sis, as the definition of Ha is unrelated to the new study.
Hence, the power to reject replication for Ha can be in-
sufficient (i.e., larger than 1- the preset Type II error
rate β), while the prior predictive p-value is statistically
significant, or vice versa. Figure 3 assists in interpreting
the resulting p-value, considering the statistical power
to reject replication for Ha, unless F̄yr is exactly 0. If
the new study perfectly meets the features of the orig-
inal study as described in HRF, F̄yr will be 0 and the
prior predictive p-value 1.00. In such a case, we confirm
replication of the relevant features in the original study
as captured in HRF, irrespective of power. Theoretically
it is possible that F̄yr = 0, while the new study is an
extreme sample from a population in which HRF is not
true. That, however, is not under consideration here,
as our question was whether the observed new study
replicates, or fails to replicate, relevant features of the

osf.io/6h8x3


10

1a. Relevant Features in HRF

1b. Define Ha

1c. Original data

yr collected?

2. Calculate required sample 
size for yr

3. Prior Predictive p-value + γ

𝐹"𝒚𝑟 = 0?

The new study matches HRF. 
Replication is confirmed.

γ ≥ 1 - β?

ppp < α?

Replication is rejected. 
Report γ for Ha.  

yes

no

yes

no

yes

yes

ppp < . α?
no

Replication is rejected despite 
low power. The observed data 
is more extreme than Ha.

Replication is not rejected, 
despite sufficient power to 
do so. Report γ for Ha.  

no

yes

Replication is not rejected. Report γ and 
emphasize that the Type II error is 1- γ for 
Ha. Optional: go back to 2. to see if a larger 
new study could resolve the power issue, or 
if the original study and its conclusions are 
not specific enough to test replication at all. 

no

njr = ∞?

The original study is not 
suited for replication testing 
with the prior predictive p

yes

no

1. O
riginal Study

Figure 3. The prior predictive p-value workflow.

original study.
In case of a non-significant result in combination with

low power, the researcher should emphasize the proba-
bility that not rejecting replication is a Type II error, and
it is advised to conduct a replication study with larger
n jr. The required sample size per group can again be
calculated with the sample.size.calc function in the
ANOVAreplication package. If the required n jr is exces-
sive given Ha, it may be an inevitable conclusion that
the original study is not suited for replication testing by
means of the prior predictive p-value. If replication is
rejected despite low power, it implies that the observed
new dataset deviates more from HRF than most datasets
under Ha. With sufficient statistical power, it is still in-
formative to notify the reader of the achieved power
and/or the probability of a Type II error given the pop-
ulation under Ha.

Examples

To illustrate the use of the prior predictive check to
assess whether relevant ANOVA features are replicated,
we two selected replication studies that were part of
the Reproducibility Project Psychology initiated by the
Open Science Collaboration (2012, 2015). All calcu-
lations can be performed with the ANOVAreplication
R-package (Zondervan-Zwijnenburg, 2018).

The first study is Fischer et al. (2008), who stud-

ied the impact of self-regulation resources on confir-
matory information processing. According to the the-
ory, people who have low self-regulation resources
(i.e., depleted participants) will prefer information that
matches their initial standpoint. An ego-threat condi-
tion was added, because the literature proposes that
ego-threat affects decision relevant information pro-
cessing, although the direction of this effect is not
clear. To determine which relevant feature of the re-
sults (see Table 4) should be tested for replication,
we follow the original findings: “Planned contrasts
revealed that the confirmatory information processing
tendencies of participants with reduced self-regulation
resources [...] were stronger than those of nonde-
pleted [...] and ego threatened participants [...]”
Fischer et al. (2008, p. 387). This translates to:
HRF: µlow self-regulation,r > (µhigh self-regulation,r,µego-threatened,r )
(Workflow Step 1a). We want to reject replication
when all means in the population are equal. That is:
Ha: µlow self-regulation,r = (µhigh self-regulation,r =µego-threatened,r )
(Workflow Step 1b). We simulate the original data
based on the means, standard deviations and sample
sizes reported in Fischer et al. (2008) (Workflow Step
1c). As the replication study is already conducted by
Galliani (2015) (see Table 4 for results), we do not
calculate the required sample size to test replication
(Workflow Step 2), and proceed to calculate the prior
predictive p-value and the power of the replication test


11

Table 4
Descriptive Statistics for Confirmatory Information Processing from the Original Study: Fischer et al. (2008), and the
New Study: Galliani (2015)

Low self-regulation High self-regulation Ego-threatened
Study n M (SD) n M (SD) n M (SD)
Original 28a 0.36 (1.08) 28a -0.19 (0.53) 28a -0.18 (0.81)
New 48 -0.07 (0.45) 47 -0.05 (0.47) 45 0.13 (0.64)

aOnly the total sample size of 85 was provided in Fischer et al. (2008).

Table 5
Z-scores of Participants’ Mean Estimates from the Original Study: Janiszewski and Uy (2008), and the New Study:
Chandler (2015)

Low Motivation to Adjust High Motivation to Adjust
Precise Anchor Rounded Anchor Precise Anchor Rounded Anchor

Study n M (SD) n M (SD) n M (SD) n M (SD)
Original 14 -0.76 (0.17) 15 -0.23 (0.48) 15 -0.04 (0.28) 15 0.98 (0.41)
New 30 -0.35 (0.23) 30 -0.18 (0.37) 30 0.20 (0.34) 30 0.35 (0.44)

(Workflow Step 3). The resulting prior predictive p-
value was .003 with γ = .66, indicating that we reject
replication, despite limited power. The ordering in the
new data by Galliani (2015) results in an extreme F̄
score compared to the predicted data. Figure 4 illus-
trates this conclusion: Over 90% of the predicted data
scores perfectly in line with HRF, but the new study by
Galliani (2015) deviates from HRF and scores in the ex-
treme 0.3% of the predicted data. The replication of the
original study conclusions is thus rejected.

The second study is Janiszewski and Uy (2008), who
studied numerical judgements with five experiments.
More specifically, they study the impact of precision of
an anchor, and motivation to adjust from the anchor on
judgement bias. The group means, standard deviations,
and sample sizes of experiment 4a in the original study
by Janiszewski and Uy (2008) and the replication study
by Chandler (2015) are provided in Table 5. We find
that based on these results, Janiszewski and Uy (2008)
draw two conclusions. “First, a precise anchor results
in less adjustment than a rounded anchor” (p. 126).
For experiment 4a, which was replicated by Chandler
(2015), this conclusion translates to HRF:

(µlow motivation,round,r > µlow motivation,precise,r ) &
(µhigh motivation,round,r > µhigh motivation,precise,r )

(Workflow Step 1a). We want to reject replication when
all means in the population are equal. That is: Ha:

µlow motivation,round,r = µlow motivation,precise,r =

µhigh motivation,round,r = µhigh motivation,precise,r

(Workflow Step 1b). We simulate the original data
based on the means, standard deviations and sample

sizes reported in Janiszewski and Uy (2008) (Workflow
Step 1c). As the replication study is already conducted
by Chandler (2015), we do not calculate the required
sample size to test replication (Workflow Step 2), and
proceed to calculate the prior predictive p-value and the
power of the replication test (Workflow Step 3). The
resulting prior predictive p-value is 1.00. The data ob-
tained by Chandler (2015) were perfectly in line with
the HRF describing the effect as observed by Janiszewski
and Uy (2008). Therefore, we do not have further con-
cerns about the obtained power. Hence, we conclude
that the results of Janiszewski and Uy (2008) with re-
spect to HRF:

(µlow motivation,round,r > µlow motivation,precise,r ) &
(µhigh motivation,round,r > µhigh motivation,precise,r )

are replicated by Chandler (2015).
The other conclusion that Janiszewski and Uy (2008)

draw is about the presence of an interaction effect of ad-
justment motivation and anchor rounding: “The differ-
ence in the amount of adjustment between the rounded-
and precise-anchor conditions increased as the motiva-
tion to adjust went from low [...] to high” (p. 125).
The results and conclusions of Janiszewski and Uy with
respect to experiment 4a translate to HRF:

(µlow motivation,round,r > µlow motivation,precise,r ) &
(µhigh motivation,round,r > µhigh motivation,precise,r ) &

(µlow motivation,round,r −µlow motivation,precise,r ) <
(µhigh motivation,round,r −µhigh motivation,precise,r ).

The prior predictive p-value related to this HRF is .014
with γ = .87. Thus, we reject replication of the interac-
tion effect.


12

F y

F
re

q
u

e
n

cy

0 5 10 15

0
2

0
0

0
4

0
0

0
6

0
0

0
8

0
0

0

Figure 4. The prior predictive p-value for the replication of Fischer et al. (2008) by Galliani (2015). The histogram
bars represent F̄ for the predicted data. The thick line on the left represents F̄ for the predicted data that are exactly
0 (i.e., over 90% of the total), whereas the red line represents F̄ for Galliani (2015).

Discussion & Conclusion

The goal of the current paper was to introduce the
prior predictive check as a manner to test replication
of ANOVA features. With the prior predictive check re-
searchers can find an answer to the question: “Does the
new study fail to replicate relevant features of the orig-
inal study?” Identifying a non-replication may make
us wonder about the representativeness of the origi-
nal study, the new study, and the comparability of both
studies. Or, as stated by Simonsohn (2015, p. 9) “Sta-
tistical techniques help us identify situations in which
something other than chance has occurred. Human
judgment, ingenuity, and expertise are needed to know
what has occurred instead.”

In the current paper, we discussed the prior predictive
p-value for the ANOVA setting. For the ANOVA setting,
we explained how to test relevant features of the form
Rµr > 0. Technically, however, the relevant features
evaluated by the ANOVAreplication R-package, how-
ever, can also be of the form Rµr > r and Sµr = s, where
r and s are vectors of length K containing the constants
in HRF, and S is a K×J restriction matrix like R. Accord-
ingly, minimum (effect size) differences between means

can be evaluated and means can be constrained equal to
specific values. Even though constraints of these forms
can be evaluated with the R-package and in the online
application, they are not emphasized in the current pa-
per because they will less often directly relate to the
findings of an original study.

The prior predictive p-value is generalizable to sta-
tistical models other than the ANOVA as well. That
is, for any model a predictive distribution can be ob-
tained, constrained hypotheses can be constructed, and
a test-statistic evaluating the constraints can be calcu-
lated. The test as currently provided can already be
used for the repeated measures ANOVA by means of
contrast weights (see, for example, Furr and Rosenthal,
2003). With contrast weights a score for each partic-
ipant can be calculated indicating to what degree the
participant follows the expected pattern. Subsequently,
the replication of relevant features of these contrast
scores over groups can be tested. A pre-print introduc-
tion to test replication with the prior predictive p-value
for structural equation models has been published at
https://psyarxiv.com/uvh5s.

In the current paper, we introduced the prior predic-
tive p-value as a new tool to quantify replication fail-

https://psyarxiv.com/uvh5s


13

ure or success to the meta-scientific toolbox. With the
prior predictive p-value we test whether the new study
significantly deviates from our expectations based on
the original study. Other methods to evaluate repli-
cation research questions with are included in Table 1
and demonstrated in Zondervan-Zwijnenburg and Ri-
jshouwer (2020). Two features of the prior predictive
p-value to test replication stand out. First, the prior
predictive p-value makes use of a predictive distribution
given the original study results. The new study results
are compared to the predicted data. A Bayes factor on
the other hand, weighs the evidence for two compet-
ing hypotheses in the new study as it actually occurred,
but does not take study variation into account. Second,
to compare the new study with the predicted data, we
consider relevant features of the original study. While
most other methods evaluate the replication of a sim-
ple effect size, relevant features can be any constraint
or set of constraints of the form Rµr > 0, which seam-
lessly connects to the research objective of most ANOVA
studies. With the ANOVAreplication R-package includ-
ing a vignette as a tutorial, and the interactive applica-
tion (see osf.io/6h8x3), we provide researchers with
an easy to use test for replication of ANOVA features.
The availability of the prior predictive p-value to test
replications can further promote the trend to conduct
more replication studies in the field of psychology.

Author Contact

Correspondence concerning this article should be
addressed to Mariëlle Zondervan-Zwijnenburg, De-
partment of Methods and Statistics, Utrecht Uni-
versity, Padualaan 14, 3584CH Utrecht. E-mail:
M.A.J.Zwijnenburg@uu.nl.

Conflict of Interest and Funding

Add a statement about conflict of interest. If you have
no conflict of interest to declare, please state that. Add a
statement about how the research was funded. If there
was no specific funding please state that.

Author Contributions

MZ and HH were involved in the initial research de-
sign. MZ drafted and revised the article in collaboration
with HH. MZ developed the interactive application, con-
ducted the simulation studies, and conducted the analy-
ses. RS provided additional feedback, and evaluated the
interactive application. All authors approved the final
manuscript. The first author (MZ) was the main author
and the last author (HH) was the main supervisor on
this project.

Acknowledgements

We would like to thank Meta-Psychology editor dr. Fe-
lix Schönbrodt, and reviewers dr. Matt Williams and
dr. Zoltan Dienes for their helpful feedback on this
manuscript.

The first and third author are supported by the Con-
sortium Individual Development (CID), which is funded
through the Gravitation program of the Dutch Ministry
of Education, Culture, and Science and the Netherlands
Organization for Scientific Research (NWO grant num-
ber 024.001.003). The second author is supported by a
VIDI grant from the Netherlands Organization for Sci-
entific Research (NWO grant number 452.14.006).

Open Science Practices

This article earned the Open Materials badge for
making the materials openly available. This article is a
methods article and did not include any new data, and
it was not pre-registered. It has been verified that the
analysis reproduced the results presented in the article.
The entire editorial process, including the open reviews,
is published in the online supplement.

References

Anderson, S. F., & Maxwell, S. E. (2016). There’s
more than one way to conduct a replication
study: Beyond statistical significance. Psycho-
logical Methods, 21(1), 1–12. https://doi
.org/10.1037/met0000051

Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J.,
Denissen, J. J., Fiedler, K., ..., & Wicherts, J. M.
(2013). Recommendations for increasing repli-
cability in psychology. European Journal of Per-
sonality, 27(2), 108–119. https://doi.org/
10.1002/per.1919

Box, G. E. (1980). Sampling and Bayes’ inference in
scientific modelling and robustness. Journal of
the Royal Statistical Society. Series A (General),
143(4), 383–430. https://doi.org/10.2307/
2982063

Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach,
F. J., Geller, J., Giner-Sorolla, R., Grange, J. A.,
..., & Van’t Veer, A. (2014). The replication
recipe: What makes for a convincing replica-
tion? Journal of Experimental Social Psychology,
50, 217–224. https://doi.org/10.1016/j
.jesp.2013.10.005

osf.io/6h8x3
https://doi.org/10.1037/met0000051
https://doi.org/10.1037/met0000051
https://doi.org/10.1002/per.1919
https://doi.org/10.1002/per.1919
https://doi.org/10.2307/2982063
https://doi.org/10.2307/2982063
https://doi.org/10.1016/j.jesp.2013.10.005
https://doi.org/10.1016/j.jesp.2013.10.005


14

Chandler, J. (2015). Replication of Janiszewski & Uy
(2008, PS, study 4b). Open Science Frame-
work. osf.io/aaudl

Cohen, J. (1988). Statistical power analysis for the
behavioral sciences (2nd ed.). Hillsdale, NJ,
Lawrence Erlbaum Associates. https://doi
.org/10.4324/9780203771587

Cumming, G. (2008). Replication and p intervals: p val-
ues predict the future only vaguely, but confi-
dence intervals do much better. Perspectives on
Psychological Science, 3(4), 286–300. https://
doi.org/10.1111/j.1745-6924.2008.00079
.x

Cumming, G. (2014). The new statistics: Why and how.
Psychological Science, 25(1), 7–29. https://
doi.org/10.1177/0956797613504966

Earp, B. D., & Trafimow, D. (2015). Replication, falsifi-
cation, and the crisis of confidence in social psy-
chology. Frontiers in Psychology, 6, 621. https:
//doi.org/10.3389/fpsyg.2015.00621

Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skul-
borstad, H. M., Allen, J. M., Banks, J. B.,
Baranski, E., Bernstein, M. J., Bonfiglio, D. B.,
Boucher, L., Brown, E. R., Budiman, N. I., Cairo,
A. H., Capaldi, C. A., Chartier, C. R., Chung,
J. M., Cicero, D. C., Coleman, J. A., Conway,
J. G., . . . Nosek, B. A. (2016). Many labs
3: Evaluating participant pool quality across
the academic semester via replication. Journal
of Experimental Social Psychology, 67, 68–82.
https://doi.org/10.1016/j.jesp.2015
.10.012

Errington, T., Tan, F., Lomax, J., Perfito, N., Iorns, E.,
Gunn, W., & Lehman, C. (2019). Reproducibility
project: Cancer biology. osf.io/e81xl

Fischer, P., Greitemeyer, T., & Frey, D. (2008). Self-
regulation and selective exposure: The impact
of depleted self-regulation resources on con-
firmatory information processing. Journal of
Personality and Social Psychology, 94(3), 382.
https://doi.org/10.1037/0022-3514.94
.3.382

Furr, R. M., & Rosenthal, R. (2003). Repeated-measures
contrasts for "multiple-pattern" hypotheses.
Psychological Methods, 8(3), 275–293. https:
//doi.org/10.1037/1082-989X.8.3.275

Galliani, E. (2015). Replication report of Fischer, Greite-
meyer, and Frey (2008, JPSP, study 2). https:
//osf.io/j8bpa

Gelman, A., Meng, X.-L., & Stern, H. (1996). Posterior
predictive assessment of model fitness via real-
ized discrepancies. Statistica Sinica, 6(4), 733–
760.

Gelman, A., & Stern, H. (2006). The difference between
“significant” and “not significant” is not itself
statistically significant. The American Statisti-
cian, 60(4), 328–331. https://doi.org/10
.1198/000313006x152649

Harms, C. (2018). A bayes factor for replications of
anova results. The American Statistician. https:
//doi.org/10.1080/00031305.2018.1518787

Hedges, L. V., & Olkin, I. (1980). Vote-counting meth-
ods in research synthesis. Psychological Bulletin,
88(2), 359. https://doi.org/10.1037/0033
-2909.88.2.359

Hoijtink, H. (2012). Informative hypotheses: Theory and
practice for behavioral and social scientists. CRC
Press. https://doi.org/10.1201/b11158

Hoijtink, H., Mulder, J., van Lissa, C., & Gu, X. (2019).
A tutorial on testing hypotheses using the bayes
factor. Psychological Methods, 24(5), 539–556.
https://doi.org/10.1037/met0000201

Janiszewski, C., & Uy, D. (2008). Precision of the anchor
influences the amount of adjustment. Psycho-
logical Science, 19(2), 121–127. https://doi
.org/10.1111/j.1467-9280.2008.02057.x

Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B.,
Bahnık, Š., Bernstein, M. J., Bocian, K., Brandt,
M. J., Brooks, B., Brumbaugh, C. C., Cemalcilar,
Z., Chandler, J., Cheong, W., Davis, W. E., De-
vos, T., Eisner, M., Frankowska, N., Furrow, D.,
Galliani, E. M., . . . Nosek, B. A. (2014). Investi-
gating variation in replicability: A ’many labs’
replication project. Social Psychology, 45(3),
142–152. https://doi.org/10.1027/1864
-9335/a000178

Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G.,
Adams, R. B., Alper, S., Aveyard, M., Axt, J. R.,
Babalola, M. T., Bahnık, Š., Batra, R., Ber-
kics, M., Bernstein, M. J., Berry, D. R., Bialo-
brzeska, O., Binan, E. D., Bocian, K., Brandt,
M. J., Busching, R., . . . Nosek, B. A. (2018).
Many labs 2: Investigating variation in replica-
bility across samples and settings. Advances in
Methods and Practices in Psychological Science,
1(4), 443–490. https://doi.org/10.1177/
2515245918810225

Ledgerwood, A. (2014). Introduction to the special
section on advancing our methods and prac-
tices. Perspectives on Psychological Science, 9(3),
275–277. https : / / doi .org / 10 .1177 /
1745691613513470

Ly, A., Etz, A., Marsman, M., & Wagenmakers, E.-J.
(2018). Replication bayes factors from evi-
dence updating. Behavior Research Methods, 1–

osf.io/aaudl
https://doi.org/10.4324/9780203771587
https://doi.org/10.4324/9780203771587
https://doi.org/10.1111/j.1745-6924.2008.00079.x
https://doi.org/10.1111/j.1745-6924.2008.00079.x
https://doi.org/10.1111/j.1745-6924.2008.00079.x
https://doi.org/10.1177/0956797613504966
https://doi.org/10.1177/0956797613504966
https://doi.org/10.3389/fpsyg.2015.00621
https://doi.org/10.3389/fpsyg.2015.00621
https://doi.org/10.1016/j.jesp.2015.10.012
https://doi.org/10.1016/j.jesp.2015.10.012
osf.io/e81xl
https://doi.org/10.1037/0022-3514.94.3.382
https://doi.org/10.1037/0022-3514.94.3.382
https://doi.org/10.1037/1082-989X.8.3.275
https://doi.org/10.1037/1082-989X.8.3.275
https://osf.io/j8bpa
https://osf.io/j8bpa
https://doi.org/10.1198/000313006x152649
https://doi.org/10.1198/000313006x152649
https://doi.org/10.1080/00031305.2018.1518787
https://doi.org/10.1080/00031305.2018.1518787
https://doi.org/10.1037/0033-2909.88.2.359
https://doi.org/10.1037/0033-2909.88.2.359
https://doi.org/10.1201/b11158
https://doi.org/10.1037/met0000201
https://doi.org/10.1111/j.1467-9280.2008.02057.x
https://doi.org/10.1111/j.1467-9280.2008.02057.x
https://doi.org/10.1027/1864-9335/a000178
https://doi.org/10.1027/1864-9335/a000178
https://doi.org/10.1177/2515245918810225
https://doi.org/10.1177/2515245918810225
https://doi.org/10.1177/1745691613513470
https://doi.org/10.1177/1745691613513470


15

11. https://doi.org/10.3758/s13428-018
-1092-x

Marsman, M., Schönbrodt, F. D., Morey, R. D., Yao, Y.,
Gelman, A., & Wagenmakers, E.-J. (2017). A
Bayesian bird’s eye view of ‘replications of im-
portant results in social psychology’. Royal So-
ciety Open Science, 4(1), 160426. https://doi
.org/10.1098/rsos.160426

Meng, X.-L. (1994). Posterior predictive p-values. The
Annals of Statistics, 22(3), 1142–1160. https:
//doi.org/10.1214/aos/1176325622

Morey, R. D., & Lakens, D. (2019). Why most of psychol-
ogy is statistically unfalsifiable. https://doi
.org/10.5281/zenodo.838685

Nieuwenhuis, S., Forstmann, B. U., & Wagenmakers,
E.-J. (2011). Erroneous analyses of interactions
in neuroscience: A problem of significance. Na-
ture Neuroscience, 14(9), 1105–1107. https://
doi.org/10.1038/nn.2886
10.1038/nn.2886

Open Science Collaboration. (2012). An open, large-
scale, collaborative effort to estimate the repro-
ducibility of psychological science. Perspectives
on Psychological Science, 7(6), 657–660. https:
//doi.org/10.1177/1745691612462588

Open Science Collaboration. (2015). Estimating the re-
producibility of psychological science. Science,
349(6251). https://doi .org/10 .1126/
science.aac4716

Pashler, H., & Wagenmakers, E.-.-.-J. (2012). Editors’
introduction to the special section on replica-
bility in Psychological Science: A crisis of con-
fidence? Perspectives on Psychological Science,
7(6), 528–530. https://doi.org/10.1177/
1745691612465253

Patil, P., Peng, R. D., & Leek, J. T. (2016). What should
researchers expect when they replicate studies?
A statistical view of replicability in psychologi-
cal science. Perspectives on Psychological Science,
11(4), 539–544. https://doi.org/10.1177/
1745691616646366

Schmidt, S. (2009). Shall we really do it again? The
powerful concept of replication is neglected in
the social sciences. Review of General Psychol-
ogy, 13(2), 90–100. https://doi .org/10
.1037/a0015108

Silvapulle, M. J., & Sen, P. K. (2005). Constrained statis-
tical inference: Order, inequality, and shape con-
straints (Vol. 912). John Wiley & Sons. https:
//doi.org/10.1002/9781118165614

Simonsohn, U. (2015). Small telescopes detectability
and the evaluation of replication results. Psy-
chological Science, 26(5), 559–569. https://
doi.org/10.1177/0956797614567341

Van Aert, R. C., & Van Assen, M. A. (2017). Examining
reproducibility in psychology: A hybrid method
for combining a statistically significant origi-
nal study and a replication. Behavior Research
Methods, 1–25. https://doi.org/10.3758/
s13428-017-0967-6

Vanbrabant, L., Van de Schoot, R., & Rosseel, Y. (2015).
Constrained statistical inference: Sample-size
tables for ANOVA and regression. Frontiers in
Psychology, 5, 1565. https://doi .org/10
.3389/fpsyg.2014.01565

Verhagen, J., & Wagenmakers, E.-J. (2014). Bayesian
tests to quantify the result of a replication at-
tempt. Journal of Experimental Psychology: Gen-
eral, 143(4), 1457–1475. https://doi.org/
10.1037/a0036731

Zondervan-Zwijnenburg, M. A. J. (2018). ANOVArepli-
cation: Test ANOVA replications by means of
the prior predictive p-value [R package version
1.1.3]. R package version 1.1.3. https://CRAN
.R-project.org/package=ANOVAreplication

Zondervan-Zwijnenburg, M. A. J., & Rijshouwer, D.
(2020). Testing replication with small samples:
Applications to anova. In R. van de Schoot &
M. Miocevic (Eds.), Small sample size solutions:
A guide for applied researchers and practitioners.
Routledge.

https://doi.org/10.3758/s13428-018-1092-x
https://doi.org/10.3758/s13428-018-1092-x
https://doi.org/10.1098/rsos.160426
https://doi.org/10.1098/rsos.160426
https://doi.org/10.1214/aos/1176325622
https://doi.org/10.1214/aos/1176325622
https://doi.org/10.5281/zenodo.838685
https://doi.org/10.5281/zenodo.838685
https://doi.org/10.1038/nn.2886
https://doi.org/10.1038/nn.2886
https://doi.org/10.1177/1745691612462588
https://doi.org/10.1177/1745691612462588
https://doi.org/10.1126/science.aac4716
https://doi.org/10.1126/science.aac4716
https://doi.org/10.1177/1745691612465253
https://doi.org/10.1177/1745691612465253
https://doi.org/10.1177/1745691616646366
https://doi.org/10.1177/1745691616646366
https://doi.org/10.1037/a0015108
https://doi.org/10.1037/a0015108
https://doi.org/10.1002/9781118165614
https://doi.org/10.1002/9781118165614
https://doi.org/10.1177/0956797614567341
https://doi.org/10.1177/0956797614567341
https://doi.org/10.3758/s13428-017-0967-6
https://doi.org/10.3758/s13428-017-0967-6
https://doi.org/10.3389/fpsyg.2014.01565
https://doi.org/10.3389/fpsyg.2014.01565
https://doi.org/10.1037/a0036731
https://doi.org/10.1037/a0036731
https://CRAN.R-project.org/package=ANOVAreplication
https://CRAN.R-project.org/package=ANOVAreplication

	Introduction
	Prior Predictive p-value
	Step 1: Prior Predictive Distribution of the Data
	Step 2: Test Statistic Evaluating Relevant Features
	Step 3: p-value
	Uniformity


	Power
	Simulation Study
	Power and Sample Size Determination

	Workflow
	Step 1.
	Step 2.
	Step 3.


	Examples
	Discussion & Conclusion
	Author Contact
	Conflict of Interest and Funding
	Author Contributions
	Acknowledgements
	Open Science Practices