Meta-Psychology, 2022, vol 6, MP.2019.1615 https://doi.org/10.15626/MP.2019.1615 Article type: Original article Published under the CC-BY4.0 license Open data: Not Applicable Open materials: Yes Open and reproducible analysis: Yes Open reviews and editorial process: Yes Preregistration: No Edited by: Felix D. Schönbrodt Reviewed by: Williams M., Dienes, Z. Analysis reproduced by: Counsell A., Batinović L. All supplementary files can be accessed at OSF: https://doi.org/10.17605/OSF.IO/6WPN4 Testing ANOVA Replications by Means of the Prior Predictive p-Value. M.A.J. Zondervan-Zwijnenburg Department of Methodology & Statistics, Utrecht University, The Netherlands A.G.J. van de Schoot Department of Methodology & Statistics, Utrecht University, The Netherlands; Optentia Research Program, North-West University, South Africa H.J.A. Hoijtink Department of Methodology & Statistics, Utrecht University, The Netherlands Abstract In the current study, we introduce the prior predictive p-value as a method to test replication of an analysis of variance (ANOVA). The prior predictive p-value is based on the prior predictive distribution. If we use the original study to compose the prior distribution, then the prior predictive distribution contains datasets that are expected given the original results. To determine whether the new data resulting from a replication study deviate from the data in the prior predictive distribution, we need to calculate a test statistic for each dataset. We propose to use F̄, which measures to what degree the results of a dataset deviate from an inequality constrained hypothesis capturing the relevant features of the original study: HRF. The inequality constraints in HRF are based on the findings of the original study and can concern, for example, the ordering of means and interaction effects. The prior predictive p-value consequently tests to what degree the new data deviates from predicted data given the original results, considering the findings of the original study. We explain the calculation of the prior predictive p-value step by step, elaborate on the topic of power, and illustrate the method with examples. The replication test and its integrated power and sample size calculator are made available in an R-package and an online interactive application. As such, the current study supports researchers that want to adhere to the call for replication studies in the field of psychology. Keywords: ANOVA, comparison of means, power analysis, prior predictive p-value, replication study Introduction New studies conducted to replicate earlier original studies are often referred to as replication studies. After the latest “crisis in confidence” in the field of psychol- ogy, the call to conduct replication studies is stronger than ever (Anderson & Maxwell, 2016; Asendorpf et al., 2013; Cumming, 2014; Earp & Trafimow, 2015; https://doi.org/10.15626/MP.2019.1615 https://doi.org/10.17605/OSF.IO/6WPN4 2 Ledgerwood, 2014; Open Science Collaboration, 2012, 2015; Pashler & Wagenmakers, 2012; Schmidt, 2009; Verhagen & Wagenmakers, 2014), and large replica- tion projects such as the Reproducibility Project Psy- chology (Open Science Collaboration, 2015), Repro- ducibility Project: Cancer Biology (RP:CB) (Errington et al., 2019), and Many Labs projects (Ebersole et al., 2016; Klein et al., 2014; Klein et al., 2018) have been launched. As a result, methodology on conducting repli- cation studies has received increasing attention (see for example Anderson and Maxwell, 2016; Asendorpf et al., 2013; Brandt et al., 2014; Schmidt, 2009). There is, however, no standard methodology to determine whether a replication is successful or not (Open Science Collaboration, 2015). The results of an original study are replicated when a new study corroborates the original findings. A com- mon and intuitive method to assess whether a result is replicated is ‘vote-counting’. Vote-counting is assess- ing whether the new effect is statistically significant and in the same direction as the significant effect in the original study (Anderson & Maxwell, 2016; Simon- sohn, 2015). Vote-counting, however, has serious short- comings. First of all, it is a dichotomous evaluation that does not take into account the magnitude of dif- ferences between effect-sizes of the original and new study (Asendorpf et al., 2013; Simonsohn, 2015). Sec- ondly, each of the effect sizes being significant does not imply that both effect sizes are the same, nor does one significant effect and one non-significant effect imply that both effects are different (Gelman & Stern, 2006; Nieuwenhuis et al., 2011). Stated otherwise, vote- counting does not formally test whether a result is repli- cated (Anderson & Maxwell, 2016; Verhagen & Wagen- makers, 2014). Thirdly, underpowered replication stud- ies are less likely to replicate significance, which can lead to misleading conclusions (Asendorpf et al., 2013; Cumming, 2008; Hedges & Olkin, 1980; Simonsohn, 2015). In the current study, we address the following replica- tion research question: “Does the new study fail to repli- cate relevant features of the original study?”. For exam- ple, the result of an original ANOVA study is: Group A > Group B > Group C. The finding can be: “Group A performs better than group B, which performs better than group C”; “Group A performs better than group B and C”; or “Group A and B perform better than group C”. The ‘relevant features’ subordinate to the replica- tion test always have to be in line with the original re- sult (i.e., Group A > Group B > Group C) for the test to function properly. If the purpose of the replication test is to put the proclaimed theory by the original to the test, then the claims of the original study determine the exact relevant features to be evaluated. However, if there is reason to test another feature, it is possible to let the relevant features deviate from the claims in the original study. The relevant features of original studies will be captured in the form of an informative hypothe- sis (Hoijtink, 2012), which is specified using inequality constraints among the means of the ANOVA model. We propose to evaluate the replication of these hypotheses with the prior predictive p-value (Box, 1980). The prior predictive p-value was not introduced to test replication. It was originally presented as a method to test whether the current data is unexpected given prior expectations concerning the parameter values of a statistical model. A disadvantage of the prior predictive check to test model fit is that it is leaves undetermined whether the prior expectations about the parameter val- ues or the model assumptions are incorrect. Hence, as a model test the prior predictive check has been replaced by the posterior predictive check (Gelman et al., 1996), which does not make prior assumptions about expected parameter values, but instead uses the posterior results given the current data. With respect to testing replication, however, the prior predictive check is a good method for three reasons. First, instead of non-empirical prior expectations, we use the posterior distribution of the model parameters given the original data as the prior distribution. Con- sequently, we have a well-founded and clear-cut prior. Second, the prior predictive check uses a distribution of datasets (i.e., the prior predictive distribution) that are expected given the prior (i.e., the posterior of the original study). In this manner, the prior predictive dis- tribution takes into account that results in a new dataset - resulting from a replication study - may deviate from the original results because of random variation instead of meaningful differences. According to our definition, a study replicates if the new dataset is drawn from the same population as the original dataset. Third, the prior predictive check uses a ‘relevant checking function’ for which we propose F̄ (Silvapulle & Sen, 2005, p. 38-39). The statistic F̄ captures the deviance from a constrained hypothesis that we base on the findings of the origi- nal study. As a result, we can check whether the new study significantly fails to replicate relevant features of the original study, while taking variation into account. 3 Ta bl e 1 R ep li ca ti on R es ea rc h Q u es ti on s an d M et h od s to A dd re ss T h em R ep li ca ti on R es ea rc h Q u es ti on C u rr en t S tu d y an d S im il ar Q u es ti on s M et h od S et ti n g R ef er en ce D oe s th e n ew st u d y fa il to re p li ca te re le va n t fe at u re s of th e or ig in al st u d y? P ri or p re d ic ti ve p- va lu e t- te st , A N O VA C u rr en t st u d y D oe s th e n ew st u d y fa il to re p li ca te th e ef fe ct si ze of th e or ig in al st u d y? C on fi d en ce in te rv al fo r d if fe r- en ce in ef fe ct si ze s t- te st , co rr el at io n A n d er so n an d M ax w el l (2 0 1 6 ) P re d ic ti on in te rv al co rr el at io n Pa ti l et al . (2 0 1 6 ) D oe s th e n ew st u d y re p li ca te th e ef fe ct si ze of th e or ig in al st u d y? E qu iv al en ce te st t- te st A n d er so n an d M ax w el l (2 0 1 6 ) B ay es fa ct or t- te st Ve rh ag en an d W ag en m ak er s (2 0 1 4 ) B ay es fa ct or A N O VA H ar m s (2 0 1 8 ) B ay es fa ct or B F m od el sa Ly et al . (2 0 1 8 ) O th er R ep li ca ti on R es ea rc h Q u es ti on s M et h od S et ti n g R ef er en ce Is th e ef fe ct p re se n t or ab se n t in th e re p li ca ti on st u d y? B ay es fa ct or t- te st , co rr el at io n b M ar sm an et al . (2 0 1 7 ) Is C oh en ’s d in th e p op u la ti on of a d et ec ta bl e si ze ? Te le sc op e te st t- te st c S im on so h n (2 0 1 5 ) Is th e or ig in al ef fe ct si ze ex tr em e in co m p ar is on to th e n ew st u d y? C on fi d en ce in te rv al fo r d if fe r- en ce in ef fe ct si ze s t- te st , co rr el at io n O p en S ci en ce C ol la bo ra ti on (2 0 1 5 ) W h at is C oh en ’s d in th e p op u la ti on ? C on fi d en ce in te rv al fo r av er - ag e ef fe ct si ze t- te st A n d er so n an d M ax w el l (2 0 1 6 ) W h at is th e ef fe ct si ze (c or re ct ed fo r p u bl ic at io n bi as ) in th e p op u la ti on ? H yb ri d m et a- an al ys is t- te st V an A er t an d V an A ss en (2 0 1 7 ) a A ll m od el s fo r w h ic h a B ay es fa ct or ca n be co m p u te d . b T h e re co n ce p tu al iz at io n by Ly et al . (2 0 1 8 ) ge n er al iz es to m os t co m m on ex p er im en ta l d es ig n s. c T h e te le sc op e te st is ex p la in ed in th e t- te st se tt in g, bu t ap p li ca bl e to an y m od el fo r w h ic h a p ow er an al ys is ca n be co n d u ct ed . 4 Table 1 shows how our research question and pro- posed method relate to other replication research ques- tions and associated methods that have been proposed. Our method addresses a question similar to that in An- derson and Maxwell (2016), Harms (2018), Ly et al. (2018), Verhagen and Wagenmakers (2014) and Patil et al. (2016), but now enables researchers to evaluate the replication of relevant features of an original ANOVA study. The bottom panel of Table 1 shows other replica- tion research questions that will not be pursued in this paper. The reader interested in these questions, should consult the given references. The goal of this paper is to introduce the prior predic- tive p-value as a method to test replication of relevant features of original ANOVA studies. In the first section, we provide a step by step introduction of the prior pre- dictive p-value as included in the ANOVAreplication R-package and the online interactive application (see osf.io/6h8x3). In the second section, we discuss the statistical power of the prior predictive p-value. In the third section, we explain how to use and interpret the prior predictive p-value by means of a workflow. In the fourth section, we use several studies from the Repro- ducibility Project Psychology (Open Science Collabora- tion, 2012) to demonstrate the use of the prior predic- tive p-value. The paper ends with a discussion and con- clusion section. Prior Predictive p-Value The evaluation of the replication of an ANOVA study by means of the prior predictive p-value (Box, 1980) consists of three steps that will be explained below. Step 1: Prior Predictive Distribution of the Data The ANOVA model is given by: yi jd = µ jd + �i jd (1) �i jd ∼N(0,σ 2 d ), where yi jd is observation i = 1, ..., n jd in group j = 1, ..., J for dataset d ∈ {o, r, sim}, where o denotes the original data, r denotes the new data, and sim denotes simulated data, the latter will be introduced towards the end of this section. Furthermore, µ jd is the mean of group j in dataset d, �i jd is the error term, and σ2d is the pooled variance over all J groups. The original ANOVA results can be summarized in the posterior distribution of the parameters: g(µo,σ 2 o|yo), where µo = [µ1o, ...,µJo] and yo includes all observations yi jo: g(µo,σ 2 o|yo) ∝ f (yo|µo,σ 2 o)h(µo,σ 2 o), (2) where the density of the data f (yo|µo,σ 2 o) = J∏ j=1 n jo∏ i=1 1 √ 2πσo e −(yi jo−µ jo ) 2 2σ2o (3) and the standard prior distribution, h(µo,σ 2 o) ∝ 1 σ2o , (4) that is, a uniform prior on the means and Jeffrey’s prior on the variance. The prior distribution for the analysis of the original data is uninformative, that is, the poste- rior distribution is completely determined by the orig- inal data in order to match the results of the original study. If the original study used a Bayesian analysis, the priors should match those of the original study in order to reproduce the original study results. Given the observed original results, the prior distribution for fu- ture parameters h(µr,σ 2 r ) = h(µsim,σ 2 sim) = g(µo,σ 2 o|yo). With the prior predictive p-value, we then test H0: µr,σ 2 r ∼ h(µr,σ 2 r ). H0 states that µr,σ 2 r follow the dis- tribution of the prior for µr,σ 2 r . Loosely formulated, H0 states that the parameters in the new data are in line with our expectations given the original results. To test H0, we obtain datasets that are to be expected given the original data. Using this prior we simulate data ysim that are to be expected given the results of the original study: f (ysim) = ∫ f (ysim|µsim,σ 2 sim)h(µsim,σ 2 sim)dµsim,σ 2 sim), (5) where f (ysim) is the prior predictive distribution of the data. Note that f (ysim|µsim,σ 2 sim) is the counterpart of Equation 3 for dataset sim instead of o. Datasets ytsim for t = 1, ..., T , where T denotes the number of samples from the prior predictive distribution, are obtained by sampling µtsim,σ t sim from h(µsim,σsim) = g(µo,σo|yo), and subsequently simulating ytsim from f (ysim|µ t sim,σ t sim) (cf. Equation 3). Datasets ytsim have sample sizes n1r, ..., nJr, because the predicted data needs to be compared to the new data yr that has sample sizes n1r, ..., nJr. The steps in the following sections elaborate how new data yr can be compared to the T data matrices sampled from f (ysim) that are to be expected given H0 using a test-statistic that evaluates relevant features of the original data. Step 2: Test Statistic Evaluating Relevant Features We propose to use F̄ (Silvapulle & Sen, 2005, p. 38- 39) as a test-statistic to evaluate how much the pre- dicted data and the observed data deviate from an in- equality constrained hypothesis capturing the relevant osf.io/6h8x3 5 features of the original study HRF: F̄yd = RSSd,HRF − RSSd,Hu S 2d , (6) where RSSd,Hu denotes the residual sum of squares in dataset d ∈ {r, sim} for the unrestricted hypothesis Hu: µ1d, ...,µJd , RSSd,Hu = ∑ i j (yi jd − ȳ jd ) 2, (7) where ȳ jd denotes the mean for group j in dataset d. S 2d denotes the mean squared error, S 2d = RSSd,Hu N − J , (8) where N = J∑ j=1 n jr, and RSSd,HRF = ∑ i j (yi jd − µ̃ jd ) 2, (9) where µ̃d = [µ̃ jd, ..., µ̃Jd ] = argmin µ̃d∈HRF ∑ i j (yi jd −µ jd ) 2. (10) µ̃d thus contains the set of parameter estimates that minimize the residual sum of squares for yd under the constraints imposed by HRF. F̄yd is the scaled difference between the residual sum of squares under the con- straints imposed by HRF and the residual sum of squares for yd under Hu. As Hu is unrestricted, F̄yd quantifies the misfit of yd with HRF. The hypothesis capturing the relevant features of the original data, HRF, is of the form Rµd > 0, where R is a K×J restriction matrix, J denotes the number of groups in the ANOVA study, and K the number of restrictions in HRF, while µd is the mean vector of length J. Examples of constraints that can be applied under Rµr > 0 are: • Simple order constraints: µ jd > µ j′d , or µ jd < µ j′d for a pair j, j′. • Interaction effects: (µABd - µAB′d ) > (µA′Bd - µA′B′d ), for a 2×2 contingency table. The constraints in HRF should be based on the find- ings of the original study, which implies and requires that HRF is always in agreement with the results of the original study (i.e., F̄yo = 0). The results of the orig- inal study alone are usually not enough to determine which HRF is to be evaluated. For example, an orig- inal study shows that ȳ1o < ȳ2o < ȳ3o. This finding may lead to HRF: µ1d < µ2d < µ3d , but also to HRF: (µ1d,µ2d )< µ3d or HRF: µ1d < (µ2d,µ3d ). Which exact fea- tures should be covered in HRF can be guided by the conclusions of the original study. For example, if in the original study it is concluded that a treatment condition leads to better outcomes than two control conditions, the most logical specification of the relevant features is HRF: (µcontrolAd,µcontrolBd )< µTreatmentd . Alternatively, if in the original study it is concluded that treatment A is better than treatment B, which is better than the control condition, a logical relevant feature hypothesis would be: HRF: µTreatmentAd > µTreatmentBd > µControld . It may also occur that the researcher conducting the repli- cation test has an interest to evaluate a claim that is not in the original study, but could be made based on its results. In all cases, the researcher conducting the replication test should substantiate the choices made in the formulation of HRF with results from the original study. It is good practice to also pre-register HRF. In the Examples Section, we demonstrate for two studies how the original study is linked to HRF. First, however, we explain how the prior predictive p-value is calculated. Step 3: p-value The third and final step is to compute the prior pre- dictive p-value. When we calculate F̄ytsim for each dataset ytsim obtained in Step 1 with respect to F̄ as defined in Step 2, a sampling-based representation of the prior predictive distribution of the test statistic f (F̄ysim ) is ob- tained. Consequently, p = P(F̄ysim ≥ F̄yr |H0) = (11) 1 T T∑ t=1 I(F̄ytsim ≥ F̄yr ), where H0 denotes “Replication”, that is: H0: µr,σ 2 r ∼ h(µr,σ 2 r ). Furthermore, I is an indicator function that takes on the value 1 if the argument is true and 0 oth- erwise. As illustrated in Figure 1, the prior predictive p-value indicates how exceptional the observed statistic for the new data, F̄yr , is compared to its prior predictive distri- bution f (F̄ysim ). The shaded area on the right side of F̄yr is P(F̄ysim ≥ F̄yr |H0), that is, the prior predictive p-value. If the prior predictive p-value is significant, we reject replication of the relevant features of the original study by the new data. Note that the focus is on rejecting replication of the original results and not on rejecting HRF in itself for the new study.1 1To test HRF we recommend Hoijtink et al. (2019), Vanbra- bant et al. (2015). 6 F yd F re q u e n cy 0 10 20 30 40 0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 f (F ysi m) prior predictive p−value F yr Figure 1. An illustration of the prior predictive p-value. Uniformity. To determine the significance of a p- value by comparing it to some preselected value α, the p-value needs to be uniformly distributed if H0 is true. Only when the p-value is uniform, α is equal to the nom- inal Type I error. We will demonstrate that this is true for the prior predictive p-value if f (F̄ysim ) is continuous, and it is true up to some α0 if f (F̄ysim ) is discrete. A p-value is uniform if: P(p ≤ α|H0) ≤ α for all α ∈ [0, 1], (12) where p denotes a p-value from f ( p|H0), that is, the null- distribution of the p-values. The following three steps proof that Equation 12 holds for the prior predictive p-value when f (F̄ysim ) is continuous: 1. P( p < α|H0) = P(F̄yr > F̄ysim,1−α|H0), where F̄yr is the test-statistic rendering p via p = P(F̄ysim > F̄yr |H0) and F̄ysim,1−α is the 1-αth percentile of the distribution f (F̄ysim |H0). 2. P(F̄yr > F̄ysim,1−α|H0) = ∫ F̄yr >F̄ysim,1−α f (F̄yr |H0)dF̄yr , where f (F̄yr |H0) denotes the distribution of F̄yr un- der H0. 3. For the situations considered in this pa- per it holds that f (F̄yr |H0) = f (F̄ysim ), therefore ∫ F̄yr >F̄ysim,1−α f (F̄yr |H0)dF̄yr = ∫ F̄ysim >F̄ysim,1−α f (F̄ysim )dF̄ysim = α, which completes the proof. With constraints of the form Rµr > 0, however, f (F̄ysim ) will often be discrete. When f (F̄ysim ) is discrete, the prior predictive p-value is not uniform for all α ∈ [0, 1]. For example, let us obtain g(µo,σ 2 o|yo) = h(µr,σ 2 r ) for an original study with ȳ1o = 1, ȳ2o = 2, ȳ3o = 3, s2o = 5, and n jo = 50, with n jr = 50 and HRF: µ1r < µ2r < µ3r. Subsequently, we simulate ytr for t = 1, ..., 100, 000, and calculate the prior predictive p-value for each ytr. The result is f (p|H0), which is plotted in Figure 2a. In Fig- ure 2a, we see a thick vertical line that indicates a set of p-values with exactly the same value, namely 1.00. This set of equal p-values results from the fact that HRF : µ1r < µ2r < µ3r is true for a substantial number of datasets ytr causing the associated F̄ytr to be exactly equal to 0 and the associated prior predictive p-values to be exactly equal to 1 (see Figure 2b). Generally, how- ever, there exists an α0 for which f ( p|H0) is uniform (Meng, 1994), since all values in f (F̄ysim ) other than 0 will occur in a continuous fashion. Thus, α is uniform for α ∈ [0,α0]. If the preselected α < α0, α is equal to the nominal Type I error. α0 can be computed as 1 − P( f (F̄ysim ) = 0). For example, α0 ≥ .05 if no more than 95% of F̄ysim is exactly 0. It would be exceptional if more than 95% of F̄ysim = 0, but it could occur with extremely low power in the original study and an unspe- 7 p F re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 2 0 0 0 0 4 0 0 0 0 6 0 0 0 0 (a) f (p|H0). F yr F re q u e n cy 0 5 10 15 20 25 0 2 0 0 0 0 4 0 0 0 0 6 0 0 0 0 (b) f (F̄ysim ). Figure 2. Uniformity of the prior predictive p-value for HRF: µ1r < µ2r < µ3r. cific HRF . A visualization of f (F̄ysim ) can help to roughly estimate α0. For the discrete f (F̄ysim ) considered here 53% of f (F̄ysim ) = 0 and α0 = .47 (Figure 2b). In the next section, we deal with another important property of null hypothesis significance testing meth- ods: Power. Power Power is the probability to reject the null hypothesis (of replication) with a preselected α when not the null, but an alternative hypothesis is true. Researchers typ- ically pursue a power of .80. Let us denote power by γ. γ = P( p < α|Ha), (13) = P(F̄yr > F̄ysim,1−α|Ha), where Ha is the population under the alternative hy- pothesis for which replication is to be rejected. Note that any population for which H0 is not true can qualify to reject replication. The population used is determined by the theoretical context in which the replication test takes place. The population with µ1a = .... = µJa is a spe- cial population that is generally considered to display a non-effect in ANOVA studies. Hence, µ1a = .... = µJa seems a natural default choice for the population under the alternative hypothesis. As a best guess for µ ja and σ2a in a power analysis, the grand mean ȳo and variance σ2o of the original study can be used. The population under the alternative hypothesis with µ1a = .... = µJa is on the edge of HRF: it deviates minimally from HRF, hence, the associated γ will be a lower limit. Power will increase when the population under the alternative hypothesis is more different from HRF than in the population with equal means, for example, when the means are ordered differently. Simulation Study To illustrate the power of the prior predictive p-value, we conducted a simulation study in which we varied the effect size in the original study fo, the sample size for the original study n jo, the sample size for the new study n jr, the relevant feature of interest HRF, and the popu- lation under the alternative hypothesis Ha as specified in Table 2. For each cell in the simulation study, 10,000 samples were drawn from Ha and power was calculated according to Equation 13. The results of the simulation study are provided in Table 3. As expected, power generally increases with increasing effect sizes, increasing sample sizes, and in- creasing deviation between yo and Ha. There are, how- ever, some exceptions: With small fo and low n jo, larger n jr only emphasize the noise in the original study more and do not lead to an increase in power. Similarly, a more specific HRF does not always increase power. Given original studies with smaller samples and smaller effect sizes, h(µr,σ 2 r ) is so uninformative that more spe- cific HRF are only more inaccurate under H0, and F̄yr needs to be extremely large to reject the null. Table 3 also shows that the power on the edge (i.e., the power for Ha1) is insufficient for original studies with small and medium effect sizes (γ < .60 in all cells). With medium fo, power is only sufficient if the new study originates from a population in which the means are ordered differently (e.g., Ha2). For original studies with large effect sizes and group sample sizes in the original studies with at least 50 participants per group, power can be sufficient under Ha1. Power levels off, however, for HRF1 and HRF2 at .67, and .83 respec- tively. Under µ1a = µ2a = µ3a, HRF1: µ1r < (µ2r,µ3r ) is true in 13 of the situations by chance. Consequently, power cannot exceed 1− 13 = .67. For HRF2: µ1r < µ2r < µ3r, 1 6 of 8 Table 2 Simulation Sample Statistics for Original Study and Population Values under Ha yo Ha fo ȳ1o ȳ2o ȳ3o s2o fa µ1a µ2a µ3a σ 2 a .10 -0.12 0.00 0.12 1.00 0 0.00 0.00 0.00 1.00 .25 -0.31 0.00 0.31 1.00 .10 0.00 -0.24 0.00 1.00 .40 -0.49 0.00 0.49 1.00 Note. Effect size f as introduced by Cohen (1988, p. 274-275). Other simulation factors: n jd ∈ 20, 50, 100. HRF1: µ1d < (µ2d,µ3d ), HRF2: µ1d < µ2d < µ3d . Table 3 Power HRF1|Ha1 HRF2|Ha1 HRF1|Ha2 HRF2|Ha2 n jr n jr n jr n jr fo n jo 20 50 100 20 50 100 20 50 100 20 50 100 .10 20 .03 .01 .00 .02 .00 .00 .11 .08 .05 .08 .04 .02 .10 50 .08 .06 .03 .06 .04 .02 .21 .31 .36 .19 .22 .28 .10 100 .11 .12 .10 .09 .10 .08 .26 .45 .62 .25 .39 .52 .25 20 .13 .10 .06 .09 .05 .01 .33 .40 .45 .25 .26 .23 .25 50 .25 .32 .37 .20 .26 .26 .48 .73 .88 .43 .62 .81 .25 100 .30 .46 .57 .29 .44 .55 .54 .83 .96 .53 .79 .94 .40 20 .32 .41 .41 .27 .26 .21 .59 .78 .89 .49 .64 .75 .40 50 .49 .66 .67 .45 .68 .83 .74 .93 .98 .69 .93 .99 .40 100 .55 .66 .67 .57 .83 .83 .77 .93 .97 .77 .98 .99 Text in cells with γ ≥ .80 is boldface. Text in cells with a maximum γ in relation to the specific HRF|Ha is italic. the combinations under Ha1 are in line with replication by chance. Hence, power cannot exceed 1 − 16 = .83. If we move further from the edge of HRF, as we do with Ha2, power increases. Thus, the power of the prior pre- dictive p-value considering an HRF with three or fewer order constraints will almost never be high if the true means are equal, but can be high if there is a different ordering in reality as compared to the one in HRF . The results demonstrate that imprecise estimates (i.e., large standard errors leading to a low informative prior) in the original study lead to low power, especially on the edge of HRF. This is as true for the prior predic- tive p-value as it is for other approaches. For example, in a classical ANOVA study with three groups with 20 participants each, power is <.10, <.40, and <.80 for small, medium, and large effect sizes respectively; a re- sult that was already pointed out in Cohen (1988, p. 313). Zondervan-Zwijnenburg and Rijshouwer (2020) demonstrates the application of different methods to evaluate replication, within the context of small sam- ples. Not a single method is unaffected by small sam- ple sizes. As highlighted by Morey and Lakens (2019) and Patil et al. (2016): Replication can only be rejected based on the findings of the original study, and when these findings are highly imprecise due to large stan- dard deviations and small sample sizes, rejecting them is hard or even impossible. Underpowered original studies may result in non- significant prior predictive p-values that have a high probability of being Type II errors (Morey & Lakens, 2019). Therefore, only reporting the prior predictive p-value is not enough, the probability of a Type II er- ror (i.e., 1-γ) given the population under the alterna- tive hypothesis should be communicated to the reader as well. The next section elaborates on the computa- tion of power and the required sample size for sufficient statistical power. The Workflow and Examples sections explain how researchers should incorporate prior pre- dictive p-values and power. One of the examples will also demonstrate rejected replication despite low power on the edge of H0. Power and Sample Size Determination As highlighted in the previous sections and in the lit- erature (e.g., Brandt et al., 2014; Simonsohn, 2015), power is an important characteristic of a convincing replication study. It is thus important that researchers can calculate the power of the prior predictive check, and can determine the sample size for a new study such that the replication test has high statistical power. Therefore, the ANOVAreplication R-package and the 9 online interactive application (see osf.io/6h8x3) in- clude a power and sample size calculator. Given the vector with group sample sizes in the new study nr, h(µr,σ 2 r ), Ha, HRF, and α, the power γ is calcu- lated as follows: 1. Following Step 1 and 2 of the prior predictive check, t = 1, ..., T datasets are simulated from f (F̄ysim ), and F̄ysim,1−α can be calculated. 2. Given µa, σa, t = 1, ..., T datasets are simulated from f (F̄yr|Ha) with sample sizes nr. Following Step 2 of the prior predictive check, for each dataset F̄yr is calculated. 3. γ = P(F̄yr > F̄ysim,1−α|Ha) = 1 T T∑ t=1 I(F̄ytr ≥ F̄ysim,1−α ), As default choice for µa, we recommend to use ȳo for each group. With this setting, the power is calculated to reject replication in case of equal group means. As de- fault choice for σa, we recommend the pooled standard deviation of the original study. To determine the required sample size to reject repli- cation with sufficient power, we use an iterative proce- dure. In addition to h(µr,σ 2 r ), Ha, HRF, α, we use the following information to calculate the required sample size: a target power level γ̃; a small margin covering acceptable values around the target power γmargin, be- cause the calculated power may not be exactly equal to the target power; a starting value for the group sample size n jr0 ; a maximum number of iterations Qmax; and a maximum total sample size for the new study Nrmax . Our default values are: γ̃ = .825, γmargin = .025, α = .05, n jr0 = 20, Qmax = 10, and Nrmax = 600. 1. In every iteration q, γq is calculated given n jrq . 2. When q > 1, n jrq+1 is determined by regressing {γ1, ...,γi} on {n jr1, ..., n jrq} with a linear or quadratic (only if q = 3) function. In case of a linear regression, the linear regression coefficient β1 is the power increase per subject. Subsequently, n jrq+1 = (γq − γ̃)/β1 + n jrq . In case of regression with a quadratic function, n jrq+1 is calculated by solving the polynomial: γ̃ = β0 + β1n jrq+1 + β 2 2n jrq+1 . 3. Repeat step (1) and (2) until γq ∈ [γ̃ − γmargin, γ̃ + γmargin] (i.e., power is sufficient), or γq−1 ≈ γq (i.e., power does not increase anymore up to two decimal points), or n jrq−1 = n jrq (i.e, the sample size does not change anymore), or q = Qmax, or ΣJj=1n jrq = Nmax. Workflow To clarify the procedure to obtain the prior predictive p-value, the workflow is depicted in Figure 3. Step 1. The first steps (1a-1c) only require the orig- inal study. Step 1a is to derive the relevant feature to be evaluated in the test statistic from the findings of the original study. Next, the population for which replica- tion should be rejected (i.e., Ha) can be defined. What is the ordering of the means in this population and what is the effect size in that ordering? Ha can be a population in which all means are equal, but it does not have to be. Step 1c is to obtain the data of the original study, or reconstruct the data based on reported means, standard deviations and group sample sizes. If the new study is not yet conducted, the second step is to calculate the re- quired sample size per group for the new study to reject replication with sufficient power (i.e., γ). Step 2. The sample sizes calculation can be con- ducted with the sample.size.calc function in the ANOVAreplication package. If the function cannot find a (reasonable) group sample size for which γ is suffi- cient, this implies that the original study is not suited for replication testing with the prior predictive p-value for the specified Ha: its conclusions are too vague (i.e., the standard errors are too wide) to reject replication if Ha is true. There is still a chance that the prior predictive p-value turns out significant, especially if the observed data is more extreme than most samples from Ha, but the researcher should consider whether collecting data with such a low probability of a meaningful result is ethically acceptable. Step 3. As a third step, the prior predic- tive p-value can be computed with the function prior.predictive.check. The power associated to the sample size of the new study can be calculated with power.calc. Note that it is not a post-hoc power analy- sis, as the definition of Ha is unrelated to the new study. Hence, the power to reject replication for Ha can be in- sufficient (i.e., larger than 1- the preset Type II error rate β), while the prior predictive p-value is statistically significant, or vice versa. Figure 3 assists in interpreting the resulting p-value, considering the statistical power to reject replication for Ha, unless F̄yr is exactly 0. If the new study perfectly meets the features of the orig- inal study as described in HRF, F̄yr will be 0 and the prior predictive p-value 1.00. In such a case, we confirm replication of the relevant features in the original study as captured in HRF, irrespective of power. Theoretically it is possible that F̄yr = 0, while the new study is an extreme sample from a population in which HRF is not true. That, however, is not under consideration here, as our question was whether the observed new study replicates, or fails to replicate, relevant features of the osf.io/6h8x3 10 1a. Relevant Features in HRF 1b. Define Ha 1c. Original data yr collected? 2. Calculate required sample size for yr 3. Prior Predictive p-value + γ 𝐹"𝒚𝑟 = 0? The new study matches HRF. Replication is confirmed. γ ≥ 1 - β? ppp < α? Replication is rejected. Report γ for Ha. yes no yes no yes yes ppp < . α? no Replication is rejected despite low power. The observed data is more extreme than Ha. Replication is not rejected, despite sufficient power to do so. Report γ for Ha. no yes Replication is not rejected. Report γ and emphasize that the Type II error is 1- γ for Ha. Optional: go back to 2. to see if a larger new study could resolve the power issue, or if the original study and its conclusions are not specific enough to test replication at all. no njr = ∞? The original study is not suited for replication testing with the prior predictive p yes no 1. O riginal Study Figure 3. The prior predictive p-value workflow. original study. In case of a non-significant result in combination with low power, the researcher should emphasize the proba- bility that not rejecting replication is a Type II error, and it is advised to conduct a replication study with larger n jr. The required sample size per group can again be calculated with the sample.size.calc function in the ANOVAreplication package. If the required n jr is exces- sive given Ha, it may be an inevitable conclusion that the original study is not suited for replication testing by means of the prior predictive p-value. If replication is rejected despite low power, it implies that the observed new dataset deviates more from HRF than most datasets under Ha. With sufficient statistical power, it is still in- formative to notify the reader of the achieved power and/or the probability of a Type II error given the pop- ulation under Ha. Examples To illustrate the use of the prior predictive check to assess whether relevant ANOVA features are replicated, we two selected replication studies that were part of the Reproducibility Project Psychology initiated by the Open Science Collaboration (2012, 2015). All calcu- lations can be performed with the ANOVAreplication R-package (Zondervan-Zwijnenburg, 2018). The first study is Fischer et al. (2008), who stud- ied the impact of self-regulation resources on confir- matory information processing. According to the the- ory, people who have low self-regulation resources (i.e., depleted participants) will prefer information that matches their initial standpoint. An ego-threat condi- tion was added, because the literature proposes that ego-threat affects decision relevant information pro- cessing, although the direction of this effect is not clear. To determine which relevant feature of the re- sults (see Table 4) should be tested for replication, we follow the original findings: “Planned contrasts revealed that the confirmatory information processing tendencies of participants with reduced self-regulation resources [...] were stronger than those of nonde- pleted [...] and ego threatened participants [...]” Fischer et al. (2008, p. 387). This translates to: HRF: µlow self-regulation,r > (µhigh self-regulation,r,µego-threatened,r ) (Workflow Step 1a). We want to reject replication when all means in the population are equal. That is: Ha: µlow self-regulation,r = (µhigh self-regulation,r =µego-threatened,r ) (Workflow Step 1b). We simulate the original data based on the means, standard deviations and sample sizes reported in Fischer et al. (2008) (Workflow Step 1c). As the replication study is already conducted by Galliani (2015) (see Table 4 for results), we do not calculate the required sample size to test replication (Workflow Step 2), and proceed to calculate the prior predictive p-value and the power of the replication test 11 Table 4 Descriptive Statistics for Confirmatory Information Processing from the Original Study: Fischer et al. (2008), and the New Study: Galliani (2015) Low self-regulation High self-regulation Ego-threatened Study n M (SD) n M (SD) n M (SD) Original 28a 0.36 (1.08) 28a -0.19 (0.53) 28a -0.18 (0.81) New 48 -0.07 (0.45) 47 -0.05 (0.47) 45 0.13 (0.64) aOnly the total sample size of 85 was provided in Fischer et al. (2008). Table 5 Z-scores of Participants’ Mean Estimates from the Original Study: Janiszewski and Uy (2008), and the New Study: Chandler (2015) Low Motivation to Adjust High Motivation to Adjust Precise Anchor Rounded Anchor Precise Anchor Rounded Anchor Study n M (SD) n M (SD) n M (SD) n M (SD) Original 14 -0.76 (0.17) 15 -0.23 (0.48) 15 -0.04 (0.28) 15 0.98 (0.41) New 30 -0.35 (0.23) 30 -0.18 (0.37) 30 0.20 (0.34) 30 0.35 (0.44) (Workflow Step 3). The resulting prior predictive p- value was .003 with γ = .66, indicating that we reject replication, despite limited power. The ordering in the new data by Galliani (2015) results in an extreme F̄ score compared to the predicted data. Figure 4 illus- trates this conclusion: Over 90% of the predicted data scores perfectly in line with HRF, but the new study by Galliani (2015) deviates from HRF and scores in the ex- treme 0.3% of the predicted data. The replication of the original study conclusions is thus rejected. The second study is Janiszewski and Uy (2008), who studied numerical judgements with five experiments. More specifically, they study the impact of precision of an anchor, and motivation to adjust from the anchor on judgement bias. The group means, standard deviations, and sample sizes of experiment 4a in the original study by Janiszewski and Uy (2008) and the replication study by Chandler (2015) are provided in Table 5. We find that based on these results, Janiszewski and Uy (2008) draw two conclusions. “First, a precise anchor results in less adjustment than a rounded anchor” (p. 126). For experiment 4a, which was replicated by Chandler (2015), this conclusion translates to HRF: (µlow motivation,round,r > µlow motivation,precise,r ) & (µhigh motivation,round,r > µhigh motivation,precise,r ) (Workflow Step 1a). We want to reject replication when all means in the population are equal. That is: Ha: µlow motivation,round,r = µlow motivation,precise,r = µhigh motivation,round,r = µhigh motivation,precise,r (Workflow Step 1b). We simulate the original data based on the means, standard deviations and sample sizes reported in Janiszewski and Uy (2008) (Workflow Step 1c). As the replication study is already conducted by Chandler (2015), we do not calculate the required sample size to test replication (Workflow Step 2), and proceed to calculate the prior predictive p-value and the power of the replication test (Workflow Step 3). The resulting prior predictive p-value is 1.00. The data ob- tained by Chandler (2015) were perfectly in line with the HRF describing the effect as observed by Janiszewski and Uy (2008). Therefore, we do not have further con- cerns about the obtained power. Hence, we conclude that the results of Janiszewski and Uy (2008) with re- spect to HRF: (µlow motivation,round,r > µlow motivation,precise,r ) & (µhigh motivation,round,r > µhigh motivation,precise,r ) are replicated by Chandler (2015). The other conclusion that Janiszewski and Uy (2008) draw is about the presence of an interaction effect of ad- justment motivation and anchor rounding: “The differ- ence in the amount of adjustment between the rounded- and precise-anchor conditions increased as the motiva- tion to adjust went from low [...] to high” (p. 125). The results and conclusions of Janiszewski and Uy with respect to experiment 4a translate to HRF: (µlow motivation,round,r > µlow motivation,precise,r ) & (µhigh motivation,round,r > µhigh motivation,precise,r ) & (µlow motivation,round,r −µlow motivation,precise,r ) < (µhigh motivation,round,r −µhigh motivation,precise,r ). The prior predictive p-value related to this HRF is .014 with γ = .87. Thus, we reject replication of the interac- tion effect. 12 F y F re q u e n cy 0 5 10 15 0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 Figure 4. The prior predictive p-value for the replication of Fischer et al. (2008) by Galliani (2015). The histogram bars represent F̄ for the predicted data. The thick line on the left represents F̄ for the predicted data that are exactly 0 (i.e., over 90% of the total), whereas the red line represents F̄ for Galliani (2015). Discussion & Conclusion The goal of the current paper was to introduce the prior predictive check as a manner to test replication of ANOVA features. With the prior predictive check re- searchers can find an answer to the question: “Does the new study fail to replicate relevant features of the orig- inal study?” Identifying a non-replication may make us wonder about the representativeness of the origi- nal study, the new study, and the comparability of both studies. Or, as stated by Simonsohn (2015, p. 9) “Sta- tistical techniques help us identify situations in which something other than chance has occurred. Human judgment, ingenuity, and expertise are needed to know what has occurred instead.” In the current paper, we discussed the prior predictive p-value for the ANOVA setting. For the ANOVA setting, we explained how to test relevant features of the form Rµr > 0. Technically, however, the relevant features evaluated by the ANOVAreplication R-package, how- ever, can also be of the form Rµr > r and Sµr = s, where r and s are vectors of length K containing the constants in HRF, and S is a K×J restriction matrix like R. Accord- ingly, minimum (effect size) differences between means can be evaluated and means can be constrained equal to specific values. Even though constraints of these forms can be evaluated with the R-package and in the online application, they are not emphasized in the current pa- per because they will less often directly relate to the findings of an original study. The prior predictive p-value is generalizable to sta- tistical models other than the ANOVA as well. That is, for any model a predictive distribution can be ob- tained, constrained hypotheses can be constructed, and a test-statistic evaluating the constraints can be calcu- lated. The test as currently provided can already be used for the repeated measures ANOVA by means of contrast weights (see, for example, Furr and Rosenthal, 2003). With contrast weights a score for each partic- ipant can be calculated indicating to what degree the participant follows the expected pattern. Subsequently, the replication of relevant features of these contrast scores over groups can be tested. A pre-print introduc- tion to test replication with the prior predictive p-value for structural equation models has been published at https://psyarxiv.com/uvh5s. In the current paper, we introduced the prior predic- tive p-value as a new tool to quantify replication fail- https://psyarxiv.com/uvh5s 13 ure or success to the meta-scientific toolbox. With the prior predictive p-value we test whether the new study significantly deviates from our expectations based on the original study. Other methods to evaluate repli- cation research questions with are included in Table 1 and demonstrated in Zondervan-Zwijnenburg and Ri- jshouwer (2020). Two features of the prior predictive p-value to test replication stand out. First, the prior predictive p-value makes use of a predictive distribution given the original study results. The new study results are compared to the predicted data. A Bayes factor on the other hand, weighs the evidence for two compet- ing hypotheses in the new study as it actually occurred, but does not take study variation into account. Second, to compare the new study with the predicted data, we consider relevant features of the original study. While most other methods evaluate the replication of a sim- ple effect size, relevant features can be any constraint or set of constraints of the form Rµr > 0, which seam- lessly connects to the research objective of most ANOVA studies. With the ANOVAreplication R-package includ- ing a vignette as a tutorial, and the interactive applica- tion (see osf.io/6h8x3), we provide researchers with an easy to use test for replication of ANOVA features. The availability of the prior predictive p-value to test replications can further promote the trend to conduct more replication studies in the field of psychology. Author Contact Correspondence concerning this article should be addressed to Mariëlle Zondervan-Zwijnenburg, De- partment of Methods and Statistics, Utrecht Uni- versity, Padualaan 14, 3584CH Utrecht. E-mail: M.A.J.Zwijnenburg@uu.nl. Conflict of Interest and Funding Add a statement about conflict of interest. If you have no conflict of interest to declare, please state that. Add a statement about how the research was funded. If there was no specific funding please state that. Author Contributions MZ and HH were involved in the initial research de- sign. MZ drafted and revised the article in collaboration with HH. MZ developed the interactive application, con- ducted the simulation studies, and conducted the analy- ses. RS provided additional feedback, and evaluated the interactive application. All authors approved the final manuscript. The first author (MZ) was the main author and the last author (HH) was the main supervisor on this project. Acknowledgements We would like to thank Meta-Psychology editor dr. Fe- lix Schönbrodt, and reviewers dr. Matt Williams and dr. Zoltan Dienes for their helpful feedback on this manuscript. The first and third author are supported by the Con- sortium Individual Development (CID), which is funded through the Gravitation program of the Dutch Ministry of Education, Culture, and Science and the Netherlands Organization for Scientific Research (NWO grant num- ber 024.001.003). The second author is supported by a VIDI grant from the Netherlands Organization for Sci- entific Research (NWO grant number 452.14.006). Open Science Practices This article earned the Open Materials badge for making the materials openly available. This article is a methods article and did not include any new data, and it was not pre-registered. It has been verified that the analysis reproduced the results presented in the article. The entire editorial process, including the open reviews, is published in the online supplement. References Anderson, S. F., & Maxwell, S. E. (2016). There’s more than one way to conduct a replication study: Beyond statistical significance. Psycho- logical Methods, 21(1), 1–12. https://doi .org/10.1037/met0000051 Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., ..., & Wicherts, J. M. (2013). Recommendations for increasing repli- cability in psychology. European Journal of Per- sonality, 27(2), 108–119. https://doi.org/ 10.1002/per.1919 Box, G. E. (1980). Sampling and Bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society. Series A (General), 143(4), 383–430. https://doi.org/10.2307/ 2982063 Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F. J., Geller, J., Giner-Sorolla, R., Grange, J. A., ..., & Van’t Veer, A. (2014). The replication recipe: What makes for a convincing replica- tion? Journal of Experimental Social Psychology, 50, 217–224. https://doi.org/10.1016/j .jesp.2013.10.005 osf.io/6h8x3 https://doi.org/10.1037/met0000051 https://doi.org/10.1037/met0000051 https://doi.org/10.1002/per.1919 https://doi.org/10.1002/per.1919 https://doi.org/10.2307/2982063 https://doi.org/10.2307/2982063 https://doi.org/10.1016/j.jesp.2013.10.005 https://doi.org/10.1016/j.jesp.2013.10.005 14 Chandler, J. (2015). Replication of Janiszewski & Uy (2008, PS, study 4b). Open Science Frame- work. osf.io/aaudl Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ, Lawrence Erlbaum Associates. https://doi .org/10.4324/9780203771587 Cumming, G. (2008). Replication and p intervals: p val- ues predict the future only vaguely, but confi- dence intervals do much better. Perspectives on Psychological Science, 3(4), 286–300. https:// doi.org/10.1111/j.1745-6924.2008.00079 .x Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29. https:// doi.org/10.1177/0956797613504966 Earp, B. D., & Trafimow, D. (2015). Replication, falsifi- cation, and the crisis of confidence in social psy- chology. Frontiers in Psychology, 6, 621. https: //doi.org/10.3389/fpsyg.2015.00621 Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skul- borstad, H. M., Allen, J. M., Banks, J. B., Baranski, E., Bernstein, M. J., Bonfiglio, D. B., Boucher, L., Brown, E. R., Budiman, N. I., Cairo, A. H., Capaldi, C. A., Chartier, C. R., Chung, J. M., Cicero, D. C., Coleman, J. A., Conway, J. G., . . . Nosek, B. A. (2016). Many labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82. https://doi.org/10.1016/j.jesp.2015 .10.012 Errington, T., Tan, F., Lomax, J., Perfito, N., Iorns, E., Gunn, W., & Lehman, C. (2019). Reproducibility project: Cancer biology. osf.io/e81xl Fischer, P., Greitemeyer, T., & Frey, D. (2008). Self- regulation and selective exposure: The impact of depleted self-regulation resources on con- firmatory information processing. Journal of Personality and Social Psychology, 94(3), 382. https://doi.org/10.1037/0022-3514.94 .3.382 Furr, R. M., & Rosenthal, R. (2003). Repeated-measures contrasts for "multiple-pattern" hypotheses. Psychological Methods, 8(3), 275–293. https: //doi.org/10.1037/1082-989X.8.3.275 Galliani, E. (2015). Replication report of Fischer, Greite- meyer, and Frey (2008, JPSP, study 2). https: //osf.io/j8bpa Gelman, A., Meng, X.-L., & Stern, H. (1996). Posterior predictive assessment of model fitness via real- ized discrepancies. Statistica Sinica, 6(4), 733– 760. Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statisti- cian, 60(4), 328–331. https://doi.org/10 .1198/000313006x152649 Harms, C. (2018). A bayes factor for replications of anova results. The American Statistician. https: //doi.org/10.1080/00031305.2018.1518787 Hedges, L. V., & Olkin, I. (1980). Vote-counting meth- ods in research synthesis. Psychological Bulletin, 88(2), 359. https://doi.org/10.1037/0033 -2909.88.2.359 Hoijtink, H. (2012). Informative hypotheses: Theory and practice for behavioral and social scientists. CRC Press. https://doi.org/10.1201/b11158 Hoijtink, H., Mulder, J., van Lissa, C., & Gu, X. (2019). A tutorial on testing hypotheses using the bayes factor. Psychological Methods, 24(5), 539–556. https://doi.org/10.1037/met0000201 Janiszewski, C., & Uy, D. (2008). Precision of the anchor influences the amount of adjustment. Psycho- logical Science, 19(2), 121–127. https://doi .org/10.1111/j.1467-9280.2008.02057.x Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Bahnık, Š., Bernstein, M. J., Bocian, K., Brandt, M. J., Brooks, B., Brumbaugh, C. C., Cemalcilar, Z., Chandler, J., Cheong, W., Davis, W. E., De- vos, T., Eisner, M., Frankowska, N., Furrow, D., Galliani, E. M., . . . Nosek, B. A. (2014). Investi- gating variation in replicability: A ’many labs’ replication project. Social Psychology, 45(3), 142–152. https://doi.org/10.1027/1864 -9335/a000178 Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Alper, S., Aveyard, M., Axt, J. R., Babalola, M. T., Bahnık, Š., Batra, R., Ber- kics, M., Bernstein, M. J., Berry, D. R., Bialo- brzeska, O., Binan, E. D., Bocian, K., Brandt, M. J., Busching, R., . . . Nosek, B. A. (2018). Many labs 2: Investigating variation in replica- bility across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490. https://doi.org/10.1177/ 2515245918810225 Ledgerwood, A. (2014). Introduction to the special section on advancing our methods and prac- tices. Perspectives on Psychological Science, 9(3), 275–277. https : / / doi .org / 10 .1177 / 1745691613513470 Ly, A., Etz, A., Marsman, M., & Wagenmakers, E.-J. (2018). Replication bayes factors from evi- dence updating. Behavior Research Methods, 1– osf.io/aaudl https://doi.org/10.4324/9780203771587 https://doi.org/10.4324/9780203771587 https://doi.org/10.1111/j.1745-6924.2008.00079.x https://doi.org/10.1111/j.1745-6924.2008.00079.x https://doi.org/10.1111/j.1745-6924.2008.00079.x https://doi.org/10.1177/0956797613504966 https://doi.org/10.1177/0956797613504966 https://doi.org/10.3389/fpsyg.2015.00621 https://doi.org/10.3389/fpsyg.2015.00621 https://doi.org/10.1016/j.jesp.2015.10.012 https://doi.org/10.1016/j.jesp.2015.10.012 osf.io/e81xl https://doi.org/10.1037/0022-3514.94.3.382 https://doi.org/10.1037/0022-3514.94.3.382 https://doi.org/10.1037/1082-989X.8.3.275 https://doi.org/10.1037/1082-989X.8.3.275 https://osf.io/j8bpa https://osf.io/j8bpa https://doi.org/10.1198/000313006x152649 https://doi.org/10.1198/000313006x152649 https://doi.org/10.1080/00031305.2018.1518787 https://doi.org/10.1080/00031305.2018.1518787 https://doi.org/10.1037/0033-2909.88.2.359 https://doi.org/10.1037/0033-2909.88.2.359 https://doi.org/10.1201/b11158 https://doi.org/10.1037/met0000201 https://doi.org/10.1111/j.1467-9280.2008.02057.x https://doi.org/10.1111/j.1467-9280.2008.02057.x https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1177/2515245918810225 https://doi.org/10.1177/2515245918810225 https://doi.org/10.1177/1745691613513470 https://doi.org/10.1177/1745691613513470 15 11. https://doi.org/10.3758/s13428-018 -1092-x Marsman, M., Schönbrodt, F. D., Morey, R. D., Yao, Y., Gelman, A., & Wagenmakers, E.-J. (2017). A Bayesian bird’s eye view of ‘replications of im- portant results in social psychology’. Royal So- ciety Open Science, 4(1), 160426. https://doi .org/10.1098/rsos.160426 Meng, X.-L. (1994). Posterior predictive p-values. The Annals of Statistics, 22(3), 1142–1160. https: //doi.org/10.1214/aos/1176325622 Morey, R. D., & Lakens, D. (2019). Why most of psychol- ogy is statistically unfalsifiable. https://doi .org/10.5281/zenodo.838685 Nieuwenhuis, S., Forstmann, B. U., & Wagenmakers, E.-J. (2011). Erroneous analyses of interactions in neuroscience: A problem of significance. Na- ture Neuroscience, 14(9), 1105–1107. https:// doi.org/10.1038/nn.2886 10.1038/nn.2886 Open Science Collaboration. (2012). An open, large- scale, collaborative effort to estimate the repro- ducibility of psychological science. Perspectives on Psychological Science, 7(6), 657–660. https: //doi.org/10.1177/1745691612462588 Open Science Collaboration. (2015). Estimating the re- producibility of psychological science. Science, 349(6251). https://doi .org/10 .1126/ science.aac4716 Pashler, H., & Wagenmakers, E.-.-.-J. (2012). Editors’ introduction to the special section on replica- bility in Psychological Science: A crisis of con- fidence? Perspectives on Psychological Science, 7(6), 528–530. https://doi.org/10.1177/ 1745691612465253 Patil, P., Peng, R. D., & Leek, J. T. (2016). What should researchers expect when they replicate studies? A statistical view of replicability in psychologi- cal science. Perspectives on Psychological Science, 11(4), 539–544. https://doi.org/10.1177/ 1745691616646366 Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychol- ogy, 13(2), 90–100. https://doi .org/10 .1037/a0015108 Silvapulle, M. J., & Sen, P. K. (2005). Constrained statis- tical inference: Order, inequality, and shape con- straints (Vol. 912). John Wiley & Sons. https: //doi.org/10.1002/9781118165614 Simonsohn, U. (2015). Small telescopes detectability and the evaluation of replication results. Psy- chological Science, 26(5), 559–569. https:// doi.org/10.1177/0956797614567341 Van Aert, R. C., & Van Assen, M. A. (2017). Examining reproducibility in psychology: A hybrid method for combining a statistically significant origi- nal study and a replication. Behavior Research Methods, 1–25. https://doi.org/10.3758/ s13428-017-0967-6 Vanbrabant, L., Van de Schoot, R., & Rosseel, Y. (2015). Constrained statistical inference: Sample-size tables for ANOVA and regression. Frontiers in Psychology, 5, 1565. https://doi .org/10 .3389/fpsyg.2014.01565 Verhagen, J., & Wagenmakers, E.-J. (2014). Bayesian tests to quantify the result of a replication at- tempt. Journal of Experimental Psychology: Gen- eral, 143(4), 1457–1475. https://doi.org/ 10.1037/a0036731 Zondervan-Zwijnenburg, M. A. J. (2018). ANOVArepli- cation: Test ANOVA replications by means of the prior predictive p-value [R package version 1.1.3]. R package version 1.1.3. https://CRAN .R-project.org/package=ANOVAreplication Zondervan-Zwijnenburg, M. A. J., & Rijshouwer, D. (2020). Testing replication with small samples: Applications to anova. In R. van de Schoot & M. Miocevic (Eds.), Small sample size solutions: A guide for applied researchers and practitioners. Routledge. https://doi.org/10.3758/s13428-018-1092-x https://doi.org/10.3758/s13428-018-1092-x https://doi.org/10.1098/rsos.160426 https://doi.org/10.1098/rsos.160426 https://doi.org/10.1214/aos/1176325622 https://doi.org/10.1214/aos/1176325622 https://doi.org/10.5281/zenodo.838685 https://doi.org/10.5281/zenodo.838685 https://doi.org/10.1038/nn.2886 https://doi.org/10.1038/nn.2886 https://doi.org/10.1177/1745691612462588 https://doi.org/10.1177/1745691612462588 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1177/1745691612465253 https://doi.org/10.1177/1745691612465253 https://doi.org/10.1177/1745691616646366 https://doi.org/10.1177/1745691616646366 https://doi.org/10.1037/a0015108 https://doi.org/10.1037/a0015108 https://doi.org/10.1002/9781118165614 https://doi.org/10.1002/9781118165614 https://doi.org/10.1177/0956797614567341 https://doi.org/10.1177/0956797614567341 https://doi.org/10.3758/s13428-017-0967-6 https://doi.org/10.3758/s13428-017-0967-6 https://doi.org/10.3389/fpsyg.2014.01565 https://doi.org/10.3389/fpsyg.2014.01565 https://doi.org/10.1037/a0036731 https://doi.org/10.1037/a0036731 https://CRAN.R-project.org/package=ANOVAreplication https://CRAN.R-project.org/package=ANOVAreplication Introduction Prior Predictive p-value Step 1: Prior Predictive Distribution of the Data Step 2: Test Statistic Evaluating Relevant Features Step 3: p-value Uniformity Power Simulation Study Power and Sample Size Determination Workflow Step 1. Step 2. Step 3. Examples Discussion & Conclusion Author Contact Conflict of Interest and Funding Author Contributions Acknowledgements Open Science Practices