meta-psychology, 2021, vol 5, mp.2020.2529 https://doi.org/10.15626/mp.2020.2529 article type: guest editorial published under the cc-by4.0 license open data: no open materials: no open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: daniël lakens reviewed by: j. heathers, neuroskeptic analysis reproduced by: andré kalmendal all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/38ep5 a method to streamline p-hacking ian hussey ghent university abstract the analytic strategy of p-hacking has rapidly accelerated the achievement of psychological scientists’ goals (e.g., publications & tenure), but has suffered a number of setbacks in recent years. in order to remediate this, this article presents a statistical inference measure that can greatly accelerate and streamline the p-hacking process: generating random numbers that are < .05. i refer to this approach as pointless. results of a simulation study are presented and an r script is provided for others to use. in the absence of systemic changes to modal p-hacking practices within psychological science (e.g., worrying trends such as preregistration and replication), i argue that vast amounts of time and research funding could be saved through the widespread adoption of this innovative approach. keywords: p-hacking, r, nhst introduction p-hacking – the updating or adjusting data or analyses in light of prior beliefs about hypotheses – has proven to be of exceptional utility to the goals of psychological scientists (e.g., acquiring high-impact publications, tenure, and paid speaking engagements). while a number of useful tutorials in p-hacking and related strategies exist (e.g., bakker et al., 2012; simmons et al., 2011), insightful commentators have pointed out that only those with a ‘flair’ for it are likely to make it in the world of psychological science (baumeister, 2016). however, progress has slowed in recent years due to a number of unfortunate setbacks, including wider use of replication and pre-registration (e.g., munafò et al., 2017; open science collaboration, 2015) by methodological terrorists (fiske, 2016) and data parasites (longo & drazen, 2016). in this article, i introduce the pointless metric and demonstrate how it can streamline the process of p-hacking your results. while this metric does suffer from the mild flaw of providing zero diagnosticity of the presence or absence of a true effect, this property is largely irrelevant to most psychological scientist’s primary goals (i.e., publishability: nosek et al., 2012). secondary goals such as valid and useful insights into human behaviour are also occasionally met, albeit incidentally. more importantly, the metric possesses three superior characteristics. first, it is noninferior to current p-hacking practices, which also tell us little about the presence or absence of a true effect (large scale replications put this diagnosticity at no better than a coin toss: klein et al., 2018; open science collaboration, 2015). second, it retains a far more important property of hacked p values: by guaranteeing significant results, it maintains predictive validity for publishability. finally, it also provides economic benefits relative to the high total life-cycle costs associated with traditional p-hacking (e.g., by eliminating the need for comprehensive graduate training in either statistics or ‘flair’ for p-hacking). methods and results i observed that traditional approaches are relatively time consuming and inefficient (i.e., exploitation of researcher degrees of freedom until p < .05: simmons et al., 2011). the pointless metric was inspired by the observation that, regardless of the specific p-hacking strategy employed, the product of this process is highlight reliable (i.e., the statistical result “p < .05”). as such, 2 many intermediary steps are therefore arguably unnecessary, and the same end result can be obtained more efficiently by automation. this is accomplished by generating a random number that is < .05. i recommend researchers to refer to this this statistical inference procedure as a form of machine learning to increase their chances of getting published. r code to calculate pointless is provided below: p_pointless .350). similarly, judgments on nonstatus-related dimensions were also not affected by the choice size manipulation (f(2, 412) = 0.94, p = .391, η²p = 0.005; msmall = 4.86, se = 0.09; mmedium = 4.77, se = 0.08; mlarge = 4.70, se = 0.08; for comparisons, all ps > .170). 4 tunca, ziano, and xu table 1 means and standard deviations (in parentheses) for statusand nonstatus-related dimensions across experimental conditions. status-related dimensions nonstatus-related dimensions high status respected status (combined) honest nice attractive nonstatus (combined) coffee: small (n = 40) 4.47 (1.32) 5.10 (1.06) 4.79 (0.93) 5.42 (1.11) 5.45 (1.22) 4.72 (1.11) 5.20 (0.84) medium (n = 49) 4.35 (1.35) 4.63 (1.27) 4.49 (1.24) 4.96 (1.22) 4.86 (1.32) 4.47 (1.26) 4.76 (1.14) large (n = 48) 4.21 (1.29) 4.42 (1.18) 4.31 (1.14) 4.90 (1.26) 4.56 (1.13) 4.31 (1.07) 4.59 (0.97) smoothie: small (n = 45) 3.98 (1.36) 4.64 (1.17) 4.31 (1.15) 5.00 (1.04) 4.91 (1.02) 4.67 (1.26) 4.86 (0.91) medium (n = 43) 4.33 (1.34) 4.65 (1.31) 4.49 (1.21) 4.98 (1.26) 4.93 (1.08) 4.51 (1.26) 4.81 (1.03) large (n = 46) 4.07 (1.20) 4.57 (1.42) 4.32 (1.11) 4.98 (1.29) 4.89 (1.27) 4.33 (1.28) 4.73 (1.10) pizza: small (n = 43) 4.21 (1.17) 4.44 (1.03) 4.33 (0.99) 4.72 (1.18) 4.58 (0.96) 4.37 (0.93) 4.56 (0.77) medium (n = 53) 4.34 (1.33) 4.58 (1.18) 4.46 (1.17) 4.98 (1.32) 4.83 (1.22) 4.40 (1.23) 4.74 (1.13) large (n = 48) 4.33 (1.46) 4.56 (1.11) 4.45 (1.12) 4.92 (1.18) 4.90 (1.06) 4.50 (1.15) 4.77 (0.98) average small (n = 128) 4.21 (1.29) 4.72 (1.12) 4.47 (1.05) 5.04 (1.14) 4.97 (1.12) 4.59 (1.11) 4.87 (0.88) medium (n = 145) 4.39 (1.33) 4.62 (1.24) 4.48 (1.20) 4.97 (1.26) 4.87 (1.21) 4.46 (1.24) 4.77 (1.10) large (n = 142) 4.20 (1.32) 4.51 (1.24) 4.36 (1.12) 4.93 (1.24) 4.78 (1.16) 4.38 (1.17) 4.70 (1.01) 5 super-size me: an unsuccessful preregistered replication table 2 anova summary table for the effects of product type, portion size, and status dimensions. f p η²p within-subjects effects dimension 91.33 < .001 0.18 dimension*size 0.78 0.461 <0.01 dimension*product 1.57 0.209 0.01 dimension*size*product 0.54 0.708 0.01 residual between-subjects effects size 0.74 0.476 <0.01 product 0.73 0.482 <0.01 size*product 1.58 0.179 0.02 residual figure 1 comparisons of statusand nonstatus-related dimensions across different product sizes. 6 tunca, ziano, and xu in table 3, we also present a comparison between the results of the original study and the replication study, including following the replication classification of lebel et al. (2019). although the original study reported rather large effects of product size on status perceptions, in the replication study effects were nonsignificant in the opposite of the hypothesized direction (status perceptions were lowest in the large product size condition). table 3 comparisons for status-related dimensions between the replication and original study. note. see supplementary materials for a note on how the original effect sizes were recalculated supplementary bayesian analyses because frequentist statistics and interpretation of p-values are generally not informative in quantifying evidence for the null hypothesis, bayesian analyses are recommended for establishing evidence of absence (keysers et al., 2020). we therefore supplemented our analyses with bayes factors, which provide us the plausibility of the observed data under different models employing the null versus alternative hypothesis. these were not preregistered. open source software jasp was used to conduct the bayesian anovas reported in this section (bergh et al., 2020; jasp team, 2020). first, we ran a bayesian repeated measures anova for the 3 (size: small, medium, large) x 2 (dimension: status, nonstatus) mixed model. table 4 compares the likelihood of all possible models relative to the null model; however, given that our focus is on the predictive performance of each component, analysis of effects presented in table 5 is more informative (bergh et al., 2020; keysers et al., 2020). to generate the analysis of effects, we used the recommended matched models’ option, which, in our case, compares the model with the interaction effect only with the models that exclude the interaction, thereby providing a more conservative estimate for each factor’s contribution (keysers et al., 2020; mathôt, 2017). as seen in table 5, the results revealed that data are 16 times more likely to occur under the models that exclude the interaction effect, which can be concluded as decisive evidence for the null hypothesis with respect to kass and raftery’s (1995) reference values. we also conducted a bayesian anova to examine the effect of portion size on status dimensions only, as we did in the main analyses. as seen in tables 6 and 7, data are about 24 times more likely to occur under the models excluding the portion size effect, thereby providing decisive evidence for the null hypothesis stating that portion size choices are not associated with status perceptions. we further illustrated this lack of evidence by plotting the model with portion size effect. as seen figure 2, 95% credible intervals for different levels of portion size overlap substantially, indicating no differences among the levels. replication study original study size comparison mdiff t(412) p cohen’s d [95% ci] mdiff t(182) p cohen’s d* recalculated cohen’s d [95% ci] * replication classification according to lebel et al. 2019 large vs. small -0.11 -0.77 .442 -0.10 [-0.34, 0.14] 1.95 4.66 .001 1.10 1.49 [1.09, 1.89] no signal – inconsistent large vs. medium -0.12 -0.90 .367 -0.10 [-0.34, 0.14] 1.19 2.95 .01 0.65 0.89 [0.52, 1.26] no signal – inconsistent medium vs. small 0.01 0.11 .916 0.01 [-0.23, 0.25] 0.76 2.27 .05 0.46 0.62 [0.26, 0.98] no signal – inconsistent 7 super-size me: an unsuccessful preregistered replication table 4 model comparison for all models under consideration for the replication data. table 5 analysis of effects of individual factors effects p(incl) p(incl|data) bfexcl dimension 0.400 0.993 3.671e -17 size 0.400 0.114 7.731 dimension  ✻  size 0.200 0.007 16.241 note. compares models that contain the effect to equivalent models stripped of the effect. higherorder interactions are excluded. table 6 comparison of the portion size effect model with the null model for the replication data models p(m) p(m|data) bfm bf01 error % null model 0.500 0.960 23.899 1.000 size 0.500 0.040 0.042 23.899 0.025 table 7 analysis of the effect of portion size effects p(incl) p(incl|data) bfexcl size 0.500 0.040 23.899 figure 2. model averaged posterior distributions for the portion size effect model note on the effect sizes in order to compare the effect sizes, we recalculated the original effect sizes from the reported fvalues using two independent tools (lakens, 2013; uanhoro, 2017). for both the interaction and the effect of product size on status, we found smaller effect sizes and slightly higher p-values than the original. we compared effect sizes in the present replication, effect sizes reported in the original paper, and effect sizes we recalculated based on the summary statistics provided in the original paper, and we provided 90% confidence intervals (as it is customary with η²p , which cannot be smaller than 0). the original interaction was f(1, 177) = 4.06, original p = .03, recalculated p = .045, original η²p = 0.05, but we recalculated it as η²p = 0.02, 90% ci [0.0002, 0.07]. in our replication, we obtained the following models p(m) p(m|data) bfm bf01 error % null model (incl. subject) 0.200 3.271e -17 1.308e -16 1.000 dimension 0.200 0.879 29.132 3.720e -17 0.939 dimension + size 0.200 0.114 0.513 2.876e -16 6.109 dimension + size + dimension  ✻  size 0.200 0.007 0.028 4.671e -15 8.726 size 0.200 3.741e -18 1.497e -17 8.742 5.093 note. all models include subject 8 tunca, ziano, and xu results: f(2, 412) = 0.83, p = .435, η²p < 0.01, 90% ci [0, 0.017] the original main effect of product size on status inferences yielded f(1, 177) = 10.22, original p = .001, recalculated p = .002, original η²p = 0.10, recalculated η²p = 0.05, 90% ci [0.012, 0.12]. note that the recalculated p could be due to a different rounding. our replicated effect size on status: f(2, 412) = 0.48, p = .620, η²p < 0.01, 90% ci [0, 0.012]. the same can be said about the cohen’s ds reported at page 1051. recalculating them yields much larger effect sizes (see supplementary materials for details), which we report in table 3. general discussion in this work, we failed to replicate the first experiment from dubois et al. (2012), which found that consumers associated larger portion choices with higher status. what could be the reasons for this replication failure? first, although we conducted a direct replication in terms of materials and procedure, one major difference from the original study was the study sample. the original experiment was based on 183 undergraduate students (average age not available); the replication was based on 415 participants from mturk (mage = 37.9). assuming that the undergraduate sample was much younger, age might have influenced the results, such that while students could associate mundane products like coffee, smoothie, or pizza with status, older consumers could not. second, another possible explanation is that students on average have lower socioeconomic status, and given the link between low socioeconomic status and higher propensity for status consumption (dubois & ordabayeva, 2015), students were more likely to associate larger portion sizes with status. however, there are plenty of successful replications in which an original finding obtained with students was successfully replicated on mturk samples (ziano, wang, et al., 2021; ziano, yao, et al., 2020). a third explanation could be the association between status and health behaviors, which have long been associated with higher socioeconomic status (pampel et al., 2010). particularly, healthy food consumption is prevalent among middle and upper social class while consumption of unhealthy choices such as fast food is common among lower social class (hupkens et al., 2000; pechey & monsivais, 2016). this connection between status and health behaviors has been further augmented in today’s popular culture. for instance, several famous social media influencers are portraying a wealthy lifestyle coupled with healthy behaviors such as eating well, meditating, and doing physical exercise (vaterlaus et al., 2015). consequently, it is possible that consumers today perceive larger portion choices to be unhealthy, and do not associate such unhealthy behaviors with status. the findings of dubois et al. (2012) have been greatly influential in the marketing literature; yet our preregistered direct replication casts doubt on the reliability of the relationship between larger food portions and status. we therefore strongly recommend conducting further preregistered conceptual and direct replications to ascertain whether larger portions in fact signal higher status. obesity and other excessive food consumption related health problems have significant consequences; thus, scientific research findings that inform policymaking in these areas must be robust. the postulation that consumers signal status via choosing larger portions is certainly novel and worthwhile to examine. nevertheless, we conclude that the evidence for this postulation remains inconclusive until further replications are available in the literature. author contact burak tunca department of business administration, lund university school of economics and management burak.tunca@fek.lu.se https://orcid.org/0000-0001-6381-2979 ignazio ziano (corresponding author) department of marketing, grenoble ecole de management, ignazio.ziano@grenoble-em.com https://orcid.org/0000-0002-4957-3614 xu wenting department of marketing, grenoble ecole de management, wenting.xu@grenoble-em.com conflict of interest and funding we have no conflict of interest or specific funding source to declare. 9 super-size me: an unsuccessful preregistered replication author contributions ignazio supervised xu wenting on her master’s thesis. burak verified the analyses and conclusions and performed new ones and completed the manuscript submission draft. ignazio and burak jointly finalized the manuscript for submission. wenting xu conducted the replication as part of her “grand memoire” (master’s thesis) at grenoble ecole de management during the academic year 2018-19. open science practices this article earned the preregistration+, open data and the open materials badge for preregistering the hypothesis and analysis before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references bergh, d. van den, doorn, j. van, marsman, m., draws, t., kesteren, e.-j. van, derks, k., dablander, f., gronau, q. f., kucharský, š., gupta, a. r. k. n., sarafoglou, a., voelkel, j. g., stefan, a., ly, a., hinne, m., matzke, d., & wagenmakers, e.-j. (2020). a tutorial on conducting and interpreting a bayesian anova in jasp. lannee psychologique, vol. 120(1), 73–96. dubois, d., & ordabayeva, n. (2015). social hierarchy, social status, and status consumption. in the cambridge handbook of consumer psychology (pp. 332–367). cambridge university press. https://doi.org/10.1017/cbo9781107706552.01 3 dubois, d., rucker, d. d., & galinsky, a. d. (2012). super size me: product size as a signal of status. journal of consumer research, 38(6), 1047– 1062. https://doi.org/10.1086/661890 field, s. m., hoekstra, r., bringmann, l., & ravenzwaaij, d. van. (2019). when and why to replicate: as easy as 1, 2, 3? collabra: psychology, 5(1), 46. https://doi.org/10.1525/collabra.218 grewal, d. (2011). when you try to buy status, it can backfire. scientific american. https://www.scientificamerican.com/article/understanding-lure-trap-luxury-goods/ hupkens, c. l. h., knibbe, r. a., & drop, m. j. (2000). social class differences in food consumptionthe explanatory value of permissiveness and health and cost considerations. european journal of public health, 10(2), 108–113. https://doi.org/10.1093/eurpub/10.2.108 jasp team. (2020). jasp (version 0.13.1). https://jasp-stats.org kass, r. e., & raftery, a. e. (1995). bayes factors. journal of the american statistical association, 90(430), 773–795. jstor. https://doi.org/10.2307/2291091 keysers, c., gazzola, v., & wagenmakers, e.-j. (2020). using bayes factor hypothesis testing in neuroscience to establish evidence of absence. nature neuroscience, 23(7), 788–799. https://doi.org/10.1038/s41593-020-0660-4 lakens, d. (2013). calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas. frontiers in psychology, 4. https://doi.org/10.3389/fpsyg.2013.00863 lebel, e. p., vanpaemel, w., cheung, i., & campbell, l. (2019). a brief guide to evaluate replications. meta-psychology, 3. https://doi.org/10.15626/mp.2018.843 lynch, j. g., bradlow, e. t., huber, j. c., & lehmann, d. r. (2015). reflections on the replication corner: in praise of conceptual replications. international journal of research in marketing, 32(4), 333–342. https://doi.org/10.1016/j.ijresmar.2015.09.006 mathôt, s. (2017, may 15). bayes like a baws: interpreting bayesian repeated measures in jasp. https://www.cogsci.nl/blog/interpretingbayesian-repeated-measures-in-jasp morey, r. d., rouder, j. n., jamil, t., urbanek, s., forner, k., & ly, a. (2018). bayesfactor: computation of bayes factors for common designs (0.9.12-4.2) [computer software]. https://cran.r-project.org/package=bayesfactor pampel, f. c., krueger, p. m., & denney, j. t. (2010). socioeconomic disparities in health behaviors. annual review of sociology, 36, 349–370. 10 tunca, ziano, and xu https://doi.org/10.1146/annurev.soc.012809.102529 pechey, r., & monsivais, p. (2016). socioeconomic inequalities in the healthiness of food choices: exploring the contributions of food expenditures. preventive medicine, 88, 203–209. https://doi.org/10.1016/j.ypmed.2016.04.012 simons, d. j. (2014). the value of direct replication. perspectives on psychological science, 9(1), 76– 80. https://doi.org/10/f5r8rj steenhuis, i., & poelman, m. (2017). portion size: latest developments and interventions. current obesity reports, 6(1), 10–17. https://doi.org/10.1007/s13679-017-0239-x stroebe, w., & strack, f. (2014). the alleged crisis and the illusion of exact replication. perspectives on psychological science, 9(1), 59–71. https://doi.org/10.1177/1745691613514450 uanhoro, j. (2017). effect size calculators. https://effect-size-calculator.herokuapp.com/ vandenbroele, j., van kerckhove, a., & zlatevska, n. (2019). portion size effects vary: the size of food units is a bigger problem than the number. appetite, 140, 27–40. https://doi.org/10.1016/j.appet.2019.04.025 vaterlaus, j. m., patten, e. v., roche, c., & young, j. a. (2015). #gettinghealthy: the perceived influence of social media on young adult health behaviors. computers in human behavior, 45, 151–157. https://doi.org/10.1016/j.chb.2014.12.013 villarica, h. (2011, november 4). study of the day: what that venti coffee really says about you. the atlantic. https://www.theatlantic.com/health/archive/2011/11/study-ofthe-day-what-that-venti-coffee-really-saysabout-you/247864/ warren, j. (2011, november 3). a new linkage offers possibilities in the anti-obesity campaign. the new york times. https://www.nytimes.com/2011/11/04/us/a-new-linkage-offers-possibilities-in-the-anti-obesity-campaign.html ziano, i., wang, y. j., sany, s. s., feldman, g., ho, n. l., lau, y. k., bhattal, i. k., keung, p. s., nora, tong, z., cheng, b., & chan, h. y. c. (2021). perceived morality of direct versus indirect harm: replications of the preference for indirect harm effect. meta psychology, 5. https://doi.org/10.15626/mp.2019.2134 ziano, i., yao, d., gao, y., & feldman, g. (2020). impact of ownership on liking and value: replications and extensions of three ownership effect experiments. journal of experimental social psychology. https://doi.org/10.13140/rg.2.2.16962.84163/ 3 zlatevska, n., dubelaar, c., & holden, s. s. (2014). sizing up the effect of portion size on consumption: a meta-analytic review. journal of marketing, 78(3), 140–154. https://doi.org/10.1509/jm.12.0303 meta-psychology, 2021, vol 5, mp.2019.1778 https://doi.org/10.15626/mp.2019.1778 article type: replication report published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: yes edited by: rickard carlsson reviewed by: l. o’brien, å. innes-ker, u. schimmack analysis reproduced by: r. carlsson all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/kx3ej perceived discrimination against black americans and white americans l.j zigerell illinois state university abstract a widely-cited study reported evidence that white americans reported higher ratings of how much whites are the victims of discrimination in the united states than of how much blacks are the victims of discrimination in the united states. however, much fewer than half of white americans rated discrimination against whites in the united states today to be greater or more frequent than discrimination against blacks in the united states today, in data from the american national election studies 2012 time series study or in preregistered analyses of data from the american national election studies 2016 time series study or from a 2017 national nonprobability survey. given that relative discrimination against black americans is a compelling justification for policies to reduce black disadvantage, results from these three surveys suggest that white americans’ policy preferences have much potential to move in a direction that disfavors programs intended to reduce black disadvantage. keywords: race, discrimination, perceptions from slavery through reconstruction through the civil rights movement and modern times, race has been the “most difficult subject” in the united states (kinder and sanders, 1996, p. 11). the u.s. population has become more racially diverse, but comparison of the treatment of black americans to the treatment of white americans has remained an important comparison for assessing racial discrimination in the united states. empirical evidence of discrimination against black americans (e.g., quillian et al., 2017) can be balanced at least partly with evidence of discrimination against white americans (e.g., axt et al., 2016), as can prominent claims of discrimination in particular domains, such as police disproportionately searching black americans (lafraniere and lehren, 2015) or affirmative action in college admissions disadvantaging white americans (hurley, 2016). the presence of evidence of discrimination against black americans and of evidence of discrimination against white americans raises the question of the direction and size of the net balance of black/white discrimination in the united states. perceptions of this balance have the potential to influence legal outcomes and influence support for race-targeted programs (carter and murphy, 2015, p. 274). survey results reported in norton and sommers (2011) indicated that white americans now perceive this balance of discrimination to disfavor whites, in a finding that has been cited in media outlets such as the new york times (2011) and npr (2011) and has been frequently cited in social science publications (e.g., todd et al., 2012; mayrl and saperstein, 2013; cabrera, 2014; hughey, 2014; wilkins et al., 2015; major and kaiser, 2017; and west and eaton, 2019). however, the inference from norton and sommers (2011) that “whites have now come to view anti-white bias as a bigger societal problem than anti-black bias” (p. 215) might be due to the research design of that study. participants in the norton and sommers (2011) study were asked to “indicate how much you 2 think [blacks/whites] [were/are] the victims of discrimination in the united states in each of the following decades” (p. 217), with participants reporting perceptions about discrimination against blacks in each decade from the 1950s to the 2000s and then reporting perceptions about discrimination against whites in each decade from the 1950s to the 2000s (p. 218). consecutively rating discrimination against blacks in a string of decades might have caused participants to use as their reference point the participant’s perception of discrimination against blacks in the immediately prior decade, instead of using as the reference point the participant’s perception of discrimination against whites in the corresponding decade; the same phenomenon might have occurred when participants subsequently rated discrimination against whites, focusing on ratings of discrimination against whites being sensible across decades for whites instead of focusing on ratings being sensible in comparison to blacks in a given decade. comparing participant ratings of discrimination against blacks in the 2000s to participant ratings of discrimination against whites in the 2000s might thus produce an incorrect inference about relative perceived discrimination in the 2000s. for assessing whether white americans really do perceive there to be more discrimination in the united states today against whites than against blacks, the three studies below reported on data from large-sample surveys that permit more straightforward research designs that focus participants on the contemporary time period and/or on direct comparisons of discrimination against blacks and whites in the united states today. study 1: anes 2012 time series study [non-preregistered] data were from the american national election studies (anes) 2012 time series study (american national election studies, 2016a). the author’s institutional review board does not require review and approval for analysis of de-identified datasets such as the anes 2012 time series study. participants data analysis was limited to participants coded as non-hispanic white who provided substantive responses to the analyzed items. per anes documentation (2016b, p. 7): the target population for the survey was adult u.s. citizens; the key item is from postelection interviews, which were conducted between 7 november 2012 and 24 january 2013; and the estimated aapor rr1 response rates for pre-election interviews were 38% for the face-to-face mode and 2% for the internet mode, with respective re-interview rates for the post-election interview of 94% and 93%. measures the key item for this analysis is: “how much discrimination is there in the united states today against each of the following groups?”. target groups presented in random order included blacks and whites. response options were: “a great deal”, “a lot”, “a moderate amount”, “a little”, and “none at all”. the key item was used to construct three dichotomous variables, respectively coded 1 if a participant rated discrimination against blacks greater than discrimination against whites, rated discrimination against blacks equal to discrimination against whites, and rated discrimination against whites greater than discrimination against blacks. the 15 nonhispanic white post-election interview cases without a substantive rating of both discrimination against blacks and discrimination against whites were excluded from the analysis, producing a sample size of 3260 nonhispanic whites. results for non-hispanic white participants, weighted point estimates and 95% confidence intervals indicated that 54% [51, 56] rated discrimination against blacks greater than discrimination against whites, 37% [35, 39] rated discrimination against blacks equal to discrimination against whites, and 10% [9, 11] rated discrimination against whites greater than discrimination against blacks. the finding that only a small percentage of non-hispanic whites rated discrimination against whites greater than discrimination against blacks held when the analysis was limited to participants who responded online, in which concern about social desirability biasing reporting is lessened: 11% [10, 13] of nonhispanic whites with substantive responses to the discrimination items reported greater perceived discrimination against whites than against blacks. moreover, weighted analyses of the general discrimination item coded from 0 for discrimination rated at “none at all” to 1 for discrimination rated at “a great deal” indicated that, among non-hispanic white participants in the online survey, respective mean ratings were 0.31 and 0.48 for discrimination against whites and discrimination against blacks, with respective means of 0.24 and 0.73 among non-hispanic black participants. 3 study 2: anes 2016 time series study [confirmatory with modifications indicated] in assessing whether the study 1 finding replicated in the anes 2016 time series study, data analyses followed a plan preregistered at the open science framework (https://osf.io/n7z4a). data were from the anes 2016 time series study (american national election studies, 2018a). the author’s institutional review board does not require review and approval for analysis of de-identified datasets such as the anes 2016 time series study. hypotheses the anes 2016 time series study included the study 1 item and items regarding police and federal government discrimination against blacks relative to whites. the corresponding preregistered hypotheses were: 1. h1 [directional]: white americans will report perceiving more discrimination in the united states today against blacks than against whites. 2. h2 [directional]: white americans will report perceiving that the police treat whites better than blacks. 3. h3 [non-directional]: white americans might or might not report perceiving that the federal government treats blacks better than whites. h2 is directional and reflects the expectation that participants will perceive more police discrimination against blacks than against whites, given factors such as thenrecent prominent media coverage of police shootings of black americans after the 2014 killing of michael brown in ferguson, missouri (e.g., “study finds police fatally shoot unarmed black men at disproportionate rates”, lowery, 2016). h3 is non-directional, reflecting a lack of similarly prominent media coverage suggesting anti-black discrimination by the federal government and the possibility that some participants might perceive government assistance to equally or disproportionately benefit blacks, such as an association between blacks and welfare receipt (brown-iannuzzi et al., 2017). h1 is directional and reflects the expectation that the anes 2012 time series study pattern will replicate and the expectation that perceived discrimination against blacks will be greater than perceived discrimination against whites given the relative prominence of discrimination against blacks in police shootings and other domains, coupled with black disadvantage in education (sablich, 2016) and wealth (traub et al., 2016). participants data analysis was limited to participants coded as non-hispanic white or non-hispanic black and who provided substantive responses to the analyzed items, which were asked in the post-election interview. per anes documentation (2018b, pp. 4-5): target populations for the surveys were adult u.s. citizens in d.c. and the 48 contiguous states for the face-to-face mode and adult u.s. citizens in d.c. and the 50 states for the internet mode; the key items are drawn from postelection interviews, which were conducted between 9 november 2016 and 8 january 2017; and the estimated aapor rr1 response rates for pre-election interviews were 50% for the face-to-face mode and 44% for the internet mode, with respective re-interview rates for the post-election interview of 90% and 84%. measures the post-election interview items analyzed were: 1. “how much discrimination is there in the united states today against each of the following groups?”. target groups presented in random order included blacks and whites. response options were: “a great deal”, “a lot”, “a moderate amount”, “a little”, and “none at all”. postelection interview cases without a substantive response to both items were excluded from the analysis: 99 non-hispanic whites (4%) and 26 nonhispanic blacks (8%). 2. “in general, do the police treat whites better than blacks, treat blacks better than whites, or treat them both the same?”. response options were: “treat whites better”, “treat both the same”, and “treat blacks better”. post-election interview cases without a substantive response to this item were excluded from the analysis: 34 non-hispanic whites (1%) and 7 non-hispanic blacks (2%). 3. “in general, does the federal government treat whites better than blacks, treat blacks better than whites, or treat them both the same?”. response options were: “treat whites better”, “treat both the same”, and “treat blacks better”. post-election interview cases without a substantive response to this item were excluded from the analysis: 41 non-hispanic whites (2%) and 8 non-hispanic blacks (2%). the software used in the analysis (statacorp, 2017) did not report standard errors or confidence intervals using the preregistered commands with weighting, because at least one stratum had a single sampling unit https://osf.io/n7z4a 4 table 1 reported perceptions of discrimination [anes 2016 time series study]. non-hispanic non-hispanic white americans black americans 0.66 0.79 greater discrimination against blacks than whites [0.64, 0.69] [0.72, 0.86] [0.64, 0.68] [0.71, 0.85] 0.27 0.19 equal discrimination against whites than black [0.25, 0.29] [0.13, 0.26] [0.25, 0.29] [0.13, 0.27] 0.07 0.02 greater discrimination against whites than blacks [0.05, 0.08] [0.00, 0.04] [0.06, 0.08] [0.01, 0.05] 0.51 0.83 the police treat whites better than blacks [0.48, 0.53] [0.78, 0.88] [0.48, 0.53] [0.78, 0.88] 0.48 0.14 the police treat whites and blacks the same [0.46, 0.51] [0.09, 0.19] [0.46, 0.51] [0.10, 0.20] 0.01 0.03 the police treat blacks better than whites [0.01, 0.02] [0.00, 0.05] [0.01, 0.02] [0.01, 0.07] 0.30 0.77 the federal government treats whites better than blacks [0.28, 0.32] [0.69, 0.84] [0.28, 0.32] [0.68, 0.83] 0.47 0.20 the federal government treats whites and blacks the same [0.45, 0.49] [0.13, 0.28] [0.45, 0.49] [0.14, 0.29] 0.23 0.03 the federal government treats blacks better than whites [0.21, 0.25] [0.00, 0.06] [0.21, 0.25] [0.01, 0.08] note. top cell values indicate point estimates for decimal percentages in weighted analyses based on preregistered use of the stata svy: mean command, middle cell values are 95% confidence intervals from the stata svy: mean command with nonpreregistered use of the scaled option, and bottom cell values are 95% confidence intervals based on non-preregistered use of the stata svy: prop command for proportions. sample sizes for the general discrimination item, the police discrimination item, and the federal government discrimination item were 2530, 2595, and 2588 for non-hispanic white participants and 316, 335, and 334 for non-hispanic black participants. so that a variance could not be estimated for that stratum. non-preregistered analyses were therefore conducted with each known available non-missing option in the software for handling weighting for strata with a single sampling unit (centered, certainty, and scaled); results are reported for the option that produced the largest standard errors, which was the scaled option. results results in table 1 and figure 1 indicate that, among non-hispanic whites, 66% reported greater perceived discrimination against blacks than against whites, 27% reported equal perceived discrimination against blacks and whites, and 7% reported greater perceived discrimination against whites than against blacks; respective percentages were 51%, 48%, and 1% for the police discrimination item and 30%, 47%, and 23% for the federal government item. results regarding whites’ general perceptions of discrimination and whites’ perceptions of discrimination by police were consistent with the preregistered directional hypotheses. ends of 95% confidence intervals for the paired percentages did not overlap for any of the three comparisons of the percentage that reported better treatment of black americans to the percentage that reported better treatment 5 federal government treats blacks better than whites federal government treats whites and blacks the same federal government treats whites better than blacks police treat blacks better than whites police treat whites and blacks the same police treat whites better than blacks greater discrimination against whites than blacks equal discrimination against blacks and whites greater discrimination against blacks than whites 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 percentages for non−hispanic whites percentages for non−hispanic blacks figure 1. non-hispanic white americans’ and non-hispanic black americans’ reported perceptions of discrimination [anes 2016 time series study] note. error bars indicate ends of the 95% confidence intervals for weighted analyses based on the stata svy: mean command. source: anes 2016 time series study. graph produced in r (r core team, 2017) using ggplot2 (wickham, 2017). of white americans. moreover, in a non-preregistered analysis, the p-value was less than p=.001 for a wald test of the hypothesis that the constant in a linear regression predicting a dichotomous variable coded 1 for participants who reported greater discrimination against black americans equaled the constant in a linear regression predicting a dichotomous variable coded 1 for participants who reported greater discrimination against white americans. table 1 results also indicated that, in all three items, black americans perceived more favorable treatment for whites than for blacks, with ends of the 95% confidence intervals not overlapping for any comparison of the percentage that reported better treatment of black americans to the percentage that reported better treatment of white americans. non-preregistered weighted analyses indicated that the key inference held when limiting the analysis to participants who responded online: 7% [5, 8] of nonhispanic whites with substantive responses to the general discrimination items reported greater perceived discrimination against whites than against blacks. moreover, non-preregistered weighted analyses of the general discrimination item coded from 0 for discrimination rated at “none at all” to 1 for discrimination rated at “a great deal” indicated that, among non-hispanic white participants in the online survey, respective mean ratings were 0.27 and 0.58 for discrimination against whites and discrimination against blacks, with respective means of 0.20 and 0.79 among non-hispanic black participants. study 3: 2017 yougov survey [confirmatory with modifications indicated] the anes general discrimination items analyzed in study 1 and study 2 asked participants to respond to an item about the level of discrimination against blacks and to a separate item about the level of discrimination against whites; the anes items also referred to “discrimination” in a way that could permit participants to respond based on a combination of the perceived frequency of discrimination and the perceived strength of discrimination. providing a clearer inference about participant perceptions, the key item in study 3 asked participants to directly compare their perceived frequency of discrimination against blacks to their perceived frequency of discrimination against whites. data analysis of the 2017 survey followed a plan preregistered at the open science framework (https: //osf.io/q7ufz), with one hypothesis, reflecting the expectation that the key pattern in study 1 and study 2 will replicate: 1. h1: a higher proportion of non-hispanic whites will report that black americans are more often the victim of discrimination in the united states today than white americans are, compared to the proportion of non-hispanic whites that report that white americans are more often the victim of discrimination in the united states today than black americans are. u.s. resident adult participants from a yougov opt-in https://osf.io/q7ufz https://osf.io/q7ufz 6 table 2 reported perceptions of discrimination [2017 yougov survey]. non-hispanic non-hispanic white americans black americans 0.43 0.90 black americans are more often the victim of discrimination [0.36, 0.51] [0.81, 0.98] [0.36, 0.51] [0.78, 0.96] 0.42 0.07 equal often discrimination [0.35, 0.50] [0.00, 0.15] [0.35, 0.50] [0.03, 0.19] 0.12 0.01 white americans are more often the victim of discrimination [0.07, 0.16] [-0.01, 0.02] [0.08, 0.17] [0.00, 0.05] note. top cell values indicate point estimates for corresponding decimal percentages in weighted analyses and middle cell values are 95% confidence intervals from the stata svy: mean command, both based on preregistered use of the stata svy: mean command. bottom cell values are 95% confidence intervals based on non-preregistered use of the stata svy: prop command for proportions. the sample size was 359 for non-hispanic white americans and 52 for non-hispanic black americans, which included 8 non-hispanic white participants and 1 non-hispanic black participant who were coded as skipping the item. survey panel completed an online survey fielded between 27 july 2017 and 31 july 2017, with a final sample of 2,000 participants, of which the randomization assigned 359 non-hispanic whites and 52 non-hispanic blacks to the item for this study; see appendix a for more information on the construction of the sample. the key item was: “in the united states today, which of the following two groups is more often the victim of discrimination, compared to the other group?”. response options were “black americans”, “white americans”, and “both groups are the victim of discrimination equally often in the united states today”, with the order of the first two response options randomly reversed and the third response option always third. the key item was used to construct three dichotomous variables, respectively coded 1 if the participant selected the “black americans”, “white americans”, and “both groups...” options. the research for study 3 received approval from the author’s institutional review board. results for weighted analyses reported in table 2 indicate that a larger percentage of non-hispanic whites selected the option indicating that black americans are more often the victim of discrimination in the united states today (43%), compared to the percentage that selected the option indicating that white americans are more often the victim of discrimination in the united states today (12%); the p-value was less than p=.001 for a wald test of the hypothesis that the constant in a linear regression predicting the “black americans” outcome variable equaled the constant in a linear regression predicting the “white americans” outcome variable, supporting the preregistered directional hypothesis. moreover, results indicated that 42% of nonhispanic whites selected the option that both groups are the victim of discrimination equally often in the united states today. general discussion results reported in norton and sommers (2011) indicated that, in the united states today, whites perceive that whites are the victims of discrimination more than blacks are the victims of discrimination. however, analyses of data from three recent large-sample national surveys indicated that white americans do not perceive discrimination the united states today against whites to be greater or more frequent than discrimination against blacks. this discrepancy might be due to the research design of norton and sommers (2011), in which participants rated discrimination against blacks in a series of decades and then rated discrimination against whites in a series of decades, but were not asked to directly compare discrimination in the united states today against whites to discrimination in the united states today against blacks. discussing results from the 2016 prri/brookings immigration survey, jones et al. (2016) reported that 57% of white americans agreed that “today discrimination against whites has become as big a problem as discrimination against blacks and other minorities” (p. 2). this result might be perceived to be in tension with the patterns reported for non-hispanic whites in the studies above, but this finding does not indicate that whites believe that whites face more discrimination than blacks face; the 57% estimate is consistent with study 3 re7 sults in which a combined 54% of non-hispanic whites reported the perception that, relative to the frequency of discrimination against black americans, white americans are more often (12%) or equally often (42%) the victim of discrimination in the united states today. this 42% estimate from study 3 of the percentage of white americans who perceive equality in the frequency of black/white discrimination can be paired with estimates of black/white discrimination equality of 37% and 27% from study 1 and study 2 to produce the inference that a nontrivial percentage of white americans perceive there to be similar levels of discrimination against blacks as against whites. this perception and the perception of more discrimination against white americans than against black americans can have important consequences for attitudes and policy preferences. for example, wellman, liu, and wilkins (2016) reported results suggesting that “when white people perceive increased anti-white bias, it leads them to view interracial relations as zero-sum and to reject affirmative action” (p. 433), and wilkins et al. (2015) reported results suggesting that “perceiving greater bias against men or whites may be associated with favoring policies that ultimately hurt women and blacks” (p. 11). to the extent that norton and sommers (2011) overestimated the percentage of white americans who perceive there to be more discrimination against whites than against blacks in the united states today, norton and sommers (2011) might cause an overestimate of the potential change in white americans’ attitudes that might have already occurred due to increases in perceived anti-white discrimination. discrimination against black americans is a compelling justification for policies to reduce black disadvantage, and results from studies 1 through 3 suggest that white americans’ preferences have more potential to become less favorable about programs that are intended to reduce black disadvantage, compared to estimates of this potential based on norton and sommers (2011). author contact correspondence concerning this article should be addressed to: l.j zigerell, ljzigerell@illinoisstate.edu, orcid 0000-0003-4262-8405. acknowledgements the author thanks eric plutzer for comments on a prior version of the manuscript, michael i. norton for providing information on norton and sommers (2011) and for pointers to and suggestions for citation of related studies, and laurie o’brien, åse innes-ker, and ulrich schimmack for their reviews. conflict of interest and funding the author declares no conflict of interest. documentation for the anes 2012 time series study (american national election studies, 2016b) indicated that the anes received funding from national science foundation grants ses-0937715 and ses-0937727, the university of michigan, and stanford university. documentation for the anes 2016 time series study (american national election studies, 2018b) indicated that the study was funded by national science foundation grants ses-1444721 and ses-1444910. however, my analyses of anes data received no funding. the 2017 yougov survey received funding from illinois state university new faculty start-up support and from the illinois state university college of arts and sciences. author contributions l.j zigerell is the sole author of this contribution. open science practices this article earned the preregistration plus, open data and the open materials badge for preregistering the hypothesis and analysis before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references american national election studies. (2016a). anes 2012 time series study, may 4, 2016 release. american national election studies. (2016b). user’s guide and codebook for the anes 2012 time series study, may 4, 2016 release. the university of michigan and stanford university. ann arbor, mi, palo alto, ca. american national election studies. (2018a). anes 2016 time series study, dec 18, 2018 release. american national election studies. (2018b). user’s guide and codebook for the anes 2016 time series study, dec 18, 2018 release. the university of michigan and stanford university. ann arbor, mi, palo alto, ca. axt, j. r., ebersole, c. r., & nosek, b. a. (2016). an unintentional, robust, and replicable pro-black bias in social judgment. social cognition, 34(1), 1–39. 8 brown-iannuzzi, j. l., dotsch, r., cooley, e., & payne, b. k. (2017). the relationship between mental representations of welfare recipients and attitudes toward welfare. psychological science, 28(1), 92–103. cabrera, n. l. (2014). ‘‘but i’m oppressed too”: white male college students framing racial emotions as facts and recreating racism. international journal of qualitative studies in education, 27(6), 768–784. carter, e. r., & murphy, m. c. (2015). group-based differences in perceptions of racism: what counts, to whom, and why? social and personality psychology compass, 9(6), 269–280. hughey, m. w. (2014). white backlash in the ‘postracial’ united states. ethnic and racial studies, 37(5), 721–730. hurley, l. (2016). u.s. supreme court upholds racebased college admissions. reuters. https : / / www . reuters . com / article / us usa court affirmativeaction-iduskcn0z91n3 jones, r., cox, d., dionne jr., e., galston, w., cooper, b., & lienesch, r. (2016). how immigration and concerns about cultural change are shaping the 2016 election: prri/brookings survey. prri. http : / / www. prri . org / research / prri brookings poll immigration economy trade terrorism-presidential-race/ kinder, d. r., & sanders, l. m. (1996). divided by color: racial politics and democratic ideals. university of chicago press. lafraniere, s., & lehren, a. w. (2015). the disproportionate risks of driving while black. the new york times. https://www.nytimes.com/2015/ 10/25/us/racial-disparity-traffic-stops-drivingblack.html lowery, w. (2016). study finds police fatally shoot unarmed black men at disproportionate rates. washington post. https://www.washingtonpost. com/national/study-finds-police-fatally-shootunarmed black men at disproportionate rates/2016/04/06/e494563e-fa74-11e5-80e4c381214de1a3_story.html major, b., & kaiser, c. r. (2017). ideology and the maintenance of group inequality. group processes & intergroup relations, 20(5), 582–592. mayrl, d., & saperstein, a. (2013). when white people report racial discrimination: the role of region, religion, and politics. social science research, 42(3), 742–754. new york times. (2011). room for debate: is anti-white bias a problem? new york times. https://www. nytimes.com/roomfordebate/2011/05/22/isanti-white-bias-a-problem norton, m. i., & sommers, s. r. (2011). whites see racism as a zero-sum game that they are now losing. perspectives on psychological science, 6(3), 215–218. npr. (2011). racism as a zero-sum game. npr. https: / / www. npr. org / 2011 / 07 / 13 / 137818177 / racism-as-a-zero-sum-game quillian, l., pager, d., hexel, o., & midtbøen, a. h. (2017). meta-analysis of field experiments shows no change in racial discrimination in hiring over time. proceedings of the national academy of sciences, 114(41), 10870–10875. r core team. (2017). r: a language and environment for statistical computing. vienna, austria. https: //www.r-project.org/ sablich, l. (2016). 7 findings that illustrate racial disparities in education. brookings institution. brown center chalkboard. https : / / www . brookings . edu / blog / brown center chalkboard / 2016 / 06 / 06 / 7 findings that illustrate-racial-disparities-in-education/ statacorp. (2017). stata statistical software: release 15. college station, tx. todd, a. r., bodenhausen, g. v., & galinsky, a. d. (2012). perspective taking combats the denial of intergroup discrimination. journal of experimental social psychology, 48(3), 738–745. traub, a., ruetschlin, c., sullivan, l., meschede, t., dietrich, l., & shapiro, t. (2016). the racial wealth gap: why policy matters. demos and the institute on assets & social policy. wellman, j. d., liu, x., & wilkins, c. l. (2016). priming status-legitimizing beliefs: examining the impact on perceived anti-white bias, zero-sum beliefs, and support for affirmative action among white people. british journal of social psychology, 55(3), 426–437. west, k., & eaton, a. a. (2019). prejudiced and unaware of it: evidence for the dunning-kruger model in the domains of racism and sexism. personality and individual differences, 146, 111–119. wickham, h. (2017). ggplot2: elegant graphics for data analysis. new york. wilkins, c. l., wellman, j. d., babbitt, l. g., toosi, n. r., & schad, k. d. (2015). you can win but i can’t lose: bias against high-status groups increases their zero-sum beliefs about discrimination. journal of experimental social psychology, 57, 1–14. https://www.reuters.com/article/us-usa-court-affirmativeaction-iduskcn0z91n3 https://www.reuters.com/article/us-usa-court-affirmativeaction-iduskcn0z91n3 https://www.reuters.com/article/us-usa-court-affirmativeaction-iduskcn0z91n3 http://www.prri.org/research/prri-brookings-poll-immigration-economy-trade-terrorism-presidential-race/ http://www.prri.org/research/prri-brookings-poll-immigration-economy-trade-terrorism-presidential-race/ http://www.prri.org/research/prri-brookings-poll-immigration-economy-trade-terrorism-presidential-race/ https://www.nytimes.com/2015/10/25/us/racial-disparity-traffic-stops-driving-black.html https://www.nytimes.com/2015/10/25/us/racial-disparity-traffic-stops-driving-black.html https://www.nytimes.com/2015/10/25/us/racial-disparity-traffic-stops-driving-black.html https://www.washingtonpost.com/national/study-finds-police-fatally-shoot-unarmed-black-men-at-disproportionate-rates/2016/04/06/e494563e-fa74-11e5-80e4-c381214de1a3_story.html https://www.washingtonpost.com/national/study-finds-police-fatally-shoot-unarmed-black-men-at-disproportionate-rates/2016/04/06/e494563e-fa74-11e5-80e4-c381214de1a3_story.html https://www.washingtonpost.com/national/study-finds-police-fatally-shoot-unarmed-black-men-at-disproportionate-rates/2016/04/06/e494563e-fa74-11e5-80e4-c381214de1a3_story.html https://www.washingtonpost.com/national/study-finds-police-fatally-shoot-unarmed-black-men-at-disproportionate-rates/2016/04/06/e494563e-fa74-11e5-80e4-c381214de1a3_story.html https://www.washingtonpost.com/national/study-finds-police-fatally-shoot-unarmed-black-men-at-disproportionate-rates/2016/04/06/e494563e-fa74-11e5-80e4-c381214de1a3_story.html https://www.nytimes.com/roomfordebate/2011/05/22/is-anti-white-bias-a-problem https://www.nytimes.com/roomfordebate/2011/05/22/is-anti-white-bias-a-problem https://www.nytimes.com/roomfordebate/2011/05/22/is-anti-white-bias-a-problem https://www.npr.org/2011/07/13/137818177/racism-as-a-zero-sum-game https://www.npr.org/2011/07/13/137818177/racism-as-a-zero-sum-game https://www.npr.org/2011/07/13/137818177/racism-as-a-zero-sum-game https://www.r-project.org/ https://www.r-project.org/ https://www.brookings.edu/blog/brown-center-chalkboard/2016/06/06/7-findings-that-illustrate-racial-disparities-in-education/ https://www.brookings.edu/blog/brown-center-chalkboard/2016/06/06/7-findings-that-illustrate-racial-disparities-in-education/ https://www.brookings.edu/blog/brown-center-chalkboard/2016/06/06/7-findings-that-illustrate-racial-disparities-in-education/ https://www.brookings.edu/blog/brown-center-chalkboard/2016/06/06/7-findings-that-illustrate-racial-disparities-in-education/ 9 appendix a text description of the 2017 survey, drawn from the data deliverables from yougov: yougov interviewed 2040 respondents who were then matched down to a sample of 2000 to produce the final dataset. the respondents were matched to a sampling frame on gender, age, race, education, ideology, and political interest. the frame was constructed by stratified sampling from the full 2010 american community survey (acs) sample with selection within strata by weighted sampling with replacements (using the person weights on the public use file). data on voter registration status and turnout were matched to this frame using the november 2010 current population survey. data on interest in politics and party identification were then matched to this frame from the 2007 pew religious life survey. the matched cases were weighted to the sampling frame using propensity scores. the matched cases and the frame were combined and a logistic regression was estimated for inclusion in the frame. the propensity score function included age, gender, race/ethnicity, years of education, region, voter registration status, political interest, and ideology. the propensity scores were grouped into deciles of the estimated propensity score in the frame and poststratified according to these deciles. the weights were then post-stratified to a fourway stratification of gender, four-category age, four-category race, and four-category education, to produce the final weight. further details on the 2017 yougov survey sample: 100% eligibility rate 64.4% rr3 3,349 invitations -970 non-responses 2,379 starts -134 refusals -88 partial completions 2,157 completions -117 completions screened out for speeding through the items/high refusal rates 2,040 sample matched down to 2,000 study 1: anes 2012 time series study [non-preregistered] participants measures results study 2: anes 2016 time series study [confirmatory with modifications indicated] hypotheses participants measures results study 3: 2017 yougov survey [confirmatory with modifications indicated] general discussion author contact acknowledgements conflict of interest and funding author contributions open science practices meta-psychology, 2021, vol 5, mp.2019.1645 https://doi.org/10.15626/mp.2019.1645 article type: commentary published under the cc-by4.0 license open data: n/a open materials: n/a open and reproducible analysis: n/a open reviews and editorial process: yes preregistration: n/a edited by: daniel lakens reviewed by: j. gottfried, m. wolf analysis reproduced by: n/a all supplementary files can be accessed at osf: https://osf.io/afusj/ the validation crisis in psychology ulrich schimmack department of psychology, university of toronto mississauga abstract cronbach and meehl (1955) introduced the concept of construct validity and described how researchers can demonstrate that their measures have construct validity. although the term construct validity is widely used, few researchers follow cronbach and meehl’s recommendation to quantify construct validity with the help of nomological networks. as a result, the construct validity of many popular measures in psychology is unknown. i call for rigorous tests of construct validity that follow cronbach and meehl’s recommendations to improve psychology as a science. without valid measures even replicable results are uninformative. i suggest that a proper program of validation research requires a multi-method approach and causal modeling of correlations with structural equation models. construct validity should be quantified to enable cost-benefit analyses and to replace existing measures with better measures that have superior construct validity. keywords: measurement, construct validity, convergent validity, discriminant validity, structural equation modeling; nomological networks nine years ago, psychologists started to realize that they have a replication crisis. many published results do not replicate in honest replication attempts that allow the data to decide whether a hypothesis is true (open science collaboration, 2015). one key problem is that original studies often have low statistical power (cohen, 1962; schimmack, 2012). another problem is that researchers use questionable research practices to increase power, which also increases the risk of false positive results (john et al., 2012). new initiatives that are called open science (e.g., preregistration, data sharing, a priori power analyses, registered reports) are likely to improve the replicability of psychological science in the future, although progress towards this goal is painfully slow. unfortunately, low replicability is not the only problem in psychological science. i argue that psychology not only has a replication crisis, but also a validation crisis. the need for valid measures seems obvious. to test theories that relate theoretical constructs to each other (e.g., construct a influences construct b for individuals drawn from population p under conditions c), it is necessary to have valid measures of constructs. for example, research on intelligence that uses hair length as a measure of intelligence would be highly misleading; highly replicable gender differences in hair length would be interpreted as evidence that women are more intelligent than men. this inference would be false because hair length is not a valid measure of intelligence, even though the relationship between gender and hair length is highly replicable. thus, even successful and replicable tests of a theory may be false if measures lack construct validity; that is, they do not measure what researchers assume they are measuring. the social sciences are notorious for imprecise use of terminology. the terms validity and validation are no exception. in educational testing, where the emphasis is on assessment of individuals, the term validation has a different meaning than in psychological science, where the emphasize is on testing psychological theories (borsboom & wijsen, 2016). in this article, i focus on construct validity. a measure possesses construct validity to the degree that quantitative variation in a measure reflects quantitative variation in the construct that the measure was design to measure. for exam2 ple, a measure of anxiety is a valid measure of anxiety if scores on the measure reflect variation in anxiety. hundreds of measures are used in psychological science with the purpose of measuring variation in constructs such as learning, attention, emotions, attitudes, values, personality traits, abilities, or behavioral frequencies. although measures of these constructs are used in thousands of articles, i argue that very little is known about the construct validity of these measures. that is, it is often claimed that psychological measures are valid, but evidence for this claim is often lacking or insufficient. i argue that psychologists could improve the quality of psychological science by following cronbach and meehl’s (1955) recommendations for construct validation. specifically, i argue that construct validation requires (a) a multi-method approach, (b) a causal model of the relationship between constructs and measures, and (c) quantitative information about the correlation between unobserved variation in constructs and observed scores on measures of constructs. construct validity the classic article on “construct validity” was written by cronbach and meehl (1955); two giants in the history of psychology. every graduate student of psychology and surely every psychologist who wants to validate a psychological measure should be familiar with this article. the article was the result of an apa task force that tried to establish criteria, now called psychometric properties, that could be used to evaluate psychological measures. in this seminal article on construct validity cronbach and meehl note that construct validation is necessary “whenever a test is to be interpreted as a measure of some attribute or quality which is not “operationally defined” (p. 282). this definition makes it clear that there are other types of validity (e.g., criterion validity) and that not all measures require construct validity. however, studies of psychological theories that relate constructs to each other require valid measures of these constructs in order to test psychological theories. in modern language, construct validity is the relationship between variation in observed scores on a measure (e.g., degree celsius on a thermometer) and a latent variable that reflects corresponding variation in a theoretical construct (e.g., temperature; i.e., average kinetic energy of the particles in a sample of matter). the problem of construct validation can be illustrated with the development of iq tests. iq scores can have predictive validity (e.g., performance in graduate school) without making any claims about the construct that is being measured (iq tests measure whatever they measure and what they measure predicts important outcomes). however, iq tests are often treated as measures of intelligence. for iq tests to be valid measures of intelligence, it is necessary to define the construct of intelligence and to demonstrate that observed iq scores are related to unobserved variation in intelligence. thus, construct validation requires clear definitions of constructs that are independent of the measures that are being validated. without clear definition of constructs, the meaning of a measure reverts essentially to “whatever the measure is measuring,” as in the old saying “intelligence is whatever iq tests are measuring.” what are constructs cronbach and meehl (1955) define a construct as “some postulated attribute of people, assumed to be reflected in test performance (p. 283). the term “reflected” in cronbach and meehl’s definition makes it clear that they define constructs as latent variables and the process of measurement requires a reflective measurement model. this point is made even clearer when they write “it is clear that factors here function as constructs (p. 287). individuals are assumed to have attributes; today we may say personality traits or states. these attributes are typically not directly observable (e.g., kindness rather than height), but systematic observation suggests that the attribute exists (some people are kinder than others across time and situations). the first step is to develop a measure of this attribute (e.g., a self-report measure “how kind are you?”). if the selfreport measure is valid, variation in the ratings should reflect actual variation in kindness. this needs to be demonstrated in a program of validation research. for example, self-ratings should show convergent validity with informant ratings, and they should predict actual behavior in experience sampling studies or laboratory settings. face validity is not sufficient; that is “i am kind” is not automatically a valid measure of kindness because the question directly maps on the construct. convergent validity to demonstrate construct validity, cronbach and meehl advocate a multi-method approach. the same construct has to be measured with several measures. if several measures are available, they can be analyzed with factor analysis. in this factor analysis, the factor represents the construct and factor loadings show how strongly scores in the observed measures are related to variation in the construct. for example, if multiple independent raters agree in their ratings of individuals’ kindness, the common factor in these ratings may correspond to the personality trait kindness, and the factor loadings provide evidence about the degree of construct validity of each measure (schimmack, 2010). it is important to distinguish factor analysis of items and factor 3 analysis of multiple measures. factor analysis of items is common and often used to claim validity of a measure. however, correlations among self-report items are influenced by systematic measurement error (anusic et al., 2009; podsakoff, mackenzie, & podsakoff, 2012). the use of multiple independent methods (e.g., multiple raters) reduces the influences of shared method variance and makes it more likely that correlations among measures are caused by the influence of the common construct that the measures are intended to measure. in the section “correlation matrices and factor analysis” cronbach and meehl (1955) clarify why factor analysis can reveal construct validity. “if two tests are presumed to measure the same construct, a correlation between them is predicted (p. 287). the logic of this argument should be clear to any psychology student who was introduced to the third-variable problem in correlational research. two measures may be related even if there is no causal relationship between them because they are both influenced by a common cause. for example, cities with more churches have higher murder rates. here the assumed common cause is population size. this makes it possible to measure population size with measures of the number of churches and murders. the shared variance between these measures reflects population size. thus, we can think about constructs as third variables that produce shared variance among observed measures of the same construct. this basic idea was refined by campbell and fiske (1959), who coined the term convergent validity. two measures of the same construct possess convergent validity if they are positively correlated with each other. however, there is a catch. two measures of the same construct could also be correlated for other reasons. for example, self-ratings of kindness and considerateness could be correlated due to socially desirable responding or evaluative biases in selfperceptions (campbell & fiske, 1959). thus, campbell and fiske (1959) made clear that convergent validity is different from reliability. reliability shows consistency in scores across measures without examining the source of the consistency in responses. construct validity requires that consistency is produced by variation in the construct that a measure was designed to measure. for this reason, reliability is necessary, but not sufficient to demonstrate construct validity. an unreliable measure cannot be valid because there is no consistency, but a reliable measure can be invalid. for example, hair length can be measured reliably, but the reliable variance in the measure has no construct validity as a measure of intelligence. one cause of the validation crisis in psychology is that validation studies ignore the distinction between same-method and cross-method correlations (campbell & fiske, 1959). correlations among measures that share method variance (e.g., self-reports) cannot be used to examine convergent validity. unfortunately, few studies use actual behavior to validate selfreport measures of personality traits (baumeister, vohs, & funder, 2007). discriminant validity the term discriminant validity was introduced by campbell and fiske (1959). however, cronbach and meehl already pointed out that high or low correlations can support construct validity. “only if the underlying theory of the trait being measured calls for high item inter correlations do the correlations support construct validity” (p. 288). crucial for construct validity is that the correlations are consistent with theoretical expectations. for example, low correlations between intelligence and happiness do not undermine the validity of an intelligence measure because there is no theoretical expectation that intelligence is related to happiness. in contrast, low correlations between intelligence and job performance would be a problem if the jobs require problem solving skills and intelligence is an ability to solve problems faster or better. it is often overlooked that discriminant validity also requires a multi-method approach (e.g., greenwald, mcghee, & schwartz, 1998). a multi-method approach is required because the upper limit for discriminant validity is the amount of convergent validity for different measures of the same construct, not a value of 1 or the reliability of a scale (campbell & fiske, 1959). for example, martel, schimmack, nikolas, and nigg (2015) examined multi-rater data of children’s attention deficit and hyperactivity (adhd) symptoms. table 1 shows the correlations for the items “listens” and “being organized.” the cross-rater-same-item correlations (italics) show convergent validity of ratings of the same “symptom” by different raters. the cross-rater-different-item correlations (bold) show discriminant validity only if they are consistently lower than the convergent validity correlations. in this example, there is little evidence of discriminant validity because cross-construct correlations are nearly as high as same-construct correlations. an analysis with structural equation modeling of these data shows a latent correlation of r = .99 between a “listening” factor and an “organized” factor. this example illustrates why it is not possible to interpret items on an adhd checklist as distinct symptoms (martel et al., 2015). more important, the example shows that claims about discriminant validity require a multi-method approach. 4 table 1. correlation among ratings of adhd symptoms mother listen father listen teacher listen mother organized father organized teacher organized m-listen f-listen 0.558 t-listen 0.450 0.436 m-organized 0.664 0.494 0.392 f-organized 0.432 0.561 0.324 0.437 t-organized 0.376 0.407 0.698 0.350 0.304 note. m = mother, f = father, t = teacher, ratings of child listens and child is organized. data from martel et al. (2015) quantifying construct validity it is rare to see quantitative claims about construct validity in psychology, and sometimes information about reliability is falsely presented as evidence for construct validity (flake, pek, & hehman; 2017). most method sections include a vague statement that measures have demonstrated construct validity as if a measure is either valid or invalid. contrary to this current practice, cronbach and meehl made it clear that construct validity is a quantitative construct and that factor loadings can be used to quantify validity. “there is an understandable tendency to seek a ’construct validity coefficient’. a numerical statement of the degree of construct validity would be a statement of the proportion of the test score variance that is attributable to the construct variable. this numerical estimate can sometimes be arrived at by a factor analysis” (p. 289). and nobody today seems to remember cronbach and meehl’s (1955) warning that rejection of the null-hypothesis, the test has zero validity, is not the end goal of validation research. “it should be particularly noted that rejecting the null hypothesis does not finish the job of construct validation. the problem is not to conclude that the test ’is valid’ for measuring the construct variable. the task is to state as definitely as possible the degree of validity the test is presumed to have" (p. 290). cronbach and meehl are well aware that it is difficult to quantify validity precisely, even if multiple measures of a construct are available because factors may not be perfect representations of constructs. “rarely will it be possible to estimate definite construct saturations, because no factor corresponding closely to the construct will be available (p. 289). however, broad information about validity is better than no information about validity (schimmack, 2010). one reason why psychologists rarely quantify validity could be that estimates of construct validity for many tests are embarrassingly low. the limited evidence from some multi-method studies suggests that about 30% to 50% of the variance in rating scales is valid variance (connelly & ones, 2010; zou, schimmack, & gere, 2013). another reason is that it can be difficult or costly to measure the same construct with three independent methods, which is the minimum number of measures to quantify validity. two methods are insufficient because it is not clear how much validity of each method contributes to the convergent validity correlation between them. for example, a correlation of r = .4 between self-ratings and informant ratings is open to very different interpretations. “if the obtained correlation departs from the expectation, however, there is no way to know whether the fault lies in test a, test b, or the formulation of the construct” (cronbach & meehl, 1955, p. 300). i believe that the failure to treat construct validity as a quantitative construct is the root cause of the validation crisis in psychology. every method is likely to have some validity (i.e., non-zero construct variance), but measures with less than 30% valid variance are unlikely to have much practical usefulness to test psychological theories and are inadequate for personality assessment (schimmack, 2019). quantification of construct validity would provide an objective criterion to evaluate new measures and stimulate development of better measures. thus, quantifying validity would be an important initiative to improve psychological science. one notable exception is the literature in industrial and organizational psychology, where construct validity has been quantified (cote & buckley, 1987). a metaanalysis of construct validation studies suggested that less than 50% of the variance was valid construct variance, and that a substantial portion of the variance is caused by systematic measurement error. the i/o literature shows that it is possible and meaningful to quantify construct validity. i suggest that other disciplines in 5 psychology follow their example. the nomological net some readers may be familiar with the term “nomological net” that was popularized by cronbach and meehl in their 1995 article. however, few readers will be able to explain what a nomological net is, despite the fact that cronbach and meehl considered nomological nets essential for construct validation. “to validate a claim that a test measures a construct, a nomological net surrounding the concept must exist (p. 291). cronbach and meehl state that “the laws in a nomological network may relate (a) observable properties or quantities to each other; or (b) theoretical constructs to observables; or (c) different theoretical constructs to one another. these “laws” may be statistical or deterministic” (p. 290). i argue that cronbach and meehl would have used the term structural equation model, if structural equation modeling existed when they wrote their article. after all, structural equation modeling is simply an extension of factor analyses, and cronbach and meehl did equate constructs with factors, and structural equation modeling makes it possible to relate (a) observed indicators to each other, (b) observed indicators to latent variables, and (c) latent variables to each other. thus, cronbach and meehl essentially proposed to examine construct validity by modeling multi-traitmulti-method data with structural equations. cronbach and meehl also realize that constructs can change as more information becomes available. in this sense, construct validation is an ongoing process of improved understanding of constructs and measures. empirical data can suggest changes in measures or changes in concepts. for example, empirical data might show that intelligence is a general disposition that influences many different cognitive abilities or that it is better conceptualized as the sum of several distinct cognitive abilities. ideally this iterative process would start with a simple structural equation model that is fitted to some data. if the model does not fit, the model can be modified and tested with new data. over time, the model would become more complex and more stable because core measures of constructs would establish the meaning of a construct, while peripheral relationships may be modified if new data suggest that theoretical assumptions need to be changed. “when observations will not fit into the network as it stands, the scientist has a certain freedom in selecting where to modify the network” (p. 290). the increasing complexity of a model is only an advantage if it is based on better understanding of a phenomenon. weather models have become increasingly more complex and better able to forecast future weather changes. in the same way, better psychological models would be more complex and better able to predict behavior. structural equation modeling is sometimes called confirmatory factor analysis. in my opinion, the term confirmatory factor analysis has led to the idea that structural equation modeling can only be used to test whether a theoretical model fits the data or not. the consequences of the focus on confirmation was to hamper use of structural equation modeling for construct validation because simplistic models did not fit the data. rather than modifying models accordingly, researchers avoided using cfa for construct validation. for example, mccrae, zonderman, costa, bond, and paunonen (1996) dismissed structural equation modeling as a useful method to examine the construct validity of big five measures because it failed to support their conception of the big five as orthogonal dimensions with simple structure. i argue that structural equation modeling is a statistical tool that can be used to test existing models and to explore new models. this flexible use of structural equation model would be in the spirit of cronbach and meehl’s vision that construct validation is an iterative process that improves measurement and understanding of constructs as the nomological net is altered to accommodate new information. this suggestion highlights a similarity between the validation crisis and the replication crisis. one cause of the replication crisis was the use of statistics as a tool that could only confirm theoretical predictions, p <.05. in the same way, confirmatory factor analysis was only used to confirm models. in both cases, confirmation bias impeded scientific progress and theory development. a better use of structural equation modeling is to use it as a general statistical framework that can be used to fit nomological networks to data and to use the results in an iterative process that leads to better understanding of constructs and better measures of these constructs. this is the way cfa was intended to be used (jöreskog, 1969). network models are not nomological nets in the past decade, it has become popular to examine correlations among items with network models (schmittmann et al., 2013). network models are graphic representations of correlations or partial correlations among a set of variables. importantly, network models do not have latent variables that could correspond to constructs. “network modeling typically relies on the assumption that the covariance structure among a set of the items is not due to latent variables at all” (epskamp et al., 2017, p. 923). instead, “psychological attributes are conceptualized as networks of directly related observables” (schmittmann et al., 2013, p. 43). it is readily apparent that network models are not nomological nets because they avoid defining constructs 6 independent of specific operationalizations. “since there is no latent variable that requires causal relevance, no difficult questions concerning its reality arise” (schmittmann et al., 2013, p. 49). thus, network models return to operationalism at the level of the network components. each component in the network is defined by a specific measure, which is typically a self-report item or scale. the difficulty of psychological measurement is no longer a problem because self-report items are treated as perfectly valid measures of network components. the example in table 1 shows the problem with this approach. rather than having six independent network components, the six items in table 1 appear to be six indicators of a single construct that are measured with systematic and random measurement error. at least for these data, but probably for multi-method data in general, it makes little sense to postulate direct causal effects between observed scores. for example, it makes little sense to postulate that father’s ratings of forgetfulness causally influenced teachers’ ratings of attention. it is noteworthy that recent trends in network modeling acknowledge the importance of latent variables and relegate the use of network modeling to modeling residual correlations (epskamp, rhemtulla, & borsboom, 2017). these network models with latent variables are functionally equivalent to structural equation models with correlated residuals. thus, they are no longer conceptually distinct from structural equation models. a detailed discussion of latent network models is beyond the scope of this article. the main point is that network models without latent variables cannot be used to examine construct validity because constructs are by definition unobservable and can be studied only indirectly by examining their influence on observable measures. any direct relationships between observables either operationalize constructs or avoid the problem of measurement and implicitly assume perfect measurement. recommendations for users of psychological measures the main recommendation for users of psychological measures is to be skeptical of claims that measures have construct validity. many of these claims are not based on proper validation studies. at a minimum a measure should have demonstrated at least modest convergent validity with another measure that used a different method. ideally, a multi-method approach was used to provide some quantitative information about construct validity. researchers should be wary of measures that have low convergent validity. for example, it has been known for a long time that implicit measures of self-esteem have low convergent validity (bosson et al., 2000), but this finding has not deterred researchers from claiming that the self-esteem iat is a valid measure of implicit self-esteem (greenwald & farnham (2000). proper evaluation of this claim with multi-method data shows no evidence of construct validity (falk et al., 2015; schimmack, 2019). consumers should also be wary of new constructs. it is very unlikely that all hunches by psychologists lead to the discovery of useful constructs. given the current state of psychological science, it is rather more likely that many constructs turn out to be non-existent. however, the history of psychological measurement has only seen development of more and more constructs and more and more measures to measure this expanding universe of constructs. since the 1990s, constructs have doubled because every construct has been split into an explicit and an implicit version of the construct. presumably, there is even implicit political orientation or gender identity with little empirical support for these implicit constructs (cf. schimmack, 2019). the proliferation of constructs and measures is not a sign of a healthy science. rather it shows the inability of empirical studies to demonstrate that a measure is not valid, a construct does not exist, or a construct is redundant with other constructs. this is mostly due to self-serving biases and motivated reasoning of test developers. the gains from a measure that is widely used are immense. articles that introduced popular measures like the implicit association test (greenwald et al., 1998) have some of the highest citation rates. thus, it is tempting to use weak evidence to make sweeping claims about validity because the rewards for publishing a widely used measure are immense. one task for meta-psychologists could be to critically evaluate claims of construct validity by original authors because original authors are likely to be biased in their evaluation of construct validity (cronbach, 1989). the validation crisis cronbach and meehl make it clear that they were skeptical about the construct validity of many psychological measures. “for most tests intended to measure constructs, adequate criteria do not exist. this being the case, many such tests have been left unvalidated, or a fine-spun network of rationalizations has been offered as if it were validation. rationalization is not construct validation. one who claims that his test reflects a construct cannot maintain his claim in the face of recurrent negative results because these results show that his construct is too loosely defined to yield verifiable inferences" (p. 291). in my opinion, nothing 7 much has changed in the world of psychological measurement. flake et al. (2017) reviewed current practices and found that reliability is often the only criterion that is used to claim construct validity. however, reliability of a single measure cannot be used to demonstrate construct validity because reliability is only necessary, but not sufficient for validity. thus, many articles provide no evidence for construct validity and even if the evidence were sufficient to claim that a measure is valid, it remains unclear how valid a measure is. another sign that psychology has a validity crisis is that psychologists today still use measures that were developed decades ago (cf. schimmack, 2010). although these measures could be highly valid, it is also likely that they have not been replaced by better measures because quantitative evaluations of validity are lacking. for example, rosenberg’s (1965) 10-item self-esteem scale is still the most widely used measure of self-esteem (bosson et al., 2000; schimmack, 2019). however, the construct validity of this measure has never been quantified and it is not clear whether it is more valid than other measures of self-esteem. what is the alternative? while there is general agreement that current practices have serious limitations (kane, 2017; maul, 2017), there is no general agreement about the best way to address the validation crisis. some comments suggest that psychology might fare better without quantitative measurement (maul, 2017). if we look to the natural sciences, this does not appear to be an attractive alternative. in the natural sciences progress has been made by increasingly more sophisticated measurements of basic units such as time and length (nanotechnology). meehl was an early proponent of more rather than less rigorous methods in psychology. if psychologists had followed his advice to quantify validity, psychological science would have made more progress. thus, i do not think that abandoning quantitative psychology is an attractive alternative. others believe that cronbach and meehl’s agenda is too ambitious (kane, 2016, 2017). “where the theory is strong enough to support such efforts, i would be in favor of using them, but in most areas of research, the required theory is lacking” (kane, 2017, p. 81). this may be true for some areas of psychology, such as educational testing, but it is not true for basic psychological science where the sole purpose of measures is to test psychological theories. in this context, construct validation is crucial for testing of causal theories. for example, theories of implicit social cognition require valid measure of implicit cognitive processes (greenwald et al., 1998; schimmack, 2019). thus, i am more optimistic than kane that psychologists have causal theories of important constructs such as attitudes, personality traits, and wellbeing that can inform a program of construct validation. the industrial literature shows that it is possible to estimate construct validity even with rudimentary causal theories (cote & buckley, 1987), and there are some examples in social and personality psychology where structural equation modeling was used to quantify validity (schimmack, 2019, schimmack, 2010; zou et al., 2013). thus, i believe improvement of psychological science requires a quantitative program of research on construct validity. conclusion just like psychologist have started to appreciate replication failures in the past years, they need to embrace validation failures. some of the measures that are currently used in psychology are likely to have insufficient construct validity. if the 2010s were the decade of replication, the 2020s may become the decade of validation. it is time to examine how valid the most widely used psychological measures actually are. cronbach and meehl (1955) outlined a program of construct validation research. ample citations show that they were successful in introducing the term, but psychologists failed in adopting the rigorous practices they were recommending. it is time to change this and establish clear standards of construct validation that psychological measures should meet. most important, validity has to be expressed in quantitative terms to encourage competition for developing new measures of existing constructs with higher validity. author contact i am grateful for the financial support of this work by the canadian social sciences & humanities research council (sshrc). correspondence regarding this article should be addressed to ulrich.schimmack@utoronto.ca, department of psychology, university of toronto mississauga, 3359 mississauga road, on l5l 1c6. orcid https://orcid.org/0000-0001-9456-5536. conflict of interest and funding i do not have a conflict of interest. author contributions i am the sole contributor to the content of this article. open science practices this article earned no open science badges because it is theoretical and does not contain any data or data analyses. 8 references anusic, i., schimmack, u., pinkus, r. t., & lockwood, p. (2009). the nature and structure of correlations among big five ratings: the halo-alpha-beta model. journal of personality and social psychology, 97(6), 1142-1156. http://dx.doi.org/10.1037/a0017159 baumeister, r. f., vohs, k. d., & funder, d. c. (2007). psy-chology as the science of self-reports and finger movements: whatever happened to actual behavior? perspectives on psychological science, 2(4), 396–403. https://doi.org/10.1111/j.17456916.2007.00051.x borsboom, d. & wijsen, l. d. (2016). frankenstein’s validity monster: the value of keeping politics and science separated. assessment in education: principles, policy & practice, 23, 281-283. doi: 10.1080/0969594x.2016.1141750 bosson, j. k., swann, w. b., jr., & pennebaker, j. w. (2000). stalking the perfect measure of implicit self-esteem: the blind men and the elephant revisited? journal of personality and social psychology, 79(4), 631-643. http://dx.doi.org/10.1037/00223514.79.4.631 campbell, d. t., & fiske, d. w. (1959). convergent and dis-criminant validation by the multitrait-multimethod ma-trix. psychological bulletin, 56(2), 81-105. http://dx.doi.org/10.1037/h0046016 cohen, j. (1962). statistical power of abnormal–social psy-chological research: a review. journal of abnormal and social psychology, 65, 145–153. doi:10.1037/h0045186 connelly, b. s., & ones, d. s. (2010). an otherperspective on personality: meta-analytic integration of observers’ accuracy and predictive validity. psychological bulletin, 136(6), 10921122.http://dx.doi.org/10.1037/a0021212 cote, j. a., & buckley, m. r. (1987). estimating trait, method, and error variance: generalizing across 70 construct validation studies. journal of marketing, 24, 315-318. cronbach, l. j. (1989). construct validation after thirty years. in r. l. linn (ed.), intelligence: measurement, theory, and public policy(pp. 147-171). chicago: university of illinois press cronbach, l. j., & meehl, p. e. (1955). construct validity in psychological tests. psychological bulletin, 52(4), 281-302. http://dx.doi.org/10.1037/h0040957 falk, c., heine, s. j., takemura, k., zhang, c., & hsu, c. w. (2015). are implicit self-esteem measures valid for as-sessing individual and cultural differences? journal of personality, 83, 56-68. doi:10.1111/jopy.12082 flake, j. k., pek, j., & hehman, e. (2017). construct valida-tion in social and personality research: current practice and recommendations. social psychological and personality science, 8(4), 370–378. https://doi.org/10.1177/1948550617693063 greenwald, a. g., & farnham, s. d. (2000). using the implicit association test to measure self-esteem and self-concept. journal of personality and social psychology, 79(6), 1022-1038. http://dx.doi.org/10.1037/00223514.79.6.1022 greenwald, a.g., mcghee, d.e., & schwartz, j.l.k. (1998). measuring individual differences in implicit cognition: the implicit association test. journal of personality and social psychology, 74, 1464–1480. john, l. k., loewenstein, g., & prelec, d. (2012). measuring the prevalence of questionable research practices with incentives for truth telling. psychological science, 23, 524–532. doi:10.1177/0956797611430953 jöreskog, k.g. (1969). a general approach to confirmatory maximum likelihood factor analysis. psychometrika, 34, 183-202. https://doi.org/10.1007/bf02289343 kane, m. t. (2016) explicating validity. assessment in education: principles, policy & practice, 23, 198211, doi: 10.1080/0969594x.2015.1060192 kane, m. t. (2017) causal interpretations of psychological attributes.measurement: interdisciplinary research and perspectives, 15, 79-82, doi: 10.1080/15366367.2017.1369771 martel, m. m., schimmack, u., nikolas, m., & nigg, j. t. (2015). integration of symptom ratings from multiple informants in adhd diagnosis: a psychometric model with clinical utility. psychological assessment, 27(3), 1060-71. maul. a. (2017). moving beyond traditional methods of survey validation. measurement: interdisciplinary research and perspectives, 15, 103-109. https://doi.org/10.1080/15366367.2017.1369 786 mccrae, r. r., zonderman, a. b., costa, p. t., jr., bond, m. h., & paunonen, s. v. (1996). evaluating replicability of factors in the revised neo personality inventory: confirmatory factor analysis versus procrustes rotation.journal of personality and social psychology, 70(3), 9 552-566. http://dx.doi.org/10.1037/00223514.70.3.552 open science collaboration. (2015). estimating the repro-ducibility of psychological science. science, 349, (6251), 943-950. podsakoff, p. m., mackenzie, s. b., podsakoff, n. p. (2012). sources of method bias in social science research and recommendations on how to control it. annual review of psychology, 63, 539-569. rosenberg, m. (1965). society and the adolescent selfimage. princeton, nj: princeton university press. schimmack, u. (2010). what multi-method data tell us about construct validity. european journal of personality, 24, 241–257. doi: 10.1002/per.771 schimmack, u. (2012). the ironic effect of significant results on the credibility of multiple-study articles. psychological methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487 schimmack (2019). the implicit association test: a method in search of a construct. perspectives on psychological science, https://doi.org/10.1177/1745691619863798 schmittmann, v. d., cramer, a. o. j., waldorp, l. j., epskamp, s., kievit, r. a., & borsboom, d. (2013). deconstructing the construct: a network perspective on psychological phenomena. new ideas in psychology, 31, 43-53. https://doi.org/10.1016/j.newideapsych.2011. 02.007 zou, c., schimmack, u., & gere, j. (2013). the validity of well-being measures: a multipleindicator–multiple-rater model. psychological assessment, 25(4), 1247-1254. http://dx.doi.org/10.1037/a0033902 construct validity what are constructs convergent validity discriminant validity quantifying construct validity the nomological net network models are not nomological nets recommendations for users of psychological measures the validation crisis what is the alternative? conclusion author contact conflict of interest and funding author contributions open science practices references meta-psychology, 2022, vol 6, mp.2020.2460 https://doi.org/10.15626/mp.2020.2460 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: moritz heene reviewed by: gelman, a., martin, s.r., olvera astivia, o. analysis reproduced by: jens fust all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/p6gwf power or alpha? the better way of decreasing the false discovery rate františek bartoš university of amsterdam; faculty of arts, charles university maximilian maier university of amsterdam both authors contributed equally abstract the replication crisis in psychology has led to an increased concern regarding the false discovery rate (fdr) – the proportion of false positive findings among all significant findings. in this article, we compare two previously proposed solutions for decreasing the fdr: increasing statistical power and decreasing significance level α. first, we provide an intuitive explanation for α, power, and fdr to improve the understanding of these concepts. second, we investigate the relationship between α and power. we show that for decreasing fdr, reducing α is more efficient than increasing power. we suggest that researchers interested in reducing the fdr should decrease α rather than increase power. by investigating the relative importance of both α level and power, we connect the literature on these topics and our results have implications for increasing the reproducibility of psychological science. keywords: power, significance level, false discovery rate, alpha the reproducibility of studies in psychology has been questioned in the last few years. massive replication initiatives found that replicability can be as low as 36% (open science collaboration, 2015; but see camerer et al., 2018; ebersole et al., 2016; klein et al., 2014; klein et al., 2018 for more optimistic estimates), and many researchers have tried to identify the factors affecting the replicability of studies. while a comprehensive overview of this is beyond the scope of a single article (a whole issue of perspectives on psychological science was dedicated to the problem; pashler and wagenmakers, 2012), we focus on statistical power, significance level α and the false discovery rate (fdr, the proportion of false positive findings among all statistically significant findings).1 while some papers emphasize the importance of increasing statistical power to decrease the fdr (button et al., 2013; christley, 2010), others call for decreasing α (benjamin et al., 2018). however, these two views seem disconnected and it is unclear whether (or under which conditions) researchers should decide to decrease α and when to increase power in order to reduce the fdr. to further explore this disconnect, we reviewed all articles mentioning fdr (or related terms) in the context of power and α in five methods and evidence synthesis journals within psychology (for more details see: https://osf.io/9cfg8/). out of 106 reviewed articles, nine explicitly stated the 1the fdr is sometimes also called false positive rate (fpr benjamin et al., 2018) or false positive risk (fpr colquhoun, 2017). https://doi.org/10.15626/mp.2020.2460 https://doi.org/10.17605/osf.io/p6gwf https://osf.io/9cfg8/ 2 importance of increasing power to reduce the fdr, while five articles discussed the importance of decreasing α.2 notably, only miller and ulrich (2019) discussed that both decreasing α and increasing power would reduce fdr. however, the efficiency of those two options was not compared so far. the current article aims to bridge the discussion over α and power regarding the fdr and investigate the more efficient way of reducing the fdr. to achieve this, we first reiterate the concepts of power, false positives, and false discovery rate. we explain them using intuitive examples to deepen the understanding of these concepts. next, we examine two possible views and their impact on reducing the fdr. the first view concerns planning a study and deciding on α and power independently. the second view concerns balancing between α and power for a fixed design, where setting α determines power and vice versa. false positives and α in his pivotal book “statistical methods for the research worker” fisher (1925) was the first to widely popularize the concept of hypothesis testing and statistical significance to differentiate signal from noise. neyman and pearson (1928) introduced the conceptualization of the significance level α as a tool to control the long-term error rates. in other words, a decision from a statistical test with a significance level (i.e., 5%) would not result in more than a rate α of incorrectly rejected true null hypotheses. thus, α determines the long-term rate of false positives. if researchers set their α to 5%, they will accept the alternative hypothesis when the probability of the data or more extreme data assuming the null hypothesis to be true (the p-value) is below α. let us illustrate this concept with an example from fisher (1935) famous experiment “the lady tasting tea”. lady muriel bristol claims that she can detect whether tea or milk was added first to a cup. to test whether the lady has these tea tasting abilities, fisher gives lady bristol eight cups of tea, in which four of them has milk added first, while the other four have tea added first. fisher wants to keep his long-term error rate of false positives below 5%. since the lady knows that half of the cups are tea first, fisher focuses only on the number of correctly classified tea-first-cups (because the correctly classified milk-first-cups are dependent on the correctly classified tea-first-cups). how many of the four tea-first-cup cases would the lady need to classify correctly to convince fisher of her abilities? the probability of correctly guessing x tea-first-cups in four trials can be obtained using the hypergeometric distribution (figure 1, left). all four tea-first-cups would be guessed correctly with a probability of 1.43%. so, this event would indicate that it is improbable to see the lady give all eight correct answers if she has no tea tasting abilities and guessed entirely at random. but what if she makes one mistake? the probability of classifying at least three out of four tea-first-cups correctly by pure guessing is 24.3%. in other words, this would not provide sufficient evidence against her lack of abilities. so, in this case, fisher would be unable to know whether she can differentiate between the cups. even if she were guessing entirely at random, she could have achieved at least three out of four correctly guessed tea-first-cups 24.3% of the time. power neyman and pearson (1928) introduced the concept of statistical power because of the fundamental asymmetry of controlling type i error rates without explicitly formalizing type ii error control (the probability of concluding the absence of an effect, when it exists; lehmann, 1992). statistical power describes the probability that a statistical test rejects the null hypothesis when it is false. in other words, power refers to the probability of rejecting the null hypothesis, assuming that the hypothesized effect is present. the statistical power of a test depends on α, the sample size, and the magnitude of the true effect. a higher α, a larger sample size, and a larger true effect all contribute to increased statistical power (cohen, 1992). power is thus related to false negatives, with higher statistical power decreasing the probability of finding a false negative result. let’s continue with the previous example but look at it from the other side. assume that the lady can distinguish whether the milk or tea was added first. it is a difficult task, and she makes a mistake from time to time. her probability of classifying the cup correctly is 0.7. the resulting probabilities this time follow a noncentral hypergeometric distribution (liao and rosen, 2001; figure 1, right). thus, the probability of her classifying all eight cups correctly is now 19%. in other words, if the lady has the ability to classify correctly in 70% of cases, fisher would only detect this 19% of the time. false discovery rate it follows from the previously outlined definition that power does not influence the probability of observing a false-positive result for any single study. however, since negative results are rarely published (masicampo and lalande, 2012; mathur and vanderweele, 2020; nelson et al., 1986; rosenthal, 1979; rosenthal and gaito, 1963, 1964; wicherts, 2017 but see van aert et al., 2019 2most of the remaining articles focused on correction for multiple testing. 3 0 1 2 3 4 successes p (s u cc e ss e s) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 .014 .229 .514 .229 .014 0 1 2 3 4 successes p (s u cc e ss e s) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 .000 .019 .231 .559 .190 figure 1. the hypergeometric distribution shows the probability of x successes (x-axis) with the probability of success 0.50 (left) and 0.70 (right). note that we only display up to four successes. we can think of those bars as the number of tea-first-cups classified correctly. the lady knows how many (but not which) cups have tea added to them first. therefore, if she classifies all tea-first-cups correctly, she necessarily also classifies the milk-first-cups correctly. the dark-filled bars correspond to the probability of 4 correct answers. for contrary evidence), it is more interesting to investigate the proportion of false positives among significant findings, i.e., the false discovery rate (fdr). this proportion depends on the number of true positives (believing that someone possesses the tea tasting abilities when they truly do) and the number of false positives (believing that someone possesses the tea tasting abilities when they do not). while the number of true positives depends on power and the number of true alternative hypotheses, the number of false positives depends on α and the proportion of false hypotheses. so, the fdr connects both previously mentioned concepts, and we illustrate it with our running example. her majesty the queen decides to start a royal tea tasting society (rtts) and requests fisher to recruit new members based on their tea tasting abilities. assume that one-fifth of the population possesses such abilities and can identify the order of milk and tea in 70% of cases. the remaining four-fifths do not possess this skill and their answers are equal to random guessing. fisher decides to use α of 5%; therefore, 0.05×0.80 = 4% of the tests he administers result in false positives. because he conveniently uses the same set-up as in the previous example, we know that the power of the test is 19%. therefore, 0.19×0.20 = 3.8% of the tests he administers yield true positives. subsequently, he introduces all citizens who passed the test to the queen, who promotes them to members of the rtts. however, what the queen does not realize is the fact that 0.04/(0.04+0.038) = 51% of her rtts members do not possess any tea tasting abilities (the fdr). as can be deduced from the example, there are two ways to decrease fdr either increase power and thus the number of true positives, or reduce α and the number of false positives. this relationship is depicted in equation (1), which illustrates how power and α influence the fdr, with p(h0) standing for the proportion of true null hypotheses, α for significance level, and ρ for statistical power, fdr = p(h0) ×α p(h0) ×α+ (1 − p(h0)) ×ρ . (1) this is the reason why many argue that researchers need to increase the statistical power to reduce the fdr. however, we show in the following paragraphs that reducing α is usually the preferable option by investigating two ways of considering the trade-off between power and α. in the first way, researchers plan a study and independently determine what levels of α and power should be used. in the second, researchers balance between α and power for a fixed design, where setting α determines the power and vice versa. determining α and power independently the first view assumes that α and power are set independently.3 for example, researchers plan a study with 3the first case and the following derivations were suggested by stephen r. martin in his review (https://osf.io/ 7kdjn/). https://osf.io/7kdjn/ https://osf.io/7kdjn/ 4 figure 2. the logarithm of the fdr gradient (z-axis) is dependent on α (alpha, x-axis) and power (y-axis) for the probability of the null hypothesis being true equal to 0.5. the red surface (with blue lines) depicts the gradient of fdr with respect to α and the green surface (with red lines) depicts the gradient of fdr in respect to power. note that they intersect when α is equal to power. when α is lower than power (right side), the gradient of fdr with respect to α dominates the gradient with respect to power. an animated version is accessible at https://osf.io/gbtku/. desired α and power and compute the required sample size for achieving them. subsequently, we can study how either changing α or power in the planning phase influences the fdr. to do that, we present derivations of equation (1) with respect to α, δfdr δα = ρ× (1 − p(h0)) × p(h0) (α× p(h0) +ρ× (1 − p(h0)))2 , (2) and power, δfdr δρ = −α× (1 − p(h0)) × p(h0) (α× p(h0) +ρ× (1 − p(h0)))2 , (3) equations (2) and (3) connect the change in α or power to change in fdr. since the denominators are the same and p(h0) is bound to be between 0 and 1, the comparison of equation (2) and (3) shows that the gradient of fdr with respect to power will dominate the gradient of fdr with respect to α as long as power is larger than α (figure 2). this is generally true because α is the lower bound on power, unless a one-sided test is used and the effect is in the opposite direction. then, power is lower than α and the gradient of fdr with respect to α dominates the gradient of fdr with respect to power. in addition, when a two-sided test is used but the power is low, many significant results will be in the opposite direction (type s error; gelman and carlin, 2014). including those into the fdr would further change the results. compelling visualizations that support this claim are also available in the online materials (https://osf.io/gbtku/) and a more detailed discussion of this approach can be found in the open review (https://osf.io/sp95d/). overall, this indicates that for all conditions that are typically encountered in hypothesis testing, the gradient with respect to alpha will dominate the gradient with respect to power. in other words, when designing a study, planning a lower α has a larger effect than planning higher power, as long as power is kept higher than α. so, if fisher wanted to mitigate the proportion of members of rtts with no tea tasting abilities before the experiment was conducted, the best solution would be to decrease the α as much as possible. trading α and power the second view goes one step further. if we assume that researchers are operating with limited resources (i.e., a limited number of participants or time), then α determines power or vice versa. in other words, for a fixed design, researchers can either set α and power can be expressed as a function of α, or researchers can set a desired power, and α can be expressed as a function of power. equation (4) shows the relationship of power (ρ), on the left side, to α, in the case of a two-tailed independent samples z-test. in addition, sample size n and effect size d are needed to determine the parameter µ of the normal distribution of expected z-statistics under the alternative hypothesis. the significance level α determines the upper and lower cut-off value used for significance testing through a quantile function of the standard normal distribution φ−1. the cut-off is subsequently used in the cumulative probability density function of the normal distribution φµ with mean µ and standard deviation equal to 1, determine the probability of obtaining more extreme z-values than those equal to α, ρ= 1 −φµ(φ −1(1 −α/2)) +φµ(φ −1(α/2)). (4) the µ parameter of the cumulative density function of the normal distribution for a two-sample independent ztest is dependent only on the effect size d and the number of participants n split equally into the groups (equation (5)). more participants or larger effect size means that the distribution of z-statistics has a higher mean µ, µ= d √ n 2 . (5) equations (4) and (5) are also depicted for a concrete example with n = 100, d = 0.5 and α = .05 (figure 3). https://osf.io/gbtku/ https://osf.io/gbtku/ https://osf.io/sp95d/ 5 −6 −4 −2 0 2 4 6 0.0 0.1 0.2 0.3 0.4 0.5 z−statistics d e n si ty φ−1(1 − α 2 ) φ−1( α 2 ) 1 − φ µ( φ−1(1 − α 2 )) φ µ( φ−1( α 2 )) figure 3. equations (4) and (5) correspond to this visualzation when assuming n = 100, d = 0.5 and α = .05. the vertical lines correspond to the cut-off z-statistic computed using a quantile function of the normal distribution under the null hypothesis (dashed line). the full line corresponds to the expected distribution of z-statistics under the alternative hypothesis with the grey-filled area corresponding to the power computed using a cumulative density function. if α is decreased, the vertical lines placed at the cutoff z-statistic determined by the quantile function of the normal distribution move further apart from the center and thus reduce the grey-filled area corresponding to the power. on the other hand, one could also increase α and thus increase the area corresponding to power. so, given constant sample size and effect size, researchers are faced with two possibilities: they can either (a) increase α, reducing the cut-off and thus achieving higher power; or (b) decrease α and subsequently lower the power. we know that there is a convention to set α in statistical tests to 5%. however, there is no reason why α should remain constant at this fixed value. fisher (1956) explained that the 5% should be disregarded whenever there are other substantial reasons to determine α. more recently, scientists again called for a more flexible adaption of α (lakens et al., 2018). in other words, in psychological science that operates with limited resources, there is always a trade-off that needs to be made between avoiding false positives and detecting true positives. if fisher wants to mitigate the proportion of members of rtts with no tea tasting abilities (assuming he has a constrained budget), he is faced with two options. on the one hand, he can decrease α and lower the number of false positives with the cost of decreased power and fewer true positives. on the other hand, he can also increase the power and increase the number of true positives at the cost of increasing α leading to more false positives. the important question is, which is more efficient in lowering the fdr: lowering α or increasing power? we show that for a two-sided z-test and for one-sided z-test with true effect in the predicted direction given a constant sample size, decreasing α leads to lower fdr than increasing statistical power. figure 4 shows this relationship on an example with an independent samples z-tests for the proportion of true null hypothesis p(h0) = 0.5, effect size d = 0.5 and sample size n = 100 (50 per group). similar results can be obtained for different sample sizes, effect sizes, proportions of null hypotheses being true and statistical tests (code to generate 3d plots across different µs can be found at https://osf.io/ uszxk/). there is always a decrease in fdr with decreasing α but for two exceptions. first, if the null hypotheses are either all false or all true (which would include effect size equal to 0), then the proportion is 1 or 0 respectively, independent of power and α. second, for one-sided tests where the true effect is opposite to the expected direction, the fdr will increase with reducing α. however, these two situations should be relatively rare in practice; therefore, reducing α is usually the most efficient way to decrease the fdr. for a more formal analysis we also calculated the gradient of the fdr with regards to α (see supplementary materials at https://osf.io/svu7r). this elaborates the conclusion that α is more efficient in reducing the fdr, since the derivative is positive for all values of α apart from one-sided tests with an effect in the opposite direction. 3d plots showing the derivative for different noncentrality parameters (ncps) can be found at https://osf.io/uszxk/ https://osf.io/uszxk/ https://osf.io/svu7r 6 0.0 0.1 0.2 0.3 0.4 0.5 .00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1.00 .00 .80 .89 .93 .95 .97 .98 .98 .99 1.00 1.00 alpha power p ro p o rt io n o f fa ls e p o si tiv e s 0.0 0.1 0.2 0.3 0.4 0.5 .00 .00 .00 .00 .01 .01 .02 .05 .10 .22 1.00 .00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1.00 alpha power p ro p o rt io n o f fa ls e p o si tiv e s figure 4. trading off between power and α with p(h0) = 0.5, d = 0.5 and n = 100 (50 per group) results in the displayed fdr. the double x-axis shows α with its corresponding power, scaled according to α in the left chart and according to power in the right chart. to plot the relationship for other statistical tests, see https://osf.io/uwkqz/. 0.0 0.3 0.6 0.9 1.2 1.5 .00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1.00 .00 .80 .89 .93 .95 .97 .98 .98 .99 1.00 alpha power ∂ fd r ∂ α 0 2 4 6 8 10 12 .00 .00 .00 .00 .01 .01 .02 .05 .10 .22 1.00 .00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1.00 alpha power ∂ fd r ∂ α 1.00 figure 5. the figure displays the gradient of fdr with respect to α (and corresponding power) from a trade-off between power and α with p(h0) = 0.5, d = 0.5 and n = 100 (50 per group). the double x-axis shows α with its corresponding power, scaled according to α in the left chart and according to power in the right chart. https://osf.io/uszxk/. figure 5 shows the gradient an independent samples z-tests for the proportion of true null hypothesis p(h0) = 0.5, effect size d = 0.5 and sample size n = 100 (50 per group). an expected objection is that instead of the trade-off by increasing α, one can achieve an increase in power by increasing sample size. as explained before, there is no apparent reason for keeping α constant with increasing sample size. instead, one can keep the power fixed and use the higher sample size to decrease α. figure 6 shows that keeping the power constant and decrease α by increasing the sample size is more efficient in lowering the fdr. again, a similar pattern can be observed irrespective of the starting sample size, α, power, effect size, and proportion of true null hypotheses. the decrease in fdr is stronger when using the increase in sample size to reduce α rather than increase the power. discussion our analysis shows that reducing α is usually more effective in reducing the false discovery rate than increasing power. researchers striving to reduce the false discovery rate should reduce their α instead of increasing power. this is not only true when planning a study and deciding on the levels of α and power, but also when balancing power and α at a constant sample size or when increasing sample size and considering whether to “spend” the additional participants on increasing power or reducing α. our conclusion is similar to the long-standing literature on α adjustments for controlling the false discovery rate in multiple testing (e.g., benjamini & hochberg, https://osf.io/uwkqz/ https://osf.io/uszxk/ 7 0.00 0.02 0.04 0.06 0.08 0.10 100 150 200 300 500 sample size (n) p ro p o rt io n o f fa ls e p o si tiv e s ●● ● ● ● ● fixed alpha fixed power figure 6. the displayed fdr results when either keeping the power (triangles) or α (circles) fixed while increasing sample size. the filled circle marks the starting point at n = 100 (50 per group) with p(h0) = 0.5, d = 0.5, resulting in power = 0.70 and α = 0.05. 1995). however, its main goal is to keep the false discovery rate for a set of tests below a certain threshold rather than trading α and power in respect to the fdr. we also need to consider several limitations of our analyses. in case of one-sided tests, reducing α is only more beneficial if the true effect is in the expected direction. in case of two-sided tests, incorporating type s errors into the definition of fdr increases the effectiveness of power if it is close to α. however, both of these scenarios are not plausible under common conditions. in addition, for balancing α and power, we only present results for the two-sample z-test and assuming that the assumptions of the statistical test (e.g., homoskedasticy and normal distribution) are fulfilled. while the relation between power and α and fdr for a variety of other tests can be found at https://osf.io/uwkqz/ and is in line with our analysis, a formal proof that the proposed relationship is holding for all tests under different conditions is not presented in this paper. more research is needed to generalize our results to more kinds of tests and settings. we also only analyze the effect of α and power, while an additional issue causing nonreplicability can be a low prior probability of the tested hypotheses (benjamin et al., 2018; hoogeveen et al., 2020; ioannidis, 2005), which plays a direct role in the fdr formula. in addition, we want to emphasize that we are still advocates of high power for several reasons.4 first, high power is crucial for avoiding type ii errors. controlling type i errors is often perceived as more important than controlling type ii errors (e.g., cohen, 1956); however, in some contexts, type ii errors might be more problematic (fiedler et al., 2012). for example, consider researchers first investigating a new, potentially groundbreaking treatment for depression. here, the type ii error of not detecting the effectiveness of the treatment might be more costly than concluding that the treatment is effective when it is not. this error (and consequently abandoning this line of research) would mean missing an opportunity to improve the lives of people with depression. another example might be replication studies, where the primary focus is to test whether a previously reported effect is there, with a lesser concern of inflating fdr. here, high power is crucial to avoid such type ii errors. in addition, low power and conditioning on significance leads to an overestimation of effect sizes (type m error) and to effect size estimates in the wrong direction (type s error; gelman and carlin, 2014). for these reasons, high powered studies are crucial for cumulative science. therefore, we recommend that in practice, researchers think about their inferential goals, weighing the costs of both type i and type ii errors, to determine an optimal α and power (lakens et al., 2018; maier & lakens, 2022; miller & ulrich, 2019; mudge et al., 2012). if an important goal is to reduce the fdr, our analyses show that reducing α is more effective than increasing power. last but not least, we want to point out that the actual α level is often higher than the nominal α level due to questionable research practices, such as optional stopping or failure to report all dependent variables (john et al., 2012; simmons et al., 2011; wicherts, 2017). therefore, finding ways to prevent these practices using tools such as preregistration (nosek et al., 2018) and registered reports (chambers et al., 2015) is probably one of the most critical tasks psychological science is facing. some researchers also argue that we should abandon the framework of statistical testing and instead focus solely on summarizing the full information about effect size estimates (mcshane et al., 2019). conclusion we strove for two objectives in this paper. firstly, we reiterated over α, power, and false discovery rate, hopefully improving the understanding of these concepts. secondly, we compared two previously proposed solutions to decreasing the false discovery rate. our results show that with respect to the false discovery rate, it is usually more effective to decrease α than to increase statistical power. we suggest that researchers interested in reducing the false discovery rate focus on reducing α. 4and we do not fear that our article will lead to a decrease in power since six decades of articles calling for an increase in statistical power had no visible impact (smaldino & mcelreath, 2016). https://osf.io/uwkqz/ 8 author contact františek bartoš; f.bartos96@gmail.com; department of psychological methods, university of amsterdam; department of arts, faculty of arts, charles university; orcid: 0000-0002-0018-5573 maximilian maier; maximilianmaier0401@gmail.com; department of psychological methods, university of amsterdam; orcid: 0000-0002-9873-6096 acknowledgments we would like to thank marie delacre, jǐrí štipl, and franziska nippold for helpful comments and suggestions on previous versions of this manuscript. conflict of interest and funding the authors declare that there were no conflicts of interest with respect to the authorship or the publication of this article. author contributions both authors contributed equally to all stages of the research process and writing the manuscripts. open science practices this article earned the open materials badge for making the materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references benjamin, d. j., berger, j. o., johannesson, m., nosek, b. a., wagenmakers, e.-j., berk, r., bollen, k. a., brembs, b., brown, l., camerer, c., et al. (2018). redefine statistical significance. nature human behaviour, 2(1), 6–10. https://doi.org/ 10.1038/s41562-017-0189-z benjamini, y., & hochberg, y. (1995). controlling the false discovery rate: a practical and powerful approach to multiple testing. journal of the royal statistical society: series b (methodological), 57(1), 289–300. https : / / doi . org / 10 . 1111/j.2517-6161.1995.tb02031.x button, k. s., ioannidis, j., mokrysz, c., nosek, b. a., flint, j., robinson, e. s., & munafò, m. r. (2013). power failure: why small sample size undermines the reliability of neuroscience. nature reviews neuroscience, 14(5), 365–376. https://doi.org/10.1038/nrn3475 camerer, c. f., dreber, a., holzmeister, f., ho, t.-h., huber, j., johannesson, m., kirchler, m., nave, g., nosek, b. a., pfeiffer, t., et al. (2018). evaluating the replicability of social science experiments in nature and science between 2010 and 2015. nature human behaviour, 2(9), 637–644. https://doi.org/10.1038/s41562-018-0399-z chambers, c. d., dienes, z., mcintosh, r. d., rotshtein, p., & willmes, k. (2015). registered reports: realigning incentives in scientific publishing. cortex, 66, a1–a2. christley, r. m. (2010). power and error: increased risk of false positive results in underpowered studies. the open epidemiology journal, 3(1). http: //dx.doi.org/10.2174/1874297101003010016 cohen, j. (1956). statistical power analysis for the behavioral sciences. routledge. cohen, j. (1992). statistical power analysis. current directions in psychological science, 1(3), 98–101. https : / / doi . org / 10 . 1111 / 1467 8721 . ep10768783 colquhoun, d. (2017). the reproducibility of research and the misinterpretation of p-values. royal society open science, 4(12), 171085. https://doi. org/10.1098/rsos.171085 ebersole, c. r., atherton, o. e., belanger, a. l., skulborstad, h. m., allen, j. m., banks, j. b., baranski, e., bernstein, m. j., bonfiglio, d. b., boucher, l., et al. (2016). many labs 3: evaluating participant pool quality across the academic semester via replication. journal of experimental social psychology, 67, 68–82. https: //doi.org/10.1016/j.jesp.2015.10.012 fiedler, k., kutzner, f., & krueger, j. i. (2012). the long way from α-error control to validity proper: problems with a short-sighted false-positive debate. perspectives on psychological science, 7(6), 661–669. https : / / doi . org / 10 . 1177 / 1745691612462587 fisher, r. a. (1925). statistical methods for research workers. oliver & boyd. fisher, r. a. (1935). the design of experiments. oliver & boyd. fisher, r. a. (1956). statistical methods and scientific inference. hafner publishing. gelman, a., & carlin, j. (2014). beyond power calculations: assessing type s (sign) and type m https://doi.org/10.1038/s41562-017-0189-z https://doi.org/10.1038/s41562-017-0189-z https://doi.org/10.1111/j.2517-6161.1995.tb02031.x https://doi.org/10.1111/j.2517-6161.1995.tb02031.x https://doi.org/10.1038/nrn3475 https://doi.org/10.1038/s41562-018-0399-z http://dx.doi.org/10.2174/1874297101003010016 http://dx.doi.org/10.2174/1874297101003010016 https://doi.org/10.1111/1467-8721.ep10768783 https://doi.org/10.1111/1467-8721.ep10768783 https://doi.org/10.1098/rsos.171085 https://doi.org/10.1098/rsos.171085 https://doi.org/10.1016/j.jesp.2015.10.012 https://doi.org/10.1016/j.jesp.2015.10.012 https://doi.org/10.1177/1745691612462587 https://doi.org/10.1177/1745691612462587 9 (magnitude) errors. perspectives on psychological science, 9(6), 641–651. https://doi.org/10. 1177/1745691614551642 hoogeveen, s., sarafoglou, a., & wagenmakers, e.-j. (2020). laypeople can predict which socialscience studies will be replicated successfully. advances in methods and practices in psychological science, 3(3), 267–285. https://doi.org/10. 1177/2515245920919667 ioannidis, j. p. (2005). why most published research findings are false. plos medicine, 2(8), e124. https : / / doi . org / 10 . 1371 / journal . pmed . 0020124 john, l. k., loewenstein, g., & prelec, d. (2012). measuring the prevalence of questionable research practices with incentives for truth telling. psychological science, 23(5), 524–532. https://doi. org/10.1177/0956797611430953 klein, r. a., ratliff, k. a., vianello, m., adams jr, r. b., bahnık, š., bernstein, m. j., bocian, k., brandt, m. j., brooks, b., brumbaugh, c. c., et al. (2014). investigating variation in replicability: a “many labs” replication project. social psychology, 45(3), 142. https://doi.org/10.1027/ 1864-9335/a000178 klein, r. a., vianello, m., hasselman, f., adams, b. g., adams jr, r. b., alper, s., aveyard, m., axt, j. r., babalola, m. t., bahník, š., et al. (2018). many labs 2: investigating variation in replicability across samples and settings. advances in methods and practices in psychological science, 1(4), 443–490. https : / / doi . org / 10 . 1177 / 515245918810225 lakens, d., adolfi, f. g., albers, c. j., anvari, f., apps, m. a., argamon, s. e., baguley, t., becker, r. b., benning, s. d., bradford, d. e., et al. (2018). justify your alpha. nature human behaviour, 2(3), 168–171. https : / / doi . org / 10 . 1038 / s41562-018-0311-x lehmann, e. (1992). introduction to neyman and pearson (1933) on the problem of the most efficient tests of statistical hypotheses. breakthroughs in statistics (pp. 67–72). springer. liao, j. g., & rosen, o. (2001). fast and stable algorithms for computing and sampling from the noncentral hypergeometric distribution. the american statistician, 55(4), 366–369. https:// doi.org/10.1080/17470218.2012.711335 maier, m., & lakens, d. (2022). justify your alpha: a primer on two practical approaches (no. 2). masicampo, e., & lalande, d. r. (2012). a peculiar prevalence of p-values just below. 05. the quarterly journal of experimental psychology, 65(11), 2271–2279. https://doi.org/10.1080/ 17470218.2012.711335 mathur, m. b., & vanderweele, t. j. (2020). sensitivity analysis for publication bias in meta-analyses. journal of the royal statistical society: series c (applied statistics), 69(5), 1091–1119. mcshane, b. b., gal, d., gelman, a., robert, c., & tackett, j. l. (2019). abandon statistical significance. the american statistician, 73(sup1), 235–245. https : / / doi . org / 10 . 1371 / journal . pone.0208631 miller, j., & ulrich, r. (2019). the quest for an optimal alpha. plos one, 14(1), e0208631. https: //doi.org/10.1371/journal.pone.0208631 mudge, j. f., baker, l. f., edge, c. b., & houlahan, j. e. (2012). setting an optimal α that minimizes errors in null hypothesis significance tests. plos one, 7(2), e32734. https://doi.org/10.1371/ journal.pone.0032734 nelson, n., rosenthal, r., & rosnow, r. l. (1986). interpretation of significance levels and effect sizes by psychological researchers. american psychologist, 41(11), 1299. https://doi.org/10.1037/ 0003-066x.41.11.1299 neyman, j., & pearson, e. s. (1928). on the use and interpretation of certain test criteria for purposes of statistical inference: part i. biometrika, 175– 240. https://doi.org/10.1093/biomet/20a.34.263 nosek, b. a., ebersole, c. r., dehaven, a. c., & mellor, d. t. (2018). the preregistration revolution. proceedings of the national academy of sciences, 115(11), 2600–2606. https : / / doi . org / 10.1073/pnas.1708274114 open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251). https://doi.org/10.1126/science. aac4716 pashler, h., & wagenmakers, e.-.-j. (2012). editors’ introduction to the special section on replicability in psychological science: a crisis of confidence? perspectives on psychological science, 7(6), 528–530. https : / / doi . org / 10 . 1177 / 1745691612465253 rosenthal, r. (1979). the file drawer problem and tolerance for null results. psychological bulletin, 86(3), 638–641. https : / / doi . org / 10 . 1037 / 0033-2909.86.3.638 rosenthal, r., & gaito, j. (1963). the interpretation of levels of significance by psychological researchers. the journal of psychology, 55(1), 33– 38. https://doi.org/10.1080/00223980.1963. 9916596 https://doi.org/10.1177/1745691614551642 https://doi.org/10.1177/1745691614551642 https://doi.org/10.1177/2515245920919667 https://doi.org/10.1177/2515245920919667 https://doi.org/10.1371/journal.pmed.0020124 https://doi.org/10.1371/journal.pmed.0020124 https://doi.org/10.1177/0956797611430953 https://doi.org/10.1177/0956797611430953 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1177/515245918810225 https://doi.org/10.1177/515245918810225 https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1080/17470218.2012.711335 https://doi.org/10.1080/17470218.2012.711335 https://doi.org/10.1080/17470218.2012.711335 https://doi.org/10.1080/17470218.2012.711335 https://doi.org/10.1371/journal.pone.0208631 https://doi.org/10.1371/journal.pone.0208631 https://doi.org/10.1371/journal.pone.0208631 https://doi.org/10.1371/journal.pone.0208631 https://doi.org/10.1371/journal.pone.0032734 https://doi.org/10.1371/journal.pone.0032734 https://doi.org/10.1037/0003-066x.41.11.1299 https://doi.org/10.1037/0003-066x.41.11.1299 https://doi.org/10.1093/biomet/20a.3-4.263 https://doi.org/10.1093/biomet/20a.3-4.263 https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1177/1745691612465253 https://doi.org/10.1177/1745691612465253 https://doi.org/10.1037/0033-2909.86.3.638 https://doi.org/10.1037/0033-2909.86.3.638 https://doi.org/10.1080/00223980.1963.9916596 https://doi.org/10.1080/00223980.1963.9916596 10 rosenthal, r., & gaito, j. (1964). further evidence for the cliff effect in interpretation of levels of significance. psychological reports, 15(2), 570. https://doi.org/10.2466/pr0.1964.15.2.570 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. https://doi.org/ 10.1177/0956797611417632 smaldino, p. e., & mcelreath, r. (2016). the natural selection of bad science. royal society open science, 3(9), 160384. https://doi.org/10.1098/ rsos.160384 van aert, r. c., wicherts, j. m., & van assen, m. a. (2019). publication bias examined in meta-analyses from psychology and medicine: a meta-meta-analysis. plos one, 14(4), e0215052. https : / / doi . org / 10 . 1371 / journal.pone.0215052 wicherts, j. m. (2017). the weak spots in contemporary science (and how to fix them). animals, 7(12), 90–119. https://doi.org/10.3390/ani7120090 https://doi.org/10.2466/pr0.1964.15.2.570 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1098/rsos.160384 https://doi.org/10.1098/rsos.160384 https://doi.org/10.1371/journal.pone.0215052 https://doi.org/10.1371/journal.pone.0215052 https://doi.org/10.3390/ani7120090 meta-psychology, 2021, vol 5, mp.2019.2134, https://doi.org/10.15626/mp.2019.2134 article type: replication report published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: yes edited by: rickard carlsson reviewed by: daniël lakens, arvid erlandsson analysis reproduced by: andré kalmendal all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/3scaf perceived morality of direct versus indirect harm: replications of the preference for indirect harm effect ignazio ziano1 grenoble ecole de management, f-38000 grenoble, france yu jie wang1, sydney susanto sany1 department of psychology, university of hong kong, hong kong sar long ho ngai2, yuk kwan lau2, iban kaur bhattal2, pui sin keung2, yan to wong2, wing zhang tong2 department of psychology, university of hong kong, hong kong sar bo ley cheng, hill yan chan department of psychology, university of hong kong, hong kong sar gilad feldman3 department of psychology, university of hong kong, hong kong sar royzman and baron (2002) demonstrated that people prefer indirect harm to direct harm: they judge actions that produce harm as a by-product to be more moral than actions that produce harm directly. in two preregistered studies, we successfully replicated study 2 of royzman and baron (2002) with a hong kong student sample (n = 46) and an online american mechanical turk sample (n = 314). we found consistent evidential support for the preference for indirect harm phenomenon (d = 0.46 [0.26, 0.65] to 0.47 [0.18, 0.75]), weaker than effects reported in the original findings of the target article (d = 0.70 [0.40, 0.99]). we also successfully replicated findings regarding reasons underlying a preference for indirect harm (directness, intent, omission, probability of harm, and appearance of harm). all materials, data, and code are available at osf.io/ewq8g. keywords: direct harm, indirect harm, morality, pre-registered replication, preference for indirect harm judgments of morality do not only depend on the result of an action, but also on the way that it was performed. for instance, acts of omission are considered more moral than acts of commission, despite leading to the same result (omission bias) (spranca, minsk, & baron, 1991). in their 2002 article, royzman and baron found that people preferred indirect harm to direct harm and considered indirect harm as more moral (studies 1 and 2). in addition, omission bias (jamison, yay, & feldman, 2020) was 1 joint first authors 2 joint fourth authors 3 corresponding author found to be weaker for indirect compared to direct harm (study 3).what is the difference between direct and indirect harm? consider two actors, ann and bob, with ann inflicting harm on bob. an example for direct harm would be for ann to harm bob by pushing him off the swing. an example of indirect harm would be for ann to saw down the tree branch to which the swing is attached, which would then in turn lead to bob falling down and getting hurt. both actions lead to the same outcome involving harm – ziano et al. 2021 2 bob getting hurt yet the difference is regarding the direct link in the causal chain of events. in principle, indirect action could be performed without bob ever being involved and, in such case, would result in no harm to bob. royzman and baron (2002) hypothesized and found that even if a negative outcome is the same, people judge the morality of actions leading to that negative outcome as dependent on whether there was a direct or indirect link between the action and outcome. this in turn resulted in a strategic preference for indirect harm. in order to minimize accountability when inflicting harm, people show a preference for inflicting indirect over direct harm. impact of “the preference for indirect harm” preference for indirect harm is central in the understanding of moral judgment. in his seminal study, milgram (1974) observed that people were more likely to commit harm if they did not have physical contact with the victim, i.e., when the harm they had to inflict to the experimenter’s confederates was less direct. in general, dislike for physical contact with the victim may be caused by an overall a preference for indirect harm. cushman, young, and hauser (2006) summarized and tested three principles of harm: action, intention, and contact. the second principle, which they termed the ‘intention principle’, is an extension to the preference for indirect harm: people prefer harm as a byproduct rather than the main goal of an action. they found corroborating evidence for indirect harm as being an intuitive guide to moral judgment, building on work by haidt and hersh (2001), showing that participants were unable to explain why they would prefer indirect to direct harm. hauser, cushman, young, kang-xing jin, and mikhail (2007) found further support for preference for indirect harm across cultures, including that participants were unable to readily provide explanations for it. in line with these results, more recent research found further support for the intuitive nature of preference for indirect harm, as evaluation mode (joint vs. separate) moderated the effect (paharia, kassam, greene, & bazerman, 2009). preference for direct harm can be linked to various practices observed in everyday life. for example, bennett (1966) compared direct and indirect action leading to the same outcome – the death of a fetus. some catholic hospitals – opposed to abortion on principle – would consent to performing hysterectomy on pregnant women whose lives were in danger, while they would not consent to perform an abortion. the hysterectomy would not only kill the fetus, but also make the woman sterile. in these cases, catholic hospitals would prefer an action that leads to a worse indirect harm (the death of the fetus and lifetime infertility for the woman) than a direct action leading to a less harm (the death of the fetus), on the religious grounds that indirect harm to the fetus is acceptable, while direct harm is not. choice of study for replication we chose the royzman and baron (2002) study based on two factors: absence of direct replications and impact. to the best of our knowledge there are no published direct replications of this study thus far. the article has had significant impact on scholarly research in the area of moral psychology. at the time of writing, there were 173 google scholar citations of the article and many important follow-up theoretical and empirical articles, such as the cushman et al. (2006) three principles of harm, and the investigation of hauser et al. (2007) on the dissociation between the conscious nature of moral judgments (such as preference for indirect harm) and the intuitive nature of moral justifications (such as the intuition principle). the original article consisted of three scenariobased studies using university (study 1: n = 176) and online samples (study 2: n = 54; study 3: n = 69). in studies 1 and 2 royzman and baron (2002) asked participants to directly compare actions that lead to the same amount of harm and the same amount of a beneficial outcome. in the first scenario of study 2, for example, study participants had to compare the morality of two actions (action a and action b) leading to the same harmful outcome – preventing an alcoholic patient from receiving a liver transplant – either by lowering his priority on an organs transplant list (direct option) or by increasing everyone else’s priorities (indirect option), by indicating whether they perceived action a or action b to be more wrong. in the present investigation, we conducted two replication attempts of the two scenarios fully detailed in study 2 of royzman and baron (2002). original findings in target article a summary of the findings in the target article is provided in table 1. the preference for indirect replications of the preference for indirect harm effect 3 harm effect was d = 0.70, 95% ci [.40; 0.99], a medium to strong effect. they examined considerations and found support for all proposed mechanisms, with statistically significant correlations (r = .47 to r = .70) when participants deemed the consideration a reason for moral judgment ('predicted' column in table 1). they found weaker and sometimes statistically non-significant correlations (r = .01 to r = .16) when participants did not deem the consideration a reason underscoring a preference for indirect harm ('opposite' column in table 1). table 1 summary of original findings in royzman and baron (2002) factors effect (d) cil cih morality 0.70 0.40 0.99 predicted (direct) opposite (indirect) morality judgment r probability 12.0% 4.9% .467 directness reason 15.0% 3.5% .649 not a reason 17.4% 5.1% .099 appearance reason 15.7% 3.2% .609 not a reason 27.1% 6.3% .157 omission reason 16.0% 2.8% .553 not a reason 22.2% 6.5% .092 intent reason 16.9% 3.9% .698 not a reason 32.4% 5.6% .012 note. correlations with the morality question, original study, according to whether or not it was cited as a reason for a moral judgment, from royzman and baron (2002), p.174. ‘predicted’ indicates the share of responders indicating that the direct action was more wrong; ‘opposite’ indicates the share of responders indicating that the indirect action was more wrong. all correlations above .092 are significant at α = .05. methods pre-registration in each of the replication studies, we pre-registered the experiment on the open science framework and data collection was launched soon after. pre-registrations, power analyses, disclosures, and all materials used in the experiments are available in the supplementary materials. these together with data and code were shared on the open science framework (project: osf.io/ewq8g; pre-registration hong kong undergraduate sample: osf.io/qdn2m; pre-registration online american sample: osf.io/hwsdc). power analyses and deviations from power analysis preregistration power analyses indicated that 24 participants would be sufficient to have 95% power of detecting the original effect (d = 0.70) with a one-tailed alpha of .05, using a one-sample t-test as in the original article. the preregistration for the first data collection planned to sample 70 participants among hong kong university students, a decision based on convenience, as these participants were students in a psychology course. of that sample, we were able to collect 49 participants, given that participation was voluntary. after excluding the students who designed this very replication, 46 participants remained. sensitivity analyses indicate that this sample size provides approximately 99.8% power to detect the original effect with a one-tailed alpha of .05. the second online data collection on amazon mechanical turk (mturk) was part of a larger project of replications of psychology findings and this study was combined with other replications, random presentation order. the final sample size (n = 314) is due to power analyses related to the other replications running in the same data collection. sensitivity analyses indicate 99.9%+ power to detect the original effect with a one-tailed alpha of .05. procedure the first replication was considered a pre-test and conducted in an undergraduate course at a university in hong kong. students worked in teams of 3 to 6 to design and run a series of replications, and one of the replications was royzman and baron’s (2002) study. the students then served as the target sample for the experiments designed by their classmates in which they had not designed and had no ziano et al. 2021 4 knowledge prior to participation. the course materials covered classic judgement and decision-making literature, which means that the students were made aware of a wide array of heuristics and biases, and the experiment therefore should be considered a very conservative test of the effect in a non-naive sample. students were randomly assigned into replication teams with different target studies for replication. student groups designed the experiment survey, conducted effect size calculations, ran power analyses, and wrote pre-registrations. pre-registrations on the osf and data collections were managed by course instructor. all the students registered in the course were invited to take part as respondents in the study. to ensure anonymity, students were only asked to indicate which replication group they belonged to and those were later excluded from the data analysis of the study they designed. the final sample included the students that were not involved in planning the study, totaling 46 participants (15 males, 31 females; mage = 20.2, sdage = 0.99). for the second replication, two advanced course undergraduate students unrelated to the first replication worked independently to analyze the target article. they conducted effect-size calculations, power analyses, and each separately wrote a preregistration plan. they then reviewed each other's work and made final revisions, reviewed by the teaching assistant and the course coordinator. both plans were pre-registered on the osf prior to data collection by the corresponding author, who was the course instructor of the first replication and the advanced course. the final sample included 314 american mturk workers, recruited using turkprime.com (litman, robinson, & abberbock, 2017) (173 males, 141 females, mage = 36.8, sdage = 11.3). we note that the pre-registration plans included different references to possible exclusion criteria addressing generalized factors such as seriousness, english proficiency, etc. we conducted our analyses both with and without exclusions and found that exclusions had little effect on the results. for the sake of brevity, the findings reported below are without any exclusions. a comparison of the target article sample and the replication samples is provided in table 3. in both replication attempts, participants evaluated the two scenarios described in detail (out of the eight total scenarios; six were not described) in royzman and baron’s (2002) study 2, assessing participants’ preference for indirect harm. the following was the organ transplant scenario (scenario 1 in target article): “x is in charge of a computer database controlling the distribution of available organ transplants. the first person in line for a difficult-toget liver transplant is mr. y. mr. y was an alcoholic and his drinking ruined his liver. y no longer drinks. the rules say that past alcohol use should not be considered, but x still thinks that y should not get priority, so he decides to break the rules and prevent y from getting the next liver. he can do this in two ways: •[direct] x can lower y’s priority score by 20 points. •[indirect] x can raise everyone else’s priority score by 20 points.” the following was the zoo scenario (scenario 2 in the target article): “a zoo has been created to conserve 200 species of wild animal that have become extinct elsewhere. the zoo is now threatened with a parasitic disease that infects the animals. x, the zookeeper, has two options: •[direct] painlessly poison the animals in which the parasite reproduces, thus saving the other animals. five species will become extinct. •[indirect] poison the parasites. the same poison will cause five animal species to become extinct. in both cases, x is sure that he will save most of the species and lose five. the five lost species are of equal value in both cases.” measures morality. after each of the two scenarios, participants were asked which of the two options was morally worse (1 = a is much more wrong; 2 = a is a little more wrong; 3 = equal; 4 = b is a little more wrong; 5 = b is much more wrong; note that higher scores indicate higher morality for the indirect option). reasons: considerations for morality evaluations. participants compared the two options in each scenario on five factors: directness, intentionality, appearance, and action-omission on a five point scale (1 = factor is more applicable to the direct harm option thus the option is more immoral; 2 = direct harm option is not more immoral even though factor is more applicable to it; 3 = factor is equally applicable to both the options [equal morality]; 4 = factor is more applicable to indirect harm option, thus the replications of the preference for indirect harm effect 5 option is more immoral; 5 = indirect harm option is not more immoral even though the factor is more applicable to it), and probability of harm on a three point scale (1 = more likely to cause harm in a than in b; 2 = equally likely to cause harm in a and b; 3 = more likely to cause harm in b than in a). measures are reported in full in the supplementary. replications evaluation we aimed to compare the replication effects with the original effects in the target article (d = 0.70, 95% ci [0.40; 0.99]) using two methods: (1) we categorized the comparison of effects using the criteria set by lebel, vanpaemel, cheung, and campbell (2019), and (2) we conducted equivalence testing using the toster module (lakens, scheel, & isager, 2018). figures summarizing these criteria are available in the supplementary materials. table 2 provides a classification of the replications using the criteria lebel, mccarthy, earp, elson, & vanpaemel, (2018) criteria. we summarize the two replications as "very close replications". table 2 classification of replications based on lebel et al. (2018) design facet hong kong replication mturk replication iv operationalization same same dv operationalization same same iv stimuli same same dv stimuli same same procedural details different different physical settings same same contextual variables different different replication classification very close replication very close replication note. information on this classification is provided in lebel et al. 2018. see also figure provided in the supplementary table 3 difference and similarities between original studies and replication attempts royzman & baron 2002 hong kong undergraduate students american mturk workers sample size 54 46 314 geographic origin us american hong kong sar us american gender 17 males, 37 females 15 males, 31 females 173 males, 141 females median age (years) 34 20 34 average age (years) not reported 20.2 36.8 age range (years) 17-69 19-22 21-71 medium (location) computer (online) computer (online) computer (online) compensation $3 (only after participant written request) none (volunteers) nominal payment year not reported (during or before 2002) 2018 2018 ziano et al. 2021 6 figure 1. plots for the morality ratings prior to categorizing. the two plots on the first row are for hong kong sample and the two plots on the second row are for the american sample. the first of every plot pair is for the organ scenario, and the second is for the zoo scenario. the scale is from 1 to 5, with 3 representing the mid-point. higher values indicate higher morality ratings for the indirect option. results preference for indirect harm violin plots for the raw morality ratings (on a scale from 1 to 5) are provided in figure 1. across the two scenarios in two experiments the ratings were higher than the midpoint neutrality rating of 3. for the analyses, we followed the method set by royzman and baron (2002) and recoded the morality ratings as 0 for the indifference point (3 = equal), 1 for the direct action being more wrong (1 = a is much more wrong; 2 = a is a little more wrong), and -1 for the indirect action being more wrong (4 = b is a little more wrong; 5 = b is much more wrong). we then ran a series of one-sample, one-sided t-tests comparing to the converted midpoint of 0, followed by dependent t-tests comparing the organ and zoo scenarios in each sample (twosided), and finally equivalence testing comparing to the effects of the target article original findings. note that this strategy, albeit not ideal given the low number of response categories (three) and the grouping of responses, was the one used by the original authors. we therefore complemented these analyses with non-parametric testing (wilcoxon’s signed-rank test) for the one-sample tests against the scale midpoint and for the comparison between scenarios (alpha = .05). the effect size comparisons should, however, be interpreted with caution, since the original effect size was obtained from data across eight scenario (we did not have access to the remaining six scenarios, nor to the original data, and to the effect sizes per scenario). the findings are summarized in table 4. the findings were consistent across the two replication attempts, with similar point estimates and overlap in 95% confidence intervals. the effects in both replications were in the same direction and supported the original study’s findings, but with weaker effects. replications of the preference for indirect harm effect 7 table 4 preference for indirect harm findings summary: morality ratings one-sample t-tests m sd statistic p d cil cih interpretation original (n=54) combined effect t(53) = 5.12 < .001 0.70 0.40 0.99 baseline hong kong (n=46) hk organ .33 .60 t(45) = 3.70 < .001 0.55 0.23 0.86 signal; consistent w = 198 < .001 hk zoo .33 .79 t(45) = 2.80 = .004 0.41 0.11 0.71 signal; consistent w = 408 = .005 comparison t(45) = 0.00 = 1.00 0.00 0.00 0.00 no evidence for differ-ence w = 139.5 = .97 equivalence hk organ t(45) = -1.03 = .154 similar effect equivalence hk zoo t(45) = -1.93 = .03 weaker effect mturk (n = 314) mturk organ .15 .65 t(313) = 4.16 < .001 0.24 0.12 0.34 signal; inconsistent, positive (weaker) w = 6627 < .001 mturk zoo .26 .73 t(313) = 6.31 < .001 0.36 0.24 0.47 signal; consistent w = 12988 < .001 comparison t(313) = -2.03 = .040 0.11 .003 .21 weak to no differences w = 6624.5 = .066 equivalence mturk or-gan t(313) = -8.19 < .001 weaker effect equivalence mturk zoo t(313) = -6.05 < .001 weaker effect note. categorized morality scores are -1 to 1, with 0 as the mid-point. higher values indicate higher morality ratings for the indirect option. the tests are one-sample t-test comparing to 0. comparisons are one-sided paired t-tests (alpha = .05) comparing the organ and zoo scenarios within that sample. tost are toster equivalence test analyses comparing to the effect-size found in the original findings of the target article. the interpretation column is according to the criteria set by lebel et al. (2019) or equivalence testing (lakens et al., 2018). “w” indicates the w statistics wilcoxon’s signedrank non-parametric test. reasons: considerations for morality evaluations we followed the procedure in the target article to test reasons for morality evaluations and the preference for indirect harm effect by examining correlations between ratings of morality and considerations directness, appearance, omission, and intent. ratings were coded as either being more applicable to the direct, indirect, or neither option, and then as either being a reason or not for the morality ratings. the findings are summarized in table 5. we found support for the original study findings with medium to strong correlations (hong kong organ: r = .29 to .71; hong kong zoo: r = .32 to .90; mturk organ: r = .36 to .56; mturk zoo: r = .49 to .63) between each factor and morality ratings when the factor was indicated as a reason, and much weaker correlations, of which half were negative, contrary to predictions (hong kong organ: r = -.11 to .26; hong kong zoo: r = -.20 to .22; mturk organ: r = .10 to .29; mturk zoo: r = .13 to .29) when the factor was not indicated as a reason. probability ratings were all positive and ranged from r = .14 to .50 across the samples and scenarios. royzman and baron (2002) furthered add an indication to better contextualize the psychological mechanisms underlying preference for indirect harm. they classified answers to the considerations into two categories, ‘predicted’ and ‘opposite’. ‘predicted’ represented the share of responders indicating that the direct action was more wrong (thus indicating preference for indirect harm, in line with predictions); ‘opposite’ represented the share of responders indicating that the indirect action was more wrong (thus indicating preference for direct harm, contrary to predictions). royzman and baron ziano et al. 2021 8 (2002) further classified these answers based on whether participants find that the specific consideration is a reason for moral judgment (indicated in table 5 as ‘reason’) or not (indicated in table 5 as ‘not a reason’, except for probability). the researchers found that, in general, when indicating that the specific consideration is a reason for moral judgment more participants showed preference for indirect harm and indicated the direct action as more wrong (ranging from 15% to 16.9%) whereas fewer participants indicate that the indirect action is more wrong (ranging from 2.8% to 3.9%). similarly, when indicating that the specific consideration is not a reason for moral judgment, more participants showed preference for indirect harm and indicate the direct action as more wrong (ranging from 17.4% to 32.4%) whereas fewer participants indicated that the indirect action is more wrong (ranging from 5.1% to 6.5%). in the replications we conducted, findings were broadly in line with the results of royzman and baron (2002). when indicating that the specific consideration is a reason for moral judgment, more participants showed preference for indirect harm and indicated that the direct action is more wrong (ranging from 17.6% to 66.7%) whereas fewer participants indicated that the indirect action is more wrong (ranging from 0% to 12.3%). similarly, when indicating that the specific consideration is not a reason for moral judgment more participants showed a preference for indirect harm and indicate direct action is more wrong (ranging from 13.4% to 51.6%) whereas fewer participants indicated that the indirect action is more wrong (ranging from 6.5% to 29%). overall, in all cases the proportions of participants who indicated the direct action is more wrong (thus indicating preference for indirect harm) were larger than the proportion of participants indicating the indirect action is more wrong, irrespective of whether they considered the specific consideration to be a reason for moral judgment or not. overall, in both replication attempts, we successfully replicated the correlational evidence that royzman and baron (2002) presented when investigating potential factors underlying the preference for indirect harm (probability of harm, intent, appearance, omission, and directness). this suggests that all of these factors likely play a part as psychological underpinnings of the preference for indirect harm. however, the evidence presented is correlational and only shows a statistical association rather than a neat cause-effect path. further research may experimentally investigate the causality of these associations by manipulating intent or appearance within direct and indirect harm, for example. this is especially interesting in light of the literature showing that moral judgment in general, and preference for indirect harm in particular, are intuitive processes for which people are unable to quickly provide a justification (cushman et al., 2006; paharia et al., 2009) or explain why people prefer indirect harm to direct harm. mini meta-analysis effect summary we summarized the findings of the two replications studies together with the target article original findings using mini meta-analyses for each of the scenarios to assess the overall effect size (goh, hall, & rosenthal, 2016; lakens & etz, 2017 see plots in figure 2). the overall effects for the organ scenario was d = 0.47, ci [0.18, 0.75], and for the zoo scenario d = 0.46 [0.26, 0.65]. we conclude that the two scenarios had comparable weak to medium effects that are different from null. replications of the preference for indirect harm effect 9 organ donor scenario zoo scenario figure 2. mini meta-analysis effect size estimates (cohen’s d) and 95% confidence intervals (cis) around effect size estimates for the original study and the two replication attempts in the two scenarios. ziano et al. 2021 10 table 5 reasons for morality and the preference for indirect harm effect: frequencies and correlations note. correlation (pearson’s r) between morality and considerations, according to whether or not it was cited as a reason for a moral judgment. ‘predicted’ indicates the share of responders indicating that the direct action was more wrong; ‘opposite’ indicates the share of responders indicating that the indirect action was more wrong. * p < .05; ** p < .01; *** p < .001. hong kong undergraduate sample american mturk sample organ donor predicted (indirect) opposite (direct) morality r organ donor predicted (indirect) opposite (direct) morality r probability 28.3% 0% .144 [-0.15, 0.41] probability 18.8% 10.2% .331 [0.23, 0.43]*** directness directness reason 50.0% 7.1% .631 [0.34, 0.81]*** reason 29.4% 7.1% .555 [0.45, 0.64]*** not a reason 43.3% 16.7% -.058 [-0.41, 0.31] not a reason 32.9% 10.5% .101 [-0.03, 0.23] appearance appearance reason 48.1% 0% .291 [-0.10, 0.60] reason 24.4% 8.8% .400 [0.28, 0.51]*** not a reason 45.5% 12.1% -.110 [-0.44, 0.24] not a reason 27.3% 12.8% .293 [0.17, 0.40]*** omission omission reason 24.3% 5.4% .714 [0.51, 0.84] *** reason 21.1% 7.7% .424 [0.32, 0.52]*** not a reason 17.1% 8.6% .262 [-0.08, 0.55] not a reason 18.5% 9.5% .228 [0.11, 0.34]*** intent intent reason 31.4% 2.9% .685 [0.46, 0.83] *** reason 21.1% 5.1% .363 [0.25, 0.47]*** not a reason 17.6% 14.7% .080 [-0.27, 0.41] not a reason 17.0% 6.5% .166 [0.04, 0.28]*** zoo predicted (indirect) opposite (direct) morality r zoo predicted (indirect) opposite (direct) morality r probability 26.1% 15.2% .499 [0.24, 0.69]*** probability 13.4% 10.8% .309 [0.21, 0.41]*** directness directness reason 66.7% 4.8% .532 [0.13, 0.78]*** reason 47.3% 12.2% .612 [0.52, 0.69]*** not a reason 51.6% 29.0% -.029 [-0.38, 0.33] not a reason 43.2% 13.5% .191 [0.05, 0.32]*** appearance appearance reason 58.3% 4.2% .894 [0.77, 0.95]*** reason 38.7% 12.3% .630 [0.54, 0.71]*** not a reason 41.9% 29.0% .191 [-0.18, 0.51] not a reason 41.9% 10.5% .290 [0.16, 0.41]*** omission omission reason 28.1% 9.4% .707 [0.48, 0.85]*** reason 30.0% 10.9% .485 [0.38, 0.58]*** not a reason 29.4% 11.8% .220 [-0.13, 0.52] not a reason 28.2% 10.0% .171 [0.04, 0.30]* intent intent reason 17.6% 11.8% .324 [-0.02, 0.60] reason 31.7% 11.1% .511 [0.41, 0.60]*** not a reason 22.2% 11.1% -.199 [-0.50, 0.14] not a reason 24.3% 9.5% .132 [0.00, 0.26] replications of the preference for indirect harm effect 11 general discussion we successfully replicated findings from royzman and baron (2002) study 2 with a non-naive undergraduate sample from hong kong and an american mturk sample. these results provide empirical support for the preference for indirect harm phenomenon: people tend to prefer indirect harm over direct harm. we summarize the replications as “signal and consistent” according to the lebel et al.’s (2019) replication success criteria, yet we note that equivalence tests indicated overall weaker effects compared to the target article findings. mini metaanalyses of the replications and original findings indicated weak to medium effects that are different from null. what may explain the weaker effects? sample and time are the typical suspects. royzman and baron (2002) study was conducted using an internet sample, resembling the mturk sample in the replications, although mturk workers are likely more experienced in participating in online studies (chandler, mueller, & paolacci, 2014). compared to the original sample, the hong kong sample was of a different cultural and linguistic background and had a much higher familiarity with heuristics and biases. we believe, however, that both sample and the passing of time are limited explanations given our other judgment and decision-making replications with similar samples showing high consistency between these two samples and the original findings (e.g., chandrashekar et al, 2020; chen et al., 2020). we cannot, however, rule out any possibility with confidence, and the many differences between the original study and our replications make it difficult to determine the cause. a possible future direction is to conduct a meta-analysis on the literature testing for moderators. our findings suggest that the classic phenomenon is replicable, yet that we may need to update our expectations regarding effect size. replications are especially useful in this regard. researchers can now use the replications' effect-size as an updated and more conservative estimate of the effect when designing their follow-up studies. author contact ignazio ziano, ignazio.ziano@grenoble-em.com, orcid.org/0000-0002-4957-3614 yu jie wang, u3529917@connect.hku.hk, sydney susanto sany, sydneyssany@yahoo.com long ho ngai, sngai717@connect.hku.hk yuk kwan lau, tonilau@connect.hku.hk iban kaur bhattal, iban03@connect.hku.hk pui sin keung, u3534402@connect.hku.hk yan to wong, norawyt@connect.hku.hk wing zhang tong, u3544235@connect.hku.hk bo ley cheng, boleystudies@gmail.com hill yan cedar chan, cedar@hku.hk gilad feldman (corresponding author), gfeldman@hku.hk, orcid.org/0000-0003-2812-6599 conflict of interest and funding the authors declared no potential conflicts of interests with respect to the authorship and/or publication of this article. this research was supported by the european association for social psychology seedcorn grant. author contributions gilad feldman (corresponding author gf from now on and in the table below) was the course instructor for fundamentals and advanced social psychology courses (psyc2020/3052) and led the two reported replication efforts in those courses. gf supervised each step in the project, conducted the pre-registrations, and ran data collection. ignazio ziano (joint first authoriz from now on and in the table below) integrated the two replication efforts into a manuscript with validation and further extensions of the statistical analyses. gf and iz jointly finalized the manuscript for submission. yu jie wang and sydney susanto sany (joint first authors) conducted the us replication as part of the advanced social psychology course (identified as students psyc3052 in the table below). long ho ngai, yuk kwan lau, iban kaur bhattal, pui sin keung, yan to wong, and wing zhang tong conducted the hong kong replication as part of the fundamentals of social psychology course (joint fourth authors; identified as students psyc2020 in the table below). ziano et al. 2021 12 bo ley cheng (teaching assistant; included in the “tas” column in the table below) guided and assisted the replication effort in the psyc3052 course. hill yan cedar chan guided and assisted the replication effort in the psyc2020 course (teaching assistant; included in the “tas” column in the table below). contributor roles taxonomy in the table below, employ credit (contributor roles taxonomy) to identify the contribution and roles played by the contributors in the current replication effort. please refer to the url (https://www.casrai.org/credit.html) on details and definitions of each of the roles listed below role iz gf students psyc 2020 students psyc 3052 tas conceptualization x pre-registrations x x x data curation x formal analysis x x x x funding acquisition x investigation x x x methodology x x pre-registration peer review / verification x x x x data analysis peer review / verification x x x project administration x x resources x software x x x x supervision x validation x x visualization x writing-original draft x x writing-review and editing x x open science practices this article earned the preregistration+, open data and the open materials badge for preregistering the hypothesis and analysis before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references bennett, j. (1966). ’whatever the consequences’. analysis, 26, 83–102. chandrashekar, s. p., yeung., s., yau, k., feldman, g., ... (2020). agency and self-other asymmetries in perceived bias and shortcomings: replications of the bias blind spot and extensions linking to free will beliefs. doi: 10.13140/rg.2.2.19878.16961 manuscript under review. retrieved december 2019 from https://www.researchgate.net/publication/331431431_agency_and_selfother_asymmetries_in_perceived_bias_and_shortcomings_replications_of_the_bias_blind_spot_and_extensions_linking_to_free_will_beliefs chen, j., hui, l. s., yu, t., feldman, g., zeng, s. v., ching, t. l., ... & cheng, b. l. (2020). foregone opportunities and choosing not to act: replications of inaction inertia effect. social psychological and personality science. manuscript accepted for publication. retrieved december 2019 from: https://www.researchgate.net/publication/332550110_foregone_opportunities_and_choosing_not_to_act_replications_of_inaction_inertia_effect cushman, f., young, l., & hauser, m. (2006). the role of conscious reasoning and intuition in moral judgment: testing three principles of harm. psychological science, 17, 1082-1089. replications of the preference for indirect harm effect 13 goh, j. x., hall, j. a., & rosenthal, r. (2016). mini meta-analysis of your own studies: some arguments on why and a primer on how. social and personality psychology compass, 10, 535-549. haidt, j., & hersh, m. a. (2001). sexual morality: the cultures and emotions of conservatives and liberals 1. journal of applied social psychology, 31, 191-221. hauser, m., cushman, f., young, l., kang-xing jin, r., & mikhail, j. (2007). a dissociation between moral judgments and justifications. mind & language, 22, 1-21. jamison, j., yay, t., & feldman, g. (2020). action-inaction asymmetries in moral scenarios: replication of the omission bias examining morality and blame with extensions linking to causality, intent, and regret. manuscript under review. retrieved december 2019 from https://www.researchgate.net/publication/326260685_action-inaction_asymmetries_in_moral_scenarios_replication_of_the_omission_bias_examining_morality_and_blame_with_extensions_linking_to_causality_intent_and_regret lakens, d., & etz, a. j. (2017). too true to be bad: when sets of studies with significant and nonsignificant findings are probably true. social psychological and personality science, 8, 875881. lakens, d., scheel, a. m., & isager, p. m. (2018). equivalence testing for psychological research: a tutorial. advances in methods and practices in psychological science, 1, 259-269. lebel, e. p., mccarthy, r. j., earp, b. d., elson, m., & vanpaemel, w. (2018). a unified framework to quantify the credibility of scientific findings. advances in methods and practices in psychological science, 1, 389-402. lebel, e. p., vanpaemel, w., cheung, i., & campbell, l. (2019). a brief guide to evaluate replications. meta psychology, 541, 1–17. https://doi.org/10.31219/osf.io/paxyn litman, l., robinson, j., & abberbock, t. (2017). turkprime. com: a versatile crowdsourcing data acquisition platform for the behavioral sciences. behavior research methods, 49, 433442. milgram, s. (1974). obedience to authority. an experimental view. new york: harper. paharia, n., kassam, k. s., greene, j. d., & bazerman, m. h. (2009). dirty work, clean hands: the moral psychology of indirect agency. organizational behavior and human decision processes, 109, 134-141. paolacci, g., & chandler, j. (2014). inside the turk: understanding mechanical turk as a participant pool. current directions in psychological science, 23, 184-188. royzman, e. b., & baron, j. (2002). the preference for indirect harm. social justice research, 15, 165-184. spranca, m., minsk, e., & baron, j. (1991). omission and commission in judgment and choice. journal of experimental social psychology, 27, 7 meta-psychology, 2021, vol 5, mp.2020.2474, https://doi.org/10.15626/mp.2020.2474 article type: replication report published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: yes edited by: rickard carlsson reviewed by: streamlined peer review analysis reproduced by: alexey guzey all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/dynvt frequency estimation and semantic ambiguity do not eliminate conjunction bias, when it occurs: replication and extension of mellers, hertwig, and kahneman (2001) subramanya prasad chandrashekar1 lee shau kee school of business and administration, hong kong metropolitan university, hong kong sar bo ley cheng department of psychology, university of hong kong, hong kong sar yat hin cheng1, chi long fong1, ying chit leung1, yui tung wong1 department of psychology, university of hong kong, hong kong sar gilad feldman2 department of psychology, university of hong kong, hong kong sar mellers, hertwig, and kahneman (2001) conducted an adversarial collaboration to try and resolve hertwig’s contested view that frequency formats eliminate conjunction effects, and that conjunction effects are largely due to semantic ambiguity. we conducted a preregistered well-powered very close replication (n = 1032), testing two personality profiles (linda and james) in a four conditions between-subject design comparing unlikely and likely items to "and" and "and are" conjunctions. linda profile findings were in support of conjunction effect and consistent with tversky and kahneman’s (1983) arguments for a representative heuristic. we found no support for semantic ambiguity. findings for james profile were a likely failed replication, with no conjunction effect. we provided additional tests addressing possible reasons, in line with later literature suggesting conjunction effects may be context-sensitive. we discuss implications for research on conjunction effect, and call for further well-powered pre-registered replications and extensions of classic findings in judgment and decision-making. keywords: conjunction effect, frequency estimation, replication, linda problem, judgment and decision making the conjunction fallacy is one of the most wellknown judgment errors in the judgment and decision making (jdm) literature. the fallacy consists of judging the conjunction of two events as more likely the any of the two specific events, violating one of the most fundamental tenets of probability theory 1 joint first authors 2 corresponding author that postulates that probability of a conjunction of two events can never be higher than the probability any of the two individual events. kahneman and colleagues initially reported the conjunction effect as a bias, and that resulted in an intense debate in the academic community (e.g., 2 chandrashekar et al.2021 fiedler, 1988; gigerenzer, 1996, 2005; hertwig & chase, 1998; hertwig & gigerenzer, 1999). one view opposing conjunction effect as a bias was by hertwig and colleagues that argued that conjunction effect is not at all a fallacy, demonstrating that the effect arises out of semantic ambiguity, in that participants’ understanding of natural language words such as “probability” and “and” diverged from that of experimenters (e.g., hertwig & gigerenzer, 1999). daniel kahneman and ralph hertwig engaged in an adversarial collaboration to which barbara mellers served as an arbiter. they all then jointly examined the potential semantic ambiguity of “and” conjunction to try and explain the conjunction effect reported in the kahneman and tversky’s study (1996). the article has been influential with over 430 citations according to google scholar at the time of writing. chosen study for replication: outline of mellers et al (2001) mellers et al. (2001) conducted examined frequency estimates of personality sketches. they tested two personality sketches in three experiments, one about linda and the other about james. for example, the linda story read as: linda is 31 years old, single, outspoken, and very bright. she majored in philosophy. as a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. participants read the scenario and estimated how many of a 100 people like linda fit a particular target description. the target descriptions varied between experimental conditions: likely (feminists), unlikely (bank tellers), semantic “and” (bank tellers and feminists), and semantic “and are”’ (bank tellers and are feminists). kahneman argued that the conjunction effect would occur despite frequency estimation was used, reflected from the average frequency estimates of the conjunction conditions “and” and “and are” higher than the unlikely item condition. hertwig proposed that conjunction phrase “bank teller and are feminists” would not yield support for conjunction effects. the results for the linda scenario supported kahneman’s prediction across two out of three experiments conducted as part of the adversarial collaboration, whereas, with the james scenario just one experiment supported the prediction. we summarized findings in the original article in table 1. the divergence of findings reported across the three experiments made it hard for readers to assess the overall effect size, and we, therefore, conducted a mini meta-analysis summary of their effects across experiments, summarized in table 2. the need for replication since the first demonstration of the conjunction effect, there have been attempts to develop a theory to explain the phenomenon. semantic ambiguity remains the strongest counterargument to the demonstration of conjunction effects. with the recent growing recognition of the importance of reproducibility and replicability in psychological science (e.g., brandt et al., 2014; open science collaboration, 2015; van‘t veer & giner-sorolla, 2016; zwaan, etz, lucas, & donnellan, 2018), we felt it was important to establish the replicability of the findings noted in the mellers et al. (2001). we, therefore, embarked on a well-powered preregistered very close replication of mellers et al. (2001) employing the most current psychological science methods, which would allow to test for both the presence and possible absence of an effect. present investigation we had several goals. first, we set out to revisit the original experimental design and assess the replicability of the original findings. with power analyses and higher power, we aimed at detecting weak effects that may not have been possible in the original study. secondly, we complemented the traditional analyses in the original article with equivalence tests and bayesian analyses to also allow for quantifying evidence in support of the null hypothesis. third, we added extensions to examine further lay perceptions of provided statistical information that may explain some of the differences found in the original findings. 3 chandrashekar et al.2021 table 1 summary of findings in mellers et al. (2001) experiments 1 to 3 and the replication note. exp1/exp2/exp3 = experiment 1, 2, and 3. standard errors are in the parentheses. boldface indicates significant results, p <.05. table 2 summary of findings of the original study versus replication note. linda story can be concluded as a successful replication. james replication is a likely failed replication. in addition, there was no support found for semantic ambiguity (comparing "and" and "and are"). in the original article, effect sizes (es) were not reported; we computed cohen’s d and confidence intervals based on the mean estimates and standard errors of the mean estimates of the outcome variables of the original study (see full tables in supplementary). the effect sizes of the original study presented in the table are based on the mini-meta-analysis of experiment 1, 2, and 3 of mellers et al. (2001), as the study is closest for direct comparison for replication summary. the replication summary directly based on lebel et al., (2019) category, see details in "evaluation criteria for replication design and findings". linda story james story target exp1 exp2 exp3 replication target exp1 exp2 exp3 replication likely target feminists 58.1 (2.4) 47.7 (3.4) 47.9 (4.5) 58.43 (1.79) artists 41.0 (2.7) 45.1 (2.6) 47.1 (3.3) 36.2 (1.62) unlikely target bank tellers 24.6 (1.9) 21.4 (2.0) 14.3 (2.9) 9.87 (0.88) republicans 28.9 (2.1) 19.8 (1.8) 12.7 (2.6) 18.38 (1.18) “and” “and” 39.9 (2.0) 30.4 (2.3) 26.4 (3.9) 18.8 (1.36) “and” 33.1 (1.8) 42.7 (2.4) 22.9 (3.4) 15.19 (1.15) “and are” “and are” 40.2 (2.7) 21.8 (2.1) 22.8 (2.7) 19.55 (1.48) “and are” 32.0 (2.5) 20.0 (1.9) 21.4 (2.7) 15.55 (1.09) original results replication comparison cohen's d with 95% ci t-statistic (one-sided) cohen's d with 95% ci replication summary linda story “and” and unlikely target 0.59 [0.36, 0.82] t(431.26) = 5.51, p < .001 0.49 [0.31, 0.67] signal consistent “and are” and unlikely target 0.38 [-0.02, 0.77] t(419.21) = 5.63, p < .001 0.50 [0.32, 0.67] signal consistent “and” and "and are" 0.18 [-0.09, 0.45] t(505.55) = −0.37, p = .646 -0.03 [-0.21, 0.14] no signal-inconsistent (opposite) james story “and” and unlikely target 0.62 [0.08, 1.15] t(507.82) = −1.93, p = .973 -0.17 [-0.35, 0.00] signal-inconsistent (opposite) “and are” and unlikely target 0.17 [-0.07, 0.41] t(510.69) = −1.76, p = .960 -0.15 [-0.33, 0.02] no signal-inconsistent (opposite) “and” and "and are" 0.41 [-0.26, 1.08] t(506.05) = -0.23, p = .591 -0.02 [-0.19, 0.15] no signal-inconsistent (opposite) 4 chandrashekar et al.2021 context: large replication effort of judgement and decision-making findings the current replication was part of a large-scale pre-registered replication project aiming to revisit well-known research findings in the area of judgment and decision making (jdm) and to examine the reproducibility and replicability of these findings. in this project, all replications are conducted by students in undergraduate courses and undergraduate and masters guided thesis at the university of hong kong psychology department. four students in two separate courses were randomly assigned to the current replication. working independently, the students conducted an in-depth analysis of the target article, wrote pre-registrations with poweranalyses, conducted data analysis on the collected data, and then wrote manuscripts for journal submission. in each student pair, students conducted peer review on one another to optimize design and analysis. a teaching assistant (6th author) and the corresponding author supervised and gave feedback in each step of the replication process. the corresponding author conducted all pre-registrations on the osf and online data collection. more information on the process is provided in the supplementary, and further details and updates on this project can be found on: https://osf.io/5z4a8/ (core, 2020). method pre-registration, power analysis, and open-science we pre-registered the experiment on the open science framework (osf), and data collection was launched later that week. pre-registration with power analyses and all materials used in the study are available in the supplementary materials. all measures, manipulations, and exclusions are reported, and data collection was completed before analyses. osf pre-registration review link for the study: https://osf.io/gb7pk. data and r/rmarkdown code (r core team, 2015) is available on the osf: https://osf.io/6v8e2/. full open-science details and disclosures are provided in the supplementary. please note the pre-registration crowdsourcing process involved four students who worked independently to analyze the original article, document hypotheses and tests in the original study, propose analyses for testing predictions, calculate original effects, conduct a power-analysis, and propose extensions. we note the differences and similarities across four pre-registration documents in the supplementary materials (for details see table s12-s14), and we followed the combination of all of those in our analyses. we aimed to detect smallest the effect size of d = 0.20 at a power of 0.80 one-tail comparing two conditions, despite the reported effects in the target article and original findings being much higher. this was meant to allow us the possibility of detecting effects not found in the target article for one of the two scenarios (details below). participants a total of 1032 participants were recruited online through american amazon mechanical turk (mturk) using the turkprime.com platform (litman, robinson, & abberbock, 2017) (mage = 38.77, sdage = 12.07; 550 females). we identified four responses to be excluded based on the exclusion criteria we recorded in the pre-registration due to their self-reported lack of seriousness or english proficiency, yet exclusions had no impact on the findings and so our main report focuses on the full sample. procedure participants were randomly assigned to one of the four experimental conditions (likely, unlikely, "and", and "and are"). all participants read two personality profiles, one of linda and the other of james, exactly as in the original study. each profile consisted of one short description of a character, and frequency estimation questions. all descriptions and questions were taken from the original article (mellers et al., 2001). the presentation order of the two profiles was randomized. linda profile description was as follows: linda is 31 years old, single, outspoken, and very bright. she majored in philosophy. as a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. of 100 people like linda, how many are [likely: feminists?] [unlikely: bank tellers?] ["and": bank tellers and feminists?] [“and are”: bank tellers and are feminists?] 5 chandrashekar et al.2021 james profile description was as follows: james grew up in a bohemian family. his father was a musician, and his mother was a painter. they lived together for 40 years and never got married. james was a very talented child with a special gift for comedy, but he turned into a rebellious troublemaker in his youth. he dropped out of college after two years and traveled to asia to learn crafts. james is now 35 years old. of 100 people like james, how many are [likely: artists?] [unlikely: republicans?] [“and”: republicans and artists?] [“and are” republicans and are artists?] participants answered questions based on two scenarios, one for linda and one for james, according to their randomly assigned condition (indicated in brackets in the scenarios above). the dependent variable was the estimated frequency of the described personality in the scenario measured on a scale from 1 to 100. the supplementary details the experimental instructions, scenarios, and response variables. extension following the replication materials, participants proceeded to the next page and answered six additional questions. depending on their assigned condition participants were asked to estimate the percentage of people, females, and males in the united states that match the target item (likely, unlikely, "and", "and are"), and they did so for both profiles. for example, participants in the likely condition estimated the percentage of people, females, and males in the united states that are 1) feminists, 2) artists. we had several aims with this extension: 1) assess whether the conjunction effect would show for the generalized population without the specific descriptions of james and linda, and 2) examine possible gender differences in the estimations of the items used in the james and linda descriptions. data analysis plan our analyses matched the original article's hypotheses, as follows: hypothesis 1: the frequency estimate for the “and” conjunction phrase will be higher than the phrase describing unlikely target alone. two sets of competing hypotheses suggested by hertwig and kahneman: hypothesis 2a: the frequency estimate for the “and are” conjunction phrase will be higher than the phrase describing unlikely target alone. hypothesis 2b: the frequency estimate for the “and are” conjunction phrase will not be higher than the phrase describing unlikely target alone. hypothesis 3a: the frequency estimate for the “and are” conjunction phrase will be lower than the frequency estimate for ‘and” conjunction phrase. hypothesis 3b: the frequency estimate for the “and are” conjunction phrase will not be lower than the frequency estimate for ‘and” conjunction phrase. a comparison of the three experiments in the original article and the current replication is provided in table s4 of the supplementary materials. in table s5, we briefly note the reasons for the chosen differences between original studies and the replication attempt. in the replication attempt, we did not include filler items, because when filler items are present, the responses are inherently comparative and therefore drive the conjunction effect observed (hertwig & chase, 1998). supporting this view, the results of both study 1 and study 3 of the original study that included filler items found support for conjunction effect—for both “and” and “and are” conjunction phrases. given the possibility of different psychological processes between comparative and non-comparative responses, we excluded filler items, that allow for the test of competing predictions from kahneman and hertwig theorized to be essentially non-comparative in nature. more importantly, with the current focus on testing the main argument if the conjunction effects are driven by semantic ambiguity of natural language term “and” in a frequency representation. we chose to focus on “and” and “and are” as the conjunction phrases and implement a between-subjects design which would allow for a clearer test of the competing predictions between kahneman and hertwig. for instance, hertwig argued that the frequency judgments are possibly driven by the understanding that “and” is a union operator, and the use of a more restrictive “and are” phrase would take 6 chandrashekar et al.2021 away the conjunction effect. kahneman argued that judgments were driven by a match between a personality description and porotype of a category; therefore, both “and” and “and are” phrases would likely yield conjunction effects. following the analyses in the target original, we first conducted welch (based on recommendations of delacre, lakens, & leys, 2017) one-tail independent samples t-test, a null-hypothesis significance testing (nhst) method. when nhst analyses were non-significant, we complement nhst analyses with equivalence testing to compare effects against a minimal effects considered meaningful (toster package; lakens, 2017; lakens, scheel, & isager, 2018) and bayesian analyses to quantify support for the null hypothesis given a prior (kruschke & liddell, 2018; vandekerckhove, rouder, & kruschke, 2018) using bayesfactor r package (version 0.9.124.2; morey & rouder, 2015). these were minor adjustments we made to the pre-registration data analysis plan, summarized in table s6. evaluation criteria for replication design and findings table s7 provides a classification of the replications using the criteria by lebel, mccarthy, earp, elson, and vanpaemel (2018) criteria (see figure s2). we summarize the current replication as a "very close replication". to interpret the replication results we followed the framework by lebel, vanpaemel, cheung, and campbell (2019). they suggested a replication evaluation using three factors: (a) whether a signal was detected (i.e., confidence interval for the replication effect size (es) excludes zero), (b) consistency of the replication es with the original study’s es, and (c) precision of the replication’s es estimate (see figure s1). results descriptive statistics are detailed in table 1 and statistical tests and effect-size findings are summarized in table 2. conjunction effects we first looked for the conjunction effect for each profile, by comparing frequency estimates for both “and” and “and are” conditions with the "unlikely" condition. considering the linda scenario, “and” condition (n = 252, m = 18.80, sd = 21.62) were greater than for the “unlikely” condition (n = 258, m = 9.87, sd = 14.1; md = 8.93, t(431.26) = 5.51, p < .001, ds = 0.49, 95% ci [0.31, 0.67]; see figure 1). similarly, frequency estimates of “and are” condition (n =258, m = 19.55, sd = 23.74) were greater than "unlikely" condition (n = 258, m = 9.87, sd = 14.15; md = 9.69, t(419.21) = 5.63, p < .001, ds = 0.50, 95% ci [0.32, 0.67]). thus, results lend support toward h1 and h2a in the linda scenario. however, differences across conditions for the james scenario (see summary plot in figure 1; “and” condition: n = 252, m = 15.19, sd = 18.24; “unlikely” condition: n = 258, m = 18.38, sd = 19.03; “and are” condition: n = 258, m = 15.55, sd = 17.55). the "and" versus "unlikely" contrast (md = −3.19, t (507.82) = −1.93, p = .973; ds = -0.17, 95% ci [-0.35, 0.00]) show that frequency estimates for “and” condition were lower than “unlikely” condition, although the difference was not statistically significant. therefore, the results of the james scenario failed to support h1. similarly, the contrast between “unlikely” and "and are" conditions (md = −2.83, t(510.69) = −1.76, p = .960; ds = -0.15, 95% ci [-0.33, 0.02]) show that frequency estimates for “and are” condition were lower than “unlikely” condition, though with a weak effect not statistically significant. in essence, the results support h2b. semantic ambiguity? to examine whether the semantically ambiguous word “and” had an effect on participants’ judgment, we conducted a one-tail welch t-test comparing frequency estimates of “and” and “and are” conditions for each of the personality scenarios. as predicted by h3a, we found no support for differences for the linda profile (md = −0.75, t(505.55) = −0.37, p = .646, ds = -0.03, 95% ci [-0.21, 0.14]) or for the james profile (md = -0.36, t(506.05) = -0.23, p = .591, ds = -0.02, 95% ci [-0.19, 0.15]). next, we conducted an equivalence test of the semantic ambiguity effect. based on simonsohn’s (2015) recommendation for replication studies we calculated the smallest effect size of interest (sesoi) that mellers et al.’s experiment could have detected with a power of 33%. we choose experiment 2 of as a reference for equivalence test analysis based on one important similarity between the experiment 2 7 chandrashekar et al.2021 figure 1 linda and james profiles: violin plots for expected frequency of target item. linda profile james profile note. boxes represent interquartile range of the distribution, with the notch in the middle representing the mean. the density of the violin plots represents the density of the data at each value, with wider sections indicating higher density. note that the p-values for the contrast effects are for two-tail tests, different from the one-tail tests. plots were generated using ggstatsplot r package (patil, 2018). 8 chandrashekar et al.2021 and the current replication. that is, both studies did not include filler items. with an n of 96 in each condition, mellers et al. (2001) had 33% power to detect an effect size of d = 0.22. we used it as the equivalence bound for the study (sesoi set to d = 0.22). equivalence tests for both linda story (t(505.55) = 2.11, p = .018) and james story (t(506.05) = -2.25, p = .012) indicating support for the null, meaningfully smaller from sesoi. furthermore, we conducted one-tail bayesian ttests with a prior set at 0.707 with a null region of (0, ∞) such that the results against null (i.e., against mu = 0) would quantify support the semantic ambiguity hypothesis suggested by hertwig and colleagues. for the linda profile, we found bf10 = 0.08 (or bf01 = 13.32), which indicates that, given the data, the nullhypothesis is over 11 times more likely than the onesided alternative. similarly, for the james profile, bf10 = 0.08 (or bf01= 12.06), which indicates that given data, the null-hypothesis is over nine times more likely than the one-sided alternative. additional analyses the james profile may have been less representative of an artist in comparison to the linda profile as representative of a feminist. to test this aspect, we compared the average frequency estimations for james and linda story within ‘likely’ experimental condition, in which participants rated the extent to which linda and james were representative of a feminist and an artist, respectively. frequency estimations for the “likely” condition for linda profile ("feminists", n = 260, m = 58.43, sd = 28.93) were greater than for james profile ("artists", m = 36.20, sd = 26.08; md = 22.22, t (259) = 11.99, p < .001, ds = 0.81, 95% ci [0.61, 0.88]). whereas, a similar comparison between linda and james story within the unlikely condition show that frequency estimate for linda ("bank teller", n = 258, m = 9.87, sd = 14.15) was lower than james ("republicans", m = 18.38, sd = 19.03; md = −8.52, t (257) = −6.87, p < .001, d = -0.50, 95% ci [-0.56, -0.30]). this pattern of the observed difference between linda and james across “likely” and “unlikely” conditions is consistent with the previous work that found that the occurrence of conjunction effects, for example, depends on the probabilities of a (linda is a bank teller) and b (linda is active in the feminist movement). in particular, there is a higher chance of conjunction effect when people perceive lower the probability of the less probable constituent p(a), and p(b) was high, in comparison to cases where p(a) and p(b) were both low or both high (fisk & pidgeon, 1996; wells, 1985). the study included additional variables that mirrored the outcome variables but asked the participants to rate the percentage of males and females in the population that fit the description. for example, participants in ‘and’ condition after reading linda story answered “try and estimate, what percentage of females in the u.s. are bank tellers and feminists?”, and after reading james story answered “try and estimate, what percentage of males in the u.s. are republicans and artists?”. we looked at the contrasts between the outcome variables and these additional variables across experimental conditions to ascertain if the ratings on the outcome variable were driven by profile description, rather than linda by virtue of the name being female and similarly james being male. for linda story across three experimental conditions linda was rated higher on the outcome variable in comparison to the percentage of females in society (likely condition: md = 15.31; t (259) = 8.67, p < .001; d = 0.58, 95% ci [0.41, 0.67]; ‘and’ condition: md = 6.43; t (251) = 4.75, p < .001; d = 0.32, 95% ci [0.17, 0.43]; ‘and are’ condition: md = 5.79; t (257) = 3.98, p < .001; d = 0.27, ci [0.12, 0.37]). similarly, for the james story, across conditions we found that james was rated higher on the outcome variable in comparison to the percentage of males in society (likely condition: md = 19.10; t (259) = 11.15, p < .001; d = 0.87, ci [0.56, 0.83]; ‘and’ condition: md = 3.81; t (251) = 3.36, p = .001; d = 0.23, 95% ci [0.09, 0.34]; ‘and are’ condition: md = 2.58; t (257) = 2.39, p = .018; d = 0.15, 95% ci [0.03, 0.27]). summary of replication findings the evaluation of the replication findings is summarized in table 2. our replication for the linda profile was in support of the confirmatory predictions based on the conjunction effects. whereas the results for the james profile were inconsistent. importantly, the original study reported that in frequency estimate for “and” condition is higher than unlikely condition. this prediction forms the basis for testing the absence or presence of semantic ambiguity in predicting the conjunction effects. the replication results for this prediction are in the opposite direction, i.e., we found frequency estimates were lower for unlikely condition than “and” condition. therefore, the results of the james scenario are inconclusive in teasing apart the semantic ambiguity associated with “and” conjunction term. 9 chandrashekar et al.2021 extension descriptive results for the extension are provided in table s8, and plots are provided in figures s3 to s6. we first tested whether the conjunction effect occurred for any of the three items (people, male, females; within design) for each of the profiles (linda and james, between design) and their assigned condition (likely, unlikely, "and", "and are"). as expected, we found no support for a conjunction effect for general population females with the linda profile items (feminist and bank teller) yet without the linda description. similarly, we found no effect for males with the general population james profile items (republicans and artist) yet without the james description. these findings should be interpreted with caution, yet these are in support of the conjunction effect demonstrated with the linda and james problems as being affected by the description of linda and james in a way that makes conjunction items more salient than the unlikely. meaning, that the conjunction effect may be dependent on the representativeness heuristic (tversky & kahneman, 1982) and the preceding described profile. yet, we found support for a conjunction effect for the linda items for the estimation of people overall (feminist: m = 29.36, sd = 17.13; bank teller: m = 8.56, sd = 12.2; "and": m = 11.01, sd = 14.01). it remains to be explored why there would be support for a conjunction effect for evaluation of people overall, but not for females or males, yet it does point out that the conjunction effect may sometimes occur without the representativeness heuristic description, and with a within-subject design. at the very least, this suggests that the conjunction effect is contextsensitive, as is also indicated in the differences in effects we found between the linda and the james problem. there were also patterns indicating statistical flaws, such that given a population gender split of 50%-50% for females-males, participants indicated means for the general population that were far from the average of the estimation for females and the estimation of males (e.g., people who are bank teller: m = 8.56, sd = 12.2; females who are bank tellers: m = 21.46, sd = 28.64; males who are bank tellers: m = 9.93, sd = 15.40). this is despite the within-subject design and the three questions being presented together. if participants indeed understood these questions correctly, this may be indicative of elicitation of estimate separately for each of the questions irrespective of the context or priors, and/or an inability to process or report percentages. further findings regarding gender effects for the items in the two profile is provided in tables s10 and s11. discussion we conducted a preregistered well-powered replication of the main design across the three studies of mellers et al.’s (2001). our findings regarding the linda profile demonstrate support for conjunction effects for both “and” and “and are” connectors. the findings of the linda scenario are not supportive of the alternative view that that conjunction effects observed in the linda story are a manifestation of semantic interpretation of “and” term by participants as union instead of the intersection. the semantic ambiguity arguments predicted that “and are” experimental condition will fail to provide support for conjunction effects, and participants’ frequency estimate in “and are” experimental condition will be lower than “and” experimental condition. furthermore, in reference to linda story, we compared if the frequency estimates in the “and are” condition was lower than “and” condition. equivalence testing and bayesian analyses indicated support for null differences. these findings are in support of the kahneman view of conjunction effects with frequency estimates. our findings for the james profile were not in support of either the kahneman or the hertwig hypotheses and previous findings. firstly, the comparison between “and” and “unlikely” condition was not in support of a conjunction effect. secondly, we found no support for differences between frequency estimates between “and are” an unlikely condition. further, similar to linda story the planned comparison that tested if the frequency estimates in the “and are” condition was lower than “and” condition supports the view that differences between conditions were statistically equivalent to zero. failure to find empirical support for conjunction effects with james story suggests that conjunction effects are context specific. conjunction effects are commonly demonstrated using the linda profile, yet the findings regarding other scenarios are less clear (costello & watts, 2017). thus, it is quite possible that james and linda scenarios are qualitatively different. 10 chandrashekar et al.2021 a closer examination of the original findings showed that the effects of the james scenario varied considerably across the experiments from weak effects in experiment 1 ("and" and unlikely: d = 0.21; "and are" and unlikely: d = 0.13) with no indication of semantic ambiguity (d = 0.05) to mixed effects in experiment 2 ("and" and unlikely: d = 1.11; "and are" and unlikely: d = 0.01) indicating strong semantic ambiguity effect (d = 1.08). the mini meta-analytic effect we computed for the three original studies seemed to indicate differences in effect size between the linda and the james scenarios, especially in regards to semantic ambiguity. additional analyses we conducted suggested that the personality sketch of james was less representative of an artist in comparison to linda’s personality sketch of a feminist. the observed difference is consistent with view kahneman’s argument that conjunction effects arises through the substitution of representativeness estimates for probability estimates. this may have been one of the reasons why the current study does not find support for conjunction effect for james story even when then comparison was between the unlikely and the “and” conditions, which was supported in study 2 and 3 of the original paper. the current replication effort supports the tversky and kahneman’s (1983) assertion that conjunction effects, when those occur, are a probabilistic error due to representativeness and availability heuristic. more precisely, the results of the current study for linda story are supportive of the view that frequency estimates do produce conjunction effects that rely on judgmental heuristic and are not driven by semantic ambiguity of the conjunction terms. the results for the james profile were inconclusive to likely failure. overall, we found some support for conjunction effects, but that those may be less robust than initially expected. these findings indicate the importance of further conducting well-powered preregistered replications and extensions that would revisit classic experiments in this domain and aim to gain deeper insights of effect, to investigate the reliability and generalizability of previous findings, the contextual variations of the conjunction effect. author contact subramanya prasad chandrashekar, spchandr@ouhk.edu.hk, orcid.org/0000-00028599-9241 correspondence about this article should be addressed to gilad feldman at gfeldman@hku.hk. conflict of interest and funding this research was supported by the european association for social psychology seedcorn grant. subramanya prasad chandrashekar would like to thank institute of international business and governance (iibg), established with the substantial support of a grant from the research grants council of the hong kong special administrative region, china (ugc/ids 16/17), for its support. author contributions gilad feldman (gf) was the course instructor for two social psychology courses (psyc2071/3052) and led the two reported replication efforts in these courses. gf supervised each step in the project, conducted the pre-registrations, and ran data collection. subramanya prasad chandrashekar (spc) integrated the two replication efforts into a manuscript with validation and further extensions of the statistical analyses. gf and spc jointly finalized the manuscript for submission. yat hin cheng and chi long fong worked on the replication as part of the judgment and decision making course (identified as students psyc2071 in the table below). ying chit leung and yui tung wong worked on the replication as part of the advanced social psychology course (identified as students psyc3052 in the table below). 11 chandrashekar et al.2021 contributor roles taxonomy in the table below, employ credit (contributor roles taxonomy) to identify the contribution and roles played by the contributors in the current replication effort. please refer to the url (https://www.casrai.org/credit.html) on details and definitions of each of the roles listed below. role spc gf students psyc 2071 students psyc 3052 ta conceptualization x pre-registrations x x x data curation x formal analysis x x x x funding acquisition x investigation x x x methodology x x pre-registration peer review/ verification x x x x data analysis peer review/ verification x x x project administration x x resources x software x x x x supervision x validation x x visualization x writing-original draft x x writing-review and editing x x 12 chandrashekar et al.2021 open science practices this article earned the preregistration+, open data and the open materials badge for preregistering the hypothesis and analysis before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the editorial process for this article relied on streamlined peer review where peer reviews obtained from previous journal(s) were moved forward and used as the basis for the editorial decision. these reviews are shared in the supplementary files, in the authors' cover letter. the identities of the reviewers are shown or hidden in accordance with the policy of the journal that originally obtained them. the entire editorial process is published in the online supplement. references brandt, m. j., ijzerman, h., dijksterhuis, a., farach, f. j., geller, j., giner-sorolla, r., ... & van't veer, a. (2014). the replication recipe: what makes for a convincing replication?. journal of experimental social psychology, 50, 217-224. doi:https://doi.org/10.1016/j.jesp.2013.10.005 delacre, m., lakens, d., & leys, c. (2017). why psychologists should by default use welch’s t-test instead of student’s t-test. international review of social psychology, 30, 92–101. doi: http://doi.org/10.5334/irsp.82 collaborative open-science research (2020). large-scale replications and extensions of findings in judgment and decision making. doi 10.17605/osf.io/5z4a8. retrieved march 2020 from http://osf.io/5z4a8 costello, f., & watts, p. (2017). explaining high conjunction fallacy rates: the probability theory plus noise account. journal of behavioral decision making, 30, 304-321. doi: https://doi.org/10.1002/bdm.1936 fiedler, k. (1988). the dependence of the conjunction fallacy on subtle linguistic factors. psychological research, 50, 123–129. doi: https://doi.org/10.1007/bf00309212 fisk, j. e., & pidgeon, n. (1996). component probabilities and the conjunction fallacy: resolving signed summation and the low component model in a contingent approach. acta psychologica, 94, 1-20. doi: 10.1016/00016918(95)00048-8 gigerenzer, g. (1996). on narrow norms and vague heuristics: a reply to kahneman and tversky (1996). psychological review, 103, 592–596. doi: https://doi.org/10.1037/0033295x.103.3.592 gigerenzer, g. (2005). i think, therefore i err. social research: an international quarterly, 72, 195218. hertwig, r., & chase, v. m. (1998). many reasons or just one: how response mode affects reasoning in the conjunction problem. thinking and reasoning, 4, 319–352. doi: https://doi.org/10.1080/135467898394102 hertwig, r., & gigerenzer, g. (1999). the ‘conjunction fallacy’ revisited: how intelligent inferences look like reasoning errors. journal of behavioral decision making, 12, 275–305. doi: 10.1002/(sici)1099-0771(1999) kruschke, j. k., & liddell, t. m. (2018). the bayesian new statistics: hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective. psychonomic bulletin & review, 25, 178-206. doi: 10.3758/s13423-0161221-4 lakens, d. (2017). equivalence tests: a practical primer for t tests, correlations, and meta-analyses. social psychological and personality science, 8, 355-362. doi: 10.1177/1948550617697177 lakens, d., scheel, a. m., & isager, p. m. (2018). equivalence testing for psychological research: a tutorial. advances in methods and practices in psychological science, 1(2), 259-269. doi: 10.1177/2515245918770963 lebel, e. p., mccarthy, r. j., earp, b. d., elson, m., & vanpaemel, w. (2018). a unified framework to quantify the credibility of scientific findings. advances in methods and practices in psychological science, 1(3), 389-402. doi: 10.1177/2515245918787489 lebel, e. p., vanpaemel, w., cheung, i., & campbell, l. (2019). a brief guide to evaluate 13 chandrashekar et al.2021 replications. meta psychology, 541, 1–17. doi: 10.31219/osf.io/paxyn litman, l., robinson, j., & abberbock, t. (2017). turkprime. com: a versatile crowdsourcing data acquisition platform for the behavioral sciences. behavior research methods, 49, 433442. doi: 10.3758/s13428-016-0727-z mellers, b., hertwig, r., & kahneman, d. (2001). do frequency representations eliminate conjunction effects? an exercise in adversarial collaboration. psychological science, 12, 269-275.doi: 10.1111/1467-9280.00350 morey, r. d., & rouder, j. n. (2015). bayesfactor: computation of bayes factors for common designs (r package version 0.9.12-2). retrieved from https://cran.r-project.org/package=bayesfactor open science collaboration (2015). estimating the reproducibility of psychological science. science, 349, aac4716–aac4716. doi: 10.1126/science.aac4716 patil, i. (2018). ggstatsplot:“ggplot2” based plots with statistical details. cran. r core team (2015) r: a language and environment for statistical computing. r foundation for statistical computing, vienna, austria. isbn 3900051-070, url http://www.r-project.org simonsohn, u. (2015). small telescopes: detectability and the evaluation of replication results. psychological science, 26, 559-569. doi: 10.1177/0956797614567341 tversky, a., & kahneman, d. (1982). judgments of and by representativeness. in d. kahneman, p. slovic, & a. tversky (eds.), judgment under uncertainty: heuristics and biases. uk cambridge: cambridge university press. tversky, a., & kahneman, d. (1983). extensional versus intuitive reasoning: the conjunction fallacy in probability judgment. psychological review, 90, 293-315. doi: 10.1037/0033295x.90.4.293 vandekerckhove, j., rouder, j. n., & kruschke, j. k. (2018). bayesian methods for advancing psychological science. 25, 1-4. doi: 10.3758/s13423-018-1443-8 van‘t veer, a.e., & giner-sorolla, r. (2016). pre-registration in social psychology—a discussion and suggested template. journal of experimental social psychology, 67, 2-12. doi: 10.1016/j.jesp.2016.03.004 wells, g. l. (1985). the conjunction error and the representativeness heuristic. social cognition, 3, 266-279. doi: 10.1521/soco.1985.3.3.266 zwaan, r. a., etz, a., lucas, r. e., & donnellan, m. b. (2018). making replication mainstream. behavioral and brain sciences, 41. meta-psychology, 2021, vol 5, mp.2020.2506 https://doi.org/10.15626/mp.2020.2506 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: henrik danielsson reviewed by: dixon, p., buchanan e., magnusson, k. analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/cb78p what to make of equivalence testing with a post-specified margin? harlan campbell university of british columbia department of statistics paul gustafson university of british columbia department of statistics abstract in order to determine whether or not an effect is absent based on a statistical test, the recommended frequentist tool is the equivalence test. typically, it is expected that an appropriate equivalence margin has been specified before any data are observed. unfortunately, this can be a difficult task. if the margin is too small, then the test’s power will be substantially reduced. if the margin is too large, any claims of equivalence will be meaningless. moreover, it remains unclear how defining the margin afterwards will bias one’s results. in this short article, we consider a series of hypothetical scenarios in which the margin is defined post-hoc or is otherwise considered controversial. we also review a number of relevant, potentially problematic actual studies from clinical trials research, with the aim of motivating a critical discussion as to what is acceptable and desirable in the reporting and interpretation of equivalence tests. keywords: equivalence testing, non-inferiority testing, confidence intervals, type 1 error, frequentist testing, clinical trials, negative studies, null results facts do not accumulate on the blank slates of researchers’ minds and data simply do not speak for themselves. [...] interpretation can produce sound judgments or systematic error. only hindsight will enable us to tell which has occurred. kaptchuk, 2003 introduction consider the following hypothetical situation. after having collected data, we want to determine whether or not an effect is absent based on a statistical test. all too often, in such a situation, non-significance (i.e. p > 0.05), or a combination of both non-significance and supposed high power (i.e. a large sample size), is used as the basis for a claim that the effect is null. unfortunately, such an argument is logically flawed. as the saying goes, “absence of evidence is not evidence of absence” (altman and bland, 1995; hartung et al., 1983). instead, to correctly conclude the absence of an effect under the frequentist paradigm, the recommended tool is the equivalence test (also known as a “non-inferiority test” for one-sided testing (wellek, 2010)). let θ be our parameter of interest. an equivalence test reverses the question that is asked in a null hypothesis significance test (nhst). instead of asking whether we can reject the null hypothesis of no effect, e.g., h0 : θ = 0, an equivalence test examines whether the magnitude of θ is at all meaningful: can we reject the https://doi.org/10.15626/mp.2020.2506 https://doi.org/10.17605/osf.io/cb78p 2 possibility that θ is as large or larger than our smallest effect size of interest, ∆? the null hypothesis for an equivalence test is defined as h0 : θ < (−∆, ∆). in other words, equivalence implies that θ is small enough that any nonzero effect would be at most equal to ∆. the interval (−∆, ∆) is known as the equivalence margin and represents a range of values for which θ can be considered negligible. in psychology research and in the social sciences, where the practice of equivalence testing is relatively new –but now “rapidly expanding” (koh and cribbie, 2013)– there are many questions about how to best conduct and interpret equivalence tests. for example, consider the question of a “post-specified” margin. it is generally accepted that one must specify the equivalence margin a priori, i.e. before any data have been observed (wellek, 2010). however, in our hypothetical situation, suppose that we did not have the foresight needed to have pre-specified this margin, are we then simply out of luck? it is worth noting that lack of foresight is only one reason we may have failed to have pre-specified an appropriate equivalence margin. defining and justifying the equivalence margin is one of the “most difficult issues” (hung et al., 2005) for researchers. if the margin we define is deemed too large, then any claim of equivalence will be considered meaningless. if the margin we define is somehow too small, then the probability of declaring equivalence will be substantially reduced (wiens, 2002). while the margin is ideally chosen as a boundary to objectively exclude the smallest effect size of interest (lakens et al., 2017), these “ideal” boundaries can be difficult to define, and there is generally no clear consensus among stakeholders (keefe et al., 2013). furthermore, previously agreed-upon meaningful effect sizes may be difficult to ascertain as they are rarely specified in protocols and published results (djulbegovic et al., 2011). suppose now that, having failed to pre-specify an adequate equivalence margin, we define the equivalence margin post-hoc, having already collected and observed the data. given the potential consequences of interpreting data based on post-hoc decisions, it is understandable that this idea may be alarming to some; e.g., see the “harkonen case” (as discussed in lee and rubin, 2016) in which the u.s. department of justice prosecuted drug-maker intermune (united states v. harkonen, 2013), for making claims based on post-hoc subgroup analyses. in the biostatistics literature there are many warnings about how and when to specify the equivalence margin. hung et al., 2005 note that: “if the margin can change depending on what has been observed [...] statistical testing of non-inferiority [or equivalence] may not be interpretable.” and wiens, 2002 observes that: “the potential biases of defining the margin after the study should be weighed against the cost and inconvenience of better understanding the differences [between study groups].” finally, the committee for proprietary medicinal products (cpmp), 2001 (the eu scientific advisory organization dealing with new human pharmaceuticals approval) notes that: “it is prudent to specify a noninferiority margin in the protocol in order to avoid the serious difficulties that can arise from later selection.” statements such as these lead one to ask the following. under what circumstances would equivalence testing with a data-dependent margin “not be interpretable?” what are the “potential biases” and “serious difficulties” we should consider in these, less than ideal, circumstances? walker and nowacki, 2011 stress that defining the equivalence margin before observing the data is “essential to maintain the type i error at the desired level” suggesting that potential type i error inflation is the issue of concern. yet this too remains unclear. with equivalence testing becoming more and more common for psychology researchers, these are important matters to address. in this article we will shed light on these curious questions by considering a series of rather confounding hypothetical scenarios (sections 2 and 3) as well as a number of relevant case studies from biomedical research, where equivalence testing has been widely used for decades (section 4). we conclude (section 5) with an invitation for further discussion about how best to address the title question: what to make of equivalence testing with a post-specified margin? the pseudo-type i error and a pathological case before going forward, we would be wise to recall that, under the frequentist paradigm, hypotheses are statements about parameters and therefore are nonrandom quantities. hence, each hypothesis is either true or false, irrespective of how the data are realized. let θ be the parameter of interest and let x represent the data. borrowing from the notation of wellek, 2017, let θ ¯ (x; α) be the lower bound of a one-sided (1 − α)% confidence interval (ci); and let θ̄(x; α) be the upper bound of a one-sided (1 − α)% ci. for example, a one-sided 95% ci for θ could be written out as [−∞, θ̄(x; 0.05)]; a two-sided 90% ci could be written as [θ ¯ (x; 0.05), θ̄(x; 0.05)]. let us define a symmetric equivalence margin as (−∆, ∆). then the standard equivalence testing hypotheses are defined as: 3 figure 1. the one-to-one correspondence between α and ∆. in the above plot, an equivalence test is conducted on two sample normally distributed data. the observed mean difference is θ̂ = 0.2, and the observed pooled standard deviation is equal to 1, with n1 = n2 = 50. the shape of this particular curve is specific to this particular data. however, for any general case, the smallest value of α needed to reject the null (x-axis) decreases as ∆ increases (y-axis). furthermore, as the dashed lines indicate, when ∆ = θ̂, the corresponding value of α will be 0.5. h0 : θ ≤−∆, or θ ≥ ∆, vs. h1 : −∆ < θ < ∆. there is a one-to-one correspondence between symmetric confidence intervals and equivalence testing. the null hypothesis, h0, can be rejected whenever the realized confidence bounds satisfy [θ ¯ (x; α), θ̄(x; α)] ⊂ (−∆, ∆). conversely, there will be insufficient evidence to reject the null hypothesis whenever [θ ¯ (x; α), θ̄(x; α)] 1 (−∆, ∆). for example, with the standard α = 0.05, we can reject h0 if and only if a 90% ci for θ fits entirely within the equivalence margin. equivalence testing provides the standard guarantee about type 1 error that pr(reject h0|h0 is true) ≤ α; see wellek, 2017. if we reject the null hypothesis if and only if the 90% ci for θ fits within (−∆, ∆), we can rest assured that we will only make a type 1 error in less than 5% of cases. should the equivalence margin not be specified a priori, and be defined based on the observed data, we have the following admittedly improper hypothesis test: h̃0 : θ ≤−∆(x), or θ ≥ ∆(x) vs. h̃1 : −∆(x) < θ < ∆(x). in this case, we may not necessarily have that pr(reject h̃0|h̃0 is true) ≤ α. to better understand, let us consider the following admittedly “pathological case.” let ∆(x) be chosen, based on the observed data, to be the smallest possible value for which one can claim equivalence (known in the literature as the “lead” boundaries, see meyners, 2007). this is done by setting: ∆(x) = max(|θ ¯ (x; α)|, |θ̄(x; α)|) + �, where � is a small positive real number. for example, if a 90% ci for θ is [−0.2, 0.5], the “pathological” equivalence margin might be defined as [−0.51, 0.51], with ∆(x) = 0.5 + 0.01. given the monotonic relationship between a confidence interval and an equivalence test, there is a oneto-one correspondence between α and ∆. for any given value of α, conditional on a fixed sample of data, there is a value for ∆ for which one can reject h0. conversely, for any given value of ∆, there is a value of α for which one can reject h0; see figure 1. in our pathological case, we have that pr(reject h̃0) = 1, i.e., we will always claim equivalence. in this situation, the margin is entirely “data-dependent.” in other words, the data (as summarized by the confidence interval) and the margin are perfectly correlated. we write cor( f (x), ∆) = 1, where f (x) = max(|θ ¯ (x; α)|, |θ̄(x; α)|). figure 2 displays the relationship between type 1 error and cor( f (x), ∆), see details in the appendix. in the pathological case, since pr(reject h̃0) = 1, we also have that pr(reject h̃0|h̃0) = 1. as such, we have pr(reject h̃0|h̃0) > α, and therefore, the “pseudo-type i error” is not controlled. when there is less correlation, i.e. when the margin is not entirely data-dependent, we can expect to see less type 1 error inflation. in order for the test to be valid, the key is independence between the margin and the data. in the case when the data and the margin are entirely independent, the type 1 error rate will be at most equal to α, as desired. a somewhat less pathological case now let us consider a somewhat less pathological situation. the cpmp published an advisory report, “points to consider on switching between superiority and noninferiority” (committee for proprietary medicinal products (cpmp), 2001), in which they describe another hypothetical situation where the margin is determined after the data is observed: 4 figure 2. in order for the test to be valid, the key is independence between the margin and the data. the relationship between type 1 error and the correlation between the margin and the data. the correlation measure, cor( f (x), ∆), is obtained by varying the probability of setting ∆(x) equal to the lead margin vs. setting ∆(x) equal to a value entirely independent of the data. the curve is the result of repeated simulations of two-sample data; see details in appendix. “let us suppose that a bioequivalence trial finds a 90% confidence interval for the relative bioavailability of a new formulation that ranges from 0.90 to 1.15. can we only conclude that the relative bioavailability lies between the conventional limits of 0.80 and 1.25 because these were the predefined equivalence margins? or can we conclude that it lies between 0.90 and 1.15? the narrower interval based on the actual data is the appropriate one to accept. hence, if the regulatory requirement changed to +/15%, this study would have produced satisfactory results. there is no question here of a data-derived selection process. however, if the trial had resulted in a confidence interval ranging from 0.75 to 1.20, then a post hoc change of equivalence margins to +/-25% would not be acceptable because of the obvious conclusion that the equivalence margin was chosen to fit the data.” according to this recommendation, it seems that, without any scrutiny, we are free to shrink a prespecified margin as needed. however, we should always avoid widening the pre-specified margin if that is what is necessary. if this is the case, it would suggest that a prudent strategy would be to always pre-specify the largest possible margin before collecting data, and then shrink the margin as required. this may strike some as opportunistic and potentially problematic. 5 ng, 2003 studies a similar hypothetical situation in which a large, possibly infinite number of margins are all pre-specified and all the corresponding hypotheses are tested (without any bonferroni-type of adjustment for multiple comparisons). equivalence is then claimed using the narrowest of all potential pre-specified margins for which equivalence is statistically significant. ng, 2003 explains why this hypothetical strategy may be problematic: “although there is no inflation of the type i error rate [due to the fact that all hypotheses are nested], simultaneous testing of many nested null hypotheses is problematic in a confirmatory trial because the probability of confirming the finding of such testing in a second trial would approach 0.5 as the number of nested null hypotheses approaches infinity.” to better understand ng, 2003’s concern, consider a similar setup where, for a standard null hypothesis significance test, a large, possibly infinite number of prespecified α-levels (allowable type i error rates) are defined. the null is then rejected using the smallest of all potential pre-specified α values. under this procedure, the probability of confirming a statistical significant finding in a second trial (with identical sample size and α) approaches 0.5; see hoenig and heisey, 2001 who describe this (often unappreciated) property of “retrospective power.” as such, it is always expected that one specifies (and justifies) a single α-level prior to observing any data; see the recent commentary of lakens et al., 2018. (these two situations are in fact identical, due to the aforementioned one-to-one correspondence between a data-driven selection of α and a data-driven choice of ∆; see figure 1.) how hypothetical are situations like these? while the cases described in the previous sections were purely hypothetical, similar situations do arise in practice. we consider a number of different clinical trial studies as examples, with the aim of motivating a critical discussion as to what is acceptable and desirable in the reporting and interpretation of equivalence tests. first, consider cases of post-hoc judgement that often arise in the regulatory approval of drugs seeking a designation of bio-equivalence for approval. when the pre-specified margin is deemed too generous (i.e. too wide) by regulatory authorities only after the data have already been observed and analyzed, the regulator may decide that for the purposes of approval, the drug does not meet an appropriate standard for equivalence. consider two examples: 1. the sportif iii and sportif v randomized controlled trials (rcts) were studies designed to investigate the potential of ximelagatran as the first oral alternative to warfarin in patients with nonvalvular atrial fibrillation to reduce the risk of thromboembolic complications. the primary end point in each study was the incidence of all strokes and systemic embolic events, and the primary objective was to establish the non-inferiority of ximelagatran relative to warfarin with a pre-specified margin of an absolute 2% difference in the event rate; see halperin, 2003. both studies met the primary objectives of noninferiority with the pre-specified margin. as such, upon completion, the studies were heralded as a “major breakthrough” (albers et al., 2005; kulbertus, 2003). however, upon regulatory review by the fda cardiovascular and renal drugs advisory committee (crac), the pre-specified margin was judged to be “too generous” (boudes, 2006). this post-hoc criticism of the “unreasonably generous” (kaul et al., 2005) margin, along with concerns about potential liver toxicity, led to a unanimous decision by the crdac to conclude that the benefit of ximelagatran did not outweigh the risk. the fda then refused to grant approval of ximelagatran for any of the proposed indications, see head et al., 2012 and boudes, 2006 who provide a detailed timeline and description of the approval process. 2. the everest ii study was a rct designed to evaluate percutaneous mitral valve repair relative to mitral valve surgery (mauri et al., 2010). the primary efficacy end point was defined as the proportion of patients free from death, surgery for valve dysfunction, and with moderate-severe (3+) or severe (4+) mitral regurgitation at 12 months. upon completion, researchers claimed success when the primary non-inferiority objective was achieved. however, the conclusion of non-inferiority was “difficult to accept due to unduly wide margins” (head et al., 2012). thus, the fda determined that despite the significant pvalue, “non-inferiority is not implied due to the large margin” and therefore the data “did not demonstrate an appropriate benefit-risk profile when compared to standard mitral valve surgery and were inadequate to support approval” (fda, 2013). in other instances, the complete opposite has occurred. despite the fact that the researchers fail to pre-specify a specific margin prior to observing the data, the regulatory agency will still accept a claim of equivalence/non-inferiority on the basis that, given 6 some non-controversial post-hoc margin, there is sufficient evidence. consider two examples: 1. the goal of mannkind’s “study 103” was to evaluate the inhaled insulin afrezza for the treatment of diabetes mellitus in adults. subjects were randomized to 12 weeks of continued treatment in one of three treatment arms. the prespecified primary objective was to show superiority of the afrezza ti+metformin arm relative to the secretagogue+metformin arm, with respect to change in hba1c at 12 weeks. upon completion, the superiority objective was not achieved and a non-inferiority margin had not been prespecified by the researchers. however, the regulators were able to accept a claim of noninferiority. the fda clinical review states: “the sponsor did not specify a non-inferiority margin. however, the fda statistical reviewer noted that afrezza ti+metformin was non-inferior to secretagogue+metformin when the standard margin of 0.4% for insulins is used (the upper bound of the 95% confidence interval for the treatment difference in hba1c is 0.3%),” (yanoff, 2014). 2. the ally-3 trial was a one-arm phase 3 trial with the goal of evaluating the safety and efficacy of oral daclatasvir for chronic hcv genotype 3 infection (mccormack, 2015). there was no active or placebo control and as such it was impossible to conduct a non-inferiority or equivalence test based only on the trial data. as such the fda looked to other trials to determine estimates for the effectiveness of competitor treatments. in addition, as noted by the oregon health authority, “[t]he ally-3 trial [...] did not define a noninferiority margin for determination of efficacy. the fda analysis calculated it based on historical data and concluded that dcv [daclatasvir] with sof [sofosbuvir] achieved non-inferiority compared to sof [sofosbuvir] with rbv [ribavirin] for 24 weeks[...],” (herink, 2016). in this case, the fda reviewers “clinically justified” their choice of a post-specified non-inferiority margin based on a historical data; see struble, 2015. these studies illustrates the fact that, in some fields, there may be well-established “standard” margins or sufficient “historical data.” such standards no doubt make post-specification less controversial for regulatory agencies. when it comes to peer-reviewed journals, researchers will often note that, while an equivalence margin was not pre-specified, a conclusion of equivalence can still be (cautiously) accepted. we consider two examples. in the first case, the margin was not pre-defined, yet claims of equivalence were nevertheless put forward. in the second case, while a margin was pre-defined, additional conclusions were made based on post-specified margins. 1. a. b. chang et al., 2008 published the results of a rct with the goal of evaluating a 5versus 3day course of oral corticosteroids (cs) for nonhospitalised children with asthma exacerbations. the primary outcome was 2-week morbidity of children. the study did not show a statistically significant difference between the two treatment arms. in the interpretation of the results, chang et al. (2008) note that: “it would have been ideal to define a non-inferiority or equivalence margin a priori on the basis of a minimally important effect or historical controls. our study was designed as a superiority trial, and we did not define a noninferiority margin a priori. nevertheless, for the primary outcome measure, the chosen symptom score cut-off of 0.20 (i.e., chosen minimally important difference), the study shows equivalence.” as such, the researchers concluded that the 3-day and 5-day treatment courses were “equally efficacious” in reducing the symptoms of asthma (a. chang et al., 2007). 2. jones et al., 2016 studied the efficacy of isoflurane relative to sevoflurane in cardiac surgery. when interpreting the results, the authors note that: “our choice of non-inferiority margin may seem to be overly generous; however, it is important to emphasize that, if the margin had been reduced to as low as 1.5%, the conclusions of this trial would not have changed,” (jones et al., 2016). if, following a study’s publication, other researchers take issue with how the study’s equivalence margin was justified, they will often respond in a letter to the journal. the post-hoc debate between groenewoud et al., 2017 and gupta et al., 2016 about the appropriateness of the pre-specified non-inferiority margin defined in groenewoud et al., 2016’s study on methods for embryo transfer is an excellent example of this. in the end, readers are left to judge for themselves. conclusion researchers advocate that equivalence testing has great potential to “facilitate theory falsification” (quintana, 2018). by clearly distinguishing between what is “evidence of absence” versus what is an “absence of evidence,” equivalence testing may facilitate the long 7 “series of searching questions” necessary to evaluate a “failed outcome” (pocock and stone, 2016). as a result, it may encourage greater publication of null results which is desperately needed (fanelli, 2011). yet, outside of health research, guidelines on how best to define and interpret margins are lacking. we hope that the question posed in the title of this article will motivate researchers to further consider the delicate issues involved. in clinical trials research, expectations that a margin be pre-specified have been well established for quite some time (piaggio et al., 2006). this is not the case in other disciplines. in psychology research and in the social sciences, discussions of how best to execute equivalence tests are underway and appropriate recommendations are crucially needed. one might argue that the pathological case of equivalence testing we considered does not actually qualify as testing per se, and is instead, simply a tool for describing the data. this is the opinion of meyners, 2007, who concludes that, as a descriptor of the data, the “lead boundaries”, (−∆(x), ∆(x)), provide “useful information” and in some cases are “even more important than confidence intervals” for reporting results. at the end of the day, everyone must arrive at their own conclusions as to whether or not a sufficient standard of evidence for equivalence has been demonstrated. obviously this is often easier said than done. as one final example from clinical trials, we turn to the infamous debate over using bevacizumab (avastin) as a treatment for age-related macular degeneration. a noninferiority study was conducted to investigate (group, 2011). however, some considered the pre-specified non-inferiority margin of 5 letters (on the etdrs visual acuity chart) as “generous” even before the results of the trial were announced (hirschler, 2011). this suggests that, regardless of the results, some would have remained skeptical of any claim of non-inferiority with the 5-letter margin. in stark contrast, the standard of evidence for many healthcare providers was much weaker. indeed, many doctors determined that the use of bevacizumab (avastin) as a substitute for ranibizumab (lucentis) was justified (particularly given the “too big to ignore” price difference) even before the completion of the non-inferiority trial and were comfortable treating large numbers of patients with avastin “off-label” (steinbrook, 2006). in this situation, financial incentives clearly played a competing role with statistical considerations of clinical efficacy in what was to be considered “equivalent.” while the use of equivalence testing should be encouraged, caution is warranted. in a review of equivalence and non-inferiority clinical trials, le henanff et al., 2006 find that often studies “reported margins [that] were so large that they were clearly unconvincing.” indeed, as gøtzsche, 2006 conclude: “clinicians should especially bear in mind that noninferiority margins are often far too large to be clinically meaningful and that a claim of equivalence may also be misleading if a trial has not been conducted to an appropriately high standard.” we conclude with the following general recommendations: • if the parameter of interest is not measured in units that are interpretable, one should consider standardized effect sizes. campbell, 2020 notes that: “equivalence tests for standardized effects may help researchers in situations when what is “negligible” is particularly difficult to determine.” for instance, if the outcome of interest is a depression scale, the clinical relevance of a certain x point improvement may not be intuitively meaningful. it may be difficult to define what number of points can be considered “negligible.” however, since a cohen’s d = 0.2 is widely interpreted to be a “small” sized effect (cohen, 1977; fritz et al., 2012), one could conclude, based on an equivalence test which rejects the null with ∆ = d = 0.2, that any effect, if it exists, is at most small. • the validity of an equivalence test does not depend on the margin being pre-specified. rather, the necessary requirement for a valid test is that the margin is completely independent of the data. in one of our biomedical examples (afrezza ti + metformin), we described a situation where the researchers had not specified a margin but the fda adopted a “standard margin of 0.4%.” while there are no comparable independent agencies to regulate psychology research, peer-review journals do possess substantial leverage and would be wise to consider adopting a set of “default margins” (based on standardized effect sizes). while “default equivalence margins” may not be appropriate for all studies, their use would be similar to that of “default priors” for bayesian inference (rouder et al., 2012) and offer a potential for more objective analyses. • simply because a margin has been pre-specified (and is therefore guaranteed to be independent of the data), it is not necessarily an appropriate choice. regardless of whether the margin is prespecified, or defined post-hoc, we must acknowledge that a claim of “noninferiority [or equivalence] is almost certain with lenient noninferiority margins” (flacco et al., 2016). one should always 8 critically consider the practical implications of the given margin. • if one is to suggest equivalence based on a posthoc margin, one must, at the very least, be forthcoming and honest about the potential for bias. in such cases, every effort should be made to justify the appropriateness of the post-specified margin based on factors entirely independent of the observed data. • in the absence of a pre-specified margin, one can always resort to simply reporting the associated confidence interval. if the confidence interval contains the null and is “narrow enough,” the absence of an effect can be deemed likely. this tactic lacks the formalism of equivalence testing, yet avoids the difficulties of interpretation and justification with a post-hoc margin. • deliberate or not, questionable research practices cause major harm to the credibility of psychology research (sijtsma, 2016). with this in mind, researchers, given their incentive to publish (nosek et al., 2012), are not in the best position to define their own margins. this is true whenever the margin is pre-specified, and especially true when a margin is suggested post-hoc. as such, in order to avoid any potential scrutiny, researchers would be wise to seek an independent party, void of any potential biases, to define an appropriate margin. this is already common practice in clinical trial research, where sponsors have undeniable incentives to further drug development and the fda and other regulators will (ideally) set a clear guidance for an acceptable margin. in other fields, such as psychology, the suggestion that an equivalence margin be defined/scrutinized by an independent party has recently been considered within the framework of a proposed publication policy. in the conditional equivalence testing (cet) publication policy, the independent journal editor/reviewers are tasked with critically evaluating a given margin prior to the start of a study (campbell and gustafson, 2018). author contact h. campbell: https://orcid.org/0000-0002-09591594 and p. gustafson: https://orcid.org/0000-00022375-5006. please contact h. campbell at harlan.campbell@stat.ubc.ca with any inquiries. conflict of interest and funding we have no conflicts of interest to declare. the research was supported by nserc discovery grant rgpin-2019-03957. author contributions h. campbell and p. gustafson both contributed to the concept and writing of this article. h. campbell drafted the original manuscript, and p. gustafson provided critical revisions. both authors approved the final version of the manuscript for submission. open science statement this article earned the open materials badge for making the materials available. it was not pre-registered and had no collected data to share. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references albers, g. w., diener, h.-c., frison, l., grind, m., nevinson, m., partridge, s., halperin, j. l., horrow, j., olsson, s. b., petersen, p., et al. (2005). ximelagatran vs warfarin for stroke prevention in patients with nonvalvular atrial fibrillation: a randomized trial. jama, 293(6), 690–698. altman, d. g., & bland, j. m. (1995). statistics notes: absence of evidence is not evidence of absence. the bmj, 311(7003), 485. boudes, p. f. (2006). the challenges of new drugs benefits and risks analysis: lessons from the ximelagatran fda cardiovascular advisory committee. contemporary clinical trials, 27(5), 432–440. campbell, h. (2020). equivalence testing for standardized effect sizes in linear regression. arxiv preprint arxiv:2004.01757. campbell, h., & gustafson, p. (2018). conditional equivalence testing: an alternative remedy for publication bias. plos one, 13(4), e0195145. chang, a., clark, r., thearle, d., stone, g., petsky, h., champion, a., wheeler, c., & acworth, j. (2007). longer better than shorter? a multicentre randomised control trial (rct) of 5 vs 3 days of oral prednisolone for acute asthma in children. respirology, 12, a67. 9 chang, a. b., clark, r., sloots, t. p., stone, d. g., petsky, h. l., thearle, d., champion, a. a., wheeler, c., & acworth, j. p. (2008). a 5-versus 3-day course of oral corticosteroids for children with asthma exacerbations who are not hospitalised: a randomised controlled trial. medical journal of australia, 189(6), 306–310. cohen, j. (1977). statistical power analysis for the behavioral sciences. academic press. committee for proprietary medicinal products (cpmp). (2001). points to consider on switching between superiority and non-inferiority. british journal of clinical pharmacology, 52(3), 223. djulbegovic, b., kumar, a., magazin, a., schroen, a. t., soares, h., hozo, i., clarke, m., sargent, d., & schell, m. j. (2011). optimism bias leads to inconclusive results an empirical study. journal of clinical epidemiology, 64(6), 583–593. fanelli, d. (2011). negative results are disappearing from most disciplines and countries. scientometrics, 90(3), 891–904. fda. (2013). pma p100009: fda summary of safety and effectiveness data. accessdata.fda.gov. flacco, m. e., manzoli, l., & ioannidis, j. (2016). noninferiority is almost certain with lenient noninferiority margins. journal of clinical epidemiology, 71, 118. fritz, c. o., morris, p. e., & richler, j. j. (2012). effect size estimates: current use, calculations, and interpretation. journal of experimental psychology: general, 141(1), 2. gøtzsche, p. c. (2006). lessons from and cautions about noninferiority and equivalence randomized trials. jama, 295(10), 1172–1174. groenewoud, e., cohlen, b., al-oraiby, a., brinkhuis, e., broekmans, f., de bruin, j., van den dool, g., fleisher, k., friederich, j., goddijn, m., et al. (2016). a randomized controlled, noninferiority trial of modified natural versus artificial cycle for cryo-thawed embryo transfer. human reproduction, 31(7), 1483–1492. groenewoud, e., macklon, b. k. n., & cohlen, b. (2017). response to: the impact of an inappropriate non-inferiority margin in a noninferiority trial. endometrial preparation methods in frozen-thawed embryo transfer, 31, 93. group, c. r. (2011). ranibizumab and bevacizumab for neovascular age-related macular degeneration. new england journal of medicine, 364(20), 1897–1908. gupta, r., gupta, h., & banker, m. (2016). the impact of an inappropriate non-inferiority margin in a non-inferiority trial. human reproduction, 1–2. halperin, j. l. (2003). ximelagatran compared with warfarin for prevention of thromboembolism in patients with nonvalvular atrial fibrillation: rationale, objectives, and design of a pair of clinical studies and baseline patient characteristics (sportif iii and v). american heart journal, 146(3), 431–438. hartung, j., cottrell, j. e., & giffin, j. p. (1983). absence of evidence is not evidence of absence. anesthesiology: the journal of the american society of anesthesiologists, 58(3), 298–299. head, s. j., kaul, s., bogers, a. j., & kappetein, a. p. (2012). non-inferiority study design: lessons to be learned from cardiovascular trials. european heart journal, 33(11), 1318–1324. herink, m. (2016). class update with new drug evaluation: direct antivirals for hepatitis c. %5curl% 7bhttps : / / www. orpdl . org / durm / meetings / meetingdocs/2016_01_28/archives/2016_01_ 28_hepatitiscclassupdate_final.pdf%7d hirschler, b. (2011). head-to-head eye drug results tipped for early may. reuters. https : / / www. reuters . com / article / novartis roche lucentis / head to head eye drug results tipped for early-may-iduslde72s1t620110330 hoenig, j. m., & heisey, d. m. (2001). the abuse of power: the pervasive fallacy of power calculations for data analysis. the american statistician, 55(1), 19–24. hung, h., wang, s.-j., & o’neill, r. (2005). a regulatory perspective on choice of margin and statistical inference issue in non-inferiority trials. biometrical journal, 47(1), 28–36. jones, p. m., bainbridge, d., chu, m. w., fernandes, p. s., fox, s. a., iglesias, i., kiaii, b., lavi, r., & murkin, j. m. (2016). comparison of isoflurane and sevoflurane in cardiac surgery: a randomized non-inferiority comparative effectiveness trialcomparaison de l’isoflurane et du sévoflurane en chirurgie cardiaque: une étude randomisée d’efficacité comparative et de non-infériorité. canadian journal of anesthesia/journal canadien d’anesthésie, 63(10), 1128– 1139. kaptchuk, t. j. (2003). effect of interpretive bias on research evidence. the bmj, 326(7404), 1453– 1455. kaul, s., diamond, g. a., & weintraub, w. s. (2005). trials and tribulations of non-inferiority: the ximelagatran experience. journal of the american college of cardiology, 46(11), 1986–1995. keefe, r. s., kraemer, h. c., epstein, r. s., frank, e., haynes, g., laughren, t. p., mcnulty, j., reed, %5curl%7bhttps://www.orpdl.org/durm/meetings/meetingdocs/2016_01_28/archives/2016_01_28_hepatitiscclassupdate_final.pdf%7d %5curl%7bhttps://www.orpdl.org/durm/meetings/meetingdocs/2016_01_28/archives/2016_01_28_hepatitiscclassupdate_final.pdf%7d %5curl%7bhttps://www.orpdl.org/durm/meetings/meetingdocs/2016_01_28/archives/2016_01_28_hepatitiscclassupdate_final.pdf%7d %5curl%7bhttps://www.orpdl.org/durm/meetings/meetingdocs/2016_01_28/archives/2016_01_28_hepatitiscclassupdate_final.pdf%7d https://www.reuters.com/article/novartis-roche-lucentis/head-to-head-eye-drug-results-tipped-for-early-may-iduslde72s1t620110330 https://www.reuters.com/article/novartis-roche-lucentis/head-to-head-eye-drug-results-tipped-for-early-may-iduslde72s1t620110330 https://www.reuters.com/article/novartis-roche-lucentis/head-to-head-eye-drug-results-tipped-for-early-may-iduslde72s1t620110330 https://www.reuters.com/article/novartis-roche-lucentis/head-to-head-eye-drug-results-tipped-for-early-may-iduslde72s1t620110330 10 s. d., sanchez, j., & leon, a. c. (2013). defining a clinically meaningful effect for the design and interpretation of randomized controlled trials. innovations in clinical neuroscience, 10(5-6 suppl a), 4s. koh, a., & cribbie, r. (2013). robust tests of equivalence for k independent groups. british journal of mathematical and statistical psychology, 66(3), 426–434. kulbertus, h. (2003). sportif iii and v trials: a major breakthrough for long-term oral anticoagulation. revue medicale de liege, 58(12), 770– 773. lakens, d., adolfi, f., albers, c., anvari, f., apps, m., argamon, s., baguley, t., becker, r., benning, s., bradford, d., et al. (2018). justify your alpha. nature human behavior, 2, 168–171. lakens, d., scheel, a. m., & isager, p. m. (2017). equivalence testing for psychological research: a tutorial. pre-print retrieved from the open science framework. le henanff, a., giraudeau, b., baron, g., & ravaud, p. (2006). quality of reporting of noninferiority and equivalence randomized trials. jama, 295(10), 1147–1151. lee, j. j., & rubin, d. b. (2016). evaluating the validity of post-hoc subgroup inferences: a case study. the american statistician, 70(1), 39–46. mauri, l., garg, p., massaro, j. m., foster, e., glower, d., mehoudar, p., powell, f., komtebedde, j., mcdermott, e., & feldman, t. (2010). the everest ii trial: design and rationale for a randomized study of the evalve mitraclip system compared with mitral valve surgery for mitral regurgitation. american heart journal, 160(1), 23–29. mccormack, p. l. (2015). daclatasvir: a review of its use in adult patients with chronic hepatitis c virus infection. drugs, 75(5), 515–524. meyners, m. (2007). least equivalent allowable differences in equivalence testing. food quality and preference, 18(3), 541–547. ng, t.-h. (2003). issues of simultaneous tests for noninferiority and superiority. journal of biopharmaceutical statistics, 13(4), 629–639. nosek, b. a., spies, j. r., & motyl, m. (2012). scientific utopia ii. restructuring incentives and practices to promote truth over publishability. perspectives on psychological science, 7(6), 615–631. piaggio, g., elbourne, d. r., altman, d. g., pocock, s. j., evans, s. j., group, c., et al. (2006). reporting of noninferiority and equivalence randomized trials: an extension of the consort statement. jama, 295(10), 1152–1160. pocock, s. j., & stone, g. w. (2016). the primary outcome fails -what next? new england journal of medicine, 375(9), 861–870. quintana, d. s. (2018). revisiting non-significant effects of intranasal oxytocin using equivalence testing. psychoneuroendocrinology, 87, 127– 130. rouder, j. n., morey, r. d., speckman, p. l., & province, j. m. (2012). default bayes factors for anova designs. journal of mathematical psychology, 56(5), 356–374. sijtsma, k. (2016). playing with data—or how to discourage questionable research practices and stimulate researchers to do things right. psychometrika, 81(1), 1–15. steinbrook, r. (2006). the price of sight: ranibizumab, bevacizumab, and the treatment of macular degeneration. new england journal of medicine, 355(14), 1409–1412. struble, k. (2015). clinical review, cross discipline team leader review. center for drug evaluation and research, application number: 206843orig1s000. walker, e., & nowacki, a. s. (2011). understanding equivalence and noninferiority testing. journal of general internal medicine, 26(2), 192–196. wellek, s. (2010). testing statistical hypotheses of equivalence and noninferiority. crc press. wellek, s. (2017). a critical evaluation of the current “p-value controversy”. biometrical journal. wiens, b. l. (2002). choosing an equivalence limit for noninferiority or equivalence studies. controlled clinical trials, 23(1), 2–14. yanoff, l. b. (2014). clinical review, cross discipline team leader review. center for drug evaluation and research, application number: 022472orig1s000. introduction the pseudo-type i error and a pathological case a somewhat less pathological case how hypothetical are situations like these? conclusion author contact conflict of interest and funding author contributions open science statement meta-psychology, 2023, vol 7, mp.2021.2764 https://doi.org/10.15626/mp.2021.2764 article type: file-drawer report published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: deliah sarah bolesta, rima-maria rahal analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/vxqj5 group membership and deviance punishment: are deviant ingroup members actually judged more negatively than outgroup ones? eric bonetto1, timothy s. carsel2, jaïs adam-troian3, florent varet4, lindsay m. keeran1, grégory lo monaco2, and anthony piermattéo4 1aix-marseille university institute 2university of illinois at chicago, chicago illinois 3department of social sciences, canadian university dubai 4université catholique de lille, equipe oces deviance punishment is an important issue for social-psychological research. group members tend to punish deviance through rejection, ostracism and – more commonly negative judgments. subjective group dynamics proposes to account for social judgement patterns of deviant and conformist individuals. relying on a group identity management perspective, one of the model’s core predictions is that the judgment of a deviant target depends on group membership. more specifically, the model predicts that deviant ingroup members should be judged more negatively than outgroup ones. although this effect has been repeatedly observed over the past decades, there is a current lack of sufficiently powered studies in the literature. for the first time, we conducted tests of subjective group dynamics in france and the us to investigate whether ingroup deviants were judged more harshly than outgroup ones. across six experiments and an internal mini meta-analysis, we observed no substantial difference in judgment between ingroup and outgroup deviant targets, d = -0.01, 95% ci[-0.07, 0.06]. the findings’ implications for deviance management research are discussed. keywords: deviance, punishment, subjective group dynamics, replication an important focus of social-psychological research about deviance pertains to the way social groups react toward members who deviate from group norms (abrams, 2010). group members tend to punish deviance through rejection, ostracism and more commonly through negative judgments (bendor & swistak, 2001; douglas, 2010; hogg & reid, 2006; lapinski & rimal, 2005; rimal & real, 2003). the subjective group dynamics model (marques, abrams, paez, & hogg, 2001; marques, paez, & abrams, 1998) tackles the issue of deviance punishment and predicts, amongst other things, that deviant ingroup and deviant outgroup members are not punished to the same extent. subjective group dynamics states that ingroup members will be evaluated more extremely than outgroup members because the attitudes and behaviors of ingroup members are more relevant to the ingroup’s identity (marques et al., 1998). from this perspective, pro-normative ingroup members should be evaluated more positively than pro-normative outgroup members. conversely, deviant ingroup members should be judged more negatively than deviant outgroup members. hence, when an ingroup member displays attitudes or behaviors that threaten the positive image of the ingroup, then other ingroup members should react negatively toward the deviant member (see the black sheep effect; marques, 2010; marques, yzerbyt, & leyens, 1988; marques & paez, 1994). consequently, the deviant ingroup member should be evaluated negatively or – in some cases – ostracized in an effort to restore the group’s positive social identity through maintaining the perceived ingroup superiority compared to the outgroup (marques, 2010; abrams, rutland, & cameron, 2003). on the contrary, because attitudes and behaviors of outgroup members are less relevant to the ingroup’s identity, reactions to a deviant outgroup member should be less extreme (marques, 2010). therefore, this 4 effect draws upon ingroup favoritism (i.e., the tendency to attribute more symbolic or material rewards to one’s own group over an outgroup (tajfel, billig, bundy, & flament, 1971; turner, brown, & tajfel, 1979), whereby individuals must reconcile their knowledge of the existence of undesirable ingroup members with their motivation to uphold a favorable view of the ingroup (marques et al., 1988; pinto, marques, levine, & abrams, 2010). https://doi.org/10.15626/mp.2021.2764 https://doi.org/10.17605/osf.io/vxqj5 2 previous studies highlighting an effect of group membership on deviance punishment tend to present participants with a description of a target individual or a target group engaging in a deviant behavior (even if the deviance of this behavior is not systematically pretested; e.g., wang, zheng, meng, lu, & ma, 2016). participants are then asked to judge the target on various dimensions through self-report measures (e.g., warmth, competence). deviance management has attracted considerable research focus over the past decades. the effect of group membership on deviance punishment has been observed in a wide range of intergroup contexts and under a variety of collective identity threat conditions (pinto et al., 2010; branscombe nr, wann dl, noel jg, coleman, 1993; castano, paladino, coull, & yzerbyt, 2002; coull, yzerbyt, castano, paladino, & leemans, 2001; hutchison & abrams, 2003; khan & lambert, 1998; shin, freda, & yi, 1999; stapel, koomen, & spears, 1999). as such, one might conclude that this effect is a highly replicable and robust phenomenon. in fact, it is often included in introductory psychology textbooks as an example of a robust and counterintuitive finding (abrams, hogg, & marques, 2005; albarracin, johnson, & zanna, 2005; fiske, gilbert, , & lindzey, 2010; levine & hogg, 5 2010; postmes & jetten, 2006). despite this amount of literature, a methodological issue suggests a call for further investigation of this effect. many of these studies rely on small samples (e.g., n = 66 for four groups in marques et al., 1998, experiment 1; n = 37 for four groups in experiment 2; n = 46 for two groups in castano et al., 2002, experiment 1; n = 28 for two groups in experiment 2; see also bettencourt et al., 2015). because small samples are unlikely to capture extreme values that are present in the population, they tend to inflate observed effect sizes (button et al., 2013). consequently, it seems possible that the published effects of group membership on deviance punishment are much larger than the true effect. this problem is exacerbated as researchers conduct power analyses prior to data collection and use these inflated effect sizes because they will underestimate the number of participants they will actually need (anderson, kelley, & maxwell, 2017). therefore, these practices feed into each other and make it difficult to interpret the robustness of the effect. the presence of this methodological issue led us to conduct a series of six sufficiently powered tests (i.e., by current, post-replication crisis standards) of subjective group dynamics. more specifically, we sought to test the hypothesis according to which deviant ingroup members are punished more harshly than outgroup ones, in a time of concerns regarding the replicability of social psychological research (earp & trafimow, 2015; nosek et al., 2015]. the studies reported below were conducted by independent teams in france and in the us. method general method over the past four years, independent teams from france and the us conducted replication attempts of the effect according to which deviant ingroup members would be evaluated more negatively than deviant outgroup members, as predicted by the subjective group dynamics model. because the research teams were working independently of each other, our studies span several intergroup contexts and social norm violations, using different dependent variables. consequently, our studies constitute conceptual replication attempts with samples drawn from international populations. in each study, a deviant target was described in a vignette and participants were asked to judge this target on various dimensions (e.g., warmth, competence, social distance) through self-report measures (branscombe et al., 1993; khan & lambert, 1998; rullo, presaghi, & livi, 2015). all multi-item measures were mean. the present studies were conducted with the aim of achieving a sample size of at least n = 50 per condition, as recommended by simmons, nelson and simonsohn (2013). after we reported the individual effects for each study, we proceeded to a mini metaanalysis of aggregated results (goh, hall, & rosenthal, 2016) to try to give an estimate of the size of the effect of group membership on deviance punishment. although some measures that were collected for exploratory purposes are not reported in this article, all data, syntaxes, supplementary information about procedures, and all measures for all studies can be found here: https://osf.io/392ha/. all studies were conducted in accordance with the 1964 helsinki declaration (wmo, 1964) and its later amendments, the ethical principles of the french code of ethics for psychologists (cncdp, 2012), and the 2016 apa ethical principles of 7 psychologists and code of conduct (apa, 2017). this research was approved by the institutional review board [anonymized for peer review] (research protocol 20171027). all studies are reported, and no subject was removed from the original databases. sample sizes for each study was determined a priori and without any extension on the basis of initial looks at the results. however, because not all participants answered every question, we used pairwise deletion on the variables for which we did not have rehttps://osf.io/392ha/ 3 sponses. consequently, the number of participants in each analysis fluctuates a little around the total sample size. details for the six studies study 1 (us, 2017) we recruited 300 participants (60.00% male; mage = 34.65, sd = 10.26) from amazon’s mechanical turk (mturk) ($0.10/minute). a sensitivity analysis showed that this sample enabled us to detect an effect size of ρ = 0.16 at 80% power. participants were sent a computerized questionnaire upon registering for the experiment. upon signing the informed consent, participants were told that they would read a short newspaper article that was printed shortly after an altercation that ostensibly happened during the 2016 summer olympics between the united states and australian basketball fans. we manipulated between subjects whether the australian fans or the united states’ fans initiated the altercation. after reading the fake article, participants first completed a 2-item feelings thermometer (r = .87) about the deviant fans (‘to what extent do you feel favorable and warm toward the [american fans vs. australian fans] or unfavorable and cold toward them?’ from -3 ‘very cold’ to +3 ‘very warm’). then, they filled a 2-item (r = .91) measure of blame (e.g., ‘to what extent do you blame the 8 [american fans/australian fans] for the fight between the american and australian basketball fans?’ and ‘to what extent do you think the [american fans/australian fans] are responsible for the fight between the american and australian basketball fans?’, from 1 ‘not at all’ to 5 ‘very much’). finally, participants completed a punishment measure (‘to what extent do you think the [american fans/australian fans] should be punished for their behavior?’ from 1 ‘not at all’ to 5 ‘very much’) and provided a fine they would leverage against the deviant fans between $0 and $1000. study 2 (us, 2018) study 2 was a direct replication of study 1 with two exceptions. first, participants were recruited via the psychology subject pool at a us university instead of through mturk. second, because we did not have a direct manipulation check in study 1 on the perceived deviance of the target, we measured all variables within subjects. we also asked participants to what extent they judged the behaviors of each group of fans (australian and united states) to be peaceable versus hostile (-3 = very hostile to +3 very peaceable), appropriate versus inappropriate (-3 = very inappropriate to +3 very appropriate), and acceptable versus unacceptable (-3 = very unacceptable to +3 very acceptable). these three items were averaged together to create the manipulation check (rus = .89, raustralia = .86). we recruited 199 undergraduate students (32.00% male; mage = 19.15, sd = 1.42). a sensitivity analysis showed that this sample enabled us to detect an effect size of ρ = 0.20 at 80% power. as in study 1, participants completed a computerized questionnaire that was sent to them via email but instead participated in exchange for course credit. all other measures were the same: feelings thermometer about the deviant fans rus = .67, and raustralia = .74; blame rus = .80, and raustralia = .84; punishment; and a 9 fine. the primary analyses were conducted on evaluations of the deviant fans in an independent-samples t-test, but the results do not change when analyzed as a mixeddesign (see the supplementary information on the osf for these analysis). therefore, these data were computed as if they came from a between-subjects design to fit with the rest of the studies. the interaction between the instigating country and the within-subjects’ evaluations of the fans on the manipulation check composite was significant, f(1, 195) = 181.48, p < .001, η2 p = .48. as expected, when australians instigated the fight, participants rated the australians’ behavior as more hostile/inappropriate/unacceptable (m = -1.72, sd = 1.29) than the american fans’ behavior (m = -0.07, sd = 1.34), t(195) = 8.93, p < .001, 95% cimd[1.29, 2.02]. when the american fans instigated the fight, participants rated the american fans’ behavior as more hostile/inappropriate/unacceptable (m = -1.75, sd = 1.20) than the australian fans’ behavior (m = 0.06, sd = 1.29), t(195) = 10.11, p < .001, 95% cimd[1.46, 2.16]. study 3 (us, 2018) studies 2 and 3 were originally part of the same study. however, there was no interaction between outgroup country (i.e., russia versus australia) and any other independent variable. consequently, the two conditions were separated into their own samples for ease of reporting. please see the supplementary information on the osf for these analyses. study 3 was a direct replication of study 2 with one change: instead of the altercation between u.s. and australian fans, the altercation was described as happening between u.s. and russian fans. we recruited 209 undergraduate students (31.43% male; mage = 19.04, sd = 1.18). a sensitivity analysis showed that this sample enabled us to detect an effect size of ρ = 0.19 at 80% power. as 10 in study 2, participants completed a computerized questionnaire that was sent to them via email in exchange 4 for course credit. all measures were identical to those used in study 2: feelings thermometer rus = .67, and rrussia = .72; blame rus = .80, and rrussia = .83; punishment; and a fine. as in study 2, the same three items were used as a manipulation check (rus = .89, rrussia = .93), and again the interaction between the instigating country and the withinsubjects’ evaluation of the fans was significant, f(1, 207) = 117.04, p < .001, η2 p = .36. when russians instigated the fight, participants rated the russians fans’ behavior as more hostile/inappropriate/unacceptable (m = -1.58, sd = 1.43) than the american fans’ behavior (m = 0.06, sd = 1.42), t(195) = 7.90, p < .001, 95% cimd[1.23, 2.04]. when the american fans instigated the fight, participants rated the american fans’ behavior as more hostile/inappropriate/unacceptable (m = -1.41, sd = 1.33) than the russians fans’ behavior (m = 0.10, sd = 1.30), t(195) = 7.36, p < .001, 95% cimd[1.11, 1.92]. study 4 (france, 2016) a paper-pencil questionnaire was distributed among 143 undergraduate students (21.70% male; mage = 19.20, sd = 1.23) in exchange for course credit. a sensitivity analysis showed that this sample enabled us to detect an effect size of ρ = 0.23 at 80% power. participants were asked to read the answers of a young vs. old person target to a previous research questionnaire about homosexuality. in the 2016 european social survey (ess), 88.3% of french respondents indicated that they at least ‘agreed’ with the statement that ‘gays and lesbians should be free to live life as they wish’, indicating that homophobia is at least a somewhat deviant attitude. analyses were conducted using the ess online analysis tool. weights were applied according to recommendations by the weighting european social survey data guide. participants were randomly assigned to one of the two vignette conditions (young ingroup member, 21 years old vs. old-outgroup member, 50 years old), and the vignette consisted of the target’s answers to a few questions about their opinion about homosexuality. participants first read the three words that came to the target’s mind when we talk about homosexuality (i.e., ‘problem for the society’, ‘pests’, ‘deviants’). then, participants read the target’s answers on items like ‘on a scale ranging from 1 to 10, what is your opinion about homosexuals’ (the responses presented the target as homophobic). then, participants answered a 10-items (α = .94) judgment index constructed for the study in line with the literature on social judgment (e.g., ‘i have a positive image of this student’, ‘i think i could get along with this student’, from 1 ‘not at all’ to 8 ‘completely’) study 5 (france, 2017) an online questionnaire was distributed among social network groups (facebook, no incentive). these social media groups were selected to be as neutral as possible, so we used trade and sales advertisements groups. we recruited 120 participants from the general french population (9.20% male; mage = 30.61, sd = 10.86). a sensitivity analysis showed that this sample enabled us to detect an effect size of ρ = 0.25 at 80% power. participants were told that they would attend an online study about their capacity to guess the personality of others. they were asked to read interview excerpts from a target (an anonymized french vs. belgium person) describing his/her personality (gender was not specified). the deviant targets’ description was: ‘after having formed an impression of something, i often find it difficult to modify. in fact, i usually do not change the way i think even after a conversation, because i have always the feeling that i’m 12 right’. in the 2016 ess, 92.1% of french respondents indicated at least ‘a little like me’ to the question ‘it’s important to be humble and modest, not draw attention’ indicating hubris and dogmatism are likely perceived to be deviant attitudes. participants were randomly assigned to one of two vignette conditions (in-group french target vs. outgroup belgian target). finally, participants answered measures (7-points likert scale, from -3 ‘not at all’ to +3 ‘completely’) of warmth (4 items: ‘sweet’, ‘caring’, ‘amusing’, ‘funny’; α = .88) and competence (4 items: ‘perfectionist’, ‘tenacious’, ‘thorough’, ‘unshakeable’; α = .68) for the target (bonetto, varet, & troïan, 2019; bonetto, pichot, girandola, & bonnardel, 2020), and a social distance scale (bogardus, 1933). study 6 (france, 2017) study 6 used the same deviant target description as study 5. we recruited 161 undergraduate students (9.30% male; mage = 20.33, sd = 3.72; no incentive). a sensitivity analysis showed that this sample enabled us to detect an effect size of ρ = 0.22 at 80% power. the group membership manipulation was changed to reflect the student population from which we sampled (21-years-old student target ingroup vs. 50-years-old employee target out-group), and participants were randomly assigned to one of the two vignette conditions. after reading about the dogmatic and hubristic deviant target, participants answered a 4-items (α = .83) judgment index (e.g., ‘in your opinion, x gives a good image of him/herself’, from 1 ‘completely disagree’ to 9 ‘completely agree’; lo monaco, piermattéo, guimelli, & ernst-vintila, 2011). 5 results all dependent variables were z-scored, and all analyses were independent-samples t-tests. across all studies and all dependent measures, we did not find any evidence for an effect of group membership on deviance punishment. more precisely, we found no 13 evidence for the prediction that deviant ingroup members would be evaluated more harshly than deviant outgroup members (see table 1). therefore, we conducted a minimeta analysis of our results (goh et al., 2016) to limit the risk of making type ii errors regarding the existence of this effect in our datasets. we meta-analyzed the results of the present six studies using the major package for jamovi (hamilton, 2018). means and standard deviations from all studies were weighted by sample size for their respective experimental group (restricted maximum-likelihood with mean standardized differences). the final sample size was n = 1132 (mage = 23.83, sdage = 4.78, 27.27% male). as can be seen in figure 1, we found no evidence for the presence of the effect of group membership on deviance punishment in our data, b = -0.01, 95% ci[0.07, 0.06], se = 0.03, z = -0.15, p = .88. effect size was d = -0.01, 95% ci[0.07, 0.06] and model aic = -13.61, log-likelihood = 8.81. as an alternative way of conducting the metaanalysis, a mixed model was computed with dependent variables as a nested factor within studies within countries according to the following equation: judg cond + (1 | dv/study/country). results from this analysis and scripts can be seen in the ‘supplementary analysis’ section (https://osf.io/5mqgw/) and converge in finding no support for substantial differences between in and outgroup deviant targets, t = 0.01, p = .99. discussion this series of six conceptual replications, independently conducted by two laboratories in two different countries, did not provide evidence for the effect of group membership on deviance punishment (p = .88, d = -0.01), despite being sufficiently powered. as earp and trafimow (2015, p.9) put it, ‘if a series of replications is carried out, independently by different labs, and deliberately tailored to the parameters and conditions so described – yet they reliably fail to produce the original result –then this should be considered informative’. these results thus provide null effects that may be used by investigators to identify boundary conditions of a deviance punishment asymmetry between ingroup and outgroup. addressing stimulus sampling issues, we 16 tested the hypothesis across a wide range of dimensions along the attitudinal space, hence high variability across dependent variables (fiagbenu, proch, & kessler, 2021; wells & windschitl, 1999). despite this variety of dependent variables, the present studies highlight consistent failures to replicate the effect of group membership on deviance punishment. despite their consistent results, the studies also contain a number of limitations. first, an argument can be made that sufficiently powered direct replication studies are the only way to establish the presence of an effect (doyen, klein, simons, & cleeremans, 2014). however, such arguments typically fail to consider the cultural context within which an original study was conducted (zwaan, etz, lucas, & donnellan, 2018). indeed, some experiments can be difficult, or even impossible to directly replicate (crandall & sherman, 2016). consequently, a conceptual replication was the only avenue to impinge on the purported psychology. second, although direct replications can provide precise parameter estimates, we can never be sure that those are not artifacts due to the use of a specific paradigm and materials. replicating an effect independent of operationalization is the only way to gain an estimate of its ‘true’ size, to make sure it exists as such, and that the effect is generalizable (crandall & sherman, 2016; campbell & fiske, 1959). third, the aggregation of such different studies (methodologically speaking) is likely to provide a biased estimate of the true effect size because of the combined noise from use of diverse methods (huf et al., 2011). nonetheless, the nature of the present studies allowed us to limit other typical biases found in meta analyses. all replications were conducted by independent teams (earp & trafimow, 2015; berk & freedman, 2003) 17 with different sample sizes that ranged from medium to high, which decreases the likelihood of small-study effects (greco, zangrillo, biondi-zoccai, & landoni, 2013). fourth, although our results cast doubt on the claims made by previous studies regarding deviance punishment, we cannot speak to the veracity of the more general claim regarding the extremization of judgments toward ingroup targets versus outgroup targets because we focused exclusively on judgments of deviant targets. in other words, these were not replications of the wellknown black sheep effect (marques et al., 1988; marques & paez, 1994) that were attempted here. indeed, such an effect often refers to an interaction effect in that researchers typically manipulate whether a target is an ingroup or an outgroup member and whether the target behaves counter-normatively or pronormatively. we focused exclusively on the main effect of group membership on deviance punishment in this paper. the present contribution paves the way for potenhttps://osf.io/5mqgw/ 6 table 1 study measure nin gr nout gr mingr(sd) moutgr(sd) t(ddl) p d 1 feeling thermometer 151 149 0.05(1.01) -0.47(1.00) -0.81(298) .42 -0.10 blame 151 149 -0.00(1.01) 0.00(0.99) 0.52(298) .96 0.01 punishment 151 149 -0.03(1.02) 0.03(0.98) 0.52(298) .60 0.06 fine 151 148 -0.05(0.95) 0.05(1.05) 0.86(297) .39 0.09 2 feeling thermometer 104 95 -0.08(1.03) -0.01(1.00) 0.53(197) .60 0.08 blame 103 95 0.09(1.02) -0.11(0.94) -1.41(196) .16 -0.20 punishment 103 95 0.06(1.03) -0.12(0.94) -1.30(196) .19 -0.19 fine 103 95 0.08(1.06) -0.15(0.90) -1.62(196) .11 -0.23 3 feeling thermometer 105 104 0.06(0.93) 0.04(1.04) -0.15(207) .88 0.02 blame 105 104 -0.02(0.96) 0.04(1.07) 0.42(207) .68 0.06 punishment 105 104 0.03(0.91) 0.03(1.11) 0.03(207) .98 0.00 fine 104 104 -0.02(0.93) 0.08(1.09) 0.70(206) .48 0.10 4 judgment index 70 73 -0.14(0.83) -0.13(1.13) 1.61(141) .11 0.27 5 warmth 61 59 -0.37(0.89) -0.20(0.90) 1.04(118) .30 0.19 competence 61 59 0.19(1.00) 0.37(0.89) 1.02(118) .31 0.19 social distance 61 59 0.33(1.06) 0.17(1.02) 0.85(118) .40 -0.16 6 judgment index 79 82 -0.18(0.85) -0.30(0.98) 0.86(159) .39 -0.14 tially important theoretical advances for deviance management research in the context of intergroup processes. as earp and trafimow (2015) argue, null findings from conceptual replications have specific theoretical interest. null findings of conceptual replications can establish the boundary conditions of an effect and help proponents of the theory specify under which conditions and with which kinds of materials the effect should be obtained. for instance, the effect of group membership on deviance punishment might appear only when ingroup identification is high among participants, which would be a prerequisite condition to obtain it (strength of u.s. identification was collected for studies 2 and 3. this point was originally outside of our plan but suggested by a reviewer. supplementary analyses indicate that the interaction between group identification and instigator on the dependent 18 variables were either not significant or in the opposite direction as predicted by subjective group dynamics. see online materials for all supplementary analyses). another methodological limitation is that our studies did not include measures of social identification with the inand outgroup as manipulation checks. one reason for this choice is an attempt to closely replicate protocols from the literature. for instance, marques et al. (1989) did not include any measure of social identification in their studies despite claiming a moderation by this construct. furthermore, when social identification is indeed included, it generally taps into the ingroup only (e.g., pinto et al., 2011), and those manipulation test do highlight that subjects display ingroup identification over and above the scale’s midpoint (pinto et al., 2011, study 1-2) this is to be expected if not just for the fact that this identity is made salient through the survey item presence, a phenomenon at the basis of self-categorization paradigms (see reynolds, turner, haslam, & ryan, 2001). although the absence of proper manipulation checks for social identity did not prevent researchers from routinely obtaining group membership effects, stronger tests of the theory should include those, and assess their potential moderating effect. moreover, although previous studies highlighted a host of moderators (e.g., social identification, within-group membership status; pinto et al., 2010; abrams, travaglino, marques, pinto, & levine, 2018), these typically only specify when we should expect an attenuation or exac7 figure 1 erbation of the effect. therefore, our studies provide evidence for when a core prediction of the subjective group dynamics model might not be corroborated, and some of these well-known moderators could actually be necessary conditions for the effect studied here. finally, as we said previously, changes in the cultural context within which the effect of group membership on deviance punishment was previously observed should also be considered. more precisely, deviance punishment may have change over time. societal level changes may explain inconsistencies between previous studies on deviance punishment and our attempts to replicate the effect (this kind of interpretation was considered for stereotype threat; lewis & michalak, 2019; see also muthukrishna, henrich, & slingerland, 2020). these series of studies demonstrate that the effect of group membership on deviance punishment might be more sensitive to contextual factors than previously considered. the identification of parameter boundaries is of paramount importance for better theory specification (earp & trafimow, 2015). thus, far from invalidating the basic tenets of subjective group dynamics, these results indicate that it might be a fruitful endeavor to conduct further replications of deviance management studies to clarify what these parameters are. author contact corresponding author: eric bonetto (bonetto.ericbw@gmail.com). conflict of interest and funding no conflict of interest or specific source of funding. author contributions all authors contributed equally to this research. open science practices this article earned the open data and the open materials badge for making the data and materials openly available. the studies were not preregistered. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. 8 references abrams d. (2010). deviance. in j.m. levine & m.a. hogg (eds.), encyclopedia of group processes and intergroup relations (pp. 206-211). sage. abrams, d., hogg, m.a., & marques, j.m. (2005). a social psychological framework for understanding social inclusion and exclusion. psychology press. abrams, d., rutland, a., & cameron, l. (2003). the development of subjective group dynamics: children’s judgments of normative and deviant in-group and out-group individuals. child development, 74, 1840-1856. https://doi.org/10.1046/j.1467-8624.2003.00641.x abrams, d., travaglino, g.a., marques, j.m., pinto, i., & levine, j.m. (2018). deviance credit: tolerance of deviant ingroup leaders is mediated by their accrual of prototypicality and conferral of their right to be supported. journal of social issues, 74, 36-55. https://doi.org/10.1111/josi.2018.74.issue-1/issuetoc albarracin, d., johnson, b.t., & zanna, m.p. (2005). the handbook of attitudes. lawrence erlbaum associates. anderson, s.f., kelley, k., & maxwell, s.e. (2017). sample-size planning for more accurate statistical power: a method adjusting sample effect sizes for publication bias and uncertainty. psychological science, 28, 1547-1562. https://doi.org/10.1177/0956797617723724 bendor j., & swistak p. (2001). the evolution of norms. american journal of sociology, 106, 1493-1545. https://doi.org/10.1086/321298 berk, r. & freedman, d. (2003). statistical assumptions as empirical commitments. in. t.g. blomberg & s. cohen (eds.), law, punishment and social control: essays in honor of sheldon messinger (pp. 235-254). aldine de gruyter. bettencourt, b.a., manning, m., molix, l., schlegel, r., eidelman, s., & biernat, m. (2015). explaining extremity in evaluation of group members: meta-analytic tests of three theories. personality and social psychology review, 20, 49-74. https://doi.org/10.1177/1088868315574461 bogardus, e.s. (1933). a social distance scale. sociology & social research, 17, 265-271. bonetto, e., varet, f., & troïan, j. (2019). to resist or not to resist? investigating the normative features of resistance to persuasion. journal of theoretical social psychology, 3, 167-175. https://doi.org/10.1002/jts5.44 bonetto, e., pichot, n., girandola, f., & bonnardel, n. (2020). the normative features of creativity: creative individuals are judged to be warmer and more competent. the journal of creative behavior. https://doi.org/10.1002/jocb.477 branscombe, n.r., wann, d.l., noel, j.g., & coleman, j. (1993). in-group or out-group extemity: importance of the threatened social identity. personality and social psychology bulletin, 19, 381-388. https://doi.org/10.1177/0146167293194003 button, k.s., ioannidis, j.p., mokrysz, c., nosek, b.a., flint, j., robinson, e.s., & munafò, m.r. (2013). power failure: why small sample size undermines the reliability of neuroscience. nature reviews neuroscience, 14, 365-376. campbell, d.t. & fiske, d.w. (1959). convergent and discriminant validation by the multitrait-multimethod matrix. psychological bulletin, 56, 81-105. https://doi.org/10.1037/h0046016 castano, e., paladino, m. p., coull, a., & yzerbyt, v. y. (2002). protecting the ingroup stereotype: ingroup identification and the management of deviant ingroup members. british journal of social psychology, 41, 365-385. https://doi.org/10.1348/014466602760344269 coull, a., yzerbyt, v.y., castano, e., paladino, m.p., & leemans, v. (2001). protecting the ingroup: motivated allocation of cognitive resources in the presence of threatening ingroup members. group processes & intergroup relations, 4, 327-339. https://doi.org/10.1177/1368430201004004003 crandall, c.s. & sherman, j.w. (2016). on the scientific superiority of conceptual replications for scientific progress. journal of experimental social psychology, 66, 93-99. https://doi.org/10.1016/j.jesp.2015.10.002 cuddy, a. j. c., fiske, s. t., & glick p. (2008). warmth and competence as universal dimensions of social perception: the stereotype content model and the bias map. advances in experimental social psychology, 40, 61-149. https://doi.org/10.1016/s0065-2601(07)00002-0 douglas k. (2010). fads and fashions. in j.m. levine & m.a. hogg (eds.), encyclopedia of group processes and intergroup relations (pp. 269-272). sage. doyen, s., klein, o., simons, d.j., & cleeremans, a. (2014). on the other side of the mirror: priming in cognitive and social psychology. social cognition, 32(supplement), 12-32. https://doi.org/10.1521/soco.2014.32.supp.12 9 earp, b.d. & trafimow, d. (2015). replication, falsification, and the crisis of confidence in social psychology. frontiers in psychology, 6, 621. https://doi.org/10.3389/fpsyg.2015.00621 fiagbenu, m. e., proch, j., & kessler, t. (2021). of deadly beans and risky stocks: political ideology and attitude formation via exploration depend on the nature of the attitude stimuli. british journal of psychology, 112, 342-357. https://doi.org/10.1111/bjop.12430 fiske, s.t., gilbert, d.t., & lindzey, g. (2010). handbook of social psychology, vol. 2. wiley & sons. goh, j.x., hall, j.a., & rosenthal, r. (2016). mini meta-analysis of your own studies: some arguments on why and a primer on how. social and personality psychology compass, 10, 535-549. https://doi.org/10.1111/spc3.12267 greco, t., zangrillo, a., biondi-zoccai, g., & landoni, g. (2013). meta-analysis: pitfalls and hints. heart, lung and vessels, 5, 219-225. hamilton, w.k. (2018). major: meta analysis jamovi r. for the jamovi project. available from: http://kylehamilton.com/#publicationsselected hogg m.a. & reid, s.a. (2006). social identity, self-categorization, and the communication of group norms. communication theory, 16, 7-30. https://doi.org/10.1111/j.1468-2885.2006.00003.x huf, w., kalcher, k., pail, g., friedrich, m.e., filzmoser, p., & kasper, s. (2011). metaanalysis: fact or fiction? how to interpret meta-analyses. the world journal of biological psychiatry, 12, 188-200. https://doi.org/10.3109/15622975.2010.551544 hutchison, p. & abrams, d. (2003). ingroup identification moderates stereotype change in reaction to ingroup deviance. european journal of social psychology, 33, 497-506. https://doi.org/10.1002/ejsp.157 khan, s. & lambert, a.j. (1998). ingroup favoritism versus black sheep effects in observations of informal conversations. basic and applied social psychology, 20, 263-269. https://doi.org/10.1207/s15324834basp20043 lapinski, m.k. & rimal, r.n. (2005). an explication of social norms. communication theory, 15, 127-147. https://doi.org/10.1111/j.1468-2885.2005.tb00329.x levine, j.m. & hogg, m.a. (2010). encyclopedia of group processes and intergroup relations, vol. 1. sage. lewis jr, n., & michalak, n. m. (2019). has stereotype threat dissipated over time? a cross-temporal meta-analysis. https://doi.org/10.31234/osf.io/w4ta2 lo monaco, g., piermattéo, a., guimelli, c., & ernst-vintila, a. (2011). using the black sheep effect to reveal normative stakes: the example of alcohol drinking contexts. european journal of social psychology, 41, 1-5. https://doi.org/10.1002/ejsp.764 marques, j.m. (2010). black sheep effect. in j.m. levine & m.a. hogg (eds.), encyclopedia of group processes and intergroup relations, vol. 1 (pp. 5557). sage. marques j.m., abrams, d., paez, d., & hogg, m.a. (2001). social categorization, social identification, and rejection of deviant group members. in m.a. hogg & r.s. tindale (eds.), blackwell handbook of social psychology: group processes, vol. 3 (pp.400-424). blackwell. marques, j.m. & paez, d. (1994). the “black sheep effect”: social categorization, rejection of ingroup deviates, and perception of group variability. in w. stroebe & m. hewstone (eds.), european review of social psychology, vol. 5 (pp. 38-68). john wiley. marques, j.m., paez, d., & abrams, d. (1998). social identity and intragroup differentiation as subjective social control. in s. worchel, j.f. morales, d. paez, & j.-c. deschamps (eds.), social identity: international perspectives (pp. 124-142). sage. marques, j.m., yzerbyt v.y., & leyens, j.p. (1988). the “black sheep effect”: extremity of judgments towards ingroup members as a function of group identification. european journal of social psychology, 18, 1-16. https://doi.org/10.1002/ejsp.2420180102 muthukrishna, m., henrich, j., & slingerland, e. (2020). psychology as a historical science. annual review of psychology, 72, 717-749. https://doi.org/10.1146/annurev-psych-082820-111436 nosek, b.a., alter, g., banks, g.c., borsboom, d., bowman, s.d., breckler, s.j., et al. (2015). promoting an open research culture. science, 348, 1422-1425. https://doi.org/10.1126/science.aab2374 pinto, i.r., marques, j.m., levine, j.m., & abrams, d. (2010). membership status and subjective group dynamics: who triggers the black sheep effect? journal of personality and social psychology, 99, 107-119. https://doi.org/10.1037/a0018187 postmes, t., & jetten, j. (2006). individuality and the group: advances in social identity. sage. 10 reynolds, k. j., turner, j. c., haslam, s. a., & ryan, m. k. (2001). the role of personality and group factors in explaining prejudice. journal of experimental social psychology, 37, 427-434. https://doi.org/10.1006/jesp.2000.1473 rimal, r.n. & real, k. (2003). understanding the influence of perceived norms on behaviors. communication theory, 13, 184-203. https://doi.org/10.1111/j.1468-2885.2003.tb00288.x rullo, m., presaghi, f., & livi, s. (2015). reactions to ingroup and outgroup deviants: an experimental group paradigm for black sheep effect. plos one, 10, e0125605. https://doi.org/10.1371/journal.pone.0125605 simmons, j.p., nelson, l.d., & simonsohn, u. (2013). life after p-hacking. meeting of the society for personality and social psychology, new orleans, la, 17-19. https://doi.org/10.2139/ssrn.2205186 shin, g.w., freda, j., yi, g. (1999). the politics of ethnic nationalism in divided korea. nations and nationalism, 5, 465-484. stapel, d.a., koomen, w., & spears, r. (1999). framed and misfortuned: identity salience and the whiff of scandal. european journal of social psychology, 29, 397-402. https://doi.org/10.1002/(sici)1099-0992(199903/05)29:2/3<397::aidejsp936>3.0.co;2-6 tajfel, h., billig, m.g., bundy, r.p., & flament, c. (1971). social categorization and intergroup behaviour. european journal of social psychology, 1, 149-178. https://doi.org/10.1002/ejsp.2420010202 turner, j.c., brown, r.j., & tajfel, h. (1979). social comparison and group interest in ingroup favouritism. european journal of social psychology, 9, 187-204. https://doi.org/10.1002/ejsp.2420090207 wang, l., zheng, j., meng, l., lu, q., & ma, q. (2016). ingroup favoritism or the black sheep effect: perceived intentions modulate subjective responses to aggressive interactions. neuroscience research, 108, 46-54. https://doi.org/10.1016/j.neures.2016.01.011 wells, g. l., & windschitl, p. d. (1999). stimulus sampling and social psychological experimentation. personality and social psychology bulletin, 25, 1115-1125. https://doi.org/10.1177/01461672992512005 zwaan, r.a., etz, a., lucas, r.e., & donnellan, m.b. (2018). making replication mainstream. behavioral and brain sciences, 41, 1-61. https://doi.org/10.1017/s0140525x17001972 references meta-psychology, 2022, vol 6, mp.2021.2923 https://doi.org/10.15626/mp.2021.2923 article type: review protocol published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: yes edited by: thomas nordström reviewed by: maude johansson, cody christopherson analysis reproduced by: not applicable all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/hxvb5 current understanding of maternal healthcare acceptability from patients’ perspectives: a scoping review protocol joy blaise bucyibaruta university of pretoria, faculty of health sciences leah maidment university of pretoria, faculty of health sciences carl august daniel heese university of pretoria, faculty of health sciences mmapheko doriccah peu university of pretoria, faculty of health sciences lesley bamford university of pretoria, faculty of health sciences national department of health, south africa annatjie elizabeth van der wath university of pretoria, faculty of health sciences estelle grobler university of pretoria, faculty of health sciences alfred musekiwa university of pretoria, faculty of health sciences abstract the importance of healthcare acceptability concept cannot be overlooked in health sciences including psychology, yet, it remains controversial and it is poorly understood by all health researchers. this concept cuts across all health disciplines and it refers to human behaviour such as attitude, trust, and respect through interactions between patients and health professionals. many studies have been published on acceptability of maternal healthcare, but there is no consensus on how it is defined and conceptualised. thus, this study aims at reviewing existing literature to shed light on the definition and conceptualisation of maternal healthcare acceptability from the patients’ perspectives. this study will apply scoping review to reach its broad purpose. the search of relevant articles from electronic and grey literature will be guided by a search strategy developed based on eligibility criteria. two researchers will independently screen the retrieved articles using rayyan software and chart data from included articles. an agreement of 80% between them will be considered appropriate. this study will provide a general interpretation of key findings in line with available evidence and consistent with the research purpose. the researchers will discuss the study’s limitations and propose potential implications and future research projects. keywords: acceptability, attitudes, community interactions, expectations, experiences, healthcare provider interactions, healthcare systems and policy interactions, maternal healthcare, perceptions, support https://doi.org/10.15626/mp.2021.2923 https://doi.org/10.17605/osf.io/hxvb5 2 introduction acceptability of healthcare is consistently increasing its relevance in health sciences including psychology to improve healthcare service delivery to the population (cameron et al., 2017; sekhon, cartwright, & francis, 2017; shaw, larkin, & flowers, 2014). the concept of acceptability of healthcare cuts across all countries and all healthcare disciplines with undeniable significance in planning, implementing and monitoring healthcare interventions (cameron et al., 2017; shaw et al., 2014). nevertheless, acceptability of healthcare remains poorly defined and conceptualised (bucyibaruta et al., 2018; sekhon et al., 2017). there are two theories in literature used to describe acceptability of healthcare; unitary construct and multi-construct(sekhon et al., 2017). however, a growing body of evidence supports the multi-construct theory (bucyibaruta et al., 2018; sekhon et al., 2017). therefore, this study will approach acceptability of healthcare as a multi-construct concept. acceptability of healthcare reflects the quality of interactions between the patient and the community, health provider or health systems (gilson, 2007). those interactions are described by the terms conveying beliefs and perceptions of received or anticipated healthcare (dyer, owens, & robinson, 2016; murphy & gardner, 2019). such terms include respect, privacy, confidentiality, trust, understanding, support, etc. those terms have overextended meanings and some researchers have proposed to categorise them under specific constructs of acceptability by applying the best-fit theory (gilson, 2007; mcintyre, thiede, & birch, 2009). the nature of those interactions is clearly multifaceted making acceptability of healthcare a complex concept. as a consequence, acceptability of healthcare stays a controversial topic without a consensual definition and shared conceptual framework within wider community of health professionals. this situation calls for more research to inform uniform understanding of healthcare acceptability for practical implications. the concept of acceptability of healthcare — also referred to as cultural access — was introduced in the early 1980’s as one of the dimensions of access to healthcare (penchansky & thomas, 1981). it is worth noting that affordability — also denoted as financial access — and availability — also mentioned as physical access — are the other two dimensions of access to healthcare widely described in the literature (bucyibaruta et al., 2018; mcintyre et al., 2009; silal, pennkekana, harris, birch, & mcintyre, 2012). acceptability was originally announced as “the best fit fulfilment of healthcare expectations between the patient and the healthcare system”.(penchansky & thomas, 1981) after that, significant effort was made to refine the definition of acceptability of healthcare (dillip et al., 2012; donabedian, 2002; kozarewicz, 2014; kyei-nimakoh, carolan-olah, & mccann, 2017; rothstein et al., 2016; d. j. russell et al., 2013; sekhon, cartwright, & francis, 2018; staniszewska et al., 2010). for example, acceptability of healthcare was described as “conformity to the wishes, desires and expectations of patients and responsible members of their families” (donabedian, 1993). some authors have referred acceptability of healthcare as “social and cultural distance between health care systems and their users” (hausmann-muela, ribera, & nyamongo, 2003). acceptability of healthcare was also reported as “individual perceptions influenced by social representations and modified in social interactions, suggesting a ‘fit’ or match between providers and clients with regard to their understandings of disease” (dillip et al., 2012). other authors have argued acceptability of healthcare as "attitudes and beliefs of consumers about the health care system to the personal and practice characteristics of health care providers" (deborah j russell et al., 2013). acceptability of healthcare was furthermore defined as a "multi-faceted construct reflecting the extent to which people delivering or receiving a healthcare intervention consider it to be appropriate, based on anticipated or experienced cognitive and emotional responses to the intervention" (sekhon et al., 2017). those are some examples of contradictory definitions of healthcare acceptability from the literature. therefore, this confused state of healthcare acceptability definition requires a need of a comprehensive clarification within the broader society of health researchers. the lack of shared understanding on how acceptability of healthcare is defined and conceptualised impedes its applications at the level of definite healthcare such as maternal healthcare. while there are common characteristics shared by different healthcare services in general, there are distinctive aspects that make maternal healthcare unique as far as acceptability is concerned. for example, antenatal delivery and post-natal healthcare services are unique to maternal healthcare. thus one would expect exclusive description of maternal healthcare acceptability to match specific expectations and experiences of mothers attending antenatal, delivery and post-natal healthcare services. many scholars have published on acceptability of maternal healthcare (al-mujtaba et al., 2020; balde et al., 2017; cummins et al., 2021; feinberg, smith, & naik, 2009; grant et al., 2017; påfs et al., 2015; sripad, warren, hindin, & karra, 2019). however, those researchers had different conceptions of maternal healthcare acceptability. it is obvious that a definite definition and conceptualisation of maternal healthcare acceptability is still 3 to be agreed upon amongst researchers. women often go through psychological distress resulting from various stressors and demands that are difficult to cope with during pregnancy, delivery and immediate postpartum (staneva, bogossian, & wittkowski, 2015; traylor, johnson, kimmel, & manuck, 2020). this situation occasionally shapes acceptability maternal healthcare in how various health professionals (midwives, doctors, psychologists or psychiatrists) assist the most affected women (alderdice, mcneill, & lynn, 2013; hadfield & wittkowski, 2017). nevertheless, the concept of acceptability of maternal healthcare is poorly understood by health researchers including psychology researchers to advance and support appropriate health practice in such circumstance (sekhon et al., 2018). moreover, there is a paucity of evidence about contextual understanding of how acceptability of maternal healthcare is defined and conceptualised in existing literature. thus, this study will seek to review existing literature to shed light on how the concept of maternal healthcare acceptability is defined and conceptualised. the specific objectives will include: 1. to identify the gaps in defining the concept of maternal healthcare acceptability. 2. to explore the contextual understanding of maternal healthcare acceptability. 3. to ascertain the implication practices of maternal healthcare acceptability. methods this study is embedded in a bigger phd research project applying mixed methods including scoping review as presented and approved by the faculty of health sciences research ethics committee, university of pretoria. thus, this study will be conducted in observation with all ethical and legal considerations as per the research ethics certificate reference no: 545/2019. moreover, this protocol article is submitted for registered report, and it will be conducted once the in-principle acceptance (ipa) is provided by metapsychology journal. the protocol is also registered on open science framework (https://osf.io/s3ymu) to increase research transparency and to avoid unintended duplication of reviews (https://osf.io/gxp3c/). thus, this study will be conducted in line with the registered report’s guidelines, and it will be subject to ethical and policy consideration of meta-psychology that will issue the ipa for this project. study design the scoping review is an appropriate method to organize and summarise existing literature in an orderly and replicable way to identify gaps in the body of literature and to answer a broader research question (armstrong, hall, doyle, & waters, 2011; dijkers, 2015). this scoping review will be conducted in six steps as described by arksey and o’malley (arksey & o’malley, 2005). those steps consist of: (i) identifying the research question, (ii) identifying relevant studies, (iii) selection of eligible studies, (iv) charting the data, (v) collating and summarizing the results, and (vi) consultation exercise with experts in the field [optional]. the latter will be included to improve the usefulness of the findings for implication practices. this study will also be guided by the scoping review framework developed by the joanna briggs institute to enhance the methodological quality (tricco et al., 2018). identifying the research questions in order to establish the current understanding of how acceptability of maternal health services is defined and conceptualised in existing literature, this scoping review will pursue to answer the following questions: 1. how is maternal healthcare acceptability defined and conceptualised? 2. what are contextual understandings of maternal healthcare acceptability? 3. what are implication practices from the concept of maternal healthcare acceptability? identifying relevant studies the researchers endeavour to be as comprehensive as possible in identifying relevant studies and documents suitable for answering the research questions. thus, the principal investigator (pi) and two co-authors will independently conduct online search for relevant articles to answer the research questions from existing databases including medline / pubmed, cochrane library, google scholar and cinahl. the researchers will apply the snowball strategy by checking the reference lists of retrieved studies as well as ‘cited by’ articles to identify additional studies. furthermore, the researchers will perform search of relevant grey literature of dissertations/theses (proquest dissertations & theses global), conference abstracts (embase conference abstracts, conference proceedings), powerpoint presentations, magazines, health organisations websites such as who, departments of health in different countries, google website and unpublished work on the topic. a https://osf.io/s3ymu https://osf.io/gxp3c/ 4 librarian has been recruited to guide information retrieval from relevant databases and other steps of this scoping review. the identification of relevant studies will be guided by eligibility criteria and search strategy developed by the pi. the latter will ensure that eligibility criteria and search strategy are understood by the other two researchers who will be involved in identification of relevant studies before this activity will be undertaken. identification of relevant studies will be iterative in nature. once about 1 000 articles will be retrieved, the researchers will focus on other steps of scoping review. however, identification of additional relevant studies may resume based on the preliminary findings or consensus among the researchers of this study or recommendations from experts in the consultation exercise. selection of eligible studies an “open” strategy will be adopted to allow for the inclusion of any and all sources existing in the literature on acceptability of maternal healthcare. however, only sources in english will be included because the latter is the common language of the researchers that will be involved in the screening of retrieved articles. the concept of acceptability of healthcare was first described in 1981 (penchansky & thomas, 1981). thus, the selection process will include scientific works on this topic published between 1981 up to now (2022). the study design, methodological quality appraisal or risk of bias assessment for included articles will not be considered in line with the standards of scoping review methodology (armstrong et al., 2011). eligibility criteria. identified studies will be screened using eligibility criteria carefully developed by the pi to ensure that the included studies are relevant to address the research questions. eligibility criteria have been determined using population-concept-context (p-c-c) criteria as depicted in table 1. exclusion criteria. this scoping review is part of a larger phd project looking at the impact of acceptability of maternal healthcare on maternal mortality and thus, exclusion criteria will include: • population: studies reporting on female population aged less than 18 years including adolescents or teenagers falling pregnant. • concept: studies reporting on acceptability of services other than maternal healthcare acceptability or maternal healthcare acceptability beyond antenatal, delivery and immediate post-partum (within 42 days after termination of pregnancy or delivery). • studies without full-text. n.b: it is worth noting that we will include the studies that partly overlap on both inclusion and exclusion criteria such as young female population (less than 18 years) and stakeholders other than the women or different concepts from maternal healthcare acceptability. however, only findings meeting the inclusion criteria will be extracted for data charting, analyzing and reporting of the results. search strategy. drawing on the determined eligibility criteria, the pi has developed the search strategy using specific keywords or medical subject headings (mesh) terms in various combinations to increase identification of related studies published on the topic. table 2 shows some of the mesh terms that will be used in search strategy. together with the librarian and the two researchers who will be involved in articles search, the pi will conduct a pilot search applying the search strategy to check its appropriateness on different online databases. the search strategy might be refined by the researchers engaged in online search by using synonymous and/or proxy words to maximise identification of publication related to acceptability of maternal healthcare. level one screening. after identification of relevant studies, the two researchers will export the retrieved articles into endnote and email them as a compressed endnote file to the pi who will merge them into a single endnote library. then, the pi will remove the duplicates and import the merged endnote library into rayyan software for level one screening. the two researchers have been trained in literature screening using rayyan software and they will be responsible for independently screening the titles and abstracts of identified sources. the screening process will be blinded. different studies on scoping review have used different levels of agreement between researchers involved in the screening process; including 75% (tricco et al., 2016), 80% (pham et al., 2014) and 85% (damanhoury et al., 2018; tricco et al., 2018). thus, an agreement level of 80% for this study between two independent researchers will be considered appropriate in the pilot screening of the first 100 articles before proceeding to the screening of the rest of studies retrieved. the pi will resolve any screening conflict between the two independent researchers by reviewing with them the inclusion and exclusion criteria to reach a consensual decision. this will be done after the pilot screening phase and for each article before it will be included in or excluded from the next step of level two screening. level two screening. after successful screening of titles and abstracts, the pi will export the included articles from rayyan into endnote and email them to the 5 table 1 eligibility criteria. criteria description population women aged 18 years and above seeking maternal healthcare concept acceptability of maternal healthcare (antenatal; delivery; post-partum) context open (worldwide) table 2 search strategy. eligibility criteria keywords or mesh terms synonymous or proxy words population “women” “mothers”, “females”, “women of reproductive age”, etc. concept “acceptability" “acceptable/unacceptable”, “respectful/disrespectful”, “trust/distrust”, “supportive/unsupportive”, “caring/uncaring”, “perception/experience”, etc. "maternal healthcare" “pregnancy”, “labour”, “delivery”, “postpartum”, “maternal healthcare services” “antenatal care”, “pmtct”, “mental health in pregnancy”, “breastfeeding”, etc. context specific country. example: “south africa”, zimbabwe”, “malawi”, “rwanda”, “united states of america”, “canada”, “united kingdom”, etc. province, town or healthcare facility in a specific country. examples: “gauteng”, “western cape”, “kwazulu-natal”, “mpumalanga”, “johannesburg”, “cape town”, “durban”, “secunda” “chris hani baragwanath”, “steve biko”, etc. worldwide or specific continent: “global”, africa”, europe”, etc. sub-regions within a continent. “sadec”, “sub-saharan africa”, “north africa”, “western europe”, “north america”, etc. boolean operators “or”, “and”, “not” screeners as a compressed endnote file to ensure that the full texts are attached in the endnote library. the two researchers will attach the full text of selected articles as pdf documents into endnote library. then, they will email them back as compressed endnote files to the pi who will merge them into a single endnote library with full texts attached. the pi will import the merged endnote library into rayyan software for level two screening with the only purpose being to include or exclude them for a further data charting process. the screening process will be blinded with an agreement level of 80% between two independent researchers to be appropriate. the pi will be involved in resolving the conflict occurring between the two screeners during the full-text screening by reviewing with them the inclusion and exclusion criteria to reach a consensual decision. like level one screening, the agreement level between independent research during the level two screening will be checked for the first 100 articles and for any subsequent article before it will be included in or excluded from data charting process. database search. the pi has developed the database search which will be completed to summarise the historic search. table 3 shows database search. charting the data from each included study, the data extraction process will be conducted in such a way to provide a logical and quantitative descriptive summary of relevant information that aligns with the research questions and objectives. the pi has developed a data charting form to record the key information extracted from articles that will be included in this study. the pi will create a google document with all data headings from the charting form to be collected from each included article. he will then invite two researchers to complete it independently. the pi will conduct a pilot data charting with the two researchers on data charting applying the data charting form. the agreement level of 80% between the researchers will be considered appropriate before continuing with data charting of the rest of the included articles. any conflict amongst the researchers will be 6 table 3 database search. search id# dates number of studies retrieved number of studies selected number of studies included (excluding duplicates) after screening level one after screening level two s#1 s#2 s#3 s#4 s#5 etc. resolved by the pi. the two researchers will submit their answers and the pi will review with them the answers in a google sheet to resolve any subsequent conflict between them before exporting the database into stata software for quantitative descriptive analysis. table 4 describes pre-defined data charting form. collating, summarizing and reporting the results the reporting and mapping of the body of literature will be consistent with the preferred reporting items for systematic reviews and meta-analyses extension for scoping reviews (prisma-scr) to align study selection with the research objectives (liberati et al., 2009). quantitative descriptive statistics such as mean, median, frequencies, and percentages will be used to analyse, describe and summarize the results. the researchers will use the prisma-scr flow diagram to demonstrate the process of inclusion of relevant articles from identification to the retention of articles fulfilling all eligibility criteria (figure 1). the results will be presented in either a graphical/charted or tabular forms and reporting frequencies or percentages of data charted. in addition, the researcher will provide a narrative summary accompanying the tabulated and/or charted results to highlight how the results are linked to the objectives and research questions of this study. consultation exercise with experts the consultation exercise has been planned to engage with experts in the field through emails, oneto-one consultative virtual meetings or through open science framework (osf) project to regularly record thoughts, opinions and experiences from experts on this topic. the researcher will engage various experts in the field of acceptability of maternal health services to enhance the findings from scoping review and to obtain additional references that may be included in this study. the experts’ consultation will also be used as an exercise to provide more insights into the scoping review results, additional relevant articles and implications for future research projects, policy decision-making and strengthening of health system practices. we will apply delphi technique as an appropriate method to engage experts in consultation exercise to build a consensus among them on the findings from the scoping review by the research team and validate overall results including experts’ inputs (falzarano & zipp, 2013; nasa, jain, & juneja, 2021). we defined an expert as an individual holding a master’s or higher degree in any field and has knowledge and experience to meaningfully participate in the expert consultation process. we will look for experts from four groups: (1) patients; (2) healthcare providers; (3) healthcare researchers; and (4) healthcare managers/policy makers. experts will be identified globally from authors who published on this topic and through academic pool with interest on this topic. we will also apply snowball sampling strategy in experts’ selection process. we will request any interested or recruited expert to name additional experts from her/his cycle as expert patients, healthcare providers, healthcare researchers or healthcare managers/policy makers for potential recruitment. however, they will not know whether the named additional experts have been recruited to maintain anonymity of experts participating in this exercise. the recruitment process of experts will last for five months and only those committed will be included in this study. existing literature on delphi technique does not offer a definite sample size, number of surveys and level of consensus. we will aim to recruit at least 5 to 10 participants from each expert groups (i.e., 20 to 40 in total) and conduct at least four rounds of survey including brainstorming and validation phases to reach 80% of consensual agreement among experts in line with other studies applying this method (falzarano & zipp, 2013; nasa et al., 2021). figure 2 outlines the process of administering delphi surveys and appendix 1 summarizes how each research question will be answered through experts’ participations (questionnaire 1). for research question 1 we plan to summarize the findings on definition and conceptual framework of maternal healthcare acceptability from included articles. 7 table 4 data charting form. data heading description title of study title of the article or study author/s name of author/s publication year year that the article was published study design qualitative quantitative mixed methods scoping review systematic review meta-analysis unknown publication type journal book website conference proceedings unpublished other (specify) keywords key words used by author/s context study setting or country type of maternal healthcare antenatal (specify) labour & delivery (specify) post-natal (specify) definition of maternal healthcare acceptability author/s apply/s the definition of healthcare acceptability in general: yes (if yes, specify) no type of interactions with the mothers mothers-community interactions mothers-health provider interactions • mothers-health systems/policy interactions components of mothers-community interactions support from husband or partner (yes or no) support from family (yes or no) support from community (yes or no) other (specify) components of mothershealth provider interactions language barrier respecting privacy assistance in labour talking to health worker in private busy health worker being shouted at being hit, slapped or pinched health worker not respecting other patients health worker not respecting me other (specify) components of mothershealth systems and policy interactions dirty facilities satisfied with received services allowed to have companion during labour referred for follow up care informed about child-care grant other (specify) practical implications yes (if yes; specify) no comments conclusion maternal healthcare acceptability proxy term 8 figure 1. prisma-scr flow diagram. regarding the definition, the research team will draw from those results to choose or propose a more practical definition of maternal healthcare acceptability and ask experts whether they agree with the research team or not (questionnaire 1). concerning the conceptual framework, the research team will draw from the results to select or propose more shared components of each construct, and more practical conceptual framework of maternal healthcare acceptability, then ask the experts panels whether they agree with them or not (questionnaire 1). for research question 2 we plan to summarize contextual findings related to geographical context and assess any contextual understanding of maternal healthcare acceptability from included material. based on those results, the research team will make some assertions related to contextual understanding of maternal healthcare acceptability and ask experts whether they agree with them or not (questionnaire 1). for research question 3 we plan to summarize practical implications of maternal healthcare acceptability identified from included articles and or recommended by the panel of experts. experts will have opportunity to make comments and suggestions, on every survey round, that will be considered in the subsequent questionnaires, and the cycle will continue till there will be 80% consensual agreement on selected items responding to the three research questions. validation of consensual results will end the consultation exercise with experts (figure 2) and the results will be presented under broad three categories namely: consensual validated results; consensual but not validated results; and non-consensual results. 9 figure 2. delphi surveys administration process. ethics and dissemination this study will be conducted under an approved ethics certificate and in principle acceptance (ipa) issued by meta-psychology. the results will be presented at relevant conferences and published in a peerreviewed journal. logistics and time schedule thoughtful logistics and time schedule have been put in place to ensure smooth implementation of this project. these include project management timetable and action plan. project management timetable. the pi had an idea to write a protocol article on scoping review and to submit it for registered report in april 2021. two researchers were recruited to work on this project via tuks undergraduate research forum (turf), university of pretoria in may 2021. the pi and the two researchers attended a workshop on evidence synthesis including scoping review and a seminar on screening and study selection. those training sessions were organised in may and june 2021 by the office of the deputy dean of research and postgraduate studies, faculty of health sciences. the pi continued to train those two researchers over the course of july 2021 on how to effectively perform search on different electronic databases and how to use endnote. the gannt chart (figure 3) illustrates the project management timetable (in days) once ipa is approved. action plan. it is expected that this project will be completed within 350 days after the ipa is granted. table 5 portrays action plan. discussion this study aims at identifying the gaps in literature on acceptability of maternal healthcare, exploring the conceptual understanding and implication practices of maternal healthcare acceptability in the context of south africa and around the globe. thus, scoping review is an appropriate method to answer the broad questions of this research (armstrong et al., 2011). the process will provide the current understanding of how acceptability of maternal care is defined and conceptualised. the main results will be summarised in line with eligibility criteria (population-concept-context) and will be discussed in line with available evidence on the topic (dijkers, 2015). the discussion of the findings will consider the relevance of key stakeholders (patients, communities, providers and health managers or policy makers). involvement of experts through a consultation exercise will enhance the relevance of practical considerations (arksey & o’malley, 2005). the discussion will provide a general interpretation of the results with 10 figure 3. project management timetable (gantt chart). table 5 action plan. actions responsible days supervisor identifying relevant studies two researchers and pi 30 pi and is selection of eligible studies two researchers 30 pi and is charting the data two researchers 120 pi and is summarizing the results pi 150 supervisors consultation exercise pi 120 pi and supervisors writing the report all co-authors 80 pi respect to the review questions and objectives. the authors will suggest the next steps such as undertaking systematic review and/or meta-analysis studies informed by the findings from this review. strengths and limitations strengths. scoping review is a suitable evidence synthesis method to answer broad research questions as in this particular study. a thoughtful and rigorous protocol with clear stages will guide implementation of this project to reach the study objectives. eligibility criteria, search strategy and data charting form have been pre-defined to avoid bias. we will apply scoping review as a transparent and replicable way to review a body of evidence to identify the gaps in the literature and shed some light on how maternal healthcare acceptability is defined and conceptualised in south africa and around the globe. this method is appropriate to ascertain implication practices from acceptability of maternal healthcare concept and suggest future research studies such as systematic review or meta-analysis to investigate a more narrow aspect of this concept. limitations. this study is conditional on ethics approval reference no: no: 545/2019 for a phd research project excluding young pregnant women aged less than 18 years old. thus, studies on acceptability related to pregnancy, delivery and post-partum in teenagers will be excluded in this scoping review. this will result in exclusion of critical information on acceptability of healthcare acceptability in pregnant adolescents. another limitation is to omit studies on acceptability of maternal healthcare published in languages other than english. this may result in elimination of important studies on this topic. data availability to ensure transparency and reproducibility, all data generated or analysed during this study will be included in the published scoping review article. this will include a list of included and excluded articles with reasons for excluding studies, searching database and excel spreadsheet of charted data. 11 reporting guidelines the reporting and mapping of the body of literature will be consistent with the preferred reporting items for systematic reviews and meta-analyses extension for scoping reviews (prisma-scr) to align the selection of relevant articles with the research objectives (liberati et al., 2009). author contact joy blaise bucyibaruta: email: u19375370@tuks.co.za; orcid: 0000-0002-6530-4342 leah maidment: email: u18008799@tuks.co.za; orcid: 0000-0002-1075-2543 carl august daniel heese: email: u19020512@tuks.co.za; orcid: 0000-0002-8429-4178 mmapheko doriccah peu: email: doriccah.peu@up.ac.za; orcid: 0000-0002-1585-2404 lesley bamford: email: lesley.bamford@health.gov.za; orcid: 0000-0002-5788-4308 annatjie elizabeth van der wath: email: annatjie.vanderwath@up.ac.za; orcid: 0000-0001-5117-9272 estelle grobler: email: estelle.grobler@up.ac.za; orcid: 0000-0002-2992-312x alfred musekiwa: email: alfred.musekiwa@up.ac.za; orcid: 0000-0001-5880-3680 corresponding author: joy blaise bucyibaruta, telephone: (+27) 739 160 808; email: u19375370@tuks.co.za acknowledgments we would like to acknowledge the university of pretoria, faculty of health sciences for ethical approval and other supports to effectively conduct this study. special thanks to the tuks undergraduate research forum (turf) for recruiting independent researchers and to the office of deputy dean: research and postgraduate studies for organising training on evidence synthesis and use of rayyan software. many thanks to to dr cheryl tosh for manuscript editing and formatting. conflict of interest and funding authors declare no conflict of interest and no grant received for this article. author contributions dr joy blaise bucyibaruta is the pi and corresponding author. he is the project administrator, and he was involved in conceptualization, data curation, formal analysis, investigation, methodology, writing — original draft preparation, and subsequent corrections —, editing, formatting and approval of article submission. ms leah maidment and mr carl heese are the co-authors of this study. they have been recruited to get involved in the literature search, level one and two screening, data charting, proof-reading and approval of article submission. prof doriccah peu and prof lesley bamford read an article as the pi’s supervisors. they also provided some suggestions on how to improve the article. prof annatjie van der wath read an article as an external researcher and provided some suggestions on how to improve the article ms estelle grobler is an information specialist, and she has been recruited to get be involved in the literature search. she will also assist the pi to resolve the conflicts in screening between the two researchers to ensure the search reliability. prof alfred musekiwa was appointed as a supervisor with experience in evidence synthesis to guide the entire scoping review project. the author names order connects to the contribution by ranked order. 12 references al-mujtaba, m., shobo, o., oyebola, b. c., ohemu, b. o., omale, i., shuaibu, a., & anyanti, j. (2020). assessing the acceptability of village health workers’ roles in improving maternal health care in gombe state, nigeria a qualitative exploration from women beneficiaries. plos one, 15(10), e0240798. alderdice, f., mcneill, j., & lynn, f. (2013). a systematic review of systematic reviews of interventions to improve maternal mental health and well-being. midwifery, 29(4), 389-399. arksey, h., & o’malley, l. (2005). scoping studies: towards a methodological framework. international journal of social research methodology, 8(1), 19-32. armstrong, r., hall, b. j., doyle, j., & waters, e. (2011). cochrane update. ’scoping the scope’ of a cochrane review. j public health (oxf), 33(1), 147-150. doi:10.1093/pubmed/fdr015 balde, m. d., bangoura, a., sall, o., balde, h., niakate, a. s., vogel, j. p., & bohren, m. a. (2017). a qualitative study of women’s and health providers’ attitudes and acceptability of mistreatment during childbirth in health facilities in guinea. reproductive health, 14(1), 1-13. bucyibaruta, b. j., eyles, j., harris, b., kabera, g., oboirien, k., & ngyende, b. (2018). patients’ perspectives of acceptability of art, tb and maternal health services in a subdistrict of johannesburg, south africa. bmc health services research, 18(1), 1-15. cameron, s. t., craig, a., sim, j., gallimore, a., cowan, s., dundas, k., . . . lakha, f. (2017). feasibility and acceptability of introducing routine antenatal contraceptive counselling and provision of contraception after delivery: the apples pilot evaluation. bjog, 124(13), 2009-2015. doi:10.1111/1471-0528.14674 cummins, a., griew, k., devonport, c., ebbett, w., catling, c., & baird, k. (2021). exploring the value and acceptability of an antenatal and postnatal midwifery continuity of care model to women and midwives, using the quality maternal newborn care framework. women and birth. damanhoury, s., newton, a., rashid, m., hartling, l., byrne, j., & ball, g. (2018). defining metabolically healthy obesity in children: a scoping review. obesity reviews, 19(11), 1476-1491. dijkers, m. (2015). what is a scoping review? kt update, 4(1). dillip, a., alba, s., mshana, c., hetzel, m. w., lengeler, c., mayumana, i., . . . obrist, b. (2012). acceptability–a neglected dimension of access to health care: findings from a study on childhood convulsions in rural tanzania. bmc health services research, 12(1), 1-11. donabedian, a. (1993). quality in health care: whose responsibility is it? american journal of medical quality, 8(2), 32-36. donabedian, a. (2002). an introduction to quality assurance in health care. oxford university press. dyer, t. a., owens, j., & robinson, p. g. (2016). the acceptability of healthcare: from satisfaction to trust. community dent health, 33(4), 242-251. falzarano, m., & zipp, g. p. (2013). seeking consensus through the use of the delphi technique in health sciences research. journal of allied health, 42(2), 99-105. feinberg, e., smith, m. v., & naik, r. (2009). ethnically diverse mothers’ views on the acceptability of screening for maternal depressive symptoms during pediatric well-child visits. journal of health care for the poor and underserved, 20(3), 780. gilson, l. (2007). acceptability, trust and equity. cambridge university press. grant, m., wilford, a., haskins, l., phakathi, s., mntambo, n., & horwood, c. m. (2017). trust of community health workers influences the acceptance of community-based maternal and child health services. african journal of primary health care and family medicine, 9(1), 1-8. hadfield, h., & wittkowski, a. (2017). women’s experiences of seeking and receiving psychological and psychosocial interventions for postpartum depression: a systematic review and thematic synthesis of the qualitative literature. j midwifery womens health, 62(6), 723-736. doi:10.1111/jmwh.12669 hausmann-muela, s., ribera, j. m., & nyamongo, i. (2003). health-seeking behaviour and the health system response. disease control priorities project working paper no14. kozarewicz, p. (2014). regulatory perspectives on acceptability testing of dosage forms in children. international journal of pharmaceutics, 469(2), 245-248. kyei-nimakoh, m., carolan-olah, m., & mccann, t. v. (2017). access barriers to obstetric care at health facilities in sub-saharan africa—a systematic review. systematic reviews, 6(1), 1-16. 13 liberati, a., altman, d. g., tetzlaff, j., mulrow, c., gøtzsche, p. c., ioannidis, j. p., . . . moher, d. (2009). the prisma statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. journal of clinical epidemiology, 62(10), e1-e34. mcintyre, d., thiede, m., & birch, s. (2009). access as a policy-relevant concept in lowand middle-income countries. health econ policy law, 4(pt 2), 179-193. doi:10.1017/s1744133109004836 murphy, a. l., & gardner, d. m. (2019). pharmacists’ acceptability of a men’s mental health promotion program using the theoretical framework of acceptability. aims public health, 6(2), 195-208. doi:10.3934/publichealth.2019.2.195 nasa, p., jain, r., & juneja, d. (2021). delphi methodology in healthcare research: how to decide its appropriateness. world journal of methodology, 11(4), 116. påfs, j., musafili, a., binder-finnema, p., klingberg-allvin, m., rulisa, s., & essén, b. (2015). ‘they would never receive you without a husband’: paradoxical barriers to antenatal care scale-up in rwanda. midwifery, 31(12), 1149-1156. penchansky, r., & thomas, j. w. (1981). the concept of access: definition and relationship to consumer satisfaction. medical care, 127-140. pham, m. t., rajić, a., greig, j. d., sargeant, j. m., papadopoulos, a., & mcewen, s. a. (2014). a scoping review of scoping reviews: advancing the approach and enhancing the consistency. research synthesis methods, 5(4), 371-385. rothstein, j. d., jennings, l., moorthy, a., yang, f., gee, l., romano, k., . . . lefevre, a. e. (2016). qualitative assessment of the feasibility, usability, and acceptability of a mobile client data app for community-based maternal, neonatal, and child care in rural ghana. international journal of telemedicine and applications. russell, d. j., humphreys, j. s., ward, b., chisholm, m., buykx, p., mcgrail, m., & wakerman, j. (2013). helping policy-makers address rural health access problems. aust j rural health, 21(2), 61-71. doi:10.1111/ajr.12023 russell, d. j., humphreys, j. s., ward, b., chisholm, m., buykx, p., mcgrail, m., & wakerman, j. (2013). helping policy-makers address rural health access problems. australian journal of rural health, 21(2), 61-71. sekhon, m., cartwright, m., & francis, j. j. (2017). acceptability of healthcare interventions: an overview of reviews and development of a theoretical framework. bmc health services research, 17(1), 1-13. sekhon, m., cartwright, m., & francis, j. j. (2018). acceptability of health care interventions: a theoretical framework and proposed research agenda. in: wiley online library. shaw, r. l., larkin, m., & flowers, p. (2014). expanding the evidence within evidence-based healthcare: thinking about the context, acceptability and feasibility of interventions. bmj evidence-based medicine, 19(6), 201-203. silal, s. p., penn-kekana, l., harris, b., birch, s., & mcintyre, d. (2012). exploring inequalities in access to and use of maternal health services in south africa. bmc health serv res, 12, 120. sripad, p., warren, c. e., hindin, m. j., & karra, m. (2019). assessing the role of women’s autonomy and acceptability of intimate-partner violence in maternal health-care utilization in 63 low-and middle-income countries. international journal of epidemiology, 48(5), 1580-1592. staneva, a. a., bogossian, f., & wittkowski, a. (2015). the experience of psychological distress, depression, and anxiety during pregnancy: a meta-synthesis of qualitative research. midwifery, 31(6), 563-573. staniszewska, s., crowe, s., badenoch, d., edwards, c., savage, j., & norman, w. (2010). the prime project: developing a patient evidence-base. health expect, 13(3), 312-322. doi:10.1111/j.1369-7625.2010.00590.x traylor, c. s., johnson, j., kimmel, m. c., & manuck, t. a. (2020). effects of psychological stress on adverse pregnancy outcomes and non-pharmacologic approaches for reduction: an expert review. american journal of obstetrics & gynecology mfm. tricco, a. c., lillie, e., zarin, w., o’brien, k. k., colquhoun, h., levac, d., . . . weeks, l. (2018). prisma extension for scoping reviews (prisma-scr): checklist and explanation. annals of internal medicine, 169(7), 467-473. tricco, a. c., lillie, e., zarin, w., o’brien, k., colquhoun, h., kastner, m., . . . wilson, k. (2016). a scoping review on the conduct and reporting of scoping reviews. bmc medical research methodology, 16(1), 1-10. meta-psychology, 2021, vol 5, mp.2020.2535, https://doi.org/10.15626/mp.2020.2535 article type: commentary published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: montoya a.k., fossum j.l., zigerell l.j. analysis reproduced by: lucija batinović all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/dchqt are women less likely to ask than men partly because they work fewer hours? a commentary on artz et al. (2018) jens mazei and joachim hüffmeier tu dortmund university a long debate in negotiation research concerns the question of whether gender differences in the propensity to initiate negotiations, in behaviors shown during negotiations, and in negotiation performance actually exist. whereas past negotiation research suggested that women are less likely to initiate negotiations than men, a recent study by artz et al. (2018) seems to suggest that women are as likely as men to “ask” for higher pay. however, this finding by artz et al. (2018) was obtained once the number of weekly hours worked was added as a covariate in the statistical analysis. following extant work, we suggest that the number of weekly hours worked could be—and, from a theoretical standpoint, perhaps should be—considered a mediator of gender differences. conducting a monte carlo analysis based on the results and statistics provided by artz et al. (2018) also yielded empirical evidence suggesting that weekly hours could be a mediator. thus, women may be less likely than men to ask for higher pay, among other potential reasons, because they work fewer weekly hours. based on this alternative conceptualization of the role of weekly hours, our commentary has theoretical implications for the understanding of gender differences in the propensity to initiate negotiations and practical implications for the effective reduction of gender inequalities. keywords: gender, sex, negotiation, bargaining, gender gap gender differences are a “hot” topic in negotiation research (e.g., bowles et al., 2019; small et al., 2007). this is because gender differences in the propensity to initiate negotiations as well as in negotiation behaviors and performance may help to account for longstanding inequalities (i.e., gender gaps in pay and top leadership positions; bureau of labor statistics [bls], 2019a; catalyst, 2020; kulik & olekalns, 2012; stuhlmacher & walters, 1999). in fact, research has shown that women are less likely to initiate negotiations than men (for a meta-analysis, see kugler et al., 2018). thus, women may earn less than men, in part, as they “do not ask” (babcock & laschever, 2003; babcock et al., 2006; bowles et al., 2007). yet, a recent study by artz et al. (2018) seems to question this narrative. our commentary focuses on one central result of this recent study, namely that women were no less likely than men to try to get higher pay. however, this result was obtained only when the number of weekly hours worked was added as a covariate (artz et al., 2018; see also luekemann & abendroth, 2018). the study by artz et al. (2018) already had an impact: it has been cited 56 times (google scholar; december 15, 2020), including in prestigious outlets such as the academy of management annals (jang et al., 2018). it has also been discussed in the media as follows: “findings seem to debunk perception that women lack assertiveness when negotiating salaries, a common explanation for wage gap” (lartey, 2016). therefore, the findings by artz et al. (2018) may be taken as raising serious doubts about cumulative knowledge obtained in past negotiation research on gender. 2 mazei & hüffmeier the key question is: do women ask (artz et al., 2018) or don’t they (babcock & laschever, 2003; kugler et al., 2018)? in this commentary, we aim to reconcile the seemingly diverging conclusions drawn from the study by artz et al. (2018) and past negotiation research by thoroughly embedding the findings by artz et al. (2018) in extant theory (e.g., bowles & mcginn, 2008; eagly & wood, 2012) and research (e.g., luekemann & abendroth, 2018). doing so suggests an alternative interpretation of the findings by artz et al. (2018), which has markedly different theoretical and practical implications: women may not ask, among other potential reasons, because they work fewer hours than men (see also bowles & mcginn, 2008; livingston, 2014; luekemann & abendroth, 2018). this alternative interpretation has not only nuanced theoretical implications, but also practical implications for the successful mitigation of gender inequalities. background in brief to start, we would like to clarify the terms “sex” and “gender” as used in the current commentary. in psychological science, “sex” is usually used to “refer to male, female, and intersex as categories or groups of people” (bosson et al., 2018, p. 6), whereas “gender” is usually used to “refer to the meanings that people give to the different sex categories” (bosson et al., 2018, p. 6). for example, if women had a lower propensity to initiate negotiations than men, this finding could be denoted as a “sex difference” (e.g., wood & eagly, 2002). however, in the current area of research (i.e., negotiations), it is more common to denote such effects as “gender differences” (e.g., bowles et al., 2007; kugler et al., 2018). hence, to ease communication in the specific area of research to which we aim to contribute, we adhere to the consensus in negotiation research and use the term “gender.” negotiation research on gender dates back to at least the 1970s (rubin & brown, 1975; cf. kray & thompson, 2005; small et al., 2007). from early on, a key question was whether women actually differ from men. given that findings were often mixed (stuhlmacher & walters, 1999; walters et al., 1998), and given that (psychological) research generally has long been characterized by low statistical power (cohen, 1992; maxwell, 2004), meta-analysis plays a crucial role in determining whether gender differences exist in behaviors and outcomes related to negotiations (see also eagly & wood, 2013). meta-analyses revealed small yet significant gender differences, such that women as compared to men show less competitive negotiation behavior (walters et al., 1998), obtain lower economic outcomes (mazei et al., 2015; shan et al., 2019; stuhlmacher & walters, 1999), and, of particular relevance for this commentary, are less likely to initiate negotiations (kugler et al., 2018). the study by artz et al. (2018) a recent study by artz et al. (2018) seems to raise doubts about central findings from past research, as it “concludes that males and females ask equally often for promotions and raises” (p. 611). we would like to make clear from the outset that we appreciate the work by artz et al. (2018), as their study provides noteworthy results that have generated renewed interest in the question of whether (and why) women actually differ from men in the propensity to initiate negotiations. as we will highlight in this commentary, their study also sparks needed discussion specifically about the role played by the number of weekly hours worked for the emergence of gender differences in the propensity to initiate negotiations (see also luekemann & abendroth, 2018). however, care needs to be taken if findings can be interpreted other than “males and females ask equally often” (artz et al., 2018, p. 611). a notable alternative interpretation could be that women did no longer differ from men when a mediator was included in the analysis (for details, see below). if such an alternative interpretation is not considered, important lessons learned from past negotiation research (i.e., there are real gender differences in negotiation contexts; kugler et al., 2018) might become obscured, and practitioners might not implement interventions that are actually needed (e.g., those that address the flexibility stigma; williams et al., 2013; see also below). thus, our goal is to suggest an alternative interpretation, based on extant work (e.g., bowles & mcginn, 2008; eagly & wood, 2012), of the findings by artz et al. (2018). artz et al. (2018) examined a representative sample of n = 4,582 employees from australia (n = 2,639 women and n = 1,943 men) to test at least two notions that are prominent in extant negotiation research on gender. these were (a) that “women may be reluctant to ‘ask,’ because that might be viewed 3 are women less likely to ask than men partly because they work fewer hours? a commentary on artz et al. (2018) by their manager as pushy or ‘out-of-role’ behavior for a female,” and (b) that “in certain circumstances, women may have a lower propensity than men to ask for pay raises and promotions” (artz et al., 2018, p. 611). the authors obtained many interesting findings, some of which that were inconsistent with past research, or at least they were interpreted as such (for our alternative interpretation, see below). women were not more likely than men to be concerned about straining the relationship with others as a result of trying to get higher pay or a promotion (artz et al., 2018). this result is in line with the prominent research by bowles et al. (2007), but not with the likewise prominent research by amanatullah and morris (2010) and babcock et al. (2006). this result does not support the notion that women are afraid to come across as “pushy” (amanatullah & tinsley, 2013). furthermore, women were not found to differ from men in analyses on a variable tapping into the propensity to ask for a promotion (irrespective of which covariates were included; artz et al., 2018). this finding does not suggest a gender difference in the decision to initiate negotiations about career advancement (luekemann & abendroth, 2018). yet women were less likely than men to indicate having had success in negotiating pay (artz et al., 2018), which is in line with another longstanding notion in negotiation research on gender (cf. stuhlmacher & walters, 1999). this noteworthy result led artz et al. (2018, p. 629) to conclude that “females are less successful at getting,” although they might not be less inclined to ask. the distinction between gender differences in “asking” and “getting” is relevant, for instance, because the two phenomena would suggest different interventions (see the section on implications for practice below). being the focus of our commentary, artz et al. (2018) also examined the likelihood with which women and men attempted to get higher pay (rather than a promotion; see above). this criterion is especially relevant because asking for higher pay is a direct way to reduce the gender pay gap. women were less likely than men to have attempted to get higher pay in two regressions that included several covariates (artz et al., 2018). however, this gender difference became non-significant once the number of weekly hours worked was included as an additional covariate (for further analyses on full-time and parttime workers, see table 4 in artz et al., 2018; see also luekemann & abendroth, 2018; stevens & whelan, 2019). as the authors noted (p. 623), “on closer scrutiny, the appearance of a lack of ‘asking’ is being driven statistically by working fewer hours.” the number of weekly hours worked was, in fact, positively related to the likelihood of having attempted to get higher pay. in their conclusions, artz et al. (2018, pp. 628-629) again stressed that “this adjustment for hours is particularly important. once it is done, regression equations for the likelihood of ‘asking’ do not show a statistically significant difference between men and women.” the number of weekly hours worked clearly played an important role in the study by artz et al. (2018). what is the conceptual role of weekly hours? the findings by artz et al. (2018) may be interpreted such that women “do ask,” and this seems to be the favored interpretation by the authors. however, in the case of asking for higher pay, this interpretation only holds true if the conceptual role of the number of weekly hours worked is a covariate (or control variable/confounder; mackinnon et al., 2000). artz et al. (2018, p. 629) noted: “our results have concentrated on the case in which hours of work are held constant. this is arguably natural, because we wish here to do a ceteris paribus comparison between males and females, but we have not attempted to explain the observed difference in the mean number of working hours between men and women” (in footnote 10, the authors additionally hinted that “these differences in working hours presumably stem in part from historical and sociological differences in the gender roles”). given the relevance of the number of weekly hours in the study by artz et al. (2018), we argue that a deepened elaboration on its conceptual role is in order (see also luekemann & abendroth, 2018). doing so strikes us as important because the results obtained by artz et al. (2018) suggest that the number of weekly hours could also be a mediator. artz et al. (2018) did not more closely examine the gender difference in weekly hours—the a-path in a mediation—but their reported statistics allowed us to do so. women (m = 33.87, sd = 10.43) worked significantly fewer weekly hours than men (m = 41.60, sd = 9.86), t(4,580) = 25.37, p < .001. this gender difference had an effect size of d = 0.76 (calculated on the basis of the unrounded numbers given by artz et al., 2018, and formulas 12.11 and 12.12 given by borenstein, 2009, p. 226). as mentioned above, weekly hours were also significantly related to the likelihood of having attempted to get higher pay (while controlling for gender; see table 3, equation 9 in 4 mazei & hüffmeier artz et al., 2018). this finding suggests a significant b-path (see also baron & kenny, 1986). following the logic of a joint–significance test (see also the component approach in yzerbyt et al., 2018), this pattern of results could be interpreted as suggesting mediation. as kenny (2018) put it, “if step 2 (the test of a) and step 3 (the test of b) are met, it follows that the indirect effect is likely nonzero.” to examine a potential indirect effect through weekly hours, we also conducted a monte carlo analysis using selig and preacher’s tool (2008; see also preacher & selig, 2012). the results are displayed in figure 1. the analysis, based on 20,000 repetitions, yielded a 95% confidence interval (ci) for the indirect effect that ranged from 0.02 to 0.04—thus, it did not include zero. this finding again suggests a significant indirect effect through weekly hours. taken together, statistically speaking, the number of weekly hours could be seen as a covariate (in mackinnon et al.’s, 2000, terms, a “confounder”) or as a mediator. figure 1 monte carlo analysis of an indirect effect. note. this analysis was conducted using selig and preacher’s tool (2008; see also preacher & selig, 2012). the figure shows the distribution of estimations of the indirect effect (i.e., a × b). the resulting 95% ci excludes zero, suggesting an indirect effect through weekly hours. given this lack of clarity regarding the conceptual role of weekly hours, our goal is to embed the findings by artz et al. (2018) in extant theory (e.g., bowles & mcginn, 2008; eagly & wood, 2012) to begin disentangling their conceptual role. theory explains “why empirical patterns were observed” (sutton & staw, 1995, p. 374; see also bacharach, 1989; whetten, 1989) and provides hypotheses regarding the specific conceptual roles played by constructs of interests. nevertheless, as the question of whether weekly hours is a mediator or a covariate is a question about the causal influence of weekly hours, suitable empirical methods that test for causality with longitudinal studies are also needed. below, we report relevant empirical work (e.g., luekemann & abendroth, 2018), yet focused longitudinal studies on this topic are currently missing. thus, as it stands, weekly hours only have the potential to be a mediator of the gender difference in the likelihood with which salary negotiations are initiated, and additional research is needed (see the section on implications for theory and future research). why weekly hours could be a mediator (among others) current negotiation research on gender is often grounded in social role theory (eagly, 1987; eagly & wood, 2012) and the related role congruity theory (eagly & karau, 2002), as this framework proved helpful to integrate extant research (e.g., amanatullah & morris, 2010; kugler et al., 2018; stuhlmacher & linnabery, 2013). artz et al. (2018) referred to but did not elaborate on these theories. in figure 2, we depict a simplified model adapted from eagly and wood’s (2012) framework (see figure 49.1 in eagly & wood, 2012, p. 465, for their framework in its original form). the aim of the model shown in figure 2 is to explain gender differences in the propensity to initiate negotiations about salary (see path 1). in a nutshell, women may be less likely to initiate salary negotiations, among other reasons, because they work fewer weekly hours, which is one aspect of the division of labor (see paths 2 and 3; bowles & mcginn, 2008; eagly et al., 2020; luekemann & abendroth, 2018). in addition, the division of labor leads to the emergence of gender roles (path 4; eagly & steffen, 1984; koenig & eagly, 2014), which can also drive gender differences (path 5; kugler et al., 2018; stuhlmacher & linnabery, 2013). we now elaborate on these notions. 5 are women less likely to ask than men partly because they work fewer hours? a commentary on artz et al. (2018) division of labor: women work fewer hours than men a first key notion in the underlying framework by eagly and wood (2012; see also eagly, 1987) that is relevant for our research is that there is often a division of labor among women and men. as eagly et al. (2020, p. 302) pointed out, “a common arrangement is a neotraditional division of labor”: women as compared to men spend less time at work—that is, they have fewer weekly hours of paid work (see path 2 in figure 2)—but more time on household and caretaking activities (e.g., bls, 2019b; eagly & carli, 2007). in fact, women in the united states, and also in many other western countries, are more likely than men to work part-time; and this gender difference can be observed across different age groups and ethnicities (bls, 2018). in the representative sample examined by artz et al. (2018), women also worked significantly fewer hours than men (d = 0.76). altogether, the observation that women work fewer hours than men is an important aspect of the division of labor (e.g., eagly & wood, 2012). in other words, a division of labor means (in part) that women and men differ in their weekly hours of paid work (see path 2 in figure 2). figure 2 mediation of gender differences in the propensity to initiate salary negotiations. note. this figure shows a simplified and adapted model that follows from eagly and wood’s (2012) framework. the model shown here is simplified because eagly and wood’s (2012) framework entails additional constructs and processes (e.g., socialization) that are beyond the scope of our research. the model shown here is adapted, most notably, because we highlight those aspects (e.g., weekly hours of paid work) that are most relevant for our research and also because eagly and wood’s (2012) figure does not include a box for “gender” (i.e., women vs. men). however, visualizing the relationship between gender and weekly hours (path 2) makes explicit the notion that the division of labor entails, among other aspects, a gender difference in weekly hours (e.g., eagly et al., 2020). for a figure of the underlying framework in its original form, see eagly and wood (2012, p. 465). 6 mazei & hüffmeier the gender difference in weekly hours is not the only aspect of the division of labor (e.g., women and men often work in different types of occupations; eagly et al., 2020). moreover, different aspects of the division of labor can influence each other: a prime example is that having (more) caretaking duties makes it more difficult to pursue other (workplace) activities (e.g., eagly & carli, 2007; wood & eagly, 2012). yet, we focus on weekly hours because this was the decisive variable in the study by artz et al. (2018): controlling for weekly hours rendered the gender difference in the likelihood to ask about pay non-significant, and this particular result led to their conclusion that women “do ask” (artz et al., 2018). this does not mean, however, that other variables (e.g., tenure; see table 3 in artz et al., 2018) are irrelevant for gender differences to emerge. thus, it is appropriate to regard weekly hours as only one potential mediator of gender differences in the likelihood to negotiate salary (see also bowles & mcginn, 2008; luekemann & abendroth, 2018). the division of labor has consequences for the initiation of salary negotiations the division of labor has many structural and psychological consequences (eagly & wood, 1999). that is, working fewer hours not only reduces one’s pay as a structural consequence (e.g., eagly & carli, 2007; eagly & wood, 2012), it can also yield a flexibility stigma (williams et al., 2013) as a psychological consequence. this stigma is defined as “a type of discrimination triggered whenever an employee signals a need for workplace flexibility due to family responsibilities (e.g., by requesting leaves of absence or flexible hours)” (rudman & mescher, 2013, p. 323; see also chung, 2020). this is because working reduced or flexible hours can be viewed as an indication of lacking work devotion1 (e.g., bourdeau et al., 2019; williams et al., 2013). of greater relevance for our research, however, are people’s own perceptions, not just other’s discriminatory reactions (i.e., another psychological consequence): if people work reduced or flexible hours, they may deem themselves as lacking work devotion, so that they may not feel entitled or consider it appropriate to receive higher pay (for literature on entitlement, see major, 1989). this is because “the work devotion schema is […] seductive—workers may also believe that a 1 the work devotion schema “reflects deep cultural assumptions that work demands and deserves undivided and intensive allegiance” (williams et al., 2013, p. 211). strong work ethic helps form their sense of self and self-worth” (williams et al., 2013, p. 211). thus, if people do not consider it appropriate to receive higher pay, they should be unlikely to ask for it (bowles & mcginn, 2008; luekemann & abendroth, 2018). altogether, as women work fewer hours than men given the extant division of labor (path 2 in figure 2; bls, 2018), they may be less likely to initiate negotiations about salary—a relationship visualized as path 3 in figure 2 (see also bowles & mcginn, 2008, which is described in detail below, and livingston, 2014). as mentioned earlier, artz et al.’s (2018) study suggested a positive relationship between weekly hours and the likelihood with which people attempted to get higher pay (for a similar result, see stevens & whelan, 2019). notably, luekemann and abendroth (2018) also found that women with children (vs. men with children) were less likely to initiate a discussion about career advancement, and this gender difference became non-significant once working hours (as well as overtime hours and tenure) were included as additional covariates (see their model 4 on pp. 16-17 and their footnote 4). although luekemann and abendroth (2018) did not study people’s propensity to negotiate pay, their findings were conceptually and empirically similar to those obtained by artz et al. (2018). thus, luekemann and abendroth (2018, p. 17) similarly noted that “mothers’ lower likelihood to pose claims is mainly driven by them working fewer hours.” moreover, we conducted an own (unpublished) study with a crosssectional design (mazei, nohe, & hüffmeier, 2017). we observed, for instance, that women as compared to men reported being more responsible for homemaking duties but less for being the breadwinner. in turn, being responsible for homemaking duties was positively related to working part-time, but negatively related to the frequency with which negotiations about pay were initiated. the division of labor causes gender roles, which also drive gender differences the division of labor is relevant (cf. bowles & mcginn, 2008) in still another respect, as it causes gender roles (e.g., eagly & steffen, 1984; eagly & wood, 1999), defined as “consensual [and normative] beliefs about the attributes of women and men” 7 are women less likely to ask than men partly because they work fewer hours? a commentary on artz et al. (2018) (eagly & karau, 2002, p. 574; eagly, 1987). this relationship is shown as path 4 in figure 2 (eagly & wood, 2012; see also koenig & eagly, 2014). negotiation research has devoted great attention to gender roles because initiating negotiations fits men’s agentic gender role, but not women’s communal gender role (e.g., kugler et al., 2018; stuhlmacher & linnabery, 2013; see also eagly & karau, 2002). given this misfit among women, people react more negatively toward women who initiate negotiations (bowles et al., 2007) and negotiate in an agentic manner (amanatullah & tinsley, 2013; regarding such negative reactions, see also rudman et al., 2012). thus, gender roles also play a role in the emergence of gender differences in the propensity to initiate negotiations (e.g., kugler et al., 2018), as depicted as path 5 in figure 2. another relevant notion in the literature is that “gender roles coexist with specific roles based on factors such as family relationships and occupation” (eagly & wood, 1999, p. 413), and gender roles and aspects of the division of labor may have partially independent effects in certain settings. for instance, eagly and wood (2012, p. 469, emphasis added) noted that “because specific roles have direct implications for task performance in many natural settings, they can be more important than gender roles.” in other words, the division of labor can drive gender differences not only because it causes gender roles (e.g., eagly & steffen, 1984), which is why figure 2 includes path 3 (see the rationale above and the general notion of “constraints” below; bowles & mcginn, 2008; eagly & wood, 1999). conversely, gender roles can lead to gender differences even if a current division of labor is controlled for (eagly, 1987; eagly & wood, 1999). this is possible, for instance, because people might have learned gender roles earlier in their lives, which then keep affecting their own and other’s actions even in the absence of a current division of labor (eagly & wood, 1999; 2012). this insight is notable because it explains why gender differences can emerge in laboratory settings in which the division of labor with its usual constraints (eagly & wood, 1999) is less relevant, as was highlighted by eagly (1987). as previously learned gender roles can have an independent effect in a current setting, it may again be appropriate to regard weekly hours as one mediator (potentially among others). altogether, weekly hours could be a mediator observing a null effect for gender while controlling for weekly hours or other aspects of the division of labor—as was the case in artz et al. (2018)—can be interpreted in different ways. on the one hand, such a null effect could mean that there simply is no gender difference. on the other hand, it could mean that gender differences in the propensity to initiate salary negotiations are not solely driven by gender roles, but also by those aspects of the division of labor that were controlled for. extant work (e.g., eagly & wood, 1999; luekemann & abendroth, 2018) suggests that the latter possibility could be true. thus, weekly hours could be—and, from a theoretical standpoint, perhaps should be—considered a mediator. all told, the division of labor is a central construct that can help to explain why gender differences in the propensity to initiate negotiations, and in many other contexts, exist (e.g., eagly & wood, 1999; luekemann & abendroth, 2018). for instance, working fewer hours not only gives rise to the hampering flexibility stigma (e.g., williams et al., 2013), but gender roles, which exist due to the division of labor (e.g., eagly & steffen, 1984), also create obstacles for women in negotiation contexts (e.g., stuhlmacher & linnabery, 2013). these insights were succinctly pointed out by bowles and mcginn (2008): they described gender dynamics as a “two-level game,” such that negotiations at work are related to negotiations at home (see also livingston, 2014). as they put it (p. 395), “the traditional division of labor between the sexes — in which men are the breadwinners and women are the caregivers — creates asymmetries between men and women in terms of how constrained their negotiations with employers (at level one) are by their negotiations with household members (at level two).” thus, women may be less likely to ask about higher pay because they work fewer weekly hours (cf. luekemann & abendroth, 2018). implications for theory and future research if the number of weekly hours worked were a mediator, the findings by artz et al. (2018) should no longer be interpreted as contradicting past negotiation research (e.g., babcock & laschever, 2003; kugler et al., 2018). interpreting their results such that women did not ask, among other reasons, because they work fewer hours would suggest the existence 8 mazei & hüffmeier of a gender difference, just as past research did. what may be different about the findings by artz et al. (2018), however, is the specific process that drives the gender difference. past research typically focused on the effects of gender roles (see path 5 in figure 2; e.g., amanatullah & morris, 2010; stuhlmacher & linnabery, 2013). gender roles result from the division of labor (see path 4 in figure 2; eagly & wood, 2012; koenig & eagly, 2014), yet they can drive gender differences on their own in certain contexts (see above; eagly, 1987; eagly & wood, 1999). thus, if the goal of a study is to isolate the potential effects of gender roles, it can make sense to control for aspects of the division of labor, such as weekly hours. thus, if a study pursues this goal, weekly hours could also be conceptualized as a covariate. alternatively, the results by artz et al. (2018) may suggest that weekly hours—an aspect of the division of labor—mediate gender differences in the propensity to initiate negotiations (see paths 2 and 3 in figure 2). thus, the division of labor may drive gender differences in the context of negotiations in different ways, and not “only” through gender roles (paths 4 and 5 in figure 2; see also bowles & mcginn, 2008; luekemann & abendroth, 2018). altogether, the seemingly diverging conclusions drawn from past research and artz et al. (2018) can be reconciled by the theoretical argumentation presented here. some studies have already shed first light on the potential relevance of weekly hours for the emergence of gender differences in negotiation contexts (e.g., luekemann & abendroth, 2018; and, of course, the study by artz et al., 2018). yet, additional negotiation research that sheds a brighter light on the role of weekly hours, and the division of labor, more generally, is clearly needed. suggesting that weekly hours could be a mediator entails that evidence of its causal effects is needed. please note, however, that conducting “true” experiments, in which the division of labor among women and men is manipulated, is not possible. thus, a particularly worthwhile avenue for future research would be to conduct longitudinal studies. these studies could examine whether changes in the division of labor help to explain gender differences in the propensity to initiate salary negotiations. this future research will help to determine whether the number of weekly hours not only has the potential to be a mediator, but actually is one. implications for practice an alternative interpretation of the findings by artz et al. (2018) would go hand in hand with alternative practical implications. if women were only “less successful at getting” (artz et al., 2018, p. 629), practical interventions would “only” need to make sure that decision-makers grant women’s requests as often and as much as men’s. it is clear that decision-makers should treat women and men equally, because not doing so would be blatant discrimination. however, this intervention on its own is unlikely to be sufficient if women were less likely than men to “ask” in the first place. as the division of labor may drive gender differences in negotiation contexts and beyond (see path 3 in figure 2; e.g., bowles & mcginn, 2008; eagly & wood, 2012), more might need to be done to eliminate gender differences. in the long term, for gender differences to diminish, women and men would need to share breadwinning and caretaking duties more equally (but see croft et al., 2015, who highlighted that men typically fulfill caretaking duties to a lesser extent than women). in the short term, interventions to reduce the flexibility stigma (e.g., williams et al., 2013) would also help. specifically, supportive organizational norms can mitigate perceptions of low work devotion among people who work fewer hours and the related negative consequences (bourdeau et al., 2019). these additional interventions should help to encourage women to “ask” with the same likelihood and as much as men do (luekemann & abendroth, 2018). conclusion gender differences remain a “hot” topic in negotiation research. artz et al. (2018) provided valuable and noteworthy findings that have stimulated anew the debate whether or not (and why) women and men differ in the propensity to initiate negotiations. given that they interpreted their results such that they question cumulative knowledge obtained in past negotiation research, it is crucial to carefully scrutinize the results and their interpretation. as we highlighted in our commentary, following extant work (e.g., bowles & mcginn, 2008; eagly & wood, 2012), weekly hours could be one notable mediator of gender differences in the propensity to initiate salary negotiations. this possibility is relevant because, thus far, “negotiation scholars have largely ignored the structural implications of job candidates’ 9 are women less likely to ask than men partly because they work fewer hours? a commentary on artz et al. (2018) domestic relations when studying negotiations with employers” (bowles & mcginn, 2008, pp. 394-395; but see luekemann & abendroth, 2018). thus, future negotiation research should devote greater attention to weekly hours and examine further their potential relevance for the emergence of gender differences in negotiation contexts. author contact jens mazei, department of psychology, tu dortmund university, https://orcid.org/0000-00033579-6857; joachim hüffmeier, department of psychology, tu dortmund university, https://orcid.org/0000-0002-0490-7035. correspondence concerning this article should be addressed to jens mazei, department of psychology, tu dortmund university, emil-figge-straße 50, 44227 dortmund, germany. e-mail: jens.mazei@tu-dortmund.de conflict of interest and funding we declare that we do not have any conflicts of interests that might be interpreted as influencing this commentary. we have not received any specific funding for this research. author contributions jens mazei developed the concept for this commentary and drafted the manuscript. joachim hüffmeier provided comments on the manuscript and edited it. the author names order reflects the different contributions by the authors. open science practices this article earned the open materials badge for making the materials openly available. this is a commentary on another study and as such did not produce new data and was not pre-registered. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references amanatullah, e. t., & morris, m. w. (2010). negotiating gender roles: gender differences in assertive negotiating are mediated by women’s fear of backlash and attenuated when negotiating on behalf of others. journal of personality and social psychology, 98(2), 256–267. https://doi.org/10.1037/a0017094 amanatullah, e. t., & tinsley, c. h. (2013). punishing female negotiators for asserting too much…or not enough: exploring why advocacy moderates backlash against assertive female negotiators. organizational behavior and human decision processes, 120(1), 110–122. https://doi.org/10.1016/j.obhdp.2012.03.006 artz, b., goodall, a. h., & oswald, a. j. (2018). do women ask? industrial relations, 57(4), 611-636. https://doi.org/10.1111/irel.12214 babcock, l., & laschever, s. (2003). women don’t ask: negotiation and the gender divide. princeton, nj: princeton university press. babcock, l., gelfand, m. j., small, d. a., & stayn, h. (2006). gender differences in the propensity to initiate negotiations. in d. de cremer, m. zeelenberg, & j. k. murnighan (eds.), social psychology and economics (pp. 239–259). mahwah, nj: lawrence erlbaum associates publishers. bacharach, s. b. (1989). organizational theories: some criteria for evaluation. academy of management review, 14(4), 496–515. https://doi.org/10.5465/amr.1989.4308374 baron, r. m., & kenny, d. a. (1986). the moderator– mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. journal of personality and social psychology, 51(6), 1173–1182. https://doi.org/10.1037/0022-3514.51.6.1173 bear, j. b., & glick, p. (2017). breadwinner bonus and caregiver penalty in workplace rewards for men and women. social psychological and personality science, 8(7), 780–788. https://doi.org/10.1177/1948550616683016 borenstein, m. (2009). effect sizes for continuous data. in h. cooper, l. v. hedges, & j. c. valentine (eds.), handbook of research synthesis and metaanalysis (2nd ed., pp. 221–235). new york, ny: russell sage foundation. bosson, j. k., vandello, j. a., & buckner, c. e. (2018). the psychology of sex and gender. thousand oaks, ca: sage. 10 mazei & hüffmeier bourdeau, s., ollier-malaterre, a., & houlfort, n. (2019). not all work-life policies are created equal: career consequences of using enabling versus enclosing work-life policies. academy of management review, 44(1), 172–193. https://doi.org/10.5465/amr.2016.0429 bowles, h. r., babcock, l., & lai, l. (2007). social incentives for gender differences in the propensity to initiate negotiations: sometimes it does hurt to ask. organizational behavior and human decision processes, 103(1), 84–103. https://doi.org/10.1016/j.obhdp.2006.09.001 bowles, h. r., & mcginn, k. l. (2008). gender in job negotiations: a two-level game. negotiation journal, 24(4), 393–410. https://doi.org/10.1111/j.15719979.2008.00194.x bowles, h. r., thomason, b., & bear, j. b. (2019). reconceptualizing what and how women negotiate for career advancement. academy of management journal, 62(6), 1645–1671. https://doi.org/10.5465/amj.2017.1497 bureau of labor statistics (2018). who chooses parttime work and why? retrieved april, 8, 2020, from www.bls.gov/opub/mlr/2018/article/whochooses-part-time-work-and-why.htm bureau of labor statistics (2019a). highlights of women’s earnings in 2018. retrieved december, 6, 2019, from www.bls.gov/opub/reports/womensearnings/2018/pdf/home.pdf bureau of labor statistics (2019b). american time use survey – 2018 results. retrieved april, 6, 2020, from www.bls.gov/news.release/pdf/atus.pdf catalyst (2020). pyramid: women in s&p 500 companies. retrieved march, 25, 2020, from www.catalyst.org/research/women-in-sp-500-companies/ chung, h. (2020). gender, flexibility stigma and the perceived negative consequences of flexible working in the uk. social indicators research, 151(2), 521–545. https://doi.org/10.1007/s11205-0182036-7 cohen, j. (1992). a power primer. psychological bulletin, 112(1), 155–159. https://doi.org/10.1037/0033-2909.112.1.155 croft, a., schmader, t., & block, k. (2015). an underexamined inequality: cultural and psychological barriers to men’s engagement with communal roles. personality and social psychology review, 19(4), 343–370. https://doi.org/10.1177/1088868314564789 eagly, a. h. (1987). sex differences in social behavior: a social-role interpretation. hillsdale, nj: lawrence erlbaum associates, inc. eagly, a. h., & carli, l. l. (2007). through the labyrinth: the truth about how women become leaders. boston, ma: harvard business school press. eagly, a. h., & karau, s. j. (2002). role congruity theory of prejudice toward female leaders. psychological review, 109(3), 573–598. https://doi.org/10.1037/0033-295x.109.3.573 eagly, a. h., nater, c., miller, d. i., kaufmann, m., & sczesny, s. (2020). gender stereotypes have changed: a cross-temporal meta-analysis of us public opinion polls from 1946 to 2018. american psychologist, 75(3), 301–315. https://dx.doi.org/10.1037/amp0000494 eagly, a. h., & steffen, v. j. (1984). gender stereotypes stem from the distribution of women and men into social roles. journal of personality and social psychology, 46(4), 735–754. https://doi.org/10.1037/0022-3514.46.4.735 eagly, a. h., & wood, w. (1999). the origins of sex differences in human behavior: evolved dispositions versus social roles. american psychologist, 54(6), 408–423. https://doi.org/10.1037/0003066x.54.6.408 eagly, a. h., & wood, w. (2012). social role theory. in p. a. m. van lange, a. w. kruglanski, & e. t. higgins (eds.), handbook of theories of social psychology (pp. 458–476). thousand oaks, ca: sage publications. eagly, a. h., & wood, w. (2013). the nature–nurture debates: 25 years of challenges in understanding the psychology of gender. perspectives on psychological science, 8(3), 340–357. https://doi.org/10.1177/1745691613484767 jang, d., elfenbein, h. a., & bottom, w. p. (2018). more than a phase: form and features of a general theory of negotiation. academy of management annals, 12(1), 318–356. https://doi.org/10.5465/annals.2016.0053 kenny, d. a. (2018). mediation. retrieved december, 17, 2020, from http://davidakenny.net/cm/mediate.htm koenig, a. m., & eagly, a. h. (2014). evidence for the social role theory of stereotype content: observations of groups’ roles shape stereotypes. journal of personality and social psychology, 107(3), 371– 392. https://doi.org/10.1037/a0037215 kray, l. j., & thompson, l. l. (2005). gender stereotypes and negotiation performance: an examination of theory and research. in b. m. staw & r. m. 11 are women less likely to ask than men partly because they work fewer hours? a commentary on artz et al. (2018) kramer (eds.), research in organizational behavior: an annual series of analytical essays and critical reviews (vol. 26, pp. 103–182). new york, ny: elsevier science/jai press. kugler, k. g., reif, j. a. m., kaschner, t., & brodbeck, f. c. (2018). gender differences in the initiation of negotiations: a meta-analysis. psychological bulletin, 144(2), 198–222. https://doi.org/10.1037/bul0000135 kulik, c. t., olekalns, m. (2012). negotiating the gender divide: lessons from the negotiation and organizational behavior literatures. journal of management, 38(4), 1387–1415. https://doi.org/10.1177/0149206311431307 lartey, j. (2016). women ask for pay increases as often as men but receive them less, study says. retrieved march, 31, 2020, from www.theguardian.com/world/2016/sep/05/gender-wage-gapwomen-pay-raise-men-study livingston, b. a. (2014). bargaining behind the scenes: spousal negotiation, labor, and work–family burnout. journal of management, 40(4), 949977. https://doi.org/10.1177/0149206311428355 luekemann, l., & abendroth, a.-k. (2018). women in the german workplace: what facilitates or constrains their claims-making for career advancement? social sciences, 7(11), 214. https://doi.org/10.3390/socsci7110214 mackinnon, d. p., krull, j. l., & lockwood, c. m. (2000). equivalence of the mediation, confounding and suppression effect. prevention science, 1(4), 173–181. https://doi.org/10.1023/a:1026595011371 major, b. (1989). gender differences in comparisons and entitlement: implications for comparable worth. journal of social issues, 45(4), 99–115. https://doi.org/10.1111/j.15404560.1989.tb02362.x maxwell, s. e. (2004). the persistence of underpowered studies in psychological research: causes, consequences, and remedies. psychological methods, 9(2), 147–163. https://doi.org/10.1037/1082-989x.9.2.147 mazei, j., hüffmeier, j., freund, p. a., stuhlmacher, a. f., bilke, l., & hertel, g. (2015). a meta-analysis on gender differences in negotiation outcomes and their moderators. psychological bulletin, 141(1), 85–104. https://doi.org/10.1037/a0038184 mazei, j., nohe, c., & hüffmeier, j. (2017). homemaking or breadwinning? gender differences in negotiation as explained by women’s and men’s domestic roles. unpublished presentation at the 30th conference of the international association for conflict management (iacm). preacher, k. j., & selig, j. p. (2012). advantages of monte carlo confidence intervals for indirect effects. communication methods and measures, 6(2), 77–98. https://doi.org/10.1080/19312458.2012.679848 rubin, j. z., & brown, b. r. (1975). the social psychology of bargaining and negotiation. new york, ny: academic press. rudman, l. a., & mescher, k. (2013). penalizing men who request a family leave: is flexibility stigma a femininity stigma? journal of social issues, 69(2), 322–340. https://doi.org/10.1111/josi.12017 rudman, l. a., moss-racusin, c. a., phelan, j. e., & nauts, s. (2012). status incongruity and backlash effects: defending the gender hierarchy motivates prejudice against female leaders. journal of experimental social psychology, 48(1), 165–179. https://doi.org/10.1016/j.jesp.2011.10.008 selig, j. p., & preacher, k. j. (2008, june). monte carlo method for assessing mediation: an interactive tool for creating confidence intervals for indirect effects [computer software]. retrieved december, 17, 2020, from http://quantpsy.org/ shan, w., keller, j., & joseph, d. (2019). are men better negotiators everywhere? a meta-analysis of how gender differences in negotiation performance vary across cultures. journal of organizational behavior, 40(6), 651–675. https://doi.org/10.1002/job.2357 small, d. a., gelfand, m. j., babcock, l., & gettman, h. (2007). who goes to the bargaining table? the influence of gender and framing on the initiation of negotiation. journal of personality and social psychology, 93(4), 600–613. https://doi.org/10.1037/0022-3514.93.4.600 stevens, k., & whelan, s. (2019). negotiating the gender wage gap. industrial relations, 58(2), 141– 188. https://doi.org/10.1111/irel.12228 stuhlmacher, a. f., & linnabery, e. (2013). gender and negotiation: a social role analysis. in m. olekalns, & w. adair (eds.), handbook of research on negotiation (pp. 221–248). london: edward elgar. stuhlmacher, a. f., & walters, a. e. (1999). gender differences in negotiation outcome: a meta-analysis. personnel psychology, 52(3), 653–677. 12 mazei & hüffmeier https://doi.org/10.1111/j.17446570.1999.tb00175.x sutton, r. i., & staw, b. m. (1995). what theory is not. administrative science quarterly, 40(3), 371– 384. https://doi.org/10.2307/2393788 walters, a. e., stuhlmacher, a. f., & meyer, l. l. (1998). gender and negotiator competitiveness: a meta-analysis. organizational behavior and human decision processes, 76(1), 1–29. https://doi.org/10.1006/obhd.1998.2797 whetten, d. a. (1989). what constitutes a theoretical contribution? academy of management review, 14(4), 490–495. https://doi.org/10.5465/amr.1989.4308371 williams, j. c., blair-loy, m., & berdahl, j. l. (2013). cultural schemas, social class, and the flexibility stigma. journal of social issues, 69(2), 209–234. https://doi.org/10.1111/josi.12012 wood, w., & eagly, a. h. (2002). a cross-cultural analysis of the behavior of women and men: implications for the origins of sex differences. psychological bulletin, 128(5), 699–727. https://doi.org/10.1037/0033-2909.128.5.699 wood, w., & eagly, a. h. (2012). biosocial construction of sex differences and similarities in behavior. in j. m. olson, & m. p. zanna (eds.), advances in experimental social psychology (vol.46, pp. 55– 123). burlington, ma: academic press. https://doi.org/10.1016/b978-0-12-3942814.00002-7 yzerbyt, v., muller, d., batailler, c., & judd, c. m. (2018). new recommendations for testing indirect effects in mediational models: the need to report and test component paths. journal of personality and social psychology, 115(6), 929–943. http://dx.doi.org/10.1037/pspa0000132 meta-psychology, 2021, vol 5, mp.2020.2711 https://doi.org/10.15626/mp.2020.2711 article type: commentary published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: streamlined peer review analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/xwved does the privacy paradox exist? comment on yu et al.’s (2020) meta-analysis tobias dienlin university of vienna ye sun city university of hong kong abstract in their meta-analysis on how privacy concerns and perceived privacy risk are related to online disclosure intention and behavior, yu et al. (2020) conclude that “the ‘privacy paradox’ phenomenon (...) exists in our research model” (p. 8). in this comment, we contest this conclusion and present evidence and arguments against it. we find five areas of problems: (1) flawed logic of hypothesis testing; (2) erroneous and implausible results; (3) questionable decision to use only the direct effect of privacy concerns on disclosure behavior as evidence in testing the privacy paradox; (4) overinterpreting results from masem; (5) insufficient reporting and lack of transparency. to guide future research, we offer three recommendations: going beyond mere null hypothesis significance testing, probing alternative theoretical models, and implementing open science practices. while we value this meta-analytic effort, we caution its readers that, contrary to the authors’ claim, it does not offer evidence in support of the privacy paradox. keywords: privacy paradox, meta-analysis, comment in a recent meta-analysis on how privacy concerns and perceived privacy risk are related to online disclosure intention and behavior, the authors conclude that “privacy concern cannot significantly affect disclosure behavior, which confirms that the ‘privacy paradox’ phenomenon [. . . ] exists in our research model” (yu et al., 2020, p. 7f.). such a strong claim from a meta-analytic study is likely to impact future research on online privacy in substantial ways. in this comment, we challenge this conclusion and present contesting evidence and arguments. while we value this meta-analytic effort, we caution its readers that, contrary to the authors’ claim, it does not offer evidence in support of the privacy paradox. we first describe and discuss five areas of problems in yu et al.’s (2020) analysis. based on these problems, we then offer three recommendations for future research. problem 1: flawed logic of hypothesis testing the privacy paradox phenomenon describes “the dichotomy between information privacy attitudes and actual behavior” and (kokolakis, 2017, p. 122). for example, despite stating that they are concerned about privacy, users still share much information online. empirically, the privacy paradox is tested by analyzing the relationship between privacy cognitions (e.g., privacy attitudes, privacy concerns, or perceived privacy risk) and privacy behavior (e.g., information disclosure https://doi.org/10.15626/mp.2020.2711 https://doi.org/10.17605/osf.io/xwved 2 or privacy protection) (gerber et al., 2018). in primary studies, rejecting the privacy paradox hypothesis is relatively straightforward when there is a significant, negative relationship between cognitive and behavioral variables. in other words, if increased privacy concerns are associated with reduced online sharing, such evidence refutes the privacy paradox (e.g., utz & krämer, 2009). “support” for the paradox, on the other hand, is typically inferred from the lack of a significant relationship (taddicken, 2014). to date, only a few studies have found that privacy cognitions are positively related to disclosure outcomes (e.g., contena et al., 2015), which would constitute direct support for the privacy paradox. yu et al. (2020, p. 4) formally test the privacy paradox via a null hypothesis, which states, “h4: privacy concern has no significant effects on users’ personal information disclosure behavior. namely, privacy paradox exists.” the logic of this hypothesis testing is problematic, however, because absence of evidence is not evidence of absence (e.g., a sample of all white swans is no evidence that black swans do not exist). a nonsignificant result (i.e., p > .05) does not mean that the null hypothesis is true or should be accepted (see greenland et al., 2016, for detailed discussions of this and related misperceptions). in sum, a finding of no significant effects cannot demonstrate the absence of the effect (and hence the existence of the privacy paradox). problem 2: erroneous and implausible results to demonstrate the hypothesized null effect, yu et al. (2020) show a non-significant direct effect of privacy concerns on disclosure behavior. we first demonstrate the errors in the evidence presented in their report. we then critically discuss the authors’ decision to focus only on this direct effect in problem 3 and their use of meta-analytical structural equation modeling (masem) in problem 4. the authors conclude the non-significant direct effect of the privacy concerns on disclosure behavior by comparing two structural equation models. the proposed (and final) model does not include the direct effect (see figure 1, top panel), whereas the saturated model does (figure 1, bottom panel). the two models are compared based on only one criterion: the rmsea index. according to the authors, the proposed model shows a good model fit with an rmsea = .008. for the saturated model, the authors write that “[t]he model fit indices [. . . ] were not acceptable (rmsea = 0.368). this implied that our proposed model was effective and that privacy concerns could not predict users’ disclosure behavior. thus, our h4 was supported, which indicated that privacy paradox does exist.” (yu et al., 2020, p. 5). privacy concerns perceived privacy risk disclosure intention disclosure behavior privacy concerns perceived privacy risk disclosure intention disclosure behavior figure 1. top: proposed/final model reported in yu et al. (2020). bottom: saturated model to test the direct effect of privacy concern on disclosure behavior in yu et al. (2020). the results reported by the authors are erroneous and implausible. first, because leaving out a relationship in a path model effectively constrains it to zero (kline, 2016), which is unlikely in most cases (e.g., orben & lakens, 2020), model fit should increase if we add another path (kline, 2016). for the saturated model, with an added path between privacy concerns and disclosure behavior, a decreased model fit is implausible. second, the reported rmsea of .368 for the saturated model must be erroneous. the rmsea for a saturated model (with df = 0) should be undefined (or zero), as can be seen from the formula for its calculation (kline, 2016, p. 205). rms ea = √ χ2m − d fm d fm (n − 1) (1) we reran the model with the correlation matrix reported in table 1 in yu et al. (2020) (for the syntax and the results, see online supplementary material at https://osf.io/qexpf/). following the authors’ procedure, we calculated the harmonic mean using the sample statistics provided in table 1. as expected, our reanalysis showed a “perfect” fit with an rmsea of 0. in addition, for the proposed model, we were able to reproduce all the fit indices reported in yu et al. (2020) except for the rmsea: whereas the authors report an rmsea of .008, our re-analysis produced an rmsea of .08. therefore, the rmseas reported in yu et al. (2020) – .368 for the saturated model and .008 for the final model – are both erroneous. these errors invalidate the findings that the authors use as the key evidence for the privacy paradox hypothesis testing. https://osf.io/qexpf/ 3 we also quickly point out some other more minor problems we have observed. first, the bivariate correlation between privacy concerns and disclosure behavior is reported to be r = –0.063 in table 1 but as “rpc-db = –0.120” in the text (p. 6). second, at least twice in the text, the authors consider a p-value between .05 and .10 as relevant/statistically significant, without acknowledging or explaining the shift from the usual significance level of 5%. finally, using only rmsea as the basis for model comparison is subpar. besides, for models with low degrees of freedom, rmsea is problematic and should be avoided (kenny et al., 2015). problem 3: the questionable decision to use only the direct effect (or the lack thereof) of privacy concerns on disclosure behavior as evidence in testing the privacy paradox. notably, the authors’ claim for evidence regarding the privacy paradox is only based on (the lack of) the direct effect of privacy concerns on disclosure behavior. on theoretical and methodological grounds, we argue that the authors’ decision to conclude privacy paradox solely based on this one path is problematic. problem 3a. the omission of indirect effect via behavioral intention. yu et al.’s (2020) hypothesis 4 (see above) addresses the overall effect of privacy concerns on disclosure behavior in testing the privacy paradox, not just its residual direct effect. in presenting evidence for h4, however, the authors entirely omit the indirect effect of privacy concerns on disclosure behavior via disclosure intention. statistically, this decision is peculiar: when estimating the overall effect of the independent variable on the outcome, direct and indirect effects are to be combined. theoretically, this omittance lacks justification as well. according to theory of planned behavior (fishbein & ajzen, 1975), on which yu et al.’s (2020) model is based, attitudes affect behavior indirectly via behavioral intentions. behavioral intentions, as a mediator, help explain how and why an effect takes place. the residual direct effect in a model often captures the influence of unexamined mechanisms. neither the indirect or the direct effect alone addresses the existence of an effect or its magnitude (rohrer, 2018). to this end, in testing h4, the authors should estimate the total effect. statistically identical to the total effect is the bivariate correlation between the two variables (hayes, 2013), which is provided in the paper. notably, table 1 in yu et al. (2020) reports a significant correlation between privacy concerns and disclosure behavior (r = –0.063, 95% ci [–0.120; –0.005], p = .034). granted, it is a very small effect (see below) – but if we just use the pvalue for hypothesis testing (e.g., p < .05), which is the approach used in the paper, the conclusion would be to reject the privacy paradox. problem 3b. the exclusion of privacy risk perceptions from the privacy paradox framework. by focusing on the residual direct effect of privacy concerns on disclosure behavior, the authors in effect claim that risk perceptions, treated as a confounding variable, have no theoretical or empirical role in the privacy paradox framework. we find this decision questionable. we agree with the authors’ position that privacy concerns and perceived risk are conceptually distinguishable. nonetheless, to make such a distinction does not mean that privacy cognitions, as a larger construct in the privacy paradox, are represented only by privacy concerns. instead, we argue that risk perceptions are also relevant in testing the privacy paradox. as the authors evoke in their literature review, gerber et al. (2018, p. 245) explicitly list “privacy attitude, concerns, perceived risk, behavioral intention and behavior” as central variables for “privacy paradox explanation attempts”. when providing examples of the privacy paradox in the introduction, yu et al. (2020) themselves include research on perceived privacy risks. in any event, making a conceptual distinction between the two variables should not lead to disregarding their close relationship. despite their differences (see also below), both concepts capture cognitions toward privacy. empirical data show high correlations between the two: r = .73 in (bol et al., 2018), and r = .62 as is reported in this meta-analysis. therefore, theoretically and empirically, privacy concerns and risk perceptions are both part of the privacy paradox framework. the role of risk perception should not be excluded from the empirical evidence regarding the privacy paradox. yu et al.’s (2020) analysis finds that perceived privacy risk is a significant predictor of online privacy behavior (r = –.165, p = .003). in other words, people who perceive online information sharing as riskier also share less information, which we believe represents compelling evidence against the privacy paradox. problem 4: overinterpreting masem results we also caution against the authors’ overinterpretations of their masem results as evidence of causal relationships. the authors claim that they “conducted structural equation modeling, based on meta-analytically pooled correlations (masem), to investigate the causal effects of [. . . ] privacy cognition on online disclosure intention and behavior” (yu et al., 2020, p. 2, emphasis added). the use of structural modeling techniques alone does 4 not ensure causal inferences. results of model fitting should not be interpreted as if they came from an experiment when they were not (loehlin & beaujean, 2016). in addition, “the data do not confirm a model, they only fail to disconfirm it” (cliff, 1983, p. 116). there are equivalent models that also fit the data, and there are unanalyzed variables that could disconfirm the model if included. most models in the social science research are “descriptive models” that simply depict relationships but are presented as “structural models” yielding causal explanations (freedman, 1987, p. 221). yu et al.’s (2020) model, if estimated correctly, could provide descriptive relationships but not causal evidence. the included primary studies typically analyzed cross-sectional, self-reported data. the masem approach used in yu et al. (2020) also does not incorporate potential confounding variables such as age, sex, or education level, which may affect the relationship between privacy cognitions and online sharing behavior (kezer et al., 2016; tifferet, 2019). the small number of variables and degrees of freedom in the model also limits the usefulness of masem (cheung, 2021) and leaves little room for testing possible alternative models. yu et al.’s (2020) particular approach to masem is also a limited one with problematic statistical properties. masem includes a collection of methods that combine meta-analysis and sem. what yu et al. (2020) used is the univariate-r approach, which first metaanalyzes each correlation as if they were independent and then fits an sem on the pooled average correlation matrix as if it were an observed covariance matrix (cheung, 2019, 2021). other masem methods, varying in specific procedures, are multivariate approaches that aggregate correlational matrices from primary studies by taking into consideration the dependence of correlations. the latter also allows for handling missing data and addressing estimation uncertainty in fitting the sem. the univariate-r approach has known statistical issues (cheung, 2021). for example, the pairwise aggregation/deletion means that an ad-hoc sample size is used for sem (the harmonic mean is the most common), which leads to biased test statistics and standard errors. the sem results also differ depending on which ad-hoc sample size is used. ignoring sampling uncertainty across studies and treating the correlation matrix as the covariance matrix have also been shown to generate incorrect estimates (cheung, 2019, 2021). overall, we remind readers that yu et al.’s (2020) masem results should be interpreted with great caution. the use of a structural model does not automatically allow for causal inferences, and the implemented univariate-r approach to masem leads to estimates that are, in general, less trustworthy. problem 5: insufficient reporting and lack of transparency finally, there is a substantial underreporting in yu et al.’s (2020) meta-analysis. meta-analyses, like other empirical research work, are under the influence of researchers’ subjective decisions and common errors. standardized reporting and data transparency are important to the reproducibility of meta-analytic findings (lakens et al., 2016; maassen et al., 2020). yu et al.’s (2020) reporting does not adhere to the reporting guidelines for meta-analysis such as the prisma statement (preferred reporting items for systematic reviews and meta-analyses) (moher et al., 2009). key information is not available in either the published paper or the supplemental material. for example, individual effect estimates are not reported. the inclusion and exclusion criteria are described only vaguely. there is no description about how the key variables (such as privacy concerns vs. perceived privacy risks) were operationalized and extracted from the primary studies. regarding moderator coding, no information was provided about coder training or inter-coder reliability. no publication bias assessment was reported. the lack of transparency in yu et al. (2020) makes it hard to assess the validity of their reported metaanalytic data. this comment and the additional analyses we report are consequently constrained despite our best efforts. without sufficient information to evaluate the search, coding, or effect size extraction processes in yu et al.’s (2020) meta-analysis, nor any data to reproduce their summary effect sizes, we remind readers that re-analysis can only be as good as the available data it is based on, the quality of which we are unable to assess. future research the five major areas of problems with yu et al.’s (2020) analysis, as discussed above, put their conclusion regarding the privacy paradox in question. these problems, to a certain extent, speak to larger issues and challenges in the existing privacy paradox research, meriting further reflections. in this section, we engage in such reflections and provide three recommendations for future research on the privacy paradox. recommendation 1: going beyond null hypothesis significance testing (nhst) as we discussed in problem 1, to “confirm” a null hypothesis via statistical non-significance is a flawed approach. this misconception of nhst is not specific to yu et al. (2020). for example, in examining the privacy 5 paradox, hallam and zanella (2017) also proposed a null hypothesis that “privacy concern is not related to sensitive information self-disclosure behavior” (p. 220) and used the non-significant effect as supportive evidence. a p-value above .05 alone cannot distinguish a true null effect from insensitive data (dienes, 2014; lakens et al., 2018). by contrast, rather than relying on the p-value as the sole arbiter of “truth” in nhst, to assess whether there is evidence for the null effect researchers can adopt an interval perspective or calculate bayes factors. in what follows, we introduce and illustrate these two approaches using the data from yu et al. (2020) and from baruh et al. (2017), which is another meta-analysis on the privacy paradox. the results are to be read in the context of the example and not as a definitive answer to the existence of the privacy paradox. 1a. the interval estimates approach interval estimates include bayesian credibility intervals or frequentist confidence intervals, which are “the set of possible population values consistent with the data; all other population values may be rejected” (dienes, 2014, p. 3). an interval approach can overcome the problems associated with nhst by determining a range of values consistent with a hypothesized effect. the null hypothesis, therefore, no longer hinges on a single point value (such as 0 in most null hypotheses in nhst) but specifies a “null region” (dienes, 2014). delineating the null region requires determining a minimally interesting effect size, or the so-called “smallest effect size of interest” (sesoi, lakens et al., 2018). we can then make statistical inferences by comparing the observed interval against the null region, following the guidelines outlined in dienes (2014). for example, if an effect is too small to be meaningful, a hypothesis is rejected – even if the p-value is below 5%. using a null region with a pre-defined sesoi, researchers can therefore better assess the empirical evidence in terms of its actual theoretical and practical significance. we illustrate below how these rules may apply in the context of the privacy paradox research. we set a predetermined sesoi of r = -.05 (funder & ozer, 2019), hence a null region of r = -.05 to .05, and depict hypothetical interval estimates in the upper panel of figure 2. corresponding to the four rules in dienes (2014), these depicted scenarios are interpreted as follows: 1. rule 1: if the interval falls completely within the null region, accept the null region hypothesis. this case, therefore, presents evidence that leads to the acceptance of the privacy paradox hypothesis (i.e., privacy concerns are unrelated to disclosure behavior). 2. rule 2: if the interval falls completely outside of the null region, reject the null region hypothesis. in this case, as the interval has no overlap with and is on the negative side of the null region, it is unambiguous evidence for the alternative hypothesis (i.e., privacy concerns are negatively related to disclosure behavior). the privacy paradox is rejected. 3. rule 3: if the interval overlaps with the null region only on one side, reject the directional hypothesis accordingly. in this case, the upper limit of the interval is below the sesoi, thereby rejecting the positive effect hypothesis. in other words, the hypothesis that there is a positive relationship between privacy concerns and disclosure behavior (i.e., a stronger version of the privacy paradox) is rejected. we suspend judgement regarding a negative effect hypothesis or a null region hypothesis. 4. rule 4: if the interval contains values both above and below the null region, suspend judgement. in this case, the observed interval overlaps with the null region beyond both sides, which means that the data are insensitive; thus, no conclusion can be made. the lower panel of figure 2 displays real data from the two meta-analyses on the relationship between privacy concerns and information sharing: the 95% confidence intervals of the overall effect sizes. baruh et al.’s (2017) data ([-.18, -.07]) fall entirely outside and below the null region ([-.05, .05]), thus squarely rejecting the privacy paradox. to interpret yu et al.’s (2020) data ([-.01, -.12]), first, the null region hypothesis cannot be accepted (i.e., rule 1 does not apply). second, the positive directional hypothesis is rejected (i.e., rule 3 applies), meaning that there is evidence against a positive relationship between privacy concerns and disclosure behavior. we would suspend judgment regarding the existence of a negative effect or no effect. 1b. the bayes factor approach another alternative is to use bayes factors, which compare the probability of two competing hypotheses – in this case, an alternative hypothesis and the null hypothesis (dienes, 2014). bayes factors (b), ranging from 0 to infinity, indicate that “data are b times more likely under the alternative than under the null” (dienes, 2014, p. 4). a value of b larger than 1, therefore, suggests greater evidence for the alternative hypothesis than the null. although there is no absolute cutoff point for b (unlike the dichotomized p-value for statistical significance), the following guideline has 6 d e cisio n r u le s tu d y −0.1 0.0 4. suspend judgement 3. reject positive effect hypothesis 2. reject null region hypothesis 1. accept null region hypothesis baruh et al. (2017) yu et al. (2020) effect size (r) smallest effect size of interest: r = −.05 null region: r = −.05, .05 figure 2. upper panel: illustration of the decision rules proposed by dienes (2014) using hypothetical data. lower panel: 95% confidence intervals of the relation between privacy concerns and information sharing as reported by the two meta-analyses: yu et al. (2020) & baruh et al. (2017). conclusion: baruh et al.’s (2017) data reject the null region (i.e., the privacy paradox) hypothesis. for yu et al.’s (2020) data, the positive effect (i.e., privacy concerns increase disclosure behavior) hypothesis is rejected. been suggested to ease its interpretation: a b value greater than 3 indicates “substantial” (jeffreys, 1961) or, more recently, “moderate” (lee & wagenmakers, 2013) evidence for the alternative hypothesis; a b of lower than 1/3 indicates substantial/moderate evidence for the null hypothesis; and the values in-between are considered weak or anecdotal evidence (dienes, 2014). to apply bayes factors to the privacy paradox research, we could postulate a small negative correlation (r = -.10) for the alternative hypothesis and a null effect (r = 0) for the privacy paradox hypothesis. we then compare the two hypotheses by calculating a bayes factor (dienes, 2008), assuming that the effect is normally distributed with a standard deviation of r / 2 = .05 (see dienes, 2014). using the data from yu et al. (2020), the resulting b is 3.96 (see online supplementary material at https://osf.io/qexpf/). in other words, the alternative hypothesis of a small negative effect is about four times more likely than the null hypothesis, which constitutes at least moderate evidence against the privacy paradox. similarly, instead of referring to a null effect, using bayes we can also compare other informative hypotheses. for example, combining bayes and the sesoi logic, we can compare the probability of all meaningful negative effects (say, h1: r < -.05) versus its complement (i.e., h2: r ≥ -.05). in this case, h2 therefore captures all values we consider “paradoxical”, including null effects and positive effects. using the r package “bain” (van lissa et al., 2020) and the data from yu et al. (2020), we compared both hypotheses and found that h1 is 7.54 times more likely than its complement h2. in other words, with the data collected from yu et al. (2020), the theory that there is no privacy paradox would be about 7 times more likely than the theory that the privacy paradox exists. recommendation 2: rethinking the theoretical/conceptual model as we discussed in problem 3, the theoretical relationship between privacy concerns and risk perceptions is open to rethinking and subject to future empirical testing. in yu et al.’s (2020) proposed/final model, privacy concerns and perceived privacy risks are modeled as parallel predictors of disclosure intentions and/or behavior, and perceived risks are treated as a control variable (see top panel of figure 1). we encourage future research to consider alternative models. for example, perceived privacy risks can be posited as a mediator https://osf.io/qexpf/ 7 privacy concerns perceived privacy risk disclosure intention disclosure behavior figure 3. our suggested theoretical model for future research, in which perceived privacy risks and disclosure intention mediate the effect of privacy concerns on disclosure behavior. between privacy concerns and disclosure behavior (see figure 3). to explain, privacy concerns are often conceptualized as general, trait-like, and intuitive factors; perceived privacy risks, by contrast, are often understood as specific, state-like, and rational factors (bol et al., 2018). because general dispositions often precede more specific cognitions (fishbein & ajzen, 1975), general concerns about privacy may likewise shape more specific perceived privacy risk (dienlin & trepte, 2015; heirman et al., 2013). supporting our theoretical model, a large body of empirical research also analyzed privacy concerns as predictors of perceived privacy risk (e.g., keith et al., 2013; lancelot miltgen et al., 2013; li et al., 2011; zhou, 2015, see the review in gerber et al. 2018). we encourage future researchers to take up the theoretical and empirical tasks of explicitly clarifying the relationship between privacy concerns and risk perceptions. finding the “correct” model is important. statistically controlling for variables that really are mediators will lead to false results (rohrer, 2018). such clarifications will enable more precise modeling and hence more accurate evidence in examining the privacy paradox. recommendation 3: implementing open science practices researchers are humans, and humans make mistakes. reporting errors such as numerical inconsistencies are common in social sciences in general (see the review in nuijten et al., 2016). whereas human errors are often inevitable, we as researchers should nonetheless help one another enhance the rigor of our research processes to avoid, detect, and correct errors. to ensure a discipline’s self-scrutiny and hence selfcorrection, we encourage online privacy researchers to increase openness and transparency. the recent open science movement in social sciences (munafò et al., 2017; nosek et al., 2015), evoking mertonian norms such as communalism and organized skepticism (merton, 1942), seeks to promote such values and norms in research practices. transparency is key to improving research reproducibility and replicability, a major goal in the face of the replication crisis (camerer et al., 2016; camerer et al., 2018; open science collaboration, 2015). open science practices include adherence to reporting standards in publications, preregistration of study plans, data sharing, and reproducible workflow documentation (for overviews, see christensen et al., 2019; dienlin et al., 2021; munafò et al., 2017). transparency also means greater efficiency for the research community, as we can share resources for errorchecking, replication, and developing new studies. for future meta-analyses, we encourage researchers to engage more open science practices to build cumulative research. we specifically recommend preregistering analyses, complying with the reporting standards, and making data and other essential materials of the research process publicly available (lakens et al., 2016). conclusion meta-analyses do not offer conclusive findings for an area of research. notably, and not discussed in yu et al. (2020), another meta-analysis on the privacy paradox finds a negative significant relationship between privacy concerns and information sharing (r = –.13; baruh et al., 2017), which speaks against the privacy paradox. the meta-review by gerber et al. (2018, p. 226) concludes that “[...] strong predictors for privacy behavior are privacy intention, willingness to disclose, privacy concerns and privacy attitude” (emphasis added). this meta-analysis by yu et al. (2020), like others, represents only one assessment of the area of research and shall not be taken as definitive. in this comment, we lay out evidence and arguments that question the validity of yu et al.’s (2020) data, analyses, and results. based on the accessible information, we contest their conclusion that the privacy paradox exists. re-analyzing their data rather seems to provide some evidence against this paradox. as we have emphasized, due to the underreporting in yu et al.’s (2020) paper, we are unable to assess the validity of the reported data when performing our re-analyses. this caveat, we hope, serves both to mark the limitations of this comment and to accentuate the importance of standard reporting and data transparency for empirical researchers, including meta-analysts. in closing, we believe that the privacy paradox remains an open question in need of further theoretical and empirical efforts. we hope that this comment presents a constructive engagement with yu et al.’s 8 (2020) meta-analysis and inspires more theory-based, rigorous, and open research on the privacy paradox in the future. author contact tobias dienlin, university of vienna, department of communication, 1090 vienna, austria. e-mail: tobias.dienlin@univie.ac.at. orcid: 0000-0002-68758083. ye sun, city university of hong kong, department of media and communication, 83 tat chee avenue, kowloon tong, kowloon, hong kong. email: yesun27@cityu.edu.hk. orcid: 0000-00018551-2037. corresponding author: tobias dienlin acknowledgements td has already published on the privacy paradox, and in most of his studies he found support against the privacy paradox. we would like to thank malte elson, niklas johannes, philipp masur, julia rohrer, and michael scharkow for valuable feedback on this submission conflict of interest and funding both authors declare no conflicts of interest. while working on the manuscript, td received funding from the volkswagen foundation. author contributions td and ys wrote the article; td ran the reanalysis and wrote the code; td supervised the project. authorship order was determined by magnitude of contribution. open science practices this article earned the open materials badge for making the materials openly available. it has been verified that the analysis reproduced the results presented in the article. this is a commentary that analyzed a published article, and as such has no new data. the editorial process for this article relied on streamlined peer review where peer reviews obtained from previous journal(s) were moved forward and used as the basis for the editorial decision. these reviews are shared in the supplementary files, as part of the authors’ cover letter. the identities of the reviewers are shown or hidden in accordance with the policy of the journal that originally obtained them. the entire editorial process is published in the online supplement. references baruh, l., secinti, e., & cemalcilar, z. (2017). online privacy concerns and privacy management: a meta-analytical review. journal of communication, 67(1), 26–53. https://doi.org/10.1111/ jcom.12276 bol, n., dienlin, t., kruikemeier, s., sax, m., boerman, s. c., strycharz, j., helberger, n., & vreese, c. h. (2018). understanding the effects of personalization as a privacy calculus: analyzing self-disclosure across health, news, and commerce contexts. journal of computer-mediated communication, 23(6), 370–388. https : / / doi . org/10.1093/jcmc/zmy020 camerer, c. f., dreber, a., forsell, e., ho, t.-h., huber, j., johannesson, m., kirchler, m., almenberg, j., altmejd, a., chan, t., heikensten, e., holzmeister, f., imai, t., isaksson, s., nave, g., pfeiffer, t., razen, m., & wu, h. (2016). evaluating replicability of laboratory experiments in economics. science, 351(6280), 1433–6. https: //doi.org/10.1126/science.aaf0918 camerer, c. f., dreber, a., holzmeister, f., ho, t.-h., huber, j., johannesson, m., kirchler, m., nave, g., nosek, b. a., pfeiffer, t., altmejd, a., buttrick, n., chan, t., chen, y., forsell, e., gampa, a., heikensten, e., hummer, l., imai, t., . . . wu, h. (2018). evaluating the replicability of social science experiments in nature and science between 2010 and 2015. nature human behaviour, 2(9), 637–644. https://doi.org/10. 1038/s41562-018-0399-z cheung, m. w.-l. (2019). some reflections on combining meta-analysis and structural equation modeling. research synthesis methods, 10(1), 15– 22. https://doi.org/10.1002/jrsm.1321 cheung, m. w.-l. (2021). meta-analytic structural equation modeling. oxford research encyclopedia of business and management. oxford university press. https://doi.org/10.1093/acrefore/ 9780190224851.013.225 christensen, g., freese, j., & miguel, e. (2019). transparent and reproducible social science research: how to do open science. cliff, n. (1983). some cautions concerning the application of causal modeling methods. multivariate behavioral research, 18(1), 115–126. https: //doi.org/10.1207/s15327906mbr1801_7 https://doi.org/10.1111/jcom.12276 https://doi.org/10.1111/jcom.12276 https://doi.org/10.1093/jcmc/zmy020 https://doi.org/10.1093/jcmc/zmy020 https://doi.org/10.1126/science.aaf0918 https://doi.org/10.1126/science.aaf0918 https://doi.org/10.1038/s41562-018-0399-z https://doi.org/10.1038/s41562-018-0399-z https://doi.org/10.1002/jrsm.1321 https://doi.org/10.1093/acrefore/9780190224851.013.225 https://doi.org/10.1093/acrefore/9780190224851.013.225 https://doi.org/10.1207/s15327906mbr1801_7 https://doi.org/10.1207/s15327906mbr1801_7 9 contena, b., loscalzo, y., & taddei, s. (2015). surfing on social network sites. computers in human behavior, 49, 30–37. https://doi.org/10.1016/ j.chb.2015.02.042 dienes, z. (2014). using bayes to get the most out of non-significant results. frontiers in psychology, 5. https://doi.org/10.3389/fpsyg.2014.00781 dienlin, t., johannes, n., bowman, n. d., masur, p. k., engesser, s., kümpel, a. s., lukito, j., bier, l. m., zhang, r., johnson, b. k., huskey, r., schneider, f. m., breuer, j., parry, d. a., vermeulen, i., fisher, j. t., banks, j., weber, r., ellis, d. a., . . . de vreese, c. (2021). an agenda for open science in communication. journal of communication, 71(1), 1–26. https://doi.org/ 10.1093/joc/jqz052 dienlin, t., & trepte, s. (2015). is the privacy paradox a relic of the past? an in-depth analysis of privacy attitudes and privacy behaviors [00201]. european journal of social psychology, 45(3), 285– 297. https://doi.org/10.1002/ejsp.2049 fishbein, m., & ajzen, i. (1975). belief, attitude, intention, and behavior: an introduction to theory and research. addison-wesley. freedman, d. a. (1987). a rejoinder on models, metaphors, and fables. journal of educational statistics, 12(2), 206–223. https://doi.org/10. 2307/1164900 funder, d. c., & ozer, d. j. (2019). evaluating effect size in psychological research: sense and nonsense [00014]. advances in methods and practices in psychological science, 2(2), 156–168. https : / / doi.org/10.1177/2515245919847202 gerber, n., gerber, p., & volkamer, m. (2018). explaining the privacy paradox: a systematic review of literature investigating privacy attitude and behavior [00017]. computers & security, 77, 226– 261. https://doi.org/10.1016/j.cose.2018.04. 002 greenland, s., senn, s. j., rothman, k. j., carlin, j. b., poole, c., goodman, s. n., & altman, d. g. (2016). statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. european journal of epidemiology, 31(4), 337–350. https : / / doi . org / 10 . 1007 / s10654 016-0149-3 hallam, c., & zanella, g. (2017). online self-disclosure: the privacy paradox explained as a temporally discounted balance between concerns and rewards. computers in human behavior, 68, 217– 227. https://doi.org/10.1016/j.chb.2016.11. 033 hayes, a. f. (2013). introduction to mediation, moderation, and conditional process analysis: a regression-based approach. guilford press. http: //lib.myilibrary.com/detail.asp?id=480011 heirman, w., walrave, m., & ponnet, k. (2013). predicting adolescents’ disclosure of personal information in exchange for commercial incentives: an application of an extended theory of planned behavior. cyberpsychology, behavior, and social networking, 16(2), 81–87. https://doi.org/10. 1089/cyber.2012.0041 jeffreys, h. (1961). the theory of probability (3rd ed.). oxford university press. keith, m. j., thompson, s. c., hale, j., lowry, p. b., & greer, c. (2013). information disclosure on mobile devices: re-examining privacy calculus with actual user behavior [00208]. international journal of human-computer studies, 71(12), 1163–1173. https://doi.org/10.1016/ j.ijhcs.2013.08.016 kenny, d. a., kaniskan, b., & mccoach, d. b. (2015). the performance of rmsea in models with small degrees of freedom. sociological methods & research, 44(3), 486–507. https://doi.org/ 10.1177/0049124114543236 kezer, m., sevi, b., cemalcilar, z., & baruh, l. (2016). age differences in privacy attitudes, literacy and privacy management on facebook. cyberpsychology: journal of psychosocial research on cyberspace, 10(1). https://doi.org/10.5817/ cp2016-1-2 kline, r. b. (2016). principles and practice of structural equation modeling (4th ed.). the guilford press. kokolakis, s. (2017). privacy attitudes and privacy behaviour: a review of current research on the privacy paradox phenomenon. computers & security, 64, 122–134. https://doi.org/10.1016/ j.cose.2015.07.002 lakens, d., hilgard, j., & staaks, j. (2016). on the reproducibility of meta-analyses: six practical recommendations. bmc psychology, 4(1), 24. https://doi.org/10.1186/s40359-016-0126-3 lakens, d., scheel, a. m., & isager, p. m. (2018). equivalence testing for psychological research: a tutorial. advances in methods and practices in psychological science, 1(2), 259–269. https://doi. org/10.1177/2515245918770963 lancelot miltgen, c., popovič, a., & oliveira, t. (2013). determinants of end-user acceptance of biometrics: integrating the “big 3” of technology acceptance with privacy context. decision support systems, 56, 103–114. https://doi.org/10. 1016/j.dss.2013.05.010 https://doi.org/10.1016/j.chb.2015.02.042 https://doi.org/10.1016/j.chb.2015.02.042 https://doi.org/10.3389/fpsyg.2014.00781 https://doi.org/10.1093/joc/jqz052 https://doi.org/10.1093/joc/jqz052 https://doi.org/10.1002/ejsp.2049 https://doi.org/10.2307/1164900 https://doi.org/10.2307/1164900 https://doi.org/10.1177/2515245919847202 https://doi.org/10.1177/2515245919847202 https://doi.org/10.1016/j.cose.2018.04.002 https://doi.org/10.1016/j.cose.2018.04.002 https://doi.org/10.1007/s10654-016-0149-3 https://doi.org/10.1007/s10654-016-0149-3 https://doi.org/10.1016/j.chb.2016.11.033 https://doi.org/10.1016/j.chb.2016.11.033 http://lib.myilibrary.com/detail.asp?id=480011 http://lib.myilibrary.com/detail.asp?id=480011 https://doi.org/10.1089/cyber.2012.0041 https://doi.org/10.1089/cyber.2012.0041 https://doi.org/10.1016/j.ijhcs.2013.08.016 https://doi.org/10.1016/j.ijhcs.2013.08.016 https://doi.org/10.1177/0049124114543236 https://doi.org/10.1177/0049124114543236 https://doi.org/10.5817/cp2016-1-2 https://doi.org/10.5817/cp2016-1-2 https://doi.org/10.1016/j.cose.2015.07.002 https://doi.org/10.1016/j.cose.2015.07.002 https://doi.org/10.1186/s40359-016-0126-3 https://doi.org/10.1177/2515245918770963 https://doi.org/10.1177/2515245918770963 https://doi.org/10.1016/j.dss.2013.05.010 https://doi.org/10.1016/j.dss.2013.05.010 10 lee, m. d., & wagenmakers, e.-j. (2013). bayesian cognitive modeling: a practical course. cambridge university press. li, h., sarathy, r., & xu, h. (2011). the role of affect and cognition on online consumers’ decision to disclose personal information to unfamiliar online vendors [00313]. decision support systems, 51(3), 434–445. https : / / doi . org / 10 . 1016 / j . dss.2011.01.017 loehlin, j. c., & beaujean, a. a. (2016). latent variable models an introduction to factor, path, and structural equation analysis (5th ed.). routledge. maassen, e., assen, m. a. l. m. v., nuijten, m. b., olsson-collentine, a., & wicherts, j. m. (2020). reproducibility of individual effect sizes in meta-analyses in psychology. plos one, 15(5), e0233107. https://doi.org/10.1371/journal. pone.0233107 merton, r. (1942). a note on science and democracy. journal of legal and political sociology, 115– 126. moher, d., liberati, a., tetzlaff, j., altman, d. g., & group, t. p. (2009). preferred reporting items for systematic reviews and meta-analyses: the prisma statement. plos medicine, 6(7), e1000097. https://doi.org/10.1371/journal. pmed.1000097 munafò, m. r., nosek, b. a., bishop, d. v. m., button, k. s., chambers, c. d., du percie sert, n., simonsohn, u., wagenmakers, e.-j., ware, j. j., & ioannidis, j. p. a. (2017). a manifesto for reproducible science. nature human behaviour, 1(1). https://doi.org/10.1038/s41562-016-0021 nosek, b. a., alter, g., banks, g. c., borsboom, d., bowman, s. d., breckler, s. j., buck, s., chambers, c. d., chin, g., christensen, g., contestabile, m., dafoe, a., eich, e., freese, j., glennerster, r., goroff, d., green, d. p., hesse, b., humphreys, m., . . . yarkoni, t. (2015). promoting an open research culture. science, 348(6242), 1422–1425. https : / / doi . org / 10 . 1126/science.aab2374 nuijten, m. b., hartgerink, c. h. j., van assen, m. a. l. m., epskamp, s., & wicherts, j. m. (2016). the prevalence of statistical reporting errors in psychology (1985-2013). behavior research methods, 48(4), 1205–1226. https://doi. org/10.3758/s13428-015-0664-2 open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251), 4716. https : / / doi . org / 10 . 1126 / science.aac4716 orben, a., & lakens, d. (2020). crud (re)defined. advances in methods and practices in psychological science, 3(2), 238–247. https : / / doi . org / 10 . 1177/2515245920917961 rohrer, j. m. (2018). thinking clearly about correlations and causation: graphical causal models for observational data. advances in methods and practices in psychological science, 24(2), 251524591774562. https://doi.org/10.1177/ 2515245917745629 taddicken, m. (2014). the ‘privacy paradox’ in the social web: the impact of privacy concerns, individual characteristics, and the perceived social relevance on different forms of self-disclosure. journal of computer-mediated communication, 19(2), 248–273. https://doi.org/10.1111/jcc4. 12052 tifferet, s. (2019). gender differences in privacy tendencies on social network sites: a metaanalysis. computers in human behavior, 93, 1– 12. https://doi.org/10.1016/j.chb.2018.11. 046 utz, s., & krämer, n. c. (2009). the privacy paradox on social network sites revisited: the role of individual characteristics and group norms. cyberpsychology: journal of psychosocial research on cyberspace, 3(2). www.cyberpsychology.eu/ view.php?cisloclanku=2009111001&article=2 van lissa, c. j., gu, x., mulder, j., rosseel, y., van zundert, c., & hoijtink, h. (2020). teacher’s corner: evaluating informative hypotheses using the bayes factor in structural equation models. structural equation modeling: a multidisciplinary journal, 1–10. https : / / doi . org / 10 . 1080/10705511.2020.1745644 yu, l., li, h., he, w., wang, f.-k., & jiao, s. (2020). a meta-analysis to explore privacy cognition and information disclosure of internet users. international journal of information management, 51, 102015. https : / / doi . org / 10 . 1016 / j . ijinfomgt.2019.09.011 zhou, t. (2015). understanding user adoption of location-based services from a dual perspective of enablers and inhibitors. information systems frontiers, 17(2), 413–422. https://doi.org/10. 1007/s10796-013-9413-1 https://doi.org/10.1016/j.dss.2011.01.017 https://doi.org/10.1016/j.dss.2011.01.017 https://doi.org/10.1371/journal.pone.0233107 https://doi.org/10.1371/journal.pone.0233107 https://doi.org/10.1371/journal.pmed.1000097 https://doi.org/10.1371/journal.pmed.1000097 https://doi.org/10.1038/s41562-016-0021 https://doi.org/10.1126/science.aab2374 https://doi.org/10.1126/science.aab2374 https://doi.org/10.3758/s13428-015-0664-2 https://doi.org/10.3758/s13428-015-0664-2 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1177/2515245920917961 https://doi.org/10.1177/2515245920917961 https://doi.org/10.1177/2515245917745629 https://doi.org/10.1177/2515245917745629 https://doi.org/10.1111/jcc4.12052 https://doi.org/10.1111/jcc4.12052 https://doi.org/10.1016/j.chb.2018.11.046 https://doi.org/10.1016/j.chb.2018.11.046 www.cyberpsychology.eu/view.php?cisloclanku=2009111001&article=2 www.cyberpsychology.eu/view.php?cisloclanku=2009111001&article=2 https://doi.org/10.1080/10705511.2020.1745644 https://doi.org/10.1080/10705511.2020.1745644 https://doi.org/10.1016/j.ijinfomgt.2019.09.011 https://doi.org/10.1016/j.ijinfomgt.2019.09.011 https://doi.org/10.1007/s10796-013-9413-1 https://doi.org/10.1007/s10796-013-9413-1 problem 1: flawed logic of hypothesis testing problem 2: erroneous and implausible results problem 3: the questionable decision to use only the direct effect (or the lack thereof) of privacy concerns on disclosure behavior as evidence in testing the privacy paradox. problem 3a. the omission of indirect effect via behavioral intention problem 3b. the exclusion of privacy risk perceptions from the privacy paradox framework problem 4: overinterpreting masem results problem 5: insufficient reporting and lack of transparency future research recommendation 1: going beyond null hypothesis significance testing (nhst) 1a. the interval estimates approach 1b. the bayes factor approach recommendation 2: rethinking the theoretical/conceptual model recommendation 3: implementing open science practices conclusion author contact acknowledgements conflict of interest and funding author contributions open science practices meta-psychology, 2021, vol 5, mp.2019.1635 https://doi.org/10.15626/mp.2019.1635 article type: commentary published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: henrik danielsson reviewed by: john everett marsh, örjan dahlström analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/nja69 dissociation between speech and emotion effects in short-term memory: a data reanalysis. stefan wiens department of psychology, stockholm university abstract performance in visual serial recall tasks is often impaired by irrelevant auditory distracters. the duplex-mechanism account of auditory distraction states that if the distracters provide order cues, these interfere with the processing of the order cues in the serial recall task (interference by process). in contrast, the unitary account states that distracters capture only attention on a general level (attentional distraction) without interfering specifically with order processing. marsh et al. (2018, journal of experimental psychology-learning memory and cognition, 44, 882897) reported finding a dissociation between the effects of serial recall tasks and those of a missing-item task on the disruptive effects of speech and of emotional words, as predicted by the duplex-mechanism account. critically, the reported analyses did not test specifically for the claimed dissociation. therefore, i reanalyzed the marsh et al. data and conducted the appropriate analyses. i also tested the dissociation more directly and added a bayesian hypothesis test to measure the strength of the evidence for a dissociation. results provided strong evidence for a dissociation (i.e., crossover interaction) between effects of speech and of emotion. because the duplex-mechanism account predicts this dissociation between speech effects (interference by process) and emotion effects (attentional diversion) whereas the unitary account does not, marsh et al.’s data support the duplex-mechanism account. however, to show that this dissociation is robust, researchers are advised to replicate this dissociation in an adversarial registered report. keywords: short-term memory, irrelevant speech, serial recall, auditory distraction marsh et al. (2018) study the ability to remember the order of events, which is critical for short-term memory, is commonly tested with serial recall tasks. for example, participants are shown a series of digits at a rate of one digit per second, and afterward, they are asked to recall the correct order of the digits. the irrelevant sound effect refers to the observation that irrelevant speech or other sounds presented during the task impair recall performance (beaman and jones, 1997; ellermeier and zimmer, 2014; jones and macken, 1993). one prominent explanation for this effect is the duplex-mechanism account (hughes, 2014; hughes et al., 2007), which proposes two separate mechanisms: interference by process and attentional diversion. interference by process occurs because irrelevant sounds that change over time provide order cues that are processed automatically, and these order cues interfere with the processing of the order cues in the serial recall task. attentional diversion occurs because irrelevant sounds capture attention, and this attentional diversion away from the serial recall task also impairs performance. according to the duplex-mechanism account, task impairment results from both processes. in https://doi.org/10.15626/mp.2019.1635 https://doi.org/10.17605/osf.io/nja69 2 contrast, according to the unitary account, task impairment results only from attentional diversion (bell et al., 2019; körner et al., 2017; röer et al., 2015). in support of the unitary account, recent studies found that the content of speech (i.e., postcategorical properties such as meaning) can disrupt performance in serial recall. for example, recall performance is disrupted more by emotional words than neutral words (buchner et al., 2006; buchner et al., 2004), more by taboo words than neutral words (röer et al., 2017), and more by participants’ own names than control names (röer et al., 2013). however, as argued by marsh et al. (2018), these findings do not necessarily demonstrate that auditory distraction in serial recall can be caused only by attentional diversion. both interference by process and attentional diversion may disrupt recall performance. if so, it should be possible to dissociate these disruptive effects. in a clever study, marsh et al. (2018) conducted two experiments intended to demonstrate this dissociation. in the first experiment, participants performed a serial recall task with eight digits (1 to 8 in random order, one per 900 ms) that were either easy to read (low load) or difficult to read (high load). in low load, the digits were clearly visible, whereas in high load, the digits were embedded in random visual noise. after each series of eight digits, participants had to recall the order of the digits. recall performance was indexed by proportion correct (across serial positions). in the second experiment, participants performed a missing-item task: participants were presented with series of eight different digits in random order (as in serial recall), and at the end of each trial, the participants had to report which digit from the 1-9 range was missing from the series. thus, digit order did not have to be processed to perform the missing-item task. performance was also indexed by proportion correct. each task comprised six conditions: quiet and five conditions with auditory distracters (15 trials per condition). the five distracter conditions were neutral words and two content categories (social and physical) each of positive and negative words. for each trial with auditory distracters, all distracters were drawn from the same condition. also, each digit in a series was accompanied by an auditory distracter, and both had the same onset. note that the two content categories (social and physical) were merged by marsh et al. (2018) on the basis of preliminary analyses. thus, there were four conditions in the final analyses: quiet, neutral, positive, and negative. because the neutral words were not emotional, they were used to capture the distracting effect of speech per se. quiet (i.e., no sound) was the control condition. thus, the speech effect was the difference of quiet minus neutral words. neutral, positive, and negative words were used to capture the distracting effect of emotion. thus, the emotion effect was the difference of neutral words minus the mean of positive and negative words. note that marsh et al. (2018) referred to a valence effect, but this term implies a specific interest in the difference between positive and negative words. instead, i prefer to refer to an emotion effect because the main interest was the difference between neutral words and emotional words (positive and negative combined). marsh et al. (2018) argued that according to the duplex-mechanism account, the effects of speech (i.e., interference by process) and of emotion (i.e., attentional diversion) should differ (i.e., be dissociated) between the serial recall task and the missing-item task. the authors argued that the following results would support a dissociation: first, in the serial recall task, attending to digits that are hard to read should decrease attentional diversion but not interference by process. accordingly, high load (vs. low load) should decrease the emotion effect but not the speech effect. second, performing the missing-item task (in which order cues are irrelevant) should decrease interference by process but not attentional diversion. accordingly, the missingitem task (vs. low load in the serial recall task) should decrease the speech effect but not the emotion effect. critically, marsh et al. (2018) analyzed performance data (proportion correct) with anovas and t tests, but these analyses did not test specifically for a dissociation between task effects on speech and on emotion. to address this issue, i reanalyzed the marsh et al. data, which the authors kindly shared with me. below, i review the original analyses and discuss their problems, simulate hypothetical data to illustrate these problems, conduct the appropriate analyses, test the dissociation more directly, add a bayesian hypothesis test to measure the strength of evidence for a dissociation, and discuss theoretical implications. in closing, i discuss some meta-scientific concerns. all scripts, analyses, figures, and additional material are available at open science framework. to facilitate open and reproducible science (munafò et al., 2017, this material includes a complete r-markdown script (baptiste, 2017; bates et al., 2015; lawrence, 2016; lüdecke, 2021; müller, 2020; r core team, 2016; singmann et al., 2020; team, 2020; wickham et al., 2019; wiens, 2017; zhu, 2020). for example, recall performance was measured as proportion correct; thus, it may violate assumptions for anovas (e.g., normality). however, because additional analyses suggested that results were unaffected (see r-markdown script), the simpler analyses with proportion correct are https://osf.io/xds8g/ 3 reported below. original analyses figure 1a shows the mean proportion correct for the four sound categories for both low and high load in the serial recall task of the first experiment and for the missing-item task of the second experiment. marsh et al. (2018) conducted two main analyses to test for the expected pattern of results. within the framework of null hypothesis significance testing (perezgonzalez, 2015; szucs and ioannidis, 2017; wiens and nilsson, 2017), the authors interpreted a significant finding (p < .05) as evidence for an effect (thus, significant indicates statistically significant). the first main analysis compared low and high load in the serial recall task. proportion-correct data were analyzed in a 2 × 4 mixed anova with load (low and high load) as a between-subjects variable and sound (quiet, neutral, positive, and negative) as a withinsubjects variable. because the overall interaction was significant (p = 0.031 after greenhouse-geisser correction), marsh et al. (2018) interpreted this as evidence that the effect of load (low vs. high load) was different on the speech effect than on the emotion effect. in follow-up analyses, marsh et al. separately assessed the effect of load (low vs. high) on the speech effect and on the emotion effect. for speech, results showed no significant interaction between load (low and high) and sound (quiet and neutral), p = .90. for emotion, results showed that for low load, proportion correct decreased significantly (p < .05) from neutral to positive and from positive to negative, whereas for high load, proportion correct did not differ significantly among the sound conditions. according to the authors, these results support a dissociation: whereas high load had no effect on the speech effect (quiet vs. neutral), it reduced the emotion effect (performance difference among neutral, positive, and negative). the second main analysis compared the missing-item task with low load in the serial recall task. notably, marsh et al. (2018) did not conduct a 2 × 4 anova (as in the first analysis) but analyzed the effect of task (missing item vs. low load) on the speech effect and emotion effect separately. with regard to the speech effect, an anova of proportion correct with task (missingitem task and serial recall task with low load) as a between-subjects variable and sound (quiet and neutral) as a within-subjects variable showed that the interaction in this 2 × 2 anova was significant, p = 0.008. with regard to the emotion effect, a similar anova with task and sound (neutral, positive, and negative) showed that the interaction in this 2 × 3 anova was not significant, p = 0.076. problems with original analyses although anovas are commonly used and have intuitive appeal, the analyses reported by marsh et al. (2018) were not specific enough to support a claim for a dissociation. the critical question is whether the speech and emotion effects differ between the tasks, that is, high versus low load in the first analysis and missingitem versus low load in the second analysis. with regard to the first analysis, high load (vs. low load) should reduce the emotion effect more strongly than the speech effect. the overall interaction in the 2 × 4 anova should be sensitive to this difference. however, this interaction is unspecific: with its 3 df s, the interaction can be conceptualized as representing a combination of three orthogonal (i.e., statistically independent) contrasts at the same time (wiens and nilsson, 2017). therefore, the interaction may be significant because of effects that are irrelevant to the critical question. further, in both analyses, marsh et al. (2018) conducted separate tests of the speech and emotion effects, but the results do not resolve whether the two effects differed from each other. in fact, a nonsignificant effect in one condition and a significant effect in another condition does not imply that the two conditions differ significantly from each other (gelman and stern, 2006; makin and orban de xivry, 2019; nieuwenhuis et al., 2011). accordingly, the difference between a nonsignificant effect and another, significant effect needs to be tested explicitly. simulation to illustrate that an omnibus interaction in an anova may be significant even though specific effects are not, i simulated data for a 2 × 4 factorial design (modelled after the first analysis in marsh et al., 2018). figure 2a shows the means for the simulated data, and figure 2b shows difference scores that isolate the speech and emotion effects (i.e., 2 × 2 design). the speech effect is the difference of quiet minus neutral, and the emotion effect is the difference of neutral minus combined positive and negative. as shown in figure 2b, low load had the same (large) effect on speech and emotion, whereas high load had the same (small) effect on speech and emotion. thus, in this 2 × 2 design, there is a main effect of load. critically, because the difference between speech and emotion was identical for both loads, there is absolutely no evidence for a dissociation of the effects of load on speech and emotion. that is, there is no interaction between load and effects on speech and emotion. in support, the overall (3-df ) interaction in the 2 × 4 anova of the means was significant (p < .001), but the specific (1-df ) interaction in the 2 × 2 anova 4 figure 1. mean proportion correct of the marsh et al. (2018) data for the serial recall task with low load (sr-low), the serial recall task with high load (sr-high), and the missing-item task (mi) for the four sound conditions (a) and for effects of speech and of emotion (b). in (b), the speech effect was the difference of quiet minus neutral (neu), and the emotion effect was the difference of neutral minus combined positive (pos) and negative (neg). the error bars denote the 95% ci for each individual mean. of the difference scores was not (p = 1). note that exact p values (and 95% ci) are not informative because they depend on the noise in the simulated data. nonetheless, the p values illustrate that an overall interaction in the 2 × 4 anova does not necessarily support the claim that the load effect differs between speech and emotion. therefore, it is unclear whether the significant interaction in the first analysis by marsh et al. shows that the load effect differed between speech and emotion. further, because in both analyses by marsh et al. (2018), a significant effect in one condition and a nonsignificant effect in another condition does not necessarily imply a significant difference between the two conditions (gelman and stern, 2006; makin and orban de xivry, 2019; nieuwenhuis et al., 2011), the marsh et al. analyses did not reveal whether there was a dissociation. reanalyses because the analyses by marsh et al. (2018) did not resolve whether there was a dissociation, i reanalyzed the marsh et al. data to conduct the critical analyses. to simplify them, i computed difference scores to capture the speech effect and the emotion effect, as described above. figure 1b shows the difference scores for low and high load of the serial recall task and for the missing-item task. with regard to the first main analysis in marsh et al. (2018), i conducted a 2 × 2 anova of proportion correct with load (low and high) as a between-subjects 5 figure 2. mean proportion correct of simulated data for the serial recall task with low load (sr-low) and the serial recall task with high load (sr-high) for the four sound conditions (a) and for speech and emotion (b). in (b), the speech effect was the difference of quiet minus neutral (neu), and the emotion effect was the difference of neutral minus combined positive (pos) and negative (neg). the error bars denote the 95% ci for each individual mean. variable and effect (speech and emotion) as a withinsubjects variable. according to marsh et al., the difference of high minus low load for speech minus emotion should be positive; however, this interaction was not significant, m = 0.04, 95% ci [−0.02, 0.10], p = 0.160. this result does not support the claim of a dissociation between the effects of low and high load on the disruptive effects of speech and of emotional words. specifically, it does not provide evidence for a larger effect of load on emotion than on speech. with regard to the second main analysis in marsh et al. (2018), i conducted a 2 × 2 anova of proportion correct with load (low load in the serial recall task and missing-item task) as a between-subjects variable and effect (speech and emotion) as a within-subjects variable. according to marsh et al., the difference of missing-item task minus low load for the difference of speech minus emotion should be negative; indeed, this interaction was significant, m = −0.08, 95% ci [−0.15, −0.02], p = 0.011. this result is consistent with the claim for a dissociation between task effects on speech and on emotion. specifically, task effects (missing-item vs. low load) were larger on speech than on emotion. taken together, however, it is unclear whether these results support the claim of a dissociation by marsh et al. (2018). on the one hand, there was no significant difference in the load effect (low vs. high load) on speech and emotion. on the other hand, there was a significant difference in the task effect (missing-item task vs. low load) on speech and emotion. to resolve this issue, i propose a direct test of the dissociation. 6 direct analysis according to the duplex-mechanism account, there should be a strong dissociation between high load in the serial recall task and the missing-item task in their effects on speech and emotion. because interference by process should occur mainly in the serial recall task, the speech effect should decrease from the serial recall task with high load to the missing-item task. conversely, because attentional diversion should be reduced during high load, the emotion effect should increase from the serial recall task with high load to the missing-item task. thus, there should be a qualitative (crossover) interaction in that task effects should differ in their direction (berrington de gonzález and cox, 2007; vanderweele, 2015). indeed, in a 2 × 2 anova of proportion correct with load (high load and missing-item task) as a between-subjects variable and effect (speech and emotion) as a within-subjects variable, the interaction was significant, p < 0.001. that is, with regard to the difference of high load minus missing-item task for the difference of speech minus emotion, m = 0.12, 95% ci [0.05, 0.19], p = .001. follow-up t tests confirmed that the speech effect decreased from the serial recall task with high load to the missing-item task (mean difference of high load minus missing-item task = 0.06, 95% ci [0.01, 0.11], p = .011). conversely, the emotion effect increased from the serial recall task with high load to the missing-item task (mean difference of high load minus missing-item task = −0.06, 95% ci [−0.09, −0.03], p < .001). the results of this additional analysis clearly support the claim of a dissociation between the effects of the serial recall task with high load and the missing-item task on speech and on emotion. specifically, the two tasks’ effects were in opposite directions for speech and emotion. although these results from null hypothesis significance testing provide evidence against the null hypothesis, they are limited because they do not measure the strength of the evidence for the alternative hypothesis (dienes and mclatchie, 2018; szucs and ioannidis, 2017; wagenmakers, 2007). that is, because the alternative hypothesis is not made explicit, a statistically significant finding does not necessarily imply that the data support the alternative hypothesis (dienes, 2008, 2016; wagenmakers, love, et al., 2018; wagenmakers, marsman, et al., 2018; wagenmakers et al., 2016; wiens and nilsson, 2017). for example, a hypothetical study may find that the task effect is significantly smaller on emotion (0.15) than on speech (0.18). although this implies that the task effects differ (i.e., the difference is not nil), the difference may seem too small to be theoretically important. in contrast, bayesian hypothesis testing requires an explicit alternative hypothesis and allows one to distinguish among data that support the alternative hypothesis, support the null hypothesis, or are inconclusive (dienes, 2008, 2016; wagenmakers, love, et al., 2018; wagenmakers, marsman, et al., 2018; wagenmakers et al., 2016; wiens and nilsson, 2017). in bayesian hypothesis tests, the bayes factor (bf) compares the likelihood of the data given the null hypothesis with the likelihood of the data given an alternative hypothesis. because the bf provides a continuous measure of the strength of evidence, i computed the bf to measure the evidence for or against a dissociation between task effects on speech and on emotion. i also used a suggested interpretation scheme to represent the values with a verbal label: 3 > bf > 1 is considered anecdotal (or inconclusive) evidence, 10 > bf > 3 is considered moderate evidence, and 30 > bf > 10 is considered strong evidence (wagenmakers, love, et al., 2018). in the direct analysis, the mean difference of high load in the serial recall task minus the missing-item task was 0.061 for speech (i.e., the distracting effect of speech was larger during high load than during the missing-item task) and was in the opposite direction for emotion (−0.062). to compute the bf, the observed speech effect (i.e., 0.061) was used to define the alternative hypothesis, that is, the task effect on speech was used as a reasonable estimate of the task effect on emotion. i used three different alternative hypotheses to assess the robustness of the results, as recommended (dienes, 2014; dienes and mclatchie, 2018) and as used in previous research in my lab (ströberg et al., 2017; wiens et al., 2019). for the uniform distribution, the true effect was supposed to fall between 0 and 0.061, and all values were equally likely. for the half-normal distribution, the true effect was modeled as a half-normal with a mean of zero and a standard deviation of 0.061. accordingly, the true effect was supposed to be greater than zero and more likely to be less than 0.061 than to be greater. for the data-driven t distribution, the true effect was modeled as a t distribution as defined by the observed speech effect (dienes and mclatchie, 2018). for the three alternative hypotheses, the bf01 ranged between 12.5 (for uniform) to 16.7 (for half-normal). because these results provide strong support for the null hypothesis (wagenmakers, love, et al., 2018), they support the claim for a dissociation: the effect of the serial recall task with high load versus the missing-item task differed between emotion and speech. 7 implications these reanalyses of the marsh et al. (2018) data constitute critical tests of the claim that task effects on speech and on emotion are dissociated. the most important analysis is that of the differences between the serial recall task with high load and the missingitem task. the duplex-mechanism account predicts a clear dissociation (i.e., crossover interaction) between the task’s effects on speech and on emotion. because interference by process should be greater during the serial recall task with high load than during the missing-item task, the speech effect should be larger during high load than during the missing-item task. conversely, because attentional diversion should be less during the serial recall task with high load than during the missing-item task, the emotion effect should be smaller during high load than during the missing-item task. in support, null hypothesis significance tests suggested that the task effect differed between speech and emotion (as p < .001), and that the task effect on speech (p = .011) was opposite to the task effect on emotion (p < .001). further, bayesian hypothesis tests provided strong evidence (16.7 > bf > 12.5) that the task effect differed between emotion and speech. taken together, the present reanalyses confirmed that marsh et al. were correct in their initial claim: their data provide strong evidence for a dissociation between the task effect on speech and the task effect on emotion. the reanalyses of the marsh et al. (2018) data confirm and extend previous reports of an apparent dissociation between interference by process and attentional diversion (elliott et al., 2016; hughes et al., 2013; kattner and ellermeier, 2018). however, these studies relied on the framework of null hypothesis significance testing, and a statistically significant finding does not necessarily imply that the data support the alternative hypothesis, because the alternative hypothesis is not made explicit (dienes, 2016; wagenmakers, marsman, et al., 2018). the present findings extend previous reports because they provide a direct measure of the strength of the evidence for the idea that interference by process can be dissociated from attentional diversion. although results from hypothesis testing are useful in determining whether there is an effect per se (haaf et al., 2019; wagenmakers, marsman, et al., 2018), a complementary approach is to estimate the effect size (calin-jageman and cumming, 2019; cumming, 2014; wasserstein et al., 2019; wiens and nilsson, 2017). at face value, the 95% confidence intervals (if viewed as likelihood intervals) suggest that the true effect sizes may be rather small (e.g., between 0.01 and 0.11 for the speech effect), but because current theories do not make quantitative predictions, it cannot be resolved whether these effect sizes are theoretically important. meta-thoughts in response to some concerns raised by the reviewers, i would like to discuss a few meta-psychological issues, which seem fitting for the present journal. first, dr. marsh introduced me to his article when he visited the department. although his research is outside of my area, dr. marsh encouraged me to submit the reanalysis to the original journal. the three reviewers (dr. marsh among them) were positive, but the associate editor of jep:lmc rejected the submission: “to be clear the reanalyses are certainly important, but they are of limited scope. perhaps with an additional experiment that replicates and extends the findings the current paper would make more of an independent contribution” (2019). this view is problematic because the reanalysis of the data by marsh et al. concerns directly the validity of the authors’ claim for a dissociation. importantly, because this view prioritizes the novelty of a claim over the truth of the claim (nosek et al., 2012), it hinders the critical process of self-correction in science (ferguson and heene, 2012). second, i might not have been able to conduct this reanalysis if john marsh had not shared his data willingly. therefore, any data should be readily available for reanalyses (as in the present case) and future metaanalyses. for example, previous research on differences between interference by process and attentional diversion may have been confounded because studies used different setups (körner et al., 2017). although the design by marsh et al. (2018) avoids this confound, easy access to the data from previous studies would allow an exploratory meta-analysis to study whether effects appear to differ. therefore, scientists should consider a publication without shared material and raw data as incomplete. third, because bayesian analyses refer to terms such as strength of evidence, results may appear to be more robust than those from null hypothesis significance testing. although it is true that bayesian results are more robust than p values (dienes, 2008; wagenmakers, 2007; wagenmakers, marsman, et al., 2018), they can be “b-hacked” nonetheless (savalei and dunn, 2015). to illustrate, the direct analysis may seem optimal, but this may simply be an illusion driven by hindsight bias. similarly, i could have computed many bayes factors for various alternative hypotheses, picked the largest one, and come up with a convincing post-hoc rationale (assisted by cognitive biases) regarding why this is the optimal alternative hypothesis. accordingly, bayesian analyses are not immune to questionable research practices (john et al., 2012; simmons et al., 2011; wicherts et al., 8 2016). fourth, although the direct analysis supports the idea of a dissociation (marsh et al., 2018), the robustness of this dissociation is unresolved. progress in science may be described in terms of a cycle of creativity and verification (wagenmakers, dutilh, et al., 2018). in exploratory research, creativity is needed to aggregate current knowledge and data into theories. from these theories, hypotheses with specific predictions are derived. in the next step, hypothesis-testing (i.e., confirmatory) research tries to verify these predictions by comparing the predictions with independent data (i.e., data that were not used when developing the theory). although both processes of the cycle are important, it is critical to distinguish between postdiction (exploratory research) and prediction (confirmatory research) to avoid biases, and this is easily done with preregistration (nosek et al., 2018). therefore, researchers in this area should embrace preregistering hypotheses and method to strengthen any claims for confirmatory research. fifth, because the present results of a dissociation are encouraging, it seems worthwhile to show that the dissociation is robust. although a single study may claim an effect, several independent studies need to replicate the effect before it can be considered a scientific fact (chambers, 2017; zwaan et al., 2018). the most promising approach for a replication is an adversarial registered report (nosek and errington, 2020a). researchers with one theoretical perspective invite researchers with an opposing theoretical perspective to collaborate on a study (or to serve as reviewers). the researchers from both camps need to agree on study design, method, and analyses. although the researchers do not have to agree on hypotheses (e.g., direction of predicted effect), they have to agree on method and analyses and that results will be informative no matter their outcome (nosek and errington, 2020b). then, a manuscript with introduction, method, and analyses is submitted as a registered report to a journal (chambers, 2015, 2017). because the submission does not contain results (as they are unknown), the reviewers evaluate the merits of the idea and the method. after peer review, the submission is locked (preregistered) and receives an in-principle acceptance: if the researchers conduct their study according to the preregistration and interpret the results sensibly, then the final paper will be accepted no matter the results. this approach minimizes biases such as hindsight bias, confirmation bias, and carking (critiquing after results are known; nosek and lakens, 2014) and promotes a productive research process (chambers, 2017). for example, if researchers think of alternative explanations upon viewing the results, then these become hypotheses for follow-up research and are not considered as actual explanations for unexpected outcomes (nosek and errington, 2020a). researchers in this area (and others) are encouraged to embrace this approach because it emphasizes the important role of replication: confronting our current theoretical understanding with new evidence (nosek and errington, 2020b). conclusion results provided strong evidence for a dissociation (crossover interaction) between speech effects and emotion effects. because the duplex-mechanism account predicts this dissociation between speech effects (interference by process) and emotion effects (attentional diversion) whereas the unitary account does not, marsh et al.’s (2018) data support the duplex-mechanism account. however, to show that this finding is robust, researchers are advised to replicate this finding in an adversarial registered report. author contact stefan wiens, psykologiska institutionen, stockholms universitet, 106 91 stockholm, sweden, su website, +468163933, sws@psychology.su.se, orcid: 00000003-4531-4313. acknowledgements i thank john marsh for providing me with the data, marco tullio liuzza and stephen pierzchajlo for advice on r, erik van berlekom for helpful discussions and editorial suggestions, and steve palmer for editing. conflict of interest and funding the author declares no conflict of interest. this work was supported by marianne and marcus wallenberg (grant 2019-0102). author contributions stefan wiens: conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, project administration, resources, software, validation, visualization, writing-original draft, and writing-review & editing. open science practices https://www.su.se/profiles/swiens-1.184142 https://orcid.org/0000-0003-4531-4313 https://orcid.org/0000-0003-4531-4313 9 this article earned the the open materials badge for making the materials openly available. this is a commentary that focused on re-analyzing the findings of a published article, and as such there are no (new) data. it has been verified that the analysis reproduced the results presented in the article. this study was not preregistered. the entire editorial process, including the open reviews, are published in the online supplement. references baptiste, a. (2017). gridextra: miscellaneous functions for \"grid\" graphics (version 2.3). https : / / cran.r-project.org/package=gridextra bates, d., mächler, m., bolker, b., & walker, s. (2015). fitting linear mixed-effects models using lme4. journal of statistical software, 67(1). https:// doi.org/10.18637/jss.v067.i01 beaman, c. p., & jones, d. m. (1997). role of serial order in the irrelevant speech effect: tests of the changing-state hypothesis. journal of experimental psychology: learning, memory, and cognition, 23(2), 459–471. https : / / doi . org / 10 . 1037/0278-7393.23.2.459 bell, r., röer, j. p., lang, a.-g., & buchner, a. (2019). distraction by steady-state sounds: evidence for a graded attentional model of auditory distraction. journal of experimental psychology: human perception and performance, 45(4), 500– 512. https://doi.org/10.1037/xhp0000623 berrington de gonzález, a., & cox, d. r. (2007). interpretation of interaction: a review. the annals of applied statistics, 1(2), 371–385. https : / / doi . org/10.1214/07-aoas124 buchner, a., mehl, b., rothermund, k., & wentura, d. (2006). artificially induced valence of distractor words increases the effects of irrelevant speech on serial recall. memory & cognition, 34(5), 1055–1062. https://doi.org/10.3758/ bf03193252 buchner, a., rothermund, k., wentura, d., & mehl, b. (2004). valence of distractor words increases the effects of irrelevant speech on serial recall. memory & cognition, 32(5), 722–731. https:// doi.org/10.3758/bf03195862 calin-jageman, r. j., & cumming, g. (2019). the new statistics for better science: ask how much, how uncertain, and what else is known. the american statistician, 73, 271–280. https://doi.org/ 10.1080/00031305.2018.1518266 chambers, c. (2015). ten reasons why journals must review manuscripts before results are known. addiction, 110(1), 10–11. https://doi.org/10. 1111/add.12728 chambers, c. (2017). the seven deadly sins of psychology: a manifesto for reforming the culture of scientific practice / chris chambers. princeton university press. cumming, g. (2014). the new statistics: why and how. psychological science, 25(1), 7–29. https://doi. org/10.1177/0956797613504966 dienes, z. (2008). understanding psychology as a science: an introduction to scientific and statistical inference. palgrave macmillan. dienes, z. (2014). using bayes to get the most out of non-significant results. frontiers in psychology, 5. https://doi.org/10.3389/fpsyg.2014.00781 dienes, z. (2016). how bayes factors change scientific practice. journal of mathematical psychology, 72, 78–89. https://doi.org/10.1016/j.jmp. 2015.10.003 dienes, z., & mclatchie, n. (2018). four reasons to prefer bayesian analyses over significance testing. psychonomic bulletin & review, 25(1), 207–218. https://doi.org/10.3758/s13423-017-1266-z ellermeier, w., & zimmer, k. (2014). the psychoacoustics of the irrelevant sound effect. acoustical science and technology, 35(1), 10–16. https://doi. org/10.1250/ast.35.10 elliott, e. m., hughes, r. w., briganti, a., joseph, t. n., marsh, j. e., & macken, b. (2016). distraction in verbal short-term memory: insights from developmental differences. journal of memory and language, 88, 39–50. https://doi.org/10. 1016/j.jml.2015.12.008 ferguson, c. j., & heene, m. (2012). a vast graveyard of undead theories: publication bias and psychological science’s aversion to the null. perspectives on psychological science, 7(6), 555–561. https://doi.org/10.1177/1745691612459059 gelman, a., & stern, h. (2006). the difference between “significant” and “not significant” is not itself statistically significant. the american statistician, 60(4), 328–331. https : / / doi . org / 10 . 1198/000313006x152649 haaf, j. m., ly, a., & wagenmakers, e.-j. (2019). retire significance, but still test hypotheses. nature, 567(7749), 461. https : / / doi . org / 10 . 1038 / d41586-019-00972-7 461 hughes, r. w. (2014). auditory distraction: a duplexmechanism account: duplex-mechanism account of auditory distraction. psych journal, 3(1), 30–41. https://doi.org/10.1002/pchj.44 hughes, r. w., hurlstone, m. j., marsh, j. e., vachon, f., & jones, d. m. (2013). cognitive control of auditory distraction: impact of task difficulty, https://cran.r-project.org/package=gridextra https://cran.r-project.org/package=gridextra https://doi.org/10.18637/jss.v067.i01 https://doi.org/10.18637/jss.v067.i01 https://doi.org/10.1037/0278-7393.23.2.459 https://doi.org/10.1037/0278-7393.23.2.459 https://doi.org/10.1037/xhp0000623 https://doi.org/10.1214/07-aoas124 https://doi.org/10.1214/07-aoas124 https://doi.org/10.3758/bf03193252 https://doi.org/10.3758/bf03193252 https://doi.org/10.3758/bf03195862 https://doi.org/10.3758/bf03195862 https://doi.org/10.1080/00031305.2018.1518266 https://doi.org/10.1080/00031305.2018.1518266 https://doi.org/10.1111/add.12728 https://doi.org/10.1111/add.12728 https://doi.org/10.1177/0956797613504966 https://doi.org/10.1177/0956797613504966 https://doi.org/10.3389/fpsyg.2014.00781 https://doi.org/10.1016/j.jmp.2015.10.003 https://doi.org/10.1016/j.jmp.2015.10.003 https://doi.org/10.3758/s13423-017-1266-z https://doi.org/10.1250/ast.35.10 https://doi.org/10.1250/ast.35.10 https://doi.org/10.1016/j.jml.2015.12.008 https://doi.org/10.1016/j.jml.2015.12.008 https://doi.org/10.1177/1745691612459059 https://doi.org/10.1198/000313006x152649 https://doi.org/10.1198/000313006x152649 https://doi.org/10.1038/d41586-019-00972-7 https://doi.org/10.1038/d41586-019-00972-7 https://doi.org/10.1002/pchj.44 10 foreknowledge, and working memory capacity supports duplex-mechanism account. journal of experimental psychology: human perception and performance, 39(2), 539–553. https://doi.org/ 10.1037/a0029064 hughes, r. w., vachon, f., & jones, d. m. (2007). disruption of short-term memory by changing and deviant sounds: support for a duplexmechanism account of auditory distraction. journal of experimental psychology: learning, memory, and cognition, 33(6), 1050–1061. https : / / doi . org / 10 . 1037 / 0278 7393 . 33 . 6 . 1050 john, l. k., loewenstein, g., & prelec, d. (2012). measuring the prevalence of questionable research practices with incentives for truth telling. psychological science, 23(5), 524–532. https://doi. org/10.1177/0956797611430953 jones, d. m., & macken, w. j. (1993). irrelevant tones produce an irrelevant speech effect: implications for phonological coding in working memory. journal of experimental psychology: learning, memory, and cognition, 19(2), 369–381. https://doi.org/10.1037/0278-7393.19.2.369 kattner, f., & ellermeier, w. (2018). emotional prosody of task-irrelevant speech interferes with the retention of serial order. journal of experimental psychology: human perception and performance, 44(8), 1303–1312. https://doi.org/10.1037/ xhp0000537 körner, u., röer, j. p., buchner, a., & bell, r. (2017). working memory capacity is equally unrelated to auditory distraction by changing-state and deviant sounds. journal of memory and language, 96, 122–137. https://doi.org/10.1016/ j.jml.2017.05.005 lawrence, m. a. (2016). ez: easy analysis and visualization of factorial experiments (version 4.4-0). https://cran.r-project.org/package=ez lüdecke, d. (2021). sjplot: data visualization for statistics in social science (version 2.8.7). https : / / cran.r-project.org/package=sjplot makin, t. r., & orban de xivry, j.-j. (2019). ten common statistical mistakes to watch out for when writing or reviewing a manuscript. elife, 8, e48175. https://doi.org/10.7554/elife.48175 marsh, j. e., yang, j., qualter, p., richardson, c., perham, n., vachon, f., & hughes, r. w. (2018). postcategorical auditory distraction in shortterm memory: insights from increased task load and task type. journal of experimental psychology: learning, memory, and cognition, 44(6), 882–897. https : / / doi . org / 10 . 1037 / xlm0000492 müller, k. (2020). here: a simpler way to find your files (version 1.0.0). https://cran.rproject.org/ package=here munafò, m. r., nosek, b. a., bishop, d. v. m., button, k. s., chambers, c. d., du sert, n. p., simonsohn, u., wagenmakers, e.-j., ware, j. j., & ioannidis, j. p. a. (2017). a manifesto for reproducible science. nature human behaviour, 1(1), 1–9. https://doi.org/10.1038/s41562016-0021 nieuwenhuis, s., forstmann, b. u., & wagenmakers, e.-j. (2011). erroneous analyses of interactions in neuroscience: a problem of significance. nature neuroscience, 14(9), 1105–1107. https :// doi.org/10.1038/nn.2886 nosek, b. a., ebersole, c. r., dehaven, a. c., & mellor, d. t. (2018). the preregistration revolution. proceedings of the national academy of sciences, 115(11), 2600–2606. https : / / doi . org / 10.1073/pnas.1708274114 nosek, b. a., & errington, t. m. (2020a). the best time to argue about what a replication means? before you do it. nature, 583(7817), 518–520. https://doi.org/10.1038/d41586-020-021426 nosek, b. a., & errington, t. m. (2020b). what is replication? plos biology, 18(3), e3000691. https: //doi.org/10.1371/journal.pbio.3000691 nosek, b. a., & lakens, d. (2014). registered reports: a method to increase the credibility of published results. social psychology, 45(3), 137– 141. https : / / doi . org / 10 . 1027 / 1864 9335 / a000192 nosek, b. a., spies, j. r., & motyl, m. (2012). scientific utopia: ii. restructuring incentives and practices to promote truth over publishability. perspectives on psychological science, 7(6), 615–631. https : / / doi . org / 10 . 1177 / 1745691612459058 perezgonzalez, j. d. (2015). fisher, neyman-pearson or nhst? a tutorial for teaching data testing. frontiers in psychology, 6. https://doi.org/10. 3389/fpsyg.2015.00223 r core team. (2016). r: a language and environment for statistical computing. retrieved august 1, 2019, from https://www.r-project.org/ röer, j. p., bell, r., & buchner, a. (2013). self-relevance increases the irrelevant sound effect: attentional disruption by one’s own name. journal of cognitive psychology, 25(8), 925–931. https: //doi.org/10.1080/20445911.2013.828063 https://doi.org/10.1037/a0029064 https://doi.org/10.1037/a0029064 https://doi.org/10.1037/0278-7393.33.6.1050 https://doi.org/10.1037/0278-7393.33.6.1050 https://doi.org/10.1177/0956797611430953 https://doi.org/10.1177/0956797611430953 https://doi.org/10.1037/0278-7393.19.2.369 https://doi.org/10.1037/xhp0000537 https://doi.org/10.1037/xhp0000537 https://doi.org/10.1016/j.jml.2017.05.005 https://doi.org/10.1016/j.jml.2017.05.005 https://cran.r-project.org/package=ez https://cran.r-project.org/package=sjplot https://cran.r-project.org/package=sjplot https://doi.org/10.7554/elife.48175 https://doi.org/10.1037/xlm0000492 https://doi.org/10.1037/xlm0000492 https://cran.r-project.org/package=here https://cran.r-project.org/package=here https://doi.org/10.1038/s41562-016-0021 https://doi.org/10.1038/s41562-016-0021 https://doi.org/10.1038/nn.2886 https://doi.org/10.1038/nn.2886 https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1038/d41586-020-02142-6 https://doi.org/10.1038/d41586-020-02142-6 https://doi.org/10.1371/journal.pbio.3000691 https://doi.org/10.1371/journal.pbio.3000691 https://doi.org/10.1027/1864-9335/a000192 https://doi.org/10.1027/1864-9335/a000192 https://doi.org/10.1177/1745691612459058 https://doi.org/10.1177/1745691612459058 https://doi.org/10.3389/fpsyg.2015.00223 https://doi.org/10.3389/fpsyg.2015.00223 https://www.r-project.org/ https://doi.org/10.1080/20445911.2013.828063 https://doi.org/10.1080/20445911.2013.828063 11 röer, j. p., bell, r., & buchner, a. (2015). specific foreknowledge reduces auditory distraction by irrelevant speech. journal of experimental psychology: human perception and performance, 41(3), 692–702. https : / / doi . org / 10 . 1037 / xhp0000028 röer, j. p., körner, u., buchner, a., & bell, r. (2017). attentional capture by taboo words: a functional view of auditory distraction. emotion, 17(4), 740–750. https : / / doi . org / 10 . 1037 / emo0000274 savalei, v., & dunn, e. (2015). is the call to abandon pvalues the red herring of the replicability crisis? frontiers in psychology, 6. https://doi.org/10. 3389/fpsyg.2015.00245 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. https://doi.org/ 10.1177/0956797611417632 singmann, h., bolker, b., westfall, j., aust, f., & benshachar, m. s. (2020). afex: analysis of factorial experiments (version 0.27-2). https://cran.rproject.org/package=afex ströberg, k., andersen, l. m., & wiens, s. (2017). electrocortical n400 effects of semantic satiation. frontiers in psychology, 8, 2117. https : / / doi . org/10.3389/fpsyg.2017.02117 szucs, d., & ioannidis, j. p. a. (2017). when null hypothesis significance testing is unsuitable for research: a reassessment. frontiers in human neuroscience, 11, 390. https : / / doi . org / 10 . 3389/fnhum.2017.00390 team, r. (2020). rstudio: integrated development environment for r. boston, ma. http : / / www . rstudio.com vanderweele, t. j. (2015). explanation in causal inference: methods for mediation and interaction. oxford university press. wagenmakers, e.-j. (2007). a practical solution to the pervasive problems of p values. psychonomic bulletin & review, 14(5), 779–804. https://doi. org/10.3758/bf03194105 wagenmakers, e.-j., dutilh, g., & sarafoglou, a. (2018). the creativity-verification cycle in psychological science: new methods to combat old idols. perspectives on psychological science, 13(4), 418–427. https : / / doi . org / 10 . 1177 / 1745691618771357 wagenmakers, e.-j., love, j., marsman, m., jamil, t., ly, a., verhagen, j., selker, r., gronau, q. f., dropmann, d., boutin, b., meerhoff, f., knight, p., raj, a., van kesteren, e.-j., van doorn, j., šmíra, m., epskamp, s., etz, a., matzke, d., . . . morey, r. d. (2018). bayesian inference for psychology. part ii: example applications with jasp. psychonomic bulletin & review, 25(1), 58–76. https://doi.org/10.3758/s134230171323-7 wagenmakers, e.-j., marsman, m., jamil, t., ly, a., verhagen, j., love, j., selker, r., gronau, q. f., šmíra, m., epskamp, s., matzke, d., rouder, j. n., & morey, r. d. (2018). bayesian inference for psychology. part i: theoretical advantages and practical ramifications. psychonomic bulletin & review, 25(1), 35–57. https : / / doi . org/10.3758/s13423-017-1343-3 wagenmakers, e.-j., morey, r. d., & lee, m. d. (2016). bayesian benefits for the pragmatic researcher. current directions in psychological science, 25(3), 169–176. https : / / doi . org / 10 . 1177/0963721416643289 wasserstein, r. l., schirm, a. l., & lazar, n. a. (2019). moving to a world beyond “p < 0.05”. the american statistician, 73, 1–19. https : / / doi . org/10.1080/00031305.2019.1583913 wicherts, j. m., veldkamp, c. l. s., augusteijn, h. e. m., bakker, m., van aert, r. c. m., & van assen, m. a. l. m. (2016). degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking. frontiers in psychology, 7. https://doi.org/10. 3389/fpsyg.2016.01832 wickham, h., averick, m., bryan, j., chang, w., mcgowan, l., françois, r., grolemund, g., hayes, a., henry, l., hester, j., kuhn, m., pedersen, t., miller, e., bache, s., müller, k., ooms, j., robinson, d., seidel, d., spinu, v., . . . yutani, h. (2019). welcome to the tidyverse. journal of open source software. retrieved february 1, 2021, from https://joss.theoj.org/papers/10. 21105/joss.01686 wiens, s. (2017). aladins bayes factor in r. https : / / doi.org/10.17045/sthlmuni.4981154.v3 wiens, s., & nilsson, m. e. (2017). performing contrast analysis in factorial designs: from nhst to confidence intervals and beyond. educational and psychological measurement, 77(4), 690–715. https://doi.org/10.1177/0013164416668950 wiens, s., szychowska, m., eklund, r., & van berlekom, e. (2019). cascade and no-repetition rules are comparable controls for the auditory frequency mismatch negativity in oddball tasks. psychophysiology, 56(1), e13280. https : / / doi . org/10.1111/psyp.13280 https://doi.org/10.1037/xhp0000028 https://doi.org/10.1037/xhp0000028 https://doi.org/10.1037/emo0000274 https://doi.org/10.1037/emo0000274 https://doi.org/10.3389/fpsyg.2015.00245 https://doi.org/10.3389/fpsyg.2015.00245 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/0956797611417632 https://cran.r-project.org/package=afex https://cran.r-project.org/package=afex https://doi.org/10.3389/fpsyg.2017.02117 https://doi.org/10.3389/fpsyg.2017.02117 https://doi.org/10.3389/fnhum.2017.00390 https://doi.org/10.3389/fnhum.2017.00390 http://www.rstudio.com http://www.rstudio.com https://doi.org/10.3758/bf03194105 https://doi.org/10.3758/bf03194105 https://doi.org/10.1177/1745691618771357 https://doi.org/10.1177/1745691618771357 https://doi.org/10.3758/s13423-017-1323-7 https://doi.org/10.3758/s13423-017-1323-7 https://doi.org/10.3758/s13423-017-1343-3 https://doi.org/10.3758/s13423-017-1343-3 https://doi.org/10.1177/0963721416643289 https://doi.org/10.1177/0963721416643289 https://doi.org/10.1080/00031305.2019.1583913 https://doi.org/10.1080/00031305.2019.1583913 https://doi.org/10.3389/fpsyg.2016.01832 https://doi.org/10.3389/fpsyg.2016.01832 https://joss.theoj.org/papers/10.21105/joss.01686 https://joss.theoj.org/papers/10.21105/joss.01686 https://doi.org/10.17045/sthlmuni.4981154.v3 https://doi.org/10.17045/sthlmuni.4981154.v3 https://doi.org/10.1177/0013164416668950 https://doi.org/10.1111/psyp.13280 https://doi.org/10.1111/psyp.13280 12 zhu, h. (2020). kableextra: construct complex table with ’kable’ and pipe syntax (version 1.3.1). https : //cran.r-project.org/package=kableextra zwaan, r. a., etz, a., lucas, r. e., & donnellan, m. b. (2018). making replication mainstream. behavioral and brain sciences, 41, e120. https://doi. org/10.1017/s0140525x17001972 https://cran.r-project.org/package=kableextra https://cran.r-project.org/package=kableextra https://doi.org/10.1017/s0140525x17001972 https://doi.org/10.1017/s0140525x17001972 marsh et al. (2018) study original analyses problems with original analyses simulation reanalyses direct analysis implications meta-thoughts conclusion author contact acknowledgements conflict of interest and funding author contributions open science practices meta-psychology, 2022, vol 6, mp.2020.2626 https://doi.org/10.15626/mp.2020.2626 article type: original article published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: no edited by: henrik danielsson reviewed by: koppel, l., boag, s., roberts n., beath a. analysis reproduced by: not applicable all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/jnvcu the importance of rigorous methods in a growing research field: five practices for asmr researchers. thomas j. hostler manchester metropolitan university, uk abstract a rigorous field of research is constructed on reproducible findings that allow researchers to confidently formulate hypotheses and build theories from accessible literature. as a nascent area of research, the study of autonomous sensory meridian response (asmr) has the opportunity to become such a field through the adoption of transparent and open research practices. in this paper i outline five such practices that can help achieve this aim: preregistration, sharing data and code, sharing materials, posting preprints, and collaboration. failing to adopt such principles could allow the proliferation of findings that are irreproducible and delay the progress of the field. keywords: autonomous sensory meridian response, asmr, open science, preregistration background autonomous sensory meridian response (asmr) is a sensory experience characterized by a tingling sensation on the crown of the head and feelings of calmness, relaxation, and altered consciousness. the experience is triggered by audiovisual stimuli commonly found in popular online youtube videos including whispering, tapping and scratching sounds, close personal attention, and expert hand movements. people seek out and watch these videos to experience asmr, which produces self-reported psychological benefits including reducing stress, anxiety, depression, loneliness, and insomnia (barratt & davis, 2015). in may 2022, ‘asmr’ was the 3rd most popular search term on youtube in the world, with nearly 15 million searches (hardwick, 2022). despite huge public popularity, asmr has only recently become the subject of scientific enquiry. first described in academic literature by ahuja (2013), the first empirical investigation into asmr was published by barratt and davis (2015), an online survey describing the phenomenon, common triggers and reasons for engaging with asmr content. since then, the number of published papers on asmr has increased year on year (figure 1), indicating a growing academic interest in the phenomenon. researchers have subsequently investigated the triggers of asmr (barratt et al., 2017), the physiological concomitants (poerio et al., 2018), personality correlates and co-occurrences with other sensory experiences (bedwell & butcher, 2020; fredborg et al., 2017, 2018; keizer et al., 2020; lee et al., 2019; mcerlean & banissy, 2017, 2018; mcerlean & osborneford, 2020), underlying brain regions (lochte et al., 2018; smith et al., 2019a; smith et al., 2019b; smith et al., 2017), and developed reliable self-report measures (roberts et al., 2019) and curated stimuli sets (liu & zhou, 2019). as a new field of study, there are countless novel directions open for asmr researchers to take. this may lead to an incentive to conduct and publish research quickly in order to become the first to set foot https://doi.org/10.15626/mp.2020.2626 https://doi.org/10.17605/osf.io/jnvcu 2 0 2 4 6 8 10 12 14 16 2013 2014 2015 2016 2017 2018 2019 2020 2021 n um be r o f p ub lic at io ns year figure 1. number of publications published per year about “autonomous sensory meridian response” from 2013 2021, as indexed on web of science. on an untouched and fertile soil of scientific discovery. it is well documented that such publishing incentives can blind researchers to biases that encourage the overinterpretation of data and the use of questionable research practices (giner-sorolla, 2012; higginson & munafò, 2016). in turn, this leads to an accumulation in the literature of spurious results, inaccurate estimates of effect sizes, and findings that are irreproducible (ioannidis, 2005; munafò et al., 2017). this is a particular concern for a new and developing field, where a ‘hard core’ of replicated and accepted findings have yet to be established (lakatos, 1978). to develop such a core, and allow researchers to confidently build upon the findings of previous research, it is paramount that such findings must be obtained using unbiased and rigorous practices and that these methods are clear and transparent to allow others to replicate methodologies as closely as possible. therefore, adopting transparent and rigorous research practices will accelerate the accumulation of ‘established’ asmr findings, and subsequently, theory-building and the field at large. conversely, utilising traditional ‘closed’ research practices is likely to facilitate the publication of findings and results that do not replicate (smaldino & mcelreath, 2016), thus leading future researchers to pursue unproductive lines of inquiry and delaying the development of the field. it is theorised that developing fields go through three stages of differentiation, mobilisation, and legitimacy building (hambrick & chen, 2008). asmr research has a clear unique focus, which helps to differentiate it from the study of other sensory experiences. however, it currently lacks mobilisation in terms of organisation and resources, and legitimacy in terms of rigorous methodologies and compelling evidence. hambrick and chen (2008) argue that the growth and legitimacy of a field depends on the quality of the research produced: "scholars in more established fields will look for indicators that the new area’s research resembles a style they hold in high regard, a style that is ‘on the right track.’" 3 in this paper, i outline five practices that asmr researchers can use to improve the legitimacy of the field, accelerate the mobilisation of the research community, and increase the transparency, rigour, and reproducibility of their research: pre-registration, sharing data and code, sharing study materials, preprints and post-prints, and collaboration. for each, i will explain their applicability to asmr research with examples. pre-registration pre-registration is the process of publicly and transparently reporting a study’s design and analysis plan prior to data collection (nosek et al., 2018). it typically takes the form of a time-stamped online document that details a study’s design, hypotheses, sampling plan, and data processing and analysis plan. the purpose of preregistration is to preclude the use of questionable research practices including ‘p-hacking’ (exploiting undisclosed flexibility in data analysis plans to obtain significant results) and ‘harking’ (hypothesizing after the results are known to present such significant findings as predicted a priori) (chambers, 2017). this does not mean that studies that were not preregistered have definitely engaged in p-hacking, or trying multiple analyses until a significant result is obtained, but that it is impossible to tell. only by reporting one’s analysis plan in advance can someone say with confidence that there was no undisclosed flexibility in the analysis. it is important to note that phacking is often unconscious and the result of human, fallible researchers operating with cognitive biases and within a career progression framework that incentivises publication of significant results. faced with a multitude of analytical decisions to take regarding participant exclusion, data quality, and questionnaire scoring, researchers analysing data without a pre-defined plan end up walking in a “garden of forking paths” (gelman & loken, 2019), with each path leading to a different statistical outcome. the combination of the ease of posthoc rationalisation of findings, confirmation bias, and the incentives to discover and publish significant results mean that researchers are likely to publish and justify analyses that turn out as significant, even though alternative analyses may have produced different findings (chambers, 2017). a hypothetical example below using asmr research illustrates this, where two research teams in alternate universes each recruit 250 asmr participants and administer them with (amongst other things) the asmr-checklist (fredborg et al., 2017) and the “poor attentional control” subscale of the short imaginal process inventory (sipi) (huba et al., 1982). both research teams obtain exactly the same dataset, and test exactly the same hypothesis: that there is a relationship between asmr and attentional control. the first research team, anna and abel, proceed with data analysis as follows: when analysing their data, they decide to exclude 10 people who report in a qualitative response that they experience atypical asmr (e.g. they only experience it in response to real-world triggers), in order to get a more homogenous sample of only those who get a typical asmr experience from videos. when scoring the sipi, they follow the instructions from huba et al (1982) and begin with a baseline score of 42, and add/subtract the scores of specific questions to produce an overall score. they notice the distribution of the sipi scores are non-normal due to skewness and a number of extreme scores. they follow textbook instructions to use a non-parametric test for the analysis. they find no significant relationship between asmr and poor attentional control (rho = -.135, p = .091) and conclude that there is not sufficient evidence from the study to say that the constructs are related. in their discussion, they suggest future research may want to focus on alternative constructs. in the alternate universe, two other researchers (brett and bushra), test the same hypothesis with the same dataset, but in a different way: when analysing their data, they decide to keep in those who report that they experience atypical asmr, as they think that asmr is probably quite a heterogenous experience overall and they do not want to make a judgement on what ‘counts’ as asmr. however, when scoring the asmrchecklist, they decide to exclude the trigger of “watching someone apply makeup” from the scoring, as they find a relatively low response rate for this trigger, and believe it could be unduly influenced by gender and cultural norms. when scoring the sipi, they use a traditional scoring method of recoding negative items and computing a mean score of the items. they notice the distribution of the sipi scores are non-normal due to skewness and a number of extreme scores. they follow textbook instructions to log-transform the data before analysis. they find a significant negative relationship between asmr and poor attentional control (r = -.195, p = .035) and conclude the constructs are related. in their discussion, they suggest future research should explore this relationship further. in both of these examples, there were a number of decisions the researchers had to take regarding the analysis: who to exclude, how to score the questionnaires, which analysis to use. many of these decisions were data-driven: brett and bushra would not have decided to remove that specific trigger from the checklist scoring 4 until seeing the relatively low score. all of these decisions were reasonable and justified, but done so posthoc. the particular decisions made led to different conclusions. in both these cases, the eventual p-value was accepted and interpreted as evidence for or against their pre-planned hypothesis, and the finding incorporated into the literature. due to the small existing literature base and lack of current theory, it would not be difficult to come up with plausible explanations of why asmr may (or may not) be related to poor attentional control, and to suggest diverging directions for future research. however, the different ‘paths’ taken by the researchers illustrate that despite testing the same a priori hypothesis, their analyses were essentially exploratory, meaning that it would be wrong to draw any firm conclusions from the data. pre-registering analyses means that researchers have to think more carefully in advance about their hypotheses, and justify analysis decisions in advance: what ‘counts’ as asmr for the research question i’m interested in? what criteria for data quality am i willing to accept? what are the psychometric properties of my questionnaire and is it suitable to answer the research question? answering these questions may be difficult but not impossible with a bit of thought and knowledge of the literature and good methodological practice. one concern many researchers have with preregistration is that it will prevent them from running exploratory tests, or will constrain their analysis. what if the data turns out nothing like they thought and they cannot run the test they preregistered? this concern is understandable but ill-founded. pre-registration does not prevent researchers from performing or reporting non-preregistered analyses, or testing hypotheses that they thought of at a later date. however, it necessitates the accurate reporting of such tests as exploratory, and with it the implications for the lack of ‘hard’ inferences that can be drawn from the data. the commonly used procedure of null hypothesis significance testing (nhst) means that a proportion of ‘significant’ results are expected by chance. as exploratory analyses are by definition a search for potentially ‘chance’ findings, conclusions drawn from exploratory tests need to be replicated in order to add value to the literature. on the flip side, a statistical test that supports a pre-registered hypothesis allows one to draw much more confident inferences from the results and provides a much sturdier finding upon which to build. studies can be pre-registered on websites including aspredicted (http://aspredicted.org) or the open science framework (osf) (http://osf.io). an example of a preregistration for an asmr study (poerio et al., 2022) can be found here: https://osf.io/pjq6s which we are happy for other researchers to use as a guide, although do not claim that it is a perfect example. o. klein et al. (2018) present a useful guide for writing pre-registrations. registered reports. a recent initiative in scientific publishing that utilises the idea of pre-registration for maximum benefit is the registered report (rr) (chambers & tzavella, 2020). an rr is a specific type of article, which is submitted to a journal prior to data collection. a researcher will outline the background and idea for their research, and submit this along with their methodology and pre-registration for the study they wish to run. at this point, the study goes out for peer review, and may return with revisions if necessary. once the revisions have been satisfied, the journal will grant the paper “in principle acceptance”, meaning they guarantee to publish the subsequent results of the study, regardless of the outcomes. rrs therefore combine the methodological rigor of a pre-registered study with a commitment to publish null results if necessary, thus correcting for publication bias in the literature. given that publication bias can severely hamper progress in a psychological field (ferguson & heene, 2012), the widespread use of rrs in the new field of asmr research would help to prevent the problem of publication bias occurring in the first place. at the time of writing, journals that currently or will soon offer rrs that are likely to be open to publishing research on asmr include: affective science; attention, perception, & psychophysics; cognition and emotion; consciousness and cognition; experimental psychology; frontiers in psychology; international journal of psychophysiology; iperception; journal of personality and social psychology; nature human behaviour; neuroscience of consciousness; perception; plos one and psychological science. a full live list of journals that accept rrs can be found at: https://cos.io/rr/ sharing data and code as well as making data analysis decisions transparent via pre-registration, researchers should also share the data itself, and the analysis code used to produce the results. ‘sharing’ in this context means making the data available on an accessible online repository, rather than authors responding to ad hoc requests to share data via email. empirical research (and the personal experience of anyone who has done a meta-analysis) confirm that sharing data on an ad hoc basis is highly ineffective for a number of reasons including poor record keeping and authors changing institutions, and availability declines rapidly over time (savage & vickers, 2009; vines et al., 2014; wicherts et al., 2011; wicherts et al., 2006). making data readily available has multiple benefits for cumulative science. the first is that when data is http://aspredicted.org http://osf.io https://osf.io/pjq6s https://cos.io/rr/ 5 made available, other researchers can check whether they agree with the conclusions drawn from the data by conducting their own analyses. this has already been evidenced in asmr research: we (hostler et al., 2019) reinterpreted the data from cash et al. (2018) study by accessing the raw dataset and visualizing it in a different way than presented in the original paper. by doing this, we came to a different conclusion and therefore attempted to ‘correct’ the literature in the spirit of the selfcorrecting mechanism of science (merton, 1974). this would not have been possible without access to the raw data. the second benefit of providing raw data is that researchers can perform alternate analyses on the data that the original authors did not or could not, to maximise the utility of the original research. it is impossible to know what other researchers might find interesting or need to know about the data, and therefore not possible to know which statistics and analyses would need to be reported. a hypothetical example is that of a future meta-analyst who wishes to compute an overall effect size from past experimental asmr research. following the guidelines of lakens (2013), they find that in order to compute an accurate effect size from previous research utilising within-subjects designs, they need information about the correlation (r) between the two repeated measures. this statistic is very rarely reported in studies, but by making the raw data available, it doesn’t have to be: the meta-analyst can download the data and calculate it themselves. another example is that of mcerlean and osborne-ford (2020) who were able to access the open data of fredborg et al. (2018) and perform additional analyses to facilitate the comparison of their findings. a third benefit concerns the sharing of analysis code. this allows the computational reproducibility of the findings and is an important mechanism for error detection. by uploading the code used to produce the findings used in the paper, other researchers can see exactly ‘which buttons were pressed’ to reach the final numbers reported in the paper, and whether the preregistration was followed (if used). to maximise accessibility, researchers should try to use non-proprietary analysis software such as r, so that any researcher has the potential capacity to produce the analysis, regardless of institutional access to proprietary programs such as spss or stata. the ‘gold standard’ would be to use online container technology to recreate the virtual environment in which the original code was executed, in case different versions of software produce different results (see clyburne-sherin et al., 2018). if this is not possible, then researchers should still share analysis code from proprietary programs such as spss via the syntax function: partial accessibility is better than none. data sharing is becoming increasingly common across psychology in general, thanks in part to journal and funder requirements (houtkoop et al., 2018) and a few current examples in asmr research have been highlighted above. the challenge for the future is to ensure that data sharing in asmr research is done consistently, and in an ethical and thoughtful way that maximises its benefit, not merely as an afterthought to meet journal requirements. table 1 highlights some potential issues for sharing data and code in asmr research and some possible solutions. for more useful guidance see meyer (2018). sharing study materials one of the most important transparent research practices for asmr research is the sharing of study materials, such as asmr videos. in experimental designs, researchers will often want to use asmr videos to induce asmr in order to measure the asmr response and any influences on this. in correlational research, researchers may want to use asmr videos to verify the ‘asmr status’ of participants and confirm whether they experience the sensation (hostler et al., 2019). in order to improve consistency amongst studies and the reproducibility of findings, it is important to know exactly which videos are used in asmr research for these purposes. most researchers will take asmr videos created by asmrtists (content creators) from youtube, rather than create original asmr content themselves. whilst the use of these videos in research itself is likely to fall under ‘fair use’ copyright laws, it is unclear whether sharing the original video files on other public repositories may infringe copyright. depending on the number and length of videos, it may also be prohibitive due to the large file sizes. researchers can circumvent this issue by sharing information on the original sources of the videos used, including urls and timestamps (for an example from poerio et al. (2018) see https://osf.io/u374f/). researchers should also cite the exact video number used from stimuli sets such as liu and zhou (2019), and include descriptions about the content of the videos used in the manuscript or supplementary materials, in case of broken links. as asmr research is in an infant stage, there is currently little knowledge about how the specific content and triggers in an asmr video may affect the asmr response. however, this is the sort of question that could prove to be important in future systematic reviews of asmr research– if sufficient information is available to investigate it. ensuring that the exact content of videos used in asmr research is accessible will allow future https://osf.io/u374f/ 6 table 1 issues and solutions for data sharing in asmr research. issue solution ethical concerns most data from asmr experiments are unlikely to be particularly sensitive, but in some cases researchers may collect data that necessitates increased anonymity (e.g., data on mental health diagnoses), via the removal of potentially identifying information (e.g. age and gender). participants must also consent for their data to be shared, so researchers should be careful to avoid claiming that data will be “kept confidential” or “destroyed” in their ethics documentation and participant information sheets. data comprehensibility shared data is only useful if other researchers can interpret it. researchers should include supplementary “data dictionary” files alongside data files, to explain what each variable represents, and how any categorical variables have been coded. data accessibility where possible, data should be provided in file formats for nonproprietary software, for example .csv files. ideally, syntax or data analysis code should include any pre-processing of data, including scoring questionnaires and removing outliers, as well as analyses. file names should be machine readable, utilising underscores rather than spaces, and include information about what the file contains (e.g. “asmr_study1_eegdata.csv”) data storage data and analysis code files can often be hosted as supplementary files on journal websites, or institutional data repositories. however, hosting the data on a searchable repository such as the open science framework (http://osf.io), figshare (http:// figshare.com) or zenodo (http://zenodo.org) can increase discoverability and utility, as well as allowing the researcher greater control over updating the file in future, if necessary. meta-analysists to code for any number of factors – for example, which triggers are present; whether the asmrtist is male or female – as moderating factors in analyses of asmr effects. in addition to sharing asmr video materials used in research, researchers should utilise external repositories such as the open science framework to share all their study materials including recruitment media, participant instructions, questionnaires, and flow-chart protocols. whilst journal articles often have strict word limits on the amount of information that can be included in a method section, there are no such limits to uploading supplementary material in separate repositories, to which the dois can be linked to in the article itself. sharing materials is one of the easiest transparency initiatives to introduce into a research workflow. for example, an entire online questionnaire created in qualtrics can be downloaded as a .pdf and uploaded to the osf with only a few clicks. the benefit is that other researchers can see exactly what information participants were told, how the study was set up, and which questions were asked, to allow the study to be replicated as closely as possible. preprints & post-prints the traditional publishing workflow dictates that the findings of a piece of research are only communicated to other scientists and the public after undergoing peer review when the study is published as a journal article. preprints are online copies of manuscripts that shortcut this process, as they are uploaded to a repository prior to peer review and journal submission (speidel & spitzer, 2018). given that peer review can often take several months before a study is published, preprints accelerate the dissemination of research findings, and thus the productivity and development of the entire field. preprints also increase the visibility and discoverability of research, which can facilitate networking. they can also be used to communicate research that would otherwise sit in a "file drawer", for example studies with null results, small sample sizes, or undergraduate projects (sarabipour et al., 2019). although researchers are encouraged to check individual journal policies, most journals do not regard preprints as ‘prior publication’ and http://osf.io http://figshare.com http://figshare.com http://zenodo.org 7 so they do not act as a disqualifier for eventual journal publication. one common concern regarding preprints is that it is believed posting the results of findings online before publication could allow a researcher to be ‘scooped’ and their idea stolen. however, preprints are associated with a permanent doi meaning that they have an established mark in the scientific record, and are indexed by google scholar and crossref. therefore, posting a preprint can actually allow a researcher to indicate that they recorded a finding or idea first, without having to wait for ‘official’ publication. practically speaking, it is also unlikely that another lab would have the time to run an entire study, write up the results, and submit and get it accepted for publication, in the time it takes for a posted preprint to be peer reviewed. most psychology researchers submit their preprints to the dedicated psychology preprint server psyarxiv (https://psyarxiv.com/) but asmr research could also be hosted on the server for “mind and contemplative practices” research, mindrxiv (https://mindrxiv.org/), biorxiv for biological mechanisms (https://biorxiv.org) or a geographical server such as africarxiv (https://osf. io/preprints/africarxiv/). post-prints confusingly, the term ‘preprint’ is sometimes also used to refer to a manuscript uploaded after publication, to circumvent publisher paywalls. these ‘postprints’ are a common way to allow for ‘green’ open access, where authors can upload a non-typeset copy of their article to their personal, institutional, or academic social networking webpage (e.g. http://researchgate. net; http://academia.edu) if their actual article is not published open access in the journal itself. sharing post-prints of articles is an increasingly common requirement of publicly funded research, and so it is likely that many asmr researchers already do this. however, it is worth reiterating that if an asmr research article is not published ‘gold’ open access, authors should try to ensure that the non-formatted manuscript is available in as many places as possible, to allow for maximum discoverability and use by the asmr research community. collaboration the final practice is a broader call for more collaborative research in asmr, and for researchers to combine resources together in the spirit of ‘team science’ (munafò et al., 2017) to conduct larger, well-powered investigations into the phenomena. large, multi-lab studies are becoming increasingly common in psychological science (e.g. camerer et al., 2018; ebersole et al., 2016; r. a. klein et al., 2014; r. a. klein et al., 2018; open science collaboration, 2015) as researchers realise the benefits that pooling resources can have, particularly when it comes to participant recruitment, traditionally one of the more practically difficult aspects of psychological research. underpowered studies remain the norm in psychology (maxwell, 2004) and researchers commonly overestimate the power in their own studies (bakker et al., 2016). this is compounded by the fact that traditional methods of calculating sample size underestimate the number of participants needed, due to imprecise measures of effects: taking effect sizes from previous studies at face value can underestimate the number of participants needed by a factor of ten (anderson et al., 2017). as an example, when accounting for publication bias and type 1 error, researchers wishing to replicate a published study finding a between-two-groups effect size of d = 0.68 could require up to a total of n = 662 to achieve power of .73. whilst online research methods can facilitate collecting data from large samples like these, many researchers will want to conduct asmr studies in the laboratory to control for environmental factors, standardize participant experience, or employ physiological or neuroscientific measures. however, the time and resources needed to collect large sample sizes in-person can make this sort of data collection prohibitive. the solution is for researchers to work together to pool resources. whilst it may be difficult for a single lab to recruit and test 600 participants in-person, it would be feasible for six labs to work together and collect 100 participants each for a single, high-powered study. utilising the other principles of transparent research, the researchers involved could share materials and pre-register the design of the study together to ensure consistency in the set-up of the experiment and data collection. the data itself would then be shared and analysed together with a shared transparent analysis code, so that all contributors could see exactly how the final results were reached. a relevant example of a large, multi-lab study investigating an emotional, sensory phenomenon is zickfeld et al. (2019). researchers from 23 different laboratories in 15 countries across five continents worked together to investigate “kama muta”, the distinct emotional experience of ‘being moved’. the kaviar project (kama muta validity investigation across regions) members worked together to standardise measurement of the phenomenon and construct a questionnaire, and then recruited a total sample of 3542 participants from across their respective labs. the study was pre-registered and conducted according to open research principles, and readers are encouraged to exhttps://psyarxiv.com/ https://mindrxiv.org/ https://biorxiv.org https://osf.io/preprints/africarxiv/ https://osf.io/preprints/africarxiv/ http://researchgate.net http://researchgate.net http://academia.edu 8 plore the study osf page as an exemplar of the organisation of a multi-lab project: https://osf.io/cydaw/. the resulting study provided convincing evidence of the validity of kama muta and the measurement tool used. as a new seminal paper on the topic, the study has already been cited 21 times in the literature. multi-lab collaborations of course require greater time and resources than individual lab studies, and come with unique challenges including integrating multiple datasets, multi-site ethical approval, translation issues, and the logistical headache of coordinating dozens of people in disparate time zones (moshontz et al., 2018). in addition, traditional publishing incentives often favour quantity over quality. this means it is tempting for researchers to work in silos, working on their own studies as quickly as possible to increase their own number of publications and citations, to enhance their career. however, as discussed earlier, this is bad for the field as a whole as it results in a multitude of underpowered and therefore non-replicable studies in the literature. this is not conducive to progress in the field, and it is a mistake to think that the limitations of small-sample individual studies and publication bias can be corrected post-hoc via meta-analysis (see kvarven et al., 2019). the focus on quantity over quality is also a false economy for researchers interested in career-boosting metrics. by engaging in collaborative research, researchers will be producing high-value science: pre-registered, high-powered studies represent the highest quality of evidence on a topic, and so the results are inevitably published in more prestigious journals, receive greater prominence in the literature, and are cited more often. in addition, as a multi-lab study may have dozens (if not hundreds) of authors, each promoting the study through their own networks, social media channels, and at conferences and invited talks, the natural reach and visibility of the research is substantially increased. concerns about author credit to large projects can be addressed by utilising clear contribution statements, such as the credit taxonomy (holcombe, 2019), which ensures that all contributors receive appropriate recognition. finally, as evidenced by the success of previous multi-lab studies, the practical barriers to large scale collaborations are far from insurmountable. cloudbased document editing (e.g. google docs; dropbox paper), multi-user conferencing and communication software (e.g. zoom; slack) and online project management software (e.g. github; openproject) greatly ease the organisational burden of such initiatives and mean that barriers to multi-lab working are more often psychological than practical. researchers wishing to find other labs to collaborate with could utilise the studyswap platform (https://osf. io/meetings/studyswap/), where labs can post “needs” or “haves” to facilitate sharing resources. another option could be proposing an asmr study to be conducted via the psychological science accelerator (https: //psysciacc.org/), a network of psychology labs around the world who coordinate data collection on largescale studies. in addition, i invite asmr researchers to sign up to a mailing list we have created for sharing asmr research news and collaboration opportunities: https://asmr-net.us1.list-manage.com/subscribe? u=503533eadcf849b8c108a79a7&id=e0763fe0d0. conclusion asmr is a new and developing research field with enormous potential to inform our knowledge of emotion, sensory processing, and digitally-mediated therapies, as well as being a fascinating subject of study in its own right. the few studies conducted so far on asmr have tentatively explored the phenomenon, and suggested exciting directions for research to pursue. however, in order for the field to develop and grow successfully, researchers need to be able to trust the findings in the literature and build theories and hypotheses upon ‘stable’ effects they are confident they will be able to replicate (witte & zenker, 2017). such a “progressive research program” only works when the results of hypothesis tests are trustworthy – i.e., free from bias (preregistered), able to be replicated (open materials), computationally reproducible and scrutinised for errors (open data and code), accessible (preprints) and come from high-powered studies (collaboration). whilst there are also theoretical hurdles to overcome to advance research in this area, including questions to be answered around the definition and measurement of asmr, transparency and collaboration are also a means for addressing these in a thorough and efficient manner (boag et al., 2021). witte and zenker (2017) argue that psychology must “coordinate our research away from the individualistically organized but statistically underpowered short-term efforts that have produced the [replication] crises, toward jointly managed and well-powered long-term research programs”. with the adoption of the open research practices outlined in this article, the nascent field of asmr research has the potential to be the epitome of such a research program. author contact please contact me via email (t.hostler@mmu.ac.uk) or twitter (@tomhostler); orcid: https://orcid.org/ 0000-0002-4658-692x https://osf.io/cydaw/ https://osf.io/meetings/studyswap/ https://osf.io/meetings/studyswap/ https://psysciacc.org/ https://psysciacc.org/ https://asmr-net.us1.list-manage.com/subscribe?u=503533eadcf849b8c108a79a7&id=e0763fe0d0 https://asmr-net.us1.list-manage.com/subscribe?u=503533eadcf849b8c108a79a7&id=e0763fe0d0 https://orcid.org/0000-0002-4658-692x https://orcid.org/0000-0002-4658-692x 9 acknowledgments i would like to thank brendan o’connor for his comments on an early draft of this piece. conflict of interest and funding no conflict of interest or specific source of funding. author contributions i am the sole author of the article. open science practices this article is a theoretical article that was not preregistered nor had data or materials to share. the entire editorial process, including the open reviews, are published in the online supplement. references ahuja, n. k. (2013). “it feels good to be measured” clinical role-play, walker percy, and the tingles. perspectives in biology and medicine, 56(3), 442–451. https://doi.org/10.1353/pbm.2013. 0022 anderson, s. f., kelley, k., & maxwell, s. e. (2017). sample-size planning for more accurate statistical power: a method adjusting sample effect sizes for publication bias and uncertainty. psychological science, 28(11), 1547–1562. https:// doi.org/10.1177/0956797617723724 bakker, m., hartgerink, c. h. j., wicherts, j. m., & van der maas, h. l. j. (2016). researchers’ intuitions about power in psychological research. psychological science, 27(8), 1069–1077. https: //doi.org/10.1177/0956797616647519 barratt, e. l., & davis, n. j. (2015). autonomous sensory meridian response (asmr): a flow-like mental state. peerj, 3, e851. https : / / doi . org / 10.7717/peerj.851 barratt, e. l., spence, c., & davis, n. j. (2017). sensory determinants of the autonomous sensory meridian response (asmr): understanding the triggers. peerj, 5. https : / / doi . org / 10 . 7717 / peerj.3846 bedwell, s., & butcher, i. (2020). the co-occurrence of alice in wonderland syndrome and autonomous sensory meridian response. psypag quarterly, 114(1), 18–26. https://doi.org/http: //www.open-access.bcu.ac.uk/id/eprint/8749 boag, s., roberts, n., & beath, a. (2021). commentary on hostler: on collaboration and converging methods for rigorous autonomous sensory meridian response research. camerer, c. f., dreber, a., holzmeister, f., ho, t.-h., huber, j., johannesson, m., kirchler, m., nave, g., nosek, b. a., pfeiffer, t., altmejd, a., buttrick, n., chan, t., chen, y., forsell, e., gampa, a., heikensten, e., hummer, l., imai, t., . . . wu, h. (2018). evaluating the replicability of social science experiments in nature and science between 2010 and 2015. nature human behaviour, 2(9), 637–644. https://doi.org/10. 1038/s41562-018-0399-z cash, d. k., heisick, l. l., & papesh, m. h. (2018). expectancy effects in the autonomous sensory meridian response. peerj, 6. https://doi.org/ 10.7717/peerj.5229 chambers, c. (2017). the seven deadly sins of psychology: a manifesto for reforming the culture of scientific practice. princeton university press. https://doi.org/10.2307/j.ctvc779w5 chambers, c., & tzavella, l. (2020). registered reports: past, present and future (preprint) [type: preprint]. metaarxiv. https : / / doi . org / 10.31222/osf.io/43298 [accessed 26/03/2020] clyburne-sherin, a., fei, x., & green, s. a. (2018). computational reproducibility via containers in social psychology (preprint). psyarxiv. https: //doi.org/10.31234/osf.io/mf82t [accessed 26/03/2020] ebersole, c. r., atherton, o. e., belanger, a. l., skulborstad, h. m., allen, j. m., banks, j. b., baranski, e., bernstein, m. j., bonfiglio, d. b., boucher, l., brown, e. r., budiman, n. i., cairo, a. h., capaldi, c. a., chartier, c. r., chung, j. m., cicero, d. c., coleman, j. a., conway, j. g., . . . nosek, b. a. (2016). many labs 3: evaluating participant pool quality across the academic semester via replication. journal of experimental social psychology, 67, 68–82. https://doi.org/10.1016/j.jesp.2015.10.012 ferguson, c. j., & heene, m. (2012). a vast graveyard of undead theories: publication bias and psychological science’s aversion to the null. perspectives on psychological science, 7(6), 555–561. https : / / doi . org / 10 . 1177 / 1745691612459059 fredborg, b., clark, j., & smith, s. (2017). an examination of personality traits associated with autonomous sensory meridian response (asmr). frontiers in psychology, 8. https : / / doi . org/10.3389/fpsyg.2017.00247 fredborg, b., clark, j., & smith, s. (2018). mindfulness and autonomous sensory meridian rehttps://doi.org/10.1353/pbm.2013.0022 https://doi.org/10.1353/pbm.2013.0022 https://doi.org/10.1177/0956797617723724 https://doi.org/10.1177/0956797617723724 https://doi.org/10.1177/0956797616647519 https://doi.org/10.1177/0956797616647519 https://doi.org/10.7717/peerj.851 https://doi.org/10.7717/peerj.851 https://doi.org/10.7717/peerj.3846 https://doi.org/10.7717/peerj.3846 https://doi.org/http://www.open-access.bcu.ac.uk/id/eprint/8749 https://doi.org/http://www.open-access.bcu.ac.uk/id/eprint/8749 https://doi.org/10.1038/s41562-018-0399-z https://doi.org/10.1038/s41562-018-0399-z https://doi.org/10.7717/peerj.5229 https://doi.org/10.7717/peerj.5229 https://doi.org/10.2307/j.ctvc779w5 https://doi.org/10.31222/osf.io/43298 https://doi.org/10.31222/osf.io/43298 https://doi.org/10.31234/osf.io/mf82t https://doi.org/10.31234/osf.io/mf82t https://doi.org/10.1016/j.jesp.2015.10.012 https://doi.org/10.1177/1745691612459059 https://doi.org/10.1177/1745691612459059 https://doi.org/10.3389/fpsyg.2017.00247 https://doi.org/10.3389/fpsyg.2017.00247 10 sponse (asmr). peerj, 6. https://doi.org/10. 7717/peerj.5414 gelman, a., & loken, e. (2019). the garden of forking paths : why multiple comparisons can be a problem , even when there is no “ fishing expedition ” or “ p-hacking ” and the research hypothesis was posited ahead of time. giner-sorolla, r. (2012). science or art? how aesthetic standards grease the way through the publication bottleneck but undermine science. perspectives on psychological science, 7(6), 562–571. https : / / doi . org / 10 . 1177 / 1745691612457576 hambrick, d. c., & chen, m.-j. (2008). new academic fields as admittance-seeking social movements: the case of strategic management. the academy of management review, 33(1), 32–54. https://doi.org/10.2307/20159375 hardwick, j. (2022). top youtube searches (as of may 2022) [publication title: ahrefsblog]. retrieved may 17, 2020, from https://ahrefs.com/blog/ top-youtube-searches/ higginson, a. d., & munafò, m. r. (2016). current incentives for scientists lead to underpowered studies with erroneous conclusions. plos biology, 14(11), e2000995. https://doi.org/10. 1371/journal.pbio.2000995 holcombe, a. o. (2019). contributorship, not authorship: use credit to indicate who did what. publications, 7(3), 48. https : / / doi . org / 10 . 3390/publications7030048 hostler, t. j., poerio, g. l., & blakey, e. (2019). still more than a feeling: commentary on cash et al., “expectancy effects in the autonomous sensory meridian response” and recommendations for measurement in future asmr research. multisensory research, 32(6), 521–531. https://doi.org/10.1163/22134808-20191366 houtkoop, b. l., chambers, c., macleod, m., bishop, d. v. m., nichols, t. e., & wagenmakers, e.-j. (2018). data sharing in psychology: a survey on barriers and preconditions. advances in methods and practices in psychological science, 1(1), 70–85. https : / / doi . org / 10 . 1177 / 2515245917751886 huba, g. j., singer, j. l., aneshensel, c. s., & antrobus, j. s. (1982). manual for the short imaginal processes inventory. research psychologist press. ioannidis, j. p. a. (2005). why most published research findings are false. plos medicine, 2(8), e124. https : / / doi . org / 10 . 1371 / journal . pmed . 0020124 keizer, a., chang, t. h. (, o’mahony, c. j., schaap, n. s., & stone, k. d. (2020). individuals who experience autonomous sensory meridian response have higher levels of sensory suggestibility. perception, 49(1), 113–116. https : / / doi . org / 10.1177/0301006619891913 klein, o., hardwicke, t. e., aust, f., breuer, j., danielsson, h., hofelich mohr, a., ijzerman, h., nilsonne, g., vanpaemel, w., & frank, m. c. (2018). a practical guide for transparency in psychological science. collabra: psychology, 4(1), 20. https : / / doi . org / 10 . 1525 / collabra . 158 klein, r. a., ratliff, k. a., vianello, m., adams, r. b., bahník, š., bernstein, m. j., bocian, k., brandt, m. j., brooks, b., brumbaugh, c. c., cemalcilar, z., chandler, j., cheong, w., davis, w. e., devos, t., eisner, m., frankowska, n., furrow, d., galliani, e. m., . . . nosek, b. a. (2014). investigating variation in replicability: a “many labs” replication project. social psychology, 45(3), 142–152. https : / / doi . org / 10 . 1027 / 1864 9335/a000178 klein, r. a., vianello, m., hasselman, f., adams, b. g., adams, r. b., alper, s., aveyard, m., axt, j. r., babalola, m. t., bahník, š., batra, r., berkics, m., bernstein, m. j., berry, d. r., bialobrzeska, o., binan, e. d., bocian, k., brandt, m. j., busching, r., . . . nosek, b. a. (2018). many labs 2: investigating variation in replicability across samples and settings. advances in methods and practices in psychological science, 1(4), 443–490. https : / / doi . org / 10 . 1177 / 2515245918810225 kvarven, a., strømland, e., & johannesson, m. (2019). comparing meta-analyses and preregistered multiple-laboratory replication projects. nature human behaviour. https : / / doi . org / 10 . 1038 / s41562-019-0787-z [accessed 26/03/2020] lakatos, i. (1978). the methodology of scientific research programmes: philosophical papers [publication title: cambridge core]. https : / / doi . org/10.1017/cbo9780511621123 lakens, d. (2013). calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas. frontiers in psychology, 4. https://doi.org/10.3389/fpsyg. 2013.00863 lee, m., song, c.-b., shin, g.-h., & lee, s.-w. (2019). possible effect of binaural beat combined with autonomous sensory meridian response for inducing sleep. frontiers in human neuroscience, https://doi.org/10.7717/peerj.5414 https://doi.org/10.7717/peerj.5414 https://doi.org/10.1177/1745691612457576 https://doi.org/10.1177/1745691612457576 https://doi.org/10.2307/20159375 https://ahrefs.com/blog/top-youtube-searches/ https://ahrefs.com/blog/top-youtube-searches/ https://doi.org/10.1371/journal.pbio.2000995 https://doi.org/10.1371/journal.pbio.2000995 https://doi.org/10.3390/publications7030048 https://doi.org/10.3390/publications7030048 https://doi.org/10.1163/22134808-20191366 https://doi.org/10.1177/2515245917751886 https://doi.org/10.1177/2515245917751886 https://doi.org/10.1371/journal.pmed.0020124 https://doi.org/10.1371/journal.pmed.0020124 https://doi.org/10.1177/0301006619891913 https://doi.org/10.1177/0301006619891913 https://doi.org/10.1525/collabra.158 https://doi.org/10.1525/collabra.158 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1177/2515245918810225 https://doi.org/10.1177/2515245918810225 https://doi.org/10.1038/s41562-019-0787-z https://doi.org/10.1038/s41562-019-0787-z https://doi.org/10.1017/cbo9780511621123 https://doi.org/10.1017/cbo9780511621123 https://doi.org/10.3389/fpsyg.2013.00863 https://doi.org/10.3389/fpsyg.2013.00863 11 13. https : / / doi . org / 10 . 3389 / fnhum . 2019 . 00425 liu, m., & zhou, q. (2019). a preliminary compilation of a digital video library on triggering autonomous sensory meridian response (asmr): a trial among 807 chinese college students. frontiers in psychology, 10. https://doi. org/10.3389/fpsyg.2019.02274 lochte, b. c., guillory, s. a., richard, c. a. h., & kelley, w. m. (2018). an fmri investigation of the neural correlates underlying the autonomous sensory meridian response (asmr). bioimpacts, 8(4), 295–304. https : / / doi . org / 10 . 15171 / bi.2018.32 maxwell, s. e. (2004). the persistence of underpowered studies in psychological research: causes, consequences, and remedies. psychological methods, 9(2), 147–163. https : / / doi . org/10.1037/1082-989x.9.2.147 mcerlean, a. b. j., & banissy, m. j. (2017). assessing individual variation in personality and empathy traits in self-reported autonomous sensory meridian response. multisensory research, 30(6, si), 601–613. https://doi.org/10.1163/ 22134808-00002571 mcerlean, a. b. j., & banissy, m. j. (2018). increased misophonia in self-reported autonomous sensory meridian response. peerj, 6. https://doi. org/10.7717/peerj.5351 mcerlean, a. b. j., & osborne-ford, e. j. (2020). increased absorption in autonomous sensory meridian response. peerj, 8, e8588. https://doi. org/10.7717/peerj.8588 merton, r. k. (1974). the sociology of science: theoretical and empirical investigations (1st ed.). university of chicago press. meyer, m. n. (2018). practical tips for ethical data sharing. advances in methods and practices in psychological science, 1(1), 131–144. https : / / doi.org/10.1177/2515245917747656 moshontz, h., campbell, l., ebersole, c. r., ijzerman, h., urry, h. l., forscher, p. s., grahe, j. e., mccarthy, r. j., musser, e. d., antfolk, j., castille, c. m., evans, t. r., fiedler, s., flake, j. k., forero, d. a., janssen, s. m. j., keene, j. r., protzko, j., aczel, b., . . . chartier, c. r. (2018). the psychological science accelerator: advancing psychology through a distributed collaborative network. advances in methods and practices in psychological science, 1(4), 501–515. https://doi.org/10.1177/2515245918797607 munafò, m. r., nosek, b. a., bishop, d. v. m., button, k. s., chambers, c., percie du sert, n., simonsohn, u., wagenmakers, e.-j., ware, j. j., & ioannidis, j. p. a. (2017). a manifesto for reproducible science. nature human behaviour, 1(1). https://doi.org/10.1038/s41562-016-0021 nosek, b. a., ebersole, c. r., dehaven, a. c., & mellor, d. t. (2018). the preregistration revolution. proceedings of the national academy of sciences, 115(11), 2600–2606. https : / / doi . org / 10.1073/pnas.1708274114 open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716–aac4716. https : / / doi . org/10.1126/science.aac4716 poerio, g. l., blakey, e., hostler, t. j., & veltri, t. (2018). more than a feeling: autonomous sensory meridian response (asmr) is characterized by reliable changes in affect and physiology (j. e. aspell, ed.). plos one, 13(6), e0196645. https://doi.org/10.1371/journal. pone.0196645 poerio, g. l., mank, s., & hostler, t. j. (2022). the awesome as well as the awful: heightened sensory sensitivity predicts the presence and intensity of autonomous sensory meridian response (asmr). journal of research in personality, 97. https://doi.org/10.1016/j.jrp.2021.104183 roberts, n., beath, a., & boag, s. (2019). autonomous sensory meridian response: scale development and personality correlates. psychology of consciousness: theory, research, and practice, 6(1), 22–39. https://doi.org/10.1037/cns0000168 sarabipour, s., debat, h. j., emmott, e., burgess, s. j., schwessinger, b., & hensel, z. (2019). on the value of preprints: an early career researcher perspective. plos biology, 17(2), e3000151. https : / / doi . org / 10 . 1371 / journal . pbio . 3000151 savage, c. j., & vickers, a. j. (2009). empirical study of data sharing by authors publishing in plos journals (c. mavergames, ed.). plos one, 4(9), e7078. https://doi.org/10.1371/journal. pone.0007078 smaldino, p. e., & mcelreath, r. (2016). the natural selection of bad science. royal society open science, 3(9), 160384. https://doi.org/10.1098/ rsos.160384 smith, s. d., fredborg, b. k., & kornelsen, j. (2019a). a functional magnetic resonance imaging investigation of the autonomous sensory meridian response. peerj, 7. https : / / doi . org / 10 . 7717 / peerj.7122 smith, s. d., fredborg, b. k., & kornelsen, j. (2019b). atypical functional connectivity ashttps://doi.org/10.3389/fnhum.2019.00425 https://doi.org/10.3389/fnhum.2019.00425 https://doi.org/10.3389/fpsyg.2019.02274 https://doi.org/10.3389/fpsyg.2019.02274 https://doi.org/10.15171/bi.2018.32 https://doi.org/10.15171/bi.2018.32 https://doi.org/10.1037/1082-989x.9.2.147 https://doi.org/10.1037/1082-989x.9.2.147 https://doi.org/10.1163/22134808-00002571 https://doi.org/10.1163/22134808-00002571 https://doi.org/10.7717/peerj.5351 https://doi.org/10.7717/peerj.5351 https://doi.org/10.7717/peerj.8588 https://doi.org/10.7717/peerj.8588 https://doi.org/10.1177/2515245917747656 https://doi.org/10.1177/2515245917747656 https://doi.org/10.1177/2515245918797607 https://doi.org/10.1038/s41562-016-0021 https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1371/journal.pone.0196645 https://doi.org/10.1371/journal.pone.0196645 https://doi.org/10.1016/j.jrp.2021.104183 https://doi.org/10.1037/cns0000168 https://doi.org/10.1371/journal.pbio.3000151 https://doi.org/10.1371/journal.pbio.3000151 https://doi.org/10.1371/journal.pone.0007078 https://doi.org/10.1371/journal.pone.0007078 https://doi.org/10.1098/rsos.160384 https://doi.org/10.1098/rsos.160384 https://doi.org/10.7717/peerj.7122 https://doi.org/10.7717/peerj.7122 12 sociated with autonomous sensory meridian response: an examination of five resting-state networks. brain connectivity, 9(6), 508–518. https://doi.org/10.1089/brain.2018.0618 smith, s. d., katherine fredborg, b., & kornelsen, j. (2017). an examination of the default mode network in individuals with autonomous sensory meridian response (asmr). social neuroscience, 12(4), 361–365. https : / / doi . org / 10 . 1080/17470919.2016.1188851 speidel, r., & spitzer, m. (2018). preprints: the what, the why, the how. retrieved march 25, 2020, from https://cos.io/blog/preprints-what-whyhow/ [accessed 26/03/2020] vines, t. h., albert, a. y., andrew, r. l., débarre, f., bock, d. g., franklin, m. t., gilbert, k. j., moore, j.-s., renaut, s., & rennison, d. j. (2014). the availability of research data declines rapidly with article age. current biology, 24(1), 94–97. https://doi.org/10.1016/j.cub. 2013.11.014 wicherts, j. m., bakker, m., & molenaar, d. (2011). willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results (r. e. tractenberg, ed.). plos one, 6(11), e26828. https : / / doi . org/10.1371/journal.pone.0026828 wicherts, j. m., borsboom, d., kats, j., & molenaar, d. (2006). the poor availability of psychological research data for reanalysis. american psychologist, 61(7), 726–728. https://doi.org/10.1037/ 0003-066x.61.7.726 witte, e. h., & zenker, f. (2017). from discovery to justification: outline of an ideal research program in empirical psychology. frontiers in psychology, 8, 1847. https://doi.org/10.3389/fpsyg.2017. 01847 zickfeld, j. h., schubert, t. w., seibt, b., blomster, j. k., arriaga, p., basabe, n., blaut, a., caballero, a., carrera, p., dalgar, i., ding, y., dumont, k., gaulhofer, v., gračanin, a., gyenis, r., hu, c.-p., kardum, i., lazarević, l. b., mathew, l., . . . fiske, a. p. (2019). kama muta: conceptualizing and measuring the experience often labelled being moved across 19 nations and 15 languages. emotion, 19(3), 402–424. https : / / doi.org/10.1037/emo0000450 https://doi.org/10.1089/brain.2018.0618 https://doi.org/10.1080/17470919.2016.1188851 https://doi.org/10.1080/17470919.2016.1188851 https://cos.io/blog/preprints-what-why-how/ https://cos.io/blog/preprints-what-why-how/ https://doi.org/10.1016/j.cub.2013.11.014 https://doi.org/10.1016/j.cub.2013.11.014 https://doi.org/10.1371/journal.pone.0026828 https://doi.org/10.1371/journal.pone.0026828 https://doi.org/10.1037/0003-066x.61.7.726 https://doi.org/10.1037/0003-066x.61.7.726 https://doi.org/10.3389/fpsyg.2017.01847 https://doi.org/10.3389/fpsyg.2017.01847 https://doi.org/10.1037/emo0000450 https://doi.org/10.1037/emo0000450 meta-psychology, 2023, vol 7, mp.2021.2840 https://doi.org/10.15626/mp.2021.2840 article type: original article published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: christina bergmann, matthew page analysis reproduced by: not applicable all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/fg8tz an integrative framework for planning and conducting non-intervention, reproducible, and open systematic reviews (niro-sr) marta k topor*1,2, jade s pickering*3, ana barbosa mendes4, dorothy bishop5, fionn büttner6, mahmoud m elsherif7, thomas rhys evans8, emma l henderson2, tamara kalandadze9, faye t nitschke10, janneke p c staaks11, olmo r van den akker12, siu kit yeung13,14, mirela zaneva5, alison lam15, christopher r madan16, david moreau17, aoife o’mahony18, adam j parker19, amy riegelman20, meghan testerman21, and samuel westwood22,23 1university of copenhagen, denmark 2university of surrey, uk 3university of manchester, uk 4erasmus university rotterdam, netherlands 5university of oxford, uk 6university college dublin, ireland 7university of birmingham, uk 8university of greenwich, uk 9østfold university college, norway 10university of newcastle, australia 11university of amsterdam, netherlands 12tilburg university, netherlands 13the university of hong kong, hksar, china 14the chinese university of hong kong, hksar, china 15university of liverpool, uk 16university of nottingham, uk 17university of auckland, nz 18cardiff university, uk 19university college london, uk 20university of minnesota, mn, usa 21princeton university, nj, usa 22university of westminster, uk 23king’s college london, uk *joint first authors most of the commonly used and endorsed guidelines for systematic review protocols and reporting standards have been developed for intervention research. these excellent guidelines have been adopted as the gold-standard for systematic reviews as an evidence synthesis method. in the current paper, we highlight some issues that may arise from adopting these guidelines beyond intervention designs, including in basic behavioural, cognitive, experimental, and exploratory research. we have adapted and built upon the existing guidelines to establish a complementary, comprehensive, and accessible tool for designing, conducting, and reporting non-intervention, reproducible, and open systematic reviews (niro-sr). niro-sr is a checklist composed of two parts that provide itemised guidance on the preparation of a systematic review protocol for pre-registration (part a) and reporting the review (part b) in a reproducible and transparent manner. this paper, the tool, and an open repository osf.io/f3brw provide a comprehensive resource for those who aim to conduct a high quality, reproducible, and transparent systematic review of non-intervention studies. keywords: guidelines, non-intervention research, open research, reproducibility, systematic reviews, transparency https://doi.org/10.15626/mp.2021.2840 https://doi.org/10.17605/osf.io/fg8tz https://osf.io/f3brw/ introduction systematic literature reviews are a widely used method for rigorously synthesising existing evidence to answer research questions and to inform best practice and policy-making. the quality of systematic reviews is contingent upon comprehensive, systematic, and transparent identification of all the relevant literature, followed by a balanced critical evaluation and synthesis of the data extracted from that literature. rigorous implementation can minimise biases and questionable reporting practices that can lead to misleading or inconsistent conclusions (ioannidis, 2016; moher et al., 2009; siddaway et al., 2019). however, the most popular guidelines for designing, reporting, conducting, and critically appraising systematic reviews to date have been designed for the synthesis of healthcare, medical, and intervention-based research. these include the prospero protocol pre-registration system and template (booth et al., 2012); the preferred reporting items for systematic reviews and meta-analyses (prisma; page et al., 2021); cochrane handbook for systematic reviews of interventions (higgins et al., 2019); and the assessing the methodological quality of systematic reviews tool (amstar; shea et al., 2017). the popularity of these tools is evident through endorsementfrom a number of journals (see prisma endorsers for an example), university libraries, and collaborative groups specialised in conducting systematic reviews (see the list of recommended systematic review tools by the equator network). therefore, these tools are widely chosen by authors of systematic reviews through recommendations, journal requirements, good findability, and/or greater accessibility. moreover, they are a likely choice for authors who conduct systematic reviews based on studies other than interventions who reach for these tools through similar routes. for instance, some more general and multidisciplinary journals that publish various types of studies encourage or require that all submitted systematic reviews must follow the prisma guidelines intended for intervention studies (e.g. systematic reviews, peerj or plos one), which may not always be well suited. intervention studies focus on assessing the efficacy or effectiveness of, for example, healthcare interventions and clinical trials that a priori assign participants to different intervention groups (committee of medical journal editors, 2021). as conceptualised by glass (1972), the essential aim of intervention studies is to evaluate the proposed intervention and its effects. many other types of research, such as basic, experimental, and exploratory research in the social, cognitive, and behavioural sciences, do not share the same aims as intervention research, and instead aim to explore and explain mechanisms, and thus evidence synthesis of such papers must be approached from a different perspective. research that does not fit the scope of intervention, such as explanatory, experimental, and basic research, should also adopt rigorous and transparent practises of conducting evidence synthesis, particularly in the context of the ongoing paradigm shift that places emphasis on replicable and reproducible research (munafò et al., 2017). however, researchers conducting systematic reviews of non-intervention research who wish to follow established guidelines must often resort to adapting the criteria of less applicable guidelines to make it appropriate to assess these types of studies, leading to ad hoc solutions such as filtering, combining, or customising practices from several sources (macpherson and jones, 2010). for instance, one popular tool is the “picos” framework (population, intervention, comparison, outcome, study design) which aids the development of a research question and eligibility criteria for evidence synthesis. picos is an important component of the prospero template, cochrane guidelines, and the amstar tool, and it was only recently removed from prisma following the 2020 update. this framework cannot always be directly applied to diverse research designs (bramer, 2015) and many alternatives have been developed (booth et al., 2019); for example the spider framework (cooke et al., 2012) for systematic reviews of research using qualitative methods. more general guidelines which are not limited to intervention designs also exist. in the field of psychology a comprehensive tool, the meta-analysis reporting standards (mars; appelbaum et al., 2018), was recently proposed by the american psychological association (apa) working group on quantitative research reporting standards. the tool advances standards, but there are barriers to its implementation. not all systematic reviews include meta-analyses, thus for many authors who decide not to include a meta-analysis component, mars may initially be considered unsuitable. in addition, accessibility of mars as a tool is limited because it is not an open-access resource. in fact, the uptake of mars for evidence synthesis has been very limited and described as “non-existent” in a recent review (hohn et al., 2020). lastly, mars is a reporting guideline, which in practice means that researchers may follow it retrospectively for reporting purposes only and are less likely to use it to inform the design of their study. in summary, although valuable resources exist for guiding the design and reporting of systematic reviews, researchers have a limited choice when it comes to selecting an appropriate and accessible tool for systematic reviews beyond interventional research. the utility of existing guidelines for high quality syshttps://www.crd.york.ac.uk/prospero/ http://www.prisma-statement.org/endorsement/prismaendorsers https://www.equator-network.org/?post_type=eq_guidelines&eq_guidelines_study_design=systematic-reviews-and-meta-analyses&eq_guidelines_clinical_specialty=0&eq_guidelines_report_section=0&s=+ https://www.equator-network.org/?post_type=eq_guidelines&eq_guidelines_study_design=systematic-reviews-and-meta-analyses&eq_guidelines_clinical_specialty=0&eq_guidelines_report_section=0&s=+ https://www.equator-network.org/?post_type=eq_guidelines&eq_guidelines_study_design=systematic-reviews-and-meta-analyses&eq_guidelines_clinical_specialty=0&eq_guidelines_report_section=0&s=+ 2 table 1 definitions of terminology term definition intervention research a study which aims to evaluate the effects of an intervention, often against another intervention, on primary or secondary outcomes of interest. non-intervention research a study which aims to provide an explanatory framework of an empirical phenomenon, or to provide supporting evidence for a theoretical paradigm (glass, 1972). intervention systematic review a systematic review which synthesises results from intervention research. non-intervention systematic review a systematic review which synthesises results from non-intervention research. systematic review protocol a protocol (ideally pre-registered, see “systematic review preregistration” below) which outlines specific plans for conducting the systematic review. it may be understood as a ‘recipe’ for the systematic review. systematic review pre-registration a systematic review can be pre-registered by submitting the finished protocol to a pre-registration platform, such as the open science framework. systematic map a report of the ongoing research activity on a particular topic, informed by a systematic search and screening strategy, which can be used to identify gaps in research. methodological systematic review informed by a systematic search, this review summarises methodological practices or questions in a given area. tematic reviews is limited by whether they are correctly applied. general problems with adherence to guidelines have been highlighted in the 2009 version of prisma (page and moher, 2017) but also in systematic reviews in psychology as a field (hohn et al., 2020). page et al. (2021) and hohn et al. (2020) suggested that low adherence could be related to possible lack of guideline adherence checks during peer review, relaxed demands for adherence by journals, or variation in how checklist items are interpreted by the systematic reviewers. however, adherence rates might be significantly impacted by the guideline’s appropriateness to specific fields. problems with the use of guidelines differ across disciplines and may be driven by discipline-specific interpretation of items which can be further exacerbated by ambiguity and lack of clarity regarding item wording (rethlefsen et al., 2021). this is specifically problematic for fields where non-intervention research is common (gates and march, 2016). therefore, the development of systematic review guidelines that cater beyond interventional designs and are appropriate for explanatory, experimental, and basic research could help to improve a guideline’s adherence rate in fields such as psychology, neuroscience, and economics. the lack of sufficient instructions accompanying guidelines may also contribute to the low adherence problem especially with regards to items designed to facilitate transparency and robustness of systematic reviews. for example, protocol pre-registration is one of the prisma items with a very low adherence rate. considering that pre-registration is widely understood to be an important measure to constrain reporting bias (nosek et al., 2018), it is of particular concern that this item is only adhered to by 21% of systematic reviews published using prisma between 2010 and 2017 (page & moher, 2017). this low adherence may be partly due to the uncertainty that surrounds the writing of systematic reviews protocols, their pre-registration, and how to transparently report and justify deviations from protocol when necessary. for example, it is often considered unclear how immutable a pre-registered protocol is, and when and how systematic reviewers can appropriately deviate from protocol and subsequently report this transparently (dehaven, 2017). in addition, systematic reviews tend not to report specific search results (48%), or screening and extraction procedures (abstract screening: 18%; full text screening: 20%). furthermore, specifically in meta-analyses, systematic reviews reported the effect-size in 62% and moderator information in 58%. finally, only 11% of systematic reviews contained the statistical code required for reproducibility of the analysis (polanin et al., 2020). this reporting is necessary not only to give context to any additional decisions made during the analysis, but also to https://osf.io 3 give others the information to evaluate key decisions made within the planned review, and improve the reproducibility of evidence synthesis (which is known to be low; maassen et al., 2020). given these issues surrounding uptake, adherence, accessibility, and relevance of existing guidelines, the non-intervention, open, and reproducible evidence synthesis (niroes) collaboration was set up to create a suite of accessible tools designed to facilitate evidence synthesis of non-intervention research, while also minimising the limitations of existing guidelines. in particular, it is designed to have high utility amongst novice systematic reviewers. this paper presents the non-intervention, open, and reproducible tool for systematic reviews (niro-sr), which is designed with the specific purpose of providing guidelines and a framework for researchers to conduct a systematic review of non-intervention research in line with best practice. we believe this to be particularly applicable to the social, cognitive, and behavioural sciences, as those are the perspectives from which the majority of co-authors have approached the problem, but the guidelines may well prove useful to a wider range of fields given the nonspecificity of the items. we acknowledge the importance of conducting meta-analyses as part of systematic reviews, however it is not a strictly necessary part of a systematic review and so this is outside of the scope of the current paper. our tools provide guidance for creating, planning, and pre-registering a systematic review protocol (part a), and conducting and reporting a systematic review (part b), with the goal of making evidence synthesis as open and reproducible as possible, thereby improving the credibility of the systematic review and reducing the likelihood of biased outcomes and conclusions. method item bank search and information sources. the refinement and specification of the aims and the scope of the project (as reflected in the introduction) occurred during conferences and working groups that engaged researchers, librarians, and journal editors predominantly representing experimental and basic behavioural/cognitive fields from january to december 2019 (e.g. advanced methods for reproducible science workshop, uk reproducibility network 2019; society for the improvement of psychological science conference, 2019; niroes online collaborative hackathon, august 2019). participants were at different career levels and with varied experience of applying systematic reviews in their work. discussions during these meetings unveiled personal experiences of barriers for conducting systematic reviews beyond intervention research. in addition, many pre-existing tools to guide systematic reviews across experimental, behavioural and cognitive fields were shared, forming the first step for compiling relevant existing tools that would inform the development of niro-sr. talks and presentations given about the project to date can be found through the project’s open science framework page (osf.io/8seby/). the initial list of relevant systematic review guidelines was expanded by two authors (mkt and jsp) who conducted a search of existing guidelines for writing, reporting and quality assessment of systematic reviews, systematic maps, and meta-analyses. this was facilitated through extensive web searches (e.g. “systematic review checklist”, “systematic review guidelines”, “systematic review reporting”), resources from the equator network website and further collaborative sessions with the niroes team until we reached saturation, i.e. we could not find any more relevant tools using this method. our search identified 19 guidelines (appendix a) that provided quality assessment and protocols for systematic reviews, with a total of 517 items. item extraction. all items and explanatory text were extracted verbatim from the 19 sourced guidelines to create an item bank. the prisma 2020 update (page et al., 2020) and accompanying item bank were published after our item bank was compiled and, therefore, was not included in our item bank. we cross-referenced our own with those from prisma 2020 and identified 55 items from various additional guidelines that added value to the items we had already included. the final item bank contained 572 items extracted from all sources. the flowchart for this process is presented in figure 1. eligibility was determined by two authors (mkt and jsp) who independently coded each item for potential inclusion as “yes”, “no” or “maybe” depending on its broad relevance and application for systematic reviews of non-intervention studies. “maybe” was defined as having components that were useful but without being directly applicable as a whole item. exclusion criteria included application (e.g., applicable to metaanalyses and/or to systematic maps only); relevance (items that were relevant to clinical/intervention research and not adaptable for non-intervention research systematic reviews), formatting and presentation (items which suggested formatting that was not specific to systematic reviews, for instance, if they referred only to systematic maps), and ambiguity (e.g., items that had https://osf.io/8seby/ https://www.equator-network.org/ https://www.equator-network.org/ 4 figure 1 flowchart showing the records identified from searching, and the records included/excluded during screening throughout the development of niro-sr. 5 a lack of clarity or incomplete guidance). disagreements were resolved by consensus, and irreconcilable disagreements were re-evaluated at a later stage of the niro-sr tool development following discussions with a larger group of collaborators and experts in systematic review methodology. the final item bank, including decisions about the inclusion and exclusion of items, can be found in the project’s osf repository (osf.io/p2v34). item development first, eligible items were categorised into the section of a systematic review that was most applicable, which included abstract, title, protocol, introduction, aims, research question, search strategy, screening, data extraction, risk of bias and quality assessment, synthesis, results, transparency, discussion, and miscellaneous items (see item bank tab “included items by category”). second, items were further divided to form two parts of the niro-sr tool, the protocol (part a) and the review (part b). protocol items were applicable when devising and pre-registering a prospective systematic review protocol of nonintervention studies, and review items were applicable for guiding the process of conducting a systematic review and writing a report for publication. finally, each group of items was either rewritten for clarity or adapted for general use in non-intervention research. this process of rewriting items, splitting complex items, and merging similar items was conducted iteratively and collaboratively over several months and alongside other feedback methods (see section 2.3 and section 2.4). the resulting items resemble the original curated items in theme, depth, scope. an example of an adapted item is provided in table 2. please note that items addressing the risk of bias and heterogeneity of reviewed studies were included in the niro-sr tool, but to a limited extent. this is because a separate, complementary tool for guiding the assessment of bias and quality in non-intervention research systematic reviews is under development by niroes. initial feedback; accessibility and understandability one aim of niro-sr was to make it accessible to researchers who had never conducted a systematic review before. in december 2019, feedback on the initial version of the tool was sought from a convenience sample of students and staff (n = 9) in the school of psychology, university of surrey (all materials and feedback available on osf.io/f3brw). none of the participants had published a systematic review at the time of response, and they had little previous experience with conducting systematic reviews, relatively low confidence in this method, and their research areas were non-interventional. participants were asked to provide general ratings of niro-sr using a three-point scale (“1 not good enough”, “2 could be improved” and “3 good”) across five separate categories: clarity (mean rating = 2.56, sd = 0.53), structure (mean rating = 2.89, sd = 0.33), practicality (mean rating = 2.61, sd = 0.49), relevance (mean rating = 2.86, sd = 0.38), and simplicity (mean rating = 2.44, 0.52). comments were overall positive about the tool’s utility, with suggested revisions limited to improvements in clarity and further guidance in a minority of items. all participants reported that they would want to use this tool when conducting relevant systematic reviews in the future. the feedback guided some initial changes to improve the tool’s clarity for non-expert users, which included adding an explanation of the purpose and procedures of pre-registration at the beginning of the tool, and explaining items in further detail. the study procedures involving human participants have been reviewed against the guidelines set out by the ethics committee of faculty of health and medical sciences, university of surrey and carried out in accordance with the university of surrey’s code of conduct on good research practice and the declaration of helsinki. final edits; collaborator feedback in march 2020, a virtual hackathon was hosted to invite final feedback on the tool from a multidisciplinary team of both existing and new collaborators comprising expert researchers and librarians experienced in systematic reviews, systematic maps, and meta-analyses as well as more novice researchers with little experience of evidence synthesis. expert researchers revised the tool to ensure that it covered the breadth of knowledge needed to conduct a systematic review, including adding details that were missing based on their own experiences of preparing pre-registration protocols and writing non-intervention systematic reviews. novice contributors refined the tool with the aim of making it as accessible and understandable as possible to users of all levels of expertise in reporting and conducting systematic reviews. in the cases where new items were applicable to only certain types of non-intervention studies, they were marked as optional. finally, it was identified that certain items could benefit from additional illustrative examples, templates, or detailed guidance. these included: • a full example of a search strategy • a decision log template to track the decisions made during the screening and data extraction stages • an example of a screening manual https://osf.io/p2v34 https://osf.io/f3brw 6 table 2 the table below presents an item which guides authors on how to prepare and report systematic review research questions. on the left, the picos framework sourced from the prisma statement. on the right, the same framework is adapted for non-intervention research in niro-sr. in the adapted version, the language clearly guides the researchers to state their dependent and independent variables. “interventions” are excluded from the item and there is an added optional position on the consideration of covariates. picos, prisma statement; moher et al., (2009) item 3, niro-sr (part a) provide an explicit statement of questions being addressed with reference to: what is the primary review question? the review question must be clearly defined and include the following: • participants, • the primary outcome measure(s) of interest (the • interventions, dependent variables(s); dv) • comparisons, • the primary independent variables (ivs) of interest • outcomes, • the population/participants of interest (e.g., • study design undergraduate students, participants with a specific diagnosis, school-age children etc.) • (optional) study design(s) of interest, for example: i. observational measured variables at one time-point ii. cross-sectional measured variables with different individuals at different timepoints/variables iii. longitudinal same individuals followed over time; could be prospective or retrospective iv. experimental examining effect of specific manipulation • (optional) any covariates of interest or variables you want to control for (e.g. participant age) nb. if you find that your research question does not fit the above, for instance in exploratory or methodological systematic reviews, you should state this in the protocol for transparency. if you cannot operationalise the dv and iv make sure to clearly define the focus (e.g. methodological variation) and the context (e.g. in working memory research) of your investigation. • a template for data extraction forms • a risk-of-bias assessment tool to help with the assessment of credibility of included nonintervention studies these are outside of the scope of the current paper, but represent the need for further information and guidance. following this feedback process, nirosr version 0.1 (and version 0.1.1 for subsequent minor fixes) was uploaded to osf.io/c9wer for any researcher who wanted to use it to guide their systematic review projects. the niro-sr tool has already been used by several projects to inform pre-registration and the guidelines have been implemented in some curriculums, including the university of coventry and the university of the philippines diliman. feedback from users has been very positive, and they provided further suggestions to improve the tool and increase clarity. these changes were implemented, and the current paper presents the finalised niro-sr version 1.1. https://osf.io/c9wer/ 7 results niro-sr version 1.0 niro-sr comprises two parts (osf.io/c9wer), a and b. part a is a guide for pre-registering a systematic review protocol composed of 30 items, of which 26 items are required and 4 items are recommended for best practice. the items are divided into eight sections: title, description and aims, research question, search strategy, screening, data extraction, critical appraisal, synthesis, and transparency. part b is a 38-item guide for high standards of reporting for non-intervention systematic reviews with the following sections: title, abstract, introduction, method (deviations from protocol, search strategy, screening methods, data extraction method, critical appraisal method, synthesis method), results (extracted records results, critical appraisal results, synthesis results), discussion, and transparency. if part a cannot be completed, researchers must give a justification why this is the case and are advised to include as much relevant content from part a as possible in the final systematic review publication. discussion niro-sr aims to firstly provide guidelines for conducting systematic reviews of research that do not clearly fit the definition of intervention research, such as explanatory, experimental, and basic research. the guidelines are intended to be particularly applicable to the behavioural sciences and related fields, but may also be used in other fields outside the expertise of the authors of this paper. secondly, niro-sr aims to place emphasis on reproducibility, openness and transparency of systematic reviews. part a provides guidance for developing and pre-registering a comprehensive review protocol, and part b guides authors in writing and reporting systematic reviews. both parts of the guidelines are designed to be usable on their own, but can also complement existing tools such as prisma 2020. niro-sr may particularly benefit psychologists and experimental and behavioural scientists who focus on non-intervention research in their systematic reviews, by providing specific advice on how to develop a review protocol, and to conduct and report a rigorous systematic review. it provides guidance to authors on defining primary review questions (item a3), secondary research questions (item a4), hypotheses (item a5), inclusion and exclusion criteria (item a13), and data extraction processes (items a15 to a17). these are the areas where existing systematic review guidelines are often inapplicable for non-intervention research. nirosr provides a framework that places particular emphasis on the operationalisation of variables of interest (e.g. ivs and dvs) and covariates, whilst still maintaining focus on relevant study designs and participant groups (see table 2 and item a3). it is hoped that by providing specific advice for conducting comprehensive systematic reviews of basic research in the behavioural sciences, niro-sr will help to begin to standardise and improve the contents of nonintervention systematic reviews protocols. niro-sr may help prevent author bias (which is usually unintentional) through its emphasis on the development and pre-registration of a protocol before conducting a systematic review. niro-sr does not make the distinction as to whether the protocol should be publicly available from the outset or upon publication of the review (for example, by pre-registering with an embargo period on the open science framework), but it places importance on the availability and transparency of the public record. niro-sr advises that the protocol should be available together with the final review and include a statement of transparency which specifies the date of pre-registration and point in the review process at which the protocol was pre-registered (e.g., before the final search was completed, or before data extraction began; see item a26). the protocol benefits the authors as it sets out a detailed and transparent plan for the systematic review, and benefits the reader who can more confidently reflect on how different decisions made throughout the process of conducting the systematic review may have influenced its outcomes. niro-sr also emphasises the importance of reporting all deviations from the original protocol. we acknowledge that such deviations are often necessary, so we recommend that they are justified and transparently declared in the eventual report of the systematic review (item b5). niro-sr recommends a multiple-author approach when conducting systematic reviews, in line with best practice recommendations (page et al., 2021; watts and li, 2019). for example, multiple team members should independently screen the titles and abstracts and full texts, and have a clear procedure for solving potential disagreements between systematic reviewers, as well as report a quantitative measure of inter-reviewer reliability (items b13, b14, b18, b20 and b21). this helps facilitate reproducibility by increasing the likelihood that a separate team of researchers could follow the exact steps of the original review and reach the same conclusions (i.e., same data, same method, same results; barba, 2018). researchers should be able to use the same method (i.e., search strategy, screening process and inclusion/exclusion criteria), on the same data (i.e., the databases and search results) and arrive at the same results (i.e., the final set of papers and the extracted https://osf.io/c9wer 8 data). however, subjective decisions must still be made throughout this process and so, where full reproducibility is not possible, niro-sr emphasises the importance of transparency. we recommend that a decision log is made available that catalogues important decisions (items b14, b15 and b36). the decision log allows anyone trying to reproduce the results to identify and evaluate the subjective decisions behind any discrepancies. niro-sr was developed to both alleviate the barriers preventing researchers from conducting systematic reviews and to encourage novice researchers to conduct systematic review in fields where specific guidelines are currently lacking. we strived to ensure that niro-sr is comprehensive, clear, and openly accessible to enable researchers to improve their literature review methodology with a systematic and transparent approach. methodological limitations niro-sr was developed without a pre-registered protocol or previously published methodological guidance for the development of such tools, which could introduce biases at the item selection stage of the tool development. unfortunately, the lack of pre-registration was due to the fact that––as far as we are aware––there was no pre-registration template that could serve as an adequate template for developing niro-sr. our web searches to identify appropriate guidelines and tools were therefore not systematic. we minimised biases with the breadth of the collaboration team and, although the sample of nine researchers providing initial feedback was small, we additionally sought input from multiple, independent contributors comprising a crossdiscipline mix of academics and librarians with extensive experience of conducting and teaching intervention or non-intervention research systematic reviews. further, we chose to develop niro-sr based on existing, peer-reviewed, consensus-based guidelines of robust methods for rigorous and transparent reporting (see appendix a). as with all guidelines, some limitations may only be fully known when niro-sr has been widely adopted. furthermore, whilst the niroes collaboration represents multiple disciplines and research fields, the dominant field of the authors is the experimental and behavioural sciences, which may reduce its applicability to some fields. whilst we believe the tool to be particularly applicable to explanatory experimental and basic behavioural/cognitive research, we cannot confidently assess its use for other fields. this paper accompanies the release of niro-sr version 1.1, and we anticipate that further updates will be necessary and may affect the structure, content, and wording of the items. to retain standardisation, these are anticipated to be infrequent. to facilitate future updates, users of niro-sr are encouraged to provide feedback to the corresponding authors. implications and future directions niro-sr addresses an important gap in the available guidelines to help reviewers produce high quality systematic reviews for research in experimental and behavioural sciences. the project was conceptualised through a collaborative effort during multiple method and metascience oriented meetings including the society for the improvement of psychological science 2019 conference, advanced methods for reproducible science 2019 and 2020 workshops and reproducibilitea meetings. the growing demand for the tool is also reflected through many presentations about niro-sr delivered at psychology-focused or interdisciplinary meetings and conferences including the organisation for human brain mapping 2020 conference, metascience 2021 conference and uk reproducibility network’s meeting for open science working groups in 2020. a number of pre-registered protocols have already been completed using niro-sr, some of which can be found on the osf (osf.io/f3brw). therefore, we expect a further increase in use of the niro-sr tool, which we hope will have a significant impact on the quality of systematic reviews in non-intervention research, reducing the need for bespoke customisations of existing guidelines in order to answer specific research questions. a few years after release, we plan to assess the implementation of the niro-sr guidelines to further understand the challenges of conducting systematic reviews in our field, as well as to inform future updates. specifically, we would like to provide an evidence-base for whether there is a demand for the tool as we have anecdotally observed already, and whether reviews using niro-sr are of comparable or greater quality to the high quality systematic reviews that have used other pre-existing tools. the further standardisation of systematic reviews outside of intervention research will also allow for better meta-scientific approaches and comparison of outcomes across multiple systematic reviews in the future. further, niro-sr provides a solid basis for conducting systematic reviews with a meta-analysis component. whilst niro-sr does not advise on the methodology specific to meta-analyses, it will help to raise the standard of the systematic approach such as the establishment of the research question, pre-registration, search strategy, inclusion/exclusion criteria, and logging decision making. finally, niro-sr is tailored for systematic reviews of https://osf.io/f3brw/ 9 experimental, cognitive and behavioural research, but future additions to the project could include “plug-ins” for the tool that enhance its existing features (released as needed on the osf page; osf.io/f3brw). for example, additional optional items could assist with reviews of other study designs such as qualitative studies or longitudinal studies, or specific items could be created for other approaches to evidence synthesis such as metaanalyses or systematic maps. additionally, extensions of the niro-sr are currently under development, including further guidance for risk of bias and quality assessment (related, but not necessarily synonymous, endeavours). there are elements of a study that may not directly introduce bias but which are nevertheless important indicators of quality, for example incompleteness in the reporting of the methodology which can lead to problems with replicability. conclusions niro-sr is a new tool that will allow researchers to follow standardised guidelines for systematic reviews of basic cognitive and behavioural research. it fills an important gap in methodological standards and we hope it will contribute to the improvement of the quality of systematic reviews of research that does not form an intervention. author contributions the authorship for this project was determined using the credit taxonomy and the authorship agreement for the niroes collaboration (available on the osf). for the current project, authors were divided into four relevant tiers as specified in the authorship agreement. the first tier, “project management” specifies the joint lead authors and project co-leads, m.k. topor and j.s. pickering. within each subsequent tier, authors were listed in an alphabetical order as follows: tier 2 “major contributions” (data curation, formal analysis, investigation, methodology, visualisation, writing original draft, miscellaneous input into creating the tool): a. barbosa mendes, d.v.m. bishop, f.c. büttner, m.m. elsherif, t.r. evans, e.l. henderson, t. kalandadze, f.t. nitschke, j.p.c. staaks, o. van den akker, s.k. yeung, m. zaneva; tier 3 “feedback and review” (conceptualisation, writing review & editing, feedback on the tool): a. lam, c.r. madan, d. moreau, a. o’mahony, a.j. parker, a. riegelman, m. testerman; and tier 5 “senior supervision” (in addition to tier 2 “major contributions”): s.j. westwood. acknowledgements special thanks to the attendees of the society for the improvement of psychological science meeting in rotterdam, netherlands, july 2019, as well as attendees of the advanced methods for reproducible science workshop in windsor, uk, january 2020. we are also very grateful to all researchers at the university of surrey for providing their feedback. additional thanks to: katie corker, margriet groen, matt jaquiery, sayaka kidby, emily kothe, marta majewska, marissa mcbride, and james montilla doble. supplemental material the niro-sr tool is available at osf.io/c9wer. all data and supplementary materials have been deposited in an open repository on the open science framework. relevant links have been provided throughout the paper for access to specific materials. all of these materials are hosted on the open niro-sr osf page (osf.io/f3brw; topor et al., 2023). conflict of interest and funding no funding has been received for the realisation of this project. jade pickering was on the advisory board at meta-psychology at the point of submission. no other authors declare any conflicts of interest. author contact corresponding author: marta k. topor, marta.topor@hotmail.co.uk (orcid: https://orcid.org/0000-0003-3761-392x). mkt and jsp contributed equally and are joint first authors. open science practices this article is purely theoretical and as such is not eligible for open science badges. the entire editorial process, including the open reviews, is published in the online supplement. references appelbaum, m., cooper, h., kline, r. b., mayo-wilson, e., nezu, a. m., & rao, s. m. (2018). journal article reporting standards for quantitative research in psychology: the apa publications and communications board task force report. the american psychologist, 73(1), 3–25. https : //doi.org/10.1037/amp0000191 barba, l. a. (2018). terminologies for reproducible research. arxiv. retrieved october 30, 2020, from https://arxiv.org/abs/1802.03311 https://osf.io/f3brw/ https://credit.niso.org/ https://osf.io/c9wer https://osf.io/f3brw https://doi.org/10.1037/amp0000191 https://doi.org/10.1037/amp0000191 https://arxiv.org/abs/1802.03311 10 booth, a., clarke, m., dooley, g., ghersi, d., moher, d., petticrew, m., & stewart, l. (2012). the nuts and bolts of prospero: an international prospective register of systematic reviews. systematic reviews, 1, 2. https://doi.org/10.1186/ 2046-4053-1-2 booth, a., noyes, j., flemming, k., moore, g., tunçalp, ö., & shakibazadeh, e. (2019). formulating questions to explore complex interventions within qualitative evidence synthesis. bmj global health, 4(suppl 1), e001107. https : / / doi.org/10.1136/bmjgh-2018-001107 bramer, w. m. (2015). patient, intervention, comparison, outcome (pico): an overrated tool. mla news, 55(2). campbell, m., mckenzie, j. e., sowden, a., katikireddi, s. v., brennan, s. e., ellis, s., hartmann-boyce, j., ryan, r., shepperd, s., thomas, j., welch, v., & thomson, h. (2020). synthesis without meta-analysis (swim) in systematic reviews: reporting guideline. bmj (clinical research ed.), 368, l6890. https : / / doi . org / 10 . 1136/bmj.l6890 coeytaux, r. r., mcduffie, j., goode, a., cassel, s., porter, w. d., sharma, p., meleth, s., minnella, h., nagi, a., & john w williams, j. (2014). criteria used in quality assessment of systematic reviews (tech. rep.). department of veterans affairs (us). retrieved march 3, 2021, from https : / / www . ncbi . nlm . nih . gov / books / %7bnbk242394%7d/ collins, a., vercammen, a., mcbride, m., carling, c., & burgman, m. (n.d.). reproducibility of systematic reviews in environmental and conservation science. committee of medical journal editors. (2021). clinical trials. retrieved march 3, 2020, from http: //www.icmje.org/recommendations/browse/ publishing and editorial issues / clinical trial registration.html cooke, a., smith, d., & booth, a. (2012). beyond pico: the spider tool for qualitative evidence synthesis. qualitative health research, 22(10), 1435–1443. https : / / doi . org / 10 . 1177 / 1049732312452938 critical appraisal skills program. (n.d.). casp systematic review checklist. retrieved december 2, 2020, from https : / / casp uk . net / casp tools checklists/ dehaven, a. (2017). preregistration: a plan, not a prison. retrieved march 3, 2020, from https : //www.cos.io/blog/preregistrationplannotprison gates, n. j., & march, e. g. (2016). a neuropsychologist’s guide to undertaking a systematic review for publication: making the most of prisma guidelines. neuropsychology review, 26(2), 109–120. https : / / doi . org / 10 . 1007 / s11065-016-9318-0 glass, g. v. (1972). the wisdom of scientific inquiry on education. journal of research in science teaching, 9(1), 1–18. https://doi.org/10.1002/tea. 3660090103 haddaway, n. r., macura, b., whaley, p., & pullin, a. s. (2018). roses reporting standards for systematic evidence syntheses: pro forma, flowdiagram and descriptive summary of the plan and conduct of environmental systematic reviews and systematic maps. environmental evidence, 7(1), 7. https : / / doi . org / 10 . 1186 / s13750-018-0121-7 higgins, j. p., thomas, j., chandler, j., cumpston, m., li, t., page, m. j., & welch, v. a. (eds.). (2019). cochrane handbook for systematic reviews of interventions (version 6). cochrane. https://doi. org/10.1002/9781119536604 hohn, r. e., slaney, k. l., & tafreshi, d. (2020). an empirical review of research and reporting practices in psychological meta-analyses. review of general psychology, 108926802091884. https : //doi.org/10.1177/1089268020918844 ioannidis, j. p. a. (2016). the mass production of redundant, misleading, and conflicted systematic reviews and meta-analyses. the milbank quarterly, 94(3), 485–514. https : / / doi . org / 10 . 1111/1468-0009.12210 joanna briggs institute. (n.d.). critical appraisal tools. retrieved december 2, 2020, from https : / / joannabriggs.org/critical-appraisal-tools maassen, e., van assen, m. a. l. m., nuijten, m. b., olsson-collentine, a., & wicherts, j. m. (2020). reproducibility of individual effect sizes in meta-analyses in psychology. plos one, 15(5), e0233107. https://doi.org/10.1371/journal. pone.0233107 macpherson, a., & jones, o. (2010). editorial: strategies for the development of international journal of management reviews. wiley. https://doi. org/10.1111/j.1468-2370.2010.00282.x methodological expectations of campbell collaboration intervention reviews: conduct standards (tech. rep.). (2019). the campbell collaboration. https://doi.org/10.4073/cpg.2016.3 methodological expectations of campbell collaboration intervention reviews: reporting standards (tech. https://doi.org/10.1186/2046-4053-1-2 https://doi.org/10.1186/2046-4053-1-2 https://doi.org/10.1136/bmjgh-2018-001107 https://doi.org/10.1136/bmjgh-2018-001107 https://doi.org/10.1136/bmj.l6890 https://doi.org/10.1136/bmj.l6890 https://www.ncbi.nlm.nih.gov/books/%7bnbk242394%7d/ https://www.ncbi.nlm.nih.gov/books/%7bnbk242394%7d/ http://www.icmje.org/recommendations/browse/publishing-and-editorial-issues/clinical-trial-registration.html http://www.icmje.org/recommendations/browse/publishing-and-editorial-issues/clinical-trial-registration.html http://www.icmje.org/recommendations/browse/publishing-and-editorial-issues/clinical-trial-registration.html http://www.icmje.org/recommendations/browse/publishing-and-editorial-issues/clinical-trial-registration.html https://doi.org/10.1177/1049732312452938 https://doi.org/10.1177/1049732312452938 https://casp-uk.net/casp-tools-checklists/ https://casp-uk.net/casp-tools-checklists/ https://www.cos.io/blog/preregistration-plan-not-prison https://www.cos.io/blog/preregistration-plan-not-prison https://www.cos.io/blog/preregistration-plan-not-prison https://doi.org/10.1007/s11065-016-9318-0 https://doi.org/10.1007/s11065-016-9318-0 https://doi.org/10.1002/tea.3660090103 https://doi.org/10.1002/tea.3660090103 https://doi.org/10.1186/s13750-018-0121-7 https://doi.org/10.1186/s13750-018-0121-7 https://doi.org/10.1002/9781119536604 https://doi.org/10.1002/9781119536604 https://doi.org/10.1177/1089268020918844 https://doi.org/10.1177/1089268020918844 https://doi.org/10.1111/1468-0009.12210 https://doi.org/10.1111/1468-0009.12210 https://joannabriggs.org/critical-appraisal-tools https://joannabriggs.org/critical-appraisal-tools https://doi.org/10.1371/journal.pone.0233107 https://doi.org/10.1371/journal.pone.0233107 https://doi.org/10.1111/j.1468-2370.2010.00282.x https://doi.org/10.1111/j.1468-2370.2010.00282.x https://doi.org/10.4073/cpg.2016.3 11 rep.). (2019). the campbell collaboration. https://doi.org/10.4073/cpg.2016.4 miller, j. (2002). the scottish intercollegiate guidelines network (sign). the british journal of diabetes & vascular disease, 2(1), 47–49. https : / / doi . org/10.1177/14746514020020010401 moher, d., liberati, a., tetzlaff, j., altman, d., & group, p. (2009). preferred reporting items for systematic reviews and meta-analyses: the prisma statement. plos medicine, 6(7), e1000097. https : / / doi . org / 10 . 1371 / journal . pmed . 1000097 moher, d., shamseer, l., clarke, m., ghersi, d., liberati, a., petticrew, m., shekelle, p., stewart, l. a., & group, p.-p. (2015). preferred reporting items for systematic review and meta-analysis protocols (prisma-p) 2015 statement. systematic reviews, 4(1), 1. https : / / doi . org / 10 . 1186 / 2046-4053-4-1 munafò, m. r., nosek, b. a., bishop, d. v. m., button, k. s., chambers, c. d., du sert, n. p., simonsohn, u., wagenmakers, e.-j., ware, j. j., & ioannidis, j. p. a. (2017). a manifesto for reproducible science. nature human behaviour, 1, 0021. https://doi.org/10.1038/s41562-0160021 national heart, lung, and blood institute. (n.d.). study quality assessment tools. retrieved december 2, 2020, from https : / / www. nhlbi . nih . gov / health-topics/study-quality-assessment-tools nosek, b. a., ebersole, c. r., dehaven, a. c., & mellor, d. t. (2018). the preregistration revolution. proceedings of the national academy of sciences of the united states of america, 115(11), 2600–2606. https : / / doi . org / 10 . 1073 / pnas . 1708274114 oxman, a. d., & guyatt, g. h. (1991). validation of an index of the quality of review articles. journal of clinical epidemiology, 44(11), 1271–1278. https : / / doi . org / 10 . 1016 / 0895 4356(91 ) 90160-b page, m. j., mckenzie, j. e., bossuyt, p. m., boutron, i., hoffmann, t. c., mulrow, c. d., shamseer, l., tetzlaff, j. m., akl, e. a., brennan, s. e., chou, r., glanville, j., grimshaw, j. m., hróbjartsson, a., lalu, m. m., li, t., loder, e. w., mayo-wilson, e., mcdonald, s., . . . moher, d. (2021). the prisma 2020 statement: an updated guideline for reporting systematic reviews. systematic reviews, 10(1), 89. https : / / doi.org/10.1186/s13643-021-01626-4 page, m. j., & moher, d. (2017). evaluations of the uptake and impact of the preferred reporting items for systematic reviews and meta-analyses (prisma) statement and extensions: a scoping review. systematic reviews, 6(1), 263. https:// doi.org/10.1186/s13643-017-0663-8 polanin, j. r., hennessy, e. a., & tsuji, s. (2020). transparency and reproducibility of meta-analyses in psychology: a meta-review. perspectives on psychological science, 15(4), 1026–1041. https:// doi.org/10.1177/1745691620906416 rethlefsen, m. l., kirtley, s., waffenschmidt, s., ayala, a. p., moher, d., page, m. j., koffel, j. b., & group, p.-s. (2021). prisma-s: an extension to the prisma statement for reporting literature searches in systematic reviews. systematic reviews, 10(1), 39. https://doi.org/10.1186/ s13643-020-01542-z shea, b. j., reeves, b. c., wells, g., thuku, m., hamel, c., moran, j., moher, d., tugwell, p., welch, v., kristjansson, e., & henry, d. a. (2017). amstar 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. bmj (clinical research ed.), 358, j4008. https://doi.org/10.1136/bmj.j4008 siddaway, a. p., wood, a. m., & hedges, l. v. (2019). how to do a systematic review: a best practice guide for conducting and reporting narrative reviews, meta-analyses, and meta-syntheses. annual review of psychology, 70, 747–770. https : / / doi . org / 10 . 1146 / annurev psych 010418-102803 stroup, d. f., berlin, j. a., morton, s. c., olkin, i., williamson, g. d., rennie, d., moher, d., becker, b. j., sipe, t. a., & thacker, s. b. (2000). meta-analysis of observational studies in epidemiology: a proposal for reporting. meta-analysis of observational studies in epidemiology (moose) group. the journal of the american medical association, 283(15), 2008– 2012. https://doi.org/10.1001/jama.283.15. 2008 topor, m., pickering, j. s., barbosa mendes, a., bishop, d. v. m., büttner, f. c., elsherif, m. m., evans, t. r., henderson, e. l., kalandadze, t., nitschke, f. t., staaks, j., van den akker, o., yeung, s. k., zaneva, m., lam, a., madan, c., moreau, d., o’mahony, a., parker, a. j., . . . westwood, s. (2023). non-interventional, reproducible, and open systematic review (nirosr) guidelines. retrieved march 27, 2023, from https://osf.io/f3brw/ watts, r. d., & li, i. w. (2019). use of checklists in reviews of health economic evaluations, 2010 to https://doi.org/10.4073/cpg.2016.4 https://doi.org/10.1177/14746514020020010401 https://doi.org/10.1177/14746514020020010401 https://doi.org/10.1371/journal.pmed.1000097 https://doi.org/10.1371/journal.pmed.1000097 https://doi.org/10.1186/2046-4053-4-1 https://doi.org/10.1186/2046-4053-4-1 https://doi.org/10.1038/s41562-016-0021 https://doi.org/10.1038/s41562-016-0021 https://www.nhlbi.nih.gov/health-topics/study-quality-assessment-tools https://www.nhlbi.nih.gov/health-topics/study-quality-assessment-tools https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1016/0895-4356(91)90160-b https://doi.org/10.1016/0895-4356(91)90160-b https://doi.org/10.1186/s13643-021-01626-4 https://doi.org/10.1186/s13643-021-01626-4 https://doi.org/10.1186/s13643-017-0663-8 https://doi.org/10.1186/s13643-017-0663-8 https://doi.org/10.1177/1745691620906416 https://doi.org/10.1177/1745691620906416 https://doi.org/10.1186/s13643-020-01542-z https://doi.org/10.1186/s13643-020-01542-z https://doi.org/10.1136/bmj.j4008 https://doi.org/10.1146/annurev-psych-010418-102803 https://doi.org/10.1146/annurev-psych-010418-102803 https://doi.org/10.1001/jama.283.15.2008 https://doi.org/10.1001/jama.283.15.2008 https://osf.io/f3brw/ 12 2018. value in health, 22(3), 377–382. https : //doi.org/10.1016/j.jval.2018.10.006 whiting, p., savović, j., higgins, j. p. t., caldwell, d. m., reeves, b. c., shea, b., davies, p., kleijnen, j., churchill, r., & group, r. (2016). robis: a new tool to assess risk of bias in systematic reviews was developed. journal of clinical epidemiology, 69, 225–234. https : / / doi . org / 10 . 1016/j.jclinepi.2015.06.005 https://doi.org/10.1016/j.jval.2018.10.006 https://doi.org/10.1016/j.jval.2018.10.006 https://doi.org/10.1016/j.jclinepi.2015.06.005 https://doi.org/10.1016/j.jclinepi.2015.06.005 13 appendix the list of guidelines used to extract items for curation and preparation of niro-sr. 517 items have been extracted verbatim from the guidelines below: • amstar systematic review quality checklist (shea et al., 2017) • casp checklist for systematic reviews (critical appraisal skills program, n.d.) • criteria used in quality assessment of systematic reviews (coeytaux et al., 2014) • joanna briggs institute checklist for systematic reviews (joanna briggs institute, n.d.) • meccir: conduct standards (methodological expectations of campbell collaboration intervention reviews: conduct standards, 2019) • meccir: reporting standards (methodological expectations of campbell collaboration intervention reviews: reporting standards, 2019) • moose: reporting guidelines for meta-analysis of observational studies in epidemiology (stroup et al., 2000) • national heart lung and blood institute checklist for systematic reviews (national heart, lung, and blood institute, n.d.) • overview quality assessment questionnaire (oxman and guyatt, 1991) • prisma protocols (moher et al., 2015) • prisma statement (moher et al., 2009) • prisma 2020 update item bank (page et al., 2021) • prospero (booth et al., 2012) • reproducibility of systematic reviews in environmental and conservation science (collins et al., n.d.) • robis: tool to assess risk of bias in systematic reviews (1.2) (whiting et al., 2016) • roses (haddaway et al., 2018) • sign tool based on amstar (miller, 2002) • spider alternative to pico for qualitative and mixed research (cooke et al., 2012) • synthesis without meta-analysis (swim) in systematic reviews (campbell et al., 2020) an additional 55 relevant items were extracted from close inspection of the prisma 2020 update item bank (https://osf.io/kbj6v/, page et al., 2021) which included a number of additional tools and guidelines used across different fields. https://osf.io/kbj6v/ 404 not found meta-psychology, 2020, vol 4, mp.2018.872 https://doi.org/10.15626/2018.872 article type: commentary published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: n/a edited by: marcel van assen reviewed by: m. van assen, r. van aert analysis reproduced by: andré kalmendal all supplementary files can be accessed at osf: https://doi.org//10.17605/osf.io/j2qgs coding errors lead to unsupported conclusions: a critique of hofmann et al. (2015) donald r. williams university of california, davis, usa paul-christian bürkner aalto university, finland abstract we have detected coding errors in the meta-analysis of hofmann et al. (2015) who investigated the effect of intranasal oxytocin on psychiatric symptoms. we demonstrate that, after correcting these errors and reanalysing the data, the main conclusions of hofmann et al. (2015) are no longer supported. keywords: meta-analysis, coding errors, intranasal oxytocin introduction due to converging evidence in animals and healthy human populations, oxytocin has been identified as potentially having therapeutic properties. as such, numerous randomized controlled trails have investigated the efficacy of intranasal oxytocin (in-ot) on reducing psychiatric symptoms in clinical populations. as results have been mixed, meta-analytic reviews seeking to synthesize the extant literature have been published. one such review was published in psychiatry research (hofmann et al., 2015). the authors concluded that in-ot significantly improved psychiatric symptoms and found significant effects on depression, anxiety, psychotic symptoms, and general psychopathology. we found several errors in this paper and, when corrected, resulted in all null results (no significant effect of inot) which suggests that the conclusions of hofmann et al. (2015) are incorrect. the current letter therefore has three aims: (1) we will outline several errors and raise questions regarding their analysis; and (2) we will perform a meta-analysis using the same primary studies and similar methods; and (3) we will conclude by stating the importance of issuing a correction. errors and questions effect size directions while conducting a meta-analysis on a similar topic, we initially noticed hofmann et al. (2015) misspecified the direction of one outcome. in other words, the primary study reported that the placebo group improved whereas the in-ot group did not (lee et al., 2013). however, table 1 of hofmann et al. (2015) reports a large effect of in-ot (hedges’ g = 1.07). furthermore, all of the outcomes reported in their table 1 were positive which indicates in-ot was superior to placebo in all instances. from the primary studies, however, we 2 extracted the relevant data and found that 6 out of the 16 outcomes used to compute the overall effect should have been negative. as seen in figure 1 of this letter, all effects to the left of 0 had the wrong direction in hofmann et al. (2015). possible selection bias decisions made during the research process can influence the presence of an effect (gelman and loken, 2014). this is true in meta-analyses, particularly when extracting only one outcome from several possibilities. anagnostou et al. (2012) reported three outcomes on repetitive behavior. while two of the effects were either minimal (d = 0.13) or in the opposite direction (d = -0.22), hofmann et al. (2015) selected the largest effect in support of in-ots efficacy (d = 0.64). while the yale-brown obsessive compulsive scale produced the negative effect in anagnostou et al. (2012) , the same scale produced a positive effect in epperson et al. (1996) and was selected for inclusion by hofmann et al. (2015). dadds et al. (2014) reported two measures of repetitive behavior. for these outcomes, the placebo group showed improvement between preto post-test scores, whereas symptoms actually increased in the inot group. from this study, hofmann et al. (2015) again selected the outcome that was most favorable to in-ot. however, this outcome (child autism rating scale) was not labeled as repetitive behavior in dadds et al. (2014) while the outcomes that favored the placebo group were considered repetitive behavior. finally, since multiple outcomes were extracted from some studies, the overall effect of in-ot on psychiatric symptoms was computed on a subset of effects. based on the effects reported in table 1 of hofmann et al. (2015), the average effect size was larger for the included outcomes (d = 0.83) than the excluded outcomes (d = 0.49). misspecified outcomes we also believe several outcomes were not coded accurately in hofmann et al. (2015). the majority of outcomes in the psychotic symptoms category were total scores from the positive and negative symptoms scale (panss). total scores of the panss are a combination of negative symptoms, positive symptoms, and general psychopathology kay et al. (1987). however, hofmann et al. (2015) included two outcomes as psychotic symptoms that exclusively measured aspects of negative symptoms in schizophrenia. they also coded two brief psychiatry reporting scale (bprs) outcomes as general psychopathology. based on the contents of the scale and other meta-analyses on this topic (oya et al., 2016), this should have either been coded as psychotic symptoms or they should have provided rationale for divergent coding. all four of these outcomes were reported as positive which, in addition to the aforementioned errors, likely inflated their meta-analytic estimates. meta-analysis based on the methods provided in hofmann et al. (2015), we attempted to replicate their procedures as closely as possible, including outcomes included, effect size calculation, and assessment of publication bias. we then analyzed the data in a manner that was consistent with the extant literature and previous meta-analyses on this topic. all computations were done in r and with the metafor package (viechtbauer, 2010). we used the default settings in metafor, including reml for estimating the between-study variance and the qprofile method for the corresponding confidence intervals. two-tailed p-values are reported. our fully reproducible analysis is available on osf (https://osf.io/ kd3en/). methods the exact method used for effect size calculation is not entirely clear in hofmann et al. (2015). accordingly, we computed both hedges’ g (smd) exclusively from the post-treatment scores and the standardized mean change with raw score standardization (smcr), which is a measure of pre to post-treatment change (r = 0.7) compared between groups (in-ot vs. placebo). from their methods section, we think that an effect size similar to the smcr was most likely used. in the present analysis, when a 95 % confidence interval (ci) excluded zero there was evidence for a significant effect at pvalue < 0.05. replication attempt. as seen in figure 1, the overall estimates for psychiatric symptoms were not significant (smd = 0.22, z = 1.67, p-value = 0.0953, 95 %-ci [-0.04, 0.47]; smcr = 0.17, z = 1.23, p-value = 0.217, 95 % = ci [-0.10, 0.43]). trim and fill procedures indicated bias in smd outcomes and, when corrected, the effect was reduced (smd = 0.07, 95 % ci = [-0.18, 0.32]). there was significant between-study variance for the smcrs (τ2 = 0.15, p-value = 0.003), but not for the smds (τ2 = 0.09, p-value = 0.1149). we then obtained estimates for specific symptoms (table 1).the meta-analytic estimates for depression, anxiety, repetitive behaviors, and general psychopathology were all non-significant (ci’s included zero). while using the outcomes reported in hofmann et al. (2015) produced a significant smd estimate for psychotic symptoms (table 1), restricting the outcomes to total psychotic symptoms resulted in a loss of statistical significance. https://osf.io/kd3en/ https://osf.io/kd3en/ 3 (a) (b) figure 1. (a) smcr estimates. (b) smd estimates. the effect from averbeck et al. (2012) was computed from a t-statistic on post-treatment scores. outcomes from macdonald et al. (2013) were obtained from a figure using web plot digitizer rohatgi, 2017. pedersen et al. (2013) did not report prescores. through email, dr. pederson confirmed that the authors of hofmann et al. (2015) did not contact then in regards to pre-scores. as such, we used change scores (smcr) from day 2 to day 3, while smd was calculated from day 3. we used the same outcome for dadds et al. (2014) as hofmann et al. (2015) it should be noted, however, this was pre-treatment and follow-up scores (3 months later). dr. dadds confirmed that they did not collect postdata. standard deviations (sd) for modabbernia et al. (2013) 4 table 1 meta-analytic estimates for specific symptoms smcr es se z p-value 95 % ci anxiety 0.09 0.17 0.5147 0.6067 [-0.24, 0.41] depression 0.29 0.27 1.0843 0.2782 [-0.23, 0.81] psychopathology 0.10 0.18 0.5664 0.5711 [-0.25, 0.46] psychotic 0.31 0.18 1.6814 0.0927 [-0.05, 0.66] repetitive -0.06 0.21 -0.2889 0.7726 [-0.46, 0.35] τ2 0.09 0.06 0.0140 [0.01, 0.29] smd anxiety 0.08 0.18 0.4695 0.6387 [-0.26, 0.43] depression 0.28 0.27 1.0300 0.3030 [-0.25, 0.81] psychopathology 0.14 0.19 0.7301 0.4653 [-0.24, 0.52] psychotic 0.41 0.19 2.1278 0.0334 [0.03, 0.80] repetitive 0.15 0.22 0.6876 0.4917 [-0.28, 0.58] τ2 0.069 0.069 0.0561 [0.00, 0.40] note: smcr: standardized mean change with raw score standardization. smd: standardized mean difference (hedges’ g). es: effect size. τ2: residual between-study variance after accounting for symptom type. conclusion although hofmann et al. (2015) is not a new article, and was recently retracted (hofmann et al., 2016), there are several reasons this letter deserves attention. first, while they concluded that in-ot had robust effects on several psychiatric symptoms, our analysis suggests that all effects were non-significant. second, inot research has become a very active field and ensuring correctness in the publish literature is a mental health priority. third, there is overwhelming evidence from animal studies supporting the role of oxytocin in psychiatric disorders, especially those comprised of social dysfunction (lim et al., 2005). by ensuring null results are represented in the literature, researchers might be compelled to improve current methods of delivery or dedicate more resources into developing pharmaceutical drugs that directly activate oxytocin receptors. accordingly, we hope this letter not only results in a correction but also moves the field forward which is especially important because of the lack of effective treatments for certain aspects of these disorders. author contact corresponding author: donald r. williams (email: drwwilliams@ucdavis.edu). paul-christian bürkner (email: paul.buerkner@gmail.com). conflict of interest and funding the authors have no conflict of interest to declare. there was no specific funding for this study. author contributions drw conducted the analysis and wrote the initial draft of the paper. pcb originally spotted the inconsistencies in hofmann et al. (2015) and provided proofreading for the analysis and the paper. author names are ranked in order of contribution. open science practices this article earned the open data and the open materials badge for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references anagnostou, e., soorya, l., chaplin, w., bartz, j., halpern, d., wasserman, s., wang, a. t., pepa, l., tanel, n., kushki, a., & hollander, e. (2012). intranasal oxytocin versus placebo in the treatment of adults with autism spectrum disorders: a randomized controlled trial. molecular autism, 3(1), 16. https : / / doi . org / 10 . 1186 / 2040-2392-3-16 https://doi.org/10.1186/2040-2392-3-16 https://doi.org/10.1186/2040-2392-3-16 5 averbeck, b. b., bobin, t., evans, s., & shergill, s. s. (2012). emotion recognition and oxytocin in patients with schizophrenia. psychological medicine, 42, 259–266. https://doi.org/10. 1017/s0033291711001413 dadds, m. r., macdonald, e., cauchi, a., williams, k., levy, f., & brennan, j. (2014). nasal oxytocin for social deficits in childhood autism: a randomized controlled trial. journal of autism and developmental disorders, 44(3), 521–531. https://doi.org/10.1007/s10803-013-1899-3 epperson, c. n., mcdougle, c. j., & price, l. h. (1996). intranasal oxytocin in obsessive-compulsive disorder. biological psychiatry, 40(6), 547–549. https : / / doi . org / 10 . 1016 / 0006 3223(96 ) 00120-5 gelman, a., & loken, e. (2014). the garden of forking paths: why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. psychological bulletin, 140(5), 1272–1280. https://doi.org/dx. doi.org/10.1037/a0037714 hofmann, s. g., fang, a., & brager, d. n. (2015). effect of intranasal oxytocin administration on psychiatric symptoms: a meta-analysis of placebocontrolled studies. psychiatry research, 228(3), 708–714. https://doi.org/10.1016/j.psychres. 2015.05.039 hofmann, s. g., fang, a., & brager, d. n. (2016). notice of retraction and replacement: hofmann et al. effect of intranasal oxytocin administration on psychiatric symptoms: a meta-analysis of placebo-controlled studies. psychiatry research. 2015;228:708-714. psychiatry research. https://doi.org/10.1016/j.psychres.2016.10. 055 kay, s. r., fiszbein, a., & opler, l. a. (1987). the positive and negative syndrome scale (panss) for schizophrenia. schizophrenia bulletin, 13(2), 261–76. lee, m. r., wehring, h. j., mcmahon, r. p., linthicum, j., cascella, n., liu, f., bellack, a., buchanan, r. w., strauss, g. p., contoreggi, c., & kelly, d. l. (2013). effects of adjunctive intranasal oxytocin on olfactory identification and clinical symptoms in schizophrenia: results from a randomized double blind placebo controlled pilot study. schizophrenia research, 145(1-3), 110– 115. https://doi.org/10.1016/j.schres.2013. 01.001 lim, m. m., bielsky, i. f., & young, l. j. (2005). neuropeptides and the social brain: potential rodent models of autism. international journal of developmental neuroscience, 23(2-3 spec. iss.), 235–243. https : / / doi . org / 10 . 1016 / j . ijdevneu.2004.05.006 macdonald, k., macdonald, t. m., brüne, m., lamb, k., wilson, m. p., golshan, s., & feifel, d. (2013). oxytocin and psychotherapy: a pilot study of its physiological, behavioral and subjective effects in males with depression. psychoneuroendocrinology, 38(12), 2831–2843. https : / / doi . org/10.1016/j.psyneuen.2013.05.014 modabbernia, a., rezaei, f., salehi, b., jafarinia, m., ashrafi, m., tabrizi, m., hosseini, s. m. r., tajdini, m., ghaleiha, a., & akhondzadeh, s. (2013). intranasal oxytocin as an adjunct to risperidone in patients with schizophrenia: an 8-week, randomized, double-blind, placebocontrolled study. cns drugs, 27(1), 57–65. https://doi.org/10.1007/s40263-012-0022-1 oya, k., matsuda, y., matsunaga, s., kishi, t., & iwata, n. (2016). efficacy and safety of oxytocin augmentation therapy for schizophrenia: an updated systematic review and meta-analysis of randomized, placebo-controlled trials. european archives of psychiatry and clinical neuroscience, 266(5), 439–450. https://doi.org/10. 1007/s00406-015-0634-9 pedersen, c. a., smedley, k. l., leserman, j., jarskog, l. f., rau, s. w., kampov-polevoi, a., casey, r. l., fender, t., & garbutt, j. c. (2013). intranasal oxytocin blocks alcohol withdrawal in human subjects. alcoholism: clinical and experimental research, 37(3), 484–489. https://doi. org/10.1111/j.1530-0277.2012.01958.x rohatgi, a. (2017). webplotdigitizer. http://arohatgi. info/webplotdigitizer viechtbauer, w. (2010). conducting meta-analyses in r with the metafor package. journal of statistical software, 36(3), 1–48. https://doi.org/10. 18637/jss.v036.i03 https://doi.org/10.1017/s0033291711001413 https://doi.org/10.1017/s0033291711001413 https://doi.org/10.1007/s10803-013-1899-3 https://doi.org/10.1016/0006-3223(96)00120-5 https://doi.org/10.1016/0006-3223(96)00120-5 https://doi.org/dx.doi.org/10.1037/a0037714 https://doi.org/dx.doi.org/10.1037/a0037714 https://doi.org/10.1016/j.psychres.2015.05.039 https://doi.org/10.1016/j.psychres.2015.05.039 https://doi.org/10.1016/j.psychres.2016.10.055 https://doi.org/10.1016/j.psychres.2016.10.055 https://doi.org/10.1016/j.schres.2013.01.001 https://doi.org/10.1016/j.schres.2013.01.001 https://doi.org/10.1016/j.ijdevneu.2004.05.006 https://doi.org/10.1016/j.ijdevneu.2004.05.006 https://doi.org/10.1016/j.psyneuen.2013.05.014 https://doi.org/10.1016/j.psyneuen.2013.05.014 https://doi.org/10.1007/s40263-012-0022-1 https://doi.org/10.1007/s00406-015-0634-9 https://doi.org/10.1007/s00406-015-0634-9 https://doi.org/10.1111/j.1530-0277.2012.01958.x https://doi.org/10.1111/j.1530-0277.2012.01958.x http://arohatgi.info/webplotdigitizer http://arohatgi.info/webplotdigitizer https://doi.org/10.18637/jss.v036.i03 https://doi.org/10.18637/jss.v036.i03 introduction errors and questions effect size directions possible selection bias misspecified outcomes meta-analysis methods replication attempt conclusion author contact conflict of interest and funding author contributions open science practices meta-psychology, 2022, vol 6, mp.2020.2718 https://doi.org/10.15626/mp.2020.2718 article type: tutorial published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: nordström, t., rohrer, j., zigerell, l.j. analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/w98s6 multiverse analyses in the classroom tom heyman methodology and statistics unit, institute of psychology, leiden university wolf vanpaemel faculty of psychology and educational sciences, ku leuven abstract most empirical papers in psychology involve statistical analyses performed on a new or existing dataset. sometimes the robustness of a finding is demonstrated via data-analytical triangulation (e.g., obtaining comparable outcomes across different operationalizations of the dependent variable), but systematically considering the plethora of alternative analysis pathways is rather uncommon. however, researchers increasingly recognize the importance of establishing the robustness of a finding. the latter can be accomplished through a so-called multiverse analysis, which involves methodically examining the arbitrary choices pertaining to data processing and/or model building. in the present paper, we describe how the multiverse approach can be implemented in student research projects within psychology programs, drawing on our personal experience as instructors. embedding a multiverse project in students’ curricula addresses an important scientific need, as studies examining the robustness or fragility of phenomena are largely lacking in psychology. additionally, it offers students an ideal opportunity to put various statistical methods into practice, thereby also raising awareness about the abundance and consequences of arbitrary decisions in data-analytic processing. an attractive practical feature is that one can reuse existing datasets, which proves especially useful when resources are limited, or when circumstances such as the covid-19 lockdown measures restrict data collection possibilities. keywords: multiverse analysis; robustness; education; pedagogy; open science an important part of many psychology students’ (under)graduate programs are research-methods classes in which students are asked to complete their own (smallscale) research project (e.g., kierniesky, 2005). typically, the goal is to run through the entire empirical cycle, thus putting knowledge gained from previous theory-focused courses into practice. however, this can be quite challenging, as time and resources are often limited in such projects. as a consequence, students and instructors might (begrudgingly) take shortcuts, resulting in ill-designed or underpowered studies, poorly-motivated research questions, sloppy measurement practices, and so on. perhaps the most devastating consequence of this approach is that students could come away with a wrong impression of what psychological research entails, and it might even instill bad habits in prospective researchers. in the present paper, we suggest an alternative implementation of research-methods classes that addresses these concerns. in particular, we propose that completing a multiverse analysis project as part of such research methods classes has several important benefits. first, we explain what a multiverse analysis entails (see steegen et al., 2016). then, we describe the two main ingredients of a multiverse-in-theclassroom project: a suitable dataset and a solid (meta)scientific background. next, we give a worked example of such a project, based on our personal experience as instructors. finally, we discuss the benefits and chalhttps://doi.org/10.15626/mp.2020.2718 https://doi.org/10.17605/osf.io/w98s6 2 lenges of the multiverse-in-the-classroom. what is a multiverse analysis? most empirical papers in psychology involve some kind of data analysis. typically, there is no unique path from the raw data to the eventual conclusions of a paper. researchers need to make a number of decisions along the way, such as whether and how to deal with outliers and missing data, whether and how to transform variables, and so on. in some cases, theoretical considerations provide a clear solution to such questions, yet, at times, researchers have little to go on, so they turn to their gut feeling, lab habits, or fieldspecific standards, which are often poorly motivated. as a result, when processing and analyzing empirical data, researchers regularly face certain choices that are arbitrary in nature. these researcher degrees of freedom (simmons et al., 2011) lead to a garden of forking paths (gelman & loken, 2014). as an example, suppose that, for a given dataset, a researcher identifies four plausible ways to deal with outliers, three approaches to handle missing data, and two reasonable options to transform a particular variable. assuming all combinations are sensible, this would lead to 4*3*2 = 24 unique paths, each with its own outcome (see also bishop, 2016). however, researchers usually report the results for just one or a few of these paths, by picking only one or a few options out of the pool of plausible alternatives (e.g., deleting observations 2.5 standard deviations above the mean, listwise deletion when encountering missing values, and log-transforming a positively skewed variable; see also elson, 2016, for a practical illustration). in contrast, the idea of a multiverse analysis (steegen et al., 2016) is to explore and report on a wide array of imaginable (combinations of) reasonable alternatives, each of which providing an answer to the same research question. by explicitly considering the results of several reasonable analyses, a multiverse analysis can give an idea about the robustness or fragility of a certain finding, and might even point to moderators of the effect in question (i.e., key choices regarding data processing and/or analysis that the conclusion depends on). a multiverse analysis can be applied to newly collected data (e.g., kalokerinos et al., 2019), but also retrospectively using existing data (e.g., moors & hesselmann, 2019). for instance, credé and phillips (2017) conducted a multiverse analysis on data from carney et al. (2010) examining the power pose effect, which is the (controversial) finding that holding a high-power body pose affects hormone levels. their multiverse analysis revealed that most alternative pathways yielded null effects, whereas the original single-pathway analysis produced a significant effect. the importance of a multiverse analysis is also nicely illustrated by the study of silberzahn et al. (2018), in which 29 research teams independently examined whether referees in soccer are more likely to give red cards to players with a darker skin tone compared to light-skin-toned players. all teams used the same dataset to answer this question, yet the conclusions varied considerably: 20 out of 29 teams (69%) found a positive relation (i.e., dark-skin-toned-players tended to receive more red cards), whereas 9 teams obtained a null effect, which was even numerically negative in two cases. these results underscore that there are often several ways to process and analyze a given dataset, and that picking a single pathway might be deceiving, which is why conducting a multiverse analysis can be very informative. ideas similar to that of a multiverse analysis have been proposed under different names, such as specification curve analysis (simonsohn et al., 2020), vibration of effects analysis (patel et al., 2015), multimodel analysis (young and holsteen, 2017), and the many analysts approach by silberzahn et al. (2018) discussed above (though, in contrast with the other approaches, in the latter approach the different choices are distributed over research teams rather than being performed by the same team). multiverse-style analyses are increasingly being recognized as providing crucial information, and researchers have also proposed various extensions and refinements. for instance, multiverse analyses have been applied in the context of meta-analyses (voracek et al., 2019), suggested as an approach to deal with different random effect structures of multi-level models (harder, 2020), and used in combination with so-called explorable explanations allowing readers of a paper to dynamically move through the multiverse (dragicevic et al., 2019). in addition, liu et al. (2020) recently developed a programming tool called boba, which helps researchers to conduct and visualize multiverse analyses, whereas others have developed specific r packages to facilitate multiverse analyses (e.g., masur & scharkow, 2019; sarma & kay, 2019). teaching multiverse analyses the key message of this paper is that multiverse analyses are ideally suited to be included in laboratory or research-methods classes. in line with its general theme, there are a multitude of ways in which multiverse analyses can be incorporated in research-methods classes, taking into account the available time, place in the curriculum, and learning objectives. yet, they all require two essential ingredients: a suitable dataset and a solid (meta-)scientific background. both of these el3 ements will be discussed in turn, including some guidance based on personal experience. a suitable dataset a multiverse analysis can be conducted on newly gathered data, or one could reuse an existing dataset. from an educational point-of-view, the former option is fairly comparable to a typical student research project, though the eventual statistical analyses will be considerably more elaborate, sophisticated and time-consuming. focusing on existing data is perhaps more unusual in the context of a research methods class, in that it involves finding a suitable dataset and isolating the hypotheses of interest, rather than designing a study to test a hypothesis and collecting data. for short projects or when students are relatively inexperienced, the instructor could select one or a few suitable studies, thus assuring that students can hit the ground running. alternatively, students with a stronger background could be given the opportunity to find a suitable study themselves. selecting a study from the literature for a multiverse analysis comes with several challenges. one obvious requirement for such a study is that it should have publically available data, or that the original authors share their data for the agreed-upon purposes. this already narrows down the pool of studies, as psychological scientists are often not able or willing to share research data (vanpaemel et al., 2015; wicherts et al., 2006), though since the start of the open science movement and its various initiatives (e.g., morey et al., 2016), there has been an increase in data availability (kidwell et al., 2016). furthermore, even if data are available, it does not necessarily imply that they are amenable to a multiverse analysis. it might, for instance, be unclear what a certain variable measures, or how a data file is structured (hardwicke et al., 2018). obviously, the multiverse-using-existing-data-approach is only feasible when one has access to reusable data. another important criterion is that the study should afford plausible alternative data-analytic pathways, tailored to the students’ capabilities, to test the hypothesis of interest. we suspect that many studies in psychology meet this requirement, by for example focusing on outlier detection, dichotomization of variables, covariate inclusion, and so on. however, the data need to be available at a level raw enough to allow the construction of different pathways. if one only has access to the processed data (e.g., after dichotomization), rather than to the raw data, certain reasonable alternative processing and analysis options can not be explored. a final issue to consider is analytical reproducibility (i.e., conducting the same analyses on the same dataset and obtaining the same results). ideally, one selects a study of which the (most important) results are reproducible, or, at minimum, that the reason for nonreproducibility is clear. this requirement restricts the pool of possible target studies even further, as analytic reproducibility within psychological research has been shown to be far from ideal. for example, hardwicke et al. (2018) were able to independently reproduce the key results from only 11 out of 35 articles with reusable data published in the journal cognition. more surprisingly, even with the help of the original authors, key results of 13 articles could not be reproduced. artner et al. (2021) describe similar struggles in their attempt to reproduce 232 key statistical claims from 46 articles, based on the raw data, without help from the original authors (see also wicherts et al., 2011). although reproducibility is not strictly necessary in order to conduct a multiverse analysis, it does provide some reassurance that the data were processed and interpreted in the way intended by the authors. for example, before conducting their multiverse analysis, steegen et al. (2016) had to correct various minor reporting errors in the original data, which were discovered only by first attempting to reproduce the results (see their supplemental materials). if the original results are not (entirely) reproducible, but the source of the inconsistencies is easily identifiable (e.g., use of dummy coding rather than effect coding or correctable typos in the data file), one can still be reasonably confident in one’s understanding of the data-analysis, and the study might be a suitable target for the type of research project described here. in fact, such cases can be especially interesting from an educational point of view, as they demonstrate the project’s relevance, and illustrate that even accomplished researchers might struggle with data analysis at times. yet, when there is no discernable explanation for nonreproducible results, undertaking a multiverse analysis is potentially fruitless, especially when the discrepancies are substantial, because one might have misinterpreted the data. of course, it is also possible that the original authors made a mistake, but it can be timeconsuming to figure this out, and the authors might not be able or willing to help clear up any discrepancies. finding a study meeting all these requirements can be quite challenging, for students and instructors alike. a useful starting point for this search process is the article library on curatescience.org, which provides the possibility of filtering articles based on the availability of data (lebel et al., 2018). furthermore, one could browse repositories like the open science framework (soderberg, 2018) for articles with open data. consulting recent issues of journals using badges 4 to signal articles with open data and open materials (https://www.cos.io/our-services/badges; kidwell et al., 2016) is another excellent option. of course, the instructor could provide a dataset of their own or one they are already familiar with. this could either be the primary or only option (see example application below), or as a back-up in case (some) students wouldn’t be able to find a suitable dataset themselves. based on our experience, both of these approaches work well. a solid (meta-)scientific background it is important to build a solid meta-scientific framework, and provide students with sufficient background information about multiverse analyses at the beginning of the project (unless they are already familiar with these concepts from other courses). for example, one could cover some insightful meta-scientific articles such as simmons et al. (2011) about researcher degrees of freedom and their effect on the false positive rate, gelman and loken (2014), which describes how data-analysis can be conceived as a garden of forking paths, and steegen et al. (2016), which introduces multiverse analyses. that way, students are gently introduced to the concept of a multiverse analysis and the rationale behind it. in addition, it serves to foster critical thinking and demonstrates the relevance of such (meta-)scientific studies, including their own. besides these more general meta-scientific articles, students could benefit from several (published) examples of a multiverse analysis (e.g., credé & phillips, 2017; moors & hesselmann, 2019), to give them an idea of what it concretely entails. this serves two purposes. one, it provides guidance on how to summarize and interpret the outcome of a multiverse analysis (e.g., plotting a distribution of p-values, or creating a heatmap with p-values as a function of the various analytic pathways). two, it stimulates students in recognizing potentially arbitrary choices, thus giving them inspiration for their own multiverse. still, it can be quite challenging and overwhelming for students to generate alternative data-analytic pathways. a useful source, besides the papers mentioned above, is the work of wicherts et al. (2016), which offers a comprehensive overview of researcher degrees of freedom. moreover, one could also encourage students to look for alternative pathways in related work. in particular, when the project involves re-analysis of a published study, students could critically assess the rationale behind the article’s data-analytic choices, or examine papers cited in the target article as well as previous publications from the same authors on the same topic. to facilitate this, the instructor could organize a (group) discussion about the paper in question and point out some potentially relevant or remarkable choices. students could (or should) also try to reproduce the original findings, if they haven’t done so already as part of the process to select the target study (see above). that way, students familiarize themselves with the target study and its data, which might give them ideas for their eventual multiverse. throughout the project, strong guidance is needed. it is critical to inform students about the expectations regarding a multiverse analysis, and to tackle misconceptions. for one, the goal should not be to merely devise as many paths as possible. the key is that the alternatives are properly motivated — quality over quantity (del giudice & gangestad, 2021). furthermore, when multiple students use the same dataset, it is perfectly plausible to end up with different paths, and thus potentially a seemingly-contradicting answer to the same research question. this does not mean that someone made a mistake, rather it shows the ubiquity of arbitrary decisions. clear communication about these issues is important to avoid any confusion among students. providing feedback to students, particularly when it comes to the construction and implementation of the multiverse analysis, is also instrumental to make the project a success. some students may come up with poorly motivated alternative pathways, in which case the supervisor should steer them in the right direction or encourage them to carefully (re)consider the rationale for their choices. feedback could also take the form of a group discussion at a later stage of the project, to address the different pathways students came up with and compare their outcomes. though not strictly necessary, basic knowledge of r (i.e., a programming language primarily used for data analysis and visualization; r core team, 2016), or even r markdown (i.e., an environment to create dynamic, reproducible reports; allaire et al., 2016), can help students in running their analyses and reporting their results, yet there is quite a steep learning curve. multiverse analyses involve combining different options (e.g., different outlier criteria for different dependent variables that are transformed in various ways). especially when this amounts to many individual pathways, it will be more efficient to integrate them instead of performing each analysis separately, yet that does require some programming experience or training. example application this section describes an actual implementation of the multiverse-in-the-classroom approach in the context of an undergraduate research project (see table 1 for a summary of the syllabus). besides illustrating the viability of the approach, we hope that it can inhttps://www.cos.io/our-services/badges 5 table 1 summary of the syllabus for the undergraduate research project involving a multiverse analysis timing activity primary learning objective(s) week 1 general introduction understand the topic of the thesis week 2 group discussion of target article (i.e., smith et al., 2019) engage in critical thinking about the target article class on ethics, data sharing, and reproducibility understand the importance of data sharing and reproducibility week 3 group discussion of wicherts et al. (2011) understand the importance of data sharing and reproducibility group discussion of hardwicke et al. (2018) understand the importance of data sharing and reproducibility week 4 group discussion of simmons et al. (2011) recognize researchers’ degrees of freedom and realize their impact group discussion of steegen et al. (2016) understand what a multiverse analysis entails, how to conduct one, and see how the results could be presented week 5 r intro perform data processing, visualization, and plotting in r week 6 rmarkdown intro write reproducible and dynamic report week 7-17 conduct multiverse analysis and write thesis (including four opportunities for individual feedback) write a thesis incorporating relevant feedback spire instructors, course coordinators, and program directors who would consider including multiverse analyses in their research-methods classes. of course, there are many alternative ways to implement the multiversein-the-classroom approach, taking into account aspects such as timing, group size, students’ prior knowledge, learning objectives, and so on. the project took place in the 2020 spring semester with the first author as the instructor, and was inspired by a course jointly-taught by both authors in previous years. it was embedded in a course called bachelorproject, which spans 17 weeks, and is organized for students in the final year of their undergraduate psychology program. these students have already followed several statistics and methods courses, typically amounting to 30 european credits (ec). the bachelorproject represents a study load of 15 ec, during which students need to write an individual thesis describing the outcomes of a research project. the course is mandatory for all undergraduate psychology students, but they are divided in small groups each with a different instructor and a different research topic (e.g., mental health in university students, examining people’s interest in psychedelics, individual differences in the attentional bias towards emotion,....). the multiverse-in-the-classroom approach described here was used in one such group, consisting of eight students. students ultimately had to write a thesis about their project following the typical introduction-methodresults-discussion structure. the resulting products were evaluated on the same criteria as other research projects within the course by two independent graders (including the instructor). in addition, the instructor also graded the process as a whole. the project involved the re-analysis of an existing dataset, which was provided by the instructor. the selected target article was a study by smith et al. (2019), examining the influence of acute stress on semantic memory retrieval. smith et al. found that participants performed better on an open-ended trivia questionnaire after experiencing acute stress, and when they showed a stronger stress response. the study met all of the above criteria: reusable processed data were available in detailed enough format (the underlying raw data were, at the time, available upon request, and are now publically available; see smith, 2020); the results were reproducible (except for one easily-identifiable deviation); and the data processing and analysis steps afforded various alternative pathways. in a first meeting with the students of +1 hour the general topic of the thesis was introduced by the instructor. this included a short description of the target study as well as a brief introduction to the concept of a multiverse analysis. in the next meeting (+2 hours), the target article was examined in detail through a journal club, in which the instructor led the discussion. students were expected to read the 6 article in advance, and were encouraged to pay special attention to methodological and data-analytical choices. furthermore, any aspects of the paper that were unclear to the students were addressed during the meeting. from this point onwards, students were encouraged to start thinking about alternative analysis pathways, inspired by the group discussion, through searching for literature around the same topic, etc. the third meeting (+1.5 hours) consisted of an interactive lecture on data sharing (including ethical issues such as protecting the privacy of participants), reproducibility, and scientific integrity (including a discussion of questionable research practices). the idea is to introduce some concepts that are directly relevant for their thesis (e.g., reproducibility) as well as to give students a broad overview of meta-scientific topics. the next four meetings (+2 hours each) involved journal clubs around articles on, respectively, data sharing and reproducibility (i.e., hardwicke et al., 2018; wicherts et al., 2011), researcher degrees of freedom (i.e., simmons et al., 2011), and multiverse analysis (steegen et al., 2016). each time, two students led the discussion, but everyone was supposed to read the paper in advance and take part in the discussion. the instructor intervened sporadically if something was unclear or to point out relevant aspects. the purpose of these meetings was three-fold. first, it served to build a solid meta-scientific background, and to give students inspiration for their own multiverse analysis. second, writing the introduction section for a thesis about multiverse analyses can be challenging as it differs somewhat from that of a “regular” empirical study. hence, discussing a few key articles puts them on the right track. finally, these journal clubs were also meant to improve students’ presentation and discussion skills. the four final collective meetings (+2 hours each) served to introduce the students to r and r markdown. students were guided through a custom-made script showing how to read in data, transform and combine datasets, use conditional statements and loops, make graphs, and perform all the analyses that were used in the target paper. the script already used the data from the target paper to make sure that students understood what the variables meant. even though the script introduced all the procedures needed to reproduce the results of the target paper, they were illustrated using different variables. as a take-home exercise, students then tried to independently reproduce the key outcomes of the target paper using r, which they later embedded in an r markdown document. this guaranteed that all students were (eventually) able to follow the processing and analysis pathways outlined by smith et al. (2019). note that it wasn’t required from students to write their thesis in r markdown, or even use r for their eventual analyses. in the end, all eight students conducted their multiverse analysis in r, and two of them wrote their final paper using r markdown. from that point onwards, each student had four individual feedback meetings with the instructor in which their research proposal (i.e., rationale for the different pathways), analysis plan, code, results, and write-up were discussed. seventeen weeks after the start of the course, they were expected to submit their final thesis and accompanying analysis script. an exhaustive overview of all the alternatives students came up with would take us too far, but the following examples serve to illustrate the versatility of a multiverse approach to (under)graduate research projects. for instance, smith and colleagues considered responses to a trivia questionnaire as being correct if they completely matched the correct answer, were misspelled but easily extrapolated, were inappropriately pluralized or capitalized, were common synonyms of the correct answer, or if the first four or more letters matched the correct answer. however, students considered various reasonable alternatives to this coding scheme, such as treating incomplete responses as incorrect, regardless of how many letters matched the correct answer (boere, 2020; de jong, 2020; hoogeterp, 2020; kraaijenbrink, 2020; kuipers, 2020; van dijk, 2020; van rijn, 2020; van wijk, 2020). exploring this variation was only possible because students had access to the raw data (i.e., responses of each participant to each question), as the processed data only contained accuracy scores per participant based on the original coding scheme. furthermore, some students redefined the construct reactivity to stress. in the original paper, it was operationalized as the change in cortisol levels relative to a baseline, whereas students also considered the change in the psychological stress response measured through the state-trait inventory for cognitive and somatic anxiety, described in grös et al., 2007 (hoogeterp, 2020; van rijn, 2020). additionally, some students added covariates to the analyses (e.g., age; kraaijenbrink, 2020), or removed covariates (i.e., gender; hoogeterp, 2020; kuipers, 2020; van wijk, 2020). yet other pathways involved imputing missing values (kraaijenbrink, 2020), or removing observations (e.g., excluding participants who did not display an elevated cortisol level after stress-induction; boere, 2020; van dijk, 2020). although there was some overlap in data-analytic choices between students, each individual project featured unique pathways, which were based on existing literature (e.g., merz et al., 2016), statistical arguments, and/or a critical appraisal of the original study. the 7 breadth of options is illustrated in figure 1, showing the distribution of p-values for smith and colleagues’ main finding resulting from each students’ multiverse analysis (see https://osf.io/rtayk/ for the underlying r code). on average, students’ multiverse analyses comprised 78 paths (minimum 18, maximum 160). this outcome highlights the feasibility and potential of undergraduate research projects incorporating multiverse analyses. we hasten to add that it does not serve as a way to evaluate the robustness of smith and colleagues’ main finding, because certain data-analytic choices explored by the students were insufficiently motivated. the work done by the students, however, offers an ideal starting point for a more thorough multiverse analysis of the finding (see heyman et al., 2022). benefits of the multiverse-in-the-classroom approach incorporating multiverse analyses in (under)graduate research projects (or other courses) has many benefits for students as well as (psychological) science in general. one strength of the multiverse-in-the-classroom approach is that it can be flexibly adapted to the course’s learning objectives, classroom size, time frame, background of the students, and so on. for instance, one can conduct a multiverse analysis reusing an existing dataset, like in the example described above, or one could use newly gathered data. because the latter option involves an additional step compared to a typical research project, it is well-suited for situations where something extra is required from students (e.g., students enrolled in an honours program), whereas the former option can be applied more broadly. importantly, as there is no need to design a new study, or to collect any data, the students’ overall time-investment is comparable to that of a regular research project. moreover, an adapted version of such a multiverse project can be used in a more statistics-oriented course rather than a research-methods-oriented course. both authors have used a similar approach as part of a 13-week graduate statistics course within a psychology research master track for a number of years. there, the +40 students were instructed to write a report about the multiverse analysis they conducted in small groups using existing data. because these graduate students are well-versed in statistical analyses and programming, and due to the group-nature of the project, it can easily fit in a 13-week course as compared to the 17-week undergraduate research project described above. as a multiverse project does not necessarily require collecting new data, one could effectively save a lot of resources (i.e., time of participants and students, money to pay participants,. . . ). therefore, it is ideal for situations where collecting new data is impractical or impossible, for instance, because special equipment or expertise is required, getting ethical approval takes too much time, or when one does not have access to a participant pool or money to pay participants. this proved to be especially relevant in the lockdown situation due to covid-19 in spring 2020. indeed, the lockdown measures, which involved suspending all in-vivo data collection and required classes to be taught online, had very little impact on the project discussed above, with all students meeting the original deadline. the flexibility also applies to the selection of a target study. each student could focus on a separate paper, or, as was the case in the example above, each student independently construes their own data-analytic pathways for the same data set. the latter option is comparable to the many-analysts-one-dataset approach used by silberzahn et al. (2018), augmented with the additional requirement that every analyst (i.e., student) should consider several plausible alternatives rather than a single one. we believe the many-multiversesone-dataset option is the most interesting of the two, because any given multiverse will rarely (if ever) exhaust all reasonable options, hence it makes sense to adopt a form of data-analytic triangulation. in other words, there is a multiverse of multiverse analyses, which can be captured to some degree by asking different students to focus (semi-)independently on the same overarching topic. although it is unrealistic to expect that every individual project will be of the same quality, it can be enlightening to see the variability, or lack thereof, in outcomes. indeed, as figure 1 shows, it is possible that some multiverse analyses suggest the effect in question to be quite robust, whereas others suggest the effect to be rather fragile. despite bridging an important gap in psychological science by showing the robustness or fragility of findings, multiverse analyses are relatively rare, owing perhaps to their apparent complexity and/or their perceived lack of novelty. in that sense, one can draw a parallel to replication studies: once rare in psychology (makel et al., 2012), they are now becoming more mainstream through various initiatives (see zwaan et al., 2018). moreover, frank and colleagues (frank & saxe, 2012; hawkins et al., 2018) promoted conducting replication studies in student research projects (see also grahe et al., 2012; wagge et al., 2019). the current proposal seeks to accomplish a similar goal for multiverse analyses. note that both approaches can complement each other, in that one can conduct a multiverse analysis on replication data, either as part of the same project or across different iterations of the course (e.g., 8 figure 1. distribution of p-values for smith and colleagues’ main finding resulting from each students’ multiverse analysis. histogram of student 1 (n = 160) p−value f re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 histogram of student 2 (n = 140) p−value f re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 histogram of student 3 (n = 48) p−value f re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 histogram of student 4 (n = 36) p−value f re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 histogram of student 5 (n = 78) p−value f re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 histogram of student 6 (n = 18) p−value f re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 histogram of student 7 (n = 110) p−value f re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 histogram of student 8 (n = 36) p−value f re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 note. the red dotted line indicates a p-value of .05. the figure in brackets indicates the number of pathways in each student’s multiverse. remark that not all pathways were properly motivated, so these results should not be considered an evaluation of the robustness of smith and colleagues’ main finding. one group conducting the replication study, and another group performing multiverse analyses, possibly the following semester or academic year). another major benefit of adopting the multiverse-inthe-classroom approach, besides its flexibility, is that it gives students the opportunity to make a tangible contribution to psychological science, something that might not always occur with (under)graduate research projects. moreover, under some conditions, the work done by students and instructor(s) can be solidified in a joint research paper, suitable for publication, as was the case for the example application. the classroom phase then serves as an elicitation step of possible reasonable variations, which, in a second step, are evaluated for adoption in a multiverse analysis by a domain expert. such a two-step multiverse analysis where data-analytical pathways are first elicited from different sources which then get synthesized and applied to the data can even yield more comprehensive and less biased results compared to a regular multiverse analysis. the multiverse in the classroom approach also provides ample pedagogical opportunities. conducting a multiverse analysis typically requires students to perform a number of different statistical analyses. it thereby addresses an often-heard complaint from psychology students regarding the relevance of statistics. even though most research projects involve the practical application of statistics, it rarely is a focal point (in some cases, the analysis part might actually be considered a nuisance). furthermore, a multiverse project may help students to better understand the intricacies of statistical analyses. importantly, it is not a purely methodological or statistical project, as it also involves an empirical research question such as “to what extent 9 does power posing have an effect on hormone levels” or “what is the effect of stress on semantic memory”. hence, there is still the thrill of discovery, which helps fuel students’ engagement. at a more abstract level, a multiverse project also allows students to gain first-hand experience with the importance of open science, reproducibility, proper documentation of data, and so on. in addition, it teaches them to critically evaluate the rationale behind a study, especially its methodology, and it gives them an idea about the imperfections of psychological science. as future consumers of research, it is relevant to recognize that arbitrary decisions abound in research, and to realize their consequences. a multiverse-in-the-classroom project really drives this point home. moreover, for those students aspiring to become producers of research, it is paramount to adopt responsible research practices, such as assessing the robustness of key outcomes. in fact, students spontaneously mentioned these aspects in an informal evaluation of the course (e.g., “it has really changed my perspective on research. . . and sparked my interest” or “it was interesting to see what happens to the p-values when conducting different analyses”). one could argue that typical research projects, in which students are required to develop a new hypothesis, design a new study and collect data, teaches them bad habits or even questionable research practices as it is rather difficult to accomplish all this in a rigorous manner within the, usually limited, timeframe. challenges and objections a multiverse-in-the-classroom project can involve designing a new study, but that might not be feasible within the confines of a single semester, because developing and conducting such an analysis in itself is rather time-consuming. the option involving existing data is more readily applicable, yet one potential objection is that such a project does not cover the entire empirical cycle. although a multiverse project requires a thorough literature search, motivating a research question, and a comprehensive data-analysis of which the results ought to be interpreted and discussed, students may miss out on learning specific skills (e.g., regarding data collection). when the development of such skills is a central objective of the course, one might need to look for a creative solution. for instance, in the example application described above, the absence of a data collection phase was addressed by having students recode the participants’ responses to the trivia questionnaire. note though that one can raise similar concerns about more widely-applied projects such as those involving online data collection. in fact, there is often quite a bit of variability in what is demanded of students across projects within the same (under)graduate program. more fundamentally, accreditation guidelines for research projects in psychology often explicitly mention the possibility to conduct secondary data analyses (e.g., australian psychology accreditation council, 2019; the british psychological society, 2019). another challenge of conducting a multiverse analysis is that it requires combining various alternatives (e.g., three different outlier criteria and four different data transformations yield 12 outcomes). in principle, every analysis can be conducted separately, but this becomes unwieldy quite quickly, so one could use a script to increase efficiency. depending on the students’ background, the latter option might prove to be unattainable unless one would include some programming classes in the curriculum (e.g., teaching the language r). another potential hurdle for students (and instructors alike) revolves around the interpretation of a multiverse analysis. in contrast to a typical research project, one does not end up with a single outcome, but with a collection of outcomes. this elicits questions such as when should a finding be considered robust, when is it presumably a fluke, and how should the results be summarized and presented. indeed, published papers involving multiverse analyses typically eyeball the pattern of results, for instance, by plotting the distribution of p-values. steegen et al. (2016) tentatively suggest to focus inference on the average p-value, but beyond that, there is little guidance as to how to synthesize a multiverse analysis (but see simonsohn et al., 2020). a more fundamental objection could be that approaches such as pre-registration are more desirable, so that students should spend their time learning about pre-registration rather than about multiverse analyses. pre-registration entails that one specifies the analysis plan before knowing its results, if possible even before starting the data collection (nosek et al., 2018). as such, pre-registration makes transparent which choices could be data-driven and which are not. however, if a researcher pre-registers one or few analytic pathways, one is still left in the dark about how robust or fragile the effect is, or about whether certain choices are more critical than others (for a similar argument see steegen et al., 2016). to that end, one would need to conduct a multiverse-style analysis. of course, one could preregister a multiverse analysis to combine the strengths of both approaches, but this increases the complexity of the project. finally, one should be cautious that students do not completely lose faith in (psychological) science. indeed, whereas the goal is to make students critical consumers of scientific output, and, as a result, careful producers of scientific output, they should not come away with 10 the idea that science is inherently flawed or that all researchers are opportunistic or fraudulent. along the same lines, students should be made aware that not all hypotheses can necessarily be tested in a myriad of ways. based on the informal evaluation mentioned above, students did not come away with such incorrect notions, but future research on the effectiveness of the multiverse-in-the-classroom approach should determine whether this is indeed the case. conclusion the present paper proposes to implement multiverse analyses in student research projects, and provides a practical demonstration that we hope will encourage, help and inspire instructors to adopt it in their own courses. because multiverse analyses speak to the robustness of a (published) finding, it can fulfill an important need in psychological science, thus making the results of such projects truly relevant. furthermore, it is an excellent way to put statistics in practice, it fosters critical thinking, and raises awareness about the prevalence and consequences of arbitrary data-analytic decisions. finally, the flexibility of the multiverse-inthe-classroom approach makes it suitable for all kinds of projects, even when data collection is not feasible. author contact correspondence concerning this article should be addressed to tom heyman, methodology and statistics unit, institute of psychology, leiden university, wassenaarseweg 52, 2333 ak leiden, the netherlands. e-mail: t.d.p.heyman@fsw.leidenuniv.nl. orcid th 0000-0003-0565-441x wv 0000-0002-5855-3885. conflict of interest and funding the authors declare that there were no conflicts of interest or specific funding with respect to the authorship or the publication of this article. author contributions both authors conceptualized the idea. th wrote the first draft of the manuscript and wv provided extensive feedback. both authors approved the final version for submission. open science practices this article earned the open materials badge for making the materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references allaire, j., cheng, j., xie, y., mcpherson, j., chang, w., allen, j., wickham, h., atkins, a., & hyndman, r. (2016). rmarkdown: dynamic documents for r [r package version 1.6]. https : / / cran . r project.org/package=rmarkdown artner, r., verliefde, t., steegen, s., gomes, s., traets, f., tuerlinckx, f., & vanpaemel, w. (2021). the reproducibility of statistical results in psychological research: an investigation using unpublished raw data. psychological methods, 26(5), 527–546. https : / / doi . org / 10 . 1037 / met0000365 australian psychology accreditation council. (2019). accreditation standards for psychology programs evidence guide (version 1.2). https : / / psychologycouncil . org . au / wp content / uploads/2021/03/apac-evidence-guide_v1.2. pdf bishop, d. (2016). open research practices: unintended consequences and suggestions for averting them.(commentary on the peer reviewers’ openness initiative). royal society open science, 3(4), 160109. https://doi.org/10.1098/rsos. 160109 boere, r. (2020). het belang van reproduceerbare en transparante wetenschap: een multiverse benadering. [unpublished bachelor’s thesis]. leiden university. carney, d. r., cuddy, a. j., & yap, a. j. (2010). power posing: brief nonverbal displays affect neuroendocrine levels and risk tolerance. psychological science, 21(10), 1363–1368. https://doi.org/ 10.1177/0956797610383437 credé, m., & phillips, l. a. (2017). revisiting the power pose effect: how robust are the results reported by carney, cuddy, and yap (2010) to data analytic decisions? social psychological and personality science, 8(5), 493–499. https://doi.org/ 10.1177/1948550617714584 de jong, s. (2020). het effect van stress op het semantisch geheugen: een multiverse benadering. [unpublished bachelor’s thesis]. leiden university. del giudice, m., & gangestad, s. w. (2021). a traveler’s guide to the multiverse: promises, pitfalls, and a framework for the evaluation of analytic decisions. advances in methods and practices in psyhttps://cran.r-project.org/package=rmarkdown https://cran.r-project.org/package=rmarkdown https://doi.org/10.1037/met0000365 https://doi.org/10.1037/met0000365 https://psychologycouncil.org.au/wp-content/uploads/2021/03/apac-evidence-guide_v1.2.pdf https://psychologycouncil.org.au/wp-content/uploads/2021/03/apac-evidence-guide_v1.2.pdf https://psychologycouncil.org.au/wp-content/uploads/2021/03/apac-evidence-guide_v1.2.pdf https://psychologycouncil.org.au/wp-content/uploads/2021/03/apac-evidence-guide_v1.2.pdf https://doi.org/10.1098/rsos.160109 https://doi.org/10.1098/rsos.160109 https://doi.org/10.1177/0956797610383437 https://doi.org/10.1177/0956797610383437 https://doi.org/10.1177/1948550617714584 https://doi.org/10.1177/1948550617714584 11 chological science, 4(1), 1–15. https://doi.org/ 10.1177/2515245920954925 dragicevic, p., jansen, y., sarma, a., kay, m., & chevalier, f. (2019). increasing the transparency of research papers with explorable multiverse analyses. proceedings of the 2019 chi conference on human factors in computing systems, 1–15. elson, m. (2016). flexibility in methods & measures of social science. https : / / www. flexiblemeasures . com/ frank, m. c., & saxe, r. (2012). teaching replication. perspectives on psychological science, 7(6), 600–604. https : / / doi . org / 10 . 1177 / 1745691612460686 gelman, a., & loken, e. (2014). the statistical crisis in science. american scientist, 102(6), 460–465. grahe, j. e., reifman, a., hermann, a. d., walker, m., oleson, k. c., nario-redmond, m., & wiebe, r. p. (2012). harnessing the undiscovered resource of student research projects. perspectives on psychological science, 7(6), 605–607. https: //doi.org/10.1177/1745691612459057 grös, d. f., antony, m. m., simms, l. j., & mccabe, r. e. (2007). psychometric properties of the statetrait inventory for cognitive and somatic anxiety (sticsa): comparison to the state-trait anxiety inventory (stai). psychological assessment, 19(4), 369–381. https : / / doi . org / 10 . 1037/1040-3590.19.4.369 harder, j. a. (2020). the multiverse of methods: extending the multiverse analysis to address datacollection decisions. perspectives on psychological science, 15(5), 1158–1177. https://doi.org/ 10.1177/1745691620917678 hardwicke, t. e., mathur, m. b., macdonald, k., nilsonne, g., banks, g. c., kidwell, m. c., hofelich mohr, a., clayton, e., yoon, e. j., henry tessler, m., lenne, r. l., altman, s., long, b., & frank, m. c. (2018). data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal cognition. royal society open science, 5(8), 180448. https://doi.org/10.1098/rsos.180448 hawkins, r. x., smith, e. n., au, c., arias, j. m., catapano, r., hermann, e., keil, m., lampinen, a., raposo, s., reynolds, j., salehi, s., salloum, j., tan, j., & frank, m. c. (2018). improving the replicability of psychological science through pedagogy. advances in methods and practices in psychological science, 1(1), 7–18. https : / / doi . org/10.1177/2515245917740427 heyman, t., boere, r., de jong, s., hoogeterp, l., kraaijenbrink, j., kuipers, c., van dijk, m., van rijn, l., & van wijk, t. (2022). the effect of stress on semantic memory retrieval: a multiverse analysis. collabra: psychology, 8(1), 35745. https : //doi.org/10.1525/collabra.35745 hoogeterp, l. (2020). het effect van stress op het semantisch geheugen: een multiverse benadering. [unpublished bachelor’s thesis]. leiden university. kalokerinos, e. k., erbas, y., ceulemans, e., & kuppens, p. (2019). differentiate to regulate: low negative emotion differentiation is associated with ineffective use but not selection of emotion-regulation strategies. psychological science, 30(6), 863–879. https : / / doi . org / 10 . 1177/0956797619838763 kidwell, m. c., lazarević, l. b., baranski, e., hardwicke, t. e., piechowski, s., falkenberg, l.-s., kennett, c., slowik, a., sonnleitner, c., hessholden, c., errington, t. m., fiedler, s., & nosek, b. a. (2016). badges to acknowledge open practices: a simple, low-cost, effective method for increasing transparency. plos biology, 14(5), e1002456. https : / / doi . org / 10 . 1371/journal.pbio.1002456 kierniesky, n. c. (2005). undergraduate research in small psychology departments: two decades later. teaching of psychology, 32(2), 84–90. https://doi.org/10.1207/s15328023top3202_ 1 kraaijenbrink, j. (2020). the effect of stress on the semantic memory: a multiverse approach. [unpublished bachelor’s thesis]. leiden university. kuipers, c. (2020). the effect of stress on the semantic memory: a multiverse approach. [unpublished bachelor’s thesis]. leiden university. lebel, e. p., mccarthy, r. j., earp, b. d., elson, m., & vanpaemel, w. (2018). a unified framework to quantify the credibility of scientific findings. advances in methods and practices in psychological science, 1(3), 389–402. https : / / doi . org / 10 . 1177/2515245918787489 liu, y., kale, a., althoff, t., & heer, j. (2020). boba: authoring and visualizing multiverse analyses. ieee transactions on visualization and computer graphics, 27(2), 1753–1763. https://doi. org/10.1109/tvcg.2020.3028985 makel, m. c., plucker, j. a., & hegarty, b. (2012). replications in psychology research: how often do they really occur? perspectives on psychological science, 7(6), 537–542. https : / / doi . org / 10 . 1177/1745691612460688 https://doi.org/10.1177/2515245920954925 https://doi.org/10.1177/2515245920954925 https://www.flexiblemeasures.com/ https://www.flexiblemeasures.com/ https://doi.org/10.1177/1745691612460686 https://doi.org/10.1177/1745691612460686 https://doi.org/10.1177/1745691612459057 https://doi.org/10.1177/1745691612459057 https://doi.org/10.1037/1040-3590.19.4.369 https://doi.org/10.1037/1040-3590.19.4.369 https://doi.org/10.1177/1745691620917678 https://doi.org/10.1177/1745691620917678 https://doi.org/10.1098/rsos.180448 https://doi.org/10.1177/2515245917740427 https://doi.org/10.1177/2515245917740427 https://doi.org/10.1525/collabra.35745 https://doi.org/10.1525/collabra.35745 https://doi.org/10.1177/0956797619838763 https://doi.org/10.1177/0956797619838763 https://doi.org/10.1371/journal.pbio.1002456 https://doi.org/10.1371/journal.pbio.1002456 https://doi.org/10.1207/s15328023top3202_1 https://doi.org/10.1207/s15328023top3202_1 https://doi.org/10.1177/2515245918787489 https://doi.org/10.1177/2515245918787489 https://doi.org/10.1109/tvcg.2020.3028985 https://doi.org/10.1109/tvcg.2020.3028985 https://doi.org/10.1177/1745691612460688 https://doi.org/10.1177/1745691612460688 12 masur, p., & scharkow, m. (2019). specr: statistical functions for conducting specification curve analyses. https://github.com/masurp/specr merz, c. j., dietsch, f., & schneider, m. (2016). the impact of psychosocial stress on conceptual knowledge retrieval. neurobiology of learning and memory, 134, 392–399. https : / / doi . org / 10.1016/j.nlm.2016.08.020 moors, p., & hesselmann, g. (2019). unconscious arithmetic: assessing the robustness of the results reported by karpinski, briggs, and yale (2018). consciousness and cognition, 68, 97–106. https: //doi.org/10.1016/j.concog.2019.01.003 morey, r. d., chambers, c. d., etchells, p. j., harris, c. r., hoekstra, r., lakens, d., lewandowsky, s., morey, c. c., newman, d. p., schönbrodt, f. d., vanpaemel, w., wagenmakers, e.-j., & zwaan, r. a. (2016). the peer reviewers’ openness initiative: incentivizing open research practices through peer review. royal society open science, 3(1), 150547. https://doi. org/10.1098/rsos.150547 nosek, b. a., ebersole, c. r., dehaven, a. c., & mellor, d. t. (2018). the preregistration revolution. proceedings of the national academy of sciences, 115(11), 2600–2606. https : / / doi . org / 10.1073/pnas.1708274114 patel, c. j., burford, b., & ioannidis, j. p. (2015). assessment of vibration of effects due to model specification can demonstrate the instability of observational associations. journal of clinical epidemiology, 68(9), 1046–1058. https://doi.org/ 10.1016/j.jclinepi.2015.05.029 r core team. (2016). r: a language and environment for statistical computing. r foundation for statistical computing. vienna, austria. https://www. r-project.org/ sarma, a., & kay, m. (2019). multiverse: explorable multiverse data analysis and reports in r [r package version 0.1.4]. https : / / cran . r project . org / package=multiverse silberzahn, r., uhlmann, e. l., martin, d. p., anselmi, p., aust, f., awtrey, e., bahnik, s., bai, f., bannard, c., bonnier, e., carlsson, r., cheung, f., christensen, g., clay, r., craig, m. a., dalla rosa, a., dam, l., evans, m. h., flores cervantes, i., nosek, b. a., et al. (2018). many analysts, one data set: making transparent how variations in analytic choices affect results. advances in methods and practices in psychological science, 1(3), 337–356. https : / / doi . org / 10 . 1177/2515245917747646 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. https://doi.org/ 10.1177/0956797611417632 simonsohn, u., simmons, j. p., & nelson, l. d. (2020). specification curve analysis. nature human behaviour, 4, 1208–1214. https : / / doi . org / 10 . 1038/s41562-020-0912-z smith, a. m. (2020). acute stress enhances generalknowledge semantic memory. https://doi.org/ 10.17605/osf.io/eq8sy smith, a. m., hughes, g. i., davis, f. c., & thomas, a. k. (2019). acute stress enhances generalknowledge semantic memory. hormones and behavior, 109, 38–43. https://doi.org/10.1016/j. yhbeh.2019.02.003 soderberg, c. k. (2018). using osf to share data: a step-by-step guide. advances in methods and practices in psychological science, 1(1), 115–120. https : / / doi . org / 10 . 1177 / 2515245918757689 steegen, s., tuerlinckx, f., gelman, a., & vanpaemel, w. (2016). increasing transparency through a multiverse analysis. perspectives on psychological science, 11(5), 702–712. https : / / doi . org / 10.1177/1745691616658637 the british psychological society. (2019). standards for the accreditation of undergraduate, conversion and integrated masters programmes in psychology. https : / / www . psychologicalsociety . ie / source/undergraduate%5c%20accreditation% 5c % 20guidelines % 5c % 202019update _ file _ 674.pdf van dijk, m. (2020). acute stress enhances semantic memory: the robustness of the findings of smith, hughes, davis, and thomas (2019). [unpublished bachelor’s thesis]. leiden university. van rijn, l. (2020). the effect of stress on semantic memory: a multiverse approach. [unpublished bachelor’s thesis]. leiden university. van wijk, t. (2020). het effect van stress op semantisch geheugen: een multiverse benadering. [unpublished bachelor’s thesis]. leiden university. vanpaemel, w., vermorgen, m., deriemaecker, l., & storms, g. (2015). are we wasting a good crisis? the availability of psychological research data after the storm. collabra, 1(1), 3. https : //doi.org/10.1525/collabra.13 voracek, m., kossmeier, m., & tran, u. s. (2019). which data to meta-analyze, and how? a specification-curve and multiverse-analysis aphttps://github.com/masurp/specr https://doi.org/10.1016/j.nlm.2016.08.020 https://doi.org/10.1016/j.nlm.2016.08.020 https://doi.org/10.1016/j.concog.2019.01.003 https://doi.org/10.1016/j.concog.2019.01.003 https://doi.org/10.1098/rsos.150547 https://doi.org/10.1098/rsos.150547 https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1016/j.jclinepi.2015.05.029 https://doi.org/10.1016/j.jclinepi.2015.05.029 https://www.r-project.org/ https://www.r-project.org/ https://cran.r-project.org/package=multiverse https://cran.r-project.org/package=multiverse https://doi.org/10.1177/2515245917747646 https://doi.org/10.1177/2515245917747646 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1038/s41562-020-0912-z https://doi.org/10.1038/s41562-020-0912-z https://doi.org/10.17605/osf.io/eq8sy https://doi.org/10.17605/osf.io/eq8sy https://doi.org/10.1016/j.yhbeh.2019.02.003 https://doi.org/10.1016/j.yhbeh.2019.02.003 https://doi.org/10.1177/2515245918757689 https://doi.org/10.1177/2515245918757689 https://doi.org/10.1177/1745691616658637 https://doi.org/10.1177/1745691616658637 https://www.psychologicalsociety.ie/source/undergraduate%5c%20accreditation%5c%20guidelines%5c%202019update_file_674.pdf https://www.psychologicalsociety.ie/source/undergraduate%5c%20accreditation%5c%20guidelines%5c%202019update_file_674.pdf https://www.psychologicalsociety.ie/source/undergraduate%5c%20accreditation%5c%20guidelines%5c%202019update_file_674.pdf https://www.psychologicalsociety.ie/source/undergraduate%5c%20accreditation%5c%20guidelines%5c%202019update_file_674.pdf https://doi.org/10.1525/collabra.13 https://doi.org/10.1525/collabra.13 13 proach to meta-analysis. zeitschrift für psychologie, 227(1), 64–82. https://doi.org/10.1027/ 2151-2604/a000357 wagge, j. r., baciu, c., banas, k., nadler, j. t., schwarz, s., weisberg, y., ijzerman, h., legate, n., & grahe, j. (2019). a demonstration of the collaborative replication and education project: replication attempts of the red-romance effect. collabra: psychology, 5(1), 5. https://doi.org/ 10.1525/collabra.177 wicherts, j. m., bakker, m., & molenaar, d. (2011). willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. plos one, 6(11), e26828. https : / / doi . org / 10 . 1371 / journal . pone.0026828 wicherts, j. m., borsboom, d., kats, j., & molenaar, d. (2006). the poor availability of psychological research data for reanalysis. american psychologist, 61(7), 726–728. https://doi.org/10.1037/ 0003-066x.61.7.726 wicherts, j. m., veldkamp, c. l., augusteijn, h. e., bakker, m., van aert, r., & van assen, m. a. (2016). degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking. frontiers in psychology, 7, 1832. https : / / doi . org / 10.3389/fpsyg.2016.01832 young, c., & holsteen, k. (2017). model uncertainty and robustness: a computational framework for multimodel analysis. sociological methods & research, 46(1), 3–40. https : / / doi . org / 10 . 1177/0049124115610347 zwaan, r. a., etz, a., lucas, r. e., & donnellan, m. b. (2018). making replication mainstream. behavioral and brain sciences, 41, e120. https://doi. org/10.1017/s0140525x17001972 https://doi.org/10.1027/2151-2604/a000357 https://doi.org/10.1027/2151-2604/a000357 https://doi.org/10.1525/collabra.177 https://doi.org/10.1525/collabra.177 https://doi.org/10.1371/journal.pone.0026828 https://doi.org/10.1371/journal.pone.0026828 https://doi.org/10.1037/0003-066x.61.7.726 https://doi.org/10.1037/0003-066x.61.7.726 https://doi.org/10.3389/fpsyg.2016.01832 https://doi.org/10.3389/fpsyg.2016.01832 https://doi.org/10.1177/0049124115610347 https://doi.org/10.1177/0049124115610347 https://doi.org/10.1017/s0140525x17001972 https://doi.org/10.1017/s0140525x17001972 meta-psychology, 2021, vol 5, mp.2019.1916, https://doi.org/10.15626/mp.2019.1916 article type: original article published under the cc-by4.0 license open data: n/a open materials: n/a open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: moritz heene reviewed by: j. mcgrane, a. kyngdon analysis reproduced by: andré kalmendal all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/gtn9c levels of measurement and statistical analyses matt n. williams school of psychology, massey university most researchers and students in psychology learn of s. s. stevens’ scales or “levels” of measurement (nominal, ordinal, interval, and ratio), and of his rules setting out which statistical analyses are admissible with each measurement level. many are nevertheless left confused about the basis of these rules, and whether they should be rigidly followed. in this article, i attempt to provide an accessible explanation of the measurement-theoretic concerns that led stevens to argue that certain types of analyses are inappropriate with data of particular levels of measurement. i explain how these measurement-theoretic concerns are distinct from the statistical assumptions underlying data analyses, which rarely include assumptions about levels of measurement. the level of measurement of observations can nevertheless have important implications for statistical assumptions. i conclude that researchers may find it more useful to critically investigate the plausibility of the statistical assumptions underlying analyses than to limit themselves to the set of analyses that stevens believed to be admissible with data of a given level of measurement. keywords: levels of measurement, measurement theory, ordinal, statistical analysis. introduction most students and researchers in psychology learn of a division of measurement into four scales: nominal, ordinal, interval, and ratio. this taxonomy was created by the psychophysicist s. s. stevens (1946). stevens wrote his article in response to a long-running debate within a committee of the british association for the advancement of science which had been formed in order to consider the question of whether it is possible to measure “sensory events” (i.e., sensations and other psychological attributes; see ferguson et al., 1940). the committee was partly made up of physical scientists, many of whom believed that the numeric recordings taking 1 for example, the term “scale” is often used to refer to specific psychological tests or measuring devices (e.g., the “hospital anxiety and depression scale”; zigmond & snaith, 1983). it is also often used to refer to formats for place in psychology (specifically psychophysics) did not constitute measurement as the term is usually understood in the natural sciences (i.e., as the estimation of the ratio of a magnitude of an attribute to some unit of measurement; see michell, 1999). stevens attempted to resolve this debate by suggesting that it is best to define measurement very broadly as “the assignment of numerals to objects or events according to rules” (stevens, 1946, p. 677), but then divide measurements into four different “scales”. these are now often referred to as “levels” of measurement, and that is the terminology i will predominantly use in this paper, as the term “scales” has other competing usages within psychometrics1. according to stevens’ definition of measurement, virtually any research discipline can claim to achieve measurement, although not all may achieve interval collecting responses (e.g., “a four-point rating scale”). these contemporary usages are quite different from stevens “scales of measurement” and therefore the term “levels of measurement” is somewhat less ambiguous. williams or ratio measurement. he went on to argue that the level with which an attribute has been measured determines which statistical analyses are permissible (or “admissible”) with the resulting data. stevens’ definition and taxonomy of measurement has been extremely influential. although i have not conducted a rigorous evaluation, it appears to be covered in the vast majority of research methods textbooks aimed at students in the social sciences (e.g., cozby & bates, 2015; heiman, 2001; judd et al., 1991; mcburney, 1994; neuman, 2000; price, 2012; ray, 2000; sullivan, 2001). stevens’ taxonomy is also often used as the basis for heuristics indicating which statistical analyses should be used in particular scenarios (see for example cozby & bates, 2015). however, the fame and influence of stevens’ taxonomy is something of an anomaly in that it forms part of an area of inquiry (measurement theory) which is rarely covered in introductory texts on research methods2. measurement theories are theories directed at foundational questions about the nature of measurement. for example, what does it mean to “measure” something? what kinds of attributes can and cannot be measured? under what conditions can numbers be used to express relations amongst objects? measurement theory can arguably be regarded as a branch of philosophy (see tal, 2017), albeit one that has heavily mathematical features. the measurement theory literature contains several excellent resources pertaining to the topic of admissibility of statistical analyses (e.g., hand, 1996; luce et al., 1990; michell, 1986; suppes & zinnes, 1962), but this literature is often written for an audience of readers who have a reasonably strong mathematical background, and can be quite dense and challenging. this means that, while many students and researchers are exposed to stevens’ rules about admissible statistics, few are likely to understand the basis of these rules. this can often lead to significant confusion about important applied questions: for example, is it acceptable to compute “parametric” statistics using observations collected with a likert scale? 2 by way of example, the well-known textbook “psychological testing: principles, applications, & issues” by kaplan and saccuzzo (2018) covers stevens’ taxonomy and rules about permissible statistics (albeit without attribution to stevens), but does not mention any of the in this article, therefore, i attempt to provide an accessible description of the rationale for stevens’ rules about admissible statistics. i also describe some major objections to stevens’ rules, and explain how stevens’ measurement-theoretic concerns are different from the statistical assumptions underlying statistical analyses—although there exist important connections between the two. i close with conclusions and recommendations for practice. stevens’ taxonomy of measurement stevens’ definition and taxonomy of measurement was inspired by two theories of measurement: representationalism, especially as expressed by campbell (1920), and operationalism, especially as expressed by bridgman (1927)3. operationalism (see bridgman, 1927; chang, 2009) holds that an attribute is fully synonymous with the operations used to measure it: that if i say i have measured depression using score on the beck depression inventory (bdi), then when i speak of a participant’s level of depression, i mean nothing more or less than the score the participant received on the bdi. stevens’ definition of measurement— as “the assignment of numerals to objects or events according to rules” (stevens, 1946, p. 677)—is based on operationalism. in contrast to operationalism, representationalism argues that measurement starts with a set of observable empirical relations amongst objects. the objects of measurement could literally be inanimate objects (e.g., rocks), but they could also be people (e.g., participants in a research study). to a representationalist, measurement consists of transferring the knowledge obtained about the empirical relations amongst objects (e.g., that granite is harder than sandstone) into numbers which encode the information obtained about these empirical relations (see krantz et al., 1971; michell, 2007). stevens suggested that levels of measurement are distinguished by whether we have “empirical operations” (p. 677) for determining relations (equality, rank-ordering, equality of differences, and equality of ratios). this is an idea that appears to have been measurement theories discussed in this section (operationalism, representationalism, and the classical theory of measurement). 3 see mcgrane (2015) for a discussion of these competing influences on stevens’ definition of measurement. levels of measurement and statistical analyses influenced by representationalism (representationalism being a theory that concerns the use of numbers to represent information about empirical relations). nevertheless, an influence of operationalism is apparent here also: to stevens, the level of measurement of a set of observations depended on whether an empirical operation for determining equality, rank-ordering, equality of differences and/or equality of ratios was applied (regardless of the degree to which the empirical operation produced valid determinations of these empirical relations). while stevens’ definition and taxonomy of measurement incorporates both representational and operationalist influences, there is a third theory of measurement that he did not incorporate: the classical theory of measurement. this theory of measurement has been the implicit theory of measurement in the physical sciences since classical antiquity (michell, 1999). the classical theory of measurement states that to measure an attribute is to estimate the ratio of the magnitude of an attribute to a unit of the same attribute. for example, to say that we have measured a person’s height as 185cm means that we have estimated that the person’s height is 185 times that of one centimetre (the unit). the classical theory of measurement suggests that only some attributes—quantitative attributes—have a structure such that their magnitudes stand in ratios to one another. a set of axioms demonstrating what conditions need to be met for an attribute to be quantitative were determined by the german mathematician otto hölder (1901; for an english translation see michell & ernst, 1996). because the focus of this article is on stevens’ arguments, i will not cover the classical theory of measurement further in this article, but excellent introductions can be found in michell (1986, 1999, 2012). it may suffice at this point to note that from a classical perspective, stevens’ “nominal” and “ordinal” levels do not constitute measurement at all. stevens defined his four scales or levels of measurement as follows. nominal according to stevens (1946), nominal measurement is produced when we have an empirical operation that allows us to determine that some objects are equivalent with respect to some attribute, while other objects are noticeably different. for example, imagine we have a group of university students, and via the empirical operation of looking up their academic records we can determine that some of the students are psychology majors while some are business majors. if we wished to compare the psychology students and the business students with respect to some other attribute, we might record information about the participants and their majors in a dataset. for the sake of convenience, we might do so by recording their majors numerically. specifically, we might enter a 0 in a “major” column in the dataset for each of the psychology students, and a 1 for each of the business students. in doing so, according to stevens, we would have accomplished nominal measurement. importantly, there are many coding rules that would work just as effectively at conveying the information we have about participants’ majors. for example, we could just as well use 1 to indicate a psychology student and 0 to indicate a business student, or -10 to indicate a psychology student and 437.3745 to indicate a business student. as long as students’ majors are recorded by assigning the psychology students one fixed number and business students another, any two numbers would work just as well at conveying what we have observed about the students. ordinal ordinal measurement is produced when we have an empirical operation that allows us to determine that some objects are greater or lesser than others with respect to some attribute. imagine, for example, that we are interested in the attribute satisfaction with life. we perform the empirical operation of asking three participants (hakim, jeff, and sarah) to respond to the question “in general, how satisfied are you with your life?” (see cheung & lucas, 2014, p. 2811), with response options of very dissatisfied, moderately dissatisfied, moderately satisfied, and very satisfied. we discover that hakim indicates that he is very satisfied with life, while jeff is moderately satisfied, and sarah is moderately dissatisfied. we have thus performed an empirical operation that allows us to determine whether each of these participants has more or less life satisfaction than another. if we are to record this information numerically, there are many possible coding rules that could conwilliams vey the information collected about the life satisfaction of our participants, but there is a restriction: the number assigned to hakim must be higher than the number assigned to jeff, which must be higher again than that assigned to sarah. any such coding rule will record what we have observed: that hakim has the highest level of life satisfaction, followed by jeff, followed by sarah. so, we could assign hakim a life satisfaction score of 3, jeff a score of 2, and sarah a score of 1. or we could also assign hakim a score of 1234, jeff a score of 6, and sarah a score of 0.45. either of these two coding rules records the observed ordering hakim > jeff > sarah, and from stevens’ perspective each would be just as adequate as the other. however, we could not assign hakim a score of 3, jeff a score of 1, and sarah a score of 2; this would imply that sarah has higher life satisfaction than jeff, which conflicts with the empirical information we have collected. formally, any coding system within the class of monotonic transformations will equivalently convey the information that we have about the participants. in other words, if we have assigned numeric scores to the participants such that hakim > jeff > sarah, then we can transform those numeric scores in any way provided that the order of the scores (hakim > jeff > sarah) remains the same. interval interval measurement is produced when, in addition to having an empirical operation that allows us to observe that some objects are greater or less than others with respect to some attribute, we have an empirical operation that allows us to determine whether the difference between a pair of objects is greater than, less than, or the same as the difference between another pair of objects. the classic example of an interval scale is temperature when measured via a mercury thermometer (i.e., a narrow glass tube containing mercury, with a bulb at the bottom, held upright). if we place the thermometer inside a fridge, we can see that the mercury level will be 4 the fact that this assumption is necessary points to the important role theory can have in measurements; for a sophisticated discussion in the context of thermometers and the measurement of temperature, see sherry (2011). 5 in fact, it is also possible to produce interval measurement based only on observations about order and equality of differences along with some other conditions; see lower than if we placed the thermometer in a living room. it will be lower again if we place the thermometer in a freezer. if we are willing to assume that mercury expands with increasing temperature, this empirical observation allows us to determine that, with respect to temperature, living room > fridge > freezer. this observation alone would be a purely ordinal one. however, we can also use a ruler to measure the highest point reached by the mercury in each location. by this method, we can determine whether the distance the mercury expands by when moved from the freezer to the fridge is more, the same, or less than the distance it expands by when moved from the fridge to the living room. if we are willing to assume that the relationship between temperature and the height of the mercury in the thermometer is linear within the range of temperatures observed4, then we can also empirically compare differences between observations. for example, we might attach a ruler to our tube of mercury, and observe that the difference in the height of the mercury between the living room and the fridge is 5 mm, while the difference between the height of the mercury in the fridge and in the freezer is 10 mm. given our assumption of linear expansion of mercury with temperature, this implies that the difference in temperature between the fridge and the living room is twice5 the difference in temperature between the freezer and the fridge. because we have an empirical operation that allows us to compare differences in temperature, we have achieved interval measurement. the information we have collected about the temperature of the fridge, the freezer, and the living room can be recorded via a variety of coding rules, but there is now an additional restriction: not only must the coding rule preserve the observed ordering living room > fridge > freezer, but the difference between the number we assign to the fridge and the one we assign to the freezer must be twice the difference between the number we assign to the living suppes and zinnes’ (1962) description of infinite difference systems. for the sake of simplicity and brevity i have focused here on the simpler scenario of observations about ratios of differences. levels of measurement and statistical analyses room and the fridge. we could record the freezer as having a temperature of 0, the fridge a temperature of 2, and the living room a temperature of 3. or we could record the freezer as having a temperature of 5, the fridge a temperature of 15, and the living room a temperature of 20. but we should not record the freezer as having a temperature of 0, the fridge a temperature of 1, and the living room a temperature of 3; this would imply that the difference in temperature between the living room and the fridge is greater than that between the fridge and the freezer. more formally, if we have a coding rule that records the information we have collected about these temperature, we can apply any linear transformation to it (e.g., by multiplying the existing values by some number and/or adding a constant) while still adequately representing the information we have collected about the temperatures. it is worth emphasising here that it is the fact we can empirically compare differences between temperatures that implies that we have achieved interval measurement. the argument has sometimes been made (e.g., carifio & perla, 2008) that while the responses to a rating scale item (as in the earlier example for life satisfaction) are ordinal in nature, a score created by summing the responses to multiple items is interval. this argument confuses the issue of level of measurement with that of the distribution of a variable. the act of summing ordinal observations may increase the degree to which the distribution of scores approximates a normal distribution but does not transform these observations from ordinal to interval (because it does not provide an operation for determining equality of differences). ratio imagine, now, that we wish to compare the lengths of two objects: a pen and a rolling pin. by placing these objects side-by-side we can quickly establish that the rolling pin is the longer. let us assume that we have several of the same model of pen, each of identical length. now imagine we lay three 6 for the sake of simplicity, the situation i describe here is one where the length of one of the objects is exactly divisible by the length of the other. in reality, it might be the case that the rolling pin is slightly longer than three pens, such that i can conclude only that the ratio of the length of the rolling pin to the pen falls in the interval [3, 4]. but we could obtain a more precise estimate of the of these pens end to end with one another and observe that the rolling pin appears to be equal in length to the three pens6. in other words, we have observed that the ratio of the length of the rolling pin to that of the pen is 3. therefore, we have achieved ratio measurement. once again, these observations can be recorded numerically, but our choice of coding rule is now very restricted: whatever number we assigned as the length of the rolling pin must be three times the number assigned to the pen. so we might record the pen as having a length of 1 and the rolling pin a length of 3, or the pen a length of 0.5 and the rolling pin a length of 1.5, but we could not assign the pen a length of 1 and the rolling pin a length of 2. more formally, if we have applied a coding rule that records the information we have collected about the ratios of the lengths of the objects, then the only transformation we can apply to the numeric values is multiplying them by some constant. stevens’ “admissible statistics” the connection stevens drew between levels of measurement and statistical analysis was this: given observations pertaining to a set of objects, there are a variety of coding rules that one could use to encode the information held about the empirical relations amongst objects. furthermore, if we consider the four levels of measurement on a hierarchy from ratio at the top to nominal at the bottom, the lower levels of measurement offer a much more diverse range of coding rules from which one can arbitrarily select. statistical analyses, however, may produce different results depending on which coding rule is used, so we should only use statistical analyses that produce invariant results across the class of coding rules that are permissible for the data we have collected. by way of example, let’s return to our measurements of life satisfaction from three participants: hakim (who was very satisfied with his life), jeff length of the rolling pin if we had a “standard sequence” of many replicates of the same short length. a ruler measured in millimetres, for example, just represents a sequence of identical one-millimetre lengths laid end to end. for a more rigorous treatment of this topic, see krantz et al. (1971) williams (moderately satisfied), and sarah (moderately dissatisfied). imagine now that we recruit a fourth participant, ming, who transpires to be very dissatisfied with life. we wish to use these four participants to test the hypothesis that owning a pet is associated with increased life satisfaction. we ask our participants whether they each own a pet; it turns out that hakim and jeff do, while sarah and ming do not. we can now proceed to comparing the life satisfaction of these two groups of participants. recall, though, that we have only ordinal observations of life satisfaction, and can apply any coding rule to our observations that preserves the ordering hakim > jeff > sarah > ming. the outcomes of two such coding rules are displayed in table 1. table 1 example life satisfaction data life satisfaction participant owns pet? qualitative response coding rule one coding rule two hakim yes very satisfied 4 1002 jeff yes moderately satisfied 3 1000 mean (sd) 3.5 (0.71) 1001 (1.41) sarah no moderately dissatisfied 2 1 ming no very dissatisfied 1 0 mean (sd) 1.5 (0.71) 0.5 (0.71) note. coding rule one: very dissatisfied = 1, moderately dissatisfied = 2, moderately satisfied = 3, very satisfied = 4. coding rule two: very dissatisfied = 0, moderately dissatisfied = 1, moderately satisfied = 1000, very satisfied = 1002. if we applied a student’s t test to compare the mean life satisfaction ratings of the two groups (pet owners, nonpet owners), we would discover that the results differ depending on which coding rule we use. for coding rule one, the mean difference in life satisfaction between the pet owners and the nonpet owners is 2, and this difference is not statistically significant, t(2) = 2.83, p = .106. but for coding rule two, the mean difference in life satisfaction is 1000.5, and this difference is statistically significant, t(2) = 894.87, p < .001. thus, it seems that the outcome of the student’s t test varies across these two equally permissible coding rules, which does not seem like a satisfactory state of affairs. on the other hand, if we compare the two samples using a mannwhitney u test, the resulting p value is identical across the two coding rules (being p = .33). this is the case for the simple reason that the mann-whitney u is calculated using the ranks of the observations rather than their numeric coded values. as such, we might argue that the mann-whitney u statistic and its associated p value is invariant across the class of permissible transformations with ordinal data, whereas the student’s t test is not, and that as such the mann-whitney u is the more appropriate test. stevens went on to set out a list of statistical analyses that he believed would produce invariant results for variables of each level of measurement. for example, he suggested that a median is admissible as a measure of central tendency for an ordinal variable, since the case (or pair of cases) that falls at the median will always be the same across any monotonic transformation of the variable, even if the numeric value of the median will not. on the other hand, he suggested that a mean is not an admissible measure of central tendency with ordinal data, because both the actual value of the mean and the case to which it most closely corresponds will both differ across monotonic transformations of the observations. in noting these distinctions, it is clear there exists a degree of ambiguity about what constitutes “invariance”. michell (1986) and luce et al. (1990) provide more formal examination of the type of invariance that is implied by stevens’ arguments. parametric and non-parametric statistics it is common for authors to claim that the issue of admissibility raised by stevens implies that parametric statistical analyses should only be used with interval or ratio data (e.g., jamieson, 2004; kuzon et al., 1996). broadly speaking, a parametric analysis is one that involves an assumption that observations or errors are drawn from a specific probability distribution, such as the normal distribution (see altman & bland, 2009). some statistical analyses (e.g., rank-based tests such as the mann-whitney u) are non-parametric and also produce invariant results across monotonic transformations of the outcome variable, and thus comply with stevens’ rules about admissible statistics with ordinal data. however, there certainly exist non-parametric tests that would not be considered as admissible for use with levels of measurement and statistical analyses ordinal data by stevens. for example, a permutation test to compare two means (see hesterberg et al., 2002) is non-parametric—it does not assume that the errors or observations are drawn from any specific probability distribution—but it will not produce invariant p values across monotonic transformations of the observations. as such, stevens’ rules about admissibility are not accurately described as applying to whether “parametric” analyses can be utilised. cliff (1996) uses the term ordinal statistics to describe those analyses whose conclusions will be unaffected by monotonic transformations of the variables; this term can be helpful when describing those analyses that stevens would have classed as permissible with ordinal data. objections to stevens’ claims about admissibility a range of objections to stevens’ claims about the relationship between levels of measurement and admissible statistical analysis have been offered in the literature. i will not attempt to cover these comprehensively; excellent summaries can be found in velleman and wilkinson (1993), and zumbo and kroc (2019). in this section, i will focus on just three core objections. the first fundamental objection to stevens’ dictums is simply that researchers may not necessarily desire to make inferences that will be invariant across all permissible transformations of the measurements they have observed. for example, consider our earlier example of a researcher attempting to measure the relationship between owning a pet and satisfaction with life. the researcher might proceed by coding responses to a life satisfaction scale as very dissatisfied = 1, moderately dissatisfied = 2, moderately satisfied = 3, and very satisfied = 4, and then perform a statistical analysis. such a researcher might very well see it as entirely irrelevant whether her results would remain invariant if she monotonically transformed her data using the coding rule very dissatisfied = 0, moderately dissatisfied = 1, moderately satisfied = 1000, very satisfied = 1002. amongst the types of generalisations that researchers seek (e.g., from samples to populations, from observations to causes), generalisations from one coding rule to another may not always be desired or claimed. correspondingly, it may be inappropriate for methodologists to dictate that researchers should act in such a way as to permit such generalisations. the second objection is the presence of internal inconsistency in stevens’ prohibitions. specifically, stevens was extremely liberal in his definition of measurement (any assignment of numbers to objects is enough to be measurement), and likewise liberal in how he distinguished levels of measurement. for example, he argued that all that is needed to achieve interval measurement of an attribute is an empirical operation for determining equality of differences of the attribute (regardless of the validity of this operation, or the structure of the attribute itself). taken literally, this would imply that i could achieve “interval” measurements of film quality by applying the empirical operation of asking a group of participants whether they perceive there to be a larger difference in quality between saw iv and maid in manhattan than between maid in manhattan and the godfather (regardless of whether participants actually have any meaningful way of comparing these differences, or whether “film quality” is actually a quantitative attribute). when the definition of what constitutes a particular level of measurement is so loose it makes little sense to make the level of measurement a strict determining factor for which statistical analyses may be applied. even stevens himself wavered on the point of how strictly his rules about admissible statistics should be applied: “…most of the scales used widely and effectively by psychologists are ordinal scales. in the strictest propriety the ordinary statistics involving means and standard deviations ought not to be used with these scales […] on the other hand, for this 'illegal' statisticizing there can be invoked a kind of pragmatic sanction: in numerous instances it leads to fruitful results” (stevens, 1946, p. 679). a final fundamental objection to stevens’ dictums about admissible statistics is the fact that statistical tests make assumptions about the distributions of variables and/or errors—not about levels of measurement. this is the topic i will turn to for the remainder of this article. statistical assumptions when statisticians evaluate a method for estimating a parameter (e.g., the relationship between two variables in a population), an important task is williams to show that the estimation method has particular desirable properties. for example, we may desire that a method for estimating a parameter will produce estimates that are unbiased—that, across repeated samplings, do not tend to systematically overor underestimate the parameter. we may also desire that the estimation method is consistent— that the statistic estimated from the sample will converge to the true population parameter as we collect more and more observations. and we may desire that the estimation method is efficient—that it minimises how much variability or noise there is in the estimates it produces across repeated samples (see dougherty, 2007 for more detailed descriptions of these concepts). to demonstrate that particular estimation methods have particular desirable properties, statisticians must make assumptions. these assumptions are premises that are used to form deductive arguments (proofs). for example, a statistical model commonly used by psychologists is the linear regression model, in which a participant’s score on an outcome variable7 is modelled as a function of their scores on a set of predictor variables multiplied by a set of regression coefficients (plus random error). in this model, the distributions of the errors (over repeated samplings) are typically assumed to be independently, identically and normally distributed with an expected value (true mean) of zero, regardless of the combination of levels of the predictor variables for each participant8 (williams et al., 2013). furthermore, we assume that the predictor variables are measured without error, and that any measurement error in the outcome variable is purely random and uncorrelated with the predictors (williams et al., 2013). if these assumptions hold, then it can be demonstrated that ordinary least squares estimation will produce estimates of the regression coefficients that are unbiased, consistent, efficient and normally distributed estimators of the true values in the population. this in turn means that statistical tests can be conducted on the coefficients that will abide by 7 an outcome variable is often referred to as a “dependent” variable, and a predictor variable as an “independent” variable. i use the more general terminology of predictor/outcome because some authors reserve the terms “independent variable” and “dependent variable” to refer to variables in a true experiment. their nominal type i error rates and confidence interval coverage. in most cases, the assumptions used to prove that statistical tests have particular desirable properties (e.g., unbiasedness, consistency, efficiency), do not include assumptions about levels of measurement. it is not correct to say, for example, that a correlation or a t test or a regression model or an anova directly assume that any of the variables involved are interval or ratio. as should be clear by now, the concerns that motivated stevens’ rules do not pertain to statistical assumptions. rather, they are measurement-theoretic concerns, pertaining in specific to the question of whether statistical analyses will produce results that depend on what has been empirically observed as opposed to arbitrary features of the process used to numerically record these observations. if most statistical tests do not make assumptions about levels of measurement, does this in turn imply that concerns about levels of measurement can safely be disregarded? no. the application of statistical analysis with ordinal or nominal data can result in consequential breaches of statistical assumptions. in fact, considering levels of measurement in terms of their potential impacts on statistical assumptions provides a framework that may be useful for evaluating the extent to which levels of measurement have implications for data analysis decisions. when researchers apply inferential statistics (e.g., significance tests, confidence intervals, bayesian analyses) they are by definition aiming to make inferences (e.g., from observations to causal effects, and/or from a sample to a population). the validity of these inferences will necessarily depend on the validity of the statistical assumptions made in forming these inferences—so, whereas it is possible to make an argument that stevens’ dictums can safely be ignored, this is certainly not the case for statistical assumptions. below i identify several ways in which the level of measurement of a set of observations may affect whether particular statistical assumptions are met. i 8 this presentation of assumptions is for a model where the predictor values may be either fixed in advance or sampled from a population. the assumptions of a model where predictor values are fixed in advance are slightly simpler, requiring only that the marginal mean of each error term is zero. levels of measurement and statistical analyses make no claim to this being an exhaustive list of such mechanisms. i focus specifically on ordinal data because this is the measurement level which most commonly causes ambiguity with respect to analysis decisions in psychology research. i also focus on multiple linear regression as the statistical analysis of interest, as this is an analysis framework that encompasses many special cases of interest to psychologists (e.g., anova, ancova, t tests), and that itself forms a special case of other more sophisticated analysis techniques often applied by psychologists (e.g., structural equation models, mixed/multilevel models, generalised linear models). assumption that measurement error in outcome is uncorrelated with predictors one scenario in which a researcher may find themselves with a set of ordinal observations is when the attribute they seek to measure is continuous, but the observations are obtained in such a way that this quantitative attribute is discretised. for example, we might assume that there exists an underlying continuous latent variable “life satisfaction”, and that recording it using a four-point rating scale such as the one described earlier in this article means dividing variation in this continuous attribute into four ordered categories. the assumption that responses to observed items are caused by variation in underlying latent attributes reflects a perspective on measurement sometimes referred to as latent variable theory (borsboom, 2005). liddell and kruschke (2018) note that when a researcher aims to make inferences about the effect of a set of predictor variables on an underlying unbounded continuous attribute—but the outcome variable is actually recorded as a response in one of a finite number of ordered categories—the participants’ observed ordinal responses can be biased estimates of their levels of the continuous attribute. this is the case because a response scale that consists of a set of discrete options produces responses that are bounded to fall within a range, whereas the underlying continuous attribute may not be bounded to fall within that range. for example, if the underlying continuous attribute is normally distributed, it will have an unbounded distribution, and could theoretically take any value on the real number line. this implies in turn that values of the underlying continuous variable that lie outside the range of the response options will be “censored”. for example, if responses are recorded on a rating scale with response options coded as 1 to 5, values on the underlying continuous attribute that are higher than 5 can only be recorded as 5, while values lower than 1 can only be recorded as 1. this means that, for those participants whose values of the attribute are outside the range of the response options, the recorded responses are biased estimates of their levels of the underlying continuous attribute. although liddell and kruschke do not describe it in these terms, the difference between the ordinal response and the true underlying values of the continuous attribute represents a form of systematic measurement error. and because the magnitude of this error depends on the value of the underlying continuous attribute, the presence of any relationship between the predictor variables and the underlying continuous attribute will mean that the measurement error in the outcome variable is correlated with the predictor variables. this constitutes a breach of the assumptions of a linear regression model, and a breach that can seriously distort parameter estimates and error rates (as demonstrated by liddell & kruschke, 2018). as liddell and kruschke show, it is also not a problem that is ameliorated when the predictor variable is formed by summing or averaging responses from multiple items. by way of solution, liddell and kruschke suggest that regression models specifically designed for ordinal outcome variables (e.g., the ordered probit model) may be useful in such situations; see also bürkner and vuorre (2019) for an introduction to a wider range of ordinal regression models. furthermore, while the discussion above focuses on ordinal outcome variables, using an ordinal predictor variable to make inferences about the effect of an underlying continuous attribute will likewise mean that the underlying attribute is measured with error, and result in a biased estimate of its effect (see westfall & yarkoni, 2016). admittedly, this is a problem whose salience depends on whether the researcher believes that a continuous attribute underlies an ordinal variable, and wishes to make inferences about the underlying continuous attribute rather than the observed ordinal variable. however, making inferences only about ordinal variables themselves also presents serious williams challenges for statistical analysis, as we will see in the next subsection. non-linearity when social scientists specify statistical models, they often assume that relationships between variables are linear. for some statistical analyses (e.g., pearson’s correlation), this assumption is intrinsic to the form of analysis itself. in other cases, it is possible to specify that a particular relationship is nonlinear, but doing so requires deliberate action from the data analyst, and the types of non-linear relationship that can be specified are restricted. for example, multiple linear regression can accommodate some types of non-linear relationships between variables (e.g., polynomial relationships), but the data analyst must specify these as part of the model. furthermore, only models where the outcome variable is a linear function of the parameters can be specified as linear regression models (this is why we call this mode of analysis “linear” regression). when a relationship between variables is assumed to be linear but in fact is not, the applied statistical model clearly does not capture reality. even if we accept that a linear regression model is an inaccurate simplification of reality and wish nevertheless to make inferences about the parameters of this model were it fit to the population, the presence of non-linearity will mean that the statistical assumption that the expected value of the errors is zero for all values of the predictors will be breached, implying that the estimation method may produce biased estimates of the population parameters. admittedly, for some models an assumption of linearity is met by design: for example, if we estimate the effect of an experimentally manipulated binary variable on an outcome variable, it will obviously be possible for a straight line to perfectly connect the two group means. but in many situations—especially when we are trying to estimate effects of measured psychological variables on one another rather than estimating the effects of experimental manipulations—an implied assumption of linearity could well be false. as an empirical example, consider a study aimed at estimating the effect of perfectionism on procrastination, with both attributes measured using selfreport rating scales that we have numerically coded such that they each have a range of 1 to 10. if we fit a simple linear regression model with perfectionism as the predictor and procrastination as the outcome, then we are assuming that increasing perfectionism from 1 to 2 points has exactly the same effect on procrastination as increasing perfectionism from 2 to 3 points, or from 3 to 4 points, and so forth. but if the perfectionism scores are ordinal, this may not be plausible: after all, an ordinal scale is one where we have been unable to compare differences in levels of the attribute. consequently, the size of the difference in numeric scores between two participants’ scores is largely an artefact of the rule we’ve used to code observations numerically, and we have no evidence that it bears any connection to the magnitudes of the differences in the underlying attribute (in this case, perfectionism). as such, even if variation in the attribute underlying the predictor variable (perfectionism) has a completely linear effect on the outcome variable (procrastination), there is no strong reason to assume that there would be a linear relationship between the numeric scores. exacerbating this problem further is the possibility that, when we estimate the effect of one psychological attribute on another, the attribute underlying the predictor variable may itself not have a linear effect on the outcome variable of interest. after all, different scores on a psychological test may not necessarily represent different levels of some homogenous quantitative attribute, but may instead represent the presence or absence of qualitatively different properties. consider, for example, the difference between a person who has obtained an iq score of 100 on the wechsler adult intelligence scale (wais-iv; wechsler et al., 2008) and one who has received an iq score of 120. these different iq scores may reflect qualitative differences between the participants. for example, the second person may have elements of general knowledge that the first person does not, thus achieving a higher score on the information subtest, or know how to apply the strategy of “chunking” digits so as to achieve a higher score on the digit span subtest. a person with an iq score of 140 might have access to qualitatively different items of knowledge and cognitive skills again. the differences in “intelligence” between these individuals are not necessarily just differences on some homogeneous quantitative attribute, but rather—at least in part—the presence or absence of qualitatively different items of knowledge and cognitive skills. there may be little reason, then, to assume that each of these qualitative differences levels of measurement and statistical analyses would have identical effects on another psychological attribute (e.g., job performance; schmidt, 2002), despite the equal differences in numeric scores (100 to 120, 120 to 140). differences in scores on a variable that does not represent varying magnitudes of a homogeneous quantitative attribute but rather qualitative differences in the properties of participants may result in such a variable having distinctly nonlinear effects on other variables (see figure 1). figure 1. illustration of three types of effect. the first is a linear effect. the second is a quadratic effect—an effect that is not linear, but that can readily be specified within a linear regression framework. the third is a non-linear effect that takes the form of a segmented function, where the effect of the predictor variable itself changes abruptly as the predictor variable increases. this kind of effect is plausible when the predictor variable is ordinal, but cannot readily be accommodated within a linear regression framework (at least not without applying a piecewise model). what can researchers do about this? statistical analyses that permit the specification of non-linear relationships obviously do exist (e.g., cleveland & devlin, 1988). however, psychological theories are rarely specific enough to imply the specific functional form of relationships. non-linear models can be selected based on empirical data, but basing such model specification decisions on empirical data alone may (at least in the absence of cross-validation) risk overfitting—i.e., selecting overly complex nonlinear models that do not generalise well outside the sample they are trained on (babyak, 2004; hawkins, 2004). in the field of statistical learning, this problem is known as the “bias-variance trade-off” (james et al., 2013, p. 33). if we apply a simple model which incorporates inaccurate assumptions (e.g., linear relationships), the resulting estimates may be substantially biased. applying a more flexible model (e.g., a polynomial model) may reduce this bias, but at the cost of producing estimates that are more variable across datasets (e.g., overfitting). in the absence of a purely statistical solution, this problem may be addressed by the development of theory to be more specific about the functional form of relationships, as occurs in mathematical psychology (see navarro, 2020). where models assuming linear relationships are applied, it is important to apply diagnostic procedures that can detect the presence of non-linearity. such diagnostics may allow researchers to understand and communicate to readers the degree to which an assumption of linearity is a reasonable approximation of reality in the specified case, and the consequent degree to which additional uncertainty may surround the results. although a detailed description of methods for detecting non-linearity in relationships is beyond the scope of this paper, perhaps the most well-known method is plotting residuals against predicted (“fitted”) values to visually identify the presence of a non-linear pattern (see gelman & hill, 2007). more formal tests of non-linearity in the context of regression include the reset test (ramsey, 1969) and the rainbow test (utts, 1982). conclusion at this point it should be clear that i see little reason for contemporary researchers to rigidly follow stevens’ dictums about which statistical analyses are admissible with data of particular levels of measurement. a number of strong objections to stevens’ dictums have been raised in the methodological literature, of which perhaps the most fundamental is that his rules assume a goal on the part of the researcher to achieve a type of generalisation (inferences that apply across a class of coding rules) that may not be of interest to the researcher. furthermore, most statistical tests do not directly require assumptions about levels of measurement. how0 5 10 15 20 0 1 2 3 4 5 6 7 8 9 10 11 o u tc o m e predictor linear effect quadratic effect segmented effect williams ever, statistical assumptions and measurement-theoretic concerns do intersect in important ways9. my suggestion that contemporary researchers do not need to follow stevens’ rules exactly as he stated them should not be read as implying that researchers can safely set aside measurement-theoretic concerns. indeed, much as michell (1986) suggests, measurement-theoretic issues do have implications for statistical analysis, just not the simple implications proposed by stevens. i suggest that researchers focus on whether the statistical assumptions of the analyses they wish to perform are consistent with the observations they have collected, considering in doing so how the plausibility of these assumptions may be affected by the level of measurement of the observations. the assumptions that are made should be clearly communicated to readers and interrogated for plausibility, based both on a priori considerations (e.g., is it likely that an ordinal variable could have linear effects?) and empirical ones (e.g., to what extent is this set of observations consistent with a linear relationship?) author contact correspondence regarding this article should be addressed to matt williams, school of psychology, massey university, private bag 102904, north shore, auckland, new zealand. email: m.n.williams@massey.ac.nz orcid: https://orcid.org/0000-00020571-215x conflict of interest and funding i report no conflicts of interest. this study did not receive any specific funding. author contributions i am the sole contributor to the content of this article. 9 sometimes this connection between measurement and analysis is very direct. consider, for example, the “matthew effect” in reading, where investigating the claimed open science practices this article earned no open science badges because it is theoretical and does not contain any data or data analyses. however, the r-code provided in the osf project was fully reproducible with the given example data. references altman, d. g., & bland, j. m. (2009). parametric v non-parametric methods for data analysis. bmj, 338, a3167. https://doi.org/10.1136/bmj.a3167 babyak, m. a. (2004). what you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. psychosomatic medicine, 66(3), 411–421. https://doi.org/10.1097/01.psy.0000127692.23 278.a9 borsboom, d. (2005). measuring the mind: conceptual issues in contemporary psychometrics. cambridge university press. bridgman, p. w. (1927). the logic of modern physics. macmillan. bürkner, p.-c., & vuorre, m. (2019). ordinal regression models in psychology: a tutorial. advances in methods and practices in psychological science, 2(1), 77–101. https://doi.org/10.1177/2515245918823199 campbell, n. r. (1920). physics: the elements. cambridge university press. carifio, j., & perla, r. (2008). resolving the 50-year debate around using and misusing likert scales. medical education, 42(12), 1150–1152. https://doi.org/10.1111/j.13652923.2008.03172.x chang, h. (2009). operationalism. in e. n. zalta (ed.), the stanford encyclopedia of philosophy. https://plato.stanford.edu/archives/fall2009/entries/operationalism/ cheung, f., & lucas, r. e. (2014). assessing the validity of single-item life satisfaction measures: results from three large samples. quality of phenomenon—compounding differences over time between stronger and weaker readers—clearly requires the empirical comparison of differences (i.e., an interval scale; see protopapas et al., 2016). mailto:m.n.williams@massey.ac.nz mailto:m.n.williams@massey.ac.nz https://orcid.org/0000-0002-0571-215x https://orcid.org/0000-0002-0571-215x levels of measurement and statistical analyses life research, 23(10), 2809–2818. https://doi.org/10.1007/s11136-014-0726-4 cleveland, w. s., & devlin, s. j. (1988). locally weighted regression: an approach to regression analysis by local fitting. journal of the american statistical association, 83(403), 596– 610. https://doi.org/10.1080/01621459.1988.104786 39 cliff, n. (1996). answering ordinal questions with ordinal data using ordinal statistics. multivariate behavioral research, 31(3), 331–350. https://doi.org/10.1207/s15327906mbr3103_4 cozby, p. c., & bates, s. c. (2015). methods in behavioral research (12th ed.). mcgraw-hill. dougherty, c. (2007). introduction to econometrics (3rd ed.). oxford university press. ferguson, a., myers, c. s., bartlett, r. j., banister, h., bartlett, f. c., brown, w., campbell, n. r., craik, k. j. w., drever, j., guild, j., houstoun, r. a., irwin, j. o., kaye, g. w. c., philpott, s. j. f., richardson, l. f., shaxby, j. h., smith, t., thouless, r. h., & tucker, w. s. (1940). final report of the committee appointed to consider and report upon the possibility of quantitative estimates of sensory events. report of the british association for the advancement of science, 2, 331–349. gelman, a., & hill, j. (2007). data analysis using regression and multilevel/hierarchical models. cambridge university press. hand, d. j. (1996). statistics and the theory of measurement. journal of the royal statistical society: series a (statistics in society), 159(3), 445–492. https://doi.org/10.2307/2983326 hawkins, d. m. (2004). the problem of overfitting. journal of chemical information and computer sciences, 44(1), 1–12. https://doi.org/10.1021/ci0342472 heiman, g. w. (2001). understanding research methods and statistics: an integrated introduction for psychology (2nd ed.). houghton mifflin. hesterberg, t., moore, d. s., monaghan, s., clipson, a., & epstein, r. (2002). bootstrap methods and permutation tests. in d. s. moore & g. p. mccabe (eds.), introduction to the practice of statistics (4th ed.). freeman. hölder, o. (1901). die axiome der quantität und die lehre vom mass. teubner. james, g., witten, d., hastie, t., & tibshirani, r. (2013). an introduction to statistical learning (vol. 112). springer. jamieson, s. (2004). likert scales: how to (ab) use them. medical education, 38(12), 1217–1218. https://doi.org/10.1111/j.13652929.2004.02012.x judd, c. m., smith, e. r., & kidder, l. h. (1991). research methods in social relations (6th ed.). holt rinehart and winston. kaplan, r. m., & saccuzzo, d. p. (2018). psychological testing: principles, applications, and issues (9th ed.). cengage. krantz, d. h., suppes, p., & luce, r. d. (1971). foundations of measurement: additive and polynomial representations (vol. 1). academic press. kuzon, w., urbanchek, m., & mccabe, s. (1996). the seven deadly sins of statistical analysis. annals of plastic surgery, 37, 265–272. https://doi.org/10.1097/00000637199609000-00006 liddell, t. m., & kruschke, j. k. (2018). analyzing ordinal data with metric models: what could possibly go wrong? journal of experimental social psychology, 79, 328–348. https://doi.org/10.1016/j.jesp.2018.08.009 luce, r. d., suppes, p., & krantz, d. h. (1990). foundations of measurement: representation, axiomatization, and invariance. academic press. mcburney, d. h. (1994). research methods (3rd ed.). brooks/cole. mcgrane, j. a. (2015). stevens’ forgotten crossroads: the divergent measurement traditions in the physical and psychological sciences from the mid-twentieth century. frontiers in psychology, 6. https://doi.org/10.3389/fpsyg.2015.00431 michell, j. (1986). measurement scales and statistics: a clash of paradigms. psychological bulletin, 100(3), 398–407. https://doi.org/10.1037/0033-2909.100.3.398 michell, j. (1999). measurement in psychology: a critical history of a methodological concept. cambridge university press. michell, j. (2007). representational theory of measurement. in m. boumans (ed.), measurement in economics: a handbook (pp. 19–39). elsevier. williams michell, j. (2012). alfred binet and the concept of heterogeneous orders. frontiers in quantitative psychology and measurement, 3, 261. https://doi.org/10.3389/fpsyg.2012.00261 michell, j., & ernst, c. (1996). the axioms of quantity and the theory of measurement: translated from part i of otto hölder’s german text “die axiome der quantität und die lehre vom mass.” journal of mathematical psychology, 40(3), 235– 252. https://doi.org/10.1006/jmps.1996.0023 navarro, d. (2020). if mathematical psychology did not exist we would need to invent it: a case study in cumulative theoretical development [preprint]. psyarxiv. https://doi.org/10.31234/osf.io/ygbjp neuman, w. l. (2000). social research methods (4th ed.). allyn & bacon. price, p. (2012). research methods in psychology. saylor foundation. protopapas, a., parrila, r., & simos, p. g. (2016). in search of matthew effects in reading. journal of learning disabilities, 49(5), 499–514. https://doi.org/10.1177/0022219414559974 ramsey, j. b. (1969). tests for specification errors in classical linear least-squares regression analysis. journal of the royal statistical society: series b (methodological), 31(2), 350–371. https://doi.org/10.1111/j.25176161.1969.tb00796.x ray, w. j. (2000). methods: toward a science of behavior and experience (6th ed.). wadsworth. schmidt, f. l. (2002). the role of general cognitive ability and job performance: why there cannot be a debate. human performance, 15(1–2), 187– 210. https://doi.org/10.1080/08959285.2002.9668 091 sherry, d. (2011). thermoscopes, thermometers, and the foundations of measurement. studies in history and philosophy of science part a, 42(4), 509–524. https://doi.org/10.1016/j.shpsa.2011.07.001 stevens, s. s. (1946). on the theory of scales of measurement. science, 103(2684), 677–680. https://doi.org/10.1126/science.103.2684.677 sullivan, t. j. (2001). methods of social research. harcourt college publishers. suppes, p., & zinnes, j. l. (1962). basic measurement theory. stanford university. tal, e. (2017). measurement in science. in e. n. zalta (ed.), the stanford encyclopedia of philosophy. https://plato.stanford.edu/archives/fall2017/entries/measurement-science utts, j. m. (1982). the rainbow test for lack of fit in regression. communications in statistics theory and methods, 11(24), 2801–2815. https://doi.org/10.1080/03610928208828423 velleman, p. f., & wilkinson, l. (1993). nominal, ordinal, interval, and ratio typologies are misleading. the american statistician, 47(1), 65–72. https://doi.org/10.2307/2684788 wechsler, d., coalson, d. l., & raiford, s. e. (2008). wais-iv technical and interpretive manual. pearson. westfall, j., & yarkoni, t. (2016). statistically controlling for confounding constructs is harder than you think. plos one, 11(3), e0152719. https://doi.org/10.1371/journal.pone.0152719 williams, m. n., grajales, c. a. g., & kurkiewicz, d. (2013). assumptions of multiple regression: correcting two misconceptions. practical assessment, research & evaluation, 18(11). https://scholarworks.umass.edu/pare/vol18/iss1/11/ zigmond, a. s., & snaith, r. p. (1983). the hospital anxiety and depression scale. acta psychiatrica scandinavica, 67(6), 361–370. https://doi.org/10.1111/j.16000447.1983.tb09716.x zumbo, b. d., & kroc, e. (2019). a measurement is a choice and stevens’ scales of measurement do not help make it: a response to chalmers. educational and psychological measurement, 76(6), 1184–1197. https://doi.org/10.1177/0013164419844305 meta-psychology, 2022, vol 6, mp.2021.2837 https://doi.org/10.15626/mp.2021.2837 article type: original article published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: isager, p., williams, m., beffara bret, a. analysis reproduced by: not applicable all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/6a24s means to valuable exploration: i. the blending of confirmation and exploration and how to resolve it michael höfler faculty of psychology, clinical psychology and behavioural neuroscience, institute of clinical psychology and psychotherapy, technische universität dresden, dresden, germany stefan scherbaum faculty of psychology, institute of general psychology, biopsychology, and psychological research methods, technische universität dresden, dresden, germany philipp kanske faculty of psychology, clinical psychology and behavioural neuroscience, institute of clinical psychology and psychotherapy, technische universität dresden, dresden, germany; max planck institute for human cognitive and brain sciences, leipzig, germany brennan mcdonald faculty of psychology, clinical psychology and behavioural neuroscience, institute of clinical psychology and psychotherapy, technische universität dresden, dresden, germany robert miller faculty of psychology, technische universität dresden, dresden, germany abstract data exploration has enormous potential to modify and create hypotheses, models, and theories. harnessing the potential of transparent exploration replaces the common, flawed purpose of intransparent exploration: to produce results that appear to confirm a claim by hiding steps of an analysis. for transparent exploration to succeed, however, methodological guidance, elaboration and implementation in the publication system is required. we present some basic conceptions to stimulate further development. in this first of two parts, we describe the current blending of confirmatory and exploratory research and propose how to separate the two via severe testing. a claim is confirmed if it passes a test that probably would have failed if the claim was false. such a severe test makes a risky prediction. it adheres to an evidential norm with a threshold, usually p < α = .05, but other norms are possible, for example, with bayesian approaches. to this end, adherence requires control against questionable research practices like phacking and harking. at present, preregistration seems to be the most feasible mode of control. analyses that do not adhere to a norm or where this cannot be controlled should be considered as exploratory. we propose that exploration serves to modify or create new claims that are likely to pass severe testing with new data. confirmation and exploration, if sound and transparent, benefit from one another. the second part will provide suggestions for planning and conducting exploration and for implementing more transparent exploratory research. keywords: exploration, confirmation, p-hacking, harking, preregistration, severity, replication, bias, bayes https://doi.org/10.15626/mp.2021.2837 https://doi.org/10.17605/osf.io/6a24s 2 introduction degrees of freedom in the specification of hypotheses and analyses are both a curse and a blessing. in improperly performed confirmatory analyses, they are misused to disguise incidental findings as evidence for a hypothesis; whereas in exploratory analyses they open ways to new insights (thompson et al., 2020). questionable research practices (qrps) like p-hacking and “hypothesising after the results are known” (harking; hollenbeck & wright, 2017; rubin, 2017) misuse degrees of freedom to produce nominally confirmatory results (p-value < α). at the same time, results presented as confirmatory are more likely to be published (francis, 2012; gigerenzer & marewski, 2015; masicampo & lalande, 2012; rosenthal, 1979; scargle, 2000). both practices disrupt scientific communication by introducing findings that are not sufficiently substantiated by evidence to the literature. moreover, results arising from qrps are less likely to be replicated, such that p-hacking and harking are considered significant causes of the replication crisis: the failure to replicate many established experimental psychological findings (head et al., 2015; but also see lewandowsky & oberauer, 2020; ulrich & miller, 2020). confirmation means that an evidential norm with a defined threshold (usually p < α) is applied. it must be strictly adhered to without being affected by the data. in contrast, sloppy confirmation occurs when a scientist trawls through a dataset with various options on how to test a claim (gelman & loken, 2013) and cherry picks the data (e.g. the smallest p-value) while hiding the other results, thus missing the rigour required of confirmation (p-hacking). or, with intransparent harking a new hypothesis is generated around one preferred result out of the many that were generated. it is presented as confirmed while suppressing the other results. by beginning with the flawed and improper intention of confirming hypotheses in this manner, one hiddenly engages in exploration. confirmation, on the other hand, is a constrained process by which established scientific concepts are held up to empirical scrutiny through precise prediction, study design, and analysis planning. in other words, the “researcher degrees of freedom” (simonsohn et al., 2020) available during confirmation are intentionally restricted to provide evidence for (or against) a clearly stated hypothesis. basic concepts in contrast to confirmation, transparent or “open exploration” (thompson et al., 2020) embraces the degrees of freedom during the analysis to potentially reveal something of substantial interest within the data (dirnagl, 2020). exploration, when done transparently, thus allows a free inquiry into the behaviour of a dataset without preconceived delineations as to what patterns it shows (thompson et al., 2020). transparent exploration seems to be very rare (gigerenzer & marewski, 2015) and its potential to identify novel insights rarely utilised. in contrast, intransparent exploration for the purpose of confirmation appears to be worryingly common (agnoli et al., 2017; gopalakrishna et al., 2021a; john et al., 2012; kerr, 1998). our aim is to call upon transparent exploration’s potential and to revive this alternative approach to science. to this end, we present some basic concepts and methodical considerations to stimulate further in-depth and detailed elaborations. we use the term "exploration" as referring to a toolbox of analytical methods to generate and modify hypotheses, models, and theories. with this purpose, harking may become transparent (hollenbeck & wright, 2017) and just describes the generating aspect. likewise, even transparent p-hacking may serve the purpose to find novelty, as we shall explicite in part ii of this first of two consecutive articles. the more exploration succeeds in moving science forward through unpredicted findings, we suggest, the more valuable it is. value may arise either directly, that is, through discoveries that provide ever more accurate depictions of reality. indirect value might come from exploratively generated claims that are wrong but trigger alternative ideas and thus open other paths to novel insight (nosek et al., 2018; stebbins, 1992, 2001, 2006). thus, we suggest, exploration not only possesses practical value to inform particular scientific domains, but also epistemic value as a systematic method to find the new. (note that the general term "exploration" has a much broader meaning than used here including goals such as approaching a new area of research to begin with or "becoming familiar with something by testing it" (stebbins, 2001).) with the term “claim” we label statements that are “synthetic [either right or wrong at least in some occasions], testable, falsifiable, parsimonious, and (hopefully) fruitful” (myers & hansen, 2012; p. 167). a claim makes an assertion on a hypothesis, model, or theory. we follow the predictivistic tradition of the philosophy of science, where a claim is supported if it makes a correct prediction on new data (barnes, 2008). confirmation and exploration are both imperative, but serve very different purposes. confirmation is about rigidly testing a claim. if successful, a new claim becomes an established claim. we use the usual notation for statistical tests, where h1 means that the assertion is true (operationalised as an alternative hypothesis in a 3 statistical test) and h0 that it is not true (null hypothesis in a statistical test). in exploration, the narrow focus of confirmation is replaced by the freedom to widen the scope with the inherent goal of identifying novelty. this could move existing claims in a wider range of directions or lead to new assertions about the world. confirmation describes the straightforward, iterative path of research: hypothesise – test – corroborate or discard. if confirmation fails, the straight path ends, opening up the opportunity for a divergent, less-travelled path toward insight. exploration is the method of venturing from (what should be) the well-trodden confirmatory path with a quantitative quest. it even seems necessary for discovery beyond the mainstream. however, for this alternative scientific track to succeed, researchers must be equipped with competencies on the conceptions, goals, and methods of both confirmation and exploration. structure of the two parts our two articles are intended to outline the required means for valuable confirmation and exploration in scientific research. we begin this first part by discussing how confirmation and exploration are often blended in today’s research. we then describe the related pressure to produce nominally confirmatory results, which we believe can be reduced if one assigns exploration the value it deserves in scientific practice. this, however, requires an epistemically strict distinction of confirmation and exploration. we use the theory of severe testing for this and clarify the role of preregistration. finally, we lay out how transparent exploration serves confirmation and vice versa. part ii will propose foundations on how to plan, conceptualise and conduct exploration and how to implement more exploration in scientific practice. the blending of confirmation and exploration manifestations of blending we define blending as the (mis)use of exploratory methods of analysis for confirmatory purposes. conceptually, blending is not unfounded as studies may be meaningfully placed on a continuum from purely exploratory (“where the hypothesis is found in the data”) to purely confirmatory (“where the entire analysis plan has been explicated before the first participant is tested”; wagenmakers et al., 2012). such a continuum maps onto the experience of having unexpected difficulties with data. in many cases this leads scientists (either intentionally or unintentionally) to blend these approaches together, and exploratory results are reported as if they were confirmatory. there are several indications that blending is all too common. direct evidence comes from many researchers admitting to qrps, such as excluding data, collecting more data, and making post-hoc claims about hypotheses (agnoli et al., 2017; gopalakrishna et al., 2021a; john et al., 2012; kerr, 1998). additionally, a disconcerting 9% (95 % confidence interval = 6 11 %) of researchers across scientific fields even concede to data fabrication and/or falsification (gopalakrishna et al., 2021a), perhaps the worst practices used to trim results in a particular, presumptive confirmatory direction. then there are multiple strands of indirect evidence. first, content analyses show that published studies are almost always framed as confirmatory (banks et al., 2016; gigerenzer & marewski, 2015; spector, 2015; woo et al., 2017): hypotheses with p < α , even if they have only been established through the analysis of data, appear as already confirmed. second, p-values just below the usual α = .05 are found far more often than expected. this may be caused both by researchers doing qrps when preparing a paper, as well as subsequent publication bias towards positive results (francis, 2012; masicampo & lalande, 2012; rosenthal, 1979; scargle, 2000). blending is also reflected in the fact that many more negative results are found in registered reports, the format that publishes a paper irrespective of whether the results confirm a claim (allen & mehler, 2019; chambers tzavella, 2020; scheel et al., 2021a). it is also evident in the one-sided focus on confirmation in the teaching of science and statistics, with statistical testing being misunderstood as “a universal method for scientific inference” (gigerenzer & marewski, 2015). therefore not surprisingly, statistical tests are also used for exploratory purposes. finally, blending is carried further through harmed scientific communication, as the replication crisis shows (aarts et al., 2015; camerer et al., 2018; open science collaboration, 2015): recipients of a seemingly confirming result build their research on the false assumption of sound confirmation; or, if sensitive to the problem, do not know for sure whether a conclusion is based on confirmation or mere exploration. moreover, empirical evidence suggests that the sharp increase in the number of publications exacerbates the problem by further favouring the already prevailing straight paths instead of opening up new ones (chu & evans, 2021). the pressure to produce seemingly confirming results blending also seems to be caused by the pressure to produce publications in the current incentive system where the number of publications and citations dominates the evaluation of scientific performance and 4 career opportunities (gonzales & cunningham, 2015; kerr, 1998; mcintosh, 2017; nosek et al., 2018; nosek & lindsay, 2018; wagenmakers et al., 2012). indeed, researchers have reported on such pressure (gopalakrishna et al., 2021a, 2021b). besides, pressure is related to more frequent participation in at least one “severe qrp” (gopalakrishna et al., 2021a). practices such as phacking and intransparent harking anticipate publication bias in favour of positive results. at a deeper level, researcher bias towards generating and publishing positive results seem to be influenced by the false belief that positive results were associated with more scientific novelty, the flawed “ideal of confirmation" (kerr, 1998) and, as a consequence, a "positive testing strategy" (klayman & ha, 1987). this is, of course, in stark contrast to popper’s insight that the capacity to falsify hypotheses is actually more fundamental to advances in science (glass & hall, 2008; kerr, 1998; klayman & ha, 1987; locke, 2007; mayo, 2018; popper, 1959). preregistration of confirmatory analyses falls short of solving the problem whether something has been preregistered is not logically related to its quality (szollosi et al., 2020). kerr (1998) and others have been criticised for exaggerating the value of preregistration in this regard (devezer et al., 2021; rubin, 2019, 2020). preregistration records an a priori plan of the claim and the analysis and thus creates control over the plan history (heers, 2020; rubin, 2019, 2020; wagenmakers et al., 2012). however, it seems to be the best mode of control to date because it allows researchers to prove how they have planned a study (wagenmakers & dutilh, 2016). however, the following issues suggest that the pressure to publish positive results partially resists preregistration. first, preregistration can be abused as an option pulled only in the case of success to sell a result as a more convincing post-hoc. a negative result would be kept secret (bian et al., 2020; claesen et al., 2019). registered reports try to address the bias towards positive results through the guarantee of publication before the results are known, but in many cases a final report is not published (claesen et al., 2019; hardwicke & ioannidis, 2018). even registered reports can be abused to market positive and suppress negative results (bian et al., 2020). another issue is superficial preregistration: the underreporting of analyses and collected variables, which leaves room for intransparent exploration (franco et al., 2016). finally, despite preregistration becoming ever more widespread, only a minority of psychological researchers use any type of preregistration. only around 30% of psychological researchers in french speaking countries mentioned such practice in a survey (beffara-bret & beffara-bret, 2019), and only 3% of 188 psychology articles published between 2014 and 2017 included a statement on preregistration (hardwicke et al., 2020). this indicates remaining hindrances and unresolved issues. transparent exploration helps to reduce the pressure preregistration is sometimes misunderstood as eliminating flexibility in hypothesis formulation/modification (hollenbeck & wright, 2017) and data analysis (goldin-meadow, 2016; scott, 2013). this is wrong, as preregistration only archives the initial plan for an analysis. there may be important reasons to deviate from a plan, with deviation allowed as long as a justification is provided and changes are clearly stated as soon as they occur. however, flexibility could also be indicative of deficits and gaps in theory formulation (eronen & bringmann, 2021; fiedler, 2017; gigerenzer, 2010; szollosi & donkin, 2021). this should be taken as a call to fill the gaps with explicit, transparent exploration (woo et al., 2017). we will get back on this in part ii. this opportunity, like any measure to implement more exploration through methods, teaching and publishing policies, would make it easier for researchers to dispense with confirmatory framing. it has however been argued that, were preregistration to become mandatory, everything else might appear to be flawed confirmatory research (goldin-meadow, 2016). besides, more open science practices would give rise to further qrp like “preregistering after the results are known” (yamada, 2018). we believe that the best answer to these concerns is to promote transparent exploration to reduce these wrong and even “perverse incentives” (chiacchia, 2017). to achieve this, however, we first require a clear distinction between confirmation and exploration. differentiating confirmation and exploration it is crucial to resolve the blending in the reporting of scientific results. otherwise, confirmation is damaged by intransparent exploration, and transparent exploration does not unfold as it could. while replication (zwaan et al., 2018) and multi-lab studies (stroebe, 2019) try to address the consequences of blending, and preregistration has the mentioned insufficiencies, we address blending by clearing the path for explicit exploration as a viable alternative. however, differentiating confirmation and exploration is not as easy as it may first appear. explorative results are also supported by a certain, albeit exaggerated, amount of evidence as seen in small p-values (szollosi & donkin, 2021). 5 only confirmation uses an evidential norm the principle difference between confirmation and exploration is that confirmation adheres to an evidential norm for the test of a hypothesis to pass. an evidential norm states that an h1 hypothesis is confirmed, if the evidence for it at least exceeds a certain threshold. the usual norm requires that chosen 1 – p (p-value) must be greater than 1 – α. specifying a threshold, like α = .05, makes researchers “accountable” for what they would report as confirming (mayo, 2018). in short: the choice of the threshold norm and its strict application must not be influenced by the data. adherence is violated if variation in p according to the available analytical options (gelman & loken, 2013) is misused to fish for a particular p that happens to be smaller than , thus with phacking. adherence is also violated by harking when multiple relations are tested and an assertion is made around the one that happens to yield p < α. in this case, 1 α is not adhered to in the context of all that has been tested (see the discussion on global vs. local claims at the end of the chapter). we suggest, however, that deviations from an analytical plan are unproblematic if they are not chosen in order to obtain a smaller p, but, for example, to account for deviations from otherwise violated model assumptions (field wilcox, 2017). this keeps an analysis conceptionally within the bounds of confirmation. it is in accordance with the logical possibility of changing analytical decisions after seeing the data without reducing the rigour in testing (szollosi et al., 2020). likewise, it is in principle possible to create or modify a claim after looking at the data without being influenced by what is seen in them. however, these possibilities are difficult to control unless they can be anticipated and incorporated into a preregistered plan (e.g. run model with option a if the parameter estimation converges, run model with option b otherwise). such instances might constitute confirmation, but it is difficult to assess whether they truly do. like others before us (lakens, 2019; mayo, 2018), we propose that confirmation is testing with a high risk to fail if a claim is wrong (“severe testing”, see below), and that this high risk must not be reduced by analytical decisions. with adherence to an evidential norm, confirming a claim is supported with true evidence. however, because of blending and the experiences of the replication crisis, adherence requires control. reliable control should be the prerequisite for an analysis to be accepted as confirmation. preregistration seems to be the most feasible and effective mode of control, wherefore we agree with others that only preregistered analyses should be accepted as confirmatory (lakens, 2019; yamada, 2018). this should apply from now on and until perhaps a better mode of control is found. (deciding whether to accept old analyses as confirmations, especially in the period before preregistration, is in itself a difficult question.) note that other modes have been proposed. the retrospective “21 word solution” demands a post-hoc statement with which scientists declare that they have worked properly (simmons et al., 2012). open analysis may be very effective, but is pretty effortful (see part ii). both alternatives, however, do not offer transparency on the plan history. control measures like requiring preregistration place the burden of proof on scientists with the price of false negative assessments. assuming that researchers have not worked properly, although they have, is rigid but seems necessary in psychology, which has been severely harmed by exaggerated evidence. accordingly, new analyses that have not been or been improperly preregistered, or where the preregistered analytical plan contradicts the report, should be considered as exploratory. a norm must be used, but which norm is disputable the usual norm of 1 – p > 1 – α is disputable and subject to intense debate. p-values and statistical tests have several interpretational pitfalls and fundamental drawbacks (greenland, 2017a; greenland et al., 2016; wagenmakers, 2007). besides, they are based on many assumptions on the path from a substantive claim via study design, the produced data and the model that describes them. this relates to “duhem’s problem” (ivanova, 2021; mayo, 2018; rakover, 2003), which states that a hypothesis cannot be tested without making assumptions beyond the data. such assumptions often refer to bias and remain intransparent. if true, they would mean that issues like selection, measuring, noncompliance, and unconsidered shared causes between factor and outcome (maclure & schneeweiss, 2001) do not introduce any bias (in the bayesian framework with a probability of 100 percent, greenland, 2005). we propose that a well-chosen norm is effective in staking out the boundary between the new and the established, considering major sources of bias. evidential norms and mayo’s theory of severe testing we use mayo’s (2018) much debated (gelman et al., 2019) philosophy of "severe testing" to discuss the choice of a norm. for mayo (2018) severity is the probability with which a given test with given data would have found a hypothesis to be wrong if it was truly wrong. a test might have yielded a positive result, but the test might have been hardly capable of giving a negative result if the claim was wrong. in short: “a test is severe when it is highly capable of demonstrating a claim is false” (lakens, 2019). importantly, this concept is 6 not bound to any specific statistical theory or school (e.g. frequentist vs. bayesian), rather it is a conceptual framework by which to judge the appropriateness of whatever evidential norm. grounded in popper (1959), lakatos (1977) and others, a severe test is difficult to pass and, in case of success, provides evidential support because it could have easily failed. with a nonsevere test, a hypothesis is not sufficiently probed, that is, the test was not capable of finding the “flaws or discrepancies of a hypothesis” (mayo, 2018). severity calls for study designs that produce data capable of separating the truth or falsity of a hypothesis from all alternative explanations and thus closely link a hypothesis with the associated empirical observations used in testing it. such awareness should recall the insight related to duhem’s problem that any single study is incapable of ruling out all alternative assumptions (e.g. greenland, 2005; milde, 2019). it also links to the fundamental question to what extent truth can be approached (e.g. via “truthlikeness”, cevolani & festa, 2018; niiniluoto, 2020). once a study has been designed and data have been collected, a claim can only be statistically probed. any statistical method then is limited by the study’s ability to produce a certain data result (e.g., a high average error rate in a cognitive test) that exceeds an evidential norm if the claim is indeed true (e.g. cognitive impairment is present), but would not be expected to do so under alternative assumptions (e.g., lack of compliance or misunderstanding the instructions). likewise, mayo’s (2018) elaborations on calculating severity relies on this and thus involves only probing against chance (random error), not bias (systematic error). this, however, is the subject of a controversial discussion (gelman et al., 2019). anyway, a general understanding of severity might encourage researchers to reflect on substantive reasons for a claim to be wrong rather than falling prey to self-delusion and hiding behind statistical rituals (gigerenzer, 2018). in regards to the replication debate, the severity framework makes it transparent that qrps like harking and p-hacking create the illusion of greater test severity. thus, a result is sold by hiddenly exceeding the evidential norm. however, “preregistration makes it possible to evaluate the severity of a test” (lakens, 2019). the framework also sheds light on the limitations of replication. a study design and analytical model might poorly map the phenomenon of interest and barely probe why a hypothesis may be wrong, in which case a wrong result could replicate (devezer et al., 2021; mayo, 2018; steiner et al., 2019). in addition, a finding may indeed replicate (e.g. in a very large sample), but not translate into practical use like an intervention (yarkoni, 2020). bayesian severity we propose that evidential norms should be reconsidered along severity considerations to set the right boundaries beyond which exploration should take over. this should involve the capacity of both a study design and an analytical model to probe a substantive hypothesis against alternative non-causal explanations, especially bias. whereas mayo’s frequentist-oriented elaborations on calculating severity are not capable of incorporating assumptions beyond the data, elaboration on bayesian severity assessment opens the door for formalising this. although controversial for epistemic reasons (gelman et al., 2019 and papers cited therein; mayo, 2018), severity can be handled in the bayesian framework through a new interpretation. such “falsificationist bayesianism” (gelman et al., 2019; gelman & shalizi, 2013) makes “risky and specific predictions” that could easily turn out to be wrong (van dongen et al., 2020). a prediction might be made, for example, on the “posterior probability” for a claim to be true (given a prior distribution for an effect and, as in frequentist statistics, the data and the model that describes the data). then, the norm requires that this posterior probability must exceed 1 – α for a test to pass. however, to achieve severity it has been argued that one needs to consider how likely a claim was already before seeing the data (mayo, 2018). this shifts the focus to the increment in this probability through data observation (held et al., 2021; wagenmakers et al., 2018). whatever norm is chosen, it must be preregistered to counteract that its choice or how it was evaluated was affected by data inspection. the same is true for the prior distribution since the posterior distribution may heavily depend on it (gelman et al., 2013). bayesian approaches open up the possibility of addressing two further issues. first, one may probe against scepticism with a “sceptical prior”. this expresses the belief in values around 0 before seeing the data (in terms of a normal distribution) and serves the purpose of convincing a sceptic (good, 1950, p. 80 ff.; held, 2020; held et al., 2021). second, bayesian norms could take advantage of the in psychology little-known ability of bayesian methods to probe causal hypotheses against pure associations via a causal model with explicit assumptions on bias. bias may arise, for instance, from misclassification, selection probabilities and effects of a common cause on factor and outcome, and the absence of major biases is an assumption that seems to hold only in very simple experiments (greenland, 2005, 2009; höfler et al., 2007; lash et al., 2009). in bayesian models uncertainty about biases can itself be expressed, 7 and this uncertainty carries over to more cautious conclusions about causal effects (greenland, 2005). the assumptions may also be varied so that the sensitivity of the result to these assumptions can be assessed. this informs the readership of how probing against a particular bias scenario versus alternative scenarios relates to meeting a norm (lash et al., 2014; smith et al., 2021; vanderweele mathur, 2020). although seldom used, such bayesian models make assumptions on bias transparent. at least, they encourage reflection on such mechanisms and generate awareness which of them require better understanding (e.g. measurement errors). because severity is a fairly new concept, there is much room for the development of rigorous norms beyond frequentist statistical tests. first, the use of confidence intervals (or the bayesian counterpart of "credibility intervals") facilitates identification of severe evidence that the same data may provide for either h1 or h0 (mayo, 2018). second, methodological progress should lead to norms that are based on calculating severity under more defendable assumptions, especially on bias. another way to account for bias is using different studies that probe differently against different sources of bias through a range of studies. this embraces the methodical diversity that addresses the requirements of various causal quests in different domains (gigerenzer & marewski, 2015; greenland, 2017b). although it is very difficult to integrate diverse and differently biased evidence (greenland, 2017b), scientists could find ways to approach multi-method norms that advance science through multidimensional confirmation. single studies might nevertheless model unaddressed sources of bias or, at least, be explicit in that they have only been probed against certain bias and thus communicate a better understood piece of evidence. at the cost of falling short of a norm, exploration opens the door for novelty the common epistemic price of doing exploration is the possibility of falling short of adherence to a norm. for example, the usual frequentist may be exceeded. p may be smaller than the nominal α = .05, but through exploration it was only compared with, say, α = .20. this happens with intransparent harking if one explores several outcomes, selects a particular outcome with p < .05 and presents only that outcome (altman et al., 2017). and, as mentioned, α may be exceeded by p-hacking, since the multiple chances of passing a norm through different analytical options are not taken into account. then, to meet the norm, new data are required, the more data, the more exploration has been used. the second aspect of this price is that the extent of the exceedance quickly becomes incalculable when multiple explorative steps are used. in case of an entire lack of transparency, this may call for a new study that meets the norm on its own. the effortful replication initiatives take this stance (schimmack, 2018). we suggest that exploration should consider “turning all the knobs” (hofstadter & dennett, 1981) around a given hypothesis or model, or when generating new hypotheses or models. this enables the core benefit of exploration: the potential of finding the new, wherever it might be hidden, whatever it might look like. for instance, an explorative quest might include the functional shapes of an effect, factor and outcome categorization and different effects in subpopulations. we propose that quests should be guided by the following key ideas building on severity: • the stronger and more specific a claim, the more available it is to severe testing through confirmation (with new data). thus, the greater is its ability to advance science if it were true (lakens, 2019). • claims should be searched for that are likely to pass severe testing with new data. global versus local claims another idea to be elaborated in part ii concerns the distinction between global and local claims. as we shall see, this distinction is important in planning and conducting explorations. it deserves to be mentioned here already because it sheds further light on how harking may practically violate the boundary between confirmation and exploration. assume that a set of k factors and a set of l outcomes is explored with regard to factor-outcome associations. each possible association is analysed with a frequentist level test. in this case, the probability that at least one p-value is smaller than α is high for a large number of tests (k∗ l). then globally the existence of any (at least one) such relation is tested with poor severity. this would not adhere to the norm because multiple testing would be ignored (bender & lange, 2001). the test result for a particular factor-outcome association with p < α, however, could have well been negative. now one may ask to much extent a local claim about the existence of this association is supported. the answer depends on whether the global context can be ignored in substantive terms. the problem is that a claim may be shifted from local to global in the course of data analysis by overgeneralising a single factor and a single outcome as if they would represent two latent variables with a relation between them. this is another instance where adherence to a norm is difficult to control without preregistration. 8 part ii will discuss a couple of ambiguities, such as whether to aim at global versus local assertions and how to address them with background knowledge. transparent exploration serves confirmation and scientific communication with transparency, explorative research practices are no longer questionable if scientists become committed to conducting and publishing transparent exploration, there will be less pressure and incentive to blend exploration and confirmation. with transparency regarding exploratory results, communication is no longer harmed by hidden information, and evidence is no longer overstated. transparency also provides the answer to the question whether the double-use of data for confirmation and exploration is problematic. for example, one could misunderstand wagenmakers and colleagues (2012) this way: “the interpretation of common statistical tests in terms of type i and type ii error rates are valid only if the data were used once and if the statistical test was not chosen on the basis of suggestive patterns in the data”. actually, while exploration must not affect what and how to confirm (barnes, 2008), it may well be used later to modify a hypothesis through exploration (devezer et al., 2021). double-use, when done in this temporal order, is established practice in medical research. for instance, the consort data from the uk biobank (uk biobank limited, 2022) are made available for exploration by everyone after the confirmatory results have been published. remarkably, having a hypothesis versus not having one appears to severely impair the ability to detect striking data patterns (yanai & lercher, 2020), and it would be interesting to assess whether teaching exploration could reduce this effect. we suspect that researchers fear their confirmation trials failing (“all my work will be in vain if i do not confirm my hypothesis!”) or, at least, that data are only able to analyse what has been pre-specified. in the case of non-confirmation of, say, an intervention effect, a researcher might proceed with common practices like subgroup analysis (“is the intervention effective in females and males, respectively?”). this is not problematic as long as a then found data pattern does not lead to a confirming assertion (“we confirmed that intervention is effective in females”), but as a modified hypothesis, yet to be confirmed or not with new data (”we propose the modified hypothesis that the intervention is effective in females.”). concatenated exploration science has been argued to be most productive if confirmation and exploration co-exist in a “good balance”, back and forth from theories to derived claims, study design and data (bogen & woodward, 1988; box, 1980; scheel et al., 2021b). but intransparent harking and p-hacking hinder researchers from recognizing the two (woo et al., 2017). if their difference becomes transparent, exploratively generated or modified hypotheses, models and theories openly invite confirmatory studies. with “concatenated exploration”, stebbins (1992) denotes a cooperative strategy that pays off for all who participate in a “longitudinal research process”. such a process may start with exploration (according to popper, new scientific claims may start from anywhere; klayman ha, 1987; popper, 1959; in lakatos’ conception of science new additions from exploration contribute to the further development of theory; lakatos, 1977). explorative results may give rise to a new claim, a confirmation trial (with perhaps explorative refinement), subsequent studies with confirmation and extensions (maybe using different populations), further adjustment, confirmation and so on. for similar proposals, see behrens (1997), nosek and colleagues (2018) and thompson and colleagues (2020). such a chain process may provide the impetus for the evolution of hypotheses (e.g. excess screen time causes a heightened stress response in adolescents), the development of models (different causes of the stress response and how they interact together) or an entire theory (the evolution and development of the stress response). stebbins (2001) describes a range of sociological examples, where such a chain of research has advanced science including postpartum changes in women and the development of women’s occupational aspirations across the lifespan. in psychology, concatenated exploration appears to describe the idea behind the very common trial-and-error proceeding along the development of interventions (e.g. the history of origins of dialectic behavior therapy, linehan wilks, 2015). in addition, an interplay between exploration and confirmation has long been established practice in factor analysis, where the construction of a scale is a chain of setting up a model, testing it, modifying, again testing and so on (hurley et al., 1997). researchers who participate in such a chain of research are able to publish at least once and can expect to be cited several times in the currently prevailing quantitative incentive system of publications and citations. in qualitative terms this commitment to concatenated exploration may be favoured in upcoming new criteria of sustainable scientific achievement (pavlovskaia, 2014; spangenberg, 2011). additionally, given the iter9 ative nature of a research chain scientists can hope for more confirmed findings, cooperatively generated insights and the ability to look back on research of enduring validity later in life (mckiernan et al., 2016; nosek et al., 2012; pavlovskaia, 2014). yet such an outlook on incentives for conducting science well beyond just publishing a paper might help to counteract behaviour geared towards short-term benefits. conclusion several researchers had already called for a major up-valuing of exploration as a complement to confirmation (gonzales / cunningham, 2015; mcintosh, 2017; nosek et al., 2018; scheel et al., 2021b). however, without elaborations on the conceptions and methods, good examples, teaching and implementation practices in the publication system uncertainty may prevent researchers from abandoning the ritualised (gigerenzer, 2018), almost obsessive, restriction to blended confirmatory research. additional obstacles may hinder more transparent exploration such as the social barriers that have been described for open science and behavioral change in general (nosek & bar-anan, 2012; zwaan et al., 2018). we believe that transparent exploration is fundamental to the advance of science. a starting point for transparent exploration is an understanding that, to date, the blending of confirmation and exploration has been all too common and that distinguishing these two concepts is vital to the health of science. a sound norm is severe. adherence to and control over such a norm establish a sharp boundary for the transition of a new assertion into an established one. this promotes scientific communication by requiring that future research be only built on sufficient evidence. in the second part we shall outline how to plan and conduct transparent exploration in practice, setting the goals of “comprehensive exploration” and “efficient exploration” with some ideas on filtering and smoothing data patterns to separate the signals from the noise. we will discuss the roles of preregistration, open data and open analysis. part ii will end with the key points of a research agenda on how to explore in a specific domain and a checklist with recommendations to stakeholders who have the means to establish more transparent exploration in the publication system. author contact corresponding author: michael höfler, chemnitzer straße 46, clinical psychology and behavioural neuroscience, institute of clinical psychology and psychotherapy, technische universität dresden, 01187 dresden, germany. michael.hoefler@tu-dresden.de, +49 351 463 36921 orcid: https://orcid.org/0000-0001-7646-8265 conflict of interest and funding the authors declare that there were no conflicts of interest with respect to the authorship or the publication of this article. stefan scherbaum and philipp kanske are supported by the german research foundation (crc940/a08 and ka4412/2-1, ka4412/4-1, ka4412/5-1, crc940/c07, respectively). acknowledgements we wish to thank the three reviewers on our first submission for their detailed comments, in particular matt williams for his epistemical suggestions. several new arguments are based on the reviewers’ suggestions. we also thank annekathrin rätsch for aid with the references. author contributions michael höfler worked out most of the content and had the lead in writing. robert miller and stefan scherbaum have contributed to the elaboration of the basic idea and contributed to details. brennan mcdonald joined in this version, refined epistemic details and was involved in the writing and wording of the entire manuscript. philipp kanske commented on and edited the manuscript open science statement this article is theoretical and as such received no open science badges. the entire editorial process, including the open reviews, are published in the online supplement. https://orcid.org/0000-0001-7646-8265 10 references aarts, a., anderson, j., anderson, c., attridge, p., attwood, a., axt, j., babel, m., bahník, š., baranski, e., barnett-cowan, m., bartmess, e., beer, j., bell, r., bentley, h., beyan, l., binion, g., borsboom, d., bosch, a., bosco, f., & penuliar, m. (2015). estimating the reproducibility of psychological science. science, 349. 10.1126/science.aac4716 agnoli, f., wicherts, j. m., veldkamp, c. l. s., albiero, p., & cubelli, r. (2017). questionable research practices among italian research psychologists. plos one, 12(3): e0172792. 10.1371/journal.pone.0172792 allen, c., & mehler, d. m. a. (2019). open science challenges, benefits and tips in early career and beyond. plos biology(17), e300024. 10.1371/journal.pbio.3000246 altman, d. g., moher, d., & schulz, k. f. (2017). harms of outcome switching in reports of randomised trials: consort perspective bmj (356): j396 10.1136/bmj.j396 banks, g. c., rogelberg, s. g., woznyj, h. m., landis, r. s., & rupp, d. e. (2016). editorial: evidence on questionable research practices: the good, the bad, and the ugly. journal of business and psychology, 31(3), 323–338. 10.1007/s10869-016-9456-7 barnes, e. (2008). the paradox of predictivism. cambridge: cambridge university press. 10.1017/cbo9780511487330 beffara-bret, a., & beffara-bret, b. (2019). open science in european french speaking countries. b beffara, a bret. retrieved october 18th, 2022, from https://brice-beffara.shinyapps.io/opef/ behrens, j. t. (1997). principles and procedures of exploratory data analysis. psychological methods, 2(2), 131–160. 10.1037/1082-989x.2.2.131 bender, r., & lange, s. (2001). adjusting for multiple testing–when and how? journal of clinical epidemiology, 54(4), 343-349. 10.1016/s0895-4356(00)00314-0. bian, j., min, j. s., prosperi, m., & wang, m. (2020). are preregistration and registered reports vulnerable to hacking? epidemiology,31(3), e32. 10.1097/ede.0000000000001162 bogen, j., & woodward, j. (1988). saving the phenomena. philosophical review, 97(3) 303–352. 10.2307/2185445 box, g. e. p. (1980). sampling and bayes inference in scientific modelling and robustness (with discussion and rejoinder). journal of the royal statistical society a, 143(4), 383–430. camerer, c. f., dreber, a., holzmeister, f. et al. (2018). evaluating the replicability of social science experiments in nature and science between 2010 and 2015. nature human behavior, 2, 637–644. 10.1038/s41562-018-0399-z cevolani, g., & festa, r. (2018). a partial consequence account of truthlikeness. synthese, 197, 1627-1646. 10.1007/s11229-018-01947-3 chambers, c., & tzavella, l. (2020). registered reports: past, present and future. preprint at metaarxiv 10.31222/osf.io/43298 chiacchia, k. (2017, july 12). perverse incentives? how economics (mis-)shaped academic science. hpc wire. retrieved october 26, 2021 from https://www.hpcwire.com/2017/07/12/perverse-incentives-economics-mis-shaped-academic-science/ chu, j. s. g., & evans, j. a. (2021). slowed canonical progress in large fields of science proceedings of the national academy of sciences, 118 (41). e2021636118; 10.1073/pnas.2021636118 claesen, a., gomes, s. l. b. t., tuerlinckx, f., & vanpaemel, w. (2019, may 9). preregistration: comparing dream to reality. retrieved october 14, 2020 from https://psyarxiv.com/d8wex/ devezer, b., navarro, d. j., vandekerckhove, j., & ozge buzbas, e. (2021). the case for formal methodology in scientific reform. royal society open science, 31; 8(3), 200805. 10.1098/rsos.200805 dirnagl, u. (2020). preregistration of exploratory research: learning from the golden age of discovery. plos biology, 18(3), e3000690. 10.1371/journal.pbio.3000690 eronen, m. i., & bringmann, l. f. (2021). the theory crisis in psychology: how to move forward. perspectives on psychological science, 16(4), 779–788. 10.1177/1745691620970586 fiedler, k. (2017). what constitutes strong psychological science? the (neglected) role of diagnosticity and a priori theorizing. perspectives on psychological science, 12(1), 46–61. 10.1177/1745691616654458 field, a. p., & wilcox, r. r. (2017). robust statistical methods: a primer for clinical psychology and experimental psychopathology researchers. behaviour research and therapy, 98, 19-38. 10.1016/j.brat.2017.05.013 11 francis, g. (2012). publication bias and the failure of replication in experimental psychology. psychonomic bulletin review, 19, 975–991. 10.3758/s13423-012-0322-y franco, a., malhotra, n., & simonovits, g. (2016). underreporting in psychology experiments: evidence from a study registry. social psychological and personality science, 7(1), 8-12. 10.1177/1948550615598377 gelman, a., & loken, e. (2013, november 14). the garden of forking paths: why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. retrieved october 14, 2020 from http://www.stat.columbia.edu/ gelman/research/unpublished/phacking.pd f gelman, a., carlin, j. b., stern, h. s., dunson, d. b., vehtari, a., & rubin, d. b. (2013). bayesian data analysis (3nd ed.). chapman and hall/crc. 10.1201/b16018 gelman, a., haig, b., hennig, c., owen, a., cousins, r., young, s., robert, c., yanofsky, c., wagenmakers, e. j., kenett, r., & lakeland, d. (2019). many perspectives on deborah mayo’s “statistical inference as severe testing: how to get beyond the statistics wars”. retrieved november 2, 2021 from http://www.stat.columbia.edu/ gelman/research/unpublished/mayor eviews2.pd f gelman, a., & shalizi, c. r. (2013). philosophy and the practice of bayesian statistics. british journal of mathematical and statistical psychology, 66, 8-38. 10.1111/j.2044-8317.2011.02037.x gigerenzer, g. (2010). personal reflections on theory and psychology. theory psychology, 20(6), 733–743. 10.1177/0959354310378184 gigerenzer, g. (2018). statistical rituals: the replication delusion and how we got there. advances in methods and practices in psychological science, 1(2), 198–218. 10.1177/0959354310378184 gigerenzer, g., & marewski, j. n. (2015). surrogate science: the idol of a universal method for scientific inference. journal of management, 41(2), 421–440. 10.1177/0149206314547522 glass, d. j., & hall, n. (2008). a brief history of the hypothesis. cell, 134(3): 378–381. 10.1016/j.cell.2008.07.033 goldin-meadow, s. (2016, august 31). why preregistration makes me nervous. retrieved october 14, 2020, from http://www.psychologicalscience.org/observer/why-preregistration-makes-me-nervous gonzales, j. e., & cunningham, c. a. (2015, august). the promise of preregistration in psychological research. psychological science agenda. retrieved october 14, 2020 from https://www.apa.org/science/about/psa/2015/08/preregistration good, i. j. (1950). probability and the weighing of evidence. griffin. gopalakrishna, g., riet, g. t., cruyff, m. j., vink, g., stoop, i., wicherts, j. m., & bouter, l. (2021a, july 6). prevalence of questionable research practices, research misconduct and their potential explanatory factors: a survey among academic researchers in the netherlands. 10.31222/osf.io/vk9yt gopalakrishna, g., wicherts, j. m., vink, g., stoop, i., van den akker, o., riet, g. t., & bouter, l. (2021b, july 6). prevalence of responsible research practices and their potential explanatory factors: a survey among academic researchers in the netherlands. 10.31222/osf.io/xsn94 greenland, s. (2005). multiple-bias modeling for analysis of observational data (with discussion). journal of the royal statistical society, series a, 168, 267-306. 10.1111/j.1467-985x.2004.00349.x greenland, s. (2009). bayesian perspectives for epidemiologic research: iii. bias analysis via missing-data methods. international journal of epidemiology, 38(6), 1662-73. 10.1093/ije/dyp278 greenland, s. (2017a). invited commentary: the need for cognitive science in methodology, american journal of epidemiology, 186(6), 639–645. doi. 10.1111/j.1467-985x.2004.00349.x greenland, s. (2017b). for and against methodologies: some perspectives on recent causal and statistical inference debates. european journal of epidemiology, 32(1), 3-20. 10.1007/s10654-017-0230-6 greenland, s., senn, s. j., rothman, k. j., carlin, j. b., poole, c., goodman, s. n., & altman, d. g. (2016). statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. european journal of epidemiology, 31(4), 337–350. 10.1007/s10654-016-0149-3 hardwicke, t. e., & ioannidis, j. p. a. (2018). mapping the universe of registered reports. nature human behaviour, 2, 793–796. 10/gf9db hardwicke, t. e., thibault, r. t., kosie, j., wallach, j. d., kidwell, m. c., & ioannidis, j. (2020). estimating the prevalence of transparency and reproducibility-related research practices in psychology (2014-2017). metaarxiv. 10.31222/osf.io/9sz2y 12 head, m. l., holman, l., lanfear, r., kahn, a. t., & jennions, m. d. (2015). the extent and consequences of p-hacking in science. plos biology, 13(3), e1002106. 10.1371/journal.pbio heers, m. (2020). preregistration and registered reports. fors guide no. 09, version 1.0. lausanne: swiss centre of expertise in the social sciences fors. 10.24449/fg-2020-00009 held, l. (2020). a new standard for the analysis and design of replication studies. journal of the royal statistical society, 183(2), 431–448. 10.1111/rssa.12493 held, l., matthews, r., ott, m., & pawel, s. (2021). reverse-bayes methods for evidence assessment and research synthesis. research synthesis methods. 10.1002/jrsm.1538 höfler, m., lieb, r., & wittchen, h. u. (2007). estimating causal effects from observational data with a model for multiple bias. international journal of methods in psychiatric research, 16(2), 77–87. 10.1002/mpr.205 hofstadter, d. r., & dennett, d. c. (1981). the mind’s i: fantasies and reflections on self and soul. new york: basic books. hollenbeck, j. r., & wright, p. m. (2017). harking, sharking, and tharking: making the case for post hoc analysis of scientific data. journal of management, 43(1), 5–18. 10.1177/0149206316679487 hurley, a. e., scandura, t. a., schriesheim, c. a., brannick, m. t., seers, a., vandenberg, r. j., & williams, l. j. (1997). exploratory and confirmatory factor analysis: guidelines, issues, and alternatives. journal of organizational behavior, 18(6), 667-683. 10.1002/(sici)1099-1379(199711)18:6<667::aid-job874>3.0.co;2-t ivanova, m. (2021). duhem and holism. elements in the philosophy of science. 10.1017/9781009004657 john, l. k., loewenstein, g., prelec, d. (2012). measuring the prevalence of questionable research practices with incentives for truth telling. psychological science, 23, 524–532. kerr, n. l. (1998). "harking: hypothesizing after the results are known". personality and social psychology review, 2(3), 196–217. 10.1207/s15327957pspr02034 klayman, j., & ha, y.-w. (1987). confirmation, disconfirmation, and information in hypothesis testing. psychological review, 94(2), 211–228. 10.1037/0033-295x.94.2.211 lakatos, i. (1977). the methodology of scientific research programmes: philosophical papers volume 1. cambridge university press, cambridge. lakens, d. (2019). the value of preregistration for psychological science: a conceptual analysis. 10.31234/osf.io/jbh4w lash, t. l., fox, m. p., maclehose, r. f., maldonado, g., mccandless, l. c., & greenland, s. (2014). good practices for quantitative bias analysis. international journal of epidemiology, 43(6), 1969–1985. 10.1093/ije/dyu149 lash, t. l., fox, m. p., & fink, a. k. (2009). applying quantitative bias analysis to epidemiologic data. new york, ny: springer. lewandowsky, s., & oberauer, k. (2020). low replicability can support robust and efficient science. nature communications, 11(1), 1-12. 10.1038/s41467-019-14203-0 linehan, m. m., & wilks, c. r. (2015). the course and evolution of dialectical behavior therapy. american journal of psychotherapy, 69(2), 97-110. 10.1176/appi.psychotherapy.2015.69.2.97 locke, e. a. (2007). the case for inductive theory building. journal of management, 33, 867-890. 10.1177/0149206307307636 maclure, m., & schneeweiss, s. (2001). causation of bias: the episcope. epidemiology 12(1),114-22. 10.1097/00001648-200101000-00019. masicampo, e. j., & lalande, d. r. (2012). a peculiar prevalence of p values just below .05, the quarterly journal of experimental psychology, 65(11), 2271-2279. 10.1080/17470218.2012.711335 mayo, d. g. (2018). statistical inference as severe testing: how to get beyond the statistics wars. cambridge: cambridge university press. 10.1017/9781107286184 mcintosh, r. d. (2017). exploratory reports: a new article type for cortex. cortex, 96, a1–a44. 10.1016/j.cor-tex.2017.07.014 mckiernan, e. c., bourne, p. e., brown, c. t., buck, s., kenall, a., lin, j., mcdougall, d., nosek, b. a., ram, k., soderberg, c. k., spies, j. r., thaney, k., updegrove, a., woo, k. h., & yarkoni, t. (2016). how open science helps researchers succeed. elife, 5, e16800. 10.7554/elife.16800 milde, c. (2019). what can be concluded from statistical significance? severe testing as an appealing extension to our standard toolkit (ssrn scholarly paper id 3413808). social science research network. 10.2139/ssrn.3413808 13 myers, a., & hansen, c. h. (2012). experimental psychology, 7th edition. pacific grove, ca: wadsworth/thomson learning. niiniluoto, i. (2020). truthlikeness: old and new debates. synthese, 197(4), 1581–1599. 10.1007/s11229-018-01975-z nosek, b. a., & bar-anan, y. (2012). scientific utopia: i. opening scientific communication. psychological inquiry, 23(3), 217–243. 10.1080/1047840x.2012.692215 nosek, b. a., ebersole, c. r., dehaven, a. c., & mellor, d. t. (2018). the preregistration revolution. pnas proceedings of the national academy of sciences of the united states of america, 115(11), 2600–2606. 10.1073 nosek, b. a., & lindsay, d. s. (2018, february 2). preregistration becoming the norm in psychological. science. aps observer, 31(3). retrieved october 14, 2020 from https://www.psychologicalscience.org/observer/preregistration-becoming-the-norm-in-psychological-science nosek, b. a., spies, j. r., & motyl, m. (2012). scientific utopia: ii. restructuring incentives and practices to promote truth over publishability. perspectives on psychological science, 7, 615-31. 10.1177/1745691612459058 open science collaboration. (2015). "estimating the reproducibility of psychological science" (pdf). science, 349 (6251), aac4716. 10.1126/science.aac4716 pavlovskaia, e. (2014). sustainability criteria: their indicators, control, and monitoring (with examples from the biofuel sector). environmental sciences in europe, 26, 17. 10.1186/s12302-014-0017-2 popper, k. (1959). the logic of scientific discovery. basic books. rakover, s. s. (2003). experimental psychology and duhem’s problem. journal for the theory of social behaviour, 33(1), 45–66. 10.1111/1468-5914.00205 rosenthal, r. (1979). the "file drawer problem" and tolerance for null results, psychological bulletin, 86(3), 838-641. rubin, m. (2017). when does harking hurt? identifying when different types of undisclosed post hoc hypothesizing harm scientific progress. review of general psychology, 21(4), 308-320. 10.1037/gpr0000128 rubin, m. (2019). the costs of harking. british journal for the philosophy of science, 73(2). 10.1093/bjps/axz050 rubin, m. (2020). does preregistration improve the credibility of research findings? the quantitative methods for psychology, 16(4), 376–390. 10.23668/psycharchives.4839 scargle, j. (2000). publication bias: the "file-drawer" problem in scientific inference, journal of scientific exploration, 14(1), 91-106. scheel, a. m., schijen, m. r. m. j., & lakens, d. (2021a). an excess of positive results: comparing the standard psychology literature with registered reports. advances in methods and practices in psychological science, 4(2). 10.1177/25152459211007467 scheel, a. m., tiokhin, l., isager, p. m., & lakens, d. (2021b). why hypothesis testers should spend less time testing hypotheses. perspectives on psychological science, 16(4), 744–755. 10.1177/1745691620966795 schimmack, u. (2018). the replicability revolution. behavioral and brain sciences, 41, (e147) 10.1017/s0140525x18000833 smith, l. h., mathur, m. b., & vanderweele, t. j. (2021). multiple-bias sensitivity analysis using bounds. epidemiology, 32(5), 625–634. 10.1097/ede.0000000000001380 scott, s. k. (2013). preregistration would put science in chains. retrieved october 14, 2020 from https://www.timeshighereducation.com/comment/opinion/preregistration-would-put-science-inchains/2005954.article simmons, j. p., nelson, l. d., & simonsohn, u. (2012). a 21 word solution (october 14, 2012). retrieved february 11, 2021 from https://ssrn.com/abstract=2160588 10.2139/ssrn.2160588 simonsohn, u., simmons, j. p., & nelson, l. d. (2020). specification curve analysis. nature human behavior, 4, 1208-1214. 10.1038/s41562-020-0912-z spangenberg, j. (2011). sustainability science: a review, an analysis and some empirical lessons. environmental conservation, 38(3), 275-287. 10.1017/s0376892911000270 spector, p. e. (2015). induction, deduction, abduction: three legitimate approaches to organizational research. video lecture for consortium for advancement of research methods and analysis. university of north dakota (https://razor.med.und.edu/carma/video). 14 stebbins, r. a. (1992). concatenated exploration: notes on a neglected type of longitudinal research. quality & quantity, 26, 435-442. 10.1007/bf00170454 stebbins, r. a. (2001). exploratory research in the social sciences. thousand oaks, calif: sage publications. 10.4135/9781412984249 stebbins, r. a. (2006). concatenated exploration: aiding theoretic memory by planning well for the future. journal of contemporary ethnography, 35(5), 483-494. 10.1177/0891241606286989 steiner, p. m., wong, v. c., & anglin, k. (2019). a causal replication framework for designing and assessing replication efforts. zeitschrift für psychologie, 227(4), 280-292. 10.1027/2151-2604/a000385 stroebe, w. (2019). what can we learn from many labs replications? basic and applied social psychology, 41(2), 91–103. 10.1080/01973533.2019.1577736 szollosi, a., & donkin, c. (2021). arrested theory development: the misguided distinction between exploratory and confirmatory research. perspectives on psychological science, 16, 717 724. 10.1177/1745691620966796 szollosi, a., kellen, d., navarro, d. j., shiffrin, r., van rooij, i., van zandt, t., & donkin, c. (2020). is preregistration worthwhile? trends in cognitive sciences, 24(2), 94–95. 10.4135/9781412984249 thompson, w. h., wright, j., & bissett, p. g. (2020). point of view: open exploration. elife, 9, (e52157). 10.7554/elife.52157 ulrich, r., & miller, j. (2020). questionable research practices may have little effect on replicability. elife, 9, (e58237). 10.7554/elife.58237 uk biobank limited. uk biobank. (2022). retrieved october 18, 2022 from https://www.ukbiobank.ac.uk/ van dongen, n. n. n., wagenmakers, e., & sprenger, j. (2020, december 16). a bayesian perspective on severity: risky predictions and specific hypotheses. 10.31234/osf.io/4et65 vanderweele, t. j., & mathur, m. b. (2020). commentary: developing best-practice guidelines for the reporting of e-values. international journal of epidemiology, 49 (5), 1495 1497. 10.1093/ije/dyaa094 wagenmakers, e.-j. (2007). a practical solution to the pervasive problems of p values. psychonomic bulletin review, 14(5), 779–804. 10.3758/bf03194105 wagenmakers, e. j., & dutilh, g. (2016). seven selfish reasons for preregistration. aps observer, 29(9). https://www.psychologicalscience.org/observer/seven-selfish-reasons-for-preregistration wagenmakers, e. j., wetzels, r., borsboom, d., van der maas, h. j. l., & kievit, r. a. (2012). an agenda for purely confirmatory research. perspectives on psychological science, 7, 632–638. 10.1177/1745691612463078 wagenmakers, e.-j., marsman, m., jamil, t., ly, a., verhagen, j., love, j., selker, r., gronau, q. f., šmíra, m., epskamp, s., matzke, d., rouder, j. n., & morey, r. d. (2018). bayesian inference for psychology. part 1: theoretical advantages and practical ramifications. psychonomic bulletin review, 25(1), 35–57. 10.3758/s13423-017-1343-3 woo, s. e., o’boyle, e. h., & spector, p. e. (2017). best practices in developing, conducting, and evaluating inductive research [editorial]. human resource management review, 27(2), 255–264. 10.1016/j.hrmr.2016.08.004 yamada, y. (2018). how to crack preregistration: toward transparent and open science. frontiers in psychology, 9, 1831. 10.3389/fpsyg.2018.0183 yanai, i., & lercher, m. a. (2020). a hypothesis is a liability. genome biology, 21, 23. 10.1186/s13059-020-02133-w yarkoni, t. (2020). implicit realism impedes progress in psychology: comment on fried (2020).psychological inquiry, 31, 326-333. 10.1080/1047840x.2020.1853478. zwaan, r. a., etz, a., lucas, r. e., & donnellan, m. b. (2018). making replication mainstream.behavioral and brain sciences, 41. 10.1017/s0140525x17001972 meta-psychology, 2023, vol 7, mp.2020.2472 https://doi.org/10.15626/mp.2020.2472 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: felix d. schönbrodt reviewed by: lerche v., rousselet g., wilcox, r. analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/mek4u another warning about median reaction time jeff miller1 1university of otago contrary to the warning of miller (1988), rousselet and wilcox (2020) argued that it is better to summarize each participant’s single-trial reaction times (rts) in a given condition with the median than with the mean when comparing the central tendencies of rt distributions across experimental conditions. they acknowledged that median rts can produce inflated type i error rates when conditions differ in the number of trials tested, consistent with miller’s warning, but they showed that the bias responsible for this error rate inflation could be eliminated with a bootstrap bias correction technique. the present simulations extend their analysis by examining the power of bias-corrected medians to detect true experimental effects and by comparing this power with the power of analyses using means and regular medians. unfortunately, although bias-corrected medians solve the problem of inflated type i error rates, their power is lower than that of means or regular medians in many realistic situations. in addition, even when conditions do not differ in the number of trials tested, the power of tests (e.g., t-tests) is generally lower using medians rather than means as the summary measures. thus, the present simulations demonstrate that summary means will often provide the most powerful test for differences between conditions, and they show what aspects of the rt distributions determine the size of the power advantage for means. keywords: reaction time, power, means, medians, within-subjects comparisons introduction in typical reaction time (rt) experiments, researchers collect many rts per participant in each condition that are then compared via repeated-measures ttests or anovas. when they want to determine whether the central tendencies of the rts differ between conditions, they are faced with the problem of how to summarize the many within-condition rts per participant into a single number for use in the repeated-measures test. various summary measures have been used for this purpose—most commonly the means and medians of the within-condition rts for each participant. miller (1988) warned that when rt distributions are skewed, as they usually are, median rts are biased. furthermore, this bias is larger when the number of trials per condition is small. he therefore recommended that medians should not be used when comparing conditions with different numbers of trials, because the larger bias could cause conditions with fewer trials to appear slower, even with identical rt distributions in both conditions. rousselet and wilcox (2020; henceforth, r&w) recently disputed this recommendation based on an extensive series of simulations examining means, medians, and several other summary measures. in particular, they used a standard percentile bias correction procedure (e.g., efron, 1979, efron and tibshirani, 1993) and found that it successfully eliminated the bias problem identified by miller (1988). in brief, their procedure estimates the median bias as the difference between the observed median and the average median across many bootstrap samples. the observed median is then corrected by subtracting this estimated bias, and the final result of this subtraction is taken as the bias-corrected median estimate (for further details, see rousselet and wilcox, 2020). in view of the fact that the correction procedure eliminated median bias and other aspects of their analysis, r&w concluded that “the recommendation by miller (1988) to not use the median when comparing distributions that differ in sample size was illadvised” (p. 31). their conclusions have been influential in encouraging researchers to analyze median rts (e.g., gordon et al., 2020, maksimenko et al., 2019, thornton and zdravković, 2020). the present article reexamines the use of mean rt, median rt, and bias-corrected median rt as summary measures for the central tendency of an individual participant’s rts observed in a particular experimental condition, focusing on the statistical power of each summary measure. it is obviously desirable to use a summary measure that provides as much power as possible while staying within the chosen level of type i erhttps://doi.org/10.15626/mp.2020.2472 https://doi.org/10.17605/osf.io/mek4u 2 ror rate1. in particular, the present simulations sought to identify the summary method that would provide the greatest power when comparing condition means of the summary scores across participants via parametric tests (e.g., t-tests or anovas), as is most commonly done. although this question has been looked at previously, it appears that power is sometimes higher for means and sometimes higher for medians (e.g., ratcliff, 1993, rousselet and wilcox, 2020), and there has been no clear characterization of the conditions under which each one is superior. the primary simulations reported in this article used the ex-gaussian distribution as an ad hoc descriptive model of rt, because this simple distribution generally provides good fits to observed rt distributions (e.g., luce, 1986, hohle, 1965). the ex-gaussian can be conceived of as the sum of two independent random variables. one is a normal with mean µ and standard deviation σ, the other is an exponential with mean τ, and the overall mean rt is the sum of µ and τ. examples of these distributions are shown in figure 1, which illustrates that the exponential τ parameter reflects the skewness of the rt distribution—that is, the length of the long tail of slow responses characteristically seen in real rt data (burbeck and luce, 1982 ,luce, 1986, hohle, 1965). the flexibility of the ex-gaussian in describing distributions with different amounts of skew makes it a useful model for simulations investigating type i error rates, because these depend on skew (e.g., miller, 1988). in addition to the ex-gaussian, simulations were also carried out using four other statistical models for rt distributions in order to make sure that the obtained results were not idiosyncratic to the ex-gaussian. specifically, these were the ex-wald distribution (e.g., schwarz, 2001), the shifted lognormal distribution, the shifted gamma distribution, and the three-parameter (i.e., shifted) weibull distribution. as is illustrated with the examples in figure 2, these are all similar to observed rt distributions in that they are skewed with a long tail at the high end. for each of the different ex-gaussian distributions that we examined, parallel simulations of 1,000 experiments were also carried out with each of these alternative distributional models. for these parallel simulations, the parameters of each alternative distribution were adjusted so that the alternative distribution matched the corresponding exgaussian at the 5th, 50th, and 95th percentile points, so we will refer to these as the “percentile-matched” distributions. to foreshadow the results, the patterns obtained with all of these percentile-matched distributions closely matched the presented patterns obtained with the ex-gaussian. more specifically, although the relative performance of the mean, median, and biascorrected median summary measures depends strongly on rt skewness, it depends hardly at all on the precise underlying distribution family producing that skewness. the ex-gaussian and other skewed distributions are helpful not only in describing single rt distributions but even more so in describing the effects of experimental manipulations on these distributions. observed rt distributions can easily differ in ways that are too complex to summarize in a single measure of central tendency such as a mean, so other descriptors of distributional changes can provide useful clues about the causes of experimental effects (e.g., balota et al., 2008, balota and yap, 2011, heathcote et al., 1991). besides being of interest in their own right, these distributional differences may also have implications for the choice of the most appropriate measure of central tendency to be used when that is the research focus. one possibility, illustrated by the pair of ex-gaussians on the left of figure 1, is that the experimental manipulation shifts the distribution to the right in the slower condition, which is described within the ex-gaussian model by an increase in the µ parameter with no change in skewness. for example, using a spatial simon paradigm (e.g., hommel, 2011), luo and proctor (2018) asked participants in their experiment 1 to respond with the left versus right hand to red versus green squares that appeared irrelevantly to the left or right of fixation. even though location was irrelevant, responses were faster when the square appeared on the same side as the required response than when it appeared on the opposite side. at the distributional level, this rt difference was well described as a shift effect reflected entirely in the µ parameter, with no change in skewness (τ). another possibility, illustrated by the pair of ex-gaussians on the right side of figure 1, is that the experimental manipulation stretches the tail of the rt distribution in the slower condition, essentially increasing its skew, which can be described as an effect that is entirely on τ. for example, in their experiment 3, luo and proctor (2018) asked participants to respond with the left versus right hand to red versus green arrows that pointed irrelevantly to the left or right, and responses were faster when the arrow pointed to the same side as the required response than when it pointed to the opposite side. this time, however, the rt difference was mainly due to a stretched 1r&w also evaluated different summary measures with respect to various criteria for identifying “the typical value of a distribution, which provides a good indication of the location of the majority of observations” (p. 2). i will not address those criteria in the present article, but only consider the value of the measures for standard hypothesis testing, which is a very common statistical procedure with such data. 3 figure 1 example probability density functions (pdfs) and cumulative distribution functions (cdfs) for three ex-gaussian distributions differing in µ and τ, all with σ = 50. a reference distribution with µ = 400 and τ = 200 (solid lines, mean 600 ms, median 544.82 ms) is shown on all panels to facilitate visualization of the effects of changing µ versus τ. the comparison distributions (dotted lines, with mean 700 ms) differ with respect to either µ (left panels, median 644.82 ms) or τ (right panels, median 612.11 ms). tail, with increased skew reflected in a larger τ, and there was little change in µ. since the introduction of the ex-gaussian by hohle (1965), many studies have examined the shifting versus tail-stretching effects of various experimental manipulations on the shapes of rt distributions as described in terms of µ and τ. both µ and τ are typically larger in the slower condition than in the faster one, indicating that most experimental manipulations have both shifting and stretching effects, in varying mixtures. there is unfortunately no consensus about the psychological meanings of changes in these different parameters, because there are at best weak distinctions between experimental manipulations with shifting versus stretching effects (e.g., matzke and wagenmakers, 2009, rieger and miller, 2020), but the ex-gaussian distribution nevertheless remains useful as a way of describing changes in the shapes of rt distributions as well as their means. 4 figure 2 example probability density functions (pdfs) for the different rt distribution families examined. the ex-gaussian distribution has parameters of µ= 400, σ= 50, and τ= 200. the parameters of the other distributions were adjusted to match the ex-gaussian at the 5th, 50th, and 95th percentile points, leading to the following parameter values: ex-wald: wald µ = 399.9 and σ = 50.9, and exponential τ = 199.9; shifted lognormal: µ = 312.9 and σ = 215.6, with a shift of c = 287.2; shifted gamma: µ= 246.3 and σ= 206.1, with a shift of c = 353.0; weibull: µ= 239.8 and σ= 205.2, with an offset of c = 359.5. 0 500 1000 1500 rt 0.0000 0.0010 0.0020 0.0030 p d f (r t ) ex-gaussian ex-wald shifted lognormal shifted gamma weibull for the present purposes, the distinction between shifting and stretching effects is relevant because—as will be seen—statistical tests based on means, medians, and bias-corrected medians are especially different in their power to detect stretching effects. type i error rates for completeness, and to make the simulation process more concrete, this section reviews briefly the wellestablished fact that type i error rates are inflated by sample-size-dependent bias when medians are used to compare rts across conditions with unequal numbers of trials (which i will call unequal trial “frequencies” rather than “sample size”, to avoid confusion with the number of participants). this bias is an artifact that would contaminate comparisons of conditions with different trial frequencies if medians were used to summarize the rts in each condition. originally, comparisons of such conditions were used particularly in studies of the main effects of stimulus and response probability (e.g., hyman, 1953), attentional cuing (e.g., posner et al., 1978), and expectancy (e.g., mowrer et al., 1940, zahn and rosenthal, 1966). in addition, trial frequencies have often been varied across conditions to explore a variety of cognitive processes by investigating their interactions with probability (e.g., broadbent and gregory, 1965, den heyer et al., 1983, miller and pachella, 1973, sanders, 1970, theios et al., 1973). currently, trial frequencies are commonly varied in studies of spatial and temporal statistical learning (e.g., flowers et al., 2021, gibson et al., 2021, liesefeld and müller, 2021, vadillo et al., 2021), the modulation of attentional control processes by environmental contingencies (e.g., cochrane et al., 2021, huang et al., 2021, kang and chiu, 2021), action-outcome contingency learning (e.g., gao and gozli, 2021), adaptation to the frequency of congruent versus incongruent information (e.g., bausenhart et al., 2021, ivanov and theeuwes, 2021, thomson et al., 2021), and betweentask resource sharing (e.g., miller and tang, 2021), to name just a few areas. unfortunately, median bias is still sometimes overlooked and may contaminate published 5 comparisons of conditions with different trial frequencies (e.g., bulger et al., 2021). as noted by miller (1988) and confirmed by r&w’s table 2, sample medians are biased with skewed distributions, and the bias is greater when the number of trials is smaller. if medians are used to compare conditions with different trial frequencies, this bias causes the type i error rate to be inflated—perhaps seriously. specifically, the low-frequency condition will often appear to be statistically slower than the high-frequency condition, even if the true rt distributions are identical in the two conditions. a simple simulation of 5,000 experiments illustrates the problem. in each simulated experiment, rts were generated for 60 participants. each participant was tested for 51 trials in the “frequent” condition and 5 trials in the “infrequent” condition, with odd numbers of trials used so that the median of each sample would be the unique middle score. the null hypothesis was always true—that is, rts for both conditions were sampled from the same underlying ex-gaussian distribution with µ = 400, σ = 50, τ = 200 shown in figure 1. within each simulated experiment, the rts sampled for each participant were summarized by computing the median in each condition. using these medians as the dependent variable, a paired t-test comparing the means of these medians was then computed across the 60 participants, with α = 0.05, two-tailed. since the null hypothesis was true in the simulated experiments, one would theoretically expect approximately 5% significant results (i.e., type i errors) by chance, with half of these yielding significantly larger scores in the frequent condition and half significantly larger scores in the infrequent condition. however, the simulation actually produced 17.8% type i errors where the infrequent condition appeared slower versus only 0.1% where the frequent condition appeared slower. thus, in accordance with the warning of miller (1988), comparing the means of participant/condition median rts produced far too many type i errors in the direction that would lead researchers to conclude that responses are slower in the infrequent condition. the inflated type i error rate for medians arises for purely statistical reasons. as is described in the appendix, the full sampling distribution of the sample medians can be computed numerically using the known properties of order statistics (i.e., the median of the smaller sample is the third order statistic in a sample of five, and the median of the larger sample is the 26th order statistic in a sample of 51), and these sampling distributions are shown in figure 3. crucially, the means of these sampling distributions are 561.4 and 546.7, respectively, so the long-run mean of the smaller sample medians really is larger than that of the larger samples. the t-test results simply reflect this true difference in average medians for samples of these two sizes from this distribution. in comparison, across exactly the same simulated datasets using each participant’s condition mean or bias-corrected median2 as the summary measure, approximately 2.5% type i errors in each direction were obtained, as expected. parallel simulations were carried out to determine the extent of error rate inflation under a variety of different simulation conditions, and representative results are shown in figure 4. the different simulation conditions used: (a) varying numbers of trials n in the infrequent condition (the frequent condition always had 51 trials), as shown along the horizontal axis; (b) exgaussian (or corresponding percentile-matched distributions) with different values of µ and τ to vary the degree of skewness, shown as different lines; and (c) 30 or 60 participants in the experiment, shown in the panels on the left or right. the vertical axis shows the proportion of simulated experiments in which researchers would reject the null hypothesis and conclude that responses were slower in the infrequent condition. since scores in both conditions were actually always drawn from the same distribution, these would again be type i errors in that direction. obviously, the type i error rates for the median analyses can far exceed the appropriate 2.5% with small ns in the infrequent condition, whereas the error rates for the means do not. bias-corrected medians also produced appropriate error rates, replicating r&w’s results. very similar patterns of type i error rates were obtained in the simulations with the other four percentilematched distributions used as rt models (i.e., ex-wald, shifted lognormal, etc). for example, across the 32 simulation conditions shown in figure 4, the average type i error rate for the median was 6.7% for the ex-gaussian, whereas it ranged from 6.1% to 6.7% with the other four distributions. similarly, the type i error rate exceeded 15% for all distributions in the worst case (i.e., the simulation with 60 participants, five trials in the infrequent condition, and the most-skewed distribution percentile-matched to the ex-gaussian with µ= 350 and τ= 250). meanwhile, the type i error rates for the mean and bias-corrected medians were always around 2% for these other distributions, just as they were with the exgaussian (fig. 4). thus, the finding of inflated type i errors for medians seems relatively independent of the precise shape of the skewed rt distribution. the simulations presented so far have all used pure, uncontaminated rt distributions, but there are reasons 2for all simulations in this article, bias-corrected medians were based on 200 bootstrap samples. 6 figure 3 probability density function (pdf) for the theoretical sampling distribution of the median for samples of five and 51 trials from an ex-gaussian distribution with µ = 400, σ = 50, and τ = 200, together with the mean µ of each sampling distribution. 200 400 600 800 1000 sample median rt 0.0000 0.0050 0.0100 0.0150 p d f (s a m p le m e d ia n r t ) 51 =546.7 5 =561.4 51 trials 5 trials to suspect that observed rt distributions contain occasional outliers (e.g., ratcliff, 1993, ulrich and miller, 1994), perhaps because the participant’s attention momentarily wanders away from the task. it is an empirical question whether the results shown in figure 4 would change markedly if the simulations included outliers. for example, since means are more affected by extreme scores than medians, the type i error rates associated with mean-based analyses might be inflated when outliers are included. to look at the effects of outliers, additional simulations were conducted using each of the different rt models already introduced. these simulations included either 2% or 4% outliers, and the outliers were formed by summing an rt from the uncontaminated distribution with a random number distributed uniformly between 0–1,000 ms to reflect a distraction delay3. such outliers had hardly any influence on the type i error rates obtained using means, medians, or bias-corrected medians, so it seems unlikely that outliers in real rt data would reduce the type i error rate advantage for means and bias-corrected medians relative to regular medians. r&w acknowledged the problem of inflated type i errors when using sample medians for comparing population means (e.g., with t-tests), and their figure 10b even shows simulation results displaying the problem. nonetheless, they essentially dismissed this problem because “the bias can be strongly attenuated by using a percentile bootstrap bias correction” (p. 31), which is a procedure that was not considered by miller (1988). indeed, their figure 10c shows that the bootstrap bias correction completely cures the type i error rate problem, as is also shown in the present figure 4. thus, it is reasonable to consider the bias-corrected median as a possible summary measure of rts, and the next step is to check its power. power given that bias correction solves the median’s problem of type i error rate inflation, it is tempting to suspect that bias-corrected medians would be preferable to means, because the median is often the preferred measure of central tendency with skewed dis3ratcliff (1993) introduced outliers varying uniformly between 0–2,000 ms, but it seems that responses delayed by 1,000–2,000 ms would be easily identified and excluded by commonly-used outlier rejection techniques. 7 figure 4 proportions of “infrequent mean larger” type i errors obtained when using means, medians, or bias-corrected medians to compare conditions with different numbers of trials n in the infrequent condition, with an expected type i error rate in this direction of 0.025 based on α = 0.05. each point indicates the proportion of significantly larger means in the infrequent condition across 10,000 simulated experiments with the indicated number of participants. the true distribution was always an ex-gaussian with σ = 50. its value of µ was 350, 400, 450, or 500, with τ = 600 − µ. there were always 51 trials in the frequent condition, and the true underlying rt distributions were always identical ex-gaussians in the frequent and infrequent conditions. for the bias-corrected medians, 200 bootstrap samples were used to correct the median separately for each simulated participant/condition pair. 8 tributions. contrary to this intuition, however, ratcliff (1993) reported that regular medians provide less statistical power than means. r&w acknowledged ratcliff’s report, but they downplayed it because of the small trial frequencies used in ratcliff’s analysis. in addition, it remains an open question how the power of bias-corrected medians compares with that of means. the present simulations investigated these issues. fortunately, it is easy to compare the power of means versus bias-corrected medians using simulations similar to those described above for assessing type i error rate. instead of using the same rt distributions for the two conditions being compared, one simply uses different distributions and checks the proportion of simulated experiments yielding a statistically significant difference— this proportion is an estimate of statistical power. to model the different types of experimental effects for which researchers might test, one can allocate different amounts of the rt increase in the slower condition to different amounts of shifting versus skewing (i.e., tailstretching) effects on the rt distribution. within the ex-gaussian rt model this amounts to increases in the µ versus τ parameters, and changes in other parameters produce comparable shifting versus stretching effects within the other rt distribution models. the first set of power simulations examined the ability of the different summary measures to reveal a true between-condition rt difference in experiments where the two conditions had unequal trial frequencies, and the results of these simulations are displayed in figure 5. regular medians would not be appropriate in this situation because of the type i error rate problem described in the previous section, so these simulations only compared the power of tests using means and biascorrected medians. naturally, these two types of testing were compared under identical simulation conditions, and in fact identical samples of simulated rts were always analyzed with the two summary measures. in total, there were 32 simulation conditions using ex-gaussian rt distributions, corresponding to the 32 points shown in figure 5, for each of the mean-based and bias-corrected median-based tests. in all 32 simulation conditions, 51 rts per participant were sampled from the faster condition, and the true mean rt in the faster condition was 600 ms. the 32 conditions were formed as the factorial combination of eight different dataset sizes and four conditions differing with respect to rt skewness. the eight dataset sizes consisted of 30 or 60 participants factorially combined with 5, 9, 17, or 33 trials in the slower condition. the four skewness conditions were formed using two amounts of skewness of the rt distribution in the faster condition (i.e., µ f = 350 and τ f = 250 or µ f = 500 and τ f = 100, with σ = 50 in both cases) and by allocating the rt increase in the slower condition either 25% to µs and 75% to τs, or the reverse4. thus, in different simulation conditions the faster rt distribution was either more or less skewed to begin with and the mean rt difference between conditions arose either mostly from shifting the distribution in the slower condition or mostly from stretching it. finally, the true mean rt difference between the fast and slow conditions was adjusted individually for each of the 32 simulation conditions to produce an intermediate power level (i.e., approximately 25%–75%) for tests using means as the summary measure. intermediate power levels are desirable because they provide the best opportunity to observe power differences between means and bias-corrected medians; with very low or high power levels, the differences between analysis methods are compressed by floor or ceiling effects. across the 32 simulation conditions, the true mean difference varied from 9–41 ms. not surprisingly, the results shown in figure 5 indicate that the power of t-test comparisons increases with the number of participants and the number of trials per participant, and in fact these power increases are even more dramatic than are shown because the true differences were adjusted to smaller values with larger datasets in order to avoid ceiling effects on power. more critically, they also show a clear power advantage for using means rather than bias-corrected medians. thus, although the bias-correction procedure lets the median do as well as the mean with respect to type i errors (fig. 4), this summary measures seems to have much less power than the simpler option of using means. the advantage for mean-based testing depends little either on the number of trials in the slower condition or on the skewness of rts in the faster condition (i.e., µ f = 350 versus µ f = 500). it is clearly larger, however, when the experimental effect arises mainly from a tail-stretching effect (i.e., ∆µ/∆ = 0.25; upper panels) rather than from a shifting effect (i.e., ∆µ/∆= 0.75; lower panels). indeed, further simulations (not shown) indicate that the power of mean-based analyses is only slightly higher than that of analyses using bias-corrected medians when slowing is almost entirely due to a shift (i.e., ∆µ/∆ = 0.95). the reasons for this pattern will become clearer after the next set of simulations, which reinforce and extend the pattern. once again, the results of the simulations with the other, percentile-matched rt distributions closely 4in the corresponding 32 simulation conditions using each of the percentile-matched distributions, adjustments of the parameters of those distributions were made as needed to match the percentiles of the other distributions to those of the exgaussians used in the fast and slow conditions. 9 figure 5 power of meanand bias-corrected median-based tests for true differences in ex-gaussian rt distributions in experiments comparing a faster condition with 51 trials per participant against a slower condition with fewer trials per participant. each point indicates the proportion of significant results across 5,000 simulated experiments (α = 0.025, one-tailed). simulation conditions also differed with respect to the skewness of the faster rt distribution (µ f = 350 and τ f = 250 or µ f = 500 and τ f = 100) and the proportion of the total rt slowing (∆) associated with the µ parameter (∆µ/∆ = 0.25 or 0.75). for the bias-corrected medians, 200 bootstrap samples were used to correct the median separately for each simulated participant/condition pair. match those of the ex-gaussian rt distributions, with these simulations also showing greater power for meanbased testing. for example, across the 32 simulation conditions in figure 5, the average power levels of the meanand bias-corrected median-based tests were 0.58 and 0.32, respectively. with the other distributions, the average power for means ranged from 0.55–0.58, and the average power for bias-corrected medians ranged from 0.26–0.31. similarly, across all distribution types and all simulation conditions, the minimum and maximum power levels ranged from 0.30–0.37 and 0.77– 0.80 respectively for means, whereas these ranges ex10 tended from 0.11–0.12 and 0.51–0.61 for bias-corrected medians. in further simulations including 2% or 4% outliers of the same type used in the earlier type i error rate simulations, power decreased for both meanand bias-corrected median-based tests, but average power across simulation conditions was still more than 10% higher for the mean than for the bias-corrected median with all distributions. in view of the fact that mean-based rt summaries have demonstrably greater power than bias-corrected median-based summaries for experiments with unequal trial frequencies (fig. 5), it is also sensible to compare power levels in experiments with equal trial frequencies. as noted by miller (1988) and r&w, regular medians are not associated with type i error rate inflation in this situation because they would be equally biased in both conditions, so regular medians can also be considered as an appropriate summary of single-trial rts in this case. it is, however, useful to compare the power of these three candidate measures of central tendency (i.e., means, medians, bias-corrected medians). figure 6 shows the results of simulations analogous to those shown in figure 5, except with equal numbers of trials per participant in the faster and slower conditions, and naturally power again increased in the simulation conditions with more participants and trials even though these conditions had smaller true mean differences to avoid ceiling effects. power is consistently lower for bias-corrected medians than for regular medians, suggesting that the bias correction should not be used with equal trial frequencies. mean-based tests again have the most power, although the power difference between means and medians depends heavily on whether the experimental manipulation has mostly a shifting or tail-stretching effect. as can be seen in the upper panels of figure 6, means have substantially more power than medians when a minority of the rt difference results from a change in µ (i.e., ∆µ/∆= 0.25). the power advantage for means is much reduced when a majority of the rt difference results from a change in µ (i.e., ∆µ/∆ = 0.75), and medians can actually have slightly more power when the rt difference is a pure shift (i.e., ∆µ/∆ = 1.00; not shown). the same qualitative patterns are evident in figs. 14–16 of r&w. overall, the pattern of greater power for mean-based testing shown in figure 6 was again consistent across distributions and outlier conditions. averaging across the different dataset sizes and skewness combinations shown in the figure, the average power levels of meanbased testing ranged across distributions from 0.55– 0.58, whereas the ranges for medianand bias-corrected median-based testing were 0.38–0.45 and 0.27–0.33, respectively. the presence of 2% or 4% outliers reduced these average power levels overall, but average power was still largest for means (ranging across distributions from 0.46–0.49 and 0.41–0.43 with 2% and 4% outliers, respectively), second-largest for medians (ranging from 0.37–0.43 and 0.35–0.41), and smallest for biascorrected medians (ranging from 0.26–0.31 and 0.26– 0.30). why is it that using participant mean rts as the summary measure has so much more power when the experimental effect is mostly a stretch in the slow tail? the main reason is simply that power increases with effect size, as is true for all statistical tests. consider two conditions whose true mean rts differ by 40 ms. in that case, the expected difference in mean rts between those two conditions is 40 ms, regardless of how the effect is distributed between shifting and stretching and regardless of how many trials there are per participant in each condition. the situation is far more complicated for differences in medians, however, as is illustrated with the ex-gaussian distribution in figure 7. figure 7a shows the expected value of the difference between the medians of the fast and slow conditions (∆mdn) as a function of (a) how much of the 40 ms mean rt difference is produced by changes in µ versus τ (i.e., ∆µ versus ∆τ), and (b) how many trials per participant are tested in each condition5. critically, the expected difference between medians is always less than the 40 ms expected difference in means, and it is far less when the conditions differ mostly in τ (i.e., ∆µ = 10 and ∆τ = 30) rather than mostly in µ (i.e., ∆µ = 30 and ∆τ = 10), particularly when the number of trials is large. the fact that the numerical differences are larger for means than medians strongly suggests that tests using means would have more power. in theory, medians could provide more statistical power despite their smaller effect size in milliseconds if they had much smaller standard errors. they do not, however, as is clear in figure 7b, which shows the corresponding ratios of the standard error of the difference in means to the standard error of the difference in medians. these ratios are quite close to 1.0, which means that the standard errors of the means and medians are nearly equal in all of these cases. figure 7c shows the comparison of means versus medians plotted in terms of cohen’s d, a standard effect size measure. effect sizes increase with the number of trials, as expected because the standard error of the sample statistics (i.e., mean and median) decrease as the number of trials increases. more importantly, it is clear that effect sizes are larger for means than for medians across all conditions, and this is the source of the power advantage for means. 5the results shown in this figure were obtained by computation rather than by simulation, using methods explained in the appendix. 11 figure 6 power of mean-, median-, and bias-corrected median-based tests for true differences in ex-gaussian rt distributions in experiments comparing faster and slower conditions with equal numbers of trials per participant. the parameters other than the numbers of trials per condition are the same as those in figure 5. in essence, when much of an experimental manipulation’s effect is to stretch the long upper tail of the rt distribution, the median’s relative insensitivity to this part of the distribution eliminates part of the very betweencondition difference that the researcher is looking for6. this is particularly ironic because insensitivity to skew is often cited as one of the median’s benefits, and it is supposed to make the median especially tempting with skewed distributions (e.g., hays, 1973, marascuilo, 1971). as noted by yule (1911), for example, “the median may [italics in original] sometimes be preferable to the mean, owing to its being less affected by abnormally large or small values of the variable” (p. 120), although he also commented that the median’s “limitations render the applications of the median in any work in which theoretical considerations are necessary comparatively 6the same problem would arise with trimmed means, though to a lesser extent, because trimming also reduces the contribution of the high end of the rt distribution, where the condition difference is greatest. 12 figure 7 a: expected difference between fastand slow-condition median rts (∆mdn) for two conditions whose true means differ by 40 ms as a function of the number of trials per participant in each condition and of the division of the 40 ms effect between the µ and τ parameters of the ex-gaussian rt distribution (i.e., ∆µ = 10 and ∆τ = 30, ∆µ = 20 and ∆τ = 20, or ∆µ = 30 and ∆τ = 10). in all cases, the expected difference between the mean rts of these conditions is 40 ms. b: the ratio of the standard error of the difference in means (σmn) to the standard error of the difference in medians (σmdn), illustrating that the standard errors of the differences are approximately equal. c: cohen’s d for testing the condition effect using means (dmn, thick lines) versus medians (dmdn, thin lines) as the summaries of the individual-participant performance in each condition. 13 circumscribed” (p. 119). as the present simulations show, however, means can have much higher power to detect between-condition rt differences when experimental manipulations increase skewness, as they often do (e.g., heathcote et al., 1991, hockley, 1984, hockley and corballis, 1982, luo and proctor, 2018, mewhort et al., 1992, moutsopoulou and waszak, 2012, possamaï, 1991, singh et al., 2018). in a re-analysis of datasets from seven published articles, for example, rieger and miller (2020) found significant (p < 0.05) increases in τ in 15 of 25 different statistical comparisons involving various distinct experimental manipulations. evidence from research on bilingualism also suggests that the rt advantage for bilinguals is mostly due to the reduced number of quite long rts and that the power of bilingual/monolingual comparisons diminishes greatly when long rts are not considered (zhou and krott, 2016). distribution of differences in addition to comparing the effectiveness of means, medians, and bias-corrected medians as summaries of individual-participant rts, r&w also compared three different methods of testing for a significant difference between conditions after a summary measure had been obtained for each participant in each condition. they did this using simulations based on a “g&h” distribution (see below). one method was to conduct a one-sample t-test using the individual-participant between-condition differences in the summary scores. this method is equivalent to testing with a repeatedmeasures anova or a paired t-test as in the present simulations (e.g., fig. 4), which appear to be the most common methods of testing for overall rt differences between conditions. the second method was to conduct a test on 20% trimmed means—that is, a test excluding participants with the most extreme betweencondition differences. finally, the third method was to test whether the median of the participants’ betweencondition differences differed from zero. it is important to realize that r&w’s g&h simulations comparing the three different methods of testing for differences in summary measures address a different question than that of how the individual rts of a given participant should be summarized in the first place. specifically, comparing hypothesis testing procedures addresses the question of how best to test for a significant effect of conditions after summarizing the original individual-participant rts in each condition. this is a different question because researchers could initially summarize individual-trial rts with any of the summary methods (i.e., means, medians, bias-corrected medians) and then subsequently test for condition differences with any of the hypothesis testing methods (i.e., t-test, 20% trimmed means test, median test). in principle, any one of these nine options could provide the most statistical power. thus, the conclusions of the present simulations comparing different summary methods are specific to the t-tests and these simulations might have a different outcome if the summary measures were compared across conditions with some other method. in their comparison of different hypothesis testing procedures using the g&h distribution, r&w did not distinguish between the three different methods of summarizing individual-trial rts (i.e., means, medians, biascorrected medians). in fact, they only generated a single random number for each simulated participant, and this number represented the difference (i.e., condition effect) for that participant summarized from the individual-participant rts with “any type of differences between means, medians or any other quantities” (p. 17). these individual-participant difference scores were generated from “g&h” distributions, which allow convenient parametric variation of distribution skewness and kurtosis (i.e., tail heaviness) through g and h parameters, respectively. although it might seem more appropriate to simulate single-trial rts and examine all nine possible analysis combinations (i.e., 3 summary methods × 3 hypothesis testing methods), it is not clear how to do that realistically. even assuming that all of the individual-participant rt distributions were exgaussians, the participants would surely differ in their distribution parameters and in their between-condition differences in these parameters (e.g., effects on µ and τ). the distribution of individual-participant difference scores would be heavily influenced by this participantto-participant variation as well as by the choice of summary method, but there does not yet exist an appropriate model for this individual variation. thus, it was not unreasonable for r&w to model the final distribution of individual-participant difference scores directly with the g&h distribution rather than attempting to specify a model in which these difference scores would emerge from varying individual rt distributions under each summary method. r&w’s simulations comparing the effectiveness of the different hypothesis testing methods produced two particularly important results (e.g., their figs. 12 and 13). first, each of the hypothesis testing methods tends to lose power when the distribution of participant-toparticipant difference scores is more skewed or has heavier tails (i.e., larger kurtosis). second, this tendency to lose power with increasing skew or kurtosis is much stronger for the t-test than for the tests using trimmed means or medians. naturally, then, r&w 14 suggested that researchers should consider carefully the amount of skew and kurtosis in their distributions of participant-to-participant difference scores when deciding which procedure to use in testing for a condition effect. although r&w’s simulations comparing hypothesis testing methods do not speak directly to the question of how the individual-participant rts should be summarized in the first place, as was mentioned earlier, one might suspect that they do so indirectly. in particular, their results suggest that researchers should prefer the summary measure for which the participant-toparticipant difference scores are the least skewed and have the lightest tails. intuitively, it might seem reasonable to assume that medians—by virtue of their smaller sensitivity to extreme scores—would produce difference score distributions that are less skewed and have lighter tails than those produced by means, but this assumption must be checked empirically. to do that, i examined the two large, publicly available rt datasets of ferrand et al. (2010) and hutchison et al. (2013), both involving lexical decision tasks. in both datasets, responses to words were faster than responses to nonwords, which provided a convenient condition effect to examine. since these are real datasets, they have realistic trial-to-trial rt variability and participant-to-participant variability in condition effects, by definition, obviating the need to specify a formal model for these. thus, i computed three separate nonword-minus-word difference scores for each participant—once each using the participant’s condition mean rts, condition median rts, and bias-corrected condition median rts. the normalized frequency distributions of these difference scores for the two datasets, tabulated across 944 and 503 participants, respectively, are shown in figure 8. perhaps somewhat counterintuitively, the empirical distributions of individual-participant difference scores shown in figure 8 are both less skewed (smaller values of skew and g) and lighter tailed (smaller values of kurtosis and h) when the difference scores are computed from mean rts than when they are computed from either of the median-based summary measures. in combination with r&w’s finding of greater power with less skewed and lighter-tailed difference score distributions, this pattern provides clear evidence that researchers would have more power when using means rather than medians to summarize rts. based on r&w’s results, it seems that this would be true regardless of which hypothesis testing procedure was used, but it appears that the mean advantage would be especially large with standard t-tests or anovas. a distinctive feature of the spp and flp datasets, relative to many published studies, is that there were unusually many trials in each condition. one might therefore wonder whether the results shown in figure 8 would generalize to datasets with fewer trials per condition (perhaps because there were more conditions). to examine this issue, i conducted simulations with smaller random subsets of the rts for each participant in each condition. to increase the stability of the simulation results, 20 subsets of a given number of rts were randomly selected for each participant, selecting without replacement for each subset but with replacement across subsets (because there were not enough rts to sample without replacement for the larger subsets). for each randomly selected subset of rts from one participant, the condition effect was computed using each of the three summary measures (i.e., mean, median, biascorrected median). finally, across all simulated subsets for a given number of rts, the distribution of condition effects was analyzed using the same computations as those shown in figure 8 for the full datasets. figure 9 shows the results of these simulations, which nicely extend the results obtained with hundreds of trials per participant in each condition (fig. 8) to datasets with smaller numbers of trials. with virtually any number of trials per condition per participant selected from these real datasets, the between-participant difference score distributions would be less skewed (i.e., smaller skewness and g) and less heavy-tailed (i.e., smaller kurtosis and h) when differences were computed from mean rts than when they were computed from medians or bias-corrected medians. thus, as with the full datasets, these results in combination with r&w’s demonstration of greater power with less skew and lighter tails, provide a further argument for using the mean to summarize the central tendency of observed rts. conclusions r&w concluded that “there seems to be no rationale for preferring the mean over the median as a measure of central tendency for skewed distributions” (p. 31). on the contrary, when performing hypothesis tests to compare the central tendencies of rts between experimental conditions, the present simulations show that there may be an extremely clear rationale involving both type i error rate and experimental power. when comparing conditions with unequal numbers of trials, the sample-size-dependent bias of regular medians can lead to clear inflation of the type i error rate (fig. 4), so these medians definitely should not be used. means and bias-corrected medians are both free of this bias and thus have acceptable type i error rates, so either could be considered as a possible summary mea15 figure 8 normalized histograms of individual-participant rt difference scores computed from three different summary rt measures in the lexical decision task datasets from the semantic priming project (a, c, e; hutchison et al., 2013) and the french lexicon project (b, d, f; ferrand et al., 2010). each participant’s observed 800–1,000 word and nonword rts were first summarized by computing the mean, median, or bias-corrected median, and the nonword minus word difference was then computed for each measure. the histograms (bars) depict the frequency distributions of these differences across participants, and the skewness (skew) and kurtosis (kurt) values computed from these observed difference scores are shown on the panel. the solid line is the best-fitting (maximum likelihood) g&h distribution for each set of differences, and the g and h parameters of these distributions are also shown. mean -100 0 100 200 300 400 0 0.005 0.01 r e la tiv e f re q u e n cy skew = 1.82 kurt = 10.88 g = 0.32 h = 0.11 a semantic priming project mean -100 0 100 200 300 400 0 0.005 0.01 skew = 0.75 kurt = 4.43 g = 0.30 h = 0.09 b french lexicon project median -100 0 100 200 300 400 0 0.005 0.01 r e la tiv e f re q u e n cy skew = 2.77 kurt = 19.51 g = 0.36 h = 0.14 c median -100 0 100 200 300 400 0 0.005 0.01 skew = 1.11 kurt = 5.29 g = 0.38 h = 0.10 d bias-corrected median -100 0 100 200 300 400 rt difference 0 0.005 0.01 r e la tiv e f re q u e n cy skew = 2.74 kurt = 19.31 g = 0.36 h = 0.14 e bias-corrected median -100 0 100 200 300 400 rt difference 0 0.005 0.01 skew = 1.12 kurt = 5.34 g = 0.39 h = 0.10 f sure in this situation. means clearly have greater power (fig. 5) than bias-corrected medians in most situations, however, which would nearly always make them the preferred choice. when comparing conditions with equal numbers of trials, means, medians, and bias-corrected medians all have appropriate type i error rates, so any of these might be the preferred summary measure in this situation. bias-corrected medians always seem to have less power than regular medians, however, so here the choice is really between means and regular medians, depending on which of those has the higher power. as can be seen in figure 6), the answer depends on how the experimental manipulation affects skewness. thus, to choose between means and medians as the summary measure maximizing power, researchers must consider the effect of the experimental manipulation at the level of the rt distribution. 16 figure 9 measures of skewness and kurtosis, plus maximum-likelihood estimates of parameters g and h, as a function of the number of trials per condition in the lexical decision task datasets from the semantic priming project (hutchison et al., 2013) and the french lexicon project (ferrand et al., 2010). random subsets of the indicated n of trials per condition were taken for each participant and parameters were estimated as in figure 8. the results in figure 6 suggest that the two measures will have approximately equal power when rt skewness is unaffected by the manipulation, whereas medians will have greater power if skewness decreases in the slower condition and means will have greater power if skewness increases in the slower condition. although the ex-gaussian τ is one way of assessing skewness, it is not always necessary to estimate ex-gaussian parameters from rt distributions. instead, one can use a simpler skewness measure—namely, the difference between the mean and median of rt—as a proxy for τ. if this difference is smaller in the slower condition than the faster one, that is a sign that power will be better using medians. on the other hand, if this difference is larger in the slower condition, power will be better using means. an important caveat concerning the choice of summary measure is that this choice should not be made based on the data being analyzed. to avoid the inflation of type i error rate that arises when researchers try multiple alternative analyses in the attempt to obtain significant results (i.e., “p-hacking”; simmons et al., 2011), researchers must choose the best summary measure in advance, based on theoretical considerations regarding the expected effect, on prior experience with similar experimental manipulations, or on pilot data. it would 17 be inappropriate to decide whether to analyze mean or median rts based on whichever gave the larger effect in a given dataset, because this would inflate the researcher’s type i error rate. author contact address correspondence to jeff miller, department of psychology, university of otago, dunedin, new zealand. electronic mail may be sent to miller@psy.otago.ac.nz. acknowledgements i am grateful to wolf schwarz, patricia haden, veronika lerche, guillaume rousselet, and rand wilcox for helpful comments on earlier versions of this article, and to ludovic ferrand for providing the raw data from the french lexicon project. conflict of interest and funding the author declares that he had no conflicts of interest with respect to the authorship or publication of this article. author contributions jeff miller: conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing original draft, writing review & editing, visualization, supervision, project administration, funding acquisition. code availability the ex-gaussian, ex-wald, shifted lognormal, shifted gamma, and weibull distributions used in this code are part of the cupid package available at https://github.com/milleratotago/cupid. open science practices this article earned the open materials badge for making materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references arnold, b. c., balakrishnan, n., & nagaraja, h. n. (1992). a first course in order statistics. wiley. balota, d. a., yap, m. j., cortese, m. j., & watson, j. m. (2008). beyond mean response latency: response time distributional analyses of semantic priming. journal of memory & language, 59(4), 495–523. https : / / doi . org / 10 . 1016 / j . jml.2007.10.004 balota, d. a., & yap, m. j. (2011). moving beyond the mean in studies of mental chronometry: the power of response time distributional analyses. current directions in psychological science, 20(3), 160–166. https : / / doi . org / 10 . 1177 / 0963721411408885 bausenhart, k. m., ulrich, r., & miller, j. o. (2021). effects of conflict trial proportion: a comparison of the eriksen and simon tasks. attention, perception, & psychophysics, 83(2), 810–836. https: //doi.org/10.3758/s13414-020-02164-2 broadbent, d. e., & gregory, m. h. p. (1965). on the interaction of s-r compatibility with other variables affecting reaction time. british journal of psychology, 56, 61–67. https : / / doi . org / 10 . 1111/j.2044-8295.1965.tb00944.x bulger, e., shinn-cunningham, b. g., & noyce, a. l. (2021). distractor probabilities modulate flanker task performance. attention, perception, & psychophysics, 83(2), 866–881. https://doi. org/10.3758/s13414-020-02151-7 burbeck, s. l., & luce, r. d. (1982). evidence from auditory simple reaction times for both change and level detectors. perception & psychophysics, 32, 117–133. https : / / doi . org / 10 . 3758 / bf03204271 cochrane, a., simmering, v., & green, c. s. (2021). modulation of compatibility effects in response to experience: two tests of initial and sequential learning. attention, perception, & psychophysics, 83(2), 837–852. https : / / doi . org / 10.3758/s13414-020-02181-1 den heyer, k., briand, k. a., & dannenbring, g. l. (1983). strategic factors in a lexical-decision task: evidence for automatic and attentiondriven processes. memory & cognition, 11, 374– 381. efron, b. (1979). computers and the theory of statistics: thinking the unthinkable. siam review, 21, 460–480. https : / / doi . org / 10 . 1137 / 1021092 efron, b., & tibshirani, r. j. (1993). an introduction to the bootstrap. chapman & hall. https://doi.org/10.1016/j.jml.2007.10.004 https://doi.org/10.1016/j.jml.2007.10.004 https://doi.org/10.1177/0963721411408885 https://doi.org/10.1177/0963721411408885 https://doi.org/10.3758/s13414-020-02164-2 https://doi.org/10.3758/s13414-020-02164-2 https://doi.org/10.1111/j.2044-8295.1965.tb00944.x https://doi.org/10.1111/j.2044-8295.1965.tb00944.x https://doi.org/10.3758/s13414-020-02151-7 https://doi.org/10.3758/s13414-020-02151-7 https://doi.org/10.3758/bf03204271 https://doi.org/10.3758/bf03204271 https://doi.org/10.3758/s13414-020-02181-1 https://doi.org/10.3758/s13414-020-02181-1 https://doi.org/10.1137/1021092 https://doi.org/10.1137/1021092 18 ferrand, l., new, b., brysbaert, m., keuleers, e., bonin, p., méot, a., augustinova, m., & pallier, c. (2010). the french lexicon project: lexical decision data for 38,840 french words and 38,840 pseudowords. behavior research methods, 42(2), 488–496. https://doi.org/10.3758/ brm.42.2.488 flowers, c. s., palitsky, r., sullivan, d., & peterson, m. a. (2021). investigating the flexibility of attentional orienting in multiple modalities: are spatial and temporal cues used in the context of spatiotemporal probabilities? visual cognition, 29(2), 105–117. https : / / doi . org / 10 . 1080 / 13506285.2021.1873211 gao, c., & gozli, d. g. (2021). are self-caused distractors easier to ignore? experiments with the flanker task. attention, perception, & psychophysics, 83(2), 853–865. https : / / doi . org / 10.3758/s13414-020-02170-4 gibson, b. s., pauszek, j. r., trost, j. m., & wenger, m. j. (2021). the misrepresentation of spatial uncertainty in visual search: singleversus joint-distribution probability cues. attention, perception, & psychophysics, 83(2), 603– 623. https : / / doi . org / 10 . 3758 / s13414 020 02145-5 gordon, a., geddert, r., hogeveen, j., krug, m. k., obhi, s., & solomon, m. (2020). not so automatic imitation: expectation of incongruence reduces interference in both autism spectrum disorder and typical development. journal of autism and developmental disorders, 50, 1310– 1323. https://doi.org/10.1007/s1080301904355-9 hays, w. l. (1973). statistics for the social sciences. (2nd ed.) holt, rinehart, & winston. heathcote, a., popiel, s. j., & mewhort, d. j. k. (1991). analysis of response-time distributions: an example using the stroop task. psychological bulletin, 109, 340–347. https://doi.org/10.1037/ 0033-2909.109.2.340 hockley, w. e. (1984). analysis of response time distributions in the study of cognitive processes. journal of experimental psychology: learning, memory, & cognition, 10, 598–615. https://doi. org/10.1037/0278-7393.10.4.598 hockley, w. e., & corballis, m. c. (1982). tests of serial scanning in item recognition. canadian journal of psychology, 36, 189–212. https://doi.org/10. 1037/h0080637 hohle, r. h. (1965). inferred components of reaction times as functions of foreperiod duration. journal of experimental psychology, 69, 382–386. https://doi.org/10.1037/h0021740 hommel, b. (2011). the simon effect as tool and heuristic. acta psychologica, 136(2), 189–202. https://doi.org/10.1016/j.actpsy.2010.04.011 huang, c., theeuwes, j., & donk, m. (2021). statistical learning affects the time courses of saliencedriven and goal-driven selection. journal of experimental psychology: human perception & performance, 47(1), 121–133. https://doi.org/10. 1037/xhp0000781 hutchison, k. a., balota, d. a., neely, j. h., cortese, m. j., cohen-shikora, e. r., tse, c.-s., yap, m. j., bengson, j. j., niemeyer, d., & buchanan, e. (2013). the semantic priming project. behavior research methods, 45(4), 1099–1114. https://doi.org/10.3758/s13428-012-0304-z hyman, r. (1953). stimulus information as a determinant of reaction time. journal of experimental psychology, 45, 188–196. https://doi.org/10. 1037/h0056940 ivanov, y., & theeuwes, j. (2021). distractor suppression leads to reduced flanker interference. attention, perception, & psychophysics, 83(2), 624–636. https : / / doi . org / 10 . 3758 / s13414 020-02159-z kang, m. s., & chiu, y.-c. (2021). proactive and reactive metacontrol in task switching. memory & cognition, 49(8), 1617–1632. https://doi.org/ 10.3758/s13421-021-01189-8 liesefeld, h. r., & müller, h. j. (2021). modulations of saliency signals at two hierarchical levels of priority computation revealed by spatial statistical distractor learning. journal of experimental psychology: general, 150(4), 710–728. https: //doi.org/10.1037/xge0000970 luce, r. d. (1986). response times: their role in inferring elementary mental organization. oxford university press. luo, c., & proctor, r. w. (2018). the location-, word-, and arrow-based simon effects: an ex-gaussian analysis. memory & cognition, 46(3), 497–506. https://doi.org/10.3758/s13421-017-0767-3 maksimenko, v. a., frolov, n. s., hramov, a. e., runnova, a. e., grubov, v. v., kurths, j., & pisarchik, a. n. (2019). neural interactions in a spatially-distributed cortical network during perceptual decision-making. frontiers in behavioral neuroscience, 13, 220. https://doi.org/10. 3389/fnbeh.2019.00220 marascuilo, l. a. (1971). statistical methods for behavioral science research. mcgraw-hill. https://doi.org/10.3758/brm.42.2.488 https://doi.org/10.3758/brm.42.2.488 https://doi.org/10.1080/13506285.2021.1873211 https://doi.org/10.1080/13506285.2021.1873211 https://doi.org/10.3758/s13414-020-02170-4 https://doi.org/10.3758/s13414-020-02170-4 https://doi.org/10.3758/s13414-020-02145-5 https://doi.org/10.3758/s13414-020-02145-5 https://doi.org/10.1007/s10803-019-04355-9 https://doi.org/10.1007/s10803-019-04355-9 https://doi.org/10.1037/0033-2909.109.2.340 https://doi.org/10.1037/0033-2909.109.2.340 https://doi.org/10.1037/0278-7393.10.4.598 https://doi.org/10.1037/0278-7393.10.4.598 https://doi.org/10.1037/h0080637 https://doi.org/10.1037/h0080637 https://doi.org/10.1037/h0021740 https://doi.org/10.1016/j.actpsy.2010.04.011 https://doi.org/10.1037/xhp0000781 https://doi.org/10.1037/xhp0000781 https://doi.org/10.3758/s13428-012-0304-z https://doi.org/10.1037/h0056940 https://doi.org/10.1037/h0056940 https://doi.org/10.3758/s13414-020-02159-z https://doi.org/10.3758/s13414-020-02159-z https://doi.org/10.3758/s13421-021-01189-8 https://doi.org/10.3758/s13421-021-01189-8 https://doi.org/10.1037/xge0000970 https://doi.org/10.1037/xge0000970 https://doi.org/10.3758/s13421-017-0767-3 https://doi.org/10.3389/fnbeh.2019.00220 https://doi.org/10.3389/fnbeh.2019.00220 19 matzke, d., & wagenmakers, e. j. (2009). psychological interpretation of the ex-gaussian and shifted wald parameters: a diffusion model analysis. psychonomic bulletin & review, 16, 798–817. https://doi.org/10.3758/pbr.16.5.798 mewhort, d. j. k., braun, j. g., & heathcote, a. (1992). response time distributions and the stroop task: a test of the cohen, dunbar, and mcclelland (1990) model. journal of experimental psychology: human perception & performance, 18, 872–882. https://doi.org/10.1037/00961523.18.3.872 miller, j. o. (1988). a warning about median reaction time. journal of experimental psychology: human perception & performance, 14(3), 539–543. https://doi.org/10.1037/0096-1523.14.3.539 miller, j. o., & pachella, r. g. (1973). locus of the stimulus probability effect. journal of experimental psychology, 101(2), 227–231. https://doi.org/ 10.1037/h0035214 miller, j. o., & tang, j. l. (2021). effects of task probability on prioritized processing: modulating the efficiency of parallel response selection. attention, perception, & psychophysics, 83(1), 356– 388. https : / / doi . org / 10 . 3758 / s13414 020 02143-7 moutsopoulou, k., & waszak, f. (2012). across-task priming revisited: response and task conflicts disentangled using ex-gaussian distribution analysis. journal of experimental psychology: human perception & performance, 38(2), 367–374. https://doi.org/10.1037/a0025858 mowrer, o. h., rayman, n., & bliss, e. (1940). preparatory set (expectancy)an experimental demonstration of its “central” locus. journal of experimental psychology, 26, 357–371. https : / / doi . org/10.1037/h0058172 posner, m. i., nissen, m. j., & ogden, w. c. (1978). attended and unattended processing modes: the role of set for spatial location. in h. l. pick jr. & e. saltzman (eds.), modes of perceiving and processing information. (pp. 137–157). lawrence erlbaum. possamaï, c. a. (1991). a responding hand effect in a simple-rt precueing experiment: evidence for a late locus of facilitation. acta psychologica, 77, 47–63. https : / / doi . org / 10 . 1016 / 0001 6918(91)90064-7 ratcliff, r. (1993). methods for dealing with reaction time outliers. psychological bulletin, 114, 510– 532. https://doi.org/10.1037/0033-2909.114. 3.510 rieger, t. c., & miller, j. o. (2020). are model parameters linked to processing stages? an empirical investigation for the ex-gaussian, exwald, and ez diffusion models. psychological research, 84(6), 1683–1699. https://doi.org/ 10.1007/s00426-019-01176-4 rousselet, g. a., & wilcox, r. r. (2020). reaction times and other skewed distributions: problems with the mean and the median. meta-psychology, 4. https://doi.org/10.15626/mp.2019.1630 sanders, a. f. (1970). some variables affecting the relation between relative stimulus frequency and choice reaction time. acta psychologica, 33, 45– 55. https://doi.org/10.1016/00016918(70) 90121-6 schwarz, w. (2001). the ex-wald distribution as a descriptive model of response times. behavior research methods, instruments & computers, 33, 457–469. https : / / doi . org / 10 . 3758 / bf03195403 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. https://doi.org/ 10.1177/0956797611417632 singh, t., laub, r., burgard, j. p., & frings, c. (2018). disentangling inhibition-based and retrievalbased aftereffects of distractors: cognitive versus motor processes. journal of experimental psychology: human perception & performance, 44(5), 797–805. https : / / doi . org / 10 . 1037 / xhp0000496 theios, j., smith, p. g., haviland, s., traupmann, j., & moy, m. (1973). memory scanning as a serial self-terminating process. journal of experimental psychology, 97, 323–336. https : / / doi . org / 10.1037/h0034107 thomson, s. j., simone, a. c., & watter, s. (2021). item-specific proportion congruency (ispc) modulates, but does not generate, the backward crosstalk effect. psychological research, 85(3), 1093–1107. https://doi.org/10.1007/ s00426-020-01318-z thornton, i. m., & zdravković, s. (2020). searching for illusory motion. attention, perception, & psychophysics, 82, 44–62. https : / / doi . org / 10 . 3758/s13414-019-01750-3 ulrich, r., & miller, j. o. (1994). effects of truncation on reaction time analysis. journal of experimental psychology: general, 123(1), 34–80. https : //doi.org/10.1037/0096-3445.123.1.34 https://doi.org/10.3758/pbr.16.5.798 https://doi.org/10.1037/0096-1523.18.3.872 https://doi.org/10.1037/0096-1523.18.3.872 https://doi.org/10.1037/0096-1523.14.3.539 https://doi.org/10.1037/h0035214 https://doi.org/10.1037/h0035214 https://doi.org/10.3758/s13414-020-02143-7 https://doi.org/10.3758/s13414-020-02143-7 https://doi.org/10.1037/a0025858 https://doi.org/10.1037/h0058172 https://doi.org/10.1037/h0058172 https://doi.org/10.1016/0001-6918(91)90064-7 https://doi.org/10.1016/0001-6918(91)90064-7 https://doi.org/10.1037/0033-2909.114.3.510 https://doi.org/10.1037/0033-2909.114.3.510 https://doi.org/10.1007/s00426-019-01176-4 https://doi.org/10.1007/s00426-019-01176-4 https://doi.org/10.15626/mp.2019.1630 https://doi.org/10.1016/0001-6918(70)90121-6 https://doi.org/10.1016/0001-6918(70)90121-6 https://doi.org/10.3758/bf03195403 https://doi.org/10.3758/bf03195403 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1037/xhp0000496 https://doi.org/10.1037/xhp0000496 https://doi.org/10.1037/h0034107 https://doi.org/10.1037/h0034107 https://doi.org/10.1007/s00426-020-01318-z https://doi.org/10.1007/s00426-020-01318-z https://doi.org/10.3758/s13414-019-01750-3 https://doi.org/10.3758/s13414-019-01750-3 https://doi.org/10.1037/0096-3445.123.1.34 https://doi.org/10.1037/0096-3445.123.1.34 20 vadillo, m. a., giménez-fernández, t., beesley, t., shanks, d. r., & luque, d. (2021). there is more to contextual cuing than meets the eye: improving visual search without attentional guidance toward predictable target locations. journal of experimental psychology: human perception & performance, 47(1), 116–120. https : //doi.org/10.1037/xhp0000780 yule, g. u. (1911). an introduction to the theory of statistics. charles griffin & co. zahn, t. p., & rosenthal, d. (1966). simple reaction time as a function of the relative frequency of the preparatory interval. journal of experimental psychology, 72, 15–19. https://doi.org/10. 1037/h0023328 zhou, b., & krott, a. (2016). data trimming procedure can eliminate bilingual cognitive advantage. psychonomic bulletin & review, 23(4), 1221– 1230. https://doi.org/10.3758/s134230150981-6 appendix expected values and standard errors of differences in means and medians this appendix describes the numerical procedures for computing the expected values and standard errors of between-condition differences in mean rts and between-condition differences in median rts that are depicted in figure 7. let x1,i and x2,i, i = 1 . . .n, be random samples of n rts from the two conditions being compared. these come from assumed probability distributions (e.g., exgaussian, etc) with means µ1 and µ2, variances σ21 and σ22, and cumulative distribution functions (cdfs) f1(t) and f2(t), respectively. for simplicity in dealing with medians, assume that n is odd. means to analyze the between-condition difference in mean rts, the researcher computes for each participant dmn = x̄2 − x̄1 = n∑ i=1 x2,i/n − n∑ i=1 x1,i/n, (1) which has expected value e[dmn] = µ2−µ1. the variance of this difference is var[dmn] =σ21/n +σ 2 2/n, because x1,i and x2,i are independent samples of trials. medians to analyze the between-condition difference in median rts, the researcher computes for each participant dmdn = x2(k) − x1(k) (2) where x·(k) indicates the k’th order statistic in the sample of n rts. the median is the k’th order statistic for k = (n + 1)/2 when n is odd. given the cdf f(t) for the rts in either condition, the cdf of the median in that condition x(k) is fx(k) (t) = n∑ j=k ( n j ) f(t) j · [1 − f(t)]n− j (3) (e.g., arnold et al., 1992). as is illustrated in figure 3, the probability distribution of the median rt in this condition is uniquely determined by this cdf, so the median’s expected value e[x(k)] and variance var[x(k)] in the condition can be computed by numerical integration. once this computation is carried out for each of the two conditions individually, the expected value and variance of the difference between conditions are e[dmdn] = e[x2(k)] − e[x1(k)] (4) and var[dmdn] = var[x2(k)] + var[x1(k)]. (5) https://doi.org/10.1037/xhp0000780 https://doi.org/10.1037/xhp0000780 https://doi.org/10.1037/h0023328 https://doi.org/10.1037/h0023328 https://doi.org/10.3758/s13423-015-0981-6 https://doi.org/10.3758/s13423-015-0981-6 meta-psychology, 2023, vol 7, mp.2021.2762 https://doi.org/10.15626/mp.2021.2762 article type: original article published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: no edited by: carlsson, r. & innes-ker, å reviewed by: rebecca willén, farhan sarwar analysis reproduced by: not applicable all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/ycs8t a theory of ethics to guide investigative interviewing research david a. neequaye university of gothenburg abstract this article examines ethical considerations relevant to the formulation of psychological investigative interviewing techniques or methods. psychology researchers are now devoting much attention to improving the efficacy of eliciting information in investigative interviews. stakeholders agree that interviewing methods must be ethical. however, there is a less concerted effort at systematically delineating ethical considerations to guide the creation of interviewing methods derived from scientific psychological principles. the disclosures interviewees make may put them at considerable risk, and it is not always possible to determine beforehand whether placing interviewees under such risks is warranted. thus, i argue that research psychologists aiming to contribute ethical methods in this context should ensure that those methods abide by a standard that actively protects interviewees against unjustified risks. interviewing techniques should provide interviewees, particularly vulnerable ones, with enough agency to freely determine what to disclose. researchers should explicitly indicate the boundary conditions of a method if it cannot achieve this standard. journal editors and reviewers should request such discussions. the suggested standard tasks research psychologists to be circumspect about recommending psychological techniques without fully addressing the ethical boundaries of those methods in their publications. i explain the proposed ethical standard’s necessity and discuss how it can be applied. keywords: disclosure, ethical investigative interviewing, human intelligence source, psychological manipulation, suspect, witness a theory of ethics on investigative interviewing this article discusses ethical considerations relevant to the formulation of psychological investigative interviewing techniques or methods. such techniques are means of asking people questions in investigative interviews. in those interviews, interviewers question interviewees to potentially elicit information pertinent to perceived local/national or international security concerns or interests a governing entity may have. i use the word, perceived, deliberately in this definition of investigative interviewing. different governing entities may have different views on whether an investigation is a security issue or whether an interest is legitimate. another term, interrogation, has a similar meaning as investigative interviewing (see, e.g., hartwig et al., 2014). however, some stakeholders associate interrogation with questioning methods aimed at confirming a preconceived notion rather than eliciting information (rachlew, 2017; williamson, 1993). therefore, i adopt the term investigative interviewing. the term is arguably a more intuitive designator of the majority of social interactions wherein the objective is to elicit truthful information in an investigation or interview. before proceeding, it is critical to preempt any later confusion by noting relevant topics that are not the current article’s focus. this examination does not discuss the ethics of interviews psychologists conduct, such as psychological evaluations in the health sector, for emhttps://doi.org/10.15626/mp.2021.2762 https://doi.org/10.17605/osf.io/ycs8t 2 ployers, and other organizations. additionally, the article does not delve into how psychologists should personally comport themselves when directly performing consultations in a professional capacity. the apa ethical principles of psychologists and code of conduct addresses these important issues (american psychological association, 2002), which are not the focus of the present discussion. the article centers on investigative interviewing pertaining to national/international security concerns and interests—and research psychologists contributing ethical methods to that enterprise. research psychologists may serve as consultants, advising directly on actual investigative interviews, and it is generally recommended that such consultations follow the apa ethical guidelines. it is worth repeating here that this article does not delve into how research psychologists should personally comport themselves during professional consultations. previous examinations have addressed these essential issues (see, e.g., american psychological association, 2002; porter et al., 2016). moreover, this work does not address how security personnel, for example, the police, should behave when conducting investigative interviews—or the ethical-legal standards thereof (e.g., the recording of suspect and witness interviews). other research has examined such essential issues (see, e.g., clarke, 2001; newton, 1998; snook et al., 2020). this article focuses on the burgeoning research field where research psychologists contribute somewhat indirectly by developing and recommending psychological interviewing techniques to be used by other practitioners, specifically, law enforcement and intelligence interviewers. many existing scientific publications recommend such psychological interviewing techniques (see, vrij, fisher, et al., 2017; vrij and granhag, 2014 for overviews). the ethics of these methods published in scientific journals remain underexamined. existing ethics codes are either underspecified for investigative interviewing (e.g., american psychological association, 2002) or typically speak to how researchers or psychologists should personally comport themselves (e.g., porter et al., 2016). this article explores and specifies ethical considerations relevant to indirect contributions to investigative interviewing that feature in psychology journal articles. researchers’ current focus on developing psychological interviewing techniques various events in recent history have sparked a general interest in investigative interviewing. for example, (a) the discovery that the central intelligence agency (united states) used undeniably abusive questioning methods at guantanamo bay and abu ghraib (cohen, 2005); (b) the revelations that law enforcement personnel routinely employ dubious techniques that elicit false confessions to crimes (kassin, 2017); (c) the missed opportunities to obtain information that could have helped prevent security breaches worldwide (soufan, 2011). psychology researchers are now devoting much attention to improving the efficacy of eliciting information in investigative interviews (meissner, 2021; vrij, fisher, et al., 2017). the majority of research focused on developing interviewing techniques draws on psychological principles that facilitate disclosure. researchers and practitioners agree that interviewing methods must be ethical (alison and alison, 2017; hartwig et al., 2014; vrij, fisher, et al., 2017). however, there is a less concerted effort at systematically delineating ethical considerations relevant to the creation of interviewing methods derived from scientific psychological principles. two seminal works have considered this topic previously. skerker (2010) extensively discusses the morality of investigative interviewing. hartwig et al. (2016) reviewed extant psychological interviewing methods, at the time, and analyzed their moral implications. the present work draws on skerker (2010) and hartwig et al. (2016), and when necessary, i will describe the relevant aspects. however, this work is not a commentary on those publications. the current article takes a different approach and examines the ethics of investigative interviewing in ways that address existing interviewing methods and those that may be developed in the future. this work aims to contribute to the discussion by calling on the field to now consider a more forwardlooking approach. i will argue that interviewing methods should provide interviewees, particularly vulnerable ones, with enough agency to freely determine what to disclose. as such, researchers should explicitly indicate the boundary conditions of a technique if it cannot achieve this standard. the proposed ethical standard may assist in regulating researchers’ creation and publication of psychological interviewing techniques. the proposed standard purposely focuses on the publications of research psychologists for pragmatic reasons. guidelines aimed directly at practicing investigative interviewers may be ineffective since practitioners are generally under no obligation to abide by an ethical standard the academic literature posits—unless the relevant authorities ratify the standard. such ratification may comprise an arduous and lengthy bureaucratic process, decelerating the goals at hand. nonetheless, practicing interviewers may draw on the recommendations of research publications, and they are encouraged to do so (fallon, 2014). suppose a publication indicates that an interviewing technique leads 3 people to disclose information. in that case, a practitioner may implement the technique in an actual interview. this possibility offers a pragmatic way for the academic literature to proactively contribute to practitioners using ethical methods. we must ensure that the methods researchers develop and recommend via their publications are ethical. the editorial process of scientific publishing is an opportune avenue to assure that researchers primarily disseminate ethical interviewing methods. importantly, this approach could preempt possibilities for the scientific literature to sustain or produce ethically problematic techniques that practitioners might adopt. however, there is little specification when it comes to standards by which to determine the ethical nature of interviewing methods researchers publish. later, i will elaborate on the details of the proposed ethical standard; now, i will explain its pragmatic necessity. the apa ethics code (american psychological association, 2002) offers principles to govern the psychology profession; two are relevant to the publication of investigative interviewing methods in psychology journals. principle a: the beneficence and nonmaleficence principle charges members of the psychology profession to ensure that they do no harm. in this view, we must strive to safeguard the welfare of those whom our work may affect. principle e calls for respecting people’s rights and dignity. the work of psychology should not infringe on individuals’ rights to privacy, confidentiality, and selfdetermination. it is incumbent upon research psychologists to ensure that their publications align with the relevant ethics codes. however, the apa ethics code is designed to be applicable across the organization’s many (∼54) divisions; the principles therein are purposely aspirational. the respective divisions and fields of psychology are encouraged to create enforceable standards that put the apa ethics principles into action (behnke, 2006). the standard this article proposes is an attempt to provide such an enforceable standard applicable to the development and recommendation of investigative interviewing methods via scientific publication. put simply, the standard is an effort to provide a metric to assure that published interviewing techniques align with the beneficence and nonmaleficence plus the respect for rights and dignity principles. to my knowledge, a compendium of generally applicable ethics principles does not exist when it comes to psychological interviewing methods. the current state of the literature is not exactly a shortcoming but an inevitable result of moral and legal entanglements commonly associated with investigative interviewing (see, e.g., sukumar et al., 2016). moreover, there are varied legal jurisdictions and cultural contexts worldwide. broadly examining legal nuances simultaneously with the morality of psychological techniques will be close to unmanageable. thus, this work will focus on moral issues that generally arise when people are subjected to psychological techniques in investigative interviews. the analysis will not delve into the specific laws of any context. also, this article does not necessarily provide a complete theory. the goal is to offer a general proposal to commence open discussions on the topic. i have structured the remainder of this article as follows. first, the work discusses the moral challenges and legitimacy of investigative interviewing. next, i describe the categories of people that typically feature in interviews and the risks interviews pose to them. the article then explores how research psychologists may structure interviewing methods they develop to navigate such risks ethically. finally, i explore the ethical nature of the scharff technique—an interviewing method that features in the published literature. that analysis provides an example of how one may examine the ethics of an interviewing technique. moral challenges of investigative interviewing investigative interviews contain a moral conundrum: to what extent is it permissible to (sometimes) compel people to submit to questioning about whatever topic a governing entity deems fit? surely one is entitled to keep what they know a secret if one wishes? controlling the information one wants to be open or secret is the essence of human autonomy—that is, one’s ability to determine their identity, intentions, possessions, and actions (bok, 1989). previous works, namely, hartwig et al. (2016) and skerker (2010) have addressed this general problem investigative interviewing raises. i will draw on the relevant aspects of those works to reiterate the legitimacy of investigative interviewing because the discussion is a useful reminder, and it sets the stage for the remainder of this work. bok (1989) argues that one cannot automatically approve or disapprove of secrecy. one must examine the moral arguments for every occasion where the justifiability of secrecy is under contention bok (1989). let us now address whether investigative interviewing, which often comprises probing others’ secrets, is a justifiable enterprise. liberal democracy is presently the most popular form of national and international government (e.g., mukand and rodrik, 2020). in this system of governance, founded on deontological ethics, a significant aspect of exercising one’s rights is the freedom to enjoy autonomy without interference hartwig et al. (2016). there is a general understanding that people are morally equal; skerker (2010) explains this phenomenon in detail, and the remainder of this paragraph provides a summary. mutual respect for one another’s 4 rights upholds moral equality. one can fully enjoy one’s rights if they do not impede another’s rights (skerker, 2010). for example, a person is free to own a house provided she purchases it using her resources, not stealing another’s house. illegitimate means of exercising one’s rights is a rights violation. the violator consequently forfeits the expectation that others will respect her rights. hence, it is permissible to inhibit such violations but only to the extent that the restraint restores the status quo—moral equality (skerker, 2010). typically, in liberal democracies, the state or the legitimate governing entity is responsible for ensuring moral equality by protecting the interests of the governed. accordingly, governing entities usually have a monopoly on coercion. they can, therefore, compel the governed to attend investigative interviews if there is cause to believe a deviation in moral equality has occurred. as such, the interview may support efforts to restore the status quo. for instance, a murder suspect may be detained and questioned about their whereabouts when the crime happened. ordinarily, detaining, and asking anyone to share personal details compromises their autonomy, which is a rights violation. however, hartwig et al. (2016) note that a warranted temporary restriction of a person’s rights, as in the example of this murder suspect, is not necessarily a violation but a rights infringement. an infringement, in this case probing the secrets of a suspected murderer, is not inherently wrong. remember that governing entities in liberal democracies are responsible for safeguarding moral equality, which allows the mandate to compel. nonetheless, such rights infringements must not exceed the amount of compelling a governing entity needs to restore the status quo (skerker, 2010). the governed have the right to procedural fairness or natural justice (bayles, 2012). but investigative interviewing is fraught with epistemic limitations. it is impossible always to know whether subjecting a person to an interview is justified in the first place, a significant part of interviewing aims to determine if the interview is appropriate (hartwig et al., 2016). suspects undergo questioning as part of establishing whether they are culprits—they might well be innocent. an interviewer might have to ask an informant probing questions before fully determining whether the probe was necessary for understanding some investigation of interest. these epistemic limitations make it challenging to establish the extent to which chosen interviewing methods overly infringe on an interviewee’s autonomy—and rise to the level of a rights violation. that is to say, investigative interviews are highly morally risky (hartwig et al., 2016). however, interviews are a necessary aspect of assuring moral equality. thus, it is critical to equip stakeholders to better anticipate such moral risks. that knowledge could be a useful crutch to navigate the formulation of interviewing methods ethically. people subjected to interviews a central discussion point regarding the moral risks of investigative interviewing is the injuries interviewing methods pose or could pose to interviewees or people subjected to interviews. for example, the examination by hartwig et al. (2016) classified the existing psychological interviewing techniques into themes and analyzed their potential moral hazards; next follows a summary. one theme comprised interviewing techniques that seek to confirm preconceived notions of guilt. those methods implement coercive and deceptive practices to elicit confessions; examples include deceptively minimizing or maximizing the consequences of crimes to force suspects to confess at any cost (inbau et al., 2001). confession-based techniques have been shown to reliably elicit false confessions (see kassin, 2017). moreover, such methods damage the integrity of interviewers and the institutions under which they serve (hartwig et al., 2016). the other theme consists of interviewing methods that actively avoid confirming preconceived notions. the peace model is an example of a framework under the information-gathering approach (see clarke, 2001 for an overview). peace is an abbreviation denoting five stages of the interviewing process—planning and preparation; engaging and explaining; asking an interviewees’ account of events; closure; and evaluation. the peace model offers the interviewee maximum opportunities to provide a thorough account in response to an interviewer’s inquiries. other techniques advocate asking questions in a manner that elicits differences between liars and truth tellers (granhag & hartwig, 2015; vrij, 2018). some methods also employ subtle elicitation tactics that lead interviewees to disclose information without realizing they have provided new information (oleszkiewicz et al., 2014). in all, the techniques under the information-gathering theme equip interviewers with tactics and strategies to gather information about a topic of interest. hartwig et al. (2016) note that the information-gathering theme provides a better ethical view of investigative interviewing. it focuses on treating interviewees fairly and respecting their rights—unlike confession-based methods. i have chosen a different way of approaching the topic by first categorizing interviewees to provide a generic classification system. this entry point offers a novel approach different from hartwig et al. (2016) and is useful because it allows a more tractable and forwardlooking discussion than classifying interviewing tech5 niques. we cannot anticipate all the methods that might be developed in the future. hence, a classification of extant interviewing techniques is likely to become limited as the field develops new methods. classifying interviewees affords a forward-looking analysis and offers more longevity. such a categorization allows one to examine whether any existing or upcoming interviewing method may pose injuries to the class of interviewees from which the technique aims to elicit information. generically classifying interviewees better situates the field to examine the moral implications of current and future interviewing methods. additionally, the shift in analysis, highlighting interviewees’ experiences, could spur discussions toward potential modifications existing techniques may need to enhance their ethical standards. types of interviewees and the risks for worstcase outcomes. surveys of the literature reveal three general functions that can arguably characterize any interviewee1; a suspect, a witness, or a human intelligence (humint) source (e.g., vrij, fisher, et al., 2017; vrij, meissner, et al., 2017). an individual is likely to fit at least one of these designations when questioned in an investigative interview at any time point. hence, this classification system sets the stage for a forwardlooking ethics analysis that offers more longevity. i will examine each interviewee category and the associated risk for worst-case outcomes in interviews. suspects. suspects are individuals whom interviewers question because of reasonable grounds to believe, at least temporarily, that the person has committed a crime or some aspect of a crime. typically, interviewers question suspects in a local/national law enforcement jurisdiction (e.g., meissner et al., 2015). some suspects can, however, be interviewed in an international law enforcement context (glasius, 2006). international suspects usually feature in investigations sanctioned by intergovernmental organizations such as the international criminal court (icc). any act labeled a crime in a jurisdiction is unlawful by default. a governing entity can mete out punishment to anyone found guilty of a crime (morrison, 2013; tierney, 2009). if a suspect is convicted of a crime, penalties can range from a partial to a total loss of civil liberties. for example, those convicted can receive a fine or a prison sentence. what a suspect reveals during questioning could contribute to absolving or implicating them. risks. in the worst-case scenario, what a person discloses when questioned as a suspect could inadvertently cause them to lose some civil liberties or contribute to the loss. witnesses. like suspects, witnesses can feature in a local/national or an international law enforcement context. these individuals undergo interviews if they report having had some direct sensory experience of a crime under investigation or have been directly harmed. ordinarily, these are cases where the person may have seen or heard aspects of a crime as a bystander (read and craik, 1995; wells and olson, 2003)—or they may have been a victim of the crime (spalek, 2016). the information a witness provides sheds light on the crime under investigation in various ways. three are relevant here: (a) whatever a witness discloses may contribute to exculpating or incriminating a suspect; (b) or the information may lead someone to become a suspect. (c) in the first instance, a witness is not a suspect, and a witness may refuse to answer incriminating questions (pieck, 1960). nonetheless, if a witness unwittingly provides self-incriminating information, they may become a suspect in the investigation. risks. in the worst-case scenario, what a witness reveals may inadvertently cause another person or the witness themselves to become a suspect or ultimately lose some civil liberties (wells and olson, 2003). human intelligence sources. humint sources are questioned because they might possess information relevant to an investigation. a broad range of humint investigations exist. those investigations could be previous, ongoing, or possible future local/national or international crimes (hartwig et al., 2014). the interest may also be a general security issue that is not exactly a crime (burkett, 2013). for example, an investigation of known or ostensible threats posed by a local or foreign organization or a foreign government. alternatively, the investigation could be gathering information to further an entity’s interests (ransom, 2013). for instance, a government seeking to gain insight into subjects that can inform foreign policy or foreign aid decisions. unlike suspects, humint sources are questioned primarily because of the information they may possess, not necessarily due to a belief that they have committed a crime. there are a few distinctions between the humint source and witness designations worth noting. sometimes, the objective of an intelligence interview is to elicit information about a previous, ongoing, or imminent crime. here, unlike witnesses, intelligence sources are those who may have obtained information indirectly and not necessarily through direct harm or sensory experience. for instance, another person could have provided the information a humint source now holds. in this view, one could consider such indirect sources of 1possibly, some jurisdictions may call the categories by slightly different names. however, stakeholders are familiar with and frequently use these designators in the manner defined by the classification scheme. it is unlikely that the names used here will cause unnecessary ambiguity in the investigative interviewing literature. 6 information in criminal investigations to be intelligence sources, as they possess relevant information (carter, 1990). indeed, investigators may question other individuals to better understand aspects of an incident; for instance, cross-checking suspects’ alibis or witness’ statements using third party information. an example of such intelligence gathering could be interviewing various care-givers in a child sexual abuse case to shed light on the alleged abuse. investigators may also ask independent experts about different aspects of a crime; for example, requesting an explanation of dna evidence analysis. the goal could be to better understand, for instance, how the crime occurred. risks in criminal investigations. despite the functional distinctions, the humint source and witness designations have similar risks when an intelligence interview relates to a criminal investigation. a humint source may provide information that contributes to implicating or absolving a suspect. under certain circumstances, an intelligence interviewee could incriminate herself. thus, the worst-case scenario here is a humint source unwittingly revealing information—and those details implicate someone or her to become a suspect in a criminal investigation or eventually lose some civil liberties. risks in non-criminal investigations. other intelligence interviews entail eliciting information about general security issues such as criminal networks and foreign threats. alternatively, the questioning can center on certain interests an entity may have (e.g., foreign policy decisions). these intelligence investigations are not necessarily alleged crimes, at least not until a governing entity launches a formal criminal investigation. thus, here, an intelligence source may come by information directly, as a witness would, or the source could obtain it indirectly, as i have described earlier. unlike witnesses who provide information about a specific (past) crime, intelligence sources here could undergo questioning about known or ostensible events. when the matter is a general security issue or an entity’s interests, the worst-case eventuality for the humint source is revealing details on a subject when one does not intend to. caveats. i must emphasize that the above categorizations derive from the immediate purpose or function an interviewee is serving during questioning. thus, the classification scheme does not preclude an interviewee from switching roles during an ongoing or a subsequent interview. the point is—an interviewee assumes a designation depending on their current function, and those functions are subject to change. navigating the moral risks of investigative interviewing the current article aims to improve the ethics of investigative interviewing techniques research psychologists publish. if so, why delve into risks to interviewees directly caused by practicing interviewers—not research psychologists? after all, jurisdictions have laws and guidelines to determine and prevent malpractice, at least in local and national law enforcement (e.g., college of policing, 2019). as i noted earlier, research psychologists recommend that practicing interviewers should implement interviewing methods published in the scientific literature. often the recommendation comes with the claim that those techniques are ethical or elicit information effectively (see, e.g., vrij, fisher, et al., 2017). suppose practitioners heed the recommendations of the published literature because of the presumed benefits. in that case, research psychologists may contribute indirectly to the risks interviewees face even though practitioners directly cause those risks. therefore, it behooves research psychologists to be cognizant of the dangers they could inadvertently contribute to and preempt such possibilities. importantly, research psychologists should strive to prevent possible misuse of techniques they publish. investigative interviewing is an inevitable aspect of assuring moral equality despite the risks involved. thus, the critical question to now address is this: how should stakeholders ethically navigate these moral risks when developing interviewing methods? how should research psychologists design fail-safe techniques that inhibit interviewees from being excessively compelled in interviews? the following sections delve into these issues. interviewees should decide their disclosures. actions that tend to result in unpleasant outcomes can be permissible if one consents to partaking (hartwig et al., 2016; tadros, 2011). for example, pugilists consent to the possibility of concussions when they engage in prizefighting. in this view, the risks of investigative interviewing are not immoral by default. hartwig et al. (2016) note that, sometimes, such outcomes are worthy of consent because, in liberal democracies, the governed expect governing entities to ensure moral equality. citizens presume that governing entities will take action to keep them safe and protect their interests. however, epistemic limitations prevent knowing beforehand if an investigative interview is warranted. by extension, we also cannot know whether the potential worst-case scenario an interviewing method presents is worthy of consent. interviewees must first disclose information for the worst-case scenario to arise. as a result, in liberal democracies, interviewees generally have the right to 7 silence, even if they can be compelled to questioning in interviews. the right to silence allows the interviewee control over what information to be open or keep secret. this control assures that interviewees consent to the possible consequences of their disclosures beforehand. i use the phrase—possible consequences—deliberately because it is not always viable to foresee the exact outcomes of disclosures in interviews. the right to determine one’s disclosures may allow bad-faith actors to withhold information or provide misleading details to avoid deserved punishment. more importantly, however, such autonomy enables good-faith actors to provide all the information a governing entity may need toward ensuring moral equality (seidmann and stein, 2000). these features do not necessarily advocate freeing the bad-faith actors. instead, they uphold a primary aspect of moral equality—the protection of good faith actors (bandes, 2009; cassell, 2017). a critical parameter for what can be done to persons without compromising their rights is that no one should harm another without the other’s consent (nozick, 1974). in this vein, an interviewing method worthy of consent should inhibit the worst-case scenarios for good-faith actors in the class of interviewees from which it aims to elicit information. interviewing methods purposed to elicit information from suspects should make it near impossible for innocent suspects to inadvertently say anything that can cause or contribute to them losing civil liberties. the witness and humint source designations in criminal investigations are similar. techniques designed to elicit information from such interviewees should make it near impossible for an innocent person to say anything selfincriminating or unwittingly incriminate another innocent person. humint interviewing methods aimed at eliciting information about general security threats or an entity’s interest should also protect interviewees’ autonomy in deciding what to disclose. because these intelligence interviews do not concern formal criminal investigations, interviewees here are not necessarily innocent or guilty. consequently, such interviewees may be propositioned to submit to an interview and not compelled as a matter of course. to be consent-worthy, elicitation techniques purposed for such humint sources should make it near impossible to say anything one has not formed a clear intention to share. the consequences of disclosure in these humint contexts are highly unpredictable, unlike in criminal investigations where some rules (e.g., constitutions) regulate the potential outcomes of disclosure. as such, it should remain the source’s prerogative to determine what to disclose; this provides reasonable assurance that the source has consented to the possible consequences of the disclosure. two principles for ethical interviewing methods. how should research psychologists formulate interviewing techniques that are worthy of consent? by what principles should those questioning methods strive to abide? this work offers two principles to set future discussions in motion. the following proposal aims to contribute to developing parameters to assess or guide the ethics of current and future psychological interviewing techniques. a psychological interviewing method can achieve consent worthiness by ensuring the following. (a) an interviewee is maximally and instantaneously aware of an interviewer’s inquiry or what the interviewer is asking; call this the inquiry-clarity principle. (b) an interviewee is maximally and instantaneously aware of the information they are choosing to disclose; call this the disclosure-awareness principle. these principles arguably give interviewees substantial agency to decide their disclosures and safeguard against disclosing information unwittingly. suppose an interviewee has agency, that is, maximum and instant awareness of an interviewer’s inquiry and forms a clear intention of an acceptable response, which she discloses. it is then conceivable that the interviewee has consented to the potential consequences of their disclosure. agency allows the expectation that a person bears responsibility for their actions (frith, 2014; haggard and tsakiris, 2009; moore, 2016). it is worth reiterating that the two principles are to be applied in serial order, the inquiry-clarity principle and then the disclosure-awareness principle. both principles must be satisfied. suppose that an interviewing technique falls short of making an interviewee aware of what an interviewer is asking. one cannot justify the consent-worthiness of that technique by claiming or demonstrating that the method makes interviewees aware of what they are choosing to disclose. without inquiry-clarity, it is challenging, if not impossible, to ascertain a reasonably proximate cause of a disclosure. if the inquiry that elicited a disclosure is unknown, one cannot claim the unknown inquiry is consent-worthy, regardless of the interviewee’s response. after satisfying the inquiry-clarity principle, the method must also allow the interviewee to choose their preferred response. suppose a method elicits a legitimate response from an interviewee, but the interviewee does not realize they have offered such a response. in that case, the method falls short of the disclosure-awareness principle. here, legitimate response means a response on the record that one can materially ascribe to the interviewee. proponents of a psychological interviewing technique must offer theoretical arguments or empirical evidence 8 demonstrating the conditions under which the method satisfies the inquiry-clarity and disclosure-awareness principles. such exposition will reveal the extent to which the method allows interviewees agency over their disclosures. editors and reviewers of investigative interviewing research may draw on the principles offered here to request such discussions from authors. this requirement will encourage much-needed reflection about the ethics of interviewing methods. the principles offer much-needed clarity about what constitutes a psychologically manipulative technique. currently, the term psychological manipulation remains undefined in the investigative interviewing literature. torture, physical coercion, and deceptive methods do not comport with the principles proposed here. analysts condemn such practices (meissner et al., 2015); they compromise interviewees’ agency to freely determine what to disclose and essentially prevent one from providing accurate information. for example, deceptive interviewing techniques are prone to elicit false confessions (kassin, 2017; kassin and kiechel, 1996). hartwig et al. (2016) address the immorality of these dubious methods; they note that such techniques undermine society’s trust in public institutions and damage interviewers’ character. that notwithstanding, indubitably immoral techniques should not be the standard against which stakeholders determine ethical techniques. this criterion is too low. it does not guarantee interviewees’ protection, given that investigative interviewing is fraught with epistemic limitations, and governing entities can compel people to interviews. accordingly, deriving an interviewing method from scientific psychological principles does not necessarily make a technique consent-worthy. the scientific interviewing technique may still obscure an interviewer’s inquiry and lead an interviewee to unwittingly disclose information—then the method is arguably psychologically manipulative. i am not suggesting that any interviewing technique that successfully persuades an interviewee to share information is psychologically manipulative. that interpretation is erroneous. the proposal is that methods are psychologically manipulative if they undermine interviewees’ agency when convincing them to share information. an absolute standard based on the two principles. interviewing methods should be held to an absolute standard that actively protects all interviewees, especially vulnerable ones, such as children and people with intellectual disabilities. broadly speaking, vulnerable interviewees, for various reasons, are highly suggestible and prone to providing information unwittingly (see, e.g., farrugia and gabbert, 2020; gudjonsson, 2005; o’mahony et al., 2012). hence, i propose that the minimum requirement of consent-worthiness should be one that adheres to the inquiry-clarity and disclosureawareness principles when a vulnerable person undergoes a technique. this standard calls on research psychologists to apply a maximin rule when formulating interviewing methods. the maximin rule is an ethics principle advocating that, typically, the right state of affairs is one structured such that the worst outcome is as acceptable as can be (rawls, 1971). a method should allow the most vulnerable interviewee, within a class of interviewees, maximum agency in determining what to disclose. the standard necessitates research psychologists to be explicit about the possible boundary conditions where a technique fails to be consent-worthy—if such conditions exist. to my knowledge, there is little ongoing discussion about the ethical boundaries of techniques derived from psychological principles. let us say, for example, one designs a technique to elicit information from suspects. at a minimum, i propose that the method give vulnerable suspects agency to determine what information they want to share. for example, the technique should give the most suggestible suspect, who is prone to falsely confessing, maximum agency to determine what to disclose—without undermining such ability in any way. this standard is not always attainable. techniques for vulnerable interviewees may require special considerations. however, the proposed standard tasks developers to indicate the target populations of their methods and not leave that to assumption. for instance, does the method provide agency to neurotypical adults but not adults with intellectual disabilities and children? it is worth clarifying that this work does not suggest that consent-worthy methods must provide an equal amount of agency to all interviewees within the relevant class. such a feat will be unproductive and, ultimately, insurmountable due to individual differences. instead, i am proposing that an interviewing method should generally give interviewees enough (viz., maximum) agency to the extent that when vulnerable interviewees undergo the method, they too will have the agency to decide what they wish to reveal. an illustration may assist in understanding the proposal. imagine that the agency a technique provides is a pie. i am not suggesting that an interviewing method should ensure that all interviewees get an equal piece. equal pieces may still be unacceptable if the pie is too small in the first place. for example, by design, a method may offer little agency to any interviewee. the proposed standard calls for a pie big enough so that even when one gets the smallest slice, it will still be acceptable. put differently, interviewing methods should provide enough agency such that the levels are suffi9 cient even when dealing with vulnerable interviewees. researchers should be explicit about the boundaries of a method if it cannot achieve this standard. overall, the proposed standard may allow psychological investigative interviewing techniques to achieve beneficence and nonmaleficence; and respect for rights and dignity (viz., american psychological association, 2002, principles a and e). advancing the ethics paradigm: exploring the scharff technique the standard proposed here arguably provides fairly straightforward and actionable guidelines for the design of future interviewing techniques. nonetheless, when it comes to existing methods, it is useful to explore the conditions under which those techniques impact interviewees’ agency to promote or undermine consent worthiness. illustration may help one understand how to engage with the standard when examining a specific technique’s ethics. in that light, i will examine the scharff technique. researchers developed the scharff technique by scientifically conceptualizing the interviewing style of hanns scharff; a member of the german luftwaffe, active during world war ii2. see granhag and hartwig (2015) for the first of such conceptualizations (see, also, oleszkiewicz, 2016). choosing the scharff technique is arbitrary. every interviewing technique in the published literature should be the subject of an ethics analysis; for example, the variants of the cognitive approach to lie detection (vrij, fisher, et al., 2017), the strategic use of evidence framework (granhag & hartwig, 2015), the tactical disclosure of evidence method (dando and bull, 2011), the verifiability approach (nahari et al., 2014), et cetera. moreover, this analysis of the scharff technique is preliminary and not exhaustive; its goal is to encourage open discussions about the ethical boundaries of existing techniques. the objective is not to impugn or be a snare to label the scharff technique as unethical. the state of the literature will improve through collective effort, not alienation. the present discussion provides an example of a structure and general recommendations. the goal is to get the ball rolling for future examinations of the scharff technique and all other methods in the published literature. the current conception of the scharff technique comprises five components that work in concert to elicit information. these components are best characterized as a way of handling conversations to facilitate the other interlocutor’s disclosure—not a checklist of mandatory behaviors to perform in serial order. an interviewer may enact the scharffian components in any order that suits the goals at hand. • a friendly interpersonal approach. this component advocates conversing with the interviewee in a non-adversarial manner. the interviewee is to feel relaxed and comfortable, not accused or threatened to comply with demands. • not pressing for information. this component charges the interviewer to refrain from asking direct questions or badgering the interviewee with questions. note that asking direct questions is not synonymous with badgering. • creating an impression of omniscience on a topic. this component calls on the interviewer to establish an illusion of possessing substantial information on a topic by presenting truthful information. the presentation could be explicit; for example, the interviewer could talk openly about what they know, like a coherent story. nevertheless, the goal to create the illusion of omniscience must be subtle, leading the interviewee to perceive they are unlikely to reveal anything new. it is worth noting that exponents of the scharff technique advocate the use of truthful information, not false evidence ploys. • confirmations and disconfirmations over direct questions. instead of posing direct questions, the scharff technique recommends the interviewer to present claims in a manner that invites the interviewee to either confirm or disconfirm. here is an example fashioned after one provided by oleszkiewicz (2016). suppose an interviewer wanted to know if an event is likely to happen at location-x or -y. the interviewer could ask a direct question. will the event happen at location-x or -y? alternatively, one could embed the question in a claim. we know the event is likely to happen at location-x, not y. the interviewer is to discern if the interviewee confirms or disconfirms the claim in any way. drawing on grice (1975) and luke (2021) notes that the confirmationdisconfirmation tactic may derive from the architecture of conversational norms; interlocutors typically correct each other’s mistakes if any arise. oleszkiewicz et al. (2014) mention that confirma2as one can infer, scharff fought for nazi germany, which begs an ethics discussion at a different level of analysis. suppose an interviewing method is consent-worthy by the standard this article proposes. in that case, does it matter who or what inspired the method? are there any conditions that ethically justify or should preempt from developing consentworthy interviewing methods? i am preparing to address these questions in an upcoming article, but see luke (2017) 10 tion and disconfirmation alleviate the responsibility on the interviewee to take the initiative of sharing new information. importantly, the tactic minimizes interviewees’ perception of the magnitude of their contribution to the interviewer’s knowledge. • ignoring new information. this component recommends the interviewer to avoid indicating that an interviewee has disclosed anything significant. the interviewer may achieve this goal in several ways. examples of those ways include changing topics after the interviewee discloses something significant, downplaying the information, or ignoring it. luke (2021) theorizes that not acknowledging new information may complement other components like not pressing for information and the confirmation-disconfirmation tactic. this symbiosis may bolster the illusion that the interviewer is indeed omniscient on the relevant topic. the interested reader may consult oleszkiewicz (2016) for a more thorough discussion of the scharff components. overall, the scharff technique’s current conception calls for the interviewer to assure that the interviewee experiences a pleasant interview. interviewers must strive to view interviewees as persons, not just sources of information (oleszkiewicz, 2016). hence, the congeniality should be as authentic as possible, without threats, accusations, emotional manipulation, or accusatory minimization (luke, 2021; oleszkiewicz, 2016). generally speaking, the ethos of the technique is laudable and respects interviewees’ dignity. based on the extant scientific literature on the scharff technique, one can infer that proponents intend for interviewers to use it to elicit information from human intelligence (humint) interviewees (e.g., granhag & hartwig, 2015). this specification is a good start, and the scharff technique literature offers more, in that regard, than say the cognitive approach to lie detection (see, e.g., vrij, fisher, et al., 2017). in fact, this analysis is possible because the scharff technique literature provides some specifications about the intended target population. nonetheless, the scharff body of work is silent regarding which humint interviewees can undergo the technique. for example, is the technique applicable when interviewing children, adolescents, or adults with intellectual disabilities; or is the technique designed for interviews with neurotypical adults only? to be fair, as i mentioned earlier, this issue is pervasive across the investigative interviewing literature; it is not specific to the scharff technique. furthermore, it is not clear whether the scharff technique is recommended for eliciting intelligence in criminal investigations, noncriminal investigations, or both. to examine the ethics of any interviewing method, researchers must specify its intended applications. it may be worthwhile to approach that task in a manner analogous to the safe and balanced prescribing of medication. the idea is that medical professionals should strive to prescribe medicines appropriate to the condition under treatment with a dosage regimen that minimizes harm to the patient (aronson, 2006). thinking along those same lines, it is useful for research psychologists to accompany their recommendations of interviewing techniques with dosage regimens that promote judicious usage and flags plus preempts abuse. next follows an elaboration of how one can offer an appropriate dosage regimen in the investigative interviewing context. • (i) one must first identify the categories of interviewees from which the method aims to elicit information. one must outline why it is permissible to subject that category of interviewees to the recommended technique. there should be empirical evidence or a theoretical discussion describing how the method provides agency that allows the relevant interviewees control over their disclosures. • (ii) it is also equally vital to identify potential categories of interviewees that should not be subjects of the technique. if possible, authors should also specify with theoretical examination or empirical evidence why it is not permissible to subject a category of interviewees to a proposed technique. these specifications allow clear identification of a technique’s ethical boundaries. imagine that the technique seriously undermines the agency of some relevant category of interviewees, but this is left unspecified. suppose the technique provides neurotypical adults agency to decide their disclosures. however, for whatever reason, when adults with intellectual disabilities undergo the technique, they mistake the interview for a casual conversation. say it is not immediately apparent to them that what they are saying are disclosures on the record. in that case, the technique fails the inquiry-clarity and disclosure-awareness principles. the vulnerable class of interviewees is not aware that the interviewer is asking for information on the record, and they (i.e., interviewees) are choosing to disclose information on that record. • (iii) finally, authors should aim to anticipate possible misapplications and preempt such possibilities by describing such situations. for example, 11 exponents of the scharff technique try to prevent misapplication by noting that interviewers should only use truthful information—not false evidence ploys—to create the omniscience illusion. proponents of an interviewing method are unlikely to always be available to chaperone practicing interviewers when they are potentially misusing the proposed method such that it falls short of consent-worthiness. for that reason, leaving the permitted and protected categories of interviewees unspecified or potential misapplications unaddressed allows dangerous possibilities for abuse. without specification, one can justify any usage by claiming the technique is recommended in the scientific literature and is therefore ethical. specifying the permitted and protected categories and flagging possible misuse will challenge researchers to more thoroughly consider the possible risks a seemingly noncoercive technique could be inviting. exposing those risks will lead researchers to take preemptive action by discontinuing the technique, outlining the boundary conditions, or redesigning it to be fail-safe if possible. let us now explore the scharff technique’s consentworthiness given what is published concerning (a) the category of interviewees from which the method aims to elicit information; (b) the associated risks of being an interviewee in that relevant category; (c) the effects of the scharff technique on how interviewees think and behave. based on the extant research, one can assume that the scharff technique is purposed to elicit information from humint interviewees who are neurotypical adults. a recent meta-analysis examined experiments where neurotypical adults underwent the scharff technique compared to direct questioning (luke, 2021). all things being equal, direct questions arguably reveal a questioner’s inquiry clearly. a neurotypical adult answering those questions is likely to be aware of what they are choosing to disclose. all things being equal, direct questions satisfy the inquiry-clarity and disclosureawareness principles in serial order. by the standard this article proposes, direct questioning is consent-worthy when interviewing neurotypical adults in an intelligence interview. the meta-analysis indicated that the scharff technique generally influences humint interviewees as intended (luke, 2021). the method leads interviewees to disclose more new information. importantly under scharffian questioning, interviewees underestimate the amount of new information they disclose relative to what they indeed revealed. the technique also makes interviewees perceive the interviewer as more knowledgeable on the topics of discussion. finally, the scharff technique leads interviewees to report greater difficulty in deciphering the interviewer’s questioning objectives. further research needs to elucidate how the various scharffian components contribute to the effects just described. however, one can infer that taken together, the components may lead neurotypical adults to unwittingly disclose some information in a humint interview. the use of the word some is not trivial. the meta-analysis (luke, 2021) suggests that for certain questions, the scharff technique obscures inquiry-clarity insofar as interviewees report difficulty understanding what information the interviewer is after. additionally, interviewees disclose more new information and underestimate the amount they actually disclose. that result suggests that the scharff technique may obscure disclosure-awareness of some items of information. in the worst-case scenario, during a criminal investigation, for example, when questioning an informant, the technique may reduce the interviewee’s agency in deciding whether to reveal or withhold some information items. suppose those items are not bona fide criminal secrets but private information outside the scope of the investigation. in that case, the scharff technique’s influence on neurotypical adults’ agency is a cause for concern. as discussed previously, epistemic limitations prevent knowing beforehand whether one possesses bona fide criminal secrets3. governing entities have a monopoly on coercion and could compel people to interviews, at least in criminal investigations. interviewees’ recourse from being excessively compelled is to decide their disclosures and object to providing information if they wish (skerker, 2010). proponents must address this potential ethical boundary of the scharff technique when it comes to humint interviewees in criminal investigations. similar concerns to the one just described arise when considering scharffian influence on humint interviewees in non-criminal investigations. an example of a non-criminal investigation is a clandestine interview to collect information aimed at directing foreign policy. as discussed earlier, the risks in those interviews are highly unpredictable. typically, there are hardly any codified rules (e.g., constitutions) stipulating the potential outcomes of disclosure for the interviewee. those outcomes could be innocuous or pernicious, but one cannot be entirely sure before disclosing the information. the scharff technique reduces agency to decide some disclosures. hence, the technique may prevent an interviewee from withholding what they have not formed a clear intention to share. for those items of information, the scharff technique falls short of the inquiry-clarity 3in a liberal democracy, withholding a criminal secret is morally inconsistent with the expectation that governing entities should ensure moral equality when a deviation occurs (skerker, 2010). 12 and disclosure-awareness principles. there is a need to examine the flagged potential ethical boundary that could arise in non-criminal investigations. as noted previously, the present analysis is preliminary. the current conception of the scharff technique may change. there is also the need for more research to examine what precisely the scharffian components cause interviewees to disclose. the meta-analysis indicates that the scharff technique, compared to direct questions, leads interviewees to disclose a higher amount of information (luke, 2021). nonetheless, it is unclear what information people share in response to the technique. the underlying mechanisms of the technique remain largely unknown. thus, it is uncertain which components exert the elicitation influences and why interviewees reveal the information they do. maybe, scharffian tactics generally elicit information that interviewees must legitimately reveal, for example, criminal secrets. further research may also indicate that the technique leads interviewees to unwittingly share information irrespective of the information’s characteristics. these questions remain unanswered. answering the pending research questions will assist in conducting further ethical analysis to comprehensively determine the scharff technique’s consent-worthiness. concluding remarks this article has proposed an ethics standard that may guide the formulation of psychological interviewing methods. that is, interviewing techniques should provide interviewees, particularly vulnerable ones, enough agency to freely determine what to disclose. this standard is feasible in governances, pertinently, liberal democracies, that arguably respect the natural rights of the governed and persons in general. applying the principles to develop interviewing methods in immoral contexts like totalitarian regimes that do not aim to respect the rights of the governed would be unethical by default. in their publications, researchers should explicitly indicate the boundary conditions of a technique if it cannot achieve the proposed standard. journal editors and reviewers should request such discussions. it is worth noting that this proposal does not call for researchers to cease exploring how people think and behave in investigative interviews. as such, i do not intend for this work to stifle the exploration of interviewing methods. it is vital to understand the enterprise of interviewing to facilitate navigating it ethically. the suggested standard tasks research psychologists to be circumspect about recommending psychological techniques without (fully) addressing the ethical boundaries of those methods in their publications. the field must adopt fairly unambiguous metrics to examine the ethics of interviewing methods authors aim to publish. we must relinquish the comforting fiction that techniques devoid of blatant physical or psychological coercion are automatically ethical without actively investigating those methods. indubitably immoral techniques that take advantage of suggestible interviewees to elicit false confessions or collect information nefariously should not be the standard against which research psychologists determine ethical techniques—this criterion is too low. psychology research needs further theoretical and empirical work that proactively ensures that current and future publications of interviewing methods protect various interviewees against potential manipulation. in addition to thinking about how a method may facilitate disclosure, research psychologists must also be cognizant of the potential misuse or misapplication of the techniques they publish. this article’s contribution is that it specifies criteria by which to examine the ethical nature of interviewing methods researchers publish, reducing ambiguity about what constitutes psychological manipulation. investigative interviewing entails considerable moral risks. stakeholders must take steps to avoid ambiguity about what constitutes consentworthy methods of eliciting information from interviewees. author contact david a. neequaye, 0000-0002-7355-2784 department of psychology, university of gothenburg, sweden i thank lorraine hope and erik mac giolla for providing immensely useful comments on earlier versions of this work. i am fully and solely responsible for any errors in this article. correspondence to: david a. neequaye, department of psychology, university of gothenburg. box 500, 405 30 gothenburg, sweden; email: david.neequaye@psy.gu.se conflict of interest and funding i have no conflict of interest to declare. i received no specific funding for this research. author contributions david a. neequaye: original conceptualization, investigation, project administration, writing – original draft, writing – review & editing 13 open science practices this article is purely theoretical and as such did not receive any open science badges. there was no statistical analysis. the entire editorial process, including the open reviews, is published in the online supplement. references alison, l., & alison, e. (2017). revenge versus rapport: interrogation, terrorism, and torture. american psychologist, 72(3), 266–277. https://doi.org/ 10.1037/amp0000064 american psychological association. (2002). ethical principles of psychologists and code of conduct. american psychologist, 57(12), 1060– 1073. https : / / doi . org / http : / / dx . doi . org . ezproxy.ub.gu.se/10.1037/0003066x.57.12. 1060 aronson, j. k. (2006). balanced prescribing. british journal of clinical pharmacology, 62(6), 629– 632. https : / / doi . org / 10 . 1111 / j . 1365 2125 . 2006.02825.x bandes, s. a. (2009). protecting the innocent as the primary value of the criminal justice system book review. ohio state journal of criminal law, 7(1), 413–438. retrieved october 11, 2020, from https://heinonline.org/hol/p?h=hein. journals/osjcl7&i=417 bayles, m. e. (2012). procedural justice: allocating to individuals. springer science & business media. behnke, s. (2006). apa’s ethical principles of psychologists and code of conduct: an ethics code for all psychologists...? monitor on psychology. https://www.apa.org/monitor/sep06/ethics bok, s. (1989). secrets: on the ethics of concealment and revelation. vintage. burkett, r. (2013). an alternative framework for agent recruitment: from mice to rascls. studies in intelligence, 57(1), 12. carter, d. l. (1990). law enforcement intelligence operations: an overview of concepts, issues and terms | office of justice programs. national institute of justice. retrieved october 4, 2022, from https: //www.ojp.gov/ncjrs/virtual-library/abstracts/ law enforcement intelligence operations overview-concepts-issues cassell, p. (2017). can we protect the innocent without freeing the guilty? thoughts on innocence reforms that avoid harmful tradeoffs. thoughts on innocence reforms that avoid harmful tradeoffs. wrongful convictions and the dna revolution: twenty-five years of freeing the innocent (p. 25). cambridge university press. clarke, c. (2001). national evaluation of investigative interviewing peace course (tech. rep.). home office, london. cohen, s. (2005). post-moral torture: from guantanamo to abu ghraib. index on censorship, 34(1), 24–30. https : / / doi . org / 10 . 1080 / 03064220512331339427 college of policing. (2019). investigative interviewing. retrieved june 26, 2019, from https : / / www . app . college . police . uk / app content / investigations / investigative interviewing / #peace-framework dando, c. j., & bull, r. (2011). maximising opportunities to detect verbal deception: training police officers to interview tactically. journal of investigative psychology and offender profiling, 8(2), 189–202. https://doi.org/https://doi.org/10. 1002/jip.145 fallon, m. (2014). collaboration between practice and science will enhance interrogations. applied cognitive psychology, 28(6), 949–950. https:// doi.org/10.1002/acp.3091 farrugia, l., & gabbert, f. (2020). vulnerable suspects in police interviews: exploring current practice in england and wales. journal of investigative psychology and offender profiling, 17(1), 17–30. https://doi.org/10.1002/jip.1537 frith, c. d. (2014). action, agency and responsibility. neuropsychologia, 55, 137–142. https : / / doi . org/10.1016/j.neuropsychologia.2013.09.007 glasius, m. (2006). the international criminal court : a global civil society achievement. routledge. https://doi.org/10.4324/9780203414514 granhag, p. a., & hartwig, m. (2015). the strategic use of evidence technique: a conceptual overview. in a. vrij & b. verschuere (eds.), deception detection: current challenges and new directions (pp. 231–251). john wiley & sons, ltd. https: //doi.org/10.1002/9781118510001.ch10 grice, h. p. (1975). logic and conversation. speech acts, 41–58. https : / / doi . org / 10 . 1163 / 9789004368811_003 gudjonsson, g. (2005). disputed confessions and miscarriages of justice in britain: expert psychological and psychiatric evidence in the court of appeal. manitoba law journal, 31(3), 489– 522. retrieved october 25, 2020, from https : / / heinonline . org / hol / p ? h = hein . journals / manitob31&i=497 haggard, p., & tsakiris, m. (2009). the experience of agency: feelings, judgments, and responsibility. current directions in psychological science. retrieved october 18, 2020, from https : / / https://doi.org/10.1037/amp0000064 https://doi.org/10.1037/amp0000064 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1037/0003-066x.57.12.1060 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1037/0003-066x.57.12.1060 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1037/0003-066x.57.12.1060 https://doi.org/10.1111/j.1365-2125.2006.02825.x https://doi.org/10.1111/j.1365-2125.2006.02825.x https://heinonline.org/hol/p?h=hein.journals/osjcl7&i=417 https://heinonline.org/hol/p?h=hein.journals/osjcl7&i=417 https://www.apa.org/monitor/sep06/ethics https://www.ojp.gov/ncjrs/virtual-library/abstracts/law-enforcement-intelligence-operations-overview-concepts-issues https://www.ojp.gov/ncjrs/virtual-library/abstracts/law-enforcement-intelligence-operations-overview-concepts-issues https://www.ojp.gov/ncjrs/virtual-library/abstracts/law-enforcement-intelligence-operations-overview-concepts-issues https://www.ojp.gov/ncjrs/virtual-library/abstracts/law-enforcement-intelligence-operations-overview-concepts-issues https://doi.org/10.1080/03064220512331339427 https://doi.org/10.1080/03064220512331339427 https://www.app.college.police.uk/app-content/investigations/investigative-interviewing/#peace-framework https://www.app.college.police.uk/app-content/investigations/investigative-interviewing/#peace-framework https://www.app.college.police.uk/app-content/investigations/investigative-interviewing/#peace-framework https://www.app.college.police.uk/app-content/investigations/investigative-interviewing/#peace-framework https://doi.org/https://doi.org/10.1002/jip.145 https://doi.org/https://doi.org/10.1002/jip.145 https://doi.org/10.1002/acp.3091 https://doi.org/10.1002/acp.3091 https://doi.org/10.1002/jip.1537 https://doi.org/10.1016/j.neuropsychologia.2013.09.007 https://doi.org/10.1016/j.neuropsychologia.2013.09.007 https://doi.org/10.4324/9780203414514 https://doi.org/10.1002/9781118510001.ch10 https://doi.org/10.1002/9781118510001.ch10 https://doi.org/10.1163/9789004368811_003 https://doi.org/10.1163/9789004368811_003 https://heinonline.org/hol/p?h=hein.journals/manitob31&i=497 https://heinonline.org/hol/p?h=hein.journals/manitob31&i=497 https://heinonline.org/hol/p?h=hein.journals/manitob31&i=497 https://journals.sagepub.com/doi/10.1111/j.1467-8721.2009.01644.x 14 journals . sagepub . com / doi / 10 . 1111 / j . 1467 8721.2009.01644.x hartwig, m., luke, t. j., & skerker, m. (2016). ethical perspectives on interrogation: an analysis of contemporary techniques. in j. jacobs & j. jacobs (eds.), the routledge handbook of criminal justice ethics. routledge. hartwig, m., meissner, c. a., & semel, m. d. (2014). human intelligence interviewing and interrogation: assessing the challenges of developing an ethical, evidence-based approach. in r. bull (ed.), investigative interviewing (pp. 209–228). springer. https : / / doi . org / 10 . 1007 / 978 1 4614-9642-7_11 inbau, f. e., reid, j., buckley, j., & jayne, b. (2001). criminal interrogation and confessions (4th). aspen publishers, inc. kassin, s. m. (2017). false confessions: how can psychology so basic be so counterintuitive? american psychologist, 72(9), 951–964. https://doi. org/ http:/ /dx .doi. org.ezproxy.ub .gu. se/10 . 1037/amp0000195 kassin, s. m., & kiechel, k. l. (1996). the social psychology of false confessions: compliance, internalization, and confabulation. psychological science, 7(3), 125–128. https://doi.org/10.1111/ j.1467-9280.1996.tb00344.x luke, t. (2017). the moral backdrop of interrogations [rabbit tracks]. retrieved march 22, 2021, from https://www.rabbitsnore.com/2017/11/ the-moral-backdrop-of-interrogations.html luke, t. (2021). a meta-analytic review of experimental tests of the interrogation technique of hanns joachim scharff. applied cognitive psychology, acp.3771. https://doi.org/10.1002/acp.3771 meissner, c. a. (2021). “what works?” systematic reviews and meta-analyses of the investigative interviewing research literature. applied cognitive psychology, 35(2), 322–328. https : / / doi . org / https://doi.org/10.1002/acp.3808 meissner, c. a., kelly, c. e., & woestehoff, s. a. (2015). improving the effectiveness of suspect interrogations. annual review of law and social science, 11(1), 211–233. https : / / doi . org / 10 . 1146/annurev-lawsocsci-120814-121657 moore, j. w. (2016). what is the sense of agency and why does it matter? frontiers in psychology, 7. https://doi.org/10.3389/fpsyg.2016.01272 morrison, w. (2013). what is crime? contrasting definitions and perspectives. in c. hale, k. hayward, a. wahidin, & e. wincup (eds.), criminology. oup oxford. mukand, s. w., & rodrik, d. (2020). the political economy of liberal democracy. the economic journal, 130(627), 765–792. https : / / doi . org / 10 . 1093/ej/ueaa004 nahari, g., vrij, a., & fisher, r. p. (2014). exploiting liars’ verbal strategies by examining the verifiability of details. legal and criminological psychology, 19(2), 227–239. https://doi.org/https: //doi.org/10.1111/j.2044-8333.2012.02069.x newton, t. (1998). the place of ethics in investigative interviewing by police officers. the howard journal of criminal justice, 37(1), 52–69. https: //doi.org/10.1111/1468-2311.00077 nozick, r. (1974). anarchy, state, and utopia (vol. 5038). new york: basic books. oleszkiewicz, s. (2016). eliciting human intelligence a conceptualization and empirical testing of the scharff-technique (doctoral dissertation). department of psychology, university of gothenburg. göteborg. retrieved november 27, 2018, from http://hdl.handle.net/2077/41567 oleszkiewicz, s., granhag, p. a., & kleinman, s. m. (2014). on eliciting intelligence from human sources: contextualizing the scharff-technique. applied cognitive psychology, 28(6), 898–907. https://doi.org/10.1002/acp.3073 o’mahony, b. m., milne, b., & grant, t. (2012). to challenge, or not to challenge? best practice when interviewing vulnerable suspects. policing: a journal of policy and practice, 6(3), 301– 313. https://doi.org/10.1093/police/pas027 pieck, m. (1960). witness privilege against selfincrimination in the civil law. villanova law review, 5, 33. porter, s., rose, k., & dilley, t. (2016). enhanced interrogations: the expanding roles of psychology in police investigations in canada. canadian psychology/psychologie canadienne, 57(1), 35–43. https://doi.org/10.1037/cap0000042 rachlew, a. (2017). from interrogating to interviewing suspects of terror: towards a new mindset [penal reform international]. retrieved august 24, 2020, from https : / / www . penalreform . org/blog/interrogatinginterviewingsuspectsterror-towards-new-mindset/ ransom, h. h. (2013). the intelligence establishment. harvard university press. retrieved august 31, 2020, from https://hup.degruyter.com/view/ title/323605 rawls, j. (1971). a theory of justice. harvard university press. read, d., & craik, f. i. m. (1995). earwitness identification: some influences on voice recognihttps://journals.sagepub.com/doi/10.1111/j.1467-8721.2009.01644.x https://journals.sagepub.com/doi/10.1111/j.1467-8721.2009.01644.x https://doi.org/10.1007/978-1-4614-9642-7_11 https://doi.org/10.1007/978-1-4614-9642-7_11 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1037/amp0000195 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1037/amp0000195 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1037/amp0000195 https://doi.org/10.1111/j.1467-9280.1996.tb00344.x https://doi.org/10.1111/j.1467-9280.1996.tb00344.x https://www.rabbitsnore.com/2017/11/the-moral-backdrop-of-interrogations.html https://www.rabbitsnore.com/2017/11/the-moral-backdrop-of-interrogations.html https://doi.org/10.1002/acp.3771 https://doi.org/https://doi.org/10.1002/acp.3808 https://doi.org/https://doi.org/10.1002/acp.3808 https://doi.org/10.1146/annurev-lawsocsci-120814-121657 https://doi.org/10.1146/annurev-lawsocsci-120814-121657 https://doi.org/10.3389/fpsyg.2016.01272 https://doi.org/10.1093/ej/ueaa004 https://doi.org/10.1093/ej/ueaa004 https://doi.org/https://doi.org/10.1111/j.2044-8333.2012.02069.x https://doi.org/https://doi.org/10.1111/j.2044-8333.2012.02069.x https://doi.org/10.1111/1468-2311.00077 https://doi.org/10.1111/1468-2311.00077 http://hdl.handle.net/2077/41567 https://doi.org/10.1002/acp.3073 https://doi.org/10.1093/police/pas027 https://doi.org/10.1037/cap0000042 https://www.penalreform.org/blog/interrogating-interviewing-suspects-terror-towards-new-mindset/ https://www.penalreform.org/blog/interrogating-interviewing-suspects-terror-towards-new-mindset/ https://www.penalreform.org/blog/interrogating-interviewing-suspects-terror-towards-new-mindset/ https://hup.degruyter.com/view/title/323605 https://hup.degruyter.com/view/title/323605 15 tion. journal of experimental psychology: applied, 1(1), 6–18. https://doi.org/http://dx. doi . org . ezproxy. ub . gu . se / 10 . 1037 / 1076 898x.1.1.6 seidmann, d. j., & stein, a. (2000). the right to silence helps the innocent: a game-theoretic analysis of the fifth amendment privilege. harvard law review, 114(2), 430–510. https://doi.org/10. 2307/1342573 skerker, m. (2010). an ethics of interrogation. university of chicago press. snook, b., barron, t., fallon, l., kassin, s. m., kleinman, s., leo, r. a., meissner, c. a., morello, l., nirider, l. h., redlich, a. d., & trainum, j. l. (2020). urgent issues and prospects in reforming interrogation practices in the united states and canada. legal and criminological psychology, 26(1), 1–24. https : / / doi . org / 10 . 1111 / lcrp.12178 spalek, b. (2016). crime victims: theory, policy and practice. macmillan international higher education. sukumar, d., wade, k. a., & hodgson, j. s. (2016). strategic disclosure of evidence: perspectives from psychology and law. psychology, public policy, and law, 22(3), 306–313. https://doi.org/ 10.1037/law0000092 tadros, v. (2011). consent to harm. current legal problems, 64(1), 23–49. https://doi.org/10.1093/ clp/cur004 tierney, j. (2009). criminology: theory and context. pearson education. vrij, a. (2018). deception and truth detection when analyzing nonverbal and verbal cues. applied cognitive psychology. https://doi.org/10.1002/acp. 3457 vrij, a., fisher, r. p., & blank, h. (2017). a cognitive approach to lie detection: a meta-analysis. legal and criminological psychology, 22(1), 1–21. https://doi.org/10.1111/lcrp.12088 vrij, a., & granhag, p. a. (2014). eliciting information and detecting lies in intelligence interviewing: an overview of recent research. applied cognitive psychology, 28(6), 936–944. https : / / doi . org/10.1002/acp.3071 vrij, a., meissner, c. a., fisher, r. p., kassin, s. m., morgan, c. a., & kleinman, s. m. (2017). psychological perspectives on interrogation. perspectives on psychological science, 12(6), 927–955. https://doi.org/10.1177/1745691617706515 wells, g. l., & olson, e. a. (2003). eyewitness testimony. annual review of psychology; palo alto, 54, 277–95. https://doi.org/http://dx.doi.org. ezproxy.ub.gu.se/10.1146/annurev.psych.54. 101601.145028 williamson, t. m. (1993). from interrogation to investigative interviewing; strategic trends in police questioning. journal of community & applied social psychology, 3(2), 89–99. https://doi.org/ 10.1002/casp.2450030203 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1037/1076-898x.1.1.6 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1037/1076-898x.1.1.6 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1037/1076-898x.1.1.6 https://doi.org/10.2307/1342573 https://doi.org/10.2307/1342573 https://doi.org/10.1111/lcrp.12178 https://doi.org/10.1111/lcrp.12178 https://doi.org/10.1037/law0000092 https://doi.org/10.1037/law0000092 https://doi.org/10.1093/clp/cur004 https://doi.org/10.1093/clp/cur004 https://doi.org/10.1002/acp.3457 https://doi.org/10.1002/acp.3457 https://doi.org/10.1111/lcrp.12088 https://doi.org/10.1002/acp.3071 https://doi.org/10.1002/acp.3071 https://doi.org/10.1177/1745691617706515 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1146/annurev.psych.54.101601.145028 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1146/annurev.psych.54.101601.145028 https://doi.org/http://dx.doi.org.ezproxy.ub.gu.se/10.1146/annurev.psych.54.101601.145028 https://doi.org/10.1002/casp.2450030203 https://doi.org/10.1002/casp.2450030203 meta-psychology, 2022, vol 6, mp.2021.2808 https://doi.org/10.15626/mp.2021.2808 article type: original article published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: yes edited by: rickard carlsson reviewed by: streamlined peer-review analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/gx3bk the use of the term rapport in the investigative interviewing literature: a critical examination of definitions david a. neequaye university of gothenburg erik mac giolla university of gothenburg abstract researchers typically note that there is much divergence about how rapport is defined in the investigative interviewing literature. we examined the scope of this divergence, the commonalities of extant definitions, and how the current state of affairs impacts the scientific investigation of rapport. we obtained 228 publications that discussed rapport in an investigative interviewing context. only thirty-two publications (14 %) explicitly defined rapport. twenty-two of those definitions were unique. all of the definitions implied that rapport centers on the quality of the interviewer-interviewee interaction. however, the definitions ascribed different attributes when describing more specifically how rapport relates to the quality of interpersonal interactions. a thematic analysis revealed six major attributes by which rapport could be characterized. the attributes were communication, mutuality, positivity, respect, successful outcomes, and trust. these attributes were disparately distributed across the definitions. based on the considerable disparity in its definitions, we question the theoretical and practical value of the term rapport. the current situation creates ambiguity about the meaning of rapport and impedes its objective assessment. to avoid further ambiguity, we believe the field must collectively determine a finite set of attributes to denote the term rapport. until those attributes are determined, stakeholders should stop indiscriminately using the word rapport to describe any collection of attributes of the interviewer-interviewee interaction. keywords: definitions, rapport, investigative interviewing, source, suspect, victim, witness a critical examination of rapport “before we inquire into origins and functional relations, it is necessary to know the thing we are trying to explain (asch, 1952, reprinted in 1987)" this work examines the use of the term rapport in the extant investigative interviewing literature. here, an investigative interview refers to a social interaction in which human interviewers solicit information from human sources (i.e., interviewees) for security reasons or legal purposes. it is generally accepted that rapport is important for conducting successful—that is, ethical and effective—1interviews (e.g.,vrij et al., 2017). however, existing literature reviews note that there is much divergence about how rapport is defined (abbe & brandon, 2013, 2014; gabbert et al., 2020 2; val1effective denotes the interviewer achieving the goals of the interview. https://doi.org/10.15626/mp.2021.2808 https://doi.org/10.17605/osf.io/gx3bk 2 lano, schreiber compo, 2015; vanderhallen & vervaeke, 2014). our aim is to systematically explore the extent of such variance, the commonalities shared by extant definitions, and how this state of affairs may influence the scientific investigation of rapport. the value of rapport the extant investigative interviewing literature univocally suggests that rapport is a critical component of successful interviewing. for example, the importance of rapport has been described in the following ways: “a necessary condition for a successful interview” (abbe & brandon, 2013, p. 241); “the cornerstone of any attempt to elicit information from an uncooperative source” (kelly et al., 2013, p.169); and “the heart of a good interview” (st-yves, 2006, p. 92). moreover, researchers credit the successful inclusion of rapport in an interview with benefits such as the following: children’s plentiful and accurate disclosures of details about sexual abuse (e.g., hershkowitz et al., 2015; sternberg et al., 1997); adult witnesses’ improved recall and cooperativeness (e.g., collins et al., 2002; kieckhaefer et al., 2014; nash et al., 2016; duke et al., 2018 ); and crime suspects’ tendency to engage with interviewers and disclose information (e.g., alison et al., 2014; alison et al., 2013; kelly et al., 2016; walsh & bull, 2012). in light of these advantages, the excellence of investigative interviews in the field (as opposed to laboratory experiments) are often judged, in part, by the extent to which interviewers are able to establish rapport in an interview (see, e.g., clarke & milne, 2001; schreiber compo et al., 2012). major interviewing regulations recommend that interviewers build and maintain rapport with interviewees throughout interviews. these regulations include the peace model2 (see, e.g., bull & milne, 2004; college of policing, 2013), and the achieving best evidence (abe) guidelines (home office, 2011)—commonly implemented in the united kingdom; the cognitive interview (fisher & geiselman, 1992); the national institute of child health and human development protocol (nichd; lamb et al., 2007), and the army field manual (afm 2-22.3; department of the army, 2006)—commonly applied in the united states. problems in defining rapport although stakeholders endorse rapport univocally, there seems to be uncertainty about what rapport entails or should entail. for example, abbe and brandon (2014, p. 207) note that the extent to which rapport has a similar meaning across different countries and interviewing contexts is unclear. vallano and schreiber compo (2015, p. 86) mention that the existing work is unable to provide a clear and consistent definition of rapport. vanderhallen and vervaeke (2014, p.77) note that the term rapport is conceptually weak in the literature. saywitz et al., 2015, p. 383) found that research has not defined the critical elements of rapport clearly. sauerland and colleagues (2018, p. 269) write that definitions (and operationalizations) of rapport are vague and varying. typically, defining a construct is a pre-requisite to its operationalizing and analysis. inadequate definitions and operationalizations obstruct the measurement of a construct, obscuring the inferences one can draw (shadish et al., 2002). in any body of work, a reference point that defines the fundamental aspects of a construct is central to measuring the construct coherently and comprehensibly. kripke’s (1972) work on reference-fixing in stipulative definitions further highlights the importance of establishing definitions with a common reference point. kripke describes a peculiar type of stipulative definition whereby one introduces a term (e.g., a name) using a description that tells the audience what the speaker is referring to by the term. for example, joanne rowling (i.e., the term) is the author of the harry potter novels (i.e., the description). in the current example, the speaker’s description is not synonymous with joanne rowling. the description fixes a reference indicating what the speaker means when saying, joanne rowling. the reference, in the pertinent aspects, is now univocal and clarifies what the speaker is saying. however, such reference-fixing, according to kripke, is also contingent, pending further insight that may lead to revising the reference. for example, joanne rowling is the author of the harry potter and fantastic beasts novels. a term like joanne rowling is rigid; the name joanne rowling is always the name joanne rowling. conversely, fixing the term’s (i.e., the name’s) reference can be non-rigid. one can describe joanne rowling in many different ways (see also gupta, 2015 on kripke, 1972). problems with reference-fixing arise when users of the same rigid designator, use the term without sufficiently explaining how their use builds on or relates to previous uses. such an explanation is required so that joanne rowling, author of the harry potter and fantastic beasts novels, is not confused with another, identically named joanne rowling—who happens to work at a chocolate factory. in order for two people to have an unambigu2peace is an acronym representing the following five interview stages. planning and preparation; engaging and explaining; asking an interviewees’ account of events; closure; and evaluation. walsh and bull (walsh2015) mention that australia, canada, and new zealand also adopt the peace model. 3 ous conversation about joanne rowling, they must be referring to the same person. applying kripke’s musings to rapport definitions, we can infer that across the investigative interviewing literature, the term rapport is a rigid designator. however, as mentioned, there is great uncertainty about how the reference for the term rapport is fixed. this referencefixing issue creates ambiguity about the meaning of rapport across the literature. for two people to have an intelligible conversation about joanne rowling, they need to know that they are referring to the same person. similarly, to systematically research the construct rapport, researchers need to know that they are referring to the same construct. ideally, this would require some univocal baseline description, or reference point, of rapport in investigative interviewing contexts. by reviewing the extant definitions of rapport, we hoped to identify such a reference point. method search strategy. we aimed to gather a comprehensive list of definitions of the term rapport within the academic literature (in english) on investigative interviewing. a literature search was carried out on the psycinfo database. the primary search word was rapport, which was combined with terms specifying the field of investigative interviewing. we searched full texts, not just abstracts or keywords, to allow us flag any literature that fits the scope of the review. that is, literatures that examine rapport as a main and/or secondary investigation within the investigative interviewing context. the formal search strategy was: “rapport” and “investigative interview*” or “suspect*” or “eyewitness*” or “police” or “interrogation” or “cognitive interview” or “peace model” or “intelligence gathering” or “nichd” we complemented the formal search strategy with an informal search of relevant review articles, official documents, and publication lists of key researchers in the field. we preregistered the parameters of our review here: https://osf.io/5zha3/ identification of definitions. we reviewed the selected literature to identify the extant unique definitions of rapport by examining each publication’s entire text. the first criterion for determining definitions was the authors’ explicit, or sufficiently clear, indication that by a certain sentence, or parts of the sentence, they are bringing a reader to know the meaning they (i.e., the authors) have assigned to the term rapport. using this rule, we extracted as definitions the predicates of sentences whose subjects resemble the following forms. rapport is defined as, refers to, indicates, regarded as, conceptualized as, described as something—and something is the predicate. in some cases, the word rapport was not necessarily the subject of a sentence from which we identified a potential definition. here, we determined the authors’ intention to provide such a rapport definition from the immediately surrounding discussion. the second criterion for identifying definitions was the possibility to fully decipher the meaning of rapport the authors intend from the text provided. thus, we disregarded potential definitions whereby authors describe the nature of rapport by referring to anecdotes about a particular investigative interview whose content cannot be verified thoroughly or objectively. thematic analysis we thematically analyzed the definitions following braun and clarke’s (2006) recommendations. we began by familiarizing ourselves with the definitions and followed this with an initial coding phase where codes closely represented the data. based on these initial codes, we created broader and more interpretative attributes of the definitions. these broader attributes were then compared to the initial codes for fit and altered accordingly. figure 1 illustrates the attributes in the specific definitions. for a detailed explanation of how each code relates to each definition, see the supplemental figure (https://osf.io/qxw2r/). we strongly encourage a reading of the supplemental figure as it demonstrates the intricacies of the thematic analysis. overview. the search obtained 228 relevant publications, and we analyzed all the articles; none were excluded. approximately 86 % of these publications did not provide a definition of rapport. that is, authors invoked the term without any clear definition or without explaining in detail how the current invocation is consistent or different from prior uses of rapport. an inevitable consequence of this finding is that this article’s reference list does not contain publications that did not define rapport. one can find a comprehensive reference list at the following link: https://osf.io/5zha3/ of the 32 publications that defined rapport, we identified 22 unique definitions. six of the definitions were proxied by the authors from other sources but met our inclusion criteria. thematic analysis of the 22 definitions uncovered one overarching reference point and six subordinate attributes of rapport. the overarching reference point was that all definitions referred in some way to the quality of the interviewer-interviewee interpersonal interaction. however, the six subordinate attributes showed considerable variance across definitions. no single attribute was common to all the definitions. table 1 includes a detailed list of all the definitions and a chart of the corresponding attributes. the table provides a quick snapshot of the variance in https://osf.io/5zha3/ https://osf.io/qxw2r/ https://osf.io/5zha3/ 4 attributes. the most common attribute was “positivity”—that rapport implies a positive interaction—was mentioned in approximately 68 % of definitions. the remaining attributes were mentioned in less than 40 % of definitions. this finding suggests that there is considerable variance in how rapport is defined in the investigative interviewing literature, with the caveat that most propose that rapport in part refers to a positively valenced interaction. the subsequent paragraphs describe each main attribute in turn. positivity. this attribute captured definitions, which held that rapport implied a positive interpersonal interaction. that is, an interaction, which an interviewer or interviewee might consider desirable. this categorization included definitions that invoked concepts such as a working relationship, warmth, and harmony. additionally, the positivity attribute captured definitions advocating that to induce rapport an interviewer should display behaviors that would increase an interviewee’s positive perceptions of the interaction. these behaviors include expressing sympathy, interest in the interviewee’s welfare, and acceptance of the interviewee. fifteen of the twenty-two definitions included positivity. mutuality. this attribute captured definitions containing a focus on the shared characteristics between the interviewer and interviewee. this designation, thus, includes mentions such as shared understanding, shared attention, common ground, and communicative alliance. mutuality was a component of nine of the twenty-two definitions. communication. this attribute captures definitions that emphasize the role of communication in facilitating rapport. furthermore, we assigned this attribute to definitions, indicating that interviewers should be genuine in their dealings with an interviewee. seven of the twenty-two definitions included the attribute of communication. successful outcomes. this attribute captures definitions where rapport comprises a successful interview. such success stipulations include increasing the interviewee’s cooperation, willingness to talk, as well as the productivity or amount of intelligence (viz., information) obtained from the interaction. successful outcomes were included in seven of twenty-two definitions. trust. this attribute captures definitions that explicitly mention that trust is a component of rapport. additionally, we assigned this attribute to definitions that emphasize that rapport should increase an interviewee’s confidence in an interviewer. trust was a component of five of the twenty-two definitions. respect. this attribute captures definitions, which explicitly mention that respect is a component of rapport. also, we include definitions mentioning that inducing rapport consists of emphasizing an interviewees’ autonomy through an unforced interaction. two of the twenty-two definitions included the attribute of respect. reliability analysis. we subjected the thematic analysis to a reliability check. a coder was assigned to independently replicate our thematic analysis using the descriptions of the six attributes. specifically, the coder, who was blind to the research question, rated the presence or absence of the six attributes in each of the 22 definitions. we subsequently examined the consistency between the coder’s ratings and the results of our thematic analysis. there was a 91.7 % agreement between the coder’s rating and our own, k = .81, se = .06, 95 % ci [.70, .92]. the most consistent minor disagreement arose when some attributes of a definition appeared in an adjective qualifying a noun. an example is a definition including the phrase mutual respect. here, the coder designated the attribute denoted by the noun only (i.e., respect). but we assigned the attribute denoted by both the adjective and the noun (i.e., mutuality and respect). the reliability analysis coding can be accessed here: https://osf.io/5zha3/ influential descriptions of rapport without explicit definitions. there were some noteworthy descriptions of the meaning of rapport that featured in the literature but did not necessarily fit our inclusion criteria. this section, thus, focuses on describing influential lines of rapport research that do not provide an explicit definition of the term. unlike the examinations that hardly offered definitions, these expositions explain what rapport means in various ways, but we could not delineate the exact definitions of rapport the authors intended. the observing rapport-based interpersonal techniques (orbit) framework is a notable example (see, e.g., alison et al., 2013; alison et al., 2014; christiansen et al., 2018). here, the researchers suggest that motivational interviewing is a useful tool to evoke rapport. miller and rollnick (2009), for example, define motivational interviewing (mi) as a collaborative, personcentered form of guiding to elicit and strengthen motivation for change. exponents of the orbit framework seem to interpret the mi definition as one suggesting that rapport means ‘creating a collaborative atmosphere’. that is, an interaction conducive to open communication between an interviewer and interviewee (see alison et al., 2013, p. 413; alison et al., 2014, p. 2). nonetheless, it was unclear whether the orbit framework implements the interpretation of mi, just described, as a rapport definition. or the framework adopts the exact mi definition; namely, the definition by miller and rollnick (2009). other research has implemented the concept—the https://osf.io/5zha3/ 5 t ab le 1 d ef in it io ns o f r ap po rt in t he in ve st ig at iv e in te rv ie w in g lit er at ur e an d th ei r co rr es po nd in g at tr ib ut es a tt ri bu te s d ef in iti on so ur ce pr ox ie d so ur ce po si tiv ity m ut ua lit y c om m un ic at io n su cc es sf ul o ut co m es tr us t re sp ec t a w or ki ng re la tio ns hi p be tw ee n op er at or [a n in te rv ie w er ] a nd so ur ce [a n in te rv ie w ee ] b as ed o n a m ut ua lly sh ar ed u nd er st an di ng o f e ac h ot he r’s g oa ls an d ne ed s t ha t c an le ad to su cc es sf ul a ct io na bl e in te lli ge nc e or in fo rm at io n. ke lly , m ill er , r ed lic h, & k le in m an (2 01 3) ; r ed lic h, k el ly , & m ill er (2 01 4) ; m ei ss ne r, su rm on -b öh r, o le sz ki ew ic z, & a lis on , 2 01 7 n o x x x a sh ar ed u nd er st an di ng a nd c om m un ic at io n be tw ee n th e in te rv ie w er a nd in te rv ie w ee . ri sa n, b in de r, & m iln e (2 01 6) n o x x a w or ki ng re la tio ns hi p in w hi ch p ro gr es s i s m ad e, th e in te rv ie w ee si m pl y be in g w ill in g to ta lk , t he de ve lo pm en t o f t ru st , a nd m ut ua l r es pe ct . ru ss an o, n ar ch et , k le in m an , & m ei ss ne r ( 20 14 ) d er iv ed fr om a sy nt he si s o f pr ac tit io ne rs ' d ef in iti on s x x x x x th e re la tio ns hi p be tw ee n [a n] in te rv ie w er a nd in te rv ie w ee . va lla no & s ch re ib er c om po (2 01 5) n o a p os iti ve o r n eg at iv e re la tio ns hi p in vo lv in g tr us t an d co m m un ic at io n. va lla no , e va ns , s ch re ib er c om po , & k ie ck ha ef er (2 01 5) d er iv ed fr om a sy nt he si s o f pr ac tit io ne rs ' d ef in iti on s x x x th e bo nd o r c on ne ct io n be tw ee n an in ve st ig at iv e in te rv ie w er a nd in te rv ie w ee . va lla no , e va ns , s ch re ib er c om po , & k ie ck ha ef er (2 01 5) n o x a re la tio ns hi p th at re su lts fr om in te ra ct io n be tw ee n pe op le a nd p ro vi de s p ar tic ip an ts w ith a w ar m fe el in g, is h ar m on io us a nd n at ur al (u nf or ce d) , o ffe rs tr us t, an d st im ul at es c oo pe ra tio n. va nd er ha lle n, v er va ek e, & h ol m be rg (2 01 1) n o x x x x a p os iti ve a nd p ro du ct iv e af fe ct b et w ee n pe op le th at fa ci lit at es m ut ua lit y of a tte nt io n an d ha rm on y. vr ij et a l. (2 01 7) ; w al sh & b ul l ( 20 12 ) be rn ie ri & g ill is (2 00 1) , w ho so ur ce d th ei r d ef in iti on fr om th e co nc is e o xf or d d ic tio na ry (1 99 5) x x x en su rin g a w or ki ng re la tio ns hi p an d ef fe ct iv e co m m un ic at io n th ro ug ho ut a n in te rv ie w . w al sh & b ul l ( 20 15 ) n o x x d ev el op in g a ha rm on io us re la tio ns hi p w ith a no th er pe rs on a nd c on ve yi ng u nd er st an di ng a nd ac ce pt an ce to w ar ds th at p er so n. w rig ht , n as h, & w ad e (2 01 5) n o x x a h ar m on io us , s ym pa th et ic c on ne ct io n to a no th er . ch ild s & w al sh (2 01 7) ; c ol lin s, li nc ol n & f ra nk (2 00 2) ; k ie ck ha ef er & w rig ht (2 01 5) ; k ie ck ha ef er , v al la no , & s ch re ib er c om po (2 01 4) ; v al la no & sc hr ei be r c om po (2 01 1) ; v ill al ba (2 01 4) ; n as h, n as h, m or ris , & s m ith (2 01 6) n ew be rr y & s tu bb s ( 19 90 ) x d ev el op in g a po si tiv e re la tio ns hi p w ith o th er s u si ng an in vi tin g ap pr oa ch . kl ei n, k le in , l an de , b or de rs , & w hi ta cr e (2 01 5) n o x 6 th e pr oc es s o f e st ab lis hi ng a h ar m on io us a nd pr od uc tiv e w or ki ng re la tio ns hi p be tw ee n an in te rv ie w er a nd in te rv ie w ee . m ac d on al d, k ee pi ng , s no ok , & l ut he r ( 20 17 ) n o x x th e qu al ity o f a n in te ra ct io n th at a llo w s i nd iv id ua ls to c om m un ic at e ef fe ct iv el y. m at su m ot o & h w an g (2 01 8) n o x a sm oo th a nd p os iti ve in te rp er so na l i nt er ac tio n. a bb e & b ra nd on (2 01 4) n o x x a st at e of c om m un ic at iv e al lia nc e. a bb e & b ra nd on (2 01 3) n o x x th e in te rp er so na l r el at io ns hi p or c on ne ct io n be tw ee n [a n] in te rv ie w er a nd in te rv ie w ee es ta bl is he d ov er th e co ur se o f t he ir in te ra ct io n. a lis on & a lis on (2 01 7) n o x th e es ta bl is hm en t o f a re la tio ns hi p in w hi ch th e pe op le in vo lv ed in th e in te ra ct io n un de rs ta nd e ac h ot he r a nd h av e go od c om m un ic at io n. co lli ns , d oh er ty -s ne dd on , & d oh er ty (2 01 4) th e au th or s c re di t b er ni er i ( 20 05 ). n on et he le ss , w e co ul d no t a sc er ta in th e ex ac t d ef in iti on fr om th e pa pe r b y be rn ie ri (2 00 5) . t hu s, w e ha ve in cl ud ed it as a u ni qu e de fin iti on . x x a p sy ch ol og ic al st at e in w hi ch so ci al d is ta nc e is re du ce d an d tr us t i nc re as ed [t hr ou gh th e es ta bl is hm en t o f s ha re d to pi cs , i nt er es t, ba ck gr ou nd , an d ot he r f ac to rs ]. d av id , r aw ls, & t ra in um (2 01 8) n o x x a h ar m on io us , p os iti ve , a nd p ro du ct iv e re la tio ns hi p be tw ee n an in te rv ie w er a nd in te rv ie w ee . ew en s e t a l., (2 01 6) th e au th or s c re di t t he fo llo w in g so ur ce s (e va ns , h ou st on , & m ei ss ne r, 20 12 ; w al sh & b ul l, 20 12 ). th e de fin iti on is n ot e xa ct ly id en tic al to th at m en tio ne d in w al sh & b ul l ( 20 12 ). ev an s e t a l. (2 01 2) d o no t p ro vi de a ra pp or t d ef in iti on . x x be in g ge nu in el y op en , i nt er es te d, a nd a pp ro ac ha bl e, as w el l a s b ei ng in te re st ed in th e in te rv ie w ee ’s fe el in gs o r w el fa re . co lle ge o f p ol ic in g, u ni te d ki ng do m (2 01 9) n o x x a c on di tio n es ta bl is he d by th e h u m in t co lle ct or [i. e. , i nt er vi ew er ] t ha t i s c ha ra ct er iz ed b y so ur ce [i. e. , i nt er vi ew ee ] c on fid en ce in th e h u m in t co lle ct or a nd a w ill in gn es s t o co op er at e w ith h im . a fm 2 -2 2. 3, d ep ar tm en t o f t he a rm y, u ni te d st at es (2 00 6) n o x x 7 figure 1. overview of attributes inherent in the rapport definitions working alliance3—to proxy rapport. similar to mi, the concept originates in the counseling and therapy literature (see, e.g., bordin, 1979; martin et al., 2000). we could not trace a specified definition of the working alliance in the investigative interviewing literature drawing on the concept (viz., vanderhallen, vervaeke, & holmberg, 2011; vanderhallen & vervaeke, 2014). bordin (1979)—whose work has informed the current adaptions of the notion—describes the nature of the working alliance as including three features: ‘an agreement on goals, an assignment of a task or a series of tasks, and the development of bonds.’ vanderhallen et al. (2011, p. 114) argue that bordin’s (1979) description emphasizes agreement and an emotional bond between an interviewer and interviewee (i.e., interactants). the authors do not explicate the role task assignment plays to evoke rapport in investigative interviewing contexts. thus, we could not ascertain precisely how bordin’s (1979) stipulations map onto the meaning of rapport, as advocated by vanderhallen et al. (2011) and vanderhallen & vervaeke (2014). that is, are certain aspects of the working alliance enough to evoke rapport? or must an interaction include all the three features of the working alliance to sufficiently induce rapport. another popularly implemented description of elements that constitute rapport is one offered by tickledegnen and rosenthal (1990). the work examines the nonverbal correlates of rapport. here, the authors do not explicitly offer a rapport definition but describe its nature as consisting of three dynamic components—mutual attentiveness, positivity, and coordination (tickle-degnen & rosenthal, 1990). it is noted that the influence of the three components may vary in evoking an instance of rapport due to rapport’s dynamism in interpersonal interaction. tickle-degnen and rosenthal (1990) propose that the temporal stage of interaction significantly contributes to a constituent’s import. specifically, early interactions rely more on positivity and mutual attentiveness. coordination and mutual attentiveness come to bear on late interactions. it is thus unclear whether—at a minimum—the three elements are required to instantiate rapport. or the influence of a component entirely depends on the stage of interaction. investigative interviewing research drawing on tickle-degnen and rosenthal’s (1990) theorizing similarly do not explicitly define rapport (see, e.g., collins & carthy, 2018; driskell et al., 2013; holmberg & madsen, 2014). moreover, such work provides little a priori specifications about whether the three elements are required at a minimum or whether the significance of rapport’s elements is solely derived from the temporal stage of an interview. duke et al. (2018) have developed scales to measure interviewees’ perceptions of rapport (rs3i). the rs3i examines the extent to which an interviewee experiences rapport in an interview by measuring specific perceptions of an interviewer. that is, the interviewer’s attentiveness, trustworthiness and respectfulness, professional competence, cultural similarity (to the interviewee), and connected flow. here, connected flow indicates the ease with which the interviewee perceived their interaction with the interviewer. these perceptions were drawn from a literature review that sought to identify the components of rapport (see duke, 2013; duke et al., 2018, p. 65). in the exposition, duke et al. (2018) allude to two prior descriptions of rap3we have maintained the phrasing—the working alliance—throughout this paper because exponents of the concept describe it as such. 8 port (viz, kleinman, 2006; neuman & salinas-serrano, 2006). however, the authors do not offer a working definition nor explain how the prior descriptions they mentioned, encapsulate the dimensions of the rs3i. for instance, it is not entirely clear whether inducing a single dimension of the rs3i (e.g., attentiveness) is sufficient to create an instance of rapport—or whether all the elements must be present to evoke rapport discussion we examined the existing definitions of rapport in the investigative interviewing literature. a formal search obtained 228 publications that discussed rapport in an investigative interviewing context. only thirty-two publications (14 %) explicitly defined rapport. twentytwo of those definitions were unique. a thematic analysis of the definitions revealed six major attributes. the attributes were communication, mutuality, positivity, respect, successful outcomes, and trust. however, these attributes were disparately distributed across definitions, demonstrating considerable differences in how rapport is defined. this pattern has created a literature replete with different definitions of a term univocally seen as a fundamental part of an effective interview. consider the following examples. it is not apparent whether the “mutually shared understanding of goals and needs” on which kelly et al. (2013) base their definition is equivalent to “effective communication” as proposed by matsumoto and hwang (2018). additionally, it is not clear if “the relationship” vallano and schreiber compo (2015) refer to is one where actors can understand each other’s goals (kelly et al., 2013), communicate effectively (matsumoto & hwang, 2018), or both (kelly et al., 2013; matsumoto & hwang, 2018). moreover, it remains unknown the extent to which “a state of communicative alliance” (abbe & brandon, 2013) equals a “smooth, positive interpersonal interaction” (abbe & brandon, 2014). and it is unclear whether such a positive interpersonal interaction must include trust (e.g., vallano et al., 2015), respect (e.g., vanderhallen et al., 2011), or both (e.g., russano et al., 2014). the ambiguity about the meaning of rapport is also evident when considering definitions posited by the major investigative interviewing guidelines. we believe that discussing the variance between those guidelines is warranted for practical reasons. furthermore, researchers often invoke the importance of rapport by noting that at least one of the major interviewing guidelines endorses it. researchers do not necessarily examine whether potential differences in the meanings of rapport impact the generalizability of research across jurisdictions using the different regulations (see, e.g., meissner et al., 2015; walsh & bull, 2012, vallano & schreiber compo, 2015). in our opinion, the reasons outlined suggest that even when research works highlight subtle definitional differences, rapport is still assumed to be identical across the major interviewing guidelines. significant variations, however, exist. according to the afm, rapport is a condition established by an interviewer that is characterized by an interviewee’s confidence in the interviewer and willingness to cooperate (afm 2-22.3; department of the army, 2006, p. 141). the definition includes the themes trust and successful outcomes as delineated in this research. this afm definition is markedly different from the one posited in the peace model. peace describes rapport as a property of the interviewer being ‘genuinely open, interested, and approachable, as well as being interested in the interviewee’s feelings or welfare’ (college of policing, 2013). the peace model definition centers on the themes of positivity and communication. from the afm perspective, rapport does not necessarily derive from congeniality (viz., positivity) as suggested by the peace model. in fact, the afm indicates that rapport could be based on friendliness, mutual gain, or even fear (p. 141). to our knowledge, the cognitive interview, the nichd protocol, and the abe guidelines do not provide specific rapport definitions but recommend behaviors by which interviewers can create rapport. without specified definitions, it is not possible to examine the meanings and value of rapport for these guidelines. it is not possible to ascertain whether one behavior is sufficient to induce rapport. or whether an interviewer has to enact several behaviors to induce rapport. how did we get here? definitional issues plague the systematic research of the term rapport. we believe these definitional issues have arisen because researchers have attempted to define rapport at the wrong level of analysis. the extant definitions have defined rapport by its potential constituent parts—such as trust, friendliness, or respect. in contrast, we argue that rapport is a higher order concept, and as such requires a higher order definition. a proposal for this higher order definition can be gleaned from the common reference point we identified on which all extant rapport definitions build. specifically, that rapport refers to the quality of the interviewerinterviewee interaction. this reference point suggests that the term rapport is axiomatic: since all interviews require interpersonal interaction, rapport refers to a necessary and self-evident aspect of any investigative interview. confusion on the definition of rapport has arisen because when defining the term researchers have focused on different aspects of the quality of this inter9 action. a comparison can be made with the term “personality” within trait psychology. like rapport, personality is a higher order term: it refers to one’s relatively stable pattern of behaviors, cognitions, and emotions (cloninger, 2009). how many traits there are and how they relate to each other is an empirical question. importantly, whether one believes there are three (eyesnck & eysenck, 1984), five (goldberg, 1990), or six traits (ashton et al. 2004), does not change the definition of personality as one’s relatively stable pattern of behaviors, cognitions, and emotions. similarly, whether one believes the important attributes of rapport are friendliness, trust, or respect, does not change the higher order definition of rapport as the quality of the interviewerinterviewee interaction. as with personality, what constitutes the attributes of rapport, is an empirical question. it may be that the attributes of rapport are the six subordinate attributes we identified. it may be that there is only one overarching good-bad dimension of rapport. we do not know. an answer to this question will require rigorous empirical analysis. this distinction between the higher order axiomatic definition of rapport and the lower order attributes of rapport has important consequences for how we research and discuss the term. for instance, by defining rapport as the quality of the interviewer-interviewee interaction, many catchall statements invoking the importance of rapport become tautological or vacuous. consider the statement “rapport is important for the outcome of an interview”. this amounts to little more than saying, “the quality of the interaction between interviewer and interviewee is important for the outcome of an interview”. all things being equal, surely this must be the case. such statements are of little interest to either the researcher or practitioner. instead, researchers should focus on the specific attributes of the interviewer-interviewee interaction. consider for instance the hypothetical finding that a more trustful and respectful interaction increases an interviewee’s disclosure. this finding is of both theoretical and applied value. however, the value of this finding is lost if the terms trust and respect are replaced by the umbrella term rapport. this is because the field has no agreed upon finite set of attributes of what rapport encapsulates. as long as researchers continue to define the attributes of rapport in different ways, claiming that “rapport increases an interviewee’s disclosure” could mean any number of things. until the field can agree upon such a finite set of attributes, we strongly recommend that stakeholders stop indiscriminately using the word rapport to describe any collection of attributes of the interviewer-interviewee interaction. moving forward if the field wishes to continue using the term rapport without the ambiguity and associated problems we currently see, the field must collectively determine this finite set of attributes. here the field can draw inspiration from other areas of inquiry that have dealt with similar concerns. again, we can draw inspiration from trait psychology. early assessments of personality traits included scales that all nominally measured personality—but, in fact, measured different attributes of personality (john & srivastava, 1999). personality researchers eventually reached a consensus by centering their pursuit on their common ground—the lexical hypothesis. that is, the idea that descriptions of significant entities, such as personality, eventually become part of people’s language (ashton & lee, 2005). in brief, by scouring dictionaries, personality researchers identified comprehensive lists of adjectives describing traits. insofar as only significant entities will enter our language, they argued that these lists should comprise all meaningful ways in which personality can be described. for over five decades, these lists were subjected to rigorous analysis by the research community (allport & odbert, 1936; goldberg, 1990). in broad strokes, they had people rate themselves and others on these trait adjectives. goldberg (1990), for example, had people make ratings on a list of over 1400 adjectives. by subjecting these ratings to exploratory factor analyses, the underlying structure of personality could be uncovered. these efforts ultimately yielded the big five personality structure (mccrae, 1989). an adapted lexical approach may also be applicable in investigating the attributes of rapport as defined as the quality of the interaction between an interviewer and interviewee. in this work, we have provided an exhaustive list of the extant rapport definitions. this list can serve as a point of departure. that is, researchers can complement the current list of definitions with additional descriptors of the quality of interpersonal interactions. the ultimate goal of the compilation being to identify all meaningful ways in which such interactions can be described. similar to research on the structure of personality, these lists can then be used to rate the quality of interpersonal interactions in investigative interviews. such ratings can be conducted by the interviewer, the interviewee, or even by external observers. researchers can then use factor analyses to empirically determine the constituent components and underlying structure of what the field wants to call rapport. a project of this scale will likely require collaboration at the level of the research community—for example, a multi-lab study. the collaboration must involve practitioners and experts in the field. the potential rewards 10 are vast, as the field will take a large stride toward answering the question, “what is rapport in investigative interviewing?”. metascientific discussion: comments and responses since our first submission, we have received eight independent reviews, from two different journals, on this text. some were positive of the paper; others were highly critical of its value and conclusions. all of the reviews can be read at https://osf.io/5zha3/. in the following sections we summarize and respond to what we see as the most substantive criticisms of the paper. a sermon to the choir? several reviewers dismissed this work, arguing that it simply rehashes a well-known issue—namely that rapport is ill defined in the investigative interviewing literature. consequently, this research is merely convincing the convinced. we partly agree, in that many do acknowledge significant problems in defining rapport (e.g., vallano & schreiber compo, 2015). however, we believe these observations only emphasize the importance of the present research. by systematically reviewing all extant rapport definitions in investigative interviewing, we quantify, and clearly highlight, the extent of the issue. we believe this is a necessary first step to initiate a public discussion by the research community on how the term rapport is defined, should be defined, and used as a psychological concept. making the issue an open discussion will prevent us from glossing over definitional issues, as is typically the case in the literature. indeed, in 86% of the articles we reviewed, the term rapport was used, but was not explicitly defined at all. defining psychological concepts is challenging—why pick on rapport? some reviewers highlighted that psychological constructs are generally difficult to define concisely. they argued that our definitional gripe with rapport could be raised with any number of psychological constructs including “anxiety”, “love” or “depression”. we agree that many psychological constructs are difficult to define. we struggle, however, to see why this should prevent us from addressing definitional issues in our own field of inquiry. this argument amounts to little more than saying, “my backyard is messy, but this is ok because yours is messy too”. instead, the investigative interviewing research community must attempt to discuss the problem and strive to arrive at a unified working definition of rapport. other areas of psychology do take definitional issues seriously. indeed, for each of the constructs “anxiety” (akiskal, 1998; nyatanga & de vocht, 2006; cambre, & cook, 1985), “love” (beall & sternberg, 1995; fehr, 1988; fehr & russell, 1991) and “depression” (blatt et al., 1982; haaga et al., 1991), researchers have written extensively about defining these difficult concepts. others are trying to tidy up their backyards. we must begin tidying too. different strokes for different folks? other reviewers suggested that rapport must have different definitions, since rapport will vary depending on the interview context. suppose an interviewer questions an innocent and cooperative eyewitness willing to provide information about a crime under investigation. rapport in this context, it is argued, may differ from rapport when the same interviewer tries to elicit information from an uncooperative suspect accused of committing a crime. as with problems in the definitions of rapport more generally, we believe this criticism confuses the level of analysis. when researchers argue that rapport will vary from context to context, we believe they in fact mean that the importance of rapport’s lower-level attributes will vary. for example, effective communication may be of utmost importance when interviewing a cooperative witness. in contrast, trust may be more important when interviewing an uncooperative suspect. in both situations, however, the higher order definition of rapport still refers to the quality of the interviewerinterviewee interaction. by conflating the higher order definition with its lower-level attributes, we worry that researchers will erroneously draw the conclusion that the term rapport will require context-specific definitions. even if one grants the view that different contexts require different rapport definitions, in ways, this argumentation only exacerbates the issue. this is because researchers would then be required to outline: (1) their specific definition; (2) the specific context in which their definition of rapport refers to, such as ‘cooperative eye witness rapport’ or ‘uncooperative suspect rapport’; and (3) how their definition differs from other definitions of rapport. this may be a viable way of dealing with definitional issues in rapport. however, by our reading of the literature, it is not the typical way in which researchers use the term. definitional issues, so what? a final critique from some reviewers was that the disparity in rapport definitions in investigative interviewing may be a non-issue. it was questioned whether the disparity is harmful to science if studies consistently find that inducing rapport leads interviewees to disclose more information than no rapport—irrespective of what rapport means. we agree that extant studies may be finding similar results despite definitional disparities. however, if these definitional disparities mean the studies have examined different constructs, any conclusion or understanding one can draw from the studies will be constrained. as an analogy consider the finding that two completely difhttps://osf.io/5zha3/ 11 ferent medications both lead to a reduction in depression. if these different medications were classified, or ‘defined’, as the same drug, conclusions concerning the utility of the two drugs may still stand, but the science and understanding surrounding them would be significantly flawed. we cannot see a world in where definitional issues with rapport would not undermine its scientific inquiry in a similar way. additionally, definitional disparities bring methodological concerns. without a yardstick to flag what constitutes rapport, it becomes challenging, if not impossible, to objectively assess the extent a given investigation has examined something called rapport. one could measure several attributes of the interviewerinterviewee interaction and arbitrarily decide that some of those attributes (or none) concern rapport, depending on the hypothesis one has predetermined to support. as flake and fried (2020) note, ambiguity about a concept’s definition presents opportunities for researchers to knowingly or accidentally exploit the ambiguity to engage in questionable measurement practices like the one just described. concluding remarks we are not claiming that a construct called rapport does not exist or that it has limited utility in improving investigative interviewing. rather, we have drawn attention to the commonality and the scope of variance in rapport definitions. all the extant definitions imply that rapport centers on the quality of the interviewerinterviewee interaction. nonetheless, the definitions vary considerably with what they regard as the underlying attributes of rapport. this disparity creates ambiguity about the meaning of rapport and impedes its objective assessment. in the short term, stakeholders should avoid using the word rapport as a cover term to encapsulate disparate collections of attributes relating to the interviewer-interviewee interaction. in the long term, the field should collectively determine a finite set of attributes to denote what we mean by rapport. we believe these suggestions are tractable pathways to reduce ambiguity about the meaning of the word rapport in investigative interviewing, thereby improving both the theoretical and applied value of the term. author contact david a. neequaye, 0000-0002-7355-2784 department of psychology, university of gothenburg, sweden erik mac giolla, 0000-0002-5285-5321 department of psychology, university of gothenburg, sweden. for their assistance with data collection, we thank simon karlsson, katarina radonakova, and madison turner (listed alphabetically by surname). correspondence to: david a. neequaye, department of psychology, university of gothenburg. box 500, 405 30 gothenburg, sweden; email: david.neequaye@psy.gu.se conflict of interest and funding we have no conflict of interest to declare. we received no specific funding for this research. author contributions david a. neequaye: original conceptualization, data curation, formal analysis, investigation, methodology, project administration, writing – original draft, writing – review & editing erik mac giolla: refining conceptualization, data curation, formal analysis, investigation, methodology, writing – review & editing open science practices the editorial process for this article relied on streamlined peer review where peer reviews obtained from previous journal(s) were moved forward and used as the basis for the editorial decision. these reviews are shared in the supplementary files. the identities of the reviewers are shown or hidden in accordance with the policy of the journal that originally obtained them. this article earned the preregistration, open data and the open materials badge for preregistering the hypothesis before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process is published in the online supplement. references abbe, a., & brandon, s. e. (2013). the role of rapport in investigative interviewing a review the role of rapport in investigative interviewing. journal of investigative psychology and offender profiling, 10(3), 237–249. https://doi.org/10.1002/jip. 1386 https://doi.org/10.1002/jip.1386 https://doi.org/10.1002/jip.1386 12 abbe, a., & brandon, s. e. (2014). building and maintaining rapport in investigative interviews. police practice and research, 15(3), 207–220. https : / / doi . org / 10 . 1080 / 15614263 . 2013 . 827835 akiskal, h. s. (1998). toward a definition of generalized anxiety disorder as an anxious temperament type. acta psychiatrica scandinavica, 98, 66–73. https://doi.org/https\://doi.org/10. 1111/j.1600-0447.1998.tb05969.x alison, l., alison, e., noone, g., elntib, s., waring, s., & christiansen, p. (2014). the efficacy of rapport-based techniques for minimizing counter-interrogation tactics amongst a field sample of terrorists. psychology, public policy, and law, 20(4), 421–430. https://doi.org/ 10.1037/law0000021 alison, l. j., alison, e., noone, g., elntib, s., & christiansen, p. (2013). why tough tactics fail and rapport gets results observing rapport-based interpersonal techniques (orbit) to generate useful information from terrorists. psychology, public policy, and law, 19(4), 411–431. https : //doi.org/10.1037/a0034564 allport, g. w., & odbert, h. s. (1936). trait-names a psycho-lexical study. psychological monographs, 47(1), i–171. https://doi.org/http\://dx.doi. org.ezproxy.ub.gu.se/10.1037/h0093360 asch, s. e. (1987). social psychology (original work published 1952). new york oxford university press. ashton, m. c., & lee, k. (2005). a defence of the lexical approach to the study of personality structure. european journal of personality, 19(1), 5– 24. https://doi.org/10.1002/per.541 ashton, m. c., lee, k., perugini, m., szarota, p., de vries, r. e., di blas, l., boies, k., & de raad, b. (2004). a six-factor structure of personality-descriptive adjectives solutions from psycholexical studies in seven languages [num pages 356-366 place washington, us publisher american psychological association (us)]. journal of personality and social psychology, 86(2), 356–366. https : / / doi . org / http\ : //dx.doi.org/10.1037/0022-3514.86.2.356 beall, a. e., & sternberg, r. j. (1995). the social construction of love [publisher sage publications ltd]. journal of social and personal relationships, 12(3), 417–438. https : / / doi . org / 10 . 1177/0265407595123006 blatt, s. j., quinlan, d. m., chevron, e. s., mcdonald, c., & zuroff, d. (1982). dependency and selfcriticism psychological dimensions of depression [place us publisher american psychological association]. journal of consulting and clinical psychology, 50(1), 113–124. https : / / doi . org/10.1037/0022-006x.50.1.113 bordin, e. s. (1979). the generalizability of the psychoanalytic concept of the working alliance. psychotherapy theory, research & practice, 16(3), 252–260. https://doi.org/10.1037/h0085885 braun, v., & clarke, v. (2006). using thematic analysis in psychology. qualitative research in psychology, 3(2), 77–101. https://doi.org/10.1191/ 1478088706qp063oa bull, r., & milne, b. (2004). attempts to improve the police interviewing of suspects. in g. d. lassiter (ed.), interrogations, confessions, and entrapment (pp. 181–196). springer us. https:// doi.org/10.1007/978-0-387-38598-3_8 cambre, m. a., & cook, d. l. (1985). computer anxiety definition, measurement, and correlates [publisher sage publications inc]. journal of educational computing research, 1(1), 37–54. https://doi.org/10.2190/fk5l092ht6yb pyba christiansen, p., alison, l., & alison, e. (2018). well begun is half done interpersonal behaviours in distinct field interrogations with high-value detainees. legal and criminological psychology, 23(1), 68–84. https://doi.org/10.1111/lcrp. 12111 clarke, c., & milne, r. (2001). national evaluation of investigative interviewing peace course. home office, london. college of policing. (2013). investigative interviewing. retrieved june 26, 2019, from https % 5c : / / www . app . college . police . uk / app content / investigations / investigative interviewing / #peace-framework collins, k., & carthy, n. (2018). no rapport, no comment the relationship between rapport and communication during investigative interviews with suspects. journal of investigative psychology and offender profiling. https://doi.org/10. 1002/jip.1517 collins, r., lincoln, r., & frank, m. g. (2002). the effect of rapport in forensic interviewing. psychiatry, psychology and law, 9(1), 69–78. https : //doi.org/10.1375/pplt.2002.9.1.69 de vocht, h., & nyatanga, b. (2006). towards a definition of death anxiety [number 9 publisher mark allen publishing]. international journal of palliative nursing, 12(9), 410–413. https://doi. org/10.12968/ijpn.2006.12.9.21868 https://doi.org/10.1080/15614263.2013.827835 https://doi.org/10.1080/15614263.2013.827835 https://doi.org/https\://doi.org/10.1111/j.1600-0447.1998.tb05969.x https://doi.org/https\://doi.org/10.1111/j.1600-0447.1998.tb05969.x https://doi.org/10.1037/law0000021 https://doi.org/10.1037/law0000021 https://doi.org/10.1037/a0034564 https://doi.org/10.1037/a0034564 https://doi.org/http\://dx.doi.org.ezproxy.ub.gu.se/10.1037/h0093360 https://doi.org/http\://dx.doi.org.ezproxy.ub.gu.se/10.1037/h0093360 https://doi.org/10.1002/per.541 https://doi.org/http\://dx.doi.org/10.1037/0022-3514.86.2.356 https://doi.org/http\://dx.doi.org/10.1037/0022-3514.86.2.356 https://doi.org/10.1177/0265407595123006 https://doi.org/10.1177/0265407595123006 https://doi.org/10.1037/0022-006x.50.1.113 https://doi.org/10.1037/0022-006x.50.1.113 https://doi.org/10.1037/h0085885 https://doi.org/10.1191/1478088706qp063oa https://doi.org/10.1191/1478088706qp063oa https://doi.org/10.1007/978-0-387-38598-3_8 https://doi.org/10.1007/978-0-387-38598-3_8 https://doi.org/10.2190/fk5l-092h-t6yb-pyba https://doi.org/10.2190/fk5l-092h-t6yb-pyba https://doi.org/10.1111/lcrp.12111 https://doi.org/10.1111/lcrp.12111 https%5c://www.app.college.police.uk/app-content/investigations/investigative-interviewing/#peace-framework https%5c://www.app.college.police.uk/app-content/investigations/investigative-interviewing/#peace-framework https%5c://www.app.college.police.uk/app-content/investigations/investigative-interviewing/#peace-framework https%5c://www.app.college.police.uk/app-content/investigations/investigative-interviewing/#peace-framework https://doi.org/10.1002/jip.1517 https://doi.org/10.1002/jip.1517 https://doi.org/10.1375/pplt.2002.9.1.69 https://doi.org/10.1375/pplt.2002.9.1.69 https://doi.org/10.12968/ijpn.2006.12.9.21868 https://doi.org/10.12968/ijpn.2006.12.9.21868 13 driskell, t., blickensderfer, e. l., & salas, e. (2013). is three a crowd? examining rapport in investigative interviews. group dynamics theory, research, and practice, 17(1), 1–13. https://doi. org/10.1037/a0029686 duke, m. c. (2013). the development of the rapport scales for investigative interviews and interrogations (doctoral dissertation) [isbn 9781303164224 publication title proquest dissertations and theses]. the university of texas at el paso. united states texas. retrieved february 28, 2022, from http % 5c : / / www. proquest . com / docview / 1418015989 / abstract/8c36cb36043047depq/1 duke, m. c., wood, j. m., magee, j., & escobar, h. (2018). the effectiveness of army field manual interrogation approaches for educing information and building rapport. law and human behavior, 42(5), 442–457. https : / / doi . org / 10 . 1037/lhb0000299 eysenck, h. j., & eysenck, s. b. g. (1984). eysenck personality questionnaire-revised. sevanoaks, kent, uk hodder & staughton. fehr, b. (1988). prototype analysis of the concepts of love and commitment [place us publisher american psychological association]. journal of personality and social psychology, 55(4), 557– 579. https://doi.org/10.1037/00223514.55. 4.557 fehr, b., & russell, j. a. (1991). the concept of love viewed from a prototype perspective [place us publisher american psychological association]. journal of personality and social psychology, 60(3), 425–438. https : / / doi . org / 10 . 1037 / 0022-3514.60.3.425 fisher, r. p., & geiselman, r. e. (1992). memory enhancing techniques for investigative interviewing the cognitive interview. charles c thomas publisher. flake, j. k., & fried, e. i. (2020). measurement schmeasurement questionable measurement practices and how to avoid them [publisher sage publications inc]. advances in methods and practices in psychological science, 3(4), 456–465. https : //doi.org/10.1177/2515245920952393 gabbert, f., hope, l., luther, k., wright, g., ng, m., & oxburgh, g. (2020). exploring the use of rapport in professional information-gathering contexts by systematically mapping the evidence base. applied cognitive psychology, n/a. https : //doi.org/https\://doi.org/10.1002/acp.3762 goldberg, l. r. (1990). an alternative "description of personality" the big-five factor structure [num pages 1216-1229 place washington, us publisher american psychological association (us)]. journal of personality and social psychology, 59(6), 1216–1229. https://doi.org/http\: //dx.doi.org.ezproxy.ub.gu.se/10.1037/00223514.59.6.1216 gupta, a. (2015). defintions. stanford encyclopedia of philosophy (summer 2015, p. 23). metaphysics research lab, stanford university. https%5c:// plato.stanford.edu/archives/sum2015/entries/ definitions/ haaga, d. a., dyck, m. j., & ernst, d. (1991). empirical status of cognitive theory of depression [place us publisher american psychological association]. psychological bulletin, 110(2), 215–236. https://doi.org/10.1037/00332909.110.2. 215 hershkowitz, i., lamb, m. e., katz, c., & malloy, l. c. (2015). does enhanced rapport-building alter the dynamics of investigative interviews with suspected victims of intra-familial abuse? journal of police and criminal psychology, 30(1), 6– 14. https : / / doi . org / 10 . 1007 / s11896 013 9136-8 holmberg, u., & madsen, k. (2014). rapport operationalized as a humanitarian interview in investigative interview settings. psychiatry, psychology and law, 21(4), 591–610. https://doi.org/ 10.1080/13218719.2013.873975 john, o. p., & srivastava, s. (1999). the big five trait taxonomy history, measurement, and theoretical perspectives. handbook of personality theory and research, 2(1999), 102–138. kelly, c. e., miller, j. c., & redlich, a. d. (2016). the dynamic nature of interrogation. law and human behavior, 40(3), 295–309. https://doi.org/10. 1037/lhb0000172 kelly, c. e., miller, j. c., redlich, a. d., & kleinman, s. m. (2013). a taxonomy of interrogation methods. psychology, public policy, and law, 19(2), 165–178. https : / / doi . org / 10 . 1037/a0030310 kieckhaefer, j. m., vallano, j. p., & compo, n. s. (2014). examining the positive effects of rapport building when and why does rapport building benefit adult eyewitness memory? memory, 22(8), 1010–1023. https : / / doi . org / 10 . 1080 / 09658211.2013.864313 kleinman, s. m. (2006). kubark counterintelligence interrogation review observations of an interrogator [publisher citeseer]. in i. s. board (ed.), educing information. interrogation science and art. foundations for the future (p. 95). https://doi.org/10.1037/a0029686 https://doi.org/10.1037/a0029686 http%5c://www.proquest.com/docview/1418015989/abstract/8c36cb36043047depq/1 http%5c://www.proquest.com/docview/1418015989/abstract/8c36cb36043047depq/1 http%5c://www.proquest.com/docview/1418015989/abstract/8c36cb36043047depq/1 https://doi.org/10.1037/lhb0000299 https://doi.org/10.1037/lhb0000299 https://doi.org/10.1037/0022-3514.55.4.557 https://doi.org/10.1037/0022-3514.55.4.557 https://doi.org/10.1037/0022-3514.60.3.425 https://doi.org/10.1037/0022-3514.60.3.425 https://doi.org/10.1177/2515245920952393 https://doi.org/10.1177/2515245920952393 https://doi.org/https\://doi.org/10.1002/acp.3762 https://doi.org/https\://doi.org/10.1002/acp.3762 https://doi.org/http\://dx.doi.org.ezproxy.ub.gu.se/10.1037/0022-3514.59.6.1216 https://doi.org/http\://dx.doi.org.ezproxy.ub.gu.se/10.1037/0022-3514.59.6.1216 https://doi.org/http\://dx.doi.org.ezproxy.ub.gu.se/10.1037/0022-3514.59.6.1216 https%5c://plato.stanford.edu/archives/sum2015/entries/definitions/ https%5c://plato.stanford.edu/archives/sum2015/entries/definitions/ https%5c://plato.stanford.edu/archives/sum2015/entries/definitions/ https://doi.org/10.1037/0033-2909.110.2.215 https://doi.org/10.1037/0033-2909.110.2.215 https://doi.org/10.1007/s11896-013-9136-8 https://doi.org/10.1007/s11896-013-9136-8 https://doi.org/10.1080/13218719.2013.873975 https://doi.org/10.1080/13218719.2013.873975 https://doi.org/10.1037/lhb0000172 https://doi.org/10.1037/lhb0000172 https://doi.org/10.1037/a0030310 https://doi.org/10.1037/a0030310 https://doi.org/10.1080/09658211.2013.864313 https://doi.org/10.1080/09658211.2013.864313 14 washington, dc national defense intelligence college press. kripke, s. a. (1972). naming and necessity. in d. davidson & g. harman (eds.), semantics of natural language (pp. 253–355). springer netherlands. https://doi.org/10.1007/9789401025577_9 lamb, m. e., orbach, y., hershkowitz, i., esplin, p. w., & horowitz, d. (2007). a structured forensic interview protocol improves the quality and informativeness of investigative interviews with children a review of research using the nichd investigative interview protocol. child abuse & neglect, 31(11), 1201–1231. https://doi.org/ 10.1016/j.chiabu.2007.03.021 martin, d. j., garske, j. p., & davis, m. k. (2000). relation of the therapeutic alliance with outcome and other variables a meta-analytic review. journal of consulting and clinical psychology, 68(3), 438–450. https : / / doi . org / http\ : //dx.doi.org.ezproxy.ub.gu.se/10.1037/0022006x.68.3.438 matsumoto, d., & hwang, h. c. (2018). social influence in investigative interviews the effects of reciprocity. applied cognitive psychology, 32(2), 163–170. https://doi.org/10.1002/acp.3390 mccrae, r. r. (1989). why i advocate the five-factor model joint factor analyses of the neo-pi with other instruments. in d. m. buss & n. cantor (eds.), personality psychology recent trends and emerging directions (pp. 237–245). springer us. https : / / doi . org / 10 . 1007 / 978 1 4684 0634-4_18 meissner, c. a., kelly, c. e., & woestehoff, s. a. (2015). improving the effectiveness of suspect interrogations. annual review of law and social science, 11(1), 211–233. https : / / doi . org / 10 . 1146/annurev-lawsocsci-120814-121657 miller, w. r., & rollnick, s. (2009). ten things that motivational interviewing is not. behavioural and cognitive psychotherapy, 37(2), 129–140. https: //doi.org/10.1017/s1352465809005128 nash, r. a., nash, a., morris, a., & smith, s. l. (2016). does rapport-building boost the eyewitness eyeclosure effect in closed questioning? legal and criminological psychology, 21(2), 305–318. https://doi.org/10.1111/lcrp.12073 neuman, a., & salinas-serrano, d. (2006). custodial interrogations what we know, what we do, and what we can learn from law enforcement experiences. in i. s. board (ed.), educing information. interrogation science and art. foundations for the future (p. 141). washington, dc national defense intelligence college press. of the army, d. (2006). fm 2-22.3 (fm 34-52) human intelligence collector operations. office, h. (2011). achieving best evidence in criminal proceedings. london. russano, m. b., narchet, f. m., kleinman, s. m., & meissner, c. a. (2014). structured interviews of experienced humint interrogators interviews of humint interrogators. applied cognitive psychology, 28(6), 847–859. https://doi.org/10. 1002/acp.3069 sauerland, m., brackmann, n., & otgaar, h. (2018). rapport little effect on children’s, adolescents’, and adults’ statement quantity, accuracy, and suggestibility. journal of child custody, 15(4), 268–285. https://doi.org/10.1080/15379418. 2018.1509759 saywitz, k. j., larson, r. p., hobbs, s. d., & wells, c. r. (2015). developing rapport with children in forensic interviews systematic review of experimental research developing rapport with children. behavioral sciences & the law, 33(4), 372–389. https://doi.org/10.1002/bsl.2186 schreiber compo, n., hyman gregory, a., & fisher, r. (2012). interviewing behaviors in police investigators a field study of a current us sample. psychology, crime & law, 18(4), 359–375. https : / / doi . org / 10 . 1080 / 1068316x . 2010 . 494604 sternberg, k. j., lamb, m. e., hershkowitz, i., yudilevitch, l., orbach, y., esplin, p. w., & hovav, m. (1997). effects of introductory style on children’s abilities to describe experiences of sexual abuse. child abuse & neglect, 21(11), 1133– 1146. https : / / doi . org / 10 . 1016 / s0145 2134(97)00071-9 st-yves, m. (2006). the psychology of rapport five basic rules. investigative interviewing. https://doi. org/10.4324/9781843926337-15 tickle-degnen, l., & rosenthal, r. (1990). the nature of rapport and its nonverbal correlates. psychological inquiry, 1(4), 285–293. https://doi.org/ 10.1207/s15327965pli0104_1 vallano, j. p., evans, j. r., schreiber compo, n., & kieckhaefer, j. m. (2015). rapport-building during witness and suspect interviews a survey of law enforcement rapport-building during interviews. applied cognitive psychology, 29(3), 369–380. https://doi.org/10.1002/acp.3115 vallano, j. p., & schreiber compo, n. (2015). rapportbuilding with cooperative witnesses and criminal suspects a theoretical and empirical review. https://doi.org/10.1007/978-94-010-2557-7_9 https://doi.org/10.1007/978-94-010-2557-7_9 https://doi.org/10.1016/j.chiabu.2007.03.021 https://doi.org/10.1016/j.chiabu.2007.03.021 https://doi.org/http\://dx.doi.org.ezproxy.ub.gu.se/10.1037/0022-006x.68.3.438 https://doi.org/http\://dx.doi.org.ezproxy.ub.gu.se/10.1037/0022-006x.68.3.438 https://doi.org/http\://dx.doi.org.ezproxy.ub.gu.se/10.1037/0022-006x.68.3.438 https://doi.org/10.1002/acp.3390 https://doi.org/10.1007/978-1-4684-0634-4_18 https://doi.org/10.1007/978-1-4684-0634-4_18 https://doi.org/10.1146/annurev-lawsocsci-120814-121657 https://doi.org/10.1146/annurev-lawsocsci-120814-121657 https://doi.org/10.1017/s1352465809005128 https://doi.org/10.1017/s1352465809005128 https://doi.org/10.1111/lcrp.12073 https://doi.org/10.1002/acp.3069 https://doi.org/10.1002/acp.3069 https://doi.org/10.1080/15379418.2018.1509759 https://doi.org/10.1080/15379418.2018.1509759 https://doi.org/10.1002/bsl.2186 https://doi.org/10.1080/1068316x.2010.494604 https://doi.org/10.1080/1068316x.2010.494604 https://doi.org/10.1016/s0145-2134(97)00071-9 https://doi.org/10.1016/s0145-2134(97)00071-9 https://doi.org/10.4324/9781843926337-15 https://doi.org/10.4324/9781843926337-15 https://doi.org/10.1207/s15327965pli0104_1 https://doi.org/10.1207/s15327965pli0104_1 https://doi.org/10.1002/acp.3115 15 psychology, public policy, and law, 21(1), 85– 99. https://doi.org/10.1037/law0000035 vanderhallen, m., & vervaeke, g. (2014). between investigator and suspect the role of the working alliance in investigative interviewing. in r. bull (ed.), investigative interviewing (pp. 63– 90). springer new york. https : / / doi . org / 10 . 1007/978-1-4614-9642-7_4 vanderhallen, m., vervaeke, g., & holmberg, u. (2011). witness and suspect perceptions of working alliance and interviewing style what happens in real-life police interviews? journal of investigative psychology and offender profiling, 8(2), 110–130. https://doi.org/10.1002/jip.138 vrij, a., meissner, c. a., fisher, r. p., kassin, s. m., morgan, c. a., & kleinman, s. m. (2017). psychological perspectives on interrogation. perspectives on psychological science, 12(6), 927–955. https://doi.org/10.1177/1745691617706515 walsh, d., & bull, r. (2012). examining rapport in investigative interviews with suspects does its building and maintenance work? journal of police and criminal psychology, 27(1), 73–84. https://doi.org/10.1007/s11896-011-9087-x william r. shadish. (2002). experimental and quasiexperimental designs for generalized causal inference. houghton mifflin. https://doi.org/10.1037/law0000035 https://doi.org/10.1007/978-1-4614-9642-7_4 https://doi.org/10.1007/978-1-4614-9642-7_4 https://doi.org/10.1002/jip.138 https://doi.org/10.1177/1745691617706515 https://doi.org/10.1007/s11896-011-9087-x meta-psychology, 2022, vol 6, mp.2020.2741 https://doi.org/10.15626/mp.2020.2741 article type: tutorial published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: no edited by: erin buchanan reviewed by: dorothy bishop, eiko fried analysis reproduced by: not applicable all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/2f73h a meta-analytic approach to evaluating the explanatory adequacy of theories alejandrina cristia laboratoire de sciences cognitives et de psycholinguistique, département d’études cognitives, ens, ehess, cnrs, psl university, france sho tsuji international research center for neurointelligence, institute for advanced studies, the university of tokyo, japan christina bergmann language development department, max planck institute for psycholinguistics, the netherlands abstract how can data be used to check theories’ explanatory adequacy? the two traditional and most widespread approaches use single studies and non-systematic narrative reviews to evaluate theories’ explanatory adequacy; more recently, large-scale replications entered the picture. we argue here that none of these approaches fits in with cumulative science tenets. we propose instead community-augmented meta-analyses (camas), which, like metaanalyses and systematic reviews, are built using all available data; like meta-analyses but not systematic reviews, can rely on sound statistical practices to model methodological effects; and like no other approach, are broad-scoped, cumulative and open. we explain how camas entail a conceptual shift from meta-analyses and systematic reviews, a shift that is useful when evaluating theories’ explanatory adequacy. we then provide step-by-step recommendations for how to implement this approach – and what it means when one cannot. this leads us to conclude that camas highlight areas of uncertainty better than alternative approaches that bring data to bear on theory evaluation, and can trigger a much needed shift towards a cumulative mindset with respect to both theory and data, leading us to do and view experiments and narrative reviews differently. keywords: meta-analysis, variability, replication, sample size, effect size, quantitative, open science, cumulative science, theory adjudication, explanatory adequacy https://doi.org/10.15626/mp.2020.2741 https://doi.org/10.17605/osf.io/2f73h 2 introduction as cognitive scientists and psychologists, we strive for generality, trying to see beyond individual data points and experiments. theories are key in this process. ranging from broad frameworks to implemented computational models, theories are the tools we use to capture observed patterns, and to generate new predictions. given this crucial role, theories need to be evaluated, updated, and when there are competing accounts, compared. in this context, an important question arises: how can we best evaluate theories against empirical evidence, particularly in the age of the replicability crisis (e.g., vazire, 2018)? in this paper, we argue that usual strategies are at odds with cumulative science (defined as the endeavor to optimally integrate findings into the web of knowledge); and we propose a novel approach, based on open community-augmented metaanalyses (camas; tsuji et al., 2014). we first discuss the ways in which this approach more closely fits the desiderata from cumulative science, which recommends an integrative approach to empirical studies. we then provide step-by-step instructions how, in the future, we can work towards letting the evidence decide: rather than checking a theory’s explanatory adequacy via individual studies (which cannot by themselves cover the whole potential scope of a theory) or narrative reviews (where result integration is verbal), we propose a shift in mindset supported by meta-analytic tools. theories and cumulative science the psychological sciences saw a sea change as reports of relatively low levels of replication bubbled to the surface (e.g., klein et al., 2014; open science collaboration, 2015). a first reaction was to blame our data collection and reporting practices: a great deal of writing has been done to quantify questionable research practices (john et al., 2012), estimating their causal impact on replication (ulrich and miller, 2020), and evaluating alternative research approaches (scheel et al., 2021). more recently, attention turned to theory, with the realization that lack of replication is at least partially due to what we may call questionable theoretical practices. this young body of writing is already too extensive to be reviewed here (see e.g., fried, 2020, and replies in the same issue), but for our purposes, the most important insights include a definition of what theory is, and what the steps of theory development are. we follow robinaugh et al. (2021) in defining theories as models of the world, meaning that they represent in a simplified abstract manner a portion of the complexity of the world. several researchers are in agreement about the fact that psychological theories as well as those found in many areas of cognitive science (but not all, e.g., aspects of decision-making, palminteri et al., 2017) are purely verbal or narrative, tending to also be underspecified and ambiguous. current recommendations are thus to strive for further precision, leading borsboom et al. (2021) to propose that the first three phases of theory development involve 1. identifying a domain, 2. constructing a proto-theory, and 3. formalizing the theory (note that alternative proposals for steps have been laid out, for instance in robinaugh et al., 2021; divergences on this are immaterial to the claims and proposals in the present article). identifying a domain involves specifying the boundary of application of the theory, including the definition of its scope (i.e., when the theory applies or not). constructing a proto-theory involves specifying what the "parts" of the theory are, as well as what their "relationships" are. in the formalization phase, the relationship between the parts comes to be defined precisely in mathematical notation. the next phase involves a check on the explanatory adequacy of the theory, a step that involves comparing the theory-implied data against empirical observations, which typically requires auxiliary hypotheses. the present paper is focused on the phase where explanatory adequacy is evaluated. this phase has already been a focus of attention, with for instance some work explaining that this is not identical to simply fitting a statistical model to data (see fried, 2020, pp. 274 and ss.) and arguing that instead this step will involve relating statistical modeling to data generated from formalized theories (see saliently robinaugh et al., 2021, section 4.2). our proposal is conceptually independent from these recommendations, as they do not specify which data bears on a theory’s evaluation. we here argue that this operation should integrate all relevant and accessible information, rather than partial or select information. we detail our arguments in the next section. how data are currently used in theory evaluation in this section, we review two commonly employed approaches to checking the explanatory adequacy of one or more theories, against which we compare our own proposal. we assume that prior to checking explanatory adequacy, the scope of the theory has been defined, and the theory itself has been clarified and ideally formalized (see guest and martin, 2021; robinaugh et al., 2021 for further information on these steps). for our purposes, what is important is that one has clarified the factors in the theory and how they can potentially be measured. to explain our proposal, we will use a running example of how infants’ learning of the sounds and words of their native language may be linked. the preceding 3 work then, will have defined what it means by "sounds", "words", "infants", "native language", and "learn". in our running example, we will discuss three alternative (verbal) theories: top-down (stating that infants learn words first, and then use them to learn sounds; e.g., kuhl, 1983); bottom-up (stating that infants learn sounds first, and then use them to learn words; e.g., feldman et al., 2013); and parallel (stating that infants learn sounds and words independently from each other). evaluating explanatory adequacy will involve five stages. these are listed in table 1, and we provide information on how each stage is addressed via different methods in subsequent sections. stage i is scope determination, deciding which studies in the literature bear on a given theory or group of theories that are evaluated. for instance, in our running example, the research may decide that sound discrimination studies are relevant, as are word recognition studies, whereas studies where one checks whether infants prefer different prosodic patterns are not, because they do not refer to either sounds or words. stage ii is design space sampling, which refers to the types of procedures, stimuli, populations, etc. that are relevant for that theory. in our running example, the researcher may decide that, given their definition of learning as changing one’s behavior, behavioral studies are relevant, whereas neuroimaging studies are not. at this point, the corpus of data to be considered has been defined: it is all the studies where the "parts" of the theory are invoked, and where the "relationships" between the parts can be studied. stage iii involves checking previous literature in terms of the quality of the data. there are several individual steps in this stage, which will be detailed below. the majority of checks are borrowed directly from the meta-analytic literature, and they may be involved in assessing the quality of individual studies (checking, for instance, for evidence of selective reporting of results), as well as collections of studies (checking for publication bias). in our running example, the researcher may check the quality of each body of literature they are relying on (sound discrimination studies, word recognition studies). other checks are more specific to the evaluation of explanatory adequacy, and they involve checking whether the whole scope of the theory is already represented in the literature, or whether there are gaps in the design space that could eventually reveal inappropriate generalization. in our running example, large age gaps and differences in the age sampling for sound discrimination, which has been assessed from birth, and word recognition studies, which is not tested before the seventh month, could prevent appropriate modeling of acquisition order in the next stages. stage iv is quantitatively controlling for study differences, which may be integrated with the fifth stage discussed next, but conceptually it is closer to the third stage discussed just above. at this stage, we ask ourselves how to conceptually combine studies, given that the body of literature considered will often not be a string of strict replications or systematic variations of single factors. some of the questions that arise require us to consider what to do with studies that vary in precision (or sample size) and/or that vary in methods (albeit within conceptual and methodological scope, given decisions made during the first and second stage). in our running example, this will entail considering what to do with studies from the 1980s and 1990s, which often had sample sizes of 6-8 infants (e.g., kuhl, 1983), versus the 2010s, which have seen some studies with over 100 infants (newman et al., 2016). stage v is result integration, where we try to draw a comprehensive picture based on the assembled corpus of data. when doing so, we may need to control for study differences (for instance, if several different procedures should fall within the scope of a theory, and if these lead to different results, what we are to conclude). at a minimum, this will require statistical modeling of the body of data, in which case the tools can be again borrowed from the meta-analytic literature. when theories are sufficiently specified, they may constitute theories of processes, in which case assessing results at the level of studies may be insufficient or inappropriate. in this case, the researcher will need additional modeling steps, for instance using the body of previous literature in a rawer form. tools at this stage include individual participant data (ipd) meta-analyses (riley et al., 2020; verhage et al., 2020); mega-analyses (sung et al., 2014); and hybrid metaand mega-analyses or pseudoipd meta-analyses (koile and cristia, 2021; papadimitropoulou et al., 2019). all points discussed here apply to these different formats of quantitatively aggregating evidence, but for simplicity we limit our considerations to summarizing group-level data. in our running example, this would be when we check for evidence that the public body of literature is consistent with top-down, bottom-up, or parallel theories of early language acquisition. individual studies probably the most common way to evaluate theories’ explanatory adequacy is by means of individual studies, i.e., a single experiment or manipulation (so not a paper or a series of experiments). typically, specific predictions are empirically tested (either with human participants or computational models), and the result4 table 1 stages when evaluating a theory’s explanatory adequacy using a single study (including large-scale replications), narrative (non-systematic) review, meta-analyses, and cama approaches. n/a* = does not represent the whole body of literature. stage single study narrative review meta-analyses cama i. scope determination n/a* subjective, static static dynamic ii. design space sampling one point subjective comprehensive, narrow comprehensive, broad iii. checks for literature quality n/a* subjective bias at study/literaturelevel, power analysis bias at study/literaturelevel, addition of filedrawer studies iv. quantitatively controlling for study differences impossible impossible moderator analysis, weighting moderator analysis, weighting v. result integration irrelevant narrative; vote counting meta-regression replicable, reproducible, extendable metaregression ing data are taken to support only one of the competing accounts. for instance, in our running example, a prominent individual study often invoked as supporting the bottom-up proposal is werker and tees (1984), who documented a decline in the discrimination of nonnative sounds between 6 and 12 months of age, before children built a vocabulary. or so people thought at the time: tincoff and jusczyk (1999) found that children did know some words by 6 months, which put the topdown and parallel theories back in the race. but we argue that an individual study cannot be used to thoroughly check a theory’s explanatory adequacy by itself, for at least the following two reasons. first, each study is very specific: it employs one experimental setting, including stimuli, implementation, and sample, and results may not generalize to other settings that vary along one or more dimensions (brown et al., 2014). when theorizing, we disregard the specificity of studies unless there is some other study that proves that a given setting mattered. we may then revise the theory to now predict this difference (which gives enormous weight to that result); or we may argue against the validity of that result to avoid changing our theory. this exception aside, most of the time absence of evidence of a methodological or population-specific effect is implicitly taken as evidence of absence: each theory is as general as it can be given the extant evidence and, in return, each empirical result is taken to be as generalizable as possible barring counterevidence. we agree with yarkoni (2020) about the fact that this is not sound theoretical evaluation practice. second, single studies are always a noisy window into reality. the best case scenario is that a predictable proportion of results are misleading because of our inferential tools, which allow false positives and negatives to seep into the literature. even in this idealized case, it is impossible to determine whether a single result accurately reflects reality as there are no mechanisms to detect false positives or negatives at the study-level. to draw from our running example, data in a meta-analysis for infant vowel discrimination (tsuji and cristia, 2014) shows that individual studies yield a wide array of results: across different studies, infants discriminate vowels well, barely, or not at all. however, the situation is even more complex in a realistic scenario, because it is not the case that the literature accurately reflects all findings. indeed, extant literature (and any single study in it) may be misleading because of questionable research practices, which are eminently difficult to eradicate (scheel et al., 2021), and because of publication bias skewed towards significant results and thus potentially over-representing false positives (ferguson and heene, 2012). a special case of single studies: large-scale replications. recent years have seen the rise of crosslaboratory replications, which address several weaknesses we highlighted in the context of individual studies (a good set of proposals in this direction is found in uhlmann et al., 2019). in particular, initiatives like "many labs" (e.g., klein et al., 2018; open science collaboration, 2015) could address both the overspecificity and the noisiness of single studies. when many labs collect data on a given phenomenon using largely the same experimental procedure they are varying experimenter identity and increasing sample diversity, which already contributes to a greater trust in the 5 likelihood of the study generalizing to a new sample collected by a new experimenter. their larger sample sizes also reduce the chance of observing false negatives through their greater precision. such studies are typically also more trustworthy because analyses are usually pre-registered, and data are open, allowing correction of any analytic judgment error that may have occurred. however, these collaboration efforts have not yet gone so far as to vary methodological parameters systematically (but see baribault et al., 2018; manybabies consortium, 2020). as a result, they still provide a single datum localized to one specific region in methodological space, and thus they cannot speak to broad generalizability (see also machery, 2020).1 narrative reviews. narrative reviews seem to provide a framework to weave together multiple studies. we talk here about non-systematic qualitative reviews, which are the prevalent form of evidence integration, often as part of the introduction and/or discussion of an experimental paper, or in invited submissions. as a result, such evaluations of the empirical evidence are often not peer-reviewed independently. moreover, narrative reviews authored by prominent researchers come with an implied stamp of approval and are hard to contest without also appearing to attack the author – which makes the absence of appropriate peer review all the more problematic. the first major shortcoming of narrative reviews is the fact that data selection is not done in an overt and transparent way, with no obligation to objectively check for quality and bias. in fact, despite the author’s best intentions, the procedure whereby a narrative review is put together is fraught with occasions for biases to seep in, including data and outcome selection (for a selfreflective account of how this may happen, see bishop, 2020). a documented example of this comes from a recent study of reviews on a potential link between depression and nutrition: thomas-odenthal et al. (2020) found strong conclusions and recommendations were eight times more common in narrative reviews as compared to meta-analyses, despite the fact that narrative reviews relied on fewer studies than meta-analyses. it may be interesting to replicate such a study focusing on a more theoretical topic in psychology. the second shortcoming of narrative reviews is that single-study interpretation and narrativization can iron out discrepancies. for instance, going back to our running example, imagine that we find a study where infants’ sound discrimination correlates with their word recognition abilities, and two studies where the correlation between the two is zero (this is based on observed patterns: cristia et al., 2014; wang et al., 2021). depending on how they feel about the parallel theory, the researcher interpreting these data may argue that the latter two studies failed to find an effect because they were poorly designed or underpowered (so one piece of data supports the bottom-up account, and the other two are ignored); or they may argue that the sound discrimination study was poorly designed, loading on lexical skills, and thus this is a spurious correlation (allowing the body of results to be consistent with the parallel theory). this is because narrative reviews lack a framework for quantitative evaluation and comparison, and thus inherit some of the issues with single studies. sometimes, authors of narrative reviews do attempt to take into account a body of evidence with heterogeneous results – but this is hard to do in narrative terms: authors may produce a table summarizing the studies, with a column tagging with + or (or even 0) studies depending on whether they support a conclusion or not. this entails making a decision of what constitutes a "+" – is it a significant result, and does the direction of the effect matter? is it a result that is numerically in the "right" direction? what is the threshold for deciding that the evidence aligns one way or another? this method is even more impractical in the case of theoretically relevant and/or methodological moderators that are suspected of having a major effect. verbally postulating them based on diverging outcomes is not good scientific practice because it amounts to saying there is a "significant" difference without testing for it. the final reason why narrative reviews are the most pernicious is that there is no procedure for deciding that there is enough evidence. often, a single study will be considered as enough, again reflecting the "single study is decisive" assumption. meta-analyses. the criticisms we leveraged against single studies have motivated a push towards systematic reviews and meta-analyses in many fields, including psychology. the detailed procedures that have been laid down to guide systematic reviews and meta-analyses (e.g., prisma, moher et al., 2009; page et al., 2021; shamseer et al., 2015) can help us counter our selection biases, overtly report quality judgments, and use objective and quantitative methods for study weighing and moderator tests. moreover, a range of tools can be used to deal with heterogeneous data, and to check for bias in the field as a whole (e.g., egger et al., 1997). of course, meta-analyses are not perfect (ioannidis, 2016), and recent investigations into the transparency and reproducibility of meta-analyses 1we don’t discuss our running example here because there have not been any large-scale efforts to replicate sound discrimination and/or word recognition yet (but see manybabies consortium, 2021). 6 revealed considerable issues (maassen et al., 2020; polanin et al., 2020). this makes sense: no tool can force its handler to use it wisely. meta-analyses are often done to check whether a statement is true or false – e.g., to what extent a certain treatment can reduce depression (e.g., cuijpers et al., 2013). considerations of moderating factors are less common (although certainly not to be ignored, see riley et al., 2020 for the importance of integrating patient characteristics in individual participant data metaanalyses). this mindframe is appropriate for a simple hypothesis-testing, dichotomous reading of what the evidence has to tell us. as a result, heterogeneity is often seen as a threat to interpretation validity (although etiology is complex, e.g., engels et al., 2000), meaning that some researchers will be tempted to keep the scope of their meta-analysis narrow (e.g., li et al., 2015). in the context of checking explanatory adequacy, such traditional meta-analyses have clear advantages over the alternative two approaches, including systematic inclusion of previous literature and overt modeling of study differences. our running example was chosen because, in fact, there are meta-analyses for both sound discrimination (tsuji and cristia, 2014) and the recognition of word forms (bergmann and cristia, 2016), which thus provide information on the timeline of acquisition of these two levels considering all previous evidence, and statistically accounting for, e.g., methodological factors thought to be irrelevant to the theory being tested (although they account for significant variance in effect sizes; cf. bergmann et al., 2018). metaanalyses are, however, limited in ways that will become clear in the next section, where we explain our proposed approach. our proposal: camas we have proposed community-augmented metaanalyses (henceforth camas; tsuji et al., 2014) as a way to further improve on the already powerful metaanalytic approach in two key ways. first, in camas the meta-analytic procedures for screening, inclusion, qualitative, and quantitative analyses, as well as the resulting data and scripts, are public and open, allowing community members to detect and correct any problems at a relatively low cost. second, community members can rescue meta-analyses from post-publication deterioration by adding data points which emerged after the meta-analysis was originally carried out; in fact, we have seen that camas provide a natural home for unpublished studies, which helps counter publication bias. users can also add new variables of interest that the original meta-analyst might not have been aware of or interested in. as a result of these two features (openness and dynamicity), the stage is set for labor to be distributed and decision-making democratized. extant camas also suggest additional benefits. indeed, communities are created around the resource for instance to profit from those data during experiment planning, leading to agreements for standardized formats to be used when extending extant camas or creating new ones. this facilitates the re-use of analysis scripts and enables meta-meta-analyses.2 additionally, the dynamic nature of a cama supports the constant integration and evaluation of new evidence, naturally hijacking binary readings. we have an interesting anecdote on this which bears on our running example: when the meta-analysis on vowel discrimination was published, it was taken to support the conclusion that vowels became attuned to the native language by about 6 months, based on discrimination trajectories that were different for native than non-native contrasts, with significant increases for native ones and non-significant decreases for non-native contrasts. since publication, the metaanalysis has become a cama hosted within the metalab platform (bergmann et al., 2018), and last time we checked neither the native or non-native trends were significant. although camas share many features with metaanalyses based on systematic reviews, and we can thus build on insights and methods developed (largely) in the medical sciences, their application in the context of cognitive sciences and for theory evaluation specifically does entail an important mind shift. we noted above a preference for meta-analyses to be based on a narrow scope, with heterogeneity interpreted as a validity threat. in contrast, theories in cognitive science that aim for generality will need to adopt a broader scope, which may make it burdensome for the meta-analyst (as it entails inputting more data). the unique features of a cama, however, help with this: democratization of the data entry process allows other researchers to add more data points. a step-by-step manual for using camas to check theories’ explanatory adequacy in this section, we provide 10 steps you can take to check theories against extant data in a cumulative sci2see metalab.stanford.edu and psychopen cama in leibniz-psychology.org/en/services/ for implemented camas. we also would like to point to living systematic reviews (elliott et al., 2014), which, as far as we can judge, are conceptually equivalent to camas and were developed in parallel. in what follows, we continue using the name cama as this is the name under which we had proposed this idea, which has been picked up by others in the cognitive sciences (burgard et al., 2021; ijzerman et al., 2021). 7 entific framework, with an extra step that is based on fostering educational synergies in research training (see figure 1). step 0: consider educational opportunities. by trying to use camas to bring data to bear during theory evaluation, you will learn a great deal not only about meta-analyses, but also about how challenging it is to evaluate a theory against data in the age of cumulative science when you have not been trained for it. considering educational opportunities means that you will make this easier for future generations of researchers, and if you think about it in advance, it may also lighten your load. although it would be ideal to integrate early career researchers in any and all steps, the steps leading to the highest synergies are steps 3, 4, 6, and 9. in a nutshell, these are steps in which either data are entered into a cama (step 4; notice also that steps 5-7 also involve adding information to an extant cama); or when you have realized data are incomplete and more needs to be collected (steps 3, 6, 9). we have involved undergraduate and graduate students in data entry during workshops at international conferences and in our teaching (e.g. tsuji et al., 2016; now integrated in black and bergmann, 2017). regarding collecting additional data, we support the call for inviting early career researchers to be involved in replication (hawkins et al., 2018), which would be useful to increase statistical power (steps 3 and 9), except that we propose a twist: instead of only engaging in strict replication, students could be involved in expanding the coverage of extant studies by varying methodology in ways predicted to be irrelevant by the theory being evaluated (step 6). if you are teaching a data analysis course, consider using camas to train students to attempt to consider data in the framework of theory evaluation statistically (using meta-regression), but also to critique extant data (by e.g. checking for design diversity, quality, heterogeneity, and power; steps 8-10). step 1: define the scope of data your theory is supposed to explain. reflect on what you need to evaluate your theory (see table 1): what is the range of study types the theory is thought to cover? is your candidate theory large in scope ("how children learn") or narrow ("how adults in visual object tracking deploy overt attention")? are there specific quality features that your theory predicts to be crucial (using a specific type of eye-tracker, removing inattentive participants)? this is also a good point at which to consider pre-registering your meta-analysis (watt and kennedy, 2017). as mentioned previously, for our running example on how learning sounds and words relate to each other in infancy, a reasonable scope would include studies on native sound discrimination as well as studies on processing of native words. step 2: find (ca)mas that fall within the scope defined in step 1. look for meta-analyses or camas that match the scope you defined in the previous step. today, this will probably involve combining multiple meta-analyses to fully occupy that scope, since most meta-analyses today are phenomenon-driven and insufficiently broad (see step 4), and then turning your broad, composite meta-analysis into a cama. metaanalyses can be turned into camas, by: • if applicable and possible, formatting raw data according to common standards (see footnote 1) • providing a codebook for all columns in the raw data • sharing meta-analytic raw data (as extracted from papers or received from authors) and search protocols in an open format (e.g., .csv spreadsheets and .txt files) as well as code to compute effect sizes and perform analyses • devising a protocol for adding data (e.g., a form) and quality control (e.g., a dedicated curator) as time goes by, the broad-scope camas proposed in step 4 will be more and more prevalent. readers of the future: look for extant camas in your fields before starting one. if you don’t find one, proceed to the next step; if you do, skip to step 5. in our running example, we found a meta-analysis on vowels and several related to word processing in metalab. this is a good start, but given the scope that was defined based on theoretical concerns in step 1, we would conclude that we are missing a meta-analysis on native consonant and tone processing. it would also be ideal to have a meta-analysis on individual measures of sound and word processing (i.e., the correlation between the two). step 3: stop before you start a new cama. you haven’t found relevant camas and you are uncertain whether cama (as an approach) is useful because there is only one or a few studies within the scope defined in step 1. at this point, you should in fact directly conclude that more research is needed: if there are not enough data points to check for generalizability, how can we trust them (or it) to tell us general facts about psychological phenomena? come back to this manual when there seems to be enough evidence. if you do find enough studies, then continue to the next step. in our running example, we established in step 2 that there were a few relevant meta-analyses, but one estimating correlations in individual variation for sound and word processing was missing. our own knowledge of the literature suggests that there are fewer than 5 8 define theory's scope build cama relevant (ca)ma exists? y n lit contains enough studies? y n code cama studies for theory relevance code quality in cama studies do cama studies sample generalization space? y n incorporate controls for heterogeneity consider educational opportunities do you have enough power? y n evaluate theory 1 2 3 4 5 6 7 8 9 10 0 figure 1. workflow for using camas to evaluate theories. the number in the black circle refers to the step in the manual. (ca)ma stands for (community-augmented) meta-analyses. studies on this topic, and thus it may be too soon to attempt a meta-analysis on this topic precisely. step 4: set up a broad meta-analytic scope. you’ve defined your scope, failed to find camas that cover it completely, but believe there are enough studies to check for generalizability of the theory you are interested in, so you decide to perform a meta-analysis. typical meta-analyses are built to evaluate whether there is sufficient evidence for a specific phenomenon, and thus data entry is limited to the scope defined by the theory. however, this means that criteria of relevance (step 5), methodological coverage (step 6), and quality (step 7) are folded into one, which will make it harder to spot and recover from subjective judgments on any of these points. (incidentally, this also limits the reusability of the data entered, and thus is in contradiction with cumulative science principles.) so think instead in cama terms: define your scope as broad as you can, and not any broader. in the meta-analyses we are considering for our running example, tsuji and cristia (2014) included all infant vowel discrimination studies (including both behavioral and neuroimaging methods, and diverse populations ranging from normative to a variety of less commonly studied infant groups); bergmann and cristia (2016) included all infant word segmentation studies. step 5: code cama studies for scope. by either finding, combining, and augmenting existing (ca)mas (step 2) or constructing your own (steps 3-4), you are now in possession of a body of data that probably includes studies outside of the scope defined in step 1. add a field to the cama defining relevance for your particular theory or programmatically exclude them in analysis code, for example by selecting for specific study or population characteristics. notice that this transparency will allow future reviewers and readers to eval9 uate whether inclusion was subjective or principled. in our running example, the above-mentioned camas on vowel and word processing were subsequently used for testing theories with a narrow scope bergmann and cristia (2016) and tsuji and cristia (2017) and a broad scope bergmann et al. (2017). the latter was in fact an attempt to determine the relative timeline of acquisition of sounds and words. in that study, we revisited inclusion decisions: we could only find significant effects of age as predicted by the theory when we subset to studies on typically-developing monolingual infants, and which had multiple age groups in the same paper. step 6: code cama studies for generalizability. even after subsetting to relevant studies, the cama you are using may contain data collected with many methodologies. this is not a weakness. the belief that a single study can focus on a phenomenon by isolating it presumes that methodological variation goes away – which is basically an optical trick: we don’t see the variation because we are focusing on one point. in contrast, broadly-defined camas give us an opportunity to overtly consider that variability: ask yourself rather, has the theory’s full design space (set in step 1) been thoroughly sampled without confounds? if so, you can use statistical tools to account for this (step 8); if there are regions of the space that have not been sampled, or have been sampled with confounds, consider first collecting more data. in our running example, the fact that we could only retrieve the predicted age effects in a subset of data generated some concern. at present, we do not know whether this implies a true limit to generalizability of the theories, or merely a failure in statistical power, due to the fact that effects measured in infancy tend to be very small (bergmann et al., 2018). step 7: code cama studies for quality. as in the previous step, make sure you apply your pre-defined quality criteria from step 1. in some research, this may mean coding whether data points come from doubleblind randomized control trials as opposed to correlational research (e.g., armijo-olivo et al., 2015). for experimental research, you as an expert can develop field-specific criteria to code studies, ideally by crafting the definitions, and then asking a third party to apply them. an important next step is to statistically test for potential effects that confirm differences in data quality do exist. reviewers and readers can then make an informed judgment of whether these explicit and transparent criteria were subjective or principled. regarding our running example, we made an attempt to check whether measures of data quality defined in advance explained significant variance in the meta-analyses we were considering, and found they did not (tsuji et al., 2020). step 8: check for heterogeneity and control for orthogonal variance. in step 1, you defined scope, design space, and quality based on the theory being evaluated. this theory may incorrectly predict homogeneity of results within this whole space. check whether this is true using traditional meta-analytic tools, including heterogeneity checks (huedo-medina et al., 2006) and incorporating statistical controls for methodological (step 6) and quality (step 7) dimensions via weighting or as fixed or random factors, as appropriate. in our running example, we systematically control for differences in sample size by inverse variance weighting; we declare method (i.e., specific methodologies among behavioral and neuroimaging ones) as a fixed effect; and check for heterogeneity (bergmann and cristia, 2016; tsuji and cristia, 2017). step 9: consider power. at this point, you will have a cama covering precisely the studies within scope, sampling throughout the design space with no confounds, and taking quality into account. you are ready to integrate results using standard meta-analytic regressions, and as in such work, you should consider whether you have sufficient power (pigott, 2020). if you find that you do not, you can estimate how much more work is needed and recommend a roadmap for future work, where you also may highlight limits on generalizability present in the extant body of literature your cama describes. in our running example, we found both power limitation and systematic gaps in the literature; e.g., tsuji and cristia (2014) found few studies on the timeline for non-native vowels, and studies since have addressed those gaps (e.g., mazuka et al., 2014). step 10: continue the work of evaluating your theory. you have made tremendous progress in evaluating your theory in a cumulative-scientific framework – which is all the more reason to not stop now. be extremely careful about how you interpret your metaregression, avoiding conclusions like "the theory is (not) right because the mean effect size is (not) significant". this is once again binomial reading rearing its ugly head, now treating a meta-analysis as if it were a single study, with a focus on strict significance. apply to meta-analyses, including camas, the same lessons you learned from improved statistical practices in analyzing single experiments (see also moreau and gamble, 2020). in our running example, we felt that evidence at the time was most consistent with a sounds-first, than a words-first, theoretical explanation (bergmann et al., 2017), but recognized several limitations of the evidence, including the fact that this merely indicated a difference in timelines between vowel and word pro10 cessing but not a causal relationship. in any case, at this point, only one aspect of theory evaluation has occurred, and as described in the introduction and developed further below, there are many other procedures that we can apply to not only check but also develop and improve our theories. how this approach may change how you use single studies and narrative reviews we believe that camas are the most promising tool for transparently bringing data to bear during evaluation of theories’ explanatory adequacy in the age of cumulative science. in this section, we briefly discuss the place other approaches have in the scientific process (see figure 2). use camas to decide not to run a new study as camas become more prevalent, it will be increasingly easy to use them to decide whether to run a new study – or not. a good example comes from our cama of word segmentation (bergmann and cristia, 2016), which documented an effect size so small that new studies have a recommended sample size of over two hundred infants, which is not currently feasible for single labs. another example comes from a cama on phonotactic learning (cristia, 2018), collecting laboratory experiments in which infants were briefly exposed to sound sequences. there were many such studies, following essentially the same method and all published as supporting the theory that prelinguistic infants can learn sound sequences after brief exposure. however, the meta-analysis revealed an effect of zero, strongly suggesting that the phenomenon was not reliable because (significant) opposite effects were sometimes observed within the same lab with nearly identical methods. this should lead at a minimum to changing the technique (habituating the child to the pattern, rather than using brief fixed exposures); and could promote an abandonment of the theory (perhaps humans can only learn sound sequences much later, after we start talking). use cama-informed single studies to efficiently sample the design space camas are useful to reveal gaps in the literature. if gathering more data along a similar line just to increase power (see steps 0, 3, and 9), you may worry about being able to publish it. although we do hope there is a change in attitude towards this kind of study (see also zwaan et al., 2018), we acknowledge that such work might be most plausibly done in the context of student training, or as a first step during a phd program (frank and saxe, 2012; hawkins et al., 2018; roettger and baer-henney, 2019). if collected as a student project, the sample may be too small to warrant independent publication. nonetheless, the study would still be included in camas and thus contribute to the body of evidence (see step 7-8 for adequate integration of studies potentially varying in quality). use cama-informed studies to replicate-and-extend alternatively, you may be able to design your study in such a way that you both collect data that increases power on an established phenomenon, and add novel conditions, for instance to extend the coverage of methodological variables predicted to be irrelevant by the theory (see steps 0 and 6). when writing up the results, it is then possible to emphasize the importance of the novel component (which opens the way to generating knowledge in a new direction), while calling for more work on that same topic with reference to the cama results. this way, an author can both signal the importance of cumulativity all the while writing a compelling article (rabagliati et al., 2019). break new ground with single studies you may have come up with a novel hypothesis for which no suitable previous data exists. or, you may have found that extant empirical data as integrated in a cama contradict predictions of current theoretical accounts, and you have subjectively interpreted this contradiction, developing a new hypothesis about which factors have caused the observed discrepancy. perhaps you decided that the method is flawed and/or the theory is false, so you would like to launch a new line of research to explore different kinds of methods and/or alternative theories. we do not want to discourage you from running this type of study. however, we hope you will remember that the role of this new study cannot be to prove or disprove a theory (see single studies section), but to propose an idea that can then serve as a starting point for a new cumulative research endeavor. use narrative reviews to inform other stages of theory evaluation, adaptation, and development we have focused here on explanatory adequacy, but as summarized in the introduction, building solid theories takes much more than that. some experts on this broader view of theory development recommend formalization for the precision which it brings to theoretical discussions (e.g., robinaugh et al., 2021). without discounting these important ideas, guest and martin (2021) highlight the value of considering a wide 11 narrative reviews clarify concepts establish scope & bounds camas suggest critical data needed for extant theories' coverage integrate results on extant theories highlight contradictions between data & theory single studies add data to increase precision add data to increase coverage provide data on new topics determine components propose mechanisms figure 2. proposed key roles of camas, single studies, and narrative reviews in the context of cumulative science. range of levels of specificity when describing psychological phenomena, ranging from very specific hypotheses made in the context of one study to abstract theories in which plausible mechanisms have been specified (see saliently their figure 2). in this context, narrative reviews still play a role as we try to clarify concepts and phenomena, and their relation to each other (crucial for theory development, e.g., borsboom et al., 2021). limitations of the present paper before closing, we would like to highlight some shortcomings of this paper, the first being that we focused on camas’ role in the explanatory adequacy phase. we thus say little about other phases, and notably to the question of when one should abandon a theory altogether, which one of our reviewers cogently pointed out may be behind wasted research efforts. we believe this is an important topic that should be revisited, at which point camas may be found useful in two particular ways. first, camas may reveal that a theory’s scope is so narrow, and/or the proportion of variance explained is so small, as to be of little use in explaining psychological phenomena in the real world. second, having open meta-analytic repositories where data are 12 more easily integrated into the body of literature can help provide a home for studies that would otherwise be destined to the file drawer, and thus camas could help us measure wasted scientific effort. another limitation of the present paper is that the types of examples we have discussed are based on group-level effect sizes, typically averaged across trials and conditions, and this type of approach may be suboptimal in the quest for shedding light on cognitive processes. haines et al. (2020) recently drew attention to this issue, and provided recommendations for data analyses. we would like to stress that systematicity, openness, and dynamicity, the three features that make camas particularly powerful for testing explanatory adequacy, should carry over to this context. of course, laying out how to engage in camas using more granular data (at the trial level and below) will require additional work, which we hope will be undertaken in the future. conclusion in this paper, we have considered traditional ways of bringing data to bear when evaluating theories, and concluded that none of them is perfect in the current age of cumulative science. specifically, considering single studies in isolation (including large-scale collaboration) as well as weaving together single studies in a narrative non-systematic review both suffer from selection biases and inappropriate sampling of the space of possibilities. we have instead provided a step-bystep instruction for using meta-analyses based on rigorous systematic reviews, particularly open, communityaugmented meta-analyses (camas). note that they still require the person using a meta-analysis for theory evaluation to have a clear mind about what the theory states, what its key concepts are, and what reasonable implementations of those concepts are. are camas perfect? we suspect no, because camas still rely on extant literature, and thus flaws in the literature can be carried over. although as meta-analysts we have a few tools in our kit to deal with imperfection (see step 8), the result of the cama is still bounded by the overall quality and quantity of the underlying literature, but we want to emphasize that camas make the extant empirical boundaries clearer. being a scientist means standing on the shoulders of giants. we hope that our proposal provides guidance in how to stand firmly on these shoulders, and how others can in turn stand on ours. we look forward to a new generation of psychologists that cumulatively and systematically build on previous work, and approaches data collection and theory construction with this novel lens, making ours a sustainable discipline that ever continues to approach the truth. author contact we are grateful to caroline rowland, dorothy bishop, and eiko fried for invaluable feedback on an earlier version of this manuscript. all errors remain our own. author contact: • alejandrina cristia 0000-0003-2979-4556, alecristia@gmail.com • sho tsuji 0000-0001-9580-4500, tsujish@gmail.com • christina bergmann 0000-0003-2656-9070, chbergma@gmail.com conflict of interest and funding the authors declare no conflict of interest related to the contents of this manuscript. the authors acknowledge grants from the berkeley initiative for transparency in the social sciences, a program of the center for effective global action (cega), with support from the laura and john arnold foundation. the authors were further supported by the h2020 european research council [marie skłodowska-curie grant nos. 659553 and 660911], agence nationale de la recherche (anr-17-ce28-0007 langage, anr-16data-0004, anr-14-ce30-0003, anr-17-eure-0017), the fetzer-franklin fund, and the james s. mcdonnell foundation understanding human cognition scholar award. author contributions ac: conceptualization, investigation, methodology, project administration, visualization, writing – original draft, writing – review & editing; st: conceptualization, investigation, methodology, writing – original draft, writing – review & editing; cb: conceptualization, investigation, methodology, visualization, writing – original draft, writing – review & editing open science practices this article is theoretical and does not have accompanying data and materials, nor was it pre-registered. thus it was not eligible for the open science badges. the entire editorial process, including the open reviews, is published in the online supplement. references armijo-olivo, s., da costa, b. r., cummings, g. g., ha, c., fuentes, j., saltaji, h., & egger, m. (2015). pedro or cochrane to assess the quality of clinical trials? a meta-epidemiological study. plos one, 10(7). https://orcid.org/0000-0003-2979-4556 https://orcid.org/0000-0001-9580-4500 https://orcid.org/0000-0003-2656-9070 13 baribault, b., donkin, c., little, d. r., trueblood, j. s., oravecz, z., van ravenzwaaij, d., white, c. n., de boeck, p., & vandekerckhove, j. (2018). metastudies for robust tests of theory. proceedings of the national academy of sciences, 115(11), 2607–2612. bergmann, c., & cristia, a. (2016). development of infants’ segmentation of words from native speech: a meta-analytic approach. developmental science, 19(6), 901–917. bergmann, c., tsuji, s., & cristia, a. (2017). topdown versus bottom-up theories of phonological acquisition: a big data approach. interspeech 2017, 2103–2107. bergmann, c., tsuji, s., piccinini, p. e., lewis, m. l., braginsky, m., frank, m. c., & cristia, a. (2018). promoting replicability in developmental research through meta-analyses: insights from language acquisition research. child development, 89(6), 1996–2009. bishop, d. v. (2020). the psychology of experimental psychologists: overcoming cognitive constraints to improve research: the 47th sir frederic bartlett lecture. quarterly journal of experimental psychology, 73(1), 1–19. black, a., & bergmann, c. (2017). quantifying infants’ statistical word segmentation: a meta-analysis. 39th annual meeting of the cognitive science society, 124–129. borsboom, d., van der maas, h. l., dalege, j., kievit, r. a., & haig, b. d. (2021). theory construction methodology: a practical framework for building theories in psychology. perspectives on psychological science, 16, 756–766. brown, s. d., furrow, d., hill, d. f., gable, j. c., porter, l. p., & jacobs, w. j. (2014). a duty to describe: better the devil you know than the devil you don’t. perspectives on psychological science, 9(6), 626–640. burgard, t., bošnjak, m., & studtrucker, r. (2021). community-augmented meta-analyses (camas) in psychology. zeitschrift für psychologie, 229, 15–23. cristia, a. (2018). can infants learn phonology in the lab? a meta-analytic answer. cognition, 170, 312–327. cristia, a., seidl, a., junge, c., soderstrom, m., & hagoort, p. (2014). predicting individual variation in language from infant speech perception measures. child development, 85(4), 1330–1345. cuijpers, p., berking, m., andersson, g., quigley, l., kleiboer, a., & dobson, k. s. (2013). a metaanalysis of cognitive-behavioural therapy for adult depression, alone and in comparison with other treatments. the canadian journal of psychiatry, 58(7), 376–385. egger, m., smith, g. d., schneider, m., & minder, c. (1997). bias in meta-analysis detected by a simple, graphical test. british medical journal, 315(7109), 629–634. elliott, j. h., turner, t., clavisi, o., thomas, j., higgins, j. p., mavergames, c., & gruen, r. l. (2014). living systematic reviews: an emerging opportunity to narrow the evidence-practice gap. plos medicine, 11(2). engels, e. a., schmid, c. h., terrin, n., olkin, i., & lau, j. (2000). heterogeneity and statistical significance in meta-analysis: an empirical study of 125 meta-analyses. statistics in medicine, 19(13), 1707–1728. feldman, n. h., myers, e. b., white, k. s., griffiths, t. l., & morgan, j. l. (2013). word-level information influences phonetic learning in adults and infants. cognition, 127(3), 427–438. ferguson, c. j., & heene, m. (2012). a vast graveyard of undead theories: publication bias and psychological science’s aversion to the null. perspectives on psychological science, 7(6), 555–561. frank, m. c., & saxe, r. (2012). teaching replication. perspectives on psychological science, 7(6), 600– 604. fried, e. i. (2020). lack of theory building and testing impedes progress in the factor and network literature. psychological inquiry, 31(4), 271–288. guest, o., & martin, a. e. (2021). how computational modeling can force theory building in psychological science. perspectives on psychological science, 16(4), 789–802. haines, n., kvam, p. d., irving, l., smith, c., beauchaine, t. p., pitt, m. a., & turner, b. (2020). theoretically informed generative models can advance the psychological and brain sciences: lessons from the reliability paradox. https : / / psyarxiv . com / xr7y3 / download?format=pdf hawkins, r. x., smith, e. n., au, c., arias, j. m., catapano, r., hermann, e., keil, m., lampinen, a., raposo, s., reynolds, j., et al. (2018). improving the replicability of psychological science through pedagogy. advances in methods and practices in psychological science, 1(1), 7– 18. huedo-medina, t. b., sánchez-meca, j., marinmartinez, f., & botella, j. (2006). assessing heterogeneity in meta-analysis: q statistic or i2 index? psychological methods, 11(2), 193–206. https://psyarxiv.com/xr7y3/download?format=pdf https://psyarxiv.com/xr7y3/download?format=pdf 14 ijzerman, h., hadi, r., coles, n., paris, b., elisa, s., fritz, w., klein, r. a., & ropovik, i. (2021). social thermoregulation: a meta-analysis. https : //psyarxiv.com/fc6yq/download?format=pdf ioannidis, j. p. (2016). the mass production of redundant, misleading, and conflicted systematic reviews and meta-analyses. the milbank quarterly, 94(3), 485–514. john, l. k., loewenstein, g., & prelec, d. (2012). measuring the prevalence of questionable research practices with incentives for truth telling. psychological science, 23(5), 524–532. klein, r. a., ratliff, k. a., vianello, m., adams jr, r. b., bahnık, š., bernstein, m. j., bocian, k., brandt, m. j., brooks, b., brumbaugh, c. c., et al. (2014). investigating variation in replicability. social psychology, 45, 142–152. klein, r. a., vianello, m., hasselman, f., adams, b. g., adams jr, r. b., alper, s., aveyard, m., axt, j. r., babalola, m. t., bahnık, š., et al. (2018). many labs 2: investigating variation in replicability across samples and settings. advances in methods and practices in psychological science, 1(4), 443–490. koile, e., & cristia, a. (2021). towards cumulative cognitive science: a comparison of meta-analysis, mega-analysis, and hybrid approaches. open mind, 5, 154–173. kuhl, p. k. (1983). perception of auditory equivalence classes for speech in early infancy. infant behavior and development, 6(2-3), 263–285. li, s.-j., jiang, h., yang, h., chen, w., peng, j., sun, m.-w., lu, c. d., peng, x., & zeng, j. (2015). the dilemma of heterogeneity tests in metaanalysis: a challenge from a simulation study. plos one, 10(5), e0127538. maassen, e., van assen, m. a., nuijten, m. b., olssoncollentine, a., & wicherts, j. m. (2020). reproducibility of individual effect sizes in metaanalyses in psychology. plos one, 15(5). machery, e. (2020). what is a replication? philosophy of science, 87(4), 545–567. manybabies consortium. (2020). quantifying sources of variability in infancy research using the infant-directed-speech preference. advances in methods and practices in psychological science, 3(1), 24–52. manybabies consortium. (2021). mb-athome: online infant data collection. https : / / manybabies . github.io/mb-athome/ mazuka, r., hasegawa, m., & tsuji, s. (2014). development of non-native vowel discrimination: improvement without exposure. developmental psychobiology, 56(2), 192–209. moher, d., liberati, a., tetzlaff, j., altman, d. g., & group, p. (2009). preferred reporting items for systematic reviews and meta-analyses: the prisma statement. plos medicine, 6(7). moreau, d., & gamble, b. (2020). conducting a metaanalysis in the age of open science: tools, tips, and practical recommendations. psychological methods. https : / / psycnet . apa . org / fulltext / 2020-66880-001.pdf newman, r. s., rowe, m. l., & ratner, n. b. (2016). input and uptake at 7 months predicts toddler vocabulary: the role of child-directed speech and infant processing skills in language development. journal of child language, 43(5), 1158– 1173. open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251). page, m. j., moher, d., bossuyt, p. m., boutron, i., hoffmann, t. c., mulrow, c. d., shamseer, l., tetzlaff, j. m., akl, e. a., brennan, s. e., et al. (2021). prisma 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. british medical journal, 372. palminteri, s., wyart, v., & koechlin, e. (2017). the importance of falsification in computational cognitive modeling. trends in cognitive sciences, 21(6), 425–433. papadimitropoulou, k., stijnen, t., dekkers, o. m., & le cessie, s. (2019). one-stage random effects meta-analysis using linear mixed models for aggregate continuous outcome data. research synthesis methods, 10(3), 360–375. pigott, t. d. (2020). power of statistical tests for subgroup analysis in meta-analysis. n. ting, jc cappelleri, s. ho,(din) d.-g. chen (editors), design and analysis of subgroups with biopharmaceutical applications, 347–368. polanin, j. r., hennessy, e. a., & tsuji, s. (2020). transparency and reproducibility of meta-analyses in psychology: a meta-review. perspectives on psychological science, 15(4), 1026–1041. rabagliati, h., ferguson, b., & lew-williams, c. (2019). the profile of abstract rule learning in infancy: meta-analytic and experimental evidence. developmental science, 22(1). riley, r. d., debray, t. p., fisher, d., hattle, m., marlin, n., hoogland, j., gueyffier, f., staessen, j. a., wang, j., moons, k. g., et al. (2020). individual participant data meta-analysis to examhttps://psyarxiv.com/fc6yq/download?format=pdf https://psyarxiv.com/fc6yq/download?format=pdf https://manybabies.github.io/mb-athome/ https://manybabies.github.io/mb-athome/ https://psycnet.apa.org/fulltext/2020-66880-001.pdf https://psycnet.apa.org/fulltext/2020-66880-001.pdf 15 ine interactions between treatment effect and participant-level covariates: statistical recommendations for conduct and planning. statistics in medicine, 39(15), 2115–2137. robinaugh, d. j., haslbeck, j. m., ryan, o., fried, e. i., & waldorp, l. j. (2021). invisible hands and fine calipers: a call to use formal theory as a toolkit for theory construction. perspectives on psychological science, 16(4), 725–743. roettger, t. b., & baer-henney, d. (2019). toward a replication culture: speech production research in the classroom. phonological data and analysis, 1, 13. scheel, a. m., schijen, m. r., & lakens, d. (2021). an excess of positive results: comparing the standard psychology literature with registered reports. advances in methods and practices in psychological science, 4(2), 1–12. shamseer, l., moher, d., clarke, m., ghersi, d., liberati, a., petticrew, m., shekelle, p., & stewart, l. a. (2015). preferred reporting items for systematic review and meta-analysis protocols (prisma-p) 2015: elaboration and explanation. british medical journal, 349. sung, y. j., schwander, k., arnett, d. k., kardia, s. l., rankinen, t., bouchard, c., boerwinkle, e., hunt, s. c., & rao, d. c. (2014). an empirical comparison of meta-analysis and megaanalysis of individual participant data for identifying gene-environment interactions. genetic epidemiology, 38(4), 369–378. thomas-odenthal, f., molero, p., van der does, w., & molendijk, m. (2020). impact of review method on the conclusions of clinical reviews: a systematic review on dietary interventions in depression as a case in point. plos one, 15(9). tincoff, r., & jusczyk, p. w. (1999). some beginnings of word comprehension in 6-month-olds. psychological science, 10(2), 172–175. tsuji, s., bergmann, c., & cristia, a. (2014). community-augmented meta-analyses: toward cumulative data assessment. perspectives on psychological science, 9(6), 661–665. tsuji, s., & cristia, a. (2014). perceptual attunement in vowels: a meta-analysis. developmental psychobiology, 56(2), 179–191. tsuji, s., & cristia, a. (2017). which acoustic and phonological factors shape infants’ vowel discrimination? exploiting natural variation in inphondb. interspeech, 2108–2112. tsuji, s., cristia, a., frank, m. c., & bergmann, c. (2020). addressing publication bias in metaanalysis. zeitschrift für psychologie, 228, 50–61. tsuji, s., lewis, m., bergmann, c., frank, m., & cristia, a. (2016). tutorial: meta-analytic methods for cognitive science. in a. papafragou, d. grodner, d. mirman, & j. trueswell (eds.), roceedings of the 38th annual conference of the cognitive science society (pp. 33–34). cognitive science society. uhlmann, e. l., ebersole, c. r., chartier, c. r., errington, t. m., kidwell, m. c., lai, c. k., mccarthy, r. j., riegelman, a., silberzahn, r., & nosek, b. a. (2019). scientific utopia iii: crowdsourcing science. perspectives on psychological science, 14(5), 711–733. ulrich, r., & miller, j. (2020). meta-research: questionable research practices may have little effect on replicability. elife, 9. vazire, s. (2018). implications of the credibility revolution for productivity, creativity, and progress. perspectives on psychological science, 13(4), 411–417. verhage, m. l., schuengel, c., duschinsky, r., van ijzendoorn, m. h., fearon, r. p., madigan, s., roisman, g. i., bakermans–kranenburg, m. j., oosterman, m., & on attachment transmission synthesis, c. (2020). the collaboration on attachment transmission synthesis (cats): a move to the level of individual-participant-data metaanalysis. current directions in psychological science, 29(2), 199–206. wang, y., seidl, a., & cristia, a. (2021). infant speech perception and cognitive skills as predictors of later vocabulary. infant behavior and development, 62. watt, c. a., & kennedy, j. e. (2017). options for prospective meta-analysis and introduction of registration-based prospective meta-analysis. frontiers in psychology, 7, 2030. werker, j. f., & tees, r. c. (1984). cross-language speech perception: evidence for perceptual reorganization during the first year of life. infant behavior and development, 7(1), 49–63. yarkoni, t. (2020). the generalizability crisis. behavioral and brain sciences, 1–37. zwaan, r. a., etz, a., lucas, r. e., & donnellan, m. b. (2018). making replication mainstream. behavioral and brain sciences, 41. introduction theories and cumulative science how data are currently used in theory evaluation individual studies a special case of single studies: large-scale replications narrative reviews meta-analyses our proposal: camas a step-by-step manual for using camas to check theories' explanatory adequacy how this approach may change how you use single studies and narrative reviews use camas to decide not to run a new study use cama-informed single studies to efficiently sample the design space use cama-informed studies to replicate-and-extend break new ground with single studies use narrative reviews to inform other stages of theory evaluation, adaptation, and development limitations of the present paper conclusion author contact conflict of interest and funding author contributions open science practices meta-psychology, 2021, vol 5, mp.2020.2625, https://doi.org/10.15626/mp.2020.2625 article type: tutorial published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: no edited by: erin m. buchanan reviewed by: williams, m. & dunleavy, d. analysis reproduced by: not applicable all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/u68y7 preregistration of secondary data analysis: a template and tutorial olmo r. van den akker department of methodology and statistics, tilburg university lorne campbell department of psychology, university of western ontario rodica ioana damian department of psychology, university of houston andrew n. hall department of psychology, northwestern university elliott kruse egade business school, tec de monterrey stuart j. ritchie social, genetic and developmental psychiatry centre, king’s college anna e. van ‘t veer methodology and statistics unit, institute of psychology, leiden university sara j. weston department of psychology, university of oregon william j. chopik department of psychology, michigan state university pamela e. davis-kean department of psychology, university of michigan jessica e. kosie department of psychology, princeton university jerome olsen department of applied psychology: work, education and economy, university of vienna max planck institute for research on collective goods k. d. valentine division of general internal medicine, massachusetts general hospital marjan bakker department of methodology and statistics, tilburg university preregistration has been lauded as one of the solutions to the so-called ‘crisis of confidence’ in the social sciences and has therefore gained popularity in recent years. however, the current guidelines for preregistration have been developed primarily for studies where new data will be collected. yet, preregistering secondary data analyses--where new analyses are proposed for existing data---is just as important, given that researchers’ hypotheses and analyses may be biased by their prior knowledge of the data. the need for proper guidance in this area is especially desirable now that data is increasingly shared publicly. in this tutorial, we present a template specifically designed for the preregistration of secondary data analyses and provide comments and a worked example that may help with using the template effectively. through this illustration, we show that completing such a template is feasible, helps limit researcher degrees of freedom, and may make researchers more deliberate in their data selection and analysis efforts. keywords: preregistration, secondary data analysis 2 van den akker et al. preregistration has been lauded as one of the key solutions to the replication crisis in the social sciences, mainly because it has the potential to prevent p-hacking by restricting researcher degrees of freedom, but also because it improves transparency and study planning, and can reduce publication bias. however, despite its growing popularity, preregistration is still in its infancy and preregistration practices are far from optimal (claesen, gomes, tuerlinckx, & vanpaemel, 2019; veldkamp et al., 2018). moreover, the current guidelines for preregistration are primarily relevant for studies in which new data will be collected. in this paper, we suggest that preregistration is also attainable when testing new hypotheses with pre-existing data and provide a tutorial on how to effectively preregister such secondary data analyses. secondary data analysis involves the analysis of existing data to investigate research questions, often in addition to the main ones for which the data were originally gathered (grady, cummings, & hulley, 2013). analyzing these datasets comes with its own challenges (cheng & phillips, 2014; smith et al., 2011). for instance, common secondary datasets often include many different variables from many different respondents, sometimes measured at different points in time (e.g., the world values survey, inglehart et al., 2014; the wisconsin longitudinal study, herd, carr, & roan, 2014). this provides ample opportunity for researchers to p-hack and increases the likelihood of obtaining spurious statistically significant results (weston, ritchie, rohrer, & przybylski, 2019). in addition, because secondary data are often extensive and difficult to collect initially, researchers frequently analyze the same dataset multiple times to answer different research questions. researchers are therefore not likely to come to a dataset with completely fresh eyes, and may have insight regarding associations between at least some of the variables in the dataset. such prior knowledge may steer the researchers toward a hypothesis that they already know is in line with the data. this practice is called harking (hypothesizing after results are known; kerr, 1998) and can lead to false positive results (rubin, 2017). if harking goes undisclosed, it is not possible for third parties to evaluate whether the statistical tests for the hypotheses are well founded, as statistical hypothesis tests (e.g., null hypothesis significance tests, nhst) are only valid when the hypotheses are drawn up a priori (wagenmakers, wetzels, borsboom, van der maas, & kievit, 2012; but see devezer et al., 2020). because secondary data analyses are particularly sensitive to data-driven researcher decisions, preregistering them is especially important. other options exist to increase error control and illustrate sensitivity to flexibility in data analysis, however. for example, a multiverse analysis (steegen, tuerlinckx, gelman, & vanpaemel, 2016) or specification-curve analysis (simonsohn, simmons, & nelson, 2015) would be useful if researchers are unsure about which specific analysis is most suitable to test their hypothesis. in these approaches, all plausible analytic specifications are implemented to get an overall picture of the evidence without the need to choose a specific (and potentially biased) statistical analysis. this makes it impossible for researchers to cherry-pick variables or analyses based on their prior knowledge. however, it would still be possible to cherry-pick the range of analyses, and it is difficult to weight the results from the different analyses in an unbiased manner. it would thus be appropriate to complement these methods with a preregistration, especially when the aim is to limit the potential for p-hacking and harking, for both primary and secondary data analysis. to facilitate the preregistration of secondary data analyses, a session was organized at the society for the improvement of psychological science (sips, see https://improvingpsych.org) conference in 2018 with the aim of creating an expert-generated preregistration template specifically tailored to secondary data analysis. providing guidance on how to preregister is vital as preregistration is hard and requires practice and effort to be effective (nosek et al., 2019). participants in the session were experts on or had experience with secondary data analysis, preregistration, or both, thereby providing a good mix of expertise for the task at hand. the session began with analyzing the standard osf preregistration template (bowman et al., 2016) and through successive rounds of discussion and testing, participants decided whether items could be edited, omitted, or added to make the template suitable for secondary data analysis. the resulting first draft of the template was further improved in the months following the conference through a digital back and forth involving the preregistration of an actual secondary data analysis. these efforts---the generation of the template and the preregistration of an example analysis---culminated in the preregistration template presented here. specific templates like this can greatly facilitate preregistration as it gives the author guidance about what to include in the preregistration so that 3 preregistration of secondary data analysis: a template and tutorial all researcher degrees of freedom are covered (veldkamp et al., 2018). as such, the template would also be well-suited as a framework for a registered report submission that focuses on secondary data. some of the questions in the preregistration template for secondary data analysis are similar to the questions in more ‘traditional’ templates; others aim to solve the challenges unique to the preregistration of secondary data analysis, such as the increased need for transparency about the process leading up to the preregistration. the template presented here is not the only preregistration template for secondary data analysis. mertens and krypotos (2019) simultaneously developed a template consisting of 10 questions based on the aspredicted template (see https://aspredicted.org). our template differs from that template in two ways. first, it involves 25 questions and therefore captures a wider array of researcher degrees of freedom. for example, our template includes specific questions about defining and handling outliers, and the specification of robustness checks, both of which give leeway for data-driven decisions in secondary data analyses (weston et al., 2019). moreover, a more comprehensive template gives researchers the option to use as many or as few of the questions as they want, in order to tailor their preregistration to specific study needs. second, our template comes with elaborate comments and a worked example that we hope makes the preregistration of secondary data analysis more concrete. we think both these contributions are helpful to researchers looking to preregister their secondary data analysis. using the template to preregister a secondary data analysis: template questions and example answers, with guiding comments in italics part 1: study information question 1: provide the working title of your study. do religious people follow the golden rule? assessing the link between religiosity and prosocial behavior using data from the wisconsin longitudinal study. we specifically mention the data set we are using so that readers know we are preregistering a secondary data analysis. clarifying this from the outset is helpful because readers may look at such preregistrations differently than they look at preregistrations of primary data analyses. question 2: name the authors of this preregistration. josiah carberry (jc) – orcid id: https://orcid.org/0000-0002-1825-0097 pomona sprout (ps) – personal webpage: https://en.wikipedia.org/wiki/hogwarts_staff#pomona_sprout when listing the authors, add an orcid id or a link to a personal webpage so that you and your co-authors can be easily identified. this is particularly important when preregistering secondary data analyses because you may have prior knowledge about the data that may influence the contents of the preregistration. if a reader has access to a personal profile that lists prior research, they can judge whether any prior knowledge of the data is plausible and whether it potentially biased the data analysis. that is, whether it introduced systematic error in the testing because researchers selected or encouraged one outcome or answer over others (merriamwebster, n.d.). question 3: list each research question included in this study. rq1 = are more religious people more prosocial than less religious people? rq2 = does the relationship between religiosity and prosociality differ for people with different religious affiliations? research questions are often used as a steppingstone for the development of specific and testable hypotheses and can therefore be phrased on a more conceptual level than hypotheses. note that it is perfectly fine to skip the research questions and only preregister your hypotheses. question 4: please provide the hypotheses of your secondary data analysis. make sure they are specific and testable, and make it clear what your statistical framework is (e.g., bayesian inference, nhst). in case your hypothesis is directional, do not forget to state the direction. please also provide a rationale for each hypothesis. “do to others as you would have them do to you” (luke 6:31). this golden rule is taught by all ma4 van den akker et al. jor religions, in one way or another, to promote prosociality (parliament of the world’s religions, 1993). religious prosociality is the idea that religions facilitate behavior that is beneficial for others at a personal cost (norenzayan & shariff, 2008). the encouragement of prosocial behavior by religious teachings appears to be fruitful: a considerable amount of research shows that religion is positively related to prosocial behavior (e.g., friedrichs, 1960; koenig, mcgue, krueger, & bouchard, 2007; morgan, 1983). for instance, religious people have been found to give more money to, and volunteer more frequently for, charitable causes than their non-religious counterparts (e.g., grønbjerg & never, 2004; lazerwitz, 1962; pharoah & tanner, 1997). also, the more important people viewed their religion, the more likely they were to do volunteer work (youniss, mclellan, & yates, 1999). based on the above we expect that religiosity is associated with prosocial behavior in our sample as well. to assess this prediction, we will test the following hypotheses using a null hypothesis significance testing framework: h0(1) = in men and women who graduated from wisconsin high schools in 1957, there is no association between religiosity and prosociality. h1(1) = in men and women who graduated from wisconsin high schools in 1957, there is a positive association between religiosity and prosociality. just like in primary data analysis, a good hypothesis is specific (i.e., it includes a specific population), quantifiable, and testable. a one-sided hypothesis is suitable if theory, previous literature, or (scientific) reasoning indicates that your effect of interest is likely to be in a certain direction (e.g., a < b). note that we provided detailed information about the theory and previous literature in our answer. this is crucial for secondary data analysis because it allows the reader to assess the thought process behind the hypotheses. readers can then judge for themselves whether they think the hypotheses logically follow from the theory and previous literature or that they may have been tainted by the authors’ prior knowledge of the data. ideally, your preregistration already contains the framework for the introduction of the final paper. moreover, writing up the introduction now instead of post hoc forces you to think clearly about the way you arrived at the hypotheses and may uncover flaws in your reasoning that can then be corrected before data collection begins. part 2: data description question 5: name and describe the dataset(s), and if applicable, the subset(s) of the data you plan to use. useful information to include here is the type of data (e.g., cross-sectional or longitudinal), the general content of the questions, and some details about the respondents. in the case of longitudinal data, information about the survey’s waves is useful as well. to answer our research questions we will use a dataset from the wisconsin longitudinal study (wls; herd, carr, & roan, 2014). the wls provides long-term data on a random sample of all the men and women who graduated from wisconsin high schools in 1957. the wls involves twelve waves of data. six waves were collected from the original participants or their parents (1957, 1964, 1975, 1992, 2004, and 2011), four were collected from a selected sibling (1977, 1994, 2005, and 2011), one from the spouse of the original participant (2004), and one from the spouse of the selected sibling (2006). the questions vary across waves and are related to domains as diverse as socio-economic background, physical and mental health, and psychological makeup. we will use the subset consisting of the 1957 graduates who completed the follow-up 20032005 wave of the wls dataset because it includes specific modules on religiosity and volunteering. like the wls data we use in our example, many large-scale datasets are outlined in detail in an accompanying paper. it is important to cite papers like this, but also to mention the most relevant information in the preregistration so that readers do not have to search for the information themselves. sometimes information about the dataset is not readily available. in those cases, be especially candid with the information you have about the dataset because the data you provide may be the only information about the data available to readers of the preregistration. question 6: specify the extent to which the dataset is open or publicly available. make note of any barriers to accessing the data, even if it is publicly available. the dataset we will use is publicly available, but you need to formally agree to acknowledge the funding source for the wisconsin longitudinal study, to cite the data release in any manuscripts, 5 preregistration of secondary data analysis: a template and tutorial working papers, or published articles using these data, and to inform wls about any published papers for use in the wls bibliography and for reporting purposes. to do this you need to submit some information about yourself on the website: (https: //www.ssc.wisc.edu/wlsresearch/data/downloads). you will then receive an email with a download link. it is important to check whether the data is open or publicly available also to other researchers. for example, it could be that you have access via the organization providing the data (explain this in your answer to q7), but that does not necessarily mean that it is publicly available to others. an example of publicly available data that is difficult to access would be data for which you need to register a profile on a website, or for which the owners of the data need to accept your request before you can have access. question 7: how can the data be accessed? provide a persistent identifier or link if the data are available online or give a description of how you obtained the dataset. the data can be accessed by going to the following link and searching for the variables that are specified in q12 of this preregistration: https: //www.ssc.wisc.edu/wlsresearch/documentation/browse/?label=&variable=&wave_108=on& searchbutton=search when available, report the dataset’s persistent identifier (e.g., a doi) so that the data can always be retrieved from the internet. in our example, we could only provide a link, but we added instructions for the reader to retrieve the data. in general, try to bring the reader as close to the relevant data as possible, so instead of giving the link to the overarching website, give the link to the part of the website where the data can easily be located. question 8: specify the date of download and/or access for each author. ps: downloaded 12 february 2019; accessed 12 february 2019. jc: downloaded 3 january 2019 (estimated); accessed 12 february 2019. we will use the data accessed by jc on 12 february 2019 for our statistical analyses. state here for each author when the dataset was initially downloaded (e.g., for previous analyses or merely to obtain the data) and when either metadata or the actual data (specify which) was first accessed (e.g., to identify variables of interest or to help fill out this form). also, specify the author whose downloaded data you will use for the statistical analyses. this information is crucial in light of the reproducibility of your study because it is possible that the data has been edited since you last downloaded or accessed it. if you cannot retrieve when you downloaded or accessed the data, estimate those dates. in case you collected the data yourself to answer another research question, please state the date you first looked at the data. finally, because not everybody will use the same date format it is important to state the date you downloaded or accessed the data unambiguously. for example, avoid dates like 12/02/2019 and instead use 12 february 2019 or december 2nd, 2019. question 9: if the data collection procedure is well documented, provide a link to that information. if the data collection procedure is not well documented, describe, to the best of your ability, how data were collected. the wls data was and is being collected by the university of wisconsin survey center for use by the research community. the origins of the wls can be traced back to a state-sponsored questionnaire administered during the spring of 1957 at all wisconsin high school to students in their final year. therefore, the dataset constitutes a specific sample not necessarily representative of the united states as a whole. most panel members were born in 1939, and the sample is broadly representative of white, non-hispanic american men and women who completed at least a high school education. a flowchart for the data collection can be found here: https:// www.ssc.wisc.edu/wlsresearch/about/flowchart/ cor459d7.pdf while describing the data collection procedure, pay specific attention to the representativeness of the sample, and possible biases stemming from the data collection. for example, describe the population that was sampled from, whether the aim was to acquire a representative / regional / convenience sample, whether the data collectors were aware of this aim, the data collectors’ recruitment efforts, the procedure for running participants, 6 van den akker et al. whether randomization was used, and whether participants were compensated for their time. all of this information can be used to judge whether the sample is representative of a wider population or whether the data is biased in some way, which crucially determines the conclusions that can be drawn from the results. in addition, thinking about the representativeness of a dataset is a crucial part of the planning stage of the research. for example, you might come to the conclusion that the dataset at hand is not suitable after all and opt for a different dataset, thereby preventing research waste. finally, it is good practice to describe what entity originally collected the data (e.g., your own lab, another lab, a multi-lab collaboration, a (national) survey collection organization, a private organization) because different data sources may have different purposes for collecting the data, which may also result in biased data. question 10: some studies offer codebooks to describe their data. if such a codebook is publicly available, link to it here or upload the document. if not, provide other available documentation. also provide guidance on what parts of the codebook or other documentation are most relevant. the codebook for the dataset we use can be found here: https://www.ssc.wisc.edu/wlsresearc h/documentation/waves/?wave=grad2k. we will mainly use questions from the mail survey about religion and spirituality, and the phone survey on volunteering, but will also use some questions from other modules (see the answer to q12). any documentation is welcome here, as readers will use this documentation to make sense of the dataset. if applicable, provide the codebook for the entire dataset but guide the reader to the relevant parts of the codebook so they do not have to search for the relevant parts extensively. alternatively, you can create your own data dictionaries/codebooks (arslan, 2019; buchanan et al., 2019). if, for some reason codebook information cannot be shared publicly, provide an explanation. part 3: variables question 11: if you are going to use any manipulated variables, identify them here. describe the variables and the levels or treatment arms of each variable (note that this is not applicable for observational studies and meta-analyses). if you are collapsing groups across variables this should be explicitly stated, including the relevant formula. if your further analysis is contingent on a manipulation check, describe your decisions rules here. not applicable. manipulated variables in secondary datasets usually originate from another study investigating another research question. you may, therefore, need to adapt the manipulated variable to answer your own research question. for example, it may be necessary to relabel or even omit one of the treatment arms. please provide a careful log of all these adaptations so that readers will have a clear grasp of the variable you will be using and how it differs from the variable in the original dataset. any resources mentioned in the answer to q10 may be useful here as well. question 12: if you are going to use measured variables, identify them here. describe both outcome measures as well as predictors and covariates and label them accordingly. if you are using a scale or an index, state the construct the scale/index represents, which items the scale/index will consist of, how these items will be aggregated, and whether this aggregation is based on a recommendation from the study codebook or validation research. when the aggregation of the items is based on exploratory factor analysis (efa) or confirmatory factor analysis (cfa), also specify the relevant details (efa: rotation, how the number of factors will be determined, how best fit will be selected, cfa: how loadings will be specified, how fit will be assessed, which residuals variance terms will be correlated). if you are using any categorical variables, state how you will code them in the statistical analyses. religiosity (iv): religiosity is measured using a newly created scale with a subset of items from the religion and spirituality module of the 2004 mail survey (described here: https://www.ssc.wisc.edu /wlsresearch/documentation/waves/?wave=grad 2k&module=gmail_religion). the scale includes general questions about how religious/spiritual the individual is and how important religion/spirituality is to them. importantly, the questions are not specific to a particular denomination and are on the 7 preregistration of secondary data analysis: a template and tutorial same response scale. the specific variables are as follows: 1. il001rer: how religious are you? 2. il002rer: how spiritual are you? 3. il003rer: how important is religion in your life? 4. il004rer: how important is spirituality in your life? 5. il005rer: how important was it, or would it have been if you had children, to send your children for religious or spiritual instruction? 6. il006rer: how closely do you identify with being a member of a religious group? 7. il007rer: how important is it for you to be with other people who are the same religion as you? 8. il008rer: how important do you think it is for people of your religion to marry other people who are the same religion? 9. il009rer: how strongly do you believe that one should stick to a particular faith? 10. il010rer: how important was religion in your home when you were growing up? 11. il011rer: when you have important decisions to make in your life, how much do you rely on your religious or spiritual beliefs? 12. il012rer: how much would your spiritual or religious beliefs influence your medical decisions if you were to become gravely ill? the levels of all of these variables are indicated by a likert scale with the following options: (1) not at all; (2) not very; (3) somewhat; (4) very; (5) extremely, as well as ‘system missing’ (the participant did not provide an answer) and ‘refused’ (the participant refused to answer the question). variables il006rer, il008rer, and il012rer additionally include the option ‘don’t know’ (the participant stated that they did not know how to answer the question). we will use the average score (after omitting nonnumeric and ‘don’t know’ responses) on the twelve variables as a measure of religiosity. this average score is constructed by ourselves and was not already part of the dataset. prosociality (dv): in line with previous research (konrath, fuhrel-forbis, lou, & brown, 2012), we will use three measures of prosociality that measure three aspects of engagement in other-oriented activities (see brookfield, parry, & bolton, 2018 for the link between prosociality and volunteering). the prosociality variables come from the volunteering module of the 2004 phone survey. the codebook of that module can be found here: https://www.ssc. wisc.edu/wlsresearch/documentation/waves /?wave=grad2k&module=gvol. the three measures of prosociality we will use are: 1. gv103re: did the graduate do volunteer work in the last 12 months? this dichotomous variable assesses whether or not the participant has engaged in any volunteering activities in the last 12 months. the levels of this variable are yes/no. yes will be coded as ‘1’, no will be coded as ‘0’. 2. gv109re: number of graduate’s other volunteer activities in the past 12 months. this variable is a summary index providing a quantitative measure of the participant’s volunteering activities. scores on this variable range from 1 to 5 and reflect the number of the previous five questions to which the participant answered yes. the previous five questions assess whether or not the participant volunteered at any of the following organization types: (1) religious organizations; (2) school or educational organization; (3) political group or labor union; (4) senior citizen group or related organization; (5) other national or local organizations. for each of these questions the answer ‘yes’ is coded as 1 and the answer ‘no’ is coded as 0. 3. gv111re: how many hours did the graduate volunteer during a typical month in the last 12 months? this is a numerical variable that provides information on how many hours per month, on average, the participant volunteered. the three variables will be treated as separate measures in the dataset and do not require manual aggregation. number of siblings (covariate): we will include the participant’s number of siblings as a control variable because many religious families are large (pew research center, 2015) and it can be argued that cooperation and trust arise more naturally in larger families because of the larger number of social interactions in those families. to measure participants’ number of siblings we used the variable gk067ss: the total number of siblings ever born from the 2004 phone survey siblings module (see:https: //www.ssc.wisc.edu/wlsresearch/documentation/waves/?wave=grad2k&module=gsib). this is a numerical variable with the possibility for the participant to state “i don’t know”. at the interview participants were instructed to include "siblings born alive but no longer living, as well as those alive now 8 van den akker et al. and to include step-brothers and step-sisters and children adopted by their parents.” agreeableness (covariate): we will include the summary score for agreeableness (ih009rec, see https://www.ssc.wisc.edu/wlsresearch/documentation/waves/?wave=grad2k&module=gmail_ values) in the analysis as a control variable because a previous study (on the same dataset, see the answer to q18) we were involved in showed a positive association between agreeableness and prosociality. because previous research also indicates a positive association between agreeableness and religiosity (saroglou, 2002) we need to include agreeableness as a control variable to disentangle the influence of religiosity on prosociality and the influence of agreeableness on prosociality. the variable ih009rec is a sum score of the variables ih003rerih008rer (to what extent do you agree that you see yourself as someone who is talkative / is reserved [reverse coded] / is full of energy / tends to be quiet [reverse coded] / who is sometimes shy or inhibited [reverse coded] / who generates a lot of enthusiasm). all of these were scored from 1 to 6 (1 = “agree strongly”, 2 = “agree moderately”, 3 = “agree slightly”, 4 = “disagree slightly”, 5 = “disagree moderately”, 6 = “disagree strongly”), while participants could also refuse to answer the question. if a participant refused to answer one of the questions, that participant’s score was not included in the sum score variable ih009rec. if you are using measured variables, describe them in such a way that readers know exactly what variables will be used in the statistical analyses. because secondary datasets often involve many measured variables, there is ample room to select variables after doing an analysis. it is therefore essential to be exhaustive here. variables you do not mention here should not pop up in your analysis later unless you have a good reason for it. as you can see, we clearly label the function of each variable, the specific items related to that variable, and the item’s response options. it could be that you choose to combine items into an index or scale that have not been combined like that in previous studies. carefully detail this process and indicate that you constructed the index or scale yourself to avoid confusion. finally, note that we include covariates to be able to make statements about the causal effect of religion on prosociality. this is common practice in the social sciences, but causal inference is complex and there may be better solutions in other situations, and even in this situation. please see rohrer (2018) for more information about causation in observational data. question 13: which units of analysis (respondents, cases, etc.) will be included or excluded in your study? taking these inclusion/exclusion criteria into account, indicate the (expected) sample size of the data you’ll be using for your statistical analyses to the best of your knowledge. in the next few questions, you will be asked to refine this sample size estimation based on your judgments about missing data and outliers. initially, the wls consisted of 10,317 participants. as we are not interested in a specific group of wisconsin people, we will not exclude any participants from our analyses. however, only 7,265 participants filled out the questions on prosociality and the number of siblings in the phone survey and only 6,845 filled out the religiosity items in the mail survey (herd et al., 2014). this corresponds to a response rate of 73% and 69% respectively. because we do not know whether the participants that the mail survey also did the phone survey, our minimum expected sample size is 10,317 * 0.73 * 0.69 = 5,297. provide information on the total sample size of the dataset, the sample size(s) of the wave(s) you are going to use (if applicable), as well as the number of participants that provided data on each of the questions and/or scales to be used in the data analyses. in our sample we do not exclude any participants, but if you have a research question about a certain group you may need to exclude participants based on one or more characteristics. be very specific when describing these characteristics so that readers with no knowledge of the data are able to redo your moves easily. for our wls dataset, it is impossible to know the exact sample size without inspecting the data. if that is the case, provide an estimate of the sample size. if you provide an estimate, try to be conservative and pick the lowest sample size of the possible options. if it is impossible to provide an estimate, it is also possible to mask the data. for example, it is possible to add random noise to all values of the dependent variable. in that case, it is impossible to pick up any real effects and you are essentially blind to the data. similarly, it is possible to blind yourself to real effects in the data by having someone relabel the treatment levels so you cannot link them to the treatment levels anymore. these 9 preregistration of secondary data analysis: a template and tutorial and other methods of data blinding are clearly described by dutilh, sarafoglou, and wagenmakers (2019). question 14: what do you know about missing data in the dataset (i.e., overall missingness rate, information about differential dropout)? how will you deal with incomplete or missing data? based on this information, provide a new expected sample size. the wls provides a documented set of missing codes. in table 1 below you can find missingness information for every variable we will include in the statistical analyses. ‘system missing’ refers to the number of participants that did not or could not complete the questionnaire. ‘partial interview’ refers to the number of participants that did not get that particular question because they were only partially interviewed. the rest of the codes are self-explanatory. importantly, some respondents refused to answer the religiosity questions. these respondents apparently felt strongly about these questions, which could indicate that they are either very religious or very anti-religious. if that is the case, the respondent’s propensity to respond is directly associated with their level of religiosity and that the data is missing not at random (mnar). because it is not possible to test the stringent assumptions of the modern techniques for handling mnar data we will resort to simple listwise deletion. it must be noted that this may bias our data as we may lose respondents who are very religious or anti-religious. however, we believe this bias to be relatively harmless given that our sample still includes many respondents that provided extreme responses to the items about the importance of the different facets of religion (see: https://www.ssc.wisc.edu/wlsresearch /documentation/waves/?wave=grad2k&module= gmail_religion). moreover, because our initial sample size is very large, statistical power is not substantially compromised by omitting these respondents. that being said, we will extensively discuss any potential biases resulting from missing data in the limitations section of our paper. employing listwise deletion leads to an expected minimum number of 10,317 * 0.30 * 0.70 * 0.64 = 1,387 participants for the binary logistic regression, and an expected minimum number of 10,317 * 0.24 * 0.70 * 0.64 = 1,109 (gv109re) and 10,317 * 0.23 * 0.70 * 0.64 = 1,063 (gv111re) for the linear regressions. provide descriptive information, if available, on the amount of missing data for each variable you will use in the statistical analyses and discuss potential issues with the pattern of missing data for your planned analyses. also provide a plan for how the analyses will take into account the presence of missing data. where appropriate, provide specific details how this plan will be implemented. this can be done by specifying a step-by-step protocol for how you will impute any missing data. you could first explain how you will assess whether the data are missing at random (mar) missing completely at random (mcar) or missing not at random (mnar), and then state that you will use technique x in case of mar data, technique y in case of mcar data, and technique z in case of mnar data. for an overview of the types of missing data, and the different techniques to handle missing data, see lang & little (2018). note that the missing data technique we used in our example, listwise deletion, is usually not the best way to handle missing data. we decided to use it in this example because it gave us the opportunity to illustrate how researchers can describe potential biases arising from their analysis methods in a preregistration. if you cannot specify the exact number of missing data because the dataset does not provide that information, provide an estimate. if you provide an estimate, try to be conservative and pick the lowest sample size of the possible options. if it is impossible to provide an estimate, you could also mask the data (see dutilh, sarafoglou, & wagenmakers, 2019). it is good practice to state all missingness information with relation to the total sample size of the dataset. 10 van den akker et al. table 1 an overview of the missing values for all variables we will use in our analyses. variable system missing don’t know inappropriate refused not ascertained partial interview could not code remaining remaining (%) il001rer 3,471 0 0 190 0 0 0 6,656 64 il002rer 3,471 0 0 212 0 0 0 6,634 64 il003rer 3,471 0 0 191 0 0 0 6,655 65 il004rer 3,471 0 0 241 0 0 0 6,605 64 il005rer 3,471 0 0 201 0 0 0 6,645 64 il006rer 3,471 1 0 201 0 0 0 6,644 64 il007rer 3,471 0 0 192 0 0 0 6,654 65 il008rer 3,471 1 0 199 0 0 0 6,646 64 il009rer 3,471 0 0 219 0 0 0 6,627 64 il010rer 3,471 0 0 190 0 0 0 6,656 65 il011rer 3,471 0 0 190 0 0 0 6,656 65 il012rer 3,471 1 0 198 0 0 0 6,647 64 gv103re 3,052 0 3,955 1 0 182 0 3,127 30 gv109re 3,052 0 4,590 0 0 182 0 2,493 24 gv111re 3,052 50 4,716 0 0 182 0 2,317 23 gk067ss 3,052 21 0 0 0 0 0 7,244 70 11 preregistration of secondary data analysis: a template and tutorial question 15: if you plan to remove outliers, how will you define what a statistical outlier is in your data? please also provide a new expected sample size. note that this will be the definitive expected sample size for your study and you will use this number to do any power analyses. the dataset probably does not involve any invalid data since the dataset has been previously ‘cleaned’ by the wls data controllers and any clearly unreasonably low or high values have been removed from the dataset. however, to be sure we will create a box and whisker plot for all continuous variables (the dependent variables gv109re and gv111re, the covariate gk067ss, and the scale for religiosity) and remove any data point that appears to be more than 1.5 times the iqr away from the 25th and 75th percentile. based on normally distributed data, we expect that 2.1% of the data points will be removed this way, leaving 1,358 out of 1,387 participants for the binary regression with gv103re as the outcome variable and 1,086 out of 1,109 participants, and 1,041 out of 1,063 participants for the linear regressions with gv109re and gv111re as the outcome variables, respectively. estimate the number of outliers you expect for each variable and calculate the expected sample size of your analysis. the expected sample size is required to do a power analysis for the planned statistical tests (q21) but also prevents you from discarding a significant portion of the data during or after the statistical analysis. if it is impossible to provide such an estimate, you can mask the data and make a more informed estimation based on these masked data (see dutilh, sarafoglou, & wagenmakers, 2019). if you expect to remove many outliers or if you are unsure about your outlier handling strategy, it is good practice to preregister analyses including and excluding outliers. to see how decisions about outliers can influence the results of a study, see bakker and wicherts (2014) and lonsdorf et al. (2019). for more information about outliers in the context of preregistration, see leys, delacre, mora, lakens, and ley (2019). question 16: are there sampling weights available with this dataset? if so, are you using them or are you using your own sampling weights? the wls dataset does not include sampling weights and we will not use our own sampling weights as we do not seek to make any claims that are generalizable to the national population. because secondary data samples may not be entirely representative of the population you are interested in, it can be useful to incorporate sampling weights into your analysis. you should state here whether (and why) you will use sampling weights, and provide specifics on exactly how you will use them. to implement sampling weights into your analyses, we recommend using the “survey” package in r (lumley, 2004). part 4: knowledge of data question 17: list the publications, working papers (in preparation, unpublished, preprints), and conference presentations (talks, posters) you have worked on that are based on the dataset you will use. for each work, list the variables you analyzed, but limit yourself to variables that are relevant to the proposed analysis. if the dataset is longitudinal, also state which wave of the dataset you analyzed. importantly, some of your team members may have used this dataset, and others may not have. it is therefore important to specify the previous works for every co-author separately. also mention relevant work on this dataset by researchers you are affiliated with as their knowledge of the data may have been spilled over to you. when the provider of the data also has an overview of all the work that has been done using the dataset, link to that overview. both authors (ps and jc) have previously used the graduates 2003-2005 wave to assess the link between big five personality traits and prosociality. the variables we used to measure the big five personality traits were ih001rei (extraversion), ih009rei (agreeableness), ih017rei (conscientiousness), ih025rei (neuroticism), and ih032rei (openness). the variables we used to measure prosociality were ih013rer (“to what extent do you agree that you see yourself as someone who is generally trusting?”), ih015rer (“to what extent do you agree that you see yourself as someone who is considerate to almost everyone?”), and ih016rer (“to what extent do you agree that you see yourself as someone who likes to cooperate with others?). we presented the results at the arp conference in st. louis in 2013 and we are currently finalizing a manuscript based on these results. 12 van den akker et al. additionally, a senior graduate student in jc’s lab used the graduates 2011 wave for exploratory analyses on depression. she linked depression to alcohol use and general health indicators. she did not look at variables related to religiosity or prosociality. her results have not yet been submitted anywhere. an overview of all publications based on the wls data can be found here: https://www.ssc.wisc. edu/wlsresearch/publications/pubs.php?topic= all. it is important to specify the different ways you have previously used the data because this information helps you to establish any knowledge of the data you may already have. this prior knowledge will need to be provided in q18. if available, include persistent identifiers (e.g., a doi) to any relevant papers and presentations. understandably, there is a subjectivity involved in determining what constitutes “relevant” work or “relevant” variables for the proposed analysis. we advise researchers to use their professional judgment and when in doubt always mention the work or variable so readers can assess their relevance themselves. in the worked example, the exploratory analysis by the student in jc’s lab is probably not relevant, but because of the close affiliation of the student to jc, it is good to include it anyway. listing previous works based on the data also helps to prevent a common practice identified by the american psychological association (2019) as unethical: the so-called “least publishable unit” practice (also known as “salami-slicing”), in which researchers publish multiple papers on closely related variables from the same dataset. given that secondary datasets often involve many closely related variables, this is a particularly pernicious issue here. question 18: what prior knowledge do you have about the dataset that may be relevant for the proposed analysis? your prior knowledge could stem from working with the data first-hand, from reading previously published research, or from codebooks. also provide any relevant knowledge of subsets of the data you will not be using. provide prior knowledge for every author separately. in a previous study (mentioned in q17) we used three prosociality variables (ih013rer, ih015rer, and ih016rer) that may be related to the prosociality variables we use in this study. we found that ih013rer, ih015rer, and ih016rer are positively associated with agreeableness (ih009rec). because previous research (on other datasets) shows a positive association between agreeableness and religiosity (saroglou, 2002) agreeableness may act as a confounding variable. to account for this we will include agreeableness in our analysis as a control variable. we did not find any associations between prosociality and the other big five variables. it is important to denote your prior knowledge diligently because it provides information about possible biases in your statistical analysis decisions. for example, you may have learned at an academic conference or in a footnote of another paper that the correlation between two variables is high in this dataset. if you do a test of this hypothesis, you already know the test result, making the interpretation of the test invalid (wagenmakers, et al., 2012). in cases like this, where you have direct knowledge about a hypothesized association, you should disregard doing a confirmatory analysis altogether or do one based on a different dataset. any indirect knowledge about the hypothesized association does not preclude a confirmatory analysis but should be transparently reported in this section. in our example, we mentioned that we know about the positive association between agreeableness and prosociality, which may say something about the direction of our hypothesized association given the association between agreeableness and religiosity. moreover, this prior knowledge urged us to add agreeableness as a control variable. thus, aside from improving your preregistration, evaluating your prior knowledge of the data can also improve the analyses themselves. all information like this that may influence the hypothesized association is relevant here. for example, restriction of range (meade, 2010), measurement reliability (silver, 2008), and the number of response options (gradstein, 1986) have been shown to influence the association between two variables. you may have provided univariate information regarding these aspects in previous questions. in this section, you can write about how they may affect your hypothesized association. do note that it is unlikely that you are able to account for all the effects of prior knowledge on your analytical decisions. for example, you may have prior knowledge that you are not consciously aware of. the best way to capture this unconscious prior knowledge is to revisit previous work, think deeply about any information that might be relevant for the 13 preregistration of secondary data analysis: a template and tutorial current project, and present it here to the best of your ability. this exercise helps you reflect on potential biases you may have and makes it possible for readers of the preregistration to assess whether the prior knowledge you mentioned is plausible given the list of prior work you provided in q17. of course, it is still possible that researchers purposefully neglect to mention prior knowledge or provide false information in a preregistration. even though we believe that deliberate deceit like this is rare, at the end of our template we require researchers to formally “promise” to have truthfully filled out the template and that no other preregistration exists on the same hypotheses and data. a violation of this formal statement can be seen as misconduct, and we believe researchers are unlikely to cross that line. part 5: analyses question 19: for each hypothesis, describe the statistical model you will use to test the hypothesis. include the type of model (e.g., anova, multiple regression, sem) and the specification of the model. specify any interactions and post-hoc analyses and remember that any test not included here must be labeled as an exploratory test in the final paper. our first hypothesis will be tested using three analyses since we use three variables to measure prosociality. for each, we will run a directional null hypothesis significance test to see whether a positive effect exists of religiosity on prosociality. for the first outcome (gv103re: did the graduate do volunteer work in the last 12 months?) we will run a logistic regression with religiosity, the number of siblings, and agreeableness as predictors. for the second and third outcomes (gv109re: number of graduate’s other volunteer activities in the past 12 months; gv111re: how many hours did the graduate volunteer during a typical month in the last 12 months?) we will run two separate linear regressions with religiosity, the number of siblings, and agreeableness as predictors. the code we will use for all these analyses can be found at https://osf.io/e3htr. think carefully about the variety of statistical methods that are available for testing each of your hypotheses. one of the classic “questionable research practices” is trying multiple methods and only publishing the ones that “work” (i.e., that support your hypothesis). almost every method has several options that may be more or less suited to the question you are asking. therefore, it is crucial to specify a priori which one you are going to use and how. if you can, include the code you will use to run your statistical analyses, as this forces you to think about your analyses in detail and makes it easy for readers to see exactly what you plan to do. ideally, when you have loaded the data in a software program you only have to press one button to run your analyses. if including the code is impossible, describe the analyses such that you could give a positive answer to the question: “would a colleague who is not involved in this project be able to recreate this statistical analysis?” question 20: if applicable, specify a predicted effect size or a minimum effect size of interest for all the effects tested in your statistical analyses. for the logistic regression with ‘did the graduate do volunteer work in the last 12 months?’ as the outcome variable, our minimum effect size of interest is an odds of 1.05. this means that a oneunit increase on the religiosity scale would be associated with a 1.05 factor change in odds of having done volunteering work in the last 12 months versus not having done so. for the linear regressions with ‘the number of graduate’s volunteer activities in the last 12 months”, and “how many hours did the graduate volunteer during a typical month in the last 12 months?’ as the outcome variables, the minimum regression coefficients of interest of the religiosity variables are 0.05 and 0.5, respectively. this means that a one-unit increase in the religiosity scale would be associated with 0.05 extra volunteering activities in the last 12 months and with 0.5 more hours of volunteering work in the last 12 months. all of these smallest effect sizes of interest are based on our own intuition. to make comparisons possible between the effects in our study and similar effects in other studies the unstandardized linear regression coefficients will be transformed into standardized regression coefficients using the following formula: !! = #!(%!/ %"), where #! the unstandardized regression coefficient of independent variable i, and %! and %" are the 14 van den akker et al. standard deviations of the independent and dependent variable respectively. comment(s): a predicted effect size is ideally based on a representative preliminary study or meta-analytical result. if those are not available, it is also possible to use your own intuition. for advice on setting a minimum effect size of interest, see lakens, scheel, & isager (2018) and funder and ozer (2019). question 21: present the statistical power available to detect the predicted effect size(s) or the smallest effect size(s) of interest or present the accuracy that will be obtained for estimation. use the sample size after updating for missing data and outliers, and justify the assumptions and parameters used (e.g., give an explanation of why anything smaller than the smallest effect size of interest would be theoretically or practically unimportant). the sample size after updating for missing data and outliers is 1,358 for the logistic regression with gv103re as the outcome variable, and 1,086 and 1,041 for the linear regressions with gv109re and gv111re as the outcome variables, respectively. for all three analyses this corresponds to a statistical power of approximately 1.00 when assuming our minimum effect sizes of interest. for the linear regressions we additionally assumed the variance explained by the predictor to be 0.2 and the residual variance to be 1.0 (see figure below for the full power analysis of the regression with the lowest sample size). for the logistic regression we assumed an intercept of -1.56 corresponding to a situation where half of the participants have done volunteer work in the last year (see the r-code for the full power analysis at https://osf.io/f96rn). advice on conducting a power analysis using g*power can be found in faul, erdfelder, buchner, and lang (2009). advice on conducting a power analysis using r can be found here: cran.r-project .org/web/packages/pwr/vignettes/pwr-vignette .html. note that power analyses for secondary data analyses are unlike power analyses for primary data analyses because we already have a good idea about what our sample size is based on our answers to q13, q14, and q15. therefore, we are primarily interested in finding out what effect sizes we are able to find for a given power level or what our power is given our minimum effect size of interest. in our example, we chose the second option. when presenting your power analysis be sure to state the version of g*power, r, or any other tool you calculated power with, including any packages or add-ons, and also report or copy all the input and results of the power analysis. question 22: what criteria will you use to make inferences? describe the information you will use (e.g., specify the p-values, effect sizes, confidence intervals, bayes factors, specific model fit indices), as well as cut-off criteria, where appropriate. will you be using oneor two-tailed tests for each of your analyses? if you are comparing multiple conditions or testing multiple hypotheses, will you account for this, and if so, how? we will make inferences about the association between religiosity and prosociality based on the p-values and the size of the regression coefficients of the religiosity variable in the three main regressions. we will conclude that a regression analysis supports our hypothesis if both the p-value is smaller than .01 and the regression coefficient is larger than our minimum effect size of interest. we chose an alpha of .01 to account for the fact that we do a test for each of the three regressions (0.05/3, rounded down). if the conditions above hold for all three regressions, we will conclude that our hypothesis is fully supported, if they hold for one or two of the regressions, we will conclude that our hypothesis is partially supported, and if they hold for none of the regressions we will conclude that our hypothesis is not supported. it is crucial to specify your inference criteria before running a statistical analysis because researchers have a tendency to move the goalposts when making inferences. for example, almost 40% of p-values between 0.05 and 0.10 are reported as “marginally significant”, even though these values are not significant when compared to the traditional alpha level of 0.05, and the evidential value of these p-values is low (olsson-collentine, van assen, & hartgerink, 2019). similarly, several studies have found that the majority of studies reporting p-values do not use any correction for multiple comparisons (cristea & ioannidis, 2018; wason, stecher, & mander, 2014), perhaps because this lowers the chance of finding a statistically significant result. for an overview of multiple-comparison correction methods relevant to secondary data analysis, see thompson, wright, bissett, and poldrack (2019). 15 preregistration of secondary data analysis: a template and tutorial question 23: what will you do should your data violate assumptions, your model not converge, or some other analytic problem arises? when the distribution of the number of volunteering hours (gv111re) is significantly non-normal according to the kolmogorov-smirnov test (massey, 1951), and/or (b) the linearity assumption is violated (i.e., the points are asymmetrically distributed around the diagonal line when plotting observed versus the predicted values), we will log-transform the variable. it is, of course, impossible to predict every single way that things might go awry during the analysis. one of the variables may have a strange and unexpected distribution, one of the models may not converge because of a quirk of the correlational structure, and you may even encounter error messages that you have never seen before. you can use your prior knowledge of the dataset to set up a decision tree specifying possible problems that might arise and how you will address them in the analyses. thinking through such a decision tree will make you less overwhelmed when something does end up going differently than expected. however, note that decision trees come with their own problems and can quickly become very complex. alternatively, you might choose to select analysis methods that make assumptions that are as conservative as possible; preregister robustness analyses which test the robustness of your findings to analysis strategies that make different assumptions; and/or pre-specify a single primary analysis strategy but note that you will also report an exploratory investigation of the validity of distributional assumptions (williams & albers, 2019). of course, there are pros and cons to all methods of dealing with violations, and you should choose a technique that is most appropriate for your study. question 24: provide a series of decisions about evaluating the strength, reliability, or robustness of your focal hypothesis test. this may include within-study replication attempts, additional covariates, cross-validation efforts (out-of-sample replication, split/hold-out sample), applying weights, selectively applying constraints in an sem context (e.g., comparing model fit statistics), overfitting adjustment techniques used (e.g., regularization approaches such as ridge regression), or some other simulation/sampling/bootstrapping method. to assess the sensitivity of our results to our selection criterion for outliers, we will run an additional analysis without removing any outliers. there are many methods you can use to test the limits of your hypothesis. the options mentioned in the question are not supposed to be exhaustive or prescriptive. we included these examples to encourage researchers to think about these methods, all of which serve the same purpose as preregistration: improving the robustness and replicability of the results. question 25: if you plan to explore your dataset to look for unexpected differences or relationships, describe those tests here, or add them to the final paper under a heading that clearly differentiates this exploratory part of your study from the confirmatory part. as an exploratory analysis, we will test the relationship between scores on the religiosity scale and prosociality after adjusting for a variety of social, educational, and cognitive covariates that are available in the dataset. we have no specific hypotheses about which covariates will attenuate the religiosity-prosociality relation most substantially, but we will use this exploratory analysis to generate hypotheses to test in other, independent datasets. whereas it is not presently the norm to preregister exploratory analyses, it is often good to be clear about which variables will be explored (if any), for example, to differentiate these from the variables for which you have specific predictions or to plan ahead about how to compute these variables. part 6: statement of integrity the authors of this preregistration state that they filled out this preregistration to the best of their knowledge and that no other preregistration exists pertaining to the same hypotheses and dataset. 16 van den akker et al. summary in this tutorial we presented a preregistration template for the analysis of secondary data and have provided guidance for its effective use. we are aware that the number of questions (25) in the template may be overwhelming, but it is important to note that not every question is relevant for every preregistration. our aim was to be inclusive and cover all bases in light of the diversity of secondary data analyses. even though none of the questions are mandatory, we do believe that an elaborate preregistration is preferable over a concise preregistration simply because it restricts more researcher degrees of freedom. we therefore recommend that authors answer as many questions in as much detail as possible. and, if questions are not applicable, it would be good practice to also specify why this is the case so that readers can assess your reasoning. effectively preregistering a study is challenging and can take a lot of time but, like nosek et al. (2019) and many others, we believe it can improve the interpretability, verifiability and rigor of your studies and is therefore more than worth it if you want both yourself and others to have more confidence in your research findings. the current template is merely one building block toward a more effective preregistration infrastructure and, given the ongoing developments in this area, will be a work in progress for the foreseeable future. any feedback is therefore greatly appreciated. please send any feedback to the corresponding author, olmo van den akker (ovdakker@gmail.com). author contact correspondence concerning this article should be addressed to olmo van den akker, e-mail: ovdakker@gmail.com. https://orcid.org/00000002-0712-3746 acknowledgments the authors would like to thank the participants of the society for the improvement of psychological science (sips) conference in 2018 that helped to create the first draft of the preregistration template but were unable to help out with the subsequent extensions (brian brown, oliver clark, charles ebersole, and courtney soderberg). conflict of interest and funding this research is based on data from the wisconsin longitudinal study, funded by the national institute on aging (r01 ag009775; r01 ag033285), and was supported by a consolidator grant (improve) from the european research council (erc; grant no. 726361). author contributions conceptualization: oa, sw, mb writing (original draft): oa, lc, wc, rd, pdk, ah, jk, ek, jo, sr, kdv, av, mb writing (reviewing and editing): oa, sw, lc, wc, rd, pdk, ah, jk, ek, jo, sr, kdv, av, mb project administration: oa open science practices this article is a tutorial that did not include any data or conducted analyses. it was not pre-registered. therefore, it didn’t receive any badges. the entire editorial process, including the open reviews, is published in the online supplement. 17 preregistration of secondary data analysis: a template and tutorial references american psychological association. (2019). publication practices & responsible authorship. retrieved from https://www.apa.org/research/responsible/publication arslan, r. c. (2019). how to automatically document data with the codebook package to facilitate data reuse. advances in methods and practices in psychological science. https://doi.org/10.1177/2515245919838783 bakker, m., & wicherts, j. m. (2014). outlier removal, sum scores, and the inflation of the type i error rate in independent samples t tests: the power of alternatives and recommendations. psychological methods, 19(3), 409. https://doi.org/10.1037/met0000014 bowman, s. d., dehaven, a. c., errington, t. m., hardwicke, t. e., mellor, d. t., nosek, b. a., & soderberg, c. k. (2016, january 1). osf prereg template. https://doi.org/10.31222/osf.io/epgjd brookfield, k., parry, j., & bolton, v. (2018). getting the measure of prosocial behaviors: a comparison of participation and volunteering data in the national child development study and the linked social participation and identity study. nonprofit and voluntary sector quarterly, 47(5), 1081-1101. https://doi.org/10.1177/0899764018786470 buchanan, e. m., crain, s. e., cunningham, a. l., johnson, h. r., stash, h. e., papadatou-pastou, m., … , & aczel, b. (2019, may 20). getting started creating data dictionaries: how to create a shareable dataset. https://doi.org/10.31219/osf.io/vd4y3 cheng, h. g., & phillips, m. r. (2014). secondary analysis of existing data: opportunities and implementation. shanghai archives of psychiatry, 26(6), 371-375. https://doi.org/10.11919/j.issn.10020829.214171 claesen, a., gomes, s. l. b. t., tuerlinckx, f., & vanpaemel, w. (2019, may 9). preregistration: comparing dream to reality. https://doi.org/10.31234/osf.io/d8wex cristea, i. a., & ioannidis, j. p. (2018). p values in display items are ubiquitous and almost invariably significant: a survey of top science journals. plos one, 13(5), e0197440. https://doi.org/10.1371/journal.pone.0197440 devezer, b., navarro, d. j., vandekerckhove, j., & buzbas, e. o. (2020). the case for formal methodology in scientific reform. biorxiv, 2020.04.26.048306. https://doi.org/10.1101/2020.04.26.048306 dutilh, g., sarafoglou, a., & wagenmakers, e. j. (2019). flexible yet fair: blinding analyses in experimental psychology. synthese, 1-28. https://doi.org/10.1007/s11229-019-02456-7 faul, f., erdfelder, e., buchner, a., & lang, a. g. (2009). statistical power analyses using g* power 3.1: tests for correlation and regression analyses. behavior research methods, 41(4), 1149-1160. https://doi.org/10.3758/brm.41.4.1149 friedrichs, r. w. (1960). alter versus ego: an exploratory assessment of altruism. american sociological review, 496-508. http://doi.org/10.2307/2092934 funder, d. c., & ozer, d. j. (2019). evaluating effect size in psychological research: sense and nonsense. advances in methods and practices in psychological science, 2(2), 156-168. https://doi.org/10.1177/2515245919847202 gradstein, m. (1986). maximal correlation between normal and dichotomous variables. journal of educational statistics, 11(4), 259-261. grady, d. g., cummings, s. r., & hulley, s. b. (2013). research using existing data. designing clinical research, 192-204. retrieved from https://pdfs.semanticscholar.org/343e/04f26f768c9530f58e1847a ff6a4e072d0be.pdf grønbjerg, k. a., & never, b. (2004). the role of religious networks and other factors in types of volunteer work. nonprofit management and leadership, 14(3), 263-289. https://doi.org/10.1002/nml.34 herd, p., carr, d., & roan, c. (2014). cohort profile: wisconsin longitudinal study (wls). international journal of epidemiology, 43(1), 34-41. https://doi.org/10.1093/ije/dys194 inglehart, r., c. haerpfer, a. moreno, c. welzel, k. kizilova, j. diez-medrano, m. lagos, p. norris, e. ponarin, & b. puranen et al. (eds.). (2014). world values survey: round six countrypooled datafile version: www.worldvaluessurvey.org/wvsdocumentationwv6.jsp. jd systems institute. kerr, n. l. (1998). harking: hypothesizing after the results are known. personality and social psychology review, 2(3), 196-217. 18 van den akker et al. https://doi.org/10.1207/s15327957pspr0203_ 4 koenig, l. b., mcgue, m., krueger, r. f., & bouchard, t. j. jr. (2007). religiousness, antisocial behavior, and altruism: genetic and environmental mediation. journal of personality, 75(2), 265-290. https://doi.org/10.1111/j.14676494.2007.00439.x konrath, s., fuhrel-forbis, a., lou, a., & brown, s. (2012). motives for volunteering are associated with mortality risk in older adults. health psychology, 31(1), 87-96. http://doi.org/10.1037/a0025226 lang k. m. & little t. d. (2018). principled missing data treatments. prevention science, 19(3), 284-294. https://doi.org/10.1007/s11121-0160644-5 lakens, d., scheel, a. m., & isager, p. m. (2018). equivalence testing for psychological research: a tutorial. advances in methods and practices in psychological science, 1(2), 259-269. https://doi.org/10.1177/2515245918770963 lazerwitz, b. (1962). membership in voluntary associations and frequency of church attendance. journal for the scientific study of religion, 2(1), 74-84. http://doi.org/10.2307/1384095 leys, c., delacre, m., mora, y. l., lakens, d., & ley, c. (2019). how to classify, detect, and manage univariate and multivariate outliers, with emphasis on pre-registration. international review of social psychology, 32(1). http://doi.org/10.5334/irsp.289 lonsdorf, t. b., klingelhöfer-jens, m., andreatta, m., beckers, t., chalkia, a., gerlicher, a., …, & merz, c. j. (2019). how to not get lost in the garden of forking paths: lessons learned from human fear conditioning research regarding exclusion criteria. https://doi.org/10.31234/osf.io/6m72g lumley, t. (2004). analysis of complex survey samples. journal of statistical software, 9(1), 1-19. http://doi.org/10.18637/jss.v009.i08 massey f. j. jr. (1951). the kolmogorov-smirnov test for goodness of fit. journal of the american statistical association, 46(253), 68-78. meade, a. w. (2010). restriction of range. in n. j. sand (ed.), encyclopedia of research design. sage publishing. retrieved from https://sk.sagepub.com/reference/researchdesign/n388.xml merriam-webster (n.d.). bias. in merriam-webster.com dictionary. retrieved january 26, 2021, from https://www.merriam-webster.com/dictionary/bias. mertens, g., & krypotos, a.m. (2019). preregistration of analyses of preexisting data. psychologica belgica, 59(1), 338–352. http://doi.org/10.5334/pb.493 morgan, s. p. (1983). a research note on religion and morality: are religious people nice people? social forces, 61(3), 683-692. http://doi.org/10.2307/2578129 norenzayan, a., & shariff, a. f. (2008). the origin and evolution of religious prosociality. science, 322(5898), 58-62. http://doi.org/10.1126/science.1158757 nosek, b. a., beck, e. d., campbell, l., flake, j. k., hardwicke, t. e., mellor, d. t., van 't veer, a. e., & vazire, s. (2019). preregistation is hard, and worthwhile. trends in cognitive sciences, 23(10), 815-818. https://doi.org/10.1016/j.tics.2019.07.009 olsson-collentine, a., van assen, m. a., & hartgerink, c. h. (2019). the prevalence of marginally significant results in psychology over time. psychological science, 30(4), 576-586. https://doi.org/10.1177/0956797619830326 parliament of the world’s religions. (1993). toward a global ethic: an initial declaration. retrieved from https://www.weltethos.org/1-pdf/10stiftung/declaration/declaration_english.pdf pew research center. (2015). america’s changing religious landscape. pew research center. retrieved from https://www.pewforum.org/2015/05/12/americas-changing-religious-landscape pharoah, c., & tanner, s. (1997). trends in charitable giving. fiscal studies, 18(4), 427-443. https://doi.org/10.1111/j.14755890.1997.tb00272.x rohrer, j. m. (2018). thinking clearly about correlations and causation: graphical causal models for observational data. advances in methods and practices in psychological science, 1(1), 27– 42. https://doi.org/10.1177/2515245917745629 rubin, m. (2017). when does harking hurt? identifying when different types of undisclosed post hoc hypothesizing harm scientific progress. review of general psychology, 21(4), 308-320. https://doi.org/10.1037/gpr0000128 saroglou, v. (2002). religion and the five factors of personality: a meta-analytic review. personality and individual differences, 32(1), 15-25. https://doi.org/10.1016/s01918869(00)00233-6 19 preregistration of secondary data analysis: a template and tutorial silver, n. c. (2008). attenuation. in p. j. lavrakas (ed.), encyclopedia of survey research methods. sage publishing. retrieved from http://methods.sagepub.com/reference/encyclopedia-of-survey-research-methods/n24.xml simonsohn, u., simmons, j. p., & nelson, l. d. (2015). better p-curves: making p-curve analysis more robust to errors, fraud, and ambitious p-hacking, a reply to ulrich and miller (2015). journal of experimental psychology: general, 144(6), 1146-1152. https://doi.org/10.1037/xge0000104 smith, a. k., ayanian, j. z., covinsky, k. e., landon, b. e., mccarthy, e. p., wee, c. c., & steinman, m. a. (2011). conducting high-value secondary dataset analysis: an introductory guide and resources. journal of general internal medicine, 26(8), 920-929. https://doi.org/10.1007/s11606-010-1621-5 steegen, s., tuerlinckx, f., gelman, a., & vanpaemel, w. (2016). increasing transparency through a multiverse analysis. perspectives on psychological science, 11(5), 702-712. https://doi.org/10.1177/1745691616658637 thompson, w. h., wright, j., bissett, p. g., & poldrack, r. a. (2019). dataset decay: the problem of sequential analyses on open datasets. biorxiv, 801696. https://doi.org/10.1101/801696 veldkamp, c. l. s., bakker, m., van assen, m. a. l. m., crompvoets, e. a. v., ong, h. h., nosek, b. a., ..., & wicherts, j. m. (2018). ensuring the quality and specificity of preregistrations. https://doi.org/10.31234/osf.io/cdgyh wagenmakers, e.-j., wetzels, r., borsboom, d., van der maas, h. l. j., & kievit, r. a. (2012). an agenda for purely confirmatory research. perspectives on psychological science, 7(6), 632– 638. https://doi.org/10.1177/1745691612463078 wason, j. m., stecher, l., & mander, a. p. (2014). correcting for multiple-testing in multi-arm trials: is it necessary and is it done? trials, 15(1), 364. https://doi.org/10.1186/1745-6215-15-364 weston, s. j., ritchie, s. j., rohrer, j. m., & przybylski, a. k. (2019). recommendations for increasing the transparency of analysis of pre-existing datasets. advanced methods and practices in psychological science. https://doi.org/10.1177/2515245919848684 williams, m. n., & albers, c. (2019). dealing with distributional assumptions in preregistered research. meta-psychology, 3. https://doi.org/10.15626/mp.2018.1592 youniss, j., mclellan, j. a., & yates, m. (1999). religion, community service, and identity in american youth. journal of adolescence, 22(2), 243253. https://doi.org/10.1006/jado.1999.0214. meta-psychology, 2023, vol 7, mp.2021.2911 https://doi.org/10.15626/mp.2021.2911 article type: original article published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: peder isager, julia rohrer analysis reproduced by: not applicable all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/cafu5 visual argument structure tool (vast) version 1.0 daniel leising, oliver grenke, and marcos cramer technische universität dresden we present the first version of the visual argument structure tool (vast), which may be used for jointly visualizing the semantic, conceptual, empirical and reasoning relationships that constitute arguments. its primary purpose is to promote exactness and comprehensiveness in systematic thinking. the system distinguishes between concepts and the words (“names”) that may be used to refer to them. it also distinguishes various ways in which concepts may be related to one another (causation, conceptual implication, prediction, transformation, reasoning), and all of these from beliefs as to whether something is the case and/or ought to be the case. using these elements, the system allows for formalizations of narrative argument components at any level of vagueness vs. precision that is deemed possible and/or necessary. this latter feature may make the system particularly useful for attaining greater theoretical specificity in the humanities, and for bridging the gap between the humanities and the “harder” sciences. however, vast may also be used outside of science, to capture argument structures in e.g., legal analyses, media reports, belief systems, and debates. keywords: modelling, formalization, narrative, theory, humanities, science introduction argument structures are ubiquitous and important: we encounter them every day, in newspaper articles, in court rulings, in political and non-political debates on and off screen, and in our own more informal conversations with other people, and with ourselves. in an abstract sense, most arguments are about what is true, why and how things are related to one another, and about whether things are good or not. too often, however, arguments seem to run in circles, end in stalemates, or just fizzle out and are given up upon, instead of being conclusively resolved. we argue (yes), that these things happen because people tend to lose sight of some of the claims that they or others have made before. therefore, it is often advisable to aim for a comprehensive analysis of all relevant argument components. another reason why so many arguments remain unproductive is that those who argue tend to overlook the actual complexity of their own and others’ claims, and the relative vagueness of many claims. interestingly, the same problems seem to plague much of the more “narrative” theorizing that is so common in the humanities, and in psychology. in fact, there has been no shortage of calls for better (e.g. more formalized) theorizing in psychology, precisely to counter these shortcomings (devezer et al., 2021; eronen & bringmann, 2021; fried, 2020; glöckner & betsch, 2011; muthukrishna & henrich, 2019; robinaugh et al., 2021; smaldino, 2017, 2019). concrete advice on how exactly such better theorizing may be achieved is largely missing, however (borsboom et al., 2021). in the present paper, we introduce a tool devised for dealing with those problems. the tool is called vast (visual argument structure tool). the core idea is to visually display all relevant components of an argument structure at once, while at the same time aiming for exactness. a comprehensive display will make it harder to overlook or downplay related claims made earlier (e.g., because those previous claims do not align well with more recent ones). visual displays may also be more intuitive and easier to digest for most users, especially when compared to the alternative of using algebraic expressions. after all, there is a reason why so many articles in scientific journals as well as in the news media are accompanied by figures illustrating their main points. furthermore, visual displays tend to be more parsimonious: with formulae, the same variable name will have to be written again each time it is used as an input or output of some new equation. in contrast, a visual display may incorporate the variable only once, and establish all of the relevant relationships with other variables through arrows or lines pointing in various directions. this is how the matter is handled in vast, and aligns well with the typical approaches in structural equation modelling (sem), structural causal models (scm) and directed acyclic graphs (dag) (dablander, 2020; pearl, 1995; pearl & mackenzie, 2018; rohrer, 2018). the graphical structuring of arguments was also inspired to some extent by developments in formal arhttps://doi.org/10.15626/mp.2021.2911 https://doi.org/10.17605/osf.io/cafu5 2 visual argument structure tool (vast) version 1.0 1 table 1. components of the system element name meaning symbolized by default range concept a feature that may apply to certain objects frame with abstract label 0; 1 name a natural-language label that may be used for a concept frame with label in quotation marks 0; 1 higher-order concept a specific combination of elements that may in its entirety apply to certain objects frame containing two or more elements 0; 1 data a set of actual observations frame with thick black edge on one side 0; 1 is how much x is the case pentagon containing is in capitals 0; 1 ought how much x ought to be the case pentagon containing ought in capitals 0; 1 perspective how much perspective-holder agrees with x oval connected to is / ought containing name of perspective-holder 0; 1 naming how appropriate it is to call x by the name y arrow accompanied by lowercase letter n -1; 1 conceptual implication how much thinking of something as being x also implies thinking of it as being y arrow accompanied by lowercase letter i -1; 1 causation how reliably x will trigger y arrow accompanied by lowercase letter c -1; 1 transformation how strongly x maps onto y arrow accompanied by lowercase letter t -1; 1 prediction how well y may be predicted from x arrow accompanied by lowercase letter p -1; 1 reasoning how much x is a reason to believe y arrow accompanied by lowercase letter r -1; 1 gumentation (see baroni et al., 2018). vast overlaps significantly with all of these previous approaches, and also incorporates many elements of formal logic, in particular logical connectives (and, or and xor) from classical propositional logic (büning & lettmann, 1999) in the tradition of boole (1854) and frege (1879). given that we allow truth-values between 0 and 1 (see below), vast is also influenced by continuously-valued logics (preparata & yeh, 1972). however, vast is comparatively broader and more integrative in that it explicitly accounts for various types of relationships between concepts (i.e., naming, conceptual implication, causation, prediction, transformation, and reasoning). the strength of all of these relationships may be expressed in terms of the same metric, as we will explain in more detail below. vast also accounts for the possibility that concepts may be applied to different sets of objects, for claims as to whether something is and/or ought to be the case, and for different perspectives on these issues. we discuss some of the overlap and differences between vast and previous tools with a similar scope further below. the system in this section, we introduce the different types of elements that, taken together, constitute our system in its entirety. table 1 lists all of these elements alongside each other. to facilitate comprehension, we will use a variety of examples along the way to illustrate their potential uses. concepts concepts are the basic building blocks of cognition. note that concepts are assumed to exist before language is being used (see below) — they may exert their influence irrespective of the words (“names”, see below) that are used to denote them. a concept assigns values to objects. thus, concepts are very similar to mathematical functions in that they produce an output value for every input (i.e., object). in the most simple case, that output will be dichotomous, so the concept will yield a value of either 0 or 1 for each object. here, 1 means that the object is an exemplar of the concept, and 0 means that the object is not an exemplar of the concept. for instance, a person may look at a number of objects and determine whether any of them are exemplars of the concept that one would refer to with the word “car”. 3 note, however, that many concepts are not dichotomous but allow for continuous variation of output values. in vast, this variation is usually normalized to a range between 0 and 1, to make comparisons between different concepts easier. this basically incorporates the so-called “prototype approach” to classification (rosch, 1978), in which an object may be a more or less typical exemplar of a category / concept / class. in vast, concepts are usually displayed in the form of frames bearing abstract labels (e.g., x, y, or xyz). it is important to note that the display of a concept only symbolises the assumed existence and relevance of a cognitive process that would assign certain values to some objects and other values to other objects. it is agnostic in regard to the desired, assumed, or measured distribution of those values. the question of whether something is or should be the case is the realm of is and ought statements, which will be introduced later. empirical data constitute a special type of concept, which is symbolized by adding a thick black edge to one side of the respective frame. this is basically the same distinction that is made between “manifest” (measured) and “latent” (imagined) variables in structural equation modelling (see figure 12 and the accompanying text for an example). types of relationships between concepts the next major element of the system is the relationship that may exist between concepts. these are basically if-then relations: if one concept does apply (to some of the relevant objects), then another concept also applies, at least to some extent. there are different qualities of such relationships, however, which must be distinguished from another. in vast, we explicitly account for naming, implication, causation, prediction, transformation, and reasoning relationships, which we consider to be among the most common and relevant ones. specifying additional relationship types is also possible, if needed. all of this will now be explained in more detail. relationship type 1: naming (n). people tend to use words to denote the different ways in which they think about objects. in vast, these words are called “names”. to clearly distinguish concepts and their names from one another, vast uses abstract labels (e.g., c, r) for the former and real language labels in quotation marks (e.g., “cat”, “rocket”) for the latter. the relationship between a concept and a word denoting that concept is called a “naming relationship”. it is symbolized by an arrow pointing from the concept to the name, accompanied by the lower letter n. as with all relationship types (see above), the arrow stands for xy „bat“ „furious“„enraged“ z „device used for hitting a ball in some games“ „small, nocturnal flying animal with leathery wings“ n n n n n n figure 1 selected naming relationships. the word "bat" is a homonym for concepts x and y, whereas the words "furious" and "enraged" are synonyms for concept z an if-then relation: if an object is an exemplar of the respective concept, then one may call this object by the respective name. figure 1 displays some examples of naming relationships, including homonyms (the same name is used for different concepts) and synonyms (different names are used for the same concept). note that both (a) the appropriateness and (b) the strengths of the relationships displayed in figure 1 are treated as irrelevant for now. distinguishing between concepts and their names is often necessary, because idiosyncratic word usage accounts for all sorts of problems (e.g., misunderstandings) in everyday arguments. the same issue is relevant for psychology, which continues to suffer from — often unacknowledged — jingle-fallacies (use of homonymic theoretical concepts) and jangle-fallacies (use of synonymic theoretical concepts) (block, 1995). relationship type 2: conceptual implication (i). conceptual implication is about the extent to which classifying objects as exemplars of one concept implies also categorizing the same objects as exemplars of another concept. figure 2 displays an example. here, when an object is considered to be a “sun”, the same object is also somewhat likely to be considered “hot” and “bright”. conceptual implications are symbolized by arrows accompanied by the lowercase letter i. figure 2 contains three different ways of displaying basically the same information. in the display on the left-hand side, all concepts and the relevant relationships between them are displayed as a part of one coherent whole. this is the default mode that we suggest for use with most vast analyses, as it maximises parsimony while retaining all of the relevant information. in the middle display, the conceptual implications (type i) 4 s „sun“ h b „hot“ „bright“ n i i n n sun hot bright i i s h b i i s „sun“ n b „bright“ n h „hot“ n fimm figure 2 conceptual implications among three concepts. left: full (default) mode in which concepts and names are connected within the same structure. middle: full (default) mode with concepts separated from their names. right: finger-ismoon-mode (fimm) in which concept labels reference the concepts’ names. the dashed lines symbolise that these are alternative ways of displaying the same set of concepts, names, and relationships, not three parts of the same vast display among the three concepts, and the naming relationships (n) have been separated from one another. this way of displaying things may sometimes be helpful to avoid clutter. however, this approach comes at the price of somewhat lower parsimony because every concept now has to be displayed twice. in the display on the righthand side, we use concept labels that directly reference the concepts’ names. this is what we call the “finger-ismoon-mode” (fimm), as it abolishes the explicit distinction between signifier and referent. this constitutes another possible way of reducing clutter, but comes at the significant risk of overlooking the importance of semantics, especially (partial) homonymity, synonomity, and antonymity. for example, another display using fimm could show that the concept star has the same conceptual implications (hot, bright) as the concept sun. here, the use of fimm might obscure the fact that this is the case simply because “sun” and “star” are two different words for the exact same type of thing (s). to highlight the risk of semantic ambiguities like this one, we recommend explicating when the fimm is being used, by adding the respective acronym in one corner of the display (see figure 2). also, many concept names are too long to be used as concept labels. in these cases, we recommend the approach exemplified in the middle of figure 2. as a next step, we will introduce four more types of relationships between concepts that frequently feature in argument structures. figure 3 displays the ways in which they are distinguished from one another (in terms of lowercase letters accompanying the respective arrows), along with a very simple example for each type. note that, for simplicity, this figure uses fimm, as signalled by the acronym in the upper right-hand corner. relationship type 3: causation (c). many important articles and books have been written about causation (eronen & bringmann, 2021; pearl, 1995; pearl & mackenzie, 2018; rohrer, 2018). in vast, we use a concept of causation that is also reflected in how most experimentalists tend to think about their research designs. this concept involves temporal order as a necessary ingredient: causes always precede effects, but never the other way round. also, causation would become evident if we were able to manipulate the suspected cause variable and then observe subse5 smoking lung cancer temperature in celsius temperature in fahrenheit c t height number of y chromosomes p x + 4 = 8 x = 4 r fimm figure 3 four more types of relationships between concepts (c = causation, t = transformation, p = prediction, r = reasoning) quent changes in another variable. note that all of this concerns the ways in which we (and most people, presumably) think about causation, irrespective of whether such a suspected causal link may ever be proven or disproved in terms of data. note also that most causal relationships between concepts could be — but do not have to be — decomposed into a number of intervening steps. for example, the causal relationship in figure 3 reflects a relatively proximal link between cause (smoking) and effect (lung cancer). it could be amended by inserting tar accumulation as a mediator that is caused by smoking and that causes lung cancer. relationship type 4: transformation (t). this relationship type is used to account for situations in which the applicability of one concept may be deduced from another concept by mere computation. the respective example in figure 3 reflects a case in which one variable (temperature in celsius) is basically rescaled into another variable, by multiplying the former’s values with a factor (1.8) and then adding a constant (32). the specific values for the factor and the constant are not displayed, but could be displayed. the transformation type of relationship may also be used to account for scoring procedures, such as the specific ways in which an operational measure of socioeconomic status is derived from a number of indicators (e.g., highest degree attained, annual income). relationship type 5: prediction (p). this type of relationship is about knowing something about the values of y when we know something about the values of x. note that this is possible without knowing anything about the mechanism underlying the association. for example, type p relationships may ignore the direction of causal effects, as in the respective example in figure 3: here, a person’s height predicts the number of y chromosomes that same person has, although certainly the former is not the cause for the latter. often, such predictive relationships may be found and described first (e.g., a certain set of symptoms appearing together in patients), and only later be replaced by more specific explanations (e.g., in terms of a virus causing all of those symptoms). relationship type 6: reasoning (r). this relationship type is about the conclusions that people draw from certain premises, on purely intellectual grounds. it reflects the idea that if some concept applies (e.g., x + 4 = 8, see figure 3), one may infer that some other concept (e.g., x = 4, see figure 3) also applies. note that this is not limited to conclusions that would generally be regarded “logical”, but to just about any conclusion that someone thinks they may draw. in fact, vast may be used to first explicate one person’s line of reasoning and then refute that reasoning based on some other reasoning. for example, peter may think that the results of some empirical study clearly suggest that y is the case, whereas trudy may think otherwise. such discrepancies may then be explained using vast, by analysing why exactly peter and trudy come to such different conclusions (e.g., because one of them trusts the authors of the study, whereas the other does not). so-called “logical conclusions” simply constitute a special case in which certain lines of reasoning are viewed as (in-)defensible by a group of people (e.g., scientists) who endorse some set of reasoning rules. that endorsement then serves as the premise for drawing conclusions as to whether “x is reason to believe y” or not — which is another type r relationship. additional relationship types. in the present paper, we only address those types of relationships between concepts that we think feature prominently in many arguments — everyday ones as well as scientific ones. needless to say, the selection is and has to be somewhat subjective. it is relatively easy to come up with examples of other relationship types that may be useful to employ under certain circumstances (e.g., metamorphosis (m): when x turns into y over time; association (a): when thinking of x makes it likely to also think of y; element of (e): when x is among the ingredients that, together, constitute y etc.). we assume that the principles laid out in the following (e.g., regarding relationship strength and the 6 x y x y x 1) 7) 2) 4) 8) c „weak“ c 0.5 x y3) * y = 2 sin(x) + 750 x yor z x y c 0.3 c c * x c 0.3 c x y c 0.3 9) c <> 0 c 0 x y c 0.3 10) c ? x5) y* z c * y = 2.45 x + z y x6) y and z c1 c3 c2 figure 4 relationship strengths and relationship patterns construction of higher-order concepts) will still apply in these instances. in cases where a relationship between concepts is assumed to exist but the exact nature of that relationship is (yet) unknown, we recommend using the letter u. relationship strength in vast, the default interpretation of an arrow that points from one concept (e.g., x) to another (e.g., y) is that this relationship is considered relevant and positive (i.e., the more x the more y). thus, if an arrow is absent between x and y, this means that the relationship is zero and/or that it is regarded unimportant for the present analysis. so far, we did not use any further specifications of relationship strength, and this approach may be perfectly sufficient in many cases. sometimes, however, such specifications will be deemed useful or even necessary. vast allows for the use of verbal labels such as “weak”, “strong”, “negative” etc. for this purpose. this approach will often be appropriate when trying to visualise the structure of an existing argument that has been made using the natural language. it may also be the most useful approach when a numerical specification seems not possible (yet). case 1 in figure 4 displays an example. note that we rather arbitrarily added the letter c (causation) to all the arrows in figure 4. this may easily be replaced with any other relationship type, as everything we say here about relationship strength applies equally to all types. if a simple numerical quantification of relationship strength is wanted, we propose using normalized coefficients ranging from −1 to 1. “normalized” means that these coefficients ignore the particular scales of the concepts that they connect, but rather quantify relationship strength in terms of proportions of these features’ ranges. note that this is only possible if such a range may reasonably be assumed to exist. based on our own experience, this seems to be the case with most psycho7 min(x) max(x) min(y) max(y) min(x) max(x) min(y) max(y) min(x) max(x) min(y) max(y) min(x) max(x) min(y) max(y) min(x) max(x) min(y) max(y) coeff = 1.0 coeff = 1.0 coeff = 0.5 coeff = 0.5 coeff = 0.5 figure 5 relationship strength quantified in terms of default coefficients ranging from -1 to 1. upper two examples: relationships between two dichotomous concepts. lower three examples: relationships between two ordinal or metric concepts logical concepts, however. the default coefficient of relationship strength that we propose reflects the increase in y (expressed as a percentage of the available range of that feature) that is associated with a “perfect” or “complete” increase in x (i.e, an increase that covers the whole available range of x, from the smallest conceivable value to the highest conceivable value). case 2 in figure 4 presents an example in which the coefficient is 0.5. this type of coefficient may be applied to relationships between dichotomous concepts as well as to relationships between continuous concepts. in the former (dichotomous) case, it may be interpreted as the percentage of cases in which a complete change in x (from 0 to 1) is accompanied by a change in y (from 0 to 1). in the latter (continuous) case, the coefficient reflects the average increase on the continuous y scale that accompanies the largest possible increase (from 0 to 1) on the continuous x scale (again both expressed in terms of percentages of the respective ranges). figure 5 contains a number of examples showcasing this broad applicability. negative coefficients are to be interpreted accordingly: the more x is the case, the less y is the case. this default coefficient of relationship strength is generic enough to be applied to all types of relationships between concepts (e.g., type p: “wearing glasses” makes it 70 percent likely for a person to also be “smart”; type r: it is 90 percent reasonable to assume someone “is in love with you” when that person “giggles a lot while talking to you”; type c: being “obese” makes it 50 percent likely for someone to develop “diabetes type ii” as a consequence). for non-dichotomous concepts (see the lower three panels of figure 5), the default coefficient of relationship strength is largely agnostic regarding the distributions of concepts’ values: for example, the relationships displayed in the middle panel and in the panel to the right have the same strength coefficient (0.5) but in the latter case the relationship is deterministic whereas in the former case it is noisier. this difference may also be accounted for in vast, as we will discuss in the next section ("noise"). vast’s default coefficient of relationship strength is based on percentages of the ranges of the concepts that the relationship connects. sometimes, however, there may be good reasons to deviate from the default (e.g., when earthquake magnitude on the unbounded richter 8 scale is part of an argument). in such cases, using other measures of relationship strength (e.g., an exact function translating x into y) is possible. as the exact function connecting certain concepts will often be too long to be written above an arrow in its entirety, we suggest placing it somewhere else in the display and referencing it using an asterisk (e.g., case 3 in figure 4). diamonds should be used when several concepts are jointly related to another concept. this includes logical connectives such as and, or and xor (exclusive or), as shown in case 4 in figure 4. when a more specific formula is needed to derive a joint output from several inputs (e.g., a scoring procedure), the diamond and asterisk elements may be combined, as shown in case 5. a diamond with and inside it may also be used to symbolise the interaction effect that two concepts (x and y) have on a third concept (z). case 6 in figure 4 displays this possibility along with the two main effects of x and y on z, so this is basically a vast-type depiction of twofactor anova. we also use case 6 to showcase the possibility of indexing coefficients (c1, c2, c3). doing so is often useful to facilitate discussions among analysts. noise sometimes, we may not only wish to display the relationship between specific concepts, but acknowledge that there are additional unspecified influences (“noise”) on these concepts, as well. to symbolise these influences, we recommend using “noise arrows” similar to the ones that are used in structural equation modelling (sem). a noise arrow always points toward a given concept (i.e., the one that is affected by the noise) but does not originate in a specific concept. cases 7, 8, 9 and 10 in figure 4 provide examples. note that, other than in structural equation modelling, noise arrows in vast do not stand for the residuals that remain between observed y values and the y values that one predicts from x. rather, they stand for other influences apart from x that may move the values of y toward its maximum (default, positive coefficient), or toward its minimum (negative coefficient). case 7 displays the default situation in which noise may lead to an increase in concept y. if noise may move the values of a concept in either direction, this may be specified using “<> 0”, as shown in case 8. if it is important to specify that there is no noise, this may be expressed using a coefficient of zero, as shown in case 9. this expresses the idea that the only factor whose values may make a difference with regard to the values of concept y is concept x. finally, when an influence may exist and be relevant for the analysis, but one is not really sure yet, we recommend using a question mark as “coefficient” (see case 10 in figure 4). tic „temperature in celsius“ n t1 t2 tif „temperature in fahrenheit“ n bh„body height“ n p1 p2 bw „body weight“ n b„bird“ n i1 i2 a „animal“ n x„huge“ n y „gigantic“ n i 1 gw„global warming“ n c1 c2 gm „glaciers melting“ n pna „peter has no alibi“ n r2 pg „peter is guilty“ n r1 figure 6 relationship arrows connecting concepts both ways. use of two arrows implies that the exact ways in which the concepts are related depend on direction. use of a bidirectional arrow (as in the last case) implies that direction does not matter. the i 1 coefficient in the latter case reflects the assumption that concepts x and y are identical relationship direction arrows in vast stand for if-then relationships between concepts. we will now briefly address the direction into which arrows may point, and how this differs between relationship types. generally speaking, if there is an arrow pointing from x to y, there may also be an arrow pointing from y to x. in most cases, the shape of the respective relationship will differ depending on its direction. if both directions are of interest to the current analysis, we recommend signalling this difference by using two separate arrows, one for each direction. figure 6 presents a few examples. a few specifics need to be briefly discussed in this regard: first, naming relationships are special in that arrows may only point from a concept to its name but not the other way round. second, when the strengths of conceptual implications (type i) between x and y differ depending on direction, this means that one concept (the one that is the target of the arrow with the higher coefficient) is broader and more inclusive than the other. this is highly relevant to all kinds of concep9 „expected increase of average global temperature in the next 50 years, measured in kelvin“ is ought 2 0 inc n figure 7 a simple example with isand ought-statements regarding the same concept tual hierarchies (taxonomies). third, vast does allow for causal (type c) arrows pointing from x to y and back. this is relevant for displaying all kinds of positive and negative feedback loops. note that this diverges from recommendations in the literature on directed acyclic graphs (rohrer, 2018). fourth, vast also allows for displays of circular reasoning (type r), because the purpose of vast is not to prescribe rules as to how one should think, but to make visible the ways in which someone thinks. this includes the possibility of displaying beliefs that others may find unconvincing or even irrational. fifth, we recommend using a bidirectional arrow with an “i 1” coefficient for expressing the idea that x and y are identical (see figure 6). likewise, a bidirectional arrow with a coefficient of “i -1” would imply that one concept is the exact opposite of the other (e.g., between concept h named “huge” and concept t named “tiny”). the is and ought elements in many arguments, the extent to which something is considered to be the case and the extent to which something should be the case play important roles. to capture these extents, vast uses two special elements, called is and ought. both denote specific values on a given concept. they are important in a variety of ways: first, disagreements often arise because people start from different premises regarding the extent to which something is the case (e.g., whether vaccines are safe) or the extent to which something ought to be the case (e.g., whether one should trust the government). second, discrepancies between is and ought-values on the same concept often explain why people decide to act in certain ways — often they do so in order to move the is-value closer to the ought-value. in vast, is and ought are symbolized by pentagons which include the respective term (is or ought) in capitals. these are connected to one or more concepts using simple lines rather than arrows, in order to distinguish them from relationships between concepts. the specific isand ought-values are written next to the respective lines. figure 7 shows a very simple example. is and ought are not concepts themselves but rather denote specific locations within the range of values that a given concept may have. if is/ought is used, specifying the metric of measurement may sometimes be helpful (e.g., figure 7). if it is not possible or useful to specify this metric, we recommend expressing is and ought values in terms of fractions of the normalized range (between 0 and 1) of the respective concept. if no specific value is given, we recommend using “applies more likely than not" (> 0.5) as the default interpretation. note that is may be interpreted as a measure of the respective concept’s central tendency (e.g., the arithmetic mean). it is possible, however, to provide whole ranges of isor ought-values, if a single value is deemed insufficient. x y „having a good day“ „being in a good mood“ c concept name unit is ought m sd m sd x „having a good day“ % 50 10 70 10 y „being in a good mood“ % 65 25 65 10 figure 8 an example of how assumed and desired concept features may be displayed separately in a table, to avoid clutter going further, it may sometimes be helpful to specify the assumed and/or desired distribution characteristics of a concept even more. in such cases, we recommend providing the respective information in a separate table on the side, to avoid clutter. an example is shown in figure 8. the example tells us that, if a person (= object) “has a good day”, that person will be more likely to also “be in a good mood”, and that this effect is a causal one. the table also tells us that (a) it would be good if the average percentage of people having a good day would increase (from is: 50 to ought: 70), and that (b) the variation (sd) in this percentage would go down (from is: 25 to ought: 10). the latter goal may 10 hpw is ought daniel marcos 0 4 analysts: daniel, marcos 0.9 0.8 0.8is ought 01.0 0 „hours per week that daniel spends studying esperanto“ n figure 9 different is and ought values for different perspectiveholders be rooted e.g., in the assumption that it is important to avoid extreme unhappiness in people. the perspective element almost by definition, arguments tend to involve disagreements between viewpoints. to account for this, vast incorporates a so-called “perspective” element that reflects how strongly a given entity agrees with something. this “entity” is usually a person, but it may also be a group of people or something more abstract like a corporation. the perspective element may only be used to condition is and ought statements. when used to condition an is-statement, it reflects the extent to which a perspective-holder agrees that the given level of the concept applies. when used to condition an ought-statement, it reflects the extent to which the perspective-holder agrees that this is the most desirable level of the concept. levels of agreement may be quantified using values between 0 ("does not agree at all") and 1 ("agrees completely"). if a perspective-holder’s level of agreement is left unspecified, we recommend using "tends to agree" (> 0.5) as the default interpretation. if a (part of a) vast display does not explicate who the perspective-holder is, then is and ought statements reflect the view of the analyst who created the display (see next section). figure 9 showcases the use of the perspective element in vast: the name of the perspective-holder is displayed inside an oval, and the strength of the perspective-holder’s belief is again expressed using coefficients ranging from 0 to 1. note that a value of 0 would only imply a complete lack of agreement with x, not the belief that the opposite of x is true. this is necessary because many of the concepts that we use in everyday life do not have clearly defined opposites. if it seems necessary to not only visualise a perspectiveholder’s lack of agreement with x but also what else (y) they believe in, that alternative view will have to be specified, as well. the specific example given in figure 9 conveys a wealth of information at one glance: daniel and marcos both think that daniel does not spend any time studying esperanto. however, daniel is perfectly certain about that (1.0) whereas marcos — who cannot really know for sure — is a little less certain (0.8). also, daniel is almost certain (0.9) that he should not take up any esperanto-learning, whereas marcos is also quite sure that daniel should spend 4 hours per week learning esperanto. making differences such as these visible may go a long way in explaining the different behavioural choices that people make. the perspective element of vast incorporates the important issue of subjective certainty, which plays a key role in scientific theorizing. in fact, if one plotted all of the possible is-values (x-axis) against a perspectiveholder’s subjective certainties (y-axis) and rescaled the latter such that their sum is 1, one would basically obtain a density distribution very much akin to a bayes prior. outside of scientific theorizing, however, inspecting entire distributions of possible is-values is relatively rare. thus, we recommend displaying the is-value with the highest subjective certainty as a default. if necessary, alternative is-values and their respective certainties may be displayed in addition. the perspective feature may also be used to express someone’s “hunches” (e.g., the suspicion that there may be another yet unrecognized factor involved in accounting for some effect). this also includes suspicions as to what someone may or may not have meant by saying something (i.e., implications). the analyst element each vast display has to be created by someone. notably, the persons creating such displays are responsible for arranging the various concepts and their relationships with one another in the most accurate or helpful ways possible, but not for the actual content of the respective argument. in fact, it is possible to display the structure of an argument with great precision while at the same time disagreeing wholeheartedly with most or all of the points that are being made. this is why we prefer to call these persons “analysts” rather than “authors”. we recommend naming the analyst who created a display in a header, as also shown in the upper left corner of figure 9. the persons named as analysts are responsible for the display in its entirety (again: irrespective of how much they agree with the display’s actual content). to make this point clear, we decided to include the same persons (daniel and mar11 x y1 c isis tina toby 1.0 0.0 0.5 0.5z c x y2 c z c analyst: peter figure 10 higher-order concepts. here, is and ought statements for different perspective-holders refer to two higher-order concepts, each of which contains a number of causal relationships between lower-order concepts cos) in figure 9, in two different roles: as analysts, and as perspective-holders. altogether, the figure tells us that daniel and marcos (analysts) agree that daniel and marcos (perspective-holders) have different views on how much time daniel should spend learning esperanto (the specifics of their viewpoints were discussed above and will not be repeated here). in other words, they “agree to disagree”. higher order concepts after having introduced all of the different ways in which concepts may be related to one another, as well as the is, ought, perspective and analyst elements, it is now time to introduce a final element of great importance: in vast, any combination of elements may itself become a “higher-order concept” (hoc) and thus be related to other (higher-order) concepts or be the subject of is or ought statements. in this, all of the rules explained so far do apply as well. figure 10 displays a very simple example. here, two persons (toby and tina) are portrayed (by peter) as disagreeing in regard to the question of whether the causal effect of x on z is mediated by y1, or by y2. toby is certain that the former is the case, whereas tina is undecided. similar displays may be used to account for all sorts of differences between viewpoints, such as naming conventions (i.e., what it is that a given label refers to, or what that label ought to be used for). such issues are as greatly relevant to contemporary psychology as they were decades ago (block, 1995). it is important to note that a higher-order concept binds all of its components together in an inseparable fashion: the higher-order concept applies (to some object) if and only if all of its lower-order components apply. by using higher-order concepts, vast users may “zoom in” on certain parts of a display if they wish to x b y c c „antecedent“ „consequence“„black box“ n n n b m1 m2 i 1 „mediator 2“„mediator 1“ c is 1 n n analyst: imani figure 11 using higher-order concepts to “zoom in” on a particular part of a vast display. the lower part of the display details what concept b is about elaborate on its details, or “zoom out” when they decide to rather ignore some of the details for some time. figure 11 provides an example: the lower part of the figure “zooms in” on the meaning of one of the concepts (b) that features in the causal chain displayed in the upper part of the figure. specifically, we learn that b stands for m1 being the case (is) and m1 eliciting m2. furthermore, all of this elicits y. higher-order concepts may be used to display a hierarchy of concepts that apply to different sets of objects. so far, we abstained from explicating the sets of objects that the concepts in a vast analysis apply to. instead, we tacitly assumed that those objects were the same across all concepts. sometimes, however, specifying the sets of objects to which different concepts apply is necessary. we recommend using greek letters for this purpose. figure 12 displays a hypothetical example. here, intelligence test scores and school grades were obtained from students (τ). remember that the thick black edges of the respective concept frames symbolise the fact that these concepts were actually measured. we assume that the students’ intelligence test scores do predict 12 sel(σ) int(τ) gra(τ) int(τ) gra(τ) p1 0.5 p2 0.2 „students‘ intelligence-test results“ „students‘ grades“ hoc1(σ) hoc2(σ) „selectiveness of school“ int(τ) gra(τ) sel(σ) cc n1 n2 n3 hoc1(σ) hoc2(σ) „effect in school 1“ „effect in school 2“ n4 n5 analyst: pam figure 12 use of concepts (int, gra, sel) and higher-order concepts (hoc1, hoc2) that are applied to different sets of objects (τ vs. σ) their grades to some extent. note that we use p instead of c relationships in this display, because the two types of data are empirically related (probably because they both reflect the students’ actual cognitive abilities), but the students’ test results are not the cause of their grades. the figure also tells us that this predictive relationship is assumed to be different for students in school 1 as compared to students in school 2. we learn this from comparing coefficients p1 and p2, and from how the two higher-order concepts (hoc1 and hoc2) are named. at this point, a new set of objects (σ) has to be introduced to distinguish the two schools from one another. we also learn that something else (sel) named “selectivity” is assumed to vary across the same objects (σ), and that this variation causally explains the difference between p1 and p2. discussion in this paper, we presented the first version of vast, the visual argument structure tool. it provides a set of clear rules by which the ways in which people think and speak about things may be visually organized, in order to help understand those ways better. vast may be used for constructing new arguments, as well as for analysing, completing, revising and/or (partly) refuting existing ones. the system captures some of the key types of elements that many everyday arguments and scientific theories share by means of a relatively small set of graphical symbols. in our view, its appeal lies in its intuitiveness and relative economy, in its capacity to account for any degree of specification or relative fuzziness, and in its applicability to basically any content domain. in the next section, we give a brief overview of the system’s possible uses. depending on the material at hand, one or several of the following goals may play a more prominent role in a vast analysis: (a) afford comprehensiveness, (b) explicate premises in terms of what is the case and what ought to be case, (c) clarify the views that different (groups of) people have, including areas of (dis-)agreement, (d) identify areas of under-specification, inconsistency or outright contradiction, (e) deduce defensible conclusions. potential uses of the system theory specification. there is no shortage of complaints that the mainly narrative theories which are so common in the humanities and in psychology are of limited value because of their relative fuzziness and under-specification. at the same time, proposals as to how this situation may be improved are largely lacking, and the existing ones are often relatively unspecific themselves. we think that vast may offer a solution to this problem, for two reasons: first, a vast analysis may help pinpoint those parts of a narrative theory that may and should be better specified, and then aid in the specification process. we consider this the preferable approach compared to the alternative of rejecting narrative theories altogether. by specifying a theory better, its “empirical content” (empirischer gehalt; glöckner and betsch, 2011; popper, 2002, page 96) and thus ultimately its utility will be improved (e.g., it will become easier to refute). second, vast allows for accommodating any level of fuzziness that seems acceptable or unavoidable at present. this very much aligns with the idea of scientific theory-development as an incremental process of gradually increasing specificity. furthermore, by being 13 able to incorporate natural-language components of an argument, a vast display may help bridge the chasm that exists between the humanities and the “harder” sciences, with psychology dangling somewhere in between. in the appendix we provide a somewhat more complex example, showcasing an attempt to clarify the meaning of a short theory paragraph from a research paper. nomenclature issues. psychology in particular has long suffered — and continues to suffer — from significant jingleand jangleproblems (block, 1995): to this day, psychologists often use different words to denote the same thing, or the same words to denote different things. both practices are at odds with scientific ideals of efficiency and parsimony. vast may be used to help improve on the present situation quite a bit by making visible (at a glance), (a) which terms are used, (b) by whom, (c) to denote what (including the relationships among the concepts that the terms refer to). the according displays will almost certainly involve naming and implication relationships as well as the perspective element. for example: is the thing that is called “narcissism” by author a the same as the thing that is called “narcissism” by author b (e.g., in terms of its assumed or shown relationships with other concepts)? to what extent are “arrogance”, “dominance”, and “self-enhancement” just different words for the same thing, and is that the same thing that is also called “narcissism” by some scientists? and so on. facilitating scientific discourse. through its perspective element, vast analyses may very well be used to account for the inherently social nature of all scientific discourse (oreskes, 2020), as they enable an explicit and comprehensive showcasing of points of convergence or disagreement between the views of different scholars studying the same subject. we assume that scientific debates may become significantly more efficient when making sources of disagreement visible and then working through them, one after another, possibly in an iterative fashion. this way, a vast analysis may actually serve as a kind of road-map to help guide the scientific process (e.g., in systematic attempts at forming consensus). peer review. vast may also be used as a tool in peer review. this seems particularly promising when reviewing a paper from a research field that the reviewer is not that familiar with. in such cases, it may be helpful to first organise the available information in the paper as to (a) how many relevant concepts there are (b) how they are assumed to relate to one another, (c) how they are named, (d) how they were measured, (e) how well these measurements reflect the assumed relationships between the concepts of interest, and so on. use as a tool for gathering research data. the use of vast may also be helpful when people’s belief systems (e.g., so-called conspiracy theories) are the research domain of interest. for example, how aware are people of the logical (in-)consistencies within their own worldviews? what happens if you make them aware of the existing inconsistencies (e.g., do they add new components post hoc that mitigate them)? which components of people’s belief systems are particularly hard to change (e.g., the ones that are of key importance to several intertwined belief systems)? and: do people find it easier to map arguments they agree with, as compared to arguments with which they do not agree? explicating the paths from premises to conclusions (and back). vast may be used to derive defensible conclusions from a given set of premises, or to elucidate the ways in which a given perceiver seems to draw conclusions from such premises. likewise, vast may be used to infer the premises upon which some existing set of conclusions was built. often, perspectives may be of particular importance in such analyses. this is because believing different things to be true (is) or desirable (ought) goes a long way in explaining wildly different conclusions (e.g., in terms of how one should act). finding common ground. the potential use of vast for working toward consensual viewpoints is not limited to scientists (see above). we hope that vast may just as well be used to enable more traceable and rational conversations among proponents of viewpoints that may seem irreconcilable at first (e.g., regarding abortion, second amendment rights, vaccination etc.). the extent to which this hope is warranted will have to become the subject of future research, however. teaching critical thinking and the art of argumentation. vast may be used as a teaching tool, helping teachers explain to students the various ways in which concepts may be related to one another, and the important roles that is and ought statements as well as different perspectives play in many arguments. ideally, these 14 things would be taught by way of analysing existing arguments together, or by jointly developing new ones (cf. cullen et al., 2018). comparison with related tools vast’s intended domain of use overlaps very significantly with those of many other systems, most prominently directed acyclic graphs (dag; pearl, 1995; pearl and mackenzie, 2018; rohrer, 2018) and structural equation modelling (sem). major points of convergence with these systems are obviously the display of concepts or variables (“nodes”), and the display of relationships between them (“edges”). in sem, numerical coefficients are often used to express the strengths of relationships between variables. a similar route is taken when creating so-called “research maps” summarizing the theoretical assertions and the evidence speaking for or against them, for a given research field (matiasz et al., 2018). from sem, vast has further borrowed the use of “noise arrows” to symbolise additional, unspecified influences. diverging from sem conventions, however, vast uses a different default meaning for the absence of arrows between concepts: whereas in sem this usually signals an unrelatedness of variables, in vast it means “unrelated or not related in ways relevant to the current argument”. this is a somewhat more liberal approach, and more in line with how everyday arguments are structured, according to our experience: when people do not talk about relationships between concepts, this usually means that they see no reason for doing so, but not necessarily that they assume the respective coefficient to be zero. the coverage of vast exceeds those of dags and sems by a wide margin: like dags and sems, vast does cover the (measured or unmeasured) features of objects and the relationships (causation and association) among those features. unlike those other two frameworks, however, vast also covers some more “psychological” relationship types such as naming, conceptual implication and reasoning. for this reason, the default coefficient of relationship strength in vast expresses the covariation of two concepts in terms of percentages of ranges (e.g., how much more will i consider an object to “be a car” if it “has tires?”). vast also goes beyond the aforementioned systems in that it enables an explicit accounting for assumptions as to how much something is and ought to be the case, and for differences between people in regard to such assumptions. all of this is unquestionably of key importance for many everyday arguments. the coverage and methodology of vast overlaps considerably with tools developed in philosophy, such as mindmup (https://maps.simoncullen.org/), reason!able (van gelder, 2002) and others (for an overview of argument visualisation approaches, see okada et al., 2014). these other tools usually enable users to zoom in on any parts of a verbal argument and deduce the logical relationships among them. this concerns reasons for drawing certain conclusions as well as objections to doing so. in vast, these are captured using positive or negative type r relationships. many tools account for premises that may lead to certain conclusions either by themselves, or in combination. in vast, these would be distinguished from one another in terms of separate vs. combined (and/or) arrows pointing toward a concept. several tools also afford the possibility of making whole strains of argument (e.g., “x is a reason to believe y”) the subject of further reasoning (e.g., “z is a reason not to believe that x is a reason to believe y”). in vast, this is captured using higher-order concepts. variants of is-statements and quantifications of reasoning strength are also found in some existing tools (e.g., reason!able). however, a major difference between these tools and vast is that the former deal exclusively with relationships of the reasoning (r) type, whereas the latter also accounts for many other possible types of relationships between concepts, while still capturing their strengths with the same (default) metric. limitations and outlook at this early stage in the development of vast, it is difficult to predict how eagerly it will be picked up and eventually be used by others. we have spent significant amounts of time over the course of approximately three years developing, testing, revising and refining the system through numerous iterations, trying to make it work with analyses of diverse sets of examples both from within science and outside of it. we are convinced that the current version does work reasonably well, but we certainly expect additional improvements in the future. to facilitate these, we encourage our readers to give it a try, to put the system to use on whatever arguments they find interesting, and to let us know about the experiences they make. news media articles, statements by political agents (e.g., parties or office-holders), court sentences, advertisements, and of course science texts are all fair game. based on our own experiences, we predict that, like us, most readers will find this type of analysis intellectually challenging. we hope that the substantial effort that tends to be associated with specifying argument structures this way will not deter people from trying. vast analyses may be real eye-openers in regard to the, well, vast level of complexity that does permeate many arguments but tends to be overlooked when sticking to purely narrative ways of formulating them. in https://maps.simoncullen.org/ 15 this regard, a reviewer (julia rohrer) alerted us to a potential conflict of interest, especially for scientists who consider using the system: given the current incentive structure in academia, there may be good (i.e. rational) reasons to avoid greater levels of theory specification. in fact, low levels of specification may be selected for because using them is likely to involve a lower risk of being proven wrong. ill-specified theories will also be less likely to incur strong negative reactions from reviewers and may be easier to “sell” to the public. all of this is true and may in fact explain part of why weak theorizing is so persistent in psychology to this day. we do consider it unlikely that the present paper will incite a mass movement of theory specification enthusiasts. however, based on our own experience with the tool, we do believe that for those psychologists who already are genuinely interested in improving on the specificity of their (and others’) theorizing, using vast is definitely worth a try. considerably greater clarity is usually achieved. at present, vast merely consists of a set of conceptual distinctions and related rules for how they should be visualized. as this is the core of the system, complete and satisfactory vast analyses are already possible using any standard graphics tool, or even just paper and pencil. however, our ultimate goal is to implement the system as a free web resource that will be capable of (a) developing vast displays by asking users the right questions and (b) checking any given display for consistency and completeness. appendix: a more complex example the task at hand is to clarify the meaning of the following theory paragraph from the paper by theves et al. (2020), using vast. such a clarification attempt may — and usually does — lead to an identification of areas of underspecification, ambiguity or even contradiction. note, however, that this particular paragraph was picked for no other reason than being a relatively typical example of narrative theorizing in psychology (and because it deals with the subject of “concepts”, which play a key role in vast). we do not consider this a particularly problematic case, but rather just use it to showcase a typical application of vast. thousands of other paragraphs may have been used just as well. the paragraph from the (theves et al., 2020) paper (page 7318) goes like this: "concepts are organizing structures that define how contents are related to each other and can be used to transfer meaning to novel input (smith & medin, 1981; kemp, 2010). their formation thus inherently depends on generalization over, and integration of experiences. thus, a role of the hippocampus in generalization seemed considerable due to its roles in binding elements into spatial and episodic context (davachi et al., 2003; komorowski et al., 2013; davachi, 2006; ranganath, 2010) as well as integration of information over episodes (schlichting et al., 2015; davis et al., 2012; collin et al., 2015; milivojevitch et al., 2015 [...]" figure 13 shows a vast display created by daniel (analyst) to reflect the paragraph above. more specifically, it shows what this analyst thinks the authors (perspective) of the paragraph are saying (is), which is everything inside the largest frame. note that, for simplicity, a minor portion at the end of the narrative paragraph was excluded, as indicated by the use of brackets at the end (see above). daniel used the default mode of vast in which all concepts and their names are separated from one another. he identified 8 relevant concepts to be accounted for. five of these (c1 to c5) reflect key sentences from the paragraph, as shown in the lower part of the figure where the respective naming relationships are listed. note that some minor inferences and modifications were necessary to enable each concept name to speak for itself. the other three concepts (l1 to l3) represent the references to the literature that also feature in the paragraph. these are empirical in nature, as indicated by the use of thick black edges on the respective frames. they are also assumed to be given, as indicated by the use of is pentagons. in this first vast display, daniel focuses on what he thinks is the reasoning structure in the paragraph. his use of vast’s default mode along with his decision to set the naming relationships aside allows us to fully concentrate on this reasoning structure, without being distracted by the actual content of specific concepts. there is no need for indexing the naming relationships in this context, because they will not be individually discussed. daniel identified seven relevant relationships of the reasoning type (r1 to r7): because l1 is given (is), we may believe c1 to be the case, via r1. this is how daniel interprets the first citation. if c1 is the case, we may also believe c2 to be the case, via r2. this is how daniel interprets the first “thus”. if all of this is the case, we may also believe c3 to be the case, via r3. this is daniel’s interpretation of the second “thus”. however, this (r3) is only true if c4 and c5 are also the case, via r4 and r5. this latter conditioning is how daniel interprets the “due to” in the paragraph. furthermore, l2 is given, which is a reason (via r6; reflecting the second citation) to believe c4 to be the case. finally, l3 is also given, which is a reason (via r7; reflecting the third citation) to believe c5 to be the case. note that the analyst had to make numerous auxiliary assumptions about things that were not entirely clear in the narrative paragraph itself: for example, the analyst 16 “[…] a role of the hippocampus in generalization seemed considerable” “[hippocampus’s role] in binding elements into spatial and episodic context” “[hippocampus’s role] in integration of information over episodes” c1 r3r2 analyst: daniel is authors c2 c3 c1 “concepts are organizing structures that define how contents are related to each other and can be used to transfer meaning to novel input.” c2 c3 “[concepts’] formation […] inherently depends on generalization over, and integration of experiences.” c4 c5 “davachi et al., 2003; komorowski et al., 2013; davachi, 2006; ranganath, 2010“ “smith and medin, 1981; kemp, 2010”l1 l2 c4 c5 r4 r5 r1 n n n n n n n “schlichting et al., 2015; davis et al., 2012; collin et al., 2015; milivojevitch et al., 2015“ l3 n l1 l3 l2 r6 r7is is 1 1 1 figure 13 reasoning structure in a theory paragraph by theves et al. (2020), according to daniel (analyst) treated all of the papers supposedly supporting a given proposition as a unified whole (i.e., a single concept). alternatively, he could have used a separate concept for each paper, which would have meant that each paper by itself supports the respective proposition. also, daniel treated r4 and r5 as two entirely separate paths by which r3 is supported. alternatively, he could have interpreted the “as well as” in the respective sentence to mean that only the combination of c4 and c5 is a reason to believe c3, which he would then have had to express using an and diamond. needless to say, each of these additional assumptions, and any other element in the display, may be challenged any time — but only if they are made explicit, which is the whole point of vast analyses. by making specification gaps and ambiguities explicit this way, a vast analysis may eventually help drive theory development. as a next step, daniel decides to zoom-in a bit more on two of the key concepts in the paragraph. the first of these is c1, which he interprets to be mostly about conceptual implications. this is how he interprets the word “are” in this concept’s name. figure 14 shows the outcome of this zooming-in: daniel believes the content of c1 and the entire content of the largest frame underneath it to be interchangeable. to him, these two concepts have the exact same meaning, as indicated by his using a bidirectional implication arrow with coefficient 1 (signalling identity). what daniel does here is specify the meaning of a higher-order concept (c1) by breaking it down into four components (c6 to c9) and some relevant relationships (i2 to i4) between them: objects that are considered exemplars of c6 (named “concepts”) are also considered exemplars of c7, c8 and c9. note that these latter implications use unidirectional arrows only. note further that all of this is expressly daniel’s opinion, and not necessarily shared by the authors of the paragraph. this is because only daniel is named as the analyst in the byline, but the original authors of the paragraph do not appear as perspective-holders anywhere in the figure. the second concept that the analyst chooses to “zoom-in” on is c2. figure 15 showcases the result. as in the previous analysis, daniel assumes that the content of c2 and the content of the largest frame underneath it are mutually interchangeable (i5). again, he explicates the meaning of a concept (c2) by breaking it down into a few subordinate concepts (c10, c11, and 17 “organizing structures” “define how contents are related to each other” “can be used to transfer meaning to novel input” “concepts” i2 i3 i4 analyst: daniel c6 c7 c9 c8 n n n n c1 “concepts are organizing structures that define how contents are related to each other and can be used to transfer meaning to novel input.” n i1 1 figure 14 “zooming-in” in on concept c1 (ignoring citations) c12) and some relationships among them (this time of the c type, which is how daniel interprets the word "depends" in the text). daniel’s use of an and diamond signals that c12 may only be the case if c10 and c11 are both given, but not if only c10 or c11 are given. this is how he interprets the word "and" in the text. furthermore, the c 0 influence on c12 makes it clear that there are no other causal pathways by which c12 may come about. this is how daniel interprets the word “inherently" in the text. however, the question mark above the third arrow pointing toward the and diamond signals that daniel is not sure whether another influence is needed to bring about c12. this is because the name of concept c2 only seems to say that c10 and c11 are necessary for c12 to happen, but not that they are sufficient. here, the analysis points to a need for greater specification. finally, a brief word on the sets of objects that the concepts in the three figures pertain to. these are obviously not the same. in figure 13, the objects are conceivable realities: the one that is described as given (in which certain rules apply and certain research papers exist), and possible alternative ones. in figure 14, the objects are rather ill-defined. it may be useful to think of them as being various kinds of mental phenomena here (e.g., perceptions, memories etc.). in figure 15, the objects may be thought of as variants of people’s developmental trajectories: only in those in which c10 and c11 (and maybe something else, see question mark) take place, will we also see c12 happening. for simplicity, we did not explicitly account for the different sets of objects in figures 13, 14, and 15. “[concepts’] formation […] inherently depends on generalization over, and integration of experiences.” analyst: daniel i5 1 “[concepts’] formation” “generalization over [experiences]” “integration of experiences” c2 n and c11 c10 c12 n n n c 1 c 0 ? figure 15 “zooming-in” in on concept c2 (ignoring citations) author contact corresponding author: daniel leising (orcid: 0000-00001-8503-5840), technische universität dresden, daniel.leising@tu-dresden.de conflict of interest and funding the authors state that they have no conflict of interest to declare. this project was not connected to any particular source of third-party funding. author contributions daniel leising: conceptualization, methodology, visualization, writing — original draft, writing — review & editing; oliver grenke: conceptualization, methodology, visualization, writing — review & editing; marcos cramer: conceptualization, methodology, visualization, writing — review & editing. acknowledgments the authors would like to thank moritz leistner for help with layouting this paper. open science practices this article is theoretical and as such is not eligible for open science badges. the entire editorial process, including the open reviews, is published in the online supplement. 18 references baroni, p., gabbay, d., giacomin, m., & van der torre, l. (2018). handbook of formal argumentation. college publications. block, j. (1995). a contrarian view of the five-factor approach to personality description. psychological bulletin, 117, 187–215. https : / / doi . org / 10 . 1037/0033-2909.117.2.187 boole, g. (1854). an investigation of the laws of thought: on which are founded the mathematical theories of logic and probabilities (vol. 2). walton; maberly. borsboom, d., van der maas, h. l. j., dalege, j., kievit, r. a., & haig, b. d. (2021). theory construction methodology: a practical framework for building theories in psychology. perspectives on psychological science, 16(4), 756–766. https://doi. org/10.1177/1745691620969647 büning, h. k., & lettmann, t. (1999). propositional logic: deduction and algorithms (vol. 48). cambridge university press. cullen, s., fan, j., van der brugge, e., & elga, a. (2018). improving analytical reasoning and argument understanding: a quasi-experimental field study of argument visualization. npj science of learning, 3, 21. https : / / doi . org / 10 . 1038/s41539-018-0038-5 dablander, f. (2020). an introduction to causal inference. https://doi.org/10.31234/osf.io/b3fkw devezer, b., navarro, d. j., vandekerckhove, j., & buzbas, e. o. (2021). the case for formal methodology in scientific reform. royal society open science, 8(3), 200805. https : / / doi . org / 10.1098/rsos.200805 eronen, m. i., & bringmann, l. f. (2021). the theory crisis in psychology: how to move forward. perspectives on psychological science, 16(4), 779–788. https : / / doi . org / 10 . 1177 / 1745691620970586 frege, g. (1879). begriffsschrift: eine der arithmetischen nachgebildete formelsprache des reinen denkens. verlag von louis nebert. fried, e. i. (2020). lack of theory building and testing impedes progress in the factor and network literature. psychological inquiry, 31(4), 271–288. https : / / doi . org / 10 . 1080 / 1047840x . 2020 . 1853461 glöckner, a., & betsch, t. (2011). the empirical content of theories in judgement and decision making: shortcomings and remedies. judgement and decision making, 6(8), 771–721. matiasz, n. j., wood, j., doshi, p., speier, w., beckemeyer, b., wang, w., hsu, w., & silva, a. j. (2018). researchmaps.org for integrating and planning research. plos one, 13(5), e0195271. https://doi.org/10.1371/journal. pone.0195271 muthukrishna, m., & henrich, j. (2019). a problem in theory. nature human behaviour, 3(3), 221– 229. https : / / doi . org / 10 . 1038 / s41562 018 0522-1 okada, a., buckingham shum, s. j., & sherborne, t. (2014). knowledge cartography: software tools and mapping techniques. springer. oreskes, n. (2020). why trust science? princeton university press. pearl, j. (1995). causal diagrams for empirical research. biometrika, 82(4), 669–688. https://doi.org/ 10.1093/biomet/82.4.669 pearl, j., & mackenzie, d. (2018). the book of why. basic books. popper, k. r. (2002). the logic of scientific discovery. routledge classics. preparata, f. p., & yeh, r. t. (1972). continuously valued logic. journal of computer and system sciences, 6(5), 397–418. https : / / doi . org / 10 . 1016/s0022-0000(72)80011-4 robinaugh, d. j., haslbeck, j. m. b., ryan, o., fried, e. i., & waldorp, l. j. (2021). invisible hands and fine calipers: a call to use formal theory as a toolkit for theory construction. perspectives on psychological science, 16(4), 725–742. https: //doi.org/10.1177/1745691620974697 rohrer, j. m. (2018). thinking clearly about correlations and causation: graphical causal models for observational data. advances in methods and practices in psychological science, 1(1), 27–42. https://doi.org/10.1177/2515245917745629 rosch, e. (1978). principles of categorization. in e. rosch & b. b. lloyd (eds.), cognition and categorization (pp. 27–48). lawrence erlbaum associates. smaldino, p. (2017). models are stupid, and we need more of them. in r. r. vallacher, s. j. read, & a. nowak (eds.), computational social psychology. routledge. https://doi.org/10.4324/ 9781315173726 smaldino, p. (2019). better methods can’t make up for mediocre theory. nature, 575, 9. https : / / doi . org/10.1038/d41586-019-03350-5 theves, s., fernández, g., & doeller, c. (2020). the hippocampus maps concept space, not feature space. the journal of neuroscience, 40(38), 7318–7325. https : / / doi . org / 10 . 1523 / jneurosci.0494-20.2020 https://doi.org/10.1037/0033-2909.117.2.187 https://doi.org/10.1037/0033-2909.117.2.187 https://doi.org/10.1177/1745691620969647 https://doi.org/10.1177/1745691620969647 https://doi.org/10.1038/s41539-018-0038-5 https://doi.org/10.1038/s41539-018-0038-5 https://doi.org/10.31234/osf.io/b3fkw https://doi.org/10.1098/rsos.200805 https://doi.org/10.1098/rsos.200805 https://doi.org/10.1177/1745691620970586 https://doi.org/10.1177/1745691620970586 https://doi.org/10.1080/1047840x.2020.1853461 https://doi.org/10.1080/1047840x.2020.1853461 https://doi.org/10.1371/journal.pone.0195271 https://doi.org/10.1371/journal.pone.0195271 https://doi.org/10.1038/s41562-018-0522-1 https://doi.org/10.1038/s41562-018-0522-1 https://doi.org/10.1093/biomet/82.4.669 https://doi.org/10.1093/biomet/82.4.669 https://doi.org/10.1016/s0022-0000(72)80011-4 https://doi.org/10.1016/s0022-0000(72)80011-4 https://doi.org/10.1177/1745691620974697 https://doi.org/10.1177/1745691620974697 https://doi.org/10.1177/2515245917745629 https://doi.org/10.4324/9781315173726 https://doi.org/10.4324/9781315173726 https://doi.org/10.1038/d41586-019-03350-5 https://doi.org/10.1038/d41586-019-03350-5 https://doi.org/10.1523/jneurosci.0494-20.2020 https://doi.org/10.1523/jneurosci.0494-20.2020 19 van gelder, t. (2002). argument mapping with reason! able. the american philosophical association newsletter on philosophy and computers, 2(1), 85–90. introduction the system concepts types of relationships between concepts relationship type 1: naming (n). relationship type 2: conceptual implication (i). relationship type 3: causation (c). relationship type 4: transformation (t). relationship type 5: prediction (p). relationship type 6: reasoning (r). additional relationship types. relationship strength noise relationship direction the is and ought elements the perspective element the analyst element higher order concepts discussion potential uses of the system theory specification. nomenclature issues. facilitating scientific discourse. peer review. use as a tool for gathering research data. explicating the paths from premises to conclusions (and back). finding common ground. teaching critical thinking and the art of argumentation. comparison with related tools limitations and outlook appendix: a more complex example author contact conflict of interest and funding author contributions acknowledgments open science practices meta-psychology, 2021, vol 5, mp.2020.2539 https://doi.org/10.15626/mp.2020.2539 article type: commentary published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: brand, c.o., martin, s.r. analysis reproduced by: not applicable all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/jgxk7 comparing bayesian posterior passing with meta-analysis joshua pritsker purdue university abstract brand, von der post, ounsley, and morgan (2019) introduced bayesian posterior passing as an alternative to traditional meta-analyses. in this commentary i relate their procedure to traditional meta-analysis, showing that posterior passing is equivalent to fixed effects meta-analysis. to overcome the limitations of simple posterior passing, i introduce improved posterior passing methods to account for heterogeneity and publication bias. additionally, practical limitations of posterior passing and the role that it can play in future research are discussed. keywords: bayesian updating, posterior passing, meta-analysis introduction the ability to accumulate evidence across studies is often said to be a great advantage of bayesian inference. this point was recently discussed by brand, von der post, ounsley, and morgan (2019), who suggested that bayesian posterior passing alleviates the need for traditional meta-analysis. they performed a simulation study comparing posterior passing to non-cumulative analysis and combined analysis of the data from all studies. however, they lacked a formal theoretical comparison of posterior passing to traditional methods of meta-analysis. this commentary relates posterior passing to traditional meta-analyses, allowing for one to determine the performance of posterior passing on the basis of how traditional meta-analytic methods are known to perform. to avoid some of the pitfalls of posterior passing, i suggest improved procedures that account for heterogeneity and publication bias. i address brand et al.’s (2019) suggestion that posterior passing avoids some of the problems of traditional meta-analyses, and discuss practical limitations in using it as a replacement for traditional meta-analysis. posterior passing as meta-analysis as brand et al. (2019) suggest that posterior passing may replace traditional meta-analyses, one might wonder how posterior passing relates to traditional metaanalyses. given that meta-analyses are typically based on proxy statistics and their standard errors instead of individual data points, one might consider posterior distributions generated the same way. we can do this by using the likelihood of a statistic instead of the likelihood of the full data. for instance, supposing that we have a parameter θ that we want to make inferences about, we might construct the likelihood function using the sampling distribution of an estimate of θ. in standard meta-analysis, studies are typically summarized by their estimates and standard errors. if an estimate was derived by maximizing a likelihood that takes the form of gaussian function over θ, the estimate and its standard error fully summarize this likelihood function. even in cases where the likelihood function is not fully described by its maximum likelihood estimate and standard error, using the estimate and standard error may be viewed as a second-order asymptotic approximation to the log-likelihood. taking this approach, with θ̂i being the estimate from https://doi.org/10.15626/mp.2020.2539 https://doi.org/10.17605/osf.io/jgxk7 2 study i of θ, we can get the posterior after study i, denoted as fi (θ), by: fi (θ) ∝ π (θ) f (xi | θ) (1a) = fi−1 (θ) f ( θ̂i | θ, se 2 i ) (1b) where π (·) is our prior distribution, xi refers to all the information of study i, and sei is the standard error for θ̂i. typically, f ( θ̂i | θ, se 2 i ) = φ ( θ̂i | θ, se 2 i ) , where φ ( θ̂i | θ, se 2 i ) is the density at θ̂i under a normal distribution with a mean of θ and variance of se2i . now, we may expand (1) across studies: f (θ) ∝ π0(θ) k∏ i=1 φ ( θ̂i | θ, se 2 i ) (2) where π0(θ) is the prior distribution used by the initial study. hence, the posterior is proportional to the initial prior times the product of the likelihoods of the relevant studies. how does this compare to the posterior distribution given by a traditional meta-analysis? in a fixed-effects setting (3), they are identical, provided that the meta-analysis uses the same prior as the initial study. the standard fixed effects model is that each θ̂i follows a normal distribution with a universal mean value of θ, and a variance of se2i : θ̂i ∼n ( θ, se2i ) (3) then, the posterior density can be constructed by taking the product of the likelihoods of each estimate: f f ixed (θ | x) ∝ π(θ) k∏ i=1 φ ( θ̂i | θ, se 2 i ) (4) this equivalence also answers some of the questions brought up by brand et al. (2019), such as how posterior passing would perform under publication bias. now, it is clear that we can use existing studies on traditional fixed-effects meta-analyses to answer these questions (e.g., simonsohn et al., 2014). in a random effects setting, we assume that each study samples from a slightly different population, and these populations have their own parameter values, referred to as µi, which vary around the true θ value: µi ∼n ( θ,τ2 ) (5a) θ̂i ∼n ( µi, se 2 θ̂i ) (5b) then, the likelihood for each estimate is given by marginalizing out µi: frandom (θ | x) ∝ π(θ,τ) k∏ i=1 ∫ µ φ ( θ̂i | µi, se 2 i ) (6a) ×φ ( µi | θ,τ 2 ) dµi = π(θ,τ) k∏ i=1 φ ( θ̂i | θ, se 2 i +τ 2 ) (6b) where µi is the mean of the population that study i comes from, τ2 is the variance across populations, and the prior distribution is now over both θ and τ. hence, in a random-effects setting (4), posterior passing, like (3), will underestimate the posterior variance. this point was previously noted by martin (2017). a random-effects model is typically preferable, as the fixed-effects assumption of no between-study variance is implausible in most settings (borenstein et al., 2010). the use of the full-data posterior by brand et al. (2019) is inconsequential to this result. hence, although posterior passing will produce consistent point estimates, the posterior variance may be underestimated. how can we improve posterior passing? incorporating random effects an obvious question to ask at this point is if we can modify posterior passing to incorporate random effects. a bayesian solution is to update τ along with θ, modeling their joint posterior distribution. using statistic likelihoods as in the previous section, we can derive the joint posterior update by marginalizing out µ, just as in a standard random-effects meta-analysis: fi (θ,τ) ∝ π (θ,τ) f (xi | θ,τ) (7a) = fi−1 (θ,τ) f (xi | θ,τ) (7b) = fi−1 (θ,τ) ∫ µ f (xi | µi) φ ( µi | θ,τ 2 ) dµi (7c) = fi−1 (θ,τ) φ ( θ̂i | θ, se 2 i +τ 2 ) (7d) to get the marginal posterior distribution of θ, one may integrate out τ, and vice versa. notably, the parameter must be tracked across studies to gain evidence about its value. however, even if previous studies hadn’t used a random effects model, they can still be added using their likelihood functions: fi (θ,τ) ∝ π (θ,τ)φ ( θ̂i−1 | θ, se 2 i−1 +τ 2 ) (8) ×φ ( θ̂i | θ, se 2 i +τ 2 ) 3 where study i − 1 is the study that hadn’t used a random effects model. in the case of the current study being the first on a topic, no evidence will be gained about the value of τ, but incorporating it using only the information from a subjective prior distribution will nonetheless give a more realistic view of the uncertainty of fi(θ). addressing publication bias a second major issue with posterior passing is that brand et al. (2019) provide no way to address publication bias. without adjustment, posterior passing will perform identical to a fixed-effects meta-analysis that completely ignores publication bias. unadjusted metaanalyses are known to perform poorly when publication bias is substantiative, yielding potentially misleading results (simonsohn et al., 2014). this problem can be viewed as one of biased sampling, hence the likelihood function is given by a weighted distribution as follows (pfeffermann et al., 1998): f ( xi | publishedi ) = ei [ p | xi ] f (xi) e [ p ] (9) = ei [ p | xi ]∫ x e [ p | xi ] f (xi) d xi f (xi) where p are the probabilities of publication. now, we need to model e [ p | xi ] . considering that studies are typically given a dichotomous interpretation (mcshane & gal, 2017), a realistic option is a simple step function: e [ p | xi ] = α si ( θ̂i ) = 0 β si ( θ̂i ) = 1 (10) where si(θi) is a function that gives the standard interpretation of θ̂i in a pass/fail manner, such as its pvalue being below 0.05. however, it is unclear if dichotomization exists to the same extent in bayesian studies as it does in frequentist studies. in such a context one may replace (10) with a smoother model, such as a logistic one: e [ p | xi ] = logistic ( α+βsi ( θ̂i )) (11) where si ( θ̂i ) could represent a bayes factor cutoff or similar. for a review of other models that have been suggested, see sutton, song, gilbody, and abrams (2000). in any case, the posterior update is now as follows: fi (θ,τ,α,β) ∝ π (θ,τ,α,β) f (xi | θ,τ,α,β) (12a) ∝ π (θ,τ,α,β) (12b) × e [ p | xi,θ,τ,α,β ] e [ p | θ,τ,α,β ] f (xi | θ,τ) = fi−1 (θ,τ,α,β) (12c) × e [ p | xi,θ,τ,α,β ] e [ p | θ,τ,α,β ] × ∫ µ f (xi | µi) φ ( µi | θ,τ 2 ) dµi = fi−1 (θ,τ,α,β) (12d) × e [ p | xi,θ,τ,α,β ] e [ p | θ,τ,α,β ] ×φ ( θ̂i, | θ, se 2 i +τ 2 ) as with τ, multiple studies are needed to identify α, and β. hence, inference in early studies will be highly dependent on the prior distribution for these parameters, which should not be uninformative. however, this problem dissipates as evidence for these parameters accumulates. including studies outside of the posterior passing chain in meta-analysis, extensive literature searches are conducted to avoid systematically excluding any studies. however, it may not be obvious how one can incorporate studies outside of a posterior passing ‘chain,’ a sequence of studies where each study’s prior is equal to the previous study’s posterior, into our prior distribution. the same problem occurs if two studies are done simultaneously, creating a fork in the chain. brand et al. (2019) suggest that a normal prior with variance representing our uncertainty and with a mean at the estimate given by the study may be used. however, it can be difficult to determine such a variance, and this procedure only makes sense if we are the first study in a chain aiming to get information from an unchained study. a better answer is to include the study’s likelihood function in our posterior derivation. when there are simultaneous studies, simply take one’s posterior as the prior and use the likelihood from the other as if it were outside of the chain altogether. switching back to the simple fixed effects procedure for conciseness, this gives us: fi (θ) ∝ π (θ) f (xi−1 | θ) f (xi | θ) (13) to include more studies, one simply needs to add more f (xn | θ) functions. as a side note, the nature of 4 this function yields an obvious option for bayesian-style updating in a non-bayesian framework: li (θ) ∝ f (x1, . . . , xi−1 | θ) f (xi | θ) (14) this may simply be interpreted as the likelihood function for θ across all the included studies. inferences can then be made using standard frequentist sequential trial methods (cf. wetterslev et al., 2017). further comments and discussion is posterior passing a practical replacement for meta-analysis? brand et al. (2019) suggest that posterior passing can replace traditional meta-analyses. indeed, the improved posterior passing procedures introduced in the previous section can compete with traditional meta-analyses, but is it practical? to match the quality of traditional metaanalyses, one would have to meet the same conditions, including extensive literature search and the inclusion of all available studies. furthermore, most studies are currently done in a frequentist manner, and splits can occur in a chain due to simultaneous studies, so in practice each study in the chain will have to do its own mini meta-analysis. at that point, it might be preferable to just do traditional meta-analyses. brand et al. (2019) also suggest that posterior passing can solve the problem of conflicting meta-analyses by updating evidence in real time. meta-analyses can provide contradictory results due to a number of reasons, such as differing inclusion criteria and using different methods. however, posterior passing only appears to be able to mitigate differences occurring by meta-analyses occurring at different times rather than by different methods. in this case though, one would simply go with the most recent meta-analysis. however, when multiple meta-analyses disagree, they often have differences in statistical methodology or inclusion criteria. replacing meta-analyses with posterior passing would likely result in multiple posterior passing chains to reflect this disagreement. perhaps this might be avoided if fields are sufficiently vigilant in preventing conflicting chains, but the same would be true of traditional meta-analytic conflicts. in fact, one might argue that for any criticism of a meta-analysis, one could seemingly make an equivalent criticism of a posterior passing chain. instead of having disagreeing metaanalyses, we would instead simply have a number of individual studies in disagreement. it may be that posterior passing could help mitigate disagreements of this nature by the fact that inclusion criteria could change with any study in the chain. however, having such fuzzy inclusion criteria is clearly undesirable as it would lead to results with an unclear interpretation. hence, posterior passing does not appear to avoid the problem of conflicting meta-analyses in a desirable manner. the same applies for all methodological limitations of standard meta-analyses, such as being impacted by publication bias, as posterior passing and meta-analysis are mathematically equivalent. alternative roles for posterior passing even if posterior passing cannot generally replace traditional meta-analyses, it may nonetheless be useful. with the improvements suggested above, posterior passing can replace traditional meta-analysis in areas where meta-analyses are unlikely to produce conflicting results in the first place. an alternative to creating posterior passing chains that still utilizes the posterior passing mechanism is to use it in meta-analyses. this doesn’t solve the issue of conflicting meta-analyses, but has practical advantages. by using the posterior distribution of the last similar meta-analysis as a prior distribution, meta-analyses can be performed in chunks instead of having to redo the entire meta-analysis with each update. similarly, instead of using posterior passing chains, studies can use posterior distributions from meta-analyses as their priors to get accurate net effect estimates within each study. this allows for broader conclusions than would otherwise be warranted by the study alone. a particularly relevant case for this is largescale replication projects, where the prior can be gotten from the last meta-analysis. this yields a readily interpretable lower-bound on the extent to which a field’s view on a topic should shift as the result of the replication effort, by providing the change in posterior assuming that previous studies had been conducted properly. hence, although posterior passing may have problems as a replacement for meta-analysis, it can have utility regardless. author contact the corresponding author may be contacted at jpritsk@purdue.edu, orcid 0000-0001-9647-6684. conflict of interest and funding no conflicts of interest declared. author contributions pritsker is the sole author of this article. open science practices this article is a commentary and had no data or materials to share, and it was not pre-registered. the enmailto:jpritsk@purdue.edu https://orcid.org/0000-0001-9647-6684 5 tire editorial process, including the open reviews, are published in the online supplement. references borenstein, m., hedges, l. v., higgins, j. p., & rothstein, h. r. (2010). a basic introduction to fixedeffect and random-effects models for metaanalysis. research synthesis methods, 1, 97–111. https://doi.org/10.1002/jrsm.12 brand, c. o., ounsley, j. p., van der post, d. j., & morgan, t. j. h. (2019). cumulative science via bayesian posterior passing. meta-psychology, 3. https://doi.org/10.15626/mp.2017.840 martin, s. (2017). open peer review by stephen martin. meta-psychology: decision letter for brand et al. https://doi.org/10.17605/osf.io/c4wn8 mcshane, b. b., & gal, d. (2017). statistical significance and the dichotomization of evidence. journal of the american statistical association, 112, 885– 895. https://doi.org/10.1080/01621459.2017. 1289846 pfeffermann, d., krieger, a. m., & rinott, y. (1998). parametric distributions of complex survey data under informative probability sampling. statistica sinica, 8, 1087–1114. simonsohn, u., nelson, l. d., & simmons, j. p. (2014). p-curve and effect size: correcting for publication bias using only significant results. perspectives on psychological science, 9, 666–681. https: //doi.org/10.1177/1745691614553988 sutton, a. j., song, f., gilbody, m., s., & abrams, k. r. (2000). modelling publication bias in metaanalysis: a review. statistical methods in medical research, 9, 421–445. https://doi.org/10. 1177/096228020000900503 wetterslev, j., jakobsen, j. c., & gluud, c. (2017). trial sequential analysis in systematic reviews with meta-analysis. bmc medical research methodology, 17. https://doi.org/10.1186/s12874-0170315-7 https://doi.org/10.1002/jrsm.12 https://doi.org/10.15626/mp.2017.840 https://doi.org/10.17605/osf.io/c4wn8 https://doi.org/10.1080/01621459.2017.1289846 https://doi.org/10.1080/01621459.2017.1289846 https://www.jstor.org/stable/24306526 https://doi.org/10.1177/1745691614553988 https://doi.org/10.1177/1745691614553988 https://doi.org/10.1177/096228020000900503 https://doi.org/10.1177/096228020000900503 https://doi.org/10.1186/s12874-017-0315-7 https://doi.org/10.1186/s12874-017-0315-7 introduction posterior passing as meta-analysis how can we improve posterior passing? incorporating random effects addressing publication bias including studies outside of the posterior passing chain further comments and discussion is posterior passing a practical replacement for meta-analysis? alternative roles for posterior passing author contact conflict of interest and funding author contributions open science practices meta-psychology, 2019, vol 3, mp.2018.892 https://doi.org/10.15626/mp.2018.892 article type: original article published under the cc-by4.0 license open data and materials: n/a open and reproducible analysis: n/a open reviews and editorial process: yes preregistration: n/a edited by: rickard carlsson reviewed by: nick brown, jack davis, nicholas a coles all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/ps5ru computational reproducibility via containers in psychology april clyburne-sherin independent consultant xu fei code ocean seth ariel green code ocean abstract scientific progress relies on the replication and reuse of research. recent studies suggest, however, that sharing code and data does not suffice for computational reproducibility —defined as the ability of researchers to reproduce “particular analysis outcomes from the same data set using the same code and software” (fidler and wilcox, 2018). to date, creating long-term computationally reproducible code has been technically challenging and time-consuming. this tutorial introduces code ocean, a cloud-based computational reproducibility platform that attempts to solve these problems. it does this by adapting software engineering tools, such as docker, for easier use by scientists and scientific audiences. in this article, we first outline arguments for the importance of computational reproducibility, as well as some reasons why this is a nontrivial problem for researchers. we then provide a step-by-step guide to getting started with containers in research using code ocean. (disclaimer: the authors all worked for code ocean at the time of this article’s writing.) keywords: computational reproduciblity, social psychology, containers, docker, code ocean introduction: the need for computational reproducibility stodden (2014) distinguishes between three forms of reproducibility: statistical, empirical, and computational. in psychology, statistical reproducibility, encompassing transparency about analytic choices and strategies, has received sustained attention (simmons, nelson, and simonsohn, 2011; grange et al., 2018; gelman and loken, 2014; morey and lakens, 2016). likewise, empirical reproducibility —providing enough information about procedures to enable high-fidelity independent replication —has been a high-profile issue in light of work by the center for open science (collaboration, 2015; nosek and lakens, 2014). computational reproducibility, by contrast, has been less of a focus. kitzes (2017) describes a research project as being “computationally reproducible”1 when “a second investigator (including you in the future) can recreate the final reported results of the project, including key quantitative findings, tables, and figures, given only a set of files and written instructions.”2 computational repro1note that we use the term ‘reproduction’ to refer to recreating given results using given data, and replication to refer to analyzing new data. this is broadly in line with the definitions used by, among others, peng (2011), claerbout (2011), and donoho, maleki, rahman, shahram, and stodden (2008), but some disciplines use the terms differently; for an overview, see marwick, rokem, and staneva (2017) and barba (2018). 2fidler and wilcox (2018) distinguish between two senses https://doi.org/10.17605/osf.io/ps5ru 2 ducibility facilitates the accumulation of knowledge by enabling researchers to assess the analytic choices, assumptions, and implementations that led to a set of results; it also enables testing the robustness of methods to alternate specifications. hardwicke et al. (2018) call this form of reproducibility a “minimum level of credibility” (p. 2). moreover, as donoho (2017) argues, preparing one’s work for reproducible publication provides “benefits to authors. working from the beginning with a plan for sharing code and data leads to higher quality work, and ensures that authors can access their own former work, and those of their co-authors, students and postdocs” (p. 760). because computations are central to modern research in the social sciences, their reproducibility, or lack thereof, warrants ministration and attention within the broader open science movement and the scientific community. many psychology journals (lindsay, 2017; jonas and cesario, 2015) address reproducibility through strong policies on sharing data, code, and materials. the society for personality and social psychology’s ‘task force on publication and research practices” (funder et al., 2014) advises authors to make “available research materials necessary” to reproduce statistical results, and to adhere “to spsp’s data sharing policy” (p. 3). the american psychological association’s ethics policy (section 8.14) asks that “psychologists do not withhold the data on which their conclusions are based from other competent professionals who seek to verify the substantive claims through reanalysis” (association, 2012). many journals in the field require that authors sign off on this policy (e.g., cooper, 2013). the challenge of computational reproducibility for two reasons, however, such policies do not suffice for computational reproducibility. first, data and code that are available “upon request” may turn out to be unavailable when actually requested (stodden, seiler, and ma, 2018; wicherts, borsboom, kats, and molenaar, 2006; vanpaemel, vermorgen, deriemaecker, and storms, 2015; wood, müller, and brown, 2018). second, code and data that are publicly available do not necessarily yield the results one sees in the accompanying paper. this is due to a number of technical challenges. dependencies —the packages and libraries that a researcher’s code relies on —change over time, often in ways that produce errors (bogart, kästner, and herbsleb, 2015) or change outputs. software versions are not always perfectly recorded (barba, 2016), which makes reconstruction of the original computational environment difficult. while there are many useful guides to best practices for scientific research (wilson et al., 2017; sandve, nekrutenko, taylor, and hovig, 2013), adopting them is an investment of scarce time and attention. more prosaically, differences between scientists’ machines can be nontrivial, and memory or storage limitations can halt a reproduction effort (deelman and chervenak, 2008). as a result, publicly available code and data are often not computationally reproducible. an example comes from the journal cognition. following the journal’s adoption of a mandatory data sharing policy, hardwicke et al. (2018) attempted to reproduce the results of 35 articles for which they had code and data, and were able to do so, without author assistance, for just 11 papers; a further 11 were reproducible with author assistance, and the remaining 13 were not reproducible “despite author assistance” (p. 3). while the authors are careful to note that these issues do not appear to “seriously impact” original conclusions, nevertheless, “suboptimal data curation, unclear analysis specification, and reporting errors can impede computational reproducibility” (p. 3).3 rates of reproducibility appear similar in other disciplines. at the quarterly journal of political science, editors found that from “september 2012 to november 2015. . . 14 of the 24 empirical papers subject to inhouse review were found to have discrepancies between the results generated by authors’ own code and those in their written manuscripts” (eubank, 2016, p. 273). in sociology, after working closely with authors, liu and salganik (2019) were able to reproduce the results of seven of 12 papers for a special issue of the journal socius. in development economics, wood et al. (2018) looked at 109 papers and found only 29 to be “push button replicable” (the authors’ synonym for computationally reproducible). in general, how much information suffices for reproduction becomes clear only when it is attempted. literate programming, a valuable paradigm for documentation and explanation, does not necessarily address these issues. woodbridge (2017) recounts attempting to identify a sample of jupyter notebooks (kluyver et al., 2016) mentioned in pubmed central, thinking that reproduction “would simply involve of the term: “direct (reproducing particular analysis outcomes from the same data set using the same code and software)...[and] conceptual (analyzing the same raw data set with alternative approaches, different models or statistical frameworks).” this article primarily concerns direct reproducibility. 3the authors also note that “assessments in the “reproducible” category took between 2-4 person hours, and assessments in the “reproducible with author assistance” and “not fully reproducible, despite author assistance” categories took between 5-25 person hours” (p. 28). 3 searching the text of each article for a notebook reference, then downloading and executing it. . . it turned out that this was hopelessly naive.” dependencies were frequently unmentioned and were not always included with notebooks; troubleshooting language and tool specific issues required expertise and hindered portability; and notebooks would often “assume the availability of non-python software being available on the local system,” but such software “may not be freely available.” in sum, as silver (2017) notes, lab-built tools rarely come ready to run. . . much of the software requires additional tools and libraries, which the user may not have installed. even if users can get the software to work, differences in computational environments, such as the installed versions of the tools it depends on, can subtly alter performance, affecting reproducibility. (p. 173) a welcome development: containers meanwhile, tools designed by engineers engineers to share code are available, but are often befuddling to non-specialists. chamberlain and schommer (2014) note that virtual machines “have serious drawbacks,” including the difficulty of use “without a high level of systems administration knowledge” and requiring “a lot of storage space, which makes them onerous to share” (p.1). one major advance for sharing code is container software. containers reduce complexity, silver (2017) writes, “by packaging the key elements of the computational environment needed to run the desired software. . . into a lightweight, virtual box. . . [t]hey make the software much easier to use, and the results easier to reproduce” (p. 174). docker a container platform called docker is rising in popularity in some academic fields (merkel, 2014; boettiger, 2015). docker’s core virtues include: 1. a rich and growing ecosystem of supporting tools and environments, such as rocker (boettiger and eddelbuettel, 2017), a repository of docker images4 specifically for r users, and biocimagebuilder for bioconductor-based builds (almugbel et al., 2017); 2. ease of use, relative to other container and virtual machine technology; 3. an open-source code base, allowing for adaptation (hung, kristiyanto, lee, and yeung, 2016) and integration with existing academic software (grüning et al., 2016; almugbel et al., 2017); 4. relatively lightweight installation, because a docker container “does not replicate the full operating system, only the libraries and binaries of the application being virtualized” (chamberlain and schommer, 2014); and 5. compatibility with any programming language that can be installed on linux.5 adoption of container technology like docker in psychology, however, remains scant.6 a few explanations come to mind. the first is simply lack of awareness. the second is lack of incentives, as journals increasingly require the sharing of code and data but not of a full-fledged computational environment. the third is that docker, though easier to use than many other software engineering tools, requires familiarity with the command line and dependency management. these skills take time and effort to learn, are not part of the standard curriculum for training researchers (boettiger, 2015), and are not self-evidently a worthwhile investment when weighing opportunity costs. code ocean: customizing container technology for researchers code ocean attempts to address these issues. it is a platform for creating, running, and collaborating on research code. it allows scientists to package code, data, results, metadata, and a computational environment into a single compendium —called a ‘compute capsule,’7 or simply ‘capsule’ for short —whose results can be reproduced by anyone who presses a ’run’ button. it does so by providing a simple-to-use interface for configuring computational environments, getting code up and running online, and publishing final results. each published capsule is assigned a unique, persistent identifier in the form of a digital object identifier (doi) and can be embedded either directly into the text of an article or its landing page. the platform hopes to make code accompanying research articles reproducible in perpetuity8 by 4a docker image is the executable package containing all necessary prerequisites for a software application to run. 5for a more thorough overview of docker’s capabilities and scientific use cases, see boettiger (2015). 6a search on 23 april 2019 of http://www.apa.org, for instance for the words “docker container” yielded zero matches. 7thank you to christopher honey for the term. 8for code ocean’s preservation plan, see https://help. codeocean.com/faq/code-oceans-preservation-plan. https://help.codeocean.com/faq/code-oceans-preservation-plan https://help.codeocean.com/faq/code-oceans-preservation-plan 4 containing all analyses within stable and portable computational environments. the remainder of this article will illustrate these features by walking through a capsule called “the contact hypothesis re-evaluated: code and data”, available at https://doi.org/10.24433/co.4024382.v6 or https: //codeocean.com/capsule/8235972/tree/v6. this capsule reproduces the results of a july 2018 article published in behavioural public policy (paluck, green, and green, 2018).9 (it may help to open up the capsule in a new tab or window while reading.) reuse without downloading or technical setup code ocean allows reuse without installing anything locally. figure 1 shows the default view for this capsule. code is in the top left, data are in the bottom left, and a set of published results are on the right (in the ‘reproducibility pane’). readers can view and edit selected files in the center pane. a published capsule’s code, data, and results are open-access; they can be viewed and downloaded by all, with or without a code ocean account.10 the ‘reproducible run’ button reproduces all results in their entirety. this is possible by dint of two things: a run script, and a fully configured computational environment. the ‘run’ script (also called the ‘master script‘), visible in figure 1 as the code file with the flag icon, is a script that executes each analysis script in its proper order. authors can designate different files as their entrypoints by selecting ‘set as file to run’. all capsules must have a run script to be published. clicking on ‘environment’ will give the user a snapshot of the computational environment (figure 2). this tab offers a number of common package managers, customized for each base environment, and a postinstall script wherein you can download and install software that isn’t currently available through a package manager, or precisely specify an order of operations (figure 3). whenever possible, package versions are labeled and held static to ensure transparency and long-term stability of computations. for published capsules, environments are pre-configured by authors and do not need to be altered by readers to reproduce results. configured to support research workflows code ocean offers support for any open-source language that can be installed on linux, and also the proprietary languages stata and matlab (figure 4). this particular capsule runs stata and r code in sequence (figure 2). each language comes with pre-configured base environments for common use cases; readers can also start from a blank slate, with no scientific programming languages installed. metadata and preservation code ocean asks authors to provide sufficient metadata on published capsules to facilitate intelligibility. attaching rich metadata to a capsule encourages citation and signals that published code is a first-class research object. in addition to metadata provided by authors, published capsules are automatically provided with a doi and citation information (figure 5). metadata about an associated publication establishes a compute capsule as a ‘version of record’ of code and data to support published findings. cloud workstations by default, pressing ‘reproducible run’ on code ocean runs the main script from top to bottom. readers may also wish to run code line by line (or snippet by snippet) iteratively. cloud workstations support this. following instructions provided on https://help.codeocean.com/en/articles/ 2366255-cloud-workstations-an-overview, authors and readers can currently run terminal, jupyter, jupyterlab, r shiny, and rstudio workstations, with more options planned. this particular capsule has rstudio preinstalled and ready to launch (figure 6). exporting capsules for local reproduction for any capsule readers have access to, including all public capsules, they can download code, data, metadata, and a formula for the computational environment, as well as instructions on reproducing results locally. local reproduction will require some familiarity with docker, as well as all applicable software licenses (figure 7). share or embed a capsule finally, code ocean lets readers easily share published capsules. capsules can be posted to social media, or as interactive widgets embedded directly into the text of articles, websites, or blogs (figure 8). 9note that one author (seth green) is the author of this capsule and a co-author of the accompanying bpp article. 10running code requires an account to prevent abuse of available computational resources, which include gpus. authors who sign up with academic email addresses receive 10 hours of runtime per month and 20 gb of storage by default. code ocean’s current policy is to provide authors with any and all resources they need to publish capsules on the platform. for more details, see https://codeocean.com/pricing. https://doi.org/10.24433/co.4024382.v6 https://codeocean.com/capsule/8235972/tree/v6 https://codeocean.com/capsule/8235972/tree/v6 https://help.codeocean.com/en/articles/2366255-cloud-workstations-an-overview https://help.codeocean.com/en/articles/2366255-cloud-workstations-an-overview https://codeocean.com/pricing 5 figure 1. a capsule contains directories for code, data, and results. additional sub-directories can be added by authors. note: all figures are available as standalone files at https://doi.org/10.17605/osf.io/s8mz4. figure 2. this capsule requires packages from apt-get, cran, and ssc. figure 3. software can also be added via a custom postinstall script. figure 4. when creating a new compute capsule, an author can select environments with pre-installed languages and language-specific installers, or start from a blank slate (‘ubuntu linux’). this figure displays available matlab environments. conclusion: answering the call to make reproducibility tools simpler in the context of discussing docker, boettiger (2015) writes that: https://doi.org/10.17605/osf.io/s8mz4 6 figure 5. an excerpt from a capsule’s metadata. a doi and citation data are automatically added to any published capsule. a technical solution, no matter how elegant, will be of little practical use for reproducible research unless it is both easy to use and adapt to the existing workflow patterns of practicing domain researchers . . . another researcher may be less likely to build on existing work if it can only be done by using a particular workflow system or monolithic software platform with which they are unfamiliar. likewise, a user is more likely to make their own computational environment available for reuse if it does not involve a significant added effort in packaging and documenting. perhaps the most important feature of a reproducible research tool is that it be easy to learn and fit relatively seamlessly into existing workflow patterns of domain researcher. (pp. 4-5) figure 6. a variety of cloud workstation types are available. figure 7. export will download all necessary components for reproducing results locally. we believe that containers are an important advance in this direction, and hope that code ocean, by building on this technology and adapting it specifically to the needs of researchers, helps enable a fully reproducible workflow that is “easy to use and adapt” to existing research habits. open science practices because this article is a tutorial, there are no relevant data, materials, analyses or preregistration(s) to be shared. 7 figure 8. compute capsules can be embedded into the text of articles so that analyses can be reviewed and assessed in context. this capsule appears within the text of gilad and mizrahi-man (2015). author note april clyburne-sherin is an independent consultant on open science tools, methods, training, and community stewardship; as of august 2019, she was director of scientific outreach at code ocean. xu fei is outreach scientist at code ocean. seth green is developer advocate at code ocean. correspondence concerning this article can be addressed to xufei at codeocean dot com and seth at codeocean dot com. we would like to thank shahar zaks, christopher honey, rickard carlsson, our reviewers nick brown and jack davis for their feedback, and nicholas a. coles for his helpful comments on our psyarxiv preprint. author contributions april clyburne-sherin contributed conceptualization, investigation, visualization, and writing (original draft, reviewing and editing). xu fei contributed conceptualization, visualization and writing (reviewing and editing). seth ariel green contributed conceptualization, investigation, visualization, and writing (original draft, review and editing. authorship order accords to alphabetical order of last names. conflict of interest all three authors worked at code ocean during the writing of this paper. funding the authors did not receive any grants for writing this paper. references almugbel, r., hung, l.-h., hu, j., almutairy, a., ortogero, n., tamta, y., & yeung, k. y. (2017). reproducible bioconductor workflows using browser-based interactive notebooks and containers. journal of the american medical informatics association, 25(1), 4–12. association, a. p. (2012). ethics code updates to the publication manual. retrieved september, 1, 2012. barba, l. a. (2016). the hard road to reproducibility. science, 354(6308), 142–142. barba, l. a. (2018). terminologies for reproducible research. corr, abs/1802.03311. boettiger, c. (2015). an introduction to docker for reproducible research. acm sigops operating systems review, 49(1), 71–79. boettiger, c. & eddelbuettel, d. (2017). an introduction to rocker: docker containers for r. arxiv preprint arxiv:1710.03675. bogart, c., kästner, c., & herbsleb, j. (2015). when it breaks, it breaks. in proc. of the workshop on software support for collaborative and global software engineering (scgse). chamberlain, r. & schommer, j. (2014). using docker to support reproducible research. doi: https://doi. org/10.6084/m9. figshare, 1101910. claerbout, j. (2011). reproducible computational research: a history of hurdles, mostly overcome. technical report. collaboration, o. s. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716. cooper, j. (2013). on fraud, deceit and ethics. journal of experimental social psychology, 2(49), 314. deelman, e. & chervenak, a. (2008). data management challenges of data-intensive scientific workflows. in cluster computing and the grid, 2008. ccgrid’08. 8th ieee international symposium on (pp. 687– 692). ieee. 8 donoho, d. (2017). 50 years of data science. journal of computational and graphical statistics, 26(4), 745–766. donoho, d., maleki, a., rahman, i., shahram, m., & stodden, v. (2008). 15 years of reproducible research in computational harmonic analysis. technical report. eubank, n. (2016). lessons from a decade of replications at the quarterly journal of political science. ps: political science & politics, 49(2), 273–276. fidler, f. & wilcox, j. (2018). reproducibility of scientific results. in e. n. zalta (ed.), the stanford encyclopedia of philosophy (winter 2018). metaphysics research lab, stanford university. funder, d. c., levine, j. m., mackie, d. m., morf, c. c., sansone, c., vazire, s., & west, s. g. (2014). improving the dependability of research in personality and social psychology: recommendations for research and educational practice. personality and social psychology review, 18(1), 3–12. gelman, a. & loken, e. (2014). the statistical crisis in science. american scientist, 102(6), 460. gilad, y. & mizrahi-man, o. (2015). a reanalysis of mouse encode comparative gene expression data. f1000research, 4. grange, j., lakens, d., adolfi, f., albers, c., anvari, f., apps, m., . . . benning, s., et al. (2018). justify your alpha. nature human behavior. grüning, b., rasche, e., rebolledo-jaramillo, b., eberhart, c., houwaart, t., chilton, j., . . . nekrutenko, a. (2016). enhancing pre-defined workflows with ad hoc analytics using galaxy, docker and jupyter. biorxiv, 075457. hardwicke, t. e., mathur, m. b., macdonald, k., nilsonne, g., banks, g. c., kidwell, m. c., . . . henry tessler, m., et al. (2018). data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal cognition. royal society open science, 5(8), 180448. hung, l.-h., kristiyanto, d., lee, s. b., & yeung, k. y. (2016). guidock: using docker containers with a common graphics user interface to address the reproducibility of research. plos one, 11(4), e0152686. jonas, k. j. & cesario, j. (2015). guidelines for authors. retrieved from http://www.tandf.co.uk/journals/ authors/rrsp-submission-guidelines.pdf kitzes, j. (2017). introduction. in j. kitzes, d. turek, & f. deniz (eds.), the practice of reproducible research: case studies and lessons from the dataintensive sciences. university of california press. kluyver, t., ragan-kelley, b., pérez, f., granger, b. e., bussonnier, m., frederic, j., . . . corlay, s., et al. (2016). jupyter notebooks-a publishing format for reproducible computational workflows. in elpub (pp. 87–90). lindsay, d. s. (2017). sharing data and materials in psychological science. sage publications sage ca: los angeles, ca. liu, d. & salganik, m. (2019). successes and struggles with computational reproducibility: lessons from the fragile families challenge. socarxiv. marwick, b., rokem, a., & staneva, v. (2017). assessing reproducibility. in j. kitzes, d. turek, & f. deniz (eds.), the practice of reproducible research: case studies and lessons from the data-intensive sciences. univ of california press. merkel, d. (2014). docker: lightweight linux containers for consistent development and deployment. linux journal, 2014(239), 2. morey, r. d. & lakens, d. (2016). why most of psychology is statistically unfalsifiable. submitted. nosek, b. a. & lakens, d. (2014). registered reports. hogrefe publishing. paluck, e. l., green, s. a., & green, d. p. (2018). the contact hypothesis re-evaluated. behavioural public policy, 1–30. peng, r. d. (2011). reproducible research in computational science. science, 334(6060), 1226–1227. sandve, g. k., nekrutenko, a., taylor, j., & hovig, e. (2013). ten simple rules for reproducible computational research. plos computational biology, 9(10), e1003285. silver, a. (2017). software simplified. nature, 546(7656), 173–174. simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. stodden, v. (2014). what scientific idea is ready for retirement. edge. stodden, v., seiler, j., & ma, z. (2018). an empirical analysis of journal policy effectiveness for computational reproducibility. proceedings of the national academy of sciences, 115(11), 2584–2589. vanpaemel, w., vermorgen, m., deriemaecker, l., & storms, g. (2015). are we wasting a good crisis? the availability of psychological research data after the storm. collabra: psychology, 1(1). wicherts, j. m., borsboom, d., kats, j., & molenaar, d. (2006). the poor availability of psychological research data for reanalysis. american psychologist, 61(7), 726. http://www.tandf.co.uk/journals/authors/rrsp-submission-guidelines.pdf http://www.tandf.co.uk/journals/authors/rrsp-submission-guidelines.pdf 9 wilson, g., bryan, j., cranston, k., kitzes, j., nederbragt, l., & teal, t. k. (2017). good enough practices in scientific computing. plos computational biology, 13(6), e1005510. wood, b. d., müller, r., & brown, a. n. (2018). push button replication: is impact evaluation evidence for international development verifiable? plos one, 13(12), e0209416. woodbridge, m. (2017). jupyter notebooks and reproducible data science. retrieved from https : / / markwoodbridge . com / 2017 / 03 / 05 / jupyter reproducible-science.html https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html introduction: the need for computational reproducibility the challenge of computational reproducibility a welcome development: containers docker code ocean: customizing container technology for researchers reuse without downloading or technical setup configured to support research workflows metadata and preservation cloud workstations exporting capsules for local reproduction share or embed a capsule conclusion: answering the call to make reproducibility tools simpler open science practices author note author contributions conflict of interest funding mp.2018.871.witt.proofs_corrected meta-psychology, 2019, vol 3, mp.2018.871, https://doi.org/10.15626/mp.2018.871 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: daniël lakens reviewed by: patrick r heck, angelika stefan, felix schönbrodt analysis reproduced by: jack davis all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/69xmg insights into criteria for statistical significance from signal detection analysis jessica k. witt colorado state university what is best criterion for determining statistical significance? in psychology, the criterion has been p < .05. this criterion has been criticized since its inception, and the criticisms have been rejuvenated with recent failures to replicate studies published in top psychology journals. several replacement criteria have been suggested including reducing the alpha level to .005 or switching to other types of criteria such as bayes factors or effect sizes. here, various decision criteria for statistical significance were evaluated using signal detection analysis on the outcomes of simulated data. the signal detection measure of area under the curve (auc) is a measure of discriminability with a value of 1 indicating perfect discriminability and 0.5 indicating chance performance. applied to criteria for statistical significance, it provides an estimate of the decision criterion’s performance in discriminating real effects from null effects. aucs were high (m = .96, median = .97) for p values, suggesting merit in using p values to discriminate significant effects. aucs can be used to assess methodological questions such as how much improvement will be gained with increased sample size, how much discriminability will be lost with questionable research practices, and whether it is better to run a single high-powered study or a study plus a replication at lower powers. aucs were also used to compare performance across p values, bayes factors, and effect size (cohen’s d). aucs were equivalent for p values and bayes factors and were slightly higher for effect size. signal detection analysis provides separate measures of discriminability and bias. with respect to bias, the specific thresholds that produced maximally-optimal utility depended on sample size, although this dependency was particularly notable for p values and less so for bayes factors. the application of signal detection theory to the issue of statistical significance highlights the need to focus on both false alarms and misses, rather than false alarms alone. keywords: statistical significance, bayes factor, effect size, p values jessica k. witt, department of psychology, colorado state university, fort collins, co 80523. the author would like to thank anne cleary, john wixted, mark prince, susan wagner cook, mike dodd, art glenberg, jim nairne, jeremy wolfe, and ben prytherch for useful discussions and feedback on an earlier draft. the author would also like to thank the editor (daniel lakens) and reviewers (patrick heck, angelika stefan, felix schönbrodt, and daniel benjamin) for valuable suggestions. this work was supported by grants from the national science foundation to jkw (bcs-1348916 and bcs-1632222). please address correspondence to jessica witt, colorado state university, behavioral sciences building, fort collins, co 80523 usa. email: jessica.witt@colostate.edu. witt 2 scientists across many disciplines including psychology, biology, and economics use p < .05 as the criterion for statistical significance. this threshold has recently been challenged due to numerous failures to replicate findings published in top journals (begley & ellis, 2012; camerer et al., 2016; open science collaboration, 2015). changes in the recommendations for statistical significance include using a stricter criterion for significance (e.g., p < .005; benjamin et al., 2017) and minimizing flexibility in decisions around data collection and analysis (e.g., simmons, nelson, & simonsohn, 2011). these recommendations were designed to increase replicability by decreasing the false alarm rates, which is the rate at which null effects are incorrectly labeled as significant. however, the best criteria for statistical significance are ones that maximize discriminability between real and null effects, not just those that minimize false alarms. one analytic technique that is intended to measure the discriminability of a test is signal detection theory (green & swets, 1966). signal detection theory has previously been applied to evaluate p values (krueger & heck, 2017). here, the signal detection theory measure of area under the curve (auc) is offered as a tool to quantify the effectiveness of various measures of statistical effects. signal detection analysis involves categorizing outcomes into four categories. applied to criteria for statistical significance, a hit occurs when there is a true effect and the analysis correctly identifies it as significant (see table 1). a miss occurs when there is a true effect but the analysis identifies it as not significant. a correct rejection occurs when there is no effect and the analysis correctly identifies it as not significant, and a false alarm occurs when there is no effect but the analysis identifies it as significant. in statistics, type i errors (false alarms) and type ii errors (misses) are sometimes considered separately, with type i errors being a function of the alpha level and type ii errors being a function of power. an advantage of signal detection theory is that it combines type i and type ii errors into a single analysis of discriminability and also 1 this number was selected somewhat arbitrarily, and the results generalized to other numbers. larger number of repeats reduced the standard deviations of the results reported below, but did not affect the means. the decision to simulate sets of studies was to allow for considers the relative distributions of each type of error in the analysis of bias. data simulations for experiment 1 data were simulated for two independent groups of 64 participants each, which corresponds to 80% power at an alpha level of .05 for a two-tailed independent-samples t-test. table 1. signal detection classification of data based on the example criteria p < .05 for a true effect (cohen’s d = 0.50) and a null effect (cohen’s d = 0). p < .05 p > .05 “significant” “not significant” d =.50 hit miss d = 0 false alarm correct rejection data for one group was sampled from a normal distribution with a mean of 50 and a standard deviation of 10 (such as might be found on a memory test with a total score of 100). the data for the other group was sampled from a normal distribution with a mean of 50 (for studies with a null effect) or 45 (for studies with an effect size of cohen’s d = .50) and a standard deviation of 10. the data were submitted to an independent-samples t-test (all simulations and analyses were conducted in r; r core team, 2017). details of the simulation are available in the online supplementary materials (https://osf.io/bwqm8/). this initial simulation will be referred to as experiment 1. see appendix for overview of all of experiments. data were simulated from 20 studies1, half of which had an effect size of 0 and half had a medium effect size (cohen’s d = .50). the result from each simulated study was classified as a hit or miss (for studies modeled as a medium effect) or as a correct rejection or false alarm (for studies modeled as a null effect). the classification was based on four criteria for statistical significant related to p values: p < .10, p < .05, p < .005, and p < .001. this process was multiple comparisons across a variety of measures (p values, bayes factors, and effect sizes). insights into criteria for statistical significance from signal detection analysis 3 repeated 100 times1. the outcomes across all studies were summarized into the proportions of hits, misses, false alarms, and correct rejections for each criterion (see figure 1). in addition, the hit rates and false alarm rates were calculated for the purpose of plotting the receiver operator characteristic (roc) curves (see figure 2). the hit rate is the proportion of studies for which the simulated effect was real and the criterion classified it as significant, and the false alarm rate is the proportion of studies for which the simulated effect was null but the criterion classified it as significant. to clarify, whereas the proportion of hits (as plotted in figure 1) is the number of hits divided by the total number of studies, the hit rate (plotted in figure 2) is the number of hits divided by the number of studies modeled as a real effect. bayes factors, which are also plotted, are discussed below. figure 1. proportion of each outcome as a function of the decision criterion for significance. brighter colors correspond to errors and dark colors correspond to correct classifications. for criteria of bayes factors greater than 2, 3, or 10, studies that produced a bayes factor less than the criterion but greater than the inverse of the criterion were considered inconclusive, which is why the total proportion of outcomes does not equal 1. figure 2. mean hit rates are plotted as a function of mean false alarm rates and the decision criterion (see legend) for one set of 20 studies (left panel) and averaged across all 100 sets of 20 studies (right panel). roc curves are plotted for criteria based on p values (thick green line) and bayes factor (thin blue line). the two lines are identical (as was the case for all 100 sets of 20 studies). area under the curve (auc) is the shaded area. meta-psychology, 2019, vol 3, mp.2018.871, https://doi.org/10.15626/mp.2018.871 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: daniël lakens reviewed by: patrick r heck, angelika stefan, felix schönbrodt analysis reproduced by: jack davis all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/69xmg in selecting a criterion for statistical significance, researchers must select a measure (e.g., p values) and a threshold within that measure (e.g., alpha = .05). a measure can be evaluated by assessing its ability to discriminate between real and null effects, which can be quantified by calculating the area under the roc curve (auc; macmillan & creelman, 2008). with respect to evaluating thresholds for a specific measure (e.g., comparing .005 to .05), the location of each threshold on the roc curve can be calculated. location on the curve is a measure of bias. each of these measures will be considered in turn. to measure discriminability of p values, the auc was computed 100 times, once for each set of 20 studies. unlike the discriminability measure of d’, the discriminability measure of auc makes no assumptions regarding the underlying distributions, which is critical because distributions of p values are not normally distributed. higher aucs indicate better ability to discriminate real effects from null effects. if discrimination were perfect, the curve would follow the left and top boundaries in figure 2, and the auc would equal 1 (i.e. the entire area would be under the curve). if discrimination were at chance, the curve would follow the diagonal line in figure 2, and the auc would be .5 (i.e. only 50% of the area would be under the curve). as is apparent in figure 2, p values produced curves that were closer to 1 (perfect performance) than to .5 (chance performance). the mean auc was .96 (median = .97, sd = .04). thus, p values were effective, though not perfect, at discriminating between real and null effects. this aligns with conclusions from other valuations of p values (e.g., krueger & heck, 2017, 2018). these auc values suggest some benefit in using p values, at least as a continuous measure without necessarily having strict thresholds for significance (mcshane, gal, gelman, robert, & tackett, 2018). perhaps alternative methods to reduce false alarm rates might be more beneficial than to eliminate p values altogether (e.g., trafimow & marks, 2015). note that measures of discriminability evaluate p values as a measure without consideration of the specific alpha value adopted as the criterion. specific alpha levels relate to bias, and are discussed below. what could improve discriminability when using p values as the criterion for statistical significance? one suggestion has been to lower the threshold from .05 to .005. this would not alter the discriminability because discriminability relates to p values as a whole, not to specific thresholds. thresholds refer to locations on the curve, and these dictate bias, rather than discriminability. signal detection theory distinguishes between discriminability and bias. as applied to the case of criteria for statistical significance, discriminability refers to the criterion’s performance at identifying real effects versus null effects, and bias refers to whether the errors tend to be false alarms or misses. assessing bias can be useful for selecting the appropriate criterion for asserting statistical significance. for example, assume that the cost of a miss is equivalent to the cost of a false alarm in a particular field. in that case, optimal utility would be achieved by setting the criterion in such a way that its point on the roc curve is the one that falls closest to the upper left corner in figure 2. the euclidean distance between each point on the roc curve and the point of perfect performance is plotted in figure 3. for the scenario that was simulated, an alpha level closer to the blue dot, which aligns with an alpha level of .10, would come closer to achieving that maximum-utility outcome than an alpha level of .005. lowering the criterion for statistical significance to p < .005 would increase the number of studies that will replicate by decreasing false alarms, but it would do so at the cost of missing real effects (see also krueger & heck, 2017). note the proportion of misses in figure 1 across the various criteria, particularly for the criterion of p < .005. misses are bad for science (fiedler, kutzner, & krueger, 2012; murayama, pekrun, & fiedler, 2014). assuming that null effects are theoretically interesting and practically important, it is important to determine which null effects are due to a genuine lack of difference versus a miss of a true effect. is the trade-off to increase replicability worth the large increase in misses? insights into criteria for statistical significance from signal detection analysis 5 perhaps science can adopt alternative means to improve replicability without sacrificing so many missed hits, such as increasing incentives for publishing statisticallyand scientifically-sound significant findings and also publishing (statistically and scientifically-sound) null results. figure 3. distance to perfection was calculated as the euclidean distance between each point on the roc curve (see figure 2) and the top-left corner (which corresponds to 100% hit rate and 0% false alarm rate) across all 100 sets of 20 studies. a lower distance to perfection score indicates better discriminability between real and null effects. error bars represent 95% confidence intervals. one effective way to improve replicability is to increase sample size. many studies are underpowered (e.g., etz & vandekerckhove, 2016; fraley & vazire, 2014; ioannidis, 2005; sedlmeier & gigerenzer, 1989). the simulations in experiment 1 showed that at a power of 80% (at an alpha level of .05), the mean auc for p values was .96. at a power of 50%, the mean auc for p values was .85 (median = .87; sd = .10). increasing power to 90% produced a mean auc of .975 (median = .99; sd = .03), increasing power to 95% produced a mean auc of .984 (median = 1; sd = .03), and increasing power to 99% produced a mean auc of .999 (median = 1; sd = .004). if resources are unlimited, increasing sample size to increase power is an effective way of improving discriminability of real effects from null effects (krueger & heck, 2017). assuming limited resources, one might wonder whether it is better to run one high-powered study or a study plus a replication that are both at 80% power. aucs can help a researcher make these decisions. two additional “experiments” (i.e., sets of simulations) were conducted. in experiment 2, everything was the same as in experiment 1 except the sample size for each group was 105 (which corresponds to 95% power at an alpha level of .05). in experiment 3, everything was the same as in experiment 1 except that for every study that was simulated, a second study with the same parameters was simulated and the higher p value was retained. this emulates a situation for which a study is conducted that produces a significant p value and then a replication fails to find a significant effect, so the effect is considered not significant. this is why the higher p value was retained. the mean auc for experiment 2 was .99 (median = 1; sd = .01). the mean auc for experiment 3 was .97 (median = .99; sd = .04). this suggests that higher power produces better discriminability than replicating a study with both the original and replication studies at 80% power. however, the higher-powered study produced more false alarms whereas the study plus replication produced few false alarms but more misses (see figure 4). again, researchers will need to decide what trade-offs between false alarms and misses make the most sense for their science. meta-psychology, 2019, vol 3, mp.2018.871, https://doi.org/10.15626/mp.2018.871 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: daniël lakens reviewed by: patrick r heck, angelika stefan, felix schönbrodt analysis reproduced by: jack davis all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/69xmg figure 4. proportion of each outcome as a function of the decision criterion and whether one or two studies were run. the left panel shows the outcomes across 100 sets of 20 studies, each with 105 data points per group (which corresponds to 95% power at alpha = .05). the right panel shows the outcomes across 100 sets of 20 studies. for each study, a replication was conducted. both the original study and the replication had 64 data points per group (which corresponds to 80% power at alpha = .05). in order for an effect to meet the decision criterion, both the original study and the replication had to produce values that exceeded the decision criterion. for example, for the criterion of p < .05, both the study and the replication had to produce p values < .05, otherwise the set of studies was considered not significant. power, rather than effect size, is more important for discriminability. in experiment 4, data were simulated at 80% power (at an alpha of .05) for each of 8 effect sizes ranging from d = .1 .8. the aucs for each were approximately the same (m = .95; range of means for each effect size = .947 .961; variations due to chance rather than systematic differences). as shown in figure 5, when power was consistent, there were also no substantial differences in the rate of the different outcomes. thus, while studying bigger effects will reduce the number of participants needed, it will not improve discriminability on its own. questionable research practices some recommendations to improve replicability concern practices to avoid. these have been labeled questionable research practices, and have been identified as particularly problematic (simmons et al., 2011). aucs can be used to assess the degree to which doing various questionable research practices reduces discriminability. one recommendation is to designate the number of participants to be run ahead of time, rather than use an optional stopping rule (simmons et al., 2011). in a new set of simulations (experiment 5), each simulated study was conducted with 30 participants per group with either a cohen’s d = .50 or d = 0. a lower sample size was used given that published studies tend to be underpowered insights into criteria for statistical significance from signal detection analysis 7 figure 5. proportion of sdt outcomes is plotted as a function of effect size for the single criterion for statistical significance of p < .05. data were all simulated at a power of 80% at an alpha of .05. as in experiment 1, 20 studies were simulated, and this was repeated this 100 times. to try to mimic typical use of the optional stopping rule, for each study, if the p value was between .20 and .05, an additional 10 participants were added per group. after this addition, if the p value was less than .05, data collection stopped; otherwise the process was repeated up to 9 more times. on average, p-hacking in the form of adding more participants occurred 4.3 times in each set of 20 studies (sd = 2; range = 0 – 11). the optional stopping rule produced differences in the aucs relative to the original sample, but the differences were not systematic. sometimes running additional unplanned participants improved discriminability and other times it worsened discriminability (see figure 6). how can this questionable research practice have no impact the discriminability of real effects from null effects? the reason is that these questionable research practices increase the false alarm rate but they also increase the hit rate (see figure 7). much of the attention on the replication crisis has sought to minimize false alarms, but it is also necessary to discuss the corresponding increase in the number of misses (i.e. the decrease in the number of hits). discriminability between real effects and null effects takes into account both the false alarm rate and the hit rate. figure 6. the area under the curve (auc) for hacked studies plotted as a function of the auc for the original studies. a higher auc indicates better discrimination between real and null effects. the line is at unity. data points above the line indicate better discriminability for the hacked studies, and data points below the line indicate better discriminability for the original studies. a decreased hit rate directly corresponds to an increased miss rate. furthermore, the data were simulated so that the studies were underpowered. although p-hacking increased the false alarm rates (see also ioannidis, 2005), adding participants increased power, which is good for discriminability. to be clear, the recommendation is not to p-hack by running participants until the effect is significant. instead, experiments should be run with sufficient power or only allow restricted flexibility in stopping data collection such as, for example, by following the recommendations of lakens (2014) or using sequential bayes factor with a minimum and maximum n (schönbrodt & wagenmakers, 2018). but with respect to interpreting published research, the current simulations suggest that flexibility in data collection via an optional stopping rule does not necessarily void the findings (see also murayama et al., 2014; salomon, 2015). in these simulations, phacking increased the hit rate by 28% while only increasing the false alarm rate by 12%. note, however, that p-hacking via optional stopping rules does not always increase hit rates more than false alarm rates. if power is high (e.g., > 99%), simulations showed that hit rates increased from witt 8 99.9% to 100% but false alarm rates increased from 5.4% to 9.8%. figure 7. proportion of hits, false alarms, misses, and correct rejections as a function of whether the studies were the original sample of 30 data points per group or had been p-hacked via an optional stopping rule. outcomes shown only for the decision criterion of p < .05. note that the seeming benefit for p-hacking is dependent on the low power of the simulated study. bayes factor versus p values an alternative to p values is to use bayes factors (e.g., dienes, 2011; kass & raftery, 1995; kruschke, 2013; lee & wagenmakers, 2005; rouder, speckman, sun, morey, & iverson, 2009). bayes factor refers to the ratio of likelihoods of the data for the alternative hypothesis relative to the null hypothesis. a bayes factor of 1 corresponds to equal likelihood for the alternative and the null hypotheses, and a bayes factor greater than 1 is evidence for the alternative hypothesis relative to the null hypothesis. bayes factors quantify how well a hypothesis predicts the data relative to a competing hypothesis (such as the null hypothesis), and thus is a continuous measure for which the focus is on the strength of the evidence, rather than a specific cut-off for deeming effects significant or not. however, bayes factors between 1-3 are considered weak or anecdotal evidence, so a bayes factor of 3 could be considered a decision criterion akin to a criterion for significance (see table 2), though not everyone agrees with the idea of using strict cut-offs (e.g., morey, 2015). table 2. overview of relationship between bayes factor and conclusion about the evidence being in favor of the alternative hypothesis (ha) or the null hypothesis (h0). adapted from wetzels et al. (2011), lakens (2016), and jeffreys (1961). bayes factor interpretation >100 decisive evidence for ha over h0 30 100 very strong evidence for ha over h0 10 30 strong evidence for ha over h0 3 10 substantial evidence for ha over h0 1-3 anecdotal evidence for ha over h0 1 no evidence 1/3 1 anecdotal evidence for h0 over ha 1/10 – 1/3 substantial evidence for h0 over ha 1/30 – 1/10 strong evidence for h0 over ha 1/100 – 1/10 very strong evidence for h0 over ha < 1/100 decisive evidence for h0 over ha to measure discriminability and bias for bayes factors, the studies simulated in experiment 1 were also evaluated using four decision criteria related to bayes factor (bf): bf > 1, bf > 2, bf > 3, and bf > 10. studies were classified as shown in table 3. note that for bayes factors that fell in between the criterion and its inverse (e.g., 1/3 – 3), no classification was made because the data were inconclusive. this is why the outcomes do not sum to 1 in figure 1. the calculation of the aucs is a function of the bayes factor itself, rather meta-psychology, 2019, vol 3, mp.2018.871, https://doi.org/10.15626/mp.2018.871 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: daniël lakens reviewed by: patrick r heck, angelika stefan, felix schönbrodt analysis reproduced by: jack davis all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/69xmg table 3. signal detection classification of data based on the example criteria bayes factor > 3 for a true effect (cohen’s d = 0.50) and a null effect (cohen’s d = 0). bayes factor >3 “significant” bayes factor < 1/3 “not significant” 1/3 < bayes factor < 3 “inconclusive” d = .50 hit miss no classification d = 0 false alarm correct rejection no classification than classifications of outcomes, so even though not all studies could be classified into the four sdt outcomes, all studies contributed to the auc calculation. the bayesfactor r package (morey, rouder, & jamil, 2014) was used to calculate the bayes factors. the default cauchy prior was used when calculating bayes factors, but different priors produced the same auc results. changing the prior produced shifts along the roc curve but did not change discriminability. as shown in figure 2, the aucs related to bayes factor were also quite high. in fact, the aucs for bayes factor corresponded perfectly to the aucs for p values. this means that for the situation simulated here, bayes factors are not any better (or worse) than p values at discriminating real effects from null effects. in other words, bayes factor incurs no advantage over p values at detecting a real effect versus a null effect for the current scenario. this is because bayes factors are redundant with p values for a given sample size. both p values and bayes factors can be calculated from the t-statistic and the sample size, so it is expected that they would be related. in these simulations, there was a nearperfect linear relationship between the (log of the) bayes factors and the (log of the) p values, as has been shown previously (benjamin et al., 2017; krueger & heck, 2018; wetzels et al., 2011). equivalency in aucs between bayes factors and p values generalized to other scenarios as well including one-sample t-tests and correlations (see figure 8). although the discriminability between p values and bayes factors was equivalent across a variety of situations, as revealed by equal aucs (see figures 2, 8, and 9), the exact relationship between them differed as a function of sample size. in experiment 6, for 30 different sample sizes ranging from 32 to 2000 per group, 100 simulations of 20 studies were conducted (10 with a cohen’s d modeled at .50 and 10 with a cohen’s d modeled at 0). for each sample size, a linear regression was conducted to predict the log of the bayes factor from the log of the p value. the results are shown in figure 9. these simulations show near-complete redundancy between p values and bayes factors. this redundancy also supports the conclusion that for the conditions simulated, p values and bayes factors are equally adept at distinguishing real effects from null effects. witt 10 figure 8. simulations were run for 20 studies (repeated 100 times) for 3 effect sizes for 3 power levels (two-tailed at alpha = .05) for 4 types of statistical tests. aucs for the bayes factors are plotted as a function of aucs for the p values. they are identical in every case, which is consistent with the claims of equal discriminability between p values and bayes factors. size of the symbol corresponds to effect size, which is cohen’s d (for twosample t-tests), cohen’s dz (for one-sample t-tests), and r*2 (for correlations). for the uneven two-sample t test, group 2 had 20% more participants than group 1. the plot collapses across all conditions given that the patterns were the same regardless of test type, power, or effect size. figure 9. outcomes from 100 simulations of 20 studies (half simulated as a null effect; half as a medium effect) for each of 30 different sample sizes ranging from 32 to 2000. color corresponds to sample size. panel a shows the area under the curve (auc) for p values and bayes factors as a function of sample size. a bigger auc indicates better discrimination between real and null effects. panel b shows the relationship between p value and bayes in the range for which p values are highest (the inset shows the relationship for the entire range, and the dotted box shows the area that has been expanded in the main figure). the legend corresponds to sample size. the black vertical line corresponds to a p value of .05, and the black horizontal line corresponds to a bayes factor of 1. panels c and d show the intercepts and slopes from linear regressions that predict the log of the bayes factor from the log of the p values. the intercept is the p value that corresponds to a bayes factor of 1, so it corresponds to the value of the p value along the horizontal line in panel b. the slope, plotted in panel d, corresponds to the steepness of the curves in panel b meta-psychology, 2019, vol 3, mp.2018.871, https://doi.org/10.15626/mp.2018.871 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: daniël lakens reviewed by: patrick r heck, angelika stefan, felix schönbrodt analysis reproduced by: jack davis all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/69xmg figure 10. left column shows results from experiment 7 (equal number of null and real effects) and right column shows results from experiment 8 (four times as many null as real effects). in the top row, hit rates are plotted as a function of false alarm rates and criterion for experiment 7 (left panel) and experiment 8 (right panel). each point corresponds to a different decision criterion related to the posterior odds (bf > 1, 2, 3, and 10, not labeled but for each cluster of 4, the points go sequentially from top-right corner to bottom-left corner) as a function of the prior odds (see legend). the receiver operator characteristic (roc) curves are plotted for three different sets of prior odds for each panel. the area under the curve (auc) is shown in grey. the curves and aucs are identical across all prior odds in each panel. in the bottom row, proportion of each outcome (calculated as the number of each outcome divided by the total number of studies) across prior odds is shown only for the decision criterion of bayes factor > 3. meta-psychology, 2019, vol 3, mp.2018.871, https://doi.org/10.15626/mp.2018.871 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: daniël lakens reviewed by: patrick r heck, angelika stefan, felix schönbrodt analysis reproduced by: jack davis all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/69xmg despite equivalence in discriminability between p values and bayes factor, these simulations illustrate a previously acknowledged discrepancy in the conclusions supported by the two types of criteria (lindley, 1957). specifically, in figure 9b, all data points to the left of the black vertical line that are also below the black horizontal line would be classified as significant according to the criterion of p < .05 but according to a bayes factor interpretation, the evidence would favor the null hypothesis over the alternative. this illustrates why it is possible to get results for which the p value indicates a significant finding (i.e. evidence for the alternative hypothesis) but the bayes factor shows evidence for the null hypothesis relative to the alternative. these conflicting outcomes occurred in studies for which sample size (or, more precisely, power) was high. these simulations help illustrate the point that for highpowered studies, a p value of .05 is more evidence for the null hypothesis than for the alternative hypothesis (lakens, 2015). when power is high, researchers using p values to determine statistical significance should use a lower criterion including priors whereas bayes factors do not take into account the prior odds of an effect being real, the posterior odds do. posterior odds can be calculated by multiplying the bayes factor by the prior odds (see equation 1). posterior odds are the probability of the alternative hypothesis (m = h1) given the data (d) over the null hypothesis (m = h0) given the data (d). to evaluate the effect of prior odds on discriminability, two additional experiments were conducted. in experiment 7, the same conditions as in experiment 1 were simulated, but aucs were calculated for posterior odds across three different prior odds: 0.1, 1, and 10. equation 1. in experiment 8, everything was the same as in experiment 1 except there were four times as many studies with d = 0 (16 studies) than with d = .5 (4 studies). aucs were calculated for posterior odds across three prior odds (.25, 1, 4). as shown in figure 10, adding information about prior odds to the bayes factor merely shifted the points along the roc curve but did not alter discriminability regardless of the accuracy of the prior odds. in addition, changing the proportion of real effects did not have much impact on discriminability. in experiment 8, the mean auc was .95 (median = .97, sd = .07) for all sets of prior odds (as well as for p values), which was similar to the mean auc of .96 (median = .98, sd = .04) for all sets of prior odds (and for p values) in experiment 7. except for experiment 7, all of the simulations conducted involved simulating studies for which half had a true effect and half had a null effect. this assumes that effects are to be expected half of the time, which is an assumption that is unlikely to be true. the results from experiment 7 show, however, that similar patterns are found even when the null hypothesis is likely to be true. unreported simulations show similar patterns even when the alternative hypothesis is likely to be true. thus, the results regarding discriminability (measured with aucs) are independent of specific assumptions regarding the likelihood of the null hypothesis. put another way, the discriminability of p values and bayes factors are high in situations for which real effects are likely and in situations for which real effects are unlikely. obviously, more p values and bayes factors reach thresholds for significance when there are more significant effects, so “significant” effects are more for ‘safe’ studies than ‘risky’ studies (krueger & heck, 2018). nevertheless, the diagnosticity of the p value (and of bayes factor) is high regardless of the likelihood of finding a real effect. insights into criteria for statistical significance from signal detection analysis 13 bayes factor and bias as with p values, we can consider bias related to bayes factors. as shown in figure 3, the cut-offs that achieved maximize utility assuming equal weights given to false alarms and misses was bayes factor > 1. this contrasts with the typical interpretation of bayes factor (e.g., table 2) for which bayes factors between 1-3 are considered anecdotal evidence. unlike with p values, the threshold that should be used for bayes factors did not vary as much with changes in sample size as did the alpha levels of the p values (see figure 10). compare the red points to the green points, which correspond to p < .10 and p < .005. for smaller sample sizes, the red points achieve better performance than the green points, but for larger sample sizes, the relationship flips and the green points achieve better performance. this repeats the point made earlier that at larger sample sizes, a lower alpha should be used. for bayes factors, compare the light blue and purple points, which correspond to bayes factor thresholds of 1 and 3. for smaller sample sizes, the light blue points achieved better performance, but for larger sample sizes, the purple points achieved better performance. however, unlike with p values, this reversal was not nearly as dramatic, and the decision criterion of bayes factor > 1 performed better than or nearly as good as the other thresholds across all sample sizes. it is also worth noting that as sample size increases, all bayes factor criteria improved, whereas p values plateaued at their alpha levels. thus, another advantage of bayes factors is that increasing the amount of evidence increases their ability to accurately detect an effect. signal detection analysis is a tool that scientists can use to evaluate relative trade-offs across various decision criteria. this is not to say that scientists should only use or always use decision criteria (as opposed to estimations of effect size, for example), but that when a criterion for statistical significance is adopted, consideration should be made for both false alarms and misses. if the goal is to maximize optimal utility, given equal weight to hits and correct rejections (or, equivalently, equal tolerance for false alarms and misses), distance to perfection can be used to assess various criteria. in the case of a medium effect size with 64 participants per group, the decision criteria of p < .10, p < .05, and bf > 1 led to better performance than the criteria of p < .005, bf > 3, and bf > 10. as sample size increased, the criteria of p < .005 and all tested bayes factor thresholds led to better performance than p < .10. discriminability with effect size as a final note, discriminability (as measured using aucs) was as good or better when using effect size (in this case, cohen’s d) than p values or bayes factors (see figure 14). effect size improved discriminability because cohen’s d is signed (i.e. differentiates -.5 from .5). when discriminability was assessed using absolute effect size, the aucs matched those obtained with p values and bayes factors. the measure of effect size does not have the feature of a specific decision criterion for statistical significance, so for researchers who want strict thresholds for significance, effect size is unlikely to be a useful tool. but for researchers who want to know the strength of the evidence or the magnitude of the effect, effect size would be useful. meta-psychology, 2019, vol 3, mp.2018.871, https://doi.org/10.15626/mp.2018.871 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: daniël lakens reviewed by: patrick r heck, angelika stefan, felix schönbrodt analysis reproduced by: jack davis all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/69xmg figure 11. distance to perfection was calculated as the euclidean distance between each point on the roc curve (see figure 2) and the top-left corner (which corresponds to 100% hit rate and 0% false alarm rate). distance to perfection scores were calculated for each of 100 sets of 20 studies (half of which were modeled as a null effect and half of which were modeled with cohen’s d = .5) for each sample size. the data are grouped by sample size, and color corresponds to the criterion for statistical significance. errors bars correspond to 95% confidence intervals. figure 12. area under the curve (auc) for cohen’s d as a function of the aucs for p values and bayes factors (bf). data are from experiment 1. each point corresponds to one set of 20 studies with half modeled with cohen’s d = .5 and half modeled with cohen’s d = 0. dotted line is at unity. conclusion an essential part of science is that it is replicable. but another essential part of science is to uncover new discoveries. changing the standard criterion for statistical significance merely moves the standard along the roc curve. any change to this standard such as decreasing the required p value or using bayes factors instead will not improve discriminability between real and null effects. rather, a change to be more conservative will decrease false alarm rates at the expense of increasing miss rates. false alarm rates should not be considered in isolation without also considering miss rates. rather, researchers should consider the relative importance for each in deciding the criterion to adopt. this aligns with other recommendations for researchers to justify their alphas (lakens et al., 2018). in addition, given that true null results can be theoretically interesting and practically important, a conservative criterion can produce critically misleading interpretations by labeling real effects as if they were null effects. insights into criteria for statistical significance from signal detection analysis 15 moving forward, the recommendation is to acknowledge the relationship between false alarms and misses, rather than implement standards based solely on false alarm rates. open science practices this article earned the open materials badge for making the materials available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references begley, c. g., & ellis, l. m. (2012). drug development: raise standards for preclinical cancer research. nature, 483(29 march), 531533. doi: 10.1038/483531a benjamin, d. j., berger, j. o., johannesson, m., nosek, b. a., wagenmakers, e. j., berk, r., . . . johnson, v. e. (2017). redefine statistical significance. nature human behaviour. doi: 10.1038/s41562-017-0189-z camerer, c. f., dreber, a., forsell, e., ho, t.-h., huber, j., johannesson, m., . . . wu, h. (2016). evaluating replicability of laboratory experiments in economics. science, 351(6280), 1433-1436. collaboration, o. s. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716. doi: 10.1126/science.aac4716 dienes, z. (2011). bayesian versus orthodox statistics: which side are you on? perspectives on psychological science, 6, 274-290. doi: 10.1177/1745691611406920 etz, a., & vandekerckhove, j. (2016). a bayesian perspective on the reproducibility project: psychology. plos one, 11(2), e0149794. doi: 10.1371/journal.pone.0149794 fiedler, k., kutzner, f., & krueger, j. i. (2012). the long way from α-error control to validity proper: problems with a short-sighted falsepositive debate. perspectives on psychological science, 7(6), 661-669. doi: 10.1177/1745691612462587 fraley, r. c., & vazire, s. (2014). the n-pact factor: evaluating the quality of empirical journals with respect to sample size and statistical power. plos one, 9(10), e109019. doi: 10.1371/journal.pone.0109019 green, d. m., & swets, j. a. (1966). signal detection theory and psychophysics. new york: wiley. ioannidis, j. p. a. (2005). why most published research findings are false. plos med, 2(8), e124. doi: 10.1371/journal.pmed.0020124 jeffreys, h. (1961). theory of probability. oxford, uk: oxford university press. kass, r. e., & raftery, a. e. (1995). bayes factors. journal of the american statistical association, 90(430), 773-795. krueger, j. i., & heck, p. r. (2017). the heuristic value of p in inductive statistical inference. frontiers in psychology, 8(908). doi: 10.3389/fpsyg.2017.00908 krueger, j. i., & heck, p. r. (2018). testing significance testing. collabra: psychology, 4(1), 11. doi: 10.1525/collabra.108 kruschke, j. k. (2013). bayesian estimation supersedes the t test. journal of experimental psychology: general, 142(2), 573-603. doi: 10.1037/a0029146 lakens, d. (2014). performing high-powered studies efficiently with sequential analyses. european journal of social psychology, 44, 701-710. doi: 10.1002/ejsp.2023 lakens, d. (2015, march 20, 2015). how a p-value between 0.04-0.05 equals a p-value between 0.16-017. retrieved from http://daniellakens.blogspot.com/2015/03/h ow-p-value-between-004-005-equals-p.html lakens, d. (2016, 1/14/16). power analysis for default bayesian t-tests. retrieved from http://daniellakens.blogspot.com/2016/01/p ower-analysis-for-default-bayesian-t.html lakens, d., adolfi, f. g., albers, c. j., anvari, f., apps, m. a. j., argamon, s. e., . . . zwaan, r. a. (2018). justify your alpha. nature human witt 16 behaviour, 2(3), 168-171. doi: 10.1038/s41562018-0311-x lee, m. d., & wagenmakers, e. j. (2005). bayesian statistical inference in psychology: comment on trafimow (2003). psychological review, 112(3), 662-668. doi: 10.1037/0033295x.112.3.662 lindley, d. v. (1957). a statistical paradox. biometrika, 44(1/2), 187-192. macmillan, n. a., & creelman, c. d. (2008). detection theory: a user's guide (second edition). new york: psychology press. mcshane, b. b., gal, d., gelman, a., robert, c., & tackett, j. l. (2018). abandon statistical significance. arxiv preprint. doi: arxiv.org/pdf/1709.07588 morey, r. d. (2015, 5/31/18). on verbal categories for the interpretation of bayes factors retrieved from http://bayesfactor.blogspot.com/2015/01/on -verbal-categories-for-interpretation.html morey, r. d., rouder, j. n., & jamil, t. (2014). bayesfactor: computation of bayes factors for common designs (version 0.9.8), from http://cran.rproject.org/package=bayesfactor murayama, k., pekrun, r., & fiedler, k. (2014). research practices that can prevent an inflation of false-positive rates. personality and social psychology review, 18(2), 107-118. doi: 10.1177/1088868313496330 rouder, j. n., speckman, p. l., sun, d., morey, r. d., & iverson, g. (2009). bayesian t-tests for accepting and rejecting the null hypothesis. psychonomic bulletin & review, 16, 225-237. doi: 10.3758/pbr.16.2.225 salomon, e. (2015). p-hacking true effects. retrieved from http://www.erikasalomon.com/2015/06/phacking-true-effects/ schönbrodt, f. d., & wagenmakers, e.-j. (2018). bayes factor design analysis: planning for compelling evidence. psychonomic bulletin & review, 25(1), 128-142. doi: 10.3758/s13423-0171230-y sedlmeier, p., & gigerenzer, g. (1989). do studies of statistical power have an effect on the power of studies? psychological bulletin, 105(2), 309316. doi: 10.1037/0033-2909.105.2.309 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359-1366. doi: 10.1177/0956797611417632 team, r. c. (2017). r: a language and environment for statistical computing. retrieved from https://www.r-project.org trafimow, d., & marks, m. (2015). editorial. basic and applied social psychology, 37(1), 1-2. wetzels, r., matzke, d., lee, m. d., rouder, j. n., iverson, g. j., & wagenmakers, e.-j. (2011). statistical evidence in experimental psychology: an empirical comparison using 855 t tests. perspectives on psychological science, 6(3), 291-298. doi: 10.1177/1745691611406923 meta-psychology, 2022, vol 6, mp.2020.2628 https://doi.org/10.15626/mp.2020.2628 article type: replication report published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: yes edited by: rickard carlsson reviewed by: adrien fillon, artur nilsson analysis reproduced by: adrien fillon all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/pf6dn mortality salience effects fail to replicate in traditional and novel measures. bjørn sætrevik operational psychology research group, department for psychosocial science, faculty of psychology, university of bergen hallgeir sjåstad department of strategy and management, norwegian school of economics, and snf centre for applied research at nhh abstract mortality salience (ms) effects, where death reminders lead to ingroup-bias and defensive protection of one’s worldview, have been claimed to be a fundamental human motivator. ms phenomena have ostensibly been identified in several hundred studies within the “terror management theory” framework, but transparent and high-powered replications are lacking. experiment 1 (n = 101 norwegian lab participants) aimed to replicate the traditional ms effect on national patriotism, with additional novel measures of democratic values and pro-sociality. experiment 2 (n = 784 us online participants) aimed to replicate the ms effect on national patriotism in a larger sample, with ingroup identification and pro-sociality as additional outcome measures. the results showed that neither experiment replicated the traditional ms effect on national patriotism. the experiments also failed to support conceptual replications and underlying mechanisms on democratic values, processing speed, psychophysiological responses, ingroup identification, and pro-sociality. this indicates that the effect of death reminders is less robust and generalizable than previously assumed. keywords: mortality salience, death reminders, worldview defence, terror management, replication. the concept of mortality salience (ms) refers to a phenomenon where reminders of death lead to subconscious changes in attitudes and behaviour, typically in the form of increased ingroup-bias and behaviour that may serve the role of defending one’s cultural worldview. the ms effect has been reported in several hundred experiments since the 1980’s, with unusually large effect sizes (for meta-analyses, see: burke et al., 2013; burke et al., 2010). the dominant theoretical framework to account for ms effects has been the “terror management theory” (tmt, greenberg et al., 1986; pyszczynski et al., 2015). this theory emerges from a psychodynamic approach to existential questions, and proposes automatic defence mechanisms that may protect the person from conscious death awareness. specifically, tmt states that cognitions related to mortality evoke an aversive state of existential anxiety which motivates to suppress thoughts of vulnerability (proximal defences), or to bolster self-esteem or affirm cultural values to find meaning beyond death (distal defences). this has some similarities to the concept of “psychological defence mechanisms” in psychodynamic theory. it is https://doi.org/10.15626/mp.2020.2628 https://doi.org/10.17605/osf.io/pf6dn 2 mainly the ms effects of distal defence, often referred to as “cultural worldview defence”, that has been investigated in social psychology experiments. in the terminology of tmt and the writings of ernest becker (1973), the general idea is that adhering to a cultural worldview can work as a buffer against the fear of death by providing a form of “symbolic immortality”. the aim of the current study was to provide a high-powered and preregistered replication of the ms effect using traditional outcome measures, and to use novel outcome measures to examine possible mechanisms of ms. need for mortality salience replication. there is an ostensibly solid empirical basis for ms effects in terror management research, with a meta-analytical effect size of d = 0.82 (burke et al., 2010). however, the research tradition has also been called into doubt due to claimed theoretical weakness and non-falsifiability (fiedler et al., 2012; martin and bos, 2014), researcher effects (yen and cheng, 2013), failure to replicate past findings (trafimow and hughes, 2012), and contradictory empirical findings (hart, 2014). given the recent method reform in psychological research and other fields (munafò et al., 2017), it should also be noted that there have been few, if any, preregistered direct replications with open datasets showing robust ms effects. as with other psychodynamic theories, the postulation of complex subconscious processes makes it challenging to empirically test ms effects. past studies have used a variety of subtle experimental manipulations without appropriate manipulation checks. a review (burke et al., 2010) is sometimes cited to argue that the ms effect is thoroughly empirically established. however, the review also reveals a great deal of variation in the experiment designs, in terms of different manipulations and different outcome measures, whether or not there are “delay tasks” (and their duration and number), and whether the ms effect relies on various covariates. few of the studies report performing manipulation checks or provide open data. despite the vast number of studies, no standard experiment approach for producing the ms effect appears to have emerged. it is noteworthy that there are still few preregistered replications of the basic ms effect. hayes and schimel (2018) performed a series of three experiments, where study 2 was a preregistered online experiment. this experiment showed a decrease in self-esteem after performing a word-completion task with death related words. however, the effect only emerged when applying an unregistered exclusion of some of the participants. pepper and colleagues (2017) failed to replicate the ms effects from a previous study (griskevicius et al., 2011), and a recent preregistered replication of a much cited tmt experiment failed to observe any evidence for a ms effect (rodríguez-ferreiro et al., 2019). this study also provided an analysis and a discussion of the literature cited by burke and colleagues (2010), and argued that the distribution of reported effect sizes given the sample sizes did not follow the distribution one would expect from complete reports of a true effect. this makes it an open question to what extent the ms effect can be reproduced and replicated under more restrictive and transparent conditions. mechanisms of mortality salience ms effects are typically described in relatively general terms (i.e., “threshold for awareness”, and “proximal and distal defences”), without going into details about the cognitive or psychophysiological mechanisms for the effects. as other priming effects, ms may be accounted for by spreading activation in semantic networks (morewedge and kahneman, 2010). if so, cognitive representations related to death are closely linked to representations of cultural values in an associative network, so that activation of one part of the network lowers the threshold for activating semantically linked parts of the network (see arndt et al., 2002, for a similar account). if ms works through such a mechanism, a conceptual replication would be to expect the ms manipulation to lead to increased distraction when a stroop task presents words related to in-group classification. this may be compared to ms studies that have used a lexical decision task and similar measures to estimate death thought accessibility (hayes et al., 2010; hayes et al., 2008). further, it has been argued that ms increases “tension”, “discomfort” or “reluctance” associated with being reminded of death, which proximal or distal defences may reduce (greenberg et al., 1995; greenberg et al., 1992). if so, one may expect the ms manipulation to lead to increased psychophysiological activation either during stimulation, or as a residual effect while the ms is in effect (see e.g. arndt, 1999; arndt et al., 2001; rosenblatt et al., 1989). previous research has suggested that ms effects may be moderated by individual differences in cognitive style (juhl and routledge, 2010) or political orientation (burke et al., 2013), which would indicate that one should control for such factors or examine possible interaction effects. generalizability of mortality salience. although it has been claimed that ms is a fundamental motivator for vast aspects of the human condition (greenberg et al., 1986; greenberg et al., 1997), a majority of the studies (72.6% in a meta-analysis, burke et al., 2010) used the same outcome measures, namely an effect on attitude measures. 3 moreover, a considerable part of the ms literature has used outcome measures that may be confounded with aggression towards out-group members in the face of threats (such as increase in patriotism or support for the local sports team, see e.g. stets, 2006; turner et al., 1994). the tmt claim that ms increases adherence to cultural values would be better supported if it could be shown for cultural values that cannot be construed as out-group aggression. finally, although some cross-cultural work has been done (heine et al., 2002; routledge et al., 2010), studies in more diverse cultural settings are needed. crosscultural studies could indicate the wider applicability of a ms effect, and could also contribute to exclude competing causal mechanisms. if ms effects could be shown for patriotism and other values outside of north america and central europe, this would further support the argument that ms enhances ingroup processes. study overview as reviewed above, there are reasons to question the robustness of ms effects and the underlying mechanism. to address concerns regarding mechanism and generalizability, our experiment 1 was done in a lab setting and included both the traditional measure of national patriotism and novel measures of democratic values, prosociality, stroop processing and psychophysiology. to address the need for high-powered direct replications, experiment 2 was conducted as an online experiment of ms effects on a measure of national patriotism, with additional measures of in-group favouritism and prosociality. the sample size for main effects in experiment 1 is about twice as large as the typical study in the published ms literature, whereas the sample size in experiment 2 is over 17 times larger than the typical ms study (calculated from the n per analysis cell reported in the meta-analysis of burke et al., 2010). this provides sufficient power for both experiments to detect the effects reported in the literature. power to detect effects of different sizes are discussed in a later section. all hypotheses and analysis approaches for both experiments were preregistered ahead of data collection and were performed in accordance with local ethical guidelines. experiment 1 background based on the above review, the main aim of experiment 1 was to replicate the traditional ms effect on national patriotism in a lab setting. as further aims, we also attempted to conceptually replicate the ms effect in novel but theoretically related outcome measures, while controlling for individual cognitive and psychosocial sensitivity to the manipulation. our preregistration (available at https://osf.io/ec4yk) described four manipulation checks and six hypotheses, which are justified below and listed in table 2. 1 to test that the construct of mortality interferes with cognitive processing, we checked (e1mc1) whether the ms group were slower to respond to deathrelated words. to assess whether we succeeded in manipulating ms on a psychophysiological level, we checked (e1mc2) whether the ms group had higher psychophysiological activation. further, we checked that (e1mc3) the pro-patriotic was preferred over the anti-patriotic essay, and whether (e1mc4) the pro-democratic essay was preferred over the antidemocratic essay. in order to directly replicate the most common type of ms research where ms increases preference for national patriotism, experiment 1 measured preference for pro and anti-norway essays (essays were taken from rosenblatt et al., 1989, with minor cultural adjustments). here, we expected (e1h3) participants in the ms group to show a higher preference for the essay expressing national patriotic values. data was collected among a norwegian population where democratic values of privacy, citizenship and human rights are mainstream pro-social values. on this background, the patriotic essays were supplemented with essays about how the concern for democratic values should be handled in the aftermath of a terrorist attack. if ms increases the relevance of ingroup cultural values, one would expect (e1h2) an effect of increased preference for an essay expressing democratic values, compared to an essay expressing anti-democratic values. the novel essays were included to test whether ms effects could be shown to be independent of outgroup aggression. a possible underlying mechanism of ms may be that existential threats make membership to social groups more important, and thus make people more aware of social categorization. if so, we would expect (e1h1) that ms activates cognitive constructs related to “social categorization”, and thus makes these words more intrusive on stroop processing, resulting in longer response times (“rt”). while most ms studies measure attitudes (using essays like the types mentioned above), it would be beneficial to supplement this with measures of behavioural intentions. previous research on charitable giving has suggested that people are more generous towards recipients that belong to a common ingroup (everett et 1the study was preregistered before any of the data was inspected. please note that different dates may be displayed for the preregistration of experiment 1. the correct timestamp is 2015-10-12, as shown in the osf registration date and the osf version control. https://osf.io/ec4yk 4 al., 2015; grimson et al., 2020). some studies have found ms to increase donation to charities (jonas et al., 2002; roberts and maxfield, 2019; zaleskiewicz et al., 2015). we included a novel measure of pro-sociality that asked participants how they would have shared a hypothetical lottery prize with individuals that were closely or more distantly related to them. this was intended as a short online measure that could also distinguish between charity towards different groups, in order to better explore the motivation for the charity (i.e., giving to family and relatives, as opposed to giving to strangers). this measure was partly inspired by singer’s (2011) concept of a “moral circle” that may include people more or less “distant” from you. the structure of the task is very similar to the “dictator game” (kahneman et al., 1986), where the decision-maker is asked to divide a given endowment between themselves and an anonymous partner, which is an established choicemeasure of generosity that has previously been adapted to charity giving as well (see sjåstad, 2019). in the tmt account, one would expect (e1h4) the ms group to express a preference for sharing more of the prize with non-relatives, in an attempt to be remembered beyond their physical death by a larger social circle. 2 previous studies have claimed that ms effects are more evident in more cognitive flexible individuals (juhl and routledge, 2010). we thus expect (e1h5) the ms effects on the four measures listed above to be enhanced for participants low on “need for closure” (nfc, federico et al., 2007). finally, one may expect individual differences in the effectiveness of the ms stimulation, and that this may be indexed by psychophysiological measures. heart rate variability (hrv, acharya et al., 2006) may be used as an index of the body’s ability to adapt to the changing demands of the environment. higher hrv (more high-frequency modulation of heartbeat intervals) has been taken to indicate more physiological adaptability and executive function, while lower hrv has been argued to indicate states of stress and emotional activation (delaney and brodie, 2000; lane et al., 2009). the hrv is an unobtrusive measurement after it has been mounted. as the tmt account states that ms leads to an uncomfortable emotional state that is alleviated through either proximal or distal defences. we thus expect (e1h6) the ms effects on the four measures listed above to be enhanced for participants with lower hrv. see table 1 for an overview of manipulation checks and hypotheses. the tmt literature emphasizes the need for a “delay task” to avoid conscious processing of the death reminder (although delay tasks are not used consistently, see burke et al., 2010). no explicit delay task was used experiment 1, to avoid the risk of fatigue effects of an overly long experiment obscuring any ms effects. however, there were two tasks not related to mortality following the traditional ms manipulation and preceding the traditional ms measurement (experiment procedure in next section). in particular, the “social stroop task” may serve the role of a delay task that precedes all outcome measures. experiment 1 methods experiment 1 outline we conducted a laboratory experiment where the predictor variable was ms vs. control manipulation (between subjects, two conditions). the outcome variables were rt for death-related words, rt for social words, preference for pro-democratic essays, preference for pro-patriotic essays, nfc and hrv. the preregistered experiment procedure and materials can be found online at https://osf.io/naxz6/. data was collected in our lab between 2015-09-29 and 2015-10-30. experiment 1 sample we recruited 101 university students (44 female) through email advertising (preregistered sample was 100, one participant was replaced during data collection due to non-compliance). this sample size was set in order to be larger than the typical ms experiment in the published literature, while also restricted by practical concerns for in-lab studies. please see the full power analysis in the discussion. all participants were undergraduate psychology students, self-identified as having a norwegian identity and normal colour vision. median age group was 22-25 years old. the experiment program randomized participants to the ms group or the control group, without the experimenters knowing who were in which group. due to an administration error, there were 52 participants in the ms group and 48 participants in the control group. no participants were excluded in the data analysis. experiment 1 procedure the whole experiment took about 25 minutes, and was conducted in sound attenuated testing booths, on desktop computers running the e-prime experiment presentation software (psychology software tools, 2note that the preregistrations distinguished between a e1h4a and e1h4b. e1h4b was based on the theoretical framework of coalition psychology, and would be supported if ms group participants expressed a greater preference to share the prize with genetically related individuals. we have excluded this discussion since the distinction between the two accounts is unclear and neither of them was supported. https://osf.io/naxz6/ 5 2012), where responses were given with a pc keyboard. after signing an informed consent form, all participants went through the following experiment procedure: 1. a seven-item scale for «need for closure» (about 1 minute). 2. ms manipulation: two questions where participants were asked to write short responses about either «death» or «toothache» (2-3 minutes). 3. a «social stroop» task with half the words related to social categorization (about 5 minutes). 4. proand anti-democratic essays (order counterbalanced), each followed by five questions evaluating the essay’s content and author (about 5 minutes). 5. proand anti-patriotic essay (order counterbalanced), each followed by five questions evaluating the essay’s content and author (about 4 minutes). 6. pro-sociality measure (1-2 minutes). 7. a «death stroop» task with half the words related to death (2-3 minutes). in the beginning of the experiment (stage 1), participants filled out a questionnaire for cognitive style, using a seven item measure of nfc (federico et al., 2007, translated to norwegian by the authors). for the manipulation (stage 2) participants were randomized into two experimental groups, which were asked two different questions. the randomization was done by the experiment presentation software and was double blinded for both participants and experimenters. we used the traditional manipulation of ms (rosenblatt et al., 1989), where both groups were asked to write down their answers to two short questions presented sequentially, using the standard instructions of «responding based on gut feeling». the first question asked about thoughts and feelings evoked by «death» for the ms group (or «toothache» for the control group), while the second question asked about what they thought will happen at death and after death (or toothache). this manipulation was used in 79.8% of the ms studies in a meta-analysis (burke et al., 2010). a manual inspection showed that all participants provided relevant answers. a stroop task was presented (stage 3), with words written in red, blue, green or yellow text against a grey background, and participants were asked to indicate what colour each word was written in as quickly as possible. 50% of the words were related to social categorization (such as «them», «us», «conflict» and «cooperation»), and 50% neutral words matched for letter length and word frequency. there were 200 trials, and the order of presentation for the words was online randomized. participants responded by using mouse clicks, where the placement of the response boxes switched between trials (this was done to avoid preference for left/right side or centre/lateralized responses, and thus each trial required a visual search for the intended response). for each participant, we calculated the ratio of time taken to respond to social words divided by the time on neutral words. only responses with rt within one sd of the participant’s mean were included in the analysis (applying a more lenient criteria of two sd did not significantly change the results). next (stage 4 and 5), four brief essays were presented. the first two essays (about 200 words long) were novel for the study and presented two opposing views of how norwegian security policy should be handled in the aftermath of a terrorist attack that happened four years before the data collection. the themes of these essays were whether norwegian society was essentially safe or under threat, whether extreme viewpoints should be discussed in public or censored, whether the best safeguard against terrorism is social integration and prevention or surveillance and control, and whether terror measures should be balanced against democratic rights or not. the next two essays (about 120 words long) were proand anti-patriotic essay that are traditionally used in ms studies (rosenblatt et al., 1989). in the review of ms experiments burke et al. (2010) a wide range outcome variables are used, but the most commonly used is that ms leads to a more favourable evaluation of national patriotic essays (and less favourable evaluation of non-patriotic essays). the essays had been translated into norwegian by the authors and one aspect changed to suit the norwegian setting (from “picking fruits” to “work as store clerk”). all four essays are available online both in norwegian as they were used in the experiment and translated to english (https://osf.io/d2zus/). whether the pro or anti essays were presented first was counterbalanced between participants. after each of the four essays, participants were asked to rate on a nine-point scale how well they liked the author, how intelligent and knowledgeable they thought the author was, to which extent they agreed with the essay and thought it made an accurate assessment of the issue (higher scores indicating a more positive evaluation). essay evaluations were calculated as the participant’s average score for the answers to the five questions for each of the four essays (cronbach’s alpha for pro-democratic essays = .92, for anti-democratic essays = .93, for pro-patriotic essay = .89, for anti-patriotic essay = .91). for each participant, a difference score was calculated between https://osf.io/d2zus/ 6 the responses to the proand the anti-patriotic essays, and similarly for the democratic essays (for both difference scores higher values indicated being more positive to the pro than the anti essays). 3 thereafter (stage 6), a novel measure of pro-sociality was applied, where the participants were asked how they would have liked to share a hypothetical lottery prize equivalent to about usd 1.000.000 between their (a) close family (parents and siblings), (b) relatives (grandparents, uncles and aunts, cousins), (c) friends and (d) with charities, and (e) what they would keep for themselves and any immediate family. participants typed in a percentage for how much they wanted to share with each party. all the entered percentages were displayed on screen. the participants were asked to check if they were satisfied with the distribution and had the option to distribute again. a ratio was calculated of the percentage assigned to the latter two recipients (friends and charity) over the percentage kept for self, family and relatives (c+d) / (a+b+e). finally (stage 7), a second stroop task (92 trials) was presented, where 50% of the words were related to mortality (e.g., «funeral», «obituary» and «mortal»), with matched neutral words. presentation, response and calculations were similar as in the first stroop task (stage 3). this stage was placed last in the experiment, to prevent the presentation of the death-related words to interfere with the assumed ms effect on previous stages. this is in line with the concern of hayes and schimel (2018) that activation of death-related constructs may disrupt the measurement of ms effects. throughout the experiment, heart rate was measured from all participants using polar rs800cx waist sensor and wrist recording unit. we excluded hrv data from participants with signal loss for more than 1/3 of the recording. we had a preregistered approach to select a 5-minute analysis window from the middle of the recording (which roughly matches when participants evaluate the essays), and the onset was adjusted based on data quality in the window (before identifying participants to their experimental condition). hrv was calculated as the root mean sum of squared differences (rmssd) of the distance between the peak of each qrs complex (rr beat-to-beat interval, a timedomain analysis). a conventional interpretation (delaney and brodie, 2000) is that lower hrv indicates states of stress and emotional activation (but note that this interpretation has been disputed). experiment 1 results based on our planned directional predictions, we performed one-tailed null hypothesis testing of our manipulation checks e1mc1, e1mc2, e1c3, e1c4 and hypotheses e1h1, e1h2, and e1h3. as the direction of the hypothesis e1h4 for the pro-sociality task was not clear in the preregistration, we tested it with a twotailed test. all analyses were done in the jamovi software (jamovi project, 2019). see table 1 for a summary of manipulation checks, hypothesis testing and results. dataset (https://osf.io/2q7kp/) and analyses (https://osf.io/n6ysk/) are available online. manipulation check the first manipulation check (e1mc1) gave no indication that the semantic construct of «death» was more central for the ms participants. neither did the second manipulation check (e1mc2) show a significant effect of ms on psychophysiological activation (hrv). an additional test of hrv at the time of the ms manipulation also failed to find a significant effect of ms (one-tailed p = .081). the remaining manipulation checks (e1mc3 and e1mc4) showed that the intended essay was preferred in both pairs of essays (pro-norway and prodemocratic values, respectively), confirming that the essay measure was successful at creating a cultural ingroup versus outgroup scenario. confirmatory analyses the planned tests of e1h1, e1h2 and e1h3 were performed with one-tailed t–tests in the direction stated in the preregistration. each were followed up with general linear model analyses with nfc score or hrv score was included as covariates to test for interaction effects (respectively corresponding to e1h5 and e1h6 hypotheses in the preregistration). this approach is functionally equivalent to the glm approaches described in the preregistration. the planned tests for e1h1 showed no significant main effect of ms on the «social stroop» test, nor any significant interaction with nfc or hrv. e1h2 showed no significant main effect of ms on rating of democratic essays, nor any significant interaction with nfc or hrv. e1h3 showed no significant main effect of ms on the rating of patriotic essays, nor any significant interaction with nfc or hrv. answer distributions on this central outcome variable is shown in violin plots in figure 1, indicating no difference in distribution between the experiment conditions. thus, experiment 1 did not find any support for the primary hypotheses of the ms effects found in the tmt literature. 3this pre-registered approach is consistent with the standard method in previous research on tmt. open data and code are provided in order for interested readers to perform alternative analyses (e.g., using the pro-essay score as a covariate for the effect on the anti-essay score). https://osf.io/2q7kp/ https://osf.io/n6ysk/ 7 table 1 list of manipulation checks and hypotheses, operationalization, tests and extent of support in experiment 1 (one-tailed p-values where not otherwise indicated). name operationalization statistical test results e1mc1: ms will make semantic constructs related to death more obtrusive on reading. the ms group will show longer rt for reporting the colour of words with «death related» content compared to words with neutral content. one-tailed t–test of difference between conditions in ratio of death-word rt to neutral word rt. not supported (n = 99, ms group m = 11.9 ms (sd = 66) vs. control group m = 7.2 ms (sd = 48.3), p = .345, ci = -inf. – 14.7, d = -0.08). e1mc2: ms will increase psychophysiological activation participants in the ms group will have lower hrv than the control group. one-tailed t–test of rmmsd difference between conditions during essay reading. not supported (n = 75, ms group m = 44.7 (sd = 19.9) vs. control group m = 54.1 (sd = 31.7), p = .065 in expected direction, ci = -0.8 – inf., d = 0.35). e1mc3: the pro– democratic essay represents the majority view participants will show a preference for the pro-democratic essay. one-tailed t–test will show a higher average score on five questions about the prodemocratic essay than average of same questions for the anti-democratic essay. supported (n = 100, proessay m = 6.7 (sd = 1.3) vs. anti-essay m = 3.7 (sd = 1.5), p < .001 in expected direction, ci = 2.73 – inf., d = 1.64). e1mc4: the pro– patriotic essay represents the majority view participants will show a preference for the pro-patriotic essay. one-tailed t–test will show a higher average score on five questions about the propatriotic essay than average of same questions for the anti-patriotic essay supported (n = 100, proessay m = 6.5 (sd = 1.25) vs. anti-essay m = 4.1 (sd = 1.5), p < .001 in expected direction, ci = 2.08 – inf., d = 1.37). e1h1: ms will increase activation of semantic constructs related to social categorization participants in the ms group will have longer rt to the social words than to the neutral words in a stroop task. a t–test of experiment group (ms group vs. control group) on stroop effect for social words as outcome variable. to test the h5a prediction, the same relationship was also tested with a two-way anova with an added interaction of nfc score. as predicted by h6a prediction, the same relationship was also tested with a two-way anova with an added interaction hrv index. not supported (n = 99, ms group m = -2.3 ms (sd = 40.8) vs. control group m = 11.6 ms (sd = 58.5),p= .09, ci = inf. – 30.7, d = .28). e1h5a not supported (interaction p =.44). e1h6a not supported (interaction p =.55). 8 e1h2: ms will increase support for democratic values participants in the ms group will show a higher preference for the democratic essay compared to the anti-democratic essay. a t–test for experiment group (ms group vs. control group) on pro/anti-democratic essays as outcome variable. to test the h5b prediction, the same relationship was also tested with a two-way anova with an added interaction of nfc score. to test the h6b prediction, the same relationship was also tested with a two-way anova with an added interaction of hrv index. not supported (n = 100, ms group m = 3.1 (sd = 1.9) vs. control group m = 2.9 (sd = 1.8), p = .29, ci = -inf. – 0.41, d = -0.11). e1h5b not supported (interaction p = .87). e1h6b not supported (interaction p = .45). e1h3: ms will increase national patriotism participants in the ms group will show a higher preference for the patriotic essay compared to the anti-patriotic essay. a t–test for experiment group (ms group vs. control group) on patriotic essays as outcome variable. to test the h5c prediction, the same relationship was also tested with a two-way anova with an added interaction of nfc score. to test the h6c prediction, the same relationship was also tested with a two-way anova with an added interaction of hrv index. not supported (n = 100, ms group m = 2.3 (sd = 1.75) vs. control group m = 2.4 (sd = 1.73), p = .58, ci = -inf – 0.65, d = .04). e1h5c not supported (interaction p = .85). e1h6c not supported (interaction p = .29). e1h4: ms will affect the degree of prosociality the amount shared with friends and charities relative to the amount shared with family will be different for the ms group. a t–test for experiment group (ms group vs. control group) on the ratio of giving to friends and charities as outcome variable. to test the h5d and h5e predictions, the same relationship was also tested with a two-way anova with an added interaction of nfc score. to test the h6d and h6e predictions, the same relationship was also tested with a two-way anova with an added interaction of hrv index. not supported (n = 100, m = 0.44 (sd = 0.71) vs. m = 0.25 (sd = 0.31), two-tailed p = .09, ci = -inf. -0.01, d = 0.35). e1h5d not supported (interaction p = .22). e1h6d not supported (interaction p = .99). 9 figure 1. violin plot of scores on patriotic essay evaluation difference score in experiment 1, which did not show a significant difference between mortality salience and the control condition (n=100, norway). on the measure of pro-sociality, the ms group stated they would share more (23.1%) with friends and charities than the control group (16.9%). an effect in that direction would support the e1h4 prediction derived from tmt that ms should increase preference for sharing with non-relatives. however, the effect was not significant in the two-tailed test prescribed by the bidirectional hypothesis in the preregistration. the score distributions indicate that the difference may have been driven by a few extreme values (see violin plot in analysis files on osf). nevertheless, the difference in means between the groups justifies a further examination of this measure, which is performed in experiment 2. there were no significant interactions of nfc or hrv on the pro-social task. experiment 2 background the main aim of experiment 2 was to directly replicate the traditional effect of ms on national patriotism in an american sample. as additional aims we wanted to test conceptual replication of ms effects on ingroup identification and pro-sociality. the preregistration for experiment 2 is available at https://osf.io/d26fq. as manipulation checks, we (e2mc1) verified that the pro-usa essays were in fact preferred over the antiusa essays. in addition, we (e2mc2) manually verified that participants had in fact provided meaningful answers to the manipulation questions about death (preregistration of this analysis: https://osf.io/sw6md). the first hypothesis (e2h1) tested whether ms effects involve mechanisms of ingroup identification and group membership. this corresponds to the mechanism tested in e1h1 in experiment 1 (the “social stroop test”), but the current test has higher face-validity. moreover, the test is an attempt to replicate an effect of ms leading to higher ingroup identification previously found in an italian sample (castano et al., 2002). the second hypothesis (e2h2) was intended to directly replicate the traditional lab experiments, where participants who write short answers to questions about death showed increased patriotism later in the study (greenberg et al., 1994; greenberg et al., 1992; simon et al., 1997). we attempted to make both manipulation and effect measure in the online experiment as similar as possible to the traditional experiments. the third hypothesis (e2h3) was intended to further explore the effects of ms on the novel measure of prosociality used in experiment 1. this measured generosity towards people outside your family, and is thus an indicator of pro-sociality. previous studies have shown ms to lead to increased pro-sociality (jonas et al., 2002; roberts and maxfield, 2019; zaleskiewicz et al., 2015). while experiment 1 had no explicit delay task, to maintain similarity with the research we want to replicate, experiment 2 included a 20-item mood measure to serve this function. the same delay task is often used in the tmt literature. the delay task was presented after the ms manipulation, and before the three outcome measures. finally, we also measured political orientation in order to perform a preregistered exploration of whether political orientation moderates the predicted effect of e2h1, e2h2 or e2h3. experiment 2 methods experiment 2 outline we conducted a high-powered online experiment on a total sample of 800 us participants. as in experiment 1, the predictor variable was ms versus control. the outcome variables were american ingroup identification, national patriotism, and pro-sociality. experiment 2 sample a total of 803 us participants signed up for this study in exchange for $ 0.50 usd. after excluding 19 duplicate and one incomplete responder, the final sample consisted of 784 participants (389 randomised to the ms condition, age m = 38, 61% female). as noted earlier, this sample is more than 17 times larger per analysis cell than the typical ms experiment in the tmt literature. see power-curve in discussion. data was collected on july 12th, 2019. experiment 2 procedure the amazon mechanical turk online platform (buhrmester et al., 2011; hauser and schwarz, 2016) https://osf.io/d26fq https://osf.io/sw6md 10 was used to recruit participants for a study about “personality and attitudes”, and the experiment was programmed and administered on the qualtrics platform (available as online materials: https://osf.io/jm4uh/). on average participants spent slightly over 8 minutes on the experiment. the experiment procedure was as follows: 1. ms manipulation: two questions where participants were asked to write short responses about either «death» or «toothache», (2-3 minutes). 2. delay task: indicate current mood, 20 questions (about a minute). 3. dv1: ingroup identification, 5 questions (about half a minute). 4. dv2: national patriotism, ratings of pro-usa and anti-usa essays, order counterbalanced (about 3 minutes) 5. dv3: pro-social task (1-2 minutes). 6. moderator: variable: political orientation (a few seconds). when clicking through to the survey, participants were randomized to receive the ms manipulation or the control task. at the beginning of the experiment (stage 1), the ms and control group were asked to write brief answers to the same two questions as in experiment 1 about either “death” or “toothache”, respectively. the two questions were presented on separate pages, and answers were written in an empty text box beneath each question. to avoid that participants could simply click their way through the survey without responding, the answer to each of the two questions had to be at least 15 characters long before they could proceed. while this approach to a manipulation check has weaknesses, it should be noted that manipulation checks even to this extent are rarely reported in the published ms literature. an advantage of this manipulation check is that the subsequent assessment of written responses should not interfere with the subconscious processing of death reminders that is proposed by tmt. on the next screen (stage 2), there was a delay task of answering 20 questions about current mood (panas; watson et al., 1988). participants were asked to rate the extent they felt “interested”, “distressed”, “excited” etc. on a five-point scale from “not at all” to “extremely”. this task was included to maintain similarity to traditional ms experiments. the same task is used in 47.7% of the ms literature (burke et al., 2010), most typically as the only delay task.. the inclusion of a delay task between the manipulation and the outcome variables is sometimes argued to be critical for the ms phenomenon to emerge. thereafter (stage 3) participants were presented with a screen with five statements about their american identity, based on the group identification scale (doosje et al., 1995). each item was rated on a seven-point scale from “not at all” to “totally”. the statements were: “i perceive myself as an american”, “i feel strong ties with other americans”, “being an american does not mean much to me” (reversed), “i identify with american people”, and “being an american has nothing to do with my identity” (reversed). cronbach’s alpha for the responses was .84. next (stage 4), national patriotism was measured with one pro-usa and one anti-usa essay presented sequentially in counterbalanced order. these were the traditional essays used in ms experiments (see e.g. rosenblatt et al., 1989) and similar to the essays used in experiment 1, except that they had not been translated and adapted to fit a norwegian context. after reading each essay, participants answered the same five questions as in experiment 1 (cronbach’s alpha = .91 for pro-usa, .95 for anti-usa). to get a score for national patriotism, the average score on the anti-usa essay was subtracted from the average score on the pro-usa essay (higher scores indicate higher patriotism). thereafter (stage 5), the same pro-social task as in experiment 1 was applied (value of lottery win set to usd 1.000.000). in a sequential list, participants entered the percentage they would like to share with (a) self or immediate family, (b) close family, (c) extended family, (d) friends, and (e) charities. the summed percentage was shown beneath, and the sum had to be 100% in order to continue the experiment. as in experiment 1 and as preregistered, an index for pro-sociality was calculated as (d+e) / (a+b+c), in which a higher number indicates a higher level of pro-sociality. at the end of the experiment (stage 6), participants reported age and gender. then as a single-item measure of political views the question “in general, what would be the most accurate description of your political views?” was answered on a 7-point scale (marked with 1 = very leftwing/ liberal, 4 = centrist/ moderate, 7 = very rightwing/ conservative). the responses used the full range, with a central tendency in the middle (m = 3.64, sd = 1.8). experiment 2 results the manipulation checks confirmed that the pro-usa essay was preferred over the anti-usa essay (e2mc1), indicating that national patriotism as expressed in these essays was in fact the dominant cultural value. further, almost all participants did in fact write meaningful rehttps://osf.io/jm4uh/ 11 table 2 list of manipulation checks and hypotheses, operationalization, tests and extent of support in experiment 1 (one-tailed p-values where not otherwise indicated). name operationalization statistical test results e2mc1: the pro-usa essay will be preferred over the anti-usa essay (across conditions) participants will show a preference for the pro-patriotic essay. one-tailed t-¬test will show a higher average score on five questions about the propatriotic essay than average of same questions for the anti-patriotic essay. supported (n = 784, ms group m = 5.4 (sd = 1.04) vs. control group m = 4.4 (sd = 1.62), p < .001 in expected direction, d = 0.59). e2mc2: the manipulation instructions were followed participants will provide meaningful answer to the manipulation questions manual classification of all 800 responses 98% provided relevant responses e2h1: ms will increase ingroup identification participants in the ms group will have to a larger degree identify as americans. a t–test of experiment group (ms group vs. control group) on ingroup identification score. regression of political views on ingroup identification. moderation of political views on relationship between mortality salience and ingroup identification. not supported (n = 784, ms group m = 5.2 (sd = 1.38) vs. control group m = 5.28 (sd = 1.39), p = .46, d = 0.05). more conservative participants showed significantly higher ingroup identification (t = 9.67, p < .001). no significant moderation of political views (z = -0.8, p = .425). e2h2: ms will increase national patriotism participants in the ms group will show a higher preference for the patriotic essay compared to the anti-patriotic essay. a t–test for experiment group (ms group vs. control group) on patriotic essays as outcome variable. regression of political views on patriotism. moderation of political views on relationship between mortality salience and patriotism. not supported (n = 784, ms group m = 1.15 (sd = 1.99) vs. control group m = 1.17 (sd = 1.95), p = .91, d = 0.01). more conservative participants showed significantly higher patriotism (t = 11, p < .001). no significant moderation of political views on patriotism (z = -0.04, p = .97). e2h3: ms will increase pro-sociality participants in the ms group will state that they would share more of a hypothetical money prize with friends and charities relative to the amount shared with family and relatives. a t–test for experiment group (ms group vs. control group) on the ratio of giving to friends and charities as outcome variable. regression of political views on pro-sociality. moderation of political views on relationship between mortality salience and patriotism. support for less sharing in ms group (n = 781, m = 0.177 (sd = 0.267) vs. control group m = 0.234 (sd = 0.234), p = .036, d = .15). no significant effect of political views on pro-sociality (t = 1.62, p = .106). no significant moderation of political views on sharing (z = 1.5, p = .135). 12 sponses to the ms experimental manipulation (e2mc2), indicating that the manipulation was successful in evoking thoughts about death versus toothache (control). human verification of written responses (e2mc2) also helped ensure the data quality, since participants randomly clicking their way through the survey or not understanding the instructions would be screened out in this procedure. the written responses are described in more detail in a separate publication (storelv and sætrevik, 2021). in accordance with the preregistration we tested e2h1, e2h2 and e2h3 as t–tests against a two-tailed p-value of .05. in addition, we tested the effects of political orientation on the outcomes as simple regressions, and we tested their moderating effect on the relationship between ms and outcomes. all analyses were done in the jamovi software (2019), using the “medmod” module for the moderation analyses. see table 2 for results from the preregistered confirmatory hypotheses tests (e2h1, e2h2 and e2h3). dataset and analyses are available online (at https://osf.io/mejnt/ and https://osf.io/zpn92/). despite the successful manipulation check, the results showed no significant difference between the ms and the control group on ratings of ingroup identification (e2h1). further, there was no significant difference between the ms and control group on the focal outcome measure of national patriotism (e2h2). a violin plot of the scores is shown in figure 2. thus, experiment 2 did not show any support for the primary hypotheses about the effect of death reminders, as derived from tmt and previous research. on the pro-sociality measure there was a small but significant difference in the opposite direction of the e2h3 prediction (d = .15, p = .036), of the ms group sharing less (12.7% of the amount) than the control group (14.8%) with friends and charities. as suggested in the preregistration, we explored the possible moderator effect of political orientation on the outcome variables. political orientation was significantly correlated with two of the outcome measures, indicating that more conservative participants identified more as americans and showed higher national patriotism, while there was no significant effect on prosociality. however, there was no significant interaction between ms and political orientation on neither of the three outcome measures (ingroup identification, national patriotism and pro-sociality), thus showing no difference in how conservative and liberal participants responded to the ms manipulation. discussion the aim of the current study was to directly replicate the effect of ms on attitudes to national patriotic essays, figure 2. violin plot of difference in national patriotism between experiment groups in experiment 2, which did not show a significant difference between mortality salience and the control condition (n=784, usa). and to conceptually replicate the effect on other measures to explore mechanisms and boundary conditions. this was tested across two preregistered experiments, one in a lab and one online, with a combined sample of 884 participants from two different countries. despite our best efforts, we failed to both directly and conceptually replicate the ms effects. neither did we find indications of the assumed mechanism of ms (word processing times, psychophysiology or ingroup identification). the second experiment showed a small but significant effect on pro-sociality, where ms led to reduced prosociality. this effect is in the opposite direction from the prediction derived from traditional ms theories and previous research. we were thus unable to obtain any empirical support for direct or conceptual replication of the ms effect or its assumed mechanisms on any of the outcome measures. as opposed to most previous research on ms and tmt, in both experiments we manually verified that the manipulation was adhered to (that the ms group in fact wrote relevant answers related to “death” themes), and provide this and all other outcomes in public datasets. we thus see it as unlikely that the null-results can be explained by a failure to manipulate death awareness. the results are further discussed below. no direct replication of ms effect on patriotism both experiments failed to directly replicate the traditional ms effect on national patriotism, using the typical essay measure in both a norwegian (e1h3) and an american sample (e2h2). although our manipulation checks confirmed that the patriotic essay reflected the dominant cultural values in our samples, the ms treatment did not increase this preference. this result opposes much of the published ms literature, typically dehttps://osf.io/mejnt/ https://osf.io/zpn92/ 13 scribed in terms of the tmt framework (burke et al., 2010; greenberg et al., 1994). since experiment 1 aimed for conceptual rather than direct replication, a stroop task and a novel essay task was performed before the patriotic essay. the presence of these tasks may be a possible explanation of the nullfinding (for e1h3), although we assumed that these measures would be sufficiently indirect to not interfere with the ms effect on patriotic essays. nevertheless, the high-powered experiment 2 had the patriotic essays immediately after the delay task (e2h2), which would constitute a direct replication of the prototypical ms study (see e.g. greenberg et al., 1994). no conceptual replication of ms effect on novel essays for experiment 1 we constructed novel essays for measuring ms effects on preferences for democratic values. a manipulation check (e1mc3) confirmed that the pro-democratic essay expressed values that were dominant in the sample. our theoretical extension (conceptual replication) of the traditional ms effect was to expect that ms would increase the preference for expressing democratic values. however, the e1h2 test did not show a ms effect of increased preference for the democratic essay. experiment 1 thus failed to demonstrate the ms effect to transfer to a novel and culturally adapted measure (support of democratic values). although this replication in experiment 1 contains novel aspects, the literature often describes ms effects as being universal across cultures, and that the effect has wide-reaching consequences for most aspects of human social life. further, reviews have shown ms effects in a wide range of outcome variables (burke et al., 2010), and has been shown to transfer to defence of cultural values in both american, european and non-western societies (e.g. heine et al., 2002; routledge et al., 2010). no direct replication of ms effect on ingroup identification as a straight-forward test of the mechanism assumed to cause the ms effect, experiment 2 (e2h1) tested whether the manipulation increased social identification with the larger ingroup (i.e., being an american). there was no significant effect on this measure, thus failing to support what the tmt has claimed is a fundamental mechanism behind ms effects. this also constitutes a failed replication of the results from a previous study in italy, using similar measures (castano et al., 2002). to our knowledge, no other studies have directly tested this assumed mechanism of ms on ingroup identification but have instead tested the effect that ms has on expressing the ingroup’s values. no effect of ms on word processing speed there was no significant effect in the experiment 1 manipulation check of ms increasing stroop processing times for death-related words (e1mc1). in our view, this fails to support the claims of tmt (see e.g. arndt et al., 1997), as there was no indication that the ms manipulation made concepts related to mortality more accessible for the participant in a subsequent task. neither did experiment 1 find a ms effect on stroop processing times for words associated with social categorization (e1h1). this opposes the expectation derived from tmt that ms makes social identification (or cultural belonging) more relevant as a way of finding meaning beyond physical death. we propose three possible explanations for the lack of ms effects on the stroop task. either (1) ms does not work through a basic cognitive mechanism of spreading activation in a conceptual network, and can thus not be measured with a stroop task, (2) our stroop methodology was not suitable to register changes in availability of cognitive constructs, or (3) the standard ms manipulation does not robustly produce cognitive effects (at least not in the form described in the literature). using a computer mouse for stroop responses may produce some random variation in response times, but this should be compensated for by the high number of stroop trials. to the best of our knowledge, no former experiments have tested a direct effect of ms on processing speed of death-related concepts, or an indirect effect on processing speed of concepts such as social words that are assumed to be causally linked with mortality. some studies (e.g. gailliot et al., 2006) have suggested that ms slows down stroop processing in general, but without using stroop words with relevant/irrelevant content. this leaves us without an established framework to evaluate whether stroop is a suitable approach to test the mechanisms assumed to underlie the ms effects. we encourage further testing of this approach in future studies. no ms interaction effect on psychophysiology the tmt assumes that ms leads to an uncomfortable state that motivates the affirmation of one’s cultural values (arndt et al., 2001; delaney and brodie, 2000; henry et al., 2010; lane et al., 2009; routledge et al., 2010; schuler et al., 2017; silveira et al., 2013). we assumed that such a change of state could be detectable in psychophysiology. however, the manipulation check in experiment 1 (e1mc2) failed to show a significant effect of decreased hrv in the ms condition. an effect 14 of ms on hrv may have been indicated (one-tailed p = .065) but did not meet our preregistered alpha level. a psychophysiological activation effect may have been obscured by individual variation, measurement noise or analysis choices. it is possible that different hrv analysis approaches (such as different artefact smoothing or different analysis windows) could have indicated effects. however, note that there were no significant effects in an additional testing window (during ms stimulation), nor in the interactions between ms and hrv on any of the outcome measures. given the high volatility of hrv methods to analytic flexibility, we chose to not explore the data outside of our planned analysis. note that research on physiological indicators of ms appear to have shown mixed results, also within the tmt literature (arndt, 1999; rosenblatt et al., 1989). effect of ms to decrease pro-sociality both experiments included our novel money-sharing measure of whether ms increased pro-sociality. a conceptual extension of tmt was that ms should lead to increased generosity to people outside of one’s family, as a means to having an impact beyond one’s life (building on jonas et al., 2002; roberts and maxfield, 2019; zaleskiewicz et al., 2015). experiment 1 showed a non-significant tendency of ms leading to increased money-sharing with friends and charity. however, the more robust test in experiment 2 showed a significant effect in the opposite direction, of less sharing in the ms group. this is difficult to align with the tmt account, as being less generous with friends and strangers cannot be seen to be a culturally dominant value. one could perhaps formulate an ad hoc explanation for why this measure showed opposite effects in the two studies (e.g., different norm salience, different sample populations), or having opposite predictions from our hypothesis (e.g., wanting to retain resources as proximal defence against death). this illustrates how difficult it can be to derive falsifiable predictions from the tmt (see similar issues raised by martin and bos, 2014). in adherence to the hypothethico-deductive approach, we caution against interpreting a non-predicted result as supportive evidence for a given hypothesis. further, one should keep in mind that the pro-social task is novel and unestablished and was included as an exploratory measure. replication of experiment design in an attempt to account for the current null-findings, one may point to differences between our experimental design and previous studies. for example, there was no explicit delay task before the target measure in experiment 1, which previous research has suggested may be necessary for ms effects to occur (greenberg et al., 2000). on the other hand, one could argue that the stroop test and the democratic essays filled this role, as they were performed between the ms manipulation and the patriotic essays measure and are ostensibly unrelated to both mortality and patriotism. further, experiment 2 used the most common type of delay task (a 20-item mood measure) and still failed to show a ms effect on national patriotism or ingroup identification in a large us sample. in both experiments the patriotic essays were among the first measures following the manipulation and delay task, to ensure a sensitive test of the primary hypothesis. a review (burke et al., 2010) has argued that ms effects can be shown across a number of experiment procedures, citing examples that use zero, one, two or three delay tasks. as the tmt claims that ms is a fundamental motivator for human behaviour, it would be surprising if it relies on the exact repetition of minor variations in experimental procedure discussed here. in either case, we would argue that experiment 2 adheres to the crucial design features in the published literature, in having one delay task immediately preceding the central outcome measures. following previous research, our experiments also tested whether the outcome variables showed interactions with cognitive style, political orientation, or the manipulation’s effectiveness (although these tests are lower powered than the main effect tests). these tests failed to provide any further support for the ms effects. based on the ms literature one may argue that the political orientation of the participants should be taken into consideration. a meta-analysis burke et al., 2013 found that in some cases ms can lead to a general “conservative shift”, whereas other studies have found a polarizing response in which pre-existing political attitudes are amplified regardless of their ideological nature (“worldview defense”). in the current replication, we failed to find empirical support for either of these hypotheses. specifically, we did not find a significant main effect of death reminders on national patriotism. although participants with right-leaning political views were more supportive of the national-patriotic essay than left-leaning liberal participants, exposure to the death reminder (ms) did not lead to increased national patriotism in this sub-group either. sample size and power in terms of sample size, experiment 1 had 50 participants in each condition, while experiment 2 had 400 in each condition. the sample sizes reported in a metaanalysis covering more than 400 experiments (burke et 15 figure 3. power-curve for experiment 2, assuming a quarter of the effect sizes reported in meta-analyses. adapted from https://shiny.ieis.tue.nl/d_p_power/ al., 2010) can be divided by the first and second predictor variable, to reveal a median sample size per condition to be n = 23.3. another meta-analysis of the ms effect on political attitudes (burke et al., 2013) across 49 experiments reported an average effect size of d = 1.15, and a similarly calculated median sample size of n = 25.9. some of the studies may also have had additional, undeclared predictors (argued by john et al., 2012, to be common in psychology research). we therefore consider experiment 2 to be a high-powered test of the traditional ms hypothesis. for an overview, see the power-curve in figure 3 for the statistical power experiment 2 would have to detect a broad range of effects of different sizes. both experiments were thus respectively larger and considerably larger than the studies on ms effects in the literature. these samples are sufficient to detect the effects reported in the literature. however, in replication studies one may be concerned that the originally reported effects are inflated, which should be compensated for in assuming smaller actual effects. burke and colleagues (2010) reported the average ms effect size to be d = 0.82. the statistical power for experiment 1 had a 99% power to detect an effect of this size, and 65% power to detect an effect of half that size. if one assumes that effect sizes has been severely overestimated in the past, one may still consider experiment 1 to be underpowered. with its larger sample, experiment 2 had chances approaching 100% to detect both the reported meta-analytic effect size and half of it, and 80% chance to detect effect sizes a quarter of that. we therefore consider experiment 2 to be a high-powered test of the traditional ms hypothesis. for an overview, see the power-curve in figure 3 for the statistical power experiment 2 would have to detect a broad range of different effects sizes. the use of online samples to achieve this level of statistical power, experiment 2 recruited a large online sample using amazon’s mechanical turk, while most of the previous literature has been done in-person in physical labs. there have been some recent concerns about reduced data quality when using services like mechanical turk (chmielewski and kucker, 2019), calling for better data screening. however, closer examination has indicated that such online samples may in fact be more representative for the general population than the student samples typically used in lab experiments (buhrmester et al., 2011; buhrmester et al., 2018). moreover, some studies have indicated that online participants at mechanical turk pay closer attention to study instructions than student samples (hauser and schwarz, 2016) and provide comparable data quality (kees et al., 2017). in our view, this suggests that in relatively short and focused survey experiments online samples can be a valuable trade-off to student samples in order to achieve high-powered studies. we recommend that future research develop beyond convenience samples of college students and mechanical turk, by testing and replicating focal hypotheses in large-scale representative samples. the validity would be supported if similar results are obtained across different participant samples and methodological approaches. it could also be argued that the ms phenomenon hinges on the person-to-person interaction that arises in lab-experiments. it should be noted that the literature does not specify that ms is restricted to such conditions or that aspects of the interaction with the experimenter that in itself that causes the effect. if the social interaction is essential for the effect (and it can be shown that transparent and high-powered in-person lab procedures can reliably produce the ms effect), the boundary condition of the need for a social interaction should be implemented in the theoretical account of the phenomenon. we are not aware of any successful replication satisfying these criteria. a different approach to criticize the use of online samples could be to argue that the ms manipulation may not be attended to or taken seriously when presented online. in response to these concerns, we conducted a manual inspection of the written responses to the traditional ms manipulation task (see storelv and sætrevik, 2021, for details). if a notable proportion of our sample did not provide meaningful reflections about the topic they were assigned to write about (death or https://shiny.ieis.tue.nl/d_p_power/ 16 toothache), that would suggest that the overall attention level and corresponding data quality were low. this possibility was rejected since the manipulation check e2mc2 found that 98% of the online sample provided valid responses to the manipulation. we should note that non-compliance to the manipulation may also occur in in-person studies, and in contrast to our study the traditional ms studies do not typically report manipulation checks. failed replication of mortality salience a possible explanation for the current null-findings is that the ms effect is less robust than previously assumed. as with the majority of psychological research preceding the recent awareness of fundamental methodological issues (munafò et al., 2017; simmons et al., 2011), most ms research has been conducted in a non-transparent way without preregistration or open data. this makes it difficult to assess the extent of unpublished results and undisclosed flexibility in design and analysis in the ms literature. independently of our study, a recent “many labs 4” project (klein et al., 2019) has tried to replicate the ms effects across 21 different labs (n = 2.220). these preregistered experiments failed to replicate effects of the original studies, both with and without original author involvement. note that the initial report from this project was criticised for not adhering closely to the preregistration in determining which studies to include (chatard et al., 2020). however, a bayesian multiverse approach to all the many labs 4 studies (haaf et al., 2020) found evidence against ms effects in the majority of the analyses. a recent registered report study also failed to find support for ms across three experiments using established measures (schindler et al., 2021). these findings mirror the current results, and in addition indicate that methodological expertise in study design is not a likely explanation for the null-findings, nor that the ms effect can be reliably reproduced in lab studies. the current preregistered null-findings in a controlled lab study and in a large-scale online sample is one of three independent replication projects that have failed to support the ms hypothesis from tmt (klein et al., 2019; schindler et al., 2021). this may indicate that the traditional ms effect may not be a robust and replicable phenomenon, despite the high number of past publications (burke et al., 2013; burke et al., 2010). at the very least, the current null-findings emphasizes the need for high-powered, preregistered and transparent replication of the traditional ms effect. conclusion the current study with a norwegian lab experiment (n = 101) and an american online experiment (n = 784) aimed to directly replicate an often-reported ms effect on national patriotism, and a previous reported ms effect on ingroup identification. the study further aimed to conceptually replicate the ms effect on support for democratic values, and to explore a potential ms effect on a novel measure of pro-sociality. all these efforts failed to support the predicted ms effects. one of the experiments found a significant ms effect of decreased pro-sociality, but this effect is in the opposite direction of the hypothesis derived from the established literature. the lab experiment was unable to find any effect of ms on processing speed of concepts related to death or social categorization. it may have been indicated that ms led to increased psychophysiological activation, but this failed to reach the cut-off for one-tailed significance in two different analysis windows. we also failed to support interaction effects derived from reasonable interpretations of the ms literature. some methodological shortcomings are discussed above. one could claim that while being more transparent and better powered than most of the cited literature, experiment 1 is nevertheless underpowered and has a rather complex experimental design. however, experiment 2 can be interpreted alone as a highpowered attempt at directly replicating the central ms effect (greenberg et al., 1994). given the claim that ms effects are robust and should generalize across a variety of settings and outcome measures (burke et al., 2013; burke et al., 2010; pyszczynski et al., 2015), it is noteworthy that both our attempts at preregistered replications of the traditional ms effect failed. if one would like to argue that there is solid empirical support for the ms effect in the past literature, one should define the necessary and sufficient conditions to produce the effect (e.g., type and duration of delay task, lab or survey-based data collection, which covariates are necessary), and one should only count studies that fulfil these conditions as having supported the effect. in our view, the current results show that the basic ms effect is more difficult to reproduce than what is indicated in the literature. it is possible that the ms effect hinges on methodological quirks, specific samples or other boundary conditions that have not been reported or identified in previous research. variations in theoretical, experimental or analytical approaches may thus have provided different results in the current study. we welcome further research on the proposed ms effect, but we will view the proposed phenomenon with scepticism until such conditions are identified. we actively encourage attempts to replicate the 17 current null-findings. divergent findings could help to identify boundary conditions for the effect (nosek and errington, 2020), whereas similar results would further strengthen the current conclusions. note that the tmt literature has also been claimed to be supported by nonexperiment approaches, such as longitudinal studies following personal events of threat or loss (ben-zur and zeidner, 2009). although the current study fails to replicate the most commonly cited experimental demonstration of the ms effect, the overall tmt may still be supported by other approaches. we find it uncontroversial that avoidance of death can be a powerful motivator, and that human psychology is embedded with instincts to favour and support the ingroup. however, it is less obvious that an abstract awareness of mortality could account for a vast array of behaviours not associated with death, or that a subtle death reminder is sufficient to motivate complex behaviour through subconscious processes. despite our original intention to verify and further explore the nature of ms effects, we found no empirical support for this hypothesis in the present study. author contact bjørn sætrevik (corresponding author), department of psychosocial science, faculty of psychology, university of bergen, christies gate 12, 5015 bergen, norway. email: bjorn.satrevik@uib.no orcid: 0000-00029367-6987. hallgeir sjåstad, norwegian school of economics. orcid: 0000-0002-8730-1038. conflict of interest and funding the authors have no financial or non-financial competing interests to report. no benefits impinge on the publication of this article, apart from the expectation that the authors should contribute to objective scientific dissemination as part of their professional role. the interpretation of the results is not biased by the authors’ previous professional or scientific work. experiment 2 of this research was supported by the research council of norway through its centres of excellence scheme, fair project no 262675. author contributions both authors were involved in developing hypotheses and designing both experiments. author bs conducted experiment 1, while author hs conducted experiment 2. author bs has performed analyses and been in charge of writing the manuscript, while hs has contributed to the process. author names are ranked to reflect their level of contribution. acknowledgements we were assisted in the data collection by student researchers anna berglann, ingrid kleiven, marte lovise aagaard-nilsen, silje johansen, daniel gunstveit, astrid bergwitz, isabella ahlvin, line molander, and frida ölander. open science practices this article earned the preregistration+, open data and the open materials badge for preregistering the hypotheses and analyses before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references acharya, u., joseph, k., kannathal, n., lim, c., & suri, j. (2006). heart rate variability: a review. medical and biological engineering and computing, 44(12), 1031–1051. arndt, j. (1999). searching for the terror in terror management: mortality salience and physiological indices of arousal and affect. arndt, j., allen, j., & greenberg, j. (2001). traces of terror: subliminal death primes and facial electromyographic indices of affect. motivation and emotion, 25(3), 253–277. arndt, j., greenberg, j., & cook, a. (2002). mortality salience and the spreading activation of worldview-relevant constructs: exploring the cognitive architecture of terror management. journal of experimental psychology: general, 131(3), 307. arndt, j., greenberg, j., solomon, s., pyszczynski, t., & simon, l. (1997). suppression, accessibility of death-related thoughts, and cultural worldview defense: exploring the psychodynamics of terror management. journal of personality and social psychology, 73(1), 5. becker, e. (1973). the denial of death. the free press. ben-zur, h., & zeidner, m. (2009). threat to life and risk-taking behaviors: a review of empirical findings and explanatory models. personality and social psychology review, 13(2), 109–128. bjorn.satrevik@uib.no 18 buhrmester, m., kwang, t., & gosling, s. (2011). amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data? perspectives on psychological science, 6(1), 3–5. buhrmester, m., talaifar, s., & gosling, s. (2018). an evaluation of amazon’s mechanical turk, its rapid rise, and its effective use. perspectives on psychological science, 13(2), 149–154. burke, b., kosloff, s., & landau, m. (2013). death goes to the polls: a meta-analysis of mortality salience effects on political attitudes. political psychology, 34(2), 183–200. burke, b., martens, a., & faucher, e. (2010). two decades of terror management theory: a metaanalysis of mortality salience research. personality and social psychology review, 14(2), 155– 195. castano, e., yzerbyt, v., paladino, m.-p., & sacchi, s. (2002). i belong, therefore, i exist: ingroup identification, ingroup entitativity, and ingroup bias. personality and social psychology bulletin, 28(2), 135–143. chatard, a., hirschberger, g., & pyszczynski, t. (2020). a word of caution about many labs 4: if you fail to follow your preregistered plan, you may fail to find a real effect. chmielewski, m., & kucker, s. (2019). an mturk crisis? shifts in data quality and the impact on study results [publisher: in]. social psychological and personality science, 11, 464–473. delaney, j., & brodie, d. (2000). effects of short-term psychological stress on the time and frequency domains of heart-rate variability. perceptual and motor skills, 91(2), 515–524. doosje, b., ellemers, n., & spears, r. (1995). perceived intragroup variability as a function of group status and identification. journal of experimental social psychology, 31(5), 410–436. everett, j., faber, n., & crockett, m. (2015). preferences and beliefs in ingroup favoritism. frontiers in behavioral neuroscience, 9, 15. federico, c., jost, j., pierro, a., & kruglanski, a. (2007). the need for closure and political attitudes: final report for the anes pilot. anes pilot study report. fiedler, k., kutzner, f., & krueger, j. (2012). the long way from -error control to validity proper: problems with a short-sighted false-positive debate. perspectives on psychological science, 7(6), 661–669. gailliot, m., schmeichel, b., & baumeister, r. (2006). self-regulatory processes defend against the threat of death: effects of self-control depletion and trait self-control on thoughts and fears of dying. journal of personality and social psychology, 91(1), 49. greenberg, j., arndt, j., simon, l., pyszczynski, t., & solomon, s. (2000). proximal and distal defenses in response to reminders of one’s mortality: evidence of a temporal sequence. personality and social psychology bulletin, 26(1), 91– 99. greenberg, j., porteus, j., simon, l., pyszczynski, t., & solomon, s. (1995). evidence of a terror management function of cultural icons: the effects of mortality salience on the inappropriate use of cherished cultural symbols. personality and social psychology bulletin, 21(11), 1221–1228. greenberg, j., pyszczynski, t., & solomon, s. (1986). the causes and consequences of a need for selfesteem: a terror management theory. public self and private self (pp. 189–212). springer. greenberg, j., pyszczynski, t., solomon, s., simon, l., & breus, m. (1994). role of consciousness and accessibility of death-related thoughts in mortality salience effects. journal of personality and social psychology, 67(4), 627. greenberg, j., simon, l., pyszczynski, t., solomon, s., & chatel, d. (1992). terror management and tolerance: does mortality salience always intensify negative reactions to others who threaten one’s worldview? journal of personality and social psychology, 63(2), 212. greenberg, j., solomon, s., & pyszczynski, t. (1997). terror management theory of self-esteem and cultural worldviews: empirical assessments and conceptual refinements. advances in experimental social psychology (pp. 61–139). elsevier. grimson, d., knowles, s., & stahlmann-brown, p. (2020). how close to home does charity begin? applied economics, 52(34), 3700–3708. griskevicius, v., tybur, j., delton, a., & robertson, t. (2011). the influence of mortality and socioeconomic status on risk and delayed rewards: a life history theory approach. journal of personality and social psychology, 100(6), 1015. haaf, j., hoogeveen, s., berkhout, s., gronau, q., & wagenmakers, e.-j. (2020). a bayesian multiverse analysis of many labs 4: quantifying the evidence against mortality salience. hart, j. (2014). toward an integrative theory of psychological defense. perspectives on psychological science, 9(1), 19–39. https://doi.org/10.1177/ 1745691613506018 hauser, d., & schwarz, n. (2016). attentive turkers: mturk participants perform better on online https://doi.org/10.1177/1745691613506018 https://doi.org/10.1177/1745691613506018 19 attention checks than do subject pool participants. behavior research methods, 48(1), 400– 407. hayes, j., & schimel, j. (2018). unintended effects of measuring implicit processes: the case of death-thought accessibility in mortality salience studies. journal of experimental social psychology, 74, 257–269. hayes, j., schimel, j., arndt, j., & faucher, e. (2010). a theoretical and empirical review of the deaththought accessibility concept in terror management research. psychological bulletin, 136(5), 699. hayes, j., schimel, j., faucher, e., & williams, t. (2008). evidence for the dta hypothesis ii: threatening self-esteem increases deaththought accessibility. journal of experimental social psychology, 44(3), 600–613. heine, s., harihara, m., & niiya, y. (2002). terror management in japan. asian journal of social psychology, 5(3), 187–196. henry, e., bartholow, b., & arndt, j. (2010). death on the brain: effects of mortality salience on the neural correlates of ingroup and outgroup categorization. social cognitive and affective neuroscience, 5(1), 77–87. jamovi project, t. (2019). the jamovi project [type: (version 0.9)]. https://www.jamovi.org retrieved from john, l., loewenstein, g., & prelec, d. (2012). measuring the prevalence of questionable research practices with incentives for truth telling. psychological science, 0956797611430953. jonas, e., schimel, j., greenberg, j., & pyszczynski, t. (2002). the scrooge effect: evidence that mortality salience increases prosocial attitudes and behavior. personality and social psychology bulletin, 28(10), 1342–1353. juhl, j., & routledge, c. (2010). structured terror: further exploring the effects of mortality salience and personal need for structure on worldview defense. journal of personality, 78(3), 969–990. kahneman, d., knetsch, j. l., & thaler, r. h. (1986). fairness and the assumptions of economics. journal of business, s285–s300. kees, j., berry, c., burton, s., & sheehan, k. (2017). an analysis of data quality: professional panels, student subject pools, and amazon’s mechanical turk. journal of advertising, 46(1), 141– 155. klein, r., cook, c., ebersole, c., vitiello, c., nosek, b., chartier, c., & ratliff, k. (2019). many labs 4: failure to replicate mortality salience effect with and without original author involvement. https://psyarxiv.com/vef2c retrieved from lane, r., mcrae, k., reiman, e., chen, k., ahern, g., & thayer, j. (2009). neural correlates of heart rate variability during emotion. neuroimage, 44(1), 213–222. martin, l., & bos, k. (2014). beyond terror: towards a paradigm shift in the study of threat and culture. european review of social psychology, 25(1), 32–70. morewedge, c., & kahneman, d. (2010). associative processes in intuitive judgment. trends in cognitive sciences, 14(10), 435–440. munafò, m., nosek, b., bishop, d., button, k., chambers, c., sert, n., & ioannidis, j. (2017). a manifesto for reproducible science. nature human behaviour, 1, 0021. nosek, b., & errington, t. (2020). what is replication? plos biology, 18(3). https : / / doi . org / 10 . 31222/osf.io/u4g6t. what is replication? pepper, g., corby, d., bamber, r., smith, h., wong, n., & nettle, d. (2017). the influence of mortality and socioeconomic status on risk and delayed rewards: a replication with british participants. peerj, 5, 3580. psychology software tools, i. (2012). e-prime 2.0 [pages: – 2 0]. https://www.pstnet.com retrieved from pyszczynski, t., solomon, s., & greenberg, j. (2015). thirty years of terror management theory: from genesis to revelation. advances in experimental social psychology, 52, 1–70. roberts, j., & maxfield, m. (2019). mortality salience and age effects on charitable donations. american behavioral scientist, 0002764219850864. rodríguez-ferreiro, j., barberia, i., gonzález-guerra, j., & vadillo, m. (2019). are we truly special and unique? a replication of goldenberg et al.(2001. royal society open science, 6(11), 191114. rosenblatt, a., greenberg, j., solomon, s., pyszczynski, t., & lyon, d. (1989). evidence for terror management theory: i. the effects of mortality salience on reactions to those who violate or uphold cultural values. journal of personality and social psychology, 57(4), 681. routledge, c., ostafin, b., juhl, j., sedikides, c., cathey, c., & liao, j. (2010). adjusting to death: the effects of mortality salience and self-esteem on psychological well-being, growth motivation, https://www.jamovi.org https://psyarxiv.com/vef2c https://doi.org/10.31222/osf.io/u4g6t. https://doi.org/10.31222/osf.io/u4g6t. https://www.pstnet.com 20 and maladaptive behavior. journal of personality and social psychology, 99(6), 897. schindler, s., reinhardt, n., & reinhard, m.-a. (2021). defending one’s worldview under mortality salience: testing the validity of an established idea. journal of experimental social psychology, 93, 104087. schuler, e., mlynski, c., & wright, r. (2017). influence of mortality salience on effort-related cardiovascular response to an identity-relevant challenge. motivation science, 3(2), 164–171. silveira, s., graupmann, v., agthe, m., gutyrchik, e., blautzik, j., demirçapa, i., & reiser, m. (2013). existential neuroscience: effects of mortality salience on the neurocognitive processing of attractive opposite-sex faces. social cognitive and affective neuroscience, 9(10), 1601–1607. simmons, j., nelson, l., & simonsohn, u. (2011). falsepositive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. simon, l., greenberg, j., harmon-jones, e., solomon, s., pyszczynski, t., arndt, j., & abend, t. (1997). terror management and cognitiveexperiential self-theory: evidence that terror management occurs in the experiential system. journal of personality and social psychology, 72(5), 1132. singer, p. (2011). the expanding circle: ethics, evolution, and moral. princeton university press. sjåstad, h. (2019). short-sighted greed? focusing on the future promotes reputation-based generosity. judgment decision making, 14(2). stets, j. (2006). identity theory. in p. burke (ed.), contemporary social psychological theories (pp. 88– 110). stanford university press. storelv, s., & sætrevik, b. (2021). nothing is certain except taxes and the other thing: searching for death anxiety in a large online sample. https : //doi.org/10.31234/osf.io/3tkzq trafimow, d., & hughes, j. (2012). testing the death thought suppression and rebound hypothesis: death thought accessibility following mortality salience decreases during a delay. social psychological and personality science, 3(5), 622–629. turner, j., oakes, p., haslam, s., & mcgarty, c. (1994). self and collective: cognition and social context. personality and social psychology bulletin, 20(5), 454–463. watson, d., clark, l., & tellegen, a. (1988). development and validation of brief measures of positive and negative affect: the panas scales. journal of personality and social psychology, 54(6), 1063. yen, c.-l., & cheng, c.-p. (2013). researcher effects on mortality salience research: a meta-analytic moderator analysis. death studies, 37(7), 636– 652. zaleskiewicz, t., gasiorowska, a., & kesebir, p. (2015). the scrooge effect revisited: mortality salience increases the satisfaction derived from prosocial behavior. journal of experimental social psychology, 59, 67–76. https://doi.org/10.31234/osf.io/3tkzq https://doi.org/10.31234/osf.io/3tkzq meta-psychology, 2023, vol 7, mp.2020.2640 https://doi.org/10.15626/mp.2020.2640 article type: original article published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: yes edited by: rickard carlsson reviewed by: ignazio ziano, ulrich schimmack analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/4sz2q the effectiveness of the “but-you-are-free” technique: meta-analysis and re-examination of the technique adrien alejandro fillon1, lionel souchet2, alexandre pascual3, and fabien girandola2 1era chair in science and innovation policy & studies, university of cyprus 2department of social psychology, aix-marseille university 3department of social psychology, university of bordeaux the “but you are free. . . ” (byaf) technique is a technique to increase compliance (for example, to give spare change for the bus), by adding the words “but you are free to accept or refuse” to the request. in this pre-registered meta-analysis, we examine the effect of the byaf technique in 52 experiments (n = 19528). an analysis of 74 effect sizes showed a medium effect (g = 0.44, 95% confidence intervals (ci) [0.36, 0.51]) for the byaf technique. a moderator analysis found a stronger effect for faceto-face interactivity over other types of interactivities. all the other moderators we used were not statistically significant. we did not find any differences between articles published before and after carpenter’s (2013) meta-analysis. an examination of risk of bias showed that only seven studies were of “low risk”, and a meta-analysis of these studies showed no effect of the byaf (g = 0.11, 95% ci [-0.18, 0.40]) we also found that most recent studies on the subject are too low-powered to detect the effect found by carpenter (2013), and the reproducibility rates were critically low (r-index = 9.77%, z-curve expected discovery rate = 6%). we propose some improvements to the design and experiments to ensure the effects found in the literature exist and are replicable. all materials are available on https://osf.io/8eqa5/ keywords: but-you-are-free, commitment, compliance, meta-analysis introduction the but-you-are-free (byaf) technique is a commitment technique invented and used by guéguen and pascual (2000). this technique consists of an addition of the words “but you are free” during a request to enhance the acceptance of the request. the byaf technique is one of many techniques (see pratkanis, 2007 for a review) used as commitment techniques, based on the reactance theory (brehm, 1966). contrary to other techniques, the byaf is easy to use – you only need to add one sentence to the request. for example, guéguen and pascual (2000) observed 10% of compliance rate with the request "sorry madam/sir, would you have some coins to take the bus, please?" (control condition), whereas 47.5% was obtained with "sorry madam/sir, would you have some coins to take the bus, please? but you are free to accept or to refuse." (byaf condition). the “byaf” technique can be combined to other techniques such as the “foot in the door” technique to further increase compliance. furthermore, this technique can be applied in many situations, such as face-to-face interaction, but also in indirect interaction, for example with the use of the internet (e.g., e-mail, (pascual, 2002). but how does it work? the exact wording (i.e., “but you are free”) is not required to enhance compliance, as other wording “but obviously do not feel obliged” (guéguen et al., 2013, p. 129) is as effective. the technique relies on the salience of the target’s freedom in their decision-making process. the acknowledgement that one can say “no” leads to say “yes” more often, and to be more committed, as shown by the amount of money given in most of the studies (e.g., guéguen and pascual, 2000). as commitment theory (kiesler, 1971; kiesler and sakumura, 1966) postulates, it is possible to manipulate the degree of commitment by manipulating the degree of perceived choice when performing the act. as such, the byaf technique can be considered as a non-pressure manipulation used to enhance compliance. original study and its follow-up in the original study (guéguen and pascual, 2000), researchers indicated in the subject section that four confederates, 2 men and 2 women on average age of 20-22 years old asked 40 men and 40 women chosen at random in the street. in the procedure section, they indicated that the experiment was made in a mall. in the control condition, the confederates say “sorry https://doi.org/10.15626/mp.2020.2640 https://doi.org/10.17605/osf.io/4sz2q 2 madam/sir, would you have some coins to take the bus, please?” and in the byaf condition "sorry madam/sir, would you have some coins to take the bus, please? but you are free to accept or to refuse." the confederate then noted if the participant accepted to give money, and noted the amount given before giving it back to the participant and debriefed him. in the result section, researchers indicated that 10% of subjects accepted the request in the control group, and 47.5% in the byaf group, whereas the mean amount was 0.48$ in the control group and 1.04$ in the byaf group (all differences were statistically significant at an α level of 0.05). researchers indicated that this experiment shows the effectiveness of the byaf technique to increase the probability of compliance, in saying yes to the request, and the implication of the subject, in giving a higher amount of money. in 2013, carpenter conducted a meta-analysis of the byaf technique with 42 studies published after the original described above. his goal was to summarize the effect size of this technique and show some probable mediators and moderators. indeed, researchers wanted to show if face-to-face interaction was important to the byaf technique, and if the type of choice (prosocial, offer, or selfish) and the time of the request (immediate or delayed) influenced the byaf effect. also, as the first research are based on a monetary request, it was important to assess that byaf works in another context, such as a signature for a petition. the meta-analysis showed that the sample-size weighted correlation between the presence and absence of the byaf technique and the proportion of those who complied with the request was r = .13 (i.e., d = 0.26), which is, according to the author, a moderate-sized increase in effectiveness with the use of the byaf technique. it is typically considered a small to medium effect size (sawilowsky, 2009). the sampling error explained 22% of the variation in effect size. the confidence interval of the correlation was not reported. carpenter identified several moderators. an immediate request led to an r = .18 (i.e., d = 0.37), and a delayed request to an r = .07 (i.e., d = 0.14), which showed the importance to position the byaf technique close to the targeted request. prosocial requests were as likely to work (r = .16, d = 0.32) as selfish requests (r = .16, d = 0.32). concerning the analysis of publication bias, carpenter correlated the sample sizes and effect sizes and found an r = -.30. this result means that there is the possibility that, as the sample size increases, the effect decreases, potentially to a null effect. this result suggests that publication bias is present and that the effect size estimate is inflated. thus, the actual effect size might be small. also, researchers used the trim-and-fill technique (duval and tweedie, 2000) but did not provide the plot associated. the trim-and-fill technique leads to a reduction of the effect size by .04 (from an r = .13 to an r = .09, d = 0.18). some meta-analysts indicated that the trim and fill technique performs poorly in the presence of substantial between-study heterogeneity (j. higgins, chandler, et al., 2022). finally, as carpenter pointed out, nearly all the experiments were conducted either by guéguen or pascual (see table 1), but they found the strongest and the smallest effect sizes for the technique. one major problem of the carpenter (2013) meta-analysis is some studies were flagged as of risk of having fabricated data (brown, 2020). the flagged studies have the strongest effect sizes found (odds ratio for dufourcq-brana et al., 2006 or = 6.57; guéguen and pascual, 2000 or = 8.14; pascual and guéguen, 2002 or = 6), thus, eliminating these results from our analysis might show a null effect of the byaf technique. also, it is possible that research on this subject improves over time, with larger sample sizes, and stronger methods, leading to convergence to the “true” effect size of the byaf technique on compliance. in most cases in psychology, the original effect sizes are inflated (schäfer and schwarz, 2019). this is the reason why we conduct a novel preregistered and open meta-analysis on the byaf technique over compliance, with a look at the inconsistencies we can find between our analysis and the one from carpenter (2013). moderators we want to investigate the moderators that can influence the effect of the byaf technique. the research on the subject shows that the moderators that can influence the technique are the type of request (pro-social vs. selfish), the temporality (immediate or delayed), the gender of the subject and of the confederate (man vs. woman), the culture (individualistic vs. collectivistic), the interactivity (face-to-face vs. indirect), and the type of freedom evocation (“but-you-are-free” vs. other). we also want to test if there are substantial differences between the effect sizes found before and after the carpenter (2013) meta-analysis. type of request as carpenter (2013) pointed out, the effectiveness of the byaf technique might rely upon the type of request. for carpenter and boster (2009), the compliancegaining techniques work better for pro-social benefits, like giving to a charity, rather than for selfish reasons, like giving to take the bus. nonetheless, carpenter (2013) found no difference in compliance rate for the pro-social and selfish types of requests. we seek to redo the analysis with the same hypothesis, given that the larger number of studies involved could give a better 3 estimate of the effect size, and possibly could detect a moderator effect of the type of request. in doing so, our hypothesis is the same as in carpenter’s (2013) metaanalysis: the compliance rate will be higher for the prosocial type of request than for the selfish type of request. temporality temporality was called “immediate or delayed” in carpenter’s (2013) analysis. indeed, depending on the studies, the researchers can look at whether the participant complied with the request immediately after using the technique (e.g., when they asked for money, the original technique), or after a certain amount of time (e.g., by sending an email and then testing at whether the participant had made a purchase, grassini et al., 2012). we seek to replicate the effect of temporality found in the carpenter’s meta-analysis. researchers found that the compliance rate was lower when the confederate was absent (delayed condition) than present (immediate condition). two reasons are possible: an easier reactance involved with the absence of the confederate, or the wanting to have a better selfrepresentation when the researcher is present. we seek to redo the analysis with the addition of new studies to find that the immediate use of the byaf technique is more effective than the delayed use. subject gender studies seem to indicate that men are less compliant than women (grosch and rau, 2016). for example, one study found that men cheat more than women (fischbacher and föllmi-heusi, 2013). grosch and rau (2016) indicated that this difference can be explained by the cultural roles of men and women, as women are seen as more pro-social than men. thus, we think that female participants will comply more to the request in the byaf condition than men. confederate gender many experiments have shown that confederate gender influences the compliance rate. for example, vaughn et al. (2009) have only found an effect of compliance when the confederate was a woman. long et al. (1996) found that women were more helped than men. on the contrary, dolinska and dolinski (2006) found that both sexes have a better chance to find compliers when confederate sex matches the participant sex. this difference can be explained by cultural variation. since most of the experiments were conducted in france, we think that the byaf technique will be more effective if the confederate is a woman. indeed, we hypothesize that participants will comply more to women confederate than to men confederate in the byaf condition. culture in pro-social culture such as in china, one could expect more compliance than in a more individualistic culture such as in france (see hamamura et al., 2018). there are at least three reasons for this hypothesis. on general, the theory of commitment is more effective for individualistic than for collective culture (kim and sherman, 2007), because people in individualistic culture have a more internal locus of control (channouf, 1990; desrumaux, 1996), and people are more easily reactant (jonas et al., 2009). thus, the byaf technique which reduces reactance should work better for people in an individualistic culture. indeed, pascual et al. (2012) showed that the byaf technique induces more compliance in individualistic countries (i.e., france, romania) than collectivistic countries (i.e., ivory coast, china, and russia). according to triandis (1989), individualist cultures include northern and western europe as well as north america, whereas collectivist cultures would be characteristic of asia, africa, and south america. participants from an individualistic country would comply more with the byaf technique than participants from a collectivistic country. interactivity if the byaf technique has a different effect depending on the gender of the participant, or/and the gender of the confederate, it implies that this difference is within a “face-to-face” interaction. furthermore, the difference between temporality (immediate or delayed), implies a difference between a “face-to-face” interaction and more distal interactions. we believe that participants are more engaged when the interaction is in “faceto-face” rather than in a more indirect interaction, via email, phone call, or internet. type of freedom evocation the byaf technique is an induction in a sentence (typically “but you are free to accept or to refuse”) and induce a feeling of freeness making the recipient more willing to accept the demand, or to comply. other evocations include propositions such as “do not feel obliged”, “do as you wish”, or “feel free to refuse”. there are possibilities that some evocations are better than others to induce compliance. indeed, the proposition “but you are free to refuse” is the most salient, leading to the best understanding by the recipient that he/she is free to accept or not. it should have a stronger effect on compliance than the other possibilities of evoking freedom. 4 before and after carpenter’s analysis garmendia et al. (2019) have shown that 46% of meta-analyses have their conclusions altered by false data, with fraudulent/plagiarized studies, or errors. as we previously showed, carpenter analysis has this problem. original effect sizes are inflated (schäfer and schwarz, 2019) and we tend to think that most recent research is of better quality than before the crisis in social science (motyl et al., 2017). in carpenter’s (2013) meta-analysis, the use of the trim-and-fill method reduced the effect size found close to the null, we hypothesize that the effect will be lower after carpenter’s analysis than before. summary hypotheses main hypotheses people tend to comply more with the “but you are free” technique than with direct asking. confirmatory hypotheses the compliance rate will be higher for 1) the prosocial type of request than for the selfish type of request and 2) immediate asking than delayed. exploratory hypotheses the compliance rate will be higher (a) for women than for men, (b) for women confederate than for men confederate, (c) from an individualistic country than from a collectivistic country, (d) in a “face to face” interaction than in other types of interaction, (e) with the exact proposition “but you are free” than the others types of evocation and (f) in studies on the carpenter (2013) meta-analysis than for the studies made after. method open-science, replicability, and our current study we preregistered our analysis, following prisma (moher et al., 2009) checklist and made available all our data and our analysis in r/rmarkdown in an osf (link = https://osf.io/8eqa5/). r packages used can be found in supplementary. literature search we systematically searched google scholar (for suitability for meta-analyses see gehanno et al., 2013; martín-martín et al., 2018; walters, 2007) with the following term but you are free, as carpenter did in 2013. we provide an overview of the search process in figure 1. the database searches achieved 1760 hits. we also searched articles by scanning reference sections of figure 1 meta-analysis flow diagram (adapted from prisma 2020) found articles and using the “related articles” and “cited by” options in google scholar. based on reviewer feedback, we asked for unpublished studies in the adrips, eadm, and easp social networks, without any additional results. after adjusting for duplicates, 81 sources remained. to minimize possible potential publication bias, we contacted all identified authors in person and requested unpublished manuscripts. we were provided with twentytwo additional articles leading to a total of 103 sources. all abstracts, tables, and results sections of empirical sources were scanned to assess their relevance. after this step, 29 articles remained as potentially includable articles. our eligibility criterion is the use of the “but you are free” technique with a direct measure of compliance. we only include experimental designs, with a clear contrast between the byaf technique and a control group, with an asking being saying “yes/no”, money, clicking on a button online, or sending a postal mail. we exclude studies 1) that do not measure direct compliance or are using a scale to measure the strength of compliance, 2) without a control group, and that contrasts the byaf technique with another technique and 3) that do not provide the exact term for the byaf technique, for whom the term is disconnected/too far away https://osf.io/8eqa5/ 5 from the term “but you are free”. finally, we exclude studies with missing statistics or statistics that are not reported: studies that do not report crucial measures such as the number of participants or standard needed for calculating the effect size deviation will be excluded from the sample. we briefly read through all articles to examine whether they met our inclusion criteria. a total of 7 articles were qualified for the exclusion, leading to a total sum of 22 identified articles with codable data. finally, a total of 52 samples were included in this metaanalysis leading to a sum of 74 effect sizes. we provided a list of all included experiments in table 1. we used a data extraction sheet that was already successfully used in other meta-analyses (e.g., fillon et al., 2021; yeung et al., 2021). the coding process for the pre-tests was completed by two coders to ensure high inter-rater reliability. we documented and reported all decisions in detail. after testing, one review author extracted all data and provided detailed information about coding decisions. a second author verified the coding. disagreements were resolved by discussion between the two authors. all coding decisions were documented in the extraction sheet. we added in osf available raw data and emails with authors. we documented in column “source” the extraction of data. coding included studies we included a total of 52 experiments with a total of 19528 participants. the final sample consists of 18 published and 4 unpublished studies. most studies were conducted in a face-to-face experimental design, in the street; others were made online, via an online video game or by email, phone, or postal letters. an overview of all included studies is provided in table 1. analysis we ran our analysis in r. we used the following meta-analysis related packages to conduct our analyses: metafor, psych, compute.es, mbess, mad, poweranalysis, metaforest, esc, metaviz, puniform, zcurve (see supplementary for the whole r packages used). given the range of different types of studies and designs, we expected heterogeneity in the sample to be relatively high. therefore, a random-effects model was used. we coded the sheet with the total number of participants in each group (experimental via the byaf technique, control) and the number of participants who comply in each group. in most cases, the numbers were provided but for some, we computed them from the test available. all conversions and coding decisions were documented. we preregistered to use cohen’s d as effect size but used hedges’ g instead because it corrects for low sample size (delacre et al., 2021). we produced forest plots of the effect size distribution. a meta-analysis examined the overall main effect of the bias; a meta-regression was conducted to assess the impact of the described moderators. statistical heterogeneity was determined using the tau² test and quantified using i², which represents the percentage of the total variation in a set of studies that is due to heterogeneity (higgins, 2003). this yielded a point estimate, confidence interval, and pvalue, along with statistics for heterogeneity, assessed using the q-statistics, and the i2 statistic. we detected significant heterogeneity and therefore proceeded to explore potential moderators. we also performed analyses for the presence of publication bias, including funnel plots and statistical tests for publication bias (publication status as a moderator) and funnel plot asymmetry tests (trim-and-fill method, rank correlation test, egger’s unweighted regression symmetry test, etc.). finally, we tested for robustness via the graphical display of study heterogeneity (gosh) and plotted a z-curve to estimate replicability. moderator analyses we tested subgroups and moderators using a comparison of fixed-effects meta-analysis models. most of our hypotheses are exploratory; we tested the type of request and immediate or delayed as confirmatory, since they were already studied in the carpenter (2013) meta-analysis. for the other moderators, we conducted exploratory analyses. results the but-you-are-free main effect in an analysis of all studies on the impact of the byaf effect on compliance, we found an effect of g = 0.44 [0.36, 0.51]. we found considerable heterogeneity (q (73) = 271.67, p < .001, i² = .80.7%) in the observed effect sizes. the variation in effect-sizes was greater than would be expected from sampling error alone, indicating that moderator variables might be accountable for the variance in the effects. a meta-analysis forest plot is provided in figure 2. study design and measures as moderators we summarized all moderator findings in table 2. overall, the only exploratory moderator that has an impact on the byaf effect was the type of interactivity, as face-to-face interactivity has a significantly higher number of compliers than the others combined (email, phone, postal letter, and internet). on the other side, 6 table 1 all experiments included in the meta-analysis article n interactivity culture published 1 barbier (2018) 422 internet france no 2 carpenter & pascual (2016) 131 face-to-face usa yes 3 carpenter & pascual (2016) 320 face-to-face france yes 4 carpenter & pascual (2016) 240 face-to-face norway yes 5 dufourcq-brana (2007) 400 email france no 6 dufourcq-brana (2007) 60 face-to-face france no 7 dufourcq-brana (2007) 100 face-to-face france no 8 farley et al. (2019) 45 face-to-face usa yes 9 farley et al. (2019) 40 face-to-face usa yes 10 grassini et al. (2012) 900 email france yes 11 guéguen & pascual (2000) 80 face-to-face france yes 12 guéguen & pascual (2005) 159 face-to-face france yes 13 guéguen et al. (2002) 600 email france yes 14 guéguen et al. (2010) 100 face-to-face france yes 15 guéguen et al. (2013) 2160 face-to-face france yes 16 guéguen et al. (2013) 160 face-to-face france yes 17 guéguen et al. (2013) 4421 face-to-face france yes 18 guéguen et al. (2013) 400 face-to-face france yes 19 guéguen et al. (2013) 100 face-to-face france yes 20 guéguen et al. (2013) 2608 phone france yes 21 guéguen et al. (2013) 4515 email france yes 22 guéguen et al. (2013) 2230 postal letter france yes 23 guéguen et al. (2013) 400 postal letter france yes 24 guéguen et al. (2013) 344 face-to-face france yes 25 guéguen et al. (2013) 300 face-to-face france yes 26 guéguen et al. (2013) 400 face-to-face france yes 27 guéguen et al. (2015) 120 face-to-face france yes 28 guéguen et al. (2017) 60 face-to-face france yes 29 marchand et al. (2009) 74 face-to-face france yes 30 meineri et al. (2016) 60 face-to-face france yes 31 meineri et al. (2016) 649 face-to-face france yes 32 pascual & guéguen (2002) 80 face-to-face france yes 33 pascual & guéguen (2002) 120 face-to-face france yes 34 pascual & guéguen (2002) 200 face-to-face france yes 35 pascual & guéguen (2002) 306 face-to-face france yes 36 pascual & guéguen (2002) 126 face-to-face france yes 37 pascual (2002) 181 face-to-face france no 38 pascual (2002) 320 face-to-face france no 39 pascual (2002) 167 face-to-face france no 40 pascual (2002) 306 face-to-face france no 41 pascual (2002) 220 face-to-face france no 42 pascual et al. (2012) 609 face-to-face france, ivory coast yes 43 pascual et al. (2012) 360 face-to-face france, romania, russia yes 44 pascual et al. (2012) 360 face-to-face france, romania, russia yes 45 pascual et al. (2012) 128 face-to-face france, china yes 46 pascual et al. (2002) 400 email france yes 47 pascual et al. (2009) 120 face-to-face france yes 48 pascual et al. (2015) 60 face-to-face france yes 49 pascual et al. (2015) 160 face-to-face france yes 50 pascual et al. (2020) 314 face-to-face france, china yes 51 pascual et al. (2020) 788 face-to-face france, moldavia, tunisia yes 52 silone et al. (2016) 155 postal letter france yes 7 figure 2 meta-analysis forest plot for all studies re model −2 −0.5 1 2.5 4 observed outcome silone et al. (2016) / 1 / 1 pascual et al. (2020) / 2 / 3 pascual et al. (2020) / 2 / 2 pascual et al. (2020) / 2 / 1 pascual et al. (2020) / 1 / 2 pascual et al. (2020) / 1 / 1 pascual et al. (2015) / 2 / 1 pascual et al. (2015) / 1 / 1 pascual et al. (2009) / 1 / 1 pascual et al. (2002) / 1 / 1 pascual et al (2012) / 4 / 2 pascual et al (2012) / 4 / 1 pascual et al (2012) / 3 / 3 pascual et al (2012) / 3 / 2 pascual et al (2012) / 3 / 1 pascual et al (2012) / 2 / 3 pascual et al (2012) / 2 / 2 pascual et al (2012) / 2 / 1 pascual et al (2012) / 1 / 2 pascual et al (2012) / 1 / 1 pascual (2002) / 7 / 4 pascual (2002) / 7 / 3 pascual (2002) / 7 / 2 pascual (2002) / 7 / 1 pascual (2002) / 6 / 2 pascual (2002) / 6 / 1 pascual (2002) / 5 / 4 pascual (2002) / 5 / 3 pascual (2002) / 5 / 2 pascual (2002) / 5 / 1 pascual (2002) / 11 / 1 pascual (2002) / 10 / 1 pascual & gueguen (2002) / 5 / 1 pascual & gueguen (2002) / 4 / 2 pascual & gueguen (2002) / 4 / 1 pascual & gueguen (2002) / 3 / 2 pascual & gueguen (2002) / 3 / 1 pascual & gueguen (2002) / 2 / 1 pascual & gueguen (2002) / 1 / 1 meineri et al. (2016) / 2 / 1 meineri et al. (2016) / 1 / 1 marchand et al. (2009) / 1 / 1 gueguen et al. (2017) / 1 / 1 gueguen et al. (2015) / 1 / 1 gueguen et al. (2013) / 9 / 1 gueguen et al. (2013) / 8 / 1 gueguen et al. (2013) / 7 / 1 gueguen et al. (2013) / 6 / 1 gueguen et al. (2013) / 5 / 1 gueguen et al. (2013) / 3 / 1 gueguen et al. (2013) / 2 / 1 gueguen et al. (2013) / 13 / 2 gueguen et al. (2013) / 13 / 1 gueguen et al. (2013) / 12 / 1 gueguen et al. (2013) / 11 / 2 gueguen et al. (2013) / 11 / 1 gueguen et al. (2013) / 10 / 2 gueguen et al. (2013) / 10 / 1 gueguen et al. (2013) / 1 / 1 gueguen et al. (2010) / 1 / 1 gueguen et al. (2002) / 1 / 1 gueguen & pascual (2005) / 1 / 1 gueguen & pascual (2000) / 1 / 1 grassini et al. (2012) / 1 / 1 farley et al. (2019) / 2 / 1 farley et al. (2019) / 1 / 1 dufourcq−brana (2007) / 3 / 1 dufourcq−brana (2007) / 2 / 1 dufourcq−brana (2007) / 1 / 1 carpenter & pascual (2016) / 2 / 3 carpenter & pascual (2016) / 2 / 2 carpenter & pascual (2016) / 2 / 1 carpenter & pascual (2016) / 1 / 1 barbier (2018) / 4 / 1 155 115 132 130 80 40 82 54 66 400 64 64 60 60 60 60 60 60 387 222 28 19 22 25 80 80 26 28 23 31 220 171 126 83 101 50 50 120 80 378 30 37 60 61 1324 4515 1625 100 253 2374 80 100 100 150 86 86 100 100 1080 100 900 159 80 900 40 35 100 60 400 60 104 29 89 119 −0.32 [−0.66, 0.03] 0.42 [−0.09, 0.93] 0.46 [ 0.16, 0.76] 0.55 [ 0.25, 0.86] 0.69 [ 0.31, 1.08] 0.31 [−0.18, 0.81] 0.35 [−0.03, 0.73] 0.23 [−0.69, 1.15] 0.47 [ 0.03, 0.92] −0.22 [−0.42, −0.02] 0.12 [−0.36, 0.60] 0.69 [ 0.19, 1.19] 0.29 [−0.15, 0.72] 0.56 [ 0.11, 1.00] 0.48 [ 0.04, 0.92] 0.22 [−0.22, 0.65] 0.53 [ 0.09, 0.97] 0.50 [ 0.06, 0.93] 0.20 [ 0.00, 0.41] 0.29 [ 0.03, 0.56] 0.67 [−0.01, 1.35] 0.75 [−0.14, 1.65] 0.92 [ 0.10, 1.74] 0.07 [−0.72, 0.86] 0.54 [ 0.15, 0.92] 0.45 [ 0.07, 0.83] 0.17 [−0.70, 1.03] −0.03 [−0.71, 0.64] −0.07 [−0.84, 0.71] 0.36 [−0.28, 1.01] 0.51 [ 0.24, 0.78] 0.28 [ 0.00, 0.55] −0.11 [−0.45, 0.24] 0.35 [−0.06, 0.75] 0.30 [−0.09, 0.69] 0.51 [ 0.03, 0.99] 0.96 [ 0.46, 1.46] 0.41 [ 0.03, 0.79] 0.98 [ 0.52, 1.44] 0.29 [ 0.10, 0.48] 0.59 [−0.03, 1.21] 0.50 [−0.06, 1.05] 1.12 [ 0.58, 1.66] 0.51 [ 0.07, 0.95] 0.32 [ 0.21, 0.42] 0.36 [ 0.30, 0.42] 0.43 [ 0.33, 0.53] 1.04 [ 0.63, 1.46] 0.52 [ 0.27, 0.77] 0.29 [ 0.22, 0.36] 0.39 [ 0.01, 0.77] 0.51 [ 0.17, 0.86] 0.69 [ 0.35, 1.04] 0.41 [ 0.11, 0.71] 0.91 [ 0.53, 1.29] 0.89 [ 0.51, 1.28] 0.46 [ 0.12, 0.81] 0.33 [−0.01, 0.67] 0.69 [ 0.58, 0.80] −0.15 [−0.54, 0.24] 0.63 [ 0.49, 0.78] 0.59 [ 0.28, 0.91] 1.15 [ 0.68, 1.61] 0.35 [ 0.22, 0.48] 1.70 [ 0.99, 2.41] 0.68 [−0.12, 1.47] 0.59 [ 0.19, 0.99] 1.02 [ 0.49, 1.56] −0.19 [−0.39, 0.00] 0.63 [ 0.23, 1.04] 0.67 [ 0.16, 1.18] 0.29 [−0.46, 1.03] −0.07 [−0.48, 0.34] 0.02 [−0.27, 0.31] 0.44 [ 0.36, 0.51] author(s), year, and study # observed [95% ci]sample size 8 the two confirmatory moderators had a significant effect, as we found that a face-to-face interaction led to a stronger effect than the other forms of interactivity, and a direct request led to a stronger effect than a delayed request. subject gender we hypothesized that the byaf technique would increase compliance to a higher degree with women than with men. while we found a slightly larger effect size of the byaf technique for women, this difference was not statistically significant. confederate gender we hypothesized that the byaf technique would increase compliance to a higher degree with women than with men confederates. we did not find support for this hypothesis, as the test for the difference was nonsignificant. we also performed an anova on the confederate and subject gender moderators to find if there might be an interaction effect. the anova revealed no statistically significant interaction effect (q (3) = 2.18, p = 0.54). culture we hypothesized that the byaf technique would increase compliance to a higher degree in individualistic cultures than in collectivistic cultures. our results indicate a higher effect size of the byaf technique in individualistic culture than collectivistic, but the result is not significant. interactivity we hypothesized that the byaf technique would increase compliance to a higher degree in face-to-face interaction than the other types of interaction. our results indicate a higher and significant effect size of the byaf technique with the face-to-face interaction than the other, yet we caution against drawing any general conclusions from these findings as we did not find enough effect sizes for the “other” moderators. for example, we only collected one effect size for the use of the technique by phone. freedom evocation we hypothesized a stronger effect of the byaf technique with the exact term “but you are free” than other terms. on the contrary, our results indicate a higher effect of the combined other framing, while the effect is not significant. carpenter’s analysis we hypothesized a stronger effect size via the coding of the carpenter’s (2013) meta-analysis than the effect sizes found in the experiments made after the carpenter analysis. we did not find any differences between the studies made before and after carpenter’s analysis, as the average effect sizes are very similar. type of request we hypothesized a higher number of compliers with the byaf technique in a prosocial request than a selfish one. our result tends to indicate the contrary, participants complied more with a selfish request than a prosocial request with the byaf technique, but the effect is not significant. temporality we hypothesized that the effect of the byaf technique would be stronger for immediate requests and weaker for delayed ones. our results corroborate the hypothesis; we found a stronger and significant effect for immediate requests (g = 0.47, 95% ci [0.41; 0.54]) than for delayed requests (g = 0.25, 95% ci [0.03, 0.47]). publication bias we tested for the presence of publication bias using several methods, and a summary of publication bias analyses is provided in table 3. we ran publication bias analyses on collapsed effect sizes by study, with one effect size per study. point estimates are consistent, and methods that produce confidence intervals show substantial overlap in confidence intervals for each method. the range of estimates goes from 0.25 to 0.56. the trim and fill method indicates an asymmetry of the funnel with 17 studies missing on the left side, confirmed with a significant egger’s regression test. the asymmetry of a funnel plot can be caused by two effects: publication bias or other factors (e.g., poor methodological quality, true heterogeneity, artefactual, or chance; egger et al., 1997). the distinction between publication bias and other factors relies on where the missing studies are in the funnel plot. if the missing studies are in the significant area (i.e., the white area inside the funnel plot), it means that the meta-analysis lacks significant effect sizes, which are mainly due to other factors. if the missing studies are in the non-significant area (i.e., the darker areas of the funnel plot), it probably indicates a sign of publication bias. based on the funnel plot (figure 3) and the trim-and-fill plot (figure 4), our results indicate the presence for both signs, 9 table 2 moderator analysis of the but you are free technique moderator k n mean g 95% ci difference p subject gender woman 38 9316 0.48 [0.40, 0.56] man 41 8008 0.42 [0.35, 0.50] -0.059 [-0.17, 0.05] .28 confederate gender woman 50 4355 0.45 [0.36, 0.54] man 26 2048 0.41 [0.27, 0.55] -0.04 [-0.21, 0.13] .62 culture individualistic 65 18550 0.45 [0.37, 0.53] collectivistic 9 978 0.32 [0.20, 0.44] 0.13 [-0.01, 0.28] .08 interactivity face-to-face 64 9101 0.49 [0.42, 0.56] by e-mail 5 7115 0.19 [-0.13, 0.52] by phone 1 1625 0.43 [0.33, 0.53] by postal letter 2 1479 0.02 [-0.60, 0.64] by internet 2 208 -0.01 [-0.25, 0.23] overall other than face-to-face 10 10427 0.15 [-0.05, 0.36] -0.34** [-0.55, -0.13] .002 freedom evocation « but you are free » 59 14069 0.42 [0.34, 0.50] other 13 5218 0.53 [0.36, 0.71] 0.11 [-0.08, 0.30] .26 carpenter before 54 16835 0.44 [0.35, 0.52] after 20 2693 0.43 [0.27, 0.60] 0.007 [-0.18, 0.19] .95 type of request selfish 40 12603 0.50 [0.41, 0.60] prosocial 34 6925 0.36 [0.26, 0.47] 0.14 [-0.005, 0.29] .06 temporality immediate 63 10451 0.47 [0.41, 0.54] delayed 12 9212 0.25 [0.03, 0.47] 0.23 [-0.004, 0.45] .05 note. k = number of samples; n = total number of individuals in k; mean g = average hedge’s g effect size, ci = lower and upper limits of 95% confidence interval, * p < .05, two-tailed, **p <.01, two-tailed, *** p < .001, two-tailed. as we found support for a lack of significant and nonsignificant studies. these results are strengthened by the three-parameter selection model (3psm) estimate, for which the likelihood ratio test is close to the significance threshold, which could indicate selective reporting (hedges, 1992). in the case of inconsistencies between estimators, the 3psm is a better indication (carter et al., 2019), and, in our case, does not exclude a possible publication bias. overall, while some estimators indicate a possible publication bias, the more robust test for high heterogeneity do not favor the possibility for selective reporting. but this result is accompanied by a possible problem of poor methodological quality leading to a (rather small) inflated effect, from a found effect of 0.44 to an estimated mean effect between 0.34 and 0.38. we ran a p-curve and p-uniform analysis which respectively found an estimated g = 0.41 and g = 0.38. the p-uniform analysis found 45 significant effect sizes, and the p-curve analysis indicated presence of evidential values and no absence of evidential values (see supplementary for the p-curve table). as requested by the editor, we ran a statcheck (nuijten, 2018) on the statistics we used to retrieve the number of participants in each condition and found only one inconsistent result which did not affect the overall result. robustness we did not pre-register an estimation of robustness. still, we ran a script to create a graphical display of study heterogeneity (gosh) to assess the robustness of effect size found. we provide the r script in supplementary rather than in the rmarkdown because of the 10 table 3 publication biases analyses results publication bias analysis method results and adjusted models three-parameter selection model likelihood ratio test: 3.39, p = .07 adjusted model: g = 0.38, 95% ci [0.26, 0.50] pet b = 0.34 [0.25, 042], p <.001 peese b = 0.36 [0.30, 0.42], p <.001 puniform adjusted model: g = 0.45, 95% ci [0.37 0.56], 45 significant henmi & copas (2010) adjusted model: g = 0.36, 95% ci [0.26, 0.51] trim and fill funnel plot asymmetry 17 studies missing on the left side. rank correlation test (begg & mazumdar, 1994) kendall’s tau = 0.14, p = .09 egger’s regression test z = 2.06, p = .04 note. values in parentheses indicate 95% confidence intervals [lower bound, upper bound]. figure 3 funnel plot for all studies observed outcome s ta n d a rd e rr o r 0 .4 7 1 0 .3 5 3 0 .2 3 5 0 .1 1 8 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 time consumption used in the analysis. on our recent computer, the analysis took between 3 and 4 hours. one test of robustness includes the leave-one-out analysis, a method of analysis (olkin et al., 2012) made to see the influence of one effect size on heterogeneity. another possibility is to estimate the influence of a subgroup in meta-analyses, which leads to a very high number of meta-analyses to perform to find the whole combination of effect sizes that could influence the robustness of the analysis. in fact, with 74 effect sizes found, it leads to 1.88x102̂2 meta-analysis, which makes the comparison figure 4 trim-and-fill funnel plot 0.0 0.1 0.2 0.3 0.4 0.5 −1 0 1 hedge's g s ta n d a rd e rr o r note. the 17 missing studies are shown in black. we used the trim-and-fill method to see studies on the left with a random model, with the addition of the egger regression test shown as the red line. impossible. the gosh makes the analysis graphical, by plotting one meta-analysis as a dot. if dots are homogeneously displayed, the effect found is robust, while if two or more clusters are found, it means that at least one subgroup influences too much the overall effect size found. our gosh plot can be found in figure 5. the figure presented is in a homogenous circle form, show11 figure 5 gosh plot for robustness note. the plot helps to see how heterogeneity varies between overall estimates for every left-out meta-analysis. we can see that for every meta-analysis, the overall estimate varies between 0.3 and 0.6, with heterogeneity between i²=60% and i² = 90%. ing that all meta-analyses have an average estimate between 0.3 and 0.6, and heterogeneity between i²=60% and i² = 90%. we conclude that the meta-analysis estimate is robust to leave-out studies. z-curve analysis based on feedback from a reviewer, we created a zcurve analysis (figure 6, bartoš and schimmack, 2021). the z-curve is a method for estimating publication bias and possibility of false positives. the observed discovery rate is of 45% (64 significant tests out of 141). the expected discovery rate, or the mean power before selection for significance, is of 6%. the expected replication rate, or the mean power after selection for significance, is of 73%. thus, we see that the power of studies after selection for significance is far higher than before. this is a clear indication of publication bias with a high false positive risk. risk of bias 2 (rob2) as asked by the editor, we conducted a rob2 check (j. higgins, thomas, et al., 2022; mcguinness and higgins, 2021). we detailed the check by domain alongside the assessment in the spreadsheet. overall, we found that nearly 40% of studies did not randomize or declare the randomization of the participants, nearly 40% lacked an explanation of missing data, and 50% had figure 6 z-curve analysis of the but-you-are-free effect (expectation maximization, em method) byaf (with em) z−scores d e n si ty 0 1 2 3 4 5 6 0 1 2 3 4 range: 0.01 to 5.18 74 tests, 15 significant observed discovery rate: 0.20 95% ci [0.12 ,0.32] expected discovery rate: 0.18 95% ci [0.05 ,0.46] expected replicability rate: 0.22 95% ci [0.14 ,0.51] a bias in the measurement of the outcome because we cannot trust the studies made with guéguen’s students (see brown, 2020). we found no bias due to deviation from the intended intervention because all interventions were straightforward, with the measurement of the direct behavior. finally, no study declared a preplanification (see figure 7). after conducting an overall risk of bias, we created a traffic-light plot visualizing the risks by study (see figure 8). the complete plot is in supplementary materials on osf. we found only seven studies on low-risk and decided to run another meta-analysis on these studies. our seven studies indicated no effect of the byaf technique, with a g = 0.11, 95% ci [-0.18; 0.40]. the heterogeneity was huge, with a i² = 95%. a forest plot of the effect can be found in figure 9. based on an exchange with the editor, we conducted a third metaanalysis, including all studies with an overall rate of “low risk” and “some concerns”. the result of the metaanalysis is g = 0.38, 95% ci [0.27; 0.49] and i² = 84%. discussion we conducted three meta-analyses for the byaf technique. we tested several moderators and found support for a contextual effect of the technique on compliance. including all studies, we found a direct medium effect of the byaf technique (g = 0.44) consistent across most of 12 figure 7 risk of bias in studies included in our meta-analysis overall risk of bias bias in selection of the reported result bias in measurement of the outcome bias due to missing outcome data bias due to deviations from intended interventions bias arising from the randomization process 0% 25% 50% 75% 100% low risk some concerns high risk no information figure 8 traffic-light plot of the ten firsts studies included in our meta-analysis our moderators. excluding high-risk studies, the effect found was weaker (g = 0.38) and nonexistent with only low risk of bias studies (g = 0.11, ci including the null). confirmatory moderators type of request we initially hypothesized that the efficacy of the byaf technique might be higher for prosocial requests than selfish requests, as carpenter (2013) first hypothesized. in his meta-analysis, he did not find evidence that prosocial requests were associated with a higher level of compliance. with the addition of new effect sizes based on new experiments, we also did not find a significant difference, but our difference is now in the other direction: our results indicate a non-significant higher effect size for selfish requests. this result, while surprising, might be confounded with other moderators. indeed, selfish requests are often made face-to-face and immediately, two conditions with high averaged effect sizes, while prosocial requests were often made indirectly and in delayed condition, two conditions with lower averaged effect sizes. in the prosocial condition, the effect size found (g = 0.36) was medium, indicating that this moderator does not play a fundamental role in the effectiveness of the byaf technique: it might work independently of this contextual effect. temporality our moderator analysis revealed that the effect of the byaf technique is stronger for immediate rather than for delayed requests, while being at the threshold for significance (α= .05, p = .05). this finding is in accordance with our hypothesis, and the findings of carpenter (2013). indeed, the effect found in the immediate condition was medium to strong (g = 0.47) and weak in the delayed condition (g = 0.25). this finding is not surprising, since we only added two effect sizes to the delayed condition regarding carpenter’s meta-analysis. most recent works in the byaf literature are made via the immediate condition. the effectiveness of the byaf technique is impacted by the temporality moderator: once the demand is delayed, we cannot be sure that the byaf technique can be effective, which shows the importance for the participant to be directly linked to the confederate. exploratory moderator subject and confederate gender we hypothesized that the byaf technique can make women comply more than men and that individuals would comply more with a woman confederate. we found no support for a gender effect of moderation. indeed, across the four conditions, the effect sizes remain constant (between g = 0.41 and g = 0.48). also, the result from the anova reveals no interaction effect: the gender of the individual does not interact with the gender of the confederate. culture we hypothesized that the byaf technique could be stronger in individualistic countries than in collectivistic countries. our results could possibly corroborate this hypothesis. however, the p-value is not significant, 13 figure 9 forest plot of “low risk” studies included in our meta-analysis re model −2 −0.5 1 2.5 4 effect of byaf technique on compliance silone et al. (2016) / 1 / 1 pascual et al. (2002) / 1 / 1 pascual (2002) / 11 / 1 gueguen et al. (2013) / 8 / 1 gueguen et al. (2002) / 1 / 1 dufourcq−brana (2007) / 1 / 1 carpenter & pascual (2016) / 1 / 1 155 400 220 4515 900 400 89 −0.32 [−0.66, 0.03] −0.22 [−0.42, −0.02] 0.51 [ 0.24, 0.78] 0.36 [ 0.30, 0.42] 0.63 [ 0.49, 0.78] −0.19 [−0.39, 0.00] −0.07 [−0.48, 0.34] 0.11 [−0.18, 0.41] author(s), year, and study # observed [95% ci]sample size and we only found 9 effect sizes for participants in collectivistic countries, which limits our possibility of explanation. nonetheless, we found that the byaf technique can be more effective in an individualistic setting. this might be in part due to the easiness for people in individualistic countries to be reactant to the asking, and more effectiveness of the byaf technique to lower the reactance in this situation. other mental processes might be active in individualistic countries to influence people not to comply, but they remain unknown. interactivity we hypothesized that face-to-face interaction would lead to more compliance with the byaf effect than other types of interaction. overall, we found a significant difference in this direction: the face-to-face interaction found the highest average effect size (g = 0.51). in more detail, we found that phoning could be a good way to exercise the byaf technique with a medium effect size (g = 0.43), but we only found one study with this type of interaction. e-mail can also be an effective way to use the byaf technique, but the effect size found was considerably lower (g = 0.19) and included the null. we call for further examination of this condition of interactivity, since the results are not clear. for the other types of interactivities (i.e., postal letter, internet) we found no effect of the byaf technique, but we are limited by the number of studies included, with only 2 effect sizes found for each condition. overall, we found a significant difference between the face-toface interaction and the others, but we cannot draw a definitive conclusion due to too few effect sizes in the other conditions. freedom evocation the goal of this moderator was to understand if the exact term “but you are free” was necessary for the effect to appear. we found that it was not the case, as the effects found were not different between the exact term and others. the combination of the other terms leads to a higher non-significant average effect size, signaling a possible more effective way or term to induce compliance than the standard term “but you are free”. but what were the other types of evocation used that give 14 the highest effect sizes? given the forest plot (see supplementary metarials), we see that at least three studies give a very high effect size. in the first (farley et al., 2019, study 2, g = 1.70) the confederate added the term “feel free to say no”. in the second (guéguen et al., 2013, study 11, g = 0.91), the confederate added the term “do as you wish” and in the third (pascual and guéguen, 2002, study 7, g = 0.75), the confederate added the term “you are not obliged”. in comparing the three terms, we do not find any patterns leading to a meaningful conclusion about how they lead to a stronger effect of the byaf technique. the only common point between the three studies is that they have very few participants (respectively 40, 86, and 19) leading to a probable overestimate of the effect size. before and after carpenter (2013) finally, we wanted to see if any differences were made between the studies before and after the carpenter (2013) analysis to see if they lead to a different effect size found. carpenter (carpenter, 2013) found an average effect size of r = .13. once our overall effect size (g = 0.45) was transformed in correlation, we found an r = 0.22 of the technique, two times higher than carpenter found. this result still holds for the analysis we made of the identical dataset used by carpenter. why do we have so much difference? we found several errors in the carpenter (carpenter, 2013) analysis. for example, carpenter used one experiment (dufourcq-brana, 2007) two times. also, carpenter made ambiguous and not reported decisions in his study. for example, for the experiments with two measures (e.g., guéguen et al., 2002; marchand et al., 2009; pascual, 2002), he decided to take one of them and did not make transparent the reason why. in our analysis, we decided to merge them except if one variable is not includable as reported in our preregistration. we found several errors in the original papers, some tests were not compatible with the reported number of participants (e.g., guéguen et al., 2013; pascual et al., 2002, study 10). all the discrepancies found are made open in the commentary columns in the dataset. for the publication bias section, carpenter only used the trim-and-fill technique, leading to no missing studies. we do not know how researchers used the algorithm (bilateral or left-centered, as recommended) and the trim-and-fill plot is not available. also, researchers did not report the heterogeneity (i² or tau²) found in the article, while giving the percentage of variability explained by sampling error. they still found that the byaf technique only accounts for 22% of the variation, a condition in which the trimand-fill tool alone might not be sensitive (carter et al., 2019). thus, we think that the use of this only publication bias estimator is not enough to assess the credibility of the effect size found. finally, with the use of the trimand-fill, carpenter (2013) found an overall corrected effect size of r = 0.04. we found that: 1) the averaged effect size found was much higher than the one reported by carpenter (2013) with the overall sample, 2) the averaged effect size did not differ from before and after the analysis made in 2013, 3) there was a lack of transparency of the choices made in 2013, leading to some errors and curious effect sizes taken into account and 4) no enough assessment of possible publication bias leading to think that the effect size found was more meaningful than it possibly is. implication with all studies included, we found a medium effect size, but only one meaningful moderator, as the byaf technique works better in the face-to-face condition than in others, with possible covariates. also, we have several publication bias estimators flagging possible problems in relation to the experiments on this technique. we did not find that temporality is important to the effectiveness of the byaf technique. more surprisingly, we did not find that subject and confederate genders were important. also, we did not find differences between a selfish and a prosocial request and found quite the contrary, as selfish requests were more prone to the byaf technique than prosocial. our results indicate that participants seem not to process the request more carefully for a selfish request than for a prosocial one. the interactivity moderator was significant, but with too few studies for most modalities, and the merging of them can mislead our results. finally, culture was barely significant, with far more participants from france than the other countries, and we cannot be sure that the effect is clearly related to culture and not to country and/or confederates in these countries. overall, we did not find any consistent evidence for possible moderators. we have few publication bias estimators that indicate a possibility of publication bias. we found a little asymmetry in our funnel plot, for significant and non-significant results, via different techniques. according to egger et al. (1997), four possibilities are to consider for this asymmetry: selection bias, poor methodological quality, true heterogeneity, and artefactual. for the selection bias, it might be possible to have location and language bias. for example, most of the original experiments on compliance were said to be made in the same street in the same city (vannes, france), and others in bordeaux (france). also, it is possible that the “but you are free” and “vousêtes-libre-de” do not have the same meaning, most importantly once translated into the language in collec15 tivistic countries. we found selective reporting in the carpenter (2013) meta-analysis (reported in a commentary in the dataset). by looking at the original articles, we found lacking and inconsistent data we had to ask the author (we reported their answer in osf). we also found several poor methodological qualities: a lot of the studies we found have very low power when compared to the effect size found in carpenter’s analysis. for example, with a correlation of r = 0.13 transformed to a d = 0.26, a power of 95%, equal number of participants in each group, α of 5%, and a one-tailed test (since we do not want the control group being more effective than the byaf), we would need 321 participants per group to have a chance to detect the effectiveness of the technique (see supplementary materials for more details). in the forest plot of articles published after the carpenter (2013) analysis (see supplementary materials), we find that only one article (i.e., grassini et al., 2012) has the necessary power to detect an effect. unfortunately, this experiment was made via e-mail and does not give us information about the standard face-to-face use of the technique, eliminating possible unknown covariates linked to the use of an online store. overall, given the smallest effect size of interest of r = .13, no studies conducted are enough powered to ensure that the effect of the byaf technique leads to compliance. for true heterogeneity, we see that the confidence interval is mostly high, due to too low sample size. limitations sample size and power in the first published paper on the byaf technique, researchers employed 20 participants per condition (guéguen and pascual, 2000). afterward, carpenter found a very low effect size for the byaf technique, which implies the need for a large sample (n = 321 per condition for 95% power, n = 240 per condition for 80% power and an alpha of 0.05, as shown in the implication section). in the last experiment on the subject (farley et al., 2019), researchers assigned 25 participants to the byaf group, and 20 to the control group. in between, we found no studies with enough participants in sample size to possibly detect an effect if the effect exists. low sample size is a major concern for the possibility to put into evidence the effectiveness of the byaf technique. to see the power of each study in the meta-analysis, we performed a power test (figure 10). we set the test with a r = 0.13 and alpha = 0.05. the redder the area, the less power, the greener, the more. we found 5 studies in the green area and only two in the yellow area. the average power is 9.70% and the replicability index 0% which means that we have less than 10% of chance to figure 10 power test of the articles published after carpenter’s (2013) analysis 0.0 0.1 0.2 0.3 0.4 0.5 100% 25.5% 10% 7.2% 6.2% 5.8% −1 0 1 effect s ta n d a rd e rr o r p o w e r 0.05 0.20 0.40 0.60 0.80 1.00 power α = 0.05, δ = 0.13 | medpower = 9.7%, d33% = 0.31, d66% = 0.49 | e = 12, o = 47, ptes < 0.001, r−index = 0% note.. we set alpha to 0.05 and an effect size to r = 0.13, the effect size found by carpenter (2013). the redder the area, the less power, the greener, the more. we found no studies in the green area and only one in the yellow area. the average power is 9.20% and the replicability index 0% which means that we have less than 10% of chance to reject h0 when there is a true effect, and no chance at all to replicate one study (see motyl et al., 2017 for r-index). reject h0 when there is a true effect, and no chance at all to replicate one study (see motyl et al., 2017 for rindex). also, the z-curve showed a very low discovery rate of 6%. guéguen’s work one main reason for conducting this meta-analysis was to see how reliable the effect of the but you are free technique was. one major limitation of the present meta-analysis is that nearly all the studies using this technique had guéguen’s authorship or were made by a ph.d. student or close collaborator of guéguen. we tried to make a meta-analysis without guéguen’s name and found a similar effect size of g = 0.48 [0.28; 0.68] with a total n = 1010 and n = 457 participants from pascual and collaborators (2021) study. most implausible odds ratio comes from guéguen’s study, as we found odds higher than 5 and some close to 10, with the huge exception of farley and collaborators (2019) 16 whose results were higher than or = 23, mostly because of a lack of power (the data were 19/20 compliers in the byaf condition, 9/20 in the control condition). we checked these studies using the rob2 tool and found that they were problematic for many reasons such as no randomization, most confederates are young students aware of the experimentation, no preregistration or curious way of selecting the participants. as brown (2020) shows, we cannot trust the young student’s confederates of guéguen, because some fabricated their data. finally, we conducted a “low risk” meta-analysis which showed no result of the effect. this result clearly questions the existence of the byaf effect. limitations in moderators we tried to test several moderators to reduce heterogeneity: when and how can we ensure that the byaf is effective? for most of them, the numbers of experiments were particularly low. also, when aggregating them, we did not find any differences between them, and even if we did, we could not draw a strong conclusion because these moderators are, for some, very different from each other. for the moderators with more than 10 effect sizes, we did not find any differences, and we cannot explain why the heterogeneity persists in the effects found. the only significant moderator we found was the one from our confirmatory hypothesis temporality, as we found that immediate requests are more effective than delayed ones. finally, we did not find any evidence that moderators can diminish the heterogeneity of the byaf technique, leading to the conclusion that: 1) we did not take into account the most important moderators, mostly because researchers failed to raise attention to them, 2) they are no important moderators in the byaf technique, which contradicts the moderators found for others techniques (see carpenter, 2013 for a review of some) or 3) the publication bias and/or the possibility of a truly random effect leads to an inflated effect size. culture while it might be less important than the issues raised above, we cannot be sure that our simple dichotomy in individualistic versus collectivistic countries is well appropriate for this technique. indeed, the byaf technique might rely on subtle or important differences between countries, as some studies on cross-cultural psychology pointed out. for example, boskł (2020) found that male complied far less to male in poland, but not in england, based on a sociocultural model. we do not know if the distinction we made was the best possible and we cannot compare countries because, france aside, all the others have experiments from only one study. approximation of effect while we made transparent how we coded our effect, they are not all closely similar. indeed, we have, for some studies, merged two conditions altogether to have a control group. in other studies, we took only one possible effect size, the one most closely related to the byaf effect. nonetheless, it might still be possible that our decisions lead to bias. this drawback may apply to almost every meta-analysis in empirical science, but we tried to improve transparency and complete reports to ensure having less bias possible. finally, the rob2 check was made only by the first author, and as transparent as it is, the coding is subjective and can lead to a selection of “low risk” studies different from another coder. direction for future research the byaf technique since the first meta-analysis, we did not find any analysis powered enough to detect an effect of the byaf technique. the most conservative meta-analysis we made, with only the “low risk” studies, questioned the existence of the effect. the main direction to take is to make a well-powered study with the main and original context of the appearance of the technique, in face-toface interaction with a request for a spare for the bus. this replication should be made by several confederates from the two genders, in many places across the world. also, this amount of work can be done with collaborative replication to see how the effect varies across different contexts and environments. at best, the study should be a pre-registered experiment or registered report based on the minimum effect size of interest of r = .13, with a power of 90% and alpha of 0.05, leading to a required sample size of 616 participants. the rob2 check helped us detail what could be needed for the quality of the study. first, one should carefully explain the randomization and selection of participants in the street. confederates should not be aware of the experimental conditions and should not be the one who select the participants. researchers also need to report the targeted subject who didn’t decide to reply at all. confederates should be of all ages, because only (very) young students were confederates in studies included in the present meta-analysis (in most studies, the mean age of confederates is close to 20 years old). 17 moderators once a well-designed, highly powered study is made, it would be possible to investigate some moderators. for example, one type of moderator might be highly relevant. the interactivity (face-to-face or in a more indirect setting), and the immediate or delayed moderators were significant, which means that the presence of the confederate might be a necessity for the byaf technique to work. one direction is to do studies in an internet setting, leading to a refining of the byaf technique to a nudge, an easy and cheap intervention in the choice architecture (thaler and sunstein, 2009), aiming to improve the acceptance to a request. the difference found between the internet and face-to-face setting could lead to a huge improvement to understand how the byaf technique works. also, the only study we have in an email setting shows an effect size in the range of effects found in nudge theory (dellavigna and linos, 2020). another direction to investigate is the respective impact of gender, age, and culture. we tried to investigate the impact of gender and found no effect, but we could not control for age, which can impact the relationship between gender of the confederate and participant, in the face-to-face setting. having the help of confederates from multiple age and gender can help understand the impact of these social cues on the helping of others to the request. also, once controlling for gender and sex, we can move on and enhance the theory by construing upon cultural variation, in different countries and cultures. finally, we did not find any differences between the exact term of evocation “but-you-are-free” and others, now we propose not to pursue in this direction, until a well-powered preregistered replication of the initial effect is made. author contact adrien fillon, era chair in science and innovation policy & studies (sinnopsis),university of cyprus, adrienfillon@hotmail.fr conflict of interest and funding the authors declared no potential conflicts of interest with respect to the authorship and/or publication of this article. one author, who provided data and verified the coding, was an important author in the literature. however, the main coder was independent of the literature. coding and verifications were made transparent in the excel file and the source for the code is provided in the excel file (column n). the authors received no financial support for the research and/or authorship of this article. author contributions adrien worked under the supervision of lionel and fabien at aix-marseille university for conducting the pre-registered meta-analysis. adrien wrote the pre-registration, with verification and registration by alexandre, lionel, and fabien. adrien and alexandre conducted the search of the literature, developed the coding scheme, and coded the articles. adrien provided the rmarkdown code and analyses. adrien summarized the methods and results and wrote the manuscript. adrien, alexandre, lionel, and fabien finalized the manuscript for submission. open science practices this article earned the preregistration+, open data and the open materials badge for preregistering the hypothesis and analysis before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references bartoš, f., & schimmack, u. (2021). zcurve: an implementation of z-curves. retrieved november 26, 2021, from https://cran.rproject.org/ package=zcurve boskł, p. (2020). investigating the sociocultural models with cultural experiments: a polish–english study on request → compliance in gender relations. asian journal of social psychology, 23(2). https://doi.org/10.1111/ajsp.12408 brehm, j. w. (1966). a theory of psychological reactance. academic press. brown, n. (2020). nick brown’s blog: the guéguen saga update, summer 2020 edition. retrieved november 25, 2021, from https://steamtraen. blogspot . com / 2020 / 06 / the gueguen saga update-summer-2020.html carpenter, c. j. (2013). a meta-analysis of the effectiveness of the “but you are free” compliancegaining technique. communication studies, 64(1), 6–17. https : / / doi . org / 10 . 1080 / 10510974.2012.727941 https://cran.r-project.org/package=zcurve https://cran.r-project.org/package=zcurve https://doi.org/10.1111/ajsp.12408 https://steamtraen.blogspot.com/2020/06/the-gueguen-saga-update-summer-2020.html https://steamtraen.blogspot.com/2020/06/the-gueguen-saga-update-summer-2020.html https://steamtraen.blogspot.com/2020/06/the-gueguen-saga-update-summer-2020.html https://doi.org/10.1080/10510974.2012.727941 https://doi.org/10.1080/10510974.2012.727941 18 carpenter, c. j., & boster, f. j. (2009). a meta-analysis of the effectiveness of the disrupt-then-reframe compliance gaining technique. communication reports, 22(2), 55–62. https : / / doi . org / 10 . 1080/08934210903092590 carter, e. c., schönbrodt, f. d., gervais, w. m., & hilgard, j. (2019). correcting for bias in psychology: a comparison of meta-analytic methods. advances in methods and practices in psychological science, 2(2), 115–144. https://doi.org/10. 1177/2515245919847196 channouf, a. (1990). antécédents et effets cognitifs et comportementaux des conduites : de l’internalité à la consistance (these de doctorat). université pierre mendès france (grenoble). retrieved november 25, 2021, from https://www.theses. fr/1990gre29032 delacre, m., lakens, d., ley, c., liu, l., & leys, c. (2021). why hedges’ g’s based on the nonpooled standard deviation should be reported with welch’s t-test. https://doi.org/10.31234/ osf.io/tu6mp dellavigna, s., & linos, e. (2020). rcts to scale: comprehensive evidence from two nudge units (working paper no. 27594). national bureau of economic research. https://doi.org/10.3386/ w27594 desrumaux, p. (1996). hypothese de dependance entre les processus d’explications causales internes et externes et l’engagement pro-attitudinal: une application au don du sang. revue internationale de psychologie sociale, 9, 77–94. dolinska, b., & dolinski, d. (2006). to command or to ask? gender and effectiveness of “tough” vs “soft” compliance-gaining strategies. social influence, 1(1), 48–57. https://doi.org/10.1080/ 15534510500314571 dufourcq-brana, m. (2007). l’influence d’une déclaration de liberté sur l’efficacité du pied-dans-laporte et de l’amorçage (these de doctorat). lorient. retrieved november 25, 2021, from http: //www.theses.fr/2007loris100 dufourcq-brana, m., pascual, a., & gueguen, n. (2006). abstract. revue internationale de psychologie sociale, 19(3), 173–187. retrieved november 25, 2021, from https : / / www. cairn . info / revue internationale-de-psychologie-sociale-2006-3page-173.htm duval, s., & tweedie, r. (2000). a nonparametric “trim and fill” method of accounting for publication bias in meta-analysis. journal of the american statistical association, 95(449), 89–98. https : //doi.org/10.1080/01621459.2000.10473905 egger, m., smith, g. d., schneider, m., & minder, c. (1997). bias in meta-analysis detected by a simple, graphical test. bmj, 315(7109), 629–634. https://doi.org/10.1136/bmj.315.7109.629 farley, s. d., kelly, j., singh, s., jr, c. t., & young, t. (2019). “free to say no”: evoking freedom increased compliance in two field experiments. the journal of social psychology, 159(4), 482– 489. https://doi.org/10.1080/00224545.2018. 1505707 fillon, a., kutscher, l., & feldman, g. (2021). impact of past behaviour normality: meta-analysis of exceptionality effect. cognition and emotion, 35(1), 129–149. https : / / doi . org / 10 . 1080 / 02699931.2020.1816910 fischbacher, u., & föllmi-heusi, f. (2013). lies in disguise—an experimental study on cheating. journal of the european economic association, 11(3), 525–547. https : / / doi . org / 10 . 1111 / jeea.12014 garmendia, c. a., nassar gorra, l., rodriguez, a. l., trepka, m. j., veledar, e., & madhivanan, p. (2019). evaluation of the inclusion of studies identified by the fda as having falsified data in the results of meta-analyses: the example of the apixaban trials. jama internal medicine, 179(4), 582–583. https : / / doi . org / 10 . 1001 / jamainternmed.2018.7661 gehanno, j.-f., rollin, l., & darmoni, s. (2013). is the coverage of google scholar enough to be used alone for systematic reviews. bmc medical informatics and decision making, 13(1), 7. https: //doi.org/10.1186/1472-6947-13-7 grassini, a., pascual, a., guéguen, n., & jacob, c. (2012). the effect of the ‘evoking freedom’ technique on sales in a computer-mediated field setting. the international review of retail, distribution and consumer research, 22(4), 435– 437. https://doi.org/10.1080/09593969.2012. 690780 grosch, k., & rau, h. a. (2016). gender differences in compliance: the role of social value orientation (working paper no. 88). globalfood discussion papers. retrieved november 25, 2021, from https://www.econstor.eu/handle/10419/ 146902 guéguen, n., jacob, c., pascual, a., & morineau, t. (2002). request solicitation and semantic evocation of freedom: an evaluation in a computer-mediated communication context. perceptual and motor skills, 95(1), 208–212. https://doi.org/10.2466/pms.2002.95.1.208 https://doi.org/10.1080/08934210903092590 https://doi.org/10.1080/08934210903092590 https://doi.org/10.1177/2515245919847196 https://doi.org/10.1177/2515245919847196 https://www.theses.fr/1990gre29032 https://www.theses.fr/1990gre29032 https://doi.org/10.31234/osf.io/tu6mp https://doi.org/10.31234/osf.io/tu6mp https://doi.org/10.3386/w27594 https://doi.org/10.3386/w27594 https://doi.org/10.1080/15534510500314571 https://doi.org/10.1080/15534510500314571 http://www.theses.fr/2007loris100 http://www.theses.fr/2007loris100 https://www.cairn.info/revue-internationale-de-psychologie-sociale-2006-3-page-173.htm https://www.cairn.info/revue-internationale-de-psychologie-sociale-2006-3-page-173.htm https://www.cairn.info/revue-internationale-de-psychologie-sociale-2006-3-page-173.htm https://doi.org/10.1080/01621459.2000.10473905 https://doi.org/10.1080/01621459.2000.10473905 https://doi.org/10.1136/bmj.315.7109.629 https://doi.org/10.1080/00224545.2018.1505707 https://doi.org/10.1080/00224545.2018.1505707 https://doi.org/10.1080/02699931.2020.1816910 https://doi.org/10.1080/02699931.2020.1816910 https://doi.org/10.1111/jeea.12014 https://doi.org/10.1111/jeea.12014 https://doi.org/10.1001/jamainternmed.2018.7661 https://doi.org/10.1001/jamainternmed.2018.7661 https://doi.org/10.1186/1472-6947-13-7 https://doi.org/10.1186/1472-6947-13-7 https://doi.org/10.1080/09593969.2012.690780 https://doi.org/10.1080/09593969.2012.690780 https://www.econstor.eu/handle/10419/146902 https://www.econstor.eu/handle/10419/146902 https://doi.org/10.2466/pms.2002.95.1.208 19 guéguen, n., joule, r.-v., halimi-falkowicz, s., pascual, a., fischer-lokou, j., & dufourcq-brana, m. (2013). i’m free but i’ll comply with your request: generalization and multidimensional effects of the “evoking freedom” technique. journal of applied social psychology, 43(1). https : //doi.org/10.1111/j.1559-1816.2012.00986.x guéguen, n., & pascual, a. (2000). evocation of freedom and compliance: the but you are free of. . . technique. current research in social psychology, 5, 264–270. hamamura, t., bettache, k., & xu, y. (2018). individualism and collectivism. in the sage handbook of personality and individual differences: volume ii: origins of personality and individual differences (pp. 365–382). sage publications ltd. https://doi.org/10.4135/9781526470317 hedges, l. v. (1992). meta-analysis. journal of educational statistics, 17(4), 279–296. https : / / doi . org/10.3102/10769986017004279 higgins, j., chandler, t., cumpston, m., li, t., page, m., & welch, v. ( (2022). cochrane handbook for systematic reviews of interventions version 6.3 (updated february 2022). cochrane. www. training.cochrane.org/handbook higgins, j., thomas, j., chandler, j., cumpston, m., li, t., page, m., & welch, v. (2022). cochrane handbook for systematic reviews of interventions version 6.3. cochrane. www.training.cochrane. org/handbook jonas, e., graupmann, v., kayser, d. n., zanna, m., traut-mattausch, e., & frey, d. (2009). culture, self, and the emergence of reactance: is there a “universal” freedom? journal of experimental social psychology, 45(5), 1068–1080. https : / / doi.org/10.1016/j.jesp.2009.06.005 kiesler, c. a. (1971). the psychology of commitment: experiments linking behavior to belief. academic press. kiesler, c. a., & sakumura, j. (1966). a test of a model for commitment. journal of personality and social psychology, 3(3), 349–353. https://doi.org/ 10.1037/h0022943 kim, h. s., & sherman, d. k. (2007). "express yourself": culture and the effect of self-expression on choice. journal of personality and social psychology, 92(1), 1–11. https://doi.org/10.1037/ 0022-3514.92.1.1 long, d. a., mueller, j. c., wyers, r., khong, v., & jones, b. (1996). effects of gender and dress on helping behavior. psychological reports, 78(3), 987–994. https : / / doi . org / 10 . 2466/pr0.1996.78.3.987 marchand, m., halimi-falkowicz, s., & joule, r. .-. (2009). comment aider les résidents d’une maison de retraite à librement décider de participer à une activité sociale ? toucher, « vous êtes libre de. . . » et pied-dans-la-porte. european review of applied psychology, 59(2), 153–161. https:// doi.org/10.1016/j.erap.2008.05.001 martín-martín, a., orduna-malea, e., thelwall, m., & delgado lópez-cózar, e. (2018). google scholar, web of science, and scopus: a systematic comparison of citations in 252 subject categories. journal of informetrics, 12(4), 1160– 1177. https : / / doi . org / 10 . 1016 / j . joi . 2018 . 09.002 mcguinness, l. a., & higgins, j. p. t. (2021). risk-ofbias visualization (robvis): an r package and shiny web app for visualizing risk-of-bias assessments. research synthesis methods, 12(1), 55–61. https://doi.org/10.1002/jrsm.1411 moher, d., liberati, a., tetzlaff, j., altman, d. g., & group, t. p. (2009). preferred reporting items for systematic reviews and meta-analyses: the prisma statement. plos medicine, 6(7), e1000097. https://doi.org/10.1371/journal. pmed.1000097 motyl, m., demos, a. p., carsel, t. s., hanson, b. e., melton, z. j., mueller, a. b., prims, j. p., sun, j., washburn, a. n., wong, k. m., yantis, c., & skitka, l. j. (2017). the state of social and personality science: rotten to the core, not so bad, getting better, or getting worse? journal of personality and social psychology, 113(1), 34–58. https://doi.org/10.1037/pspa0000084 nuijten, s. e. bibinitperiod m. b. (2018). statcheck: extract statistics from articles and recompute p values. retrieved november 25, 2021, from https: //cran.r-project.org/package=statcheck olkin, i., dahabreh, i. j., & trikalinos, t. a. (2012). gosh – a graphical display of study heterogeneity. research synthesis methods, 3(3), 214– 223. https://doi.org/10.1002/jrsm.1053 pascual, a. (2002). soumission sans pression et technique du "vous êtes libre de. . . " (these de doctorat). bordeaux 2. retrieved november 25, 2021, from http://www.theses.fr/2002bor21000 pascual, a., dufourcq-brana, m., & guéguen, n. (2002). l’induction d’un sentiment de liberté au service du pied-dans-la-porte : une évaluation dans le cadre de la communication par ordinateur. unpublished. pascual, a., & guéguen, n. (2002). la technique du "vous êtes libre de...": induction d’un sentihttps://doi.org/10.1111/j.1559-1816.2012.00986.x https://doi.org/10.1111/j.1559-1816.2012.00986.x https://doi.org/10.4135/9781526470317 https://doi.org/10.3102/10769986017004279 https://doi.org/10.3102/10769986017004279 www.training.cochrane.org/handbook www.training.cochrane.org/handbook www.training.cochrane.org/handbook www.training.cochrane.org/handbook https://doi.org/10.1016/j.jesp.2009.06.005 https://doi.org/10.1016/j.jesp.2009.06.005 https://doi.org/10.1037/h0022943 https://doi.org/10.1037/h0022943 https://doi.org/10.1037/0022-3514.92.1.1 https://doi.org/10.1037/0022-3514.92.1.1 https://doi.org/10.2466/pr0.1996.78.3.987 https://doi.org/10.2466/pr0.1996.78.3.987 https://doi.org/10.1016/j.erap.2008.05.001 https://doi.org/10.1016/j.erap.2008.05.001 https://doi.org/10.1016/j.joi.2018.09.002 https://doi.org/10.1016/j.joi.2018.09.002 https://doi.org/10.1002/jrsm.1411 https://doi.org/10.1371/journal.pmed.1000097 https://doi.org/10.1371/journal.pmed.1000097 https://doi.org/10.1037/pspa0000084 https://cran.r-project.org/package=statcheck https://cran.r-project.org/package=statcheck https://doi.org/10.1002/jrsm.1053 http://www.theses.fr/2002bor21000 20 ment de liberté et soumission à une requête ou le paradoxe d’une liberté manipulatrice. [the "you are free of..." technique: induction of a feeling of freedom and compliance in a request or the paradox of a manipulating freedom.] revue internationale de psychologie sociale, 15(1), 51–80. pascual, a., oteme, c., samson, l., wang, q., halimifalkowicz, s., souchet, l., girandola, f., guéguen, n., & joule, r.-v. (2012). crosscultural investigation of compliance without pressure: the “you are free to. . .” technique in france, ivory coast, romania, russia, and china. cross-cultural research, 46(4), 394–416. https : / / doi . org / 10 . 1177 / 1069397112450859 pratkanis, a. r. (ed.). (2007). the science of social influence: advances and future progress. psychology press. https : / / doi . org / 10 . 4324 / 9780203818565 sawilowsky, s. s. (2009). new effect size rules of thumb. journal of modern applied statistical methods, 8(2), 597–599. https : / / doi . org / 10 . 22237/jmasm/1257035100 schäfer, t., & schwarz, m. a. (2019). the meaningfulness of effect sizes in psychological research: differences between sub-disciplines and the impact of potential biases. frontiers in psychology, 10, 813. https://doi.org/10.3389/fpsyg. 2019.00813 thaler, r. h., & sunstein, c. r. (2009). nudge: improving decisions about health, wealth, and happiness. (updated édition). penguin books. triandis, h. c. (1989). the self and social behavior in differing cultural contexts. psychological review, 96(3), 506–520. https : / / doi . org / 10 . 1037 / 0033-295x.96.3.506 vaughn, a. j., firmin, m. w., & hwang, c.-e. (2009). efficacy of request presentation on compliance. social behavior and personality: an international journal, 37(4), 441–449. https://doi.org/10. 2224/sbp.2009.37.4.441 walters, w. h. (2007). google scholar coverage of a multidisciplinary field. information processing & management, 43(4), 1121–1132. https : / / doi . org/10.1016/j.ipm.2006.08.006 yeung, s. k., fillon, a., protzko, j., elsherif, m., moreau, d., & feldman, g. (2021). experimental metaanalysis registered report template. osf. https: //doi.org/10.17605/osf.io/ytgrp https://doi.org/10.1177/1069397112450859 https://doi.org/10.1177/1069397112450859 https://doi.org/10.4324/9780203818565 https://doi.org/10.4324/9780203818565 https://doi.org/10.22237/jmasm/1257035100 https://doi.org/10.22237/jmasm/1257035100 https://doi.org/10.3389/fpsyg.2019.00813 https://doi.org/10.3389/fpsyg.2019.00813 https://doi.org/10.1037/0033-295x.96.3.506 https://doi.org/10.1037/0033-295x.96.3.506 https://doi.org/10.2224/sbp.2009.37.4.441 https://doi.org/10.2224/sbp.2009.37.4.441 https://doi.org/10.1016/j.ipm.2006.08.006 https://doi.org/10.1016/j.ipm.2006.08.006 https://doi.org/10.17605/osf.io/ytgrp https://doi.org/10.17605/osf.io/ytgrp introduction original study and its follow-up moderators type of request temporality subject gender confederate gender culture interactivity type of freedom evocation before and after carpenter’s analysis summary hypotheses main hypotheses confirmatory hypotheses exploratory hypotheses method open-science, replicability, and our current study literature search coding included studies analysis moderator analyses results the but-you-are-free main effect study design and measures as moderators subject gender confederate gender culture interactivity freedom evocation carpenter’s analysis type of request temporality publication bias robustness z-curve analysis risk of bias 2 (rob2) discussion confirmatory moderators type of request temporality exploratory moderator subject and confederate gender culture interactivity freedom evocation before and after carpenter (2013) implication limitations sample size and power guéguen’s work limitations in moderators culture approximation of effect direction for future research the byaf technique moderators author contact conflict of interest and funding author contributions open science practices meta-psychology, 2020, vol 4, mp.2019.2117 https://doi.org/10.15626/mp.2019.2117 article type: original article published under the cc-by4.0 license open data: n/a open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: n/a edited by: daniël lakens reviewed by: w. qu, a. lohman, s.grønneberg analysis reproduced by: andré kalmendal all supplementary files can be accessed at osf: http://doi.org//10.17605/osf.io/b5cxq issues, problems and potential solutions when simulating continuous, non-normal data in the social sciences oscar l. olvera astivia university of washington abstract computer simulations have become one of the most prominent tools for methodologists in the social sciences to evaluate the properties of their statistical techniques and to offer best practice recommendations. amongst the many uses of computer simulations, evaluating the robustness of methods to their assumptions, particularly univariate or multivariate normality, is crucial to ensure the appropriateness of data analysis. in order to accomplish this, quantitative researchers need to be able to generate data where they have a degree of control over its nonnormal properties. even though great advances have been achieved in statistical theory and computational power, the task of simulating multivariate, non-normal data is not straightforward. there are inherent conceptual and mathematical complexities implied by the phrase “non-normality” which are not always reflected in the simulations studies conduced by social scientists. the present article attempts to offer a summary of some of the issues concerning the simulation of multivariate, non-normal data in the social sciences. an overview of common algorithms is presented as well as some of the characteristics and idiosyncrasies that implied in them which may exert undue influence in the results of simulation studies. a call is made to encourage the meta-scientific study of computer simulations in the social sciences in order to understand how simulation designs frame the teaching, usage and practice of statistical techniques within the social sciences. keywords: monte carlo simulation, non-normality, skewness, kurtosis, copula distribution introduction the method of monte carlo simulation has become the workhorse of the modern quantitative methodologist, enabling researchers to overcome a wide range of issues, from handling intractable estimation problems to helping guide and evaluate the development of new mathematical and statistical theory (beisbart & norton, 2012). from its inception in los alamos national laboratory, monte carlo simulations have provided insights to mathematicians, physicists, statisticians and almost any researcher who relies on quantitative analyses to further their field. i posit that computer simulations can address three broad classes of issues depending on the ultimate goal of the simulation itself: issues of estimation and mathematical tractability, issues of data modelling and issues of robustness evaluation. the first issue is perhaps best exemplified in the development of markov chain monte carlo (mcmc) techniques to estimate parameters for bayesian analysis or to approximate the solution of complex integrals. the second is more often seen within areas such as mathematical biology or financial mathematics, where the behaviour of chaotic systems can be approximated as if they were random procsses (e.g. hoover & hoover, 2015). in this case, 2 computer simulations are designed to answer “what if” type questions where slight alterations to the initial conditions of the system may yield widely divergent results. the final issue (and the one that will concern the rest of this article) is a hybrid of the previous two and particularly popular within psychology and the social sciences: the evaluation of robustness in statistical methods (carsey & harden, 2013). whether it is testing for violations of distributional assumptions, presence of outliers, model misspecifications or finite sample studies where the asymptotic properties of estimators are evaluated under realistic sample sizes, the vast majority of quantitative research published within the social sciences is either exclusively based on computer simulations or presents a new theoretical development which is also evaluated or justified through simulations. just by looking at the table of contents of three leading journals in quantitative psychology for the present year, multivariate behavioural research, the british journal of mathematical and statistical psychology and psychometrika, one can see that every article present makes use of computer simulations in one way or another.this type of simulation studies can be described in four general steps: (1) decide the models and conditions from which the data will be generated (i.e. what “holds” in the population). (2) generate the data. (3) estimate the quantities of interest for the models being studied in step (1). (4) save the parameter estimates, standard errors, goodness-of-fit indices, etc. for later analyses and go back to step (2). steps (2)-(4) would be considered a replication within the framework of a monte carlo simulation and repeating them a large number of times shows the patterns of behaviour of the statistical methods under investigation that will result in further recommendations for users of these methods. robustness simulation studies emphasize the decisions made in step (1) because the selection of statistical methods to test and data conditions will guide the recommendations that will subsequently inform data practice. for the case of non-normality, the level of skewness or kurtosis, presence/absence of outliers, etc. would be encoded here. most of the time, steps (2) through (4) are assumed to operate seamlessly either because the researcher has the sufficient technical expertise to program them in a computer or because it is just assumed that the subroutines and algorithms employed satisfy the requests of the researcher. a crucial aspect of the implementation of these algorithms and of the performance of the simulation in general is the ability of the researcher to ensure that the simulation design and the actual computer implementation of it are consistent with one another. if this consistency is not there then step (2) is brought into question and one, either as a producer or consumer of simulation research, needs to wonder whether or not the conclusions obtained from the monte carlo studies are reliable. this issue constitutes the central message of this article as it pertains to how one would simulate multivariate, nonnormal data, the types of approaches that exist to do this and what researchers should be on the lookout for.. non-normal data simulation in the social sciences investigating possible violations of distributional assumptions is one of the most prevalent types of robustness studies within the quantitative social sciences. monte carlo simulations have been used for such investigations on the general linear model (e.g., beasley & zumbo, 2003; finch, 2005), multilevel modelling (e.g., shieh, 2000), logistic regression (e.g., hess, olejnik, & huberty, 2001), structural equation modelling (e.g., curran, west & finch, 1996) and many more. when univariate properties are of interest (such as, for example, the impact that non-normality has on the t-test or anova) researchers have a plethora of distribution types to choose from. distributions such as the exponential, log-normal and uniform are usually employed to test for non-zero skewness or excess kurtosis (e.g., oshima & algina (1992); wiedermann & alexandrowicz (2007); zimmerman & zumbo (1990). however, when the violation of assumptions implies a multivariate, non-normal structure, the data-generating process becomes considerably more complex because, for the continuous case, many candidate densities can be called the “multivariate” generalization of a well-known univariate distribution. (kotz, balakrishnan & johnson, 2004). consider, for instance, the case of a closelyrelated multivariate distribution to the normal: the multivariate t distribution. kotz and nadarajah (2004) list fourteen different representations of distributions that could be considered as “multivariate t”, with the most popular representation being used primarily out of convenience, due to its connection with elliptical distributions. in general, there can be many mathematical objects which could be consider the multivariate generalization or “version” of well-known univariate probability distributions, and choosing among the potential candidates is not always a straightforward task. 3 from multivariate normal to multivariate nonnormal distributions: what works and what does not work the normal distribution possesses a property that eases the generalization process from univariate to multivariate spaces in a relatively straightforward fashion: it is closed under convolutions (i.e., closed under linear combinations) such that adding normal random variables and multiplying them times constants results in a random variable which is itself normallydistributed. let a1, a2, a3, . . . , an be independent, normally distributed random variables such that ai ∼ n(µi,σ2i ) for i = 1, 2, . . . , n. if k1, k2, k3, . . . , kn are realvalued constants then it follows that: n∑ i=1 ki ai ∼ n  n∑ i=1 kiµi, n∑ i=1 k2i σ 2 i  (1) consider z = (z1, z2, z3, . . . , zn)′ where each zi ∼ n(0, 1). for any real-valued matrix b of proper dimensions (besides the null matrix), define y = bz. then, by using the property presented in equation (1), the matrixvalued random variable y + µµµ follows a multivariate normal distribution with mean vector µµµ and covariance matrix σ = bb′, which is known as the cholesky decomposition or cholesky factorization. for this particular calculation, a matrix (in this case the covariance matrix σ) can be “decomposed” or “factorized” into a lower-triangular matrix (b in the example) and its transpose. it is important to point out that other matrixdecomposition approaches (such as principal component analysis or factor analysis) could serve a similar role. although it would be tempting to follow the same general approach to construct a multivariate, nonnormal distribution (i.e., select a covariance/correlation matrix, decompose it in its factors, bb′ and multiply them times the matrix with uncorrelated, non-normal distributions of choice) and it has been done in the past (see, for instance, hittner, may & silver, 2003; silver, hittner & may, 2004; wilcox & tian tian, 2008), it is of utmost importance to highlight that this procedure would only guarantee that the population correlation or covariance matrix is the one intended by the researcher. this property holds irrespective of whether the distributions to correlate are normal or non-normal. the univariate marginal distributions would lose their unique structures and, by the central limit theorem, would become more and more normally distributed the more one-dimensional marginals are added. figure 1 highlights this fact by reproducing the simulation conditions described in silver, hittner and may (2004) for the uniform case. consider four independent, identicallydistributed random variables (x1, x2, x3, x4) which follow a standard, uniform distribution, u(0, 1) and a population correlation matrix r4×4 with equal correlations of 0.5 in the off-diagonals. the process of multiplying the matrix with standard-uniform random variables times the cholesky-decomposed r4×4 to induce the correlational structure (matrix b in the paragraph above) ends up altering the univariate distributions such that they no longer follow the non-normal distributions intended by the researchers (vale & maurelli, 1983). the r code below exemplifies this process. in order to truly generate multivariate non-normal structures with a degree of control over the marginal distributions and the correlation structure simultaneously, more complex simulation techniques are needed. # block 1 set.seed(124) ## creates the correlation matrix and factors it r av > 0. equivalently, all the eigenvalues of said matrix should be greater than 0. all covariance (and correlation) matrices are defined to be positive-definite. if they are not, they are not a true correlation/covariance matrix. 5 population excess kurtosis of 15. the following r code would generate the non-normal distribution e with the user-specified non-normality: # block 2 set.seed(124) ## eqn4 (the fleishman system) to be solved fleishman cor(x) [,1] [,2] [1,] 1.0000 0.5002 [2,] 0.5002 1.0000 ##correlation matrix of gaussian copula > cor(y) y1 y2 y1 1.0000 0.4405 y2 0.4405 1.0000 correlation shrinkage although the gaussian copula induces a relationship between the gamma and uniform marginals, it is important to highlight that the original correlation of figure 2. four-dimensional distribution implied by the simulation design of silver, hittner & may (2004) with assumed u(0, 1) marginal distributions ρ = 0.5 has now changed. when calculating the pearson correlation of the original sample from the bivariate normal distribution against the one from the gaussian copula, the difference becomes apparent. there is a shrinkage of about 0.05-0.06 units in the correlation metric, with the shrinkage contingent on the the size of the initial correlation. figure 32 further clarifies this issue by relating the initial correlation of the bivariate normal distribution to the final correlation of the gaussian copula. in other words, there is a downward bias of approximately 0.15 units in the correlation metric for the theoretically maximum correlation of this copula. there are two important reasons for why this happens, even though it is not always acknowledged within the simulation literature in the social sciences. first, both the probability integral transform (u<-pnorm(z)) and the quantile function (or inverse cdf) needed to obtain the non-normal marginal distributions (qgamma and qunif) are not linear transformations. the pearson correlation is only invariant under linear transformations so it stands to reason that if nonlinear transformations are applied, there is no expectation that the correlation will remain the same. second, there exists a result from copula distribution theory that places further restrictions on the range of the pearson correlation referred to as the fréchet–hoeffding bounds. hoeffding (1940) showed that the covariance 2notice that figure 3 is not directly related to the code in block 2 8 figure 3. relationship between the original correlation for the bivariate normal distribution and the final correlation for the gaussian copula with g(1, 1) and u(0, 1) univariate marginals. the horizontal axis includes values for the correlation for the bivariate normal distribution and the vertical axis presents the transformed correlation after the copula is constructed.the identity function (straight line) is included as reference. between two random variables (s, t ) can be expressed as: cov(s, t ) = ∫ ∞ −∞ ∫ ∞ −∞ h(s, t) − f(s)g(t)d sdt (7) where h(s, t) is their joint cumulative distribution function and f(s), g(t) are the marginal distribution functions of the random variables s and t respectively. define h(s, t)min = max[f(s) + g(t) − 1, 0] and h(s, t)max = min[f(s), g(t)]. fréchet (1951) and hoeffding (1940) independently proved that: h(s, t)min ≤ h(s, t) ≤ h(s, t)max (8) and, by using these bounds in equation (7) above, it can be shown that ρmin ≤ ρ ≤ ρmax, where ρ is the linear correlation between the marginal distributions. the implication of this inequality is that, given the distributional shape defined by f(s) and g(t), the correlation coefficient may not fully span the [-1, 1] theoretical range. for instance, astivia and zumbo (2017) have shown that for the case of standard, lognormal variables the theoretical correlation range is restricted to [−1/e, 1] (approximately [−0.368, 1]), where e is the base of the natural logarithm. for the gaussian copula above, the theoretical upper bound is approximately 0.85 on the positive side of the correlation range. when the inverse cdfs are implemented to generate the nonnormal marginals, the fréchet–hoeffding bounds are induced, restricting the types of correlational structures that multivariate, non-normal distributions can generate when compared to multivariate normal ones. moreover, the fréchet–hoeffding bounds are the greatest lower and least upper bounds. that is, the fréchet– hoeffding bounds cannot be improved upon in general (joe, 2014; nelsen, 2010). although there is nothing that can expand the fréchet–hoeffding bounds to the full [-1, 1] range if the marginals are fixed, the intermediate correlation matrix approach described in section 2.2 can be used to find the proper value for the correlation coefficient needed to initialize the gaussian copula, if a specific population value is desired. as long as the value on this intermediate correlation is within the bounds specified by the non-normal marginals, the final correlation after the marginal transformation is completed will match the population parameter intended by the researcher. relationship of the norta method, the vale– maurelli algorithm and gaussian copulas. gaussian copulas have other important connections to the simulation work done within the social sciences, notably, the fact that the norta method can be parameterized as a gaussian copula (qing, 2017). by extension, the vale–maurelli algorithm has also been proved to be a special case of gaussian copulas so that the majority of simulation work conducted in the social sciences has really only considered the gaussian copula as its test case (foldnes & grønneberg, 2015; grønneberg & foldnes, 2019). the same issues and limitations presented in sections 2.2 and 2.3 are, in fact, exchangeable given that the data-generating methods considered in both cases share the same essential properties. distributions closed under linear transformation and their connection to simulating multivariate, non-normality as presented in section 2.1, one of the many attractive properties of the normal distribution is that the sum of independent, normal random variables is itself normally distributed. this property is known as being “closed under convolutions” (i.e., when one combines in a certain way or “convolves” random variables, the resulting random variable belongs to the same family as its original components. through the use of this property, one can define the multivariate normal distribution by finding linear combinations (i.e., convolutions) of the one-dimensional normal marginals that will result in xn×d ∼ n(µµµd×1,σσσd×d ). although not very common, this property is shared by some other probabil9 ity distributions, making it the preferred starting point to defining multivariate generalizations of them. continuous distributions such as the cauchy and gamma share this property and their multivariate extensions depend on it. the gamma distribution is a particularly relevant case given its connection to other well-known probability distributions such as the exponential and the chi-square. if x1 ∼ g(α1,β) and x2 ∼ g(α2,β) then x1 + x2 ∼ g(α1 + α2,β). notice how the property of closeness is only true the rate parameter β is the same. the sum of two generic gamma distributions is not necessarily gamma-distributed (see moschopoulos, 1985). for the interested reader, an introduction to the theory of gamma distributions can be found in chapter 15 of krishnamoorthy (2016). as a motivating example to showcase a multivariate distribution that is not gaussian, yet closed under convolutions, consider p and q to be independently distributed poisson random variables with parameters (λp,λq) respectively. if w = p + q then w ∼ poisson(λw = λp + λq). by using this property, one can generalize the univariate poisson distribution to multivariate spaces. consider p, q and v to be independent, poisson distributed random variables with respective parameters λp,λq and λv . define two new random variables p∗ and q∗ as follows: p∗ = p + v q∗ = q + v (9) because p∗ and q∗ share v in common, (p∗, q∗)′ exhibits poisson-distributed, univariate marginal distributions with a covariance equal to λv . notice that this construction only allows for the case where the covariance between p∗ and q∗ is positive because, by definition, the parameter λ of a poisson distribution must be positive. figure 4 shows the bivariate histogram of a simulated example with p ∼ poisson(1), q ∼ poisson(2) and v ∼ poisson(3). in r code: #block 4 set.seed(124) ## simulates independent poisson random variables p 12 39.4 1.8 38.6 1.7 marital status married 52.2 1.4 50.9 1.6 separated/divorced/widowed 11.4 0.7 4.8 0.5 never married 36.2 1.5 44.1 1.7 area of residence urban 49.6 5.0 50.4 4.9 rural 50.4 5.0 49.6 4.9 notes. underscored values are different from those reported in onyike et al.’s (2003) table 2. for “race/ethnicity”, the number of males is 3,849; see discussion in the main text. 6 table 3 reproduction of onyike et al.’s (2003) table 3. % with dis/dsm-iii depression relative body weight no. of participants all respondents females males normal weight (bmi 18.5–24.9) 4,154 2.79 3.82 1.67 underweight (bmi <18.5) 301 3.24 3.82 1.82 overweight (bmi 25.0–29.9) 2,297 2.42 4.01 1.37 obese (bmi ≥30) 1,658 5.13 6.74 2.85 obesity class 1 (bmi 30–34.9) 981 3.55 4.97 1.88 obesity class 2 (bmi 35–39.9) 410 4.80 6.79 0.83 obesity class 3 (bmi ≥40) 267 12.51 13.03 11.54 note. underscored values are different from those reported in onyike et al.’s (2003) table 3. numbers in parentheses represent the standard error of the corresponding percentage estimate. in most cases, the wider confidence intervals in our reproduction do not affect whether the odds ratios reported in table 4 are statistically significant at the .05 level, with two exceptions: • the odds ratio for the relationship between bmi (treated as a continuous variable) and past month depression in females is statistically significant in onyike et al.’s (2003) table 4, 95% ci [1.03, 1.06], but not in our reproduction, 95% ci [0.99, 1.04]. • the odds ratio for the comparison of the prevalence of past-month major depression between obesity class 3 and normal weight participants in the male subsample is statistically significant in onyike et al.’s table 4, 95% ci [1.03, 57.26], but not in our reproduction, 95% ci [0.12, 486.2]. as mentioned above, there were only three male participants in obesity class 3 for this comparison; it does not seem implausible that minor variations in calculation methods between statistical software packages could cause substantial differences in their outputs for such small subsamples. these two discrepancies nevertheless relate to relatively ancillary findings that were not emphasized in onyike et al.’s (2003) abstract or discussion. table 5 we were unable to reproduce onyike et al.’s (2003) table 5 because several of the covariates that these authors claimed to have included were either not available in the nhanes-iii data set that we downloaded, or were calculated in an unclear way. specifically: • we were unable to find any measure of the use of psychiatric medicine in the nhanes-iii data set or code books. • we have no way to determine the criteria used by onyike et al. to categorize participants’ alcohol use as none, moderate, and abuse, based on the six variables (mypf1, mypf2, mypf3s, mypf4, 7 t ab le 4 re pr od uc ti on o f o ny ik e et a l.’ s (2 00 3) t ab le 4 . po pu la tio n an d bm i c at eg or y n o. o f pa rt ic ip an ts pa st -m on th m aj or de pr es si on pa st -y ea r m aj or de pr es si on li fe tim e m aj or de pr es si on re cu rr en t m aj or de pr es si on o r 95 % c i o r 95 % c i o r 95 % c i o r 95 % c i a ll re sp on de nt s bm i ( co nt in uo us v ar ia bl e) 8, 41 0 1. 05 1. 01 , 1 .0 9 1. 03 0. 99 , 1 .0 6 1. 02 0. 99 , 1 .0 5 1. 01 0. 97 , 1 .0 5 n or m al w ei gh t ( bm i 1 8. 5– 24 .9 ) 4, 15 4 1. 00 § 1. 00 § 1. 00 § 1. 00 § u nd er w ei gh t ( bm i < 18 .5 ) 30 1 1. 17 0. 49 , 2 .7 7 1. 39 0. 66 , 2 .9 2 1. 35 0. 74 , 2 .4 5 1. 21 0. 54 , 2 .6 9 o ve rw ei gh t ( bm i 2 5. 0– 29 .9 ) 2, 29 7 0. 86 0. 53 , 1 .4 0 0. 84 0. 53 , 1 .3 2 0. 93 0. 65 , 1 .3 3 0. 84 0. 54 , 1 .2 8 o be se ( bm i ≥ 30 ) 1, 65 8 1. 88 1. 03 , 3 .4 3 1. 42 0. 86 , 2 .3 3 1. 22 0. 82 , 1 .8 1 1. 13 0. 73 , 1 .7 6 o be si ty c la ss 1 ( bm i 3 0– 34 .9 ) 98 1 1. 28 0. 65 , 2 .5 3 1. 01 0. 55 , 1 .8 4 0. 87 0. 55 , 1 .3 8 0. 78 0. 47 , 1 .2 9 o be si ty c la ss 2 ( bm i 3 5– 39 .9 ) 41 0 1. 76 0. 78 , 3 .9 7 1. 67 0. 92 , 3 .0 6 1. 39 0. 78 , 2 .4 6 1. 41 0. 72 , 2 .7 7 o be si ty c la ss 3 ( bm i ≥ 40 ) 26 7 4. 98 2. 07 , 1 1. 98 2. 92 1. 28 , 6 .6 3 2. 60 1. 39 , 4 .8 6 2. 28 0. 92 , 5 .6 7 fe m al es bm i ( co nt in uo us v ar ia bl e) 4, 56 1 1. 05 1. 01 , 1 .0 8 1. 02 0. 99 , 1 .0 5 1. 02 0. 99 , 1 .0 4 1. 00 0. 97 , 1 .0 3 n or m al w ei gh t ( bm i 1 8. 5– 24 .9 ) 2, 18 0 1. 00 § 1. 00 § 1. 00 § 1. 00 § u nd er w ei gh t ( bm i < 18 .5 ) 20 2 1. 00 0. 38 , 2 .6 2 1. 36 0. 61 , 3 .0 2 1. 20 0. 59 , 2 .4 3 1. 03 0. 40 , 2 .6 2 o ve rw ei gh t ( bm i 2 5. 0– 29 .9 ) 1, 09 5 1. 05 0. 65 , 1 .7 2 0. 81 0. 54 , 1 .2 1 0. 94 0. 66 , 1 .3 4 0. 71 0. 45 , 1 .1 2 o be se ( bm i ≥ 30 ) 1, 08 4 1. 82 1. 02 , 3 .2 5 1. 29 0. 81 , 2 .0 7 1. 12 0. 78 , 1 .6 1 0. 97 0. 64 , 1 .4 9 o be si ty c la ss 1 ( bm i 3 0– 34 .9 ) 59 7 1. 32 0. 61 , 2 .8 6 0. 90 0. 45 , 1 .8 0 0. 74 0. 43 , 1 .2 8 0. 68 0. 37 , 1 .2 5 o be si ty c la ss 2 ( bm i 3 5– 39 .9 ) 28 5 1. 84 0. 71 , 4 .7 5 1. 66 0. 79 , 3 .4 6 1. 41 0. 74 , 2 .7 0 1. 40 0. 67 , 2 .9 4 o be si ty c la ss 3 ( bm i ≥ 40 ) 20 2 3. 78 1. 67 , 8 .5 5 2. 19 0. 97 , 4 .8 7 2. 15 1. 19 , 3 .8 7 1. 36 0. 60 , 3 .1 3 m al es bm i ( co nt in uo us v ar ia bl e) 3, 84 9 1. 06 0. 97 , 1 .1 6 1. 04 0. 98 , 1 .1 0 1. 02 0. 97 , 1 .0 7 1. 03 0. 96 , 1 .1 0 n or m al w ei gh t ( bm i 1 8. 5– 24 .9 ) 1, 97 4 1. 00 § 1. 00 § 1. 00 § 1. 00 § u nd er w ei gh t ( bm i < 18 .5 ) 99 1. 09 0. 13 , 9 .2 8 0. 57 0. 07 , 4 .9 5 1. 06 0. 23 , 4 .9 0 1. 12 0. 04 , 2 9. 83 o ve rw ei gh t ( bm i 2 5. 0– 29 .9 ) 1, 20 2 0. 82 0. 35 , 1 .9 4 1. 08 0. 56 , 2 .0 7 1. 16 0. 65 , 2 .0 6 1. 25 0. 64 , 2 .4 5 o be se ( bm i ≥ 30 ) 57 4 1. 73 0. 52 , 5 .7 1 1. 54 0. 71 , 3 .3 6 1. 28 0. 67 , 2 .4 7 1. 40 0. 57 , 3 .4 5 o be si ty c la ss 1 ( bm i 3 0– 34 .9 ) 38 4 1. 13 0. 40 , 3 .1 7 1. 22 0. 59 , 2 .5 4 1. 14 0. 61 , 2 .1 3 1. 00 0. 40 , 2 .4 7 o be si ty c la ss 2 ( bm i 3 5– 39 .9 ) 12 5 0. 49 0. 00 , 3 .1 e8 0. 99 0. 16 , 5 .9 7 0. 66 0. 11 , 3 .9 4 0. 71 0. 00 , 1 .9 e9 o be si ty c la ss 3 ( bm i ≥ 40 ) 65 7. 68 0. 12 , 4 86 .2 4. 53 0. 41 , 5 0. 07 3. 26 0. 40 , 2 6. 54 5. 15 0. 53 , 5 0. 23 n ot e. u nd er sc or ed v al ue s ar e di ff er en t fr om t ho se r ep or te d in o ny ik e et a l.’ s (2 00 3) t ab le 4 . § : r ef er en ce c at eg or y. 8 mypf5s, and mypf6s) that correspond to participants’ responses to questions that were about their alcohol consumption in the nhanes-iii interview. • onyike et al. classified participants as (a) current smokers, (b) former smokers, or (c) those who had never smoked. the nhanes-iii survey and examination data sets contain a number of items related to the smoking of cigarettes, cigars, and pipes; it is not clear how these were combined to arrive at onyike et al.’s three-way classification. • we do not understand why the five categories (excellent, very good, good, fair, poor) for physician’s health rating from the nhanes-iii examination (variable pep13a) were collapsed into just three categories (excellent, good, fair/poor). • we do not understand why the four race/ethnic categories from table 2 were collapsed into three in table 5, with “hispanic/other” apparently being used as an omnibus category for anyone who was not classed as “white” or “african-american” (this last category apparently being a synonym for “black” from table 2). we could, of course, have reproduced the table with these covariates either omitted or guessed at, but a comparison of the results with the published table would probably not have been very meaningful. discussion our efforts to reproduce onyike et al.’s (2003) analyses were made easier by the fact that the underlying data set was openly accessible and extensively documented (the nhanes-iii documentation consists of many hundreds of pages for each data set). this is in contrast to the situation facing researchers who wish to reproduce articles for which the data are less thoroughly documented or simply not available for re-analysis at all. despite this, however, it was difficult for us to reproduce many of onyike et al.’s tables, because we did not know how all of the choices that these authors made in analyzing the data. the fact that our reanalysis was so challenging even in this seemingly favorable scenario speaks to the importance of sharing not only data and descriptions of analyses, but also the original code (typically in the form of scripts in the language of a statistical software package) that was used to process and analyze the data. it is only with access to this code that readers and reviewers can obtain full insight into how the data were actually analyzed. a number of positive changes in the process of analyzing scientific data and publishing the results of those analyses have taken place in the 18 years since onyike et al.’s (2003) article was published. first, the widespread dissemination and adoption of free software such as r (r core team, 2018) and its associated packages has made powerful computing tools and associated support resources available at essentially no cost to anyone with access to a quite modest desktop or laptop computer. second, organizations such as the open science foundation (https://osf.io/) now make it easy for authors to share their analysis code and (depending on licensing arrangements and confidentiality issues) data. third, helped by the improvements mentioned in the two previous points, the sharing of code and data so that other researchers may reproduce and possibly extend one’s results is rapidly becoming a standard part of publishing a scientific article (e.g., lindsay, 2017). all of those developments have played their part in our replication efforts and the writing of the current article. in our reproduction, we were able to reproduce most of the figures in onyike et al.’s (2003) tables 1 and 2, although the analyses necessary to reproduce table 2 are somewhat inconsistent with the written description in the article (cf. the issue of the “demographic characteristics of the respondents”), and include what appear to be at least two data processing errors. nevertheless, tables 1 and 2 represent primarily descriptive information rather than statistics bearing on onyike et al.’s research questions. for tables 3 and 4 (representing bivariate relationships between bmi and various operationalizations of depression), we were able to reproduce the reported statistics, albeit with some minor discrepancies. on the other hand, we were completely unable to reproduce table 5. this table represents arguably the most crucial statistical output of the study, in that it presents information about the relationship between bmi and depression while controlling for the variables that onyike et al. considered to be plausible confounds. our inability to reproduce the statistics in this table does not mean that onyike et al.’s results are invalid—indeed, they are entirely congruent with the findings of subsequent systematic reviews and meta-analyses, such as those by luppino et al. (2010) and pereira-miranda et al. (2017)—but it does suggest that they were presented without sufficient information to permit direct replication. despite the issues we have raised in the present article, we do not believe that onyike et al.’s (2003) article is severely flawed; certainly we do not think that it is atypical of the research that was being published at the time. nor do we think that an extensive corrigendum is required, although perhaps a brief note could be added to the published article to correct the most obvious errors that we have identified and add suffihttps://osf.io/ 9 cient information about the data preparation and analysis process to allow the reproduction of the reported results. our take-home message for researchers is, rather, a more general one: even with a carefully curated data set such as nhanes-iii, the process of data analysis requires precision and care, preferably with multiple sets of eyes and the sharing of code (and, where they are not already public, data) to allow for computational reproducibility (donoho, 2010) of their findings. we believe that the time needed for the reader of an article to reproduce the calculations in a published paper ought to be measurable in minutes, not months. author contact corresponding author is nicholas j. l. brown. author contact: nicholasjlbrown@gmail.com conflict of interest and funding the authors declare that no conflict of interest exists. no funding was involved in this research. author contributions all four authors analyzed the data independently. nicholas j. l. brown wrote the paper and the other authors provided critical revisions. all authors approved the final version of the manuscript. open science practices this article earned the open materials badge for making the materials available. this is a commentary that focused on reproducing the findings of a published article, and as such there are no (new) data. it was not pre-registered. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. 10 references brown, n. (2018, march 13). announcing a crowdsourced reanalysis project [weblog post]. retrieved august 28, 2021 from https://steamtraen.blogspot.com/2018/03/announcing-crowdsourced-reanalysis.html donoho, d. l. (2010). an invitation to reproducible computational research. biostatistics, 11(3), 385–388. https://doi.org/10.1093/biostatistics/kxq028 lakens, d. (2014, december 19). observed power, and what to do if your editor asks for post-hoc power analyses [weblog post]. retrieved august 28, 2021 from https://daniellakens.blogspot.com/2014/12/observed-power-and-what-to-do-if-your.html lindsay, d. s. (2017). sharing data and materials in psychological science. psychological science, 28(6), 699–702. https://doi.org/10.1177/0956797617704015 lumley, t. (2019). package ‘survey’, v. 3.35-1. https://cran.r-project.org/web/packages/survey/survey.pdf luppino, f. s., de wit, l. m., bouvy, p. f., stijnen, t., cuijpers, p., penninx, b. w. j. h., & zitman, f. g. (2010). overweight, obesity, and depression: a systematic review and meta-analysis of longitudinal studies. archives of general psychiatry, 67(3), 220–229. https://doi.org/10.1001/archgenpsychiatry.2010.2 nhanes-iii. (1996a). third national health and nutrition examination survey (nhanes iii), 1988–94: nhanes iii household adult data file documentation. http://www.nber.org/nhanes/ftp.cdc.gov/pub/health_statistics/nchs/nhanes/nhanes3/1a/adult-acc.pdf nhanes-iii. (1996b). third national health and nutrition examination survey (nhanes iii), 1988–94: nhanes iii examination data file documentation. http://www.nber.org/nhanes/ftp.cdc.gov/pub/health_statistics/nchs/nhanes/nhanes3/1a/exam-acc.pdf nhanes-iii. (1996c). third national health and nutrition examination survey (nhanes iii), 1988–94: analytic and reporting guidelines. https://wwwn.cdc.gov/nchs/data/nhanes/analyticguidelines/88-94-analytic-reporting-guidelines.pdf onyike, c. u., crum, r. m., lee, h. b., lyketsos, c. g., & eaton, w. w. (2003). is obesity associated with major depression? results from the third national health and nutrition examination survey. american journal of epidemiology, 158(12), 1139–1147. https://doi.org/10.1093/aje/kwg275 r core team. (2018). r: a language and environment for statistical computing. r foundation for statistical computing, vienna, austria. https://www.r-project.org/ pereira-miranda, e., costa, p. r. f., queiroz, v. a. o., pereira-santos, m., & santana, m. l. p. (2017). overweight and obesity associated with higher depression prevalence in adults: a systematic review and meta-analysis. journal of the american college of nutrition, 36(3), 223–233, https://doi.org/10.1080/07315724.2016.1261053 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632 https://steamtraen.blogspot.com/2018/03/announcing-crowdsourced-reanalysis.html https://doi.org/10.1093/biostatistics/kxq028 https://daniellakens.blogspot.com/2014/12/observed-power-and-what-to-do-if-your.html https://doi.org/10.1177/0956797617704015 https://cran.r-project.org/web/packages/survey/survey.pdf https://doi.org/10.1001/archgenpsychiatry.2010.2 http://www.nber.org/nhanes/ftp.cdc.gov/pub/health_statistics/nchs/nhanes/nhanes3/1a/adult-acc.pdf http://www.nber.org/nhanes/ftp.cdc.gov/pub/health_statistics/nchs/nhanes/nhanes3/1a/exam-acc.pdf https://wwwn.cdc.gov/nchs/data/nhanes/analyticguidelines/88-94-analytic-reporting-guidelines.pdf https://doi.org/10.1093/aje/kwg275 https://www.r-project.org/ https://doi.org/10.1080/07315724.2016.1261053 https://doi.org/10.1177/0956797611417632 background data processing table 1 table 2 table 3 table 4 table 5 discussion author contact conflict of interest and funding author contributions open science practices references mp.2018.898.haverkamp-beauducel_proofs_20190902 meta-psychology, 2019, vol 3, mp.2018.898, https://doi.org/10.15626/mp.2018.898 article type: original article published under the cc-by4.0 license open data: not relevant open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: oscar olvera, paul lodder analysis reproduced by: jack davis all supplementary files can be accessed at the osf project page: https://osf.io/em62j/ differences of type i error rates for anova and multilevellinear-models using sas and spss for repeated measures designs nicolas haverkamp university of bonn andré beauducel university of bonn to derive recommendations on how to analyze longitudinal data, we examined type i error rates of multilevel linear models (mlm) and repeated measures analysis of variance (ranova) using sas and spss. we performed a simulation with the following specifications: to explore the effects of high numbers of measurement occasions and small sample sizes on type i error, measurement occasions of m = 9 and 12 were investigated as well as sample sizes of n = 15, 20, 25 and 30. effects of non-sphericity in the population on type i error were also inspected: 5,000 random samples were drawn from two populations containing neither a within-subject nor a between-group effect. they were analyzed including the most common options to correct ranova and mlmresults: the huynh-feldt-correction for ranova (ranova-hf) and the kenward-rogercorrection for mlm (mlm-kr), which could help to correct progressive bias of mlm with an unstructured covariance matrix (mlm-un). moreover, uncorrected ranova and mlm assuming a compound symmetry covariance structure (mlm-cs) were also taken into account. the results showed a progressive bias for mlm-un for small samples which was stronger in spss than in sas. moreover, an appropriate bias correction for type i error via ranova-hf and an insufficient correction by mlm-un-kr for n < 30 were found. these findings suggest mlm-cs or ranova if sphericity holds and a correction of a violation via ranova-hf. if an analysis requires mlm, spss yields more accurate type i error rates for mlm-cs and sas yields more accurate type i error rates for mlmun. keywords: multilevel linear models, software differences, repeated measures anova, simulation study, kenward-roger correction, type i error rate in times of a replication crisis that is yet to overcome, we feel a need to improve methodological standards in order to regain credibility of scientific knowledge. it is therefore important to generate clearly formulated “best practice” recommendations when there are multiple competing methodological approaches for the same issue in question. progress in psychology means for researchers to understand and investigate their methodological tools in order to know about their strengths and weaknesses as well as the circumstances under which they should or should not be used. in this study, we will therefore focus on two popular methods to analyze dependent means as they occur, for example, in longitudinal data. it is important to examine whether a mean change over time is of statistical relevance or not. in recent longitudinal research, a trend to use multilevel linear correspondence address: dr. nicolas haverkamp, institute of psychology, university of bonn, kaiser-karl-ring 9, 53111 bonn, germany. nicolas.haverkamp@uni-bonn.de haverkamp & beauducel 2 models (mlm) instead of repeated measures analysis of variance (ranova) can be identified (arnau, balluerka, bono, & gorostiaga, 2010; arnau, bono, & vallejo, 2009; goedert, boston, & barrett, 2013; gueorguieva & krystal, 2004); even appeals to researchers in favor of mlm over ranova are made (boisgontier & cheval, 2016). despite the high popularity of mlm, the terminology is not all clear without ambiguity. we follow a definition of tabachnick and fidell (2013) in using the term “mlm” to denote models with the following characteristics: regression models basing upon at least two data levels, where the levels are typically specified by the measurement occasions interleaved with individuals, models containing covariance pattern models, and fixed as well as random effects. although tabachnick and fidell (2013, p. 788) indicate that “mlm” is used for “a highly complex set of techniques”, they mention the presence of at least two data levels first, giving the impression that this is the most important aspect of these techniques. as we noticed massive differences in type i error rates for different approaches before (haverkamp & beauducel, 2017), we will furthermore focus on the type i error corrections that are offered by the respective method. moreover, the large type i errors that we have noticed before could trigger publications of results that cannot be replicated or reproduced. this is why we consider that the focus on type i error rates is of special importance for the current debate on the reproducibility of results. if the features of mlm over ranova are compared, three main advantages of mlm become apparent: first, mlms permit to model data that are structured in at least one level. if there are reasons to suppose two or more nested data levels, mlm is applicable. in the case of one level of measurement occasions plus one level of individuals, ranova is also adequate. however, if the structure is any more complex, comprehending several levels, ranova will always be less appropriate than mlm (baayen, davidson, & bates, 2008). second, several randomly distributed missing values can emerge in repeated measures designs containing a large number of measurement occasions. even then, mlm is robust, because there is no requirement for complete data over occasions as individual parameters (e.g., slope parameters) are estimated. a third comparative advantage over ranova is the potential to draw comparisons between mlms with differing assumptions about the covariance structure inherent in the data (baek & ferron, 2013). for example, mlms with compound-symmetry (cs), with uncorrelated structure, or with auto-regressive covariance structure are feasible. if no particular preconceptions or assumptions on the covariances can be formulated a priori, mlm with an unstructured covariance matrix (un) can be defined as the most common choice for mlm (tabachnick & fidell, 2013). to the best of our knowledge, a comparison of all advantages and disadvantages of mlm and ranova is not available. however, the reader may find a discussion of several advantages of mlm over ranova in finch, bolin and kelley (2014). in longitudinal research, small sample sizes occur frequently (mcneish, 2016). it is therefore of special interest how the issues related to sample size problems (e.g. incorrect type i error rates) can adequately be addressed. in recent literature, mlm are recommended as more appropriate compared to ranova for small sample sizes if some precautions are taken: mcneish and stapleton (2016b), among others, report for restricted maximum likelihood (reml) estimation to improve small sample properties of mlm for sample sizes below 25 and even into the single digits. they give a clear recommendation against maximum likelihood (ml) if sample sizes are small because variance components are underestimated and type i error rates are inflated (mcneish, 2016, 2017). however, as reml is seen as not completely solving these issues, the kenward-roger correction (kenward & roger, 1997, 2009) is suggested as best practice to maintain nominal type i error rates (mcneish & stapleton, 2016a). this correction is not yet available in spss but was recently included in sas (mcneish, 2017). we therefore decided to follow these recommendations by using mlm with reml and considering the kenwardroger correction in our analyses of small sample properties for the different methods. another issue is the robustness of mlm and anova results across different statistical software packages. so far, this has not been systematically examined. for simple tests, like t-tests or simple anova models, no substantial differences between software packages are to be expected. however, for more complex statistical techniques like mlm different explicit or implicit default settings (e.g., number of iterations, correction methods) may occur. this may also be related to the different purposes differences of type i error rates for anova and multilevel-linear-models using sas and spss for repeated measures designs 3 and abilities of the different software packages (tabachnick & fidell, 2013). as the number of options can be large, differences between the algorithms may sometimes not be made entirely transparent in the software descriptions (see results section), we consider this a critical topic. however, for very simple repeated measures designs without any complex interaction or covariate, it should nevertheless be expected that different software packages provide the same results. however, to our knowledge, this has not been investigated until now so that we would like to shed some light in this topic by means of our study. to compare the results of different mlm designs in this study, it is necessary for the respective software package to allow certain specifications of the model(s). tabachnick and fidell (2013) provide an overview of the abilities for the most popular packages: spss, sas, hlm, mlwin (r) and systat. for this simulation study, a few features will be necessary: at first, it must be possible to specify the structure of the variance-covariance matrix as unstructured or with compound symmetry. second, probability values as well as degrees of freedom for effects have to be included in the output to allow for corrections if the sphericity assumption is violated. following the specifications of tabachnick and fidell (2013), we decided to compare sas and spss as only these two software packages provide all of the required features mentioned above. in accordance with this, a literature research shows sas and spss to be among the most popular software packages. table 1 shows the number of google scholar hits for a reference search for a slightly broader set of keywords (“spss”, “sas”, “stata”, “r project”, “r core team”, “multilevel linear model”, and “hierarchical linear model” as well as the relevant packages to perform mlm in r, see note of table 1 for more details). we acknowledge that the validity of reference-searches depends on the search terms and that some additional terms might also be considered relevant in the present context. for example, “mixed models” and “random-effects models”, and “nested data models” might also be interesting terms for a reference search. however, we did not use “mixed models” and “random-effects models” here because conventional repeated measures anova can also be described with these terms. moreover, we did not use “nested data models” as this term could be used for several different techniques like non-linear mixed models. thus, our keywords were chosen in order to enhance the probability that the search results are specific to non-anova methods but specific to multilevel/hierarchical linear models. keeping the limitations of this reference search in mind, the results nevertheless indicate that spss and sas are often used for mlm. even when the relative number of hits might be questioned, the absolute number of hits indicate that several hundred researchers used spss or sas for mlm so that our comparison might be of interest at least for these researchers. moreover, we performed a literature search for simulation studies on mlm software packages. the results of this literature research are shown in table 2, indicating for each mlm simulation study the smallest sample size included and the software package used to analyze the data. table 1. google scholar hits for mlm using spss, sas, stata or r software package google scholar hits spss 2070 sas 1790 stata 984 r 512 note. the search was performed on the 9th of september, 2018. the search strings were: ““spss” -“sas” “stata” -“r core team” -"r project" and “multilevel linear model” or “hierarchical linear model”” (for the spss search); “-“spss” “sas” -stata -“r core team” –“r project” and “multilevel linear model” or “hierarchical linear model”” (for the sas search); “-“spss” –“sas” stata “r core team” –“r project” and “multilevel linear model” or “hierarchical linear model”” (for the stata search); ““spss” –“sas” -stata “r core team” or “r project” or “nlme” or “lme4” or “lmertest” or “lme” or “pbkrtest” and “multilevel linear model” or “hierarchical linear model”” (for the r search). meta-psychology, 2019, vol 3, mp.2018.898, https://doi.org/10.15626/mp.2018.898 article type: original article published under the cc-by4.0 license open data: not relevant open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: oscar olvera, paul lodder analysis reproduced by: jack davis all supplementary files can be accessed at the osf project page: https://osf.io/em62j/ table 2. simulation studies on mlm with small sample sizes author(s) year smallest sample size software package(s) arnau et al. 2009 30 (5) sas ferron, bell, hess, rendina-gobioff, and hibbard 2009 4 sas ferron, farmer, and owens 2010 4 sas goedert et al. 2013 6 stata/ic gomez, schaalje, and fellingham 2005 3 sas gueorguieva and krystal 2004 50 sas haverkamp and beauducel 2017 20 spss keselman, algina, kowalchuk, and wolfinger 1999 30 (6) sas kowalchuk, keselman, algina, and wolfinger 2004 30 (6) sas maas and hox 2005 5 mlwin (r) usami 2014 10 r note. the number in brackets refers to the smallest group size in the simulation study. simulation efforts that focus on very particular models, options, and data yield fairly idiosyncratic results. they might, for sure, be of relevance for a specific research field if the mlm defined in the simulation study is consistent with the mlm that is usually implemented in this field of research. for example, the study of arnau et al. (2009) investigated different methods for repeated measures mlm in sas. they found the satterthwaite correction (satterthwaite, 1946) being too liberal in contrast to the kenward-roger correction (kenward & roger, 1997, 2009), which delivered more robust results, but their study concentrated on split-plot designs only. on the other hand, the studies of ferron and colleagues (2009; 2010) investigated type i error rates for mlm in sas as well, but were restricted to multiple-baseline data. paccagnella (2011), meanwhile, examined binary response 2-level model data in his study on sufficient sample sizes for accurate estimates and standard errors of estimates in mlm. nagelkerke, oberski, and vermunt (2016) delivered a detailed analysis on type i error and power but limited themselves to multilevel latent class analysis. however, we are convinced that these specific simulation studies should be rounded off by simulation studies focusing on rather simple, ‘basic’ models and differences of type i error rates for anova and multilevel-linear-models using sas and spss for repeated measures designs 5 data (berkhof & snijders, 2001), which are less contingent upon particular modeling options and data features. although the coverage of simulation approaches will naturally be restricted, using basic models and population data specifications can build a background for the investigation of more specific models. therefore, this simulation approach focusses solely on the effects of a violation of the sphericity assumption on mean type i error rates in ranova-models (without correction and with huynh-feldt-correction) and mlm (based on compound-symmetry as well as on an unstructured covariance matrix) for a within-subjects effect without any between-group effect. as ranova is not capable of the simultaneous analysis of more than one data level, there is no point in a comparison of ranova and mlm for data of such complex structure. this study is therefore limited to a subset of simulated repeated measures data that allows for an analysis with ranova as well as mlm. haverkamp and beauducel (2017) also used rather basic population models and data, but their study was limited to the spss package, so that they could not include the options provided by sas. the present study extends on the study provided by haverkamp and beauducel (2017) in that sas, the kenward-roger correction, smaller sample sizes and a larger number of measurement occasions were investigated. the kenward-roger correction (kenward & roger, 1997, 2009) that is available in sas but not in spss was considered here as it should result in a more appropriate type i error rate for mlm based on an unstructured covariance matrix (arnau et al., 2009; gomez et al., 2005; mcneish & stapleton, 2016a). as kenward and roger (1997, 2009) have shown that their correction works with sample sizes of about 12 cases, small sample sizes will also be investigated in the present simulation study. as violations of the sphericity condition or compound symmetry (cs) have been found to affect the type i error rates in ranova and mlm, this aspect was also investigated here. it should be noted that cs is not identical but similar to the sphericity assumption of ranova. as the cs assumption is more restrictive than the sphericity assumption (field, 1998), mlm with cs assumption will also satisfy the sphericity assumption. accordingly, uncorrected ranova and huynh-feldt-corrected (hf) ranova were compared in order to investigate effects of the violation of the sphericity condition. for mlm, models based on compound symmetry (cs) and unstructured covariance matrix (un) were checked. in consequence, there will be five versions of mlm in the study (mlm-un sas, mlm-un spss, mlm-cs sas, mlm-cs spss, mlm-kr sas) and the present simulation study will allow for a comparison of the type i error rate of mlm with kenwardroger-correction with other mlm based on sas and spss for models with and without cs. reml will be used as an estimation method for mlm because it is more suitable for small sample sizes than ml (mcneish & stapleton, 2016b) and because it has been proven to be most accurate for random effects models, i.e., for models that do not contain any fixed between group effects (west, welch, & galecki, 2007). to summarize, this simulation study has two major aims: first, the results of uncorrected ranova, ranova-hf, mlm-un and mlm-cs are compared for sas and spss, as they are available in both packages. if the results show substantial differences between the software packages, this will have immediate consequences for software applications, as the software with the more correct type i error rate should be preferred. second, sas offers the kenward-roger-correction, which was developed to correct mlm-un results for a progressive bias in type i error (kenward & roger, 1997, 2009), especially for small sample sizes. therefore, the samples were also analyzed under this condition (mlm-kr) to compare the results to those delivered by the other ranova and mlm specifications. our expectations are as follows: normally, one would expect that statistical methods have a type-i error at the level of the a priori significance level, when they are used appropriately. this implies that uncorrected anova (ranova) and mlm-cs should have 5% of type-i errors at an alpha-level of 5% when they are used in data without violation of the sphericity assumption. however, when these methods are used with data violating the sphericity assumption, the percentage of type-i errors should be larger than 5%. we also expect that ranova-hf and mlm-kr result in 5% of type-i errors, even in data violating the sphericity assumption, whereas mlmun results in a larger percentage of type-i errors in small samples with and without violation of the sphericity assumption (kenward & roger, 1997, 2009; haverkamp & beauducel, 2017). finally, if everything works fine, even in light of different default haverkamp & beauducel 6 settings, no substantial differences between spss and sas should occur for the simple repeated measures data structure that we will investigate, when identical methods (i.e., mlm-un, mlm-cs, ranova, and hf) are performed. material and methods we performed the analyses of the simulated data with sas version 9.4 (sas studio 3.7) and ibm spss statistics version 23.0.0.3. we manipulated the violation of the sphericity assumption, the sample size, and the number of measurement occasions. there was no between-subject effect and no within-subject effect in the population data. under the sphericity condition, the sphericity assumption holds in the population. there were t =1 to m; for m = 9 and m = 12 measurement occasions for each individual i. we used the spss mersenne twister random number generator for generation of a population of normally distributed, z-standardized, and uncorrelated variables zti (e[zti]=0; var[zti]=1). since dependent variables in a repeated measures design are typically correlated, we generated a correlation of .50 between the dependent variables according to the procedure described by knuth (1981). accordingly, the correlated dependent variables yti were generated by means of (1) where ci and zti are the scores of individual i on uncorrelated z-standardized, normal distributed random variables. in equation 1, the common random variable ci represents the part of the scores that is identical in all yti, whereas the random variables zti represent the specific scores that are different in each yti. the inter-correlation of the yti variables may be due to a constant variable across time or it may be due to other aspects inducing statistical dependency between the yti variables. this form of data generation for m = 9 can also be described in terms of the factor model with a pattern of population common factor loadings (2) and a pattern of unique factor loadings . as in snook and gorsuch (1989, p. 149-150), the population matrix of correlated random variables y can be written as (3) where vector c contains the common random variable and z is a matrix of m independent random variables (an example population file for the sphericity condition and m = 9 containing the resulting yt-variables, the common variable c, and the independent random variables zt has been uploaded in spss-format and in ascii-format; an spss-syntax example of data-generation can be found at https://osf.io/4g96f/). the condition with violation of the sphericity condition was based on a population of dependent variables with a population correlation of .50 for the even values of t and a population correlation of .80 for the odd values of t. the correlation of .80 was generated by introducing a second common random variable c2i that is aggregated only for the variables with odd values of t. for m = 12 this yields (4) from each population of generated variables 5,000 samples were drawn and submitted to repeated measures anova without correction based on sas (ranova sas) and spss (ranova spss), ranova with huynh-feldt-correction based on sas (ranova-hf sas) and spss (ranova-hf spss), mlm with compound-symmetry based on sas (mlm-cs sas) and spss (mlm-cs spss) and mlm with unstructured covariance matrix based on sas (mlm-un sas) and spss (mlm-un spss). moreover, the samples were submitted to sas based mlmun with kenward-roger correction (mlm-kr sas). note that the same sample data were used for the analyses with spss and sas. 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2.50 .50 .50 .50 .50 .50 .50 .50 .50é ùê úë û= 'p 1/2( )diag 'd= i pp ,'y=cp +zd 1 2 1 0.50 0.30 0.20 , 2 1, 0 5 . 0.50 0.50 , 2 , 1 6 tii i ti tii c c z if t k for k to y c z if t k for k to ì ïï í ï ïî + + = + = + = = = 1 , 0.50 0.50 ,iti ti or t to mc fy z == + differences of type i error rates for anova and multilevel-linear-models using sas and spss for repeated measures designs 7 as sample sizes were n = 15, 20, 25 and 30, the simulation study was based on 144 conditions (= sphericity [2] ´ analysis methods [9] ´ n [4] ´ m [2]) with 5,000 samples per condition. for all statistical analyses the respective type i error rate was calculated for the .05 alpha-level. to identify substantial bias in the results, we followed the criterion of bradley (1978) by which a test is robust if the empirical error rate is within the range 0.025–0.075 for α = .05. a test is considered to be liberal when the empirical type i error rate exceeds the upper limit. if the error rate is below the lower limit, the test is regarded as conservative. results the results for nine measurement occasions under the condition of sphericity showed a progressive bias for mlm-un and small sample sizes (fig. 1). type i error inflation was higher for mlm-un performed in spss compared to mlm-un in sas. multilevel linear models with compound symmetry demonstrated a slightly better performance for sas than for spss as the type i error rates of mlm-cs sas were closer to the 5 % level. the kenwardroger-correction for mlm-un sas reduced the type i error rate but did not fully solve the issues of small sample sizes, especially for n = 25 or below. the uncorrected ranova showed the expected type i error rates close to five per cent when the sphericity condition holds, regardless whether they were performed in spss or sas and with or without huynh-feldt-correction. the results for nine measurement occasions showed higher inflation in type i error rates for mlm-un when sphericity was violated (fig. 2). again, this progressive bias turned out to be stronger for mlm-un in spss than in sas. the kenward-roger-correction results did not differ much from the type i error rates of this method for nine measurement occasions under the sphericity figure 1. average type i error rates for 5,000 tests: nine measurement occasions, sphericity assumption holds. haverkamp & beauducel 8 condition (cf. fig. 1). the type i error rates for the uncorrected ranova in spss and sas as well as for mlm-cs in spss did not differ substantially and showed a moderately inflated type i error. the huynh-feldt correction provided satisfying results of type i error rates close to five per cent for both software packages, while mlm-cs shows a striking conservative bias when performed with sas and a large difference to results for the same method when performed in spss. figure 2. average type i error rates for 5,000 tests: nine measurement occasions, sphericity violation. for twelve measurement occasions and no sphericity violation, a large progressive bias for mlm-un and small sample sizes emerged (fig. 3; please note different scaling of the ordinate). again, this type i error inflation was higher for mlm-un performed in spss compared to mlm-un in sas. the kenward-roger-correction for mlm-un in sas does not solve the problem of type i error inflation for n = 30 or below. the mlm-cs and ranova results showed type i error rates close to five per cent, regardless whether they were performed in spss or sas or – in case of ranova – with or without huynh-feldt-correction. differences of type i error rates for anova and multilevel-linear-models using sas and spss for repeated measures designs 9 figure 3. average type i error rates for 5,000 tests: twelve measurement occasions, sphericity assumption holds. when sphericity was violated, the results for twelve measurement occasions showed a similar high inflation in type i error rates for mlm-un as without violation (fig. 4). as under the previous conditions, this progressive bias was stronger for mlmun in spss than in sas. the kenward-roger-correction results resemble the type i error rates of this method under the sphericity condition for 12 measurement occasions. the rates for the uncorrected ranova in spss and sas as well as for mlmcs in spss appear similar and show an expected moderately inflated type i error. again, the huynhfeldt correction delivers type i error rates close to five per cent for both software packages, while a conservative effect for mlm-cs was found for sas. haverkamp & beauducel 10 figure 4. average type i error rates for 5,000 tests: twelve measurement occasions, sphericity violation. concluding the results section, a few findings concerning mlm-un should be pointed out: the type i error inflation for the uncorrected mlm-un is remarkably high when sample sizes are small. this effect is so massive that it cannot be adequately corrected via the use of mlm-kr. on the other hand, the results show a trend where for both software packages the mlm-un method shows less type i error as the sample size increases. to investigate whether a large sample size would lead to an acceptable average type i error rate, we performed an additional simulation using the same data from our main study only for all facets of mlm-un (uncorrected in sas/spss, kenward-roger in sas) for a sample size of n = 100 with twelve measurement occasions under the condition of sphericity (see table 3). table 3 average type i error rates for different sample sizes and versions of mlm-un mlm-un version n = 15 n = 30 n = 100 mlm-un sas 56,33 19,56 7,76 mlm-sat sas 56,33 19,56 7,76 mlm-kr sas 36,87 9,86 5,66 mlm-un spss 69,42 24,66 8,80 note. rates are for 5,000 tests: twelve measurement occasions, sphericity assumption holds. differences of type i error rates for anova and multilevel-linear-models using sas and spss for repeated measures designs 11 table 3 shows the results of the additional simulation. concerning our expectations, two conclusions can be drawn: first, the trend of uncorrected mlm-un results to lower type i error rates as sample size increases can be confirmed. however, even for a large sample size of n = 100, the average type i error rates of mlm-un still failed to meet bradley’s liberal criterion (bradley, 1978) in sas and spss. only if mlm-un results were corrected by means of mlm-kr, they showed no liberal bias in type i error. as the differences between spss and sas for mlm-un are considerable, we tried to examine how these disparities can be explained. first, we inspected the underlying linear mixed model algorithms for spss (ibm corporation, 2013) and sas (sas institute inc., n. d.) and found no differences. second, we noticed indications for differences in the calculation of denominator degrees of freedom between the mixed procedures of spss and sas including the advice to employ the satterthwaite method to compute denominator degrees of freedom in sas if we “want to compare sas results for mixed linear models to those from spss” (ibm corporation, 2016) because it is reportedly used by default in spss. to explore whether the heterogeneity between the mixed type i error rates of mlm-un for sas and spss can be explained by this difference, we also included the satterthwaite method to correct mlm-un results (mlm-sat) in our additional simulation in sas as there is no option to alter the default method in spss (see table 3). we would expect similar average type i error rates between mlm-sat sas and mlm-un spss when the supposed differences in the calculation of denominator degrees are causal for the diverging simulation results in mlm-un. however, it turns out that it was not possible to reproduce the results of mlm-un spss by employing the satterthwaite method to compute denominator degrees of freedom in sas because the results of mlm-sat and mlm-un sas were nearly identical. it therefore remains an important question for future research to explain these disparities between sas and spss for supposedly identical methods in rather simple repeated measures data. discussion as expected, we found that uncorrected anova (ranova) and mlm-cs had 5% of type-i errors at an alpha-level of 5% when they were used in data without violation of the sphericity assumption. the expected increase of type-i error rates was also found for ranova and mlm-cs with data violating the sphericity assumption. although we found the expected type-i error rate of 5% for ranova-hf we found unexpected larger type-i error rates for mlm-kr in data violating the sphericity assumption. the larger type-i error rates for mlm-un in small samples with and without violation of the sphericity assumption were again confirmed (kenward & roger, 1997, 2009; haverkamp & beauducel, 2017). as kenward and roger (1997) noted, the reason for bias of mlm-un is probably that the precision is obtained from an estimate of the variancecovariance matrix of its asymptotic distribution. however, in small samples, asymptotic-based measures of precision can overestimate the true precision. the results of our study thus confirm that asymptotic-based measures of precision can lead to biased results of mlm. finally, unexpected differences between mlm-un spss and mlm-un sas as well as between mlm-cs spss and mlm-cs sas occurred for the simple repeated measures data structure investigated. the results of this simulation study bear some implications for the analysis of repeated measures designs in terms of best practice recommendations. note that these suggestions are based on very basic designs as the simulated data contained no withinsubject effect and neither a between-subjects nor a between-group effect. as pointed out before, we took these restrictions to examine type i error rates for within-subject models that are not distorted in any way by the influences of other effects or levels. the following implications for simple withinsubject repeated measures designs can be derived from this simulation study: 1. the use of mlm-un to analyze data with nine or more measurement occasions with samples comprising 30 cases or less is generally not recommended without a correction method. this bias is stronger when mlm-un is performed with spss. haverkamp & beauducel 12 when mlm-un has to be applied, it is best to use it with the kenward-roger correction (mlmkr). if an uncorrected mlm-un has to be the method of choice for some reason, estimation via sas would be more appropriate than estimation via spss but would still result in huge inflation of type i error if the sample size is small. although there was more convergence between mlm-un spss and mlm-un sas for a sample of about 100 participants, there was still a slightly smaller type i error for sas. moreover, a small post-hoc simulation revealed that the differences between mlm-un sas and mlm-un spss cannot be accounted for by the satterthwaite method for the correction of degrees of freedom, which is a non-default option in sas and which is always used in spss. 2. according to the criterion of bradley (1978), mlm-un without correction showed a liberal bias under every simulated condition regardless of the software package. for twelve measurement occasions, the kenward-roger correction in sas does not solve the problem of type i error inflation for n = 30 or smaller. for nine measurement occasions, kenward-roger only delivers a result without a liberal bias if the sample size is above n = 25. the kenward-roger correction does, however, correct for some of the large liberal bias of uncorrected mlmun. if mlm-un is required for the analysis of repeated measures data that involves a high number of measurement occasions as well as a small sample size that is about n = 25 or larger, it is recommended to use it with the kenward-roger correction. 3. for nine measurement occasions, a conservative bias according to bradley (1978) was found for mlm-cs if sphericity was violated. this effect was specific to the sas software package, as the mlmcs results for spss showed no conservative bias but type i error rates that were on the verge of the liberal criterion. these findings plead for the use of spss if mlm-cs has to be applied in spite of nonsphericity. 4. in accordance with previous research, the findings of this simulation study in general argue for the use of mlm-cs or ranova if the sphericity assumption holds as well as a correction of ranova results via the huynh-feldt correction if sphericity is violated. no major differences in the software packages occurred for the results of these methods. the encouraging results on ranova are in line with previous results on anova when the assumption of the normal distribution is violated (schmider, ziegler, danay, beyer, & bühner, 2010). there are, of course, several limitations to this study: • the population data contained no within-subject effect and neither a between-subjects nor a between-group effect and no interactions. accordingly, the model was restricted to a simple withinsubjects design. • not all methods, particularly corrections for mlm as kenward-roger, were available in both software packages. this is a limitation because we do not know how the kenward-roger correction would work in the context of the spss algorithm. • sas and spss do not provide a complete description of their algorithms and they do not provide the software scripts. therefore, the exact reasons for the differences could not be determined. of course, the software packages are protected by law because people who develop the software scripts need to be paid for their work. however, when considerable differences between software packages occur even for rather simple data, the law protection might constitute a limitation for the scientific value of the software. furthermore, this study yields some indications for future research: • the examination of type i error rates for the discussed methods should be expanded to more complex models including between-subject effects or between-within interaction effects. • the differences in the results of very basic methods in statistical software packages have to be further explored, especially concerning mlm-un and mlm-cs. • the reasons of the massive type i error inflation for mlm-un at lower sample sizes have to be analyzed in-depth. it may also be interesting to include r in further research in order to have at least one software where all the scripts are available. in the course of the ongoing debate about the lack of reproducibility of scientific studies, different recommendations have been developed: benjamin et al. (2017) proposed to set the statistical standards of evidence higher by shifting the threshold for defining statistical significance for new discoveries from p < 0.05 to p < 0.005. lakens et al. (2017), on the other hand, formulate a more general demand of justifications of all key choices in research design and statistical practice to increase transparency. differences of type i error rates for anova and multilevel-linear-models using sas and spss for repeated measures designs 13 we therefore see the results of this study as helpful for researchers’ methodological choices when analyzing repeated measures designs: only if the characteristics of different methods under specific conditions (e.g. their robustness against progressive bias when sample sizes are small or sphericity is violated) are known, researchers can choose their method on the basis of this knowledge. open science practices this article earned the open materials badge for making the materials available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references arnau, j., bono, r., & vallejo, g. (2009). analyzing small samples of repeated measures data with the mixed-model adjusted f test. communications in statistics simulation and computation, 38(5), 1083–1103. https://doi.org/10.1080/03610910902785746 arnau, j., balluerka, n., bono, r., & gorostiaga, a. (2010). general linear mixed model for analysing longitudinal data in developmental research. perceptual and motor skills, 110(2), 547– 566. https://doi.org/10.2466/pms.110.2.547566 baayen, r. h., davidson, d. j., & bates, d. m. (2008). mixed-effects modeling with crossed random effects for subjects and items. journal of memory and language, 59(4), 390–412. https://doi.org/10.1016/j.jml.2007.12.005 baek, e. k., & ferron, j. m. (2013). multilevel models for multiple-baseline data: modeling acrossparticipant variation in autocorrelation and residual variance. behavior research methods, 45(1), 65–74. https://doi.org/10.3758/s13428012-0231-z benjamin, d., berger, j., johannesson, m., nosek, b., wagenmakers, e.-j., berk, r., . . . johnson, v. (2017). redefine statistical significance. berkhof, j., & snijders, t. a. b. (2001). variance component testing in multilevel models. journal of educational and behavioral statistics, 26(2), 133–152. https://doi.org/10.3102/10769986026002133 boisgontier, m. p., & cheval, b. (2016). the anova to mixed model transition. neuroscience and biobehavioral reviews, 68, 1004–1005. https://doi.org/10.1016/j.neubiorev.2016.05.034 bradley, j. v. (1978). robustness? british journal of mathematical and statistical psychology, 31(2), 144–152. https://doi.org/10.1111/j.20448317.1978.tb00581.x ferron, j. m., bell, b. a., hess, m. r., rendina-gobioff, g., & hibbard, s. t. (2009). making treatment effect inferences from multiple-baseline data: the utility of multilevel modeling approaches. behavior research methods, 41(2), 372–384. https://doi.org/10.3758/brm.41.2.372 ferron, j. m., farmer, j. l., & owens, c. m. (2010). estimating individual treatment effects from multiple-baseline data: a monte carlo study of multilevel-modeling approaches. behavior research methods, 42(4), 930–943. https://doi.org/10.3758/brm.42.4.930 field, a. (1998). a bluffer's guide to … sphericity. the british psychological society: mathematical, statistical & computing section newsletter, 6(1), 13–22. finch, w. h., bolin, j. e., & kelley, k. (2014). multilevel modeling using r. new york: crc press. goedert, k. m., boston, r. c., & barrett, a. m. (2013). advancing the science of spatial neglect rehabilitation: an improved statistical approach with mixed linear modeling. frontiers in human neuroscience, 7, 211. https://doi.org/10.3389/fnhum.2013.00211 gomez, e. v., schaalje, g. b., & fellingham, g. w. (2005). performance of the kenward–roger method when the covariance structure is selected using aic and bic. communications in statistics simulation and computation, 34(2), 377–392. https://doi.org/10.1081/sac200055719 haverkamp & beauducel 14 gueorguieva, r., & krystal, j. h. (2004). move over anova: progress in analyzing repeatedmeasures data and its reflection in papers published in the archives of general psychiatry. archives of general psychiatry, 61(3), 310–317. https://doi.org/10.1001/archpsyc.61.3.310 haverkamp, n., & beauducel, a. (2017). violation of the sphericity assumption and its effect on type-i error rates in repeated measures anova and multi-level linear models (mlm). frontiers in psychology, 8, 1841. https://doi.org/10.3389/fpsyg.2017.01841 ibm corporation. (2013). ibm knowledge center. model (linear mixed models algorithms). retrieved september 7, 2018, from https://www.ibm.com/support/knowledgecenter/en/sslvmb_22.0.0/com.ibm.spss.statistics.algorithms/alg_mixed_model.htm ibm corporation. (2016, september 07). ibm support. denominator degrees of freedom for fixed effects in spss mixed. retrieved september 7, 2018, from http://www-01.ibm.com/support/docview.wss?uid=swg21477296 kenward, m. g., & roger, j. h. (1997). small sample inference for fixed effects from restricted maximum likelihood. biometrics, 53(3), 983. https://doi.org/10.2307/2533558 kenward, m. g., & roger, j. h. (2009). an improved approximation to the precision of fixed effects from restricted maximum likelihood. computational statistics & data analysis, 53(7), 2583– 2595. https://doi.org/10.1016/j.csda.2008.12.013 keselman, h. j., algina, j., kowalchuk, r. k., & wolfinger, r. d. (1999). a comparison of recent approaches to the analysis of repeated measurements. british journal of mathematical and statistical psychology, 52(1), 63–78. https://doi.org/10.1348/000711099158964 knuth, d. e. (1981). the art of computer programming: seminumerical algorithms (2. ed., 25. print). addison-wesley series in computer science and information processing: / donald e. knuth ; vol. 2. reading, mass.: addison-wesley. kowalchuk, r. k., keselman, h. j., algina, j., & wolfinger, r. d. (2004). the analysis of repeated measurements with mixed-model adjusted f tests. educational and psychological measurement, 64(2), 224–242. https://doi.org/10.1177/0013164403260196 lakens, d., adolfi, f., albers, c., anvari, f., apps, m., argamon, s., . . . zwaan, r. (2017). justify your alpha. maas, c. j. m., & hox, j. j. (2005). sufficient sample sizes for multilevel modeling. methodology, 1(3), 86–92. https://doi.org/10.1027/16142241.1.3.86 mcneish, d. (2016). on using bayesian methods to address small sample problems. structural equation modeling: a multidisciplinary journal, 23(5), 750–773. https://doi.org/10.1080/10705511.2016.118654 9 mcneish, d. (2017). small sample methods for multilevel modeling: a colloquial elucidation of reml and the kenward-roger correction. multivariate behavioral research, 52(5), 661– 670. https://doi.org/10.1080/00273171.2017.134453 8 mcneish, d. m., & stapleton, l. m. (2016a). the effect of small sample size on two-level model estimates: a review and illustration. educational psychology review, 28(2), 295–314. https://doi.org/10.1007/s10648-014-9287-x mcneish, d. m., & stapleton, l. m. (2016b). modeling clustered data with very few clusters. multivariate behavioral research, 51(4), 495– 518. https://doi.org/10.1080/00273171.2016.116700 8 nagelkerke, e., oberski, d. l., & vermunt, j. k. (2016). power and type i error of local fit statistics in multilevel latent class analysis. structural equation modeling: a multidisciplinary journal, 24(2), 216–229. https://doi.org/10.1080/10705511.2016.125063 9 paccagnella, o. (2011). sample size and accuracy of estimates in multilevel models. methodology, 7(3), 111–120. https://doi.org/10.1027/16142241/a000029 sas institute inc. (n. d.). sas/stat(r) 14.1 user's guide, second edition. retrieved september 7, 2018, from http://support.sas.com/documendifferences of type i error rates for anova and multilevel-linear-models using sas and spss for repeated measures designs 15 tation/cdl/en/statug/68162/html/default/viewer.htm#statug_mixed_overview02.htm satterthwaite, f. e. (1946). an approximate distribution of estimates of variance components. biometrics bulletin, 2(6), 110. https://doi.org/10.2307/3002019 schmider, e., ziegler, m., danay, e., beyer, l., & bühner, m. (2010). is it really robust? methodology, 6(4), 147–151. https://doi.org/10.1027/1614-2241/a000016 snook, s. c., & gorsuch, r. l. (1989). component analysis versus common factor analysis: a monte carlo study. psychological bulletin, 106(1), 148–154. https://doi.org/10.1037/00332909.106.1.148 tabachnick, b. g., & fidell, l. s. (2013). using multivariate statistics (6. ed.). boston: pearson education. usami, s. (2014). a convenient method and numerical tables for sample size determination in longitudinal-experimental research using multilevel models. behavior research methods, 46(4), 1207–1219. https://doi.org/10.3758/s13428013-0432-0 west, b. t., welch, k. b., & galecki, a. t. (2007). linear mixed models: a practical guide using statistical software. boca raton fla. u.a.: chapman & hall/crc. meta-psychology, 2021, vol 5, mp.2020.2548 https://doi.org/10.15626/mp.2021.2548 article type: commentary published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: not applicable edited by: rickard carlsson reviewed by: n. brown & j. ferreira analysis reproduced by: andré kalmendal all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/72gac collinearity isn’t a disease that needs curing jan vanhove university of fribourg abstract once they have learnt about the effects of collinearity on the output of multiple regression models, researchers may unduly worry about these and resort to (sometimes dubious) modelling techniques to mitigate them. i argue that, to the extent that problems occur in the presence of collinearity, they are not caused by it but rather by common mental shortcuts that researchers take when interpreting statistical models and that can also lead them astray in the absence of collinearity. moreover, i illustrate that common strategies for dealing with collinearity only sidestep the perceived problem by biasing parameter estimates, reformulating the model in such a way that it maps onto different research questions, or both. i conclude that collinearity in itself is not a problem and that researchers should be aware of what their approaches for addressing it actually achieve. keywords: regression assumptions, multiple regression, interpreting regression models as researchers and students learn more about statistical models, they sooner or later stumble across the term (multi)collinearity. collinearity, which roughly means that the predictors in a statistical model are correlated with each other, is often cast as a problem for statistical analysis. this suggests that the conscientious analyst has to solve it. i will argue that, to the extent that problems occur in the presence of collinearity, these are not caused by the collinearity itself but rather by a faulty way of thinking about statistical models that can lead analysts astray even in the absence of collinearity. common strategies for dealing with collinear predictors do not solve these perceived problems but instead sidestep them, often by fitting a model that, perhaps unbeknownst to the analyst, answers a different set of questions from the original one. this article does not present any novel insights, but i hope that it will nonetheless be educational to readers who sometimes find the output of regression models befuddling. i will focus on collinearity between two continuous predictors in (ordinary least squares) multiple regression models. in this case the strength of the collinearity can be gauged from the correlation between the predictors. however, all of my points apply to models with categorical predictors or a mix of categorical and continuous predictors as well. i will not discuss methods for assessing the degree of collinearity between three or more predictors for the simple reason that i find them a distraction: in what follows, i will argue that collinearity is not a statistical problem and should not be checked for (also see o’brien, 2007). collinearity and its consequences collinearity means that a substantial amount of information contained in some of the predictors included in a statistical model can be pieced together as a linear combination of some of the other predictors in the model. the easiest case is when you have a multiple linear regression model with two correlated predictors, as in the examples to follow. these predictors can be continuous or categorical, but i will stick to continuous predictors for ease of exposition. i created four datasets with two continuous predictors to illustrate collinearity and its consequences. you can find the r code to reproduce all analyses at https: //osf.io/jupd8/. the outcome in each dataset was created using the following equation; the parameter values https://osf.io/jupd8/ https://osf.io/jupd8/ 2 were chosen arbitrarily: outcomei = 0.4 ×predictor1i + 1.9 ×predictor2i +εi, (1) where the residuals (εi) were drawn from a normal distribution with a standard deviation of 3.5.1 the four datasets are presented in figures 1 through 4. in figure 1, a linear function of predictor1 captures most of the information contained in predictor2, so the two predictors are strongly collinear. in figure 3, by contrast, both predictors are completely unrelated, and a linear function of one predictor cannot capture any information in the other. hence, the two predictors are not collinear at all. some readers may be surprised to see that i consider a situation where two predictors are correlated at r = 0.50 (figure 2) to be a case of weak rather than moderate or strong collinearity. but in fact, the consequences of having two predictors that are correlated at r = 0.50 (rather than at r = 0.00) are negligible. finally, figure 4 highlights the linear part in collinearity: while the two predictors in this figure are related in that predictor2 perfectly determines predictor1, there is no linear relationship between them whatsoever. (you cannot uniquely determine the value for predictor2 when you know the value for predictor1, though.) the dataset in figure 4 is not affected by any of the statistical consequences of collinearity, but it will be useful to illustrate a point i want to make below. to illustrate the statistical consequences of collinearity, i simulated 10,000 samples of 50 observations in which the two predictors were highly correlated (sample correlation of r = 0.98, yielding datasets similar to the one in figure 1) and 10,000 samples of 50 observations in which they were completely orthogonal (sample r = 0.00, yielding datasets similar to the one in figure 3). in all cases, both predictors were independently related to the outcome according to equation 1. on each simulated sample, i ran a multiple regression model from which i extracted the estimated model coefficients. figure 5 shows the estimated coefficients for the first predictor, whose true parameter value is 0.4. clearly, the estimates vary more when the predictors are strongly correlated than when they are not, such that individual estimates can lie farther from the true parameter value and often have the opposite sign from this true parameter value. however, on average, the estimates equal the true parameter value. in statistics parlance, they are “unbiased.” crucially, and happily, this greater variability is reflected in the standard errors and confidence intervals around these estimates: the standard errors and confidence intervals are automatically wider when the estimated coefficients are affected by collinearity. this is outcome −2 −1 0 1 2 0.50 n = 50 missing: 0 −10 0 10 0.50 n = 50 missing: 0 −2 0 1 2 predictor1 0.98 n = 50 missing: 0 −10 −5 0 5 10 −2 0 1 2 −2 −1 0 1 2predictor2 figure 1. a dataset with strongly collinear predictors (r = 0.98). illustrated in figure 6: if you fit multiple regressions on the datasets plotted in figures 1 to 4, the confidence intervals are considerably wider if the predictors are strongly collinear than when they are not. moreover, the confidence intervals retain their nominal coverage rates (i.e., x% of the x% confidence intervals contain the true parameter value). so the statistical consequence of collinearity is automatically taken care of in the model’s output and requires no additional computations on the part of the analyst. the greater variability in the estimates, and the appropriately larger standard errors and wider confidence intervals all reflect a relative lack of information in the sample (also see morrissey and ruxton, 2018). it is difficult to improve on york’s (2012) explanation of the problem and possible solutions: “collinearity is at base a problem about information. if two factors are highly correlated, researchers do not have ready access 1some researchers examining the consequences of collinearity generate both the predictors and the outcome directly from multivariate normal distributions in which the correlation between the predictors varies but the correlations between the outcome and the individual predictors do not (e.g., wurm and fisicaro, 2014) rather than generating the outcome as a function of the predictors as i did. in doing so, they implicitly allow the true regression equation to vary from simulation to simulation: if you fix the correlations between the predictors and the outcome, but you want to vary the intercorrelation between the predictors, you have to vary the β parameters (fixed at 0.4 and 1.9 in my example) and the σ parameter (fixed at 3.5 in my example). the result of this is that such simulations may paradoxically show that you obtain more significant estimates when the predictors are strongly collinear (see wurm and fisicaro, 2014, table 6), but they actually compare different data generating processes. 3 outcome −2 0 2 0.31 n = 50 missing: 0 −10 0 5 0.46 n = 50 missing: 0 −2 0 2 predictor1 0.50 n = 50 missing: 0 −10 −5 0 5 −2 0 2 −2 0 2predictor2 figure 2. a dataset with weakly collinear predictors (r = 0.50). outcome −2 0 2 0.13 n = 50 missing: 0 −10 0 5 0.43 n = 50 missing: 0 −2 0 2 predictor1 0.00 n = 50 missing: 0 −10 −5 0 5 −2 0 1 2 −2 −1 0 1 2predictor2 figure 3. a dataset with completely unrelated predictors (r = 0.00). outcome −1.5 −0.5 0.5 1.5 0.15 n = 50 missing: 0 −5 0 5 0.52 n = 50 missing: 0 −1.5 0.0 1.5 predictor1 0.00 n = 50 missing: 0 −5 0 5 −1.5 0.0 1.5 −1.5 −0.5 0.5 1.5predictor2 figure 4. a dataset with orthogonal (r = 0.00) but perfectly related predictors: once you know the value of predictor2, you know the value of predictor1. to much information about conditions of the dependent variables when only one of the factors actually varies and the other does not. if we are faced with this problem, there are really only three fundamental solutions: (1) find or create (e.g. via an experimental design) circumstances where there is reduced collinearity; (2) get more data (i.e. increase the n size), so that there is a greater quantity of information about rare instances where there is some divergence between the collinear variables; or (3) add a variable or variables to the model, with some degree of independence from the other independent variables, that explain(s) more of the variance of y, so that there is more information about that which is being modeled.” (p. 1384) is collinearity a problem? for the most part, i think that collinearity is a problem for statistical analyses in the same way that belgium’s lack of mountains is detrimental to the country’s chances of hosting the winter olympics: it is an unfortunate fact of life, but not something that has to be solved. the three solutions that york (2012) mentions, i.e., running another study, obtaining more data or reducing the error variance using covariates, are all sensible, but if you have to work with the data that you have, the model output will be unbiased and will appropriately reflect the degree of uncertainty in the estimates. so i do not consider collinearity a problem. what is the case, however, is that collinearity highlights problems with the way many people think about statistical models and inferential statistics. let’s look at a couple of these. “collinearity decreases statistical power.” you may have heard that collinearity decreases statistical power, i.e., the chances of obtaining a statistically significant coefficient estimate if the true parameter value is different from zero. this is true, but the lower statistical power is a direct result of the larger standard errors, which appropriately reflect the greater sampling variability of the estimates. this is only a problem if you interpret “lack of statistical significance” as “zero effect.” but then the problem does not lie with collinearity but with the belief that non-significant estimates indicate zero effects. (schmidt (1996) calls this false belief “the most devastating of all to the research enterprise” (p. 126).) it is just that this false belief is even more likely than usual to lead you astray 4 no collinearity (r = 0.00) strong collinearity (r = 0.98) −10 −5 0 5 10−10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 estimated regression coefficient for first predictor in multiple regression models d e n si ty figure 5. the parameters for the first predictor in 10,000 samples as estimated by multiple linear regression models. when the predictors are strongly collinear, the estimates vary more from sample to sample, but the estimates are unbiased in either case. the dashed vertical lines show the true parameter value (0.4). (intercept) predictor1 predictor2 −1 0 1 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 no collinearity (related predictors) no collinearity (unrelated predictors) weak collinearity strong collinearity estimated coefficient with 95% confidence interval figure 6. estimated coefficients and their 95% confidence intervals for the models fitted to the four datasets. the dashed vertical lines show the true parameter values. when your predictors are collinear. if instead of focusing solely on the p-value, you take into account both the estimate and its uncertainty interval, then there is no problem. incidentally, i think that some people may be misled when they hear that collinearity “decreases” statistical power or “increases” standard errors as this wording may be taken to suggest that collinearity is a process that can be halted or reversed. it is true that compared to situations in which there is less or no collinearity and all other things are equal, the standard errors are larger and statistical power is lower when there is stronger collinearity. but outside of computer simulations, you cannot reduce collinearity while keeping all other things equal. in the real world, collinearity is not an unfolding process that can be nipped in the bud without bringing about other changes in the research design, the sampling procedure, or the statistical model and its interpretation. similarly, you may have heard that collinearity “inflates” standard errors or p-values. this wording, too, is misleading as it suggests that, in the presence of collinearity, standard errors and p-values are larger than they should have been. they are not, as per the discussion in the previous section (see morrissey and ruxton, 2018). “none of the predictors is significant but the overall model fit is.” with collinear predictors, you may end up with a statistical model for which the f-test of the overall model fit is highly significant but that does not contain a single significant predictor. this is illustrated in table 1. the overall model fit for the dataset with strong collinearity (see figure 1) is highly significant, but as shown in figure 6, neither predictor has an estimated coefficient that is significantly different from zero: both 95% confidence intervals contain zero. if this seems strange, you need to keep in mind that the tests for the individual coefficient estimates and the test for the overall model fit seek to answer different 5 table 1 f-tests and p-values for the overall model fit for the multiple regression models on the four datasets. even though neither predictor has a significant estimated coefficient in the ‘strong collinearity’ dataset (as shown in figure 6), the overall fit is highly significant. dataset f-test p-value strong collinearity f(3, 47) = 8.0 0.001 weak collinearity f(3, 47) = 6.6 0.003 no collinearity (unrelated predictors) f(3, 47) = 5.9 0.005 no collinearity (related predictors) f(3, 47) = 9.8 0.000 questions, so there is no contradiction if they yield different answers. to elaborate, the test for the overall model fit asks if all predictors jointly can account for variance in the outcome; the tests for the individual coefficients ask whether these are different from zero. with collinear predictors, it is possible that the answer to the first question is “yes” and the answer to the second is “i have no idea.” the reason for this is that with collinear predictors, either predictor could act as the stand-in of the other so that, as far as the model is concerned, either coefficient could well be zero, as long as the other is not. but due to the lack of information in the collinear sample, it is not sure which, if any, is zero (see mcelreath, 2020, chapter 6, for a lucid explanation). so again, there is no real problem: the tests answer different questions, so they may yield different answers. it is just that when you have collinear predictors, this tends to happen more often than when you do not. “collinearity means that you can’t take model coefficients at face value.” it is sometimes said that collinearity makes it more difficult to interpret estimated model coefficients. but the appropriate interpretation of an estimated regression coefficient is always the same, regardless of the degree of collinearity: according to the model, what would the difference in the mean outcome be if you took two large groups of observations that differed by one unit in the focal predictor but whose other predictor values were the same. the emphasised clause is crucial, and note the absence of any appeal to causality in the previous sentence. the interpretational difficulties that become obvious when there is collinearity are not caused by the collinearity itself but by mental shortcuts that people take when interpreting regression models. for instance, you may obtain a coefficient estimate in a multiple regression model with collinear predictors that you interpret to mean that older children perform more poorly on a foreign-language (l2) writing task than younger children. this would be counterintuitive, and you may find that, in your sample, older children actually outperform younger ones. you could chalk this one up to collinearity, but the problem really is related to a faulty mental shortcut you took when interpreting your model: you forgot to take into account the crucial “but whose other predictor values are the same” clause. if your model also includes measures of the children’s previous exposure to the l2, their motivation to learn the l2, and their l2 vocabulary knowledge, then what the estimated coefficient means is emphatically not that, according to the model, older children perform on average more poorly on a writing task than younger children. what it means is that, according to the model, older children perform more poorly than younger children with the same values on the previous exposure, motivation, and vocabulary knowledge measures. if, on reflection, this is not what you are actually interested in, then you should fit a different model (also see miller and chapman, 2001, for a similar point in the context of analysis of covariance). for instance, if you are interested in the overall difference between younger and older children regardless of their previous exposure, motivation and vocabulary knowledge, do not include these variables as predictors. but then you should have also not included these predictors if the collinearity had not been as strong. another interpretational difficulty emerges if you recast the interpretation of the estimate as follows: according to the model, what would the expected difference in mean outcome be if you took an observation and increased its value on the focal predictor by one unit but kept the other predictor values constant? the difference between this interpretation and the one that i offered earlier is that we have moved from a purely descriptive one to both a causal and an interventionist one (viz., the idea that one could change some predictor values while keeping the others constant and that this would have an effect on the outcome). in the face of strong collinearity, it becomes clear that this interventionist interpretation may be wishful thinking: it may be impossible to change values in one predictor without also changing values in the predictors that are collinear with it. but the problem here again is not the collinearity but the mental shortcut in the interpretation. statistical models 6 describe associations; imbuing them with a causal or even interventionist interpretation requires strong additional assumptions (for guidance, see elwert, 2013; rohrer, 2018; shmueli, 2010). in fact, you can run into the same difficulties when you apply the interventionist mental shortcut in the absence of collinearity: in the dataset shown in figure 4, it is impossible to change the second predictor without also changing the first since the first is a transformation of the second. yet the two variables are not collinear, since the transformation is completely nonlinear. or say you want to model quality ratings of texts in terms of the number of words in the text (“tokens”), the number of unique words in the text (“types”), and the type/token ratio. the model will output estimated coefficients for the three predictors, but as an analyst you should realise that it is impossible to find two texts differing in the number of tokens but having both the same number of types and the same type/token ratio: if you change the number of tokens and keep constant the number of types, the type/token ratio changes, too. a final mental shortcut that is laid bare in the presence of collinearity is conflating a measured variable with the theoretical construct that this variable is assumed to capture. conflating measurements and constructs can completely invalidate the conclusions drawn from a model even in the absence of collinearity (see berthele and vanhove, 2020; brunner and austin, 2009; loftus, 1978; wagenmakers et al., 2012; westfall and yarkoni, 2016). the literature on lexical diversity offers another case in point. the type/token ratio (ttr) discussed in the previous paragraph is one of several possible measures of a text’s lexical diversity. if you take a collection of otherwise comparable texts, chances are that the longer texts tend to have lower ttr values (see malvern et al., 2004, chapter 2). this text-size dependence has led quantitative linguists to abandon the use of the ttr, even though the relationship in any given dataset need not be that strong (see figure 7 for an example). however, the reason why researchers have abandoned the use of the ttr is not collinearity per se. rather, it is that the ttr is a poor measure of what it is supposed to capture, viz., the lexical diversity displayed in a text. specifically, because of the statistical properties of language, the ttr is pretty much bound to conflate a text’s lexical diversity with its length. the negative correlation between the ttr and text length is not a big problem for statistical modelling, but it is a symptom of a more fundamental problem: a measure of lexical diversity should not as a matter of fact be related to text length. the fact that the ttr is shows that it is a poor measure of lexical diversity. this problem diversity rating 0.4 0.6 0.8 1.0 −0.17 n = 1000 missing: 0 2 4 6 8 0.58 n = 1000 missing: 0 0.4 0.6 0.8 1.0 type/token ratio −0.66 n = 1000 missing: 0 2 4 6 8 4 5 6 7 8 4 5 6 7 8log−2 tokens figure 7. the type/token ratio tends to be negatively correlated with text length (here: log-2 number of tokens). but the problem is not that the type/token ratio is collinear with text length; it is that the type/token ratio also measures something it is not supposed to measure (length) and is a poor measure of what it is supposed to measure (lexical diversity, represented here by human ratings). data from the french corpus published by vanhove et al., 2019. is hidden if researchers mentally equate the ttr with the construct of lexical diversity rather than remaining cognizant of the fact that it is but an attempt to quantify the construct—and not a successful one at that. to be clear, it is not necessarily a problem that measures of lexical diversity empirically correlate with text length. after all, it is possible that the lexical diversity of longer texts is greater than that of shorter texts or vice versa: texts may be pithy but lexically diverse if the writers often used le mot juste instead of elaborate circumlocutions, and long texts may be lexically more diverse than shorter ones if they were written by more sophisticated writers with more to tell. the problem with the ttr is that it almost necessarily correlates with text length, even if, at the construct level, the texts’ lexical diversity does not. for instance, if you take increasingly longer snippets of texts from the same book, you will find that the ttr goes down (see tweedie and baayen, 1998). this does not mean that the writer’s vocabulary skills went down in the process of writing the book, but that s/he had to reuse common words (e.g., articles, pronouns, prepositions, copula verbs, common or important content words). more generally, if your predictors correlate strongly when they are not supposed to, your problem is not collinearity, but it may be that in trying to capture one construct, you have also captured the one represented by the other predictor. in sum, the interpretational challenges encountered 7 when predictors are collinear are not caused by the collinearity itself but by mental shortcuts that may lead researchers astray even in the absence of collinearity. collinearity does not require a statistical solution i have argued that collinearity is not a genuine statistical problem, so i do not think it should be addressed by statistical means. let’s take a closer look at some popular strategies that analysts resort to when their predictors are collinear and the repercussions of these strategies. residualising predictors the first popular strategy for dealing with collinearity is to residualise one collinear predictor against the other. this means that one of the predictors is fitted as the dependent variable in a regression model with the other predictor(s) as the independent variable(s). the estimated residuals are extracted from this model and then used as a replacement for the original predictor in the multiple regression model. york (2012) and wurm and fisicaro (2014) comprehensively discuss the consequences of this approach; figure 8 highlights the main points. as seen in the top left and bottom right panels, residualising one of the predictors against the other and using these residuals in lieu of the original predictor does not bias the estimates for the residualised predictor relative to the original true parameter values in equation (1). it also does not reduce the sample-to-sample variability of these estimates. so, as far as the residualised predictor is concerned, there is no downside or upside to this approach. however, as seen in the top right and bottom left panels, the estimates of the residualiser (i.e., the predictor that was not residualised) show less sample-to-sample variability, but they are substantially biased relative to the original true parameter values in equation (1) when the original predictors are collinear.2 the reason for this is that any variance in the outcome that could be accounted for by both predictors is now assigned wholly to the residualiser (see york, 2012). residualising one of the predictors against the other, then, changes the meaning of the estimated coefficient for the residualiser in a way that i suspect is opaque to most analysts and consumers. in fact, i cannot wrap my head around the sentence that i am about to foist upon you: what, according to the model, would be the mean difference if you took a large group of data points that differed by one unit in the residualiser but whose other predictor values differed by the same amount and in the same direction from the values that you would expect this predictor to have based on the linear association between it and the residualiser in the sample? (everything following but describes what it means for the estimated residuals to be held constant.) perhaps such estimates can be useful, but hardly more than once in a blue moon. dropping collinear predictors a second approach is to drop one or more of the collinear predictors from the model. i have no problem with this approach per se. but the problem that it solves is not collinearity but rather that the original model was misspecified. this approach only represents a solution if the new model is capable of answering the research question since, crucially, estimated coefficients from models with different predictors do not have the same meaning. for instance, say you are interested in the association between l2 grammatical knowledge and l2 reading proficiency and you fit a model with l2 reading test scores as the outcome, the learners’ scores on an l2 grammar test as the focal predictor and their scores on an l2 vocabulary test as a ‘control variable.’ if you decide to drop the vocabulary test scores from the model because of their correlation with the grammar test scores, you change the meaning of the estimate of the coefficient for the grammar test scores. in the full model, this estimate captures the mean difference in reading proficiency between learners with the same vocabulary score but with a one-unit difference in grammar test scores. in the reduced model, the estimate captures the mean difference in reading proficiency between learners with a one-unit difference in grammar test scores, regardless of their performance on the vocabulary test score. either estimate may be useful for addressing the research question, but this depends on the research question, not on the degree of collinearity. if the reduced model makes more sense than the full model in the presence of collinearity, it would have also made more sense in the absence of collinearity. something to be particularly aware of is that by dropping one of the collinear predictors, you bias the estimates of the other predictors relative to their original parameter values as shown in figure 9 (and see note 2). the reason is that, thanks to their correlation with the dropped predictor, the remaining predictors can now do some of its job in accounting for variance in the outcome. 2i say “relative to the original true parameter values” since, technically, the estimates in the new model are not biased either, but they estimate something different from the estimates in the original model. 8 predictor1 predictor2 p re d icto r1 re sid u a lise d p re d icto r2 re sid u a lise d −10 −5 0 5 10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 estimated regression coefficients for the predictors in multiple regression models after one of them has been residualised against the other d e n si ty figure 8. in the presence of strong collinearity (r = 0.98), residualising one of the predictors against the other does not bias the estimates for the residualised predictor or reduce their sample-to-sample variability (top left and bottom right), but it does bias the estimates of the residualiser (top right and bottom left) relative to the original parameter values (shown as dashed vertical lines). no collinearity (r = 0.00) strong collinearity (r = 0.98) 0 2 4 0 2 4 0.0 0.2 0.4 0.6 0.8 estimated regression coefficient for first predictor in simple regression models d e n si ty figure 9. dropping a collinear predictor changes the meaning of the estimate for the predictor retained (right panel). the reduced model now yields biased estimates of the original parameters (represented by the dashed vertical lines). when the predictors are perfectly orthogonal, this does not happen, but this is a special case. averaging predictors a third strategy for dealing with collinearity is to compress the information in the collinear predictors into a smaller set of less strongly correlated predictors. for instance, analysts sometimes take the average of several (possibly z-standardised) predictors and use this average instead of the original predictors. alternatively, they might submit these predictors to a principal component or factor analysis and extract one or more components or factors from this analysis to use these in lieu of the original predictors. i do not mind this approach per se, either, but analysts should be aware that the meaning of their model estimates is now different from those in the model that they originally fitted. the estimates now express the model’s best guess of the mean difference in the outcome when sampling a large number of data points that differ in one unit in the newly created variable but have 9 no collinearity (r = 0.00) strong collinearity (r = 0.98) −1 0 1 2 −5 0 5 0.00 0.25 0.50 0.75 estimated regression coefficient for first predictor in ridge regression models d e n si ty figure 10. ridge regression is a form of biased estimation, so naturally the estimates it yields are biased. when the predictors are orthogonal, all estimates are biased towards zero (left panel). when the predictors are collinear, the estimates for the weaker predictor are biased away from zero (right panel), whereas the estimates for the stronger predictor are biased towards zero (not shown). otherwise identical predictor variables. depending on the research question, such a model may be more defensible than the model originally fitted. but this depends on the research question, not on the degree of collinearity between the predictors. using estimation methods such as ridge regression with independently and identically distributed errors (i.e., when the independence and homoskedasticity assumptions are met), ordinary least squares regression is guaranteed to yield unbiased estimates with the lowest possible sample-to-sample variability. ridge regression and its cousins (lasso, elastic net) sacrifice unbiasedness in order to obtain estimates with an even lower sample-to-sample variability. this can be particularly useful in models optimised for predicting (as opposed to describing or explaining; see kuhn and johnson, 2013; shmueli, 2010). since collinearity is associated with more variable estimates, it is understandable that ridge regression and the like are used to tackle it. but the result of using models that deliberately bias the estimates is, quite naturally, that you end up with biased estimates. i illustrate this in figure 10, for which i reanalysed the data underlying figure 5 using ridge regression. (details of the choice of the λ parameter are available in the supplementary materials, but they are not important here.) for orthogonal predictors, all estimates are biased towards zero. for strongly collinear predictors, the estimates for the weaker predictor will be biased away from zero (shown in the figure), and those for the stronger predictor will be biased towards zero. biased estimation, then, reduces sampling variability in the estimates, but at the cost of, well, biased estimation. moreover, the usefulness of standard errors and confidence intervals for ridge regressions and its cousins is contested (see goeman et al., 2018), so a further drawback is debatable statistical inference.3 in sum, popular strategies to address collinearity involve giving up the unbiased estimates of ordinary least squares regression, redefining the statistical model so that it answers different questions from the original model, or both. as york (2012) writes, “statistical ‘solutions,’ such as residualization that are often used to address collinearity problems do not, in fact, address the fundamental issue, a limited quantity of information, but rather serve to obfuscate it. it is perhaps obvious to point out, but nonetheless important in light of the widespread confusion on the matter, that no statistical procedure can actually produce more information than exists in the data.” [p. 1384] summary collinearity is a form of lack of information that is already appropriately reflected in the output of your statistical model. when collinearity is associated with 3with informative prior distributions on the parameters, bayesian models can yield fairly narrow posterior distributions (i.e., a fairly low degree of uncertainty) for the estimates even in the presence of collinearity. but this is achieved by virtue of incorporating information from outside the sample into the model by means of the prior distributions, not by conjuring information out of thin air. 10 interpretational difficulties, these difficulties are not caused by the collinearity itself. rather, they reveal that the model was poorly specified (in that it answers a question different from the one of interest), that the analyst has overly focused on significance rather than estimates and the uncertainty about them, or that the analyst took a mental shortcut in interpreting the model that could have also led them astray in the absence of collinearity. these shortcuts include failing to interpret parameter estimates conditional on all the other predictors in the model, lending a causal or interventionist interpretation to what is a descriptive model without proper justification, and conflating a measure with the construct that it is supposed to represent. lastly, if you do decide to deal with collinearity, make sure you can still answer the question of interest and that any bias in the estimates can be justified. author contact jan vanhove, university of fribourg, department of multilingualism, rue de rome 1, 1700 fribourg, switzerland. orcid id: https://orcid.org/ 0000-0002-4607-4836. website: https://janhove. github.io. i thank twitter user @facupalacio12 for the reference to morrissey and ruxton (2018), and johan ferreira, nick brown, and rickard carlsson for their comments. conflict of interest and funding the author declares no conflict of interest and did receive specific funding for the present work. author contributions jv was the sole author of this article. this article is based on a blog post with the same title (https:// janhove.github.io/analysis/2019/09/11/collinearity). open science practices this article earned the open data and the open materials badge for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references berthele, r., & vanhove, j. (2020). what would disprove interdependence? lessons learned from a study on biliteracy in portuguese heritage language speakers in switzerland. international journal of bilingual education and bilingualism, 23(5), 550–566. https : / / doi . org / 10 . 1080 / 13670050.2017.1385590 brunner, j., & austin, p. c. (2009). inflation of type i error rate in multiple regression when independent variables are measured with error. canadian journal of statistics, 37(1), 33–46. https: //doi.org/10.1002/cjs.10004 elwert, f. (2013). graphical causal models (s. l. morgan, ed.). in s. l. morgan (ed.), handbook of causal analysis for social research. dordrecht, the netherlands, springer. https : / / doi . org / 10.1007/978-94-007-6094-3\_13 goeman, j., meijer, r., & chaturvedi, n. (2018). l1 and l2 penalized regression models. https : / / cran.rproject.org/web/packages/penalized/ vignettes/penalized.pdf kuhn, m., & johnson, k. (2013). applied predictive modeling. new york, springer. https://doi.org/10. 1007/978-1-4614-6849-3 loftus, g. r. (1978). on interpretations of interactions. memory & cognition, 6(3), 312–319. malvern, d., richards, b., chipere, n., & durán, p. (2004). lexical diversity and language development: quantification and assessment. basingstoke, uk, palgrave macmillan. https://doi. org/10.1007/978-0-230-51180-4 mcelreath, r. (2020). statistical rethinking: a bayesian course with examples in r and stan (2nd). boca raton, fl, crc press. miller, g. a., & chapman, j. p. (2001). misunderstanding analysis of variance. journal of abnormal psychology, 110(1), 40–48. https : / / doi . org / 10.1037/0021-843x.110.1.40 morrissey, m. b., & ruxton, g. d. (2018). multiple regression is not multiple regressions: the meaning of multiple regression and the non-problem of collinearity. philosophy, theory, and practice in biology, 10(3). https : / / doi . org / 10 . 3998 / ptpbio.16039257.0010.003 o’brien, r. m. (2007). a caution regarding rules of thumb of variance inflation factors. quality & quantity, 41, 673–690. https : / / doi . org / 10 . 1007/s11135-006-9018-6 rohrer, j. m. (2018). thinking clearly about correlations and causation: graphical causal models for observational data. advances in methods and https://orcid.org/0000-0002-4607-4836 https://orcid.org/0000-0002-4607-4836 https://janhove.github.io https://janhove.github.io https://janhove.github.io/analysis/2019/09/11/collinearity https://janhove.github.io/analysis/2019/09/11/collinearity https://doi.org/10.1080/13670050.2017.1385590 https://doi.org/10.1080/13670050.2017.1385590 https://doi.org/10.1002/cjs.10004 https://doi.org/10.1002/cjs.10004 https://doi.org/10.1007/978-94-007-6094-3\_13 https://doi.org/10.1007/978-94-007-6094-3\_13 https://cran.r-project.org/web/packages/penalized/vignettes/penalized.pdf https://cran.r-project.org/web/packages/penalized/vignettes/penalized.pdf https://cran.r-project.org/web/packages/penalized/vignettes/penalized.pdf https://doi.org/10.1007/978-1-4614-6849-3 https://doi.org/10.1007/978-1-4614-6849-3 https://doi.org/10.1007/978-0-230-51180-4 https://doi.org/10.1007/978-0-230-51180-4 https://doi.org/10.1037/0021-843x.110.1.40 https://doi.org/10.1037/0021-843x.110.1.40 https://doi.org/10.3998/ptpbio.16039257.0010.003 https://doi.org/10.3998/ptpbio.16039257.0010.003 https://doi.org/10.1007/s11135-006-9018-6 https://doi.org/10.1007/s11135-006-9018-6 11 practices in psychological science, 1(1), 27–42. https://doi.org/10.1177/2515245917745629 schmidt, f. l. (1996). statistical significance testing and cumulative knowledge in psychology: implications for training of researchers. psychological methods, 1, 115–129. https://doi.org/10.1037/ 1082-989x.1.2.115 shmueli, g. (2010). to explain or to predict? statistical science, 25(3), 289–310. https : / / doi . org / 10 . 1214/10-sts330 tweedie, f. j., & baayen, r. h. (1998). how variable may a constant be? measures of lexical richness in perspective. computers and the humanities, 32(5), 323–352. https://doi.org/10.1023/a: 1001749303137 vanhove, j., bonvin, a., lambelet, a., & berthele, r. (2019). predicting perceptions of the lexical richness of short french, german, and portuguese texts using text-based indices. journal of writing research, 10(3), 499–525. https : / / doi.org/10.17239/jowr-2019.10.03.04 wagenmakers, e.-j., krypotos, a.-m., criss, a. h., & iverson, g. (2012). on the interpretation of removable interactions: a survey of the field 33 years after loftus. memory & cognition, 40(2), 145–160. https : / / doi . org / 10 . 3758 / s13421 011-0158-0 westfall, j., & yarkoni, t. (2016). statistically controlling for confounding constructs is harder than you think. plos one, 11(3), e0152719. https: //doi.org/10.1371/journal.pone.0152719 wurm, l. h., & fisicaro, s. a. (2014). what residualizing predictors in regression analyses does (and what it does not do). journal of memory and language, 72, 37–48. https://doi.org/10.1016/ j.jml.2013.12.003 york, r. (2012). residualization is not the answer: rethinking how to address multicollinearity. social science research, 41, 1379–1386. https:// doi.org/10.1016/j.ssresearch.2012.05.014 https://doi.org/10.1177/2515245917745629 https://doi.org/10.1037/1082-989x.1.2.115 https://doi.org/10.1037/1082-989x.1.2.115 https://doi.org/10.1214/10-sts330 https://doi.org/10.1214/10-sts330 https://doi.org/10.1023/a:1001749303137 https://doi.org/10.1023/a:1001749303137 https://doi.org/10.17239/jowr-2019.10.03.04 https://doi.org/10.17239/jowr-2019.10.03.04 https://doi.org/10.3758/s13421-011-0158-0 https://doi.org/10.3758/s13421-011-0158-0 https://doi.org/10.1371/journal.pone.0152719 https://doi.org/10.1371/journal.pone.0152719 https://doi.org/10.1016/j.jml.2013.12.003 https://doi.org/10.1016/j.jml.2013.12.003 https://doi.org/10.1016/j.ssresearch.2012.05.014 https://doi.org/10.1016/j.ssresearch.2012.05.014 collinearity and its consequences is collinearity a problem? "collinearity decreases statistical power." "none of the predictors is significant but the overall model fit is." "collinearity means that you can't take model coefficients at face value." collinearity does not require a statistical solution residualising predictors dropping collinear predictors averaging predictors using estimation methods such as ridge regression summary author contact conflict of interest and funding author contributions open science practices meta-psychology, 2017 (1), a1001 article type: editorial published under the cc-by4.0 license pre-print doi: na paper doi:10.15626/mp2017.1001 reviews doi:na edited by: rickard carlsson reviewed by: not peer-reviewed inaugural editorial of meta-psychology rickard carlsson, henrik danielsson, moritz heene, åse innes-ker, daniël lakens, ulrich schimmack, felix d. schönbrodt, marcel van assen, yana weinstein in 1957 robert k. merton wondered how historians living in 2050 would look back at how the sociology of science developed, and predicted that they would see a ’spacious area of neglect’ (merton, 1957, p. 635). sixty years later, we might safely make a similar prediction about how future historians will look back at the psychology of science. science is a social enterprise, and psychologists are ideally suited to study the interand intra-individual processes that impact how science is done. one specific area within the psychology of science is the psychology of psychological science, and we refer to this as meta-psychology. the past several years has seen increased focus on analyzing the systemic and psychological factors that threatens the validity of research in general, and psychology in particular. this focus has resulted in some practical changes, such as journals offering preregistration of hypotheses, platforms for sharing datasets, and an increased discussion on how to improve all aspects of the research cycle from planning to peerreview. this is still work in progress. there are questions about how to increase transparency, how to critique published findings, and how we aggregate results to build a cumulative science. the issues identified can be analyzed from a psychological perspective. for example, we prefer results that confirm our biases, believe prestigious individuals, and remember things selectively. any recommendations for change should be grounded in an understanding of human psychology in order to work with it, rather than chafing against it. we believe that it is time to create a specialized jouraffiliations: rickard carlsson, linnaeus university, sweden; henrik danielsson, linköping university, sweden and the swedish institute for disability research, sweden; moritz heene, ludwig-maximilians-university munich, germany; åse innes-ker, lund university, sweden; daniël lakens, eindhoven university of technology, the netherlands; ulrich schimmack, university of toronto, canada; felix d. schönbrodt, ludwig-maximilians-university munich, germany; marcel van assen, tilburg university and utrecht university, the netherlands; yana weinstein, university of massachusetts, lowell, usa nal that publishes articles in the emerging field of metapsychological research; a journal that questions the basic assumptions of research paradigms and monitors the progress of psychological science as a whole. the new journal meta-psychology aims to provide a platform for academic work on the psychology of psychological science, as well as an outlet for new types of contributions, such as high quality post-publication peer reviews, articles that empty the file-drawers of researchers, and registered reports. psychology needs a journal dedicated to meta-psychology most scientific journals focus on publishing original research articles or review articles (including metaanalyses) of studies on a particular topic. so far, there has been no outlet dedicated to meta-psychological articles. a large number of high-quality meta-psychological blogs have appeared in recent years, and these blogs are indicative of the need of scholars to communicate ideas, (re)-analyses, and data that do not fit existing scientific outlets. we believe it is time to provide researchers access to a specialized journal that publishes articles in the emerging field of meta-psychology, in order to incorporate this important body of work into the scientific archive, and facilitate open peer-review of these contributions. the journal meta-psychology (mp) provides researchers interested in these issues an opportunity to publish their work in a peer-reviewed scientific journal. meta-psychology shares some goals with two journals from the association for psychological science: perspectives on psychological science (pps) and advances in methods and practices in psychological science (ampps). in his first editorial (diener, 2006), ed diener described pps as a journal that publishes an “eclectic mix of provocative reports and articles, including broad integrative reviews, overviews of research programs, meta-analyses, theoretical statements, and articles on topics such as the philosophy of science, opinion pieces about major issues in the field, autobiographical reflections of senior members of the field, and even occasional humorous essays and sketches”. ampps’s website states that it “is the home for innovative developments in research methods, practices, and 2 carlsson & editorial team conduct across the full range of areas and topics within psychological science”. we believe that despite the similarities with pps and ampps, meta-psychology provides a unique and important outlet for meta-psychological research. indeed, meta-psychology differs from pps and ampps in several important ways. first, pps and ampps are both traditional journals that require readers to pay in order to read the articles. in contrast, meta-psychology is an open access journal that is free both for readers (golden open access) and for authors (no article processing fees). open access is a defining attribute of any journal ascribing to the ideals of openness and transparency. expensive subscriptions exclude readers who are not in the fortunate position to be employed at a wealthy university. although open access journals solve this problem, they often have expensive article processing charges that will exclude authors who cannot acquire the necessary funding. we firmly believe that in order to ensure diversity in meta-psychological research, the only way forward is completely free open access. we are proud to have made this possible by relying on open source software (open journal systems, https://pkp.sfu.ca/ojs/; the apa6 class for latex by brian beitzel, 2012), a free infrastructure provided by linnaeus university, and volunteering editors who are not receiving monetary compensation for their work. meta-psychology will not only be an outlet for examining and advancing psychological science, but also function as an avant garde experiment that pushes the boundaries of scientific publishing. already at the start, meta-psychology will be quite different from traditional scientific journals. meta-psychology has no physical or economic limitations in the number of articles that can be published. the sole criterion for publication is that an open peer-review process considers a submission a valuable contribution to the field of meta-psychological research. manuscripts submitted to meta-psychology are first published as pre-prints. the full editorial work is openly available for anyone to inspect, independent on whether an article is ultimately published or rejected. all submissions will be reviewed by two or three expert reviewers selected by the action editor, but anyone can read and comment on pre-prints, and these comments will be incorporated in the formal review process. commenters who make a substantial contribution will be credited as additional reviewers on a given paper. in addition to providing an open peer review process and transparent editorial decisions, we also expect high standards of openness from our authors. all published articles have to be accompanied by data, code, and materials posted on an open science framework project page that is cited in the article. the analysis will be independently reproduced by reviewers and/or editors prior to publication. furthermore, pre-registration is strongly encouraged. in terms of the transparency and openness (top) guidelines, we aim to reach the highest (level 3) or the next highest level (level 2) on all criteria. for more details see the submission guidelines on our website. to summarize, meta-psychology will have absolutely no barriers with respect to funding and accessibility, a low threshold on perceived novelty and splashy findings, and a high threshold on openness and transparency. importantly, it is our vision that metapsychology will lead the way forward by being an early adopter of advances in how psychological research is conducted and communicated. indeed, our intention is that meta-psychology will be a living and evolving metapsychology experiment in itself. first call for papers meta-psychology publishes original articles on the topic of meta-psychology, and authors should read the about the journal section on the website to discover the large scope of submissions the journal will consider. however, when launching we would like to emphasize three types of contributions that we are especially proud to give a home in meta-psychology. file-drawer reports the overall goal of accepting file-drawer reports is to correct the scientific record by encouraging researchers to write up and publish findings that otherwise would remain in the file drawer and bias the literature. these articles provide an opportunity to completely empty your (or your lab’s) file drawer on a psychology research topic. the exhaustiveness of the emptying is crucial, meaning that even suspectedely failed (e.g., manipulation check failed, suspicion of problems with data collection) should be included, but of course carefully detailed and explained in order to give future metaanalysts the information they need to decide whether the data are useful. file-drawer reports should ideally not be rejected papers that would do well as original contributions in other journals. rather, we literally want you to scrape the bottom of your drawers and submit findings that otherwise would never be published in a scientific journal. because this is an entirely new format, there will likely be some initial confusion of what the reports should contain (but please see the submission instructions) and what are appropriate topics. for that reason, we encourage authors to contact rickard carlsson https://pkp.sfu.ca/ojs/ inaugural editorial 3 to discuss ideas about how to go about emptying your file drawer. expand your blog to an article there’s no doubt that blogs have played an important role in the emerging field of meta-psychology. blogs provide a medium to quickly disseminate information that would otherwise be hard to publish in traditional journals. sometimes blogs are just random musings, but sometimes they contain highly important contributions. we think it would be a shame if these contributions did not become part of the scientific record. although you may independently start re-writing your blog into a submission type that fits metapsychology (see submission guidelines), we encourage you to contact an editorial board member to discuss whether your blog post has the potential to be expanded to an mp article, before you start working on it. registered reports in meta-psychology we are proud to have the option to submit proposals for meta-psychological research that will be given inprinciple acceptance (ipa) prior to any data collection and/or data analysis. start by consulting the submission guidelines, and if you have any more questions you should consult with the registered report section editor marcel van assen. outlook being on the frontier of the ongoing changes in academic publishing means that we are experimental and thus modest in that we might not get everything right from the start. expect meta-psychology to evolve and be the first adopter of important meta-science and publication process advances. references beitzel, b. (2012). formatting latex documents in apa style (6th edition) using the apa6 class. the practex journal, 2012(1), 1–12. diener, e. (2006). editorial. perspectives on psychological science, 1, 1–4. doi:10.1111/j.17456916.2006. 00001.x merton, r. k. (1957). priorities in scientific discovery: a chapter in the sociology of science. american sociological review, 22, 635–659. doi:10 . 2307 / 2089193 https://dx.doi.org/10.1111/j.1745-6916.2006.00001.x https://dx.doi.org/10.1111/j.1745-6916.2006.00001.x https://dx.doi.org/10.2307/2089193 https://dx.doi.org/10.2307/2089193 4 carlsson & editorial team appendix information about the journal co-editors-in-chief (2017 – 2021) • rickard carlsson (linnaeus university, sweden) • ulrich schimmack (university of toronto, mississauga, canada) editorial board (2017 – 2021) • henrik danielsson (linköping university, sweden) • moritz heene (ludwig-maximilians-university munich, germany) • åse innes-ker (lund university, sweden) • daniël lakens (eindhoven university of technology, the netherlands) • felix schönbrodt (ludwig-maximilians-university munich, germany) • marcel van assen (tilburg university, and utrecht university, the netherlands) • yana weinstein (university of massachusetts, lowell, usa) advisory board (2017 – 2021) • nick brown (university of groningen, the netherlands) • paul bürkner (university of münster, germany) • gary burns (wright state university, usa) • cody christopherson (southern oregon university, usa) • james coyne (university of pennsylvania, usa) • malte elson (ruhr university bochum, germany) • eiko fried (university of amsterdam, the netherlands) • kristoffer magnusson (karolinska institutet, sweden) • stephen martin (baylor university, usa) • david meyer (university of michigan, ann arbor, usa) • julia rohrer (university of leipzig, germany) • anne scheel (eindhoven university of technology, the netherlands) • donald williams (university of california, davis, usa) • matt williams (massey university, new zealand) • tim van der zee (leiden university, the netherlands) links • the journal’s website: https://open.lnu.se/index.php/metapsychology • the journal’s osf project: https://osf.io/3m4z3/ https://open.lnu.se/index.php/metapsychology https://osf.io/3m4z3/ psychology needs a journal dedicated to meta-psychology first call for papers file-drawer reports expand your blog to an article registered reports in meta-psychology outlook co-editors-in-chief (2017 – 2021) editorial board (2017 – 2021) advisory board (2017 – 2021) links mp.2019.1992.final meta-psychology, 2020, vol 4, mp.2019.1992, https://doi.org/10.15626/mp.2019.1992 article type: original article published under the cc-by4.0 license open data: n/a open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: n/a edited by: erin m. buchanan reviewed by: k.d valentine, donald r. williams analysis reproduced by: erin m. buchanan all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/9b6z3 multiplicity control vs replication: making an obvious choice even more obvious andrew hunter york university nataly beribisky york university linda farmus york university robert cribbie york university this paper presents a side-by-side consideration of multiplicity control procedures and replication as solutions to the problem of multiplicity. several independent theoretical arguments are presented which demonstrate that replication serves several important functions, and that multiplicity control procedures have a number of serious flaws. subsequently, the results of a simulation study are provided, showing that under typical conditions, replication provides similar familywise error control and power as multiplicity control procedures. taken together, these theoretical and statistical arguments lead to the conclusion that researchers who are concerned about the problem of multiplicity should shift their attention away from multiplicity control procedures and towards increased use of replication. keywords: multiplicity control, familywise error, power, replication, effect size, metaanalysis it is easier than ever to collect and analyze vast amounts of data. plentiful research participants, accessible statistical software, and the popularity of the social sciences have led to a golden age of quantitative research. much of this research is still being conducted using the lens of null hypothesis significance testing (nhst). in nhst, tests of “statistical significance” compare the probability of obtaining a test statistic as extreme (or more extreme) than that found under the null hypothesis to a pre-selected nominal type i error rate. situations in which findings produced by sampling error are erroneously deemed to be “significant” are referred to as “type i errors” or “false positives”. as the number of statistical tests being conducted has risen, social science stakeholders have become increasingly concerned with type i errors (false positives), that is, finding a “significant” effect simply as a result of sampling error. this is because as more and more tests are conducted, the probability of a type i error occurring increases. understandably, there have been repeated calls for the adoption of methods (termed “multiplicity control”) to reduce the number of false positive results in research (alibrandi, 2017). at the same time, the value of replication is being touted across many disciplines as a way of ensuring that the results of scientific studies are legitimate (cumming, 2014; shrout & rodgers, 2018). to date, multiplicity control and replication have rarely been discussed within the same context. this is surprising since they both purport to reduce the likelihood of type i errors in the results of research studies. specifically, replications provide more insight over time on the existence (and magnitude) of hunter, beribisky, farmus, & cribbie 2 effects, while multiplicity control procedures control the rate of decision-making about the existence of given effects within a single framework. in this paper, we discuss the tenets and principles of multiplicity control and replication, and then we move into a comparison of the methods both theoretically and methodologically. we show that one of the many advantages of increased replications is the minimized need for type i error control via multiplicity control procedures. multiplicity control multiplicity refers to testing multiple hypotheses with the goal of isolating those that are statistically significant. the problem is that as the number of tests conducted increases, the probability of obtaining a type i error also increases. to illustrate this principle, let us say we are comparing the speed at which participants walk. participants are separated into four groups and each group is primed with a different list of words. if we hypothesize that priming will affect subsequent walking speeds, then we may wish to compare each group to every other group individually (i.e., test all six pairwise comparisons). though each test carries a specific probability of making a type i error (α), the overall probability of a type i error (α’) across all six tests will be higher than α. in this way we can see how researchers are often put in the agonizing position of having interesting results that likely contain one or more false positives. multiplicity control procedures researchers have traditionally attempted to control for the increased likelihood of a type i error when multiple tests are conducted by using multiplicity control procedures (mcps). there are many different mcps, but all accomplish essentially the same goal—they make the cut-off demarcating statistically significant from statistically non-significant results more conservative as the number of statistical tests conducted increases (olejnik, li, supattathum, & huberty, 1997; sakai, 2018). mcps can be applied to many kinds of tests, such as pairwise mean comparisons, multiple tests of correlation or regression coefficients, multiple parameters in structural equation modeling, tests repeated over multiple outcome variables, multiple voxel-level tests in fmri, and more. some of the most popular mcps provide familywise error control (αfw), which controls the probability of at least one type i error at α across all comparisons (i.e., α’ = α). the most popular approach for αfw control is the bonferroni method (dunn, 1961), which controls for multiplicity by dividing the overall probability of a type i error (αfw) by the number of tests conducted (t). the resulting per-test alpha level is αt = α / t. numerous alternatives to the bonferroni procedure for controlling α’ at α have been proposed, such as the holm (1979) procedure; a flexible and popular alternative. the holm procedure makes inferences regarding statistical significance in a stepwise manner. the term stepwise implies that the significance tests take place in a prespecified order and αt can depend on the specific stage of testing. replication replication lends validity and generalizability to empirical results, and as such, has been heralded as a cornerstone of the so-called “new statistics” (cumming, 2014). it also happens that some forms of replication address the multiplicity problem by leveraging the simple principle that it is highly unlikely that sampling error would yield the same false positive result across several studies. these are some of the reasons why replications are gaining traction. indeed, many academic journals have stated that they are now open to accepting replication studies (e.g., lucas & donnellan, 2013; vazire, 2016). it is our position that replication, an indispensable tool in its own right, naturally and effectively deals with the multiplicity problem. it is important to note that there are many forms of replication (some have even suggested as many as 12 different types; radder, 1992). scholars generally distinguish between direct replications and conceptual replications. direct replications involve repeating the precise methodology of a previously conducted study, and conceptual replications involve testing the same hypothesis using different methods (schmidt, 2009). the purpose of a direct replication is to determine the reliability of an effect, whereas a conceptual replication provides a new test of a theory (simons, 2014). in the context of the multiplicity problem, direct replications are most relevant, because we are concerned with unreliable effects arising from sampling multiplicity control vs replication: making an obvious choice even more obvious 3 error (i.e. type i errors). if we were concerned with the validity of a claim based on study results (that is, if we wanted to further test whether a given result actually supports a theoretical claim), conceptual replications would be our focus. therefore, in this paper, when we use the word “replication” we are referring to direct replications. multiplicity control procedures vs replication although mcps and replication are theoretically very different, they share a common goal in reducing the probability of type i errors; hence we find a comparison of these strategies informative. below we outline the reasons why we find replication to be a more logical and natural way to control for type i errors than adopting mcps. to begin, there are important theoretical issues with the general practice of multiplicity control, highlighted by the fact that there is no basis for the decision to link the alpha (α) level (or maximum acceptable type i error rate) used for a particular test to the number of other tests conducted within a study (carmer & walker, 1985; cribbie, 2017; rothman, 1990; saville, 1990, 2003, 2015). although many have highlighted that α’ can increase drastically when researchers ignore the effects of multiplicity (e.g., bland & altma, 1995; hancock & klockars, 1996; holland, basu & sun, 2010; ryan, 1959; tyler, normand, & horton, 2011), there is still no logical theoretical basis for linking the number of tests conducted to the per-test type i error rate. for example, conducting all t = 6 pairwise comparisons in one study is strategically no different than conducting six studies each with t = 1 pairwise comparison, so why should there be a penalty for conducting all the tests together? if you believe that these situations are equivalent, and that mcps should be applied in both of these cases, why not control for all tests conducted by the researcher over their scholarly career? or even further, all statistical tests ever conducted? the ridiculousness of this suggestion speaks to the way that the logic, or lack thereof, underlying mcps does not scale well. in addition, linking the number of tests conducted to the per-test type i error rate leads to strange recommendations such as limiting the number of variables studied in order to reduce the potential for type i errors (schochet, 2007). second, replication involves the repetition of a methodology under slightly different conditions (e.g., different cities, lab settings, research assistants, samples). mcps only address the likelihood of erroneous results within the study at hand. in contrast, replication reduces error by increasing the scope of an initial study, which directly contributes to the generalizability of the findings (fisher, 1935; lindsay & ehrenberg, 1993). repeated findings and generalizability—rather than a low chance of error in a single study—have widely been regarded as the hallmark of legitimate results (carver, 1993; fisher, 1935; lykken, 1968; nuzzo, 2014; popper, 1934; steiger, 1990). methodologists have long stressed the importance of replication for establishing generalizability; for example, cohen (1994) writes, “for generalization, psychologists must finally rely, as has been done in all the older sciences, on replication.” (p. 997). third, it has been noted that most (if not all) variables investigated in meaningful studies are related, although the magnitude of the association might not be large (cohen, 1990, 1994; gelman, hill, & yajima, 2012; rothman, 2014; tukey, 1991). this claim suggests that one of the core assumptions of nhst— that the null hypothesis corresponds to a complete non-effect or lack of association—does not map well onto reality. thus, statistical procedures aimed at reducing type i errors (like mcps), which are grounded in nhst, are at best over-conservative, and at worst, unnecessary and irrelevant since type i errors of this nature are virtually nonexistent. as tukey (1991) notes, “[it is] foolish…to ask… ‘are the effects [of a and b] different?’…a and b are always different—in some decimal place.” (p. 100). and cohen (1990) states that the null hypothesis can only be true in the “bowels of a computer processor running a monte carlo study” (p. 1308). to word this differently, mcps have the single goal of reducing the likelihood of type i errors, while replication has the broader goal of exploring the reliability/generalizability of research findings (including the direction and magnitude of effects). if type i errors do not exist, then mcps are unnecessary. in contrast, because replication was not specifically designed to address multiplicity (i.e., it is not rooted in nhst, even though some researchers might define replication in terms of nhst results), it remains a valuable pursuit. fourth, since replication, unlike mcps, is not a procedure founded within nhst, there are many hunter, beribisky, farmus, & cribbie 4 extensions that are available such as focusing on the magnitude of effect sizes and meta-analyzing the effects across the multiple replication studies. in contrast, mcps are tools which are directly embedded within the nhst framework. accordingly, mcps are also subject to the same dichotomous decisionmaking as the rest of nhst, which has been strongly criticized (e.g., gigerenzer, krauss, & vitouch, 2004). lastly, across popular testing situations, multiplicity control is not superior to replication in terms of reducing the likelihood of type i errors. this novel finding is the primary focus of this paper. if replication is theoretically superior to multiplicity control while providing the statistical benefits of mcps then there would appear to be a clear winner. methodology we conducted a monte carlo simulation to evaluate whether replication provides a similar level of type i error control to multiplicity control. we simulated pairwise comparisons within a one-way independent groups framework, comparing familywise error control and statistical power of the bonferroni and holm mcps to that of replication (for a list of terms and definitions used in this section, see table 1). several factors were manipulated in this study, including sample size, number of replications, number of populations, population mean configuration, and the method of error control. sample sizes per group were set at n = 25 and n = 100 to reflect common sample sizes in psychological research. the number of groups was set at either j = 4 (six pairwise comparisons) or j = 7 (21 pairwise comparisons). to test both familywise error control and statistical power, we manipulated population means so that three types of configurations were adopted: all population means equal (complete null), some of the population means equal (partial null), and none of the population means equal (complete non-null). in the complete null case, we investigated familywise error rates. in the partial null case, we evaluated both familywise error rates and power. in the complete non-null case, we investigated power. power was recorded in terms of all-pairs power (the proportion of simulations in which all null hypotheses associated with non-null pairwise comparisons were correctly rejected) and average per-pair power (the proportion of truly significant differences that were correctly identified as statistically significant, averaged across all simulations). the population mean configurations used in the study can be found in table 2. the within-group standard deviation was fixed at 20, and thus every one-unit increase in differences between means increased the effect size (cohen’s d) by .05. for example, the population value for cohen’s d for the μ1 = 0 and μ2 = 16 comparison is -.8 [(0-16)/20]. in the non-null condition, population means were equally spaced (e.g., 0, 8, 16, 24). familywise error control and power were evaluated under situations in which the study was not replicated, replicated once, or replicated twice. with no replication, the familywise error rate was calculated as the proportion of simulations in which at least one pairwise comparison was falsely declared statistically significant (i.e., there was at least one false positive). with one or two replications, the familywise error rate was calculated as the proportion of simulations in which at least one pairwise comparison was falsely declared statistically significant in the original study and in each replication (i.e., the false positive persisted across replications). note that the order in which the errors are evaluated (study first, replication second; or replication first, study second) is irrelevant since a false effect would need to be present in both to be counted towards the familywise error rate. with no replication, the per-pair power rate was calculated as the proportion of non-null pairwise comparisons that were correctly declared statistically significant, averaged across all simulations. the all-pairs power rate was the proportion of simulations in which all non-null pairwise comparisons were correctly declared statistically significant. with replication, the per-pair power rate was calculated as the proportion of non-null pairwise comparisons that were correctly declared statistically significant in the original study and in each replication, averaged across all simulations. with replication, the all-pairs power rate was the proportion of simulations in which all non-null pairwise comparisons were correctly declared statistically significant in the original study and in each replication. to examine familywise error rates across simulations, we adopted three methods of multiplicity control. the first method evaluated each pairwise comparison at α (i.e., no multiplicity control), the second was the bonferroni mcp method, and the multiplicity control vs replication: making an obvious choice even more obvious 5 third was the stepwise holm mcp method. in addition to computing the test statistics separately for the original study and each replication, to model the accumulation of research over time, type i error and power rates were also investigated when the combined (meta-analytic) effect across replications was analyzed. meta-analysis is a useful tool for combining research that examines the same effect, and here, we use it to model how replication effects may be combined. lastly, beyond traditional nhst-based approaches, rates were also calculated for instances in which the effect size (cohen’s d) meets the minimum meaningful value (ε) in both non-replicated and replicated situations. since our simulated populations did not violate the assumptions of anova or cohen’s d, we did not need to measure amount of bias or use a robust measure of effect size. the effect size equivalent of a type i error occurs when the observed d value mistakenly exceeds ε (i.e., the population value of d < ε, but the observed value of d > ε), whereas the equivalent of power occurs when the observed d value correctly exceeds ε (i.e., the population value of d > ε, and the observed value of d > ε). when the population value of d > ε, we can calculate the average proportion of correct statements regarding d or the proportion of all correct statements regarding d (i.e., all pairwise d values that are greater than ε when population d > ε). for this study, the nominal type i error rate was set at α = .05, ε was set at d = .3, and 5000 simulations were conducted for each condition. it should be noted though that the choice of an appropriate value for α and ε is affected not only by general recommendations but also by the context of the study. given the lack of context in this study, our choices can be considered somewhat arbitrary. finally, it is worth noting that we are simulating perfect, direct replications. as stated earlier, direct replications are useful for 1) testing the reliability of an effect, and 2) establishing the generalizability of an effect. the imperfect nature of replications (i.e., they are conducted in different laboratories, with different kinds of participants, under slightly different conditions) is what makes them useful for establishing generalizability. however, this is not the focus of the present paper. because this paper is purely interested in the ability of replications to control for multiplicity, we are solely concerned with the way that direct replications can establish the reliability of an effect. thus, while our simulated replications are artificial and unrealistic, they are all that is needed to compare replication and mcps in the control of multiplicity. results familywise error control tables 3 and 4 show that when studies are not replicated, the bonferroni and holm mcps are, unsurprisingly, vastly preferable to no control. for example, table 3 shows that in a non-replicated study with 100 participants and 21 comparisons (μ = 0, 0, 0, 0, 0, 0, 8), the bonferroni and holm procedures keep the error rate below the nominal α = .05. when no control is applied, the familywise error rate greatly exceeds α = .05 (.374). in the more extreme situation where no true differences exist (μ = 0,0,0,0,0,0,0), error rates are even worse when no control is used (.440). when replications are conducted, error rates shift dramatically. familywise error rates associated with the bonferroni and holm mcps drop to .000, regardless of sample size and number of comparisons. those associated with no control also drop noticeably. tables 3 and 4 show that when a single replication is conducted, familywise error rates without multiplicity control are maintained at or below α = .05. when two replications are conducted, empirical familywise error rates are maintained below .01. although we included results when both multiplicity control and replication are utilized, this paper specifically contrasts the use of multiplicity control in a single (non-replicated) study against a replicated study with no multiplicity control. thus, the important contrast is between the bonferroni and holm results without replication, and that of the no control condition with replication. tables 3 and 4 show that in this configuration, mcps and no control with replication both keep empirical familywise error rates at or below α = .05. it is important to remind readers that replication was not designed, like mcps, to maintain familywise error rates at specific levels. in fact, if the number of tests was very large it might take more than one or two replications to control the familywise error rate at α. the statistical properties of replication are nonetheless attractive—the probability of repeating an error over and hunter, beribisky, farmus, & cribbie 6 over is very slim when the probability of an error in each instance (e.g., study) is small (i.e., α). power a strategy that controls for familywise error by squandering statistical power has little utility. therefore, we also compared the power provided by mcps and replication. as expected, overall power (i.e., the probability of finding a significant effect in the original study and each replication) decreases as the number of replications increases, since the probability of finding a statistically significant effect across multiple replications is (1 – β)r, where 1 β is the power per replication, β is the type ii error rate, and r is the number of replications. thus, when a partial mean structure is used (μ = 0, 0, 0, 8 or μ = 0, 0, 0, 0, 0, 0, 8), no control with one replication provided similar or higher per-pair power rates than multiplicity control without replication. per-pair power rates for no control with two replications, and mcps with no replications, are highly similar. when a complete non-null mean structure was used (μ = 0, 8, 16, 24 or μ = 0, 8, 16, 24, 32, 40, 48) differences in per-pair power between replication and mcps become slight (often inconsequential). when all-pairs power was the outcome of interest, results were highly comparable, except when sample sizes were large (e.g., n = 100) and the mean structure contained several true differences (e.g., μ = 0, 8, 16, 24 or μ = 0, 8, 16, 24, 32, 40, 48). in these cases, the holm procedure often demonstrated superior all-pairs power (see tables 5 and 6). for example, with population means of 0, 8, 16,24 and n = 100, the holm procedure had an all-pairs power rate of .450 whereas the all-pairs power rate with no multiplicity control and one replication was .224 and with two replications was .104. the all-pairs power rate for the bonferroni method was .089. meta-analysis as meta-analyses are a natural extension of how replications may give more evidence regarding an effect, we also explored the familywise type i error and power rates when the results of the original study and the replication studies are combined into a single result. as expected, since no multiplicity control is imposed, familywise type i error rates mirror what would be found in a single study with no multiplicity control. however, since meta-analysis combines the effects of multiple studies, the sample size—and hence the power—rises dramatically. thus, both the per-pair and all-pairs power rates, especially with larger sample sizes, were much larger than any of the procedures where statistical significance is required in each of the studies conducted. given recent support for the contention that type i errors are theoretically implausible in most behavioral science research (e.g., cribbie, 2017), focusing on power via meta-analytic solutions is very appealing. cohen’s d effect sizes have become increasingly popular and are commonly used alongside traditional nhst. effect size measures allow us to move from the dichotomous determination of the presence or absence of an effect in nhst to an evaluation of magnitudes of effects observed. unlike statistical significance, measures of effect size do not have statistical cut-offs. effect size interpretations vary due to both context and magnitude (beribisky, davidson, & cribbie, 2019), however, when there is little theoretical precedence for what constitutes a “meaningful” effect size, unofficial rules of thumb are often used to determine what constitutes a “minimal meaningful value”. here, we have chosen to use an effect size of cohen’s d = .3, which has been conventionally regarded to be within the “small” range of cohen’s d. ideally, the choice of an effect may correspond to the smallest meaningful difference (as opposed to how often null comparisons resulted in cohen’s d = 0) since research in many fields (e.g., clinical, health) is often aimed at determining whether observed effects are meaningful or not. because mcps are embedded in nhst, which is sample size dependent, mcps cannot be discussed in relation to observed effect sizes in studies. however, effect sizes can be evaluated across replication studies, and thus we compare our nhst-based results with the ability of replications to prevent a conclusion that the data provides evidence for a meaningful effect size when in fact the population effect size is null. table 7 summarizes the ability of replication to prevent the effect size equivalent of a type i error. that is, we present rates at which replication results in a conclusion of a “meaningful” difference between samples multiplicity control vs replication: making an obvious choice even more obvious 7 when the population effect size is null. table 8 summarizes the ability of replication to prevent the effect size equivalent of a type ii error (obtaining a sample effect size that is not meaningful when the population effect size is non-null). recall that in our simulations we used d = .3 as the “minimal meaningful value”. table 7 shows that when sample sizes are sufficiently large (n = 100), replication effectively reduces the frequency of false conclusions about the meaningfulness of cohen’s d. for example, when a partialnull mean structure was used (0, 0, 0, 8 or 0, 0, 0, 0, 0, 0, 0, 8), the probability that a cohen’s d value was erroneously equal to or greater than d = .3 across all replications was less than 3% with a single replication, and approximately 0% with two. in other words, the values in table 7 relate to the proportion of comparisons greater than .3 when the true population difference was 0 (not for conditions when the population cohen’s d is greater in magnitude than zero). table 8 reports two different measures. the first is “average proportion of correct” (apc) statements regarding the magnitude of cohen’s d. this is the proportion of cohen’s d values that were accurately at or above d = .3 across all simulations. this shows how likely it is that truly meaningful effect sizes will persist across replications. the second is “proportion of all correct” (pac) statements regarding the magnitude of cohen’s d. this is the proportion of simulations in which all truly meaningful effect sizes were sufficiently large (d ≥ .3) to be labelled as such. this shows how likely it is that every truly meaningful effect size in a study will persist across replications. these two measures are analogous to per-pair power and all-pairs power. like table 7, table 8 reports the proportion of correct statements related to conditions where the effect size in the population is greater than d = .3. table 8 shows that apc and pac “power rates” for cohen’s d are highly comparable to our previously obtained nhst per-pair and all-pair power rates. namely, rates decrease as the number of tests increase and the sample sizes decrease. this again highlights the importance of utilizing large sample sizes when possible. conclusion our simulation study yielded two important conclusions regarding the comparison of multiplicity control and replication on a statistical level. first, both the mcps and replication maintained type i error rates at acceptable levels. second, replication and mcps provide roughly equivalent power. we also extended the comparison by demonstrating that obvious extensions of replication, such as focusing on effect sizes and meta-analyzing the results of replications, provide valuable research strategies. for example, meta-analysis, as expected, generally provides an advantage in power, although of course at the cost of higher type i errors without any mcps. further, replication is a valuable strategy for minimizing the possibility that a researcher could incorrectly conclude that a meaningful effect size has been detected. while there were some situations where replication performed better than multiplicity control and vice versa, the overall pattern suggested that mcps and replication were very similar. furthermore, our simulations show that when parameters most closely resembled those found in typical social science research studies (i.e., healthy sample size and a moderate number of comparisons where some but not all are truly different), replication provides satisfactory familywise error control and demonstrates equivalent or superior power. thus, in most situations replication is either as good, or better, than multiplicity control. given these results, how should the everyday scientist address the multiplicity problem? it is our position that replication is the best answer. some may believe this position fails to appreciate practical constraints. two prominent constraints are limited time and money, and institutional pressure to produce novel (rather than rigorous) results. we recognize the legitimacy and severity of these concerns. however, because of the problematic assumptions underlying mcps (e.g., null relationships are common), and the subjective nature of many decisions involved in mcps (e.g., how to define an appropriate “family”), we recommend that they should not be used. an analogous situation is the use of advanced data analysis techniques (e.g. structural equation modeling, multi-level modeling) by researchers with hunter, beribisky, farmus, & cribbie 8 limited sample sizes. there may be very good reasons why they cannot access more participants (e.g. lack of access to participants, low funding, time-limited data collection, etc.), and equally valid reasons why their analysis technique would make sense. however, those two truths do not change the fact that their results will be challenging to obtain and interpret with a low sample size. in the same way, the fact that many researchers face barriers to replication does not mean that mcps are an acceptable answer. in sum, the results of the present simulation study make the choice to conduct replications, and abandon the use of mcps, even more obvious by demonstrating that, in addition to being theoretically superior, replication provides natural multiplicity control. replications also indirectly enable other beneficial research practices such as a comparison of effect sizes and a combining of effect sizes (i.e., meta-analysis). we hope this will encourage members of the social science community to take wilkinson et al.’s shrewd advice to heart and “let replications promote reputations” (wilkinson, 1990, p. 600). open science practices this article earned the the open materials badge for making materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. author contact andrew hunter hunter07@yorku.ca orcid id: 0000-0001-7236-0900 linda farmus lifarm@yorku.ca orcid id: 0000-0002-5303-6408 nataly beribisky natalyb1@yorku.ca orcid id: 0000-0002-1081-0125 robert cribbie cribbie@yorku.ca orcid id: 0000-0002-9247-497x conflict of interest and funding none of the authors have any conflicts of interest. this research was not funded by any specific source. author contributions all authors contributed equally to the final paper, including the development of the ideas, the conducting of the simulation study, and the writing of the paper. robert cribbie was the senior author and hence is the last author; remaining authorship is alphabetical based on first name. multiplicity control vs replication: making an obvious choice even more obvious 9 appendix table 1 terminology, definitions and more information regarding concepts used in simulation study. term definition idea n sample size per group (25 or 100). used to demonstrate commonly found sample sizes within psychology. j number of groups. the number of groups directly corresponded to the number of pairwise comparisons such that: pairwise comparisons = j(j-1) 2 complete null condition all population means are equal. one of three configurations used for simulation study. possible to investigate familywise error rates. partial null condition some population means are equal. one of three configurations used for simulation study. possible to investigate both familywise error rates and power. complete non-null condition none of the population means are equal. one of three configurations used for simulation study. possible to investigate power. all pairs power (ap) for the true, non-null differences between the groups, all pairwise comparison null hypotheses are correctly rejected. for: no replication condition: proportion of simulations in which all the real pairwise differences are statistically significant (the null hypothesis is correctly rejected). one or two replication conditions: proportion of simulations in which all non-null pairwise comparisons are statistically significant in the original study and in each replication (the null hypothesis is correctly rejected). average per pair power (pp) for the true, non-null differences between the groups, the corresponding pairwise comparison null hypotheses are correctly rejected (averaged across all simulations). for: no replication condition: proportion of real pairwise differences that are statistically significant, averaged across all simulations. one or two replication conditions: proportion of real pairwise differences that are statistically significant in the original study and each replication, averaged across all simulations. familywise error control controls the likelihood of at least one type i error at α across all comparisons. for: no replication condition: proportion of simulations where there is at least one pairwise comparison that is incorrectly deemed significant. one or two replication conditions: proportion of simulations where the false positive exists in the original study and the replicated one(s). bonferroni method type of multiple comparison procedure. α' = . α/(number of comparisons). p-value is compared to α'. holm method type of multiple comparison procedure. a sequential modified-bonferroni procedure that provides greater power while still maintaining strict familywise error control (see cribbie, 2017 for more details). hunter, beribisky, farmus, & cribbie 10 d cohen’s d or standardized mean difference. type i error: population d is truly less than ε yet d incorrectly exceeds ε. power: population d is truly greater than ε and d is greater than ε in situation. ε minimally meaningful value. table 2 simulation mean structure familywise error control familywise error control/ power power 6 comparisons μ = 0,0,0,0 μ = 0,0,0,8 μ = 0,8,16,24 |dp| = 0 |dp| = 0 or .4a |dp| = .4, .8, 1.2 21 comparisons μ = 0,0,0,0,0,0,0 μ = 0,0,0,0,0,0,8 μ = 0,8,16,24,32,40,48 |dp|= 0 |dp| = 0 or .40 |dp| = .4,.8,1.2,1.6,2.0 or 2.4 note. dp represents the population value for the pairwise cohen’s d; when multiple dp values are provided, e.g., for μ = 0,0,0,8, |dp| can be 0 or .40, this implies that for some pairwise comparisons |dp| = 0, e.g., for μ1 vs μ2, dp = |(μ1 μ2)/sp| = |(0-0)/20| = 0, and for other pairwise comparisons |dp| = .40, e.g., for μ1 vs μ4, dp = |(μ1 – μ4)/sp| = |(0-8)/20| = |-.40| = .40 table 3 familywise error rates for 4 groups (t = 6) n = 25 n = 100 μ = 0,0,0,0 μ = 0,0,0,8 μ = 0,0,0,0 μ = 0,0,0,8 no replication bonferroni .041 .023 .038 .022 holm .041 .026 .038 .032 no control .202 .120 .205 .125 one replication bonferroni .001 .000 .000 .000 holm .001 .000 .000 .000 no control .013 .007 .013 .007 meta-analysis .208 .120 .209 .121 two replications bonferroni .000 .000 .000 .000 holm .000 .000 .000 .000 no control .001 .000 .001 .000 meta-analysis .200 .117 .206 .122 multiplicity control vs replication: making an obvious choice even more obvious 11 table 4 familywise error rates for 7 groups (t = 21) n = 25 n = 100 μ = 0,0,0,0,0,0,0 μ = 0,0,0,0,0,0,8 μ = 0,0,0,0,0,0,0 μ = 0,0,0,0,0,0,8 no replication bonferroni .039 .028 .043 .028 holm .039 .028 .043 .032 no control .442 .380 .440 .374 one replication bonferroni .000 .000 .000 .000 holm .000 .000 .000 .000 no control .046 .036 .050 .031 meta-analysis .440 .363 .438 .362 two replications bonferroni .000 .000 .000 .000 holm .000 .000 .000 .000 no control .002 .002 .004 .002 meta-analysis .447 .371 .440 .364 table 5 average per-pair and all pairs power rates for 4 groups (6 comparisons) n = 25 n = 100 μ = 0,0,0,8 μ = 0,8,16,24 μ = 0,0,0,8 μ = 0,8,16,24 pp ap pp ap pp ap pp ap no replication bonferroni .103 .016 .380 .000 .566 .307 .782 .089 holm .110 .025 .419 .000 .596 .370 .885 .450 no control .288 .092 .567 .002 .802 .610 .903 .470 one replication bonferroni .010 .000 .239 .000 .319 .096 .658 .009 holm .011 .001 .266 .000 .354 .139 .795 .203 no control .080 .008 .410 .000 .647 .374 .825 .224 meta-analysis .491 .241 .737 .042 .979 .948 .990 .941 two replications bonferroni .001 .000 .180 .000 .180 .032 .590 .001 holm .001 .000 .201 .000 .211 .053 .727 .090 no control .024 .000 .332 .000 .524 .233 .762 .104 meta-analysis .671 .433 .838 .215 .998 .995 .999 .996 note. pp = per pair power rates, ap = all pairs power rates hunter, beribisky, farmus, & cribbie 12 table 6 average per-pair and all pairs power rates for 7 groups (21 comparisons) n = 25 n = 100 μ = 0,0,0,0,0,0,8 μ = 0,8,16,24,32,40,48 μ = 0,0,0,0,0,0,8 μ = 0,8,16,24,32,40,48 pp ap pp ap pp ap pp ap no replication bonferroni .048 .001 .545 .000 .405 .083 .829 .000 holm .050 .002 .594 .000 .420 .100 .913 .150 no control .287 .039 .742 .000 .810 .487 .944 .202 one replication bonferroni .002 .000 .450 .000 .160 .005 .758 .000 holm .002 .000 .494 .000 .172 .008 .851 .021 no control .080 .001 .642 .000 .645 .227 .898 .042 meta-analysis .501 .147 .907 .000 .977 .903 .994 .874 two replications bonferroni .000 .000 .407 .000 .064 .000 .729 .000 holm .000 .000 .448 .000 .071 .001 .809 .002 no control .022 .000 .592 .000 .515 .106 .862 .007 meta-analysis .682 .312 .890 .031 .998 .990 .999 .990 note. pp = per pair power rates, ap = all pairs power rates table 7 proportion of incorrect statements regarding the magnitude of cohen’s d for 4 and 7 groups n = 25 n = 100 μ = 0,0,0,0 μ = 0,0,0,8 μ = 0,0,0,0 μ = 0,0,0,8 no replication .722 .543 .156 .090 one replication .372 .221 .007 .003 two replications .132 .072 .000 .000 μ = 0,0,0,0,0,0,0 μ = 0,0,0,0,0,0,8 μ = 0,0,0,0,0,0,0 μ = 0,0,0,0,0,0,8 no replication .938 .904 .347 .293 one replication .714 .612 .025 .016 two replications .358 .275 .002 .000 multiplicity control vs replication: making an obvious choice even more obvious 13 table 8 average proportion of correct (apc) and proportion of all correct (pac) statements regarding magnitude of cohen’s d for 4 and 7 groups n = 25 n = 100 μ = 0,0,0,8 μ = 0,8,16,24 μ = 0,0,0,8 μ = 0,8,16,24 apc pac apc pac apc pac apc pac no replication .651 .404 .810 .173 .758 .543 .880 .368 one replication .413 .155 .681 .032 .581 .301 .788 .137 two replications .268 .063 .595 .005 .443 .168 .720 .051 μ = 0,0,0,0,0,0,8 μ = 0,8,16,24,32,40,48 μ = 0,0,0,0,0,0,8 μ = 0,8,16,24,32,40,48 no replication .647 .268 .890 .016 .765 .422 .931 .117 one replication .422 .070 .816 .000 .576 .170 .878 .016 two replications .270 .015 .766 .000 .431 .065 .839 .002 hunter, beribisky, farmus, & cribbie 14 references alibrandi, a. (2017). closed testing procedure for multiplicity control. an application on oxidative stress parameters in hashimoto's thyroiditis. epidemiology, biostatistics and public health, 14(1), 1-6. doi: 10.2427/1205 beribisky, n., davidson, h., & cribbie, r. a. (2019). exploring perceptions of meaningfulness in visual representations of bivariate relationships. peerj, 7, e6853. doi: 10.7717/peerj.6853 bland, j. m., & altman, d. g. (1995). multiple significance tests: the bonferroni method. british medical journal, 310(6973), 170. doi: 10.1136/bmj.310.6973.170 carmer, s. g., & walker, w. m. (1985). pairwise multiple comparisons of treatment means in agronomic research. journal of agronomic education, 14, 19-26. https://www.crops.org/files/publications/ns e/pdfs/jnr014/014-01-0019.pdf carver, r. p. (1993). the case against statistical significance testing, revisited. journal of experimental education, 61, 287–292. doi: 10.1080/00220973.1993.10806591 cohen, j. (1990). things i have learned (so far). american psychologist, 45, 1304-1312. cohen, j. (1994). the earth is round (p < .05). american psychologist 49, 997–1003. cribbie, r. a. (2017). multiplicity control, school uniforms, and other perplexing debates. canadian journal of behavioural science, 49, 159-165. doi: 10.1037/cbs0000075 cumming, g. (2014). the new statistics: why and how. psychological science, 25(1), 7-29. doi: 10.1177/0956797613504966 fisher, r. a. (1935). the design of experiments. oxford, england: oliver & boyd gelman, a., hill, j., & yajima, m. (2012). why we (usually) don't have to worry about multiple comparisons. journal of research on educational effectiveness, 5(2), 189-211. doi: 10.1080/19345747.2011.618213 gigerenzer, g., krauss, s., & vitouch, o. (2004). the null ritual: what you always wanted to know about significance testing but were afraid to ask. in d. kaplan (ed.), sage handbook of quantitative methodology for the social sciences (pp. 391–408). thousand oaks, ca: sage. hancock, g. r., & klockars, a. j. (1996). the quest for α: developments in multiple comparison procedures in the quarter century since. review of educational research, 66(3), 269-306. doi: 10.3102/00346543066003269 holland, b., s. basu, and f. sun. 2010. neglect of multiplicity when testing families of related hypotheses. working paper, temple university. doi: 10.2139/ssrn.1466343 holm, s. (1979). a simple sequentially rejective multiple test procedure. scandinavian journal of statistics, 6(2), 65-70. lindsay, r. m., & ehrenberg, a. s. c. (1993). the design of replicated studies. the american statistician, 47(3), 217-228. doi: 10.1080/00031305.1993.10475983 lucas, r. e., & donnellan, m. b. (2013). improving the replicability and reproducibility of research. journal of research in personality, 47, 453-454. doi: 10.1016/j.jrp.2013.05.002 lykken, d. t. (1968). statistical significance in psychological research. psychological bulletin, 70, 151–159. doi: 10.1037/h0026141 nuzzo, r. (2014). scientific method: statistical errors. nature, 506(7487), 150-152. doi: 10.1038/506150a olejnik, s., li, j., supattathum, s., & huberty, c. j. (1997). multiple testing and statistical power with modified bonferroni procedures. journal of educational and behavioral statistics, 22(4), 389-406. doi: 10.3102/10769986022004389 popper, k. r. (1968). the logic of scientific discovery. new york, ny: harper & row. radder, h. (1992). experimental reproducibility and the experimenters’ regress. in d. hull, m. forbes, & k. okruhlik (eds.), proceedings of the 1992 biennal meeting of the philosophy of science association (pp. 63–73). east lansing, mi: philosophy of science association. rothman, k. j. (1990). no adjustments are needed for multiple comparisons. epidemiology, 1(1), 43-46. doi: 10.1097/00001648-19900100000010 rothman, k. j. (2014). six persistent research misconceptions. journal of general internal multiplicity control vs replication: making an obvious choice even more obvious 15 medicine, 29(7), 1060-1064. doi: 10.1007/s11606013-2755-z ryan, t.a. (1959). multiple comparisons in psychological research. psychological bulletin, 56, 26-47. doi: 10.1037/h0042478 sakai, t. (2018). multiple comparison procedures. in laboratory experiments in information retrieval. springer: singapore. doi: 10.1007/978-981-13-1199-4_4 saville, d. j. (1990). multiple comparison procedures: the practical solution. the american statistician, 44(2), 174-180. doi: 10.1080/00031305.1990.10475712 saville, d. j. (2003). basic statistics and the inconsistency of multiple comparison procedures. canadian journal of experimental psychology, 57(3), 167. doi: 10.1037/h0087423 saville, d. j. 2015. multiple comparison procedures—cutting the gordian knot. agronomics journal (107), 730-735. doi:10.2134/agronj2012.0394. schochet, p. z. (2007). guidelines for multiple testing in experimental evaluations of educational interventions. princeton, nj: mathematica policy research, inc. schmidt, s. (2009). shall we really do it again? the powerful concept of replication is neglected in the social sciences. review of general psychology, 13(2), 90-100. doi: 10.1037/a0015108 shrout, p. e., & rodgers, j. l. (2018). psychology, science, and knowledge construction: broadening perspectives from the replication crisis. annual review of psychology, 69, 487-510. doi: 10.1146/annurev-psych-122216-011845 simons, d. j. (2014). the value of direct replication. perspectives on psychological science, 9(1), 7680. steiger, j. h. (1990). structural model evaluation and modification: an interval estimation approach. multivariate behavioral research, 25(2), 173-180. doi: 10.1207/s15327906mbr2502_4 tukey, j. w. (1991). the philosophy of multiple comparisons. statistical science, 6(1), 100-116. tyler, k. m., normand, s. l. t., & horton, n. j. (2011). the use and abuse of multiple outcomes in randomized controlled depression trials. contemporary clinical trials, 32(2), 299-304. doi: 10.1016/j.cct.2010.12.007 vazire, s. (2016). editorial. social psychological and personality science, 7(1), 3-7. doi: 10.1177/1948550615603955 wilkinson, l. (1999). statistical methods in psychology journals: guidelines and explanations. american psychologist, 54(8), 594-604. doi: 10.1037/0003-066x.54.8.594 mp.2018.895.witt_20191107 meta-psychology, 2019, vol 3, mp.2018.895, https://doi.org/10.5626/mp.2018.895 article type: original article published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: felix schönbrodt reviewed by: julia rohrer, gordon feld, steve haroz analysis reproduced by: tobias mühlmeister all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/hxk2u graph construction: an empirical investigation on setting the range of the y-axis jessica k. witt colorado state university graphs are an effective and compelling way to present scientific results. with few rigid guidelines, researchers have many degrees-of-freedom regarding graph construction. one such choice is the range of the y-axis. a range set just beyond the data will bias readers to see all effects as big. conversely, a range set to the full range of options will bias readers to see all effects as small. researchers should maximize congruence between visual size of an effect and the actual size of the effect. in the experiments presented here, participants viewed graphs with the y-axis set to the minimum range required for all the data to be visible, the full range from 0 to 100, and a range of approximately 1.5 standard deviations. the results showed that participants’ sensitivity to the effect depicted in the graph was better when the y-axis range was between one to two standard deviations than with either the minimum range or the full range. in addition, bias was also smaller with the standardized axis range than the minimum or full axis ranges. to achieve congruency in scientific fields for which effects are standardized, the y-axis range should be no less than 1 standard deviations, and aim to be at least 1.5 standard deviations. keywords: graph design, effect size, sensitivity, bias one way to lie with statistics is to set the range of the y-axis to form a misleading impression of the data. a range set too narrow will exaggerate a small effect and can even make a non-significant trend appear to be a substantial effect (pandey, rall, satterthwaite, nov, & bertini, 2015). yet the default setting of many statistical and graphing software packages automatically sets the range as narrow as the data will allow. the problem of creating misleading graphs persists even when the full range is shown instead. as shown in the studies reported below, a range set too wide also creates a misleading impression of the data by making effects seem smaller than they are. here, i argue that for scientific fields that use standardized effect sizes and adopt cohen’s convention that an effect of d = 0.8 is big, the range of the y-axis should be approximately 1.5 standard deviations (sds). how should the y-axis range of a graph be determined? graph construction should account for the visual experience of the people reading the graphs (cleveland & mcgill, 1985; kosslyn, 1994; tufte, 2001) and the strong link between perception and cognition (barsalou, 1999; glenberg, witt, & metcalfe, 2013). when the visual size of the effect aligns with the actual size of the effect, the person reading the graph does not have to exert mental effort to decode effect size from the graph. instead, the size of the effect is processed automatically. this increases graph fluency by making it easier to understand jessica k. witt, department of psychology, colorado state university. data, scripts, and supplementary materials available at osf.io/hw2ac. address correspondence to jkw, department of psychology, colorado state university, fort collins, co 80523, usa. email: jessica.witt@colostate.edu witt 2 table 1. overview of the five experiments. experiment n effect sizes graph type standardized condition1 1 9 0.1, 0.3, 0.5, 0.8 bar graph 2 sds 2 14 0.1, 0.3, 0.5, 0.8 bar graph 1.4 sds 3 13 0, 0.3, 0.5, 0.8 bar graph with error bars 1.2 sds 4 20 0, 0.3, 0.5, 0.8 line graph 1.4 sds 5 15 0, 0.3, 0.5, 0.8 line graph 1 sd notes. 1this refers to the range depicted in the standardized condition, so a range of 1.4 sds is when the graph was centered on the grand mean and extended 0.7 sds in either direction. that an effect is big when it looks big and an effect is small when it looks small. to increase graph fluency, the range of the y-axis should be selected to maximize compatibility between visual size and actual effect size (kosslyn, 1994; pandey et al., 2015; tufte, 2001). however, the current literature fails to provide clear guidelines on how to achieve this compatibility. for example, some recommend displaying only the relevant range so that the axis goes from just below the lowest data point to just above the highest data point (kosslyn, 1994). this would not achieve the recommended compatibility because small effects would look big. others assert that the y-axis should always start from 0, particularly for bar graphs (few, 2012; pandey et al., 2015; wong, 2010). this too could fail to achieve compatibility by making effects look too small. in the case of scientific fields for which effect size is standardized based on standard deviation, the range of the y-axis should be a function of the standard deviation (sd). in behavioral sciences such as psychology and economics, for example, the mean effect size is approximately half a sd (bosco, aguinis, singh, field, & pierce, 2015; open science collaboration, 2015; paterson, harms, steel, & crede, 2016), and a standardized effect size of d = .8 is considered a big effect (cohen, 1988). consequently, an appropriate range for the y-axis would be one to two sds, which would be plotted as the group mean ± 0.75 sd (or ±0.5 – 1 sds). with this range, big effects such as a cohen’s d of .8 would look big and small effects of d = .3 would look small. in other words, this range would help achieve compatibility between the visual impression of the size of the effect and the actual size of the effect. empirical studies the effect of visual-conceptual size compatibility on graph fluency was empirically tested in 57 participants across 5 experiments (see table 1). the participants were naïve college students, which serves as an appropriate sample given that scientific results should be accessible and comprehensible to this population and not just to experts in one’s field. the stimuli were bar or line graphs that had been constructed from simulated data. data were simulated from two (hypothetical) groups of participants by sampling from normal distributions in r (r core team, 2017). for one group, the data were drawn from a normal distribution with a mean of 50 and a standard deviation of 10 (as in a memory experiment with mean performance of 50% and sd of 10%). for the other group, the data were drawn from a normal distribution with a standard deviation of 10 and the mean at 49, 47, 45, or 42. these means correspond to effect sizes of d = 0.1, 0.3, 0.5, and 0.8, respectively. in experiments 3-5, the mean of 49 (d = 0.1) was replaced with the mean of 50 (d = 0). in experiments 2-5, the data were re-sampled if the attained effect size differed by more than 0.1 from the intended effect size. data were simulated 10 times for each of the four effect sizes to create 40 sets of data for each experiment. in experiments 1-3, the means of the simulated data were displayed as a bar graph depicting two groups of participants who engaged in different study strategies (spaced versus massed; see figure 1). in experiments 4-5, the means were used to determine the end points of a line graph, and the x-axis was labeled as “hours spent studying”. for each set of data, three graphs were constructed that varied in the range of the y-axis. the full condition showed the full range from 0 to 100 on a hypothetical memory test. the minimal condition showed the smallest range necessary to see the data. the standardized condition was centered on the group graph construction: an empirical investigation on setting the range of the y-axis 3 mean and extended by one to two sds in either direction (the exact value differed across experiments, see table 1 or the appendix). figure 1 shows several examples of graphs that served as stimuli. in experiment 3, error bars were also included and explained to the participants. within an experiment, the same set of 120 graphs (3 axis ranges x 4 effect sizes x 10 sets) were shown to the participants. graphs were shown one at a time, order was randomized, and participants completed 4 blocks of 120 trials. in all experiments, the participants’ task was to indicate whether there was no effect, a small effect, a medium effect, or a big effect for each graph by pressing 1, 2, 3 or 4 on the keyboard. graph fluency was measured using linear regressions rather than accuracy because regression coefficients have the advantage that they provide two separate measures. the slope provides an estimate of sensitivity to the magnitude of the effect depicted in the plot. a steeper slope indicates better sensitivity to effect size than a shallower slope. the intercept provides an estimate of bias. two graphs could lead to similar levels of sensitivity but different levels of bias. separate linear regressions were calculated for each participant for each y-axis range condition (full, standardized, and minimal). full standardized minimal . figure 1. sample stimuli in the experiments on bar graphs and on line graph. the bar graphs show final test score as a function of whether study style was spaced or massed. the line graphs show final test score as a function of hours spent studying from 1 to 4. within each experiment, the same data were plotted using the full range from 0-100, the standardized range (in this case, the group mean +/0.7 sd), or the minimal range necessary to see the data. in this example, a medium effect (cohen’s d = 0.5) was simulated for the bar graphs (top row) and a small effect (cohen’s d = 0.3) was simulated for the line graphs (bottom row). the participant’s task was to indicate whether there was no effect, a small effect, a medium effect, or a big effect. witt 4 in each regression, the dependent measure was response (on the scale of 1 to 4). the effect sizes were recoded to also be on a scale from 1 to 4 then centered by subtracting 2.5 so that perfect performance would produce a regression coefficient for the slope of 1 and an intercept of 2.5. figure 2 shows the mean slope coefficients across all 5 experiments. sensitivity was best for the standardized graphs and worse for the full range graphs. participants were better able to assess the size of the effect depicted in the graph for the standardized graphs, than for the minimal or full graphs. participants were also less biased when viewing the standardized graphs. figure 3 shows the mean bias across all 5 experiments. bias scores were calculated as a percent bias based on the coefficients for the intercept. a negative score indicates a bias to respond that effects were small, and a positive score indicates a bias to respond that the effects were big. for the full graphs, there was a large bias to respond that the effects were small. when looking at graphs with the full range, participants responded that almost all effects (86%) were null or small. for the minimal graphs, there was a large bias to respond that the effects were substantial. when looking at graphs with the minimal range for cohen’s d = 0.10 – 0.80, participants responded that the effect was big on 49% of the trials. in contrast, there was much less bias with the standardized graphs (see supplemental materials). figure 2. sensitivity is plotted as a function of graph axis condition for the three types of graphs across all 5 experiments. sensitivity was measured as the coefficient for the slope from regressions of actual effect size on estimated effect size. only trials for which the graph depicted an effect size greater than d = 0.1 are included (see supplementary materials for all the data). a higher sensitivity score corresponds to better performance, and a coefficient of 1 corresponds to perfect performance. a coefficient of 0 indicates chance performance. in the left panel, mean sensitivity across all experiments is shown. error bars are 1 sem calculated within-subjects, and are approximately the same size as the symbols. the yaxis range is 3 sd. the right panel shows sensitivity for each participant for each experiment. the data are color-coded by experiment (e.g. red = experiment 1, orange = experiment 2) and are also laterally positioned from left to right within graph type category. each point corresponds to one participant, and each participant has one symbol for each of the three graph types. the solid horizontal line at 0 shows the point of no sensitivity and the dashed horizontal line at 1 shows the point of perfect sensitivity. graph construction: an empirical investigation on setting the range of the y-axis 5 discussion the visual impression of the size of an effect has a strong influence on the judged size of an effect. when the visual impression was compatible with the actual effect size, judgments of effect size were better calibrated and less biased compared with the typical default setting of showing the minimum range to display the data and the setting of showing the full potential range. based on the current studies, the recommendation is to center the y-axis on the grand mean and extend the range 0.75 sds in either direction so that the range of the y-axis is 1.5 sds. the current studies show improved sensitivity to effect size and reduced bias in estimating effect size when the range of the y-axis was centered on the grand mean of the data and extended approximately 0.7 sds in either direction. the various studies used slightly different extensions ranging from 0.5 sds to 1 sd. there were not large detectable differences in sensitivity or bias depending on the exact range that was used, so the precise value of the y-axis range might not be critical. rather, the key feature is that the visual size aligns with the actual size of the effect. the specific range to be used might vary as a function of the size of the error bars (the range should be large enough to encompass them), the size of the effect (the range would have to be extended for particularly large effects, such as was done with the current results), if doing so would make the range include nonsensical numbers (such as negative numbers for performance), and to achieve a consistent scale across multiple graphs to enhance across-graph comparisons. given that the exact range in terms of sd could vary from plot to plot, it could be useful to indicate the range in sd units in the figure caption. this indication would be particularly useful in cases for which researchers do not include error bars. the current experiments explored graphs of stimulated data from between-subjects designs. the recommendations likely generalize to withinsubject designs with the caveat that the y-axis figure 3. bias (as a percentage) is plotted as a function of graph axis condition for the three types of graphs across all 5 experiments. a negative bias corresponds to responding that effects are smaller than they are, and a positive bias corresponds to responding that effects are bigger than their actual size. in the left panel, mean bias across all experiments is shown. error bars are 1 sem calculated within-subjects, and are approximately the same size as the symbols. the y-axis range is 4 sd. the right panel shows bias for each participant for each experiment. the data are color-coded by experiment (e.g. red = experiment 1, orange = experiment 2) and are also ordered from left to right within graph type category. each point corresponds to one participant, and each participant has one symbol for each of the three graph types. witt 6 should be a function of the denominator used to calculate the within-subjects effect size. for example, the denominator for cohen’s dz is the square root of the sum of the squares of the standard deviations minus the product of the standard deviations and the correlation between the two measures. graphs plotting within-subjects data could be ± 0.75 times this denominator (or one of the other suggested measures for within-subjects effects sizes; e.g. lakens, 2013). in cases for which there are both between-subjects and within-subjects factors, the researchers will have to decide which denominator to use for the range depending on which effect they most want to emphasize. it is debatable whether the recommendation offered here should be employed with bar graphs. some have shown that graphs that start at a position other than 0 are deceptive (e.g., pandey et al., 2015). the idea is that bar graphs should always start at 0 because the height of the bar signifies the value of the condition being represented. when the y-axis starts at a value greater than 0, the height of the bar corresponds to the difference between the condition’s value and the starting point, rather than the condition’s value itself. consider the following example: imagine that group a scored 70% on a memory test and group b scored 60%. on a plot for which the y-axis starts at 50%, group a’s score would appear twice as big as group b’s score, even though they only scored 10% higher. the issue at hand concerns the visual impression of the data. if the graph gives the impression that the differences are big, and that aligns with the size of the effect, the graph would be produce compatibility between vision and true effect size. if, however, the impression is that one group’s performance was twice as good as the other group’s performance, this would produce a misleading impression of the data. the current experiments cannot speak to which impression was experienced because participants were asked to rate the size of the effect as being no effect, small, medium, or big, rather than quantifying the size of one bar relative to another. the specific task used here did not permit measuring the spontaneous impression given by the graphs. one option is for researchers to use alternative types of graphs to avoid the issue. alternatives include point graphs and a newly-designed type of graph called a hat graph (witt, 2019). the recommendation to set the y-axis range to be 1.5 sds does not generalize to fields for which the sd is unknown or irrelevant for interpreting effect size. for these fields, previous recommendations such as tufte’s lie detector ratio could be appropriate (tufte, 2001). but for scientific fields that rely on standard deviation to interpret effect size, this is the first empirically-based recommendation that provides clear guidelines for constructing graphs to communicate the magnitude of the effects. maximizing compatibility between visual size and conceptual size improved comprehension of the effects shown in the graphs. the data presented in the graphs were exactly the same, yet participants were less biased and were more sensitive to the size of the depicted effect when the axis range was one to two sds. furthermore, emphasizing sd and effect size in graph construction could help shift researchers’ focus to effect size, rather than statistical significance. indeed, effect size (as measured with cohen’s d) provides a better measure for discriminating real effects from null effects than p values or bayes factors (witt, 2019). such a shift could help guard against practices that have contributed to recent failures to replicate in various scientific fields (camerer et al., 2016; open science collaboration, 2015). in his famous book on how to lie with statistics, huff noted that as long as the y-axis is correctly labeled, “nothing has been falsified – except the impression that it gives” (huff, 1954, p. 62). the impression matters. researchers should select the range of the y-axis so that small effects look small and big effects look big (based on the field’s adopted conventions). a simple way to do this is to set the range to be 1.5 (or more) standard deviations of the dependent measure. that this improves graph comprehension is both intuitive and is now supported by empirical evidence. open science practices this article earned the open data and the open materials badge for making the data and materials graph construction: an empirical investigation on setting the range of the y-axis 7 openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. author contribution witt is solely responsible for this manuscript. the author read and approved the final manuscript. funding this work was supported by grants from the national science foundation (bcs-1632222 and bcs1348916). conflict of interest statement the author declares there were no conflicts of interest. references barsalou, l. w. (1999). perceptions of perceptual symbols. behavioral and brain sciences, 22, 577-660. belia, s., fidler, f., williams, j., & cumming, g. (2005). researchers misunderstand confidence intervals and standard error bars. psychological methods, 10(4), 389-396. doi: 10.1037/1082989x.10.4.389 bosco, f. a., aguinis, h., singh, k., field, j. g., & pierce, c. a. (2015). correlational effect size benchmarks. journal of applied psychology, 100(2), 431-449. doi: 10.1037/a0038047 camerer, c. f., dreber, a., forsell, e., ho, t.-h., huber, j., johannesson, m., . . . chan, t. (2016). evaluating replicability of laboratory experiments in economics. science, 351(6280), 1433-1436. cleveland, w. s., & mcgill, r. (1985). graphical perception and graphical methods for analyzing scientific data. science, 229(4716), 828-833. doi: 10.1126/science.229.4716.828 cohen, j. (1988). statistical power analyses for the behavioral sciences. new york, ny: routledge academic. collaboration, o. s. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716. doi: 10.1126/science.aac4716 cumming, g., & finch, s. (2005). inference by eye: confidence intervals and how to read pictures of data. american psychologist, 60(2), 170-180. doi: 10.1037/0003-066x.60.2.170 few, s. (2012). show me the numbers: designing tables and graphs to enlighten (second edition ed.). burlingame, ca: analytics press. glenberg, a. m., witt, j. k., & metcalfe, j. (2013). from revolution to embodiment: 25 years of cognitive psychology. perspectives on psychological science, 8(5), 574-586. huff, d. (1954). how to lie with statistics. new york, ny: w. w. norton & company. kosslyn, s. m. (1994). elements of graph design. new york: w. h. freeman and company. lakens, d. (2013). calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas. frontiers in psychology, 4, 863. doi: doi:10.3389/fpsyg.2013.00863 morey, r. d., rouder, j. n., & jamil, t. (2014). bayesfactor: computation of bayes factors for common designs (version 0.9.8), from http://cran.rproject.org/package=bayesfactor pandey, a. v., rall, k., satterthwaite, m. l., nov, o., & bertini, e. (2015). how deceptive are deceptive visualizations?: an empirical analysis of common distortion techniques. paper presented at the proceedings of the 33rd annual acm conference on human factors in computing systems, seoul, republic of korea. paterson, t. a., harms, p. d., steel, p., & crede, m. (2016). an assessment of the magnitude of effect sizes: evidence from 30 years of metaanalysis in management. journal of leadership & organizational studies, 23(1), 66-81. revelle, w. (2018). psych: procedures for psychological, psychometric, and personality research. retrieved from https://cran.rproject.org/package=psych witt 8 r core team. (2017). r: a language and environment for statistical computing. retrieved from https://www.r-project.org tufte, e. r. (2001). the visual display of quantitative information (second edition ed.). cheshire, ct: graphics press. witt, j. k. (2019). introducing hat graphs. retrieved from psyarxiv.com/sg37q. witt, j. k. (2019). insights into criteria for statistical significance from signal detection analysis. meta-psychology, 3, mp.2018.871. doi: 10.15626/mp.2018.871 wong, d. m. (2010). the wall street journal guide to information graphics: the dos & don'ts of presenting data, facts, and figures. new york, ny: w. w. norton. graph construction: an empirical investigation on setting the range of the y-axis 9 appendix: experimental details experiment 1: bar graphs with axis range of 2 sd participants judged the size of effects depicted in bar graphs that were constructed with three axis range options. method participants. nine students in an introductory psychology course participated in exchange for course credit. in this and all subsequent experiments, the number of participants was maximized within a pre-determined time limit. stimuli and apparatus. graphs were constructed in r (r core team, 2017). for each graph, two means were generated. one mean was 50, and the other mean was 49, 47, 45, or 42. these equated to effect sizes of cohen’s d = .1, .3, .5, and .8, respectively. to add some noise to each graph, each mean was drawn from a normal distribution centered on the desired mean with 1000 samples and a standard deviation of 10. the means were presented in bar graphs (see figure a1). the left bar was white and labeled “spaced” and the right bar was black and labeled “massed”. for each set of simulated data, three bar graphs were constructed that corresponded to the three y-axis range conditions. for the full graphs, the y-axis range went from 0 to 100. for the minimal graphs, the y-axis went from the smallest data value minus 1 to the largest data value plus 1. for the standardized graphs, the mean of the two groups was calculated, and 1 sd (10) was added in either direction to set the y-axis range. this process of creating 3 graphs for each set of data was repeated 10 times for each of the 4 effect sizes for a total of 120 graphs. graphs were 500 pixels by 500 pixels and were shown on a 19” computer monitors with 1028 x 1024 resolution. procedure. after providing informed consent, each participant was seated at a computer. they were given the following instructions: “you will see graphs showing the effect of study style on final test performance. there were two study styles. massed is like cramming everything at once at just before the exam. spaced refers to studying a little bit every day for weeks before the exam. the y-axis shows final test performance, with higher value meaning better performance. witt 10 exp (sd range) full standardized minimal 1 (2) 2 (1.4) 3 (1.2) 4 (1.4) graph construction: an empirical investigation on setting the range of the y-axis 11 5 (1) figure a1. sample stimuli for each of the 5 experiments. each row corresponds to one experiment and shows a single set of a data plotted in the three different ways (full, standardized, and minimal). in all cases, the data show a medium effect (cohen’s d = 0.5). the number in parentheses under the experiment number indicates the range of the standardized condition. for each graph, indicate if study style had 1. no effect, 2. a small effect, 3. a medium effect, 4. a big effect on final performance. ready? press enter”. a trial began with a fixation cross at the center of the screen for 500ms. the graph was then shown. above the graph, text reminded participants of the four response options. the graph remained until participants made a response, at which point, the graph disappeared and a blank screen was shown for 500ms. each block of trails consisted of the presentation of each of the 120 graphs (3 graph types x 4 depicted effect sizes x 10 repetitions). order was randomized within block, and participants completed 4 blocks for a total of 480 trials. results and discussion one participant only completed 431 trials, but their data were still included. the depicted effect size was recoded on a scale from 1 to 4 to be consistent with the scale of the response. the smallest effect size (d = .1) was coded as 1.5 to account for the idea that this effect is smaller than a small effect but bigger than no effect. in later experiments, these graphs were replaced with graphs for which there was no effect instead of d = .1. for each participant for each of the 3 axis range conditions, the data were submitted to separate linear regressions with estimated effect size as the dependent factor and actual effect size (recoded on a scale from 1-4 then centered by subtracting 2.5) as the independent factor. the regressions produced two coefficients for each participant for each axis range condition. the slope indicates sensitivity to the size of the effect. a slope of 1 indicates perfect sensitivity. a slope less than 1 indicates attenuated sensitivity. the intercept indicates any bias to see effects as smaller or larger than their true size. one participant had slopes that were identified as outliers in the full and minimal conditions because they were greater than 1.5 times the interquartile range for each condition. this participant was excluded from the analysis (despite being the best performer in the group) because their data were not typical of the rest of the sample. another participant had a slope less than 1.5 times the interquartile range in the full condition, and was also excluded for not being typical of the rest of the sample. the coefficients were analyzed using pairedsamples t-tests to compare each graph condition to the others. analyses were done in r (r core team, 2017). bayes factors were calculated using the bayesfactor package in r with a medium prior (morey, rouder, & jamil, 2014). a bayes factor greater than 3 indicates moderate evidence, and a bayes factor greater than 10 indicates substantial evidence for the alternative hypothesis over the null hypothesis. conversely, a bayes factor less than .33 and less than .10 indicates moderate and substantial evidence for the null hypothesis over the alternative hypothesis. effect sizes were calculated using the witt 12 recommendations of lakens (2013), and 95% confidence intervals (cis) on the effect size were calculated using the cohen.d.ci function in the psych package (revelle, 2018). figure a2. mean response is plotted as a function of depicted effect size and graph type for experiment 1. error bars are 1 sem calculated withinsubjects. solid lines represent linear regressions for depicted effects d ≥ .3. dashed lines represent linear regressions for depicted effects less than d ≤ .3. the standardized graphs produced significantly greater slopes than the full graphs, t(6) = 3.84, p = .009, dz = 1.45, 95% cis [.33, 2.51], bayes factor = 7.54 (see figure a2). with the standardized y-axis range, participants were more sensitive to the differences in actual effect size (m = .47, sd = .11) compared with graphs that showed the full range from 0 to 100 (m = .30, sd = .07). sensitivity was also better for the standardized graphs than the minimal graphs, t(6) = 3.61, p = .011, dz = 1.37, 95% cis [.28, 2.40], bayes factor = 6.17. the minimal graphs (m = .28, sd = .04) produced sensitivity similar to the full graphs, p = .51, dz = .26, 95% cis [-.50, 1.01], bayes factor = 0.43. these data show an advantage for the standardized graphs because participants were more sensitive to differences among magnitudes of the depicted effect sizes with the standardized graphs than with the full or minimal graphs. however, the standardized graphs led to performance that was far from perfect. the slope was .47, and perfect performance would have produced slopes of 1. thus, even though the standardized graphs signify an improvement over the other two options, more work is still necessary to improve graph comprehension. another advantage for the standardized graphs can be seen with respect to bias. bias scores were calculated as a percentage score of underestimation (negative values) and overestimation (positive values). they were calculated as the participant’s coefficient for the intercept minus the true intercept (2.5) divided by the true intercept. there were significant differences between the bias scores across all conditions, ps < .003. the bias scores for the full graphs was negative (m = -27%, sd = 10%) and significantly below 0, t(6) = -7.01, p < .001, dz = 2.64, 95% cis [1.00, 4.27], bayes factor = 82. the bias scores for the minimal graphs were positive (m = 36%, sd = 19%) and significantly above 0, t(6) = 4.91, p = .003, dz = 1.86, 95% cis [.57, 3.10], bayes factor = 19. in contrast, the bias scores for the standardized graphs were significantly less biased than in the other conditions (ps < .003), and were not significantly different from 0 (m = 1%, sd = 4%), t(6) = 0.47, p = .66, dz = .18, 95% cis [-.58, .82], bayes factor = 0.39. with the full graphs, most effects looked like small effects. indeed, 91% of the trials with the full graphs were labeled as showing no effect or a small effect. with the minimal graphs, 58% of the effects were labeled as big effects and 88% were labeled as medium or big. with the standardized graphs, small effects looked small and medium effects looked medium (see figure a3). however, the big effects only looked medium. thus, the experiment was replicated but with a smaller range in the standardized condition to determine if that would improve detection of big effects. experiment 2: bar graphs with axis range of 1.4 sd standardized graphs, for which the y-axis range is a function of the standard deviation, produced better sensitivity and less bias in participants who judged the size of the depicted effect compared with graphs that showed the full range and graphs that showed only the minimal range necessary to see the data. however, sensitivity with the standardized graphs was still below perfect performance. in this experiment, the range of the standardized graphs was decreased from 2 sds to 1.4 sds. graph construction: an empirical investigation on setting the range of the y-axis 13 method fourteen students in an introductory psychology course participated in exchange for course credit. everything was the same in experiment 1 except for the construction of the standardized graphs, for which the y-axis range went from the group mean minus 0.7 sd to the group mean plus 0.7 sd (see figure a1). thus, the standardized range was 1.4 sd (instead of 2 sd as in experiment 1). in addition, the simulated data were evaluated to ensure that the outcomes were similar to the intended outcomes. the effect size of the simulated data were compared to the intended effect size, and if they differed by more than 0.1, the data were resampled until the discrepancy was less than 0.1. participants completed 4 blocks of 120 trials, and order was randomized within block. figure a3. response is plotted as a function of depicted effect size for the three types of axis range conditions (full, minimal, and standardized) for experiment 1. the bottom right panel shows the correct response. response was entered as 1 (no effect), 2 (small effect), 3 (medium effect), and 4 (big effect). each point corresponds to one participant’s response on one trial. the data have been jittered along both axes to enable visibility. witt 14 results and discussion the data were analyzed as before. three participants had a slope that was deemed an outlier for being beyond at least 1.5 times the interquartile range for the full or minimal graphs. the slope, which indicates sensitivity to the size of the effect in the graph, was greater for the standardized graphs (m = .54, sd = .17) than the full graphs (m = .31, sd = .08), t(10) = 3.46, p = .006, dz = 1.04, 95% cis [.28, 1.77], bayes factor = 9.00 (see figure a4). sensitivity was also greater for the standardized graphs than the minimal graphs (m = .30, sd = .06), t(10) = 4.07, p = .002, dz = 1.23, 95% cis [.42, 2.00], bayes factor = 20. replicating experiment 1, the current data show that setting the range of the y-axis to be a function of the standard deviation, rather than the full range of options or the minimal range necessary to show the data, improved graph comprehension. recall, participants were not asked to indicate how big the effect looked but rather how big the effect was. full and minimal graphs both produced misleading impressions of the data that severely attenuated sensitivity to effect size. simply setting the range of the y-axis in relation to the standard deviation improved readers’ sensitivity to the data. figure a4. mean response is plotted as a function of depicted effect size and graph type for experiment 2. error bars are 1 sem calculated withinsubjects. solid lines represent linear regressions for depicted effects d ≥ .3. dashed lines represent linear regressions for depicted effects less than d ≤ .3. bias was again found for the full and minimal graphs but not the standardized graphs. for the full graphs, the bias was to underestimate effect size by 28% (sd = 9%), t(10) = -10.51, p < .001, dz = 3.17, 95% cis [1.67, 4.64], bayes factor > 100. indeed, of all the trials with the full graphs, the effect was labeled as small or no effect on 90% of responses. the bias was of a similar magnitude but in the opposite direction for the minimal graphs, t(10) = 4.91, p < .001, dz = 1.48, 95% cis [.59, 2.33], bayes factor = 61. with the minimal graphs, participants overestimated the size of the effects by 31% (sd = 21%). over half of all effects with the minimal graphs were labeled big (53%), and 81% were labeled as medium or big. in contrast, the bias was much smaller (m = 6%, sd = 9%) for the standardized graphs, and only marginally significantly different from 0, t(10) = 2.13, p = .059, dz = .64, 95% cis [-.02, 1.28], bayes factor = 1.50. the bias with the standardized graphs was far less than the biases observed with the full and minimal graphs, ps < .001. the evidence thus far is clear: graphs with a yaxis range that is a function of the standard deviation produces better sensitivity and less bias in participants when they are tasked with judging the size of an effect, compared with graphs that present the full range and with graphs that present only the minimal range necessary to view all of the data. experiment 3: bar graphs with error bars the graphs in experiments 1 and 2 did not contain error bars. as a result, the graphs did not contain enough information to know if an effect was null, small, medium, or big. this was a conscious decision given that introductory psychology students might not know how to interpret error bars. yet, it is necessary to know if standardized graphs still produce an advantage even when there is enough information presented in the graphs to be able to accurately answer the question. in addition, the graphs graph construction: an empirical investigation on setting the range of the y-axis 15 with the smallest effects in experiments 1 and 2 had the awkward feature of being bigger than no effect but smaller than a “small” effect, so it was unclear whether the correct answer should be 1 or 2. this ambiguity was eliminated in the current experiment. method thirteen students in an introductory psychology course participated in exchange for course credit. graphs were constructed similarly as in experiment 2 with the following exceptions. the four effect sizes that were modeled were cohen’s d = 0, .3, .5, and .8, which corresponds to no effect, a small effect, a medium effect, and a big effect, respectively. the data were simulated as coming from two independent groups of 100 participants. the mean used to model the data for the hypothetical group that used the spaced studying strategy was always 50 (as in 50% accuracy on a memory test). the mean used to model the data for the hypothetical group that used the massed studying strategy was 50 minus 0, 3, 5, or 8 depending on the effect size being modeled. using these means and a sd of 10, data were sampled from a normal distribution and summarized for the graphs. error bars were calculated as 95% confidence intervals. in addition to the instructions given in experiments 1 and 2, participants were also told the following: “important! an effect is statistically significant if p < .05. however, you can also assess statistical significance by looking at error bars. error bars are lines that extend from the mean of each condition. the mean of each condition is shown by the top of the bar. if the error bar from one condition overlaps the mean from the other condition, the effect is not significant. if neither bar overlaps the mean of the other condition, then the effect is significant. the farther apart the error bars, the bigger the effect.” note that this rule of thumb is overly simplified. there can be cases for which the error bars overlap but the effect is statistically significant at the p < .05 level (cumming & finch, 2005), but this level of nuance was not presented to the participants. for each set of simulated data, 3 graphs were constructed. for the full graphs, the y-axis range went from 0 to 100. for the standardized graphs, the y-axis range went from the grand mean minus 0.6 sd to the grand mean plus 0.6 sd. for the minimal graphs, the bottom of the y-axis range was the smallest combination of the mean minus the lower confidence interval minus 0.1, and the top of the range was the biggest combination of the mean plus the upper confidence interval plus 0.1. participants completed 4 blocks of 120 randomized trials. results and discussion the data were analyzed as before. one participant had a negative slope for the standardized graphs, and another participant had a high slope for the full graphs. both were 1.5 times beyond the interquartile range and excluded from analyses. figure a5. mean response is plotted as a function of depicted effect size and graph type for experiment 3. error bars are 1 sem calculated within-subjects. solid lines represent linear regressions for depicted effects d ≥ .3. dashed lines represent linear regressions for depicted effects less than d ≤ .3. the slopes were steeper, showing better sensitivity, for the standardized graphs (m = .62, sd = .19) compared with the full graphs (m = .24, sd = .09) and the minimal graphs (m = .55, sd = .20). the difference in slopes between the standardized and full graphs was significant, t(10) = 7.76, p < .001, dz = 2.34, 95% cis [1.16, 3.50], bayes factor > 100. the difference in slopes between the standardized versus minimal graphs was also significant, t(10) = 3.09, p = .011, dz = .93, 95% cis [.20, 1.63], bayes factor = 5.46. even though all the information was the same across witt 16 the three graph conditions and even though this information was sufficient for determining the size of each effect, participants were better able to determine effect size when the range of the y-axis was a function of the standard deviation (see figure a5). the impression given by figure a3 indicates that sensitivity was just as good if not better for the minimal graphs than the standardized graphs when comparing no effect to a small effect (ds = 0 and .3), but sensitivity was better (steeper) for the standardized graphs when comparing across small, medium, and big effects (ds = .3, .5, and .8). this impression prompted an unplanned analysis. linear regressions were again conducted for each participant for each graph condition. however, in one set of regressions, only effect sizes 0 and .3 were included. in another set of regressions, only effect sizes .3, .5, and .8 were included. two additional participants were identified as outliers because the slopes for all three graphs in the latter analysis were 1.5 times beyond the interquartile range, and were excluded from the remaining analyses. with respect to determining whether or not an effect is present (by comparing slopes for graphs depicting ds = 0 and .3), all three graph types led to similar performance (standardized: m = .89, sd = .40; full: m = .49, sd = .22; minimal: m = 1.02, sd = .45). with all three types of graphs, participants were sensitive to whether or not there was an effect, as shown by coefficients for each graph type that were positive and significantly greater than 0, ps < .001. the standardized graph produced some benefit over the full graphs, t(8) = 2.82, p = .022, dz = .85, 95% cis [.14, 1.53], bayes factor = 3.76. the standardized graph was no better, and marginally worse, than the minimal graphs, t(8) = -1.82, p = .11, dz = .55, 95% cis [-.10,1.17], bayes factor = 1.03. it should be noted that a bias to see all effects as being bigger (as found with minimal graphs) would lead to a steeper slope when comparing just the graphs that depict a null effect and a small effect. thus, it cannot be known whether sensitivity is better with the minimal graphs or if the bias caused by the minimal graphs leads to greater estimates of sensitivity. with respect to determining the magnitude of an effect that is present (by comparing slopes for graphs depicting ds = .3, .5, and .8), the standardized graphs produced better sensitivity than the full or minimal graphs (standardized: m = .46, sd = .11; full: m = .09, sd = .06; minimal: m = .25, sd = .13), ps ≤ .001. the comparison between the standardized graphs to the full graphs resulted in a bayes factor greater than 100, dz = 2.81, 95% cis [1.45, 4.14]. the comparison between the standardize graphs to the minimal graphs resulted in a bayes factor of 65, dz = 1.50, 95% cis [.60, 2.35]. in each of the three graph types, participants showed some level of sensitivity to the magnitude of the effect, as shown by the coefficients being significantly greater than 0, ps < .003. in addition to better sensitivity with the standardized graphs, the standardized graphs also produced less bias compared with the other graphs, ps <= .001. for the full graphs, there was a 28% bias (sd = 12%) to underestimate effect size, which was significantly different from 0, t(10) = -7.82, p < .001, dz = 2.36, 95% cis [1.17, 3.52], bayes factor > 100. for the minimal graphs, there was a 14% bias (sd = 24%) to overestimate the size of the effect, which was marginally significantly from 0, t(10) = 2.03, p = .069, dz = .61, 95% cis [-.05, 1.25], bayes factor = 1.33. for the standardized effects, the bias was 7% (sd = 19%) and was not significantly different from 0, t(10) = 1.29, p = .227, dz = .39, 95% cis [-.23, .99], bayes factor = .58. in summary, even with error bars, graphs with the y-axis range set as a function of the standard deviation produced better sensitivity and less bias compared with graphs that showed the full range and graphs that showed only the minimal range necessary to see the data. experiment 4: line graphs with axis range of 1.4 sd the current experiment used line graphs as stimuli instead of bar graphs to see if the previous recommendations generalized to a different kind of graph. method twenty students in an introductory psychology course participated in exchange for course credit. stimuli were graphs that were constructed by simulating data from two groups, and connecting their means with a line to create an impression of data across four groups. the four effect sizes that were graph construction: an empirical investigation on setting the range of the y-axis 17 modeled were cohen’s d = 0, .3, .5, and .8, which corresponds to no effect, a small effect, a medium effect, and a big effect, respectively. the y-axis range was full (0-100), minimal (smallest value minus 1 to largest value plus 1), or standardized (group mean minus 0.7 sd to the group mean plus 0.7 sd). everything else was the same as in the previous experiments, except the x-axis was labeled as hours spent studying on a range from 1-4. results and discussion the data are shown in figure a6. the data were analyzed as before with three separate linear regressions for each participant for each graph type for each combination of all effect sizes, d = 0 and .3 only, and d = .3 .8 only. one participant had slopes greater than 1.5 times the interquartile range for the full and minimal graphs, and 3 participants had slopes less than 1.5 times the interquartile range for the minimal graphs. all 4 were excluded. for regressions on all effect sizes depicted in the graphs, the standardized graphs lead to greater slopes than the full graphs, t(15) = 7.16, p < .001, dz = 1.79, 95% cis [.98, 2.59], bayes factor > 100 (see table a1). the standardized graphs did not lead to significantly different slopes than the minimal graphs when calculated across the entire range, t(15) = 0.18, p = .86, dz = .05, 95% cis [-.45, .53], bayes factor = .26. however, this is because the minimal graphs produced superior performance with respect to determining whether there was an effect or not but inferior performance when an effect was present and the magnitude had to be determined. for regressions comparing d = 0 to d = .3, the slopes for the minimal graphs were higher than for the standardized graphs, t(15) = -4.70, p < .001, dz = 1.17, 95% cis [.52, 1.81], bayes factor > 100. again, recall that the bias generated by the minimal graphs to see effects as bigger would produce greater sensitivity scores even if participants were not necessarily more sensitive to the effect. indeed, the slope coefficient is 1.29, which is greater than perfect accuracy of 1, which implies some bias. for regressions comparing ds > 0, the slopes for the standardized graphs were higher than for the minimal graphs, t(15) = 3.05, p = .008, dz = .76, 95% cis [.19, 1.31], bayes factor = 6.46. this suggests that the standardized graphs still produced better outcomes than the full or minimal graphs. figure a6. mean response is plotted as a function of depicted effect size and graph type for experiment 4. error bars are 1 sem calculated withinsubjects. solid lines represent linear regressions for depicted effects d ≥ .3. dashed lines represent linear regressions for depicted effects less than d ≤ .3. table a1. mean (and sd) coefficients for the slopes for each graph type for each analysis from experiment 4. graph type all data ds = .3-.8 ds = 0 .3 full .30 (.08) .15 (.08) .61 (.22) standardized .61 (.15) .52 (.18) .86 (.32) minimal .61 (.06) .31 (.25) 1.29 (.56) note. the slopes indicate the linear relationship between the size of the effect depicted and the estimate of the effect size, both of which were coded on a scale from 1-4. regarding bias, similar results were found as in previous experiments. the bias was -26% (sd = 11%) with the full graphs, indicating a bias to underestimate the effects, t(15) = -9.52, p < .001, dz = 2.38, 95% cis [1.39, 3.34], bayes factor > 100. the bias was 19% (sd = 17%) with the minimal graphs, indicating a bias witt 18 to overestimate the size of the effects, t(15) = 4.36, p < .001, dz = 1.09, 95% cis [.46, 1.70], bayes factor = 64. with the standardized graphs, the bias was 2% (sd = 10%), which was not significantly different from 0, t(14) = 0.73, p = .48, dz = .18, 95% cis [-.31, .67], bayes factor = .32. with the line graphs, as with the bar graphs, the standardized axis range produced better sensitivity and less bias than the full axis range or the minimal axis range. experiment 5: line graphs with axis range of 1 sd the current experiment replicated experiment 4 using a smaller axis range for the standardized graphs. method fifteen students in an introductory psychology course participated in exchange for course credit. the stimuli were the same as in experiment 4 except that for the standardized graphs, the range was the group mean ± 0.5 sd. results and discussion the data were analyzed as before with three separate linear regressions for each participant for each graph type for each combination of all effect sizes, d = 0 and .3 only, and d = .3 .8 only. one participant had a slope that was less than 1.5 times the interquartile range for the minimal graphs, and one had a slope greater than 1.5 times the interquartile range for the full graphs. both were excluded. the mean slope coefficients for the remaining participants are shown in table a2 and the data are shown in figure a7. table a2. mean (and sd) coefficients for the slopes for each graph type for each analysis from experiment 5. graph type all data ds = .3-.8 ds = 0 .3 full .32 (.12) .20 (.13) .58 (.22) standardized .55 (.20) .48 (.17) .76 (.49) minimal .53 (.16) .36 (.25) .96 (.52) figure a7. mean response is plotted as a function of depicted effect size and graph type for experiment 5. error bars are 1 sem calculated withinsubjects. solid lines represent linear regressions for depicted effects d ≥ .3. dashed lines represent linear regressions for depicted effects less than d ≤ .3. the patterns match those found in experiment 4. participants were more sensitive to the size of the effect for the standardized graphs than for the full graphs when all trials were included, t(13) = 4.41, p < .001, dz = 1.22, 95% cis [.48, 1.94], bayes factor = 46, and when trials for which the effect size depicted was greater than 0, t(13) = 6.69, p < .001, dz = 1.86, 95% cis [.93, 2.76], bayes factor > 100, but not when only trials for which the effect size depicted was null or small, t(13) = 1.41, p = .19, dz = .39, 95% cis [-.18, .95], bayes factor = .62. participants were more sensitive to the size of the effect for the standardized graphs than for the minimal graphs but only when the depicted effect in the graph was greater than 0, t(13) = 2.59, p = .023, dz = .72, 95% cis [.09, 1.32], bayes factor = 2.88. there was no difference in sensitivity across all effect sizes, p = .65, bayes factor = .31, and the minimal graphs produced better sensitivity when only data from graphs depicting a null or small effect were included, t(13) = -3.11, p = .009, dz = .86, 95% cis [.21, 1.49], bayes factor = 6.27. as before, the bias created by the minimal graphs could account for this apparent increase in sensitivity. graph construction: an empirical investigation on setting the range of the y-axis 19 regarding the bias, the full graphs produced a bias of -15% (sd = 17%), indicating a bias to underestimate effect size, t(12) = -3.07, p = .010, dz = .85, 95% cis [.20, 1.48], bayes factor = 5.90. the minimal graphs produced a bias of 12%, (sd = 20%), which was marginally above 0, t(12) = 2.21, p = .047, dz = .61, 95% cis [.01, 1.20], bayes factor = 1.69. the standardized graphs led to a small bias of 6% (sd = 14%), that was marginally close to 0, t(12) = 1.63, p =.13, dz = .45, 95% cis [-.13, 1.01], bayes factor = .80. across experiment comparisons sample size was not selected to achieve sufficient power to do analyses across experiments. to facilitate preliminary exploration of the data, the coefficients are reported in tables s3, s4, and s5, and are plotted in figure a8 and figures 2 and 3 in the main text. it may be interesting to note that sensitivity of the size of the effect was not notably better with error bars than without error bars even though error bars are necessary to understand effect size. although this may not be surprising given the participants being introductory psychology student, the pattern is consistent with previous findings that many researchers do not know how to interpret error bars (belia, fidler, williams, & cumming, 2005). in addition, the lack of noticeable differences in sensitivity between the experiments suggests that the use of a y-axis range that is approximately 1.5 sds could help better report the results in cases for which researchers neglect to include error bars. table a3. mean slopes (and standard deviations) from regressions on all trials for each of the 5 experiments. graph type exp 1 exp 2 exp 3 exp 4 exp 5 full .28 (.10) .31 (.08) .21 (.08) .30 (.08) .32 (.13) standardized .46 (.13) .54 (.17) .58 (.17) .61 (.15) .55 (.20) minimal .27 (.03) .30 (.06) .49 (.15) .61 (.06) .53 (.16) note. a slope of 1 indicates perfect performance and a slope of 0 indicates chance performance. table a4. mean slopes (and standard deviations) from regressions on trials for which cohen’s d > 0.1 for each of the 5 experiments. graph type exp 1 exp 2 exp 3 exp 4 exp 5 full .17 (.10) .18 (.10) .09 (.06) .15 (.08) .20 (.13) standardized .42 (.16) .46 (.16) .46 (.11) .52 (.18) .48 (.17) minimal .07 (.09) .13 (.14) .25 (.13) .31 (.25) .36 (.25) note. a slope of 1 indicates perfect performance and a slope of 0 indicates chance performance. witt 20 table a5. mean bias scores as a percentage (and standard deviations) for each of the 5 experiments. graph type exp 1 exp 2 exp 3 exp 4 exp 5 full -27 (5) -28 (9) -25 (10) -26 (11) -15 (17) standardized 1 (4) 6 (9) 14 (13) 2 (10) 6 (14) minimal 36 (21) 31 (21) 23 (16) 19 (17) 12 (20) note. bias scores were calculated as a percent bias based on intercepts from regressions on all trials including those for which cohen’s d = 0. meta-psychology, 2020, vol 4, mp.2020.2560 https://doi.org/10.15626/mp.2020.2560 article type: tutorial published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: not applicable edited by: erin m. buchanan reviewed by: o. van den akker, r.-m. rahal analysis reproduced by: not applicable all supplementary files can be accessed at osf: https://osf.io/2qujn conducting high impact research with limited financial resources (while working from home) paul h. p. hanel university of essex, university of bath abstract the covid-19 pandemic has far-reaching implications for researchers. for example, many researchers cannot access their labs anymore and are hit by budget-cuts from their institutions. luckily, there are a range of ways how highquality research can be conducted without funding and face-to-face interactions. in the present paper, i discuss nine such possibilities, including meta-analyses, secondary data analyses, web-scraping, scientometrics, or sharing one’s expert knowledge (e.g., writing tutorials). most of these possibilities can be done from home, as they require only access to a computer, the internet, and time; but no state-of-the art equipment or funding to pay for participants. thus, they are particularly relevant for researchers with limited financial resources beyond pandemics and quarantines. keywords: resources; meta-analysis; secondary data-analysis; covid-19 lower student numbers and a general economic recession caused by global quarantine measures to control the covid-19 pandemic are putting a lot of pressure on universities and researchers (adams, 2020). for example, lab access as well as research budgets are suspended, and recruitment of diverse samples, even online, might be more difficult (lourenco & tasimi, 2020). the lack of funding can hamper the quantity and quality of research output and cause numerous issues. indeed, early career researchers identified having few resources as a major reason they struggle with publishing and therefore advancing their careers (e.g., lennon, 2019; urbanska, 2019). furthermore, a lack of resources and funding can have a detrimental effect on the mental health of phd-students (levecque et al., 2017) and academic staff (gillespie et al., 2001). however, while having substantial resources arguably facilitates primary research (i.e., researchers collecting their own data), it is possible to conduct high-impact and high-quality research with little or no funding as well while working remotely from home. in this paper, i provide nine examples of how highimpact research in biomedical and social sciences can be conducted with limited materialistic resources. that is research which is published in prestigious journals (e.g., journals that are among the top 25% in a given field according to scopus). the list of examples presented is neither meant to be exhaustive nor representative. nevertheless, i am hoping that the examples provided can inspire researchers to think of new research questions or methods and allow them to take some pressure off themselves. i discuss how people can conduct highimpact research using information provided within published work, with data collected by others (secondary data analysis), with researcher’s expertise and interests (e.g., tutorials), as well as with simulation studies. table 1 provides an overview of the nine approaches which are discussed in detail below. 2 table 1 how to conduct high impact research with limited resources: an overview type of research summary example papers introductory texts meta-analysis a quantitative review of the literature cuijpers et al. (2013) webb et al. (2012) borenstein et al. (2009); cheung and vijayakumar (2016); moher et al. (2009) scientometrics analysis of scientific publication fanelli (2010a); leimu and koricheva (2005) leydesdorff and milojević (2013) network and cluster analysis analysing the relations of objects (e.g., researchers, journals) with each other cipresso et al. (2018); wang and bowers (2016) costantini et al. (2015) data collected by organisations typically large datasets that are openly accessible in the internet hanel and vione (2016); ondish and stern (2017) cheng and phillips (2014); rosinger and ice (2019) re-using data using data collected by researchers; typically main findings are already published. coelho et al. (2020) web-scraping extracting or harvesting data from the internet (e.g., social media) guess et al. (2019); preis et al. (2013) michel et al. (2011); paxton and griffiths (2017) tutorials sharing one’s expert knowledge clifton and webster (2017); weissgerber et al. (2015) theoretical papers developing new theories ajzen (1991); festinger (1957) van lange (2013); smaldino (2020) simulation studies computer experiments, creating data may and hittner (1997); schmidt-catran and fairbrother (2016) beaujean (2018); feinberg and rubright (2016); morris et al. (2019) of course, whether it will be easy or difficult to acquire the necessary skills to write a paper within any of the nine approaches discussed below depends on a range of factors such as previous experience, complexity of the research question, and data availability to answer a specific research question. that is, it can be easier to publish a paper using for example secondary data because no data collection is required, but if a researcher is unfamiliar with specific statistical analyses such as multi-level modeling and the relevant literature, it might take longer than collecting primary data and writing a paper up. information provided within articles meta-analyses a meta-analysis is a quantitative review of the literature on a specific topic. the main aims are to estimate the strength of an effect across studies, test for moderators, publication bias, and to identify gaps in the literature (borenstein et al., 2009; simonsohn et al., 2015). for example, researchers might be interested in testing which emotion regulation strategy works best (webb et al., 2012) or whether psychotherapy is better than pharmacotherapy in treating depressive and anxiety disorders (cuijpers et al., 2013). to perform a meta-analysis, researchers tend to start with a systematic literature review1, identify relevant articles and ideally unpublished studies, extract descriptive statistics (e.g., sample size, descriptive statistics) and information of relevant moderators (e.g., country of origin, sample type), and finally meta-analyse across samples (cheung & vijayakumar, 2016). thus, researchers need only a computer and access to the internet to perform a meta-analysis2. nevertheless, a meta-analysis is hard work and a range of pitfalls, such 1a systematic review alone without a quantitative synthesis can be useful as well. for example, when there are only a few or too diverse papers published in a specific topical area, a qualitative summary only can be informative. 2many research projects in general and meta-analyses in particular can benefit from collaborations. for example, any coding of studies is ideally done by at least two researchers. finding reliable collaborators can be an issue for people with a smaller research network, especially in times when labs are closed, conferences cancelled, and home office is encouraged. there are many ways how potential collaborators can be identified (sparks 2019). one is to first identify researchers who already published relevant articles or graduate students that are listed on the lab pages of more senior researchers and start follow them on social media to get an impression of their 3 as an unsystematic literature review, must be avoided. luckily, guidelines exist which help to overcome pitfalls (e.g., prisma guidelines) (moher et al., 2009) as well as how to reduce publication bias (stanley & doucouliagos, 2014), and powerful software can facilitate the statistical analysis and visualisations (e.g., the r-package metafor) (viechtbauer, 2010). also, pre-registration of meta-analyses is possible (quintana, 2015; stewart et al., 2012). meta-analyses are useful for many disciplines because they provide a robust effect size estimate of a specific research question. also, meta-analyses typically attract more citations than empirical studies (patsopoulos et al., 2005). meta-analyses that identify moderators or develop new taxonomies based on the literature can be especially influential (webb et al., 2012). if meta-analyses already exist in a given subfield, researchers can consider performing a second-order metaanalysis: a meta-analysis across meta-analyses to get even more robust effect size estimates (hyde, 2005) or to test for moderators such as cultural factors (fischer et al., 2019). additionally, meta-analyses come with secondary benefits for meta-analysts themselves. everyone who has performed a meta-analysis knows that identifying the relevant information such as descriptive statistics or effect sizes in empirical articles can easily get frustrating because authors often do not report sufficient information. this can mean that otherwise perfectly suitable studies cannot be included in a metaanalysis. thus, every phd-student in biomedical and social sciences working on a quantitative research question, might want to consider performing a meta-analysis at the beginning of their program to teach them the importance of reporting detailed results and ideally also of sharing the (anonymised) data openly. one objection against the claim that every researcher with a computer and internet access can perform a meta-analysis, might be that particularly less affluent institutions can not pay the high subscription fees for many scientific journals. however, as the number of pre-prints and open access journals are increasing, paywalls become less of an issue. further, researchers from less affluent institutions can collaborate with colleagues from institutions with access to the required journals. finally, while legally questionable, researchers have found a way to bypass the paywall of most scientific publishers (bohannon, 2016). scientometrics scientometrics is an interdisciplinary scientific field that analyses scientific publication trends using various statistical methods. there are countless ways that publications can be analysed. i will discuss a few of them in this and the next section. for example, one line of publications investigates how often so-called statistically significant findings occur: are ‘positive’ results increasing “down the hierarchy of the sciences” (fanelli, 2010a), does publication pressure increases scientists’ bias (fanelli, 2010b), or are p-values just below .05 occurring more frequently than one would expect assuming no publication bias (simonsohn et al., 2015)? a prominent example of scientometrics is citation analysis. for example, what predicts whether a scientific article gets cited? is it whether it is published open access (mckiernan et al., 2016) or whether sample sizes are large (hanel & haase, 2017)? all relevant information to address these questions can be extracted from the articles of a specific scientific (sub-)field and sometimes even from meta-analyses (hanel & haase, 2017). typically, questions such as these are investigated separately in each subfield such as internal medicine (van der veer et al., 2015). similar research questions can be tested with citations aggregated on a journal level. the amount of citations articles published in the last 2, 3, or 5 years in a specific journal are averaged and used as quality indicator of that journal (i.e., the so-called journal impact factor or, more recently, cite score) (teixeira da silva & memon, 2017). however, it is an empirical question in its own right whether these quality indicators are associated with other quality indicators of empirical studies (brembs, 2018), and whether there are unintended consequences of ranking journals based on alleged quality (brembs et al., 2013). research questions such as these, and others like them, can again be tested with limited resources as they often only require the coding of published articles (e.g., on some quality indicators). furthermore, some journals are asking reviewers to assess the quality of a manuscript quantitatively when providing their review. if one has access to how reviewers evaluate manuscripts it is possible to assess whether reviewers agree on the quality of the manuscript (bornmann et al., 2010), or whether reviewers (reinhart, 2009) can predict how well a paper or researchers get cited in the following years. network and cluster analyses of the published literature yet another way to perform research at low costs is to perform network and cluster analyses. a network “is an views and beliefs on various issues. then reach out to them via email to gauge their general interest and, in case of a positive reply, schedule a video chat. if this is going well, it might be useful to discuss early on who contributes what and authorships. who does what? who gets to be first author? it is worth keeping in mind that shared (first) authorships are possible. 4 abstract representation of a system of entities or variables (i.e., nodes) that have some form of connection with each other (i.e., edges)” (dalege et al., 2017, p. 528). nodes can represent a variety of things including people, journals, or keywords. in short, network analyses typically reveal how strongly objects are associated. for example, from combing through keywords, journal names, citation counts, or country of origin of authors from hundreds or thousands of articles, it is possible to identify emerging themes and track a disciplines evolution. this can show which keywords are more frequently used together, which journals cite each other (journal citation network analysis), or researchers from which countries collaborate together more frequently (cipresso et al., 2018). in addition to this, these analyses also allow researchers to identify potential gaps in the literature (e.g., if two or more keywords are not linked in a keyword network analysis, this might indicate a potential gap in the literature). finally, moving beyond network analysis, extracting the full text of scientific articles can be used to analyse their readability (plavén-sigray et al., 2017) or to estimate the accuracy of the reported statistical information (nuijten et al., 2015), for instance. secondary data analysis data made available by (research) organisations over the past decades, the number of large, openly available surveys relevant to the social sciences and researchers interested in the mental health of people has grown rapidly. several of them are conducted in national representative samples in just one country (e.g., british election study, american national election studies), while others contain data from up to 70 countries (e.g., european social survey, world values survey). there is also a range of open datasets that might be of interest to biomedical researchers and neuroscientists such as the human connectome project which includes anatomical and diffusion neuroimaging data; the star*d project which includes antidepressant treatment of patient diagnosed with major depressive disorder, or the uk biobank which contains health information of 500,000 volunteer participants. many of these surveys are conducted every few years. since the surveys are openly and freely available to researchers and contain many variables relevant to social scientists, they can be used to answer a range of research questions. research questions addressed by past research include: whether student samples provide a good estimate of the general public (hanel & vione, 2016) or whether social trust and self-rated health are positively correlated (jen et al., 2010), and whether scales are invariant across groups of people (cieciuch et al., 2017). additionally, it’s possible to combine data from large surveys with other data. for example, nosek et al. (2009) correlated implicit gender-science stereotypes from the project implicit with the gender differences in science and math achievements from the trends in international mathematics and science study (gonzales et al., 2003). basabe and valencia (2007) correlated the country averages of hofstede’s (2001) cultural dimensions, inglehart’s (inglehart & baker, 2000) values as measured by the world values survey, and schwartz’s (2006) cultural value dimensions, with indices of human development provided in the united nations report (e.g., 2014) and de riviera’s (2004) culture of peace dimensions. such analyses allow to identify, for example, what predicts whether a country is more likely to engage in wars and supress its own population. as all the prior mentioned datasets are openly available, it is relatively easy to reproduce all analyses and come up with new research questions that can be answered with these datasets. further, it is possible to pre-register secondary data analysis (van den akker et al., 2019). the complexity of statistical analysis depends on the research question and data. for example, testing hypothesis with large (n > 40,000) datasets containing data from various countries typically requires multilevel modeling, because participants are nested within countries (for an example paper see rudnev & vauclair, 2018). in contrast, when two or more datasets have been combined, and, for example, only country-level data is available, researchers typically rely on correlation and regression analyses (e.g., basabe & valencia, 2007; inman et al., 2017). recommendations for performing secondary data analyses exist, for example, for social studies (fitchett & heafner, 2017), medical sciences (cheng & phillips, 2014), human biology (rosinger & ice, 2019), and qualitative research (sherif, 2018). reusing data this point is similar to the one above, except that it focuses solely on reusing data collected by either researchers’ own lab-group or that were shared by other researchers. typically, the data were collected to answer some pre-defined research question, but not for the additional analyses someone thought about only after data collection. further, if one has access to several similar datasets that also included some demographic information which may have been reported but were not the focus of the main paper(s), the datasets can be combined and reanalysed to test for differences and similarities between the demographic groups on several of the http://www.humanconnectomeproject.org/ https://www.nimh.nih.gov/funding/clinical-research/practical/stard/allmedicationlevels.shtml https://implicit.harvard.edu/ 5 primary variables (assuming this analysis has not been reported in the primary papers). in a similar refrain, if several primary studies included a scale with, as a rule-of-thumb, more than eight-items per dimension, it is worth considering to test whether only some of the items of each dimension is as reliable and valid as the original scale (coelho et al., 2020). both types of research questions (comparisons across demographic groups and scale validation) along with several other ones, can also entirely be addressed with datasets openly shared by researchers. google has created a search engine that searches for open datasets (https://datasetsearch.research.google.com/; see also https://dataverse.harvard.edu/) which can be directly used or combined with other datasets. to the best of my knowledge, the number of articles based on re-using data collected by other researchers is still very limited. however, since more and more researchers are sharing their data and search engines allow to identify potentially relevant datasets, the number of papers based on other researchers’ data is likely increasing. an additional way to reuse data is to verify the results of an already published article with the data collected by the original authors. the necessity of this is illustrated by an attempt to replicate 59 macroeconomic papers using the original data (chang & li, 2017). only 29 papers were replicated, even with the help of the original authors. such an initiative would be very useful in other scientific fields too. but also replicating the results from single papers has been encouraged. for example, the journal cortex has recently announced a new article type “verification reports” which reports independent replication of the research findings of a published article through repeating the original analyses. this is to “provide scientists with professional credit for evaluating one of the most fundamental forms of credibility: whether the claims in previous studies are justified by their own data” (chambers, 2020, p. a1). web-scraping when people use social media or use a search engine, they produce data. some of the traces people leave online can be relatively easy scrapped (i.e., extracted or harvested) and allow us to answer research questions we would not be able to answer with traditional approaches (cf. paxton & griffiths, 2017). webpages from which data can relatively easily obtained include: twitter, reddit, as well as google ngram viewer, and google trends. for example, researchers used twitter to test whether survey responses of social media use are accurate (guess et al., 2019), and predictors of solidarity expressions with refugees (smith et al., 2018). further, google trends – which analyses how often people searched using google for specific terms in one or all countries on a specific date – was used to test whether online health-seeking behaviour predicts influenza-like symptoms (ginsberg et al., 2009) and whether google searches predict stock market moves (preis et al., 2013). other outputs tutorials to conduct high quality primary research, researchers often need to acquire specific skills. examples of expert knowledge and skills that researchers may have, include recruiting participants from hard to reach populations, setting up testing equipment which often includes programming skills (e.g., in cognitive psychology or neuroscience), and analysing the data. without a good mentor, helpful peers, or informative tutorials, acquiring such skills can be cumbersome. sharing this knowledge by writing blogposts or peer-reviewed articles (e.g., tutorials) can therefore be very useful once we acquired some specialist expert knowledge. for example, what recruitment methods work well to get couples to participate in unpaid online or lab-studies (e.g., flyer distribution on places where people are waiting anyway such as train stations, schools, on campus, or targeting specific groups on social media)? what are best practices for writing reproducible code? how should data of a specific format be analysed? writing a step-by-step tutorial (assuming this does not yet exist), ideally with some concrete examples, may be often cited and help to establish a reputation as an expert. previous tutorials focused on various statistical methods such as response surface analysis (barranti et al., 2017), network analysis (dalege et al., 2017), multilevel meta-analyses (assink & wibbelink, 2016), or bayesian statistics (weaver & hamada, 2016); recommendations for data visualisation (weissgerber et al., 2016), or web-scrapping (bradley & james, 2019); suggestions for open science practices (allen & mehler, 2019); or how to use databases (waagmeester et al., 2020). a more advanced type of tutorials concerns software packages, because they usually include computer code that assist others directly in performing a specific analysis, additional to a (peer-reviewed) article (viechtbauer, 2010). expert knowledge also allows researchers more easily to write commentaries on various topics. popular commentaries include topics such as the scientific publication system (lawrence, 2003) or cargo cult science (feynman, 1974). theoretical papers related to tutorials, scientists can integrate and advance research in theoretical papers. theories are imhttps://datasetsearch.research.google.com/ https://dataverse.harvard.edu/ 6 portant because they help us to see “the coherent structures in seemingly chaotic phenomena and make inroads into previously uncharted domains, thus affording progress in the way we understand the world around us” (van lange, 2013, p. 40). in contrast to tutorials which typically focus on solving specific problems such as conducting a specific analysis, theoretical papers can both solve problems through integrating apparently contradictory findings into one broader framework, but can also ‘cause’ problems through making novel predictions and is therefore crucial for new empirical discoveries (higgins, 2004). prominent examples include the theory of planned behavior (ajzen, 1985, 1991) which aims to explain planned human behaviour or cognitive dissonance theory (festinger, 1957) which aims to explain how people deal with internal inconsistencies. however, developing a formalised and testable theory can be challenging. for example, van lange (2013) argues that good theories should contain “truth, abstraction, progress, and applicability as standards” (p. 40) and provides recommendations how this can be done. smaldino (2020) discusses various options how verbal theories can be translated into formal models. simulation studies another way to get data without needing to conduct a study, is to simulate data from hundreds or often even thousands of studies using specialised statistical software such as the freely available program r (feinberg & rubright, 2016). in a simulation study, data are generated that may or may not reflect real data. thanks to the advances in processing capacities, many simulation studies can be done nowadays without needing to access a supercomputer. simulation studies have been used to answer a range of questions, such as which mediation test best balances type-i error and statistical power (mackinnon et al., 2002) and the pitfalls in specifying fixed and random effects in multilevel models (schmidt-catran & fairbrother, 2016). the first step in a simulation study is typically to define the problem. for example, a researcher might be interested in exploring which, out of multiple tests that serve the same purpose, has the lowest type-i and typeii error rates. other steps include making assumptions, simulating the data, evaluating the output, and finally disseminating the findings (for tutorials see beaujean, 2018; feinberg & rubright, 2016). in short, simulation studies are an effective way to conduct cheap research, but require advanced programming skills. conclusion in the present paper, i provide suggestions of how impactful research can be conducted with limited resources and while working remotely. the above list is not meant to be exhaustive but will hopefully provide some examples that might inspire researchers to consider alternative ways to research phenomena they find interesting. importantly, encouraging researchers to conduct research using more secondary data analysis does not disregard primary empirical research. however, it is sometimes not feasible for everyone to conduct well-powered empirical studies because of a limited amount of resources. thus, being aware of alternative ways to conduct research can help researchers in this situation, to get to a point in which they can compete with researchers who have access to more resources (cf. lepori et al., 2019). ultimately, it might make science more egalitarian, because it also allows researchers from financially less well-situated institutions to publish in prestigious journals. author contact paul hanel, department of psychology, university of essex, colchester, united kingdom. p.hanel@essex.ac.uk department of psychology, university of essex, colchester, united kingdom acknowledgements. i wish to thank martha fitch little and wijnand van tilburg for useful comments on an earlier version of this paper. conflict of interest and funding the author has no conflict of interest to declare. there was no specific funding for this project. author contributions this is a single author contribution. open science practices this theoretical article contains no data, materials or analysis. the entire editorial process, including the open reviews, are published in the online supplement. references adams, r. (2020, april 22). coronavirus uk: universities face £2.5bn tuition fee loss next year. the guardian. https://www.theguardian.com/education/2020 /apr/23/coronavirus-uk-universities-face-25bntuition-fee-loss-next-year 7 ajzen, i. (1985). from intentions to actions: a theory of planned behavior. in j. kuhl & j. beckmann (eds.), action control: from cognition to behavior (pp. 11–39). springer. https://doi.org/10.1007/978-3-642-69746-3_2 ajzen, i. (1991). the theory of planned behavior. organizational behavior and human decision processes, 50(2), 179–211. https://doi.org/10.1016/07495978(91)90020-t allen, c., & mehler, d. m. a. (2019). open science challenges, benefits and tips in early career and beyond. plos biology, 17(5), e3000246. https://doi.org/10.1371/journal.pbio.3000246 assink, m., & wibbelink, c. j. m. (2016). fitting three-level meta-analytic models in r: a step-by-step tutorial. the quantitative methods for psychology, 12, 154–174. https://doi.org/10.20982/tqmp.12.3.p154 barranti, m., carlson, e. n., & côté, s. (2017). how to test questions about similarity in personality and social psychology research description and empirical demonstration of response surface analysis. social psychological and personality science, 8(4), 465–475. https://doi.org/10.1177/1948550617698204 basabe, n., & valencia, j. (2007). culture of peace: sociostructural dimensions, cultural values, and emotional climate. journal of social issues, 63(2), 405–419. https://doi.org/10.1111/j.15404560.2007.00516.x beaujean, a. a. (2018). simulating data for clinical research: a tutorial. journal of psychoeducational assessment, 36(1), 7–20. https://doi.org/10.1177/0734282917690302 bohannon, j. (2016). who’s downloading pirated papers? everyone. science, 352(6285), 508–512. https://doi.org/10.1126/science.352.6285.508 borenstein, m., hedges, l. v., higgins, j. p. t., & rothstein, h. (2009). introduction to meta-analysis. john wiley & sons. bornmann, l., mutz, r., & daniel, h.-d. (2010). a reliability-generalization study of journal peer reviews: a multilevel metaanalysis of inter-rater reliability and its determinants. plos one, 5(12), e14331. https://doi.org/10.1371/journal.pone.0014331 bradley, a., & james, r. j. e. (2019). web scraping using r. advances in methods and practices in psychological science. https://doi.org/10.1177/2515245919859535 brembs, b. (2018). prestigious science journals struggle to reach even average reliability. frontiers in human neuroscience, 12. https://doi.org/10.3389/fnhum.2018.00037 brembs, b., button, k., & munafò, m. (2013). deep impact: unintended consequences of journal rank. frontiers in human neuroscience, 7, 291. https://doi.org/10.3389/fnhum.2013.00291 chambers, c. d. (2020). verification reports: a new article type at cortex. cortex, 129, a1–a3. https://doi.org/10.1016/j.cortex.2020.04.020 chang, a. c., & li, p. (2017). a preanalysis plan to replicate sixty economics research papers that worked half of the time. american economic review, 107(5), 60–64. https://doi.org/10.1257/aer.p20171034 cheng, h. g., & phillips, m. r. (2014). secondary analysis of existing data: opportunities and implementation. shanghai archives of psychiatry, 26(6), 371–375. https://doi.org/10.11919/j.issn.10020829.214171 cheung, m. w.-l., & vijayakumar, r. (2016). a guide to conducting a meta-analysis. neuropsychology review, 26(2), 121–128. https://doi.org/10.1007/s11065-016-9319-z cieciuch, j., davidov, e., algesheimer, r., & schmidt, p. (2017). testing for approximate measurement invariance of human values in the european social survey. sociological methods & research, 47(4), 665-686 . https://doi.org/10.1177/0049124117701478 cipresso, p., giglioli, i. a. c., raya, m. a., & riva, g. (2018). the past, present, and future of virtual and augmented reality research: a network and cluster analysis of the literature. frontiers in psychology, 9. https://doi.org/10.3389/fpsyg.2018.02086 clifton, a., & webster, g. d. (2017). an introduction to social network analysis for personality and social psychologists. social psychological and personality science, 8(4), 442–453. https://doi.org/10.1177/1948550617709114 coelho, g. l. de h., hanel, p. h. p., & wolf, l. j. (2018). the very efficient assessment of need for cognition: developing a 6-item version. assessment. https://doi.org/10.1177/1073191118793208 costantini, g., epskamp, s., borsboom, d., perugini, m., mõttus, r., waldorp, l. j., & cramer, a. o. j. (2015). state of the art personality research: a tutorial on network analysis of personality data in r. journal of research in personality, 54, 13–29. https://doi.org/10.1016/j.jrp.2014.07.003 cuijpers, p., sijbrandij, m., koole, s. l., andersson, g., beekman, a. t., & reynolds, c. f. (2013). the efficacy of psychotherapy and pharmacotherapy in treating depressive and anxi8 ety disorders: a meta-analysis of direct comparisons. world psychiatry, 12(2), 137–148. https://doi.org/10.1002/wps.20038 dalege, j., borsboom, d., van harreveld, f., & van der maas, h. l. j. (2017). network analysis on attitudes: a brief tutorial. social psychological and personality science, 8(5), 528–537. https://doi.org/10.1177/1948550617709827 de rivera, j. (2004). assessing the basis for a culture of peace in contemporary societies. journal of peace research, 41(5), 531–548. https://doi.org/10.1177/0022343304045974 fanelli, d. (2010a). “positive” results increase down the hierarchy of the sciences. plos one, 5(4), e10068. https://doi.org/10.1371/journal.pone.0010068 fanelli, d. (2010b). do pressures to publish increase scientists’ bias? an empirical support from us states data. plos one, 5(4), e10271. https://doi.org/10.1371/journal.pone.0010271 feinberg, r. a., & rubright, j. d. (2016). conducting simulation studies in psychometrics. educational measurement: issues and practice, 35(2), 36–49. https://doi.org/10.1111/emip.12111 festinger, l. (1957). a theory of cognitive dissonance. stanford university press. feynman, r. p. (1974). cargo cult science. engineering and science, 37, 10–13. fischer, r., karl, j. a., & fischer, m. v. (2019). norms across cultures: a cross-cultural meta-analysis of norms effects in the theory of planned behavior. journal of crosscultural psychology, 50(10), 1112–1126. https://doi.org/10.1177/0022022119846409 fitchett, p. g., & heafner, t. l. (2017). quantitative research and large-scale secondary analysis in social studies. in handbook of social studies research (pp. 68–94). john wiley & sons, ltd. https://doi.org/10.1002/9781118768747.ch4 gillespie, n. a., walsh, m., winefield, a. h., dua, j., & stough, c. (2001). occupational stress in universities: staff perceptions of the causes, consequences and moderators of stress. work & stress, 15(1), 53–72. https://doi.org/10.1080/02678370117944 ginsberg, j., mohebbi, m. h., patel, r. s., brammer, l., smolinski, m. s., & brilliant, l. (2009). detecting influenza epidemics using search engine query data. nature, 457(7232), 1012–1014. https://doi.org/10.1038/nature07634 gonzales, p. (2003). highlights from the trends in international mathematics and science study (timss) 2003. guess, a., munger, k., nagler, j., & tucker, j. (2019). how accurate are survey responses on social media and politics? political communication, 36(2), 241–258. https://doi.org/10.1080/10584609.2018.150 4840 hanel, p. h. p., & haase, j. (2017). predictors of citation rate in psychology: inconclusive influence of effect and sample size. frontiers in psychology, 8. https://doi.org/10.3389/fpsyg.2017.01160 hanel, p. h. p., & vione, k. c. (2016). do student samples provide an accurate estimate of the general public? plos one, 11(12), e0168354. https://doi.org/10.1371/journal.pone.0168354 higgins, e. t. (2004). making a theory useful: lessons handed down. personality and social psychology review, 8(2), 138–145. https://doi.org/10.1207/s15327957pspr0802_7 hofstede, g. (2001). culture’s consequences: comparing values, behaviors, institutions and organizations across nations (2nd ed.). sage. hyde, j. s. (2005). the gender similarities hypothesis. american psychologist, 60(6), 581–592. https://doi.org/10.1037/0003-066x.60.6.581 inglehart, r. f., & baker, w. e. (2000). modernization, cultural change, and the persistence of traditional values. american sociological review, 65(1), 19–51. https://doi.org/10.2307/2657288 inman, r. a., silva, s. m. d., bayoumi, r., & hanel, p. h. p. (2017). cultural value orientations and alcohol consumption in 74 countries: a societal-level analysis. frontiers in psychology: cultural psychology, 8. https://doi.org/10.3389/fpsyg.2017.01963 jen, m. h., sund, e. r., johnston, r., & jones, k. (2010). trustful societies, trustful individuals, and health: an analysis of self-rated health and social trust using the world value survey. health & place, 16(5), 1022–1029. https://doi.org/10.1016/j.healthplace.2010.06 .008 lawrence, p. a. (2003). the politics of publication. nature, 422(6929), 259–261. https://doi.org/10.1038/422259a leimu, r., & koricheva, j. (2005). what determines the citation frequency of ecological papers? trends in ecology & evolution, 20(1), 28–32. https://doi.org/10.1016/j.tree.2004.10.010 lennon, j. c. (2019). navigating academia as a psyd student. nature human behaviour. https://socialsciences.nature.com/channels/21 40-is-it-publish-or-perish/posts/52824-competi ng-in-the-world-of-academia-as-a-psyd-student lepori, b., geuna, a., & mira, a. (2019). scientific output scales with resources. a 9 comparison of us and european universities. plos one, 14(10), e0223415. https://doi.org/10.1371/journal.pone.0223415 levecque, k., anseel, f., de beuckelaer, a., van der heyden, j., & gisle, l. (2017). work organization and mental health problems in phd students. research policy, 46(4), 868–879. https://doi.org/10.1016/j.respol.2017.02.008 leydesdorff, l., & milojević, s. (2013). scientometrics. arxiv:1208.4566 [cs]. http://arxiv.org/abs/1208.4566 lourenco, s. f., & tasimi, a. (2020). no participant left behind: conducting science during covid19. trends in cognitive sciences, 24(8), 583–584. https://doi.org/10.1016/j.tics.2020.05.003 mackinnon, d. p., lockwood, c. m., hoffman, j. m., west, s. g., & sheets, v. (2002). a comparison of methods to test mediation and other intervening variable effects. psychological methods, 7(1), 83. may, k., & hittner, j. b. (1997). tests for comparing dependent correlations revisited: a monte carlo study. the journal of experimental education, 65, 257–269. mckiernan, e. c., bourne, p. e., brown, c. t., buck, s., kenall, a., lin, j., mcdougall, d., nosek, b. a., ram, k., soderberg, c. k., spies, j. r., thaney, k., updegrove, a., woo, k. h., & yarkoni, t. (2016). how open science helps researchers succeed. elife, 5, e16800. https://doi.org/10.7554/elife.16800 michel, j.-b., shen, y. k., aiden, a. p., veres, a., gray, m. k., team, t. g. b., pickett, j. p., hoiberg, d., clancy, d., norvig, p., orwant, j., pinker, s., nowak, m. a., & aiden, e. l. (2011). quantitative analysis of culture using millions of digitized books. science, 331(6014), 176–182. https://doi.org/10.1126/science.1199644 moher, d., liberati, a., tetzlaff, j., & altman, d. g. (2009). preferred reporting items for systematic reviews and meta-analyses: the prisma statement. annals of internal medicine, 151(4), 264–269. https://doi.org/10.7326/0003-4819151-4-200908180-00135 morris, t. p., white, i. r., & crowther, m. j. (2019). using simulation studies to evaluate statistical methods. statistics in medicine, 38(11), 2074–2102. https://doi.org/10.1002/sim.8086 nosek, b. a., smyth, f. l., sriram, n., lindner, n. m., devos, t., ayala, a., bar-anan, y., bergh, r., cai, h., gonsalkorale, k., kesebir, s., maliszewski, n., neto, f., olli, e., park, j., schnabel, k., shiomura, k., tulbure, b. t., wiers, r. w., . . . greenwald, a. g. (2009). national differences in gender–science stereotypes predict national sex differences in science and math achievement. proceedings of the national academy of sciences, 106(26), 10593–10597. https://doi.org/10.1073/pnas.0809921106 nuijten, m. b., hartgerink, c. h. j., assen, m. a. l. m. van, epskamp, s., & wicherts, j. m. (2015). the prevalence of statistical reporting errors in psychology (1985–2013). behavior research methods, 1–22. https://doi.org/10.3758/s13428-015-06642 ondish, p., & stern, c. (2017). liberals possess more national consensus on political attitudes in the united states: an examination across 40 years. social psychological and personality science, 9(8), 935-943. https://doi.org/10.1177/1948550617729410 patsopoulos, n. a., analatos, a. a., & ioannidis, j. p. a. (2005). relative citation impact of various study designs in the health sciences. jama, 293(19), 2362–2366. https://doi.org/10.1001/jama.293.19.2362 paxton, a., & griffiths, t. l. (2017). finding the traces of behavioral and cognitive processes in big data and naturally occurring datasets. behavior research methods, 49(5), 1630–1638. https://doi.org/10.3758/s13428-017-0874-x plavén-sigray, p., matheson, g. j., schiffler, b. c., & thompson, w. h. (2017). research: the readability of scientific texts is decreasing over time. elife, 6, e27725. https://doi.org/10.7554/elife.27725 preis, t., moat, h. s., & stanley, h. e. (2013). quantifying trading behavior in financial markets using google trends. scientific reports, 3. https://doi.org/10.1038/srep01684 quintana, d. s. (2015). from pre-registration to publication: a non-technical primer for conducting a meta-analysis to synthesize correlational data. frontiers in psychology, 6. https://doi.org/10.3389/fpsyg.2015.01549 reinhart, m. (2009). peer review of grant applications in biology and medicine. reliability, fairness, and validity. scientometrics, 81(3), 789–809. https://doi.org/10.1007/s11192-008-2220-7 rosinger, a. y., & ice, g. (2019). secondary data analysis to answer questions in human biology. american journal of human biology, 31(3), e23232. https://doi.org/10.1002/ajhb.23232 rudnev, m., & vauclair, c.-m. (2018). the link between personal values and frequency of drinking depends on cultural values: a cross-level interaction approach. frontiers in psychology, 9. https://doi.org/10.3389/fpsyg.2018.01379 schmidt-catran, a. w., & fairbrother, m. (2016). 10 the random effects in multilevel models: getting them wrong and getting them right. european sociological review, 32(1), 23–38. https://doi.org/10.1093/esr/jcv090 schwartz, s. h. (2006). a theory of cultural value orientations: explication and applications. comparative sociology, 5(2), 137–182. https://doi.org/10.1163/1569133067786673 57 sherif, v. (2018). evaluating preexisting qualitative research data for secondary analysis. forum qualitative sozialforschung / forum: qualitative social research, 19(2). https://doi.org/10.17169/fqs19.2.2821 simonsohn, u., simmons, j. p., & nelson, l. d. (2015). better p-curves: making pcurve analysis more robust to errors, fraud, and ambitious p-hacking, a reply to ulrich and miller (2015). journal of experimental psychology: general, 144(6), 1146–1152. https://doi.org/10.1037/xge0000104 smaldino, p. e. (2020). how to translate a verbal theory into a formal model. https://files.osf.io/v1/resources/n7qsh/provid ers/osfstorage/5ecd62d2aeeb6d01d6087b01? format=pdf&action=download&direct&versi on=2 smith, l. g. e., mcgarty, c., & thomas, e. f. (2018). after aylan kurdi: how tweeting about death, threat, and harm predict increased expressions of solidarity with refugees over time. psychological science, 29(4), 623–634. https://doi.org/10.1177/0956797617741107 sparks, s. (2019). how to find international collaborators for your research. british council. https://www.britishcouncil.org/voicesmagazine/how-to-find-internationalcollaborators-for-your-research stanley, t. d., & doucouliagos, h. (2014). metaregression approximations to reduce publication selection bias. research synthesis methods, 5(1), 60–78. https://doi.org/10.1002/jrsm.1095 stewart, l., moher, d., & shekelle, p. (2012). why prospective registration of systematic reviews makes sense. systematic reviews, 1(1), 7. https://doi.org/10.1186/2046-4053-1-7 teixeira da silva, j. a., & memon, a. r. (2017). citescore: a cite for sore eyes, or a valuable, transparent metric? scientometrics, 111(1), 553–556. https://doi.org/10.1007/s11192-017-2250-0 united nations developmental programme. (2014). human developmental report: human development index (hdi). http://hdr.undp.org/en/data urbanska, k. (2019). oh no, i haven’t published: navigating the job market without a publication record. nature human behaviour. https://socialsciences.nature.com/users/30163 3-karolina-urbanska/posts/54645-oh-no-i-hav ent-published-navigating-the-job-market-wi thout-apublication-record van den akker, o., weston, s. j., campbell, l., chopik, w. j., damian, r. i., davis-kean, p., hall, a. n., kosie, j. e., kruse, e. t., olsen, j., ritchie, s. j., valentine, k. d., van ’t veer, a. e., & bakker, m. (2019). preregistration of secondary data analysis: a template and tutorial [preprint]. psyarxiv. https://doi.org/10.31234/osf.io/hvfmr van der veer, t., baars, j. e., birnie, e., & hamberg, p. (2015). citation analysis of the ‘big six’ journals in internal medicine. european journal of internal medicine, 26(6), 458–459. https://doi.org/10.1016/j.ejim.2015.05.017 van lange, p. a. m. (2013). what we should expect from theories in social psychology: truth, abstraction, progress, and applicability as standards (tapas). personality and social psychology review, 17(1), 40–55. https://doi.org/10.1177/1088868312453088 viechtbauer, w. (2010). conducting meta-analyses in r with the metafor package. journal of statistical software, 36(3), 1–48. waagmeester, a., stupp, g., burgstaller-muehlbacher, s., good, b. m., griffith, m., griffith, o. l., hanspers, k., hermjakob, h., hudson, t. s., hybiske, k., keating, s. m., manske, m., mayers, m., mietchen, d., mitraka, e., pico, a. r., putman, t., riutta, a., queralt-rosinach, n., . . . su, a. i. (2020). wikidata as a knowledge graph for the life sciences. elife, 9, e52614. https://doi.org/10.7554/elife.52614 wang, y., & bowers, a. j. (2016). mapping the field of educational administration research: a journal citation network analysis. journal of educational administration, 54(3). https://doi.org/10.1108/jea02-2015-0013 weaver, b. p., & hamada, m. s. (2016). quality quandaries: a gentle introduction to bayesian statistics. quality engineering, 28(4), 508–514. https://doi.org/10.1080/08982112.2016.1167 220 webb, t. l., miles, e., & sheeran, p. (2012). dealing with feeling: a meta-analysis of the effectiveness of strategies derived from the process model of emotion regulation. psychological bulletin, 138(4), 775–808. https://doi.org/10.1037/a0027600 weissgerber, t. l., garovic, v. d., savic, m., winham, s. j., & milic, n. m. (2016). from static to interac11 tive: transforming data visualization to improve transparency. plos biology, 14(6), e1002484. https://doi.org/10.1371/journal.pbio.1002484 weissgerber, t. l., milic, n. m., winham, s. j., & garovic, v. d. (2015). beyond bar and line graphs: time for a new data presentation paradigm. plos biology, 13(4), e1002128. https://doi.org/10.1371/journal.pbio.1002128 information provided within articles meta-analyses scientometrics network and cluster analyses of the published literature secondary data analysis data made available by (research) organisations reusing data web-scraping other outputs tutorials theoretical papers simulation studies conclusion author contact conflict of interest and funding author contributions open science practices references meta-psychology, 2022, vol 6, mp.2021.2803 https://doi.org/10.15626/mp.2021.2803 article type: original article published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: yes edited by: thomas nordström reviewed by: matthew miller, františek bartoš analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/39w6k meta-analytic findings of the self-controlled motor learning literature: underpowered, biased, and lacking evidential value brad mckay school of human kinetics, university of ottawa department of kinesiology, mcmaster university zachary d. yantha school of human kinetics, university of ottawa julia hussien school of human kinetics, university of ottawa michael j. carter department of kinesiology, mcmaster university diane m. ste-marie school of human kinetics, university of ottawa abstract the self-controlled motor learning literature consists of experiments that compare a group of learners who are provided with a choice over an aspect of their practice environment to a group who are yoked to those choices. a qualitative review of the literature suggests an unambiguous benefit from self-controlled practice. a meta-analysis was conducted on the effects of self-controlled practice on retention test performance measures with a focus on assessing and potentially correcting for selection bias in the literature, such as publication bias and p-hacking. first, a naïve random effects model was fit to the data and a moderate benefit of self-controlled practice, g = .44 (k = 52, n = 2061, 95% ci [.31, .56]), was found. second, publication status was added to the model as a potential moderator, revealing a significant difference between published and unpublished findings, with only the former reporting a benefit of self-controlled practice. third, to investigate and adjust for the impact of selectively reporting statistically significant results, a weight-function model was fit to the data with a one-tailed p-value cutpoint of .025. the weight-function model revealed substantial selection bias and estimated the true average effect of selfcontrolled practice as g = .107 (95% ci [.047, .18]). p-curve analyses were conducted on the statistically significant results published in the literature and the outcome suggested a lack of evidential value. fourth, a suite of sensitivity analyses were conducted to evaluate the robustness of these results, all of which converged on trivially small effect estimates. overall, our results suggest the benefit of self-controlled practice on motor learning is small and not currently distinguishable from zero. keywords: motor learning, retention, choice, optimal theory, meta-analysis, p-curve, publication bias introduction asking learners to control any aspect of their practice environment has come to be known as self-controlled practice in the motor learning literature (sanli et al., 2013; wulf & lewthwaite, 2016). the first published experiments to test self-controlled learning asked learners to control their augmented feedback schedule (janelle et al., 1997; janelle et al., 1995). for example, in an experiment by janelle et al., 1997, participants practiced throwing tennis balls at a target with their https://doi.org/10.15626/mp.2021.2803 https://doi.org/10.17605/osf.io/39w6k 2 non-dominant hand. the practice period occurred over two separate days. participants were assigned to one of four experimental groups (n = 12): self-controlled knowledge of performance, yoked-to-self-control, summary knowledge of performance after every five trials, and a knowledge of results only control group. the self-controlled group could request knowledge of performance whenever they wanted it, while each yoked group participant was matched with a self-control group counterpart and received knowledge of performance on the same schedule. the experimenter evaluated the participants’ throws, identified the most critical error in their throwing form, and provided knowledge of performance via video feedback, along with directing attention to the error and giving prescriptive feedback. during a delayed-retention test, the accuracy, form, and speed of the throw were assessed. the results indicated that the self-control group threw more accurately and with better form than all other groups on the retention test. the self-control and yoked groups did not significantly differ in throwing speed, but the control group threw faster than the self-control group on the second retention block. the results were interpreted as evidence that the participants provided with choice were able to process information more efficiently than their counterparts who received a fixed schedule of feedback. figure 1 shows that the number of experiments comparing self-controlled groups to yoked groups has been increasing since the original experiments by janelle and his colleagues (1997, 1995). researchers have experimented with giving learners control over a variety of variables in the practice environment. a qualitative assessment of the literature suggests that self-control is generally beneficial regardless of choice-type (wulf & lewthwaite, 2016). for example, self-control has been effective when participants have been provided choice over what can be considered instructionally-relevant variables, such as knowledge of results (patterson & carter, 2010), knowledge of performance (lim et al., 2015), concurrent feedback (huet et al., 2009), use of an assistive device (wulf et al., 2001), observation of a skilled model (lemos et al., 2017), practice schedule (wu & magill, 2011), practice volume (lessa & chiviacowsky, 2015), and task difficulty (leiker et al., 2016). additionally, self-controlled benefits have also been found for instructionally-irrelevant variables, such as the colour of various objects in the practice environment (wulf et al., 2018), other decorative choices (iwatsuki et al., 2019), and the choice of what to do after the retention test is complete (lewthwaite et al., 2015). despite the widespread optimism that self-controlled practice is useful for enhancing motor learning, re0 2 4 6 8 10 19 95 19 96 19 97 19 98 19 99 20 00 20 01 20 02 20 03 20 04 20 05 20 06 20 07 20 08 20 09 20 10 20 11 20 12 20 13 20 14 20 15 20 16 20 17 20 18 20 19 year n u m b e r o f e xp e ri m e n ts figure 1. number of self-controlled learning experiments meeting the inclusion criteria by year. searchers continue to debate the underlying mechanisms responsible for the effect (m. j. carter & stemarie, 2017b; wulf et al., 2018). beginning with janelle et al. (1995), both motivational and information processing mechanisms were proposed as possible explanations for self-control benefits. researchers have since supported these two mechanisms and, from a motivational perspective, have posited that self-control enhances confidence (chiviacowsky, wulf, & lewthwaite, 2012; janelle et al., 1995; wulf & lewthwaite, 2016) and satisfies the basic psychological need for autonomy (sanli et al., 2013; wulf & lewthwaite, 2016), motivating motor performance and learning enhancement. most self-controlled learning experiments, however, have involved participants making choices over potentially informative variables, which could act as a confounding variable. citing this potential motivational/informational confound, lewthwaite et al. (2015) experimented with providing instructionally-irrelevant choices, such as the colour of the golf balls to putt, the painting to hang on the wall, and what to do following the retention test. lewthwaite and her colleagues reasoned that information processing explanations could not account for benefits due to these incidental choices, and instead motivational factors were more likely. consistent with the motivational hypothesis, participants exhibited significantly greater motor learning on a golf putting task (experiment 1) and on a balance task (experiment 2). subsequently, several experiments have reported benefits with instructionally-irrelevant choices (abdollahipour et al., 2017; chua et al., 2018; halperin et al., 2017; iwatsuki et al., 2019; wulf et al., 2014; wulf et al., 2018), further reinforcing this motivational perspective. a contrasting line of research has been reported by m. j. carter and his colleagues (2014, 2017a, 2017b) in which informational factors, the second dominant 3 perspective, are given more weight as an explanatory variable. in one experiment by m. j. carter et al. (2014), self-control participants were provided with choice over receiving knowledge of results, but divided into three experimental groups; those who could make their knowledge of results decision before the trial, after the trial, or both (they would decide before, but could change their mind following the trial). timing of the choice significantly attenuated the self-control benefit. while the self-after and self-both groups exhibited learning advantages relative to their yoked counterparts, the self-before group displayed no such advantage. the argument proffered by the researchers was that there was more informational value to be gained from knowledge of results requested after a trial than when it had to be requested before the outcome of the trial occurred (also see chiviacowsky & wulf, 2005). in another experiment (m. j. carter & ste-marie, 2017a), asking learners to complete an interpolated activity in the interval preceding their choice of whether to receive knowledge of results significantly attenuated the self-control benefit (also see couvillion et al., 2020; woodard & fairbrother, 2020). as a final example, m. j. carter and ste-marie (2017b) compared an instructionally-relevant choice group (i.e., when to receive knowledge of results) to an instructionallyirrelevant choice group (i.e., which video game to play after retention and which colour arm wrap to wear while practicing). unlike the experiment by wulf and colleagues (2018), m. j. carter and ste-marie found that instructionally-relevant choices were more effective than task-irrelevant choices. overall, they have used these different findings to tie self-controlled learning benefits to information-processing activities of the learner and, in particular, those related to the processing of intrinsic feedback (e.g., m. j. carter & ste-marie, 2017a; chiviacowsky & wulf, 2005) and the provided knowledge of results (e.g., grand et al., 2015). in the present research, these different viewpoints concerning the mechanisms of self-controlled learning advantages were examined via meta-analysis with choice-type included as a moderator. the logic was that the motivational and informational perspectives would have different predictions. more specifically, from a motivation hypothesis, no moderating effect of choicetype on motor learning would be expected. in contrast, smaller effects for irrelevant-choice type, as compared to relevant-choice types, would be expected from the information-processing perspective. beyond this interest in the possible theoretical mechanisms, a more important question addressed was whether there is in fact evidential value for the selfcontrolled learning benefit. this is of relevance because the current consensus in the field is that self-controlled practice is generally more effective than yoked practice (for reviews see sanli et al., 2013; ste-marie et al., 2019; wulf & lewthwaite, 2016). reflecting this confidence in its benefits for motor learning, researchers have recommended adoption of self-control protocols in varied settings, such as medical training (brydges et al., 2009; jowett et al., 2007; wulf et al., 2010), physiotherapy (hemayattalab et al., 2013; wulf, 2007), music pedagogy (wulf & mornell, 2008), strength and conditioning (halperin et al., 2018), and sports training (janelle et al., 1995; sigrist et al., 2013). problematic though is that recent, high-powered experiments with pre-registered analysis plans have failed to observe motor learning or performance benefits with self-control protocols (grand et al., 2017; mckay & stemarie, 2022; st. germain et al., 2022; yantha et al., 2022). against the backdrop of the so-called replication crisis in psychology (open science collaboration, 2015), there is reason for pause when evaluating the ostensible benefits of self-controlled learning. further, lohse et al. (2016) have raised concerns about publication bias, uncorrected multiple comparisons, p-hacking, and other selection effects in the motor learning literature. therefore, to address the impact of selection effects on estimates of the self-controlled learning effect, a weight function model (e. c. carter et al., 2019; hedges & vevea, 1996; mcshane et al., 2016; vevea & hedges, 1995; vevea & woods, 2005) with a one-tailed p-value cutpoint of .025 was fit to the dataset of effects to provide a pre-registered adjusted estimate of the overall self-controlled learning effect. even the adjusted estimate is biased if the data generating processes are biased in ways not captured by the assumptions of the model, so further sensitivity analyses were conducted to estimate the average effect of self-control after correcting for selection effects (e. c. carter et al., 2019; vevea & woods, 2005). in parallel, in an effort to investigate the presence of evidential value in the literature, significant results were subjected to a p-curve analysis (simonsohn et al., 2014b; simonsohn et al., 2015). the pcurve analysis focuses exclusively on significant results and therefore is not affected by publication bias. in sum, the objectives of this meta-analysis were to estimate the true average effect of self-controlled learning and evaluate the evidential value of the selfcontrolled learning literature. bias resulting from selective publication was addressed with weight function and p-curve models and effect size estimates were adjusted accordingly. a key theoretical question related to the underlying mechanisms of putative self-controlled learning advantages (motivational versus informational influences) was also addressed through moderator anal4 yses, but, to anticipate, inferences will depend on the reliability of the evidence overall. finally, sensitivity analyses were conducted in addition to pre-registered analyses in an effort to understand the extent that our conclusions depended on the modeling techniques and assumptions adopted. methods pre-registration the procedures followed to conduct this metaanalysis were pre-registered and can be viewed at https://osf.io/qbg69. this meta-analysis was retrospective and earlier samples of the literature had been metaanalyzed prior to this pre-registration, albeit with different data collection procedures, scope, and excluding recent experiments. this study adheres to prisma reporting guidelines (page et al., 2021). literature search the literature search and data extraction were conducted by three authors (bm, zy, jh) and one research assistant (hs) independently. the goal of the search was to identify all articles that met the inclusion criteria for the meta-analysis. specifically, randomized experiments were subject to five criteria for inclusion: 1) a self-control group in which participants were asked to make at least one choice during practice, 2) a yoked group that experienced the same practice conditions as the self-controlled group, 3) a delayed 24-hour retention test or test with longer delay interval, 4) an objective measurement of motor performance, and 5) publication in a peer-reviewed journal or acceptance as part of a master’s or phd thesis. the literature search was completed on august 2, 2019. the search commenced at pubmed and google scholar with the following query: self-control* or self-regulat* or self-direct* or learner-control* or learner-regulat* or learner-direct* or subject-control* or subject-regulat* or subject-direct* or performercontrol* or performer-regulat* or performer-direct* and motor learning*. the query retrieved 9014 hits on pubmed and 98,600 hits on google scholar. each researcher excluded hits based on title alone or title and abstract when necessary, and quit searching the databases at self-selected intervals following extended periods of excluding 100% of search results (ranging between 20 and 30 results pages without identifying a relevant record). following an initial run of searching databases, each researcher employed their own search strategies, including reviewing the reference sections of reviews and included articles, consulting the optimal theory website1, and searching the proquest thesis database. this literature search process resulted in 160 articles that could not be excluded without consulting the fulltext of the article. all 160 articles were coded for inclusion or exclusion by two researchers independently. all instances of disagreement between coders were reviewed by three authors (bm, zy, and jh), and consensus was reached in each case. disagreements were infrequent and were often caused by a lack of clarity in the articles (e.g., 100% knowledge of results groups labeled as yoked groups). none of the coding disagreements evolved into conceptual disagreements. rather, in each case, it was identified that one coder had missed a detail in the full text that changed its inclusion eligibility. subsequent to this process, a total of 73 articles, which included 78 experiments, met the inclusion criteria (see table 1). dependent variable selection the focus of this meta-analysis was on performance outcomes associated with the goal of the skill. the primary theoretical perspectives offered as an account for self-controlled learning are likewise focused on performance outcomes. for example, the optimal theory proposes that a learner’s movements become coupled with the goal they are trying to achieve when they experience autonomy-support during practice (wulf & lewthwaite, 2016). to reflect this focus, a dependent measure priority list was developed that gave higher priority to absolute error measures and less priority to consistency measures, time/work measures, and form scores. dependent measure priority was ordered as follows: 1) absolute error (and analogous measures: radial error, points in an accuracy measure), 2) rootmean-square-error (rmse), 3) absolute constant error, 4) variable error, 5), movement time (and distance travelled), 6) movement form – expert raters, 7) otherwise unspecified objective performance measure reported first in research report.2 in the event that multiple measures of motor performance were reported for an experiment, effect sizes were calculated for the highest priority measure reported in the study. in experiments with multiple self-control groups and one yoked 1the webpage link that was consulted is no longer available (https://optimalmotorlearning.com/index.php/didyou-know-that/). a new webpage devoted to optimal theory can be accessed using the following link: https://gwulf.faculty.unlv.edu/optimal-motor-learning/ 2radial error, accuracy points, and distance travelled were added to the pre-registered dependent measures as they arose during data-extraction. decisions were made blind to the data by an author not involved in said extraction (bm or dsm). https://osf.io/qbg69 https://optimalmotorlearning.com/index.php/did-you-know-that/ https://optimalmotorlearning.com/index.php/did-you-know-that/ https://gwulf.faculty.unlv.edu/optimal-motor-learning/ 5 group, the self-control groups were combined (higgins & green, 2011). if multiple choice-types or subpopulations were included in an experiment, combined and individual effects were calculated for inclusion in moderator analyses. many of the self-controlled learning experiments analyzed in this study included multiple dependent measures. however, including multiple measures from the same experiment introduces bias and inflates type 1 error (scammacca et al., 2014). although there are a variety of methods for dealing with multiple measures from the same studies in meta-analysis, we chose to create a priority list and always selected the highest priority dependent measure that was reported. if the highest priority measure was not described in adequate detail to calculate the effect size, the authors were contacted and the data were requested. if the authors could not provide the data for the highest priority dependent measure reported in their study, the experiment was left out of our analysis. the rationale for selecting the approach we did was based on five considerations. first, our interest was in motor learning as reflected by an enhanced capability to perform a skill. motor learning studies often report multiple error measures, but they are not equally coupled with performance outcome. constant error, for example, was not included on the priority list because it is possible to have zero constant error while performing terribly overall. therefore, we chose to prioritize measures that could be considered to be tightly coupled with performance, like absolute error, rmse, and absolute constant error. if these measures were not used, measures that are only correlated with performance, such as variable error, movement time, and movement form, were selected. we reasoned this selection strategy would focus the analysis on measures related to improved skill while de-emphasizing other effects. second, we reasoned that averaging across dependent measures could introduce additional heterogeneity to the analysis by including potentially disparate dependent measures. the third, fourth, and fifth considerations all relate to avoiding bias but differ with regard to the source of the bias and the alternate method that would include such bias. thus, the third consideration was that imposing a priority list was thought to better avoid biases that could emerge from selecting the most focal measure in a given study, because an unknowable percentage of studies may have defined the focal measure based on the strength of the findings. fourth, we reasoned that some measures may only get reported if they support the predicted benefit of selfcontrol. scammacca et al. (2014) reported that effect size estimates were inflated when random dependent measures were selected in a meta-analysis case study, perhaps reflecting a selective reporting bias. averaging across all reported measures–a fair alternative to our approach–could conceivably pick up some of this reporting bias. fifth, we ignored lower priority measures with data when higher priority measures lacked data because we reasoned there could be a systematic reason for this pattern: preference for reporting data associated with positive effects. indeed, there were articles where the only measure reported with sufficient data to calculate an effect size was also the only measure with a significant result (e.g., wulf et al., 2005). data extraction the four researchers separated into pairs and half of the included experiments were coded independently by one pair. the other half were coded independently by the other pair. the coding included varied moderators, publication year, and sample size. also hedges’ g was calculated from reported statistics and sample size using the compute.es package (re, 2013) in r (r core team, 2021). effect sizes were calculated from means and standard deviations, test statistics like t and f, or from precisely reported p-values. when covariates were included in the analysis, the correlation coefficient for the covariate dependent measure relationship was required to calculate accurate effect sizes. since this information is often not reported, authors were contacted and the information was requested. one effect size was calculated for each of three time points for each experiment: acquisition, retention, and transfer. the independent data extractions were compared and inconsistent results were highlighted. there was 89% absolute agreement between pairs of coders on 1344 data points. for those with disagreement, one of the researchers from the other coding pair reviewed the relevant experiment to confirm the value to be used in the analysis.3 several articles failed to report the data necessary to calculate effect sizes at some or all time-points. a total of 39 authors were emailed with requests for missing data and 17 were able to provide data following a minimum one month period following the request. after requesting missing data, 25 experiments were excluded 3on one occasion, the third researcher was unable to match either effect calculation, so the involved researchers discussed the issue, determined the source of the inconsistency, and asked a fourth researcher to recalculate the effect size with clear instructions for avoiding confusion. the source of inconsistency was simply a rounding error when combining multiple groups and the fourth researcher was able to corroborate the calculation. 6 from primary analyses for missing retention data. a total of 52 effects from 51 experiments reported in 46 articles were included in the primary meta-analysis. in addition to extracting effect sizes, inferential statistics were scraped from published experiments that reported a statistically significant effect at retention. two authors (bm and jh) independently completed a pcurve disclosure form consisting of a direct quote of the stated hypotheses for each experiment, the experimental design, and a direct quote of the results indicating a significant result (see appendix a). there was 94% absolute agreement between the independent forms. mismatches were resolved with consensus. outlier screening the meta-analysis r package metafor (viechtbauer, 2010) was used to screen the data for potentially influential outliers (see analysis script). in order to identify outlier values and exclude them from further analyses, the following nine influence statistics were calculated: a) externally standardized residuals, b) dffits values, c) cook’s distances, d) covariance ratios, e) dfbetas values, f) the estimates of t2 when each study is removed in turn, g) the test statistics for (residual) heterogeneity when each study is removed in turn, h) the diagonal elements of the hat matrix, and i) the weights (in %) given to the observed outcomes during the model fitting. any experiment with effects identified as extremely influential by any three of the influence metrics were removed from subsequent analyses. risk of bias all articles were assessed for risk of bias by the lead author using the cochrane risk of bias 1.0 tool (higgins et al., 2011). each article was coded as either high risk, unclear (some concerns), or low risk on 7 dimensions: sequence generation, allocation concealment, incomplete outcome data, selective outcome reporting, blinding of outcome assessment, blinding of participants and personnel, and other sources of bias. pre-specified analyses random effects model a naïve random effects model was fit to the retention effect sizes to estimate the average reported effect of self-controlled learning and to assess heterogeneity in effect sizes between experiments. heterogeneity was evaluated with the q statistic and described with i2. a mixed-effects model was fit to evaluate whether differences in experimental design or sample characteristics moderated the effect of self-controlled learning. moderator analyses moderators were determined based on the authors’ collective knowledge of the self-controlled learning literature. we coded for discrete differences in protocols between experiments to investigate whether differing methodologies resulted in different effect size estimates. further, based on a meta-analysis reporting that the effect of choice on intrinsic motivation can be moderated by whether participants were compensated for completing the study (patall et al., 2008), we also coded for compensation type. finally, we investigated whether publication status was a moderator of the effect of self-control as part of our overall approach to examining the impact of publication bias on the selfcontrolled learning literature. the following six moderators were analyzed separately in mixed-effects models: a) choice-type: choices were categorized as either instructionally-irrelevant, knowledge of results, knowledge of performance, concurrent feedback, amount of practice, use of assistive device, practice schedule, observational practice, or difficulty of practice; b) experimental setting: experiments were categorized as either laboratory, applied, or laboratory-applied. we defined a laboratory setting as one where learners are asked to acquire a skill not typically performed in everyday life. we defined an applied setting as one where learners are asked to acquire a skill often performed outside of a laboratory. finally, we defined a laboratory-applied setting as one where learners are asked to acquire a skill resembling skills often performed outside the laboratory but with researcher-contrived differences; c) subpopulation: the following subgroups were analyzed: adult (18-50 years of age), children/adolescents (under 18-years old), older adult (over 50-years-old), and clinical (clinical population defined by the research article); d) publication status: articles were classified as published or unpublished (e.g., theses); e) compensation: whether participants were compensated for participating in the experiment was categorized as compensated, not compensated, or not stated; f) retention delay-interval: coded as 24-hour, 48-hours, or >48hours. 7 table 1 experiment characteristics and moderator coding. authors year setting compensation choice-type population retention n published aiken et al. 2012 applied not stated observation adult 24-hr 28 yes alami 2013 lab yes feedback (kr) adult 24-hr 22 no ali et al. 2012 lab not stated feedback (kr) adult 24-hr 48 yes andrieux et al. 2016 lab not stated task difficulty adult 24-hr 48 yes andrieux et al. 2012 lab not stated task difficulty adult 24-hr 38 yes arsal 2004, expt 1 lab not stated feedback (kr) adult 48-hr 28 no arsal 2004, expt 2 lab not stated feedback (kr) adult 48-hr 28 no barros 2010, blocked lab not stated feedback (kr) adult 24-hr 48 no barros 2010, random lab not stated feedback (kr) adult 24-hr 48 no barros et al. 2019, expt 1 lab-applied no feedback (kr) adult 24-hr 60 yes barros et al. 2019, expt 2 lab no feedback (kr) adult 24-hr 60 yes bass 2015 lab no feedback (kr) adult 24-hr 20 no bass 2018 applied no feedback (kr) adult 24-hr 60 no brydges et al. 2009 applied not stated observation adult >48-hr 48 yes bund & weimeyer 2004 lab-applied no observation adult 24-hr 52 yes carter & patterson 2012 lab not stated feedback (kr) adult 24-hr 20 yes carter & patterson 2012 lab not stated feedback (kr) older 24-hr 20 yes carter & patterson 2012 lab not stated feedback (kr) two 24-hr 40 yes chen et al. 2002 lab yes feedback (kr) adult 48-hr 48 yes chiviacowsky 2014 lab not stated feedback (kr) adult 24-hr 28 yes chiviacowsky & lessa 2017 lab not stated feedback (kr) oider 48-hr 22 yes chiviacowsky & wulf 2002 lab not stated feedback (kr) adult 24-hr 30 yes chiviacowsky et al. 2012 lab not stated feedback (kr) clinical 24-hr 30 yes chiviacowsky et al. 2008 lab not stated feedback (kr) children 24-hr 26 yes chiviacowsky et al. 2012 lab not stated assistive device clinical 24-hr 28 yes davis 2009 applied not stated model adult 24-hr 24 no fagundes et al. 2013 lab-applied not stated feedback (kr) adult 48-hr 52 yes fairbrother et al. 2012 lab not stated feedback (kr) adult 24-hr 48 yes ferreira et al. 2019 lab not stated feedback (kr) adult 24-hr 60 yes figueiredo et al. 2018 lab no feedback (kr) adult 24-hr 30 yes ghorbani 2019, expt 2 lab-applied not stated feedback (kr) adult 24-hr 36 yes grand et al. 2015 lab no feedback (kr) adult 24-hr 36 yes grand et al. 2017 lab yes incidental adult >48-hr 68 yes hansen et al. 2011 lab no feedback (kr) adult 24-hr 24 yes 8 hartman 2007 lab not stated assistive device adult 24-hr 18 yes hemayettalab et al. 2013 lab not stated feedback (kr) clinical 24-hr 20 yes ho 2016 lab not stated amount of practice adult 24-hr 120 no holmberg 2013 lab-applied no feedback (kp) adult 24-hr 24 no huet et al. 2009 lab-applied not stated feedback (concurrent) adult 24-hr 20 yes ikudome et al. 2019, expt 1 lab-applied no incidental adult 24-hr 40 yes ikudome et al. 2019, expt 2 lab-applied no observation adult 24-hr 40 yes jalalvan et al. 2019 lab-applied not stated task difficulty adult 24-hr 60 yes janelle et al. 1997 lab-applied yes feedback (kp) adult >48-hr 48 yes jones 2010 lab yes repetition schedule adult 24-hr 40 no kaefer et al. 2014 lab no feedback (kr) adult 24-hr 56 yes keetch & lee 2007 lab yes repetition schedule adult 24-hr 96 yes kim et al. 2019 lab yes feedback (kr) adult 24-hr 42 yes leiker et al. 2016 lab-applied not stated task difficulty adult >48-hr 60 yes leiker et al. 2019 lab not stated task difficulty adult >48-hr 60 yes lemos et al. 2017 applied no observation children 24-hr 24 yes lessa & chiviacowsky 2015 applied not stated amount of practice older 48-hr 36 yes lewthwaite et al. 2015, expt 1 lab-applied not stated incidental adult 24-hr 24 yes lewthwaite et al. 2015, expt 2 lab not stated incidental adult 24-hr 30 yes lim et al. 2015 applied not stated feedback (kp) adult 24-hr 24 yes marques & correa 2016 applied not stated feedback (kp) adult 48-hr 70 yes marques et al. 2017 applied not stated feedback (kp) adult 24-hr 30 yes norouzi et al. 2016 lab not stated feedback (kr) adult 24-hr 45 yes nunes et al. 2019 lab-applied no feedback (kp) older 24-hr 40 yes ostrowski 2015 lab not stated feedback (kr) adult 24-hr 80 no patterson & carter 2010 lab yes feedback (kr) adult 24-hr 24 yes patterson & lee 2010 lab-applied yes task difficulty adult 48-hr 48 yes patterson et al. 2013 lab yes feedback (kr) adult 24-hr 48 yes patterson et al. 2011 lab yes feedback (kr) adult 24-hr 60 yes post et al. 2016 lab-applied no feedback (kp) adult 24-hr 44 yes post et al. 2011 applied no amount of practice adult 24-hr 24 yes post et al. 2014 applied not stated amount of practice adult 24-hr 30 yes rydberg 2011 applied not stated repetition schedule adult 24-hr 16 no sanli & patterson 2013 lab no repetition schedule adult 24-hr 24 yes sanli & patterson 2013 lab no repetition schedule children 24-hr 24 yes ste-marie et al. 2013 applied no feedback (kp) children 24-hr 60 yes tsai & jwo 2015 lab yes feedback (kr) adult 24-hr 36 yes 9 von lindern 2017 lab not stated feedback (kr) adult 24-hr 48 no williams et al. 2017 lab yes feedback (concurrent) adult 24-hr 29 yes wu & magill 2011 lab no repetition schedule adult 24-hr 30 yes wu 2007, expt 1 lab-applied yes repetition schedule adult 24-hr 30 no wulf & adams 2014 lab no repetition schedule adult 24-hr 20 yes wulf &toole 1999 lab-applied yes assistive device adult 24-hr 26 yes wulf et al. 2015, expt 1 lab-applied no repetition schedule adult 24-hr 68 yes wulf et al. 2001 lab-applied yes assistive device adult 24-hr 26 yes wulf et al. 2018, expt 1 lab-applied no incidental adult 24-hr 32 yes wulf et al. 2018, expt 2 lab-applied no incidental adult 48-hr 28 yes wulf et al. 2018, expt 2 lab-applied no observation adult 48-hr 28 yes wulf et al. 2018, expt 2 lab-applied no two adult 48-hr 42 yes wulf et al. 2005 applied no observation adult >48-hr 26 yes note. kr = knowledge of results; kp = knowledge of performance. 10 adjusting for selection effects selection bias in the motor learning literature is likely caused by filtering based on the statistical significance of results (lohse et al., 2016). to assess and adjust for selection effects, the r package weightr (coburn & vevea, 2017) was used to fit a vevea-hedges weight function model to the retention data (vevea & hedges, 1995). the weight-function model estimates the true average effect, heterogeneity, and the probability that a non-significant result survives censorship and is available for analysis. selection effects are modelled by a step function that divides the effects into two bins at one-tailed p = .025, coinciding with a two-tailed pvalue of .05. the probability of a non-significant effect surviving censorship to appear in the model is estimated relative to the probability of observing a study with a significant effect. the selection-adjusted model was compared to the naïve random effects model with a likelihood ratio test. better fit from an adjusted model suggests selection bias in the literature. the adjusted estimate from the weight-function model was pre-registered as the primary estimate of the true average effect in this meta-analysis. please note that while the weight-function model attempts to estimate the true effect of self-controlled learning after correcting for selection biases, the estimated effect cannot be considered definitive. nevertheless, the adjusted estimate is likely less biased than the naïve random effects estimate (e. c. carter et al., 2019; hong & reed, 2021; kvarven et al., 2020; vevea & hedges, 1995). the difference between the estimates can be informative about the potential impact of selection biases, with larger disparities between models suggesting greater selection effects. p-curve analysis to investigate the evidential value of the selfcontrolled learning literature, the significant positive results at retention reported in peer-reviewed journals were submitted to a p-curve analysis (simonsohn et al., 2015). to be included in the analysis, articles needed to meet the following criteria: a) be a published article; b) state explicitly that self-controlled learning was expected to be more effective than yoked practice; c) report inferential statistics comparing a self-control group and a yoked group directly on a retention test; d) conclude that the self-control group performed significantly better than the yoked group. if the article included multiple dependent measures showing a significant effect, the dependent measure priority list was used to select the highest priority measure. if only one measure was reported as significant, that effect was included even if the experiment included higher priority measures that were null. this resulted in a slightly different sample of effects from the random effects and weight-function models. the distribution of significant p-values is a function of the power of the experiments included in the analysis. if a p-curve included only type 1 errors, the expected distribution would be uniform. as the power of included experiments increases, so too does the amount of right skew in the p-curve, with smaller p-values appearing more frequently than large p-values. the pcurve analysis tests the null hypothesis that there is no evidentiary value by analyzing the amount of right skew in the distribution of p-values. conversely, if researchers peek at their data and stop collecting when they reach statistical significance, a practice known as p-hacking, the distribution of significant p-values under the null would be left skewed, with p-values near .05 occurring more frequently. varying mixtures of true effect sizes and intensities of p-hacking produce varying shapes of p-curve, therefore the observed p-curve was compared to the distribution of p-values expected if the studies were conducted with 33% power. it is unlikely that researchers would continuously conduct experiments that fail >66% of the time whilst studying the self-controlled learning phenomenon. observing a p-curve significantly “flatter” than what would be expected with 33% power would suggest a lack of evidential value among the significant results (simonsohn et al., 2014a, 2014b). sensitivity analyses the primary analyses were followed up with several sensitivity analyses. sensitivity analyses are used to evaluate the sensitivity of the results to the specific parameters chosen for the original analyses. the selfcontrolled learning literature, like many areas of behavioural research, was not produced exclusively by registered experiments with pre-specified analysis plans and 100% reporting frequency. the complexity of selection effects at various levels, including editorial decisions, author decisions, analysis decisions, and missing data, renders the accuracy of modeled effects impossible to estimate (e. c. carter et al., 2019). producing a range of estimates based on varying assumptions is intended to provide the reader with a broader picture of the uncertainty of the point estimates in the primary analyses. bias correction methods vary in their performance depending on the total amount of heterogeneity, the true average effect size, the amount of publication bias, and the intensity of p-hacking in the data (e. c. carter et al., 2019). to determine which bias correction models perform well in the various plausible conditions for 11 other sources of bias blinding of participants and personnel blinding of outcome assessment selective outcome reporting incomplete outcome data allocation concealment sequence generation 0% 25% 50% 75% 100% low risk of bias some concerns high risk of bias figure 2. proportion of studies with low risk, some concerns, and high risk of bias in each of the seven dimensions of the cochrane rob 1.0 tool. data in this meta-analysis, model performance checks were conducted using the meta-showdown explorer shiny app developed by e. c. carter and colleagues (2019). simulated conditions were as follows: medium publications bias (significant results published at 100% frequency, non-significant published at 20% frequency, wrong direction effects published at 5% frequency), medium questionable research practice (qrp) environment (for a detailed explanation of qrp environment see e. c. carter et al., 2019), τ = 0, .2; g = 0, .2, .5; k = 60, good performance defined as a maximum of .1 upward or downward bias, and maximum mean absolute error of .1, also tested with maximum bias and error values of .15. with good performance defined by a maximum bias in either direction of .1 and maximum absolute error of .1, the weight function model and, to a lesser extent, p-curve models provided coverage across all plausible conditions except the highest heterogeneity condition (τ = .4). with good performance defined as a maximum bias and error of .15, the precision-effect with standard error (peese) method provided good performance in all conditions. therefore, sensitivity analyses were conducted on effect size data via p-curve and peese methods. an additional sensitivity analysis of the estimated power among included studies was conducted with the z-curve (bartoš & schimmack, 2020). z-curve, like p-curve, analyzes only statistically significant results and estimates the power of the included studies (called expected replication rate, err). however, unlike p-curve, z-curve is robust to heterogeneity because it fits a finite mixture model of seven distributions, allowing the underlying true effects to vary. further, z-curve also estimates the power of all studies that have been conducted (called expected discovery rate, edr) which can be compared to the observed discovery rate in order to test for the presence of publication bias. primary p-curve a leave-one-out analysis of p-curve results was conducted to assess the extent to which the primary results depended on the inclusion of one or two extreme results. results that depend on the inclusion of one or two extreme results should not be considered robust. results risk of bias the risk of bias assessment revealed lackluster reporting standards were pervasive among the included articles (see figure 2). for example, comparing a selfcontrol group to a yoked group usually involves first collecting a self-control participant, then their yoked counterpart. despite this, most articles simply reported that http://www.shinyapps.org/apps/metaexplorer/ 12 the participants were randomly assigned to these conditions, with no indication of how this temporal constraint was addressed. a similar issue was observed with respect to addressing outliers and attrition. over 75% of the included articles failed to mention outliers and how they were addressed (captured by the incomplete outcome data dimension). most studies included in this study were not double-blind, largely due to the inherent difficulties in conducting a double-blind study of self-controlled motor learning. while the risk of bias associated with a lack of double blinding has been debated (see howick, 2008), it is nonetheless notable that double-blinding was rare among the included studies. outlier removal two studies were flagged as significantly influential outliers by all nine influence metrics calculated during data screening: lemos et al. (2017, g = 3.7), and marques et al. (2017, g = 3.95). no other effect sizes were identified as outliers by any metric. both outliers were removed from all subsequent analyses. naïve random effects model the naïve random effects model estimated the average treatment effect of self-controlled practice, g = .44 (k = 52, n = 2061, 95% ci [.31, .56]). however, there was significant variability in the average effect estimated across experiments, q(df = 51) = 103.45, p < .0001, τ = .31. it was estimated that 47.9% (i2) of the total variability in effect sizes across experiments was due to true heterogeneity in the underlying effects measured (see figure 3). moderator analyses six moderators selected for theoretical and/or methodological reasons were tested separately. five moderators failed to account for a significant amount of heterogeneity: experimental setting (p = .46, r2 = 1%), compensation (p = .99, r2 = 0%), choice-type (p = .71, r2 = 0%), sub-population (p = .74, r2 = 0%), and retention interval (p = .54, r2 = 0%). one moderator, publication status, accounted for a statistically significant amount of heterogeneity, p < .0001, r2 = 48%. among published experiments, self-controlled practice had a strong benefit, g = .54, 95% ci [.28, .81]. however, among unpublished experiments, selfcontrolled practice had essentially no effect, g = .003, 95% ci [-.23, 24]. selection model the weight-function model combines an effect size model and a selection model (hedges & vevea, 1996). the effect size model is equivalent to the naïve random effects model, specifying what the distribution of effect sizes would be in the absence of publication bias or other selection effects. the selection model accounts for the probability a given study survives selection based on its p-value and specifies how the effect size distribution is modified by selection. a weight-function model with a p-value cutpoint of (one-tailed) .025 was fit to the retention effect size estimates (see figure 4). the results of a likelihood ratio test suggest the adjusted model was a significantly better fit to the data than the unadjusted model, χ2(df = 1) = 21.18, p < .0001.4 the adjusted effect size estimate was significantly different from zero, g = .107, p < .001, 95% ci [.05, .17]. according to the adjusted model, non-significant results were 6% as likely to survive selection as significant results. note that the weightr function failed to estimate the random effects model and the results reported here are based on a fixed-effect estimate. p-curve the purpose of the p-curve analysis was to investigate the evidential value in the published reports (n = 26) of statistically significant self-controlled learning benefits. visual inspection of figure 5 reveals a v-shaped distribution with the greatest frequency of p-values in the < .05 bin. the observed p-curve was significantly flatter than would be expected if the experiments had 33% power, p = .0035, indicating an absence of evidential value. conversely, the half p-curve (simonsohn et al., 2015) was significantly right skewed, suggesting the presence of evidential value. sensitivity analysis, however, revealed that the half curve does not remain significantly right skewed following removal of the most extreme p-value from the sample. the estimated power of the included studies was 5%, 95% ci [5%, 17%]. interim discussion the primary results described above suggest that selection effects have caused a seriously distorted record of self-controlled learning. estimated benefits are less than one third of the naïve estimate, g = .107, 95% ci [.05, .17]. the p-curve analysis failed to detect robust evidence of a self-controlled learning effect. the performance of the weight-function model depends on the specific conditions present in the meta-analysis, although these conditions are unknowable (e. c. carter et al., 2019). it was necessary to conduct sensitivity analyses with additional bias correction methods to 4be aware that the likelihood ratio test is not robust to misspecification of the random effects model (hedges & vevea, 1996). 13 -2 -1 0 1 2 3 4 standardized mean difference wulf et al., 2017 (exp. 2) wulf et al., 2017 (exp. 1) wulf et al., 2015 wulf & toole, 1999 wulf & adams, 2014 williams et al., 2017 tsai & jwo, 2015 ste-marie et al., 2013 post et al., 2014 post et al., 2011 post et al., 2016 patterson & carter, 2010 marques & correa, 2016 lim et al., 2015 lewthwaite et al., 2015 (exp. 2) lewthwaite et al., 2015 (exp. 1) lessa & chiviacowsky, 2015 leiker et al., 2016 kim et al., 2019 kaefer et al., 2014 ikodome et al., 2019 (exp. 2) ikodome et al., 2019 (exp. 1) hartman, 2007 hansen et al., 2011 grand et al., 2017 grand et al., 2015 figueiredo et al., 2018 ferreira et al., 2019 fairbrother et al., 2012 chiviacowsky et al., 2012.2 chiviacowsky et al., 2008 chiviacowsky et al., 2012.1 chiviacowsky & wulf, 2002 chiviacowsky & lessa, 2017 chiviacowsky, 2014 barros et al., 2018 (exp. 2) barros et al., 2018 (exp. 1) andrieux et al, 2016 ali et al., 2012 aiken et al., 2012 wu, 2007 (exp. 1) von lindern, 2017 rydberg, 2011 ostrowski & porter, 2015 holmberg, 2013 ho, 2016 bass, 2015 barros, 2010 (random) barros, 2010 (blocked) arsal, 2004 (exp. 2) arsal, 2004 (exp. 1) alami, 2013 0.76 [ 0.11, 1.41] 0.82 [ 0.11, 1.53] 0.63 [ 0.15, 1.11] 0.81 [ 0.03, 1.59] 2.16 [ 1.09, 3.23] 0.27 [-0.44, 0.98] 0.85 [ 0.14, 1.56] 0.82 [ 0.30, 1.34] 0.88 [ 0.15, 1.61] 0.12 [-0.66, 0.90] 0.14 [-0.45, 0.73] 0.89 [ 0.08, 1.70] 0.59 [ 0.04, 1.14] 1.68 [ 0.78, 2.58] 1.00 [ 0.27, 1.73] 1.07 [ 0.24, 1.90] 0.72 [ 0.07, 1.37] 0.45 [ 0.01, 0.89] 0.32 [-0.27, 0.91] 0.53 [ 0.01, 1.05] 0.88 [ 0.26, 1.50] 0.08 [-0.20, 0.36] 1.29 [ 0.31, 2.27] -0.44 [-1.27, 0.39] -0.14 [-0.62, 0.34] 0.30 [-0.35, 0.95] 0.37 [-0.48, 1.22] 0.27 [-0.25, 0.79] 0.41 [-0.14, 0.96] 0.76 [ 0.03, 1.49] 0.80 [ 0.02, 1.58] 0.77 [ 0.04, 1.50] 0.31 [-0.40, 1.02] 0.36 [-0.45, 1.17] 0.77 [ 0.01, 1.53] 0.55 [-0.07, 1.17] 0.31 [-0.31, 0.93] 0.94 [ 0.13, 1.75] 0.49 [-0.06, 1.04] 0.09 [-0.62, 0.80] 0.36 [-0.35, 1.07] 0.18 [-0.50, 0.86] 0.04 [-0.88, 0.96] 0.11 [-0.41, 0.63] -0.69 [-1.50, 0.12] -0.18 [-0.52, 0.16] 0.83 [-0.05, 1.71] 0.14 [-0.57, 0.85] -0.43 [-1.16, 0.30] -0.12 [-0.83, 0.59] 0.83 [ 0.07, 1.59] -0.92 [-1.77, -0.07] hedges' g [95%ci]author(s) and year favours self-controlfavours yoked published experiments unpublished experiments 0.54 [0.42, 0.67]re model for published subgroup 0.01 [-0.25, 0.26]re model for unpublished subgroup 0.44 [0.31, 0.56]re model for overall estimate figure 3. forest plot of hedges’ g (95% ci) for self-controlled versus yoked groups on retention tests. size of squares is proportional to 1/σ2 (precision). experiments are divided into published and unpublished subgroups and the black polygons represent 95% ci estimates from subgroup analyses. the black polygon at the bottom of the figure represents the 95% ci estimate for all included experiments. 14 observed outcome s ta n d a rd e rr o r 0 .5 4 8 0 .4 1 1 0 .2 7 4 0 .1 3 7 0 -1 0 1 2 0.10 < p ≤ 1.00 0.05 < p ≤ 0.10 0.01 < p ≤ 0.05 0.00 < p ≤ 0.01 studies figure 4. funnel plot of self-controlled learning studies at retention. standard error is plotted on the y-axis and hedges’ g is plotted on the x-axis. dark gray contour regions represent two-tailed p-values between .10 and .05 (not quite significant). the light gray contour regions represent two-tailed p-values between .05 and .01. in the absence of bias (and other forms of heterogeneity), the most precise experiments would center on the naïve random effects estimate near the top of the plot and as experiments get progressively less precise they would move down the plot and spread out symmetrically. in the presence of bias, one would expect experiments to cluster in the light gray contour regions. the clustering of experiments in the positive light gray contour region in the above plot suggests substantial bias. assess the reliability of the selection-adjusted weightfunction model estimate. based on performance checks conducted under a range of plausible conditions, it was determined that sensitivity analyses conducted with a peese meta-regression and p-curve effect size estimation would provide good performance coverage across most plausible conditions. sensitivity analyses precision-effect with standard error (peese) model when publication bias is present in a body of evidence, sample size and effect size can be negatively correlated (stanley & doucouliagos, 2014). the peese model fits a quadratic relationship between effect size and standard error to reflect the intuition that publication bias is stronger for low precision studies than high precision studies. the rationale is that low precision studies need to overestimate effects to achieve significance and get published, while high precision studies can publish without exaggerated effects; thus, creating greater publication bias among lower precision studies (e. c. carter et al., 2019; stanley & doucouliagos, 2014). a weighted-least-squares regression model was fit with effect size regressed on the square of the standard error, weighted by the inverse of the variance: gi = b0 + b1 se 2 i + ei (1) the peese method estimated a non-significant benefit of self-controlled learning after controlling for publication bias, g = .054, 95% ci [-.18, .29], p = .659. 15 figure 5. p-curve analysis of published experiments that were statistically significant at retention. if the included experiments are studying a true null hypothesis the expected distribution of p-values is uniform, represented by the dotted line. if the experiments are studying a true effect, the expected distribution becomes increasingly right skewed as a function of statistical power. the expected right skewed distribution associated with 33% power is plotted by the dashed line. the observed p-curve is plotted by the solid line and was substantially flatter than the 33% power distribution. the half p-curve analysis included p-values below p = .025 and was significantly right skewed. the right skew did not survive deletion of the most extreme value. p-curve effect estimation a p-curve model was fit to the overall retention effect size data, unlike the first primary p-curve which was fit to the reported significant results. the p-curve is a function of sample size and effect size, and because sample size is known, the effect size that provides the best fit to the observed p-curve can be estimated (simonsohn et al., 2014a). a p-curve analysis conducted with the r package dmetar (harrer et al., 2019) was used to estimate the average effect size among the statistically significant effects in the meta-analysis. the model estimated an average effect of g = .035.5 the estimated power of included studies was 7%, 95% ci [5%, 22%]. unfortunately, p-curve does not perform well in the presence of heterogeneity and these results should be interpreted cautiously. z-curve a z-curve was fit to the overall retention data and estimated the power of statistically significant studies (err) as 12%, 95% ci [3%, 34%]. the power of all studies conducted (edr) was estimated as 6%, 95% ci [5%, 13%]. the 95% confidence intervals for both the err and edr failed to include the observed discovery rate of 48%, suggesting significant publication bias in the data. acquisition and transfer in light of the evidence that experiments are apparently selected for positive self-controlled learning ef5the p-curve of effect sizes was significantly flatter than the expected 33% power curve as well, p = .009. 16 fects at retention, pre-planned exploratory estimates of the effect of self-controlled practice on acquisition and transfer performance can no longer be considered reliable. however, given that some have argued that transfer tests are more sensitive measures of motor learning than delayed retention tests (chiviacowsky & wulf, 2002; fairbrother et al., 2012), the transfer test data were analyzed via both naïve random effects and weight function models. the naïve estimate at transfer was g = .52, while the bias corrected estimate was g = .17, p = .24. as with delayed retention, the selection model provided a better fit to the transfer data than the naïve model, p = .008. the primary take away from these analyses is that the reported self-controlled learning effects to date are unreliable. discussion the primary objective of this meta-analysis was to assess the effect of providing choices during the acquisition of a motor skill on delayed retention performance in the general population. a secondary objective was to test between motivation and informational explanations for self-controlled learning benefits by investigating whether choice-type moderates the effect of choice. to this aim, an extensive search for experiments that compared self-controlled practice to a yoked comparison group was conducted. effect size and moderator data were ascertained from data reported in the research articles or, in some cases, received directly from the authors of the studies. efforts were taken to ensure that each effect size calculation and moderator code could be reproduced by an independent party. in parallel, the results of published experiments that achieved a hypothesized statistically significant result in favour of self-control were extracted directly from the articles and outlined in a p-curve disclosure form (see appendix a). pre-registered primary analyses were applied to the data and results were followed up with a suite of sensitivity analyses. the naïve random effects model estimated a benefit from self-controlled practice of g = .44. however, the naïve model fails to account for selection effects, such as publication bias and p-hacking, and as such overestimates the true average effect when these selection effects are present (e. c. carter et al., 2019; hedges & vevea, 1996; stanley & doucouliagos, 2014). publication status was a significant moderator of the self-controlled practice effect, accounting for 48% of the total heterogeneity in the model. published experiments reported an average benefit of g = .54 while unpublished experiments reported no benefit at all on average. it is possible that researchers use statistical significance, typically defined as p < .05 on a two-tailed test, to filter their results for publication. to account for potential selection effects driven by statistical significance, a weightfunction model was fit to the retention test effect size data with a one-tailed p-value cutpoint of .025 included in the model (vevea & hedges, 1995). the adjusted model provided a significantly better fit to the data than the naïve random effects model. the model estimated the selection-adjusted benefit of self-controlled learning as g = .11, a dramatic departure from the naïve estimate of g = .44. two additional bias correction techniques were conducted to assess the sensitivity of this result to changes in correction methodology. the peese method estimated the effect at g = .05, while p-curve estimated g = .04, and neither analysis was able to rule out the null hypothesis. in parallel to the meta-analysis described above, a pcurve was conducted on the reported significant results. the p-curve used somewhat different inclusion criteria focusing only on published, statistically significant results suggesting a self-controlled learning benefit. in addition, the p-curve included results reported for any dependent measure in an article, even if the focal measure (of this meta-analysis) was reported as non-significant. therefore, the p-curve was more inclusive of evidence reported by authors as favouring a self-controlled benefit while ignoring experiments with null effects. the results revealed both significant right skew below p = .025 (two-tailed) and a p-curve that was significantly flatter than a distribution with an expected power of 33%. the evidence of right skew, indicating superiority of self-control relative to yoked conditions, was tenuous and did not survive the deletion of the most extreme result–an experiment that reported a benefit from selfcontrol of g = 2.16 (wulf & adams, 2014). the overall p-curve produced an estimate that the true power of the included experiments was 5%, leading to a rejection of the hypothesis that the experiments contained evidential value. it appears from these analyses that the substantial self-controlled learning literature is, as of now, insufficient to provide evidence that self-controlled practice is more effective than a yoked practice. the bias correction techniques applied in this analysis are sensitive to unknown conditions, such as the true average effect size and the amount of true heterogeneity; although efforts were taken to provide coverage across most plausible conditions. the corrected estimates produced by the weight-function model, p-curve, and peese methods appeared to converge on trivially small effects. further, the p-curve of significant results suggested a lack of evidential value. based on the model performance parameters we tested (e. c. carter et al., 2019), which allowed up to .15 units of maximum bias or mean ab17 solute error as acceptable performance, our results are consistent with a self-controlled learning benefit ranging from g = -.11 to .26, with a plausible upper 95% confidence limit of g = .33. thus, this analysis does not rule out the possibility that self-controlled practice provides meaningful motor learning benefits on average. the present literature, however, appears insufficient to establish that a self-control benefit indeed exists. turning to the current theoretical debates surrounding the motivational and informational underpinnings of self-controlled learning, these debates now seem moot, or at least premature. the effectiveness of selfcontrol was not moderated by choice-type, suggesting that self-controlled practice may be ineffective regardless of the nature of the choices provided. indeed, the only factor we tested that moderated the effect of selfcontrolled practice was publication status. future studies given that the current meta-analysis failed to support the widely touted assertion of a substantial selfcontrolled learning benefit (sanli et al., 2013; ste-marie et al., 2019; wulf & lewthwaite, 2016), considerations need to be given to the design and research practices for future studies. registered reports provide one possible path forward (caldwell et al., 2020). a registered report involves submitting a research proposal to a twophase peer-review. the first phase of the review occurs prior to data-collection and is assessed based on the proposed methodology, rationale, and potential contribution. if accepted in principle, researchers commit to carrying out the registered experiment and submitting the results in a final article for the second phase of peer-review. the final article is peer-reviewed for quality and adherence to the registered plan, but acceptreject decisions at this point are not based on the results. in theory, this practice should eliminate p-hacking and, for literatures composed entirely of registered reports, publication bias. a number of motor behaviour and/or kinesiology journals have begun adopting registered reports as an option for authors, including the human movement science, frontiers in movement science and sport psychology, journal of sport and exercise psychology, journal of sport sciences, and reports in sport and exercise (formerly registered reports in kinesiology). while registered reports are a potentially fruitful process to begin the accumulation of evidence regarding self-controlled learning, there are practical issues with investigating self-controlled learning that motor learning researchers may find overly burdensome. for example, to have 80% power to detect an effect of g = .26 with a two cell experimental design, 506 participants are required. if the weight-function adjusted estimate of g = .11 is accurate, n = 2600 are required. more challenging still would be testing between hypothesized motivational and informational mechanisms. for example, if a 2 (choice) x 2 (choice-relevance) experiment were conducted to test whether the instructional-relevance of choice fully attenuates its effect, four times as many participants would be required to maintain the same degree of power (simonsohn, 2015). in contrast, the median sample size among experiments included in this metaanalysis was n = 36, which is typical of motor learning experiments in general (lohse et al., 2016). in addition to challenges with establishing that an effect exists, additional challenges will emerge if researchers are interested in generalizing the benefits of self-controlled practice beyond comparisons to a yoked group, as has been the case thus far (ste-marie et al., 2019; wulf & lewthwaite, 2016). yoking may allow for inferences to be made about the act of making certain choices, but it may not provide an adequate control group for evaluating best practices in an applied setting (e.g., j. a. c. barros et al., 2019; ste-marie et al., 2019; yantha et al., 2022). indeed, given that our estimate suggests the advantage of self-controlled over yoked practice is small, if it exists at all, it seems unlikely that self-control would be more effective than an instructorguided practice. an instructor-guided group could easily be argued to have advantages over a yoked group, because of the ability for the instructor to adapt choices to the current practice context and to make use of personal experience and expertise. following this logic, experiments investigating the benefit of self-controlled over instructor-guided practice could conceivably require substantially larger samples than experiments that use yoked comparison groups. exploratory analysis of pre-registered experiments there have been, to our knowledge, four preregistered experiments that have compared selfcontrolled and yoked practice (grand et al., 2017; mckay & ste-marie, 2022; st. germain et al., 2022; yantha et al., 2022). three of these experiments failed to meet our inclusion criteria because they were not published or part of an accepted thesis at the time of the analysis (mckay & ste-marie, 2022; st. germain et al., 2022; yantha et al., 2022). these pre-registered experiments should provide estimates of the self-control effect unbiased by selection effects and are therefore more useful for estimating the real average effect than attempting to correct biased experiments after the fact (e. c. carter et al., 2019). a random effects model was used to estimate the average effect of self-control in the four experiments and yielded g = .02, 95% ci [-.17, .21]. these results converge with the bias-corrected es18 timates around trivially small differences between selfcontrolled and yoked practice conditions. conclusions we set out to assess the effect of self-controlled practice on motor learning. the published literature on the subject to date appeared unambiguously supportive of a self-control benefit, yet the results of this meta-analysis suggest this may not be the case. if authors, reviewers, and editors select for statistical significance when deciding if experiments get published, the published literature becomes biased (ioannidis, 2005). worse still, filtering based on statistical significance may well incentivize researchers to leverage researcher degrees of freedom to achieve a significant result, a practice known as p-hacking, further biasing the literature (wicherts et al., 2016). an instructive example of the potential impact of selection effects comes from research studying the so-called ego-depletion effect (baumeister et al., 2007; hagger et al., 2010). in a typical study, participants are asked to engage in activities that supposedly drain a limited reservoir of willpower, termed egodepletion, and are subsequently measured on a dependent measure requiring an additional exertion of selfcontrol, such as a stroop task. the typical finding is that performance suffers on the second task if egodepletion occurs beforehand. a meta-analysis by hagger et al. (2010) reported the average effect of egodepleting interventions on willpower dependent measures was d = .62. there was apparent consensus in the field that willpower relied on a limited resource due to the ostensibly unambiguous evidence in support of the theory (baumeister & vohs, 2016). nevertheless, when bias correction methods were applied in a meta-analysis of ego-depletion literature, the adjusted estimates often did not differ significantly from zero (e. c. carter et al., 2015). subsequently, a preregistered, multi-lab replication project tested a sample of n = 2141 and reported that the ego-depletion effect was close to zero (hagger et al., 2016). thus, a prominent psychological construct substantiated by a large corpus of peer-reviewed evidence was investigated using cutting edge meta-analytic techniques that corrected for selection bias and the result was a trivially small estimated effect–an estimate supported by a subsequent large scale pre-registered replication effort. notably, both the bias corrected meta-analysis and the subsequent multi-lab replication efforts have been criticized by ego-depletion theorists (baumeister & vohs, 2016; cunningham & baumeister, 2016). others have sharply challenged these critiques (schimmack, 2020), and while debate continues among social psychologists about the underlying theory at stake (e.g., dang, 2018), there is consensus that several methods shown to produce positive results in the past are unlikely to replicate in future experiments. in stark parallel to the ego-depletion literature, the findings of the current research suggest the selfcontrolled motor learning literature may be similarly biased. as motor learning researchers consider the path forward for self-controlled learning, non-bias related limitations of the extant literature should be addressed. for example, yoked groups fail to isolate putative motivational and informational processes when self-controlling learners make choices pertinent to acquiring a skill (m. j. carter et al., 2016; m. j. carter & ste-marie, 2017b; lewthwaite et al., 2015). further, exclusive reliance on yoked comparison groups limits the generalizability of self-controlled learning to applied settings where the alternative to self-control is typically coach or instructor control (i.e., those with domainspecific knowledge). as motor learning researchers in this area move forward, they are faced with the question of whether this effect is worth the resources required to study it. if that answer is yes, then in addition to being pre-registered and an adequately powered design, future self-controlled learning experiments should provide insight about either the underlying processes at work or the real world usefulness of this practice variable. author contact corresponding author: brad mckay, department of kinesiology, mcmaster university, 1280 main st w, hamilton on canada, l8s 4k1. permanent e-mail: bradmckay8@gmail.com institution e-mail: mckayb9@mcmaster.ca brad mckay 0000-0002-7408-2323 zachary d. yantha 0000-0003-1851-7609 julia hussien 0000-0001-7434-228x michael j. carter 0000-0002-0675-4271 diane m. ste-marie 0000-0002-4574-9539 acknowledgements all authors thank heather smith for her help with data extraction. conflict of interest and funding the authors declare no conflicts of interest. bm was supported by a social sciences and humanities research council (sshrc) of canada canada graduate scholarship doctoral. mjc was supported by a natural sciences and engineering research council (nserc) of canada discovery grant (rgpin-2018-05589). https://orcid.org/0000-0002-7408-2323 https://orcid.org/0000-0003-1851-7609 https://orcid.org/0000-0001-7434-228x https://orcid.org/0000-0002-0675-4271 https://orcid.org/0000-0002-4574-9539 19 r packages used in this project we used r (version 4.0.4; r core team, 2021) and the r-packages computees (re, 2013), dmetar (version 0.0.9000; harrer et al., 2019), kableextra (version 91.3.4; zhu, 2021), meta (version 4.18.0; balduzzi et al., 2019), metafor (version 3.0.2; viechtbauer, 2010), papaja (version 0.1.0.9997; aust & barth, 2020), rcolorbrewer (neuwirth, 2014), robvis (version 0.3.0; mcguinness, 2019), tidyverse (version 1.3.0; wickham et al., 2019), and weightr (version 2.0.2; coburn & vevea, 2019). author contributions (credit taxonomy) conceptualization: bm, zdy, jh, mjc, dsm data curation: bm, mjc formal analysis: bm funding acquisition: bm, mjc investigation: bm, zdy, jh, mjc methodology: bm project administration: bm software: bm, mjc supervision: bm, dsm validation: bm, mjc visualization: bm, mjc writing – original draft: bm, zdy, jh, mjc, dsm writing – review & editing: bm, zdy, jh, mjc, dsm open science practices this article earned the preregistration+, open data and the open materials badge for preregistering the hypothesis and analysis before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references references marked with an asterisk (*) indicate studies included in the meta-analysis. abdollahipour, r., palomo nieto, m., psotta, r., & wulf, g. (2017). external focus of attention and autonomy support have additive benefits for motor performance in children. psychology of sport and exercise, 32, 17–24. *aiken, c. a., fairbrother, j. t., & post, p. g. (2012). the effects of self-controlled video feedback on the learning of the basketball set shot. frontiers in psychology., 3, 338. *alami, a. (2013). an examination of feedback request strategies when learning a multi-dimensional motor task under self-controlled and yoked conditions (doctoral dissertation). university of tennessee, knoxville. *ali, a., fawver, b., kim, j., fairbrother, j., & janelle, c. m. (2012). too much of a good thing: random practice scheduling and self-control of feedback lead to unique but not additive learning benefits. frontiers in psychology., 3, 503. *andrieux, m., boutin, a., & thon, b. (2016). selfcontrol of task difficulty during early practice promotes motor skill learning. journal of motor behavior, 48(1), 57–65. *andrieux, m., danna, j., & thon, b. (2012). self-control of task difficulty during training enhances motor learning of a complex coincidence-anticipation task. research quarterly for exercise and sport, 83(1), 27–35. *arsal, g. (2004). effects of external and self-controlled feedback schedule on retention of anticipation timing and ball throwing task (master’s thesis). middle east technical university. aust, f., & barth, m. (2020). papaja: create apa manuscripts with r markdown [r package version 0.1.0.9997]. https : / / github . com / crsh / papaja balduzzi, s., rücker, g., & schwarzer, g. (2019). how to perform a meta-analysis with r: a practical tutorial. evidence-based mental health, (22), 153– 160. https://doi.org/10.1136/ebmental-2019300117 *barros, j. a. c., yantha, z. d., carter, m. j., hussien, j., & ste-marie, d. m. (2019). examining the impact of error estimation on the effects of selfcontrolled feedback. human movement science, 63, 182–198. *barros, j. a. (2010). the effects of practice schedule and self-controlled feedback manipulations on the acquisition and retention of motor skills (doctoral dissertation). university of tennessee, knoxville. bartoš, f., & schimmack, u. (2020). z-curve.2.0: estimating replication rates and discovery rates. *bass, a. d. (2015). an experiment to chronometrically examine the effects of self-controlled feedback on the performance and learning of a sequential timing task (master’s thesis). university of tennessee, knoxville. https://github.com/crsh/papaja https://github.com/crsh/papaja https://doi.org/10.1136/ebmental-2019-300117 https://doi.org/10.1136/ebmental-2019-300117 20 *bass, a. d. (2018). the effect of observation on motor learning in a self-controlled feedback protocol (doctoral dissertation). university of tennessee, knoxville. baumeister, r. f., & vohs, k. d. (2016). strength model of self-regulation as limited resource: assessment, controversies, update (chapter 2). in j. m. olson & m. p. zanna (eds.), advances in experimental social psychology (pp. 67–127). academic press. baumeister, r. f., vohs, k. d., & tice, d. m. (2007). the strength model of self-control. current directions in psychological science, 16(6), 351–355. *brydges, r., carnahan, h., safir, o., & dubrowski, a. (2009). how effective is self-guided learning of clinical technical skills? it’s all about process. medical education, 43(6), 507–515. *bund, a., & wiemeyer, j. (2004). self-controlled learning of a complex motor skill: effects of the learner’s preferences on performance and selfefficacy. journal of human movement studies, 47, 215–236. caldwell, a. r., vigotsky, a. d., tenan, m. s., radel, r., mellor, d. t., kreutzer, a., lahart, i. m., mills, j. p., boisgontier, m. p., & consortium for transparency in exercise science (cotes) collaborators. (2020). moving sport and exercise science forward: a call for the adoption of more transparent research practices. sports medicine, 50(3), 449–459. carter, e. c., kofler, l. m., forster, d. e., & mccullough, m. e. (2015). a series of meta-analytic tests of the depletion effect: self-control does not seem to rely on a limited resource. journal of experimental psychology: general, 144(4), 796–815. carter, e. c., schönbrodt, f. d., gervais, w. m., & hilgard, j. (2019). correcting for bias in psychology: a comparison of meta-analytic methods. advances in methods and practices in psychological science, 2(2), 115–144. carter, m. j., carlsen, a. n., & ste-marie, d. m. (2014). self-controlled feedback is effective if it is based on the learner’s performance: a replication and extension of chiviacowsky and wulf (2005). frontiers in psychology., 5, 1325. *carter, m. j., & patterson, j. t. (2012). self-controlled knowledge of results: age-related differences in motor learning, strategies, and error detection. human movement science, 31(6), 1459–1472. carter, m. j., rathwell, s., & ste-marie, d. (2016). motor skill retention is modulated by strategy choice during self-controlled knowledge of results schedules. journal of motor learning and development, 4(1), 100–115. carter, m. j., & ste-marie, d. m. (2017a). an interpolated activity during the knowledge-of-results delay interval eliminates the learning advantages of self-controlled feedback schedules. psychological research, 81(2), 399–406. carter, m. j., & ste-marie, d. m. (2017b). not all choices are created equal: task-relevant choices enhance motor learning compared to taskirrelevant choices. psychonomic bulletin & review, 24(6), 1879–1888. *chen, d. d., hendrick, j. l., & lidor, r. (2002). enhancing self-controlled learning environments: the use of self-regulated feedback information. journal of human movement studies, 43(1), 69. *chiviacowsky, s. (2014). self-controlled practice: autonomy protects perceptions of competence and enhances motor learning. psychology of sport and exercise, 15(5), 505–510. *chiviacowsky, s., & lessa, h. t. (2017). choices over feedback enhance motor learning in older adults. journal of motor learning and development, 5(2), 304–318. *chiviacowsky, s., & wulf, g. (2002). self-controlled feedback: does it enhance learning because performers get feedback when they need it? research quarterly for exercise and sport, 73(4), 408–415. chiviacowsky, s., & wulf, g. (2005). self-controlled feedback is effective if it is based on the learner’s performance. research quarterly for exercise and sport, 76(1), 42–48. *chiviacowsky, s., wulf, g., de medeiros, f. l., kaefer, a., & tani, g. (2008). learning benefits of self-controlled knowledge of results in 10-yearold children. research quarterly for exercise and sport, 79(3), 405–410. chiviacowsky, s., wulf, g., & lewthwaite, r. (2012). self-controlled learning: the importance of protecting perceptions of competence. frontiers in psychology., 3, 458. *chiviacowsky, s., wulf, g., lewthwaite, r., & campos, t. (2012). motor learning benefits of selfcontrolled practice in persons with parkinson’s disease. gait & posture, 35(4), 601–605. *chiviacowsky, s., wulf, g., machado, c., & rydberg, n. (2012). self-controlled feedback enhances learning in adults with down syndrome. revista brasileira de fisioterapia, 16(3), 191–196. chua, l.-k., wulf, g., & lewthwaite, r. (2018). onward and upward: optimizing motor performance. human movement science, 60, 107–114. 21 coburn, k. m., & vevea, j. l. (2017). weightr: estimating weight-function models for publication bias. r package version, 1(2). coburn, k. m., & vevea, j. l. (2019). weightr: estimating weight-function models for publication bias [r package version 2.0.2]. https : / / cran . r project.org/package=weightr couvillion, k. f., bass, a. d., & fairbrother, j. t. (2020). increased cognitive load during acquisition of a continuous task eliminates the learning effects of self-controlled knowledge of results. journal of sport sciences, 38(1), 94–99. cunningham, m. r., & baumeister, r. f. (2016). how to make nothing out of something: analyses of the impact of study sampling and statistical interpretation in misleading meta-analytic conclusions. frontiers in psychology., 7, 1639. dang, j. (2018). an updated meta-analysis of the ego depletion effect. psychological research, 82(4), 645–651. *davis, j. (2009). effects of self-controlled feedback on the squat (master’s thesis). state university of new york college at cortland. *fagundes, j., chen, d. d., & laguna, p. (2013). selfcontrol and frequency of model presentation: effects on learning a ballet passé relevé. human movement science, 32(4), 847–856. *fairbrother, j. t., laughlin, d. d., & nguyen, t. v. (2012). self-controlled feedback facilitates motor learning in both high and low activity individuals. frontiers in psychology., 3, 323. *ferreira, b. p., malloy-diniz, l. f., parma, j. o., nogueira, n. g. h. m., apolinário-souza, t., ugrinowitsch, h., & lage, g. m. (2019). selfcontrolled feedback and learner impulsivity in sequential motor learning. perceptual motor skills, 126(1), 157–179. *figueiredo, l. s., ugrinowitsch, h., freire, a. b., shea, j. b., & benda, r. n. (2018). external control of knowledge of results: learner involvement enhances motor skill transfer. perceptual motor skills, 125(2), 400–416. *ghorbani, s. (2019). motivational effects of enhancing expectancies and autonomy for motor learning: an examination of the optimal theory. j. gen. psychol., 146(1), 79–92. *grand, k. f., bruzi, a. t., dyke, f. b., godwin, m. m., leiker, a. m., thompson, a. g., buchanan, t. l., & miller, m. w. (2015). why self-controlled feedback enhances motor learning: answers from electroencephalography and indices of motivation. human movement science, 43, 23– 32. *grand, k. f., daou, m., lohse, k. r., & miller, m. w. (2017). investigating the mechanisms underlying the effects of an incidental choice on motor learning. journal of motor learning and development, 5(2), 207–226. hagger, m. s., chatzisarantis, n. l. d., alberts, h., anggono, c. o., batailler, c., birt, a. r., brand, r., brandt, m. j., brewer, g., bruyneel, s., calvillo, d. p., campbell, w. k., cannon, p. r., carlucci, m., carruth, n. p., cheung, t., crowell, a., de ridder, d. t. d., dewitte, s., . . . zwienenberg, m. (2016). a multilab preregistered replication of the ego-depletion effect. perspectives in psychological science, 11(4), 546–573. hagger, m. s., wood, c., stiff, c., & chatzisarantis, n. l. d. (2010). ego depletion and the strength model of self-control: a meta-analysis. psychological bulletin, 136(4), 495–525. halperin, i., chapman, d. w., martin, d. t., lewthwaite, r., & wulf, g. (2017). choices enhance punching performance of competitive kickboxers. psychological research, 81(5), 1051–1058. halperin, i., wulf, g., vigotsky, a. d., schoenfeld, b. j., & behm, d. g. (2018). autonomy: a missing ingredient of a successful program? strength & conditioning journal, 40(4), 18. hancock, g. r., butler, m. s., & fischman, m. g. (1995). on the problem of two-dimensional error scores: measures and analyses of accuracy, bias, and consistency. journal of motor behavior, 27(3), 241–250. *hansen, s., pfeiffer, j., & patterson, j. t. (2011). selfcontrol of feedback during motor learning: accounting for the absolute amount of feedback using a yoked group with self-control over feedback. journal of motor behavior, 43(2), 113– 119. harrer, m., cuijpers, p., furukawa, t., & ebert, d. d. (2019). dmetar: companion r package for the guide ’doing meta-analysis in r’ [r package version 0.0.9000]. http://dmetar.protectlab.org/ *hartman, j. m. (2007). self-controlled use of a perceived physical assistance device during a balancing task. perceptual motor skills, 104(3 pt 1), 1005–1016. hedges, l. v., & vevea, j. l. (1996). estimating effect size under publication bias: small sample properties and robustness of a random effects selection model. journal of educational and behavioral statistics, 21(4), 299–332. *hemayattalab, r., arabameri, e., pourazar, m., ardakani, m. d., & kashefi, m. (2013). effects of https://cran.r-project.org/package=weightr https://cran.r-project.org/package=weightr http://dmetar.protectlab.org/ 22 self-controlled feedback on learning of a throwing task in children with spastic hemiplegic cerebral palsy. research in developmental disabilities, 34(9), 2884–2889. higgins, j. p., & green, s. (eds.). (2011). cochrane handbook for systematic reviews of interventions (vol. 4). john wiley & sons. higgins, j. p., altman, d. g., gøtzsche, p. c., jüni, p., moher, d., oxman, a. d., savović, j., schulz, k. f., weeks, l., & sterne, j. a. (2011). the cochrane collaboration’s tool for assessing risk of bias in randomised trials. bmj, 343. *ho, r. l. m. (2016). self-controlled learning and differential goals: does “too easy” and “too difficult” affect the self-control paradigm? (doctoral dissertation). california state university, long beach. *holmberg, b. a. (2013). the “when” and the “what”: effects of self-control of feedback about multiple critical movement features on motor performance and learning (doctoral dissertation). university of tennessee, knoxville. hong, s., & reed, w. r. (2021). using monte carlo experiments to select meta-analytic estimators. research synthesis methods, 12(2), 192–215. howick, j. (2008). against a priori judgements of bad methodology: questioning double-blinding as a universal methodological virtue of clinical trials. *huet, m., camachon, c., fernandez, l., jacobs, d. m., & montagne, g. (2009). self-controlled concurrent feedback and the education of attention towards perceptual invariants. human movement science, 28(4), 450–467. *ikudome, s., kou, k., ogasa, k., mori, s., & nakamoto, h. (2019). the effect of choice on motor learning for learners with different levels of intrinsic motivation. journal of sport and exercise psychology, 41(3), 159–166. ioannidis, j. p. a. (2005). why most published research findings are false. plos medicine, 2(8), e124. iwatsuki, t., navalta, j. w., & wulf, g. (2019). autonomy enhances running efficiency. journal of sport sciences, 37(6), 685–691. *jalalvand, m., bahram, a., daneshfar, a., & arsham, s. (2019). the effect of gradual self-control of task difficulty and feedback on learning golf putting. research quarterly for exercise and sport, 90(4), 429–439. *janelle, c. m., barba, d. a., frehlich, s. g., tennant, l. k., & cauraugh, j. h. (1997). maximizing performance feedback effectiveness through videotape replay and a self-controlled learning environment. research quarterly for exercise and sport, 68(4), 269–279. janelle, c. m., kim, j., & singer, r. n. (1995). subjectcontrolled performance feedback and learning of a closed motor skill. perceptual motor skills, 81(2), 627–634. *jones, a. (2010). effects of amount and type of selfregulation opportunity during skill acquisition on motor learning (doctoral dissertation). mcmaster university. jowett, n., leblanc, v., xeroulis, g., macrae, h., & dubrowski, a. (2007). surgical skill acquisition with self-directed practice using computerbased video training. american journal of surgery., 193(2), 237–242. *kaefer, a., chiviacowsky, s., meira, c. d. m., jr, & tani, g. (2014). self-controlled practice enhances motor learning in introverts and extroverts. research quarterly for exercise and sport, 85(2), 226–233. *keetch, k. m., & lee, t. d. (2007). the effect of selfregulated and experimenter-imposed practice schedules on motor learning for tasks of varying difficulty. research quarterly for exercise and sport, 78(5), 476–486. *kim, y., kim, j., kim, h., kwon, m., lee, m., & park, s. (2019). neural mechanism underlying selfcontrolled feedback on motor skill learning. human movement science, 66, 198–208. kvarven, a., strømland, e., & johannesson, m. (2020). comparing meta-analyses and preregistered multiple-laboratory replication projects. nature human behaviour, 4(4), 423–434. *leiker, a. m., bruzi, a. t., miller, m. w., nelson, m., wegman, r., & lohse, k. r. (2016). the effects of autonomous difficulty selection on engagement, motivation, and learning in a motioncontrolled video game task. human movement science, 49, 326–335. *leiker, a. m., pathania, a., miller, m. w., & lohse, k. r. (2019). exploring the neurophysiological effects of self-controlled practice in motor skill learning. journal of motor learning and development, 7(1), 13–34. *lemos, a., wulf, g., lewthwaite, r., & chiviacowsky, s. (2017). autonomy support enhances performance expectancies, positive affect, and motor learning. psychology of sport and exercise, 31, 28–34. *lessa, h. t., & chiviacowsky, s. (2015). self-controlled practice benefits motor learning in older adults. human movement science, 40, 372–380. 23 *lewthwaite, r., chiviacowsky, s., drews, r., & wulf, g. (2015). choose to move: the motivational impact of autonomy support on motor learning. psychonomic bulletin & review, 22(5), 1383– 1388. *lim, s., ali, a., kim, w., kim, j., choi, s., & radlo, s. j. (2015). influence of self-controlled feedback on learning a serial motor skill. perceptual motor skills, 120(2), 462–474. lohse, k., buchanan, t., & miller, m. (2016). underpowered and overworked: problems with data analysis in motor learning studies. journal of motor learning and development, 4(1), 37–58. *marques, p. g., & corrêa, u. c. (2016). the effect of learner’s control of self-observation strategies on learning of front crawl. acta psychologica, 164, 151–156. *marques, p. g., thon, r. a., espanhol, j., tani, g., & corrêa, u. c. (2017). the intermediate learner’s choice of self-as-a-model strategies and the eight-session practice in learning of the front crawl swim. kinesiology, 49(1). mcguinness, l. a. (2019). robvis: an r package and web application for visualising risk-of-bias assessments. https://github.com/mcguinlu/robvis mckay, b., & ste-marie, d. m. (2022). autonomy support via instructionally irrelevant choice not beneficial for motor performance or learning. research quarterly for exercise and sport, 93, 64–76. mcshane, b. b., böckenholt, u., & hansen, k. t. (2016). adjusting for publication bias in meta-analysis: an evaluation of selection methods and some cautionary notes. perspectives in psychological science, 11(5), 730–749. neuwirth, e. (2014). rcolorbrewer: colorbrewer palettes [r package version 1.1-2]. https : / / cran . r project.org/package=rcolorbrewer *norouzi, e., hossini, f. s., & aghdasi, m. t. (2016). effect of self-control feedback on the learning of a throwing task with emphasis on decisionmaking process. open science journal of psychology, 2(6), 32. *nunes, m. e. d. s., correa, u. c., souza, m. g. t. x. d., basso, l., coelho, d. b., & santos, s. (2019). no improvement on the learning of golf putting by older persons with self-controlled knowledge of performance. journal of aging and physical activity, 27(3), 300–308. open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716. *ostrowski, j. (2015). the influence of shame on the frequency of self-controlled feedback and motor learning (master’s thesis). southern illinois university carbondale. page, m. j., mckenzie, j. e., bossuyt, p. m., boutron, i., hoffmann, t. c., mulrow, c. d., shamseer, l., tetzlaff, j. m., akl, e. a., brennan, s. e., et al. (2021). the prisma 2020 statement: an updated guideline for reporting systematic reviews. bmj, 372. patall, e. a., cooper, h., & robinson, j. c. (2008). the effects of choice on intrinsic motivation and related outcomes: a meta-analysis of research findings. psychological bulletin, 134(2), 270–300. *patterson, j. t., carter, m., & sanli, e. (2011). decreasing the proportion of self-control trials during the acquisition period does not compromise the learning advantages in a self-controlled context. research quarterly for exercise and sport, 82(4), 624–633. *patterson, j. t., & carter, m. j. (2010). learner regulated knowledge of results during the acquisition of multiple timing goals. human movement science, 29(2), 214–227. *patterson, j. t., carter, m. j., & hansen, s. (2013). self-controlled kr schedules: does repetition order matter? human movement science, 32(4), 567–579. *patterson, j. t., & lee, t. d. (2010). self-regulated frequency of augmented information in skill learning. candian journal of experimental psychology, 64(1), 33–40. *post, p. g., aiken, c. a., laughlin, d. d., & fairbrother, j. t. (2016). self-control over combined video feedback and modeling facilitates motor learning. human movement science, 47, 49–59. *post, p. g., fairbrother, j. t., & barros, j. a. c. (2011). self-controlled amount of practice benefits learning of a motor skill. research quarterly for exercise and sport, 82(3), 474–481. *post, p. g., fairbrother, j. t., barros, j. a. c., & kulpa, j. d. (2014). self-controlled practice within a fixed time period facilitates the learning of a basketball set shot. journal of motor learning and development, 2(1), 9–15. r core team. (2021). r: a language and environment for statistical computing. r foundation for statistical computing. vienna, austria. https://www. r-project.org/ re, a. c. d. (2013). compute.es: compute effect sizes. https://cran.rproject.org/package=compute. es https://github.com/mcguinlu/robvis https://cran.r-project.org/package=rcolorbrewer https://cran.r-project.org/package=rcolorbrewer https://www.r-project.org/ https://www.r-project.org/ https://cran.r-project.org/package=compute.es https://cran.r-project.org/package=compute.es 24 *rydberg, n. (2011). the effect of self-controlled practice on forearm passing, motivation, and affect in women’s volleyball players (master’s thesis). university of nevada, las vegas. *sanli, e. a., & patterson, j. t. (2013). learning effects of self-controlled practice scheduling for children and adults: are the advantages different? perceptual motor skills, 116(3), 741–749. sanli, e. a., patterson, j. t., bray, s. r., & lee, t. d. (2013). understanding self-controlled motor learning protocols through the selfdetermination theory. frontiers in psychology., 3, 611. scammacca, n., roberts, g., & stuebing, k. k. (2014). meta-analysis with complex research designs: dealing with dependence from multiple measures and multiple group comparisons. review of educational research, 84(3), 328–364. schimmack, u. (2020). a meta-psychological perspective on the decade of replication failures in social psychology. canadian psychology, 61(4), 364–376. sigrist, r., rauter, g., riener, r., & wolf, p. (2013). augmented visual, auditory, haptic, and multimodal feedback in motor learning: a review. psychonomic bulletin & review, 20(1), 21–53. simonsohn, u. (2015). [17] no-way interactions. the winnower. https : / / doi . org / 10 . 15200 / winn . 142559.90552 simonsohn, u., nelson, l. d., & simmons, j. p. (2014a). p-curve and effect size: correcting for publication bias using only significant results. perspectives in psychological science, 9(6), 666–681. simonsohn, u., nelson, l. d., & simmons, j. p. (2014b). p-curve: a key to the file-drawer. journal of experimental psychology: general, 143(2), 534– 547. simonsohn, u., simmons, j. p., & nelson, l. d. (2015). better p-curves: making p-curve analysis more robust to errors, fraud, and ambitious p-hacking, a reply to ulrich and miller (2015). journal of experimental psychology: general, 144(6), 1146–1152. st. germain, l., williams, a., poskus, a., balbaa, n., leshchyshen, o., lohse, k. r., & carter, m. j. (2022). increased perceptions of autonomy through choice fail to enhance motor skill retention. journal of experimental psychology: human perception and performance, 48(4), 370– 379. stanley, t. d., & doucouliagos, h. (2014). metaregression approximations to reduce publication selection bias. research synthesis methods, 5(1), 60–78. stanley, t. d., jarrell, s. b., & doucouliagos, h. (2010). could it be better to discard 90% of the data? a statistical paradox. american statistician, 64(1), 70–77. ste-marie, d. m., carter, m. j., & yantha, z. d. (2019). self-controlled learning: current findings, theoretical perspectives, and future directions. in n. j. hodges & a. m. williams (eds.), skill acquisition in sport: research, theory and practice (pp. 119–140). routledge. *ste-marie, d. m., vertes, k. a., law, b., & rymal, a. m. (2013). learner-controlled self-observation is advantageous for motor skill acquisition. frontiers in psychology., 3, 556. *tsai, m.-j., & jwo, h. (2015). controlling absolute frequency of feedback in a self-controlled situation enhances motor learning. perceptual motor skills, 121(3), 746–758. vevea, j. l., & hedges, l. v. (1995). a general linear model for estimating effect size in the presence of publication bias. psychometrika, 60(3), 419– 435. vevea, j. l., & woods, c. m. (2005). publication bias in research synthesis: sensitivity analysis using a priori weight functions. psychological methods, 10(4), 428–443. viechtbauer, w. (2010). conducting meta-analyses in r with the metafor package. journal of statistical software, 36(3), 1–48. https://doi.org/10. 18637/jss.v036.i03 *von lindern, a. d. (2017). self-control effect during a reduction of feedback availability (doctoral dissertation). university of tennessee, knoxville. wicherts, j. m., veldkamp, c. l. s., augusteijn, h. e. m., bakker, m., van aert, r. c. m., & van assen, m. a. l. m. (2016). degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking. frontiers in psychology., 7, 1832. wickham, h., averick, m., bryan, j., chang, w., mcgowan, l. d., françois, r., grolemund, g., hayes, a., henry, l., hester, j., kuhn, m., pedersen, t. l., miller, e., bache, s. m., müller, k., ooms, j., robinson, d., seidel, d. p., spinu, v., . . . yutani, h. (2019). welcome to the tidyverse. journal of open source software, 4(43), 1686. https://doi.org/10.21105/joss.01686 *williams, c. k., tseung, v., & carnahan, h. (2017). self-control of haptic assistance for motor learning: influences of frequency and opinion of utility. frontiers in psychology., 8, 2082. https://doi.org/10.15200/winn.142559.90552 https://doi.org/10.15200/winn.142559.90552 https://doi.org/10.18637/jss.v036.i03 https://doi.org/10.18637/jss.v036.i03 https://doi.org/10.21105/joss.01686 25 woodard, k. f., & fairbrother, j. t. (2020). cognitive loading during and after continuous task execution alters the effects of self-controlled knowledge of results. frontiers in psychology., 11, 1046. *wu, w. f. w. (2007). self-control of learning multiple motor skills (doctoral dissertation). louisiana state university. *wu, w. f. w., & magill, r. a. (2011). allowing learners to choose: self-controlled practice schedules for learning multiple movement patterns. research quarterly for exercise and sport, 82(3), 449–457. *wulf, g., clauss, a., shea, c. h., & whitacre, c. a. (2001). benefits of self-control in dyad practice. research quarterly for exercise and sport, 72(3), 299–303. *wulf, g., & toole, t. (1999). physical assistance devices in complex motor skill learning: benefits of a self-controlled practice schedule. research quarterly for exercise and sport, 70(3), 265– 272. wulf, g. (2007). self-controlled practice enhances motor learning: implications for physiotherapy. physiotherapy, 93(2), 96–101. wulf, g., & adams, n. (2014). small choices can enhance balance learning. human movement science, 38, 235–240. *wulf, g., chiviacowsky, s., & cardozo, p. l. (2014). additive benefits of autonomy support and enhanced expectancies for motor learning. human movement science, 37, 12–20. *wulf, g., chiviacowsky, s., & drews, r. (2015). external focus and autonomy support: two important factors in motor learning have additive benefits. human movement science, 40, 176– 184. *wulf, g., iwatsuki, t., machin, b., kellogg, j., copeland, c., & lewthwaite, r. (2018). lassoing skill through learner choice. journal of motor behavior, 50(3), 285–292. wulf, g., & lewthwaite, r. (2016). optimizing performance through intrinsic motivation and attention for learning: the optimal theory of motor learning. psychonomic bulletin & review, 23(5), 1382–1414. wulf, g., & mornell, a. (2008). insights about practice from the perspective of motor learning: a review. music performance research, 2, 1–25. *wulf, g., raupach, m., & pfeiffer, f. (2005). self-controlled observational practice enhances learning. research quarterly for exercise and sport, 76(1), 107–111. wulf, g., shea, c., & lewthwaite, r. (2010). motor skill learning and performance: a review of influential factors. medical education, 44(1), 75–84. yantha, z. d., mckay, b., & ste-marie, d. m. (2022). the recommendation for learners to be provided with control over their feedback schedule is questioned in a self-controlled learning paradigm. journal of sports sciences, 40(7), 769–782. zhu, h. (2021). kableextra: construct complex table with ’kable’ and pipe syntax [r package version 1.3.4]. https://cran.r-project.org/package= kableextra https://cran.r-project.org/package=kableextra https://cran.r-project.org/package=kableextra 2 6appendix a: p-curve disclosure form table 2 experiment information from papers included in the p-curve analysis. original paper quoted text from original paper indicated predicted benefit of selfcontrol relative to yoked practice design key statistical result quoted text from original paper with statistical results result andrieux, danna & thon (2012) “thus, we hypothesized that a practice condition in which the learner could set the level of task difficulty would be more beneficial for learning than a condition in which this parameter was imposed.” two cell difference in means “a follow up analysis restricted to the first two blocks revealed a significant difference between groups, f(1, 36) = 4.85, p <.05, partial eta squared = .12. self-controlled learners were significantly more accurate (m ae = 12.73 mm, se = 1.57) than their yoked counterparts (m ae = 18.1 mm, se = 1.87) after a 24-hr rest.” f(1, 36) = 4.85 andrieux, boutin, & thon (2016) “two main reasons led us to expect that self-control of nominal task difficulty would enhance motor skill learning, and especially when introduced during early practice rather than during late practice.” four cell (full selfcontrol, full yoked, self-control then yoked, yoked then self-control) difference in means “planned pairwise comparisons revealed that the self-control groups exhibited lower rmse (sc + sc, sc + yo, and yo + sc groups) than their yoked group counterparts (yo + yo group), f(1, 44) = 14.02, p <.01.” f(1, 44) = 14.02 brydges, carnahan, safir & dubrowski (2009) “we hypothesised that participants with self-guided access to instruction would learn more than participants whose access to instruction was externally controlled.” 2 (control: self, yoked) x 2 (goals: process, outcome) difference in means “the self-process group performed better on the retention test than the control-process group (fig. 1). this effect was significant for time taken, (f[1,23] = 4.33, p <0.05).” f(1,23) = 4.33 chiviacowsky (2014) “we hypothesized that participants of the self-controlled group would show superior motor learning than yoked participants” two cell difference in means “the self group outperformed the yoked group. the group main effect was significant, t(26) = 2.08, p = .04, d = .78.” t(26) = 2.08 chiviacowsky, wulf, de medeiros, kaefer & tani (2008) “therefore, the purpose of the present study was to examine whether the learning benefits of selfcontrolled kr would generalize to children.” two cell difference in means “the self-control group had higher accuracy scores than the yoked group. this difference was significant, f(1, 24) = 4.40, p <.05.” f(1, 24) = 4.40 2 7 chiviacowsky, wulf, lewthwaite, & campos (2012) “the potential benefits of selfcontrolled practice have yet to be examined in persons with pd...under the assumption that self-controlled practice would enhance the learning of the task...” two cell difference in means “the self-control group was overall more effective than the yoked group. time in balance was significantly longer for the self-control group, f(1, 26) = 4.25, p <.05.” f(1, 26) = 4.25 chiviacowsky wulf, machado & rydberg (2012) “we predicted that self-controlled practice, in particular the ability to choose when to receive feedback, would result in more effective learning compared to a practice condition without this opportunity (yoked group).” two cell difference in means “the day following practice, a retention test (without feedback) revealed lower aes for the self-control group than the yoked group (see figure 2, right). the group difference was significant, with f(1, 28)= 4.72, p <0.05, eta squared =.14.” f(1, 28)= 4.72 hartman (2007) “the primary aim of this study was to test whether there would exist a learning advantage for a selfcontrolled group, as opposed to a yoked control group, for learning a dynamic balance task.” two cell difference in means “to assess the relatively permanent or learning effects of practice with or without a self-controlled use of a balance pole, both groups performed a retention test on day 3. the group effect was significant, f(1, 17) = 8.29, p <.01, with the self-control group outperforming the yoked group.” f(1, 17) = 8.29 kaefer, chiviacowsky, meira jr. & tani (2014) “...both self-controlled groups (introverts and extroverts) will achieve a level of activation that facilitates learning through the control of stimulation source (feedback) in comparison with the groups that do not have control over it.” 2 (control: self, yoked) x 2 (personality: introvert, extrovert) difference in means “the groups’ main effects were detected on the factor “feedback type”: self-controlled groups performed better, f(1, 52) = 4.13, p <.05, compared with externally controlled groups” f(1, 52) = 4.13 leiker, bruzi, miller, nelson, wegman & lohse (2016) “we hypothesized that participants in the self-controlled group would show superior learning (i.e., better performance on retention and transfer tests) compared to the yoked group.” two cell difference in means “controlling for pre-pest, there was a significant main effect of group, f(1,57) = 4.51, p = 0.04, partial eta squared = 0.07, such that participants in the self-controlled group performed better on the posttest than participants in the yoked group.” f(1,57) = 4.51 lemos, wulf, lewthwaite & chiviacowsky (2017) “independent of which factor the learner is given control over e or whether or not this factor is directly related to the task to be learned e the learning benefits appear to be very robust.” two cell difference in means “on the retention test, choice participants clearly outperformed the control group. the group main effect was significant, f(1, 22) = 88.16, p <0.01.” f(1, 22) = 88.16 2 8 lessa & chiviacowsky (2015) “...it was hypothesized that older adult participants of the self-group would demonstrate superior motor learning results, presenting faster task times on the speed cup-stacking task, when compared with participants in the yoked control group.” two cell difference in means “the analysis of the retention test revealed significant differences between groups, f(1,34) = 4.87, p <.05...with participants of the selfcontrol group presenting faster task times compared to yoked participants.” f(1,34) = 4.87 lewthwaite, chiviacowsky, drews & wulf (2015; exp. 1) “in the present experiment, the choice learners were given was not related to task performance per se. therefore, any learning benefits resulting from having, as opposed to not having, a choice would suggest that motivational factors are responsible for those effects.” two cell difference in means “on the retention test, during which white golf balls were used, the choice group showed significantly higher putting accuracy (36.8) than the yoked group (26.4), f(1, 22) = 7.31, p <.05” f(1, 22) = 7.31 lewthwaite, chiviacowsky, drews & wulf (2015; exp. 2) “given the potential theoretical importance of the finding in experiment 1, we wanted to replicate it with another task and different type of choice.” two cell difference in means “on the retention test 1 day later, the choice group demonstrated significantly longer times in balance than the yoked group, f(1, 27) = 7.93, p <.01.” f(1, 27) = 7.93 lim, ali, kim, choi & radlo (2015) “it was expected that a selfcontrolled feedback schedule would be more effective for the learning and performance of serial skills for both acquisition and retention phases than a yoked schedule.” two cell difference in means “in the retention phase, there was a significant main effect for group (f(1, 22) = 18.27, p <.05). the follow-up test indicated that the self-controlled feedback group had higher performance (cohen’s d = 6.4) than the yoked-feedback group during the retention test in both blocks.” f(1, 22) = 18.27 patterson, carter & sanli (2011: comparison 1) “we expected that the structure of this self-controlled practice context would either add to or compromise the existing benefits attributed to a self-controlled practice context.” 2 (control: self, yoked) x 3 (structure: full, all, faded) difference in means “specifically, the self-self condition demonstrated less |ce| compared to their yoked-yoked counterparts. this main effect was significant, f(1, 18) = 8.06, p <.05.” f(1, 18) = 8.06 patterson, carter & sanli (2011: comparison 2) “we expected that the structure of this self-controlled practice context would either add to or compromise the existing benefits attributed to a self-controlled practice context.” 2 (control: self, yoked) x 3 (structure: full, all, faded) difference in means “the all-self condition demonstrated less |ce| compared to the all-yoked condition. this main effect was also statistically significant, f(1, 18) = 4.67, p <.05.” f(1, 18) = 4.67 2 9 patterson, carter & sanli (2011: comparison 3) “we expected that the structure of this self-controlled practice context would either add to or compromise the existing benefits attributed to a self-controlled practice context.” 2 (control: self, yoked) x 3 (structure: full, all, faded) difference in means “the faded-self condition demonstrated less |ce| compared to the faded-yoked condition, supported by a main effect for group, f(1, 18) = 5.78, p <.05.” f(1, 18) = 5.78 post, fairbrother, barros & kulpa (2014) “it was hypothesized that learners in the sc group would demonstrate superior accuracy and form scores compared with the yoked group during the retention test.” two cell difference in means “the univariate anova for retention revealed a significant group effect, f(1, 29) = 6.08, p = .020. the sc group had higher accuracy scores the yk group” f(1, 29) = 6.08 ste-marie, vertes, law & rymal (2013) “we hypothesized that the learner controlled group would show superior physical performance of the trampoline skills. . . compared to the experimenter controlled group.” two cell difference in means “a separate independent samples t-test showed that the learner controlled group had significantly higher performance scores compared to the experimenter controlled group at retention, t(58) = 3.21, p <.05, d = .753.” t(58) = 3.21 wulf & adams (2014) “we asked whether giving performers an incidental choice would also result in more effective learning of exercise routines.” 2(group: selfcontrol, yoked) x 3 (exercise: toe touch, head turn, ball pass) x 2 (leg: left, right) mixed design with repeated measures on the final two factors difference in means “on the retention test. . . the choice group showed fewer errors than the control group. the main effects of group, f(1,18) = 25.35, p <.001, was significant.” f(1,18) = 25.35 wulf & toole (1999) “if the beneficial effects of selfcontrol found in previous studies are more general in nature (i.e., some general mechanism responsible for these effects), learning advantage would also be expected for self-controlled use of physical assistance.” two cell difference in means “the main effect of group, f(1,24) = 4.54, p <.05, was significant. thus, allowing learners to select their own schedule of physical assistance during practice had a clearly beneficial effect on learning.” f(1,24) = 4.54 3 0 wulf, clauss, shea & whitacre (2001) “importantly, however, if self-control promotes the development of a more efficient movement technique, one should see greater movement efficiency, as indicated by delayed force onsets, in self-control as compared to yoked participants.” two cell difference in means “whereas the self-control group demonstrated relative force onsets that, on average, occurred about half the distance between the center of the apparatus and the participant’s maximum amplitude, the yoked group’s average force onset had already occurred after they had travelled less than 20% of the distance to the maximum amplitude. this group difference was significant, f(1,24) = 4.43, p <.05.” f(1,24) = 4.43 wulf, raupach & pfeiffer (2005) “thus, if the learning advantages of self-controlled practice generalize to observational practice, allowing learners to decide when they want to view a model presentation should result in enhanced retention performance, with regard to movement form and, perhaps, movement accuracy, compared to that of yoked learners.” two cell difference in means “overall, the self-control group had higher form scores than the yoked group throughout retention. the main effect of group f(1,23) = 5.16, p <.05, was significant.” f(1,23) = 5.16 wulf, iwatsuki, machin, kellogg, copeland, & lewthwaite (2017) exp 1. “the purpose of the present experiments was threefold. first, we deemed it important to provide further evidence for the impact of incidental choices on motor skill learning. given that self-controlled practice benefits for learning have frequently been interpreted from an information-processing perspective (e.g., carter, carlson, & ste-marie, 2014; carter & ste-marie, 2016), with limited regard for rewardingmotivational explanations, further experimental evidence for learning enhancements through choices not directly related to the task seemed desirable (experiments 1 and 2).” two cell difference in means “on the retention test one day later, the choice group demonstrated higher scores than did the control group. the group effect was significant, f(1, 29) = 5.72, p <.05.” f(1, 29) = 5.72 3 1 wulf, chiviacowsky & drews (2015) “to summarize, we hypothesized that an external focus and autonomy support would have additive benefits for motor learning (i.e., retention and transfer performance), as evidenced by main effects for each factor.” 2 (autonomy support: self, yoked) x 2 (focus: external, internal) difference in means “on the retention test, the main effect of autonomy support was significant, f(1, 64) = 6.98, p <.01.” f(1,64) = 6.98 ikudome, kuo, ogasa, mori & nakamoto (2019; exp. 2) “previous studies manipulating participants’ choice of variables relevant to the experimental task have indicated that such choices have a positive effect on motor learning due to deeper information processing by the participants. based on these studies, it is possible that this positive effect would be observed regardless of participants’ levels of intrinsic motivation, because this type of choice would not induce a change in perceived locus of causality from internal to external.” 2(choice: self, yoked) x 2 (motivation: high, low) difference in means “an ancova indicated significant main effects of choice, f(1, 39) = 8.93, p = .005.” f(1,39) = 8.93 note. kr = knowledge of results; pd = parkinson’s disease; sc = self-controlled. 32 appendix b: missing data of the 78 experiments that met the eligibility criteria of this meta-analysis, 25 were excluded because of missing data. those 25 experiments included 13 experiments that reported a statistically significant result, along with 12 that failed to find a significant selfcontrolled learning effect. among the 13 experiments with missing data reporting a significant self-control benefit, one reported an inappropriate analysis (hemayattalab et al., 2013),6 one reported statistics that do not match the experimental design (jalalvand et al., 2019),7 one reported significant effects on a partial analysis of their data rather than overall (brydges et al., 2009), and one was previously identified by lohse and colleagues 2016 as an outlier study (m. j. carter & patterson, 2012). the meta-analysis may have been strengthened by the exclusion of these results (stanley et al., 2010). among the remaining nine experiments reporting a significant effect with missing data, two reported effects collapsed across immediate and delayed retention only (patterson et al., 2013; wu & magill, 2011), two reported null effects on a higher priority measure and did not include sufficient data to calculate the effect size, while reporting a significant effect on a lower priority measure (wulf et al., 2001; wulf et al., 2005, both studies were included in the primary p-curve analysis), and five compared three or more groups in an omnibus anova and reported the group effect as significant but did not include sufficient data to calculate the effect size for the self-control versus yoked comparison (chen et al., 2002; ghorbani, 2019; huet et al., 2009; janelle et al., 1997; norouzi et al., 2016). 6although data were collected in one dimension using concentric circles, ae and a measure of dispersion were analyzed together in a manova. this measure of dispersion is not an accurate reflection of variability on a two-dimensional task for reasons described by hancock et al., 1995. 7a subgroup analysis involving two groups n = 15 was reported with df = 56. the article reports r2 effect sizes associated with each test that cannot be reproduced with the reported statistics or best guesses. meta-psychology, 2023, vol 7, mp.2021.2932 https://doi.org/10.15626/mp.2021.2932 article type: replication report published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: yes edited by: rickard carlsson reviewed by: joachim hüffmeier, lukas röseler analysis reproduced by: jens fust all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/hdxny we are all less risky and more skillful than our fellow drivers: successful replication and extension of svenson (1981) lina koppel1, david andersson1, gustav tinghög1,2, daniel västfjäll3,4, and gilad feldman5 1department of management and engineering, division of economics, linköping university 2department of health, medicine and caring sciences, division of health care analysis, linköping university 3department of behavioral sciences and learning, division of psychology, linköping university 4decision research, eugene, or 5department of psychology, university of hong kong, hong kong sar the better-than-average effect refers to the tendency to rate oneself as better than the average person on desirable traits and skills. in a classic study, svenson (1981) asked participants to rate their driving safety and skill compared to other participants in the experiment. results showed that the majority of participants rated themselves as far above the median, despite the statistical impossibility of more than 50% of participants being above the median. we report a preregistered, well-powered (total n = 1,203), very close replication and extension of the svenson (1981) study. our results indicate that the majority of participants rated their driving skill and safety as above average. we added different response scales as an extension and findings were stable across all three measures. thus, our findings are consistent with the original findings by svenson (1981). materials, data, and code are available at https://osf.io/fxpwb/. keywords: better-than-average effect, self-evaluation, self-enhancement, replication when people are asked to rate themselves on desirable traits and skills, most people rate themselves as above average. this is known as the better-thanaverage effect and has been demonstrated in a variety of domains. in one of the most well-known examples, svenson (1981) asked participants to rate their safety and skill as drivers compared to other participants in the experiment. results showed that the majority of participants rated themselves far above the median, despite the statistical impossibility of more than 50% to be above the median. here, we embarked on a preregistered very close replication and extension of the svenson study to examine the replicability of the original finding. the better-than-average effect the better-than-average effect has been demonstrated in a variety of domains and is generally considered a manifestation of self-evaluation bias. drivers believe that they are better drivers (svenson, 1981; inspired by preston and harris, 1965), college instructors believe they are better teachers (cross, 1977), social psychologists believe they are better researchers (van lange et al., 1997), couples believe they have better marriages (rusbult et al., 2000), and undergraduates believe they have better leadership skills, athletic prowess, and ability to get along with others (brown, 1986). people even believe that they are less biased than others, an effect known as the bias blind spot (pronin et al., 2002). a recent meta-analysis of 124 published articles found that the better-than-average effect was large and robust across studies (zell et al., 2020). although it is closely related to several other biases, including unrealistic optimism (predicting that positive outcomes are more likely and negative outcomes are less likely to happen to oneself compared to others; shepperd et al., 2013; weinstein, 1980) and the dunning-kruger effect (overestimating the rank of one’s performance compared to objective measures; dunning, 2011), the better-than-average effect is unique in that it involves comparing the present self to an average other on a relatively enduring attribute or skill. much research has been dedicated to finding boundary conditions and explanations for the effect (for reviews, see alicke and govorun, 2005; chambers and windschitl, 2004; moore and healy, 2008; sedikides and alicke, 2012, 2019; sedikides and gregg, 2008; zell et al., 2020). yet, to the best of our knowledge, no direct replications exist of the original finding by svenson (1981). the importance of replicability has received increasing recognition in the field of psychological science over the past few years (e.g., asendorpf et al., https://doi.org/10.15626/mp.2021.2932 https://doi.org/10.17605/osf.io/hdxny https://osf.io/fxpwb/ 2 2013; brandt et al., 2014; camerer et al., 2018; nosek and errington, 2020; nosek et al., 2021; open science collaboration, 2015; zwaan et al., 2017). replication is considered a cornerstone of science, yet it is only recently that researchers have begun to systematically investigate the replicability of published findings. we here revisit the classic phenomenon to examine the replicability of the original finding with an independent replication. choice of study for replication we chose the svenson (1981) study based on two factors: absence of direct replications and impact. to the best of our knowledge, there are no published direct replications of this study 1 thus far.1 the article has had significant impact on scholarly research in several areas of psychology, including social psychology and judgment and decision making. at the time of writing, there were 2,112 citations of the article in google scholar. findings in the original article in the original study by svenson (1981), participants were asked to rate either their skill or their safety as drivers in relation to other participants in the experiment. data was collected in sweden (n = 80) and in the us (n = 81) in lab experiments. the results indicated that the majority of participants regarded themselves as more skillful and less risky than the average driver in each group respectively. among the swedish participants, 77% ranked their safety as above average and 69% ranked their skill as above average. among the us participants, 88% ranked their safety as above average and 93% ranked their skill as above average. adjustments and extensions we had to make several adjustments to the original design. first, rather than including two different samples for the two different questions, we ran the questions together in a withinsubjects design that would allow us to compare the effects of the two questions and their associated dependent variables. second, we had to adjust the questionnaire to match the target sample—online american amazon mechanical turk (mturk) workers. we first introduced a few verification questions to ensure that workers were drivers. because our study was conducted online, we also had to adjust the reference group. we chose to focus on the us state as the reference group. third, we had issues with reproducing the question used to elicit rankings and so had to make adjustments. when doing so, we noticed issues with the 10 categories used for percentile ranks (e.g., the midpoint is grouped as 41–50%, and the first category includes a range of 11 percentiles compared to other categories with a range of 10). we therefore added an extension and chose to randomize the dependent variable question across three designs: (1) our best estimate of what the target article used, (2) an adjusted 11-item scale with a mid-point indicated as 50% (average), and (3) a simple 7-item likert scale asking participants to compare to the average. we compared effects across the three designs. thus, the use of three different response scales helps to check the robustness of the effect, as minor methodological features can influence the results of hypothesis tests (e.g., baribault et al., 2018; landy et al., 2020). method we report all measures, conditions, data exclusions, and how we determined sample size. participants a total of 1,203 american amazon mechanical turk (mturk) participants completed the study using turkprime.com (mage = 40.40, sd = 12.21; 641 females). a comparison of the target article sample and the replication samples is provided in table 1. an a priori power analysis in g*power 3.1 (exact test, twotailed, with 95% power) indicated that 90 participants were needed to obtain the smallest effect size from the original paper, cohen’s g = 0.19 (see supplementary materials). however, a sample size of n = 90 is smaller than the sample size in the original study (n = 161) and is based on an effect size estimate that might be larger than the true effect size. therefore, we decided to follow suggestions from simonsohn (2015) and aim for 2.5 times the original sample size. the data collection was combined with data collection for a different study (see chen et al., 2021, experiment 2) that required a much larger sample size (studies displayed in randomized order). participants first consented to participate in the study and were then asked verification questions regarding having a driver’s license, year and location of license, 1at the time of writing, a google scholar search for the term “replication” within works citing svenson (1981) yielded 259 results, but of these, we found none that would count as a very close replication in the framework of lebel et al. (2017) while also having high statistical power. there are some studies that essentially replicate the original study (groeger and brown, 1989; svenson et al., 1985); however, sample sizes are relatively small and there are methodological differences especially in terms of the response scale format. in sum, although not a perfect method, we take the results of our search as a strong likelihood that no close replication has been conducted. 3 table 1 differences and similarities between samples in original study and replication svenson (1981) replication sample size 161 1,203 geographic origin us american & sweden us american gender unknown 562 males, 641 females median age (years) 22 (us), 33 (sweden) 37 average age (years) unknown 40.40 age range (years) unknown 18–87 medium (location) lab (sweden & us) computer (online) compensation unknown nominal payment year before 1981 2019 and state of residence. participants (n = 84) who indicated they did not have a drivers’ license were filtered out. procedure participants indicated how safe and how skilled they were as drivers (both questions included, displayed in random order). they then answered a funneling section and provided demographic information (age, gender, country of birth, family social class, english understanding of study), before being debriefed. measures there were two dependent variables: driving safety and driving skill. the question about safety was phrased as follows: we would like to know what you think about how safely you drive an automobile. all drivers are not equally safe drivers. we want you to compare your own skill to the skills of other people in your state. by definition, there is a least safe and a most safe driver. we want you to indicate your own estimated position among the people in your state. of course, this is a difficult question because you do not know all the people in your state, much less how safely they drive. but please make the most accurate estimate you can. the question about skill was phrased as follows: we would like to know what you think about how skilled you are at driving an automobile. all drivers are not equally skilled drivers. we want you to compare your own skill to the skills of other people in your state. by definition, there is a least skilled and a most skilled driver. we want you to indicate your own estimated position among the people in your state. of course, this is a difficult question because you do not know all the people in your state, much less how skilled drivers they are. but please make the most accurate estimate you can. for each question, participants indicated their driving safety/skill compared to the average driver in their state using one of the three following response scales: 1. reproduced materials: “please indicate how [safely you drive/skilled you are] compared to others by marking your estimated position among drivers in your state” in 10 categories from 0–10% (least safe/skilled drivers) [...] 41-50%, 51-60%, [...] 91–100% (most safe/skilled drivers). 2. 11-item scale to include midpoint (changes underlined): same question as above but with the following scale: 0–9% (least safe/skilled drivers) [...] 40–49%, 50% (average), 51-60%, [...] 91100% (most safe/skilled drivers). 3. standard comparison 7-item likert scale: “please indicate how [safely you drive/skilled you are] compared to others by marking your estimated position compared to other drivers in your state” (1 = far below average; 4 = average; 7 = far above average). evaluation criteria for replication table 2 provides a classification of the replication using criteria by lebel et al. (2017). we summarize the replication as a “very close replication”. we compare the replication effects with the original effects in the target article using criteria from lebel et al. (2019). 4 table 2 classification of the replication, based on lebel et al. (2017) design facet replication iv operationalization same dv operationalization same iv stimuli same dv stimuli same procedural details different physical settings different contextual variables different replication classification very close replication data analysis the original article did not include any statistical tests, and the scale and design make it difficult to conduct such a test. yet, our best estimation of an analysis is to compare the percentages of participants who answered the 50%+ categories and compare those to an expected 50% (binomial test). for the likert scale, we conducted a one-sample t-test comparing to the mean of 4, the scale midpoint. we examined normality in the distribution of frequencies, including parameters of skewness and kurtosis. analysis code can be found in the supplementary materials. results replication descriptive statistics of all measures are presented in table 3. statistical tests of the hypotheses are summarized in tables 4–5 and plotted in figures 1–3. the medians for the distributions of safety judgments in table 3 fall in the interval 71–80%, for both percentile category response scales. this indicates that half of the participants believed themselves to be among the safest 30 percent of drivers. over 90% of participants (93% for the reproduced materials and 91% for the adjusted materials with a 50% midpoint) believed themselves to be safer than the median driver. binomial tests against test proportion 0.50 (twotailed) indicated that this effect was statistically significant, ps < .001 (see table 4). in comparison, the original study found that the medians for the distributions of safety judgments fell in the interval 81–90% for the us group and 71–80% for the swedish group, indicating that half of the participants believed themselves to be among the safest 20 (us) or 30 (sweden) percent of the drivers in the two groups respectively. 88% in the us group and 77% in the swedish group believed themselves to be safer than the median driver. the medians for the distributions of skill judgments in table 3 fall in the interval 71–80% (both for the reproduced and for the adjusted materials). this indicates that half of the participants believed themselves to be among the most skilled 30 percent of drivers. 91% (for the reproduced materials) and 78% (for the adjusted materials) believed themselves to be more skilled than the median driver. binomial tests against test proportion 0.50 (two-tailed) indicated that this effect was statistically significant, ps < .001 (see table 4). in comparison, the original study found that the medians for the distributions of skill judgments fell in the interval 61-70% for the us group and 51-60% for the swedish group. 93% in the us sample and 69% in the swedish sample believed themselves to be more skilled than the median driver. when participants rated themselves on a 7-item likert response scale, they also rated themselves as significantly safer than average, m = 5.50 (sd = 1.08), t(386) = 27.28, p < .001, g = 1.39, 95% ci [1.25, 1.53], and more skilled than average, m = 5.28 (sd = 1.08), t(378) = 23.13, p < .001, g = 1.19, 95% ci [1.06, 1.32] (see table 5). extensions figure 4 shows the effect size (hedges’s g) and 95% confidence intervals for each rating scale. for skills ratings, the cis are overlapping in all cases, suggesting no evidence for a difference in the size of the better-thanaverage effect depending on the type of rating scale used. for safety ratings, the cis for the two scales that involve percentile categories are overlapping, but the cis for the likert scale are slightly lower, suggesting a slightly smaller better-thanaverage effect when safety is rated on a likert scale. nevertheless, the effect is very large in all cases. for effect sizes, confidence intervals, and important study characteristics of the replication, original study, and meta-analysis by zell et al. (2020), see supplementary table s7. figure 5 shows the mean safety and skills ratings in each state (excluding states with fewer than 5 responses). we find no obvious pattern in the effect across states. however, some states had very few observations and cis are generally very large, which complicates interpretation. therefore, we chose not to analyze this data further. exploratory analyses (not pre-registered) a series of exploratory ols regressions were run to investigate whether participants’ gender, age, and driving experience (i.e., years since driver’s license was obtained) predicted their ratings of driving safety and skill. the regressions also included item order (i.e. 5 table 3 proportion of participants in each category panel a: percentile categories used in original study n 0-10 11-20 21-30 31-40 41-50 51-60 61–70 71-80 81-90 91-100 safety 413 0.2% 0.0% 0.5% 1.9% 3.9% 7.5% 15.3% 24.2% 28.3% 18.2% skill 405 0.2% 0.0% 0.5% 2.2% 5.7% 14.6% 16.0% 26.2% 21.7% 12.8% panel b: 10-percentile categories with 50% midpoint n 0-9 10-19 20-29 30-39 40-49 50 51-60 61-70 71-80 81-90 91-100 safety 403 0.0% 0.2% 0.5% 1.2% 1.5% 6.0% 8.2% 13.2% 26.6% 29.5% 13.2% skill 419 0.0% 0.2% 1.2% 2.9% 2.9% 14.6% 10.0% 16.2% 25.5% 18.4% 8.1% panel c: likert response scale n 1 2 3 4 5 6 7 safety 387 0.5% 2.6% 1.8% 16.3% 25.3% 38.5% 17.3% skill 379 0.0% 1.3% 3.2% 20.6% 25.9% 39.3% 9.8% table 4 summary of statistical tests for the items with percentile categories category n observed prop. [95% ci] test prop. p interpretation percentile categories used in original study safety >50 <50 386 27 0.94 0.06 [0.91, 0.96] 0.50 <.001 signal – consistent skill >50 <50 370 35 0.91 0.09 [0.88, 0.94] 0.50 <.001 signal – consistent 10-percentile categories with 50% midpoint safety >50 <50 365 38 0.91 0.09 [0.87, 0.93] 0.50 <.001 signal – consistent skill >50 <50 328 91 0.78 0.22 [0.74, 0.82] 0.50 <.001 signal – consistent note. binomial tests comparing the percentage of participants who rated their driving safety and skill as above average to an expected 50%. whether participants rated safety or skills first) and study order (i.e., whether participants completed this study or the study reported in chen et al., 2021 first). the analyses revealed that age and driving experience were associated with both safety and skills ratings, such that the rating increased with increasing age and experience (see supplementary tables s1–s6). in addition, there was a significant link between gender and safety ratings using the likert scale and between gender and skills ratings using the likert scale and the adjusted materials, indicating that women rated themselves lower. however, there was no such link in the other scales; thus, the results involving gender seem to depend on the response scale format and item content. including item order and study order in the regression analyses did not alter the interpretation of the effects of gender, age, and driving experience. item order and study order also had no consistent effect on participants’ ratings, although completing the svenson (1981) replication first was associated with higher safety ratings in one of the scales (the reproduced materials) and rating safety before skills was associated with lower safety ratings in another (the likert scale; see supplementary tables s1–s6). nevertheless, the regression results address the question of whether gender, age, and driving experience are associated with participants’ ratings of driving safety and skill; they do not address the question of whether gender, age, and driving experience affect whether participants rate themselves above average. because the vast majority of participants rated themselves as above average, we did not conduct such an analysis. finally, we investigated the correlation between skills and safety ratings in the three response scales. this analysis indicated that participants’ skills ratings were positively correlated with their safety ratings in all three scales (original scale: tau = .52, p < .001, n = 122; adjusted scale with 50% midpoint: tau = .48, p < .001, n = 136; likert scale: tau = .47, p < .001, n = 121). 6 table 5 summary of statistical tests for the likert scale t df p mean diff [95% ci] hedges’s g [95% ci] interpretation safety 27.38 386 <.001 1.50 [1.40, 1.61] 1.39 [1.25, 1.53] signal – consistent skill 23.13 378 <.001 1.28 [1.17, 1.39] 1.19 [1.06, 1.32] signal – consistent note. one-sample t-test, test value: 4. figure 1 proportion of participants in each percentile category of safety ratings and skills ratings, using the same percentile categories as the original article. mean = 8.1 test = 5.510% 20% 30% 1 (0 −1 0) 2 (1 1− 20 ) 3 (2 1− 30 ) 4 (3 1− 40 ) 5 (4 1− 50 ) 6 (5 1− 60 ) 7 (6 1− 70 ) 8 (7 1− 80 ) 9 (8 1− 90 ) 10 (9 1− 10 0) rating (percentile category) p e rc e n ta g e safety ratings mean = 7.7 test = 5.510% 20% 30% 1 (0 −1 0) 2 (1 1− 20 ) 3 (2 1− 30 ) 4 (3 1− 40 ) 5 (4 1− 50 ) 6 (5 1− 60 ) 7 (6 1− 70 ) 8 (7 1− 80 ) 9 (8 1− 90 ) 10 (9 1− 10 0) rating (percentile category) p e rc e n ta g e skills ratings figure 2 proportion of participants in each percentile category of safety ratings and skills ratings, using the adjusted percentile categories. mean = 8.9 test = 6.0 10% 20% 30% 1 (0 −9 ) 2 (1 0− 19 ) 3 (2 0− 29 ) 4 (3 0− 39 ) 5 (4 0− 49 ) 6 (5 0) 7 (5 1− 60 ) 8 (6 1− 70 ) 9 (7 1− 80 ) 10 (8 1− 90 ) 11 (9 1− 10 0) rating (percentile category) p e rc e n ta g e safety ratings mean = 8.2 test = 6.0 10% 20% 30% 1 (0 −9 ) 2 (1 0− 19 ) 3 (2 0− 29 ) 4 (3 0− 39 ) 5 (4 0− 49 ) 6 (5 0) 7 (5 1− 60 ) 8 (6 1− 70 ) 9 (7 1− 80 ) 10 (8 1− 90 ) 11 (9 1− 10 0) rating (percentile category) p e rc e n ta g e skills ratings 7 figure 3 proportion of participants in each percentile category of safety ratings and skills ratings, using the likert scale. mean = 5.5 test = 4.010% 20% 30% 40% 1 2 3 4 5 6 7 rating p e rc e n ta g e safety ratings mean = 5.3 test = 4.010% 20% 30% 40% 1 2 3 4 5 6 7 rating p e rc e n ta g e skills ratings figure 4 effect sizes (hedges’s g) and 95% cis for each rating scale. discussion we embarked on a preregistered replication and extension of a classic phenomenon in the judgment and decision-making literature known as the better-thanaverage effect. the original article found that the majority of participants reported that they were safer and more skilled than the average driver (svenson, 1981). the findings from our replication are consistent with the original findings. that is, the majority of participants rated their driving safety and skill as above the median. results were stable across three different response scales: our best estimate of the original materials, an adjusted scale with a 50% midpoint, and a 7-item likert scale. our replication adds to a larger literature investigating the replicability of published research in psychological science (e.g., camerer et al., 2018; open science collaboration, 2015). importantly, our study design closely follows the original study by svenson (1981) and thereby classifies as a very close replication according to replication criteria by lebel et al. (2017). recently, ziano et al., 2020 conducted a replication of another classic study on the better-thanaverage effect (alicke, 1985), which indicated that college students’ ratings of how characteristic a trait was of them (vs. an average student) increased with increasing desirability of the trait, and that this effect was stronger among more controllable traits. findings from ziano et al. (2020) were consistent with the original findings. in sum, findings from the present study are in line with the view of the better-than-average effect as a robust phenomenon. 8 maryland (n= 6 ) iowa (n= 5 ) colorado (n= 6 ) pennsylvania (n= 25 ) illinois (n= 19 ) new york (n= 17 ) minnesota (n= 5 ) kentucky (n= 6 ) arkansas (n= 5 ) connecticut (n= 7 ) oregon (n= 9 ) texas (n= 31 ) california (n= 33 ) indiana (n= 9 ) washington (n= 13 ) georgia (n= 17 ) mississippi (n= 5 ) utah (n= 5 ) virginia (n= 13 ) michigan (n= 20 ) north carolina (n= 12 ) south carolina (n= 7 ) ohio (n= 23 ) florida (n= 30 ) tennessee (n= 9 ) alabama (n= 6 ) arizona (n= 9 ) nevada (n= 5 ) new jersey (n= 10 ) missouri (n= 7 ) oklahoma (n= 7 ) massachusetts (n= 7 ) 6 8 10 safety rating using original scale maryland (n= 8 ) wisconsin (n= 8 ) arkansas (n= 6 ) pennsylvania (n= 23 ) kentucky (n= 8 ) arizona (n= 9 ) iowa (n= 9 ) illinois (n= 17 ) texas (n= 26 ) missouri (n= 10 ) new york (n= 18 ) oregon (n= 5 ) ohio (n= 21 ) florida (n= 39 ) rhode island (n= 5 ) washington (n= 10 ) north carolina (n= 16 ) georgia (n= 11 ) massachusetts (n= 8 ) tennessee (n= 11 ) new jersey (n= 12 ) indiana (n= 10 ) virginia (n= 8 ) california (n= 26 ) west virginia (n= 5 ) oklahoma (n= 5 ) south carolina (n= 7 ) colorado (n= 8 ) alabama (n= 11 ) michigan (n= 13 ) 6 8 10 skill rating using original scale kentucky (n= 6 ) florida (n= 33 ) new jersey (n= 7 ) kansas (n= 5 ) georgia (n= 12 ) massachusetts (n= 9 ) oregon (n= 5 ) maryland (n= 7 ) illinois (n= 18 ) pennsylvania (n= 25 ) missouri (n= 13 ) colorado (n= 6 ) michigan (n= 21 ) tennessee (n= 11 ) washington (n= 7 ) wisconsin (n= 6 ) new york (n= 26 ) north carolina (n= 19 ) california (n= 29 ) minnesota (n= 6 ) ohio (n= 14 ) indiana (n= 8 ) texas (n= 21 ) virginia (n= 17 ) arizona (n= 13 ) alabama (n= 9 ) 6 8 10 12 safety rating using adjusted scale with midpoint louisiana (n= 6 ) oregon (n= 8 ) illinois (n= 27 ) colorado (n= 7 ) new york (n= 27 ) missouri (n= 9 ) north carolina (n= 14 ) new jersey (n= 8 ) nevada (n= 6 ) florida (n= 31 ) pennsylvania (n= 28 ) michigan (n= 25 ) minnesota (n= 6 ) arizona (n= 14 ) georgia (n= 22 ) oklahoma (n= 8 ) massachusetts (n= 5 ) tennessee (n= 7 ) virginia (n= 15 ) california (n= 37 ) south carolina (n= 5 ) indiana (n= 6 ) texas (n= 25 ) wisconsin (n= 6 ) ohio (n= 16 ) 4 6 8 10 skill rating using adjusted scale with midpoint maryland (n= 5 ) missouri (n= 10 ) virginia (n= 10 ) georgia (n= 13 ) kentucky (n= 8 ) tennessee (n= 8 ) pennsylvania (n= 18 ) new york (n= 25 ) ohio (n= 11 ) wisconsin (n= 11 ) illinois (n= 21 ) indiana (n= 5 ) massachusetts (n= 5 ) minnesota (n= 5 ) north carolina (n= 16 ) texas (n= 24 ) new jersey (n= 11 ) connecticut (n= 5 ) oklahoma (n= 5 ) florida (n= 35 ) michigan (n= 17 ) colorado (n= 7 ) california (n= 30 ) west virginia (n= 5 ) oregon (n= 11 ) nevada (n= 7 ) washington (n= 8 ) arizona (n= 11 ) south carolina (n= 6 ) 3 4 5 6 7 safety rating using likert scale tennessee (n= 10 ) missouri (n= 11 ) connecticut (n= 8 ) michigan (n= 20 ) hawaii (n= 6 ) kentucky (n= 8 ) nevada (n= 7 ) texas (n= 25 ) wisconsin (n= 7 ) south carolina (n= 5 ) ohio (n= 11 ) minnesota (n= 7 ) virginia (n= 17 ) florida (n= 28 ) georgia (n= 9 ) illinois (n= 14 ) massachusetts (n= 8 ) california (n= 29 ) utah (n= 5 ) north carolina (n= 17 ) pennsylvania (n= 17 ) maryland (n= 6 ) oregon (n= 12 ) new york (n= 23 ) arizona (n= 10 ) washington (n= 14 ) indiana (n= 6 ) mississippi (n= 6 ) new jersey (n= 8 ) 4 5 6 7 skill rating using likert scale figure 5 mean safety and skills ratings in each state (excluding states with fewer than 5 observations). error bars represent 95% cis. 9 author contact lina koppel, orcid: 0000-0002-6302-0047. gustav tinghög, orcid: 0000-0002-8159-1249. daniel västfjäll, orcid: 0000-0003-2873-4500. correspondence: gilad feldman, department of psychology, university of hong kong, hong kong sar; gfeldman@hku.hk; orcid: 0000-0003-2812-6599 conflict of interest and funding the author(s) declared no potential conflicts of interests with respect to the authorship and/or publication of this article. the author(s) received no financial support for the research and/or authorship of this article. author contributions role lk da gt dv gf conceptualization x x x x x pre-registration x x x x x data curation x formal analysis x x x funding acquisition x investigation x x x x x pre-registration peer review/verification x x x x x data analysis peer review/verification x x x x x methodology x x x x x project administration x x resources x software x x x x x supervision x validation x x x x x visualization x x x writing – original draft x writing – review and editing x x x x x target article svenson, o. (1981). are we all less risky and more skillful than our fellow drivers? acta psychologica, 47, 143-148. https://doi.org/10.1016/00016918(81)90005-6. links to project files project page on osf with datasets and code: https://osf.io/fxpwb/. pre-registration (including materials and analysis code): https://osf.io/jky24. acknowledgments we thank members of the jedi lab for valuable contributions to this study during a workshop at linköping university in august 2019. open science practices this article earned the preregistration+, open data and the open materials badge for preregistering the hypothesis and analysis before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references alicke, m. d. (1985). global self-evaluation as determined by the desirability and controllability of trait adjectives. journal of personality and social psychology, 49(6), 1621–1630. https://doi.org/ 10.1037/0022-3514.49.6.1621 alicke, m. d., & govorun, o. (2005). the better-thanaverage effect. in m. d. alicke, d. a. dunning, & l. e. krueger (eds.), the self in social judgment (pp. 85–106). psychology press. asendorpf, j. b., conner, m., de fruyt, f., de houwer, j., denissen, j. j., fiedler, k., fiedler, s., funder, d. c., kliegl, r., nosek, b. a., perugini, m., roberts, b. w., schmitt, m., van aken, m. a., weber, h., & wicherts, j. m. (2013). recommendations for increasing replicability in psychology. european journal of personality, 27(2), 108–119. https://doi.org/10.1002/per.1919 baribault, b., donkin, c., little, d. r., trueblood, j. s., oravecz, z., van ravenzwaaij, d., white, c. n., de boeck, p., & vandekerckhove, j. (2018). metastudies for robust tests of theory. proceedings of the national academy of sciences of the united states of america, 115(11), 2607–2612. https://doi.org/10.1073/pnas.1708285114 brandt, m. j., ijzerman, h., dijksterhuis, a., farach, f. j., geller, j., giner-sorolla, r., grange, j. a., perugini, m., spies, j. r., & van ’t veer, a. (2014). the replication recipe: what makes for a convincing replication? journal of experimental social psychology, 50(1), 217–224. https:// doi.org/10.1016/j.jesp.2013.10.005 https://doi.org/10.1016/0001-6918(81)90005-6 https://doi.org/10.1016/0001-6918(81)90005-6 https://osf.io/fxpwb/ https://osf.io/jky24 https://doi.org/10.1037/0022-3514.49.6.1621 https://doi.org/10.1037/0022-3514.49.6.1621 https://doi.org/10.1002/per.1919 https://doi.org/10.1073/pnas.1708285114 https://doi.org/10.1016/j.jesp.2013.10.005 https://doi.org/10.1016/j.jesp.2013.10.005 10 brown, i. d. (1986). evaluations of self and others: selfenhancement biases in social judgments. social cognition, 4(4), 353–376. https://doi.org/10. 1521/soco.1986.4.4.353 camerer, c. f., dreber, a., holzmeister, f., ho, t. h., huber, j., johannesson, m., kirchler, m., nave, g., nosek, b. a., pfeiffer, t., altmejd, a., buttrick, n., chan, t., chen, y., forsell, e., gampa, a., heikensten, e., hummer, l., imai, t., . . . wu, h. (2018). evaluating the replicability of social science experiments in nature and science between 2010 and 2015. nature human behaviour, 2(9), 637–644. https://doi.org/10. 1038/s41562-018-0399-z chambers, j. r., & windschitl, p. d. (2004). biases in social comparative judgments: the role of nonmotivated factors in above-average and comparative-optimism effects. psychological bulletin, 130(5), 813–838. https://doi.org/ 10.1037/0033-2909.130.5.813 chen, j., hui, l. s., yu, t., feldman, g., zeng, s., ching, t. l., ng, c. h., wu, k. w., yuen, c. m., lau, t. k., cheng, b. l., & ng, k. w. (2021). foregone opportunities and choosing not to act: replications of inaction inertia effect. social psychological and personality science, 12(3), 333–345. https : / / doi . org / 10 . 1177 / 1948550619900570 cross, k. p. (1977). not can, but will college teaching be improved? new directions for higher education, 17, 1–15. https : / / doi . org / 10 . 1002 / he . 36919771703 dunning, d. (2011). the dunning-kruger effect: on being ignorant of one’s own ignorance. in j. m. olson & m. p. zanna (eds.), advances in experimental social psychology (pp. 247–296). academic press. https://doi.org/10.1016/b978-012-385522-0.00005-6 groeger, j., & brown, i. (1989). assessing one’s own and others’ driving ability: influences of sex, age, and experience. accident analysis prevention, 21(2), 155–168. https : / / doi . org / 10 . 1016/0001-4575(89)90083-3 landy, j., jia, m., ding, i., viganola, d., tierney, w., dreber, a., johanneson, m., pfeiffer, t., ebersole, c., gronau, q., ly, a., van den bergh, d., marsman, m., derks, k., wagenmakers, e.-j., proctor, a., bartels, d. m., bauman, c. w., brady, w. j., . . . uhlmann, e. l. (2020). crowd-sourcing hypothesis tests: making transparent how design choices shape research results. ssrn electronic journal, 146(5), 451– 479. https://doi.org/10.2139/ssrn.3654406 lebel, e., berger, d., campbell, l., & loving, t. (2017). falsifiability is not optional. journal of personality and social psychology, 113(2), 254–261. https://doi.org/10.1037/pspi0000106 lebel, e., vanpaemel, w., cheung, i., & campbell, l. (2019). a brief guide to evaluate replications. meta-psychology, 3. https://doi.org/10.15626/ mp.2018.843 moore, d. a., & healy, p. j. (2008). the trouble with overconfidence. psychological review, 115(2), 502–517. https : / / doi . org / 10 . 1037 / 0033 295x.115.2.502 nosek, b. a., & errington, t. m. (2020). what is replication? plos biology, 18(3), 1–8. https://doi. org/10.1371/journal.pbio.3000691 nosek, b. a., hardwicke, t. e., corker, k. s., & rohrer, j. (2021). replicability , robustness , and reproducibility in psychological science. annual review of psychology. https://doi.org/10.31234/ osf.io/ksfvq open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716. https://doi.org/10.1126/ science.aac4716 preston, c. e., & harris, s. (1965). psychology of drivers in traffic accidents. journal of applied psychology, 49(4), 284–288. https://doi.org/10.1037/ h0022453 pronin, e., lin, d. y., & ross, l. (2002). the bias blind spot: perceptions of bias in self versus others. personality and social psychology bulletin, 28(3), 369–381. https : / / doi . org / 10 . 1177 / 0146167202286008 rusbult, c. e., van lange, p. a., wildschut, t., yovetich, n. a., & verette, j. (2000). perceived superiority in close relationships: why it exists and persists. journal of personality and social psychology, 79(4), 521–545. https://doi.org/10.1037/ 0022-3514.79.4.521 sedikides, c., & alicke, m. d. (2012). self-enhancement and self-protection motives. in r. m. ryan (ed.), oxford handbook of motivation (pp. 303– 322). oxford university press. https://doi.org/ 10.7873/date.2014.002 sedikides, c., & alicke, m. d. (2019). the five pillars of self-enhancement and self-protection. in m. ryan (ed.), the oxford handbook of human motivation (2nd ed., pp. 307–319). oxford university press. sedikides, c., & gregg, a. p. (2008). self-enhancement: food for thought. perspectives on psychological science, 3(2), 102–116. https : / / doi . org / 10 . 1111/j.1745-6916.2008.00068.x https://doi.org/10.1521/soco.1986.4.4.353 https://doi.org/10.1521/soco.1986.4.4.353 https://doi.org/10.1038/s41562-018-0399-z https://doi.org/10.1038/s41562-018-0399-z https://doi.org/10.1037/0033-2909.130.5.813 https://doi.org/10.1037/0033-2909.130.5.813 https://doi.org/10.1177/1948550619900570 https://doi.org/10.1177/1948550619900570 https://doi.org/10.1002/he.36919771703 https://doi.org/10.1002/he.36919771703 https://doi.org/10.1016/b978-0-12-385522-0.00005-6 https://doi.org/10.1016/b978-0-12-385522-0.00005-6 https://doi.org/10.1016/0001-4575(89)90083-3 https://doi.org/10.1016/0001-4575(89)90083-3 https://doi.org/10.2139/ssrn.3654406 https://doi.org/10.1037/pspi0000106 https://doi.org/10.15626/mp.2018.843 https://doi.org/10.15626/mp.2018.843 https://doi.org/10.1037/0033-295x.115.2.502 https://doi.org/10.1037/0033-295x.115.2.502 https://doi.org/10.1371/journal.pbio.3000691 https://doi.org/10.1371/journal.pbio.3000691 https://doi.org/10.31234/osf.io/ksfvq https://doi.org/10.31234/osf.io/ksfvq https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1037/h0022453 https://doi.org/10.1037/h0022453 https://doi.org/10.1177/0146167202286008 https://doi.org/10.1177/0146167202286008 https://doi.org/10.1037/0022-3514.79.4.521 https://doi.org/10.1037/0022-3514.79.4.521 https://doi.org/10.7873/date.2014.002 https://doi.org/10.7873/date.2014.002 https://doi.org/10.1111/j.1745-6916.2008.00068.x https://doi.org/10.1111/j.1745-6916.2008.00068.x 11 shepperd, j. a., klein, w. m., waters, e. a., & weinstein, n. d. (2013). taking stock of unrealistic optimism. perspectives on psychological science, 8(4), 395–411. https : / / doi . org / 10 . 1177 / 1745691613485247 simonsohn, u. (2015). small telescopes: detectability and the evaluation of replication results. psychological science, 26(5), 559–569. https://doi. org/10.1177/0956797614567341 svenson, o. (1981). are we all less risky and more skillful than our fellow drivers? acta psychologica, 47, 143–148. https://doi.org/10.1016/00016918(81)90005-6 svenson, o., fischhoff, b., & macgregor, d. (1985). perceived driving safety and seatbelt usage. accident; analysis and prevention, 17(2), 119–113. https : / / doi . org / 10 . 1016 / 0001 4575(85 ) 90015-6 van lange, p. a., taris, t. w., & vonk, r. (1997). dilemmas of academic practice: perceptions of superiority among social psychologists. european journal of social psychology, 27(6), 675–685. https : / / doi . org / 10 . 1002 / (sici ) 1099 0992(199711 / 12 ) 27 : 6{\textless } 675 :: aid ejsp838{\textgreater}3.0.co;2-f weinstein, n. d. (1980). unrealistic optimism about future life events. journal of personality and social psychology, 39(5), 806–820. https : / / doi . org / 10.1037/0022-3514.39.5.806 zell, e., strickhouser, j. e., sedikides, c., & alicke, m. d. (2020). the better-than-average effect in comparative self-evaluation: a comprehensive review and meta-analysis. psychological bulletin, 146(2), 118–149. https : / / doi . org / 10 . 1037 / bul0000218 ziano, i., mok, p. y., & feldman, g. (2020). replication and extension of alicke (1985) betterthan-average effect for desirable and controllable traits. social psychological and personality science. https : / / doi . org / 10 . 1177 / 1948550620948973 zwaan, r. a., etz, a., lucas, r. e., & donnellan, m. b. (2017). making replication mainstream. behavioral and brain sciences, 41, 1–50. https://doi. org/10.1017/s0140525x17001972 https://doi.org/10.1177/1745691613485247 https://doi.org/10.1177/1745691613485247 https://doi.org/10.1177/0956797614567341 https://doi.org/10.1177/0956797614567341 https://doi.org/10.1016/0001-6918(81)90005-6 https://doi.org/10.1016/0001-6918(81)90005-6 https://doi.org/10.1016/0001-4575(85)90015-6 https://doi.org/10.1016/0001-4575(85)90015-6 https://doi.org/10.1002/(sici)1099-0992(199711/12)27:6{\textless}675::aid-ejsp838{\textgreater}3.0.co;2-f https://doi.org/10.1002/(sici)1099-0992(199711/12)27:6{\textless}675::aid-ejsp838{\textgreater}3.0.co;2-f https://doi.org/10.1002/(sici)1099-0992(199711/12)27:6{\textless}675::aid-ejsp838{\textgreater}3.0.co;2-f https://doi.org/10.1037/0022-3514.39.5.806 https://doi.org/10.1037/0022-3514.39.5.806 https://doi.org/10.1037/bul0000218 https://doi.org/10.1037/bul0000218 https://doi.org/10.1177/1948550620948973 https://doi.org/10.1177/1948550620948973 https://doi.org/10.1017/s0140525x17001972 https://doi.org/10.1017/s0140525x17001972 the better-than-average effect choice of study for replication findings in the original article adjustments and extensions method participants procedure measures evaluation criteria for replication data analysis results replication extensions exploratory analyses (not pre-registered) discussion author contact conflict of interest and funding author contributions target article links to project files acknowledgments open science practices microsoft word mp.2018.894.proofs.docx meta-psychology, 2020, vol 4, mp.2018.894, https://doi.org/10.15626/mp.2018.894 article type: original article published under the cc-by4.0 license open data: n/a open materials: n/a open and reproducible analysis: n/a open reviews and editorial process: yes preregistration: n/a edited by: åse innes-ker reviewed by: d. meyer, m. olson. analysis reproduced by: n/a all supplementary files can be accessed at the osf project page: https://osf.io/gdmb9/ two questions to foster critical thinking in the field of psychology: are there any reasons to expect a different outcome, and what are the consequences if we don’t find what we were looking for? peter holtz leibniz-institut für wissensmedien iwm, tübingen, germany there are many factors that contribute to the present crisis of confidence in psychology, among them epistemological causes: under pressure to ‘publish or perish’ and to ‘get visible or vanish’ in order to survive in an increasingly globalized academic job market, psychologists may often be too eager to find their hypotheses confirmed by empirical data. they may also not pay enough attention to alternative theories and consequently often miss opportunities to learn from their failures to obtain the expected results in their studies. in this paper, i propose to start asking two questions physicist john platt had proposed in 1964 on a regular basis in the field of psychology as a means of fostering critical thinking or to encourage a critical approach to the growth of scientific knowledge: are there reasons to expect a different outcome, and what consequence is it going to have if the study does not yield the expected results? i explore what potential these two questions have for ensuring epistemological progress by asking them with respect to social-priming research, which is one of the research programmes that have recently been criticized in the course of the ‘reproducibility debate’. keywords: epistemology, philosophy of science, critical rationalism, falsificationism, strong inference, social priming, replicability, reproducibility, crisis of confidence. in this paper i will argue that the reasons for the present ‘crisis’ in psychology can be attributed in part to epistemological deficiencies: psychologists are too eager to find their theories corroborated by empirical evidence, they do not consider competing theories often enough, and they often do not pay enough attention to the inferences that can be drawn from not finding the expected results. however, scores of philosophers (e.g., dewey, 1903/2004; popper, 1934/1959) and scientists (e.g. feynman, 1974) have argued that as scientists, and as human beings in general, we can learn most of all from our mistakes. as a remedy to psychologists’ apparently overly optimistic approach to scientific research, i will introduce two questions that physicist john platt (1964) once proposed as a means of accelerating progress in science. the questions are related to possible alternative theories and provoke the researcher to consider empirical outcomes that are contrary to expectations. i will argue that these two questions can and should be asked with regard to any empirical study, and that the field of psychology would benefit from asking them on a regular basis. if researchers were to do so, critical thinking or a “critical approach” (popper, 1962, p. 51) to the growth of knowledge could become over time one of the mainstays of research in psychology along with, among others, methodological rigor, honesty, and transparency. the usefulness of these two questions is explored by asking them with respect to one of the fields of research that among many others has come under scrutiny in the ensuing reproducibility debate (e.g., pashler & wagenmakers, 2012): research on social priming (e.g., bargh, chen, & burrows, 1996; bargh & chartrand, 1999). holtz 2 the present ‘crisis of confidence’ in psychology the problem is epistemology, not statistics much has been written and said about the “crisis of confidence” (pashler & wagenmakers, 2012) in psychology since the ominous years 2011 and 2012 when diederik stapel’s academic fraud was discovered, when daryl bem was able to publish a paper supposedly ‘proving’ human’s precognitive abilities in the jpsp (bem, 2011; see also wagenmakers, wetzels, borsboom, & van der maas, 2011), and when doubts were cast on the reproducibility of social priming effects (doyen, klein, pichon, & cleeremans, 2012). consequently, initiatives such as the open science collaboration (open science collaboration, 2015) and the many labs project (klein et al., 2014) set out to further investigate the degree to which psychological studies can be replicated. a closely related problem is the apparently frequent use of questionable research practices (qrps; fanelli, 2009; leslie, loewenstein, & prelec, 2012; martinson, anderson, de vries, 2005) such as phacking (making a statistically non-significant result appear significant; simmons, nelson, & simonsohn, 2011) or harking (making up hypothesis after the results are known while pretending that they had been formulated in advance; e.g. kerr, 1998) in the field of psychology. taken together, these practices may make the current practice of significance testing in psychology more or less obsolete: with enough perseverance, p-hacking, and harking, researchers can create empirical evidence in favor of more or less any theoretical assumption (simmons, nelson, & simonsohn, 2011; gelman & loken, 2013). as a consequence, a reader of scientific publications in the field of psychology apparently cannot assume with reasonable certainty that the reported research findings are ‘true’ in the sense that they could be replicated by independent researchers. a number of authors have proposed statistical and methodological solutions to these problems such as using bayesian instead of frequentist statistics (e.g., marsman et al., 2017; wagenmakers et al., 2017) or using stricter thresholds in null-hypothesis tests (benjamin et al., 2017). however, several other authors (e.g., holtz & monnerjahn, 2017, strack, 2017) reiterated paul meehl’s statement from 1997 that “the problem is epistemology, not statistics” (p. 394). meehl’s point of departure here and in other publications was still, of course, a statistical one: the way null-hypothesis tests are used in psychology does not put theories in “grave danger of refutation” (1978, p. 806). in contrast to the hard sciences such as physics, the theories or conjectures (if we want to use the term theory only for substantial and elaborated systems of knowledge) in soft sciences such as psychology most often do not yield point predictions in the sense that they predict a certain measurable numerical outcome (see also meehl, 1967). in psychology, conjectures usually posit only that a given factor has some measurable influence on outcome variables. as a consequence—the aforementioned questionable research practices notwithstanding—even a random conjecture has in principle a 50% chance of being ‘verified’ in an empirical study given infinite sample sizes (unlimited statistical power): with increasing numbers of participants, the null-hypothesis is more and more likely to be rejected; when the number of participants converges towards infinity, the question is just whether the (almost certainly significant) effect goes in the right direction or not. these and other considerations prompted meehl to the conclusion that the established use of statistics in psychology leads to an overly optimistic appraisal of the truth status of psychological conjectures, while at the same time not enough attention is paid to potential alternative explanations. as a consequence, the growth of scientific knowledge in the field of psychology is not as fast or steady as it could be: by (more or less) only confirming what everyone believes to be true anyway, psychologists miss out on many opportunities to improve their theories as a consequence of research findings contradicting their assumptions. similar arguments have been brought forward by a number of other authors before (e.g., pettigrew, 1991; meehl, 1990, 1997) and after (e.g., earp & trafimow, 2015; holtz & monnerjahn, 2017) the emergence of the current crisis in psychology. one of the reasons for the apparent unwanted optimism with regard to empirical confirmation of one’s theory may be related to the strong publication pressure that emerged over the course of the “academic capitalism” which developed in the 20th century (münch, 2014): scientists increasingly have to publish or perish and to get visible or vanish (doyle & cuthill, 2015; holtz, deutschmann, & dobewall, two questions to foster critical thinking 3 2017) as means of surviving in the academic job market and of getting tenured positions. more than 90% of the publications in the field of psychology present evidence in favor of a theory (fanelli, 2011; sterling, rosenbaum, & weinkam, 1995; yong, 2012). alternative hypotheses and competing theories are only considered in a small number of cases (e.g., 21.6% and 11.4% respectively of 236 jpsp articles which were published between 1982 and 2005, according to uchino, thoman, and byerly, 2010). scientists more often than not build their careers around a certain ‘pet theory’ and their writings are mostly read and in turn cited by other scientists from the same research community. such communities are built upon the assumption that a given theory is ‘true’ or at least useful in explaining relevant phenomena. hence, being overly optimistic with regard to one’s pet theory can be a means of ensuring funding, publications, and hence survival in a highly competitive academic world (for an extensive analysis see billig, 2013). this attitude of using empirical research as a sales angle for a theory to be tendered in the academic market of ideas (and often enough in non-academic markets as well, such as consulting) was recently expressed in a very straightforward way by daryl bem in an interview with slate magazine (engber, 2017): “if you looked at all my past experiments, they were always rhetorical devices. i gathered data to show how my point would be made. i used data as a point of persuasion, and i never really worried about, ‘will this replicate or will this not?’”. whereas i believe that demanding ‘impartiality’ and ‘objectivity’ from researchers is illusory (see e.g., holtz & monnerjahn, 2017; holtz & odağ, 2018), i think nevertheless that the drawbacks of such a salesman-like approach to science are fairly obvious. such an attitude is particularly detrimental when there are no serious alternative theories regarding the phenomena to be explained. without competing theories, there is no competition on the market of ideas and no ‘checks and balances’ by scientists holding different views, which, according to philosopher karl popper (e.g., 1972) for example, are urgently needed to keep researchers’ optimism with regard to their theories at bay. it should be noted that bem’s habit of using data to show his point—which he likely employed during most of his prodigious 50 plus-year career in social psychology—was only regarded as problematic after he attacked a widespread common sense assumption: human beings don’t have precognitive abilities. but how can we change such salesman-like attitudes? changing the hearts and minds of researchers the focus of the paper presented here will be on questions about if and how unwanted optimism in looking for confirmations of theories and finding these theories too easily corroborated in empirical studies can be addressed in and of itself. i am not going to discuss the specific statistical solutions meehl and others have proposed for these problems. without denying the importance of the related statistical debates, i believe that a change for the better is not only needed in terms of methodology, but also in terms of the general mindset that guides psychologists in conducting their research. ioannidis (2005) wrote in his seminal article on ‘false positives’ in science: “diminishing bias through enhanced research standards and curtailing of prejudices may also help. however, this may require a change in scientific mentality that might be difficult to achieve” (p. 0701). apart from statistical recommendations such as larger sample sizes and power analysis, ioannidis also proposed to have researchers pre-register their intended studies whenever possible. this suggestion has been echoed by a number of scholars from the field of psychology (e.g., lindsay, simons, & lilienfeld, 2016; van’t veer & giner-sonolla, 2016), and several journals have implemented some form of a pre-registration policy since then (see e.g., chambers, feredoes, muthukumarasway, & etchells, 2014; jonas & cesario, 2016). apart from pre-registration, in their manifesto for reproducible science, munafo and colleagues (2017) propose the ‘blinding’ of researchers as a means of protecting against cognitive biases (p. 2), for example in the form of not telling those who analyze the data which data points represent the experimental condition. they suggest further measures, such as educating researchers about effects of questionable research practices and defining guidelines for making data collection and analysis processes more transparent. researchers should also be given incentive by funding agencies and publishers for following these open science recommendations. holtz 4 i will argue in the next paragraphs in favor of making use of two simple questions that physicist john platt formulated in an article in 1964 as a means of ensuring and accelerating progress in science in general. these two questions sum up in an easily understandable and accessible way the central ‘mantra’ of competing approaches in current epistemology: scientific progress can only be defined as an advancement beyond existing knowledge, and we can learn most of all from our mistakes. i will call such an approach to the growth of (scientific) knowledge critical thinking or a “critical approach” (popper, 1962, p.51). the advantage of my proposal is that it can be implemented easily with any kind of study. many of the aforementioned methodological and statistical measures more or less follow directly from paying more attention to competing theories and the falsification of results. hence, i propose first to tackle the ‘hearts and minds’ of researchers by introducing a simple behavior pattern which will prepare the ground for the methodological suggestions mentioned above for improving the reproducibility of psychological science. i believe that if only a small group of researchers would start asking these two questions on a regular basis, there is a good chance that critical thinking could become just as commonplace as preregistration and replicability research have become over the last years. platt’s ‘two questions’ platt’s (1964) main point of departure was that, from his perspective, progress (at least during the 1960s) apparently happened faster and more steadily in certain branches of science such as high-energy physics and molecular biology than in others. he attributed these differences to the frequent use of strong inference in these disciplines. according to him, strong inference entails the following steps: 1) devising alternative hypotheses; 2) devising a crucial experiment (or several of them), with alternative possible outcomes, each of which will, as nearly as possible, exclude one or more of the hypotheses; 3) carrying out the experiment so as to get a clean results; 1') recycling the procedure, making subhypotheses or sequential hypotheses to refine the possibilities that remain and so on. (p. 347) of course, this idea is not new. actually, the concept of “strong inference” by means of comparing competing theories had already been introduced by the geologist chamberlin in 1890, and the idea of an experimentum crucis as a means of testing competing theories per se extends at least back to bacon’s “novum organon scientiarum” in the 17th century (bacon, 1620). it must also be noted that platt’s accounts of the history of strong inference and its use in modern science has been criticized for historical inaccuracies and for over-simplifying the issues that are related to devising crucial experiments (e.g., o’donohue & buchanan, 2001). so why should we return to his article, which was written over 50 years ago by a perhaps overly enthusiastic physicist, as a potential partial remedy for the current crisis in psychology? platt’s analysis of the problems in many scientific fields corresponds to my (of course limited) experience in the field of psychology: scientists—when they are asked what science ideally should be like— usually know that they are supposed to be critical, employ rigorous tests of their theories, and compare theories whenever possible. however, in our daily lives as scientists, our minds are occupied with other things: for example, we have to rapidly publish our findings—lots of them—to keep our careers going, and we have lots of other everyday duties to fulfill that may prevent us from employing the scientific rigor that we know is needed. as platt puts it: how many of us write down our alternatives and crucial experiments every day, focusing on the exclusion of a hypothesis? we may write our scientific papers so that it looks as if we had steps 1, 2, and 3 in mind all along. but in between, we do busywork. we become “method-oriented” rather than “problem-oriented.” we say we prefer to “feel our way” toward generalizations. we fail to teach our students how to sharpen up their inductive inferences. (p. 348; emphasis as in the original) however, the strongest part of platt’s paper is in my opinion his description of “aids” (p. 352) for the implementation of strong inference in the daily scientific practice. how can we—in the middle of our everyday struggles—enforce a critical mindset upon ourselves, our students, and our colleagues? platt proposes two simple questions that can and should two questions to foster critical thinking 5 be asked about any scientific studies as means of employing a “yardstick” (p. 352) for the study’s effectiveness: [question 1:] but sir, what experiment could disprove your hypothesis?”; or, on hearing a scientific experiment described, [question 2:] but sir, what hypothesis does your experiment disprove? this goes straight to the heart of the matter. it forces everyone to refocus on the central question of whether there is or is not a testable scientific step forward. (p. 352) he continues: if such a question were asked aloud, many a supposedly great scientist would sputter and turn livid and would want to throw the questioner out, as a hostile witness! such a man is less than he appears, for he is obviously not accustomed to think in terms of alternative hypotheses and crucial experiments for himself; and one might also wonder about the state of science in the field he is in. but who knows?—the question might educate him, and his field too! (p. 352) in the next section, i will discuss the epistemological importance of these two questions as well as criticism of the empiricist ‘success formula’ that platt had presented in his paper. the epistemological importance of platt’s “two questions” first, it should be noted that to platt, these two questions are in fact just one question. following the empiricist tradition, platt believed in the experimentum crucis as the driving factor behind scientific progress. in this sense, asking what kind of evidence could go against one’s theory is always the same as asking for evidence that would support another theory, because the effectiveness of an experimentum crucis depends on the possibility of identifying relevant competing theories as well as deciding which one is ‘wrong’ and which one is ‘right’. probably like every ‘baconian’ ever since the 17th century, platt may be oversimplifying matters here: first of all, it is not always possible to identify all of the relevant competing theories. actually, in some areas of science, we may be faced with the fact that we have entered uncharted territory insofar as there is at that point no theory yet that could explain the phenomena in question. furthermore, the idea of setting up studies that once and for all could clarify which theory is right and which is wrong is probably naive in ignoring underdetermination, or the ‘quine-duhem thesis’ (harding, 1976): research findings are not only influenced by the theoretical mechanisms that we want to test, but also by innumerable so-called auxiliary hypotheses, ranging from rather trivial assumptions such as ‘our measurement instrument worked’ to serious hitherto unknown alternative explanations. for example, in 1906, physicist pierre duhem mathematically proved for a subfield of physics (newton’s law of universal mutual gravitation) that the number of such auxiliary hypotheses is necessarily infinite. subsequently, philosopher w. v. o. quine (1951) generalized duhem’s argument to more or less everything that can be known to human beings. the experimentum crucis idea also becomes critical whenever a theory does better at explaining certain phenomena in certain areas while a competing one offers better explanations in other areas. however, platt’s two questions also make lots of sense for those who are aware of the limits of empiricism. for example, critical rationalists in the tradition of karl popper assume that no empirical method whatsoever can clarify theoretical questions once and for all. all that empirical methods can do is to discover inconsistencies between theoretical predictions and empirical results (falsification). they can thus point towards ways in which theories can be improved. over time, employing such a critical mindset towards theories and working on ways of improving them lead to a growth of knowledge in an evolutionary sense (popper, 1972): theories are continuously replaced with better theories, and this process refines our understanding of the world, although we will never know with certainty whether or not our theories are true in a metaphysical sense. consequently, even without considering a competing theory, just asking for empirical results that would go against a researcher’s expectations can in itself be an important part of the research process: scientific progress is possible to the critical rationalist only through discovery of such inconsistencies. for the second step, the replacement of theories with ‘better’ theories, comparing and critically evaluating theories is pivotal for obvious reasons. hence, platt’s ‘question’ can be put to use in critical rationalism only after making ‘two questions’ of the one. holtz 6 in the next paragraph, i will formulate a more generalized version of these questions that is particularly suited for use in all branches of psychology. my objective here is again to facilitate the asking of these questions on a regular basis as a means of making critical thinking or a critical approach “mainstream” in the field of psychology. i believe that without the ballast of a formalized empiricism, platt’s two questions can be even stronger tools to foster critical thinking in the field of psychology. the two questions revisited the way platt formulated his questions may scare off some psychologists through the use of strong terms such as ‘disprove’ and by explicitly mentioning ‘experiments’ as the method of choice. because i believe that the same epistemological principles guide qualitative as well as quantitative research traditions in psychology (see holtz & odağ, 2018), i would like to introduce a more generalized wording of platt’s two questions that in my opinion is better suited for use in a multifaceted field such as psychology. new wording for the first question should address the identification of possible competing theories, for example in the following form: “are there any reasons to expect a different outcome than the one you expect?” if then reasons are provided to expect a different outcome, the follow-up sub-question should be: “to what extent can our study provide arguments in favor or against the competing assumptions?” if this question is answered in the negative, one should think about ways to refine the study. the second question is meant to make the researcher aware that science is not only about finding confirmations of one’s beliefs, but rather about critically testing them. the researcher must remain aware that more often than not, inconsistencies between predictions and expectations drive scientific progress: “which outcomes would clearly contradict your assumptions?” or another way to word it would be, “which outcomes could cast doubt on confidence in your underlying assumptions?”. this question first of all has the purpose of making the researcher aware of possible discrepancies between predictions and observations. this results in the critical mindset that in critical rationalism is of the two driving forces behind scientific progress (the other one is “intuition or imagination”; popper 1979, p. 167). it may be followed up by the question “and what consequences is it going to have if you get results that go against your expectations?”. in the following paragraphs i will attempt to demonstrate the usefulness of these two questions with their follow-up questions for the field of psychology by asking them not about a single study, but about what could maybe be called a research programme (lakatos, 1978) in social psychology: research on social priming (e.g., bargh, chen, & burrows, 1996; bargh & chartrand, 1999). i think a very similar argument could be made for other research programmes such as ego depletion research (in the tradition of baumeister, bratslavsky, muraven, & tice, 1998; for a critical perspective see e.g. carter & mccull0ugh, 2014) or power posing research (in the tradition of carney, cuddy, & yap. 2010; for a critical perspective see e.g. ranehill et al., 2015). it is important to keep in mind that i do not want to propagate the idea of strong inference in the sense of platt (1964). i just want to use his two questions to explore in the form of a thought experiment how psychology would benefit from more critical thinking or a critical approach. the case of social priming the rise and fall of social priming research in psychology, the word priming usually designates effects of exposure to a stimulus (such as a word, a bodily sensation, or an observation) on a subject’s responses in a situation subsequent to the stimulus exposure. early research on priming focused primarily on how reading certain words had an effect on the perception and processing of subsequent associated and/or semantically related words (e.g., meyer & schvaneveldt, 1971; meyer & schvaneveldt, 1976; neely, 1977). the term social priming is most often used as an umbrella term for different kinds of priming in the form of unconscious activation of social categories (such as old, polite, rude, …), resulting in behavioral tendencies that are based on those respective schemes, role concepts, or stereotypes (e.g., molden, 2014). the term social priming has been used by, among others, daniel kahneman in his concerned open letter (2012) to the “students of social priming”. kahneman had previously devoted a part of his best-selling book “thinking, fast and slow” (2011) to social priming research. two questions to foster critical thinking 7 in a seminal study that paved the way for a large number of other publications with variations on the putative underlying phenomenon of social priming, bargh, chen, and burrows (1996) found, among other results, behaviors that allegedly stemmed from social priming: participants in laboratory experiments were more likely to interrupt a conversation between an experimenter and a confederate when, prior to the conversation, they had solved verbal problems (so called ‘scrambled-sentence tasks’) which included words that were related to rudeness (e.g., bold, aggressively, rude, …) than when the problems included words that were related to politeness or were ‘neutral’ in this regard. in another experiment, participants walked down a hallway significantly more slowly when they had previously solved problems which included words that were related to old age (e.g., old, florida, grey, …) than when the problems included neutral words or words that were associated with youth. although the paper by bargh and colleagues (1996) has been quoted literally thousands of times, according to doyen, klein, pichon, & cleeremans (2012), only two partially successful replication studies were published between the years 1996 and 2012 (aarts & dijksterhuis, 2002; cesario, plaks, & higgins, 2006). in a first experiment of their own, doyen and colleagues were unable to replicate the primed elder-walking experiment of bargh and colleagues (1996); in a second study, they succeeded at producing similar effects as those in the original study—but only if the experimenter was aware of the hypothesis and not if s/he was ‘blind’ towards the expected outcome. after initial discussions regarding differences between the setups of the experiments and after what some commentators perceived to be attacks by bargh and colleagues against the quality and credibility of doyen and colleagues’ paper (yong, 2012), the controversy regarding the replicability of social priming effects continues until today (e.g., weingarten, chen, mcadams, yi, hepler, & albarracin, 2016; schimmack, heene, & kesavan, 2017; see also daniel kahneman’s related comment as reported in mccook, 2018). in the following paragraphs, i will ask platt’s two questions with a view to the social priming research programme (lakatos, 1978) as a whole, and i will discuss potential consequences: [question 1] are there reasons to expect a different outcome, and [question 2] what could some consequences of unexpected findings be? are there reasons to expect a different outcome? in their 1996 paper, bargh and colleagues argue (p. 230-231) that whereas it is widely accepted that attitudes, emotions, and self-concepts can be affected in an unconscious way by means of activating schemes and scripts and the like, “behavioral responses to the social environment are [usually believed to be] under conscious control” (p. 230). they continue by quoting two authors (fiske, 1989; devine, 1989) who acknowledged that behavior may be affected by automatic processes, for example, in the form of the unconscious activation of stereotypes; however, fiske and devine both made the claim that human beings can still consciously decide to overcome such behavioral tendencies and decide not to act in accordance with their prejudice. social priming research makes the contrary claim that behavioral responses to automatic cognitive processes are neither mediated by attitudes and emotions nor can they be overruled through higher cognitive instances because they happen unconsciously. but who would doubt that such cases of unconscious behavior affected by priming exist? bargh and colleagues seem to assume according to the aforementioned quote that some imagined opponent would maintain that there are none, and that there cannot be any cases in which behavior is affected in an unconscious and automatized way through scheme activation and the like. in formal logic, such a statement could have the form of the implication a=>b (read: ‘if a, then b’ or ‘whenever a is the case, b is going to be the case as well’) with a being a behavioral response and b being some degree of conscious control. in this case, in accordance with the so-called modus tollens, a singular observation of a and non-b would make the a=>b clause false. but are there really opponents who completely rule out the possibility that there can be a behavioral response without conscious control? at least fiske (1989) and devine (1989), the two authors whom bargh and colleagues (1996) quote as proponents of a conscious cognitive behavioral control mechanism (p. 231), would probably not subscribe to any statement positing the impossibility of unconscious effects on behavior in a radical form. to holtz 8 me, it rather seems that unconscious mechanisms have been one of the defining features of psychology as a discipline from the days of the psychoanalysts and the gestalt theorists to modern social psychology. but let us imagine such a radical opponent, for example, in the form of a stubborn economic rationalist or a theologist who steadfastly believes in human beings’ free will, and who refuses to acknowledge the possibility that some of our behavior is the consequence of automatized and unconscious psychic processes: is there any chance that such a person could be convinced through studies such as the ‘primed elder-walking experiment’? probably not. one reason is that the experiments by bargh and colleagues were statistically ‘underpowered’ (schimmack, heene, & kesavan, 2017; weingarten, chen, mcadams, yi, hepler, & albarracin, 2016). this means that given the average effect size, the sample sizes in the experiments (ranging from 30 to 34) were not sufficient to ensure significant results with reasonable certainty (e.g., >80%). a mean-spirited opponent would immediately pick on this deficiency. furthermore, the description of the procedure follows the conventions of the day, but a hostile adversary would maybe point out that the information is not sufficient to enable exact replication of the study in question (see e.g. stark, 2018). a steadfast opponent would also point out that the study was not pre-registered, and that neither the data nor analysis procedures were made public. furthermore, bargh and his colleagues would have to state how many other experiments conducted by them failed to yield reliable results in a predicted direction and so were relegated to the ‘file drawer’, rather than being published explicitly along their ‘successful’ experiments. an opponent could also point to the artificiality of such psychological laboratory experiments and demand evidence from studies using more ‘natural’ behavior data (mcguire, 1973 & 2004). all in all, to convince such an opponent, the design as well as the reporting standards would probably have to be much stricter. in a later publication, bargh and chartrand (1999) presented a stronger version of their thesis based on the studies in the 1996 paper and a few related studies: our thesis here—that most of a person's everyday life is determined not by their conscious intentions and deliberate choices but by mental processes that are put into motion by features of the environment and that operate outside of conscious awareness and guidance—is a difficult one for people to accept. (p. 462) this is of course a more provocative thesis that will most definitely meet opposition among at least a few psychologists and other social scientists arguing in favor of a more humanistic image of a human being as an at least partly rational being that is able to defy its more animalistic tendencies. the statement in its strong form also bears relevance for ethical and jurisdictional debates: is a human being really responsible for her actions if ‘most’ of her life is ‘determined’ by environmental forces operating ‘outside of conscious awareness and guidance’? but do bargh and colleagues’ findings here and elsewhere support this far-reaching conclusion? no. the fact that it is possible to create an experiment in which unconscious factors affect behavior does not allow for generalized conclusions about the importance of these processes in everyday life or to the percentage of everyday decisions that are affected or even determined (sic) by uncontrollable unconscious forces. this is all of course assuming that social priming experiments are in fact replicable. thus, it seems that bargh and colleagues are first creating a strawman by making the untenable claim that behavior is always under conscious control. then they use their findings to propagate a much more far-reaching theory of the conditio humana as a miserable being that, evoking freud, is not even master in its own house. of course, research questions regarding the degree to which behavior is under conscious control are valuable. but in this case, just creating an experiment that does (or does not) show that unconscious effects are possible is not enough. different kinds of studies comparing behavioral reactions systematically in different scenarios (in the sense of the aforementioned strong inference) would be needed to allow for this kind of generalization (see also mcguire, 1973 & 2004). such studies would also have to be adequately powered in terms of the number of subjects to allow for these kinds of conclusions. two questions to foster critical thinking 9 what if you don’t find what you expected? is there a possible research outcome that would eventually cause bargh and colleagues to abandon their idea of unconscious automatized effects on behavior? does the strawman mentioned earlier have any theoretical chance at all to defeat his opponents? i don’t think so. one reason is that it is all but impossible statistically to clearly demonstrate the absence of any effect (e.g., cohen, 1994). of course, the case is different for the stronger statement that such unconscious effects determine most of our daily lives. but here as well, the question of whether or not it is possible to demonstrate unconscious effects in an experiment seems to be by and large irrelevant. obviously, there are a large number of studies showing the rather trivial fact that conscious processes can ‘override’ automatic behavior tendencies (e.g., in devine, 1989, mentioned beforehand). but for estimating the relevance of any findings for everyday processes, a more comprehensive approach would again be needed. so what is the theoretical consequence of bargh and colleagues’ studies being replicable or not? would any imaginable outcome of a large-scale replication study have an effect on our understanding of the world and human nature? maybe a failure to replicate bargh and colleagues’ findings could be used as an argument that, after all, it is not so easy to manipulate human beings’ minds. but actually the relevance of this argument could be more related to the large degree of publicity that bargh and colleagues’ studies received than to the empirical evidence per se. in itself, the fact that an experiment ‘does not work’ (does not yield the expected results) does of course not rule out the fact that another experiment could be designed that would ‘work’, finally demonstrating the intended effect. newell’s (1973) infinite yes-or-no game and the question why social priming research has become so popular applying platt’s two questions to classical social priming studies seems to indicate that bargh and colleagues’ studies present little relevant empirical evidence with regard to the most relevant question as to what extent human beings can consciously control their behavior. as i have explained, no psychologist would seriously rule out the possibility that there may be unconscious effects of automatized cognition on behavior, and the experimental setup of the studies does not really allow for any generalization in the sense of estimating the degree to which everyday behavior is affected by conscious and unconscious processes. the small sample sizes, the relatively loose research protocol (no preregistration), and the relatively sparse information about the procedures—which were however absolutely in line with the conventions of the day—are factors that also limit the study’s capacity to convince a stubborn opponent of potential priming effects of their existence. in order to provide empirical evidence supporting the stronger thesis that ‘most’ of our behavior is ‘determined’ by unconscious processes, a wider range of studies systematically comparing conscious and unconscious influences and aiming at discovering the limits of the respective underlying assumptions would be more helpful than just repeating experiments that show that some unconscious automatized effect can be elicited. but why have variations of exactly these studies then become so popular? one reason may be related to the naive positivistic idea that obtaining the expected result in a thoroughly controlled scientific experiment ‘proves’ the existence of the phenomenon in question. this claim has been by and large discredited in philosophy, for example, because of the aforementioned issues with underdetermination (quine, 1951). even given a reproducible stable effect, someone could come along at any time with a better explanation of the observed phenomena in question and show hitherto unknown evidence falsifying the original assumptions (e.g., popper, 1962). thinking about social priming’s popularity, i am also reminded of newell’s (1973) brilliant analysis of the issues in psychological research in the 1970s. he argued that psychologists too often study complex questions by means of reducing them to a yes or no type question. examples would be the nature vs. nurture debate or the debate over conscious vs. unconscious information processing already ongoing at that time (newell lists 24 such yes or no questions on p. 288). the true answer to all these questions is most likely one starting with “it all depends”, and answering them in a productive way would actually require complex theories, elaborated research designs, and strong inference (as outlined in platt, 1964). however, asking complex questions in a yes or holtz 10 no fashion and only counting confirmations of the respective theories sets the stage for an infinite game: apparent opponents produce an endless chain of evidence in favor of their own stance regarding such a yes or no question without ‘hurting’ each other and without necessarily producing anything resembling a growth of knowledge. the right question to ask here in order to achieve progress would be “to what extent is your research potentially able to convince an opponent of your point of view?” hopefully productive discussions would then be facilitated among proponents of different views. conclusion science is not about finding your assumptions confirmed and finding ways to sell your ideas to an audience. instead, it is (or should be) about critically examining your beliefs and correcting them whenever they do not correspond with empirical observations, and about being willing to give them up whenever there is a better explanation for the phenomena in question. this is what i call critical thinking or a critical approach to the growth of scientific knowledge. i assume that most psychologists would agree with these statements. still, the question is to what extent scientists also (can) act according to these principles in a globalized capitalist academic market where productivity and public attention determine a scientist’s career. it should be the task of scientific organizations such as scientific associations editing prestigious journals to help scientists to act in accordance with what they know would be the right thing to do. reward structures should be established that reward good scientific practices, such as taking into account alternative explanations, exploring the limits of one’s assumptions, and being open to report also unexpected results (e.g., munafo et al., 2017). just as large-scale replication projects, preregistration, and open science have been popularized by small groups of researchers continuing to make their point, i hope that critical thinking could become a mainstay of psychology as well, if enough researchers begin to ask platt’s two questions on a regular basis. in view of the most recent crisis in psychology, i think it is also fair to demand from those who are active in teaching scientific methods to students to put as much emphasis on developing critical thinking as they put on teaching thorough methodological knowledge. for those of us who work on scientific textbooks and who convey scientific knowledge to the wider public, i think that the question should be asked whether a salesman-like attitude of (over) selling the benefits of scientific research is really a way to sustain worthwhile scientific activity. platt’s two questions indeed can and should be asked with regard to any scientific study in the field of psychology, and i hope that i have provided some arguments for my cause. at least i have the impression that the scientific output throughout my scientific career might have been more relevant and interesting if i had asked these questions about my own work on a regular basis. of course, asking the questions does not entail having or prescribing an answer. i am aware that my discussion of social priming research is to some degree provocative and will evoke criticism from those who are more knowledgeable in this area of psychology than i am. maybe they can convince me that these questions can or should be answered differently for social priming research. the most critical part of this paper is probably not the question whether platt’s two questions should be asked on a regular basis—i suppose there will not be much opposition to this thesis. instead, it is my claim that these questions have not been asked often enough in psychology so far. still, i hope that i could at least make the point that the methodological recommendations that are meant to counter the current crisis (e.g., benjamin et al., 2017; munafo et al., 2017) follow immediately from a more critical approach to scientific research. eventually, critical thinking could become just as ‘mainstream’ as preregistration and large-scale replication studies have become over the last several years. and in the end ‘market forces’ could themselves contribute to a critical culture in psychology once a critical degree of popularity has been reached: as soon as scientists have to demonstrate their ability to think critically in order to, for example, obtain a tenured position, a significant incentive will have been created. consequently, scientists may then have to document, for example, how many hypotheses they have falsified or the number of occasions on which an empirical finding had made them revise the theory to be tested. falsifications of nullhypotheses would only count here whenever they two questions to foster critical thinking 11 constitute a critical test of the theory to be tested in the sense that other researchers would be willing to defend the null-hypothesis, as is the case, for example, in para-psychology. here, most scientists hold the view that psy-phenomena do not exist and falsifying this hypothesis (of course, in a reproducible and replicable way) would be a substantial advancement of scientific knowledge. in the previously discussed example of social priming, the case is different and—as i argued in the previous paragraphs—a rejection of the null-hypothesis may not mean much from an epistemological point of view. i would consider it an ideal outcome of this paper if there were a kind of movement in the direction of simply asking these two questions on a regular basis and studying the reactions. i would expect some respondents to indeed turn ‘livid’, whereas i hope that others would start heading towards more critical thinking. social media could perhaps facilitate the exchange of experiences with reactions to these questions. author contact dr. peter holtz. schleichstr. 6, 72074 tübingen, germany; p.holtz@iwm-tuebingen.de. orcid: https://orcid.org/0000-0001-7539-6992. conflict of interest and funding i have no conflict of interest to declare. my research is funded since 2018 by the leibniz association, germany (leibniz competition 2018, funding line "collaborative excellence", project salient [k68/2017]). author contributions i am the sole author of this article. acknowledgements several in-text comments by david e. meyer on the first version of this article were integrated into this revised version of the original manuscript. i would like to thank professor meyer for his invaluable contribution to the paper! i also want to thank reviewer michael olson and the editors åse innes-ker and rickard carlsson for their valuable feedback and their help during the review process! open science practices this is a conceptual article without data, materials or analysis that could have been preregistered. the entire editorial process, including the open reviews, are published in the online supplement. references aarts, h., & dijksterhuis, a. (2002). category activation effects in judgment and behaviour: the moderating role of perceived comparability. british journal of social psychology, 41(1), 123138. bacon, f. (1620). new organon. available from: http://www.constitution.org/bacon/nov_org.htm; last retrieved june 2018. bargh, j. a., chen, m., & burrows, l. (1996). automaticity of social behavior: direct effects of trait construct and stereotype activation on action. journal of personality and social psychology, 71(2), 230. bargh, j. a., & chartrand, t. l. (1999). the unbearable automaticity of being. american psychologist, 54(7), 462. baumeister, r. f.; bratslavsky, e.; muraven, m.; tice, d. m. (1998). ego depletion: is the active self a limited resource?. journal of personality and social psychology, 74(5), 1252–1265. bem, d. j. (2011). feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. journal of personality and social psychology, 100(3), 407-425. benjamin, d. j., berger, j. o., johannesson, m., nosek, b. a., wagenmakers, e. j., berk, r., ... & cesarini, d. (2018). redefine statistical significance. nature human behaviour, 2(1), 6. billig, m. (2013). learn to write badly: how to succeed in the social sciences. cambridge: cambridge university press. carney, d. r., cuddy, a. j., & yap, a. j. (2010). power posing: brief nonverbal displays affect neuroendocrine levels and risk tolerance. psychological science, 21(10), 1363-1368. carter, e. c., & mccullough, m. e. (2014). publication bias and the limited strength model of holtz 12 self-control: has the evidence for ego depletion been overestimated?. frontiers in psychology, 5, 823. cesario, j., plaks, j. e., & higgins, e. t. (2006). automatic social behavior as motivated preparation to interact. journal of personality and social psychology, 90(6), 893. chambers, c. d., feredoes, e., muthukumaraswamy, s. d., & etchells, p. (2014). instead of" playing the game" it is time to change the rules: registered reports at aims neuroscience and beyond. aims neuroscience, 1(1), 4-17. cohen, j. (1994). the earth is round (p<. 05). american psychologist, 49(12), 997. chamberlin, t. c. (1890). the method of multiple working hypotheses. science, 15(366), 92-96. devine, p. g. (1989). stereotypes and prejudice: their automatic and controlled components. journal of personality and social psychology, 56, 5-18. dewey, j. (1903/2004). democracy and education. mineola, ny: dover. doyen, s., klein, o., pichon, c. l., & cleeremans, a. (2012). behavioral priming: it's all in the mind, but whose mind?. plos one, 7(1), e29081. duhem, p. (1954/1906). the aim and structure of physical theory (transl. p. p. wiener). princeton, nj: princeton university press. earp, b. d., & trafimow, d. (2015). replication, falsification, and the crisis of confidence in social psychology. frontiers in psychology, 6, 621. engber, d. (2017). daryl bem proved esp is real: which means science is broken. slate. online document available from https://slate.com/health-and-science/2017/06/daryl-bem-proved-esp-isreal-showed-science-is-broken.html; last retrieved may 2020. fanelli, d. (2009). how many scientists fabricate and falsify research? a systematic review and meta-analysis of survey data. plos one, 4(5), e5738. fanelli, d. (2011). negative results are disappearing from most disciplines and countries. scientometrics, 90(3), 891-904. feynman, r. p. (1974). cargo cult science. engineering and science, 37(7), 10-13. fiske, s. t. (1989). examining the role of intent: toward understanding its role in stereotyping and prejudice. in j. s. uleman & j. a. bargh (eds.): unintended thought. new york: guilford press, 253-283. gelman, a., & loken, e. (2013). the garden of forking paths: why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. department of statistics, columbia university. online document available at: http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf; last retrieved june 2018. gilbert, d. t., king, g., pettigrew, s., & wilson, t. d. (2016). comment on “estimating the reproducibility of psychological science”. science, 351(6277), 1037-1037. harding, s. (ed.). (1976). can theories be refuted?: essays on the duhem-quine thesis (vol. 81). new york: springer. holtz, p., deutschmann, e., & dobewall, h. (2017). cross-cultural psychology and the rise of academic capitalism: linguistic changes in ccr and jccp articles, 1970-2014. journal of crosscultural psychology, 48(9), 1410-1431. holtz, p., & monnerjahn, p. (2017). falsificationism is not just ‘potential’ falsifiability, but requires ‘actual’ falsification: social psychology, critical rationalism, and progress in science. journal for the theory of social behaviour, 47, 348-362. holtz, p., & odağ, ö. (2018). popper was not a positivist: why critical rationalism could be an epistemology for qualitative as well as quantitative social scientific research. qualitative research in psychology, advance online publication. online document available from: https://doi.org/10.1080/14780887.2018.144762 2; last retrieved may 2020. john, l. k., loewenstein, g., & prelec, d. (2012). measuring the prevalence of questionable research practices with incentives for truth telling. psychological science, 23(5), 524-532. jonas, k. j., & cesario, j. (2016). how can preregistration contribute to research in our field?. comprehensive results in social psychology, 1, 17. kahneman, d. (2012). a proposal to deal with questions about priming effects. online document available from: https://www.nature.com/polopoly_fs/7.6716.1349271308!/sup two questions to foster critical thinking 13 pinfofile/kahneman%20letter.pdf; last retrieved june 2018. klein, r.a., ratliff, k.a., vianello, m., adams jr, r.b., bahník, š., bernstein, m.j., bocian, k., brandt, m.j., brooks, b., brumbaugh, c.c, & cemalcilar, z. (2014). investigating variation in replicability: a “many labs” replication project. social psychology, 45(3), 142-152. lakatos, i. (1978). the methodology of scientific research programmes. cambridge, uk: cambridge university press. lindsay, d. s., simons, d. j., & lilienfeld, s. o. (2016). research preregistration 101. aps observer, 29(10). online document available from https://www.psychologicalscience.org/observer/research-preregistration-101; last retrieved june 2018. marsman, m., schönbrodt, f. d., morey, r. d., yao, y., gelman, a., & wagenmakers, e. j. (2017). a bayesian bird's eye view of ‘replications of important results in social psychology’. royal society open science, 4(1), 160426. martinson, b. c., anderson, m. s., & de vries, r. (2005). scientists behaving badly. nature, 435(7043), 737. mccook, a. (2018). “i placed too much faith in underpowered studies:” nobel prize winner admits mistakes. online document available from https://retractionwatch.com/2017/02/20/placed-much-faithunderpowered-studies-nobel-prize-winneradmits-mistakes/; last retrieved june 2018. mcguire, w. j. (1973). the yin and yang of progress in social psychology: seven koan. journal of personality and social psychology, 26(3), 446. mcguire, w. j. (2004). a perspectivist approach to theory construction. personality and social psychology review, 8(2), 173-182. meehl, p. e. (1967). theory-testing in psychology and physics: a methodological paradox. philosophy of science, 34, 103-115. meehl, p. e. (1978). theoretical risks and tabular asterisks: sir karl, sir ronald, and the slow progress of soft psychology. journal of consulting and clinical psychology, 46(4), 806-834. meehl, p. e. (1990). appraising and amending theories: the strategy of lakatosian defense and two principles that warrant it. psychological inquiry, 1(2), 108-141. meehl, p. e. (1997). the problem is epistemology, not statistics: replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. in: l. l. harlow, s. a. mulaik, & j. h. steiger (eds): what if there were no significance tests?. london: routledge, 393-425. meyer, d. e., & schvaneveldt, r. w. (1971). facilitation in recognizing pairs of words: evidence of a dependence between retrieval operations. journal of experimental psychology, 90(2), 227. meyer, d. e., & schvaneveldt, r. w. (1976). meaning, memory structure, and mental processes. science, 192(4234), 27-33. münch, r. (2014). academic capitalism: universities in the global struggle for excellence. london: routledge. munafò, m.r., nosek, b.a., bishop, d.v., button, k.s., chambers, c.d., du sert, n.p., simonsohn, u., wagenmakers, e.j., ware, j.j.. & ioannidis, j. p. (2017). a manifesto for reproducible science. nature human behaviour, 1, 0021. neely, j. h. (1977). semantic priming and retrieval from lexical memory: roles of inhibitionless spreading activation and limited-capacity attention. journal of experimental psychology: general, 106(3), 226. newell, a. (1973). you can't play 20 questions with nature and win: projective comments on the papers of this symposium. in w. g. chase (ed.): visual information processing. new york: academic press, 283-308. molden, d. c. (2014). understanding priming effects in social psychology: what is “social priming” and how does it occur?. social cognition, 32(supplement), 1-11. o'donohue, w., & buchanan, j. a. (2001). the weaknesses of strong inference. behavior and philosophy, 29, 1-20. open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716. pashler, h., & wagenmakers, e. j. (2012). editors’ introduction to the special section on replicability in psychological science a crisis of confidence?. perspectives on psychological science, 7(6), 528-530. pettigrew, t. f. (1991). toward unity and bold theory: popperian suggestions for two persistent holtz 14 problems of social psychology. in c. w. stephan, w. g. stephan, & t. pettigrew (eds.): the future of social psychology. new york: springer, 13-27. platt, j. r. (1964). strong inference. science, 146(3642), 347-353. popper, k. r. (1934/1959). the logic of scientific discovery. london: routledge (german original 1934). popper, k. r. (1962). conjectures and refutations: the growth of scientific knowledge. basic books: new york & london. popper, k. (1972). objective knowledge: an rvolutionary approach. oxford: oxford university press, 246-284. popper, k. r. (1979). three worlds. ann arbor, mi: university of michigan press. ranehill, e., dreber, a., johannesson, m., leiberg, s., sul, s., & weber, r. a. (2015). assessing the robustness of power posing: no effect on hormones and risk tolerance in a large sample of men and women. psychological science, 26(5), 653-656. quine, w. v. o. (1951). two dogmas of empiricism. the philosophical review, 60(1), 20–43. schimmack, u., heene, m., & kesavan, k. (2017). reconstruction of a train wreck: how priming research went off the rails. replicability-index: improving the replicability of empirical research. online document available from https://replicationindex.wordpress.com/2017/02/02/reconstruction-of-atrain-wreck-how-priming-research-went-ofthe-rails/; last retrieved june 2018. shrout, p. e., & rodgers, j. l. (2018). psychology, science, and knowledge construction: broadening perspectives from the replication crisis. annual review of psychology, 69, 487-510. simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359-1366. stark, p. b. (2018). before reproducibility must come preproducibility. nature, 557(7707), 613. sterling, t. d., rosenbaum, w. l., & weinkam, j. j. (1995). publication decisions revisited: the effect of the outcome of statistical tests on the decision to publish and vice versa. the american statistician, 49(1), 108-112. strack, f. (2017). from data to truth in psychological science. a personal perspective. frontiers in psychology, 8, 702. stroebe, w., & strack, f. (2014). the alleged crisis and the illusion of exact replication. perspectives on psychological science, 9(1), 59-71. uchino, b. n., thoman, d., & byerly, s. (2010). inference patterns in theoretical social psychology: looking back as we move forward. social and personality psychology compass, 4(6), 417-427. van't veer, a. e., & giner-sorolla, r. (2016). pre-registration in social psychology—a discussion and suggested template. journal of experimental social psychology, 67, 2-12. wagenmakers, e. j., wetzels, r., borsboom, d., & van der maas, h. l. (2011). why psychologists must change the way they analyze their data: the case of psi: comment on bem (2011). journal of personality and social psychology, 100(3), 426432. weingarten, e., chen, q., mcadams, m., yi, j., hepler, j., & albarracín, d. (2016). from primed concepts to action: a meta-analysis of the behavioral effects of incidentally presented words. psychological bulletin, 142(5), 472. wagenmakers, e. j., verhagen, a. j., ly, a., matzke, d., steingroever, h., rouder, j. n., & morey, r. d. (2017). the need for bayesian hypothesis testing in psychological science. in s. o. lilienfeld & i. d. waldman (eds.): psychological science under scrutiny: recent challenges and proposed solutions. new york: wiley, 123-138. yong, e. (2012). bad copy. nature, 485(7398), 298. meta-psychology, 2022, vol 6, mp.2020.2577 https://doi.org/10.15626/mp.2020.2577 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: danielsson, h., carlsson, r. reviewed by: schönbrodt, f., schmukle, s., hedge, c. analysis reproduced by: batinović, l., fust, j. all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/w7cp3 exploring reliability heterogeneity with multiverse analyses: data processing decisions unpredictably influence measurement reliability sam parsons university of oxford radboud university medical center abstract analytic flexibility is known to influence the results of statistical tests, e.g. effect sizes and p-values. yet, the degree to which flexibility in data processing decisions influences measurement reliability is unknown. in this paper i attempt to address this question using a series of 36 reliability multiverse analyses, each with 288 data processing specifications, including accuracy and response time cut-offs. i used data from a stroop task and flanker task at two time points, as well as a dot probe task across three stimuli conditions and three timepoints. this allowed for broad overview of internal consistency reliability and test-retest estimates across a multiverse of data processing specifications. largely arbitrary decisions in data processing led to differences between the highest and lowest reliability estimate of at least 0.2, but potentially exceeding 0.7. importantly, there was no consistent pattern in reliability estimates resulting from the data processing specifications, across time as well as tasks. together, data processing decisions are highly influential, and largely unpredictable, on measure reliability. i discuss actions researchers could take to mitigate some of the influence of reliability heterogeneity, including adopting hierarchical modelling approaches. yet, there are no approaches that can completely save us from measurement error. measurement matters and i call on readers to help us move from what could be a measurement crisis towards a measurement revolution. keywords: reliability, multiverse, analytic flexibility, data processing in this paper i was concerned with the influence analytic flexibility on measurement reliability, specifically in data processing or data cleaning. i took inspiration from numerous papers reporting the unsettlingly low reliability of dot probe attention bias indices (e.g. jones et al., 2018; schmukle, 2005; staugaard, 2009) and other work investigating alternative analyses and data processing strategies, with the intention of yielding a more reliable measurement (e.g. jones et al., 2018; price et al., 2015). when considering the impact of researcher degrees of freedom, focus is drawn to decisions made in the beginning (task design) or at the end (data analysis) of the research process. i was interested in the middle step: data processing and measure reliability. in this paper, i explore and visualise the influence of data processing steps on reliability using a series of reliability multiverse analyses. https://doi.org/10.15626/mp.2020.2577 https://doi.org/10.17605/osf.io/w7cp3 2 getting up to speed with reliability the accuracy of our conclusions rests on the quality, and the strength, of our evidence. our evidence rests on the bedrock of our measurements. the quality of our measures defines the quality of our results. without adequate focus on the validity of our measures, how can we be assured that we are capturing the concept or process that we are interested in? without any attention to the reliability of our measures, how can we be sure that we are capturing a phenomenon with any precision? psychological science has a guilty habit of neglecting these foundations, though of course some areas fair better than others. in a recent paper, my colleagues and i argued for a widespread appreciation for the reliability of our cognitive measures (parsons et al., 2019). briefly, low reliability places doubt on the veracity of statistical analyses using that measure; measurement reliability restricts the observable range of effect sizes in simple correlational analyses, and unpredictably in more complicated models; and failing to correct for measurement error makes comparing effect sizes between, and within, studies difficult. these issues are compounded by the sad observation that the reporting of reliability (and validity) evidence is woefully poor. scale validity and reliability is not routinely examined, and many scales are adapted on an ad hoc basis with little or no validation (flake et al., 2017). in other cases scales fail to pass deeper psychometric evaluation, including tests of measurement invariance (hussey and hughes, 2018). this likely reflects issues with more superficial approaches to establishing validity evidence i.e. reporting cronbach’s alpha, stating it is adequate, and moving on. pockets of psychological science take a more enlightened approach. however, i feel it is reasonable to argue that the field at large is not doing well in our measurement practices. most relevant to this paper; it is the exception rather than the norm to evaluate the psychometric properties of cognitive measurements (gawronski et al., 2011). strictly speaking, we cannot state that a task is unreliable; although we might observe a consistent pattern of unreliability in measurements obtained that causes us to question further use of the task. an important reminder: estimates of reliability refer to the measurement obtained in a specific sample and under particular circumstances, including the task parameters. reliability is therefore not fixed; it may differ between populations, samples, and testing conditions. variations of a task may lead to the generation of more or less reliable measurements. for example, the stimulus presentation duration will likely influence the cognitive processes involved in completing the task, perhaps leading participants to perform more consistently in one version, relative to another. reliability is a property of the measurement, not of the task used to obtain it. in this study, we are concerned with the data processing steps researchers take and how these influence our measurement, and the resulting reliability estimates. to explore this, i invite you to join me, dear reader, on a walk through the garden of forking paths. analytic flexibility and the garden of forking paths every result presented in every research article is the culmination of many decisions made by one or more researchers; the sheer number of combinations of valid decisions is likely uncalculatable. the “garden of forking paths” (gelman and loken, 2013) is a useful analogy to illustrate this. with each decision that must be made, however arbitrary, the researcher comes to a fork in their research path and selects one. to add a little suspense, there will be many cases when the researcher does not notice a fork in the road. perhaps the researcher unconsciously makes the same turn as always, their feet working of their own accord. these forks in the path, the decisions researchers make (whether they are aware or not), may be reasonably combined to make a near uncountable number of paths. each path also leads to a location; some paths end close to one another, and other times the paths diverge wildly. we can think of the end of the path as the statistical result our researcher arrives at. the researcher has to decide their path, based on the soundest justifications they can make at each fork [e.g. (lakens et al., 2018). of course, psychological science has become fully aware of the detrimental effects of selecting one’s path retrospectively, based on where the path ends or the results most exciting to the researcher (read as: p < .05; e.g. (simmons et al., 2011). analytic flexibility is not inherently bad. however, we must acknowledge the ramifications. the effects we observe, or do not, are potentially influenced by all of the decisions made to arrive at them. thus, a range of possible effects may have been observed that could be more or less equally valid or justifiable based on the analytical decisions made. in discussions of analytical flexibility, focus is usually given primarily to decisions made during statistical analysis. for example, should i control for age and gender? do i reason that this is model more appropriate over that one? or where should i set my alpha and how should i justify the decision? discussions of analytical flexibility often concern issues around p-hacking and other qrps (intended or unintended). however, as leek and peng (leek and peng, 2015) note, p-values are the tip of the iceberg; not enough scrutiny is given 3 to the impact of the many steps in the research pipeline that precede inference testing. i agree. in my estimation, flexibility in measurement and data handling do not receive the scrutiny they deserve. if the garden of forking paths concerns analytic flexibility, then measurement flexibility decides which gateway one enters the garden through in the first place. as an example, a recent review highlighted the lack of consensus around the processing of task data from tasks in the attention control literature, including but not limited to the data pre-processing used in this paper (von bastian et al., 2020, p. 47-48). mapping the garden of forking paths with multiverse analyses multiverse analyses (steegen et al., 2016) offers us a "gps in the garden of forking paths" (quintana and heathers, 2019). the process is simpler than one might expect. first, we define a set of reasonable data processing and analysis decisions. second, we run the entire set of analyses. we can then examine results across the entire range of results. specification curve analysis (simonsohn et al., 2015) adds third step allowing for inference tests across the distribution of results generated in the multiverse (for insightful applications of specification curve analyses, see orben and przybylski, 2019; rohrer et al., 2017). in this paper i use ‘specification’ to refer to each combination of data processing decisions in the multiverse analysis. multiverse analyses enable us to explore how a researcher’s – sometimes arbitrary – choices in data processing (e.g. outlier removal) and analysis decisions (e.g. including covariates, splitting samples) influence statistical results, and the conclusions drawn from the analysis. from this we can examine which choices are more or less influential than others, as well as how robust the result is across the full set of specifications. a reliability multiverse from many data processing decisions in this paper i report multiverse analyses exploring the influence of data processing specifications on the reliability of a calculated measurement. i used openly accessible stroop task and flanker task data generously shared by hedge and colleagues (hedge et al., 2018) and dot probe task data from the cogbias project (booth et al., 2017; booth et al., 2019). following our previous work in this area (parsons et al., 2019), i was interested in the stability and range of reliability estimates on cognitive-behavioural measures. broadly, i was interested in the impact of data processing decisions on reliability. it is possible that certain analytic decisions tend to yield higher reliability estimates; it may be that particular combinations of decisions are also better, or worse, than others. beyond that, i was interested in the range of estimates. a small range would suggest that measure reliability is relatively stable as we make potentially arbitrary data processing decisions while walking the garden of forking paths. a large range suggests hidden measurement reliability heterogeneity. this is potentially an important, and underappreciated, contributor to the replicability crisis (loken and gelman, 2017). alternatively, this could be a herald for a crisis of measurement. methods data stroop and flanker task data were obtained from the online repository for hedge, sumner, and powell (hedge et al., 2018, https://osf.io/cwzds/). full details of the data collection, study design, and procedure can be found in hedge et al. (hedge et al., 2018). these data are ideal for our purposes as they a) contain many trials, helping us obtain more precise estimates of reliability, and b) include two assessment time-points approximately 3-4 weeks apart, allowing us to explore both: internal consistency and test-retest reliability. the data were collected from different studies; for simplicity in this paper, the data across studies were pooled (n = 107 before any data processing – note that this may be different from the sample size presented by hedge et al. due to differences in data processing). dot probe data were obtained from the cogbias project (booth et al., 2017; booth et al., 2019). full details of the full study and data collection can be found in booth et al. (2017; 2019). these data complement the stroop and flanker data as they provide a longer test-retest duration (approximately 1.5 years between repeated measures) across three timepoints. in addition, the task incorporated three stimuli conditions, allowing us cross-sectional comparisons of reliability stability within the same task. the dot probe data were pooled such that only a subset of participants completing the task at all three timepoints were retained (n = 285). interested readers can find the data and code used to perform the multiverse analyses and generate this manuscript in the open science framework repository for this project (https://osf.io/haz6u/). 1 1i used the following r packages for all analyses and figures, and to generate this document: r (version 4.1.0; r core team, 2018) and the r-packages cairo (version 1.5.12.2; urbanek and horner, 2019), dplyr (version 1.0.9; wickham, françois, et al., 2019), forcats (version 0.5.1; wickham, 2019a, ggplot2 (version 3.3.6; wickham, 2016), gridexhttps://osf.io/cwzds/ https://osf.io/haz6u/ 4 stroop task participants made keyed responses to the colour of a word presented in the centre of the screen. in congruent conditions the word was the same as the font colour, whereas, in incongruent trials, the word was a different colour from the font colour. in a neutral condition, the word was not a colour word. participants completed 240 of each trial type. the outcome index we explore here is the rt cost, calculated as the average rt for incongruent trials minus the average rt for congruent trials. flanker task participants made keyed responses to the colour of a word presented in the centre of the screen. in congruent conditions the word was the same as the font colour, whereas, in incongruent trials, the word was a different colour from the font colour. in a neutral condition, the word was not a colour word. participants completed 240 of each trial type. the outcome index we explore here is the rt cost, calculated as the average rt for incongruent trials minus the average rt for congruent trials. dot probe task task participants made keyed responses to the identity of a probe presented on screen. the probe was presented in the same location as one of the paired faces presented on screen for 500ms prior. the paired faces were an emotional face (angry, pained, and happy) paired with a neutral face (taken from the stoic faces database, roy et al., 2009). in congruent trials, the probe was presented in the same location as the emotional face. in incongruent trials, the probe was presented in the same location as the neutral face. participants completed three blocks of 56 trials corresponding to the emotion presented. the ‘attention bias’ outcome index (macleod et al., 1986) was calculated as calculated as the average rt for incongruent trials minus the average rt for congruent trials. multiverse analysis in a personal effort to make my research reproducible, and also help others perform similar processes i have developed simple functions to perform the multiverse analyses reported in this paper. readers interested in performing similar analyses can find these functions within the splithalf package (parsons, 2021) and tutorials on the related github page (https://github.com/sdparsons/splithalf). the key functions are: splithalf.multiverse, testretest.multiverse, and multiverse.plot. intraclass correlation coefficients (icc2) were estimated using the psych r package (revelle, 2019). interested readers can also inspect the code used to perform the analyses in this paper (https://osf.io/haz6u/). step 1. creating a list of all specifications. no data were removed before the multiverse analysis. to my knowledge, there are no fixed standards in the literature for processing data from any of the tasks. i identified six decisions common to processing rt data, though there are many more. for simplicity i stuck to rt difference scores as the outcome measure of interest. however, there are very different analytical techniques that might be applied to rt tasks such as this (for example, multilevel modelling and drift-diffusion modelling approaches). the decisions were as follows: • total accuracy. researchers may opt to remove participants with accuracy lower than a prespecified cut-off; for example 80 of 90 per cent. i used three options; 80 • absolute response time removals. researchers will often remove trials faster than a minimum rt threshold and trials that exceed a maximum rt threshold. i use minimum rt cut-offs at 100ms, 200ms, as well as no cut-off. and, i use two maximum rt cutoffs; 3000ms, and 2000ms. • relative rt cut offs. after absolute rt cutoffs, researchers can decide to remove trials with rts greater than a number of standard deviations from the mean (sometimes called relative cut-offs or trimmed means). three sds from the mean would remove very extreme outliers; two sds from the mean is common. i have not seen researchers use one sd from the mean as a cut off, as it is likely a too conservative threshold. as i was interested in a wide range of possible specifications, i included one standard deviation. i use no relative cut off, and one, two, and three sds from the mean cutoffs in the multiverse. • where to apply the relative cutoff. the decision to remove trials based on a sd cutoff comes with its own decision. namely, at what granularity? we could remove trials with rts greater than tra (version 2.3; auguie, 2017), papaja (version 0.1.1; aust and barth, 2018), patchwork (version 1.1.1; pedersen, 2019), psych (version 2.1.6; revelle, 2019), purrr (version 0.3.4; henry and wickham, 2019), readr (version 1.4.0; wickham et al., 2018), splithalf (version 0.8.1; parsons, 2021), stringr (version 1.4.0; wickham, 2019b), tibble (version 3.1.7; müller and wickham, 2019), tidyr (version 1.2.0; wickham and henry, 2019), tidyverse (version 1.3.1; wickham, averick, et al., 2019), and tinylabels (version 0.2.3; barth, 2022) https://osf.io/haz6u/ 5 2sds from the participant’s average rt, for example. we could also remove trials with rts greater than 2sds from the mean rt within each trial type (congruent and incongruent, for example). i included both options; participant level, and trial type level. • averaging. most often the mean rt within each trial type is calculated, and may then be analysed directly, or a difference score calculated to analyse. researchers may opt to use the median rt instead. i included both options. the number of possible combinations (data processing specifications) quickly increases with every additional option. here we have 3 × 2 × 3 × 4 × 2 × 2 = 288 possible specifications. step 2. run all specifications and extract reliability estimates. from this decision list, we have a complete list of 288 data processing specifications. in the multiverse analysis the data is processed following each specification parameters, before estimating the reliability of the resulting outcome measure. internal consistency was estimated using 500 permutations of the splithalf (parsons et al., 2019) procedure for each specification (5000 is standard, but 500 was selected to reduce processing time). following hedge et al. (2018), and because icc relates to both the correlation and the agreement among repeated measures, test-retest reliability was estimated using icc2k (koo & li, 2016). step 3. visualising the multiverse. i find that one of the joys of multiverse analyses are the visualisations, because sometimes science is more art than science. i explain the visualisations in the results section. analysis plan for the core analysis i performed 18 multiverse analyses following the steps described above. separately for each of the stroop and flanker task data, i examined internal consistency reliability at time 1 and at time 2, as well as test-retest reliability from time 1 to time 2. for the dot probe data, i examined internal consistency reliability at each of the three timepoints, separately for the three task conditions, as well as test-retest reliability at across timepoints. for each multiverse i report the median estimate and it’s 95% confidence interval, the proportion of estimates exceeding 0.7, and the range of estimates in that multiverse. in addition to visualising each multiverse, i also include visualisations overlapping the internal consistency multiverses over time. these overlapped plots allow us to visually inspect whether the pattern of reliability estimates following the full range of data processing specifications are comparable across each time point. inferences from the multiverse it is not my aim in this paper to make inferences from these reliability multiverse analyses as one would in a specification curve analysis (simonsohn et al., 2015). one could use this method to perform inference testing against the curve of reliability estimates. however, it is not clear what this would add: testing whether the reliability estimates significantly differ from zero is a low bar for assessing the reliability of a measure. results i include a visualisation for each multiverse analysis. the reliability estimates are presented on the y-axis at the top of the figure; each estimate is represented by a black dot and the 95% confidence interval is represented by the shaded band. the x-axis indicates each individual multiverse specification of processing decisions (288 total), displayed in the ‘dashboard’ at the bottom of the figure. the vertical dashed line running through the top panel and the bottom dashboard represents the median reliability estimate. this line is extended through the dashboard to demonstrate that the estimate is derived from the unique combination of data processing decisions, including (from top to bottom, in order of processing step); 1) participant removal below total accuracy threshold, 2) maximum rt cut-off, 3) minimum rt cut-off, 4) removal of rts > this number of sds from the mean, 5) whether this removal is at the trial or subject level, and 6) use of mean or median to derive averages. stroop time 1: internal consistency the median reliability estimate was 0.76, 95% ci [0.69,0.92]. estimates ranged from 0.68 to 0.92. 97% of the reliability estimates were > 0.7. stroop time 2: internal consistency the median reliability estimate was 0.66, 95% ci [0.61,0.89]. estimates ranged from 0.58 to 0.90. 25% of the reliability estimates were > 0.7. stroop: test-retest the median reliability estimate was 0.56, 95% ci [0.50,0.63]. estimates ranged from 0.47 to 0.63. 0% of the reliability estimates were > 0.7. 0.0 flanker time 1: internal consistency the median reliability estimate was 0.82, 95% ci [0.65,0.92]. estimates ranged from 0.62 to 0.93. 93% of the reliability estimates were > 0.7. 6 figure 1. internal consistency reliability multiverse for stroop rt cost at time 1 flanker time 2: internal consistency the median reliability estimate was 0.71, 95% ci [0.62,0.91]. estimates ranged from 0.59 to 0.91. 55.00% of the reliability estimates were > 0.7. flanker: test-retest the median reliability estimate was 0.55, 95% ci [0.30,0.69]. estimates ranged from 0.29 to 0.72. 2% of the reliability estimates were > 0.7. overlapping time 1 and time 2 multiverses in the next two figures i overlap the time 1 and time 2 multiverses, separately for the stroop and flanker data. the specifications are ordered by the reliability estimates at time 1 for each measure (figures 1 and 3). these figures allow us to compare the patterns of reliability estimates following the same data processing decisions. dot probe task for ease of presentation (and to reduce the total number of figures), we visualise the dot probe task reliability multiverses entirely as overlapping plots. angry faces for the angry faces condition median and 95% cis for each wave of testing were; wave 1, -0.04, [95% ci -0.17, 0.58; wave 2, 0.01, [95% ci -0.21, 0.61; wave 3, 0.03, [95% ci -0.08, 0.65. 7 figure 2. internal consistency reliability multiverse for stroop rt cost at time 2 happy faces for the happy faces condition median and 95% cis for each wave of testing were; wave 1, -0.09, [95% ci -0.19, 0.58; wave 2, -0.01, [95% ci -0.10, 0.65; wave 3, 0.07, [95% ci -0.04, 0.65. pained faces for the pained faces condition median and 95% cis for each wave of testing were; wave 1, 0.04, [95% ci -0.26, 0.65; wave 2, -0.09, [95% ci -0.17, 0.60; wave 3, 0.15, [95% ci -0.08, 0.68. dot probe: test-retest test retest reliability estimates (icc2) for each condition were: angry, 0.04, 95% ci [0, 0.10]; happy, 0, 95% ci [0, 0.07]; pain, 0, 95% ci [0, 0.01] secondary analyses: reliability and number of trials increasing the number of trials typically increases reliability estimates (e.g. hedge et al., 2018; von bastian et al., 2020). a visual inspection of the multiverses suggests that specifications involving the removal of more trials (i.e. removing trials greater than 1 standard deviation from the average) leads to higher reliability estimates. table 1 presents the pearson correlations between the reliability estimates and the number of trials retained in each specification. for internal consistency reliability these correlations typically ran counter to expectations of reduced trials leading to reduced reliability. in most cases the association was negative more trials removed during data processing was associated with higher reliability estimates were observed. in contrast, for most of the test-retest reliability multiverses, removal of more trials led to lower reliability estimates. 8 figure 3. test-retest reliability multiverse for stroop rt cost to investigate this further, i reran the multiverses for stroop and flanker data using only the first half of trials collected for each participant. i also reran the multiverses for the dot probe data using only the first 20 trials for each trial type (i attempted to rerun the dot probe data with only 14 trials for each trial type, but this led to errors under stricter specifications where there were too few trials to run the reliability estimation). to save the reader from viewing all 18 multiverses for a second time, the code and all outputs can be found in the supplementary materials. on visual inspection of the multiverse visualisations, the overall pattern of results is similar: specifications resulting in the removal of more trials tend to result in higher reliability estimates. the final column in table 1 presents the mean difference in reliability estimates for each of the 18 multiverses (positive values indicate higher reliability estimates with the full number of trials). for internal consistency estimates: multiverses with fewer trials had lower reliability estimates, on average, for the stroop and flanker tasks. but, against expectations, reliability estimates increased for the dot probe task when the number of trials was reduced. in contrast, almost all test-retest estimates were reduced in the reduced number of trials analyses. figure 13 presents the difference between reliability estimates in full vs reduced trials multiverses for all 18 multiverse analyses. discussion across 18 reliability multiverse analyses, and their colourful visualisations, we explored the influence of data pre-processing specifications on measure reliability. to briefly summarise: internal consistency reliability estimates ranged from 0.58 to 0.92 in the stroop data, 9 figure 4. internal consistency reliability multiverse for flanker rt cost at time 1 0.59 to 0.93 in the flanker data, and -0.28 to 0.68 in the dot probe data. test-retest reliability estimates ranged from 0.47 to 0.63 in the stroop data, 0.29 to 0.72 in the flanker data, and 0 to 0.11 in the dot probe data. from the introduction we remember that reliability estimates are a product of: the sample and the population they are drawn from, the task (including any differences in implementation), and the circumstances in which the measurement was obtained, i.e. reliability is not an inherent quality of the task itself. the first conclusion we can draw from these multiverse analyses is that data processing specifications are also an integral part of this list. at the onset of this project, i thought it reasonable to assume that a particular feature of the data processing path might result in consistently higher (and lower) reliability estimates. the clearest indication we can take from these analyses is that there is no single set of data processing specifications, or combination of data processing decisions, that lead to improved reliability. the wide ranges of estimates are an additional cause for concern. seemingly arbitrary data processing decisions can lead to differences of more than .3 in the reliability of a measure. these decisions are equally reasonable and logical choices, and we should not expect them to have meaningful impact on the theoretical questions being asked of the data. the reliability multiverse analyses presented here demonstrate this using data from a stroop and a flanker task. as well as across tasks, overlapping the time 1 and time 2 multiverses for both tasks highlights that even the same set of specifications does not lead to directly comparable internal consistency reliability estimates over time. data processing decisions appear to be extremely important contributors to mea10 figure 5. internal consistency reliability multiverse for flanker rt cost at time 2 sure reliability, but their influence is unpredictable and arbitrary. the secondary analyses give us more insight into the relationship between the number of trials retained through data processing and the resultant reliability estimates. the picture is not a simple one. figure 13 highlights the unpredictable influence of what is essentially another multiverse specification decision – do i remove half of trials before any other data processing? while the underlying pattern of more data reduction leading to greater reliability generally holds across tasks, within tasks fewer trials led to lower reliability on average for the stroop and flanker tasks (as we should expect) but not the dot probe. more work is needed to unravel these influences, but a take-home message may be: while administering more trials to participants is typically a good thing for reliability, there may be some benefit (in terms of reliability) of removing more trials. though, as i discuss below, pursuit of reliability alone should not be the goal. in the core of this discussion i raise several open questions and suggest some plausible actions that could be taken to mitigate some of the risk reliability heterogeneity poses. how do we guard against reliability heterogeneity? in simple bivariate analyses, we usually think low reliability will simply attenuate estimated effect sizes (e.g. spearman, 1904). but the influence can be far less predictable (the reader may be noticing a trend of unpredictability in this paper). low reliability can lead to elevation of effect size estimates and even reversals in direction (or examples, see brakenhoff et al., 2018; segerstrom and boggero, 2020), with the influence be11 figure 6. internal consistency reliability multiverse for flanker rt cost at time 2 coming more unpredictable in more complex models. it is therefore important to take reliability heterogeneity into account when comparing effect sizes (for several clear examples, see cooper et al., 2017). it is plausible that some studies may have obtained smaller or larger effect sizes than others based, in part, on the reliability of the measurements taken. similarly, identical observed effect sizes may represent very different ‘true’ effect sizes, if reliability is taken into account. recently, wiernik and dahlke (2020) made a strong case for correcting for measurement error in meta-analyses and provide the necessary formula and code for doing so. there are several actions we can take to begin to account for reliability heterogeneity. two simple recommendations to briefly reiterate two recommendations i and my colleagues have made previously: a) report all data processing steps taken, and b) report the reliability of measures analysed (parsons et al., 2019). these recommendations will not ‘fix’ potential psychometric issues within one’s study, or reliability heterogeneity across studies. however, complete reporting of data processing will assist in the computational reproducibility of one’s results. reporting psychometric information will assist in the interpretation of results, including comparisons of effect sizes, as well as provide useful information about the utility of a task in studies of individual differences. 12 figure 7. overlapped internal consistency reliability multiverse for stroop rt cost at times 1 and 2 multiverse analyses as a robustness check one approach is running a multiverse across a justified set of data processing specifications (that yield the same theoretically justified construct of interest, see the below section on validity) and generating a distribution of effect sizes from the final analyses under these specifications. in principle this is the same as a sensitivity or robustness analysis, and act as a check on the reliability heterogeneity introduced by different (but equally justifiable) data processing specifications. adopt a modelling approach incorporating trial level variation into our analyses with hierarchical modelling approaches (aka mixed models, multilevel models) will likely be a vital step in protecting us against reliability heterogeneity. psychological effects are often heterogeneous across individuals (bolger et al., 2019), and factors within tasks have important effects [e.g. stimuli differences, (debruine and barr, 2021). it follows that our models should take trial-level variation into account. more than this, using models that capture the theorized data generating process, including relevant distributions (e.g. response time distributions are typically very right skewed), likely have a better chance of capturing the process of interest in the first place. using the stroop and flanker data from hedge et al. (2018) rouder, kumar, and haaf (2019; also see rouder and haaf, 2018) demonstrated that hierarchical models should be used to account for error in measurement (for additional guidance on applying this modelling, see haines, 2019). adopting this approach has the benefit of ‘correcting’ the effect size estimate (and standard error) for measurement error as 13 figure 8. overlapped internal consistency reliability multiverse for flanker rt cost at times 1 and 2 part of the model, rather than as an additional step to aid in interpretations and effect size comparisons (a step that is often missed once reliability is deemed “acceptable”, assuming that reliability is estimated in the first place). rouder and colleagues demonstrate that this is also a more effective approach than ‘correcting’ the effect size estimate using e.g. spearman’s correction for attenuation formula (spearman, 1904). yet, even better corrections cannot fully save us from measurement error. hierarchical measures do bring their own considerations and potential issues. applied researchers, or those without training, may need further support to ensure the model specifications are appropriate. the model covariance structure, and appropriate priors in the case of bayesian approaches, do have potential to introduce additional sources of bias/researcher degrees of freedom. but, given existing resources and a growing body of training materials and work in this area, it is my view that a modelling approach is likely the best next step (haines et al., 2020; rouder et al., 2019; sullivan-toole et al., 2021; debruine and barr, 2021). an additional benefit of these approaches is that they typically avoid much of the data pre-processing aspects discussed in this paper, and thus the reliability heterogeneity they generate. limitations and room for expansion a small number of tasks. one limitation of this study is the focus on a small sample of tasks. it is possible that data from other tasks tend to yield more or less consistent patterns of reliability estimates across data processing specifications. similarly, i have only examined rt costs (i.e. a difference score between two 14 figure 9. internal consistency reliability multiverse for dot probe attention bias (angry faces) at times 1, 2, and 3 trial types) as the outcome measure. the analyses could have examined accuracy rates, rt averages, signal detection, and a wide variety of outcome measures. it is very possible that other outcome indices would be more or less consistently reliable across the range of data processing specifications. i opted for brevity in this paper by selecting only these tasks; i welcome future work seeking to examine a wider range of tasks and outcome indices. extracting the influence of individual decisions. the analyses here do not allow for an in depth examination of the influence of specific data processing decisions. given lack of consistency across timepoints and measures, i am not confident that robust conclusions could be drawn about a specific decision compared to another. a plausible approach to examine this is a vibration of effects analysis (e.g. klau et al., 2021) in which the variance of the final distribution of estimates can be decomposed to examine the relative influence of different categories of decisions, e.g. model specifications and data processing decisions. using this information, we might be able to prioritise sources of measurement heterogeneity more accurately. applicability to experimental vs correlational analyses. there is a paradox in measurement reliability (see hedge et al., 2018): experimental effects that are highly replicable (for example, the stroop effect) may also show low reliability. homogeneity within groups or experimental conditions allows for larger and more robust effects; researchers can opt to develop tasks that capitalise on homogeneity. unfortunately, reliability requires robust individual differences (and vice versa). highly reliable measures by necessity show consistent, potentially large, individual differences and 15 figure 10. internal consistency reliability multiverse for dot probe attention bias (happy faces) at times 1, 2, and 3 would not be suitable for group differences or experimental research. as a result, measures tend to be more appropriate for questions of a) assessing differences between groups or experimental conditions, or b) correlational or individual differences. i was primarily concerned with the use of these measures in individual differences research – hence the focus on reliability. yet, it would be overly simplistic to assert that the discussions in this paper do not also relate to experimental differences questions. indeed, the data processing specifications that maximise the measure’s utility in individual differences analyses can also hinder the measure’s utility in experimental questions. further research would be needed to quantify the relative influences on correlational vs experimental analyses. yet, large fluctuations in relative between-subjects vs within-subjects variance, due to data processing, holds importance for any research question. simulation studies. several valuable extensions to the current approach could be made via simulation approaches. by simulating data with a known measurement structure, we could examine variance in reliability estimates that operates purely by chance: i.e. where no systematic differences in reliability exist across preprocessing decisions. comparing the distributions to those observed in tasks such as those analysed here would offer insight into how severe reliability heterogeneity is introduced in “real world” data. these simulations are beyond the scope of this initial paper; however hold promise to detect variance and bias relative to a ‘true’ value of reliability in the simulated data. 16 figure 11. internal consistency reliability multiverse for dot probe attention bias (pain faces) at times 1, 2, and 3 what about validity? others have previously demonstrated that measures are often used ad hoc or with little reported validation efforts (e.g. flake et al., 2017; hussey and hughes, 2018). this study cannot begin to assess the influence of data processing flexibility on measure validity – nor did this paper attempt to address this question. reliability is only one piece of evidence needed to demonstrate the validity of a measure. yet, it is an important piece of evidence as “reliability provides an upper bound for validity” (zuo et al., 2019, page 3). while we cannot directly conclude that flexibility in data processing influences measure validity, we should look to further research to investigate. one possibility would be to conduct a validity multiverse analysis similar to the “many analysts, one data set” project (silberzahn et al., 2018). in this project, 29 teams (61 analysts total) analysed the same dataset. the teams adopted a number of different analytic approaches which resulted in a range of results. the authors concluded that, “uncertainty in interpreting research results is therefore not just a function of statistical power or the use of questionable research practices; it is also a function of the many reasonable decisions that researchers must make in order to conduct the research” (page 354). another important validity consideration is the relationship between our data processing pipelines and the (latent) construct of interest. in questionnaire development, removing or adapting items might influence reliability. but, more importantly, will give rise to a different measure that may be more or less related to our latent construct of interest. for example, fried (2017) found that several common depression questionnaires captured very different clusters of symptoms, which 17 figure 12. internal consistency reliability multiverse for dot probe attention bias (pain faces) at times 1, 2, and 3 should make us question what is meant by “depression” in the first place when using these measures. more relevant to task measures, to maximise reliability we might seek to develop a novel version of a task that relies on average response times, instead of a difference score between average response times. while this would yield highly reliable measures, the purpose of the difference score is to isolate the process of interest. therefore, while we have maximized reliability, we have also influenced both the construct of interest and the validity of the measure. perhaps this more reliable measure fails to capture the effect we intended to measure. for a more in depth discussion about balancing these theoretical, validity, and reliability considerations see von bastian et al. (2020, goodhew and edwards, 2019). with respect to the data pre-processing steps taken in this paper, it could be reasonably argued that some preprocessing specifications yield different constructs of interest or could be more or less valid for the process of interest. are we really interested in the construct including only very accurate participants and only 60% of trials close to the average response time? in this sense, the data pre-processing decisions a researcher might adopt are certainly not arbitrary from a validity standpoint. a reasonable approach in applied work would be to select a narrower set of processing specifications that the researcher believes are theoretically similar enough that the same construct is being measured. returning to the garden my intention for this project was to provide some indication about the influence of data processing pathways on the reliability of our cognitive measurements. 18 table 1 correlations between reliability estimates and number of trials retained across specifications task time measure correlation 95% ci low 95% ci high difference stroop 1 splithalf -0.38 -0.48 -0.28 0.13 stroop 2 splithalf -0.38 -0.48 -0.28 0.10 flanker 1 splithalf -0.61 -0.68 -0.53 0.12 flanker 2 splithalf -0.55 -0.63 -0.47 0.24 dptangry 1 splithalf -0.54 -0.62 -0.45 -0.03 dptangry 2 splithalf -0.66 -0.72 -0.58 -0.27 dptangry 3 splithalf -0.27 -0.37 -0.16 -0.23 dpthappy 1 splithalf -0.58 -0.65 -0.50 -0.02 dpthappy 2 splithalf -0.51 -0.59 -0.42 -0.22 dpthappy 3 splithalf -0.42 -0.51 -0.32 -0.23 dptpain 1 splithalf -0.59 -0.66 -0.51 -0.06 dptpain 2 splithalf -0.39 -0.49 -0.29 -0.27 dptpain 3 splithalf -0.15 -0.26 -0.03 -0.20 stroop icc 0.37 0.26 0.46 0.08 flanker icc -0.59 -0.66 -0.51 0.11 dptangry icc 0.61 0.54 0.68 0.04 dpthappy icc 0.42 0.32 0.51 -0.01 dptpain icc -0.01 -0.12 0.11 0.00 the influence can be profound; the multiverse analyses show large differences between the highest and lowest reliability estimates. yet, we see little consistency in the pattern of decisions leading to higher, or lower, estimates. we have the worst of both worlds: data processing decisions are largely arbitrary yet can have a large – relatively unpredictable – impact on the resulting reliability estimates. briefly returning to the garden of forking paths metaphor; i imagined that this project would help illuminate the point in which our hypothetical researcher would enter the garden, based on their data processing decisions. but, our investigation has uncovered an unfortunate secret: our researcher’s forking paths are almost entirely arbitrary and interwoven. each path diverges wildly, leading to almost anywhere in the garden. it is as if our researcher is simply spinning in dizzy circles until they stumble somewhere along the fence of reliability. i discussed several actions researchers can take collectively to help with the issue. but, by no means were these remedies to our reliability issues, nor would they directly help issues with the validity of our measurements. thankfully, there is a growing awareness that measurement matters (fried & flake, 2018). a valuable term, questionable measurement practices (qmps), was recently added to our vernacular by flake and fried (2020). qmps describe “decisions researchers make that raise doubts about the validity of the measures used in a study, and ultimately the validity of the final conclusion” (p. 458). i hope that qmps and the importance of measurement become as widely discussed as the parallel idiom, ‘questionable research practices’ (qrps). most importantly, wider discussion of these practices should make it clear to all researchers that we make many potentially impactful decisions in the design of our measures, our data processing or cleaning, and our data analysis. i am concerned that we sit on the precipice of a measurement crisis. the so-called replication crisis shook much of our field into widespread and ongoing reforms. yet, much of the focus has been on improving methodological and statistical practices. this is undoubtedly worthwhile, but largely omits discussion of reliability and validity of our measurements – despite our measurements forming the basis of any outcome or inference. this oversight feels like repairing a damaged wall at the same time as ignoring the shifting foundations under it. i hope that this paper, along other related work, highlights the issue and encourages researchers to place more emphasis on quality measurement. as a field, we can orchestrate a measurement revolution (cf. the “credibility revolution,” vazire, 2018) in which the quality and validity of our measurements is placed an order of importance above obtaining desired results. if the reader takes home a single message from this paper, please let it be “measurement matters.” 19 figure 13. difference in reliability estimates from all trials to reduced trials. note: red = test-retest icc2, blue = internal consistency estimate author contact correspondance should be addressed to sam parsons, donders institute for brain, cognition and behaviour, radboud university medical center, nijmegen, the netherlands. email: sam.parsons@radboudumc.nl. orcid: 0000-0002-7048-4093. conflict of interest i declare no conflicts of interest funding sp is currently supported by a radboud excellence fellowship. this work was initually supported by an esrc grant [es/r004285/1] acknowledgements i would like to thank ana todorovic for her insightful feedback on an earlier version of this manuscript. author contributions sp was responsible for all aspects of this manuscript: data analysis, visualisations, writing, & revisions. open science practices this article earned the open materials badge for making the materials openly available. it has been verified that the analysis reproduced the results presented 20 in the article. the entire editorial process, including the open reviews, is published in the online supplement. references auguie, b. (2017). gridextra: miscellaneous functions for "grid" graphics [r package version 2.3]. https : //cran.r-project.org/package=gridextra aust, f., & barth, m. (2018). papaja: create apa manuscripts with r markdown [r package version 0.1.0.9842]. https : / / github . com / crsh / papaja barth, m. (2022). tinylabels: lightweight variable labels [r package version 0.2.3]. https : / / cran . r project.org/package=tinylabels bolger, n., zee, k. s., rossignac-milon, m., & hassin, r. r. (2019). causal processes in psychology are heterogeneous. journal of experimental psychology: general, 148(4), 601–618. https://doi. org/10.1037/xge0000558 booth, c., songco, a., parsons, s., heathcote, l., vincent, j., keers, r., & fox, e. (2017). the cogbias longitudinal study protocol: cognitive and genetic factors influencing psychological functioning in adolescence. bmc psychology, 5(1). https://doi.org/10.1186/s40359-017-0210-3 booth, c., songco, a., parsons, s., heathcote, l. c., & fox, e. (2019). the cogbias longitudinal study of adolescence: cohort profile and stability and change in measures across three waves. bmc psychology, 7(73). https://doi.org/doi.org/10. 1186/s40359-019-0342-8 brakenhoff, t. b., van smeden, m., visseren, f. l. j., & groenwold, r. h. h. (2018). random measurement error: why worry? an example of cardiovascular risk factors (r. sichieri, ed.). plos one, 13(2), e0192298. https : / / doi . org / 10 . 1371/journal.pone.0192298 cooper, s. r., gonthier, c., barch, d. m., & braver, t. s. (2017). the role of psychometrics in individual differences research in cognition: a case study of the ax-cpt. frontiers in psychology, 8(sep), 1–16. https : / / doi . org / 10 . 3389 / fpsyg . 2017 . 01482 debruine, l., & barr, d. j. (2021). understanding mixed-effects models through data simulation. advances in methods and practices in psychological science, 4(1), 1–15. https://doi.org/ 10.1177/2515245920965119 flake, j. k., & fried, e. i. (2020). measurement schmeasurement: questionable measurement practices and how to avoid them. advances in methods and practices in psychological science, 3(456465), 10. flake, j. k., pek, j., & hehman, e. (2017). construct validation in social and personality research: current practice and recommendations [isbn: 1948-5506]. social psychological and personality science, 8(4), 370–378. https://doi.org/10. 1177/1948550617693063 fried, e. i. (2017). the 52 symptoms of major depression: lack of content overlap among seven common depression scales. journal of affective disorders, 208, 191–197. https : / / doi . org / 10 . 1016/j.jad.2016.10.019 fried, e. i., & flake, j. k. (2018). measurement matters. observer. https : / / www . psychologi % 20calscience . org / observer / measurement matters gawronski, b., deutsch, r., & banse, r. (2011). response interference tasks as indirect measures of automatic associations. cognitive methods in social psychology (pp. 78–123). the guilford press. gelman, a., & loken, e. (2013). the garden of forking paths: why multiple comparisons can be a problem, even when there is no âï¬shing expeditionâ or âp-hackingâ and the research hypothesis was posited ahead of time, 17. https: //doi.org/dx.doi.org/10.1037/a0037714 goodhew, s. c., & edwards, m. (2019). translating experimental paradigms into individualdifferences research: contributions, challenges, and practical recommendations. consciousness and cognition, 69, 14–25. https://doi.org/10. 1016/j.concog.2019.01.008 haines, n. (2019). thinking generatively: why do we use atheoretical statistical models to test substantive psychological theories? http://haineslab.com/post/thinkinggenerativelywhydoweuseatheoreticalstatisticalmodelstotestsubstantive-psychological-theories/ haines, n., kvam, p. d., irving, l. h., smith, c., beauchaine, t. p., pitt, m. a., ahn, w.-y., & turner, b. (2020). theoretically informed generative models can advance the psychological and brain sciences: lessons from the reliability paradox (preprint). psyarxiv. https : / / doi . org / 10 . 31234/osf.io/xr7y3 hedge, c., powell, g., & sumner, p. (2018). the reliability paradox: why robust cognitive tasks do not produce reliable individual differences. behavior research methods, 50(3), 1166–1186. https: //doi.org/10.3758/s13428-017-0935-1 henry, l., & wickham, h. (2019). purrr: functional programming tools [r package version 0.3.3]. https://cran.r-project.org/package=purrr https://cran.r-project.org/package=gridextra https://cran.r-project.org/package=gridextra https://github.com/crsh/papaja https://github.com/crsh/papaja https://cran.r-project.org/package=tinylabels https://cran.r-project.org/package=tinylabels https://doi.org/10.1037/xge0000558 https://doi.org/10.1037/xge0000558 https://doi.org/10.1186/s40359-017-0210-3 https://doi.org/doi.org/10.1186/s40359-019-0342-8 https://doi.org/doi.org/10.1186/s40359-019-0342-8 https://doi.org/10.1371/journal.pone.0192298 https://doi.org/10.1371/journal.pone.0192298 https://doi.org/10.3389/fpsyg.2017.01482 https://doi.org/10.3389/fpsyg.2017.01482 https://doi.org/10.1177/2515245920965119 https://doi.org/10.1177/2515245920965119 https://doi.org/10.1177/1948550617693063 https://doi.org/10.1177/1948550617693063 https://doi.org/10.1016/j.jad.2016.10.019 https://doi.org/10.1016/j.jad.2016.10.019 https://www.psychologi%20calscience.org/observer/measurement-matters https://www.psychologi%20calscience.org/observer/measurement-matters https://www.psychologi%20calscience.org/observer/measurement-matters https://doi.org/dx.doi.org/10.1037/a0037714 https://doi.org/dx.doi.org/10.1037/a0037714 https://doi.org/10.1016/j.concog.2019.01.008 https://doi.org/10.1016/j.concog.2019.01.008 http://haines-lab.com/post/thinking-generatively-why-do-we-use-atheoretical-statistical-models-to-test-substantive-psychological-theories/ http://haines-lab.com/post/thinking-generatively-why-do-we-use-atheoretical-statistical-models-to-test-substantive-psychological-theories/ http://haines-lab.com/post/thinking-generatively-why-do-we-use-atheoretical-statistical-models-to-test-substantive-psychological-theories/ http://haines-lab.com/post/thinking-generatively-why-do-we-use-atheoretical-statistical-models-to-test-substantive-psychological-theories/ https://doi.org/10.31234/osf.io/xr7y3 https://doi.org/10.31234/osf.io/xr7y3 https://doi.org/10.3758/s13428-017-0935-1 https://doi.org/10.3758/s13428-017-0935-1 https://cran.r-project.org/package=purrr 21 hussey, i., & hughes, s. (2018). hidden invalidity among fifteen commonly used measures in social and personality psychology [00000]. https: //doi.org/10.31234/osf.io/7rbfp jones, a., christiansen, p., & field, m. (2018). failed attempts to improve the reliability of the alcohol visual probe task following empirical recommendations. psychology of addictive behaviors, 32(8), 922–932. https : / / doi . org / 10 . 31234 / osf.io/4zsbm klau, s., hoffmann, s., patel, c. j., ioannidis, j. p., & boulesteix, a.-l. (2021). examining the robustness of observational associations to model, measurement and sampling uncertainty with the vibration of effects framework. international journal of epidemiology, 50(1), 266–278. https://doi.org/10.1093/ije/dyaa164 koo, t. k., & li, m. y. (2016). a guideline of selecting and reporting intraclass correlation coefficients for reliability research [arxiv: pmc4913118 publisher: elsevier b.v. isbn: 1556-3707]. journal of chiropractic medicine, 15(2), 155–163. https : / / doi . org / 10 . 1016 / j . jcm.2016.02.012 lakens, d., adolfi, f. g., albers, c. j., anvari, f., apps, m. a. j., argamon, s. e., baguley, t., becker, r. b., benning, s. d., bradford, d. e., buchanan, e. m., caldwell, a. r., van calster, b., carlsson, r., chen, s.-c., chung, b., colling, l. j., collins, g. s., crook, z., . . . zwaan, r. a. (2018). justify your alpha. nature human behaviour, 2(3), 168–171. https : / / doi . org / 10 . 1038 / s41562 018-0311-x leek, j. t., & peng, r. d. (2015). p values are just the tip of the iceberg. nature, 520, 612. https : / / doi.org/10.1038/520612a loken, e., & gelman, a. (2017). measurement error and the replication crisis. science, 355(6325), 584– 585. https://doi.org/10.1126/science.aal3618 macleod, c., mathews, a., & tata, p. (1986). attentional bias in emotional disorders. journal of abnormal psychology, 95(1), 15–20. https : / / doi.org/10.1037//0021-843x.95.1.15 müller, k., & wickham, h. (2019). tibble: simple data frames [r package version 2.1.3]. https : / / cran.r-project.org/package=tibble orben, a., & przybylski, a. k. (2019). the association between adolescent well-being and digital technology use. nature human behaviour, 3(2), 173–182. https : / / doi . org / 10 . 1038 / s41562 018-0506-1 parsons, s. (2021). splithalf: robust estimates of split half reliability. journal of open source software, 6(60), 3041. https://doi.org/10.21105/joss. 03041 parsons, s., kruijt, a.-w., & fox, e. (2019). psychological science needs a standard practice of reporting the reliability of cognitive-behavioral measurements. advances in methods and practices in psychological science, 2(4), 378–395. https://doi.org/10.1177/2515245919879695 pedersen, t. l. (2019). patchwork: the composer of plots [r package version 1.0.0]. https : / / cran . r project.org/package=patchwork price, r. b., kuckertz, j. m., siegle, g. j., ladouceur, c. d., silk, j. s., ryan, n. d., dahl, r. e., & amir, n. (2015). empirical recommendations for improving the stability of the dot-probe task in clinical research. psychological assessment, 27(2), 365–376. https : / / doi . org / 10 . 1037 / pas0000036 quintana, d. s., & heathers, j. (2019). a gps in the garden of forking paths (with amy orben). 10. 17605/osf.io/38kpe r core team. (2018). r: a language and environment for statistical computing. r foundation for statistical computing. vienna, austria. https://www. r-project.org/ revelle, w. (2019). psych: procedures for psychological, psychometric, and personality research [r package version 1.9.12]. northwestern university. evanston, illinois. https://cran.r-project.org/ package=psych rohrer, j. m., egloff, b., & schmukle, s. c. (2017). probing birth-order effects on narrow traits using specification-curve analysis. psychological science, 28(12), 1821–1832. https://doi.org/10. 1177/0956797617723726 rouder, j., & haaf, j. m. (2018). a psychometrics of individual differences in experimental tasks [00000]. https : / / doi . org / 10 . 31234 / osf. io / f3h2k rouder, j., kumar, a., & haaf, j. m. (2019). why most studies of individual differences with inhibition tasks are bound to fail [00000]. https : / / doi . org/10.31234/osf.io/3cjr5 roy, s., roy, c., éthier-majcher, c., fortin, i., belin, p., & gosselin, f. (2009). stoic: a database of dynamic and static faces expressing highly recognizable emotions, 15. http : / / mapageweb . umontreal.ca/gosselif/sroyetal_sub.pdf schmukle, s. c. (2005). unreliability of the dot probe task. european journal of personality, 19(7), 595–605. https://doi.org/10.1002/per.554 segerstrom, s. c., & boggero, i. a. (2020). expected estimation errors in studies of the cortisol awakhttps://doi.org/10.31234/osf.io/7rbfp https://doi.org/10.31234/osf.io/7rbfp https://doi.org/10.31234/osf.io/4zsbm https://doi.org/10.31234/osf.io/4zsbm https://doi.org/10.1093/ije/dyaa164 https://doi.org/10.1016/j.jcm.2016.02.012 https://doi.org/10.1016/j.jcm.2016.02.012 https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1038/520612a https://doi.org/10.1038/520612a https://doi.org/10.1126/science.aal3618 https://doi.org/10.1037//0021-843x.95.1.15 https://doi.org/10.1037//0021-843x.95.1.15 https://cran.r-project.org/package=tibble https://cran.r-project.org/package=tibble https://doi.org/10.1038/s41562-018-0506-1 https://doi.org/10.1038/s41562-018-0506-1 https://doi.org/10.21105/joss.03041 https://doi.org/10.21105/joss.03041 https://doi.org/10.1177/2515245919879695 https://cran.r-project.org/package=patchwork https://cran.r-project.org/package=patchwork https://doi.org/10.1037/pas0000036 https://doi.org/10.1037/pas0000036 10.17605/osf.io/38kpe 10.17605/osf.io/38kpe https://www.r-project.org/ https://www.r-project.org/ https://cran.r-project.org/package=psych https://cran.r-project.org/package=psych https://doi.org/10.1177/0956797617723726 https://doi.org/10.1177/0956797617723726 https://doi.org/10.31234/osf.io/f3h2k https://doi.org/10.31234/osf.io/f3h2k https://doi.org/10.31234/osf.io/3cjr5 https://doi.org/10.31234/osf.io/3cjr5 http://mapageweb.umontreal.ca/gosselif/sroyetal_sub.pdf http://mapageweb.umontreal.ca/gosselif/sroyetal_sub.pdf https://doi.org/10.1002/per.554 22 ening response: a simulation. psychosomatic medicine, 82(8), 751–756. https://doi.org/10. 1097/psy.0000000000000850 silberzahn, r., uhlmann, e. l., martin, d. p., anselmi, p., aust, f., awtrey, e., bahník, š., bai, f., bannard, c., bonnier, e., carlsson, r., cheung, f., christensen, g., clay, r., craig, m. a., dalla rosa, a., dam, l., evans, m. h., flores cervantes, i., . . . nosek, b. a. (2018). many analysts, one data set: making transparent how variations in analytic choices affect results. advances in methods and practices in psychological science, 1(3), 337–356. https://doi.org/10. 1177/2515245917747646 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant [03883]. psychological science, 22(11), 1359–1366. https : //doi.org/10.1177/0956797611417632 simonsohn, u., simmons, j. p., & nelson, l. d. (2015). specification curve: descriptive and inferential statistics on all reasonable specifications. ssrn electronic journal. https : / / doi . org / 10 . 2139/ssrn.2694998 spearman, c. (1904). the proof and measurement of association between two things. the american journal of psychology, 15(1), 72. https : / / doi . org/10.2307/1412159 staugaard, s. r. (2009). reliability of two versions of the dot-probe task using photographic faces. psychology science quarterly, 51(3), 339–350. steegen, s., tuerlinckx, f., gelman, a., & vanpaemel, w. (2016). increasing transparency through a multiverse analysis. perspectives on psychological science, 11(5), 702–712. https : / / doi . org / 10.1177/1745691616658637 sullivan-toole, h., haines, n., dale, k., & olino, t. m. (2021). enhancing the psychometric properties of the iowa gambling task using full generative modeling (preprint). psyarxiv. https://doi.org/ 10.31234/osf.io/yxbjz urbanek, s., & horner, j. (2019). cairo: r graphics device using cairo graphics library for creating highquality bitmap (png, jpeg, tiff), vector (pdf, svg, postscript) and display (x11 and win32) output [r package version 1.5-10]. https://cran.rproject.org/package=cairo vazire, s. (2018). implications of the credibility revolution for productivity, creativity, and progress. perspectives on psychological science, 13(4), 411–417. https : / / doi . org / https : / / doi . org / 10.1177/1745691617751884 von bastian, c. c., blais, c., brewer, g. a., gyurkovics, m., hedge, c., kałamała, p., meier, m. e., oberauer, k., rey-mermet, a., rouder, j. n., souza, a. s., bartsch, l. m., conway, a. r. a., draheim, c., engle, r. w., friedman, n. p., frischkorn, g. t., gustavson, d. e., koch, i., . . . wiemers, e. a. (2020). advancing the understanding of individual differences in attentional control: theoretical, methodological, and analytical considerations (preprint). psyarxiv. https://doi.org/10. 31234/osf.io/x3b9k wickham, h. (2016). ggplot2: elegant graphics for data analysis. springer-verlag new york. https : / / ggplot2.tidyverse.org wickham, h. (2019a). forcats: tools for working with categorical variables (factors) [r package version 0.4.0]. https : / / cran . r project . org / package=forcats wickham, h. (2019b). stringr: simple, consistent wrappers for common string operations [r package version 1.4.0]. https : / / cran . r project . org / package=stringr wickham, h., averick, m., bryan, j., chang, w., mcgowan, l. d., franã§ois, r., grolemund, g., hayes, a., henry, l., hester, j., kuhn, m., pedersen, t. l., miller, e., bache, s. m., müller, k., ooms, j., robinson, d., seidel, d. p., spinu, v., . . . yutani, h. (2019). welcome to the tidyverse. journal of open source software, 4(43), 1686. https://doi.org/10.21105/joss.01686 wickham, h., françois, r., henry, l., & müller, k. (2019). dplyr: a grammar of data manipulation [r package version 0.8.3]. https : / / cran . r project.org/package=dplyr wickham, h., & henry, l. (2019). tidyr: tidy messy data [r package version 1.0.0]. https : / / cran . r project.org/package=tidyr wickham, h., hester, j., & francois, r. (2018). readr: read rectangular text data [r package version 1.3.1]. https://cran.r-project.org/package= readr wiernik, b. m., & dahlke, j. a. (2020). obtaining unbiased results in meta-analysis: the importance of correcting for statistical artifacts. advances in methods and practices in psychological science. https : / / doi . org / 10 . 1177 / 2515245919885611 zuo, x.-n., xu, t., & milham, m. p. (2019). harnessing reliability for neuroscience research [00000]. nature human behaviour. https://doi.org/10. 1038/s41562-019-0655-x https://doi.org/10.1097/psy.0000000000000850 https://doi.org/10.1097/psy.0000000000000850 https://doi.org/10.1177/2515245917747646 https://doi.org/10.1177/2515245917747646 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/0956797611417632 https://doi.org/10.2139/ssrn.2694998 https://doi.org/10.2139/ssrn.2694998 https://doi.org/10.2307/1412159 https://doi.org/10.2307/1412159 https://doi.org/10.1177/1745691616658637 https://doi.org/10.1177/1745691616658637 https://doi.org/10.31234/osf.io/yxbjz https://doi.org/10.31234/osf.io/yxbjz https://cran.r-project.org/package=cairo https://cran.r-project.org/package=cairo https://doi.org/https://doi.org/10.1177/1745691617751884 https://doi.org/https://doi.org/10.1177/1745691617751884 https://doi.org/10.31234/osf.io/x3b9k https://doi.org/10.31234/osf.io/x3b9k https://ggplot2.tidyverse.org https://ggplot2.tidyverse.org https://cran.r-project.org/package=forcats https://cran.r-project.org/package=forcats https://cran.r-project.org/package=stringr https://cran.r-project.org/package=stringr https://doi.org/10.21105/joss.01686 https://cran.r-project.org/package=dplyr https://cran.r-project.org/package=dplyr https://cran.r-project.org/package=tidyr https://cran.r-project.org/package=tidyr https://cran.r-project.org/package=readr https://cran.r-project.org/package=readr https://doi.org/10.1177/2515245919885611 https://doi.org/10.1177/2515245919885611 https://doi.org/10.1038/s41562-019-0655-x https://doi.org/10.1038/s41562-019-0655-x meta-psychology, 2022, vol 6, mp.2020.2595, https://doi.org/10.15626/mp.2020.2595 article type: tutorial published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: sacha epskamp, ulrich schimmack analysis reproduced by: marco lauriola all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/h9dv5 a tutorial in longitudinal measurement invariance and cross-lagged panel models using lavaan sean p. mackinnon dalhousie university robin curtis dalhousie university roisin m. o’connor concordia university in longitudinal studies involving multiple latent variables, researchers often seek to predict how iterations of latent variables measured at early time points predict iterations measured at later time points. cross-lagged panel modeling, a form of structural equation modeling, is a useful way to conceptualize and test these relationships. however, prior to making causal claims, researchers must first ensure that the measured constructs are equivalent between time points. to do this, they test for measurement invariance, constructing and comparing a series of increasingly strict and parsimonious models, each making more constraints across time than the last. this comparison process, though challenging, is an important prerequisite to interpretation of results. fortunately, testing for measurement invariance in cross-lagged panel models has become easier, thanks to the wide availability of r and its packages. this paper serves as a tutorial in testing for measurement invariance and cross-lagged panel models using the lavaan package. using real data from an openly available study on perfectionism and drinking problems, we provide a step-by-step guide of how to test for longitudinal measurement invariance, conduct cross-lagged panel models, and interpret the results. original data source with materials: https://osf.io/gduy4/. project website with data/syntax for the tutorial: https://osf.io/hwkem/. keywords: cross-lagged panel; lavaan; measurement invariance; r; tutorial; perfectionism; social anxiety the proliferation of r as a free and versatile programming language and analytic tool, coupled with the increasing power of modern computers, has made possible a great range of new statistical tests for students and professionals across varied disciplines. however, the learning curve for r is steep, and many statistical topics are so specialized that they lack coherent step-by-step guides with accompanying syntax. this paper and its accompanying osf page (https://osf.io/hwkem/) demonstrate the process of selecting, running, and evaluating cross-lagged panel models with the lavaan package in r, using real data from an open-access source. while our target audience for this paper is graduate students with a basic understanding of r, confirmatory factor analysis, and structural equation modelling, we seek to present the material in such a way that parts of it may be useful to researchers at a range of levels. readers unfamiliar with structural equation modelling might start with ullman (2006), which is a relatively accessible introduction. measurement invariance much of the time in psychology, we do not measure the construct of interest directly, but rather infer it through a series of items associated with it. mackinnon et al. 2 such constructs are called latent variables, such as drinking motives (mackinnon et al., 2017) and perfectionism (rice, loscalzo, giannini, & rice, 2020).1 confirmatory factor analysis (cfa) is a statistical technique that allows us to test whether clusters of items in our measure are indeed reflective of the latent construct to which we have assigned them. when studying constructs over time, we administer the same measurement instruments repeatedly. to make logical claims about how latent variables change across time, we must first establish that our instruments are measuring the construct consistently over time. measurement invariance (mi) is upheld in a study when “participants across all [time periods] interpret the individual questions, as well as the underlying latent factor, in the same way" (van de schoot et al, pp. 1 2). if mi is not upheld, then the nature of the latent construct changes from over time, making comparisons across measurement occasions difficult. since the proposal of mi as a concept thirty years ago (byrne, shavelson, and muthen, 1989), researchers have considered mi an important quality to check for in longitudinal studies incorporating latent variables. in simpler terms, a fundamental problem in longitudinal measurement is that the mere passage of time (or the act of observing one’s own thoughts through repeated measurement) can sometimes change how people interpret questionnaire items. to make comparisons over time, we want mi to ensure that the nature of the construct has not changed substantially over time. to use cfa to test for mi, researchers first set up a set of nested model comparisons; essentially, a series of cfa models with increasingly strict constraints on equality over time. more strict and parsimonious models allow fewer parameters to vary over time for the same latent construct. broadly speaking, the parameters we are concerned with are: 1) factor loadings, which show how representative each item is of its latent factor; 2) intercepts, which relate to the mean levels of each item; and 3) residual variances, which represent the other unexplained influences predicting item responses besides latent variables. the most parsimonious model that maintains adequate cfa fit indices determines 1 these two papers also serve as good templates for reporting measurement invariance results for a beginning learner. 2 terminology for model strictness can vary. in this paper, we follow van de schoot et al. (2012) in using the terms “metric,” “scalar,” and “residual.” on the other the level of invariance. we can further subdivide mi according to four levels (widaman & reise, 1997; widaman et al., 2010). the least stringent level of invariance, referred to as configural invariance, allows factor loadings, item intercepts, and residual variances to vary across waves. this establishes that the same factor structure applies across waves (i.e., the same number of latent variables with the same items loading on each factor). the next level is metric – sometimes called weak2 – invariance and constrains factor loadings to equality across waves. this establishes that items do not become more (or less) representative of the latent construct at different measurement occasions. that is, as factor loadings get larger, they are stronger indicators of the latent variable; metric invariance proposes that items do not vary in how representative they are of the construct over time. the following level is scalar invariance, and constrains not only factor loadings, but also item intercepts to equality across waves. constraining item intercepts to equality establishes that the mean levels of the underlying items themselves do not vary significantly between time periods. that is, if scalar invariance is violated it means that the interpretation of the absolute value of a score changes as time goes on. it is analogous to how $100 today does not have the same value as it did 100 years ago. using a psychological example, a lack of scalar invariance might make it appear that means are increasing over time when it is really that participants are changing how they interpret the response scale over time. the final level is residual invariance, and constrains factor loadings, item intercepts, and residual variances to equality across waves. residual variance represents the degree to which the model deviated from the actual data due to external factors. fixing residual item variation to equality across waves therefore establishes that any external factors (e.g., unmeasured variables that are changing over time and predicting variation in the measured latent construct) also display minimal change over time (van de schoot et al., 2012). why does this matter? without accounting for mi, researchers may misinterpret the causes behind observed effects. for instance, in an investigation of student narcissism levels between the 1990s to hand, widaman et al. (2012) and wu et al. (2007) opt for the terms “weak,” “strong,” and “strict” in referring to the same concepts. a tutorial in longitudinal measurement invariance and cross-lagged panel models using lavaan 3 2010s, wetzel et al. (2017) examined mi to check the validity of prior findings suggesting that narcissism is increasing among today’s youth (twenge & campbell, 2009). in a three-wave model incorporating responses from more than 50,000 students, wetzel et al. found nonequivalence in several aspects of the measurement of narcissism on the narcissistic personality inventory. specifically, facets of leadership and vanity were not invariant, suggesting that students’ interpretations of questions pertaining to these aspects were changing over time, a feature of the data not previously acknowledged in twenge & campbell’s (2009) study. when accounting for this partial nonequivalence, their model actually suggested a decrease in narcissism over time. checking for mi is thus an important practice prior to interpreting longitudinal results. cross-lagged panel models another important aspect of longitudinal studies are directional effects between variables over time. researchers can use cross-lagged panel models (clpm) to investigate how well different variables predict future iterations of each other, helping to make stronger causal claims by establishing temporal precedence (cole and maxwell, 2003). results can sometimes help clarify the direction of relationships in a way cross-sectional correlations cannot. for example, mackinnon (2012) found a small positive correlation between perceived social support and school grades cross-sectionally; however, a cross-lagged panel model suggested that higher grades led to more social support, rather than the reverse, contrary to common belief. this study used a three-wave design over several years, but diary studies using more waves over shorter time periods are also common (e.g., sherry & hall, 2009). cross-lagged panel models are one attempt to make stronger causal claims with longitudinal data. however, it is important to note that cross-lagged panel models have been criticized quite early on (e.g., rogosa, 1980) and more recently by hamaker, kuiper, & grasman (2015). the crux of the criticism 3 readers should also note that the random intercepts cross-lagged panel model requires a minimum of three measurement occasions, unlike the traditional crosslagged panel model, which is identifiable with only two. is that the traditional cross-lagged panel model does not properly disentangle within-person processes (e.g., state-like, day-to-day change) from betweenperson processes (e.g., trait-like, stability from dayto-day). as a result, traditional cross-lagged panel models can produce incorrect results for statistical significance, which relationship is larger, and even the sign/direction of the relationship (hamaker et al., 2015)! as a potential solution to these issues, hamaker et al. (2015; see also mulder & hamaker, 2020 for the extension to multiple indicators, as used in this paper) introduces the random intercepts cross-lagged panel model, which properly accounts for the stable, trait-like nature of many constructs. thus, though the bulk of the paper will focus on the traditional cross-lagged panel model, we also present code and interpretation for a random intercepts cross-lagged panel model.3 mi in cross-lagged panel models when testing for mi in clpms, model complexity and number of participants can also have an impact on interpretations. the greater the number of waves and items, the more complex the model will be, and the greater the number of participants needed to facilitate reliable testing. this is generally true of structural equation models in general (kyriazos, 2018). the configural model estimates the greatest number of parameters and is thus the least parsimonious model. to simplify the problem of longitudinal studies slightly in each wave, actual scores of participants will differ somewhat from the model’s prediction, but the same participants are responding to the measure each time, creating non-independence of observations across waves. we model this nonindependence as a covariance between the residuals of the same items among waves (see the annotated syntax file that accompanies this paper). it is a reasonable a-priori assumption to expect the magnitude of covariances between residuals across waves to be similar. it is common for researchers to fix residual covariance to equality across waves.4 fixing these values to equality is thus often theoretically 4 another common set of constraints used in longitudinal data is an first-order autoregressive or ar(1) correlated error structure. that is, a set of constraints that predict the covariances will get smaller as the time lags increase (i.e., constructs measured more closely in time should be more strongly related). mackinnon et al. 4 justified and helps reduce the total number of parameters estimated by the configural model (cole and maxwell, 2003)5. prior to setting up and testing the final structural equation model, it is useful to map out the predicted relationship between variables as figures. the dataset used in this study derives from a 21day diary study investigating perfectionism, motivations for drinking, and alcohol-related problems. it is open-access, and all data are free to download at https://osf.io/hwkem/. for the purpose of this paper, our key latent variables of interest are: 1) perfectionistic self-presentation (psp), which (as operationalized in these data) measures an individual’s desire to hide their imperfections; and 2) and state social anxiety (ssa), which measures transitory feelings of anxiety associated social situations. psp was first proposed an aspect of perfectionism by hewitt et al. (2003), whereas ssa was proposed as measure of social anxiety by kashdan and steger (2006). for a recent study using these data discussing the relation between psp and ssa in more detail, please see kehayes and mackinnon (2019). to make the example easier to follow we focus on only 5 of the 20 days perfectionistic self-presentation and social anxiety were measured (arbitrarily, days 7-11). thus, in the present example we first (a) establish longitudinal measurement invariance over 5 days and then (b) test a cross-lagged panel model. in general, think of the measurement invariance portion as a necessary first step to proceed with hypothesis testing in the cross-lagged panel model. though theory suggests that perfectionism would cause social anxiety rather than the reverse, we do not concern ourselves with formally testing confirmatory hypotheses in this paper – even though the cross-lagged panel model would be the spot where hypotheses about directionality of relationships are formally tested in a traditional paper. instead, we focus on the technical, analytical aspects with the goal of teaching readers how to conduct the analysis. 5 note that measurement invariance between independent groups using multi-group modelling (e.g., comparing men, women, and nonbinary groups) is comparatively simpler than measurement invariance in the longitudinal context because of this correlated error structure. readmethod dataset our study uses a simplified version of the dataset published by mackinnon et al. (2021). we first trimmed and reformatted the dataset to contain only the variables of interest (see appendix a). the abridged dataset contains responses given by 251 participants for two latent variables (psp, composed of three items; and ssa, composed of seven items) across five days (days seven through eleven of the study). psp items were measured using a 7-point scale from 1 to 7. ssa items were measured using a 5-point scale from 0 to 4. beyond trimming the dataset to only the variables of interest, we also converted the data to wide format (in which every participant receives a unique row, and each variable receives a unique column) from long format (in which every data point receives a unique row, with participants and categorical variables recurring across rows) for a simple illustration of this pertaining to ssa and psp values across days, see figure 1. this format conversion aided in setting up the code for our later models. figure 1. examples of wide vs. long format in psp and ssa values across two days data analysis strategy this section describes our strategy for comparing and selecting models. our goal was to compare nested versions of our clpm, using cfa, to determine the most appropriate parameters for our final structural model. we sought to use the simplest model (i.e., model estimating the fewest parameters) that maintained a good fit for our data, while also ers interested in measurement invariance for independent groups can take advantage of the measurementinvariance() function in lavaan for convenience functions that are much shorter than the code in this tutorial (https://lavaan.ugent.be/tutorial/groups.html), even though the core principles are the same. a tutorial in longitudinal measurement invariance and cross-lagged panel models using lavaan 5 making good theoretical sense. because lavaan allows undefined parameters to vary freely by default, simpler models are those with more constraints defined; thus, somewhat counterintuitively, more parsimonious models appear more complicated in the code. in constructing our models through lavaan, we relied upon four key operators: 1) =~ , which is used for factor loadings, and can be thought of as “is measured by;” 2) ~ , which is used for regression formulas, and can be thought of as “is regressed on;” 3) ~~ , which is used for defining variance and residual covariance, and can be thought of as “varies with;” and 4) ~ 1, which is a special notation for defining intercepts. within any given formula, labels may be assigned to terms using the asterisk (*). any items to which the same label is applied are fixed to equality in the model’s calculations. to better understand the full definitions of each model below, we recommend referring to figures 2-5, which show simplified versions of each model with their respective constraints. it may also be useful to simultaneously follow along with our annotated code on our osf page. our approach involves five steps total: (a) configural model; (b) metric model; (c) scalar model; (d) residual model; and (e) structural model. notably, in the first four steps we use covariances (~~) rather than regressions (~) for relationships between variables. in the fifth step, we do a true cross-lagged panel model where temporal directionality of relationships is assumed (e.g., day 7 predicting day 8). this choice has no impact on the overall fit indices, but it is worth noting that the relationships between variables in the first four steps are more akin to bivariate correlations (albeit, corrected for measurement unreliability), while the last step is the true test of hypotheses with paths allowing for stronger causal inferences than correlations by adjusting for past-day levels of each variable. configural model. we began by defining our configural model. excluding the correlated error structure, the full configural model is as follows: configural.v1 < ' # psp factor loadings defined psp.7 =~ na*psp1.7 + psp2.7 + psp3.7 psp.8 =~ na*psp1.8 + psp2.8 + psp3.8 psp.9 =~ na*psp1.9 + psp2.9 + psp3.9 psp.10 =~ na*psp1.10 + psp2.10 + psp3.10 psp.11 =~ na*psp1.11 + psp2.11 + psp3.11 # psp variance constrained to 1 psp.7 ~~ 1*psp.7 psp.8 ~~ 1*psp.8 psp.9 ~~ 1*psp.9 psp.10 ~~ 1*psp.10 psp.11 ~~ 1*psp.11 # ssa factor loadings defined ssa.7 =~ na*ssa1.7 + ssa2.7 + ssa3.7 + ssa4.7 + ssa5.7 + ssa6.7 + ssa7.7 ssa.8 =~ na*ssa1.8 + ssa2.8 + ssa3.8 + ssa4.8 + ssa5.8 + ssa6.8 + ssa7.8 ssa.9 =~ na*ssa1.9 + ssa2.9 + ssa3.9 + ssa4.9 + ssa5.9 + ssa6.9 + ssa7.9 ssa.10 =~ na*ssa1.10 + ssa2.10 + ssa3.10 + ssa4.10 + ssa5.10 + ssa6.10 + ssa7.10 ssa.11 =~ na*ssa1.11 + ssa2.11 + ssa3.11 + ssa4.11 + ssa5.11 + ssa6.11 + ssa7.11 # ssa variance constrained to 1 ssa.7 ~~ 1*ssa.7 ssa.8 ~~ 1*ssa.8 ssa.9 ~~ 1*ssa.9 ssa.10 ~~ 1*ssa.10 ssa.11 ~~ 1*ssa.11 ' configural.model .05. discussion to re-cap, we used confirmatory factor analysis to test two related variables for measurement invariance across five waves, using four increasingly restrictive nested models: configural, metric, scalar, and residual. after selecting our residual model as the simplest that maintained good fit, we applied its constraints to a structural equation model, allowing us to quantify the cross-lagged relationships of our two variables across days. interestingly, not all our fit indices preferred the same model. the loglikelihood ratio tests preferred the metric model, as this test tends to prefer less parsimonious models. meanwhile, our cfi preferred the residual model based on cheung & rensvold’s (2002) criteria. aic preferred the scalar model, due to its higher parsimony coupled with good overall fit. on the other hand, bic, our final deciding criterion, placed a higher emphasis on parsimony, and therefore preferred our residual model. there are two main take-away points here. first, the relatively good fits of all of our models across indices suggest that our theoretical reasons for investigating this relationship between variables were most likely sound. second, while researchers may wish to report multiple fit indices in their papers, it is important that they decide beforehand which index they will use when determining their final model. it is worth noting that the structural models had poor fit based on the srmr index, which is probably due to constraining some paths to zero that still evince a positive relationship. if this were an a-priori criterion used for assessing model fit in a research paper, you would need to investigate the source of this misfit further. remember, srmr differs from rmsea insomuch as it has no penalty for model complexity; to the extent that you value model parsimony in model selection, you might prefer to use rmsea as your selection tool instead. in our case, the final model conformed to the strictest level of measurement invariance. however, in some studies, this will not be the case. scalar invariance is typically sufficient for general data analysis, as it indicates that participants do not vary greatly across waves in the ways they interpret and answer questions. however, should only metric invariance be upheld, researchers must qualify any subsequent results by acknowledging that, although the latent factors are loaded similarly across waves, individual interpretations of the items may change over time. for more examples of this, see steenkamp & baumgartner (1998). it is also worth noting that this method for testing mi in clpms works best on models with a low to moderate number of waves. for a 20-day diary study, it may be more pragmatic for mackinnon et al. 16 researchers to instead implement multi-level modelling techniques or multilevel structural equation modelling. though the present tutorial dataset is ill-suited for examining differences in latent means, it is worth noting that scalar invariance is often a preliminary step towards examining differences between means. that is, most substantive research questions on longitudinal data are not about the measurement invariance per se, but rather about regression/covariance and mean differences over time. though longitudinal latent mean differences are beyond the scope of this tutorial, readers interested in learning more might read bishop, christian & cole (2015) for three approaches for modelling latent growth curves with multiple indicators. moreover, breitsohl (2019) is an excellent tutorial for converting common experimental designs (e.g., anova) to sem frameworks. author contact correspondence concerning this article should be addressed to sean p. mackinnon. email: mackinnon.sean@dal.ca. http://orcid.org/0000-0003-0921-9589 conflict of interest and funding the authors have no conflict of interest to declare. the data used for this tutorial was collected with the help of a social sciences and humanities research council insight development grant [#430-2016-00805]. author contributions authors are in order of most to least contribution. sean mackinnon took the lead role in conceptualizing the tutorial, supervised robin’s work as part of his ph.d. comprehensives, edited the first draft manuscript and code, wrote original sections, created the figures, analyzed the data with the random intercepts clpm. robin curtis created the majority of the r syntax and online appendices for our osf page, excluding the random intercepts model. he also took a lead role in writing the first draft of the manuscript. roisin o’connor assisted with editing the writing in the manuscript and editing/reviewing the tutorial materials. open science practices this article earned the open materials badge for making the materials openly available. it is a tutorial that used data from a published study, and as such has no (new) collected data. it was not pre-registered. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references bishop, j., geiser, c., & cole, d. a. (2015). modeling latent growth with multiple indicators: a comparison of three approaches. psychological methods, 20(1), 43-62. https://doi.org/10.1037/met0000018 breitsohl, h. (2019). beyond anova: an introduction to structural equation models for experimental designs. organizational research methods, 22(3), 649-677. https://doi.org/10.1177/1094428118754988 byrne, b. m., shavelson, r. j., & muthen, b. o. (1989). testing for equivalence of factor covariance and mean structures: the issue of partial measurement invariance. psychological bulletin, 105(3), 456-466. https://doi.org/10.1037/00332909.105.3.456 cheung, g. w., & rensvold, r. b. (2002). evaluating goodness-of-fit indexes for testing measurement invariance. structural equation modeling, 9(2), 233-255. https://doi.org/10.1207/s15328007sem09 02_5 cole, d., & maxwell, s. (2003). testing mediational models with longitudinal data: questions and tips in the use of structural equation modeling. journal of abnormal psychology, 112(4), 558-577. a tutorial in longitudinal measurement invariance and cross-lagged panel models using lavaan 17 https://doi.org/10.1037/0021843x.112.4.558 hamaker, e. l., kuiper, r. m., & grasman, r. p. (2015). a critique of the cross-lagged panel model. psychological methods, 20(1), 102-116. http://dx.doi.org/10.1037/a0038889 hewitt, p. l., flett, g. l., sherry, s. b., habke, m., parkin, m., lam, r. w., ... stein, m. b. (2003). the interpersonal expression of perfection: perfectionistic self-presentation and psychological distress. journal of personality and social psychology, 84(6), 1303–1325. 10.1037/0022-3514.84.6.1303. kashdan, t., & steger, m. (2006). expanding the topography of social anxiety: an experiencesampling assessment of positive emotions, positive events, and emotion suppression. psychological science, 17(2), 120-128. https://doi.org/10.1111/j.14679280.2006.01674.x kehayes, i.-l. l., & mackinnon, s. p. (2019). investigating the relationship between perfectionistic self-presentation and social anxiety using daily diary methods: a replication. collabra: psychology, 5(1), 33. https://doi.org/10.1525/collabra.257 kenny, d. a. (2015). measuring model fit. http://www.davidakenny.net/cm/fit.htm kyriazos, t. (2018) applied psychometrics: sample size and sample power considerations in factor analysis (efa, cfa) and sem in general. psychology, 9, 2207-2230. https://doi.org/10.4236/psych.2018.98126. lin, l. c., huang, p. h., & weng, l. j. (2017). selecting path models in sem: a comparison of model selection criteria. structural equation modeling, 24, 855-869. https://doi.org/10.1080/10705511.2017.136 3652 mackinnon, s. (2012). perceived social support and academic achievement: cross-lagged panel and bivariate growth curve analyses. journal of youth and adolescence, 41(4), 474-485. https://doi.org/10.1007/s10964-011-9691-1 mackinnon, s. p., couture, m-e., cooper, m. l., kuntsche, e., o’connor, r. m., stewart, s. h., & the drinc team. (2017). cross-cultural comparisons of drinking motives in 10 countries: data from the drinc project. drug and alcohol review, 36, 721-730. https://doi.org/10.1111/dar.12464. mackinnon, s. p., ray, c. m., firth, s. m., & o’connor, r. m. (2021). data from “perfectionism, negative motives for drinking, and alcohol-related problems: a 21-day diary study”. journal of open psychology data, 9: 1, pp. 1–6. doi: https://doi.org/10.5334/jopd.44 mulder, j. d., & hamaker, e. l. (2020). three extensions of the random intercept cross-lagged panel model. structural equation modeling, 1-11. https://doi.org/10.1080/10705511.2020.178 4738 raftery, a. (1995). bayesian model selection in social research. sociological methodology, 25, 111163. https://doi.org/10.2307/271063 rice, s. p., loscalzo, y., giannini, m., & rice, k. g. (2020). perfectionism in italy and the usa: measurement invariance and implications for cross-cultural assessment. european journal of psychological assessment, 36, 207-211. https://doi.org/10.1027/10155759/a000476 rogosa, d. (1980). a critique of cross-lagged correlation. psychological bulletin, 88(2), 245–258. https://doi.org/10.1037/00332909.88.2.245 sherry, s., & hall, p. (2009). the perfectionism model of binge eating: tests of an integrative model. journal of personality and social psychology, 96(3), 690-709. https://doi.org/1010.1037/a0014528 steenkamp, j., & baumgartner, h. (1998). assessing measurement invariance in cross-national consumer research. journal of consumer research, 25(1), 78-107. https://doi.org/10.1086/209528 twenge, j. m., & campbell, w. k. (2009). the narcissism epidemic: living in the age of entitlement. new york, ny: atria. ullman, j. b. (2006). structural equation modeling: reviewing the basics and moving forward. journal of personality assessment, 87(1), 3550. van de schoot, r., lugtig, p., & hox, j. (2012). a checklist for testing measurement invariance. european journal of developmental psychology, 9(4), 486-492. https://doi.org/10.1080/17405629.2012.68 6740 mackinnon et al. 18 wetzel, e., brown, a., hill, p., chung, j., robins, r., & roberts, b. (2017). the narcissism epidemic is dead; long live the narcissism epidemic. psychological science, 28(12), 1833-1847. https://doi.org/10.1177/0956797617724208 widaman, k., ferrer, e., & conger, r. (2010). factorial invariance within longitudinal structural equation models: measuring the same construct across time. child development perspectives, 4(1), 10-18. https://doi.org/10.1111/j.17508606.2009.00110.x widaman, k. f., & reise, s. p. (1997). exploring the measurement invariance of psychological instruments: applications in the substance use domain. in k. j. bryant, m. windle, & s. g. west (eds.), the science of prevention: methodological advances from alcohol and substance abuse research (pp. 281–324). washington, dc: american psychological association. a tutorial in longitudinal measurement invariance and cross-lagged panel models using lavaan 19 appendix a: supplementary tables table a1 factor loadings configural metric scalar residual psp1 0.84 0.88 (1.62 1.75) 0.86 (1.68) 0.63 (1.68) 0.86 (1.68) psp2 0.93 0.96 (1.89 2.08) 0.95 (1.99) 0.90 (1.99) 0.95 (1.98) psp3 0.89 0.92 (1.73 1.96) 0.91 (1.87) 0.89 (1.87) 0.91 (1.86) ssa1 0.85 0.90 (1.01 1.06) 0.87 (1.04) 0.87 (1.04) 0.87 (1.04) ssa2 0.87 0.92 (1.11 1.18) 0.90 (1.14) 0.90 (1.14) 0.90 (1.14) ssa3 0.89 0.93 (1.11 1.21) 0.91 (1.16) 0.90 (1.15) 0.90 (1.15) ssa4 0.85 0.94 (1.03 1.12) 0.90 (1.13) 0.90 (1.13) 0.90 (1.13) ssa5 0.85 0.91 (1.06 1.15) 0.88 (1.12) 0.88 (1.12) 0.88 (1.11) ssa6 0.72 0.81 (0.87 0.99) 0.75 (0.93) 0.75 (0.93) 0.75 (0.92) ssa7 0.60 0.67 (0.64 0.75) 0.62 (0.68) 0.62 (0.68) 0.63 (0.68) note. values are formatted as “standardized (unstandardized).” value ranges (min-max) are provided for the configural model because factor loadings varied by day. for the latter three models, factor loadings were constrained to equality across days. however, for the metric and scalar models, standardized factor loading scores still fluctuated very slightly across days, because variances differed across days. this will be typical in most real data. therefore, arithmetic mean values are provided for standardized scores in the metric and scalar models, just for ease of presentation in the table. in the residual model, residual error was also constrained to equality across days, which resulted in no fluctuation of standardized scores. table a2 unstandardized intercepts configural metric scalar residual psp1 3.63 3.94 3.63 3.94 3.75 3.76 psp2 3.49 3.74 3.49 3.75 3.64 3.64 psp3 3.28 3.66 3.28 3.66 3.47 3.48 ssa1 1.53 1.64 1.53 1.64 1.60 1.61 ssa2 1.39 1.56 1.39 1.56 1.52 1.52 ssa3 1.31 1.53 1.31 1.53 1.44 1.44 ssa4 1.40 1.64 1.40 1.64 1.54 1.54 ssa5 1.31 1.47 1.32 1.47 1.41 1.41 ssa6 1.09 1.23 1.09 1.23 1.16 1.16 ssa7 0.97 1.10 0.97 1.10 1.01 1.02 note. ranges are provided for the configural and metric models, for which intercepts varied by day. for the latter two models, intercepts were fixed across days. mackinnon et al. 20 table a3 unstandardized residual variances. model configural metric scalar residual psp1 0.89 1.11 0.87 1.11 0.89 1.12 1.00 psp2 0.34 0.57 0.33 0.57 0.34 0.57 0.44 psp3 0.67 0.80 0.68 0.82 0.67 0.80 0.74 ssa1 0.26 0.38 0.26 0.38 0.26 0.38 0.33 ssa2 0.23 0.40 0.23 0.40 0.23 0.40 0.31 ssa3 0.22 0.35 0.22 0.35 0.22 0.36 0.30 ssa4 0.18 0.41 0.18 0.41 0.18 0.41 0.30 ssa5 0.27 0.44 0.26 0.44 0.27 0.44 0.37 ssa6 0.54 0.76 0.53 0.76 0.54 0.76 0.66 ssa7 0.68 0.77 0.68 0.76 0.68 0.77 0.73 note. in the residual model, residual error was constrained to equality across days. for all other models, the range of values across the five days is provided. meta-psychology, 2019, vol 3, mp.2018.843, https://doi.org/10.15626/mp.2018.843 article type: original article published under the cc-by4.0 license open data: not relevant open materials: not relevant open and reproducible analysis: not relevant open reviews and editorial process: yes preregistration: not relevant edited by: rickard carlsson reviewed by: nuijten, m. & schimmack, u. all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/q56e8 a brief guide to evaluate replications etienne p. lebel ku leuven irene cheung huron university college wolf vanpaemel ku leuven lorne campbell western university the importance of replication is becoming increasingly appreciated, however, considerably less consensus exists about how to evaluate the design and results of replications. we make concrete recommendations on how to evaluate replications with more nuance than what is typically done currently in the literature. we highlight six study characteristics that are crucial for evaluating replications: replication method similarity, replication differences, investigator independence, method/data transparency, analytic result reproducibility, and auxiliary hypotheses’ plausibility evidence. we also recommend a more nuanced approach to statistically interpret replication results at the individual-study and meta-analytic levels, and propose clearer language to communicate replication results. keywords: transparency, replicability, direct replication, evaluating replications, reproducibility there is growing consensus in the psychology community regarding the fundamental scientific value and importance of replication. considerably less consensus, however, exists about how to evaluate the design and results of replication studies. in this article, we make concrete recommendations on how to evaluate replications with more nuance than what is typically done currently in the literature. these recommendations are made to maximize the likelihood that replication results are interpreted in a fair and principled manner. we propose a two-stage approach. the first one involves considering and evaluating six crucial study characteristics (the first three specific to replication studies with the last three relevant for any study): (1) replication method similarity, (2) replication differences, (3) investigator independence, (4) method/data transparency, (5) analytic result reproducibility, and (6) auxiliary hypotheses’ plausibility evidence. second, and assuming sound study characteristics, we recommend more nuanced ways to interpret replication results at the individual-study and meta-analytic levels. finally, we propose the use of clearer and less ambiguous language to more effectively communicate the results of replication studies. these recommendations are directly based on curating n = 1,127 replications (as of august 2018) available at curate science (curatescience.org), a web platform that organizes and tracks the transparency and replications of published findings in the social sciences (lebel, mccarthy, earp, elson, & vanpaemel, 2018). this is the largest known metascientific effort to evaluate and interpret replication we thank the editor rickard carlsson and reviewers michèle nuijten and ulrich schimmack for valuable feedback on an earlier version of this article. we also thank chiel mues for copyediting our manuscript. correspondence concerning this article should be addressed to etienne p. lebel, quantitative psychology and individual differences unit, ku leuven, tiensestraat 102 box 3713, leuven, belgium, 3000. email: etienne.lebel@gmail.com https://doi.org/10.17605/osf.io/q56e8 http://www.curatescience.org/ lebel, vanpaemel, cheung, & campbell 2 results of studies across a wide and heterogeneous set of study types, designs, and methodologies. replication-specific study characteristics when evaluating replication studies, the following three study characteristics are of crucial importance: 1. methodological similarity. a first aspect is whether a replication study employed a sufficiently similar methodology to the original study (i.e., at minimum, used the same operationalizations for the independent and dependent variables, as in “close replications”; lebel et al., 2018). this is required because only such replications can cast doubt upon an original hypothesis (assuming sound auxiliary hypotheses, see section below), and hence in principle, falsify a hypothesis (lebel, berger, campbell, & loving, 2017; pashler & harris, 2012). studies that are not sufficiently similar can only speak to the generalizability -but not replicability -of a phenomenon under study, and should therefore be treated as "generalizability studies" rather than “replication studies”. such studies are sometimes called "conceptual replications", but this is a misnomer given that it is more accurate to conceptualize such studies as "extensions" rather than replications (lebel et al., 2017; zwaan, etz, lucas, & donnellan, 2017). 2. replication differences. a second aspect to carefully consider is whether there are any study design characteristics that differed from the comparison original study. these are important to consider whether the differences were within or beyond a researcher’s control (lebel et al., 2018). such differences are critical to consider because they help the community begin to understand the replicability and generalizability of an effect. consistent positive replication evidence across replications with minor design differences suggests an effect is likely robust across those design differences. on the other hand, for inconsistent replication evidence, such differences may provide initial clues regarding potential boundary conditions of an effect. 3. investigator independence. a final important consideration is the degree of independence between the replication investigators and researchers who conducted the original study. this is important to consider to mitigate against the problem of “correlated investigators” (rosenthal, 1991) whereby non-independent investigators may be more susceptible to confirmation biases given vested interest in an effect (although preregistration and other transparent practices can alleviate these issues; see next section). general study characteristics when evaluating studies in general, the following three study characteristics are important to consider. 1. study transparency. sufficient transparency is required to allow comprehensive scrutiny of how any study was conducted. sufficient transparency means posting the experimental materials and underlying data in a readable format (e.g., with a codebook) on a public repository (criteria for earning open materials and open data badges, respectively; kidwell et al., 2016) and following the relevant reporting standards for the type of study and methodology used (e.g., consort reporting standard for experimental studies; schulz, altman, & moher, 2010). if a study is not reported with sufficient transparency, it cannot be properly scrutinized. the findings from such a study are consequently of little value because the target hypothesis was not tested in a sufficiently falsifiable manner. preregistering a study (which publicly commits data collection, processing, and analysis plans prior to data collection) offers even more transparency and limits researcher degrees of freedom (assuming that the preregistered procedure was actually followed). 2. analytic result reproducibility. for any study, it is also important to consider whether a study’s primary result (or set of results) is analytically reproducible. that is, whether a study’s primary result can be successfully reproduced (within a certain margin of error) from the raw or transformed data (this is contingent of course on the fact that the data are actually available, whether publicly, as in the case of “open data”, or otherwise). a brief guide to evaluate replications if analytic reproducibility is confirmed, then our confidence in a study’s reported results is boosted (and ideally results can also be confirmed to be robust across alternative justifiable data-analytic choices; steegen, tuerlinckx, gelman, & vanpaemel, 2016). if analytic reproducibility is not confirmed and/or if discrepancies are detected, then our confidence should be reduced and this should be taken into account when interpreting a study’s results. 3. auxiliary hypotheses. finally, for any study, researchers should consider all available evidence regarding how plausible it is that the relevant auxiliary hypotheses, needed to test the substantive hypothesis at hand, were true (lebel et al., 2018). auxiliary hypotheses include, for example, the psychometric validity of the measuring instruments, and the sound realizations of experimental conditions (meehl, 1990). this can be done by examining reported evidence of positive controls or evidence that a replication sample had the ability to detect some effect (e.g., replicating a past known effect; manipulation check evidence). these considerations are particularly crucial when interpreting null results so that one can rule out more mundane reasons for not having detected a signal (e.g., fatal experimenter or data processing errors; though such fatal errors can also sometimes cause false positive results). nuanced statistical interpretation and language once these six study characteristics have been evaluated and taken into account, we recommend statistical approaches to interpret the results of a replication study at the individual-study and metaanalytic levels that are more nuanced than what is currently typically done. we then propose the use of clearer language to communicate replication results. 1 the es estimate precision of an original study is not currently accounted for because the vast majority of legacy literature original studies don’t report 95% cis (and cis most often cannot be calculated because insufficient information is reported). in rare cases that cis are reported, they are typically so wide (given the underpowered nature of the statistical interpretation: individual-study level. at the individual-study level, we recommend that the following three distinct statistical aspects of a replication result are considered: (1) whether a signal was detected, (2) consistency of the replication effect size (es) relative to the original study es, and (3) the relative precision of the replication es estimate relative to the original study. such considerations yield the following replication outcome categories for the situation where an original study detected a signal (see figure 1, panel a, for visual depictions of these distinct scenarios)1: 1. signal – consistent: replication es 95% confidence interval (ci) excludes 0 and includes original es point estimate (panel a replication scenario #1; e.g., chartier’s, 2015, reproducibility project: psychology [rpp] #31 replication result of mccrea’s, 2008 study 5; see table 1 in the appendix for details of chartier's, 2015 rpp #31 replication and subsequently cited replication examples). 2. signal – inconsistent: replication es 95% ci excludes 0 but also excludes original es point estimate. three sub categorizations exist within this outcome category: a. signal – inconsistent, larger (same direction): replication es is larger and in same direction as original es (panel a replication scenario #2; e.g., veer et al.’s, 2015, rpp #36 replication result of armor et al.’s, 2008 study 1). b. signal – inconsistent, smaller (same direction): replication es is smaller and in same direction as original es (panel a replication scenario #3; e.g., ratliff’s, 2015, rpp #26 replication result of fischer et al.’s, 2008 study 4). c. signal – inconsistent, opposite direction/pattern: replication es is in opposite direction (or reflects an inconsistent pattern) relative to the original es direction/pattern (panel a replication scenario #4; e.g., earp et al.’s, legacy literature) that es estimates are not statistically falsifiable in practical terms. once it becomes the norm in the field to report highly precise es estimates, however, it will become possible and desirable to account for original study es estimate precision when statistically interpreting replication results. lebel, vanpaemel, cheung, & campbell 4 2014 study 3 replication result of zhong & liljenquist’s, 2006 study 2). 3. no signal – consistent: replication es 95% ci includes 0 but also includes original es point estimate (panel a replication scenario #5; e.g., hull et al.’s, 2002 study 1b replication result of bargh et al.’s, 1996 study 2a). 4. no signal – inconsistent: replication es 95% ci includes 0 but excludes original es point estimate (panel a replication scenario #6; e.g., lebel & campbell’s, 2013 study 1 replication result of vess’, 2012 study 1). figure 1. distinct hypothetical outcomes of a replication study based on considering three statistical aspects of a replication result: (1) whether a signal was detected, (2) consistency of replication effect size (es) relative to an original study, and (3) the precision of replication es estimate relative to es estimate precision in an original study. outcomes are separated for situations where an original study detected a signal (panel a) versus did not detect a signal (panel b). a brief guide to evaluate replications in cases where a replication effect size estimate was less precise than the original (i.e., the replication es confidence interval is wider than the original), which can occur when a replication uses a smaller sample size and/or when the replication sample exhibits higher variability, we propose the label "less precise" be used to warn readers that such replication result should only be interpreted metaanalytically (panel a replication scenario #7; e.g., schuler & wanke’s, 2016 study 2 replication result of caruso et al.’s, 2013 study 2). in the situation where an original study did not detect a signal, such considerations yield the following replication outcome categories (see figure 1, panel b, for visual depictions of these distinct scenarios): 1. no signal – consistent: replication es 95% confidence interval (ci) includes 0 and includes original es point estimate (panel b replication scenario #1; e.g., selterman et al.’s, 2015, rpp #29 replication result of eastwick & finkel’s, 2008 study 1). 2. no signal – consistent (less precise): replication es 95% confidence interval (ci) includes 0 and includes original es point estimate, but replication es estimate is less precise than in original study (panel b replication scenario #2; no replication is yet known to fall under this scenario). 3. signal – consistent: replication es 95% confidence interval (ci) excludes 0 but includes original es point estimate (panel b replication scenario #3; roebke & penna’s 2015, rpp #76 replication result of couture et al.'s, 2008 study 1). 4. signal – inconsistent: replication es 95% confidence interval (ci) excludes 0 and excludes original es point estimate. two sub categorizations exist within this outcome category: a. signal – inconsistent, positive effect: replication es involves a positive effect (panel b replication scenario #4; e.g., cohn’s, 2015, rpp #45 replication result of ranganath & nosek’s, 2008 study 1). b. signal – inconsistent, negative effect: replication es involves a negative effect (panel b replication scenario #5; e.g., no replication is yet known to fall under this scenario). from this perspective, the proposed improved language to describe a replication study under replication scenario #6 would be: “we report a replication study of effect x. no signal was detected and the effect size was inconsistent with the original one.” this terminology contrasts favorably with several ambiguous or unclear replication-related terminologies that are currently commonly used to describe replication results (e.g., “unsuccessful”, “failed”, “failure to replicate”, “non-replication”). the terms “unsuccessful” or “failed” (or “failure to replicate”) are ambiguous: was it the replication methodology or the replication result that was unsuccessful or failed (with similar logic applied to the ambiguous term “non-replication”)? the terms “unsuccessful” or “failed” are also problematic because of the implicit message conveyed that something was “wrong” with the replication. for example, though the “small telescope approach” (simonsohn, 2015) was an improvement over the prior simplistic standard of considering a replication p < .05 as “successful” and p > .05 as “unsuccessful”, the approach nonetheless uses ambiguous language that does not actually describe a replication result (e.g., “uninformative” vs. “informative failure to replicate”). instead, the terminology we propose offers unambiguous and descriptively accurate language, stating both whether a signal was detected and the consistency of the replication es estimate relative to the original study. the proposed nuanced approach to statistically interpreting replication evidence improves the clarity of the language to describe and communicate replication results. statistical interpretation: meta-analytic level. interpreting the outcomes of a set of replication studies can proceed in two ways: an informal approach, when only a few replications are available, and a more quantitative meta-analytic approach when several replications are available for a specific operationalization of an effect. the first one considers whether replications can consistently detect a signal, each of which is consistent (i.e., of similar magnitude) with the es point estimate from the original study (panel a replication scenario #1). under this situation, one could informally say that an effect is “replicable.” when several replications are available, a more quantitative meta-analytic approach can be taken: an effect can be considered “replicable” when the meta-analytic es estimate excludes zero and is consistent with the original es point estimate (also replication scenario #1, see lebel, vanpaemel, cheung, & campbell 6 panel a figure 1; see also mathur & vanderweele, 2018). conclusion it is important to note that replicability should be seen as a minimum requirement for scientific progress rather than an arbiter of truth. replicability ensures that a research community avoids going down blind alleys chasing after anomalous results that emerged due to chance, noise, or other unknown errors. however, when adjudicating the replicability of an effect, it is important to keep in mind that an effect that does not appear to be replicable does not necessarily mean the tested hypothesis is false: it is always possible that an effect is replicable via alternative methods or operationalizations and/or that there were problems with some of the auxiliary hypotheses (e.g., invalid measurement, or unclear instructions, etc.). this possibility, however, should not be exploited: eventually one must consider the value of continued testing of a hypothesis across different operationalizations and contexts. conversely, an effect that appears replicable does not necessarily mean the tested hypothesis is true: a replicable effect may not necessarily reflect a valid and/or generalizable effect (e.g., a replicable effect may simply reflect a measurement artifact and/or may not generalize to other methods, populations, or contexts). the recommendations advocated in this article are based on curating over one thousand replications at curate science (as of august 2018). these recommendations have been applied to each of the replication in its database, including employing our suggested language to describe the outcome of each of its curated replication. it is expected, however, that these recommendations will evolve over time as additional replications, from an even wider set of studies, are curated and evaluated (indeed, as of september 2018, approximately 1,800 replications are in the queue to be curated at curate science). consequently, these recommendations should be seen as a starting point for the research community to more accurately evaluate replication results, as we gradually learn more sophisticated approaches to interpret replication results. we hope, however, that our proposed recommendations will be a stepping stone in this direction and consequently accelerate psychology’s path on becoming a more cumulative and valid science. a brief guide to evaluate replications appendix table 1. known published replication results that fall under the distinct hypothetical replication outcomes depicted in figure 1 (when available). target effect original study original es estimate (± 95% ci) replication es estimate (± 95% ci) replication study replication outcome signal detected in original study math selfhandicapping effect mccrea (2008) study 5 r = .34 ± .35 r = .29 ± .24 chartier (2015, rpp #31) signal – consistent prescribed optimism effect armor, massey et al. (2008) study 1 r = .68 ± .10 r = .76 ± .06 veer et al. (2015, rpp #36) signal inconsistent, larger selective exposure information quantity effect fischer, schulzhardt et al. (2008) study 4 r = .50 ± .21 r = .22 ± .16 ratliff (2015, rpp #26) signal inconsistent, smaller macbeth effect zhong & liljenquist (2006) study 2 r = .45 ± .31 r = -.11 ± .11 earp et al. (2014) study 3 signal inconsistent, opposite elderly priming effect bargh et al. (1996) study 2a d = 1.02 ± .76 d = .53 ± .63 hull et al. (2002) study 1b no signal consistent anxious attachment warm food effect vess (2012) study 1 d = .60 ± .55 d = .03 ± .27 lebel & campbell (2013) study 1 no signal inconsistent money priming effect caruso et al. (2013) study 2 d = .43 ± .30 d = -.09 ± .39 schuler & wänke (2016) study 2 no signal inconsistent (less precise) no signal detected in original study generalized earning prospect predicts romantic interest effect eastwick & finkel (2008) study 1 r = .14 ± .16 r = .03 ± .11 selterman, chagnon et al. (2015, rpp #29) no signal consistent hebb repetition effect revisited couture, lafond, & tremblay (2008) study 1 r = .35 ± .38 r = .27 ± .24 roebke & penna (2015, rpp #76) signal consistent implicit attitude generalization occurs immediately effect ranganath & nosek (2008) study 1 r = .00 ± .08 r = .11 ± .04 cohn (2015, rpp #45) signal inconsistent, larger lebel, vanpaemel, cheung, & campbell 8 references armor, d. a., massey, c., & sackett, a. m. (2008). prescribed optimism: is it right to be wrong about the future? psychological science, 19, 329-331. doi:10.1111/j.1467-9280.2008.02089.x bargh, j. a., chen, m., & burrows, l. (1996). automaticity of social behavior: direct effects of trait construct and stereotype activation on action. journal of personality and social psychology, 71(2), 230-244. doi:10.1037/00223514.71.2.230 chartier, c. r., & perna, o. (2015). replication of “self-handicapping, excuse making, and counterfactual thinking: consequences for self-esteem and future motivation.” by sm mccrea (2008, journal of personality and social psychology). retrieved from https://osf.io/ytxgr/ (reproducibility project: psychology study #31) cohn, m. a. (2015). replication of “implicit attitude generalization occurs immediately; explicit attitude generalization takes time” (ranganath & nosek, 2008). retrieved from: https://osf.io/9xt25/ (reproducibility project: psychology study #45) caruso, e. m., vohs, k. d., baxter, b., & waytz, a. (2013). mere exposure to money increases endorsement of free-market systems and social inequality. journal of experimental psychology: general, 142, 301-306. doi:10.1037/a0029288 couture, m., lafond, d., & tremblay, s. (2008). learning correct responses and errors in the hebb repetition effect: two faces of the same coin. journal of experimental psychology: learning, memory, and cognition, 34, 524-532. doi:10.1037/0278-7393.34.3.524 earp, b. d., everett, j. a. c., madva, e. n., & hamlin, j. k. (2014). out, damned spot: can the "macbeth effect" be replicated? basic and applied social psychology, 36, 91-98. doi:10.1080/01973533.2013.856792 eastwick, p. w., & finkel, e. j. (2008). sex differences in mate preferences revisited: do people know what they initially desire in a romantic partner? journal of personality and social psychology, 94, 245-264. doi:10.1037/0022-3514.94.2.245 fischer, p., schulz-hardt, s., & frey, d. (2008). selective exposure and information quantity: how different information quantities moderate decision makers' preference for consistent and inconsistent information. journal of personality and social psychology, 94, 231-244. doi:10.1037/0022-3514.94.2.94.2.231 hull, j., slone, l., meteyer, k., & matthews, a. (2002). the nonconsciousness of selfconsciousness. journal of personality and social psychology, 83, 406-424. doi:10.1037//00223514.83.2.406 kidwell, m., lazarevic, l., baranski, e., hardwicke, t., piechowski, s., falkenberg, l., . . . nosek, b. (2016). badges to acknowledge open practices: a simple, low-cost, effective method for increasing transparency. plos biology, 14, e1002456. doi:10.1371/journal.pbio.1002456 lebel, e. p., & campbell, l. (2013). heightened sensitivity to temperature cues in individuals with high anxious attachment: real or elusive phenomenon? psychological science, 24, 21282130. doi:10.1177/0956797613486983 lebel, e., berger, d., campbell, l., & loving, t. (2017). falsifiability is not optional. journal of personality and social psychology, 113, 696-696. doi:10.1037/pspi0000117 lebel, e. p., mccarthy, r., earp, b., elson, m. & vanpaemel, w. (2018). a unified framework to quantify the credibility of scientific findings. advances in methods and practices in psychological science, 1(3), 389-402. mathur & vanderweele (2018, may 7). preprint: "new statistical metrics for multisite replication projects". https://doi.org/10.31219/osf.io/w89s5 mccrea, s. m. (2008). self-handicapping, excuse making, and counterfactual thinking: consequences for self-esteem and future motivation. journal of personality and social psychology, 95, 274-292. http://dx.doi.org/10.1037/0022-3514.95.2.274 meehl, p. e. (1990). why summaries of research on psychological theories are often uninterpretable. psychological reports, 66, 195244. doi:10.2466/pro.66.1.195-244 pashler, h., & harris, c. r. (2012). is the replicability crisis overblown? three arguments examined. perspectives on psychological science, 7, 531– 536. doi:10.1177/1745691612463401 ranganath, k. a., & nosek, b. a. (2008). implicit attitude generalization occurs immediately; explicit attitude generalization takes time. psychological science, 19, 249-254. doi:10.1111/j.1467-9280.2008.02076.x a brief guide to evaluate replications ratliff, k. a. (2015). replication of fischer, schulzhardt, and frey (2008). retrieved from: https://osf.io/5afur/ (reproducibility project: psychology study #26) roebke, m., & penna, n. d. (2015). replication of “learning correct responses and errors in the hebb repetition effect: two faces of the same coin” by m couture, d lafond, s tremblay (2008, journal of experimental psychology: learning, memory, and cognition). retrieved from: https://osf.io/qm5n6/ (reproducibility project: psychology study #76) rosenthal, r. (1991). applied social research methods: meta-analytic procedures for social research. thousand oaks, ca: sage. doi: 10.4135/9781412984997 schuler, j., & wänke, m. (2016). a fresh look on money priming: feeling privileged or not makes a difference. social psychological and personality science, 7, 366-373. doi:10.1177/1948550616628608 schulz, k. f., altman, d. g., moher, d., consort group, & for the consort group. (2010). consort 2010 statement: updated guidelines for reporting parallel group randomised trials. bmj, 340, 698-702. doi:10.1136/bmj.c332 selterman, d. f., chagnon, e., & mackinnon, s. (2015). replication of: sex differences in mate preferences revisited: do people know what they initially desire in a romantic partner? by paul eastwick & eli finkel (2008, journal of personality and social psychology). retrieved from: https://osf.io/5pjsn/ (reproducibility project: psychology study #29) simonsohn, u. (2015). small telescopes: detectability and the evaluation of replication results. psychological science, 26, 559–569. http://dx.doi .org/10.1177/0956797614567341 steegen, s., tuerlinckx, f., gelman, a., & vanpaemel, w. (2016). increasing transparency through a multiverse analysis. perspectives on psychological science, 11, 702-712. doi:10.1177/1745691616658637 veer, a. vt., lassetter, b., brandt, m. j., & mehta, p. h. (2015). the reproducibility of psychological science the open science collaboration replication of prescribed optimism: is it right to be wrong about the future? by david a. armor, cade massey & aaron m. sackett (2008, psychological science). retrieved from: https://osf.io/8u5v2/ (reproducibility project: psychology study #36) vess, m. (2012). warm thoughts: attachment anxiety and sensitivity to temperature cues. psychological science, 23, 472-474. doi:10.1177/0956797611435919 zhong, c., & liljenquist, k. (2006). washing away your sins: threatened morality and physical cleansing. science, 313, 1451-1452. doi:10.1126/science.1130726 zwaan, r., etz, a., lucas, r., & donnellan, m. (2017). making replication mainstream. behavioral and brain sciences, 1-50. doi:10.1017/s0140525x17001972 replication-specific study characteristics 1. methodological similarity. 2. replication differences. 3. investigator independence. general study characteristics 1. study transparency. 2. analytic result reproducibility. 3. auxiliary hypotheses. nuanced statistical interpretation and language statistical interpretation: individual-study level. statistical interpretation: meta-analytic level. conclusion appendix references mp.2018.870 meta-psychology, 2019, vol 3, mp.2018.870, https://doi.org/10.15626/mp.2018.870 article type: commentary published under the cc-by4.0 license preprint: https://dx.doi.org/10.17605/osf.io/tpv4q data and materials: not relevant edited by: rickard carlsson reviewed by: malte friese et al., julia rohrer, gary burns peer review report: https://doi.org/10.17605/osf.io/u4prw editorial history: https://doi.org/10.17605/osf.io/m8p4s contemporary understanding of mediation testing adrian meule university of salzburg this is a commentary about contemporary understanding of mediation testing. specifically, this commentary highlights that outdated concepts of mediation testing are still highly prevalent in the mindsets of researchers and that many researchers use software based on contemporary mediation testing wrongly, misinterpret results or describe mediation in terms of outdated concepts while inappropriately referring to literature about contemporary concepts. keywords: mediation; indirect effect; causal steps approach; bootstrap sampling; process a common question in psychological research is how or through which mechanism(s) an effect occurs. the most basic form of this question can be represented in a simple mediation model, which consists of an antecedent (or independent) variable (x) that is linked to a consequent (or dependent) variable (y) through an intermediary (or mediator) variable (m; figure 1). in 1986, baron and kenny published a seminal article, in which they present a mediation analysis method, which is often referred to as the causal steps approach.1 according to this approach, “a variable functions as a mediator when it meets the following conditions: (a) variations in levels of the independent variable significantly account for variations in the presumed mediator (i.e., path a), (b) variations in the mediator significantly account for variations in the dependent variable (i.e., path b), and (c) when paths a and b are controlled, a previously significant relation between the independent and dependent variables is no longer significant, with the strongest demonstration of mediation occurring when path c is zero” (baron & kenny, 1986, p. 1176). thus, a mediation effect can only occur when there is a significant correlation between the independent and the dependent variable and when this relationship is no longer significant when controlling for the mediator. if the mediator does not entirely account for the association between the independent and the dependent variable, it partially mediates that effect. finally, baron and kenny (1986) recommend using the sobel test (sobel, 1982) as a formal significance test for the existence of a mediation effect. the popularity of the baron and kenny approach can be seen in that—as of this writing—their article has been cited more than 36,000 times according to web of science and more than 80,000 times according to google scholar. in 2018 alone, the article has been cited more than 2,000 times according to web of science and more than 3,000 times according to google scholar. although the causal steps approach has intuitive appeal, it is now widely accepted that the approach of mediation testing by baron and kenny (1986) is obsolete (e.g., hayes, 2009, 2013, 2017; hayes & rockwood, 2017; mackinnon & fairchild, 2009; zhao, lynch, & chen, 2010). specifically, neither are significant relationships between the independent, mediator, and dependent variable a prerequisite for a mediation effect to be possible nor is a significant reduction of the relationship between the independent and the dependent variable by the mediator necessary for inferring mediation. this adrian meule, phd. department of psychology, university of salzburg, hellbrunner straße 34, 5020 salzburg, austria. e-mail: adrian.meule@sbg.ac.at. 1 it is worth pointing out here that this commentary is about statistical mediation testing and not about different study designs that allow for inferring causality. it has been argued that statistical mediation testing has little value when applied to, for example, cross-sectional data. instead, using cross-lagged repeated measures data or experimental manipulations provides a much stronger argument for the causal direction of effects (bullock, green, & ha, 2010; spencer, zanna, & fong, 2005; stone-romero & rosopa, 2008, 2011). while these issues are important to consider when testing for mediation statistically, they are beyond the scope of this commentary. meule 2 understanding also renders the concept of “partial mediation” moot (e.g., hayes, 2013; hayes & rockwood, 2017; zhao, lynch, & chen, 2010). finally, inferential tests about mediation based on monte carlo or bootstrap confidence intervals rather than the sobel test are now recommended (hayes & scharkow, 2013). figure 1. conceptual diagram of a simple mediation model. x denotes the independent or antecedent variable. m denotes the mediator or intermediary variable. y denotes the dependent or consequent variable. path a represents the relationship between x and m. path b represents the relationship between m and y when controlling for x. path c represents the relationship between x and y (total effect). path c’ represents the relationship between x and y when controlling for m (direct effect). the indirect effect is the product of a × b. the total effect is the sum of the direct and the indirect effect (c = c’ + a × b). in current understanding of mediation testing, the relationship between the independent variable and the dependent variable is called the total effect (i.e., the c path). the relationship between the independent variable and the dependent variable while controlling for the mediator is called the direct effect (usually denoted as c’). the indirect (i.e., mediation) effect is the product of the a and b path (i.e., the relationship [a] between the independent variable and the mediator and [b] between the mediator and the dependent variable when controlling for the independent variable). the total effect is the sum of the direct and indirect effect (total effect = direct effect + indirect effect or c = c’ + a × b). thus, the statistical significance of both the total and the direct effect is irrelevant for the existence of an indirect effect. the indirect effect can be significant (1) in the absence of a direct effect (indirect-only mediation), (2) in addition to a significant direct effect of the same sign (complementary mediation), and even (3) in addition to a significant direct effect of opposite sign (competitive mediation; zhao et al., 2010). in each case, the significance of the total effect is irrelevant for the presence of a direct or indirect effect. note the crucial differences between the causal steps and the contemporary approach. researchers who adhere to the baron and kenny thinking stop when at least one of the three paths (a, b, and c) is not significant. that is, they do not test for mediation because the alleged assumptions to do that are not met (also note that many researchers falsely assume that the b path is the correlation between m and y although it is the relation between m and y when controlling for x). in addition, situations like a complementary and competitive mediation (i.e., a significant indirect effect in addition to a significant direct effect) have no place in the baron and kenny framework because a significant relationship between x and y when controlling for m (i.e., the direct effect) would indicate that there is no full mediation effect. finally, in contemporary thinking about mediation analysis, the indirect effect is either significant or not significant, regardless of the significance of the total effect. as there is, therefore, no need for an “effect to be mediated”, the concept of “partial mediation” is incompatible with the contemporary approach. most researchers find it difficult to imagine how an indirect effect in the absence of a total effect can be possible. as an example, let us consider the relationship between an impulsive personality and body weight. several mediation studies showed that higher impulsivity is associated with a higher body weight through eating behavior-related variables (meule, 2017). in other words, people that are more impulsive tend to overeat, which in turn relates to a higher body weight. however, higher impulsivity also relates to higher substance use (e.g., stanford, et al., 2009). in turn, active substance users usually have a lower body weight than non-substance users (e.g., crossin, lawrence, andrews, & duncan, in press). thus, there may be several indirect effects of impulsivity on body weight with opposite signs (e.g., a positive indirect effect through eating behavior and a negative indirect effect through substance use), which cancel each other out and, thus, there is no significant total effect of impulsivity on body weight (meule, 2017). among other tools, mediation analysis can be easily conducted using a macro called process (www.processmacro.org; hayes, 2013, 2017). in process, testcontemporary understanding of mediation testing 3 ing a simple mediation model is based on two linear regression analyses. first, the mediating variable is predicted by the independent variable (a). second, the dependent variable is predicted by both the mediating variable (b) and the independent variable (c’). the test for inferring whether there is a significant indirect effect (ab) is based on bootstrap confidence intervals. if the confidence interval does not include zero, then it is inferred that the indirect effect is significant, that is, that there is a mediation effect. demonstrating the popularity of process, a white paper about it (hayes, 2012) has been cited more than 3,000 times and hayes’ book (hayes, 2013, 2017)— which is the official reference for process—has been cited more than 18,000 times according to google scholar. unfortunately, while many researchers use process, it seems like many of them are reluctant to abandon the causal steps logic. table 1 lists 20 of the more than 5,000 articles that cited hayes' book in 2018 (according to google scholar). as can be seen from this compilation, researchers seem be stuck in the baron and kenny thinking although they are using process or refer to its manual. this is actually mentioned by hayes himself on his faq page: good academic practice is to cite something only if you have actually read it and are familiar with its content. i don't recommend using process without familiarity with what it does, as described in the book. it may not be doing what you think it is doing. i have seen many instances of researchers reporting results from the output of process that are inconsistent with what process actually is doing. these mistakes are easily avoided by reading the documentation. (http://www.processmacro.org/faq) common mistakes include the following: (1) highlighting the fact that the relationship between x and y was significant, but was no longer significant when controlling for m, although both of these statements are irrelevant for the existence of an indirect effect when using process. (2) stating that a variable partially mediated an effect when the direct effect was significant, although the concept of partial mediation is incompatible to what process does and what hayes writes in its documentation. (3) reporting results of a sobel test as an inferential test for mediation, although this is either redundant (because it will suggest the same conclusion as the bootstrap confidence intervals) or the sobel test is not significant because it has lower power than bootstrap sampling, which is why it is explicitly recommended by hayes (2013, 2017) and others (zhao et al., 2010) to prefer the latter. in conclusion, researchers still follow the causal steps logic, even when they are using programs that are not based on this logic or when they are referring to literature about contemporary mediation testing. the consequences of these fallacies include minor subtleties (e.g., writing about a direct effect when researchers actually want to refer to the total effect or stating that “an effect was mediated by” a variable) but also redundancy (e.g., reporting results from a sobel test in addition to bootstrap intervals), possibly unstable results (e.g., when using 1,000 bootstrap samples instead of 5,000 as recommended by hayes, 2013), prematurely concluding that there was no mediation effect (e.g., when not testing for mediation because one of the paths was not significant), and misinterpreting results (e.g., describing results as full vs. partial mediation). zhao and colleagues (2010) and hayes & rockwood (2017) provide brief and easy-to-read articles in which they outline the drawbacks of the causal steps approach and guide through the rationale and practical implementation of contemporary mediation testing. with such guidance, researchers will hopefully not only use statistical software based on state-of-the-art thinking about mediation, but also adjust their own mindsets accordingly when writing about mediation results. open science practices the nature of this commentary meant that there are no data or research materials to be shared. references alt, n. p., chaney, k. e., & shih, m. j. (in press). “but that was meant to be a compliment!”: evaluative costs of confronting positive racial stereotypes. group processes & intergroup relations. doi: 10.1177/1368430218756493 baron, r. m., & kenny, d. a. (1986). the moderatormediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. journal of personality and social meule 4 psychology, 51, 1173–1182. doi: 10.1037//00223514.51.6.1173 bender, a., & ingram, r. (2018). connecting attachment style to resilience: contributions of self-care and self-efficacy. personality and individual differences, 130, 18–20. doi: 10.1016/j.paid.2018.03.038 bhalla, a., allen, e., renshaw, k., kenny, j., & litz, b. (2018). emotional numbing symptoms partially mediate the association between exposure to potentially morally injurious experiences and sexual anxiety for male service members. journal of trauma & dissociation, 19, 417–430. doi: 10.1080/15299732.2018.1451976 buckner, j. d., lewis, e. m., shah, s. m., & walukevich, k. a. (2018). risky sexual behavior among cannabis users: the role of protective behavioral strategies. addictive behaviors, 81, 50–54. doi: 10.1016/j.addbeh.2018.01.039 bullock, j. g., green, d. p., & ha, s. e. (2010). yes, but what's the mechanism? (don't expect an easy answer). journal of personality and social psychology, 98, 550–558. doi: 10.1037/a0018933 crossin, r., lawrence, a. j., andrews, z. b., & duncan, j. r. (in press). altered body weight associated with substance abuse: a look beyond food intake. addiction research & theory. doi: 10.1080/16066359.2018.1453064 feinberg, l., kerns, c., pincus, d. b., & comer, j. s. (2018). a preliminary examination of the link between maternal experiential avoidance and parental accommodation in anxious and non-anxious children. child psychiatry & human development, 49, 652–658. doi: 10.1007/s10578-018-0781-0 friese, m., loschelder, d. d., gieseler, k., frankenbach, j., & inzlicht, m. (in press). is ego depletion real? an analysis of arguments. personality and social psychology review. doi: 10.1177/1088868318762 hayes, a. f. (2009). beyond baron and kenny: statistical mediation analysis in the new millennium. communication monographs, 76, 408–420. doi: 10.1080/03637750903310360 hayes, a. f. (2013). introduction to mediation, moderation, and conditional process analysis [1st ed.]. new york: the guilford press. hayes, a. f. (2017). introduction to mediation, moderation, and conditional process analysis [2nd ed.]. new york: the guilford press. hayes, a. f. (2012). process: a versatile computational tool for observed variable mediation, moderation, and conditional process modeling [white paper]. retrieved from http://www.afhayes.com/public/process2012.pdf hayes, a. f., & rockwood, n. j. (2017). regressionbased statistical mediation and moderation analysis in clinical research: observations, recommendations, and implementation. behaviour research and therapy, 98, 39–57. doi: 10.1016/j.brat.2016.11.001 hayes, a. f., & scharkow, m. (2013). the relative trustworthiness of inferential tests of the indirect effect in statistical mediation analysis: does method really matter?. psychological science, 24, 1918–1927. doi: 10.1177/0956797613480187 karriker-jaffe, k. j., klinger, j. l., witbrodt, j., & kaskutas, l. a. (2018). effects of treatment type on alcohol consumption partially mediated by alcoholics anonymous attendance. substance use & misuse, 53, 596–605. doi: 10.1080/10826084.2017.1349800 li, w., gao, l., chen, h., cao, n., & sun, b. (2018). prediction of injunctive and descriptive norms for willingness to quit smoking: the mediating role of smoking risk perception. journal of substance use, 23, 274–279. doi: 10.1080/14659891.2017.1394378 lum, z. k., tsou, k. y. k., & lee, j. c. (2018). mediators of medication adherence and glycaemic control and their implications for direct outpatient medical costs: a cross-sectional study. diabetic medicine, 35, 807–815. doi: 10.1111/dme.13619 mackenbach, j. d., charreire, h., glonti, k., bárdos, h., rutter, h., compernolle, s., ... & lakerveld, j. (in press). exploring the relation of spatial access to fast food outlets with body weight: a mediation analysis. environment and behavior. doi: 10.1177/0013916517749876 mackinnon, d. p., & fairchild, a. j. (2009). current directions in mediation analysis. current directions in psychological science, 18, 16–20. doi: 10.1111/j.1467-8721.2009.01598.x maroney, n., williams, b. j., thomas, a., skues, j., & moulding, r. (in press). a stress-coping model of problem online video game use. international journal of mental health and addiction. doi: 10.1007/s11469-018-9887-7 meule, a. (2017). commentary: questionnaire and behavioral task measures of impulsivity are differentially associated with body mass index: a comprehensive meta-analysis. frontiers in psychology, 8(1222), 1–4. doi: 10.3389/fpsyg.2017.01222 peña, j., ibarretxe-bilbao, n., sánchez, p., uriarte, j. j., elizagarate, e., gutierrez, m., & ojeda, n. (2018). contemporary understanding of mediation testing 5 mechanisms of functional improvement through cognitive rehabilitation in schizophrenia. journal of psychiatric research, 101, 21–27. doi: 10.1016/j.jpsychires.2018.03.002 polanco-roman, l., moore, a., tsypes, a., jacobson, c., & miranda, r. (2018). emotion reactivity, comfort expressing emotions, and future suicidal ideation in emerging adults. journal of clinical psychology, 74, 123–135. doi: 10.1002/jclp.22486 poless, p. g., torstveit, l., lugo, r. g., andreassen, m., & sütterlin, s. (2018). guilt and proneness to shame: unethical behaviour in vulnerable and grandiose narcissism. europe’s journal of psychology, 14, 28–43. doi: 10.5964/ejop.v14i1.1355 quinlan, e., deane, f. p., crowe, t., & caputi, p. (2018). do attachment anxiety and hostility mediate the relationship between experiential avoidance and interpersonal problems in mental health carers?. journal of contextual behavioral science, 7, 63–71. doi: 10.1016/j.jcbs.2018.01.003 reilly, e. e., gordis, e. b., boswell, j. f., donahue, j. m., emhoff, s., & anderson, d. a. (2018). evaluating the role of repetitive negative thinking in the maintenance of social appearance anxiety: an experimental manipulation. behaviour research and therapy, 102, 36–41. doi: 10.1016/j.brat.2018.01.001 roussotte, f. f., siddarth, p., merrill, d. a., narr, k. l., ercoli, l. m., martinez, j., ... & small, g. w. (2018). in vivo brain plaque and tangle burden mediates the association between diastolic blood pressure and cognitive functioning in nondemented adults. american journal of geriatric psychiatry, 26, 13–22. doi: 10.1016/j.jagp.2017.09.001 sobel, m. e. (1982). asymptotic confidence intervals for indirect effects in structural equation models. in leinhart, s. (ed.), sociological methodology (pp. 290–312). san francisco, ca: jossey-bass. spencer, s. j., zanna, m. p., & fong, g. t. (2005). establishing a causal chain: why experiments are often more effective than mediational analyses in examining psychological processes. journal of personality and social psychology, 89, 845–851. doi: 10.1037/0022-3514.89.6.845 stanford, m. s., mathias, c. w., dougherty, d. m., lake, s. l., anderson, n. e., & patton, j. h. (2009). fifty years of the barratt impulsiveness scale: an update and review. personality and individual differences, 47, 385–395. doi: 10.1016/j.paid.2009.04.008 stone-romero, e. f., & rosopa, p. j. (2008). the relative validity of inferences about mediation as a function of research design characteristics. organizational research methods, 11, 326–352. doi: 10.1177/1094428107300342 stone-romero, e. f., & rosopa, p. j. (2011). experimental tests of mediation models: prospects, problems, and some solutions. organizational research methods, 14, 631–646. doi: 10.1177/1094428110372673 varni, j. w., shulman, r. j., self, m. m., saeed, s. a., zacur, g. m., patel, a. s., ... & denham, j. m. (2018). perceived medication adherence barriers mediating effects between gastrointestinal symptoms and health-related quality of life in pediatric inflammatory bowel disease. quality of life research, 27, 195–204. doi: 10.1007/s11136-0171702-6 widmer, e. d., girardin, m., & ludwig, c. (2018). conflict structures in family networks of older adults and their relationship with health-related quality of life. journal of family issues, 39, 1573–1597. doi: 10.1177/0192513x17714507 yildirim, c., & dark, v. j. (2018). the mediating role of mindfulness in the relationship between media multitasking and mind wandering. in proceedings of the techmindsociety ’18. acm, new york, ny, usa. doi: 10.1145/3183654.3183711 zhao, x., lynch, j. g., & chen, q. (2010). reconsidering baron and kenny: myths and truths about mediation analysis. journal of consumer research, 37, 197–206. doi: 10.1086/651257 meule 6 table 1 some recent examples of articles referring to mediation analyses with process (hayes, 2013) reference indirect effect(s) of… issues alt et al. (in press) …stereotype valence on favorable evaluations through perceived offensiveness followed the causal steps logic by highlighting that “the direct effect […] was no longer significant” (pp. 6, 7, 10) bender et al. (2018) …closeness, dependability, and anxiety on resilience through self-care and selfefficacy followed the causal steps logic by stating that significant direct effects suggest “partial mediation” (p. 19) bhalla et al. (2018) …exposure to potentially morally injurious events on sexual anxiety through emotional numbing followed the causal steps logic by stating that “this was a partial mediation, as the direct effect […] remained significant” (p. 424) buckner et al. (2018) …cannabis use on condom use through protective behavior strategies followed the causal steps logic by highlighting that “there was no longer a significant direct effect” (p. 52) feinberg et al. (2018) …experiential avoidance on accommodation through beliefs about anxiety followed the causal steps logic by highlighting that “when negative beliefs about child anxiety were incorporated into the model this direct effect was no longer significant” (p. 652) friese et al. (in press) …ego depletion on performance through monitoring processes and selfefficacy followed the causal steps logic by describing mediation in terms of the baron and kenny approach (p. 9) confused total effect with direct effect karrikerjaffe et al. (2018) …treatment type on alcohol consumption through treatment duration and alcoholics anonymous attendance followed the causal steps logic by stating that “effects of treatment type on alcohol consumption [are] partially mediated by alcoholics anonymous attendance” (p. 596) li et al. (2018) …injunctive and descriptive norms on willingness to quit smoking through smoking risk perception used 1,000 bootstrap samples although hayes (2013) recommends using a minimum of 5,000 bootstrap samples followed the causal steps logic by stating that “smoking risk perception partially mediated the association between descriptive norms and willingness to quit smoking” (p. 277) lum et al. (2018) …medication adherence on glycated hemoglobin concentration through diabetes-related distress and perception of hyperglycaemia tested simple mediation models with the causal steps approach and the sobel test, but a subsequent multiple mediation model with process/bootstrapping (p. 810) followed the causal steps logic by referring to full/complete vs. partial mediation mackenbach et al. (in press) …access to fast food outlets on overweight through perceived availability and usage of fast food outlets tested mediation effects with the sobel test followed the causal steps logic by not testing for mediation for certain outcomes because “none of the aor b-paths were statistically significant” (p. 17) and highlighting how much of the total effect can be explained by the indirect effect (“proportion mediated”, p. 9) maroney et al. (in press) …depression, loneliness, and social anxiety on problem video game use through escapism and social interaction motives for gaming followed the causal steps logic by stating that “these effects being partially mediated by escapism and social interaction motives for gaming” (p. 1) peña et al. (2018) …cognitive rehabilitation on functional improvement through changes in processing speed and verbal memory followed the causal steps logic by stating that the mediating variable “partially mediated” the relationship between the independent and dependent variable (p. 23) polancoroman et al. (2018) …emotion reactivity on suicidal ideation through depressive symptoms followed the causal steps logic by highlighting that “the direct effect of emotion reactivity on suicidal ideation, though reduced, remained statistically significant after comfort expressing love, happiness, anger, and sadness were entered into the model […], but was no longer significant after depressive symptoms were entered in the model” (p. 128) poless et al. (2018) …narcissism on ethical behavior by guilt repair through guilt tested mediation effects with the sobel test followed the causal steps logic by highlighting the significant total and non-significant direct effects (figures 1-4 on pp. 34-35) quinlan et al. (2018) …experiential avoidance on interpersonal problems through attachment anxiety and hostility followed the causal steps logic by describing results as full vs. partial mediation (pp. 66-68) reilly et al. (2018) …pretest social appearance anxiety on posttest social appearance anxiety through repetitive negative thinking followed the causal steps logic by describing results as partial mediation (p. 39) roussotte et al. (2018) …diastolic blood pressure on cognitive functioning through brain plague and tangle burden used 1,000 bootstrap samples although hayes (2013) recommends using a minimum of 5,000 bootstrap samples contemporary understanding of mediation testing 7 followed the causal steps logic (“if the 95% confidence interval for a ��b does not include 0 and the association between the predictor and outcome variables including the mediator variable (i.e., the c’-path) is no longer significant, then a significant (p < 0.05) mediation has occurred.”, p. 16) varni et al. (2018) …gastrointestinal symptoms on healthrelated quality of life through perceived medication adherence barriers tested mediation effects with the sobel test in addition to bootstrap sampling confused total effect with direct effect and followed the causal steps logic (“mediators are intervening variables that are hypothesized to account partially or in full for the relationship between a predictor variable and an outcome variable […] the predictor variable is hypothesized to have a direct effect on the outcome variable”, p. 198) widmer et al. (2018) …family conflict on health through individual stress used 1,000 bootstrap samples although hayes (2013) recommends using a minimum of 5,000 bootstrap samples additionally tested mediation effects with the sobel test confused total effect with direct effect and followed the causal steps logic (“the inclusion of stress reduces the direct association between conflict structures and the psychological dimensions of health”, p. 1590) yildirim & dark (2018) …media multitasking on mind wandering through trait mindfulness followed the causal steps logic (“given that all paths were statistically significant, that the inclusion of trait mindfulness as a mediator led to reductions in the magnitude of the effect of media multitasking frequency on mind wandering tendency (path c’ < path c), and that the indirect effect of media multitasking frequency on mind wandering tendency through trait mindfulness was significant, it can be concluded that trait mindfulness partially mediated the relationship between media multitasking frequency and mind wandering tendency.”, p. 3) meta-psychology, 2023, vol 7, mp.2021.2938 https://doi.org/10.15626/mp.2021.2938 article type: commentary published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: nick brown reviewed by: borgert, n., lawrence, j., anaya, j. analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/bjgae a critical re-analysis of six implicit learning papers brad mckay mcmaster university michael j. carter mcmaster university abstract we present a critical re-analysis of six implicit learning papers published by the same authors between 2010 and 2021. we calculated effect sizes for each pairwise comparison reported in the papers using the data published in each article. we further identified mathematically impossible data reported in multiple papers, either with deductive logic or by conducting a grimmer analysis of reported means and standard deviations. we found the pairwise effect sizes were implausible in all six articles in question, with cohen’s d values often exceeding 100 and sometimes exceeding 1000. in contrast, the largest effect size observed in a million simulated experiments with a true effect of d = 3 was d = 6.6. impossible statistics were reported in four out of the six articles. reported test statistics and η2 values were also implausible, with several η2 = .99 and even η2 = 1.0 for between-subjects main effects. the results reported in the six articles in question are unreliable. many of the problems we identified could be spotted without further analysis. keywords: metascience, grimmer, effect sizes, perceptual motor learning introduction statistical reporting errors may commonly occur in psychology articles (brown & heathers, 2017; nuijten et al., 2016) and such errors are often consistent with hypothesized results (bakker & wicherts, 2011). when the primary conclusions in research articles depend on reporting errors, replicability is unlikely and future research may be wasted if researchers attempt to build on the erroneously reported results (munafò et al., 2017). in this paper, we scrutinize six papers published by the same two authors,1 where the authors report a high number of erroneous or implausible data on which their primary conclusions depend. we first became aware of the lola and tzetzis (2021) paper when the paper was highlighted in a social media post (gray, 2021). during an initial read through by one of us (bm), a number of reporting and statistical issues were noticed. the paper also referenced past research published by these authors. given our concerns over the issues found in the lola and tzetzis (2021) paper, we deemed it necessary to examine these other papers. the data irregularities we found are similar across the target articles and at times even include repeated values (e.g., f-statistics) across multiple papers. regardless of the conclusion one reaches with respect to the mechanism behind these errors, it is our contention that the results reported in these papers are unreliable and that the respective journals in which the papers are published should take cor1one of the six papers had a third author and one had a third and fourth author. https://doi.org/10.15626/mp.2021.2938 https://doi.org/10.17605/osf.io/bjgae 2 rective actions.2 below, we outline our causes for concern and the overarching issues we found across the six articles in question. the articles in question we reanalyzed six articles by afroditi lola, george tzetzis, and their colleagues. in all experiments, the authors investigated the effects of implicit and explicit instructions on perceptual and motor learning. all experiments sampled young females who were enrolled in a volleyball camp (see table 1). our reanalysis of the target articles evaluated the plausibility of the reported means, standard deviations, and test statistics. we will refer to the six articles throughout this paper using the following numbering system based on reverse chronological order: 1. lola, a.c., giatsis, g., pérez-turpin, j.a., & tzetzis, g.c. (2021). the influence of analogies on the development of selective attention in novices in normal or stressful conditions. journal of human sport and exercise. https://doi.org/10.14198/jhse.2023.181.12 2. lola, a.c., & tzetzis, g.c. (2021). the effect of explicit, implicit and analogy instruction on decision making skill for novices, under stress. international journal of sport and exercise psychology, 1-21. https://doi.org/10.1080/1612197x.2021.1877325 3. lola, a.c., & tzetzis, g.c. (2020). analogy versus explicit and implicit learning of a volleyball skill for novices: the effect on motor performance and selfefficacy. journal of physical education and sport, 20(5), 2478-2486. https://doi.org/10.7752/jpes.2020.05339 4. tzetzis, g.c., & lola, a.c. (2015). the effect of analogy, implicit, and explicit learning on anticipation in volleyball serving. international journal of sport psychology, 46(2), 152-166. https://doi.org/10.7352/ijsp. 2015.46.152 5. lola, a.c., tzetzis, g.c., & zetou, h. (2012). the effect of implicit and explicit practice in the development of decision making in volleyball serving. perceptual and motor skills, 114(2), 665-678. https://doi.org/ 10.2466/05.23.25.pms.114.2.665-678 6. tzetzis, g.c., & lola, c.a. (2010). the role of implicit, explicit instruction and their combination in learning anticipation skill, under normal and stress conditions. international journal of sport sciences and physical education, 1, 54-59.3 although there were some differences between the reported experiments in the target articles, there were many methodological commonalities that can be summarized. all six articles involved female children learning a volleyball skill as part of a volleyball camp. in each case, the participants were reported to have minimal experience (i.e., were described as novices) with the task at hand. the purpose of all six experiments was to evaluate perceptual or motor learning differences as a function of the type of instruction received during practice. each experiment included a pre-test, an acquisition (i.e., practice) phase involving 12 sessions spaced over four weeks, and a post-test. a high stress test was also included in articles 1, 2, and 6. in articles 1-4, the groups differed with respect to the type of instruction received: implicit, explicit, or analogy. in articles 5 and 6, a sequential group (see below for description) replaced the analogy group. all six experiments also included a control group that did not practice the task. implicit instruction did not contain any explicit information for how to perform the task and the learners were asked to perform a distracting task like counting backwards while practicing to prevent them from acquiring declarative rules for performance. in contrast, explicit instruction consisted of direct verbal instructions for performing the task. analogy instruction was considered a type of implicit instruction wherein an analogy or metaphor was provided to the learner. for example, "imagine that the opponents’ surface is covered with water. send the ball where there is more water and no opponents at the court." (lola & tzetzis, 2021, p. 9). sequential instruction involved receiving explicit instruction for the first half of training followed by implicit instruction for the second half of training. across experiments, the authors predicted that implicit forms of instruction—implicit, analogy, and sequential—would be more effective than explicit instruction for motor and perceptual learning. this advantage was also predicted to be greater when testing was conducted in a high stress situation. in article 2 for instance, high stress was induced by falsely telling participants that the best performers would be selected for a draft to the national team. further, it was predicted that analogy or sequential instruction would offer improvements relative to implicit instruction. the primary outcome measures used in these experiments were reaction time (articles 1, 2, 4, 5, and 6), 2we contacted the journal editors for articles 2-6 on sept 22 2021 and for article 1 on jan 21 2022. all editors indicated their intention to further investigate these issues and/or take corrective actions. 3this journal has been identified as a potential predatory journal. we were unable to find an online version of this article on the journal’s webpage and interestingly, the earliest issue on the webpage is from 2016. we were only able to find an online version on researchgate (https: //www.researchgate.net/profile/angela-calder/publication/ 234000504_the_scientific_basis_for_recovery_training_ practices_in_sport/links/5428fff80cf26120b7b574ad/ the-scientific-basis-for-recovery-training-practices-in-sport. pdf with the target article beginning on page 57). https://doi.org/10.14198/jhse.2023.181.12 https://doi.org/10.1080/1612197x.2021.1877325 https://doi.org/10.7752/jpes.2020.05339 https://doi.org/10.7352/ijsp.2015.46.152 https://doi.org/10.7352/ijsp.2015.46.152 https://doi.org/10.2466/05.23.25.pms.114.2.665-678 https://doi.org/10.2466/05.23.25.pms.114.2.665-678 https://www.researchgate.net/profile/angela-calder/publication/234000504_the_scientific_basis_for_recovery_training_practices_in_sport/links/5428fff80cf26120b7b574ad/the-scientific-basis-for-recovery-training-practices-in-sport.pdf https://www.researchgate.net/profile/angela-calder/publication/234000504_the_scientific_basis_for_recovery_training_practices_in_sport/links/5428fff80cf26120b7b574ad/the-scientific-basis-for-recovery-training-practices-in-sport.pdf https://www.researchgate.net/profile/angela-calder/publication/234000504_the_scientific_basis_for_recovery_training_practices_in_sport/links/5428fff80cf26120b7b574ad/the-scientific-basis-for-recovery-training-practices-in-sport.pdf https://www.researchgate.net/profile/angela-calder/publication/234000504_the_scientific_basis_for_recovery_training_practices_in_sport/links/5428fff80cf26120b7b574ad/the-scientific-basis-for-recovery-training-practices-in-sport.pdf https://www.researchgate.net/profile/angela-calder/publication/234000504_the_scientific_basis_for_recovery_training_practices_in_sport/links/5428fff80cf26120b7b574ad/the-scientific-basis-for-recovery-training-practices-in-sport.pdf https://www.researchgate.net/profile/angela-calder/publication/234000504_the_scientific_basis_for_recovery_training_practices_in_sport/links/5428fff80cf26120b7b574ad/the-scientific-basis-for-recovery-training-practices-in-sport.pdf 3 table 1 participant demographics in each of the target articles. target article sample size and participant details article 1: lola et al. (2021) 60 females, age range: 11 to 12 years (mage and sd not reported) article 2: lola & tzetzis (2021) 60 females, age range: 10 to 11 years (mage = 10.48, sd = 0.911)a article 3: lola & tzetzis (2020) 80 females, age range: 10 to 11 years (mage = 10.48, sd = 0.911)a article 4: tzetzis & lola (2015) 60 females, age range: 9 to 12 years (mage = 10.48, sd = 0.91)a article 5: lola et al. (2012) 60 females, age range: 10 to 12 years (mage = 11.2, sd = 0.3) article 6: tzetzis & lola (2010) 48 females, age range: 12 to 13 years (mage = 12.38, sd = 0.34) note. aarticles 2-4 report identical means and standard deviations for the age of their participants despite a different sample size in article 3 from articles 2 and 4, and a different age range in article 4 from articles 2 and 3. response accuracy (articles 1, 2, 4, and 5), and motor performance measured on a 4-point scale (article 3). in addition, articles 2 and 6 included a measure of state anxiety, the competitive state anxiety inventory-2 (tsorbatzoudis et al., 1998), and article 3 had a measure of self-efficacy using a likert scale. the number of explicit rules recalled was assessed in articles 2, 4, 5, and 6. methods none of the six articles in question included a link to a public repository where the data could be accessed. we first wrote (email sent february 10, 2021) the corresponding author of article 2 and asked if they would be willing to share the data for this experiment. the authors’ response was that the data could not be shared as they were not finished with their analyses and were in the process of running different tests (a. lola, personal communication, february 12 2021). we followed up this email (sent february 12 2021) by asking whether they would instead be willing to share the data from any of articles 3 to 6 as these were less recent, and presumably all planned analyses had been completed. after a 2 week period with no response, we followed up with a third email (sent february 26 2021) and reiterated our interest in obtaining their data from any of these articles. the authors’ response was that they were unable to share data from any of these articles because in some cases they no longer had the data and in other cases they had plans to conduct further analyses (a. lola, personal communication, march 2 2021). our first two requests did not include any indication about our concerns regarding the data irregularities. subsequently, in a fourth email (sent april 12 2021) we outlined our concerns for each article4 and once again reiterated our request to the authors to share any available data for any of the target articles. these requests were once again refused. the authors did address some specific concerns regarding article 2, but for the most part only provided more general responses to our concerns. the authors admitted that some of the values reported in the other target articles were incorrect, but did not identify which values or articles. despite this, the authors maintained that the data irregularities—identified in our email and described in this paper—do not impact the veracity of their analyses or conclusions (a. lola, personal communication, april 22 2021). we illustrate below that the data and analyses reported in each of the articles reviewed are unreliable. our extracted data and analysis scripts can be accessed using either of the following links: https://osf. io/raz6q/ or https://www.github.com/cartermaclab/ comm_lola-tzetzis-data-irregularities. effect size calculations and simulations means and standard deviations were extracted from each article for all measures and time points that were reported. cohen’s d was calculated for each pairwise comparison using the r package compute.es. consistent with the group sizes reported in article 3, which had the largest groups among the target articles, we simulated data from two groups of n = 20. we ran simulations with true effect sizes of d = .8 and d = 3 one million times each and report the range of effect sizes observed in those simulations. mathematically impossible data and granularity analysis in two of the articles in question, it was clear that some of the reported results were not mathematically possible based on the scale of measurement that was used. when outcomes were single item integers 4excluding article 1 (lola et al., 2021) because we were not aware of it at the time as it had not yet been accepted for publication. https://osf.io/raz6q/ https://osf.io/raz6q/ https://www.github.com/cartermaclab/comm_lola-tzetzis-data-irregularities https://www.github.com/cartermaclab/comm_lola-tzetzis-data-irregularities 4 (a granularity of 1), such as the number of explicit rules recalled, we used a web application (http://www. prepubmed.org/grimmer_sd/) to conduct a granularity analysis (grimmer) of reported means and standard deviations (anaya, 2016). grimmer builds off the original granularity-related inconsistency of means (grim) analysis (brown & heathers, 2017), which leveraged the fact that the means of granular data are also granular. given a data set of size n and granularity g, only means of granularity g/n are possible. thus, all possible means for data of a given g and n can be enumerated, and only means that match these possibilities are considered grim-consistent. the grimmer analysis extends this test by also evaluating whether mean-standard deviation pairs are possible. first, the grim analysis is conducted to determine if the mean is grim-consistent. next, lower and upper bounds of the standard deviation are calculated based on how many decimals d are reported (sd ± [ 0.510d ]) 2. then all possible variances between these bounds are enumerated, converted back to standard deviations, and rounded to the nearest d decimals. the reported standard deviation is checked for a match with any of these values. finally, the mean-variance pair is compared to possible meanvariance pairings (the grimmer test handles sample sizes between 5 and 99). using grimmer, it is possible to determine if specific mean and standard deviation pairs are possible for data of a given sample size. to be conservative, we specified that we did not know whether the standard deviation was calculated for the sample or population, nor whether ambiguous values were rounded up or down. mean and standard deviation pairs that are mathematically possible are considered grimmer consistent, while mean and standard deviation pairs that are not mathematically possible are grimmer inconsistent. eta-squared each of the articles reported only omnibus test statistics and then reported post-hoc analyses with symbols demarcating significant and non-significant differences. in response to our expression of concern, the authors suggested that many of the issues were due to misprints in the articles. specifically, they indicated that the reported means and standard deviations in their tables were incorrect and the root of the errors had to be from them outsourcing the formatting of their tables. the authors then insisted that despite these typographic errors, their discussion of the results and corresponding conclusions were still accurate (a. lola, personal communication, april 22, 2021). however, the test statistics reported for many analyses were implausibly large and the authors often reported η2 values associated with the omnibus test. our examination of the reported η2 values revealed that, as with the reported pairwise comparisons, many were implausibly large. results implausible effect sizes cohen’s d is used to describe the standardized mean difference of an effect and values can range between 0 and infinity in both the negative and positive direction. we calculated absolute values so that all effects were positive. cohen’s ds (cohen, 1988) is the observed difference between group means divided by their pooled standard deviation (see lakens, 2013, for a detailed discussion). conventional benchmarks for small, medium, and large effects are d = .2, .5, and .8, respectively (cohen, 1962); however, this mindless approach to effect size interpretation has been heavily discouraged (correll et al., 2020; field, 2016; lakens, 2013; thompson, 2007). recently, an analysis of 6447 cohen’s d statistics extracted from social psychology meta-analyses observed median and 75th percentile cohen’s d values of .36 and .65, respectively—suggesting the conventional benchmarks may overestimate typical effects (lovakov & agadullina, 2021). in the field of motor learning, recent meta-analyses have found average effect sizes in the published literature of d = .19 (mckay, hussien, et al., 2022), d = .54 (mckay, yantha, et al., 2022), and d = .71 (lohse et al., 2016). to evaluate the maximum plausible cohen’s d statistics one might encounter from experiments similar to those reported in the target articles, we conducted two simulations that each consisted of one million experiments (see figure 1). we set the true effect size at d = .8, the conventional benchmark for a "large" treatment effect, in the first simulation. the largest effect size observed from the one million simulated experiments was d = 2.97. in the second simulation, we set the true effect size at d = 3, an unrealistically large effect size that might rarely be encountered in the psychology and/or motor learning literature. the maximum effect size observed in the one million simulated experiments was d = 6.6. in the context of the maximum values observed in our simulations, all six articles in question reported implausibly large effect sizes. the original table of summary statistics in article 1 indicated that the smallest postintervention difference in reaction times was d = .64. however, all other effects were larger than d = 8.7 and the largest effect was d = 41. the accuracy data also reflected improbably large post-intervention differences, with two-thirds of all comparisons showing effects larger than d = 5 and a largest effect of d = 13.35. http://www.prepubmed.org/grimmer_sd/ http://www.prepubmed.org/grimmer_sd/ 5 cohen's benchmark for a large effect0.0 0.2 0.5 0.8 3.0 10.0 100.0 1000.0 4000.0 pre-test post-test retention stress-test timepoint c o h e n 's d comparison analogy-control analogy-implicit explicit-analogy explicit-control explicit-sequential implicit-control implicit-explicit implicit-sequential sequential-control figure 1. absolute cohen’s d estimates from all articles except article 6 plotted on a logarithmic scale. only data from the original tables in article 1 are included. all pairwise comparisons have been included for all dependent measures in each experiment. the range of observed values from a simulation of 1,000,000 experiments with a true effect of d = .8 is illustrated by shaded green and blue regions of the figure, reaching a maximum value of d = 2.97. the range of observed values from a simulation of 1,000,000 experiments with a true effect of d = 3 is illustrated by the shaded purple and blue regions of the figure, reaching a maximum value of d = 6.6. however, a correction to the tables of summary statistics was published that included substantially smaller standard deviations than the original tables. while the updated data do imply smaller effect sizes, as we discuss below, they appear to be inconsistent with the reported analyses. in article 2, the smallest pre-test difference for reaction time was d = 1.29 and the largest pre-test difference was d = 35.32—although none of the groups were reported as significantly different in the article. the smallest post-intervention effect at any of the three time points was d = 286.42, while the largest effect was d = 3504.86. a similar picture emerges when analyzing the accuracy data. all the pre-test differences were improbably large (all d’s ≥ 2.52) despite being reported as not significantly different in the articles. ten of the pairwise comparisons resulted in d’s ≥ 100 following treatment with the independent variables. the motor component data revealed post-treatment effect sizes ranging from d = 1.16 to d = 13.5. in article 3, post-intervention motor performance effect sizes ranged from d = 3.1 to d = 20.95. similarly, post-intervention self-efficacy effect sizes ranged from d = 1.79 to d = 44.46. likewise, in article 4 post-intervention reaction time effect sizes ranged from d = 2.28 to d = 35.97. continuing this pattern, postintervention response accuracy effect sizes ranged from d = 5.84 to d = 29.7. 6 in article 5, many response accuracy effect sizes were implausibly large beginning at pre-test, wherein effects ranged from d = 2.53 to d = 15.50. nevertheless, all pre-test comparisons were reported as non-significant. following intervention, the effect sizes ranged from d = 23.13 to d = 155.08. relative to other reported effect sizes, those reported for reaction time were not implausibly large at any time point, ranging from d = 0 to d = .86. however, the authors reported an implausibly large effect size, η2 = .94, for the 4 (group) x 3 (time) anova. further, despite only one pairwise comparison being statistically significant, all post-intervention comparisons were reported as being significant in the article. in article 6, the authors did not report means and standard deviations for most of the analyses. however, η2 effect sizes were reported and these ranged from η2 = .52 to η2 = .98. these effect sizes are discussed further below. all the post-intervention effects reviewed above were directionally consistent with the researchers’ expectations. the sometimes implausibly large pre-test effects were not expected, but also were not reported as significant. impossible data and granularity analysis in article 2, the competitive state anxiety inventory2 was used to assess the level of cognitive and somatic stress experienced by participants. responses were measured on a likert scale ranging from 1 to 4 with the data appearing to represent the average response per item. at each of the three low-stress time points, the means reported for all four groups ranged from 1.02 to 1.09. during the high-stress time point, the means ranged from 3.95 to 4.09. the means for two groups were reported as greater than 4, which is not possible given the maximum score on the competitive state anxiety inventory-2 is 4. in article 3, participants were asked to receive a served volleyball and pass it to a target consisting of three concentric circles. motor performance was measured based on where the pass landed, with three points awarded for a pass to the central circle on the target, two points for the middle circle, one point for the outermost circle, and zero points for a pass that missed the target.5 results were presented as average performance per trial and the analogy group was reported to have a mean score of 3.00 at retention (a perfect score) but with a standard deviation of .09. the perfect score was not a rounding error because the same group was reported to have a mean score of 2.99 with a standard deviation of .11 on the post-test. these data are not possible. in articles 2, 5, and 6, the authors reported means and standard deviations for the number of explicit rules recalled by participants following the intervention phase. as a single item analysis of integers these results were suitable for a grimmer analysis. in article 2, the mean and standard deviation pairs were grimmer inconsistent for all four groups (implicit: m = .73, sd = .35; analogy: m = 1.03, sd = .25; explicit: m = 4.8, sd = .78; control: m = .67, sd = .48; n = 15). in article 5, the mean and standard deviation pair was grimmer consistent for the explicit rules group (m = 4.8, sd = 1.78) and the implicit group (m = 2.3, sd = 1.3). the mean and standard deviation pairs for the remaining two groups were grimmer inconsistent (sequential: m = 4.2, sd = 1.07; control: m = 1.8, sd = .3; n = 15). in article 6, the mean and standard deviation pairs were grimmer consistent for three of the four groups if the standard deviations were calculated for the population rather than the sample (sequential: m = 4.2, sd = 1.07; implicit: m = 2.3, sd = 1.3; control: m = 1.4, sd = .9; n = 12). for two of the groups, they were consistent regardless of which method of calculating the standard deviation was used. however, the results for the explicit group were grimmer inconsistent (m = 4.8, sd = 1.78).6 eta-squared eta-squared (η2) is calculated by dividing the sum of squares for the effect by the total sum of squares. it can be interpreted as analogous to r2 as it represents the total variation in the dependent measure that can be explained by a given main effect or interaction in an anova (lakens, 2013). benchmarks have been suggested for small, medium, and large effect sizes as η2 = .01, .06, and .14, respectively (cohen, 1988). importantly, if the main effect of instruction-type results in η2 = .99, as was commonly reported in the target articles, this suggests that 99% of the total variability in the outcome measure can be explained by group assignment alone. such a result is implausible. article 2 did not report η2 values but had the largest pairwise effects and f-statistics of the five articles in question. article 3 reported η2 = .994, η2 = .996, and η2 = .996 for the time, group, and time x group effects on motor performance, respectively. similarly, variance 5independent of the issues we have raised, this approach to measuring motor performance has been shown to be inappropriate and flawed for this type of task (fischman, 2015; hancock et al., 1995; reeve et al., 1994). 6you may have noticed that two of the same mean and standard deviation pairings (m = 4.8, sd = 1.78 and m = 4.2, sd = 1.07) were classified as grimmer inconsistent for one paper and consistent for the other. this is because of sample size differences (n = 15 and n = 12). 7 explained on the self-efficacy measure was η2 = .995, η2 = .994, η2 = .997 for the time, group, and time x group effects, respectively. article 4 also reported η2 = .99 for all three effects on both response time and response accuracy measures. article 5 reported η2 = 1.0 for the main effect of time and the time x group interaction, as well as η2 = .95 for the main effect of group on the response time measure. interestingly, the time x group interaction had the smallest reported significant f-statistic among the five articles in question. with respect to response accuracy, the reported effects were η2 = .98, η2 = .94, η2 = .93 for the time, group, and time x group analyses, respectively. article 6 reported η2 = .66, η2 = .52, η2 = .72 for the time, group, and time x group analyses, respectively. other oddities although the means and standard deviations for the explicit rules analysis were only reported in three of the articles in question, analyses were reported in articles 2, 4, 5, and 6. the reported test statistic in these four articles was f = 52.67, albeit with different degrees of freedom in article 6 that reflected the different sample size in this experiment (48 versus 60 in the others). articles 2-4 were published over a span of 6 years with reported samples sizes of 60 in articles 2 and 4, and 80 in article 3. yet, the authors report identical means and standard deviations for the age of their participants in these three articles (see table 1). we assumed that each article was based on different samples as none of the articles mentioned using any previously published data. article 1 was submitted to the journal of human sport and exercise following our correspondence with the authors and published online (sept 3 2021) 11 days before we posted our preprint (version 1). we were unaware of this article when posting the original preprint and it was not included in that version. however, when we became aware of this paper it was immediately apparent that data reported therein again reflected improbably large effect sizes. further, in comparing the reaction time and accuracy means reported in article 1 to those reported in article 2, it appeared the data shared a remarkably similar pattern. to investigate this similarity further, we conducted a correlation analysis between the two data sets. the reaction time means for each group and time point were highly correlated between the two papers, r = .99, as were the accuracy means, r = .99. new developments following september 28, 2021 update after contacting the editors for articles 2-5 on september 22, 2021 and updating our preprint on september 28th, at least two important developments transpired. first, there was a response from all four editors indicating an intention to investigate the issues we raised. the international journal of sport and exercise psychology is the only journal that has taken observable action to date, issuing an expression of concern regarding article 2 on october 11th, 2021 (see https://doi. org/10.1080/1612197x.2021.1991102). the current journal editor where article 4 was published included us in an email to the authors. we were also included in the authors’ reply, wherein they again insisted that their analyses and conclusions remained valid. they offered an updated manuscript to the journal, but we are unaware of any decisions or further actions. the second important development was that the tables in article 1 were updated on october 8th, 2021. the new tables made changes to the standard deviation values for both the reaction time and accuracy measures. the standard deviations of the reaction time data were adjusted such that the decimal point was shifted one place to the right compared to the original version. for example, an original standard deviation of 10.25 is now 102.51. the adjustments to the accuracy data now show the original standard deviation values as standard errors and new standard deviations are reported. although the adjustments to the tables reflect corrections to plausible mishaps in the publication process and correcting such errors should be applauded, the new data themselves are problematic when compared to the reported analyses. to illustrate the disconnect between the new values and the test statistics in article 1, we used the r package faux to simulate data from multivariate normal distributions with the same mean and standard deviation parameters as the original and updated tables. we then analyzed the data using the same 4 x 4 mixed anova model reported by the authors and compared the test statistics we observed to those reported in article 1. we tried various correlations between time points and chose the value that produced the closest agreement between our analyses and theirs (r = .8). our analysis of simulated reaction time data using the originally reported standard deviations produced fstatistics that were more similar to the values reported in article 1 than our analysis based on the updated numbers. using the originally reported figures, we observed an f = 18193 for the main effect of time. the authors reported f = 16055 for this analysis. our analysis using the updated statistics resulted in f = 186. similarly, https://doi.org/10.1080/1612197x.2021.1991102 https://doi.org/10.1080/1612197x.2021.1991102 8 the time x group interaction was f = 4242 in article 1, f = 6208 in our simulation of the original parameters, and f = 54 using the updated numbers. the main effect of group was f = 7156 in article 1, f = 3030 in our simulation of the original parameters, and f = 23 using the updated numbers. as is evident, although the updated standard deviations lead to more plausible effect size calculations than the originally reported values, they are not consistent with the analyses reported in the paper. we observed similarly discordant f-statistics with analyses of simulated accuracy data based on the updated statistics. article 1 reported a main effect of time of f = 3278, we observed f = 2412 when analyzing our simulation of the original parameters, and f = 132 when using the updated parameters. the authors reported a group x time interaction of f = 657, we found f = 494 with simulations of the original statistics, and f = 27 when using the updated values. finally, the main effect of group was reported as f = 922, we found f = 254 with simulations of the original statistics, and f = 21 when analyzing simulations of the updated data. discussion we have reviewed concerning data irregularities spanning six articles investigating implicit motor and perceptual learning (lola et al., 2021; lola & tzetzis, 2020, 2021; lola et al., 2012; tzetzis & lola, 2010, 2015). these data irregularities include implausibly large effect sizes for pairwise comparisons and impossible descriptive statistics—both of which have been acknowledged by the authors as misprints due to an outsourcing of table formatting (a. lola, personal communication, april 22, 2021). further, the reported test statistics and associated η2 values are also implausibly large, which is inconsistent with the authors’ claim that the results and discussions remain valid despite these aforementioned typographic errors in the tables. we discovered that the data reported in articles 1 and 2 are very highly correlated despite ostensibly coming from different experiments, samples, and situations (lola et al., 2021; lola & tzetzis, 2021). finally, we observed that recently updated tables of summary statistics in article 1 were incompatible with the analyses reported in the paper, while the original summary statistics, which indicated implausible effect sizes, were more compatible with the analyses. considering these findings, the conclusions from these articles are not reliable. the data published in article 1 are especially concerning. the article information indicates that it was submitted on june 17, 2021. our email correspondence with the authors ended on april 22, 2021. therefore, the article was submitted with data that reflect implausible effect sizes as large as d = 41.0 after we had shared our concerns about the previous five articles, and after the authors had suggested at least some of the implausible effect sizes were due to misprints. despite this correspondence, the authors published an additional article with results that were not only implausible, but highly correlated with the results reported in a previous article. subsequently, the authors published corrected tables with new standard deviation values. the means remained highly correlated with those reported in article 2, but the new standard deviations were substantially larger than the original values. the updated summary statistics are not consistent with the test statistics reported in article 1. it is noteworthy that the results reported in each of these articles perfectly reflect the authors’ expectations. indeed, our attention was drawn to these articles after the lola and tzetzis (2021) paper was shared on twitter (gray, 2021); possibly because the results appeared to be exemplary. although these errors seem unlikely to have aligned with expectations by chance alone, our exposure to them occurred after they had been selected for publication. we cannot rule out that these papers were selected for publication because of exemplary results and happened to have errors, and this selection caused those errors to correlate with the authors’ expectations. other irregularities, such as a repeating f-statistic for all four analyses of explicit rules and the recurring age of participants potentially reflect sloppiness more than expectation. indeed, the authors have already admitted that some values reported in their tables were in error, but failed to identify which values and which articles. overall, it seems errors occurred in all of the articles we have reviewed. these errors were pervasive and appear to have substantially affected the conclusions of the articles in question. at a minimum, the consistent reporting errors across these six articles seem to reflect excessive carelessness throughout the publication process. even if the authors offer additional corrections, which they have suggested they intend to do,7 many in the research community may find it difficult to trust any of these results. 7there is no indication that such corrective actions were taken by the authors prior to us contacting the editors on sept 22 2021. however, as mentioned, the tables in article 1 were updated and the authors offered some corrections in response to the editors of article 5. we do not know if additional corrections have been submitted. 9 r packages used in this project we used r (version 4.0.4; r core team, 2021) and the r packages compute.es (version 0.2.5; re, 2013), daff (version 0.3.5; fitzpatrick et al., 2019), faux (version 1.1.0; debruine, 2021), gridgraphics (version 0.5.1; murrell & wen, 2020), kableextra (version 1.3.4; zhu, 2021), lemon (version 0.4.5; edwards, 2020), lsr (version 0.5; navarro, 2015), papaja (version 0.1.0.9997; aust & barth, 2020), rcolorbrewer (version 1.1.3; neuwirth, 2014), scales (version 1.2.0; wickham & seidel, 2020), and tidyverse (version 1.3.0; wickham et al., 2019). author contact corresponding authors: brad mckay (bradmckay8@gmail.com; mckayb9@mcmaster.ca) and michael j. carter (cartem11@mcmaster.ca) brad mckay 0000-0002-7408-2323 michael j. carter 0000-0002-0675-4271 acknowledgements we would like to thank abbey corson for her help with data extraction from the target articles. conflict of interest and funding the authors declare no competing interests or funding for this project. author contributions (credit taxonomy) conceptualization: bm, mjc data curation: bm formal analysis: bm methodology: bm, mjc project administration: bm, mjc software: bm supervision: mjc validation: bm, mjc visualization: bm writing – original draft: bm, mjc writing – review & editing: bm, mjc author order was determined by contribution. open science practices this article earned the open data and the open materials badge for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references anaya, j. (2016). the grimmer test: a method for testing the validity of reported measures of variability (tech. rep.). https://doi.org/10.7287/peerj. preprints.2400v1 aust, f., & barth, m. (2020). papaja: create apa manuscripts with r markdown [r package version 0.1.0.9997]. https : / / github . com / crsh / papaja bakker, m., & wicherts, j. (2011). the (mis)reporting of statistical results in psychology journals. behavior research methods, 43(3), 666–678. https:// doi.org/10.3758/s13428-011-0089-5 brown, n., & heathers, j. (2017). the grim test: a simple technique detects numerous anomalies in the reporting of results in psychology. social psychological and personality science, 8(4), 363–369. https : / / doi . org / 10 . 1177 / 1948550616673876 cohen, j. (1962). the statistical power of abnormalsocial psychological research: a review. journal of abnormal and social psychology, 65, 145– 153. https://doi.org/10.1037/h0045186 cohen, j. (1988). statistical power analysis for the behavioral sciences (2nd ed.). routledge. https:// doi.org/10.4324/9780203771587 correll, j., mellinger, c., mcclelland, g., & judd, c. (2020). avoid cohen’s ‘small’, ‘medium’, and ‘large’ for power analysis. trends in cognitive sciences, 24(3), 200–207. https://doi.org/10. 1016/j.tics.2019.12.009 debruine, l. (2021). faux: simulation for factorial designs [r package version 1.1.0]. zenodo. https: //doi.org/10.5281/zenodo.2669586 edwards, s. m. (2020). lemon: freshing up your ’ggplot2’ plots [r package version 0.4.5]. https : //cran.r-project.org/package=lemon field, a. (2016). an adventure in statistics: the reality enigma. fischman, m. (2015). on the continuing problem of inappropriate learning measures: comment on wulf et al. (2014) and wulf et al. (2015). human movement science, 42, 225–231. https:// doi.org/10.1016/j.humov.2015.05.011 fitzpatrick, p., de jonge, e., & warnes, g. r. (2019). daff: diff, patch and merge for data.frames [r package version 0.3.5]. https : / / cran . r project.org/package=daff https://orcid.org/0000-0002-7408-2323 https://orcid.org/0000-0002-0675-4271 https://doi.org/10.7287/peerj.preprints.2400v1 https://doi.org/10.7287/peerj.preprints.2400v1 https://github.com/crsh/papaja https://github.com/crsh/papaja https://doi.org/10.3758/s13428-011-0089-5 https://doi.org/10.3758/s13428-011-0089-5 https://doi.org/10.1177/1948550616673876 https://doi.org/10.1177/1948550616673876 https://doi.org/10.1037/h0045186 https://doi.org/10.4324/9780203771587 https://doi.org/10.4324/9780203771587 https://doi.org/10.1016/j.tics.2019.12.009 https://doi.org/10.1016/j.tics.2019.12.009 https://doi.org/10.5281/zenodo.2669586 https://doi.org/10.5281/zenodo.2669586 https://cran.r-project.org/package=lemon https://cran.r-project.org/package=lemon https://doi.org/10.1016/j.humov.2015.05.011 https://doi.org/10.1016/j.humov.2015.05.011 https://cran.r-project.org/package=daff https://cran.r-project.org/package=daff 10 gray, r. [ (2021). the effect of explicit, implicit and analogy instruction on decision making skill for novices, under stress. hancock, g., butler, m., & fischman, m. (1995). on the problem of two-dimensional error scores: measures and analyses of accuracy, bias, and consistency. journal of motor behavior, 27(3), 241– 250. https://doi.org/10.1080/00222895.1995. 9941714 lakens, d. (2013). calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas. frontiers in psychology, 4. https://doi.org/10.3389/fpsyg. 2013.00863 lohse, k., buchanan, t., & miller, m. (2016). underpowered and overworked: problems with data analysis in motor learning studies. journal of motor learning and development, 4(1), 37–58. https://doi.org/10.1123/jmld.2015-0010 lola, a., giatis, g., pérez-turpin, j., & tzetzis, g. (2021). the influence of analogies on the development of selective attention in novices in normal or stressful conditions. journal of human sport and exercise, 2023, 139–152. https: //doi.org/10.14198/jhse.2023.181.12 lola, a., & tzetzis, g. (2020). analogy versus explicit and implicit learning of a volleyball skill for novices: the effect on motor performance and self-efficacy. journal of physical education and sport, 20(5), 2478–2486. https : / / www . cabdirect . org / cabdirect / abstract / 20203562097 lola, a., & tzetzis, g. (2021). the effect of explicit, implicit and analogy instruction on decision making skill for novices, under stress. international journal of sport and exercise psychology, 0(0), 1–21. https : / / doi . org / 10 . 1080 / 1612197x . 2021.1877325 lola, a., tzetzis, g., & zetou, h. (2012). the effect of implicit and explicit practice in the development of decision making in volleyball serving. perceptual and motor skills, 114(2), 665–678. https://doi.org/10.2466/05.23.25.pms.114.2. 665-678 lovakov, a., & agadullina, e. (2021). empirically derived guidelines for effect size interpretation in social psychology. european journal of social psychology, 00, 1–20. https://doi.org/10.1002/ ejsp.2752 mckay, b., hussien, j., vinh, m.-a., mir-orefice, a., brooks, h., & ste-marie, d. m. (2022). metaanalysis of the reduced relative feedback frequency effect on motor learning and performance. psychology of sport and exercise, 61, 102165. https://doi.org/10.1016/j.psychsport. 2022.102165 mckay, b., yantha, z. d., hussien, j., carter, m. j., & ste-marie, d. m. (2022). meta-analytic findings in the self-controlled motor learning literature: underpowered, biased, and lacking evidential value. meta-psychology. https : / / doi . org / 10 . 15626/mp.2021.2803 munafò, m., nosek, b., bishop, d., button, k., chambers, c., percie du sert, n., simonsohn, u., wagenmakers, e., ware, j., & ioannidis, j. (2017). a manifesto for reproducible science. nature human behaviour, 1(1), 1–9. https://doi.org/ 10.1038/s41562-016-0021 murrell, p., & wen, z. (2020). gridgraphics: redraw base graphics using ’grid’ graphics [r package version 0.5-1]. https://cran.r-project.org/package= gridgraphics navarro, d. (2015). learning statistics with r: a tutorial for psychology students and other beginners. (version 0.5) [r package version 0.5]. university of adelaide. adelaide, australia. http : / / ua . edu . au/ccs/teaching/lsr neuwirth, e. (2014). rcolorbrewer: colorbrewer palettes [r package version 1.1-2]. https : / / cran . r project.org/package=rcolorbrewer nuijten, m., hartgerink, c., van assen, m., epskamp, s., & wicherts, j. (2016). the prevalence of statistical reporting errors in psychology (1985–2013). behavior research methods, 48(4), 1205–1226. https://doi.org/10.3758/ s13428-015-0664-2 r core team. (2021). r: a language and environment for statistical computing. r foundation for statistical computing. vienna, austria. https://www. r-project.org/ re, a. c. d. (2013). compute.es: compute effect sizes. https://cran.rproject.org/package=compute. es reeve, t., fischman, m., christina, r., & cauraugh, j. (1994). using one-dimensional task error measures to assess performance on twodimensional tasks: comment on ’attentional control, distractors, and motor performance’. human performance, 7(4), 315–319. https : / / doi.org/10.1207/s15327043hup0704_6 thompson, b. (2007). effect sizes, confidence intervals, and confidence intervals for effect sizes. psychology in the schools, 44(5), 423–432. https : //doi.org/10.1002/pits.20234 tsorbatzoudis, h., barkoukis, v., kaissidis-rodafinos, a., & grouios, g. (1998). a test of the reliabilhttps://doi.org/10.1080/00222895.1995.9941714 https://doi.org/10.1080/00222895.1995.9941714 https://doi.org/10.3389/fpsyg.2013.00863 https://doi.org/10.3389/fpsyg.2013.00863 https://doi.org/10.1123/jmld.2015-0010 https://doi.org/10.14198/jhse.2023.181.12 https://doi.org/10.14198/jhse.2023.181.12 https://www.cabdirect.org/cabdirect/abstract/20203562097 https://www.cabdirect.org/cabdirect/abstract/20203562097 https://www.cabdirect.org/cabdirect/abstract/20203562097 https://doi.org/10.1080/1612197x.2021.1877325 https://doi.org/10.1080/1612197x.2021.1877325 https://doi.org/10.2466/05.23.25.pms.114.2.665-678 https://doi.org/10.2466/05.23.25.pms.114.2.665-678 https://doi.org/10.1002/ejsp.2752 https://doi.org/10.1002/ejsp.2752 https://doi.org/10.1016/j.psychsport.2022.102165 https://doi.org/10.1016/j.psychsport.2022.102165 https://doi.org/10.15626/mp.2021.2803 https://doi.org/10.15626/mp.2021.2803 https://doi.org/10.1038/s41562-016-0021 https://doi.org/10.1038/s41562-016-0021 https://cran.r-project.org/package=gridgraphics https://cran.r-project.org/package=gridgraphics http://ua.edu.au/ccs/teaching/lsr http://ua.edu.au/ccs/teaching/lsr https://cran.r-project.org/package=rcolorbrewer https://cran.r-project.org/package=rcolorbrewer https://doi.org/10.3758/s13428-015-0664-2 https://doi.org/10.3758/s13428-015-0664-2 https://www.r-project.org/ https://www.r-project.org/ https://cran.r-project.org/package=compute.es https://cran.r-project.org/package=compute.es https://doi.org/10.1207/s15327043hup0704_6 https://doi.org/10.1207/s15327043hup0704_6 https://doi.org/10.1002/pits.20234 https://doi.org/10.1002/pits.20234 11 ity and factorial validity of the greek version of the csai-2. research quarterly for exercise and sport, 69(4), 416–419. https : / / doi . org / 10 . 1080/02701367.1998.10607717 tzetzis, g., & lola, a. (2010). the role of implicit, explicit instruction and their combination in learning anticipation skill under normal and stress conditions. international journal of sport sciences and physical education, 1, 54–59. tzetzis, g., & lola, a. (2015). the effect of analogy, implicit, and explicit learning on anticipation in volleyball serving. international journal of sport psychology, 46(2), 152–166. https://doi. org/10.7352/ijsp.2015.46.152 wickham, h., averick, m., bryan, j., chang, w., mcgowan, l. d., françois, r., grolemund, g., hayes, a., henry, l., hester, j., kuhn, m., pedersen, t. l., miller, e., bache, s. m., müller, k., ooms, j., robinson, d., seidel, d. p., spinu, v., . . . yutani, h. (2019). welcome to the tidyverse. journal of open source software, 4(43), 1686. https://doi.org/10.21105/joss.01686 wickham, h., & seidel, d. (2020). scales: scale functions for visualization [r package version 1.1.1]. https://cran.r-project.org/package=scales zhu, h. (2021). kableextra: construct complex table with ’kable’ and pipe syntax [r package version 1.3.4]. https://cran.r-project.org/package= kableextra https://doi.org/10.1080/02701367.1998.10607717 https://doi.org/10.1080/02701367.1998.10607717 https://doi.org/10.7352/ijsp.2015.46.152 https://doi.org/10.7352/ijsp.2015.46.152 https://doi.org/10.21105/joss.01686 https://cran.r-project.org/package=scales https://cran.r-project.org/package=kableextra https://cran.r-project.org/package=kableextra meta-psychology, 2023, vol 7, mp.2020.2556 https://doi.org/10.15626/mp.2020.2556 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: danielsson, h., carlsson, r. reviewed by: parsons, s., young, c., antonoplis, s. analysis reproduced by: batinović, l., tobias, c. all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/8mzv3 comparing the vibration of effects due to model, data pre-processing, and sampling uncertainty on a large data set in personality psychology simon klau1,2, felix d. schönbrodt3,4, chirag j. patel5, john p.a. ioannidis6,7,8,9, anne-laure boulesteix1,4, and sabine hoffmann1,4 1institute for medical information processing, biometry, and epidemiology, munich, germany 2leibniz-institute for prevention research and epidemiology − bips, bremen, germany 3department of psychology, ludwig-maximilians-universität münchen, munich, germany 4lmu open science center, ludwig-maximilians-universität münchen, munich, germany 5department of biomedical informatics, harvard medical school, boston, ma, usa 6meta-research innovation center at stanford (metrics), stanford university, stanford, ca, usa 7department of epidemiology and population health, stanford university, stanford, ca, usa 8department of biomedical data science, stanford university, stanford, ca, usa 9department of statistics, stanford university, stanford, ca, usa researchers have great flexibility in the analysis of observational data. if combined with selective reporting and pressure to publish, this flexibility can have devastating consequences on the validity of research findings. we extend the recently proposed vibration of effects approach to provide a framework comparing three main sources of uncertainty which lead to instability in empirical findings, namely data pre-processing, model, and sampling uncertainty. we analyze the behavior of these sources for varying sample sizes for two associations in personality psychology. through the joint investigation of model and data pre-processing vibration, we can compare the relative impact of these two types of uncertainty and identify the most influential analytical choices. while all types of vibration show a decrease for increasing sample sizes, data pre-processing and model vibration remain non-negligible, even for a sample of over 80000 participants. the increasing availability of large data sets that are not initially recorded for research purposes can make data pre-processing and model choices very influential. we therefore recommend the framework as a tool for transparent reporting of the stability of research findings. keywords: metascience, researcher degrees of freedom, stability, replicability, big five in recent years, a series of attempts to replicate results of published research findings on independent data have shown that these replications tend to produce much weaker evidence than the original study (open science collaboration, 2015), leading to what has been referred to as a ‘replication crisis’. while there have been a number of widely publicized examples of fraud and scientific misconduct (ince, 2011; van der zee et al., 2017), many researchers agree that this is not the major problem causing the crisis (gelman and loken, 2014; ioannidis et al., 2014). instead, the problems seem to be more subtle and partly due to the multiplicity of possible analysis strategies (goodman et al., 2016; open science collaboration, 2015). in this vein, there is evidence that the instability of empirical associations can be partly explained by the fact that researchers tend to run several analysis strategies on a given data set, but report only one of them selected post-hoc (simmons et al., 2011). indeed, there are a great number of implicit and explicit choices that have to be made when analyzing observational data. it is necessary to make various decisions when specifying a probability model to study the association between possible predictor variables and an outcome of interest (leamer, 1983). in addition to possible choices involved in the specification of a probability model, denoted as ‘model uncertainty’ in the following, there are numerous judgments and decisions that are required prior to fitting the model to the data. when pre-processing the data, there are many possibilities regarding not only the definition of predictor and outcome variables, but also data inclusion and exclusion criteria, and the treatment of outliers (wicherts et al., 2016). we denote this type of uncertainty as ‘data https://doi.org/10.15626/mp.2020.2556 https://doi.org/10.17605/osf.io/8mzv3 2 pre-processing uncertainty’. apart from the problems arising through the multiplicity of possible analysis strategies, there seem to be more fundamental issues in the analysis of observational data that originate from the low statistical power which characterizes many psychological studies (maxwell, 2004; szucs and ioannidis, 2017). in psychology, effect sizes tend to be small and sample sizes are typically small to moderate. this combination leads to studies with low statistical power and therefore high sampling uncertainty when the same analysis strategies are applied to different samples with the aim of answering the same research question. high sampling uncertainty decreases the chances of being able to replicate the results of studies that detect a true effect. in recent years, a plethora of solutions to the replication crisis have been proposed in different disciplines. there are several approaches that allow the reporting of results for a large number of possible analysis strategies (muñoz and young, 2018; simonsohn et al., 2015; steegen et al., 2016; young, 2018), including the vibration of effects which was proposed by ioannidis (2008) and further developed by patel, burford, and ioannidis (2015), palpacuer et al. (2019), and klau, hoffmann, patel, ioannidis, and boulesteix (2021). alternatively, the flexibility in the choice of analysis strategies can be reduced before analyzing the data through preregistration and registered reports (chambers, 2013; wagenmakers et al., 2012). similarly, the instability of empirical findings arising from sampling uncertainty can be assessed through resampling (meinshausen and bühlmann, 2010; sauerbrei et al., 2011) or sampling uncertainty can be reduced by increasing the sample size (button et al., 2013; maxwell, 2004; schönbrodt and perugini, 2013). while the solutions proposed so far address important pieces of the problem by either focusing on the multiplicity of analysis strategies or on sampling uncertainty, it is important to be able to investigate sampling, model, and data pre-processing uncertainty in a common framework to understand the full picture. klau, martin-magniette, boulesteix, and hoffmann (2020) rely on a resampling procedure to compare method and sampling uncertainty, but focus their application on the selection and ranking of molecular biomarkers. in this work, we use the vibration of effects approach (ioannidis, 2008) to assess model, data pre-processing, and sampling uncertainty in order to provide a tool for applied researchers to quantify and compare the instability of research findings arising from all three sources of uncertainty. we study this instability for varying sample sizes for two associations in personality psychology, namely between neuroticism and relationship status, and extraversion and physical activity, by analyzing a large and publicly available data set. the vibration of effects framework the vibration of effects framework to quantify the effect of model, data pre-processing, and sampling uncertainty the vibration of effects framework (ioannidis, 2008; patel et al., 2015) provides researchers with a tool to assess the robustness of their research findings in terms of alternative analysis strategies. in particular, it allows the quantification of the impact of different choices on the stability of results, helping researchers identify the most influential analysis choices. in this respect, the vibration of effects framework has some conceptual similarities to specification curve analysis (simonsohn et al., 2015), multiverse analysis (steegen et al., 2016) and multi-analyst experiments (aczel et al., 2021). all these approaches try to assess whether results vary according to different specifications. in contrast to the latter approaches, the vibration of effects framework presents the effect estimates and p-values resulting from a large number of analysis strategies simultaneously in a volcano plot. moreover, vibration of effects considers a more comprehensive set of possible strategies, while the specification curve and multiverse analysis include a step where the researchers try to define what are reasonable specifications (a task that is often difficult) and multi-analyst experiments typically request many different analysts to make independently their best choice of the analysis strategy. while the vibration of effects approach was initially proposed and applied to assess model uncertainty, it has recently been extended to enable comparison of the relative impact of different analysis choices with measurement and sampling uncertainty (klau et al., 2021). the vibration of effects framework can be used in the context of modeling an association of interest, i.e., when estimating the effect of a predictor of interest on an outcome of interest to obtain effect estimates and corresponding p-values, while controlling for the effect of several covariates. in the application of the framework by patel et al. (2015), the authors consider the association between a predictor of interest and a survival outcome, and assess the vibration by defining a large number of models which result from the inclusion or exclusion of a number of potential covariates. in this work, we will refer to the type of vibration investigated by patel et al. (2015) as ‘model vibration’, apply the framework to subsamples of the data (klau et al., 2021), and extend it to data pre-processing choices in order to compare assessing uncertainty through the vibration of effects 3 model vibration to ‘sampling vibration’ and ‘data preprocessing vibration’. to quantify sampling vibration, we use a resampling-based approach where we draw a large number of random subsets from our data set and fit the same model on each of these subsets. furthermore, we fit a model for a large number of data pre-processing strategies in order to assess data preprocessing vibration. these choices could, for example, include the handling of outliers, eligibility criteria, or the definition of predictor and outcome variables. examples for the implementation of these three types of vibration are provided in section applying the vibration of effects framework to the sapa dataset. figure 1 shows three possible patterns of vibration of effects generated with fictive data. since our application of the vibration of effects will focus on binary outcomes in the context of logistic regression models, we present these figures with odds ratios (or) as effect estimates. in the left panel, a regular pattern is visualized where all effects are positive (or > 1) and significant (p < 0.05). this is recognizable by a vertical line, where or = 1, and a horizontal line, where p = 0.05, illustrating the significance threshold, respectively. furthermore, dotted lines provide information about the 1st, 50th and 99th percentile of results. such a regular pattern demonstrates the robustness of the estimated effect to alternative model specifications, data pre-processing options or to resampling, depending on the type of vibration that is presented. the second panel demonstrates a pattern which is characterized by significant and non-significant results in both positive and negative directions − here, the median or is close to one. we refer to this pattern as the ‘janus effect’, in allusion to the two-headed ancient roman god (patel et al., 2015). while a janus effect pattern indicates that there is no consistent association between the predictor of interest and the outcome, the occurrence of both positive and negative significant results can lead to researchers selectively reporting a significant finding in the desired direction if they try a number of possible analysis strategies. finally, the right panel contains a more irregular pattern. such a pattern can, for example, result from the inclusion or exclusion of a particular covariate, or by different choices in the definition of a covariate. by highlighting the data points referring to such a definition, the results can be visually connected to these choices. to quantify the variability in these results, patel et al. (2015) propose two summary measures, namely relative hazard ratios and relative p-values (rp). these summary measures are defined as the ratio of the 99th and 1st percentile of hazard ratios and the difference between the 99th and 1st percentile of -log10(p-value), respectively. following patel et al. (2015), we define the relative odds ratio (ror) as the ratio of the 99th percentile and 1st percentile of the or. the ror provides a more robust and intuitive measure of variability than the variance. the minimal possible value of ror is 1, indicating no vibration of effects at all, while larger ror values indicate larger vibration. comparing the vibration of effects due to different types of uncertainty and identifying the most influential analytical choices for an association of interest, model, data preprocessing, and sampling uncertainty can be compared through the vibration of effects framework. in order to assess the variability in effect estimates and p-values for one type of vibration, the other types of vibration have to be fixed to a ‘favorite’ specification. for instance, when focusing on sampling vibration only, decisions on a favorite model as well as a favorite data pre-processing choice must be made. in addition to the investigation of individual types of vibration, the joint impact of model and data preprocessing choices on the variability of results can be quantified. for simplicity, we will refer to the combination of a model and all necessary data pre-processing choices as an analysis strategy. in the joint investigation of model and data pre-processing choices, the calculation of ror is straightforward and can give an estimate for the total amount of vibration caused by the analysis strategy. additionally, the relative impact of data pre-processing and model choices on the vibration that is caused by the choice of the analysis strategy can be quantified and it is possible to identify the model and data pre-processing choices that explain the largest variation in results. to do so, we can use a linear model in which we describe the association between the effect estimate of interest as an outcome variable (in our case the log(or)) as a function of two categorical covariates, indicating data pre-processing and model choices. by performing a variance decomposition through an analysis of variance (anova), we can determine the data preprocessing choices and model choices that most contribute to the total amount of vibration caused by the analysis strategy. in the following section, we will give detailed examples of the application of the vibration of effects framework regarding model, sampling and data preprocessing choices. 4 figure 1 vibration of effects with fictive data. 1 50 99 1 50 99 p = 0.05 p = 0.001 0 1 2 3 0.8 1.0 1.2 1.4 1.6 odds ratio − lo g 1 0 (p ) 5 10 15 20 25 density 1 50 99 1 50 99 p = 0.05 p = 0.001 0 1 2 3 0.8 1.0 1.2 1.4 1.6 odds ratio − lo g 1 0 (p ) 10 20 density 1 50 99 1 50 99 p = 0.05 p = 0.001 0 1 2 3 0.8 1.0 1.2 1.4 1.6 odds ratio − lo g 1 0 (p ) option 1 2 3 applying the vibration of effects framework to the sapa dataset the data and research questions of interest for the application of the vibration of effects, we use a large data set from the sapa project personality test (condon et al., 2017) which is publicly available at the dataverse repository (https://dataverse.harvard. edu/dataverse/sapa-project). the sample consists of 126884 participants who were invited to complete an online survey between 2013 and 2017 in order to evaluate the structure of personality traits. the data set comprises information about a large pool of 696 personality items which were completed by the participants on a 6-point scale ranging from 1 (very inaccurate) to 6 (very accurate) and a set of additional variables including gender, age, country, job status, educational attainment level, physical activity, smoking status, relationship status and body mass index (bmi) of participants. in this work, we use these data to assess the extent to which associations between the big five (agreeableness, conscientiousness, extraversion, neuroticism and openness to experience) and the five outcome variables physical activity, educational achievement, relationship status, smoking habits and obesity are influenced by data pre-processing, model, and sampling uncertainty. in order to investigate the behavior of the three types of vibration with increasing sample size, we consider different subsets of the original data with subset sizes n ∈ {500, 5000, 15000, 50000, 84045}, where 84045 is the size of the complete data set after excluding participants with missing observations. lower sample sizes than the original sample size were obtained by generating random subsamples from the original data set, without replacement. in the application of our framework, we consider six associations of interest, comprising five for which we found empirical evidence in the psychological literature. in the presentation of our results, we focus on the association between neuroticism and relationship status and between extraversion and physical activity (rhodes and smith, 2006). there is a large body of evidence on the association between neuroticism and relationship satisfaction (dyrenforth et al., 2010; malouff et al., 2010; o’meara and south, 2019), which might for instance be explained by cognitive biases in the interpretation of ambiguous situations (finn et al., 2013). concerning the association between extraversion and physical activity, eysenck, nias, and cox (1982) suggested that individuals with high levels of extraversion would be more likely to start sports and to excel in them because the bodily activity would satisfy their sensation seeking behavior. according to wilson and dishman (2015), an association between extraversion and physical activity may also result from the fact that extraverts are more social and outgoing, making them more exposed to situations that offer the possibility to be physically active. additional results on the association between agreeableness and smoking (malouff et al., 2006), neuroticism and obesity (gerlach et al., 2015), and conscientiousness and education (sorić et al., 2017) can be found in the supplementary material, together with results on openness and physical activity, for which no evidence of an association could be found (rhodes and smith, 2006). https://dataverse.harvard.edu/dataverse/sapa-project https://dataverse.harvard.edu/dataverse/sapa-project assessing uncertainty through the vibration of effects 5 quantifying and comparing the effect of model, sampling and data pre-processing uncertainty we describe each association of interest through a logistic regression model in which we estimate the effect of the predictor of interest (e.g., neuroticism or extraversion) on the binary outcome of interest (e.g., relationship status or physical activity) to obtain odds ratios (or) and corresponding p-values, while controlling for the effect of several covariates. as potential control variables, we consider all variables introduced in section the data and research questions of interest that are not part of the association of interest. for instance, the association between neuroticism and relationship status comprises the control variables age, gender, continent, job status, bmi, smoking, education, physical activity, conscientiousness, agreeableness, extraversion and openness. for the association between physical activity and extraversion, we replace these two variables in the list of potential control variables with neuroticism and relationship status. this results in a total number of 12 control variables for each associations of interest. we quantify the instability of these associations through the vibration of effects framework introduced in section the vibration of effects framework to quantify the effect of model, data pre-processing, and sampling uncertainty. model vibration in order to assess model vibration, we consider all possible combinations of control variables as described in the introduction of the framework. following patel et al. (2015), we will consider age and gender as baseline variables which are included in every model, resulting in a total number of 210 = 1024 possible models for a given association of interest. sampling vibration to quantify sampling vibration, we follow the strategy of drawing a large number of random subsets from our data set and fitting the same logistic regression model on each of the subsets, as outlined in the introduction of the framework. in particular, we draw 1000 subsets of size 0.5n, with n as the number of observations from the data sets defined in section the data and research questions of interest, which comprise different numbers of observations themselves. although each subset is drawn without replacement, the observations of subsets overlap between repetitions. data pre-processing vibration the data pre-processing choices we are considering comprise the handling of outliers, eligibility criteria, and the definition of predictor and outcome variables. these data pre-processing choices are based on studies found in the literature. for a given association of interest, we fit a logistic regression model for each data pre-processing strategy. eligibility criteria. the eligibility criteria are based on the variables age, gender and the country of participants. for age, either the full group of participants is included in the analyses (age eligibility criterion definition 1) or a subgroup is defined by excluding participants who are younger than 18 (age eligibility criterion definition 2), which can be justified by their inability to legally provide consent (barchard and williams, 2008). furthermore, studies about associations involving the big five personality traits are often carried out on subgroups of countries, for instance as shown by malouff et al. (2006) and malouff et al. (2010) for the variables smoking and physical activity. therefore, we distinguish two alternative study populations based on the participants’ country. either all participants are included in the analyses and continent is considered as a categorical control variable (country eligibility criterion definition 1), or we include only participants from the united states, which presents the single largest country in the data set. in this case (country eligibility criterion definition 2), we exclude the control variable specifying the continent from the analyses. in total, this results in 2 × 2 = 4 possible combinations for the definition of eligibility criteria. handling of outliers. a further data pre-processing choice is the handling of outliers. a variety of different outlier definitions can be found in the literature. bakker and wicherts (2014), for instance, provide a large range of z-values (which is the number of standard deviations that a value deviates from the mean) that are used to define outliers. furthermore, it is either possible to remove or winsorize outlier values (osborne and overbay, 2004). here, we focus on three different choices concerning all continuous covariates, comprising the five personality dimensions, as well as age and bmi: firstly, we perform no further pre-processing with these covariates (outlier definition 1). as a second option, we delete observations with absolute z-values greater than 2.5 (outlier definition 2). finally, we perform winsorization to achieve absolute z-values less than or equal 2.5 (outlier definition 3). thereby we replace values with z > 2.5 by 2.5, and values with z <−2.5 by −2.5. dichotomization of outcome variables and covariates. in the definition of the outcome variables and covariates, we only consider the influence of different pre-processing choices for the three variables smoking (which is the outcome variable in the association between agreeableness and smoking, see results in the supplementary material, and a covariate in all other as6 sociations), physical activity (which is the outcome variable in the association between extraversion and physical activity and between openness and physical activity, see results in the supplementary material, and a covariate in all other associations) and education (which is the outcome variable in the association between conscientiousness and education, see results in the supplementary material, and a covariate in all other associations). all three variables are recorded with a certain number of categories (nine categories for smoking, six categories for physical activity and seven categories for education) and have to be dichotomized in order to be able to model them as a binary outcome in a logistic regression model. for all three variables, literature search revealed a lack of common definitions. for smoking and physical activity for instance, summaries of these definitions are provided by malouff et al. (2006) and rhodes and smith (2006), respectively. similarly, the term education is very ambiguous, and even the more specific phrase of academic achievement exhibits a large variety of definitions (fan and chen, 2001). therefore, we aim at reasonable dichotomizations of our given categories. for smoking, we either consider a definition based on never smokers vs. all other categories of smoking (smoking definition 1) or based on nonsmokers (never smokers and study participants who did not smoke the previous year) versus all other study participants (smoking definition 2). for physical activity, we either assume a definition based on the two categories ‘less than once per week’ versus ‘once per week or more’ (physical activity definition 1) or, alternatively, ‘less than once per month’ versus ‘less than once per week or more’ (physical activity definition 2). finally, in the definition of education we distinguish between study participants with a high level of education and study participants with a low level of education. in this distinction, we either assign current university students to the group with a high level of education (education definition 1), because they will soon obtain a university degree or to the group with a low level of education (education definition 2), as they have not obtained a degree yet. all other variables (job status, relationship status, bmi) are included in the analyses without considering alternative pre-processing choices. therefore, we should acknowledge that the vibration of effects due to pre-processing choices can be larger than what is illustrated here. for more details on the variables which were collected in the sapa project, we refer to condon et al. (2017). personality scores. the definitions of the five personality dimensions, i.e., openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism, are based on the corresponding personality items. there are a large number of different strategies to combine several items to a scale value. indeed, the sapa data set contains almost 700 items that were designed to assess personality, but each participant only completed a subset of these items. in order to determine a score on each of the personality dimensions, a correlation matrix, which is based on pairwise complete cases can be analyzed through factor analysis. as the big five personality traits were initially constructed as orthogonal factors (saucier, 2002), we consider orthogonal rotation techniques as a first option (factor rotation definition 1) for the factor analysis. however, saucier (2002) argues that the scales used to measure the big five are not orthogonal in practice. in fact, a more common option in factor analysis of the personality traits is the use of oblique rotation techniques (factor rotation definition 2). the assignment of items to the five personality dimensions can be realized by determining a minimal factor loading that has to be achieved to assign an item to a factor but there is no consensus in the literature on an optimal cut-off value for such a minimal factor loading. here, we either choose a minimal factor loading of 0.3 (factor loading definition 1) or of 0.4 (factor loading definition 2). the score of a participant can then be calculated by taking the mean score of all items that were assigned to a given factor. this strategy might lead to missing values for some participants on the personality dimensions as it is only reasonable to calculate such a score if there is a minimum number of completed items. here, we use a required minimum value of 5 completed items. while there are numerous analysis strategies to determine the personality score of a participant, it is not in the scope of this study to consider all possible analysis strategies. therefore, we limit the number of possible data pre-processing strategies by only considering the two choices, orthogonal vs. oblique rotation, and mean scores on items assigned to a factor with loadings greater than 0.3 or 0.4. while these variable definitions are based on the raw data set with all observations, the other data pre-processing choices are subsequently implemented on the data sets of different sizes. the combination of the definition of personality scores with all other data pre-processing choices results in 384 different data pre-processing strategies in total. these represent only a subset of a larger number of choices that may be made, in theory. however, in practical terms, they represent the main choices that are likely to be considered. assessing uncertainty through the vibration of effects 7 table 1 data pre-processing choices original categories definition 1 (favorite) definition 2 definition 3 eligibility criteria age all participants only ≥ 18 country all participants only from us handling of outliers no pre-processing exclusion if |z|> 2.5 winsorization if |z|> 2.5 dichotomization of outcome and covariates smoking never smokers ‘never smokers’ non-smokers (‘never smokers’ not in the last year vs. and ‘not in the last year’) less than once a month all other participants vs. less than once a week all other participants 1 to 3 days a week most days everyday (5 or less times) up to 20 times a day more than 20 times a day physical activity very rarely or never less than once a week less than once a month less than once a month vs. vs. less than once a week once a week or more ‘less than once a week’ or more 1 or 2 times a week 3 or 5 times a week more than 5 times a week education less than 12 years high (incl. ‘currently in high high school graduate college/university’) vs. currently in college/university vs. low (incl. ‘currently in some college/university, but did not graduate low college/university’) college/university degree currently in graduate or professional school graduate or professional school degree personality scores rotation technique oblique orthogonal minimal factor loading 0.3 0.4 comparing the vibration of effects due to different types of uncertainty for each association of interest, we quantify and compare model, data pre-processing, and sampling uncertainty through the vibration of effects framework for varying sample sizes. our favorite data pre-processing choice is pre-processing without any subgroup analysis, without special handling of outliers, and with variable definition 1 for education, smoking and physical activity. additionally, the favorite definition of the personality traits is performed with the oblique rotation technique and factor loadings greater than 0.3. our favorite model choice simply consists in the model that contains all potential control variables. furthermore, if the aim is to assess data pre-processing vibration or model vibration, we define the full data set as our favorite sample. in addition to the investigation of individual types of vibration, we will compare the joint impact of model and data pre-processing choices on the variability of results with sampling vibration. here, the combination of data pre-processing and model choices results in 1024 × 384 = 393216 analysis strategies. however, not every possible combination yields useful and valid results. for instance, when we consider the data preprocessing choice where the association of interest is only explored for participants from the us, the model including continent as a control variable is not valid. thus, the total amount of feasible analysis strategies falls to 294912. moreover, we quantify the relative impact of data pre-processing and model choices on the vibration that is caused by the choice of the analysis strategy as previously described. results the variability in effect estimates for one type of vibration for more stable results, we repeat the analyses of all types of vibration for sample sizes of 500, 5000 and 15000 ten times and average the results across the obtained rors. for the visualization of vibration patterns, however, we choose one representative plot out of the total number of ten. for a sample size of 50000, we consider the variability between rors as negligible and run the analyses on only one sampled data set. for the association between neuroticism and relationship status and the association between extraversion and physical activity, results of measures quantifying the variability in effect estimates for one type of vibration are visualized in figures 2 and 3, respectively. corresponding figures for the other associations are provided in the supplementary material. in the upper panels, rors are displayed against the sample size n for the three types of vibration (data pre8 figure 2 data pre-processing, model, and sampling vibration for different sample sizes (top panel), and bar plots visualizing the type of results in terms of significance of estimated effects (bottom panel) for the association between neuroticism and relationship status. 1.0 1.2 1.4 1.6 1.8 500 5000 15000 50000 84045 sample size r e la tiv e o d d s r a tio ( st re n g th o f vi b ra tio n ) type of vibration data pre−processing model sampling association between neuroticism and relationship status type of results negative significant non−significant positive significant 0.00 0.25 0.50 0.75 1.00 s a m p lin g m o d e l d a ta p re − p ro ce ss in g 0.00 0.25 0.50 0.75 1.00 s a m p lin g m o d e l d a ta p re − p ro ce ss in g 0.00 0.25 0.50 0.75 1.00 s a m p lin g m o d e l d a ta p re − p ro ce ss in g 0.00 0.25 0.50 0.75 1.00 s a m p lin g m o d e l d a ta p re − p ro ce ss in g 0.00 0.25 0.50 0.75 1.00 s a m p lin g m o d e l d a ta p re − p ro ce ss in g processing, model, and sampling). for both associations, sampling vibration is higher than model and data pre-processing vibration for low sample sizes (n = 500 and n = 5000). for the lowest sample size of n = 500, the ror quantifying sampling vibration is close to 1.8 (1.81 for the association between relationship status and neuroticism and 1.77 for the association between physical activity and extraversion). for larger sample sizes, sampling vibration decreases and tends to an ror of 1. therefore, the influence of a specific sample can be expected to be negligible for sufficiently large sample sizes. focusing on the two other types of vibration, data pre-processing vibration is larger for low sample sizes than model vibration, and decreases for increasing sample size, however, without approximating an ror of 1. model vibration, in contrast, is less influenced by the sample size. although we observe a slight decrease for rors quantifying model vibration for increasing sample sizes, it is lower than sampling and data pre-processing vibration for small sample sizes and does not tend to a value of 1 for larger sample sizes. in the lower panels of figures 2 and 3, bar plots provide information about the percentage of significant results for each sample size and each type of vibration for the three categories: „negative significant“, „nonsignificant“, and „positive significant“. for all three types of vibration, most results are not significant for a sample size of n = 500 while for the larger sample sizes, the results are mostly significant. for the largest sample size, the association between neuroticism and relationship status shows a janus effect with both negative and positive significant results for model vibration. on the other hand, for sampling and data pre-processing vibration, only negative-significant or non-significant effects can be observed. for the association between extraversion and physical activity, all types of vibration yield positive significant effects for sample sizes larger than 5000, which is in accordance with the results from the literature (rhodes and smith, 2006). hence, a janus effect canassessing uncertainty through the vibration of effects 9 figure 3 data pre-processing, model, and sampling vibration for different sample sizes (top panel), and bar plots visualizing the type of results in terms of significance of estimated effects (bottom panel) for the association between extraversion and physical activity. 1.0 1.2 1.4 1.6 1.8 500 5000 15000 50000 84045 sample size r e la tiv e o d d s r a tio ( st re n g th o f vi b ra tio n ) type of vibration data pre−processing model sampling association between extraversion and physical activity type of results negative significant non−significant positive significant 0.00 0.25 0.50 0.75 1.00 s a m p lin g m o d e l d a ta p re − p ro ce ss in g 0.00 0.25 0.50 0.75 1.00 s a m p lin g m o d e l d a ta p re − p ro ce ss in g 0.00 0.25 0.50 0.75 1.00 s a m p lin g m o d e l d a ta p re − p ro ce ss in g 0.00 0.25 0.50 0.75 1.00 s a m p lin g m o d e l d a ta p re − p ro ce ss in g 0.00 0.25 0.50 0.75 1.00 s a m p lin g m o d e l d a ta p re − p ro ce ss in g not be observed for this association. the volcano plots in figures 4 and 5 allow investigating the behavior of the three types of vibration in more detail by providing the exact patterns of -log10(p-value) and ors for the three sample sizes n = 5000, n = 15000 and n = 50000. for the association between neuroticism and relationship status, we can distinguish a clear janus effect for sampling vibration with both positive and negative results for all three sample sizes. for model vibration, we initially only observe positive results for a sample size of n = 5000, but with increasing sample size, there are also results indicating a negative association between neuroticism and relationship status. in contrast, for data pre-processing vibration, we observe both positive and negative non-significant results for n = 5000 and n = 15000 whereas for a large sample size of n = 50000 the volcano plot clearly indicates a negative association, even though only about one third of the results are significant. in summary, for the association between neuroticism and relationship status, the results and conclusions critically depend on the chosen analysis strategy and there is a high potential for researchers to find contradictory findings on the same data set if they make different analytical choices. the volcano plots in figure 5 for the association between extraversion and physical activity show a more regular pattern. we observe only positive associations for all three sample sizes with only positive significant results for n = 15000 and n = 50000, indicating that the results for this association are much more robust to the choice of the analysis strategy. the relative impact of model and data preprocessing choices and the cumulative impact of both results for the total amount of vibration caused by modeland data pre-processing choices are visualized in figure 6 for the association between neuroticism and relationship status, and figure 7 for the association between extraversion and physical activity. in these fig10 figure 4 volcano plots for different types of vibration and different sample sizes (n) for the association between neuroticism and relationship status. the summary measures ror and rp indicate relative odds ratios and relative p-values, respectively. green dots indicate results obtained with favorite model choices (middle row) and favorite data pre-processing choices (bottom row). 1 50 99 1 50 99 p = 0.05 p = 0.001 0 1 2 3 0.95 1.00 1.05 1.10 1.15 odds ratio − lo g 1 0 (p ) 10 20 30 40 50 density ror = 1.17 rp = 1.51 n = 5000 1 50 99 1 50 99 p = 0.05 p = 0.001 0 1 2 3 0.95 1.00 1.05 1.10 1.15 odds ratio − lo g 1 0 (p ) 20 40 60 density ror = 1.09 rp = 1.22 n = 15000 sampling vibration 1 50 99 1 50 99 p = 0.05 p = 0.001 0 1 2 3 0.95 1.00 1.05 1.10 1.15 odds ratio − lo g 1 0 (p ) 20 40 60 density ror = 1.05 rp = 1.91 n = 50000 1 50 99 1 50 99 p = 0.05 p = 0.001 0 5 10 15 1.00 1.05 1.10 odds ratio − lo g 1 0 (p ) 10 20 30 40 50 density ror = 1.12 rp = 3.71 n = 5000 1 50 99 1 50 99 p = 0.05 p = 0.001 0 5 10 15 1.00 1.05 1.10 odds ratio − lo g 1 0 (p ) 10 20 30 40 density ror = 1.14 rp = 8.15 n = 15000 model vibration 1 50 99 1 50 99 p = 0.05 p = 0.001 0 5 10 15 1.00 1.05 1.10 odds ratio − lo g 1 0 (p ) 10 20 30 40 density ror = 1.12 rp = 16.11 n = 50000 1 5099 1 50 99 p = 0.05 p = 0.001 0 1 2 3 4 0.96 0.98 1.00 1.02 odds ratio − lo g 1 0 (p ) 5 10 15 density ror = 1.06 rp = 0.39 n = 5000 1 5099 1 50 99 p = 0.05 p = 0.001 0 1 2 3 4 0.96 0.98 1.00 1.02 odds ratio − lo g 1 0 (p ) 10 20 density ror = 1.05 rp = 0.71 n = 15000 data pre−processing vibration 1 50 99 1 50 99 p = 0.05 p = 0.001 0 1 2 3 4 0.96 0.98 1.00 1.02 odds ratio − lo g 1 0 (p ) 2.5 5.0 7.5 10.0 density ror = 1.04 rp = 3.25 n = 50000 association between neuroticism and relationship status ures, the top panels allow for a comparison of this joint vibration, also referred to as vibration due to the analysis strategy, and sampling vibration. for a low sample size of n = 500, sampling vibration is higher than vibration caused by the analysis strategy for both associations. for a medium sample size of n = 5000, rors assessing uncertainty through the vibration of effects 11 figure 5 volcano plots for different types of vibration and different sample sizes (n) for the association between extraversion and physical activity. the summary measures ror and rp indicate relative odds ratios and relative p-values, respectively. green dots indicate results obtained with favorite model choices (middle row) and favorite data pre-processing choices (bottom row). 1 50 99 1 50 99 p = 0.05 0 20 40 60 80 1.15 1.20 1.25 1.30 1.35 1.40 odds ratio − lo g 1 0 (p ) 10 20 30 40 50 density ror = 1.17 rp = 7.77 n = 5000 1 50 99 1 50 99 p = 0.05 0 20 40 60 80 1.15 1.20 1.25 1.30 1.35 1.40 odds ratio − lo g 1 0 (p ) 20 40 60 80 density ror = 1.09 rp = 13.18 n = 15000 sampling vibration 1 50 99 1 50 99 p = 0.05 0 20 40 60 80 1.15 1.20 1.25 1.30 1.35 1.40 odds ratio − lo g 1 0 (p ) 25 50 75 100 density ror = 1.05 rp = 23.54 n = 50000 1 5099 1 50 99 p = 0.05 0 50 100 150 200 250 1.24 1.28 1.32 1.36 odds ratio − lo g 1 0 (p ) 10 20 30 40 50 density ror = 1.10 rp = 13.82 n = 5000 1 50 99 1 50 99 p = 0.05 0 50 100 150 200 250 1.24 1.28 1.32 1.36 odds ratio − lo g 1 0 (p ) 10 20 30 40 density ror = 1.10 rp = 38.46 n = 15000 model vibration 1 50 99 1 50 99 p = 0.05 0 50 100 150 200 250 1.24 1.28 1.32 1.36 odds ratio − lo g 1 0 (p ) 10 20 density ror = 1.10 rp = 131.58 n = 50000 1 5099 1 50 99 p = 0.05 0 50 100 1.20 1.25 1.30 1.35 1.40 odds ratio − lo g 1 0 (p ) 2.5 5.0 7.5 10.0 density ror = 1.14 rp = 9.94 n = 5000 1 50 99 1 50 99 p = 0.05 0 50 100 1.20 1.25 1.30 1.35 1.40 odds ratio − lo g 1 0 (p ) 1 2 3 4 5 6 7 density ror = 1.12 rp = 31.66 n = 15000 data pre−processing vibration 1 50 99 1 50 99 p = 0.05 0 50 100 1.20 1.25 1.30 1.35 1.40 odds ratio − lo g 1 0 (p ) 1 2 3 4 density ror = 1.09 rp = 94.61 n = 50000 association between extraversion and physical activity corresponding to these two types of vibration are very similar (e.g. 1.18 (vibration due to the analysis strategy) and 1.17 (sampling vibration) for the association between relationship status and neuroticism, and 1.19 (vibration due to the analysis strategy) and 1.16 (sampling vibration) for physical activity and extraversion). 12 for larger sample sizes, vibration caused by the analysis strategy is larger than sampling vibration, which, as seen above, tends to an ror of 1 for the largest sample size. vibration caused by the analysis strategy, in contrast, does not show obvious decrease for sample sizes larger than 5000 and remains in a range between 1.13 and 1.15 (for the association between relationship status and neuroticism) and between 1.16 and 1.17 (for the association between physical activity and extraversion). pie charts in the bottom panels illustrate the relative impact of model and data pre-processing choices on the total vibration caused by the choice of the analysis strategy. due to the high computational burden of the variance decomposition, we randomly select three of the ten data sets for low sample sizes of 500, 5000 and 15000 to estimate the relative impact of data pre-processing and model choices and average the results over the three selected data sets. for both associations, the relative impact of data preprocessing choices exceeds the impact of model vibration for a sample size of n = 500. for sample sizes larger than 500, however, the relative model impact is larger than the relative impact due to data pre-processing. this is particularly pronounced in the association between relationship status and neuroticism, where between 79.1% (n = 5000) and 89.5% (n = 50000) of the total vibration can be explained by model choices. for the association between physical activity and extraversion, between 53.0% (n = 5000) and 61.7% (n = 50000) of the total vibration can be explained by model choices. the relative impact of data pre-processing choices is quantified by values between 35.9% (n = 50000) and 55.0% (n = 500) for this association. a more detailed investigation of data pre-processing vibration as part of the total vibration shows that the variable age has the largest impact of the data preprocessing choices on the vibration of effects for this association between physical activity and extraversion (17.5% of the total vibration for the largest sample size). however, associations in the supplementary material reveal that the relative impact of data preprocessing and model choices, and the variables with the largest impact, strongly depend on the research question of interest. for the association between education and conscientiousness, for example, for the largest sample size, 97.8% of the vibration caused by the analysis strategy can be explained by data pre-processing choices with education itself as the variable with highest impact (96.1% of the total vibration explained by education). discussion summary researchers have great flexibility in the analysis of observational data. if this flexibility is combined with selective reporting and pressure to publish significant results, it can have devastating consequences on the replicability of research findings. in this work, we extended the vibration of effects approach, proposed by ioannidis (2008), to quantify and compare the impact of model and data pre-processing choices on the stability of empirical associations. through this extension, the vibration of effects framework allows identification of the choices in the analysis strategy that explain the most variation in results and comparisons of the impact of different choices with sampling uncertainty. we illustrated three different types of vibration on the sapa data set, considering reasonable data preprocessing choices and modeling strategies based on a logistic regression model, focusing on two associations of interest in personality psychology. we quantified sampling vibration by considering the results obtained from random subsets of the data set in use. we found that sampling vibration decreased with increasing sample size and became negligible, while model and data pre-processing vibration showed an initial decrease with increasing sample size and then remained constantly non-negligible. considering all possible combinations of model and data pre-processing choices allowed us to identify the decisions which had the most influence on the variability in results. in addition to the two associations presented in the main text, we show the results of four other associations in the supplement. these results demonstrate that our findings are not specific to the two examples discussed in our paper, but relevant to a broad variety of associations, including one where no evidence for an association could be found in the literature. limitations when interpreting our results, it is important to keep in mind that both model vibration and data preprocessing vibration are in reality rather elusive concepts as they critically depend on the number and the type of analysis strategies under consideration. in theory, there are an infinite number of models and an infinite number of possible data pre-processing strategies, so any attempt to quantify the variability in an effect estimate resulting from every possible analysis strategy is doomed to fail. as it is futile to quantify the vibration in results arising from every possible strategy, we decided to focus on analysis strategies that seemed reasonable to us, i.e., those that could have been selected assessing uncertainty through the vibration of effects 13 figure 6 cumulative model and data pre-processing vibration (‘analysis strategy’) compared to sampling vibration (top panel), and relative impact of model and data pre-processing vibration for different sample sizes (bottom panel) for the association between neuroticism and relationship status. in an actual research project. while there is a firm theoretical basis to predict sampling vibration, the behavior of model and data pre-processing vibration critically depends on the particular data set and the number of possible choices under consideration. as pointed out by del giudice and gangestad (2021), it is not straightforward to identify a set of reasonable analysis strategies and the inclusion of poorly justified analysis strategies in multiverse-style analyses may entail the risk of hiding meaningful effects in a “mass of poorly justified alternatives”. the authors also caution against the inclusion of analysis decisions that are not truly arbitrary, because they might, for instance, modify the research question or reduce the reliability of validity with which key variables are measured. note that the set of considered analysis strategies may sometimes also be critically limited by the available computing capacity as the computing power needed to determine which model and data pre-processing choices lead to the most variation in results also depends critically on the total number of possible analysis strategies. following patel et al. (2015), we merely focused on a special type of model vibration, namely the vibration of effects that is due to the inclusion or exclusion of all potential control variables. vibration of effects may be larger in situations where very complex models are involved, encompassing a very large number of control variables. conversely, it may have less of an impact in data-poor studies with few variables measured and considered. furthermore, we only considered linear effects and neither examined interaction terms nor mediator variables, which may be essential in some settings. the definition of possible data pre-processing choices is challenging since these choices are sometimes “hidden”, i.e., they are typically not discussed in great detail in a publication and some choices are completely omitted. two recent multi-analyst experiments (huntington-klein et al., 2021; schweinsberg et al., 2021), in which multiple teams of researchers were asked to answer the same research question on the 14 figure 7 cumulative model and data pre-processing vibration (‘analysis strategy’) compared to sampling vibration (top panel), and relative impact of model and data pre-processing vibration for different sample sizes (bottom panel) for the association between extraversion and physical activity. same data set, found large variations in data preprocessing options among the different teams, including choices concerning the operationalization of key theoretical variables, and inclusion and exclusion criteria. in huntington-klein et al. (2021), no two teams of researchers reported the same sample size when analyzing the same research question on the same data set but “nearly all of the decisions driving data construction would be likely omitted from a paper, or skimmed over by a reader” (huntington-klein et al., 2021). since many data pre-processing choices are not transparently reported in the literature, it is very difficult to determine a set of reasonable data pre-processing steps and multianalyst experiments seem like the only naturalistic and convincing option to assess the full analytical variability on a given data set, assuming that the multiple analysts are reliable experts. when assessing the vibration of effects for a certain research question, both the set of considered analysis strategies and the selection of “favorite” model and data pre-processing options are to some degree arbitrary but may substantially impact the results. as the main focus of our work was to illustrate how the vibration of effects framework can be used to quantify and compare the impact of different sources of uncertainty and to identify analytical choices that have the most influence on the results, it was not in the scope of our work to quantify analytical variability, for instance by organizing a multi-analyst experiment to identify a set of reasonable data pre-processing choices. while the vibration of effects framework is an important tool to assess the robustness of empirical findings for model, data pre-processing, measurement, and sampling uncertainty, it is not the only way to address these sources of uncertainty. as pointed out by hoffmann et al. (2021), a variety of approaches have been proposed across different disciplines to reduce, report, integrate or accept model, data pre-processing, measurement, sampling, method and parameter uncertainty. efforts to standardize analytical options are underway in some scientific fields building consensus among investigators assessing uncertainty through the vibration of effects 15 and these efforts may result in diminishing the space for potential vibration of effects. finally, we illustrated the vibration of effects framework in logistic regression models, which is not standard in personality psychology. however, the framework can be adapted with slight modifications to more commonly used methods including, for instance, gaussian regression and correlation analyses. conclusion and outlook when analyzing observational data, it is necessary to make model and data pre-processing choices which rely on many implicit and explicit assumptions. the vibration of effects framework provides investigators with a tool to quantify the impact of these choices on the stability of results, helping them focus their attention on the choices that have the greatest influence and are therefore worth further investigation or discussion. alternatively, other frameworks could be raised and extended for this purpose, such as the specification curve analysis (simonsohn et al., 2015) or multiverse analysis (steegen et al., 2016). compared to these frameworks, the vibration of effects allows presenting a large number of effect estimates and p-values simultaneously. furthermore, it provides a quantitative intuitive measure of the uncertainty in form of a ratio, and is more appropriate to report sampling uncertainty. to establish our framework as a tool, we recommend visualizing data pre-processing, model and sampling vibration with volcano plots as we have demonstrated in the supplementary material for the association between neuroticism and relationship status. moreover, the systematic reporting of rors and p-value characteristics for these types of vibration is a simple but informative guideline for quantifying the stability of published results. the framework can also be useful for readers in the interpretation of these results: when used as a tool to report the robustness of empirical associations, it helps readers (including reviewers) to interpret these results in the context of all the possible results that could have been obtained with alternative, equally justified analysis strategies. when the research data of a publication are made publicly available, which is increasingly common to enhance transparency, a reader can use the vibration of effects framework to assess the extent to which the originally reported results are fragile or incredible because they depend on very specific analytical decisions. in this vein, it is possible to specify a number of model and data pre-processing choices and to apply the framework to assess the variability in effect estimates arising from these possible analysis strategies. in our application of the framework in personality psychology, we observed many cases in which both significant and non-significant results could be obtained, depending on the choice of the analysis strategy. in extreme cases, it was even possible to obtain both positive and negative significant associations and this phenomenon persisted for a very large sample size of over 80000 participants. the number of decisions which have to be made in the analysis of observational data becomes even more important when analyzing data that are not initially recorded for research purposes. while the increasing availability of large data sets, for instance in the form of twitter accounts (barberá et al., 2015) or transaction data (gladstone et al., 2019), offer unprecedented opportunities to study complex phenomena of interest, they also increase the number of untestable assumptions which must be made in the data pre-processing and choice of model used to describe the data. in light of our results, we suggest using the vibration of effects framework as a tool to assess the robustness of conclusions from observational data. author contact correspondence concerning this article should be addressed to simon klau, https://orcid.org/ 0000-0002-7857-1263. acknowledgements we thank alethea charlton and anna jacob for valuable language corrections and david condon for providing the data. conflict of interest and funding the authors declare that there are no conflicts of interest with respect to the authorship or the publication of this article. this work was funded by the deutsche forschungsgemeinschaft (individual grant bo3139/4-3 and bo3139/7-1). author contributions s. klau, s. hoffmann and a.-l. boulesteix developed the study concept. s. hoffmann and s. klau conducted the study and wrote the manuscript. s. klau performed the statistical analysis. c. j. patel, j. p. a. ioannidis, f. d. schönbrodt and a.-l. boulesteix substantially contributed to the manuscript. all authors approved the final version. open science practices https://orcid.org/0000-0002-7857-1263 https://orcid.org/0000-0002-7857-1263 16 this article earned the open materials badge for making the materials openly available. it has been verified that the analysis reproduced the results presented in the article, with minor issues due to complexity of analyses and computational time requirements. the entire editorial process, including the open reviews, are published in the online supplement. references aczel, b., szaszi, b., nilsonne, g., van den akker, o. r., albers, c. j., van assen, m. a. l. m., bastiaansen, j. a., benjamin, d. j., boehm, u., botvinik-nezer, r., & wagenmakers, e.-j. (2021). consensus-based guidance for conducting and reporting multi-analyst studies [metaarxiv]. https : / / doi . org / 10 . 31222 / osf. io/5ecnh bakker, m., & wicherts, j. m. (2014). outlier removal, sum scores, and the inflation of the type i error rate in independent samples t tests: the power of alternatives and recommendations. psychological methods, 19(3), 409–427. https : / / doi . org/10.1037/met0000014 barberá, p., jost, j. t., nagler, j., tucker, j. a., & bonneau, r. (2015). tweeting from left to right: is online political communication more than an echo chamber? psychological science, 26(10), 1531–1542. https : / / doi . org / 10 . 1177 / 0956797615594620 barchard, k. a., & williams, j. (2008). practical advice for conducting ethical online experiments and questionnaires for united states psychologists. behavior research methods, 40(4), 1111–1128. https://doi.org/10.3758/brm.40.4.1111 button, k. s., ioannidis, j. p. a., mokrysz, c., nosek, b. a., flint, j., robinson, e. s., & munafò, m. r. (2013). power failure: why small sample size undermines the reliability of neuroscience. nature reviews neuroscience, 14(5), 365–376. https://doi.org/10.1038/nrn3475 chambers, c. d. (2013). registered reports: a new publishing initiative at cortex. cortex, 49(3), 609– 610. https://doi.org/10.1016/j.cortex.2012. 12.016 condon, d., roney, e., & revelle, w. (2017). a sapa project update: on the structure of phrased selfreport personality items. journal of open psychology data, 5(1), 3. https://doi.org/10.5334/ jopd.32 del giudice, m., & gangestad, s. w. (2021). a traveler’s guide to the multiverse: promises, pitfalls, and a framework for the evaluation of analytic decisions. advances in methods and practices in psychological science, 4(1), 1–15. https://doi.org/ 10.1177/2515245920954925 dyrenforth, p. s., kashy, d. a., donnellan, m. b., & lucas, r. e. (2010). predicting relationship and life satisfaction from personality in nationally representative samples from three countries: the relative importance of actor, partner, and similarity effects. journal of personality and social psychology, 99(4), 690–702. https : / / doi . org/https://doi.org/10.1037/a0020385 eysenck, h. j., nias, d. k. b., & cox, d. n. (1982). sport and personality. advances in behaviour research and therapy, 4(1), 1–56. https://doi.org/10. 1016/0146-6402(82)90004-2 fan, x., & chen, m. (2001). parental involvement and students’ academic achievement: a metaanalysis. educational psychology review, 13(1), 1–22. https : / / doi . org / 10 . 1023 / a : 1009048817385 finn, c., mitte, k., & neyer, f. j. (2013). the relationship–specific interpretation bias mediates the link between neuroticism and satisfaction in couples. european journal of personality, 27(2), 200–212. https://doi.org/10.1002/per. 1862 gelman, a., & loken, e. (2014). the statistical crisis in science. american scientist, 102(6), 460–465. gerlach, g., herpertz, s., & loeber, s. (2015). personality traits and obesity: a systematic review. obesity reviews, 16(1), 32–63. https://doi.org/10. 1111/obr.12235 gladstone, j. j., matz, s. c., & lemaire, a. (2019). can psychological traits be inferred from spending? evidence from transaction data. psychological science, 30(7), 1087–1096. https : / / doi . org / 10.1177/0956797619849435 goodman, s. n., fanelli, d., & ioannidis, j. p. a. (2016). what does research reproducibility mean? science translational medicine, 8(341), 341ps12– 341ps12. https : / / doi . org / 10 . 1126 / scitranslmed.aaf5027 hoffmann, s., schönbrodt, f., elsas, r., wilson, r., strasser, u., & boulesteix, a.-l. (2021). the multiplicity of analysis strategies jeopardizes replicability: lessons learned across disciplines. royal society open science, 8(4), 1–13. https : //doi.org/10.1098/rsos.201925 huntington-klein, n., arenas, a., beam, e., bertoni, m., bloem, j. r., burli, p., chen, n., grieco, p., ekpe, g., pugatch, t., saavedra, m., & stopnitzky, y. (2021). the influence of hidden researcher decisions in applied microeconomics. https://doi.org/10.31222/osf.io/5ecnh https://doi.org/10.31222/osf.io/5ecnh https://doi.org/10.1037/met0000014 https://doi.org/10.1037/met0000014 https://doi.org/10.1177/0956797615594620 https://doi.org/10.1177/0956797615594620 https://doi.org/10.3758/brm.40.4.1111 https://doi.org/10.1038/nrn3475 https://doi.org/10.1016/j.cortex.2012.12.016 https://doi.org/10.1016/j.cortex.2012.12.016 https://doi.org/10.5334/jopd.32 https://doi.org/10.5334/jopd.32 https://doi.org/10.1177/2515245920954925 https://doi.org/10.1177/2515245920954925 https://doi.org/https://doi.org/10.1037/a0020385 https://doi.org/https://doi.org/10.1037/a0020385 https://doi.org/10.1016/0146-6402(82)90004-2 https://doi.org/10.1016/0146-6402(82)90004-2 https://doi.org/10.1023/a:1009048817385 https://doi.org/10.1023/a:1009048817385 https://doi.org/10.1002/per.1862 https://doi.org/10.1002/per.1862 https://doi.org/10.1111/obr.12235 https://doi.org/10.1111/obr.12235 https://doi.org/10.1177/0956797619849435 https://doi.org/10.1177/0956797619849435 https://doi.org/10.1126/scitranslmed.aaf5027 https://doi.org/10.1126/scitranslmed.aaf5027 https://doi.org/10.1098/rsos.201925 https://doi.org/10.1098/rsos.201925 assessing uncertainty through the vibration of effects 17 economic inquiry, 59(3), 944–960. https://doi. org/10.1111/ecin.12992 ince, d. (2011). the duke university scandal – what can be done? significance, 8(3), 113–115. https:// doi.org/10.1111/j.1740-9713.2011.00505.x ioannidis, j. p. a. (2008). why most discovered true associations are inflated. epidemiology, 19(5), 640–648. https : / / doi . org / 10 . 1097 / ede . 0b013e31818131e7 ioannidis, j. p. a., munafo, m. r., fusar-poli, p., nosek, b. a., & david, s. p. (2014). publication and other reporting biases in cognitive sciences: detection, prevalence, and prevention. trends in cognitive sciences, 18(5), 235–241. https://doi. org/10.1016/j.tics.2014.02.010 klau, s., hoffmann, s., patel, c. j., ioannidis, j. p. a., & boulesteix, a.-l. (2021). examining the robustness of observational associations to model, measurement and sampling uncertainty with the vibration of effects framework. international journal of epidemiology, 50(1), 266–278. https://doi.org/10.1093/ije/dyaa164 klau, s., martin-magniette, m.-l., boulesteix, a.-l., & hoffmann, s. (2020). sampling uncertainty versus method uncertainty: a general framework with applications to omics biomarker selection. biometrical journal, 62(3), 670–687. https://doi.org/10.1002/bimj.201800309 leamer, e. e. (1983). let’s take the con out of econometrics. the american economic review, 73(1), 31–43. malouff, j. m., thorsteinsson, e. b., & schutte, n. s. (2006). the five-factor model of personality and smoking: a meta-analysis. journal of drug education, 36(1), 47–58. https: / / doi .org / 10 . 2190/9ep8-17p8-ekg7-66ad malouff, j. m., thorsteinsson, e. b., schutte, n. s., bhullar, n., & rooke, s. e. (2010). the five-factor model of personality and relationship satisfaction of intimate partners: a metaanalysis. journal of research in personality, 44(1), 124–127. https://doi.org/https://doi. org/10.1016/j.jrp.2009.09.004 maxwell, s. e. (2004). the persistence of underpowered studies in psychological research: causes, consequences, and remedies. psychological methods, 9(2), 147–163. https : / / doi . org/10.1037/1082-989x.9.2.147 meinshausen, n., & bühlmann, p. (2010). stability selection. journal of the royal statistical society: series b (statistical methodology), 72(4), 417– 473. https : / / doi . org / 10 . 1111 / j . 1467 9868 . 2010.00740.x muñoz, j., & young, c. (2018). we ran 9 billion regressions: eliminating false positives through computational model robustness. sociological methodology, 48(1), 1–33. https://doi.org/10. 1177/0081175018777988 o’meara, m. s., & south, s. c. (2019). big five personality domains and relationship satisfaction: direct effects and correlated change over time. journal of personality, 87(6), 1206–1220. https:// doi.org/10.1111/jopy.12468 open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716. https://doi.org/10.1126/ science.aac4716 osborne, j. w., & overbay, a. (2004). the power of outliers (and why researchers should always check for them). practical assessment, research & evaluation, 9(6), 1–8. palpacuer, c., hammas, k., duprez, r., laviolle, b., ioannidis, j. p. a., & naudet, f. (2019). vibration of effects from diverse inclusion/exclusion criteria and analytical choices: 9216 different ways to perform an indirect comparison metaanalysis. bmc medicine, 17(174), 1–13. https : //doi.org/10.1186/s12916-019-1409-3 patel, c. j., burford, b., & ioannidis, j. p. a. (2015). assessment of vibration of effects due to model specification can demonstrate the instability of observational associations. journal of clinical epidemiology, 68(9), 1046–1058. https : / / doi . org/10.1016/j.jclinepi.2015.05.029 rhodes, r. e., & smith, n. e. i. (2006). personality correlates of physical activity: a review and metaanalysis. british journal of sports medicine, 40(12), 958–965. https : / / doi . org / 10 . 1136 / bjsm.2006.028860 saucier, g. (2002). orthogonal markers for orthogonal factors: the case of the big five. journal of research in personality, 36(1), 1–31. https://doi. org/10.1006/jrpe.2001.2335 sauerbrei, w., boulesteix, a.-l., & binder, h. (2011). stability investigations of multivariable regression models derived from low-and highdimensional data. journal of biopharmaceutical statistics, 21(6), 1206–1231. https://doi.org/ 10.1080/10543406.2011.629890 schönbrodt, f. d., & perugini, m. (2013). at what sample size do correlations stabilize? journal of research in personality, 47(5), 609–612. https:// doi.org/10.1016/j.jrp.2013.05.009 schweinsberg, m., feldman, m., staub, n., van den akker, o. r., van aert, r. c. m., van assen, m. a. l. m., liu, y., althoff, t., heer, j., kale, https://doi.org/10.1111/ecin.12992 https://doi.org/10.1111/ecin.12992 https://doi.org/10.1111/j.1740-9713.2011.00505.x https://doi.org/10.1111/j.1740-9713.2011.00505.x https://doi.org/10.1097/ede.0b013e31818131e7 https://doi.org/10.1097/ede.0b013e31818131e7 https://doi.org/10.1016/j.tics.2014.02.010 https://doi.org/10.1016/j.tics.2014.02.010 https://doi.org/10.1093/ije/dyaa164 https://doi.org/10.1002/bimj.201800309 https://doi.org/10.2190/9ep8-17p8-ekg7-66ad https://doi.org/10.2190/9ep8-17p8-ekg7-66ad https://doi.org/https://doi.org/10.1016/j.jrp.2009.09.004 https://doi.org/https://doi.org/10.1016/j.jrp.2009.09.004 https://doi.org/10.1037/1082-989x.9.2.147 https://doi.org/10.1037/1082-989x.9.2.147 https://doi.org/10.1111/j.1467-9868.2010.00740.x https://doi.org/10.1111/j.1467-9868.2010.00740.x https://doi.org/10.1177/0081175018777988 https://doi.org/10.1177/0081175018777988 https://doi.org/10.1111/jopy.12468 https://doi.org/10.1111/jopy.12468 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1186/s12916-019-1409-3 https://doi.org/10.1186/s12916-019-1409-3 https://doi.org/10.1016/j.jclinepi.2015.05.029 https://doi.org/10.1016/j.jclinepi.2015.05.029 https://doi.org/10.1136/bjsm.2006.028860 https://doi.org/10.1136/bjsm.2006.028860 https://doi.org/10.1006/jrpe.2001.2335 https://doi.org/10.1006/jrpe.2001.2335 https://doi.org/10.1080/10543406.2011.629890 https://doi.org/10.1080/10543406.2011.629890 https://doi.org/10.1016/j.jrp.2013.05.009 https://doi.org/10.1016/j.jrp.2013.05.009 18 a., & uhlmann, e. l. (2021). same data, different conclusions: radical dispersion in empirical results when independent analysts operationalize and test the same hypothesis. organizational behavior and human decision processes, 165, 228–249. https : / / doi . org / 10 . 1016 / j . obhdp.2021.02.003 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. https://doi.org/ 10.1177/0956797611417632 simonsohn, u., simmons, j., & nelson, l. d. (2015). specification curve: descriptive and inferential statistics on all reasonable specifications. https: //doi.org/10.2139/ssrn.2694998 sorić, i., penezić, z., & burić, i. (2017). the big five personality traits, goal orientations, and academic achievement. learning and individual differences, 54, 126–134. https://doi.org/10.1016/j. lindif.2017.01.024 steegen, s., tuerlinckx, f., gelman, a., & vanpaemel, w. (2016). increasing transparency through a multiverse analysis. perspectives on psychological science, 11(5), 702–712. https : / / doi . org / 10.1177/1745691616658637 szucs, d., & ioannidis, j. p. a. (2017). empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. plos biology, 15(3), 1–18. https:// doi.org/10.1371/journal.pbio.2000797 van der zee, t., anaya, j., & brown, n. j. (2017). statistical heartburn: an attempt to digest four pizza publications from the cornell food and brand lab. bmc nutrition, 3(54), 1–15. https : / / doi . org/10.1186/s40795-017-0167-x wagenmakers, e.-j., wetzels, r., borsboom, d., van der maas, h. l., & kievit, r. a. (2012). an agenda for purely confirmatory research. perspectives on psychological science, 7(6), 632–638. https : / / doi.org/10.1177/1745691612463078 wicherts, j. m., veldkamp, c. l. s., augusteijn, h. e. m., bakker, m., van aert, r. c. m., & van assen, m. a. l. m. (2016). degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking. frontiers in psychology, 7(1832), 1–12. https : //doi.org/10.3389/fpsyg.2016.01832 wilson, k. e., & dishman, r. k. (2015). personality and physical activity: a systematic review and metaanalysis. personality and individual differences, 72, 230–242. https://doi.org/10.1016/j.paid. 2014.08.023 young, c. (2018). model uncertainty and the crisis in science. socius, 4, 1–7. https : / / doi . org / 10 . 1177/2378023117737206 https://doi.org/10.1016/j.obhdp.2021.02.003 https://doi.org/10.1016/j.obhdp.2021.02.003 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/0956797611417632 https://doi.org/10.2139/ssrn.2694998 https://doi.org/10.2139/ssrn.2694998 https://doi.org/10.1016/j.lindif.2017.01.024 https://doi.org/10.1016/j.lindif.2017.01.024 https://doi.org/10.1177/1745691616658637 https://doi.org/10.1177/1745691616658637 https://doi.org/10.1371/journal.pbio.2000797 https://doi.org/10.1371/journal.pbio.2000797 https://doi.org/10.1186/s40795-017-0167-x https://doi.org/10.1186/s40795-017-0167-x https://doi.org/10.1177/1745691612463078 https://doi.org/10.1177/1745691612463078 https://doi.org/10.3389/fpsyg.2016.01832 https://doi.org/10.3389/fpsyg.2016.01832 https://doi.org/10.1016/j.paid.2014.08.023 https://doi.org/10.1016/j.paid.2014.08.023 https://doi.org/10.1177/2378023117737206 https://doi.org/10.1177/2378023117737206 model vibration sampling vibration data pre-processing vibration meta-psychology, 2020, vol 4, mp.2018.933 https://doi.org/10.15626/mp.2018.933 article type: original article published under the cc-by4.0 license open data: n/a open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: n/a edited by: s. r. martin reviewed by: j. d. blume, o. l. olvera astivia analysis reproduced by: andré kalmendal all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/zp3kf equivalence testing and the second generation p-value. daniël lakens eindhoven university of technology, the netherlands marie delacre université libre de bruxelles, belgium abstract to move beyond the limitations of null-hypothesis tests, statistical approaches have been developed where the observed data are compared against a range of values that are equivalent to the absence of a meaningful effect. specifying a range of values around zero allows researchers to statistically reject the presence of effects large enough to matter, and prevents practically insignificant effects from being interpreted as a statistically significant difference. we compare the behavior of the recently proposed second generation p-value (blume, d’agostino mcgowan, dupont, & greevy, 2018) with the more established two one-sided tests (tost) equivalence testing procedure (schuirmann, 1987). we show that the two approaches yield almost identical results under optimal conditions. under suboptimal conditions (e.g., when the confidence interval is wider than the equivalence range, or when confidence intervals are asymmetric) the second generation p-value becomes difficult to interpret. the second generation p-value is interpretable in a dichotomous manner (i.e., when the sgpv equals 0 or 1 because the confidence intervals lies completely within or outside of the equivalence range), but this dichotomous interpretation does not require calculations. we conclude that equivalence tests yield more consistent p-values, distinguish between datasets that yield the same second generation p-value, and allow for easier control of type i and type ii error rates. keywords: equivalence testing, second generation p-values, hypothesis testing, tost, statistical inference to test predictions researchers predominantly rely on null-hypothesis tests. this statistical approach can be used to examine whether observed data are sufficiently surprising under the null hypothesis to reject an effect that equals exactly zero. null-hypothesis tests have an important limitation, in that this procedure can only reject the hypothesis that there is no effect, while scientists should also be able to provide statistical support for equivalence. when testing for equivalence researchers aim to examine whether an observed effect is too small to be considered meaningful, and therefore is practically equivalent to zero. by specifying a range around the null hypothesis of values that are deemed practically equivalent to the absence of an effect (i.e., 0 +0.3) the observed data can be compared against an equivalence range and researchers can test if a meaningful effect is absent (hauck & anderson, 1984; kruschke, 2018; rogers, howard, & vessey, 1993; serlin & lapsley, 1985; spiegelhalter, freedman, & parmar, 1994; wellek, 2010; westlake, 1972). second generation p-values (sgpv) were recently proposed as a statistic that represents “the proportion of 2 data-supported hypotheses that are also null hypotheses” (blume et al., 2018). the researcher specifies an equivalence range around a null hypothesis of values that are considered practically equivalent to the null hypothesis. the sgpv measures the degree to which a set of data-supported parameter values falls within the interval null hypothesis. if the estimation interval falls completely within the equivalence range, the sgpv is 1. if the confidence interval falls completely outside of the equivalence range, the sgpv is 0. otherwise the sgpv is a value between 0 and 1 that expresses the overlap of data-supported hypotheses and the equivalence range. when calculating the sgpv the set of data-supported parameter values can be represented by a confidence interval (ci), although one could also choose to use credible intervals or likelihood support intervals (si). when a confidence interval is used, the sgpv and equivalence tests such as the two onesided tests (tost) procedure (lakens, 2017; meyners, 2012; quertemont, 2011; schuirmann, 1987) appear to have close ties, because both tests compare a confidence interval against an equivalence range. here, we aim to examine the similarities and differences between the tost procedure and the sgpv. we limit our analysis to continuous data sampled from a bivariate normal distribution. the tost procedure also relies on the confidence interval around the effect. in the tost procedure the data are tested against the lower equivalence bound in the first one-sided test, and against the upper equivalence bound in the second one-sided test (lakens, scheel, & isager, 2018). for an excellent discussion of the strengths and weaknesses of different frequentist equivalence tests, including alternatives to the tost procedure, see meyners (2012). if both tests statistically reject an effect as extreme or more extreme than the equivalence bound, you can conclude the observed effect is practically equivalent to zero from a neyman-pearson approach to statistical inferences. because one-sided tests are performed, one can also conclude equivalence by checking whether the 1-2×α confidence interval (e.g., when the alpha level is 0.05, a 90% ci) falls completely within the equivalence bounds. because both equivalence tests as the sgpv are based on whether and how much a confidence interval overlaps with equivalence bounds, it seems worthwhile to compare the behavior of the newly proposed sgpv to equivalence tests to examine the unique contribution of the sgpv to the statistical toolbox. the relationship between p-values from tost and sgpv when confidence intervals are symmetrical the second generation p-value (sgpv) is calculated as: pδ = |i ∩ h0| |i| × max { |i| 2 |h0| , 1 } where i is the interval based on the data (e.g., a 95% confidence interval) and h0 is the equivalence range. the first term of this formula implies that the second generation p-value is the width of the confidence interval that overlaps with the equivalence range, divided by the total width of the confidence interval. the second term is a “small sample correction” (which will be discussed later) that comes into play whenever the confidence interval is more than twice as wide as the equivalence range. to examine the relation between the tost p-value and the sgpv we can calculate both statistics across a range of observed effect sizes. building on the example by blume et al. (2018), in figure 1 p-values are plotted for the tost procedure and the sgpv. the statistics are calculated for hypothetical one-sample ttests for observed means ranging from 140 to 150 (on the x-axis). the equivalence range is set to 145 +2 (i.e., an equivalence range from 143 to 147), the observed standard deviation is assumed to be 2, and the sample size is 30. for example, for the left-most point in figure 1 the sgpv and the tost p-value is calculated for a hypothetical study with a sample size of 30, an observed standard deviation of 2, and an observed mean of 140, where the p-value for the equivalence test is 1, and the sgpv is 0. our conclusions about the relationship between tost p-values and sgpv hold for second generation pvalues calculated from confidence intervals, and assuming data is sampled from a bivariate normal distribution. readers can explore the relationship between tost pvalues and sgpv for themselves in an online shiny app: http://shiny.ieis.tue.nl/tost_vs_sgpv/. the sgpv treats the equivalence range as the nullhypothesis, while the tost procedure treats the values outside of the equivalence range as the null-hypothesis. for ease of comparison we can plot 1-sgpv (see figure 2) to make the values more easily comparable. we see that the p-value from the tost procedure and the sgpv follow each other closely. when we discuss the relationship between the p-values from tost and the sgpv, we focus on their correspondence at three values, namely where the tost p = 0.025 and sgpv is 1, where the tost p = 0.5 and sgpv = 0.5, and where the tost p = 0.975 and sgpv = 1. these three values are important for the sgpv because they indicate the values at which the sgpv indicates the data should be interpreted as compatible with the null hypothesis (sgpv = http://shiny.ieis.tue.nl/tost_vs_sgpv/ 3 figure 1. comparison of p-values from tost (black line) and sgpv (grey line) across a range of observed sample means (x-axis) tested against a mean of 145 in a one-sample t-test with a sample size of 30 and a standard deviation of 2, illustrating that when the tost p-value = 0.5, the sgpv = 0.5, when the tost p-value is 0.975, 1-sgpv = 1, and when the tost p-value = 0.025, 1-sgpv = 0. 1), or with the alternative hypothesis (sgpv = 0), or when the data are strictly inconclusive (sgpv = 0.5). these three points of overlap are indicated by the horizontal dotted lines in figure 2 at tost p-values of 0.975, 0.5, and 0.025. when the observed sample mean is 145, the sample size is 30, and the standard deviation is 2, and we are testing against equivalence bounds of 143 and 147 using the tost procedure for a one-sample t-test, the equivalence test is significant, t(29) = 5.48, p < .001. because the 95% ci falls completely within the equivalence bounds, the sgpv is 1 (see figure 1). on the other hand, when the observed mean is 140, the equivalence test is not significant (the observed mean is far outside the equivalence range of 143 to 147), t(29) = -8.22, p = 1 (or more accurately, p > .999 as p-values are bounded between 0 and 1). because the 95% ci falls completely outside the equivalence bounds, the sgpv is 0 (see figure 1). sgpv as a uniform measure of overlap it is clear the sgpv and the p-value from tost are closely related. when confidence intervals are symmetric we can think of the sgpv as a straight line that is figure 2. comparison of p-values from tost (black line) and 1-sgpv (grey line) across a range of observed sample means (x-axis) tested against a mean of 145 in a one-sample t-test with a sample size of 30 and a standard deviation of 2. directly related to the p-value from an equivalence test for three values. when the tost p-value is 0.5, the sgpv is also 0.5 (note that the reverse is not true). the 4 figure 3. means, normal distribution, and 95% ci for three example datasets that illustrate the relationship between p-values from tost and sgpv. sgpv is 50% when the observed mean falls exactly on the lower or upper equivalence bound, because 50% of the symmetrical confidence interval overlaps with the equivalence range. when the observed mean equals the equivalence bound, the difference between the mean in the data and the equivalence bound is 0, the t-value for the equivalence test is also 0, and thus the p-value is 0.5 (situation a, figure 3). two other points always have to overlap. when the 95% ci falls completely inside the equivalence region, and one endpoint of the confidence interval is exactly equal to one of the equivalence bounds (see situation b in figure 3) the tost p-value (which relies on a onesided test) is always 0.025, and the sgpv is 1. note that when sample sizes are small or equivalence bounds are narrow, small p-values for the tost or a sgpv = 1 might not be observed in practice if too few observations are collected. the third point where the sgpv and the p-value from the tost procedure should overlap is where the 95% ci falls completely outside of the equivalence range, but one endpoint of the confidence interval is equal to the equivalence bound (see situation c in figure 3), when the p-value will always be 0.975, and the sgpv is 0. note that this situation is in essence a minimum-effect test (murphy, myors, & wolach, 2014). the goal of a minimum-effect is not just to reject a difference of zero, but to reject the smallest effect size of interest (i.e., the equivalence bounds). an equivalence test and minimum effect test against the same equivalence bound are complementary, and when a tost pvalue is larger than 0.975, the p-value for the minimum effect test is smaller than 0.05 (and therefore the minimum effect test provides no additional information that can not be derived from the p-value from the equivafigure 4. means, normal distribution, and 95% ci for samples where the observed population mean is 1.5, 1.4, 1.3, and 1.2. lence test). the sgpv summarizes the information from an equivalence test (and the complementary minimumeffect test). these can be two relevant questions to ask, although it often makes sense to combine an equivalence test and a null-hypothesis test instead (lakens et al., 2018). for example, in figure 4 we have plotted four sgpv’s. from a to d the sgpv is 0.76, 0.81, 0.86, and 0.91. the difference in the percentage of overlap between a and b (-0.05) is identical to the difference in the percentage of overlap between c and d as the mean gets 0.1 closer to the test value (-0.05). as the observed mean in a onesample t-test lies closer to the test value, from situation a to d, the difference in the overlap changes uniformly. as we move the observed mean closer to the test value in steps of 0.1 across a to d the p-value calculated for normally distributed data are not uniformly distributed. the probability of observing data more extreme than the upper bound of 2 is (from a to d) 0.16, 0.12, 0.08, and 0.05. as we can see, the difference between a and b (0.04) is not the same as the difference between c and d (0.03). indeed, the difference in p-values is the largest as you start at p = 0.5 (when the observed mean falls on the test value), which is why the line in figure 1 is the steepest at p = 0.5. note that where the sgpv reaches 1 or 0, p-values closely approximate 0 and 1, but never reach these values. when different p-values for equivalence tests yield the same sgpv there are three situations where p-values for tost differentiate between observed results, while the sgpv does not differentiate. the first two situations were discussed before and can be seen in figure 1. when 5 figure 5. the relationship between p-values from the tost procedure and the sgpv for the same scenario as in figure 1. the sgpv is either 0 or 1, p-values from the equivalence test fall between 0.975 and 1 or between 0 and 0.025. where the sgpv is 1 as long as the confidence interval falls completely within the equivalence bounds, the p-value for the tost continues to differentiate between results as a function of how far the confidence interval lies within the equivalence bounds (the further the confidence interval is from both bounds, the lower the p-value). the easiest way to see this is by plotting the sgpv against the p-value from the tost procedure. the situations where the p-values from the tost procedure continue to differentiate based on how extreme the results are, but the sgpv is a fixed value are indicated by the parts of the curve where there are vertical straight lines at second generation p-values of 0 and 1. a third situation in which the sgpv remains stable across a range of observed effects, while the tost pvalue continues to differentiate, is whenever the ci is wider than the equivalence range, and the ci overlaps with the upper and lower equivalence bound. when the confidence interval is more than twice as wide as the equivalence range the sgpv is set to 0.5. blume et al. (2018) call this the “small sample correction factor”. however, it is not a correction in the typical sense of the word, since the sgpv is not adjusted to any “correct” value. when the normal calculation would be “misleading” (i.e., the sgpv would be small, which normally would suggest support for the alternative hypothesis, but at the same time all values in the equivalence range are supported), the sgpv is set to 0.5 which according to blume and colleagues signals that the sgpv is “uninformative”. note that the ci can be twice as wide as the equivalence range whenever the sample size is small (and the confidence interval width is large) or when figure 6. comparison of p-values from tost (black line) and sgpv (grey line) across a range of observed sample means (x-axis). because the sample size is small (n = 10) and with a standard deviation of 2 the ci is more than twice as wide as the equivalence range (set to -0.4 to 0.4), the sgpv is set to 0.5 (horizontal lightgrey line) across a range of observed means. then equivalence range is narrow. it is therefore not so much a “small sample correction” as it is an exception to the typical calculation of the sgpv whenever the ratio of the confidence interval width to the equivalence range exceeds 2:1 and the ci overlaps with the upper and lower bounds. we can examine this situation by calculating the sgpv and performing the tost for a situation where sample sizes are small and the equivalence range is narrow, such that the ci is more than twice as large as the equivalence range (see figure 6). when the two statistics are plotted against each other we can see where the sgpv is the same while the tost p-value still differentiates different observed means (indicated by straight lines in the curve, see figure 7). we see the sgpv is 0.5 for a range of observed means where the p-value from the equivalence test still varies. it should be noted that in these calculations the p-values for the tost procedure are never smaller than 0.05 (i.e., they do not get below 0.05 on the y-axis). in other words, we cannot conclude equivalence based on any of the observed means. this happens because we are examining a scenario where the 90% ci is so wide that it never falls completely within the two equivalence bounds. as lakens (2017) notes: “in small samples (where cis are wide), a study might have no statistical power (i.e., the ci will always be so wide that it is necessarily wider than the equivalence bounds).” none of the pvalues based on the tost procedure are below 0.05, and thus, in the long run we have 0% power. 6 figure 7. the relationship between p-values from the tost procedure and the sgpv for the same scenario as in figure 6. the p-value from the tost procedure still differentiates observed means, while the sgpv does not, when the ci is wider than the equivalence range (so the precision is low) and overlaps with the upper and lower equivalence bound, but the ci is not twice as wide as the equivalence range. in the example below, we see that the ci is only 1.79 times as wide as the equivalence bounds, but the ci overlaps with the lower and upper equivalence bounds (figure 8). this means the sgpv is not set to 0.5, but it is constant across a range of observed means, while the tost p-value is not constant across this range. if the observed mean would be somewhat closer to 0, or further away from 0, the sgpv remains constant (the ci width does not change, and it completely overfigure 8. example of a 95% ci that overlaps with the lower and upper equivalence bound (indicated by the vertical dotted lines). figure 9. comparison of p-values from tost (black line) and sgpv (grey line) across a range of observed sample means (x-axis). the sample size is small (n = 10), but because the sd is half as big as in figure 7 (1 instead of 2) the ci is less than twice as wide as the equivalence range (set to -0.4 to 0.4). the sgpv is not set to 0.5 (horizontal light grey line) but reaches a maximum slightly above 0.5 across a range of observed means. laps with the equivalence range) while the p-value for the tost procedure does vary. we can see this in figure 9 below. the sgpv is not set to 0.5, but is slightly higher than 0.5 across a range of means. how high the sgpv will be for a ci that is not twice as wide as the equivalence range, but overlaps with the lower and upper equivalence bounds, depends on the width of the ci and the equivalence range. if we once more plot the two statistics against each other we see the sgpv is 0.56 for a range of observed means where the p-value from the equivalence test still varies, as indicated by the straight section of the line (figure 10). to conclude this section, there are situations where the p-value from the tost procedure continues to differentiate, while the sgpv does not. therefore, interpreted as a continuous statistic, the sgpv is more limited than the p-value from the tost procedure. the relation between equivalence tests and sgpv for asymmetrical confidence intervals around correlations so far we have only looked at the relation between equivalence tests and the sgpv when confidence intervals are symmetric (e.g., for confidence intervals around mean differences). for correlations, which are bound between -1 and 1, confidence intervals are only symmetric for a correlation of exactly 0. the confidence interval for a correlation becomes increasingly asymmetric 7 figure 10. the relationship between p-values from the tost procedure and the sgpv for the same scenario as in figure 9. as the observed correlation nears -1 or 1. for example, with ten observations, an observed correlation of 0 has a symmetric 95% confidence interval ranging from -0.63 to 0.63, while and observed correlation of 0.7 has an asymmetric 95% confidence interval ranging from 0.13 to 0.92. note that calculating confidence intervals for a correlation involves a fisher’s z-transformation, which transforms values such that they are approximately normally z-distributed, which allows one to compute symmetric confidence intervals. these confidence intervals are then retransformed into a correlation, where the confidence intervals are asymmetric if the correlation is not exactly zero. the effect of asymmetric confidence intervals around correlations is most noticeable at smaller sample sizes. in figure 11 we plot the p-values from equivalence tests and the sgpv (again plotted as 1-sgpv for ease of comparison) for correlations. the sample size is 30 pairs of observations, and the lower and upper equivalence bounds are set to -0.45 and 0.45, with an alpha of 0.05. as the observed correlation in the sample moves from -.99 to 0 the p-value from the equivalence test becomes smaller, as does 1-sgpv. the pattern is quite similar to that in figure 2. the p-value for the tost procedure and 1-sgpv are still related as discussed above, with tost p-values of 0.975 and 0.025 corresponding to a 1sgpv of 1 and 0, respectively. there are two important differences, however. first of all, the sgpv is no longer a straight line, but a curve, due to the asymmetry in the 95% ci. second, and most importantly, the p-value for the equivalence test and the sgpv do no longer overlap at p = 0.5. the reason that the equivalence test and sgpv no longer overlap is due to asymmetric confidence intervals. if the observed correlation falls exactly on the equivalence bound the p-value for the equivalence test is 0.5. in the equivalence test for correlations the pvalue is computed based on a z-transformation which better controls error rates (goertzen & cribbie, 2010). this transformation is computed as follows, where r is the observed correlation and ρ is the theoretical correlation under the null: z = log( 1+r1−r ) 2 − log( 1+ρ1−ρ ) 2√ 1 n−3 because the z-distribution is symmetric, the probability of observing the observed or more extreme z-score, assuming the equivalence bound is the true effect size, is 50%. however, because the r distribution is not symmetric, this does not mean that there is always a 50% probability of observing a correlation smaller or larger 8 figure 11. comparison of p-values from tost (black line) and 1-sgpv (grey curve) across a range of observed sample correlations (x-axis) tested against equivalence bounds of r = -0.45 and r = 0.45 with n = 30 and an alpha of 0.05. figure 12. three 95% confidence intervals for observed effect sizes of r = -0.45, r = 0, and r = 0.45 for n = 30. only the confidence interval for r = 0 is symmetric. than the true correlation. as can be seen in figure 12, the proportion of the confidence interval that overlaps with the equivalence range is larger than 50% when the observed correlations are r = -.45 and r = .45, meaning that the two second generation p-values associated with these correlations are larger than 50%. because the confidence intervals are asymmetric around the observed effect size of 0.45 (ranging from 0.11 to 0.70) according to blume et al. (2018) 58.11% of the data-supported hypotheses are null hypotheses, and therefore 58.11% of the data-supported hypotheses are compatible with the null premise. the further away from 0, the larger the sgpv when the observed mean falls on the equivalence bound. the sgpv is the proportion of values in a 95% confidence interval that overlap with the equivalence range, but not the probability that these values will be observed. in the most extreme case (i.e., a sample size of 4, and equivalence bounds set to r = -0.99 and 0.99, with a true correlation of 0.99) 97.60% of the confidence interval overlaps with the equivalence range, even though in the long run only 36% of the correlations observed in the future will fall in this range. it should be noted that in larger sample sizes the sgpv is closer to 0.5 whenever the observed correlation falls on the equivalence bound, but this extreme example nevertheless clearly illustrates the difference between question the sgpv answers, and the question a p-value answers. the conclusion of this section on asymmetric confidence intervals is that a sgpv of 1 or 0 can still be interpreted as a p < 0.025 or p > 0.975 in an equivalence test, since the sgpv and p-value for the tost procedure are always directly related at the values p = 0.025 and p = 0.975. although blume et al. (2018) state that “the degree of overlap conveys how compatible the data are with the null premise” this definition of what the sgpv provides does not hold for asymmetric confidence intervals. although a sgpv of 1 or 0 can be directly interpreted, a sgpv between 0 and 1 is not interpretable as “compatibility with the null hypothesis” under the assumption of a bivariate normal distribution, and the generalizability of this statement needs to be examined beyond normal bivariate distributions. indeed, blume and colleagues write in the supplemental material that “the magnitude of an inconclusive second-generation p-value can vary slightly when the effect size scale is transformed. however definitive findings, i.e. a p-value of 0 or 1 are not affected by the scale changes.” what are the relative strengths and weaknesses of equivalence testing and the sgpv? when introducing a new statistical method, it is important to compare it to existing approaches and specify its relative strengths and weaknesses. here, we aimed to compare the sgpv against equivalence tests based on the tost procedure. first of all, even though a sgpv of 1 or 0 has a clear interpretation (we can reject effects outside or inside the equivalence range), intermediate values are not as easy to interpret (especially for effects that have asymmetric confidence intervals). in one sense, they are what they are (the proportion of overlap), but it can be unclear what this number tells us about the data we have collected. this is not too problematic, since the main use of the sgpv (e.g., in all examples provided by blume and colleagues) seems to be to examine whether the sgpv is 0, 1, or inconclusive. 9 as already mentioned, this interpretation of a sgpv is very similar to the neyman-pearson interpretation of an equivalence test and a minimum effect tests (which are complementary). the difference is that where a sgpv of 1 can be interpreted as p < .025, equivalence tests provide exact p-values, and they continue to differentiate between for example p = 0.024 and p = 0.002. whether this is desirable depends on the perspective that is used. from a neyman-pearson perspective on statistical inferences the main conclusion is based on whether or not p < α, and thus an equivalence test and sgpv can be performed by simply checking whether the confidence interval falls within the equivalence range, just as a null-hypothesis test can be performed by checking whether the confidence interval contains zero or not. at the same time, it is recommended to report exact p-values (american psychological association, 2010), and exact p-values might provide information of interest to readers about how precisely how surprising the data, or more extreme data, is under the null model. some researchers might be interested in combining an equivalence test with a null-hypothesis significance test. this allows a researcher to ask whether there is an effect that is statistically different from zero, and whether effect sizes that are considered meaningful can be rejected. equivalence tests combined with null-hypothesis tests classify results into four possible categories, and for example allow researchers to conclude an effect is significant and equivalent (i.e., statistically different from zero, but also too small to be considered meaningful; see lakens et al., 2018). an important issue when calculating the sgpv is its reliance on the “small sample correction”, where the sgpv is set to 0.5 whenever the ratio of the confidence interval width to the equivalence range exceeds 2:1 and the ci overlaps with the upper and lower bounds. this exception to the normal calculation of the sgpv is introduced to prevent misleading values. without this correction it is possible that a confidence interval is extremely wide, and an equivalence range is extremely narrow, which without the correction would lead to a very low value for the sgpv. blume et al. (2018) suggest that under such a scenario “the data favor alternative hypotheses”, even when a better interpretation would be that there is not enough data to accurately estimate the true effect compared to the width of the equivalence range. although it is necessary to set the sgpv to 0.5 whenever the ratio of the confidence interval width to the equivalence range exceeds 2:1, it leads to a range of situations where the sgpv is set to 0.5, while the p-value from the tost procedure continues to differentiate (see for example figure 6). an important benefit of equivalence tests is that it does not need such a correction to prevent misleading results. figure 13. comparison of p-values from tost (black line) and 1-sgpv (grey curve) across a range of observed sample correlations (x-axis) tested against equivalence bounds of r = 0.4 and r = 0.8 with n = 10 and an alpha of 0.05. as a more extreme example of the peculiar behavior of the “small sample correction” as currently implemented in the calculation of the sgpv, see figure 13. in this figure observed correlations (from a sample size of 10) from -.99 to .99 are tested against an equivalence range from r = 0.4 to r = 0.8. we can see the sgpv has a peculiar shape because it is set to 0.5 for certain observed correlations, even though there is no risk of a “misleading” sgpv in this range. this example suggests that the current implementation of the “small sample correction” could be improved. if, on the other hand, the sgpv is mainly meant to be interpreted when it is 0 or 1, it might be preferable to simply never apply the “small sample correction”. blume et al. (2018) claim that when using the sgpv “adjustments for multiple comparisons are obviated” (p. 15). however, this is not correct. given the direct relationship between tost and sgpv highlighted in this manuscript (where a tost p = 0.025 equals sgpv = 1, as long as the sgpv is calculated based on confidence intervals, and assuming data are sampled from a continuous bivariate normal distribution), not correcting for multiple comparisons will inflate the probability of concluding the absence of a meaningful effect based on the sgpv in exactly the same way as it will for equivalence tests. whenever statistical tests are interpreted as support for a hypothesis (e.g., spgv = 0 or sgpv = 1), it is possible to do so erroneously, and if researchers want to control error rates, they need to correct for multiple comparisons. 10 conclusion we believe that our explanation of the similarities between the tost procedure and the sgpv provides context to interpret the contribution of second generation p-values to the statistical toolbox. the novelty of the sgpv can be limited when confidence intervals are asymmetrical or wider than the equivalence range. there are strong similarities with p-values from the tost procedure, and in all situations where the statistics yield different results, the behavior of the p-value from the tost procedure is more consistent. we hope this overview of the relationship between the sgpv and equivalence tests will help researchers to make an informed decision about which statistical approach provides the best answer to their question. our comparisons show that when proposing alternatives to nullhypothesis tests, it is important to compare new proposals to already existing procedures. we believe equivalence tests achieve the goals of the second generation pvalue while allowing users to easily control error rates, and while yielding more consistent statistical outcomes. authors note all code associated with this article, including the reproducible manuscript, is available from https://github. com/lakens/tost_vs_sgpv and https://osf.io/8crkg/. the preprint can be found at https://psyarxiv.com/ 7k6ay/. correspondence concerning this article should be addressed to daniël lakens, den dolech 1, ipo 1.33, 5600 mb, eindhoven, the netherlands. e-mail: d.lakens@ tue.nl open science practices this article earned the open materials badge for making the materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. conflict of interest and funding no conflict of interest and no external funding. this work was supported by the netherlandsorganization for scientific research (nwo) vidi grant 452-17-013. author contributions dl conceptualized the idea, both authors wrote and revised this manuscript. references american psychological association (ed.). (2010). publication manual of the american psychological association (6th ed.). washington, dc: american psychological association. blume, j. d., d’agostino mcgowan, l., dupont, w. d., & greevy, r. a. (2018). secondgeneration p-values: improved rigor, reproducibility, & transparency in statistical analyses. plos one, 13(3), e0188299. doi:10.1371/journal.pone.0188299 goertzen, j. r., & cribbie, r. a. (2010). detecting a lack of association: an equivalence testing approach. british journal of mathematical and statistical psychology, 63(3), 527–537. doi:10.1348/000711009x475853 hauck, d. w. w., & anderson, s. (1984). a new statistical procedure for testing equivalence in twogroup comparative bioavailability trials. journal of pharmacokinetics and biopharmaceutics, 12(1), 83–91. doi:10.1007/bf01063612 kruschke, j. k. (2018). rejecting or accepting parameter values in bayesian estimation. advances in methods and practices in psychological science, 1(2), 270–280. doi:10.1177/2515245918771304 lakens, d. (2017). equivalence tests: a practical primer for t tests, correlations, and meta-analyses. social psychological and personality science, 8(4), 355–362. doi:10.1177/1948550617697177 lakens, d., scheel, a. m., & isager, p. m. (2018). equivalence testing for psychological research: a tutorial. advances in methods and practices in psychological science, 1(2), 259–269. doi:10.1177/2515245918770963 meyners, m. (2012). equivalence tests a review. food quality and preference, 26(2), 231–245. doi:10.1016/j.foodqual.2012.05.003 murphy, k. r., myors, b., & wolach, a. h. (2014). statistical power analysis: a simple and general model for traditional and modern hypothesis tests (fourth edition.). new york: routledge, taylor & francis group. quertemont, e. (2011). how to statistically show the absence of an effect. psychologica belgica, 51(2), 109–127. doi:10.5334/pb-51-2-109 rogers, j. l., howard, k. i., & vessey, j. t. (1993). using significance tests to evaluate equivalence between two experimental groups. psychological bulletin, 113(3), 553– 565. doi:http://dx.doi.org/10.1037/00332909.113.3.553 https://github.com/lakens/tost_vs_sgpv https://github.com/lakens/tost_vs_sgpv https://osf.io/8crkg/ https://psyarxiv.com/7k6ay/ https://psyarxiv.com/7k6ay/ mailto:d.lakens@tue.nl mailto:d.lakens@tue.nl https://doi.org/10.1371/journal.pone.0188299 https://doi.org/10.1348/000711009x475853 https://doi.org/10.1007/bf01063612 https://doi.org/10.1177/2515245918771304 https://doi.org/10.1177/1948550617697177 https://doi.org/10.1177/2515245918770963 https://doi.org/10.1016/j.foodqual.2012.05.003 https://doi.org/10.5334/pb-51-2-109 https://doi.org/http://dx.doi.org/10.1037/0033-2909.113.3.553 https://doi.org/http://dx.doi.org/10.1037/0033-2909.113.3.553 11 schuirmann, d. j. (1987). a comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. journal of pharmacokinetics and biopharmaceutics, 15(6), 657–680. doi:https://doi.org/10.1007/bf01068419 serlin, r. c., & lapsley, d. k. (1985). rationality in psychological research: the goodenough principle. american psychologist, 40(1), 73–83. doi:http://dx.doi.org/10.1037/0003066x.40.1.73 spiegelhalter, d. j., freedman, l. s., & parmar, m. k. (1994). bayesian approaches to randomized trials. journal of the royal statistical society. series a (statistics in society), 357–416. doi:10.2307/2983527 wellek, s. (2010). testing statistical hypotheses of equivalence and noninferiority (2nd ed.). boca raton: crc press. westlake, w. j. (1972). use of confidence intervals in analysis of comparative bioavailability trials. journal of pharmaceutical sciences, 61(8), 1340–1341. doi:10.1002/jps.2600610845 https://doi.org/https://doi.org/10.1007/bf01068419 https://doi.org/http://dx.doi.org/10.1037/0003-066x.40.1.73 https://doi.org/http://dx.doi.org/10.1037/0003-066x.40.1.73 https://doi.org/10.2307/2983527 https://doi.org/10.1002/jps.2600610845 the relationship between p-values from tost and sgpv when confidence intervals are symmetrical sgpv as a uniform measure of overlap when different p-values for equivalence tests yield the same sgpv the relation between equivalence tests and sgpv for asymmetrical confidence intervals around correlations what are the relative strengths and weaknesses of equivalence testing and the sgpv? conclusion authors note open science practices conflict of interest and funding author contributions references hüffmeier & krumm (2018).mp2018.842 meta-psychology, 2018, vol 2, mp.2018.842, https://doi.org/10.15626/mp.2018.842 article type: commentary published under the cc-by4.0 license preregistration: n/a preprint: https://doi.org/10.17605/osf.io/szx45 data, code and materials: https://doi.org/10.17605/osf.io/yt2dq edited by: rickard carlsson reviewed by: andreas ivarsson & ulrich schimmack peer review report: https://doi.org/10.17605/osf.io/dzw8c editorial history: https://doi.org/10.17605/osf.io/rq4fv no myth far and wide: relay swimming is faster than individual swimming and the conclusion of skorski et al. (2016) is unfounded joachim hüffmeier technische universität dortmund stefan krumm freie universität berlin skorski, extebarria, and thompson (2016) aim at our article on relay swimmers (hüffmeier, krumm, kanthak, & hertel, 2012). we have shown that professional freestyle swimmers at relay positions 2 to 4 swam faster in the relay than in the individual competition if they had a high chance to win a relay medal. after applying a reaction-time correction that controls for different starting procedures in relay and individual competitions, skorski et al. (2016) conclude that swimmers in relays do not swim faster. at first sight, their results appear to show this very pattern. however, we argue that the authors’ findings and conclusion—that our finding is a myth—are not warranted. first, we have also controlled for quicker reaction times in the relay competition. our correction has been based on the swimmers’ own reaction time data rather than on a constant reaction time estimate and is, thus, more precise than theirs. second, skorski et al. treat data from international and national competitions equally although national relay competitions are less attractive for the swimmers than national individual competitions. this difference likely biases their data towards slower relay times. third, the authors select a small and arbitrary sample without explicit power considerations or a clear stopping rule. fourth, they unfavorably aggregate their data. we conclude that the reported results are most likely due to the methodological choices by skorski et al. and do not invalidate our findings. keywords: teams; groups; performance gains; effort gains; relay swimming; reaction-time correction; exchange-block times meta-psychology, 2018, vol 2, mp.2018.842, https://doi.org/10.15626/mp.2018.842 article type: commentary published under the cc-by4.0 license preregistration: n/a preprint: https://doi.org/10.17605/osf.io/szx45 data, code and materials: https://doi.org/10.17605/osf.io/yt2dq edited by: rickard carlsson reviewed by: andreas ivarsson & ulrich schimmack peer review report: https://doi.org/10.17605/osf.io/dzw8c editorial history: https://doi.org/10.17605/osf.io/rq4fv in their recent study, skorski, extebarria, and thompson (2016) set out to show that our findings (hüffmeier, krumm, kanthak, & hertel, 2012) are unfounded or, in their words, “a myth” (cf. the title of their article). in the current commentary, we intend to illustrate (i) that the reported findings in the article by skorski et al. (2016) most likely result from unwarranted methodological choices and (ii) that their assessment of our findings is, thus, not valid. to do so, we first portray our pertinent studies. we focus on our study procedures, the methodological choices we made, our main findings, our explanation as well as possible alternative explanations for these findings, and we also discuss the robustness of our findings in relation to different methodological choices. we then describe the procedure and findings of skorski et al. (2016). finally, we point out four central shortcomings in the skorski et al. article and conclude that the results reported by skorski et al. cannot effectively question our results. our studies main goal of our studies. our article (hüffmeier et al., 2012) is part of a series of published (hüffmeier et al., 2017; hüffmeier & hertel, 2011; hüffmeier, kanthak, & hertel, 2013) and currently not yet published studies (schleu, mojzisch, & hüffmeier, 2018). they build on a long research tradition in (social) psychology that investigated—mostly in laboratory experiments—how working as part of a group affects the members’ motivation and performance (see karau & williams, 1993; weber & hertel, 2007, for overviews). in their meta-analysis, weber and hertel (2007) have shown that group work can result in effort gains in groups (i.e., higher effort during group as compared to individual work).1 they have further found that perceived social indispensability (i.e., the perception that the own contribution to the group performance is critical) is an important driver of such effort gains. our studies investigated whether and under which conditions group work results in such indispensability-based effort gains in groups outside the research 1 prior research used the term “motivation gains in groups”. however, motivation encompasses the direction, intensity, and persistence of behavior. pertinent small group research such as ours typically does not study the direction, but the laboratory (see hüffmeier & hertel, in press, for an overview of the swimming studies). methodological approach of our studies. most of our pertinent studies were situated in a professional swimming context and compared swimming performances from the individual and relay competitions of the same event (e.g., the 2008 olympic games; in a recent study, we investigated performances in a track and field context, cf. schleu et al., 2018). due to the high standardization of performances, professional swimming is an apt context to study the motivational impact of group versus individual work. in fact, only the prescribed starting procedures differentiate between the work that swimmers do in the individual and relay competitions. the starting rules for official swimming competitions (cf. the rules of the world swimming federation [fina]) prescribe the respective starting procedures:2 swimmers at the first relay position and swimmers in the individual competition have to follow identical starting rules. these swimmers perform a “flat start”, meaning that they cannot reliably anticipate their acoustic starting signal and are not allowed to move before this starting signal sounds. by contrast, in the relay competition, swimmers at relay positions 2 to 4 perform a “flying start”. that is, they can reliably anticipate their “starting signal” (i.e., their predecessor touching the pool wall). and they are allowed to move before this starting signal as long as at least one foot still touches the starting block when their predecessor touches the wall. obviously, these differences lead to quicker starting or reaction times in the relay as compared to the individual competitions for swimmers at relay positions 2 to 4. consequently, the resulting reaction-time advantages in relays of about 0.422 to 0.524 sec (depending on the swimmers’ gender, relay position, and swimming distance, see hüffmeier et al., 2017; see also table 1) have to be controlled to assess the pure swimming time difference between relay and individual swimming. our studies shared the following approach: first, we sampled data of swimmers taking part in both the individual and relay competitions in the same swimming intensity and persistence of behavior (i.e., the two components of effort). we, thus, prefer the term “effort gains in groups”. 2 see the fina homepage: http://www.fina.org/content/fina-rules no myth far and wide: relay swimming is faster than individual swimming and the conclusion of skorski et al. (2016) is unfounded 3 discipline at the same professional and international events (e.g., 100 meter freestyle competitions at the 2008 olympic games).3 we analyzed data from national competitions in only one article (hüffmeier et al., 2017). second, with one exception (hüffmeier & hertel, 2011), we made sure that the included data came from comparable stages in both competitions (e.g., final races from both the individual and relay competitions). third, in most studies, we only included a swimmer if his/her reaction times from both competitions were available (hüffmeier et al., 2013, 2012; hüffmeier & hertel, 2011). as an alternative approach, we conducted a pilot study to derive estimates of the reactiontime advantage separately for each relay position (hüffmeier et al., 2017; see also table 1). we used both types of data (i.e., original reaction-time data or estimates) to correct for the reaction-time advantages separately for each relay position (i.e., by subtracting the respective 3 the resulting data are not distributed equally across relay positions because normally only two swimmers from one country can qualify for individual competitions at the prestigious international championships and these two reaction times from the swimming times for the individual and relay competition). fourth, we either included all data that was available at the time of a study (e.g., study 1 in hüffmeier et al., 2017; hüffmeier et al., 2013; 2012) or we conducted a power analysis to determine the necessary sample size to detect the expected effects (study 2 in hüffmeier et al., 2017). fifth, in accordance with our finding that indispensability perceptions increase across the positions of the relay (hüffmeier & hertel, 2011), we expected a parallel increase in effort across the relay. thus, we conducted analyses testing a corresponding linear contrast as well as separate analyses for each relay position rather than aggregating the swimming performance data across relay positions. sixth, we interpreted the observed performance differences between the individual and relay competition as differences in the swimmers’ effort levels. strongest national swimmers are typically assigned to the first and fourth relay position (see, for instance, table 2 in neugart & richiardi, 2013, or table 1 below). relay position skorski, extebarria, & thompson (2016)1 hüffmeier et al. (2017), study 12 hüffmeier et al. (2017) study 23 position 1 - 0.019 s (95% ci: -0.07; 0.045) -0.006 s (95% ci: -0.010; -0.002) position 2 0.48 s 0.505 s (95% ci: 0.442; 0.568) 0.465 s (95% ci: 0.448; 0.482) position 3 0.48 s 0.455 s (95% ci: 0.366; 0.544) 0.455 s (95% ci: 0.431; 0.479) position 4 0.48 s 0.524 s (95% ci: 0.445; 0.603) 0.463 s (95% ci: 0.447; 0.479) note. 1 skorski et al. (2016) applied a constant of 0.48 seconds to the swimming times of 100 meter relay swimmers at the second, third, and fourth relay position as a reaction time estimate. this constant was taken from a swimmer sample that was unrelated to their own sample (see saavedra et al., 2014, for details). 2 in their study 1, hüffmeier et al. (2017) collected data to specifically estimate the reaction time advantages in relays separately for relay position, swimming distance, and gender. to illustrate the methodological differences as compared to skorski et al. (2016), the data from men competing in the 100 meter freestyle competition is reported in table 1. 3 in their study 2, hüffmeier et al. (2017) used a data set not specifically collected for this purpose to derive reaction time estimates. to illustrate the methodological differences as compared to skorski et al. (2016), the data from men competing in the 100 meter freestyle competition is reported in table 1. table 1. overview of the reaction time estimates applied in skorski et al. (2016) and in the two studies of hüffmeier et al. (2017) for the four relay positions (reported in seconds) hüffmeier & krumm 4 results of our studies. across all five studies focusing on swimming performances, we found the expected linear increase in effort across the relay—in those conditions where we expected it. the effect sizes for this linear increase were typically medium to large (cohen, 1992; see also table 2). in four of five studies (except for study 1 in hüffmeier et al., 2017), the effort level at the first relay position was comparable to the expended effort in the individual competition (i.e., no performance differences between the two competition types). again, in four of five studies (except for study 2 in hüffmeier et al., 2017), we observed very small, but statistically significant effort gains in groups at relay position 2 (i.e., quicker swimming times in the relay as compared to the individual competition; see table 2). we consistently found effort gains at position 3, which were largely comparable in size to those at position 2. at the fourth position, we again consistently found effort gains. these gains were more pronounced than at the prior positions in all but one study (hüffmeier et al., 2013; see table 2), but they were still small in size. concerning potential moderators of these findings, we have initial evidence that type of swimming relay (e.g., freestyle vs. medley relays) may act as a moderator (i.e., we found no effort gains in medley relays, hüffmeier et al., 2013). we have more reliable evidence that (i) the swimmers’ medal chances in the relay and (ii) the relative valence of the obtainable group outcomes may moderate these findings in the following way: swimmers without chances to win a medal in the relay competition and swimmers competing for less attractive outcomes in the relay than in the individual competition did not exhibit effort gains in groups (hüffmeier et al., 2017; 2012). by contrast, we did not find that swimmers’ gender, the swimming distance (100 vs. 200 meters) or the sports discipline (swimming vs. track and field relays) moderated the observed findings. in a first track and field study, we studied runners’ efforts (n = 397) in the individual and relay 400 meter running competitions of effect sizes 1 study total sample size relevant condition sub-sample size position 1 position 2 position 3 position 4 planned contrast 4 hüffmeier et al. (2012) 199 high instrumentality of group performance 2 151 -0.014 0.040 0.045 0.107 .74 hüffmeier et al. (2017) – study 1 302,576 high instrumentality of group performance 2 and high valence competitions 3 928 0.010 0.003 0.007 0.089 .80 table 2. summary of exemplary previous studies and effect sizes indicating increases in effort across the relay and effort gains in groups note. to allow for a high comparability of the findings in the 2017 data set with the original findings in hüffmeier et al. (2012), only 100 m free-style races were analyzed for this table. effect sizes are cohen’s d, using the standard deviation of individual swimming times. negative effect sizes indicate performance losses. 1 mean difference between individual and relay competition times (corrected by reaction times of different starting procedures) divided by the standard deviation of the individual times. 2 good chance of winning a medal. 3 olympic games, world championships, european championships, pan pacific games, commonwealth games, and universiades. 4 contrary to the analyses in the referenced papers, planned contrasts were analyzed within the relevant condition only. the tvalue of the planned contrast was converted into cohen’s d. no myth far and wide: relay swimming is faster than individual swimming and the conclusion of skorski et al. (2016) is unfounded 5 the same prestigious and international sports events (schleu et al., 2018). paralleling our findings from the swimming domain, we found a linear increase in selfreported effort and also in objective performance across the running relay. we again observed effort gains in groups for the athletes running at later positions in the relay. explanation for our findings. to learn whether the perceived indispensability of the own contribution to the group performance (cf. weber & hertel, 2007) could also be a relevant mediator explaining the effort gains of relay swimmers, we conducted a study with competitive adolescent swimmers (hüffmeier & hertel, 2011). confirming our expectations, they perceived an increasing indispensability of the own contribution to the group performance across the relay (i.e., they progressively perceived that an own bad performance could not be compensated by fellow relay swimmers). because this increase in indispensability across the relay parallels the consistently observed increase in effort across the relay, we believe that perceived indispensability may be an important mechanism underlying the effort gains among relay swimmers. alternative explanations. in this section, we discuss four prominent alternative explanations for our main findings. first, the different starting procedures in the individual and relay condition most likely do not represent a valid alternative explanation: this account cannot explain the consistently found linear increase in effort across the relay and, in particular, the most pronounced effort gains at the last relay position. second, the order in which swimmers took part in their competitions (e.g., relay competition first vs. individual competition first) could be proposed as another explanation. in the studies where we could test this explanation, we did, however, not observe a significant moderation by competition order (hüffmeier et al., 2017; 2012). third, although prominent in the literature on effort gains in groups (weber & hertel, 2007), social comparison is a rather unlikely alternative explanation. obviously, the clearest possibilities for comparing one’s performance with those of others exist in the individual competition and for the first relay swimmers because all swimmers start simultaneously in these situations. however, first relay swimmers did not exhibit effort gains in any of our studies and the backlogs or leads accumulated over the course of the relay competitions impede such comparisons for the remaining relay positions. these are, however, the positions where effort gains were observed. fourth, group members’ relative strength was discussed as an alternative explanation (cf. osborn, irwin, skogsberg, & feltz, 2012), meaning that effort gains were primarily expected among the weaker swimmers of a relay. this expectation is transferred from laboratory research where the weaker group members often exhibit effort gains in typical laboratory tasks (cf. weber & hertel, 2007). we conducted a study (study 2 in hüffmeier et al., 2017) to put this explanation to an empirical test. to do so, we collected data on the relative strength of all relay members and ordered them accordingly in our analysis. the resulting linear contrast for relative member strength was, however, not significant and could therefore not explain the observed results. robustness of our findings in relation to different methodological choices. our extant results were largely unaffected by the type of analysis and the type of reaction-time correction we used. in most of our studies, we report analyses using difference scores (i.e., we subtracted the relay from the individual swimming times). however, our results did not hinge on this type of analysis: we obtained equivalent results in all studies when conducting repeated-measures analyses. our results were also unaffected by the different types of data we applied to correct for the reaction-time advantages in the relay. first, we observed parallel findings when using the exact reaction-time data of the swimmers (e.g., hüffmeier & hertel, 2011; hüffmeier et al., 2012) and when using reaction-time estimates (hüffmeier et al., 2017). second, different types of reaction-time estimates did also not affect our results: in our two most recent studies, we applied two sets of reaction-time corrections (i.e., from a pilot study designed to establish estimates and from a big data set not specifically collected for this purpose; cf. hüffmeier et al., 2017, and table 1). we obtained comparable results with both sets of reaction time corrections. we are, however, well aware that our findings—especially at relay positions 2 and 3—are probably sensitive to methodological choices because the related effects are very small (see table 2). for instance, if a uniform and higher estimate is applied to each relay position to correct for the reaction-time advantages in relays (as, for instance, in neugart & richiardi, 2013) we probably would not observe effort gains at these positions. applying a uniform estimate to each relay position and choosing higher estimates are, however, hüffmeier & krumm 6 doubtful methodological decisions (see hüffmeier et al., 2017, for related data and a discussion; see also table 1). nevertheless, the effort gains at relay positions 2 and 3 are probably less robust than the effort gains at the fourth relay position and especially less robust than the overall linear increase in effort across the relay. our study targeted by skorski et al. (2016). in our study (hüffmeier et al., 2012), we analyzed the swimming performances of n = 199 professional freestyle swimmers from about a decade of 100 meter competitions (i.e., olympic games from 1996 to 2008, world championships from 1998 to 2011, and european championships from 2000 to 2010) and we corrected their swimming times from the individual and relay competition by using their own reaction times. our analysis shows that swimmers at the first relay position swam equally fast in both competition types while swimmers at relay positions 2 to 4 swam faster in the relay as compared to the individual competition if they had a high chance to win a relay medal.4 these effort gains in groups increased from the second to the fourth relay position. the skorski et al. (2016) article skorski et al. (2016) analyze the swimming performances of 166 male swimmers from 778 races. these races were either part of the 100 and 200 meter freestyle competitions of the olympic games from 2000, 2004, and 2012 (n = 144 races) or of international and national non-olympic 100 and 200 meter competitions (n = 634 races). the authors frame their use of a reaction-time correction controlling for the quicker reactions times in the relay as compared to the individual competition as the main contribution of their research. the authors report that they do no longer find that the analyzed swimmers swim faster in relay events than in their individual races if they control for the existing reaction-time advantages in the relay competitions. as corrections, they add 0.48 sec to the swimming times of the 100 m relay competitions and 0.45 sec to the swimming times of the 200 m relay competitions. while the authors’ conclusion does not seem to be fully accurate (on page 412, the authors report that “there was a small positive performance effect after adjusting for reaction time” for the swimming times of swimmers at positions 4 this finding is not restricted to relay swimming with high medal chances as compared to individual swimming with freely varying medal chances. we have obtained similar 2 to 4 in olympic 100 meter freestyle relays), we appreciate their efforts to critically assess our own findings. upon closer inspection, however, it seems that the authors may have overlooked some aspects of our study and that they made several methodological decisions that taint their reported results and question their conclusion. clarification and four central shortcomings of the skorski et al. (2016) article a clarification of the research literature: most pertinent studies applied a reaction-time correction. skorski et al. present their use of a reaction-time correction as a significant contribution. this creates the impression to readers that we did not apply a reactiontime correction in the targeted article (hüffmeier et al., 2012). but of course, we and others have consistently done so in the targeted and other pertinent articles (e.g., hüffmeier & hertel, 2011; hüffmeier et al., 2017; 2013; neugart & richiardi, 2013; osborn et al., 2012). the first shortcoming: our applied reaction-time correction is more precise than that of skorski et al. (2016). the reaction-time correction in our 2012 article was more precise than the one reported by skorski et al. (2016) because we relied on the swimmers’ own reaction time data: thus, we did not use reaction-time estimates, but only included swimming data for a swimmer in our analysis if his/her exact reaction times were available for both the individual and relay competition. by contrast, skorski et al. (2016) added two different and rather rough constants to their data from the 100 and 200 meter relay competitions (0.48 sec and 0.45 sec, respectively). these constants are not related to the swimmers of their sample, but were instead derived from a prior, unrelated study (cf. saveedra, garcía-hermoso, escalante, dominguez, arellano, & navarro, 2014). our own recent research suggests that adding the same constant to each relay swimming time is an imprecise approach because reaction times differ depending on the relay position (see hüffmeier et al., 2017 and table 1, for details). thus, the authors did not only ignore that their use of a reaction-time correction was not findings when the chances of success in both the relay and individual competition varied freely (see study 2 in hüffmeier et al., 2017). no myth far and wide: relay swimming is faster than individual swimming and the conclusion of skorski et al. (2016) is unfounded 7 novel at all, but they also applied a less precise reactiontime correction than we did. it is, thus, unlikely that the absence of observed differences between swimming times from the relay and individual competitions is due to the mere use of reaction-time corrections as the authors claim. in contrast, their imprecise correction may possibly have obscured otherwise observable differences. the second shortcoming: the sample in skorski et al. (2016) with many more non-olympic than olympic races systematically biases the results towards slower relay swimming times. skorski et al. (2016) analyze a variety of competitions: races from olympic competitions (n = 144) as well as from international and national non-olympic competitions (n = 634), including national championships. if the authors attempted to show that the results of our article are unfounded, this sampling approach is problematic because we focused our analysis exclusively on international championships (i.e., olympic games, world and european championships; see hüffmeier et al., 2012). in these championships, the relay and individual competitions are equally attractive for the professional swimmers—in both types of competitions they pursue the ultimate goal to win the race or at least to win a medal. the situation is different for national championships: here, the individual competition is much more attractive than the relay competition because swimmers can win and qualify for the upcoming international events in the individual competition (e.g., olympic games). by contrast, they can typically only win, but not qualify in the relay competition (i.e., relays at international events mostly consist of the best four national swimmers and not of the best relay from the national championship). in fact, many national swimming associations do not even consider relay performances from national championships as a relevant criterion for selecting swimmers for major international swimming competitions like the olympic games or world cham 5 for recent examples, see the nomination criteria for the 2017 world championships by the german swimming association (www.dsv.de/schwimmen/nationalmannschaft/nominierungs-richtlinien/), the austrian swimming association (http://www.osv.or.at/schwimmen-mit-open-water/qualifikationsrichtlinien/) or the us national swimming association (https://www.usaswimming.org/resources-home/resource-topic/resource-subtopic). pionships.5 we have found empirical evidence that accords with the differing attractiveness of national individual and relay competitions in a re-analysis of a data set with more than 300,000 races (hüffmeier et al., 2017; cf. neugart & richiardi, 2013). in this sample, professional athletes show the typical pattern of increasing effort gains in relays from position 2 to 4 for international competitions, while we even observe effort losses in groups for national and local competitions (i.e., slower swimming times in the relay as compared to the individual competition; for a related hypothesis on the valence of group outcomes, see karau, markus, & williams, 2000). in skorski et al. (2016), a parallel difference between olympic and non-olympic competitions appears to be present for the 100 meter races (see their figure 1). the authors even describe that “there was a small positive performance effect after adjusting for reaction time” (p. 412) for the swimming times of swimmers at positions 2 to 4 in olympic 100 meter freestyle relays, which—nicely—replicates our findings from the 2012 study.6 the distribution of olympic (n = 144) and nonolympic races (n = 634), thus, most likely biases the results of skorski et al. (2016) towards effort losses during group work (i.e., slower swimming times in the relay than in the individual competition). the sample selected for the authors’ study appears not to be suited for showing that our findings may be unfounded, but rather analyzes another and a very specific population of relay competitions. the third shortcoming: skorski et al. select an arbitrary and rather small sample without providing a sufficient justification. above, we have illustrated that only the olympic competition data from the skorski et al. article is suited for showing that our findings are unfounded. the sample of skorski et al. (2016) consists of n = 144 olympic races and, thus, is rather small. it is not clear why the authors did not try to allocate a bigger sample (e.g., by including the data from the 1996 olympics, which would also have been available on the website they used 6 note, however, that the exact numbers reported in skorski et al. (2016) for these data are wrong or at least unclear: although the depicted reaction-time-corrected effect sizes in their figure 1 for the olympic 100 meter competitions are clearly positive and appear to have an effect size d of about 0.50-0.60, the graphically depicted values are described in the figure caption as being either very small and positive (0.05) or even strongly negative (-1.06). hüffmeier & krumm 8 for their data collection [i.e., swimrankings.net]) and the authors do not provide a sufficient justification for the arbitrary selection of their sample.7 assembling a certain sample size without a clear stopping rule (e.g., to collect all data that is available at the time of the study or to continue data collection until a stop criterion is reached, e.g., frick, 1998) or without an a priori power analysis is problematic because it decreases the possibility to detect true effects (and increases the possibility of coming up with false positive effects; cf. cohen, 1992). the small sample size may have contributed to the absence of significant differences between relay and individual swimming performances—beyond the imprecise applied reaction-time correction (cf. the first shortcoming) and the predominant inclusion of national competition data (cf. the second shortcoming). with bigger samples and, thus, more statistical power, we have repeatedly found evidence for faster swimming times in the relay competition (see, for instance, the two studies in hüffmeier et al., 2017). the fourth shortcoming: skorski et al. aggregate data over systematically different relay positions and may thereby obscure otherwise observable differences between relay and individual swimming times. skorski et al. (2016) aggregate the swimming times for swimmers at the relay positions 2 to 4 and only provide statistical results for this aggregate (i.e., no separate results per relay position are given). this is problematic because, in our studies, the size of the performance gains in the relay competition varied across positions. this information was available when skorski et al. (2016) wrote their article (cf. hüffmeier & hertel, 2011; hüffmeier et al., 2012)—but, admittedly, it became most evident in our most recent article that was published only later (hüffmeier et al., 2017). in the most recent article, we found only (very) small effort gains for positions 2 and 3, even in the condition that made effort gains in the relay most likely (i.e., high chances of winning a medal in the relay competition, high valence of the relay competition [e.g., at world championships]; see hüffmeier et al., 2017). specifically, we observed relay swimming times that were only 0.008 s (position 2) and 0.031 s (position 3) faster than the respective individual times—as compared to 0.202 s faster times at relay position 4. while 7 as far as we know, data availability in data bases starts with the 1996 olympic games. all these differences in swimming times were significantly different from zero in the hüffmeier et al. (2017) study due to their large sample (n = 738 swimmers at positions 2 and 3 in 100 and 200 m races), studies that use smaller samples and aggregate swimming time differences across positions 2 to 4 might obscure otherwise observable faster swimming times especially at relay position 4. conclusion in our discussion of the skorski et al. (2016) article, we have identified the need to clarify the existing literature in one place and, more importantly, four shortcomings: while skorski et al. suggest that controlling for reaction-time advantages in relays is a novel contribution of their study it is in fact common research practice. turning to the first shortcoming, the authors applied imprecise correction factors. second, they included data from national competitions and—due to the relative unattractiveness of national relay competitions—most likely biased their findings towards slow relay swimming times (cf. hüffmeier et al., 2017; see also karau et al., 2000). this data is not well suited for an unbiased empirical test of the hypothesis that relay swimming is not faster than individual swimming. third, skorski et al. selected an arbitrary and comparably small sample, thereby reducing the chance to find significant differences. fourth and finally, they only provide aggregated data analyses despite well-documented differences between relay positions. they, thus, obscure possibly otherwise observable differences between swimming times from relay and individual competitions especially at the last relay position. given the multitude, weight, and association of these four shortcomings, we are convinced that the authors’ findings are non-conclusive in view of their research question. their data therefore do not question the validity of our findings and their conclusion—that our findings are “a myth”—is necessarily unfounded. no myth far and wide: relay swimming is faster than individual swimming and the conclusion of skorski et al. (2016) is unfounded 9 open science practices this article earned the open data badge for making the data available through the open science framework project page at https://doi.org/10.17605/osf.io/rq4fv. an editorial assistant independently confirmed that the code reproduced the results presented in the article. the editorial history can also be accessed from the osf project page. the nature of the research question meant that there were no relevant research materials. references cohen, j. (1992). a power primer. psychological bulletin, 112, 155-159. doi:10.1037//0033-2909.112.1.155 frick, r. w. (1998). a better stopping rule for conventional statistical tests. behavior research methods, 30, 690-697. doi: 10.3758/bf03209488 hüffmeier, j., filusch, m., mazei, j., hertel, g., mojzisch, a., & krumm, s. (2017). on the boundary conditions of effort losses and effort gains in action teams. journal of applied psychology, 102, 1673-1685. doi:10.1037/apl0000245 hüffmeier, j., & hertel, g. (2011). when the whole is more than the sum of its parts: group motivation gains in the wild. journal of experimental social psychology, 47, 455-459. doi:10.1016/j.jesp.2010.12.004 hüffmeier, j. & hertel, g. (in press). effort losses and effort gains in sports teams. in s. j. karau (ed.), individual motivation within groups: social loafing and motivation gains in work, academic, and sports teams. new york: academic press. hüffmeier, j., kanthak, j., & hertel, g. (2013). specificity of partner feedback as moderator of group motivation gains in olympic swimmers. group processes & intergroup relations, 16, 516-525. doi:10.1177/1368430212460894 hüffmeier, j., krumm, s., kanthak, j., & hertel, g. (2012). “don't let the group down”: facets of instrumentality moderate the motivating effects of groups in a field experiment. european journal of social psychology, 42, 533-538. doi:10.1002/ejsp.1875 karau, s. j., markus, m. j., & williams, k. d. (2000). on the elusive search for motivation gains in groups: insights from the collective effort model. zeitschrift für sozialpsychologie, 31, 179-190. doi:10.1024//0044-3514.31.4.179 neugart, m., & richiardi, m. g. (2013). sequential teamwork in competitive environments: theory and evidence from swimming data. european economic review, 63, 186-205. doi:10.1016/j.euroecorev.2013.07.006 osborn, k. a., irwin, b. c., skogsberg, n. j., & feltz, d. l. (2012). the köhler effect: motivation gains and losses in real sports groups. sport, exercise, and performance psychology, 1, 242-253. doi:10.1037/a0026887 saavedra, j. m., garcía-hermoso, a., escalante, y., dominguez, a. m., arellano, r., & navarro, f. (2014). relationship between exchange block time in swim starts and final performance in relay races in international championships. journal of sports sciences, 32, 1783-1789. doi:10.1080/02640414.2014.920099 schleu, j. e., mojzisch, a., & hüffmeier, j. (2018). run! run for the team: an analysis of effort gains in track and field relays. manuscript in preparation. skorski, s., etxebarria, n., & thompson, k. g. (2016). breaking the myth that relay swimming is faster than individual swimming. international journal of sports physiology and performance, 11, 410-413. doi:10.1123/ijspp.2014-0577 meta-psychology, 2022, vol 6, mp.2020.2573 https://doi.org/10.15626/mp.2020.2573 article type: tutorial published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: daniël lakens, brenton wiernik analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/3c6hu designing studies and evaluating research results: type m and type s errors for pearson correlation coefficient giulia bertoldo department of developmental psychology and socialisation, university of padova, padova, italy claudio zandonella callegher department of developmental psychology and socialisation, university of padova, padova, italy gianmarco altoè department of developmental psychology and socialisation, university of padova, padova, italy abstract it is widely appreciated that many studies in psychological science suffer from low statistical power. one of the consequences of analyzing underpowered studies with thresholds of statistical significance is a high risk of finding exaggerated effect size estimates, in the right or the wrong direction. these inferential risks can be directly quantified in terms of type m (magnitude) error and type s (sign) error, which directly communicate the consequences of design choices on effect size estimation. given a study design, type m error is the factor by which a statistically significant effect is on average exaggerated. type s error is the probability to find a statistically significant result in the opposite direction to the plausible one. ideally, these errors should be considered during a prospective design analysis in the design phase of a study to determine the appropriate sample size. however, they can also be considered when evaluating studies’ results in a retrospective design analysis. in the present contribution, we aim to facilitate the considerations of these errors in the research practice in psychology. for this reason, we illustrate how to consider type m and type s errors in a design analysis using one of the most common effect size measures in psychology: pearson correlation coefficient. we provide various examples and make the r functions freely available to enable researchers to perform design analysis for their research projects. keywords: correlation coefficient, type m error, type s error, design analysis, effect size https://doi.org/10.15626/mp.2020.2573 https://doi.org/10.17605/osf.io/3c6hu 2 introduction psychological science is increasingly committed to scrutinizing its published findings by promoting largescale replication efforts, where the protocol of a previous study is repeated as closely as possible with a new sample (camerer et al., 2016; camerer et al., 2018; ebersole et al., 2016; klein et al., 2014; klein et al., 2018; open science collaboration, 2015). interestingly, many replication studies found smaller effects than originals (camerer et al., 2018; open science collaboration, 2015) and among many possible explanations, one relates to a feature of study design: statistical power. in particular, it is plausible for original studies to have lower statistical power than their replications. in the case of underpowered studies, we are usually aware of the lower probability of detecting an effect if this exists, but the less obvious consequences on effect size estimation are often neglected. when underpowered studies are analyzed using thresholds, such as statistical significance levels, effects passing such thresholds have to exaggerate the true effect size (button et al., 2013; gelman et al., 2017; ioannidis, 2008; ioannidis et al., 2013; lane & dunlap, 1978). indeed, as will be extensively shown below, in underpowered studies only large effects correspond to values that can reject the null hypothesis and be statistically significant. as a consequence, if the original study was underpowered and found an exaggerated estimate of the effect, the replication effect will likely be smaller. the concept of statistical power finds its natural development in the neyman-pearson framework of statistical inference and this is the framework that we adopt in this contribution. contrary to the null hypothesis significance testing (nhst), the neyman-pearson approach requires to define both the null hypothesis (i.e., usually, but not necessarily, the absence of an effect) and the alternative hypothesis (i.e., the magnitude of the expected effect). further discussion on the neyman and pearson approach and a comparison with the nhst is available in altoè et al. (2020) and gigerenzer et al. (2004). when conducting hypothesis testing, we usually consider two inferential risks: the type i error (i.e., the probability α of rejecting the null hypothesis if this is true) and the type ii error (i.e., the probability β of not rejecting the null hypothesis if this is false). then, statistical power is defined as the probability 1-β of finding a statistically significant result if the alternative hypothesis is true. all this leads to a narrow focus on statistical significance in hypothesis testing, overlooking another important aspect of statistical inference, namely, the effect size estimation. when effect size estimation is conditioned on the statistical significance (i.e., effect estimates are evaluated only if their p-values are lower than α), effect size exaggeration is a corollary consequence of low statistical power that might not be evident at first. this point can be highlighted considering the type m (magnitude) and type s (sign) errors characterizing a study design (gelman & carlin, 2014). given a study design (i.e., sample size, statistical test directionality, α level and plausible effect size formalization), type m error, also known as exaggeration ratio, indicates the factor by which a statistically significant effect would be, on average, exaggerated. type s error indicates the probability to find a statistically significant effect in the opposite direction to the one considered plausible. the analysis that researchers perform to evaluate the type m and type s errors in their research practice is called design analysis, given the special focus posed into considering the design of a study (altoè et al., 2020; gelman & carlin, 2014). both errors are defined starting from a reasoned guess on the plausible magnitude and direction of the effect under study, which is called plausible effect size (gelman & carlin, 2014). a plausible effect size is an assumption the researchers make about which is the expected effect in the population. this should not be based on some noisy results from a pilot study but, rather, it could derive from an extensive evaluation of the literature (e.g., theoretical or literature reviews and meta-analyses). when considering the published literature to define the plausible effect size, however, it is important to take into account the presence of publication bias (franco et al., 2014) and consider techniques for adjusting for the possible inflation of effect size estimates (anderson et al., 2017). for example if, after taking into account possible inflations, all the main results in a given topic, considering a specific experimental design indicate that the correlation ranges between r = .15 and r = .25, we could reasonably choose as plausible effect size a value within this range. or even better, we could consider multiple values to evaluate the results in different scenarios. note that the definition of the plausible effect size is inevitably highly context-dependent so any attempt to provide reference values would not be useful, instead, it would prevent researchers from reasoning about the phenomenon of interest. even in extreme cases where no previous information is available, which would question the exploratory/confirmatory nature of the study, researchers could still evaluate which effect would be considered relevant (e.g., from a clinical or economic perspective) and define the plausible effect size according to it. why do these errors matter? the concepts of type m and type s errors allow enhancing researchers’ awareness of a complex process such as statistical inference. 3 strictly speaking, design analysis used in the design phase of a study provides similar information as the classical power analysis, indeed, to a given level of power there is a corresponding type m and type s errors. however, it is a valuable conceptual framework that can help researchers to understand the important role of statistical power both when designing a new study or when evaluating previous results from the literature. in particular, it highlights the unwanted (and often overlooked) consequences on effect estimation when filtering for statistical significance in underpowered studies. in these scenarios, there is not only a lower probability of rejecting the null when it is actually false but, even more importantly, any significant result would most likely lead to a misleading overestimation of the actual effect. the exaggeration of effect sizes, in the right or the wrong direction, has important implications on a theoretical and applied level. on a theoretical level, studies’ designs with high type m and type s errors can foster distorted expectations on the effect under study, triggering a vicious cycle for the planning of future studies. this point is relevant also for the design of replication studies, which could turn out to be underpowered if they do not take into account possible inflations of the original effect (button et al., 2013). when studies are used to inform policymaking and real-world interventions, implications can go beyond the academic research community and can impact society at large. in these settings, we could assist to a “hype and disappointment cycle” (gelman, 2019b), where true effects turn out to be much less impressive than expected. this can produce undesirable consequences on people’s lives, a consideration that invites researchers to assume responsibility in effectively communicating the risks related to effects quantification. to our knowledge, type m (magnitude) and type s (sign) errors are not widely known in the psychological research community but their consideration during the research process has the potential to improve current research practices, for example, by increasing the awareness that design choices have on possible studies’ results. in a previous work, we illustrated type m and type s errors using cohen’s’d as a measure of effect size (altoè et al., 2020). the purpose of the present contribution is to further increase the familiarity with type m and type s errors, considering another common effect size measures in psychology: pearson correlation coefficient, ρ. we aim to provide an accessible introduction to the design analysis framework and enhance the understanding of type m and type s errors using several educational examples. the rest of this article is organized as follows: introduction to type m and type s errors; description of what is a design analysis and how to conduct one; analysis of type s and type m errors when varying alpha levels and hypothesis directionality. moreover, the two appendices present further implications of design analysis for pearson correlation (appendix a) and an extensive illustration of the r functions for design analysis for pearson correlation (appendix b). type m and type s errors pearson correlation coefficient is a standardized effect size measure indicating the strength and the direction of the relationship between two continuous variables (cohen, 1988; ellis, 2010). even though the correlation coefficient is widely known, we briefly go over its main features using an example. imagine that we were interested to measure the relationship between anxiety and depression in a population and we plan a study with n participants, where, for each participant, we measure the level of anxiety (i.e., variable x) and the level of depression (i.e., variable y). at the end of the study, we will have n pairs of values x and y. the correlation coefficient helps us answer the questions: how strong is the linear relationship between anxiety and depression in this population? is the relationship positive or negative? correlation ranges from -1 to +1, indicating respectively two extreme scenarios of perfect negative relationship and perfect positive relationship1 since the correlation coefficient is a dimensionless number, it is a signal to noise ratio where the signal is given by the covariance between the two variables (cov(x,y)) and the noise is expressed by the product between the standard deviations of the two variables (s xs y; see formula1). in this contribution, following the conventional standards, we will use the symbol ρ to indicate the correlation in the population and the symbol r to indicate the value measured in a sample. r = cov(x,y) s x s y . (1) magnitude and sign are two important features characterizing pearson correlation coefficient and effect size measures in general. and, when estimating effect sizes, errors could be committed exactly regarding these two aspects. gelman and carlin (2014) introduced two indexes to quantify these risks: • type m error, where m stands for magnitude, is also called exaggeration ratio the factor by which a statistically significant effect is on average exaggerated. 1correlation indicates a relationship between variables but does not imply causation. we do not discuss this relevant aspect here but we refer the interested reader to (rohrer, 2018). 4 • type s error, sign the probability to find a statistically significant result in the opposite direction to the plausible one. note that, differently from the other inferential errors, type m error is not a probability but rather a ratio indicating the average percentage of inflation. how are these errors computed? in the next paragraphs, we approach this question preferring an intuitive perspective. for a formal definition of these errors, we refer the reader to altoè et al. (2020), gelman and carlin (2014), and lu et al. (2018). take as an example the previous fictitious study on the relationship between anxiety and depression and imagine we decide to sample 50 individuals (sample size, n = 50) and to set the α level to 5% and to perform a two-tailed test. based on theoretical considerations, we expect the plausibly true correlation in the population to be quite strong and positive which we formalize as ρ = .50. to evaluate the type m and type s errors in this research design, imagine repeating the same study many times with new samples drawn from the same population and, for each study, register the observed correlation (r) and the corresponding p-value. the first step to compute type m error is to select only the observed correlation coefficients that are statistically significant in absolute value (for the moment, we do not care about the sign) and to calculate their mean. type m error is given by the ratio between this mean (i.e., mean of statistically significant correlation coefficients in absolute value) and the plausible effect hypothesized at the beginning, which in this example is ρ = .50. thus, given a study design, type m error tells us what is the average overestimation of an effect that is statistically significant. type s error is computed as the proportion of statistically significant results that have the opposite sign compared to the plausible effect size. in the present example we hypothesized a positive relationship, specifically ρ = .50. then, type s error is the ratio between the number of times we observed a negative statistically significant result and the total number of statistically significant results. in other words, type s error indicates the probability to obtain a statistically significant result in the opposite direction to the one hypothesized. the central and possibly most difficult point in this process is reasoning on what could be the plausible magnitude and direction of the effect of interest. this critical process, which is central also in traditional power analysis, is an opportunity for researchers to aggregate, formalize and incorporate prior information on the phenomenon under investigation (gelman & carlin, 2014). what is plausible can be determined on theoretical grounds, using expert knowledge elicitation techniques (see for example o’hagan, 2019) and consulting literature reviews and meta-analysis, always taking into account the presence of effect sizes inflation in the published literature (anderson, 2019). given these premises, it is important to stress that a plausible effect size should not be determined by considering the results of a single study, given the high-level of uncertainty associated with this effect size estimate. the idea is that the plausible effect size should approximate the true effect, which although never known can be thought of as “that which would be observed in a hypothetical infinitely large sample” (gelman & carlin, 2014, p. 642). for a more exhaustive description of plausible effect size, we refer the interested reader to altoè et al. (2020) and gelman and carlin (2014). before we proceed, it is worth noting that there are other recent valuable tools that start from different premises for designing and evaluating studies. among others, we refer the interested reader to methods which start from the definition of the smallest effect size of interest (sesoi; for a tutorial, see lakens, scheel, et al., 2018). design analysis researchers can consider type m and type s errors in their practice by performing a design analysis (altoè et al., 2020; gelman & carlin, 2014). ideally, a design analysis should be performed when designing a study. in this phase, it is specifically called prospective design analysis and it can be used as a sample size planning strategy where statistical power is considered together with type m and type s errors. however, design analysis can also be beneficial to evaluate the inferential risks in studies that have already been conducted and where the study design is known. in these cases, type m and type s errors can support results interpretation by communicating the inferential risks in that research design. when design analysis happens at this later stage, it takes the name of retrospective design analysis. note that retrospective design analysis should not be confused with post-hoc power analysis. a retrospective design analysis defines the plausible effect size according to previous results in the literature or other information external to the study, whereas the post-hoc power analysis defines the plausible effect size based on the observed results in the study and it is a widely-deprecated practice (gelman, 2019a; goodman & berlin, 1994). in the following sections, we illustrate how to perform prospective and retrospective design analysis us5 ing some examples. we developed two r functions2 to perform design analysis for pearson correlation, which are available at the page https://osf.io/9q5fr/. the function to perform a prospective design analysis is pro_r(). it requires as input the plausible effect size (rho), the statistical power (power), the directionality of the test (alternative) which can be set as: “two.sided”, “less” or “greater”. type i error rate (sig_level) is set as default at 5% and can be changed by the user. the pro_r() function returns the necessary sample size to achieve the desired statistical power, type m error rate, the type s error probability, and the critical value(s) above which a statistically significant result can be found. the function to perform retrospective design analysis is retro_r(). it requires as input the plausible effect size, the sample size used in the study, and the directionality of the test that was performed. also in this case, type i error rate is set as default at 5% and can be changed by the user. the function retro_r() returns the type m error rate, the type s error probability, and the critical value(s)3. for further details regarding the r functions refer to appendix b. all code and materials are also available in a codeocean capsule at https://codeocean.com/capsule/7935517. case study to familiarize the reader with type m and type s errors, we start our discussion with a retrospective design analysis of a published study. however, the ideal temporal sequence in the research process would be to perform a prospective design analysis in the planning stage of a research project. this is the time when the design is being laid out and useful improvements can be made to obtain more robust results. in this contribution, the order of presentation aims first, to provide an understanding of how to interpret type m and type s errors, and then discuss how they could be taken into account. the following case study was chosen for illustrative purposes only and, by no means our objective is to judge the study beyond illustrating an application of how to calculate type m and type s errors on a published study. we consider the study published in science by eisenberger et al. (2003) entitled: “does rejection hurt? an fmri study of social exclusion”. the research question originated from the observation that the anterior cingulate cortex (acc) is a region of the brain known to be involved in the experience of physical pain. could pain from social stimuli, such as social exclusion, share similar neural underpinnings? to test this hypothesis, 13 participants were recruited and each one had to play a virtual game with other two players while undergoing functional magnetic resonance imaging (fmri). the other two players were fictitious, and participants were actually playing against a computer program. players had to toss a virtual ball among each other in three conditions: social inclusion, explicit social exclusion and implicit social exclusion. in the social inclusion condition, the participant regularly received the ball. in the explicit social exclusion condition the participant was told that, due to technical problems, he was not going to play that round. in the implicit social exclusion condition, the participant experienced being intentionally left out from the game by the other two players. at the end of the experiment, each participant completed a self-report measure regarding their perceived distress when they were intentionally left out by the other players. considering only the implicit social exclusion condition, a correlation coefficient was estimated between the measure of distress and neural activity in the anterior cingulate cortex. as suggested by the large and statistically significant correlation coefficient between perceived distress and activity in the acc, r = .88, p < .005 (eisenberger et al., 2003, p. 291), authors concluded that social and physical pain seem to share similar neural underpinnings. before proceeding to the retrospective design analysis, we refer the interested reader to some background history regarding this study. this was one of the many studies included in the famous paper “puzzlingly high correlations in fmri studies of emotion, personality, and social cognition” (vul et al., 2009) which raised important issues regarding the analysis of neuroscientific data. in particular, this paper noted that the magnitude of correlation coefficients between fmri measures and behavioural measures were beyond what could be considered plausible. we refer the interested reader also to the commentary by yarkoni (2009), who noted that the implausibly high correlations in fmri studies could be largely explained by the low statistical power of experiments. a retrospective design analysis should start with thorough reasoning on the plausible size and direction of the effect under study. to produce valid inferences, a lot of attention should be devoted to this point by integrating external information. for the sake of this example, we turn to the considerations made by vul and pashler 2an r-package was subsequently developed and now is available on cran, prda: conduct a prospective or retrospective design analysis https://cran.r-project.org/web/ packages/prda/index.html. prda contains other features on design analysis, that are beyond the aim of the present paper. 3critical value is the name usually employed in hypotheses testing within the neyman-pearson framework. in the research practice, this is also known as the minimal statistically detectable effect (cook et al., 2014; phillips et al., 2001) https://osf.io/9q5fr/ https://codeocean.com/capsule/7935517 https://cran.r-project.org/web/packages/prda/index.html https://cran.r-project.org/web/packages/prda/index.html 6 (2017) who suggested correlations between personality measures and neural activity to be likely around ρ= .25. a correlation of ρ = .50 was deemed plausible but optimistic and a correlation of ρ= .75 was considered theoretically plausible but unrealistic. retrospective design analysis to perform a retrospective design analysis on the case study, we need information on the research design and the plausible effect size. based on the previous considerations, we set the plausible effect size to be ρ = .25. information on the sample size was not available in the original study (eisenberger et al., 2003) and was retrieved from vul et al. (2009) to be n = 13. the α level and the directionality of the test were not reported in the original study, so for the purpose of this example, we will consider α = .05 and a two-tailed test. given this study design, what are the inferential risks in terms of effect size estimation? we can use the r function retro_r(), whose inputs and outputs are displayed in figure 1. in this study, the statistical power is .13, that is to say, there is a 13% probability to reject the null hypothesis, if an effect of at least ρ = |.25| exists. consider this point together with the results obtained in the experiment: r = .88, p < .005 (eisenberger et al., 2003, p. 291). it is clear that, even though the probability to reject the null hypothesis is low (power of 13%), this event could happen. and when it does happen, it is tempting to believe that results are even more remarkable (gelman & loken, 2014). however, this design comes with serious inferential risks for the estimation of effect sizes, which could be grasped by presenting type m and type s errors. a glance at their value communicates that it is not impossible to find a statistically significant result, but when it does happen, the effect sizes could be largely overestimated type m = 2.58 and maybe even in the wrong direction type s = .03. the type m error rate of 2.58 indicates that a statistically significant correlation is on average about two and a half times the plausible value. in other words, statistically significant results emerging in such a research design will on average overestimate the plausible correlation coefficient by 160%. the type s error of .03 suggests that there is a three percent probability to find a statistically significant result in the opposite direction, in this example, a negative relationship. in this research design, the critical values above which a statistically significant result is declared correspond to r = ±.55 (figure 1). these values are highlighted in figure 2 as the vertical lines in the sampling distribution of correlation coefficients under the null hypothesis. notice that the plausible effect size lies in the region of acceptance of the null hypothesis. therefore, it is impossible to simultaneously find a statistically significant result and estimate an effect close to the plausible one (ρ = .25). the figure represents the so-called winner’s curse: “the ‘lucky’ scientist who makes a discovery is cursed by finding an inflated estimate of that effect” (button et al., 2013). figure 1. input and output of the function retro_r() for retrospective design analysis. case study: eisenberger et al. (2003). the plausible correlation coefficient is ρ = .25, the sample size is 13, and the statistical test is two-tailed. the option seed allows setting the random number generator to obtain reproducible results. prospective design analysis ideally, type m and type s errors should be considered in the design phase of a study during the decisionmaking process regarding the experimental protocol. at this stage, prospective design analysis can be used as a sample size planning strategy which aims to minimize type m and type s errors in the upcoming study. imagine that we were part of the research team in the previous case study exploring the relationship between activity in the anterior cerebral cortex and perceived distress. when drafting the research protocol, we face the inevitable discussion on how many participants we are going to recruit. this choice depends on available resources, type of study design, constraints of various nature and, importantly, the plausible magnitude and direction of the phenomenon that we are going to study. as previously mentioned, deciding on a plausible effect size is a fundamental step which requires great effort and should not be done by trying different values only to obtain a more desirable sample size. instead, proposing a plausible effect size is where the expert knowledge of the researcher can be formalized and can greatly contribute to the informativeness of the study that is being planned. for the sake of these examples, we adopt the 7 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 ρ h0 : ρ = 0 h1 : ρ = .25 figure 2. winner’s course. h0 = null hypothesis, h1 = alternative hypothesis. when sample size, directionality of the test and type i error probability are set, also the smallest effect size above which is possible to find a statistically significant result is set. in this case, the plausible effect size, ρ = .25, lies in the region where it is not possible to reject h0 (the region delimited by the two vertical lines). thus, it is impossible to simultaneously find a statistically significant result and an effect close to the plausible one. in other words, a statistically significant effect must exaggerate the plausible effect size. previous consideration and we suppose that common agreement is reached on a plausible correlation coefficient to be around ρ = .25. finally, we would like to leave open the possibility to explore whether the relationship goes in the opposite direction to the one hypothesized, so we decide to perform a two-tailed test. we can implement the prospective design analysis using the function pro_r() which inputs and outputs are displayed in figure 3. about 125 participants are necessary to have 80% probability to detect an effect of at least ρ=±.25 if it actually exists. with this sample size, the type s error is minimized and approximates zero. in this study design, the type m error is 1.11 indicating that statistically significant results are on average exaggerated by 11%. it is possible to notice that the critical values are r = ±.18, further highlighting that our plausible effect size is actually included among those values that lead to the acceptance of the alternative hypothesis. in a design analysis, it is advisable to investigate how the inferential risks would change according to different scenarios in terms of statistical power and plausible effect size. changes in both these factors impact type m and type s errors. for example, maintaining the plausible correlation of ρ = .25, if we decrease statistical power from .80 to .60 only 76 participants are required (see table 1). however, this is associated with an increased type m error rate from 1.11 to 1.28. that is to say, with 76 subjects the plausible effect size will be on average overestimated by 28%. alternatively, imagine that we would like to maintain a statistical power of 80%, what happens if the plausible effect size is slightly larger or smaller? the necessary sample size would spike to 344 for a ρ = .15 and decrease to 60 for ρ = .35. in both scenarios, the type m error remains about 1.12, which reflects the more general point that for 80% power, type m error is around 1.10. in all these scenarios, type s error is close to zero, hence not worrisome. figure 3. input and output of the function pro_r() for prospective design analysis. plausible correlation coefficient is ρ= .25, statistical power is 80% and the statistical test is two-tailed. the option seed allows setting the random number generator to obtain reproducible results. table 1 prospective design analysis in different scenarios of plausible effect size and statistical power. ρ power sample size type m type s critical r value 0.25 0.6 76 1.280 0 ±0.226 0.15 0.8 344 1.116 0 ±0.106 0.35 0.8 60 1.115 0 ±0.254 note: in all cases, alternative = "two.sided" and sig_level = .05. 8 0.0 0.2 0.4 0.6 0.8 1.0 1.0 1.6 2.2 2.8 3.4 4.0 0.00 0.06 0.12 0.18 0.24 0.30 0.0 0.2 0.4 0.6 0.8 1.0 1.0 1.2 1.4 1.6 1.8 2.0 0.00 0.03 0.06 0.09 0.12 0.15 0.0 0.2 0.4 0.6 0.8 1.0 1.00 1.06 1.12 1.18 1.24 1.30 0.00 0.01 0.02 0.03 0.04 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 n n n n n n n n n ρ = .25 ρ = .50 ρ = .75 p ow er t y p e m t y p e s figure 4. how type m, type s and statistical power vary as a function of sample size in three different scenarios of plausible effect size (ρ= .25, ρ= .50, ρ= .75). note that, for the sake of interpretability, we decided to use different scales for both the x-axis and y-axis in the three scenarios of plausible effect size. for completeness, figure 4 summarizes the relationship between statistical power, type m and type s errors as a function of sample size in three scenarios of plausible correlation coefficients. we display the three values that vul and pashler (2017) considered for correlations between fmri measures and behavioural measures with different degrees of plausibility. an effect of ρ= .75 was deemed theoretically plausible but unrealistic, ρ = .50 was more plausible but optimistic, and ρ= .25 was more likely. the curves illustrate a general point: type m and type s error increase with smaller sample sizes, smaller plausible effect sizes and lower statistical power. also, the figure shows that statistical power, type m and type s errors are related to each other: as power increases, type m and type s errors decrease. at first, it might seem that type m and type s errors are redundant with the information provided by statistical power. even though they are related, we believe that type m and type s errors bring added value during the design phase of a research protocol because they facilitate a connection between how a study is planned and how results will actually be evaluated. that is to say, final results will comprise of a test statistics with an associated p-value and effect size measure. if the interest is maximizing the accuracy with which effects will be estimated, then type m and type s errors directly communicate the consequences of design choices on effect size estimation. varying α levels and hypotheses directionality so far, we did not discuss two other important decisions that researchers have to take when designing a study: statistical significance threshold or α level, and directionality of the statistical test, one-tailed or two-tailed. in this section, we illustrate how different 9 choices regarding these aspects impact type m and type s errors. a lot has been written regarding the automatic adoption of a conventional α level of 5% (e.g., gigerenzer et al., 2004; lakens, adolfi, et al., 2018). this practice is increasingly discouraged, and researchers are invited to think about the best trade-off between α level and statistical power, considering the aim of the study and available resources. the α level impacts type m and type s errors as much as it impacts statistical power. everything else equal, type m error increases with decreasing α level (i.e., negative relationship), whereas type s error decreases with decreasing α level (i.e., positive relationship). to further illustrate the relation between type m error and α level, let us take as an example the previous case study with a sample of 13 participants, plausible effect size ρ = .25 and two-tailed test. table 2 shows that by lowering the α level from 10% to .10%, the critical values move from r = ±.48 to r = ±.80. this suggests that, with these new higher thresholds, the exaggeration of effects will be even more pronounced because effects have to be even larger to pass such higher critical values. instead, the relationship between type s error and α level can be clarified thinking that by lowering the statistical significance threshold, we are being more conservative to falsely reject the null hypothesis in general which implies that we are also being more conservative to falsely reject the null hypothesis in the wrong direction. table 2 how changes in α level impact power, type m error, type s error and critical values. α-level power type m type s critical r value 0.100 0.212 2.369 0.040 ±0.476 0.050 0.127 2.583 0.028 ±0.553 0.010 0.035 2.977 0.011 ±0.684 0.005 0.021 3.088 0.014 ±0.726 0.001 0.005 3.340 0.000 ±0.801 note: in all cases, ρ= .25, n = 13, and alternative = "two.sided". another important choice in study design is the directionality of the test (i.e., one-tailed or two-tailed). design analysis invites reasoning on the plausible effect size and hypothesizing the direction of the effect, not only its magnitude. so why should a researcher perform non-directional statistical tests when there is a hypothesized direction? performing a two-tailed test leaves open the possibility to find an unexpected result in the opposite direction (cohen, 1988), a possibility which may be of special interest for preliminary exploratory studies. however, in more advanced stages of a research program (i.e., confirmatory study), directional hypotheses benefit from higher statistical power and lower type m error rates (figure 5). as an example, let us consider the differences between a two-tailed test and a one-tailed test in the previous case study. we can perform a new prospective design analysis (figure 6) with a plausible correlation of ρ = .25, 80% statistical power, but this time setting the argument alternative in the r function to “greater”. a comparison of the two prospective design analyses, figure 3 and figure 6, suggests that the same type m error rate of about 10% is guaranteed with 94 participants, instead of the 125 subjects necessary with a two-tailed test. note that type s error is not possible in directional statistical tests. indeed, all the statistically significant results are obtainable only in the hypothesized direction, not the opposite one. valid conclusions require decisions on test directionality and α level to be taken a priori, not while data are being analyzed (cohen, 1988). these decisions can take place during a prospective design analysis, which aligns with the increasing interest in psychological science to transparently communicate and justify design choices through studies’ preregistration in public repositories (e.g., open science framework; aspredicted.com). preregistration of studies’ protocol is particularly valuable for researchers endorsing an error statistics philosophy of science, where the evaluation of research results takes into account the severity with which claims are tested (lakens, 2019; mayo, 2018). severity depends on the degree to which a research protocol tries to falsify a claim. for example, a one-tailed statistical test provides greater severity than a two-tailed statistical test. as noted by lakens (2019), preregistration is important to openly share a priori decisions, such as test-directionality, providing valuable information for researchers interested in evaluating the severity of research claims. publication bias and significance filter on a concluding note, we would like to clarify the relationship of design analysis with publication bias and the statistical significance filter. while publication bias and type m and type s errors are related, they operate at two different levels. publication bias refers to a publication system that favours statistically significant results over non-statistically significant findings. this phenomenon alone cannot explain the presence of exaggerated effects. imagine if all studies in the literature were conducted with high statistical power, then statistically significant findings would probably not be so extreme. the problem of exaggerated effect sizes in the literature can be explained only by a combination of publication bias with studies that have low statistical power. as previously shown, statis10 two-tailed test one-tailed test 0.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 n p ow er 1.0 1.6 2.2 2.8 3.4 4.0 0 50 100 150 200 n t yp e m figure 5. comparison of type m error rate and power level between one-tailed and two-tailed test with ρ = .25, α= .05. n = sample size. figure 6. input and output of the function pro_r() for prospective design analysis. plausible correlation coefficient is ρ= .25, statistical power is 80% and the statistical test is one-tailed. tical power and type m and type s errors are related to each other: low statistical power corresponds to higher type m and type s errors. the critical element is the application of the statistical significance filter without taking into account statistical power. design analysis per se does not solve this issue but, instead, it allows us to recognize its problematic consequences. in the same way as statistical power is a characteristic of a study design, so are type m and type s errors, however, the two are qualitatively different in terms of the kind of reasoning they favour. statistical power is defined in terms of probability of rejecting the null hypothesis and, even though this is based on an effect size of interest, the relationship “low power high possibility of exaggeration” may not be straightforward for everyone. instead, type m and type s errors directly quantify the possible exaggeration. furthermore, their consideration protects against another possible pitfall. when in a study a statistically significant result is found and the associated effect size estimate is large, the finding could be interpreted as robust and impressive. however, this interpretation is not always appropriate. here, the missing piece of information is statistical power. if power is considered, researchers would realize that a large effect was found in a context where there was a low probability to find it. but this interpretation is not explicitly stating an important aspect: in these conditions, the only way to find a statistically significant result is by overestimating the true effect. on the contrary, this consequence becomes immediately clear once type m and type s errors are considered retrospectively. similarly, considering type m and type s prospectively favours reasoning in terms of effect size rather than the probability of rejecting the null hypothesis when setting the sample size in a design analysis. discussion and conclusion in the scientific community, it is quite widespread the idea that the literature is affected by a problem with effect size exaggeration. this issue is usually explained in terms of studies’ low statistical power combined with the use of thresholds of statistical significance (button et al., 2013; ioannidis, 2008; ioannidis et al., 2013; lane & dunlap, 1978; yarkoni, 2009; young et al., 2008). statistically significant results can be obtained even in underpowered studies and it is precisely in these cases that we should worry the most about issues of overestimation. type m and type s errors quantify and highlight the inferential risks directly in terms of effect size estimation, which are implied by the concept of statis11 tical power but might not be recognizable outright. so far, only a handful of papers explicitly mentioned type m and type s errors (altoè et al., 2020; gelman, 2018; gelman & carlin, 2013, 2014; gelman et al., 2017; gelman & tuerlinckx, 2000; lu et al., 2018; vasishth et al., 2018). with the broader goal of facilitating their consideration in psychological science, in the present contribution we illustrated how type m and type s errors are considered in a design analysis using one of the most common effect size measures in psychology, pearson correlation coefficient. peculiar to design analysis is the focus on the implications of design choices on effect sizes estimation rather than statistical significance only. we illustrated how type m and type s errors can be taken into account with a prospective design analysis. in the planning stage of a research project, design analysis has the potential to increase researchers’ awareness of the consequences that their sample size choices have on uncertainty about final estimates of the effects. this favours reasoning in similar terms to those in which results will be evaluated, that is to say, effect size estimation. but understanding the inferential risks in a study design is also beneficial once results are obtained. we presented retrospective design analysis on a published study, and the same process can be useful for studies in general, especially those ending without the necessary sample size to maximize statistical power and minimize type m and type s errors. in all cases, presenting their values effectively communicates the uncertainty of the results. in particular, type m and type s errors put a red flag when results are statistically significant, but the effect size could be largely overestimated and in the wrong direction. finally, both prospective and retrospective design analysis favours cumulative science encouraging the incorporation of expert knowledge in the definition of the plausible effect sizes. it is important to remark that even if design analysis is based on the definition of a plausible effect size, a best practice should be to conduct multiple design analyses by considering different scenarios which include different plausible effect sizes and levels of power to maximize the informativeness of both a prospective and a retrospective analysis. to make design analysis accessible to the research community, we provide the r functions to perform prospective design analysis and retrospective design analysis for pearson correlation coefficient https://osf. io/9q5fr/ together with a short guide on how to use the r functions and a summary of the examples presented in this contribution (appendix b). finally, prospective design analysis could contribute to better research design, however many other important factors were not considered in this contribution. for example, the validity and reliability of measurements should be at the forefront in research design, and careful planning of the entire research protocol is of utmost importance. future works could tackle some of these shortcomings for example, including an analysis of the quality of measurement on the estimates of type m and type s errors. also, we believe that it would be valuable to provide extension of design analysis for other common effect size measures with the development of statistical software packages that could be directly used by researchers. moreover, design analysis on pearson correlation can be easily extended to the multivariate case where multiple predictors are considered. lastly, design analysis is not limited to the neymanpearson framework but can be considered also within other statistical approaches such as bayesian approach. future works could implement design analysis to evaluate the inferential risks related to the use of bayes factors and bayesian credibility intervals. summarizing, choices regarding studies’ design impact effect size estimation and type m (magnitude) error and type s (sign) error allow to directly quantify these inferential risks. their consideration in a prospective design analysis increases awareness of what are the consequences of sample size choice reasoning in similar terms to those used in results evaluation. instead, retrospective design analysis provides further guidance on interpreting research results. more broadly, design analysis reminds researchers that statistical inference should start before data collection and does not end when results are obtained. author contact giulia bertoldo: 0000-0002-6960-3980 claudio zandonella callegher:0000-0001-7721-6318 gianmarco altoè: 0000-0003-1154-9528 corresponding author: gianmarco altoè, department of developmental psychology and socialization, university of padova, via venezia 8, 35131 padova, italy gianmarco.altoe@unipd.it conflict of interest and funding we have no known conflict of interest to disclose. author contributions gb and ga conceived the original idea. gb drafted the paper. cz contributed to the development of the original idea and drafted sections of the manuscript. cz and ga wrote the r functions. all authors took care of the statistical analyses and contributed to the https://osf.io/9q5fr/ https://osf.io/9q5fr/ 12 manuscript revision, read, and approved the submitted version. open science practices this article earned the open materials badge for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references altoè, g., bertoldo, g., zandonella callegher, c., toffalini, e., calcagnì, a., finos, l., & pastore, m. (2020). enhancing statistical inference in psychological research via prospective and retrospective design analysis. frontiers in psychology, 10. https://doi.org/10.3389/fpsyg.2019. 02893 anderson, s. f. (2019). best (but oft forgotten) practices: sample size planning for powerful studies. the american journal of clinical nutrition, 110(2), 280–295. https : / / doi . org / 10 . 1093 / ajcn/nqz058 anderson, s. f., kelley, k., & maxwell, s. e. (2017). sample-size planning for more accurate statistical power: a method adjusting sample effect sizes for publication bias and uncertainty. psychological science, 28(11), 1547–1562. https:// doi.org/10.1177/0956797617723724 button, k., ioannidis, j., mokrysz, c., nosek, b., flint, j., robinson, e., & munafò, m. (2013). power failure: why small sample size undermines the reliability of neuroscience. nature reviews neuroscience, 14(5), 365–376. https://doi.org/10. 1038/nrn3475 camerer, c. f., dreber, a., forsell, e., ho, t.-h., huber, j., johannesson, m., kirchler, m., almenberg, j., altmejd, a., chan, t., heikensten, e., holzmeister, f., imai, t., isaksson, s., nave, g., pfeiffer, t., razen, m., & wu, h. (2016). evaluating replicability of laboratory experiments in economics. science, 351(6280), 1433–1436. https://doi.org/10.1126/science.aaf0918 camerer, c. f., dreber, a., holzmeister, f., ho, t.-h., huber, j., johannesson, m., kirchler, m., nave, g., nosek, b. a., pfeiffer, t., altmejd, a., buttrick, n., chan, t., chen, y., forsell, e., gampa, a., heikensten, e., hummer, l., imai, t., . . . wu, h. (2018). evaluating the replicability of social science experiments in nature and science between 2010 and 2015. nature human behaviour, 2(9), 637–644. https://doi.org/10. 1038/s41562-018-0399-z cohen, j. (1988). statistical power analysis for the behavioral sciences. lawrence erlbaum associates. https://doi.org/10.4324/9780203771587 cook, j., hislop, j., adewuyi, t., harrild, k., altman, d., ramsay, c., fraser, c., buckley, b., fayers, p., harvey, i., briggs, a., norrie, j., fergusson, d., ford, i., & vale, l. (2014). assessing methods to specify the target difference for a randomised controlled trial: delta (difference elicitation in trials) review. health technol assess, 18(28). https://doi.org/10.3310/hta18280 ebersole, c. r., atherton, o. e., belanger, a. l., skulborstad, h. m., allen, j. m., banks, j. b., baranski, e., bernstein, m. j., bonfiglio, d. b., boucher, l., brown, e. r., budiman, n. i., cairo, a. h., capaldi, c. a., chartier, c. r., chung, j. m., cicero, d. c., coleman, j. a., conway, j. g., . . . nosek, b. a. (2016). many labs 3: evaluating participant pool quality across the academic semester via replication. journal of experimental social psychology, 67, 68–82. https://doi.org/10.1016/j.jesp.2015.10.012 eisenberger, n. i., lieberman, m. d., & williams, k. d. (2003). does rejection hurt? an fmri study of social exclusion. science, 302(5643), 290–292. https://doi.org/10.1126/science.1089134 ellis, p. d. (2010). the essential guide to effect sizes. cambridge university press. https : / / doi . org / 10.1017/cbo9780511761676 fisher, r. a. (1915). frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. biometrika, 10(4), 507. https : / / doi . org / 10 . 2307/2331838 franco, a., malhotra, n., & simonovits, g. (2014). publication bias in the social sciences: unlocking the file drawer. science, 345(6203), 1502. https://doi.org/10.1126/science.1255484 gelman, a. (2018). the failure of null hypothesis significance testing when studying incremental changes, and what to do about it. personality and social psychology bulletin, 44(1), 16–23. https://doi.org/10.1177/0146167217729162 https://doi.org/10.3389/fpsyg.2019.02893 https://doi.org/10.3389/fpsyg.2019.02893 https://doi.org/10.1093/ajcn/nqz058 https://doi.org/10.1093/ajcn/nqz058 https://doi.org/10.1177/0956797617723724 https://doi.org/10.1177/0956797617723724 https://doi.org/10.1038/nrn3475 https://doi.org/10.1038/nrn3475 https://doi.org/10.1126/science.aaf0918 https://doi.org/10.1038/s41562-018-0399-z https://doi.org/10.1038/s41562-018-0399-z https://doi.org/10.4324/9780203771587 https://doi.org/10.3310/hta18280 https://doi.org/10.1016/j.jesp.2015.10.012 https://doi.org/10.1126/science.1089134 https://doi.org/10.1017/cbo9780511761676 https://doi.org/10.1017/cbo9780511761676 https://doi.org/10.2307/2331838 https://doi.org/10.2307/2331838 https://doi.org/10.1126/science.1255484 https://doi.org/10.1177/0146167217729162 13 gelman, a. (2019a). don’t calculate post-hoc power using observed estimate of effect size. annals of surgery, 269(1), e9–e10. https : / / doi . org / 10 . 1097/sla.0000000000002908 gelman, a. (2019b). from overconfidence in research to over certainty in policy analysis: can we escape the cycle of hype and disappointment? new america. retrieved may 29, 2020, from http : //newamerica.org/public-interest-technology/ blog/overconfidenceresearchovercertaintypolicy-analysis-can-we-escape-cycle-hype-anddisappointment/ gelman, a., & carlin, j. (2013). retrospective design analysis using external information (unpublished) [unpublished]. retrieved april 28, 2020, from http : / / www. stat . columbia . edu / ~gelman/research/unpublished/retropower5. pdf gelman, a., & carlin, j. (2014). beyond power calculations: assessing type s (sign) and type m (magnitude) errors. perspectives on psychological science, 9(6), 641–651. https://doi.org/10. 1177/1745691614551642 gelman, a., & loken, e. (2014). the statistical crisis in science. american scientist, 102(6), 460–466. https://doi.org/10.1511/2014.111.460 gelman, a., skardhamar, t., & aaltonen, m. (2017). type m error might explain weisburd’s paradox. journal of quantitative criminology. https: //doi.org/10.1007/s10940-017-9374-5 gelman, a., & tuerlinckx, f. (2000). type s error rates for classical and bayesian single and multiple comparison procedures. computational statistics, 15(3), 373–390. https://doi.org/10.1007/ s001800000040 gigerenzer, g., krauss, s., & vitouch, o. (2004). the null ritual: what you always wanted to know about significance testing but were afraid to ask. the sage handbook of quantitative methodology for the social sciences (pp. 392– 409). sage publications, inc. https://doi.org/ 10.4135/9781412986311.n21 goodman, s., & berlin, j. (1994). the use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. annals of internal medicine, 121(3), 200–206. https : / / doi . org / 10 . 7326 / 0003-4819-121-3-199408010-00008 ioannidis, j. p. a. (2008). why most discovered true associations are inflated: epidemiology, 19(5), 640–648. https : / / doi . org / 10 . 1097 / ede . 0b013e31818131e7 ioannidis, j. p. a., pereira, t. v., & horwitz, r. i. (2013). emergence of large treatment effects from small trials—reply. jama, 309(8), 768–769. https://doi.org/10.1001/jama.2012.208831 klein, r. a., ratliff, k. a., vianello, m., adams, r. b., bahník, š., bernstein, m. j., bocian, k., brandt, m. j., brooks, b., brumbaugh, c. c., cemalcilar, z., chandler, j., cheong, w., davis, w. e., devos, t., eisner, m., frankowska, n., furrow, d., galliani, e. m., . . . nosek, b. a. (2014). investigating variation in replicability. social psychology, 45(3), 142–152. https://doi.org/10.1027/ 1864-9335/a000178 klein, r. a., vianello, m., hasselman, f., adams, b. g., reginald b. adams, j., alper, s., aveyard, m., axt, j. r., babalola, m. t., bahník, š., batra, r., berkics, m., bernstein, m. j., berry, d. r., bialobrzeska, o., binan, e. d., bocian, k., brandt, m. j., busching, r., . . . nosek, b. a. (2018). many labs 2: investigating variation in replicability across samples and settings. advances in methods and practices in psychological science, 1(4), 443–490. https : / / doi . org / 10 . 1177 / 2515245918810225 kurkiewicz, d. (2017). docstring: provides docstring capabilities to r functions. https : / / cran . r project.org/package=docstring lakens, d. (2019). the value of preregistration for psychological science: a conceptual analysis (preprint). psyarxiv. https : / / doi . org / 10 . 31234/osf.io/jbh4w lakens, d., adolfi, f. g., albers, c. j., anvari, f., apps, m. a. j., argamon, s. e., baguley, t., becker, r. b., benning, s. d., bradford, d. e., buchanan, e. m., caldwell, a. r., van calster, b., carlsson, r., chen, s.-c., chung, b., colling, l. j., collins, g. s., crook, z., . . . zwaan, r. a. (2018). justify your alpha. nature human behaviour, 2(3), 168–171. https : / / doi . org / 10 . 1038 / s41562 018-0311-x lakens, d., scheel, a. m., & isager, p. m. (2018). equivalence testing for psychological research: a tutorial. advances in methods and practices in psychological science, 1(2), 259–269. https://doi. org/10.1177/2515245918770963 lane, d. m., & dunlap, w. p. (1978). estimating effect size: bias resulting from the significance criterion in editorial decisions. british journal of mathematical and statistical psychology, 31(2), 107–112. https : / / doi . org / 10 . 1111 / j . 2044 8317.1978.tb00578.x lu, j., qiu, y., & deng, a. (2018). a note on type s/m errors in hypothesis testing. british journal of https://doi.org/10.1097/sla.0000000000002908 https://doi.org/10.1097/sla.0000000000002908 http://newamerica.org/public-interest-technology/blog/overconfidence-research-over-certainty-policy-analysis-can-we-escape-cycle-hype-and-disappointment/ http://newamerica.org/public-interest-technology/blog/overconfidence-research-over-certainty-policy-analysis-can-we-escape-cycle-hype-and-disappointment/ http://newamerica.org/public-interest-technology/blog/overconfidence-research-over-certainty-policy-analysis-can-we-escape-cycle-hype-and-disappointment/ http://newamerica.org/public-interest-technology/blog/overconfidence-research-over-certainty-policy-analysis-can-we-escape-cycle-hype-and-disappointment/ http://newamerica.org/public-interest-technology/blog/overconfidence-research-over-certainty-policy-analysis-can-we-escape-cycle-hype-and-disappointment/ http://www.stat.columbia.edu/~gelman/research/unpublished/retropower5.pdf http://www.stat.columbia.edu/~gelman/research/unpublished/retropower5.pdf http://www.stat.columbia.edu/~gelman/research/unpublished/retropower5.pdf https://doi.org/10.1177/1745691614551642 https://doi.org/10.1177/1745691614551642 https://doi.org/10.1511/2014.111.460 https://doi.org/10.1007/s10940-017-9374-5 https://doi.org/10.1007/s10940-017-9374-5 https://doi.org/10.1007/s001800000040 https://doi.org/10.1007/s001800000040 https://doi.org/10.4135/9781412986311.n21 https://doi.org/10.4135/9781412986311.n21 https://doi.org/10.7326/0003-4819-121-3-199408010-00008 https://doi.org/10.7326/0003-4819-121-3-199408010-00008 https://doi.org/10.1097/ede.0b013e31818131e7 https://doi.org/10.1097/ede.0b013e31818131e7 https://doi.org/10.1001/jama.2012.208831 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1177/2515245918810225 https://doi.org/10.1177/2515245918810225 https://cran.r-project.org/package=docstring https://cran.r-project.org/package=docstring https://doi.org/10.31234/osf.io/jbh4w https://doi.org/10.31234/osf.io/jbh4w https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1177/2515245918770963 https://doi.org/10.1177/2515245918770963 https://doi.org/10.1111/j.2044-8317.1978.tb00578.x https://doi.org/10.1111/j.2044-8317.1978.tb00578.x 14 mathematical and statistical psychology. https : //doi.org/10.1111/bmsp.12132 mayo, d. g. (2018). statistical inference as severe testing: how to get beyond the statistics wars (1st ed.). cambridge university press. https:// doi.org/10.1017/9781107286184 o’hagan, a. (2019). expert knowledge elicitation: subjective but scientific. the american statistician, 73, 69–81. https : / / doi . org / 10 . 1080 / 00031305.2018.1518265 open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716–aac4716. https : / / doi . org/10.1126/science.aac4716 phillips, b. m., hunt, j. w., anderson, b. s., puckett, h. m., fairey, r., wilson, c. j., & tjeerdema, r. (2001). statistical significance of sediment toxicity test results: threshold values derived by the detectable significance approach. environmental toxicology and chemistry, 20(2), 371– 373. https://doi.org/10.1002/etc.5620200218 rohrer, j. m. (2018). thinking clearly about correlations and causation: graphical causal models for observational data. advances in methods and practices in psychological science, 1(1), 27–42. https://doi.org/10.1177/2515245917745629 vasishth, s., mertzen, d., jäger, l. a., & gelman, a. (2018). the statistical significance filter leads to overoptimistic expectations of replicability. journal of memory and language, 103, 151– 175. https : / / doi . org / 10 . 1016 / j . jml . 2018 . 07.004 venables, w. n., & ripley, b. d. (2002). modern applied statistics with s. springer. https://cran.rproject.org/web/packages/mass/index.html vul, e., harris, c., winkielman, p., & pashler, h. (2009). puzzlingly high correlations in fmri studies of emotion, personality, and social cognition. perspectives on psychological science, 4(3), 274– 290. https : / / doi . org / 10 . 1111 / j . 1745 6924 . 2009.01125.x vul, e., & pashler, h. (2017). suspiciously high correlations in brain imaging research. psychological science under scrutiny (pp. 196–220). john wiley & sons, ltd. https : / / doi . org / 10 . 1002 / 9781119095910.ch11 yarkoni, t. (2009). big correlations in little studies: inflated fmri correlations reflect low statistical power—commentary on vul et al. (2009). perspectives on psychological science, 4(3), 294– 298. https : / / doi . org / 10 . 1111 / j . 1745 6924 . 2009.01127.x young, n. s., ioannidis, j. p. a., & al-ubaydli, o. (2008). why current publication practices may distort science. plos medicine, 5(10), 1–5. https://doi. org/10.1371/journal.pmed.0050201 https://doi.org/10.1111/bmsp.12132 https://doi.org/10.1111/bmsp.12132 https://doi.org/10.1017/9781107286184 https://doi.org/10.1017/9781107286184 https://doi.org/10.1080/00031305.2018.1518265 https://doi.org/10.1080/00031305.2018.1518265 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1002/etc.5620200218 https://doi.org/10.1177/2515245917745629 https://doi.org/10.1016/j.jml.2018.07.004 https://doi.org/10.1016/j.jml.2018.07.004 https://cran.r-project.org/web/packages/mass/index.html https://cran.r-project.org/web/packages/mass/index.html https://doi.org/10.1111/j.1745-6924.2009.01125.x https://doi.org/10.1111/j.1745-6924.2009.01125.x https://doi.org/10.1002/9781119095910.ch11 https://doi.org/10.1002/9781119095910.ch11 https://doi.org/10.1111/j.1745-6924.2009.01127.x https://doi.org/10.1111/j.1745-6924.2009.01127.x https://doi.org/10.1371/journal.pmed.0050201 https://doi.org/10.1371/journal.pmed.0050201 15 appendix a: pearson correlation and design analysis to conduct a design analysis, it is necessary to know the sampling distribution of the effect of interest. that is, the distribution of effects we would observe if n observations were sampled over and over again from a population with a given effect. this allows us, in turns, to evaluate the sampling distribution of the test statistic of interest not only under the null-hypothesis (h0), but also under the alternative hypothesis (h1), and thus to compute the statistical power and inferential risks of the study considered. regarding pearson’s correlation between two normally distributed variables, the sampling distribution is bounded between -1 and 1 and its shape depends on the values of ρ and n, respectively the population correlation value and the sample size. the sampling distribution is approximately normal if ρ= 0. whereas, for positive or negative values of ρ, it is negatively skewed or positively skewed, respectively. skewness is greater for higher absolute values of ρ but decreases when larger sample sizes are considered. in figure 7, correlation sampling distributions are presented for increasing values of ρ and fixed sample size (n = 30). in the following paragraphs, we consider the consequence of pearson’s correlation sampling distribution on statistical inference and the behaviour of type m and type s errors as a function of statistical power. statistical inference to test a hypothesis or to derive confidence intervals, the sampling distribution of the test statistic of interest must follow a known distribution. in the case of h0 : ρ = 0, the sample correlation is approximately normally distributed with standard error: se(r) = √ (1 − r2)/(n − 2). thus, statistical inference is performed considering the test statistic: t = r se(r) = r √ n − 2 1 − r2 , (2) that follows a t-distribution with d f = n − 2. however, in the case of ρ , 0, the sample correlation is no longer normally distributed. as we have previously seen, the sampling distribution is skewed for large values of ρ and small sample sizes. thus, the test statistic of interest does not follow a t-distribution.4 to overcome this issue, the fisher transformation was introduced (fisher, 1915): f(r) = 1 2 ln 1 + r 1 − r = arctanh(r). (3) applying this transformation, the resulting sampling distribution is approximately normal with mean = f(ρ) and se = 1√ n−3 . thus, the test statistic follows a standard normal distribution and statistical inference is performed considering the z-scores. alternatively, other methods can be used to obtain reliable results, for example, monte carlo simulation. monte carlo simulation is based on random sampling to approximate the quantities of interest. in the case of correlation, n observations are iteratively simulated from a bivariate normal distribution with a given ρ, and the observed correlation is considered. as the number of iterations increases, the distribution of simulated correlation values approximates the actual correlation sampling distribution and it can be used to compute the quantities of interest. although monte carlo methods are more computationally demanding than analytic solutions, this approach allows us to obtain reliable results in a wider range of conditions even when no closed-form solutions are available. for these reasons, the functions pro_r() and retro_r(), presented in this paper, are based on monte carlo simulation to compute power, type m, and type s error values. this guarantees a more general framework where other future applications can be easily integrated into the functions. type m and type s errors design analysis was first introduced by gelman and carlin (2014) assuming that the sampling distribution of the test statistic of interest follows a t-distribution. this is the case, for example, of cohen’s d effect size. cohen’s d is used to measure the mean difference between two groups on a continuous outcome. the behaviour of type m and type s errors as a function of statistical power in the case of cohen’s d is presented in figure 8. for different values of hypothetical population effect size (d = .2, .5, .7, .9), we can observe that, for high levels of power, type s and type m errors are low. conversely, the type s and type m errors are high for low values of power. as expected, the relation between power and inferential errors is not influenced by the value of d (i.e., the four lines are overlapping). limit cases are obtained for power = 1 and 0.05 (note that the lowest value of power is given by the alpha value chosen as the statistical significance threshold). in the former case, type s error is 0 and type m error is 1. in 4note that the t-distribution is defined as the distribution of a random variable t where t = z√ v/d f . with z a standard normal, v a chi-squared distribution with df the degrees of freedom. thus, if the sample correlation is no longer approximately normally distributed the test-statistic is no longer tdistributed. 16 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 ρ ρ : 0 0.4 0.6 0.8 figure 7. pearson correlation coefficient sampling distributions for increasing values of ρ and fixed sample size (n = 30) cohen’s d : 0.2 0.5 0.7 0.9 1.0 3.2 5.4 7.6 9.8 12.0 0.25 0.50 0.75 1.00 power t yp e m 0.00 0.06 0.12 0.18 0.24 0.30 0.25 0.50 0.75 1.00 power t yp e s figure 8. the behaviour of type m and type s errors as a function of statistical power in the case of cohen’s d. note that the four lines are overlapping. the latter case, type s error is 0.5 and the type m error value goes to infinity. in the case of pearson’s correlation, we noted above that the sampling distribution is skewed for large values of ρ and small sample sizes. moreover, the support is bounded between -1 and 1. thus, the relations between power, type m, and type s error are influenced by the value of the hypothetical population effect size (see figure 9). we can observe how, for different values of correlation (ρ = .2, .5, .7, .9), type m error increases at different rates when the power decrease, whereas type s error follows a consistent pattern (note that differences are due to numerical approximation). we can intuitively explain this behaviour considering that, for low levels of power, the sampling distribution includes a wider range of correlation values. however, correlation values can not exceed the value 1 and therefore the distribution becomes progressively more skewed. this does not influence the proportion of statistically signifi17 ρ : 0.2 0.5 0.7 0.9 1.0 1.8 2.6 3.4 4.2 5.0 0.25 0.50 0.75 1.00 power t yp e m 0.00 0.06 0.12 0.18 0.24 0.30 0.25 0.50 0.75 1.00 power t yp e s figure 9. the behaviour of type m and type s errors as a function of statistical power in the case of pearson’s correlation ρ. cant sampled correlations with the incorrect sign (type s error), but it affects the mean absolute value of statistically significant sampled correlations (used to compute type m error). in particular, the sampling distribution for greater values of ρ becomes skewed more rapidly and thus type-m error increases at a lower rate. finally, since the correlation values are bounded, type-m error for a given value of ρ can hypothetically increase only to a maximum value given by 1 ρ . for example, for ρ = .5 the maximum type-m error is 2 as .5 × 2 = 1 (i.e., the maximum correlation value). in this appendix, we discussed for completeness the implications of conducting a design analysis in the case of pearson’s correlation effect size. we considered extreme scenarios that are unlikely to happen in real research settings. nevertheless, we thought this was important for evaluating the statistical behaviour and properties of type m and type s error in the case of pearson’s correlation as well as helping researchers to deeply understand design analysis. appendix b: r functions for design analysis with pearson correlation here we describe the r functions defined to perform a prospective and retrospective design analysis in the case of pearson correlation. first, we give instructions on how to load and use the functions. subsequently, we provide the code to reproduce examples included in the article. these functions can be used as a base to further develop design analysis in more complex scenarios that were beyond the aim of the paper. r functions the code of the functions is available in the file design_analysis_r.r at https://osf.io/9q5fr/. after downloading the file design_analysis_r.r, run the line indicating the correct path where the file was saved: source("/design_analyisis_r.r") the script will automatically load in your workspace the functions and two required rpackage: mass (venables & ripley, 2002) and docstring kurkiewicz (2017). if you don’t have them already installed, run the line install.packages(c("mass","docstring")). the r functions are: • retro_r() for retrospective design analysis. given the hypothetical population correlation value and sample size, this function performs a retrospective design analysis according to the https://osf.io/9q5fr/ 18 defined alternative hypothesis and significance level. power level, type-m error, and type-s error are computed together with the critical correlation value (i.e., the minimum absolute correlation value that would result significant). retro_r(rho, n, alternative = c("two.sided", "less", "greater"), sig_level=.05, b=1e4, seed=null) • pro_r() for prospective design analysis. given the hypothetical population correlation value and the required power level, this function performs a prospective design analysis according to the defined alternative hypothesis and significance level. the required sample size is computed together with the associated type-m error, type-s error, and the critical correlation value. pro_r(rho, power = .80, alternative = c("two.sided", "less", "greater"), sig_level = .05, range_n = c(1,1000), b = 1e4, tol = .01, display_message = false, seed = null) for further details about function arguments, run the line docstring(retro_r) or docstring(pro_r). this creates a documentation similar to the help page of r functions. note: two other functions are defined in the script and will be loaded in your workspace (i.e., compute_crit_r() and print.design_analysis). this are internal functions that should not be used directly by the user. examples code below we report the code to reproduce the examples included in the article. # example from figure 1 retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .05, seed = 2020) # example from figure 3 pro_r(rho = .25, power = .8, alternative = "two.sided", sig_level = .05, seed = 2020) # example from figure 6 pro_r(rho = .25, power = .8, alternative = "two.sided", sig_level = .05, seed = 2020) # examples from table 1 pro_r(rho = .25, power = .6, alternative = "two.sided", sig_level = .05, seed = 2020) pro_r(rho = .15, power = .8, alternative = "two.sided", sig_level = .05, seed = 2020) pro_r(rho = .35, power = .8, alternative = "two.sided", sig_level = .05, seed = 2020) # examples from table 2 retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .100, seed = 2020) retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .050, seed = 2020) retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .010, seed = 2020) retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .005, seed = 2020) retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .001, seed = 2020) mp.2019.2266.final meta-psychology, 2020, vol 4, mp.2019.2266 https://doi.org/10.15626/mp.2019.2266 article type: original article published under the cc-by4.0 license open data: no open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: u. schimmack, m. heene analysis reproduced by: n. brown, p. langford all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/4tgq9 excess success in “don’t count calorie labeling out: calorie counts on the left side of menu items lead to lower calorie food choices” gregory francis department of psychological sciences, purdue university, west lafayette, usa evelina thunell department of clinical neuroscience, karolinska institutet, stockholm, sweden department of psychological sciences, purdue university, west lafayette, usa abstract based on findings from six experiments, dallas, liu, and ubel (2019) conclude that placing calorie labels to the left of menu items influences consumers to choose lower calorie food options. contrary to previously reported findings, they suggest that calorie labels can influence food choices, but only when placed to the left because they are in this case read first. if true, these findings have important implications for the design of menus and may help address the obesity pandemic. however, an analysis of the reported results indicates that they seem too good to be true. we show that if the effect sizes in dallas et al. (2019) are representative of the populations, a replication of the six studies (with the same sample sizes) has a probability of only 0.014 of producing uniformly significant outcomes. such a low success rate suggests that the original findings might be the result of questionable research practices or publication bias. we therefore caution readers and policy makers to be skeptical about the results and conclusions reported by dallas et al. (2019). keywords: calorie labeling, statistics, test for excess success. many scientists take significant results that are replicated across multiple studies as strong support for their conclusions. however, this interpretation requires that the studies have high power. for example, when conducting six independent studies, each with a power of 0.5, one should expect only about half of the studies to produce significant results. it would be very rare for all six studies to produce significant results, namely 0.56 ≈ 0.016. when such excess success is observed in a publication, readers should suspect that the experiments were carried out using questionable research practices (john et al., 2012; simmons et al., 2011) or that some experiments with nonsignificant results were run but not reported (publication bias; francis, 2012a). a set of studies with too much success likely misrepresents reality, and conclusions from such studies should be discounted until non-biased investigations can be performed. here, we use a test for excess success (tes) analysis (francis, 2013a; ioannidis & trikalinos, 2007; see also schimmack, 2012) to show that the results of a recent article by dallas et al. (2019) francis, thunell 2 seem too successful. while there are other methods (see for example renkewitz & keiner, 2019) that aim to detect publication bias or questionable research practices, the tes analysis is currently the only approach that deals with multiple hypothesis tests from a single sample; something that is relevant for the findings reported in dallas et al. (2019). existing alternative methods must select just one test from each sample because they require independent statistics. however, it is not always clear which test should be selected, and the choice can make a big difference in the conclusions and interpretation (e.g. bishop & thompson, 2016; erdfelder & heck, 2019; simonsohn et al., 2014; ulrich & miller, 2015). thus, we have opted for the tes, because it allows us to consider the full set of tests that dallas et al. (2019) use to support their conclusions. concerns about the tes analysis method (e.g., morey, 2013; simonsohn, 2013; vandekerckhove, guan, & styrcula, 2013) have been addressed in (francis, 2013a, 2013b) where it is argued that the criticism reflects misunderstandings about the test or about the notion of excess success. in particular, some critics have been concerned that the tes only confirms what is already known, since all studies are biased in some way. while the critics may be correct in the broadest sense of the term bias, here we use the tes to identify bias that specifically undermines the claims of the original study. we suspect that the authors of papers with results that seem too good to be true did not realize that their reported findings were actually incompatible with their claims. thus, the tes analysis, as used here, provides previously unknown insights into the interpretation of their studies. some critics have also been concerned that there may be a “file drawer” of tes analyses of studies that did not show signs of bias, and that selective reporting of tes analyses of studies that did show signs of bias undermines the type i error control of the method in the same way that publication bias can give a false representation of the strength of an effect. while there surely is a file drawer of tes analyses, it does not matter for interpreting a given data set. in general, a conclusion that a set of studies seems to be biased should be made relative to the conclusions of the original authors: when applying the tes, we can draw conclusions about the presence of bias in a set of studies even if other, unrelated, analyses of other studies do not indicate any bias, or are not analyzed or reported. in fact, just as in experimental studies, publication bias across tes analyses becomes a problem only if the conclusions are wrongly generalized (to, say, all publications from a concerned author, all articles published in a certain field, or all articles in a certain journal). in the present study, the tes file drawer is not a problem because we are drawing conclusions about the set of studies in dallas et al. (2019) relative to the original authors’ conclusions, and we are thus analyzing the whole population of interest. finally, some forms of the tes pool effect sizes across studies and thus do not behave well when there is heterogeneity of effect sizes (renkewitz & keiner, 2019); a characteristic shared by many other methods for investigating questionable research practices. here, we use a version of the tes that estimates power for each individual study rather than pool effect sizes. thus, this concern does not apply to our current analysis. based on six successful studies, dallas et al. (2019) conclude that consumers opt for lower calorie food items when calorie information is displayed on the menu – but only when it is placed so that it is read before the food names. according to the authors, previous failures to show an effect of calorie labeling on menus can be attributed to the fact that the calorie information was placed to the right of, and were thus read after, the food names. dallas et al. (2019) correctly argue that their conclusions could have important implications for policy making to address the obesity crisis in america and elsewhere (sunstein, 2019). however, the implications are only valid if the conclusions are valid, and the excess success of their findings undermines the credibility of the conclusion. material and methods the mean, standard deviation, and sample size for each condition in each experiment reported by dallas et al. (2019) and an associated corrigendum (dallas et al., 2020) are reproduced in table 1, together with the key hypotheses used to support their theoretical conclusions. (the corrigendum corrected the sample size of the right label condition in study 1 and the sample size, mean, and standard deviation for the no label condition in study 3. we used these corrected values for our excess success in calorie labeling 3 analysis.) each of the six reported experiments fully satisfied the hypotheses, thereby providing uniform support for the conclusions. namely, in five of the studies (studies 1, 2, s1, s2, and s3), participants ordered fewer calories when calorie labels were placed to the left of food names, as compared to when they were placed on the right side. the remaining study (study 3) reported a corresponding effect for hebrew readers (who read from right to left rather than left to right), so that calorie labels on the right (vs. left) led to lower calorie choices. studies 1, s2, and s3 included a third condition with no calorie labels, and the number of calories ordered in the left condition was also significantly lower than this no label condition. we evaluated the plausibility of all six studies producing uniform success, by computing estimates of experimental power for replications of each of the studies. these power estimates are based on the statistics reported by dallas et al. (2019) and the corrigendum (dallas et al. 2020), so our analysis starts by supposing that the reported findings are valid and accurate. since the studies are statistically independent, we can then compute the probability of the full set of studies being uniformly successful by multiplying the power estimates of the individual studies. we first describe how to calculate the experimental power of the studies that used a single hypothesis test (study 2 and study s1). in this case the reported t value and sample sizes can be converted to a hedges’ g standardized effect size (hedges’ g is similar to cohen’s d, but with a correction for small sample sizes). for example, the conclusion of study 2 in dallas et al. (2019) is based on a significant twosample t-test between the left and right calorie conditions, with g = 0.25. based on this value, the power of a future experiment for any given sample size is easy to calculate. we used the pwr library (champely et al., 2018) in r to compute power for a replication experiment that uses the same sample sizes as the original study. alternatively, the power could be computed from the means, standard deviations, and sample sizes by using the on-line calculator in francis (2018) or similar tools. the same procedure applies to study s1. it is more complicated to estimate the power when a study’s conclusions depend on multiple hypothesis tests. studies 1, 3, s2 and s3 in dallas et al. (2019) are based on at least three significant hypothesis tests. we describe the procedure for study 1, which is representative of our approach. in study 1, a significant anova was required to indicate a difference across conditions (left, right, or no calorie labels). in addition, the conclusions required both a significant contrast between the left and right calorie label conditions and a significant contrast between the left and no calorie label conditions. because multiple tests are required for the results to fully support the conclusions, there is no single standardized effect size that can be used to compute the power of the study. instead, we ran simulated experiments that drew random samples of the same size as the original study from normal distributions with population means and standard deviations matching the statistics reported by dallas et al. (2019). we then performed the three tests that were used in the original study on the simulated data. the process was repeated 100,000 times to give a reliable measure of the proportion of simulated experiments that found significance for all three tests. this proportion was then used as an estimate of the overall power of the study. the same procedure was used for studies 3, s2, and s3, using the respective reported statistics. some of these studies comprised additional mediation analyses, which we did not include in our tes analysis (the provided summary statistics do not contain enough information to generate simulated data for these tests). since all of the mediation effects were in agreement with the conclusions in dallas et al. (2019), including them in our analysis could only further reduce the estimated power. simulation source code written in r (r core team, 2017) for all of the analyses is available at the open science framework: https://osf.io/xrdhj/. francis, thunell 4 results the rightmost column in table 1 shows the estimated power for each of the studies in dallas et al. (2019). each study has a power of around 0.5, so replication studies with the same sample size as the original studies should produce significant results only about half of the time, assuming that the population effects are similar to the reported sample effects. thus, even if the position of calorie labels does influence food selection, it is very unlikely that six studies like these would consistently show such an effect. indeed, the probability of all six studies being successful is the product of the power values in table 1, which is only 0.014. how could dallas et al. (2019) find positive results in all their studies when this outcome was so unlikely? one possible explanation is publication bias: perhaps dallas et al. (2019) ran more than six studies, but did not report the studies that failed to produce significant outcomes. such selective reporting is problematic (francis, 2012b). consider the extreme case where there is no effect at all: because of random sampling, some studies will still produce significant results. certainly, it is misleading to only report these false positives and leave out the majority of studies that did not show a significant effect. another possible explanation is that table 1. supporting hypotheses, statistical properties, and estimated power for the tests in the tes analysis of the six studies in dallas et al. (2019). study supporting hypotheses left calorie placement right no label power 1 main effect of calorie information μleft < μright μleft < μnocalories x ̄= 654.53 s = 390.45 n = 45 x ̄= 865.41 s = 517.26 n = 54 x ̄= 914.34 s = 560.94 n = 50 0.4582 2 μleft < μright x ̄= 1249.83 s = 449.07 n = 143 x ̄= 1362.31 s = 447.35 n = 132 ------ 0.5426 3 main effect of calorie information μleft > μright μnocalories > μright x ̄= 1428.24 s = 377.02 n = 85 x ̄= 1308.66 s = 420.14 n = 86 x ̄= 1436.79 s = 378.47 n = 81 0.3626 s1 μleft < μright x ̄= 185.94 s = 93.92 n = 99 x ̄= 215.73 s = 95.33 n = 77 ------ 0.5358 s2 main effect of calorie information μleft < μright μleft < μnocalories x ̄= 1182.15 s = 477.60 n = 139 x ̄= 1302.23 s = 434.41 n = 141 x ̄= 1373.74 s = 475.77 n = 151 0.5667 s3 main effect of calorie information μleft < μright μleft < μnocalories x ̄= 1302.03 s = 480.02 n = 336 x ̄= 1373.15 s = 442.49 n = 337 x ̄= 1404.35 s = 422.03 n = 333 0.4953 excess success in calorie labeling 5 the reported analyses were a subset of the full range of conducted analyses. a typical example of this approach is when researchers use several methods for outlier exclusion and then only report the one that resulted in the most favorable outcome. as another example, researchers may run their analysis on a set of data and then decide whether to gather more data based on the outcome of this intermediate analysis (e.g., stop if the results are significant and otherwise gather more data). because this procedure results in multiple tests, the type i error rate is inflated (simmons et al., 2011). the selective reporting strategies described above are unfortunately rather common (john et al., 2012). we cannot know how dallas et al. (2019) achieved their excessively successful results, but we can conclude that this set of studies does not make a plausible argument in support of the authors’ conclusions. we recommend that scientists generally ignore the findings and conclusions of dallas et al. (2019). new studies will be required to determine whether calorie label position actually has the hypothesized effect on food choices. designing new studies a scientist planning new studies on calorie label position might be tempted to base a power analysis on the results of dallas et al. (2019). we show how this can be done, but we also caution readers that findings exhibiting excess success likely overestimate the reported effect size (francis, 2012b; simmons et al., 2011). therefore, this approach likely overestimates experimental power and underestimates the necessary sample sizes. for simplicity, we consider only the main comparison in the studies of dallas et al. (2019): a difference in calories ordered depending on whether the calorie information was presented before or after the food names (in terms of reading direction). we used the reported means and (pooled) standard deviation to compute a standardized effect size (hedge’s g) for each study, and ran a meta-analysis to pool the effect sizes across experiments (source code is at the open science framework). as an aside, a meta-analysis might not actually be appropriate for these studies because they differed in a number of potentially important methodological details. for example, in some studies the instruction was to order an entrée and a drink, while in others the menu only included entrées, or entrées and desserts. still, the reported standardized effect sizes have mostly overlapping confidence intervals, and the meta-analysis will give a rough estimate of the effect size that might exist for a new study. researchers who feel that the full meta-analysis is not appropriate might pool the data in other ways. the standardized effect size for the individual experiments in dallas et al. (2019) varies from 0.15 to 0.45, with smaller effect sizes for the studies with larger sample sizes. in a meta-analysis, studies with larger sample sizes carry more weight. taking this weighting into account, the pooled effect size is g* = 0.2366. the second column of table 2 shows the sample size per condition needed to achieve a specified power in a single study, based on the pooled effect size. to achieve 80% power, a new study should use 282 participants per condition; only one (study s3) out of the six studies in dallas et al. (2019) had at least this many participants. to achieve 90% power, sample sizes larger than any of the six studies (377 participants per condition) are required. since these sample sizes are based only on the main comparison of left versus right positions of the calorie labels, new studies that also include the left vs. no labels comparison or mediation analysis will require even larger sample sizes. as noted above, the excess success analysis suggests that the pooled effect size is based on studies that most likely overestimate the effect size. a cautious scientist might therefore want to suppose that the population effect is smaller than the metaanalysis estimate: say, by one half. the third column of table 2 shows the corresponding required sample sizes. in this case, to achieve 80% power for detecting a difference between the two calorie label placements, a sample size of 1123 participants per condition is needed. for an experiment to have 90% power, it would need to have 1502 participants in each condition. of course, power is not the only important characteristic of an experiment. one issue is that in the menus used by dallas et al. (2019), placing the calorie labels after the food item name tended to place the label next to the item price. having two number items next to each other introduces visual clutter that can make it difficult for viewers to parse out relevant information (shive francis, thunell 6 & francis, 2008). it is possible that viewers better process the calorie information when it is presented before the food name simply because it is then presented far from the price information. bleich et al. (2017) describe a number of additional challenges in studying the impact of calorie labels on food choices. table 2. sample sizes required for a new study investigating left/right placement of calorie labels to have a desired power. conclusions dallas et al. (2019) note that their findings may have important implications for policies regarding calorie labels and their possible impact on obesity. however, such implications are only valid if the reported data support their conclusions. given the inherent variability in data collection, some nonsignificant results are highly likely when conducting experiments like their six reported studies even if the effect of calorie label position is real and similar in magnitude to what they report. the excess success in the reported studies, i.e. the lack of non-significant results, indicates that something has gone wrong during data collection, analysis, reporting, or interpretation, perhaps unbeknownst to the authors (gelman & loken, 2014). we therefore advise readers to be skeptical about the results and conclusions reported by dallas et al. (2019). author contact corresponding author: gregory francis, gfrancis@purdue.edu, orcid: https://orcid.org/00000002-8634-794x. evelina thunell, evelina.thunell@ki.se, orcid: https://orcid.org/0000-0002-9368-4661 conflict of interest and funding the authors declare no competing interests. author contributions gf and et contributed to conception and interpretation. gf developed the analyses and wrote the code. gf and et wrote the article and approved the submitted version for publication. open science practices this article earned the open materials badge for materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references bishop, d. v. m., & thompson, p. a. (2016). problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value. peerj, 4, 4:e1715. https://doi.org/10.7717/peerj.1715 bleich, s. n., economos, c. d., spiker, m. l., vercammen, k. a., vanepps, e. m., block, j. p., elbel, b., story, m., & roberto, c. a. (2017). a systematic review of calorie labeling and modidesired power to detect g = 0.2366 n1 = n2 to detect g = 0.1183 n1 = n2 0.80 282 1123 0.85 322 1284 0.90 377 1502 0.95 465 1858 0.99 658 2626 excess success in calorie labeling 7 fied calorie labeling interventions: impact on consumer and restaurant behavior. obesity, 25(12), 2018–2044. https://doi.org/10.1002/oby.21940 champely, s., ekstrom, c., dalgaard, p., gill, j., weibelzahl, s., ford, c., & volcic, r. (2018). package ‘pwr’: basic functions for power analysis description. dallas, s. k., liu, p. j., & ubel, p. a. (2019). don’t count calorie labeling out: calorie counts on the left side of menu items lead to lower calorie food choices. journal of consumer psychology, 29(1), 60–69. https://doi.org/10.1002/jcpy.1053 dallas, s. k., liu, p. j., & ubel, p. a. (2020). corrigendum: don’t count calorie labeling out: calorie counts on the left side of menu items lead to lower calorie food choices. journal of consumer psychology, 30(3), 571–571. https://doi.org/10.1002/jcpy.1162 erdfelder, e., & heck, d. w. (2019). detecting evidential value and p-hacking with the p-curve tool: a word of caution. zeitschrift fur psychologie / journal of psychology, 227(4), 249–260. https://doi.org/10.1027/2151-2604/a000383 francis, g. (2012a). publication bias and the failure of replication in experimental psychology. psychonomic bulletin and review, 19(6), 975–991. https://doi.org/10.3758/s13423-012-0322-y francis, g. (2012b). too good to be true: publication bias in two prominent studies from experimental psychology. psychonomic bulletin & review, 19, 151–156. https://doi.org/10.3758/s13423-012-0227-9 francis, g. (2013a). replication, statistical consistency, and publication bias. journal of mathematical psychology, 57(5), 153–169. https://doi.org/10.1016/j.jmp.2013.02.003 francis, g. (2013b). we should focus on the biases that matter: a reply to commentaries. journal of mathematical psychology, 57(5), 190–195. https://doi.org/10.1016/j.jmp.2013.06.001 francis, g. (2018). power for independent means. introstats online (2nd edition). https://introstatsonline.com/chapters/calcul ators/mean_two_sample_power.shtml gelman, a., & loken, e. (2014). the statistical crisis in science. american scientist, 102(6), 460–465. https://doi.org/10.1511/2014.111.460 ioannidis, j. p. a., & trikalinos, t. a. (2007). an exploratory test for an excess of significant findings. clinical trials, 4(3), 245–253. https://doi.org/10.1177/1740774507079441 john, l. k., loewenstein, g., & prelec, d. (2012). measuring the prevalence of questionable research practices with incentives for truth telling. psychological science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953 morey, r. d. (2013). the consistency test does notand cannot-deliver what is advertised: a comment on francis (2013). journal of mathematical psychology, 57(5), 180–183. https://doi.org/10.1016/j.jmp.2013.03.004 renkewitz, f., & keiner, m. (2019). how to detect publication bias in psychological research. zeitschrift für psychologie, 227(4), 261–279. https://doi.org/10.1027/2151-2604/a000386 schimmack, u. (2012). the ironic effect of significant results on the credibility of multiple-study articles. psychological methods, 17(4), 551–566. https://doi.org/10.1037/a0029487 shive, j., & francis, g. (2008). applying models of visual search to map display design. international journal of human-computer studies, 66(2), 67–77. https://doi.org/https://doi.org/10.1016/j.ijhc s.2007.08.004 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632 simonsohn, u. (2013). it really just does not follow, comments on francis (2013). journal of mathematical psychology, 57(5), 174–176. https://doi.org/10.1016/j.jmp.2013.03.006 simonsohn, u., nelson, l. d., & simmons, j. p. (2014). p-curve: a key to the file-drawer. journal of experimental psychology: general, 143(2), 534–547. https://doi.org/10.1037/a0033242 sunstein, c. r. (2019). putting the calorie count before the cheeseburger. bloomberg opinion. https://www.bloombergquint.com/view/calo rie-counts-on-menus-might-work team, r. c. (2017). r: a language and environment for statistical computing. r foundation for statistical computing. francis, thunell 8 ulrich, r., & miller, j. (2015). p-hacking by post hoc selection with multiple opportunities: detectability by skewness test?: comment on simonsohn, nelson, and simmons (2014). journal of experimental psychology: general, 144(6), 1137– 1145. https://doi.org/10.1037/xge0000086 vandekerckhove, j., guan, m., & styrcula, s. a. (2013). the consistency test may be too weak to be useful: its systematic application would not improve effect size estimation in metaanalyses. journal of mathematical psychology, 57(5), 170–173. https://doi.org/10.1016/j.jmp.2013.03.007 meta-psychology, 2023, vol 7, mp.2022.3271 https://doi.org/10.15626/mp.2022.3271 article type: commentary published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: david a. neequaye, elizabeth tenney analysis reproduced by: not applicable all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/tw9ga unfortunately, journals in industrial, work, and organizational psychology still fail to support open science practices joachim hüffmeier and marc mertes tu dortmund university currently, journals in industrial, work, and organizational (iwo) psychology collectively do too little to support open science practices. to address this problematic state of affairs, we first point out numerous problems that characterize the iwo psychology literature. we then describe seven frequent arguments, which all lead to the conclusion that the time is not ripe for iwo psychology to broadly adopt open science practices. to change this narrative and to promote the necessary change, we reply to these arguments and explain how open science practices can contribute to a better future for iwo psychology with more reproducible, replicable, and reliable findings. keywords: open science practices, reproducibility, replicability, openness, transparency it is unfortunate how slowly positive change is coming to the industrial, work, and organizational psychology (iwo psychology) and the broader management literature.1 the field is riddled with problems, such as (i) low statistical power, (ii) non-transparent research practices and a lack of data-sharing, (iii) a high prevalence of questionable research practices (qrps, e.g., hypothesizing after the results are known [harking] or nondisclosure of unsupported hypotheses), (iv) many false positive findings, (v) publication bias and a substantial file drawer problem (i.e., findings that are not published because they are not statistically significant), (vi) a bias towards novelty at the expense of replication studies and cumulative science, and (vii) the low replicability of its findings. importantly, a promising cure for many of these problems has long been found: open science practices (osps; see table 1, for an overview of osps and their effectiveness).2 however, as our own (torka et al., 2023) and other research (tipu and ryan, 2021) shows, most iwo psychology and management journals generally do little to support researchers’ use of osps. for instance, our analysis of the policies of iwo psychology and management journals showed that only five of 202 analysed journals (2.5%) offered registered reports as publication option and only one journal (0.5%) provided open science badges (torka et al., 2023). if anything, the journals seem to endorse “business as usual”, which prevents overdue improvements of the state of the literature. in the following, we will illustrate that the listed problems do in fact exist, specifically in the iwo psychology/management field, and that there are hardly any excuses for not taking action on the part of the journals. we will do so by presenting typical arguments that we observed in our own studies with scientists (a survey with scientists in iwo psychology, hüffmeier et al., n.d.) and journal editors (a survey with editors of iwo psychology journals, torka et al., 2023) and (over)heard in informal conversations with colleagues. then, we will reply to these arguments (see table 2, for an overview of the seven arguments and our refutations). the first argument: osps are for scientific fields that evidentially have documented problems with the replicability of their findings like social psychology. it is of course true that replicability (or rather the lack thereof) is better documented in other fields, especially social psychology.3 however, the replicability of reported results is low across many research domains. these domains include, but are not limited to, management 1because much iwo psychology research is published in management journals (e.g., journal of organizational behavior or journal of management), the two fields cannot really be separated. 2technically, some of these measures such as journals’ support of replications do not necessarily make science more open, transparent, or accessible although they improve science. other authors, therefore, speak of “open science and reform practices” rather than of osps (see tenney et al., 2021). however, to keep with established conventions, we still use the term “osps” in this manuscript. 3replicability means that findings from new (replication) studies, which converge with those of the original studies, “can be obtained with other random samples drawn from a multidimensional space that captures the most important facets of the research design” (asendorpf et al., 2013, p. 109; see, for instance, open science foundation, 2015). https://doi.org/10.15626/mp.2022.3271 https://doi.org/10.17605/osf.io/tw9ga 2 (bergh et al., 2017) and neighbouring disciplines such as marketing (simmons and nelson, 2019) and economics (camerer et al., 2016). there is little reason to assume that the situation is fundamentally different in iwo psychology research (see also goldfarb and king, 2016) because the incentives and publishing practices in all these fields are highly comparable and, thus, equally problematic. finally, the methodological problems of iwo psychology and management are not restricted to replicability (see also the next argument). the second argument: show us the evidence that our field does in fact have severe methodological problems. maybe then we will be willing to start supporting osps. we will use our above list to substantiate the prevailing methodological problems. first, low statistical power is very common in iwo psychology and management studies. one recent study found that only 37% of considered studies had a power of at least .80 (paterson et al., 2016; see also mone et al., 1996). second, at least for iwo psychology and for strategic management research, research practices are so non-transparent that it is often impossible to reproduce reported findings even when the data are available (see bergh et al., 2017; for an overview, see artner et al., 2021). however, related efforts often fail already one step earlier because researchers are unwilling to share their data (e.g., tenopir et al., n.d.; wicherts et al., 2006). third, there is converging evidence across many studies that questionable research practices (qrps) are widespread in the field (e.g., banks et al., 2016; o’boyle jr et al., 2017). the problem is even more prevalent for articles appearing in prestigious journals such as organizational behavior and human decision processes or the academy of management journal (kepes et al., 2022). fourth, the risk of committing type i errors (i.e., rejecting a true null hypothesis or obtaining a false positive finding) is directly associated with low statistical power, which is prevalent in our field (see above). another perspective on the same issue is the rate of supported hypotheses in a field, which has been found to be especially high for the overarching field of economics and business (i.e., no further differentiation was made within this field; fanelli, 2012)—a clearly worrying finding for the state of the literature. fifth, many pertinent journals do not publish statistically nonsignificant results (tenney et al., 2021), deeming them either irrelevant or unworthy of publication. therefore, such negative results typically remain in researchers’ file drawers (i.e., publication bias; harrison et al., 2017; o’boyle jr et al., 2014). sixth, nearly all journals in the field stress that new manuscripts must contribute theoretical and empirical extensions to the current knowledge (e.g., group and organization management seeks “[. . . ] the work of scholars and professionals who extend management and organization theory [. . . ]. innovation, conceptual sophistication, methodological rigor, and cutting-edge scholarship are the driving principles”). this coincides with the underrepresentation of replication studies in the field. this underrepresentation was shown by ryan and tipu (2022). they estimate in their quantitative analysis that less than 1.5% of published research in the business and management literature are replication studies. the one-sided quest for novelty together with the prevailing disinterest for replication studies that is well-documented for most journals (tipu and ryan, 2021; see also evanschitzky et al., 2007; tenney et al., 2021) limits our collective ability to establish a cumulative knowledge base and to “differentiate ‘truth from nonsense’” (kidwell et al., 2014, p. 304). the third argument: we would like to support osps, but they are made exclusively for experimental (laboratory) research. there are no suitable templates for other approaches (correlative [field] research, secondary data analyses, qualitative studies, etc.). this is not true and it has not been for a while. while the first osps and preregistration templates were indeed often developed with a focus on experimental (laboratory) research (e.g., van’t veer and giner-sorolla, 2021), many further developments followed. there are now templates that allow preregistering analyses of pre-existing data (mertens and krypotos, 2019), systematic reviews (van den akker et al., 2020), meta-analyses (moreau and gamble, 2022), and qualitative studies (haven et al., 2020; kern and gleditsch, 2017). moreover, extant templates originally developed for experimental research can be adapted for all kinds of research with relative ease. we argue that a preregistration not fitting the template perfectly is better than no preregistration at all. while a preregistration should always contain certain information (e.g., how the sample size is determined and what measures will be used), every effort to limit researcher degrees of freedom (and thereby possibilities to engage in qrps) via preregistration is a step in the right direction. based on our own experience, we can recommend the template from the aspredicted.org website, which is also offered via the open science framework (osf; http://osf.io). the template can be used without a word limit or length restrictions on the osf, while the aspredicted.org website has a word limit. the template is simple, short and can be easily adapted to a variety of study types. in different projects of our research group, we used it, for instance, for experimental studies, correlational studies, and the analysis of pre-existing data, meta-analyses, as well as qualitative studies. thus, al3 though its original focus may have been experimental research, it is clearly not restricted to such studies. however, the templates that were designed for specific study types are of course less generic and facilitate the declaration of necessary study-specific details (e.g., study eligibility criteria or the literature search strategy for meta-analyses, see moreau and gamble, 2022). the fourth argument: if journals implement osps, it raises the bar and makes publishing more difficult. this especially applies to certain kinds of research, for instance research on minorities, hard-to-reach or small populations.4 admittedly, such concerns about gatekeeping can be justified because the requirements for publication would increase. for instance, when preregistering a study, researchers are asked to provide an a priori justification of their sample size (see, for instance, the templates on the aspredicted.org website or by van’t veer and giner-sorolla, 2021). this often means collecting [much] larger sample sizes as compared to conducting a study without a sample size justification (mone et al., 1996; paterson et al., 2016). however, providing a sample size justification does not always mean that collecting a large sample size is necessary (and often it is not done, see bakker et al., 2020). having resource constraints and/or studying a hard-to-reach or small populations are legitimate justifications for the realized sample size (lakens, 2022; although collecting surprisingly large sample sizes is more often possible than researchers might think at first, vazire, 2015). however, while there are often good reasons to conduct and publish research with rather small sample sizes, scientists should then actively acknowledge the potential, goals, and limits of their statistical analyses. moreover, it can be debated how problematic a higher bar for publishing would actually be. in fact, there has been at least some agreement for some time now (e.g., nelson et al., 2012; vazire, 2018) that individual researchers should publish fewer manuscripts while increasing their quality. to allow for making stronger scientific claims (vazire, 2018), researchers should improve the methods they apply, including the use of osps, but not excluding further improvements in other methodological areas. the fifth argument: it does not make much of a difference if journals actively support osps. researchers do not want to use them. while it may be true that first initiatives to foster the use of osps in a field are not necessarily met with enthusiasm of most researchers, there is no reason to be pessimistic. as is the case with any innovation, people take it up at different speeds and it takes a while for change to affect the habits of the majority. but researchers do willingly take up these new measures especially if esteemed journals lead the movement. the psychological flagship journal psychological science, for instance, has been an “early adopter” of osps since 2014 and has actively supported (but not enforced) the use of osps. the journal saw different positive results of its new policy rather quickly: since the introduction of open science badges (see table 1), the data sharing rate for published articles increased. in fact, when researchers earned an open data badge rather than merely indicating data availability, the data “were more likely to be actually available, correct, usable, and complete” (kidwell et al., 2016). the higher rate of published replication studies in the journal since the introduction of the “preregistered direct replication” article format indicates another positive change. the sixth argument: journals that actively support osps experience a competitive disadvantage because scientists consider them as less attractive target journals. journals like leadership quarterly or the journal of business and psychology have endorsed and supported the use of osps relatively early. if anything, these journals benefitted from this decision: although their strongly positive development in terms of journal metrics such as the journal impact factor is certainly driven by various factors and decisions, their articulated attitude towards osps did at least not hurt enough to prevent this development (see also the recent development of the journal of applied psychology after more recently introducing transparency-related changes). and of course, there are other journals that have not yet embraced osps and did not have a comparable positive development in the same time span. the seventh argument: journals that do at least encourage some osps do more than others and they therefore do enough. while some journals do actively support the use of selected osps (e.g., the journal of personnel psychology or group and organization management offering hybrid registered report submission; see gardner, 2020), these efforts are not very visible. researchers typically have to search actively for this option. if they do not know it is offered or do not know what to look for, there is a good chance that they will not even find the option on a journal website. moreover, supporting only one osp does and cannot address all of the problems we listed above. to do so, it would be much more effective to actively support all osps (see table 1). iwo psychology and management journals should do more to support osps positive change is not coming to our field automatically. illustrating this notion, a current study (tenney 4we would like to thank our reviewer elizabeth tenney for suggesting this argument. 4 ta b le 1 o p en s ci en ce p ra ct ic e d efi n it io n d em on st ra te d an d as su m ed be n efi ts fo r th e fi el d tr an sp ar en cy re qu ir em en ts fo r d at a, m et h od an d co d e, m at er ia l or st im u li a s a m in im u m re qu ir em en t, au th or s in d ic at e w h et h er th ey w il l m ak e th ei r d at a, an al yt ic m et h od s u se d in th e an al ys is (i .e ., m et h od s an d co d e) , an d re se ar ch m at er ia ls u se d to co n d u ct th e re se ar ch (i .e ., m at er ia l or st im u li ), av ai la bl e to an y re se ar ch er . • s tu d ie s em p lo yi n g h ig h st at is ti ca l p ow er , co m p le te m et h od ol og ic al tr an sp ar en cy , an d p re re gi st ra ti on ar e h ig h ly re p li ca bl e an d m or e re p li ca bl e th an p as t st u d ie s in p ri or m u lt ila b re p li ca ti on ef fo rt s (p ro tz ko et al ., 2 0 2 0 ). p re re gi st ra ti on p re re gi st ra ti on is d efi n ed as “s p ec if yi n g yo u r re se ar ch p la n in ad va n ce of yo u r st u d y an d su bm it ti n g it to a re gi st ry ” (c en te r fo r o p en s ci en ce , n .d .a) . • p re re gi st er ed st u d ie s ar e m or e tr an sp ar en t co n ce rn in g th e re p or ti n g of th ei r fi n d in gs th an n on -p re re gi st er ed st u d ie s an d al so re p or t a lo w er ra te of co n fi rm ed h yp ot h es es (t ot h et al ., 2 0 2 1 ). • r es ea rc h er s w it h ex p er ie n ce u si n g p re re gi st ra ti on s (n = 2 9 9 ) re p or te d m os tl y p os it iv e ex p er ie n ce s, th ey be li ev ed th at p re re gi st ra ti on s h ad im p ro ve d th e qu al it y of th ei r re se ar ch p ro je ct s an d “t h at th e be n efi ts ou tw ei gh th e ch al le n ge s” (s ar af og lo u et al ., 2 0 2 1 ). r eg is te re d r ep or ts r eg is te re d re p or ts ar e “a p u bl is h in g fo rm at u se d by ov er 2 5 0 jo u rn al s th at em p h as iz es th e im p or ta n ce of th e re se ar ch qu es ti on an d th e qu al it y of m et h od ol og y by co n d u ct in g p ee r re vi ew p ri or to d at a co ll ec ti on ” (c en te r fo r o p en s ci en ce , n .d .c) . • r es ea rc h er s bl in d ed to th e ra te d ar ti cl e ty p e, ra te d re gis te re d re p or ts m or e p os it iv el y ac ro ss a n u m be r of cr it er ia , in cl u d in g th e ri go ro u sn es s of th e em p lo ye d m et h od s an d th e an al ys is as w el l as th e ov er al l m an u sc ri p t qu al it y an d th e im p or ta n ce of p ro d u ce d d is co ve ri es (s od er be rg et al ., 2 0 2 1 ). o p en s ci en ce b ad ge s o p en s ci en ce b ad ge s “a re in ce n ti ve s fo r re se ar ch er s to sh ar e d at a, m at er ia ls , or to p re re gi st er ” (c en te r fo r o p en s ci en ce , n .d .b) . s p ec ifi c ba d ge s in d ic at e th at an ar ti cl e w as p re re gi st er ed , or th at it s d at a or it s m at er ia l h as be en m ad e p u bl ic ly av ai la bl e. • o p en s ci en ce b ad ge s in cr ea se d at a an d m at er ia ls sh ar in g (k id w el l et al ., 2 0 1 6 ). • s h ar ed d at a ar e “m or e li ke ly to be ac tu al ly av ai la bl e, co rre ct ,u sa bl e, an d co m p le te ” w h en re se ar ch er s ea rn an o p en d at a ba d ge th an w h en th ey on ly in d ic at e d at a av ai la bi li ty (k id w el l et al ., 2 0 1 6 ). r ep li ca ti on s r ep li ca ti on s ar e “a fu n d am en ta l fe at u re of th e sc ie n ti fi c p ro ce ss ” (z w aa n et al ., 2 0 1 8 ). w h en co n d u ct in g re p li ca ti on s, re se ar ch er s cr it ic al ly te st th e ro bu st n es s an d va li d it y of sc ie n ti fi c d is co ve ri es . • r ep li ca ti on s en su re th e ro bu st n es s of p u bl is h ed re se ar ch be ca u se “a fi n d in g n ee d s to be re p ea ta bl e to co u n t as a sc ien ti fi c d is co ve ry ” (z w aa n et al ., 2 0 1 8 ) 5 ta b le 2 t h e se ve n ar gu m en ts tr ea te d in th is co m m en ta ry t h e re fu ta ti on s of th es e ar gu m en ts (1 ) o s ps ar e fo r sc ie n ti fi c fi el d s th at ev id en ti al ly h av e d oc u m en te d p ro bl em s w it h th e re p li ca bi li ty of th ei r fi n d in gs li ke s oc ia lp sy ch ol og y. t h e re p li ca bi li ty of re se ar ch fi n d in gs is co n si st en tl y ra th er lo w fo r m an y sc ie n ti fi c fi el d s, in cl u d in g m an y n ei gh bo u ri n g fi el d s of iw o ps yc h ol og y (b ey on d s oc ia l ps yc h ol og y) . d u e to th e ex ta n t si m il ar it ie s in p u bl is h in g p ra ct ic es an d in ce n ti ve s ac ro ss d is ci p li n es , it is u n li ke ly th at th e si tu at io n is d if fe re n t in iw o ps yc h ol og y. (2 ) s h ow u s th e ev id en ce th at ou r fi el d d oe s in fa ct h av e se ve re m et h od ol og ic al p ro bl em s. m ay be th en w e w il l be w il li n g to st ar t su p p or ti n g o s ps . iw o ps yc h ol og y h as th e fo ll ow in g, w el ld oc u m en te d m et h od ol og ic al p ro bl em s: (i ) lo w st at is ti ca l p ow er , (i i) n on -t ra n sp ar en t re se ar ch p ra ct ic es an d a la ck of d at ash ar in g, (i ii ) a h ig h p re va le n ce of q u es ti on ab le r es ea rc h p ra ct ic es , (i v) m an y fa ls e p os it iv e fi n d in gs , (v ) p u bl ic at io n bi as an d a su bs ta n ti al fi le d ra w er p ro bl em , (v i) a bi as to w ar d s n ov el ty at th e ex p en se of re p li ca ti on st u d ie s an d cu m u la ti ve sc ie n ce . (3 ) w e w ou ld li ke to su p p or t o s ps , bu t th ey ar e m ad e ex cl u si ve ly fo r ex p er im en ta l (l ab or at or y) re se ar ch . t h er e ar e n o su it ab le te m p la te s fo r ot h er ap p ro ac h es (c or re la ti ve [fi el d ] re se ar ch ,s ec on d ar y d at a an al ys es , qu al it at iv e st u d ie s, et c. ). s u it ab le te m p la te s h av e be en sp ec ifi ca ll y d ev el op ed fo r th e an al ys is of ex ta n t (c or re la ti on al ) d at a, qu al it at iv e st u d ie s, m et aan al ys es , et c. m or eo ve r, ex is ti n g te m p la te s ca n be ea si ly ad ap te d . (4 ) if jo u rn al s im p le m en t o s ps , it ra is es th e ba r an d m ak es p u bli sh in g m or e d if fi cu lt . t h is es p ec ia ll y ap p li es to ce rt ai n ki n d s of re se ar ch , fo r in st an ce re se ar ch on m in or it ie s, h ar d -t ore ac h or sm al l p op u la ti on s. if jo u rn al s im p le m en t o s ps , it w ou ld p ro ba bl y of te n ra is e th e ba r fo r p u bl is h in g re se ar ch . it is , h ow ev er , w ro n g th at o s ps w ou ld al w ay s re qu ir e la rg er sa m p le si ze s. m or eo ve r, ra is in g th e ba r w ou ld p ro ba bl y be go od fo r th e sc ie n ti fi c en te rp ri se . (5 ) it d oe s n ot m ak e m u ch of a d if fe re n ce if jo u rn al s ac ti ve ly su p p or t o s ps . r es ea rc h er s d o n ot w an t to u se th em . it m ay ta ke ti m e, bu t re se ar ch er s d o w an t to u se o s ps if jo u rn al s im p le m en t th em an d in ce n ti vi ze th ei r u se , as fo r in st an ce th e ca se of ps yc h ol og ic al s ci en ce sh ow s. (6 ) jo u rn al s th at ac ti ve ly su p p or t o s ps ex p er ie n ce a co m p et it iv e d is ad va n ta ge be ca u se sc ie n ti st s co n si d er th em as le ss at tr ac ti ve ta rge t jo u rn al s. t h e li m it ed ev id en ce th at w e h av e d oe s n ot su p p or t th is ar gu m en t. e ar ly o s p ad op te rs am on g th e jo u rn al s (j ou rn al of b u si n es s an d ps yc h ol og y, le ad er sh ip q u ar te rl y) fa re d p re tt y w el l in co m p ar is on to n on -a d op te rs . (7 ) jo u rn al s th at d o at le as t en co u ra ge so m e o s ps d o m or e th an ot h er s an d th ey th er ef or e d o en ou gh . t h es e se le ct ed ef fo rt s ar e ty p ic al ly n ot su ffi ci en tl y vi si bl e. s u p p or tin g on ly a p ar t of th e o s ps ca n n ot fu ll y an d ef fe ct iv el y ad d re ss th e fi el d ’s p ro bl em s. 6 et al., 2021) found that less than one percent of articles in the field’s flagship journals are preregistered, less than one percent of the publications are replication studies (for comparable results, see ryan and tipu, 2022) or report null results, and for less than three percent, authors indicate that they openly shared their data or their materials. these low rates most likely reflect the journal policies concerning osps (cf. torka et al., 2023). for example concerning replications, tipu and ryan (2021) showed that only 4.7% of more than 600 analysed business and management journals explicitly considered replication studies, while “238 (39.7%) were implicitly dismissive of replication studies, and the remaining 3 (0.5%) journals were explicitly disinterested in considering replication studies for publication” (tipu and ryan, 2021, p. 101; for comparable results concerning replications and also further osps, see torka et al., 2023). with this contribution, we would like to invite and challenge iwo psychology and management journals to foster researchers’ use of as many osps as possible. to be clear, we do not suggest forcing researchers to use certain osps. we rather ask the journals to contribute to the needed cultural change in the field’s research practices by (i) encouraging and incentivizing methodological transparency and the use of preregistrations (e.g., by offering open science badges), (ii) offering registered reports as an equitable publishing format, and (iii) explicitly inviting well-designed replications. these measures, while cheap and easy to implement, can increase researchers’ perceptions of a journal as an attractive outlet for their high-quality research, increase the quality of research overall and the resulting trust in it, and change the field for the better by addressing the systemic roots of qrps and the low replicability of findings. author contact joachim hüffmeier, department of psychology, tu dortmund university, emil-figge-straße 50, 44227 dortmund, germany. e-mail: joachim.hueffmeier@tudortmund.de (corresponding author) ; marc mertes, email: marc.mertes@tu-dortmund.de conflict of interest and funding the authors have no conflicts of interest. there was no specific source of funding. author contributions joachim hüffmeier wrote the manuscript and marc mertes provided revisions. open science practices this article is conceptual and is not eligible for open science badges. the entire editorial process, including the open reviews, is published in the online supplement. references artner, r., verliefde, t., steegen, s., gomes, s., traets, f., tuerlinckx, f., & vanpaemel, w. (2021). the reproducibility of statistical results in psychological research: an investigation using unpublished raw data. psychological methods, 26(5), 527–546. https : / / doi . org / 10 . 1037 / met0000365 asendorpf, j. b., conner, m., de fruyt, f., de houwer, j., denissen, j. j. a., fiedler, k., fiedler, s., funder, d. c., kliegl, r., nosek, b. a., perugini, m., roberts, b. w., schmitt, m., van aaken, m. a. g., weber, h., & wicherts, j. m. (2013). recommendations for increasing replicability in psychology. psychological methods, 27(2), 108– 109. https://doi.org/10.1002/per.1919 bakker, m., veldkamp, c. l., van den akker, o. r., van assen, m. a., crompvoets, e., ong, h. h., & wicherts, j. m. (2020). recommendations in pre-registrations and internal review board proposals promote formal power analyses but do not increase sample size. plos one, 15(7), e0236079. https://doi.org/10.1371/journal. pone.0236079 banks, g. c., rogelberg, s. g., woznyj, h. m., landis, r. s., & rupp, d. e. (2016). evidence on questionable research practices: the good, the bad, and the ugly. journal of business and psychology, 31, 323–338. https : / / doi . org / 10 . 1007 / s10869-016-9456-7 bergh, d. d., sharp, b. m., aguinis, h., & li, m. (2017). is there a credibility crisis in strategic management research? evidence on the reproducibility of study findings. strategic organization, 15(3), 423–436. https : / / doi . org / 10 . 1177 / 1476127017701076 camerer, c. f., dreber, a., forsell, e., ho, t.-h., huber, j., johannesson, m., kirchler, m., almenberg, j., altmejd, a., chan, t., heikensten, e., holzmeister, f., imai, t., isaksson, s., nave, g., pfeiffer t, m., razen, & wu, h. (2016). evaluating replicability of laboratory experiments in economics. science, 351(6280), 1433–1436. https://doi.org/10.1126/science.aaf0918 center for open science. (n.d.-a). future-proof your research. preregister your next study. https : / / www.cos.io/initiatives/prereg https://doi.org/10.1037/met0000365 https://doi.org/10.1037/met0000365 https://doi.org/10.1002/per.1919 https://doi.org/10.1371/journal.pone.0236079 https://doi.org/10.1371/journal.pone.0236079 https://doi.org/10.1007/s10869-016-9456-7 https://doi.org/10.1007/s10869-016-9456-7 https://doi.org/10.1177/1476127017701076 https://doi.org/10.1177/1476127017701076 https://doi.org/10.1126/science.aaf0918 https://www.cos.io/initiatives/prereg https://www.cos.io/initiatives/prereg 7 center for open science. (n.d.-b). open science badges enhance openness, a core value of scientific practice. https : / / www . cos . io / initiatives / badges center for open science. (n.d.-c). registered reports: peer review before results are known to align scientific values and practices. https : / / www. cos.io/initiatives/registered-reports evanschitzky, h., baumgarth, c., hubbard, r., & armstrong, j. s. (2007). replication research’s disturbing trend. journal of business research, 60(4), 411–415. https : / / doi . org / 10 . 1016 / j . jbusres.2006.12.003 fanelli, d. (2012). negative results are disappearing from most disciplines and countries. scientometrics, 90, 891–904. https : / / doi . org / 10 . 1007/s11192-011-0494-7 gardner, w. l. (2020). farewell from the outgoing editor. group organization management, 45(6), 762–767. https : / / doi . org / 10 . 1177 / 1059601120980536 goldfarb, b., & king, a. a. (2016). scientific apophenia in strategic management research: significance tests mistaken inference. strategic management journal, 37(1), 167–176. https://doi.org/10. 1002/smj.2459 harrison, j. s., banks, g. c., pollack, j. m., o’boyle, e. h., & short, j. (2017). publication bias in strategic management research. journal of management, 43(2), 400–425. https://doi.org/ 10.1177/0149206314535438 haven, t. l., errington, t. m., gleditsch, k. s., van grootel, l., jacobs, a. m., kern, f. g., piñeiro, r., rosenblatt, f., & mokkink, l. b. (2020). preregistering qualitative research: a delphi study. international journal of qualitative methods, 19, 1–13. https : / / doi . org / 10 . 1177 / 16094069209764179 hüffmeier, j., mertes, m., schultze, t., nohe, c., mazei, j., & zacher, h. (n.d.). prevalence, problems, and potential: a survey on selected open science practices among iwo psychologists. [manuscript in preparation]. kepes, s., keener, s. k., mcdaniel, m. a., & hartman, n. s. (2022). questionable research practices among researchers in the most researchproductive management programs. journal of organizational behavior, advanced online publication. kern, f. g., & gleditsch, k. s. (2017). exploring preregistration and pre-analysis plans for qualitative inference. https://t1p.de/nvb9i kidwell, m. c., lazarević, l. b., baranski, e., hardwicke, t. e., piechowski, s., falkenberg, l. s., kennett, c., slowik, a., sonnleitner, c., hess-holden, c., errington, t. m., fiedler, s., & nosek, b. a. (2014). facts are more important than novelty: replication in the education sciences. educational researcher, 43(6), 304–316. https://doi. org/10.3102/0013189x145455136 kidwell, m. c., lazarević, l. b., baranski, e., hardwicke, t. e., piechowski, s., falkenberg, l. s., kennett, c., slowik, a., sonnleitner, c., hessholden, c., errington, t. m., fiedler, s., & nosek, b. a. (2016). badges to acknowledge open practices: a simple, low-cost, effective method for increasing transparency. plos biology, 14(5), e1002456. https : / / doi . org / 10 . 1371/journal.pbio.1002456 lakens, d. (2022). sample size justification. collabra: psychology, 8(1), 33267. mertens, g., & krypotos, a.-m. (2019). preregistration of analyses of preexisting data. psychologica belgica, 59(1), 338–352. https : / / doi . org / 10 . 5334/pb.493 mone, m. a., mueller, g. c., & mauland, w. (1996). the perceptions and usage of statistical power in applied psychology and management research. personnel psychology, 49(1), 103–120. https:// doi.org/10.1111/j.1744-6570.1996.tb01793.x moreau, d., & gamble, b. (2022). conducting a metaanalysis in the age of open science: tools, tips, and practical recommendations. psychological methods, 27(3), 426–432. https://doi.org/10. 1037/met0000351 nelson, l. d., simmons, j. p., & simonsohn, u. (2012). let’s publish fewer papers. psychological inquiry, 23(3), 291–293. https : / / doi . org / 10 . 1080/1047840x.2012.70524 o’boyle jr, e. h., banks, g. c., & gonzalez-mulé, e. (2017). the chrysalis effect: how ugly initial results metamorphosize into beautiful articles. journal of management, 43, 376–399. https:// doi.org/10.1177/0149206314527133 o’boyle jr, e. h., rutherford, m. w., & banks, g. c. (2014). publication bias in entrepreneurship research: an examination of dominant relations to performance. journal of business venturing, 29(6), 773–784. https : / / doi . org / 10 . 1016 / j . jbusvent.2013.10.001 open science foundation. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716. https://doi.org/10.1126/ science.aac4716 https://www.cos.io/initiatives/badges https://www.cos.io/initiatives/badges https://www.cos.io/initiatives/registered-reports https://www.cos.io/initiatives/registered-reports https://doi.org/10.1016/j.jbusres.2006.12.003 https://doi.org/10.1016/j.jbusres.2006.12.003 https://doi.org/10.1007/s11192-011-0494-7 https://doi.org/10.1007/s11192-011-0494-7 https://doi.org/10.1177/1059601120980536 https://doi.org/10.1177/1059601120980536 https://doi.org/10.1002/smj.2459 https://doi.org/10.1002/smj.2459 https://doi.org/10.1177/0149206314535438 https://doi.org/10.1177/0149206314535438 https://doi.org/10.1177/16094069209764179 https://doi.org/10.1177/16094069209764179 https://t1p.de/nvb9i https://doi.org/10.3102/0013189x145455136 https://doi.org/10.3102/0013189x145455136 https://doi.org/10.1371/journal.pbio.1002456 https://doi.org/10.1371/journal.pbio.1002456 https://doi.org/10.5334/pb.493 https://doi.org/10.5334/pb.493 https://doi.org/10.1111/j.1744-6570.1996.tb01793.x https://doi.org/10.1111/j.1744-6570.1996.tb01793.x https://doi.org/10.1037/met0000351 https://doi.org/10.1037/met0000351 https://doi.org/10.1080/1047840x.2012.70524 https://doi.org/10.1080/1047840x.2012.70524 https://doi.org/10.1177/0149206314527133 https://doi.org/10.1177/0149206314527133 https://doi.org/10.1016/j.jbusvent.2013.10.001 https://doi.org/10.1016/j.jbusvent.2013.10.001 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 8 paterson, t. a., harms, p., steel, p., & credé, m. (2016). an assessment of the magnitude of effect sizes: evidence from 30 years of meta-analysis in management. journal of leadership & organizational studies, 23(1), 66–81. https://doi.org/ 10.1177/1548051815614321 protzko, j., krosnick, j., nelson, l. d., nosek, b. a., axt, j., berent, m., buttrick, n., debell, m., ebersole, c. r., lundmark, s., macinnis, b., o’donnell, m., perfecto, h., pustejovsky, j. e., roeder, s., walleczek, j., & schooler, j. w. (2020). high replicability of newlydiscovered social-behavioral findings is achievable. psyarxiv. https://doi.org/10.31234/osf. io/n2a9x ryan, j. c., & tipu, s. a. (2022). business and management research: low instances of replication studies and a lack of author independence in replications. research policy, 51(1), 104408. https://doi.org/10.1016/j.respol.2021.10440 sarafoglou, a., kovacs, m., bakos, b. e., wagenmakers, e., & aczel, b. (2021). a survey on how preregistration affects the research workflow: better science but more work. psyarxiv. https://doi. org/10.31234/osf.io/6k5gr simmons, j. p., & nelson, l. d. (2019). data replicada. data colada. https://datacolada.org/81 soderberg, c. k., errington, t. m., schiavone, s. r., bottesini, j., thorn, f. s., vazire, s., esterling, k. m., & nosek, b. a. (2021). initial evidence of research quality of registered reports compared with the standard publishing model. nature human behaviour, 5(8), 990–997. https : //doi.org/10.1038/s41562-021-01142-4 tenney, e. r., costa, e., allard, a., & vazire, s. (2021). open science and reform practices in organizational behavior research over time (2011 to 2019). organizational behavior and human decision processes, 162, 218–223. https://doi.org/ 10.1016/j.obhdp.2020.10.015 tenopir, c., allard, s., douglass, k., aydinoglu, a., wu, l., read, e., & manoff, m. (n.d.). data sharing by scientists: practices and perceptions. plos one, 6(6), e21101. https://doi.org/10.1371/ journal.pone.0021101 tipu, s. a. a., & ryan, j. c. (2021). are business and management journals anti-replication? an analysis of editorial policies. management research review, 45(1), 101–117. https : / / doi . org / 10 . 1108/mrr-01-2021-0050 torka, a.-k., mazei, j., bosco, f., cortina, j., götz, m., kepes, s., o’boyle, e., & hüffmeier, j. (2023). how well are open science practices implemented in industrial, work, and organizational psychology and management? management research review, manuscript under journal review. toth, a. a., banks, g. c., mellor, d., o’boyle, e. h., dickson, a., davis, d. j., dehaven, a., bochantin, j., & borns, j. (2021). study preregistration: an evaluation of a method for transparent reporting. journal of business and psychology, 36, 553–571. https://doi.org/10.1108/mrr012021-0050 van den akker, o., peters, g. j., bakker, c., carlsson, r., coles, n. a., corker, k. s., ..., & yeung, s. k. (2020). prosysrev: a generalized form for registering producible systematic reviews. data colada. https : / / osf . io / preprints / metaarxiv / 3nbea/ van’t veer, a. e., & giner-sorolla, r. (2021). preregistration in social psychology—a discussion and suggested template. journal of experimental social psychology, 67, 2–12. https://doi.org/ 10.1016/j.jesp.2016.03.004 vazire, s. (2015). super power. https : / / sometimesimwrong . typepad . com / wrong / 2015/11/super-power.html vazire, s. (2018). implications of the credibility revolution for productivity, creativity, and progress. perspectives on psychological science, 13(4), 411–417. https : / / doi . org / 10 . 1177 / 1745691617751884 wicherts, j. m., borsboom, d., kats, j., & molenaar, d. (2006). the poor availability of psychological research data for reanalysis. american psychologist, 61(7), 726–728. https://doi.org/10.1037/ 0003-066x.61.7.726 zwaan, r. a., etz, a., lucas, r. e., & donnellan, m. b. (2018). making replication mainstream. behavioral and brain sciences, 41, e120. https://doi. org/10.1017/s0140525x170019726 https://doi.org/10.1177/1548051815614321 https://doi.org/10.1177/1548051815614321 https://doi.org/10.31234/osf.io/n2a9x https://doi.org/10.31234/osf.io/n2a9x https://doi.org/10.1016/j.respol.2021.10440 https://doi.org/10.31234/osf.io/6k5gr https://doi.org/10.31234/osf.io/6k5gr https://datacolada.org/81 https://doi.org/10.1038/s41562-021-01142-4 https://doi.org/10.1038/s41562-021-01142-4 https://doi.org/10.1016/j.obhdp.2020.10.015 https://doi.org/10.1016/j.obhdp.2020.10.015 https://doi.org/10.1371/journal.pone.0021101 https://doi.org/10.1371/journal.pone.0021101 https://doi.org/10.1108/mrr-01-2021-0050 https://doi.org/10.1108/mrr-01-2021-0050 https://doi.org/10.1108/mrr-01-2021-0050 https://doi.org/10.1108/mrr-01-2021-0050 https://osf.io/preprints/metaarxiv/3nbea/ https://osf.io/preprints/metaarxiv/3nbea/ https://doi.org/10.1016/j.jesp.2016.03.004 https://doi.org/10.1016/j.jesp.2016.03.004 https://sometimesimwrong.typepad.com/wrong/2015/11/super-power.html https://sometimesimwrong.typepad.com/wrong/2015/11/super-power.html https://sometimesimwrong.typepad.com/wrong/2015/11/super-power.html https://doi.org/10.1177/1745691617751884 https://doi.org/10.1177/1745691617751884 https://doi.org/10.1037/0003-066x.61.7.726 https://doi.org/10.1037/0003-066x.61.7.726 https://doi.org/10.1017/s0140525x170019726 https://doi.org/10.1017/s0140525x170019726 iwo psychology and management journals should do more to support osps author contact conflict of interest and funding author contributions open science practices meta-psychology, 2023, vol 7, mp.2022.3270 https://doi.org/10.15626/mp.2022.3270 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: peder isager, matt williams analysis reproduced by: lucija batinović associated osf project: https://doi.org/10.17605/osf.io/9x7d4 means to valuable exploration ii: how to explore data to modify existing claims and create new ones michael höfler1,2, brennan mcdonald1,2, philipp kanske1,2, and robert miller 1 1faculty of psychology, technische universität dresden, dresden, germany 2clinical psychology and behavioural neuroscience, institute of clinical psychology and psychotherapy, technische universität dresden, germany transparent exploration in science invites novel discoveries by stimulating new or modified claims about hypotheses, models, and theories. in this second article of two consecutive parts, we outline how to explore data patterns that inform such claims. transparent exploration should be guided by two contrasting goals: comprehensiveness and efficiency. comprehensiveness calls for a thorough search across all variables and possible analyses not to miss anything that might be hidden in the data. efficiency adds that new and modified claims should withstand severe testing with new data and give rise to relevant new knowledge. efficiency aims to reduce false positive claims, which is better achieved if a bunch of results is reduced into a few claims. means for increasing efficiency are methods for filtering local data patterns (e.g., only interpreting associations that pass statistical tests or using cross-validation) and for smoothing global data patterns (e.g., reducing associations to relations between a few latent variables). we suggest that researchers should condense their results with filtering and smoothing before publication. coming up with just a few most promising claims saves resources for confirmation trials and keeps scientific communication lean. this should foster the acceptance of transparent exploration. we end with recommendations derived from the considerations in both parts: an exploratory research agenda and suggestions for stakeholders such as journal editors on how to implement more valuable exploration. these include special journal sections or entire journals dedicated to explorative research and a mandatory separate listing of the confirmed and new claims in a paper’s abstract. keywords: exploration, transparency, smoothing, filtering, preregistration, open data, open analysis, severe testing, replication introduction it has long been recognised that confirmatory and exploratory research are beneficial for each other. exploratory findings can provide insights for new or improved scientific claims to be tested (lakatos, 1977; popper, 1959; stebbins, 1992), and the failure of a confirmatory trial might suggest exploring for a better claim and a more promising next trial. however, for exploration to inform confirmation well, researchers need to be equipped with an understanding of the aims and means of exploratory analysis in advance. in the first of two consecutive articles (höfler et al., 2022), we called for a sharp boundary between confirmation and exploration to separate established from new scientific claims about hypotheses, models and theories. a claim is confirmed if an evidential norm is met, such as p-value (p) < α. strict adherence to an evidential norm ensures severe testing (mayo, 2018): a confirmatory test of a claim must be likely to fail if the claim is wrong. such a risky probe ensures that a claim is supported by meaningful evidence. unfortunately, adherence is often violated through the use of questionable research practices, by cherry-picking a p < α from numerous different analyses (p-hacking) or a hypothesis that happens to yield such a p (harking; hollenbeck and wright, 2017). practices like that constitute intransparent exploration, misused to produce seeming confirmation of a hypothesis by pretending to meet the norm (höfler et al., 2022). behind non-transparency in analysis and generation of hypotheses, non-adherence may be hidden. therefore, adherence requires control to accept an analysis as confirmatory, for example by pre-registration (höfler et al., 2022). in contrast, transparent or “open exploration” (thompson et al., 2020) enjoys the freedom to extensively analyse data (manuti & giancaspro, 2019) and embraces all “researchers’ degrees of freedom” (dirnagl, https://doi.org/10.15626/mp.2022.3270 https://doi.org/10.17605/osf.io/9x7d4 2 2020; simonsohn et al., 2020) to modify existing or create new claims about the world. however, by trying different analyses, for instance by using multiple statistical tests, the evidential norm may not be adhered to because α accumulates over several tests (bender & lange, 2001). in consequence, a confirmatory trial with new data is required to adhere to the norm. this idea extends to concatenated exploration, an iterative process, in which exploration and confirmation repeatedly feed each other, modifying and testing claims, to identify the best possible claims that can be confirmed (stebbins, 1992). likewise, empirical science has been described as a process of mapping of knowledge back and forth from a claim via study design to data analysis and modification of the claim, with modification guided, for example, by exploratory results (bogen & woodward, 1988; box, 1980; lakatos, 1977; mayo, 2018; popper, 1959; suppes, 1969). for transparent exploration to evolve, however, researchers need to be equipped with a conceptual understanding and practical skills of exploratory analysis. this shall foster researchers’ selfefficacy and make them more willing to freely conduct and openly report exploration (stebbins, 2001). however, what exploration actually is has rarely been asked in psychology, with a few exceptions (behrens, 1997; dirnagl, 2020). likewise, exploration, recognisable as such, appears hard to find outside the current data mining/big data movement (adjerid & kelley, 2018), qualitative investigations (kassis & papps, 2020), planned reviews (moghaddam, 2004) and theses (sohmer, 2020). in this second article we will outline what we believe are important foundations for conceptualising and conducting transparent exploration. we begin with discussing the goals of comprehensive and efficient exploration. we then describe basic ideas on how to refine existing hypotheses, models and theories and how to create new hypotheses. based on these foundations, we summarize analytical means to address effectiveness through filtering and smoothing explorative results. the paper ends with a small research agenda framework and recommendations for stakeholders who have the means to establish more transparent exploratory research. goals of exploration exploration as a quantitative quest for novelty as in part i (höfler et al., 2021), we refer to exploration in the specific sense of “a toolbox of analytical methods to generate and modify hypotheses, models, and theories”. creating and refining such claims about the world allows for scientific novelty and may be achieved by quantitative analysis. note that we do not address qualitative analysis here which may serve the same purpose (i. & r., 1998). we regard quantitative exploration as a quest for data patterns that may give rise to novelty. we exemplify data patterns with associations between variables, but data patterns may also be higher order relations such as interactions, clusters of individuals or variables that appear similar in a substantive respect, trajectories over time and or other “data regularities” that may point to new insights (adjerid & kelley, 2018; hand, 2007; nguyen, 2000). a quest for such patterns may be theoretically well informed and thus planned, or may be primarily data-driven, starting with inspection or quantitative analysis of the data resulting in unusual, unexpected or striking patterns. these may be of direct interest or suggest where and how to further explore. comprehensive exploration and the explorative search-space perhaps the most straightforward idea of exploration is comprehensiveness. comprehensiveness embraces the potential to discover any and all patterns in a dataset that would give rise to a hidden truth about nature or challenge prior beliefs (stebbins, 2001; swedberg, 2018). due to feasibility, time, financial and other practical constraints, the resources to explore data will, however, always be limited by the inherent difficulties associated with collecting new data or even analysing given data. nevertheless, we suggest that comprehensiveness should initially guide the planning of exploration. for example, if one’s goal is to identify unknown risk factors for mental health problems, all possible variables, analyses, and observational levels ranging from the biochemical to the level of the society (williams, 2021) should be taken into consideration in the first place. theoretical arguments and prior empirical results may then suggest where the most important patterns are hidden. for instance, one may collect a data set with hundreds of potential risk factors from different domains (parental mental health, childhood risk factors, nutrition, stressors . . . ) and dozens of mental health outcomes (disorders, disability measures . . . ) and for any factor-outcome combination an association may be found. alternatively, researchers may decide to focus on exploring a specific domain and a small range of outcomes, for example, diet factors and their relation with affective disorders (martins et al., 2021). such considerations ask for boundaries within which to explore. we conceptualize this with the exploratory searchspace. the exploratory search-space comprises all data patterns (e.g., associations between variables) that are actually explored among all patterns that could be explored. choosing an exploratory search-space is akin to 3 figure 1 schematic illustration of two explorative search-spaces that could be chosen to find data patterns of interest (green dots). a pattern of interest can only be found within the boundaries of a search-space. placing the lasso of one’s practical resources around the area where background knowledge suggests the most novelty. figure 1 illustrates this with a very simple research quest, where 200 data patterns (potential associations) could be explored, out of which 6 are patterns of interest (e.g., true associations) that could be later identified by explorative analysis. the figure shows two possible choices of explorative search-spaces, s1 and s2. the narrow s1 contains 4 out of 6 patterns of interest and 22 patterns of no interest. s2 is a more comprehensive extension of s1. exploring s2 requires mores resources. besides, only 1 more pattern of interest could be identified in a later analysis, but 23 more patterns of no interest could be falsely identified (e.g., by randomly yielding p < α ). efficient exploration the danger of false-positive results already introduces the second goal of transparent exploration: efficiency. efficient exploration aims to advance science with new insights while not polluting the literature with a multitude of claims. their subsequent nonconfirmation would waste the resources of other researchers in subsequent attempts at confirmation or otherwise fail to advance science. such findings are the cost of enjoying comprehensiveness in data exploration. if one turns over every stone, one will find every hidden coin, but also every piece of junk underneath. with “patterns of interest” (the 6 green dots in figure 1) we suggest that exploration should aim at identifying patterns that are both 1. true and 2. relevant by “true” we mean that a data pattern is not caused by chance and gives rise to a previously unknown claim, requiring substantive explanation. in the popperian tradition, a true claim must improve predictions about the world that could turn out to be wrong (box, 1976; popper, 1959). thus (1) aims at finding claims that are likely to pass severe testing with new data (mayo, 2018). note that a claim derived from a pattern might be close to the pattern; for example, hypothesising an association if a statistically significant association is found in the data. it may, however, require additional substantive input to form a meaningful statement (rubin & donkin, 2022). this is strongly the case when a causal hypothesis is derived from the finding of an association (elhai & montag, 2020; glymour et al., 2019). with respect to (2), exploring for relevant patterns aims to exclude proposals of weak or modest scientific value for the benefit of stronger new or modified claims. this serves science per se but also gives rise to more severe testing. a simple example is the claim that an effect is particularly large, rather than just that the effect is greater than zero. not only is this more scientifically informative, but it can also more easily turn out to be wrong. generally, with relevance we mean any substantive argument that might render a claim scientifically interesting. for instance, causal claims have been argued to be much more relevant than associational claims to inform theories and to assess the potential of interventions (hernán, 2018; höfler et al., 2021). beyond objective dimensions like effect magnitude, practical and clinical significance (kirk, 1996) or the generalisability of a claim (from a narrow to a more general population), “relevance” is a qualitative term that, we believe, should not be defined in general terms across scientific domains. perhaps the best general answer to the meaning of relevance is that it must always be renegotiated by the scientific community even within a domain because what appears relevant might itself be subject to change. note that a wrong claim might nevertheless trigger true insights and thus be in that sense relevant (nosek et al., 2018; stebbins, 1992, 2006). for example, claims on ego depletion have not been replicated (lurquin & miyake, 2017), but have spawned the idea and finding that willpower is not a limited resource (job et al., 2010). 4 exploring around existing claims with this understanding of comprehensiveness and efficiency, we are now equipped to derive some basic ideas on how to actually explore data. these are not intended to be complete, but rather to sketch out some promising directions that might one day form part of a thorough and mathematically formalized elaboration. we begin with explorative quests around existing knowledge before discussing searches for the entirely new. thus, we proceed from narrow to wider searchspaces, just as science has been hierarchically classified into single hypotheses, models based on multiple hypotheses, and theories for a full explanation of a phenomenon (gelman et al., 2019). exploring along an existing hypothesis with specific claims (hypotheses) it is easier to infer what is wrong, while falsifying global claims (models, theories) leaves open which components actually require modification. additionally, a hypothesis might be wrong but might become true (at least make better predictions) if modified. a hypothesis might also be true but not make a strong proposition. consider again the magnitude of an effect. commonly researchers hypothesise that an effect is greater than 0 in which case a confirmative result supports any magnitude greater than zero including an effect magnitude arbitrarily close to zero. thus, an effect could be below any threshold of practical (e.g., clinical, or public health) significance (“nullism”, greenland, 2017). for a stronger proposition, exploration may aim to identify the highest δ, so that the claim “effect > δ” remains true. (mayo, 2018) gives instances on how to estimate δ based on severe testing calculations. in general, we believe that “turning all the knobs” (hofstadter & dennett, 1981) is a useful metaphor to think about the components of a hypothesis and changing which of them may give rise to a better statement about the world. for example, a hypothesis might state that a particular diet has a positive effect on quality of life. this hypothesis might be modified to say that the effect only occurs in a certain domain of life, or that the diet is only effective if its ingredients are changed. box 1 describes how trying different analytical methods might lead to a better proposition on an effect or an association. box 1: exploring around a hypothesis with “multiverse analyses” "specification curves” (masur & scharkow, 2020; simonsohn et al., 2020) and “multiverse analyses” (del giudice & gangestad, 2021; steegen et al., 2016) try different analytical methods and options and show how a result (p-value, confidence interval) varies across them, how robust it is against the assumptions that a particular analysis makes. then competences on what a method is robust against helps to understand the nature of a relationship under inspection. for instance, there might be clear evidence (p = .001) in ordinary least squares regression for more quality in life on average if a certain diet is followed versus not followed. the evidence might however vanish (p = .450) if “robust linear regression” is used instead, a method that is robust against extreme values and outliers in the residuals (erceg-hurn & mirosevich, 2008; field & wilcox, 2017; huber, 1981; wilcox, 2012). this may indicate that extreme values dominate the result in ordinary regression if not accounted for. if further data inspection is consistent with that explanation, the initial hypothesis may be refined from a difference in the mean outcome to just a higher probability of extreme values if the diet is followed. that is, from an overall association to an association only in some individuals. further exploration, for example with “finite mixture models” (skrondal & rabe-hesketh, 2004), might identify who these are. exploring within a theory’s or model’s degrees of freedom when modifying a model or theory, “turning all the knobs” calls for questioning all the single propositions from which the model or theory is built. a theory could be broken down into its component parts, changed, if necessary, as described above, and put back together again to form a modified theory. however, it has been criticised that some theories leave knobs unset in the first place, leaving open how they could turn out to be wrong (bringmann et al., 2022; scheel, 2021). underspecification renders them inaccessible to severe testing when tested as a whole, because turning knobs according to the data improves the theory’s overall fit to the data (eronen & bringmann, 2021; fiedler, 2017; gigerenzer, 2010; lakatos, 1977; lakens, 2019; szollosi & donkin, 2021). because they are poorly falsifiable, some theories “are not even wrong” (scheel, 2021). however, with transparency in exploration, filling the gaps becomes an explicit and desirable purpose (woo et al., 2017). this is rewarded with publications, and a completed model or theory which makes specific predictions and thus becomes subject to severe confirmative testing both as a whole and in its completed 5 parts. knobs to be particularly turned are causal claims in theories, which have only been tested as if they were associative (höfler et al., 2022; höfler et al., 2021). another source for hidden need for modification is poor measurement with established, but questionable instruments (e.g., schimmack, 2021). exploring to create new claims local versus global data patterns large-scale studies collect data on many factors and outcomes, such as in the epidemiology of mental disorders (kessler & merikangas, 2004), let alone the huge data sets from genetic or imaging studies (pennycook, 2018; thompson et al., 2020). with such studies one may find countless associations, and the question arises whether to explore them individually or summarize them in advance (hand, 2007). imagine one assesses 20 nutrition factors in relation with 10 mental health outcomes. here, local patterns are associations between specific nutritional factors and specific outcomes. if indicative of causal effects, they might have different implications for science or practice: a theory might suppose that different nutritional factors have very different impacts on various aspects of mental health. accordingly, interventional effects may depend on which factor is changed to affect which outcome. for example, the absence of alcohol consumption might have a different impact on social well-being than a vegan diet has on personal growth. on the other hand, the 20 factors and 10 outcomes could be manifestations of just a few latent variables, which might explain why a certain set of associations can be found. in this case, one may focus on the global pattern of associations, for instance the relation between healthy nutrition and overall mental well-being. such a focus has been used, for example, to hypothesise about the relationships between psychopathology and neural measures using canonical correlations (linke et al., 2021). in neuroscience, exploring for global claims has been argued to be more important for insight and prediction (bzdok & ioannidis, 2019). box 2 illustrates how globally focusing on any association versus locally focusing on specific associations relates to severe testing when using statistical tests in an explorative manner, and whether one should adjust the α for each test to the number of associations tested. box 2: severe testing of any association versus a particular association if several associations could be found with 20 factors and 10 outcomes, 200 associations may be tested, each with a level α significance test to separate randomly from non-randomly occurring patterns. here, α * 200 tests would be expected to yield p < α in the absence of associations (e.g., colquhoun, 2014). with α = .05, this equals 10. if one happens to find at least one p < α , the result “any association found” has not been severely probed and hence provides only little evidence for anything being truly there, because there were 200 chances for identifying a pattern. referring to “any association” puts these tests into the global context of all the 200 investigated associations, and with this global perspective, α is inflated (bender & lange, 2001). the other possible result, “no association found”, would be supported with considerable initial evidence because it could have been 200 times refuted, especially if a sample is large and thus the beta errors of the individual tests are small. the evidential norm, however, may be adjusted for the number of tests, α be replaced with α /200 in each test (bonferroni correction). not doing so has been criticised to undermine trust in some fields of science through spurious results, for example in genome-wide association studies (jorgensen et al., 2009; marigorta et al., 2018). the adjustment turns the matter around: now the result of “any association” is much more severely probed, but the result of “no association” a great deal less severely than before. if background knowledge suggests that local associations are of interest, each association should be tested with a level α test irrespective of the other associations (bender & lange, 2001). a statistically significant association then has been probed with a severity of 1 – α, and a statistically non-significant association with a severity of 1 beta (mayo, 2018). filtering local data patterns when searching for the new, it may be desirable to choose a large search-space, but a comprehensive exploration carries the risk of many false positive results. this danger is countered by more rigorous filtering, which results in a smaller number of identified patterns. in figure 2 we illustrate exploration as a process of firstly choosing an explorative search space as before (figure 1 with search space s1), then filtering the data patterns and finally creating new claims out of the remaining patterns. after choosing search space s1 with a total of 26 patterns, 4 patterns of interest could be identified. 3 of those are actually identified by filtering, 2 of which being patterns of interest. then the efficiency of filtering within s1 can be described by the proportion 6 figure 2 the specification of an exploratory search space in step 1 is efficient in that it covers 4 out of 6 patterns of interest, while the vast majority of patterns of no interest are omitted from the outset (figure 1). in step 2, the 22 patterns (grey dots) + the 4 interesting patterns (green dots) within the search space are subjected to filtering, after which 1 pattern of no interest and 2 patterns of interest remain. finally, claims are derived from these. of identified patterns of interest among all patterns of interest (2 out of 4) and the proportion of patterns of no interest among all identified patterns (1 out of 3). finally, the 3 identified patterns need to be translated into substantive claims. most simply and close to the data one may create 3 separate associational hypotheses. or, 2 associations may appear substantively similar, like: 1. more chronic stress when crash diets are used, and 2. more chronic stress when diet adherence is exceptionally high. this may give rise to more global hypothesizing: "extreme attention on healthy nutrition is related to more chronic stress". what are specific methods to filter, here to move from 26 patterns to perhaps 3? so far in this article, we have solely mentioned the dominant method of statistical tests. statistical tests are useful to eliminate random patterns, but, with their profound origin as an approach to confirmation and without explicit explorative language, they contribute to the blending of confirmation and exploration. this applies as long as the difference is not made very clear by a statement such as “explorative testing was conducted” (höfler et al., 2022). alternatives include confidence intervals, descriptive statistics, data mining and machine learning techniques (adjerid & kelley, 2018; alonso et al., 2018; romero & ventura, 2020), bayesian approaches and any other method that may happen to be effective. individual versus community-driven filtering the yet more fundamental question when filtering results is, who should do it. with individual-driven filtering, as so far assumed, scientists themselves filter their results before coming up with new claims in a publication. the most universal individual filtering method is internal cross-validation (de rooij & weeda, 2020; fleming et al., 2021; xiong et al., 2020). it can, in principle, be combined with any analytical method. its key idea is splitting a large data set randomly into n subsets, and repeatedly running an exploratory analysis in n-k “training data” subsets while probing its results with the remaining k “test data” subsets (parvandeh et al., 2020; xiong et al., 2020). this, however, requires a large total sample size. most importantly, internal cross-validation endows researchers the freedom to explore beyond a potentially existing plan or even without a plan, because each pattern found, no matter with which method and with how many analytical options tried, must pass the test data. (this works as long as the entire procedure is not repeated with new randomly created subsets until a striking pattern happens to be seemingly confirmed. however, this danger is easy to address by transparency in the seed value of the random process that divides the sample into subsamples.) by contrast, community-driven filtering relies on the scientific public and is usually implemented through per-publication peer reviews. another instance is external cross-validation, where different data sets are used to generate claims and filter them (“independent replication”; könig, 2011). we propose that individuallydriven filtering should often precede community-driven filtering, because otherwise rigorously filtered results may receive little attention amidst many published poorly filtered results. as an important exception to the previous discussions, each result, for example on diet – mental health associations, might be potentially informative for other researchers. in such cases, particularly in modestly large search-spaces, no filtering at all seems warranted. all associations may be published on a public repository (pennycook, 2018; thompson et al., 2020) so that others are enabled to probe the associations predicted from their causal models (greenland et al., 2004; ryan et al., 2019). smoothing global data patterns background knowledge, however, might suggest that one should focus on global data patterns beforehand instead of analysing local patterns and maybe aggregating these into global claims later (hand, 2007). with such knowledge one may decide to summarize observed 7 variables into latent variables before running an analysis, for example by fitting a structural equation model. or, some entities are known to be more similar than others along a dimension, for example genomic loci along the dna strand and brain activation or body cells along their two-dimensional spatial distance. then it is possible to arrange the observations accordingly and to smooth the data with statistical methods (farcomeni & greco, 2016). smoothing aims to reduce the variation along the dimension, because otherwise every single point along the dimension is subject to individually occurring random error, potentially hiding the overall pattern of interest or “latent structure”. such smoothing serves to “clean up” the data in the first place (greenland, 2006). consider the example of epigenetic responses to stress exposition across genomic loci. one may explore the variation of the response locally, locus by locus, and thus allow it to freely vary. this preserves all the patterns in the data, but many of those will just be noise, the result of random error. the background knowledge that two gene loci are more associated with an outcome the closer they are spatially is ignored (jaffe et al., 2012). figure 3 shows a fictive example, in which the outcome y, stress response, varies along the genomic locis’ relative spatial location x (for illustration onedimensional and scaled from 0 to 100). the red line displays how y truly varies across location x according to the function y = sin(sqrt(x)) * 10*x. we assume that other factors contribute to y through a normally distributed error with expectation = 0 and standard deviation = 500. for smoothing, we use polynomial splines, a technique of non-parametric regression (takezawa, 2005) that controls the extent of smoothing through the degree of a polynomial (command twoway lpoly in stata 15.2, statacorp, 2017). figure 3a shows the random pattern that emerges if no smoothing is done and only local patterns are investigated. here, the patterns are the spikes that represent outcome values. the height of each spike is a separately estimated parameter. these many estimates (here 50) may poorly carry over to new data, that is, overfitting is likely. the spikes might be used to generate a set of individual hypotheses while neglecting the spatial dependency. with luck, such a dependency might emerge with spikes fairly close to the true structure. we suspect, however, that such luck will rarely occur. with moderate smoothing (figure 3b) we are able to identify a rough course and might hypothesise that the outcome is highest if x ranges between 40 and 70. stronger smoothing (figure 3c) results in a fairly good fit to the true function and allows further hypothesising a local minimum around x = 25. however, if too much smoothing is applied (figure 3d, linear figure 3 the figure illustrates fictive data where an outcome y varies across genomic loci (x) along a spatial position. the true y-x relation (red line) equals y = sin(sqrt(x)) *10*x, where deviations from it arise from random error in a sample of n = 50 (normally distributed with expectation = 0 and standard deviation = 500). plot (a) shows the results (blue peaks) if no smoothing is done, plots (b) through (d) apply different levels of smoothing, from insufficient smoothing (b) and adequate smoothing (c) to over-smoothing (d). approximation), underfitting occurs, the latent structure cannot be described with only two parameters, core features like the peak in the range of 40-70 are overlooked. smoothing may reveal a new hypothesis like “stress response has its genetic basis in the range 40-70” or only be an intermediate step (greenland, 2006), and the smoothed structure (blue curve in the example) be further analysed, e.g., in relation to factors that might influence stress response across location. for example, particularly high peaks in the 40-70 range in individuals with negative childhood events might indicate that genes in this range are activated more strongly in these individuals. several methods to smooth psychological data are common, albeit not under the idea of smoothing. table 1 summarises some of them and lists their “smoothing parameters” that regulate the degree of smoothing. much elaboration, however, is required for sound guidance on how to apply such methods for exploration: whether they do the right smoothing to the appropriate extent to efficiently stimulate new claims in a specific research domain. as a general advice, the more a field is already understood, the more data may be smoothed. 8 table 1 table 1: some methods for statistical smoothing, their search spaces and parameters via which they smooth method search space smoothing parameter(s) non-parametric regression functions that describe an x-y association or the associations of several x with y e.g., the degree of a polynomial (local polynomial smoothing) regularisation methods in regression with many predictors (lasso, elastic net regression, etc.) estimates of regression parameters e.g., the sum of the regression coefficients, besides the intercept (lasso) exploratory factor analysis latent dimensions and their loadings on observed items number of latent dimensions and choice of rotation method cluster analysis, latent mixture models possible clusters of individuals that are homogenous within but heterogenous between number of clusters canonical correlation analysis linear combinations of factors and outcomes number of latent dimensions behind a set of factors and number of latent dimensions behind a set of outcomes planning exploration and transparency on how one has explored after outlining the goals of comprehensiveness and efficiency and some basic ideas on how to explore data, we are equipped to discuss the possibility of planning an explorative quest. we suggest that, if the sample size does not allow for internal cross-validation, a wellunderpinned plan may render data exploration more efficient in identifying patterns of interest. the argument is that a planned exploration may be more focused and therefore require less analysis. identified patterns might in turn be supported by more initial evidence (höfler et al., 2022; simmons et al., 2011). note that this is a heuristic argument, because severity depends on what exact analyses are conducted. however, the following strict statement can be made: the severity with which a pattern has been “pre-tested” becomes smaller if additional statistical tests are conducted (α becomes greater with each additional test) or any additional filtering has been done. if there is a plan, it should be transparent, that is, made public, to enable researchers to “take credit” of it (wagenmakers & dutilh, 2016) when publishing the results that it generates. also, without a plan, we suggest that transparency beyond the obligatory distinction between exploration and confirmation is crucial for scientific communication. otherwise, intransparency about how data has been explored could hide some exploratory steps. readers may then be misled about how promising confirmation attempts are. to give an extreme example, an association might appear present or absent, small or large, positive or negative, just by picking a narrowly defined subsample in which a relation might be claimed (e.g., vul et al., 2009). if many subsamples have been tried, it may be unlikely that the association will be found again with new data. box 3 summarizes how some measures inform about what has actually been done and how much initial evidence there is. 9 box 3: transparency measures for what and how much exploration has been conducted” • preregistration preregistering the pure intention that exploration is to be carried out counteracts later false assertions on confirmation on results that have been actually obtained by exploration (heard from ericjan wagenmakers in a 2020 talk). if there is a plan on how to explore, it should also be preregistered to be later able to show that one had this plan. changes of a plan might be necessary for various reasons when enjoying the dynamics of digging into data. these can be transparently recorded by an audit trail (version management) system such as “git” (chacon & straub, 2014). • open data access to data (isbell, 2021), preferably to the raw data (arribas-bel et al., 2021; nikiforova, 2020; wilkinson et al., 2016), allows researchers to reproduce found patterns or problems in data that have made changes in a plan necessary. researchers can also try their own analyses to see if they come up with the same result (shahin et al., 2020). if such analysis (preferably done by independent re-analysers, e.g., silberzahn et al., 2018) identifies the same pattern, the initial evidence for the claim is larger, because the alternative analysis might have failed to identify it (e.g., an association that is also found when using a statistical method that is more robust against irregularities in the data such as non-normally distributed residuals; field and wilcox, 2017).to ease open access to data, several publishers have recently started to offer purpose-designed, peer-reviewed, and citable journal contribution templates that allow for the publication of data sets (a collection is provided by “data journals – forschungsdaten.org,” 2022). • open analysis open analysis generates transparency in what analyses have been actually done through access to the complete syntax used and the results it has generated (van dijk et al., 2021). together with open data it serves reproducing a whole explorative quest. automatic documentation ensures that no analyses with maybe unfavourable results are concealed. powerful software packages for this have been developed that store an entire analytical workflow (peikert & brandmaier, 2021; van lissa et al., 2020; wratten et al., 2021), as well as notebooks customised for this (beg et al., 2021). further research agenda for exploration: where to explore, what and how to explore the following proposals summarize the ideas from the first (höfler et al., 2022) and this second article on exploration. they are likely to give highly contextdependent answers and are therefore intended for separate consideration across the many fields and research quests of psychology. on top of them we invite researchers to probe the conceptions of this paper with their own explorative quests. this opens the probably most promising avenue for refinement. 1. reconsider established evidential norms for confirmation. more severe tests may be required to be passed for a claim to move from a new claim to an established claim. in particular, specify what alternative explanations (e.g., bias) must be probed against. 2. identify hypotheses, models and theories that might benefit from exploring around them. 3. identify little understood domains. specify where key features might be found and with which method of measuring and analysing. 4. what new data should be collected or what existing data should be explored? what are promising search-spaces? are global or local claims of greater interest? 5. identify gaps in theories that should be filled by exploration to be complete and severely testable. also identify poorly probed components of theories that might benefit from explorative quests for modifications. 6. methodically elaborate on the efficiency of methods of filtering and smoothing. consider the applicability of exploratory methods used outside psychology, for example those recommended for exploring the huge amount of medical data in the uk biobank (2022). formal elaboration of these concepts should be helpful. mathematical rigour probes for stringency and may indicate need for change and gaps that have to be filled. 7. use open science measures for transparency on how exploration was carried out and how much initial evidence may exist for identified patterns. recommendations to stakeholders we end with a list of recommendations for stakeholders including journal editors, peer-reviewers and funding agencies. these three groups have the largest means 10 for change if they cooperate in addressing the following points. we suggest in general that funding agencies should provide financial incentives for explorative quests, public repositories and methodical elaboration. editors should offer space and define rules that promote transparent exploration of high quality. reviewers should control these issues. open review seems preferable, because it creates transparency in the control process. specific recommendations are: 1. mandatory separation between tested versus new hypotheses (gigerenzer, 2018) already listed in the abstract of an article. 2. create new journal sections for exploration papers and reserve space for this (mcintosh, 2017; thompson et al., 2020). maybe fund entire exploration journals like the publisher open exploration did with its four medical journals (publishing, 2021). 3. use editorials to mention gaps in theories (lakens, 2019)(lakens, 2019) that could be filled by exploration (woo et al., 2017). 4. “place exploratory analyses (regardless of the outcome) on citable public repositories” (pennycook, 2018). funding agencies are requested to create more space and fund according studies to inform other researchers (thompson et al., 2020) with results suitable to test or feed theories (greenland et al., 2004). 5. the common sense that every publication must have an introduction and a discussion part may be questioned. a pure exploratory publication, for example, on a range of somehow plausible potential risk factors for a disease, does not necessarily require an introduction (it would merely list weak justifications and have little space to describe the theoretical background for analysing each of the many investigated factors). the same applies to the discussion part, a deeper discussion may be better placed in a paper format that discusses the results from several studies and their impact on theory building, interventions and public health (greenland et al., 2004). publications on only grossly justified observational data with association results (e.g., short-term planned covid-19 research) appear most useful if they just describe the methods and report the results (greenland et al., 2004). conclusion science has been argued to have made its biggest discoveries through chance (gaughan, 2010; roberts, 1989), but maybe chance can be prompted by providing scientists with means to valuable exploration. psychology seems to have a particularly large potential here. also, scientific communication could highly benefit from considering exploratory findings not as established knowledge, but as pure suggestions on the rocky path from data to truth that invite one to walk on without knowing where one arrives. yet teaching some basic insights like how valuable exploration and true confirmation benefit from one another might help, at least in the long run when those who are now taught are ready to conceptualise their own studies. probably almost every reader has been taught statistics and methods with a nearly exclusive focus on confirmation. once a new generation of two-trail scientists will emerge, this generation might come up with powerful ways of cooperative exploration that our generation is incapable of imagining because of our confirmatory priming. we wish to conclude with the admittedly emotional remark that the necessity of writing these two articles on the value of exploration in science has felt somewhat strange. the self-evidence of this should be reason enough to engage in strict confirmation and transparent exploration and, in turn, to look forward to a science, we believe, thus enriched. author contact michael höfler, chemnitzer straße 46, clinical psychology and behavioural neuroscience, institute of clinical psychology and psychotherapy, technische universität dresden, 01187 dresden, germany. michael.hoefler@tu-dresden.de, +49 351 463 36921 orcid: https://orcid.org/0000-0001-7646-8265 acknowledgements: we thank annekathrin rätsch for aid with the references. conflict of interest and funding robert miller is an employee of pfizer pharma gmbh. the authors declare that there were no conflicts of interest with respect to the authorship or the publication of this article. philipp kanske is supported by the german research foundation (ka4412/2-1, ka4412/4-1, ka4412/5-1, ka4412/9-1, crc940/c07). author contributions michael höfler had the lead in developing the conceptions and the writing. brennan mcdonald has contributed epistemic details and was involved in the writing and wording of the entire manuscript. philipp kanske commented on and edited the manuscript. https://orcid.org/0000-0001-7646-8265 11 robert miller has contributed methodological aspects and reviewed and edited the manuscript. open science practices this article earned the open materials badge for making the materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references adjerid, i., & kelley, k. (2018). big data in psychology: a framework for research advancement. american psychologist, 73(7), 899–917. https://doi. org/10.1037/amp0000190 alonso, s. g., de la torre-díez, i., hamrioui, s., lópezcoronado, m., calvo barreno, d., morón nozaleda, l., & franco, m. (2018). data mining algorithms and techniques in mental health: a planned review. journal of medical systems, 42, 161. https : / / doi . org / 10 . 1007 / s10916 018 1018-2 arribas-bel, d., green, m., rowe, f., & singleton, a. (2021). open data products-a framework for creating valuable analysis ready data. journal of geographical systems, 23, 497–514. https:// doi.org/10.1007/s10109-021-00363-5 beg, m., taka, j., kluyver, t., konovalov, a., ragankelley, m., thiery, n. m., & fangohr, h. (2021). using jupyter for reproducible scientific workflows. computing in science engineering, 23(2), 36–46. https://doi.org/10.1109/mcse.2021. 3052101 behrens, j. t. (1997). principles and procedures of exploratory data analysis. psychological methods, 2(2), 131–160. https://doi.org/10.1037/1082989x.2.2.131 bender, r., & lange, s. (2001). adjusting for multiple testing — when and how? journal of clinical epidemiology, 54(4), 343–349. https://doi.org/ 10.1016/s0895-4356(00)00314-0 bogen, j., & woodward, j. (1988). saving the phenomena. philosophical review, 97, 303–352. box, g. e. p. (1976). science and statistics. journal of the american statistical association, 71(356), 791–799. https://doi.org/10.1080/01621459. 1976.10480949 box, g. e. p. (1980). sampling and bayes inference in scientific modelling and robustness (with discussion and rejoinder). journal of the royal statistical society, series a, 143, 383–430. bringmann, l. f., elmer, t., & eronen, m. i. (2022). back to basics: the importance of conceptual clarification in psychological science. current directions in psychological science, 31(4), 340–346. https : / / doi . org / 10 . 1177 / 09637214221096485 bzdok, d., & ioannidis, j. p. a. (2019). exploration, inference, and prediction in neuroscience and biomedicine. trends in neurosciences, 42(4), 251–262. https://doi.org/10.1016/j.tins.2019. 02.001 chacon, s., & straub, b. (2014). pro git. apress. colquhoun, d. (2014). an investigation of the false discovery rate and the misinterpretation of pvalues. journal of the royal society of open science, 1, 140216. https://doi.org/10.1098/rsos. 140216 data journals – forschungsdaten.org. (2022). de rooij, m., & weeda, w. (2020). cross-validation: a method every psychologist should know. advances in methods and practices in psychological science, 3(2), 248–263. https : / / doi . org / 10 . 1177/2515245919898466 del giudice, m., & gangestad, s. w. (2021). a traveler’s guide to the multiverse: promises, pitfalls, and a framework for the evaluation of analytic decisions. advances in methods and practices in psychological science, 4(1). https : / / doi . org / 10 . 1177/2515245920954925 dirnagl, u. (2020). preregistration of exploratory research: learning from the golden age of discovery. plos biol, 18(3), e3000690. https : / / doi . org/10.1371/journal.pbio.3000690 elhai, j. d., & montag, c. (2020). the compatibility of theoretical frameworks with machine learning analyses in psychological research. current opinion in psychology, 36, 83–88. https://doi. org/10.1016/j.copsyc.2020.05.002 erceg-hurn, d. m., & mirosevich, v. m. (2008). modern robust statistical methods: an easy way to maximize the accuracy and power of your research. american psychologist, 63(7), 591–601. https://doi.org/10.1037/0003-066x.63.7.591 eronen, m. i., & bringmann, l. f. (2021). the theory crisis in psychology: how to move forward. perspectives on psychological science, 16(4), 779– 788. https://doi.org/10.1037/amp0000190 https://doi.org/10.1037/amp0000190 https://doi.org/10.1007/s10916-018-1018-2 https://doi.org/10.1007/s10916-018-1018-2 https://doi.org/10.1007/s10109-021-00363-5 https://doi.org/10.1007/s10109-021-00363-5 https://doi.org/10.1109/mcse.2021.3052101 https://doi.org/10.1109/mcse.2021.3052101 https://doi.org/10.1037/1082-989x.2.2.131 https://doi.org/10.1037/1082-989x.2.2.131 https://doi.org/10.1016/s0895-4356(00)00314-0 https://doi.org/10.1016/s0895-4356(00)00314-0 https://doi.org/10.1080/01621459.1976.10480949 https://doi.org/10.1080/01621459.1976.10480949 https://doi.org/10.1177/09637214221096485 https://doi.org/10.1177/09637214221096485 https://doi.org/10.1016/j.tins.2019.02.001 https://doi.org/10.1016/j.tins.2019.02.001 https://doi.org/10.1098/rsos.140216 https://doi.org/10.1098/rsos.140216 https://doi.org/10.1177/2515245919898466 https://doi.org/10.1177/2515245919898466 https://doi.org/10.1177/2515245920954925 https://doi.org/10.1177/2515245920954925 https://doi.org/10.1371/journal.pbio.3000690 https://doi.org/10.1371/journal.pbio.3000690 https://doi.org/10.1016/j.copsyc.2020.05.002 https://doi.org/10.1016/j.copsyc.2020.05.002 https://doi.org/10.1037/0003-066x.63.7.591 12 farcomeni, a., & greco, l. (2016). robust methods for data reduction. chapman; hall/crc. https : / / doi.org/10.1201/b18358 fiedler, k. (2017). what constitutes strong psychological science? the (neglected) role of diagnosticity and a priori theorizing. perspectives on psychological science, 12(1), 46–61. https : / / doi . org/10.1177/1745691616654458 field, a. p., & wilcox, r. r. (2017). robust statistical methods: a primer for clinical psychology and experimental psychopathology researchers. behaviour research and therapy, 98, 19–38. https: //doi.org/10.1016/j.brat.2017.05.013 fleming, j. i., wilson, s. e., hart, s. a., therrien, w. j., & cook, b. g. (2021). open accessibility in education research: enhancing the credibility, equity, impact, and efficiency of research. educational psychologist, 56(2), 110–121. https : / / doi.org/10.1080/00461520.2021.1897593 gaughan, r. (2010). accidental genius: the world’s greatest by-chance discoveries. metro books. gelman, a., haig, b., hennig, c., owen, a., cousins, r., young, s., robert, c., yanofsky, c., wagenmakers, e. j., kenett, r., & lakeland, d. (2019). many perspectives on deborah mayo’s “statistical inference as severe testing: how to get beyond the statistics wars”. retrieved november 2, 2021, from http : / / www . stat . columbia . edu / ~gelman / research / unpublished / mayo _ reviews_2.pdf gigerenzer, g. (2010). personal reflections on theory and psychology. theory psychology, 20(6), 733–743. https : / / doi . org / 10 . 1177 / 0959354310378184 gigerenzer, g. (2018). statistical rituals: the replication delusion and how we got there. advances in methods and practices in psychological science, 1(2), 198–218. https : / / doi . org / 10 . 1177 / 2515245918771329 glymour, c., zhang, k., & spirtes, p. (2019). review of causal discovery methods based on graphical models. frontiers in genetics, 10, 524. https: //doi.org/10.3389/fgene.2019.00524 greenland, s. (2006). smoothing observational data: a philosophy and implementation for the health sciences. international statistical review, 74, 31–46. https : / / doi . org / 10 . 1111 / j . 1751 5823.2006.tb00159.x greenland, s. (2017). invited commentary: the need for cognitive science in methodology. american journal of epidemiology, 186(6), 639–645. https://doi.org/10.1093/aje/kwx259 greenland, s., gago-dominguez, m., & castelao, j. e. (2004). the value of risk-factor ("black-box") epidemiology. epidemiology, 15(5), 529–35. https://doi.org/10.1097/01.ede.0000134867. 12896.23 hand, d. j. (2007). principles of data mining. drugsafety, 30(7), 621–622. https : / / doi . org / 10 . 2165/00002018-200730070-00010 hernán, m. a. (2018). the c-word: scientific euphemisms do not improve causal inference from observational data. american journal of public health, 108(5), 616–619. https : / / doi . org/10.2105/ajph.2018.304337 höfler, m., scherbaum, s., kanske, p., mcdonald, b., & miller, r. (2022). means to valuable exploration i. the blending of confirmation and exploration and how to resolve it. metapsychology, 2(6). https : / / doi . org / 10 . 15626 / mp.2021.2837 höfler, m., trautmann, s., & kanske, p. (2021). qualitative approximations to causality: nonrandomizable factors in clinical psychology. clinical psychology in europe, 3(2), e3873. https://doi.org/10.32872/cpe.3873 hofstadter, d. r., & dennett, d. c. (1981). the mind’s i: fantasies and reflections on self and soul. basic books. hollenbeck, j. r., & wright, p. m. (2017). harking, sharking, and tharking: making the case for post hoc analysis of scientific data. journal of management, 43(1), 5–18. https://doi.org/10. 1177/0149206316679487 huber, p. j. (1981). robust statistics. john wiley & sons, inc. i., n., & r., b. c. (1998). qualitative-quantitative research methodology: exploring the interactive continuum. southern illinois university press. isbell, d. r. (2021). open science, data analysis, and data sharing. open science framework preprint. https://doi.org/10.31219/osf.io/pdj9y jaffe, a. e., murakami, p., lee, h., leek, j. t., fallin, m. d., feinberg, a. p., & irizarry, r. a. (2012). bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. international journal of epidemiology, 41(1), 200–209. https://doi.org/10.1093/ije/ dyr238 job, v., dweck, c. s., & walton, g. m. (2010). ego depletion — is it all in your head? implicit theories about willpower affect self-regulation. psychological science, 21(11), 1686–1693. https:// doi.org/10.1177/0956797610384745 https://doi.org/10.1201/b18358 https://doi.org/10.1201/b18358 https://doi.org/10.1177/1745691616654458 https://doi.org/10.1177/1745691616654458 https://doi.org/10.1016/j.brat.2017.05.013 https://doi.org/10.1016/j.brat.2017.05.013 https://doi.org/10.1080/00461520.2021.1897593 https://doi.org/10.1080/00461520.2021.1897593 http://www.stat.columbia.edu/~gelman/research/unpublished/mayo_reviews_2.pdf http://www.stat.columbia.edu/~gelman/research/unpublished/mayo_reviews_2.pdf http://www.stat.columbia.edu/~gelman/research/unpublished/mayo_reviews_2.pdf https://doi.org/10.1177/0959354310378184 https://doi.org/10.1177/0959354310378184 https://doi.org/10.1177/2515245918771329 https://doi.org/10.1177/2515245918771329 https://doi.org/10.3389/fgene.2019.00524 https://doi.org/10.3389/fgene.2019.00524 https://doi.org/10.1111/j.1751-5823.2006.tb00159.x https://doi.org/10.1111/j.1751-5823.2006.tb00159.x https://doi.org/10.1093/aje/kwx259 https://doi.org/10.1097/01.ede.0000134867.12896.23 https://doi.org/10.1097/01.ede.0000134867.12896.23 https://doi.org/10.2165/00002018-200730070-00010 https://doi.org/10.2165/00002018-200730070-00010 https://doi.org/10.2105/ajph.2018.304337 https://doi.org/10.2105/ajph.2018.304337 https://doi.org/10.15626/mp.2021.2837 https://doi.org/10.15626/mp.2021.2837 https://doi.org/10.32872/cpe.3873 https://doi.org/10.1177/0149206316679487 https://doi.org/10.1177/0149206316679487 https://doi.org/10.31219/osf.io/pdj9y https://doi.org/10.1093/ije/dyr238 https://doi.org/10.1093/ije/dyr238 https://doi.org/10.1177/0956797610384745 https://doi.org/10.1177/0956797610384745 13 jorgensen, t. j., ruczinski, i., kessing, b., smith, m. w., shugart, y. y., & alberg, a. j. (2009). hypothesis-driven candidate gene association studies: practical design and analytical considerations. american journal of epidemiology, 170(8), 986–993. https : / / doi . org / 10 . 1093 / aje/kwp242 kassis, a., & papps, f. a. (2020). integrating complementary and alternative therapies into professional psychological practice: an exploration of practitioners’ perceptions of benefits and barriers. complementary therapies in clinical practice, 41, 101238. https://doi.org/10.1016/j.ctcp. 2020.101238 kessler, r. c., & merikangas, k. r. (2004). the national comorbidity survey replication (ncs-r): background and aims. international journal of methods in psychiatric research, 13(2), 60–68. https://doi.org/10.1002/mpr.166 kirk, r. e. (1996). practical significance: a concept whose time has come. educational and psychological measurement, 56(5), 746–759. https:// doi.org/10.1177/0013164496056005002 könig, i. r. (2011). validation in genetic association studies. briefings in bioinformatics, 12(3), 253– 258. https://doi.org/10.1093/bib/bbq074 lakatos, i. (1977). the methodology of scientific research programmes: philosophical papers volume 1. cambridge university press. lakens, d. (2019). the value of preregistration for psychological science: a conceptual analysis. psyarxiv preprint. https://doi.org/10.31234/ osf.io/jbh4w linke, j. o., abend, r., kircanski, k., clayton, m., stavish, c., et al. (2021). shared and anxietyspecific pediatric psychopathology dimensions manifest distributed neural correlates. biological psychiatry, 89(6), 579–587. https : / / doi . org/10.1016/j.biopsych.2020.10.018 lurquin, j. h., & miyake, a. (2017). challenges to egodepletion research go beyond the replication crisis: a need for tackling the conceptual crisis. frontiers in psychology, 8, 568. https://doi.org/ 10.3389/fpsyg.2017.00568 manuti, a., & giancaspro, m. l. (2019). people make the difference: an explorative study on the relationship between organizational practices, employees’ resources, and organizational behavior enhancing the psychology of sustainability and sustainable development. sustainability, 11, 1499. https : / / doi . org / 10 . 3390 / su11051499 marigorta, u. m., rodríguez, j. a., gibson, g., & navarro, a. (2018). replicability and prediction: lessons and challenges from gwas. trends in genetics: tig, 34(7), 504–517. https://doi. org/10.1016/j.tig.2018.03.005 martins, l. b., braga tibães, j. r., sanches, m., jacka, f., berk, m., & teixeira, a. l. (2021). nutritionbased interventions for mood disorders. expert review of neurotherapeutics, 21(3), 303–315. https : / / doi . org / 10 . 1080 / 14737175 . 2021 . 1881482 masur, p. k., & scharkow, m. (2020). specr: conducting and visualizing specification curve analyses. mayo, d. g. (2018). statistical inference as severe testing: how to get beyond the statistics wars. cambridge university press. https : / / doi . org / 10 . 1017 / 9781107286184 mcintosh, r. d. (2017). exploratory reports: a new article type for cortex. cortex, 96, a1–a4. https: //doi.org/10.1016/j.cortex.2017.07.014 moghaddam, f. m. (2004). from ‘psychology in literature’ to ‘psychology is literature’: an exploration of boundaries and relationships. theory psychology, 14(4), 505–525. https : / / doi . org / 10.1177/0959354304044922 nguyen, s. h. (2000). regularity analysis and its applications in data mining. in s. t. l. polkowski & t. y. lin (eds.), rough set methods and applications (pp. 289–378). physica-verlag hd. https: //doi.org/10.1007/978-3-7908-1840-6_7 nikiforova, a. (2020). comparative analysis of national open data portals or whether your portal is ready to bring benefits from open data. iadis international conference on ict, society and human beings. nosek, b. a., ebersole, c. r., dehaven, a. c., & mellor, d. t. (2018). the preregistration revolution. pnas proceedings of the national academy of sciences of the united states of america, 115(11), 2600–2606. https : / / doi . org / 10 . 1073/pnas.1708274114 parvandeh, s., yeh, h. w., paulus, m. p., & mckinney, b. a. (2020). consensus features nested crossvalidation. bioinformatics, 36(10), 3093–3098. https : / / doi . org / 10 . 1093 / bioinformatics / btaa046 peikert, a., & brandmaier, a. m. (2021). a reproducible data analysis workflow with r markdown, git, make, and docker. quantitative and computational methods in behavioral sciences, 1, e3763. https://doi.org/10.5964/qcmb.3763 https://doi.org/10.1093/aje/kwp242 https://doi.org/10.1093/aje/kwp242 https://doi.org/10.1016/j.ctcp.2020.101238 https://doi.org/10.1016/j.ctcp.2020.101238 https://doi.org/10.1002/mpr.166 https://doi.org/10.1177/0013164496056005002 https://doi.org/10.1177/0013164496056005002 https://doi.org/10.1093/bib/bbq074 https://doi.org/10.31234/osf.io/jbh4w https://doi.org/10.31234/osf.io/jbh4w https://doi.org/10.1016/j.biopsych.2020.10.018 https://doi.org/10.1016/j.biopsych.2020.10.018 https://doi.org/10.3389/fpsyg.2017.00568 https://doi.org/10.3389/fpsyg.2017.00568 https://doi.org/10.3390/su11051499 https://doi.org/10.3390/su11051499 https://doi.org/10.1016/j.tig.2018.03.005 https://doi.org/10.1016/j.tig.2018.03.005 https://doi.org/10.1080/14737175.2021.1881482 https://doi.org/10.1080/14737175.2021.1881482 https://doi.org/10.1017/9781107286184 https://doi.org/10.1017/9781107286184 https://doi.org/10.1016/j.cortex.2017.07.014 https://doi.org/10.1016/j.cortex.2017.07.014 https://doi.org/10.1177/0959354304044922 https://doi.org/10.1177/0959354304044922 https://doi.org/10.1007/978-3-7908-1840-6_7 https://doi.org/10.1007/978-3-7908-1840-6_7 https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1073/pnas.1708274114 https://doi.org/10.1093/bioinformatics/btaa046 https://doi.org/10.1093/bioinformatics/btaa046 https://doi.org/10.5964/qcmb.3763 14 pennycook, g. (2018). you are not your data. behavioral and brain sciences, 41. https : / / doi . org / 10.1017/s0140525x1800081x popper, k. (1959). the logic of scientific discovery. basic books. publishing, o. e. (2021). https://www.explorationpub.com [accessed: 2021-01-13]. https : / / www . explorationpub.com roberts, r. m. (1989). serendipity: accidental discoveries in science. john wiley & sons, inc. romero, c., & ventura, s. (2020). educational data mining and learning analytics: an updated survey. wires data mining and knowledge discovery, 10(3). https : / / doi . org / 10 . 1002 / widm . 1355 rubin, m., & donkin, c. (2022). exploratory hypothesis tests can be more compelling than confirmatory hypothesis tests. philosophical psychology. https : / / doi . org / 10 . 1080 / 09515089 . 2022 . 2113771 ryan, o., bringmann, l. f., & schuurman, n. k. (2019). the challenge of generating causal hypotheses using network models [preprint]. psyarxiv. https://doi.org/10.31234/osf.io/ryg69 scheel, a. m. (2021). why most psychological research findings are not even wrong [preprint]. psyarxiv. https : / / doi . org / 10 . 31234 / osf. io / 8w2sd schimmack, u. (2021). the implicit association test: a method in search of a construct. perspectives on psychological science, 16(2), 396–414. https:// doi.org/10.1177/1745691619863798 shahin, m. h., bhattacharya, s., silva, d., kim, s., burton, j., podichetty, j., romero, k., & conrado, d. j. (2020). open data revolution in clinical research: opportunities and challenges. clinical and translational science, 13(4), 665–674. https://doi.org/10.1111/cts.12756 silberzahn, r., uhlmann, e. l., martin, d. p., anselmi, p., aust, f., awtrey, e., & et al. (2018). many analysts, one data set: making transparent how variations in analytic choices affect results. advances in methods and practices in psychological science, 1(3), 337–356. https : / / doi . org / 10 . 1177/2515245917747646 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. https://doi.org/ 10.1177/0956797611417632 simonsohn, u., simmons, j. p., & nelson, l. d. (2020). specification curve analysis. nature human behavior. https://doi.org/10.1038/s415620200912-z skrondal, a., & rabe-hesketh, s. (2004). generalized latent variable modeling: multilevel, longitudinal, and structural equation models. chapman & hall/crc. sohmer, o. r. (2020). an exploration of the value of cooperative inquiry for transpersonal psychology, education, and research: a theoretical and qualitative inquiry (doctoral dissertation). california institute of integral studies. https://search. proquest.com/docview/2464456670 stebbins, r. a. (1992). concatenated exploration: notes on a neglected type of longitudinal research. quality & quantity, 26, 435–442. https: //doi.org/10.1007/bf00170454 stebbins, r. a. (2001). exploratory research in the social sciences. sage publications, inc. https://doi. org/10.4135/9781412984249 stebbins, r. a. (2006). concatenated exploration: aiding theoretic memory by planning well for the future. journal of contemporary ethnography, 35(5), 483–494. https : / / doi . org / 10 . 1177 / 0891241606286989 steegen, s., tuerlinckx, f., gelman, a., & vanpaemel, w. (2016). increasing transparency through a multiverse analysis. perspectives on psychological science, 11(5), 702–712. https : / / doi . org / 10.1177/1745691616658637 suppes, p. (1969). models of data. in e. nagel, p. suppes, & a. tarski (eds.), logic, methodology, and philosophy of science: proceedings of the 1960 international congress (pp. 252–261). stanford university press. swedberg, r. (2018). on the uses of exploratory research and exploratory [retrieved october 14, 2020]. szollosi, a., & donkin, c. (2021). arrested theory development: the misguided distinction between exploratory and confirmatory research. perspectives on psychological science, 16, 717–724. https://doi.org/10.1177/1745691620966796 takezawa, k. (2005). introduction to nonparametric regression. john wiley & sons. https://doi.org/ 10.1002/0471771457 thompson, w. h., wright, j., & bissett, p. g. (2020). point of view: open exploration. elife, 9. https: //doi.org/10.7554/elife.52157 van lissa, c. j., brandmaier, a. m., brinkman, l., lamprecht, a.-l., peikert, a., struiksma, m. e., & vreede, b. (2020). worcs: a workflow for open reproducible code in science. data scihttps://doi.org/10.1017/s0140525x1800081x https://doi.org/10.1017/s0140525x1800081x https://www.explorationpub.com https://www.explorationpub.com https://doi.org/10.1002/widm.1355 https://doi.org/10.1002/widm.1355 https://doi.org/10.1080/09515089.2022.2113771 https://doi.org/10.1080/09515089.2022.2113771 https://doi.org/10.31234/osf.io/ryg69 https://doi.org/10.31234/osf.io/8w2sd https://doi.org/10.31234/osf.io/8w2sd https://doi.org/10.1177/1745691619863798 https://doi.org/10.1177/1745691619863798 https://doi.org/10.1111/cts.12756 https://doi.org/10.1177/2515245917747646 https://doi.org/10.1177/2515245917747646 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1038/s41562-020-0912-z https://doi.org/10.1038/s41562-020-0912-z https://search.proquest.com/docview/2464456670 https://search.proquest.com/docview/2464456670 https://doi.org/10.1007/bf00170454 https://doi.org/10.1007/bf00170454 https://doi.org/10.4135/9781412984249 https://doi.org/10.4135/9781412984249 https://doi.org/10.1177/0891241606286989 https://doi.org/10.1177/0891241606286989 https://doi.org/10.1177/1745691616658637 https://doi.org/10.1177/1745691616658637 https://doi.org/10.1177/1745691620966796 https://doi.org/10.1002/0471771457 https://doi.org/10.1002/0471771457 https://doi.org/10.7554/elife.52157 https://doi.org/10.7554/elife.52157 15 ence, 4(1), 29–49. https : / / doi . org / 10 . 3233 / ds-210031 van dijk, w., schatschneider, c., & hart, s. a. (2021). open science in education sciences. journal of learning disabilities, 54(2), 139–152. https:// doi.org/10.1177/0022219420945267 vul, e., harris, c., winkielman, p., & pashler, h. (2009). puzzlingly high correlations in fmri studies of emotion, personality, and social cognition. perspectives on psychological science, 4(3), 274– 290. https : / / doi . org / 10 . 1111 / j . 1745 6924 . 2009.01125.x wagenmakers, e.-j., & dutilh, g. (2016). seven selfish reasons for preregistration. aps observer, 29(9). https : / / www . psychologicalscience . org / observer / seven selfish reasons for preregistration wilcox, r. r. (2012). introduction to robust estimation and hypothesis testing. academic press. wilkinson, m. d., dumontier, m., aalbersberg, i. j., appleton, g., axton, m., baak, a., & et al. (2016). the fair guiding principles for scientific data management and stewardship. scientific data, 3(1), 160018. https://doi.org/10.1038/sdata. 2016.18 williams, m. n. (2021). levels of measurement and statistical analyses. meta-psychology, 5. https : / / doi.org/10.15626/mp.2019.1916 woo, s. e., o’boyle, e. h., & spector, p. e. (2017). best practices in developing, conducting, and evaluating inductive research [editorial]. human resource management review, 27(2), 255–264. https://doi.org/10.1016/j.hrmr.2016.08.004 wratten, l., wilm, a., & göke, j. (2021). reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. nature methods, 18, 1161–1168. https : //doi.org/10.1038/s41592-021-01254-9 xiong, z., chen, y., li, z., & zhao, y. (2020). evaluating explorative prediction power of machine learning algorithms for materials discover using k-fold forward cross-validation. computational materials science, 171, 109203. https : / / doi . org/10.1016/j.commatsci.2019.109203 https://doi.org/10.3233/ds-210031 https://doi.org/10.3233/ds-210031 https://doi.org/10.1177/0022219420945267 https://doi.org/10.1177/0022219420945267 https://doi.org/10.1111/j.1745-6924.2009.01125.x https://doi.org/10.1111/j.1745-6924.2009.01125.x https://www.psychologicalscience.org/observer/seven-selfish-reasons-for-preregistration https://www.psychologicalscience.org/observer/seven-selfish-reasons-for-preregistration https://www.psychologicalscience.org/observer/seven-selfish-reasons-for-preregistration https://doi.org/10.1038/sdata.2016.18 https://doi.org/10.1038/sdata.2016.18 https://doi.org/10.15626/mp.2019.1916 https://doi.org/10.15626/mp.2019.1916 https://doi.org/10.1016/j.hrmr.2016.08.004 https://doi.org/10.1038/s41592-021-01254-9 https://doi.org/10.1038/s41592-021-01254-9 https://doi.org/10.1016/j.commatsci.2019.109203 https://doi.org/10.1016/j.commatsci.2019.109203 meta-psychology, 2020, vol 4, mp.2018.874 https://doi.org/10.15626/mp.2018.874 article type: original article published under the cc-by4.0 license open data: n/a open materials: yes open and reproducible analysis:yes open reviews and editorial process: yes preregistration: n/a edited by: marcel van assen reviewed by: stephen martin, jack davis, donald williams, daniël lakens and rink hoekstra analysis reproduced by: erin buchanan all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/peumw estimating population mean power under conditions of heterogeneity and selection for significance jerry brunner and ulrich schimmack university of toronto mississauga abstract in scientific fields that use significance tests, statistical power is important for successful replications of significant results because it is the long-run success rate in a series of exact replication studies. for any population of significant results, there is a population of power values of the statistical tests on which conclusions are based. we give exact theoretical results showing how selection for significance affects the distribution of statistical power in a heterogeneous population of significance tests. in a set of large-scale simulation studies, we compare four methods for estimating population mean power of a set of studies selected for significance (a maximum likelihood model, extensions of p-curve and p-uniform, & z-curve). the p-uniform and p-curve methods performed well with a fixed effects size and varying sample sizes. however, when there was substantial variability in effect sizes as well as sample sizes, both methods systematically overestimate mean power. with heterogeneity in effect sizes, the maximum likelihood model produced the most accurate estimates when the distribution of effect sizes matched the assumptions of the model, but z-curve produced more accurate estimates when the assumptions of the maximum likelihood model were not met. we recommend the use of z-curve to estimate the typical power of significant results, which has implications for the replicability of significant results in psychology journals. keywords: power estimation, post-hoc power analysis, publication bias, maximum likelihood, z-curve, p-curve, puniform, effect size, replicability, meta-analysis the purpose of this paper is to develop and evaluate methods for predicting the success rate if sets of significant results were replicated exactly. we call this statistical property, the average power of a set of studies. average power can range from the criterion for a type-i error, if all significant results are false positives, to 100%, if the statistical power of original studies approaches 1. average power can be used to quantify the degree of evidential value in a set of studies (simonsohn et al., 2014b). in the end, we estimate the mean power of studies that were used to examine the replicability of psychological research, and compare the results to actual replication outcomes (open science collaboration, 2015). estimating average power of original studies is interesting because it is tightly connected with the outcome of replication studies (greenwald et al., 1996; yuan & maxwell, 2005). to claim that a finding has been replicated, a replication study should reproduce a statistically significant result, and the probability of a successful replication is a function of statistical power. thus, if reproducibility is a requirement of good science (bunge, 1998; popper, 1959), it follows that high statistical power is a necessary condition for good science. information about the average power of studies is also useful because selection for significance increases the type-i error rate and inflates effect sizes (ioannidis, 2 2008). however, these biases are relatively small if the original studies had high power. thus, knowledge about the average power of studies is useful for the planning of future studies. if average power is high, replication studies can use the same sample sizes as original studies, but if average power is low, sample sizes need to be increased to avoid false negative results. given the practical importance of power for good science, it is not surprising that psychologists have started to examine the evidential value of results published in psychology journals. at present, two statistical methods have been used to make claims about the average power of psychological research; namely p-curve (simonsohn et al., 2017) and z-curve (schimmack, 2015, 2018a), but so far neither method has been peer-reviewed. statistical power before and after a study has been conducted before we proceed, we would like to clarify that statistical power of a statistical test is defined as the probability of correctly rejecting the null hypothesis (neyman & pearson, 1933). this probability depends on the sampling error of a study and the population effect size. the traditional definition of power does not consider effect sizes of zero (false positives) because the goal of a priori power planning is to ensure that a non-zero effect can be demonstrated. however, our goal is not to plan future studies, but to analyze results of existing studies. for post-hoc power analysis, it is impossible to distinguish between true positives and false positives and to estimate the average power conditional on the unknown status of hypotheses (i.e., the null-hypothesis is true or false). thus, we use the term average power as the probability of correctly or incorrectly rejecting the null-hypothesis (sterling et al., 1995). this definition of average power includes an unknown percentage of false positives that have a probability equal to alpha (typically 5%) to reproduce a significant result in a replication attempt. at the same time, we believe that the strict null-hypothesis is rarely true in psychological research (cohen, 1994). it would be ideal if it were possible to estimate the power of a single statistical test that supports a particular finding. unfortunately, well-documented problems with the “observed power" method suggest that the goal of estimating the power of an individual test may be out of reach (boos & stefnski, 2012; hoenig & heisey, 2001). often the main problem is that estimates for a single result are too variable to be practically useful (yuan & maxwell, 2005; but also see anderson, kelley, & maxwell, 2017). it is important to distinguish our undertaking from that of cohen (1962) and the follow-up studies by chase and chase (1976) and sedlmeier and gigerenzer (1989). in cohen’s classic survey of power in the journal of abnormal and social psychology, the results of the studies were not used in any way. power was never estimated. it was calculated exactly for a priori effect sizes deemed “small," “medium" and “large." if a “medium" effect size referred to the population mean (which cohen never claimed), power at the mean effect size is still not the same as mean power. in contrast, we aim to estimate the mean power given the actual population effect sizes in a set of studies. two populations of studies we distinguish two populations of tests. one population contains all tests that have been conducted. this population contains significant and non-significant results. the other population contains the subset of studies that produced a significant result. we focus on the population of studies selected for significance for two reasons. first, often non-significant results are not available because journal articles mostly report significant results (rosenthal, 1979; sterling, 1959; sterling et al., 1995). second, only significant results are used as evidence for a theoretical prediction. it is irrelevant how many tests produced non-significant results because these results are inconclusive. as psychological theories mainly rest on studies that produced significant results, only the evidential value of significant results is relevant for evaluations of the robustness of psychology as a science. in short, we are interested in statistical methods that can estimate the average power of a set of studies with significant results. the study selection model we developed a number of theorems that specify how selection for significance influences the distribution of power. these theorems are very general. they do not depend on the particular population distribution of power, the significance tests involved, or the type i error probabilities of those tests. the only requirement is that for every study with a specific population effect size, sample size, and statistical test, the probability of a result being selected is the true power of a study. we discuss the two most important theorems in detail. all six theorems are provided in the appendix, along with an illustration of the theorems by simulation. theorem 1 population mean true power equals the overall probability of a significant result. theorem 1 establishes the central importance of population mean power after selection for significance for 3 predicting replication outcomes. think of a coin-tossing experiment in which a large population of coins is manufactured, each with a different probability of heads; that is, these coins are not fair coins with equal probabilities for both sides. also consider heads to be successes or wins. repeatedly tossing the set of coins and counting the number of heads produces an expected value of the number of successes. for example, the experiment may yield 60% heads and 40% tails. while the exact probability of showing heads of individual coins are unknown, the observable success rate is equivalent to the mean power of all coins. theorem 1 states that success rate and mean power are equivalent even if the set of coins is a subset of all coins. for example, assume all coins were tossed once and only coins showing heads were retained. repeating the coin toss experiment, we would still find that the success rate for the set of selected coins matches the mean probabilities of the selected coins. theorem 2 the effect of selection for significance on power after selection is to multiply the probability of each power value by a quantity equal to the power value itself, divided by population mean power before selection. if the distribution of power is continuous, this statement applies to the probability density function. figure 1 illustrates theorem 2 for a simple, artificial example in which power before selection is uniformly distributed on the interval from 0.05 to 1.0. the corresponding distribution after selection for significance is triangular; now studies with more power are more likely to be selected. figure 1. uniform distribution of power before selection 0.0 0.2 0.4 0.6 0.8 1.0 0. 5 1. 0 1. 5 power d en si ty expected power = 0.525 before selection, 0.635 after selection density after selection density before selection in figure 2, power before selection is less heterogeneous, and higher on average. consequently, the distributions of power before selection and after selection are much more similar. in both cases, though, mean true power after selection for significance is higher than mean true power before selection for significance. figure 2. example of higher power before selection 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 power d en si ty expected power = 0.700 before selection, 0.714 after selection density after selection density before selection note. power before selection follows a beta distribution with a = 13 and b = 6 multiplied by .95 plus .05, so that it ranges from .05 to 1. the coin-tossing selection model proposed here may seem overly simplistic and unrealistic. few researchers conduct a study and give up after a first attempt produces a nonsignificant result. for example, morewedge et al. (2014) disclosed that they did not report “some preliminary studies that used different stimuli and different procedures and that showed no interesting effects." from a theoretical perspective, it is important that all studies test the same hypothesis, but for our selection model it is not. even if all studies used exactly the same procedures and had exactly the same power, the probability of being selected into the set of reported studies matches their power, and theorem 2 holds. each study that was conducted by morewedge et al. has an unknown true power to produce a significant result, and theorem 2 implies (via theorem 5 in the appendix) that their selected studies with significant results have higher mean power than the full set of studies that were conducted. we are only interested in the statistical power and replicability of the published studies with significant results. estimation methods in this section, we describe four methods for estimating population mean power under conditions of heterogeneity, after selection for statistical significance. 4 notation and statistical background to present our methods formally, it is necessary to introduce some statistical notation. rather than using traditional notation from statistics that might make it difficult for non-statisticians to understand our method, we follow simonsohn et al. (2014a), who employed a modified version of the s syntax (becker et al., 1988) to represent probability distributions. the s language is familiar to psychologists who use the r statistical software (r core team, 2017). the notation also makes it easier to implement our methods in r, particularly in the simulation studies. the outcome of an empirical study is partially determined by random sampling error, which implies that statistical results will vary across studies. this variation is expected to follow a random sampling distribution. each statistical test has its own sampling distribution. we will use the symbol t to denote a general test statistic; it could be a t-statistic, f, chi-squared, z, or something else. assume an upper-tailed test, so that the null hypothesis will be rejected at significance level α (usually α = 0.05), when the continuous test statistic t exceeds a critical value c. typically there is a sample of test statistic values t1, . . . , tk, but when only one is being considered the subscript will be omitted. the notation p(t) refers to the probability under the null hypothesis that t is less than or equal to the fixed constant t. the symbol p would represent pnorm if the test statistic were standard normal, pf if the test statistic had an f-distribution, and so on. while p(t) is the area under the curve, d(t) is the value on the y axis for a particular t, as in dnorm. following the conventions of the s language, the inverse of p is q, so that p(q(t)) = q(p(t)) = t. sampling distributions when the null-hypothesis is true are well known to psychologists because they provide the foundation of null-hypothesis significance testing. most psychologists are less familiar with noncentral sampling distributions (see johnson et al., 1995, for a detailed and authoritative treatment). when the null hypothesis is false, the area under the curve of the test statistic’s sampling distribution is p(t,ncp), representing particular cases like pf(t,df1,df2,ncp). the initials ncp stand for “non-centrality parameter." this notation applies directly when t has one of the common non-central distributions like the non-central t, f or chi-squared under the alternative hypothesis, but it can be extended to the distribution of any test statistic under any specific alternative, even when the distribution in question is technically not a non-central distribution. the non-centrality parameter is positive when the null hypothesis is false, and statistical power is a monotonically increasing function of the non-centrality parameter. this function is given explicitly by power = 1 − p(c,ncp). for the most important non-central distributions (z, t, chi-squared and f), the non-centrality parameter can be factored into the product of two terms. the first term is an increasing function of sample size, and the second term is an increasing function of effect size. in symbols, ncp = f1(n) · f2(es). (1) this formula is capable of accommodating different definitions of effect size (cohen, 1988; grissom & kim, 2012) by making corresponding changes to the function f2 in f2(es). as an example of equation (1), consider for example a standard f-test for difference between the means of two normal populations with a common variance. after some simplification, the noncentrality parameter of the non-central f may be written as ncp = n ρ (1 −ρ) d2, where n = n1 + n2 is the total sample size, ρ is the proportion of cases allocated to the first treatment, and d is cohen’s (1988) effect size for the two-sample problem. this expression for the non-centrality parameter can be factored in various ways to match equation 1; for example, f1(n) = n ρ (1 −ρ) and f2(es) = es 2. note that this is just an example; equation 1 applies to the non-centrality parameters of the non-central z, t, chi-squared and f distributions in general. thus for a given sample size and a given effect size, the power of a statistical test is power = 1 −p(c, f1(n) · f2(es)). (2) in this formula, c is the criterion value for statistical significance; the test is significant if t > c. the function f2(es) can also be applied to sets of studies with different traditional effect sizes. for example, es could be cohen’s d, and the alternative effect size es′ could be the point-biserial correlation r (cohen, 1988, p. 24). symbolically, es′ = g(es). since the function g(es) is monotone increasing, a corresponding inverse function exists, so that es = g−1(es′). then equation (2) becomes power = 1 −p(c, f1(n) · f2(es)) = 1 −p(c, f1(n) · f2 ( g−1(es′) ) ) = 1 −p(c, f1(n) · f ′ 2 ( es′ ) ), where f ′2 just means another function f2. that is, if the definition of effect size is changed (in a monotone way), 5 the change is absorbed by the function f2, and equation (2) still applies. we are now ready to introduce our four methods for the estimation of mean power based on a set of studies that vary in power with known sample sizes and unknown population effect sizes. the four methods are called pcurve, p-uniform, maximum likelihood model, and z-curve. estimation methods the first two estimation methods are based on methods that were developed for the estimation of effect sizes. our use of these methods for the estimation of mean power is an extension of these methods. our simulation studies should not be considered tests of these methods for the estimation of effect sizes. we developed these methods simply because power is a function of effect size and sample size and sample sizes are known. thus, only estimation of unknown effect sizes is needed to estimate power with these methods. power estimation is a simple additional step to compute power for each study as a function of the effect size estimate and the sample size of each study. these models should work well, when all studies have the same effect size and heterogeneity in power is only a function of heterogeneity in sample size as assumed by these models. p-curve 2.1 and p-uniform a p-curve method for estimation of mean power is available online (www.p-curve.com). it is important to point out that this method differs from the p-curve method that we developed. the online p-curve method is called pcurve 4.06. we built our p-curve method on the effect size p-curve method with the version code pcurve2.0 (simonsohn et al., 2014b). hence, we refer to our p-curve method as p-curve2.1. p-uniform is very similar to p-curve (van assen et al., 2014). both methods aim to find an effect size that produces a uniform distribution of p-values between .05 and .00. since we developed our p-uniform method for power estimation, a new estimation method has been introduced (van aert et al., 2016). we conducted our studies with the original estimation method and our results are limited to the performance of this implementation of p-uniform. to find the best fitting effect size for a set of observed test statistics, p-curve 2.1 and p-uniform compute p-values for various effect sizes and chose the effect size that yields the best approximation of a uniform distribution. if the modified null hypothesis that effect size = es is true, the cumulative distribution function of the test statistic is the conditional probability f0(t) = pr{t ≤ t|t > c} = p(t,ncp)−p(c,ncp) 1 −p(c,ncp) = p(t, f1(n) · f2(es))−p(c, f1(n) · f2(es)) 1 −p(c, f1(ni) · f2(es)) , using ncp = f1(n) · f2(es) as given in equation 1. the corresponding modified p-value is 1 − f0(t ) = 1 −p(t, f1(n) · f2(es)) 1 −p(c, f1(n) · f2(es)) . note that since the sample sizes of the tests may differ, the symbols p, n and c as well as t may have different referents for j = 1, . . . , k test statistics. the subscript j has been omitted to reduce notational clutter. if the modified null hypothesis were true, the modified p-values would have a uniform distribution. both pcurve 2.1 and p-uniform choose as estimated effect size the value of es that makes the modified p-values most nearly uniform. they differ only in the criterion for deciding when uniformity has been reached. p-curve 2.1 is based on a kolmogorov-smirnov test for departure from a uniform distribution, choosing the es value yielding the smallest value of the test statistic. p-uniform is based on a different criterion. denoting by p j the modified p-value associated with test j, calculate y = − k∑ j=1 ln(p j), where ln is the natural logarithm. if the p j values were uniformly distributed, y would have a gamma distribution with expected value k, the number of tests. the puniform estimate is the modified null hypothesis effect size es that makes y equal to k, its expected value under uniformity. these technologies are designed for heterogeneity in sample size only, and assume a common effect size for all the tests. given an estimate of the common effect size, estimated power for each test varies only as a function of sample size which can be determined by expression 2 because sample sizes are known. population mean power can then be estimated by averaging the k power estimates. maximum likelihood model our maximum likelihood (ml) model also first estimates effect sizes and then combines effect size estimates with known sample sizes to estimate mean power. unlike p-curve2.1 and p-uniform, the ml model allows for heterogeneity in effect sizes. in this way, the model 6 is similar to hedges and vevea’s (1996) model for effect size estimation before selection for significance. to take selection for significance into account, the likelihood function of the ml model is a product of k conditional densities; each term is the conditional density of the test statistic t j, given n j = n j and t j > c j, the critical value. likelihood function. the model assumes that sample sizes and effect sizes are independent before the selection for significance. suppose that the distribution of effect size before selection is continuous with probability density gθ(es). this notation indicates that the distribution of effect size depends on an unknown parameter or parameter vector θ. in the appendix, it is shown that the likelihood function (a function of θ) is a product of k terms of the form∫ ∞ 0 d(t j, f1(n j) · f2(es))gθ(es) des∫ ∞ 0 [ 1 −p(c j, f1(n j) · f2(es)) ] gθ(es) des , (3) where the integrals denote areas under curves that can be computed with r’s integrate function. the maximum likelihood estimate is the parameter value yielding the highest product. to be applicable to actual data, the ml model has to make assumptions about the distribution of effect sizes. the ml model that was used in the simulation studies assumed a gamma distribution of effect sizes. a gamma distribution is defined by two parameters that need to be estimated based on the data. the effect sizes based on the most likely distribution are then combined with information about sample sizes to obtain power estimates for each study. an estimate of population mean power is then produced by averaging estimated power for the k significance tests. as shown in the appendix, the terms to be averaged are∫ ∞ 0 [ 1 −p(c j, f1(n j) · f2(es)) ]2 g θ̂ (es) des∫ ∞ 0 [ 1 −p(c j, f1(n j) · f2(es)) ] g θ̂ (es) des . (4) z-curve z-curve follows a traditional meta-analyses that converts p-values into z-scores as a common metric to integrate results from different original studies (rosenthal, 1979; stouffer et al., 1949). the use of z-scores as a common metric makes it possible to fit a single function to p-values arising from different statistical methods and tests. the method is based on the simplicity and tractability of power analysis for the z-tests, in which the distribution of the test statistic under the alternative hypothesis is just a standard normal shifted by a fixed quantity that plays the role of a non-centrality parameter, and will be denoted by m. input to the z-curve is a sample of p-values, all less than α = 0.05. these p-values are processed in several steps to produce an estimate. 1. convert p-values to z-scores. the first step is to imagine, for simplicity, that all the p-values arose from two-tailed z-tests in which results were in the predicted direction. this is equivalent to an upper-tailed z-test. in our simulations, alpha was set to .05, which results in a selection criterion of z = 1.96. the conversion to z-scores (stouffer et al., 1949) consists of finding the test statistic z that would have produced that p-value. the formula is z = qnorm(1 − p/2). (5) 2. set aside z > 6. we set aside extreme z-scores. this avoids fitting a large number of normal distributions to extremely small p-values. this step has no influence on the final result because all of these p-values have an observed power of 1.00 (rounded to the second decimal). this set also avoids numerical problems that arise from small p-values rounded to 0. 3. fit a finite mixture model. before selecting for significance and setting aside values above six, the distribution of the test statistic z given a particular non-centrality parameter value m is normal with mean m. afterwards, it is a normal distribution truncated on the left at the critical value c (usually 1.96) truncated on the right at 6, and rescaled to have area one under the curve. because of heterogeneity in sample size and effect size, the full distribution of z is an average of truncated normals, with potentially a different value of m for each member of the population. as a simplification, heterogeneity in the distribution of z is represented as a finite mixture with r components. the model is equivalent to the following two-stage sampling plan. first, select a non-centrality parameter m from m1, . . . , mr according to the respective probabilities w1, . . . , wr. then generate z from a normal distribution with mean m and standard deviation one. finally, truncate and re-scale. under this approximate model, the probability density function of the test statistic after selection for significance is f (z) = r∑ j=1 w j dnorm(z − m j) pnorm(6 − m j)−pnorm(c − m j) . (6) 7 the finite mixture model is only an approximation because it approximates k standard normal distribution with a smaller set of standard normal distributions. preliminary studies showed negligible differences between models with 3 or more parameters. thus, the z-curve method that was used in the simulation studies approximated the observed distribution of z-scores between 1.96 and 6 with three truncated standard normal distributions. the observed density distribution was estimated based on the observed z-scores using the kernel density estimate (silverman, 1986) as implemented in r’s density function, with the default settings. the default settings are gaussian approximation and 512 nodes. the most critical default parameter is the bandwidth. the default bandwidth defaults to 0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative onefifth power (https://stat.ethz.ch/r-manual/rdevel/library/stats/html/density.html). specifically, the fitting step proceeds as follows. first, obtain the kernel density estimate based on the sample of significant z values, re-scaling it so that the area under the curve between 1.96 and 6 equals one. to do so, all density values are divided by the sum of the density values times the bandwidth parameter of the density function. then, numerically choose w j and m j values so as to minimize the sum of absolute differences between expression (6) and the density estimate. 4. estimate mean power for z < 6. the estimate of rejection probability upon replication for z < 6 is the area under the curve above the critical value, with weights and non-centrality values from the curve fitting step. the estimate is ` = r∑ j=1 ŵ j(1 −pnorm(c − m̂ j)), (7) where ŵ1, . . . , ŵr and m̂1, . . . , m̂r are the values located in step 3. note that while the input data are censored both on the left and right as represented in forumula 6, there is no truncation in formula 7 because it represets the distribution of z upon replication. 5. re-weight using z > 6. let q denote the proportion of the original set of z statistics with z > 6. again, we assume that the probability of significance for those tests is essentially one. bringing this in as one more component of the mixture estimate, the final estimate of the probability of rejecting the null hypothesis for exact replication of a randomly selected test is zest = (1 − q) ` + q · 1 (8) = q + (1 − q) r∑ j=1 ŵ j(1 −pnorm(c − m̂ j)) by theorem 1, this is also an estimate of population true mean power after selection. unlike the other estimation methods, z-curve does not require information about sample size. unlike p-curve2.1 and p-uniform, z-curve does not assume a fixed effect size. finally, zcurve does not make assumptions about the distribution of true effect sizes or true power, but approximates the actual distribution with a weighted combination of three standard normal distributions. simulations the simulations reported here were carried out using the r programming environment (r core team, 2017) distributing the computation among 70 quad core apple imac computers. the r code is available in the supplementary materials, at https://osf.io/bvraz. in the simulations, the four estimation methods (pcurve 2.1, p-uniform, maximum likelihood and z-curve) were applied to samples of significant chi-squared or f statistics, all with p < 0.05. this covers most cases of interest, since t statistics may be squared to yield f statistics, while z may be squared to yield chi-squared with one degree of freedom. heterogeneity in sample size only: effect size fixed sample sizes after selection for significance were randomly generated from a poisson distribution with mean 86, so that they were approximately normal, with population mean 86 and population standard deviation 9.3. population mean power, number of test statistics on which the estimates were based, type of test (chi-squared or f) and (numerator) degrees of freedom were varied in a complete factorial design. within each combination, we generated 10,000 samples of significant test statistics and applied the four estimation methods to each sample. in these simulations, it was not necessary to simulate test statistic values and then literally select those that were significant. a great deal of computation was saved by using the r functions rsigf and rsigchi, (available from the supplementary materials) to simulate directly from the distribution of the test statistic after selection. a description of the simulation method and a proof of its correctness are given in the appendix. https://stat.ethz.ch/r-manual/r-devel/library/stats/html/density.html https://osf.io/bvraz https://osf.io/bvraz https://osf.io/bvraz 8 the first simulation had a 4 × 5 × 3 design with true power after selection for significance (.05, 0.25, 0.50, & 0.75), number of test statistics k on which estimates were based (15, 25, 50, 100, & 250) and numerator degrees of freedom (just degrees of freedom for the chisquared tests; 1, 3 & 5) as factors. to obtain the desired levels of power, we used the effect size metric f for ftests and w for chi-squared tests (cohen, 1988, p. 216). because the pattern of results was similar for f-tests and chi-squared tests and for different degrees of freedom, we only report details for f-tests with one numerator degree of freedom; preliminary data mining of the psychological literature suggests that this is the case most frequently encountered in practice. full results are given in the supplementary materials. average performance. table 1 shows means and standard deviations of mean power based on 10,000 simulations in each cell of the design. differences between the estimates and the true values represent systematic bias in the estimates. the results show that all methods performed fairly well, with z-curve showing more bias than the other methods, especially for small sets of studies. absolute error of estimation. although the standard deviations in table 1 provide some information about estimation errors in individual simulations, we also computed mean absolute errors, abs(true powerestimated power) to supplement this information. with 50% power at least 100 studies would be needed to reduce mean absolute error to less than 6% for all methods. thus, fairly large sets of studies are needed to obtain precise estimates of mean power. heterogeneity in both sample size and effect size the results of the first simulation study were reassuring in that our methods performed well under conditions that were consistent with model assumptions. pcurve, p-uniform and the ml model performed better than z-curve because they used information about sample sizes and correctly assumed that all studies have the same population effect size. however, our main goal was to test these methods under more realistic conditions where effect sizes vary across studies. to model heterogeneity in effect size, we let effect size before selection vary according to a gamma distribution (johnson et al., 1995), a flexible continuous distribution taking positive values. sample size before selection remained poisson distributed with a population mean of 86. for convenience, sample size and effect size were independent before selection for significance. the maximum likelihood model correctly assumed a gamma distribution for effect size, and the likelihood search was over the two parameters of the gamma distribution. table 1 average estimated population mean power for heterogeneity in sample size only (sd in parentheses): f-tests with numerator d f = 1 number of tests 15 25 50 100 250 population mean power = .05 p-curve 2.1 .083 .073 .064 .059 .055 (.059) (.039) (.024) (.015) (.007) p-uniform .076 .067 .061 .058 .054 (.050) (.032) (.019) (.012) (.006) ml-model .076 .067 .061 .057 .054 (.050) (.033) (.020) (.012) (.006) z-curve .086 .071 .058 .049 .040 (.088) (.065) (.044) (.031) (.019) population mean power = .25 p-curve 2.1 .269 .261 .256 .253 .251 (.156) (.128) (.095) (.069) (.046) p-uniform .256 .253 .252 .251 .251 (.147) (.121) (.089) (.065) (.042) ml-model .260 .255 .253 .251 .251 (.146) (.120) (.087) (.064) (.042) z-curve .314 .305 .293 .280 .268 (.155) (.127) (.093) (.068) (.045) population mean power = .50 p-curve 2.1 .484 .491 .496 .497 .499 (.175) (.139) (.102) (.073) (.046) p-uniform .473 .485 .493 .496 .499 (.170) (.132) (.097) (.070) (.044) ml-model .479 .489 .495 .497 .499 (.166) (.130) (.095) (.068) (.043) z-curve .513 .516 .513 .508 .502 (.151) (.121) (.091) (.068) (.045) population mean power = .75 p-curve 2.1 .728 .736 .742 .747 .749 (.128) (.098) (.069) (.048) (.030) puniform .721 .732 .740 .746 .748 (.126) (.097) (.067) (.047) (.029) ml-model .728 .736 .742 .747 .749 (.121) (.093) (.065) (.045) (.028) zcurve .704 .712 .717 .723 .728 (.105) (.084) (.064) (.048) (.033) the other three methods were not modified in any way. p-curve 2.1 and p-uniform continued to assume a fixed effect size, and z-curve continued to assume heterogeneity in the non-centrality parameter without distinguishing between heterogeneity in sample size and heterogeneity in effect size. we used the same design as in study 1 with one additional factor: amount of heterogeneity in effect size, as represented by the standard deviation of the effect size distribution. figure 3 shows the distribution of ef9 table 2 mean absolute error of estimation for heterogeneity in sample size only: f-tests with numerator d f = 1 number of tests 15 25 50 100 250 population mean power = 0.05 p-curve 2.1 3.32 2.25 1.41 0.93 0.52 p-uniform 2.57 1.75 1.11 0.76 0.43 ml-model 2.59 1.74 1.09 0.73 0.39 z-curve 6.53 4.90 3.38 2.44 1.79 population mean power = 0.25 p-curve 2.1 12.94 10.49 7.69 5.53 3.64 p-uniform 12.11 9.87 7.17 5.18 3.38 ml-model 12.07 9.76 7.05 5.10 3.32 z-curve 13.55 11.09 8.21 5.96 3.87 population mean power = 0.50 p-curve 2.1 14.32 11.20 8.14 5.80 3.67 p-uniform 13.93 10.68 7.80 5.56 3.51 ml-model 13.61 10.41 7.60 5.39 3.41 z-curve 12.42 9.91 7.44 5.48 3.59 population mean power = 0.75 p-curve 2.1 9.77 7.59 5.38 3.72 2.35 p-uniform 9.79 7.59 5.34 3.71 2.32 ml-model 9.33 7.23 5.11 3.53 2.21 z-curve 8.34 6.96 5.56 4.30 3.13 fect sizes after selection for significance for three levels of heterogeneity, standard deviation of effect size after selection (0.10, 0.20 or 0.30) × three levels of true population mean power (0.25, 0.50 or 0.75). effect sizes were transformed into cohen’s d for ease of interpretation. we dropped the condition with 5% power because it implies a fixed effect size of 0. we also varied the number of test statistics in a simulation (k = 100, 250, 500, 1,000 or 2,000), experimental degrees of freedom (1, 3 or 5), and type of test (f or chi-squared). within each cell of the design, ten thousand significant test statistics were randomly generated, and population mean power was estimated using all four methods. for brevity, we only present results for f-tests with numerator d f = 1. full results are given in the supplementary materials. in our simulations with heterogeneity in effect sizes, maximum likelihood is computationally demanding. using r’s integrate function, the calculation involves fitting a histogram to each curve and then adding the areas of the bars. numerical accuracy is an issue, especially for ratios of areas when the denominators are very small. in addition, it is necessary to try more than one starting value to have a hope of locating the global maximum because the likelihood function has many local maxima. in our simulations, we used three random figure 3. distribution of effect sizes (cohen’s d) for the simulations in study 2. 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 effect size distribution cohen's d d en si ty heterogeneity: black = .1; blue = .2, red = .3 power: solid = 25%, dots = 50%, dashes = 75% starting points. the ml model benefited from the fact that it assumed a gamma distribution of effect sizes, which matched the simulated effect size distributions. in contrast, z-curve made no assumptions and the other two methods falsely assumed a fixed effect size. average performance. table 3 shows estimated population mean power as a function of true population mean power. results were consistent with the differences in assumptions. pcurve2.1 and p-uniform overestimated mean power and this bias increased with increasing heterogeneity and increasing mean power. zcurve estimates were actually better than in the previous simulations with fixed effect sizes. the maximum likelihood model had the best fit, presumably because it anticipated the actual effect size distribution. absolute error of estimation. table 4 shows mean absolute error of estimation. it confirms the pattern of results seen in table 3. most important are the large absolute errors for the two methods that assumed a fixed effect size. these large absolute mean differences are obtained despite small standard deviations because p-curve2.0 and p-uniform systematically overestimate mean power. large sample sizes cannot correct for systematic estimation errors. these results show that fixed effect size models cannot be used for the estimation of mean power when there is substantial heterogeneity in https://osf.io/bvraz 10 table 3 average estimated power (sd in parentheses) for heterogeneity in sample size and effect size based on k = 1, 000 f-tests with numerator d f = 1 standard deviation of es 0.1 0.2 0.3 population mean power = 0.25 p-curve 2.1 .225 .272 .320 (.024) (.033) (.039) p-uniform .294 .694 .949 (.029) (.056) (.028) maxlike .230 .269 .283 (.069) (.016) (.015) z-curve .233 .225 .226 (.027) (.026) (.024) population mean power = 0.50 p-curve 2.1 .549 .679 .757 (.024) (.027) (.026) p-uniform .602 .913 .995 (.024) (.019) (.003) maxlike .501 .502 .506 (.025) (.019) (.019) z-curve .504 .492 .487 (.026) (.026) (.025) population mean power = 0.75 p-curve 2.1 .824 .928 .962 (.013) (.009) (.006) p-uniform .861 .992 1.000 (.012) (.003) (.000) maxlike .752 .750 .750 (.022) (.017) (.014) z-curve .746 .755 .760 (.021) (.017) (.016) power. the results also show that the difference between z-curve and the ml model are slight and have no practical significance. the good performance of z-curve is encouraging because it does not require assumptions about the effect size distribution. violating the assumptions of the ml model in the preceding simulation study, heterogeneity in effect size before selection was modeled by a gamma distribution, with effect size independent of sample size before selection. the maximum likelihood model had a substantial and arguably unfair advantage, since the simulation was consistent with the assumptions of the ml model. it is well known that maximum likelihood models are very accurate compared to other methods when their assumptions are met (stuart & ord, 1999, ch. 18). we used a beta distribution of effect sizes to examine how the ml model performs when its assumptable 4 mean absolute error of estimation in percentage points, for heterogeneity in sample size and gamma effect size based on k = 1, 000 f-tests with numerator d f = 1 standard deviation of es 0.1 0.2 0.3 population mean power = 0.25 p-curve 2.1 2.87 3.16 7.08 p-uniform 4.50 44.38 69.90 maxlike 3.55 2.06 3.34 z-curve 2.59 3.08 2.90 population mean power = 0.50 p-curve 2.1 4.93 17.86 25.70 p-uniform 10.21 41.28 49.54 maxlike 1.80 1.49 1.50 z-curve 2.12 2.19 2.23 population mean power = 0.75 p-curve 2.1 7.45 17.75 21.23 p-uniform 11.08 24.17 24.99 maxlike 1.42 1.18 1.16 z-curve 1.69 1.42 1.55 tion of a gamma distribution is violated. in this simulation, z-curve may have the upper hand because it makes no assumptions about the distribution of effect sizes or the correlation between effect sizes and sample sizes. it is well known that selection for significance (e.g. publication bias) introduces a correlation between sample sizes and effect sizes. however, there might also be negative correlations between sample sizes and effect sizes before selection for significance if researchers conduct a priori power analysis to plan their studies or if researchers learn from non-significant results that they need larger samples to achieve significance. the design of this simulation study was similar to the previous design, but we only simulated the most extreme heterogeneity (sd = .3) condition and added a factor for the correlations between sample size and effect size (r = 0, -.2, .4, -.8). as before, we ran 10,000 simulations in each condition. to make results comparable to the results in table 4, we show the results for the simulation with k = 1,000 per simulated meta-analysis. figure 4 shows the effect size distributions after selection for significance. as before, effect sizes were transformed into cohen’s d-values so that they can be compared to the distributions in figure 3. only the most extreme correlations of 0 and -.8 are shown to avoid cluttering the figure. as shown in the figure, the correlation has relatively little impact on the distributions. 11 figure 4. effect size distribution for study 3 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 effect size distribution cohen's d d en si ty 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 effect size distribution cohen's d d en si ty correlation: black = 0; red = -.8 power: solid = 25%, dots = 50%, dashes = 75% average performance. table 5 shows average estimated population mean power as a function of the correlation between sample size and effect size and different levels of power. one interesting finding is that the correlation between effect size and sample size has no influence on any of the four estimation methods. this is reassuring because the correlation before selection for significance is typically unknown. it is apparent from table 5 that correlation between sample size and effect size makes virtually no difference. results for p-curve2.1 and p-uniform again overestimate effect sizes. more important is the comparison of the ml model and z-curve. both methods perform reasonably well with mean true power of 50%, although z-curve performs slightly better. with low or high power, however, the ml model overestimates mean power by 5 and 8 percentage points, respectively. the bias for zcurve is less, although even z-curve overestimates high power by 4 percentage points. we explored the cause of this systematic bias and found that it is caused by the default bandwidth method with smaller sets of studies. when we set the bandwidth to a value of 0.05, z-curve estimates with a correlation of zero were .235, .492, and .743, respectively. table 5 average estimated power with beta effect size and sample size correlated with effect size: k = 1, 000 f-tests with numerator d f = 1 correlation between n and es -.8 -.6 -.4 -.2 .0 population mean power = 0.25 p-curve .407 .405 .403 .403 .402 (.043) (.044) (.043) (.044) (.044) p-uniform .853 .852 .852 .852 .852 (.003) (.004) (.003) (.004) (.004) maxlike .302 .301 .300 .300 .300 (.015) (.015) (.015) (.015) (.015) z-curve .232 .231 .230 .231 .230 (.015) (.015) (.015) (.015) (.015) population mean power = 0.50 p-curve .839 .840 .841 .841 .841 (.022) (.022) (.022) (.022) (.022) p-uniform .906 .906 .906 .906 .906 (.004) (.004) (.004) (.004) (.004) maxlike .532 .533 .533 .534 .534 (.018) (.018) (.019) (.019) (.019) z-curve .493 .494 .495 .495 .495 (.023) (.023) (.023) (.023) (.023) population mean power = 0.75 p-curve .990 .991 .992 .992 .992 (.002) (.002) (.002) (.002) (.002) p-uniform .964 .966 .966 .967 .967 (.003) (.003) (.003) (.003) (.003) maxlike .826 .832 .836 .838 .840 (.016) (.016) (.015) (.015) (.015) z-curve .785 .790 .793 .794 .796 (.013) (.013) (.013) (.012) (.012) discussion in this paper, we have compared four methods for estimating the mean statistical power of a heterogeneous population of significance tests, after selection for significance. we have discovered and formally proved a set of theorems relating the distribution of power values before and after selection for significance. mean power and replicability several events in 2011 have triggered a crisis of confidence about the replicability and credibility of published findings in psychology journals. as a result, there have been various attempts to assess the replicability of published results. the most impressive evidence comes from the open science reproducibility project that conducted 100 replication studies from articles published in 2008. the key finding was that 50% of significant results from cognitive psychology could be replicated suc12 cessfully, whereas only 25% of significant results from social psychology could be replicated successfully (open science collaboration, 2015). social psychologists have questioned these results. their main argument is that the replication studies were poorly done. “nosek’s ballyhooed finding that most psychology experiments didn’t replicate did enormous damage to the reputation of the field, and that its leaders were themselves guilty of methodological problems" (nisbett quoted in bartlett, 2018) estimating mean power provides an empirical answer to the question whether replication failures are caused by problems with the original studies or the replication studies. if the original studies achieved significance only by means of selection for significance or other questionable research practices, estimated mean power would be low. in contrast, if original studies had good power and replication failures are due to methodological problems of replication studies, estimated mean power would be high. we have applied z-curve to the original studies that were replicated in the open science project and found an estimate of 66% mean power (schimmack & brunner, 2016). this estimate is higher than the overall success rate of 37% for actual replication studies. this suggests (but not conclusively) that problems with conducting exact replication studies contributed partially to the low success rate of 37%. at the same time, the estimate of 66% is considerably lower than the success rate of 97% for the original studies. this discrepancy shows that success rates in journals are inflated by selection for significance and partially explains replication failures in psychology, especially in social psychology. this example shows that estimates of mean power provide useful information for the interpretation of replication failures. without this information, precious resources might be wasted on further replication studies that fail simply because the original results were selected for significance. historic trends in power our statistical approach of estimating mean power is also useful to examine changes in statistical power over time. so far, power analyses of psychology have relied on fixed values of effect sizes that were recommended by cohen (1962, 1988). however, actual effect sizes may change over time or from one field to another. z-curve makes it possible to examine what the actual power in a field of study is and whether this power has changed over time. despite much talk about improvement in psychological science in response to the replication crisis, mean power has increased by less than 5 percentage points since 2011, and improvements are limited to social psychology (schimmack, 2018b). mean power as a quality indicator one problem in psychological science is the use of quantitative indicators like number of publications or number of studies per article to evaluate productivity and quality of psychological scientists. we believe that mean power is an important additional indicator of good science. a single study with good power provides more credible evidence and more sound theoretical foundations than three or more studies with low power that were selected from a larger population of studies with nonsignificant results (schimmack, 2012). however, without quantitative information about power, it is unclear whether reported results are trustworthy or not. reporting the mean power of studies from a lab or a particular field of research can provide this information. this information can be used by journalists or textbook writers to select articles that reported credible empirical evidence that is likely to replicate in future studies. p-curve estimates of mean power simonsohn et al. (2017) provided users with a free online app to compute mean power. however, they did not report the performance of their method in simulation studies and their method has not been peerreviewed. we evaluated their online method and found that the current online method, p-curve 4.06, overestimates mean power under conditions of heterogeneity (schimmack & brunner, 2017). moreover, even heterogeneity in sample sizes alone can produce biased estimates with p-curve4.06 (brunner, 2018). however, we agree with simonsohn et al. (2014b) simonsohn et al. (2014) that pcurve 2.0 can be used for the estimation of mean effect sizes and that these estimates are relatively bias free even when there is moderate heterogeneity in effect sizes. importantly, these estimates are only unbiased for the population of studies that produced significant results, but they are inflated estimates for the population of studies before selection for significance. failing to distinguish these two populations of studies (i.e., before and after selection for significance) has produced a lot of confusion and unnecessary criticism of selection models in general (mcshane et al., 2016). while it is difficult to obtain accurate estimates of effect sizes or power before selection for significance from the subset of studies that were selected for significance, pcurve 2.0 provides reasonably good estimates of effect sizes after selection for significance, which is the reason we built p-curve 2.1 in the first place. however, 13 p-curve 2.1, and especially p-curve 4.06, produce biased estimates of mean power even for the set of studies selected for significance. therefore, we do not recommend using p-curve to estimate mean power. p-uniform estimation of mean power unlike p-curve, the authors of p-uniform limited their method to estimation of effect sizes before selection for significance. we used their estimation method to create a method for estimation of mean power after selection. as p-curve, the method had problems with heterogeneity in effect sizes and performed even worse than p-curve. recently, the developers of p-uniform changed the estimation method to make it more robust in the presence of heterogeneity and with outliers (van aert et al., 2016). the new approach simply averages the rescaled pvalues and finds the effect size that produces a mean p-value of 0.50. this method is called the irvine-hall method. we conducted new simulation studies with this method for the no correlation condition in table 5 for 25%, 50%, and 75% true power. we found that it performed much better (24%, 76%, 99%) than the old p-uniform method (85%, 91%, 97%), and slightly better than p-curve 2.1 (40%, 84%, 99%). however, the method still produces inflated estimates for medium and high mean power. maximum likelihood model our ml model is similar to hedges and vevea’s (1996) ml method that corrects for publication bias in effect size meta-analyses. although this model has been rarely used in actual applications, it received renewed attention during the current replication crisis. mcshane et al. argued that p-curve and p-uniform produced biased effect size estimates, whereas a heterogenous ml model produced accurate estimates. however, their focus was on estimating the average effect size before selection for significance. this aim is different from our aim to estimate mean power after selection for significance. moreover, in their simulation studies the ml model benefited from the fact that the model assumed a normal distribution of effect sizes and this was the distribution of effect sizes in the simulation study. in our simulation studies, the ml model also performed very well when the simulation data met model assumptions. however, estimates were biased when model assumptions differed from the effect size distribution in the data. hedges and vevea (1996) also found that their ml model is sensitive to the actual distribution of population effect sizes, which is unknown. the main advantage of z-curve over ml models is that it does not make any distributional assumptions about the data. however, this advantage is limited to estimation of mean power. whether it is possible to develop finite mixture models without distribution assumptions for the estimation of the mean effect size after selection for significance remains to be examined. future directions one concern about z-curve was the suboptimal performance when effect sizes were fixed. however, an improved z-curve method may be able to produce better estimates in this scenario as well. as most studies are likely to have some heterogeneity, we recommend using z-curve as the default method for estimating mean power. another issue is to examine performance of z-curve when researchers used questionable research practices (john et al., 2012). one questionable research practice is to include multiple dependent variables and to report only those that produced a significant result. this practice would be no different from researchers running multiple exact replication studies with the same dependent variable and reporting only the studies that produced significant results for the selected dv. the probability of this result to be selected is the true power of the study with the chosen dv and the probability of this finding to be replicated equals the true power for the chosen dv. power can vary across dvs, but the power of the dvs that were discarded is irrelevant. things become more complicated, however, if multiple dvs are selected or if only the strongest result is selected among several significant dvs (van aert et al., 2016). some questionable research practices may cause z-curve to underestimate mean power. for example, researchers who conduct studies with moderate power may deal with marginally significant results by removing a few outliers to get a just significant result. (john et al., 2012). this would create a pile of z-scores close to the critical value, leading z-curve to underestimate mean power. we recommend inspecting the z-curve to look for this qrp, which should produce a spike in zscores just above 1.96. another issue is that studies may use different significance thresholds. although most studies use p < .05 (two-tailed) as a criterion, some studies use more stringent criteria, for example to correct for multiple comparisons. including these results would lead to an overestimation of mean power, just like using p < .05 , onetailed as a criterion would lead to overestimation because most studies used the more stringent two-tailed criterion to select for significance. one solution would be to exclude studies that did not use alpha = .05 or to run separate analyses for sets 14 of studies with different criteria for significance. however, these results are currently so rare that they have no practical consequences for mean power estimates. conclusion although this article is the seminal introduction of zcurve, we have been writing about z-curve and applications of z-curve since 2015 on social media. thus, there have already been peer-reviewed criticism of our aims and methods before we were able to publish the method itself. we would like to take this opportunity to correct some of these criticisms and to ask future critics to base their criticism on this article. de boeck and jeon (2018) claim that estimation methods for mean power are problematic because they "aim at rather precise replicability inferences based on other not always precise inferences, without knowing the true values of the effect size and whether the effect is fixed or varies" (p. 769). contrary to this claim, our simulations show that z-curve can provide precise estimates of replicability; that is, the success rate in a set of exact replication studies without information about population effect sizes. to do so, only test statistics or exact p-values are needed. if related statistical information (e.g. means, sds, and n) is not reported, an article does not contain quantitative information. we hope that researchers will use z-curve (https://osf.io/w8nq4) to estimate mean power when they conduct meta-analyses. hopefully, the reporting of mean power will help researchers to pay more attention to power when they plan future studies, and we might finally see an increase in statistical power, more than 50 years after cohen (1962) pointed out the importance of power for good psychological science. more awareness of the actual power in psychological science could also be beneficial for grant applications to fund research projects properly and to reduce the need for questionable research practices to boost power by inflating the risk of type-i errors. thus, we hope that estimation of mean power serves the most important goal in science, namely to reduce errors. conducting studies with adequate power reduces type-ii errors (false negatives) and in the presence of selection bias it also reduces type-i errors. the downside appears to be that fewer studies would be published, but underpowered studies selected for significance do not provide sound empirical evidence. maybe reducing the number of published studies would be beneficial, or to paraphrase cohen (1990), “less is more, except for statistical power". author contributions most of the ideas in this paper were developed jointly. an exception is the z-curve method, which is solely due to schimmack. brunner is responsible for the theorems. acknowledgements we would like to thank dr. jeffrey graham for providing remote access to the computers in the psychology laboratory at the university of toronto mississauga. thanks to josef duchesne for technical advice. conflict of interest and funding no conflict of interest to report. this work was not supported by a specific grant. contact information correspondence regarding this article should be sent to: brunner@utstat.toronto.edu open science practices this article earned the open materials badge for making the materials openly available. preregistration and data badges are not applicable for this type of research. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references anderson, s. f., kelley, k., & maxwell, s. e. (2017). sample-size planning for more accurate statistical power: a method adjusting sample effect sizes for publication bias and uncertainty. psychological science, 28, 640–646. bartlett, t. (2018). i want to burn things to the ground. retrieved may 30, 2019, from https : / / www. chronicle.com/article/i-want-to-burn-thingsto/244488 becker, r. a., chambers, j. m., & wilks, a. r. (1988). the new s language: a programming environment for data analysis and graphics. pacific grove, california, wadsworth& brooks/cole. boos, d. d., & stefnski, l. a. (2012). p-value precision and reproducibility. the american statistician, 65, 213–221. brunner, j. (2018). an even better p-curve. retrieved may 30, 2019, from https : / / replicationindex . wordpress.com/2018/05/10/anevenbetterp-curve bunge, m. (1998). philosophy of science. new brunswick, n.j., transaction. https://osf.io/w8nq4 https://www.chronicle.com/article/i-want-to-burn-things-to/244488 https://www.chronicle.com/article/i-want-to-burn-things-to/244488 https://www.chronicle.com/article/i-want-to-burn-things-to/244488 https://replicationindex.wordpress.com/2018/05/10/an-even-better-p-curve https://replicationindex.wordpress.com/2018/05/10/an-even-better-p-curve https://replicationindex.wordpress.com/2018/05/10/an-even-better-p-curve 15 chase, l. j., & chase, r. b. (1976). statistical power analysis of applied psychological research. journal of applied psychology, 61, 234–237. cohen, j. (1962). the statistical power of abnormalsocial psychological research: a review. journal of abnormal and social psychology, 65, 145– 153. cohen, j. (1988). statistical power analysis for the behavioral sciences. (2nd edition). hilsdale, new jersey, erlbaum. cohen, j. (1990). things i have learned (so far). american psychologist, 45, 1304–1312. cohen, j. (1994). the earth is round (p < .05). american psychologist, 49, 997–1003. de boeck, p., & jeon, m. (2018). perceived crisis and reforms: issues, explanations, and remedies. psychological bulletin, 144, 757–777. greenwald, a. g., gonzalez, r., harris, r. j., & guthrie, d. (1996). effect sizes and p values: what should be reported and what should be replicated? psychophysiology, 33, 175–183. grissom, r. j., & kim, j. j. (2012). effect sizes for research: univariate and multivariate applications. new york, routledge. hedges, l. v., & vevea, j. l. (1996). estimating effect size under publication bias: small sample properties and robustness of a random effects selection model. journal of educational and behavioral statistics, 21, 299–332. hoenig, j. m., & heisey, d. m. (2001). the abuse of power: the pervasive fallacy of power calculations for data analysis. the american statistician, 55, 19–24. ioannidis, j. p. (2008). why most discovered true associations are inflated. epidemiology, 19(5), 640– 646. john, l. k., lowenstein, g., & prelec, d. (2012). measuring the prevalence of questionable research practices with incentives for truth telling. psychological science, 23, 517–523. johnson, n. l., kotz, s., & balakrishnan, n. (1995). continuous univariate distributions (2nd). new york, wiley. mcshane, b. m., böckenholt, u., & hensen, k. (2016). adjusting for publication bias in meta-analysis: an evaluation of selection methods and some cautionary notes. psychological science, 11, 730–749. morewedge, c. k., gilbert, d., & wilson, t. d. (2014). reply to frances. retrieved june 7, 2019, from https : / / www . semanticscholar . org / paper / reply to francis morewedge gilbert / 019dae0b9cbb3904a671bfb5b2a25521b69ff2cc neyman, j., & pearson, e. s. (1933). on the problem of the most efficient tests of statistical hypotheses. philosophical transactions of the royal society, series a, 231, 289–337. open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716–aac4716. https : / / doi . org/10.1126/science.aac4716 popper, k. r. (1959). the logic of scientific discovery. london, england, hutchinson. r core team. (2017). r: a language and environment for statistical computing. r foundation for statistical computing. vienna, austria. https://www. r-project.org/ rosenthal, r. (1979). the file drawer problem and tolerance for null results. psychological bulletin, 86, 638–641. schimmack, u. (2012). the ironic effect of significant results on the credibility of multiple-study articles. psychological methods, 17, 551–566. schimmack, u. (2015). post-hoc power curves: estimating the typical power of statistical tests (t, f ) in psychological science and journal of experimental social psychology. retrieved may 30, 2019, from https : / / replicationindex . com / 2015/06/27/232/ schimmack, u. (2018a). an introduction to z-curve: a method for estimating mean power after selection for significance (replicability). retrieved may 30, 2019, from https : / / replicationindex . com/2018/10/19/an-introduction-to-z-curve schimmack, u. (2018b). replicability rankings. retrieved may 30, 2019, from https : / / replicationindex . com / 2018 / 12 / 29 / 2018 replicability-rankings schimmack, u., & brunner, j. (2016). how replicable is psychology? a comparison of four methods of estimating replicability on the basis of test statistics in original studies. retrieved may 30, 2019, from http : / / www. utstat . toronto . edu / ~brunner/papers/howreplicable.pdf schimmack, u., & brunner, j. (2017). z-curve: a method for the estimation of replicability. manuscript rejected from ampps. retrieved may 30, 2019, from https://replicationindex.wordpress.com/ 2017 / 11 / 16 / preprint z curve a method for the estimating replicability based on test statistics in original studies schimmack brunner-2017 sedlmeier, p., & gigerenzer, g. (1989). do studies of statistical power have an effect on the power of studies? psychological bulletin, 105, 309–316. https://www.semanticscholar.org/paper/reply-to-francis-morewedge-gilbert/019dae0b9cbb3904a671bfb5b2a25521b69ff2cc https://www.semanticscholar.org/paper/reply-to-francis-morewedge-gilbert/019dae0b9cbb3904a671bfb5b2a25521b69ff2cc https://www.semanticscholar.org/paper/reply-to-francis-morewedge-gilbert/019dae0b9cbb3904a671bfb5b2a25521b69ff2cc https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://www.r-project.org/ https://www.r-project.org/ https://replicationindex.com/2015/06/27/232/ https://replicationindex.com/2015/06/27/232/ https://replicationindex.com/2018/10/19/an-introduction-to-z-curve https://replicationindex.com/2018/10/19/an-introduction-to-z-curve https://replicationindex.com/2018/12/29/2018-replicability-rankings https://replicationindex.com/2018/12/29/2018-replicability-rankings https://replicationindex.com/2018/12/29/2018-replicability-rankings http://www.utstat.toronto.edu/~brunner/papers/howreplicable.pdf http://www.utstat.toronto.edu/~brunner/papers/howreplicable.pdf https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017 https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017 https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017 https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017 https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017 16 silverman, b. w. (1986). density estimation. london, chapman; hall. simonsohn, u., nelson, l. d., & simmons, j. p. (2014a). p–curve: a key to the file drawer. journal of experimental psychology: general, 143, 534–547. simonsohn, u., nelson, l. d., & simmons, j. p. (2014b). p-curve and effect size: correcting for publication bias using only significant results. perspectives on psychological science, 9, 666–681. simonsohn, u., nelson, l. d., & simmons, j. p. (2017). p-curve app 4.06. retrieved may 30, 2019, from http://www.p-curve.com sterling, t. d. (1959). publication decision and the possible effects on inferences drawn from tests of significance – or vice versa. journal of the american statistical association, 54, 30–34. sterling, t. d., rosenbaum, w. l., & weinkam, j. j. (1995). publication decisions revisited: the effect of the outcome of statistical tests on the decision to publish and vice versa. the american statistician, 49, 108–112. stouffer, s. a., suchman, e. a., devinney, l. c., star, s. a., & williams, r. m., jr. (1949). the american soldier, vol.1: adjustment during army life. princeton, princeton university press. stuart, a., & ord, j. k. (1999). kendall’s advanced theory of statistics, vol. 2: classical inference & the linear model (5th). new york, oxford university press. van aert, r. c. m., wicherts, j. m., & van assen, m. a. l. m. (2016). conducting meta-analyses based on p values: reservations and recommendations for applying p-uniform and pcurve. perspectives on psychological science, 11, 713–729. van assen, m. a. l. m., van aert, r. c. m., & wicherts, j. m. (2014). meta-analysis using effect size distributions of only statistically significant studies. psychological methods, 20, 293– 309. yuan, k. h., & maxwell, s. (2005). on the post hoc power in testing mean differences. journal of educational and behavioral statistics, 30, 141– 167. appendix proofs of the theorems, with an example we present proofs of six theorems about the relationship between power and the outcome of replication studies. the first two theorems are assumptions of z-curve. the other four theorems are theoretically interesting, very useful for simulation studies, and can be used to further develop z-curve in the future. the theorems are also illustrated with a numerical example. consider a population of f-tests with 3 and 26 degrees of freedom, and varying true power values. variation in power comes from variation in the non-centrality parameter, which is sampled from a chi-squared distribution with degrees of freedom chosen so that population mean power is very close to 0.80. denoting a randomly selected power value by g and the non-centrality parameter by λ, population mean power is e(g) = ∫ ∞ 0 (1 −pf(c,ncp = λ))dchisq(λ)dλ to verify the numerical value of expected power for the example, > alpha = 0.05; criticalvalue = qf(1-alpha,3,26) > fun = function(ncp,df) + (1 pf(criticalvalue,df1=3,df2=26,ncp))*dchisq(ncp,df) > integrate(fun,0,inf,df=14.36826) 0.8000001 with absolute error < 5.9e-06 the strange fractional degrees of freedom were located using the r function uniroot, minimizing the absolute difference between the output of integrate and the value 0.8 numerically over the degrees of freedom value. the minimum occurred at 14.36826. theorem 1 population mean true power equals the overall probability of a significant result. proof. suppose that the distribution of true power is discrete. again denoting a randomly chosen power value by g, the probability of rejecting the null hypothesis is pr{t > c} = ∑ g pr{t > c|g = g}pr{g = g} = ∑ g g pr{g = g} = e(g), (9) which is population mean power. if the distribution of power is continuous with probability density function fg (g), the calculation is pr{t > c} = ∫ 1 0 pr{t > c|g = g} fg (g) dg = ∫ 1 0 g fg (g) dg = e(g) � continuing with the numerical example, we first sample one million non-centrality parameter values from the chi-squared distribution that yields an expected power http://www.p-curve.com 17 of 80%. these values are in the vector ncp. we then calculate the corresponding power values, placing them in the vector power. next, we generate one million random f statistics from non-central f distributions, using the non-centrality parameter values in ncp. in the r output below, observe that mean power is very close to the proportion of f statistics exceeding the critical value. this illustrates theorem 1 for the distribution of power before selection. needless to say, theorem 1 applies both before and after selection. > popsize = 1000000; set.seed(9999) > ncp = rchisq(popsize,df=14.36826) > power = 1 pf(criticalvalue,df1=3,df2=26,ncp) > mean(power) [1] 0.8002137 > fstat = rf(popsize,df1=3,df2=26,ncp) > sigf = subset(fstat,fstat>criticalvalue) > length(sigf)/popsize # proportion significant [1] 0.800177 to show how theorem 1 applies to the distribution of power after selection, the sub-population of power values corresponding to significant results are stored in sigpower. the tests that were significant are repeated (with the same non-centrality parameters), and the test statistics placed in fstat2. the proportion of test statistics in fstat2 that are significant is very close to the mean of sigpower. this gives empirical support to the statement that population mean power after selection for significance equals the probability of obtaining a significant result again. > sigpower = subset(power,fstat>criticalvalue) > mean(sigpower) # mean power after selection [1] 0.8274357 > # replicate the tests that were significant. > signcp = subset(ncp,fstat>criticalvalue) > fstat2 = rf(length(sigf),df1=3,df2=26,ncp=signcp) > # proportion of replications significant > length(subset(fstat2,fstat2>criticalvalue)) / + length(sigf) [1] 0.827172 theorem 2 the effect of selection for significance is to multiply the probability of each power value by a quantity equal to the power value itself, divided by population mean power before selection. if the distribution of power is continuous, this statement applies to the probability density function. proof. suppose the distribution of power is discrete. using bayes’ theorem, pr{g = g|t > c} = pr{t > c|g = g}pr{g = g} pr{t > c} = g pr{g = g} e(g) . (10) if the distribution of power is continuous with density fg (g), pr{g ≤ g|t > c} = pr{g ≤ g, t > c} pr{t > c} = ∫ g 0 pr{t > c|g = x} fg (x) d x e(g) = ∫ g 0 x fg (x) d x e(g) . by the fundamental theorem of calculus, the conditional density of power given significance is d dg pr{g ≤ g|t > c} = g fg (g) e(g) . � (11) for the numerical example we are pursuing by simulation, the density function of power before selection is a technical challenge and we will not attempt it. as a substitute, suppose that power before selection follows a beta distribution, a very flexible family on the interval from zero to one (johnson et al., 1995). if power before selection (denoted by g) has a beta distribution with parameters α and β, theorem 2 says that the density of power after selection (a function of the power value g) is f (g|t > c) = γ(α + β) γ(α)γ(β) gα−1(1 − g)β−1 ( g e(g) ) = ( 1 α/(α + β) ) γ(α + β) γ(α)γ(β) gα(1 − g)β−1 = (α + β) γ(α + β) α γ(α) γ(β) gα+1−1(1 − g)β−1 = γ(α + 1 + β) γ(α + 1) γ(β) gα+1−1(1 − g)β−1, which is again a beta density, this time with parameters α + 1 and β. m.a.l.m. van assen has pointed out the similarity of this result to conjugate prior-posterior updating in bayesian statistics. figure 5 shows how a beta with α = 2 and β = 4 is transformed into a beta with α = 3 and β = 4. theorem 3 population mean power after selection for significance equals the population mean of squared power before selection, divided by the population mean of power before selection. proof. suppose that the distribution of power is discrete. then using (10), e(g|t > c) = ∑ g g g pr{g = g} e(g) = e(g2) e(g) . (12) 18 figure 5. beta density of power before and after selection 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 5 1. 0 1. 5 2. 0 g d en si ty before after if the distribution of power is continuous, (11) is used to obtain e(g|t > c) = ∫ 1 0 g g fg (g) e(g) dg = e(g2) e(g) . � (13) in the example, sigpower contains the sub-population of power values corresponding to significant results. observe the verification of formula 13. > # repeating ... > sigpower = subset(power,fstat>criticalvalue) > mean(sigpower) [1] 0.8274357 > mean(power^2)/mean(power) [1] 0.8275373 theorem 4 population mean power before selection equals one divided by the population mean of the reciprocal of power after selection. proof. using formula 10, e ( 1 g ∣∣∣∣∣ t > c ) = ∑ g ( 1 g ) g pr{g = g} e(g) = 1 e(g) ∑ g pr{g = g} = 1 e(g) · 1 = 1 e(g) , so that e(g) = 1 / e ( 1 g ∣∣∣∣∣ t > c ) . a similar calculation applies in the continuous case. � to illustrate theorem 4, recall that the example was constructed so that mean power before selection was equal to 0.80. > 1/mean(1/sigpower) [1] 0.8000502 in the example, population mean power is 0.80, while population mean power given significance is roughly 0.83. it is reasonable that selecting significant tests would also tend to select higher power values on average, and in fact this intuition is correct. since v ar(g) = e(g2) − (e(g))2 ≥ 0, we have e(g2) ≥ (e(g))2 , and hence e(g2) e(g) ≥ e(g). theorem 3 says e(g 2 ) e(g) = e(g|t > c), so that e(g|t > c) ≥ e(g). that is, population mean power given significance is greater than the mean power of the entire population, except in the homogeneous case where v ar(g) = 0. the exact amount of increase has a compact and somewhat surprising form. theorem 5 the increase in population mean power due to selection for significance equals the population variance of power before selection divided by the population mean of power before selection. proof. e(g|t > c) − e(g) = e(g2) e(g) − e(g) = e(g2) e(g) − (e(g))2 e(g) = v ar(g) e(g) . � illustrating theorem 5 for the ongoing example, > mean(sigpower) mean(power) [1] 0.02722205 > var(power)/mean(power) [1] 0.02732371 theorem 6 the effect of selection for significance is to multiply the joint distribution of sample size and effect size before selection by power for that sample size and effect size, divided by population mean power before selection. 19 proof. note that power for a given sample size and effect size is p{t > c|x = es, n = n}. suppose effect size is discrete. then p{x = es, n = n|t > c} is p{x = es, n = n, t > c} p{t > c} = p{t > c|x = es, n = n}p{x = es, n = n} e(g) = ( p{t > c|x = es, n = n} e(g) ) p{x = es, n = n} , where e(g) is expected power before selection, equal to p{t > c} by theorem 1. suppose that effect size is continuous with density g(es). the joint distribution of sample size and effect size before selection is determined by p{n = n|x = es}g(es). the joint distribution after selection is determined by p{n = n|x = es, t > c}g(es|t > c) = p{t > c|x = es, n = n}p{n = n|x = es}g(es) g(es|t > c)p{t > c} g(es|t > c) = ( p{t > c|x = es, n = n} e(g) ) p{n = n|x = es}g(es) . it is also possible to write the joint distribution of sample size and effect size as the conditional density of effect size given sample size, times the discrete probability of sample size. that is, the joint distribution before selection is determined by g(es|n = n)p{n = n}, and the joint distribution after selection is determined by g(es|n = n, t > c)p{n = n|t > c} = d des p{x ≤ es|n = n, t > c}p{n = n|t > c} = d des p{x ≤ es, n = n, t > c} p{n = n, t > c} p{n = n, t > c} p{t > c} = 1 e(g) d des ∫ es 0 p{t > c|x = y, n = n}g(y|n = n)p{n = n}dy = p{t > c|x = es, n = n}g(es|n = n)p{n = n} e(g) = ( p{t > c|x = es, n = n} e(g) ) g(es|n = n)p{n = n} � (14) theorem 6 cannot be illustrated for the ongoing numerical example, because the example employs a distribution of the non-centrality parameter, rather than of sample size and effect size jointly. as a substitute, consider that an observed distribution of sample size after selection must imply a distribution of sample size in the unpublished studies before selection. if that distribution is too outlandish (for example, implying an enormous “file drawer" of pilot studies with tiny sample sizes) we may be forced to another model of the research and publication process. theorem 6 allows one to solve for p{n = n}, the unconditional probability distribution of sample size before selection, though an estimated or hypothesized distribution of effect size given sample size before selection is needed. when sample size and effect size are deemed independent before selection, this is not a serious obstacle. expression (14) says that g(es|n = n, t > c)p{n = n|t > c} is equal to( p{t > c|x = es, n = n} e(g) ) g(es|n = n)p{n = n}, so that integrating both sides with respect to es,∫ g(es|n = n, t > c)p{n = n|t > c}des = p{n = n|t > c} ∫ g(es|n = n, t > c) des = p{n = n|t > c} · 1 = ∫ ( p{t > c|x = es, n = n} e(g) ) g(es|n = n)p{n = n}des = ( p{n = n} e(g) ) ∫ p{t > c|x = es, n = n}g(es|n = n) des, and we have p{n = n} = e(g)  p{n = n|t > c}∫ p{t > c|x = es, n = n}g(es|n = n) des  (15) the numerator of the fraction is the probability of observing a sample size of n after selection for significance. the denominator is expected power given that sample size, and could be calculated with r’s integrate function. by theorem 1, the quantity e(g) is both population mean power before selection and p{t > c}, the probability of randomly choosing a significant result from the population of tests before selection. in equation 15, though, it is just a proportionality constant. in practice, one obtains p{n = n} by calculating the fraction in parentheses for each n, and then dividing by the total to obtain numbers that add to one. maximum likelihood even though sample size is a random variable, the quantities n1, . . . , nk are treated as fixed constants. this is similar to the way that x values in normal regression and logistic regression are treated as fixed constants in the development of the theory, even though clearly they are often random variables in practice. making the estimation conditional on the observed values n1, . . . , nk allows it to be distribution free with respect to sample size, just as regression and logistic regression are distribution free with respect to x. this is preferable to adopting parametric assumptions about the joint distribution of sample size and effect size. 20 suppose there is heterogeneity in both sample size and effect size, and that effect size is continuous. the likelihood function given significance is a product of conditional densities evaluated at the observed values of the test statistics. each term is the conditional density of the test statistic given both the sample size and the event that the test statistic exceeds its respective critical value. the joint probability distribution of sample size and effect size before selection is determined by the marginal distribution of sample size p{n = n} and the conditional density of effect size given sample size gθ(es|n), where θ is a vector of unknown parameters. denoting the random effect size by x, the conditional density of an observed test statistic t given significance and a particular sample size n is d dt p{t ≤ t|t > c, n = n} = d dt p{t ≤ t, t > c, n = n} p{t > c, n = n} = d dt p{c < t ≤ t|n = n}p{n = n} p{t > c|n = n}p{n = n} = d dt p{c < t ≤ t|n = n} p{t > c|n = n} = d dt ∫ ∞ 0 p{c < t ≤ t|n = n, x = es}gθ(es|n) des∫ ∞ 0 p{t > c|n = n, x = es}gθ(es|n) des = d dt ∫ ∞ 0 [ p(t, f1(n) f2(es))−p(c, f1(n) f2(es)) ] gθ(es|n) des∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ] gθ(es|n) des = ∫ ∞ 0 d dtp(t, f1(n) f2(es))gθ(es|n) des∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ] gθ(es|n) des = ∫ ∞ 0 d(t, f1(n) f2(es))gθ(es|n) des∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ] gθ(es|n) des , where moving the derivative through the integral sign is justified by dominated convergence. the likelihood function is a product of k such terms. in the main paper, the simplifying assumption that sample size and effect size are independent before selection means that gθ(es|n) is replaced by gθ(es), yielding expression (3). in the problem of estimating power under heterogeneity in effect size, the unknown parameter is the vector θ in the density of effect size. let θ̂ denote the maximum likelihood estimate of θ. this yields a maximum likelihood estimate of the true power of each individual test in the sample, and then the estimates are averaged to obtain an estimate of mean power. we now give details. consider randomly sampling a single test from the population of tests that were significant the first time they were carried out. let t1 denote the value of the test statistic the first time a hypothesis is tested, and let t2 denote the value of the test statistic the second time that particular hypothesis is tested, under exact repetition of the experiment. conditionally on fixed values of sample size n and effect size es, t1 and t2 are independent. by theorem 1, population mean power after selection is p{t2 > c|t1 > c} = ∑ n p{t2 > c|t1 > c, n = n}p{n = n|t1 > c} (16) this is the expression we seek to estimate. applying theorem 3 to the sub-population of tests based on a sample of size n, p{t2 > c|t1 > c, n = n} = e(g2|n = n) e(g|n = n) = ∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ]2 gθ(es|n) des∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ] gθ(es|n) des . (17) substituting (17) into (16) yields p{t2 > c|t1 > c} = ∑ n ∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ]2 gθ(es|n) des∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ] gθ(es|n) des p{n = n|t1 > c} . (18) expression (18) has two unknown quantities, the parameter θ of the effect size distribution, and p{n = n|t1 > c}. for the former quantity, we use the maximum likelihood estimate, while the p{n = n|t1 > c} values are estimated by the empirical relative frequencies of sample size, which is the non-parametric maximum likelihood estimate. the result is a maximum likelihood estimate of population power given significance: 1 k k∑ j=1 ∫ ∞ 0 [ 1 −p(c j, f1(n j) f2(es)) ]2 g θ̂ (es|n j) des∫ ∞ 0 [ 1 −p(c j, f1(n j) f2(es)) ] g θ̂ (es|n j) des . in the simulations, the density g of effect size is assumed gamma, there is no dependence on n, and the parameter θ is the pair (a, b) that parameterize the gamma distribution. simulation direct simulation from the distribution of the test statistic given significance. to study the behaviour of an estimation method under selection for significance, it is natural to simulate test statistics from the distribution that applies before selection, and then discard the ones that are not significant. but if one can simulate from the joint distribution of sample size and effect size after selection, the wasteful discarding of nonsignificant test statstics can be avoided. the idea is to do the simulation in two stages. first, simulate pairs from the joint distribution of sample size and effect size 21 after selection, and calculate a non-centrality parameter using expression (ncpmult). then using that ncp value, simulate from the distribution of the test statistic given significance. we will now show how to do the second step. it is well known that if f(t) is a cumulative distribution function of a continuous random variable and u is uniformly distributed on the interval from zero to one, then the random variable t = f−1(u) has cumulative distribution function f(t). in this case the cumulative distribution function from which we wish to simulate is p{t ≤ t|t > c, x = es, n = n} = p{t ≤ t, t > c|x = es, n = n} p{t > c|x = es, n = n} = p{c < t ≤ t|x = es, n = n} p{t > c|x = es, n = n} = p(t,ncp)−p(c,ncp) 1 −p(c,ncp) for t > c, where as usual ncp = f1(n) f2(es). to obtain the inverse, set u equal to the probability and solve for t, as follows. denoting the power of the test by γ = 1 −p(c,ncp), u = p(t,ncp)−p(c,ncp) 1 −p(c,ncp) ⇔ u (1 −p(c,ncp)) = p(t,ncp)−p(c,ncp) ⇔ p(t,ncp) = u (1 −p(c,ncp)) + p(c,ncp) ⇔ p(t,ncp) = γu + 1 −γ ⇔ t = q(γu + 1 −γ,ncp). accordingly, let u be a uniform (0,1) random variable. the significant test statistic is t = q(γu + 1 −γ,ncp) = q(1 + γ(u − 1),ncp) = q(1 −γ(1 − u),ncp) . since 1 − u also has a uniform (0,1) distribution, one may proceed as follows. for a given sample size and effect size, first calculate the non-centrality parameter ncp = f1(n) f2(es), and use that to compute the power value γ = 1 − p(c,ncp). then calculate the significant test statistic t = q(1 −γu,ncp) , (19) where u is a pseudo-random variate from a uniform (0,1) distribution. in r, the process can be applied to a vector of ncp values and a vector of independent u values of the same length. again, this is the second step. the first step is to simulate a collection of ncp values using the desired joint distribution of sample size and effect size after selection for significance. naturally, simulation is is easiest if sample size and effect size come from well-known distributions with built-in random number generation, and if sample size and effect size are specified to be independent after selection. in one of our simulations, sample size and effect size after selection were correlated. the next section describes how this was done. correlated sample size and effect size. let effect size x have density gθ(es), where θ represents a vector of parameters for the distribution of effect size. conditionally on x = es, let the distribution of sample size be poisson distributed with expected value exp(β0 + β1es). this is standard poisson regression. simulation from the joint distribution is easy. one simply simulates an effect size es according to the density g, computes the poisson parameter λ = exp(β0 + β1es), and then samples a value n from a poisson distribution with parameter λ. the challenge is to choose the parameters θ, β0 and β1 so that after selection, (a) the population mean power has a desired value, and at the same time (b) the population correlation between sample size and effect size has a desired value. population mean power is γ = ∫ ∞ 0 ∑ n [ 1 −p(c, f1(n) f2(es)) ] p{n = n|x = es}gθ(es)des . given values of θ,β0 and β1, this expression can be calculated by numerical integration; recall that p{n = n|x = es} is a poisson probability. the population correlation between sample size and effect size is ρ = e(xn) − e(x)e(n) sd(x) sd(n) , where sd(·) refers to the population standard deviation of something. the quantities e(x) and sd(x) are direct functions of θ. the standard deviation of sample size sd(n) = √ e(n2) − [e(n)]2, where e(n) = e(e[n|x]) = ∫ ∞ 0 e[n|x = es] gθ(es)des = ∫ ∞ 0 eβ0 +β1esgθ(es)des and e(n2) = e(e[n2|x]) = e(v ar(n) + e(n)2|x) = ∫ ∞ 0 ( eβ0 +β1es + e2β0 +2β1es ) gθ(es)des . 22 finally, e(xn) = ∫ ∞ 0 ∑ n esn p{n = n|x = es}gθ(es)des = ∫ ∞ 0 es e(n|x = es)gθ(es)des = ∫ ∞ 0 eseβ0 +β1esgθ(es)des . all these expected values can be calculated by numerical integration using r’s integrate function, so that the correlation ρ can be evaluated for any set of θ,β0 and β1 values. in our simulation of correlated sample size and effect size, gθ(es) was a beta density, re-parameterized so that θ = (µ,σ2) consisted of the mean µ and variance σ2. conditionally on effect size, sample size was poisson distributed with expected value exp(β0 + β1es). we set the variance of effect size σ2 to a fixed value of 0.09, so that the standard deviation of effect size after selection was 0.30, a high value. given any mean effect size µ and slope β1, the parameter β0 (the intercept of the poisson regression) was adjusted so that expected sample size at the mean value was equal to 86: β0 = ln(86) −β1µ. with these constraints, the population mean power γ and correlation ρ were a function of the two free parameters µ and β1. let γ0 be a desired value of mean power; for example, γ0 = 0.5. let ρ0 be a desired value of the correlation between sample size and effect size; for example, ρ0 = −0.8. values of µ and β1 were locating by numerically minimizing the function f (µ,β1) = |γ−γ0| + |ρ−ρ0|. we used r’s optim function. statistical power before and after a study has been conducted two populations of studies the study selection model estimation methods notation and statistical background estimation methods p-curve 2.1 and p-uniform maximum likelihood model z-curve simulations heterogeneity in sample size only: effect size fixed heterogeneity in both sample size and effect size violating the assumptions of the ml model discussion mean power and replicability historic trends in power mean power as a quality indicator p-curve estimates of mean power p-uniform estimation of mean power maximum likelihood model future directions conclusion author contributions acknowledgements conflict of interest and funding contact information open science practices appendix proofs of the theorems, with an example simulation mp.2017.brand.proofs_20190624 meta-psychology, 2019, vol 3, mp.2017.840, https://doi.org/10.15626/mp.2017.840 article type: original article published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: katherine s. corker, donald r. williams, stephen r. martin, david manheim analysis reproduced by: jack davis all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/c4wn8 cumulative science via bayesian posterior passing: an introduction charlotte o. brand university of exeter, penryn campus, college of life and environmental science james p. ounsley university of st andrews, school of biology daniel j. van der post university of st andrews, school of biology thomas j.h. morgan arizona state university, school of human evolution and social change this paper introduces a statistical technique known as “posterior passing” in which the results of past studies can be used to inform the analyses carried out by subsequent studies. we first describe the technique in detail and show how it can be implemented by individual researchers on an experiment by experiment basis. we then use a simulation to explore its success in identifying true parameter values compared to current statistical norms (anovas and glmms). we find that posterior passing allows the true effect in the population to be found with greater accuracy and consistency than the other analysis types considered. furthermore, posterior passing performs almost identically to a data analysis in which all data from all simulated studies are combined and analysed as one dataset. on this basis, we suggest that posterior passing is a viable means of implementing cumulative science. furthermore, because it prevents the accumulation of large bodies of conflicting literature, it alleviates the need for traditional meta-analyses. instead, posterior passing cumulatively and collaboratively provides clarity in real time as each new study is produced and is thus a strong candidate for a new, cumulative approach to scientific analyses and publishing. keywords: bayesian statistics, metascience, cumulative science, replication crisis, stereotype threat, psychological methods the past two decades have seen a great increase in the study of the scientific process itself, a field dubbed ‘metascience’ (munafo et al. 2017). the growth of this field has been driven by a series of results that question the reliability of science as it is currently practiced. for instance, in 2015 a global collaboration of scientists failed to replicate 64 of 100 findings published in top psychology journals in 2008 (open science collaboration 2015). given this, many scientists have turned their efforts towards identifying potential improvements to the scientific process. brand et al. 2 a key focus within metascience is how to improve the use of statistical methods and the process of scientific publishing. problems such as phacking, “harking” and the “file-drawer” effect have been discussed in science for many years with mixed opinions and widespread debate (e.g. bissell 2013; bohannon 2014; kahneman 2014; schnall 2014; fischer 2015; pulverer 2015). recent proposals such as guidelines against the mis-use of p-values (wasserstein & lazar 2016), banning the p-value (trafimow & marks 2015), statistical checking software (epskamp & nuijten, 2016), redefining statistical significance (benjamin et al., 2018), justifying your alpha (lakens et al., 2018), pre-registering methods (chambers et al. 2014; van ’t veer & giner-sorolla, 2016) and the open science movement generally (e.g. kidwell et al. 2016) are propagating discussion and endorsement of substantial changes to scientific publishing and research methods. one indication of the current scientific practice underperforming is the presence of large numbers of publications presenting conflicting conclusions about the same phenomenon. this is the case, for instance, in the ‘stereotype threat’ literature, in which experiments are designed to “activate” a negative stereotype in participants’ minds, and this leads to reduced performance in participants for which the stereotype is relevant. a common question is whether being told that women typically perform worse than men at mathematical or spatial tasks depresses the performance of female participants on these tasks (e.g. flore & wicherts 2015). despite the seemingly straight-forward nature of this question, and the publication of over 100 papers on this topic, there are no clear conclusions about the veracity of stereotype threat. traditionally, meta-analyses are conducted to clarify the existence of a purported effect in the literature. this is also true of the stereotype threat literature, which includes seven meta-analyses (walton & cohen 2003; nguyen & ryan 2008; walton & spencer 2009; stoet & geary 2012; picho, rodriguez & finnie 2013; flore & wicherts 2014; doyle & voyer 2016). however, in many cases meta-analyses do not lead to increased certainty (ferguson 2014; lakens et al., 2017; lakens, hilgard, & staaks, 2016). one reason is that meta-analyses are not just used to determine whether an effect truly exists but can also be used to reveal the underlying causes of variation across studies. consequently, meta-analyses often differ in their interpretations of the literature, depending on the specific question that the authors are interested in. moreover, in some cases metaanalyses can actually increase uncertainty. for instance, they can uncover evidence of publication bias (as is the case with four of the stereotype threat meta-analyses), or researcher effects, both of which undermine the credibility of individual studies. more generally, the lack of objective inclusion criteria can render the conclusions of meta-analyses just as fraught as the results of individual studies. an example of this is seen in the ‘cycle shift’ debate, which asks whether women’s mate preferences change over their ovulatory cycle. here, an initial metaanalysis argued against the existence of an effect, only to be followed by another meta-analysis that found the exact opposite to be the case (gildersleeve, haselton, & fales, 2014a, 2014b; wood, kressel, joshi, & louie, 2014). despite both being based on the aggregation of a large number of studies (many of which were included in both meta-analyses) one meta-analysis must be wrong. given these difficulties, meta-analyses do not provide an unambiguous solution for resolving conflicted literatures. meta-analyses are not the only way to mathematically combine the results of multiple studies. many recent proposals fall within the category of “cumulative science”, a process in which each study incorporates prior work into its analyses. for this reason, each study can be considered a meta-analysis of sorts, with its conclusions reflecting the data collected both in that study as well as prior studies. as such, there is no need for traditional meta-analyses in a cumulative science framework. one example of cumulative science is the use of sequential bayes factors, which can be used to update the extent to which evidence is weighted in favour of the presence of an effect based on new data (schönbrodt, wagenmakers, zehetleitner, & perugini, 2017). similarly, ‘curate science’, and measures of replication success, have gained support (lebel, vanpaemel, cheung, & campbell, 2018; zwaan, etz, lucas, & donnellan, 2017). here, we describe and test another approach to cumulative science, “posterior passing”, which is a straightforward extension of bayesian methods of data analysis. in what follows we first cover bayesian inference, which is the theoretical background of posterior passing. we then describe how posterior passing can be implemented cumulative science via bayesian posterior passing 3 in practice. finally, using the case study of stereotype threat mentioned above, we use a simulation to compare the ability of traditional analytic techniques and posterior passing to correctly identify effects of different sizes (including 0). we demonstrate that, given a representative number of studies characteristic of the stereotype threat literature, posterior passing provides an up-to-date, accurate estimation of the true population level effect without the need for a dedicated meta-analysis. conversely, using traditional analytic techniques such as anovas in “one-shot” analyses, produced an abundance of conflicting effect size estimates as is found in the stereotype threat literature at present. furthermore, posterior passing produces almost identical results to a ‘meta’ glmm analysis in which all available data were combined and analysed as one dataset. bayes rule and posterior passing bayes’ theorem (a.k.a. bayes’ rule), is a method of assigning probabilities to hypotheses. given a set of competing hypotheses and our beliefs about how likely they are to be true, it provides us with the probability that each hypothesis is true when we collect more data. more formally this can be written as: 𝑝(ℎ|𝑑) = 𝑝(ℎ)𝑝(𝑑|ℎ) 𝑝(𝑑) where p(h|d) is the probability that each hypothesis is true taking the data into account (the “posterior”), p(h) is the probability of each hypothesis being true prior to collecting data (the “prior”), and p(d|h) is the probability that each hypothesis would have produced the observed data (the “likelihood”). the denominator, p(d), can be conceptualized as the probability of getting the data under any hypothesis, but in practice it acts as a normalizing constant to ensure that the posterior probabilities sum to 1. to illustrate the application of bayes theorem we will now walk through a simple example based on a thought experiment used by the 16th century statistician jacob bernoulli. other introductions to bayesian inference can be found elsewhere (van de schoot et al. 2014; morgan, laland & harris 2014; mcelreath 2016, kruschke 2011) and we encourage readers to seek these out. consider an urn containing a mix of blue and white pebbles and imagine we are interested in understanding what proportion of the pebbles are blue. to start with, we have two competing hypotheses: (1) 75% of the pebbles are blue, or (2) 75% of the pebbles are white (we assume that these are the only two possibilities). we will test these hypotheses by collecting data; three times we will draw a pebble from the urn, note its color, and replace it. before collecting data, let us note our prior beliefs (p(h) in the above equation). without any knowledge we could assign each hypothesis equal prior probability (i.e. 50% in both cases) but let us imagine we have reason to suspect hypothesis 2 is more likely (perhaps we know blue pebbles are rare, or we know that the urn was filled at a factory that produces more white than blue pebbles, or maybe someone told us that they glanced inside the urn and it looked mostly white etc.). given this we assign prior probabilities of 0.4 and 0.6 to the two hypotheses. now to data collection; let us assume we happen to draw three blue pebbles. we need to use this data to calculate the likelihood for each hypothesis, i.e. the probability of drawing three blue pebbles under each hypothesis (p(d|h) in the above equation). the probability of drawing a blue pebble three times under hypothesis 1 is 0.753 , and 0.253 under hypothesis 2. this is 0.42 and 0.016 respectively. note that the likelihood is much higher for hypothesis 1, this means that the data are more consistent with hypothesis 1 than with hypothesis 2 and so we should expect bayes’ theorem to shift the probabilities of each hypothesis in favor of hypothesis 1. the next step is to calculate the normalizing constant, p(d), which is the probability of getting the data under any hypothesis. it is the sum of the probability of getting the data under each hypothesis multiplied by the prior probability that each hypothesis is true, i.e. it is the sum of the likelihoods multiplied by the priors. in our case we only have two hypotheses, so p(d) is p(h1)p(d|h1) + p(h2)p(d|h2). we now have all the necessary parts to execute bayes’ rule and we can calculate the probability that each hypothesis is true. the table below summarizes this process, showing that because the data were more consistent with hypothesis 1 it is now the more likely of the two hypotheses, even though it started with a lower prior probability. brand et al. 4 this example can also illustrate how bayes’ theorem facilitates cumulative science. assume someone else decides to draw more pebbles from the same urn. how can they include our data in their analyses? the solution is straightforward: they simply need to use our posterior as their prior. more generally, by using the posterior from a previous study as the prior in the next one, the posterior of the second study will reflect the data collected in both studies, forming a chain of studies each of which builds on the last to provide an increasingly precise understanding of the world. this method is referred to as “posterior passing” (beppu & griffiths 2009) and is the focus of this manuscript. if posterior passing is effectively implemented, it is mathematically equivalent to collecting all the data in a single highpower study (beppu & griffiths 2009). in both a theoretical analysis and a lab experiment, beppu and griffiths (2009) found that posterior passing led to successively better inferences over time. given this, posterior passing may offer a valuable addition to the scientific process, with particular benefits for fields suffering from ambiguous literatures or replication crises. the passing of posteriors across studies not only incorporates information from prior studies, but also prevents any experimental dataset from carrying too much weight. in the next section we discuss how posterior passing can be implemented as part of the bayesian analysis of data. posterior passing in practice while the above example is much simpler than most scientific problems, it is relatively straightforward to generalize the theory to continuous hypothesis spaces as is characteristic of much scientific research. for instance, say we are hypothesizing about the value of a parameter in a model. rather than assigning prior probabilities to specific hypotheses (such as “the parameter is 2.5”) we describe a probability density function across the range of possible values for the parameter. for instance, if we have reason to believe the parameter is close to 10, we might use a normal distribution with a mean of 10 and standard deviation of 1 (this permits any value from positive to negative infinity, but 10 is the single most likely value and 95% of the probability mass falls between 8 and 12). the likelihood, too, becomes a function over the hypothesis space and the normalizing constant is calculated the same way as before: the prior is multiplied by the likelihood and the resulting function is summed across the hypothesis space. the posterior, again calculated as the prior multiplied by the likelihood and divided by the normalizing constant, is also a probability density function, allocating posterior probability over the parameter space which can then be passed as the prior in subsequent studies. the application of bayes’ rule to continuous hypothesis spaces (such as in parameter estimation) runs into problems, however, because it is often impossible to calculate the normalizing constant. the table 1. a summary of the example execution of bayes’ theorem. the probability that each hypothesis is true, taking the data into account (the “posterior”, column 5), is the prior (column 2) multiplied by the likelihood (column 3) and divided by a normalizing constant (the sum of column 4). hypothesis prior, p(h) likelihood, p(d|h) prior * likelihood, p(h)p(d|h) posterior, p(h|d) 1 0.4 0.42 0.17 0.95 2 0.6 0.016 0.009 0.05 cumulative science via bayesian posterior passing 5 circumnavigation of this problem relied on the development of modern computers and new techniques such as markov chain monte carlo (mcmc) methods. the details of this technique are complicated (for accessible introductions to mcmc see mcelreath 2016; kruschke 2011), but it works by providing the user with a series of values (called “samples”) that approximate values drawn from the posterior probability density function even though the exact density function itself remains unknown. as the number of samples approaches infinity, statistical descriptions of the samples converge on the same values of the posterior probability density function itself. for instance, the mean of the samples approaches the mean of the posterior probability density function, and an interval that contains 95% of the samples will also contain 95% of the probability mass of the posterior distribution. so, even though the posterior probability density function technically remains unknown, we can nonetheless describe it in a variety of ways. in order to implement posterior passing we now need a means by which samples from the posterior can be translated into a probability density function samples fr eq ue nc y −2 −1 0 1 2 3 4 0 20 00 40 00 60 00 80 00 10 00 0 sample histogram normal approximation gamma approximation normal with inflated variance gamma with inflated variance figure 1. the translation of a histogram of samples into a probability distribution. here the samples (black histogram of 50,000 samples) look somewhat normal, but they are all positive and the histogram is positively skewed. using the mean and variance to define a normal distribution (solid blue lines, scaled x10000) produces a reasonable fit, but the lower tail places non-zero probability density on negative values while the peak appears to be slightly higher than the peak of the histogram. using a gamma distribution instead (solid red line) produces a perfect fit. artificially inflating the variance (dashed lines) changes the distributions. in the case of the normal distribution it greatly widens it, placing increasing amount of probability mass below 0. in the case of the gamma distribution the positive skew grows, but all the probability mass remains above 0. in this way the variance is increased but the mean of the probability distribution remains the same as the mean of the samples. brand et al. 6 that will be the prior in subsequent studies. the simplest approach is to assume the posterior is normally distributed and then define the prior as a normal distribution with the same mean and standard deviation as the samples generated. so, if the mean of the samples is 5.3 and the standard deviation is 0.8, then this can be assumed to correspond to the probability distribution n(5.3, 0.8). however, assuming normality could lead to an inaccurate description of the posterior. to alleviate this concern researchers can inflate the standard deviation of the passed posterior, for instance changing n(5.3, 0.8) to n(5.3, 8). this will broaden the prior, effectively weakening its influence and so avoiding distorting results due to the posterior passing process. by the same token, however, broadening the prior will also lessen the influence of past research and potentially slow down scientific accumulation. a more nuanced approach is to build bespoke priors for each set of posterior samples. the researcher can choose a probability distribution that closely matches the shape of the posterior samples (e.g. normal, exponential, gamma or beta) and appropriate parameter values can be calculated from the samples. for instance, a normal distribution with the same mean and standard deviation as the samples, or a gamma distribution with shape and rate parameters calculated from the mean and variance of the samples. as before, if the researcher wishes to err on the side of caution by weakening the effect of the passed posterior on subsequent analyses they need to simply inflate the variance of the distribution. figure 1 shows an example of this in action illustrating that a highly suitable distribution is derived in this manner. inflating the uncertainty in the prior also facilitates posterior passing in cases where there are differences in experimental design or analytic technique. even within a single area of research it is rare that any two studies are exactly the same. these differences mean that the posterior produced by one study may not be entirely appropriate for the prior in another. however, experiments do not have to be identical to engage in posterior passing: as long as they are addressing the same theoretical “effect” then there is reason to draw on previous knowledge. in these cases, the uncertainty in the posterior should be inflated to account for differences in experimental design. another possibility is that the prior could be based on a previous study that used non-bayesian methods. if even a point estimate for the effect size is given then this can be used as the mean of the prior with the variance set to a suitable value corresponding to the researcher level of uncertainty. precisely how much a prior should be watered down in cases such as these will depend on the similarity of the studies in question and discussion of this should be an important part of the peer-review and publication process. moreover, where concerns are raised, robustness analyses can be used in which the prior is varied and the resultant effect on the conclusions described and discussed. in the short term, it may be beneficial to compare results with and without posterior passing to show the difference in inference that results from either approach. the simulation in this section we present a simulation of the scientific process, testing the hypothesis that posterior passing will benefit science relative to other methods of data analysis and avoid the accumulation of large, ambiguous literatures. we simulate a series of experiments testing for an interaction between two variables. we vary (i) the true effect size of the interaction, (ii) the scale of between individual differences and (iii) the statistical technique employed by scientists. the simulation is, in part, based on the stereotype threat literature as this produced an ambiguous and conflicted literature, as discussed previously. as such, we refer to the interacting variables as sex and condition, and various simulation parameters (e.g. the number of participants per study) are set to values representative of the stereotype threat literature. we use the simulation to compare four different analysis methods; an analysis of variance (anova), a generalised linear mixed model (glmm), a bayesian glmm using mcmc estimation (henceforth “bglmm") and a bayesian glmm using mcmc estimation and posterior passing (henceforth “pp”). anovas have been widely used in psychology for decades and still represent one of the most commonly used analytic approaches (including for studies of stereotype threat), despite suggestions of their inadequacy for many types of experimental design (jaeger 2008). a move towards using generalised linear mixed models for categorical and binomial data has been suggested as more appropriate cumulative science via bayesian posterior passing 7 than methods often used by psychologists and ecologists (jaeger 2008; bolker et al. 2009) and so we include both a frequentist glmm and a bayesian equivalent. finally, we include “posterior passing” (beppu & griffiths 2009) to examine whether implementing posterior passing as a form of cumulative knowledge updating would be beneficial. additionally, we performed a single bglmm analysis over all simulated datasets combined (henceforth “meta bglmm") in order to compare posterior passing against the best possible scenario of a single highpower study. each repeat of the simulation involved the following three steps: 1) a population of one million potential experimental subjects was created, 2) 60 sequential experiments were carried out, each involving 80 participants taking part in 25 experimental trials (numbers chosen as representative of the stereotype threat literature) and 3) the 60 datasets were analysed using the four different analysis methods. for each combination of parameter values, we carried out 20 repeat simulations. further details are given below, and full model code is available at www.github.com/thomasmorgan/posterior-passing. population creation each of the 1,000,000 simulated participants is defined by two values; their sex (0 or 1, with half of the population having each value) and their performance at the experimental task relative to the population average (positive values indicate above average performance, and negative values below average performance). each participant’s performance value was drawn randomly from a normal distribution with mean 0 and with variance that varied across simulations (from 0 to 1 in steps of 0.25). for each participant there was another participant of the same sex but with the opposite performance value, and another of the opposite sex, but with the same performance value. this ensured that the average performance value in the population was exactly 0 (equivalent to 50% success on a binary choice trial), and the variation in performance within each sex was equal. data collection datasets were generated by randomly selecting a sample of 80 individuals who were then split into a control group and an experimental group (20 of each sex in each group). each simulated participant was presented with 25 binary-choice trials and the number of trials they answered correctly was generated by sampling from a binomial distribution in which the likelihood of success per trial was: 𝑝( = 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐(𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒( + 𝑒 ∗ 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛( ∗ 𝑠𝑒𝑥() where e is the unknown interaction effect that the simulated experiments are attempting to identify. in the context of stereotype threat, it can be considered as the magnitude of the effect of the stereotype threat condition on the behaviour of women (i.e. participants of sex 1). note that participants of sex 0 (i.e. men) are insensitive to condition, and condition 0 (the control condition) does not affect participant behaviour. across simulations, we consider five different values for e (the magnitude of the effect in question): 0, 0.5, 1, 1.5 and 2. given that the average population performance was 0 (equivalent to a 50% chance of success per trial) these values increase the average probability of success from 0.5 to 0.5, 0.62, 0.73, 0.82 and 0.88 respectively. thus, the cases we explored range from no interaction effect, up to a large effect, exceeding the effect sizes reported in meta-analyses of the stereotype threat literature (e.g. doyle & voyer 2016). data analysis we performed four methods of analysis on each simulated dataset. the first is the predominant method of analysis used in the stereotype threat literature; analysis of variance (2 x 2 anova). average success on the task (i.e. number of successes/number of trials) was subjected to a 2(sex) x 2(condition) anova, that included a main effect of sex, a main effect of condition, and an interaction between sex and condition at a significance value of p<0.05. the second method is a generalised linear mixed model (glmm) which models number of successes as a binomially distributed variable and uses a logit brand et al. 8 link function. the same outcome and predictor variables are used as in the anova (i.e. a baseline effect, and effects of sex and condition as well as a sex*condition interaction), but a random effect for participant is implemented. this method fits parameters based on a maximum likelihood approach and estimates the linear effect that our manipulation and independent variables have on the log odds of success in any given trial. the third approach uses the same model formulation as the glmm, but uses bayesian mcmc methods to generate parameter estimates in jags. minimally informative priors (normal distributions with mean 0 and precision 0.01) were used for all parameters, and so we expect that the outcomes of this analysis should be extremely similar to that of the frequentist glmm. the fourth approach is a bayesian glmm with “posterior passing,” (beppu & griffiths 2009) in which the prior for e (the interaction effect) is based on the posterior from the most recent previous experiment. as a deliberately coarse implementation of posterior passing we assumed the posterior was normal and defined it solely by its mean and precision. results within each simulation, the general pattern was for posterior passing to converge on the true effect size while the other analysis types produce a series of independent results distributed stochastically around the true value with no convergence over time (see fig. 2a and 2b). five metrics were used to more thoroughly examine the performance of each analytic technique across simulations, 1) the average point estimate; 2) the true positive rate; 3) the false positive rate; 4) the average width of the 95% confidence/credible interval; and 5) the average difference between the effect estimate and the true value. figure 2. analysis estimates produced from a single simulation of 60 experiments. the true effect sizes (displayed by the horizontal black line) are (a) 0 and (b) 2 (equivalent to an increase in the probability of responding correctly of 0.38). in both cases there is no individual variation. across experiments, the estimates produced by the anova, glmm and bglmm vary stochastically around the true population average, with the accuracy or certainty of each analysis unrelated to its position in the series. furthermore, in panel (b) the anova estimates are considerably less certain than those of the glmm or bglmm. in contrast to all other methods considered, posterior passing allows the analysis to become more accurate and more certain over time. note that a particularly skewed data set (data set 9, in panel a) prompts all analyses to find a positive result, despite this, posterior passing is nonetheless able to correct itself by the end of the simulation. cumulative science via bayesian posterior passing 9 these were calculated for the analysis of each simulated data set, except in the case of posterior passing where only the final analysis in each simulation was used. this is because, with posterior passing, information from each dataset is incorporated into subsequent analyses and so the final analysis contains information from all 60 datasets. effect size estimates all analysis types were generally effective at estimating the size of the effect (see fig. 3). however, as the between-individual variation increases, the anova underestimates the effect size to a modest extent. true positive result rates for all analyses, the ability to detect a positive effect increased with the true effect size (fig. 4). however, for the anova, glmm and bglmm increasing individual variation decreased the true positive result rate. with posterior passing, there is no such effect; positive results are likely to be found whenever the true effect is non-zero figure 3. average effect size estimates from the five analysis types over 20 simulations of 60 datasets (bayesian glmm not shown as it is identical to the glmm). the true population average is on the x axis, and the individual variation on the y axis. colour corresponds to effect size estimate according to the key to the right of the panels. brand et al. 10 false positive result rates when the true effect was 0, all analyses were unlikely to produce (false-)positive results but did occasionally do so (see fig. 4). across all datasets (n=6000) the anova produced 304 false positives (5.1%, close to the expected false positive rate of 5%), the glmm 341 false positives (5.7%) and the bglmm 304 false positives (5.1%). for posterior passing (and the meta bglmm), we are concerned only with whether the final analysis in each series produced a false-positive result. over 100 simulations in which the true effect size was zero (i.e. 20 repeats of 5 different variance levels), posterior passing produced two false positives, while the meta bglmm produced one. uncertainty in general, the width of the 95% confidence/credible intervals (henceforth “uncertainty”) decreases with the true effect size, but increases with individual variation (fig. 5). there are differences between analyses however. the anova is much more sensitive to individual variation than to the true effect size, i.e. increasing the true effect size only modestly reduces uncertainty, while increasing individual variation greatly increases uncertainty. both the glmm and bglmm produce figure 4. positive rate for the five analysis types over 20 simulations of 60 datasets (bayesian glmm not shown as it is identical to the glmm). the size of the true population average is on the x axis, and the individual variation on the y axis. colour gives the positive result rate, ranging from 0 1, according to the key to the right of the panels. an analysis finds a positive effect in the population if the upper and lower bounds for its 95% confidence/credible interval do not include zero, the proportion of analyses in which a positive effect is found is the positive results rate. cumulative science via bayesian posterior passing 11 confident results provided either the true effect size is high or individual variation is low, however if the effect size is small, but variation high, then model estimates are highly uncertain. finally, while the uncertainty of both pp and the combined bglmm is sensitive to the effect size and individual variation, it is only minimally so and confidence is very high across all of the parameter space we explored. the meta bglmm, in which all data are analysed in a single analysis, performs almost identically to posterior passing. error the average difference between the parameter estimates and the parameter’s true value was very low across all of the parameter combinations we considered, except in the case of the anova (fig.6). this is because the anova systematically underestimates the value of the parameter when the true population average is high and individual variation is high (see fig. 6). discussion this paper introduces posterior passing; a statistical technique, based on bayes’ theorem, that uses the results of prior studies to inform future work. in this way it allows the operationalization of cumulative science, allowing individual studies to build on each other, avoiding conflicted literatures and figure 5. analysis uncertainty measure for the five analysis types as a function of true population average and individual variation (bayesian glmm not shown as it is identical to the glmm). colour represents the uncertainty of each analysis as given by the key to the right of the panels. brand et al. 12 thereby reducing the need for dedicated meta-analyses. to test the performance of posterior passing we conducted a simulation of datasets sampled from populations with varying effect sizes. different statistical techniques were used to analyse the datasets and compared to a posterior passing approach over the same datasets. we found that although no method was perfect (e.g. all methods produced a non-zero number of false-positive results), posterior passing leads to greater certainty over time about the existence and size of an effect compared with the other statistical methods considered. as such this work supports the proposal that posterior passing is a viable means by which cumulative science can be implemented. one of the goals of this project was to test whether posterior passing could effectively identify the true value of an effect in a context where other analytic techniques have led to the build-up of an ambiguous literature, such as that concerning stereotype threat. such literatures are defined by a mix of positive and negative findings, and in practice they have remained ambiguous despite multiple meta-analyses. in our simulations, examination of the positive result rate shows that such ambiguity is common to all non-cumulative analyses when between individual variation is high and when the effect size is small. nonetheless, posterior passing is highly successful at correctly identifying the effectsize in these cases (only 2% of simulations produced a false-positive result). these findings have important implications for the way scientists conduct, analyse and publish their research. firstly, the use of anovas (the current norm in priming studies) is shown to be particularly problematic. in our results, the anova was the least figure 6. analysis estimate error as a function of true population average and individual variation (bayesian glmm not shown as it is identical to the glmm) . colour gives the size of the error, ranging from -0.1 to 0.1, according to the key to the right of the panels. cumulative science via bayesian posterior passing 13 accurate at identifying the effect size, especially when the effect size was small and the variation high. the priming literature is precisely when researchers predict effect sizes to be small and individual variation to be high, as the mechanism underlying the effect is unknown and some individuals are expected to be more or less susceptible to the effect depending on various moderating variables (see bargh 2012; gelman 2016). therefore, the fact that anovas are less likely than other methods to be able to accurately decipher an effect in these types of datasets suggests that researchers studying priming effects (as well as other small, variable effects) should move away from the anovas and on to other methods, as has been previously suggested (e.g. jaeger 2008). a second implication of our results is that posterior passing is considerably better than using bayesian methods per se. with minimally informative priors, the bayesian glmm did not provide any detectable improvement in performance compared to the frequentist glmm. this was expected because the important difference between the glmm and the bayesian glmm was the use of priors, but by using minimally informative priors we masked this difference. this might appear to suggest that there is little benefit to using bayesian methods over frequentist methods if priors are implemented uninformatively. however, other benefits exist that are not considered in our simulation. for instance, the philosophy of bayesian inference is arguably more intuitive than null hypothesis significance testing (mcelreath 2016), with bayesian credible intervals more readily understood than frequently misinterpreted p-values and confidence intervals (belia, fidler, williams, & cumming, 2005; greenland et al., 2016). nonetheless, bearing these other benefits in mind, our simulation results clearly suggest that posterior passing is a major benefit to using a bayesian approach. reassuringly, even our deliberately coarse implementation of posterior passing (in which only the posterior for the interaction term was passed, and it was assumed to be normal) was highly successful. moreover, even when spurious results are present (e.g. fig.2a, dataset 9), posterior passing rapidly reverts to the true population effect. as a measure of the success of posterior passing we compared it to a single, “meta”, bayesian glmm conducted over all 60 datasets combined, as this is equivalent to the greatest possible performance achievable through posterior passing. according to all of our metrics for evaluating the performance of different analytic techniques, posterior passing was virtually indistinguishable from this ‘meta bayesian glmm’. nonetheless, further work could measure the effect of more refined implementations of posterior passing (including passing all parameters) as this may accelerate the convergence of knowledge concerning the effects in question. posterior passing is not the only means to achieve cumulative science, however, and, as mentioned in the introduction, sequential bayes factors and curated replications hold similar promise. our results cannot comment on the efficacy of these methods relative to posterior passing, however, there are some key differences between the approaches. first, the use of bayes factors is not uncontroversial and their application has been debated elsewhere (e.g. robert, 2016). one such argument is that bayes factors retain the “accept/reject” philosophy of null hypothesis significance testing, whereas other researchers have called for a shift towards more accurate parameter estimation and model comparison approaches (cumming, 2013; mcelreath, 2016). we agree with the sentiment of schönbrodt and colleagues (2017) that estimation and hypothesis testing answer different questions and have separate goals, reflected by a trade-off between accuracy and efficiency respectively. we argue that ultimately scientists should value both accuracy and efficiency, but not prioritise efficiency at the expense of accuracy. furthermore, posterior passing offers a means of achieving both accurate and (more) efficient estimates than the other analysis techniques included in our simulation, as posterior passing converges on the correct effect size within 10-15 analyses (compared to the full 60 datasets). with regards to curated replications and calls for measures of replication success (lebel et al., 2018; zwaan et al., 2017), these approaches can be distinguished from posterior passing in that they formalize the process of replication to ensure the robustness of findings. posterior passing, conversely, does away with the notion of replications as studies build on each other rather than specifically testing the results of prior studies. despite its success in our simulation, posterior passing is unlikely to be a scientific cure-all. one factor identified as a problem in science, but not considered in our simulation, is publication bias (the brand et al. 14 increased likelihood of publishing positive findings compared to null findings). it is likely that the performance of posterior passing, along with the other analyses considered, will be negatively affected by publication bias. indeed, posterior passing may exacerbate the problem of publication bias if researchers only put forward their positive results to be part of a posterior passing framework. that said, if available data from multiple studies are put towards a cumulative analysis, regardless of their novelty, researchers may be more motivated to publish their null results, as well as replications, and so the implementation of posterior passing may reduce publication bias indirectly. given these uncertainties, it would be valuable for further work to ascertain how sensitive each analysis type is to various levels of publication bias. another assumption of our simulation is that all analyses are similar or comparable enough to use in a posterior passing framework. in actual scientific practice, however, scientists may struggle to use the results of one analysis to inform the next one due to differences in experimental design or analytic model structure. moreover, even where a single model structure is agreed this may systematically differ from reality, introducing bias into model estimates. further work is needed to explore the effects of this kind of mismatch on the performance of posterior passing. nonetheless, as previously discussed, posteriors can be watered down by increasing their variance, thereby lessening the effect of prior work on current findings. while such practice necessarily slows down scientific accumulation, it will reduce the risks that inter-study incompatibilities pose to posterior passing. this highlights how appropriate use of priors will be an important issue for researchers, as well as editors and reviewers, and that it is important that manuscripts make clear which priors were used and why. researchers may also wish to include robustness checks in which priors are modestly adjusted and the subsequent change in results included in supplementary materials. in this manuscript, we have presented posterior passing as one way in which cumulative science can be implemented. among the benefits of posterior passing is that it is easy to implement as a simple extension beyond standard bayesian analyses of data. moreover, our simulations suggest that posterior passing works well in contexts where traditional, non-cumulative, analyses produce conflicting results across multiple studies. the use of posterior passing in these contexts would potentially identify the true effect with confidence, and without relying on meta-analyses that, in practice, often fail to resolve debates. nonetheless further work is needed to evaluate posterior passing, in particular, how well it fares when faced with other known problems in science, such as biases in publication. open science practices this article earned the open data and the open materials badge for making the data and materials available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references bargh, j.a. (2012). priming effects replicate just fine, thanks. psychology today. from: www.psychologytoday.com belia, s., fidler, f., williams, j., & cumming, g. (2005). researchers misunderstand confidence intervals and standard error bars. psychological methods, 10(4), 389. benjamin, d. j., berger, j. o., johannesson, m., nosek, b. a., wagenmakers, e. j., berk, r., ... & cesarini, d. (2018). redefine statistical significance. nature human behaviour, 2(1), 6. beppu, a., & griffiths, t. l. (2009). iterated learning and the cultural ratchet. in proceedings of the 31st annual conference of the cognitive science society (pp. 2089-2094). austin, tx: cognitive science society. bissell, m. (2013). nature comment: reproducibility: the risks of the replication drive. nature, 503, 333-334. bohannon j. (2014) replication effort provokes praise—and ‘bullying’ charges. science. 2014; 344:788–789. cumulative science via bayesian posterior passing 15 bolker, b. m., brooks, m. e., clark, c. j., geange, s. w., poulsen, j. r., stevens, m. h. h., & white, j. s. s. (2009). generalized linear mixed models: a practical guide for ecology and evolution. trends in ecology & evolution, 24(3), 127-135. chambers, c. d., feredoes, e., muthukumaraswamy, s. d., & etchells, p. (2014). instead of" playing the game" it is time to change the rules: registered reports at aims neuroscience and beyond. aims neuroscience, 1(1), 4-17. cumming, g. (2013). understanding the new statistics: effect sizes, confidence intervals, and meta-analysis. routledge. doyle, r. a., & voyer, d. (2016). stereotype manipulation effects on math and spatial test performance: a meta-analysis. learning and individual differences, 47, 103-116. epskamp s, nuijten mb. statcheck: extract statistics from articles and recompute p values. r package version 1.0.1. http://cran.rproject.org/package=statcheck2015. ferguson, c. j. (2014). comment: why metaanalyses rarely resolve ideological debates. emotion review, 6(3), 251-252. fischer, m. r. (2015). replication–the ugly duckling of science? gms z med ausbild, 32, 5. flore, p. c., & wicherts, j. m. (2015). does stereotype threat influence performance of girls in stereotyped domains? a metaanalysis. journal of school psychology. gelman, a. (2016, feburary 12). priming effects replicate just fine, thanks. from www.andrewgelman.com/2012 gildersleeve, k., haselton, m. g., & fales, m. r. (2014). do women’s mate preferences change across the ovulatory cycle? a meta-analytic review. psychological bulletin, 140(5), 1205. gildersleeve, k., haselton, m. g., & fales, m. r. (2014). meta-analyses and p-curves support robust cycle shifts in women’s mate preferences: reply to wood and carden (2014) and harris, pashler, and mickes (2014). greenland, s., senn, s. j., rothman, k. j., carlin, j. b., poole, c., goodman, s. n., & altman, d. g. (2016). statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. european journal of epidemiology, 31(4), 337-350. jaeger, t. f. (2008). categorical data analysis: away from anovas (transformation or not) and towards logit mixed models. journal of memory and language, 59(4), 434-446. kahneman, d. (2014). a new etiquette for replication. social psychology. 45, 310–311 kidwell, m. c., lazarević, l. b., baranski, e., hardwicke, t. e., piechowski, s., falkenberg, l. s., ... & errington, t. m. (2016). badges to acknowledge open practices: a simple, lowcost, effective method for increasing transparency. plos biol, 14(5), e1002456. kruschke, j. (2011). doing bayesian data analysis: a tutorial with r, jags, and stan. academic press. lakens, d., adolfi, f. g., albers, c. j., anvari, f., apps, m. a., argamon, s. e., ... & buchanan, e. m. (2018). justify your alpha. nature human behaviour, 2(3), 168. lakens, d., hilgard, j., & staaks, j. (2016). on the reproducibility of meta-analyses: six practical recommendations. bmc psychology, 4(1), 24. lakens, d., lebel, e. p., page-gould, e., van assen, m. a. l. m., spellman, b., schönbrodt, f. d., … hertogs, r. (2017, july 9). examining the reproducibility of meta-analyses in psychology. retrieved from osf.io/q23ye lebel, e. p., mccarthy, r., earp, b. d., elson, m., & vanpaemel, w. (2018). a unified framework to quantify the credibility of scientific findings. openlebel, etienne p et al.“a unified framework to quantify the credibility of scientific findings”. psyarxiv, 13. mcelreath, r. (2016). statistical rethinking: a bayesian course with examples in r and stan (vol. 122). crc press. morgan, t. j., laland, k. n., & harris, p. l. (2015). the development of adaptive conformity in young children: effects of uncertainty and consensus. developmental science, 18(4), 511524. munafò, m. r., nosek, b. a., bishop, d. v., button, k. s., chambers, c. d., du sert, n. p., ... & ioannidis, j. p. (2017). a manifesto for reproducible science. nature human behaviour, 1, 0021. chicago. brand et al. 16 nguyen, h. h. d., & ryan, a. m. (2008). does stereotype threat affect test performance of minorities and women? a meta-analysis of experimental evidence. journal of applied psychology, 93(6), 1314. open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716. pulverer, b. (2015). reproducibility blues. the embo journal, 34(22), 2721-2724. picho, k., rodriguez, a., & finnie, l. (2013). exploring the moderating role of context on the mathematics performance of females under stereotype threat: a meta-analysis. the journal of social psychology, 153(3), 299-333. robert, c. p. (2016). the expected demise of the bayes factor. journal of mathematical psychology, 72, 33-37 schnall, s. (2014). clean data: statistical artifacts wash out replication efforts. social psychology, 45(4), 315-317 schönbrodt, f. d., wagenmakers, e. j., zehetleitner, m., & perugini, m. (2017). sequential hypothesis testing with bayes factors: efficiently testing mean differences. psychological methods, 22(2), 322. stoet, g., & geary, d. c. (2012). can stereotype threat explain the gender gap in mathematics performance and achievement? review of general psychology, 16(1), 93 trafimow, d, marks, m. (2015). editorial. basic and applied social psychology, 37(1),1-2. van de schoot, r., kaplan, d., denissen, j., asendorpf, j. b., neyer, f. j., & aken, m. a. (2014). a gentle introduction to bayesian analysis: applications to developmental research. child development, 85(3), 842-860. van't veer, a. e., & giner-sorolla, r. (2016). preregistration in social psychology—a discussion and suggested template. journal of experimental social psychology, 67, 2-12. wasserstein, r. l., & lazar, n. a. (2016). the asa's statement on p-values: context, process, and purpose. the american statistician. walton, g. m., & cohen, g. l. (2003). stereotype lift. journal of experimental social psychology, 39(5), 456-467. walton, g. m., & spencer, s. j. (2009). latent ability grades and test scores systematically underestimate the intellectual ability of negatively stereotyped students. psychological science, 20(9), 1132-1139. wood, w., kressel, l., joshi, p. d., & louie, b. (2014). meta-analysis of menstrual cycle effects on women’s mate preferences. emotion review, 6(3), 229-249. zwaan, r. a., etz, a., lucas, r. e., & donnellan, m. b. (2017). making replication mainstream. behavioral and brain sciences, 1-50. meta-psychology, 2020, vol 4, mp.2019.2238 https://doi.org/10.15626/mp.2019.2238 article type: commentary published under the cc-by4.0 license open data: n/a open materials: n/a open and reproducible analysis: n/a open reviews and editorial process: yes preregistration: n/a edited by: e. m. buchanan reviewed by: d. navarro, s. farrell analysis reproduced by: n/a all supplementary files can be accessed at osf: https://osf.io/2mdfj/ what factors are most important in finding the best model of a psychological process? comment on navarro (2019) nathan j. evans department of psychology, university of amsterdam, the netherlands school of psychology, university of queensland, australia abstract psychology research has become increasingly focused on creating formalized models of psychological processes, which can make exact quantitative predictions about observed data that are the result of some unknown psychological process, allowing a better understanding of how psychological processes may actually operate. however, using models to understand psychological processes comes with an additional challenge: how do we select the best model from a range of potential models that all aim to explain the same psychological process? a recent article by navarro (2019; computational brain & behavior) provided a detailed discussion on several broad issues within the area of model selection, with navarro suggesting that “one of the most important functions of a scientific theory is ... to encourage directed exploration of new territory” (p.30), that “understanding how the qualitative patterns in the empirical data emerge naturally from a computational model of a psychological process is often more scientifically useful than presenting a quantified measure of its performance” (p.33), and that “quantitative measures of performance are essentially selecting models based on their ancillary assumptions” (p.33). here, i provide a critique of several of navarro’s points on these broad issues. in contrast to navarro, i argue that all possible data should be considered when evaluating a process model (i.e., not just data from novel contexts), that quantitative model selection methods provide a more principled and complete method of selecting between process models than visual assessments of qualitative trends, and that the idea of ancillary assumptions that are not part of the core explanation in the model is a slippery slope to an infinitely flexible model. keywords: model selection, science, quantitative model comparison, cognitive models. over the past several decades, psychology research has become increasingly focused on creating formalized models of psychological processes (e.g., ratcliff, 1978; brown & heathcote, 2008; usher & mcclelland, 2001; nosofsky & palmeri, 1997; shiffrin & steyvers, 1997; osth & dennis, 2015). these process models are created by taking verbal explanations of a process, and formalizing them with an exact mathematical functional form. process models make exact quantitative predictions about observed data that are the result of some unknown psychological process, and by attempting to see which models can best account for these observed data, we can better understand how this unknown process may actually operate. however, using models to understand psychological processes comes with an additional challenge: how do we select the best model from a range of potential models that all aim to explain the same psychological process? this is an area of research known as model selection (myung & pitt, 1997; myung, 2000; myung, navarro, & pitt, 2006; evans, 2 howard, heathcote, & brown, 2017; evans & annis, 2019), and is subject to ongoing debate, both at broad levels (e.g., qualitative methods [thura, beauregardracine, fradet, & cisek, 2012] vs. quantitative methods [evans, hawkins, boehm, wagenmakers, & brown, 2017]) and specific levels (e.g., bayes factors [gronau & wagenmakers, 2019a] vs. out of sample prediction [vehtari, simpson, yao, & gelman, 2019]). a recent article by navarro (2019) provided a detailed discussion on both specific and broad issues within the area of model selection. although this article was a comment on the specific critique of bayesian leave-one-out cross-validation by gronau and wagenmakers (2019a), navarro (2019) also made several broader points on the philosophy of modelling, and how we should evaluate these formalized theories. these broader points made by navarro (2019) appear to have been the most impactful part of the entire debate so far, with navarro’s article currently (as of the 18th of january, 2019) having over 3,400 downloads and 122 shares, compared to the 678 downloads and 11 shares of the original article by gronau and wagenmakers. in general, navarro (2019) suggested that 1) “one of the most important functions of a scientific theory is ... to encourage directed exploration of new territory" (p.30), 2) “understanding how the qualitative patterns in the empirical data emerge naturally from a computational model of a psychological process is often more scientifically useful than presenting a quantified measure of its performance” (p.33), and 3) “quantitative measures of performance are essentially selecting models based on their ancillary assumptions” (p.33). although gronau and wagenmakers (2019b) provided a reply to all three commentaries made on their original article (vehtari et al., 2019; navarro, 2019; chandramouli & shiffrin, 2019), their response mostly focused on replying to vehtari et al. (2019) with further limitations of bayesian leave-oneout cross-validation. their section replying to navarro (2019) briefly mentioned that quantitative methods are useful as “the data may not yield a clear result at first sight” (p.42), but focused on a more specific point, which was regarding how useful simple examples (or in the more critical terms of navarro, “toy examples”) are in assessing the robustness of analysis methods. here, i provide a critique of some of navarro’s broader perspectives, such as the function of scientific theories, the importance of qualitative patterns compared to precise quantitative performance, and the distinction between core and ancillary assumptions. specifically, i argue that 1) all possible data should be considered when evaluating a process model (i.e., not just data from novel contexts), 2) quantitative model selection methods provide a more principled and complete method of selecting between process models than visual assessments of qualitative trends, and 3) the idea of ancillary assumptions that are not part of the core explanation in the model is a slippery slope to an infinitely flexible model. however, before providing my arguments, i would like to note that my arguments only reflect one side of the contentious debate over how models of psychological processes should be evaluated – just as navarro’s arguments only reflected another side of the debate. therefore, i believe that researchers should read both navarro (2019) and my comment with an appropriate level of scrutiny, in order to gain a more complete perspective on the broad issues within this debate and decide how they believe models of psychological processes should be evaluated. what is the most important function of a process model? most of navarro’s (2019) perspectives regarding model selection appear to be based around one key underlying factor: what is the most important function of a scientific theory (or in these cases, a formalized process model that encapsulates a scientific theory)? from navarro’s perspective, “one of the most important functions of a scientific theory is ... to encourage directed exploration of new territory”. more specifically, in the section “escaping mice to be beset by tigers” (p.30–31) navarro appears to suggest that good process models – the models that provide better representations of the unknown psychological process that we wish to understand – are the ones that make accurate predictions about novel contexts, and these novel predictions are how process models – and more generally scientific theories should be evaluated. although navarro’s perspective may be a popular one among many researchers, i believe that this is only a single perspective on a contentious issue. within this section i present a different perspective on what the most important function of a process model is, and how we should determine the best model(s) of a process: that 1) the most important function of a process model is to explain the unknown psychological process as well as possible, 2) process models should be evaluated based upon all known data, and 3) the most principled way of making these evaluations is using quantitative model selection techniques. first and foremost, i agree with navarro (2019) that encouraging directed exploration of new territory can be a useful function of a scientific theory, and making predictions about novel contexts (in navarro’s word, “human reasoning generalization”; p.30) can help us efficiently gain knowledge about an unknown psychological process, especially if our knowledge is quite limited. like a state-of-the-art optimization algorithm in the con3 text of estimating the parameter values of a model, a model that makes predictions for novel contexts provides an efficient method of searching through the space of all potential data. these novel predictions can help lead researchers to sources of data that are most informative in teasing apart different models, while avoiding less informative sources of data; a level of efficiency that a giant ‘grid search’ through all possible data would be unable to achieve. providing an efficient search of the data space is where i believe predictions about novel contexts are most useful, as they provide clear directions for what experiments are most likely to discriminate between competing models most clearly. however, i also believe that this is where the limited value of predictions about novel contexts ends. when researchers are trying to find the best explanation for a process, predictions for novel contexts do not provide any more information about which model provides the closest representation of the process than predictions for known contexts. from my perspective, data are simply observations that we make of some unknown process. data are not inherently of “theoretical interest” (p.32), apart from in their ability to tell us which model provides the closest representation of this unknown process – a process that we, as scientists, wish to understand. therefore, to be the best explanation of a process, a model should provide the best predictions across all possible data from all possible contexts that we believe are observations of this same unknown process, and not just the data that are from novel contexts. importantly, assessing which model makes the best predictions across all available data is something that quantitative model selection methods have been specifically designed to achieve (myung & pitt, 1997; evans & brown, 2018). quantitative model selection methods compare models in their ability to make accurate, yet tightly constrained (i.e., low flexibility), predictions about the data; factors that many have argued are important in finding a theory that accurately reflects the underlying psychological process (roberts & pashler, 2000; myung, 2000; evans, howard, et al., 2017). as a concrete example of why predictions about novel contexts are most important in theory evaluation, navarro (2019) eloquently points out that the rescorlawagner model (rescorla & wagner, 1972) served an important purpose in research on classical conditioning, with its novel predictions pushing researchers to explore new, specific directions. exploring these novel predictions led to the discovery of many empirical phenomena – with the rescorla-wagner model making accurate predictions for many of these novel contexts – which helped to further shape researchers’ understanding of classical conditioning. however, should we consider the rescorla-wagner model to be the best explanation of classical conditioning (e.g., the explanation we provide in textbooks for how the process operates) if it provides a substantially worse predictions than other models for all of the data that we already know about? for me, this is a very clear ‘no’. in my opinion, navarro (2019) has conflated two unique goals of process models in this example: the ability to provide the best explanation of what is actually happening, and the usefulness to guide us to new empirical discoveries that we may not have thought of exploring at otherwise (e.g., predictions for novel contexts that lead to new empirical phenomena). while novel predictions are useful for guiding empirical discovery, evaluating models based on their ability to successfully make these predictions ignores all other observations we have about this same unknown process from other contexts. therefore, assessing only novel predictions provides a poor overall reflection of which model provides the best explanation of a psychological process. are certain data of more “theoretical interest” than others? throughout the section “between the devil and the deep blue sea” (p.31–33), navarro (2019) makes numerous suggestions that some parts of the empirical data are of more “theoretical interest” (p.32) than others. specifically, navarro states that “to my way of thinking, understanding how the qualitative patterns in the empirical data emerge naturally from a computational model of a psychological process is often more scientifically useful than presenting a quantified measure of its performance” (p.33), and makes numerous references throughout the concrete example of hayes, banner, forrester, and navarro (2018) to how the qualitative patterns in the data are of greater value than the quantitative fits. however, what exactly makes these qualitative patterns more scientifically useful than precise quantitative measurement? as i discussed previously, from my perspective data are just observations that we make of some unknown psychological process, and we use these observations to try and better understand this process. therefore, it seems strange to me that some specific parts of the data (i.e., the data that compose the specific qualitative pattern) would provide a more theoretically interesting answer about which model best explains the psychological process of interest than the other parts of the data (i.e., the data that quantitative model selection methods would also take into account). below i critique three general arguments for why qualitative patterns are commonly thought to be more theoretically interesting than quantified measures of performance. these arguments are each either explicitly stated, or appear to 4 be alluded to, by navarro (2019), and in my experience are often the beliefs of researchers who prefer qualitative assessments over quantitative model selection. the arguments that i critique are: that 1) qualitative trends are able to distinguish between models more clearly, 2) precise quantitative differences can be harder to observe and understand, and 3) qualitative trends can often avoid ancillary assumptions of the models, which model selection methods can heavily depend on. note that i give the third argument its own section (where is the border between core and ancillary model assumptions?), as i believe that this is a more general point about core and ancillary assumptions in process models. the ‘qualitative trends often distinguish between the models more clearly’ argument one argument for qualitative trends being more theoretically interesting than precise quantified measures of performance is that qualitative trends are able to distinguish between the models clearly. i think many would argue that the ‘proof of the pudding is in the eating’ here, as qualitative trends have been one of the main methods in psychology for deciding between competing models, and many of these robust qualitative trends end up serving as benchmarks for new potential models to meet before being taken seriously. however, this general argument seems to imply that quantitative model selection methods cannot distinguish between models clearly, and that qualitative trends are able to magically capture something that quantified measures of performance cannot. first, it seems important to define what exactly is meant by ‘distinguishing’ between models. i think a reasonable definition is something along the lines of ‘situations where evidence can be shown for one model over another, to reduce ambiguity in which model provides a better explanation of psychological process of interest’. if this is an accurate definition of what it means to distinguish between models, then i believe that it is categorically false to suggest that quantitative model selection methods cannot clearly distinguish between models, or that the distinction obtained through quantitative model selection methods is in any way inferior to the distinction obtained from qualitative trends. for example, in the case of the bayes factor (kass & raftery, 1995), a value of 1 indicates no distinction between the models, whereas larger (or smaller) bayes factors reflect greater distinction between the models, until the evidence becomes overwhelmingly in favour of one model over the other. therefore, quantitative model selection appears to both have the ability to reduce the ambiguity in which model is better, and to know the strength of evidence for one model over the other (i.e., the amount that the ambiguity was reduced by), meaning that quantified measures of performance can just as clearly distinguish between models as qualitative trends. the ‘qualitative trends are easier to observe and understand than quantitative differences’ argument another argument for qualitative trends being more theoretically interesting than precise quantified measures of performance is that qualitative trends can be visually observed in a clear manner, whereas the more precise quantitative differences can be harder to see, and it can be harder to understand why one model beats another. navarro (2019) states in the example of hayes et al. (2018) that “it is clear from inspection that the data are highly structured, and that there are systematic patterns to how peoples judgements change across conditions. the scientific question of most interest to me is asking what theoretical principles are required to produce these shifts. providing a good fit to the data seems of secondary importance.” (p.32). here, navarro seems to suggest that the difference between the models can be clearly seen in the qualitative trends, making these trends of theoretical interest, and that accounting for the rest of the trends in the data, which the quantitative fit detects, is less important as these trends are less clear. i agree with navarro – and others who make this general argument – to some extent here. understanding why one model is better than others is an important scientific question that increases our understanding of a process, and provides us with future directions for model development (e.g., ‘model x misfits data pattern y, so therefore, we should look into mechanism z that may be able to deal with data pattern y’). gaining insights into this ‘what went wrong?’ question is most easily achieved through visual assessments of qualitative trends, as we can clearly see that the certain models miss certain trends, and that certain models capture certain trends. however, ‘selecting the model that provides the best explanation of the unknown process’ and ‘understanding what specific trends in the data certain models cannot explain’ are two completely different goals, and the ability of qualitative trends to achieve the latter does not make them better than quantitative model selection at performing the former, in contrast to what navarro appears to suggest. more generally, i do not believe that being able to visually observe a trend – based on the way that the data have been plotted – means that the observed trend should have priority over all other possible qualitative and quantitative trends in the data. realistically, there are always likely to be several trends that can potentially be visually observed in the data, which may be shown or obscured by different ways of visualizing the data. a clear example of this can be seen in the comparisons of the diffusion model and the urgency gating 5 model in evans, hawkins, et al. (2017), who show that only looking at certain trends in the data (such as interactions in summary statistics over conditions) can be misleading, and plotting the entire distributions show other, clearer trends that distinguish between the models (see figure 1 for a more detailed walk-through of this example). however, even in cases where we manage to plot the data in every way possible, and find every qualitative trend present in the data, how do we weight these different trends? as the number of trends increase, it seems unlikely that every trend will be best accounted for by a single model, making selecting a model based on qualitative trends difficult. in contrast, quantitative model selection methods are able to simultaneously account for all of the trends in the data that they are applied to, and provide a principled approach for weighting for all of the trends together. essentially, quantitative model selection methods are able to take into account everything that visually assessing a finite number of qualitative trends can, and more. the only reason that assessing qualitative trends can give different results to quantitative model selection is that assessing only a subset of the data – as is the case when assessing qualitative trends – ignores all other aspects of the data. if researchers are only interested in explaining the single qualitative trend in the data, then i can see why only assessing the single qualitative trend makes sense. however, in cases where researchers want to explain the entire psychological process – which i think is most situations – then only assessing these visually observed qualitative trends is limiting practice, rather than a theoretically interesting one. where is the border between core and ancillary model assumptions? one last argument for qualitative trends being more ‘theoretically interesting’ than precise quantified measures of performance is that assessing qualitative trends focuses on the core assumptions of the models, whereas model selection methods can heavily depend on the ancillary assumptions of the models. navarro (2019) states in the concluding paragraph that “it seems to me that in real life, many exercises in which model choice relies too heavily on quantitative measures of performance are essentially selecting models based on their ancillary assumptions” (p.33). here, navarro seems to suggest that we should only be interested in specific assumptions of models – deemed to be core to the explanation – and attempt to ignore other assumptions – deemed to be ancillary to the explanation. i agree with navarro that all assumptions in the model can have a large influence on quantitative model selection methods, and researchers may consider some of these assumptions to be ancillary. however, what are the implications of starting to classify certain assumptions as ones that the model is committed to, and others as ones that are flexible and interchangeable? the idea of core and ancillary assumptions appears to come up quite regularly in theory and model development. models can have core assumptions, which are fundamental parts of the model’s explanation of the process that cannot be changed, and ancillary assumptions, which are only made because they are required for the formalization of the model (e.g., for simulation, or fitting). however, what exactly makes one assumption core, and another ancillary? the distinction may seem like common sense while speaking in abstract terms, but i believe that these different types of assumptions become much harder to distinguish between in practice. in practice, the lines between core and ancillary assumptions can often be blurred, and declaring certain assumptions in a model as being ancillary allows a researcher to still find evidence in favour of their preferred model – success that they attribute to the core assumptions – and dismiss evidence against their preferred model – failure that they attribute to incorrect ancillary assumptions. importantly, being allowed to adjust these ancillary assumptions can make a model infinitely flexible, even if any instantiation of the model with a specific set of ancillary assumptions is not infinitely flexible. jones and dzhafarov (2014) provide a clear example of this issue with the diffusion model, where if the distribution of trial-to-trial variability in drift rate is considered an ancillary assumption of the model – a common thought among researchers in the field – then the diffusion model has infinite flexibility in explaining choice response time distributions. however, the diffusion model has been shown to be quite constrained in its predictions when assuming a specific distribution of trial-to-trial variability in drift rate (smith, ratcliff, & mckoon, 2014; heathcote, wagenmakers, & brown, 2014), such as the normal distribution (ratcliff, 2002), suggesting that the change in flexibility is created by the choice of whether the assumption is labelled as core or ancillary. this suggests that breaking models into core and ancillary assumptions can be a slippery slope, and the flexibility of a model can rapidly increase by labelling certain assumptions as being ancillary. in contrast to navarro (2019), who wished to avoid making interpretations based on ancillary assumptions, i believe that when a formalized model of a process is defined, then this model represents the complete explanation of the process. there are no core and ancillary assumptions of the model: just assumptions. this is similar to the point made by heathcote et al. (2014) in their reply to jones and dzhafarov (2014), where they 6 suggest that jones and dzhafarov targeted a straw-man definition of the diffusion model, and that the distributional assumptions should not be considered ancillary. therefore, i believe that the ability to remove the influence ancillary assumptions of the models does not make qualitative trends more theoretically interesting to assess, and instead, creates a slippery slope towards infinite flexibility. having said this, i can also understand why researchers may be reluctant to commit to all assumptions of a model as being core to the explanation. as navarro (2019) points out in the hayes et al. (2018) example, there are often difficult decisions that need to be made to create a formalized model, and some of these choices can end up being somewhat arbitrary. however, making each model a complete explanation with only core assumptions does not mean that assumptions that would normally be considered ancillary cannot be tested. specifically, multiple models can be proposed as explanations of the unknown psychological process, with each model containing a different instantiation of these ancillary assumptions, and these models being compared using quantitative model selection methods. however, each model with different assumptions is now a different, separate explanation, and researchers cannot switch between these different models for different paradigms while still claiming that this represents a success of a single explanation. i believe this presents a principled way to address the issue of ancillary assumptions in models, providing the robustness against potentially arbitrary modelling choices desired by navarro, while preventing the ancillary assumptions from making models infinitely flexible. a brief digression: is automation actually a bad thing? one final point that appears to be implied by navarro (2019) is that model selection methods being automated is a negative. although this isn’t a central point of navarro, the statements “to illustrate how poorly even the best of statistical procedures can behave when used to automatically quantify the strength of evidence for a model” (p.31) and “i find myself at a loss as to how crossvalidation, bayes factors, or any other automated method can answer it” (p.32) both appear to show some level of negative connotation around the methods being automated. i agree that there is a general issue with ‘black box’ approaches, which can be applied and interpreted incorrectly when users do not understand them properly. however, i believe that automated quantitative model selection methods, which are applied in a consistent manner from situation-to-situation, do not belong in this category. more generally, why would a method being consistent in how it is applied be considered a bad thing? generally, i would consider automation to often be a good thing, and that the case of model selection is no exception. instead of calling these methods automated, i would refer to them as principled. quantitative model selection methods are based on statistical theory, are clearly defined, and follow a systematic procedure that always compares the models in the same way. from my perspective, this seems like a good thing; being methodical is often what makes science robust. in the context of experimentation, a lot of the designs that researchers commonly use are essentially automated: these designs are based on methodological theory (i.e., minimizing measurement error and potential confounds), are clearly defined, and researchers implement them in an almost automatic, systematic fashion from experiment to experiment. however, i do not think that many researchers would argue that systematic experimentation using robust, well-developed experimental designs is a negative. so why does this principled, systematic nature of science suddenly become a negative when it comes to our method of inference? i would instead suggest that automation is the natural next step after an approach becomes rigourously defined, and when a method cannot be automated, we should question how rigourous the method actually is. i believe that it is actually problematic that there is no automated process for visually assessing qualitative trends, where the same result would always be reached by entering the same information at the beginning, and that the lack of automation suggests that this approach lacks clear principles in at least some regard. author contact correspondence concerning this article may be addressed to: nathan evans: nathan.j.evans@uon.edu.au conflict of interest and funding there are no potential conflicts of interest regarding the current article. nje was supported by an australian research council discovery early career researcher award (de200101130) and european research council advanced grant (unify-743086). author contributions nje conceptualized the project, reviewed the relevant literature, and wrote the manuscript. open science practices this article is a commentary without data, material or analysis of the type that could have been pre-registered 7 and reproduced. the entire editorial process, including the open reviews, are published in the online supplement. references brown, s. d., & heathcote, a. (2008). the simplest complete model of choice response time: linear ballistic accumulation. cognitive psychology, 57(3), 153–178. carland, m. a., marcos, e., thura, d., & cisek, p. (2015). evidence against perfect integration of sensory information during perceptual decision making. journal of neurophysiology, 115(2), 915–930. carland, m. a., thura, d., & cisek, p. (2015). the urgency-gating model can explain the effects of early evidence. psychonomic bulletin & review, 22(6), 1830–1838. chandramouli, s. h., & shiffrin, r. m. (2019). commentary on gronau and wagenmakers. computational brain & behavior, 2, 12–21. cisek, p., puskas, g. a., & el-murr, s. (2009). decisions in changing conditions: the urgencygating model. journal of neuroscience, 29(37), 11560–11571. evans, n. j., & annis, j. (2019). thermodynamic integration via differential evolution: a method for estimating marginal likelihoods. behavior research methods, 1–18. evans, n. j., & brown, s. d. (2017). people adopt optimal policies in simple decision-making, after practice and guidance. psychonomic bulletin & review, 24(2), 597–606. evans, n. j., & brown, s. d. (2018). bayes factors for the linear ballistic accumulator model of decision-making. behavior research methods, 50(2), 589–603. evans, n. j., hawkins, g. e., boehm, u., wagenmakers, e.-j., & brown, s. d. (2017). the computations that support simple decision-making: a comparison between the diffusion and urgencygating models. scientific reports, 7(1), 16433. evans, n. j., howard, z. l., heathcote, a., & brown, s. d. (2017). model flexibility analysis does not measure the persuasiveness of a fit. psychological review, 124(3), 339. gronau, q. f., & wagenmakers, e.-j. (2019a). limitations of bayesian leave-one-out cross validation for model selection. computational brain & behavior, 2, 1–11. gronau, q. f., & wagenmakers, e.-j. (2019b). rejoinder: more limitations of bayesian leave-one out cross-validation. computational brain & behavior, 2, 35–47. hayes, b., banner, s., forrester, s., & navarro, d. (2018). sampling frames and inductive inference with censored evidence. heathcote, a., wagenmakers, e.-j., & brown, s. d. (2014). the falsifiability of actual decision making models. psychological review, 121(4). jones, m., & dzhafarov, e. n. (2014). unfalsifiability and mutual translatability of major modeling schemes for choice reaction time. psychological review, 121(1), 1. kass, r. e., & raftery, a. e. (1995). bayes factors. journal of the american statistical association, 90(430), 773–795. kiani, r., hanks, t. d., & shadlen, m. n. (2008). bounded integration in parietal cortex un derlies decisions even when viewing duration is dictated by the environment. journal of neuroscience, 28(12), 3017–3029. myung, i. j. (2000). the importance of complexity in model selection. journal of mathematical psychology, 44(1), 190–204. myung, i. j., navarro, d. j., & pitt, m. a. (2006). model selection by normalized maximum likelihood. journal of mathematical psychology, 50(2), 167–179. myung, i. j., & pitt, m. a. (1997). applying occams razor in modeling cognition: a bayesian approach. psychonomic bulletin & review, 4(1), 79–95. navarro, d. j. (2019). between the devil and the deep blue sea: tensions between scientific judgement and statistical model selection. computational brain & behavior, 2, 28–34. nosofsky, r. m., & palmeri, t. j. (1997). an exemplarbased random walk model of speeded classification. psychological review, 104(2), 266. osth, a. f., & dennis, s. (2015). sources of interference in item and associative recognition memory. psychological review, 122(2), 260. pilly, p. k., & seitz, a. r. (2009). what a difference a parameter makes: a psychophysical comparison of random dot motion algorithms. vision research, 49(13), 1599–1612. ratcliff, r. (1978). a theory of memory retrieval. psychological review, 85(2), 59. ratcliff, r. (2002). a diffusion model account of response time and accuracy in a brightness discrimination task: fitting real data and failing to fit fake but plausible data. psychonomic bulletin & review, 9(2), 278–291. rescorla, r. a., & wagner, a. r. (1972). a theory of 8 pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. classical conditioning ii: current research and theory, 2, 64–99. roberts, s., & pashler, h. (2000). how persuasive is a good fit? a comment on theory testing. psychological review, 107(2), 358. shiffrin, r. m., & steyvers, m. (1997). a model for recognition memory: remretrieving effectively from memory. psychonomic bulletin & review, 4(2), 145–166. smith, p. l., ratcliff, r., & mckoon, g. (2014). the diffusion model is not a deterministic growth model: comment on jones and dzhafarov (2014). psychological review, 121(4). thura, d., beauregard-racine, j., fradet, c.-w., & cisek, p. (2012). decision making by urgency gating: theory and experimental support. journal of neurophysiology, 108(11), 2912–2930. tsetsos, k., gao, j., mcclelland, j. l., & usher, m. (2012). using time-varying evidence to test models of decision dynamics: bounded diffusion vs. the leaky competing accumulator model. frontiers in neuroscience, 6, 79. usher, m., & mcclelland, j. l. (2001). the time course of perceptual choice: the leaky, competing accumulator model. psychological review, 108(3), 550. vehtari, a., simpson, d. p., yao, y., & gelman, a. (2019). limitations of “limitations of bayesian leave-one-out cross-validation for model selection”. computational brain & behavior, 2, 22–27. winkel, j., keuken, m. c., van maanen, l., wagenmakers, e.-j., & forstmann, b. u. (2014). early evidence affects later decisions: why evidence accumulation is required to explain response time data. psychonomic bulletin & review, 21(3), 777–784. 9 figure 1. an example of three different ways (a, b, c) that the data could be, and were, visualized in evans, hawkins, et al. (2017). evans, hawkins, et al. (2017) attempted to compare the diffusion model (ddm; ratcliff, 1978) and urgency-gating model (ugm; cisek et al., 2009) by using a random dot motion task (e.g., pilly & seitz, 2009; evans & brown, 2017), where the evidence for each alternative changed over the course of each trial. each trial began with a brief burst of ‘early evidence’, which was either the same as (congruent) or different to (incongruent) the ‘late evidence’, creating a variable of ‘congruency’. the late evidence either increased over time at one of four rates (slow, medium, fast, very fast), or was not present (none), creating a variable of ‘ramp rate’. panel a plots a single qualitative trend, being the interaction between congruency and ramp rate on mean response time. both models appear to capture the overall pattern of the interaction, though the ugm is able to account for the negligible effect of congruency on mean response time, whereas the ddm overpredicts the effect. panel b plots a single, but different, qualitative trend, being the change in the difference between congruent and incongruent trial accuracy over time (i.e., a conditional accuracy function; caf) for the ‘none’ condition. again, both models appear to capture the overall pattern of change over time, though the ugm is able to account for the quick decrease of the caf to zero, whereas the ddm underpredicts the rate of the decrease. panel c plots the entire choice response time distributions for the ‘none’ condition; data which includes both of the qualitative trends shown in a and b. however, in this case the ddm clearly provides a better account of the entire distributions than the ugm, with the ugm display sizable misfit in several aspects of the data. these sources of misfit for the ugm were obscured by the methods of visualizing the data in a and b, where a and b appeared to suggest that both models explained the data well, but the ugm did so somewhat better. importantly, most previous studies comparing these models (cisek et al., 2009; thura et al., 2012; winkel et al., 2014; carland, thura, & cisek, 2015; carland, marcos, et al., 2015) or similar models (kiani et al., 2008; tsetsos et al., 2012) had only focused on the qualitative trends seen in a and b, drawing into question the validity of the findings of previous studies. what is the most important function of a process model? are certain data of more “theoretical interest” than others? where is the border between core and ancillary model assumptions? a brief digression: is automation actually a bad thing? author contact conflict of interest and funding author contributions open science practices mp.2018.880.imhoff_20190218 meta-psychology, 2019, vol 3, mp.2018.880, https://doi.org/10.15626/mp.2018.880 article type: file drawer report published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: rickard carlsson reviewed by: lee jussim, åse innes-ker, ulrich schimmack analysis reproduced by: tobias mühlmeister all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/z6sdm in search of experimental evidence for secondary antisemitism : a file drawer report roland imhoff johannes gutenberg university mainz, germany; social cognition center cologne, germany mario messer social cognition center cologne, germany in 1955, adorno attributed antisemitic sentiments voiced by germans to a paradox projection: the only latently experienced feelings of guilt were warded off by antisemitic defense mechanisms. similar predictions of increases in antisemitic prejudice in response to increased holocaust salience follow from other theoretical apparatuses (e.g., social identity theory as well as just-world theory). based on the – to the best of our knowledge – only experimental evidence for such an effect (published in psychological science in 2009), the present research reports a series of studies originally conducted to better understand the contribution of the different assumed mechanisms. in light of a failure to replicate the basic effect, however, the studies shifted to an effort to demonstrate the basic process. we report all studies our lab has conducted on the issue. overall, the data did not provide any evidence for the original effect. in addition to the obvious possibility of an original false positive, we speculate what might be responsible for this conceptual replication failure. keywords: file drawer; secondary antisemitism; victim blaming; guilt defense; replication back in 2007, we conducted an experimental study to test the widespread notion that ongoing reminders of jewish suffering due to nazi crimes will evoke some kind of prejudicial reaction in germans, a defensive “secondary” antisemitism. the (in hindsight severely underpowered) study “worked” perfectly: reminding german participants of ongoing jewish suffering led to an increase in antisemitism (compared to baseline), but only if they felt that untruthful (but socially desirable) responding was futile as we would detect such lies. all built-in validity checks almost made perfect sense. we had never seen such a pretty data pattern before (and never thereafter) and were very happy when others agreed and the paper got accepted for publication in psychological science (imhoff & banse, 2009). fueled by this success, we applied for and received a grant to explore this fascinating effect in more detail. the original plan to infer the underlying theoretical process by identifying moderators and mediators failed, however, as we could not even replicate the basic effect. the following is the tale of a long series of (mostly conceptual) non-replications. we will summarize the theoretical background of our original study, explain the goals we had with an expansion of the line of research, and describe a total of eight studies intended to replicate and the reported research and preparation of this paper was supported by a deutsche forschungsgemeinschaft (dfg) grant (im147/1-1) awarded to roland imhoff. we thank claudia beck, maren-julia boden, lena drees, laura melzer, nanette münnich, and ben sturm for help with data collection and amanda seyle jones for help in editing the manuscript. correspondence should be addressed to roland imhoff via roland.imhoff@uni-mainz.de. imhoff & messer 2 expand the original findings (studies 1 and 2) or empirically address the failure to replicate the basic finding (studies 3a to 5). the notion of secondary antisemitism is a highly popular concept across several disciplines. although there are nuances in how exactly it was conceptualized, most definitions encapsulate the idea of an antisemitism not despite but because of the holocaust. briefly after world war ii (wwii) and the nazi’s efforts to literally annihilate jews all over europe, peter schönbach (1961) observed remarkable levels of antisemitism in german youths. this seemed to be puzzling as the now widespread awareness of the antisemitic atrocities committed only a few years earlier should have served as a potent warning sign against all forms of antisemitism. he thus proposed that the adolescents knew about their parents’ complicity (guilt by either action or omission) in the actions of the nazi regime and had to somehow cope with this knowledge or – psychologically speaking – the experienced dissonance of loving their parents but associating them with such horrific actions. to do so, they – according to schönbach – were more or less forced to rewarm the nazi regime’s antisemitic propaganda to generate justifications for their parents’ demeanor. adorno (1955) made similar observations in his interpretation of group discussions organized by the frankfurt institute for social research and his explanation was also similar: the participating adults, so he argued, had feelings of latent guilt for what happened during the holocaust and had to – psycho-dynamically speaking – project this guilt onto the victims (jews) to alleviate these feelings. although this version of antisemitism as a defense mechanism is the most common interpretation of adorno’s reasoning (as also reflected in synonyms like “schuldabwehrantisemitismus”, a defense-against-guilt-antisemitism; bergmann, 2006), adorno’s writing also point to another explanation (that he never explicates as an alternative mechanism): an identity management account. over the years, these identity concerns moved to the core of current understanding of secondary antisemitism as an antisemitism borne out of the outrage that jews’ insistence on remembering what happened spoils the positive identity of being german. this has been most famously coined in a quip (ascribed to israeli psychoanalyst zvi rex): “the germans will never forgive the jews for auschwitz” (buruma, 2003). of course, this mechanism is not only a well-established figure in the political arena but makes a lot of sense against the background of a plethora of psychological theories. blaming innocent victims is a central aspect of just-world theory (lerner, 1980), whereby construing victims as negative and undeserving helps to uphold the illusion that the world is a just place (correia & vala, 2003; friedman & austin, 1978). likewise, from a social identity perspective, derogating outgroup victims is functional to attenuate threats to the moral value of the ingroup (branscombe, schmitt, & schiffhauer, 2007; castano & giner-sorolla, 2006). system justification theory (jost, banaji, & nosek, 2004) integrates many of these tenets to postulate that rationalizing the status quo (e.g., the ongoing suffering of holocaust victims by finding fault in their character) may help reduce guilt, dissonance, and discomfort (jost & hunyady, 2002). despite these many theoretical lines allowing the same prediction, the very core idea of secondary antisemitism had never been experimentally tested. existing work on the issue was predominantly non-psychological and based on secondary antisemitism as a rhetoric rather than a process. these studies invited respondents to indicate their agreement with statements that encapsulated what researchers understood as secondary antisemitism. prominent examples are items like “jews should stop complaining about what happened to them in nazi germany” (selznick & steinberg, 1969), “the jews exploit remembrance of the holocaust for their own benefit” (heitmeyer, 2006), or “i am tired of continuously hearing about german crimes against jews” (bergmann, 2006). although such utterances may well reflect what has been conceptualized as secondary antisemitism, agreement with them is not indicative of the underlying process. it is, for instance, conceivable that a respondent just dislikes jews in general, without any specific emphasis on the holocaust. this respondent will certainly agree with these statements as they communicate the negativity he or she sees in jews, but this agreement will not be the result of the need to alleviate guilt or defend one’s ingroup’s moral value. in fact, the very same argument could be made regarding the original participants in the studies by the frankfurt institute for social research. maybe they were antisemitic during wwii and continued to be antisemitic thereafter without any indirect mediation via latent guilt or the need to justify their parents. the fact that subscales tapping into agreement with traditional forms of antisemitism (e.g., “jews have too much power and influence in this world”; weil, 1985) and secondary antisemitism correlate up to r = .84 at a latent level (imhoff, 2010) adds further fuel to this fire. we thus aimed to provide experimental evidence for secondary antisemitism as a process rather than a rhetoric. as a way to induce feelings of (collective) guilt or uneasiness about german atrocities, we aimed to make in search of experimental evidence for secondary antisemitism : a file drawer report 3 holocaust victims‘ ongoing suffering salient with the expectation that the salience should increase antisemitism as a form of victim derogation (to alleviate guilt or see the world as just or one’s group as moral). something about this prediction, however, did not feel right. clearly, telling people how much a certain group suffers should somehow increase the threshold to devalue the group, as suffering is expected to evoke sympathy (heider, 1958) rather than derogation. we aimed to resolve this by reaching into the bag of tricks of social psychologists: maybe people did have this sentiment but did not express it because social norms prevented them from doing so. so all we needed was a way to block the influence of such norms. if people had a feeling that we could know what they actually felt, then socially desirable (but dishonest) responding was futile since we would not only find out about their prejudice anyway, but would also see that they are liars (a double norm violation). this sums up the logic of bogus pipeline procedures, which allegedly detect dishonest responding and thus lead participants to respond truthfully to avoid the double norm violation described above. so, this was how we proceeded: we asked as many of our undergraduate psychology students as we could find (a whopping 70 participants) to indicate their agreement with 29 statements of antisemitism as part of a larger paper-and-pencil test (your infamous “mass testing”). three months later, they were invited to participate in individual testing sessions and 63 of them agreed and showed up for an experiment involving two independent variables manipulated between subjects: was the suffering of holocaust victims described as having ongoing negative consequences for them and their descendants (ongoing suffering: yes/no)? were participants hooked up to (slightly outdated) eeg machinery and a hand palm electrode with the information that this would help us detect untruthful responding (bogus pipeline: yes/no)? afterwards, participants wrote down all the thoughts they had while reading the text, then completed a measure of implicit antisemitism, the same antisemitism scale as three months earlier, and a manipulation check item to make sure that they had indeed read the initial text (“please briefly recall the introductory text. did it mention ongoing consequences for the victims?”). when we finally looked at the results, they were beautiful – everything looked exactly as it “should”. we had an unexpectedly large number of failed manipulation checks (15 people), but the pattern made perfect sense (in hindsight): almost all of these wrong responses came from the ongoing suffering conditions (13 people). thus, instead of derogating the victims to alleviate guilt, they just refused to even take note of the ongoing suffering. the remaining 48 participants, however, showed exactly the pattern we expected (figure 2, left panel). without mention of ongoing suffering, the level of antisemitism stayed more or less the same (operationalized as standardized residuals of predicting time 2 antisemitism from time 1 antisemitism; r = .89). mentioning ongoing suffering, however, decreased the expression of antisemitic prejudice in the control condition but led to an increase when attached to a bogus pipeline. the results were even significant despite the small sample, but clearly, the strategy of controlling for baseline antisemitism made our measure very sensitive. there were more details in the data that added to the picture of a perfect study: the correlation between implicit and explicit antisemitism was independently moderated by the bogus pipeline condition and a time 1 measurement of the motivation to control prejudiced reactions (banse & gawronski, 2003), further validating the experimental procedure and the data in general. presenting this study at conferences in the following months was awarded with a lot of positive feedback that boosted our confidence to reach high with this one: we submitted to psychological science and received the happy news roughly 11 weeks later: “in both its subject matter as in its empirical approach, your paper is (in my humble opinion) a prototypical psychological science paper: it reports on a phenomenon that many people think or have heard about but does so in a way that makes this phenomenon more worthwhile, more important, and much more consequential than lay psychology would have predicted.” sure, the reviewers still had critical comments; none, however, referred to sample size. we resubmitted the manuscript within 10 days and it was accepted shortly thereafter. in light of the positive feedback we got, it seemed only logical to follow up on this line of research. the many theoretical lines that converged in predicting the effect we found were a plus in making a convincing argument. on the flipside, however, this also meant that we had not one but several candidates for the underlying psychological process responsible for this mechanism. our project sought to tackle this. specifically, we expected three distinct, not necessarily mutually exclusive, processes to be potentially involved (figure 1). building on originally psycho-dynamic reasoning, we reasoned that the mediating mechanism rested on the process that (latent) feelings of guilt that were fought off by derogating the victims and or interpreting their suffering as deserved. the implication would be that imhoff & messer 4 this mechanism should be restricted to victims of one’s own group (as feeling guilty for atrocities committed by another seemed unlikely), should be moderated by propensity to feel guilty, mediated via feelings of guilt, and should be reduced if this guilt was alleviated in any other way. the second alternative was built on the notion of social identity and individuals’ motivation to see their own group as moral (branscombe, ellemers, spears, & doosje, 1999) and defend its positive identity (branscombe, schmitt, & schiffhauer, 2007). here also the effect should be restricted to victims of the ingroup (as there exists no motivation to see outgroups as moral) and should be particularly prominent among people who identify (defensively) with their ingroup. the mediating mechanism would be the perceived threat to the ingroup’s moral image and any alternative means to repair this image might reduce the effect. the final distinct possibility was that victim derogation here was a means to restore one’s illusion of the world as a just place (e.g., correia & vala, 2003; friedman & austin, 1978; godfrey & lowe, 1975; lerner & simmons, 1966; miller, 1977; simmons & piliavin, 1972). the strong need to see the world as a place where everyone gets what they deserve and deserves what they get (lerner, 1980) should prompt the desire to generate reasons why jewish suffering was actually deserved, likely leading to victim blaming. importantly, this mechanism is not exclusive to one’s own victim but should be a general process independent of who brought about the suffering. people with a greater need to see the world as just should be more prone to show the effect and re-establishing a sense of the world as just by alternative means should reduce the effect. figure 1. potential pathways from perception of ongoing victim suffering to increased prejudice. the present research. we planned a research program that sought to replicate the basic finding of secondary antisemitism and address the plausibility of each of the three theoretical possibilities outlined above by three strategies. first, all three accounts propose different moderators for the effect: guilt proneness, defensive national identification, just-world beliefs. second, the boundary conditions of the effect should also be informative. whereas the first two accounts would predict the effect to be limited to victims of the ingroup, the last would make a general prediction for any (innocent) victim. third, all three theories allow predictions of the specific kind of alternative means that could serve as an alternative means to alleviate the discomforting feelings of guilt, ingroup threat, or just-world threat. washing one’s hands, we reasoned, should alleviate guilt; re-affirming the morality of one’s nation should alleviate concerns about one’s group’s morality; and providing examples of fair and just procedures should re-establish a sense of justice in this world. as an additional possibility, we planned to explore indirect effects via measured mediators (e.g., latent guilt). below we describe the first two studies from that line of research, which could not even establish the basic effect let alone a moderation. in light of this, we refrained from conducting additional studies with experimental moderators (e.g., washing hands). instead, all other reported studies describe efforts to find evidence for the basic process of an increase in antisemitic prejudice by making the history of the holocaust salient (not necessarily ongoing victim suffering). we employed more subtle measures of prejudice (studies 3a-3c), less egalitarian samples (studies 4a and 4b), or more modest forms of negativity, like reduced empathy (study 5). none of these succeeded in providing such evidence. study 1 in the first study, we aimed to replicate imhoff and banse’s (2009) study and to test the role of latent guilt as a potential mediating process. we utilized an adaptation of the implicit positive and negative affect test (ipanat; quirin, kazén, & kuhl, 2009), which served as an indirect measure of guilt. we examined whether a) ongoing jewish suffering increases implicit guilt, whether b) implicit guilt is positively correlated with antisemitism under bogus-pipeline conditions, and whether c) implicit guilt mediates the effect of ongoing jewish suffering on antisemitism. to maximize our chances of finding subtle effects, we took an earlier baseline measurement of our central dependent variable. in search of experimental evidence for secondary antisemitism : a file drawer report 5 method participants. an a priori power analysis suggested a required sample of n = 120 to find an interaction of a size of f = .30 (effect size was f = 0.36 in imhoff & banse, 2009) with 90% power. as we expected substantial dropout, we sought to oversample at t1. specifically, we circulated an invitation to participate in a study consisting of two parts (45-minute online study, 15-minute lab experiment) via an e-mail to individuals who had signed up as interested in study participation. to enhance participation at both measurement times, we offered 12 eur that would be given in cash after completion of the second, lab-based experiment. despite this incentive and three invitation e-mails, only 109 individuals (34 men, 74 women, 1 missing; mean age: 27.05, sd = 6.70) participated in the online study. upon completion (roughly 3 months after the first invitation to the online study), participants were contacted individually to make appointments for the lab study. a total of 83 participants (29 men, 54 women; mean age: 27.71, sd = 7.22; drop-out: 23.9%) were successfully recruited to show up for the lab study. this equipped us with 77% power to detect the estimated effect of f = .30. the data of one additional participant in the lab study had to be excluded because he or she provided a participant code not included in the dataset of the pretest. online testing. the purpose of the online test was twofold. first, we needed a baseline measure of antisemitism to control for a t2. this would reduce the noise due to stable individual differences and thus isolate the proportion of the variance that was not due to such individual differences and was therefore in principle susceptible to experimental manipulation. second, we included a long list of moderators predicted by the different theoretical models outlined above. the overarching goal was to identify systematic patterns across a series of studies to bolster the robustness of one specific theoretical approach. specifically, we included measures of guilt proneness, national identification, and just-world beliefs. some additional measures were added on a purely exploratory base. antisemitism. explicit antisemitism was assessed using imhoff’s (2010) scale for the measurement of primary and secondary antisemitism on seven-point scales ranging from 1 (totally disagree) to 7 (totally agree). in order to attenuate reactance and as in the original study, these items were preceded by a filler item (“i think the relationship between germans and jews is still influenced by the past.”). additionally, among the clearly negative items we included items that indicated more positive attitudes (e.g., 9 items tapping into collective guilt and regret; imhoff, bilewicz, & erb, 2012; 5 items on contact and contact intention, 5 items on reparation intentions). the actual antisemitism scale consisted of 29 items measuring modern antisemitism (e.g., “jews have too much influence on public opinion”; 4 reverse-coded; cronbach’s α = .91). as a second measurement approach, participants indicated how warm (5 items, e.g. “good-natured”, cronbach’s α = .92) and competent (4 items, e.g. “competent”, cronbach’s α = .77; fiske, cuddy, glick, & xu, 2002) they perceived jews to be using a list of 20 adjectives (including 11 filler items) on the same scale. guilt proneness. we assessed disposition to experience strong feelings of guilt using two instruments: the test of self-conscious affect-3 (tosca-3; german version by rüsch & brück, 2003; 5-point scale) and the guilt and shame proneness scale (gasp; german translation by cohen, wolf, panter, & insko, 2011; 7-point scale). both measures ask participants to imagine various scenarios and to indicate how likely it is for them to experience guilt (among other possible reactions) in these situations. cronbach’s α was .47 for the tosca-3 guilt scale and .60 for the guilt – negative behavior evaluation scale of the gasp. national identification. national identification was measured in two ways so that the impact of the defense form of national identification (i.e., glorification controlled for attachment, collective narcissism) could be isolated. we measured attachment to the national group (8 items; e.g., “being a german is an important part of my identity”; cronbach’s α = .90) and glorification of this group (8 items; e.g., “germany is better than other nations in all respects”; cronbach’s α = .82) on seven-point scales ranging from 1 (totally disagree) to 7 (totally agree) with items by roccas, sagiv, halevy, and eidelson (2008) that were adapted and translated to german. as an additional measure of defensive national identification, we included a measure of collective narcissism, the exaggerated belief that one’s own national group is superior to other groups, on the same scale. to this end we used the german translation of nine items (cronbach’s α = .85) of the collective narcissism scale (e.g., “i wish other groups would more quickly recognize authority of the germans”; golec de zavala, cichocka, eidelson, & jayawickreme, 2009). belief in a just world. we used dalbert’s (2001) general belief in a just world scale that consists of six items (e.g., “i think basically the world is a just place”; cronbach’s α = .72). the items of this scale were answered on a six-point scale ranging from 0 (totally disagree) to 5 (totally agree). imhoff & messer 6 additional variables. we measured right-wing authoritarianism (rwa; funke, 2005), social dominance orientation (sdo; von collani, 2002), the big five (bfi10; rammstedt & john, 2007), conspiracy mentality (imhoff & bruder, 2014), and the coping modes vigilance and cognitive avoidance (mainz coping inventory, abi; egloff & krohne, 1998) using german versions of the scales. procedure. after giving informed consent, participants completed all scales in a fixed order (tosca-3, belief in a just world, collective narcissism, glorification and attachment, antisemitism, conspiracy mentality, right-wing authoritarian, social dominance orientation, mainz coping inventory, gasp, bfi-10, demographics) before generating the individual code needed to match their pretest data with the lab study data. lab study. all participants who participated in the online study and left contact details were invited via e-mail to participate in the lab study. upon arriving at individually arranged sessions they were randomly assigned to one of the four conditions resulting from a 2 (ongoing consequence: yes vs. no) by 2 (bogus pipeline: yes vs. no) design. information on ongoing consequences. participants read a text, ostensibly taken from a history book, which described the german atrocities committed against jews in the auschwitz concentration camp. this text was identical to that used by imhoff and banse (2009). the last paragraph contained the manipulation of ongoing consequences. participants either read that the suffering of the jewish victims was part of a terrible history that has no direct implications for jews today (no ongoing consequences) or that even today jews are suffering either as auschwitz survivors or as their descendants because of “secondary traumatization” (ongoing consequences). bogus pipeline. the implementation of the bogus pipeline differed from the original study (imhoff & banse, 2009) because we initially intended to explore physiological reactions to both versions of the text about the holocaust. in the bogus pipeline condition, the electrode belt of a heart rate monitor watch was applied to participants’ chests. in addition, electrodes were attached to the palmar surfaces of the participants’ index and middle fingers and to the back of their hands, supposedly to measure galvanic skin response. participants were informed that physiological data were measured because “previous research has shown that we can detect quite well whether someone answers truthfully or with a lie”. participants in the control condition underwent measurement of heart rate as well but did not have electrodes attached to their hands. importantly, participants in this condition were informed that physiological measures were obtained merely in order to explore whether physiological parameters correlate with information processing in reading. measures. implicit guilt. we used an adaptation of the implicit positive and negative affect test (ipanat; quirin, kazén, & kuhl, 2009) that assesses anger, fear, happiness, and guilt (ipanat-4-em) to measure implicit guilt. participants were asked to judge the extent to which artificial words (e.g., “vikes”) express each of three emotional qualities per emotion cluster. guilt was represented by the emotion words “guilt”, “regret”, and “shame”, cronbach’s α = .88. explicit guilt. the same emotions that were measured with the ipanat-4-em in an indirect way were also assessed using a self-report measure. participants indicated to what extent they felt anger, fear, happiness, and guilt (“guilty”, “regretful”, and “ashamed”) at that moment, cronbach’s α = .81. antisemitism. participants completed the same scale as in the online study, α = .93. heart rate variability. we collected heart rate variability data for exploratory purposes using heart rate monitor watches by polar. procedure. after an interval ranging between seven days and three months between the online survey and participation in the lab study (time 2), participants were randomly assigned to one of four experimental conditions in a 2 (ongoing consequences vs. no ongoing consequences) × 2 (bogus pipeline vs. control) factorial design. the session started with the bogus pipeline manipulation and the physiological device set up. after a two-minute baseline measurement of heart rate variability, participants read a text about the german atrocities in the auschwitz concentration camp, which included the manipulation of consequences for presentday jews. the individual paragraphs of the text moved across the screen over a period of 140 seconds to allow for a mapping of physiological reactions to specific parts of the text. after the reading task, participants were asked to write down on a piece of paper the thoughts they had had while reading the text. subsequently, they completed the ipanat-4-em, which included our measure of implicit guilt, and filled in the measure of explicit guilt. finally, they again answered the same antisemitism questionnaire that they had completed at time 1 in search of experimental evidence for secondary antisemitism : a file drawer report 7 and indicated whether the text presented to them before contained information about ongoing negative consequences for jews today as a manipulation check (“yes” or “no”). results antisemitism showed high stability between both measurements, r(83) = .89, p < .001. we followed the strategy of the original study (imhoff & banse, 2009) in analyzing the effect of the information on ongoing consequences on antisemitism. time 1 antisemitism scores were entered as a predictor of time 2 antisemitism scores in a regression analysis and standardized residual change scores were used as an index of change in antisemitism. the resulting residual change scores were subjected to a 2 (ongoing consequences vs. no ongoing consequences) × 2 (bogus pipeline vs. control) analysis of variance (anova). in contrast to our hypothesis and the results of the original study, no evidence was found for an interaction between the information on ongoing negative consequences for jews and the bogus pipeline manipulation, f(1, 79) = 0.28, p = .602, ηp2 = 0.003 (figure 2, right panel). likewise, none of the experimental factors showed a main effect, fs < 1. confronting german participants with ongoing negative consequences for present-day jews did not result in increased antisemitism, even when participants thought that untruthful responses could be detected by the experimenter. figure 2. change in explicit antisemitism (standardized residuals) from time 1 to time 2 as a function of the information on ongoing consequences and bogus pipeline manipulations in the original study (imhoff & banse, 2009; left panel) and in study 1 of the current research. error bars represent standard errors of the mean. despite this lack of support for the basic effect, we analyzed whether ongoing jewish suffering increases implicit guilt. a t-test for independent samples revealed no significant difference in implicit guilt between the ongoing consequences condition (m = 3.25, sd = 0.95) and the no ongoing consequences condition (m = 3.06, sd = 0.90), t(81) = 0.93, p = .354, hedges’s gs = 0.20, 95% ci [-0.23, 0.64]. in contrast to our hypothesis, implicit guilt was not positively correlated with antisemitism under bogus pipeline conditions, r(44) = .12, p = .451. in order to test the moderator hypotheses, we performed separate hierarchical multiple regression analyses using the standardized residual change scores in antisemitism as a dependent variable. product terms representing the three-way interactions among both experimental factors and the potential moderator variables were entered as predictors in a third step after the simple predictors and all possible two-way products. none of these regression analyses revealed evidence for a moderation effect of collective narcissism (see table osm.1 on our osf project page), national glorification (see table osm.2), just-world beliefs (see table osm.3), or guilt proneness (see table osm.4). discussion study 1 provided no lead on the research question of which psychological processes are plausibly responsible for increased prejudice in light of ongoing suffering, predominantly because it failed to replicate this finding. although descriptively the mean scores were in the predicted direction, this trend was far from significant. several reasons appeared conceivable for this. as always, the non-significant findings could be a falsenegative and due to too little power. we failed to collect data from 120 participants as planned based on a priori power analyses and these analyses might already have been biased by an effect size estimate that was too optimistic, taken from the original study. alternatively, the bogus pipeline manipulation might not have worked as it did in the original study. we had used different equipment (a heart rate monitor plus hand electrodes instead of forehead electrodes plus hand electrodes) in a different setting (neutral, almost empty room instead of a slightly messy laboratory with many cables lying around) and sampled from a different population (via a volunteer participant e-mail list instead of first-year undergraduates) with different incentives (cash payment instead of course credit). potentially, any of these factors or their combination undermined the credibility of our bogus pipeline manipulation. in fact, unlike the previous study, we have no evidence for the validity of the procedure. in our original study, we had included an affective misattribution procedure (payne, cheng, govorun, & stewart, 2005) as a measure of implicit antisemitism. as we expected, this measure correlated substantially with the explicit measure under bogus pipeline conditions (i.e., participants really self-report what imhoff & messer 8 they “feel”), but not under control condition (where they corrected their responses in a socially desirable way). we had eliminated the indirect measure between the ongoing suffering manipulation and the dependent variable in an effort to streamline the procedure. nevertheless, we continued as planned with study 2. study 2 in study 2, we aimed to test the just-world theory as an explanation of the effect of ongoing jewish suffering against the hypotheses of guilt-defense and the protection of a positive social identity. we did so by introducing a condition in which just-world theory would make a different prediction than guilt-defense or social identity theory. just-world beliefs should be threatened by unjustly suffering victims in any case, irrespective of who the perpetrator is. in contrast, not every case of injustice should result in increased guilt or in a threatened positive social identity. only if the perpetrators are members of the in-group (in this case germans), one should be motivated to derogate the victims. accordingly, we manipulated group membership of the perpetrators. method participants. we again aimed for a final sample of 120 participants. one hundred and eighty-five first-year psychology students (27 men, 158 women; mean age: 22.29, sd = 4.88) from the university of cologne, germany, participated in an online study at the first measurement. seventy-eight participants dropped out between the first and the second measurement occasion (42%). the post-test data of two participants had to be excluded because they provided participant codes not included in the dataset of the pretest. we excluded nine participants before running the analyses because they did not remember that the historical text they had read contained information about ongoing negative consequences for the victims or because they did not remember who the perpetrators had been. the remaining sample of n = 96 (86 women, 10 men) ranged from 17 to 39 in age (m = 21.55, sd = 4.11). participants received 12 eur for their participation (approx. 7.50 eur per hour). measures. the main dependent variable in study 2 was explicit prejudice against the victim group participants read about during the experiment. depending on experimental condition, participants responded to items measuring prejudice against jews or chinese. we chose ten items from the antisemitism scale (imhoff, 2010; cronbach’s α = .72) that could be modified to assess prejudice against chinese (e.g., “chinese have too much influence on public opinion”; cronbach’s α = .80). the prejudice items were supplemented by four items on collective guilt (e.g., “i can easily feel guilty for the negative consequences that were brought about by germans [japanese]”, cronbach’s α = .83 and .62, respectively) and, for participants that read about the holocaust, by two items on primary and five items on secondary antisemitism for exploratory purposes. study 2 included the same measures of potential moderators and additional variables as in study 1 except that we excluded the tosca-3 (but kept the gasp as a measure of guilt proneness), the in-group attachment and glorification scales (but kept the measure of collective narcissism), and the abi (measuring anxiety coping styles that might be related to the tendency to avoid – and therefore misremember – threatening information). in addition, we included the following measures on a purely exploratory basis: a response latency-based measure of prejudice (adapted from vala, pereira, eugênio, lima, & leyens, 2012), a rating of jews and chinese on eight warmth-related traits, and a feeling thermometer assessing feelings towards these groups (among other groups). the biopac system that had the main purpose of serving as a bogus pipeline setup (see below) was also used to record electrodermal activity data for exploratory purposes. we did not analyze physiological data, but the raw data can be obtained from the authors. independent variables. we manipulated group membership of the perpetrators by presenting participants with either a text about the holocaust (which was the same as in study 1) or about the ongoing suffering of chinese victims of the nanking massacre committed by japanese troops. in both conditions, the last paragraph stressed the ongoing negative consequences for present-day jews or chinese, respectively. presentation of the text differed from study 1 in that the whole text was shown on the screen at once, whereas the individual paragraphs moved across the screen in study 1. in contrast to study 1 and more similar to the original study (imhoff & banse, 2009), we operationalized the bogus pipeline manipulation as measuring electrodermal activity under the pretext of lie detection vs. no physiological measurement at all. participants in the bogus pipeline condition were informed that “specific parameters of electrodermal activity allow us to detect whether someone answers truthfully or with a lie”. subsequently, the experimenter attached the electrodes of a biopac system to the palmar surfaces of the participants’ index and middle fingers. in order to increase in search of experimental evidence for secondary antisemitism : a file drawer report 9 credibility of the bogus pipeline, the experimenter continued with an alleged calibration that required participants to follow some instructions while the experimenter was monitoring the physiological parameters at another computer behind a room divider. specifically, participants were instructed to take a deep breath and hold the breath for a moment. after that, the experimenter asked participants to memorize a number between one and six printed on a card (which was a 4 in every case). analogous to a concealed information test, the experimenter then read a series of numbers that could have been on the card, and participants were instructed to answer “yes” to every number, whether accurate or not. after a few seconds, participants were informed that the apparatus was working properly and that they were ready to start with the study. participants in the control condition received no treatment at all. procedure. the first measurement of explicit prejudice against the victim group alongside assessment of the potential moderator variables was obtained in a classroom testing session (time 1). after an interval of five or six months, participants were invited to the laboratory for an individual session (time 2). participants were randomly assigned to one of four groups in a 2 (group membership of the perpetrators: in-group vs. out-group) × 2 (bogus pipeline vs. control) design. after the bogus pipeline manipulation had been administered, participants gave demographic information and read a neutral text about the history of an abandoned town, which served as a control task for the assessment of electrodermal activity, involving reading but without injustice-related content. in the bogus pipeline condition, this reading task was preceded by a three-minute baseline measurement of electrodermal activity. after this initial reading task, participants were given two minutes to write down on a piece of paper their thoughts about the text. after a one-minute rest period, participants were presented with the critical text that contained the manipulation of the perpetrators’ group membership and again wrote down their thoughts. subsequently, participants completed 48 trials of the response latency-based measure of prejudice and answered the prejudice questionnaire. finally, they were asked whether the text contained information on ongoing negative consequences for the victims (“yes” or “no”) and who the perpetrators had been as a manipulation check (“the red army”, “japanese troops”, “ss officers”, or “american soldiers”). results the stability of antisemitism was lower than in study 1, r(44) = .57, p < .001. the stability of prejudice against chinese was r(52) = .72, p < .001. prejudice against the victim group was analyzed as change in prejudice between both measurement occasions exactly as in study 1. the standardized residual change scores were subjected to a 2 (group membership of the perpetrators: in-group vs. out-group) × 2 (bogus pipeline vs. control) anova. results neither revealed a significant main effect of the bogus pipeline manipulation, which would have been predicted by just-world theory, f(1, 92) = 0.05, p = .830, ηp2 = 0.00, nor an interaction effect, which would have been predicted by guilt-defense and social identity theory, f(1,92) = 0.01, p = .919, ηp2 = 0.00. the only significant experimental effect was a (hard-to-explain) main effect of victim group, f(1, 92) = 15.74, p < .001, ηp2 = 0.17: whereas antisemitic prejudice showed a relative decrease compared to t1, the opposite was true for anti-chinese prejudice (figure 3). separate moderator analyses confirmed this result for participants high in collective narcissism (see table osm.5), just-world beliefs (see table osm.6), and guilt proneness (see table osm.7). figure 3. change in explicit prejudice against the victims (standardized residuals) from time 1 to time 2 as a function of the group membership of the perpetrators and the bogus pipeline manipulation in study 2. error bars represent standard errors of the mean. discussion studies 1 and 2 failed to replicate the basic effect of an increase in antisemitism in response to the ongoing suffering manipulation jewish victim s of ingroup chinese victim s of outgroup -1,0 -0,5 0,0 0,5 1,0 control bogus pipeline imhoff & messer 10 suffering of jewish victims, which had been reported in the original study (imhoff & banse, 2009). in light of this repeated failure to replicate the interaction of bogus pipeline and ongoing suffering, we decided to switch gears and focus on establishing the basic effect. the bogus pipeline manipulation appeared to us as the most plausible candidate for this failure. clearly, participants needed a lot of trust in the researchers to believe that the researchers could indeed detect untruthful responding. in contrast to the time when bogus pipelines were originally proposed in the early 1970s (e.g., sigall & page, 1971), current students are very likely aware of the fact that a simple “lie detector” is a gadget from fictional literature, not a real thing. based on the working hypothesis that lie detection machines have been too thoroughly debunked in public discourse to affect participants’ responding, we turned to another popular approach to circumvent social desirable responding: more subtle measures. study 3a – 3c in studies 3a to 3c, we investigated whether the very basic effect shown in the original study (imhoff & banse, 2009) – germans show increased antisemitism when confronted with the holocaust – is detectable. as we were not confident in the effectiveness of the bogus pipeline manipulation given the results of studies 1 and 2, we employed an alternative approach in addressing the problem of measuring antisemitic attitudes, which are socially very undesirable to express. instead of a bogus pipeline setup, we adopted a reverse-correlation paradigm as a subtle, indirect measure of prejudice. if confronting germans with the crimes their ancestors committed against jews results in them becoming more antisemitic, we expected germans to remember the face of a jewish person as more negative when the holocaust is mentioned at the initial confrontation with this person. to test this hypothesis, we asked participants to form a first impression of a person that was either jewish or christian. in addition, we manipulated whether the text about this person contained information about the holocaust or not. participants then completed a reverse-correlation image-classification task based on the memory they had of the target person’s face, which allowed us to visualize the remembered facial appearance of that person. we replicated this study twice (studies 3b and 3c) with minor changes regarding the materials, as explained below. method participants. seventy-eight psychology students from the university of cologne, germany, were recruited via mailing lists, flyers, social networks, or by being personally approached on the university campus to take part in study 3a. based on a priori set criteria (see below), we excluded 17 participants before running the analyses because they did not remember correctly that the target person was jewish [vs. christian] or that he was volunteering in an organization that supports holocaust survivors [vs. an organization working to protect forests] or both. the remaining sample of n = 61 (47 women and 14 men) ranged from 20 to 49 years in age (m = 24.67, sd = 5.82). participants received course credit for their participation. roughly 120 students from different fields of study participated in exchange for 4 eur in study 3b (n = 121) and study 3c (n = 120), respectively. the effective sample size after exclusions based on the same criteria as in study 3a was n = 94 (50 women and 44 men; age 18 to 38 years, m = 22.71, sd = 3.42) in study 3b, and n = 89 (59 women, 29 men, one participant did not indicate; age 18 to 40 years, m = 23.22, sd = 4.23) in study 3c. independent variables. the session started with an impression formation task that contained the manipulation of both independent variables. participants read a short text about a person containing irrelevant information about that person’s job, residence, and leisure time, and, critically, cues to the person’s religious affiliation and a sentence mentioning the holocaust or a control issue. participants were told that the person was active in his synagogue [vs. church] and volunteered with an organization that helps holocaust survivors because his grandfather had been murdered in the auschwitz concentration camp [vs. an organization working to protect forests]. in studies 3b and 3c, we introduced minor changes in the manipulations. specifically, we reasoned that volunteer work in any religious group might be seen as a cue to morality or other positive traits. in studies 3b and 3c, religious affiliation was thus made salient without implying volunteer work: the sentence containing the manipulation of group membership was changed so that the target person was not active in a synagogue or church but had been asked whether he wanted to become active in his father’s synagogue [vs. church]. participants in the holocaust condition read that the target person was involved in an organization demanding reparation payments for holocaust survivors (whereas he was working for another charity not related to the holocaust in the other condiin search of experimental evidence for secondary antisemitism : a file drawer report 11 tion). contrary to study 3a, the text contained no information about any victims among his family members to eliminate potential effect of direct sympathy. in each of the three versions of study 3, the text about the target person was accompanied by a picture showing the face of a young man. in studies 3a and 3b, the face image was the neutral male face of the averaged karolinska directed emotional faces database (lundqvist & litton, 1998), whereas we used a morph of sixteen emotionally neutral faces in frontal view taken from the radboud faces database (langner et al., 2010) in study 3c. both images have been used in previous reverse-correlation research (e.g., dotsch et al, 2008, and imhoff et al., 2013, respectively). central dependent variable: reverse-correlation image-classification task. we relied on reverse correlations to assess whether participants’ memory of a person’s face is biased by information on that person’s group membership and mention of the holocaust. reverse correlation is a data-driven approach that enables researchers to visualize an idealized decision criterion. by tracking which kind of subtle (and random) alterations in the appearance of face correlates with a classification decision (e.g., which of two faces look more female; mangini & biederman, 2004), one can estimate what a face that fulfills all criteria in an ideal way looks like (classification image). beyond very basic decisions (e.g., male vs. female), and more relevant to this study, reverse-correlation techniques can be used to construct images that reflect the expected or remembered facial appearance of a target person without making any a priori assumptions about relevant features. previous studies applied this approach to investigate biased expected facial appearance of out-group members (dotsch, wigboldus, langner, & van knippenberg, 2008; dotsch, wigboldus, & van knippenberg, 2013; imhoff & dotsch, 2013; imhoff, dotsch, bianchi, banse, & wigboldus, 2011) and previously encountered individuals (karremans, dotsch, & corneille, 2011). for instance, karremans et al. (2011) found that people involved in a romantic relationship held a less attractive memory of an attractive alternative’s face than uninvolved individuals. when asked to select a face that best represents a typical member of a certain social group (e.g., manager, nursery teacher), stereotypical beliefs about these groups’ warmth as well as competence are encoded in the face and can be decoded from the classification image by independent perceivers (imhoff, woelki, hanke, & dotsch, 2013). image creation. subsequently, participants worked through the reverse-correlation task, which allowed us to obtain visualizations of the participants’ memories of the target face. we used a two-images, forced-choice variant of the reverse-correlation paradigm (e.g., dotsch et al., 2008; imhoff et al., 2011), in which each participant completed 400 trials of selecting one of two presented faces. in each of these trials, they selected the face that they thought looked more like the target person they had seen before (i.e., during the impression formation task). the stimuli used in the picture classification task were all based on the face they had seen on the page about the target person. to generate the stimuli, this base image had been converted to grayscale and superimposed with random noise resulting in random variations of the facial appearance between the stimuli (for noise generation, see dotsch & todorov, 2011). every trial employed a different noise pattern displaying the original pattern on the left and the negative of that pattern on the right side of the screen. participants selected pictures by pressing a left or right button on the keyboard. by averaging all noise patterns participants had selected separately for each experimental condition and superimposing these classification patterns on the base image, we obtained a classification image for every condition (see figure 4). trials with a response time lower than 200ms were excluded before constructing the classification images (<5% of the trials). the resulting classification images visualized how participants in each of the four experimental groups remembered the target face on average. in addition to the classification images aggregated on a group level, we also analyzed classification images of individual participants in studies 3b and 3c in order to explore the possibility that derogation of victims could occur on inter-individually different dimensions and hence be reflected in different facial features. holocaust metioned control je w is h ta rg et imhoff & messer 12 c hr is ti an t ar ge t figure 4. classification image as a function of information about the holocaust (holocaust is mentioned vs. control) and group membership of the target person (jewish vs. christian) in study 3a. image rating. in the second phase of study 3, the classification images created by every experimental group in the first phase were rated on warmth (cronbach’s α between .84 and .92) and competence (cronbach’s α between .72 and .90) by 56 independent participants recruited via amazon mechanicalturk (mturk; 30 women and 26 men, age 18 to 75 years, m = 38.57, sd = 14.59; study 3b: n = 43, 20 women and 23 men, age 20 to 66 years, m = 37.93, sd = 13.04; study 3c: n = 64, 40 women, and 23 men, one person did not indicate, age 18 to 71 years, m = 35.56, sd = 12.68). five other participants were excluded because they indicated that they had answered randomly or purposely false, or that they would exclude their data if they were the researcher (six exclusions in study 3b and four in study 3c). the warmth and competence items were the same as in the first phase of the study. responses were made using a five-point scale ranging from 1 (strongly disagree) to 5 (strongly agree). every rater in this second phase of the study rated each of the four group-wise classification images. accordingly, ratings were analyzed using within-subjects tests. the warmth ratings of the classification images constituted the main dependent variable. the individual classification images from the first phases of studies 3b and 3c were rated by independent participants by indicating “how likable” they found each of the persons. participants were paid 25 cents in study 3a and 50 cents in studies 3b and 3c. additional measures. after completing the reversecorrelation task, participants were probed for suspicion using a funneled debriefing procedure (cf. chartrand & bargh, 1996) and were then asked to indicate their first impression of the target person by a) describing the person in their own words and b) rating the person’s warmth and competence. for the warmth and competence ratings, participants indicated to what extent each of 20 adjectives representing warmth (5 items, e.g. “good-natured”, cronbach’s α = .82) and competence (4 items, e.g. “competent”, cronbach’s α = .69; fiske et al., 2002) characterized the target person on a fivepoint scale (1 = not at all to 5 = very much). in studies 3b and 3c, we excluded the question asking participants to describe the target person and replaced the warmth and competence items with ten items assessing likability of the target person (e.g., “how likable do you find david s.?”), which also included five reverse-coded items representing common negative stereotypes about jews (e.g., “how stingy do you find david s.?”). these ten items were combined into a single explicit likability scale, cronbach’s α = .84 in study 3b and .88 in study 3c. next, participants answered ten (in studies 3b and 3c, six) questions about the target person of which three served as a manipulation check and gave demographic information. finally, they completed an antisemitism questionnaire (only in study 3a) consisting of 14 items taken from the scale used in study 1 (imhoff, 2010; cronbach’s α = .86). in studies 3b and 3c, we included a word stem completion task to explore whether representations of the holocaust were successfully activated in the holocaust condition. this task was administered after the warmth and competence ratings and asked participants to complete 30 word stems of which ten could be completed to form a word related to the holocaust (e.g., “endl_____” could be completed to “endlösung” [final solution]). answers on the ten critical items were coded as holocaust-related or not by a single rater and aggregated to a sum score. furthermore, the positive and negative affect schedule (panas; watson, clark, & tellegen, 1988) was added in between the impression formation task and the reverse-correlation image-classification task in studies 3b and 3c for exploratory purposes. materials and procedure. participants were seated at a computer in individual cubicles and were randomly assigned to one of four experimental conditions following a 2 (group membership of the target person: jewish vs. christian) × 2 (holocaust is mentioned vs. control information) design. secondary antisemitism, we reasoned, would be exhibited in a face that independent others would perceive as less warm if the person was introduced as jewish and the holocaust was mentioned. results based on the idea of secondary antisemitism, we expected the classification images created by participants who were both presented with a jewish target person and reminded of the holocaust to be rated as less warm or likable than those from the other conditions. warmth ratings of the group-wise classification images were in search of experimental evidence for secondary antisemitism : a file drawer report 13 subjected to a 2 (group membership of the target person: jewish vs. christian) × 2 (holocaust is mentioned vs. control information) repeated measures anova. contrary to the hypothesis, in studies 3a and 3b results did not show a significant interaction effect, f(1, 55) = 0.03, p = .872, ηp2 = .00, and f(1, 42) = 0.02, p = .897, ηp2 = .00, respectively. in study 3c, a significant interaction effect emerged, f(1, 63) = 10.06, p = .002, ηp2 = .14. however, the pattern of means was in contrast to expectations, as the classification image from the jewish condition was rated as warmer when participants were reminded of the holocaust (vs. control information). for the analysis of individual classification images, likability ratings were averaged across raters yielding a mean likability rating for every individual classification image. the likability scores were then submitted to a 2 (group membership of the target person: jewish vs. christian) × 2 (holocaust is mentioned vs. control information) between subjects anova. neither for study 3b nor for study 3c the anovas revealed any differences between experimental conditions, test of interaction effects, f(1,90) = 0.03, p = .871, ηp2 = .00 and f(1,85) = 0.16, p = .693, ηp2 = .00, respectively. in addition to the primary analyses looking at the classification images reported above, we explored the explicit ratings of the target person’s warmth (study 3a) and likability (studies 3b and 3c). between-subjects anovas did not yield an interaction effect in any of the studies, f(1, 57) = 0.02, p = .879, ηp2 = .00 in study 3a, f(1, 90) = 2.97, p = .088, ηp2 = .03 in study 3b, and f(1, 85) = 0.38, p = .537, ηp2 = .00 in study 3c. to explore whether representations of the holocaust were activated to a higher degree in the conditions mentioning the holocaust than in the control conditions, we compared the number of holocaust-related answers in the word stem completion task. in contrast to our expectations, the sum of holocaust-related answers was not significantly higher in the holocaust conditions (m = 2.22, sd = 1.74 in study 3b and m = 1.58, sd = 1.18 in study 3c) than in the control conditions (m = 1.66, sd = 1.58 and m = 1.43, sd = 1.17), t(92) = 1.63, p = .108, hedges’s gs = 0.33, 95% ci [-0.08, 0.74] and t(87) = 0.59, p = .559, hedges’s gs = 0.12, 95% ci [-0.29, 0.54], respectively. discussion studies 3a to 3c failed to provide any evidence for the notion that making the holocaust salient increases participants’ need to derogate the victim group. if anything, the effect was in the opposite direction in one study, but not reliably in the other studies. this invites speculation as to whether the chosen measure is indeed immune to social desirability concerns. although it is not explicitly an evaluation task, participants are of course free to take all the time they need to select images according to whatever impression they want to convey of themselves (e.g., as particularly unprejudiced). it may thus be that the measure taps into participants’ very explicit and elaborate evaluation as much as typical prejudice scales do. the unexpected effect (somewhat reminiscent of the pattern in the no bogus pipeline condition in the original paper) is compatible with this interpretation, but the lack of any effect in the following studies does not corroborate this speculation. at present, there is no consistent effect (in any direction) of making the atrocities of the holocaust salient. as perhaps a side effect rather than the focus of the current interest, we also were not able to produce consistent effects on what we perceived as a simple manipulation check: a word stem completion task. the logic was that making the history of the holocaust salient should increase participants’ tendency to complete ambiguous word stems in a semantically consistent way. such tasks are highly popular instruments in the field of social cognition to tap into the semantic accessibility of certain constructs (or concept activation). while our failure to find any effect in such measures may raise doubts about their validity, it should be noted that the employed task was constructed ad hoc without proper pilot testing of base rates of word completion tendencies. in our own lab, we have gathered experiences with such tasks in other domains (i.e., to what extent pictures of or real pregnant women make baby-related word completions more likely) with more success (marhenke & imhoff, 2018). we would thus caution against throwing the baby out with the bath water based on the failure presented here. at the same time, we caution that it is bad practice how naïvely we and other colleagues construct such measures ad hoc and interpreted them as valid as long as they produced the desired effects, but discard them as unreliable and invalid if they do not. another reason for the failure to replicate the effect could be the population we sampled from in studies 13. most were student samples from the university of cologne, more specifically the school of humanities with a specialization in special needs education. students from this school have a reputation to be particularly liberal (and their average self-reported political orientation was left of the scale midpoint in both studies 1 and 2), whereas students in the original study were psychology students who do not necessarily have the same reputation. to increase our chances of finding support for imhoff & messer 14 the mechanism of secondary antisemitism, we thus changed the research setting to a less restricted sample that might not have egalitarian norms to the same extent. we thus conducted two studies in the city center of cologne with pedestrians from the general population as participants. studies 4a and 4b to include more politically diverse participants, we recruited individuals walking in front of the main station in cologne, germany, to fill in a “short survey on opinions on violent conflicts”. instead of open antisemitic expressions, we used agreement to criticism of israel as a dependent variable. we assumed that criticism of israel would be perceived as less taboo and thus be reported openly in a questionnaire so that we would not need a bogus pipeline setup. this approach was built on the notion that not only are anti-israeli sentiment and antisemitism highly correlated in europe (kaplan & small, 2006), but certain forms of criticism of israeli politics are construed as a substitute communication. demonizing israel is socially more accepted than demonizing jews (steinberg, 2004), but – in the context of secondary antisemitism – serves the same purpose: by portraying the (jewish) state of israel as ruthless perpetrator of human right violations, the (german) crimes against jews become less salient (i.e., victim-perpetrator reversal; imhoff, 2010). in line with the hypothesis of secondary antisemitism, we expected participants to show higher agreement (relative to a control condition) to statements criticizing israel after being reminded of the holocaust. this effect might be greater for individuals high in national glorification. method participants. one hundred passers-by approached in front of the main station of cologne, germany, participated in study 4a (57 women and 43 men). participants ranged from 17 to 76 years in age (m = 33.87, sd = 14.59). for study 4b we recruited 196 passers-by (119 women, 73 men, four did not indicate their gender) ranging from 14 to 63 in age (m = 27.55, sd = 12.03). another four participants were excluded before running the analyses because of missing responses on more than 50% of the items of the main dependent variable. in both studies, we included an attention check by asking participants in the last sentence of the instruction to write an x on the page margin. as a very high proportion of participants failed this attention check (45% in study 4a and 27% in study 4b), we decided to keep these participants in the sample. the results reported below do not change when these participants are excluded. participants received no compensation. in study 4b, participants scored higher on national glorification (m = 2.32, sd = 1.22) than our student sample in study 1 (m = 1.72, sd = 0.88), t(276) = 4.02, p < .001, hedges’s gs = 0.52, 95% ci [0.26, 0.79] using the same three items. although, a mean of 2.32 is still relatively low on a seven-point scale, our goal to acquire a less liberal sample was achieved. materials and procedure. the study was conducted in summer 2014 during the 2014 israel-gaza conflict. participants were approached by the experimenter and asked to participate in a “short survey on opinions on wars and violent conflicts” (in study 4b, “on the israelipalestinian conflict”). they were then handed a twopage paper-and-pencil questionnaire and were randomly assigned to one of two experimental conditions in which they were either reminded of the holocaust or not. the manipulation of being reminded of the holocaust vs. a control condition was embedded in the instructions of the questionnaire. in study 4a, participants in the holocaust condition read that “70 years have passed since the monstrous german crime, the holocaust. within 4 years, the germans systematically killed 6 million jews in extermination camps like auschwitz.” in the control condition, that first part of the instructions read, “history of humanity is a sequence of wars” making no reference to the holocaust. participants then indicated their agreement to 13 statements criticizing israel (cronbach’s α = .77), which had been taken from the existing literature (e.g., “israel is a state that stops at nothing”; kempf, 2014). to make the cover story (a survey on wars and violent conflicts) more credible, the questionnaire also included ten items on two other wars, five on the war in ukraine and five on the war in syria, which were not analyzed. in study 4b, we included a more detailed description of the holocaust in the holocaust condition emphasizing that a) most germans from all parts of german society participated in the genocide or willfully ignored the crimes and that b) jews are still suffering today as a result of the holocaust. besides the text, the holocaust condition included a picture showing corpses of prisoners of the buchenwald concentration camp. the information about the holocaust was introduced by referring to the public discussion in germany about the role of the nazi past for the contemporary relations to israel. the control condition in study 4b was a baseline measure that simply asked participants to report their opinion on the israeli-palestinian conflict. the main dependent variable, criticism of israel, was assessed using an in search of experimental evidence for secondary antisemitism : a file drawer report 15 18-item scale (as compared to 13 items in study 4a; cronbach’s α = .84). this scale comprised the same 13 items that had been used in study 4a and five more items that were newly created on the basis of actual comments on the 2014 israel-gaza conflict from the media (e.g., “the war that israel initiated against gaza is completely unjustifiable. it is a crime, whether from the air or on the ground.”) to test whether holocaust reminders increase criticism of israel especially among germans who glorify their national group, we included three of the items by roccas et al. (2008) on national glorification in study 4b (cronbach’s α = .62). in study 4b, we also included five items from the antisemitism scale (imhoff, 2010), four items on group-based shame, three items on national attachment, and, for exploratory purposes, an open-ended question on participants’ understanding of german identity. results neither study 4a nor study 4b provided evidence for an increase in criticism of israel as a reaction to being reminded of the holocaust, t(98) = -0.29, p = .776, hedges’s gs = -0.06, 95% ci [-0.45, 0.33] and t(194) = 0.14, p = .890, hedges’s gs = 0.02, 95% ci [-0.26, 0.30], respectively. participants who read about the holocaust did not criticize israel more (study 4a: m = 3.02, sd = 0.72; study 4b: m = 3.78, sd = 0.85) than those in the control condition (study 4a: m = 3.06, sd = 0.91; study 4b: m = 3.76, sd = 0.88). this result also held for participants high in national glorification as revealed by a hierarchical multiple regression analysis. in a first step, the effect-coded group variable (holocaust = 1; control = -1), attachment to germany, and glorification of germany were entered as predictors of criticism of israel, followed by three product terms representing all possible two-way interactions between the predictor variables in a second step, and the three-way interaction in a third step. the interaction between experimental condition and glorification of germany did not significantly predict criticism of israel in the full model, β = .05, t(187) = 0.60, p = .550. study 5 in study 4, we tried to assess antisemitism in a less blatant form, harsh criticism of israel. in study 5, we accounted for the possibility that germans might be equally sensitized to criticism of israel as to open antisemitic statements. it might be the case that when confronted with statements criticizing israel in a questionnaire, germans retrieve learned answers, leaving little room for situational influences. however, emotional reactions and behavior towards individual persons might be elicited more spontaneously and thus be more responsive to situational influences. therefore we turned away from antisemitism and investigated a more subtle reaction in the area of intergroup-emotions and intergroup-behavior: does reminding germans of the holocaust result in decreased empathy and support towards israeli victims of rocket attacks? again, we also aimed at testing whether this presumably defensive reaction is dependent on the level of national glorification. method participants. ninety-eight students (53 women and 45 men) of different fields of study from the university of cologne, germany, participated in exchange for a bar of chocolate and the opportunity to win 50 eur in a raffle. participants ranged from 17 to 56 years in age (m = 22.97, sd = 5.68). another two participants dropped out during the experiment and were deleted from the data set. materials and procedure. participants were seated at computers in individual cubicles and were randomly assigned to one of two experimental conditions (holocaust reminder vs. control). after reading the instructions in which they were informed that the study was about german perceptions of israel, participants started by reporting their age, gender, educational status, citizenship, and whether their family had a history of migration. next, they were either reminded of the holocaust or not. in the holocaust reminder condition, participants read the same text that already had been used in study 4b. the control condition was a baseline condition in which participants were simply told that “we are interested in your perception of different aspects of israel.” subsequently, participants were presented with a short (47s) television news video about the gazabased rocket attacks on a village in israel. after presentation of the video, empathy towards the israeli victims was assessed using six adjectives from the empathy literature (e.g., “compassionate”; cronbach’s α = .73; batson, fultz, & schoenrade, 1987). using a sevenpoint scale (1 = not at all to 7 = extremely), participants indicated for each of the items to what extent they had felt the given emotion while watching the video about the situation of the israeli victims. the empathy items were presented among ten filler items – eight distress adjectives (e.g., “worried”) and two guilt adjectives (e.g., “guilty”) for exploratory purposes. imhoff & messer 16 in order to increase credibility of the cover story that the study was about german perceptions of different aspects of israel, participants were presented with another short video, which was a report about young israelis protesting against housing shortage and high rent prices, and indicated their agreement to six statements about the social protests (e.g., “i can easily identify with the requests of the protesters.”). subsequently, participants answered the eight-item national attachment (cronbach’s α = .86) and the eight-item national glorification scale (cronbach’s α = .79). finally, they were presented with a screen saying that the study was over and that they could participate in a raffle offering the opportunity to win 50 eur. participation in the raffle included a measure of financial support of the israeli victims as a second dependent variable (in addition to empathy). participants were invited to pledge to donate a portion of their choice of these 50 eur (between 0 and 50 eur) to an organization supporting the israeli victims in case of them winning the raffle. the text included a description of the charity organization. results empathy scores and donation pledge amounts were subjected to independent samples t-tests. empathy scores did not significantly differ between participants who were reminded of the holocaust (m = 4.03, sd = 1.07) and those in the control group (m = 3.72, sd = 0.94), t(96) = -1.53, p = .129, hedges’s gs = -0.31, 95% ci[-0.70, 0.09]. likewise, results revealed no significant group difference in the amounts participants pledged to donate in support of the israeli victims (holocaust reminder: m = 18.48, sd = 13.41; control: m = 21.52, sd = 15.26), t(46) = 0.74, p = .466, hedges’s gs = 0.21, 95% ci[-0.78, 0.36]. to test whether level of national glorification moderated the hypothesized effect of holocaust reminders on empathy, we conducted a hierarchical multiple regression analysis with the effectcoded group variable (holocaust = 1; control = -1), attachment to germany, glorification of germany, and all possible two-way and three-way interaction terms as predictors. in contrast to our hypothesis, the product term representing the interaction between glorification of germany and experimental condition did not significantly predict empathy in the full regression model, β = -.20, t(90) = -1.29, p = .200. table 1 means, t-tests, and effect sizes for simple effects that would indicate secondary antisemitism across all studies. study measure n m (sd) m (sd) t df p hedges’s g [95% ci] seg study 1 change in explicit antisemitism ongoing consequences 22 no ongoing consequences 22 ongoing consequences 0.11 (0.90) no ongoing consequences -0.00 (0.78) 0.44 42 .666 0.13 [-0.45, 0.71] 0.30 study 2 change in explicit antisemitism bogus pipeline 21 control 23 bogus pipeline -0.47 (0.58) control -0.41 (0.87) -0.27 42 .791 -0.08 [-0.66, 0.50] 0.30 study 3a warmth ratings of classification images 56 jewish/holocaust 3.02 (0.84) jewish/control 3.51 (0.96) 2.89 55 .006 0.54 [0.15, 0.93] 0.20 study 3b warmth ratings of classification images 43 jewish/holocaust 3.00 (0.89) jewish/control 3.16 (0.86) 1.11 42 .272 0.17 [-0.14, 0.48] 0.16 study 3c warmth ratings of classification images 64 jewish/holocaust 3.10 (0.84) jewish/control 2.61 (0.78) -4.49 63 < .001 -0.60 [-0.88, 0.31] 0.15 study 4a criticism of israel holocaust 50 control 50 holocaust 3.02 (0.72) control 3.06 (0.91) -0.29 98 .776 -0.06 [-0.45, 0.33] 0.20 in search of experimental evidence for secondary antisemitism : a file drawer report 17 study 4b criticism of israel holocaust 100 control 96 holocaust 3.78 (0.85) control 3.76 (0.88) 0.14 194 .890 0.02 [-0.26, 0.30] 0.14 study 5 empathy towards israeli victims of rocket attacks holocaust 50 control 48 holocaust 3.72 (0.94) control 4.03 (1.07) -1.53 96 .129 -0.31 [-0.70, 0.09] 0.20 note: the effect size measure used in studies 1, 2, 4a, 4b, and 5 is hedges’s gs. in studies 3a through 3c, we used hedges’s gav for the difference of two correlated measurements as recommended by lakens (2013). discussion study 5 did not provide evidence for the notion that being reminded of the holocaust makes germans (who score high on national glorification) less empathic with israeli victims of palestinian rocket attacks. we thus, again, failed to observe data patterns in line with secondary antisemitism. although the study did not have particularly high statistical power according to current standards, the effect on empathy was – if anything – in the unexpected direction (hedges’s g of 0.31). meta-analysis across studies before discussing the findings, we want to integrate them to give an overall feeling for the obtained evidence or lack thereof. to do so, we calculated simple effects for each study that would indicate secondary antisemitism (table 1). although we predicted interactions for the first three studies, we decided to select simple comparisons instead of effect sizes of interaction terms, because simple comparisons are more informative regarding the direction of effects. for studies 1 and 3 (that have a 2x2 design), we focused on the comparison between conditions with reminders of either ongoing suffering or the holocaust and conditions with these on the evaluation of jewish targets. for study 2, each condition included an ongoing suffering manipulation and thus we compared the degree of (baseline-corrected) antisemitism in the bogus pipeline condition to that in the control condition. we calculated hedges’s gs, respectively gav (see lakens, 2013), for each of the studies and conducted a random-effects model using the r metafor package (viechtbauer, 2010). the data show substantial heterogeneity, q(7) = 27.14, p = < .001, i2 = 72.26%, and meta-analytically again no evidence for any secondary antisemitism (figure 5). given the large heterogeneity, an average effect size of almost exactly zero may be taken as an indication that – despite underpowered studies – it was not just too small of samples that prevented us from replicating the effect. general discussion across a research program spanning two years and eight studies, we did not provide evidence for the notion that reminders of the holocaust evoke negative responses towards jews among german participants. none of the studies had particularly strong statistical power and none were figure 5. forest plot of all studies reported in the manuscript. simple effects are coded so that positive effects speak to the hypothesis of secondary antisemitism. a direct replication in exact detail. nevertheless, their consistency in not producing any result has shaken our confidence in the very basic effect. in light of this, the original goal to better understand the involved processes has lost some of its relevance. as a caveat, although all of the studies reported here were conducted after the current debate on how to achieve more reproducible and reliable science had already taken off in 2011, the spirit behind this research program is still rooted in the “old way” of doing research. what we did (without success) was hunt down an effect, desperately seeking a way to “make it significant”. what we did not do is systematically plan for compelling evidence – in either direction. in hindsight, many steps and detours we made may seem premature and a single extremely high-powered study might have imhoff & messer 18 been more advisable. that was not the typical procedure for conducting social psychological science in our book. instead, if the effect was not there, the researcher had the wrong method to show it and therefore needed to change that method. independent of how we might have obtained more compelling evidence in either direction, all our studies consistently converged in not producing results in line with a very influential theory of post-holocaust antisemitism. what can we now make of this pattern? does it mean that the original study (imhoff & banse, 2009) was a false positive? although virtually everything in that original study fell exactly in its place, luck might have played a trick on us and convinced us of something that was never there to begin with (despite a plethora of qualitative writings and discourse analyses along the same lines). in light of the present research, we would argue that this is very well possible. assuming that the original study was indeed a false positive might mean that either the very idea of secondary antisemitism is wrong or – as a more modest interpretation – that psychologists’ illusions of omnipotence to translate a societal discourse and its dynamics into a pretty, 30-minute experiment are ill-advised. maybe scholarly interpretations of the effect of continuous holocaust reminders are indeed to the point but it is naïve to emulate this in a cute little study. while we can only emphasize again that we are more than open to this possibility, we would – for the sake of the argument – like to entertain alternative explanations. under the (admittedly speculative) assumption that the original study was a true positive, how could we explain the absence of any effect in that direction in all studies reported here? one of the most parsimonious (and potentially cheap) explanations could rest on the assertion that – for whatever reasons – the bogus pipeline procedure just never worked as nicely as it did in 2007. as the original study attested, however, this is crucial to find the diverging effects of holocaust reminders. the dilution of psychological research and knowledge about the (missing) validity of “lie detectors” might have made it increasingly difficult to convince people of the operating principle of the bogus pipeline. in fact, lie detectors are debunked on a regular base not only in undergraduate psychology classes but also popular media outlets. if that was true, the psychological processes of secondary antisemitism indeed happened within our participants as they did in 2007, but our setup was not potent enough to make participants admit antisemitism. although we have no direct evidence against this plausibility, we would argue that the several other steps we undertook to reduce social desirability should then have produced at least a suggestive pattern in line with the hypothesis. another counter point to that argument might rest on the observation that psychologists tend to overestimate the power of social desirability, as most people are quite confident in the validity of their beliefs and do not adapt them according to what they think they ought to think and feel instead. in that case bogus pipelines would fail to produce effects not because they are not working, but because there is no hidden “real” belief to be revealed: people already speak frankly without such an apparatus. this would, however, mean that ongoing victim suffering did not increase our participants’ prejudice. if it had, it should have produced a main effect independent of the bogus pipeline manipulation (not an interaction by the – in this logic – obsolete bogus pipeline), which we never observed. the lack of effect in subsequent studies while maintaining the speculation that the original effect was a true positive is addressed by a second potential explanation. the second explanation is a little more difficult to argue with and in fact a reoccurring argument in the replication debate. heraclitus’ dictum that you cannot step into the same river twice, as it is not the same river and you are not the same person, served as an encapsulation of the notion that the objectively same thing is not the same as time has passed. meanings change and persons change and potentially the effect we sought is subject to zeitgeist effects to a much greater extent than we anticipated. in fact, many effects, particularly in social psychology, may be more prone to changing times and norms than most of us are typically willing to concede. the year 2007 not only was the year in which our original study took place, but also the year in which the last public negotiation between the jewish claims conference against germany with the german federal government came to an agreement. this was a publicly followed event and for many germans martin walser’s (1998) infamous speech, in which he conceded that “i am almost glad when i think i can discover that more often not the remembrance, the not-allowed-to-forget is the motive, but the exploitation of our shame for current goals” still resonated. very possibly, much like studies on prejudices against african americans from the 1970s would not necessarily replicate in the 1980s and much less today, maybe also here the discursive context changed and thus made the seemingly highly similar experimental situation a psychologically very different one. in search of experimental evidence for secondary antisemitism : a file drawer report 19 this is a specific case of gergen’s (1973) more general argument that psychology is primarily a historical inquiry as it “deals with facts that are largely nonrepeatable and which fluctuate markedly over time” (p. 310). if indeed empirical studies are then mere manifestations of a historically situated effect that neither does nor should be expected to be diachronically robust, what is the use of doing such studies? first, (obviously always under the premise that the original finding was no false positive) it illustrates this historically situated principle. it might be of interest (admittedly primarily for historical reasons) that in the early 21st century a markedly negative reaction towards jewish suffering was observable among german students. second, and potentially more interesting, the original study might illustrate a general principle and a methodological approach to it rather than establish a human constant. the principle may be that the emphasis on victim suffering can backfire if respondents are liberated from social desirability concerns. the method might thus provide a potential blueprint of how to tackle such research questions. is this enough for a field of science that so desperately seeks to be as hard a science as the physical sciences in which “stable and broad generalizations can be established with a high degree of confidence” (gergen, 1971, p. 309) and thus allow explanations that can be empirically tested? does such an understanding of science have a place in the replication era? one positive aspect of the new way of doing psychology is that (hopefully), unlike in the present case, original studies will have been pre-registered and (internally) replicated before publication. the likelihood that these results were true positives is much higher than in the current case of a single underpowered study. failure to replicate in a different time, location, and/or context might thus inform our theorizing about the situatedness of the effect at hand. thus, pre-registration will not only increase the trust in published findings, but also allow an informative hint whether it is advised or futile to seek hidden moderators. in summary, we repeatedly failed to conceptually replicate one of our findings across a relatively large number of studies, with different methodological approaches. thus, the claim that confrontation with the holocaust evokes a backlash of antisemitism among germans is not empirically well supported. either the initial finding was a false positive or this process needs to be specifically situated in a given time and context. open science practices this article earned the open data and the open materials badge for making the data and materials available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references adorno, t. w. (1975). schuld und abwehr. in t. w. adorno, gesammelte schriften, band 9 soziologische schriften ii.2 (pp. 121-326). frankfurt: suhrkamp. banse, r., & gawronski, b. (2003). die skala motivation zu vorurteilsfreiem verhalten: psychometrische eigenschaften und validität. diagnostica, 49, 4-13. batson, c. d., fultz, j., & schoenrade, p. a. (1987). distress and empathy: two qualitatively distinct vicarious emotions with different motivational consequences. journal of personality, 55, 19-39. bergmann, w. (2006). “nicht immer als tätervolk dastehen” zum phänomen des schuldabwehrantisemitismus in deutschland in d. ansorge (ed.), antisemitismus in europa und in der arabischen welt (pp. 81-106). paderbornfrankfurt: bonifatius verlag. branscombe, n. r., ellemers, n., spears, r., & doosje, b. (1999). the context and content of social identity threat. in n. ellemers, r. spears, & b. doosje (eds.), social identity: context, commitment, content (pp. 35-58). oxford, england: blackwell science. branscombe, n. r., schmitt, m. t., & schiffhauer, k. (2007). racial attitudes in response to thoughts of white privilege. european journal of social psychology, 37, 203-215. buruma, i. (2003, august). how to talk about israel. new york times, section 6, p. 28. castano, e., & giner-sorolla, r. (2006). not quite human: infrahumanization in response to collective responsibility for intergroup killing. imhoff & messer 20 journal of personality and social psychology, 90, 804-818. chartrand, t. l., & bargh, j. a. (1996). automatic activation of impression formation and memorization goals: nonconscious goal priming reproduces effects of explicit task instructions. journal of personality and social psychology, 71, 464-478. cohen, t. r., wolf, s. t., panter, a. t., & insko, c. a. (2011). introducing the gasp scale: a new measure of guilt and shame proneness. journal of personality and social psychology, 100, 947-966. correia, i., & vala, j. (2003). when will a victim be secondarily victimized? the effect of observer’s belief in a just world, victim’s innocence and persistence of suffering. social justice research, 16, 379-400. dalbert, c. (1999). the world is more just for me than generally: about the personal belief in a just world scale's validity. social justice research, 12, 79-98. dotsch, r. & todorov, a. (2012). reverse correlating social face perception. social psychological and personality science, 3, 562-571. dotsch, r., wigboldus, d. h. j., & van knippenberg, a. (2013). behavioral information biases the expected facial appearance of members of novel groups. european journal of social psychology, 43, 116-125. dotsch, r., wigboldus, d. h., langner, o., & van knippenberg, a. (2008). ethnic out-group faces are biased in the prejudiced mind. psychological science, 19, 978-980. egloff, b. & krohne, h. w. (1998). die messung von vigilanz und kognitiver vermeidung: untersuchungen mit dem angstbewältigungsinventar (abi). diagnostica, 44, 189-200. fiske, s. t., cuddy, a. j. c., glick, p., & xu, j. (2002). a model of (often mixed) stereotype content: competence and warmth respectively follow from perceived status and competition. journal of personality and social psychology, 82, 878-902. friedman, j. s., & austin, w. (1978). observers’ reactions to an innocent victim: effect of characterological information and degree of suffering. personality and social psychology bulletin, 4, 569-574. funke, f. (2005). the dimensionality of right-wing authoritarianism: lessons from the dilemma between theory and measurement. political psychology, 26, 195-218. gergen, k. j. (1973). social psychology as history. journal of personality and social psychology, 26, 309-320. godfrey, b. w., & lowe, c. a. (1975). devaluation of innocent victims: an attribution analysis within the just world paradigm. journal of personality and social psychology, 31, 944-951. golec de zavala, a., cichocka, a., eidelson, r., & jayawickreme, n. (2009). collective narcissism and its social consequences. journal of personality and social psychology, 97, 1074-1096. heider, f. (1958). the psychology of interpersonal relations. new york: wiley. heitmeyer, w. (2005). deutsche zustände (folge 3) [german circumstances (vol. 3)]. frankfurt: suhrkamp. imhoff, r., & banse, r. (2009). ongoing victim suffering increases prejudice: the case of secondary anti-semitism. psychological science, 20, 1443–1447. imhoff, r., & bruder, m. (2014). speaking (un-)truth to power: conspiracy mentality as a generalised political attitude. european journal of personality, 28, 25–43. imhoff, r., & dotsch, r. (2013). do we look like me or like us? visual projection as selfor ingroupprojection. social cognition, 31, 806-816. imhoff, r. (2010). zwei formen des modernen antisemitismus? eine skala zur messung primären und sekundären antisemitismus. conflict & communication online, 9. imhoff, r., bilewicz, m., & erb, h. (2012). collective regret versus collective guilt: different emotional reactions to historical atrocities. european journal of social psychology, 42, 729–742. imhoff, r., dotsch, r., bianchi, m., banse, r., & wigboldus, d. h. j. (2011). facing europe: visualizing spontaneous in-group projection. psychological science, 22, 1583-1590. imhoff, r., woelki, j., hanke, s., & dotsch, r. (2013). warmth and competence in your face! visual encoding of stereotype content. frontiers in psychology, 4, 386. jost, j. t., & hunyady, o. (2002). the psychology of system justification and the palliative function of ideology. european review of social psychology, 13, 111-153. jost, j.t., banaji, m.r., & nosek, b.a. (2004). a decade of system justification theory: accumulated evidence of conscious and unconscious bolstering of the status quo. political psychology, 25, 881–919. in search of experimental evidence for secondary antisemitism : a file drawer report 21 kaplan, e. h., & small, c. a. (2006). anti-israel sentiment predicts anti-semitism in europe. journal of conflict resolution, 50, 548-561. karremans, j. c., dotsch, r.,& corneille, o.(2011). romantic relationship status biases memory of faces of attractive opposite-sex others: evidence from a reverse-correlation paradigm. cognition, 121, 422-426. kempf, w. (2014). anti-semitism and criticism of israel: methodology and results of the asci survey. conflict & communication online, 14. lakens, d. (2013). calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas. frontiers in psychology, 4, 863 langner, o., dotsch, r., bijlstra, g., wigboldus, d. h. j., hawk, s., & van knippenberg, a.(2010). presentation and validation of the radboud faces database. cognition and emotion, 24, 1377–1388. lerner, m. j., & simmons, c. h. (1966). observer's reaction to the" innocent victim": compassion or rejection? journal of personality and social psychology, 4, 203-210. lerner, m. j. (1980). belief in the just world. new york: plenum press. lundqvist, d., flykt, a., & öhman, a. (1998). the karolinska directed emotional faces (kdef). cd rom from department of clinical neuroscience, psychology section, karolinska institutet, (1998). mangini, m. & biederman, i. (2004). making the ineffable explicit: estimating the information employed for face classifications. cognitive science, 28, 209–226. marhenke, t., & imhoff, r. (2018). increased accessibility of semantic concepts after (more or less) subtle activation of related concepts: support for the basic tenet of priming research. unpublished manuscript. miller, d. t. (1977). altruism and threat to a belief in a just world. journal of experimental social psychology, 13, 113-124. payne, b. k., cheng, c. m., govorun, o., & stewart, b. d. (2005). an inkblot for attitudes: affect misattribution as implicit measurement. journal of personality and social psychology, 89, 277-293. quirin, m., kazén, m., & kuhl, j. (2009). when nonsense sounds happy or helpless: the implicit positive and negative affect test (ipanat). journal of personality and social psychology, 97, 500-516. rammstedt, b., & john, o. p. (2007). measuring personality in one minute or less: a 10-item short version of the big five inventory in english and german. journal of research in personality, 41, 203-212. roccas, s., klar, y., & liviatan, i. (2006). the paradox of group-based guilt: modes of national identification, conflict vehemence, and reactions to the in-group's moral violations. journal of personality and social psychology, 91, 698-711. rüsch, n., corrigan, p. w., bohus, m., jacob, g. a., brueck, r., & lieb, k. (2007). measuring shame and guilt by self-report questionnaires: a validation study. psychiatry research, 150, 313325. schönbach, p. (1961). reaktionen auf die antisemitische welle im winter 1959/1960 [reactions to the anti-semitic wave in the winter 1959/1960]. frankfurt: europäische verlagsanstalt. selznick, g. j., & steinberg, s. (1969). the tenacity of prejudice: anti-semitism in contemporary america. oxford, england: harper & row. sigall, h., & page, r. (1971). current stereotypes: a little fading, a little faking. journal of personality and social psychology, 18, 247-255. simons, c. w., & piliavin, j. a. (1972). effect of deception on reactions to a victim. journal of personality and social psychology, 21, 56-60. steinberg, g. (2004). abusing the legacy of the holocaust: the role of ngos in exploiting human rights to demonize israel. jewish political studies review, 16, 59-72. vala, j, pereira, c. p., eugênio, m., lima, o., & leyens, j. (2012). intergroup time bias and racialized social relations. personality and social psychology bulletin, 38, 491-504. von collani, g. (2002). das konstrukt der sozialen dominanzorientierung als generalisierte einstellung: eine replikation [the construct of social dominance orientation as a generalized attitude: a replication]. zeitschrift für politische psychologie, 10, 263-282. walser, m. (1998). erfahrungen beim verfassen einer sonntagsrede [experiences while composing an oration]. in: börsenverein des deutschen buchhandels (hg.): friedenspreis des deutschen buchhandels 1998 ansprachen aus anlaß der verleihung [peace prize of the german booktrade 1998 – speeches from the award ceremony]. frankfurt/main. watson, d., clark, l. a., & tellegen, a. (1988). development and validation of brief measures of positive and negative affect. the panas scales. imhoff & messer 22 journal of personality and social psychology, 54, 1063-1070. weil, f. d. (1985). the variable effects of education on liberal attitudes: a comparative historical analysis of anti-semitism using public opinion survey data. american sociological review, 50, 458-474. meta-psychology, 2022, vol 6, mp.2020.2601 https://doi.org/10.15626/mp.2020.2601 article type: original article published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis:yes open reviews and editorial process: yes preregistration: yes edited by: alex o. holcombe reviewed by: jason m. chin, hannah fraser analysis reproduced by: lucija batinović all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/ha6kd better understanding the population size and stigmatization of psychologists using questionable research practices nicholas w. fox rutgers university, new brunswick, new jersey, usa nathan honeycutt rutgers university, new brunswick, new jersey, usa lee jussim rutgers university, new brunswick, new jersey, usa abstract there has been low confidence in the replicability and reproducibility of published psychological findings. previous work has demonstrated that a population of psychologists exists that have used questionable research practices (qrps), or behaviors during data collection, analysis, and publication that can increase the number of false-positive findings in the scientific literature. across two survey studies, we sought to estimate the current size of the qrpusing population of american psychologists and to identify if this sub-population of scientists is stigmatized. using a self-report direct estimator, we estimate approximately 18% of american psychologists have used at least one qrp in the past 12 months. we then demonstrate the use of two additional estimators: the unmatched count estimate (an indirect self-report estimator) and the generalized network scale up method (an indirect social network estimator). additionally, attitudes of psychologists towards qrp users, and ego network data collected from self-reported qrp users, suggest that qrp users are a stigmatized sub-population of psychologists. together, these findings provide insight into how many psychologists are using questionable practices and how they exist in the social environment. keywords: questionable research practices, qrps, replication crisis, social networks, stigma, person perception introduction it is the psychology researcher’s job to generate theories, test hypotheses, collect and interpret data, interpret results, and to publish their findings. this is all done to learn more about the world and how it works. in pursuing these tasks, the researcher has many decisions to make: how many observations will i collect? how will i operationalize my variables? what is my population of interest for this given study? should i exclude any observations from the final analysis? each decision point is a “researcher degree of freehttps://doi.org/10.15626/mp.2020.2601 https://doi.org/10.17605/osf.io/ha6kd 2 dom” (simmons et al., 2011) with the potential to introduce error and bias. since there is a high level of ambiguity in academic research, these degrees of freedom can be resolved in a variety of ways. in reviewing how researchers handle outlying observations, simmons et al. (2011) found different research groups made different decisions on what was most correct. when researchers cleaned their data and removed participants who made responses that were “too fast”, some defined this as two standard deviations below the mean response speed, some defined this as any observation smaller than 200 milliseconds, and others removed the fastest 2.5% of observations. none of these definitions are an inherently incorrect interpretation of “too fast”, which creates a problem – without clear standards of reporting in place, this type of flexible decision making can blur the lines between what decision is right, what decision produces a desired result, and what decision is most likely to help a finding get published. there are many “researcher degrees of freedom” that exploit the grey areas of acceptable practice and may bias research findings (john et al., 2012; wicherts et al., 2016). some examples include trying different ways to score the chosen primary dependent variable and deciding how to deal with outlying observations in an ad hoc manner. ten of these types of behaviors have been collectively called “questionable research practices” (qrps) and have been defined as behaviors during data collection, analysis, and reporting that have the potential to increase false-positive findings in the published scientific literature. for this study, nine of the ten qrps were considered (table 1). we did not include “fabricated data” (qrp item 10) as the authors consider this a fraudulent, not questionable, behavior. not only can qrp use increase the number of false-positive findings (e.g., taking a “non-significant” result and pushing it over a threshold into being "significant"), but using multiple qrps can also influence the reported effect size of a given finding due to sampling bias and low power (button et al., 2013). thus, qrp use can lead to fieldwide interpretations of findings that are not warranted by the data. prevalence of questionable research practice users consider one of the most basic questions to ask about the current replication crisis in psychology: how many people are contributing to it? john et al. (2012) found 63% of psychologists admitted to publishing work without all the dependent measures included (at some point in their academic career). as articulated by simmons et al. (2011), this is highly problematic, as increasing the number of dependent variables is correlated with an increase in the probability of finding a significant result. without reporting all dependent measures, readers are left with a false impression of the research activity underlying the reported findings. this estimate from john et al. (2012) was contested by fiedler and schwarz (2016). in their conceptual replication that used differently worded questions, a different conceptualization of “prevalence”, and tested a german (as opposed to an american) cohort of psychologists, they found less than 10% prevalence of the same questionable practice (omitting dependent variables). agnoli et al. (2017) recently replicated the original john et al. (2012) study in an italian cohort of psychologists and found somewhat higher levels of qrp use (47.9% of respondents had omitted dependent variables, see table 1). consequently, there is currently no consensus on the prevalence of qrp use in psychology. given these inconsistencies in assessing the prevalence of questionable research practices, the present work seeks to expand on this existing literature in several ways. first, we investigate current qrp users, operationalized as an “individual” who has used at least one of nine qrps “in the past 12 months”. this is different from the previous literature as it shifts the attention to individuals who perform questionable practices and away from the behavior as a concept. second, it addresses the recent use of qrps by defining behaviors performed within a specified time period of one year. previous work estimating qrp prevalence has done so by either estimating lifetime prevalence or via estimating frequency of qrp use, both providing limited insight on recent use of questionable research practices. put another way, knowing whether a researcher has used a qrp at some point during their career does not tell us much about how many researchers currently use qrps, nor does it provide an accurate estimate of the size of the current qrp-using population. a third unique contribution of the present research is that it addresses prevalence of qrp users with three different estimating methodologies. one is a direct estimate heavily based on prior research (agnoli et al., 2017; fiedler and schwarz, 2016; john et al., 2012). we directly ask researchers if they had used any of the 9 behaviors assessed in table 1 at least once in the past 12 months. previous work to estimate the prevalence of qrp use in psychology has also relied on direct self-report of behavior from participants. it is well known that asking participants to self-report on their socially undesirable behaviors can lead to an underestimation, as participants can lie about their behaviors to researchers to avoid potential negative consequences of their actions (fisher, 1993; holbrook and krosnick, 2010; salganik and heckathorn, 2004). even when survey responses 3 table 1 the 10 behaviors commonly described as "questionable research practices", including previous estimates of the prevalence of these behaviors across participants’ careers from john et al. (2012) & agnoli et al. (2017). items 1-9 are used in the present work, as item 10, falsifying data, is fraudulent behavior rather than questionable. questionable research practice john et al. (2012) prevalence estimate (us sample, control group) agnoli et al. (2017) prevalence estimate (italian sample) 1 failing to report all of a study’s dependent measures 63.4% 47.9% 2 collecting more data after looking to see if the results were significant 55.9% 53.2% 3 failing to report all of a study’s conditions 27.7% 16.4% 4 stopping data collection earlier than planned because one found the result one was looking for 15.6% 10.4% 5 rounding off p-values to achieve significance 22.0% 22.2% 6 selectively reporting studies that "worked" 45.8% 40.1% 7 deciding whether to exclude observations after seeing the effect of doing so on the results 38.2% 39.7% 8 reporting unexpected findings as being predicted from the start 27.0% 37.4% 9 reporting results are unaffected by demographics when actually unsure or not tested 3.0% 3.1% 10 falsifying data 0.6% 2.3% are completely anonymous, many participants may feel pressure to respond in the socially desirable way (makimoto et al., 2001). for this reason, we felt it was important to attempt to address this known bias by using two different indirect methods of estimation, in addition to the self-report estimate. the first indirect method, called the unmatched count technique, is an estimating technique aimed at reducing social desirability response bias in self-reports (arentoft et al., 2016) (see method for details). the second method generates an indirect estimate of the population size of qrp users by using social network information from the general population of psychologists (jing et al., 2014; mccormick et al., 2010; salganik et al., 2011; zheng et al., 2006), circumventing the need for a participant to report on their own behavior entirely. neither the unmatched count technique nor this social network method require participants to identify as belonging to a potentially stigmatized group (i.e., qrp users), thereby reducing the risk of socially desirable response bias compared to more traditional direct estimates. while network methods are expected to provide insights into qrp use prevalence, they have yet to be used in psychology. thus, this work produced three estimates of qrp use prevalence. stigmatization of questionable research practice users the term “stigma” was formally described by erving goffman as “an attribute that makes [a person] different from others in a category of persons available for [them] to be, and of a less desirable kind” (goffman, 1963). goffman describes two states of a stigmatized identity: “discredited”, where the stigmatizing attribute is outwardly identifiable to strangers (i.e., race, gender, physical handicap – sometimes referred to as “spoiled identities”), and “discreditable”, where the stigmatizing attribute can be concealed from others (i.e., sexual orientation, medical condition, certain mental disorders, behaviors). since discredited people suffer from a reduced social status, it is potentially beneficial for discreditable people to conceal their stigmatized attribute and to continue being considered “normal” (goffman, 1963). this is controlled through the process of impression management, where the actor (a person with a concealable stigma) communicates with an audience (others in a social group unaware of the actor’s “true” identity) in a manner to convince the audience of the appropriateness of their assumed role in society (goffman, 1959). reactions towards stigmatized members of society can differ depending on the perceived controllability the stigmatized individual has over their stigma. for example, people with lung cancer tend to be blamed more for 4 their condition compared to other cancer patients due to the link between cigarette smoking (a controllable behavior) and lung cancer (chapple et al., 2004). this effect persists even if the individual with lung cancer never smoked. corrigan (2000) describes differing affective responses by population members towards stigmatized individuals depending on whether or not that person is responsible for their stigma. those seen as responsible are met with anger and potential punishment, while those seen as not responsible are met with pity and potential helping behaviors. qrp use could be framed as either externally or internally attributed. one could argue that qrp use is an inevitable outcome of working in a stressful academic career where success is measured in scientific output (here, qrp use is externally attributed to stress). it could also be argued that qrps are only used by those unfit to be academics and result to using qrps to make up for their own inadequacies (here, qrp use is internally attributed to low ability). there are ways that stigmatized individuals may attempt to manage their identity while minimizing negative effects. one way is through social withdrawal. by interacting with fewer people, there are fewer moments when a concealed identity can be revealed (ilic et al., 2014). another way is through selective disclosure of their stigmatized identity. selective disclosure to trusted others (often those who share this concealed identity) is an adaptive identity management strategy – it allows the stigmatized individual to control their social interactions in a beneficial way and reduces stigmatizing experiences. social withdrawal, on the other hand, depends more from the individual by asking them to continuously monitor their social network and anticipate their potential social interactions. this additional burden results in worse mental health outcomes and no reduction in stigmatizing experiences (ilic et al., 2014). considering the potential stigmatization of qrp users is important: determining if qrp users are stigmatized will enable the development of interventions that either decrease or increase stigmatization. it is generally accepted that increased stigmatization of tobacco smokers has decreased the number of people who smoke (bayer, 2008), though it is unclear whether the group or the individual should bear more of the stigma burden (courtwright, 2013). for these reasons, it is important to first understand how qrp users exist within their social environment prior to implementing interventions aimed to reduce qrp use. to assess whether qrp use is stigmatizing, we attempt to measure stigma in two ways. first, we assess the attitudes held by the general population of american psychologists towards qrp users, focusing on four theoretical domains: attribution theory and stigma, social norms and stigma, fear and stigma, and power and stigma (stuber et al., 2008). these domains are important for understanding if qrp use is stigmatized by psychologists. for instance, population members may fear that qrp users will damage the reputation of psychology as a scientific field and thus look down on those who they perceive to be negative contributors. additionally, link and phalen (2001) argue that individuals who are stigmatized must have less power than those doing the stigmatizing, which is investigated in this study. in addition to measuring the attitudes of the general population of psychologists towards qrp users, this study also measures social withdrawal and selective disclosure behaviors of self-identified qrp users. by using this two-pronged approach, this study attempts to answer the following research questions: 1. are qrp users stigmatized by the general population of psychologists? 2. do qrp users behave as a stigmatized group? better understanding the size of the qrp-using population of psychologists, and how psychologists view their peers using qrps, will set a foundation for future interventions aimed at reducing qrp use. study 1: sizing the qrp-using population of psychologists methods preregistration statement. this study, which describes three estimates of qrp prevalence in the us psychologist population, was preregistered on may 15, 2017. the preregistration is available here: https://osf.io/xu25n. population of interest and target group. the population of interest for this work was all tenured or tenure-track researchers associated with a phd-granting psychology department in the united states. qrp users (the target group) are therefore a subgroup of this population, with a size greater than zero and maximally the size of the population of interest. a complete list of names and contact information for the population of interest was obtained via private correspondence with dr. leslie john (john et al., 2012). the list provided was current as of 2010, so name and email contact data was updated in may, 2017 as this research program was beginning. this was done by reviewing the faculty at each phd-granting psychology department in the united states and then adding or removing individuals as appropriate. https://osf.io/xu25n 5 survey distribution. members of the population of interest were invited via email to participant in a brief survey on personal social network size and attitudes towards researchers. all invitations were sent and all surveys were administered using the qualtrics web tool (qualtrics, 2005). all members of the population of interest (n = 7,101) were solicited via email to participate. emails were sent in 10 waves, with each wave consisting of 200-400 invitations. all initial emails were sent to potential participants on a thursday, and a single follow-up “reminder” email was sent on the following monday. participants who had finished the survey were sent a “thank you” email on the thursday following the initial solicitation. all invitations were sent between september 2017 and december 2017. three surveys were distributed simultaneously. this was to facilitate the different types of direct and indirect estimates that will be described in the following sections. surveys 1 and 2 were each distributed to 1,775 members of the population of interest. survey 3 was distributed to 3,551 members of the population of interest. survey 3 included the self-report direct estimator. to maximize the number of self-reported qrp users observed that would then receive additional questions about their social networks, we distributed survey 3 to half of the total 7,101 population members and split the remaining half between surveys 1 and 2. all surveys included relevant instructions and definitions (i.e., defining behaviors identified as qrps). see https://osf.io/2zwqf/ for the survey materials distributed as well as supplemental materials which describes the deviations from the preregistration. in these surveys, “qrp use” was defined as having used at least one of the nine qrps in table 1 in the past 12 months. similarly, a “qrp user” was defined as a person who has used at least one of the nine items in table 1 (excluding item 10 for reasons described previously) in the past 12 months. participants were presented the definition of qrp use at the start of the survey and the definition was always available by hovering over text in the survey by using their computer mouse. survey responses. of the 7,101 email solicitations sent, 214 emails bounced (3%). six hundred thirteen full responses were collected (9% full response rate), and 296 partial responses were collected. there was no compensation offered for participation. only full responses were used in the generation of population size estimates. additionally, 26 participant responses were removed for either being marked complete erroneously by the qualtrics webtool, or due to breaking estimatespecific criteria. for example, if a respondent claimed to know 290 individuals who have used a qrp in the past 12 months, yet the estimate of the size of their total social network was only 150 individuals, that respondent would be excluded from analysis. two hundred ninety nine (49%) participants identified as female, 279 (46%) identified as male, and 19 (3%) chose not to identify their gender. one hundred thirty one (21%) participants identified as assistant professor, 141 (23%) identified as associate professor, and 208 (34%) identified as full professor. one hundred thirteen participants chose not to disclose their tenure level. estimating methods estimate 1: direct estimate. the self-report direct estimate involved asking members of the target population whether they have used at least one qrp in the past 12 months, and was calculated as the number of respondents who self-identified as using at least one qrp divided by the total number of respondents. estimate 2: unmatched count technique estimate. the unmatched count technique (uct) is an indirect way of measuring base rates of concealable and potentially stigmatized identities (gervais and najle, 2017). in this estimate, two groups of participants are given a list of innocuous items that could apply to them (e.g., i own a dishwasher; i exercise regularly). the list of items for both groups is the same except for one additional item that one group receives and the other does not. this extra item asks about the concealable identity (e.g., i own a dishwasher; i exercise regularly; i smoke crack cocaine – examples from gervais and najle, 2017). see table 2 for the full list of items used. participants are asked to count and report the number of items in the list that apply to them. at no point does a participant identify which items they are counting. the proportion of participants that identify with the extra item is calculated as the mean difference between the innocuous and concealable identity lists. estimate 3: generalized network scale up estimate. network methods estimate population sizes using information about the personal networks of respondents, based on the assumption that personal networks are, on average, representative of the population (salganik et al., 2011). each participant’s social network provides a sample of the general population, and by collecting network data on many participants, those accumulated social networks provide access to the larger population. participants were asked about how many psychologists they "know" in the population of interest. in this study, "know" was defined as: the person knows you by face or by name, you know them by face or by name, you could contact the person if you wanted to, and you’ve been in contact with them in the past two years (bernard et al., 2010). participants were then asked a series of questions to estimate the total size https://osf.io/2zwqf/ 6 table 2 items used in the unmatched count technique (uct). items 1-9 were included on both lists, while only item 10 was used in the “sensitive item” list (list 2). item list 1 i am a vegetarian. 1 & 2 2 i own a dog. 1 & 2 3 i work on a computer nearly every day. 1 & 2 4 i have a dishwasher in my kitchen. 1 & 2 5 i can drive a motorcycle. 1 & 2 6 my job allows me to work from home at least once a week. 1 & 2 7 i jog at least four times a week. 1 & 2 8 i enjoy modern art. 1 & 2 9 i have attended a professional soccer match. 1 & 2 10 i have used at least one qrp in the past 12 months. 2 only of their social network, and the number of people they know who have used at least one qrp in the past 12 months. together, the network scale-up can be used to estimate the proportion of qrp users, and was calculated as follows: ρ= ∑ yi∑ di (1) where ρ is the proportion estimate of people who have used at least one qrp in the past 12 months, yi is the number of people known in the target group y by participant i, and di is the estimated total number of people known d by participant i within the population of interest (see killworth et al., 1998 for more on estimating d). this equation makes two assumptions: that members of the population of interest know all identity information about all members of their ego networks, and that qrp users have the same size social networks as the general population of interest. since qrp use is concealable and potentially stigmatizing, the assumptions made for the previous estimate may not be appropriate. for that reason, data was collected from self-identifying qrp users to estimate how qrp-use identity information transmits through ego networks. this estimate is called the transmission rate, or tau (τ), and estimates the social transmissibility of a person’s identity information. this data was collected using the game of contacts method (salganik et al., 2012), described below. to estimate the qrp use identity transmission rate tau, we performed the game of contacts with participants who self-identified as using at least one qrp in the past 12 months. briefly, this method has participants answer a set of questions about what they know about the qrp use of several others (called “alters”) in their social network, and what those alters know about the participant’s qrp use. the questions are semi-graphical and responses are recorded on a digital 2x2 grid, representing the four possible ways information can flow through a given ego-alter relationship (both the participant and the alter know of each others’ qrp use, the alter knows of the participant’s qrp use only, the participant knows of the alter’s qrp use only, or neither the participant nor the alter have insight on the qrp behaviors of the other). the transmission rate τ is then calculated as: τ= ∑ wi∑ xi (2) where wi is the number of alters that know the ego is a member of the target group, and xi is the total number of alters generated by the ego. this produces a value between zero and one, where one represents complete transparency of information (all alters are aware of the participant’s qrp use) and zero represents the identity being completely hidden from all alters. for a full description of the game of contacts, see salganik et al. (2012). the current study utilized a digital distribution of the game of contacts. this method is typically performed in a face-to-face interview setting with the participant (salganik et al., 2012). due to the distributed nature of our frame population, this was not feasible. instead, participants were presented with the game of contacts via qualtrics (qualtrics, 2005). these questions were pretested with several academics not within the population of interest for question clarity. a comparison between an in-person and digital game of contacts has been pre-registered by the authors (https://osf.io/yf4xc/) for future study. additionally, to relax the assumption of equal social network sizes between the general population of psychologists and qrp users, a popularity ratio (delta, δ) was calculated as: δ= de dt (3) where de is the average network size for the target group (qrp users), and dt is the average network size for the population of interest (tenure or tenure-track faculty associated with phd granting psychology departments in the united states). together, tau and delta adjust the network scale-up estimate into the generalized network scale-up as follows: https://osf.io/yf4xc/ 7 uct estimate direct estimate gnsum estimate −20 0 20 40 60 figure 1. qrp user prevalence estimates (%) using three estimating techniques: the generalized network scale up estimate (gnsum), the direct estimate, and the unmatched count technique (uct). bars represent 95% percentile bootstrapped confidence interval. ρ= ∑ yi∑ di ∗ 1 τ ∗ 1 δ (4) where ρ is the proportion estimate of people who have used at least one qrp in the past 12 months, ∑ yi∑ di is the network scale-up estimate, τ is the transmission rate, and δ is the popularity ratio. all network scale-up results are calculated using this equation, incorporating τ and δ. results the three estimates of recent qrp use in the frame population of american tenured or tenure-track faculty are summarized in figure 1 and described in detail below. direct estimate. to ensure the highest number of participants in our game of contacts, half of the total population were asked to participate in survey 3, which contained our direct estimate question. thus, 3,551 psychologists were solicited, and we received 308 responses able to be analyzed. of the 308 participants 56 indicated they had used at least one qrp in the past 12 months. we calculated qrp prevalence to be 18.18% (percentile bootstrapped 95% confidence interval [13.96%, 22.40%]). it is possible this estimate underestimates the true number of psychologists using qrps. for one, social desirability may lead some scientists who have used qrps to be unwilling to admit it. this estimate is only generated by those participants willing to reveal their identity as a qrp user. given the somewhat critical social environment for qrp users (fiske, 2016; teixeira da silva, 2018), it is reasonable to believe some participants withheld their identity when we asked directly. the following indirect estimation methods sought to mitigate this social desirability bias. unmatched count technique estimate. the remaining 3,550 psychologists contacted were asked to participate in our unmatched count estimate with 1,775 randomized into the innocuous list condition, and 1,775 randomized into the sensitive list condition. from this, we received 279 responses for analysis. the average number of list items corresponding to participants in the innocuous list condition was 4.28. the average number of list items corresponding to participants in the sensitive list condition was 4.39. we calculated qrp user prevalence to be 10.46% (percentile bootstrapped 95% confidence interval [-20.19%, 22.40%]). it was unexpected that the calculated uct estimate would be lower than our direct estimate. typically, due to reducing response bias, uct estimates are larger than direct estimates when the behavior or identity in question is concealable and potentially stigmatized (gervais and najle, 2017; starosta and earleywine, 2014; wolter and laier, 2014). given the bootstrapped 95% confidence interval crosses zero, it is likely the relatively low number of participants in our uct (n = 279) led this calculation to be overly sensitive to individual responses. upon reviewer suggestion, we calculated the 95% confidence interval using three additional bootstrapping methods: basic, normal, and bca using the r package ‘boot’ (ripley, 2021). these three additional methods produced similar ci ranges (basic = [-19.3%, 41.1%], normal = [-19.4%, 40.1%], bca = [-19.2%, 41.1%]). since the uct estimate is calculated as the mean difference between the two item list means, and because both our sample size and the observed mean difference (0.11) were small, bootstrapping the two item lists and then calculating the uct estimate can produce replicates where the mean for the innocuous item list is larger than the mean for the concealable identity list, producing a negative population size estimate. the fact that this estimate’s confidence interval crosses zero should be indicative that, although the mean difference can be used to generate a point estimate population size, the variability of responses within each list group is sufficient enough to make this estimate uninterpretable. generalized network scale up estimate. all participants who were randomized into the uct estimate were also asked to answer questions about their social networks, and to estimate how many researchers they know who have used at least one qrp in the past 12 months. participants who were randomized into the 8 direct estimate and who self-identified as a qrp user in that estimate were also asked to answer questions about their social network and to participate in the game of contacts method. participants in the direct estimate who did not self-identify as a qrp user were asked questions about their social network as well, but were not asked how many researchers they know who have used at least one qrp in the past 12 months. this was because these participants would later be asked about their views on those who use qrps (see study 2) and we did not want to prime these participants to think about qrp users in their own social network in an effort to reduce response bias. therefore, we collected social network responses from 531 participants from the general frame population (to be used in estimating δ, 56 responses from participants who self-identified as qrp users who also completed the game of contacts (to be used in estimating τ and δ), and 279 responses from participants who estimated the number of researchers they know who have used at least one qrp in the past 12 months. these 279 individuals identified a sum total of 664 qrp users, and know a sum total of 46,828 researchers. given the total frame population is 7,101 we are fairly confident all or nearly all members were identified at least once by our participants. using the network scaleup estimate (which does not include tau or delta), this generates an estimate of 1.42% (percentile bootstrapped 95% confidence interval [0.85%, 2.14%]). this estimate assumes qrp use is completely transparent and that all participant’s would know the qrp use of the members of their social network. clearly, this is a poor assumption for this population, but this estimate serves as the base starting point for our key network estimate, the generalized network scale-up estimator (gnsum), detailed below. the gnsum relaxes the assumptions of equal network size (delta) and total information transmission (tau) by incorporating these estimates into the equation. using the 531 responses from the general population and the 56 responses from the participants who indicated using a qrp in the past 12 months and equation 3, we estimate δ, which is the ratio of average social network sizes between self-identified qrp users and the general population of psychologists, to be 0.97. this means that, on average, the social network size of a self-identified qrp-using psychologist is 97% the size of a psychologist that has not identified as a qrp-user. using the game of contacts and equation 2, we estimate τ, which is the transmissibility of qrp use identity information, as 0.06 (percentile bootstrapped 95% confidence interval [0.03, 0.10]). using equation 4 to calculate the generalized network scale up estimate, we estimate qrp user prevalence to be 24.40% (percentile bootstrapped 95% confidence interval [10.93%, 58.74%]). additional analyses assessed the validity of the network scale-up method in this population by using it to generate estimates of other populations of known size. nsum estimates (which do not include tau or delta, see equation 1) were then compared to those actual population sizes. if nsum estimates correspond well with the actual size of these populations, it would suggest that the gnsum network scale up method most likely provides a good estimate of population size in this group of participants. to this end, we generated additional estimates of 24 populations of known size; the number of psychologists with particular first names (the number of psychologists named david, named janet, etc). the 24 names were gender balanced and represented common, uncommon, and rare names that exist within the census of the population of interest. the size estimates of these populations of known size can be seen in figure 2 compared to their actual size. the estimates made by our participants of the size of these 24 populations are similar to the actual prevalence of these groups, see figure 2. the correlation between our participants’ estimates of those group sizes and the actual group sizes is r = 0.91. the nsum estimate we calculated for the proportion of qrp-using psychologists was 1.42%. based on the validity estimate, it is possible this nsum value is an underestimate of the true proportion of psychologists that have used a qrp in the past 12 months. this would result in our gnsum estimate of 24.4% also being an underestimation. even the most common first name for our population of interest (david) only had a true prevalence of 2.5%, so understanding the relationship between nsum-estimated and actual prevalence beyond this value cannot be determined with our data. we cannot know for certain whether the nsum and gnsum estimates accurately identified the true proportion of qrp users in psychology, given we are estimating several variables that can effect the population size estimate. nonetheless, that using the nsum with the same participants generated estimates similar to their known values across multiple populations is consistent with the conclusion that our gnsum estimate may have also generated an estimate similar to the true proportion of qrp users in psychology. discussion because of inconsistencies in previous research, this study generated three estimates of current qrp use, using three estimating procedures. while the point estimates generated by our three estimators range from 9 0 1 2 3 0 1 2 3 actual prevalence of group (%) e st im a te d p re va le n ce o f g ro u p ( % ) figure 2. comparison of estimates made using the gnsum estimate to the actual prevalence of populations (researchers with specific first names). dotted line represents when the estimate equals the actual prevalence. larger groups have a tendency to be underestimated, a phenomenon observed in other published gnsum estimations (such as salganik et al., 2011). correlation between estimated prevalence and actual prevalence r = 0.91. 10.4% to 24.4%, the large confidence intervals generated for both the gnsum and the uct estimates make it difficult to make a precise assessment based on these two estimating methods. these large confidence intervals are most likely due to two reasons: first, compared to the direct estimate, both the gnsum and the uct estimating equations have more values being estimated (two in the uct and six in the gnsum). second, we observed a high amount of variance, which may be due to the small size of the population of interest (7,101 individuals total) and the low response rate we recorded within this population (8.63%). this in turn effected the precision of our estimates. however, we have more confidence that our direct estimate of 18.18% [13.96%, 22.40%] is an accurate estimation of the proportion of psychologists who have used a qrp in the past 12 months, knowing that it may be an underestimate due to the weakness of self-report measures to response bias. to the best of our knowledge, this is the first report of the prevalence of qrp users in a proximal timespan. as such, it is difficult to draw conclusions about the magnitude of our estimates compared to previous estimates in the literature. compared to john et al. (2012), fraser, parker, nakagawa, barnett, and fidler (2018), makel et al., (2019), and agnoli et al. (2017), we estimate lower rates of questionable research practices. compared to fiedler and schwarz (2016), however, we estimate higher rates of these practices. one often discussed reason for inconsistent qrp use estimates is how qrp behaviors are defined. in this work, we defined questionable research practices using the same language as john et al. (2012) and agnoli et al. (2017), though restricted use to a timespan of only 15 months (question wording of “in the past 12 months” with data collection lasting 3 months). it should have been expected that our estimates would be lower than some of those reported previously that used an unrestricted timespan of qrp use. additionally, our estimate may be lower than other reported estimates due to lower usage of qrps – increased attention to replication failures in psychology may have led to a decrease in these behaviors. this is also the first report to use the generalized network scale up estimator to investigate the prevalence of qrp users in psychology. previous use of this estimator in the domains of public health (those most at risk of hiv/aids) and oncology (cancer prevalence in iran) have both shown the usefulness of using social networks to measure hard-to-reach populations (salganik et al., 2011; vardanjani et al., 2015). a major strength of this estimating technique is that it can incorporate estimates of information transmissibility, or how available information is to an observer. direct estimates, on the other hand, rely on an individual’s willingness to participate and their willingness to honestly share their identity to the researcher. pressure to appear a certain way (social desirability bias) can distort a direct estimate downward. social network methods, on the other hand, enable researchers to better understand the social processes at work that produce an environment where members vary in their identity and the identity information they share with others (zheng et al., 2006). in the process of producing the reported population size estimate for current qrp users, we also report the first estimate of the social transmissibility of qrp-use identity of 0.06 [0.03, 0.10], or 6.02%. this means that only 6% of the population of qrp users is “visible” through the social networks of the general population of psychologists. this estimate suggests that, for each one qrp user a psychologist knows, there are approximately 16 other psychologists in their social network who also are qrp users. these population size estimates can serve as a baseline to measure the effectiveness of current initiatives, as well as a foundation for new ones. while much work 10 is being done to grow support for interventions such as pre-registration (wagenmakers and dutilh, 2016) and registered reports (chambers et al., 2014), it is unknown what quantitative effect these are having on curbing behaviors associated with inflated type i error such as qrps. by performing follow-up estimates at future time points, the field can use the baseline estimates presented here to measure the effectiveness of these programs at reducing qrp use. as noted previously, qrps exist in a grey area of accepted scientific practice. therefore, it is difficult to interpret the severity of qrp use. this difficulty, along with the high variability among previous estimates of qrp prevalence, has led to a number of different conclusions. some have concluded that the problems are overstated (fanelli, 2018), while others argue qrp use presents a real threat to the viability of several scientific fields, such as education and political science (bosco et al., 2016). although our estimates move the field forward in understanding the prevalence of those that use these behaviors, it provides less guidance on the severity of the consequences of qrp use on the whole. study 2: assessing the stigmatization of qrp-using psychologists methods preregistration statement. this study was not preregistered and should be considered an exploratory assessment of the stigmatization of qrp users within the us psychologist population. future preregistered studies should be conducted to confirm the relationships described in this study. population of interest and target group. the population of interest for this study was all tenured or tenure-track researchers associated with a phd-granting psychology department in the united states. as this was the same population of interest for study 1, data was collected for both studies simultaneously. survey distribution. survey material was distributed as described previously (study 1). in total, 1,775 population members were solicited to participate. stigma-related survey items were restricted to survey 1, which did not ask individuals about their own qrp use. one hundred thirty responses were collected from this survey, of which 98 were full responses without missing data. these 98 responses were used for analysis. dependent measure. because there was no existing measures of qrp-related stigma, questionnaire items measuring stigma related to being a qrp user were developed from a scale designed to assess perceived devaluation and discrimination related to smoking cigarettes (link and phelan, 2001; stuber et al., 2008). the measure assesses respondent perceptions of what most other researchers believe. these items were modified to frame them in terms of qrp use. for example, the item “most people think less of a person who smokes” was modified to “most people think less of those who use qrps”. cronbach’s alpha was calculated to assess the reliability of the items as a scale and alpha = 0.78, suggesting acceptable internal consistency (tavakol and dennick, 2011). responses to each question were on a four-point likert scale that ranged from “strongly disagree” to strongly agree”. the dependent measure was constructed as the sum of these four item responses, where larger values indicated higher qrp stigma. independent measures. the independent measures were: age: participants self-reported their age in years. phd year: participants self-reported the year in which they obtained their phd. although collected, this measure was not used in subsequent analyses. acceptability: to access descriptive and injunctive social norms at a peer level, one question was asked to participants: “how do most of your colleagues feel about using qrps? do they find it acceptable, unacceptable, or that they don’t care one way or another?” the 17 participants who responded “they don’t care one way or another” were excluded from analyses that included this measure due to ambiguity in whether this response indicated positive or negative attitudes about qrp use. attribution: two items were used to assess what participants believe were the causes of qrp use: “qrp use is due to weak character”, which was used to assess internal attribution, and “qrp use is due to stress”, which was used to assess external attribution. fear: to access fear related to the academic hazards posed by qrp users in their capacity as mentors, one item was: “qrp users are a threat to their students”. power: socioeconomic status was assessed by tenure level (assistant professor, associate professor, or full professor), and by individual income level (measured with six bins: less than $49,999, $50,000 $74,999, $75,000 $99,999, $100,000 $149,999, $150,000 $199,999, $200,000 or more). although collected, tenure level was not used in subsequent analyses. control variables. racial/ethnic status was assessed by self-identification of categories planned to be used in the 2020 u.s. census (white, black or african american, latino, hispanic or spanish origin, american 11 indian or alaska native, asian, middle eastern or north african, native hawaiian or other pacific islander, none of the above, or prefer not to say). political orientation (“politics”) was assessed on a 6-point scale (very conservative, somewhat conservative, middle-of-the-road, somewhat liberal, very liberal, and not sure). gender was assessed as either female, male, or prefer not to say. behavioral measures. to assess behaviors associated with concealing a stigmatized identity, social withdrawal and selective information transmission were measured. the average social network size of qrp users was measured and used in the calculation of the generalized network scale up method in study 1. if qrp users socially withdrawal as an adaptation to living and working with a stigmatized identity, we would predict that their average social network size would be smaller than the average social network size of the general population of psychologists. selective transmission was assessed by measuring the number of social network alters in each qrp-user’s social network who are aware of the qrp-use identity of the participant and assessing which alters are also qrp users. if a qrp user selectively discloses their identity information to in-group members, we predict that another qrp user is more likely to know the qrp-use status of a qrp-using participant compared to a psychologist whose qrp use identity is unknown to the qrp user. in other words, qrp users disclose their qrp-use identity information to other qrp users rather than disclose to individuals with an unknown qrp-use status. statistical analyses. for descriptive analyses, responses answered on a four-point likert scale were reduced to two bins (“agree” and “disagree”). linear regression was used to assess the direct relationship between independent measures and the dependent measure using the statistical program r (version 4.0.2 – rmarkdown files with full analyses and r packages used are available on our project osf page: https://osf.io/2zwqf/). a possible curvilinear relationship between power and qrp stigma was tested by introducing the squared power predictor to an additional model. data points depicted in linear regression graphs were jittered to provide increased clarity. an odds ratio was calculated to determine the odds of a qrp-using alter knowing the participant’s qrp-use identity compared to an alter with unknown qrp-use status knowing the participant’s qrp-use identity. an independent samples t test was calculated to determine the mean difference between the average social network size of qrp users compared to the average network size of the general psychologist population. 0% 20% 40% 60% 80% 100% most researchers think less of those who use qrps most researchers would not let a qrp user mentor their students most researchers believe using qrps is a sign of professional failure qrp users perceive high stigma against them % agree figure 3. prevalence of perceived stigma against qrp users among the general population of psychologists. fewer than half believe qrp users perceive stigma against them, though nearly 80% of respondents believe the researcher community thinks less of qrp users. results figure 3 shows the prevalence of perceived stigma against qrp users among the general population of psychologists. participants agreed that “most researchers think less of those that use qrps” (77.3% of participants agree) and that “most researchers would not let a qrp user mentor their students” (55.8% of participants agree). furthermore, 44.6% of participants agreed that using qrps is a sign of professional failure. interestingly, only 36.7% of participants agreed with the statement that qrp users perceive high stigma against them. it could be argued that the gap between “most researchers think less of those who use qrps” and “qrp users perceive high stigma against them” speaks to the nature of stigma itself; that it is a negative process established at the environmental level (as opposed to the individual level) by those free of the stigmatizing mark. table 3 reports the multiple regression output of all independent variables of interest regressed on the dependent variable. for this analysis, income was used as the operationalization of power, and age (in years) was used as the operationalization of age (as opposed to phd conferral year) as these were more interpretable variables and have been used in previous literature (stuber et al., 2008). this model also included the control variables of gender, ethnicity, and political orientation. in this model, age and fear are both significant predictors of stigmatization of qrp users. here, younger participants gauged qrp use as significantly more stigmatizing than older participants (p = 0.03), and those who feared qrp users as a threat to their students were significantly more stigmatizing to qrp users (p = 0.0069). as we are interested in whether specific theoretical domains of stigma predict stigma against https://osf.io/2zwqf/ 12 table 3 multiple regression output of the single model that includes all stigma domains (age, acceptability, external attribution, internal attribution, fear, and power (linear), as well as control variables gender, ethnicity, and political orientation. coefficients estimate(β) estimate(b) se t-value p-value (intercept) — 7.8437 2.4069 3.26 .0016** age -0.23245 -0.0408 0.0185 -2.21 .03* acceptability -0.01056 -0.0494 0.4917 -0.1 .9202 internal attribution 0.13725 0.4822 0.3803 1.27 .2083 external attribution -0.00675 -0.022 0.3511 -0.06 .9502 fear 0.31118 0.9136 0.3299 2.77 .0069** power (linear) 0.16157 0.368 0.2328 1.58 .01178 gender -0.07151 -0.258 0.4253 -0.61 .5458 ethnicity -0.05825 -0.1381 0.2852 -0.48 .6296 political orientation 0.02853 0.0508 0.1909 0.27 .7908 table 4 β coefficient outputs of the seven individual regressions run to test domains of stigma. each model was specified as follows: stigma dv regressed on domain (age, acceptability, internal attribution, external attribution, fear, power (linear), or power (quadratic)) + gender + ethnicity + political orientation. model 1 model 2 model 3 model 4 model 5 model 6 model 7 p-value age -0.22 .031* acceptability 0.43 .000*** internal attribution 0.28 .008*** external attribution 0.15 .160 fear 0.4 .000*** power (linear) 0.12 .250 power (quadratic) 0.27 .660 gender 0.01 -0.03 -0.09 -0.03 -0.03 -0.05 -0.04 ethnicity -0.15 -0.04 -0.1 -0.12 -0.03 -0.14 -0.15 political orientation -0.07 -0.03 0 -0.06 -0.01 -0.02 -0.02 qrp users, it was theoretically important to also look at the direct relationships between the predictors in the multiple regression and the qrp stigma outcome (mela and kopalle, 2002). investigating the direct relationships between each theoretical domain of stigma and qrp stigma provides additional insight into whether qrp use satisfies conditions predicted by stigma theory: namely, that qrp use breaks social norms, is internally attributed, is feared, and that qrp users are in a lower position of power compared to the general population of psychologists. age is an additional predictor that is outside of classic stigma theory, but interesting in this specific context, as qrp use and the resulting scientific reform movement may unequally affect researchers across age (everett and earp, 2015). the results of the seven direct models are reported in table 4. age was a significant predictor of stigma, with younger participants holding greater stigmatizing views of qrp users than older participants (β = -0.22, p = 0.031). acceptability was dummy coded, where qrp use being acceptable was coded as “0” and qrp use being unacceptable was coded as “1”. in the direct model, acceptability of qrp use was a significant predictor of stigma. those participants who considered qrp use unacceptable held greater stigmatizing views of qrp users than those who considered qrp use acceptable (β = 0.43, p = 0.000). in the direct model, internal attribution of qrp use was a significant predictor of stigma. participants who more strongly believed that qrp use was due to a researcher’s weak character held greater stigmatizing views of qrp users (β = 0.28, p = 0.008). however, we did not observe a statistically significant effect of external attribution on stigma towards qrp users (p = 0.16). 13 fear of qrp users was a significant predictor of stigma. participants who more strongly believed that qrp users were a threat to their students held a greater stigmatizing view of qrp users (β = 0.40, p = 0.000).1 power was operationalized as individual income, and was modeled both linearly and curvilinearly as it was theoretically plausible that those at the very low and high ends of income in the academic workplace would hold more similar views towards qrp users compared to those at middle incomes. in both the linear and quadratic models, we did not observe a statistically significant difference in stigma predicted by power (p = 0.25 and p = 0.66). beyond the bivariate relationships, it is also important to consider the frequency of participant responses. table 5 reports the prevalence of agreement with the independent measures used in the previous regression models. although internal attribution was a significant and positive predictor of stigma (see table 2), only a small number (24%) of participants agreed that qrp use could be internally attributed. most participants (66.2%) agreed that qrp use could be externally attributed to stress. similarly, most participants (75%) agreed that qrp use broke social norms and that qrp use was threatening to students (68.5%). stigma-related behaviors. to assess whether selfidentified qrp-using psychologists behave in ways predicted by social stigma theory, two behaviors were observed and assessed: social withdrawal and selective information transmission (or selective disclosure). social withdrawal. the average professional social network size for the general population of psychologists was 184.93 individuals. the average professional social network size for self-identified qrp-using psychologists was 178.60 individuals. we did not observe a statistically significant difference in social network size, t(72) = -0.2, p = 0.8. see figure 4 for the kernel density plot of professional social network sizes for all participants. selective disclosure. the 56 self-identified qrp users in this work produced a total of 1,230 social network alters from the game of contacts procedure (described in study 1). one hundred of these alters were considered “in-group” members, meaning these were alters that were identified as qrp users by participants in the study who self-identified as qrp users. in other words, the participants and these alters shared a common “qrp user” identity. the other 1,130 social network alters were out-group members, or psychologists with an unknown qrp-use status by the 56 qrp-using participants described in this work. participants, or “egos”, were asked for each alter whether or not that person knew of the participant’s 0.000 0.001 0.002 0.003 0.004 0 150 300 450 600 750 900 1050 1200 1350 1500 1650 1800 1950 2100 social network size k e rn e l d e n si ty population general qrp figure 4. kernel density plot of social network size for self-identified qrp users and the general population (those who did not self-identify as a qrp user). the social network distributions for these two groups were not significantly different, t(72) = -0.2, p = 0.8. generated using geom_density() function within the ggplot2 package in r 4.0.2. qrp-user identity status (either “this person knows i have used a qrp in the past 12 months” or “i do not know if this person knows i have used a qrp in the past 12 months”). the counts of these responses are depicted in figure 5. as seen in figure 5, 58 out of 100 in-group alters generated know the ego’s qrp-use identity (58%, top left panel). conversely, when the alter’s qrp-use status is unknown to the ego, only 16 out of 1,130 alters generated know of the ego’s qrp-use identity (1.44%, top right panel). this results in an odds ratio of 96.14 (95% confidence interval [51.03, 181.14], calculation described in szumilas 2010), indicating that the odds of an in-group alter knowing the ego’s qrp-use status is 96.14 times higher than out-group alters. this provides evidence of selective transmission of qrp-identity status to in-group members over out-group members, a 1note that one item of the stigma dv, “qrp users are a threat to their students” is similar to the iv item operationalizing the fear component of stigma, “most researchers would not let a qrp user mentor their students”. in a post hoc analysis performed during manuscript review, the fear iv item was a significant predictor of 3 of the 4 items in the stigma inventory: “most researchers think less of those who use qrps”, β = 0.44, p < 0.01, “most researchers would not let a qrp user mentor their students”, β = 0.37, p < 0.01, and “most researchers believe using qrps is a sign of professional failure”, β = 0.43, p < 0.01. it was not a significant predictor of the fourth item, “qrp users perceive high stigma against them”, β = 0.06, p = 0.54. this should provide some additional insight that this relationship is not being driven solely by a similarity between iv and dv item. 14 table 5 percent of participants who agreed or strongly agreed with items in the stigma dependent measure. domain item % agree acceptability most of your colleagues feel using qrps is unacceptable 75.0% fear qrp users are a threat to their students 68.5% external attribution most researchers believe using qrps is due to stress 66.2% internal attribution most researchers believe using qrps is due to weak character 24.0% figure 5. a 2x2 plot of the 1,230 alters generated by the 56 self-identified qrp users in this study. if the participant in our study (the ego) knows the alter is a qrp user, the alter is much more likely to know the qrp-use identity of the ego compared to when the ego does not know the qrp-use behavior of the alter (odds ratio = 96.14 (95% confidence interval [51.03, 181.14])). behavior also observed in other stigmatized populations (herman, 1993). discussion this study focused on the relationships between groups of research psychologists and whether qrpusing psychologists were stigmatized by their peers. all analyses except those focused on socioeconomic power (model 6 and 7, table 4) support the hypothesis that qrp-users are a stigmatized subpopulation of psychologists. one reviewer was not sympathetic to the prediction made by link and phalen’s (2001) model of stigma that those with more socioeconomic power would be more stigmatizing in the academic context. we believe that there are some potential reasons why power was not a significant predictor of qrp stigma in this study. it could be that economic power is a poor operationalization of power in the academic social environment. it is possible that the number of published papers, citation count, hindex, or years in a prestigious position could serve as better proxies of power in the academic social setting than income (bourdieu, 1988). it could also be that there is no difference in power between qrp users and the general population of psychologists. academia is unlike the typical social environment in some key ways. for instance, success as an academic psychologist has relied more and more on working with others. collaboration rates in psychology have been rising over the past 90 years (zafrunnisha and pullareddy, 2009), and this selective pressure to collaborate may serve as a vehicle for high income and lower income academics to intersect. the academic model is also based on a mentormentee relationship, where professors who make an adequate salary often closely work with graduate students, who are either unpaid, paid a modest stipend, or are economically insecure (ehrenberg and mavros, 1992). academia may not support a social environment where those of higher economic power can stigmatize those of lower economic power. it could also be that those in high socioeconomic positions used qrp behaviors to get to that position of power, and thus are in no position to hold stigmatizing attitudes towards other qrp users. taken together, the results of these models suggest that qrp-using psychologists are stigmatized by the general population of psychologists. qrp users are seen as breaking social norms and are feared as a threat to their students. when qrp use is internally attributed, stigmatizing attitudes are higher. however, when asked directly, most participants agreed that qrp use was more attributable to external variables (like stress, see table 5) than internal variables (like weak character). beyond just investigating the attitudes of the general population of psychologists on qrp users, this study also directly observed stigma-related behaviors of qrp users themselves. this is a step forward in determining if qrp users are stigmatized because we can ask the question “do qrp users act like other stigmatized groups?”. there were two-stigma-related behaviors observed in this study: social withdrawal and selective information transmission (selective disclosure). 15 figure 4 shows the comparison in social network size between qrp users and the general population of psychologists. although qrp users have a slightly smaller average social network size (178.6 versus the general population of psychologists’ average social network size of 184.93), this difference was not statistically significant. here too, it is possible that the nature of academic psychology inhibits qrp users from socially withdrawing. as mentioned previously, success as a psychologist has relied more and more on collaboration, therefore restricting one’s academic social network directly inhibits success. this outcome may also be due to selection bias, where those qrp-using psychologists who had socially withdrawn to protect their stigmatized identity no longer found success in academia and moved onto other careers. having a sufficiently large social network may be a key factor to success in academic psychology, and shrinking one’s social network to protect a concealed identity may reduce academic success, and the possibility of being solicited for this study. the other stigma-related behavior studied was selective transmission of qrp-use identity. figure 5 shows the number of people in qrp users’ social networks that either do or do not know about that person’s qrp-use identity, given that the social network member either is or isn’t of known qrp-use status themselves. it suggests that the social transmission of qrp-use identity is dependent on a shared in-group social status. when both members of a social dyad (ego and alter) are qrp users, they are more likely to know that information about each other. when the qrp-use identity of an alter is unknown (they may or may not be a qrp user), the alter is much less likely to know the qrp-using identity of the ego. this is evidence that qrp users selectively disclose their qrp use to other known qrp users. revealing is one significant way individuals can manage an invisible social identity (goffman, 1963). being stigmatized is harmful, as it can lead to stereotyping, loss of status, and discrimination (clair et al., 2005; link and phelan, 2001). by selectively revealing an invisible stigmatized identity to in-group members (in this case, other qrp users), one can avoid the harmful effects of stigmatization while minimizing the negative consequences of keeping one’s identity a secret from others (garcia and crocker, 2008; ilic et al., 2014). general discussion contributions the present research makes a number of important contributions. first, it identifies that approximately 20% of american psychologists are recent users of qrps. this is a large proportion, especially given the fact that the “replication crisis” is already several years old. the current research suggests that even at this time, a non-negligible number of psychologists are using practices in data collection, but especially in preparing scientific reports that can increase the number of false-positives in the published literature. it shows that more work must be done to change researcher behaviors that are beyond the influence of statistical initiatives like lowering the conventional alpha threshold in null hypothesis significance testing (benjamin et al., 2017). six of the ten defined qrps in table 1 take place during manuscript writing and preparation, meaning an intervention that goes beyond data analysis is needed to impact and reduce these behaviors. second, it contributes to the literature on stigma. we use data from both the general population and from the potentially stigmatized population to determine the stigma status of that group. being able to observe a group collectively manage their stigma, while simultaneously collecting data on the negative attitudes held by the general population towards that group provides us with additional confidence in the conclusion that qrpusing psychologists are indeed a stigmatized population of scientists. that said, as an observational study, causal relationships between stigmatizing attitudes and potential behavioral responses in qrp users cannot be determined here. strengths, limitations, and future research there were numerous strengths to these studies. rather than relying solely on self-reports, the population sizing was conducted using three different estimators. for this reason, we learned not only about the size of the population, but how these estimates and their confidence interval ranges can vary according to the estimator selected. this is important, especially within the context of attempting to measure a sub-population (qrp-users) of a small population of interest (american psychologists in phd-granting departments, total n = 7,101). the social network estimator allowed us to estimate the size of the population, but also provided insight into how qrp users share their identity information with others, a critical insight elaborated on in study 2. while both studies have elements of self-report (in the self-report estimate in study 1, and investigating the attitudes of the general population of psychologists in study 2), each study used multiple approaches to minimize potential social desirability biases. a major limitation of this work was the low response rate we observed (8.63% full response rate). there are a few possible reasons why the response rate was low. first, we did not offer an incentive of any type for participating in this survey. this was due to the fact that the 16 work was unfunded. another potential reason for this low response rate was that the window to participate was only open for one week following our email solicitation. we also only included participants who completed the entire survey, further reducing our response rate. the behaviors of researchers have the potential to shift quickly as norms change with the increased adoption of interventions like preregistration and the registered reports format of publishing. future research should continue to estimate the total number of qrp users to help determine if these interventions are having an effect, or if new, different mechanisms are needed. future work should also start to use the stigma literature to its advantage when considering how to best reduce the use of questionable research practices. by knowing that qrp users are stigmatized, future research could focus on the causal relationships that may exist between social attitudes and qrp users feeling stigmatized. future interventions could investigate whether decreasing stigma produces an environment that promotes qrp users revealing their identity or reforming their research behaviors, or if increasing stigma on qrp users limits the number of researchers who believe qrps are acceptable research practices and thus limits the number of new qrp users (bayer, 2008). conclusions much work has shown that there are psychologists who use questionable research practices in the course of analyzing their data and preparing their manuscripts that contribute to the inflated false-positive rates in the published literature. the current studies provide an estimate of the size of this population among tenure or tenure-track american research psychologists (18.18% using a direct estimate, though this study showed that both the point estimate and variance surrounding this estimate can depend on the estimator used). these researchers are a stigmatized subgroup of psychologists; members of the general population of psychologists hold negative attitudes towards them in domains consistent with the stigma literature, and they selectively disclose their qrp-using identity to in-group others, or social network members they have identified as like qrp-users themselves. these results suggest that even after several years of a “replication crisis” and a movement towards reform, the field of psychology has much work to do in curbing the use of questionable research practices and shifting its constituent researchers towards reducing the influence of the researcher on the results of the research. author contact corresponding author: nicholas fox institutional email: nwf7@scarletmail.rutgers.edu permanent email: nfox423@gmail.com orcid: 0000-0002-3772-8666 conflict of interest and funding at the time of manuscript submission, nicholas fox worked at the center for open science as a research scientist. the center for open science has an interest in seeing research become more transparent and shareable. all work, including data collection and the writing of initial drafts, was performed while nicholas was a phd candidate at rutgers university, working in lee jussim’s laboratory. this project was not explicitly funded, though access to qualtrics was provided by rutgers university to all faculty and students. author contributions credit taxonomy study conceptualization: nf resources: lj data curation: nf software: nf, nh, lj formal analysis: nf supervision: lj funding acquisition: n/a validation: nf investigation: nf visualization: nf methodology: nf writing, original draft: nf writing, editing: nf, nh project administration: nf, lj nicholas fox conceptualized the work, carried it out, and developed the first and final drafts of the manuscript, and thus is first author. lee jussim supervised the project from beginning to end, provided feedback and guidance as a phd candidate advisor throughout the work, and provided physical space for conducting this work, and is thus last author. nathan honeycutt provided critical feedback during the editing process of this manuscript, including recommending the addition of study 2, and is thus second author. he also submitted the manuscript to the journal, and for that nicholas fox is extremely thankful. 17 open science practices this article earned the preregistration+, open data and the open materials badge for preregistering the hypothesis and analysis before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references agnoli, f., wicherts, j. m., veldkamp, c. l. s., albiero, p., & cubelli, r. (2017). questionable research practices among italian research psychologists. plos one, 12(3), 1–17. https : / / doi . org / 10 . 1371/journal.pone.0172792 arentoft, a., van dyk, k., thames, a. d., sayegh, p., thaler, n., schonfeld, d., labrie, j., & hinkin, c. h. (2016). comparing the unmatched count technique and direct self-report for sensitive health-risk behaviors in hiv+ adults. aids care psychological and socio-medical aspects of aids/hiv, 28(3), 370–375. https : / / doi . org / 10 . 1080/09540121.2015.1090538 bayer, r. (2008). stigma and the ethics of public health: not can we but should we. social science and medicine, 67(3), 463–472. https://doi.org/10. 1016/j.socscimed.2008.03.017 benjamin, d. j., berger, j. o., johannesson, m., nosek, b. a., wagenmakers, e. j., berk, r., & johnson, v. e. (2017). redefine statistical significance. psyarxiv, (july 22), 1–18. https://doi.org/10. 17605/osf.io/mky9j bernard, h. r., hallett, t., iovita, a., johnsen, e. c., lyerla, r., mccarty, c., mahy, m., salganik, m. j., saliuk, t., scutelniciuc, o., shelley, g. a., sirinirund, p., weir, s., & stroup, d. f. (2010). counting hard-to-count populations: the network scale-up method for public health. sexually transmitted infections, 86 suppl 2, ii11–5. https://doi.org/10.1136/sti.2010.044446 bosco, f. a., aguinis, h., field, j. g., pierce, c. a., & dalton, d. r. (2016). harking’s threat to organizational research: evidence from primary and meta-analytic sources. personnel psychology, 69(3), 709–750. https://doi.org/10.1111/ peps.12111 bourdieu, p. (1988). homo academicus. stanford university press. button, k. s., ioannidis, j. p. a., mokrysz, c., nosek, b. a., flint, j., robinson, e. s. j., & munafò, m. r. (2013). power failure: why small sample size undermines the reliability of neuroscience. nature reviews neuroscience, 14(5), 365–376. https://doi.org/10.1038/nrn3475 chambers, c., feredoes, e., d. muthukumaraswamy, s., & j. etchells, p. (2014). instead of “playing the game” it is time to change the rules: registered reports at aims neuroscience and beyond. aims neuroscience, 1(1), 4–17. https:// doi.org/10.3934/neuroscience.2014.1.4 chapple, a., ziebland, s., & mcpherson, a. (2004). stigma, shame and blame experienced by patients with lung cancer: qualitative study. bmj, online fir(june), 1–5. https://doi.org/10.1136/ bmj.38111.639734.7c clair, j. a., beatty, j. e., & maclean, t. l. (2005). out of sight but not out of mind: managing invisible social identities in the workplace. academy of management review, 30(1), 78–95. https://doi. org/10.5465/amr.2005.15281431 corrigan, p. (2000). mental health stigma as social attribution : implications for research methods and attitude change mental health stigma as social attribution : implications for research methods and attitude change. clinical psychology science and practice, 7(1), 48–67. https:// doi.org/10.1093/clipsy.7.1.48 courtwright, a. (2013). stigmatization and public health ethics. bioethics, 27(2), 74–80. https:// doi.org/10.1111/j.1467-8519.2011.01904.x ehrenberg, r., & mavros, p. (1992). do doctoral students’ financial support patterns affect their times-to-degree and completion probabilities. everett, j. a. c., & earp, b. d. (2015). a tragedy of the ( academic ) commons : interpreting the replication crisis in psychology as a social dilemma for early-career researchers. 6(august), 1–4. https: //doi.org/10.3389/fpsyg.2015.01152 fanelli, d. (2018). is science really facing a reproducibility crisis, and do we need it to? proceedings of the national academy of sciences of the united states of america2, in press, 1–4. https: //doi.org/10.1073/pnas.1708272114 fiedler, k., & schwarz, n. (2016). questionable research practices revisited. social psychological and personality science, 7(1), 45–52. https : / / doi.org/10.1177/1948550615612150 fisher, r. j. (1993). social desirability bias and the vailidity of indirect questioning. journal of consumer research, 20(september 1993), 303– 315. https://doi.org/10.1086/209351 https://doi.org/10.1371/journal.pone.0172792 https://doi.org/10.1371/journal.pone.0172792 https://doi.org/10.1080/09540121.2015.1090538 https://doi.org/10.1080/09540121.2015.1090538 https://doi.org/10.1016/j.socscimed.2008.03.017 https://doi.org/10.1016/j.socscimed.2008.03.017 https://doi.org/10.17605/osf.io/mky9j https://doi.org/10.17605/osf.io/mky9j https://doi.org/10.1136/sti.2010.044446 https://doi.org/10.1111/peps.12111 https://doi.org/10.1111/peps.12111 https://doi.org/10.1038/nrn3475 https://doi.org/10.3934/neuroscience.2014.1.4 https://doi.org/10.3934/neuroscience.2014.1.4 https://doi.org/10.1136/bmj.38111.639734.7c https://doi.org/10.1136/bmj.38111.639734.7c https://doi.org/10.5465/amr.2005.15281431 https://doi.org/10.5465/amr.2005.15281431 https://doi.org/10.1093/clipsy.7.1.48 https://doi.org/10.1093/clipsy.7.1.48 https://doi.org/10.1111/j.1467-8519.2011.01904.x https://doi.org/10.1111/j.1467-8519.2011.01904.x https://doi.org/10.3389/fpsyg.2015.01152 https://doi.org/10.3389/fpsyg.2015.01152 https://doi.org/10.1073/pnas.1708272114 https://doi.org/10.1073/pnas.1708272114 https://doi.org/10.1177/1948550615612150 https://doi.org/10.1177/1948550615612150 https://doi.org/10.1086/209351 18 fiske, s. t. (2016). mob rule or wisdom of crowds. aps observer. fraser, h., parker, t., nakagawa, s., barnett, a., & fidler, f. (2018). questionable research practices in ecology and evolution. plos one, 13(7). https : / / doi . org / https : / / doi . org / 10 . 1371 / journal.pone.0200303 garcia, j. a., & crocker, j. (2008). reasons for disclosing depression matter: the consequences of having egosystem and ecosystem goals. social science and medicine, 67(3), 453–462. https:// doi.org/10.1016/j.socscimed.2008.03.016 gervais, w. m., & najle, m. b. (2017). how many atheists are there? social psychological and personality science, 1948550617707015– 1948550617707015. goffman, e. (1959). the presentation of self in everyday life. garden city ny, a 174 : so, 259–259. https://doi.org/10.2307/258197 goffman, e. (1963). stigma: notes on the management of spoiled identity. simon; schuster. herman, n. j. (1993). return to sender: reintegrative stigma-management strategies of ex-psychiatric patients [place: us publisher: sage publications]. journal of contemporary ethnography, 22(3), 295–330. https : / / doi . org / 10 . 1177 / 089124193022003002 holbrook, a. l., & krosnick, j. a. (2010). social desirability bias in voter turnout reports: tests using the item count technique. public opinion quarterly, 74(1), 37–67. https://doi.org/10.1093/ poq/nfp065 ilic, m., reinecke, j., bohner, g., r??ttgers, h. o., beblo, t., driessen, m., frommberger, u., & corrigan, p. w. (2014). managing a stigmatized identity-evidence from a longitudinal analysis about people with mental illness. journal of applied social psychology, 44(7), 464–480. https: //doi.org/10.1111/jasp.12239 jing, l., qu, c., yu, h., wang, t., & cui, y. (2014). estimating the sizes of populations at high risk for hiv: a comparison study. plos one, 9(4), 1– 6. https : / / doi . org / 10 . 1371 / journal . pone . 0095601 john, l. k., loewenstein, g., & prelec, d. (2012). measuring the prevalence of questionable research practices with incentives for truth telling. psychological science, 23(5), 524–532. https://doi. org/10.1177/0956797611430953 killworth, p. d., mccarty, c., bernard, h. r., shelley, g. a., & johnsen, e. c. (1998). estimation of seroprevalence, rape, and homelessness in the united states using a social network approach. evaluation review, 22(2), 289–308. https://doi. org/10.1177/0193841x9802200205 link, b. g., & phelan, j. c. (2001). conceptualizing stigma. annual review of sociology, 27, 363– 385. https : / / doi . org / 10 . 1146 / annurev. soc . 27.1.363 makel, m., hodges, j., cook, b., & plucker, j. (2019). questionable and open research practices in education research. https : / / doi . org / 10 . 35542/osf.io/f7srb makimoto, k., iida, y., hayashi, m., & takasaki, f. (2001). response bias by neuroblastoma screening participation status and social desirability bias in an anonymous postal survey, ishikawa, japan. journal of epidemiology, 11(2), 70–73. mccormick, t. h., salganik, m. j., & zheng, t. (2010). how many people do you know?: efficiently estimating personal network size. journal of the american statistical association, 105(489), 59– 70. https : / / doi . org / 10 . 1198 / jasa . 2009 . ap08518 mela, c., & kopalle, p. (2002). the impact of collinearity on regression analysis: the asymmetric effect of negative and positive correlations [publisher: taylor & francis journals]. applied economics, 34(6), 667–677. retrieved march 17, 2020, from https : / / econpapers . repec . org / article / tafapplec / v _ 3a34 _ 3ay _ 3a2002 _ 3ai _ 3a6_3ap_3a667-677.htm qualtrics. (2005). qualtrics survey software. www . qualtrics.com ripley, b. (2021). package ’boot’, 117. salganik, m. j., fazito, d., bertoni, n., abdo, a. h., mello, m. b., & bastos, f. i. (2011). assessing network scale-up estimates for groups most at risk of hiv/aids: evidence from a multiplemethod study of heavy drug users in curitiba, brazil. american journal of epidemiology, 174(10), 1190–1196. https : / / doi . org / 10 . 1093/aje/kwr246 salganik, m. j., & heckathorn, d. d. (2004). sampling and estimation in hidden populations using respondent-driven sampling. sociological methodology, 34(1), 193–240. https://doi.org/ 10.1111/j.0081-1750.2004.00152.x salganik, m. j., mello, m. b., abdo, a. h., bertoni, n., fazito, d., & bastos, f. i. (2012). the game of contacts: estimating the social visibility of groups. networks, 33(1), 70–78. https : / / doi . org/10.1016/j.socnet.2010.10.006.the simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology. psychological science, https://doi.org/https://doi.org/10.1371/journal.pone.0200303 https://doi.org/https://doi.org/10.1371/journal.pone.0200303 https://doi.org/10.1016/j.socscimed.2008.03.016 https://doi.org/10.1016/j.socscimed.2008.03.016 https://doi.org/10.2307/258197 https://doi.org/10.1177/089124193022003002 https://doi.org/10.1177/089124193022003002 https://doi.org/10.1093/poq/nfp065 https://doi.org/10.1093/poq/nfp065 https://doi.org/10.1111/jasp.12239 https://doi.org/10.1111/jasp.12239 https://doi.org/10.1371/journal.pone.0095601 https://doi.org/10.1371/journal.pone.0095601 https://doi.org/10.1177/0956797611430953 https://doi.org/10.1177/0956797611430953 https://doi.org/10.1177/0193841x9802200205 https://doi.org/10.1177/0193841x9802200205 https://doi.org/10.1146/annurev.soc.27.1.363 https://doi.org/10.1146/annurev.soc.27.1.363 https://doi.org/10.35542/osf.io/f7srb https://doi.org/10.35542/osf.io/f7srb https://doi.org/10.1198/jasa.2009.ap08518 https://doi.org/10.1198/jasa.2009.ap08518 https://econpapers.repec.org/article/tafapplec/v_3a34_3ay_3a2002_3ai_3a6_3ap_3a667-677.htm https://econpapers.repec.org/article/tafapplec/v_3a34_3ay_3a2002_3ai_3a6_3ap_3a667-677.htm https://econpapers.repec.org/article/tafapplec/v_3a34_3ay_3a2002_3ai_3a6_3ap_3a667-677.htm www.qualtrics.com www.qualtrics.com https://doi.org/10.1093/aje/kwr246 https://doi.org/10.1093/aje/kwr246 https://doi.org/10.1111/j.0081-1750.2004.00152.x https://doi.org/10.1111/j.0081-1750.2004.00152.x https://doi.org/10.1016/j.socnet.2010.10.006.the https://doi.org/10.1016/j.socnet.2010.10.006.the 19 22(11), 1359–1366. https://doi.org/10.1177/ 0956797611417632 starosta, a. j., & earleywine, m. (2014). assessing base rates of sexual behavior using the unmatched count technique. health psychology and behavioral medicine, 2(1), 198–210. https://doi.org/ 10.1080/21642850.2014.886957 stuber, j., galea, s., & link, b. g. (2008). smoking and the emergence of a stigmatized social status. social science and medicine, 67(3), 420–430. https : / / doi . org / 10 . 1016 / j . socscimed . 2008 . 03.010 szumilas, m. (2010). explaining odds ratios. (august), 227–229. tavakol, m., & dennick, r. (2011). making sense of cronbach’s alpha. international journal of medical education, 2, 53–55. https : / / doi . org / 10 . 5116/ijme.4dfb.8dfd teixeira da silva, j. a. (2018). freedom of speech and public shaming by the science watchdogs. journal of advocacy, research, and education, 5(1). vardanjani, h. m., baneshi, m. r., & haghdoost, a. (2015). cancer visibility among iranian familial networks: to what extent can we rely on family history reports? [publisher: public library of science]. plos one, 10(8), e0136038. https : / / doi . org / 10 . 1371 / journal . pone . 0136038 wagenmakers, e.-j., & dutilh, g. (2016). seven selfish reasons for preregistration. https : / / www. psychologicalscience . org / observer / seven selfish reasons for preregistration / comment page-1 wicherts, j. m., veldkamp, c. l., augusteijn, h. e., bakker, m., van aert, r. c., & van assen, m. a. (2016). degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking. frontiers in psychology, 7(nov), 1–12. https://doi. org/10.3389/fpsyg.2016.01832 wolter, f., & laier, b. (2014). the effectiveness of the item count technique in eliciting valid answers to sensitive questions. an evaluation in the context of self-reported delinquency. survey research methods, 8(3), 153–168. zafrunnisha, n., & pullareddy, v. (2009). authorship pattern and degree of collaboration in psychology. annals of library and information studies, 56(december), 255–261. zheng, t., salganik, m. j., & gelman, a. (2006). how many people do you know in prison? journal of the american statistical association, 101(474), 409–423. https://doi.org/10.1198/ 016214505000001168 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1080/21642850.2014.886957 https://doi.org/10.1080/21642850.2014.886957 https://doi.org/10.1016/j.socscimed.2008.03.010 https://doi.org/10.1016/j.socscimed.2008.03.010 https://doi.org/10.5116/ijme.4dfb.8dfd https://doi.org/10.5116/ijme.4dfb.8dfd https://doi.org/10.1371/journal.pone.0136038 https://doi.org/10.1371/journal.pone.0136038 https://www.psychologicalscience.org/observer/seven-selfish-reasons-for-preregistration/comment-page-1 https://www.psychologicalscience.org/observer/seven-selfish-reasons-for-preregistration/comment-page-1 https://www.psychologicalscience.org/observer/seven-selfish-reasons-for-preregistration/comment-page-1 https://www.psychologicalscience.org/observer/seven-selfish-reasons-for-preregistration/comment-page-1 https://doi.org/10.3389/fpsyg.2016.01832 https://doi.org/10.3389/fpsyg.2016.01832 https://doi.org/10.1198/016214505000001168 https://doi.org/10.1198/016214505000001168 introduction prevalence of questionable research practice users stigmatization of questionable research practice users study 1: sizing the qrp-using population of psychologists methods preregistration statement population of interest and target group survey distribution survey responses estimating methods estimate 1: direct estimate estimate 2: unmatched count technique estimate estimate 3: generalized network scale up estimate results direct estimate unmatched count technique estimate generalized network scale up estimate discussion study 2: assessing the stigmatization of qrp-using psychologists methods preregistration statement population of interest and target group survey distribution dependent measure independent measures control variables behavioral measures statistical analyses results stigma-related behaviors social withdrawal selective disclosure discussion general discussion contributions strengths, limitations, and future research conclusions author contact conflict of interest and funding author contributions open science practices meta-psychology, 2022, vol 6, mp.2021.3078 https://doi.org/10.15626/mp.2021.3078 article type: tutorial published under the cc-by4.0 license open data: not applicable open materials: not applicable open and reproducible analysis: not applicable open reviews and editorial process: yes preregistration: no edited by: erin m buchanan reviewed by: nataly beribisky, daniel baker analysis reproduced by: not applicable all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/j5v26 power to the people: a beginner’s tutorial to power analysis using jamovi james e bartlett school of psychology and neuroscience, university of glasgow, uk sarah j charles department of psychology, institute of psychiatry, psychology & neuroscience, king’s college london, uk abstract authors have highlighted for decades that sample size justification through power analysis is the exception rather than the rule. even when authors do report a power analysis, there is often no justification for the smallest effect size of interest, or they do not provide enough information for the analysis to be reproducible. we argue one potential reason for these omissions is the lack of a truly accessible introduction to the key concepts and decisions behind power analysis. in this tutorial targeted at complete beginners, we demonstrate a priori and sensitivity power analysis using jamovi for two independent samples and two dependent samples. respectively, these power analyses allow you to ask the questions: “how many participants do i need to detect a given effect size?”, and “what effect sizes can i detect with a given sample size?”. we emphasise how power analysis is most effective as a reflective process during the planning phase of research to balance your inferential goals with your resources. by the end of the tutorial, you will be able to understand the fundamental concepts behind power analysis and extend them to more advanced statistical models. keywords: power analysis, effect size, tutorial, a priori, sensitivity, jamovi introduction “if he is a typical psychological researcher, not only has he exerted no prior control over his risk of committing a type ii error, but he will have no idea what the magnitude of this risk is.” (cohen, 1965, pg. 96)1 for decades researchers have highlighted that empirical research has chronically low statistical power (button et al., 2013; cohen, 1962; sedlmeier & gigerenzer 1989). this means that the study did not include enough participants to reliably detect a realistic effect size (see table 1 for a definition of key terms). one method to avoid low statistical power is to calculate how many participants you need for a given effect size in a process called “power analysis”. power analysis is not the only way to justify your sample size (see lakens, 2022), but despite increased attention to statistical power, it is still rare to find articles that justified their sample size through power analysis (chen & liu, 2019; guo et al., 2014; larson & carbine, 2017). even for those that do report a power analysis, there are often other problems such as poor justification for the effect size, misunderstanding statistical power, or not making the power analysis reproducible (bakker et al., 2020; beribisky et al., 2019; collins & watt, 2021). therefore, we present a beginner’s tutorial which outlines the 1we are aware of the problem with using gendered language like in the original quote. despite this issue, we think the quote still resonates. https://doi.org/10.15626/mp.2021.3078 https://doi.org/10.17605/osf.io/j5v26 2 key decisions behind power analysis and walk through how it applies to t-tests for two independent samples and two dependent samples. we expect no background knowledge as we will explain the key concepts and how to interact with the software we use. before beginning the tutorial, it is important to explain why we need power analysis. there is a negative relationship between the sample size of a study and the effect size the study can reliably detect. holding everything else constant, a larger study can detect smaller effect sizes, and conversely, a smaller study can only detect larger effect sizes. a study can be described as underpowered if the effect size you are trying to detect is smaller than the effect size your study has the ability to detect. if we published or shared the results of all the studies we ever conducted, underpowered research would be less of a problem. we would just see more statistically non-significant findings. however, since there is publication bias that favours significant findings (dwan et al., 2008; franco et al., 2014), underpowered studies warp whole fields of research (button et al., 2013). imagine five research groups were interested in the same topic and performed a similar study using 50 participants each. the results from the first four groups were not statistically significant, but the fifth group by chance observed a larger effect size which was statistically significant. we know that non-significant findings are less likely to be published (franco et al., 2014), so only the fifth group published their findings which happened to observe a larger statistically significant effect. now imagine you wanted to build on this research and to inform your study, you review the literature. all you find is the fifth study reporting a larger statistically significant effect, and you use that effect size to inform your study, meaning you recruit a smaller sample than if you expected a smaller effect size. if studies systematically use small sample sizes, only larger more unrealistic effect sizes are published, and smaller more realistic effect sizes are hidden. moreover, researchers tend to have poor intuitions about statistical power, where they underestimate what sample size they need for a given effect size (bakker et al., 2016). in combination, this means researchers tend to power their studies for unrealistically large effect sizes and think they need a sample size which would be too small to detect more realistic effect sizes (etz & vandekerckhove, 2016). in short, systematically underpowered research is a problem as it warps researchers’ understanding of both what constitutes a realistic effect size and what an appropriate sample size is. a power analysis tutorial article is nothing new. there are comprehensive guides to power analysis (e.g., brysbaert, 2019; perugini et al., 2018), but from our perspective, previous tutorials move too quickly from beginner to advanced concepts. in research methods curricula, educators only briefly cover power analysis across introductory and advanced courses (sestir et al., 2021; targ meta-research group, 2020). in their assessment of researcher’s understanding of power analysis, collins and watt (2021) advise that clearer educational materials should be available. in response, our approach is presenting a beginner’s tutorial that can support both students and established researchers who are unfamiliar with power analysis. we have split our tutorial into three parts starting with a recap of the statistical concepts underlying power analysis. there are common misconceptions around useful types of power analysis (beribisky et al., 2019), so it is important to outline what we are trying to achieve. in part two, we outline the decisions you must make when performing a power analysis, like choosing your alpha, beta, and smallest effect size of interest. we then present a walk-through in part three on performing a priori and sensitivity power analyses for two independent samples and two dependent samples. absolute beginners unfamiliar with power analysis should start with part one, while readers with a general understanding of power analysis can start with part two. we conclude with recommendations for future reading that outlines power analysis for more advanced statistical tests. part one: the statistics behind power analysis type i and type ii errors the dominant theory of statistics in psychology is known as “frequentist” or “classical” statistics. power analysis is used within this framework where probability is assigned to “long-run” frequencies of observations (many things happening over time). in contrast, bayesian statistics uses another theory of probability that can be applied to individual events through combining prior belief with a likelihood function2. in this article, we are only covering the frequentist approach where the “long-run” probability is the basis of where you get p-values from. researchers often misinterpret the information provided by p-values (goodman, 2008). in our following explanations, we focus on the neyman-pearson approach (lakens, 2021), where the aim of the frequentist branch of statistics is to help you make decisions and limit the number of errors you will make in the longrun (neyman, 1977). the formal definition of a p-value 2see kruschke and liddell, (2018) for how power analysis applies to bayesian statistics. 3 by cohen (1994) is the probability of observing a result at least as extreme as the one observed, assuming the null hypothesis (there is no effect) is true. this means a small p-value (closer to 0) indicates the results are unlikely if the null hypothesis is true, while a large p-value (closer to 1) indicates the results are more likely if the null is true. the probabilities do not relate to individual studies but tell you the probability attached to the procedure if you repeated it many times. so, a p-value of .05 means that, if you were to keep taking lots of samples from the population, and the null hypothesis was true, the chance of finding a result at least as extreme as the one we have seen is 5%, or 1 in 20. so, this can be phrased as “if we conducted an infinite number of studies with the same design as this, 5% of all of the results would be at least this extreme”. the reason to use long-run probability, is that any single measurement comes with it the possibility of some kind of ‘error’, or unaccounted for variance that are not assumed in our hypotheses. for example, the researcher’s reaction times, the temperature of the room, the participant’s level of sleep, the brightness of the screens on equipment being used, etc. could all cause slight changes to the accuracy of the measures we make. in the long-run, these are likely averaged out. we used a p-value of .05 as an example because an alpha value of .05 tends to be used as the cut-off point in psychology to conclude “we are happy to say that this result is unlikely/surprising enough to make a note of”. alpha (sometimes written as “α”) is the probability of concluding there is an effect when there is not one, known as a type i error (said, type one error) or false positive. this is normally set at .05 (5%) and it is the threshold we look at for a significant effect. setting alpha to .05 means we are willing to make a type i error 5% of the time in the long-run. in the neyman-pearson approach, we create cutoffs to help us make decisions (lakens, 2021). we want to know if we can reject the null hypothesis and conclude we have observed some kind of effect. by setting alpha, we are saying the pvalue for this study must be smaller than alpha to reject the null hypothesis. if our p-value is larger than alpha, we cannot reject the null hypothesis. this is where the term “statistical significance” comes from. as a scientific community, we have come to the group conclusion that this cut-off point is enough to say “the null hypothesis may not be true” and we understand that in the longrun, we would be willing to make a mistake 5% of the time if the null was really true. it is important to understand that the cut-off of 5% appears immutable now for disciplines like psychology that routinely use 5% for alpha, but it was never meant as a fixed standard of evidence. fisher one of the pioneers of hypothesis testing commented that he accepted 5% as a low standard of evidence across repeated findings (goodman, 2008). fisher (1926) emphasised that individual researchers should consider which alpha is appropriate for the standard of evidence in their study, but this nuance has been lost over time. for example, bakker et al. (2020) reported that for studies that specifically mention alpha, 91% of power analyses use 5%. this shows how, in psychology, alpha is synonymous with 5% and it is rare for researchers to use a different alpha value. the opposite problem is where we say there is not an effect when there actually is one. this is known as a type ii error (said, type two error) or a false negative. in the neyman-pearson approach, this is the second element of using hypothesis testing to help us make decisions. in addition to alpha limiting how many type i errors (false positives) we are willing to make, we set beta to limit how many type ii errors (false negatives) we are willing to make. beta (sometimes written as “β”) is the probability of concluding there is not an effect when there really is one. this is normally set at .20 (20%), which means we are willing to make a type ii error 20% of the time in the long-run. by setting these two values, we are stating rules to help us make decisions and trying to limit how often we will be wrong in the long-run. we will consider how you can approach these decisions in part two. oneand two-tailed tests in significance testing, we describe the null hypothesis as a probability distribution centred on zero. we can reject the null hypothesis if our observed result is greater than a critical value determined by our alpha value. the area after the critical value creates a rejection region in the outer tails of the distribution. if the observed result is in this rejection region, we conclude the data would be unlikely assuming the null hypothesis is true, and reject the null hypothesis. there are two ways of stating the rejection region. these are based on the type of alternative hypothesis we are interested in. there are two types of alternative hypotheses: (1) non-directional, and (2) directional. a non-directional hypothesis is simply a statement that there will be any effect, irrespective of the direction of the effect, e.g., ‘group a is different from group b’. in contrast to this, the assumed null-hypothesis is ‘group a is not different from group b’. group a could be smaller than group b, or group a could be bigger than group b. in both situations, the null hypothesis could be rejected. a directional hypothesis, on the other hand, is a statement that there will be a specific effect, e.g., ‘group a is 4 bigger than group b’. now, the assumed null hypothesis is ‘group a is not bigger than group b’. in this instance even if we find evidence that group b is bigger than group a, the null hypothesis could not be rejected. this is where the number of tails in a test comes in. in a two-tailed test (also known as a non-directional test), when alpha is set to 5%, there are two separate 2.5% areas to create a rejection region in both the positive and negative tails. together, the two tails create a total area of 5%. to be statistically significant, the observed result can be in either the positive or negative rejection regions. group a could be higher than group b, or group b could be higher than group a, you are just interested in a difference in any direction. in a one-tailed test (also known as a directional test), there is just one larger area totalling 5% to create a rejection region in either the positive or negative tail (depending on the direction you are interested in). to be statistically significant, the critical value is slightly smaller, but the result must be in the direction you predicted. this means you would only accept a result of ‘group a is bigger than group b’ as significant. you could still find that group b is bigger than group a, but no matter how big the difference is, you cannot reject the null hypothesis as it is contrary to your directional prediction. statistical power statistical power is defined as the probability of correctly deciding to reject the null hypothesis when the null hypothesis is not true. in plain english: the likelihood of successfully detecting an effect that is actually there (see baguley, 2009 for other lay definitions). when we have sufficient statistical power, we are making the study sensitive enough to avoid making too many type ii errors. statistical power is related to beta where it is 1-β and typically expressed as a percentage. if we use a beta value of .20, that means we are aiming to have statistical power of 80% (1 .20 = .80 = 80%). effect size for statistical power, we spoke about “detecting an effect that is actually there”. the final piece of the power analysis puzzle is the smallest effect size of interest. an effect size can be defined as a number that expresses the magnitude of a phenomenon relevant to your research question (kelley & preacher, 2012). depending on your research question, this includes the difference between groups or the association between variables. for example, you could study the relationship between how much alcohol you drink and reaction time. we could say “alcohol has the effect of slowing down reaction time”. however, there is something missing from that statement. how much does alcohol slow down reaction time? is one drop of alcohol enough or do you need to consume a full keg of beer before your reaction time decreases by just 1 millisecond? the smallest effect size of interest outlines what effect size you would consider practically meaningful for your research question. effect sizes can be expressed in two ways: as an unstandardised effect, or as a standardised effect. an unstandardised effect size is expressed in the original units of measurement. for example, if you complete a stroop task, you measure response times to congruent and incongruent colour words in milliseconds. the mean difference in response time to congruent and incongruent conditions is an unstandardised effect size and will remain consistent across studies. this means you could say one study reporting a mean difference of 106ms had a larger effect than a study reporting a mean difference of 79ms. unstandardised effect sizes are easy to compare if the measurement units are consistent, but in psychology we do not always have easily comparable units. many subdisciplines use likert scales to measure an effect of interest. for example, in mental health research, one might be interested in how much anxiety someone experiences each week (participants are often given options such as “not at all”, “a little”, “neither a little nor a lot”, “a lot”, and “all the time”). these responses are not in easily interpretable measurements but, as scientists, we would still like to provide a numerical value to explain what an effect means3. this is where standardised effect sizes are useful as they allow you to compare effects across contexts, studies, or slightly different measures. for example, if study one used a five-point scale to measure anxiety but study two used a seven-point scale, a difference of two points on each scale has a different interpretation. a standardised effect size allows you to convert these differences into common measures, making it easier to compare results across studies using different units of measurement. there are many types of standardised effect sizes, such as cohen’s d or η2 (said, eta squared), which we use in different contexts (see lakens, 2013 for an overview). in this tutorial, we mainly focus on cohen’s d as the standardised mean difference as it is the effect size used in jamovi, the software we use in part three below. although there are different formulas, cohen’s d is normally the mean difference divided by the pooled standard deviation. this means it represents the difference between groups or conditions, expressed as standard deviations instead of the original units of measure3note, we use this as an example of measurements with different scales, but it is normally better to analyse ordinal data with ordinal models (see bürkner & vuorre, 2019). 5 ment. standardised and unstandardised effect sizes each have their strengths and weaknesses (baguley, 2009). unstandardised effect sizes are easier to interpret, particularly for lay readers who would find it easier to understand a difference of 150ms instead of 0.75 standard deviations. however, it can be harder to compare unstandardised effect sizes across studies when there are different measurement scales. standardised effect sizes help with this as they convert measures to standardised units, making it easier to compare effect sizes across studies. however, the standardisation process can cause problems, as effect sizes can change depending on whether the design was withinor betweensubjects, if the measures are unreliable, and if sampling affects the variance of the measures through restricting the values to a smaller range of the scale (baguley, 2009). similarly, the frame of reference is important when interpreting standardised effect sizes. when classifying the magnitude of standardised effects, cohen (1988, pg. 25) specifically says “the terms "small," "medium," and "large" are relative, not only to each other, but to the area of behavioural science or even more particularly to the specific content and research method being employed in any given investigation”. cohen emphasised that the qualitative labels (“small”, “medium,” and “large”) are arbitrarily applied to specific values, and should be applied differently to different fields. this means that interpretation should not simply follow rules of thumb that were established outside of the research field of interest. although the software we introduce in part three relies on standardised effect sizes, baguley (2009) emphasises it is better to focus on interpreting unstandardised effect sizes wherever possible. to bring this back to statistical power (successfully detecting a true effect), the bigger an effect is, the easier it is to detect. in the anxiety example, we could compare the effects of two types of therapy. if the difference between therapy a and therapy b was, on average, 3 points on an anxiety scale, it would be easier to detect than if the average difference between therapy a and therapy b was 1 point. the smaller decrease of 1 point would be harder to detect than the larger decrease of 3 points. you would need to test more people in each therapy group to successfully detect this weaker effect because of the greater level of overlap between the two sets of therapy outcomes. it is this principle that allows us to say that the bigger the effect size, the easier it is to detect. in other words, if you have the same number of participants, statistical power increases as the effect size increases. we have now covered the five main concepts underlying power analysis: alpha, beta, sample size, effect size, and oneor two-tailed tests. for ease, we have provided a summary of these concepts, their meaning, and how we often use them in table 1. it takes time to appreciate the interrelationship between these concepts, so we recommend using the interactive visualisation by magnusson (https://rpsychologist.com/d3/nhst/). types of power analysis as the four main concepts behind power analysis are related, we can calculate one as the outcome if we state the other three. the most common types of power analysis relating to these outcomes are (1) a priori, (2) sensitivity, and (3) post-hoc. if we want sample size as the outcome, we use a priori power analysis to determine how many participants we need to reliably detect a given smallest effect size of interest, alpha, and power. alternatively, if we want the effect size as the outcome, we can use sensitivity power analysis to determine what effect size we can reliably detect given a fixed sample size, alpha, and power. there is also post-hoc power analysis if we want statistical power as the outcome given an observed effect size, sample size, and alpha. post-hoc power analysis is an attractive idea, but it should not be reported as it essentially expresses the p-value in a different way. there is a direct relationship between observed power and the p-value of your statistical test, where a p-value of .05 means your observed power is 50% (lakens, 2022). remember, probability in frequentist statistics does not apply to individual events, so using the observed effect size in a single study ignores the role of the smallest effect size of interest in the long-run. as post-hoc power is uninformative, we only focus on a priori and sensitivity power analysis in this tutorial. part two: decision making in power analysis now that you are familiar with the concepts, we turn our focus to decision making in power analysis. in part one, we defined the main inputs used in power analysis, but now you must decide on a value for each one. setting your inputs is the most difficult part of power analysis as you must understand your area of research and be able to justify your choices (lakens, 2022). power analysis is a reflective process that is most effective during the planning stage of research, meaning that you must balance your inferential goals (what you want to find out) with the resources you have available (time, money, equipment, etc.). in this part, we will outline different strategies for choosing a value for alpha, beta/power, oneor two-sided tests, your smallest effect size of interest, and your sample size. https://rpsychologist.com/d3/nhst/ 6 table 1 table showing the basic concepts underlying power analysis, what they mean, and how they are often used. concept what it is how it is often used alpha (α) cut-off value for how frequently we are willing to accept a false-positive. this is traditionally set to .05 (5% of the time), but it is often set to lower thresholds in disciplines like physics. the lower alpha is, the fewer false-positives there will be in the longrun. beta (β) cut-off value for how frequently we are willing to accept a false-negative. in psychology, this is usually set to .20 (20% of the time), implicitly suggesting false negatives are less of a concern than false positives. the lower beta is, the fewer false-negatives there will be in the long-run. power (1-β) the chances of detecting an effect that exists. the opposite of beta, power is how likely you are to detect a given effect size. this is usually set to .80 (80% of the time). the higher power is, the more likely you are to successfully detect a true effect if it is there. effect size a number that expresses the magnitude of a phenomenon relevant to your research question. unstandardised effect sizes express the difference or relationship in the original units of measurement, such as milliseconds. standardised effect-sizes express the difference or relationship in standardised units, such cohen’s d. higher absolute effect sizes mean a larger difference or stronger relationship. one-tailed test when the rejection region in null hypothesis significance testing is limited to one tail in a positive or negative direction. if you have a (ideally preregistered) clear directional prediction, one-tailed tests mean you would only reject the null hypothesis if the result was in the direction you predicted. the observed result may be in the extreme of the opposite tail, but you would still fail to reject the null hypothesis. two-tailed test when the rejection region in null hypothesis significance testing is present in the extremes of both the positive and negative tail area. if you would accept a result in any direction, you can use a two-tailed test to reject the null hypothesis if the observed result is in the extremes of either the positive or negative tail. a priori power analysis how many participants do we need to reliably detect a given smallest effect size of interest, alpha, and power? we tend to use the term ‘a priori’ in front of a power analysis that is conducted before data is collected. this is because we are deducing the number of participants from information we already have. sensitivity power analysis what effect size could we detect with our fixed sample size, alpha, and desired power? we use a sensitivity power analysis when we already know how many participants we have (e.g., using secondary data, or access to a rare population). we use this type of analysis to evaluate what effect sizes we can reliably detect. 7 alpha the first input to set is your alpha value. traditionally, we use .05 to say we are willing to accept making a type i error up to 5% of the time. there is nothing special about using an alpha of .05, it was only a brief suggestion by fisher (1926) for what felt right, but he emphasised you should justify your alpha for each experiment. decades of tradition mean the default alpha is set to .05, but there are different approaches you can take to argue for a different value. you could start with the traditional alpha of .05 but adjust it for multiple comparisons. for example, if you were planning on performing four related tests and wanted to correct for multiple comparisons, you could use this corrected alpha value in your power analysis. if you used the bonferroni-holm method (cramer et al., 2016), the most stringent alpha value would be set as .0125 instead of .05. using this lower alpha value would require more participants to achieve the same power, but you would ensure your more stringent test had your desired level of statistical power. alternatively, you could argue your study requires deviating from the traditional .05 alpha value. one approach is to switch between a .05 alpha for suggestive findings and a .005 alpha for confirmatory findings (benjamin et al., 2018). this means if you have a strong prediction, or one that has serious theoretical implications, you could argue your study requires a more stringent alpha value. theoretical physicists take this approach of using a more stringent value even further, and use an alpha value of .0000003 (known as ‘five sigma’). the reason for having such a stringent alpha level is that to make changes to our understanding of physics would have knock-on effects to all other sciences, so avoiding false positives is of the utmost importance. another approach is to justify your bespoke alpha for each study (lakens et al., 2018). the argument here is you should perform a cost-benefit analysis to determine your alpha based on how difficult it is to recruit your sample. see maier and lakens (2021) for a primer on justifying your alpha. beta beta also has a traditional value: most studies aim for a beta of .20, meaning they want 80% power. cohen (1965) suggested the use of 80% as he felt that type ii errors are relatively less serious than type i errors. at the time of writing, 80% power would lead to roughly double the sample sizes than the studies he critiqued in his review (cohen, 1962). aiming for 80% power has proved influential as bakker et al. (2020) found it was the most common value researchers reported in their power analysis. aiming for 80% power was largely a pragmatic approach, so you may argue it is not high enough. cohen (1965) explicitly stated that you should ignore his suggestion of 80% if you have justification for another value. setting beta to .20 means you are willing to accept a type ii error 20% of the time. this implicitly means type ii errors are four times less serious than type i errors when alpha is set to .05. to match the error rates, bakker et al. (2020) found the next most common value was 95% power (beta = .05), but it only represented 19% of power analyses in their sample. earlier, we mentioned working with rare populations. many such populations (such as those with rare genetic conditions) may receive specialist care or support. if one were to assess the effectiveness of this specialist care/support, then not finding an effect that does exist (a type ii error) might lead to this support being taken away. in such circumstances, you could argue that type ii errors are just as important to avoid, if not more important, than type i errors. as such, in these circumstances, you might want to increase your power (have a lower beta value), to avoid undue harm to a vulnerable population. deciding on the beta value for your own study will involve a similar process to justifying your alpha. you must decide what your inferential goals are and whether 80% power is enough, knowing you could miss out on a real effect 20% of the time. however, if you increase power to 90% or 95%, it will require more participants, so you must perform a cost-benefit analysis based on how easy it will be to recruit your sample (lakens, 2022). oneand two-tailed tests in tests comparing two values, such as the difference between two groups or the relationship between two variables, you can choose a oneor two-tailed test. lakens (2016a) argued one-tailed tests are underused and offer a more efficient procedure. as the rejection region is one 5% area (instead of two 2.5% areas), the critical value is smaller, so holding everything else constant, you need fewer participants for a statistically significant result. one-tailed tests also offer a more severe test of a hypothesis since the observed result must reach the rejection region in the hypothesised direction. the p-value in your test may be smaller than alpha, but if the result is in the opposite direction to what you predicted, you still cannot reject the null hypothesis. this means one-tailed tests can be an effective option when you have a strong directional prediction. one-tailed tests are not always appropriate though, so it is important you provide clear justification for why 8 you are more interested in an effect in one direction and not the other (ruxton & neuhäuser, 2010). if you would be interested in an effect in either a positive or negative direction, then a two-tailed test would be better suited. one-tailed tests have also been a source of suspicion since they effectively halve the p-value. for example, wagenmakers et al. (2011) highlighted how some studies took advantage of an opportunistic use of one-tailed tests for their results to be statistically significant. this means one-tailed tests are most convincing when combined with preregistration (see kathawalla et al. (2021) if preregistration is a procedure you are unfamiliar with) as you can demonstrate that you had a clear directional hypothesis and planned to test that hypothesis with a one-tailed test. effect size in contrast to alpha and beta, there is not one traditional value for your choice of effect size. many studies approach power analysis with a single effect size and value for power in mind, but as we will demonstrate in part three, power exists along a curve. in most cases, you do not know what the exact effect size is, or you would not need to study it. the effect size you use is essentially the inflection point for which effect sizes you want sufficient power to detect. if you want 80% power for an effect of cohen’s d = 0.40, you will be able to detect effects of 0.40 with 80% power. you will have increasingly higher levels of power for effects larger than 0.40, but increasingly lower levels of power for effects smaller than 0.40. this means it is important to think of your smallest effect size of interest, as you are implicitly saying you do not care about detecting effects smaller than this value. the most difficult part of an a priori power analysis is justifying your smallest effect size of interest. choosing an effect size to use in power analysis and interpreting effect sizes in your study requires subject matter expertise (panzarella et al., 2021). you must decide what effect sizes you consider important or meaningful based on your understanding of the measures and designs in your area. for example, is there a relevant theory that outlines expected effects; what have studies testing similar hypotheses found; what are the practical implications of the results? some of these decisions are difficult to make and all the strategies are not always available, but there are different sources of information you can consult for choosing and justifying your smallest effect size of interest (lakens, 2022). first, you could identify a meta-analysis relevant to your area of research which summarises the average effect across several studies. if you want to use the estimate to inform your power analysis, you must think about whether the results in the meta-analysis are similar to your planned study. sometimes, meta-analyses will report a broad meta-analysis amalgamating all the results, then report moderator analyses for the type of studies they include, so you could check whether there is an estimate restricted to methods similar to your study. we also know meta-analyses can report inflated effect sizes due to publication bias (lakens, 2022), meaning you can look for a more conservative estimate such as the lower bound of the confidence interval around the average effect or if the authors report a bias-corrected average effect. second, there may be one key study you are modelling your project on. you could use their effect size to inform your power analysis, but as panzarella et al. (2021) warn, effect sizes are best interpreted in context, so question how similar your planned methods are. as a result from a single study, there will be uncertainty around their estimate, so think about the width of the confidence interval around their effect size and use a conservative estimate. finally, you can consult effect size distributions. the most popular guidelines are from cohen (1988) who argued you should use d = 0.2 for new areas of research as the measures are likely to be imprecise, 0.5 for phenomena observable to the naked eye, and 0.8 for differences you hardly need statistics to detect. cohen (1988) explicitly warned these guidelines were for new areas of research, when there was nothing else to go on. but, like many heuristics, the original suggestions have lost their nuance and are now taken as a ubiquitous ‘rule of thumb’. it is important to consider what effect sizes mean for your subject area (baguley, 2009), but researchers seldom critically choose an effect size. an analysis of studies that did justify their effect size found that the majority of studies simply cited cohen’s suggested values (bakker et al., 2020). relying on these rules of thumb can lead to strange interpretations, such as paradoxes where even incredibly small effect sizes (using cohen’s rule of thumb) can be meaningful. abelson (1985) found that an r2 of .003 was the effect size of the most significant characteristic predicting baseball success (batting average). in context, then, an r2 of .003 is clearly meaningful, so it is important to interpret effect sizes in context rather than apply broad generalisations. if you must rely on effect size distributions, there are articles which are sub-field specific. for example, gignac and szodorai (2016) collated effects in individual differences research and szucs and ioannidis (2021) outlined effects in cognitive neuroscience research. effect size distributions can be useful to calibrate your understanding of effect sizes in different areas but 9 they are not without fault. panzarella et al. (2021) demonstrated that in the studies that cited effect size distributions, most used them to directly interpret the effect sizes they observed in their study (e.g., “in this study we found a ‘large’ effect, which means. . . ”). however, as seen in abelson’s paradox, small effects in one context can be meaningful in another context. effect size distributions can help to understand the magnitude of effect sizes within and across subject areas, but comparing your observed effect size to an amalgamation of effects across all of psychology leads to a loss in nuance. if you have no other information, effect size distributions can help to inform your smallest effect size of interest, but when it comes to interpretation it is important to put your effect size in context compared to studies investigating a similar research question. with these strategies in mind, it is important to consider what represents the smallest effect size of interest for your specific study. it is the justification that is important as there is no single right or wrong answer. power analysis is always a compromise between designing an informative study and designing a feasible study for the resources at your disposal. you could always set your effect size to d = 0.05, but the sample size required would often be unachievable and effects this small may not be practically meaningful. therefore, you must explain and justify what represents the smallest effect size of interest for your area of research. sample size the final input you can justify is your sample size. you may be constrained by resources or the population you study, meaning you know the sample size and want to know what effect sizes you could detect in a sensitivity power analysis. lakens (2022) outlined that two strategies for sample size justification include measuring an entire population and resource constraints. if you study a specific population, such as participants with a rare genetic condition, you might know there are only 30 participants in your country which you regularly study, placing a limit on the sample size you can recruit. alternatively, in many student projects, the time or money available to conduct research is limited, so the sample size may be influenced by resource constraints. you might have £500 for recruitment and if you pay them £10 for an hour of their time, you only have enough money for 50 participants. in both scenarios, you start off knowing what your sample size will be. this does not mean you can ignore statistical power, but it changes from calculating the necessary sample size to detect a given effect size, to what effect size you can detect with a given sample size. this allows you to decide whether the study you plan on conducting is informative, or if it would be uninformative, you would have the opportunity to change the design or use more precise measures to produce larger effect sizes. part three: power analysis using jamovi for this tutorial, we will be using the open source software jamovi (2021). although it currently offers a limited selection for power analysis, it is perfect for an introduction for three reasons. first, it is free and accessible on a wide range of devices. historically, g*power (faul et al., 2009) was a popular choice, but it is no longer under active development which presents accessibility issues. second, the output in jamovi contains written guidance on how to interpret the results and emphasises underrepresented concepts like power existing along a curve. finally, bakker et al. (2020) observed authors often fail to provide enough information to reproduce their power analysis. in jamovi, you save your output and options in one file, meaning you can share this file to be fully reproducible. combined, these features make jamovi the perfect software to use in this tutorial. to download jamovi to your computer, navigate to the download page (https://www.jamovi.org/ download.html) and install the solid version to be the most stable. once you open jamovi, click modules (in the top right, figure 1a in red), click jamovi library (figure 1a in blue), and scroll down in the available tab until you see jpower and click install (figure 1b in green). this is an additional module written by morey and selker which appears in your jamovi toolbar. in the following sections, imagine we are designing a study to build on irving et al. (2022) who tested an intervention to correct statistical misinformation. participants read an article about a new fictional study where one passage falsely concludes watching tv causes cognitive decline. in the correction group, participants receive an extra passage where the fictional researcher explains they only reported a correlation, not a causal relationship. in the no-correction group, the extra passage just explains the fictional researcher was not available to comment. irving et al. then tested participants’ comprehension of the story and coded their answers for mistaken causal inferences. they expected participants in the correction group to make fewer causal inferences than those in the no-correction group, and found evidence supporting this prediction with an effect size equivalent to cohen’s d = 0.64, 95% ci = [0.28, 0.99]. inspired by their study, we want to design an experiment to correct another type of misinformation in articles. irving et al. (2022) themselves provide an excellent https://www.jamovi.org/download.html https://www.jamovi.org/download.html 10 figure 1. opening jamovi and managing your additional modules. click modules (a in red), jamovi library (a in blue) to manage your modules, and scroll down to install jpower (b in green) to be able to follow along to the tutorial. example of explaining and justifying the rationale behind their power analysis, so we will walk through the decision making process and how it changes the outputs. for our smallest effect size of interest, our starting point is the estimate of d = 0.64. however, it is worth consulting other sources to calibrate our understanding of effects in the area, such as irving et al. citing a meta-analysis by chan et al. (2017). for debunking, the average effect across 30 studies was d = 1.14, 95% ci = [0.68, 1.61], so we could use the lower bound of the confidence interval, but this may still represent an overestimate. irving et al. used the smallest effect (d = 0.54) from the studies most similar to their design which was included in the meta-analysis. as a value slightly smaller than the other estimates, we will also use this as the smallest effect of interest for our study. now we have settled on our smallest effect size of interest, we will use d = 0.54 in the following demonstrations. we start with a priori and sensitivity power analysis for two independent samples, exploring how the outputs change as we alter inputs like alpha, power, and the number of tails in the test. for each demonstration, we explain how you can transparently report the power analysis to your reader. we then repeat the demonstrations for two dependent samples to show how you require fewer participants when the same participants complete multiple conditions instead of being allocated to separate groups. two independent samples a priori power analysis in independent samples. if you open jamovi, you should have a new window with no data or output. for an independent samples t-test, make sure you are on the analyses tab, click on jpower, and select independent samples t-test. this will open the window shown in figure 2. we will start by calculating power a priori for an independent samples t-test. on the left side, you have your inputs and on the right side, you have the output from your choices. depending on the type of analysis you select under calculate, one of the inputs will be blanked out in grey. this means it is the parameter you want as the output on the right side and you will not be able to edit it. to break down the main menu options in this window: • calculate: your choice of calculating one of (a) the minimum number of participants (n per group) needed for a given effect size, (b) what your power is given a specific effect size and sample size, or (c) the smallest effect size that you could reliably detect given a fixed sample size. 11 figure 2. default settings for an independent samples t-test using the jpower module in jamovi. • minimally-interesting effect size: this is the standardised effect size known as cohen’s d. here we can specify our smallest effect size of interest. • minimum desired power: this is our long-run power. power is traditionally set at .80 (80%) but some researchers argue that this should be higher at .90 (90%) or .95 (95%). the default setting in jpower is .90 (90%) but see part two for justifying this value. • n for group 1: this input is currently blanked out as in this example we are calculating the minimum sample size, but here you would define how many participants are in the first group. • relative size of group 2 to group 1: if this is set to 1, the sample size is calculated by specifying equal group sizes. you can specify unequal group sizes by changing this input. for example, 1.5 would mean group 2 is 1.5 times larger than group 1, whereas 0.5 would mean group 2 is half the size of group 1. • α (type i error rate): this is your long-run type one error rate which is conventionally set at .05. see part two for strategies on justifying a different value. • tails: is the test oneor two-tailed? you can specify whether you are looking for an effect in just one direction or you would be interested in any significant result. for this example, our smallest effect size of interest will be d = 0.54 following our discussion of building on irving et al. (2022). we can enter the following inputs: effect size d = 0.54, alpha = .05, power = .90, relative size of group 2 to group 1 = 1, and two-tailed. you should get the output in figure 3. the first table “a priori power analysis” tells us that to detect our smallest effect size of interest, we would need two groups of 74 participants (n = 148) to achieve 90% power in a two-tailed test. in jamovi, it is clear how statistical power exists along a curve. the second table in figure 3 “power by effect size” shows us what range of effect sizes we would likely detect with 74 participants per group. we would have 80-95% power to detect effect sizes between d = 0.46-0.60. however, we would only have 50-80% power to detect effects between d = 0.32-0.46. this shows our smallest effect size of interest could be detected with 90% power, but smaller effects have lower power and larger effects would have higher power. this is also reflected in the power contour plot, which is reported by default (figure 4). if you cannot see the plot, make sure the “power contour plot” option is ticked under plots and scroll down in the results window as it is included at the bottom. the level of power you choose is the black line that curves from the top 12 figure 3. priori power analysis results for a two-tailed independent samples t-test using d = 0.54 as the smallest effect size of interest. we would need 74 participants per group (n = 148) for 90% power. left to the bottom right. for our effect size, we travel along the horizontal black line until we reach the curve, and the down arrow tells us we need 74 participants per group. for larger effects as you travel up the curve, we would need fewer participants and for smaller effects down the curve we would need more participants. now that we have explored how many participants we would need to detect our smallest effect size of interest, we can alter the inputs to see how the number of participants changes. wherever possible, it is important to perform a power analysis before you start collecting data, as you can explore how changing the inputs impacts your sample size. • tail(s): if you change the number of tails to one, this decreases the number of participants in each group from 74 to 60. this saves a total of 28 participants (14 in each group). if your experiment takes 30 minutes per participant, that is saving you 14 hours’ worth of work or cost while still providing your experiment with sufficient power. • α: if you change α to .01, we would need 104 participants in each group (for a two-tailed test), 60 more participants than our first estimate and 30 more hours of data collection. • minimum desired power: if we decreased power to the traditional 80%, we would need 55 particfigure 4. a power contour to show how as the effect size decreases (smaller values on the y-axis), the number of participants required to detect the effect increases (higher values on the x-axis). our desired level of 90% power is indicated by the black curved line. ipants per group (for a two-tailed test; alpha = .05). this would be 38 fewer participants than our first estimate, saving 19 hours of data collection. it is important to balance creating an informative experiment with the resources available. therefore, it is crucial that, where possible, you perform a power analysis in the planning phase of a study as you can make these kinds of decisions before you recruit any participants. you can make fewer type one (decreasing alpha) or type two (increasing power) errors, but you must recruit more participants. in the original power analysis by irving et al. (2022), they used inputs of d = 0.54, alpha = .05, power = .95, and one-tailed for a directional prediction, and they aimed for two groups of 75 (n = 150) participants. in these demonstrations, we are walking through changing the inputs to see how it affects the output, but you can look at their article for a good example of justifying and reporting a power analysis. how can this be reported? bakker et al. (2020) warned that only 20% of power analyses contained enough information to be fully reproducible. to report your power analysis, the reader needs the following four key pieces of information: • the type of test being conducted, • the software used to calculate power, 13 • the inputs that you used, and • why you chose those inputs. for the original example in figure 3, we could report it like this: “to detect an effect size of cohen’s d = 0.54 with 90% power (alpha = .05, two-tailed), the jpower module in jamovi suggests we would need 74 participants per group (n = 148) for an independent samples t-test. similar to irving et al. (2022), the smallest effect size of interest was set to d = 0.54, but we used a two-tailed test as we were less certain about the direction of the effect.” this provides the reader with all the information they would need in order to reproduce the power analysis and ensure you have calculated it accurately. the statement also includes your justification for the smallest effect size of interest. please note there is no single ‘correct’ way to report a power analysis. just be sure that you have the four key pieces of information. sensitivity power analysis in independent samples. selecting the smallest effect size of interest for an a priori power analysis would be an effective strategy if you wanted to calculate how many participants you need when designing your study. now imagine you already knew the sample size or had access to a population of a known size. in this scenario, you would conduct a sensitivity power analysis. this would tell you what effect sizes your study would be powered to detect for a given alpha, power, and sample size. this is helpful for interpreting your results as you can outline what effect sizes your study was sensitive to and which effects would be too small for you to reliably detect. if you change the “calculate” input to effect size, “minimally-interesting effect size” will now be greyed out. imagine we had finished collecting data and we knew we had 40 participants in each group but did not conduct a power analysis when designing the study. if we enter 40 for n for group 1, 1 for relative size of group 2 to group 1, alpha = .05, power = .90, and two-tailed, we get the output in figure 5. the first table in figure 5 “a priori power analysis” tells us that the study is sensitive to detect effect sizes of d = 0.73 with 90% power (note, the table is still referred to as a priori, despite it being a sensitivity power analysis. this is a quirk of the software. do not worry, it is running a sensitivity analysis). this helps us to interpret the results if we did not plan with power in mind or we had a rare sample. the second table in figure 5 “power by effect size” shows we would have 80-95% power to detect effect sizes between d = 0.630.82, but 50-80% power to detect effect sizes between d = 0.44-0.63. as the effect size gets smaller, there is less chance of detecting it with 40 participants per group, figure 5. sensitivity power analysis results for an independent samples t-test when there is a fixed sample size of 40 per group (n = 80). we would be able to detect an effect size of d = 0.73 with 90% power. but we would have greater than 90% power to detect effect sizes larger than d = 0.73. to acknowledge how power exists along a curve, we also get a second type of graph. we now have a power curve (figure 6) with the x-axis showing the potential effect size and the y-axis showing what the power would be for that potential effect size. if this plot is not visible in the output, make sure you have “power curve by effect size” ticked in the plots options. this tells us how power changes as the effect size increases or decreases, with our other inputs held constant. at 90% power, we can detect effect sizes of d = 0.73 or larger. if we follow the black curve towards the bottom left, power decreases for smaller effect sizes. this shows that once we have a fixed sample size, power exists along a curve for different effect sizes. when interpreting your results, it is important you have sufficient statistical power to detect the effects you do not want to miss out on. if the sensitivity power analysis suggests you would miss effects you would consider meaningful, you would need to calibrate your expectations of how informative your study is. how can this be reported? we can also state the results of a sensitivity power analysis in a report. if you did not perform an a priori power analysis, you could report this in the method to comment on your final sample size. if you are focusing on interpreting how informa14 figure 6. a power curve to show how as the effect size decreases (smaller values on the x-axis), we would have less statistical power (lower values on the y-axis) for our fixed sample size. our desired level of 90% power is indicated by the intersection of the black horizontal line and the black curved line. tive your results are, you could explore it in the discussion. much like an a priori power analysis, there are key details that must be included to ensure it is reproducible and informative. for the example in figure 5, you could report: “the jpower module in jamovi suggests an independent samples t-test with 40 participants per group (n = 80) would be sensitive to effects of cohen’s d = 0.73 with 90% power (alpha = .05, two-tailed). this means the study would not be able to reliably detect effects smaller than cohen’s d = 0.73”. as with an a priori power analysis, there are multiple ways you can describe the sensitivity power analysis with the example above demonstrating one way of doing so. the main goal is to communicate the four key pieces of information to ensure your reader could reproduce the sensitivity power analysis and confirm you calculated it accurately. two dependent samples a priori power analysis in dependent samples. now we will demonstrate how you can conduct a power analysis for a within-subjects design. this time, you need to select paired samples t-test from the jpower menu to get a window like figure 7. the inputs are almost identical to what we used for the independent samples t-test, but this time we only have four inputs as we do not need to worry about the ratio of group 2 to group 1. in a paired samples t-test, every participant must contribute a value for each condition. if we repeat the inputs from our independent samples t-test a priori power analysis (d = 0.54, alpha = .05, power = .90, two-tailed), your window should look like figure 8. the table “a priori power analysis” suggests we would need 39 participants to achieve 90% power to detect our smallest effect size of interest (d = 0.54) inspired by irving et al. (2022). we would need 109 fewer participants than our first estimate, saving 54.5 hours of data collection assuming your experiment takes 30 minutes. we also have the second table “power by effect size” to show how power changes for different effect size ranges. before we move on to how to report the power analysis, we will make a note of the important lesson that using a within-subjects design will always save you participants. the reason for this is that instead of every participant contributing just one value (which may have measurement error because of extraneous variables), they are contributing two values (one to each condition). the error caused by many of the extraneous variables (such as their age, eye sight, strength, or any other participant-specific variable that might cause error) are the same for both conditions for the same person. the less error in our measurements there is, the more sure we can be that the results we see are due to our manipulation. as within-participants designs lower the amount of error compared to between-participants design, they need fewer participants to achieve the same amount of power. the amount of error accounted for in a within-participants design means you need approximately half the number of participants you need to detect the same effect size in a between-subjects design (lakens, 2016b). when you are designing a study, think about whether you could convert the design to withinsubjects to make it more efficient. while it helps save on participants, it is not always possible, or practical, to use a within-subjects design. for example, in the experiment we are designing here, participants are shown two versions of a news story with a subtle manipulation. a between-subjects design might be a better choice as participants are randomised into one of two groups and they do not see the alternative manipulation. this means participants would find it more difficult to work out the aims of the study and change their behaviour. in a within-subjects design you would need at least two versions of the news story to create one ‘correction’ condition and one ‘no correction’ condition. this means participants would experience both conditions and they could work out the aims of 15 figure 7. default settings for a paired samples t-test using the jpower module in jamovi. the study and potentially change their behaviour. in addition, you would need to ensure the two versions of the news story were different enough that participants did not simply provide the same answer, but comparable enough to ensure you are not introducing a confound. this is another example of where thinking of statistical power in the design stage of research is most useful. you can decide whether a withinor between-subjects design is best suited to your procedure. how can this be reported? for the example in figure 8, you could report: “to detect an effect size of cohen’s d = 0.54 with 90% power (alpha = .05, two-tailed), the jpower module in jamovi suggests we would need 39 participants for a paired samples t-test. similar to irving et al. (2022), the smallest effect size of interest was set to d = 0.54, but we used a two-tailed test as we were less certain about the direction of the effect.”. sensitivity power analysis in dependent samples. if you change “calculate” to effect size, we can see what effect sizes a within-subjects design is sensitive enough to detect. imagine we sampled from 30 participants without performing an a priori power analysis. setting the inputs to power = .90, n = 30, alpha = .05, and two-tailed; you should get the output in figure 9. the “a priori power analysis” table shows us that the design would be sensitive to detect an effect size of d = 0.61 with 90% power using 30 participants. this helps us to interpret the results if we did not plan with power in mind or had a limited sample. the second table in figure 9 “power by effect size” shows we would have 80-95% power to detect effect sizes between d = 0.530.68, but 50-80% power to detect effect sizes between d = 0.37-0.53. as the effect size gets smaller, there is less chance of detecting it with 30 participants, but we would have greater than 90% power to detect effect sizes larger than d = 0.61. how can this be reported? for the example in figure 9, you could report: “the jpower module in jamovi suggests a paired samples t-test with 30 participants would be sensitive to effects of cohen’s d = 0.61 with 90% power (alpha = .05, two-tailed). this means the study would not be able to reliably detect effects smaller than cohen’s d = 0.61”. conclusion in this tutorial, we demonstrated how to perform a power analysis for both independent and paired samples t-tests using the jpower module in jamovi. we outlined two of the most useful types of power analysis: (1) a priori, for when you want to know how many participants you need to detect a given effect size, and (2) sensitivity, for when you want to know what effect sizes you can detect with a given sample size. we also emphasised the key information you must report to ensure power analyses are reproducible. our aim was to provide a beginner’s tutorial to learn the fundamental 16 figure 8. a priori power analysis results for a paired samples t-test using d = 0.54 as the smallest effect size of interest. we would need 39 participants for 90% power. figure 9. sensitivity power analysis results for a paired samples t-test when there is a fixed sample size of 30 participants. we would be able to detect an effect size of d = 0.61 with 90% power. concepts of power analysis, so you can build on these lessons and apply them to more complicated designs. there are three key lessons to take away from this tutorial. first, you can plan to make fewer type one (decreasing alpha) or type two (increasing power) errors, but it will cost more participants assuming you want to detect the same effect size. second, using a one-tailed test offers a more severe test of a hypothesis and requires fewer participants to achieve the same level of power. finally, using a within-subjects design requires fewer participants than a between-subjects design. power analysis is a reflective process, and it is important to keep these three lessons in mind when designing your study. designing an informative study is a balance between your inferential goals and the resources available to you (lakens, 2022). that is why we framed changes in the inputs around how many hours of data collection your study would take assuming it lasted 30 minutes. you will rarely have unlimited resources as a researcher, either from the funding body supporting your research, or from the number of participants in your population of interest. planning your study with statistical power in mind provides you with the most flexibility as you can make decisions, like considering a one-tailed test or using a within-subjects design, before you can preregister and conduct your study. the number of participants required for a sufficiently powered experiment might have surprised you. depending on the inputs and design, we needed between 39 and 208 participants to detect the same smallest effect size of interest (d = 0.54) to build on irving et al. (2022). for resource-limited studies like student dissertations or participant-limited studies on rare populations, getting so many participants may be unachievable. as a result, changing to a withinparticipants design, or changing the other inputs might be needed, where possible. in circumstances where within-participants designs are not possible, and changing inputs (e.g., alpha) does not work, you can still conduct the study, providing you adjust your expectations and make the results available. if you do make your results available, while your sample size may be too small in isolation to detect your smallest effect size of interest, your results can then be collated into meta-analyses providing they are available to other researchers. alternative solutions include studying larger effect sizes, and/or focusing on ‘team science’. cohen (1973) quipped that instead of chasing smaller effects, psychology should emulate older sciences by creating larger effects through stronger manipulations or using more precise measurements. alternatively, if you cannot conduct an informative study individually, you could pool resources and engage in team science. for example, 17 student dissertations can benefit from projects where multiple students work together, with each student contributing one component and collecting data for a larger network/project (creaven et al., 2021; wagge et al., 2019), or labs across the world pool their resources together (moshontz et al., 2018). in the past, students have been encouraged to work on a project by themselves to gain experience conducting a science experiment by themselves. however, encouraging students to work in groups may now be just as useful, as such group work reflects the paradigm shift towards ‘team science’ seen in the wider research community in recent years (wuchty et al., 2007). to conclude our tutorial, we present a list of resources you can refer to for additional applications of power analysis. we limited our tutorial to two independent samples and two paired samples for maximum accessibility, so it is important to outline resources for additional designs. g*power although g*power (faul et al., 2009) is no longer in active development, it supports power analysis for a range of statistical tests, such as correlation, nonparametric tests, and anova. there is a longer companion guide to this manuscript that walks through power analysis for correlation and anova (bartlett, 2021). superpower g*power can calculate power for anova models, but it does not accurately scale to factorial anova and pairwise comparisons. to target these limitations, lakens and caldwell (2021) developed an r package and shiny app called superpower. superpower provides more flexibility and scales up to factorial anova as you enter means and standard deviations per cell for your smallest effect sizes of interest. for guidance, see our companion guide for the shiny app (bartlett, 2021) and the authors’ ebook for the r package (caldwell et al., 2021). pwr r package if you use r, the pwr package (champely et al., 2020) supports many of the same tests as g*power such as ttests, correlation, and regression. the arguments are also similar to g*power’s inputs, such as omitting one of numerator and denominator degrees of freedom, effect size as f 2, alpha, or power for your desired output. simulation packages like pwr are user-friendly as they only require you to define inputs to calculate power analytically, but one of the benefits of a programme like r is the flexibility to simulate your own bespoke power analysis. the starting point is simulating a dataset with known attributes like the mean and standard deviation of each group or correlation between variables and applying your statistical test. you then repeat this simulation process many times and store the pvalues from each iteration. as probability in frequentist statistics relates to long-run frequencies, you calculate what percentage of those p-values were lower than your alpha, providing your statistical power. see quandt (2020) and sleegers (2021) for demonstrations of simulation applied to power analysis in r and the summer school workshop series organised by psypag (https://simsummerschool.github.io/). simulation approaches also scale to more advanced techniques such as accounting for the number of trials in a task instead of solely the number of participants (baker et al., 2021) or mixed-effects models, which are growing in popularity in psychology. power analysis procedures for mixed effects models rely on simulation, so see brysbaert and stevens (2018), debruine and barr (2021), and kumle et al. (2021) for guidance. author contact orcid jeb: 0000-0002-4191-5245; email jeb: james.bartlett@glasgow.ac.uk; orcid sjc: 0000-00023559-1141; email sjc: sarah.charles@kcl.ac.uk. conflict of interest and funding we have no conflicts of interest to disclose. writing this article was not supported by any funding sources. author contributions conceptualization (jeb, sjc); writing original draft (jeb, sjc); writing review & editing (jeb; sjc); visualization (jeb; sjc). jeb placed first due to original conceptualization, but fully shared authorship between jeb and sjc. open science practices this article is theoretical and as such provides no new data or materials, and was not pre-registered. the entire editorial process, including the open reviews, is published in the online supplement. references abelson, r. p. (1985). a variance explanation paradox: when a little is a lot. psychological bulletin, 97(1), 129–133. https : / / doi . org / 10 . 1037 / 0033-2909.97.1.129 https://simsummerschool.github.io/ https://doi.org/10.1037/0033-2909.97.1.129 https://doi.org/10.1037/0033-2909.97.1.129 18 baguley, t. (2009). standardized or simple effect size: what should be reported? british journal of psychology, 100(3), 603–617. https://doi.org/10. 1348/000712608x377117 baker, d. h., vilidaite, g., lygo, f. a., smith, a. k., flack, t. r., gouws, a. d., & andrews, t. j. (2021). power contours: optimising sample size and precision in experimental psychology and human neuroscience. psychological methods, 26(3), 295–314. https : / / doi . org / http : //dx.doi.org/10.1037/met0000337 bakker, m., hartgerink, c. h. j., wicherts, j. m., & van der maas, h. l. j. (2016). researchers’ intuitions about power in psychological research. psychological science, 27(8), 1069–1077. https: //doi.org/10.1177/0956797616647519 bakker, m., veldkamp, c. l. s., akker, o. r. v. d., assen, m. a. l. m. v., crompvoets, e., ong, h. h., & wicherts, j. m. (2020). recommendations in pre-registrations and internal review board proposals promote formal power analyses but do not increase sample size. plos one, 15(7), e0236079. https://doi.org/10.1371/journal. pone.0236079 bartlett, j. e. (2021). introduction to power analysis: a guide to g*power, jamovi, and superpower. https://osf.io/zqphw/ benjamin, d. j., berger, j. o., johannesson, m., nosek, b. a., wagenmakers, e.-j., berk, r., bollen, k. a., brembs, b., brown, l., camerer, c., cesarini, d., chambers, c. d., clyde, m., cook, t. d., de boeck, p., dienes, z., dreber, a., easwaran, k., efferson, c., . . . johnson, v. e. (2018). redefine statistical significance. nature human behaviour, 2(1), 6–10. https://doi.org/ 10.1038/s41562-017-0189-z beribisky, n., davidson, h., & cribbie, r. a. (2019). exploring perceptions of meaningfulness in visual representations of bivariate relationships. peerj, 7, e6853. https://doi.org/10.7717/peerj.6853 brysbaert, m. (2019). how many participants do we have to include in properly powered experiments? a tutorial of power analysis with reference tables. journal of cognition, 2(1), 16. https://doi.org/10.5334/joc.72 brysbaert, m., & stevens, m. (2018). power analysis and effect size in mixed effects models: a tutorial. journal of cognition, 1(1), 9. https://doi. org/10.5334/joc.10 bürkner, p.-c., & vuorre, m. (2019). ordinal regression models in psychology: a tutorial. advances in methods and practices in psychological science, 2(1), 77–101. https : / / doi . org / 10 . 1177 / 2515245918823199 button, k. s., ioannidis, j. p. a., mokrysz, c., nosek, b. a., flint, j., robinson, e. s. j., & munafò, m. r. (2013). power failure: why small sample size undermines the reliability of neuroscience. nature reviews neuroscience, 14(5), 365–376. https://doi.org/10.1038/nrn3475 caldwell, a. r., lakens, d., & parlett-pelleriti, c. m. (2021). power analysis with superpower. retrieved november 23, 2021, from https : / / aaroncaldwell.us/superpowerbook/ champely, s., ekstrom, c., dalgaard, p., gill, j., weibelzahl, s., anandkumar, a., ford, c., volcic, r., & rosario, h. d. (2020). pwr: basic functions for power analysis. retrieved november 23, 2021, from https : / / cran . r project.org/package=pwr chan, m.-p. s., jones, c. r., hall jamieson, k., & albarracín, d. (2017). debunking: a metaanalysis of the psychological efficacy of messages countering misinformation. psychological science, 28(11), 1531–1546. https://doi.org/ 10.1177/0956797617714579 chen, l.-t., & liu, l. (2019). content analysis of statistical power in educational technology research: sample size matters. international journal of technology in teaching and learning, 15(1), 49–75. retrieved july 8, 2021, from https://eric.ed.gov/?id=ej1276088 cohen, j. (1962). the statistical power of abnormalsocial psychological research: a review. the journal of abnormal and social psychology, 65(3), 145–153. https : / / doi . org / 10 . 1037 / h0045186 cohen, j. (1965). some statistical issues in psychological research. in w. benjamin b (ed.), handbook of clinical psychology. mcgraw-hill. cohen, j. (1973). statistical power analysis and research results. american educational research journal, 10(3), 225–229. https://doi.org/10. 2307/1161884 cohen, j. (1988). statistical power analysis for the behavioral sciences (2nd ed.). lawrence erlbaum associates. cohen, j. (1994). the earth is round (p<.05). american psychologist, 49(12), 997–1003. https : / / doi.org/10.1037/0003-066x.49.12.997 collins, e., & watt, r. (2021). using and understanding power in psychological research: a survey study. collabra: psychology, 7(1), 28250. https: //doi.org/10.1525/collabra.28250 https://doi.org/10.1348/000712608x377117 https://doi.org/10.1348/000712608x377117 https://doi.org/http://dx.doi.org/10.1037/met0000337 https://doi.org/http://dx.doi.org/10.1037/met0000337 https://doi.org/10.1177/0956797616647519 https://doi.org/10.1177/0956797616647519 https://doi.org/10.1371/journal.pone.0236079 https://doi.org/10.1371/journal.pone.0236079 https://osf.io/zqphw/ https://doi.org/10.1038/s41562-017-0189-z https://doi.org/10.1038/s41562-017-0189-z https://doi.org/10.7717/peerj.6853 https://doi.org/10.5334/joc.72 https://doi.org/10.5334/joc.10 https://doi.org/10.5334/joc.10 https://doi.org/10.1177/2515245918823199 https://doi.org/10.1177/2515245918823199 https://doi.org/10.1038/nrn3475 https://aaroncaldwell.us/superpowerbook/ https://aaroncaldwell.us/superpowerbook/ https://cran.r-project.org/package=pwr https://cran.r-project.org/package=pwr https://doi.org/10.1177/0956797617714579 https://doi.org/10.1177/0956797617714579 https://eric.ed.gov/?id=ej1276088 https://doi.org/10.1037/h0045186 https://doi.org/10.1037/h0045186 https://doi.org/10.2307/1161884 https://doi.org/10.2307/1161884 https://doi.org/10.1037/0003-066x.49.12.997 https://doi.org/10.1037/0003-066x.49.12.997 https://doi.org/10.1525/collabra.28250 https://doi.org/10.1525/collabra.28250 19 cramer, a. o. j., van ravenzwaaij, d., matzke, d., steingroever, h., wetzels, r., grasman, r. p. p. p., waldorp, l. j., & wagenmakers, e.-j. (2016). hidden multiplicity in exploratory multiway anova: prevalence and remedies. psychonomic bulletin & review, 23(2), 640–647. https://doi. org/10.3758/s13423-015-0913-5 creaven, a.-m., button, k., woods, h., & nordmann, e. (2021). maximising the educational and research value of the undergraduate dissertation in psychology. retrieved april 4, 2022, from https://psyarxiv.com/deh93/ debruine, l. m., & barr, d. j. (2021). understanding mixed-effects models through data simulation. advances in methods and practices in psychological science, 4(1), 2515245920965119. https://doi.org/10.1177/2515245920965119 dwan, k., altman, d. g., arnaiz, j. a., bloom, j., chan, a.-w., cronin, e., decullier, e., easterbrook, p. j., elm, e. v., gamble, c., ghersi, d., ioannidis, j. p. a., simes, j., & williamson, p. r. (2008). systematic review of the empirical evidence of study publication bias and outcome reporting bias. plos one, 3(8), e3081. https: //doi.org/10.1371/journal.pone.0003081 etz, a., & vandekerckhove, j. (2016). a bayesian perspective on the reproducibility project: psychology (d. marinazzo, ed.). plos one, 11(2), 1–12. https://doi.org/10.1371/journal.pone. 0149794 faul, f., erdfelder, e., buchner, a., & lang, a.-g. (2009). statistical power analyses using g*power 3.1: tests for correlation and regression analyses. behavior research methods, 41(4), 1149–1160. https://doi.org/10.3758/brm.41.4.1149 fisher, r. a. (1926). the arrangement of field experiments. journal of the ministry of agriculture, 33, 503–515. franco, a., malhotra, n., & simonovits, g. (2014). publication bias in the social sciences: unlocking the file drawer. science. https : / / doi . org / 10 . 1126/science.1255484 gignac, g. e., & szodorai, e. t. (2016). effect size guidelines for individual differences researchers. personality and individual differences, 102, 74–78. https://doi.org/10.1016/j.paid. 2016.06.069 goodman, s. (2008). a dirty dozen: twelve p-value misconceptions. seminars in hematology, 45(3), 135–140. https : / / doi . org / 10 . 1053 / j . seminhematol.2008.04.003 guo, q., thabane, l., hall, g., mckinnon, m., goeree, r., & pullenayegum, e. (2014). a systematic review of the reporting of sample size calculations and corresponding data components in observational functional magnetic resonance imaging studies. neuroimage, 86, 172–181. https : //doi.org/10.1016/j.neuroimage.2013.08.012 irving, d., clark, r. w. a., lewandowsky, s., & allen, p. j. (2022). correcting statistical misinformation about scientific findings in the media: causation versus correlation. journal of experimental psychology. applied. https : / / doi . org / 10 . 1037/xap0000408 kathawalla, u.-k., silverstein, p., & syed, m. (2021). easing into open science: a guide for graduate students and their advisors. collabra: psychology, 7(18684). https://doi.org/10.1525/ collabra.18684 kelley, k., & preacher, k. j. (2012). on effect size. psychological methods, 17(2), 137–152. https : / / doi.org/10.1037/a0028086 kruschke, j. k., & liddell, t. m. (2018). the bayesian new statistics: hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective. psychonomic bulletin & review, 25(1), 178–206. https : / / doi . org / 10 . 3758/s13423-016-1221-4 kumle, l., võ, m. l.-h., & draschkow, d. (2021). estimating power in (generalized) linear mixed models: an open introduction and tutorial in r. behavior research methods, 53(6), 2528–2543. https://doi.org/10.3758/s13428-021-01546-0 lakens, d. (2013). calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas. frontiers in psychology, 4. https://doi.org/10.3389/fpsyg. 2013.00863 lakens, d. (2021). the practical alternative to the p value is the correctly used p value. perspectives on psychological science, 16(3), 639–648. https: //doi.org/10.1177/1745691620958012 lakens, d. (2022). sample size justification. collabra: psychology, 8(1), 33267. https : / / doi . org / 10 . 1525/collabra.33267 lakens, d. (2016a). one-sided tests: efficient and underused. retrieved march 29, 2022, from http: / / daniellakens . blogspot . com / 2016 / 03 / one sided-tests-efficient-and-underused.html lakens, d. (2016b). why within-subject designs require fewer participants than between-subject designs. retrieved november 21, 2021, from http : / / daniellakens . blogspot . com / 2016 / 11 / why-within-subject-designs-require-less.html lakens, d., adolfi, f. g., albers, c. j., anvari, f., apps, m. a. j., argamon, s. e., baguley, t., becker, https://doi.org/10.3758/s13423-015-0913-5 https://doi.org/10.3758/s13423-015-0913-5 https://psyarxiv.com/deh93/ https://doi.org/10.1177/2515245920965119 https://doi.org/10.1371/journal.pone.0003081 https://doi.org/10.1371/journal.pone.0003081 https://doi.org/10.1371/journal.pone.0149794 https://doi.org/10.1371/journal.pone.0149794 https://doi.org/10.3758/brm.41.4.1149 https://doi.org/10.1126/science.1255484 https://doi.org/10.1126/science.1255484 https://doi.org/10.1016/j.paid.2016.06.069 https://doi.org/10.1016/j.paid.2016.06.069 https://doi.org/10.1053/j.seminhematol.2008.04.003 https://doi.org/10.1053/j.seminhematol.2008.04.003 https://doi.org/10.1016/j.neuroimage.2013.08.012 https://doi.org/10.1016/j.neuroimage.2013.08.012 https://doi.org/10.1037/xap0000408 https://doi.org/10.1037/xap0000408 https://doi.org/10.1525/collabra.18684 https://doi.org/10.1525/collabra.18684 https://doi.org/10.1037/a0028086 https://doi.org/10.1037/a0028086 https://doi.org/10.3758/s13423-016-1221-4 https://doi.org/10.3758/s13423-016-1221-4 https://doi.org/10.3758/s13428-021-01546-0 https://doi.org/10.3389/fpsyg.2013.00863 https://doi.org/10.3389/fpsyg.2013.00863 https://doi.org/10.1177/1745691620958012 https://doi.org/10.1177/1745691620958012 https://doi.org/10.1525/collabra.33267 https://doi.org/10.1525/collabra.33267 http://daniellakens.blogspot.com/2016/03/one-sided-tests-efficient-and-underused.html http://daniellakens.blogspot.com/2016/03/one-sided-tests-efficient-and-underused.html http://daniellakens.blogspot.com/2016/03/one-sided-tests-efficient-and-underused.html http://daniellakens.blogspot.com/2016/11/why-within-subject-designs-require-less.html http://daniellakens.blogspot.com/2016/11/why-within-subject-designs-require-less.html 20 r. b., benning, s. d., bradford, d. e., buchanan, e. m., caldwell, a. r., van calster, b., carlsson, r., chen, s.-c., chung, b., colling, l. j., collins, g. s., crook, z., . . . zwaan, r. a. (2018). justify your alpha. nature human behaviour, 2(3), 168–171. https : / / doi . org / 10 . 1038 / s41562 018-0311-x lakens, d., & caldwell, a. r. (2021). simulationbased power analysis for factorial analysis of variance designs. advances in methods and practices in psychological science, 4(1), 2515245920951503. https://doi.org/10.1177/ 2515245920951503 larson, m. j., & carbine, k. a. (2017). sample size calculations in human electrophysiology (eeg and erp) studies: a systematic review and recommendations for increased rigor. international journal of psychophysiology, 111, 33–41. https: //doi.org/10.1016/j.ijpsycho.2016.06.015 maier, m., & lakens, d. (2021). justify your alpha: a primer on two practical approaches. retrieved june 24, 2021, from https : / / psyarxiv. com / ts4r6/ moshontz, h., campbell, l., ebersole, c. r., ijzerman, h., urry, h. l., forscher, p. s., grahe, j. e., mccarthy, r. j., musser, e. d., antfolk, j., castille, c. m., evans, t. r., fiedler, s., flake, j. k., forero, d. a., janssen, s. m. j., keene, j. r., protzko, j., aczel, b., . . . chartier, c. r. (2018). the psychological science accelerator: advancing psychology through a distributed collaborative network. advances in methods and practices in psychological science, 1(4), 501–515. https://doi.org/10.1177/2515245918797607 neyman, j. (1977). frequentist probability and frequentist statistics. synthese, 36(1), 97–131. http://www.jstor.org/stable/20115217 panzarella, e., beribisky, n., & cribbie, r. a. (2021). denouncing the use of field-specific effect size distributions to inform magnitude. peerj, 9, e11383. https://doi.org/10.7717/peerj.11383 perugini, m., gallucci, m., & costantini, g. (2018). a practical primer to power analysis for simple experimental designs. international review of social psychology, 31(1). https : / / doi . org / 10 . 5334/irsp.181 quandt, j. (2020). power analysis by data simulation in r part i. retrieved april 1, 2022, from https: //julianquandt.com/post/poweranalysisbydata-simulation-in-r-part-i/ ruxton, g. d., & neuhäuser, m. (2010). when should we use one-tailed hypothesis testing? methods in ecology and evolution, 1(2), 114–117. https: //doi.org/10.1111/j.2041-210x.2010.00014.x sedlmeier, p., & gigerenzer, g. (1989). do studies of statistical power have an effect on the power of studies? psychological bulletin, 105(2), 309– 316. sestir, m. a., kennedy, l. a., peszka, j. j., & bartley, j. g. (2021). new statistics, old schools: an overview of current introductory undergraduate and graduate statistics pedagogy practices. teaching of psychology, 00986283211030616. https : / / doi . org / 10 . 1177/00986283211030616 sleegers, w. (2021). simulation-based power analyses. https://willemsleegers.com/content/posts/9simulationbasedpoweranalyses/simulationbased-power-analyses.html szucs, d., & ioannidis, j. p. a. (2021). correction: empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. plos biology, 19(3), e3001151. https://doi.org/10.1371/journal. pbio.3001151 targ meta-research group. (2020). statistics education in undergraduate psychology: a survey of uk course content. https://doi.org/10.31234/ osf.io/jv8x3 the jamovi project. (2021). jamovi. https : / / www . jamovi.org wagenmakers, e.-j., wetzels, r., borsboom, d., & van der maas, h. l. j. (2011). why psychologists must change the way they analyze their data: the case of psi: comment on bem (2011). journal of personality and social psychology, 100(3), 426–432. https://doi.org/10.1037/a0022790 wagge, j. r., brandt, m. j., lazarevic, l. b., legate, n., christopherson, c., wiggins, b., & grahe, j. e. (2019). publishing research with undergraduate students via replication work: the collaborative replications and education project. frontiers in psychology, 10. https://doi.org/10. 3389/fpsyg.2019.00247 wuchty, s., jones, b. f., & uzzi, b. (2007). the increasing dominance of teams in production of knowledge. science, 316(5827), 1036–1039. https://doi.org/10.1126/science.1136099 https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1177/2515245920951503 https://doi.org/10.1177/2515245920951503 https://doi.org/10.1016/j.ijpsycho.2016.06.015 https://doi.org/10.1016/j.ijpsycho.2016.06.015 https://psyarxiv.com/ts4r6/ https://psyarxiv.com/ts4r6/ https://doi.org/10.1177/2515245918797607 http://www.jstor.org/stable/20115217 https://doi.org/10.7717/peerj.11383 https://doi.org/10.5334/irsp.181 https://doi.org/10.5334/irsp.181 https://julianquandt.com/post/power-analysis-by-data-simulation-in-r-part-i/ https://julianquandt.com/post/power-analysis-by-data-simulation-in-r-part-i/ https://julianquandt.com/post/power-analysis-by-data-simulation-in-r-part-i/ https://doi.org/10.1111/j.2041-210x.2010.00014.x https://doi.org/10.1111/j.2041-210x.2010.00014.x https://doi.org/10.1177/00986283211030616 https://doi.org/10.1177/00986283211030616 https://willemsleegers.com/content/posts/9-simulation-based-power-analyses/simulation-based-power-analyses.html https://willemsleegers.com/content/posts/9-simulation-based-power-analyses/simulation-based-power-analyses.html https://willemsleegers.com/content/posts/9-simulation-based-power-analyses/simulation-based-power-analyses.html https://doi.org/10.1371/journal.pbio.3001151 https://doi.org/10.1371/journal.pbio.3001151 https://doi.org/10.31234/osf.io/jv8x3 https://doi.org/10.31234/osf.io/jv8x3 https://www.jamovi.org https://www.jamovi.org https://doi.org/10.1037/a0022790 https://doi.org/10.3389/fpsyg.2019.00247 https://doi.org/10.3389/fpsyg.2019.00247 https://doi.org/10.1126/science.1136099 mp.2018.1481.de leeuw.final meta-psychology, 2019, vol 3, mp.2018.1481, https://doi.org/10.15626/mp.2018.1481 article type: original article published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: yes edited by: henrik danielsson reviewed by: stefan kölsch and carine signoret analysis reproduced by: martina sladekova all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/epd42 similar event-related potentials to music and language: a replication of patel, gibson, ratner, besson, & holcomb (1998) joshua r. de leeuw, jan andrews, zariah altman, rebecca andrews, robert appleby, james l. bonanno, isabella destefano, eileen doyle-samay, ayela faruqui, christina m. griesmer, jackie hwang, kate lawson, rena a. lee, yunfei liang, john mernacaj, henry j. molina, hui xin ng, steven park, thomas possidente, anne shriver vassar college we report a replication of patel, gibson, ratner, besson, and holcomb (1998). the results of our replication are largely consistent with the conclusions of the original study. we found evidence of a p600 component of the event-related potential (erp) in response to syntactic violations in language and harmonic inconsistencies in music. there were some minor differences in the spatial distribution of the p600 on the scalp between the replication and the original. the experiment was pre-registered at https://osf.io/g3b5j/. we conducted this experiment as part of an undergraduate cognitive science research methods class at vassar college; we discuss the practice of integrating replication work into research methods courses. keywords: eeg, erp, p600, language, music, replication. patel, gibson, ratner, besson, and holcomb (1998) found that violations of expected syntactic structure in language and violations of expected harmonic structure in music both elicit the p600 component of the event-related potential (erp). the p600 is a positive erp component that occurs approximately 600 ms after stimulus onset. while previous work had established a link between the p600 component and syntactic violations in language (osterhout & holcomb, 1992, 1993; osterhout, holcomb, & swinney, 1994), patel and colleagues were the first to report a direct comparison of the p600 for violations of musical and linguistic structure, finding that the amplitude and scalp distribution of the p600 was similar for linguistic and musical violations. this result has been influential in theorizing about the relationship between music and language, with more than 700 citations twenty years after publication (google scholar search, september 2018). it has been used as evidence for the “shared syntactic integration resource hypothesis,” a theory that posits that structural processing of music and language utilizes the same cognitive and neural resources (patel, 2003). it has also been used to argue more broadly for the shared neurological basis of music and language (e.g., abrams et al., 2011; besson & schön, 2001; herdener et al., 2014; merrill et al., 2012; patel, 2010; sammler et al., 2010, 2013), and for the existence of shared cognitive resources/constraints for processing music and language (e.g., besson, chobert, & marie, 2011; chobert, françois, acknowledgements: we are extremely grateful to debra ratchford and prashit parikh for their assistance supervising eeg data collection, and polyphony bruna for assisting with data organization and stimulus measurement. de leeuw, andrews, altman, andrews, appleby, bonanno, destefano, doyle-samay, faruqui, griesmer, hwang, lawson, lee, liang, mernacaj, molina, ng, park, possidente, & shriver 2 velay, & besson, 2014; christiansen & chater, 2008; lima & castro, 2011; moreno et al., 2009; thompson, schellenberg, & husain, 2004; tillmann, 2012). though the work has been influential, we are not aware of any published direct replications of the main result. several studies have found erp correlates of structural violations in music (besson & faïta, 1995; besson, faïta, & requin, 1994; janata, 1995), though there is variation in the kinds of components that are found (featherstone, morrison, waterman, & macgregor, 2013; featherstone, waterman, & morrison, 2012). other studies have found that erp markers of violations of linguistic structure are systematically affected by the presence or absence of simultaneous structural violations in music (koelsch, gunter, wittfoth, & sammler, 2005; steinbeis & koelsch, 2008). these findings, along with many other behavioral and non-erp neural measures (see koelsch, 2011 for a review), support the general conclusion of patel et al. (1998) that there is overlap between the processing of structural violations in music and language. while this converging evidence should bolster our belief in the results, there is no substitute for a direct replication given the well-documented problem of publication bias in the literature (e.g., ingre & nilsonne, 2018; rosenthal, 1979). this experiment was part of an undergraduate research methods course in cognitive science, which 2 of us co-taught, 17 of us were enrolled in, and 1 of us was serving as a course intern. a major focus of this course was exposure to and training in practices that have developed in response to the replication crisis, including an increased emphasis on direct replications (zwaan, etz, lucas, & donnellan, 2017), pre-registration of experiments (wagenmakers, wetzels, borsboom, van der maas, & kievit, 2012), and transparency through public sharing of materials, data, and analysis scripts (nosek et al., 2015). to gain hands-on experience with these practices, the class conducted this replication study. we chose to replicate patel et al. (1998) given its theoretical significance in the field, lack of prior direct replications, and practical considerations like the complexity of the data analysis and study design. our replication is what lebel et al. (2017) would call a very close replication. while we were able to operationalize the independent and dependent variables in the same manner as patel et al. and were able to use either the same exact (music) or close replicas (language) of the original stimuli, we did make some changes to their procedure. we removed two conditions (out of six) to shorten the overall length of the experiment, which was necessary to run the experiment in a classroom environment. we also focused the analysis on what we took to be the key findings of the original. we highlight these deviations from the original throughout the methods section below. very close replications like this one are efforts to establish the “basic existence” of phenomena (lebel et al., 2017), which is an essential step for creating a set of robust empirical facts for theory development. method all stimuli, experiment scripts, data, and analysis scripts are available on the open science framework at https://osf.io/zpm9t/. the study pre-registration is available at https://osf.io/g3b5j/. all participants provided informed consent and this study was approved by the vassar college institutional review board. overview in both the original experiment and our replication, participants listened to short sentences and musical excerpts and made judgments about whether the sentence/music was acceptable or unacceptable. erps in response to particular words or musical events were measured with eeg. in the original experiment there were three critical kinds of sentences (grammatically simple, grammatically complex, and ungrammatical) and three critical kinds of musical excerpts (in key, nearby out of key, and distant out of key). the p600 is measured by comparing the amplitude of the erp in the grammatically simple condition to the other two language conditions and the in-key condition to the other two music conditions (see results, below). due to logistical constraints of lab availability, time, and class schedule, we opted to restrict the replication to two kinds of sentences and two kinds of musical excerpts. we used only the grammatically simple and ungrammatical sentences for the language stimuli (plus their associated control stimuli, see stimuli below), and only the in-key and distant out-of-key musical excerpts. we believe that this choice is justifiable, as the theoretical claims of patel similar event-related potentials to music and language: a replication of patel, gibson, ratner, besson, & holcomb (1998) 3 et al. are most strongly based on the p600 that was found in the ungrammatical and distant out-of-key conditions, as these are the stronger contrasts (i.e., they are more “syntactically wrong”). the grammatical and in-key conditions serve as the baseline for these analyses, and so must also be included. the original also included unanalyzed filler sentences and musical excerpts to balance certain (possibly confounding) properties of the stimuli; by not including many of the original stimuli these properties were more balanced in the critical stimuli, and we were able to drop all of the fillers in the music condition and 20 of the fillers in the language condition. altogether, the original experiment contained 150 sentences (3 x 30 plus 60 fillers) and 144 musical excerpts (3 x 36 plus 36 fillers), and our replication contained 100 sentences (2 x 30 plus 40 fillers) and 72 musical excerpts (2 x 36). participants 44 vassar college students, ages 18-22 (m = 19.8 years, sd = 1.2 years), participated in the study. our pre-registered target was 40, which is slightly more than 2.5 times the original sample (n=15). we aimed for at least 2.5 times the original sample based on the heuristic provided by simonsohn (2015). the goal of the heuristic is for replications to have sufficient power to detect effects that are smaller than the original but still plausibly detectable by the original study. while we ran more participants than the original target of 40, 5 participants did not complete the experiment due to technical difficulties such as recording problems with the eeg equipment. thus, we ended up with 39 participants, one under our pre-registered target. we stopped data collection because we reached our pre-registered cutoff date of 2/24/18 prior to having 40 usable recordings. the cutoff date was necessary for the schedule of the class. participants in patel et al. (1998) were musically trained but specifically did not have perfect pitch. their participants had an average of 11 years of musical experience, had studied music theory, and played a musical instrument for an average of 6.2 hours per week. all of our participants had at least 4 years of prior musical experience (m = 9.7 years, sd = 3.3 years), which we defined as participation in music lessons, enrollment in music coursework, or experience with any musical instrument (including voice). we also required that participants not have perfect pitch (by self-report). we did not require that participants had studied music theory. our participants played a musical instrument for an average of 5.8 hours per week (sd = 3.4 hours per week). stimuli patel graciously provided the music stimuli used in the original study. the language stimuli were no longer available in audio form, but we were provided with a list of the text of the original stimuli. we refer the reader to patel et al. (1998) for the full details of the stimuli. here we describe a basic overview of the format to provide enough context for understanding the experiment, as well as our process for recording the audio stimuli. the music stimuli were short sequences of chords synthesized using a generic piano midi instrument. they were about 6 seconds long. the chords initially established a harmonic key. the target chord — either the root chord of the established key (in-key condition) or the root chord of a distantly-related harmonic key (distant out-of-key condition) — occurred in the second half of the excerpt. an example in-key sequence can be heard at https://osf.io/z6vcu/. an example out-of-key sequence can be heard at https://osf.io/wde67/. to simplify condition labeling in what follows, the inkey (harmonically congruous) musical stimuli will be called grammatical and the distant out-of-key (harmonically incongruous) musical stimuli will be called ungrammatical, even though we recognize that the application of those terms to music is not necessarily as straightforward as it is for language. the language stimuli were spoken sentences with a target noun phrase that was either grammatical or ungrammatical given the prior context. there were two primary types of sentences (grammatical and ungrammatical) as well as two kinds of filler sentences, designed to prevent listeners from using cues other than the target noun phrase in context to judge the acceptability of the sentence. the grammatical but unacceptable fillers make it so that not all instances of “had” are acceptable. the grammatical fillers make it so that not all instances of verb + “the” are unacceptable. examples of each sentence type are below (the target noun phrase is italicized): de leeuw, andrews, altman, andrews, appleby, bonanno, destefano, doyle-samay, faruqui, griesmer, hwang, lawson, lee, liang, mernacaj, molina, ng, park, possidente, & shriver 4 grammatical: some of the soldiers had discovered a new strategy for survival. ungrammatical: some of the marines pursued the discovered a new strategy for survival. grammatical, unacceptable (filler): some of the lieutenants had reimbursed a new strategy for survival. grammatical (filler): some of the explorers pursued the idea of a new strategy for survival. sentences ranged from 2.9 to 4.8 seconds long, spoken by one of the female experimenters at a rate of approximately six syllables per second using a blue snowball ice condenser microphone, sampled at 44.1khz. the audio files were later amplified in audacity in order to be at a volume similar across sentences and approximately comparable to that of the music stimuli. for each file, the onset and duration of the target noun phrase was recorded (in milliseconds) to refer to in analysis when identifying the onset of erp components (see https://osf.io/tr7mq/ for complete list). in addition to the music and language stimuli used in the original experiment, we created sample stimuli to provide a short pre-task tutorial for participants. these consisted of six new sentences and six new musical excerpts, designed to match the properties of the original stimuli. the music files were created in musescore (musescore development team, 2018). procedure participants completed the experiment in a quiet room seated at a computer screen and keyboard. audio files were played through a pair of speakers (the original study used headphones). the experiment was built using the jspsych library (de leeuw, 2015). communication between jspsych and the eeg recording equipment was managed through a chrome extension that enables javascript-based control over a parallel port (rivas, 2016). each trial began with the audio file playing while a fixation cross was present on the screen. participants were asked to avoid blinking or moving their eyes while the fixation cross was present, to prevent eye movement artifacts in the eeg data. after the audio file concluded, participants saw a blank screen for 1450 ms. finally, a text prompt appeared on the screen asking participants if the sentence or musical excerpt was acceptable or unacceptable. participants pressed either the a (acceptable) or u (unacceptable) key in response. this procedure is nearly identical to the original, except for the use of a keyboard instead of a response box held in the participant’s lap. the experiment started with a short set of practice trials: 6 language trials followed by 6 music trials. following the practice trials, the experimenter verified that the participant understood the instructions before the experiment proceeded. the experiment consisted of 5 blocks: 3 language blocks containing 33, 33, and 34 trials, and 2 music blocks containing 36 trials each. the experiment always started with a language block and then alternated between language and music. grammatical and ungrammatical trials were randomly intermixed within each block. at the conclusion of a block, participants were given the opportunity to take a break. participants controlled the length of the break. erp recording we recorded eeg activity using a 128-channel sensor net (electrical geodesics inc.) at a sampling rate of 1000 samples/s referenced to cz. the data were amplified using a net amps 400 amplifier (electrical geodesics inc.). we focused on the 13 scalp locations that were used in patel et al. (1998). the locations and their corresponding electrode number on the egi-128 system were fz (11), cz (129), and pz (62) (midline sites), and f8 (122), atr (115), tr (108), wr (93), o2 (83), f7 (33), atl (39), tl (45), wl (42), and o1 (70) (lateral sites). vertical eye movements and blinks were monitored by means of two electrodes located above and one located below each eye; horizontal eye movements were monitored by means of one electrode located to the outer side of each eye. impedances for all of these electrodes were kept below 50 kω prior to data collection. netstation 5.4 waveform tools were used to process the eeg data offline, first by applying a high pass filter at 0.1 hz and a low pass filter at 30 hz. data were segmented into 1100 ms segments starting 100 ms prior to and ending 1000 ms after target stimulus onset. segments containing ocular artifacts were excluded from further analyses, as were any segments that had more than 20 bad channels. the similar event-related potentials to music and language: a replication of patel, gibson, ratner, besson, & holcomb (1998) 5 netstation bad channel replacement tool was applied to the eeg data which were re-referenced using an average reference and baseline corrected to the 100 ms prior to stimulus onset. these processing steps are similar to those used by patel et al. (1998; see pgs. 729-730), with some minor differences due to the use of a different eeg system. information about all tool settings is available at https://osf.io/96bjn/. results we conducted our analyses in r v3.4.2 (r core team, 2017) using several packages (henry & wickham, 2017; lawrence, 2016; morey & rouder, 2015; wickham, 2016; wickham, francois, henry, & müller, 2017; wickham & henry, 2018; wickham, hester, & francois, 2017; wilke, 2017). the complete annotated analysis script is available as an r notebook at https://osf.io/m9kej/. data exclusions 39 participants had a complete data set. we preregistered a plan to exclude trials that contained artifacts, but we did not pre-register a decision rule for how many good erp segments a participant would need in each condition to be included in the analysis. to avoid making a biased decision, we tabulated the number of artifact-free segments for each of the four conditions for each participant and chose a cutoff as the very first step in our analysis, prior to any examination of the waveforms. based on this ad-hoc inspection of the data (see https://osf.io/w7hrm/), we decided to exclude 4 participants who had at least one condition with fewer than 19 good segments. we chose 19 as the cutoff because the data had some natural clustering; the 4 participants who did not meet that cutoff had 15 or fewer good segments in at least one condition. this left us with data from 35 participants. all subsequent analyses are based only on these 35 participants. the mean number of usable trials across participants was 27.7 for language-grammatical and figure 1. grand average waveforms for language stimuli. the shaded box highlights the time window for analyzing p600 differences (500-800ms after stimulus onset) and the area surrounding each line represents ±1 se. the plots are arranged to represent approximate scalp position of each electrode, with posterior electrodes at the bottom. de leeuw, andrews, altman, andrews, appleby, bonanno, destefano, doyle-samay, faruqui, griesmer, hwang, lawson, lee, liang, mernacaj, molina, ng, park, possidente, & shriver 6 language-ungrammatical, 33.3 for music-grammatical, and 33.0 for music-ungrammatical. behavioral data we calculated the accuracy of the acceptable/unacceptable judgments that participants made, and we compare these data with the data from patel et al. in table 1. overall, the accuracy of our participants seems consistent with the accuracies reported in patel et al. with the largest difference in the ungrammatical language condition. eeg data in the original experiment, patel et al. analyzed the eeg data in two primary ways. we repeat and extend these analyses below. first, they calculated mean amplitude of the waveforms in all conditions (they had six total conditions, but we have four) and then used anovas to model the effects of grammaticality and electrode site on the amplitude of the erp. they used separate anova models for the language and music conditions and did not treat this as a factor in this part of the analysis. they analyzed three time windows, 300-500 ms, 500-800 ms, and 800-1100 ms, replicating the anovas separately in each time window. finally, they repeated this analysis separately for figure 2. grand average waveforms for music stimuli. the shaded box highlights the time window for analyzing p600 differences (500-800ms after stimulus onset) and the area surrounding each line represents ± 1 se. the plots are arranged to represent approximate scalp position of each electrode, with posterior electrodes at the bottom. table 1. behavioral data. accuracy in participants’ judgements of whether stimuli were acceptable or unacceptable in our study compared to the patel et al. (1998) study. patel et al. did not report sds. condition patel et al. (1998) replication language, grammatical m = 95% m = 93.3%, sd = 5.3% language, ungrammatical m = 96% m = 88.2%, sd = 18.0% music, grammatical m = 80% m = 84.5%, sd = 14.5% music, ungrammatical m = 72% m = 69.1%, sd = 15.5% similar event-related potentials to music and language: a replication of patel, gibson, ratner, besson, & holcomb (1998) 7 midline electrodes and lateral electrodes. this was a total of 12 anovas. given that the p600 should be strongest in the 500-800 ms window, we pre-registered a decision to restrict our analysis to the 500800 ms window only, reducing the number of anovas to 4. we view this as the strongest test of the original conclusion. the results of these four anovas are reported in table 2. while we cannot make direct comparisons with the anova results reported by patel et al. because we dropped one of the levels of the grammar factor from the procedure, we can look at whether the results align at a high level. for both music and language stimuli, patel et al. report a significant effect of grammaticality at both midline and lateral electrode sites, as well as a significant interaction between electrode location and grammaticality at both midline and lateral electrode sites. we found most of these effects; the exceptions were that we found no main effect of grammaticality for lateral electrodes and language stimuli, and no main effect of grammaticality for lateral electrodes and music stimuli. however, we did consistently find an interaction between electrode site and grammaticality for all conditions, which makes the differences in main effects somewhat difficult to interpret. for language stimuli, the interaction between electrode site and grammaticality was due to a stronger effect of grammaticality at posterior electrode sites. this is also what patel et al. found. for music stimuli, the effect of grammaticality was also stronger at posterior sites, with the exception of the two most posterior sites (o1 and o2), where there was no clear effect of grammaticality. this is a difference from the original study, as patel et al. did observe the musicbased p600 effect at these sites. the second analysis that patel et al. ran was to calculate difference waves — subtracting the grammatical erp from the ungrammatical erp — to, in theory, isolate the p600 and then directly compare the amplitude of the difference waves for language and music stimuli. for an unexplained reason, they shifted the time window of analysis to 450-750 ms. we pre-registered a decision to analyze the difference waves in the 500-800 ms range, to remain consistent with the prior analysis. patel et al. found no significant difference in the amplitude of the difference waves and concluded “in the latency range of the p600, the positivities to figure 3. grand average difference waves (ungrammatical minus grammatical) for language and music. the shaded box highlights the time window for analyzing p600 differences (500-800ms after stimulus onset) and the area surrounding each line represents ± 1 se. de leeuw, andrews, altman, andrews, appleby, bonanno, destefano, doyle-samay, faruqui, griesmer, hwang, lawson, lee, liang, mernacaj, molina, ng, park, possidente, & shriver 8 structurally incongruous elements in language and music do not appear to be distinguishable” (pg. 726). we note that a failure to find a statistically significant difference is not necessarily indicative of equivalence (gallistel, 2009; lakens, 2017). we repeat this analysis for the sake of comparison, but we also include an analysis using bayes factors to examine how the relative probabilities of models that do and do not include the factor of stimulus type (language v. music) are affected by these data. the results of the 2 anovas are shown in table 3. like patel et al., we found no main effect of stimulus type (language v. music) in either lateral or midline electrodes. however, we did find a significant interaction between stimulus type and electrode site for lateral electrodes, though we note that the p-value is relatively high (p = 0.038) with no correction for multiple comparisons. we conducted the bayes factor analysis using the bayesfactor r package (morey & rouder, 2015). briefly, the analysis evaluates the relative support for five different models of the data. all models contain a random effect of participant; models 2-5 also contain one or more fixed effects. model 2 contains the fixed effect of electrode; model 3 contains the fixed effect of stimulus type; model 4 contains both fixed effects; and model 5 contains both fixed effects table 3. anova results for the difference waves. electrode set factor stimulus electrode stimulus * electrode midline f(1, 34) = 0.226, p = 0.637 f(2, 68) = 16.289, p = 0.000002 f(2, 68) = 0.315, p = 0.731 lateral f(1, 34) = 1.784, p = 0.190 f(9, 306) = 12.181, p < 0.000001 f(9, 306) = 2.009, p = 0.038 note. “stimulus” refers to language v. music. table 2. anova results for grammaticality x electrode models. “electrode” refers to the specific electrode sites within the midline and lateral site groups. stimulus electrode set factor grammaticality electrode grammaticality * electrode language midline f(1, 34) = 6.41, p = 0.016 f(2, 68) = 1.00, p = 0.372 f(2, 68) = 5.11, p = 0.009 language lateral f(1, 34) = 0.44, p = 0.512 f(9, 306) = 3.68, p = 0.0002 f(9, 306) = 4.99, p = 0.000003 music midline f(1, 34) = 23.94, p = 0.00002 f(2, 68) = 7.00, p = 0.002 f(2, 68) = 12.43, p = 0.00003 music lateral f(1, 34) = 1.12, p = 0.298 f(9, 306) = 11.10, p < 0.000001 f(9,306) = 6.47, p < 0.000001 similar event-related potentials to music and language: a replication of patel, gibson, ratner, besson, & holcomb (1998) 9 plus their interaction. in each model, the scaling factor for fixed-effect priors is 0.5, and the scaling factor for random-effect priors is 1.0. see rouder et al. (2012) for model details. the bayes factors for all models are reported in table 4. for the midline electrodes, the model with the greatest positive change in posterior probability relative to just the random effect of participant was the model that added only the fixed effect of electrode. the bayes factor in favor of this model relative to the next best model, which added the fixed effect of stimulus type, was 6.02 (ratio of 18,160 to 3,015). thus, these data should shift our belief in the model that does not contain the stimulus type relative to the model that does by about 6x. for the lateral electrodes, the model with only a fixed effect of electrode and random effect of participant was also the winning model. however, the evidence against an effect of stimulus type is not as strong here. the bayes factor in favor of the electrode-only model relative to the full model with both main effects and their interaction is only 1.46. the full model is also favored over the main-effects only model by a bayes factor of 4.22. these suggest that our relative belief in these models is not shifted much by the data. discussion patel et al. (1998) concluded that “... the late positivities elicited by syntactically incongruous words in language and harmonically incongruous chords in music were statistically indistinguishable in amplitude and scalp distribution in the p600 latency range. … this strongly suggests that whatever process gives rise to the p600 is unlikely to be language-specific” (pg. 726). the results of our replication mostly support this conclusion. we found that the amplitude of the erp 500-800 ms after stimulus onset was more positive for ungrammatical words and chords than for grammatical words and chords. we also found that the effect of grammaticality is stronger in posterior electrodes, though we do find some minor differences from patel et al. in the consistency of this effect for lateral electrodes. the data are somewhat inconclusive as to whether there is an effect of stimulus type on the amplitude of the erp in the p600 window, with (at best) moderate evitable 4. bayes factors for models of the effect of electrode site and stimulus type (language v. music) at midline and lateral electrodes. model bayes factor relative to participant-only model midline electrodes lateral electrodes electrode + participant 18,160 ±1.26% 1.81x109 ±0.35% stimulus + participant 0.170 ±2.03% 0.157 ±2.03% electrode + stimulus + participant 3,015 ±1.32% 2.93x108 ±1.22% electrode + stimulus + electrode*stimulus + participant 367 ±2.02% 1.24x109 ±1.30% note: bayes factors indicate the change in posterior odds for the model relative to the model that contains only the random effect of participant. bayes factors larger than 1 therefore indicate relative support for the model, with larger bayes factors representing more support. bayes factors less than 1 indicate relative support for the participant-only model, with numbers closer to 0 indicating more support. de leeuw, andrews, altman, andrews, appleby, bonanno, destefano, doyle-samay, faruqui, griesmer, hwang, lawson, lee, liang, mernacaj, molina, ng, park, possidente, & shriver 10 dence to support the conclusion that there is no difference in mean amplitude. this is despite a sample size (n = 35) that is more than twice the original (n = 15). one aspect of the data that is visually striking is the clear differences in the shape of the waveforms for music and language stimuli (figures 1 and 2). patel et al. (1998) also noted this difference and attributed it to theoretically-irrelevant differences between the musical and linguistic stimuli. the musical excerpts are rhythmic with short gaps of silence, while the sentences are more variable and continuous. patel et al. argued that this could explain the difference. this seems plausible, but the statistical models they (and therefore we) used are limited to making comparisons on the mean amplitude in a particular time window, which is a substantial reduction in the information content of the waveforms. an advantage of making the full data set available is that other researchers can choose to analyze the data with other kinds of models. another difference between the language and music waveforms reported by patel et al. was a right anterior temporal negativity (ratn) in the 300-400 ms range (n350) only for the music condition. this was reported as an interesting, unexpected effect but not one that was important theoretically for the main result of similar processing of language and music structural violations. the ratn pattern was not evident in our music waveform data and the relevant statistical analysis did not replicate this element of patel et al.’s findings (see appendix for further details). of course, some concerns can only be addressed through changes to the experimental design, such as creating stimuli that have different properties designed to control for additional factors. featherstone, waterman, and morrison (2012) point out potential confounding factors in the stimuli used by patel et al. (1998) and other similar, subsequent studies. for example, the musical violations are both improbable in context and violate various rules of western musical harmony. direct replications, while crucial for establishing the reliability of a particular finding, necessarily also contain any methodological weakness of the original study. while we contend that this replication supports the empirical conclusions of the original study, we are mindful of the need to also examine support for the theoretical conclusion with a variety of methodological approaches. the relative increase in mean amplitude in the 500-800 ms window after structural violations in music and language might reflect shared processing resources, but it’s also possible that there are two distinct processes that both generate this kind of eeg signal. as we described in the introduction, there is already a literature with numerous studies that examine the behavioral and neurological overlap between music and language, a literature in which debates about the best theoretical interpretation of the empirical findings are unfolding. finally, we note that there has been a growing interest in conducting serious replication studies in undergraduate and graduate research methods classes (frank & saxe, 2012; grahe, brandt, ijzerman, & cohoon, 2014; hawkins et al., 2018; wagge, baciu, banas, nadler, & schwarz, 2019; wagge, brandt, et al., 2019). the hypothesized benefits are numerous: students act as real scientists with tangible outcomes, motivating careful and engaged work on the part of the students and benefiting the scientific community with the generation of new evidence; students learn about the mechanics and process of conducting scientific research with well-defined research questions and procedures, providing a stronger foundation for generating novel research in the future; reading papers with the goal of replication teaches students to critically evaluate the methods and rationales in order to be able to replicate the work (frank & saxe, 2012). exposing the next generation of researchers to methodological innovations that improve replicability and reproducibility spreads those practices, hopefully producing a more reliable corpus of knowledge in the future. our experience with this project anecdotally supports these hypotheses. students were engaged and produced high-quality work. moreover, the replication project provided a strong foundation for novel experimental work. the class was structured so that smaller teams of students conducted original studies following the whole-class replication effort. students were able to apply a variety of methodological skills learned from the replication project — pre-registration, data analysis techniques, use of the open science framework, and, more abstractly, an understanding of what the complete research process entails — to this second round of projects. similar event-related potentials to music and language: a replication of patel, gibson, ratner, besson, & holcomb (1998) 11 given our experiences, we endorse similar initiatives that involve students in replication work as part of their methodological training. open science practices this article earned the preregistration plus, open data and the open materials badge for preregistering the hypothesis and full analysis plan before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. author note correspondence regarding this article should be sent to joshua de leeuw: jdeleeuw@vassar.edu author contributions de leeuw and andrews were the leaders of the project and are the first and second authors of this article. the remaining authors contributed equally and are listed in alphabetical order. conflict of interest the authors have no conflict of interest to declare. references abrams, d. a., bhatara, a., ryali, s., balaban, e., levitin, d. j., & menon, v. (2011). decoding temporal structure in music and speech relies on shared brain resources but elicits different fine-scale spatial patterns. cerebral cortex, 21(7), 1507–1518. besson, m., chobert, j., & marie, c. (2011). transfer of training between music and speech: common processing, attention, and memory. frontiers in psychology, 2, 94. besson, m., & faïta, f. (1995). an event-related potential (erp) study of musical expectancy: comparison of musicians with nonmusicians. journal of experimental psychology: human perception and performance, 21(6), 1278–1296. besson, m., faïta, f., & requin, j. (1994). brain waves associated with musical incongruities differ for musicians and non-musicians. neuroscience letters, 168(1-2), 101–105. besson, m., & schön, d. (2001). comparison between language and music. annals of the new york academy of sciences, 930, 232–258. chobert, j., françois, c., velay, j.-l., & besson, m. (2014). twelve months of active musical training in 8to 10-year-old children enhances the preattentive processing of syllabic duration and voice onset time. cerebral cortex, 24(4), 956–967. christiansen, m. h., & chater, n. (2008). language as shaped by the brain. the behavioral and brain sciences, 31(5), 489–508; discussion 509– 558. de leeuw, j. r. (2015). jspsych: a javascript library for creating behavioral experiments in a web browser. behavior research methods, 47(1), 1–12. featherstone, c. r., morrison, c. m., waterman, m. g., & macgregor, l. j. (2013). semantics, syntax or neither? a case for resolution in the interpretation of n500 and p600 responses to harmonic incongruities. plos one, 8(11), e76600. featherstone, c. r., waterman, m. g., & morrison, c. m. (2012). norming the odd: creation, norming, and validation of a stimulus set for the study of incongruities across music and language. behavior research methods, 44(1), 81– 94. frank, m. c., & saxe, r. (2012). teaching replication. perspectives on psychological science, 7(6), 600– 604. gallistel, c. r. (2009). the importance of proving the null. psychological review, 116(2), 439–453. de leeuw, andrews, altman, andrews, appleby, bonanno, destefano, doyle-samay, faruqui, griesmer, hwang, lawson, lee, liang, mernacaj, molina, ng, park, possidente, & shriver 12 grahe, j., brandt, m., ijzerman, h., & cohoon, j. (2014). replication education. aps observer, 27(3). hawkins, r. x. d., smith, e. n., au, c., arias, j. m., catapano, r., hermann, e., … frank, m. c. (2018). improving the replicability of psychological science through pedagogy. advances in methods and practices in psychological science, 1(1), 7–18. henry, l., & wickham, h. (2017). purrr: functional programming tools. retrieved from https://cran.r-project.org/package=purrr herdener, m., humbel, t., esposito, f., habermeyer, b., cattapan-ludewig, k., & seifritz, e. (2014). jazz drummers recruit language-specific areas for the processing of rhythmic structure. cerebral cortex, 24(3), 836–843. ingre, m., & nilsonne, g. (2018). estimating statistical power, posterior probability and publication bias of psychological research using the observed replication rate. open science, 5(9), 181190. janata, p. (1995). erp measures assay the degree of expectancy violation of harmonic contexts in music. journal of cognitive neuroscience, 7(2), 153–164. koelsch, s. (2011). toward a neural basis of music perception a review and updated model. frontiers in psychology, 2, 110. koelsch, s., gunter, t. c., wittfoth, m., & sammler, d. (2005). interaction between syntax processing in language and in music: an erp study. journal of cognitive neuroscience, 17(10), 1565–1577. lakens, d. (2017). equivalence tests: a practical primer for t-tests, correlations, and meta-analyses. social psychological and personality science, 8(4), 355–362. lawrence, m. a. (2016). ez: easy analysis and visualization of factorial experiments. retrieved from https://cran.r-project.org/package=ez lebel, e. p., berger, d., campbell, l., & loving, t. j. (2017). falsifiability is not optional. journal of personality and social psychology, 113(2), 254– 261. lima, c. f., & castro, s. l. (2011). speaking to the trained ear: musical expertise enhances the recognition of emotions in speech prosody. emotion, 11(5), 1021–1031. merrill, j., sammler, d., bangert, m., goldhahn, d., lohmann, g., turner, r., & friederici, a. d. (2012). perception of words and pitch patterns in song and speech. frontiers in psychology, 3, 76. moreno, s., marques, c., santos, a., santos, m., castro, s. l., & besson, m. (2009). musical training influences linguistic abilities in 8-year-old children: more evidence for brain plasticity. cerebral cortex, 19(3), 712–723. morey, r. d., & rouder, j. n. (2015). bayesfactor: computation of bayes factors for common designs. retrieved from https://cran.r-project.org/package=bayesfactor musescore development team. (2018). musescore (version 2.3.2). retrieved from https://musescore.org/ nosek, b. a., alter, g., banks, g. c., borsboom, d., bowman, s. d., breckler, s. j., … yarkoni, t. (2015). promoting an open research culture. science, 348(6242), 1422–1425. osterhout, l., & holcomb, p. j. (1992). event-related brain potentials elicited by syntactic anomaly. journal of memory and language, 31(6), 785– 806. osterhout, l., & holcomb, p. j. (1993). event-related potentials and syntactic anomaly: evidence of anomaly detection during the perception of continuous speech. language and cognitive processes, 8(4), 413–437. osterhout, l., holcomb, p. j., & swinney, d. a. (1994). brain potentials elicited by garden-path sentences: evidence of the application of verb information during parsing. journal of experimental psychology: learning, memory, and cognition, 20(4), 786–803. patel, a. d. (2003). language, music, syntax and the brain. nature neuroscience, 6(7), 674–681. patel, a. d. (2010). music, language, and the brain. oxford university press, usa. patel, a. d., gibson, e., ratner, j., besson, m., & holcomb, p. j. (1998). processing syntactic relations in language and music: an event-related potential study. journal of cognitive neuroscience, 10(6), 717–733. r core team. (2017). r: a language and environment for statistical computing (version 3.4.2). vienna, austria: r foundation for statistical similar event-related potentials to music and language: a replication of patel, gibson, ratner, besson, & holcomb (1998) 13 computing. retrieved from https://www.rproject.org/ rivas, d. (2016). jspsych hardware (version v0.2-alpha). retrieved from https://github.com/rivasd/jspsychhardware rosenthal, r. (1979). the “file drawer problem” and tolerance for null results. psychological bulletin, 86(3), 638–641. rouder, j. n., morey, r. d., speckman, p. l., & province, j. m. (2012). default bayes factors for anova designs. journal of mathematical psychology, 56(5), 356–374. sammler, d., baird, a., valabrègue, r., clément, s., dupont, s., belin, p., & samson, s. (2010). the relationship of lyrics and tunes in the processing of unfamiliar songs: a functional magnetic resonance adaptation study. journal of neuroscience, 30(10), 3572–3578. sammler, d., koelsch, s., ball, t., brandt, a., grigutsch, m., huppertz, h.-j., … schulze-bonhage, a. (2013). co-localizing linguistic and musical syntax with intracranial eeg. neuroimage, 64, 134–146. simonsohn, u. (2015). small telescopes: detectability and the evaluation of replication results. psychological science, 26(5), 559–569. steinbeis, n., & koelsch, s. (2008). shared neural resources between music and language indicate semantic processing of musical tensionresolution patterns. cerebral cortex, 18(5), 1169–1178. thompson, w. f., schellenberg, e. g., & husain, g. (2004). decoding speech prosody: do music lessons help? emotion, 4(1), 46–64. tillmann, b. (2012). music and language perception: expectations, structural integration, and cognitive sequencing. topics in cognitive science, 4(4), 568–584. wagenmakers, e.-j., wetzels, r., borsboom, d., van der maas, h. l. j., & kievit, r. a. (2012). an agenda for purely confirmatory research. perspectives on psychological science, 7(6), 632– 638. wagge, j. r., baciu, c., banas, k., nadler, j. t., & schwarz, s. (2019). a demonstration of the collaborative replication and education project: replication attempts of the red-romance effect. collabra: psychology, 5(1). https://doi.org/10.1525/collabra.177 wagge, j. r., brandt, m. j., lazarevic, l. b., legate, n., christopherson, c., wiggins, b., & grahe, j. e. (2019). publishing research with undergraduate students via replication work: the collaborative replications and education project. frontiers in psychology, 10, 247. wickham, h. (2016). ggplot2: elegant graphics for data analysis. springer-verlag new york. retrieved from http://ggplot2.org wickham, h., francois, r., henry, l., & müller, k. (2017). dplyr: a grammar of data manipulation. retrieved from https://cran.r-project.org/package=dplyr wickham, h., & henry, l. (2018). tidyr: easily tidy data with “spread()” and “gather()” functions. retrieved from https://cran.r-project.org/package=tidyr wickham, h., hester, j., & francois, r. (2017). readr: read rectangular text data. retrieved from https://cran.r-project.org/package=readr wilke, c. o. (2017). cowplot: streamlined plot theme and plot annotations for “ggplot2.” retrieved from https://cran.r-project.org/package=cowplot zwaan, r. a., etz, a., lucas, r. e., & donnellan, m. b. (2017). making replication mainstream. behavioral and brain sciences, 41, 1–50. de leeuw, andrews, altman, andrews, appleby, bonanno, destefano, doyle-samay, faruqui, griesmer, hwang, lawson, lee, liang, mernacaj, molina, ng, park, possidente, & shriver 14 appendix an unexpected discovery reported by patel et al. was a right anterior temporal negativity (ratn) in the 300-400 ms window only for the music condition. patel et al. also referred to this peak as an n350 and noted its potential relation to the left anterior negativity (lan) reported for linguistic grammatical processing (but not observed by patel et al. for their linguistic stimuli). patel et al. note that these hemispheric effects of opposite laterality for language and music suggest distinct but possibly analogous cognitive processes and propose that they should receive additional investigation but do not discuss them further. we did not pre-register any analyses of this effect because we did not consider it relevant to the theoretical claim of syntactic processing similarity between music and language shown by the p600 effect. however, in response to a reviewer’s request we investigated whether our data supported patel et al.’s finding of an ratn/n350 for the music condition. the key statistical result reported by patel et al. was a significant three-way interaction between condition (in-key chord vs. distant-key chord), hemisphere, and electrode site for the 300-400 ms window. the corresponding result for a grammaticality x hemisphere x electrode site anova performed on our data was not significant (f(4, 136) = .366, p = .832), an outcome that fits with the appearance of the waveforms for the music condition of our experiment which show no sign of the ratn that appeared in patel et al.’s figure 5 (compare to our figure 2). in order to further address the strength of evidence provided by our data with respect to this three-way interaction, we conducted a bayes factor analysis using the bayesfactor r package (morey & rouder, 2015) to evaluate the relative support for models containing fixed effects of electrode, hemisphere, and grammaticality (and their interactions) relative to the null model containing only a random effect of participant. the data are 431,034 times less likely under the full model that adds in all three main effects, the three two-way interactions, and the three-way interaction than under the null model. to isolate the contribution of the three-way interaction, we can compare the full model containing the three-way interaction to the model containing all terms except the three-way interaction. the data are 39 times less likely under the model with the three-way interaction. thus, we clearly did not replicate the ratn for music reported by patel et al. the complete set of results for this analysis and the analysis scripts are available on the open science framework at https://osf.io/zpm9t/. meta-psychology, 2021, vol 5, mp.2018.869 https://doi.org/10.15626/mp.2018.869 article type: tutorial published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: not applicable edited by: rickard carlsson reviewed by: f.d. schönbrodt, a. stefan, s.r. martin, c.o. brand analysis reproduced by: andré kalmendal all supplementary files can be accessed at osf: https://osf.io/v2pnc/ a fully automated, transparent, reproducible, and blind protocol for sequential analyses brice beffara bret lppl ea 4638, université de nantes, nantes, france laboratoire de psychologie ea 3188, université de franche-comté, besançon, france the walden iii slowpen science laboratory, villeurbanne, france amélie beffara bret lppl ea 4638, université de nantes, nantes, france the walden iii slowpen science laboratory, villeurbanne, france ladislas nalborczyk department of experimental clinical and health psychology, ghent university, belgium the walden iii slowpen science laboratory, villeurbanne, france abstract despite many cultural, methodological, and technical improvements, one of the major obstacle to results reproducibility remains the pervasive low statistical power. in response to this problem, a lot of attention has recently been drawn to sequential analyses. this type of procedure has been shown to be more efficient (to require less observations and therefore less resources) than classical fixed-n procedures. however, these procedures are submitted to both intrapersonal and interpersonal biases during data collection and data analysis. in this tutorial, we explain how automation can be used to prevent these biases. we show how to synchronise open and free experiment software programs with the open science framework and how to automate sequential data analyses in r. this tutorial is intended to researchers with beginner experience with r but no previous experience with sequential analyses is required. keywords: sequential analysis, sequential testing, sequential bayes factor, automation, expectancy effects, reproducibility, blind analyses on reproducibility it may be referred to as a crisis, a revolution, or a renaissance, but psychology has undeniably known a decade of unparalleled methodological reflection and reform (for an overview, see fidler & wilcox, 2018; nelson et al., 2018). although many of the practices that are currently recognised as part of the problem (e.g., poor understanding of statistical methods, questionable 2 research practices) have long been acknowledged (e.g., babbage, 1830),1 recently introduced practices have brought considerable improvements to the reliability of findings in psychological science (e.g., see smaldino et al., 2019). there are many ways to define what reproducibility is, and it is often unclear what is meant by the terms of reproducibility, replicability, repeatability, or reliability. to avoid these confusions, we adopt the terminology suggested by goodman et al. (2016). when discussing research reproducibility, we make a distinction between i) methods reproducibility: the ability to reproduce, as closely as possible, the methodological procedures developed by a certain team (i.e., what is usually meant by reproducibility), ii) results reproducibility: the ability to reproduce a certain result in a given methodological settings (i.e., what is usually meant by replicability), and iii) inferential reproducibility: the ability (for an independent team) to replicate an inferential conclusion, that is, to arrive at the same conclusion. whereas results reproducibility concerns the outcome of a computational or experimental procedure, methods reproducibility is a property of the methods being used to produce this particular outcome. as put by meehl (1990), a scientific study is akin to a recipe, and a good methods description should allow other cooks to prepare the same kind of cake as the person that wrote the recipe did. as such, methods reproducibility is essential to results reproducibility. fortunately, recent technical developments have made the task far easier than it used to be. key components of a modern reproducible workflow may include: • transparency: exhaustive and intelligible description and sharing of all materials, scripts, etc. (for a practical introduction, see klein et al., 2018) • self-containment: writing reproducible documents in latex or rmarkdown (e.g., see the r package papaja, aust & barth, 2018) and sharing self-contained code (e.g., see https://codeocean.com) • version control: using git and github (or gitlab) to track changes in working documents (for an introduction, see vuorre & curley, 2018) • automation: minimising mistakes by automatising as many steps of the research process as possible (e.g., rouder, 2016; rouder et al., 2018; yarkoni et al., 2019) although methods reproducibility is essential to results reproducibility, it is not sufficient. despite having a long history of scrutiny in psychology, one of the major threats to results reproducibility remains statistical power, where power can be broadly defined as the probability of achieving a certain goal, given that a suspected underlying state of the world is true (kruschke, 2015). we know that a low powered result has (all other things being equal) a lower probability to replicate, the initial result being attached with a higher probability of erroneous inference (e.g., type-m or type-s errors, gelman & carlin, 2014). even though there are many ways to increase statistical power (e.g., see hansen & collins, 1994), we focus here on sequential testing, that is, the continuous analysis of data during its collection. this procedure has been shown to optimise the amount of resources (e.g., money and time) to be spent in order to attain a certain goal, as compared to classical a priori power analyses strategies (lakens, 2014; schönbrodt et al., 2017). we turn now to a brief presentation of several sequential analyses procedures, followed by a discussion of the methodological precautions that need to be undertaken to ensure the validity of these procedures. finally, we outline the core aspects of a "born-open" (following the terminology of rouder, 2016), fully automated, and reproducible workflow for sequential analyses. a brief introduction to sequential analyses procedures in this section, we briefly introduce three sequential analysis procedures that address three distinct goals. more precisely, these procedures permit to either i) accumulate relative evidence for a hypothesis (the sequential bayes factor procedure) or ii) efficiently accept or reject a value or range of values for a parameter (the sequential hdi+rope procedure) or iii) sample observations until a desired level of estimation precision is reached. sequential bayes factor schönbrodt et al. (2017) presented an alternative to null-hypothesis significance testing with a priori power analysis (nhst-pa) by introducing the sequential bayes factor (sbf) procedure. the sbf procedure uses bayes factors (bfs) to iteratively examine the relative evidential support for a hypothesis during data collection.2 1where the same could be said for recently proposed solutions like preregistration or radical transparency (e.g., de groot, 2014). 2technically speaking, the bayes factor is a ratio of marginal likelihoods (i.e., what is considered as evidence in the bayesian framework). broadly, a bf can be interpreted 3 the first step of the sbf procedure is to pick thresholds (one for each of the two hypotheses being compared) that determine the end of data collection. these thresholds should be selected to reflect the level of evidence that the experimenter consider sufficient to stop data collection, but should also be defined in consideration of specific goals and costs-benefits analyses. indeed, more stringent thresholds require larger sample sizes, but are associated with lower risks of misleading inferences (false positive and false negative), all other things being equal. then, after picking the appropriate prior distribution for the alternative hypothesis, a first batch of observations is collected, during which no bf is computed (to avoid misleading inferences due to early terminations). starting at nmin observations (the a priori defined minimum sample size), a bf is computed at each stage (or each observation). the sampling procedure goes on until the current bf reaches the a priori defined threshold or until reaching nmax observations (the a priori defined maximum sample size). schönbrodt et al. (2017) provided detailed simulation results of this procedure when comparing the means of two independent groups (the equivalent of a two-samples t-test). they show that the error rates and the average length of the procedure (i.e., how many observations are needed to reach the threshold) are a function of both the population effect size, the threshold, and the prior for the alternative hypothesis. for instance, when chasing a medium effect size (d = 0.5) and when using a "medium scaled" prior for the alternative (r = 1), stopping data collection at bf = 6 instead of bf = 3 results in a percentage of wrong inferences of 4.6% instead of 40% (see table 1 in schönbrodt et al., 2017, p. 10). based on these results, it is possible to combine prior guesses or expectations with the known properties of the sbf procedure to design experiments. this strategy is known as design analysis and includes the classical power analyses of the nhst framework as a particular case. in this vein, schönbrodt and wagenmakers (2018) introduced the bayes factor design analysis tool and demonstrated how this strategy can help to design more informative empirical studies (see also stefan et al., 2019). the sequential hdi+rope procedure bayes factors are not the only available option to perform sequential testing. whereas bfs quantify the evidence in favour of an hypothesis (relative to another hypothesis), each individual hypothesis can also be examined on its own. for instance, the hypothesis that the group difference of some measured variable is equal to zero might be assessed by looking directly at the posterior distribution of the group difference.3 this distribution can be summarised via its mean and highest density interval (hdi), an interval that contains the x% most credible values for the parameter (kruschke & liddell, 2018).4 the hypothesis according to which the group difference is equal to zero can then be assessed by checking whether the hdi includes zero as a credible value. alternatively, and more interestingly, the hdi can be compared to a region of practical equivalence (rope, kruschke, 2018), defining the range of effect sizes that we consider as negligible (i.e., equivalent to zero) for practical purposes. the comparison of hdi and rope can be summarised by computing the proportion of the hdi that is included in the rope, giving an idea of the extent to which the hypothesis of no effect is supported. alternatively, the comparison of the hdi to the rope can result in three categorical outcomes: i) the null hypothesis is rejected (when the hdi falls completely outside the rope), ii) the null hypothesis is accepted (when the hdi falls completely inside the rope), or iii) the data is said to be inconclusive (when neither of the above applies). when the goal of the analysis is to accept or reject a reference value, it is then possible to stop data collection when the hdi does not include this reference value. thus, this procedure is similar to the sbf procedure, except that the sampling procedure is terminated by a (conclusive) comparison of a hdi to a rope, instead of being terminated by a comparison of a bf value to some threshold (for a detailed study of the characteristics of this procedure, see kruschke, 2015). aiming for precision because data collection stops when the accumulated evidence reaches a certain threshold, sequential hypothesis testing procedures (e.g., sbf or hdi+rope) are known to be biased by extreme observations. in other words, the data collection stops when the collected data supports the hypothesis, preventing the opportunity of collecting contradictory data afterwards as an updating factor, indicating how credibility should be re-allocated from prior knowledge (what was known before seeing the data) to posterior knowledge (what is known after seeing the data). 3the posterior distribution is the result of any bayesian analysis. it is a probability distribution that allocates probability to parameter values, given the model (including the priors) and the observed data. 4the hdi is a particular type of credible interval, the bayesian equivalent of the frequentist confidence interval. it should be noted however that the interpretation of bayesian and frequentist intervals differ considerably (e.g., morey et al., 2016; nalborczyk et al., 2019). 4 (kruschke, 2015). performing sequential analysis until a certain estimation precision is reached overcomes this bias (kruschke, 2015). in this procedure, the goal is not to stop data analysis based on the rejection of an hypothesis but rather to sample observations until a predefined level of precision in a parameter (including effect size) estimation is reached. the estimation precision can be quantified by the width of the hdi and kruschke (2015) proposes to stop data analysis when the hdi width is less than 80% of the rope’s width. for instance, if the smallest effect size of interest (sesoi, lakens et al., 2018) is δ = 0.2, we can define a rope around 0 from δ = −0.1 to δ = 0.1. this means that we will consider an effect as approximately null when it is less than half our sesoi. planing for precision, we would therefore stop data collection when the width of our hdi (on the estimate of the effect size) is inferior to 0.8 × 0.2 = 0.16 (kruschke, 2018). of course, there are other methods to determine the desired level of precision. for instance, researchers can base their sesoi on an effect of minimal clinical relevance. in this context, the sesoi varies depending on the costs and benefits of the treatment (lakens et al., 2018). we can also imagine other criteria to determine precision on the basis of the rope. in any case, even if one wants to check whether the hdi falls inside the rope at the end of data collection, planing for precision enables us to focus on parameter or effect size estimation instead of hypothesis testing. if we decide to plan for moderate precision, we will not need a large amount of observations, but it is likely that the hdi will overlap the rope. as a consequence, it will not be possible to reach a categorical conclusion. if we plan for high precision, more observations will be needed to stop data collection, but hypothesis testing will be improved as well as precision. in most situations, planing for precision will require more observations than hypothesis testing. whatever the procedure, the stopping rule should be thought carefully by considering and balancing the costs and benefits of collecting more observations. a small number of observations is affordable but likely to lead to biased conclusions. a large number of observations can be expensive but likely to lead to more robust conclusions. both frequentist and bayesian methods have their ways to deal with risks and errors in sequential hypothesis testing. criteria are modified for sequential hypothesis testing so that "significant" p-values (e.g., .0294 instead of .05 for one interim analysis, lakens, 2014) or bayes factors (e.g., bf = 6 instead of bf = 3 depending on acceptable error rates, schönbrodt et al., 2017) are not the same as classical (non sequential) hypothesis testing. this should also be considered in sequential the hdi+rope procedure. some difficulties the procedures previously described and discussed in schönbrodt et al. (2017) and kruschke (2015, 2018) offer an attractive perspective on data collection. however, some precautions need to be undertaken in order to preserve a good precision or the long-term error rates they provide. more precisely, we discuss two main categories of biases that need to be controlled. we call the first category of biases intrapersonal biases, as biases that are expressed within individuals. these biases mainly emerge during data analysis and data reporting. we call the second category of biases interpersonal biases, as biases that are expressed between individuals. these biases mainly emerge during data collection (e.g., when the researcher interacts with participants). these biases can arise in any study but sequential procedures present specific risks on both the intra and interpersonal dimensions. we explain why in the next section. what could possibly go wrong? intrapersonal biases during sequential analyses most intrapersonal biases occur because of what some would call the "researcher degrees of freedom" (simmons et al., 2011; wicherts et al., 2016). following wicherts et al. (2016)’ nomenclature, we identified the following intrapersonal biases as having increased risks in sequential procedures: • c3: correcting, coding, or discarding data during data collection in a non-blinded manner. • a1: choosing between different options of dealing with incomplete or missing data on ad hoc grounds. • a2: specifying pre-processing of data (e.g., cleaning, normalization, smoothing, motion correction) in an ad hoc manner. • a3: deciding how to deal with violations of statistical assumptions in an ad hoc manner. • a4: deciding on how to deal with outliers in an ad hoc manner. these degrees of freedom allow flexibility when processing data before statistical inference. this flexibility can be dangerous when incentives, cultural norms, or previous practices influence data analysis, and where there are few safeguards. thus, the intrapersonal nature of these biases does not refer to an absence of social influence, but rather to biases occurring at the point where the researcher makes choices on her own. this 5 can be the case when someone discards an outlier to reach a significant result because it is easier to publish with significant results. this can also be the case when one tries different data transformations to best match their preferred hypothesis. here the problematic nature of the degrees of freedom is the researcher’s subjectivity. discarding an outlier or transforming data is not necessarily the result of being an outlier or being skewed data, but can also be the result of what the researcher wants to see because of economic, social (e.g., reputation pressure), cultural contexts. these biases are not specific to sequential procedures but their consequences might be amplified during sequential procedures because of the possible impact of each look at the data. indeed, the degrees of freedom can impact data analysis at time t1 which can in turn impact data collection in a self-perpetuating cycle. data analysis at time t2 is therefore likely to be biased by the degrees of freedom at time t2 but also at time t1. thus, errors accumulate because the effects of the degrees of freedom are multiplied. when a data analyst has expectations about what should be observed, the data analysis is likely to be biased by these expectations through confirmation (favouring an hypothesis) or disconfirmation (stronger scepticism toward data against the hypothesis than toward data corroborating the hypothesis) biases (maccoun & perlmutter, 2017). while continuously analysing data, the data analyst is faced with many choices about the best way to deal with incoming data. based on previous studies, they might have expectations about the range of plausible values, or they might need to use particular methods to process physiological signals, to recode or to transform data in a specific way, and so on. we urge researchers to make these decisions explicit before data collection. the properties of sequential procedures have been studied extensively via simulation (kruschke, 2015; schönbrodt et al., 2017; wagenmakers et al., 2017). however, noise and irregularities in simulated data only come from sampling variability, and not from practical problems that can be encountered during empirical data collection (e.g., technical issues or experimenter biases). when collecting data, researchers would like to get as close as possible to the shape of simulated data (i.e., we would like to minimise other sources of errors than sampling variability). in order for the bf, the hdi or the precision to be reliable stopping criteria, they have to be computed on reliable data. we acknowledge that what can be considered as reliable data heavily depends on the type of study. as such, decisions concerning the analysis workflow should be justified by the existing literature as much as possible. however, changing the criterion and methods for data preparation based on the state of the sequential procedure is not acceptable. the result of an experiment cannot determine the way it is itself defined. real-time resulthacking would jeopardise the confidence one can have in this result. therefore, we propose that all the steps of the sequential procedure (i.e., data preparation and cleaning, outliers detection, data transformation, model assumptions checks, and data analysis) should be automated and performed in an incremental manner. the entire dataset can be continually reanalysed, including former outliers, so that the process follows the progressive incorporation of new observations. put more formally, such a procedure should be able to take into account that an extreme observation at time t might not be extreme anymore at time tn, and should therefore be reincluded in subsequent analyses. the fact that this iterative procedure is automated should prevent the data analyst from the hazards that are commonly encountered during data manipulation (see previous section). these hazards have particularly important consequences in sequential procedures in comparison to traditional (i.e. fixed-n) procedures because of the incremental nature of evidence accumulation. a specific preprocessing choice at time t can influence inference criteria computation at time t+n which can subsequently influence another preprocessing choices and so on. with accumulated modifications in preprocessing decisions, the inference criteria are not only computed based on data but also on sequential and incremental choices from the scientist. we propose that these steps could instead be programmed and coded on the basis of preregistered choices, before starting to collect data. preregistered automated data analysis would therefore ensure conclusions based on empirical procedures to be similar to the results (e.g., long-term error rates) provided by schönbrodt et al. (2017) or kruschke (2015) using simulation, and explicitly fulfil the requirements of transparency and reproducibility. however, being able to define in advance the goal and the criteria for success of these sequential procedures requires i) to be well aware of the literature of interest, ii) to know how data behave by manipulating data from very similar previous experiments or pretests, iii) to be able to implement a procedure of data preparation for model computation before seeing new data. these three points might seem trivial but are even more important for sequential analyses than classical procedures in order to avoid intermediate influences in data preparation based on known interim results. in addition to these intermediate influences (i.e., influences between the collection of the first batch of data 6 and the final data analysis) in data preparation, intermediate influences can also be problematic during data collection. these influences are what we call "interpersonal biases". interpersonal biases during sequential analyses by interpersonal biases we refer to biases that might occur when a researcher interacts with participants during data collection. according to wicherts et al. (2016)’s nomenclature, the interpersonal bias with potentially increased risks in sequential procedures is: • c2: insufficient blinding of participants and/or experimenters. when an experimenter has expectations about what should be observed, data collection is likely to be biased by these expectations (gilder & heerey, 2018; klein et al., 2012; orne, 1962; rosenthal, 1963; rosenthal, 1964; rosenthal & rubin, 1978; zoble & lehman, 1969). one solution to prevent this bias is to make sure that experimenters are blind to the experimental conditions. double blind5 designs are expected to minimise expectancy effects (gilder & heerey, 2018; klein et al., 2012). however, when the experimenter cannot be blind, expectancy effects are to be expected. this bias has been clearly identified by lakens (2014) as a particularly strong risk (i.e. observer effects are likely to be stronger in the context of sequential testing) in sequential analysis: "experimenter bias is important to consider when performing a study under normal circumstances [...] but becomes even more important to consider when the experimenter has performed an interim analysis." what is the specific status of sequential testing concerning analyst and observer expectancy effects? expectancy effects arise when one has prior beliefs and/or motivations about the outcome of an experiment and involuntarily (we assume scientific honesty) influences the results on the basis of these prior beliefs and motivations. the confidence in a hypothesis can be influenced by previous results from the literature, naive representations about the studied phenomenon, and other sources of information. these sources may deal with the studied phenomenon but rarely with the ongoing study specifically, and, as a consequence, the potential hypothesis can be subject to uncertainty. when performing sequential testing, one has a direct access to the accumulation of evidence concerning the ongoing study. hence, the prior information accumulated from sequential analysis specifically reduces uncertainty about the potential results of the ongoing experiment as compared to information gathered from previous studies or naive representations. in other words, a bias toward a particular result may stem from previous literature or personal beliefs. in the sequential analysis context, a bias toward a particular result may also stem from observing the accumulating data across the sequential procedure. this could be particularly strong, because one can directly see the accumulated data as the sample size increases in real time. knowing about intermediate results can therefore increase the risk of falling into an "evidence confirmation loop". in the previous section, we proposed that this risk applies to confirmation and disconfirmation biases (data analysis) where the intrapersonal bias of data evaluation can inflate with accumulated evidence. in this section, we propose that this loop can also worsen experimenter expectancy effects during data collection. the interpersonal bias of experimenterparticipant interactions can be seen as a self-fulfilling prophecy amplified by feedback from previous data. clearly, it is very difficult to obtain robust results concerning the effect size of analyst and observer expectancy effects. indeed, one would have to carry out experiments on experiments in order to study these biases. this "meta-science" problem is arduous because these biases can apply at all levels of manipulation as one experiment is included in another. for instance, barber (1978) suggested that expectancy biases can also occur in the expectancy bias research. it can also be difficult to collect large observation samples by experimental conditions (e.g., zoble & lehman, 1969), although recent work has shown that it was possible (gilder & heerey, 2018). thus, we can only draw attention to these effects as a potential risk to consider rather than as a precisely quantified danger to avoid. when double blind designs are not practicable, interpersonal biases seem obvious. however, when a double blind design is set up, the existence of an interpersonal bias is probably more questionable. how could knowledge about previous data influence the outcome of the experiment? it is possible that the experimenter’s verbal and non-verbal motor cues impact on the participant’s behaviour (zoble & lehman, 1969). however, it is unlikely that only non-verbal cues underlie the experimenter expectancy effect, at least in simple or familiar tasks (hazelrigg et al., 1991). what we know today is that the experimenter expectancy effect can be inadvertent and can depend on the interaction between the experimenter and the participant (gilder & heerey, 2018; hazelrigg et al., 1991). personality variables such as the need of social influence (on the experimenter’s side) and the susceptibility to 5we use the "double blind" terminology according to the classical definition, where both the participant and the experimenter are blind to the experimental condition. 7 social influence (on the participant’s side) can also increase the expectancy bias (hazelrigg et al., 1991). in a double-blind design however, the experimenter cannot influence the participant’s responses on the basis of knowledge of the experimental condition. however, the (de)motivation and the disappointment/satisfaction of seeing the preferred hypothesis contradicted/confirmed by the sequential testing procedure can possibly influence the participant.6 we cannot rule out the possibility that the confidence in a hypothesis interacts with experimental conditions and impacts the results of the experiment in one way or another. because the experimenter is not aware of the experimental condition of the participants, they will probably influence them more uniformly than in a simple blind design. this means that the behaviour of the experimenter can potentially change the baseline value of a parameter in all participants. also, we cannot exclude the possibility that the effect of the experimental manipulation is biased by this baseline value shift. more generally, "contextual variables, such as experimenters’ expectations, are a source of error that obscures the process of interest" (klein et al., 2012). what could be expected? to the best of our knowledge, there is no experiment reporting expectancy biases when the experimenter is blind to the experimental condition. however, blinding the experimenter from interim analysis is certainly recommended (lakens, 2014) when blinding experimental conditions is not possible. we suggest that blinding the analysis should also be considered as a precaution, even when the experimenter is blind. in table 1, we describe hypothetical observable consequences of such biases on the sbf and hdi+rope procedures. importantly, expectation biases can emerge in all combinations of a priori expectations and population effect sizes. congruent observations are expected to increase the speed with which the threshold is reached (h0+ and h1+), whereas incongruent observations are expected to slow down this process (h0and h1-) and to increase the number of false alarms. evidence is insufficient to conclude on the practical significance of analyst and observer expectancy effects, especially in double blind designs. if the necessary methods to reduce bias were costly and the potential benefits uncertain, then it would be reasonable to be sceptical of our proposal. however, we will show in the next section that the methods required to reduce bias are easy to implement, and therefore costless to adopt. we have presented how the knowledge of previous data can bias the data collection process and have also illustrated the predicted consequences of these biases on the evolution of sequentially computed bfs. in the next section, we focus on how to prevent these biases from happening. we suggest two ways of implementing analysis blinding as a precaution against experimenter biases during sequential testing, and present a proof of concept for an automated procedure that would ensure objectivity. a fully automated, transparent, reproducible and triple-blind protocol for sequential testing blinding is the procedure which hides the assigned condition from people involved in the experiment. it can notably be applied to participants, experimenters, or data analysts (schulz & grimes, 2002). if possible, it is preferable to apply blinding to anyone involved in the experiment to avoid expectancy effects. whereas participant and experimenter blinding is often considered in psychology, much less attention has been given to analysis blinding, probably due to materials and time constraints. however, the use of analysis blinding would help eliminate some of the biases identified in wicherts et al. (2016). again, this has been well described by lakens (2014): "in large medical trials, tasks such as data collection and statistical analysis are often assigned to different individuals, and it is considered good practice to have a data and safety monitoring board that is involved in planning the experiment and overseeing any interim analyses. in psychology, such a division of labor is rare, and it is much more common that researchers work in isolation". analysis blinding can take two different forms in the context of sequential analyses procedures. first, analysis blinding can refer to a procedure ensuring that the person analysing the data is blind to the hypotheses (miller & stewart, 2011). this configuration minimises intrapersonal biases because the analyst does not have the information necessary to influence data analysis in a specific direction (congruent or incongruent with the hypothesis). second (and more specifically to sequential testing procedures), analysis blinding can refer to a procedure ensuring that the experimenter is blinded to the data analysis. this configuration minimises interpersonal biases because the experimenter does not have the information necessary to influence data collection in a specific direction (congruent or incongruent with the hypothesis). if the experimenter is not the data analyst, they can be blind to the evolution of the intermediate results until data collection stops. as a consequence, the specific 6ideally, scientists should be interested in all possible results, whatever they are. 8 table 1 possible interactions between the population value of the effect size and the a priori expectations of the experimenter during a (non-blind) sequential testing procedure. there is no difference in the population (h0: δ = 0) there is a difference in the population (h1: δ , 0) researcher 1, believes in h0 h0+ (congruent) h1(incongruent) researcher 2, believes in h1 h0(incongruent) h1+ (congruent) experimenter expectancy bias in the sequential procedure is avoided. another solution is to automate analysis blinding so that the data analyst and the experimenter (who can be the same person) are blind to intermediate results computed on previous sets of observations. to illustrate this idea, we describe below how to perform transparent blind sequential analysis. we propose an example for two independent-groups comparisons (as in schönbrodt et al., 2017). this tutorial covers all experimental steps from preregistration to results reporting. when several options are available for a specific step, we detail one in the manuscript and other possibilities in the supplementary materials. we provide a functional example of the procedure on the osf (open science framework) based on an emotional stroop experiment. in order to describe the procedure, we use the idea of rouder (2016) who took the perspective of his dog kirby to explain git and github. here we take the perspective of lisa loud,7 who loves carrying out experiments but has probably never used automated sequential analysis before. prerequisites • lisa needs to have an open science framework (osf, https://osf.io/register/) account. • lisa needs to have at least some basic knowledge about how to use the osf. see soderberg (2018) and https://help.osf.io/ for a practical introduction. • lisa needs to have r (r core team, 2019) and rstudio (https://www.rstudio.com/) installed on her computer. • lisa needs to have a recent version of opensesame (https://osdoc.cogsci.nl/) or psychopy (https://www.psychopy.org/) installed on her computer. "born-open" data the aim of this this tutorial is not to discuss the theoretical work necessary before carrying out an experiment. we will directly focus on the preparation of materials needed to collect data. this tutorial mainly deals with computerised experiments.8 in this situation, users need to program an experiment to collect data. the logic proposed here is compatible with experiments programmed with psychopy (peirce, 2007, 2008; peirce et al., 2019) or opensesame (mathôt et al., 2012). these software programs have the advantage of being free and able to communicate with the osf. these two qualities are crucial for transparent procedures and easy sharing. we will use the possibility to link the software program with the osf in order to propose an intuitive "born-open" data procedure (rouder, 2016). although osf synchronisation tools are generally easy to install on windows and mac os, it can be slightly more "complicated" on other operating systems (os) such as linux os. most users probably work on windows and mac os. however, because linux oss are free and open we believe that they fit well with the open science philosophy. consequently, we will provide minimal examples on how to use this tutorial on ubuntu as a popular and easy-to-use linux distribution. we propose a procedure using opensesame below. procedures with psychopy can be found in the supplementary materials. opensesame probably allows the simplest synchronisation with the osf. however, it is less flexible than psychopy for programming experiments because, unlike psychopy, it does not allow to access to a coder view. direct osf synchronisation is available with psychopy 2 but not with psychopy 3. the example available on the osf is based on psychopy 2 but has also been successfully tested with opensesame. table 2 proposes a summary of the main strengths and weaknesses of each method. programming the experiment for "born-open" data. we present how to program an experiment al7https://theloudhouse.fandom.com/wiki/lisa_loud 8however, some parts of the proposed protocol can be adapted to other kinds of experiments. https://osf.io/register/ https://help.osf.io/ https://www.rstudio.com/ https://osdoc.cogsci.nl/ https://www.psychopy.org/ https://theloudhouse.fandom.com/wiki/lisa_loud 9 table 2 global overview of open experiment programming software’s characteristics and synchronisation handling. opensesame psychopy 2 psychopy 3 synchronisation osf osf pavlovia and gitlab* automatic synchronisation ease + graphical interface for synchronisation ++ ++ synchronisation quality ++ + +++ software flexibility ++ ++ * indirect synchronisation is possible with osf because synchronisation is possible between osf and gitlab. direct synchronisation with osf could also be possible but probably not straightforward. lowing automatic "born-open" data with opensesame, currently the easiest way to program automatic "bornopen" data. for more flexibility in experiment programming, psychopy options are described in the supplementary materials. opensesame is "a graphical experiment builder for the social sciences". it is free, open-source, and cross-platform. it features "a comprehensive and intuitive graphical user interface and supports python scripting for complex tasks." (mathôt et al., 2012). osf integration is normally included by default for windows and mac os. if not (or for other os like ubuntu for instance), the installation procedure is described in the supplementary materials. if osf integration is installed, the user should see the osf icon in opensesame (figure 1). click the osf log-in button and sign in with your osf account. more details on osf integration in opensesame can be found at https://osdoc.cogsci.nl/3.1/manual/osf/. if lisa reads this part of the manual, she will know exactly what to do in order to link data to the osf.9 if she links data to the osf, each time that data has been collected (normally after every experimental session), this data is also uploaded to the osf. lisa should follow the following steps in order to do so: • lisa has to save the experiment on her computer. • she then has to open the osf explorer, right-click on the folder that she wants the data to be uploaded to, and select "sync data to this folder". the osf node that the data is linked to will be shown at the top of the explorer. • she finally needs to check "always upload collected data" and data files will be automatically saved to osf after they have been collected. script preparation and piloting performing sequential analyses requires strict experiment programming and data analysis preparation. befigure 1. osf log-in button in the main toolbar in opensesame cause data is continuously analysed during data collection, everything must be ready before collecting data. this can be seen as a disadvantage because there is a lot of work to be done before data collection. indeed, analysing data after data collection allows us to delay several choices thus allowing to launch data collection more quickly.10 however, this can also be considered as a huge advantage because there are no unexpected surprises after data collection. it is not possible to discover that something has not been recorded or that data is not in the appropriate format, or that data is more difficult to analyse than expected. everything is thought about upstream because everything must work for data analysis which is performed during data collection. we propose this 6-step procedure to prepare the analysis: • clearly define the variables involved in the study 9the following part is a reformulation of the manual for the purpose of our tutorial. 10here we do not say that preparation is specific to sequential testing. preparation can be recommended for all designs, but is not avoidable in sequential analysis. https://osdoc.cogsci.nl/3.1/manual/osf/ 10 in order to program the experiment • carefully consider how to analyse the data to be collected • program the experiment in keeping with the planned data analysis • test the experiment to check that everything is working as expected • prepare the scripts that will be used to analyse data • run some pilots to test the procedure these steps are necessary in order to launch the actual experiment. otherwise, sequential analyses are likely to fail at some point. preregistration let’s assume lisa has successfully achieved script preparation and piloting. when everything is ready, she has everything she needs to preregister her study. preregistration is very important in sequential testing because it forces lisa to explicitly state her statistical criteria of interest and therefore announce when data collection will end. this is very useful in order to limit biases due to lisa’s degrees of freedom (lakens, 2014; wicherts et al., 2016). in addition to generic preregistration, the first important thing to preregister is the sequential data cleaning procedure. in this part, lisa should describe how data will be handled before inferential statistical modelling. this can include, physiological signal processing, potential observation or participant removal criteria or any other data manipulation happening between data collection and modelling. after that, lisa should indicate the appropriate stopping statistics depending on the procedure (e.g., sbf, hdi+rope, precision). the stopping statistic should be described along with a clear and detailed description of the model which is computed to get this statistic. finally, lisa should also specify a minimum sample size required to compute the models, and a maximal sample size affordable for her, which would determine the end of data collection, independently of the stopping statistic. transparent data collection when possible, making data openly available can improve the quality of science (klein et al., 2018). in the context of sequential data analysis, it can be even more important. as we explained above, sequential analyses offer strong advantages but also increase lisa’s degrees of freedom. making data open can reduce this bias. in this perspective, born open data (rouder, 2016) can be even more efficient, especially in the context of sequential analysis. "born-open" data is the procedure which makes data automatically open as soon as it is collected. this procedure has at least two huge interests for sequential analysis. first, the time course of data collection is transparent because data is necessarily sent online and time-stamped just after being collected. this is very important because in doing so, each choice made by lisa will be clear and justified. second, "born-open" data will facilitate real time online data analysis which is very useful for sequential designs. automatic "bornopen" data is possible with opensesame or psychopy by automatic publishing data on the osf (or on github or gitlab for instance). automated data cleaning after collecting her first batch of data, lisa might not be able to directly fit the statistical model she is interested in. she probably needs a procedure of data cleaning in order to get her data ready for statistical inference. data cleaning can include physiological signal processing, artefacts removal, errors removal, dealing with potential outliers, and everything needed to get meaningful data from raw data. lisa has to perform data cleaning before statistical modelling. because lisa is analysing data sequentially, she also has to clean data sequentially. for instance, in our example (the emotional stroop task), we decided to remove missing data and response times (rt) below 100 ms. this is done each time new data is incorporated (see lines 101 to 110 of the sequential_analyses.r script). we could also have chosen to analyse rt only for correct responses and/or to remove observations based on a specific descriptive statistic. automated blind data analysis because lisa has prepared everything needed for her procedure, she is able to automate data analysis and therefore to be blind to the details of the analysis while she is collecting data. here is how she can proceed and how we proceeded in our example. we describe how lisa can handle tasks-scheduling on a unix system (macos and linux) in this paragraph. lisa can use the cronr package (wijffels, 2018a) to schedule tasks in r. this package will be useful in order to retrieve data from the osf and to analyse it. lisa will have to create a task by running the appropriate script (main_script.r in our example) once before collecting the first observation. she will be able to decide how often she wants the script to run automatically. lisa could also apply 11 the same procedure on a windows system thanks to the taskscheduler package (wijffels, 2018b). a short description on how to use this package can be found at: https://cran.r-project.org/web/packages/ taskscheduler/vignettes/taskscheduler.html. if lisa chooses that the script should auto-run each hour, the main script will check data from the osf each hour. this will be done with the osfr package (wolen & hartgerink, 2019). in our example (line 106 to 143 in main_script.r), we check if new data is available on the osf and download it if it is the case. after some data formatting (lines 145 to 197), the script will run the sequential analysis (line 199 to 243). the main script calls the sequential_analyses.r script which contains the sequential analyses function. lisa will also have to specify some parameters depending on the model she is interested in. at the end of the analysis, lisa will receive an email telling her whether to stop or to continue data collection. this email will neither report the effect nor its direction, but only the information to stop or to continue the data collection. this means that lisa will be able to follow a sequential design without any information about the results of data analysis, excepted the fact that she has (not) reached her criterion. the mail will be sent automatically in r thanks to the gmailr package (hester, 2016). reproducible reporting lisa will stop data collection when she reaches her statistical criterion or her maximum affordable sample size. she will then be able to write a report describing her results. she can do it using rmarkdown (allaire et al., 2019; xie et al., 2018) in order to incorporate her results automatically from her r scripts into her report (e.g., see bauer, 2018). lisa could for instance use the r package papaja (aust & barth, 2018) which would allow her to write a reproducible apa manuscript with rmarkdown. scientific writing with rmarkdown has important benefits for any type of research design but would be even more valuable for sequential analyses. feasibility of the proposed protocol and limitations a graphical summary of the procedure is depicted in figure 2.11 all the tools needed to set it up are available for free to everyone. using opensesame or psychopy is relatively straightforward and does not necessarily require coding skills. we proposed standard r scripts in order to automate sequential analysis. however, we concede that using this scripts requires minimal knowledge of r. hence, we also propose a shiny application available at https://barelysignificant.shinyapps.io/ blind_sequential_analyses/. with this application, lisa would just have to specify all the important details of her analysis in boxes, from which the application generates the corresponding r script almost ready for sequential analysis. this application is meant to facilitate the creation of r scripts. it automatically writes around 90% of the code lisa would have to write to use such a sequential analysis procedure. however, it is almost certain that the produced r code would not work immediately. it would require some minor tweaking from her, such as checking the local path, making sure that the scripts and the data are in the same repository, adapting the data import step to specific properties of the data under consideration, and so on. the procedure proposed here only requires one computer. hence, the implementation cost is rather low. this procedure is also well suited for multi-lab studies. indeed, an experiment can be run at different places but data is automatically centralised on one platform and can be analyse automatically and sequentially by one single computer. the experiment we designed with psychopy 2 (see the supplementary material for more information) is specifically thought for multi-lab studies by automatically recording information about the computer which runs the experiment. this enables the identification of the site associated with each participant. we concede that automation of data analysis prevents one interesting advantage of sequential testing. namely, the fact that data collection can be stopped whenever the behaviour of data is unexpected, allowing the experimenter to rethink the experimental design or aim before collecting more data (lakens, 2014). depending on the confidence and expected familiarity with the data to be collected, the researchers have to choose between automated or "two-person" analysis blinding. the first option has low costs of implementation whereas the second one is more flexible. in any case, after performing a sequential analysis, nothing prevents lisa from performing additional analyses based on unexpected data specificity, taking care to record and state the exploratory nature of any such analyses. a word on blind analysis by multiple people if lisa can afford working with a colleague on her study and if she prefers to do so, we advise her to apply the logic of the automated procedure described in this tutorial. the only difference will be that her colleague will analyse data while lisa collects it (or conversely). if her colleague analyses data, they will have to retrieve data online (e.g., on osf) and to perform the planned preregistered analysis (unless data behave 11this figure has been inspired by the figure 1 in quintana et al. (2016). https://cran.r-project.org/web/packages/taskscheduler/vignettes/taskscheduler.html https://cran.r-project.org/web/packages/taskscheduler/vignettes/taskscheduler.html https://barelysignificant.shinyapps.io/blind_sequential_analyses/ https://barelysignificant.shinyapps.io/blind_sequential_analyses/ 12 figure 2. schematic procedure of a transparent and blind sequential data analysis. define variables plan analysis program test prepare scripts run pilots data cleaning stopping statistic model specification n min n max signal processing artefacts removal errors removal outliers management preregistration scripts preparation and piloting born open data collection automated data cleaning automated blind data analysis reproducible reporting schedule download data analyse email decision very unexpectedly in which case they will have the responsibility to adapt the analysis or stop data collection prematurely). the only contact lisa and her colleague will have concerning the experiment will be the email they will send to inform lisa whether to stop or continue data collection, nothing more. conclusions we began by presenting the intra and interpersonal biases that might emerge during sequential testing and sequential analysis procedures. to tackle these issues, we proposed a novel automated, transparent, reproducible and blind protocol for sequential analysis. the main interest of this procedure is to reduce possible biases that could be encountered during intermediate data analysis and to prevent the inflation of social influences during data collection. this protocol should be considered as a proof-ofconcept for sequential analysis automation. however, future work will be able to propose more comprehensive and more user-friendly solutions for sequential analyses. for instance, the reliance on the user’s r programming skills might be alleviated with the development of an online platform that would automate the procedure online, without the need for coding. more work is also needed to precisely quantify intra and interpersonal biases during data collection and analysis. for instance, one could set up experimental procedures to pinpoint these biases in realistic lab situations (e.g., see gilder & heerey, 2018). in addition to experimental procedures, computational modelling could also be used to estimate the presence of bias in extant (published or not) sequential analyses. by formalising the sequential analysis procedure (e.g., using an evidence accumulation model) and by explicitly modelling the biases that we describe in the present article, we might be able to assess the likelihood that an observed set of collected statistics (e.g., bfs) has been obtained under the assumption of bias (or no bias). supplementary materials reproducible code and supplementary materials are available on osf: https://osf.io/mwtvk/. author contact correspondence concerning this article should be addressed to brice beffara bret, université de nantes, nantes, france. e-mail: brice.beffara@univ-nantes.fr. orcid: brice beffara bret https://orcid.org/ 0000-0002-0586-6650 amélie beffara bret https://orcid.org/0000-0002-9129-0415 ladislas nalborczyk https://orcid.org/0000-0002-7419-9855 we thank hans ijzerman for helpful comments and edward collett for his help with manuscript english proof reading on a previous version of this manuscript. many thanks to christopher moulin for his feedback and for his help with manuscript english proof reading on the updated version of this manuscript. we also thank rickard carlsson, félix schönbrodt, angelika stefan, stephen martin, and charlotte brand for insightful comments and suggestions during the peer-review process. conflict of interest and funding the authors report no conflict of interest. we thank the université de nantes, the université grenoble alpes, and the université de franche-comté for supporting our research. the author(s) received no specific financial support for the research, authorship, and/or publication of this article. https://osf.io/mwtvk mailto:brice.beffara@univ-nantes.fr https://orcid.org/0000-0002-0586-6650 https://orcid.org/0000-0002-0586-6650 https://orcid.org/0000-0002-9129-0415 https://orcid.org/0000-0002-7419-9855 13 author contributions following the credit – contributor roles taxonomy (https://casrai.org/credit/): conceptualisation: bbb (lead), abb, ln data curation: bbb, ln formal analysis: ln investigation: bbb, abb methodology: bbb, abb, ln project administration: bbb, abb, ln resources: bbb, abb, ln software: bbb (opensesame, psychopy, python), ln (r) validation: bbb, abb, ln visualisation: bbb, abb, ln writing: bbb, abb, ln authorship order was determined as follow (by order of significance): first author: original idea and major contribution to the "software" role (inter alia) last author: major contribution to the "software" role (inter alia) second author: major contribution to the "investigation" role (inter alia) open science practices this article earned the open data and the open materials badge for for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references allaire, j., xie, y., mcpherson, j., luraschi, j., ushey, k., atkins, a., wickham, h., cheng, j., chang, w., & iannone, r. (2019). rmarkdown: dynamic documents for r [r package version 1.12]. https: //rmarkdown.rstudio.com aust, f., & barth, m. (2018). papaja: create apa manuscripts with r markdown [r package version 0.1.0.9842]. https : / / github . com / crsh / papaja babbage, c. (1830). reflections on the decline of science in england, and on some of its causes. b. fellowes [etc.] barber, t. x. (1978). expecting expectancy effects: biased data analyses and failure to exclude alternative interpretations in experimenter expectancy research. behavioral and brain sciences, 1(03), 388. https : / / doi . org / 10 . 1017 / s0140525x00075531 bauer, p. c. (2018). writing a reproducible paper in r markdown. ssrn electronic journal. https : / / doi.org/10.2139/ssrn.3175518 de groot, a. (2014). the meaning of “significance” for different types of research [translated and annotated by eric-jan wagenmakers, denny borsboom, josine verhagen, rogier kievit, marjan bakker, angelique cramer, dora matzke, don mellenbergh, and han l. j. van der maas]. acta psychologica, 148, 188–194. https : / / doi . org / 10.1016/j.actpsy.2014.02.001 fidler, f., & wilcox, j. (2018). reproducibility of scientific results. in e. n. zalta (ed.), the stanford encyclopedia of philosophy (winter 2018). metaphysics research lab, stanford university. gelman, a., & carlin, j. (2014). beyond power calculations: assessing type s (sign) and type m (magnitude) errors. perspectives on psychological science, 9(6), 641–651. https://doi.org/10.1177/ 1745691614551642 gilder, t. s. e., & heerey, e. a. (2018). the role of experimenter belief in social priming. psychological science, 29(3), 403–417. https://doi.org/ 10.1177/0956797617737128 goodman, s. n., fanelli, d., & ioannidis, j. p. a. (2016). what does research reproducibility mean? science translational medicine, 8(341), 341ps12. https://doi.org/10.1126/scitranslmed.aaf5027 hansen, w. b., & collins, l. m. (1994). seven ways to increase power without increasing n [type: dataset]. nida research monograph, 142, 184– 195. https : / / doi . org / 10 . 1037 / e495862006 008 hazelrigg, p. j., cooper, h., & strathman, a. j. (1991). personality moderators of the experimenter expectancy effect: a reexamination of five hypotheses. personality and social psychology bulletin, 17(5), 569–579. https : / / doi . org / 10 . 1177/0146167291175012 hester, j. (2016). gmailr: access the gmail restful api [r package version 0.7.1]. https : / / cran . r project.org/package=gmailr klein, o., doyen, s., leys, c., magalhães de saldanha da gama, p. a., miller, s., questienne, l., & cleeremans, a. (2012). low hopes, high expectations: expectancy effects and the replicability of behavioral experiments. perspectives on https://casrai.org/credit/ https://rmarkdown.rstudio.com https://rmarkdown.rstudio.com https://github.com/crsh/papaja https://github.com/crsh/papaja https://doi.org/10.1017/s0140525x00075531 https://doi.org/10.1017/s0140525x00075531 https://doi.org/10.2139/ssrn.3175518 https://doi.org/10.2139/ssrn.3175518 https://doi.org/10.1016/j.actpsy.2014.02.001 https://doi.org/10.1016/j.actpsy.2014.02.001 https://doi.org/10.1177/1745691614551642 https://doi.org/10.1177/1745691614551642 https://doi.org/10.1177/0956797617737128 https://doi.org/10.1177/0956797617737128 https://doi.org/10.1126/scitranslmed.aaf5027 https://doi.org/10.1037/e495862006-008 https://doi.org/10.1037/e495862006-008 https://doi.org/10.1177/0146167291175012 https://doi.org/10.1177/0146167291175012 https://cran.r-project.org/package=gmailr https://cran.r-project.org/package=gmailr 14 psychological science, 7(6), 572–584. https : / / doi.org/10.1177/1745691612463704 klein, o., hardwicke, t. e., aust, f., breuer, j., danielsson, h., hofelich mohr, a., ijzerman, h., nilsonne, g., vanpaemel, w., & frank, m. c. (2018). a practical guide for transparency in psychological science. collabra: psychology, 4(1), 20. https : / / doi . org / 10 . 1525 / collabra . 158 kruschke, j. k. (2015). doing bayesian data analysis: a tutorial with r, jags, and stan (edition 2). academic press. kruschke, j. k. (2018). rejecting or accepting parameter values in bayesian estimation. advances in methods and practices in psychological science, 1(2), 270–280. https : / / doi . org / 10 . 1177 / 2515245918771304 kruschke, j. k., & liddell, t. m. (2018). the bayesian new statistics: hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective. psychonomic bulletin & review, 25(1), 178–206. https : / / doi . org / 10 . 3758/s13423-016-1221-4 lakens, d. (2014). performing high-powered studies efficiently with sequential analyses: sequential analyses. european journal of social psychology, 44(7), 701–710. https://doi.org/10.1002/ejsp. 2023 lakens, d., scheel, a. m., & isager, p. m. (2018). equivalence testing for psychological research: a tutorial. advances in methods and practices in psychological science, 1(2), 259–269. https://doi. org/10.1177/2515245918770963 maccoun, r. j., & perlmutter, s. (2017). blind analysis as a correction for confirmatory bias in physics and in psychology. in s. o. lilienfeld & i. d. waldman (eds.), psychological science under scrutiny (pp. 295–322). john wiley & sons, inc. https://doi.org/10.1002/9781119095910. ch15 mathôt, s., schreij, d., & theeuwes, j. (2012). opensesame: an open-source, graphical experiment builder for the social sciences. behavior research methods, 44(2), 314–324. https : / / doi . org/10.3758/s13428-011-0168-7 meehl, p. e. (1990). appraising and amending theories: the strategy of lakatosian defense and two principles that warrant it. psychological inquiry, 1(2), 108–141. https : / / doi . org / 10 . 1207/s15327965pli0102\_1 miller, l. e., & stewart, m. e. (2011). the blind leading the blind: use and misuse of blinding in randomized controlled trials. contemporary clinical trials, 32(2), 240–243. https://doi.org/10. 1016/j.cct.2010.11.004 morey, r. d., hoekstra, r., rouder, j. n., lee, m. d., & wagenmakers, e.-j. (2016). the fallacy of placing confidence in confidence intervals. psychonomic bulletin & review, 23(1), 103–123. https: //doi.org/10.3758/s13423-015-0947-8 nalborczyk, l., bürkner, p.-c., & williams, d. r. (2019). pragmatism should not be a substitute for statistical literacy, a commentary on albers, kiers, and van ravenzwaaij (2018). collabra: psychology, 5(1). https://doi.org/10.1525/collabra. 197 nelson, l. d., simmons, j., & simonsohn, u. (2018). psychology’s renaissance. annual review of psychology, 69(1), 511–534. https://doi.org/10. 1146/annurev-psych-122216-011836 orne, m. t. (1962). on the social psychology of the psychological experiment: with particular reference to demand characteristics and their implications. american psychologist, 17(11), 776– 783. https://doi.org/10.1037/h0043424 peirce, j. w. (2007). psychopy—psychophysics software in python. journal of neuroscience methods, 162(1-2), 8–13. https://doi.org/10.1016/ j.jneumeth.2006.11.017 peirce, j. w. (2008). generating stimuli for neuroscience using psychopy. frontiers in neuroinformatics, 2. https://doi.org/10.3389/neuro.11. 010.2008 peirce, j. w., gray, j. r., simpson, s., macaskill, m., höchenberger, r., sogo, h., kastman, e., & lindeløv, j. k. (2019). psychopy2: experiments in behavior made easy. behavior research methods, 51(1), 195–203. https://doi.org/10.3758/ s13428-018-01193-y quintana, d., alvares, g., & heathers, j. (2016). guidelines for reporting articles on psychiatry and heart rate variability (graph): recommendations to advance research communication. translational psychiatry, 6(5), e803. https : / / doi.org/10.1038/tp.2016.73 r core team. (2019). r: a language and environment for statistical computing. r foundation for statistical computing. vienna, austria. https://www. r-project.org/ rosenthal, r. (1963). on the social psychology of the psychological experiment: the experimenter’s hypothesis as unintended determinant of experimental results. american scientist, 51, 268– 283. rosenthal, r. (1964). experimenter outcomeorientation and the results of the psychological https://doi.org/10.1177/1745691612463704 https://doi.org/10.1177/1745691612463704 https://doi.org/10.1525/collabra.158 https://doi.org/10.1525/collabra.158 https://doi.org/10.1177/2515245918771304 https://doi.org/10.1177/2515245918771304 https://doi.org/10.3758/s13423-016-1221-4 https://doi.org/10.3758/s13423-016-1221-4 https://doi.org/10.1002/ejsp.2023 https://doi.org/10.1002/ejsp.2023 https://doi.org/10.1177/2515245918770963 https://doi.org/10.1177/2515245918770963 https://doi.org/10.1002/9781119095910.ch15 https://doi.org/10.1002/9781119095910.ch15 https://doi.org/10.3758/s13428-011-0168-7 https://doi.org/10.3758/s13428-011-0168-7 https://doi.org/10.1207/s15327965pli0102\_1 https://doi.org/10.1207/s15327965pli0102\_1 https://doi.org/10.1016/j.cct.2010.11.004 https://doi.org/10.1016/j.cct.2010.11.004 https://doi.org/10.3758/s13423-015-0947-8 https://doi.org/10.3758/s13423-015-0947-8 https://doi.org/10.1525/collabra.197 https://doi.org/10.1525/collabra.197 https://doi.org/10.1146/annurev-psych-122216-011836 https://doi.org/10.1146/annurev-psych-122216-011836 https://doi.org/10.1037/h0043424 https://doi.org/10.1016/j.jneumeth.2006.11.017 https://doi.org/10.1016/j.jneumeth.2006.11.017 https://doi.org/10.3389/neuro.11.010.2008 https://doi.org/10.3389/neuro.11.010.2008 https://doi.org/10.3758/s13428-018-01193-y https://doi.org/10.3758/s13428-018-01193-y https://doi.org/10.1038/tp.2016.73 https://doi.org/10.1038/tp.2016.73 https://www.r-project.org/ https://www.r-project.org/ 15 experiment. psychological bulletin, 61(6), 405–412. https://doi.org/10.1037/h0045850 rosenthal, r., & rubin, d. b. (1978). interpersonal expectancy effects: the first 345 studies. behavioral and brain sciences, 1(03), 377. https : / / doi.org/10.1017/s0140525x00075506 rouder, j. n. (2016). the what, why, and how of bornopen data. behavior research methods, 48(3), 1062–1069. https://doi.org/10.3758/s13428015-0630-z rouder, j. n., haaf, j. m., & snyder, h. k. (2018). minimizing mistakes in psychological science. psyarxiv. https : / / doi . org / 10 . 31234 / osf. io / gxcy5 schönbrodt, f. d., & wagenmakers, e.-j. (2018). bayes factor design analysis: planning for compelling evidence. psychonomic bulletin & review. https: //doi.org/10.3758/s13423-017-1230-y schönbrodt, f. d., wagenmakers, e.-j., zehetleitner, m., & perugini, m. (2017). sequential hypothesis testing with bayes factors: efficiently testing mean differences. psychological methods, 22(2), 322–339. https : / / doi . org / 10 . 1037 / met0000061 schulz, k. f., & grimes, d. a. (2002). blinding in randomised trials: hiding who got what. the lancet, 359(9307), 696–700. https://doi.org/ 10.1016/s0140-6736(02)07816-9 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. https://doi.org/ 10.1177/0956797611417632 smaldino, p., turner, m. a., & contreras kallens, p. a. (2019). open science and modified funding lotteries can impede the natural selection of bad science. https : / / doi . org / 10 . 31219 / osf . io / zvkwq soderberg, c. k. (2018). using osf to share data: a step-by-step guide. advances in methods and practices in psychological science, 1(1), 115–120. https : / / doi . org / 10 . 1177 / 2515245918757689 stefan, a. m., gronau, q. f., schönbrodt, f. d., & wagenmakers, e.-j. (2019). a tutorial on bayes factor design analysis using an informed prior. behavior research methods, 51(3), 1042–1058. https://doi.org/10.3758/s13428-018-01189-8 vuorre, m., & curley, j. p. (2018). curating research assets: a tutorial on the git version control system. advances in methods and practices in psychological science, 1(2), 219–236. https://doi. org/10.1177/2515245918754826 wagenmakers, e.-j., marsman, m., jamil, t., ly, a., verhagen, j., love, j., selker, r., gronau, q. f., šmíra, m., epskamp, s., matzke, d., rouder, j. n., & morey, r. d. (2017). bayesian inference for psychology. part i: theoretical advantages and practical ramifications. psychonomic bulletin & review, 25(1), 35–57. https : / / doi . org/10.3758/s13423-017-1343-3 wicherts, j. m., veldkamp, c. l. s., augusteijn, h. e. m., bakker, m., van aert, r. c. m., & van assen, m. a. l. m. (2016). degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid phacking. frontiers in psychology, 7. https://doi. org/10.3389/fpsyg.2016.01832 wijffels, j. (2018a). cronr: schedule r scripts and processes with the ’cron’ job scheduler [r package version 0.4.0]. https : / / cran . r project . org / package=cronr wijffels, j. (2018b). taskscheduler: schedule r scripts and processes with the windows task scheduler [r package version 1.4.0]. https : / / cran . r project.org/package=taskscheduler wolen, a., & hartgerink, c. (2019). osfr: r interface to osf [http://centerforopenscience.github.io/osfr, https://github.com/centerforopenscience/osfr]. xie, y., allaire, j., & grolemund, g. (2018). r markdown: the definitive guide [isbn 9781138359338]. chapman; hall/crc. https: //bookdown.org/yihui/rmarkdown yarkoni, t., eckles, d., heathers, j., levenstein, m., smaldino, p., & lane, j. i. (2019). enhancing and accelerating social science via automation: challenges and opportunities. https://doi.org/ 10.31235/osf.io/vncwe zoble, e. j., & lehman, r. s. (1969). interaction of subject and experimenter expectancy effects in a tone length discrimination task. behavioral science, 14(5), 357–363. https : / / doi . org / 10 . 1002/bs.3830140503 https://doi.org/10.1037/h0045850 https://doi.org/10.1017/s0140525x00075506 https://doi.org/10.1017/s0140525x00075506 https://doi.org/10.3758/s13428-015-0630-z https://doi.org/10.3758/s13428-015-0630-z https://doi.org/10.31234/osf.io/gxcy5 https://doi.org/10.31234/osf.io/gxcy5 https://doi.org/10.3758/s13423-017-1230-y https://doi.org/10.3758/s13423-017-1230-y https://doi.org/10.1037/met0000061 https://doi.org/10.1037/met0000061 https://doi.org/10.1016/s0140-6736(02)07816-9 https://doi.org/10.1016/s0140-6736(02)07816-9 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/0956797611417632 https://doi.org/10.31219/osf.io/zvkwq https://doi.org/10.31219/osf.io/zvkwq https://doi.org/10.1177/2515245918757689 https://doi.org/10.1177/2515245918757689 https://doi.org/10.3758/s13428-018-01189-8 https://doi.org/10.1177/2515245918754826 https://doi.org/10.1177/2515245918754826 https://doi.org/10.3758/s13423-017-1343-3 https://doi.org/10.3758/s13423-017-1343-3 https://doi.org/10.3389/fpsyg.2016.01832 https://doi.org/10.3389/fpsyg.2016.01832 https://cran.r-project.org/package=cronr https://cran.r-project.org/package=cronr https://cran.r-project.org/package=taskscheduler https://cran.r-project.org/package=taskscheduler https://bookdown.org/yihui/rmarkdown https://bookdown.org/yihui/rmarkdown https://doi.org/10.31235/osf.io/vncwe https://doi.org/10.31235/osf.io/vncwe https://doi.org/10.1002/bs.3830140503 https://doi.org/10.1002/bs.3830140503 on reproducibility a brief introduction to sequential analyses procedures sequential bayes factor the sequential hdi+rope procedure aiming for precision some difficulties what could possibly go wrong? intrapersonal biases during sequential analyses interpersonal biases during sequential analyses what could be expected? a fully automated, transparent, reproducible and triple-blind protocol for sequential testing prerequisites "born-open" data programming the experiment for "born-open" data script preparation and piloting preregistration transparent data collection automated data cleaning automated blind data analysis reproducible reporting feasibility of the proposed protocol and limitations a word on blind analysis by multiple people conclusions supplementary materials author contact conflict of interest and funding author contributions open science practices mp.2018.1592.williams albers_final_20190912 meta-psychology, 2019, vol 3, mp.2018.1592, https://doi.org/10.15626/mp.2018.1592 article type: original article published under the cc-by4.0 license open data: yes open materials: not relevant open and reproducible analysis: yes open reviews and editorial process: yes preregistration: not relevant edited by: moritz heene reviewed by: felix naumann, sven hilbert analysis reproduced by: rickard carlsson all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/dzn3s dealing with distributional assumptions in preregistered research matt n. williams massey university, new zealand casper j. albers university of groningen, netherlands virtually any inferential statistical analysis relies on distributional assumptions of some kind. the violation of distributional assumptions can result in consequences ranging from small changes to error rates through to substantially biased estimates and parameters fundamentally losing their intended interpretations. conventionally, researchers have conducted assumption checks after collecting data, and then changed the primary analysis technique if violations of distributional assumptions are observed. an approach to dealing with distributional assumptions that requires decisions to be made contingent on observed data is problematic, however, in preregistered research, where researchers attempt to specify all important analysis decisions prior to collecting data. limited methodological advice is currently available regarding how to deal with the prospect of distributional assumption violations in preregistered research. in this article, we examine several strategies that researchers could use in preregistrations to reduce the potential impact of distributional assumption violations. we suggest that pre-emptively selecting analysis methods that are as robust as possible to assumption violations, performing planned robustness analyses, and/or supplementing preregistered confirmatory analyses with exploratory checks of distributional assumptions may all be useful strategies. on the other hand, we suggest that prespecifying “decision trees” for selecting data analysis methods based on the distributional characteristics of the data may not be practical in most situations. keywords: preregistrations, distributional assumptions, open science, transparency. we thank moritz heene as editor and reviewers felix naumann and sven hilbert for valuable feedback on an earlier version of this article. we also thank tobias mühlmeister for assisting in the copyediting of our manuscript. correspondence regarding this article can be addressed to matt n. williams, school of psychology, massey university, private bag 102903 north shore, auckland 0745, new zealand. email: m.n.williams@massey.ac.nz williams & albers 2 virtually any inferential statistical method relies on some set of distributional assumptions1 in order to produce valid inferences. for example, a regression model estimated via ordinary least squares relies on the assumptions the predictors are measured without error, that any measurement error in the response variable is uncorrelated with the predictors, and that the error terms are independent, identically and normally distributed with a mean of zero for all values of the predictors2 (williams, grajales, & kurkiewicz, 2013). even nonparametric methods have assumptions, albeit not with respect to the specific probability distribution of particular variables. for example, if a mannwhitney test is used to test the equality of the medians of two populations on some variable, one must assume that the distribution of the variable has the same shape and spread within each of the populations (fagerland & sandvik, 2009). distributional assumptions are a common source of misconceptions (ernst & albers, 2017), but even when assumptions are correctly identified and investigated, several issues can arise. one of these will be discussed in this paper. breaches of distributional assumptions can cause problems for inference, including biased estimates, artificially narrow (or broad) confidence intervals, and increases in type i and/or type ii error rates (ernst & albers, 2017; williams et al., 2013). the severity of these problems varies depending on the analysis method, the sample size, the nature of the assumption breach, and on whether one or more assumptions are violated simultaneously. the consequences of an assumption breach can vary from a minor change in type i error rates and confidence interval coverage, through to biased parameter estimates, right through to parameter estimates fundamentally losing their intended 1 by “distributional assumption” we mean an assumption with respect to the univariate, bivariate, or multivariate distribution of variables and/or error terms—e.g., that the relationship between two variables is linear, or that the variances of a set of error terms is identical, or that the distribution of a response variable is negative binomial conditional on a set of predictor values. such assumptions are also sometimes referred to as “statistical assumptions” or just “assumptions”. we use the modifier “distributional” simply to differentiate such assumptions from purely non-statistical assumptions (e.g., assumptions about ontology or epistemology). 2 the assumption that the error terms all have mean zero for any combination of values of the predictor variables implies that the independent effects of the predictor variables included in the model on the response variable are additive and linear. indeed, some presentations of the assumptions of multiple regression (e.g., gelman & hill, 2007) replace a description of this assumption with a description of the assumption of “additivity and linearity”. note that the predictor variables included in a regression model may include transformations of the original variables (e.g., polynomial terms), which provides some capacity to specify nonlinear effects between the original variables in a dataset and the response variable. interpretation. for example, in a simple linear regression model estimated using ordinary least squares, a breach of the assumption that the error terms are normally distributed will not cause biased or inconsistent estimates or harm the interpretability of the parameters, but only affect confidence interval coverage and type i error rates, and even these effects can be mild (see gelman & hill, 2007; lumley, diehr, emerson, & chen, 2002; meuleman, loosveldt, & emonds, 2015; williams et al., 2013). on the other hand, if the assumption of a linear relationship between the predictor variable and the response variable is breached, the slope loses its interpretability as a measure of the dependency between the predictor and response variables (see meuleman et al., 2015): a measure of linear relationship is of little value if the true relationship is not linear. many methodological textbooks and other resources offer researchers advice on how to detect and respond to distributional assumption violations. there are many methods for detecting distributional assumption problems, including both graphical approaches and inferential tests. for example, a researcher interested in whether the error terms in her regression model are normally distributed could evaluate this assumption using visual methods, such as a q-q plot, a formal statistical test such as the shapiro-wilk or kolmogorov-smirnov test, or by evaluating skewness and kurtosis statistics. likewise, the potential responses available for dealing with a distributional problem are legion, including transformations, deletion of outliers, trimming of samples, alternative estimation algorithms, randomisation-based tests, rank-based nonparametric statistics, and many others. dealing with distributional assumptions in preregistered research 3 nevertheless, an important meta-strategy underlies the advice about dealing with distributional assumptions found in many methodological resources (see the discussion in wells & hintze, 2007): first, one should check for distributional problems, and then, if problems are detected, select a strategy to deal with the problems. we term this the “test then respond” meta-strategy for dealing with distributional assumption violations. the potential risks of the “test then respond” meta-strategy will be apparent to readers aware of the problems of the “garden of forking paths” (gelman & loken, 2014, p. 460) and “researcher degrees of freedom” (simmons, nelson, & simonsohn, 2011, p. 1359). by applying different analysis strategies contingent on the characteristics of the observed data, a researcher may end up happening upon a statistically significant result in favour of a particular hypothesis that is itself contingent on an analysis decision made after observing the data. this will be especially problematic if the researcher is motivated to search for and selectively report statistically significant results (“p-hacking”). p-hacking can result in inflated type i error rates and seriously biased estimates, and is thought to be one of the major causes of the current “replication crisis” in psychology and other sciences. although p-hacking has especially negative effects, the making of analysis decisions contingent on observed data can still be problematic even if the researcher’s intentions are entirely scrupulous. specifically, the nominal error rates of analysis strategies are invariably derived based on repeated sampling with a fixed analysis method, and will not necessarily apply where the analysis method depends on the data. for example, the common strategy for comparing two means of using a student’s t test if a preliminary levene’s test fails to reject the null hypothesis of equal variances across the two groups, but using a welch’s t test if the levene’s test is significant, can result in type i error rates that differ markedly from the nominal alpha level (albers, boon, & kallenberg, 2000; bancroft, 1964; zimmerman, 2004). preregistration one important strategy gaining popularity as a partial solution to problems with replicability and phacking is preregistration. in a landmark paper, wagenmakers, wetzels, borsboom, van der maas and kievit (2012) argued that when research is intended to be confirmatory—i.e., to test hypotheses—a data collection and analysis plan should be preregistered in advance. doing so reduces the capacity of researchers to exploit flexibility in their data collection and analysis procedures to produce positive (or statistically significant) findings. preregistration is a crucial control strategy in an environment where researchers are incentivised to produce statistically significant and novel findings in order to achieve publications in high-impact journals. online platforms for uploading and permanently timestamping preregistrations have since been developed (osf.io, aspredicted.org), and preregistered research studies have been increasingly frequent in the pages of psychology journals, especially within experimental social psychology; see for example the 67(1) special issue of the journal of experimental social psychology, which was entirely dedicated to preregistered research. preregistration is useful for increasing the credibility of individual studies, but it can also help address the broader problem of publication bias—a problem wherein statistically significant findings are more likely to be published than non-significant ones (see ferguson & heene, 2012). publication bias can distort the mean effect sizes estimated in metaanalyses both because studies that produce nonsignificant findings are less likely to end up in the published literature, and because authors may respond to the existence of this bias by exploiting flexibility in data collection and analysis procedures to produce statistically significant findings. preregistration can help to address publication bias both in the sense that the individual preregistered studies included in a meta-analysis may be less likely to have used methods that produce biased effect sizes (e.g., p-hacking), but also in the sense that researchers conducting meta-analyses can search for preregistrations that have not resulted in published outputs, and thus obtain some williams & albers 4 information about the size of the unpublished “file drawer”. preregistration is clearly valuable, but it is currently unclear how researchers should deal with distributional assumptions when performing preregistered research. the conventional “test then respond” approach to dealing with distributional assumptions—where the researcher selects a primary analysis strategy, performs exploratory assumption checks using a range of statistical and graphical measures, and then uses their judgment to determine whether a change in analysis strategy is needed—is clearly anathema to a preregistered approach to research. so how should researchers writing preregistrations deal with distributional assumptions? in this article we aim to provide concrete and practical advice to help researchers pre-emptively respond to the spectre of breaches of distributional assumptions when performing preregistered research. strategies for dealing with distributional assumptions in preregistrations one strategy for dealing with distributional assumptions in preregistered research is to simply ignore the issue and preregister a primary analysis strategy without any conscious attention to distributional assumptions whatsoever. this strategy does at least avoid the possibility of biased estimates caused by the researcher exploiting analytic flexibility to produce statistically significant findings. in this sense it may arguably be superior to a strategy of using exploratory strategies to diagnose distributional problems, running multiple analyses, and potentially allowing decisions about which analyses to report to be affected by whether or not they produce significant results. this said, selective reporting or p-hacking is obviously not the only cause of biased estimates or untrustworthy findings. some distributional assumption violations can cause very serious inferential problems (consider, for example, the bias in estimates that can result when predictors in regression models are measured with error; westfall & yarkoni, 2016). as such, simply ignoring distributional assumptions in preregistered research is obviously not a complete solution. alternatively, there are a variety of more sophisticated approaches that a researcher could make use of for dealing with distributional assumptions in the context of preregistered research. in the current section, we consider and discuss four potential strategies in turn. in discussing each strategy we will focus on how these strategies can be conducted and their consequences for the resulting analyses of empirical data, but we will also briefly touch on the implications of these strategies for analyses conducting before data collection—i.e., statistical power and sample size determination. strategy 1: decision tree the first strategy is to outline a decision tree specifying what methods will be used to identify distributional assumption breaches, and which alternative methods will be applied if breaches are identified. for example, a researcher might specify that linear regression using ordinary least squares will be applied in order to test their hypothesis, and a shapiro-wilk test applied to the residuals. if the shapiro-wilk test statistic is not significant, confidence intervals will be calculated based via the usual wald method; if it is significant (indicating non-normality), they will be calculated using a percentile bootstrap. such a decision tree is effectively just a preregistered form of the test then respond meta-strategy, albeit with a commitment to pre-specified criteria for making decisions. the decision tree method is currently implicitly endorsed in the widely used template for preregistrations in social psychology (van ’t veer & giner-sorolla, 2016), which asks researchers to specify “assumptions of analyses, and plans for alternative/corrected analyses if each assumption is violated.” however, we suggest that this strategy does have several problems. problem 1: uncertainty involved in diagnosing distributional problems. the first problem is that determining whether a particular distributional assumption violation is present—and of a magnitude likely to harm inferences—can be difficult, and the conclusion of such an investigation may come with great uncertainty attached. consider, for example, a researcher conducting a simple experiment with a continuous response variable and two conditions (treatment and control). here, the primary research question might be simple: is the mean of the response variable higher for treated participants dealing with distributional assumptions in preregistered research 5 than control participants? the most obvious analysis strategy would also be simple: an independent-samples student’s t test. yet the questions to be investigated in an examination of the distributional assumptions pertaining to this test would be much more complex. for example, investigating the assumption of independent error terms means answering this question: over repeated sampling, would each of the n error distributions in this design be statistically independent of one another, and, if not, is the form and magnitude of this non-independence sufficient to cause substantive changes to the error rates of the t test in the main analysis? it should be obvious that this distributional question is more complex than the primary research question regarding the differences between two means. it is complex both because it refers to the relationships between a large matrix of error terms (rather than just two means), about which we must make inferences based on a single sample of residuals, and also because it is double-barrelled—the question is not just whether the error terms are statistically dependent, but whether this dependence in turn is likely to harmfully affect the resulting inferences in the main analysis. the net result would be that any investigation of the validity of this assumption is likely to come with substantial uncertainty attached, and as such relying on such an investigation to determine the final choice of primary analysis may be problematic. problem 2: difficulty of formulating hard-andfast rules. relatedly, it is often difficult to formulate hard-and-fast rules for diagnosing particular distributional assumption breaches. many conventional distributional assumption diagnostics require the researcher to subjectively interpret a plot (e.g., a q-q plot for diagnosing non-normality, an autocorrelation function plot for diagnosing correlated error terms, a scatterplot of predicted values vs. residuals for diagnosing heteroscedasticity or non-linearity, etc.). using such graphical methods to detect assumption 3 this admittedly depends on the nature of the assumption violation. for example, a linear regression model estimated using ordinary least squares will be more and more robust to a violation of the assumption of normally distributed error terms as the sample size increases, due to the central limit theorem (provided that the other assumptions of the model hold). on the other hand, if measurement error is present in the predictor variables, increasing the sample size would not necessarily reduce the biasing effect of this measurement error on the estimates of the regression parameters. violations is explicitly recommended by the apa’s statistical task force (wilkinson & task force on statistical inference, 1999). in the context of preregistration, however, which is largely intended to remove flexibility in how data analysis is conducted, relying on the researcher to subjectively interpret a plot and then make a decision is problematic. firmer statistical rules for diagnosing particular problems exist, of course. for example, one can statistically test a null hypothesis that a particular distributional assumption is valid. a shapiro-wilk test, for instance, can be used to test a null hypothesis that the error terms for a particular model are normally distributed. there is a paradox here, though, in the sense that statistical tests will generally have low power to detect distributional problems when the sample size is small, even though this is precisely the scenario in which distributional assumptions matter most. on the other hand, if the sample size is large, statistical tests of assumptions may be powerful, even though the large sample size means that the primary analysis may be robust to the detected assumption violation3. of course, this is a general problem with statistical tests of assumptions, rather than one that solely applies in the context of preregistration or the decision tree strategy. problem 3: complexity of decision trees required. the third problem is the fact that, even if it were possible to accurately and objectively diagnose distributional problems, there is a large number of distributional problems that may arise for any given analysis, and thus an exploding quantity of potential remedies. for example, a simple linear regression model can be afflicted by a wide variety of distributional problems, including measurement error of any sort in the predictor variable, measurement error in the response variable that is correlated with the predictor, non-linearity of the relationship between the predictor and the response variable, dependent error terms, heteroscedasticity of errors, or non-normality of williams & albers 6 errors. often these problems occur in combination with one another. each problem has an array of potential remedies: for example, heteroscedasticity might be dealt with by using huber-white “sandwich” errors, by variance-stabilising transformations, or by bootstrapping (see liu, 1988). further checks may be necessary to check whether particular remedies “worked”, those checks potentially implying the need for yet more decisions (did the variance stabilising transformation produce homoscedasticity? if not, what potential remedy should be tried next?) furthermore, the order in which the assumptions are checked—although arbitrary—can have an effect on the final model that is applied to the data. setting out a decision tree that will select an appropriate analysis strategy for each of the various combinations of problems that may occur for a given analysis would thus be an extremely difficult task, and the prospect of doing so might discourage researchers considering using preregistration. problem 4: effects on nominal error rates. a final problem with the “decision tree” approach to diagnosing and responding to distributional assumption problems in preregistrations is the fact that, whatever the primary analysis technique ends up being, its nominal type i and type ii error rates (e.g., the pre-set alpha and 1-power) will typically be based on an assumption that the analysis technique was fixed in advance. unless they perform simulation studies, researchers will not typically know what the applicable error rates are in the scenario of analysis decisions being made contingent on particular features of the data. as noted previously, these error rates may vary considerably from the nominal error rates of the individual techniques considered in isolation (e.g., zimmerman, 2004). specifying simulations to estimate error rates and power for the decision tree approach could often present significant challenges. such an analysis would require a simulation in which multiple samples are simulated from a population in which there is a particular hypothesised effect size, the analysis method for each sample is decided according to the preregistered decision rule, and the data analysis then conducted. this alone may be challenging for many researchers to program, but an even more difficult challenge would be deciding what distributional characteristics or misspecifications to incorporate in the simulated data. the researcher is thus faced with the prospect of trying to predict the types of distributional problems that are likely to occur and to then simulate data that embodies these problems, which will inevitably require some complex (and relatively arbitrary) decision-making. strategy 1: conclusion. in summary, while the decision tree approach to dealing with distributional assumption breaches in preregistrations may be appropriate when wielded by expert researchers in some circumstances, we suggest that it has several important problems that mean that it is not suitable as a default strategy in preregistrations. strategy 2: selecting a robust primary analysis strategy given the need to make analysis decisions prior to any opportunity to check for distributional assumption problems, it may be useful to select primary analysis strategies that are as conservative as possible with respect to those assumptions. as stated earlier, virtually any inferential analysis will require distributional assumptions of some sort, but some analyses require stronger assumptions than others. some examples of analysis choices that may reduce the reliance on at least some distributional assumptions in commonly used statistical analyses include: ● using bootstrapping or permutation tests rather than normal theory to calculate confidence intervals and p values (thus obviating the need for normally distributed errors in linear models) ● using huber-white “sandwich” standard errors rather than assuming homoscedasticity in regression-like models (white, 1980) ● using so-called robust methods that are designed to rely less on distributional assumptions (wilcox, 2012). beyond these familiar examples, the possibility of using bayesian models estimated using markov chain monte carlo (mcmc) means that researchers can quite readily estimate models where specific assumptions are loosened in specific ways. for example, a bayesian regression model can be estimated in which the distribution of the error terms is modelled not with a normal distribution, but rather, for example, a skew-normal distribution dealing with distributional assumptions in preregistered research 7 in which a skewness parameter is freely estimated (and in which a the normal distribution is a special case; see azzalini, 1985). similarly, one can specify a bayesian regression model in which the variance of the errors is not necessarily constant across all the values of the predictor variables, but instead some function of the values of the predictor variables (with constant variance again being a special case). bayesian models are by no means assumption-free, but do give the researcher the capacity to thoughtfully loosen specific distributional assumptions, and thus may be an attractive option in the context of preregistered research. excellent introductions to bayesian data analysis can be found in kruschke and liddell (2018) and etz and vandekerckhove (2018). the programming language stan (carpenter et al., 2017) provides a framework for implementing bayesian analyses that apply flexible sets of assumptions. in attempting to select a robust primary analysis method, researchers should carefully consider what distributional issues or problems are most likely to arise in the context of their specific study. a researcher conducting a study will often have some familiarity with the measurement instruments being used (whether these are survey scales, electromyography devices, counts of particular incidents, or reaction times), and the typical characteristics of the data produced by these instruments. researchers may thus be able to draw on their own contextual knowledge to anticipate the distributional issues that might arise, and pick a statistical analysis method that is robust to the most plausible problems. the approach of selecting a primary analysis method that makes distributional assumptions that are as weak as possible can definitely be useful in preregistrations. doing so avoids the need to make analytic decisions contingent on data. in comparison to the other strategies discussed in this section, this strategy also minimises the need for conducting multiple data analyses, thus streamlining the analysis process and resulting in a more concise write-up. this said, this strategy can also be applied in combination with the two strategies we will consider below. however, this strategy is not without its problems either. if all the assumptions of the linear model are met, then the standard parametric methods are uniformly most powerful. the benefit of not having distorted type i error rates if the assumptions are violated comes at the cost of structurally lower power, thus a higher type ii error rate, if the assumptions actually hold true. admittedly, the extent of this problem differs depending on the planned test. for instance, the welch t test, for unequal variances, barely has lower power than the standard student t test (delacre, lakens, & leys, 2017). power analysis may again be challenging when applying the robust primary analysis strategy, mainly in that there may be ambiguity in terms of the distributional characteristics that should be assumed when simulating data for the purposes of power analysis. for example, if a welch’s t test is planned, should we nevertheless simulate data from two populations with equal variances? if not, how different should the variances be? this said, the complication of programming a simulation in which the analysis method applied to each sample differs depending on its characteristics (as for strategy 1) is avoided, making power analysis slightly easier for strategy 2 (robust primary analysis strategy) than is the case for strategy 1 (decision tree). strategy 3: robustness analysis the third strategy we consider here is that of deliberately preregistering multiple analyses that answer the same research questions (while making different distributional assumptions). we will term this strategy that of performing robustness analyses. two related terms are sensitivity analysis (which investigates how uncertainty pertaining to the inputs into scientific models relate to uncertainty in the outputs; see saltelli, tarantola, campolongo, & ratto, 2004), and multiverse analysis (which focuses on how different decisions made during data processing can lead to different datasets and different conclusions; see steegen, tuerlinckx, gelman, & vanpaemel, 2016). we prefer the term of robustness analysis over robustness checks because we see the purpose of this strategy as to investigate how the results vary across numerous plausible choices of analysis, rather than just to check that some favoured conclusions holds up under an alternative specification. as an example of the robustness analysis approach, imagine a researcher seeking to test the hypothesis that one continuous variable has a williams & albers 8 positive relationship with another continuous variable (with both variables comprising series of observations gathered over time). here the researcher might preregister a simple ordinary least squares regression as a primary analysis, with a model that also includes an autoregressive correlated error structure as a robustness analysis. this robustness analysis would help to deal with the potential problem of a breach of the assumption of independent error terms, which is a problem that often arises when analysing time series data. robustness analyses are useful especially when there is genuine ambiguity over which analysis technique is the most appropriate choice to address a given research question. when employing this strategy, one might designate one particular analysis as the primary analysis, and a set of other analyses as the robustness analyses. one could also regard all the planned analyses as having equal priority (and all being “robustness analyses”). regardless, the key to an effective preregistered robustness analysis is identifying different analysis methods that test the same (preregistered) hypotheses, but while making different (but plausible) distributional assumptions. as some examples: • if planning a correlational analysis, one could specify both an analysis that assumes that the variables have a linear relationship (pearson’s correlation) as well as an analysis that assumes only a monotonic relationship (spearman’s rho). • if planning a linear regression, one could specify an analysis that assumes homoscedasticity (ols estimation), as well as an analysis that produces consistent estimates even in the presence of heteroscedasticity (huber-white sandwich standard errors). • if planning a linear regression, one could specify both an analysis using standard ols estimation as well as a model including an autoregressive ar(1) term, to allow for possible serial dependence in the data. synthesising the results from robustness analyses. it is possible to preregister a specific decision rule for interpreting the results of multiple analyses (e.g., “if the relationship is positive and statistically significant in both the pearson’s correlation analysis and the spearman’s rho analysis, we will conclude that the data supports the hypothesis”). however, such decision rules are necessarily arbitrary, and designing a sensible interpretation structure may be more difficult for a larger number of analyses (e.g., what if the coefficient is statistically significant in six out of seven planned robustness analyses?) there is also no strong reason to assume that a combined hypothesis test based on multiple statistical methods would have more desirable long-run properties than any one of the single analyses that are included within the combined test. as such, it may be more appropriate to preregister multiple robustness analyses, preregister criteria for the interpretation of each, and to subsequently attempt to sensibly integrate the findings produced across these analyses—but not to specify an overarching decision rule based on the combined results of multiple analyses. this means that the synthesis of findings (e.g., in a discussion section) may be more complex when applying robustness analyses than when using any of the other strategies, with more ambiguity about whether a particular set of findings supports or does not support a particular hypothesis. it may nevertheless allow the researcher to clearly communicate the degree to which a particular finding “holds up” across multiple reasonable analysis options. power in robustness analyses. conducting a power analysis when planning to apply strategy 3 could be either straightforward or very complex. if a relatively simple analysis method with strong assumptions is selected as the primary analysis strategy (in conjunction with some additional robustness analyses with different assumptions), it could be justifiable to perform the power analysis solely for the primary analysis strategy. this means that it could be possible to perform the power analysis using point-and-click software such as g*power (faul, erdfelder, buchner, & lang, 2009). such an analysis would nevertheless need to come with an acknowledgement that the power of the additional analyses is almost inevitably likely to be lesser (depending in part on the actual degree to which any assumptions are breached), and that the reported power analysis should be interpreted as representing a best-case scenario. when taking this approach, it would probably be helpful to increase the planned sample size somewhat beyond what the power analysis suggests is required. dealing with distributional assumptions in preregistered research 9 on the other hand, a more comprehensive power analysis would check power for all of the planned robustness analyses, and do so based on data that is simulated so as to display plausible assumption violations. such a power analysis could be challenging to conduct. robustness analysis and meta-analyses. a problem with the robustness analysis approach is the ambiguity in terms of how the findings of a study that conducted a robustness analysis should be coded if included a later meta-analysis, given that each statistical method included in the robustness analysis may have produced a different estimated effect size. a researcher conducting a meta-analysis and aiming to code the effect sizes reported in an individual study conducted using robustness check may thus face uncertainty about whether to include just one effect size estimate (and if so, which one), or whether to aggregate the findings of the various analyses in some way. while this may be an ambiguity that can satisfactorily be resolved within the preregistration for a given meta-analysis (see quintana, 2015), the use of a robustness analysis within an individual study probably presents greater complications for meta-analysis than do the other strategies discussed in this article. strategy 4: exploratory assumption checks the final strategy we consider here is preregistering a primary (confirmatory) analysis method that will be followed regardless of the characteristics of the data, without necessarily making this primary analysis method a robust one, but including in the ensuing data analyses some exploratory investigations of distributional assumptions (or “assumption checks”). a plan for these investigations could be specified to at least some degree in the preregistration (perhaps making them more confirmatory in nature), but this is not absolutely necessary—provided that the analyses are explicitly tagged as exploratory in the final report. this strategy has the advantage of allowing for the communication of information about distributional assumptions without increasing the complexity of the preregistration too greatly. it also makes power analysis straightforward, since there would be just one primary analysis method to plan (for each research question), and the analysis method might well be a commonly-used one for which power analysis is available in easy-to-use software such as g*power. furthermore, the fact that the distributional assumption checks would not need to be used to make binary decisions (unlike the case in strategy 1) means that graphical methods could be used to convey information about the magnitude of any assumption breaches. the primary downside of this strategy would be the resulting ambiguity with respect to how the findings of the assumption checks should impact the interpretation of the results of the primary analysis. we would essentially suggest that the preregistered primary analysis should be conducted, reported and interpreted as planned in the preregistration effectively regardless of the outcomes of the distributional assumption checks, but that the researcher should identify any apparent distributional problems as a reason to interpret the results with some extra caution. there is a risk here, however, that a researcher might describe and emphasise evidence for a particular distributional problem differently depending on whether the results of the primary analysis are “positive” or not (e.g., ignoring a distributional problem if the main findings are positive, vs. emphasising the distributional problem as a possible explanation of the results if the data would otherwise appear to falsify a favoured theory). as such, this strategy has its dangers, but may be useful in some contexts— particularly student research, where preregistering a relatively simple analysis plan and using a relatively simple power analysis, while allowing some capacity for distributional assumption checks, may be desirable. example in this section, we will give a practical example of the four strategies suggested in this paper. in a paper published in psychological science, schroeder and epley (2015) reported multiple related organizational psychological studies. these studies investigated the effect of voice on (hypothetical) job applications. in this paper, we look only at their experiment 4, and use the description of this experiment by mcintyre (2016). in this experiment, 39 professional recruiters were assigned to one of two conditions. the recruiters either listened to a spoken job application williams & albers 10 or read a written transcript of the application. the recruiters rated the participants on intelligence, competence, and thoughtfulness, resulting in an average rating denoted as intellect. some covariates were measured as well, but we will ignore these for the sake of simplicity. the main research question was whether ratings differed between spoken and written job pitches. even for such a seemingly simple design, there are many researcher degrees of freedom. for instance, the choice to define “intellect” as the arithmetic mean of competence, thoughtfulness and intelligence implies that these three variables are equally important ingredients of intellect. . table 1. data for experiment 4 of schroeder & epley (2015) written job pitch 1.67 2.00 2.67 2.67 3.00 3.33 4.33 4.33 4.67 4.67 4.67 4.67 5.67 5.67 6.67 7.33 7.67 8.00 spoken job pitch 3.33 4.33 4.67 5.67 5.67 6.00 6.00 6.00 6.33 6.67 6.67 6.67 7.00 7.00 7.00 7.00 7.00 7.67 8.67 10.00 10.00 as these issues are besides the focus of our paper, we will not discuss them further, and instead take the 39 intellect ratings as given and work from there. the scores are given in table 1, and are also available (along with analysis code) in the osf project for this paper https://osf.io/h2xry/ . below, we indicate how the test for comparing both experimental groups could be preregistered, according to the four strategies. the beginning of the preregistration, describing the data collection itself, will be the same for each of the strategies: “we will recruit 39 professional recruiters. each recruiter is assigned at random to the “written” or “spoken” condition. [+some detailed description of the materials.] after reading/hearing the job pitch, the recruiter scores the candidate on three categories. the average score will be designated the intellect rating.” we should note in passing that the sample size in this example is quite small, meaning that this study only has adequate power to detect fairly large effects. for example, if the primary analysis was to be an independent-samples student’s t test (with a 2-sided test), and if the assumptions of the student’s t test were satisfied, then this study would have just 33% power to detect a “medium” effect size of d = 0.5. the power to detect an effect of this magnitude would be lower again for the other analysis methods specified below, even if the t test assumptions were met, and potentially even lower again in the presence of violations of the assumptions of these methods (depending on the nature of those assumption violations). in short, the data are useful for the purposes of illustration, but the sample size is not one that we would typically recommend. please note that this is no criticism towards schroeder and epley (2015), as they base their conclusions on four separate experiments, three of which have a sample size much larger than 39. for the sake of brevity, we have not included individual power analyses for each of the four separate strategies we illustrate. strategy 1, the decision tree to compare the intellect ratings of both groups, we could apply a student’s t test. however, this test assumes (i) normality of the dependent variable within each of the two populations, (ii) equal variances between populations. we will therefore apply the following decision tree: 1. first, we apply the shapiro wilk for normality test on the deviations from the group average and denote the resulting p-value by psw. a. if psw < 0.05 we deem the normality assumption breached and we will apply the non-parametric mannwhitney/wilcoxon test. dealing with distributional assumptions in preregistered research 11 b. if psw > 0.05, there is no evidence to reject the assumption of normality and we proceed with: 2. second, we apply the f test for equality of variances and denote the resulting p-value by pf. a. if pf < 0.05 we deem this assumption violated and we will apply the welch t test. b. if pf > 0.05, we will apply student’s t test. for the chosen test, we will report the p value of the two-sided comparison. note that this decision tree is just one way to check the assumptions. we chose the shapiro-wilks procedure after the recommendation by razali and wah (2011), but could have employed other normality tests instead. a similar argument holds for the f test for equality of variances. furthermore, the order of testing both assumptions matters, but is arbitrary. strategy 2, robust primary analysis strategy to compare the intellect ratings of both groups, we will use a bootstrapped version of yuen’s (1974) two-sample trimmed t test. this test is an alternative to the independent samples student’s t test designed for situations where there are both unequal variances across groups and non-normality within groups. the test is particularly useful with very long-tailed error distributions. to apply this test, we use the function yuenbt from the wrs2 package (mair & wilcox, 2018) in the statistical software r (r core team, 2018). we trim 20% of the data, thus removing the 10% smallest and 10% largest observations. this way, the influential effect of outliers is diminished. we employ 1,000 bootstrap samples, which is sufficient for accurate results (mair & wilcox, 2018). before bootstrapping we set the random number generator to seed 1. we will report the p value of this test as well as the 95% confidence interval for the trimmed mean. strategy 3, robustness analysis to compare the intellect rating of both groups, we will perform the following tests, each based on a different set of assumptions: (i) student’s t test, (ii) welch t test, (iii) the mann-whitney/wilcoxon test, (iv) yuen’s test for the 20% trimmed means, (v) the bootstrap version of test (iv) with 1,000 bootstrap samples and a seed set to 1. this last test mimics that used in strategy 2 above. strategy 4, exploratory assumption checks to compare the intellect rating of both groups, we perform student’s t test. we also provide exploratory checks of the validity of the distributional assumptions underlying this test. example: results for each strategy in this case we obviously already have the data, so we can see how the four preregistrations work out: strategy 1, decision tree. the shapiro-wilk test was non-significant (p = .124), and so was the twosided f test for equality of variances (p = .458). we therefore conducted student’s t test, and this test found a significant difference between both groups, t(37) = -3.525, p = .001. strategy 2, robust primary analysis strategy. the bootstrap version of the yuen test provided a p value of .002 and a 95% confidence interval for the trimmed mean difference [-3.323, -0.698]. strategy 3, robustness analysis. all five tests yielded a significant difference (at α = 0.05) in favour of the audio group, with the following test results. student’s t: t(37) = -3.525, p = .001; welch’s t: t(33.441) = -3.478, p = .001; wilcoxon test: w = 84.5, p = .003; yuen’s test: trimmed mean difference -2.010, p = .004, bootstrapped yuen’s test: trimmed mean difference -2.010, p = .002. thus, we conclude that applicants with a spoken job pitch receive higher ratings than those with a written job pitch. strategy 4, exploratory assumption checks. student’s t test indicated that the audio-group performed significantly better than the writtengroup, t(37) = -3.525, p = .001. the shapiro-wilk test for normality provided no significant evidence for non-normality (w = 0.966, p = .124) and the f test for equality of variances did not provide significant evidence for violation of this assumption, f(17, 20) = 1.411, p = .458. example: summary it will be clear that all four strategies have their own benefits and drawbacks. a clear benefit of strategy 2, for instance, is that the results can be reported in a very condensed form. a drawback, however, is that fewer people are familiar with the williams & albers 12 yuen test compared to more conventional tests. in this example, we deliberately kept the methodology as simple as possible with a single test for a twogroup comparison. the complexity of some strategies will grow beyond feasible limits when the research questions are more complex. summary: strategies for distributional assumption checks in the subsections above, we have outlined four strategies for dealing with distributional assumptions in preregistered research. obviously, however, we have not considered all possible approaches that researchers could take to dealing with distributional issues in a preregistration. of the four strategies we have discussed, strategy 1 (a decision tree) seems the least desirable on several counts, although it may be useful in some research contexts—particularly when applied by researchers with particularly strong statistical expertise. strategy 3 (preregistering robustness analyses) is probably the most comprehensive and sophisticated way of dealing with the uncertainty arising from distributional assumptions. it is nevertheless a relatively challenging strategy to adequately specify in a preregistration (and to perform a power analysis for), and the strategy that would result in the most verbose write-up of the results. on the other hand, strategy 4 would be the easiest to specify in a preregistration, and may be useful for student research, or for researchers new to preregistration. finally, strategy 2 represents something of a compromise between difficulty level and level of sophistication, would produce a very concise writeup of results, and could also be applied in conjunction with any of the three other strategies. additional tactics for dealing with distributional assumptions the four strategies listed above are, to at least some degree, competing strategies. on the other hand, there are two additional tactics that may be useful in virtually any preregistered research project, and that can be employed in conjunction with whichever of the four strategies above is selected. tactic 1: clearly and accurately describe distributional assumptions the first of these tactics is to transparently and accurately describe the distributional assumptions of the statistical analyses employed, and then acknowledge in the limitations section of the discussion that any uncertainty with respect to the validity of these assumptions adds to the uncertainty surrounding the substantive conclusions. a description of the assumptions of various statistical analyses is beyond the scope of this article, but some useful sources include gelman and hill (2007), williams et al. (2013), and casella and berger (2002). we note in passing that is not uncommon for distributional assumptions to be described incorrectly in resources aimed at psychologists; it can often be useful to look to more rigorous sources written for statisticians. tactic 2: open data the second tactic which we suggest applying in virtually any study—subject to any restrictions necessary to safeguard the privacy of the participants, or necessary for legal reasons—is to make a de-identified copy of the raw data openly accessible for readers and reviewers (see houtkoop et al., 2018; munafò et al., 2017; nosek et al., 2015). although the authors of any given piece of research will always have the primary responsibility for conducting and reporting data analyses that address uncertainty arising due to distributional assumptions, sharing open data nevertheless helps to ensure that others who might wish to apply a different approach to checking distributional assumptions—or a different approach to the primary analyses—are able to check the robustness of the findings to such alternative approaches. along with the raw data it is useful to post the programming code or syntax necessary to process the data and apply the analyses reported in a given article. the open science framework (https://osf.io) is a particularly useful venue for posting data and analysis syntax, but other options are available— including posting supplementary materials along with the article on a journal’s website. when preregistering a study that will use an open data policy, it is important to consider how this will be signalled to participants and in any institutional review board/ethics committee application (see dealing with distributional assumptions in preregistered research 13 meyer, 2018). in some parts of the world, institutional review boards may expect that data will be kept entirely confidential to the research team, and boilerplate information sheet and consent materials may encode this expectation. it is thus crucial to plan an open data policy from the beginning of a project, rather than leaving any decisions about data sharing to after data has been collected (at which point the researchers may find themselves inadvertently locked in to a restrictive data sharing policy). templates for preregistrations in order to assist researchers prepare preregistrations that pre-emptively deal with the prospect of distributional assumption violations, we suggest that preregistration templates include prompts leading researchers to consider employ some of the strategies described above. the specific prompts we would suggest would be: what are the distributional assumptions of the statistical analyses you will be applying? how have you accounted for the possibility of violations of these distributional assumptions? (some options include selecting analysis methods that make assumptions that are as conservative as possible; preregistering robustness analyses which test the robustness of your findings to analysis strategies that make different assumptions; and/or pre-specifying a single primary analysis strategy, but noting that you will also report an exploratory investigation of the validity of distributional assumptions). conclusion preregistration is a valuable strategy in confirmatory research projects, but it does come with challenges. one of those challenges is the need to make decisions about how distributional assumption violations will be dealt with before examining the data itself. in this article, we have examined several strategies that researchers could adopt for addressing distributional assumptions in preregistered research. while preregistering “decision trees” for changing the primary analysis depending on the presence of particular assumption breaches has several problems, preregistering analyses that make weaker distributional assumptions, or preregistering robustness analyses, may be more useful approaches. alternatively, students and other researchers new to preregistration may find it useful to preregister a simple primary analysis strategy—and stick with that regardless of the characteristics of the data— but also conduct and report an exploratory post hoc analysis of the validity of the distributional assumptions made. we recommend that researchers use the guidance above to select the strategy—or combination of strategies—that is most appropriate for their given context. finally, we suggest that transparently and accurately communicating the assumptions of the analyses employed, and making raw data openly available, can be useful tactics for ensuring that readers and reviewers have sufficient information available to reach an informed judgment about the impact of any distributional problems on the validity of a study’s conclusions. references albers, w., boon, p. c., & kallenberg, w. c. m. (2000). size and power of pretest procedures. the annals of statistics, 28(1), 195–214. https://doi.org/10.1214/aos/1016120369 azzalini, a. (1985). a class of distributions which includes the normal ones. scandinavian journal of statistics, 12(2), 171–178. bancroft, t. a. (1964). analysis and inference for incompletely specified models involving the use of preliminary test(s) of significance. biometrics, 20(3), 427–442. https://doi.org/10.2307/2528486 carpenter, b., gelman, a., hoffman, m. d., lee, d., goodrich, b., betancourt, m., … riddell, a. (2017). stan: a probabilistic programming language. journal of statistical software, 76(1). https://doi.org/10.18637/jss.v076.i01 casella, g., & berger, r. l. (2002). statistical inference (2nd ed.). pacific grove, ca: duxbury. williams & albers 14 delacre, m., lakens, d., & leys, c. (2017). why psychologists should by default use welch’s t-test instead of student’s t-test. international review of social psychology, 30(1), 92–101. https://doi.org/10.5334/irsp.82 ernst, a. f., & albers, c. j. (2017). regression assumptions in clinical psychology research practice—a systematic review of common misconceptions. peerj, 5. https://doi.org/10.7717/peerj.3323 etz, a., & vandekerckhove, j. (2018). introduction to bayesian inference for psychology. psychonomic bulletin & review, 25(1), 5–34. https://doi.org/10.3758/s13423-017-12623 fagerland, m. w., & sandvik, l. (2009). the wilcoxon–mann–whitney test under scrutiny. statistics in medicine, 28(10), 1487– 1497. https://doi.org/10.1002/sim.3561 faul, f., erdfelder, e., buchner, a., & lang, a.-g. (2009). statistical power analyses using g*power 3.1: tests for correlation and regression analyses. behavior research methods, 41(4), 1149–1160. https://doi.org/10.3758/brm.41.4.1149 ferguson, c. j., & heene, m. (2012). a vast graveyard of undead theories: publication bias and psychological science’s aversion to the null. perspectives on psychological science, 7(6), 555–561. https://doi.org/10.1177/1745691612459059 gelman, a., & hill, j. (2007). data analysis using regression and multilevel/hierarchical models. cambridge, united kingdom: cambridge university press. gelman, a., & loken, e. (2014). the statistical crisis in science. american scientist, 102(6), 460– 465. https://doi.org/10.1511/2014.111.460 houtkoop, b. l., chambers, c., macleod, m., bishop, d. v. m., nichols, t. e., & wagenmakers, e.j. (2018). data sharing in psychology: a survey on barriers and preconditions. advances in methods and practices in psychological science, 1(1), 70–85. https://doi.org/10.1177/2515245917751886 kruschke, j. k., & liddell, t. m. (2018). bayesian data analysis for newcomers. psychonomic bulletin & review, 25(1), 155–177. https://doi.org/10.3758/s13423-017-1272-1 liu, r. y. (1988). bootstrap procedures under some non-iid models. the annals of statistics, 16(4), 1696–1708. https://doi.org/10.1214/aos/1176351062 lumley, t., diehr, p., emerson, s., & chen, l. (2002). the importance of the normality assumption in large public health data sets. annual review of public health, 23, 151–169. https://doi.org/annurev.publhealth.23.100 901.140546 mair, p., & wilcox, r. (2018). wrs2: a collection of robust statistical methods (version 0.10-0). retrieved from https://cran.rproject.org/package=wrs2 mcintyre, k. p. (2016). do spoken or written words better express intelligence? retrieved august 3, 2018, from open stats lab website: https://sites.trinity.edu/osl/data-setsand-activities/t-test-activities meuleman, b., loosveldt, g., & emonds, v. (2015). regression analysis: assumptions and diagnostics. in h. best & c. wolf (eds.), regression analysis and causal inference (pp. 83–110). london, united kingdom: sage. meyer, m. n. (2018). practical tips for ethical data sharing. advances in methods and practices in psychological science, 1(1), 131–144. https://doi.org/10.1177/2515245917747656 munafò, m. r., nosek, b. a., bishop, d. v. m., button, k. s., chambers, c. d., sert, n. p. du, … ioannidis, j. p. a. (2017). a manifesto for reproducible science. nature human behaviour, 1. https://doi.org/10.1038/s41562-016-0021 nosek, b. a., alter, g., banks, g. c., borsboom, d., bowman, s. d., breckler, s. j., … yarkoni, t. (2015). promoting an open research culture. science, 348(6242), 1422–1425. https://doi.org/10.1126/science.aab2374 quintana, d. s. (2015). from pre-registration to publication: a non-technical primer for conducting a meta-analysis to synthesize correlational data. frontiers in psychology, 6. https://doi.org/10.3389/fpsyg.2015.01549 dealing with distributional assumptions in preregistered research 15 r core team. (2018). r: a language and environment for statistical computing. retrieved from http://www.r-project.org/ razali, n. m., & wah, y. b. (2011). power comparisons of shapiro-wilk, kolmogorovsmirnov, lilliefors and anderson-darling tests. journal of statistical modeling and analytics, 2(1), 21–33. saltelli, a., tarantola, s., campolongo, f., & ratto, m. (2004). sensitivity analysis in practice: a guide to assessing scientific models. chichester, united kingdom: wiley. schroeder, j., & epley, n. (2015). the sound of intellect: speech reveals a thoughtful mind, increasing a job candidate’s appeal. psychological science, 26(6), 877–891. https://doi.org/10.1177/0956797615572906 simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632 steegen, s., tuerlinckx, f., gelman, a., & vanpaemel, w. (2016). increasing transparency through a multiverse analysis. perspectives on psychological science, 11(5), 702–712. https://doi.org/10.1177/1745691616658637 van ’t veer, a. e., & giner-sorolla, r. (2016). preregistration in social psychology—a discussion and suggested template. journal of experimental social psychology, 67, 2–12. https://doi.org/10.1016/j.jesp.2016.03.004 wagenmakers, e.-j., wetzels, r., borsboom, d., van der maas, h. l. j., & kievit, r. a. (2012). an agenda for purely confirmatory research. perspectives on psychological science, 7(6), 632–638. https://doi.org/10.1177/1745691612463078 wells, c. s., & hintze, j. m. (2007). dealing with assumptions underlying statistical tests. psychology in the schools, 44(5), 495–502. https://doi.org/10.1002/pits.20241 westfall, j., & yarkoni, t. (2016). statistically controlling for confounding constructs is harder than you think. plos one, 11(3), e0152719. https://doi.org/10.1371/journal.pone.01527 19 white, h. (1980). a heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. econometrica: journal of the econometric society, 48(4), 817–838. https://doi.org/10.2307/1912934 wilcox, r. (2012). introduction to robust estimation and hypothesis testing (3rd ed.). waltham, ma: academic press. wilkinson, l., & task force on statistical inference. (1999). statistical methods in psychology journals: guidelines and explanations. american psychologist, 54(8), 594–604. https://doi.org/10.1037/0003066x.54.8.594 williams, m. n., grajales, c. a. g., & kurkiewicz, d. (2013). assumptions of multiple regression: correcting two misconceptions. practical assessment, research & evaluation, 18(11). retrieved from http://www.pareonline.net/getvn.asp?v=1 8&n=11 yuen, k. k. (1974). the two-sample trimmed t for unequal population variances. biometrika, 61(1), 165–170. https://doi.org/10.1093/biomet/61.1.165 zimmerman, d. w. (2004). a note on preliminary tests of equality of variances. british journal of mathematical and statistical psychology, 57(1), 173–181. https://doi.org/10.1348/000711004849222 meta-psychology, 2022, vol 6, mp.2019.2162, https://doi.org/10.15626/mp.2019.2162 article type: file drawer report published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: yes edited by: rickard carlsson reviewed by: rima-maria rahal, ignazio ziano, adrien fillon analysis reproduced by: lucija batinović all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/y2tf6 four failures to demonstrate that scarcity magnifies preference for familiarity stephen antonoplis university of california, berkeley serena chen university of california, berkeley as economic inequality increases in the united states and around the world, psychologists have begun to study how the psychological experience of scarcity impacts people's decision making. recent work in psychology suggests that scarcity—the experience of having insufficient resources to accomplish a goal—makes people more strongly prefer what they already like relative to what they already dislike or like less. that is, scarcity may polarize preferences. one common preference is the preference for familiarity: the systematic liking of more often experienced stimuli, compared to less often experienced stimuli. across four studies—three experiments and one crosssectional survey (all pre-registered; see https://osf.io/7zyfr/)—we investigated whether scarcity polarizes the preference for familiarity. despite consistently replicating people's preference for the familiar, we consistently failed to show that scarcity increased the degree to which people preferred the familiar to the unfamiliar. we discuss these results in light of recent failures to replicate famous findings in the scarcity literature. keywords: scarcity, familiarity, open science with economic inequality rising markedly since the 1980s, and especially since the 2007 global recession (piketty, 2014), scholars from various fields have turned their attention to understanding the effects of scarcity on human psychology. a growing approach to this question investigates the impact of the psychological experience of scarcity on thoughts, feelings, and behaviors. in this file drawer report, we sought to understand the impact of scarcity on the familiarity bias, the systematic liking of more familiar, compared to less familiar, stimuli (zajonc, 1968). what is scarcity? scarcity is defined as a lack of sufficient resources for accomplishing a goal (shah, mullainathan, & shafir, 2012). research has found that it increases overborrowing and focus on the present (shah et al., 2012; shah, mullainathan, & shafir, 2018); increases the propensity to lie in order to secure financial rewards (gino & pierce, 2009); and increases the likelihood of taking risks and the quickness to approach temptations if participants grew up in a lower social class (griskevicius et al., 2013). importantly, the resources for which a person experiences scarcity can be of many different forms, and the form of the resources (e.g., time or money) is not thought to change the effects of scarcity on human psychology (mullainathan & shafir, 2013). for 2 antonoplis & chen instance, both time and material scarcity have been found to increase overborrowing in the present (shah et al., 2012; shah et al., 2018). most pertinent to the present research, recent research suggests that scarcity polarizes preferences (zhu & ratner, 2015). when offered a choice between various products, participants more strongly preferred their favorite (vs. non-favorite) option when few of each option were available (scarcity) versus when many were available (abundance). this occurred because scarcity was perceived as threatening, inducing higher arousal, which has been previously shown to polarize people’s preferences (e.g., gorn et al., 2001; mano, 1992, 1994). below, we propose that this effect may extend to the preference for familiarity. the familiarity bias familiarity bias refers to the systematic preference for more familiar stimuli over less familiar stimuli, where familiarity is defined as frequency of exposure. in other words, familiarity bias describes the phenomenon that people tend to like things they have been exposed to more often simply because of the rate of exposure (zajonc, 1968). many classic studies in psychology suggest that people normatively prefer more, to less, familiar stimuli. for instance, research on the mere exposure effect has shown that individuals rate stimuli more positively if the stimuli occur more versus less frequently in the participants’ natural environment, as well as if the stimuli resemble more versus less closely other stimuli in participants’ natural environment (e.g., johnson, thomson, & frincke, 1960; zajonc, 1968, 2001). familiarity bias has been shown across many kinds of stimuli, including fruit and vegetables (zajonc, 1968), nonsense syllables (johnson et al., 1960), and people’s names (oppenheimer, 2004). two meta-analyses have examined the robustness of the phenomenon. across 208 experiments, bornstein (1989) found the effect to be quite reliable, although the impact of publication bias on the results was difficult to assess due to a lack of adequate tests for assessing these effects at the time the meta-analysis was conducted. montoya et al. (2017) built on bornstein’s (1989) meta-analysis and found that the effect was reliable across 118 studies and, using more appropriate tests, that publication bias likely did not bias the estimates. thus, a large body of research indicates that people, in general, prefer more to less familiar objects. how might the psychological experience of scarcity alter this preference? below, we suggest that scarcity may magnify people’s preference for familiarity, making people more strongly prefer familiarity under scarcity than not under scarcity. does scarcity increase the familiarity bias? recent research suggests that scarcity polarizes preferences (zhu & ratner, 2015). when offered a choice between various products, participants more strongly preferred their favorite (vs. non-favorite) option when few of each option were available (scarcity) versus when many were available (abundance). the researchers argued that this occurred because scarcity was perceived as threatening, inducing higher arousal, which has been previously shown to polarize people’s preferences (e.g., gorn et al., 2001; mano, 1992, 1994). when participants experienced scarcity (here, of quantity), they felt threatened by it. this threat increased their arousal, which restricted the number of evaluative dimensions considered relevant to the decision. one dimension, prior liking, was deemed particularly relevant to the decision, perhaps because threat constitutes a negative affective experience and people experiencing negative affect often choose simple decision strategies (mano, 1994). finally, to determine their preferences, participants more heavily relied on this single dimension of prior liking, producing more polarized preferences than would have resulted if other, imperfectly correlated dimensions had been incorporated into the decision. applied to the familiarity bias, such theorizing suggests that, when experiencing scarcity, people will feel more threatened and aroused, causing them to use simpler decision strategies. because the familiarity bias is a common phenomenon (bornstein, 1989; montoya et al., 2017; zajonc, 1968, 2001) and familiarity is a simple judgment to make (e.g., glaze, 1928), people may rely on familiarity of stimuli to guide their choices. as familiarity already breeds liking (zajonc, 1968), relying predominantly on familiarity in a decision task should increase stratification along familiarity. in other words, people should come to more strongly prefer familiar stimuli, relative to less familiar stimuli, under scarcity. in addition, since scarcity is expected to impact people similarly, regardless of its form (mullainathan & shafir, 2013), the effect should appear across any form of scarcity (e.g., material, 3 four failures to demonstrate that scarcity magnifies preference for familiarity time, quantity). hence, in this file drawer report, we examined the effect of different forms of scarcity (material, time, and quantity) on the familiarity bias. the present research across four studies we tested whether scarcity, in various forms, amplifies individuals’ preference for familiar over unfamiliar stimuli. we aimed to show that this pattern was consistent across various kinds of more versus less familiar stimuli. key to note is that the studies by zhu and ratner (2015), which we based ours on, took an idiographic approach to measuring preferences, examining changes in each participant’s favorite and nonfavorite options. all of the work they cited in support of the hypothesized link between scarcity and preferences took a nomothetic approach (e.g., more vs. less risky options in terms of normative probability; gorn et al., 2001; mano, 1992, 1994). in line with this, the authors speculated that their hypothesis would hold using a nomothetic approach (p. 12, zhu & ratner, 2015). hence, in the present studies, we tested the effect of scarcity on nomothetic preference for familiarity. this work, then, should be understood as a generalizability test of the work presented by zhu and ratner (2015), rather than a replication (lebel et al., 2019). in addition, some of our studies, unlike those of zhu and ratner (2015) used an incidental manipulation of scarcity, in which the experience of scarcity was not incorporated into the same situation or context as the assessment of preference. as others have argued (bargh, 1992), incidental versus explicit manipulation of social psychological phenomena is not crucial to studying the phenomena. what is crucial is that manipulations bring to mind whatever concept (in this case, scarcity) is of interest, thereby allowing this concept to shape how subsequent stimuli are perceived. thus, social psychological research on scarcity has been able to use incidental manipulations of scarcity without issue (e.g., roux et al., 2015). for all studies, we pre-registered focal hypotheses, data exclusion criteria, statistical modeling, and dependent and independent variables on the open science framework (available at https://osf.io/7zyfr/). this report is an exhaustive report on all data available from research projects relating to the topic, where at least one of the authors was principal investigator, or have otherwise the right to publish the results. this includes not only null findings, or unexpected findings, but also studies that are suspected to have failed, with careful explanation of the circumstances of the failure (e.g., experimental error, failed manipulation check). the context surrounding how these data were collected, and if they are somehow connected to already published studies (e.g., dropped experiments) is carefully explained. we report how we determined our sample sizes, all data exclusions, and all measures in all studies. all analyses were conducted in r (version 3.6.2; r core team, 2019). finally, to improve the paper’s narrative, we report studies differently than the chronological order in which they were conducted. study 1 for the initial test of our hypothesis, we sought to combine methods from both the scarcity and familiarity bias literatures in order to use noncontroversial, reliable methods. to manipulate scarcity, we had people recall a time they experienced scarcity (cf. roux et al., 2015; mani et al., 2013). to measure preference for familiarity, participants rated how much they liked both familiar and unfamiliar given names and surnames (cf. oppenheimer, 2004), as well as nonsense syllables (cf. johnson et al., 1960). the key test of our hypothesis was whether the scarcity manipulation moderated participants’ preference for familiarity such that this preference was heightened under scarcity. the pre-registration form, study materials, and data are available here: https://osf.io/7vtqr/. method following an informal lab policy of collecting 100 participants per between-subjects condition, 201 participants were recruited from amazon’s mechanical turk. their demographic characteristics matched typical samples on mturk 4 antonoplis & chen table 1 demographics across all studies (proportions and means) study study 1 study 2 study 3 study 4 gender man .50 .61 .46 .50 woman .27 .39 .53 .50 transgender .00 .00 .004 .00 decline to state .23 .00 .004 .00 race white .77 .75 .78 .67 latinx/hispanic .06 .05 .08 .11 black .09 .08 .05 .10 native american .00 .02 .00 .00 asian .06 .11 .06 .00 middle eastern .00 .00 .00 .00 mixed .01 .00 .04 .02 other .00 .00 .00 .04 decline to state .02 .00 .00 .00 born in the u.s. yes .83 .98 .94 – no .00 .02 .05 – decline to state .17 .00 .01 – age (m, sd) 39.30 (10.97) 33.64 (9.58) 36.91 (11.66) 50.19 (16.72) income (m,sd) $38,130 ($25,871) $36,093 ($21,684) $38,643 ($29,447) $72,053 ($47,986) education high school or less .13 .14 .12 .41 at least some college .86 .86 .88 .59 decline to state .02 .00 .00 .00 note. “–“ indicates that an item was not administered in the dataset. with the exceptions of age and income, all numbers in cells are proportions. (mostly white, mostly men, in their mid-30’s, had some amount of college education, and earning a relatively low income; buhrmester et al., 2011) and are reported in full in table 1. participants were 5 four failures to demonstrate that scarcity magnifies preference for familiarity randomly assigned to a scarcity or control condition. in the scarcity condition, participants wrote about a time they felt their resources were scarce (i.e., did not meet their needs; taken from roux et al., 2015). we expected writing about an experience of scarcity to be affectively unpleasant and threatening and, thus, different from most dayto-day experiences. hence, participants in the control condition wrote about an experience they had in the past week, whether an activity, an interaction, or whatever came to mind. after writing about a scarcity experience or a recent experience, participants rated how much they liked eight female given names (four familiar, four unfamiliar), eight male given names (four familiar, four unfamiliar), and eight surnames (four familiar, four unfamiliar). participants also rated how good or bad they thought the meanings of twenty-four nonsense syllables were in a foreign language. all names were rated on a scale from 1 (=dislike) to 7 (=like). all nonsense syllables were rated on a scale from 1 (=bad) to 7 (=good). participants were randomly assigned to either rate all the names first and the syllables second, or the syllables first and all the names second. this was done to avoid order effects. although the use of an incidental manipulation indirectly related to the dependent variable might seem problematic for ecological validity, this practice is fairly common in the scarcity (e.g., mani et al., 2013) and familiarity bias literatures (e.g., muthukrishnan et al., 2009). female and male given names were taken from the 2016 us social security registry of baby names (available at https://namecensus.com/babynames/popular-girl-names-in-2016/ for female names and https://namecensus.com/babynames/popular-boy-names-in-2016/ for male names). for each gender, we selected four names from the top twenty most common as the familiar names (females: isabella, sophia, emma, olivia; males: jacob, ethan, michael, william) and four names from the bottom twenty of the top 1000 (i.e., names 981–1000) as the unfamiliar names (females: lilith, charleigh, dania, savannah; males: truman, eliezer, reuben, bailey). we chose names from the top and bottom of the top 1000 to make sure that the frequencies of the names varied and that all names were somewhat recognizable (i.e., to avoid outlier names). we used this same process to select surnames, though names were pulled from the most recent (2010) us census instead of social security data (familiar: smith, johnson, williams, brown; unfamiliar: galloway, bray, nieves, petty; data available at https://www.census.gov/topics/population/ genealogy/data/2010_surnames.html). we used names as stimuli because prior work had obtained familiarity effects using names (oppenheimer, 2004). the nonsense syllables were taken from study 3 by johnson et al. (1960). they found that syllables obtaining low (0%), medium (47–53%), and high (100%) rates of judged association with english words (in glaze, 1928) were thought to have better (i.e., more “good”) meanings when participants were told the syllables were words from foreign languages and then judged how much the words referred to “good” or “bad” things. glaze (1928) obtained the syllables’ associations with english words by asking 15 participants whether they could quickly form an association to an english word for each syllable. the association rates (i.e., low, medium, and high) are the percentage of the 15 participants who reported forming an association to a syllable. johnson et al. (1960) found that more familiar words (i.e., those more frequently associated with known words) were judged more positively than unfamiliar words. after rating the names and nonsense syllables, participants completed standard demographic items (race, gender, income, education, subjective ses) and an embedded attention check. all participants were debriefed. participants who did not follow the manipulation instructions (e.g., copy and paste text from a secondary document instead of describing an experience; n=2), yielding a final sample size of 199 participants (ncontrol=103, nscarcity=96). participants were paid $0.75 for completing the study. results confirmatory results data were submitted to multilevel models that regressed liking ratings on stimulus familiarity, experimental condition, and their interaction. in 6 antonoplis & chen t ab le 2 b (s e ) f ro m h ie ra rc h ic al li n ea r m o d el s fo r st u d y 1 d ep en de nt v ar ia bl e n on se ns e sy lla bl es n am es re gr es si on t er m s n on se ns e sy lla bl es ( 0% as r ef er en ce ) n on se ns e sy lla bl es ( 0% & 4 7– 53 % a s re fe re nc e) fe m al e g iv en n am es m al e g iv en n am es su rn am es sc ar ci ty -0 .1 1 (0 .0 8) -0 .1 1 (0 .0 8) -0 .2 6 (0 .1 5) -0 .2 0 (0 .1 5) 0. 04 ( 0. 18 ) 47 –5 3% 0. 66 ( 0. 22 ) ** – – – – 10 0% 0. 91 ( 0. 16 ) ** * 1. 11 ( 0. 20 ) ** * – – – fa m ili ar – – 1. 17 ( 0. 33 ) * 1. 30 ( 0. 34 ) ** 0. 53 ( 0. 35 ) sc ar ci ty x 4 7– 53 % 0. 06 ( 0. 12 ) – – – – sc ar ci ty x 1 00 % 0. 09 ( 0. 12 ) 0. 11 ( 0. 14 ) – – – sc ar ci ty x f am ili ar – – -0 .3 0 (0 .2 1) -0 .0 2 (0 .1 9) -0 .1 1 (0 .1 9) n ot e. * p < .0 5, * * p < .0 1, * ** p < .0 0 1. “ – ” d en o te s th at a r eg re ss io n t er m w as n ot in cl u d ed in a m od el . 7 four failures to demonstrate that scarcity magnifies preference for familiarity addition, all ratings were partitioned into random intercepts within participants and random effects of familiarity within participants. that is, we controlled for the possibility that overall rating patterns might vary across participants and that preferences for familiarity might vary across participants. we also included random intercepts and slopes for experimental condition within stimuli in order to prevent stimulus-specific effects from impact the overall result. table 2 shows the multilevel regression results for all dependent variables. figure 1. scarcity condition depicted with filled circles and solid lines; control condition, with hollow circles and dashed lines. all name ratings were in terms of liking (1=dislike, 7=like). syllable ratings were in terms of how bad (=1) or good (=7) participants thought it meant in a foreign language. as expected, participants rated more familiar given names as more preferable to less familiar given names (females: b=1.17, t(6.62)=3.57, p=.010, pseudor2=.66; males: b=1.30, t(6.51)=3.82, p=.008, pseudor2=.69). in addition, relative to low-association syllables, participants rated medium-association syllables as better-sounding (b=0.66, t(22.58)=2.97, p=.007, pseudo-r2=.55) and high-association syllables as better-sounding (b=0.91, t(25.43)=5.62, p<.001, pseudo-r2=.28). high-association syllables were also rated as better-sounding than combined mediumand low-association syllables (b=1.11, t(25.43)=5.62, p<.001, pseudo-r2=.45). thus, all of these given name and nonsense syllable stimuli appeared to operate as expected in that they yielded the normative preference for more to less familiar objects. in contrast, preferences did not vary across more and less familiar surnames (b=0.53, t(6.63)=1.50, p=.179, pseudo-r2=.25), suggesting that the chosen surnames were inappropriate to test our hypothesis. in addition, as expected, there were no main effects of scarcity on ratings (b’s from -0.26– 0.03; p’s from .093–.771; pseudo-r2’s from .0004– .04). recalling an experience of scarcity did not cause participants to rate all stimuli as more or less preferable or good, relative to the control condition. if this main effect had been observed, it might suggest a different psychological effect of scarcity 8 antonoplis & chen than hypothesized: that it makes people like or dislike any stimuli more on top of any heightened preference contrasts between subsets of stimuli. thus, our stimuli and scarcity manipulation mostly conformed with our reasoning about how each would function. our focal hypothesis—that scarcity would magnify preferences for familiar over unfamiliar objects—did not receive support (b’s from -0.30– 0.11; p’s from .155–.910; pseudo-r2’s from .00008– .09). figure 1 shows scatterplots and means across conditions for all dependent variables. table 3 lists the means and standard deviations of ratings for each group and dependent variable. in general, means are quite consistent across experimental groups. any apparent moderation of ratings by experimental group appears to come from participants in the scarcity condition disliking unfamiliar objects more, rather than liking familiar options more. in fact, participants in the control condition typically reported more liking of familiar objects than those in the scarcity condition. exploratory results what proportion of participants preferred familiarity?. following a reviewer’s suggestion, we checked the proportion of participants whose personal preferences for familiarity matched the normative preference. to do so, we examined the distribution of random effects of familiarity, calculating the percentage of participants with a random effect greater than 0. after that, we re-ran the models using only participants whose personal preference matched the normative preference. we table 3 mean (sd) for unfamiliar and familiar stimuli across two experimental conditions in study 1 dependent variable experimental condition nonsense syllables female given names male given names surnames control 0% 3.27 (1.43) – – – 47–53% 3.89 (1.37) – – – 100% 4.64 (1.48) – – – unfamiliar – 3.83 (1.82) 3.41 (1.79) 4.03 (1.65) familiar – 5.15 (1.53) 4.72 (1.59) 4.61 (1.36) scarcity 0% 3.09 (1.44) – – – 47–53% 3.78 (1.50) – – – 100% 4.60 (1.60) – – – unfamiliar – 3.72 (1.87) 3.21 (1.86) 4.12 (1.64) familiar – 4.74 (1.66) 4.51 (1.64) 4.60 (1.46) note. “–” denotes that a regression term was not included in a model. the three levels of nonsense syllables reflect association rates between syllables and english words made by participants (n = 15) reported in glaze (1928). 9 four failures to demonstrate that scarcity magnifies preference for familiarity did this for all four outcomes. across all outcomes, a majority of participants’ personal preferences matched the normative preference for familiar over unfamiliar stimuli (from 77%–96%). subgroup analyses examining only participants who preferred familiar over unfamiliar stimuli did not yield substantively different results from the main analyses. the critical interaction between familiarity and scarcity remained non-significant (p’s from .149–.879). these results suggest that focusing on individual versus normative preference for familiarity does not explain our null results. bootstrapped equivalence test. though results were inconsistent with our hypothesis, failure to reject a null hypothesis is not equivalent to demonstrating evidence in favor of the null. to argue in favor of the null, one would need to show that results are more consistent with some prior belief about the distribution of data (i.e., bayesian analysis) or show that the observed effect falls outside the range of effect sizes one considers worth studying (i.e., smallest effect size of interest in equivalence tests). we did not have a strong prior about the effect size or a smallest effect size of interest, so instead we bootstrapped effect sizes (r2’s) for the key interaction test for our four outcome variables. the bootstrapped estimate and confidence intervals provide a sense of what the true effect is likely to be, and other researchers may decide whether effects in this range are worth pursuing. we started the bootstrapping using the full models from the confirmatory hypothesis tests. if models consistently failed to converge, random effects were dropped until the models consistently converged. this process resulted in dropping random effects of experimental condition and stimulus familiarity for the surnames outcome and for all pairwise comparisons between syllables outcomes. for surnames the average r2 was .002, 95% ci [1.76e-6,.01]; for female given names, .11, 95% ci [2e4,.36]; for male given names, .04, 95% ci [3.75e5,.18]; high versus low syllables, .0005, 95% ci [4.98e-7,.002]; medium versus low syllables, .0002, 95% ci [1.59e-7,7e-4]; and high versus medium and low syllables, .001, 95% ci [1.24e-6,.005]. clearly, the inclusion or exclusion of random effects made a large difference in the estimates and confidence intervals. outcomes with simpler models indicated that the key interaction was unlikely to account for very much variance at all. outcomes with the full model (i.e., female and male given names) indicated that the key interaction could account for very little variance or a considerable portion of the variance. these results suggest that our samples were not sufficiently powered to estimate the interaction’s effect size using the full model. discussion recalling an experience of scarcity, versus a recent experience, did not make participants more strongly prefer familiar to unfamiliar objects. these results do not appear to have happened because of poor stimulus selection or alternative psychological processes of scarcity. all but one set of stimuli successfully recreated the normative preference for familiarity, and overall rating patterns did not vary across experimental conditions. in addition, manual inspection of written responses to the manipulation showed that most participants (99%) wrote a relevant response. it is possible that our manipulation did not work, but our lack of a manipulation check prevents probing this possibility. the manipulation has previously been found to successfully manipulate felt scarcity in the same population we used (roux et al., 2015) and follows the format of other successful manipulations of broad social constructs and mindsets (e.g., social power; galinsky, gruenfeld, & magee, 2003; kraus, chen, & keltner, 2011). hence, though a different manipulation of scarcity might better test our hypothesis, the present results seem more consistent with the null hypothesis. in the following studies, we tested whether alternative manipulations of scarcity may produce our predicted effect. we also tested the possibility that the stimuli we used—names and nonsense syllables—are inappropriate stimuli for testing our hypothesis. finally, we tested whether the predicted effect is inappropriate to be tested with brief experimental methods (e.g., the effect unfurls over time) and instead better tested with an individualdifference approach. study 2: a different manipulation of scarcity after finishing study 1, we became aware of a study by litt, reich, maymin, and shiv (2011) that claimed to show our effect of interest. in two studies, the authors found that, under increased time pressure, participants were more likely to select a strategy associated with a stimulus they had previously been made more familiar with (i.e., an incidentally familiar strategy), even though the more familiar strategy was less helpful for their goal 10 antonoplis & chen completion. based on these results, the authors concluded that scarcity (here, of time) increased preference for familiarity. however, by varying the utility of strategies for goal completion, the authors added a factor to their design, a factor for which they did not test all levels. the authors studied only the condition where the familiar is less helpful than the unfamiliar (familiar < unfamiliar), not where the two are equal in helpfulness (familiar = unfamiliar) or familiar is more helpful than unfamiliar (familiar > unfamiliar). thus, the studies do not indicate what the baseline preference for familiarity is across different utilities and whether the difference observed in the reported studies results from time pressure increasing preference for familiarity or lack of time pressure opening people up to the unfamiliar when it is more helpful. these studies demonstrate only that under different amounts of time pressure participants, on average, preferred a more familiar object at different rates. they do not provide information on whether participants’ behavior under time pressure represents a deviation from standard rates of preference for familiarity. to fully understand whether scarcity increases preference for the familiar, we conducted a similar study to litt et al. (2011) that examined choices across all possible utilities (i.e., familiar > unfamiliar, familiar = unfamiliar, familiar < unfamiliar). in particular, we adapted the general experimental design of manipulating scarcity (here, as financial pressure, instead of as time pressure), the familiarity of possible strategies for completing the target goal (here, as given names, instead of as primed familiarity), and the degree of helpfulness of possible strategies (here, where the more familiar strategy could be more, equal, or less helpful, instead of only less helpful). thus, this study should not be considered a replication of litt et al. (2011) but instead a conceptual replication and extension (lebel et al., 2019). the pre-registration form, study materials, and data are available here: https://osf.io/2tykn/. method after designing the experiment, we, the authors, disagreed about the likelihood it would show the hypothesized effect and so devised a stopping rule for participation based on how much money we were willing to spend on the study. we decided to first collect data from 66 participants (33 participants per between-subjects condition); inspect the condition means; and if the means were in the hypothesized order, proceed to collect data until we reached our informal lab standard of 100 participants per between-subjects condition. in total, 66 participants were recruited from amazon’s mechanical turk and paid $0.40 for completing a 4-minute study. their demographic characteristics matched typical samples on mturk (mostly white, mostly men, in their mid-30’s, had some amount of college education, and earning a relatively low income; buhrmester et al., 2011) and are reported in full in table 1. after consenting to participate, participants were randomly assigned to a scarcity or control manipulation. participants then read a passage instructing them to imagine a hypothetical scenario for one minute. in the scarcity condition, participants read the following passage, similar to other passages used to manipulate scarcity (e.g., mani et al., 2013): you’re short for rent this month. you need about $1,000 to make it. imagine you have the opportunity to win this amount by playing some small bets online. you are offered six bets, but they are paired up into three pairs—and you have only enough money to choose one of the bets in each pair, for a total of three bets. whichever three you play, winning all three guarantees you at least $1,000. the bets are presented on the following pages. which three do you choose? participants in the control condition read the following passage, which we designed to trigger a sense of abundance or non-scarcity: every few months you like to play some small bets online and treat yourself to something nice with whatever you win. imagine that this month, you’re offered six bets, but they are paired up into three pairs—and you can only choose one of the bets in each pair, for a total of three bets. whichever three you play, winning all three guarantees you at least $1,000. the bets are presented on the following pages. which three do you choose? an embedded, invisible timer in the page required participants to spend at least thirty seconds reading and imagining the passage they were shown. after thirty seconds had passed, participants could advance to the next screen where they were presented with three pairs of bets. the bets asked participants to guess the rank of a male or female given name’s frequency of assignment within ±20 positions across all us newborns in the next calendar year. the names used were the same as in study 1 (familiar: isabella, sophia, emma, olivia, jacob, ethan, michael, 11 four failures to demonstrate that scarcity magnifies preference for familiarity william; unfamiliar: lilith, charleigh, dania, savannah, truman, eliezer, reuben, bailey). for each pair of bets, participants saw a new pair of names, always chosen so that one was familiar and the other, unfamiliar. in total, each subject saw six of 16 possible names. within each pair of bets, names were matched on gender. participants were randomly assigned to see three female pairs, two female pairs and one male pair, one female pair and two male pairs, or three male pairs. thus, for two participants that saw two female pairs and one male pair, one subject could see {isabella vs. lilith, sophia vs. charleigh, jacob vs, truman} and the other, {sophia vs. dania, olivia vs. savannah, michael vs. bailey}. each pair of names was randomly assigned to one of the three pay rates (familiar > unfamiliar, familiar = unfamiliar, familiar < unfamiliar), and the order in which the three pay rates were presented was randomized for each participant. this degree of randomization was necessary to remove any potential order or pairing confounds from the results. each pair of bets was presented as follows: which do you play? (both names are in the top 1000 most popular names.): earn [$400, $450, $350] if you guess the rank, somewhere between 1 and 1000, of the name [familiar name] relative to other names given to us newborns next year (within 20 rank positions). earn [$400, $350, $450] if you guess the rank, somewhere between 1 and 1000, of the name [unfamiliar name] relative to other names given to us newborns next year (within 20 rank positions). all possible combinations of winnings totaled more than $1000, as indicated in the initial instructions participants read ($400 + $450 + $350 = $1200; $400 + $450 + $450 = $1300; $400 + $350 + $350 = $1100). totals from winning all three bets were equal if a subject chose all the familiar ($400 + $450 + $350 = $1200) or all the unfamiliar bets ($400 + $350 + $450 = $1200), so there was no monetary incentive to prefer one over the other. after selecting the three bets they would play, participants completed the same demographic items as in study 1 (race, gender, education, income, subjective ses) and an embedded check (i.e., “please select ‘strongly agree’ for this item.”). following our pre-registration, participants who failed the attention check were removed from all analyses (n=2), leaving a final sample size of 64 (ncontrol=32, nscarcity=32). results as pre-registered, after collecting data from sixty-six participants, we inspected means across conditions to determine whether participants’ choices were in the predicted directions. overall, participants favored to bet on the more familiar option, even when it was worth less than the unfamiliar option. figure 2 plots means and standard errors of choice across experimental condition and bet payout. contrary to our hypothesis that scarcity would increase preference for familiarity, participants in the scarcity condition appeared to less strongly prefer the familiar option across all bets (scarcity: mchoosefamiliar=2.06, sd=0.91; control: mchoosefamiliar=2.31, sd=0.82) and to show a stronger decline in preference for the familiar bet across payouts (scarcity: from 84% choosing familiar in familiar > unfamiliar to 53% choosing familiar in familiar < unfamiliar; control: from 88% choosing familiar in familiar > unfamiliar to 59% choosing familiar in familiar < unfamiliar). based on these patterns and our pre-registration, we ceased data collection. figure 2. bars are standard errors. exploratory analyses planned analyses. per a reviewer’s suggestion, we conducted our planned statistical analyses on the data from 64 participants. following the qualitative inspection, participants, on average, preferred the familiar bet to the unfamiliar bet across all 12 antonoplis & chen conditions (b=3.33 , z=3.72, p<.001, or=27.94, 95% ci [4.83,161.49). this preference was not significantly stronger in the control versus scarcity condition (b=-0.73, z=-0.61, p=.541, or=0.48, 95% ci [0.05,5.02]), and it varied linearly across the bet worth conditions such that the familiar option was chosen most often when it was worth more than the unfamiliar option and less often when the two were equal in value or the unfamiliar option was worth more (b=-1.00, z=-3.01, p=.003, or=0.37, 95% ci [0.19,0.71]). finally, the interaction between scarcity versus control condition and bet worth was not significant (b=0.08, z=0.17, p=.866, or=1.09, 95% ci [0.41,2.85]), suggesting that the scarcity condition did not make participants more strongly prefer familiarity even when it was not in their interest to prefer familiarity. what proportion of participants preferred familiarity?. following study 1, we checked the proportion of participants whose personal preferences for familiarity matched the normative preference and re-ran the main analyses using only these participants. one hundred percent of participants showed the normative preference in their personal preferences. hence, restricting analyses did not eliminate any deviant participants, and results remained as reported above, suggesting that scarcity did not make participants more strongly prefer familiarity when it was not in their interest to prefer familiarity. equivalence test. the odds-ratio effect size (or=1.09) and its 95% confidence interval ([0.41,2.85]) for the interaction between scarcity and familiarity from the full model provide a simple check of what effect sizes can be ruled out from the present data. the confidence interval covers a very large range of effect sizes, including both very large negative effects (lower bound of 0.41), indicating that, on average, participants in the scarcity condition less heavily favored the familiar bet relative to participants in the control condition as its worth decreased), and very large positive effects (upper bound of 2.85), indicating that, on average, participants in the scarcity condition more heavily favored the familiar bet relative to participants in the control condition as its worth decreased).thus, these results are not very informative about the range of effect sizes that can plausibly be ruled out. this is to be expected from the small sample size. discussion as in study 1, we failed to reject the null hypothesis per the conditions stipulated in our preregistration. thus, we again failed to demonstrate that scarcity magnifies the preference for familiarity. admittedly, our manipulation was slightly non-intuitive and lacked high ecological validity (who is short on rent but has enough money to place somewhat large bets?), and that may have impacted results. despite this, it was internally valid in that the names used showed the expected familiarity bias. moreover, being short on rent was a common scarcity experience described in the written responses to study 1’s manipulation of scarcity. perhaps the problem emanated from our control condition, winning extra money for a treat. but given that participants in study 1 considered being short on money for critical things (e.g., rent, textbooks, medical expenses) to be experiences of scarcity, trying to win money for a non-essential luxury experience (a treat) would seem to be a good conceptual opposite of what mturkers consider experiences of scarcity. as we did not include a manipulation check, it is not possible to test how well the manipulation induced scarcity for participants. another possibility is that our stimuli are inappropriate. maybe our hypothesized effect occurs for only a subset of all possible stimuli, and given names lie outside that subset. though we weren’t sure what that subset would be, the stimuli used in zhu and ratner (2015), who showed that scarcity magnifies individual preferences, would appear to be appropriate. hence, we adapted an experimental design from zhu and ratner (2015) for a subsequent study. study 3: alternative stimuli having observed two non-significant results, we thought it best to try a manipulation from the study that reported the result inspiring our study. perhaps, the two manipulations of scarcity we had used, though apparently valid in prior work, were inappropriate for our research question. in addition, the stimuli we used may have been inappropriate, whereas stimuli from the original report should be appropriate. hence, we adapted the experimental design of study 1 from zhu and ratner (2015). in particular, we used the same “buying groceries at the market” paradigm (described further below) and altered the kinds of groceries available in order to 13 four failures to demonstrate that scarcity magnifies preference for familiarity match our theoretical question (described further below). thus, as stated in the introduction, this study should not be considered a replication of zhu and ratner (2015), but instead a generalizability study to normative preferences for familiarity (lebel et al., 2019). the pre-registration form, study materials, and data are available here: https://osf.io/4nxrb/. method zhu and ratner (2015) used the following procedure: first, participants were asked to report their preferences for four flavors of yogurt (as a rank order and rating scale from 0=not at all to 100=very much). participants were then asked to imagine they were shopping for groceries and encountered a “pick any 4 yogurts for $1” sale on yogurt. participants were randomly assigned to one of two conditions: resource scarcity, wherein only four of each yogurt flavor was available, or resource abundance, wherein forty of each yogurt flavor remained. finally, participants selected how much of each flavor they wanted. examining differences between participants’ favorite (=rank 1) and nonfavorite yogurts (=all other ranks), zhu and ratner (2015) found that participants in the scarcity, versus abundance, condition reported a larger difference between their favorite and non-favorite flavors for both liking and share of chosen yogurts. we altered this procedure as follows: first, we used fruits as stimuli, rather than yogurt flavors, as information on fruit familiarity, but not yogurt flavor familiarity, was available. second, we did not ask participants for their preferences regarding the fruit prior to the manipulation but instead selected fruit to vary in their normatively defined familiarity. we describe the experiment in greater detail below. whereas our lab normally collects 100 participants per between-subjects, we decided to increase the number to 150 for the added statistical power. thus, we recruited three hundred participants from amazon’s mechanical turk (paying $0.25 for a two-minute study). their demographic characteristics matched typical samples on mturk (mostly white, in their mid-30’s, had some amount of college education, and earning a relatively low income; buhrmester et al., 2011) and are reported in full in table 1. as we did not have a readily available dataset of yogurt flavor consumption or production, we used fruit as stimuli instead of yogurt flavors. we thought the use of fruit was justified because they are a kind of food typically consumed as a snack, like yogurt, and are considered healthy, like yogurt. in addition, zhu and ratner (2015) used a variety of stimuli, both food and non-food, with no apparent heterogeneity in effect presence. because we wanted to examine whether the effect extended to nomothetic preferences, we did not ask participants for their fruit preferences before choosing. instead, we chose fruit that varied in normative familiarity. we selected fruit based on the amount consumed in the us per year (usda, 2016), how highly ranked they were according to online ranking websites (“delicious” from ranker.com, 2018; “favorites” from thetoptens.com, 2018a; “delicious” from thetoptens.com, 2018b), and the average calories per serving of each fruit (usda, 2016). in addition, we also sought to select fruit that would be similarly easy to eat (e.g., whether need to wash before eating, whether need to peel skin off to eat). based on these criteria, we selected apples and bananas as the more familiar fruit and oranges and peaches as the less familiar fruit. data for all the fruit we considered are shown in table 4. the fruit considered for the study were constrained by the kinds of fruit on which the usda provided data. we considered all fruit for which the usda provided data (except lemons, which are rarely eaten as a snack in the us, where the study was conducted). following zhu and ratner (2015), participants were told they were going to the grocery store to buy some snacks for the day, and that there was a sale on fruit at the grocery store. the grocery store was allowing customers to buy four fruit consisting of any combination of apples, bananas, peaches, and oranges for $1. customers could buy four apples for $1; two bananas and two peaches for $1; one of each fruit for $1; or any other combination of the four fruit for $1. the conditions participants had been randomly assigned to varied 14 antonoplis & chen t ab le 4 c ri te ri a fo r se le ct in g fr u it s ti m u li fo r st u d y 3 c on su m pt io n su bj ec tiv e ra tin gs a d d it io n al c o n si de ra ti on s fr ui t c on su m pt io n (l bs /y ea r/ ca pi ta ) c on su m pt io n r an k t op t en fa vo ri te s t op t en d el ic io us r an ke r d el ic io us a ve ra ge v o te d r an k c al o ri es (1 s er vi n g) p re p ar at io n a pp le s 18 .5 2 5 5 8 6 72 w as h b an an as 27 .5 1 4 6 5 5 10 5 p e e l c he rr ie s 1. 2 10 11 42 11 21 .3 3 8 7 w as h g ra pe fr ui ts 1. 9 9 25 46 28 33 8 2 p ee l g ra pe s 8. 1 4 6 9 2 5. 6 7 10 4 w as h o ra ng e 9. 2 3 9 14 4 9 6 2 p e e l pe ac he s 2. 9 7 10 12 6 9 .3 3 5 9 w as h pe ar s 2. 8 8 15 47 13 25 9 5 w as h pi ne ap pl es 7. 3 6 7 4 9 6 .6 7 8 2 c u t st ra w be rr ie s 8 5 1 2 1 1. 33 46 w as h n ot e. c on su m p ti o n d at a ar e fr om 2 0 16 in t h e u s . s u b je ct iv e ra ti n g d at a w er e ac ce ss ed in 2 0 18 . s p ea rm an ’s r h o b et w ee n c o n su m p ti on r an k an d a ve ra ge v ot ed r an k w as .7 7 (p = .0 1) , i n d ic at in g th at n o rm at iv e ra te d p re fe re n ce a n d a ct u al co n su m p ti o n w er e h ig h ly r el at ed . b o ld ed f ru it s w er e se le ct ed f o r u se in t h e st u d y. 15 four failures to demonstrate that scarcity magnifies preference for familiarity by the amount of fruit available. in the scarcity condition, there were only four of each fruit available. in the control condition, there were 40 of each fruit available. participants indicated how many of each fruit they wanted by writing a number from one to four in empty boxes next to each fruit. the experiment was programmed such that only numbers could be entered into the boxes, and participants could not progress in the experiment if the sum of numbers entered did not equal four. after indicating their choices, participants completed an instructional manipulation check (“in the scenario you just read, about how many of each kind of fruit were available at the grocery store” with ≤ 5, 10–20, and > 35 as possible responses) and reported demographic characteristics (income, macarthur ladder, education, race, gender) and an attention check embedded in the demographics section (e.g., “please selected ‘strongly agree’ for this item.”). per our pre-registration, participants who failed the attention check (n=10) or answered the instructional check incorrectly based on their experimental condition (n=41) were excluded from all analyses. these exclusions reduced the final sample size to 249 individuals (ncontrol=108, nscarcity=141). results confirmatory results due to the clustered nature of our data, we used a multilevel model for analysis. participants’ fruit choices were regressed on scarcity condition (=1; dummy-coded), whether a fruit was familiar (=1; dummy-coded), and their interaction. we included random intercepts and random slopes of fruit familiarity within participants, and random intercepts and random slopes of experimental condition within fruits (following the logic of study 1). note that the random effects within fruits depart from the pre-registration’s specification of only random effects within participants. we made an error in the pre-registration and report the more appropriate model here (though exclusion of random effects within fruits does not change the results). we further departed from the preregistration by using a gaussian, instead of binomial, distribution. the dependent variable is a count variable, so a binomial distribution would have been impossible to use. we did not use a poisson distribution because our data violate a key assumption of it: that the variable is unbounded. participants could not select more than four of any kind of fruit. as expected, participants chose more familiar (m=1.20, sd=1.02) than unfamiliar fruit (m=0.80, sd=0.99), b=0.40, t(4.88)=4.62, p=.006, pseudor2=.82. in addition, the scarcity condition did not affect mean fruit choices across conditions (scarcity: m=1.00, sd=1.08; control: m=1.00, sd=0.95), b=0.00, t(184.7)=0.00, p=1.00, pseudor2=.00. contrary to our hypothesis, preference for familiar fruit was not stronger in the scarcity than control condition, b=-0.02, t(184.9)=-0.16, p=.875, pseudo-r2=.0001. consistent with study 1, participants showed nearly identical preferences for familiar (scarcity: m=1.19, sd=1.05; control: m=1.20, sd=0.98) over unfamiliar fruits (scarcity: m=0.81, sd=1.07; control: m=0.80, sd=0.88) across our two experimental conditions. figure 3 displays these results. figure 3. bars are standard errors. exploratory results what proportion of participants preferred familiarity?. following study 1, we checked the proportion of participants whose personal preferences for familiarity matched the normative preference and re-ran the main analyses using only these participants. eighty-two percent of participants showed the normative preference in their personal preferences. restricting analyses to participants who showed the normative preference did not substantively change the results. the 16 antonoplis & chen interaction between scarcity and familiarity remained non-significant (p=.580). bootstrapped equivalence test. following study 1, we used 5000 bootstrapped resamples to estimate pseudo-r2 values for the key interaction. the mean r2 was .001, 95% ci [1.62x10-6,.007]. this suggests we can rule out pseudo-r2’s larger than .007 as plausible effect sizes. discussion in a third experiment, we failed to show that situational scarcity magnifies the preference for familiarity. based on prior work, our experimental manipulation would appear to be internally valid, and our stimuli clearly are, as they replicated the classic familiarity bias. in addition, our increased sample size allowed us to find that the key interaction between scarcity and familiarity is likely very small, in fact much smaller than typical effect sizes in social–personality psychology (richards et al., 2003). whereas studies 1 and 2’s results may be explained in terms of methodological issues, the results of study 3 seem to more clearly suggest that situationally induced scarcity does not magnify preference for familiarity. although we cannot rule out that the manipulation did not induce scarcity for participants, as we did not include a manipulation check. still, one possibility that remains untested is that situationally induced scarcity does not alter familiarity bias, but longer term scarcity does. we tested this possibility in study 4. study 4: individual differences whereas the apparent null effects in studies 1–3 suggest that experimentally induced (i.e., situational) scarcity does not magnify situational preference for familiarity, it remains possible that longer term experiences of scarcity may be correlated with higher preference for familiarity. this would be the case if scarcity’s effect on preference for familiarity builds up over time. then, repeated exposure to scarcity would be necessary in order to observe differences in preference for familiarity. hence, in a final study, we examined whether individual differences in both perceived time and material scarcity correlated with individual differences in preference for novelty. the preregistration form, study materials, and data are available here: https://osf.io/spzv4/. method sample data came from the first phase of the measuring morality dataset maintained by the kenan institute for ethics at duke university (available at https://kenan.ethics.duke.edu/attitudes/resourc es/measuring-morality/). this was a nationally representative sample of adults in the united states. the full sample includes data from 1,519 individuals (full details of demographics are reported in table 1). we selected items based on our own assessment of whether they measured our constructs of interest. we pre-registered that we planned to combine the items into general indices but that items might be dropped based on low correlations with other items in the index. measures to assess perceived material scarcity, we used an index of the following items: “agree that i just don’t have enough money to live the life i would like to live.” (code: ppl10018; from 1=strongly agree to 5=strongly disagree; reverse-scored), “agree that generally, i live from paycheck to paycheck.” (code: ppfs0684; from 1=strongly disagree to 4=strongly agree), “how would you rate your own personal finances these days?” (code: ppfs0679; from 1=excellent to 4=poor), and “are your personal finances getting better these days, or worse?” (code: ppfs0680; 1=better, 2=worse, 3=same; recoded so that worse and same switched numerical values, and then reverse-scored). all four items were standardized prior to averaging and showed acceptable internal consistency (α=.73). to measure perceived time scarcity, we used an index of the following items: “agree that life is so busy that i find i have less time to spend with family and friends.” (code: ppl10008; from 1=strongly agree to 5=strongly disagree; reverse-scored), “agree that it is hard for me to find the time to be involved in local/community matters.” (code: ppl10009; from 1=strongly agree to 5=strongly disagree; reversescored), and “agree that it is becoming increasingly difficult to find the time to relax and unwind.” (code: ppl10012; from 1=strongly agree to 5=strongly disagree; reverse-scored). although these items differ from canonical manipulations of time scarcity wherein participants have more or less time to complete a task (shah et al., 2012), we thought that as individual-difference measures of time scarcity they are sufficient. an experience of scarcity occurs 17 four failures to demonstrate that scarcity magnifies preference for familiarity when a person lacks sufficient resources to meet a goal or need. our three items all referred to normatively valued goals: personal relationships, community involvement, and relaxing/leisure. thus, our items appear to satisfy the basic definition of scarcity (lacking a resource—i.e., time—for valued goals). moreover, people might vary in the extent to which they value these goals, but that is also true of the small monetary awards typical of lab studies (cf. shah et al., 2012) and, thus, not unique to our items. all three items were standardized prior to averaging and showed good internal consistency (α=.81). to measure preference for familiarity, we used the following three items: “i think it is important to do lots of different things in life. i always look for new things to try.” (code: sv6; from 1=very much like me to 6=not like me at all), “i often try new brands because i like variety.” (code: ppadopt2; from 1=strongly agree to 5=strongly disagree), and “agree that i am usually the first of my friends to try new products and services.” (code: ppfs0687; from 1=strongly disagree to 4=strongly agree). although we had planned to combine these items into a single index, they showed low internal consistency (α=.54), so we used them as three separate items. note that sample sizes vary across hypothesis tests because some questions were not asked to all participants, some participants declined or refused to answer questions, and some participants reported being unsure of their response. we recoded any responses of “not asked”, “refused”, “not applicable”, and “not sure” as missing because they were not substantive responses. results confirmatory results checking preference for familiarity. to check that we had selected appropriate items, in addition to their face validity, we examined whether average responses to our three items on preference for familiarity were below the scale midpoint. that is, did participants, on average, think that “look[ing] for new things to try” was “not like me,” disagree that “i often try new brands because i like variety,” and disagree that they are the first of their friends to “try new products and services”? affirmations of these questions would indicate average preferences to avoid novelty, presumably in favor of seeking familiarity. to test these hypotheses, we conducted one-sample t-tests, comparing the sample mean to the scale midpoint (3.5 for question 1, 3 for question 2, and 2.5 for question 3). we used one-tailed tests because we had directional predictions, that sample means would be lower than the scale midpoints. the mean of question 1 (“i think it is important to do lots of different things in life. i always look for new things to try.”) was 3.96 (sd = 1.29), above the scale midpoint of 3.5 (p=1.00). questions 2 and 3, however, showed the expected levels. the mean of question 2 (“i often try new brands because i like variety.”) was 2.75 (sd=1.14), t(1496)=-8.33, p<.001. the mean of question 3 (“agree that i am usually the first of my friends to try new products and services.”) was 1.94 (sd=0.86), t(1086)=-21.55, p<.001. thus, two of the familiarity preference items we selected operated as expected, and one did not. scarcity and preference for familiarity. table 5 displays bivariate correlations between time and material scarcity and the three measures of preference for familiarity. evidence in favor of our hypothesis would be a statistically significant negative correlation between the scarcity and familiarity measures. no such correlations were obtained. thus, we failed to reject the null hypothesis across all six hypothesis tests. 18 antonoplis & chen table 5 correlations between scarcity and familiarity preference (study 4) scarcity time material items for familiarity preference r p 95% ci r p 95% ci i think it is important to do lots of different things in life. i always look for new things to try. .03 .209 [-.02,.08] -.03 .212 [-.08,.02] i often try new brands because i like variety. .08 .002 [.03,.13] .02 .472 [-.03,.07] agree that i am usually the first of my friends to try new products and services. .05 .113 [-.01,.11] .00 .921 [-.06,.06] 19 four failures to demonstrate that scarcity magnifies preference for familiarity exploratory results alternative measures of scarcity. prior research on scarcity has used traditional measures of social class (e.g., income, “mouths to feed”) as indicators of scarcity (i.e., mani et al., 2013; shah et al., 2015). the logic supporting this usage is that people with fewer resources in general are also likely to have fewer resources than they need. hence, we estimated post-hoc correlations between “mouths to feed” (household income / sqrt(family size)) and our three measures of preference for familiarity. again, scarcity, as “mouths to feed,” was uncorrelated with preference for familiarity (r’s from -.01–.05, p’s from .77–1.00). what proportion of participants preferred familiarity?. following study 1, we checked the proportion of participants whose personal preferences for familiarity matched the normative preference and re-ran the main analyses using only these participants. thirty-five percent of participants disagreed that it was important to try new things in life. forty-two percent disagreed that they tried new brands because they like variety. seventy-five percent disagreed that they were the first of their friends to try new products and services. analyzing data for only participants who showed the expected preferences for familiarity, material scarcity was significantly negatively correlated with thinking it is important to try new things in life (r=.09, p=.045, n=512) and with being the first of one’s friends to try new products and services (r=-.08, p=.030, n=818), but not with trying new brands because of taste for variety (r-.00, p=.966, n=619). these results suggest that people experiencing greater material scarcity had a relatively stronger preference for familiarity. however, the p-values were quite high, higher than would be expected if these were true effects (simonsohn et al., 2014), so we are not entirely sure they are real. time scarcity was not significantly correlated with thinking it is important to try new things in life (r=.02, p=.614, n=497) or trying new products and services before one’s friends (r=.01, p=.681, n=797), but was significantly positively related to trying new brands because of taste for variety (r=.09, p=.022, n=604). that is, people who reported experiencing more time scarcity exhibited a preference for novelty: they were more likely to report trying new brands trying new brands because of taste for variety. if real, this is in the opposite of our predicted direction. equivalence test. the correlation effect sizes (r’s in table 5) and their 95% confidence intervals (95% ci in table 5) provide a simple check of what effect sizes can be ruled out from the present data. the confidence intervals cover a relatively small range of effect sizes, from r=-.06 to r=.13. this suggests we can rule out r’s outside of the range of -.06–.13 as plausible effect sizes. in addition, relative to other effect sizes in social–personality psychology, these fall near or below the 33rd percentile, suggesting that the effect of scarcity on familiarity, assuming it is not null, is smaller than most effects in social– personality psychology. discussion taking an individual differences approach, we found that participants preferred familiarity to novelty and that scarcity, both as time and material scarcity, did not magnify this preference. consistent with study 3, an equivalence test suggested that the effect of scarcity on familiarity bias, to the extent that it exists, is very small. these results cohere with our experimental results from studies 1–3. in aggregate, these results suggest that scarcity does not increase preference for familiarity in states or in longer term attitudes. general discussion across four pre-registered studies, we failed to find evidence for the hypothesis that scarcity polarizes preferences for familiarity. three studies tested this experimentally, using diverse stimuli and manipulations. a fourth tested it using an individual differences approach. although perhaps surprising given prior research (zhu & ratner, 2015), these null results help identify a potential boundary condition of when scarcity polarizes preferences. in particular, scarcity may yield this effect only at the idiographic level. when people experience scarcity, versus abundance, they may exhibit stronger preferences for things they themselves already like, and not for things that are generally liked across people. beyond the possibility that idiographic preferences are key to the predicted effect of scarcity increasing the familiarity bias, why else might we have observed these null results? we do not suspect it is an issue of study design. poor stimulus selection cannot explain these failures, as we consistently found evidence for a normative 20 antonoplis & chen preference for familiarity (seven of nine dependent variables showed the effect). in addition, we do not expect that we poorly operationalized scarcity. for one, our operationalizations map onto the definition of scarcity pretty well (e.g., not having enough time or money or options to satisfy all of one’s desires). second, several of our operationalizations of scarcity were used in prior research that found effects of scarcity on psychological outcomes and used manipulation checks to assess the key assumption that they induced an experience or perception of scarcity (with all p’s < .001; litt et al., 2011; roux et al., 2015; zhu & ratner, 2015). of course, we cannot fully rule out the possibility that our manipulations may have failed to induce a psychological experience of scarcity, due to our not including manipulation checks. third, we used both experimental and correlational designs, suggesting the null result is not a feature of study type. one possible explanation is low power. studies 3 and 4, which had the largest sample sizes of all of our studies, suggested that the key effect was quite small, to the extent that it existed at all. in particular, study 3’s results suggested that the key effect was a pseudo-r2 of .001, 95% ci [1.62x10-6,.007], and study 4’s results suggested that the key effect was an r from -.06–.13 (r2 from .004–.017). none of our studies was designed to detect an effect that small. to the extent that methodological issues do not explain the null results, a theoretical explanation is possible, too. in particular, it may be the case that scarcity does not have “secondary” effects in the sense that it does not impact thoughts, feelings, or behaviors that are not relevant to the immediate context in which scarcity was experienced. some recent work on scarcity has begun to suggest that scarcity may not have such “secondary” effects. for instance, camerer et al. (2018) failed to replicate the finding that a brief experience of scarcity reduced cognitive control on a subsequent, unrelated task (i.e., ego depletion; originally reported as study 1 in shah et al., 2012). in response, shah et al. (2018) replicated every study from their own 2012 paper and found that none of scarcity’s secondary effects replicated. these included the aforementioned depletion effect, as well as neglect of future demands and neglect of details helpful for future tasks, but not the immediate task. shah et al. (2018) did, however, replicate all of the “primary” effects of scarcity (i.e., greater present focus, more overborrowing). these failures to replicate suggest that scarcity’s effects may be limited to the immediate situation at hand (e.g., spending more time focusing a shot when one has limited shots) and cease when the situation changes (e.g., considering strategies for future rounds, an unrelated cognitive control task). given that we studied a secondary effect (the primary effects being typical mediators like stress, arousal, etc.), our hypothesis may have been doomed from the start. still, at least one version of the hypothesis has received empirical support (i.e., the idiographic approach; zhu & ratner, 2015), so future research should determine the robustness of these original results. conclusion we failed to find support for the hypothesis that scarcity magnifies the preference for familiarity. these results may help place a boundary on prior work showing similar results (zhu & ratner, 2015). at the very least, they identify for other researchers a hypothesis that is unlikely to be generative or, alternatively, demonstrate several sub-optimal tests of the hypothesis, which future researchers can know to avoid. author contact stephen antonoplis, antonoplis@berkeley.edu, department of psychology, university of california, berkeley, ca 94720, u.s.a. acknowledgements the authors thank members of the self, identity, and relationships lab for their feedback on this project. conflict of interest and funding the authors hold no known conflicts of interest relevant to this work. s. antonoplis was funded by a nsf graduate research fellowship, grant number dge 1752814. author contributions s.a. and s.c. conceived the project and designed the studies jointly. s.a. conducted data analyses and drafted the manuscript, with critical feedback from s.c. open science practices 21 four failures to demonstrate that scarcity magnifies preference for familiarity this article earned the preregistration+, open data and the open materials badge for preregistering the hypothesis and analysis before data collection, and for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references bargh, j. a. (1992). does subliminality matter to social psychology? awareness of the stimulus versus awareness of its influence. in r.f. bornstein & t. s. pittman (eds.), perception without awareness: cognitive, clinical, and social perspectives (pp. 236–255). guilford press. bornstein, robert f. (1989). exposure and affect: overview and meta-analysis ofresearch, 19681987. psychological bulletin, 106(2), 265–289. buhrmester, m., kwang, t., & gosling, s. d. (2011). amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data? perspectives on psychological science, 6(1), 3– 5. https://doi.org/10.1177/1745691610393980 camerer, c. f., dreber, a., holzmeister, f., ho, t.h., huber, j., johannesson, m., kirchler, m., nave, g., nosek, b. a., pfeiffer, t., altmejd, a., buttrick, n., chan, t., chen, y., forsell, e., gampa, a., heikensten, e., hummer, l., imai, t., … wu, h. (2018). evaluating the replicability of social science experiments in nature and science between 2010 and 2015. nature human behaviour, 2(9), 637–644. https://doi.org/10.1038/s41562-018-0399-z galinsky, a. d., gruenfeld, d. h., & magee, j. c. (2003). from power to action. journal of personality and social psychology, 85(3), 453– 466. https://doi.org/10.1037/00223514.85.3.453 gino, f., & pierce, l. (2009). the abundance effect: unethical behavior in the presence of wealth. organizational behavior and human decision processes, 109(2), 142–155. https://doi.org/10.1016/j.obhdp.2009.03.003 glaze, j. a. (1928). the association value of nonsense syllables. the pedagogical seminary and journal of genetic psychology, 35(2), 255–269. https://doi.org/10.1080/08856559.1928.10532 156 gorn, g., pham, m. t., & sin, l. y. (2001). when arousal influences ad evaluation and valence does not (and vice versa). journal of consumer psychology, 11(1), 43–55. griskevicius, v., ackerman, j. m., cantu, s. m., delton, a. w., robertson, t. e., simpson, j. a., thompson, m. e., & tybur, j. m. (2013). when the economy falters, do people spend or save? responses to resource scarcity depend on childhood environments. psychological science, 24(2), 197–205. https://doi.org/10.1177/0956797612451471 johnson, r. c., thomson, c. w., & frincke, g. (1960). word values, word frequency, and visual duration thresholds. psychological review, 67(5), 332. kraus, m. w., chen, s., & keltner, d. (2011). the power to be me: power elevates self-concept consistency and authenticity. journal of experimental social psychology, 47(5), 974– 980. https://doi.org/10.1016/j.jesp.2011.03.017 lebel, e. p., vanpaemel, w., cheung, i., & campbell, l. (2019). a brief guide to evaluate replications. meta-psychology, 3, 9. litt, a., reich, t., maymin, s., & shiv, b. (2011). pressure and perverse flights to familiarity. psychological science, 22(4), 523–531. https://doi.org/10.1177/0956797611400095 mani, a., mullainathan, s., shafir, e., & zhao, j. (2013). poverty impedes cognitive function. science, 341(6149), 976–980. mano, h. (1992). judgments under distress: assessing the role of unpleasantness and arousal in judgment formation. organizational behavior and human decision processes, 52(2), 216–245. mano, h. (1994). risk-taking, framing effects, and affect. organizational behavior and human decision processes, 57, 38–58. montoya, r. m., horton, r. s., vevea, j. l., citkowicz, m., & lauber, e. a. (2017). a reexamination of the mere exposure effect: the influence of repeated exposure on recognition, familiarity, and liking. psychological bulletin, 143(5), 459–498. https://doi.org/10.1037/bul0000085 mullainathan, s., & shafir, e. (2013). scarcity: why having too little means so much. times books/henry holt and co. 22 antonoplis & chen muthukrishnan, a. v., wathieu, l., & xu, a. j. (2009). ambiguity aversion and the preference for established brands. management science, 55(12), 1933–1941. https://doi.org/10.1287/mnsc.1090.1087 oppenheimer, d. m. (2004). spontaneous discounting of availability in frequency judgment tasks. psychological science, 15(2), 100–105. piketty, t. (2014). capital in the twenty-first century. harvard university press. r core team. (2019). r: a language and environment for statistical computing. r foundation for statistical computing. https://www.r-project.org/ ranker.com. (2018). the most delicious fruits. https://www.ranker.com/list/mostdelicious-fruits/analise.dubner roux, c., goldsmith, k., & bonezzi, a. (2015). on the psychology of scarcity: when reminders of resource scarcity promote selfish (and generous) behavior. journal of consumer research, ucv048. https://doi.org/10.1093/jcr/ucv048 shah, a. k., mullainathan, s., & shafir, e. (2012). some consequences of having too little. science, 338(6107), 682–685. https://doi.org/10.1126/science.1222426 shah, anuj k, mullainathan, s., & shafir, e. (2018). an exercise in self-replication: replicating shah, mullainathan, and shafir (2012). 15. shah, anuj k., shafir, e., & mullainathan, s. (2015). scarcity frames value. psychological science, 26(4), 402–412. simonsohn, u., nelson, l. d., & simmons, j. p. (2014). p-curve: a key to the file-drawer. journal of experimental psychology: general, 143(2), 534–547. thetoptens.com. (2018a). top ten favorite fruits. https://www.thetoptens.com/favorite-fruits/ thetoptens.com. (2018b). top ten most delicious fruits. https://www.thetoptens.com/mostdelicious-fruits/ usda. (2016). fruit and tree nut yearbook tables. https://www.ers.usda.gov/dataproducts/fruit-and-tree-nut-data/fruit-andtree-nut-yearbooktables/#supply%20and%20utilization zajonc, r. b. (1968). attitudinal effects of mere exposure. journal of personality and social psychology, 9(2p2), 1. zajonc, r. b. (2001). mere exposure: a gateway to the subliminal. current directions in psychological science, 10(6), 224–228. zhu, m., & ratner, r. k. (2015). scarcity polarizes preferences: the impact on choice among multiple items in a product class. journal of marketing research, 52(1), 13–26. https://doi.org/10.1509/jmr.13.0451 microsoft word mp.2018.884.niemeyer.final.docx meta-psychology, 2020, vol 4, mp.2018.884, https://doi.org/10.15626/mp.2018.884 article type: original article published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: daniël lakens reviewed by: k. coburn, f. d. schönbrodt analysis reproduced by: erin m. buchanan all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/my8s6 publication bias in meta-analyses of posttraumatic stress disorder interventions helen niemeyer* freie universität berlin, department of clinical psychological intervention, germany sebastian schmid department of psychiatry and psychotherapy, charité university medicine berlin, campus mitte, germany christine knaevelsrud freie universität berlin, department of clinical psychological intervention, germany robbie c.m. van aert* tilburg university, department of methodology and statistics, the netherlands dominik uelsmann department of psychiatry and psychotherapy, charité university medicine berlin, campus mitte, germany olaf schulte-herbrueggen department of psychiatry and psychotherapy, charité university medicine berlin, campus mitte, germany * hn and rva have equally contributed and split first authorship. meta-analyses are susceptible to publication bias, the selective publication of studies with statistically significant results. if publication bias is present in psychotherapy research, the efficacy of interventions will likely be overestimated. this study has two aims: (1) investigate whether the application of publication bias methods is warranted in psychotherapy research on posttraumatic stress disorder (ptsd) and (2) investigate the degree and impact of publication bias in meta-analyses of the efficacy of psychotherapeutic treatment for ptsd. a comprehensive literature search was conducted and 26 metaanalyses were eligible for bias assessment. a monte-carlo simulation study closely resembling characteristics of the included meta-analyses revealed that statistical power of publication bias tests was generally low. our results showed that publication bias tests had low statistical power and yielded imprecise estimates corrected for publication bias due to characteristics of the data. we recommend to assess publication bias using multiple publication bias methods, but only include methods that show acceptable performance in a method performance check that researchers first have to conduct themselves. keywords: publication bias, meta-meta-analysis, meta-analysis, posttraumatic stress disorder, psychotherapy meta-psychology, 2020, vol 4, mp.2018.884, https://doi.org/10.15626/mp.2018.884 article type: original article published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: daniël lakens reviewed by: k. coburn, f. d. schönbrodt analysis reproduced by: erin m. buchanan all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/my8s6 posttraumatic stress disorder (ptsd) following potentially traumatic events is a highly distressing and common condition, with lifetime prevalence rates in the adult population of 11.7% for women and 4% for men in the united states of america (kessler, petukhova, sampson, zaslavsky, & wittchen, 2012). ptsd is characterized by the re-experiencing of a traumatic event, avoidance of stimuli that could trigger traumatic memories, negative cognitions and mood, and alterations in arousal and reactivity (american psychiatric association, 2013). the dsm criteria have been updated recently, but most research is still based on the previous versions dsmiv-tr (american psychiatric association, 2000), dsm-iv (american psychiatric association, 1994) or dsm-iii-r (american psychiatric association, 1987). various forms of psychological interventions for treating ptsd have been investigated in a large number of studies. cognitive behavioral therapies (cbt) and eye movement desensitization and reprocessing (emdr) are the most frequently studied approaches (e.g., bisson, roberts, andrew, cooper, & lewis, 2013). trauma-focused cognitive behavioral therapies (tf-cbt) use exposure to trauma memory or reminders and the identification and modification of maladaptive cognitive distortions related to the trauma in their treatment protocols (e.g., ehlers, clark, hackmann, mcmanus, & fennell, 2005; foa & rothbaum, 1998; resick & schnicke, 1993). non trauma-focused cognitive behavioral therapies (non tf-cbt) do not focus on trauma memory or meaning, but for example on stress management (veronen & kilpatrick, 1983). emdr includes an imaginal confrontation of traumatic images, the use of eye movements and some core elements of tf-cbt (see forbes et al., 2010). although a range of other psychological treatments exists (e.g., psychodynamic therapies or hypnotherapy), fewer empirical studies of these approaches have been conducted (bisson et al., 2013). meta-analysis methods are used to quantitatively synthesize the results of different studies on the same research question. meta-analysis has become more popular according to the gradual increase of published papers that apply meta-analysis methods especially since the beginning of the 21st century (aguinis, dalton, bosco, pierce, & dalton, 2010), and it has been called the "gold standard" for synthesizing individual study results (aguinis, gottfredson, & wright, 2011; head, holman, lanfear, kahn, & jennions, 2015). results of meta-analyses are often used for deciding which treatment should be applied in clinical practice, and international evidence-based guidelines recommend tf-cbt and emdr for the treatment of ptsd (acpmh; forbes et al., 2007; nice; national collaborating centre for mental health, 2005). publication bias in psychotherapy research the validity of meta-analyses is highly dependent on the quality of the included data from primary studies (valentine, 2009). one of the most severe threats to the validity of a meta-analysis is publication bias, which is the selective reporting of statistically significant results (rothstein, sutton, & borenstein, 2005). approximately 90% of the main hypotheses of published studies within psychology are statistically significant (fanelli, 2012; sterling, rosenbaum, & weinkam, 1995) and this is not in line with the on average low statistical power of studies (bakker, van dijk, & wicherts, 2012; ellis, 2010). if only published studies are included in a meta-analysis, the efficacy of interventions may be overestimated (hopewell, clarke, & mallett, 2005; ioannidis, 2008; lipsey & wilson, 2001; rothstein et al., 2005). about one out of four funded studies examining the efficacy of a psychological treatment for depression did not result in a publication, and adding the results of the retrieved unpublished studies lowered the mean effect estimate by 25% from a medium to a small effect size (driessen, hollon, bockting, & cuijpers, 2017). the treatments in evidence-based psychotherapy are mainly selected based on published research (gilbody & song, 2000). the scientist-practitioner model (shapiro & forrest, 2001) calls for clinical psychologists to let empirical results guide their work, aiming to move away from opinionand experiencedriven therapeutic decision making toward the use of research results in clinical practice. if publication bias is present, guidelines may offer recommendations seemingly based on apparent empirical evidence that are only erroneously supported by the results of meta-analyses (berlin & ghersi, 2005). publication bias in meta-analyses of posttraumatic stress disorder interventions 3 consequently, psychotherapists who follow the scientist-practitioner model would be prompted to apply interventions in routine care that may be less efficacious than assumed and may even have detrimental effects for patients. a re-analysis of meta-analyses in psychotherapy research for schizophrenia and depression revealed that evidence for publication bias was found in about 15% of these meta-analyses (niemeyer, musch, & pietrowsky, 2012, 2013). however, until now no further comprehensive assessment of publication bias in meta-analyses of the efficacy of psychotherapeutic treatments for other clinical disorders has been conducted. hence, the presence and impact of publication bias in psychotherapy research also for ptsd remains largely unknown. although trauma-focused interventions are claimed to be efficacious, their efficacy may be overestimated and might be lower if publication bias was taken into account. this in turn would result in suboptimal recommendations in the treatment guidelines and consequently also in unnecessarily high costs for the health care system (jaycox & foa, 1999; maljanen et al., 2016; margraf, 2009). due to publication bias being widespread and its detrimental impact on the results of meta-analyses (dickersin, 2005; fanelli, 2012; rothstein & hopewell, 2009), a statistical assessment of publication bias should be conducted in every meta-analysis investigating psychotherapeutic treatments. this is in line with recommendations in the meta-analysis reporting standards (mars; american psychological association, 2010) and the preferred reporting items for systematic reviews and meta-analyses (prisma; moher, liberati, tetzlaff, & altman, 2009). a considerable number of statistical methods to investigate the presence and impact of publication bias have been developed in recent years. these methods should also be applied to already published meta-analyses in order to examine whether publication bias distorts the results (banks, kepes, & banks, 2012; van assen, van aert, & wicherts, 2015). the development of publication bias methods and recommendations to apply these methods will likely yield a more routinely assessment of publication bias in meta-analyses. however, research has shown that publication bias tests generally suffer from low statistical power and especially if there are only a small number of studies included in a metaanalysis and publication bias is not extreme (begg & mazumdar, 1994; egger, smith, schneider, & minder, 1997; renkewitz & keiner, 2019; sterne, gavaghan, & egger, 2000; van assen, van aert, & wicherts, 2015). this raises the question whether routinely applying publication bias tests without taking into account characteristics of the meta-analysis, such as the number of included studies, is a good practice. objectives the first goal of this paper is to study whether applying publication bias tests is warranted under conditions that are representative for published meta-analyses on ptsd treatments. applying publication bias tests may not always be appropriate if, for example, statistical power of these tests is low caused by a small number of studies included in the meta-analysis. hence, we study the statistical properties of publication bias tests by conducting a monte-carlo simulation study that closely resembles the meta-analyses on ptsd treatments. the second goal of our study is to assess the severity of publication bias in the meta-analyses published on ptsd treatments. we will not interpret the results of the publication bias tests if it turns out that these tests have low statistical power. regardless of these results, we will apply multiple methods to correct effect size for publication bias to the meta-analyses on ptsd treatments. effect size estimates of these methods become less precise (wider confidence intervals), but they still provide relevant insights into whether the effect size estimate becomes closer to zero if publication bias is taken into account. method data sources we conducted a literature search following the search strategies recommended by lipsey and wilson (2001) to identify all meta-analyses published on ptsd treatments. we screened the databases psycinfo, psyndex, pubmed, and the cochrane database of systematic reviews for all published and unpublished meta-analyses in english or german up to 5th september 2015. the search combined terms indicative of meta-analyses or reviews and terms indicative of ptsd. the exact search terms were [(“metaana*” or “meta-ana*” or “review” or niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 4 “übersichtsarbeit”) and (“stress disorders, post traumatic” (mesh) or “post-trauma*” or “posttrauma*” or “posttraumatic stress disorder” or “trauma*” or “ptsd” or “ptbs”)]. in addition, a snowball search system was used for the identification of further potentially relevant studies by screening the reference lists of included articles and of conference programs from the field of ptsd and trauma as well as psychotherapy research (see https://osf.io/9b4df/ for more information). experts in the field were contacted, but no additional meta-analyses were obtained. meta-analyses were retrieved for further assessment if the title or abstract suggested that these dealt with a meta-analysis of psychotherapy for ptsd. if an abstract provided insufficient information, the respective article was examined in order not to miss a relevant meta-analysis. study selection and data extraction meta-analyses were required to meet the following inclusion criteria: 1) a psychotherapeutic intervention was evaluated. psychotherapy was defined as “the informed and intentional application of clinical methods and interpersonal stances derived from established psychological principles for the purpose of assisting people to modify their behaviors, cognitions, emotions, and/or other personal characteristics in directions that the participants deem desirable" (norcross, 1990, p. 219). 2) the intervention aimed at reducing subclinical or clinical ptsd, according to diagnostic criteria for ptsd (e.g., using one of the versions of the dsm) or according to ptsd symptomatology as measured by a validated self-report or clinician measure in an adult population (i.e., aged 18 years and older). and 3) a summary effect size was provided. both uncontrolled designs investigating changes in one group (within-subjects design) and multiple group comparisons (between-subjects design) were suitable for inclusion. exclusion criteria were: 1) pooling of studies with various disorders, so that samples composed of other disorders along with ptsd were included in a meta-analysis and the effect sizes were combined to an overall effect estimate not restricted to the treatment of ptsd; and 2) the metaanalysis examined the efficacy of pharmacological treatment. three independent raters (du, hn, ssch) decided on the inclusion or exclusion of each metaanalysis upon preliminary reading of the abstract and discussed in the case of dissent.1 we included a meta-analysis if it did not explicitly target children and adolescents, but minor hints for the inclusion of such studies were present. however, this was only suitable if it concerned individual studies in a metaanalysis, and if we found such hints only when thoroughly checking the list of references. for conciseness, we use the term meta-analysis to refer to the article that was published and use the term data set for the effect sizes included in a metaanalysis. a meta-analysis can comprise more than one data set if, for instance, treatment efficacy was investigated for different outcomes, such as ptsd symptoms and depressive symptoms, or when the efficacy of two treatments (e.g., tf-cbt and emdr) was investigated separately in the same meta-analysis. the term primary study is used to refer to the original study that was included in the meta-analysis. when a meta-analysis consists of multiple data sets, we included all data sets for which primary studies' effect sizes and a measure of their precision were provided or could be computed. we tried to extract effect sizes and their precision of the primary studies from the meta-analysis. if the required data were not reported, we contacted the corresponding authors and re-analyzed the primary studies in order to obtain the data. data were extracted independently by one author (ssch), cross-checked by a second reviewer (hn), and in case of deviations during the statistical calculations checked by two researchers (rva, hn). all data sets for which the data were available and we could reproduce the average effect size reported in the meta-analysis ourselves were included. an absolute difference in average effect size larger than 0.1 was set as criterion for reproducibility. we labeled a data set as not reproducible if we could not reproduce the results based on the available data and description of the analyses after contacting the authors of a meta-analysis. moreover, there were no restrictions with respect to the dependent variable. that is, all primary and secondary outcomes of the meta-analyses were suitable for inclusion. primary outcomes in meta-analyses on ptsd are usually ptsd symptom score or clinical status, whereas secondary outcomes often vary (e.g. anxiety, depression, dropout, or other; see also bisson, roberts, andrew, cooper, & lewis, 2013). publication bias in meta-analyses of posttraumatic stress disorder interventions 5 the objectives of our paper were to study whether applying publication bias tests is warranted in meta-analyses on the efficacy of psychotherapeutic treatment for ptsd and to assess the severity of publication bias in these meta-analyses. the majority of statistical methods to detect the presence of publication bias does not perform well if the true effect sizes are heterogeneous (e.g., stanley & doucouliagos, 2014; van aert et al., 2016; van assen et al., 2015), some are even recommended not to be used in this situation (ioannidis, 2005). hence, it was necessary to only include data sets where the proportion of variance that is caused by heterogeneity in true effect size as quantified by the i2-statistic was smaller than 50%. we excluded all data sets of a meta-analysis that included less than six studies, because publication bias tests suffer from low statistical power in case of a small number of studies in a meta-analysis and if severe publication bias is absent (begg & mazumdar, 1994; sterne et al., 2000). others recommend a minimum of 10 studies (sterne et al., 2011), but we adopted a less strict criterion for two reasons. first, we want to study whether applying publication bias tests is warranted for conditions that are representative for published meta-analyses. meta-analyses often contain less than 10 studies. for example, the median number of studies in meta-analyses published in the cochrane database of systematic reviews is 3 (rhodes, turner, & higgins, 2015; turner, jackson, wei, thompson, & higgins, 2015). also the number of studies in meta-analyses for psychotherapy research is usually small. meta-analyses on the efficacy of psychotherapy for schizophrenia (niemeyer, musch, & pietrowsky, 2012) as well as depression (niemeyer, musch, & pietrowsky, 2013) also applied a minimum of 6 studies as lower limit for the application of publication bias tests. second, more recently developed methods to correct effect size for publication bias can be used to estimate the effect size even if the number of studies in a meta-analysis is small. for example, a method that was developed for combining an original study and replication has shown that two studies can already be sufficient for accurately evaluating effect size (van aert & van assen, 2018). however, a consequence of applying publication bias methods to meta-analyses based on a small number of studies is that effect size estimates become less precise and corresponding confidence intervals wider (stanley et al., 2017; van assen et al., 2015). statistical methods publication bias test. we assessed for the following publication bias tests whether it was warranted to apply these methods to the data sets in ptsd psychotherapy research: egger’s regression test (egger et al., 1997), rank-correlation test (begg & mazumdar, 1994), test of excess significance (ioannidis & trikalinos, 2007b), and p-uniform’s publication bias test (van assen et al., 2015). these methods were included, because these are commonly applied in meta-analyses (egger’s regression test and rankcorrelation test) or outperformed existing methods in some situations (tes and p-uniform’s publication bias test; renkewitz & keiner, 2019). it is important to note that egger’s regression test and the rankcorrelation test were developed to test for smallstudy effects. small-study effects refer to the tendency of smaller studies to go along with larger effect sizes. one of the causes of small-study effects is publication bias, but another cause is, for instance, heterogeneity in true effect size (see egger et al., 1997, for a list of causes of small-study effects). the tes was also not specifically developed to test for publication bias, but examines whether the observed and expected number of statistically significant effect sizes in a meta-analysis are in line with each other (see https://osf.io/b9t7v/ for an elaborate overview of existing publication bias tests). in order to investigate whether the application of the publication bias tests to the included data sets was warranted, we conducted a monte-carlo simulation study to examine the statistical power of the publication bias tests for the data sets. data were generated in a way to stay as close as possible to the characteristics of the data sets. that is, the same number of effect sizes as in the data set as well as the same effect size measure were used for generating the data. the data were simulated under the fixed-effect (a.k.a. equal-effects) model, so effect sizes for each data set were sampled from a normal distribution with mean � and variance equal to the “observed” squared standard errors. statistically significant effect sizes based on a one-tailed test with �=.025 (to reflect common practice of testing a two-tailed hypothesis and only reporting results in the predicted direction) were always “published” niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 6 and included in a simulated meta-analysis. publication bias restricted the “publication” of statistically nonsignificant effect sizes in a way that these effect sizes had a probability of 1-pub to be included in a simulated meta-analysis. effect sizes were simulated till the included number of simulated effect sizes equaled the number of effect sizes in a data set. we examined the type-i error rate and statistical power of egger’s regression test, rank-correlation test, tes, and p-uniform’s publication bias test for each simulated meta-analysis using �=.05. twotailed hypothesis tests were conducted for egger’s regression test and the rank-correlation test. onetailed hypothesis tests were used for tes and p-uniform’s publication bias test, because only evidence in one direction for these methods is indicative of publication bias. for each simulated meta-analysis, we recorded the proportion of data sets for which the statistical power of a publication bias test was larger than 0.8. meta-analyses were simulated 10,000 times for all included data sets. true effect size � was fixed to zero for generating data, because this enabled simulating data using the same effect size measure as in the data sets. selected values for publication bias (pub) were 0, 0.25, 0.5, 0.75, 0.85, 0.95, and 1 where pub equal to 0 indicates no publication bias and 1 extreme publication bias. this monte-carlo simulation study was programmed in r 3.5.3 (r core team, 2019) and the packages “metafor” (viechtbauer, 2010), “puniform” (van aert, 2019), and “parallel” (r core team, 2019) were used. r code for this monte-carlo simulation study is available at https://osf.io/pg7sj. estimating effect size corrected for publication bias. five different methods were included to estimate the effect size: traditional meta-analysis, trim and fill, pet-peese, p-uniform, and the selection model approach proposed by vevea and hedges (1995). traditional meta-analysis was included, because it is the analysis that is conducted in every meta-analysis. either a fixed-effect (fe) or randomeffects (re) model was selected depending on the statistical model used in the meta-analysis. these publication bias methods were selected, because they were either often applied in meta-analyses (trim and fill) or outperformed other methods (petpeese, p-uniform, and the selection model approach; mcshane et al., 2016; stanley & doucouliagos, 2014; van assen et al., 2015). p-curve (simonsohn et al., 2014) was not included in the present study because the methodology underlying p-curve is the same as p-uniform, and p-uniform has the advantage that it can also test for publication bias and estimate a 95% confidence interval (ci; see https://osf.io/b9t7v/ for an elaborate overview of existing methods to correct effect size for publication bias). average effect size estimates of traditional metaanalysis, trim and fill, pet-peese, p-uniform, and the selection model approach were computed and transformed to a common effect size measure (i.e., cohen’s d) before interpreting them. data sets that used log relative risks as effect size measure were conducted based on log odds ratios and these average effect size estimates were transformed to cohen’s d values. if there was not enough information to transform hedges’ g to cohen’s d, hedges’ g was used in the analyses. effect sizes were computed using the formulas described in borenstein (2009). we assessed the severity of publication bias by computing difference scores in effect size estimates between traditional meta-analysis and each publication bias method (i.e., trim and fill, pet-peese, puniform, and the selection model approach). that is, we subtracted the effect size estimate of traditional meta-analysis from the method’s effect size estimate. a difference score of zero reflects that the estimates of traditional meta-analysis and the publication bias method were the same, whereas a positive or negative difference score indicates that the estimates were different. subsequently, the mean and standard deviation (sd) of these difference scores were computed for the three methods. all analyses were conducted using r version 3.5.3 (r core team, 2019). the “metafor” package (viechtbauer, 2010) was used for conducting fixed-effect or random-effects meta-analysis, trim and fill, rankcorrelation test, and egger’s regression test. the “puniform” package (van aert, 2019) was used for applying the p-uniform method using the default estimator based on the irwin-hall distribution. in line with the recommendation by stanley (2017), �=0.1 was used for the right-tailed test whether the intercept of a pet analysis was different from zero, and therefore whether the results of pet or peese had to be interpreted. the selection model approach as proposed by (vevea & hedges, 1995) and implemented in the “weightr” package (coburn & vevea, 2019) was applied to all data sets. data and r code of publication bias in meta-analyses of posttraumatic stress disorder interventions 7 the analyses are available at https://osf.io/afnvr/ and https://osf.io/taq5f/?. results description of meta-analyses investigated a flowchart illustrating the procedure of selecting meta-analyses and data sets is presented in figure 1. the literature search resulted in 7,647 hits including duplicates, the screening process reduced this number to 502 meta-analyses, of which 89 dealt with the efficacy of psychotherapeutic interventions for ptsd and were included (see appendix a and https://osf.io/pkzx8/). of these 89 meta-analyses, four could not be located as they were unpublished dissertations and the authors did not reply to our requests.2 one meta-analysis was excluded because it used a network meta-analysis approach (gerger et al., 2014) and the included publication bias methods cannot be applied to this type of data. a multi-site study (morrissey et al., 2015) was excluded, because meta-analysis methods were used to combine the results from the different sites. of the remaining 83 meta-analyses, we contacted 36 authors (43.4%) because the effect size data was not fully reported in their paper and obtained data from six authors (16.7%). *438 primary studies id en ti fic at io n sc re en in g el ig ib ili ty o bt ai ni ng d at a an aly ze d da ta 7,647 hits including duplicates (screened up to 5th september 2015) • psycinfo/psyndex (2,980) • pubmed (4,412) • cochrane (131) • references (123) • conference programs, communication with experts (0) 7,145 records identified as duplicates, no meta-analyses (e.g. primary studies, book chapters, editorials, comments or corrections to other publications) 502 meta-analyses 419 meta-analyses excluded • no psychotherapeutic intervention (245) • no ptsd diagnosis (112) • studies with mixed disorders including ptsd but without ptsd subgroup (10) • targeting children and adolescents (22) • targeting acute stress disorder (17) • older version of a meta-analysis (5) • fmri as outcome (2) • network meta-analysis approach (1) • multisite study (1) • data not available (4) 83 meta-analyses (2,110 data sets)* 2,017 data sets excluded • less than six studies (1,510) • heterogeneity (309) • results not replicable (141) • duplicates (6) • random assignment of signs† (5) • studies for acute stress disorder, psychotropic drugs or children included (16) • no effect sizes (e.g. rather dropout in percentage) (25) • 36 authors contacted for requesting data (28 no data received) 98 data sets (26 meta-analyses) figure 1. flow chart: identification and selection of meta-analyses and data sets. note. † positive and negative signs were randomly assigned to each effect in the meta-analysis meta-psychology, 2020, vol 4, mp.2018.884, https://doi.org/10.15626/mp.2018.884 article type: original article published under the cc-by4.0 license open data: yes open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: daniël lakens reviewed by: k. coburn, f. d. schönbrodt analysis reproduced by: erin m. buchanan all supplementary files can be accessed at the osf project page: https://doi.org/10.17605/osf.io/my8s6 our analysis of the 83 meta-analyses first examined whether they discussed the problem of publication bias. fifty-eight meta-analyses (69.9%) mentioned publication bias, whereas 25 (30.1%) did not mention it at all. in 35 meta-analyses (42.2%), it was specified that the search strategies included unpublished studies, and 20 (24.1%) indeed found and included unpublished studies. however, in 46 metaanalyses (55.4%) unpublished studies were explicitly regarded as unsuitable for inclusion, and two metaanalyses (2.4%) did not specify their search and inclusion criteria with respect to unpublished studies. forty-seven meta-analyses (56.6%) statistically assessed publication bias, whereas 36 (43.4%) did not. five meta-analyses (6.0%) included the rankcorrelation test, six (7.2%) egger's regression test, and nine (10.8%) the trim and fill procedure. tes, pet-peese and p-uniform were not applied in any of the meta-analyses. a funnel plot (light & pillemer, 1984) was presented in 26 meta-analyses (31.3%) and failsafe n (rosenthal, 1979) was computed in 26 meta-analyses (31.3%). these results indicate that a large number of meta-analyses did not assess publication bias or only applied a selection of publication bias methods. pet-peese and p-uniform have been developed more recently and therefore we did not expect them to be regularly applied. the 83 meta-analyses included a total number of 2,110 data sets, of which 98 (4.6%) data sets from 26 meta-analyses fulfilled all inclusion criteria and were eligible for publication bias assessment (see flowchart in figure 1). figure 2 is a histogram of the number of effect sizes per data sets before data sets were excluded due to less than six studies and heterogeneous true effect size. the results show that the majority of data sets contained less than six effect sizes, and that only a small number of data sets included more than 15 effect sizes. many data sets were excluded because there were less than six studies (1,510 data sets), and due to heterogeneity in true effect size (309 data sets). all meta-analyses of which data sets were included in our study are marked with an asterisk in the list of references. figure 2. histogram of the number of primary studies’ effect sizes included in data sets. the vertical red dashed line denotes the cut-off that was used for assessing publication bias in a meta-analysis. characteristics of included data sets thirty-nine (39.8%) data sets reported hedges’ g as effect size measure, 29 (29.6%) cohen’s d, 3 (3.1%) a standardized mean difference, 7 (7.1%) a raw mean difference, 16 (16.3%) risk ratio, 2 (2.0%) log odds ratio, and 2 (2.0%) data sets glass’ delta. the median number of effect sizes in a data set was 7 (first quartile 7, third quartile 10). since publication bias tests have low statistical power if the number of effect sizes is small in a meta-analysis (begg & mazumdar, 1994; sterne et al., 2000; van assen et al., 2015), the characteristics of many of the data sets are not well-suited for methods to detect publication bias. additionally, p-uniform cannot be applied if there are no statistically significant effect sizes in a meta-analysis, because a requirement is that at least one study in a meta-analysis is statistically significant. the median number of statistically significant effect sizes in the data sets was 3 (34.3%; first quartile 1 (13%), third quartile 6 (80.4%)), and 77 data sets (78.6%) included at least one significant effect size (see appendix a, which also reports the number of studies included in each data set). publication bias in meta-analyses of posttraumatic stress disorder interventions 9 consequently, conditions were also not well-suited for p-uniform in particular, since this method uses only the statistically significant effect sizes. the median i2-statistic was 0% (first quartile 0%, third quartile 28.7%). publication bias test before applying the publication bias tests to the data sets, we conducted a monte-carlo simulation study to examine whether statistical power of the tests is large enough (> 0.8) to warrant applying these tests. type-i error rate and statistical power of the rank-correlation test (open circles), egger’s test (triangles), tes (diamonds), and p-uniform’s publication bias test (solid circles) as a function of publication bias (pub) are shown in figure 3. the results in the figure were obtained by averaging over the 98 data sets and the 10,000 replications in the monte-carlo simulation study. type-i error rate of all publication bias tests was smaller than �=.05 implying that the tests were conservative. these results indicate that statistical power of all methods was not above 0.5 for pub < 0.95. statistical power of only the tes was larger than 0.8 in case of extreme publication bias (pub = 1). figure 3. type-i error rate and statistical power obtained with the monte-carlo simulation study of the rank-correlation test (open circles), egger’s test (triangles), test of excess significance (tes; diamonds), and p-uniform’s publication bias test (solid circles) we also studied in the simulations whether for each data set the statistical power of a publication bias test was larger than 0.8. this enabled us to select the data sets where publication bias tests would be reasonable powered to detect publication bias if it was present. statistical power of none of the methods was larger than 0.8 for any data set if pub < 0.95 (results are available at https://osf.io/6bnc5/ for the rank-correlation test, https://osf.io/ufdps/ for egger’s test, https://osf.io/5yehp/ for the tes, and https://osf.io/feux3/ for p-uniform). it is highly unlikely that publication bias is this extreme in the included data sets, because many data sets contained statistically nonsignificant effect sizes (median percentage of nonsignificant effect sizes in a data set 65.7%). the publication bias tests would be most likely severely underpowered when applied to the published meta-analyses on ptsd, and it follows from these results that the tests should not be applied here. therefore, we only report the results of applying the publication bias tests to the data sets as supplement in the online repository (https://osf.io/49cke/) for completeness. effect size corrected for publication bias the data set with id 77 (from the meta-analysis by kehle-forbes et al., 2013) was excluded for estimating effect sizes corrected for publication bias because not enough information was available to transform the log relative risks to cohen’s d. hedges’ g effect sizes could not be transformed into cohen’s d for 12 data sets and hedges’ g was used instead (see appendix a). descriptive results of the effect size estimates of traditional meta-analysis, trim and fill, pet-peese, p-uniform, and the selection model approach are presented in table 1. p-uniform could only be applied to data sets with at least one statistically significant result (77 data sets), and the selection model approach did not converge for two data sets. results showed that especially estimates of pet-peese were closer to zero than traditional meta-analysis and that the standard deviation of the estimates of pet-peese and p-uniform was larger than traditional meta-analysis, trim and fill, and the selection model approach. see appendix a for the results of the effect size estimates corrected for publication bias per data set. niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 10 table 1. descriptive results of data sets analyzed with metaanalysis (fixed-effect or random-effects model depending on the model that was used in the original meta-analysis), trim and fill, pet-peese, p-uniform, and the selection model approach. mean, me-dian [min.; max.], (sd) of estimates table format 0.603, 0.532 [0.015;1.85], (0.447) trim and fill 0.574, 0.467 [-0.047;1.789], (0.411) petpeese 0.219, 0.203 [-1.656;3.075], (0.696) p-uniform 0.556, 0.693 [-6.681;2.158], (1.385) sel. model 0.603, 0.536 [-0.061;1.828], (0.439) note. min. is the minimum value, max. is the maximum value, and sd is the standard deviation. the mean of the difference in effect size estimate between pet-peese and the meta-analytic estimate was -0.101 (sd = 0.872). however, the median of the difference in effect size estimate was close to zero (mdn = -0.002), suggesting that the estimates of pet-peese and traditional meta-analysis were close. the mean of the difference between the estimates of trim and fill and traditional meta-analysis (-0.009, mdn = 0, sd = 0.104) and the selection model approach and traditional meta-analysis was negligible (0.026, mdn = 0.026, sd = 0.145). analyses for data sets including significant effect sizes. p-uniform was applied to a subset of 77 data sets (see appendix a), because this method requires that at least one study is statistically significant. the mean of the difference in effect size estimate of p-uniform and traditional meta-analysis was -0.174 (mdn = 0.04, sd = 1.273). the large standard deviation is caused by situations in which an extreme effect size was estimated because a primary study’s effect size was only marginally significant (i.e., p-value just below .05). in order to counteract these extreme effect size estimates, we set p-uniform’s effect size estimate to zero when the average of the statistically significant p-values was larger than half the �-level.3 this is in line with the recommendation by van aert et al. (2016). setting this effect size to zero resulted in a mean of the difference in effect size estimate between p-uniform and traditional meta-analysis of -0.019 (mdn = 0.04, sd = 0.364). the change in difference in effect size estimate was caused by setting the effect size estimates of p-uniform in seven data sets to zero, in which puniform originally substantially corrected for publication bias. the mean of the difference scores between pet-peese and traditional meta-analysis when computed based on this subset of 77 data sets was -0.129 (mdn = -0.011, sd = 0.968), for trim and fill the mean of the difference scores was -0.014 (mdn = 0, sd = 0.105), and for the selection model approach the mean of the difference scores was 0.028 (mdn = 0.024, sd = 0.155). explaining estimates of p-uniform, the selection model approach, and pet-peese. we illustrate deficiencies of p-uniform, the selection model approach, and pet-peese by discussing the results of two exemplary data sets. estimates of p-uniform can be imprecise (i.e., with a wide ci) if they are based on a small number of effect sizes in combination with p-values of these effect sizes close to the �level. in 29 out of 77 data sets p-uniform’s estimate was based on at most three studies. for instance, the estimated average log relative risk of random-effect meta-analysis of the data set from bisson et al. (2013, id=20) was -0.177, 95% ci [-0.499, 0.145] and p-uniform’s estimate was based on a single study and equaled -0.504, 95% ci [-3.809, 8.174]. the effect size estimate of p-uniform, as for any other method, is more precise the larger the number of effect sizes in a data set or the larger the primary study’s sample sizes. the selection model approach also suffers from a small number of statistically significant effect sizes. the computed weights for the intervals of the method’s selection model are imprecisely estimated if only a small number of effect sizes are within an interval. in an extreme situation where no effect sizes are observed in an interval of the selection model, the implementation of the selection model approach by vevea and hedges (1995) in the r package “weightr” assigns a weight of 0.01 to this interval. bias in effect size estimation increases the more this weight deviates from its true value. pet-peese also did not result in reasonable effect size estimates in each of the data sets, and publication bias in meta-analyses of posttraumatic stress disorder interventions 11 especially not if the standard errors of the primary studies were highly similar (i.e., were based on similar sample sizes). figure 4 shows the funnel plot based on the data set from bisson et al. (2007) comparing tf-cbt versus wait list and active controls (id=14; left panel) with the filled circles being the 15 observed effect sizes. the studies’ standard errors diverged from each other, which makes it possible to fit a regression line through the observed effect sizes in the data set (dashed black line). pet-peese’s effect size estimate was -0.027 (95% ci [-0.663, 0.609]) denoted by the asterisk in figure 4), which was closer to zero than traditional meta-analysis (0.260, 95% ci [-0.057, 0.578]) but had a wider ci. the data set from diehle et al. (2014) comparing two different treatments of tf-cbt (id=44) is presented in the right panel of figure 4. pet-peese was hindered by the highly similar studies’ standard error, which ranged from 0.227 to 0.478. hence, the effect size estimate of pet-peese (0.44, 95% ci [-1.079, 1.958]) was unrealistically larger than the estimate traditional meta-analysis (-0.153, 95% ci [-0.084, 0.39]), and its ci was wider. figure 4. funnel plots of the data sets from bisson et al. (2013) (id=14; left panel) and diehle et al. (2014) (id=44; right panel). filled circles are the observed effect sizes in a meta-analysis, the dashed black line is the fitted regression line through the observed effect sizes, the asterisks indicate the estimate of pet-peese. discussion publication bias is widespread in the psychology research literature (bakker et al., 2012; fanelli, 2012; sterling et al., 1995) resulting in overestimated effect sizes in primary studies and meta-analyses (kraemer, gardner, brooks, & yesavage, 1998; lane & dunlap, 1978). guidelines such as the mars (american psychological association, 2010) and prisma (moher, liberati, tetzlaff, & altman, 2009) recommend to routinely correct for publication bias in any meta-analysis. others recommend to re-analyze published meta-analyses to study the extent of publication bias in whole fields of research (ioannidis, 2009; ioannidis & trikalinos, 2007a; van assen et al., 2015) by using multiple publication bias methods (coburn & vevea, 2015; kepes et al., 2012). however, the question is whether routinely assessing publication bias is indeed a good recommendation, because researchers may end up in applying publication bias methods in situations where these do not have appropriate statistical properties, potentially leading to drawing faulty conclusions. we tried to answer niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 12 this question by re-analyzing a large number of meta-analyses published on the efficacy of psychotherapeutic treatment for ptsd. we re-analyzed 98 data sets from 26 meta-analyses studying a wide variety of psychotherapeutic treatments for ptsd. we had to exclude a large portion of data sets (95.4%) mainly due to heterogeneity in true effect size and data sets containing less than six primary studies. these exclusion criteria were necessary, because publication bias methods do not perform well in case of heterogeneity in true effect size (ioannidis, 2005) and a small number of primary studies yields low power of publication bias methods and imprecise effect size estimation (sterne et al., 2000). the included data sets were characterized by including a small number of primary studies (median 7 studies) resulting in challenging conditions for any publication bias method. before applying publication bias tests, we studied whether these tests would have sufficient statistical power (>0.8). we conducted a monte-carlo simulation study in which data were generated in a way to stay as close as possible to the included data sets. the statistical power of the publication bias tests was only larger than 0.8 in case of extreme publication bias (i.e., nonsignificant effect sizes having a probability of 0.05 or smaller to be included in a meta-analysis). hence, we concluded that it was not warranted to apply publication bias tests. of note is that the median percentage of nonsignificant effect sizes in a data set was 65.7% suggesting that extreme publication bias was absent. publication bias methods that correct the effect size for bias are also affected by a small number of primary studies, because the effect size estimates become then imprecise (i.e., a wide ci). however, comparing estimates of these methods with those of traditional meta-analysis that does not correct for publication bias still provides insights about the severity of publication bias. this analysis revealed no evidence for severe overestimation caused by publication bias as the corrected estimates were close to those of traditional meta-analysis. our results imply that following up on the guidelines to assess publication bias in any meta-analysis is far from straightforward in practice. many data sets in our study where too heterogeneous for publication bias analyses. moreover, even after the exclusion of data sets with less than six studies, statistical power of publication bias tests for each data set was low if extreme publication bias was absent and cis of methods that provided estimates corrected for publication bias were wide. these results even call for revising the recommendation by sterne et al. (2011) to apply publication bias tests only to meta-analyses with more than 10 studies. our results are also corroborated by a recent study of renkewitz and keiner (2019) who concluded based on a simulation study that publication bias could only be reliably detected with at least 30 studies. however, a caveat here is that these recommendations heavily depend on the severity of publication bias that is assumed to be present in a metaanalysis. hence, most important is that researchers are aware of the fact that publication bias tests suffer from low statistical power and that a nonsignificant publication bias test does not imply that publication bias is absent. recommendations we consider it important to give practical advice to researchers. we recommend researchers to follow the mars guidelines, apply publication bias tests, and report effect size estimates corrected for publication bias. however, a well-informed choice has to be made to select the publication bias methods with the best statistical properties as no method outperforms all other methods in all conditions (carter et al., 2019; renkewitz & keiner, 2019). carter and colleagues (2019) conclude that it has not been investigated yet whether the application of publication bias methods is warranted in real data in psychology, and that this ultimately is an empirical question which should be the focus of future research. routinely applying publication bias methods without paying attention to their statistical properties for the characteristics of the respective metaanalysis cannot be recommended. hence, researchers need to consider the characteristics of the data sets and check the properties of publication bias methods for these data sets before actually applying these methods. such a “method performance check” has also been recommended by carter et al. (2019) for methods to correct effect size for publication bias and can be conducted by their meta explorer web application (http://www.shinyapps.org/apps/metaexplorer/) or simulation studies. a complicating factor, publication bias in meta-analyses of posttraumatic stress disorder interventions 13 however, is that a method performance check requires information about the true effect size, true heterogeneity in true effect size, and the extent of publication bias that is not available. hence, researchers are advised to use multiple levels for these parameters in a method performance check as a sensitivity analyses. as there is no single publication bias method that outperforms all other methods and selecting a method depends on unknown parameters, we recommend to apply multiple publication bias methods that show acceptable performance in a method performance check. a so-called triangulation (kepes et al., 2012; coburn & vevea, 2015) following a methods performance check, rather than applying only one publication bias method, will yield more insight into the presence and severity of publication bias, because each method uses its own approach to examine publication bias. researchers should refrain from testing for publication bias if a method performance check by means of a power analysis reveals that publication bias is unlikely to be detected in their meta-analysis. applying methods to correct effect size for publication bias is still useful in case of a small number of studies in a meta-analysis, because estimates corrected for publication bias can be compared to the uncorrected estimate to assess the severity of publication bias. we consider it important to emphasize that the reporting of publication bias methods should be independent of their results. the analysis procedure of the meta-analysis as well as the publication bias tests is preferred to be preregistered in a pre-analysis plan before the analyses are actually conducted. moreover, conflicting results of publication bias methods are an interesting and important finding on its own that has to be discussed in the paper. limitations heterogeneous data sets had to be excluded, because assessing publication bias with the included methods is only accurate when based on meta-analyses with no or small heterogeneity in true effect size (ioannidis & trikalinos, 2007a; terrin et al., 2003). for that reason, data sets were excluded from the analyses if the i2-statistic was larger than 50%, but the i2-statistic is generally imprecise and especially if the number of effect sizes in a meta-analysis is small (ioannidis, patsopoulos, & evangelou, 2007). this is also reflected in the wide confidence intervals around the i2-statistics of the included data sets in the analyses (see appendix a). moreover, there is also an effect of publication bias on the i2-statistic which has been shown to be large, complex and non-linear, such that publication bias can both dramatically decrease and increase the i2-statistic (augusteijn, van aert, & van assen, 2019). therefore, a consequence of using a selection criterion based on the i2-statistic in the current study is that this may have led to the inclusion of data sets with heterogeneity in true effect size, which may, in turn, also have biased the results of the publication bias methods because these methods do not perform well under substantial heterogeneity (ioannidis, 2005; terrin et al., 2003; van assen et al., 2015). data sets affected by publication bias may also have been excluded by limiting ourselves to homogeneous data sets. imagine a data set consisting of multiple statistically significant effect sizes because of publication bias and one nonsignificant effect size that is not influenced by publication bias. the inclusion of this nonsignificant effect size likely causes the i2-statistic to be larger than 50% while the true effect size in fact may be homogeneous. hence, publication bias may also have resulted in the exclusion of homogeneous data sets. another limitation is that questionable research practices, known as phacking (i.e., all behaviors researchers can use to obtain the desired results; simmons, nelson, & simonsohn, 2011), may have further biased the results of the publication bias methods as well as the traditional meta-analysis (van aert et al., 2016). of note is also that the data sets in the current investigation often contained multiple statistically nonsignificant effect sizes when an active treatment was compared to a passive or active control group, which is not expected in case of extreme publication niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 14 bias. especially comparisons between two active treatments resulted in very few significant differences in efficacy. these meta-analyses with nonsignificant comparative effects might also be affected by publication bias. for example, when a new treatment is found to be as efficacious as an established one, this might be newsworthy and have a larger chance to get published than a finding demonstrating the well-known superiority of the state-of-theart treatment. this implies that publication bias lead to the publication of statistically nonsignificant rather than significant effects. publication bias will not be detected by any of the methods in such a situation in this study. conclusion routinely assessing publication bias in any metaanalysis is recommended by guidelines such as mars and prisma. we have shown, however, that the characteristics of meta-analyses in research on ptsd treatments are generally unfavorable for publication bias methods. that is, heterogeneity and small numbers of studies in meta-analyses result in low statistical power and imprecise corrected estimates. of note is that interpreting results from small data sets cautiously accounts in general for metaanalyses. the characteristics of the meta-analyses in our study on ptsd treatments are deemed to be typical for psychotherapy research, and potentially for other areas of clinical psychology, as well. the development of new publication bias methods and the improvement of existing methods is necessary that allow the true effect size to be heterogeneous and perform well in case of a small number of effect sizes in a meta-analysis. promising developments are p-uniform being extended to enable accurate effect size estimation in the presence of heterogeneity in true effect size (van aert & van assen, 2020). other promising developments are bayesian methods to correct for publication bias (du, liu, & wang, 2017; guan & vandekerckhove, 2016) and the increased attention for selection model approaches (citkowicz & vevea, 2017; mcshane et al., 2016). we hope that our work creates awareness for the limitations of publication bias methods and recommend researchers to apply and report multiple publication bias methods that have shown good statistical properties for the meta-analysis under study. author contact helen niemeyer freie universität berlin department of clinical psychological intervention schwendenerstr. 27 14195 berlin germany phone 0049/3083854798 helen.niemeyer@fu-berlin.de conflict of interest and funding all authors declare no conflict of interest. the preparation of this article was supported by grant 406-13-050 from the netherlands organization for scientific research (rva). author contributions hn and rva designed the study. hn and ssch developed the search strategy and ssch coordinated the literature search. ssch, hn and du served as independent raters in the process of the selection of the meta-analyses. hn and ssch conducted the data collection, coding and data management. ssch contacted the authors of all primary studies for which data were missing. du developed the treatment coding scheme, and hn, du and ssch categorized the treatment and control conditions. rva performed all statistical analyses. hn and rva drafted the manuscript. ck and osh supervised the whole conduction of the meta-meta-analysis and revised the manuscript. open science practices this article earned the open data and the open materials badge for making the data and materials openly available. it has been verified that the analysis reproduced the results presented in the article. it should be noted that the coding of the literature has not been verified; only the final analysis. the publication bias in meta-analyses of posttraumatic stress disorder interventions 15 entire editorial process, including the open reviews, are published in the online supplement. acknowledgements the authors would like to thank helen-rose and sinclair cleveland, andrea ertle, manuel heinrich, marcel van assen, josephine wapsa and jelte wicherts who helped in proof reading the article. references aguinis, h., dalton, d. r., bosco, f. a., pierce, c. a., & dalton, c. m. (2010). meta-analytic choices and judgment calls: implications for theory building and testing, obtained effect sizes, and scholarly impact. journal of management, 37(1), 5-38. doi:10.1177/0149206310377113 aguinis, h., gottfredson, r. k., & wright, t. a. (2011). best‐practice recommendations for estimating interaction effects using meta‐analysis. journal of organizational behavior, 32(8), 1033-1043. doi:10.1002/job.719 american psychiatric association. (1987). diagnostic and statistical manual of mental disorders (3rd ed.). washington, dc: author. american psychiatric association. (1994). diagnostic and statistical manual of mental disorders (4th ed.). washington, dc: author. american psychiatric association. (2000). diagnostic and statistical manual of mental disorders (4th ed.). washington, dc: author. american psychiatric association. (2013). diagnostic and statistical manual of mental disorders (5th ed.). washington, dc: author. american psychological association. (2010). publication manual of the american psychological association. washington, dc: author. augusteijn, h. e. m., van aert, r. c. m., & van assen, m. a. l. m. (2019). the effect of publication bias on the q test and assessment of heterogeneity. psychological methods, 24(1), 116-134. doi:10.1037/met0000197 bakker, m., van dijk, a., & wicherts, j. m. (2012). the rules of the game called psychological science. perspectives on psychological science, 7(6), 543-554. doi:10.1177/1745691612459060 banks, g. c., kepes, s., & banks, k. p. (2012). publication bias: the antagonist of meta-analytic reviews and effective policymaking. educational evaluation and policy analysis, 34(3), 259-277. doi:10.3102/0162373712446144 banks, g. c., kepes, s., & mcdaniel, m. a. (2012). publication bias: a call for improved meta‐analytic practice in the organizational sciences. international journal of selection and assessment, 20(2), 182-197. doi:10.1111/j.14682389.2012.00591.x becker, b. j. (2005). failsafe n or file-drawer number. in h. r. rothstein, a. j. sutton, & m. borenstein (eds.), publication bias in meta-analysis: prevention, assessment and adjustments (pp. 111-125). chichester, england: wiley. begg, c. b., & mazumdar, m. (1994). operating characteristics of a rank correlation test for publication bias. biometrics, 50, 1088-1101. benish, s. g., imel, z. e., & wampold, b. e. (2008). the relative efficacy of bona fide psychotherapies for treating post-traumatic stress disorder: a meta-analysis of direct comparisons. clinical psychology review, 28(5), 746-758. doi:10.1016/j.cpr.2007.10.005 berlin, j. a., & ghersi, d. (2005). preventing publication bias: registries and prospective metaanalysis. in h. r. rothstein, a. j. sutton, & m. borenstein (eds.), publication bias in metaanalysis: prevention, assessment and adjustments (pp. 35-48). chichester, england: wiley. bisson, j. i., roberts, n. p., andrew, m., cooper, r., & lewis, c. (2013). psychological therapies for chronic post-traumatic stress disorder (ptsd) in adults. cochrane database of systematic reviews, 12. doi:10.1002/14651858.cd003388.pub4 borenstein, m. (2009). effect sizes for continuous data. in h. cooper, l. v. hedges, & j. c. valentine (eds.), the handbook of research synthesis and meta-analysis (pp. 221-236). new york, ny: russell sage foundation. borenstein, m., hedges, l. v., higgins, j. p. t., & rothstein, h. r. (2009). introduction to metaanalysis. chichester, england: wiley. bornstein, h. a. (2004). a meta-analysis of group treatments for post-traumatic stress disorder: how treatment modality affects symptoms. (64), proquest information & learning, us. retrieved from niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 16 http://search.ebscohost.com/login.aspx?direct=true&db=psyh&an=2004-99008373&site=ehost-live available from ebscohost psyh database. carter, e. c., schönbrodt, f. d., gervais, w. m., & hilgard, j. (2019). correcting for bias in psychology: a comparison of meta-analytic methods. advances in methods and practices in psychological science, 2, 1 24. retrieved from osf.io/preprints/psyarxiv/9h3nu chard, k. m. (1995). a meta-analysis of posttraumatic stress disorder treatment outcome studies of sexually victimized women. (55), proquest information & learning, us. retrieved from http://search.ebscohost.com/login.aspx?direct=true&db=psyh&an=1995-95007211&site=ehost-live available from ebscohost psyh database. citkowicz, m., & vevea, j. l. (2017). a parsimonious weight function for modeling publication bias. psychological methods, 22(1), 28-41. doi:doi:10.1037/met0000119 coburn, k. m., & vevea, j. l. (2015). publication bias as a function of study characteristics. psychological methods, 20(3), 310-330. doi:10.1037/met0000047 coburn, k. m., & vevea, j. l. (2019). weightr: estimating weight-function models for publication bias. r package version 2.0.1. doi: https://cran.r-project.org/package=weightr dickersin, k. (2005). publication bias: recognizing the problem, understanding its origins and scope, and preventing harm. in h. r. rothstein, a. j. sutton, & m. borenstein (eds.), publication bias in meta-analysis: prevention, assessment and adjustments (pp. 11-33). chichester, england: wiley. driessen, e., hollon, s. d., bockting, c. l. h., & cuijpers, p. (2017). does publication bias inflate the apparent efficacy of psychological treatment for major depressive disorder? a systematic review and meta-analysis of us national institutes of health-funded trials. plos one, 10(9), e0137864. doi:doi:10.1371/journal.pone.0137864 du, h., liu, f., & wang, l. (2017). a bayesian "fill-in" method for correcting for publication bias in meta-analysis. psychological methods, 22(4), 799-817. doi:10.1037/met0000164 duval, s., & tweedie, r. (2000a). a nonparametric “trim and fill” method of accounting for publication bias in meta-analysis. journal of the american statistical association, 95(449), 8998. duval, s., & tweedie, r. (2000b). trim and fill: a simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. biometrics, 56(2), 455-463. doi:10.1111/j.0006-341x.2000.00455.x egger, m., smith, g. d., schneider, m., & minder, c. (1997). bias in meta-analysis detected by a simple, graphical test. brithish medical journal, 315(7109), 629-634. doi:10.1136/bmj.315.7109.629 ehlers, a., bisson, j., clark, d. m., creamer, m., pilling, s., richards, d., . . . yule, w. (2010). do all psychological treatments really work the same in posttraumatic stress disorder? clinical psychology review, 30(2), 269-276. ehlers, a., clark, d. m., hackmann, a., mcmanus, f., & fennell, m. (2005). cognitive therapy for post-traumatic stress disorder: development and evaluation. behaviour research and therapy, 43(4), 413-431. ellis, p. d. (2010). the essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results. new york: cambridge university press. fanelli, d. (2012). negative results are disappearing from most disciplines and countries. scientometrics, 90(3), 891-904. doi:10.1007/s11192011-0494-7 ferguson, c. j., & brannick, m. t. (2012). publication bias in psychological science: prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. psychological methods, 17(1), 120-128. doi:10.1037/a0024445 field, a. p., & gillett, r. (2010). how to do a meta‐ analysis. british journal of mathematical and statistical psychology, 63(3), 665-694. doi:10.1348/000711010x502733 fleiss, j. l. (1971). measuring nominal scale aggreement among many raters. psychological bulletin, 76, 378-382. doi:10.1037/h0031619 foa, e. b., & rothbaum, b. o. (1998). treating the trauma of rape: cognitive-behavioral therapy for ptsd. new york, ny: guilford press. forbes, d., creamer, m., bisson, j. i., cohen, j. a., crow, b. e., foa, e. b., . . . ursano, r. j. (2010). a publication bias in meta-analyses of posttraumatic stress disorder interventions 17 guide to guidelines for the treatment of ptsd and related conditions. journal of traumatic stress, 23(5), 537-552. forbes, d., creamer, m., phelps, a., bryant, r., mcfarlane, a., devilly, g. j., . . . newton, s. (2007). australian guidelines for the treatment of adults with acute stress disorder and posttraumatic stress disorder. aust n z j psychiatry, 41(8), 637-648. doi:10.1080/00048670701449161 francis, g. (2013). replication, statistical consistency, and publication bias. journal of mathematical psychology, 57(5), 153-169. doi:10.1016/j.jmp.2013.02.003 gerger, h., munder, t., gemperli, a., nüesch, e., trelle, s., jüni, p., & barth, j. (2014). integrating fragmented evidence by network meta-analysis: relative effectiveness of psychological interventions for adults with post-traumatic stress disorder. psychol med, 44(15), 3151-3164. doi:10.1017/s0033291714000853 gilbody, s., & song, f. (2000). publication bias and the integrity of psychiatry research. psychological medicine, 30, 253-258. doi:10.1017/s0033291700001732 guan, m., & vandekerckhove, j. (2016). a bayesian approach to mitigation of publication bias. psychonomic bulletin & review, 23(1), 74-86. doi:doi:10.3758/s13423-015-0868-6 head, m. l., holman, l., lanfear, r., kahn, a. t., & jennions, m. d. (2015). the extent and consequences of p-hacking in science. plos biol, 13(3), e1002106. doi:10.1371/journal.pbio.1002106 hedges, l. v., & vevea, j. l. (2005). selection method approaches. in h. r. rothstein, a. j. sutton, & m. borenstein (eds.), publication bias in meta-analysis: prevention, assessment and adjustments (pp. 145-174). chichester, uk: wiley. higgins, j., & thompson, s. g. (2002). quantifying heterogeneity in a meta‐analysis. statistics in medicine, 21(11), 1539-1558. doi:10.1002/sim.1186 hinkle, d. e., wiersma, w., & jurs, s. g. (2003). applied statistics for the behavioral sciences. boston, mass.: houghton mifflin company. hopewell, s., clarke, m., & mallett, s. (2005). grey literature and systematic reviews. in h. r. rothstein, a. j. sutton, & m. borenstein (eds.), publication bias in meta-analysis: prevention, assessment and adjustments (pp. 49-72). chichester, england: wiley. ioannidis, j. p. (2005). differentiating biases from genuine heterogeneity: distinguishing artifactual from substantive effects. in h. r. rothstein, a. j. sutton, & m. borenstein (eds.), publication bias in meta-analysis: prevention, assessment and adjustments (pp. 287-302). sussex, england: wiley. ioannidis, j. p. (2008). why most discovered true associations are inflated. epidemiology, 19(5), 640-648. doi:10.1097/ede.0b013e31818131e7 ioannidis, j. p. (2009). integration of evidence from multiple meta-analyses: a primer on umbrella reviews, treatment networks and multiple treatments meta-analyses. canadian medical association journal, 181(8), 488-493. doi:10.1503/cmaj.081086 ioannidis, j. p., patsopoulos, n. a., & evangelou, e. (2007). uncertainty in heterogeneity estimates in meta-analyses. british medical journal, 335(7626), 914-916. doi:10.1136/bmj.39343.408449.80 ioannidis, j. p., & trikalinos, t. a. (2007a). the appropriateness of asymmetry tests for publication bias in meta-analyses: a large survey. canadian medical association journal, 176(8), 1091-1096. doi:10.1503/cmaj.060410 ioannidis, j. p., & trikalinos, t. a. (2007b). an exploratory test for an excess of significant findings. clinical trials, 4(3), 245-253. doi:10.1177/1740774507079441 iyengar, s., & greenhouse, j. b. (1988). selection models and the file drawer problem. statistical science, 3, 109-135. jaycox, l. h., & foa, e. b. (1999). cost-effectiveness issues in the treatment of post-traumatic stress disorder. in n. e. miller & m. k. m (eds.), cost-effectiveness of psychotherapy: a guide for practitioners, researchers, and policymakers. new york, ny: oxford university press. karen, r. m. (1990). shame and guilt as the treatment focus in post-traumatic stress disorder: a meta-analysis. (51), proquest information & learning, us. retrieved from http://search.ebscohost.com/login.aspx?direct=true&db=psyh&an=1991-51715001&site=ehost-live available from ebscohost psyh database. niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 18 kepes, s., banks, g. c., mcdaniel, m., & whetzel, d. l. (2012). publication bias in the organizational sciences. organizational research methods, 15(4), 624-662. kessler, r. c., petukhova, m., sampson, n. a., zaslavsky, a. m., & wittchen, h. u. (2012). twelve‐month and lifetime prevalence and lifetime morbid risk of anxiety and mood disorders in the united states. international journal of methods in psychiatric research, 21(3), 169184. kraemer, h. c., gardner, c., brooks, j., & yesavage, j. a. (1998). advantages of excluding underpowered studies in meta-analysis: inclusionist versus exclusionist viewpoints. psychological methods, 3(1), 23-31. doi:doi:10.1037/1082989x.3.1.23 lane, d. m., & dunlap, w. p. (1978). estimating effect size: bias resulting from the significance criterion in editorial decisions. british journal of mathematical and statistical psychology, 31(2), 107-112. light, r., & pillemer, d. (1984). summing up: the science of research reviewing. in. cambridge, ma: harvard university press. lipsey, m. w., & wilson, d. b. (2001). practical meta-analysis. thousand oaks, ca: sage publications. loevinger, j. (1948). the technique of homogeneous tests compared with some aspects of scale analysis and factor analysis. psychological bulletin, 45(6), 507-529. doi:10.1037/h0055827 macaskill, p., walter, s. d., & irwig, l. (2001). a comparison of methods to detect publication bias in meta‐analysis. statistics in medicine, 20(4), 641-654. maljanen, t., knekt, p., lindfors, o., virtala, e., tillman, p., härkänen, t., & helsinki psychotherapy study group. (2016). the cost-effectiveness of short-term and long-term psychotherapy in the treatment of depressive and anxiety disorders during a 5-year follow up. journal of affective disorders, 190, 254-263. doi:10.1016/j.jad.2015.09.065 margraf, j. (2009). kosten und nutzen der psychotherapie. heidelberg: springer medizin. mcshane, b. b., böckenholt, u., & hansen, k. t. (2016). adjusting for publication bias in metaanalysis: an evaluation of selection methods and some cautionary notes. perspectives on psychological science, 11(5), 730-749. doi:doi:10.1177/1745691616662243 moher, d., liberati, a., tetzlaff, j., & altman, d. g. (2009). preferred reporting items for systematic reviews and meta-analyses: the prisma statement. annals of internal medicine, 151(4), 264-269. mokken, r. j. (1971). a theory and procedure of scale analysis: with applications in political research (vol. 1): walter de gruyter. moreno, s. g., sutton, a. j., ades, a. e., stanley, t. d., abrams, k. r., peters, j. l., & cooper, n. j. (2009). assessment of regression-based methods to adjust for publication bias through a comprehensive simulation study. bmc medical research methodology, 9(2). doi:10.1186/14712288-9-2 national collaborating centre for mental health. (2005). post-traumatic stress disorder: the management of ptsd in adults and children in primary and secondary care. (nice clinical guidelines, no. 26). in. leicester, uk: gaskell. niemeyer, h., musch, j., & pietrowsky, r. (2012). publication bias in meta-analyses of the efficacy of psychotherapeutic interventions for schizophrenia. schizophrenia research, 138(2), 103-112. niemeyer, h., musch, j., & pietrowsky, r. (2013). publication bias in meta-analyses of the efficacy of psychotherapeutic interventions for depression. journal of consulting and clinical psychology, 81(1), 58-74. niemeyer, h., pieper, a., uelsmann, d., schulteherbrüggen, o., & knaevelsrud, c. (2017). evidence based psychotherapy for complex posttraumatic stress disorder (ptsd), and ptsd following complex traumatization: an overview. manuscript in preparation. norcross, j. c. (1990). an eclectic definition of psychotherapy. in j. k. zeig & w. m. munion (eds.), what is psychotherapy? contemporary perspectives (pp. 218-220). san francisco, ca: jossey-bass. orwin, r. g. (1983). a fail-safe n for effect size in meta-analysis. journal of educational statistics, 8, 157-159. doi:10.2307/1164923 peters, j. l., sutton, a. j., jones, d. r., abrams, k. r., & rushton, l. (2007). performance of the trim and fill method in the presence of publication bias and between‐study heterogeneity. publication bias in meta-analyses of posttraumatic stress disorder interventions 19 statistics in medicine, 26(25), 4544-4562. doi:10.1002/sim.2889 r core team. (2019). r: a language and environment for statistical computing. renkewitz, f., & keiner, m. (2019). how to detect publication bias in psychological research? a comparative evaluation of six statistical methods. zeitschrift für psychologie, 227(4), 261 279. doi:10.31234/osf.io/w94ep resick, p. a., & schnicke, m. k. (1993). cognitive processing therapy for rape victims: a treatment manual. newbury park, ca: sage. rhodes, k. m., turner, r. m., & higgins, j. p. (2015). predictive distributions were developed for the extent of heterogeneity in meta-analyses of continuous outcome data. journal of clinical epidemiology, 68(1), 52-60. rosenthal, r. (1979). the file drawer problem and tolerance for null results. psychological bulletin, 86(3), 638-641. doi:10.1037/00332909.86.3.638 rothstein, h. r., & bushman, b. j. (2012). publication bias in psychological science: comment on ferguson and brannick (2012). psychological methods, 17, 129-136. doi:10.1037/a0027128 rothstein, h. r., & hopewell, s. (2009). grey literature. in h. cooper, l. v. hedges, & j. c. valentine (eds.), the handbook of research synthesis and meta-analysis (2nd ed., pp. 103-125). new york: russell sage foundation. rothstein, h. r., sutton, a. j., & borenstein, m. (2005). publication bias in meta-analysis: prevention, assessment and adjustments. chichester, england: wiley shapiro, f., & forrest, m. s. (2001). eye movement desensitization and reprocessing: basic principles, protocols and procedures. new york, ny: guilford press. simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359-1366. doi:10.1177/0956797611417632 simonsohn, u., nelson, l. d., & simmons, j. p. (2014). p-curve and effect size correcting for publication bias using only significant results. perspectives on psychological science, 9(6), 666-681. doi:10.1177/1745691614553988 sloan, d. m., feinstein, b. a., gallagher, m. w., beck, j. g., & keane, t. m. (2013). efficacy of group treatment for posttraumatic stress disorder symptoms: a meta-analysis. psychological trauma: theory, research, practice, and policy, 5(2), 176-183. stanley, t. d., & doucouliagos, h. (2014). meta‐regression approximations to reduce publication selection bias. research synthesis methods, 5(1), 60-78. stanley, t. d., doucouliagos, h., & ioannidis, j. p. (2017). finding the power to reduce publication bias. statistics in medicine. doi:10.1002/sim.7228 sterling, t. d., rosenbaum, w. l., & weinkam, j. j. (1995). publication decisions revisited: the effect of the outcome of statistical tests on the decision to publish and vice versa. the american statistician, 49(1), 108-112. doi:10.2307/2684823 sterne, j. a. c., becker, b. j., & egger, m. (2005). the funnel plot. in h. r. rothstein, a. j. sutton, & m. borenstein (eds.), publication bias in metaanalysis: prevention, assessment and adjustments (pp. 111-125). chichester, england: wiley. sterne, j. a. c., gavaghan, d., & egger, m. (2000). publication and related bias in meta-analysis: power of statistical tests and prevalence in the literature. journal of clinical epidemiology, 53(11), 1119-1129. sterne, j. a. c., sutton, a. j., ioannidis, j. p., terrin, n., jones, d. r., lau, j., . . . schmid, c. h. (2011). recommendations for examining and interpreting funnel plot asymmetry in meta-analyses of randomised controlled trials. brithish medical journal, 343(7818), 1-8. doi:10.1136/bmj.d400210.3102/016237371244614 4 terrin, n., schmid, c. h., & lau, j. (2005). in an empirical evaluation of the funnel plot, researchers could not visually identify publication bias. journal of clinical epidemiology, 58(9), 894-901. doi:10.1016/j.jclinepi.2005.01.006 terrin, n., schmid, c. h., lau, j., & olkin, i. (2003). adjusting for publication bias in the presence of heterogeneity. statistics in medicine, 22(13), 2113-2126. doi:10.1002/sim.1461 turner, r. m., jackson, d., wei, y., thompson, s. g., & higgins, j. p. t. (2015). predictive distributions for between-study heterogeneity and niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 20 simple methods for their application in bayesian meta-analysis. statistics in medicine, 34(6), 984-998. valentine, j. c. (2009). judging the quality of primary research. in h. cooper, l. v. hedges, & j. c. valentine (eds.), the handbook of research synthesis and meta-analysis (vol. 2, pp. 129146). new york, ny: russel sage foundation. van aert, r. c. m. (2019). puniform: meta-analysis methods correcting for publication bias. r package version 0.1.1. doi:https://cran.r-project.org/package=puniform van aert, r. c. m., & van assen, m. a. l. m. (2020). correcting for publication bias in a meta-analysis with the p-uniform* method. doi:10.31222/osf.io/zqjr9 van aert, r. c. m., wicherts, j. m., & van assen, m. a. l. m. (2016). conducting meta-analyses based on p-values: reservations and recommendations for applying p-uniform and pcurve. perspectives on psychological science, 11, 713-729. doi:doi:10.1177/1745691616650874 van assen, m. a. l. m., van aert, r. c. m., & wicherts, j. m. (2015). meta-analysis using effect size distributions of only statistically significant studies. psychological methods, 20(3), 293-309. doi:10.1037/met0000025 veronen, l. j., & kilpatrick, s. (1983). stress management for rape victims. in d. meichenbaum & m. e. jaremko (eds.), stress reduction and prevention (pp. 341-374). london, england: plenum. vevea, j. l., & hedges, l. v. (1995). a general linear model for estimating effect size in the presence of publication bias. psychometrika, 60(3), 419-435. doi:10.1007/bf02294384 vevea, j. l., & woods, c. m. (2005). publication bias in research synthesis: sensitivity analysis using a priori weight functions. psychological methods, 10(4), 428-443. doi:10.1037/1082989x.10.4.428 viechtbauer, w. (2007). confidence intervals for the amount of heterogeneity in meta-analysis. statistics in medicine, 26(1), 37-52. viechtbauer, w. (2010). conducting meta-analysis in r with the metafor package. journal of statistical software, 36(3), 1-48. wampold, b. e., imel, z. e., laska, k. m., benish, s., miller, s. d., flűckiger, c., . . . budge, s. (2010). determining what works in the treatment of ptsd. clinical psychology review, 30(8), 923933. wilen, j. s. (2015). a systematic review and network meta-analysis of psychosocial interventions for adults who were sexually abused as children. (75), proquest information & learning, us. retrieved from http://search.ebscohost.com/login.aspx?direct=true&db=psyh&an=2015-99031072&site=ehost-live available from ebscohost psyh database. publication bias in meta-analyses of posttraumatic stress disorder interventions 21 footnotes 1 we also coded the treatment that was studied in the meta-analyses to compare the results of the publication bias methods between the treatments. additional information on the coding of the treatments can be found in an online repository (https://osf.io/gh729/) as well as the results split per treatment (https://osf.io/usm9f/). 2 the four unpublished dissertations were: bornstein, 2004; chard, 1995; karen, 1990; wilen, 2015. 3 we assumed that two-tailed hypothesis tests with � = .05 were used in the primary studies. hence, p-uniform’s effect size estimate was set equal to zero if the average of statistically significant p-value was smaller than �/4. niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 22 appendix a table a results of traditional meta-analysis, begg and mazumdar’s rank-correlation test, egger’s regression test, tes, p-uniform, trim and fill, pet-peese, and the selection model approach of vevea and hedges (1995) grouped by treatment category. data set no. (id) author intervention / dependent measure / time of measurement original effect size [and 95% ci] replicated effect size, i2 [and 95% ci] no. of studies† (no. of sign. studies) begg (τ) and egger (z) tes p-uniform [and 95% ci], pub. bias test (!"#) trim and fill [and 95% ci], no. of studies imputed (kimp) petpeese [and 95% ci] selection model [and 95% ci] 74 jonas et al. (2013) tf-cbt vs. wl / depressive symptoms (bdi, sensitivity analysis including high risk of bias studies) / post wmd = -8.03 [10.14, -5.93] (re) -8.02 [-10.07, 5.96], i2 = 0 [0, 42.5] 7 (5) τ = 0.24, z = 0.33 a = 0.01 -7.06 [-9.61, 1.33], $%# = 0.74 -8.02 [-10.07, 5.96], kimp = 0 -8.68 [-11.3, -6.06] -7.93 [10.32, 5.53] 69 jonas et al. (2013) tf-cbt vs. wl / ptsd symptoms (sensitivity analysis including high risk of bias studies) / post d = -1.13 [-1.33, 0.92] (re) -1.13 [-1.34, 0.92], i2 = 0 [0, 54.2] 9 (8) τ = 0.11, z = -0.27 a = 0.15 -1.11 [-1.36, -0.81], $%# = 0.19 -1.13 [-1.34, 0.92], kimp = 0 -1.09 [-1.51, -0.67] -1.11 [-1.35, -0.86] 71 jonas et al. (2013) tf-cbt vs. wl / ptsd symptoms (caps, sensitivity analysis including high risk of bias studies) / post wmd = -27.21 [32.29, -22.13] (re) -27.13 [-32.07, 22.2], i2 = 0 [0, 66.4] 6 (6) τ = 0.47, z = 0.93 a = 0.19 -26.3 [-31.14, 20.18], $%# = 0.33 -30.02 [-34.25, -25.8], kimp = 3 -32.55 [45.84, 19.26] -26.78 [34.2, 19.35] 64 hofmann & smits (2008) tf-cbt vs. active controls / ptsd symptoms / post g = 0.62 [0.28, 0.96] (re) 0.62 [0.28, 0.97], i2 = 48.1 [0, 92.5] 6 (3) τ = 0.2, z = 0.61 a = 0 0.75 [0.15, 1.45], $%# = -0.54 0.62 [0.28, 0.97], kimp = 0 -0.02 [3.13, 3.1] 0.63 [0.13, 1.13] 2 acpmh (2013) tf-cbt vs. wl & active controls / ptsd diagnosis (itt) / post mh-rr = 0.51 [0.44, 0.59] (fe) 0.52 [0.46, 0.6], i2 = 0 [0, 0] 10 (9) τ = -0.64*, z = -1.9 a = 4.87* 0.56 [0.45, 0.81], $%# = 0.63 0.55 [0.48, 0.62], kimp = 4 0.58 [0.48, 0.7] 0.66 [0.41, 1.08] 4 acpmh (2013) tf-cbt vs. wl & active controls / depressive symptoms (itt) / post d = -0.59 [-0.76, 0.41] (fe) -0.59 [-0.76, 0.41], i2 = 0 [0, 0] 11 (6) τ = -0.02, z = 0.51 a = 0.19 -0.59 [-0.87, -0.1], $%# = -0.02 -0.59 [-0.76, 0.41], kimp = 0 -0.64 [-1.12, -0.16] -0.49 [0.82, -0.16] 5 acpmh (2013) tf-cbt vs. wl & active controls / anxiety symptoms (itt) / post d = -0.64 [-0.88, 0.39] (fe) -0.64 [-0.89, 0.39], i2 = 0 [0, 0] 8 (4) τ = -0.43, z = -2.23* a = 0.17 -0.67 [-1.23, 0.12], $%# = -0.09 -0.47 [-0.69, 0.25], kimp = 3 0.84 [-0.26, 1.93] -0.54 [-1.2, 0.13] 6 acpmh (2013) tf-cbt vs. wl & active controls / attrition (itt) / post mh-rr = 1.48 [0.99, 2.21] (fe) 1.29 [0.84, 1.98], i2 = 0 [0, 0] 10 (1) τ = 0.02, z = 0.71 a = 0.42 00 [0, 0.8], $%# = 1.66* 1.29 [0.84, 1.98], kimp = 0 1 [0.34, 2.92] 1.25 [0.64, 2.42] 14 bisson et al. (2007) b tf-cbt vs. wl & active controls / withdrawal rate / post mh-rr = 1.42 [1.05, 1.94] (fe) 1.3 [0.94, 1.78], i2 = 0 [0, 0] 15 (0) τ = -0.24, z = 1.12 a = 0.93 no significant studies 1.16 [0.86, 1.57], kimp = 4 0.97 [0.52, 1.84] 1.38 [0.98, 1.94] publication bias in meta-analyses of posttraumatic stress disorder interventions 23 data set no. (id) author intervention / dependent measure / time of measurement original effect size [and 95% ci] replicated effect size, i2 [and 95% ci] no. of studies† (no. of sign. studies) begg (τ) and egger (z) tes p-uniform [and 95% ci], pub. bias test (!"#) trim and fill [and 95% ci], no. of studies imputed (kimp) petpeese [and 95% ci] selection model [and 95% ci] 17 bisson et al. (2013) tf-cbt vs. wl & active controls / leaving the study early / post mh-rr = 1.64 [1.30, 2.06] (fe) 1.45 [1.15, 1.84], i2 = 0 [0, 0] 29 (2) τ = -0.09, z = 1.24 a = 0.06 0.050 [0, 9.1], $%# = 1.07 1.32 [1.05, 1.65], kimp = 4 1.12 [0.68, 1.83] 1.92 [1.12, 3.29] 29 bisson et al. (2013) tf-cbt (individual & group) vs. wl & active controls / leaving the study early / post mh-rr = 1.21 [0.94, 1.55] (fe) 1.19 [0.93, 1.52], i2 = 0 [0, 0] 7 (0) τ = 0.24, z = 0.44 a = 0.54 no significant studies 1.19 [0.93, 1.52], kimp = 0 1.1 [0.76, 1.59] 1.18 [nan, nan] 97 bisson et al. (2013) tf-cbt vs. wl & active controls / anxiety / post smd = -0.81 [1.03, -0.59] (re) -0.8 [-1.02, 0.58], i2=43.3 [0, 71.3] 17 (10) τ = -0.06, z = -0.76 a = 0.04 -0.95 [-1.22, 0.61], $%# = -1.09 -0.8 [-1.02, 0.58], kimp = 0 -0.18 [-1.38, 1.02] -0.82 [-1.15, -0.49] 67 jonas et al. (2013) tf-cbt vs. wl & active controls / ptsd symptoms / post d = -1.27 [-1.54, 1.00] (re) -1.27 [-1.54, -1], i2 = 23.7 [0, 85.7] 7 (7) τ = -0.33, z = -1.84 a = 0.73 -1.29 [-1.6, -1.01], $%# = -0.39 -1.11 [-1.4, 0.82], kimp = 3 -0.53 [-1.51, 0.45] -1.2 [-1.49, -0.92] 68 jonas et al. (2013) tf-cbt vs. wl & active controls / ptsd symptoms (sensitivity analysis including high risk of bias studies) / post d = -1.19 [-1.38, 0.99] (re) -1.2 [-1.4, -1], i2 = 0 [0, 68.5] 11 (10) τ = -0.09, z = -1.09 a = 0.31 -1.2 [-1.46 ,-0.93], $%# = -0.04 -1.12 [-1.34, 0.89], kimp = 2 -1.01 [-1.45, -0.57] -1.16 [-1.39, -0.93] 70 jonas et al. (2013) tf-cbt vs. wl & active controls / ptsd symptoms (caps, sensitivity analysis including high risk of bias studies) / post wmd = -27.92 [32.87, -22.96] (re) -27.88 [-32.68, 23.07], i2 = 0 [0, 69.5] 7 (7) τ = 0.05, z = -0.13 a = 0.43 -27.6 [-32.55, 21.87], $%# = 0.11 -27.88 [-32.68, 23.07], kimp = 0 -26.1 [38.42, 13.78] no convergence 72 jonas et al. (2013) tf-cbt vs. wl & active controls / depressive symptoms (bdi) / post wmd = -8.21 [10.30, -6.12] (re) -8.21 [-10.25, 6.17], i2 = 0 [0, 29.7] 6 (6) τ = -0.2, z = -0.27 a = 1.36 -6.93 [-9.32, 2.52], $%# = 1.05 -8.21 [-10.25, 6.17], kimp = 0 -7.79 [11.21, -4.38] -7.3 [-9.26, -5.33] 73 jonas et al. (2013) tf-cbt vs. wl & active controls / depressive symptoms (bdi, sensitivity analysis including high risk of bias studies) / post wmd = -7.85 [9.80, -5.89] (re) -7.82 [-9.72, 5.92], i2 = 0 [0, 32.8] 9 (6) τ = 0.39, z = 0.69 a = 0 -6.93 [-9.32, 2.52], $%# = 0.72 -8.04 [-9.89, 6.18], kimp = 1 -8.97 [11.21, -6.73] -7.75 [-10, 5.51] 75 jonas et al. (2013) tf-cbt vs. wl & active controls (including pct) / depressive symptoms (bdi, sensitivity analysis including pct) / post wmd = -6.91 [8.86, -4.96] (re) -6.96 [-8.91, 5.01], i2 = 23.2 [0, 75.6] 7 (7) τ = -0.33, z = -2.08* a = 2.84 -6.02 [-8.63, 2.76], $%# = 0.38 -5.86 [-7.79, 3.93], kimp = 3 -1.97 [-7.02, 3.07] -5.2 [-9.42, -0.98] niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 24 data set no. (id) author intervention / dependent measure / time of measurement original effect size [and 95% ci] replicated effect size, i2 [and 95% ci] no. of studies† (no. of sign. studies) begg (τ) and egger (z) tes p-uniform [and 95% ci], pub. bias test (!"#) trim and fill [and 95% ci], no. of studies imputed (kimp) petpeese [and 95% ci] selection model [and 95% ci] 76 jonas et al. (2013) tf-cbt vs. wl & active controls (including pct) / depressive symptoms (bdi, sensitivity analysis including pct and high risk of bias studies) / post wmd = -6.29 [7.84, -4.75] (re) -6.38 [-7.99, 4.76], i2 = 6.4 [0, 68.7] 11 (7) τ = 0.31, z = -0.12 a = 0.45 -6.02 [-8.63, 2.76], $%# = 0.17 -6.38 [-7.99, 4.76], kimp = 0 -6.26 [9.01, -3.52] -5.92 [8.17, -3.67] 45 dimauro (2014) tf-cbt within group / ptsd symptoms / post d = 0.69 [0.35, 1.02] (re) 0.68 [0.27, 1.09], i2 = 0 [0, 95.2] 6 (1) τ = 0.6, z = 1.6 a = 0.26 -6.580 [-51.19, 5.44], $%# = 1.01 0.6 [0.21, 0.99], kimp = 2 -0.1 [-0.78, 0.58] 0.87 [nan, nan] 8 acpmh (2013) cbt combined (mostly tfcbt, individual & group) vs. active controls / ptsd symptoms (clinician rated) / 2-3 months follow-up d = -0.43 [-0.65, 0.20] (fe) -0.43 [-0.65, 0.2], i2 = 0 [0, 0] 7 (2) τ = -0.33, z = -1.58 a = 0 -0.34 [-1.3, 1.26], $%# = 0.19 -0.38 [-0.6, 0.17], kimp = 1 0.42 [-0.68, 1.53] -0.44 [0.61, -0.26] 9 acpmh (2013) cbt combined (mostly tfcbt) vs. active controls / depressive symptoms / post d = -0.68 [-0.92, 0.44] (fe) -0.68 [-0.92, 0.45], i2 = 0 [0, 0] 8 (4) τ = -0.29, z = -1.33 a = 0.01 -0.59 [-1.18, 1.14], $%# = 0.32 -0.56 [-0.77, 0.35], kimp = 3 -0.12 [-1.01, 0.77] -0.67 [-1.14, -0.19] 10 acpmh (2013) cbt combined (mostly tfcbt, individual & group) vs. active controls / attrition / post mh-rr = 1.36 [0.86, 2.15] (fe) 1.27 [0.79, 2.05], i2 = 0 [0, 0] 10 (0) τ = 0.38, z = 1.15 a = 0.51 no significant studies 1.18 [0.74, 1.87], kimp = 2 0.76 [0.34, 1.7] 1.25 [nan, nan] 1 acpmh (2013) cbt combined (mostly tfcbt) vs. wl & active controls / ptsd symptoms (self-rated) / post d = -1.14 [-1.32, 0.95] (fe) -1.14 [-1.32, 0.95], i2 = 0 [0, 0] 11 (10) τ = -0.24, z = -1.11 a = 0.47 -1.12 [-1.37, -0.9], $%# = 0.14 -1.14 [-1.32, 0.95], kimp = 0 -1.03 [1.39, -0.68] -1.11 [-1.31, -0.9] 3 acpmh (2013) cbt combined (mostly tfcbt) vs. wl & active controls / ptsd symptoms (self-rated, itt) / post d = -1.06 [-1.30, 0.82] (fe) -1.08 [-1.32, 0.84], i2 = 0 [0, 0] 6 (5) τ = -0.07, z = -0.7 a = 0.15 -1.03 [-1.37, -0.71], $%# = 0.37 -1.08 [-1.32, 0.84], kimp = 0 -1.01 [-1.6, -0.43] -1.05 [-1.31, -0.8] 7 acpmh (2013) cbt combined (mostly tfcbt, individual & group) vs. wl & active controls / ptsd symptoms (self-rated, motor vehicle accident) / post d = -1.25 [-1.57, 0.94] (fe) -1.26 [-1.57, 0.94], i2 = 0 [0, 0] 6 (5) τ = 0.33, z = 0.6 a = 0.03 -1.22 [-1.59, 0.65], $%# = 0.18 -1.29 [-1.59, 0.98], kimp = 1 -1.4 [-2.13, -0.68] -1.24 [-1.5, -0.97] 30 casement & swanson (2012) cbt combined (mostly tfcbt) within group / nightmare frequency / post g = 0.82 [0.57, 1.07] (re) 0.82 [0.66, 0.98], i2 = 2.8 [0, 83.3] 7 (6) τ = -0.33, z = -0.13 a = 0 0.75 [0.45, 0.95], $%# = 0.77 0.86 [0.72, 1.01], kimp = 2 0.84 [0.46, 1.22] 0.82 [0.66, 0.99] publication bias in meta-analyses of posttraumatic stress disorder interventions 25 data set no. (id) author intervention / dependent measure / time of measurement original effect size [and 95% ci] replicated effect size, i2 [and 95% ci] no. of studies† (no. of sign. studies) begg (τ) and egger (z) tes p-uniform [and 95% ci], pub. bias test (!"#) trim and fill [and 95% ci], no. of studies imputed (kimp) petpeese [and 95% ci] selection model [and 95% ci] 31 casement & swanson (2012) cbt combined (mostly tfcbt) within group / ptsd symptoms / post g = 0.71 [0.46, 0.95] (re) 0.69 [0.5, 0.88], i2 = 28.4 [0, 91] 7 (5) τ = 0.24, z = 1.12 a = 0.05 0.77 [0.55, 1.05], $%# = -0.91 0.66 [0.46, 0.86], kimp = 1 0.56 [0.23, 0.88] 0.68 [0.49, 0.88] 48 dorrepaal et al. (2014) cbt combined (mostly tfcbt, individual & group) within group / ptsd symptoms (completer) / post d = 1.7 (fe) 1.68 [1.36, 2], i2 = 0 [0, 0] 8 (8) τ = 0.43, z = 2.76* a = 0.78 1.86 [1.34, 2.27], $%# = -0.74 1.43 [1.14, 1.71], kimp = 3 -0.55 [2.23, 1.14] 1.56 [1.24, 1.87] 49 dorrepaal et al. (2014) cbt combined (mostly tfcbt, individual & group) within group / ptsd symptoms (itt) / post d = 1.3 (fe) 1.29 [1.05, 1.52], i2 = 0 [0, 0] 8 (7) τ = 0.64*, z = 2.73* a = 0.98 1.45 [1.14, 1.75], $%# = -1.07 1.15 [0.94, 1.37], kimp = 2 -1.66 [-3.8, 0.49] 1.36 [1.06, 1.66] 52 dorrepaal et al. (2014) cbt combined (mostly tfcbt, individual & group) within group / ptsd symptoms (completer, complex ptsd) / post d = 1.6 (fe) 1.54 [1.18, 1.9], i2 = 0 [0, 0] 6 (6) τ = 0.47, z = 2.31* a = 0.82 1.64 [1.03, 2.18], $%# = -0.34 1.17 [0.86, 1.47], kimp = 3 -0.46 [2.71, 1.78] 1.33 [0.98, 1.68] 53 dorrepaal et al. (2014) cbt combined (mostly tfcbt, individual & group) within group / ptsd symptoms (itt, complex ptsd) / post d = 1.2 (fe) 1.17 [0.91, 1.44], i2 = 0 [0, 0] 6 (5) τ = 0.73, z = 1.99* a = 0.96 1.31 [0.97, 1.67], $%# = -0.8 1.08 [0.84, 1.33], kimp = 1 -1.51 [-5.15, 2.13] 1.26 [0.95, 1.57] 32 chen et al. (2014) emdr vs. active controls (including pe, sit/pe) / depressive symptoms / post g = -0.45 [-0.65, 0.25] (re) -0.45 [-0.65, 0.25], i2 = 0 [0, 63.3] 11 (3) τ = -0.2, z = -0.83 a = 0 -0.22□ [-1.25, 2.67], $%# = 0.39 -0.45□ [-0.65, 0.25], kimp = 0 -0.07□ [0.96, 0.82] -0.44 [0.71, -0.18] 34 chen et al. (2014) emdr vs. active controls (including ttp) / anxiety symptoms (equivalent group) / post g = -0.41 [-0.62, 0.21] (re) -0.41 [-0.62, 0.2], i2 = 0 [0, 74.5] 8 (4) τ = -0.21, z = 0 a = 2.11 0.340,□ [-0.39, 1.93], $%# = 2* -0.41□ [-0.62, 0.2], kimp = 0 -0.41□ [1.16, 0.34] -0.16 [nan, nan] 35 chen et al. (2014) emdr vs. active controls (including ttp, exp) / subjective distress (equivalent group) / post g = -0.57 [-0.81, 0.33] (re) -0.57 [-0.81, 0.33], i2 = 0 [0, 64] 8 (2) τ = -0.14, z = 0.53 a = 0.47 -0.43□ [-1.19, 26.29], $%# = 0.13 -0.57□ [-0.81, 0.33], kimp = 0 -0.68□ [1.3, -0.05] -0.73 [1.09, -0.36] 37 chen et al. (2014) emdr vs. active controls (including ttp, pe, cbt, sit/pe, exp, sm) / ptsd symptoms (equivalent group) / post g = -0.58 [-0.73, 0.42] (re) -0.57 [-0.73, 0.42], i2 = 3 [0, 63.3] 13 (5) τ = 0.1, z = 0.56 a = 0.64 -0.67□ [-0.94, 0.34], $%# = -0.76 -0.7□ [-0.87, 0.52], kimp = 4 -0.63□ [0.94, -0.32] -0.66 [0.87, -0.46] niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 26 data set no. (id) author intervention / dependent measure / time of measurement original effect size [and 95% ci] replicated effect size, i2 [and 95% ci] no. of studies† (no. of sign. studies) begg (τ) and egger (z) tes p-uniform [and 95% ci], pub. bias test (!"#) trim and fill [and 95% ci], no. of studies imputed (kimp) petpeese [and 95% ci] selection model [and 95% ci] 15 bisson et al. (2007) b emdr vs. wl & active controls / withdrawal rate / post mh-rr = 1.21 [0.66, 2.22] (re) 1.27 [0.69, 2.35], i2 = 0 [0, 55.1] 6 (0) τ = -0.07, z = -0.88 a = 0.31 no significant studies 1.66 [1, 2.77], kimp = 3 1.66 [0.8, 3.43] 1.25 [0.77, 2] 24 bisson et al. (2013) emdr vs. wl & active controls / leaving the study early / post mh-rr = 1.05 [0.62, 1.79] (fe) 1.04 [0.6, 1.8], i2 = 0 [0, 0] 7 (0) τ = 0.14, z = -0.77 a = 0.2 no significant studies 1.19 [0.71, 2], kimp = 2 1.64 [0.51, 5.25] 1 [0.68, 1.47] 25 bisson et al. (2013) emdr vs. wl & active controls / depressive symptoms / post d = -1.15 [-1.52, 0.78] (re) -1.15 [-1.52, 0.78], i2 = 38.3 [0, 88.2] 7 (5) τ = 0.2, z = 0.15 a = 0.5 -1.32 [-1.71, -0.91], $%# = -0.82 -1.15 [-1.52, 0.78], kimp = 0 -1.47 [-4.8, 1.86] -1.26 [-1.63, -0.88] 26 bisson et al. (2013) emdr vs. wl & active controls / anxiety symptoms / post d = -1.02 [-1.36, 0.69] (fe) -1.02 [-1.35, 0.69], i2 = 0 [0, 0] 6 (5) τ = 0.2, z = 0.94 a = 0.78 -0.19 [-1.14, 1.95], $%# = 1.6 -1.02 [-1.35, 0.69], kimp = 0 -1.49 [-2.71, -0.27] -0.62 [1.64, 0.39] 33 chen et al. (2014) emdr vs. wl & active controls / depressive symptoms (adults) / post g = -0.63 [-0.83, 0.44] (re) -0.64 [-0.83, 0.44], i2 = 37.1 [0, 69.4] 17 (6) τ = -0.09, z = 0.21 a = 1.51 -0.78□ [-1.27, 0.11], $%# = -0.45 -0.64□ [-0.83, 0.44], kimp = 0 -0.72□ [1.17, -0.26] -0.83 [-1.12, -0.54] 36 chen et al. (2014) emdr vs. wl & active controls (including cbt, exposure) / ptsd symptoms (<60 min/session) / post g = -0.50 [-0.74, 0.27] (re) -0.5 [-0.74, 0.26], i2 = 35.6 [0, 78.7] 10c (4) τ = -0.2, z = -0.18 a = 0.01 -0.57□ [-1.02, 0.64], $%# = -0.27 -0.5□ [-0.74, 0.26], kimp = 0 -0.5□ [-1.4, 0.4] -0.48 [0.84, -0.13] 38 chen et al. (2014) emdr vs. wl & active controls (including ttp, sit/pe, sm) / depressive symptoms (with manual) / post g = -0.55 [-0.74, 0.36] (re) -0.55 [-0.74, 0.36], i2 = 35.7 [0, 66.9] 18c (5) τ = -0.08, z = 0.53 a = 1.39 -0.62□ [-1.26, 0.27], $%# = -0.13 -0.6□ [-0.79, 0.41], kimp = 2 -0.7□ [-1.14, -0.26] -0.75 [1.05, -0.44] 39 chen et al. (2014) emdr vs. wl & active controls / depressive symptoms (<60 min/session) / post g = -0.30 [-0.55, 0.04] (re) -0.3 [-0.55, 0.04], i2 = 0 [0, 44.8] 6c (0) τ = 0.47, z = 0.74 a = 1.04 no significant studies -0.4□ [-0.62, 0.18], kimp = 3 -0.59□ [1.13, -0.04] -0.45 [0.63, -0.27] 40 chen et al. (2014) emdr vs. wl & active controls / anxiety symptoms (<60 min/session) / post g = -0.35 [-0.58, 0.13] (re) -0.35 [-0.57, 0.13], i2 = 0 [0, 88.4] 6c (3) τ = -0.07, z = -0.18 a = 2.34 0.50,□ [-0.45, 3.24], $%# = 1.72* -0.35□ [-0.57, 0.13], kimp = 0 -0.31□ [1.41, 0.79] -0.08 [0.63, 0.47] 61 gerger et al. (2014) b other therapies within group / ptsd symptoms (non-complex problems) / post g = -0.71 [-1.02, 0.40] (re) -0.71 [-1.02, -0.4], i2 = 42.5 [0, 88.4] 6 (4) τ = -0.6, z = -1.97* a = 0.08 -0.65□ [-1.31, 0.13], $%# = 0.08 -0.71□ [-1.02, 0.4], kimp = 0 0.35□ [-1.16, 1.87] -0.62 [1.03, -0.2] 81 peleikis & dahl (2005) combined therapies (mostly other therapies, group) vs. wl / trauma symptoms / post d = 0.44 [0.25, 0.64] (w=n) 0.43 [0.23, 0.63], i2 = 0 [0, 0] 8 (2) τ = 0.14, z = 1.54 a = 0.13 0.91 [-0.58, 1.57], $%# = -0.98 0.43 [0.23, 0.63], kimp = 0 -0.18 [-1.41, 1.05] 0.66 [0.08, 1.24] publication bias in meta-analyses of posttraumatic stress disorder interventions 27 data set no. (id) author intervention / dependent measure / time of measurement original effect size [and 95% ci] replicated effect size, i2 [and 95% ci] no. of studies† (no. of sign. studies) begg (τ) and egger (z) tes p-uniform [and 95% ci], pub. bias test (!"#) trim and fill [and 95% ci], no. of studies imputed (kimp) petpeese [and 95% ci] selection model [and 95% ci] 87 sloan et al. (2013) combined therapies (mostly cbt, group) vs. wl / ptsd symptoms / post g = 0.56 [0.31, 0.82] (re) 0.56 [0.32, 0.79], i2 = 0 [0, 71] 6 (3) τ = 0.6, z = 1.27 a = 0.02 0.21 [-1.27, 0.9], $%# = 0.93 0.5 [0.28, 0.72], kimp = 1 -0.95 [2.85, 0.95 0.53 [0.26, 0.79] 89 taylor & harvey (2009) combined therapies (mostly tf-cbt) vs. wl / mixed outcome measures (7-9 sessions) / post g = 0.89 [0.58, 1.21] (fe) 0.89 [0.58, 1.21], i2 = 0 [0, 0] 6 (4) τ = -0.2, z = -1.29 a = 0.04 0.88 [-0.28, 1.46], $%# = 0.03 0.89 [0.58, 1.21], kimp = 0 3.12 [-3.23, 9.48] 0.72 [-0.01, 1.45] 90 taylor & harvey (2009) combined therapies (mostly tf-cbt) vs. wl / mixed outcome measures (practitioner as therapist) / post g = 0.98 [0.70, 1.26] (fe) 0.99 [0.71, 1.26], i2 = 0 [0, 0] 7 (5) τ = 0.14, z = -0.34 a = 0.02 1.04 [-0.18, 1.43], $%# = -0.24 0.99 [0.71, 1.26], kimp = 0 1 [-0.11, 2.11] 0.86 [0.2, 1.51] 78 lambert & alhassoon (2014) b combined therapies (mostly tf-cbt, individual & group) vs. active controls (including sit) / depressive symptoms / post g = 0.63 [0.35, 0.92] (re) 0.69 [0.36, 1.03], i2 = 28.7 [0, 86.3] 9 (2) τ = 0.56*, z = 2.23* a = 0.46 1.77 [0.6, 2.6], $%# = -1.87 0.69 [0.36, 1.03], kimp = 0 -0.36 [1.42, 0.69] 1.19 [0.2, 2.18] 88 sloan et al. (2013) combined therapies (mostly cbt, group) vs. active controls / ptsd symptoms / post g = 0.09 [-0.03, 0.22] (re) 0.15 [0, 0.3], i2 = 39.4 [0, 94.6] 10 (2) τ = 0.47, z = 2.66* a = 1.18 0.38 [-3.25, 1.59], $%# = -0.3 0.1 [-0.08, 0.28], kimp = 2 -0.18 [0.47, 0.1] 0.06 [0.06, 0.18] 79 nenova et al. (2013) combined therapies (including sn, individual & group, combined delivery) vs. wl & active controls / ptsd symptoms (intrusion) / post delta = -0.09 [0.41, 0.26] (bayesian re) -0.1 [-0.32, 0.11], i2 = 0 [0, 0] 13 (0) τ = -0.18, z = 0.11 a = 0.56 no significant studies -0.1 [-0.32, 0.11], kimp = 0 -0.1 [-0.21, 0.01] -0.1 [-0.26, 0.07] 80 nenova et al. (2013) combined therapies (including sn, individual & group, combined delivery) vs. wl & active controls / ptsd symptoms (avoidance) / post delta = 0.00 [0.37, 0.31] (bayesian re) 0.04 [-0.13, 0.22], i2 = 0 [0, 0] 13 (0) τ = -0.13, z = -0.66 a = 0.41 no significant studies 0.06 [-0.12, 0.23], kimp = 6 0.06 [0.02, 0.1] no convergence 84 sherman (1998) combined therapies (individual & group) vs. wl & active controls (one study removed) / mixed outcomes / post g = 0.52 [0.37, 0.67]b 0.52 [0.38, 0.66], i2 = 0 [0, 0] 23 (4) τ = 0.37*, z = 1.54 a = 2.06 0.62 [-0.35, 1.2], $%# = -0.28 0.42 [0.29, 0.55], kimp = 6 -0.11 [-0.71, 0.5] 0.72 [0.1, 1.35] 85 sherman (1998) combined therapies (individual & group) vs. wl & active controls / mixed g = 0.64 [0.47, 0.81]b 0.64 [0.48, 0.81], i2 = 0 [0, 0] 19 (10) τ = 0.3, z = 1.74 a = 0.97 0.7 [0.25, 1.03], $%# = -0.28 0.59 [0.43, 0.75], kimp = 2 0.02 [-0.84, 0.89] 0.49 [0.22, 0.76] niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 28 data set no. (id) author intervention / dependent measure / time of measurement original effect size [and 95% ci] replicated effect size, i2 [and 95% ci] no. of studies† (no. of sign. studies) begg (τ) and egger (z) tes p-uniform [and 95% ci], pub. bias test (!"#) trim and fill [and 95% ci], no. of studies imputed (kimp) petpeese [and 95% ci] selection model [and 95% ci] outcome measures (one study removed) / follow-up 86 sloan et al. (2013) combined therapies (mostly cbt, group) vs. wl & active controls / ptsd symptoms / post g = 0.24 [0.09, 0.39] (re) 0.28 [0.13, 0.43], i2 = 48.8 [8, 87] 16 (5) τ = 0.65*, z = 4.33* a = 2.92 0.26 [-0.9, 0.89], $%# = -0.17 0.13 [-0.04, 0.3], kimp = 6 -0.25 [0.46, -0.04] 0.09 [-0.01, 0.19] 93 tol et al. (2011) combined therapies vs. wl & active controls / ptsd symptoms / post g = -0.38 [-0.55, 0.20] (re) -0.38 [-0.56, 0.2], i2 = 22.1 [0, 78.4] 9 (3) τ = -0.39, z = -1.77 a = 0.01 -0.56 [-0.9, 0.06], $%# = -0.99 -0.35 [-0.53, 0.17], kimp = 1 0.34 [-0.62, 1.29] -0.43 [0.75, -0.11] 46 dorrepaal et al. (2014) combined therapies (mostly cbt, individual & group) within group / ptsd symptoms (completer) / post d = 1.7 (fe) 1.65 [1.35, 1.95], i2 = 0 [0, 0] 9 (9) τ = 0.5, z = 2.78* a = 0.92 1.77 [1.3, 2.17], $%# = -0.54 1.34 [1.08, 1.6], kimp = 4 -0.6 [-2.14, 0.95] 1.51 [1.23, 1.79] 47 dorrepaal et al. (2014) combined therapies (mostly tf-cbt, individual & group) within group / ptsd symptoms (itt) / post d = 1.3 (fe) 1.28 [1.06, 1.51], i2 = 0 [, 0] 9 (8) τ = 0.44, z = 2.44* a = 0.5 1.41 [1.12, 1.69], $%# = -0.91 1.17 [0.96, 1.37], kimp = 2 -1.12 [-3.19, 0.96] 1.34 [1.06, 1.62] 50 dorrepaal et al. (2014) combined therapies (mostly cbt, individual & group) within group / ptsd symptoms (completer, complex ptsd) / post d = 1.6 (fe) 1.52 [1.19, 1.85], i2 = 0 [0, 0] 7 (7) τ = 0.62, z = 2.3* a = 0.96 1.56 [1.05, 2.06], $%# = -0.17 1.34 [1.04, 1.64], kimp = 2 -0.48 [2.38, 1.41] 1.33 [1.03, 1.63] 51 dorrepaal et al. (2014) combined therapies (mostly cbt) within group / ptsd symptoms (itt, complex ptsd) / post d = 1.2 (fe) 1.18 [0.93, 1.43], i2 = 0 [0, 0] 7 (6) τ = 0.52, z = 1.72 a = 0.4 1.28 [0.97, 1.6], $%# = -0.67 1.18 [0.93, 1.43], kimp = 0 -0.74 [3.55, 2.08] 1.24 [0.95, 1.53] 106 ehring et al. (2014) combined therapies (trauma-focused) within group / ptsd symptoms / follow-up g = 1.83 [1.60, 2.09] (re) 1.85 [1.53, 2.17], i2 = 22.6 [0, 84] 7 (7) τ = 0.43, z = 2.12* a = 0.1 1.89 [1.56, 2.26], $%# = -0.4 1.79 [1.46, 2.11] , kimp = 1 0.36 [-1.1, 1.83] 1.83 [1.55, 2.11] 91 taylor & harvey (2009) combined therapies (mostly tf-cbt) within group / mixed outcome measures / post g = 1.11 [0.90, 1.32] (fe) 1.08 [0.91, 1.25], i2 = 0 [0, 0] 6 (6) τ = 0.6, z = 2.49* a = 1.34 1.16 [0.97, 1.66], $%# = -0.75 1.02 [0.86, 1.19], kimp = 2 0.91 [0.77, 1.05] 1.05 [0.89, 1.21] 92 taylor & harvey (2009) combined therapies (mostly tf-cbt, individual & group) within group / mixed outcome measures (therapist as main contact for assessment and treatment) / post g = 1.03 [0.83, 1.23] (fe) 1 [0.84, 1.17], i2 = 0 [0, 0] 7 (6) τ = 0.14, z = 1.13 a = 0.06 1.06 [0.88, 1.35], $%# = -0.65 1 [0.84, 1.17], kimp = 0 0.89 [0.58, 1.2] 0.99 [0.82, 1.17] publication bias in meta-analyses of posttraumatic stress disorder interventions 29 data set no. (id) author intervention / dependent measure / time of measurement original effect size [and 95% ci] replicated effect size, i2 [and 95% ci] no. of studies† (no. of sign. studies) begg (τ) and egger (z) tes p-uniform [and 95% ci], pub. bias test (!"#) trim and fill [and 95% ci], no. of studies imputed (kimp) petpeese [and 95% ci] selection model [and 95% ci] 43 diehle et al. (2014) tf-cbt (cognitive restructuring + exposure) vs. tfcbt (exposure only) / trauma-related cognitions / post g = 0.27 [0.03, 0.50] (re) 0.26 [0.02, 0.5], i2 = 18.2 [0, 85.8] 7 (1) τ = -0.43, z = -1.37 a = 0 0.62 [-0.32, 1.18], $%# = -1.08 0.3 [0.05, 0.56], kimp = 1 0.59 [0.06, 1.25] 0.28 [-0.02, 0.58] 44 diehle et al. (2014) tf-cbt (cognitive restructuring + exposure) vs. tfcbt (exposure only) / trauma-related cognitions / follow-up g = 0.15 [-0.08, 0.39] (re) 0.15 [-0.08, 0.39], i2 = 16.1 [0, 87] 7 (0) τ = -0.14, z = -0.57 a = 0.58 no significant studies 0.19 [-0.05, 0.43], kimp = 1 0.44 [-1.08, 1.96] no convergence 77 kehleforbes et al. (2013) tf-cbt (exposure only) vs. tf-cbt (exposure plus) / dropout / post rr = 0.97 [0.66, 1.41] (re) 0.97 [0.66, 1.41], i2 = 40.5 [0, 92.2] 8 (1) τ = -0.36, z = -0.82 a = 0.3 00 [0, 1.91], $%# = 1.5 1.09 [0.73, 1.62], kimp = 2 1 [0.41, 2.4] 0.87 [0.71, 1.07] 18 bisson et al. (2013) tf-cbt vs. non-tf-cbt / leaving the study early / post mh-rr = 1.19 [0.71, 2.00] (fe) 1.12 [0.67, 1.89], i2 = 0 [0, 0] 7 (0) τ = -0.05, z = 0.67 a = 0.26 no significant studies 1.05 [0.63, 1.74], kimp = 1 0.76 [0.24, 2.41] 1.2 [0.5, 2.88] 19 bisson et al. (2013) tf-cbt vs. non-tf-cbt / depressive symptoms / post d = -0.27 [-0.56, 0.03] (fe) -0.27 [-0.56, 0.03], i2 = 0 [0, 0] 6 (0) τ = 0.2, z = 0.08 a = 0.72 no significant studies -0.27 [-0.56, 0.03], kimp = 0 -0.33 [2.85, 2.19] -0.26 [0.56, 0.03] 20 bisson et al. (2013) tf-cbt vs. non-tf-cbt / ptsd diagnosis / post mh-rr = 0.83 [0.60, 1.17] (re) 0.84 [0.61, 1.16], i2 = 36.9 [0, 97.6] 6 (1) τ = -0.47, z = -1.82 a = 0.83 0.6 [0.02, 3549.05], $%# = 0.18 0.84 [0.61, 1.16], kimp = 0 1.56 [0.55, 4.38] 0.93 [0.72, 1.22] 98 bisson et al. (2013) tf-cbt vs. non-tf-cbt / ptsd diagnosis clinician rated / post smd = -0.27 [0.63, 0.10] (re) -0.26 [-0.62, 0.1], i2 = 48.1 [0, 92.8] 7 (1) τ = 0.05, z = -0.82 a = 0.1 -1.19 [-2.3, 0.68], $%# = -1.3 -0.26 [-0.62, 0.1], kimp = 0 0.32 [-1.99, 2.63] -0.3 [-1.36, 0.75] 21 bisson et al. (2013) tf-cbt vs. other therapies (including tau) / leaving the study early / post mh-rr = 1.39 [1.01, 1.92] (fe) 1.36 [0.98, 1.87], i2 = 0 [0, 97.6] 11 (1) τ = -0.06, z = -0.03 a = 0.03 0.310 [0, 12.22], $%# = 0.69 1.38 [1.01, 1.9], kimp = 2 1.36 [0.76, 2.46] 1.33 [0.93, 1.9] 22 bisson et al. (2013) tf-cbt vs. other therapies (including tau) / depressive symptoms (self-rated) / post d = -0.37 [-0.63, 0.11] (re) -0.38 [-0.64, 0.11], i2 = 42.7 [0, 90.4] 9 (3) τ = -0.39, z = -1.73 a = 1.11 -0.33 [-1.04, 2.1], $%# = -0.07 -0.26 [-0.56, 0.03], kimp = 2 0.01 [-0.48, 0.49] -0.18 [0.39, 0.02] 23 bisson et al. (2013) tf-cbt vs. other therapies (including tau) / ptsd diagnosis / post mh-rr = 0.75 [0.59, 0.96] (re) 0.76 [0.6, 0.96], i2 = 34.3 [0, 91.5] 7 (1) τ = -0.52, z = -2.31* a = 0.03 0.71 [0.17, 25.4], $%# = -0.12 0.83 [0.64, 1.08], kimp = 2 1.53 [0.77, 3.03] 0.52 [0.18, 1.48] 57 gerger et al. (2014) b tf-cbt vs. other therapies / ptsd symptoms (structural equivalence) / post g = -0.17 [-0.39, 0.06] (re) -0.16 [-0.38, 0.06], i2 = 44.3 [0, 89.2] 7 (2) τ = -0.05, z = 0.18 a = 1.03 -0.34 [-0.71, 0.36], $%# = -1 -0.16 [-0.38, 0.06], kimp = 0 -0.26 [0.94, 0.41] -0.06 [-0.3, 0.18] niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 30 data set no. (id) author intervention / dependent measure / time of measurement original effect size [and 95% ci] replicated effect size, i2 [and 95% ci] no. of studies† (no. of sign. studies) begg (τ) and egger (z) tes p-uniform [and 95% ci], pub. bias test (!"#) trim and fill [and 95% ci], no. of studies imputed (kimp) petpeese [and 95% ci] selection model [and 95% ci] 58 gerger et al. (2014) b tf-cbt vs. other therapies / ptsd symptoms (complex problem & structural equivalence) / post g = -0.11 [-0.32, 0.09] (re) -0.11 [-0.31, 0.09], i2 = 33.5 [0, 83.6] 6 (1) τ = -0.07, z = 0.88 a = 0.14 -0.36 [-0.61, 0.02], $%# = -1.36 -0.31 [-0.55, 0.06], kimp = 3 -0.23 [0.62, 0.15] -0.08 [0.34, 0.17] 11 acpmh (2013) tf-cbt vs. combined therapies / depressive symptoms / post d = -0.12 [-0.38, 0.15] (fe) -0.12 [-0.38, 0.15], i2 = 0 [0, 0] 7 (0) τ = -0.05, z = -0.3 a = 0.37 no significant studies -0.12 [-0.38, 0.15], kimp = 0 0.08 [-1.83, 1.99] -0.2 [0.4, 0] 12 acpmh (2013) tf-cbt vs. combined therapies / anxiety symptoms / post d = -0.09 [-0.39, 0.20] (fe) -0.1 [-0.39, 0.2], i2 = 0 [0, 0] 6 (0) τ = -0.33, z = -0.22 a = 0.28 no significant studies -0.1 [-0.39, 0.2], kimp = 0 0.07 [-1.81, 1.94] -0.14 [0.41, 0.13] 13 acpmh (2013) tf-cbt vs. combined therapies / attrition / post mh-rr = 1.17 [0.69, 2.00] (fe) 1.1 [0.64, 1.9], i2 = 0 [0, 0] 6 (0) τ = -0.07, z = 0.7 a = 0.21 no significant studies 1.02 [0.6, 1.73], kimp = 1 0.71 [0.16, 3.13] 1.07 [0.6, 1.9] 65 imel et al. (2013) tf-cbt vs. combined therapies (mostly emdr) / dropout / post lor = -0.05‡ [0.52, 0.62] (re) 0.05 [-0.52, 0.62], i2 = 0 [0, 78.7] 7 (0) τ = 0.62, z = 0.97 a = 0.21 no significant studies -0.08 [-0.63, 0.46], kimp = 2 -0.56 [1.93, 0.81] 0.11 [-0.42, 0.64] 103 powers et al. (2010) tf-cbt vs. combined therapies / ptsd symptoms / post g = -0.07 [-0.42, 0.28] (re) -0.07 [-0.43, 0.29], i2 = 48.5, [0, 91.6] 6 (1) τ = -0.07, z = -0.04 a = 2.87 -0.46 [-1.3, 1.36], $%# = -0.78 -0.07 [-0.43, 0.29], kimp = 0 0.23 [-2.22, 2.68] 0.06 [-0.22, 0.34] 16 bisson et al. (2007) b emdr vs. tf-cbt/ withdrawal rate / post mh-rr = 0.87 [0.58, 1.30] (fe) 0.87 [0.57, 1.32], i2 = 0 [0, 0] 8 (0) τ = 0.14, z = 0.87 a = 0.34 no significant studies 0.8 [0.54, 1.2], kimp = 2 0.61 [0.21, 1.82] 0.82 [nan, nan] 27 bisson et al. (2013) emdr vs. tf-cbt / leaving the study early / post mh-rr = 1.00 [0.74, 1.35] (fe) 1.02 [0.75, 1.39], i2 = 0 [0, 0] 8 (0) τ = -0.21, z = -0.38 a = 0.23 no significant studies 1.02 [0.75, 1.39], kimp = 0 1.12 [0.51, 2.46] 0.95 [0.75, 1.21] 28 bisson et al. (2013) emdr vs. tf-cbt / ptsd symptoms (self-rated) / post d = -0.30 [-0.60, 0.01] (re) -0.3 [-0.6, 0.01], i2 = 32.4 [0, 90.2] 7 (1) τ = 0.05, z = 0.11 a = 0.02 2.940 [-0.78, 18.31], $%# = 1.44 -0.3 [-0.6, 0.01], kimp = 0 -0.34 [1.39, 0.71] -0.32 [0.61, -0.03] 41 chen et al. (2015) emdr vs. tf-cbt / ptsd symptoms (intrusion) / post d = -0.37 [-0.68, 0.06] (fe) -0.38 [-0.69, 0.07], i2 = 0 [0, 0] 6 (1) τ = -0.47, z = -2.19* a = 0 -1.14 [-2.29, 0.85], $%# = -1.11 -0.21 [-0.5, 0.07], kimp = 2 1.07 [-1.1, 3.25] -0.92 [2.24, 0.39] 42 chen et al. (2015) emdr vs. tf-cbt / ptsd symptoms (total, sensitivity analysis with 3 studies removed) / post d = -0.83 [-1.08, 0.58] (fe) -0.84 [-1.06, 0.61], i2 = 0 [0, 0] 8 (3) τ = -0.07, z = -0.07 a = 1.46 -0.98 [-1.48, 0.66], $%# = -1.02 -0.84 [-1.06, 0.61], kimp = 0 -0.8 [-1.18, -0.43] -0.98 [1.18, -0.77] 63 ho et al. (2012) emdr vs. tf-cbt / ptsd symptoms / post g = 0.23 [-0.03, 0.49] (fe) 0.23 [-0.03, 0.49], i2 = 0 [0, 0] 8 (0) τ = -0.14, z = -0.42 a = 0.77 no significant studies 0.23□ [-0.03, 0.49], kimp = 0 0.57□ [-1.69, 2.83] 0.22 [0.01, 0.44] 82 seidler & wagner (2006) emdr vs. tf-cbt / depressive symptoms / post g = 0.40 [0.05, 0.76] (fe) 0.39 [0.12, 0.66], i2 = 0 [0, 0] 7 (2) τ = -0.14, z = -1.14 a = 0.44 0.66 [-0.29, 1.29], $%# = -0.81 0.39 [0.12, 0.66], kimp = 0 1.51 [-1.8, 4.81] 0.27 [-0.14, 0.67] 83 seidler & wagner (2006) emdr vs. tf-cbt / depressive symptoms / follow-up g = 0.12 [-0.24, 0.48] (fe) 0.12 [-0.15, 0.4], i2 = 0 [0, 0] 7 (0) τ = -0.43, z = -1.62 a = 0.38 no significant studies 0.28 [0.02, 0.53], kimp = 2 0.76 [-0.35, 1.87] 0.21 [-0.25, 0.67] publication bias in meta-analyses of posttraumatic stress disorder interventions 31 data set no. (id) author intervention / dependent measure / time of measurement original effect size [and 95% ci] replicated effect size, i2 [and 95% ci] no. of studies† (no. of sign. studies) begg (τ) and egger (z) tes p-uniform [and 95% ci], pub. bias test (!"#) trim and fill [and 95% ci], no. of studies imputed (kimp) petpeese [and 95% ci] selection model [and 95% ci] 54 gerger et al. (2014) b combined therapies (mostly cbt) vs. other therapies / ptsd symptoms (complex problems) / post g = -0.23 [-0.42, 0.04] (re) -0.25 [-0.46, 0.05], i2 = 49.1 [0, 87.5] 12 (3) τ = -0.15, z = -0.55 a = 0.36 -0.63 [-1.26, 0.31], $%# = -2.29 -0.25 [-0.46, 0.05], kimp = 0 -0.17 [0.66, 0.31] -0.22 [0.48, 0.05] 55 gerger et al. (2014) b combined therapies (mostly emdr) vs. other therapies / ptsd symptoms (non-complex problems) / post g = -0.87 [-1.20, 0.53] (re) -0.88 [-1.21, 0.55], i2 = 40.6 [0, 95.5] 6 (5) τ = -0.2, z = -2.15* a = 0.35 -0.85 [-1.26, 0.45], $%# = -0.01 -0.88 [-1.21, 0.55], kimp = 0 0.32 [-1.19, 1.83] -0.76 [0.96, -0.56] 56 gerger et al. (2014) b combined therapies (mostly tf-cbt) vs. other therapies / ptsd symptoms (without or unclear adequate credibility) / post g = -0.40 [-0.59, 0.21] (re) -0.42 [-0.61, 0.22], i2 = 44.3 [0, 82.8] 15 (6) τ = -0.03, z = -0.49 a = 0.66 -0.67 [-1.01, 0.35], $%# = -1.61 -0.42 [-0.61, 0.22], kimp = 0 -0.31 [0.83, 0.21] -0.32 [0.58, 0.06] 59 gerger et al. (2014) b combined therapies (mostly tf-cbt) vs. other therapies / ptsd symptoms (outliers excluded) / post g = -0.43 [-0.61, 0.25] (re) -0.44 [-0.63, 0.26], i2 = 45.1 [0, 82.1] 16 (7) τ = -0.03, z = -0.3 a = 0.61 -0.68 [-0.95, 0.41], $%# = -1.75 -0.44 [-0.63, 0.26], kimp = 0 -0.39 [0.71, -0.07] -0.35 [0.62, 0.08] 60 gerger et al. (2014) b combined therapies (mostly tf-cbt) vs. other therapies / ptsd symptoms (outliers excluded + complex problem) / post g = -0.20 [-0.36, 0.04] (re) -0.21 [-0.38, 0.03], i2 = 32.5 [0, 75.1] 11 (2) τ = 0.02, z = 0.36 a = 0.02 -0.48 [-0.91, 0.2], $%# = -1.85 -0.37 [-0.57, 0.17], kimp = 4 -0.25 [0.52, 0.02] -0.21 [0.46, 0.04] 105 gerger et al. (2014) b combined therapies (structural equivalence no / unclear) vs. other therapies / ptsd symptoms /post g = -0.68 [-0.96, 0.40] (re) -0.69 [-0.97, 0.4], i2 = 47.9 [0, 89.5] 11 (6) τ = -0.29, z = -1.68 a = 0.36 -0.86 [-1.28, 0.49], $%#= -1.25 -0.69 [-0.97, 0.4], kimp = 0 0.08 [0.98, 1.14] -0.56 [0.92, -0.19] 94 torchalla et al. (2012) combined therapies (mostly other therapies) vs. other therapies / ptsd symptoms / follow up g = 0.08 [-0.03, 0.19] (re) 0.08 [-0.03, 0.19], i2 = 0 [0, 63.2] 8 (0) τ = -0.29, z = -0.72 a = 0.6 no significant studies 0.11 [0, 0.22], kimp = 2 0.1 [-0.03, 0.24] 0.08 [0.03, 0.19] 66 imel et al. (2013) combined therapies (mostly tf-cbt) vs. combined therapies (mostly cbt) (individual & group) / dropout / post lor = 0.27 [0.34, 0.81] (re) 0.27 [-0.28, 0.82], i2 = 0 [0, 15.2] 9 (0) τ = -0.11, z = -0.66 a = 0.47 no significant studies 0.36 [-0.17, 0.89], kimp = 2 0.44 [-0.06, 0.94] 0.38 [-0.23, 1] niemeyer*, van aert*, schmid, uelsmann, knaevelsrud & schulte-herbruggen 32 note. column 1: the treatments are grouped by treatment category and therefore the id of the data sets is not ordered from 1 to 93. column 3: interventions in bold indicate the direction of the effect. parentheses specify the interventions. if not otherwise mentioned interventions can be classified as individual therapy and face-to-face delivery. 95% confidence intervals for the i2-statistic were computed with the q-profile method (viechtbauer, 2007); a = test statistic of tes; acpmh = australian centre for posttraumatic mental health; amr = applied muscle relaxation; bdi = beck depression inventory; caps = clinician-administered ptsd scale; cbt = cognitive behavioral therapy; ci = confidence interval; d = cohen´s d; delta = glass' delta; emdr = eye movement desensitization and reprocessing; exp = exposure; fe = fixed-effects model; g = hedges´ g; itt = intention to treat; lor = log odds ratio; mh-rr = mantel-haenszel risk ratio; or = odds ratio; pct = present center therapy; pe = prolonged exposure; re = random-effects model with dersimonian and laird estimator for the betweenstudy variance; rr = relative risk; sit = stress inoculation training; sm = x; sn = structured nursing; tau = treatment as usual; tf-cbt = trauma-focused cognitive behavior therapy; ttp = trauma treatment protocol; wl = wait list; wmd = weighted mean difference; w=n = effect sizes in meta-analysis weighted by sample size. † number of studies or number of comparisons (if a study included more than one comparison). b the integration model was not specified in the paper.0 effect size estimate of p-uniform was set equal to zero (if the effect size measure was relative risk or odds ratio p-uniform’s estimate was set equal to 1). c includes one to two children studies. ‡ sign in front of the original effect size appeared to be incorrect after contacting the corresponding author. □ not enough information to transform hedges’ g to cohen’s d. * p < .05, two-sided." meta-psychology, 2022, vol 6, mp.2019.1615 https://doi.org/10.15626/mp.2019.1615 article type: original article published under the cc-by4.0 license open data: not applicable open materials: yes open and reproducible analysis: yes open reviews and editorial process: yes preregistration: no edited by: felix d. schönbrodt reviewed by: williams m., dienes, z. analysis reproduced by: counsell a., batinović l. all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/6wpn4 testing anova replications by means of the prior predictive p-value. m.a.j. zondervan-zwijnenburg department of methodology & statistics, utrecht university, the netherlands a.g.j. van de schoot department of methodology & statistics, utrecht university, the netherlands; optentia research program, north-west university, south africa h.j.a. hoijtink department of methodology & statistics, utrecht university, the netherlands abstract in the current study, we introduce the prior predictive p-value as a method to test replication of an analysis of variance (anova). the prior predictive p-value is based on the prior predictive distribution. if we use the original study to compose the prior distribution, then the prior predictive distribution contains datasets that are expected given the original results. to determine whether the new data resulting from a replication study deviate from the data in the prior predictive distribution, we need to calculate a test statistic for each dataset. we propose to use f̄, which measures to what degree the results of a dataset deviate from an inequality constrained hypothesis capturing the relevant features of the original study: hrf. the inequality constraints in hrf are based on the findings of the original study and can concern, for example, the ordering of means and interaction effects. the prior predictive p-value consequently tests to what degree the new data deviates from predicted data given the original results, considering the findings of the original study. we explain the calculation of the prior predictive p-value step by step, elaborate on the topic of power, and illustrate the method with examples. the replication test and its integrated power and sample size calculator are made available in an r-package and an online interactive application. as such, the current study supports researchers that want to adhere to the call for replication studies in the field of psychology. keywords: anova, comparison of means, power analysis, prior predictive p-value, replication study introduction new studies conducted to replicate earlier original studies are often referred to as replication studies. after the latest “crisis in confidence” in the field of psychology, the call to conduct replication studies is stronger than ever (anderson & maxwell, 2016; asendorpf et al., 2013; cumming, 2014; earp & trafimow, 2015; https://doi.org/10.15626/mp.2019.1615 https://doi.org/10.17605/osf.io/6wpn4 2 ledgerwood, 2014; open science collaboration, 2012, 2015; pashler & wagenmakers, 2012; schmidt, 2009; verhagen & wagenmakers, 2014), and large replication projects such as the reproducibility project psychology (open science collaboration, 2015), reproducibility project: cancer biology (rp:cb) (errington et al., 2019), and many labs projects (ebersole et al., 2016; klein et al., 2014; klein et al., 2018) have been launched. as a result, methodology on conducting replication studies has received increasing attention (see for example anderson and maxwell, 2016; asendorpf et al., 2013; brandt et al., 2014; schmidt, 2009). there is, however, no standard methodology to determine whether a replication is successful or not (open science collaboration, 2015). the results of an original study are replicated when a new study corroborates the original findings. a common and intuitive method to assess whether a result is replicated is ‘vote-counting’. vote-counting is assessing whether the new effect is statistically significant and in the same direction as the significant effect in the original study (anderson & maxwell, 2016; simonsohn, 2015). vote-counting, however, has serious shortcomings. first of all, it is a dichotomous evaluation that does not take into account the magnitude of differences between effect-sizes of the original and new study (asendorpf et al., 2013; simonsohn, 2015). secondly, each of the effect sizes being significant does not imply that both effect sizes are the same, nor does one significant effect and one non-significant effect imply that both effects are different (gelman & stern, 2006; nieuwenhuis et al., 2011). stated otherwise, votecounting does not formally test whether a result is replicated (anderson & maxwell, 2016; verhagen & wagenmakers, 2014). thirdly, underpowered replication studies are less likely to replicate significance, which can lead to misleading conclusions (asendorpf et al., 2013; cumming, 2008; hedges & olkin, 1980; simonsohn, 2015). in the current study, we address the following replication research question: “does the new study fail to replicate relevant features of the original study?”. for example, the result of an original anova study is: group a > group b > group c. the finding can be: “group a performs better than group b, which performs better than group c”; “group a performs better than group b and c”; or “group a and b perform better than group c”. the ‘relevant features’ subordinate to the replication test always have to be in line with the original result (i.e., group a > group b > group c) for the test to function properly. if the purpose of the replication test is to put the proclaimed theory by the original to the test, then the claims of the original study determine the exact relevant features to be evaluated. however, if there is reason to test another feature, it is possible to let the relevant features deviate from the claims in the original study. the relevant features of original studies will be captured in the form of an informative hypothesis (hoijtink, 2012), which is specified using inequality constraints among the means of the anova model. we propose to evaluate the replication of these hypotheses with the prior predictive p-value (box, 1980). the prior predictive p-value was not introduced to test replication. it was originally presented as a method to test whether the current data is unexpected given prior expectations concerning the parameter values of a statistical model. a disadvantage of the prior predictive check to test model fit is that it is leaves undetermined whether the prior expectations about the parameter values or the model assumptions are incorrect. hence, as a model test the prior predictive check has been replaced by the posterior predictive check (gelman et al., 1996), which does not make prior assumptions about expected parameter values, but instead uses the posterior results given the current data. with respect to testing replication, however, the prior predictive check is a good method for three reasons. first, instead of non-empirical prior expectations, we use the posterior distribution of the model parameters given the original data as the prior distribution. consequently, we have a well-founded and clear-cut prior. second, the prior predictive check uses a distribution of datasets (i.e., the prior predictive distribution) that are expected given the prior (i.e., the posterior of the original study). in this manner, the prior predictive distribution takes into account that results in a new dataset resulting from a replication study may deviate from the original results because of random variation instead of meaningful differences. according to our definition, a study replicates if the new dataset is drawn from the same population as the original dataset. third, the prior predictive check uses a ‘relevant checking function’ for which we propose f̄ (silvapulle & sen, 2005, p. 38-39). the statistic f̄ captures the deviance from a constrained hypothesis that we base on the findings of the original study. as a result, we can check whether the new study significantly fails to replicate relevant features of the original study, while taking variation into account. 3 ta bl e 1 r ep li ca ti on r es ea rc h q u es ti on s an d m et h od s to a dd re ss t h em r ep li ca ti on r es ea rc h q u es ti on c u rr en t s tu d y an d s im il ar q u es ti on s m et h od s et ti n g r ef er en ce d oe s th e n ew st u d y fa il to re p li ca te re le va n t fe at u re s of th e or ig in al st u d y? p ri or p re d ic ti ve pva lu e tte st , a n o va c u rr en t st u d y d oe s th e n ew st u d y fa il to re p li ca te th e ef fe ct si ze of th e or ig in al st u d y? c on fi d en ce in te rv al fo r d if fe ren ce in ef fe ct si ze s tte st , co rr el at io n a n d er so n an d m ax w el l (2 0 1 6 ) p re d ic ti on in te rv al co rr el at io n pa ti l et al . (2 0 1 6 ) d oe s th e n ew st u d y re p li ca te th e ef fe ct si ze of th e or ig in al st u d y? e qu iv al en ce te st tte st a n d er so n an d m ax w el l (2 0 1 6 ) b ay es fa ct or tte st ve rh ag en an d w ag en m ak er s (2 0 1 4 ) b ay es fa ct or a n o va h ar m s (2 0 1 8 ) b ay es fa ct or b f m od el sa ly et al . (2 0 1 8 ) o th er r ep li ca ti on r es ea rc h q u es ti on s m et h od s et ti n g r ef er en ce is th e ef fe ct p re se n t or ab se n t in th e re p li ca ti on st u d y? b ay es fa ct or tte st , co rr el at io n b m ar sm an et al . (2 0 1 7 ) is c oh en ’s d in th e p op u la ti on of a d et ec ta bl e si ze ? te le sc op e te st tte st c s im on so h n (2 0 1 5 ) is th e or ig in al ef fe ct si ze ex tr em e in co m p ar is on to th e n ew st u d y? c on fi d en ce in te rv al fo r d if fe ren ce in ef fe ct si ze s tte st , co rr el at io n o p en s ci en ce c ol la bo ra ti on (2 0 1 5 ) w h at is c oh en ’s d in th e p op u la ti on ? c on fi d en ce in te rv al fo r av er ag e ef fe ct si ze tte st a n d er so n an d m ax w el l (2 0 1 6 ) w h at is th e ef fe ct si ze (c or re ct ed fo r p u bl ic at io n bi as ) in th e p op u la ti on ? h yb ri d m et aan al ys is tte st v an a er t an d v an a ss en (2 0 1 7 ) a a ll m od el s fo r w h ic h a b ay es fa ct or ca n be co m p u te d . b t h e re co n ce p tu al iz at io n by ly et al . (2 0 1 8 ) ge n er al iz es to m os t co m m on ex p er im en ta l d es ig n s. c t h e te le sc op e te st is ex p la in ed in th e tte st se tt in g, bu t ap p li ca bl e to an y m od el fo r w h ic h a p ow er an al ys is ca n be co n d u ct ed . 4 table 1 shows how our research question and proposed method relate to other replication research questions and associated methods that have been proposed. our method addresses a question similar to that in anderson and maxwell (2016), harms (2018), ly et al. (2018), verhagen and wagenmakers (2014) and patil et al. (2016), but now enables researchers to evaluate the replication of relevant features of an original anova study. the bottom panel of table 1 shows other replication research questions that will not be pursued in this paper. the reader interested in these questions, should consult the given references. the goal of this paper is to introduce the prior predictive p-value as a method to test replication of relevant features of original anova studies. in the first section, we provide a step by step introduction of the prior predictive p-value as included in the anovareplication r-package and the online interactive application (see osf.io/6h8x3). in the second section, we discuss the statistical power of the prior predictive p-value. in the third section, we explain how to use and interpret the prior predictive p-value by means of a workflow. in the fourth section, we use several studies from the reproducibility project psychology (open science collaboration, 2012) to demonstrate the use of the prior predictive p-value. the paper ends with a discussion and conclusion section. prior predictive p-value the evaluation of the replication of an anova study by means of the prior predictive p-value (box, 1980) consists of three steps that will be explained below. step 1: prior predictive distribution of the data the anova model is given by: yi jd = µ jd + �i jd (1) �i jd ∼n(0,σ 2 d ), where yi jd is observation i = 1, ..., n jd in group j = 1, ..., j for dataset d ∈ {o, r, sim}, where o denotes the original data, r denotes the new data, and sim denotes simulated data, the latter will be introduced towards the end of this section. furthermore, µ jd is the mean of group j in dataset d, �i jd is the error term, and σ2d is the pooled variance over all j groups. the original anova results can be summarized in the posterior distribution of the parameters: g(µo,σ 2 o|yo), where µo = [µ1o, ...,µjo] and yo includes all observations yi jo: g(µo,σ 2 o|yo) ∝ f (yo|µo,σ 2 o)h(µo,σ 2 o), (2) where the density of the data f (yo|µo,σ 2 o) = j∏ j=1 n jo∏ i=1 1 √ 2πσo e −(yi jo−µ jo ) 2 2σ2o (3) and the standard prior distribution, h(µo,σ 2 o) ∝ 1 σ2o , (4) that is, a uniform prior on the means and jeffrey’s prior on the variance. the prior distribution for the analysis of the original data is uninformative, that is, the posterior distribution is completely determined by the original data in order to match the results of the original study. if the original study used a bayesian analysis, the priors should match those of the original study in order to reproduce the original study results. given the observed original results, the prior distribution for future parameters h(µr,σ 2 r ) = h(µsim,σ 2 sim) = g(µo,σ 2 o|yo). with the prior predictive p-value, we then test h0: µr,σ 2 r ∼ h(µr,σ 2 r ). h0 states that µr,σ 2 r follow the distribution of the prior for µr,σ 2 r . loosely formulated, h0 states that the parameters in the new data are in line with our expectations given the original results. to test h0, we obtain datasets that are to be expected given the original data. using this prior we simulate data ysim that are to be expected given the results of the original study: f (ysim) = ∫ f (ysim|µsim,σ 2 sim)h(µsim,σ 2 sim)dµsim,σ 2 sim), (5) where f (ysim) is the prior predictive distribution of the data. note that f (ysim|µsim,σ 2 sim) is the counterpart of equation 3 for dataset sim instead of o. datasets ytsim for t = 1, ..., t , where t denotes the number of samples from the prior predictive distribution, are obtained by sampling µtsim,σ t sim from h(µsim,σsim) = g(µo,σo|yo), and subsequently simulating ytsim from f (ysim|µ t sim,σ t sim) (cf. equation 3). datasets ytsim have sample sizes n1r, ..., njr, because the predicted data needs to be compared to the new data yr that has sample sizes n1r, ..., njr. the steps in the following sections elaborate how new data yr can be compared to the t data matrices sampled from f (ysim) that are to be expected given h0 using a test-statistic that evaluates relevant features of the original data. step 2: test statistic evaluating relevant features we propose to use f̄ (silvapulle & sen, 2005, p. 3839) as a test-statistic to evaluate how much the predicted data and the observed data deviate from an inequality constrained hypothesis capturing the relevant osf.io/6h8x3 5 features of the original study hrf: f̄yd = rssd,hrf − rssd,hu s 2d , (6) where rssd,hu denotes the residual sum of squares in dataset d ∈ {r, sim} for the unrestricted hypothesis hu: µ1d, ...,µjd , rssd,hu = ∑ i j (yi jd − ȳ jd ) 2, (7) where ȳ jd denotes the mean for group j in dataset d. s 2d denotes the mean squared error, s 2d = rssd,hu n − j , (8) where n = j∑ j=1 n jr, and rssd,hrf = ∑ i j (yi jd − µ̃ jd ) 2, (9) where µ̃d = [µ̃ jd, ..., µ̃jd ] = argmin µ̃d∈hrf ∑ i j (yi jd −µ jd ) 2. (10) µ̃d thus contains the set of parameter estimates that minimize the residual sum of squares for yd under the constraints imposed by hrf. f̄yd is the scaled difference between the residual sum of squares under the constraints imposed by hrf and the residual sum of squares for yd under hu. as hu is unrestricted, f̄yd quantifies the misfit of yd with hrf. the hypothesis capturing the relevant features of the original data, hrf, is of the form rµd > 0, where r is a k×j restriction matrix, j denotes the number of groups in the anova study, and k the number of restrictions in hrf, while µd is the mean vector of length j. examples of constraints that can be applied under rµr > 0 are: • simple order constraints: µ jd > µ j′d , or µ jd < µ j′d for a pair j, j′. • interaction effects: (µabd µab′d ) > (µa′bd µa′b′d ), for a 2×2 contingency table. the constraints in hrf should be based on the findings of the original study, which implies and requires that hrf is always in agreement with the results of the original study (i.e., f̄yo = 0). the results of the original study alone are usually not enough to determine which hrf is to be evaluated. for example, an original study shows that ȳ1o < ȳ2o < ȳ3o. this finding may lead to hrf: µ1d < µ2d < µ3d , but also to hrf: (µ1d,µ2d )< µ3d or hrf: µ1d < (µ2d,µ3d ). which exact features should be covered in hrf can be guided by the conclusions of the original study. for example, if in the original study it is concluded that a treatment condition leads to better outcomes than two control conditions, the most logical specification of the relevant features is hrf: (µcontrolad,µcontrolbd )< µtreatmentd . alternatively, if in the original study it is concluded that treatment a is better than treatment b, which is better than the control condition, a logical relevant feature hypothesis would be: hrf: µtreatmentad > µtreatmentbd > µcontrold . it may also occur that the researcher conducting the replication test has an interest to evaluate a claim that is not in the original study, but could be made based on its results. in all cases, the researcher conducting the replication test should substantiate the choices made in the formulation of hrf with results from the original study. it is good practice to also pre-register hrf. in the examples section, we demonstrate for two studies how the original study is linked to hrf. first, however, we explain how the prior predictive p-value is calculated. step 3: p-value the third and final step is to compute the prior predictive p-value. when we calculate f̄ytsim for each dataset ytsim obtained in step 1 with respect to f̄ as defined in step 2, a sampling-based representation of the prior predictive distribution of the test statistic f (f̄ysim ) is obtained. consequently, p = p(f̄ysim ≥ f̄yr |h0) = (11) 1 t t∑ t=1 i(f̄ytsim ≥ f̄yr ), where h0 denotes “replication”, that is: h0: µr,σ 2 r ∼ h(µr,σ 2 r ). furthermore, i is an indicator function that takes on the value 1 if the argument is true and 0 otherwise. as illustrated in figure 1, the prior predictive p-value indicates how exceptional the observed statistic for the new data, f̄yr , is compared to its prior predictive distribution f (f̄ysim ). the shaded area on the right side of f̄yr is p(f̄ysim ≥ f̄yr |h0), that is, the prior predictive p-value. if the prior predictive p-value is significant, we reject replication of the relevant features of the original study by the new data. note that the focus is on rejecting replication of the original results and not on rejecting hrf in itself for the new study.1 1to test hrf we recommend hoijtink et al. (2019), vanbrabant et al. (2015). 6 f yd f re q u e n cy 0 10 20 30 40 0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 f (f ysi m) prior predictive p−value f yr figure 1. an illustration of the prior predictive p-value. uniformity. to determine the significance of a pvalue by comparing it to some preselected value α, the p-value needs to be uniformly distributed if h0 is true. only when the p-value is uniform, α is equal to the nominal type i error. we will demonstrate that this is true for the prior predictive p-value if f (f̄ysim ) is continuous, and it is true up to some α0 if f (f̄ysim ) is discrete. a p-value is uniform if: p(p ≤ α|h0) ≤ α for all α ∈ [0, 1], (12) where p denotes a p-value from f ( p|h0), that is, the nulldistribution of the p-values. the following three steps proof that equation 12 holds for the prior predictive p-value when f (f̄ysim ) is continuous: 1. p( p < α|h0) = p(f̄yr > f̄ysim,1−α|h0), where f̄yr is the test-statistic rendering p via p = p(f̄ysim > f̄yr |h0) and f̄ysim,1−α is the 1-αth percentile of the distribution f (f̄ysim |h0). 2. p(f̄yr > f̄ysim,1−α|h0) = ∫ f̄yr >f̄ysim,1−α f (f̄yr |h0)df̄yr , where f (f̄yr |h0) denotes the distribution of f̄yr under h0. 3. for the situations considered in this paper it holds that f (f̄yr |h0) = f (f̄ysim ), therefore ∫ f̄yr >f̄ysim,1−α f (f̄yr |h0)df̄yr = ∫ f̄ysim >f̄ysim,1−α f (f̄ysim )df̄ysim = α, which completes the proof. with constraints of the form rµr > 0, however, f (f̄ysim ) will often be discrete. when f (f̄ysim ) is discrete, the prior predictive p-value is not uniform for all α ∈ [0, 1]. for example, let us obtain g(µo,σ 2 o|yo) = h(µr,σ 2 r ) for an original study with ȳ1o = 1, ȳ2o = 2, ȳ3o = 3, s2o = 5, and n jo = 50, with n jr = 50 and hrf: µ1r < µ2r < µ3r. subsequently, we simulate ytr for t = 1, ..., 100, 000, and calculate the prior predictive p-value for each ytr. the result is f (p|h0), which is plotted in figure 2a. in figure 2a, we see a thick vertical line that indicates a set of p-values with exactly the same value, namely 1.00. this set of equal p-values results from the fact that hrf : µ1r < µ2r < µ3r is true for a substantial number of datasets ytr causing the associated f̄ytr to be exactly equal to 0 and the associated prior predictive p-values to be exactly equal to 1 (see figure 2b). generally, however, there exists an α0 for which f ( p|h0) is uniform (meng, 1994), since all values in f (f̄ysim ) other than 0 will occur in a continuous fashion. thus, α is uniform for α ∈ [0,α0]. if the preselected α < α0, α is equal to the nominal type i error. α0 can be computed as 1 − p( f (f̄ysim ) = 0). for example, α0 ≥ .05 if no more than 95% of f̄ysim is exactly 0. it would be exceptional if more than 95% of f̄ysim = 0, but it could occur with extremely low power in the original study and an unspe7 p f re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 2 0 0 0 0 4 0 0 0 0 6 0 0 0 0 (a) f (p|h0). f yr f re q u e n cy 0 5 10 15 20 25 0 2 0 0 0 0 4 0 0 0 0 6 0 0 0 0 (b) f (f̄ysim ). figure 2. uniformity of the prior predictive p-value for hrf: µ1r < µ2r < µ3r. cific hrf . a visualization of f (f̄ysim ) can help to roughly estimate α0. for the discrete f (f̄ysim ) considered here 53% of f (f̄ysim ) = 0 and α0 = .47 (figure 2b). in the next section, we deal with another important property of null hypothesis significance testing methods: power. power power is the probability to reject the null hypothesis (of replication) with a preselected α when not the null, but an alternative hypothesis is true. researchers typically pursue a power of .80. let us denote power by γ. γ = p( p < α|ha), (13) = p(f̄yr > f̄ysim,1−α|ha), where ha is the population under the alternative hypothesis for which replication is to be rejected. note that any population for which h0 is not true can qualify to reject replication. the population used is determined by the theoretical context in which the replication test takes place. the population with µ1a = .... = µja is a special population that is generally considered to display a non-effect in anova studies. hence, µ1a = .... = µja seems a natural default choice for the population under the alternative hypothesis. as a best guess for µ ja and σ2a in a power analysis, the grand mean ȳo and variance σ2o of the original study can be used. the population under the alternative hypothesis with µ1a = .... = µja is on the edge of hrf: it deviates minimally from hrf, hence, the associated γ will be a lower limit. power will increase when the population under the alternative hypothesis is more different from hrf than in the population with equal means, for example, when the means are ordered differently. simulation study to illustrate the power of the prior predictive p-value, we conducted a simulation study in which we varied the effect size in the original study fo, the sample size for the original study n jo, the sample size for the new study n jr, the relevant feature of interest hrf, and the population under the alternative hypothesis ha as specified in table 2. for each cell in the simulation study, 10,000 samples were drawn from ha and power was calculated according to equation 13. the results of the simulation study are provided in table 3. as expected, power generally increases with increasing effect sizes, increasing sample sizes, and increasing deviation between yo and ha. there are, however, some exceptions: with small fo and low n jo, larger n jr only emphasize the noise in the original study more and do not lead to an increase in power. similarly, a more specific hrf does not always increase power. given original studies with smaller samples and smaller effect sizes, h(µr,σ 2 r ) is so uninformative that more specific hrf are only more inaccurate under h0, and f̄yr needs to be extremely large to reject the null. table 3 also shows that the power on the edge (i.e., the power for ha1) is insufficient for original studies with small and medium effect sizes (γ < .60 in all cells). with medium fo, power is only sufficient if the new study originates from a population in which the means are ordered differently (e.g., ha2). for original studies with large effect sizes and group sample sizes in the original studies with at least 50 participants per group, power can be sufficient under ha1. power levels off, however, for hrf1 and hrf2 at .67, and .83 respectively. under µ1a = µ2a = µ3a, hrf1: µ1r < (µ2r,µ3r ) is true in 13 of the situations by chance. consequently, power cannot exceed 1− 13 = .67. for hrf2: µ1r < µ2r < µ3r, 1 6 of 8 table 2 simulation sample statistics for original study and population values under ha yo ha fo ȳ1o ȳ2o ȳ3o s2o fa µ1a µ2a µ3a σ 2 a .10 -0.12 0.00 0.12 1.00 0 0.00 0.00 0.00 1.00 .25 -0.31 0.00 0.31 1.00 .10 0.00 -0.24 0.00 1.00 .40 -0.49 0.00 0.49 1.00 note. effect size f as introduced by cohen (1988, p. 274-275). other simulation factors: n jd ∈ 20, 50, 100. hrf1: µ1d < (µ2d,µ3d ), hrf2: µ1d < µ2d < µ3d . table 3 power hrf1|ha1 hrf2|ha1 hrf1|ha2 hrf2|ha2 n jr n jr n jr n jr fo n jo 20 50 100 20 50 100 20 50 100 20 50 100 .10 20 .03 .01 .00 .02 .00 .00 .11 .08 .05 .08 .04 .02 .10 50 .08 .06 .03 .06 .04 .02 .21 .31 .36 .19 .22 .28 .10 100 .11 .12 .10 .09 .10 .08 .26 .45 .62 .25 .39 .52 .25 20 .13 .10 .06 .09 .05 .01 .33 .40 .45 .25 .26 .23 .25 50 .25 .32 .37 .20 .26 .26 .48 .73 .88 .43 .62 .81 .25 100 .30 .46 .57 .29 .44 .55 .54 .83 .96 .53 .79 .94 .40 20 .32 .41 .41 .27 .26 .21 .59 .78 .89 .49 .64 .75 .40 50 .49 .66 .67 .45 .68 .83 .74 .93 .98 .69 .93 .99 .40 100 .55 .66 .67 .57 .83 .83 .77 .93 .97 .77 .98 .99 text in cells with γ ≥ .80 is boldface. text in cells with a maximum γ in relation to the specific hrf|ha is italic. the combinations under ha1 are in line with replication by chance. hence, power cannot exceed 1 − 16 = .83. if we move further from the edge of hrf, as we do with ha2, power increases. thus, the power of the prior predictive p-value considering an hrf with three or fewer order constraints will almost never be high if the true means are equal, but can be high if there is a different ordering in reality as compared to the one in hrf . the results demonstrate that imprecise estimates (i.e., large standard errors leading to a low informative prior) in the original study lead to low power, especially on the edge of hrf. this is as true for the prior predictive p-value as it is for other approaches. for example, in a classical anova study with three groups with 20 participants each, power is <.10, <.40, and <.80 for small, medium, and large effect sizes respectively; a result that was already pointed out in cohen (1988, p. 313). zondervan-zwijnenburg and rijshouwer (2020) demonstrates the application of different methods to evaluate replication, within the context of small samples. not a single method is unaffected by small sample sizes. as highlighted by morey and lakens (2019) and patil et al. (2016): replication can only be rejected based on the findings of the original study, and when these findings are highly imprecise due to large standard deviations and small sample sizes, rejecting them is hard or even impossible. underpowered original studies may result in nonsignificant prior predictive p-values that have a high probability of being type ii errors (morey & lakens, 2019). therefore, only reporting the prior predictive p-value is not enough, the probability of a type ii error (i.e., 1-γ) given the population under the alternative hypothesis should be communicated to the reader as well. the next section elaborates on the computation of power and the required sample size for sufficient statistical power. the workflow and examples sections explain how researchers should incorporate prior predictive p-values and power. one of the examples will also demonstrate rejected replication despite low power on the edge of h0. power and sample size determination as highlighted in the previous sections and in the literature (e.g., brandt et al., 2014; simonsohn, 2015), power is an important characteristic of a convincing replication study. it is thus important that researchers can calculate the power of the prior predictive check, and can determine the sample size for a new study such that the replication test has high statistical power. therefore, the anovareplication r-package and the 9 online interactive application (see osf.io/6h8x3) include a power and sample size calculator. given the vector with group sample sizes in the new study nr, h(µr,σ 2 r ), ha, hrf, and α, the power γ is calculated as follows: 1. following step 1 and 2 of the prior predictive check, t = 1, ..., t datasets are simulated from f (f̄ysim ), and f̄ysim,1−α can be calculated. 2. given µa, σa, t = 1, ..., t datasets are simulated from f (f̄yr|ha) with sample sizes nr. following step 2 of the prior predictive check, for each dataset f̄yr is calculated. 3. γ = p(f̄yr > f̄ysim,1−α|ha) = 1 t t∑ t=1 i(f̄ytr ≥ f̄ysim,1−α ), as default choice for µa, we recommend to use ȳo for each group. with this setting, the power is calculated to reject replication in case of equal group means. as default choice for σa, we recommend the pooled standard deviation of the original study. to determine the required sample size to reject replication with sufficient power, we use an iterative procedure. in addition to h(µr,σ 2 r ), ha, hrf, α, we use the following information to calculate the required sample size: a target power level γ̃; a small margin covering acceptable values around the target power γmargin, because the calculated power may not be exactly equal to the target power; a starting value for the group sample size n jr0 ; a maximum number of iterations qmax; and a maximum total sample size for the new study nrmax . our default values are: γ̃ = .825, γmargin = .025, α = .05, n jr0 = 20, qmax = 10, and nrmax = 600. 1. in every iteration q, γq is calculated given n jrq . 2. when q > 1, n jrq+1 is determined by regressing {γ1, ...,γi} on {n jr1, ..., n jrq} with a linear or quadratic (only if q = 3) function. in case of a linear regression, the linear regression coefficient β1 is the power increase per subject. subsequently, n jrq+1 = (γq − γ̃)/β1 + n jrq . in case of regression with a quadratic function, n jrq+1 is calculated by solving the polynomial: γ̃ = β0 + β1n jrq+1 + β 2 2n jrq+1 . 3. repeat step (1) and (2) until γq ∈ [γ̃ − γmargin, γ̃ + γmargin] (i.e., power is sufficient), or γq−1 ≈ γq (i.e., power does not increase anymore up to two decimal points), or n jrq−1 = n jrq (i.e, the sample size does not change anymore), or q = qmax, or σjj=1n jrq = nmax. workflow to clarify the procedure to obtain the prior predictive p-value, the workflow is depicted in figure 3. step 1. the first steps (1a-1c) only require the original study. step 1a is to derive the relevant feature to be evaluated in the test statistic from the findings of the original study. next, the population for which replication should be rejected (i.e., ha) can be defined. what is the ordering of the means in this population and what is the effect size in that ordering? ha can be a population in which all means are equal, but it does not have to be. step 1c is to obtain the data of the original study, or reconstruct the data based on reported means, standard deviations and group sample sizes. if the new study is not yet conducted, the second step is to calculate the required sample size per group for the new study to reject replication with sufficient power (i.e., γ). step 2. the sample sizes calculation can be conducted with the sample.size.calc function in the anovareplication package. if the function cannot find a (reasonable) group sample size for which γ is sufficient, this implies that the original study is not suited for replication testing with the prior predictive p-value for the specified ha: its conclusions are too vague (i.e., the standard errors are too wide) to reject replication if ha is true. there is still a chance that the prior predictive p-value turns out significant, especially if the observed data is more extreme than most samples from ha, but the researcher should consider whether collecting data with such a low probability of a meaningful result is ethically acceptable. step 3. as a third step, the prior predictive p-value can be computed with the function prior.predictive.check. the power associated to the sample size of the new study can be calculated with power.calc. note that it is not a post-hoc power analysis, as the definition of ha is unrelated to the new study. hence, the power to reject replication for ha can be insufficient (i.e., larger than 1the preset type ii error rate β), while the prior predictive p-value is statistically significant, or vice versa. figure 3 assists in interpreting the resulting p-value, considering the statistical power to reject replication for ha, unless f̄yr is exactly 0. if the new study perfectly meets the features of the original study as described in hrf, f̄yr will be 0 and the prior predictive p-value 1.00. in such a case, we confirm replication of the relevant features in the original study as captured in hrf, irrespective of power. theoretically it is possible that f̄yr = 0, while the new study is an extreme sample from a population in which hrf is not true. that, however, is not under consideration here, as our question was whether the observed new study replicates, or fails to replicate, relevant features of the osf.io/6h8x3 10 1a. relevant features in hrf 1b. define ha 1c. original data yr collected? 2. calculate required sample size for yr 3. prior predictive p-value + γ 𝐹"𝒚𝑟 = 0? the new study matches hrf. replication is confirmed. γ ≥ 1 β? ppp < α? replication is rejected. report γ for ha. yes no yes no yes yes ppp < . α? no replication is rejected despite low power. the observed data is more extreme than ha. replication is not rejected, despite sufficient power to do so. report γ for ha. no yes replication is not rejected. report γ and emphasize that the type ii error is 1γ for ha. optional: go back to 2. to see if a larger new study could resolve the power issue, or if the original study and its conclusions are not specific enough to test replication at all. no njr = ∞? the original study is not suited for replication testing with the prior predictive p yes no 1. o riginal study figure 3. the prior predictive p-value workflow. original study. in case of a non-significant result in combination with low power, the researcher should emphasize the probability that not rejecting replication is a type ii error, and it is advised to conduct a replication study with larger n jr. the required sample size per group can again be calculated with the sample.size.calc function in the anovareplication package. if the required n jr is excessive given ha, it may be an inevitable conclusion that the original study is not suited for replication testing by means of the prior predictive p-value. if replication is rejected despite low power, it implies that the observed new dataset deviates more from hrf than most datasets under ha. with sufficient statistical power, it is still informative to notify the reader of the achieved power and/or the probability of a type ii error given the population under ha. examples to illustrate the use of the prior predictive check to assess whether relevant anova features are replicated, we two selected replication studies that were part of the reproducibility project psychology initiated by the open science collaboration (2012, 2015). all calculations can be performed with the anovareplication r-package (zondervan-zwijnenburg, 2018). the first study is fischer et al. (2008), who studied the impact of self-regulation resources on confirmatory information processing. according to the theory, people who have low self-regulation resources (i.e., depleted participants) will prefer information that matches their initial standpoint. an ego-threat condition was added, because the literature proposes that ego-threat affects decision relevant information processing, although the direction of this effect is not clear. to determine which relevant feature of the results (see table 4) should be tested for replication, we follow the original findings: “planned contrasts revealed that the confirmatory information processing tendencies of participants with reduced self-regulation resources [...] were stronger than those of nondepleted [...] and ego threatened participants [...]” fischer et al. (2008, p. 387). this translates to: hrf: µlow self-regulation,r > (µhigh self-regulation,r,µego-threatened,r ) (workflow step 1a). we want to reject replication when all means in the population are equal. that is: ha: µlow self-regulation,r = (µhigh self-regulation,r =µego-threatened,r ) (workflow step 1b). we simulate the original data based on the means, standard deviations and sample sizes reported in fischer et al. (2008) (workflow step 1c). as the replication study is already conducted by galliani (2015) (see table 4 for results), we do not calculate the required sample size to test replication (workflow step 2), and proceed to calculate the prior predictive p-value and the power of the replication test 11 table 4 descriptive statistics for confirmatory information processing from the original study: fischer et al. (2008), and the new study: galliani (2015) low self-regulation high self-regulation ego-threatened study n m (sd) n m (sd) n m (sd) original 28a 0.36 (1.08) 28a -0.19 (0.53) 28a -0.18 (0.81) new 48 -0.07 (0.45) 47 -0.05 (0.47) 45 0.13 (0.64) aonly the total sample size of 85 was provided in fischer et al. (2008). table 5 z-scores of participants’ mean estimates from the original study: janiszewski and uy (2008), and the new study: chandler (2015) low motivation to adjust high motivation to adjust precise anchor rounded anchor precise anchor rounded anchor study n m (sd) n m (sd) n m (sd) n m (sd) original 14 -0.76 (0.17) 15 -0.23 (0.48) 15 -0.04 (0.28) 15 0.98 (0.41) new 30 -0.35 (0.23) 30 -0.18 (0.37) 30 0.20 (0.34) 30 0.35 (0.44) (workflow step 3). the resulting prior predictive pvalue was .003 with γ = .66, indicating that we reject replication, despite limited power. the ordering in the new data by galliani (2015) results in an extreme f̄ score compared to the predicted data. figure 4 illustrates this conclusion: over 90% of the predicted data scores perfectly in line with hrf, but the new study by galliani (2015) deviates from hrf and scores in the extreme 0.3% of the predicted data. the replication of the original study conclusions is thus rejected. the second study is janiszewski and uy (2008), who studied numerical judgements with five experiments. more specifically, they study the impact of precision of an anchor, and motivation to adjust from the anchor on judgement bias. the group means, standard deviations, and sample sizes of experiment 4a in the original study by janiszewski and uy (2008) and the replication study by chandler (2015) are provided in table 5. we find that based on these results, janiszewski and uy (2008) draw two conclusions. “first, a precise anchor results in less adjustment than a rounded anchor” (p. 126). for experiment 4a, which was replicated by chandler (2015), this conclusion translates to hrf: (µlow motivation,round,r > µlow motivation,precise,r ) & (µhigh motivation,round,r > µhigh motivation,precise,r ) (workflow step 1a). we want to reject replication when all means in the population are equal. that is: ha: µlow motivation,round,r = µlow motivation,precise,r = µhigh motivation,round,r = µhigh motivation,precise,r (workflow step 1b). we simulate the original data based on the means, standard deviations and sample sizes reported in janiszewski and uy (2008) (workflow step 1c). as the replication study is already conducted by chandler (2015), we do not calculate the required sample size to test replication (workflow step 2), and proceed to calculate the prior predictive p-value and the power of the replication test (workflow step 3). the resulting prior predictive p-value is 1.00. the data obtained by chandler (2015) were perfectly in line with the hrf describing the effect as observed by janiszewski and uy (2008). therefore, we do not have further concerns about the obtained power. hence, we conclude that the results of janiszewski and uy (2008) with respect to hrf: (µlow motivation,round,r > µlow motivation,precise,r ) & (µhigh motivation,round,r > µhigh motivation,precise,r ) are replicated by chandler (2015). the other conclusion that janiszewski and uy (2008) draw is about the presence of an interaction effect of adjustment motivation and anchor rounding: “the difference in the amount of adjustment between the roundedand precise-anchor conditions increased as the motivation to adjust went from low [...] to high” (p. 125). the results and conclusions of janiszewski and uy with respect to experiment 4a translate to hrf: (µlow motivation,round,r > µlow motivation,precise,r ) & (µhigh motivation,round,r > µhigh motivation,precise,r ) & (µlow motivation,round,r −µlow motivation,precise,r ) < (µhigh motivation,round,r −µhigh motivation,precise,r ). the prior predictive p-value related to this hrf is .014 with γ = .87. thus, we reject replication of the interaction effect. 12 f y f re q u e n cy 0 5 10 15 0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 figure 4. the prior predictive p-value for the replication of fischer et al. (2008) by galliani (2015). the histogram bars represent f̄ for the predicted data. the thick line on the left represents f̄ for the predicted data that are exactly 0 (i.e., over 90% of the total), whereas the red line represents f̄ for galliani (2015). discussion & conclusion the goal of the current paper was to introduce the prior predictive check as a manner to test replication of anova features. with the prior predictive check researchers can find an answer to the question: “does the new study fail to replicate relevant features of the original study?” identifying a non-replication may make us wonder about the representativeness of the original study, the new study, and the comparability of both studies. or, as stated by simonsohn (2015, p. 9) “statistical techniques help us identify situations in which something other than chance has occurred. human judgment, ingenuity, and expertise are needed to know what has occurred instead.” in the current paper, we discussed the prior predictive p-value for the anova setting. for the anova setting, we explained how to test relevant features of the form rµr > 0. technically, however, the relevant features evaluated by the anovareplication r-package, however, can also be of the form rµr > r and sµr = s, where r and s are vectors of length k containing the constants in hrf, and s is a k×j restriction matrix like r. accordingly, minimum (effect size) differences between means can be evaluated and means can be constrained equal to specific values. even though constraints of these forms can be evaluated with the r-package and in the online application, they are not emphasized in the current paper because they will less often directly relate to the findings of an original study. the prior predictive p-value is generalizable to statistical models other than the anova as well. that is, for any model a predictive distribution can be obtained, constrained hypotheses can be constructed, and a test-statistic evaluating the constraints can be calculated. the test as currently provided can already be used for the repeated measures anova by means of contrast weights (see, for example, furr and rosenthal, 2003). with contrast weights a score for each participant can be calculated indicating to what degree the participant follows the expected pattern. subsequently, the replication of relevant features of these contrast scores over groups can be tested. a pre-print introduction to test replication with the prior predictive p-value for structural equation models has been published at https://psyarxiv.com/uvh5s. in the current paper, we introduced the prior predictive p-value as a new tool to quantify replication failhttps://psyarxiv.com/uvh5s 13 ure or success to the meta-scientific toolbox. with the prior predictive p-value we test whether the new study significantly deviates from our expectations based on the original study. other methods to evaluate replication research questions with are included in table 1 and demonstrated in zondervan-zwijnenburg and rijshouwer (2020). two features of the prior predictive p-value to test replication stand out. first, the prior predictive p-value makes use of a predictive distribution given the original study results. the new study results are compared to the predicted data. a bayes factor on the other hand, weighs the evidence for two competing hypotheses in the new study as it actually occurred, but does not take study variation into account. second, to compare the new study with the predicted data, we consider relevant features of the original study. while most other methods evaluate the replication of a simple effect size, relevant features can be any constraint or set of constraints of the form rµr > 0, which seamlessly connects to the research objective of most anova studies. with the anovareplication r-package including a vignette as a tutorial, and the interactive application (see osf.io/6h8x3), we provide researchers with an easy to use test for replication of anova features. the availability of the prior predictive p-value to test replications can further promote the trend to conduct more replication studies in the field of psychology. author contact correspondence concerning this article should be addressed to mariëlle zondervan-zwijnenburg, department of methods and statistics, utrecht university, padualaan 14, 3584ch utrecht. e-mail: m.a.j.zwijnenburg@uu.nl. conflict of interest and funding add a statement about conflict of interest. if you have no conflict of interest to declare, please state that. add a statement about how the research was funded. if there was no specific funding please state that. author contributions mz and hh were involved in the initial research design. mz drafted and revised the article in collaboration with hh. mz developed the interactive application, conducted the simulation studies, and conducted the analyses. rs provided additional feedback, and evaluated the interactive application. all authors approved the final manuscript. the first author (mz) was the main author and the last author (hh) was the main supervisor on this project. acknowledgements we would like to thank meta-psychology editor dr. felix schönbrodt, and reviewers dr. matt williams and dr. zoltan dienes for their helpful feedback on this manuscript. the first and third author are supported by the consortium individual development (cid), which is funded through the gravitation program of the dutch ministry of education, culture, and science and the netherlands organization for scientific research (nwo grant number 024.001.003). the second author is supported by a vidi grant from the netherlands organization for scientific research (nwo grant number 452.14.006). open science practices this article earned the open materials badge for making the materials openly available. this article is a methods article and did not include any new data, and it was not pre-registered. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, is published in the online supplement. references anderson, s. f., & maxwell, s. e. (2016). there’s more than one way to conduct a replication study: beyond statistical significance. psychological methods, 21(1), 1–12. https://doi .org/10.1037/met0000051 asendorpf, j. b., conner, m., de fruyt, f., de houwer, j., denissen, j. j., fiedler, k., ..., & wicherts, j. m. (2013). recommendations for increasing replicability in psychology. european journal of personality, 27(2), 108–119. https://doi.org/ 10.1002/per.1919 box, g. e. (1980). sampling and bayes’ inference in scientific modelling and robustness. journal of the royal statistical society. series a (general), 143(4), 383–430. https://doi.org/10.2307/ 2982063 brandt, m. j., ijzerman, h., dijksterhuis, a., farach, f. j., geller, j., giner-sorolla, r., grange, j. a., ..., & van’t veer, a. (2014). the replication recipe: what makes for a convincing replication? journal of experimental social psychology, 50, 217–224. https://doi.org/10.1016/j .jesp.2013.10.005 osf.io/6h8x3 https://doi.org/10.1037/met0000051 https://doi.org/10.1037/met0000051 https://doi.org/10.1002/per.1919 https://doi.org/10.1002/per.1919 https://doi.org/10.2307/2982063 https://doi.org/10.2307/2982063 https://doi.org/10.1016/j.jesp.2013.10.005 https://doi.org/10.1016/j.jesp.2013.10.005 14 chandler, j. (2015). replication of janiszewski & uy (2008, ps, study 4b). open science framework. osf.io/aaudl cohen, j. (1988). statistical power analysis for the behavioral sciences (2nd ed.). hillsdale, nj, lawrence erlbaum associates. https://doi .org/10.4324/9780203771587 cumming, g. (2008). replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. perspectives on psychological science, 3(4), 286–300. https:// doi.org/10.1111/j.1745-6924.2008.00079 .x cumming, g. (2014). the new statistics: why and how. psychological science, 25(1), 7–29. https:// doi.org/10.1177/0956797613504966 earp, b. d., & trafimow, d. (2015). replication, falsification, and the crisis of confidence in social psychology. frontiers in psychology, 6, 621. https: //doi.org/10.3389/fpsyg.2015.00621 ebersole, c. r., atherton, o. e., belanger, a. l., skulborstad, h. m., allen, j. m., banks, j. b., baranski, e., bernstein, m. j., bonfiglio, d. b., boucher, l., brown, e. r., budiman, n. i., cairo, a. h., capaldi, c. a., chartier, c. r., chung, j. m., cicero, d. c., coleman, j. a., conway, j. g., . . . nosek, b. a. (2016). many labs 3: evaluating participant pool quality across the academic semester via replication. journal of experimental social psychology, 67, 68–82. https://doi.org/10.1016/j.jesp.2015 .10.012 errington, t., tan, f., lomax, j., perfito, n., iorns, e., gunn, w., & lehman, c. (2019). reproducibility project: cancer biology. osf.io/e81xl fischer, p., greitemeyer, t., & frey, d. (2008). selfregulation and selective exposure: the impact of depleted self-regulation resources on confirmatory information processing. journal of personality and social psychology, 94(3), 382. https://doi.org/10.1037/0022-3514.94 .3.382 furr, r. m., & rosenthal, r. (2003). repeated-measures contrasts for "multiple-pattern" hypotheses. psychological methods, 8(3), 275–293. https: //doi.org/10.1037/1082-989x.8.3.275 galliani, e. (2015). replication report of fischer, greitemeyer, and frey (2008, jpsp, study 2). https: //osf.io/j8bpa gelman, a., meng, x.-l., & stern, h. (1996). posterior predictive assessment of model fitness via realized discrepancies. statistica sinica, 6(4), 733– 760. gelman, a., & stern, h. (2006). the difference between “significant” and “not significant” is not itself statistically significant. the american statistician, 60(4), 328–331. https://doi.org/10 .1198/000313006x152649 harms, c. (2018). a bayes factor for replications of anova results. the american statistician. https: //doi.org/10.1080/00031305.2018.1518787 hedges, l. v., & olkin, i. (1980). vote-counting methods in research synthesis. psychological bulletin, 88(2), 359. https://doi.org/10.1037/0033 -2909.88.2.359 hoijtink, h. (2012). informative hypotheses: theory and practice for behavioral and social scientists. crc press. https://doi.org/10.1201/b11158 hoijtink, h., mulder, j., van lissa, c., & gu, x. (2019). a tutorial on testing hypotheses using the bayes factor. psychological methods, 24(5), 539–556. https://doi.org/10.1037/met0000201 janiszewski, c., & uy, d. (2008). precision of the anchor influences the amount of adjustment. psychological science, 19(2), 121–127. https://doi .org/10.1111/j.1467-9280.2008.02057.x klein, r. a., ratliff, k. a., vianello, m., adams, r. b., bahnık, š., bernstein, m. j., bocian, k., brandt, m. j., brooks, b., brumbaugh, c. c., cemalcilar, z., chandler, j., cheong, w., davis, w. e., devos, t., eisner, m., frankowska, n., furrow, d., galliani, e. m., . . . nosek, b. a. (2014). investigating variation in replicability: a ’many labs’ replication project. social psychology, 45(3), 142–152. https://doi.org/10.1027/1864 -9335/a000178 klein, r. a., vianello, m., hasselman, f., adams, b. g., adams, r. b., alper, s., aveyard, m., axt, j. r., babalola, m. t., bahnık, š., batra, r., berkics, m., bernstein, m. j., berry, d. r., bialobrzeska, o., binan, e. d., bocian, k., brandt, m. j., busching, r., . . . nosek, b. a. (2018). many labs 2: investigating variation in replicability across samples and settings. advances in methods and practices in psychological science, 1(4), 443–490. https://doi.org/10.1177/ 2515245918810225 ledgerwood, a. (2014). introduction to the special section on advancing our methods and practices. perspectives on psychological science, 9(3), 275–277. https : / / doi .org / 10 .1177 / 1745691613513470 ly, a., etz, a., marsman, m., & wagenmakers, e.-j. (2018). replication bayes factors from evidence updating. behavior research methods, 1– osf.io/aaudl https://doi.org/10.4324/9780203771587 https://doi.org/10.4324/9780203771587 https://doi.org/10.1111/j.1745-6924.2008.00079.x https://doi.org/10.1111/j.1745-6924.2008.00079.x https://doi.org/10.1111/j.1745-6924.2008.00079.x https://doi.org/10.1177/0956797613504966 https://doi.org/10.1177/0956797613504966 https://doi.org/10.3389/fpsyg.2015.00621 https://doi.org/10.3389/fpsyg.2015.00621 https://doi.org/10.1016/j.jesp.2015.10.012 https://doi.org/10.1016/j.jesp.2015.10.012 osf.io/e81xl https://doi.org/10.1037/0022-3514.94.3.382 https://doi.org/10.1037/0022-3514.94.3.382 https://doi.org/10.1037/1082-989x.8.3.275 https://doi.org/10.1037/1082-989x.8.3.275 https://osf.io/j8bpa https://osf.io/j8bpa https://doi.org/10.1198/000313006x152649 https://doi.org/10.1198/000313006x152649 https://doi.org/10.1080/00031305.2018.1518787 https://doi.org/10.1080/00031305.2018.1518787 https://doi.org/10.1037/0033-2909.88.2.359 https://doi.org/10.1037/0033-2909.88.2.359 https://doi.org/10.1201/b11158 https://doi.org/10.1037/met0000201 https://doi.org/10.1111/j.1467-9280.2008.02057.x https://doi.org/10.1111/j.1467-9280.2008.02057.x https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1177/2515245918810225 https://doi.org/10.1177/2515245918810225 https://doi.org/10.1177/1745691613513470 https://doi.org/10.1177/1745691613513470 15 11. https://doi.org/10.3758/s13428-018 -1092-x marsman, m., schönbrodt, f. d., morey, r. d., yao, y., gelman, a., & wagenmakers, e.-j. (2017). a bayesian bird’s eye view of ‘replications of important results in social psychology’. royal society open science, 4(1), 160426. https://doi .org/10.1098/rsos.160426 meng, x.-l. (1994). posterior predictive p-values. the annals of statistics, 22(3), 1142–1160. https: //doi.org/10.1214/aos/1176325622 morey, r. d., & lakens, d. (2019). why most of psychology is statistically unfalsifiable. https://doi .org/10.5281/zenodo.838685 nieuwenhuis, s., forstmann, b. u., & wagenmakers, e.-j. (2011). erroneous analyses of interactions in neuroscience: a problem of significance. nature neuroscience, 14(9), 1105–1107. https:// doi.org/10.1038/nn.2886 10.1038/nn.2886 open science collaboration. (2012). an open, largescale, collaborative effort to estimate the reproducibility of psychological science. perspectives on psychological science, 7(6), 657–660. https: //doi.org/10.1177/1745691612462588 open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251). https://doi .org/10 .1126/ science.aac4716 pashler, h., & wagenmakers, e.-.-.-j. (2012). editors’ introduction to the special section on replicability in psychological science: a crisis of confidence? perspectives on psychological science, 7(6), 528–530. https://doi.org/10.1177/ 1745691612465253 patil, p., peng, r. d., & leek, j. t. (2016). what should researchers expect when they replicate studies? a statistical view of replicability in psychological science. perspectives on psychological science, 11(4), 539–544. https://doi.org/10.1177/ 1745691616646366 schmidt, s. (2009). shall we really do it again? the powerful concept of replication is neglected in the social sciences. review of general psychology, 13(2), 90–100. https://doi .org/10 .1037/a0015108 silvapulle, m. j., & sen, p. k. (2005). constrained statistical inference: order, inequality, and shape constraints (vol. 912). john wiley & sons. https: //doi.org/10.1002/9781118165614 simonsohn, u. (2015). small telescopes detectability and the evaluation of replication results. psychological science, 26(5), 559–569. https:// doi.org/10.1177/0956797614567341 van aert, r. c., & van assen, m. a. (2017). examining reproducibility in psychology: a hybrid method for combining a statistically significant original study and a replication. behavior research methods, 1–25. https://doi.org/10.3758/ s13428-017-0967-6 vanbrabant, l., van de schoot, r., & rosseel, y. (2015). constrained statistical inference: sample-size tables for anova and regression. frontiers in psychology, 5, 1565. https://doi .org/10 .3389/fpsyg.2014.01565 verhagen, j., & wagenmakers, e.-j. (2014). bayesian tests to quantify the result of a replication attempt. journal of experimental psychology: general, 143(4), 1457–1475. https://doi.org/ 10.1037/a0036731 zondervan-zwijnenburg, m. a. j. (2018). anovareplication: test anova replications by means of the prior predictive p-value [r package version 1.1.3]. r package version 1.1.3. https://cran .r-project.org/package=anovareplication zondervan-zwijnenburg, m. a. j., & rijshouwer, d. (2020). testing replication with small samples: applications to anova. in r. van de schoot & m. miocevic (eds.), small sample size solutions: a guide for applied researchers and practitioners. routledge. https://doi.org/10.3758/s13428-018-1092-x https://doi.org/10.3758/s13428-018-1092-x https://doi.org/10.1098/rsos.160426 https://doi.org/10.1098/rsos.160426 https://doi.org/10.1214/aos/1176325622 https://doi.org/10.1214/aos/1176325622 https://doi.org/10.5281/zenodo.838685 https://doi.org/10.5281/zenodo.838685 https://doi.org/10.1038/nn.2886 https://doi.org/10.1038/nn.2886 https://doi.org/10.1177/1745691612462588 https://doi.org/10.1177/1745691612462588 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1177/1745691612465253 https://doi.org/10.1177/1745691612465253 https://doi.org/10.1177/1745691616646366 https://doi.org/10.1177/1745691616646366 https://doi.org/10.1037/a0015108 https://doi.org/10.1037/a0015108 https://doi.org/10.1002/9781118165614 https://doi.org/10.1002/9781118165614 https://doi.org/10.1177/0956797614567341 https://doi.org/10.1177/0956797614567341 https://doi.org/10.3758/s13428-017-0967-6 https://doi.org/10.3758/s13428-017-0967-6 https://doi.org/10.3389/fpsyg.2014.01565 https://doi.org/10.3389/fpsyg.2014.01565 https://doi.org/10.1037/a0036731 https://doi.org/10.1037/a0036731 https://cran.r-project.org/package=anovareplication https://cran.r-project.org/package=anovareplication introduction prior predictive p-value step 1: prior predictive distribution of the data step 2: test statistic evaluating relevant features step 3: p-value uniformity power simulation study power and sample size determination workflow step 1. step 2. step 3. examples discussion & conclusion author contact conflict of interest and funding author contributions acknowledgements open science practices meta-psychology, 2019, vol 3, mp.2018.935 https://doi.org/10.15626/mp.2018.935 article type: commentary published under the cc-by4.0 license open data: not relevant open materials: not relevant open and reproducible analysis: yes open reviews and editorial process: yes preregistration: n/a edited by: rickard carlsson reviewed by: gustav nilsonne, guillaume rousselet analysis reproduced by: tobias mühlmeister all supplementary files can be accessed at osf: https://doi.org/10.17605/osf.io/ayuxz an extended commentary on post-publication peer review in organizational neuroscience guy a. prochilo* university of melbourne stefan bode university of melbourne winnifred r. louis university of queensland hannes zacher leipzig university pascal molenberghs institute for social neuroscience abstract while considerable progress has been made in organizational neuroscience over the past decade, we argue that critical evaluations of published empirical works are not being conducted carefully and consistently. in this extended commentary we take as an example waldman and colleagues (2017): a major review work that evaluates the state-of-the-art of organizational neuroscience. in what should be an evaluation of the field’s empirical work, the authors uncritically summarize a series of studies that: (1) provide insufficient transparency to be clearly understood, evaluated, or replicated, and/or (2) which misuse inferential tests that lead to misleading conclusions, among other concerns. these concerns have been ignored across multiple major reviews and citing articles. we therefore provide a post-publication review (in two parts) of one-third of all studies evaluated in waldman and colleague’s major review work. in part i, we systematically evaluate the field’s two seminal works with respect to their methods, analytic strategy, results, and interpretation of findings. and in part ii, we provide focused reviews of secondary works that each center on a specific concern we suggest should be a point of discussion as the field moves forward. in doing so, we identify a series of practices we recommend will improve the state of the literature. this includes: (1) evaluating the transparency and completeness of an empirical article before accepting its claims, (2) becoming familiar with common misuses or misconceptions of statistical testing, and (3) interpreting results with an explicit reference to effect size magnitude, precision, and accuracy, among other recommendations. we suggest that adopting these practices will motivate the development of a more replicable, reliable, and trustworthy field of organizational neuroscience moving forward. keywords: organizational neuroscience, confidence intervals, self-corrective science, effect sizes, null hypothesis significance testing (nhst), parameter estimation, pearson correlation, post-publication peer-review, reporting standards. note: correspondence concerning this article should be addressed to guy a. prochilo, melbourne school of psychological sciences, university of melbourne. e-mail: guy.prochilo@gmail.com mailto:guy.prochilo@gmail.com 2 introduction organizational neuroscience is a domain of research that draws heavily on social and cognitive neuroscience traditions, but which examines specifically how neuroscience can inform our understanding of people and organizing processes in the context of work (waldman, ward, & becker, 2017). marked progress has been made at the theoretical level in the decade since its inception. for example, this has included a maturing discussion on the ethics, reliability, and interpretation of neuroscience data and how this applies to organizing behavior and the workplace (healey & hodgkinson, 2014; lindebaum, 2013, 2016; niven & boorman, 2016). however, the same level of progress has not been made with respect to careful and consistent critical evaluation of empirical works beyond the point of initial publication. the standards within psychological science (including organizational behavior research) are changing to reflect concerns over the transparency of reporting practices, appropriate use of inferential statistics, and the replicability of published findings (cumming, 2008, 2014; cumming & maillardet, 2006; nichols et al., 2016; nichols et al., 2017; simmons, nelson, & simonsohn, 2011; wicherts et al., 2016). in this extended commentary, we argue that scholars of organizational neuroscience are not considering these implications often enough, especially in major reviews of the literature. this commentary takes as an example the major review piece by waldman and colleagues (2017) published in annual review of organizational psychology and organizational behavior. in this article, the authors critically evaluate the state-of-the-art of organizational neuroscience, including its methods and findings, and provide recommendations for investing in neuroscienceinformed practices in the workplace. however, in what should be an evaluation of the field’s empirical work, the authors uncritically summarize a series of studies that: (1) provide insufficient transparency to be clearly understood, evaluated, or replicated, and/or (2) which misuse inferential tests that lead to misleading conclusions, among other concerns. it is customary for scientists and practitioners to cite information from the most recent review pieces, meaning that such reviews (especially annual reviews) and the references cited therein can wield a disproportionate impact on the future of a field of study. omission of satisfactory post-publication review in the above work is therefore unfortunate, and may motivate poor decisions that waste scarce time, effort, and financial resources for both researchers and organizational practitioners alike. this commentary will not be a systematic review of all studies conducted in organizational neuroscience. instead, to bring explicit attention to the concerns we raise above, we provide a focused post-publication review of five of the 15 empirical studies critically evaluated in waldman and colleagues (2017). our commentary therefore dissects a full one-third of studies that were deemed methodologically and statistically sound as part of an evidence base for guiding organizational research and practice (see table 1 for a list of these studies and justification for their selection). our motivation for this format, in contrast to a general pooling of findings via systematic review, is threefold. first, at least two of these studies represent seminal works that are among the most influential and highly cited in the field (see figure 1 for a citation distribution). second, these studies present with critical methodological or interpretational concerns that have been overlooked in multiple major reviews of the literature. and third, on the basis of these concerns, it is not entirely clear that these studies are being evaluated beyond what is reported in their abstracts by those who cite them. these studies deserve a close and detailed scrutiny and we provide this here. the primary aim of our commentary is to push the field in a positive direction by encouraging a more critical review of research findings in organizational neuroscience. in doing so, we seek to promote the development of a more replicable, reliable, and trustworthy literature moving forward1. first, we contextualize our publication evaluation criteria by discussing what has come to be known as the replication crisis in psychological science. while many solutions to this crisis have been offered, we focus on two easily implementable criteria that are likely to have a broad impact: (1) complete and transparent reporting of empirical findings, and (2) statistical inference that considers the magnitude and precision of research findings beyond mere statistical significance. second, we conduct a post-publication review of selected empirical works in two parts. in part i, we comprehensively and systematically evaluate the fields’ two seminal works with respect to our evaluation criteria. and in part ii, we provide focused reviews of secondary works that each center on a single specific methodological concern that we feel must be a point of discussion as the field moves forward. these concerns are: (1) fmri statistical analyses that preclude inferences from sample to population, (2) unsubstantiated claims of convergent validity between neuroscience and psychometric measures, and (3) the impact of researcher degrees of freedom on the inevitability of reporting statistically significant results. 1note: the concerns we discuss in this commentary are in no way unique to organizational neuroscience. we single out this field, not because it represents a special case, but because we have contributed work to this field (molenberghs, prochilo, steffens, zacher, & haslam, 2017). 3 table 1 publications selected for post-publication peer review publication justification for selection seminal works 1 peterson et al. (2008) • this study represents one of the earliest works to apply neuroscience methods to organizing phenomena. it has also been described as the first study do to so following the first theoretical writings in organizational neuroscience (see ward, volk, & becker, 2015). it is one of the most highly cited publications of all those evaluated in waldman and colleagues’ (2017) review (n = 98) and is discussed in most reviews of the literature since its publication (e.g., butler, o’broin, lee, & senior, 2015; waldman & balthazard, 2015; waldman, balthazard, & peterson, 2011b; ward, volk, & becker, 2015). on the basis of precedence, citations, and inclusion in multiple reviews, this study would be considered seminal. 2 waldman et al. (2011a) • this study is the most highly cited publication of all those evaluated in waldman and colleagues’ (2017) review (n = 177) and is included in most reviews of the literature (e.g., ashkanasy, becker, & waldman, 2014; becker & menges, 2013; becker, volk, & ward, 2015; waldman & balthazard, 2015; waldman, balthazard, & peterson, 2011b; waldman, wang, & fenters, 2016; ward, volk, & becker, 2015). it also cited in several systematic reviews (e.g., butler, o’broin, lee, & senior, 2015; nofal, nicolaou, symeonidou, & shane, 2017). it is the seminal work of the field. secondary works 1 boyatzis et al. (2012) • this study represents one of the earliest fmri studies of the field, and is a highly cited work (n = 99). it has also been used as part of the evidence base for guiding research and organizational practice decisions in extended theory pieces (e.g., coaching; boyatzis & jack, 2018). this study raises an important methodological concern that we suggest must be a point of discussion among scholars: fmri analyses that preclude inferences from sample to population. 2 waldman et al. (2013a, 2013b, 2015) • this publication represents a single empirical study that has been reported through conference proceedings (waldman et al., 2013a), as an unpublished pre-print (waldman et al., 2013b), and within a textbook chapter that discusses it at length (waldman, stikic, wang, korszen, & berka, 2015). cumulatively, these publications have received 32 citations. we include this study for evaluation for several reasons. first, the reporting of this study across multiple venues makes it challenging for scholars to clearly understand and critically evaluate the work. second, this work is discussed in several major reviews of the literature, and within the textbook that adopts the field’s name: organizational neuroscience (waldman & balthazard, 2015). this gives the impression that the work is of high quality. and finally, this study raises an important methodological concern that we suggest must be a point of discussion among scholars: unsubstantiated claims of convergent validity between neuroscience and psychometric measures. 3 kim and james (2015) • this study represents one of the most recent fmri studies conducted in the field and has received 6 citations. while this is low with respect to other works, it is discussed at length in waldman and colleagues’ (2017) major review, and is represented as high quality work. it also raises a specific methodological concern that we suggest must be a point of discussion among scholars: the impact of researcher degrees of freedom on the inevitability of reporting statistically significant results. note: citations were obtained from google scholar on jan 23, 2019. the above studies represent one-third of all studies critically evaluated in waldman and colleagues (2017) review of the state-of-the-art of organizational neuroscience. 4 edcba 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 citation count figure 1. dot plot with rotated probability distribution on each side of the data showing the number of google scholar citations for all 15 empirical studies evaluated in waldman and colleagues (2017). citations range from 6 to 177 (m = 74.7, sd = 49.62). the publications under review in this commentary are represented by filled dots and an associated label: (a) = kim and james (2015); citations = 6; (b) = waldman et al. (2013a, 2015); cumulative citations = 32; (c) = peterson et al. (2008); citations = 98; (d) = boyatzis et al. (2012); citations = 99; (e) = waldman et al. (2011a); citations = 177. the vertical line marks the mean. with respect to citation impact, these studies capture a cross-section of all studies that were subject to review by waldman and colleagues (2017). data were acquired from google scholar on jan 23, 2019. publication evaluation criteria criteria i: completeness and transparency of reporting practices science has been described as a cumulative and selfcorrective process (merton, 1973; popper, 1962). published empirical findings are not taken as unquestionable fact, but rather, all findings are subject to verification through systematic critical evaluation and replication. in doing so, these efforts may provide support that a finding is credible, or that it is wrong, and that the scientific record should be corrected. however, there are growing concerns that a large number of published empirical findings in psychological science are false or at least misleading (benjamin et al., 2018; button et al., 2013; cumming, 2014; ioannidis, 2005, 2012; leslie, george, & drazen, 2012; munafò et al., 2017; nelson, simmons, & simonsohn, 2018; simmons et al., 2011; wicherts et al., 2016). empirical findings are failing to replicate (camerer et al., 2018; klein et al., 2018; open science collaboration, 2015), and these failed replications appear to be invariant to the context and culture in which the replication is attempted (e.g., klein et al., 2018). this alarming problem has become known as the replication crisis, and there is now open discussion that the self-correcting ideal is not performing as well as it should across different areas of science (ioannidis, 2012). there are many impediments to self-correction in psychological science, including publication bias, selective reporting of results, and fabrication of data, among others. however, one of the most basic impediments for evaluating published findings as part of a self-corrective science is that researchers do not consistently provide a complete and transparent report of how exactly their research has been conducted and analyzed (appelbaum et al., 2018). in this commentary we will argue that this has been true of at least some organizational neuroscience work, and that it is particularly prevalent in the seminal works that appear in multiple reviews of the literature. complete and transparent reporting is key to systematically communicating what was done in any empirical study. there are multiple systematized reporting standards in psychological science that target various subdisciplines and types of experimental design. for example, the consolidated standards of reporting trials (consort) is a 25-item checklist that standardizes the way in which authors prepare reports of randomized controlled trial findings (www.consort-statement.org). the ohbm committee on best practice in data analysis and sharing (cobidas) guidelines describes how to www.consort-statement.org 5 table 2 outline of publication evaluation criteria evaluation criteria description method data collection • describe methods used to collect data. quality of measurements • describe the quality of the measurements (e.g., the training of data collectors). psychometric properties of instruments • describe the reliability and validity of the measurement instruments (e.g., internal consistency for composite scales; inter-rater reliability for subjectively scored ratings; etc.). data diagnostics • describe the methods used for processing the data (e.g., defining and dealing with outliers; determining if test assumptions are satisfied; and the use of data transformations, if required). analytic strategy inferential statistics • describe the analytic strategy for how inferential statistics were used to test each hypothesis. results statistics and data analysis • report descriptive statistics that are relevant to interpreting data (e.g., measures of central tendency and dispersion). • report appropriate inferential statistics obtained from statistical tests of each hypothesis, including exact p values if null hypothesis significance testing (nhst) has been used. • report effect size estimates and confidence intervals on estimates, where possible. • report whether statistical assumptions for each test were satisfied. interpretation of findings discussion • provide an interpretation of results that is substantiated by the data analysis strategy and other aspects of the study (e.g., adequacy of sample size; sampling variability; generalizability of results beyond the sample; etc.). • consider specifically effect magnitude, accuracy, and precision when interpreting results. note: these criteria were adapted from the jars-quant guidelines with a specific focus on methods, results, and interpretation of findings. the jars-quant guidelines can be found in full in appelbaum et al. (2018). plan, execute, report, and share neuroimaging research in a transparent fashion (nichols et al., 2017; nichols et al., 2016). and the apa working group on journal article reporting standards (jars-quant) covers reporting of all forms of quantitative empirical work, regardless of subdiscipline (appelbaum et al., 2018). the jars-quant have been designed with the intent of being a gold standard for reporting quantitative research across all of the psychological sciences. this includes research incorporating neuroscience methods such as organizational neuroscience. therefore, we have adapted a subset of these guidelines to systematically guide our post-publication review of seminal works (see table 2). these guidelines pertain to criteria that guide the reporting of methods, results, and interpretation of findings. these are elements of research work that are essential for enabling empirical claims to be clearly understood and evaluated by readers, and to allow findings to be replicated with reasonable accuracy. criteria ii: appropriateness of statistical inferences the cause of the replication crisis is multifaceted, and inadequate reporting practices are just a single factor among many contributing to the failure of selfcorrection in psychological science. a growing number of scholars are also raising concerns that a key theme in this crisis is an overreliance on the null hypothesis significance testing (nhst) approach when conducting research and interpreting results (e.g, calin-jageman & cumming, 2019; cumming, 2014; peters & crutzen, 2017). that is, researchers have traditionally prioritized all-or-none decisions (i.e., a finding is either statistically significant or non-significant) to the exclusion of information that describes the magnitude and precision of a finding, or whether that finding is likely to replicate. for these reasons, findings that are highly variable, imprecise, or which have been selectively reported (or manipulated) based on all-or-none decision criteria have 6 flourished. and these findings are not replicating. we believe similar concerns regarding nhst are influencing the quality of organizational neuroscience. the nhst approach has been well described elsewhere (see frick, 1996; nickerson, 2000). briefly, an effect (or effect size) describes a quantification of some difference or a relationship that is computed on sample data. as one example, this may include the magnitude and direction of a pearson correlation coefficient (r). in the nhst tradition, a researcher begins by stating a prediction regarding the direction of an effect that they believe to be true of the population from which they are sampling. this is then tested against a null hypothesis which specifies that the true population effect may actually be zero. this test yields a p value that quantifies the probability of obtaining a test statistic (e.g., t) of a given magnitude or greater when sampling from a population where the null hypothesis is true. in statistical terminology, this is the probability of making a type i error. in order to minimize such errors, a significance level called alpha (α) is used as the threshold for an allor-none decision. if the obtained p value is less than a prespecified α level, we consider ourselves sufficiently confident to assert that an effect is statistically significant and different from zero. in psychological science this threshold by convention is .05, which entails that in the long-run (i.e., after many replications of a study) we are only willing to accept type i errors at most 5% of the time. one of the major criticisms of this approach is that it simply does not provide researchers with the full information they need to describe the relationship between an independent and dependent variable (calinjageman & cumming, 2019; cumming, 2014; cohen, 1990). nhst and p values only provide evidence of whether an effect is statistically significant, and of the direction of an effect. scholars also cite concerns that nhst and its associated p values are too often misconstrued or misused by its practitioners, thereby leading to claims that are not substantiated by the data (e.g., gelman & stern, 2006; nickerson, 2000; nieuwenhuis, forstmann, & wagenmakers, 2011; mcshane, gal, gelman, robert, & tackett, 2019). as an alternative (or adjunct) to nhst, proponents of what has been called parameter estimation (kelley & rausch, 2006; maxwell, kelley, & rausch, 2008; woodson, 1969) or the new statistics (calin-jageman & cumming, 2019; cumming, 2014) have argued that inference should focus on: (1) the magnitude of a finding through reporting of effect size, (2) the accuracy and precision of a finding through reporting of confidence intervals on an effect size, and (3) an explicit focus on aggregate evidence through meta-analysis of multiple studies. on an individual study basis, the parameter estimation approach yields an identical all-or-none decision to that provided by p values. however, the focus shifts from a dichotomous all-or-none decision to information regarding the magnitude of an effect, and its accuracy and precision as quantified by confidence intervals. accuracy refers to the long run probability that a confidence interval of a given length will contain the true population value. for example, a 95% confidence interval is an interval of values that, if a study were to be repeated many times with different samples from the same population and under the same conditions, the true population value would be included in this interval 95% of the time. it is therefore plausible (although, not certain) that any particular 95% confidence interval will contain the true population parameter. precision refers to a measure of the statistical variability of a parameter, and is quantified by the width of a confidence interval (or, alternatively, the half width of the confidence interval: the margin of error). for example, a narrow 95% confidence interval is said to have high precision in that there are a limited range of plausible values which the population parameter could take. conversely, a wide 95% confidence interval is not very precise because the population parameter can take on a very wide range of plausible values. some scholars have advocated completely abandoning nhst and p values in favor of a parameter estimation approach to statistical inference (e.g., calinjageman & cumming, 2019; cumming, 2014). we don’t go so far. instead, in the style of abelson (1995), we believe that all statistics (including p values, confidence intervals, and bayesian statistics, among others) should be treated as aids to principled argument. however, to limit the scope of our commentary, our evaluations will have an explicit focus on effect size magnitude and, as an indication of accuracy and precision, the confidence intervals on these effects. in doing so, we will argue that nhst and p values have been misused across many organizational neuroscience works, and that reviewers of this literature too often accept statistical analyses and interpretations of data uncritically (see table 2 for our full evaluation criteria). post-publication peer review in the following sections we provide a concise overview of each study listed in table 1. in part i, we follow this by a systematic post-publication review of the methods, analytic strategy, results, and interpretation of findings of the fields’ two seminal works. in part ii, our post-publication review is focused (and restricted) to specific concerns in secondary works that we suggest must be a point of discussion as the field moves 7 forward. a summary of recommendations for improving post-publication review based on this commentary is given in table 3. part i systematic post-publication review of seminal works peterson et al. (2008). neuroscientific implications of psychological capital: are the brains of optimistic, hopeful, confident, and resilient leaders different? the purpose of peterson et al. (2008) was to examine the neural basis of psychological capital: a composite trait comprised of hope, resilience, self-esteem, and optimism, and which has been linked to effective leadership. using a sample of 55 business and community leaders, participants were asked to engage in a ‘visioning task’, in which they were required to create a spoken vision for the future of their business or organization while eeg measures were recorded. as the authors describe, this visioning task was theorized to evoke an emotional response that is aligned with psychological capital. expert opinions on the affective behavior witnessed during the eeg task were combined with psychometric measures of psychological capital and leadership, and these measures were used to dichotomize participants as high or low on this trait. following this, differences in eeg activity were assessed between each group. the authors reported that analysis of their eeg data revealed that high psychological capital was correlated with greater activity in the left prefrontal cortex. this was interpreted as activity associated with greater happiness, as well as having successful interpretation, meaning, construction, and sense-making skills. the authors further reported that low psychological capital was correlated with greater activity in the right frontal cortex and right amygdala. this was interpreted as activity associated with difficulty in displaying and interpreting emotions, as well as a greater likelihood to display negative affectivity or avoidance behaviors in social situations. a primary conclusion provided by the authors (which has been repeated in subsequent reviews) is that these findings are a demonstration of the importance of emotions in the study of psychological capital. the authors further suggest that future research should look more closely at the role of negative affect (e.g., fear) as a mechanism underlying low psychological capital. critical review. the critical evaluation of peterson et al. (2008) first requires qualification based on the venue in which it has been published. organizational dynamics is a journal that publishes content primarily aimed at organizational practitioners (e.g., professional managers), and therefore restricts full and transparent reporting of methods, results, and analyses in favor of narrative readability for practitioner audiences (elsevier, 2018). for this reason, the journal encourages publication of supplementary material (which may include detailed methods and results), as well as sharing of data in data repositories that can be directly linked to the article itself. these latter standards may not have been in effect at the time of publication of this early work. in any case, peterson et al. (2008) does not report any data, or link to any external dataset or supplementary information that can be used to evaluate the content of what is reported. this is problematic because this study has been repeatedly and explicitly cited as an example of high-quality empirical work in almost every review of the literature since its publication (e.g., butler, o’broin, lee, & senior, 2015; waldman et al., 2017; ward, volk, & becker, 2015). because this study is so consistently raised to the status of a high-quality empirical study, peterson et al. (2008) must be evaluated according to the same standards as any other empirical publication. that is, with adequate post-publication review. methods. we first consider the psychometric measures used in this study. the authors claim to have assessed psychological capital using a self-report questionnaire, yet, no information is given regarding what psychological instrument was employed. furthermore, no information is given regarding the conditions under which this instrument was administered, the psychometric properties of the instrument, or how data acquired from this instrument were processed with respect to outliers or other test assumptions. these same concerns relate to the instrument which was used to assess appraisals of participant leadership characteristics. it is also unclear by what process scores on these instruments were combined to dichotomize participants into groups that were considered representative of high and low psychological capital. and following this process, it is also unclear by what method the dichotomization was performed. several possibilities include the mean, median, cut-points based on previous literature, or even selective testing of all quantiles and choosing those that yield the smallest p value in a subsequent inferential test, among others. a further complication is that, in addition to each psychometric measure, the dichotomization was also based on affective behavior demonstrated during a visioning task. no information is provided regarding how these ratings were determined, or whether this was implemented correctly. this includes no information on whether coders had the requisite expertise to perform this task, or to what extent dichotomization de8 cisions were consistent across coders. and no information is provided on how this information was weighted alongside psychometric measures to perform the group dichotomization. we now turn our attention to the eeg measures. the authors provide no information regarding: (1) how eeg data were recorded (e.g., number of channels, electrode configuration, reference electrodes, and sampling rate), (2) how the data were pre-processed (i.e., how artefacts from eye movements, blinks, muscle artefacts, and sweating were identified and removed if necessary, or what filters were applied to remove frequencies of no interest), and (3) whether and how the experimenters controlled for typical artefacts resulting from bodily movements during the experiment. the latter is particularly important given that participants were instructed to talk while eeg recordings were obtained. movement during eeg recordings can create substantial artefacts (urigüen & garcia-zapirain, 2015). altogether, it is extremely difficult for readers to evaluate whether any of the reported measures or methods of data processing were valid, reliable, or implemented correctly. the authors do not provide sufficient methodological detail to the standard that is required of scientific reporting. because of this, it is unlikely that peterson et al. (2008) could be replicated with any reasonable level of accuracy. analytic strategy. the authors describe that they compared the brain maps of participants who were categorized as high versus low psychological capital. however, they do not specify what analytic strategy was used to perform this test. it is therefore not possible for readers to evaluate whether this analytic strategy was appropriate or implemented correctly. a further concern relates to use of dichotomization itself. dichotomization of continuous data reduces the efficiency of experimental design, and can lead to biased conclusions that do not replicate across different samples (altman & royston, 2006; maccallum, zhang, preacher, & rucker, 2002; royston, altman, & sauerbrei, 2006; senn, 2005). results. the authors report no statistics. that is, the authors report no measures of central tendency, no measures of dispersion, no inferential statistics, no measures of effect size, no measures of accuracy or precision, and do not report on whether statistical assumptions were satisfied. it is therefore not possible for readers to evaluate any empirical claims on the basis of test statistics. interpretation of findings. peterson et al. (2008) report two main findings: (1) greater activity in the left prefrontal cortex of participants with high psychological capital was indicative of happiness, and (2) greater activity in the right prefrontal cortex and amygdala of participants with low psychological capital was indicative of negative affectivity. this interpretation, however, relies heavily on reverse inference and a highly modular interpretation of regional brain function (for discussion, see poldrack, 2006). the prefrontal cortex is an incredibly large and diverse region, and is involved in a variety of executive functions, including, but not limited to: top-down regulation of behavior, generating mental representations, goal-directed behavior, directing attention, reflecting on one’s intentions and the intentions of others, and regulation of the stress response (arnsten, raskind, taylor, & connor, 2015; blakemore & robbins, 2012; goldman-rakic, 1996; robbins, 1996). similarly, the amygdala is presently considered a complex and diverse structure that is involved in emotion regulation, motivation, and rapidly processing sensory information of both positive and negative valence (for review, see janak & tye, 2015). these regions support functions that lack the specificity to be decomposed into the interpretations provided by the authors, particularly with respect to the methods and analytic strategy that was employed. additionally, because eeg mainly detects signals originating from sources close to the scalp, activity of deep brain structures such as the amygdala cannot be detected without sophisticated source localization analysis (grech et al., 2008). it is not clear that this localization analysis has been conducted, and had it been conducted, whether it would be possible to localize the signal to the amygdala specifically. and finally, effect size magnitude, accuracy, and precision are not given any consideration. altogether, the interpretation provided by peterson et al. (2008) is not substantiated by their methods and analytic strategy. subsequent claims regarding the importance of emotion and negative affect in psychological capital may therefore be misleading or entirely false. summary. peterson et al. (2008) lacks an adequately transparent account of what was conducted in their empirical study for it to be clearly understood, evaluated, or replicated with reasonable accuracy. it is a sobering reflection on the field that this work has been cited 98 times in the google scholar database with little discussion of what are severe and extreme limitations. it is even more sobering that it has been referenced in almost every major review of the literature since its publication without considering these limitations. indeed, these limitations are so severe and manifest that it is incomprehensible any reasonable scholar is reading this work before citing it. in the interest of a self-corrective and cumulative science, we recommend that the findings and conclusions by peterson et al. (2008) should not be repeated as part of the evidence base for organizational neuroscience in 9 any future literature reviews. furthermore, given that this is a highly cited and discussed work in the current literature, we also call on the authors to amend their reports following the full jars-quant guidelines, and to publish their data and methods openly to allow for re-analysis. waldman et al. (2011a). leadership and neuroscience: can we revolutionize the way that inspirational leaders are identified and developed? waldman et al. (2011a) is an eeg study that investigated the neural basis of inspirational leadership, which is a form of leadership that is implicated in desirable organizational outcomes. in a sample of 50 business leaders, participants were asked to engage in a ‘visioning task’ while undergoing eeg assessment. vision statements articulated by each leader were coded on a continuum from non-socialized/personalized (rating of ‘1’: self-focused and self-aggrandizing) to socialized (rating of ‘3’: collective-oriented with a positive focus). visions higher in socialized content were considered to be demonstrative of inspirational leadership. additionally, three to five followers of each leader (e.g., colleagues or employees), respectively, were asked to rate how inspirational their leader was based on two subscales of the multifactor leadership questionnaire (bass & avolio, 1990). the subsequent analysis was restricted to a measure of coherence in the high-frequency beta rhythm above the right frontal cortex. as the authors describe, this measure may have theoretical implications for emotion regulation, interpersonal communication, and social relationships. the obtained eeg and behavioral data were analyzed through a correlation analysis. right frontal coherence was positively correlated with socialized vision content coding (r = .36, p < .05), and follower perceptions of inspirational leadership were positively correlated with the socialized vision content coding (r = .39, p < .01). however, coherence was unrelated to follower perceptions of inspirational leadership (r = .26, p < .10). based on these data the authors draw two main claims. first, they assert that these data indicate their neurophysiological measure of inspirational leadership was more strongly related to an explicit inspirational leadership behavior (i.e., socialized content in vision creation) than to an indirect measure made through follower perceptions of inspirational leadership. this difference in magnitude of correlations was considered indicative of a causal mechanistic chain: right frontal coherence forms the basis of socialized visionary communication, which in turn, builds follower perceptions of inspirational leadership. and second, the authors claim that the correlation between coherence and socialized vision ratings represent a meaningful, neural distinction between leaders who espouse high versus low visionary content. specifically, they argue that this has implications for leadership development. the particular example discussed by the authors relates to targeted training through eeg-based neurofeedback (i.e., use of an operant conditioning paradigm with real-time eeg feedback). here, they contend that neurofeedback may be used to enhance ideal brain states associated with effective leadership, such as right frontal coherence. critical review. as in peterson et al. (2008), the critical evaluation of waldman et al. (2011a) requires qualification based on the venue through which it has been published. the academy of management perspectives publishes empirical articles that are aimed at the non-specialist academic reader (academy of management, 2018). for this reason, full and transparent reporting of key aspects of empirical work are sometimes eschewed in favor of readability to the non-specialist audience. however, waldman et al. (2011a) is potentially the most influential work in all organizational neuroscience (see table 1; figure 1). therefore, this publication deserves a comprehensive evaluation of its methods, results, analytic strategy, and claims. methods. we first consider the psychometric measures. the authors report that perceptions of inspirational leadership were obtained from three to six followers of each participant using the multifactor leadership questionnaire. the authors also describe that an overall measure of inspirational leadership was computed by summing these responses, which is a practice they describe is consistent with prior research. a measure of internal consistency is also provided, which demonstrated high scale reliability (i.e., α = .91). however, the authors do not provide a description of data diagnostics. for example, it is unknown how outliers in the data were identified, whether or not they were removed (and by what method they were removed), or how data were to be treated if it did not meet statistical test assumptions. the method by which participants were coded on the socialized vision rating scale is also unclear. while the authors describe the criteria by which two expert coders categorized participants, no information is provided on how the coders were trained, or the extent to which there was inter-rater agreement between the coders. turning our attention to the eeg measures, the authors report the that the 10/20 system has been used, the number of electrodes, and the three electrode locations specific to their analysis. however, there is no information regarding the sampling rate, reference electrodes, or of the general setup. this includes a lack of information on fixation and movement control, and if 10 none were used, how the impact of potential artefacts on the eeg signals have been accounted for. this issue is particularly important given that eeg was recorded while participants were engaged in an active task. as described previously, movement during eeg can cause substantial artefacts. there is also no information provided relating to pre-processing and the use of filters. altogether, waldman et al. (2011a) report a greater depth of information than peterson et al. (2008). however, peterson et al. (2008) sets a low standard. waldman et al. (2011a) requires further detail for an adequate evaluation of the validity and reliability of its reported methods. it may be possible to replicate this method, but the accuracy with which the replication would be conducted may be inadequate. analytic strategy. the authors describe that they focused on coherence between three electrodes in the right frontal region of the brain. however, the analytic strategy is not explicitly described and must be inferred from a summary of their findings. here, coherence data were extracted and subjected to a correlation analysis with ratings of inspirational leadership and socialized vision content coding. the authors do not report the specific correlation analysis that was conducted (i.e., pearson correlation, kendall rank correlation, or spearman correlation). however, the authors use the notation for pearson correlation (r), and computation of exact p values using fisher’s method (see interpretation) are consistent with those reported in their publication. it can therefore be inferred that the authors have subjected their data to pearson correlation under the assumption of bivariate normality and no bivariate outliers. although, the authors do not report whether these latter statistical assumptions were satisfied. altogether, the lack of transparency of the analytic strategy makes it difficult to evaluate whether it was appropriate or implemented correctly. results. the authors do not report descriptive statistics (i.e., central tendency or dispersion) required for interpretation of the psychometric data and socialized vision ratings. this makes it difficult to assess whether the distribution of these data were appropriate for the statistical tests that were performed, and their subsequent interpretation. for example, a restriction of range on either of these measures may influence the subsequent inferential test, the representativeness of the sample, or the generality of conclusions. the authors do report the mean coherence and its range. however, the range is a measure of dispersion that may not be typical of the dataset as a whole, and other measures would be more informative (e.g., standard deviation). measures of effect size magnitude are reported as pearson’s correlation coefficient, and are accompanied by inexact p values. reporting inexact p values make it difficult to assess type i error probability (although, these may be computed from the summary data; see interpretation). confidence intervals are also absent. finally, the authors do not report scatterplots of their data. this is problematic because summary correlation coefficients could have been generated by a variety of distributions of data, some of which may render the statistical test inappropriate. for example, pearson’s correlation is not robust, meaning that a single extreme value can have a strong influence on the coefficient (pernet, wilcox, & rousselet, 2013). as a graphical demonstration of this, in figure 2 we provide examples of 8 distributions consistent with the correlation between frontal coherence and socialized vision content coding reported by the authors (i.e., r = .36, n = 50; plot_r function: cannonball package [v 0.0.0.9] in r vanhove, 2018). altogether, the authors do not report their results in adequate detail to fully describe the data. interpretation. we consider separately the authors two main claims below. claim 1: a difference in correlation magnitudes. a critical conclusion in this publication relates to a comparison of effect size estimates, which in this case involves the pearson correlation coefficient (r). the authors suggest that the correlation between right frontal coherence and socialized vision content (r = .36, p <.05) is greater than the correlation between right frontal coherence and perceptions of inspirational leadership (r = .26, p < .10). based on this observation, the authors draw a theoretical conclusion relating to the mechanistic basis of inspirational leadership. this claim appears to be motivated on the basis of eyeballing a difference in the absolute magnitude between these correlations, as well as a difference in the all-or-none decision criteria based on the p values. however, the claim that an r of .36 is greater than .26 assumes that each r is equal to the correlations we would obtain if we were to sample the entire population of relevant business leaders and their followers. that is, not just this sample. in statistical terminology, this is the assumption that each r is equal to the respective population effect size, rho (ρ). however, r represents the best estimate of ρ in a probability distribution of rs that lie below and above each r estimate (zou, 2007). therefore, to determine if one r is greater than another, we must examine the distribution of probable scores within which each ρ may plausibly fall. this can be assessed using a parameter estimation approach by computing a 100(1 α)% confidence limit on each r. in psychological science, α by convention is .05 which necessitates a 95% confidence interval (95% ci). this interval can be obtained using fisher’s r to z transformation by first calculating the con11 (1) normal x, normal residuals (2) uniform x, normal residuals (3) +−skewed x, normal residuals (4) −−skewed x, normal residuals (5) normal x, +−skewed residuals (6) normal x, −−skewed residuals (7) increasing spread (8) decreasing spread all correlations: r(48) = .36 figure 2. examples of 8 scatterplots consistent with r = .36 in a sample of n = 50. the correlation between right frontal coherence and the socialized vision scale could have been plausibly generated by any of these data distributions. this demonstrates the importance of reporting scatterplots in order to verify whether pearson’s correlation analysis was justified, and whether the correlation coefficient is representative of the data that has generated it. fidence limits for z(ρ) and then back-transforming the limits to obtain a confidence interval for ρ (zou, 2007). we conduct these analyses below2. for completeness, we also report the exact p values given by the t statistic from each combination of r and n. for the correlation between right frontal coherence and socialized vision content we obtain: r(48) = .36, 95% ci [.09, .58], p = .010 (figure 3a; lower). for the correlation between right frontal coherence and follower perceptions of inspirational leadership we obtain: r(48) = .26, 95% ci [-.02, .50], p = .068 (figure 3a; upper). focusing on the 95% ci (and ignoring the α criterion for evaluating p values), it can be seen that the distribution of possible values of ρ overlap. to test this statistically, however, we must conduct a statistical test of the difference between these correlations using the null hypothesis of a zero difference (i.e., the 95% ci on the difference contains zero). zou’s (2007) method has been recommended for testing the statistical difference between correlations2 (cumming & calinjageman, 2016, p. 320). in this case, zou’s method takes into consideration the overlapping dependent correlation between socialized vision content and ratings of inspirational leadership (r = .39). using this method we obtain: r1 − r2 = .10, 95% ci [-.19, .39] (cocor package for r [v1.1-3]; diedenhofen & musch, 2015). the best estimate of r1 − r2 is .10, however, the 95% ci on this effect is consistent with an interval of values ranging from -.19 to .39 (figure 3b). these analyses indicate that we are insufficiently confident to conclude that r = .36 is greater than r = .26 in these data, and that the difference between these correlations may be zero. any theoretical claim that relies on a difference between these correlations is therefore not an accurate reflection of the data. claiming that a difference between a statistically significant correlation and a non-significant correlation is itself statistically significant, is a common misinterpretation of nhst. this is referred to as the interaction fallacy, and has been discussed comprehensively elsewhere (gelman & stern, 2006; nickerson, 2000; nieuwenhuis et al., 2011). 2note: these analyses assume bivariate normality, meaning robust alternatives may yield more accurate intervals (pernet et al., 2013). however, as the author’s analyses and claims are performed under this assumption, we also proceed with an assessment of claims assuming this is satisfied. 12 .36 [.09, .58] .26 [−.02, .50] −0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 r .10 [−.19, .39] −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 r1 – r2 a b figure 3. plot (a): graphical representation of pearson’s r = .26 and r = .36 and their 95% confidence intervals (ci) with a sample of n = 50 (values are given in each label). the 95% ci has been computed using fisher’s r to z transformation under the assumption of bivariate normality. while r = .26 is not statistically different from zero, its 95% ci has considerable overlap with r = .36. plot (b): a statistical test of the difference between pearson’s r = .26 and r = .36 and its 95% ci, accounting for the overlapping dependent correlation of r = .39 (tails = two-sided, null hypothesis = zero) using zou’s (2007) method. the difference between each correlation is compatible with zero difference. claim 2: the meaningfulness of a correlation magnitude. the second critical conclusion in this work is that there is a meaningful, neural distinction between leaders who were considered high versus low in espousing socialized visions. statistically speaking, this claim rests on the strong assumption that the correlation between right frontal coherence and the socialized vision scale, r = .36, is equal to the population effect size, ρ. however, as we have shown in our calculations above, r = .36 is consistent with an interval of values specified by its confidence interval. the 95% ci on a correlation coefficient tells us that in 95% of random samples from the same population, the 95% ci will contain the population parameter, ρ. it follows logically that in 5% of cases the 95% ci will miss this value. therefore, we can deduce that it is plausible (although, not certain) that a 95% ci will contain the true value of ρ. in this case, ρ could plausibly range from .09 to .58 (see figure 3a; lower). there is indeed a statistically significant neural distinction between leaders who espouse different quantities of socialized content in their visions. however, the precision of the estimate of this association is extremely low. that is, the 95% ci is so wide that the margin of error (quantified as the half-width of a confidence interval) is approaching the magnitude of the estimate itself (i.e., margin of error = .24). this means that there are a great many plausible values for which ρ could take. these values may potentially be negligible for everyday purposes (.09) or even very large (.58). the claim that this correlation is sufficient evidence for use of neurofeedback to enhance coherence is therefore staggeringly disproportionate to the precision with which this effect has been measured. drawing substantive conclusions based on a dichotomous all-or-none decision in the absence of effect size magnitude, accuracy, and precision is one of the most widespread misuses of nhst (cumming, 2014). one of the main reasons that scholars conduct empirical studies is to learn about an effect of interest. when a p value describes an effect as statistically different from zero, yet the confidence interval is very wide, we understand very little about the effect beyond its direc13 tion (i.e., positive or negative). for this reason, there has been increasing interest in recent years for planning studies to estimate the magnitude of an effect within a confidence interval that is adequately narrow in width. this has been referred to as accuracy in parameter estimation (aipe; kelley & rausch, 2006; maxwell et al., 2008; peters & crutzen, 2017) or precision for planning (cumming, 2014). in this paradigm, a key question prior to conducting a study is: what sample size is required to provide a sufficiently precise estimation of an expected effect size of interest? what is sufficiently precise depends on one’s research objectives. however, one suggestion is to target a margin of error of at least half of the expected effect size (although this is not always a practical solution; cumming & calin-jageman, 2016, p. 266). if waldman et al. (2011a) considered r = .36 to be their best estimate of the expected population effect, ρ, they may consider designing a study that yields a margin of error of no more than .18. this would require a minimum of 92 participants to attain this level of precision with 95% confidence3 (confintr: userfriendlyscience package [v 0.7.2] in r; peters, verboon, & green, 2018). it has been suggested that taking a parameter estimation approach to research planning may assist in the production of empirical work that is accurate, precise, and more likely to replicate (peters & crutzen, 2017). summary. like peterson et al. (2008), waldman et al. (2011a) lacks an adequately transparent description of their study for it to be clearly understood, evaluated, or replicated with reasonable accuracy. furthermore, the level of description that can be extracted from this report reveals that nhst has been misused or misinterpreted, and has led to interpretations of findings that are not substantiated by the data. this is the seminal work in organizational neuroscience: it has been cited 177 times in the google scholar database and is discussed at length in most reviews of the literature. yet, little attention has been given to what are important limitations in methods, analytic strategy, and interpretation of results. on the basis of this review, we recommend that scholars familiarize themselves with the interaction fallacy and other misuses of nhst (see gelman & stern, 2006; nickerson, 2000; nieuwenhuis et al., 2011). we also recommend that the results of nhst be considered with explicit reference to effect size and precision to allow for a more informative judgment of research findings. and we further recommend that researchers consider precision for planning in order to attain sufficiently narrow confidence intervals that allow for meaningful conclusions to be drawn from findings. finally, in the interest of a self-corrective and cumulative science, we suggest that scholars do not carelessly recite the contents of waldman et al. (2011a) in future reviews of the literature without sufficient critical evaluation. given that waldman et al. (2011a) is the seminal work of the field, we also call on the authors to amend their reports following the jars-quant guidelines and to publish their data openly for re-analysis. part ii focused post-publication review of secondary works boyatzis et al. (2012). examination of the neural substrates activated in memories of experiences with resonant and dissonant leaders. boyatzis and colleagues (2012) is one of the earliest fmri studies conducted in organizational neuroscience. this study was an exploratory investigation into the neural basis of the personal and interpersonal consequences of interacting with resonant and dissonant leaders, with the implication that such knowledge may inform leadership training and practice. as the authors describe, resonant leaders are considered those whose relationships are characterized by mutual positive emotions, while dissonant leaders are those who invoke negative emotions. using a sample of eight individuals with extensive employment experience, participants were interviewed to describe two distinct interactions with two leaders they considered resonant or dissonant, respectively (i.e., four leaders, describing eight interactions total). audio statements based on each of these eight interactions were created for each participant (8 – 10 s) to be used as cues to recreate an emotional memory of the interaction while undergoing fmri (5 s). as a manipulation check, participants were also presented with a 4-item question that gauged the valence of their emotional response from strongly positive to strongly negative (2 – 3 s), where recall of resonant and dissonant leaders were expected to yield positive and negative affective responses, respectively. using an event-related design, each of the eight different cues were randomly 3it is important to note from our discussion that r = .36 may not be the best estimate of ρ. the authors may therefore take a conservative approach and choose to plan for precision based on the lower limit of the plausible range of values for which ρ could take (i.e., .09). to estimate ρ = .09 with a margin of error of no more than half of this expected effect size (i.e., .045) and with 95% confidence, a study would require 1867 participants (confintr: userfriendlyscience package in r). conducting studies with precision will require more resources than researchers are accustomed to, particularly when an expected effect size is very small. 14 presented six times across three runs which resulted in 48 trials in total. results of the manipulation check confirmed that emotional responses were all in the predicted direction. following preprocessing, fmri data were then analyzed using a fixed-effects analysis. for the contrast between the resonant and dissonant conditions (i.e., resonant > dissonant), the authors reported greater activation in seven regions of interest (rois): the left posterior cingulate, bilateral anterior cingulate, right hippocampus, left superior temporal gyrus, right medial frontal gyrus, left temporal gyrus, and left insula. because the authors tested no hypotheses in this exploratory study, results were interpreted through reverse inference based on existing social, cognitive, and affective neuroscience research. for example, some of these regions have been implicated in the putative mirror neuron system. this system comprises a class of neurons that modulate their activity when an individual executes motor and emotional acts, and when these acts are observed in other individuals (molenberghs, cunnington, & mattingley, 2012). as the authors describe, several regions implicated in this network were activated in response to memories of resonant and dissonant leaders. however, some of these regions were less active during the dissonant memory task. the authors interpreted this as a pattern of avoidance of negative affect and discomfort that was experienced during moments with dissonant leaders, and which may indicate a desire to avoid these memories. critical review. boyatzis and colleagues (2012) describe this study as an exploratory study. therefore, we critically evaluate this publication as an example of a pilot research and overlook limitations that characterize such works. such limitations may include (although, not necessarily) a lack of directional a priori hypotheses and a strong reliance on reverse inference. here we focus specifically on the type of fmri statistical analysis that has been performed, and the implications this has for drawing inferences from a sample to the whole population. drawing inferences about the population from an fmri analysis. organizational behavior researchers are typically interested in what is common among a sample of participants in order to permit generalizability of an effect to the full population from which they are sampled. that is, scholars wish to predict and explain organizing behavior beyond the random sample that is included in their study in order to inform organizational theory and practice decisions. the same principle applies to fmri data analysis. in any fmri study, the blood oxygen-level (bold) response to a task will vary within the same participant from trial-to-trial (within-participant variability) and from participant-to-participant (between-participant variability). therefore, in order to draw inference from a sample group to the full population of interest, a mass univariate fmri analysis must account for both withinand between-participant variability (penny & holmes, 2007). this is what is referred to as a random-effects (or mixed-effects) analysis, which allows for formal inference about the population from which participants have been drawn. boyatzis et al. (2012) report that only a fixed-effects analysis has been conducted on their fmri data. fixedeffects analyses account only for within-subject variability, and for this reason, inferences from such analyses are only relevant to the participants included in that specific fmri study. in this case, inferences therefore only describe the eight participants recruited in boyatzis et al. (2012). because between-participant variance is much larger than within-participant variance, fixedeffects analyses will typically yield smaller p values that overestimate the significance of effects. for this reason, fixed-effects analyses are not typically reported in the absence of a corresponding random-effects analysis, particularly since the very early days of neuroimaging research (penny & holmes, 2007). the results of fixed-effects analyses are useful if a researcher is interested in the specific participants included in a sample (e.g., a case study), or if it can be justified that the sample represents the entire population of interest. however, because boyatzis et al. (2012) conducted only a fixed-effects analysis, this means it would be uncertain if the same pattern of activations would be observed if an additional participant were to be included in the study, or if a replication were to be performed. indeed, the authors report that the exclusion of the single female participant rendered eight regions of interest non-significant, demonstrating the instability of their reported effects and the strong influence of outliers when using fixed-effects analyses. a randomeffects analysis is the appropriate analysis to perform if researchers seek to generalize their findings to the population at large. summary. boyatzis et al. (2012) aimed to explain the neuronal basis of interactions with dissonant and resonant leaders, with the implication that such knowledge could improve leadership training and practices. the authors take this step in an extensive review piece describing the neural basis of leadership (boyatzis, rochford, & jack, 2014) and provide explicit recommendations on leadership practice on the basis of these exploratory findings (boyatzis & jack, 2018). this study has been cited 99 times in the google scholar database. however, almost no attention has 15 been directed to the inadequacy of the analytic strategy, and the implications this has for generalizing findings from sample to population. while random-effects analyses are consistently reported in fmri work in the broader social and cognitive neurosciences, we recommend that scholars remain vigilant of this practice in organizational neuroscience. we also recommend that scholars should be aware of this concern when discussing boyatzis et al. (2012) in future reviews of the literature. finally, in the interest of a self-corrective and cumulative science, we also call on the authors to repeat their analyses using a random-effect analysis. if these results do not replicate, we call on the authors to correct potentially misleading claims based on these data, and, if necessary, amend their recommendations for leadership practices accordingly. it is also noteworthy that in any case, an adequately powered replication study is required to confirm these findings given that the sample size was very small. waldman et al. (2013a). emergent leadership and team engagement: an application of neuroscience technology and methods. waldman et al. (2013a) is an empirical study that used real-time eeg recordings to examine emergent leadership and team engagement. the aim of this study was to investigate whether individual self-reports of engagement could predict whether that individual is likely to be appraised as an emergent leader in a team context. a second aim was to examine whether fellow team members were likely to be more engaged when an emergent leader (compared to a non-leader) used verbal communication during a group-based problemsolving task. to assess these research questions, the authors used psychometric measures of engagement and leadership, and an eeg-based measure of engagement that could be determined in real-time and on a secondby-second basis. to this end, 146 business administration students were allocated to 31 teams of 4-5 individuals and given 45 minutes to solve a corporate social responsibility case problem in a team setting. during this task, eeg was measured continuously from each participant and time-matched to individual speaking times using video recordings. as the authors describe, the eeg measure was based on a discriminant function that has been used to classify an individual’s cognitive state into different levels of engagement (berka et al., 2004; berka et al., 2005). at the conclusion of the task, participants were asked to assess their level of engagement retrospectively using the rich et al. (2010) job engagement measure, and to assess fellow team member levels of emergent leaders behaviors using items from the multifactor leadership questionnaire (bass & avolio, 1990). in the analysis that followed, emergent and nonemergent leaders were identified in each of the 31 groups based on the extreme (i.e., highest and lowest) follower ratings of emergent leadership. using logistic regression (and controlling for gender, age, and the number of friends in each team) self-reports of individual engagement were found to be a significant predictor of categorization as an emergent leader on behalf of other team members (b = 0.97, p < .05). having demonstrated that self-reports of engagement predicted emergent leader status, the authors conducted a test to determine if other team members were more engaged during periods of emergent leader verbal communication (compared to individuals who scored lowest on follower ratings of emergent leadership). in their methods section, the authors indicated that the team sample size was reduced from n = 31 to 26 in the following analyses, due to technical problems with eeg recordings. pearson correlation analysis was performed between aggregate measures of team-level engagement using the self-report and the eeg measure, which revealed a positive relationship (r = .32, p < .05) that the authors interpreted as evidence of moderate convergent validity for their eeg measure. a one-tailed dependentsamples t-test revealed no difference in real-time (eegbased) team engagement for the total time an emergent leader vs. non-leader was communicating during the task (t = 1.33, p > .05). however, when restricting the analysis to solely the final instance of emergent leader and non-leader communication, real-time team (eeg-based) engagement was found to be greater during emergent leader communication compared to nonleader communication (t = 2.24, p < .05). the authors concluded that individuals who are highly engaged (as measured by self-report) are likely to be appraised by fellow group members as an emergent leader. in turn, the claim is made that emergent leaders may be responsible, in part, for team engagement. this is because the eeg-measure of engagement was greater during the last period of emergent leader versus non-leader communication. from these results, the authors also claim that real-time eeg recordings using their discriminant function represents a valid measure of engagement. it is asserted that such measures may be particularly useful to organizational behavior research investigating ongoing team processes. critical review. waldman et al. (2013a) represents a single dataset that has been reported through multiple venues, where each venue provides a different level of detail regarding methods, analytic strategy, and results. our critical review is therefore guided by the ini16 tial work that was published through conference proceedings (waldman et al., 2013a), the unpublished preprint available on the researchgate repository (waldman et al., 2013b), and the published textbook chapter within which it is discussed at length (waldman, stikic, wang, korszen, & berka, 2015). in this commentary we focus on the neuroscience component of the study and examine the claim that the eeg measure represented a valid index of organizational engagement. a parameter estimation approach to assessing convergent validity. the claim that emergent leaders generate the greatest level of engagement during verbal communication among fellow team members requires an assessment of the validity of the eeg measure. the authors report that an r of .32 represents moderate convergent validity between the aggregate team-level eeg measure and self-report measures of engagement. however, this interpretation relies on the assumption that the effect size estimate, r, is equal to the population parameter, ρ. as demonstrated earlier in our review, it is plausible (although, not certain) that a 95% ci will contain the true value of ρ. to estimate ρ in this study we therefore apply the fisher r to z transformation using r = .32 and n = 26 (noting that the sample size was reduced to 26 because of problems with the eeg recording). we report the exact p values given by the t statistics in these calculations, and for completeness, conduct both a one-sided and two-sided test because it is unclear what test has been performed. for this assessment of convergent validity, we obtain: r(24) = .32, 95% ci [-.01, 1], p = .056 (one-sided test), and r(24) = .32, 95% ci [-.08, .63], p = .111 (two-sided test). assuming that a one-sided test was reported, a nhst approach to inference tells us that a correlation of the magnitude of r = .32 is expected to occur 5.6% of the time when the population parameter, ρ, is actually zero. that is, we are insufficiently confident to comment on the direction and magnitude of this correlation coefficient, and cannot rule out that the true convergent validity is zero. reporting p = .056 as p < .05 appears to be a generous rounding of a p value to satisfy an all-or-none decision criterion for publication. beyond this observation, however, the 95% ci tells us there are a great many plausible values for which ρ could take, including a value of zero. for validity claims, one suggestion has been that we should consider r magnitudes of .80 and .90 as good (rather than by cohen’s classic definition as very large), and r magnitudes of .60 or .70 as small and inadequate (rather than large; cumming & calin-jageman, 2016, p. 319). on this basis, r = .32 may be quite problematic for claiming convergent validity. when we consider the precision with which this correlation has been measured, the plausible value of r = 0 (or even close to zero) makes any claim of convergent validity very unpersuasive. the calculations above also draw into question the claim that the real-time eeg measure used here represents a valid measure of organizationally-relevant engagement, as it has been described in multiple reviews of the literature (e.g., waldman, wang, & fenters, 2016; waldman et al., 2017). if we suspect that a measure does not have sufficiently high validity, we must be wary of any subsequent conclusions that have been made on the basis of this measure. in this case, this includes the claim that emergent leaders are responsible for generating team engagement. a possible explanation of the lack of convergent validity observed in this study may be that the eeg measure is simply tracking alertness. in their original validation study, berka et al. (2004) describe that this eeg measure was developed for the purpose of monitoring mental workload during cognitive tasks. specifically, they describe that this measure can classify realtime eeg epochs into one of four states of alertness: ‘sleepy’, ‘relaxed wakefulness’, ‘low vigilance’, and ‘high vigilance’. in contrast to this, rich et al.’s (2010) selfreport measure defines engagement as an individual’s complete physical, cognitive, and emotional investment into a work role or job. this includes items such as “i feel proud of my job” (emotional engagement), and “i devote a lot of attention to my job” (cognitive engagement). high self-report ratings on measures of pride and attention with respect to one’s job or task may covary with moment-to-moment alertness or vigilance, but they do not necessarily need to do so. ongoing assessments of alertness may be of interest to organizational scholars. however, the relationship between this eeg measure of alertness and organizationally-relevant engagement (and consequently, emergent leadership) is not entirely clear in this study. summary. waldman et al. (2013a) and the book chapter in which it is discussed extensively (waldman et al., 2015) have been collectively cited a total of 32 times in the google scholar database. this work has also been given ample discussion space in several major reviews of the literature (e.g., ashkanasy, becker, & waldman, 2014; waldman & balthazard, 2015; waldman et al., 2017). however, these reviews provide little discussion of whether the authors’ claims are a correct reflection of their analyses and results. we recommend that scholars carefully and consistently evaluate whether measures used in organizational neuroscience are methodologically and statistically valid. to do so, we recommend taking a parameter estimation approach to evaluating 17 assessments of convergent validity by considering effect size magnitude and quantifying precision via confidence intervals. kim and james (2015). neurological evidence for the relationship between suppression and aggressive behavior: implications for workplace aggression. the final organizational neuroscience study we critically evaluate is an fmri experiment conducted by kim and james (2015). in this study, the authors set out to examine whether brain regions differed in activity during suppression (i.e., a maladaptive emotion regulation strategy) and passive viewing of negatively-laden affective images. having established this, the primary aim of this study was to determine whether activity in these regions were associated with aggressive behaviors. as the authors discuss, such research may provide insight into factors leading to a reduction of workplace aggression. prior to scanning, ratings of aggressive behavior were obtained from two significant others of each participant and averaged, respectively, to give ratings of five different forms of aggression (i.e., physical, property, verbal, relational, and passive aggression). seventeen participants were then subjected to an fmri task in which they were exposed to negatively valenced or neutral images from the international affective picture system (iaps; lang, bradley, & cuthbert, 2008). negatively valenced images were preceded by instructions (4 s) to either suppress negative emotional reactions or passively watch each image, while neutral images were always preceded by instructions to passively watch. the latter condition served as a baseline. following this, four images were presented, and participants employed the instructed emotion regulation strategy (20 s). a manipulation check question was then asked to examine the intensity of negative emotions experienced on a 4point scale (i.e., neutral to strong; 4 s). to recover from the potential effects of experiencing negative affect, a set of four grey-tone pattern images were presented (20 s), serving as a rest period. this was followed by another manipulation check question, which again measured the intensity of negative emotions experienced (4 s). the task was repeated over 48 trials. results of the manipulation check confirmed that participants experienced greater negative emotions following negatively valanced images compared to the rest period. participants also reported greater negative emotions when using suppression compared to passive observation. according to the authors’ discussion (which implicates the use of suppression in the intensification of negative affective experiences) this indicated that the manipulation was successful. fmri data was subsequently analyzed using a group-level analysis at the whole-brain level. compared to baseline, the authors reported broadly overlapping areas of activation for suppression and passive watching (e.g., the bilateral visual cortex and insula). for their primary contrasts of interest (i.e., the difference between experimental conditions) the authors reported greater activity in the cingulum and both the left and right insula when engaged in suppression (versus passive watching), and in the calcarine sulcus when engaged in passive watching (versus suppression). average t scores were extracted from each of these four regions for each participant, respectively. these values were then used in a pearson’s correlation analysis against each of the five (psychometrically assessed) aggression ratings. the authors reported a significant negative correlation between average t scores in the calcarine sulcus for the passive watching > suppression contrast and property aggression (r = -.49, p < .05). however, the authors report there was no significant relationship with any other type of aggression. based on these findings, the authors conclude there was a significant association between suppression (i.e., the neural substrate) and aggression (i.e., the psychometric measure), and that this association has implications for organizational practice and research. specifically, they suggest that use of suppression as an emotion regulation strategy in the workplace will be related to counterproductive behavior in the form of aggression, and that managers should focus on building an organizational climate or set of norms that preclude use of suppression. in their limitations section, the authors acknowledge that their small sample size provides only preliminary evidence for the relationship between suppression and aggression. however, they also claim that the correlation magnitude reported in this study indicates that an equivalent or even larger effect should be observed in studies with larger samples. critical review. kim and james (2015) presents with several important limitations. this includes no provision of reliability statistics for their aggression measures, and no description of how the fmri data were modeled (i.e., a fixed-effects versus a randomeffects analysis), among other concerns. however, our focus here will be on a concern that has not yet received attention in our commentary, and which we feel must be a focal point of discussion in the field moving forward: researcher degrees of freedom. research designs that will inevitably yield statistical significance. researcher degrees of freedom refers to the number of arbitrary decisions available to a researcher when formulating hypotheses, designing experiments, analyzing data, and reporting results (de 18 a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637a = 0.637α = 0.637 a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051a = 0.051α = 0.0510.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 0 5 10 15 20 25 30 35 40 45 50 number of comparisons α legend uncorrected corrected figure 4. plot demonstrating the multiple comparisons problem in kim and james (2015) by way of simulation. these data were generated from 10,000 replications of 50 studies that each computed between 1 and 50 correlation coefficients (r) in the same study, respectively. these correlations were computed from a sample of n = 17 and drawn from a bivariate normal distribution where the population parameter (ρ) is zero. because ρ = 0, any statistically significant correlation is a false positive (i.e., a type i error). the y axis quantifies the false positive rate (α), and the x axis quantifies the number of correlations performed in any individual study. the solid curved line represents the long-run false positive rate (uncorrected) in each of the 50 studies. the false positive rate increases with increasing numbers of comparisons. for example, when comparisons (hereby: i) = 1, α = .05; i = 2, α = .10; i = 3, α = .14; i = 4, α = .19, and so on. in kim and james (2015) a minimum of 20 comparisons must be performed to test a single hypothesis, which increases α from .05 to .64 (see vertical line and upper α label). when the same simulation is performed with bonferroni corrections, the long-run α is restricted to approximately .05 regardless of how many comparisons are made (see dotted horizontal line). winter & dodou, 2015; ioannidis, 2008; nelson et al., 2018; simmons et al., 2011; simonsohn, nelson, & simmons, 2014; wicherts et al., 2016). opportunistic discretion in the decisions that occur at each step of the research process can increase the probability of attaining sufficiently small p values in favor of the existence of an effect, even when none exists. that is, researcher degrees of freedom can substantially increase the probability of reporting false positives (i.e., type i errors). this phenomenon is therefore an important contributor to the production of research findings that do not replicate. in the present study, kim and james (2015) adopt a research design that has been described as highly prone to bias due to researcher degrees of freedom: collection of multiple dependent measures of the same construct. the decision to measure five alternate forms of aggression creates multiple opportunities for observing a statistical relationship between these data and the fmri data. in combination with two fmri contrasts yielding average t scores in four significantly active regions of interest, this study demands a minimum of 20 pearson correlation analyses to assess a single primary research question. on the basis of this chosen research design, scholars of organizational neuroscience should recognize that it is unsurprising that at least one of the 20 correlation tests would be significantly different from zero. when multiple opportunities exist to reject a single hypothesis of no relationship, the type i error under the null hypothesis is inflated from α = .05 to α = 1−(1−.05)i, where i refers to the number of comparisons in a single hypothesis family (veazie, 2006). because 20 tests have been performed in kim and james (2015), the probability of incorrectly rejecting the null hypothesis was therefore approximately 64% instead of 5% (see figure 4 for a confirmation of this computation via a simulation approach). in this case, because a single positive finding was sufficient to reject the null, one option 19 for returning type i error to an acceptable 5% threshold would be to perform a bonferroni adjustment by dividing the α criterion by the number of tests performed (see figure 4). alternatively, if this study was construed as an exploratory investigation, and missing an effect was of prime concern (i.e., making a type ii error), no adjustment to α may be necessary. instead, reporting of all tests alongside their 95% confidence intervals would be informative to provide a descriptive estimation of the range of possible effects (lee, whitehead, jacques, & julious, 2014, 41). this would then require assessment in a future confirmatory study of sufficient power with multiple comparison adjustment. although, the informativeness of an exploratory correlational study with such a small sample size may still be questionable. for example, if we construct a 95% confidence interval on the correlation between average t scores extracted from the calcarine sulcus and property aggression, we obtain: r(15) = -.49, 95% ci [-0.79, -0.01], p = .046. the precision of this estimate is so poor that the plausible range for the population parameter covers almost all negative correlation values, and approaches zero. correlational analyses with such small samples (even in exploratory studies) are also rarely desirable for another reason: there is a high probability that correlation coefficient will be inflated (button et al., 2013; ioannidis, 2008; yarkoni, 2009). in a correlation study with a sample of 17 participants and an α threshold of .05, the critical r value is ± .48. that is, any given correlation will only be statistically significant if the sample correlation is greater than r = .48 or less than r = -.48. if the population correlation between any two measures is ρ = .30, for example, a study of this sample size will systematically inflate any significant r estimate to a minimum of .48. the reason for this inflation may be due to one of a number of researcher degrees of freedom, but is also possible simply as a result of random sampling error. correlational studies in small samples can therefore expect massive inflations of statistically significant r values for all but very large population effects. this is sometimes referred to as the “winner’s curse”: scientists lucky enough to discover a statistically significant finding in a small sample study are likely to overestimate its magnitude by chance (button et al., 2013; ioannidis, 2008). indeed, artificial inflation of the correlation coefficient is a highly plausible explanation for kim and james’ (2015) single significant correlation. the critical r for statistical significance in this study was ± .48 (p = .050) and the single reported significant r was .49 (p = .046). in contrast to the author’s claim that an equivalent or even larger effect should be observed with a larger sample, their correlation may be substantially inflated or simply an artefact of sampling error. summary. researchers can (and often) make what appear to be reasonable design and analysis decisions that can increase researcher degrees of freedom. we therefore have no reason to believe that kim and james (2015) was purposefully designed to draw out significant findings. however, it is unfortunate that this work is discussed at length in waldman and colleagues’ (2017) major review piece with no consideration of the inevitability of reporting at least one statistically significant result. the decision to guide organizational research and practice decisions on the basis of a study that had a 64% probability of falsely rejecting one or more tests is a subjective judgement. while this was deemed reasonable by waldman and colleagues (2017), it may be considered inappropriate by researchers and organizational practitioners who are risking scarce time, effort, and financial resources with respect to investment in organizational neuroscience practices. wicherts et al. (2016) provide an extensive 34item checklist of different degrees of freedom that researchers have available in formulating hypotheses, and in the design, analysis, and reporting of research results. we recommend that scholars of organizational neuroscience familiarize themselves with these items and be vigilant of researcher degrees of freedom in their evaluations of the literature. we also recommend that scholars consider preregistering their own empirical work before conducting a study in order to avoid these problems themselves. preregistration is the specification of a research design, hypotheses, and analytic strategy ahead of conducting a study (munafò et al., 2017; nosek, ebersole, dehaven, & mellor, 2018), and is usually accomplished by posting these specifications to an independent registry that makes it discoverable (e.g., the open science framework: https://osf.io/). registered reports are a specific type of preregistration that are receiving increasing interest, where peer review occurs prior to data collection (chambers, dienes, mcintosh, rotshtein, & willmes, 2015; nosek & lakens, 2014). in registered reports, high quality protocols are assessed on their methods and analytic strategy, and are provisionally accepted for publication regardless of the magnitude, direction, and statistical significance of an experimental result (so long as the authors follow through with the registered methodology). the use of preregistration and registered reports is likely to substantially reduce questionable research practices in organizational neuroscience, and address many of the concerns we have raised throughout this review. conclusion in 2013, ashkanasy and colleagues appealed to scholars to not ‘throw the baby out with the bathwater’ in https://osf.io/ 20 table 3 summary of recommendations for improving post-publication review in organizational neuroscience recommendation summary 1 evaluate the transparency and completeness of reporting in an empirical work before accepting its claims. • a lack of transparency and completeness in reporting makes it difficult to evaluate whether aspects of the study were valid, reliable, or implemented correctly, and whether interpretations are substantiated by the data. consider the apa working group on journal article reporting standards (jars-quant; appelbaum et al., 2018) as a guide for best practices in reporting, among other existing systematized guidelines. 2 become familiar with common misuses and misinterpretations of null hypothesis significance testing (nhst). • improper use of nhst can lead to misleading or erroneous conclusions that are not substantiated by the data. fallacies of nhst have been described in detail elsewhere (e.g., nickerson, 2000). 3 interpret results with an explicit reference to effect size magnitude, accuracy, and precision. • nhst only provides information about whether an effect is statistically significant and its direction. a more informative interpretation of findings can be had through computation of effect sizes, and construction of confidence intervals to specify accuracy and precision. 4 consider planning for precision. • when a p value is small but the confidence interval on an effect is wide, we know very little about the effect beyond its direction. consider planning studies to attain a pre-specified margin of error or confidence interval width and evaluate whether existing studies have been designed in a way that yield sufficiently precise detail about an effect of interest. 5 ensure appropriate statistical tests have been employed if generalizations have been made from sample to population. • if seeking to generalize from sample to population, ensure that an appropriate analysis has been conducted. for example, inference beyond the sample in mass univariate fmri analyses requires a random-effects analysis. 6 evaluate claims of convergent validity between neuroscience measures and psychometric constructs using a parameter estimation approach. • the point estimate alone is not enough to specify our confidence in the convergence between a neuroscience and psychometric measure. by computing a confidence interval on measures of convergent validity (e.g., pearson’s correlation) we can specify its accuracy and precision to make a more informed judgement. 7 consider researcher degrees of freedom in the evaluation of published findings and when designing empirical studies. • arbitrary decisions when designing experiments, analysing data, and reporting results can increase the likelihood of false positives. checklists of researcher degrees of freedom are available for consultation when evaluating and designing studies (e.g., wicherts et al., 2016). 8 consider preregistration and registered reports in order to build a replicable and trustworthy organizational neuroscience. • preregistration involves specification of a research design, hypotheses, and analytic strategy ahead of conducting a study, and posting these specifications to an independent registry. registered reports involve peer review of protocols ahead of data collection. these methods may substantially improve the replicability and trustworthiness of the literature. response to unfavorable critiques of how organizational neuroscience may impact organizational research and practice. six years on, there is now substantial concern that a number of studies with methodological and/or interpretational problems are being uncritically and habitually recited in multiple reviews of the literature. whereas ashkanasy expressed worry that we may lose something valuable by dismissing organizational neuroscience altogether, the research climate may have reversed to an extent that all organizational neuroscience work is now considered valuable, indiscriminately. waldman and colleagues (2017) is a recent annual review that has been presented as a critical evaluation of the state-of-the-art in organizational neuroscience. for this reason, the references cited therein may therefore wield a disproportionate impact on the future of 21 the field and influence organizational practice and investment decisions. because we seek the development of a replicable, reliable, and trustworthy organizational neuroscience moving forward, in this commentary we therefore have provided a comprehensive postpublication review of one-third of all works evaluated in waldman and colleagues (2017). in doing so, we have identified several research themes that we propose scholars must engage with in future evaluations of the literature. these include: (1) evaluation of the transparency and completeness of an empirical work before accepting its claims, (2) familiarization with misuses or misconceptions surrounding nhst, including the interaction fallacy, (3) interpreting results with explicit reference to effect size magnitude, accuracy, and precision, (4) considering planning analyses for precision so that we can make an informed judgement about an empirical finding, (5) using appropriate statistical tests that allow for generalizability from sample to population, (6) using parameter estimation to evaluate claims of convergent validity between neuroscience and psychometric measures, (7) considering researcher degrees of freedom when evaluating published findings and designing empirical studies, and (8) considering preregistration of studies and registered reports in the interest of a replicable and trustworthy organizational neuroscience. we summarize these recommendations in table 3. organizational neuroscience has emerged as a field because it has been theorized that assessing organizing behavior at multiple levels of analysis, including the neural level, will be a valuable endeavor for organizational theory and practice (e.g., see healey & hodgkinson, 2014). we endorse this conclusion. however, a long-acknowledged concern in organizational behavior research is that theories based on studies with fundamental limitations can sometimes persist, propagate, and motivate organizational practice and the behavior of individuals for decades (ghoshal, 2005; lindebaum & zundel, 2013). it is beyond the scope of this commentary to detail examples of studies that would exemplify best practices, but we provide some recommendations and refer multiple times to comprehensive guidelines that describe what such studies would look like (e.g., cumming, 2008, 2014; cumming & maillardet, 2006; nichols et al., 2017; nichols et al., 2016; simmons et al., 2011; wicherts et al., 2016). science is a cumulative and self-corrective endeavor. as organizational neuroscience matures, the trustworthiness of its literature should gradually improve as findings are found to be credible or are refuted and the scientific record is corrected. this process is most efficient when empirical works are reported with sufficient transparency and completeness to allow for critical evaluation, and when scholars are consistently applying a critical eye to the existing literature. post-publication review, which in some cases challenges the conclusions of published work, will play an important part in the development of sound theory and organizational practice decisions that may emerge as a result of organizational neuroscience. conflict of interest all authors declare they have no conflict of interest. funding this work was partly supported by a melbourne research scholarship awarded to g.a.p. by the university of melbourne, and a heart foundation future leader fellowship (1000458) awarded to p.m. author contributions g.a.p. conceived the idea for this commentary, critically evaluated each publication, performed the analyses, generated the figures, scripted all analyses in r, wrote the paper, and compiled the paper in latex. w.r.l provided guidance on framing the paper and crosschecking the analyses. w.r.l., s.b., h.z., and p.m. contributed to the paper development, editing, and provided feedback. open science practices this type of article does not have any associated data or materials to be shared. the r statistical software analysis scripts and instructions for reproducing all analyses, simulations, and figures are available on github at: https://github.com/gprochilo/org_neuro_ com. it has been verified that the analysis reproduced the results presented in the article. the entire editorial process, including the open reviews, are published in the online supplement. references abelson, r. (1995). statistics as principled argument. taylor & francis inc. academy of management. (2018). academy of management perspectives. accessed on may 9, 2018. retrieved from https://journals.aom.org/journal/ amp altman, d. g. & royston, p. (2006). the cost of dichotomising continuous variables. bmj (clinical research ed.) 332(7549), 1080–1080. doi:10 . 1136/bmj.332.7549.1080 https://github.com/gprochilo/org_neuro_com https://github.com/gprochilo/org_neuro_com https://journals.aom.org/journal/amp https://journals.aom.org/journal/amp https://dx.doi.org/10.1136/bmj.332.7549.1080 https://dx.doi.org/10.1136/bmj.332.7549.1080 22 appelbaum, m., cooper, h., kline, r. b., mayo-wilson, e., nezu, a. m., & rao, s. m. (2018). journal article reporting standards for quantitative research in psychology: the apa publications and communications board task force report. american psychologist, 73(1), 3–25. doi:10.1037/amp0000191 arnsten, a. f. t., raskind, m. a., taylor, f. b., & connor, d. f. (2015). the effects of stress exposure on prefrontal cortex: translating basic research into successful treatments for post-traumatic stress disorder. neurobiology of stress, 1, 89–99. doi:10.1016/ j.ynstr.2014.10.002 ashkanasy, n. m. (2013). neuroscience and leadership: take care not to throw the baby out with the bathwater. journal of management inquiry, 22(3), 311–313. doi:10.1177/1056492613478519 ashkanasy, n. m., becker, w. j., & waldman, d. a. (2014). neuroscience and organizational behavior: avoiding both neuro-euphoria and neurophobia. journal of organizational behavior, 35(7), 909–919. doi:10.1002/job.1952 bass, b. m. & avolio, b. j. (1990). transformational leadership development: manual for the multifactor leadership questionnaire. menlo park, ca: mind garden. becker, w. j. & menges, j. i. (2013). biological implicit measures in hrm and ob: a question of how not if. human resource management review, 23(3), 219–228. doi:10.1016/j.hrmr.2012.12.003 becker, w. j., volk, s., & ward, m. k. (2015). leveraging neuroscience for smarter approaches to workplace intelligence. human resource management review, 25(1), 56–67. doi:10.1016/j.hrmr.2014.09.008 benjamin, d. j., berger, j. o., johannesson, m., nosek, b. a., wagenmakers, e. j., berk, r., . . . johnson, v. e. (2018). redefine statistical significance. nature human behaviour, 2(1), 6–10. doi:10.1038/ s41562-017-0189-z berka, c., levendowski, d. j., cvetinovic, m. m., petrovic, m. m., davis, g., lumicao, m. n., . . . olmstead, r. (2004). real-time analysis of eeg indexes of alertness, cognition, and memory acquired with a wireless eeg headset. international journal of human–computer interaction, 17(2), 151–170. doi:10.1207/s15327590ijhc1702_3 berka, c., levendowski, d., westbrook, p., davis, g., n lumicao, m., olmstead, r., . . . k ramsey, c. (2005). eeg quantification of alertness: methods for early identification of individuals most susceptible to sleep deprivation. in proceedings of the spie defense and security symposium, biomonitoring for physiological and cognitive performance during military operations. (vol. 5797, pp. 78– 89). doi:10.1117/12.597503 blakemore, s.-j. & robbins, t. (2012). decision-making in the adolescent brain. nature neuroscience, 15, 1184–91. doi:10.1038/nn.3177 boyatzis, r. e. & jack, a. i. (2018). the neuroscience of coaching. consulting psychology journal: practice and research, 70(1), 11–27. doi:10 . 1037 / cpb0000095 boyatzis, r. e., passarelli, a. m., koenig, k., lowe, m., mathew, b., stoller, j. k., & phillips, m. (2012). examination of the neural substrates activated in memories of experiences with resonant and dissonant leaders. the leadership quarterly, 23(2), 259–272. doi:10.1016/j.leaqua.2011.08.003 boyatzis, r. e., rochford, k., & jack, a. i. (2014). antagonistic neural networks underlying differentiated leadership roles. frontiers in human neuroscience, 8(114), 1–15. doi:10.3389/fnhum.2014.00114 butler, m. j. r., o’broin, h. l. r., lee, n., & senior, c. (2015). how organizational cognitive neuroscience can deepen understanding of managerial decision-making: a review of the recent literature and future directions. international journal of management reviews, 1–18. doi:10.1111/ijmr. 12071 button, k. s., ioannidis, j. p. a., mokrysz, c., nosek, b. a., flint, j., robinson, e. s. j., & munafo, m. r. (2013). power failure: why small sample size undermines the reliability of neuroscience. nature reviews neuroscience, 14(5), 365–376. doi:10 . 1038/nrn3475 calin-jageman, r. j. & cumming, g. (2019). the new statistics for better science: ask how much, how uncertain, and what else is known. the american statistician, 73(sup1), 271–280. doi:10.1080/ 00031305.2018.1518266 camerer, c. f., dreber, a., holzmeister, f., ho, t.-h., huber, j., johannesson, m., . . . wu, h. (2018). evaluating the replicability of social science experiments in nature and science between 2010 and 2015. nature human behaviour. doi:10.1038/ s41562-018-0399-z chambers, c. d., dienes, z., mcintosh, r. d., rotshtein, p., & willmes, k. (2015). registered reports: realigning incentives in scientific publishing. cortex, 66, a1–a2. doi:https://doi.org/10.1016/j.cortex. 2015.03.022 cohen, j. (1990). things i have learned (so far). american psychologist, 45(12), 1304–1312. cumming, g. (2008). replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. perspectives on psychologhttps://dx.doi.org/10.1037/amp0000191 https://dx.doi.org/10.1016/j.ynstr.2014.10.002 https://dx.doi.org/10.1016/j.ynstr.2014.10.002 https://dx.doi.org/10.1177/1056492613478519 https://dx.doi.org/10.1002/job.1952 https://dx.doi.org/10.1016/j.hrmr.2012.12.003 https://dx.doi.org/10.1016/j.hrmr.2014.09.008 https://dx.doi.org/10.1038/s41562-017-0189-z https://dx.doi.org/10.1038/s41562-017-0189-z https://dx.doi.org/10.1207/s15327590ijhc1702_3 https://dx.doi.org/10.1117/12.597503 https://dx.doi.org/10.1038/nn.3177 https://dx.doi.org/10.1037/cpb0000095 https://dx.doi.org/10.1037/cpb0000095 https://dx.doi.org/10.1016/j.leaqua.2011.08.003 https://dx.doi.org/10.3389/fnhum.2014.00114 https://dx.doi.org/10.1111/ijmr.12071 https://dx.doi.org/10.1111/ijmr.12071 https://dx.doi.org/10.1038/nrn3475 https://dx.doi.org/10.1038/nrn3475 https://dx.doi.org/10.1080/00031305.2018.1518266 https://dx.doi.org/10.1080/00031305.2018.1518266 https://dx.doi.org/10.1038/s41562-018-0399-z https://dx.doi.org/10.1038/s41562-018-0399-z https://dx.doi.org/https://doi.org/10.1016/j.cortex.2015.03.022 https://dx.doi.org/https://doi.org/10.1016/j.cortex.2015.03.022 23 ical science, 3(4), 286–300. doi:10.1111/j.17456924.2008.00079.x cumming, g. (2014). the new statistics: why and how. psychological science, 25(1), 7–29. doi:10 . 1177 / 0956797613504966 cumming, g. & calin-jageman, r. (2016). introduction to the new statistics: estimation, open science, and beyond. london: taylor and francis. cumming, g. & maillardet, r. (2006). confidence intervals and replication: where will the next mean fall? psychological methods, 11(3), 217–27. doi:10.1037/1082-989x.11.3.217 de winter, j. c. & dodou, d. (2015). a surge of p-values between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too). peerj, 3, 1–44. doi:10.7717/peerj.733 diedenhofen, b. & musch, j. (2015). cocor: a comprehensive solution for the statistical comparison of correlations. plos one, 10(4), 1–12. doi:10 . 1371/journal.pone.0121945 elsevier. (2018). organizational dynamics. accessed may 9, 2018. retrieved from https : / / www . journals.elsevier.com/organizational-dynamics frick, r. w. (1996). the appropriate use of null hypothesis testing. psychological methods, 1(4), 379–390. doi:10.1037/1082-989x.1.4.379 gelman, a. & stern, h. (2006). the difference between “significant” and “not significant” is not itself statistically significant. the american statistician, 60(4), 328–331. doi:10 . 1198 / 000313006x152649 ghoshal, s. (2005). bad management theories are destroying good management practices. academy of management learning & education, 4(1), 75–91. retrieved from http : / / ezproxy . lib . monash . edu . au / login ? url = http : / / search . ebscohost . com / login . aspx ? direct = true & db = bth & an = 16132558&site=ehost-live&scope=site goldman-rakic, p. s. (1996). the prefrontal landscape: implications of functional architecture for understanding human mentation and the central executive. philosophical transactions of the royal society of london. series b: biological sciences, 351(1346), 1445–53. doi:10.1098/rstb.1996.0129 grech, r., cassar, t., muscat, j., camilleri, k. p., fabri, s. g., zervakis, m., . . . vanrumste, b. (2008). review on solving the inverse problem in eeg source analysis. journal of neuroengineering and rehabilitation, 5(25), 1–33. doi:10.1186/17430003525 healey, m. p. & hodgkinson, g. p. (2014). rethinking the philosophical and theoretical foundations of organizational neuroscience: a critical realist alternative. human relations, 67(7), 765–792. doi:10.1177/0018726714530014 ioannidis, j. p. (2005). why most published research findings are false. plos medicine, 2(8), 0696– 0701. doi:10.1371/journal.pmed.0020124 ioannidis, j. p. (2008). why most discovered true associations are inflated. epidemiology, 19(5), 640–8. doi:10.1097/ede.0b013e31818131e7 ioannidis, j. p. (2012). why science is not necessarily self-correcting. perspectives on psychological science, 7(6), 645–54. doi:10 . 1177 / 1745691612464056 janak, p. h. & tye, k. m. (2015). from circuits to behaviour in the amygdala. nature, 517(7534), 284–292. doi:10.1038/nature14188 kelley, k. & rausch, j. r. (2006). sample size planning for the standardized mean difference: accuracy in parameter estimation via narrow confidence intervals. psychological methods, 11(4), 363–85. doi:10.1037/1082-989x.11.4.363 kim, m. y. & james, l. r. (2015). neurological evidence for the relationship between suppression and aggressive behavior: implications for workplace aggression. applied psychology, 64(2), 286– 307. doi:10.1111/apps.12014 klein, r. a., vianello, m., hasselman, f., adams, b. g., adams, r. b., alper, s., . . . lazarević, l. b., et al. (2018). many labs 2: investigating variation in replicability across samples and settings. advances in methods and practices in psychological science, 1(4), 443–490. doi:10.1177/2515245918810225 lang, p., bradley, m., & cuthbert, b. (2008). international affective picture system (iaps): affective ratings of pictures and instruction manual. university of florida. lee, e. c., whitehead, a. l., jacques, r. m., & julious, s. a. (2014). the statistical interpretation of pilot trials: should significance thresholds be reconsidered? bmc medical research methodology, 14, 1–8. doi:10.1186/1471-2288-14-41 leslie, k. j., george, l., & drazen, p. (2012). measuring the prevalence of questionable research practices with incentives for truth telling. psychological science, 23(5), 524–532. doi:10 . 1177 / 0956797611430953 lindebaum, d. (2013). pathologizing the healthy but ineffective: some ethical reflections on using neuroscience in leadership research. journal of management inquiry, 22(3), 295–305. doi:10 . 1177 / 1056492612462766 lindebaum, d. (2016). critical essay: building new management theories on sound data? the case of https://dx.doi.org/10.1111/j.1745-6924.2008.00079.x https://dx.doi.org/10.1111/j.1745-6924.2008.00079.x https://dx.doi.org/10.1177/0956797613504966 https://dx.doi.org/10.1177/0956797613504966 https://dx.doi.org/10.1037/1082-989x.11.3.217 https://dx.doi.org/10.7717/peerj.733 https://dx.doi.org/10.1371/journal.pone.0121945 https://dx.doi.org/10.1371/journal.pone.0121945 https://www.journals.elsevier.com/organizational-dynamics https://www.journals.elsevier.com/organizational-dynamics https://dx.doi.org/10.1037/1082-989x.1.4.379 https://dx.doi.org/10.1198/000313006x152649 https://dx.doi.org/10.1198/000313006x152649 http://ezproxy.lib.monash.edu.au/login?url=http://search.ebscohost.com/login.aspx?direct=true&db=bth&an=16132558&site=ehost-live&scope=site http://ezproxy.lib.monash.edu.au/login?url=http://search.ebscohost.com/login.aspx?direct=true&db=bth&an=16132558&site=ehost-live&scope=site http://ezproxy.lib.monash.edu.au/login?url=http://search.ebscohost.com/login.aspx?direct=true&db=bth&an=16132558&site=ehost-live&scope=site http://ezproxy.lib.monash.edu.au/login?url=http://search.ebscohost.com/login.aspx?direct=true&db=bth&an=16132558&site=ehost-live&scope=site https://dx.doi.org/10.1098/rstb.1996.0129 https://dx.doi.org/10.1186/1743-0003-5-25 https://dx.doi.org/10.1186/1743-0003-5-25 https://dx.doi.org/10.1177/0018726714530014 https://dx.doi.org/10.1371/journal.pmed.0020124 https://dx.doi.org/10.1097/ede.0b013e31818131e7 https://dx.doi.org/10.1177/1745691612464056 https://dx.doi.org/10.1177/1745691612464056 https://dx.doi.org/10.1038/nature14188 https://dx.doi.org/10.1037/1082-989x.11.4.363 https://dx.doi.org/10.1111/apps.12014 https://dx.doi.org/10.1177/2515245918810225 https://dx.doi.org/10.1186/1471-2288-14-41 https://dx.doi.org/10.1177/0956797611430953 https://dx.doi.org/10.1177/0956797611430953 https://dx.doi.org/10.1177/1056492612462766 https://dx.doi.org/10.1177/1056492612462766 24 neuroscience. human relations, 69(3), 537–550. doi:10.1177/0018726715599831 lindebaum, d. & zundel, m. (2013). not quite a revolution: scrutinizing organizational neuroscience in leadership studies. human relations, 66(6), 857– 877. doi:10.1177/0018726713482151 maccallum, r. c., zhang, s., preacher, k. j., & rucker, d. d. (2002). on the practice of dichotomization of quantitative variables. psychological methods, 7(1), 19–40. doi:10.1037//1082-989x.7.1.19 maxwell, s. e., kelley, k., & rausch, j. r. (2008). sample size planning for statistical power and accuracy in parameter estimation. annual review of psychology, 59, 537–63. doi:10 . 1146 / annurev. psych.59.103006.093735 mcshane, b. b., gal, d., gelman, a., robert, c., & tackett, j. l. (2019). abandon statistical significance. the american statistician, 73(sup1), 235– 245. doi:10.1080/00031305.2018.1527253 merton, r. (1973). the sociology of science: theoretical and empirical investigations. university of chicago press. molenberghs, p., cunnington, r., & mattingley, j. (2012). brain regions with mirror properties: a meta-analysis of 125 human fmri studies. neuroscience & biobehavioral reviews, 36(1), 341–349. doi:10.1016/j.neubiorev.2011.07.004 molenberghs, p., prochilo, g., steffens, n. k., zacher, h., & haslam, s. a. (2017). the neuroscience of inspirational leadership: the importance of collectiveoriented language and shared group membership. journal of management, 43(7), 2168–2194. doi:10.1177/0149206314565242 munafò, m. r., nosek, b. a., bishop, d. v. m., button, k. s., chambers, c. d., percie du sert, n., . . . ioannidis, j. p. a. (2017). a manifesto for reproducible science. nature human behaviour, 1(0021), 1–9. doi:10.1038/s41562-016-0021 nelson, l. d., simmons, j., & simonsohn, u. (2018). psychology’s renaissance. annual review of psychology, 69(1), 511–534. doi:10 . 1146 / annurev psych-122216-011836 nichols, t. e., das, s., eickhoff, s. b., evans, a. c., glatard, t., hanke, m., . . . yeo, b. t. (2017). best practices in data analysis and sharing in neuroimaging using mri. nature neuroscience, 20(3), 299–303. doi:10.1038/nn.4500 nichols, t. e., das, s., eickhoff, s. b., evans, a. c., glatard, t., hanke, m., . . . yeo, b. t. t. (2016). best practices in data analysis and sharing in neuroimaging using mri. biorxiv, 1–71. retrieved from https : / / www. biorxiv. org / node / 14915 . abstract nickerson, r. s. (2000). null hypothesis significance testing: a review of an old and continuing controversy. psychological methods, 5(2), 241–301. doi:i0.1037//1082-989x.s.2.241 nieuwenhuis, s., forstmann, b. u., & wagenmakers, e.-j. (2011). erroneous analyses of interactions in neuroscience: a problem of significance. nature neuroscience, 14, 1105–1107. doi:10 . 1038 / nn . 2886 niven, k. & boorman, l. (2016). assumptions beyond the science: encouraging cautious conclusions about functional magnetic resonance imaging research on organizational behavior. journal of organizational behavior, 37(8), 1150–1177. doi:10.1002/job.2097 nofal, a. m., nicolaou, n., symeonidou, n., & shane, s. (2017). biology and management: a review, critique, and research agenda. journal of management, 44(1), 7–31. doi:10 . 1177 / 0149206317720723 nosek, b. a., ebersole, c. r., dehaven, a. c., & mellor, d. t. (2018). the preregistration revolution. proceedings of the national academy of sciences, 115(11), 2600–2606. doi:10 . 1073 / pnas . 1708274114 nosek, b. a. & lakens, d. (2014). registered reports: a method to increase the credibility of published results. social psychology, 45(3), 137–141. doi:10. 1027/1864-9335/a000192 open science collaboration. (2015). estimating the reproducibility of psychological science. science, 349(6251), aac4716-1-8. doi:10 . 1126 / science . aac4716 penny, w. d. & holmes, a. j. (2007). random effects analysis. in k. friston, j. ashburner, s. j. kiebel, t. e. nichols, & w. d. penny (eds.), statistical parametric mapping: the analysis of functional brain images. jordan hill, united kingdom: elsevier science technology. doi:10 . 1016 / b978 0 12 372560-8.x5000-1 pernet, c., wilcox, r., & rousselet, g. (2013). robust correlation analyses: false positive and power validation using a new open source matlab toolbox. frontiers in psychology, 3(606), 1–18. doi:10. 3389/fpsyg.2012.00606 peters, g. y. & crutzen, r. (2017). knowing exactly how effective an intervention, treatment, or manipulation is and ensuring that a study replicates: accuracy in parameter estimation as a partial solution to the replication crisis. preprint; psyarxiv, 1–31. retrieved from https://doi.org/10.31234/osf.io/ cjsk2 https://dx.doi.org/10.1177/0018726715599831 https://dx.doi.org/10.1177/0018726713482151 https://dx.doi.org/10.1037//1082-989x.7.1.19 https://dx.doi.org/10.1146/annurev.psych.59.103006.093735 https://dx.doi.org/10.1146/annurev.psych.59.103006.093735 https://dx.doi.org/10.1080/00031305.2018.1527253 https://dx.doi.org/10.1016/j.neubiorev.2011.07.004 https://dx.doi.org/10.1177/0149206314565242 https://dx.doi.org/10.1038/s41562-016-0021 https://dx.doi.org/10.1146/annurev-psych-122216-011836 https://dx.doi.org/10.1146/annurev-psych-122216-011836 https://dx.doi.org/10.1038/nn.4500 https://www.biorxiv.org/node/14915.abstract https://www.biorxiv.org/node/14915.abstract https://dx.doi.org/i0.1037//1082-989x.s.2.241 https://dx.doi.org/10.1038/nn.2886 https://dx.doi.org/10.1038/nn.2886 https://dx.doi.org/10.1002/job.2097 https://dx.doi.org/10.1177/0149206317720723 https://dx.doi.org/10.1177/0149206317720723 https://dx.doi.org/10.1073/pnas.1708274114 https://dx.doi.org/10.1073/pnas.1708274114 https://dx.doi.org/10.1027/1864-9335/a000192 https://dx.doi.org/10.1027/1864-9335/a000192 https://dx.doi.org/10.1126/science.aac4716 https://dx.doi.org/10.1126/science.aac4716 https://dx.doi.org/10.1016/b978-0-12-372560-8.x5000-1 https://dx.doi.org/10.1016/b978-0-12-372560-8.x5000-1 https://dx.doi.org/10.3389/fpsyg.2012.00606 https://dx.doi.org/10.3389/fpsyg.2012.00606 https://doi.org/10.31234/osf.io/cjsk2 https://doi.org/10.31234/osf.io/cjsk2 25 peters, g. y., verboon, p., & green, j. (2018). userfriendlyscience: quantitative analysis made accessible. r package version 0.7.2. computer program. retrieved from https://cran.rproject.org/ package=userfriendlyscience peterson, s. j., balthazard, p. a., waldman, d. a., & thatcher, r. w. (2008). neuroscientific implications of psychological capital: are the brains of optimistic, hopeful, confident, and resilient leaders different? organizational dynamics, 37(4), 342– 353. doi:10.1016/j.orgdyn.2008.07.007 poldrack, r. a. (2006). can cognitive processes be inferred from neuroimaging data? trends in cognitive sciences, 10(2), 59–63. doi:10 . 1016 / j . tics . 2005.12.004 popper, k. (1962). conjectures and refutations: the growth of scientific knowledge. basic books. rich, b. l., lepine, j. a., & crawford, e. r. (2010). job engagement: antecedents and effects on job performance. academy of management journal, 53(3), 617–635. retrieved from http://amj.aom. org/content/53/3/617.abstract robbins, t. w. (1996). dissociating executive functions of the prefrontal cortex. philosophical transactions of the royal society of london. series b: biological sciences, 351(1346), 1463–71. doi:10.1098/rstb. 1996.0131 royston, p., altman, d. g., & sauerbrei, w. (2006). dichotomizing continuous predictors in multiple regression: a bad idea. statistics in medicine, 25(1), 127–41. doi:10.1002/sim.2331 senn, s. (2005). dichotomania: an obsessive compulsive disorder that is badly affecting the quality of analysis of pharmaceutical trials. in international statistical istitute 55th session. simmons, j. p., nelson, l. d., & simonsohn, u. (2011). false-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. psychological science, 22(11), 1359–66. doi:10 . 1177 / 0956797611417632 simonsohn, u., nelson, l. d., & simmons, j. p. (2014). p-curve: a key to the file-drawer. journal of experimental psychology: general, 143(2), 534–47. doi:10.1037/a0033242 urigüen, j. a. & garcia-zapirain, b. (2015). eeg artifact removal—state-of-the-art and guidelines. journal of neural engineering, 12(3), 1–43. doi:10.1088/ 1741-2560/12/3/031001 vanhove, j. (2018). cannonball: tools for teaching statistics. r package version 0.0.0.9000. computer program. retrieved from http : / / janhove . github.io/teaching/2018/09/26/cannonball veazie, p. j. (2006). when to combine hypotheses and adjust for multiple tests. health services research, 41(3 pt 1), 804–818. doi:10.1111/j.14756773. 2006.00512.x waldman, d. a., balthazard, p., & peterson, s. (2011a). leadership and neuroscience: can we revolutionize the way that inspirational leaders are identified and developed? the academy of management perspectives, 25(1), 60–74. doi:10.5465/amp.25. 1.60 waldman, d. a., balthazard, p., & peterson, s. (2011b). social cognitive neuroscience and leadership. the leadership quarterly, 1092–1106. doi:10.1016/j. leaqua.2011.09.005 waldman, d. a. & balthazard, p. a. (2015). organizational neuroscience. monographs in leadership and management. emerald group publishing limited. doi:10.1108/s1479-357120150000007017 waldman, d. a., stikic, m., wang, d., korszen, s., & berka, c. (2015). neuroscience and team processes. in organizational neuroscience (chap. 12, vol. 7, pp. 277–294). monographs in leadership and management. emerald group publishing limited. doi:10.1108/s1479-357120150000007012 waldman, d. a., wang, d., & fenters, v. (2016). the added value of neuroscience methods in organizational research. organizational research methods, 1–27. doi:10.1177/1094428116642013 waldman, d. a., wang, d., stikic, m., berka, c., balthazard, p. a., richardson, t., . . . maak, t. (2013a). emergent leadership and team engagement: an application of neuroscience technology and methods. in academy of management annual meeting proceedings (vol. 2013, pp. 632–637). academy of management. doi:10.5465/ambpp.2013.63 waldman, d. a., wang, d., stikic, m., berka, c., balthazard, p. a., richardson, t., . . . maak, t. (2013b). emergent leadership and team engagement: an application of neuroscience technology and methods. preprint; researchgate, 1–33. retrieved from https : / / www . researchgate . net / publication / 259678311 _ emergent _ leadership _ and _ team _ engagement _ an _ application _ of _ neuroscience _ technology_and_methods waldman, d. a., ward, m., & becker, w. j. (2017). neuroscience in organizational behavior. annual review of organizational psychology and organizational behavior, 4, 425–444. doi:10 . 1146 / annurev-orgpsych-032516-113316 ward, m. k., volk, s., & becker, w. j. (2015). an overview of organizational neuroscience. in d. waldman & p. balthazard (eds.), organizational neuroscience (chap. 2, vol. 7, pp. 17–50). emerald https://cran.r-project.org/package=userfriendlyscience https://cran.r-project.org/package=userfriendlyscience https://dx.doi.org/10.1016/j.orgdyn.2008.07.007 https://dx.doi.org/10.1016/j.tics.2005.12.004 https://dx.doi.org/10.1016/j.tics.2005.12.004 http://amj.aom.org/content/53/3/617.abstract http://amj.aom.org/content/53/3/617.abstract https://dx.doi.org/10.1098/rstb.1996.0131 https://dx.doi.org/10.1098/rstb.1996.0131 https://dx.doi.org/10.1002/sim.2331 https://dx.doi.org/10.1177/0956797611417632 https://dx.doi.org/10.1177/0956797611417632 https://dx.doi.org/10.1037/a0033242 https://dx.doi.org/10.1088/1741-2560/12/3/031001 https://dx.doi.org/10.1088/1741-2560/12/3/031001 http://janhove.github.io/teaching/2018/09/26/cannonball http://janhove.github.io/teaching/2018/09/26/cannonball https://dx.doi.org/10.1111/j.1475-6773.2006.00512.x https://dx.doi.org/10.1111/j.1475-6773.2006.00512.x https://dx.doi.org/10.5465/amp.25.1.60 https://dx.doi.org/10.5465/amp.25.1.60 https://dx.doi.org/10.1016/j.leaqua.2011.09.005 https://dx.doi.org/10.1016/j.leaqua.2011.09.005 https://dx.doi.org/10.1108/s1479-357120150000007017 https://dx.doi.org/10.1108/s1479-357120150000007012 https://dx.doi.org/10.1177/1094428116642013 https://dx.doi.org/10.5465/ambpp.2013.63 https://www.researchgate.net/publication/259678311_emergent_leadership_and_team_engagement_an_application_of_neuroscience_technology_and_methods https://www.researchgate.net/publication/259678311_emergent_leadership_and_team_engagement_an_application_of_neuroscience_technology_and_methods https://www.researchgate.net/publication/259678311_emergent_leadership_and_team_engagement_an_application_of_neuroscience_technology_and_methods https://www.researchgate.net/publication/259678311_emergent_leadership_and_team_engagement_an_application_of_neuroscience_technology_and_methods https://dx.doi.org/10.1146/annurev-orgpsych-032516-113316 https://dx.doi.org/10.1146/annurev-orgpsych-032516-113316 26 group publishing limited. doi:10 . 1108 / s1479 357120150000007001 wicherts, j. m., veldkamp, c. l. s., augusteijn, h. e. m., bakker, m., van aert, r. c. m., & van assen, m. a. l. m. (2016). degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking. frontiers in psychology, 7, 1832–1832. doi:10 . 3389/fpsyg.2016.01832 woodson, m. i. c. e. (1969). parameter estimation vs. hypothesis testing. philosophy of science, 36(2), 203–204. doi:10.1086/288247 yarkoni, t. (2009). big correlations in little studies: inflated fmri correlations reflect low statistical power—commentary on vul et al. (2009). perspectives on psychological science, 4(3), 294–298. retrieved from http://pps.sagepub.com/content/4/ 3/294.abstract zou, g. y. (2007). toward using confidence intervals to compare correlations. psychological methods, 12(4), 399–413. doi:10 . 1037 / 1082 989x . 12 . 4 . 399 https://dx.doi.org/10.1108/s1479-357120150000007001 https://dx.doi.org/10.1108/s1479-357120150000007001 https://dx.doi.org/10.3389/fpsyg.2016.01832 https://dx.doi.org/10.3389/fpsyg.2016.01832 https://dx.doi.org/10.1086/288247 http://pps.sagepub.com/content/4/3/294.abstract http://pps.sagepub.com/content/4/3/294.abstract https://dx.doi.org/10.1037/1082-989x.12.4.399 https://dx.doi.org/10.1037/1082-989x.12.4.399 introduction publication evaluation criteria criteria i: completeness and transparency of reporting practices criteria ii: appropriateness of statistical inferences post-publication peer review part i systematic post-publication review of seminal works peterson et al. (2008). neuroscientific implications of psychological capital: are the brains of optimistic, hopeful, confident, and resilient leaders different? critical review waldman et al. (2011a). leadership and neuroscience: can we revolutionize the way that inspirational leaders are identified and developed? critical review part ii focused post-publication review of secondary works boyatzis et al. (2012). examination of the neural substrates activated in memories of experiences with resonant and dissonant leaders. critical review waldman et al. (2013a). emergent leadership and team engagement: an application of neuroscience technology and methods. critical review kim and james (2015). neurological evidence for the relationship between suppression and aggressive behavior: implications for workplace aggression. critical review conclusion conflict of interest funding author contributions open science practices