Meta-Psychology, 2022, vol 6, MP.2020.2573 https://doi.org/10.15626/MP.2020.2573 Article type: Tutorial Published under the CC-BY4.0 license Open data: Not Applicable Open materials: Yes Open and reproducible analysis: Yes Open reviews and editorial process: Yes Preregistration: No Edited by: Rickard Carlsson Reviewed by: Daniël Lakens, Brenton Wiernik Analysis reproduced by: Lucija Batinović All supplementary files can be accessed at OSF: https://doi.org/10.17605/OSF.IO/3C6HU Designing Studies and Evaluating Research Results: Type M and Type S Errors for Pearson Correlation Coefficient Giulia Bertoldo Department of Developmental Psychology and Socialisation, University of Padova, Padova, Italy Claudio Zandonella Callegher Department of Developmental Psychology and Socialisation, University of Padova, Padova, Italy Gianmarco Altoè Department of Developmental Psychology and Socialisation, University of Padova, Padova, Italy Abstract It is widely appreciated that many studies in psychological science suffer from low statistical power. One of the consequences of analyzing underpowered studies with thresholds of statistical significance is a high risk of finding exaggerated effect size estimates, in the right or the wrong direction. These inferential risks can be directly quan- tified in terms of Type M (magnitude) error and Type S (sign) error, which directly communicate the consequences of design choices on effect size estimation. Given a study design, Type M error is the factor by which a statistically significant effect is on average exaggerated. Type S error is the probability to find a statistically significant result in the opposite direction to the plausible one. Ideally, these errors should be considered during a prospective design analysis in the design phase of a study to determine the appropriate sample size. However, they can also be con- sidered when evaluating studies’ results in a retrospective design analysis. In the present contribution, we aim to facilitate the considerations of these errors in the research practice in psychology. For this reason, we illustrate how to consider Type M and Type S errors in a design analysis using one of the most common effect size measures in psychology: Pearson correlation coefficient. We provide various examples and make the R functions freely available to enable researchers to perform design analysis for their research projects. Keywords: Correlation coefficient, Type M error, Type S error, Design analysis, Effect size https://doi.org/10.15626/MP.2020.2573 https://doi.org/10.17605/OSF.IO/3C6HU 2 Introduction Psychological science is increasingly committed to scrutinizing its published findings by promoting large- scale replication efforts, where the protocol of a previ- ous study is repeated as closely as possible with a new sample (Camerer et al., 2016; Camerer et al., 2018; Ebersole et al., 2016; Klein et al., 2014; Klein et al., 2018; Open Science Collaboration, 2015). Interestingly, many replication studies found smaller effects than orig- inals (Camerer et al., 2018; Open Science Collabora- tion, 2015) and among many possible explanations, one relates to a feature of study design: statistical power. In particular, it is plausible for original studies to have lower statistical power than their replications. In the case of underpowered studies, we are usually aware of the lower probability of detecting an effect if this ex- ists, but the less obvious consequences on effect size estimation are often neglected. When underpowered studies are analyzed using thresholds, such as statisti- cal significance levels, effects passing such thresholds have to exaggerate the true effect size (Button et al., 2013; Gelman et al., 2017; Ioannidis, 2008; Ioannidis et al., 2013; Lane & Dunlap, 1978). Indeed, as will be extensively shown below, in underpowered studies only large effects correspond to values that can reject the null hypothesis and be statistically significant. As a consequence, if the original study was underpowered and found an exaggerated estimate of the effect, the replication effect will likely be smaller. The concept of statistical power finds its natural de- velopment in the Neyman-Pearson framework of statis- tical inference and this is the framework that we adopt in this contribution. Contrary to the Null Hypothesis Significance Testing (NHST), the Neyman-Pearson ap- proach requires to define both the Null Hypothesis (i.e., usually, but not necessarily, the absence of an effect) and the Alternative Hypothesis (i.e., the magnitude of the expected effect). Further discussion on the Ney- man and Pearson approach and a comparison with the NHST is available in Altoè et al. (2020) and Gigerenzer et al. (2004). When conducting hypothesis testing, we usually consider two inferential risks: the Type I error (i.e., the probability α of rejecting the Null Hypothe- sis if this is true) and the Type II error (i.e., the prob- ability β of not rejecting the Null Hypothesis if this is false). Then, statistical power is defined as the proba- bility 1-β of finding a statistically significant result if the Alternative Hypothesis is true. All this leads to a nar- row focus on statistical significance in hypothesis test- ing, overlooking another important aspect of statistical inference, namely, the effect size estimation. When effect size estimation is conditioned on the sta- tistical significance (i.e., effect estimates are evaluated only if their p-values are lower than α), effect size ex- aggeration is a corollary consequence of low statistical power that might not be evident at first. This point can be highlighted considering the Type M (magnitude) and Type S (sign) errors characterizing a study design (Gel- man & Carlin, 2014). Given a study design (i.e., sample size, statistical test directionality, α level and plausible effect size formalization), Type M error, also known as Exaggeration Ratio, indicates the factor by which a sta- tistically significant effect would be, on average, exag- gerated. Type S error indicates the probability to find a statistically significant effect in the opposite direction to the one considered plausible. The analysis that re- searchers perform to evaluate the Type M and Type S errors in their research practice is called design analy- sis, given the special focus posed into considering the design of a study (Altoè et al., 2020; Gelman & Carlin, 2014). Both errors are defined starting from a reasoned guess on the plausible magnitude and direction of the effect under study, which is called plausible effect size (Gelman & Carlin, 2014). A plausible effect size is an assumption the researchers make about which is the expected effect in the population. This should not be based on some noisy results from a pilot study but, rather, it could derive from an extensive evaluation of the literature (e.g., theoretical or literature reviews and meta-analyses). When considering the published liter- ature to define the plausible effect size, however, it is important to take into account the presence of publica- tion bias (Franco et al., 2014) and consider techniques for adjusting for the possible inflation of effect size esti- mates (Anderson et al., 2017). For example if, after tak- ing into account possible inflations, all the main results in a given topic, considering a specific experimental de- sign indicate that the correlation ranges between r = .15 and r = .25, we could reasonably choose as plausible ef- fect size a value within this range. Or even better, we could consider multiple values to evaluate the results in different scenarios. Note that the definition of the plau- sible effect size is inevitably highly context-dependent so any attempt to provide reference values would not be useful, instead, it would prevent researchers from rea- soning about the phenomenon of interest. Even in ex- treme cases where no previous information is available, which would question the exploratory/confirmatory na- ture of the study, researchers could still evaluate which effect would be considered relevant (e.g., from a clinical or economic perspective) and define the plausible effect size according to it. Why do these errors matter? The concepts of Type M and Type S errors allow enhancing researchers’ aware- ness of a complex process such as statistical inference. 3 Strictly speaking, Design Analysis used in the design phase of a study provides similar information as the classical power analysis, indeed, to a given level of power there is a corresponding Type M and Type S errors. However, it is a valuable conceptual frame- work that can help researchers to understand the im- portant role of statistical power both when designing a new study or when evaluating previous results from the literature. In particular, it highlights the unwanted (and often overlooked) consequences on effect estima- tion when filtering for statistical significance in under- powered studies. In these scenarios, there is not only a lower probability of rejecting the null when it is ac- tually false but, even more importantly, any significant result would most likely lead to a misleading overes- timation of the actual effect. The exaggeration of ef- fect sizes, in the right or the wrong direction, has im- portant implications on a theoretical and applied level. On a theoretical level, studies’ designs with high Type M and Type S errors can foster distorted expectations on the effect under study, triggering a vicious cycle for the planning of future studies. This point is relevant also for the design of replication studies, which could turn out to be underpowered if they do not take into account possible inflations of the original effect (Button et al., 2013). When studies are used to inform policy- making and real-world interventions, implications can go beyond the academic research community and can impact society at large. In these settings, we could assist to a “hype and disappointment cycle” (Gelman, 2019b), where true effects turn out to be much less impressive than expected. This can produce undesirable conse- quences on people’s lives, a consideration that invites researchers to assume responsibility in effectively com- municating the risks related to effects quantification. To our knowledge, Type M (magnitude) and Type S (sign) errors are not widely known in the psychologi- cal research community but their consideration during the research process has the potential to improve cur- rent research practices, for example, by increasing the awareness that design choices have on possible studies’ results. In a previous work, we illustrated Type M and Type S errors using Cohen’s’d as a measure of effect size (Altoè et al., 2020). The purpose of the present con- tribution is to further increase the familiarity with Type M and Type S errors, considering another common ef- fect size measures in psychology: Pearson correlation coefficient, ρ. We aim to provide an accessible intro- duction to the Design Analysis framework and enhance the understanding of Type M and Type S errors using several educational examples. The rest of this article is organized as follows: introduction to Type M and Type S errors; description of what is a design analysis and how to conduct one; analysis of Type S and Type M errors when varying alpha levels and hypothesis di- rectionality. Moreover, the two appendices present fur- ther implications of design analysis for Pearson correla- tion (Appendix A) and an extensive illustration of the R functions for design analysis for Pearson correlation (Appendix B). Type M and Type S errors Pearson correlation coefficient is a standardized ef- fect size measure indicating the strength and the direc- tion of the relationship between two continuous vari- ables (Cohen, 1988; Ellis, 2010). Even though the cor- relation coefficient is widely known, we briefly go over its main features using an example. Imagine that we were interested to measure the relationship between anxiety and depression in a population and we plan a study with n participants, where, for each participant, we measure the level of anxiety (i.e., variable X) and the level of depression (i.e., variable Y). At the end of the study, we will have n pairs of values X and Y. The correlation coefficient helps us answer the questions: how strong is the linear relationship between anxiety and depression in this population? Is the relationship positive or negative? Correlation ranges from -1 to +1, indicating respectively two extreme scenarios of perfect negative relationship and perfect positive relationship1 Since the correlation coefficient is a dimensionless num- ber, it is a signal to noise ratio where the signal is given by the covariance between the two variables (cov(x,y)) and the noise is expressed by the product between the standard deviations of the two variables (S xS y; see For- mula1). In this contribution, following the conventional standards, we will use the symbol ρ to indicate the cor- relation in the population and the symbol r to indicate the value measured in a sample. r = cov(x,y) S x S y . (1) Magnitude and sign are two important features char- acterizing Pearson correlation coefficient and effect size measures in general. And, when estimating effect sizes, errors could be committed exactly regarding these two aspects. Gelman and Carlin (2014) introduced two in- dexes to quantify these risks: • Type M error, where M stands for magnitude, is also called Exaggeration Ratio - the factor by which a statistically significant effect is on average exaggerated. 1Correlation indicates a relationship between variables but does not imply causation. We do not discuss this relevant as- pect here but we refer the interested reader to (Rohrer, 2018). 4 • Type S error, sign - the probability to find a statis- tically significant result in the opposite direction to the plausible one. Note that, differently from the other inferential errors, Type M error is not a probability but rather a ratio indi- cating the average percentage of inflation. How are these errors computed? In the next para- graphs, we approach this question preferring an intu- itive perspective. For a formal definition of these errors, we refer the reader to Altoè et al. (2020), Gelman and Carlin (2014), and Lu et al. (2018). Take as an example the previous fictitious study on the relationship between anxiety and depression and imagine we decide to sam- ple 50 individuals (sample size, n = 50) and to set the α level to 5% and to perform a two-tailed test. Based on theoretical considerations, we expect the plausibly true correlation in the population to be quite strong and positive which we formalize as ρ = .50. To evaluate the Type M and Type S errors in this research design, imagine repeating the same study many times with new samples drawn from the same population and, for each study, register the observed correlation (r) and the cor- responding p-value. The first step to compute Type M error is to select only the observed correlation coefficients that are sta- tistically significant in absolute value (for the moment, we do not care about the sign) and to calculate their mean. Type M error is given by the ratio between this mean (i.e., mean of statistically significant correlation coefficients in absolute value) and the plausible effect hypothesized at the beginning, which in this example is ρ = .50. Thus, given a study design, Type M error tells us what is the average overestimation of an effect that is statistically significant. Type S error is computed as the proportion of statisti- cally significant results that have the opposite sign com- pared to the plausible effect size. In the present example we hypothesized a positive relationship, specifically ρ = .50. Then, Type S error is the ratio between the number of times we observed a negative statistically significant result and the total number of statistically significant results. In other words, Type S error indicates the prob- ability to obtain a statistically significant result in the opposite direction to the one hypothesized. The central and possibly most difficult point in this process is reasoning on what could be the plausible magnitude and direction of the effect of interest. This critical process, which is central also in traditional power analysis, is an opportunity for researchers to ag- gregate, formalize and incorporate prior information on the phenomenon under investigation (Gelman & Car- lin, 2014). What is plausible can be determined on theoretical grounds, using expert knowledge elicitation techniques (see for example O’Hagan, 2019) and con- sulting literature reviews and meta-analysis, always tak- ing into account the presence of effect sizes inflation in the published literature (Anderson, 2019). Given these premises, it is important to stress that a plausible effect size should not be determined by considering the results of a single study, given the high-level of uncertainty as- sociated with this effect size estimate. The idea is that the plausible effect size should approximate the true ef- fect, which - although never known - can be thought of as “that which would be observed in a hypotheti- cal infinitely large sample” (Gelman & Carlin, 2014, p. 642). For a more exhaustive description of plausible effect size, we refer the interested reader to Altoè et al. (2020) and Gelman and Carlin (2014). Before we proceed, it is worth noting that there are other recent valuable tools that start from different premises for designing and evaluating studies. Among others, we refer the interested reader to methods which start from the definition of the smallest effect size of interest (SESOI; for a tutorial, see Lakens, Scheel, et al., 2018). Design Analysis Researchers can consider Type M and Type S errors in their practice by performing a design analysis (Altoè et al., 2020; Gelman & Carlin, 2014). Ideally, a design analysis should be performed when designing a study. In this phase, it is specifically called prospective design analysis and it can be used as a sample size planning strategy where statistical power is considered together with Type M and Type S errors. However, design analy- sis can also be beneficial to evaluate the inferential risks in studies that have already been conducted and where the study design is known. In these cases, Type M and Type S errors can support results interpretation by com- municating the inferential risks in that research design. When design analysis happens at this later stage, it takes the name of retrospective design analysis. Note that ret- rospective design analysis should not be confused with post-hoc power analysis. A retrospective design analy- sis defines the plausible effect size according to previous results in the literature or other information external to the study, whereas the post-hoc power analysis defines the plausible effect size based on the observed results in the study and it is a widely-deprecated practice (Gel- man, 2019a; Goodman & Berlin, 1994). In the following sections, we illustrate how to per- form prospective and retrospective design analysis us- 5 ing some examples. We developed two R functions2 to perform design analysis for Pearson correlation, which are available at the page https://osf.io/9q5fr/. The function to perform a prospective design analysis is pro_r(). It requires as input the plausible effect size (rho), the statistical power (power), the directional- ity of the test (alternative) which can be set as: “two.sided”, “less” or “greater”. Type I error rate (sig_level) is set as default at 5% and can be changed by the user. The pro_r() function returns the necessary sample size to achieve the desired statistical power, Type M error rate, the Type S error probability, and the criti- cal value(s) above which a statistically significant result can be found. The function to perform retrospective de- sign analysis is retro_r(). It requires as input the plau- sible effect size, the sample size used in the study, and the directionality of the test that was performed. Also in this case, Type I error rate is set as default at 5% and can be changed by the user. The function retro_r() returns the Type M error rate, the Type S error probabil- ity, and the critical value(s)3. For further details regard- ing the R functions refer to Appendix B. All code and materials are also available in a CodeOcean Capsule at https://codeocean.com/capsule/7935517. Case Study To familiarize the reader with Type M and Type S er- rors, we start our discussion with a retrospective de- sign analysis of a published study. However, the ideal temporal sequence in the research process would be to perform a prospective design analysis in the planning stage of a research project. This is the time when the design is being laid out and useful improvements can be made to obtain more robust results. In this contri- bution, the order of presentation aims first, to provide an understanding of how to interpret Type M and Type S errors, and then discuss how they could be taken into account. The following case study was chosen for illus- trative purposes only and, by no means our objective is to judge the study beyond illustrating an application of how to calculate Type M and Type S errors on a pub- lished study. We consider the study published in Science by Eisen- berger et al. (2003) entitled: “Does Rejection Hurt? An fMRI Study of Social Exclusion”. The research question originated from the observation that the Anterior Cin- gulate Cortex (ACC) is a region of the brain known to be involved in the experience of physical pain. Could pain from social stimuli, such as social exclusion, share similar neural underpinnings? To test this hypothesis, 13 participants were recruited and each one had to play a virtual game with other two players while undergo- ing functional Magnetic Resonance Imaging (fMRI). The other two players were fictitious, and participants were actually playing against a computer program. Players had to toss a virtual ball among each other in three con- ditions: social inclusion, explicit social exclusion and implicit social exclusion. In the social inclusion condi- tion, the participant regularly received the ball. In the explicit social exclusion condition the participant was told that, due to technical problems, he was not going to play that round. In the implicit social exclusion con- dition, the participant experienced being intentionally left out from the game by the other two players. At the end of the experiment, each participant completed a self-report measure regarding their perceived distress when they were intentionally left out by the other play- ers. Considering only the implicit social exclusion con- dition, a correlation coefficient was estimated between the measure of distress and neural activity in the Ante- rior Cingulate Cortex. As suggested by the large and sta- tistically significant correlation coefficient between per- ceived distress and activity in the ACC, r = .88, p < .005 (Eisenberger et al., 2003, p. 291), authors concluded that social and physical pain seem to share similar neu- ral underpinnings. Before proceeding to the retrospective design analy- sis, we refer the interested reader to some background history regarding this study. This was one of the many studies included in the famous paper “Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition” (Vul et al., 2009) which raised im- portant issues regarding the analysis of neuroscientific data. In particular, this paper noted that the magnitude of correlation coefficients between fMRI measures and behavioural measures were beyond what could be con- sidered plausible. We refer the interested reader also to the commentary by Yarkoni (2009), who noted that the implausibly high correlations in fMRI studies could be largely explained by the low statistical power of experi- ments. A retrospective design analysis should start with thor- ough reasoning on the plausible size and direction of the effect under study. To produce valid inferences, a lot of attention should be devoted to this point by integrat- ing external information. For the sake of this example, we turn to the considerations made by Vul and Pashler 2An R-package was subsequently developed and now is available on CRAN, PRDA: Conduct a Prospective or Ret- rospective Design Analysis https://cran.r-project.org/web/ packages/PRDA/index.html. PRDA contains other features on Design Analysis, that are beyond the aim of the present paper. 3Critical value is the name usually employed in hypothe- ses testing within the Neyman-Pearson framework. In the re- search practice, this is also known as the Minimal Statistically Detectable Effect (Cook et al., 2014; Phillips et al., 2001) https://osf.io/9q5fr/ https://codeocean.com/capsule/7935517 https://cran.r-project.org/web/packages/PRDA/index.html https://cran.r-project.org/web/packages/PRDA/index.html 6 (2017) who suggested correlations between personality measures and neural activity to be likely around ρ= .25. A correlation of ρ = .50 was deemed plausible but opti- mistic and a correlation of ρ= .75 was considered theo- retically plausible but unrealistic. Retrospective Design Analysis To perform a retrospective design analysis on the case study, we need information on the research design and the plausible effect size. Based on the previous consid- erations, we set the plausible effect size to be ρ = .25. Information on the sample size was not available in the original study (Eisenberger et al., 2003) and was re- trieved from Vul et al. (2009) to be n = 13. The α level and the directionality of the test were not reported in the original study, so for the purpose of this example, we will consider α = .05 and a two-tailed test. Given this study design, what are the inferential risks in terms of effect size estimation? We can use the R function retro_r(), whose inputs and outputs are displayed in Figure 1. In this study, the statistical power is .13, that is to say, there is a 13% probability to reject the null hypothesis, if an effect of at least ρ = |.25| exists. Consider this point together with the results obtained in the experiment: r = .88, p < .005 (Eisenberger et al., 2003, p. 291). It is clear that, even though the probability to reject the null hy- pothesis is low (power of 13%), this event could hap- pen. And when it does happen, it is tempting to believe that results are even more remarkable (Gelman & Lo- ken, 2014). However, this design comes with serious inferential risks for the estimation of effect sizes, which could be grasped by presenting Type M and Type S er- rors. A glance at their value communicates that it is not impossible to find a statistically significant result, but when it does happen, the effect sizes could be largely overestimated - Type M = 2.58 - and maybe even in the wrong direction - Type S = .03. The Type M error rate of 2.58 indicates that a statistically significant cor- relation is on average about two and a half times the plausible value. In other words, statistically significant results emerging in such a research design will on av- erage overestimate the plausible correlation coefficient by 160%. The Type S error of .03 suggests that there is a three percent probability to find a statistically signifi- cant result in the opposite direction, in this example, a negative relationship. In this research design, the critical values above which a statistically significant result is declared corre- spond to r = ±.55 (Figure 1). These values are high- lighted in Figure 2 as the vertical lines in the sampling distribution of correlation coefficients under the null hy- pothesis. Notice that the plausible effect size lies in the region of acceptance of the null hypothesis. Therefore, it is impossible to simultaneously find a statistically sig- nificant result and estimate an effect close to the plau- sible one (ρ = .25). The figure represents the so-called Winner’s curse: “the ‘lucky’ scientist who makes a dis- covery is cursed by finding an inflated estimate of that effect” (Button et al., 2013). Figure 1. Input and Output of the function retro_r() for retrospective design analysis. Case study: Eisen- berger et al. (2003). The plausible correlation coeffi- cient is ρ = .25, the sample size is 13, and the statis- tical test is two-tailed. The option seed allows setting the random number generator to obtain reproducible results. Prospective Design Analysis Ideally, Type M and Type S errors should be consid- ered in the design phase of a study during the decision- making process regarding the experimental protocol. At this stage, prospective design analysis can be used as a sample size planning strategy which aims to minimize Type M and Type S errors in the upcoming study. Imagine that we were part of the research team in the previous case study exploring the relationship between activity in the Anterior Cerebral Cortex and perceived distress. When drafting the research protocol, we face the inevitable discussion on how many participants we are going to recruit. This choice depends on available resources, type of study design, constraints of various nature and, importantly, the plausible magnitude and direction of the phenomenon that we are going to study. As previously mentioned, deciding on a plausible effect size is a fundamental step which requires great effort and should not be done by trying different values only to obtain a more desirable sample size. Instead, propos- ing a plausible effect size is where the expert knowledge of the researcher can be formalized and can greatly con- tribute to the informativeness of the study that is being planned. For the sake of these examples, we adopt the 7 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 ρ H0 : ρ = 0 H1 : ρ = .25 Figure 2. Winner’s course. H0 = Null Hypothesis, H1 = Alternative Hypothesis. When sample size, directionality of the test and Type I error probability are set, also the smallest effect size above which is possible to find a statistically significant result is set. In this case, the plausible effect size, ρ = .25, lies in the region where it is not possible to reject H0 (the region delimited by the two vertical lines). Thus, it is impossible to simultaneously find a statistically significant result and an effect close to the plausible one. In other words, a statistically significant effect must exaggerate the plausible effect size. previous consideration and we suppose that common agreement is reached on a plausible correlation coeffi- cient to be around ρ = .25. Finally, we would like to leave open the possibility to explore whether the rela- tionship goes in the opposite direction to the one hy- pothesized, so we decide to perform a two-tailed test. We can implement the prospective design analysis us- ing the function pro_r() which inputs and outputs are displayed in Figure 3. About 125 participants are nec- essary to have 80% probability to detect an effect of at least ρ=±.25 if it actually exists. With this sample size, the Type S error is minimized and approximates zero. In this study design, the Type M error is 1.11 indicating that statistically significant results are on average exag- gerated by 11%. It is possible to notice that the critical values are r = ±.18, further highlighting that our plau- sible effect size is actually included among those values that lead to the acceptance of the alternative hypothesis. In a design analysis, it is advisable to investigate how the inferential risks would change according to differ- ent scenarios in terms of statistical power and plausible effect size. Changes in both these factors impact Type M and Type S errors. For example, maintaining the plau- sible correlation of ρ = .25, if we decrease statistical power from .80 to .60 only 76 participants are required (see Table 1). However, this is associated with an in- creased Type M error rate from 1.11 to 1.28. That is to say, with 76 subjects the plausible effect size will be on average overestimated by 28%. Alternatively, imag- ine that we would like to maintain a statistical power of 80%, what happens if the plausible effect size is slightly larger or smaller? The necessary sample size would spike to 344 for a ρ = .15 and decrease to 60 for ρ = .35. In both scenarios, the Type M error remains about 1.12, which reflects the more general point that for 80% power, Type M error is around 1.10. In all these scenarios, Type S error is close to zero, hence not wor- risome. Figure 3. Input and Output of the function pro_r() for prospective design analysis. Plausible correlation coef- ficient is ρ= .25, statistical power is 80% and the statis- tical test is two-tailed. The option seed allows setting the random number generator to obtain reproducible results. Table 1 Prospective design analysis in different scenarios of plau- sible effect size and statistical power. ρ Power Sample Size Type M Type S Critical r value 0.25 0.6 76 1.280 0 ±0.226 0.15 0.8 344 1.116 0 ±0.106 0.35 0.8 60 1.115 0 ±0.254 Note: In all cases, alternative = "two.sided" and sig_level = .05. 8 0.0 0.2 0.4 0.6 0.8 1.0 1.0 1.6 2.2 2.8 3.4 4.0 0.00 0.06 0.12 0.18 0.24 0.30 0.0 0.2 0.4 0.6 0.8 1.0 1.0 1.2 1.4 1.6 1.8 2.0 0.00 0.03 0.06 0.09 0.12 0.15 0.0 0.2 0.4 0.6 0.8 1.0 1.00 1.06 1.12 1.18 1.24 1.30 0.00 0.01 0.02 0.03 0.04 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 n n n n n n n n n ρ = .25 ρ = .50 ρ = .75 P ow er T y p e M T y p e S Figure 4. How Type M, Type S and Statistical power vary as a function of sample size in three different scenarios of plausible effect size (ρ= .25, ρ= .50, ρ= .75). Note that, for the sake of interpretability, we decided to use different scales for both the x-axis and y-axis in the three scenarios of plausible effect size. For completeness, Figure 4 summarizes the relation- ship between statistical power, Type M and Type S errors as a function of sample size in three scenarios of plausi- ble correlation coefficients. We display the three values that Vul and Pashler (2017) considered for correlations between fMRI measures and behavioural measures with different degrees of plausibility. An effect of ρ= .75 was deemed theoretically plausible but unrealistic, ρ = .50 was more plausible but optimistic, and ρ= .25 was more likely. The curves illustrate a general point: Type M and Type S error increase with smaller sample sizes, smaller plausible effect sizes and lower statistical power. Also, the figure shows that statistical power, Type M and Type S errors are related to each other: as power increases, Type M and Type S errors decrease. At first, it might seem that Type M and Type S errors are redundant with the information provided by statis- tical power. Even though they are related, we believe that Type M and Type S errors bring added value during the design phase of a research protocol because they facilitate a connection between how a study is planned and how results will actually be evaluated. That is to say, final results will comprise of a test statistics with an associated p-value and effect size measure. If the interest is maximizing the accuracy with which effects will be estimated, then Type M and Type S errors di- rectly communicate the consequences of design choices on effect size estimation. Varying α levels and Hypotheses Directionality So far, we did not discuss two other important de- cisions that researchers have to take when designing a study: statistical significance threshold or α level, and directionality of the statistical test, one-tailed or two-tailed. In this section, we illustrate how different 9 choices regarding these aspects impact Type M and Type S errors. A lot has been written regarding the automatic adop- tion of a conventional α level of 5% (e.g., Gigerenzer et al., 2004; Lakens, Adolfi, et al., 2018). This practice is increasingly discouraged, and researchers are invited to think about the best trade-off between α level and statis- tical power, considering the aim of the study and avail- able resources. The α level impacts Type M and Type S errors as much as it impacts statistical power. Every- thing else equal, Type M error increases with decreas- ing α level (i.e., negative relationship), whereas Type S error decreases with decreasing α level (i.e., positive relationship). To further illustrate the relation between Type M error and α level, let us take as an example the previous case study with a sample of 13 participants, plausible effect size ρ = .25 and two-tailed test. Table 2 shows that by lowering the α level from 10% to .10%, the critical values move from r = ±.48 to r = ±.80. This suggests that, with these new higher thresholds, the ex- aggeration of effects will be even more pronounced be- cause effects have to be even larger to pass such higher critical values. Instead, the relationship between Type S error and α level can be clarified thinking that by low- ering the statistical significance threshold, we are being more conservative to falsely reject the null hypothesis in general which implies that we are also being more conservative to falsely reject the null hypothesis in the wrong direction. Table 2 How changes in α level impact Power, Type M error, Type S error and critical values. α-level Power Type M Type S Critical r value 0.100 0.212 2.369 0.040 ±0.476 0.050 0.127 2.583 0.028 ±0.553 0.010 0.035 2.977 0.011 ±0.684 0.005 0.021 3.088 0.014 ±0.726 0.001 0.005 3.340 0.000 ±0.801 Note: In all cases, ρ= .25, n = 13, and alternative = "two.sided". Another important choice in study design is the di- rectionality of the test (i.e., one-tailed or two-tailed). Design analysis invites reasoning on the plausible ef- fect size and hypothesizing the direction of the effect, not only its magnitude. So why should a researcher perform non-directional statistical tests when there is a hypothesized direction? Performing a two-tailed test leaves open the possibility to find an unexpected re- sult in the opposite direction (Cohen, 1988), a possi- bility which may be of special interest for preliminary exploratory studies. However, in more advanced stages of a research program (i.e., confirmatory study), direc- tional hypotheses benefit from higher statistical power and lower Type M error rates (Figure 5). As an example, let us consider the differences between a two-tailed test and a one-tailed test in the previous case study. We can perform a new prospective design analysis (Figure 6) with a plausible correlation of ρ = .25, 80% statistical power, but this time setting the argument alternative in the R function to “greater”. A comparison of the two prospective design analyses, Figure 3 and Figure 6, suggests that the same Type M error rate of about 10% is guaranteed with 94 participants, instead of the 125 subjects necessary with a two-tailed test. Note that Type S error is not possible in directional statistical tests. In- deed, all the statistically significant results are obtain- able only in the hypothesized direction, not the opposite one. Valid conclusions require decisions on test direction- ality and α level to be taken a priori, not while data are being analyzed (Cohen, 1988). These decisions can take place during a prospective design analysis, which aligns with the increasing interest in psycho- logical science to transparently communicate and jus- tify design choices through studies’ preregistration in public repositories (e.g., Open Science Framework; As- predicted.com). Preregistration of studies’ protocol is particularly valuable for researchers endorsing an er- ror statistics philosophy of science, where the evalu- ation of research results takes into account the sever- ity with which claims are tested (Lakens, 2019; Mayo, 2018). Severity depends on the degree to which a re- search protocol tries to falsify a claim. For example, a one-tailed statistical test provides greater severity than a two-tailed statistical test. As noted by Lakens (2019), preregistration is important to openly share a priori de- cisions, such as test-directionality, providing valuable information for researchers interested in evaluating the severity of research claims. Publication Bias and Significance Filter On a concluding note, we would like to clarify the re- lationship of Design Analysis with publication bias and the statistical significance filter. While publication bias and Type M and Type S errors are related, they operate at two different levels. Pub- lication bias refers to a publication system that favours statistically significant results over non-statistically sig- nificant findings. This phenomenon alone cannot ex- plain the presence of exaggerated effects. Imagine if all studies in the literature were conducted with high statis- tical power, then statistically significant findings would probably not be so extreme. The problem of exagger- ated effect sizes in the literature can be explained only by a combination of publication bias with studies that have low statistical power. As previously shown, statis- 10 Two-tailed test One-tailed test 0.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 n P ow er 1.0 1.6 2.2 2.8 3.4 4.0 0 50 100 150 200 n T yp e M Figure 5. Comparison of Type M error rate and Power level between one-tailed and two-tailed test with ρ = .25, α= .05. n = sample size. Figure 6. Input and Output of the function pro_r() for prospective design analysis. Plausible correlation coef- ficient is ρ= .25, statistical power is 80% and the statis- tical test is one-tailed. tical power and Type M and Type S errors are related to each other: low statistical power corresponds to higher Type M and Type S errors. The critical element is the application of the statisti- cal significance filter without taking into account statis- tical power. Design Analysis per se does not solve this issue but, instead, it allows us to recognize its problem- atic consequences. In the same way as statistical power is a characteristic of a study design, so are Type M and Type S errors, however, the two are qualitatively differ- ent in terms of the kind of reasoning they favour. Statis- tical power is defined in terms of probability of rejecting the Null hypothesis and, even though this is based on an effect size of interest, the relationship “low power - high possibility of exaggeration” may not be straightforward for everyone. Instead, Type M and Type S errors directly quantify the possible exaggeration. Furthermore, their consideration protects against another possible pitfall. When in a study a statistically significant result is found and the associated effect size estimate is large, the find- ing could be interpreted as robust and impressive. How- ever, this interpretation is not always appropriate. Here, the missing piece of information is statistical power. If power is considered, researchers would realize that a large effect was found in a context where there was a low probability to find it. But this interpretation is not explicitly stating an important aspect: in these condi- tions, the only way to find a statistically significant re- sult is by overestimating the true effect. On the con- trary, this consequence becomes immediately clear once Type M and Type S errors are considered retrospectively. Similarly, considering Type M and Type S prospectively favours reasoning in terms of effect size rather than the probability of rejecting the null hypothesis when setting the sample size in a design analysis. Discussion and Conclusion In the scientific community, it is quite widespread the idea that the literature is affected by a problem with effect size exaggeration. This issue is usually explained in terms of studies’ low statistical power combined with the use of thresholds of statistical significance (Button et al., 2013; Ioannidis, 2008; Ioannidis et al., 2013; Lane & Dunlap, 1978; Yarkoni, 2009; Young et al., 2008). Statistically significant results can be obtained even in underpowered studies and it is precisely in these cases that we should worry the most about issues of overes- timation. Type M and Type S errors quantify and high- light the inferential risks directly in terms of effect size estimation, which are implied by the concept of statis- 11 tical power but might not be recognizable outright. So far, only a handful of papers explicitly mentioned Type M and Type S errors (Altoè et al., 2020; Gelman, 2018; Gelman & Carlin, 2013, 2014; Gelman et al., 2017; Gel- man & Tuerlinckx, 2000; Lu et al., 2018; Vasishth et al., 2018). With the broader goal of facilitating their consideration in psychological science, in the present contribution we illustrated how Type M and Type S er- rors are considered in a design analysis using one of the most common effect size measures in psychology, Pearson correlation coefficient. Peculiar to design analysis is the focus on the im- plications of design choices on effect sizes estimation rather than statistical significance only. We illustrated how Type M and Type S errors can be taken into ac- count with a prospective design analysis. In the planning stage of a research project, design analysis has the po- tential to increase researchers’ awareness of the conse- quences that their sample size choices have on uncer- tainty about final estimates of the effects. This favours reasoning in similar terms to those in which results will be evaluated, that is to say, effect size estimation. But understanding the inferential risks in a study design is also beneficial once results are obtained. We presented retrospective design analysis on a published study, and the same process can be useful for studies in general, especially those ending without the necessary sample size to maximize statistical power and minimize Type M and Type S errors. In all cases, presenting their values effectively communicates the uncertainty of the results. In particular, Type M and Type S errors put a red flag when results are statistically significant, but the effect size could be largely overestimated and in the wrong direction. Finally, both prospective and retrospective design analysis favours cumulative science encouraging the incorporation of expert knowledge in the definition of the plausible effect sizes. It is important to remark that even if Design Analysis is based on the definition of a plausible effect size, a best practice should be to conduct multiple Design Analyses by considering different scenarios which include differ- ent plausible effect sizes and levels of power to max- imize the informativeness of both a prospective and a retrospective analysis. To make design analysis accessible to the research community, we provide the R functions to perform prospective design analysis and retrospective design analysis for Pearson correlation coefficient https://osf. io/9q5fr/ together with a short guide on how to use the R functions and a summary of the examples presented in this contribution (Appendix B). Finally, prospective design analysis could contribute to better research design, however many other impor- tant factors were not considered in this contribution. For example, the validity and reliability of measure- ments should be at the forefront in research design, and careful planning of the entire research protocol is of ut- most importance. Future works could tackle some of these shortcomings for example, including an analysis of the quality of measurement on the estimates of Type M and Type S errors. Also, we believe that it would be valuable to provide extension of design analysis for other common effect size measures with the develop- ment of statistical software packages that could be di- rectly used by researchers. Moreover, design analysis on Pearson correlation can be easily extended to the multi- variate case where multiple predictors are considered. Lastly, design analysis is not limited to the Neyman- Pearson framework but can be considered also within other statistical approaches such as Bayesian approach. Future works could implement design analysis to evalu- ate the inferential risks related to the use of Bayes Fac- tors and Bayesian Credibility Intervals. Summarizing, choices regarding studies’ design im- pact effect size estimation and Type M (magnitude) er- ror and Type S (sign) error allow to directly quantify these inferential risks. Their consideration in a prospec- tive design analysis increases awareness of what are the consequences of sample size choice reasoning in similar terms to those used in results evaluation. Instead, ret- rospective design analysis provides further guidance on interpreting research results. More broadly, design anal- ysis reminds researchers that statistical inference should start before data collection and does not end when re- sults are obtained. Author Contact Giulia Bertoldo: 0000-0002-6960-3980 Claudio Zan- donella Callegher:0000-0001-7721-6318 Gianmarco Altoè: 0000-0003-1154-9528 Corresponding author: Gianmarco Altoè, Depart- ment of Developmental Psychology and Socialization, University of Padova, Via Venezia 8, 35131 Padova, Italy gianmarco.altoe@unipd.it Conflict of Interest and Funding We have no known conflict of interest to disclose. Author Contributions GB and GA conceived the original idea. GB drafted the paper. CZ contributed to the development of the original idea and drafted sections of the manuscript. CZ and GA wrote the R functions. All authors took care of the statistical analyses and contributed to the https://osf.io/9q5fr/ https://osf.io/9q5fr/ 12 manuscript revision, read, and approved the submitted version. Open Science Practices This article earned the Open Materials badge for making the data and materials openly available. It has been verified that the analysis reproduced the results presented in the article. The entire editorial process, including the open reviews, is published in the online supplement. References Altoè, G., Bertoldo, G., Zandonella Callegher, C., Tof- falini, E., Calcagnì, A., Finos, L., & Pastore, M. (2020). Enhancing Statistical Inference in Psy- chological Research via Prospective and Retro- spective Design Analysis. Frontiers in Psychol- ogy, 10. https://doi.org/10.3389/fpsyg.2019. 02893 Anderson, S. F. (2019). Best (but oft forgotten) prac- tices: Sample size planning for powerful stud- ies. The American Journal of Clinical Nutrition, 110(2), 280–295. https : / / doi . org / 10 . 1093 / ajcn/nqz058 Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-Size Planning for More Accurate Statis- tical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty. Psy- chological Science, 28(11), 1547–1562. https:// doi.org/10.1177/0956797617723724 Button, K., Ioannidis, J., Mokrysz, C., Nosek, B., Flint, J., Robinson, E., & Munafò, M. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neu- roscience, 14(5), 365–376. https://doi.org/10. 1038/nrn3475 Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Hu- ber, J., Johannesson, M., Kirchler, M., Almen- berg, J., Altmejd, A., Chan, T., Heikensten, E., Holzmeister, F., Imai, T., Isaksson, S., Nave, G., Pfeiffer, T., Razen, M., & Wu, H. (2016). Eval- uating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436. https://doi.org/10.1126/science.aaf0918 Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek, B. A., Pfeiffer, T., Altmejd, A., But- trick, N., Chan, T., Chen, Y., Forsell, E., Gampa, A., Heikensten, E., Hummer, L., Imai, T., . . . Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Sci- ence between 2010 and 2015. Nature Human Behaviour, 2(9), 637–644. https://doi.org/10. 1038/s41562-018-0399-z Cohen, J. (1988). Statistical power analysis for the be- havioral sciences. Lawrence Erlbaum Associates. https://doi.org/10.4324/9780203771587 Cook, J., Hislop, J., Adewuyi, T., Harrild, K., Altman, D., Ramsay, C., Fraser, C., Buckley, B., Fayers, P., Harvey, I., Briggs, A., Norrie, J., Fergusson, D., Ford, I., & Vale, L. (2014). Assessing methods to specify the target difference for a randomised controlled trial: DELTA (Difference ELicitation in TriAls) review. Health Technol Assess, 18(28). https://doi.org/10.3310/hta18280 Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skul- borstad, H. M., Allen, J. M., Banks, J. B., Baranski, E., Bernstein, M. J., Bonfiglio, D. B., Boucher, L., Brown, E. R., Budiman, N. I., Cairo, A. H., Capaldi, C. A., Chartier, C. R., Chung, J. M., Cicero, D. C., Coleman, J. A., Conway, J. G., . . . Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82. https://doi.org/10.1016/j.jesp.2015.10.012 Eisenberger, N. I., Lieberman, M. D., & Williams, K. D. (2003). Does rejection hurt? an fMRI study of social exclusion. Science, 302(5643), 290–292. https://doi.org/10.1126/science.1089134 Ellis, P. D. (2010). The Essential Guide to Effect Sizes. Cambridge University Press. https : / / doi . org / 10.1017/CBO9780511761676 Fisher, R. A. (1915). Frequency Distribution of the Values of the Correlation Coefficient in Sam- ples from an Indefinitely Large Population. Biometrika, 10(4), 507. https : / / doi . org / 10 . 2307/2331838 Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlock- ing the file drawer. Science, 345(6203), 1502. https://doi.org/10.1126/science.1255484 Gelman, A. (2018). The Failure of Null Hypothesis Sig- nificance Testing When Studying Incremental Changes, and What to Do About It. Personal- ity and Social Psychology Bulletin, 44(1), 16–23. https://doi.org/10.1177/0146167217729162 https://doi.org/10.3389/fpsyg.2019.02893 https://doi.org/10.3389/fpsyg.2019.02893 https://doi.org/10.1093/ajcn/nqz058 https://doi.org/10.1093/ajcn/nqz058 https://doi.org/10.1177/0956797617723724 https://doi.org/10.1177/0956797617723724 https://doi.org/10.1038/nrn3475 https://doi.org/10.1038/nrn3475 https://doi.org/10.1126/science.aaf0918 https://doi.org/10.1038/s41562-018-0399-z https://doi.org/10.1038/s41562-018-0399-z https://doi.org/10.4324/9780203771587 https://doi.org/10.3310/hta18280 https://doi.org/10.1016/j.jesp.2015.10.012 https://doi.org/10.1126/science.1089134 https://doi.org/10.1017/CBO9780511761676 https://doi.org/10.1017/CBO9780511761676 https://doi.org/10.2307/2331838 https://doi.org/10.2307/2331838 https://doi.org/10.1126/science.1255484 https://doi.org/10.1177/0146167217729162 13 Gelman, A. (2019a). Don’t calculate post-hoc power us- ing observed estimate of effect size. Annals of surgery, 269(1), e9–e10. https : / / doi . org / 10 . 1097/SLA.0000000000002908 Gelman, A. (2019b). From Overconfidence in Research to Over Certainty in Policy Analysis: Can We Es- cape the Cycle of Hype and Disappointment? New America. Retrieved May 29, 2020, from http : //newamerica.org/public-interest-technology/ blog/overconfidence- research- over- certainty- policy-analysis-can-we-escape-cycle-hype-and- disappointment/ Gelman, A., & Carlin, J. (2013). Retrospective de- sign analysis using external information (Un- published) [Unpublished]. Retrieved April 28, 2020, from http : / / www. stat . columbia . edu / ~gelman/research/unpublished/retropower5. pdf Gelman, A., & Carlin, J. (2014). Beyond Power Calcu- lations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychologi- cal Science, 9(6), 641–651. https://doi.org/10. 1177/1745691614551642 Gelman, A., & Loken, E. (2014). The statistical crisis in science. American scientist, 102(6), 460–466. https://doi.org/10.1511/2014.111.460 Gelman, A., Skardhamar, T., & Aaltonen, M. (2017). Type M Error Might Explain Weisburd’s Para- dox. Journal of Quantitative Criminology. https: //doi.org/10.1007/s10940-017-9374-5 Gelman, A., & Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statis- tics, 15(3), 373–390. https://doi.org/10.1007/ s001800000040 Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The Null Ritual: What You Always Wanted to Know About Significance Testing but Were Afraid to Ask. The SAGE Handbook of Quantitative Methodology for the Social Sciences (pp. 392– 409). SAGE Publications, Inc. https://doi.org/ 10.4135/9781412986311.n21 Goodman, S., & Berlin, J. (1994). The Use of Pre- dicted Confidence Intervals When Planning Ex- periments and the Misuse of Power When In- terpreting Results. Annals of internal medicine, 121(3), 200–206. https : / / doi . org / 10 . 7326 / 0003-4819-121-3-199408010-00008 Ioannidis, J. P. A. (2008). Why Most Discovered True Associations Are Inflated: Epidemiology, 19(5), 640–648. https : / / doi . org / 10 . 1097 / EDE . 0b013e31818131e7 Ioannidis, J. P. A., Pereira, T. V., & Horwitz, R. I. (2013). Emergence of Large Treatment Effects From Small Trials—Reply. JAMA, 309(8), 768–769. https://doi.org/10.1001/jama.2012.208831 Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Bahník, Š., Bernstein, M. J., Bocian, K., Brandt, M. J., Brooks, B., Brumbaugh, C. C., Cemalcilar, Z., Chandler, J., Cheong, W., Davis, W. E., De- vos, T., Eisner, M., Frankowska, N., Furrow, D., Galliani, E. M., . . . Nosek, B. A. (2014). Investi- gating Variation in Replicability. Social Psychol- ogy, 45(3), 142–152. https://doi.org/10.1027/ 1864-9335/a000178 Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Reginald B. Adams, J., Alper, S., Aveyard, M., Axt, J. R., Babalola, M. T., Bahník, Š., Batra, R., Berkics, M., Bernstein, M. J., Berry, D. R., Bialo- brzeska, O., Binan, E. D., Bocian, K., Brandt, M. J., Busching, R., . . . Nosek, B. A. (2018). Many labs 2: Investigating variation in replica- bility across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490. https : / / doi . org / 10 . 1177 / 2515245918810225 Kurkiewicz, D. (2017). Docstring: Provides docstring ca- pabilities to r functions. https : / / CRAN . R - project.org/package=docstring Lakens, D. (2019). The Value of Preregistration for Psychological Science: A Conceptual Analysis (preprint). PsyArXiv. https : / / doi . org / 10 . 31234/osf.io/jbh4w Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T., Becker, R. B., Benning, S. D., Bradford, D. E., Buchanan, E. M., Caldwell, A. R., Van Calster, B., Carlsson, R., Chen, S.-C., Chung, B., Colling, L. J., Collins, G. S., Crook, Z., . . . Zwaan, R. A. (2018). Jus- tify your alpha. Nature Human Behaviour, 2(3), 168–171. https : / / doi . org / 10 . 1038 / s41562 - 018-0311-x Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equiv- alence Testing for Psychological Research: A Tu- torial. Advances in Methods and Practices in Psy- chological Science, 1(2), 259–269. https://doi. org/10.1177/2515245918770963 Lane, D. M., & Dunlap, W. P. (1978). Estimating effect size: Bias resulting from the significance cri- terion in editorial decisions. British Journal of Mathematical and Statistical Psychology, 31(2), 107–112. https : / / doi . org / 10 . 1111 / j . 2044 - 8317.1978.tb00578.x Lu, J., Qiu, Y., & Deng, A. (2018). A note on Type S/M errors in hypothesis testing. British Journal of https://doi.org/10.1097/SLA.0000000000002908 https://doi.org/10.1097/SLA.0000000000002908 http://newamerica.org/public-interest-technology/blog/overconfidence-research-over-certainty-policy-analysis-can-we-escape-cycle-hype-and-disappointment/ http://newamerica.org/public-interest-technology/blog/overconfidence-research-over-certainty-policy-analysis-can-we-escape-cycle-hype-and-disappointment/ http://newamerica.org/public-interest-technology/blog/overconfidence-research-over-certainty-policy-analysis-can-we-escape-cycle-hype-and-disappointment/ http://newamerica.org/public-interest-technology/blog/overconfidence-research-over-certainty-policy-analysis-can-we-escape-cycle-hype-and-disappointment/ http://newamerica.org/public-interest-technology/blog/overconfidence-research-over-certainty-policy-analysis-can-we-escape-cycle-hype-and-disappointment/ http://www.stat.columbia.edu/~gelman/research/unpublished/retropower5.pdf http://www.stat.columbia.edu/~gelman/research/unpublished/retropower5.pdf http://www.stat.columbia.edu/~gelman/research/unpublished/retropower5.pdf https://doi.org/10.1177/1745691614551642 https://doi.org/10.1177/1745691614551642 https://doi.org/10.1511/2014.111.460 https://doi.org/10.1007/s10940-017-9374-5 https://doi.org/10.1007/s10940-017-9374-5 https://doi.org/10.1007/s001800000040 https://doi.org/10.1007/s001800000040 https://doi.org/10.4135/9781412986311.n21 https://doi.org/10.4135/9781412986311.n21 https://doi.org/10.7326/0003-4819-121-3-199408010-00008 https://doi.org/10.7326/0003-4819-121-3-199408010-00008 https://doi.org/10.1097/EDE.0b013e31818131e7 https://doi.org/10.1097/EDE.0b013e31818131e7 https://doi.org/10.1001/jama.2012.208831 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1027/1864-9335/a000178 https://doi.org/10.1177/2515245918810225 https://doi.org/10.1177/2515245918810225 https://CRAN.R-project.org/package=docstring https://CRAN.R-project.org/package=docstring https://doi.org/10.31234/osf.io/jbh4w https://doi.org/10.31234/osf.io/jbh4w https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1177/2515245918770963 https://doi.org/10.1177/2515245918770963 https://doi.org/10.1111/j.2044-8317.1978.tb00578.x https://doi.org/10.1111/j.2044-8317.1978.tb00578.x 14 Mathematical and Statistical Psychology. https : //doi.org/10.1111/bmsp.12132 Mayo, D. G. (2018). Statistical Inference as Severe Test- ing: How to Get Beyond the Statistics Wars (1st ed.). Cambridge University Press. https:// doi.org/10.1017/9781107286184 O’Hagan, A. (2019). Expert Knowledge Elicitation: Sub- jective but Scientific. The American Statistician, 73, 69–81. https : / / doi . org / 10 . 1080 / 00031305.2018.1518265 Open Science Collaboration. (2015). Estimating the re- producibility of psychological science. Science, 349(6251), aac4716–aac4716. https : / / doi . org/10.1126/science.aac4716 Phillips, B. M., Hunt, J. W., Anderson, B. S., Puckett, H. M., Fairey, R., Wilson, C. J., & Tjeerdema, R. (2001). Statistical significance of sediment tox- icity test results: Threshold values derived by the detectable significance approach. Environ- mental Toxicology and Chemistry, 20(2), 371– 373. https://doi.org/10.1002/etc.5620200218 Rohrer, J. M. (2018). Thinking clearly about correla- tions and causation: Graphical causal models for observational data. Advances in Methods and Practices in Psychological Science, 1(1), 27–42. https://doi.org/10.1177/2515245917745629 Vasishth, S., Mertzen, D., Jäger, L. A., & Gelman, A. (2018). The statistical significance filter leads to overoptimistic expectations of replicability. Journal of Memory and Language, 103, 151– 175. https : / / doi . org / 10 . 1016 / j . jml . 2018 . 07.004 Venables, W. N., & Ripley, B. D. (2002). Modern Ap- plied Statistics with S. Springer. https://cran.r- project.org/web/packages/MASS/index.html Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Per- spectives on Psychological Science, 4(3), 274– 290. https : / / doi . org / 10 . 1111 / j . 1745 - 6924 . 2009.01125.x Vul, E., & Pashler, H. (2017). Suspiciously high corre- lations in brain imaging research. Psychological science under scrutiny (pp. 196–220). John Wi- ley & Sons, Ltd. https : / / doi . org / 10 . 1002 / 9781119095910.ch11 Yarkoni, T. (2009). Big Correlations in Little Studies: Inflated fMRI Correlations Reflect Low Statisti- cal Power—Commentary on Vul et al. (2009). Perspectives on Psychological Science, 4(3), 294– 298. https : / / doi . org / 10 . 1111 / j . 1745 - 6924 . 2009.01127.x Young, N. S., Ioannidis, J. P. A., & Al-Ubaydli, O. (2008). Why current publication practices may distort science. PLOS Medicine, 5(10), 1–5. https://doi. org/10.1371/journal.pmed.0050201 https://doi.org/10.1111/bmsp.12132 https://doi.org/10.1111/bmsp.12132 https://doi.org/10.1017/9781107286184 https://doi.org/10.1017/9781107286184 https://doi.org/10.1080/00031305.2018.1518265 https://doi.org/10.1080/00031305.2018.1518265 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://doi.org/10.1002/etc.5620200218 https://doi.org/10.1177/2515245917745629 https://doi.org/10.1016/j.jml.2018.07.004 https://doi.org/10.1016/j.jml.2018.07.004 https://cran.r-project.org/web/packages/MASS/index.html https://cran.r-project.org/web/packages/MASS/index.html https://doi.org/10.1111/j.1745-6924.2009.01125.x https://doi.org/10.1111/j.1745-6924.2009.01125.x https://doi.org/10.1002/9781119095910.ch11 https://doi.org/10.1002/9781119095910.ch11 https://doi.org/10.1111/j.1745-6924.2009.01127.x https://doi.org/10.1111/j.1745-6924.2009.01127.x https://doi.org/10.1371/journal.pmed.0050201 https://doi.org/10.1371/journal.pmed.0050201 15 Appendix A: Pearson Correlation and Design Analysis To conduct a design analysis, it is necessary to know the sampling distribution of the effect of interest. That is, the distribution of effects we would observe if n ob- servations were sampled over and over again from a population with a given effect. This allows us, in turns, to evaluate the sampling distribution of the test statistic of interest not only under the Null-Hypothesis (H0), but also under the alternative Hypothesis (H1), and thus to compute the statistical power and inferential risks of the study considered. Regarding Pearson’s correlation between two nor- mally distributed variables, the sampling distribution is bounded between -1 and 1 and its shape depends on the values of ρ and n, respectively the population corre- lation value and the sample size. The sampling distribu- tion is approximately Normal if ρ= 0. Whereas, for pos- itive or negative values of ρ, it is negatively skewed or positively skewed, respectively. Skewness is greater for higher absolute values of ρ but decreases when larger sample sizes are considered. In Figure 7, correlation sampling distributions are presented for increasing val- ues of ρ and fixed sample size (n = 30). In the following paragraphs, we consider the conse- quence of Pearson’s correlation sampling distribution on statistical inference and the behaviour of Type M and Type S errors as a function of statistical power. Statistical inference To test a hypothesis or to derive confidence inter- vals, the sampling distribution of the test statistic of interest must follow a known distribution. In the case of H0 : ρ = 0, the sample correlation is ap- proximately normally distributed with Standard Error: SE(r) = √ (1 − r2)/(n − 2). Thus, statistical inference is performed considering the test statistic: t = r SE(r) = r √ n − 2 1 − r2 , (2) that follows a t-distribution with d f = n − 2. However, in the case of ρ , 0, the sample corre- lation is no longer normally distributed. As we have previously seen, the sampling distribution is skewed for large values of ρ and small sample sizes. Thus, the test statistic of interest does not follow a t-distribution.4 To overcome this issue, the Fisher transformation was in- troduced (Fisher, 1915): F(r) = 1 2 ln 1 + r 1 − r = arctanh(r). (3) Applying this transformation, the resulting sampling distribution is approximately Normal with mean = F(ρ) and SE = 1√ n−3 . Thus, the test statistic follows a stan- dard Normal distribution and statistical inference is per- formed considering the Z-scores. Alternatively, other methods can be used to obtain reliable results, for example, Monte Carlo simulation. Monte Carlo simulation is based on random sampling to approximate the quantities of interest. In the case of correlation, n observations are iteratively simulated from a bivariate Normal distribution with a given ρ, and the observed correlation is considered. As the num- ber of iterations increases, the distribution of simulated correlation values approximates the actual correlation sampling distribution and it can be used to compute the quantities of interest. Although Monte Carlo methods are more compu- tationally demanding than analytic solutions, this ap- proach allows us to obtain reliable results in a wider range of conditions even when no closed-form solutions are available. For these reasons, the functions pro_r() and retro_r(), presented in this paper, are based on Monte Carlo simulation to compute power, Type M, and Type S error values. This guarantees a more general framework where other future applications can be eas- ily integrated into the functions. Type M and Type S errors Design Analysis was first introduced by Gelman and Carlin (2014) assuming that the sampling distribution of the test statistic of interest follows a t-distribution. This is the case, for example, of Cohen’s d effect size. Cohen’s d is used to measure the mean difference be- tween two groups on a continuous outcome. The be- haviour of Type M and Type S errors as a function of statistical power in the case of Cohen’s d is presented in Figure 8. For different values of hypothetical population effect size (d = .2, .5, .7, .9), we can observe that, for high levels of power, Type S and Type M errors are low. Con- versely, the Type S and Type M errors are high for low values of power. As expected, the relation between power and inferential errors is not influenced by the value of d (i.e., the four lines are overlapping). Limit cases are obtained for power = 1 and 0.05 (note that the lowest value of power is given by the alpha value chosen as the statistical significance threshold). In the former case, Type S error is 0 and Type M error is 1. In 4Note that the t-distribution is defined as the distribution of a random variable T where T = Z√ V/d f . With Z a standard Normal, V a Chi-squared distribution with df the degrees of freedom. Thus, if the sample correlation is no longer approx- imately normally distributed the test-statistic is no longer t- distributed. 16 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 ρ ρ : 0 0.4 0.6 0.8 Figure 7. Pearson Correlation coefficient sampling distributions for increasing values of ρ and fixed sample size (n = 30) Cohen’s d : 0.2 0.5 0.7 0.9 1.0 3.2 5.4 7.6 9.8 12.0 0.25 0.50 0.75 1.00 Power T yp e M 0.00 0.06 0.12 0.18 0.24 0.30 0.25 0.50 0.75 1.00 Power T yp e S Figure 8. The behaviour of Type M and Type S errors as a function of statistical power in the case of Cohen’s d. Note that the four lines are overlapping. the latter case, Type S error is 0.5 and the Type M error value goes to infinity. In the case of Pearson’s Correlation, we noted above that the sampling distribution is skewed for large val- ues of ρ and small sample sizes. Moreover, the support is bounded between -1 and 1. Thus, the relations be- tween power, Type M, and Type S error are influenced by the value of the hypothetical population effect size (see Figure 9). We can observe how, for different values of correla- tion (ρ = .2, .5, .7, .9), Type M error increases at dif- ferent rates when the power decrease, whereas Type S error follows a consistent pattern (note that differ- ences are due to numerical approximation). We can intuitively explain this behaviour considering that, for low levels of power, the sampling distribution includes a wider range of correlation values. However, correla- tion values can not exceed the value 1 and therefore the distribution becomes progressively more skewed. This does not influence the proportion of statistically signifi- 17 ρ : 0.2 0.5 0.7 0.9 1.0 1.8 2.6 3.4 4.2 5.0 0.25 0.50 0.75 1.00 Power T yp e M 0.00 0.06 0.12 0.18 0.24 0.30 0.25 0.50 0.75 1.00 Power T yp e S Figure 9. The behaviour of Type M and Type S errors as a function of statistical power in the case of Pearson’s correlation ρ. cant sampled correlations with the incorrect sign (Type S error), but it affects the mean absolute value of sta- tistically significant sampled correlations (used to com- pute Type M error). In particular, the sampling distri- bution for greater values of ρ becomes skewed more rapidly and thus Type-M error increases at a lower rate. Finally, since the correlation values are bounded, Type-M error for a given value of ρ can hypothetically increase only to a maximum value given by 1 ρ . For ex- ample, for ρ = .5 the maximum Type-M error is 2 as .5 × 2 = 1 (i.e., the maximum correlation value). In this appendix, we discussed for completeness the implications of conducting a Design Analysis in the case of Pearson’s correlation effect size. We considered ex- treme scenarios that are unlikely to happen in real research settings. Nevertheless, we thought this was important for evaluating the statistical behaviour and properties of Type M and Type S error in the case of Pearson’s correlation as well as helping researchers to deeply understand Design Analysis. Appendix B: R Functions for Design Analysis with Pearson Correlation Here we describe the R functions defined to perform a prospective and retrospective design analysis in the case of Pearson correlation. First, we give instructions on how to load and use the functions. Subsequently, we provide the code to reproduce examples included in the article. These functions can be used as a base to further de- velop design analysis in more complex scenarios that were beyond the aim of the paper. R functions The code of the functions is available in the file Design_analysis_r.R at https://osf.io/9q5fr/. After downloading the file Design_analysis_r.R, run the line indicating the correct path where the file was saved: source("/Design_analyisis_r.R") The script will automatically load in your workspace the functions and two required R- package: MASS (Venables & Ripley, 2002) and docstring Kurkiewicz (2017). If you don’t have them already installed, run the line install.packages(c("MASS","docstring")). The R functions are: • retro_r() for retrospective design analysis. Given the hypothetical population correlation value and sample size, this function performs a retrospective design analysis according to the https://osf.io/9q5fr/ 18 defined alternative hypothesis and significance level. Power level, Type-M error, and Type-S error are computed together with the critical correla- tion value (i.e., the minimum absolute correlation value that would result significant). retro_r(rho, n, alternative = c("two.sided", "less", "greater"), sig_level=.05, B=1e4, seed=NULL) • pro_r() for prospective design analysis. Given the hypothetical population correlation value and the required power level, this function performs a prospective design analysis according to the de- fined alternative hypothesis and significance level. The required sample size is computed together with the associated Type-M error, Type-S error, and the critical correlation value. pro_r(rho, power = .80, alternative = c("two.sided", "less", "greater"), sig_level = .05, range_n = c(1,1000), B = 1e4, tol = .01, display_message = FALSE, seed = NULL) For further details about function arguments, run the line docstring(retro_r) or docstring(pro_r). This creates a documentation similar to the help page of R functions. Note: two other functions are defined in the script and will be loaded in your workspace (i.e., compute_crit_r() and print.design_analysis). This are internal functions that should not be used directly by the user. Examples code Below we report the code to reproduce the examples included in the article. # Example from Figure 1 retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .05, seed = 2020) # Example from Figure 3 pro_r(rho = .25, power = .8, alternative = "two.sided", sig_level = .05, seed = 2020) # Example from Figure 6 pro_r(rho = .25, power = .8, alternative = "two.sided", sig_level = .05, seed = 2020) # Examples from Table 1 pro_r(rho = .25, power = .6, alternative = "two.sided", sig_level = .05, seed = 2020) pro_r(rho = .15, power = .8, alternative = "two.sided", sig_level = .05, seed = 2020) pro_r(rho = .35, power = .8, alternative = "two.sided", sig_level = .05, seed = 2020) # Examples from Table 2 retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .100, seed = 2020) retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .050, seed = 2020) retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .010, seed = 2020) retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .005, seed = 2020) retro_r(rho = .25, n = 13, alternative = "two.sided", sig_level = .001, seed = 2020)