Meta-Psychology, 2020, vol 4, MP.2018.874 https://doi.org/10.15626/MP.2018.874 Article type: Original Article Published under the CC-BY4.0 license Open data: N/A Open materials: Yes Open and reproducible analysis:Yes Open reviews and editorial process: Yes Preregistration: N/A Edited by: Marcel van Assen Reviewed by: Stephen Martin, Jack Davis, Donald Williams, Daniël Lakens and Rink Hoekstra Analysis reproduced by: Erin Buchanan All supplementary files can be accessed at OSF: https://doi.org/10.17605/OSF.IO/PEUMW Estimating Population Mean Power Under Conditions of Heterogeneity and Selection for Significance Jerry Brunner and Ulrich Schimmack University of Toronto Mississauga Abstract In scientific fields that use significance tests, statistical power is important for successful replications of significant results because it is the long-run success rate in a series of exact replication studies. For any population of sig- nificant results, there is a population of power values of the statistical tests on which conclusions are based. We give exact theoretical results showing how selection for significance affects the distribution of statistical power in a heterogeneous population of significance tests. In a set of large-scale simulation studies, we compare four methods for estimating population mean power of a set of studies selected for significance (a maximum likelihood model, extensions of p-curve and p-uniform, & z-curve). The p-uniform and p-curve methods performed well with a fixed effects size and varying sample sizes. However, when there was substantial variability in effect sizes as well as sample sizes, both methods systematically overestimate mean power. With heterogeneity in effect sizes, the maximum likelihood model produced the most accurate estimates when the distribution of effect sizes matched the assumptions of the model, but z-curve produced more accurate estimates when the assumptions of the maximum likelihood model were not met. We recommend the use of z-curve to estimate the typical power of significant results, which has implications for the replicability of significant results in psychology journals. Keywords: Power estimation, Post-hoc power analysis, Publication bias, Maximum likelihood, Z-curve, P-curve, P- uniform, Effect size, Replicability, Meta-analysis The purpose of this paper is to develop and evalu- ate methods for predicting the success rate if sets of significant results were replicated exactly. We call this statistical property, the average power of a set of stud- ies. Average power can range from the criterion for a type-I error, if all significant results are false positives, to 100%, if the statistical power of original studies ap- proaches 1. Average power can be used to quantify the degree of evidential value in a set of studies (Simonsohn et al., 2014b). In the end, we estimate the mean power of studies that were used to examine the replicability of psychological research, and compare the results to actual replication outcomes (Open Science Collabora- tion, 2015). Estimating average power of original stud- ies is interesting because it is tightly connected with the outcome of replication studies (Greenwald et al., 1996; Yuan & Maxwell, 2005). To claim that a finding has been replicated, a replication study should reproduce a statistically significant result, and the probability of a successful replication is a function of statistical power. Thus, if reproducibility is a requirement of good science (Bunge, 1998; Popper, 1959), it follows that high statis- tical power is a necessary condition for good science. Information about the average power of studies is also useful because selection for significance increases the type-I error rate and inflates effect sizes (Ioannidis, 2 2008). However, these biases are relatively small if the original studies had high power. Thus, knowledge about the average power of studies is useful for the planning of future studies. If average power is high, replication studies can use the same sample sizes as original stud- ies, but if average power is low, sample sizes need to be increased to avoid false negative results. Given the practical importance of power for good sci- ence, it is not surprising that psychologists have started to examine the evidential value of results published in psychology journals. At present, two statistical methods have been used to make claims about the average power of psychological research; namely p-curve (Simonsohn et al., 2017) and z-curve (Schimmack, 2015, 2018a), but so far neither method has been peer-reviewed. Statistical Power Before and After A Study Has Been Conducted Before we proceed, we would like to clarify that sta- tistical power of a statistical test is defined as the proba- bility of correctly rejecting the null hypothesis (Neyman & Pearson, 1933). This probability depends on the sam- pling error of a study and the population effect size. The traditional definition of power does not consider effect sizes of zero (false positives) because the goal of a priori power planning is to ensure that a non-zero effect can be demonstrated. However, our goal is not to plan future studies, but to analyze results of existing studies. For post-hoc power analysis, it is impossible to distinguish between true positives and false positives and to estimate the average power conditional on the unknown status of hypotheses (i.e., the null-hypothesis is true or false). Thus, we use the term average power as the probability of correctly or incorrectly rejecting the null-hypothesis (Sterling et al., 1995). This definition of average power includes an unknown percentage of false positives that have a probability equal to alpha (typically 5%) to reproduce a significant result in a replication attempt. At the same time, we believe that the strict null-hypothesis is rarely true in psychological research (Cohen, 1994). It would be ideal if it were possible to estimate the power of a single statistical test that supports a partic- ular finding. Unfortunately, well-documented problems with the “observed power" method suggest that the goal of estimating the power of an individual test may be out of reach (Boos & Stefnski, 2012; Hoenig & Heisey, 2001). Often the main problem is that estimates for a single result are too variable to be practically useful (Yuan & Maxwell, 2005; but also see Anderson, Kelley, & Maxwell, 2017). It is important to distinguish our undertaking from that of Cohen (1962) and the follow-up studies by Chase and Chase (1976) and Sedlmeier and Gigeren- zer (1989). In Cohen’s classic survey of power in the Journal of Abnormal and Social Psychology, the results of the studies were not used in any way. Power was never estimated. It was calculated exactly for a priori effect sizes deemed “small," “medium" and “large." If a “medium" effect size referred to the population mean (which Cohen never claimed), power at the mean effect size is still not the same as mean power. In contrast, we aim to estimate the mean power given the actual population effect sizes in a set of studies. Two Populations of Studies We distinguish two populations of tests. One popu- lation contains all tests that have been conducted. This population contains significant and non-significant re- sults. The other population contains the subset of stud- ies that produced a significant result. We focus on the population of studies selected for significance for two reasons. First, often non-significant results are not available because journal articles mostly report significant results (Rosenthal, 1979; Sterling, 1959; Sterling et al., 1995). Second, only significant results are used as evidence for a theoretical prediction. It is irrelevant how many tests produced non-significant results because these results are inconclusive. As psychological theories mainly rest on studies that produced significant results, only the ev- idential value of significant results is relevant for evalu- ations of the robustness of psychology as a science. In short, we are interested in statistical methods that can estimate the average power of a set of studies with sig- nificant results. The Study Selection Model We developed a number of theorems that specify how selection for significance influences the distribution of power. These theorems are very general. They do not depend on the particular population distribution of power, the significance tests involved, or the Type I er- ror probabilities of those tests. The only requirement is that for every study with a specific population effect size, sample size, and statistical test, the probability of a result being selected is the true power of a study. We discuss the two most important theorems in detail. All six theorems are provided in the appendix, along with an illustration of the theorems by simulation. Theorem 1 Population mean true power equals the over- all probability of a significant result. Theorem 1 establishes the central importance of pop- ulation mean power after selection for significance for 3 predicting replication outcomes. Think of a coin-tossing experiment in which a large population of coins is man- ufactured, each with a different probability of heads; that is, these coins are not fair coins with equal probabil- ities for both sides. Also consider heads to be successes or wins. Repeatedly tossing the set of coins and count- ing the number of heads produces an expected value of the number of successes. For example, the experiment may yield 60% heads and 40% tails. While the exact probability of showing heads of individual coins are un- known, the observable success rate is equivalent to the mean power of all coins. Theorem 1 states that success rate and mean power are equivalent even if the set of coins is a subset of all coins. For example, assume all coins were tossed once and only coins showing heads were retained. Repeating the coin toss experiment, we would still find that the success rate for the set of se- lected coins matches the mean probabilities of the se- lected coins. Theorem 2 The effect of selection for significance on power after selection is to multiply the probability of each power value by a quantity equal to the power value itself, divided by population mean power before selection. If the distribution of power is continuous, this statement applies to the probability density function. Figure 1 illustrates Theorem 2 for a simple, artificial example in which power before selection is uniformly distributed on the interval from 0.05 to 1.0. The cor- responding distribution after selection for significance is triangular; now studies with more power are more likely to be selected. Figure 1. Uniform distribution of power before selection 0.0 0.2 0.4 0.6 0.8 1.0 0. 5 1. 0 1. 5 Power D en si ty Expected power = 0.525 before selection, 0.635 after selection Density after selection Density before selection In Figure 2, power before selection is less heteroge- neous, and higher on average. Consequently, the dis- tributions of power before selection and after selection are much more similar. In both cases, though, mean true power after selection for significance is higher than mean true power before selection for significance. Figure 2. Example of higher power before selection 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 Power D en si ty Expected power = 0.700 before selection, 0.714 after selection Density after selection Density before selection Note. Power before selection follows a beta distribution with a = 13 and b = 6 multiplied by .95 plus .05, so that it ranges from .05 to 1. The coin-tossing selection model proposed here may seem overly simplistic and unrealistic. Few researchers conduct a study and give up after a first attempt pro- duces a nonsignificant result. For example, Morewedge et al. (2014) disclosed that they did not report “some preliminary studies that used different stimuli and dif- ferent procedures and that showed no interesting ef- fects." From a theoretical perspective, it is important that all studies test the same hypothesis, but for our selection model it is not. Even if all studies used ex- actly the same procedures and had exactly the same power, the probability of being selected into the set of reported studies matches their power, and Theorem 2 holds. Each study that was conducted by Morewedge et al. has an unknown true power to produce a significant result, and Theorem 2 implies (via Theorem 5 in the appendix) that their selected studies with significant re- sults have higher mean power than the full set of studies that were conducted. We are only interested in the sta- tistical power and replicability of the published studies with significant results. Estimation Methods In this section, we describe four methods for estimat- ing population mean power under conditions of hetero- geneity, after selection for statistical significance. 4 Notation and statistical background To present our methods formally, it is necessary to introduce some statistical notation. Rather than using traditional notation from statistics that might make it difficult for non-statisticians to understand our method, we follow Simonsohn et al. (2014a), who employed a modified version of the S syntax (Becker et al., 1988) to represent probability distributions. The S language is familiar to psychologists who use the R statistical soft- ware (R Core Team, 2017). The notation also makes it easier to implement our methods in R, particularly in the simulation studies. The outcome of an empirical study is partially deter- mined by random sampling error, which implies that statistical results will vary across studies. This varia- tion is expected to follow a random sampling distribu- tion. Each statistical test has its own sampling distri- bution. We will use the symbol T to denote a general test statistic; it could be a t-statistic, F, chi-squared, Z, or something else. Assume an upper-tailed test, so that the null hypothesis will be rejected at significance level α (usually α = 0.05), when the continuous test statistic T exceeds a critical value c. Typically there is a sample of test statistic values T1, . . . , Tk, but when only one is being considered the subscript will be omitted. The notation p(t) refers to the probability under the null hypothesis that T is less than or equal to the fixed constant t. The symbol p would rep- resent pnorm if the test statistic were standard normal, pf if the test statistic had an F-distribution, and so on. While p(t) is the area under the curve, d(t) is the value on the y axis for a particular t, as in dnorm. Following the conventions of the S language, the inverse of p is q, so that p(q(t)) = q(p(t)) = t. Sampling distributions when the null-hypothesis is true are well known to psychologists because they pro- vide the foundation of null-hypothesis significance test- ing. Most psychologists are less familiar with non- central sampling distributions (see Johnson et al., 1995, for a detailed and authoritative treatment). When the null hypothesis is false, the area under the curve of the test statistic’s sampling distribution is p(t,ncp), repre- senting particular cases like pf(t,df1,df2,ncp). The initials ncp stand for “non-centrality parameter." This notation applies directly when T has one of the com- mon non-central distributions like the non-central t, F or chi-squared under the alternative hypothesis, but it can be extended to the distribution of any test statistic under any specific alternative, even when the distribu- tion in question is technically not a non-central distri- bution. The non-centrality parameter is positive when the null hypothesis is false, and statistical power is a monotonically increasing function of the non-centrality parameter. This function is given explicitly by Power = 1 − p(c,ncp). For the most important non-central distri- butions (Z, t, chi-squared and F), the non-centrality pa- rameter can be factored into the product of two terms. The first term is an increasing function of sample size, and the second term is an increasing function of effect size. In symbols, ncp = f1(n) · f2(es). (1) This formula is capable of accommodating different def- initions of effect size (Cohen, 1988; Grissom & Kim, 2012) by making corresponding changes to the function f2 in f2(es). As an example of Equation (1), consider for example a standard F-test for difference between the means of two normal populations with a common variance. After some simplification, the noncentrality parameter of the non-central F may be written as ncp = n ρ (1 −ρ) d2, where n = n1 + n2 is the total sample size, ρ is the pro- portion of cases allocated to the first treatment, and d is Cohen’s (1988) effect size for the two-sample problem. This expression for the non-centrality parameter can be factored in various ways to match Equation 1; for exam- ple, f1(n) = n ρ (1 −ρ) and f2(es) = es 2. Note that this is just an example; Equation 1 applies to the non-centrality parameters of the non-central Z, t, chi-squared and F distributions in general. Thus for a given sample size and a given effect size, the power of a statistical test is Power = 1 −p(c, f1(n) · f2(es)). (2) In this formula, c is the criterion value for statistical significance; the test is significant if T > c. The func- tion f2(es) can also be applied to sets of studies with different traditional effect sizes. For example, es could be Cohen’s d, and the alternative effect size es′ could be the point-biserial correlation r (Cohen, 1988, p. 24). Symbolically, es′ = g(es). Since the function g(es) is monotone increasing, a corresponding inverse function exists, so that es = g−1(es′). Then Equation (2) becomes Power = 1 −p(c, f1(n) · f2(es)) = 1 −p(c, f1(n) · f2 ( g−1(es′) ) ) = 1 −p(c, f1(n) · f ′ 2 ( es′ ) ), where f ′2 just means another function f2. That is, if the definition of effect size is changed (in a monotone way), 5 the change is absorbed by the function f2, and Equa- tion (2) still applies. We are now ready to introduce our four methods for the estimation of mean power based on a set of studies that vary in power with known sample sizes and un- known population effect sizes. The four methods are called pcurve, p-uniform, maximum likelihood model, and z-curve. Estimation Methods The first two estimation methods are based on meth- ods that were developed for the estimation of effect sizes. Our use of these methods for the estimation of mean power is an extension of these methods. Our sim- ulation studies should not be considered tests of these methods for the estimation of effect sizes. We devel- oped these methods simply because power is a func- tion of effect size and sample size and sample sizes are known. Thus, only estimation of unknown effect sizes is needed to estimate power with these methods. Power estimation is a simple additional step to compute power for each study as a function of the effect size estimate and the sample size of each study. These models should work well, when all studies have the same effect size and heterogeneity in power is only a function of hetero- geneity in sample size as assumed by these models. P-curve 2.1 and p-uniform A p-curve method for estimation of mean power is available online (www.p-curve.com). It is important to point out that this method differs from the p-curve method that we developed. The online p-curve method is called pcurve 4.06. We built our p-curve method on the effect size p-curve method with the version code p- curve2.0 (Simonsohn et al., 2014b). Hence, we refer to our p-curve method as p-curve2.1. P-uniform is very similar to p-curve (van Assen et al., 2014). Both methods aim to find an effect size that pro- duces a uniform distribution of p-values between .05 and .00. Since we developed our p-uniform method for power estimation, a new estimation method has been introduced (van Aert et al., 2016). We conducted our studies with the original estima- tion method and our results are limited to the perfor- mance of this implementation of p-uniform. To find the best fitting effect size for a set of observed test statistics, p-curve 2.1 and p-uniform compute p-values for various effect sizes and chose the effect size that yields the best approximation of a uniform distribution. If the mod- ified null hypothesis that effect size = es is true, the cumulative distribution function of the test statistic is the conditional probability F0(t) = Pr{T ≤ t|T > c} = p(t,ncp)−p(c,ncp) 1 −p(c,ncp) = p(t, f1(n) · f2(es))−p(c, f1(n) · f2(es)) 1 −p(c, f1(ni) · f2(es)) , using ncp = f1(n) · f2(es) as given in Equation 1. The corresponding modified p-value is 1 − F0(T ) = 1 −p(T, f1(n) · f2(es)) 1 −p(c, f1(n) · f2(es)) . Note that since the sample sizes of the tests may dif- fer, the symbols p, n and c as well as T may have dif- ferent referents for j = 1, . . . , k test statistics. The sub- script j has been omitted to reduce notational clutter. If the modified null hypothesis were true, the modified p-values would have a uniform distribution. Both p- curve 2.1 and p-uniform choose as estimated effect size the value of es that makes the modified p-values most nearly uniform. They differ only in the criterion for de- ciding when uniformity has been reached. P-curve 2.1 is based on a Kolmogorov-Smirnov test for departure from a uniform distribution, choosing the es value yielding the smallest value of the test statistic. P-uniform is based on a different criterion. Denoting by P j the modified p-value associated with test j, calculate Y = − k∑ j=1 ln(P j), where ln is the natural logarithm. If the P j values were uniformly distributed, Y would have a Gamma distribu- tion with expected value k, the number of tests. The P- uniform estimate is the modified null hypothesis effect size es that makes Y equal to k, its expected value under uniformity. These technologies are designed for heterogeneity in sample size only, and assume a common effect size for all the tests. Given an estimate of the common effect size, estimated power for each test varies only as a func- tion of sample size which can be determined by Ex- pression 2 because sample sizes are known. Population mean power can then be estimated by averaging the k power estimates. Maximum likelihood model Our maximum likelihood (ML) model also first es- timates effect sizes and then combines effect size esti- mates with known sample sizes to estimate mean power. Unlike p-curve2.1 and p-uniform, the ML model allows for heterogeneity in effect sizes. In this way, the model 6 is similar to Hedges and Vevea’s (1996) model for ef- fect size estimation before selection for significance. To take selection for significance into account, the likeli- hood function of the ML model is a product of k con- ditional densities; each term is the conditional density of the test statistic T j, given N j = n j and T j > c j, the critical value. Likelihood function. The model assumes that sam- ple sizes and effect sizes are independent before the se- lection for significance. Suppose that the distribution of effect size before selection is continuous with probabil- ity density gθ(es). This notation indicates that the distri- bution of effect size depends on an unknown parameter or parameter vector θ. In the appendix, it is shown that the likelihood function (a function of θ) is a product of k terms of the form∫ ∞ 0 d(t j, f1(n j) · f2(es))gθ(es) des∫ ∞ 0 [ 1 −p(c j, f1(n j) · f2(es)) ] gθ(es) des , (3) where the integrals denote areas under curves that can be computed with R’s integrate function. The maxi- mum likelihood estimate is the parameter value yield- ing the highest product. To be applicable to actual data, the ML model has to make assumptions about the dis- tribution of effect sizes. The ML model that was used in the simulation studies assumed a gamma distribution of effect sizes. A gamma distribution is defined by two parameters that need to be estimated based on the data. The effect sizes based on the most likely distribution are then combined with information about sample sizes to obtain power estimates for each study. An estimate of population mean power is then produced by averaging estimated power for the k significance tests. As shown in the appendix, the terms to be averaged are∫ ∞ 0 [ 1 −p(c j, f1(n j) · f2(es)) ]2 g θ̂ (es) des∫ ∞ 0 [ 1 −p(c j, f1(n j) · f2(es)) ] g θ̂ (es) des . (4) Z-curve Z-curve follows a traditional meta-analyses that con- verts p-values into Z-scores as a common metric to inte- grate results from different original studies (Rosenthal, 1979; Stouffer et al., 1949). The use of Z-scores as a common metric makes it possible to fit a single func- tion to p-values arising from different statistical meth- ods and tests. The method is based on the simplicity and tractability of power analysis for the Z-tests, in which the distribution of the test statistic under the alternative hypothesis is just a standard normal shifted by a fixed quantity that plays the role of a non-centrality param- eter, and will be denoted by m. Input to the Z-curve is a sample of p-values, all less than α = 0.05. These p-values are processed in several steps to produce an estimate. 1. Convert p-values to Z-scores. The first step is to imagine, for simplicity, that all the p-values arose from two-tailed Z-tests in which results were in the predicted direction. This is equivalent to an upper-tailed Z-test. In our simulations, alpha was set to .05, which results in a selection criterion of z = 1.96. The conversion to Z-scores (Stouffer et al., 1949) consists of finding the test statistic Z that would have produced that p-value. The for- mula is Z = qnorm(1 − p/2). (5) 2. Set aside Z > 6. We set aside extreme z-scores. This avoids fitting a large number of normal dis- tributions to extremely small p-values. This step has no influence on the final result because all of these p-values have an observed power of 1.00 (rounded to the second decimal). This set also avoids numerical problems that arise from small p-values rounded to 0. 3. Fit a finite mixture model. Before selecting for sig- nificance and setting aside values above six, the distribution of the test statistic Z given a partic- ular non-centrality parameter value m is normal with mean m. Afterwards, it is a normal distri- bution truncated on the left at the critical value c (usually 1.96) truncated on the right at 6, and re- scaled to have area one under the curve. Because of heterogeneity in sample size and effect size, the full distribution of Z is an average of truncated normals, with potentially a different value of m for each member of the population. As a simpli- fication, heterogeneity in the distribution of Z is represented as a finite mixture with r components. The model is equivalent to the following two-stage sampling plan. First, select a non-centrality parameter m from m1, . . . , mr according to the respective probabilities w1, . . . , wr. Then generate Z from a normal distri- bution with mean m and standard deviation one. Finally, truncate and re-scale. Under this approximate model, the probability density function of the test statistic after selection for significance is f (z) = r∑ j=1 w j dnorm(z − m j) pnorm(6 − m j)−pnorm(c − m j) . (6) 7 The finite mixture model is only an approximation because it approximates k standard normal distri- bution with a smaller set of standard normal dis- tributions. Preliminary studies showed negligible differences between models with 3 or more pa- rameters. Thus, the z-curve method that was used in the simulation studies approximated the ob- served distribution of z-scores between 1.96 and 6 with three truncated standard normal distribu- tions. The observed density distribution was es- timated based on the observed z-scores using the kernel density estimate (Silverman, 1986) as im- plemented in R’s density function, with the de- fault settings. The default settings are Gaussian approximation and 512 nodes. The most critical default param- eter is the bandwidth. The default bandwidth defaults to 0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one- fifth power (https://stat.ethz.ch/R-manual/R- devel/library/stats/html/density.html). Specifically, the fitting step proceeds as follows. First, obtain the kernel density estimate based on the sample of significant Z values, re-scaling it so that the area under the curve between 1.96 and 6 equals one. To do so, all density values are divided by the sum of the density values times the band- width parameter of the density function. Then, numerically choose w j and m j values so as to min- imize the sum of absolute differences between Ex- pression (6) and the density estimate. 4. Estimate mean power for Z < 6. The estimate of rejection probability upon replication for Z < 6 is the area under the curve above the critical value, with weights and non-centrality values from the curve fitting step. The estimate is ` = r∑ j=1 ŵ j(1 −pnorm(c − m̂ j)), (7) where ŵ1, . . . , ŵr and m̂1, . . . , m̂r are the values lo- cated in Step 3. Note that while the input data are censored both on the left and right as repre- sented in Forumula 6, there is no truncation in Formula 7 because it represets the distribution of Z upon replication. 5. Re-weight using Z > 6. Let q denote the proportion of the original set of Z statistics with Z > 6. Again, we assume that the probability of significance for those tests is essentially one. Bringing this in as one more component of the mixture estimate, the final estimate of the probability of rejecting the null hypothesis for exact replication of a randomly selected test is Zest = (1 − q) ` + q · 1 (8) = q + (1 − q) r∑ j=1 ŵ j(1 −pnorm(c − m̂ j)) By Theorem 1, this is also an estimate of population true mean power after selection. Unlike the other esti- mation methods, z-curve does not require information about sample size. Unlike p-curve2.1 and p-uniform, z-curve does not assume a fixed effect size. Finally, z- curve does not make assumptions about the distribu- tion of true effect sizes or true power, but approximates the actual distribution with a weighted combination of three standard normal distributions. Simulations The simulations reported here were carried out using the R programming environment (R Core Team, 2017) distributing the computation among 70 quad core Apple iMac computers. The R code is available in the supple- mentary materials, at https://osf.io/bvraz. In the simulations, the four estimation methods (p- curve 2.1, p-uniform, maximum likelihood and z-curve) were applied to samples of significant chi-squared or F statistics, all with p < 0.05. This covers most cases of in- terest, since t statistics may be squared to yield F statis- tics, while Z may be squared to yield chi-squared with one degree of freedom. Heterogeneity in Sample Size Only: Effect size fixed Sample sizes after selection for significance were ran- domly generated from a Poisson distribution with mean 86, so that they were approximately normal, with pop- ulation mean 86 and population standard deviation 9.3. Population mean power, number of test statis- tics on which the estimates were based, type of test (chi-squared or F) and (numerator) degrees of freedom were varied in a complete factorial design. Within each combination, we generated 10,000 samples of signifi- cant test statistics and applied the four estimation meth- ods to each sample. In these simulations, it was not necessary to simulate test statistic values and then lit- erally select those that were significant. A great deal of computation was saved by using the R functions rsigF and rsigCHI, (available from the supplementary mate- rials) to simulate directly from the distribution of the test statistic after selection. A description of the simula- tion method and a proof of its correctness are given in the appendix. https://stat.ethz.ch/R-manual/R-devel/library/stats/html/density.html https://osf.io/bvraz https://osf.io/bvraz https://osf.io/bvraz 8 The first simulation had a 4 × 5 × 3 design with true power after selection for significance (.05, 0.25, 0.50, & 0.75), number of test statistics k on which estimates were based (15, 25, 50, 100, & 250) and numerator degrees of freedom (just degrees of freedom for the chi- squared tests; 1, 3 & 5) as factors. To obtain the desired levels of power, we used the effect size metric f for F- tests and w for chi-squared tests (Cohen, 1988, p. 216). Because the pattern of results was similar for F-tests and chi-squared tests and for different degrees of free- dom, we only report details for F-tests with one nu- merator degree of freedom; preliminary data mining of the psychological literature suggests that this is the case most frequently encountered in practice. Full results are given in the supplementary materials. Average performance. Table 1 shows means and standard deviations of mean power based on 10,000 simulations in each cell of the design. Differences be- tween the estimates and the true values represent sys- tematic bias in the estimates. The results show that all methods performed fairly well, with z-curve showing more bias than the other methods, especially for small sets of studies. Absolute error of estimation. Although the stan- dard deviations in Table 1 provide some information about estimation errors in individual simulations, we also computed mean absolute errors, abs(True Power- Estimated Power) to supplement this information. With 50% power at least 100 studies would be needed to re- duce mean absolute error to less than 6% for all meth- ods. Thus, fairly large sets of studies are needed to ob- tain precise estimates of mean power. Heterogeneity in Both Sample Size and Effect Size The results of the first simulation study were reassur- ing in that our methods performed well under condi- tions that were consistent with model assumptions. P- curve, p-uniform and the ML model performed better than z-curve because they used information about sam- ple sizes and correctly assumed that all studies have the same population effect size. However, our main goal was to test these methods under more realistic condi- tions where effect sizes vary across studies. To model heterogeneity in effect size, we let effect size before selection vary according to a gamma distri- bution (Johnson et al., 1995), a flexible continuous dis- tribution taking positive values. Sample size before se- lection remained Poisson distributed with a population mean of 86. For convenience, sample size and effect size were independent before selection for significance. The maximum likelihood model correctly assumed a gamma distribution for effect size, and the likelihood search was over the two parameters of the gamma distribution. Table 1 Average estimated population mean power for heterogene- ity in sample size only (SD in parentheses): F-tests with numerator d f = 1 Number of Tests 15 25 50 100 250 Population Mean Power = .05 P-curve 2.1 .083 .073 .064 .059 .055 (.059) (.039) (.024) (.015) (.007) P-uniform .076 .067 .061 .058 .054 (.050) (.032) (.019) (.012) (.006) ML-model .076 .067 .061 .057 .054 (.050) (.033) (.020) (.012) (.006) Z-curve .086 .071 .058 .049 .040 (.088) (.065) (.044) (.031) (.019) Population Mean Power = .25 P-curve 2.1 .269 .261 .256 .253 .251 (.156) (.128) (.095) (.069) (.046) P-uniform .256 .253 .252 .251 .251 (.147) (.121) (.089) (.065) (.042) ML-model .260 .255 .253 .251 .251 (.146) (.120) (.087) (.064) (.042) Z-curve .314 .305 .293 .280 .268 (.155) (.127) (.093) (.068) (.045) Population Mean Power = .50 P-curve 2.1 .484 .491 .496 .497 .499 (.175) (.139) (.102) (.073) (.046) P-uniform .473 .485 .493 .496 .499 (.170) (.132) (.097) (.070) (.044) ML-model .479 .489 .495 .497 .499 (.166) (.130) (.095) (.068) (.043) Z-curve .513 .516 .513 .508 .502 (.151) (.121) (.091) (.068) (.045) Population Mean Power = .75 P-curve 2.1 .728 .736 .742 .747 .749 (.128) (.098) (.069) (.048) (.030) Puniform .721 .732 .740 .746 .748 (.126) (.097) (.067) (.047) (.029) ML-model .728 .736 .742 .747 .749 (.121) (.093) (.065) (.045) (.028) Zcurve .704 .712 .717 .723 .728 (.105) (.084) (.064) (.048) (.033) The other three methods were not modified in any way. P-curve 2.1 and p-uniform continued to assume a fixed effect size, and z-curve continued to assume het- erogeneity in the non-centrality parameter without dis- tinguishing between heterogeneity in sample size and heterogeneity in effect size. We used the same design as in Study 1 with one ad- ditional factor: amount of heterogeneity in effect size, as represented by the standard deviation of the effect size distribution. Figure 3 shows the distribution of ef- 9 Table 2 Mean absolute error of estimation for heterogeneity in sample size only: F-tests with numerator d f = 1 Number of Tests 15 25 50 100 250 Population Mean Power = 0.05 P-curve 2.1 3.32 2.25 1.41 0.93 0.52 P-uniform 2.57 1.75 1.11 0.76 0.43 ML-model 2.59 1.74 1.09 0.73 0.39 Z-curve 6.53 4.90 3.38 2.44 1.79 Population Mean Power = 0.25 P-curve 2.1 12.94 10.49 7.69 5.53 3.64 P-uniform 12.11 9.87 7.17 5.18 3.38 ML-model 12.07 9.76 7.05 5.10 3.32 Z-curve 13.55 11.09 8.21 5.96 3.87 Population Mean Power = 0.50 P-curve 2.1 14.32 11.20 8.14 5.80 3.67 P-uniform 13.93 10.68 7.80 5.56 3.51 ML-model 13.61 10.41 7.60 5.39 3.41 Z-curve 12.42 9.91 7.44 5.48 3.59 Population Mean Power = 0.75 P-curve 2.1 9.77 7.59 5.38 3.72 2.35 P-uniform 9.79 7.59 5.34 3.71 2.32 ML-model 9.33 7.23 5.11 3.53 2.21 Z-curve 8.34 6.96 5.56 4.30 3.13 fect sizes after selection for significance for three levels of heterogeneity, standard deviation of effect size after selection (0.10, 0.20 or 0.30) × three levels of true pop- ulation mean power (0.25, 0.50 or 0.75). Effect sizes were transformed into Cohen’s d for ease of interpreta- tion. We dropped the condition with 5% power because it implies a fixed effect size of 0. We also varied the num- ber of test statistics in a simulation (k = 100, 250, 500, 1,000 or 2,000), experimental degrees of freedom (1, 3 or 5), and type of test (F or chi-squared). Within each cell of the design, ten thousand significant test statistics were randomly generated, and population mean power was estimated using all four methods. For brevity, we only present results for F-tests with numerator d f = 1. Full results are given in the supplementary materials. In our simulations with heterogeneity in effect sizes, maximum likelihood is computationally demanding. Using R’s integrate function, the calculation involves fitting a histogram to each curve and then adding the areas of the bars. Numerical accuracy is an issue, es- pecially for ratios of areas when the denominators are very small. In addition, it is necessary to try more than one starting value to have a hope of locating the global maximum because the likelihood function has many lo- cal maxima. In our simulations, we used three random Figure 3. Distribution of effect sizes (Cohen’s d) for the simulations in Study 2. 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 Effect Size Distribution Cohen's d D en si ty Heterogeneity: black = .1; blue = .2, red = .3 Power: solid = 25%, dots = 50%, dashes = 75% starting points. The ML model benefited from the fact that it assumed a gamma distribution of effect sizes, which matched the simulated effect size distributions. In contrast, z-curve made no assumptions and the other two methods falsely assumed a fixed effect size. Average performance. Table 3 shows estimated population mean power as a function of true popula- tion mean power. Results were consistent with the dif- ferences in assumptions. Pcurve2.1 and p-uniform over- estimated mean power and this bias increased with in- creasing heterogeneity and increasing mean power. Z- curve estimates were actually better than in the previ- ous simulations with fixed effect sizes. The maximum likelihood model had the best fit, presumably because it anticipated the actual effect size distribution. Absolute error of estimation. Table 4 shows mean absolute error of estimation. It confirms the pattern of results seen in Table 3. Most important are the large absolute errors for the two methods that assumed a fixed effect size. These large absolute mean differences are obtained despite small standard deviations because p-curve2.0 and p-uniform systematically overestimate mean power. Large sample sizes cannot correct for sys- tematic estimation errors. These results show that fixed effect size models cannot be used for the estimation of mean power when there is substantial heterogeneity in https://osf.io/bvraz 10 Table 3 Average estimated power (SD in parentheses) for hetero- geneity in sample size and effect size based on k = 1, 000 F-tests with numerator d f = 1 Standard Deviation of es 0.1 0.2 0.3 Population Mean Power = 0.25 P-curve 2.1 .225 .272 .320 (.024) (.033) (.039) P-uniform .294 .694 .949 (.029) (.056) (.028) MaxLike .230 .269 .283 (.069) (.016) (.015) Z-curve .233 .225 .226 (.027) (.026) (.024) Population Mean Power = 0.50 P-curve 2.1 .549 .679 .757 (.024) (.027) (.026) P-uniform .602 .913 .995 (.024) (.019) (.003) MaxLike .501 .502 .506 (.025) (.019) (.019) Z-curve .504 .492 .487 (.026) (.026) (.025) Population Mean Power = 0.75 P-curve 2.1 .824 .928 .962 (.013) (.009) (.006) P-uniform .861 .992 1.000 (.012) (.003) (.000) MaxLike .752 .750 .750 (.022) (.017) (.014) Z-curve .746 .755 .760 (.021) (.017) (.016) power. The results also show that the difference be- tween z-curve and the ML model are slight and have no practical significance. The good performance of z-curve is encouraging because it does not require assumptions about the effect size distribution. Violating the Assumptions of the ML model In the preceding simulation study, heterogeneity in effect size before selection was modeled by a gamma distribution, with effect size independent of sample size before selection. The maximum likelihood model had a substantial and arguably unfair advantage, since the simulation was consistent with the assumptions of the ML model. It is well known that maximum likelihood models are very accurate compared to other methods when their assumptions are met (Stuart & Ord, 1999, Ch. 18). We used a beta distribution of effect sizes to examine how the ML model performs when its assump- Table 4 Mean absolute error of estimation in percentage points, for heterogeneity in sample size and gamma effect size based on k = 1, 000 F-tests with numerator d f = 1 Standard Deviation of es 0.1 0.2 0.3 Population Mean Power = 0.25 P-curve 2.1 2.87 3.16 7.08 P-uniform 4.50 44.38 69.90 MaxLike 3.55 2.06 3.34 Z-curve 2.59 3.08 2.90 Population Mean Power = 0.50 P-curve 2.1 4.93 17.86 25.70 P-uniform 10.21 41.28 49.54 MaxLike 1.80 1.49 1.50 Z-curve 2.12 2.19 2.23 Population Mean Power = 0.75 P-curve 2.1 7.45 17.75 21.23 P-uniform 11.08 24.17 24.99 MaxLike 1.42 1.18 1.16 Z-curve 1.69 1.42 1.55 tion of a gamma distribution is violated. In this simulation, z-curve may have the upper hand because it makes no assumptions about the distribution of effect sizes or the correlation between effect sizes and sample sizes. It is well known that selection for significance (e.g. publication bias) introduces a corre- lation between sample sizes and effect sizes. However, there might also be negative correlations between sam- ple sizes and effect sizes before selection for significance if researchers conduct a priori power analysis to plan their studies or if researchers learn from non-significant results that they need larger samples to achieve signifi- cance. The design of this simulation study was similar to the previous design, but we only simulated the most ex- treme heterogeneity (SD = .3) condition and added a factor for the correlations between sample size and ef- fect size (r = 0, -.2, - .4, -.8). As before, we ran 10,000 simulations in each condition. To make results comparable to the results in Table 4, we show the results for the simulation with k = 1,000 per simulated meta-analysis. Figure 4 shows the effect size distributions after se- lection for significance. As before, effect sizes were transformed into Cohen’s d-values so that they can be compared to the distributions in Figure 3. Only the most extreme correlations of 0 and -.8 are shown to avoid cluttering the figure. As shown in the Figure, the corre- lation has relatively little impact on the distributions. 11 Figure 4. Effect size distribution for Study 3 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 Effect Size Distribution Cohen's d D en si ty 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 Effect Size Distribution Cohen's d D en si ty Correlation: black = 0; red = -.8 Power: solid = 25%, dots = 50%, dashes = 75% Average performance. Table 5 shows average esti- mated population mean power as a function of the cor- relation between sample size and effect size and differ- ent levels of power. One interesting finding is that the correlation between effect size and sample size has no influence on any of the four estimation methods. This is reassuring because the correlation before selection for significance is typically unknown. It is apparent from Table 5 that correlation between sample size and effect size makes virtually no differ- ence. Results for p-curve2.1 and p-uniform again over- estimate effect sizes. More important is the comparison of the ML model and z-curve. Both methods perform reasonably well with mean true power of 50%, although z-curve performs slightly better. With low or high power, however, the ML model overestimates mean power by 5 and 8 percentage points, respectively. The bias for z- curve is less, although even z-curve overestimates high power by 4 percentage points. We explored the cause of this systematic bias and found that it is caused by the default bandwidth method with smaller sets of studies. When we set the bandwidth to a value of 0.05, z-curve estimates with a correlation of zero were .235, .492, and .743, respectively. Table 5 Average estimated power with beta effect size and sample size correlated with effect size: k = 1, 000 F-tests with numerator d f = 1 Correlation between n and es -.8 -.6 -.4 -.2 .0 Population Mean Power = 0.25 P-curve .407 .405 .403 .403 .402 (.043) (.044) (.043) (.044) (.044) P-uniform .853 .852 .852 .852 .852 (.003) (.004) (.003) (.004) (.004) MaxLike .302 .301 .300 .300 .300 (.015) (.015) (.015) (.015) (.015) Z-curve .232 .231 .230 .231 .230 (.015) (.015) (.015) (.015) (.015) Population Mean Power = 0.50 P-curve .839 .840 .841 .841 .841 (.022) (.022) (.022) (.022) (.022) P-uniform .906 .906 .906 .906 .906 (.004) (.004) (.004) (.004) (.004) MaxLike .532 .533 .533 .534 .534 (.018) (.018) (.019) (.019) (.019) Z-curve .493 .494 .495 .495 .495 (.023) (.023) (.023) (.023) (.023) Population Mean Power = 0.75 P-curve .990 .991 .992 .992 .992 (.002) (.002) (.002) (.002) (.002) P-uniform .964 .966 .966 .967 .967 (.003) (.003) (.003) (.003) (.003) MaxLike .826 .832 .836 .838 .840 (.016) (.016) (.015) (.015) (.015) Z-curve .785 .790 .793 .794 .796 (.013) (.013) (.013) (.012) (.012) Discussion In this paper, we have compared four methods for es- timating the mean statistical power of a heterogeneous population of significance tests, after selection for sig- nificance. We have discovered and formally proved a set of theorems relating the distribution of power values before and after selection for significance. Mean Power and Replicability Several events in 2011 have triggered a crisis of con- fidence about the replicability and credibility of pub- lished findings in psychology journals. As a result, there have been various attempts to assess the replicability of published results. The most impressive evidence comes from the Open Science Reproducibility project that con- ducted 100 replication studies from articles published in 2008. The key finding was that 50% of significant re- sults from cognitive psychology could be replicated suc- 12 cessfully, whereas only 25% of significant results from social psychology could be replicated successfully (Open Science Collaboration, 2015). Social psychologists have questioned these results. Their main argument is that the replication studies were poorly done. “Nosek’s ballyhooed finding that most psychology experiments didn’t replicate did enormous damage to the reputation of the field, and that its lead- ers were themselves guilty of methodological problems" (Nisbett quoted in Bartlett, 2018) Estimating mean power provides an empirical answer to the question whether replication failures are caused by problems with the original studies or the replication studies. If the original studies achieved significance only by means of selection for significance or other questionable research practices, estimated mean power would be low. In contrast, if original studies had good power and replication failures are due to methodolog- ical problems of replication studies, estimated mean power would be high. We have applied z-curve to the original studies that were replicated in the Open Science project and found an estimate of 66% mean power (Schimmack & Brun- ner, 2016). This estimate is higher than the overall suc- cess rate of 37% for actual replication studies. This sug- gests (but not conclusively) that problems with conduct- ing exact replication studies contributed partially to the low success rate of 37%. At the same time, the esti- mate of 66% is considerably lower than the success rate of 97% for the original studies. This discrepancy shows that success rates in journals are inflated by selection for significance and partially explains replication failures in psychology, especially in social psychology. This example shows that estimates of mean power provide useful information for the interpretation of replication failures. Without this information, precious resources might be wasted on further replication stud- ies that fail simply because the original results were se- lected for significance. Historic Trends in Power Our statistical approach of estimating mean power is also useful to examine changes in statistical power over time. So far, power analyses of psychology have relied on fixed values of effect sizes that were recom- mended by Cohen (1962, 1988). However, actual ef- fect sizes may change over time or from one field to another. Z-curve makes it possible to examine what the actual power in a field of study is and whether this power has changed over time. Despite much talk about improvement in psychological science in response to the replication crisis, mean power has increased by less than 5 percentage points since 2011, and improvements are limited to social psychology (Schimmack, 2018b). Mean Power as a Quality Indicator One problem in psychological science is the use of quantitative indicators like number of publications or number of studies per article to evaluate productiv- ity and quality of psychological scientists. We believe that mean power is an important additional indicator of good science. A single study with good power provides more cred- ible evidence and more sound theoretical foundations than three or more studies with low power that were selected from a larger population of studies with non- significant results (Schimmack, 2012). However, with- out quantitative information about power, it is unclear whether reported results are trustworthy or not. Re- porting the mean power of studies from a lab or a par- ticular field of research can provide this information. This information can be used by journalists or textbook writers to select articles that reported credible empirical evidence that is likely to replicate in future studies. P-Curve Estimates of Mean Power Simonsohn et al. (2017) provided users with a free online app to compute mean power. However, they did not report the performance of their method in sim- ulation studies and their method has not been peer- reviewed. We evaluated their online method and found that the current online method, p-curve 4.06, overes- timates mean power under conditions of heterogeneity (Schimmack & Brunner, 2017). Moreover, even hetero- geneity in sample sizes alone can produce biased esti- mates with p-curve4.06 (Brunner, 2018). However, we agree with Simonsohn et al. (2014b) Si- monsohn et al. (2014) that pcurve 2.0 can be used for the estimation of mean effect sizes and that these esti- mates are relatively bias free even when there is moder- ate heterogeneity in effect sizes. Importantly, these es- timates are only unbiased for the population of studies that produced significant results, but they are inflated estimates for the population of studies before selection for significance. Failing to distinguish these two populations of stud- ies (i.e., before and after selection for significance) has produced a lot of confusion and unnecessary criticism of selection models in general (McShane et al., 2016). While it is difficult to obtain accurate estimates of effect sizes or power before selection for significance from the subset of studies that were selected for significance, p- curve 2.0 provides reasonably good estimates of effect sizes after selection for significance, which is the rea- son we built p-curve 2.1 in the first place. However, 13 p-curve 2.1, and especially p-curve 4.06, produce bi- ased estimates of mean power even for the set of studies selected for significance. Therefore, we do not recom- mend using p-curve to estimate mean power. P-uniform Estimation of Mean Power Unlike p-curve, the authors of p-uniform limited their method to estimation of effect sizes before selection for significance. We used their estimation method to cre- ate a method for estimation of mean power after selec- tion. As p-curve, the method had problems with hetero- geneity in effect sizes and performed even worse than p-curve. Recently, the developers of p-uniform changed the estimation method to make it more robust in the presence of heterogeneity and with outliers (van Aert et al., 2016). The new approach simply averages the rescaled p- values and finds the effect size that produces a mean p-value of 0.50. This method is called the Irvine-Hall method. We conducted new simulation studies with this method for the no correlation condition in Table 5 for 25%, 50%, and 75% true power. We found that it performed much better (24%, 76%, 99%) than the old p-uniform method (85%, 91%, 97%), and slightly better than p-curve 2.1 (40%, 84%, 99%). However, the method still produces inflated estimates for medium and high mean power. Maximum Likelihood Model Our ML model is similar to Hedges and Vevea’s (1996) ML method that corrects for publication bias in effect size meta-analyses. Although this model has been rarely used in actual applications, it received renewed attention during the current replication crisis. McShane et al. argued that p-curve and p-uniform produced bi- ased effect size estimates, whereas a heterogenous ML model produced accurate estimates. However, their fo- cus was on estimating the average effect size before se- lection for significance. This aim is different from our aim to estimate mean power after selection for signif- icance. Moreover, in their simulation studies the ML model benefited from the fact that the model assumed a normal distribution of effect sizes and this was the distribution of effect sizes in the simulation study. In our simulation studies, the ML model also performed very well when the simulation data met model assump- tions. However, estimates were biased when model as- sumptions differed from the effect size distribution in the data. Hedges and Vevea (1996) also found that their ML model is sensitive to the actual distribution of popula- tion effect sizes, which is unknown. The main advan- tage of z-curve over ML models is that it does not make any distributional assumptions about the data. How- ever, this advantage is limited to estimation of mean power. Whether it is possible to develop finite mixture models without distribution assumptions for the estima- tion of the mean effect size after selection for signifi- cance remains to be examined. Future Directions One concern about z-curve was the suboptimal per- formance when effect sizes were fixed. However, an im- proved z-curve method may be able to produce better estimates in this scenario as well. As most studies are likely to have some heterogeneity, we recommend us- ing z-curve as the default method for estimating mean power. Another issue is to examine performance of z-curve when researchers used questionable research practices (John et al., 2012). One questionable research prac- tice is to include multiple dependent variables and to report only those that produced a significant result. This practice would be no different from researchers running multiple exact replication studies with the same depen- dent variable and reporting only the studies that pro- duced significant results for the selected DV. The prob- ability of this result to be selected is the true power of the study with the chosen DV and the probability of this finding to be replicated equals the true power for the chosen DV. Power can vary across DVs, but the power of the DVs that were discarded is irrelevant. Things become more complicated, however, if mul- tiple DVs are selected or if only the strongest result is selected among several significant DVs (van Aert et al., 2016). Some questionable research practices may cause z-curve to underestimate mean power. For example, researchers who conduct studies with moderate power may deal with marginally significant results by remov- ing a few outliers to get a just significant result. (John et al., 2012). This would create a pile of z-scores close to the critical value, leading z-curve to underestimate mean power. We recommend inspecting the z-curve to look for this QRP, which should produce a spike in z- scores just above 1.96. Another issue is that studies may use different signif- icance thresholds. Although most studies use p < .05 (two-tailed) as a criterion, some studies use more strin- gent criteria, for example to correct for multiple com- parisons. Including these results would lead to an over- estimation of mean power, just like using p < .05 , one- tailed as a criterion would lead to overestimation be- cause most studies used the more stringent two-tailed criterion to select for significance. One solution would be to exclude studies that did not use alpha = .05 or to run separate analyses for sets 14 of studies with different criteria for significance. How- ever, these results are currently so rare that they have no practical consequences for mean power estimates. Conclusion Although this article is the seminal introduction of z- curve, we have been writing about z-curve and applica- tions of z-curve since 2015 on social media. Thus, there have already been peer-reviewed criticism of our aims and methods before we were able to publish the method itself. We would like to take this opportunity to correct some of these criticisms and to ask future critics to base their criticism on this article. De Boeck and Jeon (2018) claim that estimation methods for mean power are problematic because they "aim at rather precise replicability inferences based on other not always precise inferences, without knowing the true values of the effect size and whether the effect is fixed or varies" (p. 769). Contrary to this claim, our simulations show that z-curve can provide precise esti- mates of replicability; that is, the success rate in a set of exact replication studies without information about population effect sizes. To do so, only test statistics or exact p-values are needed. If related statistical informa- tion (e.g. means, SDs, and N) is not reported, an article does not contain quantitative information. We hope that researchers will use z-curve (https://osf.io/w8nq4) to estimate mean power when they conduct meta-analyses. Hopefully, the reporting of mean power will help researchers to pay more attention to power when they plan future studies, and we might finally see an increase in statistical power, more than 50 years after Cohen (1962) pointed out the importance of power for good psychological science. More awareness of the actual power in psychological science could also be beneficial for grant applications to fund research projects properly and to reduce the need for questionable research practices to boost power by inflating the risk of type-I errors. Thus, we hope that es- timation of mean power serves the most important goal in science, namely to reduce errors. Conducting studies with adequate power reduces type-II errors (false neg- atives) and in the presence of selection bias it also re- duces type-I errors. The downside appears to be that fewer studies would be published, but underpowered studies selected for significance do not provide sound empirical evidence. Maybe reducing the number of pub- lished studies would be beneficial, or to paraphrase Co- hen (1990), “Less is more, except for statistical power". Author Contributions Most of the ideas in this paper were developed jointly. An exception is the z-curve method, which is solely due to Schimmack. Brunner is responsible for the theorems. Acknowledgements We would like to thank Dr. Jeffrey Graham for pro- viding remote access to the computers in the Psychol- ogy Laboratory at the University of Toronto Mississauga. Thanks to Josef Duchesne for technical advice. Conflict of Interest and Funding No conflict of interest to report. This work was not supported by a specific grant. Contact Information Correspondence regarding this article should be sent to: brunner@utstat.toronto.edu Open Science Practices This article earned the Open Materials badge for making the materials openly available. Preregistration and Data badges are not applicable for this type of re- search. It has been verified that the analysis reproduced the results presented in the article. The entire editorial process, including the open reviews, are published in the online supplement. References Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-size planning for more accurate statis- tical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psy- chological Science, 28, 640–646. Bartlett, T. (2018). I want to burn things to the ground. Retrieved May 30, 2019, from https : / / www. chronicle.com/article/I-Want-to-Burn-Things- to/244488 Becker, R. A., Chambers, J. M., & Wilks, A. R. (1988). The new s language: A programming environ- ment for data analysis and graphics. Pacific Grove, California, Wadsworth& Brooks/Cole. Boos, D. D., & Stefnski, L. A. (2012). P-value precision and reproducibility. The American Statistician, 65, 213–221. Brunner, J. (2018). An even better p-curve. Retrieved May 30, 2019, from https : / / replicationindex . wordpress.com/2018/05/10/an- even- better- p-curve Bunge, M. (1998). Philosophy of science. New Brunswick, N.J., Transaction. https://osf.io/w8nq4 https://www.chronicle.com/article/I-Want-to-Burn-Things-to/244488 https://www.chronicle.com/article/I-Want-to-Burn-Things-to/244488 https://www.chronicle.com/article/I-Want-to-Burn-Things-to/244488 https://replicationindex.wordpress.com/2018/05/10/an-even-better-p-curve https://replicationindex.wordpress.com/2018/05/10/an-even-better-p-curve https://replicationindex.wordpress.com/2018/05/10/an-even-better-p-curve 15 Chase, L. J., & Chase, R. B. (1976). Statistical power analysis of applied psychological research. Journal of Applied Psychology, 61, 234–237. Cohen, J. (1962). The statistical power of abnormal- social psychological research: A review. Jour- nal of Abnormal and Social Psychology, 65, 145– 153. Cohen, J. (1988). Statistical power analysis for the be- havioral sciences. (2nd edition). Hilsdale, New Jersey, Erlbaum. Cohen, J. (1990). Things i have learned (so far). Amer- ican Psychologist, 45, 1304–1312. Cohen, J. (1994). The earth is round (p < .05). Ameri- can Psychologist, 49, 997–1003. De Boeck, P., & Jeon, M. (2018). Perceived crisis and re- forms: Issues, explanations, and remedies. Psy- chological Bulletin, 144, 757–777. Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes and p values: What should be reported and what should be repli- cated? Psychophysiology, 33, 175–183. Grissom, R. J., & Kim, J. J. (2012). Effect sizes for re- search: Univariate and multivariate applications. New York, Routledge. Hedges, L. V., & Vevea, J. L. (1996). Estimating effect size under publication bias: Small sample prop- erties and robustness of a random effects selec- tion model. Journal of Educational and Behav- ioral Statistics, 21, 299–332. Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calcu- lations for data analysis. The American Statis- tician, 55, 19–24. Ioannidis, J. P. (2008). Why most discovered true asso- ciations are inflated. Epidemiology, 19(5), 640– 646. John, L. K., Lowenstein, G., & Prelec, D. (2012). Mea- suring the prevalence of questionable research practices with incentives for truth telling. Psy- chological Science, 23, 517–523. Johnson, N. L., Kotz, S., & Balakrishnan, N. (1995). Continuous univariate distributions (2nd). New York, Wiley. McShane, B. M., Böckenholt, U., & Hensen, K. (2016). Adjusting for publication bias in meta-analysis: An evaluation of selection methods and some cautionary notes. Psychological Science, 11, 730–749. Morewedge, C. K., Gilbert, D., & Wilson, T. D. (2014). Reply to frances. Retrieved June 7, 2019, from https : / / www . semanticscholar . org / paper / REPLY - TO - FRANCIS - Morewedge - Gilbert / 019dae0b9cbb3904a671bfb5b2a25521b69ff2cc Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society, Series A, 231, 289–337. Open Science Collaboration. (2015). Estimating the re- producibility of psychological science. Science, 349(6251), aac4716–aac4716. https : / / doi . org/10.1126/science.aac4716 Popper, K. R. (1959). The logic of scientific discovery. London, England, Hutchinson. R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statisti- cal Computing. Vienna, Austria. https://www. R-project.org/ Rosenthal, R. (1979). The file drawer problem and tol- erance for null results. Psychological Bulletin, 86, 638–641. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study arti- cles. Psychological Methods, 17, 551–566. Schimmack, U. (2015). Post-hoc power curves: Estimat- ing the typical power of statistical tests (t, f ) in Psychological Science and Journal of Exper- imental Social Psychology. Retrieved May 30, 2019, from https : / / replicationindex . com / 2015/06/27/232/ Schimmack, U. (2018a). An introduction to z-curve: A method for estimating mean power after se- lection for significance (replicability). Retrieved May 30, 2019, from https : / / replicationindex . com/2018/10/19/an-introduction-to-z-curve Schimmack, U. (2018b). Replicability rankings. Re- trieved May 30, 2019, from https : / / replicationindex . com / 2018 / 12 / 29 / 2018 - replicability-rankings Schimmack, U., & Brunner, J. (2016). How replicable is psychology? a comparison of four methods of estimating replicability on the basis of test statistics in original studies. Retrieved May 30, 2019, from http : / / www. utstat . toronto . edu / ~brunner/papers/HowReplicable.pdf Schimmack, U., & Brunner, J. (2017). Z-curve: A method for the estimation of replicability. manuscript re- jected from ampps. Retrieved May 30, 2019, from https://replicationindex.wordpress.com/ 2017 / 11 / 16 / preprint - z - curve - a - method - for - the - estimating - replicability - based - on - test - statistics - in - original - studies - schimmack - brunner-2017 Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309–316. https://www.semanticscholar.org/paper/REPLY-TO-FRANCIS-Morewedge-Gilbert/019dae0b9cbb3904a671bfb5b2a25521b69ff2cc https://www.semanticscholar.org/paper/REPLY-TO-FRANCIS-Morewedge-Gilbert/019dae0b9cbb3904a671bfb5b2a25521b69ff2cc https://www.semanticscholar.org/paper/REPLY-TO-FRANCIS-Morewedge-Gilbert/019dae0b9cbb3904a671bfb5b2a25521b69ff2cc https://doi.org/10.1126/science.aac4716 https://doi.org/10.1126/science.aac4716 https://www.R-project.org/ https://www.R-project.org/ https://replicationindex.com/2015/06/27/232/ https://replicationindex.com/2015/06/27/232/ https://replicationindex.com/2018/10/19/an-introduction-to-z-curve https://replicationindex.com/2018/10/19/an-introduction-to-z-curve https://replicationindex.com/2018/12/29/2018-replicability-rankings https://replicationindex.com/2018/12/29/2018-replicability-rankings https://replicationindex.com/2018/12/29/2018-replicability-rankings http://www.utstat.toronto.edu/~brunner/papers/HowReplicable.pdf http://www.utstat.toronto.edu/~brunner/papers/HowReplicable.pdf https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017 https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017 https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017 https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017 https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017 16 Silverman, B. W. (1986). Density estimation. London, Chapman; Hall. Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014a). P–curve: A key to the file drawer. Journal of ex- perimental psychology: General, 143, 534–547. Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014b). P-curve and effect size: Correcting for publica- tion bias using only significant results. Perspec- tives on Psychological Science, 9, 666–681. Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2017). P-curve app 4.06. Retrieved May 30, 2019, from http://www.p-curve.com Sterling, T. D. (1959). Publication decision and the pos- sible effects on inferences drawn from tests of significance – or vice versa. Journal of the Amer- ican Statistical Association, 54, 30–34. Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The ef- fect of the outcome of statistical tests on the de- cision to publish and vice versa. The American Statistician, 49, 108–112. Stouffer, S. A., Suchman, E. A., DeVinney, L. C., Star, S. A., & Williams, R. M., Jr. (1949). The Amer- ican soldier, vol.1: Adjustment during army life. Princeton, Princeton University Press. Stuart, A., & Ord, J. K. (1999). Kendall’s advanced the- ory of statistics, vol. 2: Classical inference & the linear model (5th). New York, Oxford University Press. van Aert, R. C. M., Wicherts, J. M., & van Assen, M. A. L. M. (2016). Conducting meta-analyses based on p values: Reservations and recommen- dations for applying p-uniform and pcurve. Per- spectives on Psychological Science, 11, 713–729. van Assen, M. A. L. M., van Aert, R. C. M., & Wicherts, J. M. (2014). Meta-analysis using ef- fect size distributions of only statistically signif- icant studies. Psychological methods, 20, 293– 309. Yuan, K. H., & Maxwell, S. (2005). On the post hoc power in testing mean differences. Journal of educational and behavioral statistics, 30, 141– 167. Appendix Proofs of the Theorems, with an example We present proofs of six theorems about the rela- tionship between power and the outcome of replica- tion studies. The first two theorems are assumptions of z-curve. The other four theorems are theoretically interesting, very useful for simulation studies, and can be used to further develop z-curve in the future. The theorems are also illustrated with a numerical example. Consider a population of F-tests with 3 and 26 degrees of freedom, and varying true power values. Variation in power comes from variation in the non-centrality pa- rameter, which is sampled from a chi-squared distribu- tion with degrees of freedom chosen so that population mean power is very close to 0.80. Denoting a randomly selected power value by G and the non-centrality parameter by λ, population mean power is E(G) = ∫ ∞ 0 (1 −pf(c,ncp = λ))dchisq(λ)dλ To verify the numerical value of expected power for the example, > alpha = 0.05; criticalvalue = qf(1-alpha,3,26) > fun = function(ncp,DF) + (1 - pf(criticalvalue,df1=3,df2=26,ncp))*dchisq(ncp,DF) > integrate(fun,0,Inf,DF=14.36826) 0.8000001 with absolute error < 5.9e-06 The strange fractional degrees of freedom were located using the R function uniroot, minimizing the abso- lute difference between the output of integrate and the value 0.8 numerically over the degrees of freedom value. The minimum occurred at 14.36826. Theorem 1 Population mean true power equals the over- all probability of a significant result. Proof. Suppose that the distribution of true power is discrete. Again denoting a randomly chosen power value by G, the probability of rejecting the null hypoth- esis is Pr{T > c} = ∑ g Pr{T > c|G = g}Pr{G = g} = ∑ g g Pr{G = g} = E(G), (9) which is population mean power. If the distribution of power is continuous with probability density function fG (g), the calculation is Pr{T > c} = ∫ 1 0 Pr{T > c|G = g} fG (g) dg = ∫ 1 0 g fG (g) dg = E(G) � Continuing with the numerical example, we first sample one million non-centrality parameter values from the chi-squared distribution that yields an expected power http://www.p-curve.com 17 of 80%. These values are in the vector NCP. We then calculate the corresponding power values, placing them in the vector Power. Next, we generate one million ran- dom F statistics from non-central F distributions, using the non-centrality parameter values in NCP. In the R out- put below, observe that mean power is very close to the proportion of F statistics exceeding the critical value. This illustrates Theorem 1 for the distribution of power before selection. Needless to say, Theorem 1 applies both before and after selection. > popsize = 1000000; set.seed(9999) > NCP = rchisq(popsize,df=14.36826) > Power = 1 - pf(criticalvalue,df1=3,df2=26,NCP) > mean(Power) [1] 0.8002137 > Fstat = rf(popsize,df1=3,df2=26,NCP) > sigF = subset(Fstat,Fstat>criticalvalue) > length(sigF)/popsize # Proportion significant [1] 0.800177 To show how Theorem 1 applies to the distribution of power after selection, the sub-population of power values corresponding to significant results are stored in SigPower. The tests that were significant are repeated (with the same non-centrality parameters), and the test statistics placed in Fstat2. The proportion of test statis- tics in Fstat2 that are significant is very close to the mean of SigPower. This gives empirical support to the statement that population mean power after selection for significance equals the probability of obtaining a sig- nificant result again. > SigPower = subset(Power,Fstat>criticalvalue) > mean(SigPower) # Mean power after selection [1] 0.8274357 > # Replicate the tests that were significant. > sigNCP = subset(NCP,Fstat>criticalvalue) > Fstat2 = rf(length(sigF),df1=3,df2=26,ncp=sigNCP) > # Proportion of replications significant > length(subset(Fstat2,Fstat2>criticalvalue)) / + length(sigF) [1] 0.827172 Theorem 2 The effect of selection for significance is to multiply the probability of each power value by a quan- tity equal to the power value itself, divided by population mean power before selection. If the distribution of power is continuous, this statement applies to the probability den- sity function. Proof. Suppose the distribution of power is dis- crete. Using Bayes’ Theorem, Pr{G = g|T > c} = Pr{T > c|G = g}Pr{G = g} Pr{T > c} = g Pr{G = g} E(G) . (10) If the distribution of power is continuous with density fG (g), Pr{G ≤ g|T > c} = Pr{G ≤ g, T > c} Pr{T > c} = ∫ g 0 Pr{T > c|G = x} fG (x) d x E(G) = ∫ g 0 x fG (x) d x E(G) . By the Fundamental Theorem of Calculus, the condi- tional density of power given significance is d dg Pr{G ≤ g|T > c} = g fG (g) E(G) . � (11) For the numerical example we are pursuing by simu- lation, the density function of power before selection is a technical challenge and we will not attempt it. As a substitute, suppose that power before selection follows a beta distribution, a very flexible family on the interval from zero to one (Johnson et al., 1995). If power before selection (denoted by G) has a beta distribution with parameters α and β, Theorem 2 says that the density of power after selection (a function of the power value g) is f (g|T > c) = Γ(α + β) Γ(α)Γ(β) gα−1(1 − g)β−1 ( g E(G) ) = ( 1 α/(α + β) ) Γ(α + β) Γ(α)Γ(β) gα(1 − g)β−1 = (α + β) Γ(α + β) α Γ(α) Γ(β) gα+1−1(1 − g)β−1 = Γ(α + 1 + β) Γ(α + 1) Γ(β) gα+1−1(1 − g)β−1, which is again a beta density, this time with parameters α + 1 and β. M.A.L.M. van Assen has pointed out the similarity of this result to conjugate prior-posterior up- dating in Bayesian statistics. Figure 5 shows how a beta with α = 2 and β = 4 is transformed into a beta with α = 3 and β = 4. Theorem 3 Population mean power after selection for significance equals the population mean of squared power before selection, divided by the population mean of power before selection. Proof. Suppose that the distribution of power is dis- crete. Then using (10), E(G|T > c) = ∑ g g g Pr{G = g} E(G) = E(G2) E(G) . (12) 18 Figure 5. Beta density of power before and after selec- tion 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 5 1. 0 1. 5 2. 0 g D en si ty Before After If the distribution of power is continuous, (11) is used to obtain E(G|T > c) = ∫ 1 0 g g fG (g) E(G) dg = E(G2) E(G) . � (13) In the example, SigPower contains the sub-population of power values corresponding to significant results. Observe the verification of Formula 13. > # Repeating ... > SigPower = subset(Power,Fstat>criticalvalue) > mean(SigPower) [1] 0.8274357 > mean(Power^2)/mean(Power) [1] 0.8275373 Theorem 4 Population mean power before selection equals one divided by the population mean of the recip- rocal of power after selection. Proof. Using Formula 10, E ( 1 G ∣∣∣∣∣ T > c ) = ∑ g ( 1 g ) g Pr{G = g} E(G) = 1 E(G) ∑ g Pr{G = g} = 1 E(G) · 1 = 1 E(G) , so that E(G) = 1 / E ( 1 G ∣∣∣∣∣ T > c ) . A similar calculation applies in the continuous case. � To illustrate Theorem 4, recall that the example was constructed so that mean power before selection was equal to 0.80. > 1/mean(1/SigPower) [1] 0.8000502 In the example, population mean power is 0.80, while population mean power given significance is roughly 0.83. It is reasonable that selecting significant tests would also tend to select higher power values on average, and in fact this intuition is correct. Since V ar(G) = E(G2) − (E(G))2 ≥ 0, we have E(G2) ≥ (E(G))2 , and hence E(G2) E(G) ≥ E(G). Theorem 3 says E(G 2 ) E(G) = E(G|T > c), so that E(G|T > c) ≥ E(G). That is, population mean power given sig- nificance is greater than the mean power of the en- tire population, except in the homogeneous case where V ar(G) = 0. The exact amount of increase has a compact and somewhat surprising form. Theorem 5 The increase in population mean power due to selection for significance equals the population variance of power before selection divided by the population mean of power before selection. Proof. E(G|T > c) − E(G) = E(G2) E(G) − E(G) = E(G2) E(G) − (E(G))2 E(G) = V ar(G) E(G) . � Illustrating Theorem 5 for the ongoing example, > mean(SigPower) - mean(Power) [1] 0.02722205 > var(Power)/mean(Power) [1] 0.02732371 Theorem 6 The effect of selection for significance is to multiply the joint distribution of sample size and effect size before selection by power for that sample size and effect size, divided by population mean power before selection. 19 Proof. Note that power for a given sample size and effect size is P{T > c|X = es, N = n}. Suppose effect size is discrete. Then P{X = es, N = n|T > c} is P{X = es, N = n, T > c} P{T > c} = P{T > c|X = es, N = n}P{X = es, N = n} E(G) = ( P{T > c|X = es, N = n} E(G) ) P{X = es, N = n} , where E(G) is expected power before selection, equal to P{T > c} by Theorem 1. Suppose that effect size is continuous with density g(es). The joint distribution of sample size and effect size before selection is determined by P{N = n|X = es}g(es). The joint distribution after selection is deter- mined by P{N = n|X = es, T > c}g(es|T > c) = P{T > c|X = es, N = n}P{N = n|X = es}g(es) g(es|T > c)P{T > c} g(es|T > c) = ( P{T > c|X = es, N = n} E(G) ) P{N = n|X = es}g(es) . It is also possible to write the joint distribution of sam- ple size and effect size as the conditional density of ef- fect size given sample size, times the discrete probabil- ity of sample size. That is, the joint distribution before selection is determined by g(es|N = n)P{N = n}, and the joint distribution after selection is determined by g(es|N = n, T > c)P{N = n|T > c} = d des P{X ≤ es|N = n, T > c}P{N = n|T > c} = d des P{X ≤ es, N = n, T > c} P{N = n, T > c} P{N = n, T > c} P{T > c} = 1 E(G) d des ∫ es 0 P{T > c|X = y, N = n}g(y|N = n)P{N = n}dy = P{T > c|X = es, N = n}g(es|N = n)P{N = n} E(G) = ( P{T > c|X = es, N = n} E(G) ) g(es|N = n)P{N = n} � (14) Theorem 6 cannot be illustrated for the ongoing nu- merical example, because the example employs a dis- tribution of the non-centrality parameter, rather than of sample size and effect size jointly. As a substitute, con- sider that an observed distribution of sample size after selection must imply a distribution of sample size in the unpublished studies before selection. If that distribution is too outlandish (for example, implying an enormous “file drawer" of pilot studies with tiny sample sizes) we may be forced to another model of the research and publication process. Theorem 6 allows one to solve for P{N = n}, the unconditional probability distribution of sample size before selection, though an estimated or hy- pothesized distribution of effect size given sample size before selection is needed. When sample size and effect size are deemed independent before selection, this is not a serious obstacle. Expression (14) says that g(es|N = n, T > c)P{N = n|T > c} is equal to( P{T > c|X = es, N = n} E(G) ) g(es|N = n)P{N = n}, so that integrating both sides with respect to es,∫ g(es|N = n, T > c)P{N = n|T > c}des = P{N = n|T > c} ∫ g(es|N = n, T > c) des = P{N = n|T > c} · 1 = ∫ ( P{T > c|X = es, N = n} E(G) ) g(es|N = n)P{N = n}des = ( P{N = n} E(G) ) ∫ P{T > c|X = es, N = n}g(es|N = n) des, and we have P{N = n} = E(G)  P{N = n|T > c}∫ P{T > c|X = es, N = n}g(es|N = n) des  (15) The numerator of the fraction is the probability of observing a sample size of n after selection for signif- icance. The denominator is expected power given that sample size, and could be calculated with R’s integrate function. By Theorem 1, the quantity E(G) is both population mean power before selection and P{T > c}, the probability of randomly choosing a significant result from the population of tests before selection. In Equa- tion 15, though, it is just a proportionality constant. In practice, one obtains P{N = n} by calculating the frac- tion in parentheses for each n, and then dividing by the total to obtain numbers that add to one. Maximum Likelihood Even though sample size is a random variable, the quantities n1, . . . , nk are treated as fixed constants. This is similar to the way that x values in normal regression and logistic regression are treated as fixed constants in the development of the theory, even though clearly they are often random variables in practice. Making the es- timation conditional on the observed values n1, . . . , nk allows it to be distribution free with respect to sample size, just as regression and logistic regression are distri- bution free with respect to x. This is preferable to adopt- ing parametric assumptions about the joint distribution of sample size and effect size. 20 Suppose there is heterogeneity in both sample size and effect size, and that effect size is continuous. The likelihood function given significance is a product of conditional densities evaluated at the observed values of the test statistics. Each term is the conditional density of the test statistic given both the sample size and the event that the test statistic exceeds its respective critical value. The joint probability distribution of sample size and effect size before selection is determined by the marginal distribution of sample size P{N = n} and the conditional density of effect size given sample size gθ(es|n), where θ is a vector of unknown parameters. Denoting the random effect size by X, the conditional density of an observed test statistic T given significance and a particular sample size n is d dt P{T ≤ t|T > c, N = n} = d dt P{T ≤ t, T > c, N = n} P{T > c, N = n} = d dt P{c < T ≤ t|N = n}P{N = n} P{T > c|N = n}P{N = n} = d dt P{c < T ≤ t|N = n} P{T > c|N = n} = d dt ∫ ∞ 0 P{c < T ≤ t|N = n, X = es}gθ(es|n) des∫ ∞ 0 P{T > c|N = n, X = es}gθ(es|n) des = d dt ∫ ∞ 0 [ p(t, f1(n) f2(es))−p(c, f1(n) f2(es)) ] gθ(es|n) des∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ] gθ(es|n) des = ∫ ∞ 0 d dtp(t, f1(n) f2(es))gθ(es|n) des∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ] gθ(es|n) des = ∫ ∞ 0 d(t, f1(n) f2(es))gθ(es|n) des∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ] gθ(es|n) des , where moving the derivative through the integral sign is justified by dominated convergence. The likelihood function is a product of k such terms. In the main pa- per, the simplifying assumption that sample size and ef- fect size are independent before selection means that gθ(es|n) is replaced by gθ(es), yielding Expression (3). In the problem of estimating power under hetero- geneity in effect size, the unknown parameter is the vector θ in the density of effect size. Let θ̂ denote the maximum likelihood estimate of θ. This yields a maxi- mum likelihood estimate of the true power of each in- dividual test in the sample, and then the estimates are averaged to obtain an estimate of mean power. We now give details. Consider randomly sampling a single test from the population of tests that were significant the first time they were carried out. Let T1 denote the value of the test statistic the first time a hypothesis is tested, and let T2 denote the value of the test statistic the second time that particular hypothesis is tested, under exact repetition of the experiment. Conditionally on fixed values of sample size n and effect size es, T1 and T2 are independent. By Theorem 1, population mean power after selection is P{T2 > c|T1 > c} = ∑ n P{T2 > c|T1 > c, N = n}P{N = n|T1 > c} (16) This is the expression we seek to estimate. Applying Theorem 3 to the sub-population of tests based on a sample of size n, P{T2 > c|T1 > c, N = n} = E(G2|N = n) E(G|N = n) = ∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ]2 gθ(es|n) des∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ] gθ(es|n) des . (17) Substituting (17) into (16) yields P{T2 > c|T1 > c} = ∑ n ∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ]2 gθ(es|n) des∫ ∞ 0 [ 1 −p(c, f1(n) f2(es)) ] gθ(es|n) des P{N = n|T1 > c} . (18) Expression (18) has two unknown quantities, the pa- rameter θ of the effect size distribution, and P{N = n|T1 > c}. For the former quantity, we use the maxi- mum likelihood estimate, while the P{N = n|T1 > c} val- ues are estimated by the empirical relative frequencies of sample size, which is the non-parametric maximum likelihood estimate. The result is a maximum likelihood estimate of population power given significance: 1 k k∑ j=1 ∫ ∞ 0 [ 1 −p(c j, f1(n j) f2(es)) ]2 g θ̂ (es|n j) des∫ ∞ 0 [ 1 −p(c j, f1(n j) f2(es)) ] g θ̂ (es|n j) des . In the simulations, the density g of effect size is assumed gamma, there is no dependence on n, and the parameter θ is the pair (a, b) that parameterize the gamma distri- bution. Simulation Direct simulation from the distribution of the test statistic given significance. To study the behaviour of an estimation method under selection for significance, it is natural to simulate test statistics from the distri- bution that applies before selection, and then discard the ones that are not significant. But if one can sim- ulate from the joint distribution of sample size and ef- fect size after selection, the wasteful discarding of non- significant test statstics can be avoided. The idea is to do the simulation in two stages. First, simulate pairs from the joint distribution of sample size and effect size 21 after selection, and calculate a non-centrality parameter using Expression (ncpmult). Then using that ncp value, simulate from the distribution of the test statistic given significance. We will now show how to do the second step. It is well known that if F(t) is a cumulative distribu- tion function of a continuous random variable and U is uniformly distributed on the interval from zero to one, then the random variable T = F−1(U) has cumulative distribution function F(t). In this case the cumulative distribution function from which we wish to simulate is P{T ≤ t|T > c, X = es, N = n} = P{T ≤ t, T > c|X = es, N = n} P{T > c|X = es, N = n} = P{c < T ≤ t|X = es, N = n} P{T > c|X = es, N = n} = p(t,ncp)−p(c,ncp) 1 −p(c,ncp) for t > c, where as usual ncp = f1(n) f2(es). To obtain the inverse, set u equal to the probability and solve for t, as follows. Denoting the power of the test by γ = 1 −p(c,ncp), u = p(t,ncp)−p(c,ncp) 1 −p(c,ncp) ⇔ u (1 −p(c,ncp)) = p(t,ncp)−p(c,ncp) ⇔ p(t,ncp) = u (1 −p(c,ncp)) + p(c,ncp) ⇔ p(t,ncp) = γu + 1 −γ ⇔ t = q(γu + 1 −γ,ncp). Accordingly, let U be a Uniform (0,1) random variable. The significant test statistic is T = q(γU + 1 −γ,ncp) = q(1 + γ(U − 1),ncp) = q(1 −γ(1 − U),ncp) . Since 1 − U also has a Uniform (0,1) distribution, one may proceed as follows. For a given sample size and effect size, first calculate the non-centrality parameter ncp = f1(n) f2(es), and use that to compute the power value γ = 1 − p(c,ncp). Then calculate the significant test statistic T = q(1 −γU,ncp) , (19) where U is a pseudo-random variate from a Uniform (0,1) distribution. In R, the process can be applied to a vector of ncp values and a vector of independent U values of the same length. Again, this is the second step. The first step is to sim- ulate a collection of ncp values using the desired joint distribution of sample size and effect size after selec- tion for significance. Naturally, simulation is is easiest if sample size and effect size come from well-known dis- tributions with built-in random number generation, and if sample size and effect size are specified to be indepen- dent after selection. In one of our simulations, sample size and effect size after selection were correlated. The next section describes how this was done. Correlated sample size and effect size. Let effect size X have density gθ(es), where θ represents a vector of parameters for the distribution of effect size. Condi- tionally on X = es, let the distribution of sample size be Poisson distributed with expected value exp(β0 + β1es). This is standard Poisson regression. Simulation from the joint distribution is easy. One simply simulates an effect size es according to the density g, computes the Poisson parameter λ = exp(β0 + β1es), and then samples a value n from a Poisson distribution with parameter λ. The challenge is to choose the parameters θ, β0 and β1 so that after selection, (a) the population mean power has a desired value, and at the same time (b) the pop- ulation correlation between sample size and effect size has a desired value. Population mean power is γ = ∫ ∞ 0 ∑ n [ 1 −p(c, f1(n) f2(es)) ] P{N = n|X = es}gθ(es)des . Given values of θ,β0 and β1, this expression can be calcu- lated by numerical integration; recall that P{N = n|X = es} is a Poisson probability. The population correlation between sample size and effect size is ρ = E(XN) − E(X)E(N) SD(X) SD(N) , where SD(·) refers to the population standard deviation of something. The quantities E(X) and SD(X) are direct functions of θ. The standard deviation of sample size SD(N) = √ E(N2) − [E(N)]2, where E(N) = E(E[N|X]) = ∫ ∞ 0 E[N|X = es] gθ(es)des = ∫ ∞ 0 eβ0 +β1esgθ(es)des and E(N2) = E(E[N2|X]) = E(V ar(N) + E(N)2|X) = ∫ ∞ 0 ( eβ0 +β1es + e2β0 +2β1es ) gθ(es)des . 22 Finally, E(XN) = ∫ ∞ 0 ∑ n esn P{N = n|X = es}gθ(es)des = ∫ ∞ 0 es E(N|X = es)gθ(es)des = ∫ ∞ 0 eseβ0 +β1esgθ(es)des . All these expected values can be calculated by numeri- cal integration using R’s integrate function, so that the correlation ρ can be evaluated for any set of θ,β0 and β1 values. In our simulation of correlated sample size and ef- fect size, gθ(es) was a beta density, re-parameterized so that θ = (µ,σ2) consisted of the mean µ and variance σ2. Conditionally on effect size, sample size was Poisson distributed with expected value exp(β0 + β1es). We set the variance of effect size σ2 to a fixed value of 0.09, so that the standard deviation of effect size after selection was 0.30, a high value. Given any mean effect size µ and slope β1, the parameter β0 (the intercept of the Poisson regression) was adjusted so that expected sample size at the mean value was equal to 86: β0 = ln(86) −β1µ. With these constraints, the population mean power γ and correlation ρ were a function of the two free pa- rameters µ and β1. Let γ0 be a desired value of mean power; for example, γ0 = 0.5. Let ρ0 be a desired value of the correlation between sample size and ef- fect size; for example, ρ0 = −0.8. Values of µ and β1 were locating by numerically minimizing the function f (µ,β1) = |γ−γ0| + |ρ−ρ0|. We used R’s optim function. Statistical Power Before and After A Study Has Been Conducted Two Populations of Studies The Study Selection Model Estimation Methods Notation and statistical background Estimation Methods P-curve 2.1 and p-uniform Maximum likelihood model Z-curve Simulations Heterogeneity in Sample Size Only: Effect size fixed Heterogeneity in Both Sample Size and Effect Size Violating the Assumptions of the ML model Discussion Mean Power and Replicability Historic Trends in Power Mean Power as a Quality Indicator P-Curve Estimates of Mean Power P-uniform Estimation of Mean Power Maximum Likelihood Model Future Directions Conclusion Author Contributions Acknowledgements Conflict of Interest and Funding Contact Information Open Science Practices Appendix Proofs of the Theorems, with an example Simulation