Meta-Psychology, 2020, vol 4, MP.2018.874
https://doi.org/10.15626/MP.2018.874
Article type: Original Article
Published under the CC-BY4.0 license

Open data: N/A
Open materials: Yes

Open and reproducible analysis:Yes
Open reviews and editorial process: Yes

Preregistration: N/A

Edited by: Marcel van Assen
Reviewed by: Stephen Martin, Jack Davis,

Donald Williams, Daniël Lakens and Rink Hoekstra
Analysis reproduced by: Erin Buchanan

All supplementary files can be accessed at OSF:
https://doi.org/10.17605/OSF.IO/PEUMW

Estimating Population Mean Power Under
Conditions of Heterogeneity and Selection for

Significance
Jerry Brunner and Ulrich Schimmack

University of Toronto Mississauga

Abstract

In scientific fields that use significance tests, statistical power is important for successful replications of significant
results because it is the long-run success rate in a series of exact replication studies. For any population of sig-
nificant results, there is a population of power values of the statistical tests on which conclusions are based. We
give exact theoretical results showing how selection for significance affects the distribution of statistical power
in a heterogeneous population of significance tests. In a set of large-scale simulation studies, we compare four
methods for estimating population mean power of a set of studies selected for significance (a maximum likelihood
model, extensions of p-curve and p-uniform, & z-curve). The p-uniform and p-curve methods performed well with
a fixed effects size and varying sample sizes. However, when there was substantial variability in effect sizes as
well as sample sizes, both methods systematically overestimate mean power. With heterogeneity in effect sizes, the
maximum likelihood model produced the most accurate estimates when the distribution of effect sizes matched the
assumptions of the model, but z-curve produced more accurate estimates when the assumptions of the maximum
likelihood model were not met. We recommend the use of z-curve to estimate the typical power of significant
results, which has implications for the replicability of significant results in psychology journals.

Keywords: Power estimation, Post-hoc power analysis, Publication bias, Maximum likelihood, Z-curve, P-curve, P-
uniform, Effect size, Replicability, Meta-analysis

The purpose of this paper is to develop and evalu-
ate methods for predicting the success rate if sets of
significant results were replicated exactly. We call this
statistical property, the average power of a set of stud-
ies. Average power can range from the criterion for a
type-I error, if all significant results are false positives,
to 100%, if the statistical power of original studies ap-
proaches 1. Average power can be used to quantify the
degree of evidential value in a set of studies (Simonsohn
et al., 2014b). In the end, we estimate the mean power
of studies that were used to examine the replicability
of psychological research, and compare the results to
actual replication outcomes (Open Science Collabora-

tion, 2015). Estimating average power of original stud-
ies is interesting because it is tightly connected with the
outcome of replication studies (Greenwald et al., 1996;
Yuan & Maxwell, 2005). To claim that a finding has
been replicated, a replication study should reproduce a
statistically significant result, and the probability of a
successful replication is a function of statistical power.
Thus, if reproducibility is a requirement of good science
(Bunge, 1998; Popper, 1959), it follows that high statis-
tical power is a necessary condition for good science.

Information about the average power of studies is
also useful because selection for significance increases
the type-I error rate and inflates effect sizes (Ioannidis,


2

2008). However, these biases are relatively small if the
original studies had high power. Thus, knowledge about
the average power of studies is useful for the planning
of future studies. If average power is high, replication
studies can use the same sample sizes as original stud-
ies, but if average power is low, sample sizes need to be
increased to avoid false negative results.

Given the practical importance of power for good sci-
ence, it is not surprising that psychologists have started
to examine the evidential value of results published in
psychology journals. At present, two statistical methods
have been used to make claims about the average power
of psychological research; namely p-curve (Simonsohn
et al., 2017) and z-curve (Schimmack, 2015, 2018a),
but so far neither method has been peer-reviewed.

Statistical Power Before and After A Study Has Been
Conducted

Before we proceed, we would like to clarify that sta-
tistical power of a statistical test is defined as the proba-
bility of correctly rejecting the null hypothesis (Neyman
& Pearson, 1933). This probability depends on the sam-
pling error of a study and the population effect size. The
traditional definition of power does not consider effect
sizes of zero (false positives) because the goal of a priori
power planning is to ensure that a non-zero effect can
be demonstrated.

However, our goal is not to plan future studies, but to
analyze results of existing studies. For post-hoc power
analysis, it is impossible to distinguish between true
positives and false positives and to estimate the average
power conditional on the unknown status of hypotheses
(i.e., the null-hypothesis is true or false). Thus, we use
the term average power as the probability of correctly
or incorrectly rejecting the null-hypothesis (Sterling et
al., 1995). This definition of average power includes
an unknown percentage of false positives that have a
probability equal to alpha (typically 5%) to reproduce a
significant result in a replication attempt. At the same
time, we believe that the strict null-hypothesis is rarely
true in psychological research (Cohen, 1994).

It would be ideal if it were possible to estimate the
power of a single statistical test that supports a partic-
ular finding. Unfortunately, well-documented problems
with the “observed power" method suggest that the goal
of estimating the power of an individual test may be
out of reach (Boos & Stefnski, 2012; Hoenig & Heisey,
2001). Often the main problem is that estimates for
a single result are too variable to be practically useful
(Yuan & Maxwell, 2005; but also see Anderson, Kelley,
& Maxwell, 2017).

It is important to distinguish our undertaking from
that of Cohen (1962) and the follow-up studies by

Chase and Chase (1976) and Sedlmeier and Gigeren-
zer (1989). In Cohen’s classic survey of power in the
Journal of Abnormal and Social Psychology, the results
of the studies were not used in any way. Power was
never estimated. It was calculated exactly for a priori
effect sizes deemed “small," “medium" and “large." If a
“medium" effect size referred to the population mean
(which Cohen never claimed), power at the mean effect
size is still not the same as mean power. In contrast,
we aim to estimate the mean power given the actual
population effect sizes in a set of studies.

Two Populations of Studies

We distinguish two populations of tests. One popu-
lation contains all tests that have been conducted. This
population contains significant and non-significant re-
sults. The other population contains the subset of stud-
ies that produced a significant result. We focus on the
population of studies selected for significance for two
reasons.

First, often non-significant results are not available
because journal articles mostly report significant results
(Rosenthal, 1979; Sterling, 1959; Sterling et al., 1995).
Second, only significant results are used as evidence for
a theoretical prediction. It is irrelevant how many tests
produced non-significant results because these results
are inconclusive. As psychological theories mainly rest
on studies that produced significant results, only the ev-
idential value of significant results is relevant for evalu-
ations of the robustness of psychology as a science. In
short, we are interested in statistical methods that can
estimate the average power of a set of studies with sig-
nificant results.

The Study Selection Model

We developed a number of theorems that specify how
selection for significance influences the distribution of
power. These theorems are very general. They do
not depend on the particular population distribution of
power, the significance tests involved, or the Type I er-
ror probabilities of those tests. The only requirement
is that for every study with a specific population effect
size, sample size, and statistical test, the probability of
a result being selected is the true power of a study. We
discuss the two most important theorems in detail. All
six theorems are provided in the appendix, along with
an illustration of the theorems by simulation.

Theorem 1 Population mean true power equals the over-
all probability of a significant result.

Theorem 1 establishes the central importance of pop-
ulation mean power after selection for significance for


3

predicting replication outcomes. Think of a coin-tossing
experiment in which a large population of coins is man-
ufactured, each with a different probability of heads;
that is, these coins are not fair coins with equal probabil-
ities for both sides. Also consider heads to be successes
or wins. Repeatedly tossing the set of coins and count-
ing the number of heads produces an expected value of
the number of successes. For example, the experiment
may yield 60% heads and 40% tails. While the exact
probability of showing heads of individual coins are un-
known, the observable success rate is equivalent to the
mean power of all coins. Theorem 1 states that success
rate and mean power are equivalent even if the set of
coins is a subset of all coins. For example, assume all
coins were tossed once and only coins showing heads
were retained. Repeating the coin toss experiment, we
would still find that the success rate for the set of se-
lected coins matches the mean probabilities of the se-
lected coins.

Theorem 2 The effect of selection for significance on
power after selection is to multiply the probability of each
power value by a quantity equal to the power value itself,
divided by population mean power before selection. If the
distribution of power is continuous, this statement applies
to the probability density function.

Figure 1 illustrates Theorem 2 for a simple, artificial
example in which power before selection is uniformly
distributed on the interval from 0.05 to 1.0. The cor-
responding distribution after selection for significance
is triangular; now studies with more power are more
likely to be selected.

Figure 1. Uniform distribution of power before selection

0.0 0.2 0.4 0.6 0.8 1.0

0.
5

1.
0

1.
5

Power

D
en
si
ty

Expected power = 0.525 before selection, 0.635 after selection

Density after selection   
Density before selection

In Figure 2, power before selection is less heteroge-
neous, and higher on average. Consequently, the dis-
tributions of power before selection and after selection

are much more similar. In both cases, though, mean
true power after selection for significance is higher than
mean true power before selection for significance.

Figure 2. Example of higher power before selection

0.0 0.2 0.4 0.6 0.8 1.0

0
1

2
3

4

Power

D
en
si
ty

Expected power = 0.700 before selection, 0.714 after selection

Density after selection   

Density before selection

Note. Power before selection follows a beta distribution with
a = 13 and b = 6 multiplied by .95 plus .05, so that it ranges from
.05 to 1.

The coin-tossing selection model proposed here may
seem overly simplistic and unrealistic. Few researchers
conduct a study and give up after a first attempt pro-
duces a nonsignificant result. For example, Morewedge
et al. (2014) disclosed that they did not report “some
preliminary studies that used different stimuli and dif-
ferent procedures and that showed no interesting ef-
fects." From a theoretical perspective, it is important
that all studies test the same hypothesis, but for our
selection model it is not. Even if all studies used ex-
actly the same procedures and had exactly the same
power, the probability of being selected into the set of
reported studies matches their power, and Theorem 2
holds. Each study that was conducted by Morewedge et
al. has an unknown true power to produce a significant
result, and Theorem 2 implies (via Theorem 5 in the
appendix) that their selected studies with significant re-
sults have higher mean power than the full set of studies
that were conducted. We are only interested in the sta-
tistical power and replicability of the published studies
with significant results.

Estimation Methods

In this section, we describe four methods for estimat-
ing population mean power under conditions of hetero-
geneity, after selection for statistical significance.


4

Notation and statistical background

To present our methods formally, it is necessary to
introduce some statistical notation. Rather than using
traditional notation from statistics that might make it
difficult for non-statisticians to understand our method,
we follow Simonsohn et al. (2014a), who employed a
modified version of the S syntax (Becker et al., 1988)
to represent probability distributions. The S language is
familiar to psychologists who use the R statistical soft-
ware (R Core Team, 2017). The notation also makes it
easier to implement our methods in R, particularly in
the simulation studies.

The outcome of an empirical study is partially deter-
mined by random sampling error, which implies that
statistical results will vary across studies. This varia-
tion is expected to follow a random sampling distribu-
tion. Each statistical test has its own sampling distri-
bution. We will use the symbol T to denote a general
test statistic; it could be a t-statistic, F, chi-squared, Z,
or something else. Assume an upper-tailed test, so that
the null hypothesis will be rejected at significance level
α (usually α = 0.05), when the continuous test statistic
T exceeds a critical value c.

Typically there is a sample of test statistic values
T1, . . . , Tk, but when only one is being considered the
subscript will be omitted. The notation p(t) refers to the
probability under the null hypothesis that T is less than
or equal to the fixed constant t. The symbol p would rep-
resent pnorm if the test statistic were standard normal,
pf if the test statistic had an F-distribution, and so on.
While p(t) is the area under the curve, d(t) is the value
on the y axis for a particular t, as in dnorm. Following
the conventions of the S language, the inverse of p is q,
so that p(q(t)) = q(p(t)) = t.

Sampling distributions when the null-hypothesis is
true are well known to psychologists because they pro-
vide the foundation of null-hypothesis significance test-
ing. Most psychologists are less familiar with non-
central sampling distributions (see Johnson et al., 1995,
for a detailed and authoritative treatment). When the
null hypothesis is false, the area under the curve of the
test statistic’s sampling distribution is p(t,ncp), repre-
senting particular cases like pf(t,df1,df2,ncp). The
initials ncp stand for “non-centrality parameter." This
notation applies directly when T has one of the com-
mon non-central distributions like the non-central t, F
or chi-squared under the alternative hypothesis, but it
can be extended to the distribution of any test statistic
under any specific alternative, even when the distribu-
tion in question is technically not a non-central distri-
bution. The non-centrality parameter is positive when
the null hypothesis is false, and statistical power is a
monotonically increasing function of the non-centrality

parameter.
This function is given explicitly by Power = 1 −

p(c,ncp). For the most important non-central distri-
butions (Z, t, chi-squared and F), the non-centrality pa-
rameter can be factored into the product of two terms.
The first term is an increasing function of sample size,
and the second term is an increasing function of effect
size.

In symbols,

ncp = f1(n) · f2(es). (1)

This formula is capable of accommodating different def-
initions of effect size (Cohen, 1988; Grissom & Kim,
2012) by making corresponding changes to the function
f2 in f2(es). As an example of Equation (1), consider
for example a standard F-test for difference between
the means of two normal populations with a common
variance. After some simplification, the noncentrality
parameter of the non-central F may be written as

ncp = n ρ (1 −ρ) d2,

where n = n1 + n2 is the total sample size, ρ is the pro-
portion of cases allocated to the first treatment, and d is
Cohen’s (1988) effect size for the two-sample problem.
This expression for the non-centrality parameter can be
factored in various ways to match Equation 1; for exam-
ple,

f1(n) = n ρ (1 −ρ) and f2(es) = es
2.

Note that this is just an example; Equation 1 applies
to the non-centrality parameters of the non-central Z, t,
chi-squared and F distributions in general. Thus for a
given sample size and a given effect size, the power of
a statistical test is

Power = 1 −p(c, f1(n) · f2(es)). (2)

In this formula, c is the criterion value for statistical
significance; the test is significant if T > c. The func-
tion f2(es) can also be applied to sets of studies with
different traditional effect sizes. For example, es could
be Cohen’s d, and the alternative effect size es′ could
be the point-biserial correlation r (Cohen, 1988, p. 24).
Symbolically, es′ = g(es). Since the function g(es) is
monotone increasing, a corresponding inverse function
exists, so that es = g−1(es′). Then Equation (2) becomes

Power = 1 −p(c, f1(n) · f2(es))
= 1 −p(c, f1(n) · f2

(
g−1(es′)

)
)

= 1 −p(c, f1(n) · f
′
2
(
es′

)
),

where f ′2 just means another function f2. That is, if the
definition of effect size is changed (in a monotone way),


5

the change is absorbed by the function f2, and Equa-
tion (2) still applies.

We are now ready to introduce our four methods for
the estimation of mean power based on a set of studies
that vary in power with known sample sizes and un-
known population effect sizes. The four methods are
called pcurve, p-uniform, maximum likelihood model,
and z-curve.

Estimation Methods

The first two estimation methods are based on meth-
ods that were developed for the estimation of effect
sizes. Our use of these methods for the estimation of
mean power is an extension of these methods. Our sim-
ulation studies should not be considered tests of these
methods for the estimation of effect sizes. We devel-
oped these methods simply because power is a func-
tion of effect size and sample size and sample sizes are
known. Thus, only estimation of unknown effect sizes
is needed to estimate power with these methods. Power
estimation is a simple additional step to compute power
for each study as a function of the effect size estimate
and the sample size of each study. These models should
work well, when all studies have the same effect size
and heterogeneity in power is only a function of hetero-
geneity in sample size as assumed by these models.

P-curve 2.1 and p-uniform

A p-curve method for estimation of mean power is
available online (www.p-curve.com). It is important
to point out that this method differs from the p-curve
method that we developed. The online p-curve method
is called pcurve 4.06. We built our p-curve method on
the effect size p-curve method with the version code p-
curve2.0 (Simonsohn et al., 2014b). Hence, we refer to
our p-curve method as p-curve2.1.

P-uniform is very similar to p-curve (van Assen et al.,
2014). Both methods aim to find an effect size that pro-
duces a uniform distribution of p-values between .05
and .00. Since we developed our p-uniform method for
power estimation, a new estimation method has been
introduced (van Aert et al., 2016).

We conducted our studies with the original estima-
tion method and our results are limited to the perfor-
mance of this implementation of p-uniform. To find the
best fitting effect size for a set of observed test statistics,
p-curve 2.1 and p-uniform compute p-values for various
effect sizes and chose the effect size that yields the best
approximation of a uniform distribution. If the mod-
ified null hypothesis that effect size = es is true, the
cumulative distribution function of the test statistic is

the conditional probability

F0(t) = Pr{T ≤ t|T > c}

=
p(t,ncp)−p(c,ncp)

1 −p(c,ncp)

=
p(t, f1(n) · f2(es))−p(c, f1(n) · f2(es))

1 −p(c, f1(ni) · f2(es))
,

using ncp = f1(n) · f2(es) as given in Equation 1. The
corresponding modified p-value is

1 − F0(T ) =
1 −p(T, f1(n) · f2(es))
1 −p(c, f1(n) · f2(es))

.

Note that since the sample sizes of the tests may dif-
fer, the symbols p, n and c as well as T may have dif-
ferent referents for j = 1, . . . , k test statistics. The sub-
script j has been omitted to reduce notational clutter.
If the modified null hypothesis were true, the modified
p-values would have a uniform distribution. Both p-
curve 2.1 and p-uniform choose as estimated effect size
the value of es that makes the modified p-values most
nearly uniform. They differ only in the criterion for de-
ciding when uniformity has been reached.

P-curve 2.1 is based on a Kolmogorov-Smirnov test
for departure from a uniform distribution, choosing the
es value yielding the smallest value of the test statistic.
P-uniform is based on a different criterion. Denoting by
P j the modified p-value associated with test j, calculate

Y = −
k∑

j=1

ln(P j),

where ln is the natural logarithm. If the P j values were
uniformly distributed, Y would have a Gamma distribu-
tion with expected value k, the number of tests. The P-
uniform estimate is the modified null hypothesis effect
size es that makes Y equal to k, its expected value under
uniformity.

These technologies are designed for heterogeneity in
sample size only, and assume a common effect size for
all the tests. Given an estimate of the common effect
size, estimated power for each test varies only as a func-
tion of sample size which can be determined by Ex-
pression 2 because sample sizes are known. Population
mean power can then be estimated by averaging the k
power estimates.

Maximum likelihood model

Our maximum likelihood (ML) model also first es-
timates effect sizes and then combines effect size esti-
mates with known sample sizes to estimate mean power.
Unlike p-curve2.1 and p-uniform, the ML model allows
for heterogeneity in effect sizes. In this way, the model


6

is similar to Hedges and Vevea’s (1996) model for ef-
fect size estimation before selection for significance. To
take selection for significance into account, the likeli-
hood function of the ML model is a product of k con-
ditional densities; each term is the conditional density
of the test statistic T j, given N j = n j and T j > c j, the
critical value.

Likelihood function. The model assumes that sam-
ple sizes and effect sizes are independent before the se-
lection for significance. Suppose that the distribution of
effect size before selection is continuous with probabil-
ity density gθ(es). This notation indicates that the distri-
bution of effect size depends on an unknown parameter
or parameter vector θ. In the appendix, it is shown that
the likelihood function (a function of θ) is a product of
k terms of the form∫

∞

0
d(t j, f1(n j) · f2(es))gθ(es) des∫

∞

0

[
1 −p(c j, f1(n j) · f2(es))

]
gθ(es) des

, (3)

where the integrals denote areas under curves that can
be computed with R’s integrate function. The maxi-
mum likelihood estimate is the parameter value yield-
ing the highest product. To be applicable to actual data,
the ML model has to make assumptions about the dis-
tribution of effect sizes. The ML model that was used
in the simulation studies assumed a gamma distribution
of effect sizes. A gamma distribution is defined by two
parameters that need to be estimated based on the data.

The effect sizes based on the most likely distribution
are then combined with information about sample sizes
to obtain power estimates for each study. An estimate of
population mean power is then produced by averaging
estimated power for the k significance tests. As shown
in the appendix, the terms to be averaged are∫

∞

0

[
1 −p(c j, f1(n j) · f2(es))

]2
g
θ̂
(es) des∫

∞

0

[
1 −p(c j, f1(n j) · f2(es))

]
g
θ̂
(es) des

. (4)

Z-curve

Z-curve follows a traditional meta-analyses that con-
verts p-values into Z-scores as a common metric to inte-
grate results from different original studies (Rosenthal,
1979; Stouffer et al., 1949). The use of Z-scores as a
common metric makes it possible to fit a single func-
tion to p-values arising from different statistical meth-
ods and tests. The method is based on the simplicity and
tractability of power analysis for the Z-tests, in which
the distribution of the test statistic under the alternative
hypothesis is just a standard normal shifted by a fixed
quantity that plays the role of a non-centrality param-
eter, and will be denoted by m. Input to the Z-curve
is a sample of p-values, all less than α = 0.05. These

p-values are processed in several steps to produce an
estimate.

1. Convert p-values to Z-scores. The first step is to
imagine, for simplicity, that all the p-values arose
from two-tailed Z-tests in which results were in
the predicted direction. This is equivalent to an
upper-tailed Z-test. In our simulations, alpha was
set to .05, which results in a selection criterion
of z = 1.96. The conversion to Z-scores (Stouffer
et al., 1949) consists of finding the test statistic Z
that would have produced that p-value. The for-
mula is

Z = qnorm(1 − p/2). (5)

2. Set aside Z > 6. We set aside extreme z-scores.
This avoids fitting a large number of normal dis-
tributions to extremely small p-values. This step
has no influence on the final result because all of
these p-values have an observed power of 1.00
(rounded to the second decimal). This set also
avoids numerical problems that arise from small
p-values rounded to 0.

3. Fit a finite mixture model. Before selecting for sig-
nificance and setting aside values above six, the
distribution of the test statistic Z given a partic-
ular non-centrality parameter value m is normal
with mean m. Afterwards, it is a normal distri-
bution truncated on the left at the critical value c
(usually 1.96) truncated on the right at 6, and re-
scaled to have area one under the curve. Because
of heterogeneity in sample size and effect size, the
full distribution of Z is an average of truncated
normals, with potentially a different value of m
for each member of the population. As a simpli-
fication, heterogeneity in the distribution of Z is
represented as a finite mixture with r components.
The model is equivalent to the following two-stage
sampling plan.

First, select a non-centrality parameter m from
m1, . . . , mr according to the respective probabilities
w1, . . . , wr. Then generate Z from a normal distri-
bution with mean m and standard deviation one.
Finally, truncate and re-scale.

Under this approximate model, the probability
density function of the test statistic after selection
for significance is

f (z) =
r∑

j=1

w j
dnorm(z − m j)

pnorm(6 − m j)−pnorm(c − m j)
.

(6)


7

The finite mixture model is only an approximation
because it approximates k standard normal distri-
bution with a smaller set of standard normal dis-
tributions. Preliminary studies showed negligible
differences between models with 3 or more pa-
rameters. Thus, the z-curve method that was used
in the simulation studies approximated the ob-
served distribution of z-scores between 1.96 and
6 with three truncated standard normal distribu-
tions. The observed density distribution was es-
timated based on the observed z-scores using the
kernel density estimate (Silverman, 1986) as im-
plemented in R’s density function, with the de-
fault settings.

The default settings are Gaussian approximation
and 512 nodes. The most critical default param-
eter is the bandwidth. The default bandwidth
defaults to 0.9 times the minimum of the standard
deviation and the interquartile range divided by
1.34 times the sample size to the negative one-
fifth power (https://stat.ethz.ch/R-manual/R-
devel/library/stats/html/density.html).

Specifically, the fitting step proceeds as follows.
First, obtain the kernel density estimate based on
the sample of significant Z values, re-scaling it so
that the area under the curve between 1.96 and 6
equals one. To do so, all density values are divided
by the sum of the density values times the band-
width parameter of the density function. Then,
numerically choose w j and m j values so as to min-
imize the sum of absolute differences between Ex-
pression (6) and the density estimate.

4. Estimate mean power for Z < 6. The estimate of
rejection probability upon replication for Z < 6 is
the area under the curve above the critical value,
with weights and non-centrality values from the
curve fitting step. The estimate is

` =

r∑
j=1

ŵ j(1 −pnorm(c − m̂ j)), (7)

where ŵ1, . . . , ŵr and m̂1, . . . , m̂r are the values lo-
cated in Step 3. Note that while the input data
are censored both on the left and right as repre-
sented in Forumula 6, there is no truncation in
Formula 7 because it represets the distribution of
Z upon replication.

5. Re-weight using Z > 6. Let q denote the proportion
of the original set of Z statistics with Z > 6. Again,
we assume that the probability of significance for
those tests is essentially one. Bringing this in as
one more component of the mixture estimate, the

final estimate of the probability of rejecting the
null hypothesis for exact replication of a randomly
selected test is

Zest = (1 − q) ` + q · 1 (8)

= q + (1 − q)
r∑

j=1

ŵ j(1 −pnorm(c − m̂ j))

By Theorem 1, this is also an estimate of population
true mean power after selection. Unlike the other esti-
mation methods, z-curve does not require information
about sample size. Unlike p-curve2.1 and p-uniform,
z-curve does not assume a fixed effect size. Finally, z-
curve does not make assumptions about the distribu-
tion of true effect sizes or true power, but approximates
the actual distribution with a weighted combination of
three standard normal distributions.

Simulations

The simulations reported here were carried out using
the R programming environment (R Core Team, 2017)
distributing the computation among 70 quad core Apple
iMac computers. The R code is available in the supple-
mentary materials, at https://osf.io/bvraz.

In the simulations, the four estimation methods (p-
curve 2.1, p-uniform, maximum likelihood and z-curve)
were applied to samples of significant chi-squared or F
statistics, all with p < 0.05. This covers most cases of in-
terest, since t statistics may be squared to yield F statis-
tics, while Z may be squared to yield chi-squared with
one degree of freedom.

Heterogeneity in Sample Size Only: Effect size fixed

Sample sizes after selection for significance were ran-
domly generated from a Poisson distribution with mean
86, so that they were approximately normal, with pop-
ulation mean 86 and population standard deviation
9.3. Population mean power, number of test statis-
tics on which the estimates were based, type of test
(chi-squared or F) and (numerator) degrees of freedom
were varied in a complete factorial design. Within each
combination, we generated 10,000 samples of signifi-
cant test statistics and applied the four estimation meth-
ods to each sample. In these simulations, it was not
necessary to simulate test statistic values and then lit-
erally select those that were significant. A great deal of
computation was saved by using the R functions rsigF
and rsigCHI, (available from the supplementary mate-
rials) to simulate directly from the distribution of the
test statistic after selection. A description of the simula-
tion method and a proof of its correctness are given in
the appendix.

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/density.html
https://osf.io/bvraz
https://osf.io/bvraz
https://osf.io/bvraz


8

The first simulation had a 4 × 5 × 3 design with true
power after selection for significance (.05, 0.25, 0.50,
& 0.75), number of test statistics k on which estimates
were based (15, 25, 50, 100, & 250) and numerator
degrees of freedom (just degrees of freedom for the chi-
squared tests; 1, 3 & 5) as factors. To obtain the desired
levels of power, we used the effect size metric f for F-
tests and w for chi-squared tests (Cohen, 1988, p. 216).

Because the pattern of results was similar for F-tests
and chi-squared tests and for different degrees of free-
dom, we only report details for F-tests with one nu-
merator degree of freedom; preliminary data mining of
the psychological literature suggests that this is the case
most frequently encountered in practice. Full results are
given in the supplementary materials.

Average performance. Table 1 shows means and
standard deviations of mean power based on 10,000
simulations in each cell of the design. Differences be-
tween the estimates and the true values represent sys-
tematic bias in the estimates. The results show that
all methods performed fairly well, with z-curve showing
more bias than the other methods, especially for small
sets of studies.

Absolute error of estimation. Although the stan-
dard deviations in Table 1 provide some information
about estimation errors in individual simulations, we
also computed mean absolute errors, abs(True Power-
Estimated Power) to supplement this information. With
50% power at least 100 studies would be needed to re-
duce mean absolute error to less than 6% for all meth-
ods. Thus, fairly large sets of studies are needed to ob-
tain precise estimates of mean power.

Heterogeneity in Both Sample Size and Effect Size

The results of the first simulation study were reassur-
ing in that our methods performed well under condi-
tions that were consistent with model assumptions. P-
curve, p-uniform and the ML model performed better
than z-curve because they used information about sam-
ple sizes and correctly assumed that all studies have the
same population effect size. However, our main goal
was to test these methods under more realistic condi-
tions where effect sizes vary across studies.

To model heterogeneity in effect size, we let effect
size before selection vary according to a gamma distri-
bution (Johnson et al., 1995), a flexible continuous dis-
tribution taking positive values. Sample size before se-
lection remained Poisson distributed with a population
mean of 86. For convenience, sample size and effect size
were independent before selection for significance. The
maximum likelihood model correctly assumed a gamma
distribution for effect size, and the likelihood search was
over the two parameters of the gamma distribution.

Table 1
Average estimated population mean power for heterogene-
ity in sample size only (SD in parentheses): F-tests with
numerator d f = 1

Number of Tests
15 25 50 100 250

Population Mean Power = .05
P-curve 2.1 .083 .073 .064 .059 .055

(.059) (.039) (.024) (.015) (.007)
P-uniform .076 .067 .061 .058 .054

(.050) (.032) (.019) (.012) (.006)
ML-model .076 .067 .061 .057 .054

(.050) (.033) (.020) (.012) (.006)
Z-curve .086 .071 .058 .049 .040

(.088) (.065) (.044) (.031) (.019)
Population Mean Power = .25
P-curve 2.1 .269 .261 .256 .253 .251

(.156) (.128) (.095) (.069) (.046)
P-uniform .256 .253 .252 .251 .251

(.147) (.121) (.089) (.065) (.042)
ML-model .260 .255 .253 .251 .251

(.146) (.120) (.087) (.064) (.042)
Z-curve .314 .305 .293 .280 .268

(.155) (.127) (.093) (.068) (.045)
Population Mean Power = .50
P-curve 2.1 .484 .491 .496 .497 .499

(.175) (.139) (.102) (.073) (.046)
P-uniform .473 .485 .493 .496 .499

(.170) (.132) (.097) (.070) (.044)
ML-model .479 .489 .495 .497 .499

(.166) (.130) (.095) (.068) (.043)
Z-curve .513 .516 .513 .508 .502

(.151) (.121) (.091) (.068) (.045)
Population Mean Power = .75
P-curve 2.1 .728 .736 .742 .747 .749

(.128) (.098) (.069) (.048) (.030)
Puniform .721 .732 .740 .746 .748

(.126) (.097) (.067) (.047) (.029)
ML-model .728 .736 .742 .747 .749

(.121) (.093) (.065) (.045) (.028)
Zcurve .704 .712 .717 .723 .728

(.105) (.084) (.064) (.048) (.033)

The other three methods were not modified in any
way. P-curve 2.1 and p-uniform continued to assume a
fixed effect size, and z-curve continued to assume het-
erogeneity in the non-centrality parameter without dis-
tinguishing between heterogeneity in sample size and
heterogeneity in effect size.

We used the same design as in Study 1 with one ad-
ditional factor: amount of heterogeneity in effect size,
as represented by the standard deviation of the effect
size distribution. Figure 3 shows the distribution of ef-


9

Table 2
Mean absolute error of estimation for heterogeneity in
sample size only: F-tests with numerator d f = 1

Number of Tests
15 25 50 100 250

Population Mean Power = 0.05
P-curve 2.1 3.32 2.25 1.41 0.93 0.52
P-uniform 2.57 1.75 1.11 0.76 0.43
ML-model 2.59 1.74 1.09 0.73 0.39
Z-curve 6.53 4.90 3.38 2.44 1.79
Population Mean Power = 0.25
P-curve 2.1 12.94 10.49 7.69 5.53 3.64
P-uniform 12.11 9.87 7.17 5.18 3.38
ML-model 12.07 9.76 7.05 5.10 3.32
Z-curve 13.55 11.09 8.21 5.96 3.87
Population Mean Power = 0.50
P-curve 2.1 14.32 11.20 8.14 5.80 3.67
P-uniform 13.93 10.68 7.80 5.56 3.51
ML-model 13.61 10.41 7.60 5.39 3.41
Z-curve 12.42 9.91 7.44 5.48 3.59
Population Mean Power = 0.75
P-curve 2.1 9.77 7.59 5.38 3.72 2.35
P-uniform 9.79 7.59 5.34 3.71 2.32
ML-model 9.33 7.23 5.11 3.53 2.21
Z-curve 8.34 6.96 5.56 4.30 3.13

fect sizes after selection for significance for three levels
of heterogeneity, standard deviation of effect size after
selection (0.10, 0.20 or 0.30) × three levels of true pop-
ulation mean power (0.25, 0.50 or 0.75). Effect sizes
were transformed into Cohen’s d for ease of interpreta-
tion.

We dropped the condition with 5% power because it
implies a fixed effect size of 0. We also varied the num-
ber of test statistics in a simulation (k = 100, 250, 500,
1,000 or 2,000), experimental degrees of freedom (1, 3
or 5), and type of test (F or chi-squared). Within each
cell of the design, ten thousand significant test statistics
were randomly generated, and population mean power
was estimated using all four methods. For brevity, we
only present results for F-tests with numerator d f = 1.
Full results are given in the supplementary materials.

In our simulations with heterogeneity in effect sizes,
maximum likelihood is computationally demanding.
Using R’s integrate function, the calculation involves
fitting a histogram to each curve and then adding the
areas of the bars. Numerical accuracy is an issue, es-
pecially for ratios of areas when the denominators are
very small. In addition, it is necessary to try more than
one starting value to have a hope of locating the global
maximum because the likelihood function has many lo-
cal maxima. In our simulations, we used three random

Figure 3. Distribution of effect sizes (Cohen’s d) for the
simulations in Study 2.

0.0 0.5 1.0 1.5 2.0

0
1

2
3

4

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0

0
1

2
3

4

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0

0
1

2
3

4

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0

0
1

2
3

4

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0

0
1

2
3

4

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0

0
1

2
3

4

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0

0
1

2
3

4

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0

0
1

2
3

4

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0

0
1

2
3

4

Effect Size Distribution

Cohen's d

D
en
si
ty

Heterogeneity: black = .1; blue = .2, red = .3

Power: solid = 25%, dots = 50%, dashes = 75%

starting points. The ML model benefited from the fact
that it assumed a gamma distribution of effect sizes,
which matched the simulated effect size distributions.
In contrast, z-curve made no assumptions and the other
two methods falsely assumed a fixed effect size.

Average performance. Table 3 shows estimated
population mean power as a function of true popula-
tion mean power. Results were consistent with the dif-
ferences in assumptions. Pcurve2.1 and p-uniform over-
estimated mean power and this bias increased with in-
creasing heterogeneity and increasing mean power. Z-
curve estimates were actually better than in the previ-
ous simulations with fixed effect sizes. The maximum
likelihood model had the best fit, presumably because it
anticipated the actual effect size distribution.

Absolute error of estimation. Table 4 shows mean
absolute error of estimation. It confirms the pattern of
results seen in Table 3. Most important are the large
absolute errors for the two methods that assumed a
fixed effect size. These large absolute mean differences
are obtained despite small standard deviations because
p-curve2.0 and p-uniform systematically overestimate
mean power. Large sample sizes cannot correct for sys-
tematic estimation errors. These results show that fixed
effect size models cannot be used for the estimation of
mean power when there is substantial heterogeneity in

https://osf.io/bvraz


10

Table 3
Average estimated power (SD in parentheses) for hetero-
geneity in sample size and effect size based on k = 1, 000
F-tests with numerator d f = 1

Standard Deviation of es
0.1 0.2 0.3

Population Mean Power = 0.25
P-curve 2.1 .225 .272 .320

(.024) (.033) (.039)
P-uniform .294 .694 .949

(.029) (.056) (.028)
MaxLike .230 .269 .283

(.069) (.016) (.015)
Z-curve .233 .225 .226

(.027) (.026) (.024)
Population Mean Power = 0.50
P-curve 2.1 .549 .679 .757

(.024) (.027) (.026)
P-uniform .602 .913 .995

(.024) (.019) (.003)
MaxLike .501 .502 .506

(.025) (.019) (.019)
Z-curve .504 .492 .487

(.026) (.026) (.025)
Population Mean Power = 0.75
P-curve 2.1 .824 .928 .962

(.013) (.009) (.006)
P-uniform .861 .992 1.000

(.012) (.003) (.000)
MaxLike .752 .750 .750

(.022) (.017) (.014)
Z-curve .746 .755 .760

(.021) (.017) (.016)

power. The results also show that the difference be-
tween z-curve and the ML model are slight and have no
practical significance. The good performance of z-curve
is encouraging because it does not require assumptions
about the effect size distribution.

Violating the Assumptions of the ML model

In the preceding simulation study, heterogeneity in
effect size before selection was modeled by a gamma
distribution, with effect size independent of sample size
before selection. The maximum likelihood model had
a substantial and arguably unfair advantage, since the
simulation was consistent with the assumptions of the
ML model. It is well known that maximum likelihood
models are very accurate compared to other methods
when their assumptions are met (Stuart & Ord, 1999,
Ch. 18). We used a beta distribution of effect sizes to
examine how the ML model performs when its assump-

Table 4
Mean absolute error of estimation in percentage points,
for heterogeneity in sample size and gamma effect size
based on k = 1, 000 F-tests with numerator d f = 1

Standard Deviation of es
0.1 0.2 0.3

Population Mean Power = 0.25
P-curve 2.1 2.87 3.16 7.08
P-uniform 4.50 44.38 69.90
MaxLike 3.55 2.06 3.34
Z-curve 2.59 3.08 2.90
Population Mean Power = 0.50
P-curve 2.1 4.93 17.86 25.70
P-uniform 10.21 41.28 49.54
MaxLike 1.80 1.49 1.50
Z-curve 2.12 2.19 2.23
Population Mean Power = 0.75
P-curve 2.1 7.45 17.75 21.23
P-uniform 11.08 24.17 24.99
MaxLike 1.42 1.18 1.16
Z-curve 1.69 1.42 1.55

tion of a gamma distribution is violated.

In this simulation, z-curve may have the upper hand
because it makes no assumptions about the distribution
of effect sizes or the correlation between effect sizes
and sample sizes. It is well known that selection for
significance (e.g. publication bias) introduces a corre-
lation between sample sizes and effect sizes. However,
there might also be negative correlations between sam-
ple sizes and effect sizes before selection for significance
if researchers conduct a priori power analysis to plan
their studies or if researchers learn from non-significant
results that they need larger samples to achieve signifi-
cance.

The design of this simulation study was similar to the
previous design, but we only simulated the most ex-
treme heterogeneity (SD = .3) condition and added a
factor for the correlations between sample size and ef-
fect size (r = 0, -.2, - .4, -.8). As before, we ran 10,000
simulations in each condition.

To make results comparable to the results in Table 4,
we show the results for the simulation with k = 1,000
per simulated meta-analysis.

Figure 4 shows the effect size distributions after se-
lection for significance. As before, effect sizes were
transformed into Cohen’s d-values so that they can be
compared to the distributions in Figure 3. Only the most
extreme correlations of 0 and -.8 are shown to avoid
cluttering the figure. As shown in the Figure, the corre-
lation has relatively little impact on the distributions.


11

Figure 4. Effect size distribution for Study 3

0.0 0.5 1.0 1.5 2.0 2.5

0
1

2
3

4
5

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0 2.5

0
1

2
3

4
5

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0 2.5

0
1

2
3

4
5

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0 2.5

0
1

2
3

4
5

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0 2.5

0
1

2
3

4
5

Effect Size Distribution

Cohen's d

D
en
si
ty

0.0 0.5 1.0 1.5 2.0 2.5

0
1

2
3

4
5

Effect Size Distribution

Cohen's d

D
en
si
ty

Correlation: black = 0; red = -.8

Power: solid = 25%, dots = 50%, dashes = 75%

Average performance. Table 5 shows average esti-
mated population mean power as a function of the cor-
relation between sample size and effect size and differ-
ent levels of power. One interesting finding is that the
correlation between effect size and sample size has no
influence on any of the four estimation methods. This is
reassuring because the correlation before selection for
significance is typically unknown.

It is apparent from Table 5 that correlation between
sample size and effect size makes virtually no differ-
ence. Results for p-curve2.1 and p-uniform again over-
estimate effect sizes. More important is the comparison
of the ML model and z-curve. Both methods perform
reasonably well with mean true power of 50%, although
z-curve performs slightly better. With low or high power,
however, the ML model overestimates mean power by 5
and 8 percentage points, respectively. The bias for z-
curve is less, although even z-curve overestimates high
power by 4 percentage points. We explored the cause
of this systematic bias and found that it is caused by the
default bandwidth method with smaller sets of studies.
When we set the bandwidth to a value of 0.05, z-curve
estimates with a correlation of zero were .235, .492,
and .743, respectively.

Table 5
Average estimated power with beta effect size and sample
size correlated with effect size: k = 1, 000 F-tests with
numerator d f = 1

Correlation between n and es
-.8 -.6 -.4 -.2 .0

Population Mean Power = 0.25
P-curve .407 .405 .403 .403 .402

(.043) (.044) (.043) (.044) (.044)
P-uniform .853 .852 .852 .852 .852

(.003) (.004) (.003) (.004) (.004)
MaxLike .302 .301 .300 .300 .300

(.015) (.015) (.015) (.015) (.015)
Z-curve .232 .231 .230 .231 .230

(.015) (.015) (.015) (.015) (.015)
Population Mean Power = 0.50
P-curve .839 .840 .841 .841 .841

(.022) (.022) (.022) (.022) (.022)
P-uniform .906 .906 .906 .906 .906

(.004) (.004) (.004) (.004) (.004)
MaxLike .532 .533 .533 .534 .534

(.018) (.018) (.019) (.019) (.019)
Z-curve .493 .494 .495 .495 .495

(.023) (.023) (.023) (.023) (.023)
Population Mean Power = 0.75
P-curve .990 .991 .992 .992 .992

(.002) (.002) (.002) (.002) (.002)
P-uniform .964 .966 .966 .967 .967

(.003) (.003) (.003) (.003) (.003)
MaxLike .826 .832 .836 .838 .840

(.016) (.016) (.015) (.015) (.015)
Z-curve .785 .790 .793 .794 .796

(.013) (.013) (.013) (.012) (.012)

Discussion

In this paper, we have compared four methods for es-
timating the mean statistical power of a heterogeneous
population of significance tests, after selection for sig-
nificance. We have discovered and formally proved a
set of theorems relating the distribution of power values
before and after selection for significance.

Mean Power and Replicability

Several events in 2011 have triggered a crisis of con-
fidence about the replicability and credibility of pub-
lished findings in psychology journals. As a result, there
have been various attempts to assess the replicability of
published results. The most impressive evidence comes
from the Open Science Reproducibility project that con-
ducted 100 replication studies from articles published
in 2008. The key finding was that 50% of significant re-
sults from cognitive psychology could be replicated suc-


12

cessfully, whereas only 25% of significant results from
social psychology could be replicated successfully (Open
Science Collaboration, 2015).

Social psychologists have questioned these results.
Their main argument is that the replication studies were
poorly done. “Nosek’s ballyhooed finding that most
psychology experiments didn’t replicate did enormous
damage to the reputation of the field, and that its lead-
ers were themselves guilty of methodological problems"
(Nisbett quoted in Bartlett, 2018)

Estimating mean power provides an empirical answer
to the question whether replication failures are caused
by problems with the original studies or the replication
studies. If the original studies achieved significance
only by means of selection for significance or other
questionable research practices, estimated mean power
would be low. In contrast, if original studies had good
power and replication failures are due to methodolog-
ical problems of replication studies, estimated mean
power would be high.

We have applied z-curve to the original studies that
were replicated in the Open Science project and found
an estimate of 66% mean power (Schimmack & Brun-
ner, 2016). This estimate is higher than the overall suc-
cess rate of 37% for actual replication studies. This sug-
gests (but not conclusively) that problems with conduct-
ing exact replication studies contributed partially to the
low success rate of 37%. At the same time, the esti-
mate of 66% is considerably lower than the success rate
of 97% for the original studies. This discrepancy shows
that success rates in journals are inflated by selection for
significance and partially explains replication failures in
psychology, especially in social psychology.

This example shows that estimates of mean power
provide useful information for the interpretation of
replication failures. Without this information, precious
resources might be wasted on further replication stud-
ies that fail simply because the original results were se-
lected for significance.

Historic Trends in Power

Our statistical approach of estimating mean power
is also useful to examine changes in statistical power
over time. So far, power analyses of psychology have
relied on fixed values of effect sizes that were recom-
mended by Cohen (1962, 1988). However, actual ef-
fect sizes may change over time or from one field to
another. Z-curve makes it possible to examine what
the actual power in a field of study is and whether this
power has changed over time. Despite much talk about
improvement in psychological science in response to the
replication crisis, mean power has increased by less than
5 percentage points since 2011, and improvements are

limited to social psychology (Schimmack, 2018b).

Mean Power as a Quality Indicator

One problem in psychological science is the use of
quantitative indicators like number of publications or
number of studies per article to evaluate productiv-
ity and quality of psychological scientists. We believe
that mean power is an important additional indicator of
good science.

A single study with good power provides more cred-
ible evidence and more sound theoretical foundations
than three or more studies with low power that were
selected from a larger population of studies with non-
significant results (Schimmack, 2012). However, with-
out quantitative information about power, it is unclear
whether reported results are trustworthy or not. Re-
porting the mean power of studies from a lab or a par-
ticular field of research can provide this information.
This information can be used by journalists or textbook
writers to select articles that reported credible empirical
evidence that is likely to replicate in future studies.

P-Curve Estimates of Mean Power

Simonsohn et al. (2017) provided users with a free
online app to compute mean power. However, they
did not report the performance of their method in sim-
ulation studies and their method has not been peer-
reviewed. We evaluated their online method and found
that the current online method, p-curve 4.06, overes-
timates mean power under conditions of heterogeneity
(Schimmack & Brunner, 2017). Moreover, even hetero-
geneity in sample sizes alone can produce biased esti-
mates with p-curve4.06 (Brunner, 2018).

However, we agree with Simonsohn et al. (2014b) Si-
monsohn et al. (2014) that pcurve 2.0 can be used for
the estimation of mean effect sizes and that these esti-
mates are relatively bias free even when there is moder-
ate heterogeneity in effect sizes. Importantly, these es-
timates are only unbiased for the population of studies
that produced significant results, but they are inflated
estimates for the population of studies before selection
for significance.

Failing to distinguish these two populations of stud-
ies (i.e., before and after selection for significance) has
produced a lot of confusion and unnecessary criticism
of selection models in general (McShane et al., 2016).
While it is difficult to obtain accurate estimates of effect
sizes or power before selection for significance from the
subset of studies that were selected for significance, p-
curve 2.0 provides reasonably good estimates of effect
sizes after selection for significance, which is the rea-
son we built p-curve 2.1 in the first place. However,


13

p-curve 2.1, and especially p-curve 4.06, produce bi-
ased estimates of mean power even for the set of studies
selected for significance. Therefore, we do not recom-
mend using p-curve to estimate mean power.

P-uniform Estimation of Mean Power

Unlike p-curve, the authors of p-uniform limited their
method to estimation of effect sizes before selection for
significance. We used their estimation method to cre-
ate a method for estimation of mean power after selec-
tion. As p-curve, the method had problems with hetero-
geneity in effect sizes and performed even worse than
p-curve. Recently, the developers of p-uniform changed
the estimation method to make it more robust in the
presence of heterogeneity and with outliers (van Aert
et al., 2016).

The new approach simply averages the rescaled p-
values and finds the effect size that produces a mean
p-value of 0.50. This method is called the Irvine-Hall
method. We conducted new simulation studies with
this method for the no correlation condition in Table 5
for 25%, 50%, and 75% true power. We found that
it performed much better (24%, 76%, 99%) than the
old p-uniform method (85%, 91%, 97%), and slightly
better than p-curve 2.1 (40%, 84%, 99%). However,
the method still produces inflated estimates for medium
and high mean power.

Maximum Likelihood Model

Our ML model is similar to Hedges and Vevea’s
(1996) ML method that corrects for publication bias in
effect size meta-analyses. Although this model has been
rarely used in actual applications, it received renewed
attention during the current replication crisis. McShane
et al. argued that p-curve and p-uniform produced bi-
ased effect size estimates, whereas a heterogenous ML
model produced accurate estimates. However, their fo-
cus was on estimating the average effect size before se-
lection for significance. This aim is different from our
aim to estimate mean power after selection for signif-
icance. Moreover, in their simulation studies the ML
model benefited from the fact that the model assumed
a normal distribution of effect sizes and this was the
distribution of effect sizes in the simulation study. In
our simulation studies, the ML model also performed
very well when the simulation data met model assump-
tions. However, estimates were biased when model as-
sumptions differed from the effect size distribution in
the data.

Hedges and Vevea (1996) also found that their ML
model is sensitive to the actual distribution of popula-
tion effect sizes, which is unknown. The main advan-
tage of z-curve over ML models is that it does not make

any distributional assumptions about the data. How-
ever, this advantage is limited to estimation of mean
power. Whether it is possible to develop finite mixture
models without distribution assumptions for the estima-
tion of the mean effect size after selection for signifi-
cance remains to be examined.

Future Directions

One concern about z-curve was the suboptimal per-
formance when effect sizes were fixed. However, an im-
proved z-curve method may be able to produce better
estimates in this scenario as well. As most studies are
likely to have some heterogeneity, we recommend us-
ing z-curve as the default method for estimating mean
power.

Another issue is to examine performance of z-curve
when researchers used questionable research practices
(John et al., 2012). One questionable research prac-
tice is to include multiple dependent variables and to
report only those that produced a significant result. This
practice would be no different from researchers running
multiple exact replication studies with the same depen-
dent variable and reporting only the studies that pro-
duced significant results for the selected DV. The prob-
ability of this result to be selected is the true power of
the study with the chosen DV and the probability of this
finding to be replicated equals the true power for the
chosen DV. Power can vary across DVs, but the power of
the DVs that were discarded is irrelevant.

Things become more complicated, however, if mul-
tiple DVs are selected or if only the strongest result is
selected among several significant DVs (van Aert et al.,
2016). Some questionable research practices may cause
z-curve to underestimate mean power. For example,
researchers who conduct studies with moderate power
may deal with marginally significant results by remov-
ing a few outliers to get a just significant result. (John
et al., 2012). This would create a pile of z-scores close
to the critical value, leading z-curve to underestimate
mean power. We recommend inspecting the z-curve to
look for this QRP, which should produce a spike in z-
scores just above 1.96.

Another issue is that studies may use different signif-
icance thresholds. Although most studies use p < .05
(two-tailed) as a criterion, some studies use more strin-
gent criteria, for example to correct for multiple com-
parisons. Including these results would lead to an over-
estimation of mean power, just like using p < .05 , one-
tailed as a criterion would lead to overestimation be-
cause most studies used the more stringent two-tailed
criterion to select for significance.

One solution would be to exclude studies that did
not use alpha = .05 or to run separate analyses for sets


14

of studies with different criteria for significance. How-
ever, these results are currently so rare that they have
no practical consequences for mean power estimates.

Conclusion

Although this article is the seminal introduction of z-
curve, we have been writing about z-curve and applica-
tions of z-curve since 2015 on social media. Thus, there
have already been peer-reviewed criticism of our aims
and methods before we were able to publish the method
itself. We would like to take this opportunity to correct
some of these criticisms and to ask future critics to base
their criticism on this article.

De Boeck and Jeon (2018) claim that estimation
methods for mean power are problematic because they
"aim at rather precise replicability inferences based on
other not always precise inferences, without knowing
the true values of the effect size and whether the effect
is fixed or varies" (p. 769). Contrary to this claim, our
simulations show that z-curve can provide precise esti-
mates of replicability; that is, the success rate in a set
of exact replication studies without information about
population effect sizes. To do so, only test statistics or
exact p-values are needed. If related statistical informa-
tion (e.g. means, SDs, and N) is not reported, an article
does not contain quantitative information.

We hope that researchers will use z-curve
(https://osf.io/w8nq4) to estimate mean power
when they conduct meta-analyses. Hopefully, the
reporting of mean power will help researchers to pay
more attention to power when they plan future studies,
and we might finally see an increase in statistical power,
more than 50 years after Cohen (1962) pointed out the
importance of power for good psychological science.

More awareness of the actual power in psychological
science could also be beneficial for grant applications to
fund research projects properly and to reduce the need
for questionable research practices to boost power by
inflating the risk of type-I errors. Thus, we hope that es-
timation of mean power serves the most important goal
in science, namely to reduce errors. Conducting studies
with adequate power reduces type-II errors (false neg-
atives) and in the presence of selection bias it also re-
duces type-I errors. The downside appears to be that
fewer studies would be published, but underpowered
studies selected for significance do not provide sound
empirical evidence. Maybe reducing the number of pub-
lished studies would be beneficial, or to paraphrase Co-
hen (1990), “Less is more, except for statistical power".

Author Contributions

Most of the ideas in this paper were developed jointly.
An exception is the z-curve method, which is solely due

to Schimmack. Brunner is responsible for the theorems.

Acknowledgements

We would like to thank Dr. Jeffrey Graham for pro-
viding remote access to the computers in the Psychol-
ogy Laboratory at the University of Toronto Mississauga.
Thanks to Josef Duchesne for technical advice.

Conflict of Interest and Funding

No conflict of interest to report. This work was not
supported by a specific grant.

Contact Information

Correspondence regarding this article should be sent
to: brunner@utstat.toronto.edu

Open Science Practices

This article earned the Open Materials badge for
making the materials openly available. Preregistration
and Data badges are not applicable for this type of re-
search. It has been verified that the analysis reproduced
the results presented in the article. The entire editorial
process, including the open reviews, are published in
the online supplement.

References

Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017).
Sample-size planning for more accurate statis-
tical power: A method adjusting sample effect
sizes for publication bias and uncertainty. Psy-
chological Science, 28, 640–646.

Bartlett, T. (2018). I want to burn things to the ground.
Retrieved May 30, 2019, from https : / / www.
chronicle.com/article/I-Want-to-Burn-Things-
to/244488

Becker, R. A., Chambers, J. M., & Wilks, A. R. (1988).
The new s language: A programming environ-
ment for data analysis and graphics. Pacific
Grove, California, Wadsworth& Brooks/Cole.

Boos, D. D., & Stefnski, L. A. (2012). P-value precision
and reproducibility. The American Statistician,
65, 213–221.

Brunner, J. (2018). An even better p-curve. Retrieved
May 30, 2019, from https : / / replicationindex .
wordpress.com/2018/05/10/an- even- better-
p-curve

Bunge, M. (1998). Philosophy of science. New
Brunswick, N.J., Transaction.

https://osf.io/w8nq4
https://www.chronicle.com/article/I-Want-to-Burn-Things-to/244488
https://www.chronicle.com/article/I-Want-to-Burn-Things-to/244488
https://www.chronicle.com/article/I-Want-to-Burn-Things-to/244488
https://replicationindex.wordpress.com/2018/05/10/an-even-better-p-curve
https://replicationindex.wordpress.com/2018/05/10/an-even-better-p-curve
https://replicationindex.wordpress.com/2018/05/10/an-even-better-p-curve


15

Chase, L. J., & Chase, R. B. (1976). Statistical power
analysis of applied psychological research.
Journal of Applied Psychology, 61, 234–237.

Cohen, J. (1962). The statistical power of abnormal-
social psychological research: A review. Jour-
nal of Abnormal and Social Psychology, 65, 145–
153.

Cohen, J. (1988). Statistical power analysis for the be-
havioral sciences. (2nd edition). Hilsdale, New
Jersey, Erlbaum.

Cohen, J. (1990). Things i have learned (so far). Amer-
ican Psychologist, 45, 1304–1312.

Cohen, J. (1994). The earth is round (p < .05). Ameri-
can Psychologist, 49, 997–1003.

De Boeck, P., & Jeon, M. (2018). Perceived crisis and re-
forms: Issues, explanations, and remedies. Psy-
chological Bulletin, 144, 757–777.

Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie,
D. (1996). Effect sizes and p values: What
should be reported and what should be repli-
cated? Psychophysiology, 33, 175–183.

Grissom, R. J., & Kim, J. J. (2012). Effect sizes for re-
search: Univariate and multivariate applications.
New York, Routledge.

Hedges, L. V., & Vevea, J. L. (1996). Estimating effect
size under publication bias: Small sample prop-
erties and robustness of a random effects selec-
tion model. Journal of Educational and Behav-
ioral Statistics, 21, 299–332.

Hoenig, J. M., & Heisey, D. M. (2001). The abuse of
power: The pervasive fallacy of power calcu-
lations for data analysis. The American Statis-
tician, 55, 19–24.

Ioannidis, J. P. (2008). Why most discovered true asso-
ciations are inflated. Epidemiology, 19(5), 640–
646.

John, L. K., Lowenstein, G., & Prelec, D. (2012). Mea-
suring the prevalence of questionable research
practices with incentives for truth telling. Psy-
chological Science, 23, 517–523.

Johnson, N. L., Kotz, S., & Balakrishnan, N. (1995).
Continuous univariate distributions (2nd). New
York, Wiley.

McShane, B. M., Böckenholt, U., & Hensen, K. (2016).
Adjusting for publication bias in meta-analysis:
An evaluation of selection methods and some
cautionary notes. Psychological Science, 11,
730–749.

Morewedge, C. K., Gilbert, D., & Wilson, T. D. (2014).
Reply to frances. Retrieved June 7, 2019, from
https : / / www . semanticscholar . org / paper /
REPLY - TO - FRANCIS - Morewedge - Gilbert /
019dae0b9cbb3904a671bfb5b2a25521b69ff2cc

Neyman, J., & Pearson, E. S. (1933). On the problem of
the most efficient tests of statistical hypotheses.
Philosophical Transactions of the Royal Society,
Series A, 231, 289–337.

Open Science Collaboration. (2015). Estimating the re-
producibility of psychological science. Science,
349(6251), aac4716–aac4716. https : / / doi .
org/10.1126/science.aac4716

Popper, K. R. (1959). The logic of scientific discovery.
London, England, Hutchinson.

R Core Team. (2017). R: A language and environment for
statistical computing. R Foundation for Statisti-
cal Computing. Vienna, Austria. https://www.
R-project.org/

Rosenthal, R. (1979). The file drawer problem and tol-
erance for null results. Psychological Bulletin,
86, 638–641.

Schimmack, U. (2012). The ironic effect of significant
results on the credibility of multiple-study arti-
cles. Psychological Methods, 17, 551–566.

Schimmack, U. (2015). Post-hoc power curves: Estimat-
ing the typical power of statistical tests (t, f )
in Psychological Science and Journal of Exper-
imental Social Psychology. Retrieved May 30,
2019, from https : / / replicationindex . com /
2015/06/27/232/

Schimmack, U. (2018a). An introduction to z-curve:
A method for estimating mean power after se-
lection for significance (replicability). Retrieved
May 30, 2019, from https : / / replicationindex .
com/2018/10/19/an-introduction-to-z-curve

Schimmack, U. (2018b). Replicability rankings. Re-
trieved May 30, 2019, from https : / /
replicationindex . com / 2018 / 12 / 29 / 2018 -
replicability-rankings

Schimmack, U., & Brunner, J. (2016). How replicable
is psychology? a comparison of four methods
of estimating replicability on the basis of test
statistics in original studies. Retrieved May 30,
2019, from http : / / www. utstat . toronto . edu /
~brunner/papers/HowReplicable.pdf

Schimmack, U., & Brunner, J. (2017). Z-curve: A method
for the estimation of replicability. manuscript re-
jected from ampps. Retrieved May 30, 2019,
from https://replicationindex.wordpress.com/
2017 / 11 / 16 / preprint - z - curve - a - method -
for - the - estimating - replicability - based - on -
test - statistics - in - original - studies - schimmack -
brunner-2017

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of
statistical power have an effect on the power of
studies? Psychological Bulletin, 105, 309–316.

https://www.semanticscholar.org/paper/REPLY-TO-FRANCIS-Morewedge-Gilbert/019dae0b9cbb3904a671bfb5b2a25521b69ff2cc
https://www.semanticscholar.org/paper/REPLY-TO-FRANCIS-Morewedge-Gilbert/019dae0b9cbb3904a671bfb5b2a25521b69ff2cc
https://www.semanticscholar.org/paper/REPLY-TO-FRANCIS-Morewedge-Gilbert/019dae0b9cbb3904a671bfb5b2a25521b69ff2cc
https://doi.org/10.1126/science.aac4716
https://doi.org/10.1126/science.aac4716
https://www.R-project.org/
https://www.R-project.org/
https://replicationindex.com/2015/06/27/232/
https://replicationindex.com/2015/06/27/232/
https://replicationindex.com/2018/10/19/an-introduction-to-z-curve
https://replicationindex.com/2018/10/19/an-introduction-to-z-curve
https://replicationindex.com/2018/12/29/2018-replicability-rankings
https://replicationindex.com/2018/12/29/2018-replicability-rankings
https://replicationindex.com/2018/12/29/2018-replicability-rankings
http://www.utstat.toronto.edu/~brunner/papers/HowReplicable.pdf
http://www.utstat.toronto.edu/~brunner/papers/HowReplicable.pdf
https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017
https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017
https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017
https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017
https://replicationindex.wordpress.com/2017/11/16/preprint-z-curve-a-method-for-the-estimating-replicability-based-on-test-statistics-in-original-studies-schimmack-brunner-2017


16

Silverman, B. W. (1986). Density estimation. London,
Chapman; Hall.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014a).
P–curve: A key to the file drawer. Journal of ex-
perimental psychology: General, 143, 534–547.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014b).
P-curve and effect size: Correcting for publica-
tion bias using only significant results. Perspec-
tives on Psychological Science, 9, 666–681.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2017).
P-curve app 4.06. Retrieved May 30, 2019, from
http://www.p-curve.com

Sterling, T. D. (1959). Publication decision and the pos-
sible effects on inferences drawn from tests of
significance – or vice versa. Journal of the Amer-
ican Statistical Association, 54, 30–34.

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J.
(1995). Publication decisions revisited: The ef-
fect of the outcome of statistical tests on the de-
cision to publish and vice versa. The American
Statistician, 49, 108–112.

Stouffer, S. A., Suchman, E. A., DeVinney, L. C., Star,
S. A., & Williams, R. M., Jr. (1949). The Amer-
ican soldier, vol.1: Adjustment during army life.
Princeton, Princeton University Press.

Stuart, A., & Ord, J. K. (1999). Kendall’s advanced the-
ory of statistics, vol. 2: Classical inference & the
linear model (5th). New York, Oxford University
Press.

van Aert, R. C. M., Wicherts, J. M., & van Assen,
M. A. L. M. (2016). Conducting meta-analyses
based on p values: Reservations and recommen-
dations for applying p-uniform and pcurve. Per-
spectives on Psychological Science, 11, 713–729.

van Assen, M. A. L. M., van Aert, R. C. M., &
Wicherts, J. M. (2014). Meta-analysis using ef-
fect size distributions of only statistically signif-
icant studies. Psychological methods, 20, 293–
309.

Yuan, K. H., & Maxwell, S. (2005). On the post hoc
power in testing mean differences. Journal of
educational and behavioral statistics, 30, 141–
167.

Appendix

Proofs of the Theorems, with an example

We present proofs of six theorems about the rela-
tionship between power and the outcome of replica-
tion studies. The first two theorems are assumptions
of z-curve. The other four theorems are theoretically
interesting, very useful for simulation studies, and can
be used to further develop z-curve in the future. The

theorems are also illustrated with a numerical example.
Consider a population of F-tests with 3 and 26 degrees
of freedom, and varying true power values. Variation
in power comes from variation in the non-centrality pa-
rameter, which is sampled from a chi-squared distribu-
tion with degrees of freedom chosen so that population
mean power is very close to 0.80.

Denoting a randomly selected power value by G and
the non-centrality parameter by λ, population mean
power is

E(G) =
∫
∞

0
(1 −pf(c,ncp = λ))dchisq(λ)dλ

To verify the numerical value of expected power for the
example,

> alpha = 0.05; criticalvalue = qf(1-alpha,3,26)
> fun = function(ncp,DF)
+ (1 - pf(criticalvalue,df1=3,df2=26,ncp))*dchisq(ncp,DF)
> integrate(fun,0,Inf,DF=14.36826)
0.8000001 with absolute error < 5.9e-06

The strange fractional degrees of freedom were located
using the R function uniroot, minimizing the abso-
lute difference between the output of integrate and
the value 0.8 numerically over the degrees of freedom
value. The minimum occurred at 14.36826.

Theorem 1 Population mean true power equals the over-
all probability of a significant result.

Proof. Suppose that the distribution of true power
is discrete. Again denoting a randomly chosen power
value by G, the probability of rejecting the null hypoth-
esis is

Pr{T > c} =
∑

g

Pr{T > c|G = g}Pr{G = g}

=
∑

g

g Pr{G = g}

= E(G), (9)

which is population mean power. If the distribution of
power is continuous with probability density function
fG (g), the calculation is

Pr{T > c} =
∫ 1

0
Pr{T > c|G = g} fG (g) dg

=

∫ 1
0

g fG (g) dg

= E(G) �

Continuing with the numerical example, we first sample
one million non-centrality parameter values from the
chi-squared distribution that yields an expected power

http://www.p-curve.com


17

of 80%. These values are in the vector NCP. We then
calculate the corresponding power values, placing them
in the vector Power. Next, we generate one million ran-
dom F statistics from non-central F distributions, using
the non-centrality parameter values in NCP. In the R out-
put below, observe that mean power is very close to the
proportion of F statistics exceeding the critical value.
This illustrates Theorem 1 for the distribution of power
before selection. Needless to say, Theorem 1 applies
both before and after selection.

> popsize = 1000000; set.seed(9999)
> NCP = rchisq(popsize,df=14.36826)
> Power = 1 - pf(criticalvalue,df1=3,df2=26,NCP)
> mean(Power)
[1] 0.8002137
> Fstat = rf(popsize,df1=3,df2=26,NCP)
> sigF = subset(Fstat,Fstat>criticalvalue)
> length(sigF)/popsize # Proportion significant
[1] 0.800177

To show how Theorem 1 applies to the distribution
of power after selection, the sub-population of power
values corresponding to significant results are stored in
SigPower. The tests that were significant are repeated
(with the same non-centrality parameters), and the test
statistics placed in Fstat2. The proportion of test statis-
tics in Fstat2 that are significant is very close to the
mean of SigPower. This gives empirical support to the
statement that population mean power after selection
for significance equals the probability of obtaining a sig-
nificant result again.

> SigPower = subset(Power,Fstat>criticalvalue)
> mean(SigPower) # Mean power after selection
[1] 0.8274357
> # Replicate the tests that were significant.
> sigNCP = subset(NCP,Fstat>criticalvalue)
> Fstat2 = rf(length(sigF),df1=3,df2=26,ncp=sigNCP)
> # Proportion of replications significant
> length(subset(Fstat2,Fstat2>criticalvalue)) /
+ length(sigF)
[1] 0.827172

Theorem 2 The effect of selection for significance is to
multiply the probability of each power value by a quan-
tity equal to the power value itself, divided by population
mean power before selection. If the distribution of power is
continuous, this statement applies to the probability den-
sity function.

Proof. Suppose the distribution of power is dis-
crete. Using Bayes’ Theorem,

Pr{G = g|T > c} =
Pr{T > c|G = g}Pr{G = g}

Pr{T > c}
=

g Pr{G = g}
E(G)

.

(10)

If the distribution of power is continuous with density
fG (g),

Pr{G ≤ g|T > c} =
Pr{G ≤ g, T > c}

Pr{T > c}

=

∫ g
0

Pr{T > c|G = x} fG (x) d x

E(G)

=

∫ g
0

x fG (x) d x

E(G)
.

By the Fundamental Theorem of Calculus, the condi-
tional density of power given significance is

d
dg

Pr{G ≤ g|T > c} =
g fG (g)
E(G)

. � (11)

For the numerical example we are pursuing by simu-
lation, the density function of power before selection is
a technical challenge and we will not attempt it. As a
substitute, suppose that power before selection follows
a beta distribution, a very flexible family on the interval
from zero to one (Johnson et al., 1995). If power before
selection (denoted by G) has a beta distribution with
parameters α and β, Theorem 2 says that the density of
power after selection (a function of the power value g)
is

f (g|T > c) =
Γ(α + β)
Γ(α)Γ(β)

gα−1(1 − g)β−1
(

g
E(G)

)
=

(
1

α/(α + β)

)
Γ(α + β)
Γ(α)Γ(β)

gα(1 − g)β−1

=
(α + β) Γ(α + β)
α Γ(α) Γ(β)

gα+1−1(1 − g)β−1

=
Γ(α + 1 + β)
Γ(α + 1) Γ(β)

gα+1−1(1 − g)β−1,

which is again a beta density, this time with parameters
α + 1 and β. M.A.L.M. van Assen has pointed out the
similarity of this result to conjugate prior-posterior up-
dating in Bayesian statistics. Figure 5 shows how a beta
with α = 2 and β = 4 is transformed into a beta with
α = 3 and β = 4.

Theorem 3 Population mean power after selection for
significance equals the population mean of squared power
before selection, divided by the population mean of power
before selection.

Proof. Suppose that the distribution of power is dis-
crete. Then using (10),

E(G|T > c) =
∑

g

g
g Pr{G = g}

E(G)
=

E(G2)
E(G)

. (12)


18

Figure 5. Beta density of power before and after selec-
tion

0.0 0.2 0.4 0.6 0.8 1.0

0.
0

0.
5

1.
0

1.
5

2.
0

g

D
en
si
ty

Before

After

If the distribution of power is continuous, (11) is used
to obtain

E(G|T > c) =
∫ 1

0
g

g fG (g)
E(G)

dg =
E(G2)
E(G)

. � (13)

In the example, SigPower contains the sub-population
of power values corresponding to significant results.
Observe the verification of Formula 13.

> # Repeating ...
> SigPower = subset(Power,Fstat>criticalvalue)
> mean(SigPower)
[1] 0.8274357
> mean(Power^2)/mean(Power)
[1] 0.8275373

Theorem 4 Population mean power before selection
equals one divided by the population mean of the recip-
rocal of power after selection.

Proof. Using Formula 10,

E
(

1
G

∣∣∣∣∣ T > c
)

=
∑

g

(
1
g

)
g Pr{G = g}

E(G)

=
1

E(G)

∑
g

Pr{G = g} =
1

E(G)
· 1

=
1

E(G)
,

so that

E(G) = 1
/
E

(
1
G

∣∣∣∣∣ T > c
)
.

A similar calculation applies in the continuous case. �

To illustrate Theorem 4, recall that the example was
constructed so that mean power before selection was
equal to 0.80.

> 1/mean(1/SigPower)
[1] 0.8000502

In the example, population mean power is 0.80,
while population mean power given significance is
roughly 0.83. It is reasonable that selecting significant
tests would also tend to select higher power values on
average, and in fact this intuition is correct. Since

V ar(G) = E(G2) − (E(G))2 ≥ 0, we have
E(G2) ≥ (E(G))2 , and hence
E(G2)
E(G)

≥ E(G).

Theorem 3 says E(G
2 )

E(G) = E(G|T > c), so that E(G|T >
c) ≥ E(G). That is, population mean power given sig-
nificance is greater than the mean power of the en-
tire population, except in the homogeneous case where
V ar(G) = 0. The exact amount of increase has a compact
and somewhat surprising form.

Theorem 5 The increase in population mean power due
to selection for significance equals the population variance
of power before selection divided by the population mean
of power before selection.

Proof.

E(G|T > c) − E(G) =
E(G2)
E(G)

− E(G)

=
E(G2)
E(G)

−
(E(G))2

E(G)

=
V ar(G)
E(G)

. �

Illustrating Theorem 5 for the ongoing example,

> mean(SigPower) - mean(Power)
[1] 0.02722205
> var(Power)/mean(Power)
[1] 0.02732371

Theorem 6 The effect of selection for significance is to
multiply the joint distribution of sample size and effect size
before selection by power for that sample size and effect
size, divided by population mean power before selection.


19

Proof. Note that power for a given sample size and
effect size is P{T > c|X = es, N = n}. Suppose effect size
is discrete. Then P{X = es, N = n|T > c} is

P{X = es, N = n, T > c}
P{T > c}

=
P{T > c|X = es, N = n}P{X = es, N = n}

E(G)

=

(
P{T > c|X = es, N = n}

E(G)

)
P{X = es, N = n} ,

where E(G) is expected power before selection, equal to
P{T > c} by Theorem 1.

Suppose that effect size is continuous with density
g(es). The joint distribution of sample size and effect
size before selection is determined by P{N = n|X =
es}g(es). The joint distribution after selection is deter-
mined by
P{N = n|X = es, T > c}g(es|T > c)

=
P{T > c|X = es, N = n}P{N = n|X = es}g(es)

g(es|T > c)P{T > c}
g(es|T > c)

=

(
P{T > c|X = es, N = n}

E(G)

)
P{N = n|X = es}g(es) .

It is also possible to write the joint distribution of sam-
ple size and effect size as the conditional density of ef-
fect size given sample size, times the discrete probabil-
ity of sample size. That is, the joint distribution before
selection is determined by g(es|N = n)P{N = n}, and
the joint distribution after selection is determined by
g(es|N = n, T > c)P{N = n|T > c}

=
d

des
P{X ≤ es|N = n, T > c}P{N = n|T > c}

=
d

des
P{X ≤ es, N = n, T > c}

P{N = n, T > c}
P{N = n, T > c}

P{T > c}

=
1

E(G)
d

des

∫ es
0

P{T > c|X = y, N = n}g(y|N = n)P{N = n}dy

=
P{T > c|X = es, N = n}g(es|N = n)P{N = n}

E(G)

=

(
P{T > c|X = es, N = n}

E(G)

)
g(es|N = n)P{N = n} � (14)

Theorem 6 cannot be illustrated for the ongoing nu-
merical example, because the example employs a dis-
tribution of the non-centrality parameter, rather than of
sample size and effect size jointly. As a substitute, con-
sider that an observed distribution of sample size after
selection must imply a distribution of sample size in the
unpublished studies before selection. If that distribution
is too outlandish (for example, implying an enormous
“file drawer" of pilot studies with tiny sample sizes) we
may be forced to another model of the research and
publication process. Theorem 6 allows one to solve for

P{N = n}, the unconditional probability distribution of
sample size before selection, though an estimated or hy-
pothesized distribution of effect size given sample size
before selection is needed. When sample size and effect
size are deemed independent before selection, this is
not a serious obstacle.

Expression (14) says that g(es|N = n, T > c)P{N =
n|T > c} is equal to(

P{T > c|X = es, N = n}
E(G)

)
g(es|N = n)P{N = n},

so that integrating both sides with respect to es,∫
g(es|N = n, T > c)P{N = n|T > c}des

= P{N = n|T > c}
∫

g(es|N = n, T > c) des

= P{N = n|T > c} · 1

=

∫ (
P{T > c|X = es, N = n}

E(G)

)
g(es|N = n)P{N = n}des

=

(
P{N = n}

E(G)

) ∫
P{T > c|X = es, N = n}g(es|N = n) des,

and we have

P{N = n} = E(G)

 P{N = n|T > c}∫
P{T > c|X = es, N = n}g(es|N = n) des


(15)

The numerator of the fraction is the probability of
observing a sample size of n after selection for signif-
icance. The denominator is expected power given that
sample size, and could be calculated with R’s integrate
function. By Theorem 1, the quantity E(G) is both
population mean power before selection and P{T > c},
the probability of randomly choosing a significant result
from the population of tests before selection. In Equa-
tion 15, though, it is just a proportionality constant. In
practice, one obtains P{N = n} by calculating the frac-
tion in parentheses for each n, and then dividing by the
total to obtain numbers that add to one.

Maximum Likelihood

Even though sample size is a random variable, the
quantities n1, . . . , nk are treated as fixed constants. This
is similar to the way that x values in normal regression
and logistic regression are treated as fixed constants in
the development of the theory, even though clearly they
are often random variables in practice. Making the es-
timation conditional on the observed values n1, . . . , nk
allows it to be distribution free with respect to sample
size, just as regression and logistic regression are distri-
bution free with respect to x. This is preferable to adopt-
ing parametric assumptions about the joint distribution
of sample size and effect size.


20

Suppose there is heterogeneity in both sample size
and effect size, and that effect size is continuous. The
likelihood function given significance is a product of
conditional densities evaluated at the observed values
of the test statistics. Each term is the conditional density
of the test statistic given both the sample size and the
event that the test statistic exceeds its respective critical
value.

The joint probability distribution of sample size
and effect size before selection is determined by the
marginal distribution of sample size P{N = n} and
the conditional density of effect size given sample size
gθ(es|n), where θ is a vector of unknown parameters.
Denoting the random effect size by X, the conditional
density of an observed test statistic T given significance
and a particular sample size n is

d
dt

P{T ≤ t|T > c, N = n}

=
d
dt

P{T ≤ t, T > c, N = n}
P{T > c, N = n}

=
d
dt

P{c < T ≤ t|N = n}P{N = n}
P{T > c|N = n}P{N = n}

=
d
dt

P{c < T ≤ t|N = n}
P{T > c|N = n}

=
d
dt

∫
∞

0
P{c < T ≤ t|N = n, X = es}gθ(es|n) des∫
∞

0
P{T > c|N = n, X = es}gθ(es|n) des

=
d
dt

∫
∞

0

[
p(t, f1(n) f2(es))−p(c, f1(n) f2(es))

]
gθ(es|n) des∫

∞

0

[
1 −p(c, f1(n) f2(es))

]
gθ(es|n) des

=

∫
∞

0
d
dtp(t, f1(n) f2(es))gθ(es|n) des∫

∞

0

[
1 −p(c, f1(n) f2(es))

]
gθ(es|n) des

=

∫
∞

0
d(t, f1(n) f2(es))gθ(es|n) des∫

∞

0

[
1 −p(c, f1(n) f2(es))

]
gθ(es|n) des

,

where moving the derivative through the integral sign
is justified by dominated convergence. The likelihood
function is a product of k such terms. In the main pa-
per, the simplifying assumption that sample size and ef-
fect size are independent before selection means that
gθ(es|n) is replaced by gθ(es), yielding Expression (3).

In the problem of estimating power under hetero-
geneity in effect size, the unknown parameter is the
vector θ in the density of effect size. Let θ̂ denote the
maximum likelihood estimate of θ. This yields a maxi-
mum likelihood estimate of the true power of each in-
dividual test in the sample, and then the estimates are
averaged to obtain an estimate of mean power. We now
give details.

Consider randomly sampling a single test from the
population of tests that were significant the first time
they were carried out. Let T1 denote the value of the test
statistic the first time a hypothesis is tested, and let T2

denote the value of the test statistic the second time that
particular hypothesis is tested, under exact repetition of
the experiment. Conditionally on fixed values of sample
size n and effect size es, T1 and T2 are independent. By
Theorem 1, population mean power after selection is

P{T2 > c|T1 > c} =
∑

n

P{T2 > c|T1 > c, N = n}P{N = n|T1 > c}

(16)
This is the expression we seek to estimate. Applying

Theorem 3 to the sub-population of tests based on a
sample of size n,

P{T2 > c|T1 > c, N = n}

=
E(G2|N = n)
E(G|N = n)

=

∫
∞

0

[
1 −p(c, f1(n) f2(es))

]2 gθ(es|n) des∫
∞

0

[
1 −p(c, f1(n) f2(es))

]
gθ(es|n) des

. (17)

Substituting (17) into (16) yields P{T2 > c|T1 > c} =

∑
n

∫
∞

0

[
1 −p(c, f1(n) f2(es))

]2 gθ(es|n) des∫
∞

0

[
1 −p(c, f1(n) f2(es))

]
gθ(es|n) des

P{N = n|T1 > c} .

(18)
Expression (18) has two unknown quantities, the pa-

rameter θ of the effect size distribution, and P{N =
n|T1 > c}. For the former quantity, we use the maxi-
mum likelihood estimate, while the P{N = n|T1 > c} val-
ues are estimated by the empirical relative frequencies
of sample size, which is the non-parametric maximum
likelihood estimate. The result is a maximum likelihood
estimate of population power given significance:

1
k

k∑
j=1

∫
∞

0

[
1 −p(c j, f1(n j) f2(es))

]2
g
θ̂
(es|n j) des∫

∞

0

[
1 −p(c j, f1(n j) f2(es))

]
g
θ̂
(es|n j) des

.

In the simulations, the density g of effect size is assumed
gamma, there is no dependence on n, and the parameter
θ is the pair (a, b) that parameterize the gamma distri-
bution.

Simulation

Direct simulation from the distribution of the test
statistic given significance. To study the behaviour of
an estimation method under selection for significance,
it is natural to simulate test statistics from the distri-
bution that applies before selection, and then discard
the ones that are not significant. But if one can sim-
ulate from the joint distribution of sample size and ef-
fect size after selection, the wasteful discarding of non-
significant test statstics can be avoided. The idea is to
do the simulation in two stages. First, simulate pairs
from the joint distribution of sample size and effect size


21

after selection, and calculate a non-centrality parameter
using Expression (ncpmult). Then using that ncp value,
simulate from the distribution of the test statistic given
significance. We will now show how to do the second
step.

It is well known that if F(t) is a cumulative distribu-
tion function of a continuous random variable and U is
uniformly distributed on the interval from zero to one,
then the random variable T = F−1(U) has cumulative
distribution function F(t). In this case the cumulative
distribution function from which we wish to simulate is
P{T ≤ t|T > c, X = es, N = n}

=
P{T ≤ t, T > c|X = es, N = n}

P{T > c|X = es, N = n}

=
P{c < T ≤ t|X = es, N = n}

P{T > c|X = es, N = n}

=
p(t,ncp)−p(c,ncp)

1 −p(c,ncp)

for t > c, where as usual ncp = f1(n) f2(es). To obtain
the inverse, set u equal to the probability and solve
for t, as follows. Denoting the power of the test by
γ = 1 −p(c,ncp),

u =
p(t,ncp)−p(c,ncp)

1 −p(c,ncp)
⇔ u (1 −p(c,ncp)) = p(t,ncp)−p(c,ncp)
⇔ p(t,ncp) = u (1 −p(c,ncp)) + p(c,ncp)
⇔ p(t,ncp) = γu + 1 −γ
⇔ t = q(γu + 1 −γ,ncp).

Accordingly, let U be a Uniform (0,1) random variable.
The significant test statistic is

T = q(γU + 1 −γ,ncp)
= q(1 + γ(U − 1),ncp)
= q(1 −γ(1 − U),ncp) .

Since 1 − U also has a Uniform (0,1) distribution, one
may proceed as follows. For a given sample size and
effect size, first calculate the non-centrality parameter
ncp = f1(n) f2(es), and use that to compute the power
value γ = 1 − p(c,ncp). Then calculate the significant
test statistic

T = q(1 −γU,ncp) , (19)

where U is a pseudo-random variate from a Uniform
(0,1) distribution. In R, the process can be applied to
a vector of ncp values and a vector of independent U
values of the same length.

Again, this is the second step. The first step is to sim-
ulate a collection of ncp values using the desired joint
distribution of sample size and effect size after selec-
tion for significance. Naturally, simulation is is easiest if

sample size and effect size come from well-known dis-
tributions with built-in random number generation, and
if sample size and effect size are specified to be indepen-
dent after selection. In one of our simulations, sample
size and effect size after selection were correlated. The
next section describes how this was done.

Correlated sample size and effect size. Let effect
size X have density gθ(es), where θ represents a vector
of parameters for the distribution of effect size. Condi-
tionally on X = es, let the distribution of sample size be
Poisson distributed with expected value exp(β0 + β1es).
This is standard Poisson regression. Simulation from
the joint distribution is easy. One simply simulates an
effect size es according to the density g, computes the
Poisson parameter λ = exp(β0 + β1es), and then samples
a value n from a Poisson distribution with parameter λ.
The challenge is to choose the parameters θ, β0 and β1
so that after selection, (a) the population mean power
has a desired value, and at the same time (b) the pop-
ulation correlation between sample size and effect size
has a desired value. Population mean power is γ =

∫
∞

0

∑
n

[
1 −p(c, f1(n) f2(es))

]
P{N = n|X = es}gθ(es)des .

Given values of θ,β0 and β1, this expression can be calcu-
lated by numerical integration; recall that P{N = n|X =
es} is a Poisson probability.

The population correlation between sample size and
effect size is

ρ =
E(XN) − E(X)E(N)

SD(X) SD(N)
,

where SD(·) refers to the population standard deviation
of something. The quantities E(X) and SD(X) are direct
functions of θ. The standard deviation of sample size
SD(N) =

√
E(N2) − [E(N)]2, where

E(N) = E(E[N|X])

=

∫
∞

0
E[N|X = es] gθ(es)des

=

∫
∞

0
eβ0 +β1esgθ(es)des

and

E(N2) = E(E[N2|X])
= E(V ar(N) + E(N)2|X)

=

∫
∞

0

(
eβ0 +β1es + e2β0 +2β1es

)
gθ(es)des .


22

Finally,

E(XN) =
∫
∞

0

∑
n

esn P{N = n|X = es}gθ(es)des

=

∫
∞

0
es E(N|X = es)gθ(es)des

=

∫
∞

0
eseβ0 +β1esgθ(es)des .

All these expected values can be calculated by numeri-
cal integration using R’s integrate function, so that the
correlation ρ can be evaluated for any set of θ,β0 and β1
values.

In our simulation of correlated sample size and ef-
fect size, gθ(es) was a beta density, re-parameterized so
that θ = (µ,σ2) consisted of the mean µ and variance

σ2. Conditionally on effect size, sample size was Poisson
distributed with expected value exp(β0 + β1es). We set
the variance of effect size σ2 to a fixed value of 0.09, so
that the standard deviation of effect size after selection
was 0.30, a high value. Given any mean effect size µ and
slope β1, the parameter β0 (the intercept of the Poisson
regression) was adjusted so that expected sample size
at the mean value was equal to 86: β0 = ln(86) −β1µ.

With these constraints, the population mean power
γ and correlation ρ were a function of the two free pa-
rameters µ and β1. Let γ0 be a desired value of mean
power; for example, γ0 = 0.5. Let ρ0 be a desired
value of the correlation between sample size and ef-
fect size; for example, ρ0 = −0.8. Values of µ and β1
were locating by numerically minimizing the function
f (µ,β1) = |γ−γ0| + |ρ−ρ0|. We used R’s optim function.


	Statistical Power Before and After A Study Has Been Conducted
	Two Populations of Studies
	The Study Selection Model
	Estimation Methods
	Notation and statistical background

	Estimation Methods
	P-curve 2.1 and p-uniform
	Maximum likelihood model
	Z-curve

	Simulations
	Heterogeneity in Sample Size Only: Effect size fixed
	Heterogeneity in Both Sample Size and Effect Size
	Violating the Assumptions of the ML model

	Discussion
	Mean Power and Replicability
	Historic Trends in Power
	Mean Power as a Quality Indicator
	P-Curve Estimates of Mean Power
	P-uniform Estimation of Mean Power
	Maximum Likelihood Model
	Future Directions
	Conclusion
	Author Contributions
	Acknowledgements
	Conflict of Interest and Funding
	Contact Information
	Open Science Practices

	Appendix
	Proofs of the Theorems, with an example
	Simulation