Meta-Psychology, 2021, vol 5, MP.2020.2506 https://doi.org/10.15626/MP.2020.2506 Article type: Original Article Published under the CC-BY4.0 license Open data: Not applicable Open materials: Yes Open and reproducible analysis: Yes Open reviews and editorial process: Yes Preregistration: No Edited by: Henrik Danielsson Reviewed by: Dixon, P., Buchanan E., Magnusson, K. Analysis reproduced by: Lucija Batinović All supplementary files can be accessed at OSF: https://doi.org/10.17605/OSF.IO/CB78P What to make of equivalence testing with a post-specified margin? Harlan Campbell University of British Columbia Department of Statistics Paul Gustafson University of British Columbia Department of Statistics Abstract In order to determine whether or not an effect is absent based on a statistical test, the recommended frequentist tool is the equivalence test. Typically, it is expected that an appropriate equivalence margin has been specified before any data are observed. Unfortunately, this can be a difficult task. If the margin is too small, then the test’s power will be substantially reduced. If the margin is too large, any claims of equivalence will be meaningless. Moreover, it remains unclear how defining the margin afterwards will bias one’s results. In this short article, we consider a series of hypothetical scenarios in which the margin is defined post-hoc or is otherwise considered controversial. We also review a number of relevant, potentially problematic actual studies from clinical trials research, with the aim of motivating a critical discussion as to what is acceptable and desirable in the reporting and interpretation of equivalence tests. Keywords: equivalence testing, non-inferiority testing, confidence intervals, type 1 error, frequentist testing, clinical trials, negative studies, null results Facts do not accumulate on the blank slates of researchers’ minds and data simply do not speak for themselves. [...] Interpretation can produce sound judgments or systematic error. Only hind- sight will enable us to tell which has occurred. Kaptchuk, 2003 Introduction Consider the following hypothetical situation. After having collected data, we want to determine whether or not an effect is absent based on a statistical test. All too often, in such a situation, non-significance (i.e. p > 0.05), or a combination of both non-significance and supposed high power (i.e. a large sample size), is used as the basis for a claim that the effect is null. Unfor- tunately, such an argument is logically flawed. As the saying goes, “absence of evidence is not evidence of ab- sence” (Altman and Bland, 1995; Hartung et al., 1983). Instead, to correctly conclude the absence of an effect under the frequentist paradigm, the recommended tool is the equivalence test (also known as a “non-inferiority test” for one-sided testing (Wellek, 2010)). Let θ be our parameter of interest. An equivalence test reverses the question that is asked in a null hypoth- esis significance test (NHST). Instead of asking whether we can reject the null hypothesis of no effect, e.g., H0 : θ = 0, an equivalence test examines whether the magnitude of θ is at all meaningful: Can we reject the https://doi.org/10.15626/MP.2020.2506 https://doi.org/10.17605/OSF.IO/CB78P 2 possibility that θ is as large or larger than our smallest ef- fect size of interest, ∆? The null hypothesis for an equiv- alence test is defined as H0 : θ < (−∆, ∆). In other words, equivalence implies that θ is small enough that any non- zero effect would be at most equal to ∆. The interval (−∆, ∆) is known as the equivalence margin and repre- sents a range of values for which θ can be considered negligible. In psychology research and in the social sciences, where the practice of equivalence testing is relatively new –but now “rapidly expanding” (Koh and Cribbie, 2013)– there are many questions about how to best conduct and interpret equivalence tests. For example, consider the question of a “post-specified” margin. It is generally accepted that one must specify the equiv- alence margin a priori, i.e. before any data have been observed (Wellek, 2010). However, in our hypothetical situation, suppose that we did not have the foresight needed to have pre-specified this margin, are we then simply out of luck? It is worth noting that lack of foresight is only one reason we may have failed to have pre-specified an ap- propriate equivalence margin. Defining and justifying the equivalence margin is one of the “most difficult is- sues” (Hung et al., 2005) for researchers. If the margin we define is deemed too large, then any claim of equiv- alence will be considered meaningless. If the margin we define is somehow too small, then the probability of declaring equivalence will be substantially reduced (Wiens, 2002). While the margin is ideally chosen as a boundary to objectively exclude the smallest effect size of interest (Lakens et al., 2017), these “ideal” bound- aries can be difficult to define, and there is generally no clear consensus among stakeholders (Keefe et al., 2013). Furthermore, previously agreed-upon meaning- ful effect sizes may be difficult to ascertain as they are rarely specified in protocols and published results (Djul- begovic et al., 2011). Suppose now that, having failed to pre-specify an ad- equate equivalence margin, we define the equivalence margin post-hoc, having already collected and observed the data. Given the potential consequences of interpret- ing data based on post-hoc decisions, it is understand- able that this idea may be alarming to some; e.g., see the “Harkonen case” (as discussed in Lee and Rubin, 2016) in which the U.S. Department of Justice prose- cuted drug-maker InterMune (United States v. Harko- nen, 2013), for making claims based on post-hoc sub- group analyses. In the biostatistics literature there are many warnings about how and when to specify the equivalence margin. Hung et al., 2005 note that: “If the margin can change depending on what has been observed [...] statistical testing of non-inferiority [or equivalence] may not be interpretable.” And Wiens, 2002 observes that: “The potential biases of defining the margin after the study should be weighed against the cost and inconvenience of better understanding the differences [between study groups].” Finally, the Committee for Proprietary Medic- inal Products (CPMP), 2001 (the EU scientific advisory organization dealing with new human pharmaceuticals approval) notes that: “it is prudent to specify a non- inferiority margin in the protocol in order to avoid the serious difficulties that can arise from later selection.” Statements such as these lead one to ask the fol- lowing. Under what circumstances would equivalence testing with a data-dependent margin “not be inter- pretable?” What are the “potential biases” and “seri- ous difficulties” we should consider in these, less than ideal, circumstances? Walker and Nowacki, 2011 stress that defining the equivalence margin before observing the data is “essential to maintain the type I error at the desired level” suggesting that potential type I error inflation is the issue of concern. Yet this too remains unclear. With equivalence testing becoming more and more common for psychology researchers, these are im- portant matters to address. In this article we will shed light on these curious questions by considering a series of rather confound- ing hypothetical scenarios (Sections 2 and 3) as well as a number of relevant case studies from biomedical re- search, where equivalence testing has been widely used for decades (Section 4). We conclude (Section 5) with an invitation for further discussion about how best to address the title question: what to make of equivalence testing with a post-specified margin? The Pseudo-type I error and a pathological case Before going forward, we would be wise to recall that, under the frequentist paradigm, hypotheses are statements about parameters and therefore are nonran- dom quantities. Hence, each hypothesis is either true or false, irrespective of how the data are realized. Let θ be the parameter of interest and let X repre- sent the data. Borrowing from the notation of Wellek, 2017, let θ ¯ (X; α) be the lower bound of a one-sided (1 − α)% confidence interval (CI); and let θ̄(X; α) be the upper bound of a one-sided (1 − α)% CI. For exam- ple, a one-sided 95% CI for θ could be written out as [−∞, θ̄(X; 0.05)]; a two-sided 90% CI could be written as [θ ¯ (X; 0.05), θ̄(X; 0.05)]. Let us define a symmetric equivalence margin as (−∆, ∆). Then the standard equivalence testing hypothe- ses are defined as: 3 Figure 1. The one-to-one correspondence between α and ∆. In the above plot, an equivalence test is conducted on two sample normally distributed data. The observed mean difference is θ̂ = 0.2, and the ob- served pooled standard deviation is equal to 1, with n1 = n2 = 50. The shape of this particular curve is spe- cific to this particular data. However, for any general case, the smallest value of α needed to reject the null (x-axis) decreases as ∆ increases (y-axis). Furthermore, as the dashed lines indicate, when ∆ = θ̂, the corre- sponding value of α will be 0.5. H0 : θ ≤−∆, or θ ≥ ∆, vs. H1 : −∆ < θ < ∆. There is a one-to-one correspondence between symmetric confidence intervals and equivalence test- ing. The null hypothesis, H0, can be rejected whenever the realized confidence bounds satisfy [θ ¯ (X; α), θ̄(X; α)] ⊂ (−∆, ∆). Conversely, there will be in- sufficient evidence to reject the null hypothesis when- ever [θ ¯ (X; α), θ̄(X; α)] 1 (−∆, ∆). For example, with the standard α = 0.05, we can reject H0 if and only if a 90% CI for θ fits entirely within the equivalence mar- gin. Equivalence testing provides the standard guaran- tee about type 1 error that Pr(reject H0|H0 is true) ≤ α; see Wellek, 2017. If we reject the null hypothesis if and only if the 90% CI for θ fits within (−∆, ∆), we can rest assured that we will only make a type 1 error in less than 5% of cases. Should the equivalence margin not be specified a pri- ori, and be defined based on the observed data, we have the following admittedly improper hypothesis test: H̃0 : θ ≤−∆(X), or θ ≥ ∆(X) vs. H̃1 : −∆(X) < θ < ∆(X). In this case, we may not necessarily have that Pr(reject H̃0|H̃0 is true) ≤ α. To better understand, let us consider the following admittedly “pathological case.” Let ∆(X) be chosen, based on the observed data, to be the smallest possible value for which one can claim equivalence (known in the literature as the “LEAD” boundaries, see Meyners, 2007). This is done by set- ting: ∆(X) = max(|θ ¯ (X; α)|, |θ̄(X; α)|) + �, where � is a small positive real number. For example, if a 90% CI for θ is [−0.2, 0.5], the “pathological” equiv- alence margin might be defined as [−0.51, 0.51], with ∆(X) = 0.5 + 0.01. Given the monotonic relationship between a confi- dence interval and an equivalence test, there is a one- to-one correspondence between α and ∆. For any given value of α, conditional on a fixed sample of data, there is a value for ∆ for which one can reject H0. Conversely, for any given value of ∆, there is a value of α for which one can reject H0; see Figure 1. In our pathological case, we have that Pr(reject H̃0) = 1, i.e., we will always claim equivalence. In this situa- tion, the margin is entirely “data-dependent.” In other words, the data (as summarized by the confidence inter- val) and the margin are perfectly correlated. We write cor( f (X), ∆) = 1, where f (X) = max(|θ ¯ (X; α)|, |θ̄(X; α)|). Figure 2 displays the relationship between type 1 er- ror and cor( f (X), ∆), see details in the Appendix. In the pathological case, since Pr(reject H̃0) = 1, we also have that Pr(reject H̃0|H̃0) = 1. As such, we have Pr(reject H̃0|H̃0) > α, and therefore, the “pseudo-type I error” is not controlled. When there is less correlation, i.e. when the margin is not entirely data-dependent, we can expect to see less type 1 error inflation. In order for the test to be valid, the key is independence between the margin and the data. In the case when the data and the margin are entirely independent, the type 1 error rate will be at most equal to α, as desired. A somewhat less pathological case Now let us consider a somewhat less pathological sit- uation. The CPMP published an advisory report, “Points to consider on switching between superiority and non- inferiority” (Committee for Proprietary Medicinal Prod- ucts (CPMP), 2001), in which they describe another hy- pothetical situation where the margin is determined af- ter the data is observed: 4 Figure 2. In order for the test to be valid, the key is independence between the margin and the data. The relationship between type 1 error and the correlation between the margin and the data. The correlation measure, cor( f (X), ∆), is obtained by varying the probability of setting ∆(X) equal to the LEAD margin vs. setting ∆(X) equal to a value entirely independent of the data. The curve is the result of repeated simulations of two-sample data; see details in Appendix. “Let us suppose that a bioequivalence trial finds a 90% confidence interval for the rel- ative bioavailability of a new formulation that ranges from 0.90 to 1.15. Can we only conclude that the relative bioavailabil- ity lies between the conventional limits of 0.80 and 1.25 because these were the prede- fined equivalence margins? Or can we con- clude that it lies between 0.90 and 1.15? The narrower interval based on the ac- tual data is the appropriate one to ac- cept. Hence, if the regulatory requirement changed to +/- 15%, this study would have produced satisfactory results. There is no question here of a data-derived selection process. However, if the trial had resulted in a con- fidence interval ranging from 0.75 to 1.20, then a post hoc change of equivalence mar- gins to +/-25% would not be acceptable because of the obvious conclusion that the equivalence margin was chosen to fit the data.” According to this recommendation, it seems that, without any scrutiny, we are free to shrink a pre- specified margin as needed. However, we should always avoid widening the pre-specified margin if that is what is necessary. If this is the case, it would suggest that a prudent strategy would be to always pre-specify the largest possible margin before collecting data, and then shrink the margin as required. This may strike some as opportunistic and potentially problematic. 5 Ng, 2003 studies a similar hypothetical situation in which a large, possibly infinite number of margins are all pre-specified and all the corresponding hypotheses are tested (without any Bonferroni-type of adjustment for multiple comparisons). Equivalence is then claimed using the narrowest of all potential pre-specified mar- gins for which equivalence is statistically significant. Ng, 2003 explains why this hypothetical strategy may be problematic: “Although there is no inflation of the type I error rate [due to the fact that all hypotheses are nested], simultaneous testing of many nested null hy- potheses is problematic in a confirmatory trial because the probability of confirming the finding of such testing in a second trial would approach 0.5 as the number of nested null hypotheses approaches infinity.” To better understand Ng, 2003’s concern, consider a similar setup where, for a standard null hypothesis sig- nificance test, a large, possibly infinite number of pre- specified α-levels (allowable type I error rates) are de- fined. The null is then rejected using the smallest of all potential pre-specified α values. Under this proce- dure, the probability of confirming a statistical signif- icant finding in a second trial (with identical sample size and α) approaches 0.5; see Hoenig and Heisey, 2001 who describe this (often unappreciated) property of “retrospective power.” As such, it is always expected that one specifies (and justifies) a single α-level prior to observing any data; see the recent commentary of Lakens et al., 2018. (These two situations are in fact identical, due to the aforementioned one-to-one corre- spondence between a data-driven selection of α and a data-driven choice of ∆; see Figure 1.) How hypothetical are situations like these? While the cases described in the previous sections were purely hypothetical, similar situations do arise in practice. We consider a number of different clinical trial studies as examples, with the aim of motivating a criti- cal discussion as to what is acceptable and desirable in the reporting and interpretation of equivalence tests. First, consider cases of post-hoc judgement that of- ten arise in the regulatory approval of drugs seeking a designation of bio-equivalence for approval. When the pre-specified margin is deemed too generous (i.e. too wide) by regulatory authorities only after the data have already been observed and analyzed, the regulator may decide that for the purposes of approval, the drug does not meet an appropriate standard for equivalence. Con- sider two examples: 1. The SPORTIF III and SPORTIF V randomized con- trolled trials (RCTs) were studies designed to in- vestigate the potential of ximelagatran as the first oral alternative to warfarin in patients with non- valvular atrial fibrillation to reduce the risk of thromboembolic complications. The primary end point in each study was the incidence of all strokes and systemic embolic events, and the primary ob- jective was to establish the non-inferiority of xime- lagatran relative to warfarin with a pre-specified margin of an absolute 2% difference in the event rate; see Halperin, 2003. Both studies met the primary objectives of non- inferiority with the pre-specified margin. As such, upon completion, the studies were heralded as a “major breakthrough” (Albers et al., 2005; Kul- bertus, 2003). However, upon regulatory review by the FDA Cardiovascular and Renal drugs Advi- sory Committee (CRAC), the pre-specified margin was judged to be “too generous” (Boudes, 2006). This post-hoc criticism of the “unreasonably gen- erous” (Kaul et al., 2005) margin, along with con- cerns about potential liver toxicity, led to a unani- mous decision by the CRDAC to conclude that the benefit of ximelagatran did not outweigh the risk. The FDA then refused to grant approval of xime- lagatran for any of the proposed indications, see Head et al., 2012 and Boudes, 2006 who provide a detailed timeline and description of the approval process. 2. The EVEREST II study was a RCT designed to eval- uate percutaneous mitral valve repair relative to mitral valve surgery (Mauri et al., 2010). The primary efficacy end point was defined as the proportion of patients free from death, surgery for valve dysfunction, and with moderate-severe (3+) or severe (4+) mitral regurgitation at 12 months. Upon completion, researchers claimed success when the primary non-inferiority objec- tive was achieved. However, the conclusion of non-inferiority was “difficult to accept due to un- duly wide margins” (Head et al., 2012). Thus, the FDA determined that despite the significant p- value, “non-inferiority is not implied due to the large margin” and therefore the data “did not demonstrate an appropriate benefit-risk profile when compared to standard mitral valve surgery and were inadequate to support approval” (FDA, 2013). In other instances, the complete opposite has oc- curred. Despite the fact that the researchers fail to pre-specify a specific margin prior to observing the data, the regulatory agency will still accept a claim of equivalence/non-inferiority on the basis that, given 6 some non-controversial post-hoc margin, there is suffi- cient evidence. Consider two examples: 1. The goal of MannKind’s “Study 103” was to eval- uate the inhaled insulin Afrezza for the treat- ment of diabetes mellitus in adults. Subjects were randomized to 12 weeks of continued treat- ment in one of three treatment arms. The pre- specified primary objective was to show superi- ority of the Afrezza TI+metformin arm relative to the secretagogue+metformin arm, with respect to change in HbA1c at 12 weeks. Upon comple- tion, the superiority objective was not achieved and a non-inferiority margin had not been pre- specified by the researchers. However, the reg- ulators were able to accept a claim of non- inferiority. The FDA clinical review states: “The sponsor did not specify a non-inferiority margin. However, the FDA statistical reviewer noted that Afrezza TI+metformin was non-inferior to secret- agogue+metformin when the standard margin of 0.4% for insulins is used (the upper bound of the 95% confidence interval for the treatment differ- ence in HbA1c is 0.3%),” (Yanoff, 2014). 2. The ALLY-3 trial was a one-arm phase 3 trial with the goal of evaluating the safety and efficacy of oral daclatasvir for chronic HCV genotype 3 in- fection (McCormack, 2015). There was no ac- tive or placebo control and as such it was impos- sible to conduct a non-inferiority or equivalence test based only on the trial data. As such the FDA looked to other trials to determine estimates for the effectiveness of competitor treatments. In addition, as noted by the Oregon Health Author- ity, “[t]he ALLY-3 trial [...] did not define a non- inferiority margin for determination of efficacy. The FDA analysis calculated it based on historical data and concluded that DCV [daclatasvir] with SOF [sofosbuvir] achieved non-inferiority com- pared to SOF [sofosbuvir] with RBV [ribavirin] for 24 weeks[...],” (Herink, 2016). In this case, the FDA reviewers “clinically justified” their choice of a post-specified non-inferiority margin based on a historical data; see Struble, 2015. These studies illustrates the fact that, in some fields, there may be well-established “standard” margins or sufficient “historical data.” Such standards no doubt make post-specification less controversial for regulatory agencies. When it comes to peer-reviewed journals, researchers will often note that, while an equivalence margin was not pre-specified, a conclusion of equiva- lence can still be (cautiously) accepted. We consider two examples. In the first case, the margin was not pre-defined, yet claims of equivalence were neverthe- less put forward. In the second case, while a margin was pre-defined, additional conclusions were made based on post-specified margins. 1. A. B. Chang et al., 2008 published the results of a RCT with the goal of evaluating a 5- versus 3- day course of oral corticosteroids (CS) for non- hospitalised children with asthma exacerbations. The primary outcome was 2-week morbidity of children. The study did not show a statistically significant difference between the two treatment arms. In the interpretation of the results, Chang et al. (2008) note that: “It would have been ideal to define a non-inferiority or equivalence margin a priori on the basis of a minimally important effect or historical controls. Our study was designed as a superiority trial, and we did not define a non- inferiority margin a priori. Nevertheless, for the primary outcome measure, the chosen symptom score cut-off of 0.20 (i.e., chosen minimally im- portant difference), the study shows equivalence.” As such, the researchers concluded that the 3-day and 5-day treatment courses were “equally effi- cacious” in reducing the symptoms of asthma (A. Chang et al., 2007). 2. Jones et al., 2016 studied the efficacy of isoflurane relative to sevoflurane in cardiac surgery. When interpreting the results, the authors note that: “our choice of non-inferiority margin may seem to be overly generous; however, it is important to emphasize that, if the margin had been reduced to as low as 1.5%, the conclusions of this trial would not have changed,” (Jones et al., 2016). If, following a study’s publication, other researchers take issue with how the study’s equivalence margin was justified, they will often respond in a letter to the jour- nal. The post-hoc debate between Groenewoud et al., 2017 and Gupta et al., 2016 about the appropriateness of the pre-specified non-inferiority margin defined in Groenewoud et al., 2016’s study on methods for em- bryo transfer is an excellent example of this. In the end, readers are left to judge for themselves. Conclusion Researchers advocate that equivalence testing has great potential to “facilitate theory falsification” (Quin- tana, 2018). By clearly distinguishing between what is “evidence of absence” versus what is an “absence of evidence,” equivalence testing may facilitate the long 7 “series of searching questions” necessary to evaluate a “failed outcome” (Pocock and Stone, 2016). As a re- sult, it may encourage greater publication of null re- sults which is desperately needed (Fanelli, 2011). Yet, outside of health research, guidelines on how best to define and interpret margins are lacking. We hope that the question posed in the title of this article will moti- vate researchers to further consider the delicate issues involved. In clinical trials research, expectations that a mar- gin be pre-specified have been well established for quite some time (Piaggio et al., 2006). This is not the case in other disciplines. In psychology research and in the so- cial sciences, discussions of how best to execute equiva- lence tests are underway and appropriate recommenda- tions are crucially needed. One might argue that the pathological case of equiv- alence testing we considered does not actually qualify as testing per se, and is instead, simply a tool for de- scribing the data. This is the opinion of Meyners, 2007, who concludes that, as a descriptor of the data, the “LEAD boundaries”, (−∆(X), ∆(X)), provide “useful in- formation” and in some cases are “even more important than confidence intervals” for reporting results. At the end of the day, everyone must arrive at their own conclusions as to whether or not a sufficient stan- dard of evidence for equivalence has been demon- strated. Obviously this is often easier said than done. As one final example from clinical trials, we turn to the infamous debate over using bevacizumab (avastin) as a treatment for age-related macular degeneration. A non- inferiority study was conducted to investigate (Group, 2011). However, some considered the pre-specified non-inferiority margin of 5 letters (on the ETDRS vi- sual acuity chart) as “generous” even before the results of the trial were announced (Hirschler, 2011). This sug- gests that, regardless of the results, some would have re- mained skeptical of any claim of non-inferiority with the 5-letter margin. In stark contrast, the standard of evi- dence for many healthcare providers was much weaker. Indeed, many doctors determined that the use of beva- cizumab (avastin) as a substitute for ranibizumab (lu- centis) was justified (particularly given the “too big to ignore” price difference) even before the completion of the non-inferiority trial and were comfortable treat- ing large numbers of patients with Avastin “off-label” (Steinbrook, 2006). In this situation, financial incen- tives clearly played a competing role with statistical con- siderations of clinical efficacy in what was to be consid- ered “equivalent.” While the use of equivalence testing should be en- couraged, caution is warranted. In a review of equiv- alence and non-inferiority clinical trials, Le Henanff et al., 2006 find that often studies “reported margins [that] were so large that they were clearly unconvinc- ing.” Indeed, as Gøtzsche, 2006 conclude: “clinicians should especially bear in mind that noninferiority mar- gins are often far too large to be clinically meaningful and that a claim of equivalence may also be misleading if a trial has not been conducted to an appropriately high standard.” We conclude with the following general recommendations: • If the parameter of interest is not measured in units that are interpretable, one should consider standardized effect sizes. Campbell, 2020 notes that: “equivalence tests for standardized effects may help researchers in situations when what is “negligible” is particularly difficult to determine.” For instance, if the outcome of interest is a de- pression scale, the clinical relevance of a certain x point improvement may not be intuitively mean- ingful. It may be difficult to define what number of points can be considered “negligible.” However, since a Cohen’s d = 0.2 is widely interpreted to be a “small” sized effect (Cohen, 1977; Fritz et al., 2012), one could conclude, based on an equiva- lence test which rejects the null with ∆ = d = 0.2, that any effect, if it exists, is at most small. • The validity of an equivalence test does not de- pend on the margin being pre-specified. Rather, the necessary requirement for a valid test is that the margin is completely independent of the data. In one of our biomedical examples (Afrezza TI + metformin), we described a situation where the researchers had not specified a margin but the FDA adopted a “standard margin of 0.4%.” While there are no comparable independent agencies to regulate psychology research, peer-review jour- nals do possess substantial leverage and would be wise to consider adopting a set of “default mar- gins” (based on standardized effect sizes). While “default equivalence margins” may not be appro- priate for all studies, their use would be similar to that of “default priors” for Bayesian inference (Rouder et al., 2012) and offer a potential for more objective analyses. • Simply because a margin has been pre-specified (and is therefore guaranteed to be independent of the data), it is not necessarily an appropriate choice. Regardless of whether the margin is pre- specified, or defined post-hoc, we must acknowl- edge that a claim of “noninferiority [or equiva- lence] is almost certain with lenient noninferiority margins” (Flacco et al., 2016). One should always 8 critically consider the practical implications of the given margin. • If one is to suggest equivalence based on a post- hoc margin, one must, at the very least, be forth- coming and honest about the potential for bias. In such cases, every effort should be made to justify the appropriateness of the post-specified margin based on factors entirely independent of the ob- served data. • In the absence of a pre-specified margin, one can always resort to simply reporting the associated confidence interval. If the confidence interval con- tains the null and is “narrow enough,” the absence of an effect can be deemed likely. This tactic lacks the formalism of equivalence testing, yet avoids the difficulties of interpretation and justification with a post-hoc margin. • Deliberate or not, questionable research practices cause major harm to the credibility of psychology research (Sijtsma, 2016). With this in mind, re- searchers, given their incentive to publish (Nosek et al., 2012), are not in the best position to de- fine their own margins. This is true whenever the margin is pre-specified, and especially true when a margin is suggested post-hoc. As such, in or- der to avoid any potential scrutiny, researchers would be wise to seek an independent party, void of any potential biases, to define an appropriate margin. This is already common practice in clin- ical trial research, where sponsors have undeni- able incentives to further drug development and the FDA and other regulators will (ideally) set a clear guidance for an acceptable margin. In other fields, such as psychology, the suggestion that an equivalence margin be defined/scrutinized by an independent party has recently been consid- ered within the framework of a proposed publica- tion policy. In the conditional equivalence testing (CET) publication policy, the independent journal editor/reviewers are tasked with critically evalu- ating a given margin prior to the start of a study (Campbell and Gustafson, 2018). Author Contact H. Campbell: https://orcid.org/0000-0002-0959- 1594 and P. Gustafson: https://orcid.org/0000-0002- 2375-5006. Please contact H. Campbell at har- lan.campbell@stat.ubc.ca with any inquiries. Conflict of Interest and Funding We have no conflicts of interest to declare. The research was supported by NSERC Discovery Grant RGPIN-2019-03957. Author Contributions H. Campbell and P. Gustafson both contributed to the concept and writing of this article. H. Campbell drafted the original manuscript, and P. Gustafson provided crit- ical revisions. Both authors approved the final version of the manuscript for submission. Open Science Statement This article earned the Open Materials badge for mak- ing the materials available. It was not pre-registered and had no collected data to share. It has been verified that the analysis reproduced the results presented in the article. The entire editorial process, including the open reviews, are published in the online supplement. References Albers, G. W., Diener, H.-C., Frison, L., Grind, M., Nevin- son, M., Partridge, S., Halperin, J. L., Horrow, J., Olsson, S. B., Petersen, P., et al. (2005). Ximelagatran vs warfarin for stroke prevention in patients with nonvalvular atrial fibrillation: A randomized trial. JAMA, 293(6), 690–698. Altman, D. G., & Bland, J. M. (1995). Statistics notes: Absence of evidence is not evidence of absence. The BMJ, 311(7003), 485. Boudes, P. F. (2006). The challenges of new drugs bene- fits and risks analysis: Lessons from the ximela- gatran FDA cardiovascular advisory committee. Contemporary Clinical Trials, 27(5), 432–440. Campbell, H. (2020). Equivalence testing for stan- dardized effect sizes in linear regression. arXiv preprint arXiv:2004.01757. Campbell, H., & Gustafson, P. (2018). Conditional equivalence testing: An alternative remedy for publication bias. PloS One, 13(4), e0195145. Chang, A., Clark, R., Thearle, D., Stone, G., Petsky, H., Champion, A., Wheeler, C., & Acworth, J. (2007). Longer better than shorter? a multicen- tre randomised control trial (rct) of 5 vs 3 days of oral prednisolone for acute asthma in chil- dren. Respirology, 12, A67. 9 Chang, A. B., Clark, R., Sloots, T. P., Stone, D. G., Petsky, H. L., Thearle, D., Champion, A. A., Wheeler, C., & Acworth, J. P. (2008). A 5-versus 3-day course of oral corticosteroids for children with asthma exacerbations who are not hospitalised: A randomised controlled trial. Medical Journal of Australia, 189(6), 306–310. Cohen, J. (1977). Statistical power analysis for the be- havioral sciences. Academic press. Committee for Proprietary Medicinal Products (CPMP). (2001). Points to consider on switching be- tween superiority and non-inferiority. British Journal of Clinical Pharmacology, 52(3), 223. Djulbegovic, B., Kumar, A., Magazin, A., Schroen, A. T., Soares, H., Hozo, I., Clarke, M., Sargent, D., & Schell, M. J. (2011). Optimism bias leads to in- conclusive results - an empirical study. Journal of Clinical Epidemiology, 64(6), 583–593. Fanelli, D. (2011). Negative results are disappearing from most disciplines and countries. Sciento- metrics, 90(3), 891–904. FDA. (2013). Pma p100009: FDA summary of safety and effectiveness data. accessdata.fda.gov. Flacco, M. E., Manzoli, L., & Ioannidis, J. (2016). Non- inferiority is almost certain with lenient nonin- feriority margins. Journal of Clinical Epidemiol- ogy, 71, 118. Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychol- ogy: General, 141(1), 2. Gøtzsche, P. C. (2006). Lessons from and cautions about noninferiority and equivalence randomized tri- als. JAMA, 295(10), 1172–1174. Groenewoud, E., Cohlen, B., Al-Oraiby, A., Brinkhuis, E., Broekmans, F., De Bruin, J., Van Den Dool, G., Fleisher, K., Friederich, J., Goddijn, M., et al. (2016). A randomized controlled, non- inferiority trial of modified natural versus ar- tificial cycle for cryo-thawed embryo transfer. Human Reproduction, 31(7), 1483–1492. Groenewoud, E., Macklon, B. K. N., & Cohlen, B. (2017). Response to: The impact of an in- appropriate non-inferiority margin in a non- inferiority trial. Endometrial preparation meth- ods in frozen-thawed embryo transfer, 31, 93. Group, C. R. (2011). Ranibizumab and bevacizumab for neovascular age-related macular degenera- tion. New England Journal of Medicine, 364(20), 1897–1908. Gupta, R., Gupta, H., & Banker, M. (2016). The impact of an inappropriate non-inferiority margin in a non-inferiority trial. Human Reproduction, 1–2. Halperin, J. L. (2003). Ximelagatran compared with warfarin for prevention of thromboembolism in patients with nonvalvular atrial fibrillation: Rationale, objectives, and design of a pair of clinical studies and baseline patient character- istics (sportif iii and v). American Heart Journal, 146(3), 431–438. Hartung, J., Cottrell, J. E., & Giffin, J. P. (1983). Ab- sence of evidence is not evidence of absence. Anesthesiology: The Journal of the American So- ciety of Anesthesiologists, 58(3), 298–299. Head, S. J., Kaul, S., Bogers, A. J., & Kappetein, A. P. (2012). Non-inferiority study design: Lessons to be learned from cardiovascular trials. Euro- pean Heart Journal, 33(11), 1318–1324. Herink, M. (2016). Class update with new drug evalua- tion: Direct antivirals for Hepatitis C. %5Curl% 7Bhttps : / / www. orpdl . org / durm / meetings / meetingdocs/2016_01_28/archives/2016_01_ 28_HepatitisCClassUpdate_FINAL.pdf%7D Hirschler, B. (2011). Head-to-head eye drug results tipped for early may. Reuters. https : / / www. reuters . com / article / novartis - roche - lucentis / head - to - head - eye - drug - results - tipped - for - early-may-idUSLDE72S1T620110330 Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calcu- lations for data analysis. The American Statis- tician, 55(1), 19–24. Hung, H., Wang, S.-J., & O’Neill, R. (2005). A regulatory perspective on choice of margin and statistical inference issue in non-inferiority trials. Biomet- rical Journal, 47(1), 28–36. Jones, P. M., Bainbridge, D., Chu, M. W., Fernandes, P. S., Fox, S. A., Iglesias, I., Kiaii, B., Lavi, R., & Murkin, J. M. (2016). Comparison of isoflurane and sevoflurane in cardiac surgery: A randomized non-inferiority comparative ef- fectiveness trialcomparaison de l’isoflurane et du sévoflurane en chirurgie cardiaque: Une étude randomisée d’efficacité comparative et de non-infériorité. Canadian Journal of Anesthesi- a/Journal Canadien d’Anesthésie, 63(10), 1128– 1139. Kaptchuk, T. J. (2003). Effect of interpretive bias on research evidence. The BMJ, 326(7404), 1453– 1455. Kaul, S., Diamond, G. A., & Weintraub, W. S. (2005). Trials and tribulations of non-inferiority: The ximelagatran experience. Journal of the Amer- ican College of Cardiology, 46(11), 1986–1995. Keefe, R. S., Kraemer, H. C., Epstein, R. S., Frank, E., Haynes, G., Laughren, T. P., Mcnulty, J., Reed, %5Curl%7Bhttps://www.orpdl.org/durm/meetings/meetingdocs/2016_01_28/archives/2016_01_28_HepatitisCClassUpdate_FINAL.pdf%7D %5Curl%7Bhttps://www.orpdl.org/durm/meetings/meetingdocs/2016_01_28/archives/2016_01_28_HepatitisCClassUpdate_FINAL.pdf%7D %5Curl%7Bhttps://www.orpdl.org/durm/meetings/meetingdocs/2016_01_28/archives/2016_01_28_HepatitisCClassUpdate_FINAL.pdf%7D %5Curl%7Bhttps://www.orpdl.org/durm/meetings/meetingdocs/2016_01_28/archives/2016_01_28_HepatitisCClassUpdate_FINAL.pdf%7D https://www.reuters.com/article/novartis-roche-lucentis/head-to-head-eye-drug-results-tipped-for-early-may-idUSLDE72S1T620110330 https://www.reuters.com/article/novartis-roche-lucentis/head-to-head-eye-drug-results-tipped-for-early-may-idUSLDE72S1T620110330 https://www.reuters.com/article/novartis-roche-lucentis/head-to-head-eye-drug-results-tipped-for-early-may-idUSLDE72S1T620110330 https://www.reuters.com/article/novartis-roche-lucentis/head-to-head-eye-drug-results-tipped-for-early-may-idUSLDE72S1T620110330 10 S. D., Sanchez, J., & Leon, A. C. (2013). Defin- ing a clinically meaningful effect for the design and interpretation of randomized controlled tri- als. Innovations in Clinical Neuroscience, 10(5-6 Suppl A), 4S. Koh, A., & Cribbie, R. (2013). Robust tests of equiva- lence for k independent groups. British Jour- nal of Mathematical and Statistical Psychology, 66(3), 426–434. Kulbertus, H. (2003). Sportif III and V trials: A ma- jor breakthrough for long-term oral anticoag- ulation. Revue medicale de Liege, 58(12), 770– 773. Lakens, D., Adolfi, F., Albers, C., Anvari, F., Apps, M., Argamon, S., Baguley, T., Becker, R., Benning, S., Bradford, D., et al. (2018). Justify your al- pha. Nature Human Behavior, 2, 168–171. Lakens, D., Scheel, A. M., & Isager, P. M. (2017). Equiv- alence testing for psychological research: A tu- torial. pre-print Retrieved from the Open Science Framework. Le Henanff, A., Giraudeau, B., Baron, G., & Ravaud, P. (2006). Quality of reporting of noninferior- ity and equivalence randomized trials. JAMA, 295(10), 1147–1151. Lee, J. J., & Rubin, D. B. (2016). Evaluating the validity of post-hoc subgroup inferences: A case study. The American Statistician, 70(1), 39–46. Mauri, L., Garg, P., Massaro, J. M., Foster, E., Glower, D., Mehoudar, P., Powell, F., Komtebedde, J., Mc- Dermott, E., & Feldman, T. (2010). The everest ii trial: Design and rationale for a randomized study of the evalve mitraclip system compared with mitral valve surgery for mitral regurgita- tion. American Heart Journal, 160(1), 23–29. McCormack, P. L. (2015). Daclatasvir: A review of its use in adult patients with chronic hepatitis c virus infection. Drugs, 75(5), 515–524. Meyners, M. (2007). Least equivalent allowable differ- ences in equivalence testing. Food Quality and Preference, 18(3), 541–547. Ng, T.-H. (2003). Issues of simultaneous tests for nonin- feriority and superiority. Journal of Biopharma- ceutical Statistics, 13(4), 629–639. Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia II. restructuring incentives and practices to promote truth over publishability. Perspec- tives on Psychological Science, 7(6), 615–631. Piaggio, G., Elbourne, D. R., Altman, D. G., Pocock, S. J., Evans, S. J., Group, C., et al. (2006). Reporting of noninferiority and equivalence randomized trials: An extension of the consort statement. JAMA, 295(10), 1152–1160. Pocock, S. J., & Stone, G. W. (2016). The primary out- come fails -what next? New England Journal of Medicine, 375(9), 861–870. Quintana, D. S. (2018). Revisiting non-significant ef- fects of intranasal oxytocin using equivalence testing. Psychoneuroendocrinology, 87, 127– 130. Rouder, J. N., Morey, R. D., Speckman, P. L., & Province, J. M. (2012). Default bayes factors for anova designs. Journal of Mathematical Psychology, 56(5), 356–374. Sijtsma, K. (2016). Playing with data—or how to dis- courage questionable research practices and stimulate researchers to do things right. Psy- chometrika, 81(1), 1–15. Steinbrook, R. (2006). The price of sight: Ranibizumab, bevacizumab, and the treatment of macular de- generation. New England Journal of Medicine, 355(14), 1409–1412. Struble, K. (2015). Clinical review, cross discipline team leader review. Center for drug evaluation and re- search, Application number: 206843Orig1s000. Walker, E., & Nowacki, A. S. (2011). Understanding equivalence and noninferiority testing. Journal of General Internal Medicine, 26(2), 192–196. Wellek, S. (2010). Testing statistical hypotheses of equiv- alence and noninferiority. CRC Press. Wellek, S. (2017). A critical evaluation of the current “p-value controversy”. Biometrical Journal. Wiens, B. L. (2002). Choosing an equivalence limit for noninferiority or equivalence studies. Con- trolled Clinical Trials, 23(1), 2–14. Yanoff, L. B. (2014). Clinical review, cross disci- pline team leader review. Center for drug evaluation and research, Application number: 022472Orig1s000. Introduction The Pseudo-type I error and a pathological case A somewhat less pathological case How hypothetical are situations like these? Conclusion Author Contact Conflict of Interest and Funding Author Contributions Open Science Statement