Meta-Psychology, 2022, vol 6, MP.2020.2577 https://doi.org/10.15626/MP.2020.2577 Article type: Original Article Published under the CC-BY4.0 license Open data: Not Applicable Open materials: Yes Open and reproducible analysis: Yes Open reviews and editorial process: Yes Preregistration: No Edited by: Danielsson, H., Carlsson, R. Reviewed by: Schönbrodt, F., Schmukle, S., Hedge, C. Analysis reproduced by: Batinović, L., Fust, J. All supplementary files can be accessed at OSF: https://doi.org/10.17605/OSF.IO/W7CP3 Exploring reliability heterogeneity with multiverse analyses: Data processing decisions unpredictably influence measurement reliability Sam Parsons University of Oxford Radboud University Medical Center Abstract Analytic flexibility is known to influence the results of statistical tests, e.g. effect sizes and p-values. Yet, the degree to which flexibility in data processing decisions influences measurement reliability is unknown. In this paper I attempt to address this question using a series of 36 reliability multiverse analyses, each with 288 data processing specifications, including accuracy and response time cut-offs. I used data from a Stroop task and Flanker task at two time points, as well as a Dot Probe task across three stimuli conditions and three timepoints. This allowed for broad overview of internal consistency reliability and test-retest estimates across a multiverse of data processing specifi- cations. Largely arbitrary decisions in data processing led to differences between the highest and lowest reliability estimate of at least 0.2, but potentially exceeding 0.7. Importantly, there was no consistent pattern in reliability estimates resulting from the data processing specifications, across time as well as tasks. Together, data processing decisions are highly influential, and largely unpredictable, on measure reliability. I discuss actions researchers could take to mitigate some of the influence of reliability heterogeneity, including adopting hierarchical modelling approaches. Yet, there are no approaches that can completely save us from measurement error. Measurement matters and I call on readers to help us move from what could be a measurement crisis towards a measurement revolution. Keywords: reliability, multiverse, analytic flexibility, data processing In this paper I was concerned with the influence an- alytic flexibility on measurement reliability, specifically in data processing or data cleaning. I took inspiration from numerous papers reporting the unsettlingly low re- liability of Dot Probe attention bias indices (e.g. Jones et al., 2018; Schmukle, 2005; Staugaard, 2009) and other work investigating alternative analyses and data processing strategies, with the intention of yielding a more reliable measurement (e.g. Jones et al., 2018; Price et al., 2015). When considering the impact of re- searcher degrees of freedom, focus is drawn to decisions made in the beginning (task design) or at the end (data analysis) of the research process. I was interested in the middle step: data processing and measure reliability. In this paper, I explore and visualise the influence of data processing steps on reliability using a series of reliability multiverse analyses. https://doi.org/10.15626/MP.2020.2577 https://doi.org/10.17605/OSF.IO/W7CP3 2 Getting up to speed with reliability The accuracy of our conclusions rests on the quality, and the strength, of our evidence. Our evidence rests on the bedrock of our measurements. The quality of our measures defines the quality of our results. Without adequate focus on the validity of our measures, how can we be assured that we are capturing the concept or process that we are interested in? Without any at- tention to the reliability of our measures, how can we be sure that we are capturing a phenomenon with any precision? Psychological science has a guilty habit of ne- glecting these foundations, though of course some areas fair better than others. In a recent paper, my colleagues and I argued for a widespread appreciation for the reliability of our cogni- tive measures (Parsons et al., 2019). Briefly, low reli- ability places doubt on the veracity of statistical anal- yses using that measure; measurement reliability re- stricts the observable range of effect sizes in simple cor- relational analyses, and unpredictably in more compli- cated models; and failing to correct for measurement er- ror makes comparing effect sizes between, and within, studies difficult. These issues are compounded by the sad observation that the reporting of reliability (and va- lidity) evidence is woefully poor. Scale validity and re- liability is not routinely examined, and many scales are adapted on an ad hoc basis with little or no validation (Flake et al., 2017). In other cases scales fail to pass deeper psychometric evaluation, including tests of mea- surement invariance (Hussey and Hughes, 2018). This likely reflects issues with more superficial approaches to establishing validity evidence - i.e. reporting Cronbach’s alpha, stating it is adequate, and moving on. Pockets of psychological science take a more enlightened ap- proach. However, I feel it is reasonable to argue that the field at large is not doing well in our measurement practices. Most relevant to this paper; it is the excep- tion rather than the norm to evaluate the psychometric properties of cognitive measurements (Gawronski et al., 2011). Strictly speaking, we cannot state that a task is unre- liable; although we might observe a consistent pattern of unreliability in measurements obtained that causes us to question further use of the task. An important reminder: estimates of reliability refer to the measure- ment obtained - in a specific sample and under particu- lar circumstances, including the task parameters. Relia- bility is therefore not fixed; it may differ between pop- ulations, samples, and testing conditions. Variations of a task may lead to the generation of more or less reli- able measurements. For example, the stimulus presen- tation duration will likely influence the cognitive pro- cesses involved in completing the task, perhaps leading participants to perform more consistently in one ver- sion, relative to another. Reliability is a property of the measurement, not of the task used to obtain it. In this study, we are concerned with the data processing steps researchers take and how these influence our measure- ment, and the resulting reliability estimates. To explore this, I invite you to join me, dear reader, on a walk through the garden of forking paths. Analytic flexibility and the garden of forking paths Every result presented in every research article is the culmination of many decisions made by one or more researchers; the sheer number of combinations of valid decisions is likely uncalculatable. The “garden of fork- ing paths” (Gelman and Loken, 2013) is a useful anal- ogy to illustrate this. With each decision that must be made, however arbitrary, the researcher comes to a fork in their research path and selects one. To add a little suspense, there will be many cases when the researcher does not notice a fork in the road. Perhaps the re- searcher unconsciously makes the same turn as always, their feet working of their own accord. These forks in the path, the decisions researchers make (whether they are aware or not), may be reasonably combined to make a near uncountable number of paths. Each path also leads to a location; some paths end close to one an- other, and other times the paths diverge wildly. We can think of the end of the path as the statistical result our researcher arrives at. The researcher has to decide their path, based on the soundest justifications they can make at each fork [e.g. (Lakens et al., 2018). Of course, psychological science has become fully aware of the detrimental effects of se- lecting one’s path retrospectively, based on where the path ends or the results most exciting to the researcher (read as: p < .05; e.g. (Simmons et al., 2011). Ana- lytic flexibility is not inherently bad. However, we must acknowledge the ramifications. The effects we observe, or do not, are potentially influenced by all of the deci- sions made to arrive at them. Thus, a range of possible effects may have been observed that could be more or less equally valid or justifiable based on the analytical decisions made. In discussions of analytical flexibility, focus is usu- ally given primarily to decisions made during statisti- cal analysis. For example, should I control for age and gender? Do I reason that this is model more appropri- ate over that one? Or where should I set my alpha and how should I justify the decision? Discussions of ana- lytical flexibility often concern issues around p-hacking and other QRPs (intended or unintended). However, as Leek and Peng (Leek and Peng, 2015) note, p-values are the tip of the iceberg; not enough scrutiny is given 3 to the impact of the many steps in the research pipeline that precede inference testing. I agree. In my estima- tion, flexibility in measurement and data handling do not receive the scrutiny they deserve. If the garden of forking paths concerns analytic flexibility, then mea- surement flexibility decides which gateway one enters the garden through in the first place. As an example, a recent review highlighted the lack of consensus around the processing of task data from tasks in the attention control literature, including but not limited to the data pre-processing used in this paper (von Bastian et al., 2020, p. 47-48). Mapping the garden of forking paths with multiverse analyses Multiverse analyses (Steegen et al., 2016) offers us a "GPS in the garden of forking paths" (Quintana and Heathers, 2019). The process is simpler than one might expect. First, we define a set of reasonable data pro- cessing and analysis decisions. Second, we run the en- tire set of analyses. We can then examine results across the entire range of results. Specification curve analysis (Simonsohn et al., 2015) adds third step allowing for in- ference tests across the distribution of results generated in the multiverse (for insightful applications of specifi- cation curve analyses, see Orben and Przybylski, 2019; Rohrer et al., 2017). In this paper I use ‘specification’ to refer to each combination of data processing decisions in the multiverse analysis. Multiverse analyses enable us to explore how a re- searcher’s – sometimes arbitrary – choices in data pro- cessing (e.g. outlier removal) and analysis decisions (e.g. including covariates, splitting samples) influence statistical results, and the conclusions drawn from the analysis. From this we can examine which choices are more or less influential than others, as well as how ro- bust the result is across the full set of specifications. A reliability multiverse from many data processing decisions In this paper I report multiverse analyses exploring the influence of data processing specifications on the reliability of a calculated measurement. I used openly accessible Stroop task and Flanker task data generously shared by Hedge and colleagues (Hedge et al., 2018) and Dot Probe task data from the CogBIAS project (Booth et al., 2017; Booth et al., 2019). Following our previous work in this area (Parsons et al., 2019), I was interested in the stability and range of reliability estimates on cognitive-behavioural measures. Broadly, I was interested in the impact of data processing deci- sions on reliability. It is possible that certain analytic de- cisions tend to yield higher reliability estimates; it may be that particular combinations of decisions are also bet- ter, or worse, than others. Beyond that, I was inter- ested in the range of estimates. A small range would suggest that measure reliability is relatively stable as we make potentially arbitrary data processing decisions while walking the garden of forking paths. A large range suggests hidden measurement reliability hetero- geneity. This is potentially an important, and underap- preciated, contributor to the replicability crisis (Loken and Gelman, 2017). Alternatively, this could be a herald for a crisis of measurement. Methods Data Stroop and Flanker task data were obtained from the online repository for Hedge, Sumner, and Powell (Hedge et al., 2018, https://osf.io/cwzds/). Full de- tails of the data collection, study design, and procedure can be found in Hedge et al. (Hedge et al., 2018). These data are ideal for our purposes as they a) contain many trials, helping us obtain more precise estimates of reliability, and b) include two assessment time-points approximately 3-4 weeks apart, allowing us to explore both: internal consistency and test-retest reliability. The data were collected from different studies; for simplicity in this paper, the data across studies were pooled (n = 107 before any data processing – note that this may be different from the sample size presented by Hedge et al. due to differences in data processing). Dot Probe data were obtained from the CogBIAS project (Booth et al., 2017; Booth et al., 2019). Full details of the full study and data collection can be found in Booth et al. (2017; 2019). These data complement the Stroop and Flanker data as they provide a longer test-retest duration (approximately 1.5 years between repeated measures) across three timepoints. In addi- tion, the task incorporated three stimuli conditions, al- lowing us cross-sectional comparisons of reliability sta- bility within the same task. The Dot Probe data were pooled such that only a subset of participants complet- ing the task at all three timepoints were retained (n = 285). Interested readers can find the data and code used to perform the multiverse analyses and generate this manuscript in the Open Science Framework repository for this project (https://osf.io/haz6u/). 1 1I used the following R packages for all analyses and fig- ures, and to generate this document: R (Version 4.1.0; R Core Team, 2018) and the R-packages Cairo (Version 1.5.12.2; Urbanek and Horner, 2019), dplyr (Version 1.0.9; Wick- ham, François, et al., 2019), forcats (Version 0.5.1; Wick- ham, 2019a, ggplot2 (Version 3.3.6; Wickham, 2016), gridEx- https://osf.io/cwzds/ https://osf.io/haz6u/ 4 Stroop task Participants made keyed responses to the colour of a word presented in the centre of the screen. In congru- ent conditions the word was the same as the font colour, whereas, in incongruent trials, the word was a different colour from the font colour. In a neutral condition, the word was not a colour word. Participants completed 240 of each trial type. The outcome index we explore here is the RT cost, calculated as the average RT for incongruent trials minus the average RT for congruent trials. Flanker task Participants made keyed responses to the colour of a word presented in the centre of the screen. In congru- ent conditions the word was the same as the font colour, whereas, in incongruent trials, the word was a different colour from the font colour. In a neutral condition, the word was not a colour word. Participants completed 240 of each trial type. The outcome index we explore here is the RT cost, calculated as the average RT for incongruent trials minus the average RT for congruent trials. Dot probe task task Participants made keyed responses to the identity of a probe presented on screen. The probe was presented in the same location as one of the paired faces presented on screen for 500ms prior. The paired faces were an emotional face (angry, pained, and happy) paired with a neutral face (taken from the STOIC faces database, Roy et al., 2009). In congruent trials, the probe was presented in the same location as the emotional face. In incongruent trials, the probe was presented in the same location as the neutral face. Participants com- pleted three blocks of 56 trials corresponding to the emotion presented. The ‘attention bias’ outcome index (MacLeod et al., 1986) was calculated as calculated as the average RT for incongruent trials minus the average RT for congruent trials. Multiverse analysis In a personal effort to make my research repro- ducible, and also help others perform similar pro- cesses I have developed simple functions to per- form the multiverse analyses reported in this paper. Readers interested in performing similar analyses can find these functions within the splithalf package (Par- sons, 2021) and tutorials on the related GitHub page (https://github.com/sdparsons/splithalf). The key functions are: splithalf.multiverse, testretest.multiverse, and multiverse.plot. Intraclass Correlation Coefficients (ICC2) were estimated using the psych R package (Revelle, 2019). Interested readers can also inspect the code used to perform the analyses in this paper (https://osf.io/haz6u/). Step 1. Creating a list of all specifications. No data were removed before the multiverse analysis. To my knowledge, there are no fixed standards in the lit- erature for processing data from any of the tasks. I identified six decisions common to processing RT data, though there are many more. For simplicity I stuck to RT difference scores as the outcome measure of interest. However, there are very different analytical techniques that might be applied to RT tasks such as this (for exam- ple, multilevel modelling and drift-diffusion modelling approaches). The decisions were as follows: • Total accuracy. Researchers may opt to re- move participants with accuracy lower than a pre- specified cut-off; for example 80 of 90 per cent. I used three options; 80 • Absolute response time removals. Researchers will often remove trials faster than a minimum RT threshold and trials that exceed a maximum RT threshold. I use minimum RT cut-offs at 100ms, 200ms, as well as no cut-off. And, I use two max- imum RT cutoffs; 3000ms, and 2000ms. • Relative RT cut offs. After absolute RT cut- offs, researchers can decide to remove trials with RTs greater than a number of standard deviations from the mean (sometimes called relative cut-offs or trimmed means). Three SDs from the mean would remove very extreme outliers; two SDs from the mean is common. I have not seen re- searchers use one SD from the mean as a cut off, as it is likely a too conservative threshold. As I was interested in a wide range of possible specifi- cations, I included one standard deviation. I use no relative cut off, and one, two, and three SDs from the mean cutoffs in the multiverse. • Where to apply the relative cutoff. The deci- sion to remove trials based on a SD cutoff comes with its own decision. Namely, at what granular- ity? We could remove trials with RTs greater than tra (Version 2.3; Auguie, 2017), papaja (Version 0.1.1; Aust and Barth, 2018), patchwork (Version 1.1.1; Pedersen, 2019), psych (Version 2.1.6; Revelle, 2019), purrr (Version 0.3.4; Henry and Wickham, 2019), readr (Version 1.4.0; Wick- ham et al., 2018), splithalf (Version 0.8.1; Parsons, 2021), stringr (Version 1.4.0; Wickham, 2019b), tibble (Version 3.1.7; Müller and Wickham, 2019), tidyr (Version 1.2.0; Wickham and Henry, 2019), tidyverse (Version 1.3.1; Wickham, Averick, et al., 2019), and tinylabels (Version 0.2.3; Barth, 2022) https://osf.io/haz6u/ 5 2SDs from the participant’s average RT, for exam- ple. We could also remove trials with RTs greater than 2SDs from the mean RT within each trial type (congruent and incongruent, for example). I in- cluded both options; participant level, and trial type level. • Averaging. Most often the mean RT within each trial type is calculated, and may then be analysed directly, or a difference score calculated to anal- yse. Researchers may opt to use the median RT instead. I included both options. The number of possible combinations (data process- ing specifications) quickly increases with every addi- tional option. Here we have 3 × 2 × 3 × 4 × 2 × 2 = 288 possible specifications. Step 2. Run all specifications and extract relia- bility estimates. From this decision list, we have a complete list of 288 data processing specifications. In the multiverse analysis the data is processed following each specification parameters, before estimating the re- liability of the resulting outcome measure. Internal con- sistency was estimated using 500 permutations of the splithalf (Parsons et al., 2019) procedure for each spec- ification (5000 is standard, but 500 was selected to re- duce processing time). Following Hedge et al. (2018), and because ICC relates to both the correlation and the agreement among repeated measures, test-retest relia- bility was estimated using ICC2k (Koo & Li, 2016). Step 3. Visualising the multiverse. I find that one of the joys of multiverse analyses are the visualisations, because sometimes science is more art than science. I explain the visualisations in the results section. Analysis plan For the core analysis I performed 18 multiverse anal- yses following the steps described above. Separately for each of the Stroop and Flanker task data, I examined internal consistency reliability at time 1 and at time 2, as well as test-retest reliability from time 1 to time 2. For the Dot Probe data, I examined internal consistency reliability at each of the three timepoints, separately for the three task conditions, as well as test-retest re- liability at across timepoints. For each multiverse I re- port the median estimate and it’s 95% Confidence In- terval, the proportion of estimates exceeding 0.7, and the range of estimates in that multiverse. In addition to visualising each multiverse, I also include visualisa- tions overlapping the internal consistency multiverses over time. These overlapped plots allow us to visually inspect whether the pattern of reliability estimates fol- lowing the full range of data processing specifications are comparable across each time point. Inferences from the multiverse It is not my aim in this paper to make inferences from these reliability multiverse analyses as one would in a specification curve analysis (Simonsohn et al., 2015). One could use this method to perform inference testing against the curve of reliability estimates. However, it is not clear what this would add: testing whether the re- liability estimates significantly differ from zero is a low bar for assessing the reliability of a measure. Results I include a visualisation for each multiverse analysis. The reliability estimates are presented on the y-axis at the top of the figure; each estimate is represented by a black dot and the 95% confidence interval is repre- sented by the shaded band. The x-axis indicates each individual multiverse specification of processing deci- sions (288 total), displayed in the ‘dashboard’ at the bottom of the figure. The vertical dashed line running through the top panel and the bottom dashboard rep- resents the median reliability estimate. This line is ex- tended through the dashboard to demonstrate that the estimate is derived from the unique combination of data processing decisions, including (from top to bottom, in order of processing step); 1) participant removal below total accuracy threshold, 2) maximum RT cut-off, 3) minimum RT cut-off, 4) removal of RTs > this number of SDs from the mean, 5) whether this removal is at the trial or subject level, and 6) use of mean or median to derive averages. Stroop Time 1: Internal Consistency The median reliability estimate was 0.76, 95% CI [0.69,0.92]. Estimates ranged from 0.68 to 0.92. 97% of the reliability estimates were > 0.7. Stroop Time 2: Internal Consistency The median reliability estimate was 0.66, 95% CI [0.61,0.89]. Estimates ranged from 0.58 to 0.90. 25% of the reliability estimates were > 0.7. Stroop: test-retest The median reliability estimate was 0.56, 95% CI [0.50,0.63]. Estimates ranged from 0.47 to 0.63. 0% of the reliability estimates were > 0.7. 0.0 Flanker Time 1: Internal Consistency The median reliability estimate was 0.82, 95% CI [0.65,0.92]. Estimates ranged from 0.62 to 0.93. 93% of the reliability estimates were > 0.7. 6 Figure 1. Internal consistency reliability multiverse for Stroop RT cost at time 1 Flanker Time 2: Internal Consistency The median reliability estimate was 0.71, 95% CI [0.62,0.91]. Estimates ranged from 0.59 to 0.91. 55.00% of the reliability estimates were > 0.7. Flanker: test-retest The median reliability estimate was 0.55, 95% CI [0.30,0.69]. Estimates ranged from 0.29 to 0.72. 2% of the reliability estimates were > 0.7. Overlapping time 1 and time 2 multiverses In the next two figures I overlap the time 1 and time 2 multiverses, separately for the Stroop and Flanker data. The specifications are ordered by the reliability estimates at time 1 for each measure (Figures 1 and 3). These figures allow us to compare the patterns of reliability estimates following the same data processing decisions. Dot Probe Task For ease of presentation (and to reduce the total num- ber of figures), we visualise the Dot Probe task reliability multiverses entirely as overlapping plots. Angry faces For the angry faces condition median and 95% CIs for each wave of testing were; wave 1, -0.04, [95% CI -0.17, 0.58; wave 2, 0.01, [95% CI -0.21, 0.61; wave 3, 0.03, [95% CI -0.08, 0.65. 7 Figure 2. Internal consistency reliability multiverse for Stroop RT cost at time 2 Happy faces For the happy faces condition median and 95% CIs for each wave of testing were; wave 1, -0.09, [95% CI -0.19, 0.58; wave 2, -0.01, [95% CI -0.10, 0.65; wave 3, 0.07, [95% CI -0.04, 0.65. Pained faces For the pained faces condition median and 95% CIs for each wave of testing were; wave 1, 0.04, [95% CI -0.26, 0.65; wave 2, -0.09, [95% CI -0.17, 0.60; wave 3, 0.15, [95% CI -0.08, 0.68. Dot Probe: test-retest Test retest reliability estimates (ICC2) for each condi- tion were: angry, 0.04, 95% CI [0, 0.10]; happy, 0, 95% CI [0, 0.07]; pain, 0, 95% CI [0, 0.01] Secondary analyses: reliability and number of trials Increasing the number of trials typically increases re- liability estimates (e.g. Hedge et al., 2018; von Bastian et al., 2020). A visual inspection of the multiverses sug- gests that specifications involving the removal of more trials (i.e. removing trials greater than 1 standard de- viation from the average) leads to higher reliability es- timates. Table 1 presents the Pearson correlations be- tween the reliability estimates and the number of trials retained in each specification. For internal consistency reliability these correlations typically ran counter to ex- pectations of reduced trials leading to reduced reliabil- ity. In most cases the association was negative - more trials removed during data processing was associated with higher reliability estimates were observed. In con- trast, for most of the test-retest reliability multiverses, removal of more trials led to lower reliability estimates. 8 Figure 3. Test-retest reliability multiverse for Stroop RT cost To investigate this further, I reran the multiverses for Stroop and Flanker data using only the first half of trials collected for each participant. I also reran the multiverses for the Dot Probe data using only the first 20 trials for each trial type (I attempted to rerun the Dot Probe data with only 14 trials for each trial type, but this led to errors under stricter specifications where there were too few trials to run the reliability estima- tion). To save the reader from viewing all 18 multi- verses for a second time, the code and all outputs can be found in the supplementary materials. On visual in- spection of the multiverse visualisations, the overall pat- tern of results is similar: specifications resulting in the removal of more trials tend to result in higher reliabil- ity estimates. The final column in Table 1 presents the mean difference in reliability estimates for each of the 18 multiverses (positive values indicate higher reliabil- ity estimates with the full number of trials). For internal consistency estimates: multiverses with fewer trials had lower reliability estimates, on average, for the Stroop and Flanker tasks. But, against expectations, reliabil- ity estimates increased for the Dot Probe task when the number of trials was reduced. In contrast, almost all test-retest estimates were reduced in the reduced num- ber of trials analyses. Figure 13 presents the difference between reliability estimates in full vs reduced trials multiverses for all 18 multiverse analyses. Discussion Across 18 reliability multiverse analyses, and their colourful visualisations, we explored the influence of data pre-processing specifications on measure reliabil- ity. To briefly summarise: Internal consistency reliability estimates ranged from 0.58 to 0.92 in the Stroop data, 9 Figure 4. Internal consistency reliability multiverse for Flanker RT cost at time 1 0.59 to 0.93 in the Flanker data, and -0.28 to 0.68 in the Dot Probe data. Test-retest reliability estimates ranged from 0.47 to 0.63 in the Stroop data, 0.29 to 0.72 in the Flanker data, and 0 to 0.11 in the Dot Probe data. From the introduction we remember that reliability estimates are a product of: the sample and the population they are drawn from, the task (including any differences in implementation), and the circumstances in which the measurement was obtained, i.e. reliability is not an in- herent quality of the task itself. The first conclusion we can draw from these multiverse analyses is that data processing specifications are also an integral part of this list. At the onset of this project, I thought it reasonable to assume that a particular feature of the data processing path might result in consistently higher (and lower) re- liability estimates. The clearest indication we can take from these analyses is that there is no single set of data processing specifications, or combination of data pro- cessing decisions, that lead to improved reliability. The wide ranges of estimates are an additional cause for concern. Seemingly arbitrary data processing decisions can lead to differences of more than .3 in the reliability of a measure. These decisions are equally reasonable and logical choices, and we should not expect them to have meaningful impact on the theoretical questions be- ing asked of the data. The reliability multiverse analy- ses presented here demonstrate this using data from a Stroop and a Flanker task. As well as across tasks, over- lapping the time 1 and time 2 multiverses for both tasks highlights that even the same set of specifications does not lead to directly comparable internal consistency re- liability estimates over time. Data processing decisions appear to be extremely important contributors to mea- 10 Figure 5. Internal consistency reliability multiverse for Flanker RT cost at time 2 sure reliability, but their influence is unpredictable and arbitrary. The secondary analyses give us more insight into the relationship between the number of trials retained through data processing and the resultant reliability es- timates. The picture is not a simple one. Figure 13 high- lights the unpredictable influence of what is essentially another multiverse specification decision – do I remove half of trials before any other data processing? While the underlying pattern of more data reduction lead- ing to greater reliability generally holds across tasks, within tasks fewer trials led to lower reliability on av- erage for the Stroop and Flanker tasks (as we should expect) but not the Dot Probe. More work is needed to unravel these influences, but a take-home message may be: while administering more trials to participants is typically a good thing for reliability, there may be some benefit (in terms of reliability) of removing more trials. Though, as I discuss below, pursuit of reliability alone should not be the goal. In the core of this discussion I raise several open ques- tions and suggest some plausible actions that could be taken to mitigate some of the risk reliability heterogene- ity poses. How do we guard against reliability heterogeneity? In simple bivariate analyses, we usually think low reliability will simply attenuate estimated effect sizes (e.g. Spearman, 1904). But the influence can be far less predictable (the reader may be noticing a trend of unpredictability in this paper). Low reliability can lead to elevation of effect size estimates and even reversals in direction (or examples, see Brakenhoff et al., 2018; Segerstrom and Boggero, 2020), with the influence be- 11 Figure 6. Internal consistency reliability multiverse for Flanker RT cost at time 2 coming more unpredictable in more complex models. It is therefore important to take reliability heterogeneity into account when comparing effect sizes (for several clear examples, see Cooper et al., 2017). It is plau- sible that some studies may have obtained smaller or larger effect sizes than others based, in part, on the reliability of the measurements taken. Similarly, iden- tical observed effect sizes may represent very different ‘true’ effect sizes, if reliability is taken into account. Re- cently, Wiernik and Dahlke (2020) made a strong case for correcting for measurement error in meta-analyses and provide the necessary formula and code for doing so. There are several actions we can take to begin to account for reliability heterogeneity. Two simple recommendations To briefly reiterate two recommendations I and my colleagues have made previously: a) report all data processing steps taken, and b) report the reliability of measures analysed (Parsons et al., 2019). These recom- mendations will not ‘fix’ potential psychometric issues within one’s study, or reliability heterogeneity across studies. However, complete reporting of data processing will assist in the computational reproducibility of one’s results. Reporting psychometric information will assist in the interpretation of results, including comparisons of effect sizes, as well as provide useful information about the utility of a task in studies of individual differences. 12 Figure 7. Overlapped internal consistency reliability multiverse for Stroop RT cost at times 1 and 2 Multiverse analyses as a robustness check One approach is running a multiverse across a justi- fied set of data processing specifications (that yield the same theoretically justified construct of interest, see the below section on validity) and generating a distribution of effect sizes from the final analyses under these spec- ifications. In principle this is the same as a sensitivity or robustness analysis, and act as a check on the relia- bility heterogeneity introduced by different (but equally justifiable) data processing specifications. Adopt a modelling approach Incorporating trial level variation into our analy- ses with hierarchical modelling approaches (aka mixed models, multilevel models) will likely be a vital step in protecting us against reliability heterogeneity. Psycho- logical effects are often heterogeneous across individu- als (Bolger et al., 2019), and factors within tasks have important effects [e.g. stimuli differences, (DeBruine and Barr, 2021). It follows that our models should take trial-level variation into account. More than this, us- ing models that capture the theorized data generating process, including relevant distributions (e.g. response time distributions are typically very right skewed), likely have a better chance of capturing the process of interest in the first place. Using the Stroop and Flanker data from Hedge et al. (2018) Rouder, Kumar, and Haaf (2019; also see Rouder and Haaf, 2018) demonstrated that hierarchical models should be used to account for error in measurement (for additional guidance on ap- plying this modelling, see Haines, 2019). Adopting this approach has the benefit of ‘correcting’ the effect size estimate (and standard error) for measurement error as 13 Figure 8. Overlapped internal consistency reliability multiverse for Flanker RT cost at times 1 and 2 part of the model, rather than as an additional step to aid in interpretations and effect size comparisons (a step that is often missed once reliability is deemed “accept- able”, assuming that reliability is estimated in the first place). Rouder and colleagues demonstrate that this is also a more effective approach than ‘correcting’ the ef- fect size estimate using e.g. Spearman’s correction for attenuation formula (Spearman, 1904). Yet, even bet- ter corrections cannot fully save us from measurement error. Hierarchical measures do bring their own considera- tions and potential issues. Applied researchers, or those without training, may need further support to ensure the model specifications are appropriate. The model co- variance structure, and appropriate priors in the case of Bayesian approaches, do have potential to introduce additional sources of bias/researcher degrees of free- dom. But, given existing resources and a growing body of training materials and work in this area, it is my view that a modelling approach is likely the best next step (Haines et al., 2020; Rouder et al., 2019; Sullivan-Toole et al., 2021; DeBruine and Barr, 2021). An additional benefit of these approaches is that they typically avoid much of the data pre-processing aspects discussed in this paper, and thus the reliability heterogeneity they generate. Limitations and room for expansion A small number of tasks. One limitation of this study is the focus on a small sample of tasks. It is possible that data from other tasks tend to yield more or less consistent patterns of reliability estimates across data processing specifications. Similarly, I have only ex- amined RT costs (i.e. a difference score between two 14 Figure 9. Internal consistency reliability multiverse for Dot Probe attention bias (angry faces) at times 1, 2, and 3 trial types) as the outcome measure. The analyses could have examined accuracy rates, RT averages, signal de- tection, and a wide variety of outcome measures. It is very possible that other outcome indices would be more or less consistently reliable across the range of data pro- cessing specifications. I opted for brevity in this paper by selecting only these tasks; I welcome future work seeking to examine a wider range of tasks and outcome indices. Extracting the influence of individual decisions. The analyses here do not allow for an in depth exam- ination of the influence of specific data processing deci- sions. Given lack of consistency across timepoints and measures, I am not confident that robust conclusions could be drawn about a specific decision compared to another. A plausible approach to examine this is a Vibra- tion of Effects analysis (e.g. Klau et al., 2021) in which the variance of the final distribution of estimates can be decomposed to examine the relative influence of dif- ferent categories of decisions, e.g. model specifications and data processing decisions. Using this information, we might be able to prioritise sources of measurement heterogeneity more accurately. Applicability to experimental vs correlational analyses. There is a paradox in measurement relia- bility (see Hedge et al., 2018): Experimental effects that are highly replicable (for example, the Stroop ef- fect) may also show low reliability. Homogeneity within groups or experimental conditions allows for larger and more robust effects; researchers can opt to develop tasks that capitalise on homogeneity. Unfortunately, reliability requires robust individual differences (and vice versa). Highly reliable measures by necessity show consistent, potentially large, individual differences and 15 Figure 10. Internal consistency reliability multiverse for Dot Probe attention bias (happy faces) at times 1, 2, and 3 would not be suitable for group differences or experi- mental research. As a result, measures tend to be more appropriate for questions of a) assessing differences be- tween groups or experimental conditions, or b) corre- lational or individual differences. I was primarily con- cerned with the use of these measures in individual dif- ferences research – hence the focus on reliability. Yet, it would be overly simplistic to assert that the discussions in this paper do not also relate to experimental differ- ences questions. Indeed, the data processing specifica- tions that maximise the measure’s utility in individual differences analyses can also hinder the measure’s util- ity in experimental questions. Further research would be needed to quantify the relative influences on correla- tional vs experimental analyses. Yet, large fluctuations in relative between-subjects vs within-subjects variance, due to data processing, holds importance for any re- search question. Simulation studies. Several valuable extensions to the current approach could be made via simulation ap- proaches. By simulating data with a known measure- ment structure, we could examine variance in reliability estimates that operates purely by chance: i.e. where no systematic differences in reliability exist across pre- processing decisions. Comparing the distributions to those observed in tasks such as those analysed here would offer insight into how severe reliability hetero- geneity is introduced in “real world” data. These simu- lations are beyond the scope of this initial paper; how- ever hold promise to detect variance and bias relative to a ‘true’ value of reliability in the simulated data. 16 Figure 11. Internal consistency reliability multiverse for Dot Probe attention bias (pain faces) at times 1, 2, and 3 What about validity? Others have previously demonstrated that measures are often used ad hoc or with little reported validation efforts (e.g. Flake et al., 2017; Hussey and Hughes, 2018). This study cannot begin to assess the influence of data processing flexibility on measure validity – nor did this paper attempt to address this question. Relia- bility is only one piece of evidence needed to demon- strate the validity of a measure. Yet, it is an impor- tant piece of evidence as “reliability provides an upper bound for validity” (Zuo et al., 2019, page 3). While we cannot directly conclude that flexibility in data pro- cessing influences measure validity, we should look to further research to investigate. One possibility would be to conduct a validity multiverse analysis similar to the “Many Analysts, One Data Set” project (Silberzahn et al., 2018). In this project, 29 teams (61 analysts to- tal) analysed the same dataset. The teams adopted a number of different analytic approaches which resulted in a range of results. The authors concluded that, “Un- certainty in interpreting research results is therefore not just a function of statistical power or the use of ques- tionable research practices; it is also a function of the many reasonable decisions that researchers must make in order to conduct the research” (page 354). Another important validity consideration is the rela- tionship between our data processing pipelines and the (latent) construct of interest. In questionnaire develop- ment, removing or adapting items might influence re- liability. But, more importantly, will give rise to a dif- ferent measure that may be more or less related to our latent construct of interest. For example, Fried (2017) found that several common depression questionnaires captured very different clusters of symptoms, which 17 Figure 12. Internal consistency reliability multiverse for Dot Probe attention bias (pain faces) at times 1, 2, and 3 should make us question what is meant by “depression” in the first place when using these measures. More relevant to task measures, to maximise reliabil- ity we might seek to develop a novel version of a task that relies on average response times, instead of a dif- ference score between average response times. While this would yield highly reliable measures, the purpose of the difference score is to isolate the process of inter- est. Therefore, while we have maximized reliability, we have also influenced both the construct of interest and the validity of the measure. Perhaps this more reliable measure fails to capture the effect we intended to mea- sure. For a more in depth discussion about balancing these theoretical, validity, and reliability considerations see von Bastian et al. (2020, Goodhew and Edwards, 2019). With respect to the data pre-processing steps taken in this paper, it could be reasonably argued that some pre- processing specifications yield different constructs of in- terest or could be more or less valid for the process of in- terest. Are we really interested in the construct includ- ing only very accurate participants and only 60% of tri- als close to the average response time? In this sense, the data pre-processing decisions a researcher might adopt are certainly not arbitrary from a validity standpoint. A reasonable approach in applied work would be to select a narrower set of processing specifications that the re- searcher believes are theoretically similar enough that the same construct is being measured. Returning to the garden My intention for this project was to provide some indication about the influence of data processing path- ways on the reliability of our cognitive measurements. 18 Table 1 Correlations between reliability estimates and number of trials retained across specifications task time measure correlation 95% CI low 95% CI high Difference Stroop 1 splithalf -0.38 -0.48 -0.28 0.13 Stroop 2 splithalf -0.38 -0.48 -0.28 0.10 Flanker 1 splithalf -0.61 -0.68 -0.53 0.12 Flanker 2 splithalf -0.55 -0.63 -0.47 0.24 DPTangry 1 splithalf -0.54 -0.62 -0.45 -0.03 DPTangry 2 splithalf -0.66 -0.72 -0.58 -0.27 DPTangry 3 splithalf -0.27 -0.37 -0.16 -0.23 DPThappy 1 splithalf -0.58 -0.65 -0.50 -0.02 DPThappy 2 splithalf -0.51 -0.59 -0.42 -0.22 DPThappy 3 splithalf -0.42 -0.51 -0.32 -0.23 DPTpain 1 splithalf -0.59 -0.66 -0.51 -0.06 DPTpain 2 splithalf -0.39 -0.49 -0.29 -0.27 DPTpain 3 splithalf -0.15 -0.26 -0.03 -0.20 Stroop ICC 0.37 0.26 0.46 0.08 Flanker ICC -0.59 -0.66 -0.51 0.11 DPTangry ICC 0.61 0.54 0.68 0.04 DPThappy ICC 0.42 0.32 0.51 -0.01 DPTpain ICC -0.01 -0.12 0.11 0.00 The influence can be profound; the multiverse anal- yses show large differences between the highest and lowest reliability estimates. Yet, we see little consis- tency in the pattern of decisions leading to higher, or lower, estimates. We have the worst of both worlds: data processing decisions are largely arbitrary yet can have a large – relatively unpredictable – impact on the resulting reliability estimates. Briefly returning to the garden of forking paths metaphor; I imagined that this project would help illuminate the point in which our hy- pothetical researcher would enter the garden, based on their data processing decisions. But, our investigation has uncovered an unfortunate secret: Our researcher’s forking paths are almost entirely arbitrary and interwo- ven. Each path diverges wildly, leading to almost any- where in the garden. It is as if our researcher is simply spinning in dizzy circles until they stumble somewhere along the fence of reliability. I discussed several actions researchers can take collectively to help with the issue. But, by no means were these remedies to our reliabil- ity issues, nor would they directly help issues with the validity of our measurements. Thankfully, there is a growing awareness that mea- surement matters (Fried & Flake, 2018). A valuable term, Questionable Measurement Practices (QMPs), was recently added to our vernacular by Flake and Fried (2020). QMPs describe “decisions researchers make that raise doubts about the validity of the measures used in a study, and ultimately the validity of the final con- clusion” (p. 458). I hope that QMPs and the importance of measurement become as widely discussed as the par- allel idiom, ‘Questionable Research Practices’ (QRPs). Most importantly, wider discussion of these practices should make it clear to all researchers that we make many potentially impactful decisions in the design of our measures, our data processing or cleaning, and our data analysis. I am concerned that we sit on the precipice of a mea- surement crisis. The so-called replication crisis shook much of our field into widespread and ongoing reforms. Yet, much of the focus has been on improving method- ological and statistical practices. This is undoubtedly worthwhile, but largely omits discussion of reliability and validity of our measurements – despite our mea- surements forming the basis of any outcome or infer- ence. This oversight feels like repairing a damaged wall at the same time as ignoring the shifting foundations under it. I hope that this paper, along other related work, highlights the issue and encourages researchers to place more emphasis on quality measurement. As a field, we can orchestrate a measurement revolution (cf. the “credibility revolution,” Vazire, 2018) in which the quality and validity of our measurements is placed an order of importance above obtaining desired results. If the reader takes home a single message from this paper, please let it be “measurement matters.” 19 Figure 13. Difference in reliability estimates from all trials to reduced trials. Note: red = test-retest ICC2, blue = internal consistency estimate Author Contact Correspondance should be addressed to Sam Par- sons, Donders Institute for Brain, Cognition and Be- haviour, Radboud University Medical Center, Nijmegen, The Netherlands. Email: sam.parsons@radboudumc.nl. ORCID: 0000-0002-7048-4093. Conflict of Interest I declare no Conflicts of Interest Funding SP is currently supported by a Radboud Excellence Fellowship. This work was initually supported by an ESRC grant [ES/R004285/1] Acknowledgements I would like to thank Ana Todorovic for her insightful feedback on an earlier version of this manuscript. Author Contributions SP was responsible for all aspects of this manuscript: data analysis, visualisations, writing, & revisions. Open Science Practices This article earned the Open Materials badge for making the materials openly available. It has been ver- ified that the analysis reproduced the results presented 20 in the article. The entire editorial process, including the open reviews, is published in the online supplement. References Auguie, B. (2017). Gridextra: Miscellaneous functions for "grid" graphics [R package version 2.3]. https : //CRAN.R-project.org/package=gridExtra Aust, F., & Barth, M. (2018). papaja: Create APA manuscripts with R Markdown [R package ver- sion 0.1.0.9842]. https : / / github . com / crsh / papaja Barth, M. (2022). tinylabels: Lightweight variable labels [R package version 0.2.3]. https : / / cran . r - project.org/package=tinylabels Bolger, N., Zee, K. S., Rossignac-Milon, M., & Hassin, R. R. (2019). Causal processes in psychology are heterogeneous. Journal of Experimental Psy- chology: General, 148(4), 601–618. https://doi. org/10.1037/xge0000558 Booth, C., Songco, A., Parsons, S., Heathcote, L., Vin- cent, J., Keers, R., & Fox, E. (2017). The Cog- BIAS longitudinal study protocol: Cognitive and genetic factors influencing psychological func- tioning in adolescence. BMC Psychology, 5(1). https://doi.org/10.1186/s40359-017-0210-3 Booth, C., Songco, A., Parsons, S., Heathcote, L. C., & Fox, E. (2019). The CogBIAS longitudinal study of adolescence: Cohort profile and stability and change in measures across three waves. BMC Psychology, 7(73). https://doi.org/doi.org/10. 1186/s40359-019-0342-8 Brakenhoff, T. B., van Smeden, M., Visseren, F. L. J., & Groenwold, R. H. H. (2018). Random mea- surement error: Why worry? An example of car- diovascular risk factors (R. Sichieri, Ed.). PLOS ONE, 13(2), e0192298. https : / / doi . org / 10 . 1371/journal.pone.0192298 Cooper, S. R., Gonthier, C., Barch, D. M., & Braver, T. S. (2017). The role of psychometrics in individual differences research in cognition: A case study of the AX-CPT. Frontiers in Psychology, 8(SEP), 1–16. https : / / doi . org / 10 . 3389 / fpsyg . 2017 . 01482 DeBruine, L., & Barr, D. J. (2021). Understanding Mixed-Effects Models Through Data Simula- tion. Advances in Methods and Practices in Psy- chological Science, 4(1), 1–15. https://doi.org/ 10.1177/2515245920965119 Flake, J. K., & Fried, E. I. (2020). Measurement Schmea- surement: Questionable Measurement Practices and How to Avoid Them. Advances in Methods and Practices in Psychological Science, 3(456- 465), 10. Flake, J. K., Pek, J., & Hehman, E. (2017). Construct Validation in Social and Personality Research: Current Practice and Recommendations [ISBN: 1948-5506]. Social Psychological and Personal- ity Science, 8(4), 370–378. https://doi.org/10. 1177/1948550617693063 Fried, E. I. (2017). The 52 symptoms of major depres- sion: Lack of content overlap among seven com- mon depression scales. Journal of Affective Dis- orders, 208, 191–197. https : / / doi . org / 10 . 1016/j.jad.2016.10.019 Fried, E. I., & Flake, J. K. (2018). Measurement mat- ters. Observer. https : / / www . psychologi % 20calscience . org / observer / measurement - matters Gawronski, B., Deutsch, R., & Banse, R. (2011). Re- sponse Interference Tasks as Indirect Measures of Automatic Associations. Cognitive methods in social psychology (pp. 78–123). The Guilford Press. Gelman, A., & Loken, E. (2013). The garden of fork- ing paths: Why multiple comparisons can be a problem, even when there is no âï¬shing ex- peditionâ or âp-hackingâ and the research hy- pothesis was posited ahead of time, 17. https: //doi.org/dx.doi.org/10.1037/a0037714 Goodhew, S. C., & Edwards, M. (2019). Translat- ing experimental paradigms into individual- differences research: Contributions, challenges, and practical recommendations. Consciousness and Cognition, 69, 14–25. https://doi.org/10. 1016/j.concog.2019.01.008 Haines, N. (2019). Thinking generatively: Why do we use atheoretical statistical models to test sub- stantive psychological theories? http://haines- lab.com/post/thinking- generatively- why- do- we- use- atheoretical- statistical- models- to- test- substantive-psychological-theories/ Haines, N., Kvam, P. D., Irving, L. H., Smith, C., Beauchaine, T. P., Pitt, M. A., Ahn, W.-Y., & Turner, B. (2020). Theoretically Informed Gener- ative Models Can Advance the Psychological and Brain Sciences: Lessons from the Reliability Para- dox (preprint). PsyArXiv. https : / / doi . org / 10 . 31234/osf.io/xr7y3 Hedge, C., Powell, G., & Sumner, P. (2018). The reliabil- ity paradox: Why robust cognitive tasks do not produce reliable individual differences. Behav- ior Research Methods, 50(3), 1166–1186. https: //doi.org/10.3758/s13428-017-0935-1 Henry, L., & Wickham, H. (2019). Purrr: Functional programming tools [R package version 0.3.3]. https://CRAN.R-project.org/package=purrr https://CRAN.R-project.org/package=gridExtra https://CRAN.R-project.org/package=gridExtra https://github.com/crsh/papaja https://github.com/crsh/papaja https://cran.r-project.org/package=tinylabels https://cran.r-project.org/package=tinylabels https://doi.org/10.1037/xge0000558 https://doi.org/10.1037/xge0000558 https://doi.org/10.1186/s40359-017-0210-3 https://doi.org/doi.org/10.1186/s40359-019-0342-8 https://doi.org/doi.org/10.1186/s40359-019-0342-8 https://doi.org/10.1371/journal.pone.0192298 https://doi.org/10.1371/journal.pone.0192298 https://doi.org/10.3389/fpsyg.2017.01482 https://doi.org/10.3389/fpsyg.2017.01482 https://doi.org/10.1177/2515245920965119 https://doi.org/10.1177/2515245920965119 https://doi.org/10.1177/1948550617693063 https://doi.org/10.1177/1948550617693063 https://doi.org/10.1016/j.jad.2016.10.019 https://doi.org/10.1016/j.jad.2016.10.019 https://www.psychologi%20calscience.org/observer/measurement-matters https://www.psychologi%20calscience.org/observer/measurement-matters https://www.psychologi%20calscience.org/observer/measurement-matters https://doi.org/dx.doi.org/10.1037/a0037714 https://doi.org/dx.doi.org/10.1037/a0037714 https://doi.org/10.1016/j.concog.2019.01.008 https://doi.org/10.1016/j.concog.2019.01.008 http://haines-lab.com/post/thinking-generatively-why-do-we-use-atheoretical-statistical-models-to-test-substantive-psychological-theories/ http://haines-lab.com/post/thinking-generatively-why-do-we-use-atheoretical-statistical-models-to-test-substantive-psychological-theories/ http://haines-lab.com/post/thinking-generatively-why-do-we-use-atheoretical-statistical-models-to-test-substantive-psychological-theories/ http://haines-lab.com/post/thinking-generatively-why-do-we-use-atheoretical-statistical-models-to-test-substantive-psychological-theories/ https://doi.org/10.31234/osf.io/xr7y3 https://doi.org/10.31234/osf.io/xr7y3 https://doi.org/10.3758/s13428-017-0935-1 https://doi.org/10.3758/s13428-017-0935-1 https://CRAN.R-project.org/package=purrr 21 Hussey, I., & Hughes, S. (2018). Hidden invalidity among fifteen commonly used measures in so- cial and personality psychology [00000]. https: //doi.org/10.31234/osf.io/7rbfp Jones, A., Christiansen, P., & Field, M. (2018). Failed at- tempts to improve the reliability of the Alcohol Visual Probe task following empirical recom- mendations. Psychology of Addictive Behaviors, 32(8), 922–932. https : / / doi . org / 10 . 31234 / osf.io/4zsbm Klau, S., Hoffmann, S., Patel, C. J., Ioannidis, J. P., & Boulesteix, A.-L. (2021). Examining the ro- bustness of observational associations to model, measurement and sampling uncertainty with the vibration of effects framework. Interna- tional Journal of Epidemiology, 50(1), 266–278. https://doi.org/10.1093/ije/dyaa164 Koo, T. K., & Li, M. Y. (2016). A Guideline of Se- lecting and Reporting Intraclass Correlation Coefficients for Reliability Research [arXiv: PMC4913118 Publisher: Elsevier B.V. ISBN: 1556-3707]. Journal of Chiropractic Medicine, 15(2), 155–163. https : / / doi . org / 10 . 1016 / j . jcm.2016.02.012 Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T., Becker, R. B., Benning, S. D., Bradford, D. E., Buchanan, E. M., Caldwell, A. R., Van Calster, B., Carlsson, R., Chen, S.-C., Chung, B., Colling, L. J., Collins, G. S., Crook, Z., . . . Zwaan, R. A. (2018). Jus- tify your alpha. Nature Human Behaviour, 2(3), 168–171. https : / / doi . org / 10 . 1038 / s41562 - 018-0311-x Leek, J. T., & Peng, R. D. (2015). P values are just the tip of the iceberg. Nature, 520, 612. https : / / doi.org/10.1038/520612a Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584– 585. https://doi.org/10.1126/science.aal3618 MacLeod, C., Mathews, A., & Tata, P. (1986). Atten- tional bias in emotional disorders. Journal of Abnormal Psychology, 95(1), 15–20. https : / / doi.org/10.1037//0021-843X.95.1.15 Müller, K., & Wickham, H. (2019). Tibble: Simple data frames [R package version 2.1.3]. https : / / CRAN.R-project.org/package=tibble Orben, A., & Przybylski, A. K. (2019). The associa- tion between adolescent well-being and digital technology use. Nature Human Behaviour, 3(2), 173–182. https : / / doi . org / 10 . 1038 / s41562 - 018-0506-1 Parsons, S. (2021). Splithalf: Robust estimates of split half reliability. Journal of Open Source Software, 6(60), 3041. https://doi.org/10.21105/joss. 03041 Parsons, S., Kruijt, A.-W., & Fox, E. (2019). Psycholog- ical Science Needs a Standard Practice of Re- porting the Reliability of Cognitive-Behavioral Measurements. Advances in Methods and Prac- tices in Psychological Science, 2(4), 378–395. https://doi.org/10.1177/2515245919879695 Pedersen, T. L. (2019). Patchwork: The composer of plots [R package version 1.0.0]. https : / / CRAN . R - project.org/package=patchwork Price, R. B., Kuckertz, J. M., Siegle, G. J., Ladouceur, C. D., Silk, J. S., Ryan, N. D., Dahl, R. E., & Amir, N. (2015). Empirical recommendations for improving the stability of the dot-probe task in clinical research. Psychological Assessment, 27(2), 365–376. https : / / doi . org / 10 . 1037 / pas0000036 Quintana, D. S., & Heathers, J. (2019). A GPS in the Garden of Forking Paths (with Amy Orben). 10. 17605/OSF.IO/38KPE R Core Team. (2018). R: A language and environment for statistical computing. R Foundation for Statisti- cal Computing. Vienna, Austria. https://www. R-project.org/ Revelle, W. (2019). Psych: Procedures for psychological, psychometric, and personality research [R pack- age version 1.9.12]. Northwestern University. Evanston, Illinois. https://CRAN.R-project.org/ package=psych Rohrer, J. M., Egloff, B., & Schmukle, S. C. (2017). Prob- ing Birth-Order Effects on Narrow Traits Using Specification-Curve Analysis. Psychological Sci- ence, 28(12), 1821–1832. https://doi.org/10. 1177/0956797617723726 Rouder, J., & Haaf, J. M. (2018). A Psychometrics of Individual Differences in Experimental Tasks [00000]. https : / / doi . org / 10 . 31234 / osf. io / f3h2k Rouder, J., Kumar, A., & Haaf, J. M. (2019). Why most studies of individual differences with inhibition tasks are bound to fail [00000]. https : / / doi . org/10.31234/osf.io/3cjr5 Roy, S., Roy, C., Éthier-Majcher, C., Fortin, I., Belin, P., & Gosselin, F. (2009). STOIC: A database of dy- namic and static faces expressing highly rec- ognizable emotions, 15. http : / / mapageweb . umontreal.ca/gosselif/sroyetal_sub.pdf Schmukle, S. C. (2005). Unreliability of the dot probe task. European Journal of Personality, 19(7), 595–605. https://doi.org/10.1002/per.554 Segerstrom, S. C., & Boggero, I. A. (2020). Expected Es- timation Errors in Studies of the Cortisol Awak- https://doi.org/10.31234/osf.io/7rbfp https://doi.org/10.31234/osf.io/7rbfp https://doi.org/10.31234/osf.io/4zsbm https://doi.org/10.31234/osf.io/4zsbm https://doi.org/10.1093/ije/dyaa164 https://doi.org/10.1016/j.jcm.2016.02.012 https://doi.org/10.1016/j.jcm.2016.02.012 https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1038/s41562-018-0311-x https://doi.org/10.1038/520612a https://doi.org/10.1038/520612a https://doi.org/10.1126/science.aal3618 https://doi.org/10.1037//0021-843X.95.1.15 https://doi.org/10.1037//0021-843X.95.1.15 https://CRAN.R-project.org/package=tibble https://CRAN.R-project.org/package=tibble https://doi.org/10.1038/s41562-018-0506-1 https://doi.org/10.1038/s41562-018-0506-1 https://doi.org/10.21105/joss.03041 https://doi.org/10.21105/joss.03041 https://doi.org/10.1177/2515245919879695 https://CRAN.R-project.org/package=patchwork https://CRAN.R-project.org/package=patchwork https://doi.org/10.1037/pas0000036 https://doi.org/10.1037/pas0000036 10.17605/OSF.IO/38KPE 10.17605/OSF.IO/38KPE https://www.R-project.org/ https://www.R-project.org/ https://CRAN.R-project.org/package=psych https://CRAN.R-project.org/package=psych https://doi.org/10.1177/0956797617723726 https://doi.org/10.1177/0956797617723726 https://doi.org/10.31234/osf.io/f3h2k https://doi.org/10.31234/osf.io/f3h2k https://doi.org/10.31234/osf.io/3cjr5 https://doi.org/10.31234/osf.io/3cjr5 http://mapageweb.umontreal.ca/gosselif/sroyetal_sub.pdf http://mapageweb.umontreal.ca/gosselif/sroyetal_sub.pdf https://doi.org/10.1002/per.554 22 ening Response: A Simulation. Psychosomatic Medicine, 82(8), 751–756. https://doi.org/10. 1097/PSY.0000000000000850 Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E., Bahník, Š., Bai, F., Ban- nard, C., Bonnier, E., Carlsson, R., Cheung, F., Christensen, G., Clay, R., Craig, M. A., Dalla Rosa, A., Dam, L., Evans, M. H., Flores Cer- vantes, I., . . . Nosek, B. A. (2018). Many Ana- lysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Advances in Methods and Practices in Psychologi- cal Science, 1(3), 337–356. https://doi.org/10. 1177/2515245917747646 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibil- ity in Data Collection and Analysis Allows Pre- senting Anything as Significant [03883]. Psy- chological Science, 22(11), 1359–1366. https : //doi.org/10.1177/0956797611417632 Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2015). Specification Curve: Descriptive and Inferen- tial Statistics on All Reasonable Specifications. SSRN Electronic Journal. https : / / doi . org / 10 . 2139/ssrn.2694998 Spearman, C. (1904). The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 15(1), 72. https : / / doi . org/10.2307/1412159 Staugaard, S. R. (2009). Reliability of two versions of the dot-probe task using photographic faces. Psychology Science Quarterly, 51(3), 339–350. Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing Transparency Through a Multiverse Analysis. Perspectives on Psychologi- cal Science, 11(5), 702–712. https : / / doi . org / 10.1177/1745691616658637 Sullivan-Toole, H., Haines, N., Dale, K., & Olino, T. M. (2021). Enhancing the Psychometric Properties of the Iowa Gambling Task Using Full Generative Modeling (preprint). PsyArXiv. https://doi.org/ 10.31234/osf.io/yxbjz Urbanek, S., & Horner, J. (2019). Cairo: R graphics de- vice using cairo graphics library for creating high- quality bitmap (png, jpeg, tiff), vector (pdf, svg, postscript) and display (x11 and win32) output [R package version 1.5-10]. https://CRAN.R- project.org/package=Cairo Vazire, S. (2018). Implications of the Credibility Revo- lution for Productivity, Creativity, and Progress. Perspectives on Psychological Science, 13(4), 411–417. https : / / doi . org / https : / / doi . org / 10.1177/1745691617751884 von Bastian, C. C., Blais, C., Brewer, G. A., Gyurkovics, M., Hedge, C., Kałamała, P., Meier, M. E., Ober- auer, K., Rey-Mermet, A., Rouder, J. N., Souza, A. S., Bartsch, L. M., Conway, A. R. A., Draheim, C., Engle, R. W., Friedman, N. P., Frischkorn, G. T., Gustavson, D. E., Koch, I., . . . Wiemers, E. A. (2020). Advancing the understanding of in- dividual differences in attentional control: Theo- retical, methodological, and analytical consider- ations (preprint). PsyArXiv. https://doi.org/10. 31234/osf.io/x3b9k Wickham, H. (2016). Ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https : / / ggplot2.tidyverse.org Wickham, H. (2019a). Forcats: Tools for working with categorical variables (factors) [R package ver- sion 0.4.0]. https : / / CRAN . R - project . org / package=forcats Wickham, H. (2019b). Stringr: Simple, consistent wrap- pers for common string operations [R package version 1.4.0]. https : / / CRAN . R - project . org / package=stringr Wickham, H., Averick, M., Bryan, J., Chang, W., Mc- Gowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Ped- ersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., . . . Yutani, H. (2019). Welcome to the tidy- verse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686 Wickham, H., François, R., Henry, L., & Müller, K. (2019). Dplyr: A grammar of data manipulation [R package version 0.8.3]. https : / / CRAN . R - project.org/package=dplyr Wickham, H., & Henry, L. (2019). Tidyr: Tidy messy data [R package version 1.0.0]. https : / / CRAN . R - project.org/package=tidyr Wickham, H., Hester, J., & Francois, R. (2018). Readr: Read rectangular text data [R package version 1.3.1]. https://CRAN.R-project.org/package= readr Wiernik, B. M., & Dahlke, J. A. (2020). Obtaining Un- biased Results in Meta-Analysis: The Impor- tance of Correcting for Statistical Artifacts. Ad- vances in Methods and Practices in Psycholog- ical Science. https : / / doi . org / 10 . 1177 / 2515245919885611 Zuo, X.-N., Xu, T., & Milham, M. P. (2019). Harnessing reliability for neuroscience research [00000]. Nature Human Behaviour. https://doi.org/10. 1038/s41562-019-0655-x https://doi.org/10.1097/PSY.0000000000000850 https://doi.org/10.1097/PSY.0000000000000850 https://doi.org/10.1177/2515245917747646 https://doi.org/10.1177/2515245917747646 https://doi.org/10.1177/0956797611417632 https://doi.org/10.1177/0956797611417632 https://doi.org/10.2139/ssrn.2694998 https://doi.org/10.2139/ssrn.2694998 https://doi.org/10.2307/1412159 https://doi.org/10.2307/1412159 https://doi.org/10.1177/1745691616658637 https://doi.org/10.1177/1745691616658637 https://doi.org/10.31234/osf.io/yxbjz https://doi.org/10.31234/osf.io/yxbjz https://CRAN.R-project.org/package=Cairo https://CRAN.R-project.org/package=Cairo https://doi.org/https://doi.org/10.1177/1745691617751884 https://doi.org/https://doi.org/10.1177/1745691617751884 https://doi.org/10.31234/osf.io/x3b9k https://doi.org/10.31234/osf.io/x3b9k https://ggplot2.tidyverse.org https://ggplot2.tidyverse.org https://CRAN.R-project.org/package=forcats https://CRAN.R-project.org/package=forcats https://CRAN.R-project.org/package=stringr https://CRAN.R-project.org/package=stringr https://doi.org/10.21105/joss.01686 https://CRAN.R-project.org/package=dplyr https://CRAN.R-project.org/package=dplyr https://CRAN.R-project.org/package=tidyr https://CRAN.R-project.org/package=tidyr https://CRAN.R-project.org/package=readr https://CRAN.R-project.org/package=readr https://doi.org/10.1177/2515245919885611 https://doi.org/10.1177/2515245919885611 https://doi.org/10.1038/s41562-019-0655-x https://doi.org/10.1038/s41562-019-0655-x