A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(2):e216, 2018 OJPHI A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Olaf Dammann,1,2 Kenneth Chui, 1 and Anselm Blumer, 2 1. Department of Public Health and Community Medicine, Tufts University School of Medicine, Boston, MA 2. Department of Gynecology and Obstetrics, Hannover Medical School, 30623 Hannover, Germany 3. Department of Computer Science, Tufts University School of Engineering, Tufts University, Medford, MA ABSTRACT We describe a computational population model with two risk factors and one outcome variable in which the prevalence (%) of all three variables, the association between each risk factor and the disease, as well as the association between the two risk factors is the input. We briefly describe three examples: retinopathy of prematurity, diabetes in Panama, and smoking and obesity as risk factors for diabetes. We describe and discuss the simulation results in these three scenarios including how the published information is used as input and how changes in risk factor prevalence changes outcome prevalence. DOI: 10.5210/ojphi.v10i2.9357 Copyright ©2018 the author(s) This is an Open Access article. Authors own copyright of their articles appearing in the Online Journal of Public Health Informatics. Readers may copy articles without permission of the copyright owner(s), as long as the author and OJPHI are acknowledged in the copy and the copy is used for educational, not-for-profit purposes. 1. Introduction In epidemiology, the concept of multi-causality holds that the occurrence of any disease depends on a set of risk factors, not just one. The generation of virtual databases that reflect the properties of populations is called micro-simulation [1]. In their simplest form, such models require as input two risk factors and their association with one outcome variable. One example is SYNTHEA, a virtual population of individuals and their electronic health records (EHRs) [2]. The algorithm could simulate individuals with, say, three characteristics: a binary disease outcome (coded as yes/no) and two binary risk factors (yes/no). The algorithm uses as input parameters the population prevalence of the two risk factors and the outcome variable; the allocation of “yes” or “no” for each variable is done by applying a Monte-Carlo simulation that uses random numbers and the population prevalence as a threshold. This ensures that, for instance, A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(2):e216, 2018 OJPHI on average 37% of the virtual population will have a certain disease if the real population prevalence of that disease is 37% and the threshold for “disease = yes” is set at 0.37. These microsimulations have one particular disadvantage: if the presence or absence of each variable in the final database is based on separate yes/no attribution processes, the variables will be independent. This, of course, is highly unlikely in reality, because the very definition of a risk factor is that it is associated with the disease under investigation. Moreover, the two risk factors will be independent of each other, which is also rarely the case in real life situations. This way of performing microsimulations will lead to populations that look like their real-life counterparts only with regard to the population average of risk factors and outcome. However, these datasets cannot be utilized to simulate population-wide changes in risk factors with the goal to study population- wide changes in the outcome (disease). Therefore, we wanted to design a model that requires as input the population prevalence of the outcome of interest and of two risk factors, as well their three associations (Figure 1). Figure 1. The associations among two non-independent risk factors and one outcome are quantified by three odds ratios. In what follows, we describe a population model with two risk factors and one outcome variable in which the prevalence (%) of all three variables, the association between each risk factor and the disease, as well as the association between the two risk factors is the input. We briefly describe three examples: (#1) retinopathy of prematurity; (#2) diabetes in Panama, and (#3) smoking and obesity as risk factors for diabetes. Next, we describe the simulation results in these three scenarios including how the published information is used as input (Step 1) and how changes in risk factor prevalence changes outcome prevalence (Step 2). A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(2):e216, 2018 OJPHI 2. METHODS 2.1 The Model Suppose we have a standard 2 x 2 table for an outcome against a risk factor (Figure 2). Label the cells A, B, C, D where A is the percent of the population for which both the risk factor and the outcome are positive, B is the percent where the risk factor is positive but the outcome is negative, C is the percent where the risk factor is negative but the outcome is positive, and D is the percent where both are negative. Then if RF is the percent of the population with positive risk factor and OUT is the percent of the population with positive outcome, we have (1) B = RF - A (2) C = OUT - A (3) D = 100 - A - B - C The equation for the odds ratio is based on the quantities depicted in Figure 2: (4) OR = AD/BC. We can substitute for B, C, and D using the first three equations, giving a quadratic equation for A with coefficients in terms of RF, OUT, and OR: (5) (OR-1)A2 + (100+(OR-1)(RF+OUT))A + OR • RF • OUT = 0 Figure 2. Fourfold table depicting the four entities defined by the presence (+) or absence (-) of a binary risk factor and an outcome. A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(2):e216, 2018 OJPHI Solving this will give a 2 x 2 table that matches the given population values for RF and OUT and has the desired odds ratio. This much is calculated in "Step 1" in the JavaScript implementation of the model (available at http://www.cs.tufts.edu/~ablumer/PopStat.html). We can also use this equation to model the effect of keeping the odds ratio fixed and changing the percentage of the population that has the risk factor. This can be done by replacing A and RF in the above equation with r*A and r*RF and solving for the value of OUT that keeps the odds ratio constant. This assumes that relative percentages of the population with positive and negative outcome within positive risk factor (A relative to B) stay the same when the positive risk factor population is changed. Since we have two risk factors, we can do identical calculations relating risk factor 1 to the outcome and relating risk factor 2 to the outcome. Similarly, we can find the entries for the 2 x 2 table relating risk factor 1 to risk factor 2. 2.2 Examples 2.2.1 Example #1: Retinopathy of prematurity We previously analyzed a data set of 617 very preterm newborns [3]. In that project, we found that 47% of all babies developed retinopathy of prematurity (ROP), a serious eye disorder among extremely preterm infants [4]. Systemic inflammation [5] and oxygen exposure data [6] are competing pathogenetic mechanisms that interfere with normal vasculogenesis [7]. The capability to simulate interventions on one or both of these pathomechanisms in order to study changes in ROP occurrence would be a groundbreaking step towards the prevention. In our data analysis, we also found that 32% of the infants had sepsis and 75% had been exposed to high levels of oxygen. The association between sepsis and oxygen on the one hand and ROP on the other (measured as an odds ratio, OR) were 2.8 and 3.6, respectively. The OR for the association between sepsis and oxygen was 2.6. In Figure 3 we clarify how these data were then entered into the model. 2.2.2 Example #2: Diabetes in Panama A second example is a study on diabetes in Panama (5.4%) [8] with female sex (RF1: 60%) and age 50+ years (RF2: 31%) as risk factor exemplars. Female sex was associated with diabetes with an OR=1.4, age 50+ had an OR=5.1. The OR for the association between female sex and age 50+ was 0.85 (see Figure 4). Obviously, in this case, the risk factors are not to be modified to simulate a population intervention as in the previous example. Instead, we are interested in the effect on diabetes prevalence due to the discrepancy between the observed age distribution described in [8] (50+ years = 31%) compared to national data published by the United Nations (20%) [9]. A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(2):e216, 2018 OJPHI Figure 3. Simulation results of Step 1 in example #1, retinopathy of prematurity. 2.2.3 Example #3: Smoking, BMI, and Diabetes A randomized controlled trial (RCT) of estrogen plus progestin (EP) versus placebo was conducted in the 1990s to explore the effect of EP on subsequent development of coronary heart disease (CHD) in postmenopausal women [10]. We wanted to use the publicly available data from this RCT to explore the influence of smoking and body mass on diabetes, and use these data as input for a simulation of the effect of two interventions, smoking cessation weight reduction, on diabetes occurrence. 3. Results 3.1 Example #1 In Step 1, we entered the population percentages for both risk factors and the outcome, as well as the three associations among them. The estimated four-fold tables provided by the model are depicted in Figure 3. In Step 2, we proceeded to the simulation of risk factor modification. First, we reduced RF1 incrementally down from 32% to 0% (Table 1). This resulted in a drop of RF2 from 75% down to 70% and a reduction in outcome occurrence from 47% down to 39%. A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(2):e216, 2018 OJPHI Table 1. Example #1. Risk factor (RF)2 and outcome (OUT) changes when RF1 declines (%). RF1 (Sepsis) RF2 (Oxygen) Outcome (Retinopathy of Prematurity) 32 75 47 30 75 46 25 74 45 20 73 44 15 72 43 10 72 41 5 71 40 0 70 39 Second, we reduced RF2 incrementally down from 75% to 0%. This resulted in a drop of RF1 from 32% down to 18% and a reduction in outcome occurrence from 47% down to 25%. Third, we calculated that even if both RF were reduced to 0, we are still left with a 21% outcome rate, which is probably attributable to other risk factors. It is also possible that the odds ratios change as the population statistics approach the extremes. 3.2 Example #2 The estimated four-fold tables provided by the model after Step 1 are depicted in Figure 4. 3.3 Example #3 In the publicly available HERS dataset (http://www.biostat.ucsf.edu/vgsm/data.html), we looked at diabetes (on oral medication or insulin) as the outcome, and at smoking and overweight/obesity as risk factors (Table 3). In an exploratory data analysis we found that in this cohort of postmenopausal women with an average age of 67 years, 26% had diabetes, 13% were smokers, and 34% were obese (defined as a BMI ≥30). Smoking was associated with a reduced risk for diabetes (OR 0.5, 95%CI 0.4, 0.7), obesity with a strong risk increase (3.3; 2.7, 3.9), and smoking had an inverse association with obesity (0.6; 0.4, 0.7)(Table 3). A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(2):e216, 2018 OJPHI Figure 4. Simulation results of Step 1 in example #2, diabetes in Panama. In Step 2, risk factor modification simulation for Age 50+ from the observed 31% down to the 20% estimated by the UN in a population prevalence decrease for diabetes from 5.4% to 4.4% (data not shown). Table 2. Example #1. Risk factor (RF)1 and outcome (OUT) changes when RF2 declines (%). RF1 (Sepsis) RF2 (Oxygen) Outcome (Retinopathy of Prematurity) 32 75 47 31 70 46 29 60 43 27 50 40 26 40 37 24 30 34 22 20 31 20 10 28 18 0 25 A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(2):e216, 2018 OJPHI Table 3. Diabetes among 2758 postmenopausal women, the association between risk factors (smoking and overweight/obese) and diabetes, and the association between risk factors. These data served as input for example #3. Diabetes YES NO OR (95%C.I.) N (row %) 728 (26) 2030 (74) Smoking, N (col %) 60 (8) 299 (15) 0.5 (0.4, 0.7) Obese, N (col %) 397 (55) 545 (27) 3.3 (2.7, 3.9) Association RF1/RF2 Smoking N (row %) Obese (BMI ≥30), N (col %) YES 359 85 (24) NO 2399 857 (36) 0.6 (0.4, 0.7) We then simulated two interventions, smoking cessation and weight reduction. We have to keep in mind that while obesity is associated with a risk increase, smoking is associated with a decreased risk for diabetes. The fact that the two risk factors are negatively associated (less obesity among smokers) might explain this “protective effect of smoking”. Reducing smoking to zero in this population led to a minuscule increase of diabetes occurrence from 18 to 19%, which we confirmed in a stratified analysis excluding smokers (Table 4). Among non-smokers, diabetes prevalence was 19.2%. Reducing obesity was associated with a prominent risk reduction for diabetes, from 18% down to 10%. At the same time, smoking increased from 13 to 17% (Table 5). Table 4. Example #3. Risk factor (RF)2 and outcome (OUT) changes when RF1 declines (%), simulating smoking cessation intervention. RF1 (Smoking) RF2 (Overweight/Obesity) Outcome (Diabetes) 13 56 18 10 57 18 8 57 18 6 57 19 4 58 19 2 58 19 0 58 19 A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(2):e216, 2018 OJPHI Table 5. Example #3. Risk factor (RF)1 and outcome (OUT) changes when RF2 declines (%), simulating weight reduction intervention. RF1 (Smoking) RF2 (Overweight/Obesity) Outcome (Diabetes) 13 56 18 13 50 17 14 40 16 15 30 14 16 20 13 17 10 11 17 0 10 5. DISCUSSION 5.1 Advantages Our model has three prominent advantages. First, it is novel. To our knowledge, no other population model exists that appreciates the association between risk factors. Second, the model is relatively simple. With only one outcome and two risk factors, the complexity of inputs is limited to their population prevalence and associations between each other. We are currently developing a tool is that includes a third risk factor and that can be used for microsimulations, i.e., it outputs a data file of a virtual population, which can be used in further simulations. Third, the model is freely available online for the community to use and explore. 5.2 Drawbacks The model is currently limited to two-level exposures and outcomes. It is also limited to only two risk factors. We are currently developing a similar model for three predictors and their inter- relations. Perhaps the most prominent limitation of the model is that it is causally naïve and rigid. Much of the complex methodology toolbox of modern epidemiology is geared towards the identification of causal risk factors [11]. Our model is not helpful in this regard. The association between risk factors and outcomes is modeled as odds ratios, which are simple measures of strength of association without implying causality or causal direction. The model is also rigid in that the input is reduced to population prevalence and association measures (odds ratios). Within the constraints of these values, the output is not probabilistic, but determined. However, the model can be run multiple times with different values for odds ratios as input that come from within the range of odds ratios defined by the observed confidence interval. A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(2):e216, 2018 OJPHI 5.3. Conclusion In this paper, we present a simple model of disease occurrence in populations. Based on the prevalence of a disease and of two risk factors, and of their association with the disease and between each other, the model calculates fourfold tables for these associations (Step 1). Thereafter, the population prevalence of either risk factor can be modified to simulate population risk factor increases or decreases, and changes in disease occurrence can be observed (Step 2). We will now develop this model further to include three risk factors and microsimulation capabilities. In the meantime, we hope it will be helpful to others and would appreciate feedback, preferably in the form of constructive criticism. Acknowledgements The following colleagues have contributed to the development of earlier versions of this model: Benjamin Hescott, Inbar Fried, Sadchla Mathieu, Ryan Durgham, Yaa Konama Pokuaa, and Eva Chege. We acknowledge internal support from the TUFTS-Collaborates! Initiative 2014 and Tufts University School of Medicine Chairs’ Initiative for Strategic Research Collaborations 2016 References 1. Rutter CM, Zaslavsky AM, Feuer EJ. 2011. Dynamic microsimulation models for health outcomes: a review. Med Decis Making. 31(1), 10-18. PubMed https://doi.org/10.1177/0272989X10369005 2. Walonoski J, et al. 2017. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc. PubMed 3. Chen ML, et al. 2011. Infection, oxygen, and immaturity: interacting risk factors for retinopathy of prematurity. Neonatology. 99, 125-32. PubMed https://doi.org/10.1159/000312821 4. Hellstrom A, Smith LE, Dammann O. 2013. Retinopathy of prematurity. Lancet. 382(9902), 1445- 57. PubMed https://doi.org/10.1016/S0140-6736(13)60178-6 5. Holm M, et al. 2017. Systemic Inflammation-Associated Proteins and Retinopathy of Prematurity in Infants Born Before the 28th Week of Gestation. Invest Ophthalmol Vis Sci. 58, 6419-28. PubMed https://doi.org/10.1167/iovs.17-21931 6. Hauspurg AK, et al. 2011. Blood gases and retinopathy of prematurity: the ELGAN Study. Neonatology. 99(2), 104-11. PubMed https://doi.org/10.1159/000308454 7. Rivera JC, et al. 2017. Retinopathy of prematurity: inflammation, choroidal degeneration, and novel promising therapeutic strategies. J Neuroinflammation. 14(1), 165. PubMed https://doi.org/10.1186/s12974-017-0943-1 8. Mc Donald Posso AJ, et al. 2015. Diabetes in Panama: Epidemiology, Risk Factors, and Clinical Management. Ann Glob Health. 81(6), 754-64. PubMed https://doi.org/10.1016/j.aogh.2015.12.014 https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=20484091&dopt=Abstract https://doi.org/10.1177/0272989X10369005 https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=29025144&dopt=Abstract https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=20733333&dopt=Abstract https://doi.org/10.1159/000312821 https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=23782686&dopt=Abstract https://doi.org/10.1016/S0140-6736(13)60178-6 https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=29260199&dopt=Abstract https://doi.org/10.1167/iovs.17-21931 https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=20689332&dopt=Abstract https://doi.org/10.1159/000308454 https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=28830469&dopt=Abstract https://doi.org/10.1186/s12974-017-0943-1 https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=27108143&dopt=Abstract https://doi.org/10.1016/j.aogh.2015.12.014 A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(2):e216, 2018 OJPHI 9. Nations U. World Population Prospects: The 2017 Revision, DVD Edition, P.D. Department of Economic and Social Affairs, Editor. 2017. 10. Hulley S, et al. 1998. Randomized trial of estrogen plus progestin for secondary prevention of coronary heart disease in postmenopausal women. Heart and Estrogen/progestin Replacement Study (HERS) Research Group. JAMA. 280(7), 605-13. PubMed https://doi.org/10.1001/jama.280.7.605 11. Glass TA, et al. 2013. Causal inference in public health. Annu Rev Public Health. 34, 61-75. PubMed https://doi.org/10.1146/annurev-publhealth-031811-124606 https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=9718051&dopt=Abstract https://doi.org/10.1001/jama.280.7.605 https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=23297653&dopt=Abstract https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=23297653&dopt=Abstract https://doi.org/10.1146/annurev-publhealth-031811-124606 A Causally Naïve and Rigid Population Model of Disease Occurrence Given Two Non-Independent Risk Factors ABSTRACT 1. Introduction 2. METHODS 2.1 The Model 2.2 Examples 2.2.1 Example #1: Retinopathy of prematurity 2.2.2 Example #2: Diabetes in Panama 2.2.3 Example #3: Smoking, BMI, and Diabetes 3. Results 3.1 Example #1 3.2 Example #2 3.3 Example #3 5. DISCUSSION 5.1 Advantages 5.2 Drawbacks 5.3. Conclusion Acknowledgements References