RESEARCH PAPER Risk Identification and Prediction for COVID-19 Mortality Hanh Nguyen a Qin Shao a Corresponding author(s): qin.shao@utoledo.edu aDepartment of Mathematics and Statistics College of Natural Sciences and Mathematics, Toledo, Ohio 43614, This paper studies several key metrics for COVID-19 using a pub- lic surveillance system data set. It compares the difference be- tween two case fatality rates: the naive case fatality rate, which has been frequently mentioned in media outlets, and one which is the sample estimate for the mortality rate. A logistic regres- sion model is applied to modeling the daily mortality rate. The conclusion is that time, gender, age and some of their interac- tions, appear to have a significant impact on the mortality rate; the daily mortality rate has been decreasing since the outbreak; males older than 60 has been the most vulnerable group. The receiver operating characteristics curve and the curve under the area show that the proposed logistic model is capable of predict- ing the outcome of a reported case with accuracy as high as 89%. These findings are helpful in assessing the magnitude of the risk posed by the COVID-19 virus to certain groups, predicting out- come severity, and optimally allocating medical resources such as intensive care units and ventilators. COVID-19 | fatality rate | mortality rate | Logistic regression | receiver operating characteristics curve Since the outbreak of the coronavirus (COVID-19) pandemic inDecember 2019 in China, researchers all over the world have been working on understanding the transmission mechanism (6, 7, 11, 29), estimating key metrics for assessing the magnitude of the risk posed by this virus (2, 13, 24, 27), and obtaining information for policy making (5, 8, 17, 26). Case fatality rate (CFR) is one of the key indicators of the severity of an infectious disease. However, it is challenging to obtain an accurate CFR, as both case and death counts of an infectious disease are in general unknown. The simplest approach uses the daily naive CFR, which is the death count divided by the case count on day t. The daily naive CFR, denoted by rt, is one of the statistics that numerous organiza- tions and media have been updating based on the latest COVID-19 data. An advantage of rt, for example, is that it is computation- ally straightforward, whereas the major disadvantage is that it is not accurate as a measure of disease severity, and sometimes is even misleading. As Ritchie and Roser (21) pointed out, it ignores deaths in cases with time lags. Since the deaths in the numerator are not a subset of the cases in the denominator, the naive CFR does not accurately reflect the severity. Another daily CFR, denoted by πt, is the ratio of the death count to the case count on day t. Both rt and πt are relative frequencies of deaths and share the same denom- inator or case count on day t, but they have different numerators — the numerator of rt is the death count on the same day, while that of nt is the death count among the cases in the denominator. The deaths in the numerator of πt consist of a subset of the denominator, although they can happen any time after the case onset dates. This fundamental disparity which will be examined and elaborated, dis- tinguishes pi t from rt as a better description of the disease severity (22). A daily mortality rate (MR), denoted by pt, is the probability of death from a disease and is another measure of severity. However, the true probability is not observable and usually estimated by the CFR πt. The relationship between daily COVID-19 mortality rate and several factors will be modeled using reported death and case counts as well as other relevant information provided by the public surveillance system of the State of Ohio. Shao et al. (23) considered how much the mortality rate can be explained by gender and age us- ing the same reported system, but it treated MR as constant over time and did not take the change of MR into account. Several pub- lic policy measures could have had some impact on the daily counts since the outbreak of COVID-19. For example, how infectiousness of COVID-19 has been changing due to interventions (15), such as social distancing and curfew; it is possible that more and more eas- ily accessible tests have led to large case counts recently; more and more effective treatments could have been contributing to the reduc- Submitted: 03/23/2021, published: 08/31/2021. Translation@utoledo.edu UTJMS 2021 Vol. 9 39–49 https://orcid.org/0000-0002-9277-4243 mailto:qin.shao@utoledo.edu tion of death counts (9). Thus, in the model development for daily MR, time is considered as one of the covariates for the purpose of identifying statistically significant factors based on statistical mod- eling. In this paper, the model proposed will be utilized to predict the likelihood of mortality for a reported case. There are three goals of this paper: comparing the daily naive CFR rt with CFR πt, identifying risk factors that impact daily MR pt using statistical inference, and making a prediction about the probability of mortality for a COVID-19 patient based on the risk factors. All the data analysis is conducted using the State of Ohio COVID-19 surveillance data, which includes information about each reported patient, in particular, gender, age, onset date, death date, and outcome. The paper is organized as follows: details about the data and statistical descriptions of several major charac- teristics are presented; rt and πt are examined and compared; the statistical inference based on logistic regression is provided in the findings about the relationship between rt and πt, statistical infer- ence results about pt, and the application of the model in prediction of death likelihood of a case based on age, gender, and time are elaborated; finally in the paper concludes with a discussion. Materials and Methods Data The raw daily data in the study period, March 10, 2020 to Jan- uary 31, 2021 inclusive, were downloaded from the State of Ohio COVID-19 dashboard (18). The rows of the raw data set are the records of patients, and the final data set is obtained by deleting all the rows that contain "unknown". Figure 1 shows that the case count increased dramatically until November and then dropped off in the last two months, while the death count did not change much throughout. Table 1 lists the monthly summary for the daily counts. The maximum case count suddenly jumped from 4,094 in October to 13,523 in November. Figure 1. Daily Death and Case Counts, March 10, 2020 - January 31, 2021 40 Translation@utoledo.edu Nguyen et al. Table 1. Summary of Daily Count Data by Month Daily Max Daily Median Daily Mean Daily Min Month Case Death Case Death Case Death Case Death March - 2020 383 39 280.5 18.0 256.0 18.9 81 4 April - 2020 2181 76 475.0 49.0 583.7 50.1 281 28 May - 2020 754 57 572.0 29.0 531.2 32.2 237 10 June - 2020 1437 29 622.5 15.0 694.8 16.0 249 7 July -2020 1877 55 1329.0 27.0 1324.4 27.0 820 12 August - 2020 1493 41 1034.0 23.0 1018.5 23.7 628 5 September - 2020 1501 48 1078.0 19.5 1025.0 21.9 621 7 October - 2020 4094 82 2286.0 44.0 2393.5 47.1 1089 15 November - 2020 13523 280 8064.0 144.0 8009.8 148.3 3729 66 December - 2020 11976 242 8028.0 136.0 8255.0 139.7 3057 62 January - 2021 11038 87 5539.0 28.0 5597.4 31.2 2370 8 A threshold of 21 days is chosen as the cutoff for survival for two reasons: according to the State of Ohio dashboard, a positive case is considered as "presumed recovered" after the symptom on- set date larger than 21 days; according to Figure 2, all the monthly medians of days for death are less than 21 days. In other words, if a patient has not died of COVID-19 by February 21, 2021, he or she is considered to have survived. For each reported positive case whose onset date is in the study period, define the dichotomous dependent variable y, which is the outcome indicator, as either 1 if the patient has died of COVID- 19 by February 21, 2021 or 0 otherwise. Age is divided into two groups: the older group (at least 60 years old) and the younger group (from 0 to 59 years old). Time t is introduced for the number of days between the begin- ning of the study and the onset date on the record of a case. For example, t = 1 for a case whose onset date was on March 10. The final data set to be analyzed contains n = 898,228 rows, with each row being the record of a positive case, and four columns being the outcome indicator y and the covariates. The column information is summarized in Table 2. Table 2. Columns of Data Set Column Type Values Sex factor with 2 levels female, male Age factor with 2 levels 0, 1 Time integer 1, 2, . . . Outcome factor 0, 1 Nguyen et al. UTJMS 2021 Vol. 9 41 Figure 2. Monthly Medians of Days for Deaths Case Fatality Rates Table 3 summarizes the total case count, death count and over- all naive CFR of each gender-by-age category up to and including January 31, 2021. The overall naive CFR is 0.0187, and these four gender-by-age groups have very different CFR’s: the male older group has the largest CFR, which is 0.0089 and the female younger group has the smallest CFR, which is 0.0005. The odds ratio of these two groups is as large as 17.951. This naive CFR uses a possi- bly smaller numerator, as the outcomes of the most recent cases are ignored. Moreover, these CFR’s are snapshots and do not take time into account. 42 Translation@utoledo.edu Nguyen et al. Table 3. Case Fatality Rates (CFR=Death Count in Each Category/Total Case Count) Gender Total Age Female Male Case Death MR Case Death MR Case Death MR <60 364536 461 0.0005 315177 693 0.0008 679713 1154 0.0013 ¥ 60 118749 7590 0.0085 99766 8037 0.0089 218515 15627 0.0174 Total 483285 8051 0.009 414943 8730 0.0097 898228 16781 0.0187 Given many factors could have impacted the counts, it is rea- sonable to take t into consideration. The daily CFR’s rt and πt are respectively calculated as follows: 𝑟𝑡 = Death Count on Day t Case Count on Day t π𝑡 = 𝐷𝑒𝑎𝑡ℎ 𝐶𝑜𝑢𝑛𝑡 𝑎𝑚𝑜𝑛𝑔 𝐶𝑎𝑠𝑒 𝐶𝑜𝑢𝑛𝑡 𝑜𝑛 𝐷𝑎𝑦 𝑡 𝐶𝑎𝑠𝑒 𝐶𝑜𝑢𝑛𝑡 𝑜𝑛 𝐷𝑎𝑦 𝑡 Logistic Regression for Mortality Rate Define pt = P (yt = 1) which is the probability of death of a reported case or reported case mortality rate at time t. Hereafter pt and pt (x) will be used interchangeably with the latter emphasizing covariates x. The daily cases are separated into four groups accord- ing to age and gender. The reference group includes all the cases who are younger females or females younger than 60, and three dummy variables are introduced for the other groups: x1 = 1 for a female case whose age is older than 60 and 0 otherwise; x2 = 1 for a male case whose age is younger than 60 and 0 otherwise; x3 = 1 for a male case whose age is older than 60 and 0 otherwise. Logistic regression, which is typically implemented to model the relation- ship between a dichotomous dependent variable and covariates, is applied to y, age, gender and time. Interested readers can refer to (1) and (16) for comprehensive discussions about the theory and ap- plications of logistic regression. The full model that includes the covariates and all the interac- tions between time, age, gender is considered. Data analysis for logistic regression is carried out using the package glm in R (19) which is a free software environment for statistical computing and graphics. According to the Akaike information criterion, the follow- ing model is a good compromise between simplicity and adequacy: log 𝑝𝑡 𝑥 1 − 𝑝𝑡 𝑥 = β0 + β𝑖 3 𝑖=1 𝑥𝑖 + β4𝑡 + β5𝑥1𝑡 + β6𝑥3𝑡, [1] where x = (x1, x2, x3, t). It is obvious that for the female younger positive cases which constitutes the reference group, model [1] becomes: log 𝑝𝑡 𝑥 1 − 𝑝𝑡 𝑥 = β0 + β4𝑡. [2] It is straightforward to obtain the models for the other age-by- gender groups. For example, for the older female group, model [1] is rewritten as: 𝑙𝑜𝑔 𝑝𝑡 𝑥 1 − 𝑝𝑡 𝑥 = β0 + β1 + β4 + β5 𝑡. [3] From [2] and [3], β1 + β5t indicates the log odds ratio of older and younger groups of female cases. Similarly, it can be concluded that β3 + β6t is the log odds ratio between older and younger groups of male cases. Results The mathematical difference between rt and πt is the numera- tor. Unless a death from COVID-19 in the numerator of rt happens on the same day when it is reported as a case, it is not among the case counts in the denominator. Thus, it is obvious that rt mis- matches these counts, which introduces bias, and on the other hand, πt pairs the deaths with the cases and is a more reliable indicator for the severity or the death likelihood of a COVID-19 patient. Figure 3 illustrates the difference between relative frequencies of rt and πt. Nguyen et al. UTJMS 2021 Vol. 9 43 Figure 3. Case Fatality Rates rt and π t The distinction was more manifest in the first 100 days, and πt reached a peak sooner than rt. Not only is there a time lag between rt and πt, but they display different patterns. In particular, the surge of rt in December did not occur in πt. The peak of πt implies that the early cases were more likely to result in death. Table 4 is the information for the estimates pβ = ( pβ0 ...., pβ6) of the parameters in model [1]. 44 Translation@utoledo.edu Nguyen et al. Table 4. Generalized Linear Model Coefficient Estimates β pβ Standard Error 95% Confident Interval p - value β0 -4.288 0.098 (-4.480,-4.097) <0.001 β1 3.530 0.106 (3.323,3.737) <0.001 β2 0.523 0.060 (0.405,0.641) <0.001 β3 3.633 0.105 (3.427,3.840) <0.001 β4 -0.008 0.000 (-0.009,-0.007) <0.001 β5 0.002 0.000 (0.001,0.002) <0.001 β6 0.002 0.000 (0.001,0.003) <0.001 There are several interesting observations from Table 4: first, a negative pβ4 entails that the death probabilities of two younger groups of both genders are decreasing functions of time t; secondly, pβ4 + pβ5 = pβ4 + pβ6 = - 0.006 suggests that the death probabilities of the older groups of both genders are also decreasing functions of time t, but that the change is slower than that of the younger groups; the MR of the male older group is the largest and that of the fe- male younger group is the smallest; for females, log odds ratio of MR between older and younger is 3.530 + 0.002t, and for the males the log odds ratio of MR between older and younger is 3.110 + 0.002t, which implies that the differences become larger and larger; the odds ratio between the largest MR of the male older group and the smallest MR of the female younger group is an increasing func- tion of t, which is exp(3.633 + 0.002t), and changes, for example, from 37.902 at t = 1 to 68.924 at t = 300. The model [1] is applied to predict the mortality risk of a case based on age and gender at time t. A large value of ppt (x) is associated with greater risk. From model [1], it is straightforward to show that the estimate ppt (x) can be calculated by: 𝑝𝑡 𝑥 = exp 𝑤𝑡 𝑥 1 + exp 𝑤𝑡 𝑥 , [4] where: 𝑤𝑡 𝑥 = β0 + β𝑖 3 𝑖=1 𝑥𝑖 + β4 𝑡 + β5 𝑥1𝑡 + β6 𝑥3𝑡 with pβ being the estimates in Table 4. The observed daily CRF πt and predicted values ppt in Figure 4 match each other well, and the male older group has been having the greatest risk since the out- break. Nguyen et al. UTJMS 2021 Vol. 9 45 Figure 4. Case Fatality Rates π t and Model Predicted Mortality Rates pp t A receiver operating characteristics (ROC) curve measures the accuracy of prediction. The higher a ROC curve is above the refer- ence line y = x, the larger power it has. In other words, the closer to (0,1) the middle of the curve is, the more accurate the prediction using the model is. 46 Translation@utoledo.edu Nguyen et al. Figure 5. Case Fatality Rates π t and Model Predicted Mortality Rates pp t Figure 5 is the receiver operating characteristics curve based on equation [4]. As early as 1966, Green and Swets (10) systematically introduced ROC curves and their applications. Figure 5 shows the prediction power of model [1]: it is within a 95% confidence inter- val and is high above the reference line y = x. Another measure of prediction power is given by the areas un- der the curve (AUC). The AUC of model [1] is 89% close to one which is the largest possible value of AUC, and the 95% confidence interval is (88.67%,89.34%). Thus, both the ROC curve and the AUC indicate that the logistic regression model [1] is a powerful tool for prediction. Discussion The case fatality rate is one of the metrics that assess the sever- ity of an infectious disease. The daily naive CFR rt is constantly updated despite the fact that it is biased. According to the com- parison based on the State of Ohio COVID-19 surveillance data, although rt and πt are different at the beginning of the COVID-19 outbreak, they share a common declining overall trend, and indicate the same most and least vulnerable groups. Therefore, rt is infor- mative despite its biasedness. In the study, age, gender and time appear to be statistically sig- nificant in determining the likelihood of death for a case. In par- ticular, the group of males older than 60 has been most vulnerable, which confirms a CDC recommendation. Moreover, the model that includes time, age, and gender provides a relatively high prediction accuracy as measured by the ROC curve and AUC. These findings are helpful in predicting outcome severity of certain groups and op- timally allocating medical resources such as ICU’s and ventilators. This study has several limitations. First, our study relies on the Ohio surveillance data, and thus ignores unreported counts, such as asymptomatic patients. Secondly, outbreaks in clusters could have exaggerated the contagiousness. For example, many reported cases in nursing homes, could have resulted in an inflated total of reported case and death counts, given deaths in nursing homes in Ohio were about 32% of the total deaths by January 28, 2021 (25). Some re- search has been conducted for the purpose of estimating the society CFR. For example, Reich et al. (20) estimated death counts using log linear models by taking an incomplete reporting system into account; the work of Bendavid et al. (3) and Havers et al. (12) attempted to estimate society CFR in particular for COVID-19, by Nguyen et al. UTJMS 2021 Vol. 9 47 sampling the population in certain geographical regions. Thirdly, although the logistic model [1] can explain the data reasonably well and shows strong power for prediction, pre-existing health condi- tions or comorbidities may be linked to the mortality rate and could improve model performance if such information was included. For example, Xu et al. (28) and Li et al. (14) studied how comorbid- ity contributed to the severity of COVID-19 patients’ outcomes in China. Lastly, the prediction power could be enhanced if the record of some typical symptoms of each patient were accessible (4). Conclusion The proposed analysis procedure can be applied to similar COVID-19 data. For example, the national counterpart of rt in Fig- ure 6 exhibits the same changing pattern as that of the State of Ohio in Figure 3, and it is reasonable to conjecture that the proposed lo- gistic regression is useful to modeling the national counterpart of πt and pt, which could be a future research project if such information was available. Figure 6: Case Fatality Rates rt of the United States, March 10, 2020 - January 31, 2021 Conflict of interest Authors declare no conflict of interest. Authors’ contributions HN performed data processioning and data analysis; QS re- viewed literature and provided the significance of the research from the public health perspective. Both authors participated in revision of the manuscript, read and approved the final document. 48 Translation@utoledo.edu Nguyen et al. 1. Agresti A. Categorical Data Analysis, 2nd ed. New Jersey: Wiley-Interscience, 2002. 2. Angelopoulos AN, Pathak R, Varma R, Jordan, MI (2020) On identify- ing and mitigating bias in the estimation of the COVID-19 case fatality rate. Harvard Data Science Review [Internet]. 2020 Jul 16; Available from https://hdsr.mitpress.mit.edu/pub/y9vc2u36. 3. Bendavid E, Mulaney B, Sood N, et al. (2020). COVID-19 Antibody Seroprevalence in Santa Clara County, California. medRxiv, 2020.04.14.20062463. 4. Bertsimas D, Boussioux L, Cory-Wright R, et al. From predictions to prescrip- tions: A data-driven response to COVID-19. Health Care Management Science 2021 Jun 24(2): 253-272. 5. Bundgaard H, Bundgaard JS, Raaschou-Pedersen DET, et al. (2021) Effective- ness of Adding a Mask Recommendation to Other Public Health Measures to Prevent SARS-CoV-2 Infection in Danish Mask Wearers: A Randomized Con- trolled Trial. Annals of Internal Medicine 174: 335-343. 6. Chen N, Zhou M, Dong X, et al. (2020) Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. The Lancet 395: 507-513. 7. Chen Y, Liu Q, Guo D (2020) Emerging coronaviruses: genome structure, repli- cation, andpathogenesis. Journal of Virology 92: 418-423. 8. Chiu WA, Fischer R, Ndeffo-Mbah ML (2020) State-level needs for social distanc- ing and contact tracing to contain COVID-19 in the United States. Nature Human Behavior 4: 1080-1090. 9. Cortegiani A, Ingoglia G, Ippolito M, et al. (2020) A systematic review on the effi- cacy and safety of chloroquine for the treatment of COVID-19. Journal of Critical Care 57: 279-283. 10. Green DM, Swets JA. Signal Detection Theory and Psychophysics, New York: John Wiley and Sons, 1966. 11. Guan W, Ni Z, Hu Y, et al. (2019) Clinical characteristics of coronavirus disease 2019 in China. The New England Journal of Medicine 382: 1708-1720. 12. Havers FP, Reed C, Lim T, et al. (2020) Seroprevalence of antibodies to SARS- CoV-2 in 10 sites in the United States, March 23-May 12, 2020. JAMA Intern Med 180: 1576-86. 13. Kobayashi T, Jung S, Linton MN, et al. (2020) Communicating the risk of death from novel coronavirus disease (COVID-19). Journal of Clinical Medicine 9: 580. 14. Li X, Xu S, Yu M, et al. (2020) Risk factors for severity and mortality in adult COVID-19 inpatients in Wuhan. The Journal of Allergy and Clinical Immunology 146: 110-118. 15. Liu, Y., Gayle, A. A., Wilder-Smith, A., et al. (2020) The reproductive number of COVID-19 is higher compared to SARS coronavirus. Journal of Travel Medicine 27: 1-4. 16. McCullagh P, Nelder JA. Generalized Linear Models 2nd ed. Boca Raton: CRC, 1989. 17. Mizumoto K, Chowell G (2020) Estimating risk for death from coronavirus dis- ease, China, January-February 2020. Emerging Infectious Diseases 26: 1251- 1256. 18. 1Ohio Department of Health COVID-19 Dashboard https://coronavirus.ohio.gov/wps/portal/gov/covid-19/dashboards 19. 1R Core Team (2020) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 20. Reich NG, Lessler J, Cummings DAT, et al. (2012) Estimating absolute and rela- tive case fatality ratios from infectious disease surveillance data. Biometrics 68: 598-606. 21. Ritchie H, Roser M (2020) What do we know about the risk of dying from COVID- 19? https://ourworldindata.org/covid-mortality-risk 22. Ritchie H, Ortiz-Ospina E, Beltekian D, et al. (2020) Mortality risk of COVID-19. https://ourworldindata.org/mortality-risk-covid 23. Shao Q, Thompson G, Thompson A (2020) COVID-19 risk factor identification based on Ohio data. Translation: The University of Toledo Journal of Medical Sciences 8: 6-14. 24. Shen CY (2020) Logistic growth modelling of COVID-19 proliferation in China and its international implications. International Journal of Infectious Diseases 96: 582-589. 25. The COVID-19 Tracking Project. https://covidtracking.com/data/state/ohio/long- term-care 26. Zhou Y, Wang L, Zhang L, et al. (2020) A spatiotemporal epidemiologi- cal prediction model to inform county-level COVID-19 risk in the United States. Harvard Data Science Review [Internet] 2020 Aug 6; Available from: https://hdsr.mitpress.mit.edu/pub/qqg19a0r 27. Xu C, Dong Y, Yu X, et al. (2020) Estimation of reproduction numbers of COVID-19 in typical countries and epidemic trends under different prevention and control scenarios. Frontiers of Medicine 14: 613-622. 28. Xu K, Zhou M, Yang D, et al. (2020) Application of ordinal logistic regression analysis to identify the determinants of illness severity of COVID-19 in China. Epidemiology and Infection 148 e146: 1-11. 29. Zhang Q, Bastard P, Liu Z, et al. (2020) Inborn errors of type I IFN immu- nity in patients with life-threatening COVID-19. Science 370. 2020 Oct 23; 370(6515):eabd4570. Nguyen et al. UTJMS 2021 Vol. 9 49