Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 1 OJPHI Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study Ahmed U. Otokiti,1 Colleen M. Farrelly,2 Leyla Warsame,3 Angie Li4 1. Icahn School of Medicine at Mount Sinai Hospital, Internal Medicine and Informatics Department, New York, NY 10029, USA 2. Staticlysm, LLC, Palmetto Bay, FL 33157, USA 3. Geisinger Health Systems, Internal Medicine and Clinical Informatics Department, Danville, PA 17821, USA 4. University at Buffalo, Department of Biomedical Informatics, Buffalo, NY 14203, USA Abstract Objective: There is a low rate of online patient portal utilization in the U.S. This study aimed to utilize a machine learning approach to predict access to online medical records through a patient portal. Methods: This is a cross-sectional predictive machine learning algorithm-based study of Health Information National Trends datasets (Cycles 1 and 2; 2017-2018 samples). Survey respondents were U.S. adults (≥18 years old). The primary outcome was a binary variable indicating that the patient had or had not accessed online medical records in the previous 12 months. We analyzed a subset of independent variables using k-means clustering with replicate samples. A cross-validated random forest- based algorithm was utilized to select features for a Cycle 1 split training sample. A logistic regression and an evolved decision tree were trained on the rest of the Cycle 1 training sample. The Cycle 1 test sample and Cycle 2 data were used to benchmark algorithm performance. Results: Lack of access to online systems was less of a barrier to online medical records in 2018 (14%) compared to 2017 (26%). Patients accessed medical records to refill medicines and message primary care providers more frequently in 2018 (45%) than in 2017 (25%). Discussion: Privacy concerns, portal knowledge, and conversations between primary care providers and patients predict portal access. Conclusion: Methods described here may be employed to personalize methods of patient engagement during new patient registration. Abbreviations: American Medical Informatics Association (AMIA), area under the curve (AUC), body mass index (BMI), electronic health record (EHR), Health Information National Trends Survey (HINTS), information technology (IT), National Cancer Institute (NCI), Veteran health (VA) Correspondence: ahmedotoks@yahoo.com* DOI: 10.5210/ojphi.v14i1.12851 Copyright ©2022 the author(s) mailto:ahmedotoks@yahoo.com* Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 2 OJPHI This is an Open Access article. Authors own copyright of their articles appearing in the Online Journal of Public Health Informatics. Readers may copy articles without permission of the copyright owner(s), as long as the author and OJPHI are acknowledged in the copy and the copy is used for educational, not-for-profit purposes. Introduction Patient engagement is a set of behaviors that foster active patient involvement in care, thereby increasing motivation and self-determination to become an active player in the healthcare journey [1]. These behaviors increase compliance, improve health outcomes, and overall public health and reduce cost [1-3]. Health IT solutions can serve as a means to increase patient engagement, as online patient portals have been shown to increase patient engagement and personalized care [4,5]. Online patient portals are web-based applications tethered to a patient’s EHR that allow secure access to health data. Through this portal, patients can view lab results, medication history, and discharge summaries, and they can securely message their physicians, request prescription refills, and schedule appointments [6]. The meaningful use Stage 2 incentive mandated by the Health Information Technology for Economic and Clinical Health Act in 2009 was a significant driver for increase in patient portal offerings by healthcare institutions across the nation [7]. Despite the significant investment in online portals, these sites continue to experience a low rate of adoption/use by patients, which hinders the potential benefits of patient engagement and its public health impact [8-10]. The most significant positive factors associated with higher portal use include higher education level, female gender, Caucasian ethnicity, Internet access, higher income, patients not on Medicaid insurance, and patient trust in the healthcare provider and system [7,8,11]. The most significant negative factors associated with lower patient portal use include privacy and security concerns and user friendliness [12,13]. Machine learning is gaining popularity in healthcare due to the ability of this method to process complex nonlinear relationships between predictors and yield stable predictions [14]. This approach has been used to predict outbreaks, suicide risk among Army personnel, and intrusion detection within EHR systems [15,16]. Several prior studies have analyzed patient behavior regarding health technology usage and its impact on patient health [17-21]. One study which employed the random forest algorithm found that health-related Internet searches predicted patient healthcare utilization [22]. These findings suggest that understanding patient interactions with medical technology may help providers offer better care and be proactive in making decisions about online patient engagement tools. This study aimed to determine which patients are most likely to utilize online portals at patient registration and to build a predictive model that could be used to create a short survey to support real-time decision support. As interaction terms likely exist between factors and because the model is high dimensional, we choose to use machine learning models to parse out factors and groups of factors associated with online portal utilization. Patients who opt for a technology-based platform may benefit from other types of engagement with technology beyond patient portals, including text messaging or automated calls. Machine learning algorithms can identify important predictors of portal usage, as well as provide robust predictions to flag those most likely to benefit from portal Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 3 OJPHI usage versus those who may engage better with alternative channels. To our knowledge, no previous studies have utilized a machine learning algorithm to predict patients to utilize patient portals as a patient engagement tool. Data from HINTS was used for our analysis. HINTS is a nationally representative survey that has been administered by the NCI since 2003 [23]. The HINTS survey and data collection program was set up to monitor changes in the rapidly evolving field of health communication. It collects nationally representative data about the public's use of cancer-related information and serves as a test bed for researchers to evaluate new theories in health information and communication. The data can also be used to help understand how adults use different communication channels to obtain health information [23,24]. Two cycles of HINTS data were utilized in our analysis: HINTS-5 Cycle 1 (2017) and HINTS-5 Cycle 2 (2018). Although HINTS is funded by the NCI with the primary goal of evaluating health communication theories in cancer patients, only 504 individuals out of 3,285 survey participants (15.3%) in HINTS 5 Cycle 1 were diagnosed with cancer [25]. Materials and Methods Study Design and Setting This was a predictive analytic study using data from two iterations of the HINTS survey. The survey for both HINTS cycles utilized in this study was disseminated via mail to the participants. More information on the survey mailing protocol, data collection, data cleaning/editing, and handling of incomplete/invalid data can be found on the NCI HINTS website [23]. Study Participants Survey respondents were sampled from the U.S. population (≥ 18 years old). A two-stage sampling method was utilized: stage one was a stratified sample of residential addresses and stage two sampling was the selection of one adult from each sampled residential address. The same sampling methodology was utilized for HINTS 5 Cycle 1 and 2. More information about sampling methodology of the HINTS survey can be found on the NCI HINTS website [23]. The sample sizes of both iterations were as follows: HINTS 5 Cycle 1: of the 3,285 respondents, 97% of the surveys were completely filled out (November 2017); HINTS 5 Cycle 2: of the 3,504 total respondents, 98% of the surveys were completely filled out (November 2018). These two iterations were chosen because they were the most recent at the time of our analysis, had uniformity of survey collection, and had captured similar variables of interest. Study Variables Target Variable/Outcome Variable Access to online medical records or patient portals was the target or outcome variable. The survey question was, “How many times did you access your medical record in the past 12 months?” (HINTS 5 Cycle 1: question D4; HINTS 5 Cycle 2: question D6). We recoded the response as >1 for “accessed online medical record” and <1 for “did not access online medical record.” Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 4 OJPHI Labels/Predictor Variables A total of 51 initial predictor variables were added based on domain knowledge and a literature search of previously identified significant determinants of online portal use (Table 1) [5,7,13,26,27]. The following variables were re-coded: a. BMI re-coded to an ordinal variable from a continuous variable for clinical significance (BMI: <25, 25-30, 30-40, >40) b. Chronic medical condition: any one of the following: hypertension, heart condition, lung disease, and arthritis c. Anxiety/depression: re-coded as an independent variable A random forest-based Boruta method was used for variable selection after initial variable inclusion based on domain knowledge and a literature review; a total of 39 variables were selected (Table 1). Table 1: Variables Initial variables before the Boruta algorithm Final variables analyzed after the Boruta algorithm Demographics 1. Age 2. Education level 3. Race/ethnicity 4. Marital status 5. Occupation status 6. English language proficiency 7. Sexual orientation 8. Total persons in the household 9. Gender 10. Rent or own a house 11. Annual household income Looking for Health Information 12. Trust health information from newspapers/magazines 13. Trust health information from the Internet 14. Trust health information from charitable organizations Demographics 1. Age 2. Education level 3. Race/ethnicity 4. Occupation status 5. English language proficiency 6. Sexual orientation 7. Annual household income Looking for Health Information 8. Trust health information from newspapers/magazines 9. Trust health information from the Internet 10. Trust health information from charitable organizations 11. Trust health information from religious organizations 12. If there were a strong need to get information about your health, where would you go first? Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 5 OJPHI 15. Trust health information from television 16. Trust health information from religious organizations 17. If there were a strong need to get information about your health, where would you go first? Overall Health 18. In general, what is your state of health? 19. Body mass index 20. Chronic medical conditions: diabetes mellitus, hypertension, heart disease, lung disease, rheumatologic 21. Chronic medical condition: depression/anxiety Your Healthcare 22. Health insurance from employer? 23. Health insurance bought directly from insurance company? 24. Medicare 25. Medicaid 26. Military healthcare/TRICARE 27. VA 28. Indian health services 29. Health insurance other Medical Research and Records 30. Who offered you online access to your medical records: healthcare provider? 31. Who offered you online access to your medical records: insurance company? 32. How many times have you accessed online medical record in the last 12 months? Your Healthcare 13. Health insurance from employer? 14. Health insurance bought directly from insurance company? 15. Medicare 16. Medicaid 17. Military health care/TRICARE 18. Indian health services Medical Research and Records 19. Who offered you online access to your medical records: healthcare provider? 20. Who offered you online access to your medical records: insurance company? 21. How many times have you accessed your online medical record in the last 12 months? 22. How confident are you about the safety and confidentiality of your electronic medical record? 23. Have you ever kept information from your health care provider because of privacy concerns? Internet Use 24. Internet use through broadband 25. Internet use through a cellular network 26. Internet use through a wireless network 27. Internet use through a computer at home 28. Internet use through a computer at work 29. Internet use on a mobile device (cell phones, tablet, etc.) 30. In the past 12 months, have you looked for medical information for yourself? 31. In the past 12 months, have you used the Internet to communicate with a healthcare provider’s office? Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 6 OJPHI 33. How confident are you about safety and confidentiality of your electronic medical record? 34. Have you ever kept information from your healthcare provider because of privacy concerns? Internet Use 35. Internet use through broadband 36. Internet use through a cellular network 37. Internet use through a wireless network 38. Internet use through; computer at home 39. Internet use through a computer at work 40. Internet use on a mobile device (cell phones, tablet, etc.) 41. In the past 12 months, have you looked for medical information for yourself? 42. In the past 12 months, have you used the Internet to communicate with a health care provider’s office? 43. In the past 12 months, have you used the Internet to view your test results? 44. Do you have a tablet? 45. Do you have a smart phone? 46. Do you have a wellness app on your phone or tablet? 47. Has your tablet or smartphone helped you make health decisions? 48. In the last 12 months, have you used other electronic devices to monitor your health? 32. In the past 12 months, have you used the Internet to view your test results? 33. Do you have a tablet? 34. Do you have a smart phone? 35. Do you have a wellness app on your phone or tablet? 36. Has your tablet or smartphone helped you make health decisions? 37. Have you visited a social networking site in the last 12 months? 38. Have you watched a health-related video on YouTube in the last 12 months? 39. Have you sent or received a text message from a health care provider in the last 12 months? Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 7 OJPHI 49. Have you visited a social networking site in the last 12 months? 50. Have you watched a health- related video on YouTube in the last 12 months? 51. Have you sent or received a text message from a health care provider in the last 12 months? Machine Learning Approach/Statistical Analysis Since there are known limitations for some statistical algorithms and notable issues with the reproduction or generalization of clinical and social science study results, we decided to use more robust methodologies, including multiple supervised machine learning approaches; we also used Cycle 2 as a replication population upon which to compare our initial Cycle 1 results to ensure replicability across populations [28,29]. Thus, Cycle 1 was partitioned for use in variable selection, model training, and initial testing of the trained models, and Cycle 2 was saved for replication of Cycle 1 test sample results. Often, especially with linear regression, either only one data collection step is used to validate a model, leading to generalization problems on other sets of data collected on similar populations, or the model is trained on one population and tested on another. Both are statistically problematic methods in creating a model [29]. One study applied multiple-sampling approaches with pooling (how this study set up the methodology) and was able to replicate >90% of the problematic samples noted in one of the prominent replication studies suggesting that most clinical paper results do not generalize properly [29]. Unsupervised Learning To determine which subgroups of patients did not choose to access online health records, we clustered two samples of patients who did not access online health records (NotAccessed_ConcernedPrivacy and NotAccessed_NoInternet) using k-means clustering on the data from Cycle 1 and Cycle 2. The number of clusters was determined using the elbow on both Cycles [30]. Results were compared between Cycles to understand how behaviors changed over time. To identify the types of records accessed by patients who did choose to access online health records, we clustered four variables (RecordsOnline_RefillMeds, RecordsOnline_RequestCorrection, RecordsOnline_MessageHCP, and RecordsOnline_AddHealthInfo) on record-accessing patients from Cycles 1 and 2. The main Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 8 OJPHI groups that appeared in both clustering results were compared across Cycle 1 and Cycle 2 to understand how usage changed over time. Supervised Learning We used stratified sampling to split Cycle 1 data into three parts: variable selection training sample, model training sample, and model test sample. To select variables, we used the Boruta algorithm, which statistically tests a random forest model to select statistically significant variables (set to p<0.05). This allowed us to identify main effects as well as interaction terms related to our outcome. To identify main effects and interaction terms separately for clinical evaluation, we fit two supervised learning models in R (logistic regression for main effects and evolved tree model for complex interaction effects) [32,33]. The evolved tree model, fit using evtree in R, allowed us to visualize complex interaction terms that are common in medical data. We then evaluated our logistic regression model and evolved tree model on the Cycle 1 test set by measuring the AUC, false positive and negative rates, and accuracy. For the logistic regression model only, we used the Akaike information criterion, which measures the goodness of fit balanced with the number of variables included in the logistic regression model [34]. Evaluation was replicated on the Cycle 2 sample to assess reproducibility of model performance across time periods. Results Variable Selection After the five runs of the Boruta algorithm on our first Cycle 1 sample, we looked at which variables were not selected by any of the selection runs and identified the following: MaritalStatus, TotalHousehold, SelfGender, RentOrOwn, TrustTelevision, GeneralHealth, BMIOver25, ChronicMedicalFlag, MedConditions_Depression, HealthIns_VA, HealthIns_Other, and OtherDevTrackHealth. These were discarded from the subsequent training and test sets (Figure 1). Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 9 OJPHI Figure 1: Variable output of the Boruta algorithm Unsupervised Learning Results For patients who did not access their medical records, the k-means model for both Cycle 1 and Cycle 2 selected the optimum number of clusters as 4 (all possible combinations of the two variables, giving 100% of the variance accounted for in the k-means models). Access issues generally decreased between Cycle 1 and Cycle 2, suggesting that access to the Internet declined as a significant barrier to usage over time (Table 2). Table 2: Unsupervised learning results: k-means model for both Cycle 1 and Cycle 2 for those who did not access their medical records Not Accessed Subgroup Cycle 1 Percent Cycle 2 Percent Privacy Only 17% 13% Access and Privacy 10% 4% Other 57% 74% Access Only 16% 10% For patients who accessed their online records, the optimal clustering for Cycle 1 included 5 clusters (~60% of variance accounted for), with major groups including a large subset of patients who mainly refilled medication and messaged primary care providers, a small subset who performed every task online, and a large subset that rarely used online portals for any tasks. The best k-means model for Cycle 2 comprised six cluster groups, including three groups of interest from Cycle 1 results. The number of patients who refilled medications and messaged primary care providers increased dramatically between cycles, suggesting a common use of online medical records (Table 3). Table 3. Unsupervised learning results: k-means model for both Cycle 1 and Cycle 2 for those who accessed their online medical records Main Accessed Subgroup Cycle 1 Percent Cycle 2 Percent Rare Usage 49% 35% Refill Meds and Message Primary Care 25% 45% Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 10 OJPHI Every Task Online 3% 4% Supervised Learning Results For the logistic regression model, we found that the model selected in Cycle 1 training data did not generalize to Cycle 2 data (with the test set AUC falling from 84% to 55% between cycles). Thus, we discarded our results as not reproducible or useful as a clinical decision model. However, the evolved tree model (Figure 2) was reproducible between cycles with AUC falling marginally from 85% to 81% between cycles, see AUC of cycle 1 data in Figure 3 and AUC of cycle 2 data in Figure 4. Significant predictors of online portal usage, according to this model, included privacy concerns, a proactive offering of access to online portals by primary care providers, and prior use of the portal to check test results (Figure 2). Figure 2: Decision tree diagram of the supervised learning method Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 11 OJPHI Figure 3: Evolved tree model AUC for cycle 1 Figure 4: Evolved tree model AUC for cycle 2 Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 12 OJPHI Discussion This study sought to identify predictive factors that determine online patient portal use using machine learning methodology. We found that previous use of online portals is a positive predictor of online portal usage, as well as the offering of online portals by primary care providers. The 2009 Health Information Technology for Economic and Clinical Health Act and meaningful use facilitated the creation and availability of online patient portals; however, there has been a low adoption rate among patients. Studies have shown that, although organizations have created portals and provided patients with log-in information, patients did not utilize the portals. However, providers that encourage portal use by tasking patients with items to complete or helping patients with the initial log in improves usage [9,35]. Irizarry et al. (2015) found that provider endorsement and engagement with patient portals positively affected patient portal utilization [5,7]. Privacy concerns are a negative a predictor of patient portal use [12,13]. News of recent data breaches does little to instill confidence in how institutions protect health information and how accessible it is to unauthorized entities [36]. Communicating institutional safety measures to secure patient privacy could improve patient trust [37]. Anthony et al. (2018) recommended that providers play a role in improving trust in portals by addressing privacy concerns directly with patients [9]. Our study indicates that access to the Internet is not as significant of a barrier as described in previous studies. The AMIA released a statement in 2018 that “broadband access is or will become a social determinant of health;” [38] however, with greater access to smartphones, a socioeconomic divide in Internet access is no longer a strong predictor of portal use [8,39]. Additionally, other populations, such as seniors, now have improved Internet access [8]. Nambisan (2017) postulated that use of the Internet for health information seeking is a better predictor of portal use rather than access to the Internet [40]. However, even with the minimal digital divide, health literacy, computer literacy, and care preferences may continue to represent barriers to patient portal utilization [7,39]. According to our study, online portals are most commonly used to refill medications and message primary care providers. Patel et al. showed that more than half of patients who access their online portals use it to perform health-related tasks and to communicate with their healthcare providers [9]. The Institute of Medicine identified patient-provider communication as a core focus in improving patient outcomes. Secure messaging augments clinical encounters by providing asynchronous communication between providers and patients [41]. Our study has shown that the prospect of utilizing a machine learning model to predict patient engagement via patient portals is promising. This technique may be scaled up to a clinical decision support tool as a user-friendly web interface or app to predict IT engagement patterns for clinic registration of new patients. Further research and validation of the model in a real ambulatory setting is necessary prior to implementation of such a tool. Limitations Although this study is a novel attempt to implement a machine learning approach for patient portal utilization, including the clustering method that provided additional insight, it is not without Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 13 OJPHI limitations. First, the cross-sectional design of the HINTS survey does not allow inferences of causality. Secondly, the variables in the survey are subject to individual interpretations of the survey questions by the respondents in addition to any response bias that may be present. Limitations of k-means clustering include assumptions about outliers (that groups are even-sized and non-overlapping). Most real-world data will violate this to some extent. In addition, generally, evolved trees are not the most stable learners; therefore, it is possible that other tree models can be used. However, our results were consistent across partitions of data, and statistical testing on the validation sample confirmed that the model was robust. Conclusions The tree model produced more consistent prediction accuracy across cycles than the regression model. It also identified privacy and data protection concerns (negative predictors) and proactive patient portal access offering by physicians (positive predictors) as the most significant determinants of patient portal use. Our unsupervised learning algorithm identified a fairly consistent cluster of patients who did not use online portals due to privacy concerns across both cycles of data. Among patients who used online portals, there was a consistent cluster of patients across cycles that used the online portal for medication refills and to message their primary care provider. Our results showed that machine learning algorithms can be used to identify factors associated with online portal use. These methods may be employed in a clinical decision support tool during new patient registration to personalize methods of patient engagement. The variables identified by our model corresponded with the characteristics of online portal users identified by previous studies [5,8]. We recommend asking patients about privacy concerns and proactively offering patients a way to access their records online or providing an alternative (text messaging, automated call, etc.) based on their response to questions asked during registration. Financial Disclosure No Financial Disclosures. Competing Interests No Competing Interests. Data Availability The data set used and analyzed for the study are available for free on the U.S. Department of Health and National Cancer Institute website: https://hints.cancer.gov/ References 1. Graffigna G, Barello S, Bonanomi A, Riva G. 2017. Factors affecting patients’ online health information-seeking behaviours: The role of the Patient Health Engagement (PHE). Model. Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 14 OJPHI Patient Educ Couns. 100(10), 1918-27. Epub May 2017. doi:https://doi.org/10.1016/j.pec.2017.05.033. PubMed 2. Laurance J, Henderson S, Howitt PJ, Matar M, Al Kuwari H, et al. 2014. Patient engagement: four case studies that highlight the potential for improved health outcomes and reduced costs. Health Aff (Millwood). 33(9), 1627-34. doi:https://doi.org/10.1377/hlthaff.2014.0375. PubMed 3. James J. "Health Policy Brief: Patient Engagement," Health Affairs [Internet]. 2013 February [cited April 23, 2021] Available from: https://www.healthaffairs.org/do/10.1377/hpb20130214.898775/full/healthpolicybrief_86.pd f 4. Reed ME, Huang J, Brand RJ, Neugebauer R, Graetz I, et al. 2019. Patients with complex chronic conditions: Health care use and clinical events associated with access to a patient portal. PLoS One. 14(6), e0217636. doi:https://doi.org/10.1371/journal.pone.0217636. PubMed 5. Irizarry T, DeVito Dabbs A, Curran CR. Patient portals and patient engagement: a state of the science review. J Med Internet Res. 2015;17(6):e148. doi: https://doi.org/10.2196/jmir.4255 6. Health IT. gov [Internet]. [cited 2020 January 26] Available from: https://www.healthit.gov/faq/what-patient-portal 7. Anthony DL, Campos-Castillo C, Lim PS. 2018. Who isn’t using patient portals and why? evidence and implications from a national sample of US adults. Health Aff (Millwood). 37(12), 1948-54. doi:https://doi.org/10.1377/hlthaff.2018.05117. PubMed 8. Hong YA, Jiang S, Liu PL. 2020. Use of patient portals of electronic health records remains low from 2014 to 2018: results from a national survey and policy implications. Am J Health Promot. 34(6), 677-80. Epub Feb 2020. doi:https://doi.org/10.1177/0890117119900591. PubMed 9. Patel V, Johnson C. " Individuals’ Use Of Online Medical Records And Technology For Health Needs," ONC Data Brief [Internet]. 2018 April [cited January, 26 2020] Available from: https://www.healthit.gov/sites/default/files/page/2018-03/HINTS-2017-Consumer- Data-Brief-3.21.18.pdf 10. Han HR, Gleason KT, Sun CA, Miller HN, Kang SJ, et al. 2019. Using patient portals to improve patient outcomes: systematic review. JMIR Human Factors. 6(4), e15038. doi:https://doi.org/10.2196/15038. PubMed 11. Lyles CR, Sarkar U, Ralston JD, Adler N, Schillinger D, et al. 2013. Patient-provider communication and trust in relation to use of an online patient portal among diabetes patients: The Diabetes and Aging Study. J Am Med Inform Assoc. 20(6), 1128-31. Epub May 2013. doi:https://doi.org/10.1136/amiajnl-2012-001567. PubMed https://doi.org/10.1016/j.pec.2017.05.033 https://pubmed.ncbi.nlm.nih.gov/28583722 https://doi.org/10.1377/hlthaff.2014.0375 https://pubmed.ncbi.nlm.nih.gov/25201668 https://doi.org/10.1371/journal.pone.0217636 https://pubmed.ncbi.nlm.nih.gov/31216295 https://pubmed.ncbi.nlm.nih.gov/31216295 https://doi.org/10.2196/jmir.4255 https://doi.org/10.1377/hlthaff.2018.05117 https://pubmed.ncbi.nlm.nih.gov/30633673 https://doi.org/10.1177/0890117119900591 https://pubmed.ncbi.nlm.nih.gov/32030989 https://pubmed.ncbi.nlm.nih.gov/32030989 https://doi.org/10.2196/15038 https://pubmed.ncbi.nlm.nih.gov/31855187 https://doi.org/10.1136/amiajnl-2012-001567 https://pubmed.ncbi.nlm.nih.gov/23676243 Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 15 OJPHI 12. Baldwin JL, Singh H, Sittig DF, Giardina TD. 2017. Patient portals and health apps: Pitfalls, promises, and what one might learn from the other. Healthc (Amst). 5(3), 81-85. Epub Oct 2016. doi:https://doi.org/10.1016/j.hjdsi.2016.08.004. PubMed 13. Hoogenbosch B, Postma J, de Man-van Ginkel JM, Tiemessen NA, van Delden JJ, et al. 2018. Use and the users of a patient portal: cross-sectional study. J Med Internet Res. 20(9), e262. doi:https://doi.org/10.2196/jmir.9418. PubMed 14. Kuhn M, Johnson K. Applied predictive modeling. Springer; 2013. 15. Gupta S, Hanson C, Gunter CA, Frank M, Liebovitz DM, et al. Modeling and detecting anomalous topic access [abstract]. 2013 IEEE International Conference on Intelligence and Security Informatics. doi:https://doi.org/10.1109/ISI.2013.6578795 16. Boxwala AA, Kim J, Grillo JM, Ohno-Machado L. 2011. Using statistical and machine learning to help institutions detect suspicious access to electronic health records. J Am Med Inform Assoc. 18(4), 498-505. doi:https://doi.org/10.1136/amiajnl-2011-000217. PubMed 17. Davis Giardina T, Menon S, Parrish DE, Sittig DF, Singh H. 2014. Patient access to medical records and healthcare outcomes: a systematic review. J Am Med Inform Assoc. 21(4), 737- 41. Epub Oct 2013. doi:https://doi.org/10.1136/amiajnl-2013-002239. PubMed 18. Mold F, Ellis B, de Lusignan S, Sheikh A, Wyatt JC, et al. 2012. The provision and impact of online patient access to their electronic health records (EHR) and transactional services on the quality and safety of health care: systematic review protocol. Inform Prim Care. 20(4), 271-82. doi:https://doi.org/10.14236/jhi.v20i4.17. PubMed 19. Ross SE, Lin CT. 2003. The effects of promoting patient access to medical records: a review [Corrected and republished from: J Am Med Inform Assoc. 2003 May-Jun;10] [3] [:294. doi:10.1197/jamia.m1147]. J Am Med Inform Assoc. 10(2), 129-38. PubMed https://doi.org/10.1197/jamia.M1147 20. Ross SE, Todd J, Moore LA, Beaty BL, Wittevrongel L, et al. 2005. Expectations of patients and physicians regarding patient-accessible medical records. J Med Internet Res. 7(2), e13. doi:https://doi.org/10.2196/jmir.7.2.e13. PubMed 21. Steele AJ, Denaxas SC, Shah AD, Hemingway H, Luscombe NM. 2018. Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease. PLoS One. 13(8), e0202344. doi:https://doi.org/10.1371/journal.pone.0202344. PubMed 22. Agarwal V, Zhang L, Zhu J, Fang S, Cheng T, et al. 2016. Impact of predicting health care utilization via web search behavior: a data-driven analysis. J Med Internet Res. 18(9), e251. doi:https://doi.org/10.2196/jmir.6240. PubMed 23. National Cancer Institute [Internet]. [cited 2020 January 26] Available from: https://hints.cancer.gov/about-hints/learn-more-about-hints.aspx https://doi.org/10.1016/j.hjdsi.2016.08.004 https://pubmed.ncbi.nlm.nih.gov/27720139 https://doi.org/10.2196/jmir.9418 https://pubmed.ncbi.nlm.nih.gov/30224334 https://doi.org/10.1109/ISI.2013.6578795 https://doi.org/10.1136/amiajnl-2011-000217 https://pubmed.ncbi.nlm.nih.gov/21672912 https://doi.org/10.1136/amiajnl-2013-002239 https://pubmed.ncbi.nlm.nih.gov/24154835 https://doi.org/10.14236/jhi.v20i4.17 https://pubmed.ncbi.nlm.nih.gov/23890339 https://pubmed.ncbi.nlm.nih.gov/12595402 https://doi.org/10.1197/jamia.M1147 https://doi.org/10.2196/jmir.7.2.e13 https://pubmed.ncbi.nlm.nih.gov/15914460 https://doi.org/10.1371/journal.pone.0202344 https://pubmed.ncbi.nlm.nih.gov/30169498 https://doi.org/10.2196/jmir.6240 https://pubmed.ncbi.nlm.nih.gov/27655225 Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 16 OJPHI 24. Assari S, Khoshpouri P, Chalian H. 2019. Combined effects of race and socioeconomic status on cancer beliefs, cognitions, and emotions. Healthcare (Basel). 7(1), 17. doi:https://doi.org/10.3390/healthcare7010017. PubMed 25. Miri S. The target population for health IT solutions: The Health Information National Trends Survey (HINTS 2017) [abstract]. 2019 American Medical Informatics Association Clinical Informatics Conference. Atlanta, GA, May, 1 2019, AMIA. 26. Grossman LV, Masterson Creber RM, Benda NC, Wright D, Vawdrey DK, et al. 2019. Interventions to increase patient portal use in vulnerable populations: a systematic review. J Am Med Inform Assoc. 26(8-9), 855-70. doi:https://doi.org/10.1093/jamia/ocz023. PubMed 27. Zhao JY, Song B, Anand E, et al. Barriers, Facilitators, and Solutions to Optimal Patient Portal and Personal Health Record Use: A Systematic Review of the Literature. AMIA Annu Symp Proc 2017; 2017: 1913-1922. 2018/06/02. 28. Alsheikh-Ali AA, Qureshi W, Al-Mallah MH, Ioannidis JP. 2011. Public availability of published research data in high-impact journals. PLoS One. 6(9), e24357. doi:https://doi.org/10.1371/journal.pone.0024357. PubMed 29. Gilbert DT, King G, Pettigrew S, Wilson TD. 2016. Comment on “Estimating the reproducibility of psychological science”. Science. 351(6277), 1037. doi:https://doi.org/10.1126/science.aad7243. PubMed 30. Steinley D. 2006. K-means clustering: a half-century synthesis. Br J Math Stat Psychol. 59(Pt 1), 1-34. doi:https://doi.org/10.1348/000711005X48266. PubMed 31. Kursa MB, Rudnicki WR. 2010. Feature selection with the Boruta package. J Stat Softw. 36(11), 1-13. doi:https://doi.org/10.18637/jss.v036.i11. 32. Nelder JA, Wedderburn RW. 1972. Generalized linear models. J R Stat Soc [Ser A]. 135(3), 370-84. doi:https://doi.org/10.2307/2344614. 33. Papagelis A, Kalles DGA. Tree: genetically evolved decision trees [abstract]. 2000 Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence ICTAI. doi: https://doi.org/10.1109/TAI.2000.889871 34. Sakamoto Y, Ishiguro M, Kitagawa G. Akaike information criterion statistics. Springer Netherlands; 1986. 35. Powell KR. 2017. Patient-perceived facilitators of and barriers to electronic portal use: a systematic review. Comput Inform Nurs. 35(11), 565-73. doi:https://doi.org/10.1097/CIN.0000000000000377. PubMed 36. Hossain MM, Hong YA. Trends and characteristics of protected health information breaches in the United States. AMIA Annu Symp Proc. 2020;2019:1081-1090. Published 2020 Mar 4. https://doi.org/10.3390/healthcare7010017 https://pubmed.ncbi.nlm.nih.gov/30682822 https://doi.org/10.1093/jamia/ocz023 https://pubmed.ncbi.nlm.nih.gov/30958532 https://doi.org/10.1371/journal.pone.0024357 https://pubmed.ncbi.nlm.nih.gov/21915316 https://doi.org/10.1126/science.aad7243 https://pubmed.ncbi.nlm.nih.gov/26941311 https://doi.org/10.1348/000711005X48266 https://pubmed.ncbi.nlm.nih.gov/16709277 https://doi.org/10.18637/jss.v036.i11 https://doi.org/10.2307/2344614 https://doi.org/10.1109/TAI.2000.889871 https://doi.org/10.1097/CIN.0000000000000377 https://pubmed.ncbi.nlm.nih.gov/28723832 Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study 17 OJPHI 37. Goel MS, Brown TL, Williams A, Cooper AJ, Hasnain-Wynia R, Baker DW. Patient reported barriers to enrolling in a patient portal. J Am Med Inform Assoc. 2011 Dec;18 Suppl 1(Suppl 1):i8-12. doi: https://doi.org/10.1136/amiajnl-2011-000473. PMID: 22071530. 38.AMIA. AMIA Responds to FCC Notice on Broadband-Enabled Health Technology. American Medical Informatics Association; 2017 [cited 2022 December]; Available from: https://amia.org/public-policy/public-comments/amia-responds-fcc-notice-broadband- enabled-health-technology. 39. Graetz I, Huang J, Muelly ER, Fireman B, Hsu J, et al. 2020. Association of mobile patient portal access with diabetes medication adherence and glycemic levels among adults with diabetes. JAMA Netw Open. 3(2), e1921429. doi:https://doi.org/10.1001/jamanetworkopen.2019.21429. PubMed 40. Nambisan P. 2017. Factors that impact Patient Web Portal Readiness (PWPR) among the underserved. Int J Med Inform. 102, 62-70. Epub Mar 2017. doi:https://doi.org/10.1016/j.ijmedinf.2017.03.004. PubMed 41. Wallwiener M, Wallwiener CW, Kansy JK, Seeger H, Rajab TK. 2009. Impact of electronic messaging on the patient-physician interaction. J Telemed Telecare. 15(5), 243-50. doi:https://doi.org/10.1258/jtt.2009.090111. PubMed https://doi.org/10.1136/amiajnl-2011-000473 https://amia.org/public-policy/public-comments/amia-responds-fcc-notice-broadband-enabled-health-technology https://amia.org/public-policy/public-comments/amia-responds-fcc-notice-broadband-enabled-health-technology https://doi.org/10.1001/jamanetworkopen.2019.21429 https://pubmed.ncbi.nlm.nih.gov/32074289 https://doi.org/10.1016/j.ijmedinf.2017.03.004 https://pubmed.ncbi.nlm.nih.gov/28495349 https://doi.org/10.1258/jtt.2009.090111 https://pubmed.ncbi.nlm.nih.gov/19590030 Using a Machine Learning Algorithm to Predict Online Patient Portal Utilization: A Patient Engagement Study Abstract Introduction Materials and Methods Study Design and Setting Study Participants Study Variables Target Variable/Outcome Variable Labels/Predictor Variables Machine Learning Approach/Statistical Analysis Unsupervised Learning Supervised Learning Results Variable Selection Unsupervised Learning Results Supervised Learning Results Discussion Limitations Conclusions Financial Disclosure Competing Interests Data Availability References