JURNAL RISET INFORMATIKA Vol. 5, No. 1. December 2022 P-ISSN: 2656-1743 |E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i1.453 Accredited rank 3 (SINTA 3), excerpts from the decision of the Minister of RISTEK-BRIN No. 200/M/KPT/2020 445 Implementation of Machine Learning Algorithms for Early Detection of Cervical Cancer Based on Behavioral Determinants Duwi Cahya Putri Buani-1*, Indah Suryani-2 Informatika Universitas Nusa Mandiri Jakarta, Indonesia duwi.dcp@nusamandiri.ac.id, Indah.ihy@nusamandiri.ac.id (*) Corresponding Author Abstract Cervical cancer is a disease that affects women and has the highest mortality rate after breast cancer. Early detection of cervical cancer is critical at this time, so cervical cancer patients are decreasing. Many women, especially in Indonesia, are less concerned about the dangers of cervical cancer, even though if detected earlier, this disease will be easier to treat. One alternative for early detection can use machine learning algorithms. The machine learning algorithms used in this study are Naïve Bayes (NB), Logistic Regression (LR), Decision Tree (DT), SVM, and Random Forest. In this study, a random under-sampling method was employed, which had no uses in any prior research. This technique makes the accuracy of the five algorithms even better. The research results show that NB has an accuracy rate of 91.67%, LR has an accuracy rate of 87.5%, DT has an accuracy rate of 81.81%, SVM has an accuracy rate of 75%, and RF has the highest accuracy rate of 94.45%. This research shows that the best model is RF or Random Forest. Keywords: Cervical Cancer; Machine Learning; Random Forest Abstract Kanker servik merupakan penyakit yang diidap oleh wanita memiliki tingkat kematikan terbesar di dunia setelah kanker payudara. Deteksi dini kanker serviks sangat penting untuk saat ini, agar pasien kanker serviks semakin berkurang. Banyak wanita terutama di Indonesia kurang peduli dengan bahayanya kanker serviks, padahal jika dideteksi lebih dini penyakit ini akan lebih mudah untuk ditangani. Salah satu alternatif untuk melakukan deteksi dini dapat menggunakan algortima machine learning. Algortima machine learning yang digunakan dalam penelitian ini adalah Naïve Bayes (NB), Logistic Regerson (LR), Decision Tree (DT), SVM dan random Forest. Dalam penelitian ini juga menggunakan teknik Random Under Sampler yang pada penelitian sebelumnya tidak digunakan, teknik ini menjadikan akurasi dari ke-lima algortima menjadi semakin baik. Dari hasil penelitian yang dilakukan menunjukan bahwa NB memiliki tingkat akurasi 91.67%, LR memiliki tingkat akurasi 87.5%, DT memiliki tingkat akurasi 81.81%, SVM memiliki tingkat akurasi 75% dan RF memiliki tingkat akurasi yang paling tinggi yaitu 94.45%. Dari penelitian ini menunjukan bahwa model yang paling baik adalah RF atau Random Forest. Kata kunci: Kanker servik; Machine Learning; Random Forest INTRODUCTION GLOBOCAN (Global Cancer Observatory) stated that Asian countries, including Indonesia, contribute most significantly to cancer cases worldwide. Data sourced from Darmais hospital in 2018 showed that the most cancer cases were breast cancer at 19.18%, cervical cancer at 10.69%, and lung cancer at 9.89% (Agustyawati, Fauzi, & Pratondo, 2021; Pangribowo, 2019; Wongkar, Angka, & Angeline, 2022). The WHO (World Health Organization) states that cervical cancer is a deadly disease that ranks second only to breast cancer. About 50,000 women have diagnosed with cervical cancer annually (Sobar, Machmud, & Wijaya, 2016)(Setyani, 2018). The high number of cervical cancer patients is influenced by the lack of knowledge among the public, especially women, to carry out early detection before cancer spreads (Aisah, Hafiyusholeh, & Ulinnuha, 2022; Winarni & Suratih, 2020). This data shows that cervical cancer is one of the most common cases of cancer in Indonesia, so it needs to be detected early (Arifin, Siregar, Ratna, & Mudzakir, 2021; Hidayah, Cholissodin, & Adikara, 2019). To perform early detection using machine learning. Machine learning P-ISSN: 2656-1743 | E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i1.453 JURNAL RISET INFORMATIKA Vol. 5, No. 1. December 2022 Accredited rank 3 (SINTA 3), excerpts from the decision of the Minister of RISTEK-BRIN No. 200/M/KPT/2020 446 is used as a classifier to detect the probability of cervical cancer risk based on its behavior and determinants (Feblian & Daihani, 2017). Previous research using the naïve Bayes (NB) and Logistic Regression (LR) algorithms showed the following results in Table 1: Table 1. Previous Research Algorithm Accuracy Auc NB 91.67% 0.96 LR 87.5% 0.97 Source: (Sobar et al., 2016) Table 1 shows previous studies using the same data but only two popular algorithms in 2016: NB and LR. In this study, the authors will compare NB and LR algorithms with other algorithms to find the best model for the early detection of cervical cancer. Previous research conducted by Sober used the same data but only made comparisons with two algorithms, NB and LR, which were prevalent then. This study compares three additional algorithms: Decision Tree (DT), SVM, and Random Forest (RF). These three additional algorithms have their respective advantages that can cover the weaknesses of the Naïve Bayes algorithm, which in previous studies had the highest level of accuracy, namely 91.67%. In addition to adding algorithms for comparison, this study also uses the Random Under Sampler technique to overcome class imbalance, although this method had not been employed in earlier investigations. Previous research using the same data using the SVM algorithm has an accuracy rate of 87%, this research uses a sample data of 59 data and four attributes without using the random under sampler technique and data processing using python (Arifin et al., 2021). Previous research using cervical cancer risk classification data with feature selection based on expert interviews used the Extreme Learning Machine algorithm to classify and measure using the Confusion Matrix curve, resulting in an accuracy of 91.76% (Hidayah et al., 2019). Previous research used the Decision Trees algorithm. The results of the accuracy error in the study were 0% using 19 attributes, and the data was hospital patient data. Dr. Wahidin Sudirohusodo Makassa, in this study, used symptoms and signs to determine the stage of cervical cancer suffered by patients (Irmayani1, 2017). The medical lens is typically used in cervical cancer studies, not a lifestyle perspective. In this study, the authors used data from habits carried out in everyday life. Of course, the information was collected from both people with and without cervical cancer. RESEARCH METHODS In this study, the CRISP-DM (Cross Industry Standard Process for Data Mining) model was used, which consisted of six stages, namely Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, And Deployment (Firqiani, Kustyo, & Giri, 2008; Hasanah, Soim, & Handayani, 2021; Matovani & Hadiono, 2018). A. Stages of business understanding Based on data from the UCI machine learning repository with a total of 72 respondents, 22 were cancer patients, and 50 were cancer survivors. All respondents were residents of a city in Jakarta, Indonesia. Examining sufferers must be done so that the disease can be detected early to reduce the risk of cervical cancer. Using data mining with classification algorithms with a high level of prediction and accuracy can help overcome these problems so that the diagnostic results obtained are accurate. This study used algorithm comparisons to obtain high accuracies, such as logistic regression, naïve Bayes, SVM, decision trees, and random forests. B. Data understanding stage The data used are secondary data obtained from the survey results of cervical cancer patients, and the data comes from a questionnaire distributed to 72 respondents, of which 22 are cancer patients and 50 are Not cancer survivors. All respondents are residents of cities in Jakarta, Indonesia, which can be accessed publicly through UCI machine learning repositories. Data consists of 19 attributes and one attribute as a class. Source: (Buani & Suryani, 2022) Figure 1. Visualization of Ca Cervix Variables The Ca Cervix variable is a label variable used to classify having cervical or no cervical cancer. JURNAL RISET INFORMATIKA Vol. 5, No. 1. December 2022 P-ISSN: 2656-1743 |E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i1.453 Accredited rank 3 (SINTA 3), excerpts from the decision of the Minister of RISTEK-BRIN No. 200/M/KPT/2020 447 Source: (Buani & Suryani, 2022) Figure 2. Visualization of Eating Behavior Data The eating Behavior Variable is a variable that describes the consumption of food. The food consumed is very influential on the health of the body. Source: (Buani & Suryani, 2022) Figure 3. Visualization of Behavioral Sexual Risk Variable Behavioral Sexual is a variable that most likely determines whether a person has cervical cancer. C. Stages of data preparation The total data in this study was 72, which already has a label where respondents have a risk of cancer and respondents who do not. However, this data still contains duplicate data, outliers, and anomalous or inconsistent data. Therefore, this stage is necessary to obtain quality data to produce a more effective and efficient model. An example of outlier data can be seen in Figure 4 below: Source: (Buani & Suryani, 2022) Figure 4. Eating Behavior Variable Figure 4 shows that the Behavior Eating Variable has no outlier data, so no data is far from observation. Source: (Buani & Suryani, 2022) Figure 5. Distribution of Ca-Cervix Data In figure 5, If blue is data labeled no cervical cancer and the orange bar is cervical cancer. From the figure, the data must equate first. Here the author uses the Random Under Sampler technique. Then the result is as in Figure 6. P-ISSN: 2656-1743 | E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i1.453 JURNAL RISET INFORMATIKA Vol. 5, No. 1. December 2022 Accredited rank 3 (SINTA 3), excerpts from the decision of the Minister of RISTEK-BRIN No. 200/M/KPT/2020 448 Source: (Buani & Suryani, 2022) Figure 6. Data Distribution after Random Under Sampler Figure 6 shows the use of the Random Under Sampler technique data distribution. The Random Under Sampler deals with classes/labels that are not identical in number. In the Random Under Sampler process, data is the division into testing data and training data, and the data division is 70% for training data and 30% for testing data. After the Random Under Sampler, perform techniques and data in the same class. The next thing to do is look at the correlation between attributes or variables using HeatMap, seen in Figure 7. Source: (Buani & Suryani, 2022) Figure 7. HeatMap Correlation Between Variables Figure 7 shows that the darker the heatmap color, the more attributes or variables have a stronger association with data classes or labels if and only if. Figure 3 shows that the variables/attributes are Behavioral Sexual Risk, Commitment of Intentions, Aggregation of Intentions, Norm Significant People, appreciative social support, self-hygiene behavior, perceptual susceptibility, and fulfillment of norms. Eight variables/attributes are the attributes that most influence the class/label of cervical cancer. A. Dataset The dataset used in this study is public data from the UCI Machine Learning Repository with a total of 72 data, data consisting of 20 attributes, and one attribute is a label, which sees in table 2: Table 2. Data Description Variable behavior_sexualRisk behavior_eating behavior_personalHygine intention_aggregation intention_commitment attitude_consistency attitude_spontaneity norm_significantPerson norm_fulfillment perception_vulnerability perception_severity motivation_strength motivation_willingness socialSupport_emotionality socialSupport_appreciation socialSupport_instrumental empowerment_knowledge empowerment_abilities empowerment_desires ca_cervix (this is a class attribute, 1=have cervical cancer, 0=no cervical cancer) Source: (Sobar et al., 2016) Table 2 shows the variables or attributes used in the study. These attributes include Behavioral Sexual Risk, Behavior Eating, Behavior Personal Hygiene, Aggregation of Intentions, Commitment of Intentions, Attitude Consistency, Attitude Spontaneity, Norm Significant People, Norm Fulfillment, Perceptual Vulnerability, Perceived Severity, Motivation strength, volitional motivation, emotional, social support, appreciative social support, instrumental social support, empowerment knowledge, empowerment ability, empowerment desire and ca_cervix (these are class attributes, 1=cervical cancer, 0=no cervical cancer). B. Research Methods The model used in this study is the Application of Machine Learning Algorithms for the early detection of cervical cancer. The algorithms used include Decision Tree (DT), SVM, Random Forest (RF), and two algorithms from previous studies, NB and LR. Then from the five algorithms selected the JURNAL RISET INFORMATIKA Vol. 5, No. 1. December 2022 P-ISSN: 2656-1743 |E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i1.453 Accredited rank 3 (SINTA 3), excerpts from the decision of the Minister of RISTEK-BRIN No. 200/M/KPT/2020 449 best model, the results are accurate. Figure 1 shows the flow chart of this study. Source: (Buani & Suryani, 2022) Figure 8. Research Flowchart Figure 8 shows research from data preparation, such as checking blank data, duplicate data, and imbalance classes. RESULTS AND DISCUSSION This study compared five algorithms: Logistic Regression, Naïve Bayes, SVM, Decision Tree, and Random Forest. For LR and NB, the accuracy results come from previous studies. SVM is a classification algorithm whose level of accuracy in a model depends on the kernel functions and parameters used, and the advantage of SVM is that it can classify and overcome regression with linear and non-linear (Dasmasela, Tomasouw, & Leleury, 2021)(Parapat, Furqon, & Sutrisno, 2018). Decision trees are prediction model techniques that use for task classification and prediction (Bahri & Lubis, 2020)(Wijaya, Bahtiar, Kaslani, & R, 2021)(Wuryani & Agustiani, 2021)(Schonlau & Zou, 2020). The results of the study can be seen in table 3. Table 3. Predicted Results Algorithm Accuracy AUC NB 91.67% 0.96 LR 87.5% 0.97 DT 81.81% 0.81 RF 94.45% 0.75 SVM 75% 0.75 Source: (Buani & Suryani, 2022) Table 3 is the result of the research conducted in this study. Table 3 describes the accuracy results after conducting experiments where Naïve Bayes (NB) has an accuracy of 91.67%, Logistic Regression (LR) of 87.5%, Decision Tree (DT) of 81.81%, SVM 75%, and Random Forest (RF) 94.45%, from the table. It shows that the highest accuracy in this study is RF. Source: (Buani & Suryani, 2022) Figure 9. Graph of Prediction Results Figure 9 is a visualization of the prediction results made by NM, LR, DT, RF, and SVM from the image showing that Random Forest is the best algorithm in making predictions with a result of 94.45% CONCLUSIONS AND SUGGESTIONS The results of this research using the Random Under Sampler technique show that the model using the SVM algorithm is 75%, while the results of the model using the Decision Tree algorithm are 82%. For the model with the Random Forest algorithm, 94% of the results indicate that the random forest is a random forest model. The best method for early detection of cervical cancer in behavioral determinants. REFERENCES Agustyawati, D. N., Fauzi, H., & Pratondo, A. (2021). Perancangan Aplikasi Deteksi Kanker Serviks Menggunakan Metode Convolutional Neural Network. EProceedings of Engineering, 8(4), 3908–3924. Aisah, S. N., Hafiyusholeh, M., & Ulinnuha, N. (2022). Klasifikasi Kanker Serviks Menggunakan Metode Extreme Learning Machine (ELM). Komputek, 6(3), 68–75. Retrieved from https://studentjournal.umpo.ac.id/index.php /komputek/article/view/68 Arifin, S. S., Siregar, A. M., Ratna, A., & Mudzakir, T. P-ISSN: 2656-1743 | E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i1.453 JURNAL RISET INFORMATIKA Vol. 5, No. 1. December 2022 Accredited rank 3 (SINTA 3), excerpts from the decision of the Minister of RISTEK-BRIN No. 200/M/KPT/2020 450 Al. (2021). Klasifikasi Penyakit Kanker Serviks Menggunakan Algoritma Support Vector Machine ( SVM). (Ciastech), 521–528. Bahri, S., & Lubis, A. (2020). Metode Klasifikasi Decision Tree Untuk Memprediksi Juara English Premier League. Jurnal Sintaksis, 2(1), 63–70. Retrieved from http://www.jurnal.stkipalmaksum.ac.id/inde x.php/Sintaksis/article/view/47 Buani, D. C. P., & Suryani, I. (2022). Independent Research Report. Jakarta. Dasmasela, R., Tomasouw, B. P., & Leleury, Z. A. (2021). Penerapan Metode Support Vector Machine (SVM) untuk Mendeteksi Penyalahgunaan Narkoba. Matematika, Statistik Dan Terapannya, 1(02), 93–101. Feblian, D., & Daihani, D. U. (2017). Implementasi Model Crisp-Dm Untuk Menentukan Sales Pipeline Pada Pt X. Jurnal Teknik Industri, 6(1). https://doi.org/10.25105/jti.v6i1.1526 Firqiani, H. N., Kustyo, A., & Giri, E. P. (2008). Seleksi Fitur Menggunakan Fast Correlation Based Filter pada Algoritma Voting Feature Intervals 5. Jurnal Ilmiah Ilmu Komputer, 6(2), 245184. Hasanah, M. A., Soim, S., & Handayani, A. S. (2021). Implementasi CRISP-DM Model Menggunakan Metode Decision Tree dengan Algoritma CART untuk Prediksi Curah Hujan Berpotensi Banjir. Journal of Applied Informatics and Computing, 5(2), 103–108. https://doi.org/10.30871/jaic.v5i2.3200 Hidayah, U. R., Cholissodin, I., & Adikara, P. P. (2019). Klasifikasi Penyakit Kanker Serviks dengan Extreme Learning Machine. Jurnal Pengembangan Teknologi Informasi Dan Ilmu Komputer, 3(7), 6575–6582. Retrieved from http://j-ptiik.ub.ac.id Irmayani1, B. A. (2017). Klasifikasi Stadium Kanker Serviks Menggunakan Sistem Pengambilan Keputusan Decision Tree. Prosiding Seminar Nasional, 04(1), 455–464. Retrieved from http://journal.uncp.ac.id/index.php/procedi ng/article/view/1281 Matovani, D., & Hadiono, K. (2018). Implementasi Algoritma Apriori Untuk Membantu Proses Persediaan Barang. Jurnal Dinamika Informatika, 10(2), 53–59. https://doi.org/10.35315/informatika.v10i2. 8133 Pangribowo, S. (2019). Beban Kanker di Indonesia. Pusat Data Dan Informasi Kesehatan Kementerian Kesehatan RI, 1–16. Parapat, I. M., Furqon, M. T., & Sutrisno. (2018). Penerapan Metode Support Vector Machine (SVM) Pada Klasifikasi Penyimpangan Tumbuh Kembang Anak. Jurnal Pengembangan Teknologi Informasi Dan Ilmu Komputer, 2(10), 3163–3169. Schonlau, M., & Zou, R. Y. (2020). The random forest algorithm for statistical learning. Stata Journal, 20(1), 3–29. https://doi.org/10.1177/1536867X2090968 8 Setyani, R. A. (2018). Penerapan Program Deteksi Dini Kanker Serviks Sebagai Upaya Pemberdayaan Wanita Di Sleman Yogyakarta. Kebidanan, Fakultas Ilmu Kesehatan Universitas Respati Yogyakarta, III(2), 12. Sobar, Machmud, R., & Wijaya, A. (2016). Behavior determinant based cervical cancer early detection with machine learning algorithm. Advanced Science Letters, 22(10), 3120–3123. https://doi.org/10.1166/asl.2016.7980 Wijaya, Y. A., Bahtiar, A., Kaslani, & R, N. (2021). Analisa Klasifikasi menggunakan Algoritma Decision Tree pada Data Log Firewall. Jurnal Sistem Informasi Dan Manajemen, 9(3), 256– 264. https://doi.org/10.47024/JS.V9I3.303 Winarni, W., & Suratih, K. (2020). Mengenal Lebih Dini Kanker Leher Rahim Bersama Forum Kajian Dan Komunikasi Muslimah. GEMASSIKA : Jurnal Pengabdian Kepada Masyarakat, 4(2), 186. https://doi.org/10.30787/gemassika.v4i2.56 9 Wongkar, R., Angka, R. N., & Angeline, R. (2022). Karakteristik Pasien Kanker Stadium 4 yang Mendapatkan Perawatan Paliatif di Rumah Sakit X. Jurnal Kedokteran Meditek, 28(2), 126–132. https://doi.org/10.36452/jkdoktmeditek.v2 8i2.2235 Wuryani, N., & Agustiani, S. (2021). Random Forest Classifier untuk Deteksi Penderita COVID-19 berbasis Citra CT Scan. Jurnal Teknik Komputer, 7(2), 187–193. https://doi.org/10.31294/jtk.v4i2