Vol. 4, No.2 July 2023 | 85 J.Valarmathi1, V.T.Kruthika2 Liver Disease Prediction Model Based on Oversampling Dataset with RFE Feature Selection using ANN and AdaBoost algorithms Ahmed Sami Jaddoa1, Samah J. Saba2, Elaf A.Abd Al-Kareem3 1 Business Informatics College, University of Information Technology and Communications, Iraq 2 Department of Computer science, Science of College, University of Diyala, Iraq 3 Department of Sharia, College of Islamic Sciences, University of Diyala, Iraq ahmed.sami@uoitc.edu.iq 1, Samah.j.saba@gmail.com 2, elaaf.ali1989@gmail.com 3 Abstract Liver disease counts are one of the most prevalent diseases all over the world and they are becoming very common these days and can be dangerous. Liver diseases are increasing all over the world due to different factors such as excess alcohol consumption, drinking contaminated water, eating contaminated food, and exposure to polluted air. The liver is involved in many functions related to the human body and if not functioned properly can affect the other parts too. Predication of the disease at an earlier stage can help reduce the risk of severity. This paper implemented oversampling dataset, feature selecting attributes, and performance analysis for the improvement of the accuracy of classification of liver patients in 3 phases. In the first phase, the z-score normalization algorithm has been implemented to the original liver patient data-sets that has been collected from the UCI repository and then works on oversampling the balanced dataset. In the second phase, feature selection of attributes is more important by using RFE feature selection. In the third phase, classification algorithms are applied to the data-set. Finally, evaluation has been performed based upon the values of accuracy. Thus, outputs shown from proposed classification implementations indicate that ANN algorithm performs better than AdaBoost algorithm with the help of feature selection with a 92.77% accuracy. Keywords: Machine learning, Classification, Feature selection, RFE, ANN, AdaBoost, and Liver. Abstrak Hitungan penyakit hati adalah salah satu penyakit yang paling umum di seluruh dunia dan menjadi sangat umum akhir-akhir ini dan bisa berbahaya. Penyakit hati meningkat di seluruh dunia karena berbagai faktor seperti konsumsi alkohol berlebihan, minum air yang terkontaminasi, makan makanan yang terkontaminasi, dan paparan udara yang tercemar. Hati terlibat dalam banyak fungsi yang berkaitan dengan tubuh manusia dan jika tidak berfungsi dengan baik dapat mempengaruhi bagian lain juga. Predikasi penyakit pada tahap awal dapat membantu mengurangi risiko keparahan. Makalah ini mengimplementasikan dataset oversampling, atribut pemilihan fitur, dan analisis kinerja untuk peningkatan akurasi klasifikasi pasien hati dalam 3 fase. Pada tahap pertama, algoritme normalisasi z- score telah diimplementasikan ke kumpulan data pasien hati asli yang telah dikumpulkan dari repositori UCI dan kemudian bekerja pada oversampling kumpulan data yang seimbang. Pada tahap kedua, pemilihan fitur atribut lebih penting dengan menggunakan pemilihan fitur RFE. Pada fase ketiga, P-ISSN : 2715-2448 | E-ISSN : 2715-7199 Vol.4 No.2 July 2023 Buana Information Technology and Computer Sciences (BIT and CS) Vol. 4, No.2 July 2023 | 86 algoritma klasifikasi diterapkan pada kumpulan data. Akhirnya, evaluasi telah dilakukan berdasarkan nilai-nilai akurasi. Dengan demikian, keluaran yang ditunjukkan dari implementasi klasifikasi yang diusulkan menunjukkan bahwa algoritma JST memiliki kinerja yang lebih baik daripada algoritma AdaBoost dengan bantuan pemilihan fitur dengan akurasi 92,77%. Kata kunci: Pembelajaran mesin, Klasifikasi, Pemilihan fitur, RFE, ANN, AdaBoost, dan Liver I. INTRODUCTION Liver disease can be defined as liver inflammation that results from the actions of bacteria, or toxic materials so that liver doesn’t properly operate anymore. According to the reports that have been conducted by World Health Organization (WHO) 2005 there has been an estimate that 7.6 million patients had died from cancer and 84 million individuals would die over the next decade. This data had shown that the liver cancer represents 6th most widespread cancer type worldwide and it is the 3rd-largest death cause along with the development. It’s unavoidable that technology development and easier access to internet have made it easier to identify liver disease and become big supporters of dealing with special need illnesses [1]. Machine Learning (ML) represents an Artificial Intelligence (AI) part that allows the system to get knowledge without any explicit knowledge. The supervised algorithms take advantage of the human inputs and outputs for prediction accuracy and training process, which is why, they are utilized for a variety of the applications of classification. Thus, ML application had extended to the health-care also. A very significant problem in the health-care is the rising numbers of the liver disease patients. Liver is one of the most vital organs with some functionalities such as detoxification of chemicals, bile production, and productions of vital protein types for the blood clotting [2].Feature selection has also been referred to as the Instance Selection, Attribute Selection, Variable Selection, Data Selection, Feature Construction, or Feature Extraction. It is utilized for the data reduction by redundant and removing irrelevant data for increasing data mining accuracy. Feature Selection chooses many relevant features from original features [3].Classification has been defined as one of the crucial tasks in DM and ML, due to the fact that it is aimed at categorizing every instance in the dataset to distinctive groups on the basis of information that has been identified by its features. In addition to that, a major DM task is the data classification. it has been attempted to create classifier identifying diabetes at minimal cost and with optimal performance [4][5]. II. LITERATURE REVIEW Over the recent years, various researches have been performed to classify liver patients. S. Jain et al. [6], proposed a paper based on the Indian Liver Patient Dataset that has a variety of symptoms for around 600 patients. this work is aimed at the evaluation of several Intelligent Technique outputs, such as K-NN, XGBoost, support vector machines (SVM), and decision tree with the ratio of the training set to testing set being 80% and 20% respectively. And results have shown that K-NN gives an accuracy of 64%, the SVM model gives a 66% accuracy, the Decision Tree model gives an 81% accuracy, and XGBoost gives an accuracy of 91%. G. Jamila et al. [7], The proposed model for the prediction of liver cirrhosis sickness employed Naive Bayesian, Classification and Regression Tree (CART), and SVMs with 10-fold cross-validation. Accuracy, recall, precision, and F1 score were used for the evaluations of the model's performance. Vol. 4, No.2 July 2023 | 87 Among all the strategies used in this study, SVM technique produces the optimal results, with an accuracy of 73%, precision of 73%, recall of 100%, and F1 score of 84%. G. S. Harshpreet Kaur [8], This study has been based upon the prediction of the liver diseases with the use of ML algorithms. The prediction of the liver diseases involves many different levels of steps, such as: preprocessing, classification and feature extraction. In this paper, a hybrid classification approach has been suggested for the prediction of liver diseases, and Data-sets have been collected from Kaggle data-base of Indian liver patient records. The suggested model was able to achieve a 77.58% accuracy. M. Ghosh et al. [9], aimed at evaluating a number of the ML outputs, such as random forest, logistic regression, XGBoost, SVMs, AdaBoost, decision tree and K-NN for prediction and diagnosis of the chronic liver disease. The algorithms of the classification have been assessed on the basis of different criteria of measurement, like the accuracy, F1 score, precision, recall, area under the curve (AUC), and specificity. Amongst algorithms, random forest exhibited superior performance in the prediction of liver diseases with 83.7% accuracy. N. Nahar et al. [10], analyzed a new and efficient method of ensemble learning for classification of liver diseases, where 5 ensemble algorithms, namely AdaBoost, BeggRep, LogitBoost, Begg-J48, and random Forest have been implemented and compared based on accuracy, FPR, RMSE TPR, and ROC curve. LogitBoost outperformed the rest of the ensemble methods, where its accuracy has been 71.53%. This paper has codified an effective process for diagnosing liver disease using deep learning giving it a web-based approach. The model attained an accuracy of 67.6 percent and this model predicts whether the user is having a liver disease or not. III. MATERIALS AND METHOD Fig.1. Overall Process of Liver Disease Model A. Dataset and Attributes Presently, there is a wide range of the data-sets related to liver diseases. In the present paper, ILPD has been utilized, it includes 583 rows and 2 classes. Where 1st class is associated with the patient records (PRs) of the liver disease and includes 416 records, the 2nd one is for the non-liver (PR) and consists of 167 records determined with the use of the summation of every sector field. Fig1 illustrates the distribution of the data in data-set. In general, the data-set includes 11 columns for 142 females and 441 male patients. Details have been listed in Table1. Vol. 4, No.2 July 2023 | 88 Table1. Attributes of the Dataset No Attributes Type Range 1 Age: Patient Age Interval [4-90] 2 Gender: Patient Gender Nominal [Female- Male] 3 TB: Total Bilirubin Interval [0.40-75] 4 DB: Direct Bilirubin Interval [0.10-19.70] 5 Alkphos: Alkaline Phosphotase Interval [63-2,110] 6 Sgpt Alamine: Amino-transferase Interval [10-2,000] 7 Sgot Aspartate: Amino-transferase Interval [10-4,929] 8 TP: Total Protiens Interval [2.70-9.60] 9 ALB: Albumin Interval [0.90-5.50] 10 A/G Ratio: Ratio of Albumin and Globulin Interval [0.30-2.80] 11 Selector field * Binary [1-2] Fig. 1. The number of patients in the dataset B. Dataset Pre-Processing Pre-processing can be defined as a highly vital stage in ML classification as the cleaner the data, then the better are the result of classification tends to be [11]. The methods of preprocessing that have been applied in the model can be explained as: a. Reducing noisy data: There are 2 data noise types in ML, which include: class noise and attribute noise. None-the-less, for the maximum accuracy in suggested model, the attribute noise is decreased for enhanced accuracy with the use of panda library. b. Data transformation: which indicates the process of the reorganization or re-structuring of the raw data. It’s utilized for the purpose of transforming the raw data to proper format allowing the data mining to obtain the strategic information faster and in a more effective way. c. Standard scalar: which transforms the data in a way that its distribution has an average value of 0 as well as a standard deviation that equals to 1. The aggregate functions conduct the operations on column values them return one value. C. Oversampling Oversampling refers to the random duplication of the minority class values. As we have already seen, the IPLD dataset has 167 non-liver samples and 416 liver samples. Therefore, it may suffer from imbalanced class distribution issue that the class of the majority may bias prediction. To overcome this problem, Random Oversampling is used to increase the majority of class samples. D. Feature Scaling Feature Scaling normalizes feature values in a pre-defined range. It’s a very vital step for building a machine learning model. It reduces the training time and sometimes helps to achieve faster 1 6 7 4 1 6 N O N L I V E R L I V E R Vol. 4, No.2 July 2023 | 89 convergence for many machines learning. Scaling using mean and standard deviation may suffer if a dataset contains too many outliers. We have used the Z-score outlier detection technique to detect the outliers and handle those outliers using Robust Scaling. E. Feature Selection Feature selection can be defined as the process of the selection of significant characteristics strongly associated with output from data-set for faster model training, decreased dimensionality, reduced complexity, improved accuracy and straightforward interpretation. Significant bio-markers/variables have been obtained from records of clinical information and lab tests of the patients with the use of the ML and statistical data mining algorithms. The abovementioned preprocessing tools include packages allowing feature selection [12]. Recursive Feature Elimination (RFE) RFE can be defined as feature selection approach of a wrapper type. Internally, it utilizes filter-based approaches; none-the-less, it differs from filter method. It has 2 significant options of configuration, which include: i. it determines the number of the features that are to be chosen, ii. it sets ML algorithm in the feature selection. In initial case, it searches a sub-set of the features through the consideration of all of the features that are present in training data-set and eliminates features until the needed number of the features is left. In 2nd case, it utilizes an ML algorithm and ranks characteristics based on their significance. It discards least significant features then repeats model fitting steps. The entire process is repeated to the point where the stated number of the features is left [13]. F. Dataset Splitting The data-set is split to data for process analysis training and testing. In that, 80% of the data has been utilized for the training and 20% of it has been used for the testing. G. Classification Techniques Classification is a model used to predict the future behavior of the data by classifying the records into predefined classes. In the classification, precise disease detection with the use of the testing and training dataset [14]. It proposed 2 ML models for building prediction. Initially, the training data has been trained across 2 ML models, such as Neural Network and AdaBoost are predicted based on a trained model of learning, one by one, and after that, test the data. Some of the parameters that include the precision, accuracy, and recall are finally compared with some algorithms that have been explained above. a. Artificial Neural Network Classifier An ANN [15] is a simulation of the working of the biological neural networks. Each one of the nodes has been modeled after a neuron, which is why, it is referred to as artificial neurons as well. An NN is made up of several layers, every one of which has a number of the nodes. The typical NN has been represented by Fig2. Vol. 4, No.2 July 2023 | 90 Fig2. Diagram of a typical ANN Basically, there are 3 components in the typical ANN: • Input Layer – one layer whose number of the nodes is dependent upon the input dimensions. The input layer applies a transform to NN’s input and passes that along as input to hidden layers. • Output Layer - which is the last layer of an NN, the dimensions of which have been characterized by the output. This layer conducts a functionality on hidden layer’s output prior to the production of the results. • Hidden Layer - Those layers represent the algorithm’s crux. They conduct all of the calculations on input for the purpose of producing output. The work of those layers is not known. Which is why, only weights and parameters that have been provided to those layers may be tweaked for the purpose of producing the needed results. A network becomes deeper with the increase of the number of the hidden layers. Each one of the nodes in a network is referred to as a perceptron, which has been depicted in Fig3. A perceptron is made up of 2 parts, which are: a sum of inputs and activation function on summation. A certain node takes weighted summation of its inputs then passes it to linear or nonlinear activation function. Fig3. A Diagram of the Perceptron The equation for certain perceptron has been depicted by 1. Weighted summation of inputs (x.w) is passed through activation function (f) besides bias value (b). It may be denoted as product of vector dot, where n represents the number of the inputs for each node. The activation function produces output prediction that has been provided as set of the inputs. Bias term has been added to computation for the purpose of helping in the enhancement of the learning of the perceptron. z = f (b + x.w) = f (b + ∑ 𝑥𝑖 𝑤𝑖 𝑛 𝑖=1 ) (1) each perceptron utilizes step function as the activation function. In a set of the perceptron’s, which is ANN (referred to as the Multi-Layer Perceptron as well), each one of the layers may have a separate activation function. Vol. 4, No.2 July 2023 | 91 b. AdaBoost Algorithm AdaBoost algorithm includes the use of very short (1-level) decision trees as weak learners added in a sequential manner to the set. Every one of the consequent models tries correcting predictions that have been made by the model before it in a sequence. It combines several of the average or weak predictors for the purpose of building strong predictor [16]. H. Performance measure Performance measure of different machine learning algorithms is analyzed by considering measures such as [17]. • Confusion Matrix - The confusion Matrix is a table used in performance measures that helps in easy visualization as well as in distinguishing true positives, true negatives, false positives and false negatives. • Accuracy - Accuracy measure is calculated by considering the ratio of the observations that have been correctly predicted to total number of the observations. Accuracy = 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 (2) • Precision - It represents the percentage of true positives out of all the predictions. Precision = 𝑇𝑃 𝑇𝑃+𝐹𝑃 (3) • Sensitivity - Out of the total positive, what percentage are predicted positive. Sensitivity = 𝑇𝑃 𝑇𝑃+𝐹𝑁 (4) • Specificity – it represents True negative rate which is the proportion of the negative tuples which have been identified correctly. Specificity = 𝑇𝑁 𝑇𝑁+𝐹𝑃 (5) IV. RESULTS AND DISCUSSION On the implementation of algorithms that have been mentioned in previous section, the following results have been obtained: Table1. Confusion matrix Actual / Predicted Normal Abnormal Normal TP FN Abnormal FP TN Table 2. confusion matrix of ANN without RFE Feature selection Actual / Predicted Normal Abnormal Normal 15 3 Abnormal 5 60 Table 3. confusion matrix of ANN with RFE Feature selection Actual / Predicted Normal Abnormal Normal 16 1 Abnormal 5 61 Table 4. Performance measure of ANN model without and with RFE feature selection Model Accuracy Precision Sensitivity Specificity ANN without 90.36% 75% 83.3% 92.3% Vol. 4, No.2 July 2023 | 92 RFE ANN with RFE 92.77% 76.1% 94.1% 92.4% Table 5. confusion matrix of AdaBoost without RFE Feature selection Actual / Predicted Normal Abnormal Normal 12 5 Abnormal 6 60 Table 6. confusion matrix of AdaBoost with RFE Feature selection Actual / Predicted Normal Abnormal Normal 13 4 Abnormal 5 61 Table 7. Performance measure of AdaBoost model without and with RFE feature selection Model Accuracy Precision Sensitivity Specificity AdaBoost without RFE 86.74% 66.6% 70.5% 90.9% AdaBoost with RFE 89.15% 72.2% 76.4% 92.4% Fig. 4. Show accuracy of ANN and AdaBoost models without and with RFE feature selection Fig. 4. Accuracy of ANN and AdaBoost models V. CONCLUSIONS This work presented a model for prediction of liver disease occurrence probability. The analyses and evaluations of suggested model have shown that it’s highly sufficient and easy to utilize and implement. Two ML algorithms have been applied to ILPD data-set for classified liver patients. In the data preprocessing issue of imbalanced class distribution, an oversampling technique (Random Over Sampling) is used, and used the Z-score outlier detection technique to detect the outliers and handle those outliers using Robust Scaling. Then applied RFE feature selection specifies the number of characteristics to be chosen is used for achieving better performance and for achieving an enhanced result, we have applied ANN and AdaBoost algorithms. From the analysis of experimental results, the ANN algorithm has achieved the highest accuracy of 92.77%. 0,82 0,84 0,86 0,88 0,9 0,92 0,94 ANN without RFE ANN with RFE Adboost without RFE Adboost with RFE Accuracy models Vol. 4, No.2 July 2023 | 93 References [1] H. Hartatik, M. B. Tamam, and A. Setyanto, “Prediction for Diagnosing Liver Disease in Patients using KNN and Naïve Bayes Algorithms,” 2020 2nd Int. Conf. Cybern. Intell. Syst. ICORIS 2020, pp. 1–5, 2020, doi: 10.1109/ICORIS50180.2020.9320797. [2] M. A. Kuzhippallil, C. Joseph, and A. Kannan, “Comparative Analysis of Machine Learning Techniques for Indian Liver Disease Patients,” 2020 6th Int. Conf. Adv. Comput. Commun. Syst. ICACCS 2020, pp. 778–782, 2020, doi: 10.1109/ICACCS48705.2020.9074368. [3] M. A. Khadija and N. A. Setiawan, “Detecting Liver Disease Diagnosis by Combining SMOTE, Information Gain Attribute Evaluation, and Ranker,” ITSMART J. Teknol. dan Inf., vol. 9, no. 1, pp. 13–17, 2020. [4] A. S. Jaddoa, Z. Tariq, and M. Al-ta, “COMPARISON OF DATA MINING ALGORITHMS FOR DIAGNOSIS OF DIABETES MELLITUS,” vol. 10, no. 2, pp. 1–8, 2021. [5] R. Ahmed, S. Jaddoa, P. Ziyad, and T. Mustafa, “Diagnosis of Diabetes Mellitus using Hybrid Techniques for Feature Selection and Classification,” pp. 1650–1663, 2021. [6] S. Jain, R. Sharma, and R. Rajkamal, “EasyChair Preprint Classification of Liver Diseases Using Intelligent Techniques Classification of Liver Diseases Using Intelligent Techniques,” 2021. [7] G. Jamila, G. M. Wajiga, Y. M. Malgwi, and A. H. Maidabara, “A Diagnostic Model for the Pediction of Liver Cirrhosis using Machine Learning Teachniques,” Comput. Sci. IT Res. J., vol. 3, no. 1, pp. 36–51, 2022, doi: 10.51594/csitrj.v3i1.296. [8] G. S. Harshpreet Kaur, “The Diagnosis of Chronic Liver Disease using Machine Learning Techniques,” Inf. Technol. Ind., vol. 9, no. 2, pp. 554–564, 2021, doi: 10.17762/itii.v9i2.382. [9] M. Ghosh et al., “A comparative analysis of machine learning algorithms to predict liver disease,” Intell. Autom. Soft Comput., vol. 30, no. 3, pp. 917–928, 2021, doi: 10.32604/iasc.2021.017989. [10] N. Nahar, F. Ara, M. A. I. Neloy, V. Barua, M. S. Hossain, and K. Andersson, “A Comparative Analysis of the Ensemble Method for Liver Disease Prediction,” ICIET 2019 - 2nd Int. Conf. Innov. Eng. Technol., pp. 23–24, 2019, doi: 10.1109/ICIET48527.2019.9290507. [11] S. Afrin et al., “Supervised machine learning based liver disease prediction approach with LASSO feature selection,” Bull. Electr. Eng. Informatics, vol. 10, no. 6, pp. 3369–3376, 2021, doi: 10.11591/eei.v10i6.3242. [12] N. Tanwar and K. F. Rahman, “Machine learning in liver disease diagnosis: Current progress and future opportunities,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1022, no. 1, 2021, doi: 10.1088/1757- 899X/1022/1/012029. [13] R. C. Poonia et al., “Intelligent Diagnostic Prediction and Classification Models for Detection of Kidney Disease,” Healthc., vol. 10, no. 2, 2022, doi: 10.3390/healthcare10020371. [14] S. Kefelegn, “Prediction and Analysis of Liver Disorder Diseases by using Data Mining Technique: Survey,” vol. 118, no. 9, pp. 765–770, 2017, [Online]. Available: http://www.ijpam.eu. [15] S. Gupta, G. Karanth, N. Pentapati, and V. R. B. Prasad, “A Web Based Framework for Liver Disease Diagnosis using Combined Machine Learning Models,” Proc. - Int. Conf. Smart Electron. Commun. ICOSEC 2020, no. Icosec, pp. 421–428, 2020, doi: 10.1109/ICOSEC49089.2020.9215454. [16] A. Khatavkar, P. Potpose, and P. Pandey, “Smart Health Prediction System,” vol. 5, no. 02, pp. 1550–1552, 2017. [17] B. K. Mengiste, H. K. Tripathy, and J. K. Rout, “Analysis and Prediction of Cardiovascular Disease Using Machine Learning Techniques,” Lect. Notes Electr. Eng., vol. 708, no. 2, pp. 133– 141, 2021, doi: 10.1007/978-981-15-8685-9_13. .