Title Science and Technology Indonesia e-ISSN:2580-4391 p-ISSN:2580-4405 Vol. 6, No. 4, October 2021 Research Paper Handling Missing Data Using Combination of Deletion Technique, Mean, Mode and Artificial Neural Network Imputation for Heart Disease Dataset Anita Desiani1*, Novi Rustiana Dewi1, Annisa Nur Fauza1, Naufal Rachmatullah2, Muhammad Arhami3, Muhammad Nawawi4 1Mathematics Department, Mathematics and Natural Science Faculty, Sriwijaya University, Palembang, 30862, Indonesia2Informatics Technique Department, Informatics Faculty, Sriwijaya University, Palembang, 30862, Indonesia3Informatics Technique Department, Lhokseumawe State Polytechnic, Aceh, 24301, Indonesia4Mechanical Engineering Departement, Graduate School of Science, Engineering and Technology, Istanbul Technical University Maslak Sarıyer, 34467, Turkey *Corresponding author: anita_desiani@unsri.ac.id AbstractTheUniversityofCalifornia IrvineHeartdiseasedatasethadmissingdataonseveralattributes. Themissingdatacanlossthe importantinformation of the attributes, but it cannot be deleted immediately on dataset. To handle missing data, there are several waysincluding deletion, imputation by mean, mode, or with prediction methods. In this study, the missing data were handled by deletiontechnique if the attribute had more than 70% missing data. Otherwise, it were handled by mean and mode method to impute missingdata that had missing data less or equal 1%. The artificial neural network was used to handle the attribute that had missing data morethan 1%. The results of the techniques and methods used to handle missing data were measured based on the performance resultsof the classification method on data that has been handled the problem of missing data. In this study the classification method usedis Artificial Neural Network, Naïve Bayes, Support Vector Machine, and K-Nearest Neighbor. The performance results of classificationmethods without handling missing data were compared with the performance results of classification methods after imputationmissing data on dataset for accuracy, sensitivity, specificity and ROC. In addition, the comparison of the Mean Squared Error resultswas also used to see how close the predicted label in the classification was to the original label. The lowest Mean Squared Error wasobtained by Artificial Neural Network, which means that the Artificial Neural Network worked very well on dataset that has beenhandled missing data compared to other methods. The result of accuracy, specificity, sensitivity in each classification method showedthat imputation missing data could increase the performance of classification, especially for the Artificial Neural Network method. KeywordsMissing Data, Artificial Neural Network, Imputation, Mean-mode, Deletion, Heart Disease Received: 10 June 2021, Accepted: 18 September 2021 https://doi.org/10.26554/sti.2021.6.4.303-312 1. INTRODUCTION Theheartdiseasedataset isoftenused inclassicationorpredic- tion. That is used to determine the pattern of factors that aect a heart disease. Heart disease is one of the biggest causes of death in the world (Rahakbauwet al., 2016). Based on research that has been conducted by Stewart et al. (2017), nearly one billion people worldwide suer strokes caused by hypertension and heart attacks. Some data on the results of examining heart disease diagnoses are published as a dataset to help various researches. One data warehouse that provides a heart attack dataset is the University of California Irvine (UCI) Machine Learning Repository. The heart attack dataset provided by the UCI is heart disease diagnostic data collected based on four sources, namely the Cleveland Clinic Foundation (Cleveland data), Hungarian Institute of Cardiology, Budapest (Hungarian data), VA Medical Center, Long Beach, CA (long-beach-va data), and University Hospital, Zurich, Switzerland (Switzer- land data). Data from these various sources was published by UCI into a dataset of diagnoses of heart disease patients which can be used for prediction of heart disease patients (Jasoni and Steinbrunn, 2013). The heart disease patient dataset contains 76 attributes, but only 14 attributes that aect heart disease (Zriqat et al., 2016). The heart attack dataset has incomplete data about 491 missing data from several attributes (Misir and Samanta., 2017). Although it has weaknesses in the complete- ness of the data and attributes used, the heart attack dataset has been widely used in various studies to diagnose patients with heart disease. Missing data problems can occur in various datasets not only in UCI heart disease dataset, such as gene and microarray data (Moorthy et al., 2014), medical data (Karim et al., 2017; https://crossmark.crossref.org/dialog/?doi=10.26554/sti.2021.6.4.303-312&domain=pdf https://doi.org/10.26554/sti.2021.6.4.303-312 Desiani et. al. Science and Technology Indonesia, 6 (2021) 303-312 Purwar and Singh, 2015), credit data (Crone and Finlay, 2012; Lan et al., 2020), software quality dataset (Huang et al., 2017; Jing et al., 2016), etc. The Missing data problems are a com- mon problem in real-world data classication. Therefore, a strong classication method is needed when classifying data that has missing data problems in its dataset domain (Soma- sundaram and Nedunchezhian, 2012). Missing data is caused by several things including errors in manual data entry proce- dures, equipment errors or wrong measurements (Purwar and Singh, 2015). Incomplete dataset will aect the accuracyof the data mining model, it can give biased results, and reduce the eciency of the computation process because there is missing information in the dataset (Choudhury and Pal, 2019). The missing data can signicantly reduce the accuracy and useful- ness of the assessment model especially in missing cases with lots of variations and can also cause errors and confusion in interpreting the data. The Missing data compromises the qual- ity of the data, and in turn aects the accuracy of the model derived from the data (Karim et al., 2017; Silva-Ramírez et al., 2011). Unfortunately, missing data in the dataset have a neg- ative impact on estimation accuracy and hence, may lead to inconsistent results. Many estimation models cannot directly handle missing data values; therefore, the preprocessing stage becomes indispensable for modern estimation processes in software engineering (Huang et al., 2017). Thepreprocessingstepis theprocessneededtocleanandl- tertargetdatabecausedatacollectionisrarelycompleteandper- fect (Salleh and Samat, 2017). So, the preprocessing method is an important role in the data mining task. Preprocessing is an important step for ltering and cleaning the dataset before it can be trained at the data mining stage so that the data used is of best quality (Crone and Finlay, 2012). The ability to han- dle missing data has become a fundamental requirement for pattern classication, because improper treatment of missing data can lead to misclassication results (García et al., 2015). Nowadays, most of the algorithms in datamininghave not been able to directly handle the problem of missing data. According to Eekhout et al. (2014); Poolsawad et al. (2012); Vazifehdan et al. (2019), there are several techniques for handling missing data, namely; the rst is Deletion, namely deleting an instant (record) or an incomplete attribute and the classication only uses the complete part of the data. The second is the imputa- tion or estimation of missing data will be used in classication. The third is ignore, Third is ignore, which is using data directly without handling missing data in a data set The simplest way to deal with this problem is deletion, i.e. deleting data that has missing data directly. However, this is only suitable for very small loss rates of 1-5% (Vazifehdan et al., 2019). Otherwise, if there is too much missing data on an attribute that there is little bit of information about the data, then the attribute can be removed from the data set because the information that attribute is incomplete (Shah et al., 2017). Removing miss- ing data in a dataset sometimes has a negative impact on the accuracy of estimates and hence results in inconsistent results (Lan et al., 2020). Deletion has been widely adopted to handle missing data during data preprocessing (Huang et al., 2017; Malarvizhi and Thanamani, 2012). The method of imputa- tion missing data was to replace missing variables with value estimates that can maintain data completeness (Choudhury and Pal, 2019). The imputation method is a solution that can handle the problem of missing data where the missing data attribute is estimated or replaced by using various methods including statistical methods, such as mean or mode, machine learning, and others (García et al., 2015; Luengo et al., 2011; Tsai et al., 2018). There are some method and technique can use for impu- tation missing data. The simplest statistical method is mean method for numeric attribute and modes that focuses on the value of an attribute which appears frequently for imputation of category attribute (Eekhout et al., 2014; Mehrotra et al., 2017; Nishanth and Ravi, 2016; Silva-Ramírez et al., 2011). According to Eekhout et al. (2014) the mean imputation can lead to biased estimates for each data scenario when the in- cidence of missing data in a domain is more than 10%. The mean imputation can also produce biased results if the obser- vational needs a relational value between variables, because it does not consider the existing relational value variables (Ped- ersen et al., 2017). Another shortcoming is the mean method cannot be used to represent data for the values in the attribute are extreme. Conversely, the mode is the easiest way to impute categorical data, but the results given will be biased if the mode value is more than one or even the mode value is not found in the attribute that has missing data. Another disadvantage of the imputation mode is that it ignores the variance of the population or sample that exists (Nishanth and Ravi, 2016). The mean and mode methods are very suitable for imputa- tion of missing data at a single value where the percentage of missing data is not too large. Currently, more complex impu- tation approaches using machine learning approaches, such as RandomForest (StekhovenandBühlmann,2012), NeuralNet- works (Nishanth and Ravi, 2016; Rahman and Davis, 2012), K-Nearest Neighbor (KNN) (Manimekalai and Kavitha, 2018), K-Means (Poolsawad et al., 2012), Decision Tree (Chauhan et al., 2013), Deep Learning (Ting et al., 2020). Approaches with mechanical learning such as Neural Networks is an alter- native to best imputation results but take more time than the statistical approach and it is not eective for small amounts of missing data (Tsai et al., 2018). Apart from deleting and estimating missing data by imputation, attribute selection also inuences the classication or prediction results. Irrelevant attributes do not aect the description of target class. The redundant attributes do not contribute to anything but they create bias in the classication results (Shilaskar and Ghatol, 2013). According to Pedro, at least 45% of the data set provided by the UCI had a problem with missing data, including the heart data dataset. Several studies to predict or classify heart disease disordered in the heart disease dataset both deal with missing data by signing it, deletion data, imputation data and elimina- tion by using the selection attribute (attribute). Al Khaldy and © 2021 The Authors. Page 304 of 312 Desiani et. al. Science and Technology Indonesia, 6 (2021) 303-312 Kambhampati (2016) predicted the pattern of heart disease by applying machine learning to predict heart disease problems regardless of the existence of missing data in the dataset. Salleh and Samat (2017) applied Fuzzy C-Means and Particle Swarm Optimization to import missing data on heart disease dataset regardless of the percentage of missing data on each attribute. Choudhury and Pal (2019) showed that the Neural Network method had stable performance for attribute that have 1-10% missing data, but their performance greatly improved when working on attribute that have 50% missing data. Unfortunately it did not explain how the data handler handled more than 50% of missing data. Tsai et al. (2018) implemented imputation of missing data using class center method, namely by nding the center of each class and measuring the distance between classes to estimate the missing data threshold in the dataset. The accuracy in this study was only 78% without explaining the classication or prediction method used to detect cardiac disorders. Silva-Ramírez et al. (2011) applied a Neural Net- work to impute the missing data which amounts to no more than 5% but it did not explain the dierence in results if the missing data was only 1% or greater than 5%. Subbalakshmi et al. (2011) combined several imputation methods, namely LOCF, Mean-Mode and IV but it did not explain the imputa- tion method used by each attribute. Hernández-Pereira et al. (2015) compared Mean or Mode Imputation, Multiple Linear Regression, Hot-deck, K-NN, and Neural Network (NN) to handle missing data. The result of in this study showed that the performance of NN provides the best performance for han- dling missing data. Several studies conducted deletion of data on attributes that deemed to have less signicant inuence on heart disease dataset (El-Bialy et al., 2015; Long et al., 2015). The heart disease dataset has several attributes that have dierent amounts of missing data. The study focused how to handle the missing data on heart disease dataset and tried to use multiple ways to overcome the problem. This study was not only use one technique or method for some attributes but it used multiple ways namely Deletion technique, Mean, Mode and Articial Neural Network methods for imputation missing data. This study tried to get the advantages of each technique and methods to solve the problem of missing data. The Deletion technique was used in the study for attributes where the amount of missing data was more than 70% because the information available for the attribute was considered in- sucient. The Mean and Mode methods used for attribute that had missing data lower or equal with 1%, because mean or mode method are suitable for missing data with a single value and the amount of missing data is not too much (Nishanth and Ravi, 2016). For attribute that had amount of missing data more than 1%, the study used articial neural network because some researches has been showed than Articial Neural Net- work (ANN) method was greatly to impute missing data that had amount more than 1% (Choudhury and Pal, 2019; Tsai et al., 2018). The utilization for each technique and methods in the study was adjusted to the percentage amount of missing data from each attribute in the dataset at the pre-processing step to get best data quality and provide best results in classi- cation. Performance measurements of a classication method that were usually used include accuracy, specicity, sensitivity and ROC (Desiani et al., 2021; Resti et al., 2021). This study used accuracy, specicity, sensitivity and ROC to measure the performance classication in the proposed method. 2. EXPERIMENTAL SECTION 2.1 Materials In this research used secondary data, namely data on patients with heart disease obtained from the University of Califor- nia Irvine (UCI) Machine Learning Repository which can be downloaded at their ocial website (Statlog, 2004). Data were obtained from 294 patients suspected of having heart disease. The data contains 14 attributes that are used as inuential at- tributes in diagnosing heart disease, among others age, gender, type of chest pain, blood pressure, cholesterol, sugar levels, electrocardiography, maximal heart rate, induced angina, old peak, slope, uoroscopy, heart rate and attributes as labels con- taining the categories healthy and sick. All attributes and values and types of each attribute can be seen in Table 1. 2.2 Methods All of stages of the proposed Method in the study can be seen in Figure 2. The stages are : 2.2.1 Handling Missing Data Deletion, the attributes that have a percentage of missing datas of more than 70% are deleted and dropped from the data set because they are considered not representative and have insuf- cient information. Attributes that have missing data of less than 70% will be imputed missing data. Mean and Mode Impu- tation, the mean method that has number of missing data less than 1% will be predicted using calculate mean of the attribute or attribute with categorical type that has number of missing datas less than 1% of the total data will be predicted using the mode imputation. For attribute with numeric (integer or real) that has number of missing data less than 1% will be predicted using calculate mean of the attribute. Articial Neural Network Imputation, for missing data with an amount of more than 1%, it will be imputed using a Articial Neural Network (ANN). In this step, data will be split into 2 group, rst was data training and the second as data testing. The ANN for this step was used 3 layers, namely input layer, one hidden layer and output layer (Figure 1). The rst step that must be done in the training stage is normalizing the input data only for the continue or numeric attribute on the dataset, because the data range for the continue attribute is dierent so the data must be normalized by x′i = xi − a b − a (1) Next initialize the weights for each input and bias in Figure 1 associated with the hidden layer. Then Calculate the input © 2021 The Authors. Page 305 of 312 Desiani et. al. Science and Technology Indonesia, 6 (2021) 303-312 Figure 1. Articial Neural Network with 3 Layers value for each unit using Equation (2) based on Figure 1, is the jth neuron in the hidden layer. zinj = n∑︁ i=1 wjix ′ i + bj (2) Where wji is the weight for the i-th neuron in the input layer and the j-th neuron in the hidden layer and bj is the bias to calculate zj . After that, calculate all the activation values as the output of each hidden layer (𝜙h) to the output layer (y) with the sigmoid function as the activation function in equation (3). Next step is calculate the input value (yinj) for the output layer using Equation (4). 𝜙h = 1 1 + e−zinj (3) yinj = p∑︁ j=1 vkjzj + bk (4) Where vkj is weight for the j-th on hidden layer and the k-th neuron on output layer. bk is bias is a bias to calculate yk . After that, calculate activation function (𝜙o) as input for output layer (yinj) on every the j-th of input using Equation (5). 𝜙o = 1 1 + e−yinj (5) Perform the backpropagation of error stage by calculating the unit error factor (𝛿) based on the error on input (yink) on output layer for each output )yk with Equation (6). 𝛿k = (tk − 𝜙o) f ′(yink) = (tk − 𝜙o)yk(1 − 𝜙o) (6) 𝛿k is the error unit that will be used in changing the layer weight with (tk) being the k-th output target. Next step is calculate the value of the weight change (Δvkj) in Equation (7) which is used to update the weight value of vkj on the hidden layer zj based on the activation value (𝜙o) that has been calculated previously, with the learning rate acceleration (𝛼=0.1) Δvkj = 𝛼𝛿kzj (7) Calculate the value of the change in bias (Δbk) which is used to update the bias value bk at the output layer y based on the value of learning rate and unit error (𝛿k) and Calculate the unit error (𝛿inj) that comes from the output layer to the hidden layer using Equation (9). Δbk = 𝛼𝛿kzj (8) 𝛿inj = m∑︁ k=1 𝛿kvkj (9) Next, it should calculate the hidden unit error (𝛿 j) in the hidden layer using Equation (10) and Calculate the change weight value of w which is then used to update the weight value of wi j. 𝛿 j = 𝛿inj f ′(zinj) (10) Δwi j = 𝛼𝛿 jxi (11) For the bias, calculate the value of the change in bias (Δbj) based on the unit error (𝛿 j) on hidden layer. After that, Update each bias and weight on the hidden layer with the Equation (13) and Update each bias and weight on the hidden layer with the Equation (14). Δbj = 𝛼𝛿 j (12) wji (new) = wji (old) + Δwji (13) vkj(new) = vkj(old) + Δvkj (14) Update the weight on the bias by using Equation (15) to ob- tain the new bias weight value (bj(new)) in the hidden layer and Equation (16) for the new bias weight value (bk(new)) in the output layer. The calculation steps for the training phase are carried out on all existing input data until the weights no longer experience signicant dierences or depend on the epoch spec- ied for each attribute. bj(new) = bj(old) + Δbj (15) © 2021 The Authors. Page 306 of 312 Desiani et. al. Science and Technology Indonesia, 6 (2021) 303-312 bk(new) = bk(old) + Δbk (16) At this testing phase , the weights generated by the ANN were applied to the testing data to test the performance results of the ANN. The steps needed were to take the last weight at the training stage and classify it with Equations (2) and (4) then compared the labels of the classication results with the original labels of the data. The comparison is used to measure the results in the accuracy, specicity (SP), and sensitivity (SN) performance for ANN. Another way to The measure the ANN performance for every attribute, the label classication in each method was compared with the original label to calculate the dierence or error that occurs. The error is calculated based on MSE. The smaller value of MSE, it can be considered that the network architecture is better (Saputra et al., 2017). The value of MSE could be calculated using the following Equation (17). MSE = n∑︁ i=1 (yi − ŷ) n (17) Where yi was the output label on the data set which had n quantities and (̂y) was the predictive value of the model. 2.2.2 Filling Imputation Result The results of imputation missing data were put back to each attribute that has amissingdata, so that the dataset had no more missing data or in another word the dataset was completed. 2.2.3 Analysis At this stage, testing of the complete dataset is carried out by calculating the resulting performance of classication methods namely accuracy, specicity (SP), sensitivity (SN) and ROC. The methods that used for classication on complete dataset were Articial Neural Network (ANN), Naïve Bayes, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN). The completed data was divided using a percentage split, which was 80% as training data and 20% as testing data. In Naïve Bayes, it used Gaussian Naïve Bayes. Gaussian Naïve Bayes was used because the dataset consists of categories and continuous types data, while in SVM, it use the One Agains One method because the label on the data was binary (2 labels). In the KNN method, it used k = 3. This has been proven during trials by giving dierent k values (1-8). At the value of k = 3 the performance value of KNN gave the highest results and decreased for k>3. The ANN at this stage used an architecture similar to the ANN at the imputation stage of missing data using Figure 2. 3. RESULTS AND DISCUSSION 3.1 Attributes Deletion The 14 attributes in the dataset were calculated for the pre- sentation of each of the total number of existing data. The attributes that had missing data and their percentages could be Figure 2. The Stages in Proposed Method seen in Table 1. From Table 1 it could be seen that the backup attribute had a very large missing data, namely Ca attribute and Thal attribute. Ca attribute had 98.97% missing data and Thal had 90.47% missing. It mean that there were a lot of lack of information that we got from these two attributes so that the attributes were dropped and deleted from the dataset. Meanwhile, 7 other attributes had missing data which was still quite low below 70%. 3.2 Imputation of Missing data 3.2.1 Attributes with less than 1% Missing Data From Table 1, it was known that the veryfewattributes that had missingdata (under1%) were the Trestbps, Restecg, Thalac and Exang attributes so that to predict missing data in the data set, it was enough to use a simple method, namely the mean (for continuous or numeric data types) and the mode for data of type category. Trestbps and Thalac imputed missing data using the mean method because they each has only had 1 missing data of the total 249 available data. Bymean XXX, the missing data for Trestbps was obtained x̂=132.58 and missing data for Thalac was x̂=139.13 . For attributes of type category such as restecg and exang, the mode value was be used to import the missing data. The Restecg attribute had 3 types of data, namely 0 for normal. 1 for ST-T wave abnormalities and 2 for left ventricular hypertrophy. In the Restecg attribute the value of the most data was owned by the normal label (0), then the missing data in the restecg attribute was lled with the label 0.The Exang attribute had two labels 0 for no and 1 for yes. 0 label was as many as 204 data and 1 label was as many as 89 data. The mode in Exang attribute was 0 label then the missing data was imputed with a 0 label. 3.2.2 Attributes with more than 1% Missing Data Forimputationonattributes thatuseANN, theattributeof type numeric or continuous must be changed to categorical form because they would be used as classication labels on training stage. There were three attributes with missing data above © 2021 The Authors. Page 307 of 312 Desiani et. al. Science and Technology Indonesia, 6 (2021) 303-312 Table 1. Attributes, Values, Total and Percentage of Missing data of Heart Disease Dataset Attributes Values Number of Percentage of Missing Data Missing Data (%) Age In years 0 0 Sex Male, Female 0 0 Chest Pain Abnang, Angina, 0 0 Type (cp) Asympt, Notang Trestbps 94,0 - 200,0 1 0.34 Cholesterol (chol) 126,0 - 564,0 23 7.82 Fasting Blood Sugar (Fbs) True, False 8 2.72 Resting ECG (restecg) Norm, Hyp, Abn 1 0.34 Max Heart Rate (thalach) 99,0 - 103,0 1 0.34 Exercise Induced True, False 1 0.34 Angina (exang) Oldpeak 0,0 - 6,2 0 0 Slope Down, Flat, Up 190 64.4 Number of 0,0 - 3,0 291 98.97 Vessels Colored (ca) Thal Normal, Rever, Fixed 266 90.47 Diagnosis of Healthy, Sick 0 0heart disease (Num) 1%, namely attributes of fasting blood sugar (Fbs), cholesterol (Chol) and Slope so the ANN imputation were applied for the attributes. The Fbs attribute had missing data as much as 8 data and has two categories, namely patients who had blood sugar > 120 mg/dl with 2 categories, there were 1 to state the patient had blood sugar > 120 mg/dl and 0 to state the patient’s blood sugar <120 mg/dl. Cholesterol (Chol) attribute was cholesterol of a patient that had in mg/dl and the attribute had 23 missing data. Chol had a continuous data type, which could be predicted using Articial Neural Network (ANN). Data on Chol attribute should be converted into categories as label data. Chol attribute could be categorized into 3 labels very high(0) : > 200 mg / dl, High(1): 160-200 mg / dl, Normal (2): <160 mg / dl (Rahakbauw et al., 2016). For Slope attribute, it was used to represent the Slope of the ST segment at (peak). There were three labels, namely 1: up, 2: at, 3: down. Missing data on the Slope attribute was 114 data. The imputation using ANN were applied for each attribute (Fbs, Chol and Slope) to guess missing data on the attribute. The results of imputation using ANN were measured based on the values of accuracy, SP, SN and MSE generated by each attribute. The results of these measurements could be seen in Figure 3. In Figure 3 it could be seen that ANN worked very well in imputing missing data on the Fbs attribute with an accuracy of 95.46, SP of 97.69, SN of 62.5 and MSE of 0.06. While on the Chol attribute the measurement results were quite good with an accuracy of 75.75% and an MSE of 0.0765, but the specicity and sensitivity were still low below 60%. The results of the ANN imputation on the Slope attribute were quite low where the accuracy and sensitivity values were Figure 3. Comparison of Accuracy, Specicity (SP), Sensitvity (SN) and MSE for Imputation still below 60% but the resulting specicity was quite good at 74.62%. The result of attributes deletion and imputation method would applied into heart disease dataset. The data that has been obtained from the handling missing data was returned into initial dataset. Thus, the new dataset did not have missing data and total number of data used for classication was 294 patient data with 12 attributes where 11 attributes as input and 1 attribute as labels (Num attribute). The label Num contained 2 categories, 0 for healthy and 1 for sick. After the process of lling in the imputed data, the results are entered into the data set. The next process was to apply classication methods into heart disease dataset to see the eect of handling missing data with deletion and imputation methods on classication process. 3.3 Filling in Imputation Missing data to Dataset The data that has been obtained from the results of technique and method of handling missing data on returned into the © 2021 The Authors. Page 308 of 312 Desiani et. al. Science and Technology Indonesia, 6 (2021) 303-312 Table 2. Comparison of Research Results on The Proposed Method with Previous Research The Handle Missing Data Method Data set Prediction Method SN SP Accuracy ROC A hybrid Bayesian network Breast Cancer C45 78.55 92.83 89.29and tensor factorization approach (Vazifehdan et al., 2019) Fuzzy C-Means and Framingham Decision Tree - 0.846 86.3 0.83Particle Swarm Optimization Heart (Salleh and Samat, 2017) Not handling missing Heart Disease Naïve Bayes 82.3 0.845 81.97 - data (Apurb et al., 2020) ANN and T-test attribute Heart Failure ANN 76.8 0.803 - - selection (Poolsawad et al., 2012) ANN and nonlinear Heart Failure ANN 69.5 0.769 - -gain analysis attribute selection (Poolsawad et al., 2012) SVM Imputation Cleveland Random Forest 97.1 48.51 84.97 - (Al Khaldy and Kambhampati, 2016) Heart Failure chaos rey algorithm Heart Interval 84.9 93.3 88.3and rough sets based disease Type-2 Fuzzy attribute reduction (Long et al., 2015) dataset Logic System K-means Clustering Wisconsin ANN 99.91 99.54 99.39 1 (Purwar and Singh, 2015) Breast Cancer Articial Neural Network Respiratory Neural 69.63 84.44 79.03 80 (Hernández-Pereira et al., 2015) MIASOFT Network feed Maximum Likelihood Hungarian Batch 36.51 99.86 99.86 - (Misir and Samanta., 2017) data set backpropagation Fuzzy K-Mean Clustering Cardiovascular Decision tree 30 0.7 0.64 - (Rahman and Davis, 2012) a combination of Internet Random forest 99 1 99.43 0split data and FKmeans of Thing (Vangipuram et al., 2020) Rened Mean Substitution Breast Cancer Fuzzy C-means 95.29 85.75 91.73 (Somasundaram and Nedunchezhian, 2012) Proposed method Heart disease Neural Network 94.2 94.2 94.23 0 Proposed Method Heart disease Naïve Bayes 87.76 87.6 87.5 0 Proposed Method Heart disease SVM 90 90 90 84 Proposed Method Heart disease KNN 90.48 0.905 90 90 initial dataset. Thus, the new dataset did not have missing value. After the process of lling in the data imputation results into the data set, the next process was to classify heart disease on the new dataset to see the eect of handling missing data on classication of heart disease. The total number of data used for classication was 294 patient data with 12 attributes whereas 11 attributes as input and 1 attribute as label namely Num attribute. The Num contained 2 categories, 0 for healthy and 1 for sick as a label in dataset. 3.4 Disccussion In this study the data from the proposed method were applied to several kinds of classication methods. The results of the imputation of missing data using Deletion, Mean, Mode and ANN techniques were analyzed to see if they were able to improve performance on classication using the ANN, Naïve Bayes, SVM, and KNN methods. To analyze the eect of imputation on missing data. The new dataset was tested us- ing several methods, namely the ANN, Naïve Bayes, Support Vector Machine (SVM) and KNN to see the eect of missing data imputation on the performance of classication methods. The results of classication testing showed that missing data imputation increased the accuracy, sensitivity (SN) and speci- city (SP) for each method. Figure 4 showed that there was an increase for accuracy, SN and SP in the new dataset. From Figure 4, it could be seen that the highest increase was obtained in classication using the ANN. The comparison of results classication on the data before andafterhandlingmissingdatawasnot sucient toevaluate the © 2021 The Authors. Page 309 of 312 Desiani et. al. Science and Technology Indonesia, 6 (2021) 303-312 Figure 4. Comparison of Accuracy, Specicity (SP) and Sensitivity (SN) in Prediction of Heart Disease Dataset success without comparing the results of the proposed method with other methods carried out in other studies. The results of this research were also compared with some other studies. Table 2 showed several studies using various techniques to deal with the problem of missing data by using various data sets, either heart disease by UCI or other datasets. From Table 2, the highest accuracy, SN and SP values were obtained by Pur- war and Singh (2015), even the ROC value obtained at the highest value, but the missing data in the study was only 16 out of 569 data. The study by Vangipuram et al. (2020) also had the highest test results compared to other studies, but the total missing data was only 0.072% of 2050 total data from 12 attributes. The test results on Al Khaldy and Kambham- pati (2016) also had higher accuracy than the accuracy of the proposed method, but the specicity value was lower than the specicity on the proposed method. The accuracy in the Sub- balakshmi et al. (2011) was also very high, but unfortunately the SN value obtained was very small. From Table 2 it could be seen that the results of the accuracy of the proposed method were better than several other studies. The results of the sensi- tivity and specicity of the proposed method were also good, it was seen that the values obtained were higher and balanced than the other studies. Several previous studies did not show the sensitivity value obtained, some other studies also did not show the accuracy or ROC result that was successfully obtained in the study. From this comparison, it could be concluded that the proposed method was very suitable to be used for imputing missing data and could increase the accuracy, sensitivity, and specicity values which were very good above 85% by dierent classication methods. 4. CONCLUSIONS The handling missing data in the study used 3 ways, rst dele- tion technique for attributes that had missing data more than 70%. The second was the mean for numeric or continue data and mode imputation method for category data to handle miss- ing data which the amount missing data was less or not more than 1%, namely the Trestbps and Thalac attributes by mean method, Restecg and Exang attributes by mode method. The third method was Articial Neural Network (ANN) for at- tributes that had total missing data more than 1%, namely Fbs, chol, and Slope attributes. The resulting MSE shows that ANN was very good to impute missing data on the FBS and Chol attributes where the resulting MSE results were relatively small. But for imputing missing data on the Slope attribute, ANN was less suitable for use because the MSE result for the attribute was still relatively big. The performance showed the imputa- tion results of the Fbs and Chol attributes by ANN better than other methods. It could be seen from the accuracy obtained above 75%. Unfortunately, the accuracy obtained by the Slope attribute was not very satisfying, it is only 52%. Although the performance results on the Slope attribute from both MSE and confusion matrix measurements were not satisfactory, the results of imputation carried out with the proposed method could improve and increase the accuracy, sensitivity (SN) and specicity (SP) of classication performance on the UCI heart disease data set. The classication performance of ANN, Naïve Bayes, SVM and KNN proofed that their performance has been increased when the methods worked on new dataset that has been handled the missing data problem compared theirs performance on original dataset. This research can be devel- oped furtherbyapplyingother imputation methods formissing data, especially for the slope attribute which has a low accuracy value and a large MSE. 5. ACKNOWLEDGEMENT The authors thank to the deep learning discussion group for all the support on our research. REFERENCES Al Khaldy, M. and C. Kambhampati (2016). Performance analysis of various missing value imputation methods on heart failure dataset. Proceedings of SAI Intelligent Systems Conference; 415–425 Apurb, Rajdhan, S. Milan, A. Avi, and R. Dundigalla (2020). Heart Disease Prediction Using Machine Learning Classi- ers. International Journal of Advanced Science and Technology, 29(6); 1700–1707 Chauhan, H., V. Kumar, S. Pundir, and E. S. Pilli (2013). A comparative study of classication techniques for intrusion detection. Proceedings - 2013 International Symposium on Computational and Business Intelligence; 40–43 Choudhury, S. J. and N. R. Pal (2019). Imputation of missing data with neural networks for classication. Knowledge-Based Systems, 182; 1–9 Crone, S. F. and S. Finlay (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting, 28(1); 224–238 Desiani, A., S. Yahdin, A. Kartikasari, and I. Irmeilyana (2021). Handling the imbalanced data with missing value elimina- tion SMOTE in the classication of the relevance education background with graduates employment. IAES International Journal of Articial Intelligence, 10(2); 346–354 Eekhout, I., H. C. de Vet, J. W. Twisk, J. P. Brand, M. R. © 2021 The Authors. Page 310 of 312 Desiani et. al. Science and Technology Indonesia, 6 (2021) 303-312 de Boer, and M. W. Heymans (2014). Missing data in a multi-item instrument were best handled by multiple impu- tation at the item score level. Journal of Clinical Epidemiology, 67(3); 335–342 El-Bialy, R., M. A. Salamay, O. H. Karam, and M. E. Khalifa (2015). Feature analysis of coronary artery heart disease data sets. Procedia Computer Science, 65; 459–468 García, S., J. Luengo, and F. Herrera (2015). DataPreprocessing in Data Mining. Springer Hernández-Pereira, E. M., D. Álvarez-Estévez, and V. Moret- Bonillo (2015). Automatic classication of respiratory pat- terns involving missing data imputation techniques. Biosys- tems Engineering, 138; 65–76 Huang, J., J. W. Keung, F. Sarro, Y. F. Li, Y. T. Yu, W. Chan, and H. Sun (2017). Cross-validation based K nearest neigh- bor imputation for software quality datasets: an empirical study. Journal of Systems and Software, 132; 226–252 Jasoni, A. and W. Steinbrunn (2013). Heart Disease Data Set. UCI Machine Learning Repository Jing, X. Y., F. Qi, F. Wu, and B. Xu (2016). Missing data imputation based on low-rank recovery and semi-supervised regression for software eort estimation. Proceedings - Inter- national Conference on Software Engineering; 607–618 Karim, M. N., C. M. Reid, L. Tran, A. Cochrane, and B. Billah (2017). Missing value imputation improves mortality risk prediction following cardiac surgery: an investigation of an Australian patient cohort. Heart, Lung and Circulation, 26(3); 301–308 Lan, Q., X. Xu, H. Ma, and G. Li (2020). Multivariable data imputation for the analysis of incomplete credit data. Expert Systems with Applications, 141; 1–12 Long, N. C., P. Meesad, and H. Unger (2015). A highly ac- curate rey based algorithm for heart disease prediction. Expert Systems with Applications, 42(21); 8221–8231 Luengo, J., A.Fernández, S.García, andF.Herrera (2011). Ad- dressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary under- sampling. Soft Computing, 15(10); 1909–1936 Malarvizhi, M. and A. Thanamani (2012). K-NN classier performs better than K-means clustering in missing value imputation. IOSR Journal of Computer Engineering, 6(5); 12– 15 Manimekalai, K. and A. Kavitha (2018). Missing value imputa- tion and normalization techniques in myocardial infarction. ICTACT Journal on Soft Computing, 8(3); 1655–1662 Mehrotra, D.V., F.Liu, andT.Permutt (2017). Missingdata in clinical trials: control-based mean imputation and sensitivity analysis. Pharmaceutical Statistics, 16(5); 378–392 Misir, R. and R. K. Samanta. (2017). A Study on Performance of UCI Hungarian Dataset Using Missing Value Manage- ment Techniques. International Journal of Computer Sciences and Engineering, 5(3); 40–44 Moorthy, K., M. Saberi Mohamad, and S. Deris (2014). A re- view on missing value imputation algorithms for microarray gene expression data. Current Bioinformatics, 9(1); 18–22 Nishanth, K.J.andV.Ravi (2016). Probabilisticneuralnetwork based categorical data imputation. Neurocomputing, 218; 17– 25 Pedersen, A. B., E. M. Mikkelsen, D. Cronin-Fenton, N. R. Kristensen, T.M.Pham, L.Pedersen, andI.Petersen (2017). Missing data and multiple imputation in clinical epidemio- logical research. Clinical Epidemiology, 9; 157–166 Poolsawad, N., L. Moore, C. Kambhampati, and J. G. Cleland (2012). Handling missing values in data mining-Acase study of heart failure dataset. Proceedings - International Conference on Fuzzy Systems and Knowledge Discovery; 2934–2938 Purwar, A. and S. K. Singh (2015). Hybrid prediction model with missing value imputation for medical data. Expert Sys- tems with Applications, 42(13); 5621–5631 Rahakbauw, D., F. K. Lembang, and Y. Taihuttu (2016). Anal- isis dan Prediksi Penyakit Jantung Koroner di Kota Ambon Menggunakan Jaringan Saraf Tiruan. Barekeng: Jurnal Ilmu Matematika dan Terapan, 10(2); 97–105 Rahman, M. M. and D. N. Davis (2012). Fuzzy unordered rules induction algorithm used as missing value imputation methods for k-mean clustering on real cardiovascular data. Lecture Notes in Engineering and Computer Science, 2197(1); 391 Resti, Y., E. S. Kresnawati, N. R. Dewi, N. Eliyati, et al. (2021). Diagnosis of Diabetes Mellitus in Women of Reproductive Age using the Prediction Methods of Naive Bayes, Discrimi- nantAnalysis, andLogisticRegression. ScienceandTechnology Indonesia, 6(2); 96–104 Salleh, M. N. M. and N. A. Samat (2017). FCMPSO: An im- putation for missing data features in heart disease classica- tion. IOP Conference Series: Materials Science and Engineering, 226(1); 1–8 Saputra, W., M. Zarlis, R. W. Sembiring, D. Hartama, et al. (2017). Analysis resilient algorithm on articial neural net- work backpropagation. Journal of Physics: Conference Series, 930(1); 12035 Shah, S. M. S., S. Batool, I. Khan, M. U. Ashraf, S. H. Abbas, andS.A.Hussain (2017). Featureextractionthroughparallel probabilistic principal component analysis for heart disease diagnosis. Physica A: Statistical Mechanics and its Applications, 482; 796–807 Shilaskar, S. and A. Ghatol (2013). Feature selection for medi- cal diagnosis: Evaluation for cardiovascular diseases. Expert Systems with Applications, 40(10); 4146–4153 Silva-Ramírez, E.-L., R. Pino-Mejías, M. López-Coello, and M.-D. Cubiles-de-la Vega (2011). Missing value imputa- tion on missing completely at random data using multilayer perceptrons. Neural Networks, 24(1); 121–129 Somasundaram, R. and R. Nedunchezhian (2012). Missing value imputation using rened mean substitution. Interna- tional Journal of Computer Science Issues, 9(4); 306–313 Statlog (2004). Heart Data Set. UCI Machine Learning Repos- itory Stekhoven, D. J. and P. Bühlmann (2012). MissForest-non- parametric missing value imputation for mixed-type data. © 2021 The Authors. Page 311 of 312 Desiani et. al. Science and Technology Indonesia, 6 (2021) 303-312 Bioinformatics, 28(1); 112–118 Stewart, J., G. Manmathan, and P. Wilkinson (2017). Primary Prevention of Cardiovascular Disease: A Review of Contemporary Guidance and Literature. JRSM Cardiovascular Disease Subbalakshmi, G., K.Ramesh, andM.C.Rao(2011). Decision support in heart disease prediction system using naive bayes. Indian Journal of Computer Science and Engineering, 2(2); 170– 176 Ting, P. Y., T. Wada, Y. L. Chiu, M. T. Sun, K. Sakai, W. S. Ku, A. A. K. Jeng, and J. S. Hwu (2020). Freeway Travel Time Prediction UsingDeep Hybrid Model–TakingSun Yat- Sen Freeway as an Example. IEEE Transactions on Vehicular Technology, 69(8); 8257–8266 Tsai, C. F., M. L. Li, and W. C. Lin (2018). A class center based approach for missing value imputation. Knowledge- Based Systems, 151; 124–135 Vangipuram, R., R. K. Gunupudi, V. K. Puligadda, and J. Vin- jamuri (2020). A machine learning approach for imputation and anomaly detection in IoT environment. Expert Systems, 37(5); 1–16 Vazifehdan, M., M. H. Moattar, and M. Jalali (2019). A hy- brid Bayesian network and tensor factorization approach for missing value imputation to improve breast cancer recur- rence prediction. Journal of King Saud University-Computer and Information Sciences, 31(2); 175–184 Zriqat, I. A., A. M. Altamimi, and M. Azzeh (2016). Acompar- ative study for predicting heart diseases using data mining classication methods. International Journal of Computer Sci- ence and Information Security, 14(12); 868–879 © 2021 The Authors. Page 312 of 312 INTRODUCTION EXPERIMENTAL SECTION Materials Methods Handling Missing Data Filling Imputation Result Analysis RESULTS AND DISCUSSION Attributes Deletion Imputation of Missing data Attributes with less than 1% Missing Data Attributes with more than 1% Missing Data Filling in Imputation Missing data to Dataset Disccussion CONCLUSIONS ACKNOWLEDGEMENT