 Kurdistan Journal of Applied Research (KJAR) | Print-ISSN: 2411-7684 – Electronic-ISSN: 2411-7706 | kjar.spu.edu.iq Volume 2 | Issue 3 | August 2017 | DOI: 10.24017/science.2017.3.30 Improving accuracy of missing data imputation in data mining Nzar A. Ali Computer Dept. Cihan University Sulaimani, Iraq Nzar@mail.com Zhyan M. Omer Statistics and Informatics University of Sulaimani Sulaimani, Iraq zhyan.omar16@gmail.com Abstract: In fact, raw data in the real world is dirty. Each large data repository contains various types of anomalous values that influence the result of the analysis, since in data mining, good models usually need good data, databases in the world are not always clean and includes noise, incomplete data, duplicate records, inconsistent data and missing values. Missing data is a common drawback in many real-world data sets. In this paper, we proposed an algorithm depending on improving (MIGEC) algorithm in the way of imputation for dealing missing values. We implement grey relational analysis (GRA) on attribute values instead of instance values, and the missing data were initially imputed by mean imputation and then estimated by our proposed algorithm (PA) used as a complete value for imputing next missing value. We compare our proposed algorithm with several other algorithms such as MMS, HDI, KNNMI, FCMOCS, CRI, CMI, NIIA and MIGEC under different missing mechanisms. Experimental results demonstrate that the proposed algorithm has less RMSE values than other algorithms under all missingness mechanisms. Keywords: Data mining; Missing value; Missing value ; Data preprocessing , 1. INTRODUCTION Data mining (DM) known as Knowledge Discovery in Databases (KDD) is "the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" [1]. Databases in the world includes big amounts of data, every day millions of peoples send data over the web through social media, banks applications, governmental offices, mobile applications, university portals, etc. today with powerful storage devices and large network for computer connections, data growth exponentially. The organization of this big data and preprocessing them so that useful knowledge extracted automatically from them are lead to new branch of science known as Data Mining (DM). In data mining process the quality of results depends on the quality of the data, for that the data pre-processing is one of the important steps to reach clean and quality data and has grate effect on the success of the mining results Data pre-processing is one of the main step in the the Knowledge discovery in databases (KDD) process that decreases the complexity of the data and gives better conditions to subsequent analysis. Through this process the nature of the data is understood and the analysis of the data is performed more accurately and efficiently. The next important step is the data itself. Input data must be suitable in structure and format that suit each DM task perfectly. It is unrealistic to expect that data will be perfect after they have been extracted. Since good models usually need good data, a thorough cleansing of the data is an important step to improve the quality of data mining methods. Not only is the correctness, also the consistency of values important. Missing data can also be a particularly pernicious problem. Especially when the number of missing data is large, not all attributes (instances) with missing values can be deleted from the sample [3] 2.LITERATURE REVIEW  J. Tian , B. Yu , D. Yu , Sh. Ma , (2014) [4] ,In this paper a hybrid missing data completion method is proposed named “Multiple Imputation using Gray- system-theory and Entropy based on Clustering (MIGEC)”. First step the non-missing data records are distributed on several clusters. Then, the imputed missing value is estimated after multiple calculations by utilizing the information entropy of the proximal category for each incomplete records in terms of the similarity metric based on Gray System Theory (GST).  X.Y. Zhou , J. S. Lim , (2014) [5] , they studied a new method, the NB-EM (Naïve Bayesian-Expectation Maximization) algorithm, for handling missing values .The comparison between their method and traditional EM(Expectation Maximization) and non-substitution approaches for dealing with datasets containing randomly missing value is performed. They proof the most effective method, compared with the traditional EM algorithm, the NB-EM algorithm has a higher accuracy rate, which suggests that the NB-EM algorithm can obtain a better results on missing values in practice.  O. B. Shukur , M.H. Lee , (2015 ) [6] , In this paper, the hybrid artificial neural network (ANN) and autoregressive (AR) method is studied for finding the missing values. They use ANN for finding the missing values in wind speed data with nonlinear characteristic and they use AR model for determining the structure of the input layer for the ANN. They use Lisewise deletion https://www.omicsonline.org/searchresult.php?keyword=Data%20mining https://www.omicsonline.org/searchresult.php?keyword=Treatment%20of%20missing%20value before AR modeling to handle the missing values. A case study is carried out using daily Iraqi and Malaysian wind speed data. They compare their prosed imputation method with linear, nearest neighbor, and state space methods. The comparison has shown that AR-ANN outperformed the classical methods. As a result they conclude that the missing values in wind speed data with nonlinear characteristic can be impute more accurately using AR-ANN. Therefore, imputing the missing values using their algorithm tends to more accurate performance of time series modeling and analysis. 3.Missing Value Many databases in the world such as governmental and non- governmental contain missing values (MVs) in their attribute values. MVs is a value for attribute that was lost in the recording process. There are various reasons for their lost, such as manual data entry errors, equipment fail and incorrect measurements. The process of preparing clean data usually requires a preprocessing stage in which the data is prepared and cleaned, in order to be useful for the knowledge extraction process. The simplest way of dealing with MVs is to delete the record that contains them from the data set. However, this method is not practical when the data contains a large number of records with MVs which make bias during the inference. MVs make data analysis difficult. The occurrence of MVs can also lead to serious problems for researchers. In fact, unsuitable handling of the MVs in data analysis may found bias and can result in ambiguous conclusions being drawn from a research study, and can also limit the generalizability of the research findings [5] . Three types of problems are usually associated with MVs in DM [7] : a) Efficiency loss. b) Complexity in handling and analyzing the data. c) Unfairness resulting from differences between missing and complete data. 2.1: Missing Data Mechanisms The algorithm of missingness describes the relationship between the likelihood of a value being missing and the other variables in the data set. If Y perform the complete data that can be partitioned as (Yobs, Ymis) where Yobs is the observed part of Y and Ymis is the missing part of Y, and R be an indicator random variable (or matrix) indicating whether or not Y is observed or missing. Let R = 1 present a value which is observed and let R = 0 present a value which is missing. The statistical model for missing data is P (R\Y, Ø) where Ø is the parameter for the missing data process. The mechanism of missingness is determined by the dependency of R on the variables in the data set [8] . The following are different mechanisms of missingness [9] . i- Missing Completely At Random (MCAR) The first mechanism of missingness is a special case of MAR known as missing completely at random (MCAR). In this case, the mechanism of missingness is given by: P (R\Y, Ø) = P (R, Ø) ………………... (1) That is, the probability of missingness is not conditional on any observed or unobserved values in Y. One example of MCAR might be a computer malfunction that arbitrarily deletes some of the data values. ii- Missing At Random (MAR) The second mechanism of missingness is missing at random (MAR), this mechanism of missingness is given by: P (R\Y, Ø) = P (R\Yobs, Ø) ……….… (2) That is, the probability of missingness is only conditional on observed values in Y and not on any unobserved values in Y. A simple example of MAR is a survey where subjects over a certain age refuse to answer a particular survey question and age is an observed covariate. iii- Not Missing At Random (NMAR) The third mechanism of missingness is referred to as missing not at random (MNAR). This mechanism of missingness is given by: P (R\Y, Ø) = P (R\Yobs,Ymis, Ø) …….. (3) This mechanism observed when the conditions of MAR are break so that the likelihood of missingness is dependent on Ymis or some unobserved covariate. One instance of MNAR might be subjects who have an income above a certain value refusing to report an income in the survey. Here the missingness is dependent on the unobserved response, income. 2.2- Methods for Handling Incomplete Data (Missing Data) There are different methods and strategies exists to handle missing-data. Managing missing data can be classified into three categories: tolerance, ignoring and imputation-based procedures. a. Tolerance The simplest method point to preserve the source entries in the incomplete mode. It may be a functional and computationally low cost solution, intime it requires the techniques to work strongly even if the data quality stays low . b. Ignoring Missing data obscurity often refers to “Case Deletion”. It is the most repeatedly applied procedure nowadays, this method undergo from a loss of information in the insufficient cases and risk of alignment if the missing data is not MCAR and it is lying down to reduce the data quality. The strength lies in the ease of application: deleting the records with missing values is done in two ways [10] (i) List-wise/Case-wise Deletion(complete-case analysis): Delete the entire records including missing values. The main hitch of this method is that the dataset may lead to large loss of data, which may result in high inexactness in particular if the main dataset is itself too small or the MIGEC Algorithm Improved Steps Incomplete dataset Complete dataset Raw dataset preparation GSA- based on attribute Clusters Imputation via mean & classification All missing values imputation finished Full Datasets with non- missing values Rank instances by missing amount in decreasing order Convert incomplete dataset to binary dataset Combining complete dataset with imputed dataset FCM NO Yes number of records that contain missing value is too large. (i) Pairwise Deletion (available-case analysis): insufficient records are deleted on an analysis-by- analysis basis, Unlike list wise deletion which deletes records that have missing values on any of the variables under analysis, pair wise deletion only deletes the specific missing values from the analysis (not the entire records) such that any given record may participate to some analyses but not to others. c. Imputation “Imputation is the process of replacing missing data with substituted values”. Missing data create problems for analyzing data; imputation is seen as a way to avoid difficulty involved with list wise deletion of records that have missing values. That is, when one or more observations are missing for a case, statistical applications default to discard any records that has a missing value, which may introduce partiality in the results. Imputation preserves all records by replacing missing values with an estimated value based on other information. When all missing values have been imputed, the data set can then be analyzed using standard techniques for complete data 4: Proposed Algorithm Our proposed algorithm depends on improving existing algorithm (MIGEC) proposed by [4] after adding the following steps: 1- Converting incomplete data set to binary dataset. 2- GRA based on attribute instead of instance. 3- Attribute merging instead of instance merging. 4- After each missing elements of attributes imputed By mean imputation, next times we use the result of new imputation (imputation by PA) instead of mean to calculate imputation of reminders missing values of specific attribute. The procedure of the proposed algorithm is schematized in Figure (1). Figure (1): The Diagram of the Proposed Algorithm 4.1 Steps of Algorithm Let denote an incomplete dataset with attribute { } and instances. For each elements of incomplete dataset is defined by , it contains two parts: { } where is observed values and is missing values. [ ] https://en.wikipedia.org/wiki/Missing_data https://en.wikipedia.org/wiki/Listwise_deletion https://en.wikipedia.org/wiki/List_of_statistical_packages A binary matrix from incomplete dataset ( ) in which converting each observed values ( ) to one and each missing values ( ) to zero is produced, in this case becomes a matrix of missing data indicators, when this R matrix has the same number of rows and columns as the data matrix . { For example: A= [ ]  R= [ ] • NA=Missing Value (Not Available) After each time that one attribute has been assigned to the most nearest cluster via (GST) in our proposed improvement, finally one instance inserted to binary matrix and called it class (target): ( ) associates with the data matrix of the cluster. Then imputation technique starts as follows: Step 1. Calculate Expected information (Entropy) after partitioning each instance due to class. ∑ ………. (4) Where : • is the likelihood of event occurring . • is number of records • is number of clusters ( ) Information needed after split due to j ∑ | | (5) • • Step 2. The coefficient of difference for the records computed : ……….. (6) Performs the attendant contrast intensity of the f th parameter. The greater value of express the more important parameter. Step 3. Extract the coefficient of weight for the f th copy: ∑ …………………..… (7) Step 4.The mean mode substitution (MMS) is appoint to set missing values in the first imputation. The straightforward technique could implement well only when the data is normally separated. Then, estimate the j th attributive missing value of : ∑ After each missing elements of attributes imputed by mean imputation, next times the result of new imputation (imputation by PA algorithm) used instead of mean to calculate imputation of reminders missing values of specific attribute. 5: Experimental Results In this section Experimental results of proposed algorithm for both Wine and simulated dataset are displayed and discussed, also comparison between both proposed algorithms with other previous techniques for dealing with missing data is described. 5.1 :Wine Data Set The first dataset that we depend on to implement our algorithm is the Wine dataset that we used in this paper is achieved from The UCI (University of California, Irvine) Machine Learning Repository database, this dataset contain 13 attributes and 178 instances. The purpose of selecting this dataset is to compare the efficiency of our algorithm with previous (MIGEC) algorithm implemented by (J. Tian , B. Yu , D. Yu , Sh. Ma) [4] . 5.2: Simulated data Data mining already work with massive quantities of data. For this reason we simulated data to know the performance of (proposed algorithm) with large amount of data. We used simple random samples of size 1000. We consider simulations under a normal distribution, this dataset contain 13 attributes and 1000 instances and all of them are numeric. The computer limitation (Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz) for implementing our algorithm doesn’t allow us to increase the amount of simulated data. We used (rnorm) command from (R programing language) to generate 1000 sample of data under normal distribution based on mean and standard deviation of Wine dataset . 5.3 Generating missingness: To introduce artificial missingness, we look for two important factors which may affect the imputation results: missing rate and missing data mechanism. Three different levels of missing rate were considered, i.e., 5%, 10% and 20% and three missing mechanisms were taken into consideration, namely MCAR, MAR and NMAR. For MCAR , In order to simulate missing values on attributes, the original datasets are run using a random generator and every data in the dataset have the same likelihood α to be missing, where α was the missing rate . “Nonparametric Missing Value Imputation using Random Forest” package from R programing language used to generate MCAR. http://uci.edu/ http://uci.edu/ http://uci.edu/ 0.13 0.14 0.15 0.16 0.17 5 10 15 20 25 30 35 40 45 50 R M S E Iterations 0.13 0.14 0.15 0.16 0.17 5 10 15 20 25 30 35 40 45 50 R M S E Iterations Simulating MAR was more challenging and it worked as follow: In case there is a complete dataset with two attributes ( ), where was the attribute in to which missing values were introduced, and was the attribute that affected the missingness of . Given a pair of attributes ), and missing rate α , First split the instances into two equal-sized subsets according to their values at and find the median of and then assigned all the instances into two subsets according to weather the instances have lower (or higher) values than the median at ( ) . After the splitting of instances, randomly selected one subset of instances and let their values at to be missing with the probability of 4α. The probability of 4α will result in a missing rate of 2α on the whole variable which is equivalent to have a missing rate of α on the two variables ) . For multi-attributes pair selection of attributes was based on high correlations among the attributes, different pairs of attributes were used to generate the missingness. Each attribute is paired with the one it is highly correlated to. The process of generating missing values by NMAR was similar to MAR. The only difference was that there was no need to split variables into pairs, NMAR produced missingness on every variable directly. For a given variable and specified missing rate α , first calculated the median of and then randomly let the values that are lower (or higher) than to be missing with probability of 2α . 5.4 :Performance Measure To evaluate the precision of various data imputation algorithms the Root Mean Square Error (RMSE) used in this paper . √ ∑ ̂ ……….…… (9) Where is the original value, ̂ is the predicted plausible value, is the total number of estimations and is standard deviation. The larger value of RMSE suggests the less accuracy that the algorithm holds. Before the comparative demonstrations, to capture the result of the data imputation accurately, it is requisite to select the optimum values for number of iteration (number of imputations) and clusters, by another expression mean that which clusters or iteration gives us minimum RMSE. 5.4.1 Number of Iteration Firstly we tested for number of iterations for each missingness mechanism (MCAR, MAR, NMAR) with 10 % missing rate and five clusters as initial value (cluster), for wine dataset. For MCAR: As seen in Figure (2), the RMSE declines to the least which is (0.1418) when number of iteration is 5 times. So it is the optimum iteration for MCAR. Figure (2): Checking optimality by number of Iterations (for MCAR) For MAR Figure (3), illustrate that the best number of iteration for MAR also is 5 which give minimum RMSE which is (0.1388). Figure (3): Checking optimality by number of Iterations (for MAR) For NMAR: Finally, for NMAR as appeared in Figure (4), Iteration (10) give us lower RMSE which is (0.1347) and the worst iteration, which yield the maximum RMSE is iteration (15). 0.13 0.14 0.15 0.16 0.17 5 10 15 20 25 30 35 40 45 50 R M S E Iterations 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 R M S E 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 R M S E Figure (4): Checking optimality by number of Iterations (for NMAR) 5.4.2 Number of Clusters Second step for checking optimality is number of clusters, we obtained it via (FCM) by applied it to complete dataset. we can match number of clusters that is affect the accuracy of our algorithm for imputation directly, since we used it for calculating classification therefore classification is a main part of imputing incomplete data in proposed algorithm. For each three types of missing mechanism (MCAR, MAR, NMAR), we checked for optimal number of clusters from 2 to 10 by using %10 missing rate. For MCAR: Figure (5), shows that when the whole data is mass into 7 groups, the RMSE fall to the minimum that is (0.1330). In contrast the worst value obtained when 2 clusters exist. Figure (5): Checking optimality by number of clusters (for MCAR) For MAR: In Figure (6), MAR performs best when data partitions into 2 groups, it`s RMSE is (0.1321). Figure (6): Checking optimality by number of clusters (for MAR) For NMAR: From Figure (7), results of RMSE for NMAR are between [0.12 - 0.15], minimum RMSE (0.12) yield when number of clusters is 8. Figure (7): Checking optimality by number of clusters (for NMAR) 6 Comparative experiments In investigation of making comparisons as extensively as possible, we select eight other approaches, which are MMS (Mean Mode Substitution), HDI (Hot Deck Imputation), KNNMI (K Nearest Neighbour Imputation with Mutual Information), FCMOCS (Fuzzy C-Mean based on Optimal Completion Strategy ), CRI (Clustering-based Random Imputation) , CMI (Clustering-based Multiple Imputation) , NIIA (The Non Parametric Iterative Imputation) and MIGEC (Multiple Imputation algorithm using Gray System Theory and Entropy based on Clustering) . After selecting optimum number of clusters and iterations for all three types of missingness, we compared proposed algorithm with various methods with varying missing rates by using RMSE (average of each RMSE) as displayed in Figures (8), (9) and (10). 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 R M S E 0.1 0.15 0.2 0.25 5% 10% 15% 20% R M S E Missing rate MMS HDI KNNMI FCMOCS NIIA CMI CRI MIGEC PA Figure (8): Comparison between proposed algorithm (PA) and other methods for imputation (MCAR) Figure (9): Comparison between proposed algorithm (PA) and other methods for imputation (MAR) Figure (10): Comparison between proposed algorithm (PA) and other methods for imputation (NMAR) Figures (8), (9) and (10) show some results that we would like to discuss as follows: i- Outcome results demonstrate that the proposed algorithm performs better than the other eight approaches under all missingness mechanisms at varying missing rates. ii- Different missing rate have different impacts on imputation accuracy. The RMSE increases with increasing missing proportion for all the methods approximately. This is understandable because with more missing rate introduced into the datasets, more information of data will be loss. However sometimes the nature of data, outlier and noise also effect on accuracy of imputation. iii-The worse RMSE achieved by methods are for MCAR mechanism, followed by MAR and MCAR mechanism.. 6: Conclusions and Future works 6-1: Conclusions The problem of incomplete data is one which researchers must handle it. Many researchers fail to consider missing values of varying natures in their analyses, treating them as a singular type or not considering the impact of the missing values at all. In this paper an extension algorithm based on MIGEC for dealing with incomplete data has been proposed. The experimental results show 1- Experimental results on wine dataset from University of California Irvine (UCI) repository illustrate the superiority of proposed algorithm to other imputation methods on accuracy of imputing missing data on three different missing types MCAR, MAR and NMAR. 2- The RMSE shows that our proposed algorithm has better results (namely, the minimal value of RMSE) than MIGEC algorithm, with average absolute difference beyond (0.025108). 3- When calculating GRA on attributes instead of instances we work with more homogeneous values in comparison with calculating GRA based on instances and as a result the attribute belong to proper cluster. 4- Increasing rate of missing records suffer the precision of the fulfillment in RMSE. It states that incomplete values negatively impact on the completion. 5- Proposed algorithm can handle missing values and perform better either with small or huge amount of the raw data, we can conclude that proposed algorithm remain stable with increasing the size of dataset which means our proposed algorithm is suitable for large data repositories. 6- Proposed algorithm reach results with less imputed iterations in comparison with other algorithms which means less run time needed in case of huge amount of data in data repositories. 7- The drawback of our proposed algorithm on MIGEC is appeared in cases when there is large amount of 0.1 0.15 0.2 0.25 5% 10% 15% 20% R M S E Missing Rate MMS HDI KNNMI FCMOCS NIIA CMI CRI MIGEC PA 0.1 0.15 0.2 0.25 5% 10% 15% 20% R M S E Missing Rate MMS HDI KNNMI FCMOCS NIIA CMI CRI MIGEC PA heterogeneity inside the attributes since GRA in our proposed algorithm depends on attribute values instead of instance values. This conclusion appears when we run the algorithm on difference simulated data. 6.2: Future Works 1- Working with data mining techniques need powerful computer to implement our work speedily and not restrict us. In this paper because of computer limitation we cannot increase the size of simulated dataset because it needs days to get results of proposed algorithm with vast size of dataset. 2- Hybrid proposed algorithm with another data mining or statistical techniques like (Neural Network, Nearest Neighbor, …). 3- Extending proposed algorithm to work with categorical attributes. 4- Distortion and noise has a great effect on the imputation techniques, while real-world data often contain much noise, therefore, another preprocessing algorithm can be implemented to clean the data before implementing PA. 5- Implementing different data mining algorithms such as association rule mining on PA and compare the results with other existing algorisms. Biography Nzar Ali: is an Associate Professor of Computer Science at the Cihan University, and his Ph.D. degree from the Sulaimaniya University. His research interests in data mining, spatial database and database indexing. He has published over 9 research papers and supervise one Phd and 6 Msc Students. . Zhyan Omer: is an Associate Lecture of Statistics and Informatic at the Sulaimaniya University and her MSc. degree from the Sulaimaniya University Her research interests include statistical data mining and missing value estimation . REFERENCE [1] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth ,(1996) “From data mining to knowledge discovery”, American Association for Artificial Intelligence, San Francisco, Vol. 17, No. 3. [2] R. Nisbet , J. Elder, G. Miner , (2009) “Handbook of Statistical Analysis and Data Mining Applications” . Academic Press, Boston. [3] J. Han and M. Kamber, (2011) “Data Mining: Concepts and Techniques”, Morgan Kaufmann,San Francisco . [4] J. Tian , B. Yu , D. Yu , Sh. Ma , (2014) “A hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering” Appl Intell 40:376–388, DOI 10.1007/s10489-013-0469- x, Springer Science+Business Media New York. [5] X.Y. Zhou , J. S. Lim , (2014) “Replace Missing Values with EM algorithm based on GMM and Naïve Bayesian” International Journal of Software Engineering and Its Applications Vol.8, No.5, pp.177-188. [6] O. B. Shukur, M.H. Lee , (2015) “Imputation of Missing Values in Daily Wind Speed Data Using Hybrid AR-ANN Method ”, Published by Canadian Center of Science and Education ,Modern, ISSN 1913-1844 E-ISSN 1913-1852, Applied Science; Vol. 9, No. 11 . [7] J.Barnard, X.Meng, (1999) “Applications of multiple imputation in medical studies: from aids to nhanes”, Stat. Methods Med. Res. 8(1), 17–36 . [8] J.A. Boyko, (2013) “Handling Data with Three Types of Missing Values” , Ph.D. Thesis , University of Connecticut . [9] R.J.A. Little, D.B. Rubin, (1987) “Statistical Analysis with Missing Data” , 1st edn. Wiley Series in Probability and Statistics, New York. [10] S .Zhang , (2011) “Shell-neighbor method and its application in missing data imputation ”, Appl Intell 35(1):123–133. http://www.aaai.org/ojs/index.php/aimagazine/issue/view/120