Microsoft Word - 07_IS_Albert_Hendro_Analisa Cluster K-means Kemiskinan Revised - KP-- 2 Application of K-Means Algorithm … (Albert V. Dian Sano; Hendro Nindito) 141  APPLICATION OF K-MEANS ALGORITHM FOR CLUSTER ANALYSIS ON POVERTY OF PROVINCES IN INDONESIA Albert V. Dian Sano1; Hendro Nindito2 1,2 Information Systems Department, School of Information Systems, Bina Nusantara University Jln. K.H. Syahdan No 9, Palmerah, Jakarta Barat, 11480 1albert_vds@yahoo.com; 2h_nindito@yahoo.com ABSTRACT The objective of this study was to apply cluster analysis or also known as clustering on poverty data of provinces all over Indonesia.The problem was that the decision makers such as central government, local government and non-government organizations, which involved in poverty problems, needed a tool to support decision-making process related to social welfare problems. The method used in the cluster analysis was k- means algorithm. The data used in this study were drawn from Badan Pusat Statistik (BPS) or Central Bureau of Statistics on 2014.Cluster analysis in this study took characteristics of data such as absolute poverty of each province, relative number or percentage of poverty of each province, and the level of depth index poverty of each province in Indonesia. Results of cluster analysis in this study are presented in the form of grouping of clusters' members visually. Cluster analysis in the study can be used to identify more quickly and efficiently on poverty chart of all provinces all over Indonesia. The results of such identification can be used by policy makers who have interests of eradicating the problems associated with poverty and welfare distribution in Indonesia, ranging from government organizations, non-governmental organizations, and also private organizations. Keywords: cluster analysis, k-means, poverty INTRODUCTION Poverty is one of the social problems and also become a challenge for many communities around the world to always find a solution. At the global level, data on poverty regarding the number of poor people is dominated by developing countries. However, in developed countries like the United States as well, there are still poor people. So poverty is everywhere universally to be a problem with the community and the world. In the national context, Indonesia, in the New Order era that began in the mid or late 1960s to 1996, the poverty rate in Indonesia decreased drastically due to strong economic growth with poverty alleviation programs. Unfortunately, the economic crisis in 1997/1998 hit the Indonesian economy and raised the amount of poverty in Indonesia sharply. The experience of poverty reduction in the past have shown many weaknesses, such as: (1) macro growth orientation without considering aspects of equity, (2) centralized policy, (3) more caricature than transformative, (4) positioning communities as objects rather than subjects, (5) orientation poverty alleviation tends to caricature and instantaneous than sustainable productivity, and (6) perspectives and solutions that are generic to the problems of poverty that exist regardless of diversity which exists. Because it is so diverse nature of the challenges that exist, the handling of the problem of poverty must touch the bottom of the source and root of the real problem, either directly or indirectly (Multifiah, 2011). 142   ComTech Vol. 7 No. 2 June 2016: 141-150 Inequality or inequality in poverty reduction is not an easy problem. Reducing inequality is more complex than just reducing poverty. In addition, it also has the potential imbalances bigger problem than the problem of poverty itself. Figures economic inequality is measured using the Gini index. Gini index figures are presented in the form of a number between 0 and 1. The higher the number of Gini index, the higher the degree of economic inequality. Gini index numbers are high is a serious social threat and could lead to social unrest. In some countries that have a Gini index above 0.5 tends to be social unrest and disintegration. Table 1 Gini Index Indonesia (BPS, 2015) INDONESIA 1996 1999 2002 2005 2007 2008 2009 2010 2011 2012 2013 0.355 0.308 0.329 0.363 0.364 0.35 0.37 0.38 0.41 0.41 0.413 To reduce drawbacks of the problems related to inequality in poverty reduction, then this study aims to represent the poverty map based on provinces groupings all over Indonesia. This study will apply cluster analysis or commonly called clustering. Results of this study are expected to facilitate the stakeholders to see the poverty map visually so that the decision makers associated with poverty alleviation programs can more quickly and easily set policies for provinces deserving to get higher priorities and to receive major attention in poverty alleviation programs. The problem is that the decision makers such as central government, local government and non-government organizations, which involve in poverty problems, need a tool to support decision- making process related to social welfare problems. Cluster analysis in this study will try to group provinces all over Indonesia into four clusters. This study will consist of two types of cluster analysis. The first is to perform cluster analysis based on a variable of the percentage of the number of poor people and a variable of depth's level of poverty. The second is to perform cluster analysis based on a variable of an absolute number of poor people and a variable of depth's level of poverty. Cluster analysis is an analysis aimed at grouping the data objects into clusters based on similar variables or characteristics. Data objects having high similarities would be put in the same cluster while those having low similarities or a big difference will be put into different clusters. Poverty has a diverse concept. World Bank defines poverty using purchasing power, i.e., US $2 per capita per day. Meanwhile, Badan Pusat Statistik or BPS (BPS-Statistics Indonesia) defines poverty based on the poverty line. According to BPS (2015), the poor are people who have an average monthly expenditure below the poverty line per capita. The poverty line, according to BPS, is the sum of the food poverty line (FPL) and Non-Food Poverty Line (NFPL). FPL is the spending on minimum food needs equivalent to 2100 calories per day per capita. Basic needs package of food commodities is represented by 52 kinds of commodities, such as grains, tubers, fish, meat, eggs, milk, vegetables, legumes, fruits, oils, fats, and others. NFPL is the minimum requirement for housing, clothing, health and education. Another definition of poverty by Bappenas (Purwanto, 2007) is a condition in which a person or a group of men and women who are unable to meet their basic rights to maintain and develop a dignified life. The basic rights of human beings include the fulfillment for food, clothing, health, education, employment, housing, clean water, land, natural resources and the environment, safe treatment or safe from the threat of violence and the right to participate in political and social life. Because this study will use a database of BPS in 2014, then it will automatically refer to the concept of poverty by BPS. Cluster analysis technique in this study will apply to some specific characteristics or attributes of data, that is, the absolute number of poverty, the relative number or percentage of poverty, and the Application of K-Means Algorithm … (Albert V. Dian Sano; Hendro Nindito) 143  level of poverty's depth index. The absolute number of poverty is the number of people living below the poverty line. The relative number is the percentage of population below the poverty line. The poverty's depth index is the average gap of expenditure of the poor to the poverty line. The higher the index value, the higher the gap between the average expenditure from the poverty line. "We are living in the information age" is a popular saying; however, we are living in an era of data. The data in terabytes or petabytes poured into our computer network, World Wide Web (www), and various data storage devices each day ranging from world business, community, science and engineering, medicine, and almost every other aspect of daily life. The explosive growth of the volume of existing data is the result of the process of computerization of our society and the rapid development of various devices the collection and storage of data which is terrific (Han & Kamber, 2012). The explosive widely available growth of data really makes us aware that we are in the era of data. Various reliable and versatile tools are needed to automatically reveal valuable information from the large-volume data and transform it into the organized knowledge. This need has led to the birth of data mining. The field is still young, dynamic and promising. Data mining has been and will continue to make great strides in our journey from the era of data into the information age to come (Han & Kamber, 2012). Data mining is the process of finding previously unknown patterns and trends in databases and using that information to build predictive models. Data mining provides a set of tools and techniques that can be applied to this processed data to discover hidden patterns and also provides healthcare professionals an additional source of knowledge for making decisions (Hossain et al., 2013) Data mining is a fun way to extract various kinds of patterns, which presents knowledge stored in large data sets implicitly and focuses on matters related to its feasibility, usefulness, effectiveness and scalability. Data mining can also be seen as a very important step in the process to find knowledge. Data is normally done through a pre-process data cleansing, data integration, selection and transformation of data and prepared for mining. Data mining can also be done on different types of databases and data storage, but the type of pattern is found determined by different types of functionality mining data such as descriptions, association, correlation analysis, classification, prediction, analysis of clusters, and so on (Tajunisha, 2010). The concept of data mining involves three steps, i.e., capturing and storing the data, converting the raw data into information and converting the information into knowledge. Data in this context comprises all the raw material an institution collects via normal operation. Capturing and storing the data is the first phase that is the process of applying mathematical and statistical formulas to "mine" the data warehouse (Kumar, 2011). Figure 1 Data mining and Knowledge Discovery Process of Database (Sources: Fayyad in Silwattananusarn & Tuamsuk, 2012) 144   B iterative Selection invalid d the data algorithm / Evaluat same pat ordinary C object to from the algorithm (Gothai C observat the objec set of clu & Kamb C 2011). A problems techniqu structure O data poin the simp other. T Euclidia I the other Based on Fi methods su n: Choosing data and inco into a form m that match tion: Interpre ttern and rep people. Clustering is ogether in a e clustering ms, two of w & Balasubra Cluster analy tion) into sev cts that are si usters resultin ber, 2012). Cluster analy Analysis of th s that are un ue used to sol e of the data t One importa nts. If a com ple Euclidean The distance an and City B In addition t r measureme T Mink Eucli City- Maha gure 1 above uch as the fo yes ng Data onsistent data m suitable to es the pattern et various pat petitive; trans s an importa cluster (or c process. Bu which most im amanie, 2012 ysis also calle veral subsets. imilar to each ng from the c ysis offers a he cluster ca nsupervised lve problems that has not b ant compone mponent of th n distance m between th Block or Man to the similar ents are show Table 2 Size of Measure kowski distance idean distance -block distance alanobis distanc e, the knowl ollowing (Fa relevant to th a; combine m o perform da n in d nature tterns into kn slating a varie ant method clusters) and ut there are mportant is, c 2). ed clustering Each of the h other, but v cluster analys useful way t an be regarde learning or t s with techniq been labeled ( ent of the clu he vector sam metric is suffi he two grou nhattan. rity and dissi wn in Table 2 f Similarity an es e ce (Sources ledge discov ayyad, Han, he task of a d multiple sour ata mining. of the data; nowledge by ety of pattern of data ware dissimilar o some specia clustering pe g is the proces ese subsets is very different sis such as cl to organize a ed as the mo the learning ques like this (Tayal & Ra ustering algo mple data is i icient to clas ups can be m imilarity of t 2 (Xu & Wun nd Dissimilari , where S i s:Xu & Wunsc ComTech ery process in Silwattan database ana ces of data. ( (4) Data Mi extract vario eliminating ns useful in t ehousing an object in oth al requireme erformance a ss of dividing a cluster, su t from the ob lustering can and present a ost popular t process und s, will certain aghuwanshi, orithm is a m in the same p ssify the data measured by the two types nsch, 2005). ity for Quantit Form is the within gro ch, 2005) h Vol. 7 No. 2 consists of s nanusarn & alyst. (2) Prep (3) Transform ining: Choos ous data patte irrelevant va terms that co nd data minin er cluster (o ents for sear and meaningf g a set of dat ch that the o bjects that are be referred t a complex da techniques a directed or u nly find a wa 2011). measure of t physical unit a instants tha y (Tayal & s of measure tative Variabl ms oup covariance 2 June 2016: several seque Tuamsuk, 2 processing: D mation: Tran sing the dat erns. (5) Inter arious pattern ould be under ng. It group r clusters) o rch results c ful cluster de ta objects (or objects in a c e in another c to as a cluster ataset (Wang and foremost unsupervised ay of dealing the distance t, then it is li at are simila Raghuwansh ement above les matrix 141-150 ential and 012): (1) Delete the nsforming ta mining rpretation ns and the rstood by ps similar r remove clustering escription object of cluster are cluster. A ring (Han & Song, t to solve . So each g with the between ikely that ar to each hi, 2011) , some of Application of K-Means Algorithm … (Albert V. Dian Sano; Hendro Nindito) 145  K-means is one of the learning algorithms undirected/unsupervised learning, the simplest used to solve various problems of the grouping. The procedure is by applying a simple and easy way to classify data that has been given to some clusters (such as clusters k) predefined (Tayal & Raghuwanshi, 2011). K-means algorithm will define the midpoint of the cluster from the average value of the points in the cluster. Steps in k -means algorithm can be explained as follows. Per all, the algorithm will select k (central cluster) at random from various objects in D (dataset), which respectively represent the center of the cluster at the beginning or the first time. For any other object, each object is assigned or grouped into clusters that are most similar or the most closely based on the Euclidean distance between the object and the center cluster. K-means clustering algorithm then iterates to improve or increase the separation distances or similarities in the cluster. For each cluster, this algorithm will calculate a new average using the objects are grouped into a cluster in the previous iteration. All objects will then be regrouped by using the average of the newly updated as the new cluster center. The iterations will continue until it reaches a stable grouping, which means that the clusters formed in the latest iteration are the same as the clusters formed in the previous iteration. K-means clustering procedure is summarized in Figure 2 below (Han & Kamber, 2012). Figure 2 Summary of Procedure Algorithm K-Means (Source: Han & Kamber, 2012) As with any other algorithms, k-means also has some advantages and disadvantages. Here are the advantages and disadvantages of k-means algorithm according to Tayal and Raghuwanshi (2011). The advantages are (1) k-means is a simple algorithm that has been adapted to many domains. (2) More automated than making threshold imply a manual of an image or images. (3) This is an algorithm that can be good candidates for use as a continuation of the work relates to vectors that have the characteristics feature or vague (fuzzy). On the other hand, the disadvantages are (1) though it can be demonstrated that the procedure will always end, k-means clustering algorithm does not always find the most optimal configuration, which is related to the global objective function. (2) This algorithm is also very sensitive to cluster centers randomly selected at the beginning. K-means algorithm can be run several times to reduce the impact on this problem. Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster. Input:  k: the number of clusters,  D: a data set containing n objects. Output: A set of k clusters. Method: 1) Arbitrarily choose k objects from D as the initial cluster centers; 2) Repeat 3) (re)assign each object to the cluster to which the object is the most similar, 4) based on the mean value of the objects in the cluster; 5) Update the cluster means, that is, calculate the mean value of the objects for 6) Each cluster; 7) Until no changes; 146   ComTech Vol. 7 No. 2 June 2016: 141-150 Figure 3 Traditional k-means Algorithm (Source: Oyelade, et al., 2010) METHODS The method applied in this study generally includes three main stages: (1) data collection, (2) data pre-processing, and (3) data mining. In data collection, data collected in this study was taken from BPS-Statistics Indonesia's (Badan Pusat Statistik) website (www.bps.go.id). Next, data pre-processing is the most important task in data mining. This stage is often said to take almost 80% of the total time or task in data mining. Techniques and methods to be applied in this stage must be precise and correct. Data pre-processing used in this study is based on the theory by Jiawei Han and Michelin which includes: first, data cleaning: filling in the missing values, repairing data errors, identify or remove outliers, and fixing inconsistent data. Second, data integration: merging related data from tables, databases, cube, or files. Third, data selection: Selecting data only related to the process of analysis. The benefit of this step is to reduce less important or less relevant data in data mining processes. Fourth, data transformation: Transforming data to support the process of analyzing the data that will be used. Fifth, data mining: This stage is the primary stage of the entire task in this study. As with the data collection as well as data pre-processing, this stage also applies the theory by Jiawei Han and Michelin which include: (1) Data Mining, this stage is the stage of the implementation of the modeling used in data mining. In this study, the model applied is k-means cluster analysis. (2) Pattern Evaluation, this is an evaluation of the pattern that has been processed. (3) Knowledge Presentation, this is a presentation of the results of the data mining process. RESULTS AND DISCUSSIONS The application of cluster analysis in this study applied four numbers of clusters. The analysis applies to two types of measurement of the number of poor people in all provinces all over Indonesia. The first one is a measurement based on the percentage of poor people or also known as the relative 1. MSE = largenumber; 2. Select initial cluster centroids {m j }j K = 1; 3. Do 4. OldMSE = MSE; 5. MSE1 = 0; 6. For j = 1 to k 7. mj = 0; nj = 0; 8. Endfor 9. For i = 1 to n 10. For j = 1 tok 11. Compute squared Euclidean distance d2(xi, mj); 12. Endfor 13. Find the closest centroid mj to xi 14. mj = mj + xi, nj= nj +1; 15. MSE1 = MSE1 + d2(xi, mj); 16. Endfor 17. For j = 1 to k 18. nj = max ( nj , 1) ; mj = mj / nj ; 19. Endfor 20. MSE = MSE1; while (MSE