CAUCHY –Jurnal Matematika Murni dan Aplikasi 
Volume 5(1)(2017), Pages 1-7 
p-ISSN: 2086-0382; e-ISSN: 2477-3344 
 

Submitted: 21 December 2016 Reviewed: 31 January 2017 Accepted: 2 November 2017 
DOI: http://dx.doi.org/10.18860/ca.v5i1.3862 
 

Applied Hierarchical Cluster Analysis with Average Linkage Algoritm 

Cindy Cahyaning Astuti1, Rahmania Sri Untari1 

1Muhammadiyah University of Sidoarjo, Sidoarjo, Indonesia 
 

Email: cindy.cahyaning@umsida.ac.id, rahmania.sriuntari@umsida.ac.id  

ABSTRACT 

This research was conducted in Sidoarjo District where source of data used from secondary data contained in 
the book "Kabupaten Sidoarjo Dalam Angka 2016" .In this research the authors chose 12 variables that can 
represent sub-district characteristics in Sidoarjo. The variable that represents the characteristics of the sub-
district consists of four sectors namely geography, education, agriculture and industry. To determine the 
equitable geographical conditions, education, agriculture and industry each district, it would require an 
analysis to classify sub-districts based on the sub-district characteristics. Hierarchical cluster analysis is the 
analytical techniques used to classify or categorize the object of each case into a relatively homogeneous group 
expressed as a cluster. The results are expected to provide information about dominant sub-district 
characteristics and non-dominant sub-district characteristics in four sectors based on the results of the cluster 
is formed. 
 

Keywords: hierarchical cluster analysis, sub-district  

INTRODUCTION 

Sub-district characteristics is  general overview of sub-district that need to be developed 

optimally, thus providing a positive impact on the sub-district progress. Sub-district characteristics 

divided into several sectors, geography, education, government, social, agriculture, industry, 

commerce, communications, finance and prices and regional income. Sidoarjo District is divided into 

18 sub-districts that are Buduran Sub-district, Candi Sub-district, Porong Sub-district, Krembung 

Sub-district, Tulangan Sub-district, Tanggulangin Sub-district, Jabon Sub-district, Krian Sub-district, 

Balongbendo Sub-district, Wonoayu Sub-district, Tarik Sub-district, Prambon Sub-district, Taman 

Sub-district, Waru Sub-district, Gedangan Sub-district, Sedati Sub-district and Sukodono Sub-

district. Potential Sidoarjo District evenly spread over the 18 Sub-districts and is reflected on the 

sub-district characteristics. In order equitable development to improve people's welfare,  goverment 

of Sidoarjo Ditrict collaboration with BPS Sidoarjo District  published "Sidoarjo Dalam Angka  2016" 

which contains sub-district characteristics in Sidoarjo. This book is expected to provide benefits for 

the implementation of development as well as helping to evaluate and supervise development 

outcomes of Sidoarjo District. Data of  sub-district characteristics in "Sidoarjo Dalam Angka 2016"  

has been analyzed only use descriptive analysis, therefore the authors consider that these data have 

a lot of information if further analysis. The focus of research using hierarchical cluster analysis are 

mailto:cindy.cahyaning@umsida.ac.id
mailto:rahmania.sriuntari@umsida.ac.id


Applied Hierarchical Cluster Analysis with Average Linkage Algoritm 

Cindy Cahyaning Astuti 2 

the four sectors that represent sub-district characteristics, namely geography, education, agriculture 

and industry. The results are expected to provide information about dominant sub-district 

characteristics and non-dominant sub-district characteristics in four sectors based on the results of 

the cluster is formed. 
 

METHODS 

Cluster analysis is a technique used to classify objects into relatively homogeneous groups, 

called clusters. Objects in each group tend to resemble each other and differ greatly with objects from 

other clusters. Cluster analysis using the principal components analysis can use interval and ratio-

scaled data. Cluster analysis is also called classification analysis or taxonomy numerical analysis 

because it deals with clustering procedure where each object is only fit into one cluster only, to avoid 

overlapping [1]. 

There are several terms used dapam cluster analysis. The terms include the following [2]: 

• Aglomeration Schedule, is to schedule that provides information about the object or the case will be 

merged or entered in clusters on each stage, in a process of hierarchical cluster analysis. 

• Cluster Centroid, is the average value of all variable objects or cases in a particular cluster. 

• Cluster centers, is the starting point of the start of the grouping in non-hierarchical cluster analysis. 

• Cluster Membership, Membership is showing the clusters, where each object or a case of being 

members. 

• Dendogram, is a graphical tool to present the results of cluster analysis, or upright vertical lines 

represent the merged cluster together. Line position on the scale indicates the distance which were 

merged cluster. Dendogram should be read from left to right. 

Terms of normality, linearity, and homoscedasticity highly considered in the multivariate 
analysis, but not in the cluster analysis. In cluster analysis, researchers should be more concerned 
with how large a sample representative in population and the presence or absence of 
multicollinearity. The first step in formulating cluster analysis of the problem of defining the 
variables used for basic grouping. Then measure the exact distance should be selected. The distance 
measure determine similarity or dissimilarity of the object to be grouped. To determine the number 
of clusters requires subjective judgment of the researchers, in addition based on the calculation 
results objectively. Cluster obtained should be interpreted and expressed in the variables used for 
the basic formation of clusters. The equation commonly used for calculating the distance between 
the item X to item Y is a Euclidean distance. The equation used to calculate the Euclidean distance is 
as follows [3]: 
 

𝑑(𝑥, 𝑦) = √(𝑥1 − 𝑦1)
2 + (𝑥2 − 𝑦2)

2 + ⋯ + (𝑥𝑝 − 𝑦𝑝)
2 

 
            = √(𝑋 − 𝑌)′(𝑋 − 𝑌)                 (1) 

 
There are two types of cluster analysis is hierarchical cluster analysis and non-hierarchical 
cluster analysis. In the method of hierarchical cluster there are two basic types namely agglomerative 
(concentration) and divisive (the spread). In agglomerative method, any object or observation is 
considered as a separate cluster. In the next stage, the two clusters which has some similarities are 
combined into a new cluster and so on. Instead, the divisive methods, from a large cluster consisting 
of all objects or observation. Furthermore, the object or observation that the highest value does not 
resemble separated and so on [4]. 

There are five kinds of algorithms to form a group with a hierarchical method, namely [5]: 
• Single-Linkage  


Applied Hierarchical Cluster Analysis with Average Linkage Algoritm 

Cindy Cahyaning Astuti 3 

Single linkage method defines the similarity between clusters based on the shortest distance from 
any object in one cluster to any other object. If there is a third object which has the closest distance 
to one of the objects in the group that has been formed, then the object can be merged into the 
group. This process continues to form a single group. This method is the most flexible method 
aglomeratif. 

• Complete Linkage  

This method is basically the same as the single linkage method. It's just the distance used is the 
maximum distance. Reasons have the maximum distance is that objects that have little in common 
can be connected. 

• Average-Linkage  

Average linkage method also has similarities with two single linkage method. Only the distance 
used is the average distance of all objects in a group with other objects outside the group. 
Grouping objects with one another based on the average minimum. Because using the average, 
then this method is considered more stable, and no bias. 

• Centroid method  

The distance used in this method is the distance between the center point of the two groups. 
Where is the center point of the group is the middle value of each variable object in one group. In 
this method each time a new group is formed, then the center point changes. The advantage of 
this method is the small effect of outliers in the formation of the group. 

• Ward's Method 

In the Ward method, distance calculations based on the sum of the squares between the two 
groups for all variables. This method can be used if the number of observations is not too large. 
In general, the distance used is a Euclidean distance squared. 

The opposite of hierarchy cluster analysis method is non-hierarchy cluster analysis. In this 
method does not include the "treelike construction" but through the process by placing objects into 
the cluster at once, forming a particular cluster. The first step in the method is to choose a cluster 
nonhirarki as initial cluster centers, and all objects within a certain distance placed on cluster 
formation. Then select the next cluster and the placement of objects continued until all are placed. 
The objects can be placed again if the distance was closer to the other cluster than the cluster of 
origin.Non-hierarchy cluster analysis methods associated with the K-means cluster, and there are 
three approaches used to place each observation on a single cluster. Such approaches include the 
following [6]: 
• Sequential Threshold, Threshold Sequential Method start by selecting one cluster and placing all 

the objects that are at a certain distance into it. If all objects that are at a certain distance has been 

entered, then the second cluster selected and put all objects within a certain thereto. Then the 

third cluster is selected and the process continues as before. 

• Parallel Threshold, Threshold Parallel method is the opposite of the first approach by selecting a 

number of clusters simultaneously and placing objects into clusters that have the distance 

between the nearest face. In the process, the distance between the face can be specified to include 

some objects into clusters -cluster. Also some variation on this method, the rest of the objects are 

not grouped if it is outside a certain distance of a cluster. 

• Optimization, the third method is similar to the previous method except that this method makes 

it possible to put objects back into the cluster closer.  

In this research used secondary data sourced from “Kabupaten Sidoarjo Dalam Angka 2016”. 
Unit of observation in this research was 18 sub-district in Sidoarjo District. In this research the 
authors chose 12 variables that can represent sub-district characteristics. Operational definition of 
each variable will be described as 
𝑋1: Surface area (km2) 
𝑋2: Total population 
𝑋3: The number of national and private elementary school 
𝑋4: The number of national and private elementary school students 


Applied Hierarchical Cluster Analysis with Average Linkage Algoritm 

Cindy Cahyaning Astuti 4 

𝑋5: The number of national and private junior high school 
𝑋6: The number of national and private junior high school student 
𝑋7: The number of national and private high school 
𝑋8: The number of national and private high school students 
𝑋9: Harvest land area (Ha) 
𝑋10: Paddy Production (Kw) 
𝑋11: The number of large and small industries 
𝑋12: The number of workers in large and small industries  
The method of analysis in this research is. 
1. Standardization of data that have variability research unit 
2. Correlation Analysis and Principal Component Analysis on Research Variables 
2. Classify sub-district in Sidoarjo with Hierarchical cluster analysis with average linkage algorithm 
Processing of data by hierarchical cluster analysis with average linkage algorithm performed with 
SPSS 20, Before the cluster analysis, principal component analysis is done to overcome the 
correlation between variables, as well as the standardization of variables in order to obtain a variable 
with the same unit, making it eligible cluster analysis. The results of the cluster analysis then be 
concluded and interpreted. 
 

RESULTS AND DISCUSSION 

Sidoarjo District consists of 18 Sub-districts formed four clusters, details of four clusters with 

members that include within each cluster can be seen in Table 1. Grouping four clusters based on the 

data that provide a general overview of the sub-district characteristics in Sidoarjo, represented by 

the four sectors, geography, education, agriculture and industry. 

 
Table 1. Clusters Membership 

Case 4 Clusters 

1:Sidoarjo 1 

2:Buduran 2 

3:Candi 2 

4:Porong 2 

5:Krembung 2 

6:Tulangan 2 

7:Tanggulangin 2 

8:Jabon 3 

9:Krian 2 

10:Balongbendo 2 

11:Wonoayu 2 

12:Tarik 2 

13:Prambon 2 

14:Taman 2 

15:Sukodono 2 

16:Gedangan 2 

17:Waru 4 

18:Sedati 3 

   
 Cluster 1 consists of one sub-district, cluster 2 consists of 14 sub-district, cluster consists of 

two sub-districts and cluster 4 consists of one sub-district. To identify the characteristics of each 

cluster conducted a descriptive analysis. Members of the group can be described as follows: 


Applied Hierarchical Cluster Analysis with Average Linkage Algoritm 

Cindy Cahyaning Astuti 5 

1. Members of Cluster 1: Sidoarjo Sub-district 

2. Members of the Cluster 2: Buduran Sub-district, Candi Sub-district, Porong Sub-district, Krembung 

Sub-district, Tulangan Sub-district, Tanggulangin Sub-district, Krian Sub-district, Balongbendo Sub-

district, Wonoayu Sub-district, Tarik Sub-district, Prambon Sub-district, Taman Sub-district, 

Gedangan Sub-district  and Sukodono Sub-district. 

3. Members of the Cluster 3: Jabon Sub-district  and Sedati Sub-district 

4. Members of Cluster 4: Waru Sub-district  

Once known the number of clusters formed and the members of each cluster then performed a 

descriptive analysis of each cluster. To search for characteristics which are most dominant in each 

cluster, then look for the highest average of the variables for each cluster. Summary of average value 

in each cluster can be seen in Table 2. 

Table 2. Summary of Average Value in Each Custer 

Variabel  Cluster  1 Cluster  2 Cluster  3 Cluster  4 

𝑋1 62,56 32,9257 80,215 30,32 

𝑋2 194051 98389,00 71228,5 231298 

𝑋3 52 29,71 20 38 

𝑋4 22379 7812,43 5427,5 18476 

𝑋5 21 7,64 6 20 

𝑋6 10813 3755,36 2934,5 8475 

𝑋7 13 2,86 2 6 

𝑋8 7691 1441,29 504 1876 

𝑋9 633 1871,29 1663 109 

𝑋10 42825 141781,79 124964,5 7445 

𝑋11 42 29,29 12,5 151 

𝑋12 5268 5820,64 1892,5 35770 

 
  : Dominant Variable                     

 
: Non-Dominant Variable               

 
Based on Table 3 can known dominant characteristics and non-dominant characteristics of the four 

clusters are formed with the following description : 

1. Cluster 1 consist of  Sidoarjo Sub-district has dominant variables are  

𝑋2, 𝑋3, 𝑋4, 𝑋5, 𝑋6, 𝑋7 and 𝑋8, this indicates that the average number of national and private 

elementary school sector, the average number of national and private elementary school 

students, the average number of national and private junior high school, the average number of 

national and private junior high school students, the average number of national and private 

high school and the average number of national and private high school students most are in 

Sidoarjo Sub-district. Sixth dominant variables are 𝑋2, 𝑋3, 𝑋4, 𝑋5, 𝑋6, 𝑋7 and 𝑋8 represents of 

education condition, shows that education condition in the Sidoarjo Sub-district has the most 

good progress compared to 17 other sub-districts. 


Applied Hierarchical Cluster Analysis with Average Linkage Algoritm 

Cindy Cahyaning Astuti 6 

2. Cluster 2 consist of Buduran Sub-district, Candi Sub-district, Porong Sub-district, Krembung 

Sub-district, Tulangan Sub-district, Tanggulangin Sub-district, Krian Sub-district, Balongbendo 

Sub-district, Wonoayu Sub-district, Tarik Sub-district, Prambon Sub-district, Taman Sub-

district, Gedangan Sub-district  and Sukodono Sub-district has dominant variable is 𝑋9 and  𝑋10, 

this indicates that 14 sub-district in cluster 2 have an the average number of harvest land area 

and the average number of Paddy Production more than four other sub-districts. Both the 

dominant variables are 𝑋9 and  𝑋10 represent agricultural condition, it indicates that the 

agricultural condition in 14 sub-districts incorporated in cluster 2 has the most good progress 

compared to four other sub-districts in Sidoarjo. 

3. Cluster 3 consist of Jabon Sub-district  and Sedati Sub-district has dominant variable is 𝑋1, this 

indicates Jabon Sub-district  and Sedati Sub-district had an the number of average surface area 

(km2) which wider than the 16 other sub-districts. While non-dominant variables are 

𝑋2, 𝑋3, 𝑋4, 𝑋5, 𝑋6, 𝑋7, 𝑋8, 𝑋11 and 𝑋12. Six variables are non-dominant are 𝑋2, 𝑋3, 𝑋4, 𝑋5, 𝑋6, 𝑋7 and 

𝑋8 represents education condition while the two others variables that are non-dominant are 𝑋11 

and 𝑋12 represents industry condition, this shows that education and industry condition in the 

two sub-districts has less well development compared to 16 other sub-districts. 

4. Cluster 4 consist of Waru Sub-district has dominant variable are 𝑋2, 𝑋11 and 𝑋12, this indicates 

that the average number of total population, the average number of large and small industries 

the average number of workers in large and small industries most are in Waru Sub-district. Both 

the dominant variables are 𝑋11 and 𝑋12 represent industry condition, it indicates that industry 

condition in Waru Sub-district has the most good progress compared to 17 other sub-districts 

in Sidoarjo. While non-dominant variables are 𝑋1, 𝑋9 and 𝑋10. Both the non-dominant variables 

are 𝑋9 and 𝑋10 represent agricultural condition this shows that agricultural condition in Waru 

sub-districts has less well development compared to 17 other sub-districts. 

CONCLUSION 

 Based on the results of hierarchical cluster analysis with average linkage algorithm is known 
that there are four clusters were formed. In general, cluster 1 is a sub-district with good education 
condition, in cluster 2 is a sub-district with good agricultural condition are very good, in cluster 3 is 
necessary to the increase the education condition, utilizing dominant land area as well as in cluster 
4 is a sub-district with good industrial condition. Based on the results of the analysis, the government 
may retain the dominant characteristics and increase non-dominant characteristics in each cluster 
to perform construction and development in their respective sub-districts. Besides equitable 
development must also consider the potential possessed by each sub-district, so that the education 
condition, the agricultural condition and the industrial condition has good development. 

REFERENCES 

 
[1]  O. Yim dan K. T. Ramdeen, “Hierarchical Cluster Analysis: Comparison of Three Linkage Measures and 
Application to Psychological Data,” Journal of Tutorials in Quantitative Methods for Psychology, vol. 11, no. 
1, pp. 1-4, 2015.  

[2]  J. Supranto, Analisis Multivariat : Arti & Interpretasi, Jakarta: Rineka Cipta, 2010.  

[3]  R. A. Johnson dan D. W. Winchern, Applied Multivariate Statistical Analysis, New Jersey: Prentice Hall 
International Inc., 2007.  

[4]  S. Saracli, N. Dogan dan I. Dogan, “Comparison of Hierarchical Cluster Analysis Methods by Cophenetic 
Correlation,” Journal of Inequalities and Applications 2013, vol. 1, pp. 203-210, 2013.  

[5]  J. F. J. Hair, R. E. Anderson, R. L. Thatham dan W. C. Black, Multivariate Data Analysis, New Jersey: Prentice 
Hall International Inc., 2009.  


Applied Hierarchical Cluster Analysis with Average Linkage Algoritm 

Cindy Cahyaning Astuti 7 

[6]  P. Trebuna dan J. Halcinova, “Mathematical Tools of Cluster Analysis,” Journal of Applied Mathematics, vol. 
4, pp. 814-816, 2013.  

 
	ABSTRACT
	INTRODUCTION
	mETHODs
	1. Standardization of data that have variability research unit
	2. Correlation Analysis and Principal Component Analysis on Research Variables
	2. Classify sub-district in Sidoarjo with Hierarchical cluster analysis with average linkage algorithm
	Processing of data by hierarchical cluster analysis with average linkage algorithm performed with SPSS 20, Before the cluster analysis, principal component analysis is done to overcome the correlation between variables, as well as the standardization ...
	RESULTS AND DISCUSSION
	CONCLUSION
	REFERENCES