Microsoft Word - ETASR_7-1_1420-1424.doc


Engineering, Technology & Applied Science Research Vol. 7, No. 1, 2017, 1420-1424 1420  
  

www.etasr.com Bafghi: Clustering of Customers Based on Shopping Behavior and Employing Genetic Algorithms 
 

Clustering of Customers Based on Shopping Behavior 
and Employing Genetic Algorithms 

 
Elham Photoohi Bafghi  
Department of Computer Engineering 
Bafgh branch, Islamic Azad University 

Bafgh, Iran 
 

Abstract—Clustering of customers is a vital case in marketing 
and customer relationship management. In traditional 
marketing, a market seller is categorized based on general 
characteristics like clients’ statistical information and their 
lifestyle features. However, this method seems unable to cope 
with today’s challenges. In this paper, we present a method for 
the classification of customers based on variables such as 
shopping cases and financial information related to the 
customers’ interactions. One measure of similarity was defined as 
clustering and clustering quality function was further defined. 
Genetic algorithms been used to ensure the accuracy of 
clustering. 

Keywords-classification; customers; shopping; genetic algorithm    

I. INTRODUCTION  
Traditional methods of mass marketing are no longer 

responsive to customers' needs diversity. Such diversity must 
be managed by clustering methods which puts the customers 
with the same need and similar shopping behavior in the same 
clusters [1]. Using the proper clustering, companies can deliver 
the customers’ goods, services and specific resources via close 
relationship. Customer clustering is one of the components in 
modern and successful marketing leading to improved 
customer relationship management (CRM) is [2]. Clustering is 
an important issue in selecting the appropriate variables. 
Clustering variables have two local variables and parameters 
are based on the product [3]. General variables are included of 
statistical information (age, gender, etc.), life-style and includes 
variables based on the buying patterns of customers. Different 
works have been taken in the field of customer clustering based 
on variables [4, 5]. In a study described in [6], the SOM (Self-
Organizing Map) algorithm and K-Means were used for 
clustering of customers based on public information. In this 
method, first SOM helped to determine the number of clusters 
and then clustering was performed using the K-Means. 
Clustering based on general variables is more understandable, 
but the assumption that the customers have the same statistical 
information (age, etc.) as well as a similar life style and the 
same shopping habits, to some extent are doubtful. Today, 
customers can purchase from different parts of the planet, and 
this makes clustering partly more difficult. On the other hand 
some information about the local variables may not be 
provided by the customer. Although information may be 

available they may vary over time. For example, income, 
marital status, occupation and the similar cases makes the 
clustering through the use of public variables more doubtful 
[7]. In this paper we propose a method of clustering of 
customers based on information about the goods. Genetic 
algorithms were also used in order to enhance the quality of 
these clusters which also have been used to determine the 
cluster centers.  

II. CLUSTERING METHOD BASED ON CUSTOMER PURCHASES 
In this section, methods of clustering of customers have 

been provided by the help of customers' purchase behavior. 
This method involves pre-processing core, similarity criteria 
based on purchase, clustering algorithm and the quality 
function of the cluster. 

A. Data pre-processing 
The purpose of this selection process is to transform and 

integrate data from one or more datasets into one data set. 
Suppose ‘I’ as an entire collection of items and T0 a 
transaction. T0 transaction includes fields such as client ID, 
time of transaction, items purchased and financial information. 
If idi is the customer ID number of ci customer, and 
itemseti={iia|iia Î I} are goods purchased by ci, as well as 
financial information moneyseti={mia|a=1,…,||itemseti||}, a 
record of tci=(idi,itemseti,moneyseti)  describes ci customer’s 
behavior based product and which will be displayed. 

B. Criteria of similarity based on purchase 
After preparing the data, it is necessary to specify the 

similarity criteria of users or to each other and goods. In this 
respect, there are two famous criteria of matching coefficient 
and Jaccard's coefficient [8, 9]. But these two criteria are not 
most appropriate because: 

• Existence of a small number of goods compared to the total 
of goods in a user set items 

• Inequality of goods (considered equal in the above two 
methods) 

• Imbalance in the importance of customers for the 
organization 


Engineering, Technology & Applied Science Research Vol. 7, No. 1, 2017, 1420-1424 1421  
  

www.etasr.com Bafghi: Clustering of Customers Based on Shopping Behavior and Employing Genetic Algorithms 
 

So in this article we have introduced a new similarity 
measure that has overcome some of these shortcomings. 
Similarity measures presented here considered goods 
purchased together and the profitability of customers. The 
criteria of goods purchased together were inspired by the 
concept of Support provided in [10]. Suppose Supp ({ii, ij)} of 
transactions involving {ii, ij} is the total transactions: 

Suppose Supp({ii,ij)} as the ratio of transactions involving 
{ii,ij} to the total transactions: 
 

0

000 },{
}),({

T

iicontainstTt
iiSupp

ji

ji


                 (1) 

 
In the above equation ii,ij Î I. It should be noted that if the 
items are rarely purchased together, support value has a low 
amount [11] and to solve the problem we use intimacy criteria 
as follows: 
 

}),{()()(

}),{(
}),({

jiji

ji

ji
iiSuppiSuppiSupp

iiSup
iiInt


    (2) 

 
In the above equation numerical value of Int was between 0 
and 1. This solves the problem of backup. 
Finally, the proposed formula to measure the similarity of two 
customers is presented in the following form: 

   

 



 

 






s

a

t

b
jb

i
a

s

a

t

b
jbibjbia

ji

mm

iiIntmm
ccsim

1 1

1 1

,
).(     (3) 

C. Clustering algorithm based on purchase 
The proposed purchase algorithm of the customers is based 

on (3). In the algorithm first customers are classified into K 
clusters. The value of K can be determined by the help of 
marketers or the method proposed in this paper (see section 
4.2). After determining K, initial cluster centers are chosen 
from Tc records. These centers may also be selected randomly 
or based on heuristic method (GA). Suppose G={cn|n=1,…,K} 
is a total cluster centers where cn is the center of nth cluster, so 
that cn Î Tc. So set(Tc-G)={ci|i=1,…,||Tc-G||} includes 
customers not selected as the center of the cluster. Then for all 
customers who are not cluster center, their similarity to all 
cluster centers are measured and added to the clusters with the 
maximum similarity to the center. The problem can be seen in 
the equation below:  

    GTwherecccsimMax Cinisnk .1        (4) 
 
After each customer was placed in clusters, cluster centers are 
calculated again. Let ci and cj are two customers in the G

n 
cluster, and the center of Gn cluster is equal to cn. In this case 
priority of ci customer is calculated based on following 
relationship: 










nmGc

m
i

ijGc
ji

i

m

m
j

ccSim

ccSim

cPio

,

,

),(

),(

)(     (5) 

Where cm is the center of Gm cluster and 
 ijGc

ji
m

j

ccSim
,

),(
 

represents the sum of similarities between the ci customer and 

other customers in cluster Gn. While 
 nmGc

m
i

m

ccSim
,

),(  

represents the sum of similarities between the client ci and the 
other cluster centers except Gn. For all customers existed in a 
Gn cluster, customer is selected as the center of the cluster 
with the highest priority value. This concept is presented in the 

following equation:  )(arg icn cpioMaxc nGi      (6) 
After specifying the new centers, algorithm has done a 
repetition pattern. The algorithm repeats this process until no 
clustering center change. The overall process of algorithm is 
shown in Figure 1. 

 
Fig. 1.  Flowchart of Clustering using Genetic Algorithms 

D. Clustering quality function 
The purpose of clustering is to maximize the similarity 
between the centers of the clusters with the cluster's customers 
as the similarities between the centers of the clusters are 


Engineering, Technology & Applied Science Research Vol. 7, No. 1, 2017, 1420-1424 1422  
  

www.etasr.com Bafghi: Clustering of Customers Based on Shopping Behavior and Employing Genetic Algorithms 
 

minimal. So the quality of clustering based on the following 
equation can be expressed as: 




 
















 


k

n nm

mn
nmkmMin

k
K

1
,1

1
)(




    (7) 





n

i Gc

n
inn

ccSim
G

),(
1

       (8) 





m

j Gc

m
jmm

ccSim
G

),(
1

   (9) 

),( mnnm ccSim     (10) 
Equation (8) considers η

n as the average similarity between the 

center of the cn cluster and existing customers in the cluster 
Gn. Equation (9) shows that η

m determines the relationship 

between the average similarity between the center of the 
cluster cm and existing customers in the cluster Gm. Equation 
10 determines the similarity between cn and cm. 
Using the clustering quality defined in (7), we can define the 
appropriate value for K in accordance with the following 
formula: 

  kMaxk tskss  argˆ       (11) 
In the above equation K value is determined between the 
lower bound and upper bound of t 

III. DETERMINING THE CENTER OF CLUSTERS USING 
GENETIC ALGORITHM 

Genetic Algorithms [12] can be used to select primary 
centers of clusters. A genetic algorithm is an abstract 
computational models of biological evolution that is used in 
optimization problems, in which the mutation operation, 
crossover and production of new population are used. 

A. Encoding the chromosome 
Each chromosome in the genetic algorithm represents a 
solution to the problem that is investigated. In this studied 
case, each chromosome represents a set of K centers of initial 
cluster. If fi is a chromosome, fi=[y1,…,yj,…,yk] where yj is the  
jth gene and K is the total number of genes (In Figure 2).  

 
Fig. 2.  Coding of chromosomes 

B. set values to the initial population 
Let Pe represents the population in the eth replication, and 

0<=e<=E, where E is the maximum number of times to repeat 

the genetic algorithm. The number of chromosomes is constant 
in all occurrences. So Pe={fi|i=1,…,L} where fi is the i

th 
chromosome and L is the total number of chromosomes in the 
population. L number is an even number specified by the user. 

C. The fitness value of a chromosome 
Fitness value of each chromosome means to determine the 

suitability of chromosomes in order to survive it. The fitness of 
each cluster is determined based on the equation 7. So the 
formula for calculating the fitness of each fi chromosome is 
calculated as follows: 

Fitness(fi)=ρ(K) , fi=[y1,…,yj,…,yK]    (12) 

In (12) chromosomes existed in the population Pe into two 
categories: good (Pebad) and bad (P

e
good). Good chromosomes 

are the sets of chromosomes with high fitness value and bad 
chromosomes are the sets of chromosomes with low fitness 
value as their numbers are equal||Pegood||=||P

e
bad||=I/2. 

D. Production of the new population 
The purpose of generating new populations is to remove the 

chromosomes with low fitness value and copy the 
chromosomes with high fitness value in the new population. So 
Pebad is removed and P

e
good chromosomes take their place. The 

higher the fitness value of a chromosome, the higher the 
probability of selection. The formula for calculating the 
probability of selection of chromosomes as shown below: 


 agoodi pgf

ii ffitnessfob )(/)(Pr (13) 

The production process of new population has ensured that 
the new generations are created of the parents with high fitness 
value. This process is shown in the following figure. 

 
Fig. 3.  The process of creating a new population 


Engineering, Technology & Applied Science Research Vol. 7, No. 1, 2017, 1420-1424 1423  
  

www.etasr.com Bafghi: Clustering of Customers Based on Shopping Behavior and Employing Genetic Algorithms 
 

E. Mutate and crossover 
After generating a new population, mutation and crossover 

operations are applied over the population. First, a fi 
chromosome and from Pegood and fj from P

e
bad set are elected. If 

the genes of fi and fj are not equal, the crossover operation is 
applied on them. How to crossover them is shown in Figure 4. 
If the two genes are equal, then the gene mutation operation is 
performed on them. For this purpose, created chromosome is 
placed instead of one of the two chromosomes. This operation 
is shown in Figure 5. 

 
Fig. 4.  Crossover operation 

 
Fig. 5.  Mutuation  operation  

IV. RESULTS 

A. Assessment of the proposed method 
In order to evaluate the proposed method, the information 

collection of shopping store with 9729 customer transaction 
information relating to 4223 customers and 1560 item were 
applied. The purpose of the proposed algorithm is that 
customers in each cluster have the highest similarity with each 
other. So items purchased by the customer in a cluster are 
similar. Although these items are distinct, customers’ buying 
patterns are definable by the help of the frequency of each item. 
This concept is known by the Support, as shown in Equation 1. 
To assess this concept, we compared the K-means clustering 
methods [13] with the proposed method and achieved to the 
following results. In the results value of the K was set at 30 and 
calculated the mean support in each cluster of five maximum 
support besides standard deviation for all items. Results of K-
means algorithm and the proposed method are shown in Tables 
I and II, respectively. The ANOVA test was used to express the 
differences of obtained results. The test results are shown in 
Table III. It can be seen that the mean value of support for 5 
supported items were significantly higher than the results of K-
means. Therefore, it can be argued that in the proposed 
approach, customers in each cluster have more similar buying 
pattern than the K-means algorithm. Also the standard 
deviation of proposed method is higher than the K-means 
which also confirms the previous term. 

TABLE I.  PRODUCT PURCHASE PATTERN WITH THE HELP OF K-MEANS METHOD 

Five maximum support values of all items Standard deviation 
of support for all 

items 

Mean of support for all 
items mean 5 4 3 2 1 

Customer’s ID 

0.0025 0.0063 0.0189 0.0142 0.0189 0.0189 0.0189 0.0236 01 
0.0027 0.0063 0.0172 0.0256 0.0156 0.0156 0.0156 0.0208 02 
0.0025 0.0047 0.0138 0.0138 0.0138 0.0138 0.0138 0.0138 03 
0.0028 0.0056 0.0175 0.0175 0.0175 0.0175 0.0175 0.0175 04 
0.0025 0.0054 0.0185 0.0148 0.0148 0.0185 0.0185 0.0259 05 
0.0018 0.0035 0.0111 0.0098 0.0098 0.0114 0.0114 0.0131 06 
0.0033 0.0084 0.0236 0.0221 0.0221 0.0221 0.0221 0.0294 07 
0.0017 0.0035 0.0104 0.0092 0.0107 0.0107 0.0107 0.0107 08 

... ... ... ... ... ... ... ... ... 
0.0026 0.0054 0.0175 0.0167 0.0167 0.0167 0.0167 0.0209 28 
0.0035 0.0094 0.0244 0.0244 0.0244 0.0244 0.0244 0.0244 29 
0.0023 0.0044 0.0152 0.0126 0.0158 0.0158 0.0158 0.0158 30 

TABLE II.  PRODUCT PURCHASE PATTERN WITH THE HELP OF PROPOSED METHOD 

Five maximum support values of all items Standard deviation 
of support for all 

items 

Mean of support for all 
items mean 5 4 3 2 1 

Customer’s ID 

0.0058 0.0052 0.0568 0.0381 0.450 0.484 0.0588 0.0934 01 
0.0070 0.0060 0.0574 0.0302 0.0302 0.0340 0.0906 0.1019 02 
0.0051 0.0050 0.0453 0.0275 0.0275 0.0303 0.0523 0.0799 03 
0.0045 0.0040 0.0344 0.0297 0.0297 0.0361 0.0361 0.0403 04 
0.0028 0.0061 0.0583 0.0379 0.0455 0.0492 0.0606 0.0985 05 
0.0064 0.0052 0.0518 0.0350 0.0420 0.0455 0.0455 0.0909 06 
0.0055 0.0042 0.0392 0.0326 0.0350 0.0373 0.0443 0.0466 07 
0.0070 0.0039 0.0275 0.0295 0.0385 0.0379 0.0379 0.0463 08 

... ... ... ... ... ...  ... ... ... 
0.0039 0.0047 0.0463 0.0360 0.0411 0.0411 0.0540 0.0591 28 
0.0043 0.0080 0.0759 0.0452 0.0565 0.0621 0.0734 0.1412 29 
0.0041 0.0048 0.0400 0.0303 0.0303 0.0379 0.0404 0.0606 30 


Engineering, Technology & Applied Science Research Vol. 7, No. 1, 2017, 1420-1424 1424  
  

www.etasr.com Bafghi: Clustering of Customers Based on Shopping Behavior and Employing Genetic Algorithms 
 

TABLE III.  EVALUATION USING ANOVA TEST 

 
Sum of 
Squares 

DF Mean square F-value P-Value =5%α 

Variance among groups 1.938*10-2 1 1.938*10-2 
Variance inside the groups 4.572*10-3 58 

Mean of 5 
maximum supports 

 Total variance 2.395*10-2 59 
7.883*10-5 

245.882 0.000 P<α 

Variance among groups 1.402*10-7 1 1.402*10-7 
Variance inside the groups 1.113*10-4 58 

Mean of support 
for all items 

Total variance 1.114*10-4 59 
1.918*10-6 

0.073 0.788 P>α 

Variance among groups 1.176*10-4 1 1.176*10-6 
Variance inside the groups 6.607*10-5 58 

Standard deviation 
of support for all 

items Total variance 1.837*10-4 59 
1.139*10-6 

103.242 0.000 P<α 

 
The proposed method has a good convergence rate. As you 
can see in the following figure, when the replication frequency 
raised the total similarities between customers, cluster centers 
was also increased and number of change of cluster centers 
were also reduced. It was shown after the 9 replications, 
cluster centers did not change and the algorithm was finished. 

 
Fig. 6.  Convergence in proposed method 

B. Evaluating the performance of Genetic Algorithm 
As mentioned, the initial cluster centers can be determined 

randomly or using genetic algorithm. Some experiments were 
taken in order to show the usefulness of genetic algorithms in 
which the number of clusters was changed from 30 to 80. In the 
following figure the effectiveness of using genetic algorithms 
was shown compared to the initial cluster centers randomly. 

 
Fig. 7.  Comparison of genetic algorithm and random method 

V. CONCLUSION 
Mass marketing no longer meets the diverse needs of 

customers. Customer clustering techniques have to be 
employed so that customers with the same buying patterns are 
put in a similar cluster. In this paper we presented a method of 
clustering customers employing genetic algorithms. Results are 
presented and further discussed. As shown, an improved 
performance is achieved through the discussed technique. 

REFERENCES 
[1] S. Dibb, L. Simkin, The market segmentation workbook: target 

marketing for marketing managers, Routledge, London, 1996 

[2] A. Berson, S. Smith, K. Thearling, Building data mining applications for 
CRM, New York: McGraw-Hill, 2000 

[3] S. Wedel, W. Kamakura, Market segmentation: Conceptual and 
methodological foundations, Boston: Kluwer, 1997 

[4] T. P. Beane, D. M. Ennis, “Market segmentation: a review”, European 
Journal of Marketing, Vol. 21, No. 5, pp. 20–42, 1987 

[5] K. Hammond, A. S. C. Ehrenberg, G. J. Goodhardt, “Market 
segmentation for competitive brands”, European Journal of Marketing, 
Vol. 30, No. 12, pp. 39–49, 1996 

[6] R. J. Kuo, L. M. Ho, C. M. Hu, “Integration of self-organizing feature 
map and K-means algorithm for market segmentation”, Computers and 
Operations Research, Vol. 29, No. 11, pp. 1475–1493l, 2002 

[7] R. G. Drozdenko, P. D. Drake, Optimal database marketing: Strategy, 
development, and data mining, London, Sage, 2002 

[8] C. D. Manning, H. Schutze, Foundations of statistical natural language 
processing, Cambridge, MA, MIT Press, 1999 

[9] H. C. Romesburg, Clustering analysis for researchers, Belmont Lifetime 
Learning Publications, 1984 

[10] R. Agrawal, R. Srikant, “Fast algorithms for mining association rules”, 
20th International Conference on Very Large Databases, pp. 487–499, 
1994 

[11] H. Mannila, Database methods for data mining, 4th International 
Conference on Knowledge Discovery and Data Mining, New York, 
1998 

[12] J. H. Holland, Adaptation in natural and artificial systems, Ann Arbor, 
MI: The University of Michigan Press, 1975 

[13] J. MacQueen, “Some methods for classification and analysis of 
multivariate observations”, 5th Conference on Mathematical Statistics 
and Probability, Vol. 1, pp. 281–297, 1967