*Corresponding Author

P-ISSN: 2087-1244
E-ISSN: 2476-907X

1

ComTech: Computer, Mathematics and Engineering Applications, 13(1), June 2022, 1−10
DOI: 10.21512/comtech.v13i1.7270

Discovering the Optimal Number of Crime Cluster Using 
Elbow, Silhouette, Gap Statistics, and NbClust Methods

Noviyanti T. M. Sagala1* and Alexander Agung Santoso Gunawan2 

1Statistics Department, School of Computer Science, Bina Nusantara University
Jln. K. H. Syahdan No. 9, Jakarta Barat 11480, Indonesia

2Computer Science Department, School of Computer Science, Bina Nusantara University
Jln. K. H. Syahdan No. 9, Jakarta Barat 11480, Indonesia

1noviyanti.sagala@binus.edu; 2aagung@binus.edu

Received: 31st March 2021/ Revised: 24th May 2021/ Accepted: 24th May 2021 

How to Cite: Sagala, N. T. M., & Gunawan, A. A. S. (2022). Discovering the Optimal Number of Crime Cluster Using 
Elbow, Silhouette, Gap Statistics, and NbClust Methods. ComTech: Computer, Mathematics and Engineering Applications, 

13(1), 1−10. https://doi.org/10.21512/comtech.v13i1.7270

Abstract - In recent years, crime has been 
critical to be analyzed and tracked to identify the trends 
and associations with crime patterns and activities. 
Generally, the analysis is conducted to discover the 
area or location where the crime is high or low by 
using different clustering methods, including k-means 
clustering. Even though the k-means algorithm is 
commonly used in clustering techniques because of 
its simplicity, convergence speed, and high efficiency, 
finding the optimal number of clusters is difficult. 
Determining the correct clusters for crime analysis is 
critical to enhancing current crime resolution rates, 
avoiding future incidents, spending less time for new 
officers, and increasing activity quality. To address 
the problem of estimating the number of clusters in 
the crime domain without the interference of humans, 
the research carried out Elbow, Silhouette, Gap 
Statistics, and NbClust methods on datasets of Major 
Crime Indicators (MCI) in 2014−2019. Several stages 
were performed to process the crime datasets: data 
understanding, data preparation, cluster modelling, 
and cluster validation. The first two phases were 
performed in the R Studio environment and the last 
two stages in Azure Studio. From the experimental 
result, Elbow, Silhouette, and NbClust methods 
suggest a similar number of optimum clusters that 
is two. After validating the result using the average 
Silhouette method, the research considers two clusters 
as the best clusters for the dataset. The visualization 
result of Silhouette method displays the value of 0,73. 
Then, the observation of the data is well-grouped. It is 
placed in the correct group.

Keywords: Crime Clustering, Elbow method, 
Silhouette method, Gap Statistics method, NbClust 
method  

I. INTRODUCTION

In recent years, crime has been one of the 
major problems in which the intensity and complexity 
have grown exponentially. With the increasing 
number of crimes and the availability of information 
technologies, crime analysis is required to reduce the 
probability of crime. The potential crime analysis 
can give significant data about the crime patterns and 
find locations where criminal activities are high or 
low (Prabakaran & Mitra, 2018).  

Crime is critical to be analyzed and tracked to 
identify trends and associations with crime patterns 
and activities. The appropriate technique must be 
chosen to solve crime cases briefly due to a large 
amount of data. Previous research claims that crime 
analysis may be figured out by a clustering-based 
model (Hajela, Chawla, & Rasool, 2020). It is required 
to determine or identify the right crime patterns by 
conducting the proper clustering approaches. It is 
essential to identify the right clusters for crime analysis 
to allow people to examine past crime patterns and 
improve current crime resolution rates. In addition, 
using preventive measures based on observed trends 
take steps to avoid potential accidents. Furthermore, 
officers who are assigned to a new position with 
no previous knowledge of site-specific crimes will 
spend less time in preparation. The last is to increase 
operating efficiency by redeploying limited resources 
to the most efficient locations at the most effective 
times.

Clustering is one of the techniques in data 
mining which is fundamentally a collection of objects 
based on closeness and difference between them. In 
short, clustering is equal to classification. The only 
difference is that the classes are not defined and 


2 ComTech: Computer, Mathematics and Engineering Applications, Vol. 13 No. 1 June 2022, 1−10

determined in advance, and the grouping of the data 
is done without supervision (Han, Kamber, & Pei, 
2012). Because of its simplicity, convergence speed, 
and high efficiency, the k-means algorithm is the most 
popular method in clustering (Berkhin, 2006; Li, Yu, 
Lei, & Tang, 2017; Jain & Dubes, 1988).

Partitioning clustering, such as k-means 
clustering, allows users to specify the number of 
k-clusters to be created. However, determining 
the optimal number of clusters in a data set is a 
fundamental problem. Unfortunately, this question has 
no definitive response. There is no prior information 
on the correct approach for k. The best possible choice 
of k is somewhere in the middle ground based on 
the features of the dataset, such as its size, variance, 
and characteristics. The optimum number of clusters 
is somewhat subjective and determined by the 
method for calculating similarity and the partitioning 
parameters. Furthermore, it is interesting to note that n 
of acceptable variety of clusters of specific types of 
datasets is rarely recognized in practice. 

Thus, to solve the problem of determining the 
number of clusters without human interferences in the 
crime domain to help crime analysis, the researchers 
conduct four different approaches that are Elbow, 
Silhouette coefficient, Gap Statistics, and NbClust 
methods. Then, those methods are tested on a k-means 
algorithm to determine the optimal number of crime 
clusters in the Toronto crime dataset. To the best of 
researchers’ knowledge, this is the first research that 
works with four different approaches to select the 
best optimum k-means value on the datasets of Major 
Crime Indicators (MCI) 2014−2019.

Kingrani, Levene, and Zhang (2018) researched 
estimating the number of clusters using diversity. 
They demonstrated that the proposed technique was 
powerful for groups of diverse sizes, variances, and 
shapes. Then, Maheswari (2019) proposed a k-means 
algorithm to build clusters on customers’ datasets 
and validated them using Elbow, Silhouette, and Gap 
Statistics methods. 

Yuan and Yang (2019) implemented k-means 
on the iris datasets. From the experimental result on 
small datasets, the four methods, Elbow, Silhouette 
coefficient, Gap Statistics, and Canopy algorithm 
methods, met the research requirements. However, 
the Canopy algorithm was the best choice to work on 
large and complex datasets.

Nath (2006) proposed the concept of 
crime detection as a machine learning task by 
implementing a k-means clustering algorithm on real 
crime data from a sheriff’s office. It implemented 
analysis, investigation, and discovery of patterns 
for the occurrence of distinct crimes. Then, Bokde, 
Kakade, Tumsare, and Wadhai (2018) utilized data 
mining extensively. They enforced a theoretical 
model based on clustering and classification of police 
recorded in England and Wales between 1990 and 2011 
to real crime datasets. Joshi, Sabitha, and Choudhury 
(2017) also used k-means clustering to identify areas 

with excessive crime rates and the most frequent type 
of crimes.

Moreover, Subbalakshmi, Krishna, Rao, and 
Rao (2015) proposed a Fuzzy Silhouette method to 
discover an optimal number of clusters. In contrast, 
Saleh and Khan (2019) analyzed the prediction and 
visualization of the patterns and trends of different 
crimes in Chicago by using the k-means algorithm in 
a Python environment. They discovered that robberies 
were at their supremacy, and most criminals were not 
arrested for their crimes.  

By considering the result of previous works 
using the Elbow, Gap Statistics, and Silhouette 
methods, the researchers propose those approaches 
together with NbClust on MCI datasets. Since the 
experiment in an R environment is conducted, the 
research would also like to know the performance of 
the R package for clustering, such as NbClust, on the 
datasets. The results are expected to help to determine 
which method is the best at finding the optimum 
number of clusters for the dataset.

II. METHODS

The datasets used in the research are open-source 
and updated data from Toronto Police Service Public 
Safety Data Portal (n.d.). The datasets include MCI 
from 2014 to 2019 occurrences by reported date and 
related offences. The datasets comprise 206.435 rows 
and 29 columns. Table 1 displays the description of 
attributes used. It should be noted that the description 
of attributes is from other resources, such as Chicago 
and San Francisco datasets and crime articles, due to 
no unavailability description in the data portal.

It is proposed that the location of crime 
incidents is intentionally offset to the nearby highway 
intersection node to ensure the safety of the parties 
involved with the case. The location information 
should be regarded as an estimated location of the 
event. It is advisable for users not to perceive any of 
these locations to be directly linked to a particular 
address or person. In Figure 1, the researchers present 
the proposed framework for the research.

The framework employed in the research is 
visualized in Figure 1. The experiment is conducted in 
an R environment. It starts from data understanding. 
The data understanding process involves gathering 
information about where the data comes from, how it 
is handled, what choices are taken, where it is stored, 
and how it flows to downstream systems. The process 
also consists of inspecting the missing values, how they 
are displayed, and how common they are. Then, the 
researchers load the data into an R Studio environment 
as the tool for data preprocessing. Then, deep dive 
into the data to examine is done to see if it will be 
necessary to use additional recognized external data 
sources to strengthen decision-making or conclusion 
about the clustering models.

Next, the method of cleaning and converting raw 
data prior to processing and interpretation is known as 


3Discovering the Optimal Number..... (Noviyanti T. M. Sagala; Alexander Agung Santoso Gunawan)

Table 1 Attributes Description

No Name Description
1 I..X Unknown
2 Y Unknown
3 Index Unknown
4 event_unique_id Unique identifier for the record
5 Occurrence date Date when the incident occurs
6 reporteddate Date when the incident is reported
7 premisetype Location of incident
8 ucr_code Uniform Crime Reporting Code
9 ucr_ext Uniform Crime Reporting Extension (Ext. number to report the incident)
10 offence Incident subcategory
11 reportedyear Year when the incident is reported
12 reportedmonth Month when the incident is reported
13 reportedday Day when the incident is reported
14 reporteddayofyear Day of the year of the reported incident
15 reporteddayofweek Day of week of the reported incident
16 reportedhour Time of reported incident 
17 Occurrenceyear The year of the incident occurs
18 Occurrencemonth The month of the incident occurs
19 Occurrenceday The day of the incident occurs
20 Occurrencedayofyear The day of year incident occurs
21 Occurrencedayofweek The day of week incident occurs
22 Occurrence hour Time of incident occurs
23 MCI Major Crime Indicator/ Incident category
24 Division Police division code that handles the incident
25 Hood_ID City of Toronto’s neighbourhood identifier
26 Neighborhood City of Toronto’s neighbourhood
27 Lat The latitude of the location where the incident occurs. Approximate location of the occurrence.
28 Long The longitude of the location where the incident occurs. This coordinate is shifted from the 

actual coordinate to protect the privacy of parties involved in the occurrence.
29 ObjectId Internal feature number 

Data 
Understanding

Feature SelectionData Cleaning

Silhouette MethodElbow Method
Gap Statistics 

Method NbClust Method

Data Standardization

Cluster Evaluation

Data Preparation

Kmeans Clustering

Figure 1 The Proposed Framework


4 ComTech: Computer, Mathematics and Engineering Applications, Vol. 13 No. 1 June 2022, 1−10

data preparation. It is a critical stage before processing 
that always includes reformatting data, making data 
corrections, and merging datasets to enrich data. Data 
processing may be time-consuming, but transforming 
data into observations and removing bias caused 
by low data quality are necessary to bring data into 
action. Data preparation consists of three phases: data 
cleaning, feature selection, and data standardization. 
The result of this phase is acceptable datasets with 
enhanced quality for the cluster modelling stage.

Similarly, cleaning up the data takes the most 
time, but it is essential for eliminating inaccurate data 
and filling in gaps. The task commences by removing 
or modifying incomplete, incorrect, duplicated, or 
irrelevant data. Data must be checked after it has been 
cleansed by searching for errors in the data preparation 
phase up to this stage. Sometimes, an error in the method 
is found during this stage and needs to be corrected 
before continuing. Then, the process is followed by 
the feature selection phase. The objective of selecting 
features is to remove non-informative or redundant 
features’ contribution to the target variable or output to 
achieve better performance for the clustering model. 
The data features to train machine learning models 
to have a significant impact on the achieved results. 
The presence of irrelevant features in the datasets will 
reduce model accuracy and cause the built model to 
learn based on irrelevant features. For the research, 
features are selected by specifying the unimportant 
features and implementing the Boruta package. The 
researchers check the description and format of the 
features. Then, if there are any unimportant features, 
the researchers will rid them. After that, the next 
phase is done by running the Boruta package in R 
Studio to remove the remaining meaningless features 
completely. Boruta is a wrapper built around the 
classification method of random forests. It manages to 
capture all the crucial, unique features that the datasets 
have about an outcome variable. 

The significant problem is that the number of 
variables can be very different. When the original scale 
is used, the variables with a wide range receive more 
weight. Hence, the research uses the feature rescaling 
technique to scale independent variables or data 
features during the data preprocessing stage to address 
this problem. Feature scaling or standardization aims 
to ensure that features are on a nearly equal scale, 
each feature is equally critical, and it is easier for most 
clustering algorithms to process. The model built in 
this work is distance-based algorithms that measure 
similarities between findings. Therefore, features with 
wider ranges will have a greater impact on clustering. 
Variables that are calculated at various scales will 
contribute to the analysis unequally. The research 
utilizes the Z-score normalization technique to 
standardize or rescale the values. It ensures the mean 
and the standard deviation to be 0 and 1, respectively. 
The formula of Z-scores is described in Equation (1). 
It shows χi as raw score, μ as mean, and σ as standard 
deviation.

          (1)

A k-means algorithm is conducted on cleaned 
data in the cluster modelling phase. K-means clustering 
algorithm is adopted, where the clustering process 
starts from choosing the valid initial cluster centers 
randomly from datasets. Then, it randomly assigns 
each data point to a cluster, followed by calculating 
the distance between the residual data items and the 
Euclidean Cluster Center (Ci (1<= I <= k)). After 
that, finding the closest cluster centers of Ci to the 
destination data item and designating the specified 
data item to the class of cluster I are done. Then, the 
following steps determine the mean value of each 
cluster class data item as the new cluster center and 
repeat the process until no improvements are possible. 
It indicates that the objective function has converged. 
The Euclidean distance formula between data items 
and space cluster centers is shown in Equation (2). 
Then, the minimum error sum of squares is adopted by 
k-means as the objective function of f. It is described 
in Equation (3). It shows k as number of classes for 
clusters, χi as the set of all the points in the sample, 
and Ci as the cluster center.

     (2)

     (3)

The researchers carry out three methods to 
find the best possible k for the k-means algorithm. It 
includes a direct method, statistical testing method, 
and R. Direct approaches include maximizing a 
criterion, such as the number of squares inside a 
cluster or the average Silhouette. Elbow and Silhouette 
methods are the names of the related methods. Then, 
the use of statistical techniques involves comparing 
evidence to the null hypothesis. The Gap Statistics 
is an example. Meanwhile, R offers the NbClust 
function for determining the best number of clusters. 
To demonstrate the direct and statistical methods in 
the R environment, the researchers need to install a 
package of factoextra. 

The underlying principle of the Elbow rule is 
to use a square of the distance between the points 
of the sample in each cluster, and the center of the 
cluster delivers a set of k-values. The Sum of Squared 
Errors (SSE) is used as a performance indicator. The 
researchers iterate and compute the SSE over the 
k-value. Smaller values indicate that they are more 
convergent in each cluster. 

The Silhouette method is first proposed by 
Kaufman and Rousseeuw (1990). It merges cohesion 


5Discovering the Optimal Number..... (Noviyanti T. M. Sagala; Alexander Agung Santoso Gunawan)

and resolution factors. Cohesion is the closeness 
between the item and the group. Compared to other 
clusters, this is called separation. The value of the 
Silhouette, which is in the range of -1 and 1, is achieved 
by this comparison. The value of the Silhouette near 
1 indicates that the item and the group have a strong 
relationship. If a data group with a high Silhouette 
value in a model is generated, the model is appropriate 
and justifiable. 

Gap Statistics is an algorithm proposed by 
Tibshirani, Walther, and Hastie (2001) to determine the 
number of clusters of unknown classification numbers 
of datasets. The principal idea of Gap Statistics is to 
initiate reference metrics that can be acquired using 
the Monte Carlo sampling method and to compute 
the Euclidean distance squares between the two 
measurements in each class (Xiao & Yu, 2007). The 
clustering outcomes of the constructed reference 
zero-mean distribution are compared to determine the 
optimal number of clusters in the datasets. 

The NbClust method determines the number 
of works based on varying combinations of several 
clusters, distance measures, and clustering algorithms 
(Charrad, Ghazzali, Boiteau, & Niknafs, 2014). In 
short, NbClust computes about 30 methods at once to 
find the optimal number of clusters. The researchers 
adopt the function is for a single function call. Then, 
it can continuously generate specific indices and the 
number of clusters. Furthermore, various results give 
the best clustering scheme in work. Before using the 
package, the researchers need to install the package 
(NbClust). The difference between the four approaches 
is shown in Table 2.

In unsupervised learning, cluster evaluation, 
also known as cluster validation, is not as well-
developed because of its very nature (Palacio-Niño 
& Berzal, 2019). It contributes to various assessment 
methods. The research focuses on internal validation 
methods that check the quality of the structure of 
clustering without access to external information. For 
the research purpose, the researchers also utilize the 
Silhouette coefficient.

III. RESULTS AND DISCUSSIONS

After checking the dataset, there is a considerable 
amount of duplication of “unique_event_id” feature 
after checking the dataset. Therefore, the researchers 
remove 26.622 event ids and leave 179.813 rows. 
In addition, there are 48 rows of standard missing 
values. Then, they are simply removed from the 
dataset since the number of cases is less than 5% of 
the data sets. The remains have plenty of records. 
Next, the total number of cleaned data is 179.765. 
The feature selection is carried out manually. It has 
been discovered that several numbers of features 
have identical descriptions, and some of them are 
unimportant to the output. Therefore, features of 1 to 
6, 8, 9, 11 to 16, 20, 25, and 29 are deleted from the 
datasets (the number according to Table 1). 

Automatically removing unimportant features 
for clustering tasks by implementing the Boruta 
function is offered by R. According to Kursa and 
Rudnicki (2010), the top two reasons why Boruta is 
conducted are as follows. First, it considers multi-
variable relationships. Second, it improves random 
forest for variable importance measure, which is a very 
popular method for variable selection. Boruta works 
by creating duplication of all independent variables, 
performing shadow features, combining the original 
ones with shuffled copies, running a random forest 
classifier, computing Z-score, finding the maximum 
Z-score among shadow features, tagging the variables 
by the level of importance, removing unimportant 
variables, and repeating all the process until all 
variables have been tagged into one of the categories 
(unimportant or important).

Performing Boruta as an automatic feature 
selector results in the feature of occurrence month. 
It is considered unimportant to the target variable. 
Thus, the variable is eliminated. The following phase 
is applying data standardization. The data summary 
before and after Z-score standardization is applied in 
Table 3. From Table 3, the range or interval between 
the minimum and maximum values of each feature is 
small and acceptable.

Table 2 Comparison of the Proposed Approaches

Criteria/Approaches Elbow Silhouette Gap Statistics NbClust Function
Method Statistical √

Direct √ √ √
Works Computing Within-

Cluster Sum of Square 
(WSS)

Computing the 
average Silhouette 
of observations

Comparing the total 
within intracluster

Providing 30 indices

The optimal number of 
clusters

The smallest value of 
total WSS measurements

Based on the 
Silhouette score

The value which 
clusters compactness 
on the original data 
falls the farthest 
below this reference 
curve

Varying all 
combinations of the 
number of clusters, 
distance measures, 
and clustering 
methods


6 ComTech: Computer, Mathematics and Engineering Applications, Vol. 13 No. 1 June 2022, 1−10

Table 3 Standardization Using Z-Score

Assault Auto Theft Break and Enter Robbery Theft Over
BEFORE 

Min 98,0 14,0 62,0 10,0 7,00
Max 4160,0 1982,0 1461,0 691,0 326,00

AFTER
Min -0,8884 -0,7232 -1,1150 -0,9679 -0,72835
Max 5,5197 9,6640 5,3209 4,9959 5,08959

Figure 2 Result of Elbow Method

Figure 3 Result of Silhouette Method


7Discovering the Optimal Number..... (Noviyanti T. M. Sagala; Alexander Agung Santoso Gunawan)

The k-value can be determined clearly by 
plotting the k-SSE curve and finding the inflection 
point down using the Elbow method. As shown in 
Figure 2, the location of a knee in the plot is usually 
considered an indicator of the appropriate number of 
clusters. It means that adding another cluster does not 
improve the partition much better. The method seems 
to suggest two clusters for the dataset. From Figure 2, 
the larger the number of clusters is, the lower the total 
WSS. However, there is a slight increase in cluster 7.

The result of the Silhouette method from Figure 
3 suggests having two clusters. There are rapid changes 
from the result of utilizing the Silhouette technique, an 
upward trend from 1 to 2; 4 to 5; and 7 to 8. However, 
there is a downward trend from 2 to 4; 5 to 7; and 8 
to 10.

By using the Gap Statistics, an optimal number 
of clusters is the one that makes the best use of it. The 
method advises clustering the data into three clusters, 
as visualized in Figure 4. The Gap Statistics coefficient 
rises significantly to reach a high of 1,9 at cluster 3. 
Then, it falls to 1 at cluster 4 and steadily increases 
and reaches the highest coefficient of about 1,12 at 
cluster 10.

The NbClust function works by providing 30 
indices for choosing the best number of clusters. The 
minimum and maximum numbers of the clusters are 
2 and 15, respectively. The function runs a k-means 
algorithm to check what number of clusters is the best 
for the dataset. The result is displayed in Figure 5 and 
Figure 6.

Figure 4 Result of Gap Statistics

             
      Figure 5 Result of NbClust (Hubert Statistics)                                      Figure 6 Result of NbClust (Dindex)  


8 ComTech: Computer, Mathematics and Engineering Applications, Vol. 13 No. 1 June 2022, 1−10

Figure 7 Bar Plot of NbClust Result

Figure 8 Internal Cluster Validation Using Silhouette Method


9Discovering the Optimal Number..... (Noviyanti T. M. Sagala; Alexander Agung Santoso Gunawan)

The Hubert and Dindex index are a graphical 
method to determine the number of clusters. In both 
plots, it seeks a significant knee that corresponds to 
a significant increase in the value of the measure. 
The summary of the NbClust is described as follows. 
Among all indices, 10 frequencies propose 2 as the 
best number of clusters. Then, 2 frequencies propose 
3 as the best number of clusters. On the other hand, 
2 frequencies also propose 4 as the best number 
of clusters, and 5 frequencies propose 5 as the best 
number of clusters. Then, there are results, such as 
1 frequency with 6 as the best number of clusters, 3 
frequencies with 12 as the best number of clusters, and 
1 frequency with 15 as the best number of clusters. 
According to the majority rule, the best number of 
clusters is two. Furthermore, the optimal number of 
clusters from the NbClust method using a bar plot is 
visualized in Figure 7. It is clearly presented in Figure 
7 that the optimal number of clusters is two.

It is interesting to note that three approaches 
(Elbow, Silhouette, and NbClust) suggest two as the 
optimal number of clusters for the crime dataset, while 
Gap Statistics proposes 3. Gap Statistics works quite 
different from the remaining approaches because the 
method has already prepared for its expected values 
and compared them to the total within intra-cluster 
for the range of k defined. On the other hand, Elbow, 
Silhouette, and NbClust work by computing solely the 
total or average within the intracluster and without 
the need to come up with their expected values. With 
the “best” set of relevant features as the result of 
feature selection phase, uncovering interesting natural 
groupings (clusters) from data according to the chosen 
criterion is easier. In other word, relevant features are 
helpful in finding clusters efficiently.

There is a way to evaluate the clustering quality 
via the Silhouette plot (which shows the Silhouette 
coefficient on the y-axis) to confirm that the suggested 
number of classes is optimal. Then, the researchers 
draw the Silhouette plot for two clusters, as displayed 
in Figure 8.

The interpretation of the Silhouette coefficient 
is as follows. The 0 means that the observation is 
well grouped. The closer the coefficient is to 1, the 
better the observation is grouped. Meanwhile, less 
than 0 means that the observation has been placed in 
the wrong cluster. If it is equal to 0, the observation 
is between two clusters. If a large majority of the 
Silhouette coefficient is positive, the result indicates 
that the observations are placed in the correct group.

IV. CONCLUSIONS

In recent years, crime has been important to 
be analyzed and tracked to identify the trends and 
associations with crime patterns and activities. To 
address the problem of estimating the number of clusters 
in the crime, the researchers study Elbow, Silhouette, 
Gap Statistics, and NbClust methods on datasets of 
Major Crime Indicators (MCI) in 2014−2019. The 
primary contribution of the research is the finding of 

the best optimal number of clusters for Toronto’s MCI 
datasets by applying Elbow, Silhouette, and NbCLust 
approaches. Cluster validation is performed to validate 
the number of recommended clusters. As a result, the 
best optimal number of clusters for the MCI datasets is 
two clusters with a 0,73 Silhouette coefficient. 

To the best of the researchers’ knowledge, the 
research literature has not revealed the use of those 
approaches in this crime dataset previously. By getting 
the most optimal number of clusters, it may ease the 
crime analysis. It can focus on these two clusters or 
the cluster that has large number of criminal cases. 
The research limitation is that k-mean has trouble 
with clustering data. The clusters have varying sizes 
and densities. As shown in cluster validation, most 
observations belong to cluster 2. Hence, it will also 
be interesting to compare the approaches to estimate 
the number of clusters altogether with other clustering 
algorithms for future research.

REFERENCES

Berkhin, P. (2006). A survey of clustering data mining 
techniques. In Grouping multidimensional data (pp. 
25−71). Springer.

Bokde, K. A., Kakade, T. P., Tumsare, D. S., & Wadhai, C. G. 
(2018). Crime detection technique using data mining 
and K-means. International Journal of Engineering 
Research & Technology (IJERT), 7(02), 223−226. 

Charrad, M., Ghazzali, N., Boiteau, V., & Niknafs, A. 
(2014). NbClust: An R package for determining the 
relevant number of clusters in a data set. Journal of 
Statistical Software, 61(6), 1−36.

Hajela, G., Chawla, M., & Rasool, A. (2020). A clustering 
based hotspot identification approach for crime 
prediction. Procedia Computer Science, 167, 
1462−1470.

Han, J., Kamber, M., & Pei, J. (2012). Data mining: 
Concepts and techniques. Elsevier.

Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering 
data. Prentice-Hall, Inc. 

Joshi, A., Sabitha, A. S., & Choudhury, T. (2017). Crime 
analysis using K-means clustering. In 2017 3rd 
International Conference on Computational 
Intelligence and Networks (CINE) (pp. 33−39). 
IEEE. 

Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in 
data: An introduction to cluster analysis. John Wiley 
& Sons.

Kingrani, S. K., Levene, M., & Zhang, D. (2018). Estimating 
the number of clusters using diversity. Artificial 
Intelligence Research, 7(1), 15−22.

Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection 
with the Boruta package. Journal of Statistical 
Software, 36(11), 1−13.

Li, X. Y., Yu, L. Y., Lei, H., & Tang, X. F. (2017). The 
parallel implementation and application of an 
improved K-means algorithm. Journal of University 
of Electronic Science and Technology of China, 
46(1), 61−68.


10 ComTech: Computer, Mathematics and Engineering Applications, Vol. 13 No. 1 June 2022, 1−10

Maheswari, K. (2019). Finding best possible number of 
clusters using K-means algorithm. International 
Journal of Engineering and Advanced Technology, 
9(1S3), 533−538. 

Nath, S. V. (2006). Crime pattern detection using data 
mining. In 2006 IEEE/WIC/ACM International 
Conference on Web Intelligence and Intelligent 
Agent Technology Workshops (pp. 41−44). IEEE.

Palacio-Niño, J. O., & Berzal, F. (2019). Evaluation metrics 
for unsupervised learning algorithms. ArXiv Preprint 
ArXiv:1905.05667.

Prabakaran, S., & Mitra, S. (2018). Survey of analysis of 
crime detection techniques using data mining and 
machine learning. Journal of Physics: Conference 
Series, 1000, 1−10.

Saleh, M. A., & Khan, I. R. (2019). Crime data analysis 
in Python using K-means clustering. International 
Journal for Research in Applied Science & 
Engineering Technology (IJRASET), 7(IV), 
151−155. 

Subbalakshmi, C., Krishna, G. R., Rao, S. K. M., & Rao, 
P. V. (2015). A method to find optimum number of 
clusters based on fuzzy silhouette on dynamic data 
set. Procedia Computer Science, 46, 346−353. 

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating 
the number of clusters in a data set via the gap 
statistic. Journal of the Royal Statistical Society: 
Series B (Statistical Methodology), 63(2), 411−423.

Toronto Police Service Public Safety Data Portal. (n.d.) 
Major Crime Indicators. Retrieved from https://
data.torontopolice.on.ca/datasets/TorontoPS::major-
crime-indicators-1/about 

Xiao, Y., & Yu, J. (2007). Gap statistic and K-means 
algorithm. J. Comput. Res. Dev, 44, 176−180.

Yuan, C., & Yang, H. (2019). Research on K-value selection 
method of K-means clustering algorithm. J: 
Multidisciplinary Scientific Journal, 2(2), 226−235.