Format Template Vol. 6, No. 1 | January – June 2023 SJET | P-ISSN: 2616-7069 |E-ISSN: 2617-3115 | Vol. 6 No. 1 January – June 2023 23 Harris’ Hawks Optimization-Tuned Density-based Clustering Muhammad Shoaib Omar1, Syed Muhammad Waqas2, Kashif Talpur3*, Sumra Khan4, Shakeel Ahmad3 Abstract: Clustering is a machine learning technique that groups data samples based on similarity and identifies outliers with distinct features. Density-based clustering outperforms other methods because it can handle arbitrary shapes of clustering distributions. However, it has a limitation of requiring empirical values for the cluster center and the nominal distance between the cluster center and other data points. These values affect the accuracy and the number of clusters obtained by the algorithm. This paper proposes a solution to optimize these parameters using Harris’ hawks optimization (HHO), an efficient optimization technique that balances exploration and exploitation and avoids stagnation in later iterations. The proposed HHO-tuned density-based clustering achieves better performance as compared to other optimizers used in this work. This research also provides a reference for designing efficient clustering techniques for complex- shaped datasets. Keywords: Machine learning; density-based clustering; metaheuristic algorithm; Harris’ hawk optimization; clustering. 1. Introduction Unlabeled data is being generated exponentially by today’s information systems, which demands accuracy and high efficiency for analysis, in order to draw logical conclusions. Clustering is the most studied and broadly used learning technique of its kind. It endeavors to divide a dataset into numerous usually separate subsets where each one is called a cluster. Through this partition, each cluster may assimilate to some prospective categories, which the clustering algorithm is not aware of prior to working on it. Based on the diverse learning approaches, researchers have considered many types of clustering 1 Department of Computer Science, Bahria University, Karachi, Pakistan 2 Department of Computer Science, Muhammad Ali Jinnah University, Karachi, Pakistan 3 Department of Science and Engineering, Solent University, Southampton, United Kingdom 4 Department of Information Technology, Salim Habib University, Karachi, Pakistan Corresponding Author: [kashif.talpur@solent.ac.uk] algorithms, based on different features. These methods can be characterized into density- based, portioning-based, hierarchy-based, model-based, and grid-based approaches [1]. A density-based clustering algorithm groups data points based on cut-off distance and minimum number of points in each radius. In 1996, an algorithm was proposed by Martin Ester et. al [2] as a clustering algorithm with arbitrary cluster shapes without specifying the number of clusters beforehand, and they named it density-based spatial clustering of applications with noise (DBSCAN). DBSCAN, by design, has several advantages when compared to its counterparts; as it does Muhammad Shoaib Omar (et al.), Harris’ Hawks Optimization-Tuned Density-based Clustering (pp. 23 - 34) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 6 No. 1 January – June 2023 24 not need to define the number of clusters as prior information, and it can work with datasets with varying densities. DBSCAN algorithm is a novel technique to deal with unlabeled data as it promises great success. Initially, using DBSCAN, several past works have focused on clustering spatial data, however, its success has been noticed in various other data complexities [3,4,5]. Because it groups data points with respect to density, clusters in DBSCAN are considered to be the regions with dense data points, and these clusters are separated by the regions having low density, or noise. This core functionality of this algorithm makes it effective in identifying clusters with unusual arbitrary shapes, as well as, determining noise or outliers [6]. DBSCAN performs clustering with the help of two user-defined parameters: epsilon (ɛ) which is the cluster radius and minimum points (minPts) to be present in that radius. Despite its efficacy in clustering applications, DBSCAN faces same limitations as other clustering methods, such as the determination of optimal values for user- defined parameters (ɛ and minPts) is difficult and problem specific. This demands a significant amount of personal experience or several experimental trials, hence limiting its extensive use. To address the said problem, different approaches have been proposed in the literature. Lai et. al [7] used multi-verse optimizer (MVO) algorithm for automatic selection of ɛ and minPts parameters. The authors claimed to have achieved improved clustering performance of DBSCAN while employed on three public datasets. In this research, the authors first improved MVO and then used it for finding the optimum range of values for the DBSCAN parameters. Researchers in [8] proposed the unsupervised patterns in multi-dimensional data, also known as data cubes, using genetic algorithm (GA). While tuning GA parameters (mutation and crossover), the researchers utilized fuzzy inference mechanisms: Mamdani's rules and Takagi-Sugeno's rules. When compared with the existing DBSCAN, they found the GA- tuned DBSCAN achieved better clustering quality on six datasets of data cubes, for OLAP (online analytical processing) purposes. Another successful application of metaheuristic algorithms on DBSCAN parameter optimization is done by Zhu et. al [9] where the authors used harmony search (HS) optimization algorithm for the similar purposes. The experiments of this new clustering scheme, on six clustering problems of varying complexity, showed superior performance than the canonical DBSCAN, based on Rand index (RI) and Jaccard coefficient (JC) evaluation metrics. These significant research efforts have developed and extended the applicability of DBSCAN by benefiting from optimization capability of metaheuristic algorithms that select more reasonable values of clustering parameters from an optimal range. Metaheuristic algorithms have proved to be powerful optimization techniques, while implemented over a range of optimization problems. By mimicking intelligent behaviors from nature, researchers have developed efficient search mechanisms. The well- established metaheuristic algorithms are particle swarm optimization (PSO) [10], artificial bee colony (ABC) [11], genetic algorithms (GA) [12], and ant colony optimization (ACO) [13]. However, some of the latest counterparts have also been effective in research, such as grey wolf optimization (GWO) [14], Archimedes optimization algorithm (AOA) [15], honey badger algorithm (HBA) [16], and chaos game optimization (CGO) [17]. These and many others are used in data mining tasks [18], wireless sensors node localization [19], power flow optimization in smart grids [20], stock price prediction [21], etc. Most of these metaheuristic algorithms are based on collective intelligence displayed by a group of search agents, which makes them simple, efficient, and ready to deploy on any optimization problem. Harris' hawks optimization (HHO) is a recent induction in the paradigm of optimization methods, developed by Heidari et. al paradigm of optimization methods, developed by Heidari et. al [22]. HHO maintains a robust global search strategy by integrating balanced exploration and Muhammad Shoaib Omar (et al.), Harris’ Hawks Optimization-Tuned Density-based Clustering (pp. 23 - 34) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 6 No. 1 January – June 2023 25 exploitation tools. Its performance is rightfully validated by several research works, such as engineering design optimization [23], satellite imaging breakdown threshold optimization [24], drug design and discovery [25] and wireless sensor node localization [26], and many other constrained and unconstrained problems [7,27,28,29]. Keeping in view the outstanding global search ability of HHO and its effective exploration and exploitation strategies, we utilize it to improve clustering performance of DBSCAN. We optimize the DBSCAN parameters (ɛ and minPts) using HHO, since the manual parameter selection process is based on trial and error which is considered as an ineffective approach. The optimal parameters result in high clustering accuracy. The rest of the paper is organized as follows. The upcoming section explains about materials and methods like fundamentals of DBSCAN and HHO. Section \ref{sec:MaterialsAndMethods} elaborates on the methodology of HHO for DBSCAN parameter optimization. The experimental settings and results are discussed in Section \ref{sec:Results}, whereas Section \ref{sec:Conclusion} duly concludes the findings of this research. 2. Materials and Methods 2.1. DBSCAN Clustering Algorithm DBSCAN (Density-based spatial clustering of applications with noise), introduced by Martin et al. in 1996 [2], is a density-based clustering algorithm, which identifies clusters by density of objects found in dense regions. The benefits of this technique are that it can identify clusters of arbitrary shapes, clusters within a cluster (nested clusters) as well as outliers (the points not belonging to any cluster). This algorithm is mainly based on two integer value parameters: ɛ and minPts, where ɛ is the maximum radius of the neighborhood and minPts are the minimum number of points in that radius. A user has to provide values of these two parameters based on some past experience or on hit and trial basis. These two parameters play an important role in accuracy of the clusters being identified by the algorithm. Let's say if ɛ is a larger number, it will include many points in a cluster. Some of the points might occur there, which may not actually belong to that cluster. On the contrary, if it is initialized with a small value, it may create more clusters as compared to the correct number of required clusters. The parameter minPts also acts in same manner. DBSCAN also identifies outliers, the points that do not belong to any cluster. There are some core definitions related to the DBSCAN, which are:  ɛ-neighborhood: Points within the radius of the center point ɛ.  Core point: A point within its radius has at least minimum points (minPts).  Boundary point: A point that is not a center point and has less than minimum points in its neighborhood but at least has one center point within its radius.  Noise points: Points that are neither midpoints nor boundary points and do not belong to any cluster.  Direct-density reachable: If point q is within the radius and p is the core point, then point q is directly density accessible from point p.  Density-Reachable: Point q is the density reachable of point p if point q is within the radius and point p is not a center point. Point q is connected to point p through other points (q1, q2, q3, ... qn).  Density-Connectivity: If points p and q are densities accessible from center point o, then points p and q are connected to each other relative to point o.  Cluster: Defined as the largest set of densely connected points. The algorithmic steps of DBSCAN are illustrated in Algorithm 1. Muhammad Shoaib Omar (et al.), Harris’ Hawks Optimization-Tuned Density-based Clustering (pp. 23 - 34) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 6 No. 1 January – June 2023 26 Algorithm 1: Pseudo-code of DBSCAN Procedure DBSCAN(Dataset X, 𝜖, minPts) Step-1: Explore all core points in the eld space X using 𝜖 and minPts and add it to the set of core objects. A core point is when |Nɛ(xi)| > minPts Step-2: Select any core object from the set and make a cluster with other core points using ɛ and minPts until all core points are visited and no more core point is left. Step-3: Add unvisited border points in the current cluster one-by-one. But border point cannot extend the cluster further as it is not a core point. Step-4: If a point is neither core nor a border point, mark it as noise and put it in a separate set, and label it. Step-5: When all border points are added to the cluster; pick another core point for the set and repeat above steps until a cluster is made. return Labeled data space of cluster index. End procedure 2.2. Harris' Hawks Optimization In 2019, Ali Asghar et al. [22] presented an optimization technique after observing the hunting behavior of Harris' hawks, which is called Harris' hawks optimization (HHO). This algorithm exhibits the hunting behavior of the Harris' hawks. For these hawks, the primary tactic for hunting prey is the “raid”. This strategy is also known as the “seven-kill strategy”. In this strategy, sometimes the surprised prey is caught within a few seconds or otherwise, these attacks are kept in continuity till the prey is caught. This search approach may include several brief and smart dives from different directions within seconds and minutes to confuse the exhausted prey that it is now and then is caught by the attacking hawk. The Harris' hawks can apply a variety of attacking styles keeping in view the hunting environment, escaping behavior, and energy of their prey. There may be some hard and soft besieges while attacking the prey till it is finally caught. These abruptly changing attacking behavior of the hawks are beneficial in a way that this strategy brings rabbits to a complete exhaustion and become defenselessness. The HHO algorithm consists of three phases, in which these attacks are taken place. These are: 2.2.1. Exploration Phase Firstly, in the exploration phase, hawks wait in the hunting field and use their extended and strong visual power to locate a prey (rabbit). Sometimes the rabbits are explored very easily and other times, they have to wait for hours to locate one. If the hawks are unable to locate any rabbit for a certain period of time, they change their place and use random perching in other locations to look for the prey. This strategy in the form of mathematical representation is depicted in Eq. 1: 𝑋(𝑡 + 1) = { 𝑋𝑟𝑎𝑛𝑑(𝑡) − 𝑟1|𝑋𝑟𝑎𝑛𝑑(𝑡) − 2𝑟2𝑋(𝑡)| if 𝑞 ≥ 0.5 𝑋𝑟𝑎𝑏𝑏𝑖𝑡 (𝑡) − 𝑟1|𝑋𝑚(𝑡) − 𝑟3(𝑈𝐵 − 𝑟4𝐿𝐵)| otherwise (1) where Xrabbit(t) is the current optimal solution whereas the next possible solution in terms of hawk's position is denoted by X(t+1). X(t), Xrand(t), and Xm(t) show the positions of all hawks (total solutions-set), randomly selected hawks, and average positions, respectively. The lower and upper bounds of the search space are given by LB and UB, whereas r1-r4 represent random variables between [0-1]. 2.2.2. The Transition from Exploration to Exploitation This is the second phase of HHO algorithm, which takes place between exploration and exploitation. When a prey is found, then different exploitation behaviors are used to reduce its escaping energy so that some of the hawks are able to catch it easily. In HHO, it is called energy component, formulated in Eq. 2: 𝐸 = 2𝐸0(1 − 𝑡 𝑡𝑚𝑎𝑥 ) (2) Muhammad Shoaib Omar (et al.), Harris’ Hawks Optimization-Tuned Density-based Clustering (pp. 23 - 34) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 6 No. 1 January – June 2023 27 where initial energy of the rabbit is denoted by E0, and t and tmax represent current and maximum number of iterations respectively. 2.2.3. Exploitation Phase The hawks bout on their target using different attacking techniques as required in that situation. The situation depicts the escaping pattern of the prey and attacking style of the hawks, and energy of the rabbits as well. There are four likely strategies in the HHO algorithms. These are as depicted in Eqs. 3 - 6:  Soft besiege: If chance of escape ≥0.5 and absolute energy ≥0.5: 𝑋(𝑡 + 1) = ∆𝑋(𝑡) − 𝐸|𝐽𝑋𝑟𝑎𝑏𝑏𝑖𝑡 (𝑡) − 𝑋(𝑡)| ∆𝑋(𝑡) = 𝑋𝑟𝑎𝑏𝑏𝑖𝑡 (𝑡) − 𝑋(𝑡), 𝐽 = 2(1 − 𝑟5) (3) where ∆𝑋(𝑡) is the difference between the current position of a hawk X(t) and rabbit 𝑋𝑟𝑎𝑏𝑏𝑖𝑡 and r5 being a random variable.  Hard besiege: If chance of escape ≥0.5 and absolute energy <0.5: 𝑋(𝑡 + 1) = 𝑋𝑟𝑎𝑏𝑏𝑖𝑡 (𝑡) − 𝐸|∆𝑋(𝑡)| (4)  Soft besiege with rapid progressive dives: If chance of escape <0.5 and absolute energy ≥0.5: 𝑋(𝑡 + 1) = { 𝑌 𝑖𝑓 F(Y ) < F(X(t)), 𝑌 = 𝑋𝑟𝑎𝑏𝑏𝑖𝑡 (𝑡) − 𝐸|𝐽𝑋𝑟𝑎𝑏𝑏𝑖𝑡 (𝑡) − 𝑋(𝑡)| 𝑍 𝑖𝑓 F(Z ) < F(X(t)), 𝑍 = 𝑌 + 𝑆 × 𝐿𝑒𝑣𝑦𝐹𝑙𝑖𝑔ℎ𝑡(𝐷) (5) where D and S denote problem dimensions and a random vector of size D, respectively. LevyFlight(D) is Levy Flight function.  Hard besiege with rapid progressive dives: If chance of escape <0.5 and absolute energy <0.5:  𝑋(𝑡 + 1) = { 𝑌 𝑖𝑓 F(Y ) < F(X(t)), 𝑌 = 𝑋𝑟𝑎𝑏𝑏𝑖𝑡 (𝑡) − 𝐸|𝐽𝑋𝑟𝑎𝑏𝑏𝑖𝑡 (𝑡)−𝑋𝑚 (𝑡)| 𝑍 𝑖𝑓 F(Z ) < F(X(t)), 𝑍 = 𝑌 + 𝑆 × 𝐿𝑒𝑣𝑦𝐹𝑙𝑖𝑔ℎ𝑡(𝐷) (6) The step-by-step procedure of HHO algorithm can be defined in Algorithm 2. Algorithm 2: Pseudo-code of HHO algorithm Procedure HHO(Dataset X, 𝜖, minPts) Initialization of population Xi, (i=1, 2,..., N) While t< tmax Calculate fitness values of hawks and best location of rabbit For each hawk Xi(t) If E≥1 Then Exploration and position update via Eq. 1 Else Exploitation and position update via Eqs. 3-6 End For End While return Optimum solution Xrabbit End procedure As can be seen in Algorithm 2, it first initializes the population and then enters the exploration phase using Eq. 1 until it finds some prey in the search space. It will keep on searching and will wait until it finds a prey. After locating the prey, it will transit from the exploration to the exploitation phase using Eq. 2. Then it will make some attacking movements using four strategies as mentioned using Eq. 3 to Eq. 6 called soft and soft besieges and soft/ hard besieges with progressive dives. 2.3. HHO-Tuned DBSCAN In this research, HHO and DBSCAN are integrated, in order to enhance clustering capability of the DBSCAN algorithm. Here, DBSCAN is integrated with HHO in such a way that it achieves highest accurate values of ɛ and minPts. As can be seen from Algorithm 3 the DBSCAN algorithm is called within HHO. HHO provides the values of best-fit rabbit and its energy as candidate values of ɛ and minPts, which are evaluated and then passed to HHO. The provided values are Muhammad Shoaib Omar (et al.), Harris’ Hawks Optimization-Tuned Density-based Clustering (pp. 23 - 34) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 6 No. 1 January – June 2023 28 compared to each other and convergence of the HHO is analyzed. At the point where it converges at global minima and with lowest clustering error, it is our desired values for ɛ and minPts. The HHO is used to search DBSCAN parameters for optimal clustering accuracy, as shown in Algorithm 3. Algorithm 2: Pseudo-code of HHO- DBSCAN algorithm Procedure HHO-DBSCAN(Solution Size N, Maximum Iterations tmax) Initialization of population Xi (i = 1, 2,..., N) While t