FACTA UNIVERSITATIS Series: Electronics and Energetics Vol. 33, No 4, December 2020, pp. 631 - 653 https://doi.org/10.2298/FUEE2004631A Elham Amouee1, Morteza Mohammadi Zanjireh1 , Mahdi Bahaghighat1, Mohsen Ghorbani2 Received April 15, 2020; received in revised form August 10, 2020 Corresponding author: Morteza Mohammadi Zanjireh Computer Engineering Department, Imam Khomeini International University, Qazvin, Iran E-mail: Zanjireh@eng.ikiu.ac.ir FACTA UNIVERSITATIS Series: Electronics and Energetics Vol. 28, No 4, December 2015, pp. 507 - 525 DoI: 10.2298/FUEE1504507S HORIZONTAL CURRENT BIPOLAR TRANSISTOR (HCBT) – A LOW-COST, HIGH-PERFORMANCE FLEXIBLE BICMOS TECHNOLOGY FOR RF COMMUNICATION APPLICATIONS Tomislav Suligoj1, Marko Koričić1, Josip Žilak1, Hidenori Mochizuki2, So-ichi Morita2, Katsumi Shinomura2, Hisaya Imai2 1University of Zagreb, Faculty of Electrical Engineering and Computing, Department of Electronics, Micro- and Nano-electronics Laboratory, Croatia 2Asahi Kasei Microdevices Co. 5-4960, Nobeoka, Miyazaki, 882-0031, japan Abstract. In an overview of Horizontal Current Bipolar Transistor (HCBT) technology, the state-of-the-art integrated silicon bipolar transistors are described which exhibit fT and fmax of 51 GHz and 61 GHz and fTBVCEO product of 173 GHzV that are among the highest-performance implanted-base, silicon bipolar transistors. HBCT is integrated with CMOS in a considerably lower-cost fabrication sequence as compared to standard vertical-current bipolar transistors with only 2 or 3 additional masks and fewer process steps. Due to its specific structure, the charge sharing effect can be employed to increase BVCEO without sacrificing fT and fmax. Moreover, the electric field can be engineered just by manipulating the lithography masks achieving the high-voltage HCBTs with breakdowns up to 36 V integrated in the same process flow with high-speed devices, i.e. at zero additional costs. Double-balanced active mixer circuit is designed and fabricated in HCBT technology. The maximum IIP3 of 17.7 dBm at mixer current of 9.2 mA and conversion gain of -5 dB are achieved. Key words: BiCMOS technology, Bipolar transistors, Horizontal Current Bipolar Transistor, Radio frequency integrated circuits, Mixer, High-voltage bipolar transistors. 1. INtRoDUCtIoN In the highly competitive wireless communication markets, the RF circuits and systems are fabricated in the technologies that are very cost-sensitive. In order to minimize the fabrication costs, the sub-10 ghz applications can be processed by using the high-volume silicon technologies. It has been identified that the optimum solution might Received March 9, 2015 Corresponding author: tomislav Suligoj University of Zagreb, Faculty of Electrical Engineering and Computing, Department of Electronics, Micro- and Nano-electronics Laboratory, Croatia (e-mail: tom@zemris.fer.hr) A NEW ANOMALOUS TEXT DETECTION APPROACH USING UNSUPERVISED METHODS 1Computer Engineering Department Imam Khomeini International University Qazvin, Iran 2Department of Electrical Engineering Raja University Qazvin, Iran Abstract. Increasing size of text data in databases requires appropriate classification and analysis in order to acquire knowledge and improve the quality of decision-making in organizations. The process of discovering the hidden patterns in the data set, called data mining, requires access to quality data in order to receive a valid response from the system. Detecting and removing anomalous data is one of the pre-processing steps and cleaning data in this process. Methods for anomalous data detection are generally classified into three groups including supervised, semi-supervised, and unsupervised. This research tried to offer an unsupervised approach for spotting the anomalous data in text collections. In the proposed method, a combination of two approaches (i.e., clustering-based and distance-based) is used for detecting anomaly in the text data. In order to evaluate the effciency of the proposed approach, this method is applied on four labeled data sets. The accuracy of Naïve Bayes classification algorithms and decision tree are compared before and after removal of anomalous data with the proposed method and some other methods such as Density-based spatial clustering of applications with noise (DBSCAN). Our proposed method shows that accuracy of more than 92.39% can be achieved. In general, the results revealed that in most cases the proposed method has a good performance. Key words: Anomaly detection, text mining, unsupervised learning, clustering, pre-processing, DBSCAN algorithm. © 2020 by University of Niš, Serbia | Creative Commons License: CC BY-NC-ND 2 proposed method shows that accuracy of more than 92.39% can be achieved. In general, the results revealed that in most cases the proposed method has a good performance. Keywords: Anomaly detection, text mining, unsupervised learning, cluster- ing, pre-processing, DBSCAN algorithm 1 Introduction The current age is called Information Age. Since the organizations and in- stitutions record a huge amount of data daily, data recovery alone is not enough to make decisions. So automatic classification and analysis of data is very important. Data mining is the process of identifying valid patterns and relationships among the high volume of data which have so far been unknown [1]. Intelligent data exploring helps organizations to discover and predict system behaviors, and patterns to make better and faster decisions. Besides, the Machine learning (ML) is the science that deals with the devel- opment of algorithms and statistical models. In machine learning, the goal is to enable computer systems to perform particular tasks without using ex- plicit instruction and merely using patterns and inference instead of being able to perform their functions. Nowadays, this science is wildly used in broad fields such as image processing, machine vision, audio signal process- ing, natural language processing (NLP), communication networks, financial areas, and so on [2–10]. In many topics of data mining, the data is classified into structured, semi- structured, and unstructured [11]. Data mining and machine learning are strong tools to handle all of these problems. Structured data is that which has a predictable and regular format such as the structure of the tables in relational databases. In contrast, unstructured data is that which does not a specific structure and its analysis is not so easy. The significant growth and diversity of text data can be considered as an example of this data type. Volume and speed of unstructured data are several times more than those of structured type. Therefore, one of the applied areas in data mining is the concept of text mining and natural language processing (NLP). Be- fore starting data mining, some steps should be taken in order to prepare data. The steps for data mining include selecting data, initial cleaning and pre-processing, discovering patterns, and interpreting and displaying them. Diagnosis of anomalous data can be considered as a pre-processing step in the data mining path [12]. Anomaly is a pattern that differs from the other patterns existing in the 632 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 633 2 proposed method shows that accuracy of more than 92.39% can be achieved. In general, the results revealed that in most cases the proposed method has a good performance. Keywords: Anomaly detection, text mining, unsupervised learning, cluster- ing, pre-processing, DBSCAN algorithm 1 Introduction The current age is called Information Age. Since the organizations and in- stitutions record a huge amount of data daily, data recovery alone is not enough to make decisions. So automatic classification and analysis of data is very important. Data mining is the process of identifying valid patterns and relationships among the high volume of data which have so far been unknown [1]. Intelligent data exploring helps organizations to discover and predict system behaviors, and patterns to make better and faster decisions. Besides, the Machine learning (ML) is the science that deals with the devel- opment of algorithms and statistical models. In machine learning, the goal is to enable computer systems to perform particular tasks without using ex- plicit instruction and merely using patterns and inference instead of being able to perform their functions. Nowadays, this science is wildly used in broad fields such as image processing, machine vision, audio signal process- ing, natural language processing (NLP), communication networks, financial areas, and so on [2–10]. In many topics of data mining, the data is classified into structured, semi- structured, and unstructured [11]. Data mining and machine learning are strong tools to handle all of these problems. Structured data is that which has a predictable and regular format such as the structure of the tables in relational databases. In contrast, unstructured data is that which does not a specific structure and its analysis is not so easy. The significant growth and diversity of text data can be considered as an example of this data type. Volume and speed of unstructured data are several times more than those of structured type. Therefore, one of the applied areas in data mining is the concept of text mining and natural language processing (NLP). Be- fore starting data mining, some steps should be taken in order to prepare data. The steps for data mining include selecting data, initial cleaning and pre-processing, discovering patterns, and interpreting and displaying them. Diagnosis of anomalous data can be considered as a pre-processing step in the data mining path [12]. Anomaly is a pattern that differs from the other patterns existing in the A New Anomalous Text Detection Approach Using Unsupervised Methods ...3 data set. Anomaly was first defined by Grubbs (1969): Anomaly is data which dramatically deviates from other available samples in the series [13]. The term ‘anomaly in text data’ is referred to texts which are abnormal or are significantly different from the other texts in terms of concept. In the text data, anomaly can be investigated in terms of a difference in the text author, the subject, the genre, the style of text, and the emotional tone of the text [14]. The main reason for the development of text mining systems is the in- creasing volume of textual data in organizations and businesses. One of the challenges in monitoring infectious diseases, such as COVID-19, is that large volumes of textual data are produced continuously. In a pandemic, this value can be far greater than a human being can process [15, 16]. Among the applications of this field, the following can be mentioned [17,18] • 1. Diagnosis of anomalies in safety reports sent from space stations • 2. Tracing the subject of news • 3. Abnormal in web content • 4. Identify significant patterns in annual financial reports • 5. Identify abnormal data in news reports • 6. Discover knowledge of medical records There are several anomaly detection systems. These systems are com- prised of three parts. The first phase is the pre-processing step which in- cludes removing unwanted words through stemming [19]. In the second phase, text display (e.g., displaying text sentences for vector) is carried out. And the third phase includes text processing for detecting anomalies and comparing between documents. Anomaly detection is not an easy challenge. So far researchers devel- oped many anomaly detection methods using statistical methods, machine learning, and data mining but the problem is still open and in its progress. Several approaches are shown in the following Figure 1: The methods of anomaly detection are widespread, and each is used based on input data type and its application. In one approach, the methods are classified based on access to the labelled data. Accordingly, the methods are categorized into three main categories [20]: 1. Supervised anomaly detection 632 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 633 4 Fig. 1: The key components of anomaly detection methods [12] 2. Semi-supervised anomaly detection 3. Unsupervised anomaly detection In the supervised method, both normal and abnormal data are labeled in the training dataset. Typical approach in such a method is to create a predictive model for both normal and abnormal data. After comparing each test data sample with the model, it is determined to which class this data belongs [12]. In the semi-supervised method, it is supposed that only normal samples are labelled. Since this method does not require anomalous data labelling, this method is more applicable than the supervised method [11]. In comparison, the methods which are run based on unsupervised method do not require the training data; so, they are more applicable than the two previous approaches. The most important advantage of this method is that it does not need to access the labelled data. Usually, this group of methods is known as clustering solutions [11]. In the Figure 2, a summary of a set of supervised and unsupervised anomaly detection methods is shown. Text clustering refers to the process of dividing a text group into similar subgroups based on content. Semantic clustering refers to cluster texts based 634 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 635 4 Fig. 1: The key components of anomaly detection methods [12] 2. Semi-supervised anomaly detection 3. Unsupervised anomaly detection In the supervised method, both normal and abnormal data are labeled in the training dataset. Typical approach in such a method is to create a predictive model for both normal and abnormal data. After comparing each test data sample with the model, it is determined to which class this data belongs [12]. In the semi-supervised method, it is supposed that only normal samples are labelled. Since this method does not require anomalous data labelling, this method is more applicable than the supervised method [11]. In comparison, the methods which are run based on unsupervised method do not require the training data; so, they are more applicable than the two previous approaches. The most important advantage of this method is that it does not need to access the labelled data. Usually, this group of methods is known as clustering solutions [11]. In the Figure 2, a summary of a set of supervised and unsupervised anomaly detection methods is shown. Text clustering refers to the process of dividing a text group into similar subgroups based on content. Semantic clustering refers to cluster texts based A New Anomalous Text Detection Approach Using Unsupervised Methods ...5 Fig. 2: Various categories of anomalous data detection methods: The super- vised and unsupervised approaches [20], [11]. I: The assumption of these methods is that the normal data belong to at least one cluster while the anomalous data do not belong to any cluster [11]. II: In this method, the normal instance locates near the centroid of its nearest cluster, while the anomalous sample is in a long distance of the nearest cluster gravity center. III: Normal data belongs to high density clusters while anomalous data is distributed in low density clusters. on their contents or meaning [21,22]. The remainder of this paper is organized as follows. In Section 2, we review some related works. In Section 3, we present the methodology of our proposed method. The simulations and experimental results of the proposed algorithm are presented in Section 4. Finally, in Section 5, we conclude the paper. 634 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 635 6 2 Related works There are many studies in the research literature that try to address the anomaly detection issues. Consequently, different algorithms are provided to diagnose anomaly in multidimensional data sets. The key methods used in this area include the distance-based approach, the density-based approach, and their subset methods [23]. Many researchers have worked to detect anomalies in textual documents. Hence, different aspects of text attributes are considered. A number of researchers changed the text to numbers and used algorithms that are suitable for numerical data. Others chose a limited number of documents, such as document titles, to detect anomalies and find a pattern for the dataset [24]. In [25], the authors used the conceptual graph method in order to iden- tify anomalous data in the text. Two-way graph that includes two different types of nodes (i.e., concepts and relations). This method differs from classi- cal statistical approaches and distance-based approaches. In this approach, a data deviation is identified based on the concept of regularity. Using a conceptual graph and the relationships between the entities (concepts), the template pattern is identified, and patterns differing from those that are rare are deemed as an anomaly [25]. Sumithiradevi et al. used clustering methods for anomalous data detection. Initially using the greedy method, they improved the k-means algorithm and clustered the data set. Then, all records were read, and a flag with the initial value of zero was attributed to them. Later on, one sample was considered as outlier and removed them from the data set. In the next step, the amount of entropy changes in the remaining set was calculated. If the entropy of the remaining set increases by removing data, the deleted data is anomaly, and the value of its flag is changed to one [26]. Juntao Wang et al. made use of the density-based approach in order to re- move anomalous data. First, by clustering approach of Fast k-means, they classified data sets. Then, for each data in the cluster, the degree of anomaly was calculated, and any data whose anomalous degree is much larger than one is removed from the data set. In the next step, the average of the re- maining set of a cluster is selected as the new center gravity of the cluster. This trend continues until converging the clusters so that all of the anoma- lous data is removed from the data set [27]. In [28], the authors applied a similar method to the Local Outlier Factor (LOF) algorithm in order to de- termine the degree of anomaly based on distance from centroid. In the first step, by improving the K-means algorithm using genetic algorithm, data set 636 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 637 6 2 Related works There are many studies in the research literature that try to address the anomaly detection issues. Consequently, different algorithms are provided to diagnose anomaly in multidimensional data sets. The key methods used in this area include the distance-based approach, the density-based approach, and their subset methods [23]. Many researchers have worked to detect anomalies in textual documents. Hence, different aspects of text attributes are considered. A number of researchers changed the text to numbers and used algorithms that are suitable for numerical data. Others chose a limited number of documents, such as document titles, to detect anomalies and find a pattern for the dataset [24]. In [25], the authors used the conceptual graph method in order to iden- tify anomalous data in the text. Two-way graph that includes two different types of nodes (i.e., concepts and relations). This method differs from classi- cal statistical approaches and distance-based approaches. In this approach, a data deviation is identified based on the concept of regularity. Using a conceptual graph and the relationships between the entities (concepts), the template pattern is identified, and patterns differing from those that are rare are deemed as an anomaly [25]. Sumithiradevi et al. used clustering methods for anomalous data detection. Initially using the greedy method, they improved the k-means algorithm and clustered the data set. Then, all records were read, and a flag with the initial value of zero was attributed to them. Later on, one sample was considered as outlier and removed them from the data set. In the next step, the amount of entropy changes in the remaining set was calculated. If the entropy of the remaining set increases by removing data, the deleted data is anomaly, and the value of its flag is changed to one [26]. Juntao Wang et al. made use of the density-based approach in order to re- move anomalous data. First, by clustering approach of Fast k-means, they classified data sets. Then, for each data in the cluster, the degree of anomaly was calculated, and any data whose anomalous degree is much larger than one is removed from the data set. In the next step, the average of the re- maining set of a cluster is selected as the new center gravity of the cluster. This trend continues until converging the clusters so that all of the anoma- lous data is removed from the data set [27]. In [28], the authors applied a similar method to the Local Outlier Factor (LOF) algorithm in order to de- termine the degree of anomaly based on distance from centroid. In the first step, by improving the K-means algorithm using genetic algorithm, data set A New Anomalous Text Detection Approach Using Unsupervised Methods ...7 is clustered. In the next step, the data set is filtered, and by defining the threshold limit, the data items whose degree of anomaly is more than a spe- cific value is considered as anomalous data. In this method, for each vector a degree of anomaly is determined based on the distance from the centroid of the cluster. Lei et al. using the subtractive clustering algorithm estimated the potential of each data to be as the initial seeds according to the neigh- borhood’s radius of samples. In the next step, by combining the Silhouette index with the K-means algorithm, they improved the estimated number of clusters. the Silhouette index is a criterion for measuring the amount of the desirability of the data assignment to the cluster. This means that each data is closer to the samples in its cluster or to data from other clusters. If the number obtained is closer to one, the assignment of data to the cluster is desirable but if the number is closer to 0.5, this means that it is likely that data belongs to another cluster. Finally, the improved Cluster-Based Local- Outlier Factor (CBLOF) algorithm is used to identify anomalous data [29]. In [30], the authors put their work on the basis of the improvement of the K- means algorithm clustering and established their method in parallel. Firstly, using Principal Component Analysis (PCA), they decreased dimensions of the problem. Then, by applying the DD Algorithm [31], they improved K- means’ performance. This algorithm selects initial seeds according to the method of distribution of data and improves their choice quality. Also, in- stead of using a certain point as the initial seeds, it uses the average value of some points as the centroid of the cluster. Through a number of tests, Yin et al. were able to define a threshold in order to determine the number of clusters in the K-means algorithm clustering so that there is no need for it to be determined by the user [30]. Identifying items or events that do not match the expected patterns or other items in the data set is called an anomaly detection. These anomalies items cause problems such as structural defects, errors, credit card fraud, and a cyber-attack and etc. The ability to detect anomaly behavior can provide very useful insights into various industries and be an important key to solving these problems. Machine learning algorithms make processing faster and more efficient for detecting anomalies. These algorithms can learn from data and predictions based on that data. In [32], they examined the issue of discovering emerging relations from news using machine learning. These relations can help with news-related tasks, such as retrieving the news, discovering events, ranking, and more, which is a challenging task. In this research, a novel Heteroge- neous graph Embedding framework for Emerging Relation detection (HEER) 636 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 637 8 and a global graph perspective was presented. HEER can embed words and entities by learning from the heterogeneous textual graph and the knowledge graph and predicts the emerging relations via a positive and unlabeled learn- ing (PU) classifier. In [33], the authors presented a kernel-based ensemble clustering approach and used a prototype reduction scheme to decrease the time required to generate the ensemble members. They showed that the reduction method could improve the results. The method they used was a learning process for documents clustering that correspondence-based aggre- gation in conjunction with kernel clustering on a matrix constructed using density-biased prototype selection. 3 Proposed Method Similar to other existing studies such as [32–35], we deal with numerical data in this research. A combination of clustering-based and distance-based methods was used by us. At the first, it is required to convert text data into an understandable format for the system. To this end, text documents were converted into vectors. anomalous data is detected in two phases. In the first phase, the K-means algorithm is used for clustering of data items in the k clusters. In the second phase, anomalous data in each cluster is detected based on the similarity comparison of each data item with the document of the centroid of the cluster. In clustering phase, probably some clusters of empty values might be created, and/or one of the anomalous data items is selected as the initial seeds and forming a cluster. Therefore, the clustering stage is carried out several times in order to get a more desirable result. In the step after clustering, centroid of each cluster is considered as the repre- sentative of that cluster and since the text data is displayed in the vector space, using the Cosine Similarity (CS) formula, the angle between the data within the cluster is compared to its center. It should be compared with the threshold limit. If the similarity rate is less than the threshold, the data is considered as an anomaly. It should be noted that the number of abnormal data is negligible in comparison with the total data in the set. In this ap- proach, the method of K-means algorithm clustering was used to divide data set to a few smaller parts based on the criterion of Cosine Similarity which will result in a decrease in the number of comparison of documents. In other words, instead of calculating distance (similarity) of each and every docu- ment in the whole set of data, we first divide the set according to the most similar documents in the K-means algorithm clustering so that the number of comparison between documents within each cluster with the cluster centroid 638 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 639 8 and a global graph perspective was presented. HEER can embed words and entities by learning from the heterogeneous textual graph and the knowledge graph and predicts the emerging relations via a positive and unlabeled learn- ing (PU) classifier. In [33], the authors presented a kernel-based ensemble clustering approach and used a prototype reduction scheme to decrease the time required to generate the ensemble members. They showed that the reduction method could improve the results. The method they used was a learning process for documents clustering that correspondence-based aggre- gation in conjunction with kernel clustering on a matrix constructed using density-biased prototype selection. 3 Proposed Method Similar to other existing studies such as [32–35], we deal with numerical data in this research. A combination of clustering-based and distance-based methods was used by us. At the first, it is required to convert text data into an understandable format for the system. To this end, text documents were converted into vectors. anomalous data is detected in two phases. In the first phase, the K-means algorithm is used for clustering of data items in the k clusters. In the second phase, anomalous data in each cluster is detected based on the similarity comparison of each data item with the document of the centroid of the cluster. In clustering phase, probably some clusters of empty values might be created, and/or one of the anomalous data items is selected as the initial seeds and forming a cluster. Therefore, the clustering stage is carried out several times in order to get a more desirable result. In the step after clustering, centroid of each cluster is considered as the repre- sentative of that cluster and since the text data is displayed in the vector space, using the Cosine Similarity (CS) formula, the angle between the data within the cluster is compared to its center. It should be compared with the threshold limit. If the similarity rate is less than the threshold, the data is considered as an anomaly. It should be noted that the number of abnormal data is negligible in comparison with the total data in the set. In this ap- proach, the method of K-means algorithm clustering was used to divide data set to a few smaller parts based on the criterion of Cosine Similarity which will result in a decrease in the number of comparison of documents. In other words, instead of calculating distance (similarity) of each and every docu- ment in the whole set of data, we first divide the set according to the most similar documents in the K-means algorithm clustering so that the number of comparison between documents within each cluster with the cluster centroid A New Anomalous Text Detection Approach Using Unsupervised Methods ...9 will be fewer. To cluster text data and to determine their similarity rate, the bag of words approach and Cosine Similarity were used, respectively. In or- der to identify anomalous data in each cluster, we are looking for data items that differ from the behavioral pattern of other members or their differences from the cluster centroid is much. After the clustering phase, the weight of each cluster will be calculated using the following formula: W(k) = n∑ x=1 Sx (1) where, k is the cluster number (K = 1, 2, . . . , n) and given that the text attributes are moved to the vector, Sx is the level of similarity of each data to the cluster centroid. In the next step, the average similarity of documents in each cluster is calculated by the following formula: M(k) = n∑ x=1 Sx/Tn (2) Thus, the total amount of document similarity relative to the centroid in each cluster is divided by the total number of documents Tn, and a numeric value is obtained as the average similarity of documents in each cluster. In the next step, the maximum and minimum amount of similarity in each cluster is calculated based on the following formulas: S(k−max) = Max(S1, S2, ..., Sn) (3) S(k−min) = Min(S1, S2, ..., Sn) (4) Avg(k) = (Sk−max + Sk−min)/2 (5) Diff(k) = M(k) − Avg(k) (6) Finally, the threshold limit of similarity is obtained for each cluster by this formula: Threshold(k) = ∣∣Diff(k) ∣∣ (7) The similarity of each data item in the cluster is compared to the centroid with the threshold value. If the similarity rate of the document to the 638 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 639 10 centroid is less than the threshold, it is considered as outlier in the cluster. As it was determined, to calculate the threshold limit, the difference between the median values of similarity and the average of similarities was used. Since the average of each cluster is obtained by dividing the sum of the values by the number of them, the existence of anomaly in the cluster leads to exceeding of standard deviation from the average value and increases the variance. As a result, mean similarity value of documents alone cannot be a good indicator of threshold limit. Therefor, the median amount of similarities, that is, the average of the similarity values of the most similar and most different documents are also entered into the threshold formula and its difference from the average values of the similarity will result in modulating the threshold limit. The pseudo-code of proposed algorithm is shown as below: Algorithm 1 The proposed algorithm Require: Input ⇒ Data set D = {d1, d2.....dn}, where n is the number of Documents (k the number of clusters) Ensure: output⇒ A set of K-clusters without outliers Require: Choose k objects from D as initial cluster centers repeat 1. Calculate distance of each data instances to centroid using CS 2. Reassign objects to the cluster with the most similarity 3. Update the cluster centroid due to the CS until Until no changes Calculate the weight based center Wk repeat 1. Calculate the Mean CS as M(k) 2. Calculate the Max(S1, S2, ..., Sn) and Min(S1, S2, ..., Sn) 3. Calculate the threshold limit of similarity for each cluster if dn < d(k) then Delete dn from Cluster k end if until End of Clusters 640 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 641 10 centroid is less than the threshold, it is considered as outlier in the cluster. As it was determined, to calculate the threshold limit, the difference between the median values of similarity and the average of similarities was used. Since the average of each cluster is obtained by dividing the sum of the values by the number of them, the existence of anomaly in the cluster leads to exceeding of standard deviation from the average value and increases the variance. As a result, mean similarity value of documents alone cannot be a good indicator of threshold limit. Therefor, the median amount of similarities, that is, the average of the similarity values of the most similar and most different documents are also entered into the threshold formula and its difference from the average values of the similarity will result in modulating the threshold limit. The pseudo-code of proposed algorithm is shown as below: Algorithm 1 The proposed algorithm Require: Input ⇒ Data set D = {d1, d2.....dn}, where n is the number of Documents (k the number of clusters) Ensure: output⇒ A set of K-clusters without outliers Require: Choose k objects from D as initial cluster centers repeat 1. Calculate distance of each data instances to centroid using CS 2. Reassign objects to the cluster with the most similarity 3. Update the cluster centroid due to the CS until Until no changes Calculate the weight based center Wk repeat 1. Calculate the Mean CS as M(k) 2. Calculate the Max(S1, S2, ..., Sn) and Min(S1, S2, ..., Sn) 3. Calculate the threshold limit of similarity for each cluster if dn < d(k) then Delete dn from Cluster k end if until End of Clusters A New Anomalous Text Detection Approach Using Unsupervised Methods ...11 3.1 Data Set In this study, two data sets (i.e., BBC and BBC sports News) were used [35]. BBC News contains 2225 documents from news articles on the BBC website in five news groups between 2004 and 2005. These five news groups were labeled under the title of Business, Entertainment, Politics, Sport, and Tech. The BBC sports Collection also contains 737 documents from the BBC sports website articles in five sports areas between 2004 and 2005, labeled as Athletics, Cricket, Football, Rugby, and Tennis. Each news data set contains a large number of text files from the broadcasted news text in several newsgroups with different topics on the BBC website. Since news topics differ in these texts, the words used in the text will also vary according to the type of news. According to the news genre, similar words are used in political news which are not used in sports news. Consequently, the words used in the sports news genre is similar, but it differs from the words of the business news genre. As a result, the similarity or difference of documents is characterized after conversion to the vector space. Documents in the sports news genre is placed in the same category at the clustering time. If business news documents are placed in this cluster, they are considered as outliers. The purpose of this approach is to find irrelevant documents that should be placed in a different cluster with regard to their subject. To identify anomaly, the algorithm is performed on the text data set with and without pre-processing. The goal of pre-processing of texts is stemming, removing of stop words, and weighing by term frequency–inverse document frequency (TF-IDF) method. Table 1 shows the details of the BBC data sets. Table 1: Summarization of the data sets Dataset Descriptions Documents Class Labels Classes BBC News News articles from BBC 2225 5 Business, Entertainment, Politics, Sport, Tech BBC Sports Sports news articles from BBC 737 5 Athletics, Cricket, Football, Rugby, Tennis 3.2 Text pre-processing The implementation of various operations on the text, including classification and clustering, requires the conversion of it into an understandable format for the system. As mentioned earlier, text documents are of an unstructured data type, and to perform the calculations it is necessary to convert them to a structured way. Stemming: The stemming process will convert words to their root form. For 640 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 641 12 example, the words ‘apply,’ ‘applied,’ and ‘application’ have the same root and they are all converted to the word ‘apply’ [36]. Stop words removal: In this step, a batch of worthless words like conjunc- tions and prepositions that are repeated alternately and do not have certain semantic meaning are deleted [37]. Bag-of-words model: it is a simple demonstration of text documents that are used in Natural Language Processing. In this model, each document is displayed regardless of grammar and how words are shown, but the number of words’ repetition matters. The result obtained from this model, will be a word-document matrix in which every row represents each document and every column represents each word. If there is a word in the document, in the corresponding column in the matrix 1 will be inserted otherwise 0 will be inserted. The first reference to ”bag-of-words model” in a linguistic context can be found in Zellig Harris’s article on Distributional Structure [38]. In this study, to display the texts of the bag-of-words model and conversion to the vector space model were used. Vector space model is an algebraic model for displaying text documents in the vector space. Weighing words by TF-IDF: In this method, words are assigned a weight based on its frequency in the text relative to their frequency in other texts. This weighing system shows how important a word is for a document. The first form of term weighting is due to Hans Peter Luhn (1957) which may be summarized as [39]: The weight of a term that occurs in a document is simply proportional to the term frequency. IDF was introduced by Karen Spärck Jones as ”term specificity” in [40, 41]. Although it has worked well as a heuristic, many researchers trying to find information theoretic justifications for it [41]. This criterion is made up of two functions of the TF (Term Frequency function) and IDF (Inverse Document Frequency function), and that means that if the number of repetitions of a specific word in the document is more and in other documents under investigation is less, this word is very impor- tant. This criterion is derived from the multiplication of two values (i.e., TF * IDF). TF equals the number of word repeats divided by the total number of words contained in the document [42,43]. tf − idft.d = tft.d × idft (8) tft,d = { log(1 + ft,d), ifft,d > 0 0, otherwise (9) 642 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 643 12 example, the words ‘apply,’ ‘applied,’ and ‘application’ have the same root and they are all converted to the word ‘apply’ [36]. Stop words removal: In this step, a batch of worthless words like conjunc- tions and prepositions that are repeated alternately and do not have certain semantic meaning are deleted [37]. Bag-of-words model: it is a simple demonstration of text documents that are used in Natural Language Processing. In this model, each document is displayed regardless of grammar and how words are shown, but the number of words’ repetition matters. The result obtained from this model, will be a word-document matrix in which every row represents each document and every column represents each word. If there is a word in the document, in the corresponding column in the matrix 1 will be inserted otherwise 0 will be inserted. The first reference to ”bag-of-words model” in a linguistic context can be found in Zellig Harris’s article on Distributional Structure [38]. In this study, to display the texts of the bag-of-words model and conversion to the vector space model were used. Vector space model is an algebraic model for displaying text documents in the vector space. Weighing words by TF-IDF: In this method, words are assigned a weight based on its frequency in the text relative to their frequency in other texts. This weighing system shows how important a word is for a document. The first form of term weighting is due to Hans Peter Luhn (1957) which may be summarized as [39]: The weight of a term that occurs in a document is simply proportional to the term frequency. IDF was introduced by Karen Spärck Jones as ”term specificity” in [40, 41]. Although it has worked well as a heuristic, many researchers trying to find information theoretic justifications for it [41]. This criterion is made up of two functions of the TF (Term Frequency function) and IDF (Inverse Document Frequency function), and that means that if the number of repetitions of a specific word in the document is more and in other documents under investigation is less, this word is very impor- tant. This criterion is derived from the multiplication of two values (i.e., TF * IDF). TF equals the number of word repeats divided by the total number of words contained in the document [42,43]. tf − idft.d = tft.d × idft (8) tft,d = { log(1 + ft,d), ifft,d > 0 0, otherwise (9) A New Anomalous Text Detection Approach Using Unsupervised Methods ...13 where t is the term, d is a bag of words (a document in IR terms), and ft,d is a frequency of the term in a bag. tft.d = ft.d Max(ft′.d : t ′ ∈ d) (10) IDF is the logarithm of the total number of documents divided by the total number of documents containing the target word [44]. idft,D = log |D| |d ∈ D : t ∈ d| = log N dft (11) where N is the cardinality of a corpus D (the total number of classes) and the denominator dft is a number of bags where the term t appears. idft = log N dft (12) Then, tf ∗ idf weight value for a term t in the bag d of a corpus D is defined as: tf ∗ idf(t, d, D) = tft,d × idft,D (13) tf ∗ idf(t, d, D) = log(1 + ft,d) × log N dft , forft,d > 0 (14) for all cases where ft,d > 0 and dft > 0, or zero otherwise. Once all frequency values are computed, term frequency matrix becomes the term weight matrix, whose columns used as class’ term weight vectors that facilitate the classification using Cosine Similarity. In accordance with the equation, the less the number of word repetition in the documents containing the target word, the more important. 3.3 Criterion for Assessing the Similarity of Two Documents The Cosine Similarity is the similarity criterion between the two vectors that calculates the cosine of the angle between the two vectors. A zero cosine is equal to 1, as a result, if two vectors coincide each other, their similarity is equal to one. It is obvious that this amount will show the highest possible similarity between vectors [45]. after preparing the words bag, the document will be displayed in the vector space. Then, the angle between the two vectors (the similarity of two documents) is calculated from the following formula: 642 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 643 14 For two vectors a and b Cosine Similarity is based on their inner product and defined as: similarity(a, b) = cos(θ) (15) cos(θ) = a · b ||a|| · ||b|| (16) a · b = n∑ i=1 aibi (17) ||a|| , ||b|| = √√√√ n∑ i=1 a2i , √√√√ n∑ i=1 b2i (18) similarity(a, b) = n∑ i=1 aibi √ n∑ i=1 a2i √ n∑ i=1 b2i (19) 4 Experimental Results In the following, the results of using the proposed approach in order to identify the abnormal data in the text will be investigated. To implement and evaluate the proposed approach, the following procedures are implemented: To evaluate the accuracy of the proposed approach, the data set is first divided into two parts. A part of data is considered as training data and other part as test data to measure the accuracy of the proposed algorithm. Also, in order to weigh the text keywords, the TF-IDF coefficient was used. Since five news genres exists in the data set, the number of clusters in the K-means algorithm is pre-determined and is equal to five. Since the used data set is labeled, the training data is used in order to learn K-means algorithm. Then, the number of documents placed in each cluster will be counted. By appointing a two-dimensional array, the index of each document with a Cosine Similarity relative to the centroid of each cluster (distance criterion) is stored in the array. According to the proposed formula in order to determine the threshold limit, the lowest and highest similarity values in each cluster relative to the centroid as well as the average spacing values are calculated. In the following, the results of the accuracy of decision tree classification 644 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 645 14 For two vectors a and b Cosine Similarity is based on their inner product and defined as: similarity(a, b) = cos(θ) (15) cos(θ) = a · b ||a|| · ||b|| (16) a · b = n∑ i=1 aibi (17) ||a|| , ||b|| = √√√√ n∑ i=1 a2i , √√√√ n∑ i=1 b2i (18) similarity(a, b) = n∑ i=1 aibi √ n∑ i=1 a2i √ n∑ i=1 b2i (19) 4 Experimental Results In the following, the results of using the proposed approach in order to identify the abnormal data in the text will be investigated. To implement and evaluate the proposed approach, the following procedures are implemented: To evaluate the accuracy of the proposed approach, the data set is first divided into two parts. A part of data is considered as training data and other part as test data to measure the accuracy of the proposed algorithm. Also, in order to weigh the text keywords, the TF-IDF coefficient was used. Since five news genres exists in the data set, the number of clusters in the K-means algorithm is pre-determined and is equal to five. Since the used data set is labeled, the training data is used in order to learn K-means algorithm. Then, the number of documents placed in each cluster will be counted. By appointing a two-dimensional array, the index of each document with a Cosine Similarity relative to the centroid of each cluster (distance criterion) is stored in the array. According to the proposed formula in order to determine the threshold limit, the lowest and highest similarity values in each cluster relative to the centroid as well as the average spacing values are calculated. In the following, the results of the accuracy of decision tree classification A New Anomalous Text Detection Approach Using Unsupervised Methods ...15 algorithms and Näıve Bayes method on two data sets before and after pre- processing by removing the anomalous data by the proposed method and the DBSCAN method was presented with different neighborhood distances. Fig. 3: Accuracy diagram of the decision tree after removing the anomalous data using the proposed method and DBSCAN with different neigh- borhood distance on the BBC data set The results in figures 3, and 4 showed that the proposed method on the BBC data set before the pre-processing phase was not improved in com- parison with the accuracy of Näıve Bayes method, but the accuracy of the decision tree using the proposed method increased in comparison with the DBSCAN method. The results in figures 5, and 6 proved that the accuracy of the pro- posed method like previous results on a pre-processed set of BBC has been improved compared to the non-pre-processed set so that the accuracy of the Näıve Bayes method after eliminating the anomalous data by the proposed method increased compared to the DBSCAN method. Also, the accuracy of decision tree after eliminating the anomalous data by the proposed method has increased on the pre-processed data set. The results in figures 7, and 8 disclosed that the proposed method on the BBC sports data set before the pre-processing phase in comparison with the accuracy of the Näıve Bayes method was not improved, but the accuracy of decision tree after removal of the anomalous data using the proposed method increased in comparison with the DBSCAN method. 644 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 645 16 Fig. 4: Accuracy diagram of the Näıve Bayes algorithm after removing the anomalous data using the proposed method and DBSCAN with dif- ferent neighborhood distance on the BBC data set Fig. 5: Accuracy diagram of the decision tree after removing the anomalous data using the proposed method and DBSCAN with different neigh- borhood distance on the pre-processed BBC data set The results in figures 9, and 10 revealed that the accuracy of the proposed method like previous results on a pre-processed set of BBC sports 646 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 647 16 Fig. 4: Accuracy diagram of the Näıve Bayes algorithm after removing the anomalous data using the proposed method and DBSCAN with dif- ferent neighborhood distance on the BBC data set Fig. 5: Accuracy diagram of the decision tree after removing the anomalous data using the proposed method and DBSCAN with different neigh- borhood distance on the pre-processed BBC data set The results in figures 9, and 10 revealed that the accuracy of the proposed method like previous results on a pre-processed set of BBC sports A New Anomalous Text Detection Approach Using Unsupervised Methods ...17 Fig. 6: Accuracy diagram of the Näıve Bayes algorithm after removing the anomalous data using the proposed method and the DBSCAN method with different neighborhood distance on the pre-processed BBC data set Fig. 7: Accuracy diagram of the decision tree after removing the anomalous data using the proposed method and DBSCAN method with different neighborhood distance on the sports data set 646 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 647 18 Fig. 8: Accuracy diagram of the Näıve Bayes algorithm after removing the anomalous data using the proposed method and DBSCAN method with different neighborhood distances on the BBC Sport data set Fig. 9: Accuracy diagram of the decision tree after removing the anomalous data using the proposed method and DBSCAN with different neigh- borhood distance on the pre-processed BBC Sport data set news has been improved compared to the non-pre-processed set so that the accuracy of the Näıve Bayes method after eliminating the anomalous data by 648 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 649 18 Fig. 8: Accuracy diagram of the Näıve Bayes algorithm after removing the anomalous data using the proposed method and DBSCAN method with different neighborhood distances on the BBC Sport data set Fig. 9: Accuracy diagram of the decision tree after removing the anomalous data using the proposed method and DBSCAN with different neigh- borhood distance on the pre-processed BBC Sport data set news has been improved compared to the non-pre-processed set so that the accuracy of the Näıve Bayes method after eliminating the anomalous data by A New Anomalous Text Detection Approach Using Unsupervised Methods ...19 Fig. 10: Accuracy diagram of the Näıve Bayes algorithm after removing the anomalous data using the proposed method and DBSCAN with dif- ferent neighborhood distance on the pre-processed BBC data set the proposed method increased compared to the DBSCAN method. Also, the accuracy of decision tree after removal of the anomalous data by the proposed method increased on the pre-processed data set. Table 2, compares two similar works with our proposed method on the same data set. Table 2: Results on the BBC News data set Authors Descriptions Accuracy Zhang et al. Heterogeneous graph embedding for emerging relation detection from news 64.4% Greene et al. Kernel-based ensemble clustering approach 88.0% Our Proposed Method A combination of two approaches (clustering-based and distance-based) 92.39% 5 Conclusion and Future Works In this research, a novel approach for identifying the anomalous text data using unsupervised methods was proposed well. The advantage of using our proposed model as an unsupervised method is that there is no need for prior knowledge and training data. In this research, we assumed that the number 648 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 649 20 of anomalous data is negligible compared to normal data. The documents studied were also in English and the Cosine Similarity (CS) criterion was used to compare the distance between the documents. Therefore, a given document which is the least similar to others is considered as an anoma- lous document. In the proposed method, a combination of two approaches clustering-based and distance-based is used for detecting anomaly in the text data. In order to evaluate the efficiency of the proposed approach, this method is applied on four labeled data sets. In general, the obtained re- sults show that the classification accuracy of the documents after applying the anomalous outlier detection algorithm and removing them from the pre- processed data set is always improved and performs well in non-pre-processed data sets. In order to determine the threshold, our model iteratively runs some al- gorithms with high mathematical calculation. Consequently computational complexity increases in our approach. Besides, it should be noted that the user must specify k (the number of clusters) at the beginning. So an im- proved k-means would be used as a solution. We have to repeat the k-means algorithm several times to fix the best clustering and prevent the selection of outliers as initial seeds. Furthermore, we should point out that based on the achieved results the pre-processing step can affect the accuracy. In the fu- ture, we are going to use improved k-means algorithm, evaluate other types of clustering algorithms, apply the model to other languages, investigate different distance thresholds (similarity) to tackle these issues. References [1] Z. A. Bakar, R. Mohemad, A. Ahmad, and M. M. Deris, “A comparative study for outlier detection techniques in data mining,” in 2006 IEEE conference on cybernetics and intelligent systems. IEEE, 2006, pp. 1–6. [2] A. Esmaeili Kelishomi, A. Garmabaki, M. Bahaghighat, and J. Dong, “Mo- bile user indoor-outdoor detection through physical daily activities,” Sensors, vol. 19, no. 3, p. 511, 2019. [3] M. Ghorbani, M. Bahaghighat, Q. Xin, and F. Özen, “Convlstmconv network: a deep learning approach for sentiment analysis in cloud computing,” Journal of Cloud Computing, vol. 9, no. 1, pp. 1–12, 2020. [4] M. Bahaghighat, L. Akbari, and Q. Xin, “A machine learning-based approach for counting blister cards within drug packages,” IEEE Access, vol. 7, pp. 83 785–83 796, 2019. 650 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 651 20 of anomalous data is negligible compared to normal data. The documents studied were also in English and the Cosine Similarity (CS) criterion was used to compare the distance between the documents. Therefore, a given document which is the least similar to others is considered as an anoma- lous document. In the proposed method, a combination of two approaches clustering-based and distance-based is used for detecting anomaly in the text data. In order to evaluate the efficiency of the proposed approach, this method is applied on four labeled data sets. In general, the obtained re- sults show that the classification accuracy of the documents after applying the anomalous outlier detection algorithm and removing them from the pre- processed data set is always improved and performs well in non-pre-processed data sets. In order to determine the threshold, our model iteratively runs some al- gorithms with high mathematical calculation. Consequently computational complexity increases in our approach. Besides, it should be noted that the user must specify k (the number of clusters) at the beginning. So an im- proved k-means would be used as a solution. We have to repeat the k-means algorithm several times to fix the best clustering and prevent the selection of outliers as initial seeds. Furthermore, we should point out that based on the achieved results the pre-processing step can affect the accuracy. In the fu- ture, we are going to use improved k-means algorithm, evaluate other types of clustering algorithms, apply the model to other languages, investigate different distance thresholds (similarity) to tackle these issues. References [1] Z. A. Bakar, R. Mohemad, A. Ahmad, and M. M. Deris, “A comparative study for outlier detection techniques in data mining,” in 2006 IEEE conference on cybernetics and intelligent systems. IEEE, 2006, pp. 1–6. [2] A. Esmaeili Kelishomi, A. Garmabaki, M. Bahaghighat, and J. Dong, “Mo- bile user indoor-outdoor detection through physical daily activities,” Sensors, vol. 19, no. 3, p. 511, 2019. [3] M. Ghorbani, M. Bahaghighat, Q. Xin, and F. Özen, “Convlstmconv network: a deep learning approach for sentiment analysis in cloud computing,” Journal of Cloud Computing, vol. 9, no. 1, pp. 1–12, 2020. [4] M. Bahaghighat, L. Akbari, and Q. Xin, “A machine learning-based approach for counting blister cards within drug packages,” IEEE Access, vol. 7, pp. 83 785–83 796, 2019. A New Anomalous Text Detection Approach Using Unsupervised Methods ...21 [5] M. Bahaghighat, S. A. Motamedi, and Q. Xin, “Image transmission over cog- nitive radio networks for smart grid applications,” Applied Sciences, vol. 9, no. 24, p. 5498, 2019. [6] F. Abedini, M. Bahaghighat, and M. S’hoyan, “Wind turbine tower detec- tion using feature descriptors and deep learning,” Facta Universitatis, Series: Electronics and Energetics, vol. 33, no. 1, pp. 133–153, 2019. [7] M. Bahaghighat, F. Abedini, M. S’hoyan, and A.-J. Molnar, “Vision inspection of bottle caps in drink factories using convolutional neural networks,” in 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP). IEEE, 2019, pp. 381–385. [8] S. Hasani, M. Bahaghighat, and M. Mirfatahia, “The mediating effect of the brand on the relationship between social network marketing and consumer behavior,” Acta Technica Napocensis, vol. 60, no. 2, pp. 1–6, 2019. [9] M. Bahaghighat, Q. Xin, S. A. Motamedi, M. M. Zanjireh, and A. Vacavant, “Estimation of wind turbine angular velocity remotely found on video mining and convolutional neural network,” Applied Sciences, vol. 10, no. 10, p. 3544, 2020. [10] M. Bahaghighat and S. A. Motamedi, “Vision inspection and monitoring of wind turbine farms in emerging smart grids,” Facta universitatis-series: Elec- tronics and Energetics, vol. 31, no. 2, pp. 287–301, 2018. [11] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009. [12] J. D. Parmar and J. T. Patel, “Anomaly detection in data mining: A review,” International Journal, vol. 7, no. 4, 2017. [13] R. Kaur and S. Singh, “A survey of data mining and social network analysis based anomaly detection techniques,” Egyptian informatics journal, vol. 17, no. 2, pp. 199–216, 2016. [14] D. Guthrie, “Unsupervised detection of anomalous text,” Ph.D. dissertation, Citeseer, 2008. [15] J. Samuel, G. Ali, M. Rahman, E. Esawi, Y. Samuel et al., “Covid-19 public sentiment insights and machine learning for tweets classification,” Information, vol. 11, no. 6, p. 314, 2020. [16] S. Latif, M. Usman, S. Manzoor, W. Iqbal, J. Qadir, G. Tyson, I. Castro, A. Razi, M. N. K. Boulos, A. Weller et al., “Leveraging data science to combat covid-19: A comprehensive review,” 2020. [17] A. Hotho, A. Nürnberger, and G. Paaß, “A brief survey of text mining.” in Ldv Forum, vol. 20, no. 1. Citeseer, 2005, pp. 19–62. [18] A. K. Nassirtoussi, S. Aghabozorgi, T. Y. Wah, and D. C. L. Ngo, “Text mining for market prediction: A systematic review,” Expert Systems with Ap- plications, vol. 41, no. 16, pp. 7653–7670, 2014. 650 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 651 22 [19] A. Mahapatra, N. Srivastava, and J. Srivastava, “Contextual anomaly detec- tion in text data,” Algorithms, vol. 5, no. 4, pp. 469–489, 2012. [20] M. Goldstein and S. Uchida, “A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data,” PloS one, vol. 11, no. 4, p. e0152173, 2016. [21] T.-E. Lin, H. Xu, and H. Zhang, “Discovering new intents via constrained deep adaptive clustering with cluster refinement.” in AAAI, 2020, pp. 8360–8367. [22] I. Aalto et al., “Discovering topics in slack message streams,” 2020. [23] R. Kannan, H. Woo, C. C. Aggarwal, and H. Park, “Outlier detection for text data,” in Proceedings of the 2017 siam international conference on data mining. SIAM, 2017, pp. 489–497. [24] M. T. Sereshki and M. M. Zanjireh, “Outlier detection in text data: An un- supervised method based on text similarity and density peak,” 2020. [25] M. Montes-y Gómez, A. Gelbukh, and A. López-López, “Detecting deviations in text collections: An approach using conceptual graphs,” in Mexican Inter- national Conference on Artificial Intelligence. Springer, 2002, pp. 176–184. [26] S. Chellamuthu and M. Punithavalli, “Enhanced k-means with greedy algo- rithm for outlier detection,” International Journal of Advanced Research in Computer Science, vol. 3, no. 3, 2012. [27] J. Wang and X. Su, “An improved k-means clustering algorithm,” in 2011 IEEE 3rd International Conference on Communication Software and Net- works. IEEE, 2011, pp. 44–46. [28] M. Marghny and A. I. Taloba, “Outlier detection using improved genetic k- means,” arXiv preprint arXiv:1402.6859, 2014. [29] D. Lei, Q. Zhu, J. Chen, H. Lin, and P. Yang, “Automatic k-means clustering algorithm for outlier detection,” in Information engineering and applications. Springer, 2012, pp. 363–372. [30] C. Yin and S. Zhang, “Parallel implementing improved k-means applied for image retrieval and anomaly detection,” Multimedia Tools and Applications, vol. 76, no. 16, pp. 16 911–16 927, 2017. [31] X.-j. Tong, F.-R. Meng, and Z.-x. Wang, “Optimization to k-means initial cluster centers,” Computer Engineering and Design, vol. 32, no. 8, pp. 2721– 2723, 2011. [32] J. Zhang, C.-T. Lu, M. Zhou, S. Xie, Y. Chang, and S. Y. Philip, “Heer: Heterogeneous graph embedding for emerging relation detection from news,” in 2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016, pp. 803–812. [33] D. Greene and P. Cunningham, “Efficient ensemble methods for document clustering,” Department of Computer Science, Trinity College Dublin, Tech. Rep., 2006. 652 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 653 22 [19] A. Mahapatra, N. Srivastava, and J. Srivastava, “Contextual anomaly detec- tion in text data,” Algorithms, vol. 5, no. 4, pp. 469–489, 2012. [20] M. Goldstein and S. Uchida, “A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data,” PloS one, vol. 11, no. 4, p. e0152173, 2016. [21] T.-E. Lin, H. Xu, and H. Zhang, “Discovering new intents via constrained deep adaptive clustering with cluster refinement.” in AAAI, 2020, pp. 8360–8367. [22] I. Aalto et al., “Discovering topics in slack message streams,” 2020. [23] R. Kannan, H. Woo, C. C. Aggarwal, and H. Park, “Outlier detection for text data,” in Proceedings of the 2017 siam international conference on data mining. SIAM, 2017, pp. 489–497. [24] M. T. Sereshki and M. M. Zanjireh, “Outlier detection in text data: An un- supervised method based on text similarity and density peak,” 2020. [25] M. Montes-y Gómez, A. Gelbukh, and A. López-López, “Detecting deviations in text collections: An approach using conceptual graphs,” in Mexican Inter- national Conference on Artificial Intelligence. Springer, 2002, pp. 176–184. [26] S. Chellamuthu and M. Punithavalli, “Enhanced k-means with greedy algo- rithm for outlier detection,” International Journal of Advanced Research in Computer Science, vol. 3, no. 3, 2012. [27] J. Wang and X. Su, “An improved k-means clustering algorithm,” in 2011 IEEE 3rd International Conference on Communication Software and Net- works. IEEE, 2011, pp. 44–46. [28] M. Marghny and A. I. Taloba, “Outlier detection using improved genetic k- means,” arXiv preprint arXiv:1402.6859, 2014. [29] D. Lei, Q. Zhu, J. Chen, H. Lin, and P. Yang, “Automatic k-means clustering algorithm for outlier detection,” in Information engineering and applications. Springer, 2012, pp. 363–372. [30] C. Yin and S. Zhang, “Parallel implementing improved k-means applied for image retrieval and anomaly detection,” Multimedia Tools and Applications, vol. 76, no. 16, pp. 16 911–16 927, 2017. [31] X.-j. Tong, F.-R. Meng, and Z.-x. Wang, “Optimization to k-means initial cluster centers,” Computer Engineering and Design, vol. 32, no. 8, pp. 2721– 2723, 2011. [32] J. Zhang, C.-T. Lu, M. Zhou, S. Xie, Y. Chang, and S. Y. Philip, “Heer: Heterogeneous graph embedding for emerging relation detection from news,” in 2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016, pp. 803–812. [33] D. Greene and P. Cunningham, “Efficient ensemble methods for document clustering,” Department of Computer Science, Trinity College Dublin, Tech. Rep., 2006. A New Anomalous Text Detection Approach Using Unsupervised Methods ...23 [34] J. Manoharan, S. H. Ganesh, and J. Sathiaseelan, “Outlier detection using enhanced k-means clustering algorithm and weight-based center approach,” Int. J. Comput. Sci. Mobile Comput., vol. 5, no. 4, pp. 453–464, 2016. [35] D. Greene and P. Cunningham, “Practical solutions to the problem of diag- onal dominance in kernel document clustering,” in Proc. 23rd International Conference on Machine learning (ICML’06). ACM Press, 2006, pp. 377–384. [36] F. N. Flores and V. P. Moreira, “Assessing the impact of stemming accuracy on information retrieval–a multilingual perspective,” Information Processing & Management, vol. 52, no. 5, pp. 840–854, 2016. [37] W. J. Wilbur and K. Sirotkin, “The automatic identification of stop words,” Journal of information science, vol. 18, no. 1, pp. 45–55, 1992. [38] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146–162, 1954. [39] H. P. Luhn, “A statistical approach to mechanized encoding and searching of literary information,” IBM Journal of research and development, vol. 1, no. 4, pp. 309–317, 1957. [40] K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of documentation, 1972. [41] S. Robertson, “Understanding inverse document frequency: on theoretical ar- guments for idf,” Journal of documentation, 2004. [42] G. Salton and C. Buckley, “Term-weighting approaches in automatic text re- trieval,” Information processing & management, vol. 24, no. 5, pp. 513–523, 1988. [43] C. D. Manning, P. Raghavan, and H. Schütze, “Scoring, term weighting and the vector space model,” Introduction to information retrieval, vol. 100, pp. 2–4, 2008. [44] K. Church and W. Gale, “Inverse document frequency (idf): A measure of de- viations from poisson,” in Natural language processing using very large corpora. Springer, 1999, pp. 283–295. [45] A. Huang, “Similarity measures for text document clustering,” in Proceedings of the sixth new zealand computer science research student conference (NZC- SRSC2008), Christchurch, New Zealand, vol. 4, 2008, pp. 9–56. 652 E. AMoUEE, M. M. ZANjIREh , M. BAhAghIghAt, M. ghoRBANI A New Anomalous text Detection Approach Using Unsupervised Methods... 653