 Proceedings of Engineering and Technology Innovation, vol. 3, 2016, pp. 28 - 30 28 Using Unsupervised Machine Learning to Detect Peer-to-Peer Botnet Flows Andrea E. Medina Paredes 1 , Yuan-Yuan Su 2 , Wei Wu 3 , Hung-Min Sun 4,* 1 Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu, Taiwan . 2 Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan. 3 Fujian Provincial Key Laboratory of Network Security and Cryptology, School of Mathematics and Computer Science, Fujian Normal University, Fuzhou, Fujian, China. 4 Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan. Received 22 February 2016; received in revised form 19 March 2016; accept ed 06 April 2016 Abstract The wa r aga inst botnet infection is fought every day by users that want to feel safe against any threat of compro mise hosts. In this paper we are going to focus on the behavior of Peer 2 Pee r (P2P) botnets, which a long with hybrid botnets is a growing trend a mong attackers. The ma in approach will consist of a behavior comparison among features e xtracted fro m network flows, focusing only in the flows fro m P2P applications including P2P botnets. Ke ywor ds : clusters, network flows, P2P botnets, unsupervised learning 1. Introduction Malic ious software such as botnets has been around for quite a t ime already and it keeps improving, evolving and growing, as for the detection systems, they try to keep track of these new e me rging botnets trends but some fa il to provide a definite and accurate solution to this problem. For e xa mp le, this past May “8chan”, a website composed of user-created boards, re- ported a series of DDoS attacks coming fro m the “Hola!” network, a popular virtual private net- work use for vie wing b locked v ideos and T V shows from other countries, which counts with a pool of 6 million IP addresses [1]. Activities such as adding signatures to d a- tabases, protecting servers against hackers, the use anti-virus software to protect computers fro m getting infected, track C&C server activ i- ties and so many other actions are taken in co n- sideration, but still cybercriminals find a way to go around the security measures. The use of supervised learning models is one of many ap- proaches that can be use to deal with botnets, classifiers like support vector mach ines (SVM ) have shown great accuracy separating botnet network flo ws fro m normal flows [ 2], other methods like decision tree a lgorith ms have been put to the test as well, measuring how accurate the decision tree classifies the data [3]. The drawback of the previous mention methods is that most of them need labelled data in order to function and only yields better results when the botnet signature is already known. In hopes to contribute to these efforts, in this paper we propose the use of Unsupervised M a- chine Learning algorithms for the fight against botnet detection. A comparison a mong three clustering algorith ms using network flo ws e x- tracted fro m a set of features, will be carried out thorough out this paper. The rest of this paper, is organized with the following: the method use, the e xperiment design with the respective o b- servations and finally the analysis of the results. 2. Method The structural synthesis of CCPGTs will be performed based on the creative design met h- odology process [7-8]. The design require ments and design con- straints are summarized based on the character- istics of the mechanism. 2.1. Approach using Unsupervised Learning Traffic can be classified by selecting its at- * Corresponding aut hor. Email: hmsun@cs.nthu.edu.tw Proceedings of Engineering and Technology Innovation , vol. 3, 2016, pp. 28 - 30 29 Copyright © TAETI tributes which distinguishes their behavior, we want the unsupervised learning algorith m to find the patterns hidden among the P2P flows. To facilitate the algorith m detection a previous process to select the most relevant features will be carry out and then these input will be feed to the clustering algorithms in order to co mpare their overall performance creating clusters based on the characteristics of those features. Then the resulting cluster will be cross validated in order to ensure the legitimacy of the outputs. Fig . 1 shows a flowchart with the overall process: Fig. 1 Overa ll process of the approach using Unsupervised Learning 2.2. Clusters Evaluation Measurements In this paper we choose Simple K-means, Farthest First [4] and DB Scan [5] to wo rk with clustering algorith ms to co mpare wh ich is the best fit to classify P2P botnets traces using network flows e xt racted fro m the pac kets send within a network [6].The data is d ivided later into subsets for evaluating the mach ine learning algorith ms, the sets contain P2P tra ffic fro m both kind’s malic ious flows and Non - Malicious flows. Pe rformance statistics are calculated for all the trials. The c lass is ignored during all the evaluations for the clusters. The validation measures are used to evaluate the credibility of the clusters, in this case due to the high imba l- ance between classes, to keep the rea l scenario of a network, we can’t only re ly on the accuracy measurement of the clusters. The Classification Oriented Measures of Cluster Validity are described below: • The nu mber of correctly classified instances as malicious is referred to as the True Positive (TP). • The number of instances classified as mali- cious but should be normal and therefore re- jected, is refe rred to as the False Positive (FP). • The number of instances classified as normal but are actually malicious, is referred to as the False Negative (FN). • The number of normal instances from a class correctly rejected is referred to as True Neg a- tive (TN). 3. Results and Discussion We need to utilize clustering algorithms which can handle such diffe rences in the data along with the first 10 top ranked attributes from the feature e xtraction. As mention before we used WEKA [6] to run this three clusters. 3.1. Dataset Assemble for Testing The dataset used in this paper was obtained fro m a previous research group that made the datasets publicly available , their paper is about a P2P traffic categorization system called “Peer Rush” [8]. The labeled data of all four P2P ap- plications (Emule , UTorrent, Vu ze and Fros t- Wire) along with Zeus and Waledac were used for testing purposes. The dataset was divided into 3 co mbinations of subsets, containing both kinds of flows that are labeled e ither Non-Malicious or Malicious: • Dataset 1 (Zeus traces ): A total of 17,940 flows are contained in the dataset, 95% non-malicious traces and 5% malicious traces. • Dataset 2 (Waledac Traces ): A total of 12,310 flows are contained in the dataset, 93% non-malicious traces and 7% malicious traces. • Dataset 3 (Zeus and Waledac traces ): A total of 12,334 flows are contained in the dataset, 92% non-malicious traces and 8% malicious traces. 3.2. Unsupervised Learning Comparison The accuracy of all the algorith ms is shown in Fig. 2. DB Scan performed significantly we ll for a ll the situations assigned, each change of dataset dimin ishes slightly the accuracy, but in genera l it ma intains the highest percentage. Simp le Proceedings of Engineering and Technology Innovation, vol. 3, 2016, pp. 28 - 30 30 Copyright © TAETI K-means improved in the last test but still had some imba lance in the number of correctly classified ma lic ious instances that were re- trieved. Fig. 2 Accuracy of Unsupervised Algorithms for all Datasets In Table 1 we can co mpare the real perfo r- mance values of each algorithm. DB Scan proves to be a worthy contender against P2P botnets flows and may be used to improve the precision of detection systems along with other security tools. We believe that each algorith m performs depending on the quality of the data and the previous preprocessing of it. Table 1 Average measurement values for each algorithm 4. Conclusions The use of unsupervised learning was pro- posing for classifying P2P traffic flows in comparison to the previous methods using su- pervised learning. The results lead us to believe the data is suitable for a density based cluster, since DB Scan a lgorith m performed we ll on every situation, obtaining high precision class i- fying P2P botnet flows and retriev ing most of these malic ious flows from the normal P2P Traffic. Acknowledgement This research was supported in part by the Ministry of Science and Technology , Taiwan, under the Grants MOST 104-3115-E-007-004 and MOST 103-2221-E-007-073-M Y3. References [1] R. Price, “ Business Insider,” http://www.bu sinessinsider.co m/hola -used-for-botnet-on- chrome, May 2015. [2] P. Barthaku r, M. Dahal, and M. K. Ghose, “A fra mework for P2P botnet detection using SVM,” in International Conference on Cyber-Enabled Distributed Co mputing and Knowledge Discover, Sanya, 2012. [3] W. H. Liao and C. C. Chang, “Peer to peer botnet detection using data min ing scheme,” International Conference on Internet Tech- nology and Applications, Wuhan, 2010. [4] S. Hochbaum, “A best possible heuristic for the k-center problem,” in Mathematics of Operations Research, 1985. [5] P. N. Tan, M. Steinbach, and V. Ku mar, “Cluster analysis: basic concepts and algo- rith ms,” Introduction to Data Mining, Pearson, pp. 487-559, 2005. [6] M. Ha ll, E. Frank, G. Ho lmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” SIGKDD Explorations, 2009. [7] B. Rahbarin ia, R. Perd isci, A. Lan zi, and K. Li, “PeerRush: min ing for unwanted P2P traffic,” Detection of Intrusions and Mal- ware, and Vulnerability Assessment, Springer, vol. 7967, pp. 62-82, 2013. https://www.most.gov.tw/?menu_id=c53d1d66-f9ed-11e4-9a55-b8ac6f2d65c0&l=en