


Proceedings of Engineering and Technology Innovation, vol. 3, 2016, pp. 28 - 30 

28 

Using Unsupervised Machine Learning to 

Detect Peer-to-Peer Botnet Flows 

Andrea E. Medina Paredes
1
, Yuan-Yuan Su

2
, Wei Wu

3
, Hung-Min Sun

4,*
 

1
Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu, Taiwan . 

2
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan. 

3
Fujian Provincial Key Laboratory of Network Security and Cryptology, School of Mathematics and 

Computer Science, Fujian Normal University, Fuzhou, Fujian, China. 

4
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan. 

Received 22 February 2016; received in revised form 19 March 2016; accept ed 06 April 2016 

 
Abstract 

The wa r aga inst botnet infection is fought 

every day by users that want to feel safe against 

any threat of compro mise hosts. In this paper we 

are going to focus on the behavior of Peer 2 Pee r 

(P2P) botnets, which a long with hybrid botnets 

is a growing trend a mong attackers. The ma in  

approach will consist of a behavior comparison 

among features e xtracted fro m network flows, 

focusing only in  the flows fro m P2P applications 

including P2P botnets. 

Ke ywor ds : clusters, network flows, P2P botnets, 

unsupervised learning 

1. Introduction 

Malic ious software such as botnets has been 

around for quite a t ime  already and it keeps 

improving, evolving and growing, as for the 

detection systems, they try to keep track of these 

new e me rging botnets trends but some fa il to  

provide a definite and accurate solution to this 

problem. For e xa mp le, this past May “8chan”, a  

website composed of user-created boards, re-

ported a series of DDoS attacks coming fro m the  

“Hola!” network, a popular virtual private net-

work use for vie wing b locked v ideos and T V 

shows from other countries, which counts with a 

pool of 6 million IP addresses [1]. 

Activities such as adding signatures to d a-

tabases, protecting servers against hackers, the 

use anti-virus  software to protect computers 

fro m getting infected, track C&C server activ i-

ties and so many other actions are taken  in co n-

sideration, but still cybercriminals find a way to  

go around the security measures. The use of 

supervised learning models is one of many ap-

proaches that can be use to deal with botnets, 

classifiers like support vector mach ines (SVM ) 

have shown great accuracy separating botnet 

network flo ws fro m normal flows [ 2], other 

methods like decision tree a lgorith ms have been 

put to the test as well, measuring how accurate  

the decision tree classifies the data [3]. The  

drawback of the previous mention methods is 

that most of them need labelled data in order to 

function and only yields better results when the 

botnet signature is already known. 

In hopes to contribute to these efforts, in this 

paper we propose the use of Unsupervised M a-

chine Learning  algorithms for the fight against 

botnet detection. A comparison a mong three 

clustering algorith ms using network flo ws e x-

tracted fro m a set of features, will be carried out 

thorough out this paper. The rest of this paper, is 

organized with the following: the method use, 

the e xperiment design with the respective o b-

servations and finally the analysis of the results. 

2. Method 

The structural synthesis of CCPGTs will be  

performed based on the creative design met h-

odology process [7-8].  

The design require ments and design con-

straints are summarized based on the character-

istics of the mechanism. 

2.1. Approach using Unsupervised Learning  

Traffic  can be classified by selecting its at-

* Corresponding aut hor. Email: hmsun@cs.nthu.edu.tw 


Proceedings of Engineering and Technology Innovation , vol. 3, 2016, pp. 28 - 30 

29 Copyright ©  TAETI 

tributes which distinguishes their behavior, we  

want the unsupervised learning algorith m to find  

the patterns hidden among the P2P flows. To  

facilitate the algorith m detection a previous 

process to select the most relevant features will 

be carry out and then these input will be  feed to  

the clustering algorithms in order to co mpare  

their overall performance creating clusters based 

on the characteristics of those features. Then the 

resulting cluster will be cross validated in order 

to ensure the legitimacy of the outputs. Fig . 1 

shows a flowchart with the overall process: 

 
Fig. 1 Overa ll process of the approach using 

Unsupervised Learning 

2.2. Clusters Evaluation Measurements 

In this paper we choose Simple K-means, 

Farthest First [4] and DB Scan [5] to wo rk with  

clustering algorith ms to co mpare  wh ich is the 

best fit  to classify P2P botnets traces using 

network flows e xt racted fro m the pac kets send 

within  a network [6].The data is d ivided later 

into subsets for evaluating the mach ine learning  

algorith ms, the sets contain P2P tra ffic fro m 

both kind’s malic ious flows and Non - Malicious 

flows. Pe rformance statistics are calculated for 

all the trials. The c lass is ignored during all the  

evaluations for the clusters. The validation 
measures are used to evaluate the credibility of 

the clusters, in this case due to the high imba l-

ance between classes, to keep the rea l scenario  

of a network, we can’t only re ly on the accuracy 

measurement of the clusters. 

The Classification Oriented Measures of 

Cluster Validity are described below: 

• The nu mber of correctly classified instances as 

malicious is referred to as the True Positive 

(TP). 

• The number of instances classified as mali-

cious but should be normal and therefore re-

jected, is refe rred to as the False Positive (FP).  

• The number of instances classified as normal 

but are actually malicious, is referred to as the 

False Negative (FN). 

• The number of normal instances from a class 

correctly rejected is referred to as True Neg a-

tive (TN). 

3. Results and Discussion 

We need to utilize clustering algorithms  

which can handle such diffe rences in the data 

along with the first 10 top ranked attributes from 

the feature e xtraction. As mention before we  

used WEKA [6] to run this three clusters. 

3.1. Dataset Assemble for Testing 

The dataset used in this paper was obtained 

fro m a previous research group that made the  

datasets publicly available , their paper is about a 

P2P traffic categorization system called “Peer 

Rush” [8]. The labeled data of all four P2P ap-

plications (Emule , UTorrent, Vu ze and Fros t-

Wire) along with Zeus and Waledac were used 

for testing purposes. The dataset was divided 

into 3 co mbinations of subsets, containing both 

kinds of flows that are  labeled e ither 

Non-Malicious or Malicious: 

• Dataset 1 (Zeus traces ): A total of 17,940 

flows are contained in the dataset, 95% 

non-malicious traces and 5% malicious traces.  

• Dataset 2 (Waledac Traces ): A total of 12,310 

flows are contained in the dataset, 93% 

non-malicious traces and 7% malicious traces.  

• Dataset 3 (Zeus and Waledac traces ): A total 

of 12,334 flows are contained in the dataset, 

92% non-malicious traces and 8% malicious 

traces. 

3.2. Unsupervised Learning Comparison  

The accuracy of all the algorith ms is shown 

in Fig. 2. 

DB Scan performed significantly we ll for a ll 

the situations assigned, each change of dataset 

dimin ishes  slightly the accuracy, but in genera l it  

ma intains the highest percentage. Simp le  


Proceedings of Engineering and Technology Innovation, vol. 3, 2016, pp. 28 - 30 

30 Copyright ©  TAETI 

K-means improved in the last test but still had  

some imba lance in the number of correctly  

classified ma lic ious instances that were re-

trieved. 

 
Fig. 2 Accuracy of Unsupervised Algorithms for 

all Datasets  

In Table  1 we  can co mpare the real perfo r-

mance values of each algorithm. DB Scan  

proves to be a worthy contender against P2P 

botnets flows and may be used to improve the 

precision of detection systems along with other 

security tools. We believe that each algorith m 

performs depending on the quality of the data 

and the previous preprocessing of it. 

Table 1 Average measurement values for each 

algorithm 

 
4. Conclusions 

The use of unsupervised learning was pro-

posing for classifying P2P traffic flows in  

comparison to the previous methods using su-

pervised learning. The results lead us to believe 

the data is suitable for a density based cluster, 

since DB Scan a lgorith m performed we ll on  

every situation, obtaining high precision class i-

fying P2P botnet flows and retriev ing most of 

these malic ious flows from the normal P2P 

Traffic. 

Acknowledgement 

This research was supported in part by the 

Ministry of Science and Technology , Taiwan, 

under the Grants MOST 104-3115-E-007-004 

and MOST 103-2221-E-007-073-M Y3. 

References 

[1] R. Price, “ Business Insider,” http://www.bu
sinessinsider.co m/hola -used-for-botnet-on-

chrome, May 2015. 

[2] P. Barthaku r, M. Dahal, and M. K. Ghose, 
“A fra mework for P2P botnet detection 

using SVM,” in International Conference  

on Cyber-Enabled Distributed Co mputing 

and Knowledge Discover, Sanya, 2012. 

[3] W. H. Liao and C. C. Chang, “Peer to peer 
botnet detection using data min ing scheme,”  

International Conference on Internet Tech-

nology and Applications, Wuhan, 2010. 

[4] S. Hochbaum, “A best possible heuristic for 
the k-center problem,”  in Mathematics of 

Operations Research, 1985. 

[5] P. N. Tan, M. Steinbach, and V. Ku mar, 
“Cluster analysis: basic concepts and algo-

rith ms,” Introduction to Data Mining, 

Pearson, pp. 487-559, 2005. 

[6] M. Ha ll, E. Frank, G. Ho lmes, B. Pfahringer, 

P. Reutemann, and I. H. Witten, “The 

WEKA data mining software: an update,” 

SIGKDD Explorations, 2009. 

[7] B. Rahbarin ia, R. Perd isci, A. Lan zi, and K. 
Li, “PeerRush: min ing for unwanted P2P 

traffic,”  Detection of Intrusions and Mal-

ware, and Vulnerability Assessment, 

Springer, vol. 7967, pp. 62-82, 2013. 

 
https://www.most.gov.tw/?menu_id=c53d1d66-f9ed-11e4-9a55-b8ac6f2d65c0&l=en