127 

 
COMPARATIVE ANALYSIS OF THE K-NEAREST NEIGHBOR ALGORITHM 
  ON VARIOUS INTRUSION DETECTION DATASETS 

 
Andri Agung Riyadi 1), Fachri Amsury 2), Irwansyah Saputra 3), Tiska Pattiasina 4), Jupriyanto 5) 

 
Sains Data1, Sistem Informasi2,3,5 

Universitas Nusa Mandiri 
andriagu1603@nusamandiri.ac.id1, fachri.fcy@nusamandiri.ac.id2, irwansyah.iys@nusamandiri.ac.id3, 

jupriyanto.kahar@gmail.com5 
 

Teknologi Informasi 
Universitas Bina Sarana Informatika 

elleoratiska07@gmail.com4 
 

Abstrak 

Keamanan di dalam jaringan komputer dapat menjadi rentan, hal ini disebabkan kita memiliki kelemahan 
dalam membuat kebijakan keamanan, konfigurasi sistem komputer yang lemah atau bug pada perangkat 
lunak. Intrusion detection adalah mekanisme mengamankan jaringan komputer dengan cara mendeteksi, 
mencegah, dan menghalangi usaha ilegal untuk mengakses informasi yang bersifat rahasia. Mekanisme IDS 
dirancang untuk dapat melindungi sistem dan mengurangi dampak kerusakan yang ada dari setiap serangan 
di dalam jaringan komputer karena melanggar kebijakan keamanan komputer meliputi ketersediaan, 
kerahasiaan, dan integritas. Teknik data mining telah digunakan untuk memperoleh pengetahuan yang 
berguna dari penggunaan dataset-dataset IDS. Beberapa dataset IDS yang umum digunakan adalah NSL-KDD, 
10% KDD, Full KDD, Corrected KDD99, UNSW-NB15, ADFA Windows, Caida, dan UNM telah digunakan untuk 
mendapatkan tingkat akurasi menggunakan algoritma k-Nearest Neighbors (k -NN). Dataset IDS terbaru 
yang disediakan oleh Canadian Institute of Cybersecurity yang berisi sebagian besar skenario serangan 
terbaru bernama dataset CICIDS2017. Eksperimen pendahuluan menunjukkan bahwa pendekatan 
menggunakan metode k-NN pada dataset CICIDS2017 berhasil menghasilkan nilai rata-rata akurasi deteksi 
intrusi tertinggi dibandingkan dataset IDS lainnya. 
 
Kata kunci: Intrusion Detection System, k-Nearest Neighbors, Machine Learning, Network Security 
 

Abstract 
Because we have flaws in developing security rules, inadequate computer system settings, or software 
defects, security in computer networks can be vulnerable. Intrusion detection is a computer network 
security method that detects, prevents, and blocks unauthorized access to confidential information. The IDS 
method is intended to defend the system and minimize the harm caused by any attack on a computer 
network that violates computer security policies such as availability, confidentiality, and integrity. Data 
mining techniques were utilized to extract relevant information from IDS databases. The following are some 
of the most widely utilized IDS datasets NSL-KDD, 10% KDD, Full KDD, Corrected KDD99, UNSW-NB15, 
ADFA Windows, Caida, dan UNM have been used to get the accuracy rate using the k-Nearest Neighbors 
algorithm (k-NN). The latest IDS dataset provided by the Canadian Institute of Cybersecurity contains most 
of the latest attack scenarios named the CICIDS2017 dataset. Preliminary experiment shows that the 
approach using the k-NN method on the CICIDS2017 dataset successfully produces the highest average 
value of intrusion detection accuracy than other IDS datasets. 
 
Keywords: Intrusion Detection System, k-Nearest Neighbors, Machine Learning, Network Security 
 
 
INTRODUCTION 
 
The number of internet users around the 

world has exploded in the previous two decades. 
Hundreds of thousands of institutions and millions 
of people communicate with each other every day 

over the internet. As a result of these advancements, 
the number of attacks on internet networks 
continues to rise on a daily basis. Data integrity and 
privacy become a significant concern. The three 
principles of network security are confidentiality, 
integrity, and availability, and network security 


128 

 
attempts to defend the network from assaults on 
these three principles. An attempt to violate these 
three key characteristics is referred to as a network 
attack (Bace & Mell, 2001). 

There is a lot of software that protects data 
and networks from incoming threats, such as 
firewalls, antivirus, data encryption, and user 
authentication, but it can't protect against all 
attacks. A lot of studies have been done on this 
subject to tackle this problem. Intrusion Detection 
Systems (IDS) was created to track and filter 
network activity by detecting threats and alerting 
network administrators (Chung & Wahid, 2012). 
The misuse detection method and the anomaly 
detection method are the two basic approaches for 
IDS. Ineffective against all forms of threats, yet each 
has its own set of strengths and weaknesses (Lin, 
Ke, & Tsai, 2015). Misuse detection is a methodical 
strategy to detect an assault on a computer network 
by comparing actions or looking for patterns that 
have previously been designated as attack 
symptoms. The abuse detection method is effective 
for detecting known assaults, but it is unable to 
detect fresh attacks (Zhang, Li, Gao, Wang, & Luo, 
2015). Anomaly detection is useful at identifying 
novel assaults, with the exception that it is not very 
effective at known detection rates, resulting in a 
high FPR (Kim, Lee, & Kim, 2014).  

Data mining techniques have been used to 
obtain useful knowledge from the use of IDS 
datasets. Some IDS datasets that are commonly used 
are NSL-KDD, 10% KDD, Full KDD, Corrected 
KDD99, UNSW-NB15, ADFA Windows, Caida, dan 
UNM have been used to get the accuracy using the 
k-NN algorithm approach (Hamid, et al., 2018). 
CICIDS2017, one of the latest IDS datasets from the 
Canadian Cybersecurity Institute (CIC) at New 
Brunswick University (UNB), was analyzed for 
research purposes (Sharafaldin, Habibi Lashkari, & 
Ghorbani, 2018). The CICIDS2017 dataset is created 
using a modern framework that takes into account 
your organization's policies and conditions and uses 
coefficients that can be individually determined for 
each criterion (Gharib, Sharafaldin, Lashkari, & 
Ghorbani, 2016). 

The solution to overcome the challenges of 
fraud detection and anomaly detection technologies 
and maximize the capabilities of the two 
technologies is to use a hybrid approach(Depren, 
Topallar, Anarim, & Ciliz, 2005). For use with IDS, 
three hybrid methods are recommended: fraud 
detection and subsequent anomaly detection, 
anomaly detection and subsequent fraud detection, 
or fraud detection and anomaly detection at the 
same time. The IDS hybrid method uses a 
combination of many results from independent 
training of fraud detection and anomaly detection. 

For example, in the hybrid method, if at least one of 
the two methods classifies network traffic as an 
attack, then network traffic is considered an attack. 
In this case, the detection rate is high, but the IDS's 
FPR is still high. Conversely, if the hybrid method 
considers network traffic as an attack only if both 
methods are classified as attacks, the FPR will be 
low, but many attacks in the network traffic will be 
ignored (Kim et al., 2014). False Positive Rate (FPR) 
is when the IDS system detects benign or normal 
activity on the computer network and classifies it as 
a dangerous attack. 

This research uses the K-NEAREST 
NEIGHBOR algorithm approach to measure the 
attack detection accuracy of the CICIDS2017 
dataset. The algorithm method is not used in the 
CICIDS2017 dataset. 
 

RESEARCH METHODS 
 
In conducting research, will use Knowledge 

Discovery in Databases (KDD) method consisting of 
five stages, namely Data Selection, Preprocessing, 
Transformation, Data Mining, Interpretation, or 
Evaluation (Fayyad, 1997). The CICIDS2017 dataset 
will be used as the latest standard dataset for 
research and evaluation studies in the field of IDS, 
performing analysis to further identify the data, 
creating the initial findings, and then evaluating the 
quality of the data. The CICIDS2017 dataset consists 
of 3.1 million records with 85 attributes, including 
one attribute used as a label. Dataset attributes have 
seven attack categories and one normal category. 
The preprocessing process includes removing 
duplicate data, checking for inconsistent data, 
removing low-value or completely useless features, 
converting labels of all attack types to ATTACK 
labels, and fixing data errors. Feature selection is 
used to determine which features are important and 
to discard low-quality and uncorrelated features. 
Given the number of records in the CICIDS2017 
dataset, you should perform data sampling for 
efficiency reasons. In this study, we obtained a 1% 
sample from the CICIDS2017 dataset. The fitted 
model is used to compare the result of the precision 
value using the k-NN algorithm approach with the 
value of k =  5, 6, 7, 8, 9. Results obtained in the form 
of accuracy, precision, and recall values are 
produced by comparison with other IDS datasets. 
 
Literature Study 
A. Intrusion Detection Systems and CICIDS2017 

Dataset 
Intrusion Detection Systems (IDS) are a 

very important part of protecting information 
systems in computer networks. The research report 
written by Anderson (1980) whose purpose was to 


129 

 
enhance the audit capabilities of computer security 
and customer surveillance capabilities of the 
system, served as the initial concept of IDS 
(Anderson, 1980). There are three commonly used 
approaches to IDS systems: misuse detection, 
anomaly detection, and hybrid detection (McHugh, 
Christie, & Allen, 2000). Hybrid detection uses the 
IDS approach. This method combines the use of 
misuse detection and anomaly detection to improve 
the ability of both attack detection methods. There 
are three ways to implement hybrid methods on 
IDS, namely the use of the misuse detection method 
followed by the anomaly detection method, the 
anomaly detection method followed by the method 
of misuses detection or integrating the method of 
misuse detection, and the anomaly detection 
method at the same time (Kim et al., 2014). 

CICIDS2017 is a dataset made by the 
University of New Brunswick's (UNB) Canadian 
Institute for Cybersecurity (CIC). CICIDS2017 was 
created using a modern framework that takes into 
account organizational policies and conditions with 
coefficients that can be individually determined for 
each criterion. This dataset consists of 
approximately 3.1 million records with more than 
80 attributes, where 1 attribute is used as a label. 
The attributes in the dataset have 7 attack 
categories and 1 benign or normal category. The 7 
categories of attacks in this dataset are Heartbleed 
Attack, Botnet, DoS Attack, Brute Force Attack,  
DDoS Attack, Infiltration Attack, and Web Attack. 
 
B. Methodology 

Data mining is the application of special 
algorithms to extract patterns from data (Fayyad, 
1997). Data mining is about solving problems by 
analyzing existing data in the database (Witten et 
al., 2005). There are numerous techniques and 
methods for carrying out various types of data 
mining tasks. This method is classified into three 
major data mining paradigms, which are as follows: 
Predictive Modeling, Discovery, and Deviation 
Detection. Data mining and Knowledge Discovery in 
Databases (KDD) are frequently used 
interchangeably to describe the process of 
uncovering hidden information in a large database 
(Agushinta & Irfan, 2008). Although they have 
different concepts, data mining and KDD are related. 
Data mining is one of the stages in the KDD process. 
 One of the ten best data mining techniques 
is the k-Nearest Neighbors (k-NN) classification 
algorithm method. The k-Nearest Neighbors (k-NN) 
method uses famous Ciceroprinciplepares cum 
paribus facility congregant (birds of a feather flock 
together or equals with equals easily associate) 
(Mucherino, Papajorgji, & Pardalos, 2009). The 
accuracy, precision, and recall of eight different IDS 

datasets were compared using the k-NN algorithm. 
In the NSL-KDD dataset, the k-NN algorithm 
outperforms other algorithms in terms of accuracy, 
precision, and recall (Hamid et al., 2018). 
 The k-NN method does not generate a 
classifier from the data in a training set, but rather 
uses the training set every time classification is 
required; thus, the k-NN method is often referred to 
as a lazy classifier. Classification employs the 
analogy-based k-NN algorithm method, which 
compares test records with training records that 
have similarities. The k-Nearest Neighbors (k-NN) 
algorithm is a method for classifying objects that are 
based on learning data that is close to the object. 
This technique is very simple and straightforward 
to use. Similar to clustering techniques, namely 
grouping new data based on its distance from some 
existing data or its nearest neighbor. The first step 
is to calculate the distance to a neighbor before 
searching for data. Then, to define the distance 
between two points, namely the point on the 
training data and the point on the testing data, the 
Euclidean formula is used with equation (1), as 
follows: 
 

𝑑(𝑎, 𝑏) =  (𝑥 + 𝑎)𝑛 = ∑ (𝑋𝑖 − 𝑌𝑖)²
𝑛

𝑖=0
     ………… (1) 

 
Explanation: 
x: data 1 
y: data 2 
i: feature n- 
d (a,b): Euclidean distance 
n: number of features 
 

In the concept of data mining, a confusion 
matrix is a method that is commonly used to 
calculate accuracy. If the dataset only has two 
classes, one is considered positive and the other is 
considered negative. 
 

Table 1. Confusion Matrix 

Data Class Positive Negative 

Positive 
true positives 

(TP) 
false negatives 

(FN) 

Negative 
false positive 

(FP) 
true negative 

(TN) 

 
Accuracy is defined as the degree of 

similarity between predicted and actual values. 
Precision is the degree of agreement between the 
information requested by the user and the response 
provided by the system. Precision values are 
calculated by dividing the number of positive 
examples correctly classified by the number of 
positive examples labeled as positive by the system. 


130 

 
The recall rate is the system's success rate in 
rediscovering information. The recall value is 
calculated by dividing the number of correctly 
classified positive samples by the number of 
positive examples in the data. 
 

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =  
𝑇𝑃+𝑇𝑁

𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁
  ............................................ (2) 

 
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =  
𝑇𝑃

𝑇𝑃+𝐹𝑃
  ......................................................... (3) 

 
𝑅𝑒𝑐𝑎𝑙𝑙 =  
𝑇𝑃

𝑇𝑃+𝐹𝑁
  ................................................................ (4) 

 
Research Framework Development 

The ideas for this research were proposed 
using references from previous evaluations 
regarding data collection in several IDS datasets, 
beginning with the use of the CICIDS2017 dataset as 
a standard dataset for researchers in the field of 
Intrusion Detection Systems (IDS) (Sharafaldin et 
al., 2018). Data will be preprocessed by removing 
features that have been written twice, discarding 
wrong notes, no value, or incomplete data, and so on 
(Alshammari & Nur Zincir-Heywood, 2007; 
Radford, Richardson, & Davis, 2018). The dataset's 
data is labeled with two labels: BENIGN and 
ATTACK. The benign traffic represents normal 
network traffic, while the rest is an attack. It is 
feared that data cleansing or feature selection, 
which eliminates features that are less valuable or 
completely useless, will render research results 
irrelevant. To prepare for the data mining process, 
data reduction and data splitting are performed. 
The k-Nearest Neighbors (k-NN) approach will be 
used as a data mining algorithm to improve 

intrusion detection accuracy with values of k =
 5, 6, 7, 8, 9. 

 
RESULTS AND DISCUSSION 

 
The data mining algorithm method 

approach is used at the classification stage to 
determine the accuracy of attack detection in the 
CICIDS2017 dataset. The algorithm used is the k-NN 
algorithm with values of k =  5, 6, 7, 8, 9. 
A. K-Nearest Neighbour 

Figure 1 depicts the accuracy, precision, 
and recall values in the CICIDS2017 dataset using 
the k-NN algorithm with values of k =  5, 6, 7, 8, 9. 
 

Figure 1. Classification Results Using the k-NN 

Algorithm 
 
 
B. Comparison With Another Dataset 

Table 2 compares the accuracy, precision, 
and recall values in several other IDS datasets using 
the k-NN algorithm with the value of k =  5, 6, 7, 8, 9. 
 
 
Table 2. Comparison of the results of the k-NN algorithm on various IDS datasets 

Dataset 

Neighborhood 

5 6 7 8 9 

Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall 

Full 
KDD99 

0.7342 0.722 0.734 0.70979 0.721 0.71 0.72028 0.736 0.72 0.65734 0.624 0.657 0.6958 0.633 0.696 

Corrected 
KDD 

0.6682 0.67 0.668 0.71495 0.707 0.715 0.48598 0.496 0.486 0.71962 0.722 0.72 0.71028 0.701 0.71 

NSLKDD 0.7853 0.677 0.785 0.92 0.92 0.92 0.97592 0.959 0.976 0.9875 0.977 0.988 0.77193 0.77 0.772 

10% KDD 0.8421 0.874 0.842 0.57142 0.571 0.571 0.64285 0.629 0.643 0.71428 0.706 0.714 0.5 0.521 0.5 

UNSW 0.4285 0.351 0.429 0.57142 0.571 0.571 0.66083 0.655 0.661 0.8421 0.874 0.842 0.82456 0.83 0.825 

Caida 0.6428 0.413 0.643 0.42857 0.762 0.351 0.5 0.521 0.5 0.71428 0.706 0.714 0.64285 0.413 0.643 

ADFA 
Windows 

0.7142 0.714 0.714 0.64285 0.413 0.643 0.82456 0.83 0.825 0.91228 0.92 0.912 0.85308 0.858 0.853 

UNM 
Dataset 

0.6382 0.626 0.638 0.79906 0.794 0.799 0.66822 0.67 0.668 0.72429 0.712 0.724 0.57943 0.53 0.579 

CICIS2017 0.9688 0.9172 0.926 0.9697 0.919 0.929 0.9676 0.9157 0.922 0.9683 0.916 0.925 0.9664 0.9113 0.921 

 
131 

 
The k-NN algorithm with the value of k=
 5, 6, 7, 8, 9 is used in Figure 2 to calculate the 
average value of accuracy, precision, and recall. 
 

Figure 2. Comparison of the Average Values of 

Accuracy, Precision, and Recall of the k-NN 
Algorithm on Various IDS Datasets 

 
We represented CICIDS2017 for 

comparison with several other existing IDS 
datasets; as shown in Figure 1, the highest accuracy 
value of the CICIDS2017 dataset was obtained using 
the k-NN algorithm with the value of k = 6, which 
equals 96.97%. Table 2 shows the accuracy, 
precision, and recall values in some IDS datasets, 
with the NSLKDD dataset having the highest level of 
accuracy using the k-NN algorithm with k = 8. 
 

CONCLUSIONS AND SUGGESTIONS 
 
Conclusion 

The goal of this study is to detect network 
anomalies using machine learning methods. 
Because of its up-to-dateness, wide attack diversity, 
and various network protocols, the CICIDS2017 
dataset was used in this context. The average value 
of accuracy, precision, and recall uses the k-NN 
algorithm with the value of k=  5, 6, 7, 8, 9 on the 
CICIDS2017 dataset higher than other IDS datasets 
which are 96.8160%, 91.5840%, 92.4640% as seen 
on Figure 2. 
 
Suggestion 
 Based on the conclusions obtained, several 
suggestions can later be done for future research 
that researchers can use a more varied algorithm 
and up-to-date IDS datasets. 
 

REFERENCES 
 
Agushinta, D. (2008, August). Perancangan Aplikasi 

Data Mining Untuk Memrediksi Permintaan 
Customer Pada Perusahaan Persewaan Mobil. 
In Proceeding, Seminar Ilmiah Nasional 
Komputer dan Sistem Intelijen (KOMMIT 2008). 

Alshammari, R., & Nur Zincir-Heywood, A. (2007). A 
flow-based approach for SSH traffic detection. 
In 2007 IEEE International Conference on 

Systems, Man and Cybernetics (pp. 296–301). 
IEEE. doi:10.1109/ICSMC.2007.4414006 

Anderson, J. P. (1980). Computer security threat 
monitoring and surveillance. Technical 
Report, James P. Anderson Company. 

Chung, Y. Y., & Wahid, N. (2012). A hybrid network 
intrusion detection system using simplified 
swarm optimization (SSO). Applied Soft 
Computing, 12(9), 3014–3022. 
doi:10.1016/j.asoc.2012.04.020 

Data Mining: Practical Machine Learning Tools and 
Techniques. (2011). Elsevier. 
doi:10.1016/C2009-0-19715-5 

Witten, I. H., Frank, E., Hall, M. A., Pal, C. J., & DATA, 
M. (2005). Practical machine learning tools 
and techniques. In DATA MINING (Vol. 2, p. 4). 

Depren, O., Topallar, M., Anarim, E., & Ciliz, M. K. 
(2005). An intelligent intrusion detection 
system (IDS) for anomaly and misuse 
detection in computer networks. Expert 
Systems with Applications, 29(4), 713–722. 
doi:10.1016/j.eswa.2005.05.002 

Fayyad, U. (1997). Data mining and knowledge 
discovery in databases: implications for 
scientific databases. In Proceedings. Ninth 
International Conference on Scientific and 
Statistical Database Management (Cat. 
No.97TB100150) (pp. 2–11). IEEE Comput. 
Soc. doi:10.1109/SSDM.1997.621141 

Gharib, A., Sharafaldin, I., Lashkari, A. H., & 
Ghorbani, A. A. (2016). An Evaluation 
Framework for Intrusion Detection Dataset. In 
2016 International Conference on Information 
Science and Security (ICISS) (pp. 1–6). IEEE. 
doi:10.1109/ICISSEC.2016.7885840 

Kim, G., Lee, S., & Kim, S. (2014). A novel hybrid 
intrusion detection method integrating 
anomaly detection with misuse detection. 
Expert Systems with Applications, 41(4), 1690–
1700. doi:10.1016/j.eswa.2013.08.066 

Lin, W.-C., Ke, S.-W., & Tsai, C.-F. (2015). CANN: An 
intrusion detection system based on 
combining cluster centers and nearest 
neighbors. Knowledge-Based Systems, 78, 13–
21. doi:10.1016/j.knosys.2015.01.009 

McHugh, J., Christie, A., & Allen, J. (2000). Defending 
Yourself: The Role of Intrusion Detection 
Systems. IEEE Software, 17(5), 42–51. 
doi:10.1109/52.877859 

Bace, R., & Mell, P. (2001). NIST special publication 
on intrusion detection systems. Booz-allen and 
Hamilton Inc MCLEAN VA. 

Mucherino, A., Papajorgji, P. J., & Pardalos, P. M. 
(2009). k-Nearest Neighbor Classification (pp. 
83–106). doi:10.1007/978-0-387-88615-2_4 

Radford, B. J., Richardson, B. D., & Davis, S. E. (2018). 
Sequence aggregation rules for anomaly 


132 

 
detection in computer network traffic. arXiv 
preprint arXiv:1805.03735. 

Sharafaldin, I., Habibi Lashkari, A., & Ghorbani, A. A. 
(2018). Toward Generating a New Intrusion 
Detection Dataset and Intrusion Traffic 
Characterization. In Proceedings of the 4th 
International Conference on Information 
Systems Security and Privacy (pp. 108–116). 
SCITEPRESS - Science and Technology 
Publications. 
doi:10.5220/0006639801080116 

Hamid, Y., Balasaraswathi, V. R., Journaux, L., & 
Sugumaran, M. (2018). Benchmark Datasets 
for Network Intrusion Detection: A 
Review. Int. J. Netw. Secur., 20(4), 645-654. 

Zhang, J., Li, H., Gao, Q., Wang, H., & Luo, Y. (2015). 
Detecting anomalies from big network traffic 
data using an adaptive detection approach. 
Information Sciences, 318, 91–110. 
doi:10.1016/j.ins.2014.07.044