Microsoft Word - Final NETSYS2021.doc


Electronic Communications of the EASST 
Volume 080 (2021) 

Guest Editors: Andreas Blenk, Mathias Fischer, Stefan Fischer, Horst Hellbrück, Oliver 
Hohlfeld, Andreas Kassler, Koojana Kuladinithi, Winfried Lamersdorf, Olaf Landsiedel, 
Andreas Timm-Giel, Alexey Vinel 

ECEASST Home Page: http://www.easst.org/eceasst/ ISSN 1863-2122 

 
Conference on Networked Systems 2021 
(NetSys 2021) 

Modular Platform for Detecting and Classifying Phishing Websites 
Using Cyber Threat Intelligence 

 
Ahmed M. Elmisery and Mirela Sertovic 

 
4 Pages 


 ECEASST 

1 / 4 Volume XXX (2020) 
 

 Modular Platform for Detecting and Classifying Phishing 
Websites Using Cyber Threat Intelligence 

 
Ahmed M. Elmisery1, and Mirela Sertovic2  

 
1amelmesiry@gmail.com  

Faculty of Computing, Engineering and Science, University of South Wales, Pontypridd, UK 
 

2 msertovic@yahoo.com 
Threat Defense Unit, Concept Tech Int. Ltd, Belfast, UK 

 
Abstract: Phishing attacks are deceptive types of social engineering techniques that 
attackers use to imitate genuine websites in order to steal the login credentials and private 
data of the end-users. The continued success of these attacks is heavily attributed to the 
prolific adoption of online services and the lack of proper training to foster a security 
awareness mindset of online users. In addition to the financial and reputational damages 
caused by data breaches of individual users and businesses, cyber adversaries can further 
use the leaked data for various malicious purposes. In this work, a modular platform was 
introduced that facilitates accurate detection and automatic evaluation of websites visited 
by employees of a company or organization. The basis for this approach is a preceding 
website analysis, which is essential when hunting for potential threats from proxy logs. 
The platform contains three modules. Characterization of suspicious websites relies on a 
set of pre-defined features and a multi-stage threat intelligence technique, the functionality 
of which has been ascertained in initial tests on real data sets. 
 
Keywords: Phishing, Threat intelligence, Cybersecurity, Malicious URL detection 

1 Introduction 
The widespread adoption of online services by public and private organizations facilitates 
conducting various business operations with their clients in an easy manner regardless of their 
physical location and time availability [Koufaris'02]. Conducting transactions and operations 
in this way has proven to be more convenient to most clients. However, there are serious risks 
associated with this model as demonstrated in [Lampe, Wenge, Müller, & Schaarschmidt'13]. 
Internet users face a variety of security threats while using online services, one of which is 
phishing attacks, where attackers use various tools and techniques to trick internet users into 
visiting malicious websites that mimic real ones to steal their login credentials and private 
data. The recent worldwide shutdown due to the novel coronavirus pandemic has forced 
millions, if not billions, to stay and work at home, attackers are finding this an incredible 
opportunity to look for new ways to commit cybercrime [Plachkinova'21]. In the latest report 
published in [AntiPhishing-Working-Group'21], January 2021 was a high in the trend of 
phishing attacks with an unprecedented 245,771 attacks in one month. Business e-mail 
compromise has caused aggregate losses in the billions of dollars, at large and small 
companies. Phishing attacks are frequently used as a vehicle for the rapid dissemination of 
malware and ransomware, where cyberattacks directed against a specific individual, 
organization, or company may rely on redirecting the victims to external malicious websites 
after receiving an email to evade anti-malware and anti-spam programs that stop spear-
phishing emails containing malicious attachments at the email security gateway.  


 ECEASST 

2 / 4 Volume XXX (2020) 

This paper introduces CTIP: " Cyber Threat Intelligence Platform". The purpose of CTIP is to 
build intelligence about web surfing activities, transforming raw proxy logs into substantial 
knowledge to uncover the suspicious websites visited by employees of a company or 
organization. This information proves valuable for long-term security posture management, in 
particular, for understanding what actually happened after a certain incident. CTIP enables a 
proper assessment and consistent recognition of known as well as unknown suspicious 
websites based on a multi-stage threat intelligence technique. Using a visual workflow for the 
entire process makes the platform more user-friendly. CTIP has its own infrastructure to 
automatically capture and analyze all visited websites within the perimeter of the enterprise 
network, making it easy to identify and investigate a potential threat in a timely manner. It 
should be possible to share the visual workflow of any multi-stage threat intelligence technique 
and replicate results, reducing the time required to triage and analyze data by prioritizing any 
visited suspicious websites. The proposed platform enables a typical monitoring infrastructure 
for the collection, storage, and analysis of proxy logs in a security posture management of an 
operational environment. 

2 Related Work 
Most of the literature deals with the problem of detecting suspicious websites using different 
methods and approaches. Signature-based detection is one of the widely used methods for the 
characterization of suspicious websites. as such, a database of blacklisted URLs is maintained 
to contain a list of malicious websites that were previously detected [Boddy'18]. Any newly 
requested website is matched against this database to determine if it has previously been 
flagged as malicious. The most common databases employed for this task are Google® Safe 
Browsing and PhishTank®. In principle, Signature-based detection usually fails to detect newly 
created malicious websites that have not been previously reported. More recent research has 
begun to use computational intelligence methods for the characterization of suspicious 
websites. as such, a set of common features relevant to malicious and genuine websites were 
elicited and then utilized for building classification models. The research work in [Corona et 
al.'17] trained support vector machines classifier on a multiple set of features. The experiment 
used an evaluation dataset containing 200 genuine and 325 suspicious websites. In [Ma, Saul, 
Savage, & Voelker'09] the authors utilized the properties of URL and domain name extracted 
from external sources to train naïve Bayes and linear regression classifiers to detect suspicious 
websites. In [Abdelhamid, Ayesh, & Thabtah'14] the authors created a set of 16 features 
derived from URL, content, and external sources, which was used to train some sort of multi-
label classifier to recognize suspicious websites. A similar study was conducted in 
[Mohammad, Thabtah, & McCluskey'12], in which the authors designed a set of 30 features, 
which were categorized into four main features groups related to URL, abnormality, content, 
and the website domain. In [Rieck, Krueger, & Dewald'10], the authors developed Cujo as a 
web proxy component between a web client and a web service. Cujo intercepts and analyzes 
the raw data of a website before sending it to the web client. The solution in [Gastellier-
Prevost, Granadillo, & Laurent'11] implemented an anti-phishing toolbar that utilizes 20 
heuristics tests to analyze the properties of URLs and webpages. The evaluation results 
demonstrate that the combination of URL and HTML heuristics is an efficient way to 
distinguish websites. The growing number of proxy logs inevitably requires automatic models 
to efficiently detect, classify and share suspicious websites, which is also required for any 
threat hunting activities. 


Modular Platform for Detecting and Classifying Phishing Websites Using Cyber Threat Intelligence 

NetSys 2021 3 / 4 

3 Methodology 

 
The proposed multi-stage threat intelligence technique consists of four main modules, which 
are interconnected together. Specifically, these modules are data preparation, modeling, 
evaluation, and visualization. Figure 1 depicts the basic workflow between these modules. The 
raw web surfing is collected using browser-based open-source intelligence plugin, which 
allows large-scale harvesting of URLs and metadata from external sources. This will cover all 
the possible features that are required for the later stages. This data is provided to the data 
preparation module for filtering and pre-processing stages. The data preparation module 
continues to read all datasets, then filters the features into categories. Besides the properties of 
the URLs, metadata related to requested websites is analysed in relation to the content of web 
pages. Existing features are gathered and stored locally. Depending on the type, they can be 
stored as a separate value or combined with other sub-values to form a new feature set. In 
addition, if the correlations between the characteristics of these features are identified, the 
different categories are linked together. The results of this module are further processed using 
the modeling module. The robust features entries from different users are combined into the 
centralized log server, fed by pre-existing profiles that already exist for different websites, and 
form the basis for the modeling stage. Each logged feature is weighted statistically according 
to its importance and frequency in characterizing suspicious websites. The features deduced in 
this work are based on the phishing websites features [Mohammad et al.'12]. These features 
are utilized in the model building process, we implemented the C4.5 algorithm [Quinlan'93] to 
build a decision tree classifier based on a labeled data set using entropy. The overall result of 
the detection depends on the label assigned by the classifier to a new instance (recently 
requested website). With the proposed methodology, feature vectors of different categories can 
be assigned to each website. A chain of features can be a determinant for suspicious websites. 
The classification task is carried out using a decision tree classifier (rule-based tree), which is 
built in a top-down or general-to-specific manner based on the training stage. The generated 
model will then be used to perform the classification task on new instances. The process of 
building the model is initiated with a root node to label all the logged records. If the root node 
is enough to label all the instances, then the task is finished. Otherwise, more nodes and leaves 
are added to the tree recursively until all the instances belong to one of the existing classes. 
The evaluation module is responsible for evaluating the accuracy of the detection model, the 
convenient precision, and F-measure rates have been used as metrics. The number of 
misclassified websites along with the F-measure rate were used to evaluate the accuracy 
between the generated model and a baseline model. The two metrics TN (True Negatives) and 
TP (True Positives) respectively indicate the number of websites that were correctly identified 
as suspicious, and genuine. The results obtained from the evaluation module are passed to the 
visualization module, which is responsible for presenting the detection results in a human-
readable form. The detailed information from the analysis of websites is stored in a machine-
readable form, which can be used to further prioritize threat hunting activities.  

                Figure 1: Multi-stage Threat Intelligence Workflow 
 

 ECEASST 

4 / 4 Volume XXX (2020) 

4 Results and Discussion 
This work presents a modular platform 
for detecting and classifying phishing 
websites using a multi-stage threat 
intelligence technique. The chosen 
approach offers a clear structure and 
comprehensive results. The approach is 
designed so that the statistically 
determined features of suspicious 
websites are scored higher than those of 
the original websites. Using CTIP, 
unknown suspicious websites can 
easily be detected. The proposed multi-stage threat intelligence technique is currently in the 
training and testing phases, TP rates (96%, 98%) were obtained on a real data set, by correctly 
identifying suspicious and genuine websites respectively, with an overall accuracy of 97% and 
more than 90% confidence. As shown in Figure 2, the developed cyber threat intelligence 
platform can be extended with various open-source intelligence extensions, daemons and 
services to ensure proper logging of web access traffic. It is planned to simulate a technical 
network of different nodes and to explore a covert cyberattack of APT on the infrastructure of 
this network. For this purpose, interfaces were developed in the modular platform. The final 
stage will utilize holistic techniques and field tests of the platform. 

Bibliography 
[Abdelhamid, N, Ayesh, A, & Thabtah, F2014] Abdelhamid, N, Ayesh, A, & Thabtah, F. (2014). Phishing detection 

based associative classification data mining. Expert Systems with Applications, 41(13), 5948-5959.  
[AntiPhishing-Working-Group2021] AntiPhishing-Working-Group. (2021). Phishing activity trends report. 

Retrieved from https://docs.apwg.org/reports/apwg_trends_report_q1_2021.pdf 
[Boddy, M2018] Boddy, M. (2018). Phishing 2.0: The new evolution in cybercrime. Computer Fraud & Security, 

2018(11), 8-10.  
[Corona, I, Biggio, B, Contini, M, Piras, L, Corda, R, Mereu, M, . . . Roli, FYear] Corona, I, Biggio, B, Contini, M, 

Piras, L, Corda, R, Mereu, M, . . . Roli, F. (2017). Deltaphish: Detecting phishing webpages in compromised 
websites. Paper presented at the European Symposium on Research in Computer Security. 

[Gastellier-Prevost, S, Granadillo, GG, & Laurent, MYear] Gastellier-Prevost, S, Granadillo, GG, & Laurent, M. 
(2011). Decisive heuristics to differentiate legitimate from phishing sites. Paper presented at the 2011 Conference 
on Network and Information Systems Security. 

[Koufaris, M2002] Koufaris, M. (2002). Applying the technology acceptance model and flow theory to online 
consumer behavior. Information systems research, 13(2), 205-223.  

[Lampe, U, Wenge, O, Müller, A, & Schaarschmidt, RYear] Lampe, U, Wenge, O, Müller, A, & Schaarschmidt, R. 
(2013). On the relevance of security risks for cloud adoption in the financial industry. Paper presented at the 
AMCIS. 

[Ma, J, Saul, LK, Savage, S, & Voelker, GMYear] Ma, J, Saul, LK, Savage, S, & Voelker, GM. (2009). Beyond 
blacklists: Learning to detect malicious web sites from suspicious urls. Paper presented at the Proceedings of the 
15th ACM SIGKDD international conference on Knowledge discovery and data mining. 

[Mohammad, RM, Thabtah, F, & McCluskey, LYear] Mohammad, RM, Thabtah, F, & McCluskey, L. (2012). An 
assessment of features related to phishing websites using an automated technique. Paper presented at the 2012 
International Conference for Internet Technology and Secured Transactions. 

[Plachkinova, M2021] Plachkinova, M. (2021). Exploring the shift from physical to cybercrime at the onset of the 
covid-19 pandemic. International Journal of Cyber Forensics and Advanced Threat Investigations, 2(1), 13. 
doi:10.46386/ijcfati.v2i1.29 

[Quinlan, JR1993] Quinlan, JR. (1993). C 4.5: Programs for machine learning. The Morgan Kaufmann Series in 
Machine Learning.  

[Rieck, K, Krueger, T, & Dewald, AYear] Rieck, K, Krueger, T, & Dewald, A. (2010). Cujo: Efficient detection 
and prevention of drive-by-download attacks. Paper presented at the Proceedings of the 26th Annual Computer 
Security Applications Conference.  

Figure 2: Schematic Structure of Cyber Threat Intelligence Platform