FACTA UNIVERSITATIS Series: Electronics and Energetics Vol. 34, No 2, June 2021, pp. 239 - 257 https://doi.org/10.2298/FUEE2102239C Vladimir Ćirić, Dušan Cvetković, Nadja Gavrilović, Natalija Stojanović, Ivan Milentijević2 Received September 16, 2020; received in revised form October 28, 2020 Corresponding author: Vladimir M. Ćirič Faculty of Electronic Engineering, Computer Science Department, Aleksandra Medvedeva 14, 18000 Niš, Serbia E-mail: vladimir.ciric@elfak.ni.ac.rs FACTA UNIVERSITATIS Series: Electronics and Energetics Vol. 28, No 4, December 2015, pp. 507 - 525 DOI: 10.2298/FUEE1504507S HORIZONTAL CURRENT BIPOLAR TRANSISTOR (HCBT) – A LOW-COST, HIGH-PERFORMANCE FLEXIBLE BICMOS TECHNOLOGY FOR RF COMMUNICATION APPLICATIONS Tomislav Suligoj1, Marko Koričić1, Josip Žilak1, Hidenori Mochizuki2, So-ichi Morita2, Katsumi Shinomura2, Hisaya Imai2 1University of Zagreb, Faculty of Electrical Engineering and Computing, Department of Electronics, Micro- and Nano-electronics Laboratory, Croatia 2Asahi Kasei Microdevices Co. 5-4960, Nobeoka, Miyazaki, 882-0031, Japan Abstract. In an overview of Horizontal Current Bipolar Transistor (HCBT) technology, the state-of-the-art integrated silicon bipolar transistors are described which exhibit fT and fmax of 51 GHz and 61 GHz and fTBVCEO product of 173 GHzV that are among the highest-performance implanted-base, silicon bipolar transistors. HBCT is integrated with CMOS in a considerably lower-cost fabrication sequence as compared to standard vertical-current bipolar transistors with only 2 or 3 additional masks and fewer process steps. Due to its specific structure, the charge sharing effect can be employed to increase BVCEO without sacrificing fT and fmax. Moreover, the electric field can be engineered just by manipulating the lithography masks achieving the high-voltage HCBTs with breakdowns up to 36 V integrated in the same process flow with high-speed devices, i.e. at zero additional costs. Double-balanced active mixer circuit is designed and fabricated in HCBT technology. The maximum IIP3 of 17.7 dBm at mixer current of 9.2 mA and conversion gain of -5 dB are achieved. Key words: BiCMOS technology, Bipolar transistors, Horizontal Current Bipolar Transistor, Radio frequency integrated circuits, Mixer, High-voltage bipolar transistors. 1. INTRODUCTION In the highly competitive wireless communication markets, the RF circuits and systems are fabricated in the technologies that are very cost-sensitive. In order to minimize the fabrication costs, the sub-10 GHz applications can be processed by using the high-volume silicon technologies. It has been identified that the optimum solution might Received March 9, 2015 Corresponding author: Tomislav Suligoj University of Zagreb, Faculty of Electrical Engineering and Computing, Department of Electronics, Micro- and Nano-electronics Laboratory, Croatia (e-mail: tom@zemris.fer.hr) INPUT SPLITS DESIGN TECHNIQUES FOR NETWORK INTRUSION DETECTION ON HADOOP CLUSTER University of Niš, Faculty of Electronic Engieering, Niš, Serbia Abstract. Intrusion detection system (IDS) is one of the most important components being used to monitor network for possible cyber-attacks. However, the amount of data that should be inspected imposes a great challenge to IDSs. With recent emerge of various big data technologies, there are ways for overcoming the problem of the increased amount of data. Nevertheless, some of this technologies inherit data distribution techniques that can be a problem when splitting a sensitive data such as network data frames across a cluster nodes. The goal of this paper is design and implementation of Hadoop based IDS. In this paper we propose different input split techniques suitable for network data distribution across cloud nodes and test the performances of their Apache Hadoop implementation. Four different data split techniques will be proposed and analysed. The techniques will be described in detail. The system will be evaluated on Apache Hadoop cluster with 17 slave nodes. We will show that processing speed can differ for more than 30% depending on chosen input split design strategy. Additionally, we’ll show that malicious level of network traffic can slow down the processing time, in our case, for nearly 20%. The scalability of the system will also be discussed. Key words: Network Intrusion Detection, Cloud Computing, Apache Hadoop. © 2021 by University of Niš, Serbia | Creative Commons License: CC BY-NC-ND Original scientific paper FACTA UNIVERSITATIS Series: Electronics and Energetics Vol. 34, No 2, June 2021, pp. 239 - 257 https://doi.org/10.2298/FUEE2102239C Vladimir Ćirić, Dušan Cvetković, Nadja Gavrilović, Natalija Stojanović, Ivan Milentijević Received September 16, 2020; received in revised form October 28, 2020 Corresponding author: Vladimir M. Ćirič Faculty of Electronic Engineering, Computer Science Department, Aleksandra Medvedeva 14, 18000 Niš, Serbia E-mail: vladimir.ciric@elfak.ni.ac.rs FACTA UNIVERSITATIS Series: Electronics and Energetics Vol. 28, No 4, December 2015, pp. 507 - 525 DOI: 10.2298/FUEE1504507S HORIZONTAL CURRENT BIPOLAR TRANSISTOR (HCBT) – A LOW-COST, HIGH-PERFORMANCE FLEXIBLE BICMOS TECHNOLOGY FOR RF COMMUNICATION APPLICATIONS Tomislav Suligoj1, Marko Koričić1, Josip Žilak1, Hidenori Mochizuki2, So-ichi Morita2, Katsumi Shinomura2, Hisaya Imai2 1University of Zagreb, Faculty of Electrical Engineering and Computing, Department of Electronics, Micro- and Nano-electronics Laboratory, Croatia 2Asahi Kasei Microdevices Co. 5-4960, Nobeoka, Miyazaki, 882-0031, Japan Abstract. In an overview of Horizontal Current Bipolar Transistor (HCBT) technology, the state-of-the-art integrated silicon bipolar transistors are described which exhibit fT and fmax of 51 GHz and 61 GHz and fTBVCEO product of 173 GHzV that are among the highest-performance implanted-base, silicon bipolar transistors. HBCT is integrated with CMOS in a considerably lower-cost fabrication sequence as compared to standard vertical-current bipolar transistors with only 2 or 3 additional masks and fewer process steps. Due to its specific structure, the charge sharing effect can be employed to increase BVCEO without sacrificing fT and fmax. Moreover, the electric field can be engineered just by manipulating the lithography masks achieving the high-voltage HCBTs with breakdowns up to 36 V integrated in the same process flow with high-speed devices, i.e. at zero additional costs. Double-balanced active mixer circuit is designed and fabricated in HCBT technology. The maximum IIP3 of 17.7 dBm at mixer current of 9.2 mA and conversion gain of -5 dB are achieved. Key words: BiCMOS technology, Bipolar transistors, Horizontal Current Bipolar Transistor, Radio frequency integrated circuits, Mixer, High-voltage bipolar transistors. 1. INTRODUCTION In the highly competitive wireless communication markets, the RF circuits and systems are fabricated in the technologies that are very cost-sensitive. In order to minimize the fabrication costs, the sub-10 GHz applications can be processed by using the high-volume silicon technologies. It has been identified that the optimum solution might Received March 9, 2015 Corresponding author: Tomislav Suligoj University of Zagreb, Faculty of Electrical Engineering and Computing, Department of Electronics, Micro- and Nano-electronics Laboratory, Croatia (e-mail: tom@zemris.fer.hr) INPUT SPLITS DESIGN TECHNIQUES FOR NETWORK INTRUSION DETECTION ON HADOOP CLUSTER University of Niš, Faculty of Electronic Engieering, Niš, Serbia Abstract. Intrusion detection system (IDS) is one of the most important components being used to monitor network for possible cyber-attacks. However, the amount of data that should be inspected imposes a great challenge to IDSs. With recent emerge of various big data technologies, there are ways for overcoming the problem of the increased amount of data. Nevertheless, some of this technologies inherit data distribution techniques that can be a problem when splitting a sensitive data such as network data frames across a cluster nodes. The goal of this paper is design and implementation of Hadoop based IDS. In this paper we propose different input split techniques suitable for network data distribution across cloud nodes and test the performances of their Apache Hadoop implementation. Four different data split techniques will be proposed and analysed. The techniques will be described in detail. The system will be evaluated on Apache Hadoop cluster with 17 slave nodes. We will show that processing speed can differ for more than 30% depending on chosen input split design strategy. Additionally, we’ll show that malicious level of network traffic can slow down the processing time, in our case, for nearly 20%. The scalability of the system will also be discussed. Key words: Network Intrusion Detection, Cloud Computing, Apache Hadoop. © 2021 by University of Niš, Serbia | Creative Commons License: CC BY-NC-ND Original scientific paper 2 V.Ciric et al. 1 Introduction The complexity of the Internet, diversity of available services, and the de- sire to expand applications of the global network contribute to its increased insecurity. Even with decades of research, and a lot of available security products, the internet has steadily become more and more dangerous [1,2]. Living in the era when everything is connected to the internet requires a different security strategy. When the attack begins, it is irrelevant how the network is configured or what kind of “boxes” network has, or how many security devices are installed. The only thing that matters is who is defending the network. The only way to stay ahead of new vulnerabilities and attacks is through vivid detection and response [3]. Unfortunately, constant security monitoring is a key component missing in most networks [4,5]. Intrusion detection system (IDS) is one of the most important compo- nents used to detect attacks in monitored network traffic [6]. Intrusion de- tection is broadly considered to be a classification problem. Based on their classification model IDSs are classified into signature (or pattern) matching and anomaly based IDS. The signature matching IDS monitors the network activity for a known misuse pattern that was previously identified as a ma- licious attempt [6]. Having in mind typical bandwidths on the network boundaries, the amount of data that need to be analyzed for malicious signatures becomes challeng- ing. There are IDS implementations available that tend to speed up network packet analysis [7–12]. Different approaches to task and data parallelism were exploited [9,10,12]. Some implementations use multi-core software de- velopment frameworks to parallelize the execution on CPU [11], while some utilize GPUs [8]. The Apache Hadoop is a framework for distributed processing of large amount of data on clusters of computers (nodes) using MapReduce pro- gramming model, where each node offers local computation and storage [13]. Hadoop Distributed File System (HDFS) is used for distributed data storage, and it represents a layer above existing file system of every node in cluster used to store input files or parts of them. Large files are split into a group of smaller blocks. Size of these blocks is fixed, so it is easy for Hadoop to index any block within the file [7]. However, this data distribution technique can introduce problems when splitting a sensitive data such as network data frames across a cluster nodes. Due to the fixed size of the block, one part of the network packet can end up on one node, while the other part is on the other, making malicious pattern matching challenging [14,15]. 240 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 241240 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 241 2 V.Ciric et al. 1 Introduction The complexity of the Internet, diversity of available services, and the de- sire to expand applications of the global network contribute to its increased insecurity. Even with decades of research, and a lot of available security products, the internet has steadily become more and more dangerous [1,2]. Living in the era when everything is connected to the internet requires a different security strategy. When the attack begins, it is irrelevant how the network is configured or what kind of “boxes” network has, or how many security devices are installed. The only thing that matters is who is defending the network. The only way to stay ahead of new vulnerabilities and attacks is through vivid detection and response [3]. Unfortunately, constant security monitoring is a key component missing in most networks [4,5]. Intrusion detection system (IDS) is one of the most important compo- nents used to detect attacks in monitored network traffic [6]. Intrusion de- tection is broadly considered to be a classification problem. Based on their classification model IDSs are classified into signature (or pattern) matching and anomaly based IDS. The signature matching IDS monitors the network activity for a known misuse pattern that was previously identified as a ma- licious attempt [6]. Having in mind typical bandwidths on the network boundaries, the amount of data that need to be analyzed for malicious signatures becomes challeng- ing. There are IDS implementations available that tend to speed up network packet analysis [7–12]. Different approaches to task and data parallelism were exploited [9,10,12]. Some implementations use multi-core software de- velopment frameworks to parallelize the execution on CPU [11], while some utilize GPUs [8]. The Apache Hadoop is a framework for distributed processing of large amount of data on clusters of computers (nodes) using MapReduce pro- gramming model, where each node offers local computation and storage [13]. Hadoop Distributed File System (HDFS) is used for distributed data storage, and it represents a layer above existing file system of every node in cluster used to store input files or parts of them. Large files are split into a group of smaller blocks. Size of these blocks is fixed, so it is easy for Hadoop to index any block within the file [7]. However, this data distribution technique can introduce problems when splitting a sensitive data such as network data frames across a cluster nodes. Due to the fixed size of the block, one part of the network packet can end up on one node, while the other part is on the other, making malicious pattern matching challenging [14,15]. 240 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 241 2 V.Ciric et al. 1 Introduction The complexity of the Internet, diversity of available services, and the de- sire to expand applications of the global network contribute to its increased insecurity. Even with decades of research, and a lot of available security products, the internet has steadily become more and more dangerous [1,2]. Living in the era when everything is connected to the internet requires a different security strategy. When the attack begins, it is irrelevant how the network is configured or what kind of “boxes” network has, or how many security devices are installed. The only thing that matters is who is defending the network. The only way to stay ahead of new vulnerabilities and attacks is through vivid detection and response [3]. Unfortunately, constant security monitoring is a key component missing in most networks [4,5]. Intrusion detection system (IDS) is one of the most important compo- nents used to detect attacks in monitored network traffic [6]. Intrusion de- tection is broadly considered to be a classification problem. Based on their classification model IDSs are classified into signature (or pattern) matching and anomaly based IDS. The signature matching IDS monitors the network activity for a known misuse pattern that was previously identified as a ma- licious attempt [6]. Having in mind typical bandwidths on the network boundaries, the amount of data that need to be analyzed for malicious signatures becomes challeng- ing. There are IDS implementations available that tend to speed up network packet analysis [7–12]. Different approaches to task and data parallelism were exploited [9,10,12]. Some implementations use multi-core software de- velopment frameworks to parallelize the execution on CPU [11], while some utilize GPUs [8]. The Apache Hadoop is a framework for distributed processing of large amount of data on clusters of computers (nodes) using MapReduce pro- gramming model, where each node offers local computation and storage [13]. Hadoop Distributed File System (HDFS) is used for distributed data storage, and it represents a layer above existing file system of every node in cluster used to store input files or parts of them. Large files are split into a group of smaller blocks. Size of these blocks is fixed, so it is easy for Hadoop to index any block within the file [7]. However, this data distribution technique can introduce problems when splitting a sensitive data such as network data frames across a cluster nodes. Due to the fixed size of the block, one part of the network packet can end up on one node, while the other part is on the other, making malicious pattern matching challenging [14,15]. Input Splits Design Techniques for IDS on Hadoop Cluster 3 Several authors already dealt with the problem of IDS implementation on Hadoop. However, for the best of our knowledge there is no solution that implements IDS on Hadoop without support of other software tools. In [16,17] the authors used Hadoop to analyse logs gathered from well-known Snort IDS. In [18] the authors proposed Hadoop as a distributed database manager, but the main processing isn’t performed by Hadoop. The goal of this paper is design and implementation of IDS based on Apache Hadoop, with focus on data splitting and distribution techniques to cluster nodes. In this paper we propose different input split techniques suitable for network data distribution across cloud nodes and test the perfor- mances of their Apache Hadoop implementations. Four different data split techniques will be proposed and analysed. The techniques will be described in detail. The IDS will be implemented using Myers pattern search algo- rithm as a core for signature-based packet analysis and evaluated on Apache Hadoop cluster with 17 slave nodes. We will show that processing speed can differ for more than 30% depending on chosen input split design strat- egy. Additionally, we’ll show that malicious level of network traffic can slow down the processing time, in our case, for nearly 20%. The scalability of the system will also be discussed. The paper is organized as follows. Section 2 gives a brief introduction to IDS. Section 3 is devoted to the MapReduce framework, as a basis for the proposed Apache Hadoop implementation. Section 4 is the main section and presents the design of the IDS workflow on the Hadoop framework. In this section we will discuss the design of data input split techniques, as well. Section 5 is devoted to the system evaluation, while in Section 6 concluding remarks are given. 2 Intrusion Detection System Background IDS monitors network traffic and deploys various techniques in order to provide security services. Based on the technique used to assess the network packets as regular or malicious, IDSs are classified into signature (or pattern) matching and anomaly based IDSs [11,19,20]. The signature matching IDS searches the network traffic for a known misuse pattern that was previously identified as a malicious attempt [7,8]. A database with malicious signatures is prepared in advance. This leads to fast and reliable operation, but these IDSs are not able to detect new attacks that have not been seen before. The anomaly based detection IDSs make the decision based on a profile of a normal network behavior, and they are capable of detecting zero day 240 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 241240 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 241 4 V.Ciric et al. attacks, with a drawback of possible false positives [20,21]. In this paper we will focus on pattern matching based IDS. The workflow of pattern matching based IDS is shown in Fig. 1 [12,19]. Intrusion detection starts with network monitoring, followed by network packet preparation for efficient pattern matching, which is based on the pre- defined signature database (Fig. 1). Network monitoring can be performed as packet capture, deep packet inspection and flow-based monitoring. Packet capture intercepts a data packet that is crossing over a specific computer net- work, but it focuses only on packet headers. Deep packet inspection (DPI) is an advanced method of packet filtering, which inspects at the application layer of the OSI (Open Systems Interconnection) reference model. Fig. 1: The typical IDS workflow. Any signature based IDS checks the presence of a malicious signature in the incoming packet sequence and act as instructed by the corresponding rule. Snort is a widely used open-source IDS based on pattern matching [11]. The pattern matching algorithm must be fast enough in order to support the network link speed. There are various implementations of pattern match- ing algorithms [7–12]. We will use Myers pattern search algorithm for DPI packets inspection, with rules in Snort syntax as proposed in [12]. In order to speed up pattern matching, in this paper we choose Apache Hadoop dis- tributed environment, with focus on network traffic data distribution across the nodes. 3 Apache Hadoop HDFS and MapReduce The Apache Hadoop is a framework for distributed computing based on MapReduce programming model, where each computer in a Hadoop cluster (node) offers local computation and storage [13]. The Apache Hadoop cluster consists of one master and many slave nodes. The Apache Hadoop is avail- able in versions 1.x and 2.x. There are two main components of Hadoop 1.x system: Hadoop Distributed File System (HDFS), used for distributed data 242 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 243242 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 243 4 V.Ciric et al. attacks, with a drawback of possible false positives [20,21]. In this paper we will focus on pattern matching based IDS. The workflow of pattern matching based IDS is shown in Fig. 1 [12,19]. Intrusion detection starts with network monitoring, followed by network packet preparation for efficient pattern matching, which is based on the pre- defined signature database (Fig. 1). Network monitoring can be performed as packet capture, deep packet inspection and flow-based monitoring. Packet capture intercepts a data packet that is crossing over a specific computer net- work, but it focuses only on packet headers. Deep packet inspection (DPI) is an advanced method of packet filtering, which inspects at the application layer of the OSI (Open Systems Interconnection) reference model. Fig. 1: The typical IDS workflow. Any signature based IDS checks the presence of a malicious signature in the incoming packet sequence and act as instructed by the corresponding rule. Snort is a widely used open-source IDS based on pattern matching [11]. The pattern matching algorithm must be fast enough in order to support the network link speed. There are various implementations of pattern match- ing algorithms [7–12]. We will use Myers pattern search algorithm for DPI packets inspection, with rules in Snort syntax as proposed in [12]. In order to speed up pattern matching, in this paper we choose Apache Hadoop dis- tributed environment, with focus on network traffic data distribution across the nodes. 3 Apache Hadoop HDFS and MapReduce The Apache Hadoop is a framework for distributed computing based on MapReduce programming model, where each computer in a Hadoop cluster (node) offers local computation and storage [13]. The Apache Hadoop cluster consists of one master and many slave nodes. The Apache Hadoop is avail- able in versions 1.x and 2.x. There are two main components of Hadoop 1.x system: Hadoop Distributed File System (HDFS), used for distributed data 242 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 243 4 V.Ciric et al. attacks, with a drawback of possible false positives [20,21]. In this paper we will focus on pattern matching based IDS. The workflow of pattern matching based IDS is shown in Fig. 1 [12,19]. Intrusion detection starts with network monitoring, followed by network packet preparation for efficient pattern matching, which is based on the pre- defined signature database (Fig. 1). Network monitoring can be performed as packet capture, deep packet inspection and flow-based monitoring. Packet capture intercepts a data packet that is crossing over a specific computer net- work, but it focuses only on packet headers. Deep packet inspection (DPI) is an advanced method of packet filtering, which inspects at the application layer of the OSI (Open Systems Interconnection) reference model. Fig. 1: The typical IDS workflow. Any signature based IDS checks the presence of a malicious signature in the incoming packet sequence and act as instructed by the corresponding rule. Snort is a widely used open-source IDS based on pattern matching [11]. The pattern matching algorithm must be fast enough in order to support the network link speed. There are various implementations of pattern match- ing algorithms [7–12]. We will use Myers pattern search algorithm for DPI packets inspection, with rules in Snort syntax as proposed in [12]. In order to speed up pattern matching, in this paper we choose Apache Hadoop dis- tributed environment, with focus on network traffic data distribution across the nodes. 3 Apache Hadoop HDFS and MapReduce The Apache Hadoop is a framework for distributed computing based on MapReduce programming model, where each computer in a Hadoop cluster (node) offers local computation and storage [13]. The Apache Hadoop cluster consists of one master and many slave nodes. The Apache Hadoop is avail- able in versions 1.x and 2.x. There are two main components of Hadoop 1.x system: Hadoop Distributed File System (HDFS), used for distributed data Input Splits Design Techniques for IDS on Hadoop Cluster 5 storage, and MapReduce computing framework for data manipulation. The architecture of Hadoop 2.x adds YARN (Yet Another Resource Negotiator) as an extension for resource management. The HDFS is an abstraction of all file systems of cluster nodes, which creates an illusion of common data file storage. Large files are split into a group of smaller parts called blocks (default block size is 64MB) [13]. The size of blocks is fixed, due to the simplification of indexing. The HDFS is master-slave architecture, based on the existence of two types of (Linux) deamons: DataNode and NameNode (Fig. 2). NameNode is executed on the master node and it is responsible for managing DataNodes (slaves) [13]. Fig. 2: HDFS components and their communication The NameNode is also responsible for taking care of the replication factor of data blocks. The replication factor contributes to data fault tolerance by creating a several copies of each block across the cluster. In Fig. 2 the repli- cation factor is 2 (default replication factor is 3). In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas, balances disk usage and manages the communication traffic to the DataNodes [13]. Typical Hadoop workflow has 4 parts: (1) transferring input data from Client host to HDFS, (2) processing data using MapReduce framework on the slave nodes, (3) storing results on HDFS, and (4) reading data by Client host from HDFS. MapReduce is programming model for distributed data processing, where the map function is applied on every data element in parallel, followed by the reduce function that summarize the collections of intermediate results produced by the map functions. MapReduce paradigm assumes that there 242 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 243242 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 243 6 V.Ciric et al. are no data dependencies between any given instance of the map functions. The map and reduce functions are implemented in Hadoop as follows. Before the beginning of execution, the input data files must be added to the HDFS. The beginning of the data processing itself is the determination of the logical units that will be processed - Input Splits. The most common case is that one Input Split corresponds to one block on HDFS, but it is not necessary to be so. In case the data requires it, the partitioning of data to Input Splits can be done differently, through a special implementation of the InputFormat class that will create them [14,15]. All input and output data are given in key-value pairs < k, v >. The default behavior is to use TextInputFormat, where the key is an offset in bytes from the beginning of file, and the value is the content of one line of the file. The binary files can be used as well. One map task processes one input split (Fig. 3). Each Input Split is divided into records, which are represented as key-value pairs < ki, vj >. Each pair is processed by a map task with one call of the map function. The map function takes one key- value pair < ki, vj > and executes given operations on them. It produces the intermediate results also in the form of key-value pairs < kn, vm > (Fig. 3). Those results are then grouped in such manner that all pairs having the same key are sent to the same reducer. Reducer summarizes all data with the same key in order to get the final result (Fig. 3). Fig. 3: The MapReduce execution In order to design an efficient Hadoop based IDS, due to the fixed size nature of HDFS data blocks, in this paper we’ll focus on experimenting with 244 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 245244 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 245 6 V.Ciric et al. are no data dependencies between any given instance of the map functions. The map and reduce functions are implemented in Hadoop as follows. Before the beginning of execution, the input data files must be added to the HDFS. The beginning of the data processing itself is the determination of the logical units that will be processed - Input Splits. The most common case is that one Input Split corresponds to one block on HDFS, but it is not necessary to be so. In case the data requires it, the partitioning of data to Input Splits can be done differently, through a special implementation of the InputFormat class that will create them [14,15]. All input and output data are given in key-value pairs < k, v >. The default behavior is to use TextInputFormat, where the key is an offset in bytes from the beginning of file, and the value is the content of one line of the file. The binary files can be used as well. One map task processes one input split (Fig. 3). Each Input Split is divided into records, which are represented as key-value pairs < ki, vj >. Each pair is processed by a map task with one call of the map function. The map function takes one key- value pair < ki, vj > and executes given operations on them. It produces the intermediate results also in the form of key-value pairs < kn, vm > (Fig. 3). Those results are then grouped in such manner that all pairs having the same key are sent to the same reducer. Reducer summarizes all data with the same key in order to get the final result (Fig. 3). Fig. 3: The MapReduce execution In order to design an efficient Hadoop based IDS, due to the fixed size nature of HDFS data blocks, in this paper we’ll focus on experimenting with 244 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 245 6 V.Ciric et al. are no data dependencies between any given instance of the map functions. The map and reduce functions are implemented in Hadoop as follows. Before the beginning of execution, the input data files must be added to the HDFS. The beginning of the data processing itself is the determination of the logical units that will be processed - Input Splits. The most common case is that one Input Split corresponds to one block on HDFS, but it is not necessary to be so. In case the data requires it, the partitioning of data to Input Splits can be done differently, through a special implementation of the InputFormat class that will create them [14,15]. All input and output data are given in key-value pairs < k, v >. The default behavior is to use TextInputFormat, where the key is an offset in bytes from the beginning of file, and the value is the content of one line of the file. The binary files can be used as well. One map task processes one input split (Fig. 3). Each Input Split is divided into records, which are represented as key-value pairs < ki, vj >. Each pair is processed by a map task with one call of the map function. The map function takes one key- value pair < ki, vj > and executes given operations on them. It produces the intermediate results also in the form of key-value pairs < kn, vm > (Fig. 3). Those results are then grouped in such manner that all pairs having the same key are sent to the same reducer. Reducer summarizes all data with the same key in order to get the final result (Fig. 3). Fig. 3: The MapReduce execution In order to design an efficient Hadoop based IDS, due to the fixed size nature of HDFS data blocks, in this paper we’ll focus on experimenting with Input Splits Design Techniques for IDS on Hadoop Cluster 7 different techniques of dividing the input data into Input Splits. 4 Design of Hadoop based IDS The architecture of the proposed IDS is shown in Fig. 4. The proposed ar- chitecture uses available Snort rules database and distributes pattern search across the Hadoop cluster. Due to the default behavior of HDFS to split the data into a fixed size blocks, and the nature of network protocols to have a packets of different sizes, the crucial design decision is how the packets on block boundaries will be handled. Having this in mind, we introduced pcap input format packet in the architecture from Fig. 4, which will allow us to experiment with different approaches by abstracting the InputFormat class mentioned in the previous Section. Fig. 4: The architecture of the proposed IDS The IDS packet from Fig. 4 is a central part that implements MapRe- duce pattern search through captured network traffic using Myers algo- rithm [12]. We chose the standard pcap format for capturing and storing the network traffic [3]. The architecture’s packet pcap input format from Fig. 4 is specialized for controlling the boundaries of the Input Splits, while pcap input counter tests the validity of its execution. The Snort rules parser creates a distributed cache out of Snort rules that will be used as an input in the pattern search algorithm. The pattern search algorithm itself is imple- mented in the tests packet, while utils provide pcap network traffic decoding functionalities. 244 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 245244 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 245 8 V.Ciric et al. 4.1 The input file format De facto standard for network traffic capture and storage is pcap file format, which is used by various well known network tools such as Wireshark, tcp- dump, libcap, etc. [3]. The internal structure of pcap file is given in Fig. 5. The file begins with global header, after which the particular network traffic packets follow. Global header contains, among the others, two important information (Fig. 5): the network protocol of the stored packets (network), and the maximum length of the stored packets (snaplen). The network protocol is in the most of cases Ethernet protocol, but it can be IP or any other. The maximum length of the stored packets is the feature that enables storage of the beginning of the packets only, for the sake of efficiency, in the cases when only headers are required. In such cases snaplen value is less then the value indicating original packet len in the actual header of the packet, showing that only the first snaplen bytes of the packet are stored. Each packet header from Fig. 5 contains the information about the stored network packet and should not be confused with actual network pro- tocol header. The packet header from Fig. 5 contains pcap information about the time when the packet is captured (ts sec and ts usec), and its length (incl len and orig len). Fig. 5: The internal structure of pcap file As network protocol packets can have variable length (from few bytes to several tens of kB, depending on protocol), and HDFS blocks are of the fixed size, the fields incl len and orig len are of the great importance for the proposed system. The incl len field represents the length of the packet in bytes as it is stored in the pcap file, while the orig len field gives its original length in bytes as seen on the network. For each packet the following relation stands incl len ≤ snap len ≤ orig len, (1) where only the first incl len bytes of each packet are captured in the pcap file. Here we will demonstrate and compare several techniques for input splits design, having in mind variable nature of network traffic packets. 246 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 247246 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 247 8 V.Ciric et al. 4.1 The input file format De facto standard for network traffic capture and storage is pcap file format, which is used by various well known network tools such as Wireshark, tcp- dump, libcap, etc. [3]. The internal structure of pcap file is given in Fig. 5. The file begins with global header, after which the particular network traffic packets follow. Global header contains, among the others, two important information (Fig. 5): the network protocol of the stored packets (network), and the maximum length of the stored packets (snaplen). The network protocol is in the most of cases Ethernet protocol, but it can be IP or any other. The maximum length of the stored packets is the feature that enables storage of the beginning of the packets only, for the sake of efficiency, in the cases when only headers are required. In such cases snaplen value is less then the value indicating original packet len in the actual header of the packet, showing that only the first snaplen bytes of the packet are stored. Each packet header from Fig. 5 contains the information about the stored network packet and should not be confused with actual network pro- tocol header. The packet header from Fig. 5 contains pcap information about the time when the packet is captured (ts sec and ts usec), and its length (incl len and orig len). Fig. 5: The internal structure of pcap file As network protocol packets can have variable length (from few bytes to several tens of kB, depending on protocol), and HDFS blocks are of the fixed size, the fields incl len and orig len are of the great importance for the proposed system. The incl len field represents the length of the packet in bytes as it is stored in the pcap file, while the orig len field gives its original length in bytes as seen on the network. For each packet the following relation stands incl len ≤ snap len ≤ orig len, (1) where only the first incl len bytes of each packet are captured in the pcap file. Here we will demonstrate and compare several techniques for input splits design, having in mind variable nature of network traffic packets. 246 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 247 8 V.Ciric et al. 4.1 The input file format De facto standard for network traffic capture and storage is pcap file format, which is used by various well known network tools such as Wireshark, tcp- dump, libcap, etc. [3]. The internal structure of pcap file is given in Fig. 5. The file begins with global header, after which the particular network traffic packets follow. Global header contains, among the others, two important information (Fig. 5): the network protocol of the stored packets (network), and the maximum length of the stored packets (snaplen). The network protocol is in the most of cases Ethernet protocol, but it can be IP or any other. The maximum length of the stored packets is the feature that enables storage of the beginning of the packets only, for the sake of efficiency, in the cases when only headers are required. In such cases snaplen value is less then the value indicating original packet len in the actual header of the packet, showing that only the first snaplen bytes of the packet are stored. Each packet header from Fig. 5 contains the information about the stored network packet and should not be confused with actual network pro- tocol header. The packet header from Fig. 5 contains pcap information about the time when the packet is captured (ts sec and ts usec), and its length (incl len and orig len). Fig. 5: The internal structure of pcap file As network protocol packets can have variable length (from few bytes to several tens of kB, depending on protocol), and HDFS blocks are of the fixed size, the fields incl len and orig len are of the great importance for the proposed system. The incl len field represents the length of the packet in bytes as it is stored in the pcap file, while the orig len field gives its original length in bytes as seen on the network. For each packet the following relation stands incl len ≤ snap len ≤ orig len, (1) where only the first incl len bytes of each packet are captured in the pcap file. Here we will demonstrate and compare several techniques for input splits design, having in mind variable nature of network traffic packets. Input Splits Design Techniques for IDS on Hadoop Cluster 9 4.2 Input split techniques We performed experiments with four different Input Split designs. Within the first design technique we use textual file as an input, while in the next three techniques we use binary file format with different network packet and HDFS data block aligning techniques. Technique 1 - tshark packet pre-decoding The simplest solution regarding the implementation of the map and reduce functions is to use input file in textual format, and to pre-decode captured network traffic prior to placing the input file on the HDFS. The paper [12] deals with this particular type of implementation. As malicious attempts can be recognized from their signatures in the form of character or byte arrays, pcap file should be decoded in order to obtain data in plain text from all headers of encapsulating network protocols (i.e. Ethernet, IP, and TCP), including data carried by the application layer. In this case it is not necessary to implement custom Hadoop InputFormat, but the TextInputFormat can be used instead. We use tshark Linux command line tool for network traffic decoding. The example of tshark tool usage is: tshark -r -T fields -E separator=, -e ip.addr -e ws.col.Protocol -e tcp.port -e udp.port -e data > output.txt Each line in the output.txt contains the information from one fetched network packet, which now represents the input file for HDFS. The input file is divided in Input Splits, and each map task is fed by one Input Split. The map function implemented to support this technique takes one line at the time, and executes the pattern search algorithm. If some of the Snort rules match the malicious network packet, the mapper emits < key, value > pair, where the key stands for the attack identification, while the value is constant 1. Having the same key, the results from the same malicious flow go to the same reducer, which counts the malicious packets in the flow and outputs the result. The advantage of this approach is ease of Hadoop implementation, with the drawback that packet decoding have to be done prior the beginning of Hadoop program and pattern search. Technique 2 - Custom InputFormat for pcap input In order to overcome disadvantages of the previous technique, pcap file have to be used in the original binary format, without pre-decoding. This can 246 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 247246 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 247 10 V.Ciric et al. be achieved by implementation of custom Hadoop InputFormat that will process pcap files. Two considerations should be taken into account: (1) How to divide the input file into Input Splits? (2) How to read records from an Input Split and feed them into the map function? During the input file processing and creation of each Input Split, in this technique we ensured that the boundary between each two adjacent Input Splits is exactly at the boundary between two packets (Fig. 6). We crawled the pcap file package by package until the configured block size limit (Fig. 6). At that point, a new Input Split is created. Since the division is performed at the packet boundaries, the sizes of the obtained Input Splits can differ from each other, but no more than the maximum length of one packet, which is 1536 bytes for Ethernet). Fig. 6: Dividing the pcap file into Input Splits: (1) correct boundary, (2) incorrect boundary In order to perform pattern search within the map function, the map function requires both the packet header and the packet data. The output from the mapper is in the form < key, value >, where the key is the offset of the beginning of the packet header, and the value is the whole packet in its original binary format. In this approach the Mapper itself decodes the binary packet and locates all required data (IP addresses, ports, payload, etc.), making the decoding distributed operation, as well. As Myers algorithm natively works with bytes, Mapper only have to decode the packet up to the application layer (to find IP addresses and port numbers), but not the application layer payload itself, which, compared to tshark, reduces the number of operations required for decoding. 248 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 249248 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 249 10 V.Ciric et al. be achieved by implementation of custom Hadoop InputFormat that will process pcap files. Two considerations should be taken into account: (1) How to divide the input file into Input Splits? (2) How to read records from an Input Split and feed them into the map function? During the input file processing and creation of each Input Split, in this technique we ensured that the boundary between each two adjacent Input Splits is exactly at the boundary between two packets (Fig. 6). We crawled the pcap file package by package until the configured block size limit (Fig. 6). At that point, a new Input Split is created. Since the division is performed at the packet boundaries, the sizes of the obtained Input Splits can differ from each other, but no more than the maximum length of one packet, which is 1536 bytes for Ethernet). Fig. 6: Dividing the pcap file into Input Splits: (1) correct boundary, (2) incorrect boundary In order to perform pattern search within the map function, the map function requires both the packet header and the packet data. The output from the mapper is in the form < key, value >, where the key is the offset of the beginning of the packet header, and the value is the whole packet in its original binary format. In this approach the Mapper itself decodes the binary packet and locates all required data (IP addresses, ports, payload, etc.), making the decoding distributed operation, as well. As Myers algorithm natively works with bytes, Mapper only have to decode the packet up to the application layer (to find IP addresses and port numbers), but not the application layer payload itself, which, compared to tshark, reduces the number of operations required for decoding. 248 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 249 10 V.Ciric et al. be achieved by implementation of custom Hadoop InputFormat that will process pcap files. Two considerations should be taken into account: (1) How to divide the input file into Input Splits? (2) How to read records from an Input Split and feed them into the map function? During the input file processing and creation of each Input Split, in this technique we ensured that the boundary between each two adjacent Input Splits is exactly at the boundary between two packets (Fig. 6). We crawled the pcap file package by package until the configured block size limit (Fig. 6). At that point, a new Input Split is created. Since the division is performed at the packet boundaries, the sizes of the obtained Input Splits can differ from each other, but no more than the maximum length of one packet, which is 1536 bytes for Ethernet). Fig. 6: Dividing the pcap file into Input Splits: (1) correct boundary, (2) incorrect boundary In order to perform pattern search within the map function, the map function requires both the packet header and the packet data. The output from the mapper is in the form < key, value >, where the key is the offset of the beginning of the packet header, and the value is the whole packet in its original binary format. In this approach the Mapper itself decodes the binary packet and locates all required data (IP addresses, ports, payload, etc.), making the decoding distributed operation, as well. As Myers algorithm natively works with bytes, Mapper only have to decode the packet up to the application layer (to find IP addresses and port numbers), but not the application layer payload itself, which, compared to tshark, reduces the number of operations required for decoding. Input Splits Design Techniques for IDS on Hadoop Cluster 11 Technique 3 - Custom InputFormat with probabilistic packet bound- ary detection Although the previous technique is much better than tshark packets decod- ing due to its distributed packet decoding, it still crawls from packet to packet through the pcap file with aim to align the boundary of the Input Split with the boundary of the packet. In order to do so, it loads each net- work packet into the memory. This takes time before the start of ”useful” distributed processing, and slows down whole processing. In order to avoid this bottleneck we propose the third technique - probabilistic packet bound- ary detection, where we assume that pcap file contains the network packets captured only on the Data Link layer of the OSI reference model, i.e. that each ”packet” in the pcap file is Ethernet frame. Let the HDFS block size be z bytes. Here we propose not to load all z bytes into the memory in order to find the boundary between the Input Splits, but rather to skip the first x (x < z) bytes (Fig. 7). The question now is whether the boundary between the network packets lies on the chosen offset of x bytes? If so, then for the next package in the pcap file Eq. (1) should stand. This practically means that the value on the position of the orig len field should be greater than zero, that the value of the inc len field should be within the valid limits of the Ethernet frame size, and that the value of the inc len field should be less than or equal to the value of the orig len field. Other fields in the package itself must be valid, too. A suitable place for additional check is where the ethertype field in the header of the Ethernet frame should be. This value should be compared with the value that indicates the Ethernet protocol. This is highly probabilistic and fuzzy way of boundary detection, and a few verified fields can mislead us by giving us a false positive answer. Thus, we check the same conditions for the following k packets (up to the packet denoted as PN+k in Fig. 7). If all conditions stand for the next k packets, we declare the boundary between the packets found, and create the Input Split. In order to perform mentioned additional checks on the next k packets, after skipping the offset of x bytes we load the following y bytes, as it is shown in Fig. 7. Let us note that y ≪ x. This can be used to form a probabilistic algorithm as follows: (1) if the verified conditions stand for the next k packets, we assume that the correct boundary between the Input Splits is on the x-th offset; (2) if the conditions don’t stand at least for one of the k packages, the offset x is not the correct limit, and the offset of x + 1-st byte should be examined (Fig. 7). In the 248 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 249248 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 249 12 V.Ciric et al. Fig. 7: Probabilistic method for packets boundaries detection worst case scenario the number of offsets that should be examined is equal to the maximum length of Ethernet frame, which is not a problem having in mind that we already have y bytes from pcap loaded and available. Even in this case, it is much less than loading and processing of all z bytes. With the careful selection of the parameters x, y and k, high degree of certainty of the proposed technique can be reached. In our implementation we selected the following parameters: z = 128MB, x = 123MB and y = 5MB. In the portion of y = 5MB there are more than 3.000 Ethernet frames, more then enough for us not to obtain any false positive, while loading only 4% of the pcap file. Technique 4 - Custom InputFormat with aligned blocks and Input Splits The previous technique has an important drawback regarding the way how the HDFS operates in the case of unequal sizes of blocks and Input Splits. In any case, the HDFS block size is constant. If the Input Split size is less than the block size, as it is in the previous technique, the boundary of the input split will not be aligned with the block boundary, forcing the HDFS to fill the remaining space with the next Input Split. That Input Split will be divided having a small portion in one block and a larger portion in the next block. The case when one Input Split resides in two blocks will force the Hadoop to copy both blocks on the node where the mapper who processes the particular Input Split is executed. This can cause large and unnecessary network traffic while copying the blocks. To overcome this issue and prevent unnecessary blocks copying, we pro- pose the fourth technique where we have a custom InputFormat and the exact same sizes of blocks and Input Splits (Fig. 8). Now, the boundaries 250 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 251250 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 251 12 V.Ciric et al. Fig. 7: Probabilistic method for packets boundaries detection worst case scenario the number of offsets that should be examined is equal to the maximum length of Ethernet frame, which is not a problem having in mind that we already have y bytes from pcap loaded and available. Even in this case, it is much less than loading and processing of all z bytes. With the careful selection of the parameters x, y and k, high degree of certainty of the proposed technique can be reached. In our implementation we selected the following parameters: z = 128MB, x = 123MB and y = 5MB. In the portion of y = 5MB there are more than 3.000 Ethernet frames, more then enough for us not to obtain any false positive, while loading only 4% of the pcap file. Technique 4 - Custom InputFormat with aligned blocks and Input Splits The previous technique has an important drawback regarding the way how the HDFS operates in the case of unequal sizes of blocks and Input Splits. In any case, the HDFS block size is constant. If the Input Split size is less than the block size, as it is in the previous technique, the boundary of the input split will not be aligned with the block boundary, forcing the HDFS to fill the remaining space with the next Input Split. That Input Split will be divided having a small portion in one block and a larger portion in the next block. The case when one Input Split resides in two blocks will force the Hadoop to copy both blocks on the node where the mapper who processes the particular Input Split is executed. This can cause large and unnecessary network traffic while copying the blocks. To overcome this issue and prevent unnecessary blocks copying, we pro- pose the fourth technique where we have a custom InputFormat and the exact same sizes of blocks and Input Splits (Fig. 8). Now, the boundaries 250 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 251 12 V.Ciric et al. Fig. 7: Probabilistic method for packets boundaries detection worst case scenario the number of offsets that should be examined is equal to the maximum length of Ethernet frame, which is not a problem having in mind that we already have y bytes from pcap loaded and available. Even in this case, it is much less than loading and processing of all z bytes. With the careful selection of the parameters x, y and k, high degree of certainty of the proposed technique can be reached. In our implementation we selected the following parameters: z = 128MB, x = 123MB and y = 5MB. In the portion of y = 5MB there are more than 3.000 Ethernet frames, more then enough for us not to obtain any false positive, while loading only 4% of the pcap file. Technique 4 - Custom InputFormat with aligned blocks and Input Splits The previous technique has an important drawback regarding the way how the HDFS operates in the case of unequal sizes of blocks and Input Splits. In any case, the HDFS block size is constant. If the Input Split size is less than the block size, as it is in the previous technique, the boundary of the input split will not be aligned with the block boundary, forcing the HDFS to fill the remaining space with the next Input Split. That Input Split will be divided having a small portion in one block and a larger portion in the next block. The case when one Input Split resides in two blocks will force the Hadoop to copy both blocks on the node where the mapper who processes the particular Input Split is executed. This can cause large and unnecessary network traffic while copying the blocks. To overcome this issue and prevent unnecessary blocks copying, we pro- pose the fourth technique where we have a custom InputFormat and the exact same sizes of blocks and Input Splits (Fig. 8). Now, the boundaries Input Splits Design Techniques for IDS on Hadoop Cluster 13 of Input Splits are not aligned with the boundaries of network packets in most cases, and they split some packets into two parts (Fig. 8). There- fore, we will ignore the split packets as invalid. They intentionally will not be processed further through the MapReduce framework, leaving the small chance of false negative response of our network intrusion detection system for the sake of speed gain by avoiding of unnecessary block copying. In the worst case, the number of packets that will not be processed can be equal to the number of Input Splits, i.e. one invalid packet per Input Split (roughly one Ethernet frame per 100.000 frames will be ignored). The problem that remains is finding valid beginning of the first packet within the Input Split, and this is a reason for having a custom InputFormat within this technique, too. To find the first valid packet in Input Split, we use the same probabilistic algorithm as in the previous technique, and we search through the first y bytes of the Input Split for the valid beginning of Ethernet frame (Fig. 8). Fig. 8: Aligned blocks and Input Splits 5 Implementation results The proposed techniques are suitable for implementation in both Hadoop 1.x and 2.x without any restrictions. In order to evaluate the proposed tech- niques, the IDS is implemented in Apache Hadoop 1.x and 2.x, and tested on a cluster with 18 commodity nodes, where 1 node is a master while the rest 17 nodes are slaves. The nodes are equipped with Intel(R)Core(TM)2 Duo, CPU E4600@2.40GHz, and 1GB of RAM. In order to compare the perfor- mances of the proposed Hadoop IDS with the reference Snort IDS, evaluation of the technique 1 is done in single-processor environment, because the Snort IDS doesn’t support distributed execution. For that purpose we used an en- vironment with i3 6006U CPU and 8GB RAM. The processor E4600 has 2 cores and operates at 2.4GHz, with Whetstone benchmark results 2.25 FLOPS per core, i.e. 4.50 FLOPS in total. The processor i3 6006U has 4 250 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 251250 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 251 14 V.Ciric et al. cores at 2.00GHz, with Whetstone benchmark results 2.08 per core, i.e. 8.31 in total [22]. We used input pcap files of variable sizes with 1, 2, 3, 4, and 5GB of network traffic data, having ”low”, ”medium”, and ”high” number of mali- cious packets (less than 1%, 30-40%, and more than 70%, respectively). We also varied the number of Snort rules and the number of slave nodes in the cluster. The proposed Hadoop IDS with technique 1 for Input Splits prepara- tion (tshark packet pre-decoding) is evaluated in Hadoop 2.9.0, in pseudo- distributed environment on the previously mentioned single-processor sys- tem, along with the Snort IDS. The results are given in Table 1. input file size [GB] preprocessing time [s] pattern search time [s] total process- ing time [s] Snort IDS 1 0 120 120 Hadoop IDS 1.6 211 94 305 Table 1: Hadoop IDS with tshark packet pre-decoding vs. Snort IDS The used input pcap file size is 1GB, and it contained about 2 million network packets with ”medium” number of malicious packets. As it can be seen from Table 1, packet pre-decoding increased the file size from 1GB to 1.6GB. Nevertheless, the proposed IDS has 21% faster pattern search time than the Snort (94s vs. 120s). This is due to the fact that the execution of Hadoop IDS takes advantage of multi-core CPU, while the Snort doesn’t. However, the time required for tshark packet pre-decoding took more than 3 minutes (211 seconds), giving the total processing time for Hadoop 2.5 times slower than the Snort IDS. For the job execution purposes Hadoop 2.x requests three different kinds of containers from YARN: the application master container, map containers, and reduce containers. Application master itself requires 1.5GB or RAM by default, making Hadoop 2.x suitable for large clusters with a lot of resources. The proposed techniques 2, 3 and 4 are evaluated in Hadoop 1.2.1 environ- ment, due to the lower resource requirements. For the independent variables in the experiment we chose input pcap file size, the level of malicious packets, the number of Snort rules in the database, and the number of slave nodes. As dependant variables we obtained the total execution time, the total num- ber of map tasks, as well as the number of data local and rack local1 map 1Data local map task is a map task which has data block already locally available on the node where it executes prior to the execution, while rack local map task needs to fetch the data block from the other slave node. 252 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 253252 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 253 14 V.Ciric et al. cores at 2.00GHz, with Whetstone benchmark results 2.08 per core, i.e. 8.31 in total [22]. We used input pcap files of variable sizes with 1, 2, 3, 4, and 5GB of network traffic data, having ”low”, ”medium”, and ”high” number of mali- cious packets (less than 1%, 30-40%, and more than 70%, respectively). We also varied the number of Snort rules and the number of slave nodes in the cluster. The proposed Hadoop IDS with technique 1 for Input Splits prepara- tion (tshark packet pre-decoding) is evaluated in Hadoop 2.9.0, in pseudo- distributed environment on the previously mentioned single-processor sys- tem, along with the Snort IDS. The results are given in Table 1. input file size [GB] preprocessing time [s] pattern search time [s] total process- ing time [s] Snort IDS 1 0 120 120 Hadoop IDS 1.6 211 94 305 Table 1: Hadoop IDS with tshark packet pre-decoding vs. Snort IDS The used input pcap file size is 1GB, and it contained about 2 million network packets with ”medium” number of malicious packets. As it can be seen from Table 1, packet pre-decoding increased the file size from 1GB to 1.6GB. Nevertheless, the proposed IDS has 21% faster pattern search time than the Snort (94s vs. 120s). This is due to the fact that the execution of Hadoop IDS takes advantage of multi-core CPU, while the Snort doesn’t. However, the time required for tshark packet pre-decoding took more than 3 minutes (211 seconds), giving the total processing time for Hadoop 2.5 times slower than the Snort IDS. For the job execution purposes Hadoop 2.x requests three different kinds of containers from YARN: the application master container, map containers, and reduce containers. Application master itself requires 1.5GB or RAM by default, making Hadoop 2.x suitable for large clusters with a lot of resources. The proposed techniques 2, 3 and 4 are evaluated in Hadoop 1.2.1 environ- ment, due to the lower resource requirements. For the independent variables in the experiment we chose input pcap file size, the level of malicious packets, the number of Snort rules in the database, and the number of slave nodes. As dependant variables we obtained the total execution time, the total num- ber of map tasks, as well as the number of data local and rack local1 map 1Data local map task is a map task which has data block already locally available on the node where it executes prior to the execution, while rack local map task needs to fetch the data block from the other slave node. 252 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 253 14 V.Ciric et al. cores at 2.00GHz, with Whetstone benchmark results 2.08 per core, i.e. 8.31 in total [22]. We used input pcap files of variable sizes with 1, 2, 3, 4, and 5GB of network traffic data, having ”low”, ”medium”, and ”high” number of mali- cious packets (less than 1%, 30-40%, and more than 70%, respectively). We also varied the number of Snort rules and the number of slave nodes in the cluster. The proposed Hadoop IDS with technique 1 for Input Splits prepara- tion (tshark packet pre-decoding) is evaluated in Hadoop 2.9.0, in pseudo- distributed environment on the previously mentioned single-processor sys- tem, along with the Snort IDS. The results are given in Table 1. input file size [GB] preprocessing time [s] pattern search time [s] total process- ing time [s] Snort IDS 1 0 120 120 Hadoop IDS 1.6 211 94 305 Table 1: Hadoop IDS with tshark packet pre-decoding vs. Snort IDS The used input pcap file size is 1GB, and it contained about 2 million network packets with ”medium” number of malicious packets. As it can be seen from Table 1, packet pre-decoding increased the file size from 1GB to 1.6GB. Nevertheless, the proposed IDS has 21% faster pattern search time than the Snort (94s vs. 120s). This is due to the fact that the execution of Hadoop IDS takes advantage of multi-core CPU, while the Snort doesn’t. However, the time required for tshark packet pre-decoding took more than 3 minutes (211 seconds), giving the total processing time for Hadoop 2.5 times slower than the Snort IDS. For the job execution purposes Hadoop 2.x requests three different kinds of containers from YARN: the application master container, map containers, and reduce containers. Application master itself requires 1.5GB or RAM by default, making Hadoop 2.x suitable for large clusters with a lot of resources. The proposed techniques 2, 3 and 4 are evaluated in Hadoop 1.2.1 environ- ment, due to the lower resource requirements. For the independent variables in the experiment we chose input pcap file size, the level of malicious packets, the number of Snort rules in the database, and the number of slave nodes. As dependant variables we obtained the total execution time, the total num- ber of map tasks, as well as the number of data local and rack local1 map 1Data local map task is a map task which has data block already locally available on the node where it executes prior to the execution, while rack local map task needs to fetch the data block from the other slave node. Input Splits Design Techniques for IDS on Hadoop Cluster 15 tasks. Fig. 9 shows the evaluation results of the techniques 2, 3 and 4 for 1GB input file size with low level of malicious packets, on a cluster with 17 slave nodes, and the Snort database with 1000 rules. The results are as expected: the Hadoop IDS with input split technique 4 has the best execution time (Fig. 9a). In this case it performs 32% faster then the proposed technique 2 (170 vs. 250 in Fig. 9a). The number of DataLocal and RackLocal blocks confirm the design hypothesis about additional block copying (Fig. 9b). The techniques 2 and 4 have the same number of map tasks due to the fact that both techniques force the size of the input split to be exact (Tcq4) or very close to the size of a block (Tcq2), while Tcq3 introduces the greatest deviation between the size of an input split and a block. Tcq2 Tcq3 Tcq4 0 100 200 250 187 170 a) [s] MapTasks DataLocal RackLocal 0 10 20 17 13 4 20 15 5 17 13 4 b) [#] Tcq2 Tcq3 Tcq4 Fig. 9: Evaluation results of different Input Split design techniques: a) total processing time for techniques 2, 3 and 4, b) map tasks and blocks distribution across the cluster We also evaluated how the technique 4 performs with variable malicious level of input pcap file and variable file size, how it performs with variable number of Snort rules in database, and how it performs in the clusters with different number of slave nodes. The evaluation results are given in Fig. 10. For better introspection, we used the same parameters for the starting points of graphics in Figs. 10 a) and b) as in Fig. 9 a): Tcq4, Snort database with 1000 rules, input file size 1GB, and low pcap malicious level. Figs. 10 b), c), and d) have one common point, as well. From Fig. 10 a) it can be seen that the total execution time strongly depends on the number of malicious attempts in the network traffic flow. In this case the total execution time differs for 18% (202 vs. 170 in Fig. 10a). This is not a consequence of the choice of pattern search algorithm, but 252 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 253252 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 253 16 V.Ciric et al. low medium high 100 200 300 170 182 202 a) malicious level of pcap to ta l ex ec u ti o n ti m e [s ] f=1GB;sr=1000 1 2 3 4 5 100 200 300 170 182 230 280 335 b) input pcap file size [GB] to ta l ex ec u ti o n ti m e [s ] m=low;sr=1000 200 400 600 800 1000 100 200 300 127 158 190 255 335 c) the number of snort rules to ta l ex ec u ti o n ti m e [s ] f=5GB;m=low 5 9 13 17 200 400 600 800 807 461 343 335 d) the number of slave nodes to ta l ex ec u ti o n ti m e [s ] f=5GB;m=low Fig. 10: Evaluation of the Tcq4 with variable parameters: a) variable num- ber of malicious packets in pcap, b) variable input pcap file size, c) variable number of snort rules, d) variable size of the cluster. rather the consequence of Snort rules database structure. The used Myers algorithm has a stabile execution time which is not affected by the contents of neither text nor pattern [12]. The Snort database is hierarchically organized, having rules categorized in levels from general to specific. For example, if the protocol is not http, the SQL injection rules are not going to be examined. Thus, for the low malicious pcap a lot of packets are simply skipped after ports and protocols check, and the Myers search algorithm is not started for them. If the monitored traffic contains a lot of packets that fall into ”suspicious” category, one or more additional pattern searches are going to be performed, depending on the number of specific rules bound to matched general rule. This directly reflects the results from Fig. 10a. Let us note that IDS process in general can be strongly affected with the choice of the pattern search algorithm, as well as with the specifics of 254 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 255254 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 255 16 V.Ciric et al. low medium high 100 200 300 170 182 202 a) malicious level of pcap to ta l ex ec u ti o n ti m e [s ] f=1GB;sr=1000 1 2 3 4 5 100 200 300 170 182 230 280 335 b) input pcap file size [GB] to ta l ex ec u ti o n ti m e [s ] m=low;sr=1000 200 400 600 800 1000 100 200 300 127 158 190 255 335 c) the number of snort rules to ta l ex ec u ti o n ti m e [s ] f=5GB;m=low 5 9 13 17 200 400 600 800 807 461 343 335 d) the number of slave nodes to ta l ex ec u ti o n ti m e [s ] f=5GB;m=low Fig. 10: Evaluation of the Tcq4 with variable parameters: a) variable num- ber of malicious packets in pcap, b) variable input pcap file size, c) variable number of snort rules, d) variable size of the cluster. rather the consequence of Snort rules database structure. The used Myers algorithm has a stabile execution time which is not affected by the contents of neither text nor pattern [12]. The Snort database is hierarchically organized, having rules categorized in levels from general to specific. For example, if the protocol is not http, the SQL injection rules are not going to be examined. Thus, for the low malicious pcap a lot of packets are simply skipped after ports and protocols check, and the Myers search algorithm is not started for them. If the monitored traffic contains a lot of packets that fall into ”suspicious” category, one or more additional pattern searches are going to be performed, depending on the number of specific rules bound to matched general rule. This directly reflects the results from Fig. 10a. Let us note that IDS process in general can be strongly affected with the choice of the pattern search algorithm, as well as with the specifics of 254 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 255 16 V.Ciric et al. low medium high 100 200 300 170 182 202 a) malicious level of pcap to ta l ex ec u ti o n ti m e [s ] f=1GB;sr=1000 1 2 3 4 5 100 200 300 170 182 230 280 335 b) input pcap file size [GB] to ta l ex ec u ti o n ti m e [s ] m=low;sr=1000 200 400 600 800 1000 100 200 300 127 158 190 255 335 c) the number of snort rules to ta l ex ec u ti o n ti m e [s ] f=5GB;m=low 5 9 13 17 200 400 600 800 807 461 343 335 d) the number of slave nodes to ta l ex ec u ti o n ti m e [s ] f=5GB;m=low Fig. 10: Evaluation of the Tcq4 with variable parameters: a) variable num- ber of malicious packets in pcap, b) variable input pcap file size, c) variable number of snort rules, d) variable size of the cluster. rather the consequence of Snort rules database structure. The used Myers algorithm has a stabile execution time which is not affected by the contents of neither text nor pattern [12]. The Snort database is hierarchically organized, having rules categorized in levels from general to specific. For example, if the protocol is not http, the SQL injection rules are not going to be examined. Thus, for the low malicious pcap a lot of packets are simply skipped after ports and protocols check, and the Myers search algorithm is not started for them. If the monitored traffic contains a lot of packets that fall into ”suspicious” category, one or more additional pattern searches are going to be performed, depending on the number of specific rules bound to matched general rule. This directly reflects the results from Fig. 10a. Let us note that IDS process in general can be strongly affected with the choice of the pattern search algorithm, as well as with the specifics of Input Splits Design Techniques for IDS on Hadoop Cluster 17 topology and network organization, as well. If anomaly based approach is chosen instead of pattern search, the search performances can be significantly affected also. From Fig. 10 b) it can be seen that the total execution time linearly depends on the size of the input files. The very slow growth in the beginning of Fig. 10b between execution times for 1GB and 2GB inputs is explained by the fact that the cluster consists of 17 nodes, where each node can execute 2 map tasks on two separate cores in parallel, giving the maximum of 34 simultaneously executed map task on the cluster. A 1GB file is presented on HDFS with 16 blocks of 64MB, while a 2GB file is presented with 32 blocks. This means that all necessary tasks can be run at the same time for both 1GB and 2GB files. The rise in the execution time between these two is due to the increased number of rack local tasks for the larger file. For larger files, more than 34 map tasks are required for processing, which means that not all of them can be started immediately, and they need to wait for the previously started map tasks to finish execution. However, the dependency once the cluster boundary is reached linearly increases (Fig. 10b). The total processing time depends linearly on the number of Snort rules (Fig. 10c), while it has an asymptotic decline depending on the number of nodes (Fig. 10d). It can be noticed in Fig. 10d that for 17 nodes the graph enters saturation and the processing speed remains slightly under 0.15 Gb/sec. As the number of nodes in the cluster grows, each of them stores a smaller number of blocks on average. This leads to an increase in rack local tasks. Copying of remote blocks during the execution is a limiting factor that leads to the saturation in this case. 6 Conclusion In this paper the design and implementation of IDS using Apache Hadoop is proposed. Four different input data split techniques are proposed and anal- ysed. The techniques are described in detail. The IDS is implemented using Myers pattern search algorithm as a core for signature-based packet analysis. We showed the suitability of Hadoop environment for the implementation of network IDS and discussed inherited problem from Hadoop that relates to splitting sensitive data across cluster nodes. The system is evaluated on Apache Hadoop cluster with 17 slave nodes. The implementation and eval- uation results are given and discussed in detail. We showed that processing speed can differ for more than 30% depending on chosen input split design strategy. Additionally, we showed that malicious level of network traffic can 254 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 255254 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 255 18 V.Ciric et al. slow down the processing time, in our case, for nearly 20%. The scalability of the system was also discussed. The proposed techniques deal with spe- cific type of input data, i.e. network traffic packets, but they can be easily generalized to deal with any type of sensitive data which need a special at- tention before it can be split into pieces and scatter onto different nodes in distributed environment. Acknowledgments This work was supported by the Serbian Ministry of Education, Science and Technological Development [grant number TR32012]. References [1] L. A. Maglaras, K.-H. Kim, H. Janicke, M. A. Ferrag, S. Rallis, P. Fragkou, A. Maglaras, and T. J. Cruz, “Cyber security of critical infrastructures,” Ict Express, vol. 4, no. 1, pp. 42–45, 2018. [2] M. A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke, “Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study,” Journal of Information Security and Applications, vol. 50, pp. 1–19, 2020. [3] J. Svoboda, I. Ghafir, V. Prenosil et al., “Network monitoring approaches: An overview,” Int J Adv Comput Netw Secur, vol. 5, no. 2, pp. 88–93, 2015. [4] I. Ghafir, V. Prenosil, J. Svoboda, and M. Hammoudeh, “A survey on network security monitoring systems,” in 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW). IEEE, 2016, pp. 77–82. [5] B. Schneier, “Managed security monitoring: Network security for the 21st century,” Computers & Security, vol. 20, no. 6, pp. 491–503, 2001. [6] G. Kumar, K. Kumar, and M. Sachdeva, “The use of artificial intelligence based techniques for intrusion detection: a review,” Artificial Intelligence Re- view, vol. 34, no. 4, pp. 369–387, 2010. [7] M. Aldwairi and D. Alansari, “Exscind: Fast pattern matching for intrusion detection using exclusion and inclusion filters,” in 2011 7th International Con- ference on Next Generation Web Services Practices. IEEE, 2011, pp. 24–30. [8] D. Xu, H. Zhang, and Y. Fan, “The gpu-based high-performance pattern- matching algorithm for intrusion detection,” Journal of computational infor- mation systems, pp. 3791–3800, 2013. 256 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 257256 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 257 18 V.Ciric et al. slow down the processing time, in our case, for nearly 20%. The scalability of the system was also discussed. The proposed techniques deal with spe- cific type of input data, i.e. network traffic packets, but they can be easily generalized to deal with any type of sensitive data which need a special at- tention before it can be split into pieces and scatter onto different nodes in distributed environment. Acknowledgments This work was supported by the Serbian Ministry of Education, Science and Technological Development [grant number TR32012]. References [1] L. A. Maglaras, K.-H. Kim, H. Janicke, M. A. Ferrag, S. Rallis, P. Fragkou, A. Maglaras, and T. J. Cruz, “Cyber security of critical infrastructures,” Ict Express, vol. 4, no. 1, pp. 42–45, 2018. [2] M. A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke, “Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study,” Journal of Information Security and Applications, vol. 50, pp. 1–19, 2020. [3] J. Svoboda, I. Ghafir, V. Prenosil et al., “Network monitoring approaches: An overview,” Int J Adv Comput Netw Secur, vol. 5, no. 2, pp. 88–93, 2015. [4] I. Ghafir, V. Prenosil, J. Svoboda, and M. Hammoudeh, “A survey on network security monitoring systems,” in 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW). IEEE, 2016, pp. 77–82. [5] B. Schneier, “Managed security monitoring: Network security for the 21st century,” Computers & Security, vol. 20, no. 6, pp. 491–503, 2001. [6] G. Kumar, K. Kumar, and M. Sachdeva, “The use of artificial intelligence based techniques for intrusion detection: a review,” Artificial Intelligence Re- view, vol. 34, no. 4, pp. 369–387, 2010. [7] M. Aldwairi and D. Alansari, “Exscind: Fast pattern matching for intrusion detection using exclusion and inclusion filters,” in 2011 7th International Con- ference on Next Generation Web Services Practices. IEEE, 2011, pp. 24–30. [8] D. Xu, H. Zhang, and Y. Fan, “The gpu-based high-performance pattern- matching algorithm for intrusion detection,” Journal of computational infor- mation systems, pp. 3791–3800, 2013. 256 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 257 18 V.Ciric et al. slow down the processing time, in our case, for nearly 20%. The scalability of the system was also discussed. The proposed techniques deal with spe- cific type of input data, i.e. network traffic packets, but they can be easily generalized to deal with any type of sensitive data which need a special at- tention before it can be split into pieces and scatter onto different nodes in distributed environment. Acknowledgments This work was supported by the Serbian Ministry of Education, Science and Technological Development [grant number TR32012]. References [1] L. A. Maglaras, K.-H. Kim, H. Janicke, M. A. Ferrag, S. Rallis, P. Fragkou, A. Maglaras, and T. J. Cruz, “Cyber security of critical infrastructures,” Ict Express, vol. 4, no. 1, pp. 42–45, 2018. [2] M. A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke, “Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study,” Journal of Information Security and Applications, vol. 50, pp. 1–19, 2020. [3] J. Svoboda, I. Ghafir, V. Prenosil et al., “Network monitoring approaches: An overview,” Int J Adv Comput Netw Secur, vol. 5, no. 2, pp. 88–93, 2015. [4] I. Ghafir, V. Prenosil, J. Svoboda, and M. Hammoudeh, “A survey on network security monitoring systems,” in 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW). IEEE, 2016, pp. 77–82. [5] B. Schneier, “Managed security monitoring: Network security for the 21st century,” Computers & Security, vol. 20, no. 6, pp. 491–503, 2001. [6] G. Kumar, K. Kumar, and M. Sachdeva, “The use of artificial intelligence based techniques for intrusion detection: a review,” Artificial Intelligence Re- view, vol. 34, no. 4, pp. 369–387, 2010. [7] M. Aldwairi and D. Alansari, “Exscind: Fast pattern matching for intrusion detection using exclusion and inclusion filters,” in 2011 7th International Con- ference on Next Generation Web Services Practices. IEEE, 2011, pp. 24–30. [8] D. Xu, H. Zhang, and Y. Fan, “The gpu-based high-performance pattern- matching algorithm for intrusion detection,” Journal of computational infor- mation systems, pp. 3791–3800, 2013. Input Splits Design Techniques for IDS on Hadoop Cluster 19 [9] M. Kharbutli, M. Aldwairi, and A. Mughrabi, “Function and data paralleliza- tion of wu-manber pattern matching for intrusion detection systems.” Netw. Protoc. Algorithms, vol. 4, no. 3, pp. 46–61, 2012. [10] X. Su, Z. Ji, and X. Lian, “A parallel ac algorithm based on spmd for intru- sion detection system,” in Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering. Atlantis Press, 2013. [11] M. Aldwairi, A. M. Abu-Dalo, and M. Jarrah, “Pattern matching of signature- based ids using myers algorithm under mapreduce framework,” EURASIP Journal on Information Security, vol. 2017, no. 1, pp. 1–11, 2017. [12] V. Ciric, D. Cvetkovic, and I. Milentijevic, “Design and implementation of net- work intrusion detection system on the apache hadoop platform,” in Proceed- ings on 5th International Conference on Electrical, Electronic, and Computer Engineering (IcETRAN 2018), Palic, Serbia, 2018, pp. 1102–1105. [13] C. Lam, Hadoop in action. Manning Publications Co., 2010. [14] M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, and J. McPher- son, “Cohadoop: flexible data placement and its exploitation in hadoop,” Pro- ceedings of the VLDB Endowment, vol. 4, no. 9, pp. 575–585, 2011. [15] A. Sayar et al., “Hadoop optimization for massive image processing: case study face detection,” International Journal of Computers Communications & Control, vol. 9, no. 6, pp. 664–671, 2014. [16] J. Cheon and T.-Y. Choe, “Distributed processing of snort alert log using hadoop,” International Journal of Engineering and Technology, vol. 5, no. 3, pp. 2685–2690, 2013. [17] P. Prathibha and E. Dileesh, “Design of a hybrid intrusion detection system us- ing snort and hadoop,” International journal of computer applications, vol. 73, no. 10, 2013. [18] K. Kato and V. Klyuev, “Development of a network intrusion detection system using apache hadoop and spark,” in 2017 IEEE Conference on Dependable and Secure Computing. IEEE, 2017, pp. 416–423. [19] C. F. Endorf, E. Schultz, and J. Mellander, Intrusion detection & prevention. McGraw-Hill Osborne Media, 2004. [20] H.-D. J. Jeong, W. Hyun, J. Lim, and I. You, “Anomaly teletraffic intrusion detection systems on hadoop-based platforms: A survey of some problems and solutions,” in 2012 15th International Conference on Network-Based Informa- tion Systems. IEEE, 2012, pp. 766–770. [21] A. Khraisat, I. Gondal, P. Vamplew, and J. Kamruzzaman, “Survey of intru- sion detection systems: techniques, datasets and challenges,” Cybersecurity, vol. 2, no. 1, p. 20, 2019. [22] U. of Washington, “Cpu performance,” https://boinc.bakerlab.org/rosetta/cpu list.php, accessed: 2020-10-22. 256 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 257256 V. ĆIRIĆ, D. CVETKOVIĆ, N. GAVRILOVIĆ, N. STOJANOVIĆ, I. MILENTIJEVIĆ Input Splits Design Techniques for IDS on Hadoop Cluster 257