69 Mathematical Problems of Computer Science 59, 69–81, 2023. doi: 10.51408/1963-0103 UDC 004.725, 004.852 Research of Model Increasing Reliability Intrusion Detection Systems Timur V. Jamgharyan National Polytechnic University of Armenia, Yrevan, Armenia e-mail: t.jamgharyan@yandex.ru Abstract The paper presents the results of the using, a recurrent neural network to detect malicious software as part of the Snort intrusion detection system.The research was conducted on datasets generated on the basis of athena, dyre, engrat, grum, mimikatz, surtr malware exploiting vulnerability CVE-2022-20685 in the Snort intrusion detection system. Processing of input traffic data was carried out before the frag-3 and modbus preprocessors. The method of k nearest neighbors was used as a mathematical apparatus. The simulation of the developed software at different iterations. All research results are presented in https://github.com/T-JN Keywords: Machine learning, Dataset, Malware, Preprocessor, Metasploit, k nearest neighbors method, Intrusion detection system. Article info: Received 8 January 2023; send to review 7 February 2023; accepted 7 March 2023. 1. Introduction The intrusion detection systems (IDS) include many different software components designed to detect various types of traffic with an embedded malicious component. Detection is carried out according to a set of rules that are configured based on the threat model and security policies. The security architecture of the Network Infrastructure (NI) is built taking into account possible attacks according to various models։ triad CIA (Confindentiality, Integrity, Availability, CIA), Parker's hexad [1]. Network IDS, unlike host IDS, detect attacks directed at the network segment and contain a set of complementary rules and security scripts that can neutralize an attack on the network. Unlike host-based IDS, network-based IDS require more computing resources due to the fact that a larger set of rules and detectors is activated during their operation [2]. When using host IDS in the Infrastructure for a fleet of computing systems running Linux OS, can disable https://github.com/T-JN Research of Model Increasing Reliability Intrusion Detection Systems 70 the rules for Windows (or another OS), but hardly possible for a network IDS, since different operating systems are used in the Infrastructure. Modern IDS are able to detect various types of attacks at different levels of the OSI (Open System Interconnection, OSI) model: bad traffic, system scanning, the use of known exploits to attack over various protocols, various backdoors, various known malware [3]. A significant limitation of systems for analyzing network traffic and the state of NI is the algorithmic and functional determinism inherent in them. An important issue of Infrastructure security is the reliability of the processed data of the IDS itself (data reliability – is, the property of the processed data not to have hidden errors [4]). The processing of data streams in the IDS itself is determined by the functioning algorithms, data presentation formats, and the formalization of signature classifiers. Protecting the IDS signature database (both remote and local) is also one of the most important tasks. If the signatures database has been attacked for availability, then when a new vulnerability appears, the IDS will not receive the necessary signature and the Infrastructure perimeter will become vulnerable [5]. The development of M2M (Machine-to-Machine, M2M) and ML (Machine learning, ML) technologies has increased the capabilities of both attack and defense tools. Various researchers are conducting research on increasing (improving) various parameters of IDS with ML [6, 7, 8]. One of the parameters that improves when using ML modules as part of a standard IDS is its variability. Unlike deterministic IDS, IDS with ML are capable of forming a multi-criteria sample on the basis of which the detector operation scheme is formed within the given constraints. But IDS with ML have certain limitations when integrating them into the NI architecture. In particular, ML IDS are very sensitive to various implementations of «noise attacks» («noise attack» is a variant of an availability attack in which a large number of random and meaningless fragmented packets are sent to the attacked system, some of which contain malware [9]). A dangerous consequence of a «noise attack» on a ML network IDS is that attackers «attack» it for a long time with streams of datasets that cause false positives, «teach» the ML IDS discriminator to be immune to this type of traffic (creating a cyclic chain of operations: false positive--true negative--false negative--true positive, which overload both the IDS itself and the SIEM system (Security information and event management, SIEM). Various manufacturers combine IDS modules into different classes, which allows you to quickly reconfigure the IDS itself for specific tasks. In particular, for Snort open source IDS, there are many different types of preprocessors (frag-3, stream, performance monitor, SMTP, POP, IMAP, SSH, DNS, DCE/RPC, SIP preprocessors, reputation preprocessor, modbus preprocessor) each of which is functionally is responsible for handling the given protocol and/or data type.  IDS preprocessor is a software module that receives data from the network traffic decoding module and outputs them to the input of intrusion detection modules. As stated in the article «Attacks on Machine Learning Systems» [10], the most vulnerable part of the ML IDS is the traditional IDS component (the deterministic part of the IDS). ML systems, like any other, will be hacked using vulnerabilities in these traditional components. The use of ML at the preprocessor level is due to the fact that when developing an IDS with ML, it is not enough to create a functioning model that can detect a threat not described in a set of rules (signatures) or generate new ones based on «known» signatures, but it is also necessary to protect the IDS itself from probable infection with malware that can compromise the reliability of the results issued by IDS․The choice of using a neural network at the preprocessor level is also due to the fact that the IDS, which has a neural network in its component composition after the preprocessor, is able to protect the NI, since malware not detected by standard datasets (described in the signature/rule database) will be detected with varying probability neural network. But with a «noise attack», the target is the IDS itself, which, when taken out of the reliable functioning mode, will no longer detect malware. Undescribed at the preprocessor level, T. Jamgharyan 71 malicious data embedded in IDS can be detected using performance preprocessors that evaluate various kinds of statistics. But the problem is that, having determined the type of network IDS, attackers can design an attack taking into account the work of preprocessors, and malware embedded in the IDS itself will not go beyond the allowable statistical deviations. A lot of research has been devoted to the task of applying machine learning as part of IDS, but only a small part of them explores the use of machine learning at the preprocessor level. This limitation, in particular, is due to the fact that the «response» of the neural network is probabilistic in nature and it is necessary to introduce clear boundaries for the neural network itself. Otherwise, the neural network will be an event generator, which will be classified as an attack by the IDS detection modules. Thus, there is a recursion to the problem of stability and integrity of both the IDS and the NI as a whole [11]. This research explores the potential of a recurrent neural network (RNN) to detect malware at the preprocessor level. The choice in the research of RNN from the entire set of neural networks is determined by the fact that RNN form a directed sequence between elements, which allows processing a series of events in time (this characteristic allows granular processing of fragmented datasets). The relevance of the work lies in the ever-increasing role of IDS with ML in the NI security architecture and the increasing security requirements of the IDS itself. The use of a neural network at the preprocessor level will increase the reliability of malware detection results without affecting the main IDS signature database, which will reduce the attack surface for the IDS itself. The novelty of the research lies in the application of the k nearest neighbors (k Nearest Neighbors, kNN) method to detect malware in IDS before preprocessors.  The k nearest neighbors method is a metric algorithm for classifying objects. Malicious software athena, dyre, engrat, grum, mimikatz, surtr obtained from publicly available sources was used as calibration data [12--15]. The choice of the kNN method is determined by the fact that it is necessary to minimize the value of the preprocessor error, and for this it is necessary to carry out a preliminary grouping and classification of unknown input datasets in normalized traffic.  Traffic normalization - modification of packets of protocols of the transport, and network levels for their subsequent processing by IDS detection modules. 2. Formulation and Description the Problem It is necessary to detect a malicious dataset in normalized traffic. The mathematical model construction was carried out on the basis of the formulas obtained in the sources [16,17]. There are network traffic 𝑋 inputs that contain malware fragments (1). 𝑋𝑚 = {(𝑥1, 𝑦1 ), … , (𝑥𝑚 , 𝑦𝑚 )}, (1) where, 𝑥𝑚- network traffic datasets that do not contain malicious components, 𝑦𝑚- network traffic datasets containing malicious components, 𝑚- number of the analyzed packet of the input dataset. On the set of input traffic data sets, the distance function 𝑥𝜌(𝑦, 𝑦′) is given. The greater the value of the distance function, the less similar the entities are 𝑦, 𝑦′, where 𝑦′- the minimum size of a malware dataset that can be uniquely identified and classified with respect to 𝑦. For any entity 𝜐 in the data package, arrange the objects 𝑥𝑖 in ascending order (2). 𝜌(𝜐, 𝑥1;𝜐 ) ≤ 𝜌(𝜐, 𝑥2;𝜐 ) ≤ ⋯ ≤ 𝜌(𝜐, 𝑥𝑚;𝜐 ), (2) Research of Model Increasing Reliability Intrusion Detection Systems 72 where 𝑥𝑖;𝜐 the set of network traffic data that is the 𝑖-th neighbor of the entity 𝜐. Similarly for the 𝑖 -th neighbor of the entity 𝜐 in the dataset 𝑦𝑖;𝜐. Using the formula (3 from the source [17], we determine the malicious kNN components for the traffic arriving in the NI. 𝛼(𝜐) = arg max 𝑦∈𝑌 ∑[𝑦(𝑥𝑖;𝜐 ) = 𝑦] 𝑚 𝑖=1 𝜔(𝑖, 𝜐), (3) where, 𝜔(𝑖, 𝜐)- a given weight function that evaluates the degree of importance of the 𝑖-th neighbor for the classification of the entity 𝜐. By changing the 𝜔(𝑖, 𝜐) value, you can get different versions of the k nearest neighbors method (4). 𝜔(𝑖, 𝜐) = [𝑖 ≤ 𝑘]. (4) When 𝜔(𝑖, 𝜐) = [𝑖 = 1] malware is detected only in the given single value 𝜔. That is, the RNN is only able to detect the malware datasets it was trained on. A graphical representation of a RNN is shown in Fig. 1. Fig. 1. Recurrent neural network. Attackers can load malware into the IDS itself not in a single package, but in fragments (using the built-in frag-3 preprocessor as an internal attack tool), then the research task of grouping and classifying malware fragments arises. Standard IDS do not cope with this task very effectively, but ML IDS, in the presence of a training set, are able to solve this problem. The disadvantage of ML IDS is that they can produce unreliable results if the preprocessor responsible for a particular type of traffic/protocol is «damaged» as a result of a «noise attack». A particular danger lies in the fact that any traffic entering the IDS preprocessors (both ML and deterministic) is not checked for malicious components, since the task of the preprocessor is to «reformat» traffic for processing by detectors. 3. Task Statement It is necessary to develop and programmatically implement an algorithm and, based on it, software that integrates a RNN capable of solving the problem of grouping and classification with the IDS preprocessor. T. Jamgharyan 73 4. Boundary Conditions 1. The smallest fragment of the malware file (𝜉) that can be classified 𝜉 = 20𝑏𝑦𝑡𝑒 (detection was carried out using context-piecewise hashing (Context Triggered Piecewise Hashing, CTPH), which is discussed in detail in [18]. 2. The delay in the processed module should not cause a «signal race». Traffic from the output of the preprocessor module to the input of the detection modules must be sent synchronously. As part of this condition, an additional restriction has been introduced - only UDP (User Datagram Protocol, UDP) traffic is processed. 3. The hardware must support the parallel computing mode. The developed software connects the RNN to frag-3 and modbus preprocessors (frag- 3 preprocessor for defragmenting an IP packet, modbus - preprocessor for processing data from a variety of devices operating in SCADA networks (Supervisory Control And Data Acquisition, SCADA).Since the frag-3 preprocessor is designed to build packages, using a trained RNN can neutralize the process of «assembling» malicious packages inside the IDS, increasing the level of reliability of its functioning. On Fig.2 shows a diagram of the Snort IDS with the proposed data processing software implemented on RNN. Fig. 2. Snort IDS with developed data processing software. Research of Model Increasing Reliability Intrusion Detection Systems 74 5. Description of the Module The network traffic coming from the decoders is directed to the preprocessor processing module (standard operation of the Snort IDS). The traffic that should processed by the frag - 3 and modbus preprocessors is sent to the developed module based on the RNN. After processing according to the developed algorithm, this traffic is again sent to the standard detection modules. The task of the module is to carry out the primary «cut-off» of possible malware and protect the IDS itself from being modified by malware. The developed algorithm is shown in Fig. 3. Fig. 3. Developed algorithm. Algorithm operation The software that searches for fragmented malware receives network traffic datasets from a decoder (Snort IDS a low-level interceptor) as input. Only traffic that must be processed by the frag-3 and modbus preprocessors is subject to processing. Step 1. Converting received datasets to «Data Frame». This conversion is necessary to speed up the work of the RNN, since the traffic not processed by the developed module goes directly to T. Jamgharyan 75 the preprocessor module and the processing delay should not exceed the boundary conditions (boundary condition 2). Step 2 phase 1. Calculation of the distance from the target object, which must be classified to each of the sample objects (traffic). Computing a distance metric between likely malware datasets. All calculations are performed in parallel mode (boundary condition 3),  2.1 k=0 calculation of the distance metric and detection of malicious datasets is not performed, since the classification of malicious and non-malicious datasets is impossible,  2.2 k=1 the distance between malicious and non-malicious datasets is constant (k=const). Only those malicious datasets that fall within the specified distance metric are detected,  2.3 k=m continuous detection mode.Upper limit: the value of m that the hardware can handle,  2.4 k>m malicious datasets are not detected,  2.5 k