J. Nig. Soc. Phys. Sci. 4 (2022) 832 Journal of the Nigerian Society of Physical Sciences Countermeasure to Structured Query Language Injection Attack for Web Applications using Hybrid Logistic Regression Technique Shehu Magawata Shagaria,∗, Danlami Gabia, Nasiru Muhammad Dankoloa, Noah Ndakotsu Ganab aDepartment of Computer Science, Kebbi State University of Science and Technology, Aliero, Nigeria bDepartment of Cyber Security Science, Federal University of Technology, Minna, Nigeria Abstract The new generation of security threats has been promoted by real-time applications, where several users develop new ways to communicate on the internet via web applications. Structured Query Language injection Attacks (SQLiAs) is one of the major threats to web application security. Here, unauthorised users usually gain access to the database via web applications. Despite the giant strides made in the detection and prevention of SQLiAs by several researchers, an ideal approach is still far from over as most existing techniques still require improvement, especially in the area of addressing the weak characterisation of input vectors which often leads to low prediction accuracy. To deal with this concern, this paper put forward a hybrid optimised Logistic Regression (LR) model with Improved Term Frequency Inverse Document-Frequency (ITFIDF-LR). To show the effectiveness of the proposed approach, attack datasets is used and evaluated using selected performance metrics, i.e., accuracy, recall, specificity and False Positive Rate. The experimental results via simulation when compared with the benchmarked techniques, achieved performance record of 0.99781 for accuracy, recall and F1-score as well as 0.99782, 0.99409 and 0.00591 for precision, specificity and False Positive Rate (FPR) respectively. This is an indication that the proposed approach is efficient and when deployed is capable of detecting SQLiA on web applications. DOI:10.46481/jnsps.2022.832 Keywords: Database management system, Logistic regression, SQL injection attack Article History : Received: 25 May 2022 Received in revised form: 07 August 2022 Accepted for publication: 08 August 2022 Published: 01 October 2022 c© 2022 The Author(s). Published by the Nigerian Society of Physical Sciences under the terms of the Creative Commons Attribution 4.0 International license (https://creativecommons.org/licenses/by/4.0). Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI. Communicated by: J. Ndam 1. Introduction The Internet, has witnessed tremendous growth over the past decade, making the web and various internet products of great research interest [1]. With this growth, the Netcraft Web Server overviews as at January 2019 estimated that there are ∗Corresponding author tel. no: +234 8036256995 Email address: shagari1978@gmail.com (Shehu Magawata Shagari) over 1.8 billion sites and over 4.39 billion internet users. With these increased number of websites comes with increased risks of security challenges confronting websites and other web ap- plications. Top on the list of security risks on web applications and sites is the Structured Query Language Injection Attacks (SQLIAs) and are lethal and account for 51% of at- tacks [2]. The SQLIAs are attacks in which an attacker finds out the weakness of a web application and exploits the weak- ness by executing malicious statement through the Internet to 1 Shagari et al. / J. Nig. Soc. Phys. Sci. 4 (2022) 832 2 access private data from the database [3]. A database is a collection of interrelated and organised data [4]. The management of database is most significant to ensuring information are kept in secrecy while authorising only those that have permission to access resources within the database. To interact with the database, a query language such as Structured Query Language (SQL) is required. Thus, Database Management System (DBMS) is therefore software that provides a functionality of creating database, main- taining or updating the data in the database [5]. The common types of DBMS are hierarchical, network, object-oriented and relational database models [6]. The unauthorised access of the database greatly endangers web applications as the database is a key component used by all web applications to store all the data required by the application [2]. SQLIAs occur when there is no validation of user inputs, cookies and input parameters, before they are passed to SQL queries that will be executed on the database. There are three basic types of SQLIAs: Union Based SQL injection attack, Error Based SQL injection attack and Blind SQL injection attack [7]. Union Based SQLIAs are attacks where Union statement is used. That is, two statements are joined by the attackers to get information from the database, on the other hand, Error Based SQLIAs are perpetrated by attack- ers through querying the server with a view to causing error and using the error message to determine vulnerabilities that can be exploited to attack the system. The Blind SQLIAs are dubbed the hardest forms of SQLIAs as the attackers only ex- tract data by querying the server. Blind SQLIAs can either be Boolean or time based [8-11]. When successful, these attacks enable attackers bypass the system authentication and gain con- trol of the database and consequently private information [12, 13]. Thus, enabling attacker to change user’s password, retrieve users, make illegal transaction, delete table or can damage the database as well as perpetrate several other illegalities. The widespread use of SQLIAs has led to development of several methods for the detection and prevention of such at- tacks. Some of these methods can widely be seen e.g., in [14- 22]. Although, these solutions have contributed immensely to- wards providing an understanding on how SQLIAs attacks do occur and addressed, however despites these giant leaps, an ideal solution is far from been achieved. With the negative im- pact of SQLIAs, several researchers in [23-29] have tried to address concerns arising from such attacks through provision of techniques to serve as potential solutions. Despites the gi- ant strides made in the detection and prevention of SQLIAs, the following limitations still exist: 1. Weak characterisation of the input vector as witnessed by most machine learning based methods, leading to low accuracy and unsatisfactory recall rate. 2. Weak representation of attributes as witnessed by text vectorisation-based algorithms which leads to inaccurate description of keyword weight and consequently low pre- diction accuracy. 3. The use of SVM by some of the existing schemes reduces accuracy as they inherit the weaknesses of SVM which lacks the ability to perform well when the data set has more noise and the number of features for each data point exceeds the number of training data sample Therefore, in this paper, we proposed an optimised logis- tic regression model based based-Improved Term-Frequency- Inverse-Document-Frequency (ITFIDF-LR) to serve as a poten- tial solution. Simulation results show the proposed approach achieved performance record of 0.99781 for accuracy, recall and F1-score as well as 0.99782, 0.99409 and 0.00591 for pre- cision, specificity and False Positive Rate (FPR) respectively. This is an indication that the proposed approach is efficient and when deployed is capable of detecting and preventing SQLiA in DBMS. The contribution of this paper is as follows: 1. Determination of existing gaps through exploring exist- ing techniques for countering SQL injection attacks from literatures, 2. An optimised logistic regression model for prevention of SQL injection attacks in DBMS, 3. An improved ITFIDF-LR approach for the detection of SQLiA in DBMS The rest of this article is organised as follows. Section 2 provides discussion on related work. Discussion on IDF, TF- IDF and Logistic Regression are provided in Section 3. Section 4 discusses the Experimental Model with Proposed Testbed and Simulation. Performance Evaluation Metric were presented in Section 5. Section 6 is the results and discussion section while Section 7 concludes the paper. 2. Related Work With the negative impact of SQLiAs, several researchers have tried to address concerns arising from such attacks through provision of techniques to serve as an ideal. Some of these researcher(s) and their techniques are discussed as follows: Hassan et al. [23] proposed a deep neural network-based technique for the detection of SQL injection vulnerability. The proposed method which is targeted at addressing the challeng- ing effects associated with financial loss in business, application compromise and administrative exploit claimed to have outper- formed existing methods in detections of SQL injection attack with the accuracy of 98.04% over 1850 dataset records, how- ever, improve performance can be recorded if a significantly available dataset is used for training the machine learning al- gorithms. Yu et al. [24] proposed a novel technique for the detection of SQL injection attacks, the technique encompasses tokenization, skin-gram model in word2vec and SVM algo- rithm with eigenvectors, the proposed model was said to have obtained an effective means of detection of SQL injection at- tack with an unspecified claimed good accuracy, 0.35%, and 1.9% was also achieved for FPR and FNR respectively, how- ever, though the accuracy score was undisclosed, better perfor- mance can be attained, with regards to aforementioned scores 2 Shagari et al. / J. Nig. Soc. Phys. Sci. 4 (2022) 832 3 in FPR and FNR it is of no doubt that performance can be im- proved upon. Pan et al. [25] proposed a robust software modelling tool with the capability of detecting threats from SQL injection at- tacks and cross-site scripting, the following machine learning models were used in the experimental analysis for SQL in- jection attacks, Naı̈ve Bayes, Random Forest, and SVM, the study claimed to have achieved an efficient and accurate method of detecting and checkmating threats from SQL injection and cross-site scripting attacks with the following performance scores; 0.941, 1.00 and 0.933, for precision score and 0.800 each for recall, while F-score of 0.865 and 0.889 respectively for Naı̈ve Bayes, Random Forest and SVM, nevertheless, a considerable significant sample dataset can aid in the effective detection of SQL attack injection as the sample dataset used for this study is slightly above 200 sample point. Krishnan et al. [26] performed research on SQL injection detection based machine learning, the effectiveness of some selected machine learning algorithms was evaluated in the quest to establish the most robust learning model, the selected models that were explored are Naı̈ve Bayes, Logistic Regression, CNN, SVM and Passive-Aggressive algo- rithms, with CNN proving the most optimal model for SQLi at- tack detection after outperforming other models compared against with 97%, 0.92 and 0.96 for the performance matric accuracy, precision and recall respectively, however, the study further pro- posed research of other machine learning model and its optimi- sation for enhancing performance. Farooq [27], an ensemble machine learning model was em- ployed to detect SQLi attack, Gradient Boosting Machine, Adap- tive Boosting, Extended Gradient Boosting Machine and Light Gradient Boosting Machine learning algorithms were analysed, the optimal result was obtained from Light Gradient Boosting Machine with the following performance scores; 0.9934 each for accuracy, precision, recall, and F1 score respectively, and 0.0093, 0.0146, 0.1208 and 0.007 for MAE, MSE, RMSE and FPR respectively, however, with the tuning of model the perfor- mance rate for SQLi attack can be improved. In order to address the challenge associated with machine learning-based detection for SQLi attack, a novel system termed semantic query-featured ensemble learning model for SQLi attack was proposed in [28]. It was claimed that the proposed model achieved the following optimal performance for accuracy, F-score, and AUC of 98%, 0.989, 0.999 respectively, however, with an optimised model a more enhanced performance can be achieved when compared to contemporary techniques of model optimisation and dataset en- hancement. A Random Forest-based SQLi attack detection ap- proach was introduced in [29] to address the challenge of detec- tion efficiency for SQLi attack detection, the performance score achieved in the experiment was 97% and 95% for precision and accuracy respectively, however, contemporary research has it that more effective means of detection of malware attack exist that can offer a robust performance result. Despites the giant strides made in the detection and pre- vention of SQLiAs, the following limitations still exist. The existing schemes; 1. Weak characterisation of the input vector as witnessed by most machine learning based methods, leading to low accuracy and unsatisfactory recall rate. 2. Weak representation of attributes as witnessed by text vectorisation-based algorithms which leads to inaccurate description of keyword weight and consequently low pre- diction accuracy. 3. The use of SVM by some of the existing schemes reduces accuracy as they inherit the weaknesses of SVM which lacks the ability to perform well when the data set has more noise and the number of features for each data point exceeds the number of training data sample. 3. Inverse Document Frequency (IDF), Term Frequency- Inverse Document Frequency (TF-IDF) and Logic Re- gression Models This section discusses three existing approaches in the pro- posed technique. These approaches were later improved as shown in section three (3) to achieve better results when com- pared to the benchmarked scheme. 3.1. Inverse Document Frequency (IDF) The Inverse Document Frequency (IDF), is the logarithm of the number of the queries in the corpus divided by the number of queries where a specific term appears [30]. Inverse Docu- ment Frequency (IDF), measures how important a term is. The IDF attempts to weigh down the frequent terms and scale up the rare ones. It is given as: I DF(t) = log ( total number o f documents number o f documents with term t in it ), (1) 3.2. Term Frequency-Inverse Document Frequency (TF-IDF) The TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. It is often used as a central tool in scoring and ranking a document’s relevance given a user query. TF-IDF can be successfully used for stop-words filtering in- text summarisation and classification [31]. The TF-IDF weight comprises of two elements: the first element is the normalised Term Frequency (TF), also referred to as the number of times a word appears in a query, divided by the total number of words in that query; the second term is the Inverse Document Frequency (IDF), which is the logarithm of the number of the queries in the corpus divided by the number of queries where the specific term appears [30]. Term Frequency, measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length also known as the total number of terms in the document as a way of normalisation: T F(t) = Number o f times term t appears in a document T otal number o f terms in the document , (2) 3 Shagari et al. / J. Nig. Soc. Phys. Sci. 4 (2022) 832 4 While computing TF, all terms are considered equally impor- tant but since certain terms, such as ”is”, ”of”, and ”that”, may appear a lot of times but have little importance. Thus, the com- monly used formular for the TF-IDF is shown in Equation 3 [32]: W(t, d) = T F(t, d) log ( N Nt ) , (3) where, T F(t, d) is the term frequency of the keyword t appear in the text d,N represents the number of full textsNt represents the number of texts which have the word t. 3.3. Logistic Regression Model Logistic Regression (LR) model is used to solve the classi- fication problems. It is an extension of Linear Regression [33- 35], where the dependent variable is categorical based on the concept of probability. In logistic regression, dependent vari- able is a binary variable that contains data represent as 1 (yes, success, spam) or 0 (no, failure, not-spam). The binary logis- tic regression maps the regression lines onto the interval [0,1], which is compatible with the logical range of probabilities. The core function of the model is logistic function or sigmoid func- tion that can hold any real value and map it between 0 and 1. A simple logit model is shown in Equation 4: In ( π 1 −π ) = log(odd s) = logit = α + βx (4) π = Probability(Y = outcome o f interest|X = x) = eα+βx 1 + eα+βx , (5) where, π is the probability of the outcomes of interest, or the event, under variable Y , α is the Y intercept, and β is the slope parameter and within the inferential framework, the null hy- pothesis states that β is equal to zero in the population. As a result, by rejective such a null hypothesis signifies that a rela- tion exists between X and Y . X can be categorical or continnous whereas Y is always categorical. By taking the antilog of Equa- tion 4 on both sides, an equation to predict the probability of the occurrence of the outcome of interest can be derived as shown in Equation 5. 4. Experimental Model with Proposed Testbed and Simula- tion This section discusses the experimental model with the pro- posed testbeds as well as the simulation environment used in the implementation. 4.1. Proposed ITFIDF-LR Model The proposed Improved Term Frequency-Inverse Document Frequency with Optimised Logistic Regression (ITFIDF-LR em- ploy both ITFIDF and the optimised LR model for vectorisa- tion. In the default TFIDF equation as shown in Equation 3, n √ T F is substituted for T F, to eliminate excessive influence of Table 1. Sample dataset Sample (Sentence) Percentage (%) SQLi attack query 62 Normal SQL query 38 term frequency and reflect the balance of weight [36]. Simi- larly, the word class weight coefficient Pt and that of the po- sition weight coefficient bt are incorporated into the traditional TF-IDF formula, leading to the formation of an ITF-IDF shown in Equation 6: W(t, d) = n √ T F(t, d) × log ( N Nt ) × Pt × bt (6) An optimised logistic regression model is then put forward to maximise several predictors as well the likelihood of obtaining data given its parameter estimates. The optimised LR model is shown in Equation 7: In ( π 1 −π ) = α + βx1 + β2x2+, . . . , +βkxk (7) π = Probability (Y = outcomeo f interest | X = x1, X ) = eα+β1x1+β2x2,+...,+ βkxk 1 + eα+β1x1+β2x2+,...,+ βkxk (8) 4.2. Datasets This research make use of SQLi attack dataset [37] from a popular dataset repository known as Kaggle. The instances of the SQLi attack dataset contain 34,084 instances. The sample dataset is shown in Table 1. Two features from the historic dataset was chosen; namely; sentence and class. The sentence needs to be detected as either normal or SQLi attack query while the second feature which is the class stands for a numeric value to determine whether it is a normal sentence or SQLi query. In our experiment, the value 1 has been used to represent the sentence as a SQLi query and 0 has been used to represent the sentence as normal statement. 4.3. Trained Model In the detection method, the proposed SQLIAs is based on ITFIDF-LR, comprising of data pre-processing phase and op- timised logistic regression model training and detecting phase. The pre-processing phase involves first getting the dataset. Then each data in the dataset is marked. The statement is marked as - 1 when it is an attack sample and is marked as +1 when it is not. Then, a word segmentation tool (happierfuntokenising) is used to segment SQL statements in the dataset. In addition, in order to reduce the difference of different features during the experi- ment, normalisation is carried out to bring all values to a limited range. That is, the Min-Max method is used to normalise the data using Equation 9: Xnorm = X − Xmin Xmax − Xmin (9) 4 Shagari et al. / J. Nig. Soc. Phys. Sci. 4 (2022) 832 5 Figure 1. Flowchart of the proposed algorithm where, X denotes the current sample data value, Xmin de- notes the minimum of the sample data, Xmax denotes the maxi- mum of the sample data, and X norm denotes the normalised value. The feature dataset is then grouped and divided into training set and testing set in the model training and detecting phase with a composition of 70:30, that is 70% for training and 30% for testing the model. Grid Search optimisation algorithm was deployed to enhance the parameters of Logistic Regres- sion model. The training set is used to train the optimised logic regression classifier model after which the testing set is used to verify the generated model and results of the classification generated are then evaluated. The flowchart for the proposed algorithm is shown in Figure 1. 4.4. Simulation Environment The detail experimental hardware environments used are shown in Table 2. Likewise, Grid Search optimisation algo- rithm was deployed to enhance the parameters of Logistic Re- gression model after a complex computational operation was performed on the following defined ranges; the Cost (C), Ran- dom State, Penalty and Solver parameters, Optimised Logistic Regression Parameter, and default Logistic Regression Param- eter as shown in Table 3. Figure 2. Comparative Analysis of Optimised and Default LR based ITFIDF 5. Performance Evaluation Metrics After the data pre-processing and the implementing stages, the proposed approach is evaluated against the benchmarked scheme based on the metrics shown in Table 4. That is, Accu- racy, Precision, Sensitivity, Specificity, F1 and FPR as used in [38, 39]. 5.1. Results and Discussion This section presents an analysis of optimised LR based IT- FIDF and the default LR based ITFIDF, Table 5 while Figure 2 show graphical presentation as achieved based on the analysis performed in this research, a clear cut view gives insight of the enhanced performance and the capability strength of an opti- mised model compared to using a default parameter of a model, in the developed model for detection and prevention of SQLiA, there is a distinct outperformance record achieved across the board for performance metric evaluation analysis, such as out- lined in succeeding subsections. The accuracy score variation between the optimised and default LR based ITFIDF is 0.04847 which is quite signifi- cant in this analysis, 0.99781 was achieved through optimised LR based ITFIDF against 0.94934 achieved through the default LR based ITFIDF, this indicates the superiority of the opti- mised LR based ITFIDF model against the training of the model with default LR based ITFIDF model. The developed research 5 Shagari et al. / J. Nig. Soc. Phys. Sci. 4 (2022) 832 6 Table 2. Properties of the experimental environment Items Properties Window Windows 10 operating system Processor Intel (R) Core (TM) i7-7300HQ CPU@2.5gHz Software Visual Studio 2016, Programming Language Python Table 3. Defined LR Parameter for Optimisation Optimised LR Parameter Default LR Parameter Cross Validation (CV) 5 5 Cost(C) 100.0 100.0 Random State 7 7 Penalty ‘L1’ ‘L2’ Solver ’liblinear’ ’libfgs’ Table 4. Performance evaluation metrics Metric Formula Accuracy = True positves +true negstotal number of examples Precision = True positvesTrue positves +true negatives Sensitivity / Recall = True positvesTrue positves + false negatives Specificity = True negativesTrue negatives+true positves False Positive Rate (FPR) = FP(TN+FP) F1 Score = 2 ∗ TP2 ∗ TP + FN + FP model for detection and prevention of SQLiA analysis evalua- tion of precision score reveals a definitive outperformance score with an excess score of 0.04623, the following precision score was achieved; 0.99782 and 0.951596 respectively for optimised and default LR based ITFIDF model. Recall rate of 0.99781 and 0.94956 was recorded in this research for optimised and default LR based ITFIDF developed model, the result proves the enhancement performance rating of the optimised model with 0.04825 superiority. The following scores was achieved from the result analysis of the developed model for SQLiA, 0.99781 and 0.94866 for the optimised and default LR based ITFIDF F-score performance, this is an indication outperfor- mance of 0.04915 greater than and against default LR based ITFIDF model when compared to the optimised LR based IT- FIDF model. A record of 0.99409 and 0.87659 was achieved for sensitivity score, this is an outperformance experienced with about 0.100 depicting the optimality of the tuned parameters of LR based ITFIDF in this research. The result obtained from this research based on the optimised LR based ITFIDF proves the promising feature expectant of a tuned model, the optimised LR based ITFIDF FPR achieved 0.00591 against the default LR based ITFIDF with a score of 0.12341, this presents an obvi- ous width margin that depicts the efficiency of an enhanced LR based ITFIDF based on the excellent result achieved. The analysis of the developed detection and prevention model for SQLiA against few other approaches in the literatures is presented to establish strength of the developed model in the detection and prevention of SQLiAs in terms of optimal per- formance, to this end the performance of the developed model was compared against the existing SQLiA detection techniques based on relevant performance evaluation metrics found in lit- eratures, metrics such as accuracy, precision, recall, F1-score, specificity and FPR. Table 6 present scores of the performance metrics achieved in this research alongside that of baseline liter- atures, the performance record from this research outperformed other techniques in respect to accuracy, precision, recall, F1- score, specificity and FPR. In addition, deducing from Table 5, accuracy record score achieved is the must employed performance evaluation met- ric in the field of SQLiA detection involving machine learning based models, this is based on the fact that it is most common in the baseline reviewed literatures. The developed model for de- tection and prevention of SQLiA achieved an optimal accuracy of 0,99781, which is followed by the technique employed by [27], [28] and [29] with a distinct wide margin in accuracy of 0.9934, 0.98 and 0.95 respectively while [29] have the least ac- curacy record of 0.95. The significant performance recorded by this research reflects how correctly the developed model can de- tect and prevent SQLiAs in DBMS environment, a low FPR of 0.00591 was recorded in the developed model, although base- line literature reviewed in this research did not capture FPR, FPR is an important evaluation metric in detection of attacks in computing environment as it is recorded for literatures earlier than 2021 articles reviewed. The precision score of 0.99 was achieved for the developed model in this research and the research by [27], however, the precision that determine the exactness of the model developed in this research for the detection and prevention of SQLiA in DBMS environment outperformed that of [27] as well as [28] with a score of 0.99782 superseding 0.993 and 0.97 for [27] and [29] respectively. The recall performance score recorded in this research which entails the measure of the completeness of the performance of detection of SQLiA achieved 0.99781, against that of [27], F1-score of 0.99781 with a significant difference against [23] as well as [27], though, [27] score 0.9934 that is slightly above 0.99, [29] had the worst performance score of 0.989, the robustness of the developed detection and prevent model for SQLiA in DBMS environment have established its 6 Shagari et al. / J. Nig. Soc. Phys. Sci. 4 (2022) 832 7 Table 5. Results of the optimised LR based ITFIDF and the default LR based ITFIDF Metrics / Approach Accuracy Precision Recall F-Score Specificity False Positive Rate Default LR Based ITFIDF 0.94934 0.95159 0.94956 0.94866 0.9487 0.1234 Optimised LR Based ITFIDF 0.99781 0.99782 0.99781 0.99409 099409 0.00591 Table 6. Comparative Analysis of Optimised LR based ITFIDF with Baseline Literatures Reference Approach Accuracy Precision Recall F1-score Specificity FPR Developed Detec- tion and Preven- tion Model Optimised LR based ITFIDF 0.99781 0.99782 0.99781 0.99781 0.99409 0.00591 [28] Ensemble Learning Model 0.98 - - 0.989 - - [29] Random Forest Model 0.95 0.97 - - - - [27] Ensemble Ma- chine Learning Model 0.9934 0.9934 0.9934 0.9934 - - N.B: (-) means the metric value is not reported in the existing approach. optimality capability across all performance metrics relevant in the field of this research area. Though specificity and FPR was not recorded for the base- line journal model, this research used the performance met- rics based on the fact that they are being employed for anal- ysis engaging detection-based machine learning models, the developed detection and prevention model achieved the scores of 0.99409 and 0.00591 for specificity and FPR respectively, showing the efficient capability in detection of SQLiA in DBMS environment 6. Conclusion The new generation of security threats have been promoted by real-time applications, where several users develop new ways to communicate on the internet via web applications. The in- ternet is witnessing growth with several threats on a daily basis to the web application. SQLiA is one of a kind of these threat, since it serves as a medium for other several severe attacks. SQLiA, which is currently tagged as one of the most notorious means of attacking database of a system continue to haunt the securities of websites from multiple angles. This can only be checkmate through consistently deploying an evolving defense technique. Hence, in this paper, we proposed an ITFIDF-LR model to serve as a potential solution. The developed model for detection and prevention of SQLiA employed ITFIDF with an optimised Logistic Regression model to addressed threats that springs up from SQLiA in web application. The optimal per- formance as revealed from the analysis performed in this paper shows performance record of 0.99781 for accuracy, recall and F1-score respectively as well as 0.99782, 0.99409 and 0.00591 for precision, specificity and FPR respectively as compared to the benchmarked approaches. Future research will focus on hybridisation of learning models for further improved perfor- mance alongside other vectorising techniques. Acknowledgment The authors will like to appreciate the handling editor and the reviewers for their valuable comments that improved the quality of this paper. References [1] Z. Chen & M. Guo, “Research on SQL injection detection technology based on SVM”, International Conference on Smart Materials, Intelligent Manufacturing and Automation (2018) 1. [2] S. O. Uwagbole, W. J. Buchanan & L. Fan, “Applied machine learning predictive analytics to SQL injection attack detection and prevention”, IFIP/IEEE Symposium on Integrated Network and Service Management (IM) (2017) 1087. [3] R. Chandrashekhar, M. Mardithaya, S. Thilagam & D. Saha, “SQL injec- tion attack mechanisms and prevention techniques”, International Confer- ence on Advanced Computing, Networking and Security (2011) 524. [4] A. Dasgupta, V. Narasayya & M. Syamala, “A static analysis framework for database applications”, IEEE 25th International Conference on Data Engineering (2009) 1403. [5] C. S. Kumar, J. Seetha, S. R. Vinotha, “Security implications of dis- tributed database management system models”, International Journal of Soft Computing and Software Engineering 2 (2012) 20. [6] S. O. Uwagbole, W. J. Buchanan & L. Fan, “Applied machine learning predictive analytics to SQL injection attack detection and prevention”, IFIP/IEEE Symposium on Integrated Network and Service Management (IM) (2017) 1087. [7] C. Anley. “Advanced SQL injection in SQL server applications,”https://crypto.stanford.edu/cs155old/cs155- spring09/papers/sql injection.pdf. Accessed 14 December, 2021. [8] J. Abirami, R. Devakunchari & C. Valliyammai, “A top web security vul- nerability SQL injection attack—survey”, Seventh International Confer- ence on Advanced Computing. (2015) 1. [9] D. Gabi, N. M. Dankolo & D. Muhammed, “Towards the use of new forensic approach as a panacea in investigation of cybercrime”, Interna- tional Journal of Scientific & Engineering Research 5 (2014) 942. [10] B. Yusuf, R. M. Dima & S. K. Aina, “Optimized breast cancer classifica- tion using feature selection and outliers detection”, J. Nig. Soc. Phys. Sci 3 (2021) 298. [11] R. O. Oveh, O. Efevberha-Ogodo & F. A. Egbokhare, “Software process ontology: a case study of software organisations software process sub domains”, J. Nig. Soc. Phys.Sci. 1 (2019) 122. 7 Shagari et al. / J. Nig. Soc. Phys. Sci. 4 (2022) 832 8 [12] O. E. Ojo, M. K. Kareem, O. Samuel & C. O. Ugwunna, “An internet-of- things based real-time monitoring system for smart classroom”, J. Nig. Soc. Phys. Sci 4 (2022) 297. [13] D. GABI, “Surveillance on security issues in cloud computing: a view on forensic perspective”, International Journal of Scientific & Engineering Research 5 (2014) 1246. [14] K. C. Rajeswari, “ SQL injection attack prevention using 448 blowfish encryption standard”, International Journal of Computer Science Trends and Technology (IJCST) 4 (2016) 325. [15] M. Qbea’h, M. Alshraideh & K.E Sabri. “ Detecting and preventing SQL injection attacks: a formal approach”, Cybersecurity and Cyberforensics Conference (CCC) (2016) 123. [16] L. Xiao, S. Matsumoto, T. Ishikawa & K. Sakurai, “SQL injection attack detection method using expectation criterion”, 2016 Fourth International Symposium on Computing and Networking (CANDAR) (2016) 649. [17] B. Aziz, M. Bader & C. Hippolyte, “Search-based sql injection attacks testing using genetic programming”, European Conference on Genetic Programming (2016) 183. [18] Q. Temeiza, M. Temeiza & J. Itmazi, “A novel method for preventing SQL injection using SHA-1 algorithm and syntax-awareness”, Joint In- ternational Conference on Information and Communication Technologies for Education and Training and International Conference on Computing in Arabic (2017) 1. [19] M. Sood, & S. Singh, “SQL injection prevention technique using encryp- tion”, International Journal of Advanced Computational Engineering and Networking 5 (2017) 4. [20] L. Bossi, E. Bertino & S. R. Hussain, “A system for profiling and moni- toring database access patterns by application programs for anomaly de- tection”, IEEE Transactions on software engineering (2017) 415. [21] S. N. Raj & E. Sherly, “SQL injection attack prevention by direct reverse resemblance technique”, International Journal of Pure and Applied Math- ematics 118 (2018) 599. [22] Y. Li & B. Zhang, “Detection of SQL injection attacks based on improved TFIDF algorithm”, Journal of Physics: Conference Series 1395 (2019) 012013. [23] M. M. Hassan, R. B. Ahmad & T. Ghosh. “SQL injection vulnerabil- ity detection using deep learning: a feature-based approach”, Indonesian Journal of Electrical Engineering and Informatics (IJEEI) 9 (2021) 702. [24] L. Yu, S. Luo & L. Pan, “Detecting SQL injection attacks based on text analysis”, 3rd International Conference on Computer Engineering, Infor- mation Science and Application Technology (ICCIA 2019) (2019) 95. [25] Y. Pan, F. Sun, Z. Teng, J. White, D. C. Schmidt, J Staples & L. Krause, “Detecting web attacks with end-to-end deep learning”, Journal of Inter- net Services and Applications 10 (2019) 1. [26] S. A. Krishnan, A. N. Sabu, P. P. Sajan & A.L Sreedeep, “SQL injection detection using machine learning”, Revista Geintec-Gestao Inovacao E Tecnologias 11 (2021) 300. [27] U. Farooq, “Ensemble machine learning approaches for detection of SQL injection attack”, Tehnički glasnik 15 (2021) 112. [28] M. Gowtham & H. B. Pramod, “Semantic query-featured ensemble learn- ing model for SQL-injection attack detection in IoT-ecosystems”, IEEE Transactions on Reliability (2021) 1. [29] P. Aggarwal, A. Kumar, K. Michael, J. Nemade & S. Sharma, “Random decision forest approach for mitigating SQL injection attacks”, IEEE In- ternational Conference on Electronics, Computing and Communication Technologies (CONECCT) (2021) 1. [30] H. C. Wu, R. W. P. Luk, K. F. Wong & K. L. Kwok, “Interpreting tf- idf term weights as making relevance decisions”, ACM Transactions on Information Systems (TOIS) 26 (2008) 1. [31] V. N. Gudivada, Computational analysis and understanding of natural languages: principles, methods and applications (1st edition), Elsevier (2018). [32] A. C. Finkelstein, G. Kappel & W. Retschitzegger, “Ubiquitous web ap- plication development-a framework for understanding”, 6th World Multi- conference on Systemics, Cybernetics and Informatics (2002) 1. [33] J. Y.-C. Peng, L.K. Lee & M. G. Ingersoll. “An introduction to logistic regression analysis and reporting”, Journal of Educational Research 91 (2002) 3. [34] G. A. Seber & A. J. Lee, Linear regression analysis (Vol. 329), John Wiley & Sons (2012). [35] D. W. Hosmer Jr, S. Lemeshow & R.X, Sturdivant, Applied logistic re- gression, John Wiley & Sons (2013). [36] W. Wang & Y. Tang, “Improvement and application of TF-IDF algorithm in text orientation analysis”, Proceedings of the International Conference on Advanced Material Science and Environmental Engineering (2016) 230. [37] S. Syed & H. Hussain, “SQL injection dataset,” https://www.kaggle.com/syedsaqlainhussain/sql-injection-dataset. Accessed 10 December 2021. [38] S. Abaimov & G. Bianchi,llilk “CODDLE: Code-injection detection with deep learning”, IEEE Access7 (2019) 128617. [39] L. Wahab & H. Jiang. “A comparative study on machine learning based al- gorithms for prediction of motorcycle crash severity,” PLoS one 14 (2019) 1. 8