International Journal of Interactive Mobile Technologies (iJIM) – eISSN: 1865-7923 – Vol. 15, No. 17, 2021 Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection https://doi.org/10.3991/ijim.v15i17.19915 Abdulfatai Ganiyu Oladepo(), Amos Orenyi Bajeh, Abdullateef Oluwagbemiga Balogun, Hammed Adeleye Mojeed, Abdulsalam Abiodun Salman, Abdullateef Iyanda Bako University of Ilorin, Ilorin, Nigeria abdulfataig@gmail.com Abstract—Spamming is one of the challenging problems within social networks which involves spreading malicious or scam content on a network; this often leads to a huge loss in the value of real-time social network services, com- promise the user and system reputation and jeopardize users trust in the system. Existing methods in spam detection still suffer from misclassification caused by redundant and irrelevant features in the dataset as a result of high dimensional- ity. This study presents a novel framework based on a heterogeneous ensemble method and a hybrid dimensionality reduction technique for spam detection in micro-blogging social networks. A hybrid of Information Gain (IG) and Principal Component Analysis (PCA) (dimensionality reduction) was implemented for the selection of important features and a heterogeneous ensemble consisting of Naïve Bayes (NB), K Nearest Neighbor (KNN), Logistic Regression (LR) and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) classifi- ers based on Average of Probabilities (AOP) was used for spam detection. To empirically investigate its performance, the proposed framework was applied on MPI_SWS and SAC’13 Tip spam datasets and the developed models were eval- uated based on accuracy, precision, recall, f-measure, and area under the curve (AUC). From the experimental results, the proposed framework (Ensemble + IG + PCA) outperformed other experimented methods on studied spam datasets. Specifically, the proposed framework had an average accuracy value of 87.5%, an average precision score of 0.877, an average recall value of 0.845, an aver- age F-measure value of 0.872 and an average AUC value of 0.943. Also, the proposed framework had better performance than some existing approaches. Consequently, this study has shown that addressing high dimensionality in spam datasets, in this case, a hybrid of IG and PCA with a heterogeneous ensemble method can produce a more effective model for detecting spam contents. Keywords—high dimensionality, ensemble, spam detection 84 http://www.i-jim.org https://doi.org/10.3991/ijim.v15i17.19915 mailto:abdulfataig@gmail.com Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection 1 Introduction An increase in penetration and access to the Internet along with developments in mobile technology in recent years has enhanced the popularity of Online Social Networks (OSNs) among Internet users. OSNs such as Twitter, Facebook, Sina Weibo, Instagram and so on, now has about 2.62 billion users across the globe and is expected to reach an estimated 3.02 billion by 2021 [1, 2]. Users on these networks communicate with one another by sharing and discussing both personal and public issues and events. This helps to build an intrinsic trust relationship among cyber friends (followers/followees) even though they may not know each other in person. Users usually feel more confident to read messages or even visit links from their cyber friends [3–5]. Micro-blogging Social Networks (MSNs) are also OSNs with specific characteristics such as (i) use of short messages composed of a limited number of characters; (ii) use of domain-specific words; (iii) high content of noisy data. MSN users can share short messages called micro-post(s) along with images and multimedia contents with other users [6]. They connect through a process of a follower-followee relationship. For instance, as illustrated in Figure 1, user A initiates a friendship connection with user B without user B acknowledging in return, hence user A is user B’s follower and user B is followee to user A, while user B and user C are both follower and followee to each other. Fig. 1. User follower/followee relationship in MSNs Location-Based Social Networks (LBSNs) is a type of micro-blogging social net- work where users share their geographic location, search for interesting places and post tips about existing locations. Examples of LBSNs include Apontador, Gowalla and Foursquare. Apontador, a popular Brazilian LBSN system has features that allow users to search for places, register new locations, check in locations and post tips or comment about these locations using smartphones. These tips help users, in addition to finding nearby and interesting places, to also read suggestions about what to order, what to buy or even what to avoid in specific places. Thus, allowing users to post tips and comment on places exposes the platform and other genuine users to spammers who then post unsolicited messages on tips and comments about locations [7–9]. Due to the popularity of micro-blogging social networks and the trust relationship built amongst cyber friends, MSN such as Twitter become a veritable platform for spammers to abuse and post malicious or spam content. Spam involves the spreading of phishing, malicious, or scam content on a network. Spamming attacks do not only lead to a loss in the value of real-time search services, iJIM ‒ Vol. 15, No. 17, 2021 85 Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection but they also interfere with statistics presented by tweet mining tools and consume additional resources from users and systems (such as network bandwidth- leading to significant revenue loss for organizations); compromise the user and system reputa- tion; they may also jeopardize users trust on the existing tips in the system [7, 10]. The alarming rate at which spamming activities take place on social networks and the inherent consequences make it worrisome and challenging to both users and providers of online social networks. According to Nexgate’s 2013 report on the state of social media spam: during the first half of 2013, the growth of social spam was 355%, much higher than the growth rate of accounts and messages on branded social networks [11]. As it is evident from the foregoing, the need arises for research into methods of identifying spammers and spam content on micro-blogging social networks. Adewole, et al. [12] asserted that a majority of studies on spam detection have been on detecting spammers’ accounts and only little has focused on spam message detection. Although many spam/spammer detection methods have been proposed in several studies, most of which are based on content analysis of users’ data interaction; learning classifications that use topological features, sociological/behavioural characteristics of nodes within and across the social structure. Few kinds of research on social spam detection and classification used content-based and social structure analysis. Benevenuto, et al. [10] and Zheng, et al. [13] in their respective studies used a support vector machine (SVM) based algorithm for spammer classification. Barushka and Hajek [14], Abulaish and Bhat [15] and Bhat, et al. [16] evaluated the performance of some ensemble learning methods using topology-based learning for social spam detection. However, redundant and irrelevant features as a result of high dimensionality are still a long-term problem for social spam detection. The overhead effect of misclassification in spam detection as a result of low spam detection accuracy caused by this problem can be very risky. Removal of such features with spectral information enhances the classification process as well as accurate classification decisions [17–19]. Dimensionality reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation still retains the meaningful properties of the original data. It applies specific techniques for reducing the number of input variables in training data for predictive models. Fewer input dimensions often mean correspond- ingly fewer parameters or a simpler structure in the machine learning model. The integration of high dimensionality reduction methods will further enhance the performance of classifiers and ensemble methods in spam detection. This study investi- gates spam detection in micro-blogging social networks using content and behavioural features from a hybrid dimensionality reduction technique, with a heterogeneous ensemble learning method on Apontador (a location-based social network) datasets. The specific objectives are highlighted as follow: i. To design a hybrid dimensionality reduction method for spam detection in micro-blogging social networks ii. To develop a spam detection framework that integrates the designed hybrid dimen- sionality reduction in (i) with heterogeneous ensemble models iii. To investigate the performance of the developed framework empirically on publicly available spam detection dataset iv. To validate the usefulness of the framework by comparing its performance with existing methods proposed in the literature 86 http://www.i-jim.org Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection Summarily, the main contributions of this study are: i. This study proposed a novel spam detection framework based on heterogeneous ensemble and a combination of dimensionality reduction techniques. ii. An empirical study to show the impact of dimensionality reduction techniques on ensemble methods in spam detection. The rest of this paper is organized as follows: Section 2 outlines the review and analysis of existing related studies. Section 3 presents the research methods which include the classifiers, datasets, the experimental framework and performance eval- uation metrics used in this study. Section 4 presents the experimental results and a discussion of our findings. Section 5 presents the conclusion and highlights the future works of this study. 2 Related works A lot of research has been conducted on spam detection in domains such as email, short message service (SMS), webpage, and social networks. More studies are still needed to be done in these areas especially social network domain viz-a-viz micro-blogging social network and location-based social networks. Generally, spam detection methods have focused on various characteristics or features of the messages and/or users via two main approaches—content-based and user/behavioural-based learning. Furthermore, spam detection in online social networks (OSNs) has explored the following technique in spam detection: blacklist, graph-based, and Machine Learning (ML); all of which could adopt either or both content and user/behavioural-based learning. Grier, et al. [20] as cited in Adewole, et al. [12], applied a blacklist-based approach to detect malicious tweets on the Twitter network. They investigated users’ click-through data generated from the phishing URL’s clicks to study the effectiveness of using mali- cious URLs to launch large-scale phishing attacks. They further analyzed the capability of blacklist-based approach in spam detection, but their findings suggested that the approach is very slow in protecting users from being compromised. For the graph-based approach; Ahmed and Abulaish [21] proposed a Markov clus- tering algorithm (MCL) to classify a set of profiles on the social network as spam and non-spam. They applied the majority vote technique to examine the overlapping clusters generated using the MCL algorithm; while Ghosh, et al. [22] analyzed link farming activities on Twitter and proposed a CollusionRank algorithm to penalize users that connect with spammers on the network, thereby discouraging the activities of link farming by lowering users’ score for connecting with spammers [20]. For the Machine Learning approach; Adewole, et al. [12] in their study, proposed an ensemble streaming framework that is based on classification and clustering for spam detection and risk assessment. They used a combination of Multinomial Naïve Bayes (MNB) and modified K-Nearest Neighbour (KNN) classifiers and the majority vote technique as the ensemble method for classifying messages. The risk assessment function was then computed from the risk score obtained from the outputs of MNB and KNN algorithms. Streaming K-means algorithm was used for the clustering to detect campaign of spam messages. They were, however, constrained to use the SMS spam iJIM ‒ Vol. 15, No. 17, 2021 87 Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection dataset for training their classifiers via transfer learning due to the non-availability of real-life micro-blogging datasets. A considerable number of studies have been conducted using the machine learning (ML) approach for spam detection in OSNs and other domains such as SMS, email and so on. By combining graph-based and ML approaches, Abulaish and Bhat [15] proposed an ensemble of classifier algorithms: J48; A variant of C4.5 Decision Tree; and Naïve Bayes (NB), using bagging and boosting methods to identify spam in OSN (Facebook dataset) based on topological and community features from users’ interaction network. They observed that the performance of NB and J48 using bagging or boosting ensemble learning methods is better than their respective individual performances. However, the ensemble method using the J48 classifier showed a better performance than that of NB. Two different works by Benevenuto, et al. [10] and Zheng, et al. [13] considered con- tent and user/behavioural attributes of their datasets, and both applied non-linear SVM classifier with Radial Basis Function (RBF) kernel - for the control of overfitting of the model and degree of nonlinearity. Benevenuto, et al. [10] worked on crawled Twitter dataset, their model identified spammers with 70.1% accuracy and non-spammers with 96.4%. Out of the 96 features trained by the SVM model, only 10 were found to be discriminatory. On the other hand, Zheng, et al. [13] performed their experiment on crawled, manually labeled Sina Weibo dataset and obtained 99.1% spammer detec- tion accuracy and 99.9% non-spammer. The SVM model was found to perform better than the NB and Bayesian Networks (BN) upon the comparison. While most of the previous studies have approached social/microblogging spam detection as a classification problem; Miller, et al. [23] viewed it as an anomaly detec- tion problem. They proposed a modified StreamKM++ and DenStream clustering algorithm for spam detection on Twitter. Their model achieved 99% recall and 6.4% false-positive rate (FPR) using StreamKM++; and 99% recall and a 2.8% FPR with DenStream. When used together, they achieved 100% recall (meaning it identified all spammers in the test data) and 2.2% FPR (meaning it incorrectly detected just 2.2% of normal users as spammers). The motivation for this study was derived from the afore-stated researches as they further identified the need for studies involving ensemble methodology and selection of important features for the task of spam detection. 3 Methodology This section presents the baseline classifiers, dimensionality reduction methods, spam datasets, performance evaluation metrics and experimental framework used in this study. 3.1 Classification algorithm This sub-section presents the baseline classification algorithms used in this study. These classifiers were selected based on their respective computational complexity which is aimed at introducing diversification to the classification process, hence, the heterogeneity in the ensemble method. 88 http://www.i-jim.org Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection Naïve Bayes (NB). This machine-learning algorithm was derived from the Bayes rule and it assumes that independent attributes of observation(s) are completely independent of each other, given a dependent variable [24]. According to Mitchell [25], when X contains n attributes that are conditionally independent of themselves given Y, the Naïve Bayes algorithm is expressed as P X X |Y P X yn I n i( ) ( | )1 1 … = = ∏ (1) Considering the training of a classifier whose output is the probability distribution over possible values of Y, based on new instance X that is to be classified. Also, assum- ing they (Xi) are conditionally independent given Y, then Equation 1 becomes: P(Y y X X P(Y y P X Y y P(Y y P X Y y k n k i i k j j i i j = … = = = = = ∏ ∑ ∏ | ( | ) ( | ) ) ) ) 1 (2) The fundamental of the Naïve Bayes classifier is expressed in Equation 2. However, the most probable value of Y is the actual interest thus, the Naïve Bayes classification rule is expressed in Equation 3 below. First, from Equation 2, we derived Y arg P Y y P X Y y P(Y y P X Y y y max k i i k j j i i j k ← = = = = ∏ ∑ ∏ ( ( | ) ( | ) ) ) This is then simplified to Equation 3 as the denominator is not dependent on yk. Y arg P(Y y P X Y yy max k i i kk ← = =∏) ( | ) (3) Equation 3 above is the simplified Naïve Bayes classification rule that outputs the most probable value of Y having considered all X1…Xn values to be independent of each other. K-nearest neighbor (KNN). KNN is an example of instance-based learners Reduction (it is used interchangeably as IBK in this study). Nearest-neighbour classifiers are based on learning by analogy, that is, by comparing a given test tuple with training tuples that are similar to it. The training tuples are described by n attributes. Each tuple represents a point in an n-dimensional space. In this way, all the training tuples are stored in n-dimensional pattern space. When given an unknown tuple, a k-nearest-neighbour classifier searches the pattern space for the k training tuples that are closest to the unknown tuple [26]. These k training tuples are the k “nearest neighbours” of the unknown tuple. KNN is a type of lazy learning where the function is only approximated locally and all computations are deferred until classification. An object is classified by a majority of its neighbours. K is always a positive integer and the neighbours are selected from a set of objects for which the correct classification is known [27, 28]. RIPPER. RIPPER is the short form of Repeated Incremental Pruning to Produce Error Reduction (it is used interchangeably as JRip in this study). It is an optimized iJIM ‒ Vol. 15, No. 17, 2021 89 Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection version of Incremental Reduced Erro Prunning (IREP) based on association rules with reduced error pruning [29]. This algorithm is a rule induction method that implements a propositional rule learner. It greedily learns rules from a given dataset by employing a divide and conquered strategy [30]. Concerning the class frequencies inherent in a given dataset, sorting of training data is being executed in an ascending manner by class labels. Thus, beginning from the smallest, rules are being generated and learned for n–1 classes. As a result, instances covered by the rules are removed from the original data repeatedly until all instances are completely removed. Logistic regression (LR). Logistic Regression (LR) is a discriminative ML method that is based on logistic function. LR focuses only on the posterior probability of each class. it is a generalized linear model, mapping the output of linear multiple regression to the posterior probability of each class [31]. 3.2 Dimensionality reduction technique This sub-section presents the dimensionality reduction techniques deployed in this study. Specifically, Information Gain (IG) and Principal Component Analysis (PCA) were selected for the dimensionality reduction techniques. Information gain (IG). Information Gain (IG) is a feature selection (FS) method for selecting relevant features from available features for any given data. According to Jain and Bhupendra [32], IG generates the best subset of features among the original features based on ‘Entropy’. Usually, the entropy of each feature of data is computed and arranged in descending order. Hence, features with lower entropy scores are discarded while those with high entropy scores are selected for creating a subset of the original data to be used for model development. To compute IG, the expected information required to categorize a record in a data table is first computed after which the expected information required for each attribute is also computed. To obtain the IG of each feature, the information score for each attri- bute is subtracted from the expected information of the given data table. Principal component analysis (PCA). Principal Component Analysis (PCA) is a multivariate statistical method for analyzing several variables to reduce large dimensional data to a relatively small number of dimensions or components [33]. As a tool, PCA is quite applicable in several use-cases such as for the visualization of genetic distance or relatedness between populations [34]. However, in this research, PCA is used for dimensionality reduction (as a method for feature extraction). It is done using the eigenvalue decomposition of a data correlation (or covariance) matrix after executing the normalization phase of the original data. Algorithmically, PCA orthogonally transforms collections of observations of pos- sibly correlated features into another set of linearly uncorrelated values (i.e. principal components), using the following steps as shown in [35]: 1. Collect the original data having d-dimensional observations ignoring the class label 2. Execute the standardization of the d-dimensional observations 3. Compute the mean vector of the d-dimensional data 4. Compute the covariance matrix of the whole data set 5. Compute the eigenvector as well as the corresponding eigenvalues 90 http://www.i-jim.org Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection 6. Sorting of eigenvectors and selection of k eigenvectors with the largest eigenvalues from a d x k dimensional matrix W (where every column represents an eigenvector). 7. Use the obtained d x k eigenvector matrix to transform the observation onto a new subspace. 3.3 Spam dataset In this study, two spam datasets from Costa, et al. [7] and Dutta, et al. [36] are used for training and testing the proposed models. These datasets (herein referred to as Dataset 1 and Dataset 2) are about “Tip Spam” in location-based social networks. Specifically, Dataset 1 is based on Apontador and consists of 60 attributes, 7076 instances and 2 class labels (spam or non-spam)[7]. Dataset 2 was created by Costa, et al. [37]. The dataset consists of 41 attributes and 2762 instances with 2 class labels (spam or non-spam). Both datasets have an equal distribution of class labels. That is, both datasets are balanced with an equal number of spam and non-spam instances. 3.4 Performance evaluation metrics For comprehensive performance evaluation, accuracy, precision, recall, f-measure and area under curve (AUC) values are used to measure the efficacy of the spam detec- tion models developed in this study. Our preference for these evaluation metrics is based on their wide usage in existing studies on social spam detection [6, 7, 17, 36–38] and their suitability for achieving the objectives of this study. I. Accuracy measures the percentage of correctly classified spam instances to the total number of instances [26] and its value is calculated as thus: Accuracy TP TN TP FP FN TN = + + + + (4) II. Precision measure the rate of the actual number of detected spam instances that are spam instances. It is represented as; Precision TP TP FP = +     (5) III. Recall measures the rate of spam instances that are correctly classified. Its formula is given as: Recall TP TP FN = +     (6) IV. F-measure measures the harmonic mean of precision and recall. F Measure Precision Recall Precision Recall − = × × +     2 (7) iJIM ‒ Vol. 15, No. 17, 2021 91 Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection wherein Equations (4), (5), (6) and (7), TP = True Positive which implies the accurate classification; FP = False Positive which implies inaccurate classification; TN = True Negative which implies accurate misclassification; and FN = False Negative which implies inaccurate misclassification. V. The area under the curve (AUC), which is also known as Area under the ROC (Receiver operating characteristics) curve shows the trade-off between TP rate and FP rate [26, 39]. It provides an aggregate measure of performance across all possi- ble classification thresholds. 3.5 Experimental framework Fig. 2. Experimental framework 92 http://www.i-jim.org Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection To validate the efficacy of the proposed framework for spam detection, an exper- imental framework as shown in Figure 2 is developed. The experimental process is divided into two phases: 1. Pre-process Phase: In this phase, spam datasets are pre-processed by a hybrid IG+PCA dimensionality reduction method to reduce the dimensionality of each spam dataset. Specifically, IG based on the Ranker search method is used to select top-ranked log2N relevant features (where N is the total number of features in the dataset). Thereafter, the selected features from IG are passed through PCA to assess and generate the optimum subset of features. The essence of passing the IG selected features through PCA for further processing as proposed is to address the bias of IG towards features with a large range of values [40, 41]. The output from this phase is the pre-processed features from each dataset which are passed into the next phase (model construction phase) for the development of spam detection models. 2. Model Construction Phase: Optimal feature subsets from the pre-processing phase are used for spam detection model construction. In this study, a heterogeneous ensemble method based on the average of probabilities rule is developed for spam detection. The goal of the heterogeneous ensemble method is to harness and aggre- gate the performance of individual baseline classifiers for classification processes [18, 29, 42]. Specifically, NB, IBK, LR and JRip baseline classifiers are used to develop a heterogeneous ensemble framework. Each of the baseline classifiers is based on dif- ferent computational characteristics, hence, the heterogeneity. The detection models are developed based on the 10-fold cross-validation (CV) technique. The preference for a 10-fold CV is based on its ability to produce models with low bias and variance [43–46]. Also, spam detection models with or without dimensionality reduction were developed to have an unprejudiced comparison and to evaluate the effect of dimen- sionality reduction and ensemble methods in spam detection. In the end, the performance of ensuing spam detection models is evaluated and ana- lyzed based on accuracy, precision, recall, f-measure and AUC. All experiments were carried out using the WEKA machine learning tool [47]. 3.6 Research method The research method adopted in this study is the quantitative empirical method. In the empirical research method, the investigation is based on observation and measure- ment of phenomena as based on direct real-life experience. In this study, a real-life dataset based on “Tip Spam” in location-based social networks were used to perform several experimental investigations to evaluate and validate the suitability, effective- ness and significance of the proposed approach. 4 Results and discussion The section presents the experimental results of the baseline classifiers and ensem- bles on the two datasets used in this study. Tables 1–6 present the experimental results of the classifiers and ensembles on Dataset 1 and Dataset 2 based on accuracy, preci- sion, recall, f-measure and area under the curve (AUC). iJIM ‒ Vol. 15, No. 17, 2021 93 Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection Table 1. Experimental results of spam models on Dataset 1 data without dimensionality reduction Classifiers Accuracy Precision Recall F-Measure AUC NB 78.20 0.806 0.782 0.778 0.837 IBK 79.22 0.792 0.792 0.792 0.799 LR 82.33 0.827 0.823 0.823 0.892 JRip 83.92 0.845 0.839 0.839 0.878 Ensemble 84.71 0.847 0.847 0.846 0.913 Table 2. Experimental results of spam models on Dataset 2 without dimensionality reduction Classifiers Accuracy Precision Recall F-Measure AUC NB 62.54 0.708 0.625 0.584 0.858 IBK 79.89 0.799 0.799 0.799 0.789 LR 85.40 0.856 0.854 0.854 0.926 JRip 87.08 0.872 0.871 0.871 0.907 Ensemble 87.92 0.873 0.879 0.879 0.933 From Table 1 and Table 2, it can be observed that the heterogeneous ensemble of the baseline classifiers (NB, IBK, LR, and Jrip) based on average of probabilities (AOP) outperforms all the considered baseline classifiers on all performance in both data- sets. Specifically, the heterogeneous ensemble method had the highest accuracy value (84.71%), precision (0.847), recall (0.847), f-measure (0.847), and AUC (0.913) on Dataset 1 and accuracy value (87.92%), precision (0.873), recall (0.879), f-measure (0.879), and AUC (0.933) on Dataset 2 when compared with other baseline classifiers as presented in Table 1 and Table 2. Amongst the baseline classifiers, JRip performed best on all performance metrics with accuracy value (83.92%), precision (0.845), recall (0.839), f-measure (0.839), and AUC (0.878) on Dataset 1 and accuracy value (87.08%), precision (0.872), recall (0.871), f-measure (0.871), and AUC (0.907) on Dataset 2. Although, the margin (in terms of performance metric values) between the heterogeneous ensemble method models may not be statistically significant, the adverse effect of allowing such predictive margin could be dangerous if single classifi- ers are used instead of ensemble methods. Besides, these results give further credence to the application and adoption of ensemble methods for prediction processes as ensem- ble methods have been proven to be better than single classifiers [6, 29]. Table 3. Experimental results of spam models on Dataset 1 with dimensionality reduction (IG) Classifiers Accuracy Precision Recall F-Measure AUC NB+IG 79.69 0.813 0.797 0.794 0.859 IBK+IG 82.33 0.824 0.823 0.823 0.880 LR+IG 81.24 0.816 0.812 0.812 0.874 JRip+IG 82.69 0.835 0.827 0.826 0.848 Ensemble+IG 85.04 0.850 0.847 0.850 0.918 94 http://www.i-jim.org Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection Table 4. Experimental results of spam models on Dataset 2 with dimensionality reduction (IG) Classifiers Accuracy Precision Recall F-Measure AUC NB+IG 78.50 0.799 0.785 0.783 0.870 IBK+IG 79.48 0.795 0.795 0.795 0.824 LR+IG 81.84 0.818 0.815 0.815 0.892 JRip+IG 81.77 0.830 0.818 0.816 0.862 Ensemble+IG 88.79 0.875 0.878 0.881 0.942 Table 3 and Table 4 present the experimental results of spam models with one of the feature selection methods, Information Gain, which is a form of dimensionality reduction technique. This is to further improve the performance of the spam models (ensemble and base classifiers) as feature selection has been known to improve pre- diction models [48–51]. The heterogeneous ensemble method still outperforms the baseline classifiers on all performance metrics on both datasets. On Dataset 1, the het- erogeneous ensemble had the highest accuracy value (85.04%), precision (0.85), recall (0.847), f-measure (0.85), and AUC (0.918). While on Dataset 2, the heterogeneous ensemble had the highest accuracy value (88.79%), precision (0.875), recall (0.878), f-measure (0.881), and AUC (0.942). There was a slight improvement in the accu- racy values (+0.33; +0.87), AUC values (+0.05; +0.09) of the heterogeneous ensemble method with IG when compared with accuracy value without IG on Dataset 1 and Dataset 2 respectively. IG improved the predictive performance of the heterogeneous ensemble method. Table 5. Experimental results of prediction models on Dataset 1with dimensionality reduction (IG+PCA) Classifiers Accuracy Precision Recall F-Measure AUC NB+IG+PCA 80.77 0.819 0.808 0.806 0.868 IBK+IG+PCA 82.44 0.826 0.824 0.824 0.883 LR+IG+PCA 81.72 0.821 0.817 0.817 0.880 JRip+IG+PCA 82.44 0.834 0.824 0.823 0.852 Ensemble+IG+PCA 85.82 0.869 0.862 0.862 0.928 Table 6. Experimental results of prediction models on Dataset 2 with dimensionality reduction (IG+PCA) Classifiers Accuracy Precision Recall F-Measure AUC NB+IG+PCA 79.72 0.819 0.797 0.794 0.879 IBK+IG+PCA 79.46 0.795 0.795 0.795 0.824 LR+IG+PCA 80.52 0.808 0.805 0.805 0.885 JRip+IG+PCA 82.28 0.829 0.823 0.822 0.871 Ensemble+IG+PCA 89.18 0.885 0.828 0.881 0.957 iJIM ‒ Vol. 15, No. 17, 2021 95 Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection Table 5 and Table 6 present the experimental results of the proposed framework (Ensemble+IG+PCA) in comparison with base classifiers. In this experiment, the Ensemble+IG+PCA models outperformed other and base classifier models on all performance metrics. On Dataset 1, the Ensemble+IG+PCA model had the highest accuracy value (85.82%), a precision score of 0.869, an F-Measure value of 0.862 and an AUC value of 0.928. A similar case was observed in the performance of the proposed framework on Dataset 2. Ensemble+IG+PCA outperform all other methods as presented in Table 6. Figures 3 and 4 show the graphical illustration of the accuracy values of the Ensemble+IG+PCA model and other baseline classifiers on Dataset 1 and Dataset 2 respectively. Also, Figures 5 and 6 present graphically the performance met- ric values (AUC, f-measure, precision and recall) of the heterogeneous ensemble model and other experimented methods on both datasets respectively. 74 76 78 80 82 84 86 N B IB K LR Jr ip En se m bl e N B +I G IB K +I G LR +I G JR ip +I G En se m bl e+ IG N B +I G +P C A IB K +I G +P C A LR +I G +P C A JR ip +I G +P C A En se m bl e+ IG +P C A Accuracy Fig. 3. Performance accuracies of all models on Dataset 1 50 60 70 80 90 100 N B IB K LR Jr ip En se m bl e N B +I G IB K +I G LR +I G JR ip +I G En se m bl e+ IG N B +I G +P C A IB K +I G +P C A LR +I G +P C A JR ip +I G +P C A En se m bl e+ IG +P C A Accuracy Fig. 4. Performance accuracies of all models on Dataset 2 96 http://www.i-jim.org Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection 0.7 0.75 0.8 0.85 0.9 0.95 NB LR Ensemble IBK+IG JRip+IG NB+IG+PCA LR+IG+PCA Ensemble+IG+PCA Performance Evaluation AUC F-Measure Recall Precision Fig. 5. Performances of the models on Dataset 1 0.5 0.6 0.7 0.8 0.9 1 NB Jrip IBK+IG Ensemble+IG LR+IG+PCA Performance Evaluation AUC F-Measure Recall Precision Fig. 6. Performances of the models on Dataset 2 Table 7. Average experimental results values for the spam models on Dataset 1 and Dataset 2 Models Average Accuracy (%) Average Precision Average Recall Average F-Measure Average AUC NB 70.37 0.757 0.703 0.681 0.8475 IBK 79.56 0.795 0.7955 0.7955 0.794 LR 83.87 0.841 0.8385 0.8385 0.909 Jrip 85.5 0.858 0.855 0.855 0.8925 Ensemble 86.31 0.860 0.863 0.8625 0.923 NB+IG 79.09 0.806 0.791 0.7885 0.8645 IBK+IG 80.91 0.809 0.809 0.809 0.852 LR+IG 81.54 0.817 0.8135 0.8135 0.883 (Continued) iJIM ‒ Vol. 15, No. 17, 2021 97 Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection Table 7. Average experimental results values for the spam models on Dataset 1 and Dataset 2 (continued) Models Average Accuracy (%) Average Precision Average Recall Average F-Measure Average AUC JRip+IG 82.23 0.832 0.8225 0.821 0.855 Ensemble+IG 86.91 0.862 0.8625 0.8655 0.9300 NB+IG+PCA 80.24 0.819 0.8025 0.800 0.8735 IBK+IG+PCA 80.95 0.810 0.8095 0.8095 0.8535 LR+IG+PCA 81.12 0.814 0.811 0.811 0.8825 JRip+IG+PCA 82.36 0.831 0.8235 0.8225 0.8615 *Ensemble+IG+PCA 87.50 0.877 0.845 0.8715 0.9425 Note: *Indicates proposed method. Table 7 presents the performance metric values for the spam models on Dataset 1 and Dataset 2. As depicted in Table 7, using the NB as a single classifier produced an average accuracy of 70.37% which improved to 79.09% when implemented with the IG FS method (NB+IG) and lastly, its accuracy slightly increased to 80.25% when com- bined with IG and PCA (NB+IG+PCA). Also, the IBK classifier produced an average accuracy of 79.56% which increased to 80.91% when combined with the IG (IBK+IG) and slightly increased to 80.95% with IBK+IG+PCA. As for the LR algorithm, the model’s initial average accuracy was 83.87% but a reduction of the accuracy to 81.54% was recorded when combined with IG (LR+IG) and is further reduced to 80.95% when PCA feature extraction was implemented. The JRip algorithm had an initial average accuracy of 85.5% but dropped to 82.23% average accuracy when combined with the IG feature selection technique and increased to an average accuracy of 82.36% when the PCA feature extraction technique was combined. Finally, the ensemble method had an initial average accuracy of 86.315% and increased to 86.915% when combined with IG feature selection. However, the proposed framework (Ensemble+IG+PCA) had the highest average accuracy value (87.5%), average precision value (0.877), average recall (0.845), average f-measure (0.8715) and average AUC value (0.9425). Table 8. Performance comparison of proposed methods with existing methods on Dataset 1 Accuracy (%) Precision Recall F-Measure AUC *Ensemble+IG+PCA 85.82 0.869 0.862 0.862 0.928 Dutta, et al. [36] 81.04 – – 0.809 – Note: *Indicates proposed method. Table 9. Performance comparison of proposed methods with existing methods on Dataset 2 Accuracy (%) Precision Recall F-Measure AUC *Ensemble+IG+PCA 89.18 0.885 0.828 0.881 0.957 Costa, et al. [7] 87.8 – – 0.873 – Agrawal and Velusamy [52] 82.5 – – – – Note: *Indicates proposed method. 98 http://www.i-jim.org Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection Furthermore, Table 8 and Table 9 present the performance comparison of the pro- posed method (Ensemble+IG+PCA) and some recent approaches from existing studies on Dataset 1 and Dataset 2. It shows that the proposed methods outperform some of the existing recent approaches based on the considered performance metrics. Conclusively, it is evident that the proposed method can detect spam messages more effectively than some existing methods. 5 Conclusions and future works This study focused on proposing an effective machine-learning-based spam mes- sage detection framework by implementing machine learning techniques (KNN, LR, RIPPER, and NB), dimensionality reduction method (feature selection: IG and fea- ture extraction: PCA), and ensemble methods (AOP technique). Specifically, a spam message detection framework based on a heterogeneous ensemble framework and a combination of dimensionality reduction techniques was proposed and implemented. Evidently, from the results of the experiments, it was observed that removing redun- dant and irrelevant features from spam datasets using hybridized feature selection and feature extraction method in conjunction with the heterogeneous ensemble method pro- vides an effective method for detecting social spam contents. This proves that better methods for spam detection can be developed by addressing underlining issues such as the high dimensionality of datasets. Consequently, it is recommended that more studies can be conducted by combining other dimensional reduction techniques as well as other forms of ensemble method to provide a generalizable social spam message detection model(s) with effective detection rates. 6 References [1] D. V. Dimitrova and J. Matthes, “Social media in political campaigning around the world: Theoretical and methodological challenges,” ed: SAGE Publications Sage CA: Los Angeles, CA, 2018. https://doi.org/10.1177/1077699018770437 [2] H. Shen and X. Liu, “Detecting spammers on Twitter based on content and social interac- tion,” presented at the 2015 International Conference on Network and Information Systems for Computers, 2015. https://doi.org/10.1109/ICNISC.2015.82 [3] M. Singh, A. Singh, D. Bansal, and S. Sofat, “An analytical model for identifying suspected users on Twitter,” Cybernetics and Systems, vol. 50, no. 4, pp. 383–404, 2019. https://doi. org/10.1080/01969722.2019.1588968 [4] M. Almseidin, A. A. Zuraiq, M. Al-Kasassbeh, and N. Alnidami, “Phishing detection based on machine learning and feature selection methods,” International Journal of Interactive Mobile Technologies, vol. 13, no. 12, 2019. [5] L. F. Hussein, A. B. Aissa, I. A. Mohamed, S. Alruwaili, and A. Alanzi, “Development of a secured vehicle spot detection system using GSM,” International Journal of Interactive Mobile Technologies, vol. 15, no. 4, 2021. https://doi.org/10.3991/ijim.v13i12.11411 [6] K. S. Adewole, N. B. Anuar, A. Kamsin, and A. K. Sangaiah, “SMSAD: a framework for spam message and spam account detection,” Multimedia Tools and Applications, vol. 78, no. 4, pp. 3925–3960, 2019. https://doi.org/10.1007/s11042-017-5018-x iJIM ‒ Vol. 15, No. 17, 2021 99 https://doi.org/10.1177/1077699018770437 https://doi.org/10.1109/ICNISC.2015.82 https://doi.org/10.1080/01969722.2019.1588968 https://doi.org/10.1080/01969722.2019.1588968 https://doi.org/10.3991/ijim.v13i12.11411 https://doi.org/10.3991/ijim.v13i12.11411 Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection [7] H. Costa, F. Benevenuto, and L. H. Merschmann, “Detecting tip spam in location-based social networks,” presented at the Proceedings of the 28th Annual ACM Symposium on Applied Computing, 2013. https://doi.org/10.1145/2480362.2480501 [8] D. Ibrahim and N. Alruhaily, “Anomaly detection in wireless sensor networks: a proposed framework,” International Journal of Interactive Mobile Technologies, vol. 14, no. 10, 2020. https://doi.org/10.3991/ijim.v14i10.14261 [9] A. Odeh, I. Keshta, and E. Abdelfattah, “Efficient detection of phishing websites using mul- tilayer perceptron,” International Journal of Interactive Mobile Technologies, vol. 14, no. 11, 2020. https://doi.org/10.3991/ijim.v14i11.13903 [10] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida, “Detecting spammers on Twitter,” presented at the Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS), 2010. [11] H. Nguyen, “2013 state of social media spam,” Publication of NexGate, USA, from web- sites, 2013. http://nexgate.com/wpcontent/uploads/2013/09/Nexgate-2013-State-of-Social- Media-Spam-Research-Report.pdf [12] K. S. Adewole, N. B. Anuar, A. Kamsin, K. D. Varathan, and S. A. Razak, “Malicious accounts: dark of the social networks,” Journal of Network and Computer Applications, vol. 79, pp. 41–67, 2017. https://doi.org/10.1016/j.jnca.2016.11.030 [13] X. Zheng, Z. Zeng, Z. Chen, Y. Yu, and C. Rong, “Detecting spammers on social net- works,” Neurocomputing, vol. 159, pp. 27–34, 2015. https://doi.org/10.1016/j.neucom .2015.02.047 [14] A. Barushka and P. Hajek, “Spam filtering in social networks using regularized deep neural networks with ensemble learning,” presented at the IFIP International Conference on Artificial Intelligence Applications and Innovations, 2018. https://doi .org/10.1007/978-3-319-92007-8_4 [15] M. Abulaish and S. Y. Bhat, “Classifier ensembles using structural features for spammer detection in online social networks,” Foundations of Computing and Decision Sciences, vol. 40, no. 2, pp. 89–105, 2015. https://doi.org/10.1515/fcds-2015-0006 [16] S. Y. Bhat, M. Abulaish, and A. A. Mirza, “Spammer classification using ensemble methods over structural social network features,” presented at the Proceedings of the 2014 IEEE/ WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 02, 2014. https://doi.org/10.1109/WI-IAT.2014.133 [17] A. G. Akintola, A. O. Balogun, F. Lafenwa-Balogun, and H. A. Mojeed, “Comparative anal- ysis of selected heterogeneous classifiers for software defects prediction using filter-based feature selection methods,” FUOYE Journal of Engineering and Technology, vol. 3, no. 1, pp. 133–137, 2018. https://doi.org/10.46792/fuoyejet.v3i1.178 [18] A. O. Ameen, A. O. Balogun, G. Usman, and G. S. Fashoto, “Heterogeneous ensemble methods based on filter feature selection,” Computing, Information Systems, Development Informatics & Allied Research Journal, vol. 7, no. 4, pp. 63–78, 2016. [19] P. Petrov, S. Ivanov, P. Dimitrov, G. Dimitrov, and O. Bychkov, “Projects management in technology start-ups for mobile software development,” International Journal of Interactive Mobile Technologies, vol. 15, no. 7, 2021. https://doi.org/10.3991/ijim.v15i07.19291 [20] C. Grier, K. Thomas, V. Paxson, and M. Zhang, “@ spam: the underground on 140 charac- ters or less,” presented at the Proceedings of the 17th ACM conference on Computer and communications security, 2010. https://doi.org/10.1145/1866307.1866311 [21] F. Ahmed and M. Abulaish, “An mcl-based approach for spam profile detection in online social networks,” presented at the 2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications, 2012. https://doi.org/10.1109/ TrustCom.2012.83 100 http://www.i-jim.org https://doi.org/10.1145/2480362.2480501 https://doi.org/10.3991/ijim.v14i10.14261 https://doi.org/10.3991/ijim.v14i11.13903 http://nexgate.com/wpcontent/uploads/2013/09/Nexgate-2013-State-of-Social-Media-Spam-Research-Report.pdf http://nexgate.com/wpcontent/uploads/2013/09/Nexgate-2013-State-of-Social-Media-Spam-Research-Report.pdf https://doi.org/10.1016/j.jnca.2016.11.030 https://doi.org/10.1016/j.neucom.2015.02.047 https://doi.org/10.1016/j.neucom.2015.02.047 https://doi.org/10.1007/978-3-319-92007-8_4 https://doi.org/10.1007/978-3-319-92007-8_4 https://doi.org/10.1515/fcds-2015-0006 https://doi.org/10.1109/WI-IAT.2014.133 https://doi.org/10.46792/fuoyejet.v3i1.178 https://doi.org/10.3991/ijim.v15i07.19291 https://doi.org/10.1145/1866307.1866311 https://doi.org/10.1109/TrustCom.2012.83 https://doi.org/10.1109/TrustCom.2012.83 Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection [22] S. Ghosh et al., “Understanding and combating link farming in the Twitter social net- work,” presented at the Proceedings of the 21st international conference on World Wide Web, 2012. https://doi.org/10.1145/2187836.2187846 [23] Z. Miller, B. Dickinson, W. Deitrick, W. Hu, and A. H. Wang, “Twitter spammer detection using data stream clustering,” Information Sciences, vol. 260, pp. 64–73, 2014. https://doi. org/10.1016/j.ins.2013.11.016 [24] A. O. Balogun, S. Basri, S. J. Abdulkadir, V. E. Adeyemo, A. A. Imam, and A. O. Bajeh, “Software defect prediction: analysis of class imbalance and performance stability,” Journal of Engineering Science and Technology, vol. 14, no. 6, pp. 3294–3308, 2019. [25] T. M. Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regres- sion,” Machine Learning, pp. 1–17, 2010. [26] J. Han and M. Kamber, “Data Mining: C d h Concepts and Techniques,” 2012. [27] P. Cunningham and S. J. Delany, “k-Nearest neighbour classifiers,” Multiple Classifier Sys- tems, vol. 34, no. 8, pp. 1–17, 2007. [28] M. A. Mabayoje, A. O. Balogun, H. A. Jibril, J. O. Atoyebi, H. A. Mojeed, and V. E. Adeyemo, “Parameter tuning in KNN for software defect prediction: an empirical analy- sis,” Jurnal Teknologi dan Sistem Komputer, vol. 7, no. 4, pp. 121–126, 2019. https://doi. org/10.14710/jtsiskom.7.4.2019.121-126 [29] A. O. Balogun, A. M. Balogun, P. O. Sadiku, and V. E. Adeyemo, “Heterogeneous ensemble models for generic classification,” Scientific Annals of Computer Science, vol. 15, no. 1, pp. 92–98, 2017. [30] Y. K. Jain, “Upendra: an efficient intrusion detection based on decision tree classifier using feature reduction,” International Journal of scientific and research Publications, vol. 2, no. 1, 2012. [31] P. Ranganathan, C. Pramesh, and R. Aggarwal, “Common pitfalls in statistical analysis: logistic regression,” Perspectives in Clinical Research, vol. 8, no. 3, p. 148, 2017. https:// doi.org/10.4103/picr.PICR_123_17 [32] A. Jain and L. Bhupendra, “Classifier selection models for intrusion detection system (IDS),” Informatics Engineering, an International Journal (IEIJ), vol. 4, no. 1, pp. 1–11, 2016. [33] R. d. O. Santos, B. M. Gorgulho, M. A. d. Castro, R. M. Fisberg, D. M. Marchioni, and V. T. Baltar, “Principal component analysis and factor analysis: differences and similari- ties in nutritional epidemiology application,” Revista Brasileira de Epidemiologia, vol. 22, p. e190041, 2019. https://doi.org/10.1590/1980-549720190041 [34] P. E. Jorgensen, S. Kang, M.-S. Song, and F. Tian, “Dimension reduction and kernel princi- pal component analysis,” arXiv preprint arXiv:1906.06451, 2019. [35] S. Raschka, “Implementing a Principal Component Analysis (PCA) in Python step by step,” ed, 2014. [36] S. Dutta, S. Ghatak, R. Dey, A. K. Das, and S. Ghosh, “Attribute selection for improving spam classification in online social networks: a rough set theory-based approach,” Social Network Analysis and Mining, vol. 8, no. 1, p. 7, 2018. https://doi.org/10.1007/s13278-017-0484-8 [37] H. Costa, L. H. Merschmann, F. Barth, and F. Benevenuto, “Pollution, bad-mouthing, and local marketing: the underground of location-based social networks,” Information Sciences, vol. 279, pp. 123–137, 2014. https://doi.org/10.1016/j.ins.2014.03.108 [38] K. S. Adewole, T. Han, W. Wu, H. Song, and A. K. Sangaiah, “Twitter spam account detec- tion based on clustering and classification methods,” The Journal of Supercomputing, vol. 76, no. 7, pp. 4802–4837, 2020. https://doi.org/10.1007/s11227-018-2641-x [39] S. Whalen and G. Pandey, “A comparative analysis of ensemble classifiers: case studies in genomics,” presented at the 2013 IEEE 13th International Conference on Data Mining, 2013. https://doi.org/10.1109/ICDM.2013.21 iJIM ‒ Vol. 15, No. 17, 2021 101 https://doi.org/10.1145/2187836.2187846 https://doi.org/10.1016/j.ins.2013.11.016 https://doi.org/10.1016/j.ins.2013.11.016 https://doi.org/10.14710/jtsiskom.7.4.2019.121-126 https://doi.org/10.14710/jtsiskom.7.4.2019.121-126 https://doi.org/10.4103/picr.PICR_123_17 https://doi.org/10.4103/picr.PICR_123_17 https://doi.org/10.1590/1980-549720190041 https://doi.org/10.1007/s13278-017-0484-8 https://doi.org/10.1016/j.ins.2014.03.108 https://doi.org/10.1007/s11227-018-2641-x https://doi.org/10.1109/ICDM.2013.21 Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection [40] F. Salo, A. B. Nassif, and A. Essex, “Dimensionality reduction with IG-PCA and ensem- ble classifier for network intrusion detection,” Computer Networks, vol. 148, pp. 164–175, 2019. https://doi.org/10.1016/j.comnet.2018.11.010 [41] P. Nskh, M. N. Varma, and R. R. Naik, “Principle component analysis based intrusion detection system using support vector machine,” presented at the 2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), 2016. https://doi.org/10.1109/RTEICT.2016.7808050 [42] A. O. Balogun et al., “Rank Aggregation Based Multi-filter Feature Selection Method for Software Defect Prediction,” in International Conference on Advances in Cyber Security, 2020, pp. 371–383: Springer. https://doi.org/10.1007/978-981-33-6835-4_25 [43] Y. A. Alsariera, V. E. Adeyemo, A. O. Balogun, and A. K. Alazzawi, “AI meta-learners and extra-trees algorithm for the detection of phishing websites,” IEEE Access, vol. 8, pp. 142532–142542, 2020. https://doi.org/10.1109/ACCESS.2020.3013699 [44] Y. A. Alsariera, A. V. Elijah, and A. O. Balogun, “Phishing website detection: for- est by penalizing attributes algorithm and its enhanced variations,” Arabian Journal for Science and Engineering, pp. 1–12, 2020. https://doi.org/10.1007/s13369- 020-04802-1 [45] A. O. Balogun et al., “Empirical analysis of rank aggregation-based multi-filter feature selection methods in software defect prediction,” Electronics, vol. 10, no. 2, p. 179, 2021. https://doi.org/10.3390/electronics10020179 [46] V. E. Adeyemo, A. O. Balogun, H. A. Mojeed, N. O. Akande, and K. S. Adewole, “Ensem- ble-Based Logistic Model Trees for Website Phishing Detection,” in International Conference on Advances in Cyber Security, 2020, pp. 627–641: Springer. https://doi. org/10.1007/978-981-33-6835-4_41 [47] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009. https://doi.org/10.1145/1656274.1656278 [48] M. A. Mabayoje, A. O. Balogun, A. O. Bajeh, and B. A. Musa, “Software defect prediction: effect of feature selection and ensemble methods,” FUW Trends in Science & Technology Journal, vol. 3, no. 2, pp. 518–522, 2018. [49] A.-B. Verónica Bolón, M. Amparo, and C. N. Sánchez, Artificial Intelligence: Foundations, Theory, and Algorithms Feature Selection for High-Dimensional Data. Springer, 2017. [50] M. A. Mabayoje, A. O. Balogun, S. M. Bello, J. O. Atoyebi, H. A. Mojeed, and A. H. Ekundayo, “Wrapper feature selection based heterogeneous classifiers for software defect prediction,” Adeleke University Journal of Engineering and Technology, vol. 2, no. 1, pp. 1–11, 2019. [51] A. O. Balogun et al., “Impact of feature selection methods on the predictive performance of software defect prediction models: an extensive empirical study,” Symmetry, vol. 12, no. 7, p. 1147, 2020. https://doi.org/10.3390/sym12071147 [52] M. Agrawal and R. L. Velusamy, “PRISMO: priority based spam detection using multi opti- mization,” presented at the International Conference on Big Data Analytics, 2018. https:// doi.org/10.1007/978-3-030-04780-1_27 7 Authors Abdulfatai Ganiyu Oladepo is an IT Service Management practitioner with a keen interest in Data Science, Machine Learning, and IT Project Management. He can be reached via his email address (abdulfataig@gmail.com). 102 http://www.i-jim.org https://doi.org/10.1016/j.comnet.2018.11.010 https://doi.org/10.1109/RTEICT.2016.7808050 https://doi.org/10.1007/978-981-33-6835-4_25 https://doi.org/10.1109/ACCESS.2020.3013699 https://doi.org/10.1007/s13369-020-04802-1 https://doi.org/10.1007/s13369-020-04802-1 https://doi.org/10.3390/electronics10020179 https://doi.org/10.1007/978-981-33-6835-4_41 https://doi.org/10.1007/978-981-33-6835-4_41 https://doi.org/10.1145/1656274.1656278 https://doi.org/10.3390/sym12071147 https://doi.org/10.1007/978-3-030-04780-1_27 https://doi.org/10.1007/978-3-030-04780-1_27 mailto:abdulfataig@gmail.com Paper—Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection Amos Orenyi Bajeh has a BSc and an MSc degree in Computer Science from the University of Ilorin where he is currently a Senior Lecturer in the Department of Com- puter Science at the same University. He has a PhD in Information Technology from Universiti Teknologi PETRONAS. Software measurement, software maintenance, machine learning and fuzzy inference system are his areas of research interest. He can be reached via his email address (bajehamos@unilorin.edu.ng). Abdullateef Oluwagbemiga Balogun received his B.Sc. and M.Sc degrees in Com- puter Science from the University of Ilorin, Nigeria. Currently on his PhD in Informa- tion Technology at the Universiti Teknologi PETRONAS, Perak, Malaysia. He is an academic staff in the Department of Computer Science, Faculty of Communication and Information Sciences, University of Ilorin, Nigeria. His research interests include Search-Based Software Engineering, Software Quality Assurance, Machine Learning, Data Science. He can be reached via his email address (balogun.ao1@unilorin.edu.ng). Hammed Adeleye Mojeed is a Lecturer in the Department of Computer Science, University of Ilorin, Ilorin Nigeria. He received a Master of Science in Computer Sci- ence with distinction from the University of Ilorin, Ilorin, Nigeria in 2019, a Diploma in Computer Networking from SIIT Global, New Delhi, India in 2014 and a Bachelor of Science in Computer Science with First Class Honors from University of Ilorin, Ilorin Nigeria in 2013. His research interests fall in the field of Empirical Search-Based Software Engineering, Software Project Planning and Management, Machine Learn- ing, Optimization and Text Mining. He has authored/co-authored over 20 publications in reputable outlets. He is a member of the IEEE Nigeria Computer Chapter and a Graduate Member of Computer Professionals of Nigeria (GMCPN). He can be reached via his email address (mojeed.ha@unilorin.edu.ng). Abdulsalam Abiodun Salman is an Associate Professor and Head of the Depart- ment of Library and Information Science, Faculty of Communication and Information Sciences, University of Ilorin, Ilorin, Nigeria. He can be reached via his email address (salman.aa@unilorin.edu.ng) Abdullateef Iyanda Bako is an Associate Professor and Dean of the Faculty of Environmental Sciences, University of Ilorin, Ilorin, Nigeria. He can be reached via his email address (bako.ai@unilorin.edu.ng) Article submitted 2020-11-18. Resubmitted 2021-07-02. Final acceptance 2021-07-03. Final version published as submitted by the authors. iJIM ‒ Vol. 15, No. 17, 2021 103 mailto:bajehamos@unilorin.edu.ng https://www.researchgate.net/deref/http%3A%2F%2FB.Sc https://www.researchgate.net/deref/http%3A%2F%2FM.Sc mailto:balogun.ao1@unilorin.edu.ng mailto:mojeed.ha@unilorin.edu.ng mailto:salman.aa@unilorin.edu.ng mailto:bako.ai@unilorin.edu.ng