J. Nig. Soc. Phys. Sci. 3 (2021) 298–307

Journal of the
Nigerian Society

of Physical
Sciences

Optimized Breast Cancer Classification using Feature Selection
and Outliers Detection

A. B. Yusufa, R. M. Dimab,∗, S. K. Ainac

aDepartment of Information and Communication Technology, Usmanu Danfodiyo University, Sokoto State
bDepartment of Computer Science, Federal University Dutsinma, Katsina State

cDepartment of Computer Science, Federal University Gashua, Yobe State

Abstract

Breast cancer is the second most commonly diagnosed cancer in women throughout the world. It is on the rise, especially in developing countries,
where majority of the cases are discovered late. Breast cancer develops when cancerous tumors form on the surface of the breast cells. The
absence of accurate prognostic models to assist physicians recognize symptoms early makes it difficult to develop a treatment plan that would help
patients live longer. However, machine learning techniques have recently been used to improve the accuracy and speed of breast cancer diagnosis.
If the accuracy is flawless, the model will be more efficient, and the solution to breast cancer diagnosis will be better. Nevertheless, the primary
difficulty for systems developed to detect breast cancer using machine-learning models is attaining the greatest classification accuracy and picking
the most predictive feature useful for increasing accuracy. As a result, breast cancer prognosis remains a difficulty in today’s society. This research
seeks to address a flaw in an existing technique that is unable to enhance classification of continuous-valued data, particularly its accuracy and the
selection of optimal features for breast cancer prediction. In order to address these issues, this study examines the impact of outliers and feature
reduction on the Wisconsin Diagnostic Breast Cancer Dataset, which was tested using seven different machine learning algorithms. The results
show that Logistic Regression, Random Forest, and Adaboost classifiers achieved the greatest accuracy of 99.12%, on removal of outliers from
the dataset. Also, this filtered dataset with feature selection, on the other hand, has the greatest accuracy of 100% and 99.12% with Random Forest
and Gradient boost classifiers, respectively. When compared to other state-of-the-art approaches, the two suggested strategies outperformed the
unfiltered data in terms of accuracy. The suggested architecture might be a useful tool for radiologists to reduce the number of false negatives and
positives. As a result, the efficiency of breast cancer diagnosis analysis will be increased.

DOI:10.46481/jnsps.2021.331

Keywords: Breast cancer, Machine learning, Accuracy, Feature selection and outliers

Article History :
Received: 05 August 2021
Received in revised form: 10 October 2021
Accepted for publication: 01 November 2021
Published: 29 November 2021

c©2021 Journal of the Nigerian Society of Physical Sciences. All rights reserved.
Communicated by: T. Latunde

1. Introduction

Cancer is a group of disorders characterized by the growth
of abnormal cells that have the ability to infiltrate or spread

∗Corresponding author tel. no:
Email address: rocinta976@gmail.com (R. M. Dima )

throughout the body[1]. Breast cancer is the world’s second
most common disease and public health concern, especially
among women, with high mortality and substantial morbidity[2].
Breast cancer is on the rise in emerging countries, and a five-
year study indicated that it is the most common malignancy[3].
According to the American Cancer Society, 93,600 new in-
stances of breast cancer are diagnosed each year in Africa, with

298


Yusuf et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 298–307 299

around 50,000 fatalities. Breast cancer was diagnosed in 2.3
million women, resulting in 685,000 deaths in 2020, according
to the World Health Organization (WHO, 2021).

It is a disease characterized by aberrant cell proliferation
in the breast[4], which is caused largely by DNA mutations.
Breast cancer tumors are classified as either malignant or be-
nign [5]. This classification is applied in the analysis of breast
tumors, lumps, or any other abnormal development in the breast
tissue. Cancer that is classified as benign is typically not life-
threatening and has a greater chance of survival, whereas cancer
that is classified as malignant is life-threatening [6]. A malig-
nant tumor can develop fast, infiltrating the lymph system and
encroaching on other healthy tissues in the surrounding area,
causing disastrous effects; on the other hand, a benign tumor
cannot grow beyond a specific size and remains confined in-
side its bulk. Early cancer identification guarantees successful
therapy and enhances the likelihood of survival [7].

Scientists have attempted to pinpoint the specific cause of
breast cancer since there are only a few risk factors that pro-
mote a woman’s chances of developing the disease. Breast can-
cer risk factors include age, genetic risk, family history, obesity,
gene variation, smoking and alcohol consumption. Due to the
small size of the cancer cell as seen from the outside, it is nearly
hard to identify breast cancer in its early stages. Mammogra-
phy, ultrasound [8], dynamic MRI[9], and elastography are the
only ways to detect cancer at an early stage[10]. In many cases,
clinicians would be required to read a large amount of imag-
ing data, which would compromise accuracy. This method is
extremely time-consuming and, in some cases, it incorrectly di-
agnoses the cancer.

Medical professionals continue to make this sort of diag-
nosis in order to see which one has the most impact. In re-
cent years, however, machine learning (ML) [11]–[14], deep
learning [15], [16], and bio-inspired computing [17] approaches
have been employed in a variety of medical diagnoses. Ma-
chine learning in the detection of breast cancer has been the
subject of several studies [18]–[21]. Several studies use differ-
ent datasets collected from the University of California-Irvine
(UCI) repository for clinical prediction of this disease. Among
these are Wisconsin Breast Cancer Dataset (WBCD), Wiscon-
sin Diagnostic Breast Cancer (WDBC) and Wisconsin Prog-
nostic Breast Cancer (WPBC) dataset to mention few. Regard-
less of the nature of the dataset, the focus of the research is
always aimed at enhancing accuracy of the prediction in order
to correctly diagnose the cancer. However, despite the popu-
larity of ML algorithms modalities proven on different breast
cancer dataset, it still cannot offer accurate and consistent out-
come in diagnosis unless improved with some data mining tech-
niques [22]. Among the popular dataset used is WDBC with
continuous-valued data problem [23] hence finding the linear-
ity among the features pose a difficulty [24], [25] when applied
to ML algorithms thus leading to poor accuracy when applied
on some algorithms.

Recently, in the research work carried out on WDBC dataset,
study and analysis of ML algorithms were reported along with
the approaches used in improving the performance of ML al-
gorithms [25]–[27]. In their work, [25] proposed a method us-

ing clustering and noise removal on WDBC dataset before it
was applied on some ML algorithms. In the research, Expec-
tation Maximization (EM) was used for data clustering, Classi-
fication and Regression Trees (CART) automatically generated
the fuzzy rules from the data hence causing removal of noise
while the Principal Component Analysis (PCA) as a dimen-
sionality reduction technique was used to overcome the multi-
collinearity issue in the data. The proposed technique was eval-
uated with WDBC and Mammographic mass datasets; then its
effectiveness was demonstrated. When compared to PCA-Support
Vector Machine (PCA-SVM,), PCA-K Nearest Neighbour (PCA-
KNN) and Decision Tree (DT), EM-PCA-CART-Fuzzy Rule-
Based had the greatest accuracy of 93.2%, whereas PCA-SVM,
PCA-KNN and DT had an accuracy of 86.7%, 82.3% and 92.9%,
respectively.

In order to improve classification accuracy of breast can-
cer disease, [27] applied preprocessing step on WDBC dataset.
Here, features were selected using gain ratio and modeled with
six algorithms using 10-fold cross-validation method. The ac-
curacy of these algorithms was: SVM-Linear (98.07%), KNN
at k=3 (97.36%), Naı̈ve Bayes (95.08%), J48 (98.07%), Multi-
layer perception (98.41%) and Random Forest (98.77%). The
results demonstrated a considerable improvement above the state-
of-the-art since RF performed the best.

Similarly, [26] focused on integrating ML algorithms with
different feature selection methods and compared their perfor-
mances to identify the most suitable approach.. The selected
features were Correlation based Feature Selection (CFS), Re-
cursive Feature Elimination (RFE), Linear Discriminant Anal-
ysis (LDA) and PCA. The ML algorithms tested were SVM
(using radial basis kernel), Neural Networks (NN) and Naı̈ve
Bayes (NB) carried out on WDBC Dataset. It was observed that
SVM-LDA combination and NN-LDA combination obtained
the best performance in terms of accuracy (98.82%).

In the comparative study carried out by [18] to determine
how to improve classification algorithm on the WDBC dataset,
the investigation was conducted on different level of cross vali-
dations and percentage of splitting the training dataset. The NB,
J48, Random Forest (RF), SMO, Multilayer Perceptron algo-
rithms when trained with 85.5% of the dataset at 10-fold cross
validation, the evaluation result showed the NB as 97.28%, J48
as 94.27%, RF as 95.56%, SMO as 96.13% and Multilayer Per-
ceptron as 96.13% accuracy. NB having the highest accuracy
was further investigated with 5, 10 and 15 cross validation at
66.6% and 85.5% splitting. The result showed improvement
when trained with 85.5% trained set resulting into 99% accu-
racy. It can be justified that such improvement exists as a result
of overfitting of the training set.

In a similar investigation conducted on three distinct datasets:
WBDC, WDBC and Coimbra by [23], the study proposed a
fuzzy technique in improving the ML algorithms. In order to
resolve the limitation of an existing method, where ID3 algo-
rithm was unable to classify the continuous-valued data and in-
crease the classification accuracy of the decision tree, FUZZY-
DBD method an automatic fuzzy database was used to design
the fuzzy database for fuzzification of data in the FID3 algo-
rithm. It was used to generate a predefined fuzzy database

299


Yusuf et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 298–307 300

before the generation of the fuzzy rule base. The fuzzified
dataset was applied to fuzzy-ID3 algorithm. The accuracy of
fuzzy-ID3 applied to fuzzified dataset was 94.362% when com-
pared with non-fuzzy WDBC dataset applied to ID3 (91.059%),
SVM (86.1%), C4.5(92.97%), NB(91.81%), RF(91.66%) and
KNN(92.57%).

The study conducted by [28] in order to improve accuracy
of ML algorithms, investigated the best features suitable for
WDBC dataset. Light Gradient Boosting Model (LGBM), Cat-
boost, and Extreme Gradient Boosting (XGB) were applied as
the feature selection approaches tested on Naive Bayes algo-
rithm. The findings revealed best accuracy with LGBM (97%),
followed by Catboost (96%) and XGB (96%).

Again, [29] experimented with feature selection techniques
of Correlation based Feature Selection(CFS), univariate selec-
tion (selectKBest), and Recursive Feature Elimination (RFE).
The RF was applied on these feature selection methods and
evaluated on WDBC dataset. It was shown that when 5 fea-
tures were picked, the RF model had the best accuracy with a
CFS of 95.32%, selectKBest 94.15% and RFE 94.15%.

As reviewed from the literature on the state-of-the-art ap-
proaches employed, the problem of multi linearity in the WDBC
dataset still exists because to the best of our knowledge, none
of the existing work has investigated the presence of outliers
on the WDBC dataset. In addition, the research investigates
the extent to which the removal of outliers, when combined
with feature selection method can improve the accuracy of other
weak algorithms. Hence, this study analyzes improvement of
ML algorithms for detecting disease based on outlier detection
and feature selection method using Pearson Correlation based
feature selection. The goal of this research was to create an
optimal model that would fill a knowledge gap. In this study,
the characteristics of the Wisconsin Diagnostic Breast Cancer
(WDBC) dataset were examined in depth for the presence or
absence of outliers. When outliers were discovered, some of
the instances were dropped. The filtered dataset was further re-
fined as the classification system’s inputs using Correlation Fea-
ture Selection (CFS). The Pearson correlation technique was
used to find the appropriate continuous features and their as-
sociated weight (importance) in order to discover variables that
are relevant for prediction. This method assists in the resolution
of overfitting and underfitting difficulties in ML. The accuracy
was used to evaluate the performance of the unfiltered (con-
ventional technique), filtered (outliers approach), and Outliers
Correlation Feature Selection (OCFS) datasets. The findings
were assessed and compared using seven classifiers: Logistic
Regression (LR), K-Nearest Neighbor (KNN), Support Vector
Machines (SVM), Decision Tree (DT), Random Forest (RF),
Gradient Boost (GB), and Adaboost (AB).

2. Materials and Method

The research architecture for predicting the presence of breast
cancer disease is shown in Figure 1. The approach includes ac-
quiring a breast cancer disease dataset and preprocessing it to
remove missing values and outliers. Also, using the already
processed dataset, an algorithm to discover strongly correlated

features was applied, and the results were engaged in ML tech-
niques to predict whether a patient had Benign or Malignant
tumors. Finally, the outcome was compared using a perfor-
mance score based on the confusion matrix. Figure 1 depicts
the process of recommended techniques for implementing ML
Algorithms.

2.1. Data Description

The data for this study is acquired from the UCI repository.
This dataset, identified as the WDBC dataset, has 569 cases that
are either Benign or Malignant. In these situations, 357 cases
(62.74%) are Benign and 212 cases (37.26%) are Malignant.
The distribution of the number of Benign and Malignant classes
in the dataset is displayed in Figure 2.

The dataset contains 33 attributes: class attribute labels (di-
agnosis: B= Benign, M= Malignant), id, and 31 real value at-
tributes. These attributes are derived from a digitized image
of a biopsy procedure for a breast mass and are used to de-
scribe the characteristics of the cell nuclei in the image. The
WDBC dataset is a computation of ten real-valued features of
cell nucleus: radius, texture, perimeter area smoothness, com-
pactness, concavity, concave points, and symmetry fractal di-
mension. Each of these qualities was estimated of their respec-
tive mean, standard error, and worst values, resulting in a total
of 30 attributes. The attributes of the WDBC dataset, as well as
their datatypes, are listed in Table 1.
The unique id numbers of the instances and the accompanying
class label (diagnosis: M=Malignant, B=Benign) are stored in
the first two columns of the dataset, respectively. Columns 3-
32 contain 30 real-value features derived from digitized images
of cell nuclei which can be used to create a model to predict
whether a tumor is benign (i.e., cancer-free) or malignant (i.e.,
cancerous).

2.2. Data Pre-Processing

Purification and modification of the dataset are required be-
fore applying ML algorithms to the dataset, it is a necessary
step to pre-process the data. Performance and accuracy of the
predictive model are not only affected by the algorithms used
but also by the quality of the dataset and pre-processing. The
phases of pre-processing used in this investigation are as fol-
lows:

2.2.1. Missing Values Checking
The dataset contains 569 instances of 33 variables. How-

ever, it was discovered that the variable id had no effect on the
dataset description or on disease prediction because it merely
keeps a serial record of the instances. As a result, the dataset’s
id feature was removed. Additionally, while conducting addi-
tional preprocessing operations on the dataset, it was discov-
ered that the last feature, unnamed:32, had the value null for
all occurrences. This might be a mistake in the data collection
process, because of this the feature was also removed from the
dataset.

300


Yusuf et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 298–307 301

Figure 1. Proposed Architecture

Table 1. Attributes and their Description on WDBC Dataset
S/n Attributes Datatypes S/n Attributes Datatypes
1 Id numeric 18 compactness se numeric
2 Diagnosis nominal 19 concavity se numeric
3 radius mean numeric 20 concave points se numeric
4 texture mean numeric 21 symmetry se numeric
5 perimeter mean numeric 22 fractal dimension se numeric
6 area mean numeric 23 radius worst numeric
7 smoothness mean numeric 24 texture worst numeric
8 compactness mean numeric 25 perimeter worst numeric
9 concavity mean numeric 26 area worst numeric

10 concave points mean numeric 27 smoothness worst numeric
11 symmetry mean numeric 28 compactness worst numeric
12 fractal dimension mean numeric 29 concavity worst numeric
13 radius se numeric 30 concave points worst numeric
14 texture se numeric 31 symmetry worst numeric
15 perimeter se numeric 32 fractal dimension worst numeric
16 area se numeric 33 unnamed:32 numeric
17 smoothness se numeric
1

2.2.2. Encoding data
The performance of machine models depends on various as-

pects. One element that influences performance of the models
are the methods used to analyze data and feed it to the model.
As such, vital step in encoding data is turning data into cat-
egorical variables understood by ML models. Encoding data
elevates model quality and helps in feature engineering. The
class label ”diagnosis” was expressed as strings of (B= Benign,
M= Malignant). This category characteristic must be converted
to restricted numbers. This is done to transform data into a for-
mat that ML algorithms can understand. Label encoding was
used to encode the diagnostic occurrences in this study, and the
result was (M=1, B =0).

2.2.3. Outliers Checking
An outlier is a statistic or observation that deviates from a

distribution’s overall pattern. If few data are significantly differ-
ent or not in range of main trend then those are termed outliers.
There skewness results, affecting the mean and standard devia-
tion of the distribution. As shown in Figure 3, this study detects
the existence of outliers in the dataset. As a result, outliers were
identified and eliminated from their respective features.

2.2.4. Data Transformation
Data must be normalized or standardized before ML algo-

rithms can be applied. The data is standardized to have a mean
of 0 (µ) and a standard deviation (

∑
) of 1. Equation 1 gives the

301


Yusuf et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 298–307 302

Figure 2. Dataset’s Class Level Showing Malignant and Benign in
WDBC Dataset

Figure 3. Plots of data points to show presence of outliers

conversion formula:

X =
X − µ
σ

(1)

2.3. Dimension Reduction
Dimension reduction is delineated as the mapping of data

to a lower dimensional space by removing irrelevant variance
in data, resulting in the detection of a subspace in which the
data exist [30]. Feature extraction and feature selection are two
types of dimensional reduction [30]. The process of identifying
and discarding irrelevant, less relevant or duplicated features of
dimensions in a dataset is known as feature extraction. Creating
strong learning models, feature selection may be used to detect
and remove as much unnecessary and redundant information as
feasible. As a result, feature selection not only decreases com-
putational and processing costs, but also improves the model
created from the chosen data [31], [32].

On healthcare data, the feature selection method has been
used in a number of previous studies [27]–[29], [33]. Previ-
ous research that are partly relevant to this study and deal with
the datasets utilized here; nevertheless, in most cases, the per-
formance of such systems was not as predicted. One of the
reasons for some systems’ poor performance is their inability
to recognize the most important and highly correlated features.

The goal of this research is to devise a method for identi-
fying the best set of features and then investigate which algo-
rithms work best with those features.

Filters, wrappers, and embedding techniques are the three
types of algorithms that may be used to select features[34]. The
Correlation Feature Selection method is used in this study to
identify the best predictive features. Correlation feature selec-
tion is a technique that uses the filter approach. The relationship
between the independent and dependent variables is determined
using a mathematical function. The features are chosen based
on the values of their correlation coefficients. The most predic-
tive feature with the class variable is considered to be highly
associated, and it is included in the final feature set. Pearson’s
Correlation: Consider a dataset D having feature set F

F = {x1, x2, x3. . . ,xn} (2)

and classes C with values c, where X, C are treated as random
variables, Pearson’s linear correlation coefficient is defined as

r =
∑n

i=1 (xi − x)
2 (ci − c)√[∑n

i=1 (xi − x)
2
]

[
∑n

i=1 (ci − c)
2 ]

(3)

Where xi and ci the ith value of X and C respectively. Percent-
age of (r) = ±1 if X and Y are linearly dependent and zero
if they are completely uncorrelated. The values of the correla-
tion coefficients between independent features and the depen-
dent class variable in the WDBC data are shown in Figure 4.
A filter approach was used to identify the most predictive char-
acteristics. The correlation between all features is calculated
and displayed in this article. The correlation criterion utilized
is 0.6, and features having a correlation of less than 0.6 are re-
moved from the training dataset. While the other qualities that
have a higher threshold are chosen. The following Figure 5 de-
picted based on the highly correlated 10 features with predicted
attribute (diagnosis).

2.4. Data Splitting
The goal of dividing the data is to avoid overfitting the model

during model testing on the testing dataset. The dataset for this
study was split into two parts: training data (80%) and test data
(20%).

2.5. Trained Model
Based on the seven classifiers, two intelligent systems were

built. Different forms of ML algorithms have previously been
the subject of numerous studies. Five of the most prevalent ap-
proaches (LR, KNN, SVM, DT, and RF) were chosen, as well
as two infrequent techniques (AB and GB). When combined
with feature selection approaches, several prior research have
demonstrated that the projected accuracy of LR, KNN, SVM,
DT[25], [26] and RF[27] algorithms was fairly high. Further-
more, to the best of our knowledge, no research in this field have
shown that AB and GB can perform very well with a high de-
gree of accuracy. These methods were investigated in this study
using hyperparameter tuning to improve the proposed model’s
efficiency. The following are the algorithms that will be dis-
cussed:

302


Yusuf et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 298–307 303

Figure 4. Correlation Coefficient Values between Independent Features and Dependent Class Variable

Figure 5. Plot of Highly Correlated Features

2.5.1. Linear regression
LR is a simple supervised learning method for projecting

the relationship between explanatory variables and dependent
variables by fitting a linear equation to experience data. LR can
be mathematically modelled as shown in equation 4 [35]:

y = β0 + β1 x1 + e (4)

Where y is the response variable, β0 and β1 are the model coef-
ficients, and e is the model coefficients error. The intercept and
slope are represented by these unknown constant values, which
are learnt during the training phase.
Equation 5 is used to predict after the model has been fitted in
the training phase:

y = β0 + β1 x (5)

Where y is the predicted value based on x, and the error is cal-
culated as shown in equation 6 as follows:

ei = yi + ŷ1 (6)

In this study, the solver and the maximum number of itera-
tions were the two LR parameters used. The solver is the algo-
rithm that is utilized to solve the optimization issue. The algo-
rithm options include “newton-cg”, “sag”, “saga”, “liblinear”
and “lbfgs”. The number of iterations required for the solvers
to converge is specified between 100 and 10000.

2.5.2. K-Nearest Neighbor
KNN is a supervised classifier that learns from data sam-

ples that have been labeled. It is a lazy method since it makes

303


Yusuf et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 298–307 304

no generalizations about the sample data points and all cal-
culations are put on hold until the classification is completed.
KNN works by determining the K closest neighbors given n
training vectors. KNN converts the training data set into a
multi-dimensional feature space and divides it into various ar-
eas based on the training dataset’s classifications. The concept
of number of neighbors is fundamental to this method. The
number of neighbors specifies how many neighbors should be
checked when an item is classified. The parameter range in this
study was randomly searched between 1 and 30, with the case
being assigned to the most frequent class among its K nearest
neighbors, as determined by a distance role.

2.5.3. Support Vector Machine
SVM are a classification approach that involves projecting

input data points into n-dimensional vector space and determin-
ing the optimal hyper-plane that maximizes the difference be-
tween the two classes. The choice of parameters such as ker-
nel, C, and gamma has a significant impact on SVM perfor-
mance. Kernels are a function that converts a low-dimensional
space into a high-dimensional one, making categorization sim-
ple. The non-linearity is controlled by the kernel. “rbf”, “poly”
or “sigmoid” are all possible kernel coefficients. The kernel
coefficients were adjusted in this investigation.

2.5.4. Decision Tree Algorithm
DT is a strong predictive learning tool used to solve classi-

fication and regression problems. It uses a tree-based top-down
progression method. It employs a tiered splitting method to di-
vide data into two or more groups at each layer, ensuring that
data in each group is comparable. Every inner node in a DT’s
tie ups to a test attribute, every branch to a test result, and each
leaf node to a different class. Before applying ‘splitting’, the
tree develops from the root node by selecting a ‘best feature’
or ‘best attribute’ from the set of accessible attributes using en-
tropy and information gain measures. The most useful informa-
tion is provided by the ‘best attribute’. Information Gain is the
pace at which the entropy of attributes increases or decreases,
and entropy reflects how homogenous the dataset is. The key
parameters that are optimized are the maximum depth of the
tree, the number of features to check when looking for the opti-
mum split, the lowest number of samples required to divide an
internal node, and the criterion used for splitting.

2.5.5. Random Forest
RF is an ensemble of several separate randomized decision

trees that work together. Bootstrap sampling of the data is used
to create the trees. Based on the set of predictor values entered,
each individual tree in the random forest casts a unit vote, and
the class with the most votes becomes the model’s prediction
for categorizing an input vector[36]. The recurrent division of
a binary tree into comparable nodes is used to create RF. By
inheritance, the parent node impacts the similarity of the child
node.

2.5.6. Gradient Boosting
GB is a strategy for enhancing ideas that have poor learning

or predictability. The goal of GB is to combine numerous con-
cepts with a weak predictive component and a clever algorithm
to create a decision tree with a considerably greater connectiv-
ity. If there aren’t many ideas in common across the data com-
ponents, this notion is especially effective with large datasets.
Search engine rankings are one of the most popular uses of this
technology. Search engine rankings must filter a large number
of possible queries, some of which may or may not be related,
into a limited number of rankable words.

2.5.7. AdaBoost
AB is a boosting method that uses weight modification to

solve classification problems without requiring any prior in-
formation of the learner’s learning. The goal of AdaBoost is
to enhance classification performance by combining different
weak learners or classifiers. A basic collection of training ex-
amples is used to train each weak learner. Each sample has a
weight, which is adjusted iteratively across all samples. The
robustness of the weak learner is represented by this weight.
The AdaBoost algorithm consists of the following main steps:
(i) sampling, which involves selecting some samples from the
training set while iterating. (ii) The sample data is used to train
different classifiers, and the error rates for each classifier are
computed. (iii) The last stage is the combination of all trained
models.

3. Performance evaluation metric

The metric accuracy was calculated using the 2 X 2 con-
fusion matrix to test the validity of the prediction models, as
shown in Table 2. The accuracy determines the proportion or
possibility of a total number of correct predictions [37]. As
seen in equations 7, the following formula is used to quantita-
tively represent this measurement. Where TP, TN, FP, and FN
stand for True Positive (number of positive data correctly la-
beled by the classifier), True Negative (number of negative data
correctly labeled by the classifier), False Positive (number of
negative data incorrectly labeled as positive), and False Neg-
ative (number of positive data incorrectly labeled as negative)
respectively.

Table 2. Illustration of Confusion Matrix Table
Actual Values
Positive Negative
Predicted
Values

Positive TP FP

Negative FN TN

Accuracy =
T P + T N

(T P + F P + T N + F N)
(7)

304


Yusuf et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 298–307 305

Table 3. Comparison Table between the Accuracy of the Proposed Models and Existing Techniques
Authors Dataset Techniques Accuracy (%)

[25] WDBC DT 92.9
PCA-SVM 86.7
PCA-KNN 82.3

EM-PCA-CART-Fuzzy Rule-Based 93.2
[27] WDBC RF 98.7
[26] WDBC SVM 96.47

CFS-SVM 96.47
RFE-SVM 96.47
LDA-SVM 98.82

[23] WDBC SVM 61.96
RF 89.37

KNN 92.77
Fuzzy-ID3 94.53

[29] WDBC CFS-RF 95.32
UFS-RF 94.15
RFE-RF 94.15

Proposed work with Outliers WDBC LR 99.1
KNN 96.5
SVM 95.6
DT 96.5
RF 99.1
GB 98.3
AB 99.1

Proposed work with Pearson Correlation Feature Selection WDBC OCFS-LR 96.5
OCFS-KNN 96.5
OCFS-SVM 95.6
OCFS-DT 94.7
OCFS-RF 100
OCFS-GB 99.1
OCFS-AB 98.3

4. Result and Discussion

The two proposed approaches were subjected to various ML
techniques. In order to create the performance statistic, a 2 x 2
confusion matrix was created, which allowed all of the algo-
rithms to be compared. The suggested models were evaluated
using the performance indicator “accuracy”.

4.1. Comparison between Different Machine Learning Algo-
rithms Based on Accuracy

The most essential metrics for evaluating ML algorithms is
accuracy. Seven classifiers were applied to the WDBC features
and processed by resolving the problem of missing values and
scaling their instances, as previously indicated, this is called
Conventional approach. Second, the outliers instances were de-
tected in five input features and were dropped from their re-
spective features this is termed Outliers approach. Finally, the
remaining instances of the outliers technique were subjected to
Pearson Correlation Feature Selection, which resulted in the se-
lection of 10 features, which is known as the OCFS approach.
The accuracy of several types of classifiers was carried out on
each of these three phases and the result is plotted in Figure 6.

Figure 6. Comparison between Different Machine Learning Algorithms
Based on Accuracy

As demonstrated in Figure 6, the conventional approach
performs the poorest of all the strategies owing to the pres-
ence of outliers, which increases data variability and reduces
statistical power. KNN, RF, and GB all have similar accuracy

305


Yusuf et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 298–307 306

(96.5%); however, they perform poorly when compared in other
approaches. Furthermore, the conventional approach has an ex-
ceptional high accuracy of 98.2% in the SVM classifier and
the least accuracy of 93.9% in the DT. The outlier approach
produces considerably higher results for all its classifiers, with
three classifiers having maximum accuracy of 99.1% for LR,
RF, and AB, respectively. Lastly, when the approach of out-
lier is used with the outcomes of Pearson Correlation technique
(10 features), the accuracies for all predictive classifiers im-
prove the most compared to their foils in other approaches. This
is feasible because feature selection allows for the removal of
noise from data while also picking the most valuable features;
this strategy, when paired with the outlier approach, yields the
greatest results. Figure 6 shows that OCFS approach RF has
the best accuracy (100%); this is because RF added additional
randomness to the model when developing the trees. Instead
of working with the suggested features, it also performs addi-
tional searching for the most important feature while splitting a
node, it searches for the best feature among a random subset of
features.

4.2. Comparison Table between the Accuracy of the Proposed
Models and Existing Techniques

For this research, WDBC datasets were used. The architec-
ture of our suggested system is depicted in Figure 1. Table 3
shows the outliers and OCFS results, as well as previous work
on the WDBC dataset. As a result, distinct results based on the
WDBC dataset have been published in the literature. It was re-
ported that after changing the number of selected features by
implementing selection algorithms like Principal Component
Analysis(PCA), Correlation Feature Selection(CFS), Recursive
Feature Selection(RFE), Linear Discriminant Analysis(LDA),
Univariate Feature Selection(UFS), Fuzzy Rule Based and Ex-
pectation Maximization (EM)-PCA-CART-Fuzzy Rule-Based,
there were significant improvements in the accuracy. So, this
work compared the accuracy of our two approaches.

When it came to removing outliers, the LR model had the
greatest accuracy (99.1%), while the SVM model had the low-
est accuracy score (95.6%). When the OCFS is used, it results
in some significant modifications. The RF model achieved the
best accuracy (100%), whereas the SVM model fared the poor-
est. Table 3 shows a comparison of our findings to current mod-
els and datasets. The table’s “Techniques” column contains in-
formation on the methodologies that were utilized in the previ-
ous study, as well as our own methodology and the findings that
were published. The table depicts the overall performance of
the algorithms in our study in comparison to other comparable
works. The greatest outcome of previous RF findings is 94.15%
[29], and our OCFS performance has improved to 100%.

5. Conclusion

The research primarily focuses on improving ML models in
order to improve accuracy in forecasting breast cancer disease
outcomes. The results show that outlier detection and OCFS
methods, in combination with various classification algorithms,

might provide useful tools for inference in this area. More study
in this area is needed to improve the classification systems’ per-
formance on diverse feature selection approaches so that they
can predict on more variables.

Acknowledgments

The authors will like to appreciate the handling editor and
the anonymous referees for their contributions to the success of
this research.

References

[1] M. R. Mohebian, H. R. Marateb, M. Mansourian, M. A. Mañanas & F.
Mokarian, “A hybrid computer-aided-diagnosis system for prediction of
breast cancer recurrence (HPBCR) using optimized ensemble learning,”
Computational and Structural Biotechnology Journal 15 (2017) 75.

[2] S. Amin, H. S. Ewunonu, E. Oguntebi & I. Liman, “Breast cancer mor-
tality in a resource-poor country: a 10-year experience in a tertiary insti-
tution,” Sahel Medical Journal 20 (2017) 9.

[3] M.W. Huang, C.W. Chen, W.C. Lin, S.W. Ke & C.F. Tsai, “SVM and
SVM ensembles in breast cancer prediction,” PLoS ONE 12 (2017)
161501.

[4] CDC, “What is breast cancer?” (2021).
[5] R. J. Oskouei, N. M. Kor & S. A. Maleki, “Data mining and medical

world: breast cancers’ diagnosis, treatment, prognosis and challenges,”
American Journal of Cancer Research 7 (2017) 610.

[6] L. A. Aaltonen, R. Salovaara, P. Kristo, F. Canzian, A. Hemminki, P. Pel-
tomäki, R. B. Chadwick, H. Kääriäinen, M. Eskelinen, H. Järvinen, J. P.
Mecklin, & A. De la Chapelle, “Incidence of hereditary nonpolyposis col-
orectal cancer and the feasibility of molecular screening for the disease,”
New England Journal of Medicine 338 (1998) 1481.

[7] A. Khamparia, S. Bharati, P. Podder, D. Gupta, A. Khanna, T. K. Phung
& D. N. H. Thanh, “Diagnosis of breast cancer based on modern mam-
mography using hybrid transfer learning,” Multidimensional Systems and
Signal Processing 32 (2021) 747.

[8] H. Kurihara, C. Shimizu, Y. Miyakita, M. Yoshida, A. Hamada, Y.
Kanayama, K. Yonemori, J. Hashimoto, H. Tani, M. Kodaira, M.
Yunokawa, H. Yamamoto, Y. Watanabe, Y. Fujiwara & K. Tamura,
“Molecular imaging using PET for breast cancer,” The Japanese Breast
Cancer Society 23 (2016) 24.

[9] T. Nagashima, M. Suzuki, H. Yagata, H. Hashimoto, T. Shishikura, N.
Imanaka, T. Ueda & M. Miyazaki, “Dynamic-enhanced MRI predicts
metastatic potential of invasive ductal breast cancer,” Breast Cancer 9
(2002) 226.

[10] C. S. Park, S. H. Kim, N. Y. Jung, J. J. Choi, B. J. Kang & H. S. Jung,
“Interobserver variability of ultrasound elastography and the ultrasound
BI-RADS lexicon of breast lesions,” Breast Cancer 22 (2015) 153.

[11] S. I. Ayon, M. Islam & M. R. Hossain, “Coronary artery heart disease pre-
diction: A comparative study of computational intelligence techniques,”
IETE Journal of Research (2020) 1.

[12] M. M. Islam, H. Iqbal, R. Haque & K. Hasan, “Prediction of breast cancer
using support vector machine and k-nearest neighbors,” IEEE Region 10
Humanitarian Technology Conference (R10-HTC) (2017) 226.

[13] L. J. Muhammad, M. M. Islam, S. S. Usman & S. I. Ayon, “Predictive
data mining models for novel coronavirus (covid-19) infected patients’
recovery,” SN Computer Science 1 (2020) 206.

[14] A. Yusuf & O. Akande, “Hyper-parameter optimization and evaluation
on selected machine learning algorithm using hepatitis dataset,” FUDMA
Journal of Sciences 5 (2021) 447.

[15] S. I. Ayon & M. Islam, “Diabetes prediction: a deep learning approach,”
International Journal of Information Engineering and Electronic Business
11 (2019) 2.

[16] Z. Islam, M. Islam & A. Asraf, “A combined deep CNN-LSTM network
for the detection of novel coronavirus (covid-19) using x-ray images,”
Informatics in Medicine Unlocked 20 (2020) 100412.

306


Yusuf et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 298–307 307

[17] K. Hasan, M. Islam & M. M. A. Hashem, “Mathematical model devel-
opment to detect breast cancer using multigene genetic programming,”
International Conference on Informatics, Electronics and Vision (2016)
574.

[18] M. T. Ahmed, M. N. Imtiaz & A. Karmakar, “Analysis of wisconsin
breast cancer original dataset using data mining and machine learning
algorithms for breast cancer prediction,” Journal of Science Technology
and Environment Informatics 9 (2020) 665.

[19] M. M. Islam, Md. R. Haque, H. Iqbal, Md. M. Hasan, M. Hasan & M.
N. Kabir, “Breast cancer prediction: A comparative study using machine
learning techniques,” SN Computer Science 1 (2020) 290.

[20] N. Khuriwal & N. Mishra, “Breast cancer diagnosis using deep learning
algorithm,” International Conference on Advances in Computing, Com-
munication Control and Networking (2018) 98.

[21] C. Shah & A. G. Jivani, “Comparison of data mining classification algo-
rithms for breast cancer prediction,” Fourth International Conference on
Computing, Communications and Networking Technologies (ICCCNT)
(2013) 1.

[22] F. A. Muhammet, “A comparative analysis of breast cancer detection and
diagnosis using data visualization and machine learning applications,”
Healthcare 8 (2020) 111.

[23] N. F. Idris & M. A. Ismail, “Breast cancer disease classification us-
ing Fuzzy-ID3 algorithm with FUZZYDBD method: automatic fuzzy
database definition,” PeerJ Computer Science 7 (2021) 427.

[24] R. Harikumar & C. Sannasi, “Effective classification framework for breast
tumors using optimized multi-kernel SVM with controlled skewness,” In-
ternational Journal of Aquatic Science 12 (2021) 1604.

[25] M. Nilashi, O. Ibrahim, H. Ahmadi & L. Shahmoradi, “A knowledge-
based system for breast cancer classification using fuzzy logic method,”
Telematics and Informatics 34 (2017) 133.

[26] D. A. Omondiagbe, S. Veeramani & A. S. Sidhu, “Machine learning clas-
sification techniques for breast cancer diagnosis,” IOP Conference Series:

Materials Science and Engineering 495 (2019) 012033.
[27] A. Saygılı, “Classification and diagnostic prediction of breast cancers via

different classifiers,” International Scientific and Vocational Studies Jour-
nal 2 (2018) 56.

[28] A. Derangula, S. Edara & P. K. Karri, “Feature selection of breast cancer
data using gradient boosting techniques of machine learning,” Clinical
Medicine 7 (2020) 17.

[29] S. Raj, S. Singh, A. Kumar, S. Sarkar & C. Pradhan, “Feature selection
and random forest classification for breast cancer disease,” Data Analytics
in Bioinformatics (2021) 191.

[30] T. H. Cheng, C. P. Wei & V. S. Tseng, “Feature selection for medical data
mining: comparisons of expert judgment and automatic approaches,” 19th
IEEE Symposium on Computer-Based Medical Systems (2006) 165.

[31] S. N. Ghazavi & T. W. Liao, “Medical data mining by fuzzy modeling
with selected features,” Artificial Intelligence in Medicine 43 (2008) 195.

[32] S. M. Vieira, J. M. C. Sousa & U. Kaymak, “Fuzzy criteria for feature
selection,” Fuzzy Sets and Systems 189189 (2012)1.

[33] S. B. Sakri, N. B. Abdul Rashid & Z. Muhammad Zain, “Particle swarm
optimization feature selection for breast cancer recurrence prediction,”
IEEE Access 6 (2018) 29637.

[34] E. E. Bron, M. Smits, W. J. Niessen & S. Klein, “Feature selection based
on the SVM weight vector for classification of dementia,” IEEE Journal
of Biomedical and Health Informatics 19 (2015) 1617.

[35] M. Kumari, V. Singh & P. Ahlawat, “Automated decision support system
for breast cancer prediction,” International Journal on Emerging Tech-
nologies 11 (2020) 193.

[36] L. Breiman, “Random forests: random features,” Technical Report 567,
Statistics Department, University of California, Berkeley (1999) 29.

[37] S. V. Stehman, “Selecting and interpreting measures of thematic classifi-
cation accuracy,” Remote Sensing of Environment 62 (1997) 77.

307