<4D6963726F736F667420576F7264202D20CF20DAE1ED20E6C7EDCFE420E6CCC7DDEDD120312D3130> Al-Khwarizmi Engineering Journal Al-Khwarizmi Engineering Journal, Vol. 16, No. 3, Sptember, (2020) P. P. 1-10 Two-Stage Classification of Breast Tumor Biomarkers for Iraqi Women Iyden Kamil Mohammed* Ali Hussein Al-Timemy** Javier Escudero*** *,** Department of Biomedical Engineering / Alkhwarizmi College of Engineering/ University of Baghdad/ Baghdad/ Iraq *** School of Engineering/ Institute for Digital Communications/ The University of Edinburgh/ Alexander Graham Bell Building/ EH9 3FG/ UK *Email: aydenel_1969@yahoo.com **Email: ali.altimemy@kecbu.uobaghdad.edu.iq ***Email: javier.escudero@ed.ac.uk (Received 18 November 2019; accepted 26 April 2020) https://doi.org/10.22153/kej.2020.04.003 Abstract Objective: Breast cancer is regarded as a deadly disease in women causing lots of mortalities. Early diagnosis of breast cancer with appropriate tumor biomarkers may facilitate early treatment of the disease, thus reducing the mortality rate. The purpose of the current study is to improve early diagnosis of breast by proposing a two-stage classification of breast tumor biomarkers fora sample of Iraqi women. Methods: In this study, a two-stage classification system is proposed and tested with four machine learning classifiers. In the first stage, breast features (demographic, blood and salivary-based attributes) are classified into normal or abnormal cases, while in the second stage the abnormal breast cases are further classified into either malignant or benign. The collected 20 breast cancer features are utilized to test the performance of the proposed classification system with Leave-One-Out (LOO) cross validation and Synthetic Minority Over-Sampling Technique (SMOTE) to balance the classes. Furthermore, correlation-based feature selection (CFS) was employed in an exploratory analysis to find the best features for the 2-stage classification system. Results: Classification accuracy of 9٤% for stage-1 and 100% for stage-2was achieved with a Naïve Bayesclassifier which outperformed other three methods. In addition, CFS selected small subset of features as being the best five features out of the all 20 features for both stage-1 and stage-2. Conclusion: We achieved a high classification accuracy which is promising to help improve the early diagnosis of breast tumor. The outcome of this study also shows the importance of CA15-3protein in saliva and blood as well as carcinoembryonic antigen level and total protein in blood, and Estrogen hormone level in saliva, for predicting breast tumors. Keywords: Breast cancer, correlation-based feature selection, decision tree, machine learning, oner algorithm, two- stage classification. 1. Introduction In Iraq, breast cancer is regarded the most common type of malignancy [1], [2]. Among all the malignant diseases, breast cancer is assessed as one of the main causes of death in post- menopausal women, accounting for 23% of all cancer deaths in 2017 [3]. In 2010, breast cancer is almost recognized as the deadliest cancer in women since it is regarded as number one cause of cancer mortality among women [4]. Biomarkers have many potential applications in oncology, including screening, risk assessment, determination of prognosis, prediction of Iyden Kamil Mohammed Al-Khwarizmi Engineering Journal, Vol. 16, No. 3, P.P. 1- 10 (2020) 2 response to treatment, differential diagnosis, and monitoring of progression of disease. Due to the major role that biomarkers may play at all stages of disease, they should undergo rigorous evaluation, including clinical validation, analytical validation, and assessment of clinical utility prior to incorporation into regular clinical care [5]. A tumor biomarker is a molecular or process- based change that discloses the status of an underlying malignancy. A tumor biomarker may be diagnosed and assessed via one or more biomarker assays or tests. Patient management is progressively being derived by tumor biomarker tests. This can be done by recognizing patients who do not require any, or recognizing other patients whose tumors are so unlikely to respond to a given type of treatment that it will drive to more harm than good. Thus, patient management should be guided by a tumor biomarker test, to inspect if it has analytical validity, which means it is accurate, reliable and reproducible [6]. Machine learning techniques have been utilized to classify cancer attributes and biomarkers aiming at improving the diagnosis rate. For instance, a classifier- based expert system was proposed in [7] for early diagnosis of prostate cancer with Artificial Neural Network (ANN) and Support Vector Machines (SVM). Thirteen attributes were acquired from 300 men to classify benign and malignant tumors. Classification accuracy of 79.3% and 80.1% was obtained with ANN and polynomial SVM, respectively. Other researchers employed machine learning and data mining techniques to investigate breast tumors. Three popular machine learning classifiers (Naive Bayes, Radial Basis Function Neural Network (RBFNN), J48 decision tree) were used [8] to develop prediction models for 683 breast cancer cases. Classification accuracy of 97.36%, 96.77 and 93.41% was obtained for the Naive Bayes, RBFNN, J48 classifiers, respectively. In another study, Behadili et al. [9] analyzed 42 attributes of the Iraqi women and selected 26 attributes for the classification of three classes with the decision tree J48 algorithm with 98% accuracy. Other researchers tried to reduce the size of the feature set to detect breast cancer with Independent Component Analysis (ICA) [10]. A publicly available data set, Wisconsin diagnostic breast cancer (WDBC) dataset, was utilized to test the proposed algorithm where the 30 attributes have been reduced to only one feature (IC). Then, reduced feature was utilized to evaluate diagnostic accuracy with multiple classifiers: k-nearest neighbor (k-NN), ANN, RBFNN, and SVM. In some occasions, three- class classification problems have been tackled such as the work in[11] by proposing two-stage classification where Computer-aided diagnosis (CAD) system have been proposed to classify brain tumors. The system classified brain tumor MRI images into normal and abnormal images in the first stage and if the output was abnormal, then it was additionally classified into malignant or benign tumor. In this paper, we propose a two-stage classification for breast cancer classification with four machine learning classifiers. In the first stage, 20 breast attributes are classified into normal or abnormal cases whereas in the second stage the abnormal breast cases are further classified into malignant or benign. Furthermore, correlation-based feature selection will be employed to find the best attributes out of the 20 attributes. The main contribution of this paper is that it presents a two-stage classification system for the classification of breast markers with machine learning classifiers. The most important attributes that are influential in the prediction of breast tumor will be investigated with correlation-based feature selection on a data set of 181 samples. 2. Materials and Methods A. Details of Breast Cancer Data Collection The data set utilized in this study was acquired from the center for early breast cancer detection and Elwiya Oncology teaching hospital in 2013 and 2014. Ethical approval to conduct the data collection was obtained from the Ministry of Health. In addition, data collection was performed in accordance with the Declaration of Helsinki 1964, and its later amendments. The data set consisted of 181subjects (111 malignant, 50 normal control and 20 benign cases). The oncologist examined subjects, with suspected breast tumor, and approved their inclusion in the study, confirmed with the following: 1) clinical examination, 2) breast biopsy, 3) mammogram and 4) Ultrasound (US) scan. Before the data collection, subjects were debriefed and consented to participate in the study. The data set has 20 attributes which can be categorized in 3 main categories, i) demographic information (9 attributes),ii) attributes derived Iyden Kamil Mohammed Al-Khwarizmi Engineering Journal, Vol. 16, No. 3, P.P. 1- 10 (2020) 3 from blood samples (5 attributes) and iii) attributes derived from saliva (6 attributes). The nine demographic attributes include age, body mass index (BMI), total body fat index (TBF), Waist Hip Ratio (WHR), number of menstruation (Mens.) cycles per year, duration of contraceptive intake per year, Menstruation (Mens.) cycle status (normal or abnormal), type of Fucosyl transferase 2 (FUT2) gene (secretor or non-secretor), type of Lewis blood group. The 5biomarkers that are derived from blood samples included the level of Estrogen hormone (Es-B), the level of progesterone hormone (Pg- B), the level of CA15-3 protein (CA15-3 B), carcinoembryonic antigen level (CEA-B) and Total Protein (TP-B). The biomarker that were obtained from analyzing salivary samples included the PH level (PH-S), salivary Total Protein (TP-S), Estrogen hormone level (Es-S), the level of CA15-3 protein (CA15-3 S), saliva progesterone hormone level (Pg-S) and salivary carcinoembryonic antigen level (CEA-S). B. Details of the Pattern Classifcation In this study, a two-stage classification system is proposed where the breast attributes are classified into either normal or abnormal in the first stage. In the second stage of the classification, the abnormal instances can be further classified into malignant and benign cases. The general block diagram of the two- stage classification system is shown in Fig. 1. Fig. 1. Block diagram of the proposed breast attribute classification method. How to select the best machine learning classifier for the breast cancer detection can be considered as an open research question. Therefore in this study we investigate the performance of the proposed two-stage classification system with four different classifiers: logistic regression [12], Naïve Bayes [13], decision tree [12] and OneR [14]. The rationale behind choosing these machine learning classifiers to perform the analysis in this study was that these classifiers are relatively straightforward and simple to implement, and they have been utilised in the previous literature. Weka, an open source software package [12] was used to perform classification with the four classifies in the current study. Logistic regression is a traditional classification method where class probabilities are estimated by means of applying the logit transformation to a linear regression model [3]. It provides a mechanism for applying linear regression for performing classification [15]. For more details about the mathematical derivation, the reader is referred to [15]. Logistic function in Weka has been utilised for logistic regression [12]. Naïve Bayes classifiers is a probabilistic classifier which is based on the Bayes theorem and being considered as a simple classifier [16]. In addition, it assumes independence in a naïve way [12]. The hypothesis is that the probabilities of each feature multiply is only valid if the events are independent. Despite the simplistic assumption of independent attributes in real life, Naïve Bayes works very effectively when utilized to classify real life datasets [12]. The pseudocode for Naïve Bayes classifier [17] is displayed in Fig.2. DecisionStump, a fast decision tree learner which uses reduced-error pruning, was used. It builds one level binary decision tree with categorical or numeric class to perform the classification. DecisionStump in Weka was utilized to do decision tree. For more details about the implementation of Decision Stump, the reader is referred to [18]. OneR classifier is simple but accurate classifier. OneR algorithm generates one rule for each predictor in the data by forming a frequency table for each predictor against the target [19]. Afterwards, the rule with minimal total error is selected. Fig. 3 illustrates the pseudocode for OneR algorithm [12]. Iyden Kamil Mohammed Al-Khwarizmi Engineering Journal, Vol. 16, No. 3, P.P. 1- 10 (2020) 4 Fig. 2. Algorithm for Naïve Bayes classifier. Fig. 3.The pseudocode for OneR algorithm. The dataset used in this study has an imbalanced number of the three classes. Class imbalances may limit the performance of machine learning classifiers [20]. In addition, different classification techniques are sensitive to the imbalanced data when the samples of one class in a dataset outnumber the samples of the other class. This may lead to biased models due to overfitting. To tackle the issue of class imbalance, we utilized Synthetic Minority Over- Sampling Technique (SMOTE), proposed by Chawla et al. [21] to equalize the number of classes in both classification stages. In SMOTE, k-nearest neighbor (kNN) is used to generate synthetic instances to oversample the minority classes while the size of the majority class is kept the same [21]. The Weka filter ‘SMOTE’ was used for both classification stages where the number of k- nearest neighbor was set to the default value of 5. In stage 1, the total number of instances became 256 for stage 1 and 221 for stage 2, after applying SMOTE. To evaluate the performance of the breast classification, the exhaustive Leave-One-Out (LOO) Cross-Validation (CV) was utilised in each stage of the classification. LOO CV will prevent overfitting and bias in the evaluation of classification performance despite that it requires lot of computations compared to the 10-fold CV since it requires to go through all dataset. Classification accuracy, precision, recall and were calculated given true positive (TP), true negative (TN), false positive (FP), and false negative (FN), as follows …(1) …(2) …(3) Furthermore, we calculated Matthews correlation coefficient (MCC) [22], given in eq. 4, which is an indicator used to evaluate the performance of classification quality where the output value is between the range -1 to 1; high value of MCC of 1 indicates an excellent classification. …(4) Where the values of P and S are given below P = (TP + FP)/ (TP + FP + TN + FN) …(5) Iyden Kamil Mohammed Al-Khwarizmi Engineering Journal, Vol. 16, No. 3, P.P. 1- 10 (2020) 5 S = (TP + FN)/ (TP + FP + TN + FN) …(6) C. Selecting the Best Attributes for Breast Cancer Classification In this study, Correlation-based Feature Subset Selection (CFS) [23] was utilised, for exploratory and interpretability purposes, to select the best attributes for each stage of the breast cancer classification. The main theory is that features which have high correlation with the class label but are uncorrelated with each other may represent the base for a good feature set. A feature evaluation formula is developed from the aforementioned theory. CFS then combines the developed evaluation formula with a heuristic search strategy and suitable correlation measure. The best classifier from the previous analysis (Section II.B) will be utilised alongside Weka function AttributeSelectedClassifier, CFS (Weka evaluator ‘CfsSubsetEval’) with the search method selected to be ‘Best first’. It should be noted that ‘SMOTE’ filter in Weka was utilised in this part of the analysis to oversample the minority class for both stage-1 and stage-2. 3. Results and Discussion The Minimum (Min), Maximum (Max), mean and standard deviation (STD) for the 20 attributes in this study are shown Table 1 for the normal subjects, Table 2 for the malignant patients and Table 3 for the benign patients, respectively. There are large differences between some attributes for the three groups (such as CA15-3 B, CEA-B and CA15-3 S) while other attributes have smaller differences (such as TBF and PH- S). Table 1, The Min, Max. Mean and Std values of the 20 attributes for normal subjects (n=50). Attribute Min Max Mean STD Age 25 67 44.26 11.44 BMI 21.2 26.4 23.70 1.35 TBF 25.91 46.9 33.67 4.48 WHR 0.63 0.95 0.78 0.08 Mens. cycles/year 10 14 12.52 0.74 Dur. of contracept./year 0 9 0.50 1.66 Mens. cycle status 0 1 0.06 0.24 FUT2 gene type 0 1 0.72 0.45 Lewis blood type 0 2 1.24 0.52 Es-B 11 97 50.48 25.70 Pg-B 0.73 4 1.81 0.99 CA15-3 B 3.11 15.3 6.61 2.74 CEA-B 0.59 2.94 1.78 0.57 TP-B 6.02 7.3 6.47 0.35 Es-S 3.18 28.1 14.62 7.43 Pg-S 0.24 1.8 0.68 0.41 CA15-3 S 0.5 2.68 0.79 0.41 CEA-S 0.5 0.5 0.50 0.00 TP-S 0.05 0.28 0.14 0.04 PH-S 6.9 7.4 7.23 0.12 Iyden Kamil Mohammed Al-Khwarizmi Engineering Journal, Vol. 16, No. 3, P.P. 1- 10 (2020) 6 Table 2, The min, max. mean and std values of malignant patients (N=111). Attributes Min Max Mean STD Age 26 73 54.21 8.14 BMI 21.4 34.4 26.90 2.86 TBF 27.34 53.5 39.53 4.16 WHR 0.67 1.1 0.88 0.11 Mens. cycles/year 10 15 12.66 0.79 Dur. of contracept. /year 0 18 3.52 5.08 Mens. cycle status 0 1 0.31 0.46 FUT2 gene type 0 1 0.39 0.49 Lewis blood type 0 2 1.30 0.75 Es-B 182 384 270.86 50.27 Pg-B 0.62 4.9 1.42 1.16 CA15-3 B 18 79 52.44 18.79 CEA-B 2.7 14.3 8.71 2.82 TP-B 7 10.8 8.76 1.22 Es-S 52 111.2 77.68 14.78 Pg-S 0.2 1.8 0.54 0.46 CA15-3 S 2.57 13.57 8.98 3.27 CEA-S 0.5 2.7 1.21 0.56 TP-S 0.61 1.52 0.91 0.27 PH-S 5.4 7.2 6.06 0.58 Table 3, The min, max. mean and std values of the 20 attributes for benign patients (N=20). Attributes Min Max Mean STD Age 28 58 45.15 7.86 BMI 21.1 32 25.63 3.18 TBF 26.36 44.91 35.17 5.01 WHR 0.69 0.92 0.81 0.07 Mens. cycles/year 11 13 12.10 0.45 Dur. of contracept. /year 0 5 0.55 1.39 Mens. cycle status 0 1 0.20 0.41 FUT2 gene type 0 1 0.60 0.50 Lewis blood type 0 2 1.25 0.55 Es-B 18 96 59.75 30.89 Pg-B 0.65 4.2 2.29 1.32 CA15-3 B 6.59 13.6 10.49 1.80 CEA-B 1.02 5.71 2.46 1.15 TP-B 6.1 7.6 6.78 0.55 Es-S 5.3 27.8 17.24 8.93 Pg-S 0.2 1.6 0.84 0.54 CA15-3 S 0.69 2.35 1.51 0.47 CEA-S 0.5 0.68 0.51 0.04 TP-S 0.12 0.28 0.17 0.04 PH-S 6.4 7.3 7.00 0.27 In order to investigate the performance of the four machine learning classifiers, Table 4 and 5 shows the results of the classification of stage-1 and stage-2 in terms of precision, recall, accuracy and MCC for logistic regression, Naïve Bayes, decision tree and OneR classifiers. Naïve Bayes classifier is outperforming other classifiers for stage-1 when we are classifying normal versus abnormal who had tumors as well as achieving similar performance for stage 2 for the case of classifying malignant and benign patients. Iyden Kamil Mohammed Al-Khwarizmi Engineering Journal, Vol. 16, No. 3, P.P. 1- 10 (2020) 7 Table 4, The results of stage-1 classification using different classifiers with LOO cross validation. The best performer is shown in bold Table 5, The results of stage-2 classification using different classifiers with LOO cross validation. Classifier Precision Recall Ac MCC OneR 1 1 100 1 Logistic 0.996 0.995 99.5 0.991 Naïve Bayes 1 1 100 1 Decision Tree 1 1 100 1 The confusion matrix (CM) for the classification of breast cancer attributes is plotted with Naïve Bayes (best performer) classifier in Table 6 and 7. The results in the diagonal of CM show the correct classification rates while the misclassifications are shown off-diagonal. There were only 15 cases (table 6) out of the 256 (with SMOTE) cases that were misclassified for stage 1 while all the 221 (with SMOTE) malignant and benign cases were classified correctly in stage 2 as shown in table 7. Table 6, Confusion matrix of the Naïve Bayes classifier for stage 1 after using SMOTE Predicted Class Actual group Abnormal Normal Abnormal 116 15 Normal 0 125 Overall accuracy= 94.1 % Table 7, Confusion matrix of the Naïve Bayes classifier for stage 2 of the classification after using SMOTE Predicted Class Actual group Malignant Benign Malignant 111 0 Benign 0 110 Overall accuracy= 100% When comparing the results obtained in this study with that in [9] who investigated 42 attributes and selected 26 attributes for the classification of three classes of breast tumors, we utilized SMOTE to balance the classes unlike [9] who used imbalanced classes, which may cause overfitting, despite the high accuracy obtained on their work 98%. To find the best breast attributes that have the most influence for the classification of breast cancer in the stage 1 and 2, we utilizedCFS [23]. Table 8 shows the best ranked selected attributes for stage-1 where the CFS selected 10 attributes while for stage-2, the best selected attributes were equal to 11. It is worth noting that the classification accuracy with the best selected attributes with LOO was equal to 94.9 %, slightly higher than that of the full set of attributes (94.1%). In stage-1 for the classification of normal or abnormal cases, BMI, menstrual cycle status, and FUT2 gene type were the 3 most important ranked attributes, selected by the CFS. As for stage-2 for the classification of benign and malignant cases, the selected features were age, WHR and number of menstrual cycles per year. It can be noted also that the common five attributes that are shared between stage-1 and stage-2 are CEA-B, CA15-3 B, CA15-3 S, TP-B and Es-S. CEA-B is tumor biomarker that is derived from blood; it is not specific for breast tumor. However, CA15-3 B is regarded as tumor biomarker and it is specific protein biomarker for breast cancer. Moreover, CA15-3 S and Es-S are Classifier Precision Recall Ac MCC OneR 0.914 0.914 91.4 0.828 Logistic 0.934 0.934 93.4 0.867 Naïve Bayes 0.948 0.941 94.1 0.889 Decision Tree 0.930 0.930 93 0.860 Iyden Kamil Mohammed Al-Khwarizmi Engineering Journal, Vol. 16, No. 3, P.P. 1- 10 (2020) 8 newly derived breast tumor biomarker that is derived from saliva. It is highly promising since it is non-invasive based on easy to acquire salivary sample. Table 6, The results of the best ranked selected attributed with CFS with Naïve Bayes classifier. The attributes that are common the two classification stages are shown in bold. Stage-1 Stage-2 1 BMI Age 2 Mens. cycle status WHR 3 FUT2 gene type Mens. cycles/year 4 CA15-3 B Lewis blood type 5 CEA-B Es-B 6 TP-B CA15-3 B 7 Es-S CEA-B 8 Pg-S TP-B 9 CA15-3 S Es-S 10 PH-S CA15-3 S 11 TP-S 4. Conclusion In this paper, 20 breast cancer attributes have been collected for the Iraqi women and utilized to test the performance of two-stage classification withfourclassifiers where the attributes are classified into normal and abnormal cases in the first stage. If the case was abnormal, then the second stage of the classification is performed to predict either the patient has a malignant or benign tumor. Synthetic Minority Over-Sampling Technique (SMOTE) was utilized to deal with the problem of class imbalance. Classification accuracy of 94% for stage-1 and 100 % was achieved with Naïve Bayes classifier and LOO cross-validation. The level of CA15-3 protein in blood (CA15-3 B) and saliva (CA15-3 S), and carcinoembryonic antigen level (CEA-B) and total protein (TP-B) in blood and also Estrogen level (Es-S) in saliva, were the best selected features with CFS for both classification stages which show the importance of those parameters in predicting malignant breast tumors. Acknowledgments The first author would like to thank the contribution of Dr Khalid Mahdi Salah, University of Mustansiriyah for his comments on the data collection. The authors are grateful for the reviewers for their insightful comments. Conflicts of Interest Disclosure Authors declare that there are no conflicts of interest. 5. References [1] N. A. S. Alwan, “Breast cancer: demographic characteristics and clinico-pathological presentation of patients in Iraq,” 2010. [2] M. S. Dawood and A. A. Mohammed, “Breast Tumor Diagnosis Using Diode Laser in NearInfrared Region,” Al-Khwarizmi Eng. J., vol. 5, no. 2, pp. 20–31, 2009. [3] M. Akram, M. Iqbal, M. Daniyal, and A. U. Khan, “Awareness and current knowledge of breast cancer,” Biol. Res., vol. 50, no. 1, p. 33, 2017. [4] G. N. Sharma, R. Dave, J. Sanadya, P. Sharma, and K. K. Sharma, “Various types and management of breast cancer: an overview,” J. Adv. Pharm. Technol. Res., vol. 1, no. 2, p. 109, 2010. [5]N. L. Henry and D. F. Hayes, “Cancer biomarkers,” Mol. Oncol., vol. 6, no. 2, pp. 140–146, 2012. [6] D. F. Hayes, “Biomarker validation and testing,” Mol. Oncol., vol. 9, no. 5, pp. 960– 966, 2015. [7] M. Çınar, M. Engin, E. Z. Engin, and Y. Ziya Ateşçi, “Early prostate cancer diagnosis by using artificial neural networks and support vector machines,” Expert Syst. Appl., vol. 36, no. 3, Part 2, pp. 6357–6361, 2009. [8] V. Chaurasia, S. Pal, and B. B. Tiwari, “Prediction of benign and malignant breast cancer using data mining techniques,” J. Algorithm. Comput. Technol., vol. 12, no. 2, Iyden Kamil Mohammed Al-Khwarizmi Engineering Journal, Vol. 16, No. 3, P.P. 1- 10 (2020) 9 pp. 119–126, 2018. [9] S. F. Behadili, M. S. Abd, I. K. Mohammed, and M. M. Al-Sayyid, “Analyzing Breast Cancer Data for Iraqi Women using Data Mining Techniques,” in 3rd International Medical Education CONGRESS, 2018. [10] A. Mert, N. Kılıç, E. Bilgili, and A. Akan, “Breast cancer detection with reduced feature set,” Comput. Math. Methods Med., vol. 2015, 2015. [11] M. K. Abd-Ellah, A. I. Awad, A. A. M. Khalaf, and H. F. A. Hamed, “Design and implementation of a computer-aided diagnosis system for brain tumor classification,” in 2016 28th International Conference on Microelectronics (ICM), 2016, pp. 73–76. [12] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques, 2nd ed. San Francisco: Morgan Kaufmann, 2005. [13] G. H. John and P. Langley, “Estimating Continuous Distributions in Bayesian Classifiers,” in Eleventh Conference on Uncertainty in Artificial Intelligence, 1995, pp. 338–345. [14] R. C. Holte, “Very simple classification rules perform well on most commonly used datasets,” Mach. Learn., vol. 11, pp. 63–91, 1993. [15] C. Sammut and G. I. Webb, Eds., “Logistic Regression BT - Encyclopedia of Machine Learning,” Boston, MA: Springer US, 2010, p. 631. [16] M. Karabatak, “A new classifier for breast cancer detection based on Naïve Bayesian,” Measurement, vol. 72, pp. 32–36, 2015. [17] M. F. A. Saputra, T. Widiyaningtyas, and A. P. Wibawa, “Illiteracy Classification Using K Means-Naïve Bayes Algorithm,” JOIV Int. J. Informatics Vis., vol. 2, no. 3, pp. 153–158, 2018. [18] C. Sammut and G. I. Webb, Eds., “Decision Stump BT - Encyclopedia of Machine Learning,” Boston, MA: Springer US, 2010, pp. 262–263. [19] S. Sayad, “Tutorial on OneR classifier.” [Online]. Available: http://www.saedsayad.com/oner.htm. [Accessed: 28-Jan-2019]. [20] N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intell. data Anal., vol. 6, no. 5, pp. 429–449, 2002. [21] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002. [22] P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. Nielsen, “Assessing the accuracy of prediction algorithms for classification: an overview,” Bioinformatics, vol. 16, no. 5, pp. 412–424, 2000. [23] M. A. Hall, “Correlation-based feature selection for machine learning,” 1999. ��� 3ا���د، �16 �� ا���ارز�� ا������� ا�� �� ا �ن ���� ����� ،10 -1 )2020( 10 ��تا��!ا�1 ����0ء��.-!ات ا���� � +ورام ا�)�ي �&%��$ #�" �! ��� ***�3�4! ا��2د !وا **#�� �0� ا������ * ا �ن ���� ���� .�!-� ,+�اد /()'� ا�&���� ا�%$ارز!� / �� ھ���� ا��� ا������*،** ***����/)�8 ا�/��7ةا�/ /.�!-� اد0�5ه /!-&� ا2�3�2ت ا�0 /� /()'� ا�& aydenel_1969@yahoo.comا��0;� ا078�2و�5: ali.altimemy@kecbu.uobaghdad.edu.iq**ا��0;� ا078�2و�5: javier.escudero@ed.ac.uk :�5ا��0;� ا078�2و*** �� ا��5 ;��� ا�J! 0'?8 ا�$B'�ت. إن ا�F'%G7 ا�/�08 ��0ط�ن ا�?�ي ,��7%�ام ا�-D!�ت ا��'$;� ا�/����� و/'�B C ا����ء �&�ف: ;-0�7 �0ط�ن ا�?�ي �0ط�ن !ا ا 07اح �)$رم � ;�&N ا�-Dج ا�/�08 �)/0ض ، !/� ;J! N(O !-�ل ا�$B'�ت. ا�+0ض !J ا��را�� ا����'� ھ$ ���'J ا�F'%G7 ا�/�08 �)?�ي !DL Jل J'7(R0! S(T U'� ./0WXات ا��'$;� Vورام ا�?�ي �-'�� !J ا����ء ا�-0ا '�ت)��3 ! Uَّ�3ُ� ، Sو�Vا �(R0/ا� �B .��[ا �b!D ا�?�ي ا��0ق: �B ھ`ه ا��را�� ، �� ا 07اح 5_�م �3�'J'7(R0! S(T U وا��7Lره !^ أر,-� !3�\�ت �)7-) �'B0اc$/;ا�� Fd�3%ا�(ا� Fd�3%ت ا��!$ا���'T J! �3(%7�//�ت ا��ت وا���'T J! �3(%7�/ (2ت ط�'-'� �ا�)-�ب���'h0/i0Lوا �7; J'R �B ، 8G, �!ا�?�5'� إ �(R0/ا� �B �'-'0 ا���'c 2ت ا�?�ي�R U'�7� ا�7%�ام N ورم�3; .�'/R أو j'�L� 0ونG-/�ت ا����7ر ا�L2 �&-/. ��0ط�ن ا�?�ي ا��7 � � ا�7%�ام� ، kذ� S(T وةDT . 07حO/ا� U'�N'(�� �B ا��B�G87 �)-?$ر S(T أNnB ا�/'mات (CFS) ا7L'�ر ا�/'mات ا�/�7��ة إ�S ا2ر���ط أداء 5_�م ا�37 J'7(R0! S(T U'��_�م ا�37�. ����, U'�� ��o'O د � ا�37� :pd�7�3�U!^ ٢٪ �)/R0)� ١٠٠٪ �)/R0)� اVو�S و ٩٤ا�!Naïve Bayes .0ىLVا �wD?ا��0ق ا� S(T ا�`ي �\$ق kذ� Sإ� �B�hx�, دت�R ، CFS J! 0ة'+y �'T0B �T$/z!/�ت�ا� NnB��7رھ� أT�,�/Lت�/� N) J', J!٢٠ �/� Sو�Vا J'7(R0/ا� J! N8� .وا�?�5'� �O� :�yD%ا�o'O�� �� J'أھ/'� ا��0و� �nً;ھ`ه ا��را�� أ pd�75 0&_� .ا�/�08 �$رم ا�?�ي F'%G7ا� J'��� �B ة�T��/��, �-� �7وا� �'��T U'� د � �3 CA15-3 �50ط��ا� �n7�/7$ى ا��! k�`)ا�)-�ب وا��م و �B ا��م �B ورام ا�?�ي�|, X��7(.