Transactions Template JOURNAL OF ENGINEERING RESEARCH AND TECHNOLOGY, VOLUME 4, ISSUE 1, MARSH 2017 28 A New Model in Arabic Text Classification Using BPSO/REP-Tree Hamza Naji 1 , Wesam Ashour 2 and Mohammed Alhanjouri 3 1 Department of Computer Engineering, Islamic University of Gaza, Palestine. 2 Department of Computer Engineering, Islamic University of Gaza, Palestine. 3 Department of Computer Engineering, Islamic University of Gaza, Palestine. Abstract—Specifying an address or placing a specific classification to a page of text is an easy process somewhat, but what if there were many of these pages needed to reach a huge amount of documents. The process becomes difficult and debilitating to the human mind. Automatic text classification is the perfect solution to this problem by identifying a category for each document automatically. This can be achieved by machine learning; by building a model contains all possible attributes features of the text. But with the increase of attributes features, we had to pick the distinguishing features where a model is created to simulate the large amount of attributes (thousands of attributes). To deal with the high dimension of the original dataset, we use features selection process to reduce it by deleting the irrelevant attributes, words, where the rest of features still contain relevant information needed in the process of classification. In this research, a new approach which is Binary Particle Swarm Optimization (BPSO) with Reduced Error Pruning Tree (REP-Tree) is proposed to select the subset of features for Arabic classification process. We compare the proposed approach with two existing approaches; Binary Particle Swarm Optimization BPSO with K-Nearest Neighbor (KNN) and Binary Particle Swarm Optimization BPSO with Support Vector Machine (SVM). After we get the subset of attributes that result from features selection process, we use three common classifiers which are Decision Trees J 48, SVM and the prepared algorithm REP-Tree (as a classifier) to build the classification model. We created our own Arabic dataset; the BBC Arabic News dataset that are collected from the BBC Arabic website and another one existing is used datasets in our experiments, Alkhaleej News Dataset. Finally, we present the experimental results and showed that the proposed algorithm is missionary in this area of research. Index Terms—Text classification, BPSO, REP-Tree, Binary Particle Swarm Optimization. I INTRODUCTION The huge increase of using text in the electronic devices and web sites, in particular, is a motivation for categorizing these texts in automatic manner. That’s because of the insuf- ficiency of human ability to handle them manually. The core task in the categorization is called the Text Categorization or Classification TC. The previous task is the ability of classi- fying a huge amount of groups of texts; each of them is called a text data-set or Corpora, to some predefined classes. In case of news data-set; for example, the classes can be Sport, Health etc., and other various classes based on their contents. Text classification process in general consists of two phases. The first one is the preprocessing phase defined as the process that implements on the amount of texts to make some improvements for reducing the unnecessary terms. The preprocessing phase also contains reducing the extra phrases of one term by a process called Stemming. Stemming is the process of eliminating the derived words of one basic word such as the words ―making makes‖ and turning them to their roots as the word "make". Another example of the stemming process are the words (argue, argued, argues, arguing, and argues) turning them to the stem "argu". On the other hands, (argument and arguments) are turned to the stem "argu- ment". The preprocessing phase includes the removing of some prefixes and suffixes from the word instead of extract- ing the original root. The second phase of text classification process is the classification step. The process of classifying the prepro- cessed text in the previous phase and presenting the corpora using a mechanism is called a classifier. To apply such two phases, we need to convert each dataset to a term vector which is the basic of text processing [ 1 ]. But how many terms we need in each dataset based on what term we need is a question to be answered. The previous question leads us to add a new step in the text classification process, Arabic Text Classification in this paper. There is a middle step between preprocessing and classi- fication process called "feature selection" [ 2 ], it is a com- plementary process to the preprocessing stage performed after it to reduce the redundant terms (features) and to keep the sufficient terms to continue the classification process [ 3 ]. We demonstrate a combination of Binary Particle Swarm Optimization BPSO and Reduced Error Pruning Tree REP- Tree for the last process of selecting good sets of features for the Arabic TC task. Then we use the second half of the hy- Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 29 bridized approach the REP-Tree and use it as a classifier as mentioned above. The text classification processes can be done easily on the English language due to the smooth environment of it. In contrast, Arabic language is considered a complex language that contains many formations and many different kinds of forms of the word. The aforementioned difficulty in the Ara- bic language requires greater efforts in dealing with the clas- sification of texts. paper focuses on the classification of the Arabic text which is the difficulty of Arabic expressive style when being employed in alternative languages like Persian, Urdu, Iranian language and alternative regional languages of Pakistan, Afghanistan and Persia. The Arabic language contents constitute a 3% of the web text content with the fourth order in languages ordering on-line [ 4 ]. The previous amount of content needs an accurate and effective classifica- tion to help the humans to easily use it .Thus, in the last 10 years the need for the effective and accurate classification has quickly been grown. There are some classification algorithms that can be done in general text classification and can be proposed in Arabic such as: Support Vector Machine (SVM), Naïve Bayes (NB), K-Nearest Neighbor(KNN), Maximum Entropy (ME), Arti- ficial Neural Network (ANN), Decision Tree (DT)and the Rocchio Feedback Algorithm. More recently, Reduced Error Pruning tree REP-Tree is investigated in Arabic TC. RET- Tree is a fast decision tree learning machine and it builds a decision tree based on the information gained or reducing the variance. Also, REP-Tree is a fast decision tree learner which builds a decision/regression tree using information gained as the splitting criterion, and prunes it by using re- duced error pruning [ 5 ]. REP-Tree was first used in Indian and English text classification in 2015 and 2012 [ 6 ], [ 7 ].The rest of the paper is organized as follows: Section 2 reviews related work. Section 3 explains BPSO concepts. Section 4 explains the second term of the proposed approach REP- Tree. Section 5 shows proposed work. Section 6 presents the results, and finally, we tend to conclude the paper in Section 7. II RELATED WORKS In the discussion below, we focus on the works addressing Arabic TC. Since the number and quality of features used to express texts has a direct effect on classification algorithms, the following will discuss the main goal of feature reduction and selection and their impact on TC. (Brahimi, Touahria and Tari, 2016) [ 8 ] addressed sentiment analysis for tweets in the Arabic language using some ap- proaches with two free available datasets of (2000 tweets). They applied the light and root stemmer as a preprocessing phase and investigated the impact of reducing the size of the dataset by selecting the most relevant features on the classi- fication efficiency and accuracy of three well used machine learning algorithms Support Vector Machine (SVM), Naïve Bayes (NB), and K-Nearest Neighbor (KNN). (Oraby, El-Sonbaty and El-Nasr, 2013) [ 9 ] worked on the impact of Stemming by applying the Khoja stemmer [ 10 ], Information Science Research Institute (ISRI) stemmer [ 11 ], and Tashaphyne Light Arabic Stemmer [ 12 ] on two datasets of the opinion classification problem, the results show that the Khoja stemmer is the best one. (Shoukry andRafea, 2012) [ 13 ] performed the classifiers Support Vector Machine SVM and Naïve Bayes NB on a dataset collected from twitter website. They applied the ex- periments on 2 documents of Arabic tweets and the results showed that the Support Vector Machine SVM was better than Naïve Bayes NB. (Al-Thwaib, 2014) [ 14 ] used the Sakhr summarizer Sakhr company website 2016 as a feature selector to choose the best words of documents instead of using all words and they used the TF feature. Documents, after using TF for feature selection, are classified using SVM classifier; the data set they used consists of 800 Arabic text documents. It is a sub- set of 60913-document corpus collected from many newspa- pers and other web sites. He succeeded to increase the accu- racy by using the summarized corpus as input for Support Vector Machine SVM classifier. (Al-Hindi and Al-Thwaib, 2013) [ 15 ] made a comparison between two data-sets, each one contained 1000 Arabic doc- uments.Text summarization was applied on one without the other. Accuracy has not improved much, but there was a difference in the time. When they used summarized docu- ments, less time was needed to build the learning model. (Abu-Errub, 2014) [ 16 ] proposed a method to classify Arabic text by comparing a document with predefined documents categories based on its contents using the Term Frequency Times Inverse Document Frequency TF.IDF method meas- ure. After that the document is classified into the appropriate sub-category using Chi Square measure. The dataset used in this study contained 1090 documents for training and 500 documents for testing, categorized into ten main categories. The results show that the proposed algorithm can classify the Arabic text datasets into predefined category. (Goweder, Elboashi and Elbekai, 2013) [ 17 ] used their de- veloped technique, Centroid-based, to classify Arabic text. The proposed algorithm is evaluated using a dataset contain- ing a 1400 Arabic documents collecting from 7 different classes. The results show that the adapted Centroid-based algorithm can classify Arabic documents without problems. They used some measurements Micro-averaging recall, pre- cision, F-measure, accuracy, and error rates respectively. The measurements factors record a performance percentage of 90.7%, 87.1%, 88.9%, 94.8%, and 5.2% according to the previous order of measurements. Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 30 (Abidi and Elberrichi, 2012) [ 18 ], in this paper, they present- ed a comparative study to assess the effect of a conceptual representation of the text. The K-Nearest Neighbor used and feature extraction was achieved via three preprocessing schemes Bag of Words, N grams, and a conceptual represen- tation. The F-measure of Bag of Words is 64%, 68% for N gram’s F-measure, and 74% for F-measure conceptual repre- sentation. Finally, the conceptual representation was the best one as the results shown. (Raho,Al-Shalabi, Kanaan and Nassar, 2015) [ 19 ] investigat- ed the importance of feature selection in Arabic corpus clas- sification by making a comparison of the performance be- tween different classifiers in different situations using fea- ture selection with stemming, and without using stemming. The dataset collected from BBC Arabic website and the classifiers they used are DT, K nearest neighbors KNN, Na- ïve Bayesian Model NBM method and Naïve Bayes NB; also they used factors Measurements such as precision, re- call, F-Measures, accuracy and time. The results showed the Accuracy of each classifier as the following: (D.T 99.4%, KNN 66.3%, NBM 92%, and NB 91.9%). (Mohammad, Al-Momani and Alwada, 2016) [ 20 ] provided a comparative study of Arabic text classification between three types of classifiers (k-Nearest Neighbor, Decision Trees C4.5, and Rocchio Classifier). These well-known al- gorithms are applied on a collected Arabic data set. Data set used consists from 1400 documents belongs to 8 categories, the same number of documents was used in the study exper- iments. They used two types of Measurements precision and recall, and the results of the experiments showed that the K- Nearest Neighbor records an average of 80% for Recall and 83% for precision, While Rocchio Classifier records an av- erage of 88% for Recall and 82% for precision. Both of the previous Classifiers are better than C4.5 with average of 64% for Recall and 67% for precision. (Kanan and Fox, 2015). [ 21 ] This study talks about a new approach in Arabic text classification stemming; they devel- oped a new model called tailored stemming, a new Arabic light Stemmer, with the usage of Support Vector Machine SVM classifier. The experiments were performed under 10- fold cross-validation training type, and gave these results for the predefined classes after using SVM as the following: Art and Culture 91.8%, Economics 93.5%, Politics 91.5% and Society 99.1%. (Al-Anzi and Abuzeina, 2016) [ 22 ] grouped the similar un- labeled document into pre-specified number of topics using Latent Semantic Indexing LSI and Singular Value Decom- posing SVD methods. The corpus they used contains 1000 documents of 10 topics, 100 documents for each topic. The results showed that EM method is the best of other methods with an average categorization accuracy of 89%. (Zubi, 2009). [ 23 ] This study is about using the web contents and applies some Arabic classification techniques on it. The general purpose of this study is to compare between two classifiers. The author used the K-Nearest Neighbor KNN Classifier and Naïve Bayes NB Classifier to apply the exper- iment. As mentioned by the author in his study. A corpus of Arabic text documents was collected from online Arabic newspapers archives, including Al-Jazeera, Al-Nahar, Al- hayat, Al-Ahram, and AlDostor as well as few other special- ized websites. He collects 1562 documents classifying it into 6 different categories. After the comparison experiment fin- ished, the results showed that the K Nearest Neighbors KNN with an average of (86.02%) was better than Classifier Na- ïve Bayesian with accuracy of (77.03%). (Zrigui, Ayadi, Mars and Maraoui, 2012). [ 24 ] They devel- oped a new model based on the Latent Dirichlet Allocation (LDA) and the Support Vector Machine SVM; they used the LDA to sample ―topics‖ of groups of texts. The results showed that the proposed LDA-SVM algorithm is able to achieve high effectiveness for Arabic text classification task (Macro-averaged F 1 88.1% and Micro-averaged F – 91.4%). III BINARY PARTICLE SWARM OPTIMIZATION BPSO Before talking about BPSO as a feature selection algorithm, we will first describe the intended of the word ―Swarm‖ in full definition of PSO ―Particle Swarm Optimization‖ algo- rithm. What is the swarm and where this name came? That’s what we got from the final meaning of the definition. Many forms of life in some organisms affected the aspirations of some researchers and invited them to develop some success- ful theories for solving problems based on this random life. There is a group of successful theories based on this mode of thinking, including the DNA counting, membrane algorithm, Particle Swarm Optimization algorithm, artificial immune systems algorithm, and Ant Colony Optimization algorithm. One of the algorithms is the Particle Swarm Optimization algorithm that was developed in the 1995 by Eberhart and Kennedy [ 25 ]. This idea has been built on the basis of the collective behavior of flocks of birds. PSO creates a random optimization algorithm to give solutions, particles, for some positions in the search space. Each of those particles holds an initial random velocity within the search space symbol- ized by V i = ( V i 1 ; V i2 ; ...V iN ), and each particle is symbolized by P i = ( P i 1 ; Pi2 ; ...; P iN ). Update its ve- locity according to its experience or other particles experi- ences. For the best particle in the search space, swarm, we called it the best global symbolized by g, and when the ve- locity has been updated, the particle it finds the new position with the latest velocity according to the following equations [ 26 ] Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 31 The main equation is: 𝑋𝑖𝑑 = 𝑋𝑖𝑑 +𝑉𝑖𝑑 (1) New position = Current position + New velocity. 𝑉𝑖𝑑 = 𝜔 ∗ 𝑉𝑖𝑑 +𝐶1 ∗ 𝑟𝑎𝑛𝑑( ) ∗ (𝑃𝑖𝑑 − 𝑋𝑖𝑑) + 𝐶2 ∗ 𝑟𝑎𝑛𝑑( ) ∗ (𝑃𝑔𝑑 − 𝑋𝑖𝑑) (2) Where rand () is a random number between (0, 1) [ 27 ]. c1, c2 are acceleration factors. Usually c1 = c2 = 2. Pgd = global best. Vid = velocity of particle [ 28 ]. Xi is the current position of the particle initialized with ran- dom binary values. Where 0 means that the corresponding feature is not selected and by 1 means that the feature is se- lected. Pi is the best previous position of the particle and initialized by the same value of Xi. Vi is the velocity of Pi. What if there was no previous velocity, then particles will navigate to the same position (current position), and that is the (local search). But if we get a new velocity, then particle will extend its search (the global search). Some problems resulted from the previous questions. Inertia weight ω solve these problems by balancing the local and global search. [ ]perform a sequence of experiments to give the best value of ω which is 1.2.In Binary Particle Swarm Optimization Bina- ry PSO, particle position is considered as a binary vector, but how binary vectors deal with velocities. [ 29 ] provided some equation to deal with velocity, a vector, (with real val- ue in which this value is kept between (0, 1)), provides a group of probabilities. According to the previous we can use the BPSO to select the relative features in the Arabic Text Classification. As mentioned in [ 30 ], the probability of bit changing is determined by the following: 𝑺(𝑽𝒊𝒅) + 𝟏 𝟏 + 𝒆 𝑽𝒊𝒅 (𝟑) 𝑰𝒇 (𝒓𝒂𝒏𝒅( ) < 𝑺(𝑽𝒊𝒅))𝒕𝒉𝒆𝒏 𝑿𝒊𝒅 = 𝟏; 𝑬𝒍𝒔𝒆 = 𝟎 (𝟒) Where rand () is a random number between (0, 1) [27]. c1, c2 are acceleration factors. Usually c1 = c2 = 2. Pgd = glob- al best. Vid = velocity of particle [29]. IV REDUCED ERROR PRUNING TREE REP-TREE More recently, Reduced Error Pruning tree REP-Tree is investigated in Arabic TC [ 31 ]. REP-Tree is a fast decision tree learning algorithm and it builds a decision tree based on the information gained or reducing the variance. REP-Tree is a fast decision tree learner which builds a deci- sion/regression tree using information gained as the splitting criterion, and prunes it using reduced error pruning. REP- Tree was first used in Indian and English text classification in 2015 [ 32 ] and 2012 [ 33 ]. The REP-Tree first starts the training process on the existing dataset, and then builds the training model by decisions, then get a mix results of some instances from the first learning step and from the pruned dataset which is a part of the dataset for post-pruning of the tree, then performing the test process. For a sub-tree of the tree, if replacing it by a node or leaf, which doesn't take more prediction errors on the pruning set than the original set, the tree replaces by a leaf. That means that the REP-Tree prunes each node after the natural classification. If the mis- classification error determined for the instances from the pruned data set is not larger than the misclassification error rate computed on the original training data, the misclassifi- cation error can be presented in the Figure (1) below. Figure 1 The misclassified detection in the pruning set of REP-Tree (binary sample), [ 34 ] by using a pruning set shown in the following table: TABLE 1 Contains some samples Category X Y Z A 0 0 1 B 0 1 1 B 1 1 0 B 1 0 0 A 1 1 1 B 0 0 0 Figure 2 The final REP-tree Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 32 The REP-tree begins from bottom from node three. We show that node three can be produced into a leaf which makes the minimum errors, on the pruning set, than as a sub tree. As a sub-tree (the pruned tree) the classification occurs at nodes four and five. One error happened in node five; but no errors happened in node three. The same matter hap- pened in node six and node nine. However, node number two cannot be made into a leaf since it makes one error while as a sub tree, with the newly-created leaves three and six. It makes no errors as shown in Figure (2). The pruning comes as a solution to the sub-tree replication problem that happened with the decision tree starts splitting. The defini- tion of this case as ―When sub tree replication occurs, iden- tical sub trees can be found at several different places in the same tree structure‖ [28]. V PROPOSED WORK In this section, the whole Arabic text classification process will be explained then it will divide the work into a collec- tion of systems, each system has special combinations to produce the final process of classification after preparing the dataset. These combinations are taken from what has already been explained in previous section. Arabic Text Datasets In this subsection, we will present the datasets used in the experiments of our paper. The used datasets are as the fol- lowing: BBC-Arabic News Dataset The first data set contains the number of 4680 documents of BBC-Arabic news, classified into the following predefined categories {'Middle East', 'World News', 'business', 'sport', 'newspapers', 'Science', 'Misc.'}. We choose a random set of existing documents 3000 documents manually; with the knowledge that classifies types in all documents as ―single label‖ classification as mentioned in section (3.2.1 section) ―types of text classification‖. The following table, Table (2), shows the division of the documents into seven preset categories. TABLE 2 The division of BBC-Arabic news Dataset based on 60% training set. Note that BBC-Arabic data-set is collected during our work, and other two datasets are already existing in the literatures (Arabic Corpora - Mourad Abbas.) and (Arabic Corpora - Alj-News.). Alkhaleej News Dataset We present the second data set which contains a number of 5690 documents for Alkhaleej News Dataset (Arabic Corpo- ra - Mourad Abbas. ), (Arabic Corpora - Alj-News.) that classified into the following predefined categories {'Interna- tional News', 'Local News', 'Sport', 'Economy'}. We choose a random set (2770 documents) with the knowledge that clas- sifies types in all the documents as a single label classifica- tion (Abbas, Smaili 2005). The following table, Table (3) shows the division of the documents into four preset Catego- ries. TABLE 3 The division of Alkhaleej News Dataset based on 60% train- ing set. The tables above show that data is partitioned into two parts data for learning and data for testing based on 60% of learn- ing; this style existed in Weka tool with many options for this purpose. The Proposed Systems In this section, we will give a set of regulations contain some processes that listed in the previous section, and then a comparison will be performed between all the existing com- binations in the form of independent systems and extract the # Class Training Set Testing Set Full Da- taset 1 Middle East 630 420 1050 2 World News 222 148 370 3 Business 124 82 206 4 Sport 348 232 580 5 Newspapers 234 155 389 6 Science 141 94 235 7 Misc. 102 68 170 Total 1801 1199 3000 # Class Training Set Testing Set Full Da- taset 1 Local News 630 400 1030 2 International News 480 320 800 3 Economy 264 176 440 4 Sport 300 200 500 Total 1674 1096 2770 Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 33 results in the next section. System A: Binary Particle Swarm Optimization and K- Nearest Neighbor. System A is the first proposed system. It works on the classi- fication of Arabic documents using the three main processes preprocessing, feature selection, and classifications as men- tioned. This system contains three processes shown in Fig- ure (3): (1- Tokenization, Stop words discarding 2- BPSO/KNN, 3- J 48). (1- Tokenization, Stop words discarding 2- BPSO/KNN, 3- SVM). (1- Tokenization, Stop words discarding 2- BPSO/KNN, 3- Rep-Tree). . Figure 3. System A. Figure 3 shows the processes of system A using the BBC-Arabic dataset with the previous processes. BPSO+KNN Experiment Steps Step 1. We need to prepare a population of particles in the features space and spread particles randomly. Xi is the current position of the particle initialized with ran- dom binary values. Where0 means that the corresponding feature is not selected and by 1 means that the feature is se- lected. Pi is the best previous position of the particle and initialized by the same value of Xi. Vi is the velocity of Pi. • According to the evaluation of each particle in the swarm gbest (global best) initializes by the best fitness value of a particle. Step2. (Determining the fitness).Fitness of subset resulted by particle with the evaluation process occurs after each feature selection iteration. The best fitness is the best accuracy in the evaluation process of the selected subset of features measured by classifiers algorithms (KNN) according to the following equation [27]. 𝐹𝑖𝑡𝑛𝑒𝑠𝑠 = (𝛼 ∗ 𝐴𝑐𝑐) + (𝛽 ∗ (𝑁 −𝑇 𝑁⁄ )) (5) Where • Acc refers to the classification accuracy of the particle us- ing chosen classifier. • To make a balance between classification accuracy and the dimension of the feature sub set that selected by particles, we use the β and α parameters to do this purpose, with range of [0, 1] for α, and 1- α for β. - N refers to the all features. - T refers to the selected features using particle P. • The fitness now is updated and then the private best of each particle is updated for each particle. Step 3. (Updating gbest).The gbest is now updated. Step 4. (Updating position).According to the BPSO velocity equation from section three, we can alter and update both velocity and position for all particles (Mendes, Kennedy and Neves, 2004). Equation (1) and (2). As mentioned in [25], the probability of bit changing is de- termined by the following: equations (3) and (4). Where rand () is a random number between (0, 1) [27]. c1, c2 are acceleration factors. Usually c1 = c2 = 2. Pgd = glob- al best. Vid = velocity of particle [28]. Step 5.If the fitness value is better than the best fitness value (gbest) in history then set current value as the new gbest. Step 6.Now for evaluation in our case KNN, we use the Eu- clidean Distance ED to measure the relevancy between cur- rent instance and the other instances in the data-set. Step 7.Define the repository R. • If the predicted classifications of instances were similar to the predefined classification, increase repository R by 1. Step 8. Now, we can measure the classification accuracy of particle P by [27]. 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑅 𝑁 (6) Where R is the group of results after testing the features from all training set N. The Experiment Parameters (BPSO+KNN). (1) Inertia weight (ധ): in the previous equation (2) is to bal- ance the local search and the global search [27], and from the literature the best value of ധ is 1.2. (2) The swarm dimension is 50 units. (3) Iterations are 200 iterations. (4) [0, 1] for α, and 1- α for β. If we use the 1 for α then β = 0 and this mean that the dimension of the features subset is neglected, so we choose a random number between [0, 1] for α (0.70); and β is 1 – 0.70 = 0.30. System B: Binary Particle Swarm Optimization and Support Vector Machine. The second system in this also studies inserting the second middle phase (Feature Selection). In this system we will use the BPSO with SVM, and then classify the resultant features by (Decision Trees J 48, Support Vector Machine SVM, and Re- duced Error Pruning-Tree Rep-Tree) as shown in Figure (4). Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 34 Figure 4 System B. Figure (4) shows the processes of system B using the BBC- Arabic dataset with the previous processes by adding the BPSO+SVM as a feature selection, and the resultant features will be classified using the three classifiers SVM as classifi- er, J48, and REP-Tree for Arabic words. BPSO+SVM Experiment Steps: Step 1. The same in system A. Step2. (Determining the fitness). Here we use the previous equation in system A, (1). Here we use SVM to measure the classification instead of KNN in the previous system A. Step 3. (Updating gbest) the same in A. Step 4. (Updating position) the same in A using the equa- tions (2), (3), and (4). Step 5. The same in A. Step 6.Now for evaluation in our case SVM, we use the SVM classifier in Weka tool to measure the relevancy be- tween current instance and the other instances in the data- set. Then repeat both step 7 and 8 as mentioned in system A. also the same previous parameters in system A. experiments. System C: Binary Particle Swarm Optimization and Re- duced Error Pruning Tree. The last system in this study also involves inserting the mid- dle phase Feature Selection including the previous processes and contents in system A, and B. In this system we will use the BPSO with Reduced Error Pruning-Tree Rep-Tree where it was not used in Arabic text classification field yet and it was recently used in English news classification. Finally, we will classify the resultant features by Decision Trees (J 48), Support Vector Machine SVM, and Reduced Error Pruning- Tree. Rep-Tree) (As a classifier) as shown in Figure (5). Figure 5 System C. Figure (5) shows system C with adding the BPSO+REP- Tree as a features selection REP-Tree here (evaluator), and the resultant features will be classified using the three classi- fiers (SVM, J48, and REP-Tree (as classifier)) for Arabic words. BPSO+REP-Tree Experiment Steps Step 1.The same in system A. Step2. (Determining the fitness).Here we use Reduced Error Pruning-Tree REP-Tree as a feature evaluator to measure the classification accuracy of the particle in a training set, in- stead of KNN in system A. Step 3. (Updating gbest) the same in A. Step 4. (Updating position) the same in A. using the equa- tions (2), (3), and (4). Step 5. The same in A. Step 6.Now for evaluation in our case REP-Tree we use REP-Tree classifier in Weka tool to measure the relevancy between current instance and the other instances in the da- taset. Then repeat both step 7 and 8 as mentioned in system A. also the same previous parameters in system A. experiments. We can alternate the last three steps by measuring the F measure factor to estimate the classification accuracy. We can list the previous steps in short and general points as the following: (1) First and after preparing the features ,terms, space and spread particles randomly, we determine the accuracy of the classification (Acc) of a particle P in training data-set by using Reduced Error Pruning-Tree REP-Tree. (2) Start extracting and filtering the features subset of the training set that selected by particle. (3) Evaluate the previous extracted features data-set by the Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 35 REP-Tree by 60 % training set validation. (4) Determine the F measure factor that result from the REP-Tree experiment to determine the fitness of the particle. VI. EXPERIMENTAL RESULTS In this section, the experimental results of the previous sys- tems are described in last section. We have executed our experiments on two data-sets, the BBC-Arabic news dataset and Alkhaleej News dataset. As mentioned in the previous section, we split the data into 60% for training and 40% for testing, and then display the results in Tables and Figures. After that, we will compare every system with the other in specific graph. We will start presenting the results of system A using the three classifiers which have been previously described in section 4. Then gradually we will review the results of system B, and finally we end with system C. 6.1 System A.A (“BPSO+KNN”/J 48) The experimental results of system A with J 48 tree are shown by Table (4) and (5) using the previous two datasets: TABLE 4 System A with J 48 tree applied on BBC-Arabic Dataset Table (4) shows the classification of BBC-Arabic documents using BPSO+KNN as a feature selector and J 48 decision tree as a classifier. As it is clear from the table, the results are as the following: the best classification is in ―Newspa- pers‖ class with precision of 87.3, recall of 88.9 and F1- Measure of 88.0. The second performance rank of classes is the ―Misc.‖ with precision of 83.9, recall of 89.6 and F1- Measure of 86.6. There is a convergence in the outcome of both ―Word News‖ and ―Sport‖ with a little outperforming in recall of 85.4 for ―Word News‖ class. The worst two clas- ses were the ―Science‖ and the ―Middle East‖ classes with precision of 62.7, recall of 86.1 and F-Measure of 72.5 for ―Science‖ and the worst precision with 67.3 and F-Measure with 68.4 for ―Middle East‖ class. Then we have the second data-set (Alkhaleej News Dataset) with the same previous experiment, Table (5) shows the results as the following: TABLE 5 System A with J 48 tree applied on Alkhaleej News Dataset Table (5) shows the classification of Alkhaleej News Dataset documents using BPSO+KNN as a feature selector and J 48 decision tree as a classifier. The best F-Measure is for ―Sport‖ class with 84.2, and the worst F-Measure is for ―Economy‖ class with 62.8. 6.2 System A.B (“BPSO+KNN”/SVM) The experimental results of system A with SVM classifier are shown by Tables (6) and (7) using the previous two da- tasets (BBC Arabic, and Alkhaleej datasets as the following: TABLE 6 System A with SVM classifier applied on BBC-Arabic Da- taset Table (6) shows the classification of BBC-Arabic documents using BPSO+KNN as a feature selector and SVM as a clas- sifier. As it is clear from Table (6), the results are as the fol- lowing: the best classification is for ―Misc.‖ class with pre- cision of 89.4, recall of 95.6 and F1-Measure of 92.3. The second performance rank of classes is the ―Business‖ with precision of 84.5, recall of 92.4 and F1-Measure of 88.2. There is a convergence in the F1-Measure outcome of both ―Middle East‖ and ―Science‖ with F1-Measure of 83.7 and 83.4 gradually. The worst class is the ―Sport‖ with precision of 87.2, recall of 79.7 and F-Measure of 83.2. Now we will apply system A (the same previous experiment with SVM) on the second data-set (Alkhaleej News Dataset) and Table (7) shows the results as the following: Class Precision% Recall% F1-Measure% Local News 75.8 78.4 77 International News 74.6 72.3 73.4 Economy 65.2 60.7 62.8 Sport 81.3 87.5 84.2 Average 74.2 74.7 74.3 Class Precision% Recall% F1-Measure% Middle East 67.3 69.7 68.4 World News 81.5 85.4 83.4 Business 72.4 73.4 72.8 Sport 84.2 79.7 81.8 Newspapers 87.3 88.9 88.0 Science 62.7 86.1 72.5 Misc. 83.9 89.6 86.6 Average 77 81.8 79 Class Precision% Recall% F1-Measure% Middle East 88.3 79.7 83.7 World News 81.7 87.3 84.4 Business 84.5 92.4 88.2 Sport 87.2 79.7 83.2 Newspapers 86.4 88.2 87.3 Science 81.4 85.6 83.4 Misc. 89.4 95.6 92.3 Average 85.5 86.9 86 Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 36 TABLE 7 System A with SVM classifier applied on Alkhaleej News Dataset Table (7) shows the classification of Alkhaleej News Dataset documents using BPSO+KNN as a feature selector and SVM as a classifier. The best F-Measure is for ―Sport‖ class with 92.3, and the worst F-Measure is for ―International News‖ class with 82. 6.3 System A.C (“BPSO+KNN”/REP-Tree) The third combination of system A is our proposed classifier REP-Tree which has recently been used in English text clas- sification as mentioned previously in the past sections. Here, the REP-Tree is a classifier used to classify a group of fea- ture resulting from the operation of features selection by BPSO+KNN. The experimental results of system A with REP-Tree classifier are shown by Tables (8) and (9) using the previous two datasets BBC Arabic and Alkhaleej da- tasets as the following: TABLE 8 System A with REP-Tree classifier applied on BBC-Arabic Dataset Table (8) shows the classification of BBC-Arabic documents using BPSO+KNN as a feature selector and REP-Tree as a classifier. As it is clear from Table (0), the results are as the following: the best classification is for ―Middle East‖ class with precision of 87.7, recall of 91.5 and F1-Measure of 89.5. The second rank of performance is for classes ―News- papers‖ with precision of 89.2, recall of 88.7 and F1- Measure of 88.9. We can detect the convergence between the previous class performance and the ―Business‖ class per- formance with precision of 86.1, recall of 90.6 and F1- Measure of 88.2. The worst performance was the ―Misc.‖ class with precision of 79.2, recall of 72.3 and F-Measure of 75.5. As in all previous experiments we'll apply the REP- Tree classifier on the other datasets. Now we will apply sys- tem A (the same previous experiment with REP-Tree) on the second data-set (Alkhaleej News Dataset) and Table (9) shows the results as the following: TABLE 9 System A with REP-Tree classifier applied on Alkhaleej News Dataset Accuracy results were comparable between REP-Tree and SVM with average F1-Measure of 87% for REP-Tree and 88% for SVM. For more details of the results the best F- Measure is for ―Local News‖ class with 89.9, and the worst F-Measure is for ―Economy‖ class with 81.8. 6.4 System B.A (“BPSO+SVM”/J 48) The experimental results of system B with J 48 tree are shown by Tables (10) and (11) using the previous two da- tasets (BBC-Arabic news dataset and Alkhaleej News da- taset): TABLE 10 System B with J 48 tree applied on BBC-Arabic Dataset Table (10) shows the classification of BBC-Arabic docu- ments using BPSO+SVM as a feature selector and J 48 deci- sion tree as a classifier. As it is clear from the table, the re- sults are as the following: the best classification perfor- mance is ―Newspapers‖ class with precision of 85.2, recall of 87.3 and F1-Measure of 86.2. The second rank of classifi- cation performance is the ―World News‖ with precision of 88.3, recall of 83.1 and F1-Measure of 85.6. We can see that the worst classes are the ―Middle East‖ and the ―Science‖ classes with precision of 70.4, recall of 72.6 and F-Measure of 71.4 for ―Middle East‖ and the worst precision with 61.0 and F-Measure with 68.2 for ―Science‖ class. Here we can be quite sure that the J 48 tree failed in the classification accuracy of ―Science‖ class by 31.8% according to its F- Measure. Now we have the second data-set (Alkhaleej News Class Precision% Recall% F1-Measure% Local News 86.1 90.4 88.1 International News 82.4 81.7 82 Economy 91.6 87.8 89.6 Sport 95.3 89.5 92.3 Average 88.8 87.3 88 Class Precision% Recall% F1-Measure% Middle East 87.7 91.5 89.5 World News 85.9 85.7 85.7 Business 86.1 90.6 88.2 Sport 80.3 72.2 76 Newspapers 89.2 88.7 88.9 Science 83.8 87.8 85.7 Misc. 79.2 72.3 75.5 Average 84.6 84.1 84.2 Class Precision% Recall% F1-Measure% Local News 88.4 91.5 89.9 International News 93.2 85.2 89 Economy 80.1 83.6 81.8 Sport 92.7 82.7 87.4 Average 88.6 85.7 87 Class Precision% Recall% F1-Measure% Middle East 70.4 72.6 71.4 World News 88.3 83.1 85.6 Business 77.5 71.2 74.2 Sport 87.7 78.5 82.8 Newspapers 85.2 87.3 86.2 Science 61 77.4 68.2 Misc. 82.5 87 84.6 Average 78.9 79.5 79 Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 37 Dataset) with the same previous experiment, Table (11) shows the results as the following: TABLE 11 System B with J 48 tree applied on Alkhaleej News Dataset Table (11) shows the classification accuracy of Alkhaleej News Dataset documents using BPSO+SVM as a feature selector and J 48 decision tree as a classifier. The best F- Measure is for ―Sport‖ class with 76.6, and the worst F- Measure is for ―Local News‖ class with 51. Also here we can be quite sure that the J 48 tree failed in the classification accuracy of ―Sport‖ class by 49% according to its F- Measure. 6.5 System B.B (“BPSO+SVM”/SVM) The experimental results of system B with SVM classifier are shown by Tables (13) and (14) using the previous two datasets (BBC Arabic and Alkhaleej datasets as the follow- ing: TABLE 12 System B with SVM classifier applied on BBC-Arabic Da- taset Table (12) shows the classification of BBC-Arabic docu- ments using BPSO+KNN as a feature selector and SVM as a classifier. As it is clear from Table (12), the results are as the following: the best classification is for ―Misc.‖ class with precision of 90.4, recall of 98.8 and F1-Measure of 94.4. The second performance rank of classes is the ―World News‖ with precision of 98.7, recall of 90.3 and F1-Measure of 94.3. The worst class is the ―Sport‖ with precision of 60.3, recall of 80.7 and F-Measure of 69. Now we will apply system B (the same previous experiment with SVM) on the second data-set (Alkhaleej News Dataset), and Table (13) shows the results as the following: TABLE 13 System B with SVM classifier applied on Alkhaleej News Dataset Table (13) shows the classification of Alkhaleej News Da- taset documents using BPSO+SVM as a feature selector and SVM as a classifier. The best accuracy (F-Measure) is for Economy class with 93.6 and the worst F-Measure is for ―Local News‖ class with 85.8. 6.6 System B.C (“BPSO+SVM”/REP-Tree) The third combination of system B is our proposed classifier REP-Tree as we mentioned in the previous experiments which has recently been used by (Kalmegh, 2015), (Patel and Upadhyay, 2012) in English text classification and by (Naji and Ashour, 2016) in Arabic text classification (a pre- vious paper related to the existing paper), as mentioned pre- viously in the past sections specifically in the first section. Here the REP-Tree is a classifier, which is used to classify a group of features resulting from the operation of features selection by BPSO+SVM. The experimental results of sys- tem B with REP-Tree classifier are shown by Tables (14) and (15) using the previous two datasets (BBC Arabic and Alkhaleej datasets as the following: TABLE 14 System B with REP-Tree classifier applied on BBC-Arabic Dataset Table (14) shows the classification of BBC-Arabic docu- ments using BPSO+SVM as a feature selector and REP-Tree as a classifier. As it is clear from Table (14), the results are as the following: the best classification is for ―World News‖ class with precision of 98.3, recall of 96.1 and F1-Measure of 97.1. The second rank of performance is for classes ―Newspapers‖ with precision of 88.2, recall of 88.9 and F1- Measure of 88.5. We can detect the convergence between the ―Middle East‖ class performance and the ―Business‖ class performance with F1-Measure of 82.7 and 82.6. The worst performance was the ―Sport‖ class with precision of Class Precision% Recall% F1-Measure% Local News 49.8 52.4 51 International News 93.3 62.4 74.7 Economy 67.1 77.5 71.9 Sport 85.3 69.8 76.7 Average 73.8 65.5 68.5 Class Precision% Recall% F1-Measure% Middle East 67.9 88.7 76.9 World News 98.7 90.3 94.3 Business 87.9 89.3 88.5 Sport 60.3 80.7 69 Newspapers 79.8 84.2 81.9 Science 99.2 85.6 91.8 Misc 90.4 98.8 94.4 Average 83.4 88.2 85.2 Class Precision% Recall% F1-Measure% Local News 83.2 88.6 85.8 International News 88.5 85.7 87 Economy 96.6 90.9 93.6 Sport 90.3 89.7 89.9 Average 89.6 88.7 89 Class Precision% Recall% F1-Measure% Middle East 77 89.4 82.7 World News 98.3 96.1 97.1 Business 87.2 78.5 82.6 Sport 79.5 75.8 77.6 Newspapers 88.2 88.9 88.5 Science 85.4 87.1 86.2 Misc 89 69.4 77.9 Average 86.3 83.6 84.6 Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 38 79.5, recall of 75.8 and F-Measure of 77.6. As in all previ- ous experiments we'll apply the REP-Tree classifier on the other datasets. Now we will apply system B (the same pre- vious experiment with REP-Tree) on the second data-set (Alkhaleej News Dataset), and Table (15) shows the results as the following: TABLE 15 System B with REP-Tree classifier applied on Alkhaleej News Dataset From Table (15) we see that the best accuracy of REP-Tree F1-Measure is 91.2 for ―Sport‖ class, and the worst F- Measure is for ―Local News‖ class with 75. We note that the results were comparable with SVM classifier. Now we will apply the REP-Tree on another data-set. 6.7 System C.A (“BPSO+REP-Tree”/J 48) System C consists of Binary PSO as a feature selector and the proposed REP-Tree as an evaluator to check the best group of features then we use the three previous classifiers (J 48, SVM, and REP-Tree) to build the classification mod- el; the classification in the resultant group of features in the training set to reduce the dimension of the original data-set and then apply the classifiers on the test data-set. We have previously noted that REP-Tree has recently been used by (Kalmegh, 2015), (Patel and Upadhyay, 2012) to classify English text and by (Naji and Ashour, 2016) in Arabic text classification. The experimental results of system C with J 48 tree are shown by Tables (16) and (17) using the previous two da- tasets (BBC-Arabic news dataset and Alkhaleej News da- taset): TABLE 16 System C with J 48 tree applied on BBC-Arabic Dataset Table (16) shows the classification of BBC-Arabic docu- ments using BPSO+REP-Tree as a feature selector and J 48 decision tree as a classifier. As it is clear from the table, the results are as the following: the best classification perfor- mance is ―World News‖ class with precision of 90.4, re- calling of 87.4 and F1-Measure of 88.8. The second rank of classification performance is the ―Middle East‖ with preci- sion of 88.7, recall of 83.3 and F1-Measure of 85.9. We can note that the worst class was the ―Business‖ class with pre- cision of 75.2, recall of 70.5 and F-Measure of 72.7. Here we can be quite sure that the J 48 tree failed in the classifica- tion accuracy of ―Science‖ class by 27.3% according to its F-Measure. Now we have the second data-set (Alkhaleej News Dataset) with the same previous experiment, Table (17) shows the results as the following: TABLE 17 System C with J 48 tree applied on Alkhaleej News Dataset Table (17) shows the classification accuracy of Alkhaleej News Dataset documents using BPSO+REP-Tree as a fea- ture selector and J 48 decision tree as a classifier. The best F-Measure is for ―Economy‖ class with 82.5, and the worst F-Measure is for ―Local News‖ class with 58.4. Also here we can be quite sure that the J 48 tree failed in the classifica- tion accuracy of ―Local News‖ class by 47.6% according to its F-Measure. 6.8 System C.B (“BPSO+REP-Tree”/SVM) The experimental results of system C with SVM classifier are shown by Table (18) and (19) using the previous two datasets (BBC Arabic and Alkhaleej datasets as the follow- ing: TABLE 18 System C with SVM classifier applied on BBC-Arabic Dataset Class Precision% Recall% F1-Measure% Local News 72 78.3 75 International News 89.6 92.2 90.8 Economy 87.3 88.3 87.7 Sport 95.4 87.5 91.2 Average 86 86.5 86.1 Class Precision% Recall% F1-Measure% Middle East 88.7 83.3 85.9 World News 90.4 87.4 88.8 Business 75.2 70.5 72.7 Sport 84.8 74.2 79.1 Newspapers 80.1 83.8 81.9 Science 79.8 78.3 79 Misc. 77.6 85.7 81.4 Average 82.3 80.4 81.2 Class Precision% Recall% F1-Measure% Local News 60.3 56.8 58.4 International News 68.6 70.9 69.7 Economy 90.4 75.9 82.5 Sport 84.8 72.5 78.1 Average 73.5 69 72.1 Class Precision% Recall% F1-Measure% Middle East 98.6 94.4 96.4 World News 68.2 88.9 77.1 Business 82.3 85.7 83.9 Sport 64.6 78.5 70.8 Newspapers 81.4 82.8 82 Science 97.2 87.1 91.8 Misc. 92.5 96.9 94.6 Average 83.5 87.7 85.2 Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 39 Table (18) shows the classification of BBC-Arabic docu- ments using BPSO+REP-Tree as a feature selector and SVM as a classifier. From Table (18) we note the equality in F- Measure average value using the same classifier SVM with a different features selection combination (BPSO+REP-Tree). The current results have been compared with Table (12, 13) (BPSO+SVM features selection). We get here an average F- Measure of 85.2 and 89.6 for SVM (the same classifier but different feature selector). As usual, we will apply system C (the same previous experiment with SVM) on the second data-set (Alkhaleej News Dataset), and Table (19) shows the results as the following: TABLE 19 System C with SVM classifier applied on Alkhaleej News Dataset Table (19) shows the classification of Alkhaleej News Da- taset documents using BPSO+REP-Tree as a feature selector and SVM as a classifier. The best accuracy (F-Measure) is for ―Local News‖ class with 95.4 and the worst F-Measure is for ―Sport‖ class with 79.7. In this experiment, we note the equality and convergence in the classification process results using the same classifier SVM with a different fea- tures selection combination (BPSO+REP-Tree). 6.9 System C.C (“BPSO+REP-Tree”/REP-Tree) The third combination of system C consists of Binary PSO as a feature selector and the proposed REP-Tree as an evalu- ator then we use REP-Tree as a classifier, as we mentioned in the previous section System C subsection. The experi- mental results of system C with REP-Tree classifier are shown by Tables (20) and (21) using the previous two da- tasets (BBC Arabic and Alkhaleej datasets as the following: TABLE 20 System C with REP-Tree classifier applied on BBC-Arabic Dataset Table (20) shows that the REP-Tree has been effective enough in the classification for BBC-Arabic documents us- ing BPSO+REP-Tree as a feature selector and REP-Tree as a classifier. The results are as following: the best classification is for ―Middle East‖ class with precision of 97.2, recall of 95.3 and F1-Measure of 96.2. Next we have the second clas- sification performance the ―Newspapers‖ with precision of 86.1, recalling of 98.4 and F1-Measure of 91.8. The third classification accuracy is the ―Business‖ with F-Measure of 87.9. We can detect the convergence between the ―Science‖ class performance and the ―World News‖ class performance with F1-Measure of 83.3 and 83.2. The worst performance was the Sport class with F-Measure of 77.8. As usual, we will apply the REP-Tree classifier on the other datasets. Now, we will apply system C, the same previous experiment with REP-Tree, on the second data-set (Alkha- leej News Dataset), and Table (21) shows the results as the following: TABLE 21 System C with REP-Tree classifier applied on Alkhaleej News Dataset From Table (21), we see that the best accuracy of REP-Tree (F1-Measure) is 97.6 for ―Local News‖ class, and the worst F-Measure is for ―Economy‖ class with 86.3. The average accuracy of the REP-Tree in this experiment was 91.8. 6.10 Performance of the Three Systems In this subsection, we will make a comparison between the previous results on the previous two datasets (BBC-Arabic and Alkhaleej) before adding some enhancements to each system in the preprocessing phase. Both Table (22) and Fig- ure (6) show the results of this comparison. TABLE 22 Comparison between the F-Measure averages of the three systems Datasets System A (BPSO+KNN )% System B (BPSO+SVM )% System C (BPSO+RE P-Tree)% BBC-Ar (J48) 79 79 81.2 BBC-Ar (SVM) 86 85.2 85.2 BBC-Ar (REP) 84.2 84.6 86.7 Alkha- leej(J48) 74.3 68.5 72.1 Alkha- leej(SVM) 88 89 89 Class Precision% Recall% F1-Measure% Local News 97.2 93.7 95.4 International News 94.5 82.9 88.3 Economy 90.3 95.5 92.8 Sport 79.5 80 79.7 Average 90.3 88 89.05 Class Precision% Recall% F1-Measure% Middle East 97.2 95.3 96.2 World News 88.6 78.5 83.2 Business 87.3 88.6 87.9 Sport 79.9 75.9 77.8 Newspapers 86.1 98.4 91.8 Science 80 86.9 83.3 Misc. 82.5 92 86.9 Average 85.9 87.9 86.7 Class Precision% Recall% F1-Measure% Local News 98 97.4 97.6 International News 91.3 92.5 91.8 Economy 85.7 87.1 86.3 Sport 93.8 89.6 91.6 Average 92.2 91.6 91.8 Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 40 Alkha- leej(REP) 87 86.1 91.8 Figure 6 Comparison between the accuracy of the three systems. From Table (22) and Figure (6), we draw the overall results of all the experiments, calculate the average for F1-Measure values, and compare all the systems with each other. VII. CONCLUSION This paper proposed a new feature selection approach to select the best subset of features from the original Arabic document .We showed that the proposed approach works well in this area after extracting the experimental results. The proposed approach can be used in the field of Arabic search engines and classifying huge amounts of Arabic web- sites pages into hierarchal classes, labels. We proposed the Reduced Error Pruning-Tree classifier, which was not used with Arabic text classification before for two purposes. The first one is an evaluator to evaluate the subset of features that resulted from the features selection algorithm Binary Particle Swarm Optimization BPSO. To evaluate this approach (BPSO+REP-Tree), we used two Ar- abic datasets, BBC-Arabic News dataset and Alkhaleej News dataset. The second purpose of the Rep-Tree is to use it as a classifier to build the learning model. We compare the first purpose (BPSO+REP-Tree) with two existing ap- proaches, (BPSO+KNN) and (BPSO+SVM), and the second purpose (REP-Tree classifier) with two well-known classifi- ers, J 48 and SVM. We named the three features selection approaches with A for (BPSO+KNN), B for (BPSO+SVM), and C for (BPSO+REP-Tree). After we get the experimental results, we concluded that the proposed approach System C is effective. We choose the F1-Measure to estimate the accu- racy of the classification process which came from two fac- tors, precision and recall factors. The values of F1-Measure for system A with the classifier J 48 is in the range of 73% - 79%, with SVM is in 86% - 88% and with the proposed classifier REP-Tree is in the range of 84% - 87%. Next, we get the F1-Measure values of the se- cond system (B) with the same classifiers as the following, with J 48 are in the range of 60.9% - 84.6%, with SVM is in 85.2% - 89.6% and with the proposed classifier REP-Tree is in the range of 84.6% - 89.5% and here is the last two algo- rithms which are comparable in the accuracy. Finally, we apply the experiments on our proposed approach system (C) in features selection domain and it gave these ranges of ac- curacy as the following, with J 48 was in the range of 69.5% - 79.6%, and with SVM is in 87% - 89.8% and with the pro- posed classifier REP-Tree is in the range of 86.7% - 91.8%. REFERENCES [1] G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval. Information Processing & Management, vol.24, no.5, (1988), pp.513-523. doi:10.1016/0306-4573(88)90021-0 [2] S. Li, R. Xia, C. Zong and C. Huang, "A framework of feature selection methods for text categorization". Proceedings of the Joint Conference of the 47th An- nual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - ACL-IJCNLP '09, (2009), doi:10.3115/1690219.1690243 [3] Y. Saeys, I. Inza, and P. Larranaga, "A review of feature selection techniques in bioinformatics". Bioinformat- ics, vol.23, no.19, (2007), pp.2507-2517. doi:10.1093/bioinformatics/btm344 [4] M. M. Al-Tahrawi and S. N. Al-Khatibb, (2015).Arabic text classification using Polynomial Networks. Jour- nal of King Saud University - Computer and Infor- mation Sciences, vol. 27, no. 4, (2015), pp. 437-449. http://dx.doi.org/10.1016/j.jksuci.2015.02.003 [5] T. Aimunandar and E. Winarko, Regional Development Classification Model using Decision Tree Approach. http://dx.doi.org/10.1016/j.jksuci.2015.02.003 Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 41 International Journal of Computer Applications IJCA, vol. 114, no. 8, (2015), pp.29-34. doi:10.5120/20000- 1755 [6] S. Kalmegh, Analysis of WEKA Data Mining Algorithm REP-Tree, Simple Cart and Random Tree for Classi- fication of Indian News. PARIPEX Paripex - Indian Journal of Research, vol. 2, no. 2, (2015), pp. 438- 446. doi:10.15373/22501991/feb2015. [7] Patel, N., & Upadhyay, S. Study of Various Decision Tree Pruning Methods with their Empirical Compari- son. International Journal of Computer Applications, vol. 60, no.12, (2012), pp. 20-25. doi:10.5120/9744- 4304 [8] Brahimi, M. Touahria, and A. Tari, Data and Text Mining Techniques for Classifying Arabic Tweet Polarity. Journal of Digital Information Management, vol.14, no.1, (2016), pp. 12-19. [9] S. M. Oraby, Y. El-Sonbaty, M. A. El-Nasr, Ex- ploring the Effects of Word Roots for Arabic Sentiment Analysis. In International Joint Con- ference on Natural Language Processing, Nago- ya, Japan (2013), pp. 471-479. [10] Khoja, S. (1999).Stemming Arabic Text, Lancas- ter, U.K, Computing Department, Lancaster University. [11] K. Taghva, R. Elkhoury, and J. Coombs, Arabic stem- ming without a root dictionary. Paper presented at In- ternational Conference on Information Technology: Coding and Computing (ITCC'05). (2005). doi:10.1109/itcc.2005.90 [12] Tashaphyne, Arabic light stemmer, 0.2. Available at https://pypi.python.org/pypi/Tashaphyne (2010). [13] A. Shoukry, and A. Rafea, Sentence-level Arabic senti- ment analysis. International Conference on Collabo- ration Technologies and Systems (CTS). (2012). doi:10.1109/cts.2012.6261103. [14] E. Al-Thwaib, Text Summarization as Feature Selection for Arabic Text Classification. World of Computer Science and Information Technology Journal (WCSIT), vol. 4, no.7, (2014), pp. 101-104. [15] K. Al-Hindi, and E. A. Al-Thwaib, Comparative Study of Machine Learning Techniques in Clas- sifying Full-Text Arabic Documents versus Summarized Documents. World of Computer Science and Information Technology Journal (WCSIT), vol. 2, no. 7, (2013), pp. 126-129. Re- trieved August17, 2016, from http://www.wcsit.org/pub/2013/vol.3.no.7/A Comparative Study of Machine Learning Tech- niques in Classifying Full-Text Arabic Docu- ments versus Summarized Documents.pdf [16] A. Abu-Errub, Arabic Text Classification Algorithm using TF.IDF and Chi Square Measurements. Inter- national Journal of Computer Applications IJCA, vol. 93, no. 6, (2014). Pp. 40-45. doi:10.5120/16223-5674 [17] Goweder, A., Elboashi, M., & Elbekai, A. (2013).Centroid-Based Arabic Classifier. The Inter- national Arab Conference on Information Technology (ACIT’2013), 108(3). Retrieved June 27, 2016, from http://acit2k.org/ACIT/2013Proceedings/108.pdf [18] Abidi, K., & Elberrichi, Z. (2012). Arabic Text Catego- rization: A Comparative Study of Different Represen- tation Modes. Journal of Theoretical and Applied In- formation Technology, 38(1), 465-470. Retrieved May 21, 2016, from http://ccis2k.org/iajit/PDF/vol.9,no.5/2983-10.pdf [19] Raho, G., Al-Shalabi, R., Kanaan, G., & Nassar, A. (2015). Different Classification Algorithms Based on Arabic Text Classification: Feature Selection Comparative Study. International Journal of Advanced Computer Science and Applications Ijacsa, 6(2) 23-28. doi:10.14569/ijacsa.2015.060228 [20] Mohammad, A. H., Al-Momani, O., & Alwada, T. (2016). Arabic Text Categorization using k- nearest neighbor, Decision Trees (C4.5) and Rocchio Classifier: A Comparative Study. In- ternational Journal of Current Engineering and Technology, 6(2), 477-482. RetrievedMay29, 2016,from http://inpressco.com/wp- content/uploads/2016/03/Paper16477-482.pdf [21] Kanan, T., & Fox, E. A. (2015).Automated Arabic text classification with P-Stemmer, machine learning, and tailored news article taxonomy. Journal of the Asso- ciation for Information Science and Technology J Assn Inf Sci Tec. doi:10.1002/asi.23609 [22] Al-Anzi, F. S., & Abuzeina, D. (2016).Toward an en- hanced Arabic text classification using cosine simi- larity and Latent Semantic Indexing. Journal of King Saud University - Computer and Information Scienc- es. doi:10.1016/j.jksuci.2016.04.001 https://pypi.python.org/pypi/Tashaphyne http://acit2k.org/ACIT/2013Proceedings/108.pdf http://ccis2k.org/iajit/PDF/vol.9,no.5/2983-10.pdf http://inpressco.com/wp-content/uploads/2016/03/Paper16477-482.pdf http://inpressco.com/wp-content/uploads/2016/03/Paper16477-482.pdf Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017) 42 [23] Zubi, Z. S. (2009). Using Some Web Content Min- ing Techniques for Arabic Text Classification. Recent Advances on Data Networks, Communi- cations, Computers, 73-84. doi:10.1109/mmcs.2009.5256696 [24] Zrigui, M., Ayadi, R., Mars, M., & Maraoui, M. (2012). Arabic Text Classification Framework Based on Latent Dirichlet Allocation. Journal of Computing and Information Technology CIT, 20(2), 11-14. doi:10.2498/cit.1001770 [24] Kennedy, J., & Eberhart, R. (1995).Particle swarm op- timization. Proceedings of ICNN'95 - International Conference on Neural Networks, 4, 1942-1948. doi:10.1109/icnn.1995.488968 [25] Kennedy, J., & Eberhart, R. (1995).Particle swarm op- timization. Proceedings of ICNN'95 - International Conference on Neural Networks, 4, 1942-1948. doi:10.1109/icnn.1995.488968 [26] Yang, Y., & Pedersen, J. O. (1997).A comparative study on feature selection in text categorization. Machine Learning International Workshop Then Conference, 412-420. Morgan Kaufmann Publishers, INC 27 Chantar, H. K., & Corne, D. W. (2011). Feature subset selection for Arabic document categorization using BPSO KNN. Third World Congress on Nature and Biologically Inspired Computing. doi:10.1109/nabic.2011.6089647 [28] Tsai, M., Su, C., Chen, K., & Lin, H. (2012).An Appli- cation of PSO Algorithm and Decision Tree for Med- ical Problem. Neural Comput & Applic Neural Com- puting and Applications, 21(8), 124-126. Retrieved September 07, 2016, from http://psrcentre.org/images/extraimages/31012565.pd f [29] Shi, Y., & Eberhart, R. (1995).A modified particle swarm optimization. Proceedings of The 1998World Congress on Computational Intelligence, 6, 69-73. doi:10.1109/icnn.1995.4889684 [30] Kennedy, J., & Eberhart, R. (1997).A discrete binary version of the Particle swarm algorithm. Proceedings of The 1998World Congress on Computational Cy- bernetics and Simulation, 4, 4104-4108. [31] Naji, H., Ashour, W. (2016). Text Classification for Ar- abic Words Using Rep-Tree. International Journal of Computer Science and Information Technology IJC- SIT, 8(2), 101-108. doi:10.5121/ijcsit.2016.8208 [32] Kalmegh, S. (2015).Analysis of WEKA Data Mining Algorithm REP-Tree, Simple Cart and Random Tree for Classification of Indian News. PARIPEX Paripex - Indian Journal of Research, 2(2), 438-446. doi:10.15373/22501991/feb2015. [33] Patel, N., & Upadhyay, S. (2012). Study of Various Decision Tree Pruning Methods with their Empirical Comparison. International Journal of Computer Ap- plications, 60(12), 20-25. doi:10.5120/9744-4304 [34] Mitchell, T. M. (1997). Machine learning.McGraw Hill. Retrieved May 3, 2016, from http://personal.disco.unimib.it/Vanneschi/McGrawHil l_-_Machine_Learning_- Tom_Mitchell.pdf http://psrcentre.org/images/extraimages/31012565.pdf http://psrcentre.org/images/extraimages/31012565.pdf http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_Machine_Learning_- http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_Machine_Learning_-