Transactions Template

JOURNAL OF ENGINEERING RESEARCH AND TECHNOLOGY, VOLUME 4, ISSUE 1, MARSH 2017

A New Model in Arabic Text Classification Using

BPSO/REP-Tree
Hamza Naji

1
, Wesam Ashour

2
and Mohammed Alhanjouri

1
Department of Computer Engineering,

Islamic University of Gaza, Palestine.

2
Department of Computer Engineering,

Islamic University of Gaza, Palestine.

3
Department of Computer Engineering,

Islamic University of Gaza, Palestine.

Abstract—Specifying an address or placing a specific classification to a page of text is an easy process
somewhat, but what if there were many of these pages needed to reach a huge amount of documents. The
process becomes difficult and debilitating to the human mind. Automatic text classification is the perfect
solution to this problem by identifying a category for each document automatically. This can be achieved by
machine learning; by building a model contains all possible attributes features of the text. But with the increase
of attributes features, we had to pick the distinguishing features where a model is created to simulate the large
amount of attributes (thousands of attributes). To deal with the high dimension of the original dataset, we use
features selection process to reduce it by deleting the irrelevant attributes, words, where the rest of features still
contain relevant information needed in the process of classification. In this research, a new approach which is
Binary Particle Swarm Optimization (BPSO) with Reduced Error Pruning Tree (REP-Tree) is proposed to select
the subset of features for Arabic classification process. We compare the proposed approach with two existing
approaches; Binary Particle Swarm Optimization BPSO with K-Nearest Neighbor (KNN) and Binary Particle
Swarm Optimization BPSO with Support Vector Machine (SVM). After we get the subset of attributes that result
from features selection process, we use three common classifiers which are Decision Trees J 48, SVM and the
prepared algorithm REP-Tree (as a classifier) to build the classification model. We created our own Arabic
dataset; the BBC Arabic News dataset that are collected from the BBC Arabic website and another one existing
is used datasets in our experiments, Alkhaleej News Dataset. Finally, we present the experimental results and
showed that the proposed algorithm is missionary in this area of research.

Index Terms—Text classification, BPSO, REP-Tree, Binary Particle Swarm Optimization.

I INTRODUCTION

The huge increase of using text in the electronic devices

and web sites, in particular, is a motivation for categorizing

these texts in automatic manner. That’s because of the insuf-

ficiency of human ability to handle them manually. The core

task in the categorization is called the Text Categorization or

Classification TC. The previous task is the ability of classi-

fying a huge amount of groups of texts; each of them is

called a text data-set or Corpora, to some predefined classes.

In case of news data-set; for example, the classes can be

Sport, Health etc., and other various classes based on their

contents.

Text classification process in general consists of two

phases. The first one is the preprocessing phase defined as

the process that implements on the amount of texts to make

some improvements for reducing the unnecessary terms. The

preprocessing phase also contains reducing the extra phrases

of one term by a process called Stemming. Stemming is the

process of eliminating the derived words of one basic word

such as the words ―making makes‖ and turning them to their

roots as the word "make". Another example of the stemming

process are the words (argue, argued, argues, arguing, and

argues) turning them to the stem "argu". On the other hands,

(argument and arguments) are turned to the stem "argu-

ment". The preprocessing phase includes the removing of

some prefixes and suffixes from the word instead of extract-

ing the original root.

The second phase of text classification process is the

classification step. The process of classifying the prepro-

cessed text in the previous phase and presenting the corpora

using a mechanism is called a classifier. To apply such two

phases, we need to convert each dataset to a term vector

which is the basic of text processing [
1
]. But how many

terms we need in each dataset based on what term we need

is a question to be answered. The previous question leads us

to add a new step in the text classification process, Arabic

Text Classification in this paper.

There is a middle step between preprocessing and classi-

fication process called "feature selection" [
2
], it is a com-

plementary process to the preprocessing stage performed

after it to reduce the redundant terms (features) and to keep

the sufficient terms to continue the classification process [
3
].

We demonstrate a combination of Binary Particle Swarm

Optimization BPSO and Reduced Error Pruning Tree REP-

Tree for the last process of selecting good sets of features for

the Arabic TC task. Then we use the second half of the hy-

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

bridized approach the REP-Tree and use it as a classifier as

mentioned above.

The text classification processes can be done easily on

the English language due to the smooth environment of it. In

contrast, Arabic language is considered a complex language

that contains many formations and many different kinds of

forms of the word. The aforementioned difficulty in the Ara-

bic language requires greater efforts in dealing with the clas-

sification of texts. paper focuses on the classification of the

Arabic text which is the difficulty of Arabic expressive style

when being employed in alternative languages like Persian,

Urdu, Iranian language and alternative regional languages

of Pakistan, Afghanistan and Persia. The Arabic language

contents constitute a 3% of the web text content with the

fourth order in languages ordering on-line [
4
]. The previous

amount of content needs an accurate and effective classifica-

tion to help the humans to easily use it .Thus, in the last 10

years the need for the effective and accurate classification

has quickly been grown.

There are some classification algorithms that can be done

in general text classification and can be proposed in Arabic

such as: Support Vector Machine (SVM), Naïve Bayes (NB),

K-Nearest Neighbor(KNN), Maximum Entropy (ME), Arti-

ficial Neural Network (ANN), Decision Tree (DT)and the

Rocchio Feedback Algorithm. More recently, Reduced Error

Pruning tree REP-Tree is investigated in Arabic TC. RET-

Tree is a fast decision tree learning machine and it builds a

decision tree based on the information gained or reducing

the variance. Also, REP-Tree is a fast decision tree learner

which builds a decision/regression tree using information

gained as the splitting criterion, and prunes it by using re-

duced error pruning [
5
]. REP-Tree was first used in Indian

and English text classification in 2015 and 2012 [
6
], [

7
].The

rest of the paper is organized as follows: Section 2 reviews

related work. Section 3 explains BPSO concepts. Section 4

explains the second term of the proposed approach REP-

Tree. Section 5 shows proposed work. Section 6 presents the

results, and finally, we tend to conclude the paper in Section

II RELATED WORKS

In the discussion below, we focus on the works addressing

Arabic TC. Since the number and quality of features used to

express texts has a direct effect on classification algorithms,

the following will discuss the main goal of feature reduction

and selection and their impact on TC.

(Brahimi, Touahria and Tari, 2016) [
8
] addressed sentiment

analysis for tweets in the Arabic language using some ap-

proaches with two free available datasets of (2000 tweets).

They applied the light and root stemmer as a preprocessing

phase and investigated the impact of reducing the size of the

dataset by selecting the most relevant features on the classi-

fication efficiency and accuracy of three well used machine

learning algorithms Support Vector Machine (SVM), Naïve

Bayes (NB), and K-Nearest Neighbor (KNN).

(Oraby, El-Sonbaty and El-Nasr, 2013) [
9
] worked on the

impact of Stemming by applying the Khoja stemmer [
10

Information Science Research Institute (ISRI) stemmer [
11

and Tashaphyne Light Arabic Stemmer [
12

] on two datasets

of the opinion classification problem, the results show that

the Khoja stemmer is the best one.

(Shoukry andRafea, 2012) [
13

] performed the classifiers

Support Vector Machine SVM and Naïve Bayes NB on a

dataset collected from twitter website. They applied the ex-

periments on 2 documents of Arabic tweets and the results

showed that the Support Vector Machine SVM was better

than Naïve Bayes NB.

(Al-Thwaib, 2014) [
14

] used the Sakhr summarizer Sakhr

company website 2016 as a feature selector to choose the

best words of documents instead of using all words and they

used the TF feature. Documents, after using TF for feature

selection, are classified using SVM classifier; the data set

they used consists of 800 Arabic text documents. It is a sub-

set of 60913-document corpus collected from many newspa-

pers and other web sites. He succeeded to increase the accu-

racy by using the summarized corpus as input for Support

Vector Machine SVM classifier.

(Al-Hindi and Al-Thwaib, 2013) [
15

] made a comparison

between two data-sets, each one contained 1000 Arabic doc-

uments.Text summarization was applied on one without the

other. Accuracy has not improved much, but there was a

difference in the time. When they used summarized docu-

ments, less time was needed to build the learning model.

(Abu-Errub, 2014) [
16

] proposed a method to classify Arabic

text by comparing a document with predefined documents

categories based on its contents using the Term Frequency

Times Inverse Document Frequency TF.IDF method meas-

ure. After that the document is classified into the appropriate

sub-category using Chi Square measure. The dataset used in

this study contained 1090 documents for training and 500

documents for testing, categorized into ten main categories.

The results show that the proposed algorithm can classify

the Arabic text datasets into predefined category.

(Goweder, Elboashi and Elbekai, 2013) [
17

] used their de-

veloped technique, Centroid-based, to classify Arabic text.

The proposed algorithm is evaluated using a dataset contain-

ing a 1400 Arabic documents collecting from 7 different

classes. The results show that the adapted Centroid-based

algorithm can classify Arabic documents without problems.

They used some measurements Micro-averaging recall, pre-

cision, F-measure, accuracy, and error rates respectively. The

measurements factors record a performance percentage of

90.7%, 87.1%, 88.9%, 94.8%, and 5.2% according to the

previous order of measurements.

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

(Abidi and Elberrichi, 2012) [
18

], in this paper, they present-

ed a comparative study to assess the effect of a conceptual

representation of the text. The K-Nearest Neighbor used and

feature extraction was achieved via three preprocessing

schemes Bag of Words, N grams, and a conceptual represen-

tation. The F-measure of Bag of Words is 64%, 68% for N

gram’s F-measure, and 74% for F-measure conceptual repre-

sentation. Finally, the conceptual representation was the best

one as the results shown.

(Raho,Al-Shalabi, Kanaan and Nassar, 2015) [
19

] investigat-

ed the importance of feature selection in Arabic corpus clas-

sification by making a comparison of the performance be-

tween different classifiers in different situations using fea-

ture selection with stemming, and without using stemming.

The dataset collected from BBC Arabic website and the

classifiers they used are DT, K nearest neighbors KNN, Na-

ïve Bayesian Model NBM method and Naïve Bayes NB;

also they used factors Measurements such as precision, re-

call, F-Measures, accuracy and time. The results showed the

Accuracy of each classifier as the following: (D.T 99.4%,

KNN 66.3%, NBM 92%, and NB 91.9%).

(Mohammad, Al-Momani and Alwada, 2016) [
20

] provided

a comparative study of Arabic text classification between

three types of classifiers (k-Nearest Neighbor, Decision

Trees C4.5, and Rocchio Classifier). These well-known al-

gorithms are applied on a collected Arabic data set. Data set

used consists from 1400 documents belongs to 8 categories,

the same number of documents was used in the study exper-

iments. They used two types of Measurements precision and

recall, and the results of the experiments showed that the K-

Nearest Neighbor records an average of 80% for Recall and

83% for precision, While Rocchio Classifier records an av-

erage of 88% for Recall and 82% for precision. Both of the

previous Classifiers are better than C4.5 with average of

64% for Recall and 67% for precision.

(Kanan and Fox, 2015). [
21

] This study talks about a new

approach in Arabic text classification stemming; they devel-

oped a new model called tailored stemming, a new Arabic

light Stemmer, with the usage of Support Vector Machine

SVM classifier. The experiments were performed under 10-

fold cross-validation training type, and gave these results for

the predefined classes after using SVM as the following: Art

and Culture 91.8%, Economics 93.5%, Politics 91.5% and

Society 99.1%.

(Al-Anzi and Abuzeina, 2016) [
22

] grouped the similar un-

labeled document into pre-specified number of topics using

Latent Semantic Indexing LSI and Singular Value Decom-

posing SVD methods. The corpus they used contains 1000

documents of 10 topics, 100 documents for each topic. The

results showed that EM method is the best of other methods

with an average categorization accuracy of 89%.

(Zubi, 2009). [
23

] This study is about using the web contents

and applies some Arabic classification techniques on it. The

general purpose of this study is to compare between two

classifiers. The author used the K-Nearest Neighbor KNN

Classifier and Naïve Bayes NB Classifier to apply the exper-

iment. As mentioned by the author in his study. A corpus of

Arabic text documents was collected from online Arabic

newspapers archives, including Al-Jazeera, Al-Nahar, Al-

hayat, Al-Ahram, and AlDostor as well as few other special-

ized websites. He collects 1562 documents classifying it into

6 different categories. After the comparison experiment fin-

ished, the results showed that the K Nearest Neighbors KNN

with an average of (86.02%) was better than Classifier Na-

ïve Bayesian with accuracy of (77.03%).

(Zrigui, Ayadi, Mars and Maraoui, 2012). [
24

] They devel-

oped a new model based on the Latent Dirichlet Allocation

(LDA) and the Support Vector Machine SVM; they used the

LDA to sample ―topics‖ of groups of texts. The results

showed that the proposed LDA-SVM algorithm is able to

achieve high effectiveness for Arabic text classification task

(Macro-averaged F 1 88.1% and Micro-averaged F –

91.4%).

III BINARY PARTICLE SWARM OPTIMIZATION
BPSO

Before talking about BPSO as a feature selection algorithm,

we will first describe the intended of the word ―Swarm‖ in

full definition of PSO ―Particle Swarm Optimization‖ algo-

rithm. What is the swarm and where this name came? That’s

what we got from the final meaning of the definition. Many

forms of life in some organisms affected the aspirations of

some researchers and invited them to develop some success-

ful theories for solving problems based on this random life.

There is a group of successful theories based on this mode of

thinking, including the DNA counting, membrane algorithm,

Particle Swarm Optimization algorithm, artificial immune

systems algorithm, and Ant Colony Optimization algorithm.

One of the algorithms is the Particle Swarm Optimization

algorithm that was developed in the 1995 by Eberhart and

Kennedy [
25

]. This idea has been built on the basis of the

collective behavior of flocks of birds. PSO creates a random

optimization algorithm to give solutions, particles, for some

positions in the search space. Each of those particles holds

an initial random velocity within the search space symbol-

ized by V i = ( V i 1 ; V i2 ; ...V iN ), and each particle is

symbolized by P i = ( P i 1 ; Pi2 ; ...; P iN ). Update its ve-

locity according to its experience or other particles experi-

ences. For the best particle in the search space, swarm, we

called it the best global symbolized by g, and when the ve-

locity has been updated, the particle it finds the new position

with the latest velocity according to the following equations

[
26

]

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

The main equation is:

𝑋𝑖𝑑 = 𝑋𝑖𝑑 +𝑉𝑖𝑑 (1)
New position = Current position + New velocity.

𝑉𝑖𝑑 = 𝜔 ∗ 𝑉𝑖𝑑 +𝐶1 ∗ 𝑟𝑎𝑛𝑑( ) ∗ (𝑃𝑖𝑑 − 𝑋𝑖𝑑) + 𝐶2
∗ 𝑟𝑎𝑛𝑑( ) ∗ (𝑃𝑔𝑑 − 𝑋𝑖𝑑) (2)

Where

rand () is a random number between (0, 1) [
27

]. c1, c2 are

acceleration factors. Usually c1 = c2 = 2. Pgd = global best.

Vid = velocity of particle [
28

Xi is the current position of the particle initialized with ran-

dom binary values. Where 0 means that the corresponding

feature is not selected and by 1 means that the feature is se-

lected. Pi is the best previous position of the particle and

initialized by the same value of Xi.

Vi is the velocity of Pi.

What if there was no previous velocity, then particles will

navigate to the same position (current position), and that is

the (local search). But if we get a new velocity, then particle

will extend its search (the global search). Some problems

resulted from the previous questions. Inertia weight ω solve

these problems by balancing the local and global search. [

]perform a sequence of experiments to give the best value of

ω which is 1.2.In Binary Particle Swarm Optimization Bina-

ry PSO, particle position is considered as a binary vector,

but how binary vectors deal with velocities. [
29

] provided

some equation to deal with velocity, a vector, (with real val-

ue in which this value is kept between (0, 1)), provides a

group of probabilities. According to the previous we can use

the BPSO to select the relative features in the Arabic Text

Classification. As mentioned in [
30

], the probability of bit

changing is determined by the following:

𝑺(𝑽𝒊𝒅) +
𝟏

𝟏 + 𝒆 𝑽𝒊𝒅
(𝟑)

𝑰𝒇 (𝒓𝒂𝒏𝒅( ) < 𝑺(𝑽𝒊𝒅))𝒕𝒉𝒆𝒏 𝑿𝒊𝒅 = 𝟏; 𝑬𝒍𝒔𝒆 = 𝟎 (𝟒)

Where rand () is a random number between (0, 1) [27]. c1,

c2 are acceleration factors. Usually c1 = c2 = 2. Pgd = glob-

al best. Vid = velocity of particle [29].

IV REDUCED ERROR PRUNING TREE REP-TREE

More recently, Reduced Error Pruning tree REP-Tree is

investigated in Arabic TC [
31

]. REP-Tree is a fast decision

tree learning algorithm and it builds a decision tree based on

the information gained or reducing the variance. REP-Tree is

a fast decision tree learner which builds a deci-

sion/regression tree using information gained as the splitting

criterion, and prunes it using reduced error pruning. REP-

Tree was first used in Indian and English text classification

in 2015 [
32

] and 2012 [
33

]. The REP-Tree first starts the

training process on the existing dataset, and then builds the

training model by decisions, then get a mix results of some

instances from the first learning step and from the pruned

dataset which is a part of the dataset for post-pruning of the

tree, then performing the test process. For a sub-tree of the

tree, if replacing it by a node or leaf, which doesn't take

more prediction errors on the pruning set than the original

set, the tree replaces by a leaf. That means that the REP-Tree

prunes each node after the natural classification. If the mis-

classification error determined for the instances from the

pruned data set is not larger than the misclassification error

rate computed on the original training data, the misclassifi-

cation error can be presented in the Figure (1) below.

Figure 1 The misclassified detection in the pruning set of REP-Tree (binary
sample), [

34
]

by using a pruning set shown in the following table:

TABLE 1

Contains some samples

Category X Y Z

A 0 0 1

B 0 1 1

B 1 1 0

B 1 0 0

A 1 1 1

B 0 0 0

Figure 2 The final REP-tree

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

The REP-tree begins from bottom from node three. We

show that node three can be produced into a leaf which

makes the minimum errors, on the pruning set, than as a sub

tree. As a sub-tree (the pruned tree) the classification occurs

at nodes four and five. One error happened in node five; but

no errors happened in node three. The same matter hap-

pened in node six and node nine. However, node number

two cannot be made into a leaf since it makes one error

while as a sub tree, with the newly-created leaves three and

six. It makes no errors as shown in Figure (2). The pruning

comes as a solution to the sub-tree replication problem that

happened with the decision tree starts splitting. The defini-

tion of this case as ―When sub tree replication occurs, iden-

tical sub trees can be found at several different places in the

same tree structure‖ [28].

V PROPOSED WORK

In this section, the whole Arabic text classification process

will be explained then it will divide the work into a collec-

tion of systems, each system has special combinations to

produce the final process of classification after preparing the

dataset. These combinations are taken from what has already

been explained in previous section.

Arabic Text Datasets

In this subsection, we will present the datasets used in the

experiments of our paper. The used datasets are as the fol-

lowing:

BBC-Arabic News Dataset

The first data set contains the number of 4680 documents of

BBC-Arabic news, classified into the following predefined

categories {'Middle East', 'World News', 'business', 'sport',

'newspapers', 'Science', 'Misc.'}. We choose a random set of

existing documents 3000 documents manually; with the

knowledge that classifies types in all documents as ―single

label‖ classification as mentioned in section (3.2.1 section)

―types of text classification‖. The following table, Table

(2), shows the division of the documents into seven preset

categories.

TABLE 2

The division of BBC-Arabic news Dataset based on 60%

training set.

Note that BBC-Arabic data-set is collected during our work,

and other two datasets are already existing in the literatures

(Arabic Corpora - Mourad Abbas.) and (Arabic Corpora -

Alj-News.).

Alkhaleej News Dataset

We present the second data set which contains a number of

5690 documents for Alkhaleej News Dataset (Arabic Corpo-

ra - Mourad Abbas. ), (Arabic Corpora - Alj-News.) that

classified into the following predefined categories {'Interna-

tional News', 'Local News', 'Sport', 'Economy'}. We choose a

random set (2770 documents) with the knowledge that clas-

sifies types in all the documents as a single label classifica-

tion (Abbas, Smaili 2005). The following table, Table (3)

shows the division of the documents into four preset Catego-

ries.

TABLE 3

The division of Alkhaleej News Dataset based on 60% train-

ing set.

The tables above show that data is partitioned into two parts

data for learning and data for testing based on 60% of learn-

ing; this style existed in Weka tool with many options for

this purpose.

The Proposed Systems
In this section, we will give a set of regulations contain

some processes that listed in the previous section, and then a

comparison will be performed between all the existing com-

binations in the form of independent systems and extract the

# Class Training

Set

Testing

Set

Full Da-

taset

1 Middle East 630 420 1050

2 World

News

222 148 370

3 Business 124 82 206

4 Sport 348 232 580

5 Newspapers 234 155 389

6 Science 141 94 235

7 Misc. 102 68 170

Total 1801 1199 3000

# Class Training

Set

Testing

Set

Full Da-

taset

1 Local News 630 400 1030

2 International

News

480 320 800

3 Economy 264 176 440

4 Sport 300 200 500

Total 1674 1096 2770

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

results in the next section.

System A: Binary Particle Swarm Optimization and K-

Nearest Neighbor.

System A is the first proposed system. It works on the classi-

fication of Arabic documents using the three main processes

preprocessing, feature selection, and classifications as men-

tioned. This system contains three processes shown in Fig-

ure (3):

(1- Tokenization, Stop words discarding 2- BPSO/KNN, 3- J

48).

(1- Tokenization, Stop words discarding 2- BPSO/KNN, 3-

SVM).

(1- Tokenization, Stop words discarding 2- BPSO/KNN, 3-

Rep-Tree). .

Figure 3. System A.

Figure 3 shows the processes of system A using the BBC-Arabic

dataset with the previous processes.

BPSO+KNN Experiment Steps

Step 1. We need to prepare a population of particles in the

features space and spread particles randomly.

Xi is the current position of the particle initialized with ran-

dom binary values. Where0 means that the corresponding

feature is not selected and by 1 means that the feature is se-

lected.

Pi is the best previous position of the particle and initialized

by the same value of Xi.

Vi is the velocity of Pi.

• According to the evaluation of each particle in the swarm

gbest (global best) initializes by the best fitness value of a

particle.

Step2. (Determining the fitness).Fitness of subset resulted by

particle with the evaluation process occurs after each feature

selection iteration. The best fitness is the best accuracy in

the evaluation process of the selected subset of features

measured by classifiers algorithms (KNN) according to the

following equation [27].

𝐹𝑖𝑡𝑛𝑒𝑠𝑠 = (𝛼 ∗ 𝐴𝑐𝑐) + (𝛽 ∗ (𝑁 −𝑇 𝑁⁄ )) (5)
Where

• Acc refers to the classification accuracy of the particle us-

ing chosen classifier.

• To make a balance between classification accuracy and the

dimension of the feature sub set that selected by particles,

we use the β and α parameters to do this purpose, with range

of [0, 1] for α, and 1- α for β.

- N refers to the all features.

- T refers to the selected features using particle P.

• The fitness now is updated and then the private best of each

particle is updated for each particle.

Step 3. (Updating gbest).The gbest is now updated.

Step 4. (Updating position).According to the BPSO velocity

equation from section three, we can alter and update both

velocity and position for all particles (Mendes, Kennedy and

Neves, 2004). Equation (1) and (2).

As mentioned in [25], the probability of bit changing is de-

termined by the following: equations (3) and (4).

Where rand () is a random number between (0, 1) [27]. c1,

c2 are acceleration factors. Usually c1 = c2 = 2. Pgd = glob-

al best. Vid = velocity of particle [28].

Step 5.If the fitness value is better than the best fitness value

(gbest) in history then set current value as the new gbest.

Step 6.Now for evaluation in our case KNN, we use the Eu-

clidean Distance ED to measure the relevancy between cur-

rent instance and the other instances in the data-set.

Step 7.Define the repository R.

• If the predicted classifications of instances were similar to

the predefined classification, increase repository R by 1.

Step 8. Now, we can measure the classification accuracy of

particle P by [27].

𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑅

𝑁
(6)

Where R is the group of results after testing the features from

all training set N.

The Experiment Parameters (BPSO+KNN).

(1) Inertia weight (ധ): in the previous equation (2) is to bal-
ance the local search and the global search [27], and from the

literature the best value of ധ is 1.2.
(2) The swarm dimension is 50 units.

(3) Iterations are 200 iterations.

(4) [0, 1] for α, and 1- α for β. If we use the 1 for α then

β = 0 and this mean that the dimension of the features subset is

neglected, so we choose a random number between [0, 1] for α

(0.70); and β is 1 – 0.70 = 0.30.

System B: Binary Particle Swarm Optimization and Support

Vector Machine.

The second system in this also studies inserting the second

middle phase (Feature Selection). In this system we will use the

BPSO with SVM, and then classify the resultant features by

(Decision Trees J 48, Support Vector Machine SVM, and Re-

duced Error Pruning-Tree Rep-Tree) as shown in Figure (4).

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

Figure 4 System B.

Figure (4) shows the processes of system B using the BBC-

Arabic dataset with the previous processes by adding the

BPSO+SVM as a feature selection, and the resultant features

will be classified using the three classifiers SVM as classifi-

er, J48, and REP-Tree for Arabic words.

BPSO+SVM Experiment Steps:

Step 1. The same in system A.

Step2. (Determining the fitness).

Here we use the previous equation in system A, (1).

Here we use SVM to measure the classification instead of

KNN in the previous system A.

Step 3. (Updating gbest) the same in A.

Step 4. (Updating position) the same in A using the equa-

tions (2), (3), and (4).

Step 5. The same in A.

Step 6.Now for evaluation in our case SVM, we use the

SVM classifier in Weka tool to measure the relevancy be-

tween current instance and the other instances in the data-

set.

Then repeat both step 7 and 8 as mentioned in system A.

also the same previous parameters in system A. experiments.

System C: Binary Particle Swarm Optimization and Re-

duced Error Pruning Tree.

The last system in this study also involves inserting the mid-

dle phase Feature Selection including the previous processes

and contents in system A, and B. In this system we will use

the BPSO with Reduced Error Pruning-Tree Rep-Tree where

it was not used in Arabic text classification field yet and it

was recently used in English news classification. Finally, we

will classify the resultant features by Decision Trees (J 48),

Support Vector Machine SVM, and Reduced Error Pruning-

Tree. Rep-Tree) (As a classifier) as shown in Figure (5).

Figure 5 System C.

Figure (5) shows system C with adding the BPSO+REP-

Tree as a features selection REP-Tree here (evaluator), and

the resultant features will be classified using the three classi-

fiers (SVM, J48, and REP-Tree (as classifier)) for Arabic

words.

BPSO+REP-Tree Experiment Steps

Step 1.The same in system A.

Step2. (Determining the fitness).Here we use Reduced Error

Pruning-Tree REP-Tree as a feature evaluator to measure the

classification accuracy of the particle in a training set, in-

stead of KNN in system A.

Step 3. (Updating gbest) the same in A.

Step 4. (Updating position) the same in A. using the equa-

tions (2), (3), and (4).

Step 5. The same in A.

Step 6.Now for evaluation in our case REP-Tree we use

REP-Tree classifier in Weka tool to measure the relevancy

between current instance and the other instances in the da-

taset.

Then repeat both step 7 and 8 as mentioned in system A.

also the same previous parameters in system A. experiments.

We can alternate the last three steps by measuring the F

measure factor to estimate the classification accuracy.

We can list the previous steps in short and general points

as the following:

(1) First and after preparing the features ,terms, space and

spread particles randomly, we determine the accuracy of the

classification (Acc) of a particle P in training data-set by

using Reduced Error Pruning-Tree REP-Tree.

(2) Start extracting and filtering the features subset of the

training set that selected by particle.

(3) Evaluate the previous extracted features data-set by the

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

REP-Tree by 60 % training set validation.

(4) Determine the F measure factor that result from the

REP-Tree experiment to determine the fitness of the particle.

VI. EXPERIMENTAL RESULTS

In this section, the experimental results of the previous sys-

tems are described in last section. We have executed our

experiments on two data-sets, the BBC-Arabic news dataset

and Alkhaleej News dataset. As mentioned in the previous

section, we split the data into 60% for training and 40% for

testing, and then display the results in Tables and Figures.

After that, we will compare every system with the other in

specific graph. We will start presenting the results of system

A using the three classifiers which have been previously

described in section 4. Then gradually we will review the

results of system B, and finally we end with system C.

6.1 System A.A (“BPSO+KNN”/J 48)

The experimental results of system A with J 48 tree are

shown by Table (4) and (5) using the previous two datasets:

TABLE 4

System A with J 48 tree applied on BBC-Arabic Dataset

Table (4) shows the classification of BBC-Arabic documents

using BPSO+KNN as a feature selector and J 48 decision

tree as a classifier. As it is clear from the table, the results

are as the following: the best classification is in ―Newspa-

pers‖ class with precision of 87.3, recall of 88.9 and F1-

Measure of 88.0. The second performance rank of classes is

the ―Misc.‖ with precision of 83.9, recall of 89.6 and F1-

Measure of 86.6. There is a convergence in the outcome of

both ―Word News‖ and ―Sport‖ with a little outperforming

in recall of 85.4 for ―Word News‖ class. The worst two clas-

ses were the ―Science‖ and the ―Middle East‖ classes with

precision of 62.7, recall of 86.1 and F-Measure of 72.5 for

―Science‖ and the worst precision with 67.3 and F-Measure

with 68.4 for ―Middle East‖ class. Then we have the second

data-set (Alkhaleej News Dataset) with the same previous

experiment, Table (5) shows the results as the following:

TABLE 5

System A with J 48 tree applied on Alkhaleej News Dataset

Table (5) shows the classification of Alkhaleej News Dataset

documents using BPSO+KNN as a feature selector and J 48

decision tree as a classifier. The best F-Measure is for

―Sport‖ class with 84.2, and the worst F-Measure is for

―Economy‖ class with 62.8.

6.2 System A.B (“BPSO+KNN”/SVM)
The experimental results of system A with SVM classifier

are shown by Tables (6) and (7) using the previous two da-

tasets (BBC Arabic, and Alkhaleej datasets as the following:

TABLE 6

System A with SVM classifier applied on BBC-Arabic Da-

taset

Table (6) shows the classification of BBC-Arabic documents

using BPSO+KNN as a feature selector and SVM as a clas-

sifier. As it is clear from Table (6), the results are as the fol-

lowing: the best classification is for ―Misc.‖ class with pre-

cision of 89.4, recall of 95.6 and F1-Measure of 92.3. The

second performance rank of classes is the ―Business‖ with

precision of 84.5, recall of 92.4 and F1-Measure of 88.2.

There is a convergence in the F1-Measure outcome of both

―Middle East‖ and ―Science‖ with F1-Measure of 83.7 and

83.4 gradually. The worst class is the ―Sport‖ with precision

of 87.2, recall of 79.7 and F-Measure of 83.2. Now we will

apply system A (the same previous experiment with SVM)

on the second data-set (Alkhaleej News Dataset) and Table

(7) shows the results as the following:

Class Precision% Recall% F1-Measure%

Local News 75.8 78.4 77

International News 74.6 72.3 73.4

Economy 65.2 60.7 62.8

Sport 81.3 87.5 84.2

Average 74.2 74.7 74.3

Class Precision% Recall% F1-Measure%

Middle East 67.3 69.7 68.4

World News 81.5 85.4 83.4

Business 72.4 73.4 72.8

Sport 84.2 79.7 81.8

Newspapers 87.3 88.9 88.0

Science 62.7 86.1 72.5

Misc. 83.9 89.6 86.6

Average 77 81.8 79

Class Precision% Recall% F1-Measure%

Middle East 88.3 79.7 83.7

World News 81.7 87.3 84.4

Business 84.5 92.4 88.2

Sport 87.2 79.7 83.2

Newspapers 86.4 88.2 87.3

Science 81.4 85.6 83.4

Misc. 89.4 95.6 92.3

Average 85.5 86.9 86

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

TABLE 7

System A with SVM classifier applied on Alkhaleej News

Dataset

Table (7) shows the classification of Alkhaleej News Dataset

documents using BPSO+KNN as a feature selector and

SVM as a classifier. The best F-Measure is for ―Sport‖ class

with 92.3, and the worst F-Measure is for ―International

News‖ class with 82.

6.3 System A.C (“BPSO+KNN”/REP-Tree)
The third combination of system A is our proposed classifier

REP-Tree which has recently been used in English text clas-

sification as mentioned previously in the past sections. Here,

the REP-Tree is a classifier used to classify a group of fea-

ture resulting from the operation of features selection by

BPSO+KNN. The experimental results of system A with

REP-Tree classifier are shown by Tables (8) and (9) using

the previous two datasets BBC Arabic and Alkhaleej da-

tasets as the following:

TABLE 8

System A with REP-Tree classifier applied on BBC-Arabic

Dataset

Table (8) shows the classification of BBC-Arabic documents

using BPSO+KNN as a feature selector and REP-Tree as a

classifier. As it is clear from Table (0), the results are as the

following: the best classification is for ―Middle East‖ class

with precision of 87.7, recall of 91.5 and F1-Measure of

89.5. The second rank of performance is for classes ―News-

papers‖ with precision of 89.2, recall of 88.7 and F1-

Measure of 88.9. We can detect the convergence between the

previous class performance and the ―Business‖ class per-

formance with precision of 86.1, recall of 90.6 and F1-

Measure of 88.2. The worst performance was the ―Misc.‖

class with precision of 79.2, recall of 72.3 and F-Measure of

75.5. As in all previous experiments we'll apply the REP-

Tree classifier on the other datasets. Now we will apply sys-

tem A (the same previous experiment with REP-Tree) on the

second data-set (Alkhaleej News Dataset) and Table (9)

shows the results as the following:

TABLE 9

System A with REP-Tree classifier applied on Alkhaleej

News Dataset

Accuracy results were comparable between REP-Tree and

SVM with average F1-Measure of 87% for REP-Tree and

88% for SVM. For more details of the results the best F-

Measure is for ―Local News‖ class with 89.9, and the worst

F-Measure is for ―Economy‖ class with 81.8.

6.4 System B.A (“BPSO+SVM”/J 48)
The experimental results of system B with J 48 tree are

shown by Tables (10) and (11) using the previous two da-

tasets (BBC-Arabic news dataset and Alkhaleej News da-

taset):

TABLE 10

System B with J 48 tree applied on BBC-Arabic Dataset

Table (10) shows the classification of BBC-Arabic docu-

ments using BPSO+SVM as a feature selector and J 48 deci-

sion tree as a classifier. As it is clear from the table, the re-

sults are as the following: the best classification perfor-

mance is ―Newspapers‖ class with precision of 85.2, recall

of 87.3 and F1-Measure of 86.2. The second rank of classifi-

cation performance is the ―World News‖ with precision of

88.3, recall of 83.1 and F1-Measure of 85.6. We can see that

the worst classes are the ―Middle East‖ and the ―Science‖

classes with precision of 70.4, recall of 72.6 and F-Measure

of 71.4 for ―Middle East‖ and the worst precision with 61.0

and F-Measure with 68.2 for ―Science‖ class. Here we can

be quite sure that the J 48 tree failed in the classification

accuracy of ―Science‖ class by 31.8% according to its F-

Measure. Now we have the second data-set (Alkhaleej News

Class Precision% Recall% F1-Measure%

Local News 86.1 90.4 88.1

International News 82.4 81.7 82

Economy 91.6 87.8 89.6

Sport 95.3 89.5 92.3

Average 88.8 87.3 88

Class Precision% Recall% F1-Measure%

Middle East 87.7 91.5 89.5

World News 85.9 85.7 85.7

Business 86.1 90.6 88.2

Sport 80.3 72.2 76

Newspapers 89.2 88.7 88.9

Science 83.8 87.8 85.7

Misc. 79.2 72.3 75.5

Average 84.6 84.1 84.2

Class Precision% Recall% F1-Measure%

Local News 88.4 91.5 89.9

International News 93.2 85.2 89

Economy 80.1 83.6 81.8

Sport 92.7 82.7 87.4

Average 88.6 85.7 87

Class Precision% Recall% F1-Measure%

Middle East 70.4 72.6 71.4

World News 88.3 83.1 85.6

Business 77.5 71.2 74.2

Sport 87.7 78.5 82.8

Newspapers 85.2 87.3 86.2

Science 61 77.4 68.2

Misc. 82.5 87 84.6

Average 78.9 79.5 79

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

Dataset) with the same previous experiment, Table (11)

shows the results as the following:

TABLE 11

System B with J 48 tree applied on Alkhaleej News Dataset

Table (11) shows the classification accuracy of Alkhaleej

News Dataset documents using BPSO+SVM as a feature

selector and J 48 decision tree as a classifier. The best F-

Measure is for ―Sport‖ class with 76.6, and the worst F-

Measure is for ―Local News‖ class with 51. Also here we

can be quite sure that the J 48 tree failed in the classification

accuracy of ―Sport‖ class by 49% according to its F-

Measure.

6.5 System B.B (“BPSO+SVM”/SVM)
The experimental results of system B with SVM classifier

are shown by Tables (13) and (14) using the previous two

datasets (BBC Arabic and Alkhaleej datasets as the follow-

ing:

TABLE 12

System B with SVM classifier applied on BBC-Arabic Da-

taset

Table (12) shows the classification of BBC-Arabic docu-

ments using BPSO+KNN as a feature selector and SVM as a

classifier. As it is clear from Table (12), the results are as the

following: the best classification is for ―Misc.‖ class with

precision of 90.4, recall of 98.8 and F1-Measure of 94.4.

The second performance rank of classes is the ―World

News‖ with precision of 98.7, recall of 90.3 and F1-Measure

of 94.3. The worst class is the ―Sport‖ with precision of

60.3, recall of 80.7 and F-Measure of 69. Now we will apply

system B (the same previous experiment with SVM) on the

second data-set (Alkhaleej News Dataset), and Table (13)

shows the results as the following:

TABLE 13

System B with SVM classifier applied on Alkhaleej News

Dataset

Table (13) shows the classification of Alkhaleej News Da-

taset documents using BPSO+SVM as a feature selector and

SVM as a classifier. The best accuracy (F-Measure) is for

Economy class with 93.6 and the worst F-Measure is for

―Local News‖ class with 85.8.

6.6 System B.C (“BPSO+SVM”/REP-Tree)
The third combination of system B is our proposed classifier

REP-Tree as we mentioned in the previous experiments

which has recently been used by (Kalmegh, 2015), (Patel

and Upadhyay, 2012) in English text classification and by

(Naji and Ashour, 2016) in Arabic text classification (a pre-

vious paper related to the existing paper), as mentioned pre-

viously in the past sections specifically in the first section.

Here the REP-Tree is a classifier, which is used to classify a

group of features resulting from the operation of features

selection by BPSO+SVM. The experimental results of sys-

tem B with REP-Tree classifier are shown by Tables (14)

and (15) using the previous two datasets (BBC Arabic and

Alkhaleej datasets as the following:

TABLE 14

System B with REP-Tree classifier applied on BBC-Arabic

Dataset

Table (14) shows the classification of BBC-Arabic docu-

ments using BPSO+SVM as a feature selector and REP-Tree

as a classifier. As it is clear from Table (14), the results are

as the following: the best classification is for ―World News‖

class with precision of 98.3, recall of 96.1 and F1-Measure

of 97.1. The second rank of performance is for classes

―Newspapers‖ with precision of 88.2, recall of 88.9 and F1-

Measure of 88.5. We can detect the convergence between

the ―Middle East‖ class performance and the ―Business‖

class performance with F1-Measure of 82.7 and 82.6. The

worst performance was the ―Sport‖ class with precision of

Class Precision% Recall% F1-Measure%

Local News 49.8 52.4 51

International News 93.3 62.4 74.7

Economy 67.1 77.5 71.9

Sport 85.3 69.8 76.7

Average 73.8 65.5 68.5

Class Precision% Recall% F1-Measure%

Middle East 67.9 88.7 76.9

World News 98.7 90.3 94.3

Business 87.9 89.3 88.5

Sport 60.3 80.7 69

Newspapers 79.8 84.2 81.9

Science 99.2 85.6 91.8

Misc 90.4 98.8 94.4

Average 83.4 88.2 85.2

Class Precision% Recall% F1-Measure%

Local News 83.2 88.6 85.8

International News 88.5 85.7 87

Economy 96.6 90.9 93.6

Sport 90.3 89.7 89.9

Average 89.6 88.7 89

Class Precision% Recall% F1-Measure%

Middle East 77 89.4 82.7

World News 98.3 96.1 97.1

Business 87.2 78.5 82.6

Sport 79.5 75.8 77.6

Newspapers 88.2 88.9 88.5

Science 85.4 87.1 86.2

Misc 89 69.4 77.9

Average 86.3 83.6 84.6

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

79.5, recall of 75.8 and F-Measure of 77.6. As in all previ-

ous experiments we'll apply the REP-Tree classifier on the

other datasets. Now we will apply system B (the same pre-

vious experiment with REP-Tree) on the second data-set

(Alkhaleej News Dataset), and Table (15) shows the results

as the following:

TABLE 15

System B with REP-Tree classifier applied on Alkhaleej

News Dataset

From Table (15) we see that the best accuracy of REP-Tree

F1-Measure is 91.2 for ―Sport‖ class, and the worst F-

Measure is for ―Local News‖ class with 75. We note that the

results were comparable with SVM classifier. Now we will

apply the REP-Tree on another data-set.

6.7 System C.A (“BPSO+REP-Tree”/J 48)
System C consists of Binary PSO as a feature selector and

the proposed REP-Tree as an evaluator to check the best

group of features then we use the three previous classifiers

(J 48, SVM, and REP-Tree) to build the classification mod-

el; the classification in the resultant group of features in the

training set to reduce the dimension of the original data-set

and then apply the classifiers on the test data-set. We have

previously noted that REP-Tree has recently been used by

(Kalmegh, 2015), (Patel and Upadhyay, 2012) to classify

English text and by (Naji and Ashour, 2016) in Arabic text

classification.

The experimental results of system C with J 48 tree are

shown by Tables (16) and (17) using the previous two da-

tasets (BBC-Arabic news dataset and Alkhaleej News da-

taset):

TABLE 16

System C with J 48 tree applied on BBC-Arabic Dataset

Table (16) shows the classification of BBC-Arabic docu-

ments using BPSO+REP-Tree as a feature selector and J 48

decision tree as a classifier. As it is clear from the table, the

results are as the following: the best classification perfor-

mance is ―World News‖ class with precision of 90.4, re-

calling of 87.4 and F1-Measure of 88.8. The second rank of

classification performance is the ―Middle East‖ with preci-

sion of 88.7, recall of 83.3 and F1-Measure of 85.9. We can

note that the worst class was the ―Business‖ class with pre-

cision of 75.2, recall of 70.5 and F-Measure of 72.7. Here

we can be quite sure that the J 48 tree failed in the classifica-

tion accuracy of ―Science‖ class by 27.3% according to its

F-Measure.

Now we have the second data-set (Alkhaleej News Dataset)

with the same previous experiment, Table (17) shows the

results as the following:

TABLE 17

System C with J 48 tree applied on Alkhaleej News Dataset

Table (17) shows the classification accuracy of Alkhaleej

News Dataset documents using BPSO+REP-Tree as a fea-

ture selector and J 48 decision tree as a classifier. The best

F-Measure is for ―Economy‖ class with 82.5, and the worst

F-Measure is for ―Local News‖ class with 58.4. Also here

we can be quite sure that the J 48 tree failed in the classifica-

tion accuracy of ―Local News‖ class by 47.6% according to

its F-Measure.

6.8 System C.B (“BPSO+REP-Tree”/SVM)

The experimental results of system C with SVM classifier

are shown by Table (18) and (19) using the previous two

datasets (BBC Arabic and Alkhaleej datasets as the follow-

ing:

TABLE 18

System C with SVM classifier applied on BBC-Arabic Dataset

Class Precision% Recall% F1-Measure%

Local News 72 78.3 75

International News 89.6 92.2 90.8

Economy 87.3 88.3 87.7

Sport 95.4 87.5 91.2

Average 86 86.5 86.1

Class Precision% Recall% F1-Measure%

Middle East 88.7 83.3 85.9

World News 90.4 87.4 88.8

Business 75.2 70.5 72.7

Sport 84.8 74.2 79.1

Newspapers 80.1 83.8 81.9

Science 79.8 78.3 79

Misc. 77.6 85.7 81.4

Average 82.3 80.4 81.2

Class Precision% Recall% F1-Measure%

Local News 60.3 56.8 58.4

International News 68.6 70.9 69.7

Economy 90.4 75.9 82.5

Sport 84.8 72.5 78.1

Average 73.5 69 72.1

Class Precision% Recall% F1-Measure%

Middle East 98.6 94.4 96.4

World News 68.2 88.9 77.1

Business 82.3 85.7 83.9

Sport 64.6 78.5 70.8

Newspapers 81.4 82.8 82

Science 97.2 87.1 91.8

Misc. 92.5 96.9 94.6

Average 83.5 87.7 85.2

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

Table (18) shows the classification of BBC-Arabic docu-

ments using BPSO+REP-Tree as a feature selector and SVM

as a classifier. From Table (18) we note the equality in F-

Measure average value using the same classifier SVM with a

different features selection combination (BPSO+REP-Tree).

The current results have been compared with Table (12, 13)

(BPSO+SVM features selection). We get here an average F-

Measure of 85.2 and 89.6 for SVM (the same classifier but

different feature selector). As usual, we will apply system C

(the same previous experiment with SVM) on the second

data-set (Alkhaleej News Dataset), and Table (19) shows the

results as the following:

TABLE 19

System C with SVM classifier applied on Alkhaleej News

Dataset

Table (19) shows the classification of Alkhaleej News Da-

taset documents using BPSO+REP-Tree as a feature selector

and SVM as a classifier. The best accuracy (F-Measure) is

for ―Local News‖ class with 95.4 and the worst F-Measure

is for ―Sport‖ class with 79.7. In this experiment, we note

the equality and convergence in the classification process

results using the same classifier SVM with a different fea-

tures selection combination (BPSO+REP-Tree).

6.9 System C.C (“BPSO+REP-Tree”/REP-Tree)
The third combination of system C consists of Binary PSO

as a feature selector and the proposed REP-Tree as an evalu-

ator then we use REP-Tree as a classifier, as we mentioned

in the previous section System C subsection. The experi-

mental results of system C with REP-Tree classifier are

shown by Tables (20) and (21) using the previous two da-

tasets (BBC Arabic and Alkhaleej datasets as the following:

TABLE 20

System C with REP-Tree classifier applied on BBC-Arabic

Dataset

Table (20) shows that the REP-Tree has been effective

enough in the classification for BBC-Arabic documents us-

ing BPSO+REP-Tree as a feature selector and REP-Tree as a

classifier. The results are as following: the best classification

is for ―Middle East‖ class with precision of 97.2, recall of

95.3 and F1-Measure of 96.2. Next we have the second clas-

sification performance the ―Newspapers‖ with precision of

86.1, recalling of 98.4 and F1-Measure of 91.8. The third

classification accuracy is the ―Business‖ with F-Measure of

87.9. We can detect the convergence between the ―Science‖

class performance and the ―World News‖ class performance

with F1-Measure of 83.3 and 83.2. The worst performance

was the Sport class with F-Measure of 77.8.

As usual, we will apply the REP-Tree classifier on the other

datasets. Now, we will apply system C, the same previous

experiment with REP-Tree, on the second data-set (Alkha-

leej News Dataset), and Table (21) shows the results as the

following:

TABLE 21

System C with REP-Tree classifier applied on Alkhaleej

News Dataset

From Table (21), we see that the best accuracy of REP-Tree

(F1-Measure) is 97.6 for ―Local News‖ class, and the worst

F-Measure is for ―Economy‖ class with 86.3. The average

accuracy of the REP-Tree in this experiment was 91.8.

6.10 Performance of the Three Systems
In this subsection, we will make a comparison between the

previous results on the previous two datasets (BBC-Arabic

and Alkhaleej) before adding some enhancements to each

system in the preprocessing phase. Both Table (22) and Fig-

ure (6) show the results of this comparison.

TABLE 22

Comparison between the F-Measure averages of the three

systems

Datasets System A

(BPSO+KNN

System B

(BPSO+SVM

System C

(BPSO+RE

P-Tree)%

BBC-Ar (J48) 79 79 81.2

BBC-Ar

(SVM)

86 85.2 85.2

BBC-Ar

(REP)

84.2 84.6 86.7

Alkha-

leej(J48)

74.3 68.5 72.1

Alkha-

leej(SVM)

88 89 89

Class Precision% Recall% F1-Measure%

Local News 97.2 93.7 95.4

International News 94.5 82.9 88.3

Economy 90.3 95.5 92.8

Sport 79.5 80 79.7

Average 90.3 88 89.05

Class Precision% Recall% F1-Measure%

Middle East 97.2 95.3 96.2

World News 88.6 78.5 83.2

Business 87.3 88.6 87.9

Sport 79.9 75.9 77.8

Newspapers 86.1 98.4 91.8

Science 80 86.9 83.3

Misc. 82.5 92 86.9

Average 85.9 87.9 86.7

Class Precision% Recall% F1-Measure%

Local News 98 97.4 97.6

International News 91.3 92.5 91.8

Economy 85.7 87.1 86.3

Sport 93.8 89.6 91.6

Average 92.2 91.6 91.8

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

Alkha-

leej(REP)

87 86.1 91.8

Figure 6 Comparison between the accuracy of the three systems.

From Table (22) and Figure (6), we draw the overall results

of all the experiments, calculate the average for F1-Measure

values, and compare all the systems with each other.

VII. CONCLUSION
This paper proposed a new feature selection approach to

select the best subset of features from the original Arabic

document .We showed that the proposed approach works

well in this area after extracting the experimental results.

The proposed approach can be used in the field of Arabic

search engines and classifying huge amounts of Arabic web-

sites pages into hierarchal classes, labels.

We proposed the Reduced Error Pruning-Tree classifier,

which was not used with Arabic text classification before for

two purposes. The first one is an evaluator to evaluate the

subset of features that resulted from the features selection

algorithm Binary Particle Swarm Optimization BPSO. To

evaluate this approach (BPSO+REP-Tree), we used two Ar-

abic datasets, BBC-Arabic News dataset and Alkhaleej

News dataset. The second purpose of the Rep-Tree is to use

it as a classifier to build the learning model. We compare the

first purpose (BPSO+REP-Tree) with two existing ap-

proaches, (BPSO+KNN) and (BPSO+SVM), and the second

purpose (REP-Tree classifier) with two well-known classifi-

ers, J 48 and SVM. We named the three features selection

approaches with A for (BPSO+KNN), B for (BPSO+SVM),

and C for (BPSO+REP-Tree). After we get the experimental

results, we concluded that the proposed approach System C

is effective. We choose the F1-Measure to estimate the accu-

racy of the classification process which came from two fac-

tors, precision and recall factors.

The values of F1-Measure for system A with the classifier J

48 is in the range of 73% - 79%, with SVM is in 86% - 88%

and with the proposed classifier REP-Tree is in the range of

84% - 87%. Next, we get the F1-Measure values of the se-

cond system (B) with the same classifiers as the following,

with J 48 are in the range of 60.9% - 84.6%, with SVM is in

85.2% - 89.6% and with the proposed classifier REP-Tree is

in the range of 84.6% - 89.5% and here is the last two algo-

rithms which are comparable in the accuracy. Finally, we

apply the experiments on our proposed approach system (C)

in features selection domain and it gave these ranges of ac-

curacy as the following, with J 48 was in the range of 69.5%

- 79.6%, and with SVM is in 87% - 89.8% and with the pro-

posed classifier REP-Tree is in the range of 86.7% - 91.8%.

REFERENCES

[1] G. Salton and C. Buckley, "Term-weighting approaches

in automatic text retrieval. Information Processing &

Management, vol.24, no.5, (1988), pp.513-523.

doi:10.1016/0306-4573(88)90021-0

[2] S. Li, R. Xia, C. Zong and C. Huang, "A framework of

feature selection methods for text categorization".

Proceedings of the Joint Conference of the 47th An-

nual Meeting of the ACL and the 4th International

Joint Conference on Natural Language Processing of

the AFNLP: Volume 2 - ACL-IJCNLP '09, (2009),

doi:10.3115/1690219.1690243

[3] Y. Saeys, I. Inza, and P. Larranaga, "A review of feature

selection techniques in bioinformatics". Bioinformat-

ics, vol.23, no.19, (2007), pp.2507-2517.

doi:10.1093/bioinformatics/btm344

[4] M. M. Al-Tahrawi and S. N. Al-Khatibb, (2015).Arabic

text classification using Polynomial Networks. Jour-

nal of King Saud University - Computer and Infor-

mation Sciences, vol. 27, no. 4, (2015), pp. 437-449.

http://dx.doi.org/10.1016/j.jksuci.2015.02.003

[5] T. Aimunandar and E. Winarko, Regional Development

Classification Model using Decision Tree Approach.

http://dx.doi.org/10.1016/j.jksuci.2015.02.003

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

International Journal of Computer Applications IJCA,

vol. 114, no. 8, (2015), pp.29-34. doi:10.5120/20000-

1755

[6] S. Kalmegh, Analysis of WEKA Data Mining Algorithm

REP-Tree, Simple Cart and Random Tree for Classi-

fication of Indian News. PARIPEX Paripex - Indian

Journal of Research, vol. 2, no. 2, (2015), pp. 438-

446. doi:10.15373/22501991/feb2015.

[7] Patel, N., & Upadhyay, S. Study of Various Decision

Tree Pruning Methods with their Empirical Compari-

son. International Journal of Computer Applications,

vol. 60, no.12, (2012), pp. 20-25. doi:10.5120/9744-

4304

[8] Brahimi, M. Touahria, and A. Tari, Data and Text Mining

Techniques for Classifying Arabic Tweet Polarity.

Journal of Digital Information Management, vol.14,

no.1, (2016), pp. 12-19.

[9] S. M. Oraby, Y. El-Sonbaty, M. A. El-Nasr, Ex-

ploring the Effects of Word Roots for Arabic

Sentiment Analysis. In International Joint Con-

ference on Natural Language Processing, Nago-

ya, Japan (2013), pp. 471-479.

[10] Khoja, S. (1999).Stemming Arabic Text, Lancas-

ter, U.K, Computing Department, Lancaster

University.

[11] K. Taghva, R. Elkhoury, and J. Coombs, Arabic stem-

ming without a root dictionary. Paper presented at In-

ternational Conference on Information Technology:

Coding and Computing (ITCC'05). (2005).

doi:10.1109/itcc.2005.90

[12] Tashaphyne, Arabic light stemmer, 0.2. Available at

https://pypi.python.org/pypi/Tashaphyne (2010).

[13] A. Shoukry, and A. Rafea, Sentence-level Arabic senti-

ment analysis. International Conference on Collabo-

ration Technologies and Systems (CTS). (2012).

doi:10.1109/cts.2012.6261103.

[14] E. Al-Thwaib, Text Summarization as Feature Selection

for Arabic Text Classification. World of Computer

Science and Information Technology Journal

(WCSIT), vol. 4, no.7, (2014), pp. 101-104.

[15] K. Al-Hindi, and E. A. Al-Thwaib, Comparative

Study of Machine Learning Techniques in Clas-

sifying Full-Text Arabic Documents versus

Summarized Documents. World of Computer

Science and Information Technology Journal

(WCSIT), vol. 2, no. 7, (2013), pp. 126-129. Re-

trieved August17, 2016, from

http://www.wcsit.org/pub/2013/vol.3.no.7/A

Comparative Study of Machine Learning Tech-

niques in Classifying Full-Text Arabic Docu-

ments versus Summarized Documents.pdf

[16] A. Abu-Errub, Arabic Text Classification Algorithm

using TF.IDF and Chi Square Measurements. Inter-

national Journal of Computer Applications IJCA, vol.

93, no. 6, (2014). Pp. 40-45. doi:10.5120/16223-5674

[17] Goweder, A., Elboashi, M., & Elbekai, A.

(2013).Centroid-Based Arabic Classifier. The Inter-

national Arab Conference on Information Technology

(ACIT’2013), 108(3). Retrieved June 27, 2016, from

http://acit2k.org/ACIT/2013Proceedings/108.pdf

[18] Abidi, K., & Elberrichi, Z. (2012). Arabic Text Catego-

rization: A Comparative Study of Different Represen-

tation Modes. Journal of Theoretical and Applied In-

formation Technology, 38(1), 465-470. Retrieved

May 21, 2016, from

http://ccis2k.org/iajit/PDF/vol.9,no.5/2983-10.pdf

[19] Raho, G., Al-Shalabi, R., Kanaan, G., & Nassar,

A. (2015). Different Classification Algorithms

Based on Arabic Text Classification: Feature

Selection Comparative Study. International

Journal of Advanced Computer Science and

Applications Ijacsa, 6(2) 23-28.

doi:10.14569/ijacsa.2015.060228

[20] Mohammad, A. H., Al-Momani, O., & Alwada, T.

(2016). Arabic Text Categorization using k-

nearest neighbor, Decision Trees (C4.5) and

Rocchio Classifier: A Comparative Study. In-

ternational Journal of Current Engineering and

Technology, 6(2), 477-482. RetrievedMay29,

2016,from http://inpressco.com/wp-

content/uploads/2016/03/Paper16477-482.pdf

[21] Kanan, T., & Fox, E. A. (2015).Automated Arabic text

classification with P-Stemmer, machine learning, and

tailored news article taxonomy. Journal of the Asso-

ciation for Information Science and Technology J

Assn Inf Sci Tec. doi:10.1002/asi.23609

[22] Al-Anzi, F. S., & Abuzeina, D. (2016).Toward an en-

hanced Arabic text classification using cosine simi-

larity and Latent Semantic Indexing. Journal of King

Saud University - Computer and Information Scienc-

es. doi:10.1016/j.jksuci.2016.04.001

https://pypi.python.org/pypi/Tashaphyne
http://acit2k.org/ACIT/2013Proceedings/108.pdf
http://ccis2k.org/iajit/PDF/vol.9,no.5/2983-10.pdf
http://inpressco.com/wp-content/uploads/2016/03/Paper16477-482.pdf
http://inpressco.com/wp-content/uploads/2016/03/Paper16477-482.pdf

Hamza Naji, Wesam Ashour and Mohammed Alhanjouri / A New Model in Arabic Text Classification Using BPSO/REP-Tree (2017)

[23] Zubi, Z. S. (2009). Using Some Web Content Min-

ing Techniques for Arabic Text Classification.

Recent Advances on Data Networks, Communi-

cations, Computers, 73-84.

doi:10.1109/mmcs.2009.5256696

[24] Zrigui, M., Ayadi, R., Mars, M., & Maraoui, M.

(2012). Arabic Text Classification Framework

Based on Latent Dirichlet Allocation. Journal of

Computing and Information Technology CIT,

20(2), 11-14. doi:10.2498/cit.1001770

[24] Kennedy, J., & Eberhart, R. (1995).Particle swarm op-

timization. Proceedings of ICNN'95 - International

Conference on Neural Networks, 4, 1942-1948.

doi:10.1109/icnn.1995.488968

[25] Kennedy, J., & Eberhart, R. (1995).Particle swarm op-

timization. Proceedings of ICNN'95 - International

Conference on Neural Networks, 4, 1942-1948.

doi:10.1109/icnn.1995.488968

[26] Yang, Y., & Pedersen, J. O. (1997).A comparative study

on feature selection in text categorization. Machine

Learning International Workshop Then Conference,

412-420. Morgan Kaufmann Publishers, INC

27 Chantar, H. K., & Corne, D. W. (2011). Feature subset

selection for Arabic document categorization using

BPSO KNN. Third World Congress on Nature and

Biologically Inspired Computing.

doi:10.1109/nabic.2011.6089647

[28] Tsai, M., Su, C., Chen, K., & Lin, H. (2012).An Appli-

cation of PSO Algorithm and Decision Tree for Med-

ical Problem. Neural Comput & Applic Neural Com-

puting and Applications, 21(8), 124-126. Retrieved

September 07, 2016, from

http://psrcentre.org/images/extraimages/31012565.pd

[29] Shi, Y., & Eberhart, R. (1995).A modified particle

swarm optimization. Proceedings of The 1998World

Congress on Computational Intelligence, 6, 69-73.

doi:10.1109/icnn.1995.4889684

[30] Kennedy, J., & Eberhart, R. (1997).A discrete binary

version of the Particle swarm algorithm. Proceedings

of The 1998World Congress on Computational Cy-

bernetics and Simulation, 4, 4104-4108.

[31] Naji, H., Ashour, W. (2016). Text Classification for Ar-

abic Words Using Rep-Tree. International Journal of

Computer Science and Information Technology IJC-

SIT, 8(2), 101-108. doi:10.5121/ijcsit.2016.8208

[32] Kalmegh, S. (2015).Analysis of WEKA Data Mining

Algorithm REP-Tree, Simple Cart and Random Tree

for Classification of Indian News. PARIPEX Paripex

- Indian Journal of Research, 2(2), 438-446.

doi:10.15373/22501991/feb2015.

[33] Patel, N., & Upadhyay, S. (2012). Study of Various

Decision Tree Pruning Methods with their Empirical

Comparison. International Journal of Computer Ap-

plications, 60(12), 20-25. doi:10.5120/9744-4304

[34] Mitchell, T. M. (1997). Machine learning.McGraw Hill.

Retrieved May 3, 2016, from

http://personal.disco.unimib.it/Vanneschi/McGrawHil

l_-_Machine_Learning_- Tom_Mitchell.pdf

http://psrcentre.org/images/extraimages/31012565.pdf
http://psrcentre.org/images/extraimages/31012565.pdf
http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_Machine_Learning_-
http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_Machine_Learning_-