Transactions Template


 JOURNAL OF ENGINEERING RESEARCH AND TECHNOLOGY, VOLUME 4, ISSUE 3, SEPTEMPER 2017 

 
105  

 
    Arabic Text Genre Classification 
Alaa M. El-Halees 

Faculty of Information Technology, Islamic University of Gaza, Gaza, Palestine, email alhalees@iugaza.edu.ps 
 

Abstract— Text genre is a type of written text. Arabic text genre classification predicts genre of specific text 
document written in Arabic independent of its topic. In this paper, an approach was proposed that takes an 
Arabic document and classify it into one of four genres which are advertisements, news, subjective and scientific 
documents. Since the frequency of words approach produces a low performance when used in the genre, an 
attempted was made to generate attributes based on the style of the text. This approach evaluated using corpus 
collected for this purpose. Using four machine learning methods, our approach compared with the word 
frequency approach, and it found that our approach is better than this mainstream approach. It, also, found that 
predicting subjectivity and scientific genre is more accurate than predicting advertisements and news. 

Index Terms— Text genre, text genre classification, Arabic language processing, text mining, machine learning 
methods. 

 
I INTRODUCTION

 
Text genre classification is concerned with predicting the 

type of an unknown text correctly, independent of its topic 

[1]. Genre means kind of text; it is functional role of the 

text, not its topic.  Examples of text genre are scientific arti-

cles, news reports, reviews, and advertisements. The im-

portance of text genre comes from that user wants a specific 

type of text. The typical example is in informational retrieval 

and search engine where the user may desire to see docu-

ments for a specific reason such as a review for some object 

(i.e. people opinion in a product) or scientific article in some 

subject [2]. Text classification gene is different from tradi-

tional text classification where traditional classification is 

based on the frequency of certain words in the document 

using TFIDF representation.  In classifying genre, text style 

is used instead. 

Most research in the area of text genre classification deals 

with English text. Some works deal with other languages, 

but in Arabic, which is a language for Millions of people, 

there is no work in text genre classification.  Arabic is a 

challenging language for some reasons. It has a  complex 

morphology as compared to other languages like English. 

This is due to the unique nature of Arabic language. The 

Arabic language is an   inflectional and derivational 

language which makes monophonical analysis a very com-

plex task [3]. 

The first and the most important task of classification genre 

are to choose genre types. Based in the field of linguistic 

three abstract and very general classes are used, namely, 

expressive, appellative, and informative text [4]. According-

ly,  the text tagged as subjective (expressive), advertisement 

(appellative) and scientific papers and news (informative). 

Then, the text genre needed to identify cues such as the 

structure of the sentence, the length of sentence, characters 

used and punctuation are used to generate the features. Then, 

machine learning methods are used to classify the genre. 

Four machine learning methods were used which are: Sup-

port Vector Machine, Naive Bays, k-nearest neighbors and 

Decision trees.  

To evaluate this approach, corpus was collected from many 

Arabic websites since no other work was done on this topic. 

Finally, our method compared with a traditional TFIDF 

method that used in topic classification. 

The remainder of the paper is organized as follows: Section 

two about related work in this area, section three about genre 

classification, section four about   our methodology, section 

five about the experiment and results, and finally, this paper 

closed with a conclusion  and an outlook for future work. 

II RELATED WORKS 

In English language, text genre classification was addressed 
by many works such of Kessler  et. al. in [1] who proposed a 

theory of genres as bundles of facets, which correlate with 

various surface cues. They argued that genre detection based 

on surface cues as successful as detection based on deeper 

structural properties. They developed a taxonomy of genres 

and facets. Also, they found an effective strategy for variable 

selection to avoid overfitting during training with neural 

networks that have higher performance on average.  

Karlgren and Cutting in [5] used discriminate analysis to 

categorize texts into pre-determined genre categories. They 

argued that discriminate analysis make it possible to use a 

large number of parameters that may be specific for a certain 

corpus, and combine them into a small number of functions, 

with the parameters weighted by how useful they are for 

discriminating text genres.  Also,  Liu et. al. in [6] discussed 

the automatic genre classification and its application. They 

argued that word level features and sentence level features 

are two important measures which vary in number among 

different genres. Based on the two aspects of views, they 

explore an approach where the Co-training method is em-

ployed to obtain genre classification.  Stamatatos  et. al. in 

[7]  took full advantage of existing natural language pro-


Alaa M. El-Halees / Arabic Text Genre Classification (2017)  

 
106  

cessing tools to propose  some style markers including 

analysis-level measures that characterize the way in which 

the input text has been analyzed and capture valuable 

stylistic information. They present a set of small-scale ex-

periments in text genre detection, author identification, and 

author verification tasks. They showed that the proposed 

method performs better than the most distributional lexical 

measures, functions of vocabulary wealth and frequencies of 

occurrence of the most frequent words. Galitsky et. al. in [8] 

proposed   to use methods based on deep textual parsing, 

which depends on finding  complex features such as syntac-

tic and discourse structures of the text,  to improve the quali-

ty of genre classification.  In their paper they had presented 

three experiments on style and genre classifications. For the 

genre classification task they adopted a corpus annotated 

with 7 different genres and conducted a series of pairwise 

classification between two genres. Melissourgou and Frantzi 

in [9] investigated a range of genres involved in writing 

tasks presented in English language teaching material. They 

explained how they identified genres based on Systemic 

Functional Linguistics (SFL) principles. They added another 

stage which is ‗naming‘ of genre categories mainly based on 

purpose and mode to guide anyone with a need to under-

stand genre requirements. 

 In multi-language text genre classification,  Petrenz in [10] 

described a new approach to classifying text genres across 

languages. It can bring the benefits of genre classification to 

the target language without the costs of manual annotation to 

achieve good results. In his experiments, he considered Eng-

lish and Chinese languages, because these languages are 

very dissimilar linguistically. He expected the approach to 

work at least equally well for more closely related language 

pairs. 

III GENRE CLASSIFICATION 

Genre classification is different from the topic classification 

that most classification research has dealt with. From an 

information retrieval point of view, a retrieval query about a 

certain topic would retrieve many documents related to that 

topic, but they may be of the different genre [11]. For 

example, if someone searches for a certain product, the 

retrieved page will be any document that contains the name 

of that product. However, genre means you can specify if 

somebody wants for example news, advertisement, or criti-

cal review about that product [2]. 

Genres give a way to describe the nature of a text, which 

allows for assigning the document to classes. Arabic genre 

classification is concerned with predicting the genre classes 

of unknown Arabic documents correctly, independent of its 

topic. In Arabic genre classification: Let C = {c1, c2, ...cm} be 

a set of genre classes and D = {d1, d2, ...dn} a set of Arabic 

documents. The task of the Arabic genre classification con-

sists in assigning class label ci to each document dj , if the 

document dj belongs to ci, which exactly one class must be 

assigned to each dj.  

Based in the field of linguistic , text genre can be classified 

into three general classes, namely, expressive, appellative, 

and informative [4].  Expressive means that text aims to 

express the attitude, expression of feelings, attitudes, and 

opinions of a person. According to this definition, opinion 

mining corpus mapped to the expressive genre. Appellative 

means appealing to the receiver‘s experience, feelings, 

knowledge and sensibility to make him/ her react in a specif-

ic way [12]. The best text maps to this genre are an 

advertisement which  used in this research. Finally, the 

informative text provides information about any topic of 

knowledge. They identify impersonal, objective, non-

emotive style [4]. Two classes were mapped to this genre 

which are scientific papers and news. 

IV METHODOLOGY 

Our methodology consist of the following steps: 

A. Generate Features 
Text genre mostly characterized by its text style. To generate 

features in this work, it concentrated on two levels of text 

styles: token level and lexicon level. Token level considers 

the text as a set of tokens grouped in sentences. In this level, 

features were generated from each document such as aver-

age number of words in a sentence, average number of short 

words in a document where it considered short words are the 

words with less than six characters, average number of 

words per phrase and the average number of characters per 

word. In lexicon level, features were generated from each 

document such as an average number of nouns, adjectives, 

and verbs per word. Also, features were added such as an 

average number of pronouns, coordinating conjunctions, 

cardinal numbers, and determines per document. 

 
B. Corpus 
As stated above, this researh used four text genres: subjec-

tive, advertisements, news and scientific. Since no other 

works in Arabic genre classification, there is no corpus exist 

in the literature. Therefore, our own corpus was collected.  

As shown in Table , 78251  documents were used for the 

four genre types where each genre type contains more than 

one topic. For example, the subjective genre has positive and 

negative reviews on topics such as movies, hotels, 

books…etc.   Advertisements have topics from many 

products such as electronics, furniture, medical and sports 

equipment.  News has topics from culture, economy, interna-

tional and sports. Finally, scientific papers have topics from 

medicine, science, economy and literature.  

 
Alaa M. El-Halees / Arabic Text Genre Classification (2017)  

 
107  

TABLE 1 

Corpus used in the experiments 

Genre  Type No.   

Documents 

Total  No. of 

Documents 

Subjective 1.Positive  

2. Negative 

1430 

1430 

2860 

Advertisement 1. Computers and 

Electronics 

2. Furniture 

3. Medical Equipments 

4. Sports  Equipments 

340 

522 

254 

342 

1456 

News 1. Culture 

2. Economy 

3. International 

4. Sports 

500 

500 

500 

500 

2000 

Scientific  1. Medicine 

2. Science 

3. Economy 

4. Literature 

512 

327 

378 

292 

1509 

 
C. Methods 
In our experiments to classify documents to genres using 

two approaches TFIDF representation and text style extrac-

tion, four classifiers were applied, which are Naïve Bayes, 

k-Nearest Neighbors and Support Vector Machine and Deci-

sion Trees. 

Naïve Bayes classifier is widely used because of its 

simplicity and computational effectiveness. The model as-

signs a class label to problem instances, represented as vec-
tors of feature values, where the class labels are drawn from 

some finite set.  In the text, It uses training methods 

consisting of relative-frequency estimation of words in a 

document as words probabilities and uses these probabilities 

to assign a class  to the document. To estimate the term P(d | 

c) where d is the document and c is the class, Naïve Bayes 

decomposes it by assuming the features are conditionally 

independent [13]. 

k-Nearest Neighbors is a method to in classification.  The 

training examples are vectors in a multidimensional feature 

space, each with a class label. The training phase of the al-

gorithm consists only of storing the feature vectors and class 

labels of the training samples. In the text, the training phase 

documents have to be indexed and converted to vector 

representation. To classify new document d; the similarly of 

its document vector to each document vector in the training 

set has to be computed. Then its k nearest neighbor is deter-

mined by measuring similarity which may be measured by, 

for example, the Euclidean distance [14]. 

Support Vector Machine is a classification algorithm 

proposed by [15]. In its simplest linear form, it is a 

hyperplane that separates a set of positive examples from a 

set of negative examples with maximum margin. In the text, 

test documents are classified according to their positions on 

the hyperplanes. 

A decision tree is a structure that includes a root node, 

branches, and leaf nodes. Each internal node denotes a test 

on an attribute, each branch denotes the outcome of a test, 

and each leaf node holds a class label.  Decision tree text 

classifier is consists of a tree in which internal nodes are 

labeled by words, branches departing from them are labeled 

by tests on the weight that the words have in the representa-

tion of the test document, and leaf nodes are labeled by 

categories ci . Such a classifier categorizes a test document 

dj by recursively testing for the weights. That the words la-

beling the internal nodes have in the representation of dj , 

until a leaf node ci is reached; the label of this leaf node is 

then assigned to dj [16].  

V EXPERIMENT AND RESULTS 

A Experiments  
Two sets of experiments have been applied, first experi-

ments for topic classification as a baseline and the second 

experiments to evaluate generated features.  

For the first set of experiments, our baseline used corpus 

described above with topic base classifications. Before clas-

sification, some pre-processing was done such as tokeniza-

tion, stop words removal and Arabic light stemming. Then, 

vector representations were obtained for the terms from their 

textual representations by performing TFIDF weight which 

is a well-known weight presentation of terms often used in 

text mining. Some terms with a low frequency of occurrence 

were removed. For classification, Four methods described 

above were used which are Naïve Bayes, k-Nearest Neigh-

bors, Support Vector Machine and Decision Trees. 

In the second set of experiments, using the same corpus as 

described above, features generated based on nature and 

lexicon of the documents in the corpus. Part-of-speech 

(POS)  was used to generate word classes, such as nouns, 

adjectives, and verbs.  Then the four machine learning 

methods were applied. 

Our experiments was evaluated using 10-cross-validation, 

and then F-measure was computed, which is a combined 

metric that takes both precisions and recalls into 

consideration. 

 
B Results 

Table 2 and Figure 1 show F-measure for baseline classifica-

tion which based on TF-IDF and classification based on 

generated features from text using four machine learning 

methods for both classifications. It is clear that generated 

features have better results than baseline in all machine 

learning methods. However, there is little difference in per-

formance when using naïve Bayes. Moreover, the biggest 

difference is when using Decision Trees where it is 48.87% 

using baseline and 100% using generated features. That is 

mainly because baseline depends on the frequency of the 

words and the words may be frequent in more than one gen-

re if it is on the same topic (e.g. word in sports topic can be 

in the news, Advertisements, subjective or Scientific). That 

is not the case for generated features which depends on style 

not frequency. 

 
Alaa M. El-Halees / Arabic Text Genre Classification (2017)  

 
108  

 
Figure 1: F-measure for Arabic  genre classification 

 
TABLE 2 

F-measure for baseline and generated features 

Arabic genre classification 

 
Table 3 shows F-measure for the four selected corpus genre 

where B.L is for Base-line and F.G. for generated features.    

It was noted that all methods accurately recognized subjec-

tive and scientific.  Also, there is a little confusion between 

Advertisements and news and that is natural because there 

are many common characteristics between them. 

 
TABLE 3 
F-measure for four Arabic text genres 

 
Support 

Vector  

Machine % 

k-nearest 

Neighbour % 

Naïve Bays % Decision 

Trees % 

 B.L G.F B.L G.F B.L G.F B.L G.F 

Advertisements 84.9 96.3 79.6 90.32 90.7    68.57 28.7 100 

News 94.4 96.0 80 88.0 81.2 53.3 67.4 100 

Subjective 86.2 100 85.1 100 76.25    100 0 100 

Scientific  98 100 97 100 97.23 100 99.3 100 

From generated decision tree, as seen in figure 2, , it can 

seen that Average words per phrase ,Average characters per 

word  and Averege number of words per sentence are the 

most important attributes that distinguish genre.   

 
FIGURE 2: DECISION TREE FOR GENERATED FEATURES 

VI  CONCLUSION AND FUTURE WORKS 

There are some significant differences between text topic 

classification and text genre classification. Text topic classi-

fication depends mainly on the frequency of some words in 

a document to recognize that document. This does not work 

for text genre classification because words may be frequent 

in multiple genres.  In this paper, Arabic document was clas-

sified  to a certain genre. Four types were chosen to classify 

Arabic genre which are an advertisement, news, subjective 

and scientific. We generated attributes based on Arabic lan-

guage style. The work evaluated using some corpus collect-

ed manually. Using four machine learning methods we 

found that our generated feature has better performance than 

the results obtained from using TFIDF method using same 

machine learning methods and same corpus. Also, we 

concluded that subjective and scientific genres have better 

performance than news and advertisements. 

In future work, it may use another Arabic genre such as a 

poem, Islamic Scripts, events, Biography..., etc. Also, it may 

need to look for other attributes which can recognize the 

genre such as syntactical level of the Arabic language. Also, 

the generated feature is done manually, using techniques 

such as deep learning, it can be generated automatically. 

REFERENCES 

[1] B. Kessler, G. Numbers, and H. Schütze. "Automatic 

detection of text genre."  In the proceedings of the 35th An-

nual Meeting of the Association for Computational Linguis-

tics and 8th Conference of the European Chapter of the 

Association for Computational Linguistics: 7–12 July, Ma-

drid, 1997.  

[2] Y. B. Lee and S. H. Myaeng, ―Text genre 

classification with genre-revealing and subject-revealing 

features,‖ In Proceedings of the 25th annual international 

ACM SIGIR conference on Research and development in 

information retrieval, pp. 145-150. 2002.  

[3] B. Hammo and S. Lytinen, ―QARAB: A Question 

Answering System to Support the Arabic Language,‖ In the 

proceedings of Computational Approaches to Semitic Lan-

guages., p. 11, 2002. 

 [4] H. Wachsmuth and K. Bujna, ―Back to the Roots of 

0

20

40

60

80

100

120

Support
Vector

Machine

k-nearest
Neighbour

Naïve BaysDecision
Trees

Baseline Generated Features %

Method Baseline  

% 

Generated Features 

% 

Support Vector Machine 90.89 98.07 

k-nearest Neighbour 85.42 95.58 

Naïve Bays 80.37 80.47 

Decision Trees 48.87 100 


Alaa M. El-Halees / Arabic Text Genre Classification (2017)  

 
109  

Genres : Text Classification by Language Function Motiva-

tion : Filter search results,‖. In the proceedings of the 5th 

International Joint Conference on Natural Language Pro-

cessing, Chiang Mai, Thailand, November 8-13, 2011. 

[5] J. Karlgren and D. Cutting, ―Recognizing Text 

Genres With Simple Metrics Using Discriminant Analysis,‖ 

In the proceedings of the 15th Conference of Computer Lin-

guists. - Vol. 2, pp. 1071–1075, 1994. 

[6] R. Liu, M. Jiang, and Z. Tie, ―Automatic genre 

classification by using co-training,‖ In the proceedings of 

the 6th International Conference of Fuzzy Systems and. 

Knowledge Discovery, vol. 1, pp. 129–132, 2009. 

[7] E. Stamatatos, N. Fakotakis, and G. Kokkinakis, 

―Automatic Text Categorization in Terms of Genre and Au-

thor,‖ Computer Linguists, vol. 26, pp. 471–495, 2000. 
[8]         B. A. Galitsky , D. A. Ilvovsky, E. L. Chernyak  S. 

and O. Kuznetsov " Style and Genre Classification by 

Means of Deep Textual Parsing", In the Proceedings of the 

International Conference Computational Linguistics and 

Intellectual Technologies: ―Dialogue 2016‖ Moscow, June 

1–4, 2016 

[9]  M. Melissourgou and K. Frantzi "Representation of 

Text Types and Genres in English Language Teaching Mate-

rial", Corpus Pragmatics, April, 2017, Springer International 

Publishing. 

 [10] P. Petrenz, ―Cross-Lingual Genre Classification,‖ 

Proceedings of the13th Conference of. European. Chapter 

Association of Computer Linguists, no. April, pp. 11–21, 
2012.  
[11] K. Crowston and B. H. Kwasnik, ―Can Document-

Genre Metadata Improve Information Access to Large Digi-

tal Collections,‖ Library Trends, vol. 1, no. 315, pp. 1–29, 

2003.  

 [12] J. Vaičenonienė, "The Language of Advertising: 
Analysis of English and Lithuanian Advertising Texts", 
Studies About Languages. No. 9, pp. 43–55, 2006. 
[13] S. L. Ting, W. H. I, and A. H. C. Tsang, ―Is Naïve 

Bayes a good classifier for document classification?‖ Inter-

national Journal of. Software. Engineering and its Applica-

tions, vol. 5, no. 3, pp. 37–46, 2011. 

[14] B.  Dasarathy. Nearest neighbor (NN) norms: NN pat-

tern classification techniques. IEEE Computer Society Press, 

1991. 

[15]   C. Cortes  and V. Vapnik.  "Support-Vector Networks". 

Machine Learning, 20, 1995 

[16]  I. Ilovich, and S. Markovitch. "Feature Generation for 

Text Categorization Using World Knowledge". In the Pro-

ceedings of the Nineteenth International Joint Conference on 

Artificial Intelligence, Edinburgh, Scotland, UK, July 30-

August 5, 2005.pp 1048-1053. 

 
Alaa M. El-Halees is a professor in computing in the facul-

ty of Information Technology at Islamic University of Gaza, 

Palestine. He holds a PhD degree in data mining from Leeds 

Metropolitan University, UK in 2004, MSc degree in Soft-

ware Engineering from Leeds Metropolitan University, UK 

in 1998 and BSc in Computer Engineering from University 

of Arizona, USA. Alaa has more than 24 years of experience 

including leading a range of IT-related projects. Prof. Alaa 

supervises M.Sc. students in Information Technology. He 

also leads and teaches modules at both BSc and MSc levels 

in Information Technology. His research activities are in the 

area of data mining, in particular text mining, machine learn-

ing and e-learning, Software Engineering and computer eth-

ics.