Microsoft Word - ETASR_V11_N2_pp6845-6848


Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6845-6848 6845 
 

www.etasr.com Yafooz et al.: Arabic Sentiment Analysis on Chewing Khat Leaves using Machine Learning and … 

 
Arabic Sentiment Analysis on Chewing Khat Leaves 

using Machine Learning and Ensemble Methods  
 

Wael M.S. Yafooz 

Computer Science Department  
Taibah University 

Madinah Munawarah, Saudi Arabia  

wyafooz@taibahu.edu.sa 

Essa Abdullah Hezzam 

Information Systems Department  
Taibah University 

Madinah Munawarah, Saudi Arabia  

essa_alkadasi@yahoo.com 

Waseem Alromema 

Information Systems Department 
Taibah University 

Madinah Munawarah, Saudi Arabia  

wromema@taibahu.edu.sa 
 

Abstract-Sentiment analysis plays an important role in obtaining 

speakers' opinions or feelings towards events, products, topics, or 

services, helping businesses to improve their products. Moreover, 

governments and organizations investigate and solve current 

social issues by analyzing perspectives and feelings. This study 

evaluated the habit of chewing Khat (qat) leaves among the 
Yemeni society. Chewing Khat plant leaves, is a common habit in 

Yemen and East Africa. This paper proposes a model to detect 

information about the Khat chewing habit, how people explore it, 

and the preference for Khat leaves among Arabic people. A 

dataset consisting of user comments on 18 youtube videos was 

prepared through several natural language processing 

techniques. Several experiments were conducted using six 

machine learning classifiers and four ensemble methods. Support 

Vector Machine and Linear Regression had almost 80% 

accuracy, whereas xgboot was the most accurate ensemble 
method reaching 77%. 

Keywords-sentiment analysis; machine learning; classification; 
ensemble methods 

I. INTRODUCTION  

Nowadays, the study of user opinions has attracted 
substantial attention in social perspectives, focusing on 
services, products, and habits through many data mining 
applications, recommender systems, and business intelligence 
applications. The analysis and interpretation of user opinions 
are altogether known as sentiment analysis, which is an area of 
natural language processing, also kwon as the voice of the 
customer in business intelligence [1, 2]. Business owners need 
to be aware of feedback to improve future performance. Such a 
time-consuming and difficult task is used in the analysis of 
huge unstructured data gathered through social media or 
internet comments. Several studies have been conducted, 
classifying sentiments as positive, negative, or neutral [3, 4, 6]. 
More complex sentiment analysis [7-10], often referred to as 
fine-grained, classify datasets into five classes, namely very 
positive, positive, neutral, negative, and very negative. 
Moreover, aspect-based sentiment analysis [11-14] classifies 
datasets by extracting entities from text. 

Users' comments are often an outcome of their opinions, 
and they can be considered as the main factor in evaluating 
services or products. Some studies focused on education [15, 
16], while other researches focused on detecting health 

misinformation on social media users [18, 19]. Khat is a type of 
plant that pleases and stimulates, and chewing Khat leaves is a 
commonly seen habit in Yemen and East Africa [21-23]. 
Although it is customary in these countries, several 
experiments showed its direct impact on human organs. This 
paper presents a model to study consumers' opinions regarding 
the habit of chewing Khat leaves in Yemeni and East African 
society. At first, the dataset was collected by extracting user 
comments from 18 youtube videos. The annotation process 
classified the data into positive and negative fractions. Several 
NLP processes were executed to prepare the data for Machine 
Learning Classifiers (MLCs) and Ensemble Methods (EMs). 

II. RELATED WORKS 

Several studies utilized sentiment analysis in different 
ways. Multilingual student comments, obtained through student 
feedback, were used to evaluate online courses' effectiveness 
and teachers' performance in [3, 15-17]. In [3], the dataset was 
collected using approximately 4000 student comments through 
surveys conducted on 25 university courses to evaluate the 
performance of a professor who had been teaching for 10 years, 
while the sentiment analysis was directed including positive, 
negative, and eight more emotions. Similarly, authors in [15] 
proposed a system to evaluate a lecturer's performance by 
collecting data through student surveys via a rating system in a 
form of numerical data. The MLC Naïve Bays was employed 
to predicate the positive and negative students’ sentiments 
toward the lectures. A recurrent neural network of long and 
short term memory in deep learning was utilized in [16]. The 
dataset was collected from 3000 positive, negative, and neutral 
student comments on 30 courses. The performance improved 
when using the softmax activation function, reaching 89%, 
99%, and 90% during training, testing, and validation, 
respectively. Deep learning was applied on a course evaluation 
dataset with 3000 student comments using three predefined 
classes in [17], while the results showed that relu and softmax 
performed better.  

Sentiment analysis is used to identify the main factors 
affecting the success of businesses, particularly start-ups. In 
[1], user comments were extracted from Twitter using topic 
modeling and applying supervised vector machine learning to 
divide comments into three main classes. The textual analysis 
was applied based on the entities trained in the previous phase 

Corresponding author: Wael M.S. Yafooz


Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6845-6848 6846 
 

www.etasr.com Yafooz et al.: Arabic Sentiment Analysis on Chewing Khat Leaves using Machine Learning and … 

 
using Nvivo software. In [4], an analysis of a massive amount 
of user comments (approximately 1.6 million) from the Yelp 
Challenge Dataset was conducted. The dataset was divided to 
20% for testing and 80% for training, using four machine 
learning classifiers. The best accuracy rate reached 92.6% and 
92.3% under Stochastic Gradient Descent and Linear Support 
Vector Classification respectively. Similarly, the same dataset 
was utilized in [2] to analyze restaurant reviews through a 
hybrid classifier ensemble method using Naïve Bayes, Support 
Vector Machines, and Genetic Algorithms.  

Some health sector studies have also been conducted [18, 
19]. Authors in [18] focused on tweets on breast cancer, 
collecting user comments from approximately 845 cancer 
patient accounts with 48,000 posts. The logistic regression 
classifier and a Convolutional Neural Network was utilized in 
the process, and the model's performance accuracy was 97.6%. 
Besides, it was found that positive experiences had more 
shares, providing more awareness to the general public. 
Descriptive statistics of text mining and topic modeling were 
utilized in [19]. Unstructured data from 3 million news articles 
on Reuters assisted in identifying the 10 major health issues 
published in news articles from 2007 to 2017. On the contrary, 
the analysis of user reviews on mobile health applications was 
prioritized in [8], collecting data from 104 mobile health 
applications with approximately 88,125 user reviews. The data 
were categorized based on each comment's functionality (such 
as usability, content, customer support, and ethics), the polarity 
concept was divided into three classes, and five machine 
classifiers were applied. The best accuracy was recorded at 
89.42% through Stochastic Gradient Descent. 

III. METHODS 

This section describes the main model phases, as shown in 
Figure 1. There are four phases: data acquisition, pre-
processing, machine learning classifiers, and model evaluation. 

 
Fig. 1.  Model architecture. 

A. Phase 1: Data Acquisition 

The dataset was collected using Python 3.8 programming 
language and YouTube API (googleapiclient package), for 
information extracting from 18 videos related to chewing Khat. 
The criteria for selecting videos were: published date between 
2015-2020, more than 50K views, more than 10K likes, and 
focus on Arabic speakers. Moreover, some keywords were 
used to locate the videos, such as Khat, Khat is dangerous, and 
disadvantages of Khat. The main attributes for the extracted 
video information were: commenter_id, commenter_name, 
comment, video_id, number of views, number of likes, and 
date. Table I shows the dataset description and the minimum 
and maximum length of user comments. 

TABLE I.  DATASET DESCRIPTION 

Items Description Max length Min length Average 

Negative 1436 

427 1 17 Positive 1296 

Total 2732 

 
The initial step of data preprocessing was carried out, 
removing English or duplicate comments. The next step, data 
annotation, was a manual process conducted with the assistance 
of three annotators that were Ph.D. holders, Arabic native 
speakers, and computer science specialists. Data annotation 
classified the comments into negative and positive. Some 
unrelated, unclear, or ambiguous comments were removed. If 
two annotators classified comments as either positive or 
negative then comments were considered respectively, 
otherwise, the comments were removed. 

B. Phase 2: Pre-processing 

The natural language pre-processing steps were: data 
cleaning, tokenization, normalization of Arabic words, 
lemmatization, deletion of special characters, and removal of 
repeating characters. Then, the annotation was performed by 
three annotators into positive and negative. These pre-
processing steps increased accuracy by removing "TSHKEEL", 
"TATWEEL", and "HAMZAH" using Python 3.6 and a 
package called “tashaphyne”. 

C. Phase 3: Machine Learning Classifiers 

Two types of MLCs were used: classic MLCs and 
Ensemble Methods (EMs). The MLCs were Linear Regression 
(LR), Naïve Bayes (NB), Support Vector Machine (SVM), K-
nearest Neighbor (KNN), Stochastic Gradient Descent (SGD), 
and Decision Tree (DT). The EMs were Random Forest (RF), 
Adaboost (ADA), Gradient Booting (BG), and xgboost (XG). 

D. Phase 4: Model Evaluation 

The model's performance was verified using Precision (1), 
Recall (2), F-Score (3), Accuracy (4), and 5-fold cross-
validation on the dataset. 

Precision 	

�������	���	
�������	���������

���	
�������	���������
    (1) 

Recall	 	

�������	���	
�������	���������

���	
�������		���������
	    (2) 

F Score 	 2	X	
$�������		%	
�����	

$�������	&	
�����	
    (3) 

Accuracy	 		
*��+�	�,	�����	-����������	

.����	���+�	�,	-����������
    (4) 

IV. RESULTS AND DISCUSSION 

This section presents the experiments and the results. Two 
experiments were conducted on the dataset: classic MLCs and 
EMs. In both experiments, the dataset was divided into 70% for 
training and 30% for testing, while 5-fold cross-validation was 
applied. 

A. Classic Machine Learning Classifiers 

As mentioned above, six classic MLCs were used. The n-
gram with Unigram, Bigram, and Trigram was used with all six 
classifiers to examine their performance. Table II shows the 
results using Unigram in the six classifiers. It can be observed 


Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6845-6848 6847 
 

www.etasr.com Yafooz et al.: Arabic Sentiment Analysis on Chewing Khat Leaves using Machine Learning and … 

 
that the SVM classifier had the highest performance accuracy 
(80.12%), while the lowest accuracy was noted on KNN 
(65%). 

TABLE II.  CLASSIFIERS' PERFORMANCE USING UNIGRAM 

MLCs Class Precision Recall F-score  Accuracy 

LR 
Negative 84% 78% 81% 

79.39% 
Positive 75% 81% 78% 

NB 
Negative 63% 78% 70% 

71.95% 
Positive 81% 68% 74% 

SVM 
Negative 84% 79% 81% 

80.12% 
Positive 76% 82% 79% 

KNN 
Negative 94% 60% 73% 

65.00% 
Positive 34% 84% 49% 

SGD 
Negative 81% 77% 79% 

77.80% 
Positive 74% 79% 76% 

DT 
Negative 74% 71% 73% 

71.34% 
Positive 68% 72% 70% 

 
Table III shows the MLCs performance using bigram. SVM 

had the highest accuracy (79.76%), whereas the lowest 
performance was noted on KNN (65.24%). 

TABLE III.  CLASSIFIERS' PERFORMANCE USING BIGRAM 

MLCs Class Precision Recall F-score  Accuracy 

LR 
Negative 83% 78% 80% 

78.90% 
Positive 75% 80% 77% 

NB 
Negative 63% 78% 70% 

72.07% 
Positive 82% 68% 74% 

SVM 
Negative 83% 79% 81% 

79.76% 
Positive 76% 81% 78% 

KNN 
Negative 91% 61% 73% 

65.24% 
Positive 37% 81% 51% 

SGD 
Negative 79% 78% 79% 

77.80% 
Positive 76% 78% 77% 

DT 
Negative 77% 69% 73% 

70.49% 
Positive 64% 72% 68% 

 
Table IV shows the MLCs performance using trigram. The 

highest accuracy was 79.51% using SVM, whereas the lowest 
was noted again for KNN (65.12%). Figure 2 depicts the 
overall MLCs results for Unigram, Bigram, and Trigram. 
Although SVM had the highest accuracy, it was followed 
closely by both LR and SGD at almost 80%. NB's and DT's 
accuracies were near 70%, whereas KNN was less accurate. 

TABLE IV.  CLASSIFIERS PERFORMANCE USING TRIGRAM 

MLCs Class Precision     Recall F-score  Accuracy 

LR 
Negative 83% 78% 80% 

78.90% 
Positive 75% 80% 77% 

NB 
Negative 63% 78% 69% 

71.59% 
Positive 81% 67% 73% 

SVM 
Negative 83% 78% 81% 

79.51% 
Positive 76% 81% 78% 

KNN 
Negative 91% 61% 73% 

65.12% 
Positive 37% 80% 51% 

SGD 
Negative 76% 78% 77% 

76.71% 
Positive 77% 75% 76% 

DT 
Negative 78% 67% 72% 

69.15% 
Positive 59% 72% 65% 

 
Fig. 2.  Accuracy comparison for 6 MLCs on three n-grams. 

B. Ensemble Methods 

The four mentioned above common methods were used. 
Table IV shows the accuracy of these EMs. The highest 
accuracy was recorded for XG, while the lowest was noted for 
GB. Figure 3 demonstrates the accuracy of the DT classifier 
using Unigram, Bigram, and Trigram compared to RF. As it 
can be noted, RF outperformed DT. 

TABLE V.  ENSEMBLE CLASSIFIERS' PERFORMANCE 

EMs Class Precision Recall F-score  Accuracy 

RF 
Negative 83% 73% 78% 

75% 
Positive 68% 79% 73% 

ADA 
Negative 80% 75% 77% 

75% 
Positive 71% 77% 74% 

GB 
Negative 81% 73% 77% 

74% 
Positive 68% 77% 72% 

XG 
Negative 82% 76% 79% 

77% 
Positive 72% 79% 75% 

 
Fig. 3.  Accuracy comparison between Decision Tree and Random Forest. 

V. CONCLUSION 

This paper presented a study on users' opinions on chewing 
Khat in Yemen and East Africa, using a dataset collected from 
YouTube comments. Several natural language processing steps 
were carried on the dataset to get the best performance using 
classifiers. Classic MLCs and Ems were applied. The best 
performance in terms of accuracy was recorded when using 
SVM, followed by Linear Regression. The best accuracy using 
EMs was recorded for XG. 

REFERENCES 

[1] J. R. Saura, P. Palos-Sanchez, and A. Grilo, “Detecting Indicators for 
Startup Business Success: Sentiment Analysis Using Text Data Mining,” 


Engineering, Technology & Applied Science Research Vol. 11, No. 2, 2021, 6845-6848 6848 
 

www.etasr.com Yafooz et al.: Arabic Sentiment Analysis on Chewing Khat Leaves using Machine Learning and … 

 
Sustainability, vol. 11, no. 3, Jan. 2019, Art. no. 917, https://doi.org/ 
10.3390/su11030917. 

[2] M. Govindarajan, “Sentiment analysis of restaurant reviews using hybrid 

classification method,” in Proceedings of 2nd IRF International 
Conference, Chennai, India, Feb. 2014, pp. 127–133. 

[3] S. Rani and P. Kumar, “A Sentiment Analysis System to Improve 

Teaching and Learning,” Computer, vol. 50, no. 5, pp. 36–43, May 
2017, https://doi.org/10.1109/MC.2017.133. 

[4] A. Salinca, “Business Reviews Classification Using Sentiment 
Analysis,” in 2015 17th International Symposium on Symbolic and 

Numeric Algorithms for Scientific Computing (SYNASC), Timisoara, 
Romania, Sep. 2015, pp. 247–250, https://doi.org/10.1109/SYNASC. 

2015.46. 

[5] U. P. Gurav and S. Kotrappa, “Sentiment Aware Stock Price Forecasting 
using an SA-RNN-LBL Learning Model,” Engineering, Technology & 

Applied Science Research, vol. 10, no. 5, pp. 6356–6361, Oct. 2020, 
https://doi.org/10.48084/etasr.3805. 

[6] J. Carrillo-de-Albornoz, J. R. Vidal, and L. Plaza, “Feature engineering 

for sentiment analysis in e-health forums,” PLOS ONE, vol. 13, no. 11, 
2018, Art. no. e0207996, https://doi.org/10.1371/journal.pone.0207996. 

[7] M. Madhukar and S. Verma, “Hybrid Semantic Analysis of Tweets: A 

Case Study of Tweets on Girl-Child in India,” Engineering, Technology 
& Applied Science Research, vol. 7, no. 5, pp. 2014–2016, Oct. 2017, 

https://doi.org/10.48084/etasr.1246. 

[8] O. Oyebode, F. Alqahtani, and R. Orji, “Using Machine Learning and 
Thematic Analysis Methods to Evaluate Mental Health Apps Based on 

User Reviews,” IEEE Access, vol. 8, pp. 111141–111158, 2020, 
https://doi.org/10.1109/ACCESS.2020.3002176. 

[9] S. Angelidis and M. Lapata, “Multiple Instance Learning Networks for 

Fine-Grained Sentiment Analysis,” Transactions of the Association for 
Computational Linguistics, vol. 6, pp. 17–31, Aug. 2018, 

https://doi.org/10.1162/tacl_a_00002. 

[10] Z. Wang, C. S. Chong, L. Lan, Y. Yang, S. B. Ho, and J. C. Tong, 

“Fine-grained sentiment analysis of social media with emotion sensing,” 
in 2016 Future Technologies Conference (FTC), Dec. 2016, pp. 1361–

1364, https://doi.org/10.1109/FTC.2016.7821783. 

[11] J. Luo, S. Huang, and R. Wang, “A fine-grained sentiment analysis of 
online guest reviews of economy hotels in China,” Journal of 

Hospitality Marketing & Management, vol. 30, no. 1, pp. 71–95, Jan. 
2021, https://doi.org/10.1080/19368623.2020.1772163. 

[12] C. Yang, H. Zhang, B. Jiang, and K. Li, “Aspect-based sentiment 

analysis with alternating coattention networks,” Information Processing 
& Management, vol. 56, no. 3, pp. 463–478, May 2019, 

https://doi.org/10.1016/j.ipm.2018.12.004. 

[13] M. Song, H. Park, and K. Shin, “Attention-based long short-term 
memory network using sentiment lexicon embedding for aspect-level 

sentiment analysis in Korean,” Information Processing & Management, 
vol. 56, no. 3, pp. 637–653, May 2019, https://doi.org/ 

10.1016/j.ipm.2018.12.005. 

[14] W. Xue and T. Li, “Aspect Based Sentiment Analysis with Gated 
Convolutional Networks,” in Proceedings of the 56th Annual Meeting of 

the Association for Computational Linguistics, Melbourne, Australia, 
Jul. 2018, vol. 1, pp. 2514–2523, https://doi.org/10.18653/v1/P18-1234. 

[15] F. F. Balahadia, M. C. G. Fernando, and I. C. Juanatas, “Teacher’s 

performance evaluation tool using opinion mining with sentiment 
analysis,” in 2016 IEEE Region 10 Symposium (TENSYMP), May 2016, 

pp. 95–98, https://doi.org/10.1109/TENCONSpring.2016.7519384. 

[16] I. A. Kandhro et al., “Sentiment Analysis of Students’ Comment by 

using Long-Short Term Model,” Indian Journal of Science and 
Technology, vol. 12, no. 8, pp. 1–16, Feb. 2019, 

https://doi.org/10.17485/ijst/2019/v12i8/141741. 

[17] I. A. Kandhro, S. Z. Jumani, F. Ali, Z. U. Shaikh, M. A. Arain, and A. 
A. Shaikh, “Performance Analysis of Hyperparameters on a Sentiment 

Analysis Model,” Engineering, Technology & Applied Science 
Research, vol. 10, no. 4, pp. 6016–6020, Aug. 2020, 

https://doi.org/10.48084/etasr.3549. 

[18] E. M. Clark et al., “A Sentiment Analysis of Breast Cancer Treatment 
Experiences and Healthcare Perceptions Across Twitter,” arXiv e-prints, 

vol. 1805, p. arXiv:1805.09959, May 2018. 

[19] M. Zolnoori et al., “Mining news media for understanding public health 
concerns,” Journal of Clinical and Translational Science, pp. 1–10, Oct. 

2019, https://doi.org/10.1017/cts.2019.434. 

[20] F. Saeed, W. M.S. Yafooz, M. Al-Sarem, and E. A. Hezzam, “Detecting 
Health-Related Rumors on Twitter using Machine Learning Methods,” 

International Journal of Advanced Computer Science and Applications, 
vol. 11, no. 8, 2020, https://doi.org/10.14569/IJACSA.2020.0110842. 

[21] A. Al-Alimi, E. Halboub, A. K. Al-Sharabi, T. Taiyeb-Ali, N. Jaafar, 
and N. N. Al-Hebshi, “Independent determinants of periodontitis in 

Yemeni adults: A case-control study,” International Journal of Dental 
Hygiene, vol. 16, no. 4, pp. 503–511, 2018, https://doi.org/ 

10.1111/idh.12352. 

[22] M. Hijazi, H. Jentsch, J. Al-Sanabani, M. Tawfik, and T. W. 
Remmerbach, “Clinical and cytological study of the oral mucosa of 

smoking and non-smoking qat chewers in Yemen,” Clinical Oral 
Investigations, vol. 20, no. 4, pp. 771–779, May 2016, https://doi.org/ 

10.1007/s00784-015-1569-2. 

[23] M. A. Al-Duais and Y. S. Al-Awthan, “Association between qat 
chewing and dyslipidaemia among young males,” Journal of Taibah 

University Medical Sciences, vol. 14, no. 6, pp. 538–546, Dec. 2019, 
https://doi.org/10.1016/j.jtumed.2019.09.008. 

[24] B. Kalakonda, S. A. Al-Maweri, H.-M. Al-Shamiri, A. Ijaz, S. Gamal, 

and E. Dhaifullah, “Is Khat (Catha edulis) chewing a risk factor for 
periodontal diseases?  A systematic review,” Journal of Clinical and 

Experimental Dentistry, vol. 9, no. 10, pp. e1264–e1270, Oct. 2017, 
https://doi.org/10.4317/jced.54163.