Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2954-2957 2954  
  

www.etasr.com Khan et al.: Analysis of Children’s Prosodic Features Using Emotion Based Utterances in Urdu Language 
 

Analysis of Children’s Prosodic Features Using 
Emotion Based Utterances in Urdu Language 

 
Sallar Khan 
Department of Computer Science 

Sir Syed University of Engineering and 
Technology 

Karachi, Pakistan 
Sallarkhan_92@yahoo.com 

Syed Abbas Ali 
Department of Computer Information 

System, NED University of Engineering 
and Technology 

Karachi, Pakistan 
Saaj.research@yahoo.com 

Jawaria Sallar  
Department of Computer Science 

Sir Syed University of Engineering and 
Technology 

Karachi, Pakistan 
H.jawaria@yahoo.com 

 
Abstract—Emotion plays a significant role in identifying the 
states of a speaker using spoken utterances. Prosodic features 
add sense in spoken utterances providing speaker emotions. The 
objective of this research is to analyze the behavior of prosodic 
features (individual and in combination with others’ prosodic 
features) with different learning classifiers on emotion based 
utterances of children in the Urdu language. In this paper, three 
different prosodic features (intensity, pitch, formant and their 
combinations) with five different learning classifiers(ANN, J-48, 
K-star, Naïve Bayes, decision stump) and four basic emotions 
(happy, sad, angry, and neutral) were used to develop the 
experimental framework. Demonstrative experiments expressed 
that, in terms of classification accuracy, artificial neural 
networks show significant results with both individual and 
combination of prosodic features in comparison with other 
learning classifiers. 

Keywords-speech; emotion; recognition; learning; classifiers; 
prosodic; features; language; Urdu; Pakistan 

I. INTRODUCTION AND RELATED WORK 
Speech is commonly known as an effective way of 

communication between human beings [1]. Emotion is an 
application which is very much concerned with the speech in 
which we can perceive the speaker’s mental state through 
his/her spoken utterances and this term is known as speech 
emotion recognition (SER). Using machine learning systems 
with partial computational resources we can identify speech 
emotion by the usage of a few emotions (happy, sad, angry, 
and neutral) [2]. The efficiency of human machine interface in 
the field of human machine interaction can significantly be 
improved by the help of automatic SER with the assistance of 
learning machines [3]. Speech signal and its acoustic features 
such as timing, intensity, voice quality, pitch and articulation 
are highly associated with fundamental emotion [5]. Speech 
emotion recognition systems have numerous applications 
which include: 1) telephonic conversations and their emotion 
analysis 2) psychiatric patient’s medical diagnosis 3) student 
emotional condition e-learning system 4) analysis of mental 
stress level during an exchange of conversations. There are 3 
core phases leading SER as a pattern recognition statistical 
problem: 1) feature extraction from speech, 2) feature selection, 

3) pattern classification [5]. For the evaluation and impact of 
spectral and prosodic features of emotional speech on 
classification, Gaussian mixture models (GMM) were used in 
[6]. Three fundamental parts which could enhance the design 
of a speech emotion recognition system are discussed in [7]: 1) 
Selection of suitable features for speech, 2) appropriate 
classification scheme’s design and 3) system performance has 
been evaluated through the presentation of a database designed 
for emotional speech. Speech emotion recognition review and 
analysis are presented in [8] by the use of different learning 
classifiers. Extraction of both local and global prosodic features 
from sentences are addressed in [9], furthermore, different 
words and syllables are suggested for analyzing the speech 
recognition or affect recognition. The technology of emotion as 
a crucial component of artificial intelligence is advised in [10]. 
Furthermore, distinct context must be considered by artificial 
intelligence for emotion recognition. Five different emotions 
have been investigated in [11] which are associated with 
acoustic properties of the prosody of speech. This investigation 
comes to a result that those speeches which are associated with 
emotion “love” and “sad” are identified by higher pitch and 
utterances with lengthier duration. Similar to [11], prosody is 
recognized as the most fundamental feature of emotional 
expression in any specific speech in [12]. 

II. RESEARCH METHODOLOGY 

A. Corpus Collection and Specification of Recording 
The structure for this research is to conduct interviews from 

primary level school going children, while the medium chosen 
will be the regional language of Pakistan Urdu. The same 
sentence will be asked from all of the participated children. 
ITU recommendations which were chosen for the recording of 
the corpus with following specifications are a) 24120 bps and 
b) SNR>=45 dB [13]. For the experiment analysis, a noise free 
room will be utilized for taking recording samples by using a 
microphone, and the entire speech emotion utterances are 
recorded in the recording format of 2.4 Ohms; sensitivity and 
48 kHhz; sampling rate. With four different emotions (happy, 
sad, angry, and neutral) the Urdu language sentence which is 
used for experiment is: “I want to play” ("ميں کهيلنا چاہتا ہوں”) 


Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2954-2957 2955  
  

www.etasr.com Khan et al.: Analysis of Children’s Prosodic Features Using Emotion Based Utterances in Urdu Language 
 

B. Population 
Children were chosen randomly from a primary level 

school with the age group of the population of this research 
being 5-10 years. 

C. Sampling Frame 
Sufficient ratio of sampling frame is chosen which can 

produce the desired result. It is selected from past studies from 
the realm of SER. For analysis purpose in this research, the 
data will consist of approximately 10 participants with 40 
speech samples (Consisting all four emotions). 

D. Procedure 

1) Feature Extraction 
The extraction process of prosodic features (pitch, intensity 

and formant) is done through testing of these samples on 
PRAAT, while only mean values of prosodic features were 
extracted in the experimental section. 

2) Learning Classification 
We have used 5 different machine learning classifiers for 

classification purpose which are Artificial Neural Network 
(ANN), Naïve Bayes (NB), Decision Stump (DS), J-48, and K-
star. 

3) Pilot Study of the Proposed Research 
For getting familiarized with the software/tools which we 

have utilized for this study, we have collected the speech 
samples. The .wav file format was used for the recorded 
samples, while PRAAT will generate the desired result 
accordingly. Whereas, WEKA classifier will process these 
values to generate the classification result. 

III. EXPERIMENTAL RESULTS AND DISCUSSION 

A. Extracting Prosodic Features 
To analyze the overall impact of prosodic features 

(intensity, pitch, and formant) on learning classifiers the 
experimental phase of this research is divided into two 
portions. In the first portion, the four emotions (happy, sad, 
angry, and neutral) which are present in the utterances of the 
speaker have been observed by using the PRAAT software 
[14]. Speech emotion corpuses which are used in the 
experimental study consist of initially 40 samples appropriated 
from a children recording in the age group of 5-10 years in 
regional language Urdu with four different emotions. To 
evaluate the dependency of emotions on each prosodic feature 
(pitch, intensity, and formant) is the primary motivation 
through these experiments. The extraction process of prosodic 
features from spoken speech emotion utterances in regional 
langue Urdu are shown in Figures 1 to 3. These observations 
are demonstrating the behavior of all three prosodic features 
which are extracted by using PRAAT software [1]. Not all the 
total 40 samples are being demonstrated in this section but just 
a portion to show an aroma that how these features were 
extracted with the use of PRAAT. 

 
Fig. 1.  Extracting prosodic feature (pitch) using PRAAT. 

 
Fig. 2.  Extracting prosodic feature (intensity) using PRAAT. 

B. Classification and precision accuracy 
The second portion is further divided into two parts, in first 

part we will discuss the overall calculation of classification 
accuracy of each learning classifier (ANN, NB, DS, J-48, and 
K-star) against each prosodic feature individually as shown in 
Table I and in combination as shown in Table II. In the second 
part, we will discuss the precision accuracy against each 
emotion (happy, sad, angry, and neutral) which are produced 
by different learning classifiers. 

1) Overall classification accuracy 
The overall result is satisfactory in terms of classification 

accuracy. It has been observed (Figure 5) that learning 
classifiers performed well when we jointly analyzed prosodic 
features and can classify correctly up to the accuracy of 45% 
meanwhile separately they produce accuracy of 40% which is 
shown in Figure 4. Furthermore in terms of learning classifiers, 
J-48 gave best classification accuracy of 35% for the prosodic 
feature pitch, while on the other hand ANN and NB both held 
their classification accuracy higher up to 40% for the prosodic 
feature intensity and for the third prosodic feature, formant, 
ANN and DS both have produced classification accuracy 
around 33%. During the process of analyzing prosodic features 
jointly, it has been recognized that the combination of intensity 


Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2954-2957 2956  
  

www.etasr.com Khan et al.: Analysis of Children’s Prosodic Features Using Emotion Based Utterances in Urdu Language 
 

and formant produced the finest classification accuracy of 45% 
with ΑΝΝ, while combinations of pitch and formant against J-
48, as well as pitch, intensity, and formant with ANN came 
second, both with accuracy around 43%. Lastly the 
combination of pitch and intensity produced classification 
accuracy around 38% with ANN classifier which comes third.  

 
Fig. 3.  Extacting prosodic feature (formant) using PRAAT. 

 
Fig. 4.  Classification accuracy of emotions on pitch, intensity and 
formant. 

 
Fig. 5.  Classification accuracy of emotions on combination of pitch, 
intensity and formant. 

 
TABLE I.  COMPARATIVE ANALYSIS OF INDIVIDUAL PROSODIC 
FEATURES USING MACHINE LEARNING CLASSIFIERS 

 Classifier Precision Accuracy 
Happy Sad Angry Neutral 

Pi
tc

h 

ANN 0.333 0 0 0.316 20% 
NB 0.125 0 0.222 0.417 12.50% 
DS 0.217 0 0.25 0.273 22.50% 
J-48 0.231 0 0.455 0.429 35% 

K-Star 0.2 0 0.222 0.417 22.50% 

In
te

ns
ity

 ANN 0.286 0 0.545 0.4 40% 
NB 0.25 0 0.5 0.375 40% 
DS 0.143 0 0.545 0.273 32.50% 
J-48 0.125 0 0.462 0.263 30% 

K-Star 0.111 0.2 0.545 0.333 32.50% 

Fo
rm

an
t ANN 0.4 0.75 0.182 0.2 32.50% 

NB 0.385 0.385 0.267 0 30% 
DS 0 1 0.27 0 32.50% 
J-48 0 0 0.242 0 20% 

K-Star 0.231 0.5 0.308 0.125 27.50% 

TABLE II.  COMPARATIVE ANALYSIS OF PROSODIC FEATURE 
COMBINATIONS USING MACHINE LEARNING CLASSIFIERS 

 Classifier Precision Accuracy 
Happy Sad Angry Neutral 

P+
I 

ANN 0.385 0 0.455 0.333 37.50% 
NB 0.25 0 0.467 0.333 35% 
DS 0.143 0 0.545 0.273 32.50% 
J-48 0.133 0 0.556 0.188 25% 

K-Star 0.083 0.2 0.5 0.2 22.50% 

P+
F 

ANN 0.25 0.429 0 0.412 30% 
NB 0.182 0.429 0.154 0.222 22.50% 
DS 0.25 0 0.25 0.25 25% 
J-48 0.417 0.333 0.444 0.462 42.50% 

K-Star 0.375 0.429 0.143 0.273 27.50% 

I+
F 

ANN 0.375 0.667 0.545 0.417 45% 
NB 0.375  0.333 0.5 0.429 42.50% 
DS 0.143 0 0.545 0.273 32.50% 
J-48 0.083 0 0.462 0.2 25% 

K-Star 0.154 0.2 0.5  0.167 25% 

P+
I+

F 

ANN 0.385 0.5 0.444 0.429 42.50% 
NB 0.25 0.286 0.455 0.357 35% 
DS 0.143 0 0.545 0.273 32.50% 
J-48 0.182 0 0.5 0.125 22.50% 

K-Star 0.222 0.222 0.5 0.333 32.50% 

P: Pitch, I: Intensity, F: Formant 

2) Precision Accuracy 
The plot of precision against each emotion (happy, sad, 

angry, and neutral) is shown in Figures 6 to 9. Analyzing 
process shows that J-48 kept the highest precision rate of 
around 0.45 regarding the happy emotion with the combination 
of pitch and formant features, while classifiers didn’t produce 
well enough results for the emotion sad except ANN which 
produced precision rate of around 0.8. For the angry emotion, 
every classifier produced their best possible results against 
almost every feature (individually or jointly) but J-48 produced 
a slightly better precision rate of 0.6 for the combination of 
pitch and intensity. In the end, for neutral emotion, again J-48 
achieved the highest precision rate of around 0.5 with the 
combination of pitch and formant. 

IV. CONCLUSION 
Speech emotion corpuses were recorded in the Urdu 

language with separate regard of four basic emotions. The main 
motivation behind this research is to analyze the impact of 
prosodic features (pitch, intensity and formant) on five 
different learning classifiers. The PRAAT software WEKA 
tool was used in the experimental framework for emotion 


Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2954-2957 2957  
  

www.etasr.com Khan et al.: Analysis of Children’s Prosodic Features Using Emotion Based Utterances in Urdu Language 
 

observation in spoken utterances and as learning classifier for 
the classification purpose. Experimental results made evident 
that by combining prosodic features (intensity and formant) we 
can achieve a classification rate up to 45% while 40% 
classification accuracy can be achieved from an individual 
prosodic feature (intensity). In terms of classification accuracy, 
the ANN has been proved overall to perform better than others 
with a classification accuracy of 45%.  

 
Fig. 6.  Precision of emotion ‘happy’ on all three prosodic features. 

 
Fig. 7.  Precision of emotion ‘sad’ on all three prosodic features 

 
Fig. 8.  Precision of emotion ‘angry’ on all three prosodic features. 

 
Fig. 9.   Precision of emotion ‘neutral’ on all three prosodic features. 

REFERENCES 
[1] S. A. Ali, A. Khan, N. Bashir, “Analyzing the Impact of Prosodic 

Feature (Pitch) on Learning Classifiers for Speech Emotion Corpus”, 
International Journal of Information Technology and Computer Science, 
Vol. 2, pp. 54-59, 2015 

[2] P. Ekman, “An argument for basic emotions”, Cognition & Emotion, 
Vol. 6, No. 3. pp. 169–200, 1992 

[3] I. Chiriacescu, Automatic Emotion Analysis Based on Speech, MSc 
Thesis, Delft University of Technology, 2009 

[4] M. B. Mustafa, R. N. Ainon, R. Zainuddin, Z. M. Don, G. Knowles, S. 
Mokhtar, “Prosodic Analysis and Modelling for Malay”, Malaysian 
Journal of Computer Science, Vol. 23, No. 2, pp. 102–110, 2010 

[5] J. Rong, G. Li, Y. P. P. Chen, “Acoustic feature selection for automatic 
emotion recognition from speech”, Information Processing & 
Management, Vol. 45, No. 3, pp. 315–328, 2009 

[6] J. Pribil, A. Pribilova, “Determination of formant features in Czech and 
Slovak for GMM emotional speech classifier”, Radioengineering, Vol. 
22, No. 1, pp. 52–59, 2013 

[7] M. El Ayadi, M. S. Kamel, F. Karray, “Survey on speech emotion 
recognition: Features, classification schemes, and databases”, Pattern 
Recognition, Vol. 44, No. 3, pp. 572–587, 2011 

[8] A. Utane, S. Nalbalwar, “Emotion recognition through Speech”, 2nd 
National Conference on Innovative Paradigms in Engineering & 
Technology, International Journal of Applied Information Systems, pp. 
5-8, 2013 

[9] K. S. Rao, S. G. Koolagudi, R. R. Vempada, “Emotion recognition from 
speech using global and local prosodic features”, International Journal of 
Speech Technology, Vol. 16, No. 2, pp. 143–160, 2013 

[10] P. Olivier, J. Wallace, “Digital technologies and the emotional family”, 
International Journal of Human-Computer Studies, Vol. 67, No. 2, pp. 
204–214, 2009 

[11] P. Pattnaik, “Impact of Emotion on Prosody Analysis”, IOSR Journal of 
Computer Engineering, Vol. 5, No. 4, pp. 10–15, 2012 

[12] W. L. Jarrold, Towards a theory of affective mind: Computationally 
modeling the generativity of goal appraisal, PhD Thesis, University of 
Texas at Austin, 2004 

[13] S. A. Ali, M. Andleeb, N. G. Haider, D. R. Khan, “Evaluating the 
Performance of Learning Classifiers and Effect of Emotions and 
Spectral Features on Speech Utterances”, International Journal of 
Computer Science and Information Security, Vol. 14, No. 10, pp. 406–
412, 2016 

[14] P. Boersma, D. Weenink, Praat: doing phonetics by computer, available 
at:  http://www.fon.hum.uva.nl/praat/