Instruction


FACTA UNIVERSITATIS  

Series: Electronics and Energetics Vol. 27, N
o
 3, September 2014, pp. 425 - 433 

DOI: 10.2298/FUEE1403425B 

RELEVANCE OF THE TYPES AND THE STATISTICAL 

PROPERTIES OF FEATURES IN THE RECOGNITION 

OF BASIC EMOTIONS IN SPEECH

 

Milana Bojanić, Vlado Delić, Milan Sečujski 

Faculty of Technical Sciences, University of Novi Sad, Serbia 

Abstract. Due to the advance of speech technologies and their increasing usage in 

various applications, automatic recognition of emotions in speech represents one of the 

emerging fields in human-computer interaction. This paper deals with several topics 

related to automatic emotional speech recognition, most notably with the improvement 

of recognition accuracy by lowering the dimensionality of the feature space and 

evaluation of the relevance of particular feature types. The research is focused on the 

classification of emotional speech into five basic emotional classes (anger, joy, fear, 

sadness and neutral speech) using a recorded corpus of emotional speech in Serbian. 

Key words: emotional speech recognition, acoustic features, basic emotions 

1. INTRODUCTION 

Basic emotion is a term used in categorical emotion models, among which Ekman‟s 

concept of six basic emotions is the most prominent one. His theory of basic emotions, 

which are “psychological universals and constitute a set of basic, evolved functions that 

are shared by all humans”, is supported with experimental findings of cross-culturally 

recognized emotions from vocal signals and facial expressions [1]. 

From the beginning of its development, Emotional Speech Recognition (ESR) studies 

have used corpora of acted emotional speech since those corpora were easy to collect. 

Such corpora usually contained several basic emotions reproduced by actors [2]. 

There are apparently reasonable objections about acted speech corpora, saying that 

acting emotions is not the same as producing „spontaneous‟ emotions and pointing out 

that within human-machine interaction emotion-related states are much more common 

than prototypical full-blown emotions (such as those represented in acted speech corpora) 

[3]. Still, recent research has shown that the relationships between the acted emotions and 

their acoustic correlates and between real life emotions and their acoustic correlates do 

not necessarily contradict [4]. 

                                                           

 
 Received February 10, 2014; received in revised form March 13, 2014 

Corresponding author: Milana Bojanić 

University of Novi Sad, Faculty of Technical Sciences, Trg Dositeja Obradovića 6, 21000 Novi Sad, Serbia 

(milana.bojanic@uns.ac.rs) 


426 M. BOJANIĆ, V. DELIĆ, M. SEĈUJSKI 

A more flexible solution to the problem of the representation of emotional states is to 

represent them as points in the continuous 2D space whose co-ordinates are the activation 

and evaluation involved in the emotional state [5]. Such dimensional models also allow 

for the mapping of basic emotions into the continuous 2D emotional space [5, 6], thus 

enabling a broad field of application of the recognition of basic emotions in speech. 

The paper summarizes our approach to the recognition of basic emotions in speech, 

focusing particularly on the improvement of recognition accuracy by lowering the 

dimensionality of the feature space. Additionally, a feature selection procedure has been 

performed in order to rank feature types and used statistical functionals. The presented 

research has been conducted on a corpus of acted emotional speech in Serbian. 

The paper is organized as follows. Aspects of the proposed approach that are relevant 

to the recognition of basic emotions, including acoustic modeling, classification scheme 

and speech corpus, are presented in Section 2. In Section 3, theoretical background about 

feature dimensionality reduction techniques is given and their possible benefits are pointed 

out. Experimental results are shown and discussed in Section 4. Finally, the conclusions are 

given in Section 5. 

2. THE PROPOSED APPROACH 

2.1 The proposed approach to acoustic modeling 

The proposed approach to acoustic modeling is based on the statistical analysis of 

acoustic feature contours [7, 8] and it is performed in three stages, as shown in Fig. 1.  

The first stage includes the extraction of acoustic features on a frame basis. These 

features belong to two acoustic feature sets, namely prosodic and spectral feature set. In 

the prosodic feature set, pitch and energy are extracted. As to spectral feature set, only the 

first 12 MFCCs are taken into account in our analysis, since they correspond to slow 

changes in the spectrum, i.e., the spectrum envelope. The feature contours which correspond 

to the pitch contour, energy contour and MFCC contour, are, respectively, sequences of 

short-term pitch, energy and MFCC values extracted on a frame basis. 

 
Fig. 1 Feature extraction process in three stages 


 Relevance of the Types and the Statistical Properties of Features in Recognition of Basic Emotions in Speech 427 

The extracted features are forwarded to the second stage, in which the first derivative 

of the acoustic features is calculated in order to model the dynamics of speech. The first 

derivative carries the information about the dynamics of emotional speech, which is useful 

in emotional speech classification [4]. 

The third stage of the feature extraction process involves a statistical analysis of the 

feature contours. The final feature set is obtained from the feature contours by applying 

so-called static modeling through functionals [9]. 

In the literature, larger numbers of statistical features are analyzed [10, 11]. Our 

selection of statistical functionals was guided by the principle that chosen statistical features 

should describe the variations and follow the trend of changes of acoustic features correlated 

with different types of emotional speech. At the same time, since it was impossible to predict 

which statistical characteristics would be the most effective, the proposed set of statistical 

features included 12 features, bearing in mind that if particular information in the feature 

vector showed to be redundant and aggravating for classification, an efficient subset of 

features would be extracted using a dimensionality reduction technique. 

The proposed set of 12 statistical functionals has been chosen from three groups of 

functionals which are the most frequently used [9]. These groups and their corresponding 

functionals are [7]: 

1. The first four moments (mean, standard deviation, skewness and kurtosis), 

2. Extrema and their positions (minimum, maximum, range, relative position of 

minimum and relative position of maximum), 

3. Regression coefficients (the slope and the offset of the linear regression of the 

contour) and regression error (the mean squared error between the regression curve 

and the original contour). 

By applying the proposed procedure, three sets of features have been extracted [7]. 

The first feature set includes only prosodic features (pitch and energy) and it will be 

referred to as prosodic feature set (P-FS). The second feature set includes only spectral 

features (12 MFCC); this set will be referred to as spectral feature set (S-FS). Finally, the 

third feature set includes both prosodic and spectral features, and additionally the voicing 

probability and the zero crossing rate. For the mentioned 16 features, the first derivative is 

calculated, and then 12 functionals are applied on all of them, resulting in 384 features 

extracted for each utterance. The third feature will be referred to as prosodic-spectral 

feature set (PS-FS). 

2.2 Classification scheme 

For the purpose of emotional speech classification, we have considered the linear 

discriminant classifier (LDC) and the k-nearest neighbours classifier (kNN), as they 

belong to well known and simple classifiers, which have been used by other researchers 

for this purpose and which have proved to be successful for both acted and spontaneous 

emotional speech [9]. As for LDC, two classification schemes have been considered. The 

first one is the linear Bayes classifier with the underlying assumption that classes have 

Gaussian densities and equal covariance matrices. The second one is the derivation of 

linear discriminant functions via the perceptron rule [12]. In the latter case, no assumptions 

have been made about the underlying class densities. 


428 M. BOJANIĆ, V. DELIĆ, M. SEĈUJSKI 

2.3 Emotional speech corpus 

The research was conducted on the Corpus of Emotional and Attitude Expressive Speech 

(GEES, according to the Serbian acronym), which is the first speech corpus recorded in 

Serbian for the purpose of research on acoustic manifestations of emotions in human speech 

in the context of speech technology [13]. It contains recordings of acted speech-based 

emotional expressions corresponding to five basic emotional states: anger, joy, fear, sadness, 

and neutral, reproduced by six actors (3 female, 3 male). The underlying textual material is 

emotionally neutral with respect to lexical content and for the purpose of this study a section 

of the corpus including 30 short and 30 long sentences was used. The reported human 

recognition accuracy for this corpus is 94.7%. To avoid an imbalance between male and 

female speakers, an equal portion of the material from each emotional class belonging to 

each speaker was chosen and a total of 1740 sentences (75 minutes of speech) have been 

processed. Both training and test sets included utterances from all speakers. Therefore, these 

experiments belong to the case of speaker dependent emotion recognition. 

3. DIMENSIONALITY REDUCTION  

Dimensionality reduction can be performed through feature extraction or feature 

selection. While feature extraction employs a mapping (usually linear) of a given feature 

space onto a lower dimensional space, creating a feature subset which is a combination of 

existing features, feature selection involves a selection of a subset from the existing 

features without any transformation. 

3.1 Linear Discriminant Analysis 

Linear Discriminant Analysis (LDA) is a linear feature extraction technique whose 

goal is the enhancement of the class-discriminatory information in a lower dimensional 

feature space. Fisher‟s LDA for a two-class problem is based on a search for a projection 

that maximizes the ratio of between-class to within-class scatter. The solution is in a 

specific choice of direction for the projection of the data where the examples from the 

same class are projected so as to be very close to each other and, at the same time, the 

projected class means are projected so as to be as far from each other as possible [14]. 

Fisher‟s LDA generalizes easily for a C class problem (in our case C = 5 since we deal 

with 5 emotional classes). Since the projection is no longer a scalar (it has C−1 dimensions), 

the determinants of the scatter matrices are used to obtain an objective function. Between-

class scatter matrix represents the scatter of the class mean vectors around the mixture 

mean, defined as: 

 
T

1

))((  


i

C

i
iiB

NS , (1) 

where 



iCxi

i
x

N

1
 is the mean vector of each class in the original feature space x, and 





x

x
N

1
 is the mean vector of the mixture distribution.  

A within-class scatter matrix shows the scatter of samples around their respective class 


 Relevance of the Types and the Statistical Properties of Features in Recognition of Basic Emotions in Speech 429 

mean vectors, and is expressed by: 

 



C

i
iW

SS
1

, (2) 

where 

  



iCx

iii
xxS

T
))(( . (3) 

It can be shown that the optimal projection matrix ]|...||[
*

1

*

2

*

1

*




C
wwwW  is the one 

whose columns are the eigenvectors corresponding to the largest eigenvalues of the 

following generalized eigenvalue problem [14]: 

 0)(
*


iWiB
wSS . (4) 

The projections with maximum class separability information are the eigenvectors corres-

ponding to the largest eigenvalues i of the matrix SW
1

SB. 

3.2 Feature selection 

The drawback of feature extraction methods is that they are not very appropriate for 

feature mining, as the original features are not retained after the transformation [9]. In 

order to gain an insight into the significance of particular features, feature selection was 

used. We adopted Sequential Forward Feature Selection (SFFS) as the search strategy and 

wrapper based evaluation as the objective function. SFFS starts the selection with an 

empty set and sequentially adds the feature that results in the highest value of the 

objective function when combined with the already selected features [15]. In the case of 

wrappers the objective function is a classifier which evaluates feature subsets by their 

recognition rate on test data employing cross-validation. In our case, linear Bayes 

classifier was selected as the wrapper as it had shown the best performance in previous 

recognition tests [7]. Ideally, feature selection methods should not only reveal the single 

most relevant attribute (or groups thereof), but they should also decorrelate the feature 

space [9]. Feature selection results in a reduced, interpretable set of significant features; 

their counts and weights in the selection set allow us to draw conclusions on the relevance 

of the feature types they belong to [16]. The feature set used in our feature selection 

experiments was PS-FS. Since it is a combination of both prosodic and spectral features, 

the relevance of particular feature types within PS-FS was expected to be evaluated. 

4. EXPERIMENTAL RESULTS 

The focus of the research was on the investigation of a possible improvement of 

recognition accuracy in the case of a reduced feature space in the task of basic emotions 

classification. Therefore, the performances of each classifier were tested in two ways: 

(1) using 3 extracted feature sets (P-FS, S-FS, PS-FS), and (2) using 3 feature sets 

obtained after LDA feature reduction has been applied on the 3 initial feature sets. The 

experiments were carried out using 3 classification techniques (the kNN classifier, the 

linear Bayes classifier and the perceptron rule). 


430 M. BOJANIĆ, V. DELIĆ, M. SEĈUJSKI 

Table 1 shows the class and average recognition rate of the kNN classifier (k = 9) in case 

of 3 feature sets, before LDA (originally extracted feature sets) and after LDA (original 

feature space reduced to 4 projection vectors). It can be observed that rather poor 

performance of kNN in the case of all three original sets has been significantly improved in 

the reduced feature space. The highest improvement has been achieved in the case of 

prosodic-spectral feature set (an increase from 39.9% to 91.3% average recognition rate), 

which could be explained by the fact that the performance of the kNN classifier is affected 

by the high dimensionality, which is particularly apparent in case of PS-FS. 

Table 1 Recognition accuracy of kNN classifier using 3 feature sets 

(before and after feature reduction using LDA) 

 Class recognition rate [%] 

Feature set Anger Fear Joy Neutral Sadness Average 

P-FS 44.3 23.9 39.1 25 44.5 35.4 

P-FS reduced with LDA 53.7 51.2 52.3 53.2 61.2 54.3 

S-FS 73.9 56.9 35.1 58.1 37.1 52.2 

S-FS reduced with LDA 81.3 92.8 81.6 95.7 93.9 89.1 

PS-FS 57.8 37.9 32.2 23.6 33.3 39.9 

PS-FS reduced with LDA 86.8 93.7 83.6 95.9 96.3 91.3 

Table 2 shows the class and average recognition rate of the linear Bayes classifier in 

case of three feature sets, before LDA (initially extracted feature sets) and after LDA 

(original feature space reduced to 4 projection vectors). An improvement of recognition 

accuracy is obtained only in the case of prosodic feature set (P-FS). This improvement 

amounts to about 5%, which is a rather moderate increase compared to the results in 

Table 1, where the improvement is about 19%. As to S-FS and PS-FS there were no 

improvements, which is probably due to good linear separability in the original feature 

space (resulting in high recognition rates using non-reduced S-FS and PS-FS). 

Table 2 Recognition accuracy obtained with 3 feature sets  

(before and after feature reduction using LDA) and with the linear Bayes classifier 

 Class recognition rate [%] 

Feature set Anger Fear Joy Neutral Sadness Average 

P-FS 51.4 43.7 46.8 45.4 62.4 49.9 

P-FS reduced with LDA 51.4 53.4 46.8 56.6 69.8 55.6 

S-FS 85.1 91.7 81 95.9 93.9 89.5 

S-FS reduced with LDA 85.1 91.4 80.7 96.5 94.3 89.6 

PS-FS 88.8 92.5 84.2 97.1 94.8 91.5 

PS-FS reduced with LDA 88.2 92.5 85.3 95.9 95.7 91.5 

The class and average recognition rate of the perceptron rule in two test conditions 

(3 feature sets before LDA and 3 feature sets after LDA) are given in Table 3. Slight 

improvements of recognition accuracy are noticeable in the case of all three reduced 

feature sets. The improvement is the lowest in case of P-FS.  


 Relevance of the Types and the Statistical Properties of Features in Recognition of Basic Emotions in Speech 431 

When these three classifiers are compared, it can be noted that a substantial improvement 

of recognition accuracy has been achieved for the simplest classifier, namely kNN. Using 

the PS-FS reduced using LDA, kNN achieves the accuracy almost equal to the best result 

in our experiments (91.5%). This holds for the perceptron as a classifier, although the 

relative improvement of the average performance of the perceptron is much smaller.  

Table 3 Recognition accuracy using 3 feature sets (before and after feature reduction 

using LDA) and with the perceptron rule as the classifier 

 Class recognition rate [%] 

Feature set Anger Fear Joy Neutral Sadness Average 

P-FS 34.8 29.3 36.2 21.3 56.9 35.7 

P-FS reduced with LDA 16.9 33.1 42.2 33 62.6 37.6 

S-FS 79.9 81.9 72.1 89.7 87.9 82.3 

S-FS reduced with LDA 78.2 90.5 80.5 91.1 93.4 86.7 

PS-FS 83.9 88.2 77.1 91.4 93.7 86.9 

PS-FS reduced with LDA 86.8 94.2 82.8 93.9 94.5 90.5 

Employing LDA, the original feature space is transformed to a new one, making it 

impossible to interpret the relevance of particular feature types. For an insight into the list 

of the most relevant features in the original (untransformed) feature space, SFFS 

(Sequential Forward Feature Selection) has been applied. The wrapper for SFFS is the 

linear Bayes classifier since it had the best recognition results. The number of selected 

features has been preset to 35. For the interpretation of results, three indicators have been 

used. The first indicator of the relevance of a feature type is the number (#) of the features 

selected by SFFS. The other two indicators are so called „share‟ and „portion‟, as 

described in [16]. With „share‟, the count of the selected feature type is normalized by the 

total number of features in the reduced set (#/35 in our experiment). With „portion‟, the 

same number is normalized by the cardinality of a feature type in the original feature set 

(#/#total). For each feature type, the „share‟ indicator displays its percentage in modeling 

our 5-class problem, while the indicator „portion‟ gives the percentage of the total number 

of the feature type which contributes to the modeling of the problem. 

The results of the selection of 35 features from PS-FS and the effectiveness of each 

feature type are displayed through 3 indicators in Table 4. The observed feature types 

from PS-FS are: zero crossing rate (ZCR), energy, pitch (plus voicing probability) and 

MFCC. Columns „#Total‟ and „#‟ show the total number and the number of selected 

features per each feature type, respectively. 

From Table 4 it can be observed that the most selected features („share‟=77.1%) 

belong to the MFCC type. The second important feature type is energy („share‟=11.4%). 

The third and the fourth feature type are ZCR and pitch, respectively. 

As regards the indicator „portion‟, the list of feature types can be arranged in the 

following way: from the total feature set energy is selected with the highest percentage 

(16.7%), followed by ZCR (12.5%). Although the MFCC feature type is the most frequent 

one in the selected feature set, only 9.4% of the total number of MFCC is selected. The 

pitch feature type is selected by the lowest rate (2.1%). 

 
432 M. BOJANIĆ, V. DELIĆ, M. SEĈUJSKI 

Table 4 Summary of feature selection results (35 features selected using SFFS),  

displayed with respect to feature types 

 ZCR Energy Pitch MFCC 

#Total 24 24 48 288 

SFFS     

# 3 4 1 27 

share [%] 8.6 11.4 2.9 77.1 

portion [%] 12.5 16.7 2.1 9.4 

Table 5 summarizes the results of the feature selection distributed along groups of used 

statistical functionals: moments, extrema and regression coefficients. The features derived 

via moments are the most frequent among the selected features („share‟=57.1%), followed 

by the features derived via extrema (22.9%) and the features derived via linear regression 

(20%). Observing the „portion‟ of the total number of features in each group of functionals, 

the most highly ranked are moments, followed by regression functionals and extrema, in that 

order. 

Table 5 Summary of feature selection results, distributed along groups of used  

statistical functionals 

 Moments Extrema Regression 

#Total 128 160 96 

SFFS    

# 20 8 7 

share [%] 57.1 22.9 20 

portion [%] 15.6 5 7.3 

5. CONCLUSION 

The paper gives an outline of a system for the recognition of basic emotions in speech, 

with particular emphasis on the extracted acoustic feature sets, classification schemes and 

emotional speech corpus. The paper discusses the obtained improvement of the recognition 

accuracy in a lower dimensional feature space obtained by applying Linear Discriminant 

Analysis. The most substantial improvement of the recognition accuracy has been achieved 

for the simplest classifier in our experiments, namely the kNN classifier. A combination of 

kNN with a reduced prosodic-spectral feature set nearly approaches the best results obtained 

in the experiments (the accuracy of 91.5%).  

Feature selection algorithm has been employed in order to evaluate the relevance of 

the feature types and their statistical properties in the given task of the recognition of 5 

basic emotions. In descending order of relevance, the features are: MFCC, energy, zero 

crossing rate and pitch. Observing the ratio of selected features to the total number of 

features in each feature type, features related to the energy are the most usually selected. 

The results of the feature selection distributed along groups of used statistical functionals 

imply that moments are the most relevant statistical features, although the extrema, 

regression coefficients and regression error also play notable roles.  


 Relevance of the Types and the Statistical Properties of Features in Recognition of Basic Emotions in Speech 433 

Combining chosen prosodic and spectral features, represented by appropriate statistical 

features, even with a most simple classification scheme (such as kNN) the recognition 

results comparable with more complex systems can be achieved. 

Acknowledgement: The research presented in this paper has been carried out within the project 

"The development of dialogue systems for Serbian and other south Slavic languages" (TR32035), 

supported by the Ministry of Education, Science and Technological Development of the Republic 

of Serbia. 

REFERENCES 

[1] D.A. Sauter, F. Eisner, P. Ekman, S. Scott, "Crosscultural recognition of basic emotions through non-

verbal emotional vocalizations", Proceedings of National Academy of Sciences of the USA, vol. 107(6), 

pp. 2408-2412, 2010. 

[2] D. Ververidis, C. Kotropoulos, "Emotional speech recognition: Resources, features and methods", Speech 

Communication, vol. 48, pp. 1162-1181, 2006. 

[3] S.L. Lutfi, F. Fernandez-Martinez, J.M. LucasCuesta, L. Lopez-Lebon, J.M. Montero, "A satisfaction-

based model for affect recognition from conversational features in spoken dialog systems", Speech 

Communication, vol. 55, pp. 825-840, 2013. 

[4] M.E. Ayadi, M.S. Kamel, F. Karray, "Survey on speech emotion recognition: Features, classification 

schemes and databases", Pattern Recognition, vol. 44, pp. 572-587, 2011. 

[5] N. Fragopanagos, J.G. Taylor, "Emotion recognition in human-computer interaction", Neural Networks, 

vol 18, pp. 389-405, 2005. 

[6] B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, "Acoustic emotion recognition: a benchmark 

comparison of performances", IEEE Workshop on Automatic Speech Recognition and Understanding, 

ASRU 2009, Italy, 2009, pp. 552-557. 

[7] V. Delić, M. Bojanić, M. Gnjatović, M. Seĉujski, S.T. Joviĉić, "Discrimination capability of prosodic and 

spectral features for emotional speech recognition", Electronics and Electrical Engineering, Kaunas 

Technologija, vol. 18, no. 9, pp. 51-54, 2012. 

[8] M. Bojanić, Extraction and selection of feature set for automatic emotional speech recognition. Ph.D. 

dissertation, Dept. Elect. Eng., Faculty of Technical Sciences, University of Novi Sad, 2013. 

[9] B. Schüller, A. Batliner, S. Steidl, D. Seppi, "Recognising realistic emotions and affect in speech: state of 

the art and lessons learnt from the first challenge", Speech Communication, vol. 53, pp. 1062-1087, 2011. 

[10] C.M. Lee, S.S. Narayanan, "Toward detecting emotions in spoken dialogs", IEEE Transactions Speech 

Audio Processing, vol. 13, no. 2, pp. 293-303, 2005. 

[11] H. Altun, G. Polat, "New frameworks to boost feature selection algorithms in emotion detection for 

improved human computer interaction", LNCS, vol. 4729, Berlin-Heidelberg: Springer, pp. 533-541, 2007. 

[12] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2
nd

 edition. Wiley, New York, 2000. 

[13] S.T. Joviĉić., Z. Kašić, M. Djordjević, M. Rajković, "Serbian emotional speech database: design, 

processing and evaluation", Proceedings of International Conference on Speech and Computer 

(SPECOM 2004), St Peterburg, 2004, pp.77–81. 

[14] K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press, 1990. 

[15] P. Pudil, J. Novovicova, J. Kittler, "Floating search methods in feature selection", Pattern Recognition 

Lett., vol. 15, pp. 1119-1125, 1994. 

[16] A. Batliner et al., "Whodunnit – Searching for the most important feature types signalling emotion-related 

user states in speech", Computer Speech and Language, vol. 25, pp. 4-28, 2011.