Microsoft Word - Issue-1_Vol-10.docx


  55  

Designing the Intelligent System Detecting a Sense of Wonder in English Speech 
Signal Using Fuzzy-Nervous Inference-Adaptive system (ANFIS) 

 
Sakine Tashakori  

University of Sistan and Baluchestan 
Sistan and Baluchestan Province, Zahedan, Daneshgah Boulevard, Iran 

Phone: +98 54 3113 2505 
sakine.tashakori@yahoo.com 

 
Salman Haghighat 

Andisheh University 
Fars Province, Jahrom, Andishe St, Iran 

salman.haghighat110@gmail.com 
 

Abstract 
The purpose of this research is to design an intelligent diagnostic system for detecting a 

sense of wonder in English speech signal using Fuzzy-Nervous Inference-Adaptive system 
(ANFIS). For English, the recognition of some surprise feelings such as anger, grief, joy and hatred 
has been made, but due to the difficulty of creating a speech database in a state of wonder and a 
shortage of resources in this case, even in other languages, so far, no sense of wonder has been 
detected in the English speech. In the absence of a suitable database in English for the identification 
of emotions, at first, a wonder-neutral database (without feeling) was created in Persian, containing 
30 sentences with a sense of surprise and neutrality. Then, LPC coefficients and frequency 
characteristics of speech signals such as maximum, minimum, middle and mean (obtained by FFT) 
were extracted. Finally, the neuro-fuzzy adaptive network (ANFIS) was used to create a sense of 
wonder with an average accuracy of about 94.93%. 

 
Keywords: Emotion Detection; English Speech Signal; Fuzzy-Nervous Inference-Adaptive 
System (ANFIS). 

 
1. Introduction 
One of the important features of speech is the transfer of the emotional state of the speaker 

to the listener. A speech describes various emotional states of an individual, including anger, 
happiness, surprise, fear, and so on. Understanding the sense of speech suggests more information 
to the listener in addition to its lexical meaning. Therefore, the listener is not only concerned with 
what the speaker says, but also the feeling along with it. 

By increasing the transaction between man and machine, the need for automatic 
conversation between these two and the removal of the human operator is of particular importance. 
There is a lot of research to make it easier to communicate between them. Understanding human 
emotions from the machine and providing a good response to it is one of the areas that helps to 
reach this goal. 

Creating a system for recognizing the sense of speech signal, due to the increased 
transaction between man and machine, plays an important role in everyday life and needs to 
increase more and more. In this regard, research is required to diagnose the sense of wonder in the 
English speech signal. 

In Ghaderian and Ahadi’s article (2008), the ferment speech parameters and step frequency 
for emotional states of anger and grief are extracted in Persian language and are determined by 
decision tree method and GMM speech mode. 

Mousavian, Nourest, and Rahati (2007), investigated the influence of culture and social 
norms on anger, happiness, sadness and neutrality during data collection and recognition of 
emotions considering the characteristics of local culture in Persian language. The length and 


BRAIN – Broad Research in Artificial Intelligence and Neuroscience 

Volume 10, Issue 1 (January - February, 2019), ISSN 2067-3957 
 

 56

coefficients of LPC are used along with the length and frequency characteristics and used to 
recognize the ANFIS combination method. 

Nourest et al (2009) investigated the feelings of hatred, anger, fear, sadness, happiness and 
neutral state in Persian language. To recognize the sensation, various sound features under the 
waveform, jitter, shimmer, sound intensity and, finally, the fractal dimension of the sound signal 
have been used. 

In Anagnostopoulos et al’s paper (2012), all speech recognition systems worked between 
2000 and 2011, with extractive features, and all databases and researches (in German, Chinese, 
Mandarin, Hindi, French, Slavic and Spanish). Then, all the linguistic, non-verbal, linear and non-
linear features are defined and classified and all emotional states of hatred, anger, fear, sadness, 
happiness, surprise, fatigue, interest, anxiety, hostility, vanity, satisfaction, hope and humor, and a 
variety of methods for analyzing various characteristics and recognizing the feeling that has been 
suggested so far has been described and compared by Shashidhar and his colleges (2011). 

Hamidi and Mansoorizad (2012) recognized feelings of anger, hatred, fear, sadness, joy for 
Persian language. Significance of speech signal, such as step and intensity, energy and MFCC 
coefficients, has been applied to detect feelings, MLP neural network with accuracy of 78%. 

Pathak (2011) studied feelings of grief, anger, happiness, hatred, fear, and and used to obtain 
a better result in recognizing the sensation in the speech signal from neural networks. 

Staroniewicz (2012 and 2009) studied six feelings of hatred, anger, fear, sadness, joy and 
surprise, along with neutral state in Polish. Different emotional features such as severity, formant 
and LPC coefficients have been analyzed with the help of neural network, supporting vector 
machine and decision tree. 

Vogot et al (2008) studied various created databases in English and categorized various 
extracted features of speech, such as step, energy, and formant  and finally, some methods for the 
analysis of various emotional features and sensory recognition such as neural network, supporting 
vector machine, decision tree and Markov secret model have been evaluated. 

According to the research, it is observed that the recognition of some emotions, including 
anger, joy, grief for the Persian speech signal, has been made, but the difficulty of creating a speech 
database in a state of wonder has not caused the discovery of a sense of wonder for English speech 
yet. Therefore, with the aim of expanding the information in this field, the purpose of this research 
is to find an effective method for detecting a surprise by analyzing the LPC coefficients and signal 
frequency characteristics. 

Yousefinejad et al. (2015) conducted a study titled "Detecting the sense of surprise" in the 
English speech signal. According to this research, it is challenging for computers to detect feelings. 
The main reason is the inability of the computer to understand the user's feelings. The purpose of 
this paper is to design a speech recognition system and provide a new method for improving the 
system. So far different features have been used in this regard, but none have practically linked the 
relationship between the range of sound and emotional states. Because the Bionic wavelet has more 
to do with this connection, it seems to be able to help separate emotional states. For this purpose, in 
this study, the bionic wavelet was used to extract the characteristic of audio signals in the automatic 
recognition of emotions from speech. The structure of the bionics is consistent with the structure of 
the human ear, and since human beings have a good understanding of speech sentiment, the use of a 
bionic wavelet can be useful for automatically detecting emotions from speech. The proposed 
structure has been evaluated on the Berlin database and Persian emotional speech data, which 
contains short sentences and expressions of negative emotions of fear, anger, discomfort, and 
normal state, and so on. The results of the experiments show that the proposed algorithm offers 
acceptable performance compared to the automatic recognition of emotions from existing speech. 

Pourvhayed and Ayat (2012) conducted a study titled "Detection of sensation in Persian 
Speech Signal" using neural network. In this paper, attempts have been made to design and 
implement a neural network based system to determine and recognize the sensation of wonder in 
Persian singular speech. For Persian, some emotions such as anger, grief, joy and hatred have been 


S. Tashakori, S. Haghighat - Designing the Intelligent System Detecting a Sense of Wonder in English Speech Signal 
Using Fuzzy-Nervous Inference-Adaptive system (ANFIS) 

57 

recognized, but due to the difficulty of creating a spoken word database in a state of wonder and a 
shortage of resources in this case, even in other languages, As a result, no detection of the sensation 
of wonder in the Persian language is done so far. Due to the inaccessibility of a suitable Persian 
database for emotion detection, a surprisingly neutral database (without feeling) was created in 
Persian, consisting of 260 sentences with a surprise and neutral sensation. 

Marawi and the Ismailis (2014) conducted a research entitled Persian database for detecting 
feeling from speech. Today, one of the issues that play an important role in human-machine 
interaction is the recognition of sensation from the speech signal, so the use of a comprehensive 
database in the system for recognizing the feeling is important. So far, various databases have been 
provided in English, Danish, and other languages, but the Persian databases have not been seen so 
far. Therefore, in this article, Persian emoticons are presented as emotional drums for speech 
recognition. This database contains 748 sentences with 8 feelings of anger, fatigue, hatred, fear, 
natural discomfort, surprise and joy. The sentences are expressed by 33 speakers (18 men and 15 
women). In order to evaluate and compare the proposed database and the famous Berlin database, 
various features from the sentences of these two databases were extracted. 

Ebrahimpour and Mahmoudian (1394) conducted a research entitled Detection of speech 
emotions using feature selection based on recursive models. Today, recognizing the sense of speech 
in cases where there is a relationship between human and machine are considered. Despite many 
efforts in this field, there is still a long way between the natural feelings of humans and the 
perception of the computer. In this article, the database of Berlin data as the most famous database 
available with 550 sentences created by professional actors in the lab environment, of which 61 
sentences have been used with different feelings such as happiness, hatred, neutral, fear, discomfort, 
anger and fatigue. Various features of the sentences of this base are extracted individually and due 
to the large number of features, a method is needed to reduce the space of the property before 
applying the classification algorithm. To this end, a back-up vector-based recursive method (SVM) 
has been developed to extract the effective features in recognizing the sensation of existing data. 
The median diagnostic value is obtained only with the use of eight attributes more effectively than 
among the 75 existing attributes. 

This article contains five sections. The second part describes how to collect the database in 
English. The third part will show how to extract the appropriate features. In the fourth section, the 
results obtained in this paper will be presented and analyzed. In the fifth part, the conclusions and 
future work will be discussed. 

 
2. Database 
One of the problems that exist in the field of emotional speech processing in English is the 

lack of or limited emotional database in English. Unfortunately, there is no standard and known 
database in English language as the other languages (Anagnostopoulos, 2012; Shashidhar, 2012; 
Lugger and Yang, 2007). The following measures were taken to prepare the database: 

- Designing English Speech Signals Data 
- Preparation of databases in two emotional states: wonder and neutral  
 
2.1. Designing the Persian Sign Language Signal Data 
 For the design of the data, a set of sentences prepared by the researcher has been extracted 

and used among professors and learners. 12 sentences were chosen randomly. 
 
2.2. Preparation of databases in two emotional states: wonder and neutral 
With the help of 10 professors and EFL learners (5 professors and 5 EFL learners), the 

selected sentences were expressed in two emotional states, with a sense of wonder and without any 
feelings (neutral). 

Each person expressed twelve sentences twice, and tried to collect the best sounds, 
altogether 240 sentences were captured in two states: wonder and neutral. 


BRAIN – Broad Research in Artificial Intelligence and Neuroscience 

Volume 10, Issue 1 (January - February, 2019), ISSN 2067-3957 
 

 58

Then the silence between the words in the sentences was deleted, using the PRAAT 
software, and each of the sentences was stored in separate files. 

 
2.3. Quality assessment of the database 
To ensure the high reliability and the naturalness of the sounds recorded for the database, the 

listener's test is performed, which after the preliminary validation (evaluated by 3 recorded voices), 
some suspicious sounds (not in the emotional state) or low-quality acoustic sounds were omitted 
and eventually a wonder-neutral database with 200 recordings was obtained. 

 
3. Feature extraction 
There is a very wide range of suitable and efficient features such as energy, velocity, step, 

MFCC coefficients (Shashidhar, 2012) and fractal dimension (Noorest et al, 2009). 
In this study, LPC coefficients and frequency characteristics of speech signals such as 

maximum, minimum, middle, and mean are used to detect sensation. 
In the presented method, each of the signal sentences in the database is initially divided 

between zero and one, and then each sentence is divided into several subclasses. For each sub word, 
a suitable number of LPC coefficients were calculated using MATLAB software. Then, by applying 
the Fourier transform on subclasses, frequency characteristics such as maximum, minimum, middle, 
mean, mode, medium of frequency range, and domain for each one were extracted. In total, 47 
features were used to recognize speech. 

 
4. Detection of feeling 
To identify emotions, an appropriate method for classifying emotions should be used. If 

adaptive neural fuzzy network is selected, the selection of a suitable adaptive neural network 
(ANFIS) and appropriate instructional training has effect in the improvement of system 
performance. Multi-layer adaptive neural network fuzzy network method, one of the most powerful 
techniques for classification of information, was selected in this research. 

Several parameters such as the number of hidden layers, the number of neurons in each 
hidden layer, and educational algorithm are effective for teaching multilayer perceptron neural 
network. After examination, the neural network of the multilayer perceptron with two layers of 
cover was considered and 47 features extracted from the speech signal were given as input, in its 
first layer there were 24, and in the second layer there were 12 neurons. An emotional database was 
created from 200 existing sentences, 160 sentences for teaching and 40 other sentences for network 
testing. 

Since the dispersion of data is very high, we first normalize it to minimize the error rate. 
Initial implementation: 

 
S. Tashakori, S. Haghighat - Designing the Intelligent System Detecting a Sense of Wonder in English Speech Signal 
Using Fuzzy-Nervous Inference-Adaptive system (ANFIS) 

59 

 
Figure 1. Initial implementation of the application 

 
Created Model Structure: The structure of the created neural network has an input, The 

input, depending on its nature, has a number of different membership functions, which directs them 
to an output, and eventually 12 rules are derived. 

 
Figure 2. Network structure created for system training 

 
BRAIN – Broad Research in Artificial Intelligence and Neuroscience 

Volume 10, Issue 1 (January - February, 2019), ISSN 2067-3957 
 

 60

In the provided ANFIS model, the error rate reaches zero according to the input data after 
10. This confidence level is obtained by the ANFIS model, which provides 70% of the input data as 
the training to the network and keeps the remaining data for testing the suggested model. The 
training of the network proceeds to a point where the error reaches zero or a small amount that is 
inclined to zero. In our proposed network after the epoch, the error rate reached an ideal value of 
zero, which we could conclude 12 fuzzy rules. 

Finally, the overall conclusion is that with more features, more accurate estimation of the 
degree of emotion detection can be achieved. 

According to the results, 12 rules are obtained: 
 If (input1 is in1mf1) then (output is out1mf1) (1)  
 If (input1 is in1mf2) then (output is out1mf2) (1)  
 If (input1 is in1mf3) then (output is out1mf3) (1)  
 If (input1 is in1mf4) then (output is out1mf4) (1)  
 If (input1 is in1mf5) then (output is out1mf5) (1)  
 If (input1 is in1mf6) then (output is out1mf6) (1)  
 If (input1 is in1mf7) then (output is out1mf7) (1)  
 If (input1 is in1mf8) then (output is out1mf8) (1)  
 If (input1 is in1mf9) then (output is out1mf9) (1)  
 If (input1 is in1mf10) then (output is out1mf10) (1)  
 If (input1 is in1mf11) then (output is out1mf11) (1)  
 If (input1 is in1mf12) then (output is out1mf12) (1)  

 
In order to quantify the precise amount of orders, there is a need to phase out a set of fuzzy 

rules that are trained with the help of ANFIS. Designed with the help of a GUI, this is easily 
possible and is displayed by placing each number in the desired range in the output.  

 
Figure 3. Shape of the Rules for Estimating 

 
S. Tashakori, S. Haghighat - Designing the Intelligent System Detecting a Sense of Wonder in English Speech Signal 
Using Fuzzy-Nervous Inference-Adaptive system (ANFIS) 

61 

 
Figure 4. Trend Line Forecast, Detecting a sense of wonder 

In the English speech signal 
 

In order to evaluate the network, all the speeches in the database (both wonder and neutral) 
were given to the ANFIS network and the result was compared with the input. 10 errors were 
identified from the instruction set to the system, and after giving the test set it counted 5 errors. 
Based on the above, the average accuracy of the system is about 94.23%. 
The results presented in Table 1 indicate that the proposed algorithm is well-suited for both the 
training and the test set. 

In Table 1, the results of the proposed method have been compared with other studies. With 
a glance at its contents, one can find that the proposed method offers far better results than other 
methods used so far for other languages. It should be noted that increasing the sample of the 
existing set of speeches in the database can play a significant role in improving the results, as the 
number of ANFIS network training samples is increased. 
 

Table 1. The results of the proposed algorithm for the created database 
Correct answer percentage Number of 

set  
Number of errors in 

detection  
The name of the set of 

speech signals 
95.15% 200  10  Educational Speech of 

Emotional Signal Set 
 

94.23% 240  15  The whole emotional  
speech of database 

 
BRAIN – Broad Research in Artificial Intelligence and Neuroscience 

Volume 10, Issue 1 (January - February, 2019), ISSN 2067-3957 
 

 62

Table 2. Comparing the results with similar research  
Answer percentage  Research done  

60 % In (Hamidi et. al., 2012)   
57.7%  In (Nourest et. al., 2009)   
94.23%  Suggested method  

 
Table 3. Compare the results with other learning machines 

Correct answer percentage  The name of the emotional 
speech database data base  

%94.23  ANFIS 
%84.13 Fuzzy 
%91.49 ANN 
%96.63 Deep learning 
%95.86 (Reinforcement learning) 

 
The model of fuzzy systems is based on building a scientific basis using expert knowledge. 

However, in some processes where human experiences are not available, it is difficult to construct 
this scientific basis, so if the initial parameters of the comparative system are properly applied by 
the expert, the convergence rate of the parameters to their optimal values, as well as the 
convergence of the output response to the optimum route, significantly increases. One of the most 
suitable methods for setting parameters is the adaptation of neuro-fuzzy networks. 

In the end, the results of the ANFIS model with the ANN model were compared and it was 
observed that the ANFIS model, due to the use of fuzzy rules, has more capability to ANN models 
in predicting the sensation of surprise in the English speech signal. The results of this research are 
consistent with the results of the above mentioned studies. They also concluded that the fuzzy-nerve 
inference method had a higher accuracy than the neural network method, But other machine 
learning techniques, such as deep learning, provides better results for training. 

 
5. Conclusion 
In this article, we have tried to find a suitable way to detect a sense of wonder in the English 

speech signal. The LPC coefficients are extracted with the maximum, minimum, middle, mean and 
average frequency range from the created database, and using the nerve fuzzy adaptive network has 
led to a wonder sensation with high accuracy and speed. Finally, the proposed system for the 
database reached to 94.23% accuracy. 

For future research, you can detect a sense of wonder from the noiseless speech signal to get 
better results. Also, in addition to extraction features, other sound features such as Formant and step 
can also be used to detect a sense of wonder. 

 
References 

Anagnostopoulos, Ch.N., Iliou, Th., & Giannoukos, I. (2012). Features and classifiers for emotion 
recognition from speech: a survey from   2000   to   2011. Springer Science+ Business 
Media. 

Ebrahimpour, B., & Mahmoudian, H. (2014). Speech emotion detection using feature selection 
based on recursive models, 7th National Conference on Electrical and Electronic 
Engineering, Islamic Azad University, Gonabad, Iran. 

Gharayan, D., & Ahadi, M. (2008). Recognition of emotional speech and speech mode 
identification in persian language. Modares Technical and Engineering Magazine, Special 
Issue of Electrical Engineering, No. 34 . 

Hamidi, M., & Mansoorizad, M. (2012). Emotion recognition from Persian speech with neural 
network. International Journal of Artificial Intelligence & Applications (IJAIA), 2 (5), 107-
112.  


S. Tashakori, S. Haghighat - Designing the Intelligent System Detecting a Sense of Wonder in English Speech Signal 
Using Fuzzy-Nervous Inference-Adaptive system (ANFIS) 

63 

Lugger, M., & Yang, B. (2007). The relevance of voice quality features in speaker independent 
emotion recognition.  IEEE ICASSP.  IV, 17–20.  

Marwi, H., & Ismailian, Z. (2014). Introduction of Persian databases to detect feeling of speech, 
Shahrood University of Technology, Iran. 

Mousavian, E., Nourest, R., & Rahati, S. (2007). Recognition of human emotions using a neural 
network-fuzzy. Eighth Conference of Intelligent Systems, Ferdowsi University of Mashhad.  

Nourest, R., Rahati, S., Sharifi, Sh., & Mousavian, E. (2009). Recognition of Sentiment in Persian 
Speech Using Fractal Subjects. Proceedings of 17th Iranian Conference on Electrical 
Engineering, Iranian University of Science and Technology.Vol. 8, 348-348. 

Pathak, S. (2011). Recognizing emotions from speech. Electronics Computer Technology (ICECT). 
4, 107 – 109.  

Pouroohid, M., & Ayat, S. (2012). Detection of the sensation of surprise in persian speech signal 
using neural network, Second National Computer Conference, Sanandaj, Sama Faculty of 
Engineering 

Shashidhar G., Koolagudi, K., & Sreenivasa,  R.  (2012). Emotion recognition from speech: a 
review. Springer Science+Business Media, 15, 99–117.  

Staroniewicz, Piotr. (2009). Recognition of Emotional State in Polish Speech – Comparison 
between Human and Automatic Efficiency. Springer, Heidelberg, 5707, 33–40.  

Staroniewicz, Piotr. (2011). Automatic recognition of emotional state    in    Polish.  Springer, 
Heidelberg, 347–353.  

Vogt, T., Andre, E., & Wagner J. (2008). Automatic recognition of emotions from speech: a review 
of the literature and recommendations for practical realization. Springer, Heidelberg, 4868, 
75–91.  

Yousefinejad, R., Haji Bagher Naeini, B., & Shafieian, M. (2015). Detection of the sensation of 
speech signal using the bionuclear wavelet, Scientific Journal of Sound and Vibration, Vol. 
5, No. 9.