Microsoft Word - ETASR_V13_N4_pp11166-11169


Engineering, Technology & Applied Science Research Vol. 13, No. 4, 2023, 11166-11169 11166  
 

www.etasr.com Venkateswarlu et al.: Emotion Recognition From Speech and Text using Long Short-Term Memory 

 
Emotion Recognition From Speech and Text 

using Long Short-Term Memory 
 

Sonagiri China Venkateswarlu 

Dept. of Electronics and Communication Engineering, Institute of Aeronautical Engineering, India 

cvenkateswarlus@gmail.com  

 
Siva Ramakrishna Jeevakala 

Dept. of Electronics and Communication Engineering, Institute of Aeronautical Engineering, India 

jsrkrishna3@gmail.com (corresponding author) 

 
Naluguru Udaya Kumar 

Dept. of Electronics and Communication Engineering, Marri Laxman Reddy Institute of Technology and 

Management, India 

joyudaya@gmail.com 

 
Pidugu Munaswamy 

Dept. of Electronics and Communication Engineering, Institute of Aeronautical Engineering, India 

sidduvamsi@gmail.com 

 
Dhanalaxmi Pendyala  

Dept. of Electronics and Communication Engineering, Institute of Aeronautical Engineering, India 

nanisp197@gmail.com 

Received: 2 May 2023 | Revised: 22 May 2023 | Accepted: 23 May 2023 

Licensed under a CC-BY 4.0 license | Copyright (c) by the authors | DOI: https://doi.org/10.48084/etasr.6004 

ABSTRACT 

Everyday interactions depend on more than just rational discourse; they also depend on emotional 

reactions. Having this information is crucial to making any kind of practical or even rational decision, as it 

can help to better understand one another by sharing our responses and providing recommendations on 

how they may feel. Several studies have recently begun to focus on emotion detection and labeling, 

proposing different methods for organizing feelings and detecting emotions in speech. Determining how 

emotions are conveyed through speech has been given major emphasis in social interactions during the last 

decade. However, the real efficiency of identification needs to be improved because of the severe lack of 

data on the primary temporal link of the speech waveform. Currently, a new approach to speech 

recognition is recommended, which couples structured audio information with long-term neural networks 

to fully take advantage of the shift in emotional content across phases. In addition to time series 

characteristics, structural speech features taken from the waveforms are now in charge of maintaining the 

underlying connection between layers of the actual speech. There are several Long-Short-Term Memory 

(LSTM) based algorithms for identifying emotional focus over numerous blocks. The proposed method (i) 

reduced overhead by optimizing the standard forgetting gate, reducing the amount of required processing 

time, (ii) applied an attention mechanism to both the time and feature dimension in the LSTM's final 

output to get task-related information, rather than using the output from the prior iteration of the 

standard technique, and (iii) employed a powerful strategy to locate the spatial characteristics in the final 

output of the LSTM to gain information, as opposed to using the findings from the prior phase of the 

regular method. The proposed method achieved an overall classification accuracy of 96.81%. 

Keywords-emotion recognition; speech recognition; MFCC; LSTM; deep learning  

I. INTRODUCTION  

Applications that rely on human-machine connections have 
made emotional reactions a priority since they are such an 

integral part of human interactions. Reactions can be analyzed 
and interpreted scientifically in a variety of ways, including 
facial characteristics, bodily indicators, and language. To 


Engineering, Technology & Applied Science Research Vol. 13, No. 4, 2023, 11166-11169 11167  
 

www.etasr.com Venkateswarlu et al.: Emotion Recognition From Speech and Text using Long Short-Term Memory 

 
achieve more natural and open interactions between humans 
and computers, it is necessary to regularly recognize and 
appropriately retain emotions represented by audio signals [1, 
2]. In the last two decades, several studies have developed and 
refined several Machine-Learning (ML) approaches to the 
problem of emotion interpretation, such as Speech Emotion 
Recognition (SER). There is a wide variety of uses for speech 
recognition technology. The effectiveness of audio interfaces 
and advisory services is measured by frustration identification. 
Online businesses try to tailor their offerings to the unique 
needs of their customers on an emotional level [3]. Monitoring 
the stress levels of flight crews has been shown to reduce the 
number of aircraft accidents. Many studies used facial-
expression recognition tools in their products to improve the 
user experience of people interacting with computers and 
increase user engagement [4]. Web-based interfaces have 
improved the precision of emotion detection using real-time 
facial identification and emotion prediction, intending to attract 
as many customers as possible by making changes based on 
their preferences [5]. 

The primary emphasis is on evaluating input data, audio, 
and video evidence to determine the subject's state of mind and 
provide advice. There are two main factors to consider while 
designing an SER system: (i) locating and extracting relevant 
features from an effective emotional text database, and (ii) 
constructing a trustworthy LSTM model using ML techniques. 
In actual use, the difficulty with which an SER program 
extracts emotional features is a major issue. Several studies 
have defined basic speech characteristics that convey speech 
content, such as power, tone, amplitude intensity, time domain 
power spectrum values, Mel-Frequency Cepstrum Coefficients 
(MFCC), and amplification features. For this reason, the vast 
majority of specialists favor using mixed features, which are 
made up of various characteristics that together convey more 
information. In addition, using composite features increases the 
likelihood of mistakes, as it makes training more difficult for 
most deep learning algorithms due to the large size and 
repetition of voice signals. Therefore, eliminating high-level 
speech duplication requires careful feature selection. Feature 
extraction or feature selection may help improve the precision 
and efficiency with which an ML model is trained, decrease 
unnecessary effort, focus on a specific area, and minimize 
internal requirements. Expressions of emotion in spoken 
language are ultimately classified. Emotion recognition is 
achieved by applying energy spectrum features to real-world 
voice data. Emotional nuances are reflected in the vocal signal 
in a multitude of ways. One of the trickiest parts of emotion 
analysis is deciding which features to use. 

Several conversation recognition methods have been 
proposed, and a variety of Deep Neural Networks have been 
introduced to facilitate automated discourse recognition with 
low power consumption [6]. The deep spectrum characteristic 
performs a comprehensive analysis of a novel acoustic 
classification produced by running data through a neural 
network audio classification and building a feature map from 
the activation of the final fully connected layer [7]. The 
features metrics for the identification of level 2 and level 5 
speech-based emotions were compared with the traditional 
acoustic representation. It can be beneficial for people with 

autism who can use portable devices to understand their own 
feelings and emotions and possibly adjust their social behavior 
accordingly [8]. In [9], an MFCC was used to analyze spectral 
characteristics in audio data to classify the 7 emotions using the 
Logistic Model Tree (LMT) algorithm, showing an accuracy of 
70%. In [10-12], ML classifiers were used to improve the 
classification accuracy of fNIRS signals, decoding cognitive 
states, and classification of power system stability. These 
approaches focus on some features and neglect others, while 
their accuracy cannot exceed 70%, which can influence 
performance in recognizing emotion in speech. This study used 
the LSTM parameter to extract features from a speech and text 
dataset. 

II. METHODOLOGY 

This kind of technology is called voice emotion 
recognition, as talking is the way people communicate with 
computers and convey their feelings [13]. Emotional content in 
spoken language may be extracted by combining several 
approaches to signal analysis. There are many models for 
analyzing speech signals to predict and determine the 
underlying mood. This study used a recurrent neural network 
model with LSTM, for training and analysis of audio files with 
sequential data. This study aimed to design a system that can 
achieve good accuracy in detecting embedded emotions in 
speech. 

A. Data Collection 

Data collection is fundamental to the development of data-
related activities. Overfitting is a key issue that has to be 
mitigated when using a large dataset and deep learning 
methods. There is a plethora of speech recognition datasets 
accessible to download from a variety of websites online, and 
most studies use freely accessible resources. This study used a 
text-based dataset [14] that contained text-based emotion 
details for various emotions. 

 
Fig. 1.  The LSTM model for speech recognition. 


Engineering, Technology & Applied Science Research Vol. 13, No. 4, 2023, 11166-11169 11168  
 

www.etasr.com Venkateswarlu et al.: Emotion Recognition From Speech and Text using Long Short-Term Memory 

 
Participants were given the option to record audio that 
expressed a variety of feelings, including happiness, calmness, 
sadness, anger, surprise, fear, and disgust. Using the Python 
library Keras, LSTM models were constructed sequentially. 
Making this model requires only a few simple procedures, as 
shown in Figure 1. The layers of a neural network are the basic 
building blocks, and the sequential class provides the structure 
for this model. At first, a new sequential class instance was 
created. Next, a stack of layers was constructed and set to 
communicate with each other. The memory cells in the 
recurrent neural layers were denoted by the symbol LSTM(). 
The dense layer was a fully linked layer used to generate 
results before the LSTM stack of layers. After a network has 
been constructed, it must be compiled. One of the benefits of 
compilation is the reduction in time it takes to complete. 

Figure 2 shows a block diagram of a continuous speech 
recognition system based on the pattern recognition paradigm. 
The speech signal is analyzed as a resulting sequence of feature 
vectors grouped in speech unit (phonemes or triphones) 
patterns. Each obtained pattern is compared with reference 
patterns, pre-trained, and stored with class identities. These 
pre-trained patterns, obtained in a learning process, are the 
acoustical models for speech units. The outcome of the speech 
recognition stage is the recognized word sequence. The series 
of primitive layers is transformed into a highly optimized 
collection of matrix transformation values. Due to the way 
Keras is configured, the syntax of this transformation must 
typically be one that can be executed by a CPU. Additionally, 
the optimizer and error functions must be defined before the 
model is compiled. 

 
Fig. 2.  The LSTM model for speech-to-text recognition. 

The datasets, consisting of a matrix of input patterns X and 
an array of output patterns y, must be specified before training 
the network. In addition, the constructed network was trained 
with a backpropagation method and optimized using a loss 
function and optimization strategy supplied by the model's 
construction. This approach requires training over a certain 
number of epochs, as shown in Figure 3. Once training was 
complete, the network was tested using non-overlapping 
training data. The accuracy of the prediction was used as a 
statistic because it helps in forecasting the performance of a 
constructed model. Data predictions were made using the 
predict() command after the model's efficacy was assessed. 

B. Model Training 

The model was trained using the fit() function with the 
following parameters: train X, target X, validation data, and a 
number of epochs. The test set included in the dataset was 

partitioned into X_test and y_test for validation purposes. The 
model iterated the data a certain number of times, as defined by 
the epochs parameter. Up to a point, the more the epochs, the 
better the model will become. From that point on, the model 
will no longer progress with each epoch. The model was 
trained for 31 epochs. 

 
Fig. 3.  Model training for LSTM. 

Keras defines neural networks as a series of layers. The 
sequential class serves as a framework for these levels of 
layers. The first step is to build a sequential instance of a class, 
followed by building a stack of layers and arranging them in 
the sequence in which they should be interlinked. The LSTM 
recurrent neural layers are made up of memory cells which are 
known as LSTM () cells. The dense layer is a fully connected 
layer that frequently precedes the LSTM stack of layers and is 
used to produce a result. \ 

III. RESULTS 

The results of the proposed LSTM ML model. were 
compared with those of [12], which used SVM and auto-
encoder and acquired a 74.07% accuracy rate. The network was 
evaluated once it was trained with a separate set of training 
data. After evaluating the performance of the model, it was 
used to make predictions. This action was carried out using the 
predict() function, and the output format was the same as 
specified by the output layer. 

The results of the model show its effectiveness in achieving 
good results compared to other emotion recognition studies, as 
it achieved an accuracy of 96.82%. A web app was created for 
the real-time application of the proposed speech emotion 
recognition model. A user can enter the text in the prompt 
displayed to predict and display the output emotion. The user 
can select a speech input file from the required location and 
press transcribe. Then, the speech will be converted to text and 
the emotion is displayed as output. The user can also select to 
use his microphone to record the voice input, convert it into 
text, and display the emotion as output. Table I shows the 
possible texts and emotions considered in this study. 


Engineering, Technology & Applied Science Research Vol. 13, No. 4, 2023, 11166-11169 11169  
 

www.etasr.com Venkateswarlu et al.: Emotion Recognition From Speech and Text using Long Short-Term Memory 

 
There are three types of inputs; the first is by entering the 
text message and then displaying the predicted emotion for the 
given text. The second is by uploading a voice file and then 
displaying the predicted emotion output, and the third is by 
allowing the mic to record the voice, analyzing it as a text, and 
then displaying the predicted emotion output. Every text, 
speech, and voice has an emotion. Table II shows the overall 
accuracy of the emotional classification model. 

TABLE I.  TEXT AND ITS CONSIDERED EMOTIONS  

Text Emotion 

I didn’t feel humiliated Sadness 

I am feeling grouchy Anger 

I am grabbing a minute to post I feel greedy wrong Anger 

I am ever feeling nostalgic about the fireplace Love 

TABLE II.  LSTM OVERALL ACCURACY 

Testing Speech Text Voice 

Accuracy 96.81% 96.81% 96.81% 

 
IV. CONCLUSION AND FUTURE SCOPE 

The primary goal of this study was to employ recurrent 
neural networks with LSTM to determine a person's emotional 
state. This study used a dataset of text files that depict a wide 
range of emotions, such as happiness, sadness, fear, contempt, 
surprise, and apathy. As most ML models take numeric values 
as input, the data was transformed into arrays to use them for 
feature extraction. The Libros package was used to extract the 
file and MFCC features were used in this model. The collected 
values were then fed into an LSTM model, which used these 
characteristics to provide an overall anticipated emotion. The 
model achieved an overall accuracy of 96.81% for emotion 
recognition using speech, text, and voice data. This model was 
used to enhance a real-time speaker identification system using 
a digital signal processor. The volume of the system may be 
adjusted according to the environment. This system can be 
used for aid to disabled persons. Applications and websites can 
use this approach to gauge user sentiment and decide how to 
best tailor their offerings to their audience. Additionally, this 
approach can be used in voice-based virtual assistants, 
chatboxes, and call centers for handling customer complaints. 

REFERENCES 

[1] Mustaqeem and S. Kwon, "A CNN-Assisted Enhanced Audio Signal 
Processing for Speech Emotion Recognition," Sensors, vol. 20, no. 1, 
Jan. 2020, Art. no. 183, https://doi.org/10.3390/s20010183. 

[2] A. M. Badshah et al., "Deep features-based speech emotion recognition 
for smart affective services," Multimedia Tools and Applications, vol. 
78, no. 5, pp. 5571–5589, Mar. 2019, https://doi.org/10.1007/s11042-
017-5292-7. 

[3] R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar, and T. 
Alhussain, "Speech Emotion Recognition Using Deep Learning 
Techniques: A Review," IEEE Access, vol. 7, pp. 117327–117345, 2019, 
https://doi.org/10.1109/ACCESS.2019.2936124. 

[4] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks 
for Large-Scale Image Recognition." arXiv, Apr. 10, 2015, 
https://doi.org/10.48550/arXiv.1409.1556. 

[5] T. Hussain, K. Muhammad, A. Ullah, Z. Cao, S. W. Baik, and V. H. C. 
de Albuquerque, "Cloud-Assisted Multiview Video Summarization 
Using CNN and Bidirectional LSTM," IEEE Transactions on Industrial 

Informatics, vol. 16, no. 1, pp. 77–86, Jan. 2020, https://doi.org/10. 
1109/TII.2019.2929228. 

[6] B. Liu, H. Qin, Y. Gong, W. Ge, M. Xia, and L. Shi, "EERA-ASR: An 
Energy-Efficient Reconfigurable Architecture for Automatic Speech 
Recognition With Hybrid DNN and Approximate Computing," IEEE 
Access, vol. 6, pp. 52227–52237, 2018, https://doi.org/10.1109/ 
ACCESS.2018.2870273. 

[7] J. Huang, B. Chen, B. Yao, and W. He, "ECG Arrhythmia Classification 
Using STFT-Based Spectrogram and Convolutional Neural Network," 
IEEE Access, vol. 7, pp. 92871–92880, 2019, https://doi.org/ 
10.1109/ACCESS.2019.2928017. 

[8] E. Sucksmith, C. Allison, S. Baron-Cohen, B. Chakrabarti, and R. A. 
Hoekstra, "Empathy and emotion recognition in people with autism, 
first-degree relatives, and controls," Neuropsychologia, vol. 51, no. 1, 
pp. 98–105, Jan. 2013, https://doi.org/10.1016/j.neuropsychologia. 
2012.11.013. 

[9] A. A. A. Zamil, S. Hasan, S. MD. Jannatul Baki, J. MD. Adam, and I. 
Zaman, "Emotion Detection from Speech Signals using Voting 
Mechanism on Classified Frames," in 2019 International Conference on 
Robotics,Electrical and Signal Processing Techniques (ICREST), 
Dhaka, Bangladesh, Jan. 2019, pp. 281–285, https://doi.org/ 
10.1109/ICREST.2019.8644168. 

[10] M. M. H. Milu, M. A. Rahman, M. A. Rashid, A. Kuwana, and H. 
Kobayashi, "Improvement of Classification Accuracy of Four-Class 
Voluntary-Imagery fNIRS Signals using Convolutional Neural 
Networks," Engineering, Technology & Applied Science Research, vol. 
13, no. 2, pp. 10425–10431, Apr. 2023, https://doi.org/10.48084/ 
etasr.5703. 

[11] S. R. Jeevakala and H. Ramasangu, "Classification of Cognitive States 
using Task-Specific Connectivity Features," Engineering, Technology & 
Applied Science Research, vol. 13, no. 3, pp. 10675–10679, Jun. 2023, 
https://doi.org/10.48084/etasr.5836. 

[12] N. A. Nguyen, T. N. Le, and H. M. V. Nguyen, "Multi-Goal Feature 
Selection Function in Binary Particle Swarm Optimization for Power 
System Stability Classification," Engineering, Technology & Applied 
Science Research, vol. 13, no. 2, pp. 10535–10540, Apr. 2023, 
https://doi.org/10.48084/etasr.5799. 

[13] S. R. Bandela and T. K. Kumar, "Emotion Recognition of Stressed 
Speech Using Teager Energy and Linear Prediction Features," in 2018 
IEEE 18th International Conference on Advanced Learning 
Technologies (ICALT), Mumbai, India, Jul. 2018, pp. 422–425, 
https://doi.org/10.1109/ICALT.2018.00107. 

[14] "Emotion Detection from Text." https://www.kaggle.com/datasets/ 
pashupatigupta/emotion-detection-from-text.