21 | International Journal of Informatics Information System and Computer Engineering 3(2) (2022) 21-30 

 
2-D Attention-Based Convolutional Recurrent Neural 

Network for Speech Emotion Recognition  

Akalya Devi C, Karthika Renuka D, Aarshana E Winy, P C Kruthikkha, Ramya P, Soundarya S  

Assistant Professor, 2UG Scholar, Department of Information Technology,PSG College of Technology, 

Coimbatore, India 

*Corresponding Email: cad.it@psgtech.ac.in 
 

A B S T R A C T S  A R T I C L E   I N F O 

Recognizing speech emotions is a formidable challenge 
due to the complexity of emotions. The function of 
Speech Emotion Recognition (SER) is significantly 
impacted by the effects of emotional signals retrieved 
from speech. The majority of emotional traits, on the 
other hand, are sensitive to emotionally neutral 
elements like the speaker, speaking manner, and 
gender. In this work, we postulate that computing 
deltas for individual features maintain useful 
information which is mainly relevant to emotional traits 
while it minimizes the loss of emotionally irrelevant 
components, thus leading to fewer misclassifications. 
Additionally, Speech Emotion Recognition (SER) 
commonly experiences silent and emotionally 
unrelated frames. The proposed technique is quite good 
at picking up important feature representations for 
emotion relevant features. So here is a two dimensional 
convolutional recurrent neural network that is 
attention-based to learn distinguishing characteristics 
and predict the emotions. The Mel-spectrogram is used 
for feature extraction. The suggested technique is 
conducted on IEMOCAP dataset and it has better 
performance, with 68% accuracy value.  
 
 
 Article History: 
Received 18 Dec 2022 
Revised 20 Dec 2022 
Accepted 25 Dec 2022 
Available online 26 Dec 2022 
Aug 2018 

__________________ 
Keywords: 
2-D, 
Attention-Based, 
Convolutional Recurrent 
Neural Network, 
Speech Emotion Recognition  

International Journal of Informatics, 
Information System and Computer 

Engineering 

International Journal of Informatics Information System and Computer Engineering 3(2) (2022) 21-30 


Akalya Devi et al. 2-D Attention-Based Convolutional Recurrent Neural Network … | 22 

 
1. INTRODUCRION 

The significance of human speech 
emotion recognition has increased 
recently to increase the quality and 
efficiency of interactions between 
machines and humans (Khalil et al., 
2019). Due to the difficulty in defining 
both natural and artificial emotions, 
recognizing human emotions is a 
challenging task all on its own. Extraction 
of the spectral and prosodic elements that 
would lead to the accurate assessment of 
emotions has been the subject of 
numerous investigations (Tzirakis et al., 
2018). 

Recognition of speech emotions is a 
technique that uses a processor to extract 
emotional information from speech 
signals (Chen et al., 2018). It then 
compares and analyzes the collected 
emotional information together with the 
distinctive factors. After the emotional 
information is extracted, various 
techniques and concepts are used to 
predict the emotions of speech signals 
(Khalil et al., 2019). Speech emotion 
detection is now a rapidly developing 
discipline bridging the interaction 
between robots and humans. It is also a 
popular study area in signal processing 
and pattern recognition. 

Emotions are incredibly important to 
human mental health. It is a way of 
expressing one's thoughts or state of 
mind to others. The major objective of 
SER is to improve human-machine 
interaction (HMI). It can also be used 
with lie detectors to monitor a subject's 
psychophysical state (Lalitha et al., 2015). 
Onboard car driving systems, dialogue 
systems for spoken languages used in call 
center conversations and the utilization 
of speech emotion patterns in medical 

applications are a few instances of SER. 
HMI systems still have a lot of problems 
that need to be resolved, especially when 
they are shifted from being tested in labs 
to being used in actual operations. 
Therefore, efforts are required to 
effectively resolve all these problems and 
enhance machine emotion perception. 

Recently, Deep Neural Networks 
(DNNs) have gained popularity and 
made revolutionary strides in a number 
of machines learning fields, including the 
continuous effect identification field. In 
most of the studies, hand-crafted features 
are used to feed the DNN architectures. 
Many DNN architectures have been put 
forth in that approach, including 
Convolutional Neural Networks (CNNs) 
and Long-Short Term Memory (LSTM) 
networks.  Mao et al. (Mao et al., 2014) 
first used Convolutional neural networks 
(CNN) and demonstrated great scores on 
numerous benchmark datasets for 
learning affective-salient features for 
SER. Recurrent neural networks (RNNs) 
were used by Lee et al. (Lee & Tashey, 
2015) to train SER on long-range temporal 
correlations. In order to train a 
convolutional recurrent neural network 
(CRNN) to predict continuous valence 
space, Trigeorgis et al. (Trigeorgis et al., 
2016) directly used the raw audio data. 

Additionally, structures connecting 
the output and the input segments have 
been learned with significant 
effectiveness using attention mechanism-
based RNNs. RNNs based on attention 
mechanisms are ideally suited to the SER 
tasks. First, speech is basically a sequence 
of data with different lengths. The 
majority of speech signals annotate 
emotion labels at the utterance level even 
though utterances sometimes have 
lengthy pauses and frequently have a 


23 | International Journal of Informatics Information System and Computer Engineering 3(2) (2022) 21-30 

 
short word count. Selecting emotional-
relevant frames for SER is very crucial. In 
this paper, we extend our model to yield 
affective salient characteristics for the 
final emotion categorization using a 
CRNN-based attention mechanism. 

In this study, we combine CRNN and 
an attention model to create a unique 
architecture for SER dubbed 2-D 
attention-based convolutional recurrent 
neural networks (ACRNN). The 
following is a summary of this paper's 
main contributions: 

1) We suggest a unique 2-D CRNN for 
SER that enhances the ability to 
understand the time-frequency 
relationship. 

2) We employ an additional attention 
model to automatically concentrate on 
the emotion-crucial frames and offer 
discriminative utterance-level 
characteristics for SER to cope with 
silence critical frames and emotion-
irrelevant frames. 

3) Experimental results contain athe 
ccuracy, recall, precision and confusion 
matrix of our proposed model. 

It is well known that most speech 
emotion datasets only have utterance-
level class labels. Most sentences, 
however, contain silent regions, short 
pauses, transitions between phonemes, 
unvoiced phonemes, and so on. It is clear 
that not all parts of a sentence are 
emotionally connected. Unfortunately, 
LSTM does not handle this situation well 
when analyzing acoustic characteristics 
extracted from voice. In the current 
study, emotion classification is useful for 
distinguishing between emotionally-
relevant and emotionally-irrelevant 
frames. In emotion classification, it is 

useful to know whether the speech frame 
is voiced or unvoiced. Currently, there 
are two types of commonly used 
methods: manually extracting 
emotionally relevant speech frames and 
using models to learn how to distinguish 
automatically. 

However, as manual extraction 
requires different thresholds on different 
data sets, it has some limitations in terms 
of feasibility. Human emotional 
expression is often gradual and thus each 
voiced frame is useful for emotion 
classification. Attention mechanisms can 
better match human emotional 
expression by capturing only the affective 
frames. Local attention was added to 
LSTM and different weights were 
assigned to each frame with varying 
emotional intensity.  

2. LITERATURE SURVEY 

On the IEMOCAP dataset, Sarthak 

Tripathi and Homayoon Beigi performed 

multimodal emotion detection and 

determined the best individual 

architectures for classification of each 

modality using data from speech, text 

and motion capture. The design of their 

merged model is modular. This makes it 

possible to upgrade any individual 

model without affecting the other 

modalities. They utilized motion 

captured data and 2D convolutions in 

place of video recordings and 3D 

convolutions (Tripathi et al., 2018).  

For the Arabic dataset KSUEmotions, 

Mohammed Zakariah and Yaser 

Mohammad Seddiq performed Speech 

emotion Recognition. The feature 

extraction method made use of the time-

frequency data from the spectrogram, as 

well as numerous modification and 


Akalya Devi et al. 2-D Attention-Based Convolutional Recurrent Neural Network … | 24 

 
filtering techniques were used. Although 

the system was tested at the file and 

segment levels, it was trained at the 

segment level (Maji & Swain, 2022).  

To automatically extract affective salient 

features from raw spectral data, Yawei 

Mu and Luis A. Hernandez Gomez 

presented a distributed convolution 

neural network (CNN). From the CNN 

output, they then applied a bidirectional 

recurrent neural network (BRNN) to 

obtain temporal information. Finally, 

they used the attention mechanism to 

target the emotion-relevant portions of 

utterance in the BRNN output sequence 

(Jiang et al., 2021). 

A Convolutional-Recurrent Neural 

Network with Multiple Attention 

Mechanisms (CRNN-MA) was proposed 

by P. Jiang, X. Xu, H. Tao, L. Zhao, and C. 

Zou for SER. It uses extracted Mel-

spectrums and frame-level features in 

parallel Convolutional Neural Network 

(CNN) and Long Short-Term Memory 

(LSTM) modules, respectively. A multi-

dimensional attention layer and multiple 

self-attention layers in the CNN module 

on frame-level weight components 

(Yadav et al, 2021) are some of the 

strategies they established for the 

suggested CRNN-MA.  

Yadav, O. P., Bastola, L. P., and Sharma, 

J. presented the Convolutional Recurrent 

Neural Network (CRNN), which 

combines Convolutional Neural Network 

(CNN) and Bidirectional Long Short-

Term Memory (BiLSTM), to learn 

emotional features from log-mel scaled 

spectrograms of spoken utterances. 

Convolution kernels of CNN are used to 

learn local features and a layer of BiLSTM 

is chosen to learn the temporal 

dependencies from the learnt local 

features. Speech utterances are pre-

processed to cut out distracting sounds 

and unnecessary information. 

Additionally, methods for increasing the 

number of data samples are researched, 

and the best methods are chosen to 

improve the model's recognition rate 

(Lim et al., 2016).  

Without employing any conventional 

hand-crafted features, Wootaek Lim, 

Daeyoung Jang, and Taejin Lee 

developed a SER approach based on 

concatenated CNNs and RNNs. 

Particularly for computer vision tasks, 

Convolutional Neural Networks (CNNs) 

have exceptional recognition ability. 

Recurrent neural networks (RNNs) also 

perform sequential data processing tasks 

to a great extent with high degree of 

success. The classification result was 

proven to have higher accuracy than that 

attained using traditional classification 

methods by utilizing the proposed 

methods on an emotional speech 

database (Gayathri et a., 2020).  

Silent frames and inappropriate 

emotional frames are frequent problems 

for Speech Emotion Recognition (SER). 

Meanwhile, the attention process has 

proved to be exceptionally effective at 

learning relevant feature representations 

for particular activities. Using the Mel-

spectrogram with deltas and delta-deltas 

as input, Gayathri, P., Priya, P. G., 

Sravani, L., Johnson, S., and Sampath, V. 

presented a Convolutional Recurrent 

Neural Networks (ACRNN) based on 

attention to learn discriminative features 

for SER. Finally, test results 


25 | International Journal of Informatics Information System and Computer Engineering 3(2) (2022) 21-30 

 
demonstrated the viability of the 

suggested approach and achieved 

cutting-edge performance in terms of 

unweighted average recall (Gayathri et a., 

2020).  

2.1. Proposed Models and Experimental 

setup   

A convolutional recurrent neural 

network with a 2D attention base, serves 

as the proposed model for speech 

emotion recognition.  

2.2. Speech Emotion Recognition   

This section explains the proposed 2D 

attention based convolutional recurrent 

neural network. Convolutional neural 

network, or CNN or ConvNet, is 

particularly adept at processing input 

with a grid-like architecture, like an 

image. A binary representation of visual 

data is a digital image. Recurrent neural 

networks (RNNs) are a type of neural 

network in which the results of one step 

are fed into the next step's computations. 

It employs sequential data or time series 

data. The Convolutional Recurrent 

Neural Network (CRNN) model uses the 

outputs and hidden states of the 

recurrent units in each frame to extract 

features from the successive windows by 

feeding each window frame by frame into 

the recurrent layer. Here we combine an 

attention mechanism together with CNN 

and RNN that enables easier and higher-

quality learning by concentrating on 

certain portions of the input sequence 

inorder to predict a particular portion of 

the output sequence. 

Feature extraction is a process that 

converts raw data into manageable 

numerical features while preserving the 

original data's information. Feature 

extraction when compared to using 

machine learning or deep learning 

models on the raw data directly, 

produces better outcomes. For the feature 

extraction Log Mel-spectrogram is used. 

The ACRNN architecture, which 

combines CRNN with an attention 

model, is used. Then, as depicted in Fig. 

1, a fully linked layer and a softmax layer 

for SER are introduced.  

 
Fig. 1. ACRNN architecture  

CNN has recently demonstrated 

remarkable accomplishments in the SER 

field. The time domain and frequency 

domain are equally important and 2-

dimensional convolution performs better 

with less data than 1-dimensional 

convolution. The SER findings, however, 

vary greatly between speakers because of 

huge variation in tone, voice and other 

unique characteristics. The log-Mels with 

deltas and delta-deltas act as the ACRNN 

input  to handle this variation, where the 

deltas and delta-deltas describe the 

emotional transformation process. 

The mel scale has a range of pitches that 

to the human ear, appear to be equally 

distant from one another. The distance in 

hertz between mel scale values, often 

known as "mels," increases as the 

frequency increases. Mel, which stands 

for melody, denotes that the scale is 

founded on pitch comparisons. 


Akalya Devi et al. 2-D Attention-Based Convolutional Recurrent Neural Network … | 26 

 
Extensive tests have shown us that the 

Mel spectrum is better compatible with 

the human auditory sense characteristic, 

which exhibits the linear distribution 

under 1000 Hz and the logarithm growth 

above 1000 Hz and hence this point is 

used to obtain the Log-Mel spectrum 

static. The link between the frequency 

and the Mel spectrum is interrelated. 

A mel spectrogram renders frequencies 

over a specific threshold logarithmically 

(the corner frequency). For instance, in 

the spectrogram with a linear scale, the 

vertical space between 1,000 and 2,000 Hz 

is half that between 2,000 and 4,000 Hz. 

The distance between the ranges is almost 

the same in the mel spectrogram. Similar 

to how we hear, this scaling makes 

similar low frequency sounds simpler to 

identify from similar high frequency 

noises. A frequency-domain value is 

multiplied by a filter bank to create the 

output of a mel spectrogram. 

When a speech signal is given with zero 

mean and unit variance, it is used to 

minimize the differences between 

speakers. The signal is then divided into 

small frames using Hamming windows 

with a shift of 10 ms and a 25 ms duration. 

The power spectrum is then placed 

through the Mel-filter bank I to produce 

output pi, and the output is then used to 

calculate the power spectrum for each 

frame using the Discrete Fourier 

transform (DFT) i. The logarithm of pi is 

then used to produce the log-Mels mi, as 

shown by (1). To determine the log-Mels' 

deltas features, we use the following 

formula (2). N is often selected as (2). 

Similarly, the delta-deltas features are 

calculated using the time derivative of the 

deltas, as seen in (3). 

Generate a 3-D feature representation for 

the CNN input  X   Σ 〖〖 R〗^(t×f×c)〗

^   by t stands for the time (frame) length, 

f for the number of Mel-filter banks, and 

c for the number of feature channels 

when computing the log-Mels with deltas 

and delta-deltas. As in speech recognition 

[17], we set f in this task to 40 and c to 3, 

which stand for static, deltas, and delta-

deltas, respectively.  

𝑚𝑖 = 𝑙𝑜𝑔(𝑝𝑖 )  ……………………(1) 

 
𝑚𝑖
𝑑 = 

𝛴 𝑛=1
𝑁   𝑛(𝑚𝑖+𝑛   −  𝑚𝑖−𝑛 )

2𝛴 𝑛=1
𝑁 𝑛2

  ………..(2) 

 
𝑚𝑖
𝑑𝑑 = 

𝛴 𝑛=1
𝑁   𝑛(𝑚𝑖+𝑛

𝑑    − 𝑚𝑖−𝑛
𝑑   )

2𝛴 𝑛=1
𝑁 𝑛2

 ………..(3) 

 
2.3. ACRNN architecture:  

In this part, we integrate CRNN with an 

attention model along with 2-D log-Mels.   

2-D CNN is used to perform convolution 

in a patch that only contains a few frames 

on the entire log-Mels. The long short-

term memory (LSTM) is then fed with 2-

D CNN sequential characteristics for 

temporal summarization. A series of 

high-level features are then entered into 

the attention layer, which outputs 

utterance-level features. Finally, 

utterance-level characteristics are used as 

the fully connected layer input to obtain 

higher level features for SER. 

1)CRNN model: High-level features for 

SER are retrieved using CRNN from 

given 2-D log-Mels. The CRNN used here 


27 | International Journal of Informatics Information System and Computer Engineering 3(2) (2022) 21-30 

 
consists of several 2-D convolution 

layers, one 2-D max-pooling layer, one 

linear layer, and one LSTM layer. Each 

convolutional layer has a 5 x 2 filter size, 

with the first convolutional layer having 

128 feature maps and the subsequent 

convolutional layers having 256 feature 

maps. After the first convolutional layer, 

we only use one max pooling layer and 

the pooling size is 2 x 2. The model 

parameters can be effectively reduced 

without compromising accuracy by 

adding a linear layer before feeding 2-D 

CNN features into the LSTM layer. As a 

result, we find that the linear layer with 

768 output units is appropriate when 

added as a dimension-reduction layer 

after the 2-D CNN. We perform a 2-D 

CNN and then feed the 2-D CNN 

sequence features via a bidirectional 

RNN with 128 cells in each direction for 

temporal summarization. As a result, a 

sequence of 256-dimensional high-level 

feature representations are obtained. 

2)Attention Layer: Due to the fact that not 

all frame-level CRNN features equally 

contribute to the representation of speech 

emotion, an attention layer is employed 

to focus on emotion-relevant sections and 

produce discriminative utterance-level 

representations for SER. 

Instead of only using a mean/max 

pooling across time, the significance of a 

number of high-level representations to 

the utterance-level emotion 

representations is rated using an 

attention model. 

In particular, first determine the 

normalized weight using a softmax 

function and the LSTM output ht at time 

step t. Then, as illustrated, perform a 

weighted sum on ht using the weights to 

determine the utterance-level 

representations (5). 

Finally, feed the utterance-level 

representations through a fully 

connected layer with 64 output units to 

obtain higher level representations that 

help the softmax classifier map the 

utterance representations into N different 

spaces, where N is the number of emotion 

classes. The fully connected layer is 

subjected to batch normalization 

(Gayathri et a., 2020) to expedite training 

and enhance generalization performance. 

𝑎𝑡 =
𝑒𝑥𝑝(𝑊.ℎ𝑡)

𝛴 𝑇=1
𝑇  𝑒𝑥𝑝(𝑊.ℎ𝑡)

 ………(4) 

 
𝑐 = 𝛴 𝑡=1
𝑇  𝑎𝑡  ℎ𝑡   ………….(5) 

 We conduct SER experiments using the 

Interactive Emotional Dyadic Motion 

Capture database (IEMOCAP) to assess 

performance of our proposed model. 

There are five sessions of IEMOCAP each 

having utterances having duration on an 

average lasting for 4.5 seconds and the 

rate of each sample being 16 kilohertz. 

Every session here is presented by two 

speakers (a male and female) in both 

scripted scenes and improvised scenes. 

Only four emotions are considered here- 

angry, sad, happy and neutral. Cross 

validation used here for evaluation is 10-

fold. Out of the total ten speakers, eight 

speakers are chosen for training the 

model, one speaker is chosen for testing 

and the other speaker is chosen for 

validation. 

Consequently, we perform each 

evaluation multiple times using various 

random seeds in order to obtain more 

reliable findings. We divide the signal 

into 3 segments which are all equal in 


Akalya Devi et al. 2-D Attention-Based Convolutional Recurrent Neural Network … | 28 

 
length for improved acceleration which is 

parallel. We have also padded with zeros 

for speech utterances which are lasting 

less than 3 s. Training set’s standard 

deviation and mean(global) are used for 

normalization of log-Mels of testing and 

training data, with 25 ms as the size of the 

window and a shift of 10 ms. Tensorflow 

and Keras libraries are installed for 

implementation (See Figs. 2-4). 

 
Fig. 2. Workflow for Azure Machine 

Learning  

 
Fig. 3. Workflow for Azure Machine 

Learning  

 
Fig 4. Classification report of 2D 

attention based CRNN  

 Fig. 1 represents the classification report 

of 1D CNN LSTM which has an accuracy 

of 56%, precision value of 59% and recall 

value of 56%. Fig. 2 represents the 

classification report of Temporal 2-D 

CNN which has an accuracy of 58%, 

precision value of 59% and recall value of 

58%. Fig. 3 represents the classification 

report of 2-D ACRNN which has an 

accuracy of 68%, precision value of 67% 

and recall value of 68%. Thus, our 

ACRNN model’s performance is superior 

while compared with other models (See 

Fig. 5). 

 
Fig 5. Classification report of 2D 

attention based CRNN  

Fig. 4 displays the confusion matrix of the 

ACRNN model. There are four emotions- 

0 represents angry, 1 represents sad, 2 

represents happy and 3 represents 

neutral. The diagonal values represent 

the correctly predicted values. The 

accuracy of our proposed model 2-D 

CRNN is 68% which is higher than the 

accuracy of 1D CNN LSTM and T-2D 

CNN. Weighted precision of our model is 

0.67, weighted recall of our model is 0.68. 

Weighted F1 score of our model is 0.67. 

All these values are higher than the 

corresponding values in 1D CNN LSTM 

and T-2D CNN. Thus, our model 

outperforms similar SER models with 

greater values for all metrics. 3-D 


29 | International Journal of Informatics Information System and Computer Engineering 3(2) (2022) 21-30 

 
attention based CRNN implemented in 

the paper Chen (Chen et al., 2018) has 

average recall value of 64.74%. Our 2-D 

attention based CRNN has outperformed 

it with a recall value of 68% (See Fig. 6).  

Fig 6. Comparison of Models and Their 

Evaluation Metrics 

Fig. 5 shows the plot between models and 

their evaluation metrics. Our model 

comes out to be the best in all metrics 

while comparing with the other two 

models.

REFERENCES

 
Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural 
networks with attention model for speech emotion recognition. IEEE Signal 
Processing Letters, 25(10), 1440-1444.  

Gayathri, P., Priya, P. G., Sravani, L., Johnson, S., & Sampath, V. (2020). Convolutional 
Recurrent Neural Networks Based Speech Emotion Recognition. Journal of 
Computational and Theoretical Nanoscience, 17(8), 3786-3789.  

Huang, C. W., & Narayanan, S. S. (2016, September). Attention Assisted Discovery of 
Sub-Utterance Structure in Speech Emotion Recognition. In Interspeech (pp. 
1387-1391).  

Huang, C., Gong, W., Fu, W., & Feng, D. (2014). A research of speech emotion 
recognition based on deep belief network and SVM. Mathematical Problems in 
Engineering, 2014.  

Jiang, P., Xu, X., Tao, H., Zhao, L., & Zou, C. (2021). Convolutional-Recurrent Neural 
Networks with Multiple Attention Mechanisms for Speech Emotion 
Recognition. IEEE Transactions on Cognitive and Developmental Systems.  

Khalil, R. A., Jones, E., Babar, M. I., Jan, T., Zafar, M. H., & Alhussain, T. (2019). Speech 
emotion recognition using deep learning techniques: A review. IEEE 
Access, 7, 117327-117345.  


Akalya Devi et al. 2-D Attention-Based Convolutional Recurrent Neural Network … | 30 

 
Lalitha, S., Mudupu, A., Nandyala, B. V., & Munagala, R. (2015, December). Speech 
emotion recognition using DWT. In 2015 IEEE International Conference on 
Computational Intelligence and Computing Research (ICCIC) (pp. 1-4). IEEE.  

Lee, J., & Tashev, I. (2015, September). High-level feature representation using 
recurrent neural network for speech emotion recognition. In Interspeech 2015.  

Lim, W., Jang, D., & Lee, T. (2016, December). Speech emotion recognition using 
convolutional and recurrent neural networks. In 2016 Asia-Pacific signal and 
information processing association annual summit and conference (APSIPA) (pp. 
1-4). IEEE.  

Maji, B., & Swain, M. (2022). Advanced Fusion-Based Speech Emotion Recognition 
System Using a Dual-Attention Mechanism with Conv-Caps and Bi-GRU 
Features. Electronics, 11(9), 1328.  

Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech 
emotion recognition using convolutional neural networks. IEEE transactions 
on multimedia, 16(8), 2203-2213.  

Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003, December). Detection of stress and 
emotion in speech using traditional and FFT based log energy features. 
In Fourth International Conference on Information, Communications and Signal 
Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. 
Proceedings of the 2003 Joint (Vol. 3, pp. 1619-1623). IEEE.  

Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & 
Zafeiriou, S. (2016, March). Adieu features? end-to-end speech emotion 
recognition using a deep convolutional recurrent network. In 2016 IEEE 
international conference on acoustics, speech and signal processing (ICASSP) (pp. 
5200-5204). IEEE.  

Tripathi, S., Tripathi, S., & Beigi, H. (2018). Multi-modal emotion recognition on 
iemocap dataset using deep learning. arXiv preprint arXiv:1804.05788.  

Tzirakis, P., Zhang, J., & Schuller, B. W. (2018, April). End-to-end speech emotion 
recognition using deep neural networks. In 2018 IEEE international conference 
on acoustics, speech and signal processing (ICASSP) (pp. 5089-5093). IEEE.  

Yadav, O. P., Bastola, L. P., & Sharma, J. (2021). Speech Emotion Recognition using 
Convolutional Recurrent Neural Network.