1 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 91 - 102 Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks Bagus Tris Atmaja1,2 Reda Elbarougy3, Masato Akagi2 1 Sepuluh Nopember Institute of Technology, Surabaya, Indonesia 2Japan Advanced Institute of Science of Technology, Nomi, Japan 3Damietta University, New Damietta, Egypt E-mail: bagus@ep.its.ac.id A B S T R A C T S A R T I C L E I N F O Emotion can be inferred from tonal and verbal information, where both features can be extracted from speech. While most researchers conducted studies on categorical emotion recognition from a single modality, this research presents a dimensional emotion recognition combining acoustic and text features. A number of 31 acoustic features are extracted from speech, while word vector is used as text features. The initial result on single modality emotion recognition can be used as a cue to combine both features with improving the recognition result. The latter result shows that a combination of acoustic and text features decreases the error of dimensional emotion score prediction by about 5% from the acoustic system and 1% from the text system. This smallest error is achieved by combining the text system with Long Short-Term Memory (LSTM) networks and acoustic systems with bidirectional LSTM networks and concatenated both systems with dense networks. Article History: Received 17 Nov 2020 Revised 20 Nov 2020 Accepted 25 Nov 2020 Available online 26 Dec 2020 lable online 09 Sep 2018 ___________________ Keywords: Speech Emotion, Neural Network, LSTM. International Journal of Informatics, Information System and Computer Engineering Journal homepage: http://ejournal.upi.edu/index.php/ijost/ International Journal of Informatics Information System and Computer Engineering 1 (2020) 91 - 102 B T Atmaja, et al. Dimensional Speech Emotion Recognition...| 92 1. INTRODUCTION The demand for recognizing emotion in speech has grown increasingly as a human emotion can be expressed via speech, and many applications, such as call center, telephone communication, and voice messages, can benefit from this speech emotion recognition. The study of speech emotion recognition was established some decades ago using unsupervised learning and a small amount of data. Advancements in computation hardware and in the development of larger speech corpus have enabled us to analyze emotion in a speech Detecting emotion is useful to investigate whether a student is confused, engaged, or certain when interacting with a tutorial system or whether a caller to help a line is frustrating or not (Jurafsky et al., 2014). By gaining knowledge of emotion from student and caller in both cases, proper action can be taken to avoid the worse condition. The degree of emotion (in the numeric score) in both cases is more relevant than the category of emotion (joy or sad, for example). These are two examples where dimensional emotion is more informative than categorical emotion. Although research on emotion recognition has been conducted progressively, most re- search are focused on recognition of categorical emotion such as in (Griol et al., 2019; Chen et al., 2018; Atmaja et al., 2019). As shown by the previous two examples, a dimensional approach of emotion recognition is more informative in such cases. Recognizing the degree of emotion is a more challenging task as it tries to predict the numerical score rather than a category. This type of task is a class of logistic regression. Research investigating dimensional emotion recognition in a text is reported by Calvo et al., 2020. They found by using the same classifier, i.e., non-negative matrix factorization (NMF), both categorical and dimensional emotion recognition obtain a similar result. They used emotional terms from an affective dictionary as text features for the dimensional task. In speech emotion recognition, the study of dimensional emotion recognition is reported by (Giannakopoulos et al., 2009) using a small dataset from videos, ten dimensions of acoustic features, and k-Nearest Neighbor (kNN) to estimate emotion degree. The results indicate that the resulting architecture can estimate emotion states of speech from movies with sufficient accuracy (Valence: 24%, Arousal: 35%, in terms of R2 statistics). Both dimensional text and speech emotion recognition above used a non-deep neural network (DNN) method due to the time and size of data. Another challenge in speech emotion recognition, besides a dimensional approach, is the strategy for extracting features. The features are the input of an emotion recognition system, and the performance of the system depends on those features. An issue to be considered when extracting features for speech emotion recognition is the necessity of combining speech (acoustic feature) with other types of features (El Ayadi et al., 2011). We choose text features as it can be extracted from speech via automatic speech recognition (ASR). The combination of these acoustic and text features is expected to improve the performance of the emotion recognition rate compared to the use of single modality i.e., acoustic feature or text feature only. This paper presents a dimensional speech emotion recognition from a multimodal dataset. The purposes of this work are (1) to examine whether the fusion 93 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 91 - 102 of two related features can decrease the error of dimensional emotion recognition and (2) to find the best DNN architecture for a list of DNN layer combination. A deep learning-based classifier from the category of recurrent neural network has been built for this purpose. Two types of features are used: acoustic and text features. For each feature, a set of networks is stacked. The two networks from acoustic and text features are then concatenated using late fusion architecture. The result shows that the proposed method can improve the performance compared to the method that used acoustic or text features only. The evaluation is presented in terms of mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). To extend this work, a discussion to evaluate the metric used in this research is summarized at the end of the paper. 2. DATASET The IEMOCAP (interactive emotional dyadic motion capture) database developed by the University of Southern California was used in this research (Busso et al., 2008). A number of 10039 turns (utterances) are recorded and measured, including included speech, visual, text, and motion capture (face, head, and hand movement). From those modalities, speech signal and text transcription are used. The dimension labels are given for valence, arousal, and dominance (VAD) in a range of 1 to 5 via self-assessment manikins (SAMs). All utterances on this dataset are used in this research. From these data, 80% is used for training, and 20% is used for the test. Twenty percent of the training data is used for validation. Fig. 1. Proposed dimensional speech emotion recognition from acoustic and text features. The dash line between label and dataset means that label is obtained from dataset directly. 3. PROPOSED METHOD A proposed method of this research paper can be split into two parts: feature extraction and dimensional emotion classifier. A block diagram of the proposed system is shown in Fig. 1. From the dataset, two features are extracted: acoustic and text features. The extracted feature then is fed into a classifier where the regression process is performed by combining those two features using the late fusion method. Finally, the classifier produces the predicted emotion dimension, which will be compared to the true value label. The difference between true value label and B T Atmaja, et al. Dimensional Speech Emotion Recognition...| 94 predicted emotion dimension is the error, which is measured in three different ways. 3.1 Feature Extraction Two sets of features from acoustic and text are used to extract emotion from speech. The following is the description of those two sets of features. 3.1.1 Acoustic Feature Extraction A number of 31 acoustic features are used in this research. These features are, • three time-domain features: zerocrossing rate (ZCR), energy, the entropy of energy. • five spectral-domain features: spectral centroid, spectral spread, spectral entropy, spectral flux, spectral roll-off. • 13 MFCC coefficients. • five fundamental frequencies (for each window). • five formants (for each window). We limit the number of windows for each utterance to 100 with 20 ms window length and 10 ms overlap. The resulting size of the acoustic feature then is (100, 31) for a single utterance. The total size of acoustic features for all utterances within the dataset is (10039, 100, 31). 3.1.2 Text Feature Extraction Text features can be obtained in many ways. One of the simple yet powerful methods is by word embedding (Penningtonet al., 2014). A word embedding is a vector representation of a word. A numerical value in the form of a vector is used to make the computer to be able to process text data as it only processes numerical value. This value is the points (numeric data) in the space of dimension, in which the size of the dimension is equal to the vocabulary size. The word representations embed those points in a feature space of lower dimension (Goodfellow et al., 2016). In the original space, every word is represented by a one- hot vector, a value of 1 for the corresponding word, and 0 for others. This element, with a value of 1, will be converted into a point in a range of vocabulary size. To obtain a vector of each word in an utterance, that utterance in the dataset must be tokenized. Tokenization is a process to divide an utterance to the number of constituent words. The following is the example of a single utterance from IEMOCAP dataset with its tokenization and a resulted text vector for each word. text = "Excuse me." tokenized_text = ["Excuse", "me"] text_vector = [832, 18] To obtain the fixed length of a vector for each utterance, a set of zeros can be padded before or after the obtained vector. The size of this zeros sequence can be obtained from the longest sequence, i.e., an utterance within the dataset, which has the longest words, subtracted by the length of a vector in the current utterance. We set the longest sequence for the IEMOCAP dataset for 554 sequences. A study to vectorize certain words has been performed by several researchers (Mikolov et al., 2013; Penningtonet al., 2014; Mikolov et al., 2017). The vector of those words can be used to weight the word vector obtained previously. The size of the dimension of each word for pre- trained word vectors is 300 (in the example above is one), shaping the size of (554, 300) text feature for each utterance, or (10039, 554, 300) for all utterances in the IEMOCAP dataset. 95 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 91 - 102 4. DIMENSIONAL EMOTION CLASSIFIER Recurrent neural network (RNN) is one of the variants of the neural network that are designed to handle sequential information. These networks introduce state variables to store past information and determine the current output based on the current input. Let H is the output of hidden layer, X is the input and W is the weight of layer with bias b. Then, H = φ(XWxh + bh). (1) is the output of hidden layer H with φ is non- linear function (activation). Having a recurrent hidden state (Ht) whose activation at each time is dependent on that of the previous time (Ht−1), the output of current hidden layer now is defined as, Ht = φ(XtWxh + Ht−1Whh + bh). (2) The problem with that RNN is it always takes past time into consideration. A situation may be encountered when the early observa- tion is more/less significant to predict the fu- ture. To tackle this situation and adding some enhancements, several methods have been pro- posed by some researchers (Cho et al., 2014; Hochreiter et al., 1997). In this paper, those two RNN methods are implemented as dimensional emotion classifier. 4.1 Gated Recurrent Unit Gated recurrent unit (GRU) enables the gating of the hidden state in RNN. This is a mechanism that is enabled for when the hidden state should be updated and when it should be reset. It is learned and addressed some limitations of RNN e.g., whether an early observation is highly significant for predicting all future observations. If the first observation is likely of great importance, it will learn not to update the hidden state after the first observation. Like- wise, it will learn to skip irrelevant temporary observations. Finally, it will learn to reset the latent state whenever needed (A. Zhang et al., 2019). Reset unit, Rt, and update unit, Zt, are the new additional units in GRU. Together with candidate unit, Ĥt, it updates the GRU in the following order. Rt = σ(XtWxr + Ht−1Whr + br) (3) Zt = σ(XtWxz + Ht−1Whz + bz) (4) H ̃ t = tanh(XtWxh +(Rt Ht−1) Whh + bh) (5) Ht = Zt Ht−1 + (1 − Zt) H̃ t. (6) where Ht now is the final update of GRU rather than reset gate of the output of RNN hidden layer unit. 4.2 Long Short-Term Memory While GRU using two additional units, reset and update, long short-term memory (LSTM) network uses three different units to control data from current time (Ht) and past time (Ht−1): input, forget, and output gates. These three gates are defined as follows, It = σ(XtWxi + Ht−1Whi + bi), (7) Ft = σ(XtWxf + Ht−1Whf + bf ),(8) Ot = σ(XtWxo + Ht−1Who + bo),(9) Wx ∈ d×h and Wh ∈ h×h is the weight parameters with bias b ∈ 1×h The complete sequence to update hidden state is defined as follow, C̃ t = tanh(XtWxc + Ht−1Whc + bc) (10) Ct = Ft Ct−1 + It C ̃ t. (11) Ht = Ot tanh(Ct). (12) C̃ t and Ct are candidate memory cell and memory cell, respectively. GRU and LSTM are very similar both in implementation and its result. GRU is faster due to less gates and LSTM, in many B T Atmaja, et al. Dimensional Speech Emotion Recognition...| 96 cases, slightly better than GRU due to its complexity to handle flow of data. 4.3 Model Architecture The implementation of the regression classifier for dimensional emotion recognition can be built by stacking some RNN layers from acoustic and text input and merge both to obtain the final dimensional emotion prediction. For each modality, acoustic, and text, we varied two dense, two GRU, and two LSTM layers. For RNNs (GRU and LSTM) we also implement a bidirectional version of those networks to allow distribution of information from the past and future time (GRU and LSTM only roll information from the current and past time, see Eq. 3–6 and Eq. 10–12). Those dense, bidirectional GRU (BGRU), and bidirectional LSTM (BLSTM) layers from each modality is stacked together using two dense layers. Fig. 2 shows one of the architectures for combining acoustic and text features to obtain three emotional dimensions. To minimize the risk of overfitting, a number of dropouts are used with value 0.4 for each acoustic and text network and 0.3 for the last dense network. Rectified linear unit (ReLU) activation is used for both dense layers in the combined network. The final dense layer with three nodes used linear activation function to obtain the score of valence, arousal, and dominance. The whole network is trained with RMSProp (Dauphin et al., 2015). optimizer with mean squared error (MSE) as a loss function. Beside MSE, we use mean absolute error (MAE) and mean absolute percentage error (MAPE) as evaluation metrics. The implementation of this deep learning architecture is available in public repository, https://github.com/bagustris/dimension al_ser_rnn,for research reproducibility Fig. 2. Architecture of deep learning system combining acoustic and text features. The number in bracket shows the size of units/nodes on the layer. 5. RESULTS 5.1 Comparison of Acoustic, Text and Combined System To begin our discussion, we presented the result for each different modality. Table 1 shows the performance of dimensional speech emotion recognition from the acoustic feature. Two layers of the same models are stacked and added with final a dense layer. For each model, the value of each metric is an average of five experiments (to minimize the effect of uncertainty computation due to randomness). The dense network is chosen as the baseline model. For this speech emotion recognition, the LSTM model shows modest improvement from the dense baseline layer in terms of MSE and MAE. However, other metrics i.e., MAPE, shows the different result which leads GRU/BGRU to obtain a better result. As MSE is used as a loss function, the result in MSE is relevant in this context. MAE metrics also show consistency with MSE. The MAPE metric can be used for https://github.com/bagustris/ https://github.com/bagustris/dimensional_ser_rnn https://github.com/bagustris/dimensional_ser_rnn 97 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 91 - 102 comparison to other datasets as it has the same scale from 0 to 100. Table 1. Performance comparison of dimensional speech emotion recognition from acoustic feature (in term of MSE, MAE, and MAPE) among different models. Modela MSE MAE MAPE (%) Dense 0.652 0.66 24.188 GRU 0.648 0.655 23.69 LSTM 0.636 0.651 24.014 BGRU 0.647 0.653 23.675 BLSTM 0.656 0.66 24.109 aEach model is a stack of two same models. For text-based emotion recognition, the result is shown in Table 2. As reported by other researchers (Poria et al., 2017; Tripathi et al., 2018). we obtained better performance on emotion recognition for IEMOCAP dataset by utilizing text features. In this text emotion recognition, the LSTM model shows modest improvement from the baseline and other models. This result from some experiments can be considered when combining acoustic and text features for the fusion of two networks. For this text emotion recognition, all metrics show almost consistent with each other (except GRU and BGRU, which is 0.02 different). MAE and MAPE show consistency in the order of the score among models. Table 2. Performance comparison of dimensional speech emotion recognition from text feature (in term of MSE, MAE, and MAPE) among different models Modela MSE MAE MAPE (%) Dense 0.493 0.559 20.436 GRU 0.48 0.549 19.888 LSTM 0.465 0.538 19.554 BGRU 0.482 0.548 19.881 BLSTM 0.487 0.55 19.588 aEach model is a stack of two same models. Finally, we presented the result of the fusion of acoustic and text features in Table 3. Clearly, a decrement of error is shown for all MSE, MAE, and MAPE metrics. For example, using the same dense layers, the error (MAPE) decreases from acoustic (24.188%) and text (20.436%) to combined acoustic and text system (19.97%). To obtain the more decrement of error, not only the architecture of each network modalities is important but also the strategy for combining the modalities is also important (Poria et al., 2017). Although we tried several different layers after concatenation of two networks (acoustic and text), we focus on selecting the combination for each modality while keeping the use of dense layer after a combination of two networks. This focus is based on some experimentation we obtained; the simple dense layers after concatenation perform better than the more sophisticated layers (GRU, LSTM, and attention models). Table 3. Performance comparison of dimensional speech emotion recognition from combination of acoustic and features (in term of MSE, MAE, and MAPE) among different models Model MSE MAE MAPE (%) Dense + Dense 0.45 7 0.546 19.97 Dense + BGRU 0.44 0.533 19.585 BGRU + Dense 0.45 4 0.543 19.81 LSTM + LSTM 0.42 8 0.525 18.929 BLSTM + BLSTM 0.43 8 0.531 19.423 BLSTM + LSTM 0.41 9 0.517 18.713 BGRU + GRU 0.42 9 0.527 19.139 B T Atmaja, et al. Dimensional Speech Emotion Recognition...| 98 5.2 Design of System Architecture and Its Result On designing the system architecture for dimensional speech emotion recognition, we rely on initial experiments using the unimodal feature, and other researcher results (Atmaja et al., 2019; Tripathi et al., 2018). From Tables 1 and 2, it is shown that the text feature gives better result on dimensional emotion recognition than acoustic feature. For both features, LSTM performs better than any other model. Using this result, we build LSTM- based networks for those two modalities and combine them with dense layers. On choosing hyperparameters, we manually add more units to text networks as it gives a better result. The choice of using 50 units and 40 units of nodes on each LSTM layer on each modality is also obtained from experimentation; we use larger units first and decrease this number until the smaller one without decreasing the performance (error metrics). For the dense layers, the number of 30 units for each layer is also based on the experiment. The ReLU and tanh activation function in those layers perform a similar result. To avoid overfitting, we use the callback strategy besides putting the dropout layer on each network branches (acoustic, text, and combination layers). Two methods are used for callback (to stop the iteration of training): early stopping and model checkpointing. For early stopping, we use a number of 10 patiences to monitor validation loss. This means, if no decrement of validation loss (MSE) after ten epochs, the training process will stop and uses the best weight for the evaluation/prediction. The model check- pointing is a similar method to save the model (which can also be ignored if we do not want to save the model). Finally, although we obtain the best prediction of emotion dimension with BLTSM and LSTM networks, there is a room for improvement for experimenting and designing a better model architecture. In some runs, the combination of GRU performs better; however, the average result shows that a combination of LSTM is the best one. The hyperparameters optimization on future research will be done on training and development set instead of manually hand- crafted. For the obtained improvement, a decrement of MAPE from acoustic feature- based emotion recognition is achieved up to 5.5 % when using a combined feature. For MSE and MAE, the decrement is in a range of 0.14-0.17 and 0.09- 0.11, respectively. From the text feature, the decrement of error is in range of 0.08-0.046, 0- 0.02, and 0-0.84% for MSE, MAE, and MAPE, respectively. The excerpt of the result of VAD score from the model obtained using BLSTM, and LSTM networks are presented in Table 4. Table 4. Sample of true and predicted VAD score from model using BLSTM and LSTM Networks Utterances VAD True Predicted Oh, totally. Yeah. [4, 3, 2.5] [3.21, 2.67, 2.60] The craziest thing just happened to me. [4, 3, 2.5] [3.35, 3.02, 2.97] This girl; she just offered me fifty thousand dollars to marry her. [3.5, 3.5, 3] [3.32, 3.3, 3.34] 99 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 91 - 102 5.3 Evaluation of Loss Function and Metrics One of the challenging problems in dimensional emotion recognition is to choose the proper metrics for evaluation. In this paper, we used standard regression metrics i.e., MSE, MAE, and MAPE. However, when running some experiments on the same condition (system architecture), when a metric decrease, an- other increase the score. For example, in the second experiment, after the first, MSE gets lower, but MAPE gets higher, and so on. Table 5 shows the raw result obtained in Table 1 for dense acoustic network. As shown in that table, the consistency of each metric is changing when re-running the experiment. In the second experiment, when MSE score decreases, MAPE score increases. In the last experiment, the MSE score increases, while MAE and MAPE decrease. To evaluate metrics, we perform a simple analysis by changing the lost function from MSE (default) to MAE and MAPE. Table 6 shows that by changing the loss function from MSE to MAE, the error result decreases slightly. Table 5. Results of five experiments on the same models (dense layers) from speech features Experiment# MSE MAE MAPE 1 0.659 0.663 24.12 2 0.644 0.660 24.41 3 0.645 0.651 23.44 4 0.652 0.663 24.94 5 0.656 0.662 24.01 If we compare this result (our best MAE) with other research which also used acoustic and linguistic information, but with different approach (Karadoğan et al., 2012). our MAE is better than them (their best MAE is 1.28 for arousal). However, comparing the same metric across the dataset is not sufficiently comparable as the upper level bound of MAE is different for each dataset. In this case, MAPE might be more useful than MSE and MAE. Moreover, using another metric such as concordance coefficients correlation (ρc) as used in (Tzirakis et al., 2017) is more relevant as it has the same scale 0-1 for any datasets to measure the agreement. Table 6. Results of BLSTM and LSTM networks from acoustic and text features with different loss function Loss Function MSE MAE MAPE MSE 0.43 0.523 18.87 MAE 0.42 0.519 18.631 MAPE 0.469 0.543 18.835 B T Atmaja, et al. Dimensional Speech Emotion Recognition...| 100 6. CONCLUSION We presented our work on dimensional speech emotion recognition by combining acoustic and text features using recurrent neural networks. Thirty- one acoustic features are used as input to acoustic networks, and 554- word vectors are fed to text networks. The result from unimodal shows that text-based emotion recognition performs better on IEMOCAP dataset compared to acoustic emotion recognition. The combination of acoustic and text features decreases the error of MAPE up to 5% from acoustic features only and near 1% from text feature only. For the combination among DNN layers, the use of BLSTM for acoustic network and LSTM for text network with con- catenated dense layers to combine those two features performs better compared to a list of given DNN layer combination. The choice of more advanced metric for loss function and evaluation in dimensional emotion recognition should be considered on the future research for consistency and benchmarking with other dimensional emotion recognition studies. REFERENCES Atmaja, B. T., Shirai, K., & Akagi, M. (2019, November). Speech emotion recognition using speech feature and word embedding. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 519-523. A. Zhang, Z. C. Lipton, M. Li, and A. J. Smol. (2019). Dive into Deep Learning, http://www.d2l.ai. Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., ... & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4), 335. Calvo, R. A., & Mac Kim, S. (2013). Emotions in text: dimensional and categorical models. Computational Intelligence, 29(3), 527-543. Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440-1444. Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259. Dauphin, Y., De Vries, H., & Bengio, Y. (2015). Equilibrated adaptive learning rates for non-convex optimization. In Advances in neural information processing systems 1504-1512. http://www.d2l.ai/ 101 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 91 - 102 El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572- 587. Giannakopoulos, T., Pikrakis, A., & Theodoridis, S. (2009, April). A dimensional approach to emotion recognition of speech from movies. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 65-68. Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning, 1. Cambridge: MIT press. Griol, D., Molina, J. M., & Callejas, Z. (2019). Combining speech-based and linguistic classifiers to recognize emotion in user spoken utterances. Neurocomputing, 326, 132-140. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. Jurafsky, D., & Martin, J. H. (2014). Speech and language processing, 3. Karadoğan, S. G., & Larsen, J. (2012, May). Combining semantic and acoustic features for valence and arousal recognition in speech. In 2012 3rd International Workshop on Cognitive Information Processing (CIP), 1-6. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111-3119. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre- training distributed word representations. arXiv preprint arXiv:1712.09405. Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (1532-1543). Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., & Morency, L. P. (2017, July). Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers) (873-883). Poria, S., Cambria, E., Hazarika, D., Mazumder, N., Zadeh, A., & Morency, L. P. (2017, November). Multi-level multiple attentions for contextual multimodal sentiment analysis. In 2017 IEEE International Conference on Data Mining (ICDM), 1033- 1038. B T Atmaja, et al. Dimensional Speech Emotion Recognition...| 102 Tripathi, S., Tripathi, S., & Beigi, H. (2018). Multi-modal emotion recognition on iemocap dataset using deep learning. arXiv preprint arXiv:1804.05788. Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2017). End- to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1301-1309.