Mathematical Problems of Computer Science 57, 7–17, 2022. doi:10.51408/1963-0082 UDC 004.934 Emotion Classification of Voice Recordings Using Deep Learning Narek T. Tumanyan Weizmann Institute of Science, Rehovot, Israel e-mail: narek.tumanyan@weizmann.ac.il Abstract In this work, we present methods for voice emotion classification using deep learning techniques. To processing audio signals, our method leverages spectral features of voice recordings, which are known to serve as powerful representations of temporal signals. To tackling the classification task, we consider two approaches to processing spectral features: as temporal signals and as spatial/2D signals. For each processing method, we use different neural network architectures that fit the approach. Classification results are analyzed and insights are presented. Keywords: Voice sentiment detection, Mood recognition, Speech emotion recog- nition, Cepstral features. 1. Introduction The problem that is addressed in this work is the emotion classification from voice record- ing. Formally, given some representation X of voice recording data and a set of n emotion labels/classes {y1, y2, ..., yn}, the aim is to come up with a classifier F(X) = yi that maps X to a label yi ∈ {y1, ..., yn}. Practically, having such a classifier F can have a wide range of applications, such as recommendation systems of movies or music driven by users’ mood, systems for tracking the emotional state and satisfaction of clients through time, security systems for preventing harmful actions based on emotion, and so on. Previous attempts to tackle the voice emotion classification problem include SVM-based algorithms of classifying voice into 5 categories - angry, happy, neutral, sad, or excited [1], which also considers the facial expression of the speaker during speech as an additional signal. Glüge et al. [2] propose a Deep Neural Network Extreme Learning method with efficient performance on small datasets. Eskimez et al. [3] tackle the speech emotion recognition problem through an unsupervised approach, by which they come up with meaningful speech representations by learning the underlying structure of the data, which aids in solving the main task. Bertero et al. [4] introduce a Convolutional Neural Network (CNN)-based ap- proach of 3-label (“angry”, “happy”, “sad”) emotion recognition of speech, where they use the standard pulse-code modulation (PCM) temporal representation of the audio signal as 7 8 Emotion Classification of Voice Recordings Using Deep Learning input. Mirsamadi et al. [5] propose a 4-label (“angry”, “happy”, “sad”, “neutral”) speech emotion recognition model based on Long Short Term Memory Network (LSTM) architecture and local attention, and base their model on Mel-Frequency Cepstral Coefficients (MFCC), Fast Fourier Transform (FFT), fundamental frequency and zero-crossing rate features of the audio. In our setups, we experiment with both CNN-based and LSTM-based architectures and consider 8 emotional labels for classification, which are described in Section 2. In this paper, we use cepstral features as representations of voice data, particularly, we utilize Mel-Frequency Cepstral Coefficients (MFCC) for representing the audio signal. We experiment with two views for processing MFCCs: processing them as sequential data in the time domain, and processing them as spatial data. For each of the approaches, we use the appropriate neural network architecture. Specifically, for processing MFCCs as temporal data, we utilize Long Short Term Memory Networks (LSTM), and for processing MFCC as spatial/2D data, we make use of Convolutional Neural Networks (CNN). 2. Datasets In our setup, we consider 8 emotion labels for classification. The databases used in the paper are as follows: Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [6], Surrey Audio-Visual Expressed Emotion (SAVEE) [7] and Toronto Emotional Speech Set (TESS) [8]. Each item in the datasets is a recording of an actor that pronounces some statement with a certain expressed emotion. Voice recordings in the databases come in a .wav format, which describes the amplitude of air pressure oscillations in the temporal domain. Each voice recording has an emotion label attached to it. The RAVDESS database has 24 actors that pronounce 2 phrases: “Kids are talking by the door“ and “Dogs are sitting by the door” with 2 intensities: Normal and High each repeated twice. Neutral emotion has no high intensity so it is only repeated twice. The emotion labels are: “neutral”, “calm”, “happy”, “sad”, “angry”, “fearful”, “disgust”, “surprised”. TESS dataset has 2 actors, young and old, and both of them are female. There are 2800 voices in total with each phrase being of the form “Say the word x, where x stands for some word. Recordings in the TESS dataset have the same labeled emotions as in RAVDESS, except for the calm label, which is absent in this dataset. SAVEE dataset has 4 English male actors with 480 voice recordings. 7 emotions are present, with the calm emotion missing. In total, there are 4720 samples. The distribution of samples and classes is summarised in Table 1 and in Table 2. Table 1: Summary of datasets used. Database Num of Recordings Num of Actors Emotion Labels RAVDESS 1440 24 8 SAVEE 480 4 7 TESS 2880 2 7 N. Tumanyan 9 Table 2: Number of voice recordings per emotion label across all databases. Neutral Calm Sad Fear Anger Surprises Happiness Disgust 616 192 652 652 652 652 652 652 3. Method 3.1 Feature Extraction To extract audio features from voice recordings, we use librosa library for python [9]. It handles most of the transformations done to voice recordings to get final features used for classification. The first step before extracting features is to resample voice recording files to obtain their time domain and amplitude representation. Voice recordings from our databases have different original sampling rates, which range from 22Khz to 48Khz. However, the content that we are trying to analyze from those recordings are the human voices themselves. Normally, the human voice ranges from low range frequencies 300Hz to higher ranges of 4 - 10Khz. This means that we can use lower sampling rates to resample our voice recording. We chose 22.05Khz sampling rate, which preserves all human voices in original audio recordings and also preserves some possible frequency deviations from the normal range, which can be caused by pronouncing high-frequency tones, e.g. fricatives. The result is a floating-point time series describing the amplitude of air pressure oscillations from a mean frequency of 0 at each time point. Thus, we obtain a time-domain representation of the signal. An example is illustrated below in Fig. 1. Fig. 1. Sample waveform representation of a voice recording signal. Having the temporal signal representation of the voice signal, we then process it to obtain its spectral features, which serves as the main data representation for our models. 10 Emotion Classification of Voice Recordings Using Deep Learning 3.2 Spectral Features Extraction Conceptually, given a temporal signal x(t), we can represent it as a combination of periodic functions of varying frequencies [10]: x(t) = ∞∫ −∞ X(w)ejwtdw, where w is the frequency of the corresponding periodic function. Thus, having the coeffi- cients X(w) is equivalent to having the original signal x(t), and we can use these coefficients as a representation of the temporal signal in the frequency space. To achieving such a rep- resentation, the Fourier Transform operation is used [10]. Since we are dealing with discrete data, the equivalent operation used is Discrete Fourier Transform (DFT), which converts discrete temporal signal x[n] of length K to a representation of this signal in frequency space by obtaining the coefficients / intensities X[k] for each frequency k [11]: X[k] = K∑ n=1 x[n]e−i2πkn/N; 1 ≤ k ≤ K. In signal processing, frequency decomposition is often performed by dividing the signals into time intervals of specified window size and performing DFT on each windowed signal, thus coming up with frequency components in multiple time intervals. Such representation of a signal is called the Short-Time Fourier Transform (STFT) of a signal [10]. For audio signals, in some cases, more sophisticated representations of the signal based on STFT are necessary for higher efficiency. Mel-frequency cepstral coefficients, a.k.a. MFCCs, are features, which represent a given signal by cepstral energy coefficients at specific short intervals of time. The advantage of MFCC features is that they represent the signal in a way that is close to the signal perception by the human ear, which, is intuitively achieved by applying smaller window-sized cepstral filters on low frequencies on a signal and increasing the window size of the filters as the considered frequency increases. The reason behind such intuition is that the human ear perceives frequencies in lower ranges much better than in higher ones. Hence, higher resolution at lower ranged frequencies is used while computing MFCCs [12]. In its final form, the MFCC of a signal can be represented simply as a function Pi(k), where the outputted value is the intensity of k-th cepstral coefficient in i-th temporal frame index. An example of an extracted MFCC feature is demonstrated in Fig. 2 Fig. 2. Sample MFCC representation of a voice recording signal. N. Tumanyan 11 3.3 Architectures and Results 3.3.1 Long Short Term Memory Networks (LSTMs) Considering the temporal nature of the data in hand, i.e., the voice recordings that are rep- resented as magnitudes of air pressure (amplitude) across time, and the computed MFCC’s that are a time series of energy coefficient values, it is sensible to use architectures that are by design intended for processing sequential data and have the appropriate inductive bias. One example of such architectures are Long Short Term Memory Networks (LSTM) [13], which are a variant of Recursive Neural Networks (RNN). The main idea behind LSTM is the usage of feedback connections for preventing the vanishing gradient problem. The architecture of LSTM used is summarized in Fig. 3. Fig. 3. The architecture of the trained LSTM model. Fig. 4. ROC curves of the trained LSTM model. Each curve corresponds to an emotion label. 12 Emotion Classification of Voice Recordings Using Deep Learning MFCC sequences are fed into an LSTM recurrent layer with a hidden dimension of size 1024. There are 2 LSTM layers stacked on top of each other, meaning that the outputs of the first layer are processed by the second one. This increases the perceptiveness of the network towards the features present in the sequence. Due to the dataset being small, we used dropout with high probability (p = 0.5) on the outputs of the first LSTM unit to prevent overfitting. The output of the last LSTM layer is then passed to a Multilayer perceptron (MLP), which outputs an 8-dimensional vector representing the logits of each emotion label. The network was trained using only the RAVDESS dataset. The recordings of the 1st and 2nd actors (one male and one female) were used as a testing set, the rest of the recordings were used for training the network. Adam optimizer with learning rate of 0.0005 was used and the loss function to minimize was cross entropy loss given by: l (ŷi) = log ( exp (ŷi)∑ j exp (ŷj) ) , L (ŷi) = − ∑ i yil (ŷi) , where {ŷi} are the estimated class labels, and {yi} are the ground-truth labels. The classification results and comparison to the existing relevant method are demon- strated in Table 3. The Receiver Operating Characteristic curves (ROC curves) of the results are shown in Fig. 4. 3.3.2 Convolutional Neural Networks (CNNs) As stated in subsection 3.2, the MFCC of a recording can be observed as a 2D feature map of a signal, with one dimension being the temporal dimension and the other being the cepstral coefficient dimension. Thus, a possible approach to working with MFCC’s is processing them as spatial signals. Convolutional Neural Networks (CNN) are one of the most prominent architectures used for processing spatial data due to their shift equivariance, their inductive bias in searching for local patterns, and many other inherent benefits. Thus, we consider solving the voice emotion classification task by training a CNN on extracted MFCC data. For MFCC calculation, the window size of 4096 and the overlap of between subsequent windows were chosen. Decreasing the window size by half degrades the performance of the network. On average, these settings produced better results. 4096 for a window size is good because it allows computing the FFT of length 4096 on that window to capture frequency spectrum of up to 4Khz. This means that the majority of human speech in those recordings is captured in each window. After calculating MFCCs for every recording and padding sequences with less length than the longest sequence, we obtain input matrices to our network of size (40 x 160) where at each sequence point we have 40 MFCCs. The architecture of the CNN used is depicted in Fig. 5, and the method is summarized as follows: There are 3 convolutional layers in the network followed by average pooling layers of size (2x2). The last layer is a fully connected layer that maps output of convolutional layers to an 8 length vector. Log softmax activation is applied to use cross entropy loss. Each layer has 32 kernels of parameters. The first layer has kernels of size (10x3), and it is deliberately chosen to be narrow and heighty to capture features from change of MFCCs through the sequence. Between layers, leaky rectified linear unit (ReLU) activation function given as h(x)=max(x, 0)+0.01*min(0, x)is used both to enable fast training and to prevent neurons N. Tumanyan 13 from dying. Leaky ReLU adds a small slope to non activated neurons thus preventing them from becoming 0 and not contributing to backpropagation in later epochs [14]. Since our dataset is very small, we used dropout with high probability (p = 0.5) as well as L2 regularization to prevent overfitting, which penalizes the sum of squares of the weights of the model. Fig. 5. The architecture of the trained CNN model. All recordings of the 1st and 2nd actors, one male and one female from the RAVDESS database were used for testing, which the neural network was not trained on. All remaining recordings were used for training. We used Adam optimizer with a learning rate of 0.00005 and L2 regularization with decay of 10−4. The final loss function becomes: l(ŷi) = log ( exp (ŷi)∑ j exp (ŷj) ) , L (ŷi) = − ∑ i yil (ŷi) + λ ∑ w∈W w2, where W is the set of all trainable weights of the CNN. The classification results and comparison to the existing related method are summarized in Table 3. Average ROC Area Under Curve (AUC) for all classes was 0.927. ROC curves for all classes are demonstrated in Fig. 6. Table 3: Classification results. Architecture Train Test Mirsamadi et al. [5] Bertero et al. [4] Accuracy Accuracy Test Accuracy Test Accuracy LSTM 93.58% 65% 63.5% - CNN 96% 67.5% - 66.1% As it can be observed, the network captures some emotions more easily than others. For instance, Neutral, Calm, Angry and Surprise were captured better than the rest. ROC-AUC metric also suggests that the model learned meaningful representations for the task. 14 Emotion Classification of Voice Recordings Using Deep Learning Fig. 6. ROC curves of the trained CNN model. Each curve corresponds to an emotion label. Overall, the results show that the models managed to learn meaningful representations from the training procedure. In Table 3, we compare our results to the LSTM-based method of Mirsamadi et al. [5], which was trained and tested on the IEMOCAP benchmark [16] with a 4-label (“angry”, “happy”, “sad”, “neutral”) classification setting, as well as to the CNN-based method of Bertero et al. [4], which was trained and tested on the TED-LIUM benchmark [15] with a 3-label (“angry”, “happy”, “sad”) classification setting. As it can be observed, our method gains superior results on our 8-label classification setting. In contrast to the 2 methods, we leverage only the MFCC representation of the signal, which highlights the efficiency of the MFCC representation and its usage with deep learning methods for the task. 4. Discussion and Conclusion This paper proposes deep learning approaches for the voice emotion classification problem. Particularly, CNN and LSTM architectures were trained on MFCC features of voice record- ings, depending on processing MFCCs either as a spatial signal or as a sequential signal. The results indicate that the networks have learned meaningful representations from the training data. A possible future direction for improving the classification performance of the pro- N. Tumanyan 15 posed models could be adding augmentations to the audio data. The recent advancements in using transformers [17] for multi-modal representation learning [18] and the expressiveness of the resulting feature space can also be a promising direction for solving the speech emotion recognition task. References [1] E. Mower, M. J. Mataric and S. Narayanan,“A framework for automatic human emo- tion classification using emotion profiles”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 5, pp. 1057–1070, 2010. [2] S. Glüge, R. Böck and T. Ott, “Emotion recognition from speech using representation learning in extreme learning machines”, Proceedings of the 9th International Joint Conference on Computational Intelligence, Funchal, Portugal, pp. 179–185, 2017. [3] S.E. Eskimez, Z. Duan and W. Heinzelman, “Unsupervised learning approach to fea- ture analysis for automatic speech emotion recognition”, IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada., pp. 5099–5103, 2018. [4] D. Bertero and P. Fung, “A first look into a convolutional neural network for speech emotion detection”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA., pp. 5115–5119, 2017. [5] S.Mirsamadi, E. Barsoum and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA., pp. 2227– 2231, 2017. [6] S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expres- sions in North American English”, PLoS ONE , vol. 13, no. 5, 2018. [7] P. Jackson and S. Haq, “Surrey audio-visual expressed emotion (savee) database”, University of Surrey: Guildford, UK. 2014. [8] M. K. Pichora-Fuller and K. Dupuis, “Toronto emotional speech set (TESS)”, Scholars Portal Dataverse, 2020. [9] B. McFee, A. Metsai, M. McVicar, S. Balke, C. Thom, C. Raffel, F. Zalkow, A. Malek, Dana, K. Lee, O. Nieto, D. Ellis, J. Mason, E. Battenberg and S. Seyfarth, librosa/librosa: 0.9.0 (0.9.0). Zenodo, 2022, https://doi.org/10.5281/zenodo.5996429 [10] K. Gröchenig, Foundations of Time-Frequency Analysis, First Edition. Birkhuser, Boston, MA, 2001. [11] A. Kulkarni, M. F. Qureshi, and M. Jha, “Discrete fourier transform: approach to sig- nal processing”, International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, vol. 03, pp. 12341–12348, 2014. [12] M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition”, Speech Commu- nication, vol. 54, no. 4, pp. 543–565, 2012. 1 6 Emotion Classi¯cation of Voice Recordings Using Deep Learning [1 3 ] S . H o c h r e it e r a n d J. S c h m id h u b e r , \ L o n g s h o r t -t e r m m e m o r y" , Neural Computation, vo l. 9 , n o . 8 , p p . 1 7 3 5 { 1 7 8 0 , 1 9 9 7 . [1 4 ] B . X u , N . W a n g , T. Ch e n a n d M. L i, \ E m p ir ic a l e va lu a t io n o f r e c t ī e d a c t iva t io n s in c o n vo lu t io n a l n e t wo r k" , CoR R , vo l. a b s / 1 5 0 5 .0 0 8 5 3 , 2 0 1 5 . [1 5 ] A . R o u s s e a u a n d P . D e le g lis e , \ E n h a n c in g t h e TE D -L IU M c o r p u s wit h s e le c t e d d a t a fo r la n g u a g e m o d e lin g a n d m o r e TE D t a lks " , International Conference on L anguage R esources and E valuation, R e ykja vik, Ic e la n d , p p . 3 9 3 5 { 3 9 3 9 , 2 0 1 4 . [1 6 ] C. B u s s o , M. B u lu t , Ch i-Ch u n L e e , A . K a z e m z a d e h , E . Mo we r , S . K im , J. N . Ch a n g , S . L e e , a n d S . S . N a r a ya n a n , \ Ie m o c a p : In t e r a c t ive e m o t io n a l d ya d ic m o t io n c a p t u r e d a t a b a s e " , L anguage R esources and E valuation , vo l. 4 2 , n o . 4 , p p . 3 3 5 { 3 5 9 , 2 0 0 8 . [1 7 ] A . V a s wa n i, N . S h a z e e r , N . P a r m a r , J. U s z ko r e it , L . Jo n e s , A . N . Go m e z , L . K a is e r a n d I. P o lo s u kh in , \ A t t e n t io n is a ll yo u n e e d " , CoR R , vo l. a b s / 1 5 0 5 .0 0 8 5 3 , 2 0 1 7 . [1 8 ] H . A kb a r i, L . Y u a n , R . Qia n , W .H . Ch u a n g , S .-Fu Ch a n g , Y . Cu i a n d B . Go n g , \ V A TT: Tr a n s fo r m e r s fo r m u lt im o d a l s e lf-s u p e r vis e d le a r n in g fr o m r a w vid e o , a u d io a n d t e xt ," Advances in Neural Information P rocessing Systems, 2 0 2 1 . Submitted 18.02.2022, accepted 27.05.2022. ÊáñÁ áõëáõóÙ³Ý íñ³ ÑÇÙÝí³Í Ó³Ûݳ·ñáõÃÛáõÝÝ»ñÇ ¿ÙáódzݻñÇ ¹³ë³Ï³ñ·Ù³Ý Ù»Ãá¹Ý»ñ ܳñ»Ï î. ÂáõÙ³ÝÛ³Ý ì»ÛóÙ³ÝÇ ¶ÇïáõÃÛáõÝÝ»ñÇ Ð³Ù³Éë³ñ³Ý, è»Ëáíáï, Æëñ³Û»É e-mail: narek.tumanyan@weizmann.ac.il ²Ù÷á÷áõÙ ´³Ý³ÉÇ µ³é»ñ` Ó³ÛÝÇ ïñ³Ù³¹ñáõÃÛ³Ý ×³Ý³ãáõÙ, ËáëùÇ ¿ÙáódzÛÇ ¹³ë³Ï³ñ·áõÙ, Ñ³×³Ë³Ï³Ý Ñ³ïϳÝÇßÝ»ñ: îíÛ³É Ñá¹í³ÍáõÙ Ý»ñϳ۳óíáõÙ »Ý ËáñÁ áõëáõóÙ³Ý íñ³ ÑÇÙÝí³Í Ó³Ûݳ·ñáõ- ÃÛáõÝÝ»ñÇ ¹³ë³Ï³ñ·Ù³Ý Ù»Ãá¹Ý»ñ: ²áõ¹Çá ³½¹³Ýß³ÝÝ»ñÁ Ù߳ϻÉáõ ѳٳñ û·- ï³·áñÍíáõÙ »Ý Ó³Ûݳ·ñáõÃÛáõÝÝ»ñÇ Ñ³×³Ë³Ï³Ý ïíÛ³ÉÝ»ñ, áñáÝù ѳÛïÝÇ »Ý ųٳݳϳÛÇÝ ³½¹³Ýß³ÝÝ»ñÇ ³ñ¹Ûáõݳí»ï Ý»ñϳ۳óٳٵ: ¸³ë³Ï³ñ·Ù³Ý ËݹÇñÁ ÉáõÍ»Éáõ ѳٳñ Ñá¹í³Íáõ٠ѳßíÇ »Ý ³éÝíáõÙ Ñ³×³Ë³Ï³Ý Ñ³ïϳÝÇßÝ»ñÇ Ùß³ÏÙ³Ý »ñÏáõ Ùáï»óáõÙ` áñå»ë ųٳݳϳÛÇÝ ³½¹³Ýß³ÝÝ»ñÇ Ùß³ÏÙ³Ý Ùáï»óáõÙ ¨ áñå»ë ï³ñ³Í³Ï³Ý ³½¹³Ýß³ÝÝ»ñÇ Ùß³ÏÙ³Ý Ùáï»óáõÙ: Úáõñ³ù³ÝãÛáõñ Ùáï»óÙ³Ý Ñ³Ù³ñ ÏÇñ³éíáõÙ »Ý ѳٳå³ï³ëË³Ý ³ñÑ»ëï³Ï³Ý ó³Ýó»ñÇ Ùá¹»ÉÝ»ñ: Ü»ñϳ۳óíáõÙ ¿ ¹³ë³Ï³ñ·Ù³Ý ³ñ¹ÛáõÝùÝ»ñÇ í»ñÉáõÍáõÃÛáõÝ, ϳï³ñíáõÙ »Ý »½ñ³Ï³óáõÃÛáõÝÝ»ñ: N. Tumanyan 1 7 Êëàññèôèêàöèÿ ýìîöèé â ãîëîñå ñ èñïîëüçîâàíèåì ãëóáîêîãî îáó÷åíèÿ Íàðåê Ò. Òóìàíÿí Èíñòèòóò Âåéöìàíà, Ðåõîâîò, Èçðàèëü e-mail: narek.tumanyan@weizmann.ac.il Àííîòàöèÿ  ýòîé ñòàòüå ìû ïðåäñòàâëÿåì ìåòîäû êëàññèôèêàöèè ýìîöèé â ãîëîñå ñ èñïîëüçîâàíèåì ìåòîäîâ ãëóáîêîãî îáó÷åíèÿ. Äëÿ îáðàáîòêè àóäèîñèãíàëîâ, äàííûé ìåòîä èñïîëüçóåò ÷àñòîòíûå ïðèçíàêè èçâëå÷åííûå èç ãîëîñîâûõ çàïèñåé, êîòîðûå, êàê èçâåñòíî, ñëóæàò ìîùíûì ïðåäñòàâëåíèåì âðåìåííûõ ñèãíàëîâ. Äëÿ ðåøåíèÿ çàäà÷è êëàññèôèêàöèè, â äàííîé ðàáîòå ðàññìàòðèâàþòñÿ äâà ïîäõîäà îáðàáîòêè ÷àñòîòíûõ ïðèçíàêîâ: êàê âðåìåííûå ñèãíàëû è êàê ïðîñòðàíñòâåííûå/2D-ñèãíàëû. Äëÿ êàæäîãî èç ïîäõîäîâ ìû èñïîëüçóåì ïîäõîäÿùèå àðõèòåêòóðû íåéðîííûõ ñåòåé. Áûëè ïðîàíàëèçèðîâàíû ðåçóëüòàòû êëàññèôèêàöèè è ïðåäñòàâëåíû âûâîäû. Êëþ÷åâûå ñëîâà: îïðåäåëåíèå íàñòðîåíèÿ ïî ãîëîñó, ðàñïîçíàâàíèå íàñòðîåíèÿ, êëàññèôèêàöèè ýìîöèé â ãîëîñå, ÷àñòîòíûå ïðèçíàêè. 01_Tumanyan_57_7_17 (1) 01a