Microsoft Word - 39-3193_s1_ETASR_V9_N6_pp5088-5092 Engineering, Technology & Applied Science Research Vol. 9, No. 6, 2019, 5088-5092 5088 www.etasr.com Samad et al.: Performance Evaluation of Learning Classifiers of Children Emotions using Feature … Performance Evaluation of Learning Classifiers of Children Emotions using Feature Combinations in the Presence of Noise Abdul Samad Department of Computer Science Hamdard University Karachi, Pakistan asamad23@gmail.com Aqeel-ur-Rehman Department of Computer Science Hamdard University Karachi, Pakistan aqeelrehman1972@gmail.com Syed Abbas Ali Department of Computer & Information Systems Engineering, NED University of Engineering & Technology, Pakistan saaj@neduet.edu.pk Abstract—Recognition of emotion-based utterances from speech has been produced in a number of languages and utilized in various applications. This paper makes use of the spoken utterances corpus recorded in Urdu with different emotions of normal and special children. In this paper, the performance of learning classifiers is evaluated with prosodic and spectral features. At the same time, their combinations considering children with autism spectrum disorder (ASD) as noise in terms of classification accuracy has also been discussed. The experimental results reveal that the prosodic features show significant classification accuracy in comparison with the spectral features for ASD children with different classifiers, whereas combinations of prosodic features show substantial accuracy for ASD children with J48 and rotation forest classifiers. Pitch and formant express considerable classification accuracy with MFCC and LPCC for special (ASD) children with different classifiers. Keywords-spoken utterances; special children; learning classifiers; noise; features I. INTRODUCTION AND RELATED WORK In the modern era of human-computer interaction, emotional speech recognition is a field of vast concern. SER has a great influence on human behavior and is a key point to build relations. Different emotions have their own characteristics that make it memorable in their own way [1]. Furthermore, “EmoChildRu” has been introduced [2] as the first child emotion database created to recognize speech and voice emotions from children’s behavior. Two child emotional speech examination probes are reported in the context of the corpus: grown-up audience members and programmed listeners. Automatic classification results are fundamentally the same as human discernment, despite the fact that the precision is underneath 55 % for both, demonstrating the trouble of child emotion recognition from speech under natural conditions. To improve the state of people with ASD, a few CAL procedures have been executed. Authors in [3] depict the investigation of fluctuated CAL strategies actualized to enhance the everyday life states of such individuals. Furthermore, the authors briefed about the CAL strategies involved in various applications improving correspondence, behavioral and social abilities of such special children. In [4], it was noticed that it is not easy to observe mental and emotional conditions in autism spectrum conditions (ASC) aspects. A technique was proposed to recognize their speech in [5] independently from the speaker. In this technique, MEL & BARK scale, Equivalent Rectangular Bandwidth in filter space along with gamma toned features were utilized at the front end, whereas at the back end, Fuzzy C Means (FCM), Multivariate Hidden Markov Models (MHMM) and Vector Quantization (VQ) approaches were applied. Individual words and short sentences in Tamil language were used to evaluate the performance by three variants. The data of two speakers were tested against the features of eight speakers. A prominent real situational database was organized in [6] to detect fear thorough the feature classifier “interjection” through speech in extreme emotional and real world emergencies. MFCCs along with Support Vector Machine with variant interjections were utilized to categorize speech emotions. In [7], Urdu language has been taken to recognize emotions from primary age children. The authors used 3 different prosodic features with 5 different classifiers and four emotions. They reported that J-48 classifier achieved the highest accuracy. A profound architecture was utilized in [8], which uses a convolutional network for extricating space shared highlights and a long short-term memory network for arranging emotions utilizing space particular features. A complete cross-corpora exploration of different avenues regarding various speech emotions areas uncovers that transferable features give increases extending from 4.3% to 18.4% in speech emotions recognition. The fundamental of Deep Neural Networks (DNNs) is to perceive human emotions from a speech signal. Mel-frequency Cepstral Coefficient (MFCC) is selected as a one of the frequently used speech features from crude sound information. In next step, DNN nourished the extricated speech features to prepare the system. Also, a hand crafted database was presented improving the utilization of the system [9]. The work-related recognition, classification, emotion detection children with ASD is still an open topic of research, while researchers are now more concerned to help these children by making them realize the emotions in the real world. Under this scope, an autism-based game “emotify” has been developed [10]. It comprises two levels of difficulty and attempts to teach children about neutral, anger, sadness and happiness emotions. Corresponding author: Abdul Samad Engineering, Technology & Applied Science Research Vol. 9, No. 6, 2019, 5088-5092 5089 www.etasr.com Samad et al.: Performance Evaluation of Learning Classifiers of Children Emotions using Feature … At the second level, children are helped in expressing their feelings which would be evaluated and examined. Machine learning approaches are exploited to develop a multilingual emotion recognition system. This paper evaluates the performance of learning classifiers when dealing with prosodic and spectral features and their combinations considering special (ASD) children as noise in terms of classification accuracy. II. LEARNING CLASSIFIERS AND FEATURES In daily routine conversations the prosodic features play a vital role [11]. The parameters used in expressing the speech to perceive the feelings of users are speech rate, length, pitch, formant, intensity, Mel frequency cepstrum coefficient (MFCC) and LPCC [12]. Two spectral features and three prosodic features (intensity, pitch, and formant) and their potential combinations are utilized in this research. • Pitch: Pitch and frequency are correlated. The analysis of every speech frame is obtained by their statistical values throughout the sample. These values [13] depict the clear picture of properties of audio parameters. • Intensity: Intensity demonstrates the prosodic feature encoding and the emotion based spoken utterance expressions [12]. • Formant: Formant is a critical recurrence segment of speech which gives quantifiable results of the consonants and vowels of the speech signal [12]. Four learning classifiers were used in the experimental framework. The evaluation of the performance of these classifiers regards spectral and prosodic features and their combinations. The comprehensive description of the learning classifiers can be found in [14]. The classifiers are: • J48: A family of decision tree algorithms used to figure the feature vector for different examples. The classes for the recently produced events are being learned on the basis of the training examples. With the support of tree grouping calculation the elementary dispersion of the information is successfully justifiable [15]. • Multi-Layer Perceptron (MLP): A class of deep artificial neural network that contains three layers at minimum. The first is the input layer while the last one is the output layer. The middle is a hidden layer and different MLPs can have various numbers of invisible layers [16]. • Rotation forest: Rotation forest [15] eliminates randomly any subsets of classes, performs Bootstrap on the remaining data and finally performs PCA and establishes free decision trees. • Logit Boost Classifier: Boosting [16] works on the principal that a set of weak learners can be used to create a strong classifier. Logit Boost provides higher weights for misclassified classifiers. III. CORPUS COLLECTION AND RECORDING SPECIFICATION The corpus has been collected from both categories of children (normal and with ASD) in Urdu and it comprises of 200 samples, equally divided for both cases. As per the research methodology, ASD children have been considered as noise in the experimental framework. The recording specification has been considered in standard conditions with Signal-to-Noise Ratio ≥ 45dB. Microsoft Windows 7 sound recorder has been utilized to record the emotion based spoken utterances of normal and special (ASD) children. The configuration is 16 bit, Mono, PCM with a testing rate of 48KHz with Microphone hazard and awareness of 54dB±2dB and 2.2W separately, 3.5mm mash stereo and link length of 1.8m. The choice of a spoken utterance incorporated these qualities a) semantically impartial, b) simple to investigate, c) reliable with any circumstance exhibited, and d) having comparable importance for every dialect. The sentence was: ― “Mujhe Khelna Hai” or “I have to play”. IV. EXPERIMENTAL RESULTS AND DISCUSSION The performance of learning classifiers is evaluated in the experimental framework for normal and special children making use of prosodic and spectral features and their combinations on spoken utterances recorded in Urdu. The corpus collection comprises of 200 spoken utterance samples in different emotions equally distributed in normal and special children. The experimental framework further classifies inter and intra feature combinations with four different classifiers (logit boost, MLP, J48 and rotation forest) with the following feature configurations: 1) separate prosodic and spectral features, 2) combinations of all three prosodic features (intensity, formant, and pitch) with two spectral features (LPCC and MFCC). The objective of the proposed framework is to identify the behavior of the four classifiers in a single feature or different combinations of spectral and prosodic features on spoken utterances of special (ASD) and normal children in terms of classification accuracy. The experimental results for both corpuses taken from normal and special (ASD- treated as noise) children in Urdu follow. A. Prosodic Features Pitch demonstrates great precision in portraying the states of children under all four classifiers while classifying special children more precisely than normal children (Table I). The classification accuracy of rotation forest with prosodic feature pitch for ASD (noisy) children spoken utterances was significantly better than the accuracy of the other classifiers. Intensity also shows good classification accuracy for ASD children for all classifiers except from rotation forest. All learning classifiers demonstrate higher classification accuracy with formant for ASD children in comparison with normal children B. Spectral Features MFCC has significant accuracy with MLP and logit boost classifier in case of normal children, on the other hand, only rotation forest shows considerable classification accuracy for ASD children. LPCC has very good accuracy only from logit boost classifier in both normal and special children. The outcome demonstrates that in any study of LPCC the logit boost classifier ought to be utilized. Engineering, Technology & Applied Science Research Vol. 9, No. 6, 2019, 5088-5092 5090 www.etasr.com Samad et al.: Performance Evaluation of Learning Classifiers of Children Emotions using Feature … Fig. 1. Prosodic features TABLE I. CLASSIFICATION ACCURACY FOR PROSODIC FEATURES Feature Classifier ASD children classification accuracy (%) Normal children classification accuracy (%) Intensity MLP 81.3 65.6 Logit boost 81.5 90.5 Rotation forest 76.5 64.5 J48 81.3 65.6 Pitch MLP 80 71.4 Logit boost 93.8 72 Rotation forest 94.4 76.7 J48 80 71.4 Formants MLP 50 11 Logit boost 90 78.6 Rotation forest 99 63.2 J48 98 60 TABLE II. CLASSIFICATION ACCURACY FOR SPECTRAL FEATURES Features Classifier ASD children classification accuracy (%) Normal children classification accuracy (%) MFCC MLP 64.7 85.7 Logit boost 64.7 85.7 Rotation forest 83.3 61.1 J48 10 50 LPCC MLP 50 10 Logit boost 84.6 62.9 Rotation forest 9 50 J48 11 50 Fig. 2. Spectral features C. Inter Combination of Prosodic and Spectral Features The prosodic feature pitch shows significant accuracy with the combination of two other prosodic features. In combination with intensity, pitch illustrates considerable classification accuracy with logit boost and J48 for ASD (special) children. Pitch with formant performs well in classifying special (ASD) children with all classifiers except logit boost. In combination with intensity and formants, MLP and rotation forest show significant classification accuracy in comparison with the other two classifiers for ASD children. J48 has comparable accuracy in classifying special and normal children. Fig. 3. Inter and intra combinations of features D. Intra Combination of Prosodic and Spectral Features In Intensity-MFCC, MLP performs better in classifying the normal children, while logit boost is considerably good in Engineering, Technology & Applied Science Research Vol. 9, No. 6, 2019, 5088-5092 5091 www.etasr.com Samad et al.: Performance Evaluation of Learning Classifiers of Children Emotions using Feature … classifying noisy spoken utterances for special children. In this case, rotation forest and J48 are performing averagely. Similarly, in Intensity-LPCC combination, the MLP is good and accurate. Nevertheless, logit boost more accurately classifies the normal children’s spoken utterances. Rotation forest and J48 perform averagely. TABLE III. CLASSIFICATION ACCURACY FOR PROSODIC AND SPECTRAL FEATURES Feature combination Classifier ASD children classification accuracy (%) Normal children classification accuracy (%) Pitch, intensity MLP 88 91.3 Logit boost 90 88 Rotation forest 87.5 87.5 J48 80 71.4 Pitch, formant MLP 90.5 81.5 Logit boost 88 91.3 Rotation forest 90.9 84.6 J48 98 60 Intensity, formant MLP 85.7 77.8 Logit boost 92 95.7 Rotation forest 89.5 76 J48 88.3 88.3 MFCC, LPCC MLP 56.5 56 Logit boost 93.8 71.9 Rotation forest 98 57.1 J48 11 50 Pitch, MFCC MLP 95.2 85.2 Logit boost 77.8 85.7 Rotation forest 94.7 79.3 J48 92.3 65.7 Pitch, LPCC MLP 84.2 72.4 Logit boost 89.5 75.9 Rotation forest 85.7 77.8 J48 92.3 65.7 Intensity, MFCC MLP 98 80 Logit boost 82.1 95 Rotation forest 94.4 76.7 J48 69.7 93.3 Intensity, LPCC MLP 80.8 86.4 Logit boost 87 84 Rotation forest 85.7 77.8 J48 69.7 93.3 Formant, MFCC MLP 53.5 80 Logit boost 98.4 75 Rotation forest 98 57.1 J48 97 60 Formants, LPCC MLP 46.4 45 Logit boost 94.1 74.2 Rotation forest 97 61.5 J48 98 60 Table III provides the results of the learning classifiers with combinations of prosodic and spectral features for classifying the noisy spoken utterances in terms of classification accuracy. The most significant results can be observed of the combination of LPCC with pitch and formant in classifying the noisy (special children) utterances, whereas, pitch and formant with MFCC also show substantial classification accuracy for special (noisy) children. Other combinations such as intensity and LPCC and intensity with MFCC also provide considerable results in classification. V. CONCLUSION In this paper, the performance of learning classifiers has been evaluated considering prosodic and spectral features and their combinations for children with ASD in terms of classification accuracy. The experimental frame comprises four different classifiers with different inter and intra combinations of prosodic and spectral features. The experiments were conducted on a sample of 200 individuals equally taken from normal and children with ASD which were considered as noise. The conclusions of the experimental results are: • The spectral features show significant classification accuracy with prosodic features (pitch & formant) with rotation forest and J48 classifiers. • Separate analysis of the spectral and prosodic features reveals that the classification accuracy of prosodic features is considerably better than of spectral features. • The intra feature combinations of spectral features with pitch and formant demonstrate better classification accuracy for different classifiers. The authors are now focusing on developing an experimental framework to perform the same methodology with different emotions for evaluating the performance of classifiers to special children (ASD) in terms of classification accuracy. REFERENCES [1] S. Ramakrishnan, “Recognition of emotion from speech: A review”, in: Speech enhancement, modeling and recognition algorithms and applications, pp. 121-137, InTech, 2012 [2] E. Lyakso, O. Frolova, E. Dmitrieva, A. Grigorev, H. Kaya, A. A. Salah, A. Karpov, “EmoChildRu: Emotional child russian speech corpus”, Lecture Notes in Computer Science, Vol. 9319, Springer, Cham, 2015 [3] S. Dewan, A. Singh, L. Singh, S. Gautam, “Role of emotion recognition in computive assistive learning for autistic person”, Indian Journal of Science and Technology, Vol. 9, No. 48, 2016 [4] O. Golan, Y. Sinai-Gavrilov, S. Baron-Cohen, “The Cambridge mindreading face-voice battery for children (CAM-C) complex emotion recognition in children with and without autism spectrum conditions”, Molecular Autism, Vol. 6, No. 1, Article ID 22, 2015 [5] R. Arunachalam, Revathi, “A strategic approach to recognize the speech of the children with hearing impairment different sets of features and models”, in: Multimedia Tools and Applications, Springer, 2019 [6] S. A. Yoon, G. Son, S. Kwon, “Fear emotion classification in speech by acoustic and behavioral cues”, Multimedia Tools and Applications, Vol. 78, No. 2, pp. 2345-2366, 2019 [7] S. Khan, S. A. Ali, J. Sallar, “Analysis of children’s prosodic features using emotion based utterances in Urdu language”, Engineering, Technology & Applied Science Research, Vol. 8, No. 3, pp. 2954-2957, 2018 Engineering, Technology & Applied Science Research Vol. 9, No. 6, 2019, 5088-5092 5092 www.etasr.com Samad et al.: Performance Evaluation of Learning Classifiers of Children Emotions using Feature … [8] A. Marczewski, A. Veloso, N. Ziviani, “Learning transferable features for speech emotion recognition”, Thematic Workshops of ACM Multimedia, Mountain View, USA, October 23-27, 2017 [9] M. F. Alghifari, T. S. Gunawan, M. Kartiwi, “Speech emotion recognition using deep feedforward neural network”, Indonesian Journal of Electrical Engineering and Computer Science, Vol. 10, No. 2, pp. 554-561, 2018 [10] A. Rouhi, M. Spitale, F. Catania, G. Cosentino, M. Gelsomini, F. Garzotto, “Emotify: emotional game for children with autism spectrum disorder based-on machine learning”, 24th International Conference on Intelligent User Interfaces Companion, New York, USA, March 16-20, 2019 [11] K. S. Rao, S. G. Koolagudi, Emotion recognition using speech features, Springer, 2013 [12] P. Shen, C. Zhou, X. Chen, “Automatic speech emotion recognition using support vector machine”, International Conference on Electronic, Mechanical Engineering and Information Technology, Harbin, China, August 12-14, IEEE 2011 [13] A. S. Utane, S. L. Nalbalwar, “Emotion recognition through speech”, 2nd National Conference On Innovative Paradigms in Engineering & Technology, Nagpur, Maharashtra, India, February 17, 2013 [14] S. A. Ali, S. Zehra, M. Khan, F. Wahab, “Development and analysis of speech emotion corpus using prosodic features for cross linguistic”, International Journal of Scientific & Engineering Research, Vol. 4, No. 1, pp. 1-8, 2013 [15] S. A. Ali, A. Khan, N. Bashir, “Analyzing the impact of prosodic feature (pitch) on learning classifiers for speech emotion corpus”, International Journal of Information Technology and Computer Science, Vol. 2, pp. 54-59, 2015 [16] M. Swain, A. Routray, P. Kabisatpathy, “Databases features and classifiers for speech emotion recognition: a review”, International Journal of Speech Technology, Vol. 21, No. 1, pp. 93-120, 2018