Microsoft Word - ETASR_V12_N6_pp9532-9535 Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9532-9535 9532 www.etasr.com Azmat et al.: Environmental Noise Reduction based on Deep Denoising Autoencoder Environmental Noise Reduction based on Deep Denoising Autoencoder Received: 4 August 2022 | Revised: 31 August 2022 | Accepted: 1 September 2022 Abstract-Speech enhancement plays an important role in Automatic Speech Recognition (ASR) even though this task remains challenging in real-world scenarios of human-level performance. To cope with this challenge, an explicit denoising framework called Deep Denoising Autoencoder (DDAE) is introduced in this paper. The parameters of DDAE encoder and decoder are optimized based on the backpropagation criterion, where all denoising autoencoders are stacked up instead of recurrent connections. For better speech estimation in real and noisy environments, we include matched and mismatched noisy and clean pairs of speech data to train the DDAE. The DDAE has the ability to achieve optimal results even for a limited amount of training data. Our experimental results show that the proposed DDAE outperformed the baseline algorithms. The DDAE shows superior performances based on three-evaluation metrics in noisy and clean pairs of speech data compared to three baseline algorithms. Keywords- DDAE; limited data; noise reduction; autoencoders I. INTRODUCTION Speech enhancement is a key part of Automatic Speech Recognition (ASR) [1]. It has been implemented in many commercial products for decades, but in order to achieve human-level performance, it requires further improvement in real-world scenarios [2], e.g. when we talk to a person or a voice recognition system like an ATM verification call, many environmental noises are added and transmitted, posing difficulties to the system. To deal with such noises, speech signal processing algorithms have been developed to improve the quality and intelligibility. Neural network-based algorithms are more efficient in learning high-order statistical information automatically by using nonlinear processing units and are efficient for noise reduction [3, 4]. Many algorithms are proposed to train deep neural networks efficiently for speech enhancement and noise reduction [5-7]. Deep learning algorithms have been used to extract speech features and for acoustic modeling [2]. The denoising autoencoder has been used for image processing and other classification applications to extract robust features when the input is a binary masked feature for each autoencoder. For speech feature extraction, a recurrent denoising autoencoder for ASR was proposed for noise reduction in [8]. There are many conventional methods for speech signal processing. For speech enhancement, the most used algorithms are Log Minimum Mean Square Error (logMMSE) [9], Karhunen-Loeve Transforms (KLT) [10], and Robust Principal Component Analysis (RPCA) [11], which are designed with specific additive background noise based on statistical speech properties and noisy signals. However, these methods do not work in real settings when people talk in a noisy environment. Consequently, it is essential to build a framework to deal with the various types of noise in real-world environments. This article introduces an explicit denoising framework called the Deep Denoising Autoencoder (DDAE), a variant of the denoising autoencoder. However, unlike previous works, where the input is noisy speech and the output is clean speech Aneeka Azmat Institute of Human-Centered Computing, National Tsinghua University, Taiwan and Institute of Information Science, Academia Sinica, Taiwan aneekaazmat89@iis.sinica.edu.tw Imad Ali Department of Computer Science University of Swat, Pakistan imadali@uswat.edu.pk M. Gilvy Langgawan Putra National Taiwan University of Science and Technology, Taiwan and Institut Teknologi Kalimantan, Indonesia gilvy.langgawan@lecturer.itk.ac.id Whenty Ariyanti National Taiwan University of Science and Technology, Taiwan and Institute of Information Science, Academia Sinica, Taiwan whenty@iis.sinica.ac.id Talha Nadeem COMSATS University Islamabad, Pakistan talhanadeem2397@gmail.com Corresponding author: Aneeka Azmat Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9532-9535 9533 www.etasr.com Azmat et al.: Environmental Noise Reduction based on Deep Denoising Autoencoder for training, the DDAE is trained on datasets of matched noisy and clean pairs of speech and mismatched noisy and clean pairs of speech for speech estimation in real and live environments. In the DDAE, the parameters in the encoder and decoder (formed by multiple layers of neural networks) are optimized based on the backpropagation criterion, where all denoising autoencoders are stacked up instead of recurrent connections. The DDAE has the ability to achieve optimal results even with a limited amount of training data. Our experimental results of DDAE were compared with the speech enhancement algorithms RPCA, KLT, and logMMSE. Based on three evaluation metrics, the DDAE outperforms these three baseline algorithms in matched and mismatched noisy and clean speech. II. THE DDAE FRAMEWORK This article uses the DDAE framework presented in [12], which removes the noise quickly and efficiently. The DDAE can achieve optimal results even with inadequate training data. It uses the noisy signal to produce clean speech features efficiently, even under mismatched testing conditions. Unlike [13], where the denoising autoencoder was trained using clean speech only, we trained the denoising autoencoder with both noisy and clean speech data. Figure 1 shows a single-layer hidden neural autoencoder, trained on clean speech with noisy and speech data as input. It consists of one nonlinear and one linear encoding and decoding stage: ℎ��� � = ��� �� + �� � � = ��ℎ���� + �� � (1) where � and �� are the weight matrices for encoding and decoding neural network connections respectively. Typically, regularization takes place when � = ��� = �. Also, � is the input layer’s bias vector, whereas � represents the output layer’s bias vector. Similarly, �� is the noisy speech signal equivalent to the clean signal �� . The nonlinear logistic function utilized by the hidden neuron is ���� = �1 + � ���� . By optimizing the objective function in (2), the parameters are determined as follows: ���� = ∑ ||�� − � � ||��� (2) where the set of the parameters is defined as � = {�, �, �}. In addition to using � = ��� = �, we apply regularization on weights and hidden neural output. This improves the generalization and prevents overfitting, which is defined as follows: ��� = ���� + ! ||�||�� + "#�ℎ��� )) (3) where ||�||�� = ∑ $�%� . &'ℎ��� �(�% � �% is a regularization function defined on the output neurons of the hidden layers, and ! and " are its weight coefficients. Thus, the parameters may be derived as follows: � ∗ ≜ arg min1 ��� (4) There are numerous unconstrained optimization algorithms to solve (4) for the estimation of �∗, �∗, �∗. In this framework, the linear search-based algorithm of quasi-Newton is applied. Figure 1 depicts the DDAE training procedure using noisy and clean voice samples. The DDAE can be constructed by stacking up several autoencoders. For the training of the DDAE, a dense layer of wised pretraining and fine-tuning is employed. When another hidden layer is introduced during the pretraining phase, the output of the previous hidden layer is used as the input of the subsequent autoencoder. The transformed noisy and clean speech data are used for training in denoising. As illustrated in Figure 1, the training pair for the first autoencoder is � and �, and followed by ℎ�� and ℎ�� or the subsequent autoencoder. During the fine-tuning step, the initial network parameters are fixed as those obtained from the pretraining step. These training stages may produce a better overall result than training the DDAE with random initialization. Fig. 1. The training process of the DDAE with noisy and clean speech data. III. EXPERIMENTAL SETTING This section evaluates the proposed DDAE framework and the baseline algorithms for speech enhancement tasks. A clean, continuous Taiwan Mandarin Hearing In Noise Test (TMHINT) data set with 100 utterances was used for training by adding 5 noise types (machine, babble, party crowd, restaurant, vacuum cleaner) and 5 signal-to-noise ratios (SNR). Therefore, the total training dataset for noisy speech contains 2500 utterances. The first testing scenario is a matching case in which 40 utterances are added with the same two types of noises (babble and party crowd) with 5 SNR levels. Thus, the total number of match case utterances are 400. More details of the dataset can be found in [14]. We have investigated the accuracy of noise reduction, speech enhancement, and intelligibility using 3 different metrics with a defined standard range set. The range for PESQ is between –0.5 to 4.5. As the value of PESQ increases, the quality of speech enhancement will increase. Similarly, the range for Short-Time Objective Intangibility (STOI) is between 0 and 1. As the value of STOI increases, the number of words identified by the listener will increase. The Speech Distortion Index (SDI) is the opposite of the first two metrics, ranging between 0 and 1. As the value of SDI decreases, so does the quality of speech enhancement. Most conventional speech enhancement algorithms are designed to filter out noisy speech Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9532-9535 9534 www.etasr.com Azmat et al.: Environmental Noise Reduction based on Deep Denoising Autoencoder using the gain function estimation. This article compares 3 traditional algorithms, logMMSE, KLT, and RPCA, with the proposed DDAE. The unprocessed speech is named Noisy. The experimental results were evaluated using 2 matched noise types with two SNR conditions (6dB, -6dB) and 4 mismatched noise types with 3 SNR conditions (-2, 0, 2), as shown in Table I. The DDAE was trained using 5 layers, each containing 2048 neurons.. TABLE I. EXPERIMENTAL SETUP OF THE TMHINT DATASET Train set 2500 utterances Noise type: Machine, cafeteria babble, crowd party, restaurant, vacuum cleaner SNR levels: -6dB, -3dB, 3dB, 6dB, 10dB Test set Matched condition 400 utterances Noise type: Cafeteria babble, crowd party SNR levels: -6dB, 6dB Mismatched condition 800 utterances Noise type: Applause, baby cry, grocery store, pink noise SNR levels: -2dB, 0dB, 2dB IV. RESULTS In this section, we evaluate and compare the results of the baseline algorithms with DDAE based on 3 evaluation criteria using different scenarios and SNR levels as mentioned in Table I. Table II shows the average PESQ results, normalized across the different SNRs. Compared to the baseline algorithms, the DDAE framework has a better PESQ score, i.e. improved speech quality, for both matched and mismatched noise types. Table II demonstrates that the DDAE framework (5 hidden layer configuration, each layer containing 2048 hidden neurons) outperformed the 3 baselines algorithms, with the exception of most pink noise, where the logMMSE utperformed the DDAE framework and the other algorithms under mismatched stationery noise conditions. In addition, the DDAE exhibited a better PESQ score of speech enhancement than logMMSE, RPCA, and KLT by maintaining high scores for STOI. For instance, the average PSESQ score of the DDAE is 2.55, while the PESQ scores of RPCA, logMMSE, and KLT are 2.18, 2.15, and 1.90 respectively. Compared to the baseline algorithms, on average, the PESQ score of the DDAE is higher than RPCA, logMMSE, and KLT by 16.97 %, 18.60%, and 34.21%, respectively. TABLE II. DDAE COMPARISON WITH THE BASELINE ALGORITHMS BASED ON THE AVERAGE PESQ SCORE Noise type Framework Noisy KLT logMMSE RPCA DDAE Matched noise (seen noise) Babble 2.26 1.89 2.21 2.25 2.58 Crowd party 2.26 1.85 2.15 2.24 2.61 Mismatched noise (unseen noise) Applause 2.13 1.76 1.89 1.99 2.95 Baby cry 2.17 1.77 1.97 2.02 2.50 Grocery 2.19 1.80 2.09 2.18 2.26 Pink noise 2.32 2.38 2.58 2.40 2.38 AVG 2.23 1.90 2.15 2.18 2.55 Table III illustrates the average STOI score for the DDAE and baseline algorithms, with 6 noise types with different SNRs. The Table shows that the DDAE has a better STOI score for all types of noise. Moreover, the DDAE demonstrated better average speech enhancement capabilities by maintaining high STOI scores. Compared to the baseline algorithms, the average STOI score of the DDAE is higher than RPCA, logMMSE, and KLT by 16.39%, 65.11%, and 10.93% respectively. TABLE III. ALGORITHM COMPARISON BASED ON AVERAGE STOI SCORE Noise Type Framework Noisy KLT logMMSE RPCA DDAE Matched Noise (seen noise) Babble 0.58 0.58 0.38 0.55 0.68 Crowd party 0.63 0.61 0.46 0.66 0.71 Mismatched Noise (unseen noise) Applause 0.69 0.68 0.57 0.697 0.73 Baby cry 0.71 0.71 0.54 0.71 0.73 Grocery 0.45 0.58 0.29 0.45 0.60 Pink noise 0.52 0.68 0.31 0.52 0.74 AVG 0.59 0.64 0.43 0.61 0.71 Table IV shows the average SDI results averaged over the different SNRs for the 6 considered noise types. The Table shows that the DDAE performs better than the other algorithms for all types of noise. In addition, the DDAE demonstrated better speech enhancement capabilities. Compared to the baseline algorithms, the SDI score of the DDAE is 22.64%, 24.07%, and 12.96%, lower than the RPCA, logMMSE, and the performance of DDAE significantly outperformed the them. Figure 2 depicts the spectrogram of a test utterance contaminated with non-stationary noise applause at SNR = -2dB achieved by different models. For comparison, the spectrograms of clean and noisy speech signals are also presented in Figure 2(a)-(b). The spectrograms of the test utterance enhanced by logMMSE, KLT, and RPCA algorithms are shown in Figures 2(c)-(e), while Figure 2(f) depicts the spectrogram of the enhanced speech signals produced by the DDAE. Figure 2 clearly demonstrates that the DDAE framework efficiently restores clean speech under very challenging conditions (non-stationery noise, -2dB SNR) and effectively suppresses noise components from the noisy signal (Figure 2(b)). DDAE suppresses noise components more effectively and produces better speech quality than the previous approaches. TABLE IV. ALGORITHM COMPARISON BASED ON THE AVERAGE SDI SCORE Noise Type Framework Noisy KLT logMMSE RPCA DDAE Matched noise (seen noise) Babble 0.61 0.45 0.48 0.50 0.40 Crowd party 0.61 0.48 0.48 0.50 0.38 Mismatched noise (unseen noise) Applause 0.61 0.55 0.585 0.58 0.41 Baby cry 0.61 0.56 0.59 0.56 0.38 Grocery 0.61 0.52 0.55 0.53 0.50 Pink noise 0.61 0.25 0.57 0.53 0.39 AVG 0.61 0.47 0.54 0.53 0.41 Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9532-9535 9535 www.etasr.com Azmat et al.: Environmental Noise Reduction based on Deep Denoising Autoencoder (a) (b) (c) (d) (e) (f) Fig. 2. Spectrograms of (a) clean, (b) noisy, (c) logMMSE, (d) KLT, (e) RPCA, and (f) DDAE. A noise of -2dB was added to the test utterance. V. DISCUSSION AND CONCLUSION In this study, the DDAE framework for speech enhancement of environmental noise is presented. The proposed framework provides a more understandable speech to the listener. The major contributions of this study are the introduction of the DDAE framework, which enhances speech intelligibility over 4 different baseline methods, and the confirmation, based on experimental data, that this proposed framework can improve speech performance. These contributions allow our proposed architecture to serve as a straightforward and efficient means of bridging the acoustic mismatch condition. The DDAE framework provides better generalization and a universal approximation capability. For mismatch and non- stationery conditions, DDAE has better generalization performance compared to the conventional speech enhancement techniques. The experimental results demonstrate that even with insufficient training data, our proposed framework outperformed KLT, logMMSE, and RPCA under both matched and mismatched noise conditions at varying SNR levels. This is confirmed by the evaluation results, which show that DDAE outperforms the other algorithms on all levels, except for stationary noise where the logMMSE approach performs better. In addition, the comparative findings demonstrate that our method provides superior enhancement performance based on 3 widely used objective metrics. It is essential to find the ideal compromise between the parameter efficiency and the speech enhancement performance of the model. However, the majority of the baseline models cannot deliver great parameter efficiency. The experimental results suggest that the proposed model is superior to other speech enhancement frameworks in terms of trade-off. REFERENCES [1] W. Helali, Ζ. Hajaiej, and A. Cherif, "Real time speech recognition based on PWP thresholding and MFCC using SVM,” Engineering, Technology & Applied Science Research, vol. 10, no. 5, pp. 6204-6208, Oct., 2020, https://doi.org/10.48084/etasr.3759. [2] G. E. Dahl, D. Yu, L. Deng, and A. Acero, "Context-Dependent Pre- Trained Deep Neural Networks for Large-Vocabulary Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012, https://doi.org/ 10.1109/TASL.2011.2134090. [3] X. Lu, M. Unoki, S. Matsuda, C. Hori, and H. Kashioka, "Controlling Tradeoff Between Approximation Accuracy and Complexity of a Smooth Function in a Reproducing Kernel Hilbert Space for Noise Reduction,” IEEE Transactions on Signal Processing, vol. 61, no. 3, pp. 601–610, Oct. 2013, https://doi.org/10.1109/TSP.2012.2229991. [4] Y. Bengio, "Learning Deep Architectures for AI,” Foundations and Trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, Nov. 2009, https://doi.org/10.1561/2200000006. [5] A. A. Alasadi, T. H. Aldhayni, R. R. Deshmukh, A. H. Alahmadi, and A. S. Alshebami, "Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System,” Engineering, Technology & Applied Science Research, vol. 10, no. 2, pp. 5547–5553, Apr. 2020, https://doi.org/10.48084/etasr.3465. [6] A. Samad, A. U. Rehman, and S. A. Ali, "Performance Evaluation of Learning Classifiers of Children Emotions using Feature Combinations in the Presence of Noise,” Engineering, Technology & Applied Science Research, vol. 9, no. 6, pp. 5088–5092, Dec. 2019, https://doi.org/ 10.48084/etasr.3193. [7] M. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun, "Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, Jun. 2007, https://doi.org/10.1109/CVPR.2007.383157. [8] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, "Recurrent Neural Networks for Noise Reduction in Robust ASR,” in Proceedings of The International Conference on Acoustics, Speech, & Signal Processing, Dec. 2012. [9] Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443– 445, Apr. 1985, https://doi.org/10.1109/TASSP.1985.1164550. [10] U. Mittal and N. Phamdo, "Signal/noise KLT based approach for enhancing speech degraded by colored noise,” IEEE Transactions on Speech and Audio Processing, vol. 8, no. 2, pp. 159–167, Mar. 2000, https://doi.org/10.1109/89.824700. [11] E. J. Candès, X. Li, Y. Ma, and J. Wright, "Robust principal component analysis?,” Journal of the ACM, vol. 58, no. 3, pp. 11:1-11:37, Mar. 2011, https://doi.org/10.1145/1970392.1970395. [12] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, "Speech enhancement based on deep denoising autoencoder,” in Interspeech 2013, Aug. 2013, pp. 436– 440, https://doi.org/10.21437/Interspeech.2013-130. [13] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, "Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion,” The Journal of Machine Learning Research, vol. 11, pp. 3371–3408, Sep. 2010. [14] L. L. N. Wong, S. D. Soli, S. Liu, N. Han, and M.-W. Huang, "Development of the Mandarin Hearing in Noise Test (MHINT),” Ear and Hearing, vol. 28, no. 2 Suppl, pp. 70S-74S, Apr. 2007, https://doi.org/10.1097/AUD.0b013e31803154d0.