INT J COMPUT COMMUN, ISSN 1841-9836 Vol.7 (2012), No. 3 (September), pp. 574-585 Bionic Wavelet Based Denoising Using Source Separation M. Talbi, A.B. Aicha, L. Salhi, A. Cherif Mourad Talbi, Lotfi Salhi, Adnene Cherif Faculty of Sciences of Tunis, Laboratory of Signal Processing, University Campus, 2092 El Manar II, Tunis, Tunisia E-mail: mouradtalbi196@yahoo.fr, lotfi.salhi@laposte.net, adnane.cher@fst.rnu.tn Anis Ben Aicha Université de Carthage, Ecole Supérieure des Communications Laboratoire de recherche COSIM Route de Raoued 3.5 Km, Cité El Ghazala, Ariana, 2083, Tunisie, Tél. : +216 71 857 000 - Fax : +216 71 856 829 E-mail: ben.aicha.anis@gmail.com Abstract: We consider the problem of speech denoising using source separation. In this study we have proposed a hybrid technique that consists in applying in the first step, the Bionic Wavelet Transform (BWT) to two different mixtures of the same speech sig- nal with noise. This speech signal is corrupted by a Gaussian white noise with two different values of the Signal to Noise Ratio (SNR) in order to obtain those two mix- tures. The second step consists in computing the entropy of each bionic wavelet coefficient and finds the two subbands having the minimal entropy. Those two sub- bands are used to estimate the separation matrix of the speech signal from noise by using the source separation. Our proposed technique is evaluated by comparing it to the denoising technique based on source separation in time domain. Keywords: Bionic wavelet transform, Blinde Source Separation, entropy, speech enhancement. 1 Introduction In signal processing, the source separation constitutes an attractive problem. Its goal is to extract from many signals mixture, the meaningful signals. This is performed with minimum a priori information on the mixture process. In the case of instantaneous mixture, many approaches employing the ICA algorithm can solve the problem of the source separation. One of those approaches permits to estimate the unmixing matrix by minimizing the mutual information between the separated sources [1, 2]. Others exploit the non-Gaussianity of the source signals and perform separation by maximizing this non-Gaussianity [2]. For example, a technique using a subband decomposing in combination with ICA, has been developed by Tanaka et al [3]. Kisilev et al [4] have employed geometric algorithms for separating mixed signals. Rachid Moussaoui et al [5] have proposed an algorithm using the idea of applying a preprocessing in the transformed domain but the separation is performed in the time domain. In this paper we have used the source separation with Bionic Wavelet Transform (BWT) for enhancement of speech signal corrupted by white noise. The source separation is performed with ICA and instead of using the wavelet packet transform as used in the technique proposed by Rachid et al [5], we have used in this work the BWT. Copyright c⃝ 2006-2012 by CCC Publications Bionic Wavelet Based Denoising Using Source Separation 575 2 Restrictions of ICA The ICA standard formulation needs at least as many sensors as sources. Therefore, we suppose in this paper that the source number is equals to the sensor number. In the instantaneous mixture case, the sources are not directly observed but as a linear combination such as: xi(t) = j=N∑ j=1 ai jS j(t), (1) where x are the observed signals, s are the source signals and A = [ai j] is unknown full rank mixing matrix. Figure 1: ICA Principle. Figure 1 shows the ICA principle. The ICA aim is practically to find the inverse matrix of A, which is the unmixing matrix W = A−1. To make an estimation of W, certain assumptions have to be made and some restrictions have to be imposed [2]: we assume that the individual components si(t) are statistically independent over the observation time and the individual components must have non Gaussian distribu- tions. In comparison to previous work, the novelty of the approach of Rachid Moussaoui et al [5] resides in the preprocessing implementation before the source separation process in to: - Relax the previous restrictions by increasing the non-Gaussanity which is a pre-requirement for ICA. - Initiate a preliminary separation by decreasing the mutual information between the resultant sig- nals from the preprocessing. The preprocessing transforms the observed signals to find an adequate representation where the sig- nals distributions are non-Gaussian. For this reason, the wavelet transform is used in order to emphasize the non Gaussian nature of the observed signals. Once we have found the inverse matrix W with the wavelet packets based ICA then, the separation is performed in the time domain [5]. Figure 2 illustrated an overview of the system proposed by Rachid Moussaoui et al [5]. In this paper we have chosen s2(t) to be a white noise that corrupted the clean speech signal s1(t) with two different values of the Signal to Noise Ratio (SNR). 576 M. Talbi, A.B. Aicha, L. Salhi, A. Cherif Figure 2: Overview of the source separation system proposed by Rachid Moussaoui et al [5]. 3 The proposed technique The proposed speech enhancement system, illustrated in Figure 4, is inspired from that of Rachid Moussaoui et al [5]. The latter is conceived for multi-channel source separation and is based on wavelet based independent component analysis. It comprises two modules shown in dotted boxes in Figure 4. The first module (pre-processing) extracts appropriate signals from the observed signals in order to facilitate the separation of the speech and noise signals. For this, the observed signals are projected on suitable bases, more specifically on bionic wavelet bases. The second module (speech and noise separation) performs the source separation using standard ICA [1]. The input of this module is the extracted signals from module 1 and the observed signals. Its output is the cleaned or the enhanced speech signal ŝ1(t). Figure 4 illustrated an overview of the proposed speech enhancement system which is summarized by the following steps: i Decompose the observed signals (the two noisy speech signals) into bionic wavelet subbands by applying the BWT. ii Compute the entropy value of each subband and select the two subbands having the minimum entropy. iii Use those two subbands as the inputs of the ICA system in order to estimate the separation matrix, A−1. iv Estimate the enhanced speech signal ŝ1(t) by applying A−1 to the temporal mixtures x1(t) and x2(t) . The used entropy in this work, is the Shannon entropy which is defined for each subband w·, j, 1 ≤ j ≤ 30 as: H( j) =− ∑ pilog( pi). (2) Bionic Wavelet Based Denoising Using Source Separation 577 Note that in the expression w·, j, 1 ≤ j ≤ 30, · is replaced by 1 if we apply the BWT to x1(t) and is replaced by 2 if we apply the BWT to x2(t). The probability pi is expressed as: pi = w·, j(i)2 ∥W∥2 , 1 ≤ i ≤ N and N is the number of samples in the subband w·, j and W is obtained by concatenating all the subbands w·, j, 1 ≤ j ≤ 30. 4 The bionic wavelet transform J. Yao and Y. T. Zhang have proposed the bionic wavelet transform (BWT) as a new time-frequency technique by referring to the perceptual model [6]. The term "bionic" means that the BWT is guided by an active biological mechanism [7]. Moreover, the BWT decomposition is both perceptually scaled and adaptive [8]. The initial perceptual aspect of the transform comes from the logarithmic spacing of the baseline scale variables which are designed to match basilar membrane spacing [8]. Then, two adapta- tion factors control the time-support employed at each scale, based on a non-linear perceptual model of the auditory system [8]. The basis of this transform is the Giguerre -Woodland non-linear transmission line model of the auditory system [9, 10], an active-feedback electro-acoustic model incorporating the auditory canal, middle ear, and cochlea [8]. The model yields estimates of the time-varying acoustic compliance and resistance along the displaced basilar membrane, as a physiological acoustic mass func- tion, cochlear frequency-position mapping, and feedback factors representing the active mechanisms of outer hair cells. The net result can be seen as a technique for the estimation of the time-varying quality factor Qeq of the cochlear filter banks as the input sound waveform function [8]. The references [6–9] give the complete details on the elements of this model. The BWT adaptive nature is ensured by a time- varying linear factor T (a,τ) which represents the scaling of the cochlear filter bank quality factor Qeq at each scale over time [8]. For each scale and time, the adaptation factor T (a,τ) of BWT is computed by using the update equation [8]: T (a,τ+∆τ) = 1[ 1 − G1 GsGs+|XBWT (a,τ)| ] [ 1 + G2| ∂∂t XBWT (a,τ)| ] (3) where Cs is a constant (typically Cs = 0.8) that represents non linear saturation effects in the cochlear model [6, 8]. The quantities G1 and G2 are respectively the active gain factor, which represents the outer hair cell active resistance function, and the active gain factor representing the time-varying compliance of the basilar membrane [8]. Practically speaking, the partial derivative in equation (3) can be approximated by using the first difference of the previous points of the BWT at that scale [8]. XBWT (a,τ) represents the bionic wavelet transform (BWT) of the signal x(t) and it is given by: XBWT (a,τ) = 1 T (a,τ) √ a ∫ x(t) · φ̃∗ [ t −τ a · T (a,τ) ] · e− jω0( t−τ a )dt, (4) where a denotes the parameter of scale, τ is the shifting parameter in time and φ̃ is the mother wavelet envelop given by [7]: φ(t) = 1 T (a,τ) √ a φ̃ [ t T (a,τ) ] · e jω0t (5) where ω0 is the base fundamental frequency of the unscaled mother wavelet. In practice ω0 is equals to 15165.4 for the human auditory system [6]. The discretization of the scale a is achieved by employing a pre-determined logarithmic spacing across the desired frequency range, so that at each scale the center frequency is expressed by [8]: ωm = ω0 (1.1623)m ,m = 0,1,2, . . . . (6) 578 M. Talbi, A.B. Aicha, L. Salhi, A. Cherif Based on Yao and Zhang’s original work for cochlear implant coding [9], coefficients at 22 scales, m = 7,. . . ,28, are calculated employing numerical integration of the continuous wavelet transform. These 22 scales correspond to center frequencies logarithmically spaced from 225 Hz to 5300 Hz. (Although the scales used here match those from Yao and Zhang’s original work, empirical variation of the number of scales and frequency placement showed minimal effect on the overall enhancement results). For this implementation, we have used coefficients at 30 scales. In the formula (4), the role of first factor T (a,τ) multiplying √ a is to ensure that the energy remains the same for each mother wavelet. The role of second factor T (a,τ) is to adjust the envelop φ̃(t) without adjusting the central frequency of φ(t) [7]. Thus, the main difference between (BWT) and the continuous wavelet transform (CWT) is based on the fact that the time-frequency resolution achieved by (BWT) can be adjusted in an adaptive manner not only by frequency variation of the signal but also by instantaneous amplitudes of this signal. It is the mother wavelet which makes the continuous wavelet transform adaptive, while the adaptive characteristic of the BWT comes from the mechanism of active control in the human auditory model. which adjusts the mother wavelet associated to (BWT) according to the analyzed signal. Basically, the idea of the (BWT) is inspired from the fact that we need to make the mother wavelet envelop variable in time according to the signal characteristics. The employed mother wavelet φ(t) in [7] is the Morlet wavelet and its envelop φ̃(t) is given by [8]: φ̃(t) = e [ −( tT0 ) 2 ] (7) where T0 denotes the initial time-support. Figure 3 illustrated the real and the imaginary parts of the complex Morlet mother wavelet. Figure 3: The Morlet wavelet. It can be shown [7, 9] that the obtained BWT coefficients, XBWT (a,τ) are derived by using the fol- lowing formula [8]: XBWT (a,τ) = K(a,τ)XWT (a,τ), (8) where K(a,τ) is given by: K(a,τ) = √ π C T0√ 1 + T 2(a,τ) (9) where C represents a normalizing constant calculated from the squared mother wavelet integral. Bionic Wavelet Based Denoising Using Source Separation 579 This representation yields to an effective computational technique for calculating in direct manner, the BWT coefficients from those of the wavelet transform. This is performed without using the BWT definition given by equation (4). There are some key differences between the discretized CWT employing the Morlet wavelet used for the BWT and a filterbank based WPT using an orthonormal wavelet. One of them is that the WPT provides a perfect reconstruction, while the discretized CWT is an approximation whose exactness depends on the number and placement of frequency bands selected [8]. Figure 4: Overview of the proposed system. 5 Criterion of evaluation For evaluating our proposed technique, we have compared it to the temporal technique based on runica. The evaluation is based on SNR, SSNR, ISd and PESQ computation. These parameters are defined as follow: - Signal-to-noise ratio The signal-to-noise ratio (SNR) of the enhanced speech signal is defined by: S NRdB = 10Log10  ∑N−1 n=0 x(n) 2∑N−1 n=0 (x(n) − x̂(n)) 2  , (10) where x(n) and x̂(n) represent respectively the original and the enhanced speech signals, and N is the samples number per signal. - Segmental signal to noise ratio The segmental signal-to-noise ratio (segSNR) is calculated by averaging the frame based SNRs over the signal: segS NRdB = 1 M M−1∑ m=0 10Log10  ∑Nm+N−1 n=Nm x(n)2∑Nm+N−1 n=Nm (x(n) − x̂(n))2  , (11) 580 M. Talbi, A.B. Aicha, L. Salhi, A. Cherif where M designates the number of frames, N is the size of frame, and Nm is the beginning of the m− th frame. As the SNR can become negative and very small during silence periods, the segSNR values are limited to the range of [-10dB, 35dB] as per [10]. - Itakura-Saito distance The distance of Itakura-Saito (ISd) measures the spectrum changes and can be computed employ- ing the coefficients of linear prediction (LPC) and this according to the following equation: IS D(a,b) = (a − b)T R(a,b) aT Ra , (12) where a represents the LPC vector of the original speech signal x(n). R is the matrix of autocorre- lation and b is the LPC coefficients vector of the enhanced speech signal x̂(n). In this work, a 10th order LPC based measure is employed. - Perceptual evaluation of speech quality The perceptual evaluation of speech quality (PESQ) algorithm [11, 12] is an objective quality measure, that is approved as the ITU-T recommendation P.862. It is a tool of objective measure- ment conceived to predict the results of a subjective Mean Opinion Score (MOS) test. It was proved [13, 14] that the PESQ is more reliable and correlated better with MOS than the traditional objective speech measures. 6 Results and discussion From Table 1 to Table 8, we report the obtained results from the application of our proposed speech enhancement technique and the temporal technique based on runica on eight noisy sentences taken from the Timit database. Table 1: Sentence 1. Parameters Proposed method Temporal method SNRi of the first mixture 0.7672 0.7672 SNRi of the second mixture -3.4638 -3.4638 SNRf(dB) 69.5528 58.7147 SSNRi of the first mixture -3.5715 -3.5715 SSNRi of the second mixture -6.3509 -6.3509 SSNRf(dB) 34.8736 34.1366 PESQi of the first mixture 1.3110 1.3110 PESQi of the second mixture 1.0983 1.0983 PESQf 4.4989 4.4936 ISdi of the first mixture 2.4029 2.4029 ISdi of the second mixture 3.9475 3.9475 ISdf 4.181 10−10 4.948 10−8 These results show clearly that our proposed technique outperforms the temporal technique of source separation using standard ICA [1]. Fig. 5 and Fig. 6 illustrate an example of speech enhancement using our proposed technique. Bionic Wavelet Based Denoising Using Source Separation 581 Table 2: Sentence 2. Parameters Proposed method Temporal method SNRi of the first mixture -2.1038 -2.1038 SNRi of the second mixture -6.6325 -6.6325 SNRf(dB) 66.5296 48.9184 SSNRi of the first mixture -5.8749 -5.8749 SSNRi of the second mixture -7.8815 -7.8815 SSNRf(dB) 34.2126 30.7198 PESQi of the first mixture 1.4577 1.4577 PESQi of the second mixture 1.2445 1.2445 PESQf 4.4981 4.4535 ISdi of the first mixture 2.8500 2.8500 ISdi of the second mixture 4.6523 4.6523 ISdf 1.4043 10−9 3.1847 10−6 Table 3: Sentence 3. Parameters Proposed method Temporal method SNRi of the first mixture 4.2400 4.2400 SNRi of the second mixture -0.2606 -0.2606 SNRf(dB) 77.4719 49.5281 SSNRi of the first mixture -1.4316 -1.4316 SSNRi of the second mixture -4.6465 -4.6465 SSNRf(dB) 34.7279 31.6486 PESQi of the first mixture 1.9873 1.9873 PESQi of the second mixture 1.7460 1.7460 PESQf 4.4999 4.4621 ISdi of the first mixture 1.8622 1.8622 ISdi of the second mixture 3.1204 3.1204 ISdf 2.5837 10−10 8.2210 10−6 7 Conclusion In this paper, we have proposed a new speech enhancement technique that consists in applying in the first step, the Bionic Wavelet Transform (BWT) to two different mixtures of the same speech signal with gaussian white noise with two different values of Signal to Noise Ratio (SNR). The second step consists in computing the entropy of each bionic wavelet coefficient and finds the two subbands having the minimal entropy. Those two subbands are used to estimate the separation matrix of the speech signal from noise by employing the source separation. The obtained results from the SNR, SSNR, ISd and PESQ computation, show clearly that the proposed speech enhancement technique outperforms the temporal technique of source separation using standard ICA. 582 M. Talbi, A.B. Aicha, L. Salhi, A. Cherif Table 4: Sentence 4. Parameters Proposed method Temporal method SNRi of the first mixture 1.6366 1.6366 SNRi of the second mixture -2.7143 -2.7143 SNRf(dB) 61.8042 55.4325 SSNRi of the first mixture -4.3027 -4.3027 SSNRi of the second mixture -6.3345 -6.3345 SSNRf(dB) 31.4223 29.4299 PESQi of the first mixture 1.5414 1.5414 PESQi of the second mixture 1.2476 1.2476 PESQf 4.4816 4.4506 ISdi of the first mixture 2.5180 2.5180 ISdi of the second mixture 4.0632 4.0632 ISdf 7.4591 10−8 8.9142 10−7 Table 5: Sentence 5. Parameters Proposed method Temporal method SNRi of the first mixture -0.7340 -0.7340 SNRi of the second mixture -5.7034 -5.7034 SNRf(dB) 59.5033 57.0129 SSNRi of the first mixture -5.6047 -5.6047 SSNRi of the second mixture -7.6901 -7.6901 SSNRf(dB) 31.4463 30.8201 PESQi of the first mixture 1.3907 1.3907 PESQi of the second mixture 1.1391 1.1391 PESQf 4.4814 4.4748 ISdi of the first mixture 2.6384 2.6384 ISdi of the second mixture 4.2292 4.2292 ISdf 2.4528 10−8 1.0487 10−7 Table 6: Sentence 6. Parameters Proposed method Temporal method SNRi of the first mixture 2.5544 2.5544 SNRi of the second mixture -2.1241 -2.1241 SNRf(dB) 62.4473 62.2521 SSNRi of the first mixture -3.3643 -3.3643 SSNRi of the second mixture -5.8605 -5.8605 SSNRf(dB) 31.4816 31.4398 PESQi of the first mixture 1.3805 1.3805 PESQi of the second mixture 1.0564 1.0564 PESQf 4.4929 4.4926 ISdi of the first mixture 2.6047 2.6047 ISdi of the second mixture 4.1036 4.1036 ISdf 9.3330 10−8 1.0324 10−7 Bionic Wavelet Based Denoising Using Source Separation 583 Table 7: Sentence 7. Parameters Proposed method Temporal method SNRi of the first mixture -0.0411 -0.0411 SNRi of the second mixture -4.9428 -4.9428 SNRf(dB) 69.0167 67.9430 SSNRi of the first mixture -5.7362 -5.7362 SSNRi of the second mixture -7.8201 -7.8201 SSNRf(dB) 33.9383 33.8069 PESQi of the first mixture 1.6015 1.6015 PESQi of the second mixture 1.4326 1.4326 PESQf 4.4978 4.4974 ISdi of the first mixture 2.5867 2.5867 ISdi of the second mixture 4.1275 4.1275 ISdf 4.6205 10−10 6.7472 10−10 Table 8: Sentence 8. Parameters Proposed method Temporal method SNRi of the first mixture -1.7018 -1.7018 SNRi of the second mixture -6.5995 -6.5995 SNRf(dB) 68.1109 52.2155 SSNRi of the first mixture -5.8031 -5.8031 SSNRi of the second mixture -7.8610 -7.8610 SSNRf(dB) 33.2544 29.8972 PESQi of the first mixture 1.3814 1.3814 PESQi of the second mixture 1.1889 1.1889 PESQf 4.4968 4.4504 ISdi of the first mixture 3.3273 3.3273 ISdi of the second mixture 5.1147 5.1147 ISdf 3.1514 10−9 5.4382 10−6 584 M. Talbi, A.B. Aicha, L. Salhi, A. Cherif Figure 5: (a) clean speech signal, (b) enhanced speech signal. Figure 6: (c) first mixture, (d) second mixture. Bionic Wavelet Based Denoising Using Source Separation 585 Bibliography [1] A. J. Bell, T. J. Sejnowski, An information maximization approach to blind separation and blind deconvolution, Neural Computation, Vol.7, pp.1004-1034, 1995. [2] A Hyvarinen, J. Karhunen, E. Oja, Independent component analysis, Wiley and Sons, 2001. [3] T. Tanaka, A. Cichocki, Subband decomposition independent component analysis and new perfor- mance criteria, ICASSP, pp.541-544, 2004. [4] P. Kisilev, M. Zibulevsky, Blind source separation using multinode sparse representation, ICIP, 2001. [5] R. Moussaoui, J. Rouat, R. Lefebvre, Wavelet Based Independent Component Analysis for Multi- Channel Source Separation, ICASSP, pp.645-648, 2006. [6] J. Yao, Y. T. Zhang, Bionic wavelet transform: a new timefrequency method based on an auditory model, IEEE Trans. on Biomedical Engineering Vol.48, No.8, pp.856-863, 2001. [7] Xiaolong Yuan, B.S.E.E. A THESIS, Ť Auditory Model-based Bionic Wavelet Transform for speech enhancement. Electrical and computer engineering. [8] M. T. Johnsona, X. Yuanb, Y. Rena, Speech signal enhancement through adaptive wavelet threshold- ing. in conference Elsevier, pp.123-133, 2007. [9] J. Yao, Y. T. Zhang, The application of bionic wavelet transform to speech signal processing in cochlear implants using neural network simulations, IEEE Trans. Biomed. Eng., Vol.49, No.11, pp. 1299-1309, 2002. [10] B. Chen, P. C. Loizou, A Laplacian-based MMSE estimator for speech enhancement, Speech Com- munication, Vol.49, No.2, pp.134-143, 2007. [11] ITU-T P.862. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs, ITU Recommen- dation P.862, 2001. [12] A. W. Rix, J. G. Beerends, M. P. Hollier, A. P. Hekstra, Perceptual evaluation of speech quality (pesq) - a new method for speech quality assessment of telephone networks and codecs, ICASSP, pp.749-752, 2001. [13] Y. Hu, P. C. Loizou, Evaluation of objective measures for speech enhancement, IEEE Trans. Speech, Audio Processing, Vol.16, No.1, pp.229-238, 2008. [14] E. Zavarehei, S. Vaseghi, Q. Yan. Inter- frame modeling of DFT trajectories of speech and noise for speech enhancement using Kalman filters, Speech Communication, Vol.48, No.11, pp.1545-1555, 2006.