Microsoft Word - 10-3759_s_oninefirst_ETASR_V10_N5_pp6204-6208


Engineering, Technology & Applied Science Research Vol. 10, No. 5, 2020, 6204-6208 6204 
 

www.etasr.com Helali et al.: Real Time Speech Recognition based on PWP Thresholding and MFCC using SVM  

 
Real Time Speech Recognition based on PWP 

Thresholding and MFCC using SVM  
 

Wafa Helali 

Faculty of Sciences of Tunis 
University Tunis El-Manar 

Tunis, Tunisia 

Zied Hajaiej 

Faculty of Sciences of Tunis 
University Tunis El-Manar,  

Tunis, Tunisia 

Adnen Cherif  

Faculty of Sciences of Tunis 

University Tunis El-Manar 

Tunis, Tunisia 
 

Abstract−The real-time performance of Automatic Speech 

Recognition (ASR) is a big challenge and needs high computing 

capability and exhaustive memory consumption. Getting a robust 

performance against inevitable various difficult situations such as 
speaker variations, accents, and noise is a tedious task. It’s 

crucial to expand new and efficient approaches for speech signal 

extraction features and pre-processing. In order to fix the high 

dependency issue related to processing succeeding steps in ARS 

and enhance the extracted features’ quality, noise robustness can 

be solved within the ARS extraction block feature, removing 

implicitly the need for further additional specific compensation 
parameters or data collection. This paper proposes a new robust 

acoustic extraction approach development based on a hybrid 

technique consisting of Perceptual Wavelet Packet (PWP) and 

Mel Frequency Cepstral Coefficients (MFCCs). The proposed 

system was implemented on a Rasberry Pi board and its 

performance was checked in a clean environment, reaching 99% 

average accuracy. The recognition rate was improved (from 80% 
to 99%) for the majority of Signal-to-Noise Ratios (SNRs) under 

real noisy conditions for positive SNRs and considerably 
improved results especially for negative SNRs. 

Keywords-automatic speech recognition; perceptual wavelet 

packet transform; Mel frequency cestrum coefficients; SVM; 

Raspberry Pi 3 

I. INTRODUCTION  

Speech recognition technology is a widespread dynamic 
research area. Automatic Speech Recognition (ASR) has been 
vastly used in many human–machine interaction applications, 
such as electronics [1], mobile robots [2-4], car audio systems 
[5], manipulators in industrial assembly lines [6], and security 
systems [7]. Nonetheless, robust performance constitutes 
obviously a real concern for any real-time application due to 
various difficult conditions such as noisy background, accents, 
and speaker variations. As a result, the need for accuracy, high 
performance and fast embedded ASR are growing 
continuously. Many projects have been invested in ASR 
techniques in order to achieve proficient embedded systems 

that are able to imitate human behavior at all levels. The ASR 
accuracy obtained in laboratory environments is quite high, but 
once the recognition system is placed in a real background, the 
recognition rate gets roughly low. Several embedded voice 
recognition systems have been reported and some of them are 
implemented in Field Programmable Gate Arrays (FPGAs) [8-
10] or in Digital Signal Processors (DSPs) [11, 12], all of them 
with a modest accuracy rate. ASR state-of-the-art systems are 
linking the performance to reasonable and controlled training 
conditions. Considering the noise impact, the system accuracy 
may become unacceptably low in some sensitive environments. 
Several researchers have shown their interests on speech 
feature extraction methods such as Linear Prediction 
Coefficients (LPC) [13], Perceptual Linear Predictive (PLP) 
[14] and Linear Predictive Cepstral Coefficients (LPCC) which 
are used due to their effectiveness and simplicity in 
speech/speaker recognition [15-16]. 

Mel Frequency Cepstral Coefficients (MFCCs) constitute 
feature parameters that present widely popular acoustic features 
mostly used in speech recognition [17]. In spite of its good 
performance achieved in clean background, the MFCCs feature 
extraction for speech recognition has been used to enhance 
speech recognition system performance in noisy environments. 
The most cited methods are the Cepstral Mean Subtraction 
(CMS) [18], the Power-Normalized Cepstral Coefficients 
(PNCCs) [19], and the Cepstral Mean Normalization (CMN) 
[20] which is a popular feature compensation method dealing 
with convolutional noise. In this same context, the majority of 
the published works demonstrated that the wavelet-based 
feature extraction [21-24] has better performance improvement 
than traditional Cepstral features in noisy environments. The 
already presented wavelet-based techniques rely on the multi-
resolution PWP properties and combine the extracted MFCC 
features from various frequency sub bands to a unique feature 
vector. 

In this paper, a new method for real-time speech 
recognition is proposed under both clean and noisy 

Corresponding author: Wafa Helali (wafa.helali@yahoo.fr) 


Engineering, Technology & Applied Science Research Vol. 10, No. 5, 2020, 6204-6208 6205 
 

www.etasr.com Helali et al.: Real Time Speech Recognition based on PWP Thresholding and MFCC using SVM  

 
environments, and it is presented and implemented on a 
Raspberry Pi3 board. The proposed method is based on MFCC 
extraction from speech signal after applying wavelet 
thresholding. The main idea relies on obtaining coefficient 
exploitation which represents the wavelet transform 
decomposition after eliminating the small coefficients 
associated with the noise usually located in high frequencies. 
Then, the MFCCs method is applied to the signal. Finally, a 
feature vector is acquired by the obtained MFCC concatenation 
that constitutes one input parameters of the SVM used for 
classification. 

Our main contribution resides in ensuring a good 
recognition rate, close to 100%, for positive SNRs. In real 
noisy areas, particularly within the range of [0, -10db], 
challenging results have been reached using the proposed real 
time approach. Obviously, real time implementation with 
Raspberry Pi gives excellent recognition performance in clean 
and noisy states. 

II. FEATURE EXTRACTION 

Feature extraction is the process of retaining useful 
information within a speech signal when rejecting the 
redundant and unwanted information. It represents merely the 
speech signal parameterization. This process includes: 

• Segmentation of the speech signal into windows. 

• Speech signal frequency decomposition into critical bands 
by transforming it into PWP. 

• Parameter extraction.  

• Coefficient calculation. 

The feature extraction is mostly used, thanks to its better 
performance for ASR and low computational complexity under 
standard environment. MFCC and its hybrid feature extraction 
technique with PWP will be employed. A brief outline of the 
proposed method is described in Figure 1. 

 
Inverse 

cosine 

transform

IDFT

Speech 

signal
• Pre-emphasis

• Elimination of CC

• Hamming Windows 

DFT PWP

Coefficient

s 

PWTFCC

Log (.)
Threshold 

Coefficients

2

 
Fig. 1.  PWTFCC algorithm.  

A. Mel Frequency Cepstral Coefficients Meaning  

MFCCs are frequency field features based on the human ear 
scale. The scale [25] is approximately linear until 1kHz and 
logarithmic at higher frequencies. These frequency domain 
features [26] offer more accuracy than time domain ones. In 
this technique, the same information can be incorporated in less 
coefficients, making it more compact. The calculation proceeds 
as described in our previous work [27]. Afterwards, FFT is 
computed for each speech frame so that signal frequency 

components could be extracted in time-domain. Then, the 
logarithmic Mel scaled filter bank is applied to the FFT frame. 
The log filter bank energies are calculated using the DCT. Only 
the first thirteen DCT coefficients are kept and the rest are 
discarded. These DCT coefficients decorrelate the features as 
well they arrange them in decreasing information order. 

B. Perceptual Wavelet Packet 

The wavelets offer a technique that represents the time-
frequency domain. It has usually been used for signal 
decomposition into high and low frequency components. Its 
coefficients depict frequency content similarity measured 
between a chosen wavelet function and a given signal. These 
coefficients are calculated as a convolution of the signal and 
the scaled wavelet function, which can be explained as an 
expanded band-pass filter due to its band-pass spectrum [27-
28]. Subsequently, the resulted wavelet transforms are 
exploited as a filter bank named Perceptual Wavelet Packet 
(PWP). The PWP results to a non-redundant restoration, which 
gives better spectral and spatial localization of signal 
configuration. Compared with other multi-scale representations 
such as Gaussian and Laplacian pyramid the PWP represents 
the privilege of multilevel decomposition, where the signal is 
decomposed in ‘approximation’ and ‘detail’ coefficients at 
each level [29], through an equivalent process to high-pass and 
low pass filtering components. As mentioned above, the 
wavelet transform was introduced for time and frequency 
analysis of transient signals and it was extended to multi-
resolution wavelet transform theory via a Finite Impulse 
Response (FIR) filter approximation. The discrete wavelets 
used in multi-resolution analysis constitute an orthonormal 
basis. The PWP decomposition steps are explicated taking into 
account details and approximation coefficients. 

III. REAL TIME IMPLEMENTATION SLANT 

The proposed speech recognition system’s block diagram is 
illustrated in Figure 2. The various system steps are explained 
in this section. The microphone input speech is sampled at 
16kHz. First of all, we mention that a Voice Activity Detector 
(VAD) is used as a noise estimator. The VAD’s output presents 
the binary signal resulting of the comparison between the 
speech input signal and the threshold value. Thus, VAD value 
is either true (VAD=1) when the measured input is greater than 
the threshold and the signal is considered as a voiced frame, or 
the VAD value is false (VAD=0) and the signal frame is 
considered as a noisy frame. The second approach step consists 
on speech signal decomposition with the PWP. The PWP 
outcome is a multilevel decomposition, in which the signal is 
divided into ‘approximation’ and ‘detail’ coefficients at every 
stage. This process is similar to low-pass and high pass 
filtering. The simplest way to remove noise is by using the 
wavelet coefficients, which are the result of the wavelet 
transform decomposition. The small coefficients associated 
with the noise through the threshold step are eliminated. 
Indeed, the threshold purpose offers the ideal components from 
the noisy signal giving the noise level estimation. There are 
various threshold methods. Between the most commonly used 
are the hard and soft threshold. They are used and adopted in 
this work and modeled by: 


Engineering, Technology & Applied Science Research Vol. 10, No. 5, 2020, 6204-6208 6206 
 

www.etasr.com Helali et al.: Real Time Speech Recognition based on PWP Thresholding and MFCC using SVM  

 
( )( )y sign x x λ= −
 
   (1) 

where x, y and �  present respectively the input signal, the 
threshold signal, and the threshold value. The MFCCs are 
applied to the signal after the threshold and concatenation 
steps. The signal is filtered and windowed by the hamming 
window for FFT transformation. Next, the signal passes 
through a Mel-filter to obtain the twelve Cepstral coefficients. 
Finally, the resulted Cepstral coefficients are concatenated to 
construct the SVM classifier input. Similarly, this technique is 
applied also to our proper training speech database containing 
spoken words which are recorded by a mono-speaker.  

 
VAD

Approximation 

coefficients CD

PWP

Thresholding

concatenation

MFCC

SVM classifier

Decision

Future 

extraction 

Trainin Model

Training set 

« .wav »

Approximation 

coefficients CA

Thresholding

Speech

 
Fig. 2.  The PWP decomposition steps of a 1D signal for three levels. 

In order to increase the performance of our proposed speech 
recognition algorithm, a denoising module was added to the 
proposed system to enhance its robustness. This denoising 
module relies on Adaptive Median Filtering (AMF) [30] which 
is able to eliminate the data speckle noise without harming the 
embedded sharp contrasts. It’s noticeable that the noise impact 
can be significantly reduced by applying the AMF to the 
temporal modulation spectrum, which is the Fourier transform 
for either real or imaginary acoustic spectrograms along the 
time axis. Thus, the resulting speech features can be more 
noise-robust and give better speech recognition performance. 
Figure 3 represents the modified speech recognition system in 
which the AMF is introduced as a speech denoising module. 

 
Learning phase

Learning

(Words, 

phonomenon)

Acoustic 

vector 

extraction

Calculation of 

models

(Words, 

phonomenon)

Learning 

base

Speech 

acquisition

Acoustic 

vector 

extraction

Decoding

Recognition
Recognized 

word

 
Fig. 3.  The modified speech recognition system. 

IV. TEST PERFORMING AND OBTAINED RESULTS 

In order to build the speech recognition system, voice 
commands and speech models have to be optimized based on a 
solid training database. In this experiment, the training database 
contained eleven commands recorded five times by a mono-
speaker for a voice command under a silent environment. Each 
recorded data consisted of up to 4s of utterance. The speech 
recognition application needs more than just the simulation and 
the proposed algorithm was tested on a particularly suitable 
flexible platform. The complete setup has been implemented 
and tested on a Raspberry Pi 3 board. 

A. Used Raspberry Pi Card Synopsis and System Pattern 

The Raspberry Pi 3 is simply a performed sized card 
processor [31], containing a micro-controller and a CPU. The 
Raspberry Pi processor core system is a Broadcom BCM2837 
System-on-Chip (SoC) multimedia processor, which has 64-bit 
quad-core ARMv8 Cortex A53 with 1GB of RAM. Besides, 
it’s equipped with 16GB expandable to 128GB. An SD card 
slot ,1.2GHz SoC processor, Video Core IV GPU, 4 USB ports, 
1 HDMI port, 40 GPIO pins which could be configured as 
digital output or input and a jack audio output. The Raspberry 
Pi is controlled by an amended version of Linux (Raspbian) 
optimized for the ARM architecture. As Raspbian is built based 
on Debian, it implicitly has all the compatibilities and features 
required for the program. Python 2.7 or 3.5 is already installed 
in the Raspbian operating system and therefore a new 
installation is not compulsory. Python 2.7 was selected because 
it owns more store community support accessible contrary to 
Python 3.5. The project requires some external Python 
packages that need to be separately installed. We have also 
installed some other measurement packages in order to 
evaluate the program performance. All specifications are 
mentioned in Table I. 

TABLE I.  RASPBERRY PI’S SOFTWARE SPECIFICATION FOR THE 
PROPOSED FRAMEWORK [30] 

Name Configuration 

OS Noobs (Rasbian) 

Programming language Python 2.7 

Libraries 
Numpy, SciPy, PyLab, 

Matplotlib, RPI.GPIO 

Audio libraries Pyaudio, Pydub, Wave 

Performance Monitoring Utilities BCMStat, TIME, htop 

 
B. Real-time Performance 

In order to validate the proposed speech recognition, based 
on MF-PWP/MFCC, algorithm’s performance, a comparison 
was made of the improvements in speech recognition accuracy 
that can be obtained through the use of several types of features 
such as MFCC, PWP/MFCC and MF-MFCC. The recognition 
experiments were performed using noisy testing data with 
various noisy conditions: white Gaussian and babble noise, 
with a noise ratio (SNR) from -10db to +10db. Figure 4 
compares the results for speech in white (Figure4(a)) and 
babble noise (Figure 4(b)) under different SNRs for several 
methods. Specifically, the recognition accuracy percent was 
compared for PWP/MFCC and MF-PWP/MFCC methods as 
described above, along with baseline MFCC and MF-MFCC. It 


Engineering, Technology & Applied Science Research Vol. 10, No. 5, 2020, 6204-6208 6207 
 

www.etasr.com Helali et al.: Real Time Speech Recognition based on PWP Thresholding and MFCC using SVM  

 
can be seen that the PWP/MFCC processing provides better 
accuracy than MFCC features for all the tested noises, although 
improvements are small in high SNRs. The lack of 
improvement observed for clean speech and high SNRs is a 
common observation for many approaches to robust speech 
recognition. It is also noted that the denoising module provides 
a trivial improvement in recognition accuracy expressly in 
lower SNRs.  

 
(a) 

 
(b) 

Fig. 4.  Comparison of recognition accuracy in: (a) white noise and (b) 

babble noise for several feature extraction methods. 

Finally, it can be observed that the proposed features based 
on MF-PWP/MFCC perform better than other features under 
all test conditions. With clean and noisy data testing, we can 
obtain a great and expectant recognition rate with the MF- 
PWP/MFCC for real-time speech recognition system. Aiming 
to validate the proposed speech recognition algorithm 
performance with several feature extractions we have measured 
the memory use, the CPU use, and the execution time. Table II 
presents the CPU and memory use. This verification is 
obtained using htop, a popular Linux text mode utility, which is 
ideal for monitoring system processes and performance 
metrics. 

TABLE II.  COMPARISON OF RESOURCE CONSUMPTION AND 
EXECUTION TIME FOR DIFFERENT FEATURE EXTRACTION METHODS 

Algorithm 

Memory 

consumption 

(bytes) 

CPU 

usage 

(%) 

Memory 

usage 

(%) 

Extraction 

time (ms) 

MFCC 8460 9.2 5.8 670 

PWP/MFCC 8486 9.2 5.8 675 

MF-MFCC 8500 10.6 6.4 678 

MF-PWP/MFCC 8580 10.9 6.4 680 

 
Table II shows that the average CPU usage is 10.9% in MF- 
PWP/MFCC. On the other hand, in MFCC, PWP/MFCC and 

MF-MFCC, it is around 9.2%, 10.6% and 10.9% respectively. 
In addition, the maximum time execution difference of the 
proposed algorithm to the other algorithms doesn’t exceed 
15ms. It was noticed that this low difference in time execution 
and resources consumption did not affect the proposed 
algorithm’s robustness. 

C. Recognition Rate Comparison  

The negative recognition rate part was given much attention 
and it represents the main contribution of this study. The 
comparison with the work in [30]is shown in Table III: 

TABLE III.  RECOGNITION RATE RESULT COMPARISON IN BABBLE 
NOISE 

SNR 

(db) 
MFCC 

MFCC 

[30] 

PWP-

MFCC 

PWP-

MFCC 

[30] 

MF-

MFCC 

MF-

MFCC 

[30] 

MF-

PWP-

MFCC 

MF-

PWP-

MFCC 

[30] 

-10 19 10.9 64 61.81 38 10.09 80 65.45 

-5 38 30.9 84 78.18 40 37.27 91 80.09 

0 59 59.09 98 85.45 60 68.15 99 97.27 

5 87 68.15 100 90 77 70.9 100 100 

10 90 76.36 100 93.65 84 85.45 100 100 

 
Generally, the published works do not take into account the 
range [0,-10db]. Recognized words in this noisy area are very 
hard to extract. Although, the recognition rate within the range 
[0, 10db] reaches nearly 100% which is also the current case. 

V. CONCLUSION 

A new real-time speech recognition algorithm has been 
proposed in this paper. The proposed algorithm exploits the 
PWP combined with MFCC in order to match speech features 
in addition to the SVM classification block. The proposed 
method proves its effectiveness to pick up an ideal recognition 
rate of about 100% in clean environment. The recognition rate 
ranges from 98.18% to 100%, even in noisy environments from 
0db to10db with the use of adaptive median filter as a 
denoising module. In the real noisy part, principally inside the 
range [0, -10db], good results have been reached with the 
proposed real time method. For real-time experimentation a 
Raspberry Pi has been used as the hardware platform. The 
proposed system’s performance was sufficient for a wide range 
of speech-controlled applications. As future work, resource 
consumption and its impact of speech embedded applications 
in addition to accuracy and timing will be investigated. 

REFERENCES 

[1] D. Karaboga and E. Kaya, “Adaptive network based fuzzy inference 
system (ANFIS) training approaches: a comprehensive survey,” 

Artificial Intelligence Review, vol. 52, no. 4, pp. 2263–2293, Dec. 2019, 
doi: 10.1007/s10462-017-9610-2. 

[2] H. A. Yanco, A. Norton, W. Ober, D. Shane, A. Skinner, and J. Vice, 

“Analysis of Human-robot Interaction at the DARPA Robotics 
Challenge Trials,” Journal of Field Robotics, vol. 32, no. 3, pp. 420–

444, May 2015, doi: 10.1002/rob.21568. 

[3] A. Pereira, C. Oertel, L. Fermoselle, J. Mendelson, and J. Gustafson, 
“Responsive Joint Attention in Human-Robot Interaction,” in 2019 

IEEE/RSJ International Conference on Intelligent Robots and Systems 
(IROS), Nov. 2019, pp. 1080–1087, doi: 

10.1109/IROS40897.2019.8968130. 

[4] I. Tiddi, E. Bastianelli, E. Daga, M. d’Aquin, and E. Motta, “Robot–City 

Interaction: Mapping the Research Landscape—A Survey of the 


Engineering, Technology & Applied Science Research Vol. 10, No. 5, 2020, 6204-6208 6208 
 

www.etasr.com Helali et al.: Real Time Speech Recognition based on PWP Thresholding and MFCC using SVM  

 
Interactions Between Robots and Modern Cities,” International Journal 
of Social Robotics, vol. 12, no. 2, pp. 299–324, May 2020, doi: 

10.1007/s12369-019-00534-x. 

[5] Y. Zheng, Y. Liu, and J. H. L. Hansen, “Navigation-orientated natural 
spoken language understanding for intelligent vehicle dialogue,” in 2017 

IEEE Intelligent Vehicles Symposium (IV), Jun. 2017, pp. 559–564, doi: 
10.1109/IVS.2017.7995777. 

[6] T. Hino, S. Ito, T. Liu, and M. Maeda, “Set-based particle swarm 

optimization with status memory for knapsack problem,” Artificial Life 
and Robotics, vol. 21, no. 1, pp. 98–105, Mar. 2016, doi: 

10.1007/s10015-015-0253-6. 

[7] A. Koduru, H. B. Valiveti, and A. K. Budati, “Feature extraction 

algorithms to improve the speech emotion recognition rate,” 
International Journal of Speech Technology, vol. 23, no. 1, pp. 45–55, 

Mar. 2020, doi: 10.1007/s10772-020-09672-4. 

[8] S. Zhu, C. Xu, J. Wang, Y. Xiao, and F. Ma, “Research and application 
of combined kernel SVM in dynamic voiceprint password authentication 

system,” in 2017 IEEE 9th International Conference on Communication 
Software and Networks (ICCSN), May 2017, pp. 1052–1055, doi: 

10.1109/ICCSN.2017.8230271. 

[9] E. Rodríguez-Orozco et al., “FPGA-based Chaotic Cryptosystem by 
Using Voice Recognition as Access Key,” Electronics, vol. 7, no. 12, p. 

414, Dec. 2018, doi: 10.3390/electronics7120414. 

[10] Q. Li et al., “MSP-MFCC: Energy-Efficient MFCC Feature Extraction 
Method With Mixed-Signal Processing Architecture for Wearable 

Speech Recognition Applications,” IEEE Access, vol. 8, pp. 48720–
48730, 2020, doi: 10.1109/ACCESS.2020.2979799. 

[11] P. J. Dugan, H. Klinck, J. A. Zollweg, and C. W. Clark, “Data Mining 

Sound Archives: A New Scalable Algorithm for Parallel-Distributing 
Processing,” in 2015 IEEE International Conference on Data Mining 

Workshop (ICDMW), Nov. 2015, pp. 768–772, doi: 
10.1109/ICDMW.2015.235. 

[12] K. Gupta and D. Gupta, “An analysis on LPC, RASTA and MFCC 

techniques in Automatic Speech recognition system,” in 2016 6th 
International Conference - Cloud System and Big Data Engineering 

(Confluence), Jan. 2016, pp. 493–497, doi: 
10.1109/CONFLUENCE.2016.7508170. 

[13] S. P. Panda, A. K. Nayak, and S. C. Rai, “A survey on speech synthesis 
techniques in Indian languages,” Multimedia Systems, vol. 26, no. 4, pp. 

453–478, Aug. 2020, doi: 10.1007/s00530-020-00659-4. 

[14] V. M. Patel, N. K. Ratha, and R. Chellappa, “Cancelable Biometrics: A 
review,” IEEE Signal Processing Magazine, vol. 32, no. 5, pp. 54–65, 

Sep. 2015, doi: 10.1109/MSP.2015.2434151. 

[15] V. M. Patel, N. K. Ratha, and R. Chellappa, “Cancelable Biometrics: A 
review,” IEEE Signal Processing Magazine, vol. 32, no. 5, pp. 54–65, 

Sep. 2015, doi: 10.1109/MSP.2015.2434151. 

[16] L. Jiao et al., “A Survey of Deep Learning-Based Object Detection,” 
IEEE Access, vol. 7, pp. 128837–128868, 2019, doi: 

10.1109/ACCESS.2019.2939201. 

[17] R. Chakroun and M. Frikha, “Efficient text-independent speaker 
recognition with short utterances in both clean and uncontrolled 

environments,” Multimedia Tools and Applications, vol. 79, no. 29, pp. 
21279–21298, Aug. 2020, doi: 10.1007/s11042-020-08824-7. 

[18] C. Kim and R. M. Stern, “Power-Normalized Cepstral Coefficients 

(PNCC) for Robust Speech Recognition,” IEEE/ACM Transactions on 
Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1315–1329, 

Jul. 2016, doi: 10.1109/TASLP.2016.2545928. 

[19] S.-S. Wang, P. Lin, Y. Tsao, J.-W. Hung, and B. Su, “Suppression by 

Selecting Wavelets for Feature Compression in Distributed Speech 
Recognition,” IEEE/ACM Transactions on Audio, Speech, and 

Language Processing, vol. 26, no. 3, pp. 564–579, Mar. 2018, doi: 
10.1109/TASLP.2017.2779787. 

[20] M. A. Islam, W. A. Jassim, N. S. Cheok, and M. S. A. Zilany, “A Robust 

Speaker Identification System Using the Responses from a Model of the 
Auditory Periphery,” PLoS One, vol. 11, no. 7, p. e0158520, Jul. 2016, 

doi: /10.1371/journal.pone.0158520. 

[21] N. Das, S. Chakraborty, J. Chaki, N. Padhy, and N. Dey, “Fundamentals, 
present and future perspectives of speech enhancement,” International 

Journal of Speech Technology, Jan. 2020, doi: 10.1007/s10772-020-
09674-2. 

[22] C. Jiang, L. Ba, X. Tang, and D. Wen, “Speaker Verification Using 

IMNMF and MFCC with Feature Warping Under Noisy Environment,” 
in 2018 Chinese Automation Congress (CAC), Nov. 2018, pp. 2583–

2588, doi: 10.1109/CAC.2018.8623278. 

[23] A. K. H. Al-Ali, V. Chandran, and G. R. Naik, “Enhanced forensic 
speaker verification performance using the ICA-EBM algorithm under 

noisy and reverberant environments,” Evolutionary Intelligence, May 
2020, doi: 10.1007/s12065-020-00406-8. 

[24] O. Mamyrbayev, A. Toleu, G. Tolegen, and N. Mekebayev, “Neural 
architectures for gender detection and speaker identification,” Cogent 

Engineering, vol. 7, no. 1, p. 1727168, Jan. 2020, doi: 
10.1080/23311916.2020.1727168. 

[25] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, 1 

edition. Englewood Cliffs, N.J: Pearson, 1993. 

[26] N. Holighaus, G. Koliander, Z. Průša, and L. D. Abreu, 
“Characterization of Analytic Wavelet Transforms and a New Phaseless 

Reconstruction Algorithm,” IEEE Transactions on Signal Processing, 
vol. 67, no. 15, pp. 3894–3908, Aug. 2019, doi: 

10.1109/TSP.2019.2920611. 

[27] W. Helali, Z. Hajaiej, and A. Cherif, “Automatic Speech Recognition 
System Based on Hybrid Feature Extraction Techniques Using TEO-

PWP for in Real Noisy Environment,” IJCSNS - International Journal of 
Computer Science and Network Security, vol. 19, no. 10, pp. 118–124, 

Oct. 2019. 

[28] A. Rinoshika and H. Rinoshika, “Application of multi-dimensional 
wavelet transform to fluid mechanics,” Theoretical and Applied 

Mechanics Letters, vol. 10, no. 2, pp. 98–115, Jan. 2020, doi: 
10.1016/j.taml.2020.01.017. 

[29] D. G. Manolakis and V. K. Ingle, Applied Digital Signal Processing: 

Theory and Practice, 1 edition. New York: Cambridge University Press, 
2011. 

[30] A. Mnassri, M. Bennasr, and C. Adnane, “A Robust Feature Extraction 
Method for Real-Time Speech Recognition System on a Raspberry Pi 3 

Board,” Engineering, Technology & Applied Science Research, vol. 9, 
no. 2, pp. 4066–4070, Apr. 2019. 

[31] S. N. Truong, “A Low-cost Artificial Neural Network Model for 

Raspberry Pi,” Engineering, Technology & Applied Science Research, 
vol. 10, no. 2, pp. 5466–5469, Apr. 2020.