Microsoft Word - 37-3465_s1_ETASR_V10_N2_pp5547-5553


Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5547  
  

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System 

 

Efficient Feature Extraction Algorithms to Develop 
an Arabic Speech Recognition System 

 

Abdulmalik A. Alasadi 
Dept. of Computer Science and IT 

Dr. Babasaheb Ambedkar Marathwada 
University 

Aurangabad, India 
dba.ora10g@gmail.com 

Theyazn H. H. Adhyani 
Community College in Abqaiq 

King Faisal University  
Saudi Arabia 

taldhyani@kfu.edu.sa 

Ratnadeep R. Deshmukh 
Dept. of Computer Science and IT 

Dr. Babasaheb Ambedkar Marathwada 
University 

Aurangabad, India 
rrdeshmukh.csit@bamu.ac.in 

Ahmed H. Alahmadi 
Department of Computer Science 

Taibah University 
Saudi Arabia 

aahmadio@taibahu.edu.sa 

Ali Saleh Alshebami 
Community College in Abqaiq 

King Faisal University  
Saudi Arabia 

aalshebami@kfu.edu.sa 
 

 

Abstract—This paper studies three feature extraction methods, 

Mel-Frequency Cepstral Coefficients (MFCC), Power-

Normalized Cepstral Coefficients (PNCC), and Modified Group 

Delay Function (ModGDF) for the development of an Automated 

Speech Recognition System (ASR) in Arabic. The Support Vector 

Machine (SVM) algorithm processed the obtained features. These 
feature extraction algorithms extract speech or voice 

characteristics and process the group delay functionality 

calculated straight from the voice signal. These algorithms were 

deployed to extract audio forms from Arabic speakers. PNCC 

provided the best recognition results in Arabic speech in 

comparison with the other methods. Simulation results showed 
that PNCC and ModGDF were more accurate than MFCC in 
Arabic speech recognition.  

Keywords-speech recognition; feature extraction; PNCC; 

ModGDF; MFCC; Arabic speech recognition 

I. INTRODUCTION  

Speech is the most commonly and widely used form of 
communication. Many researches focus on developing reliable 
systems that can understand and accept commands through 
speech. Nowadays computers are involved in almost every 
aspect of our life, and as communication between people is 
mostly vocal, people anticipate the same way of interaction 
with computers [1]. Speech has the capacity to be an important 
mode of human-computer interaction, and the interest in 
developing computers that can accept speech as input is 
growing. The substantial research effort in global speech 
recognition and the increasing computational power at lower 
cost could result in more speech recognition applications in the 
near future [3]. Arabic language is the most popular in the Arab 
world, and the Arabic alphabet is used in some other languages 
such as Persian, Urdu, and Malaysian [2]. 

Research in human-computer speech interaction has 
focused mostly on developing better technical speech 
recognition systems, and gains in precision and productivity 
[4]. This research applied three distinct feature extraction 
methods onto an Arabic speech dataset, namely Mel-Frequency 
Cepstral (MFCC), Power-Normalized Cepstral Coefficients 
(PNCC) and Modified Group Delay Function (ModGDF). The 
extracted features were classified by a Support Vector Machine 
(SVM). The results of these three feature extracting techniques 
were compared in order to get the most efficient and accurate 
output. The feature extraction techniques, having their own 
properties like ModGDF, give additive and high-resolution 
signal. The additive property adds different functions in one 
group domain, and high-resolution property is used to sharpen 
the peaks of a group delay domain [5].  

II. BACKGROUND 

Speech awareness and evaluation have captivated 
researchers from Fletcher's early works [6] and the first voice 
identification devices [7], to present-day. Nevertheless, high 
precision machine speech recognition can be achieved mostly 
in quiet settings, as the efficiency of a typical speech 
recognizer reduces significantly in loud settings [8]. 
Environmental influence and other variables were explored in 
[9]. As technology progresses, speech recognition will be 
embedded in more devices used in everyday activities, where 
environmental variables perform a major part, such as mobile 
phone voice recognition applications [10], cars [11], integrated 
access control and information systems [12], emotion 
identification systems [13], application monitoring [14], 
disabled assistance [15], and intelligent technology. In addition 
to voice, many acoustic applications are also essential in 
diverse engineering issues [16–22]. A noise decrease method 
could be deployed to enhance efficiency in real-world noisy 
settings [23–26]. Machine efficiency degrades on noise, 

Corresponding author: Theyazn H. H. Adhyani 



Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5548  
  

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System 

 

channel variance, and spontaneous expressions further below 
than humans [27]. Automatic Speech Recognition (ASR) has 
not surpassed human efficiency in precision and robustness but 
we continue to avail from it by knowing the central values 
behind the identification of Human Speech (HS) [28]. Despite 
the advancements in auditory processing and popular front-
ends for ASR devices, only a few elements of noise handling in 
the auditory periphery are modeled and simulated [29]. For 
instance, common methods such as MFCC use auditory 
features like varying bandwidth filter bank and compression 
size. Coefficients of Perceptual Linear Prediction (PLP) focus 
on perceptual processing by using curves of critical band 
resolution, corresponding loudness scaling, and cube root 
energy laws of listening Linear Prediction Coefficients (LPC) 
[30]. Synaptic adjustment could include an instance of 
auditory-motivated enhancements of voice depiction. Standard 
MFCC or PLP coefficients could be substituted by coefficients 
depending on some cochlear model in order to better represent 
human auditory periphery. The proposed model of synaptic 
adaptation in [31] showed important improvements in the 
efficiency of speech recognition. The PNCC proposed in [32], 
was based on auditory processing, including new 
characteristics, using a nonlinearity of power-law, a noise-
suppression algorithm relying on asymmetric filtering, and 
temporal masking. The experimental findings exhibited 
enhanced precision of acceptance, comparing to MFCC and 
PLP. Another strategy for feature removal was based on Deep 
Neural Networks (DNN). The noise robustness of sound 
designs relying on DNN was evaluated in [33]. Recurrent 
Neural Networks (RNN) for cleaning distorted input 
characteristics were applied in [34]. The use of LSTM-RNNs 
was suggested in [35] to manage extremely non-stationary 
additive noise. For solid voice recognition, an all-inclusive 
outline of profound teaching was presented in [36]. Many 
researches utilized PNCC and MFCC to extract the most 
significant features from speech signals [37-39]. Group Delay 
Function (ModGDF) was used to extract speech signals, being 
more efficient than MFCC. 

III. METHOD 

Figure 1, shows the developed recognition system for 
evaluating the identification of Arabic speech.  

 

 
Fig. 1.  Proposed speech recognition system 

Audio from Arabic speakers was given as input to the 
system, and three feature extraction techniques, MFCC, PNCC 

and ModGDF, were applied to extract significant features of 
Arabic speech. SVM algorithm was used for training and 
classification, and performance measures were employed to 
evaluate these algorithms. 

IV. DATABASE 

A speech database was created, populated with utterances 
from volunteered Yemeni students studying at Dr. Babasaheb 
Ambedkar Marathwada University, in Aurangabad, India. 
Tables I and II, include the demographic information of the 
volunteers and the basic parameters of the recordings. 

TABLE I.  DEMOGRAPHICS OF VOLUNTEERS 

Parameter Values 

Speaker type Students (BSc, MSc, PhD) 
Gender 35 Male, 15 Female 

Basic language Arabic 
Accent Standard and Yemeni 

Age group 20 - 35 
Country Yemen 

Environment Dept. of CS & IT 

TABLE II.  BASIC RECORDING PARAMETERS 

Parameter Value 

Sampling rate 16000Hz 
Speakers Dependent 

Condition of noise Normal 
Accent Arabic 

Pre–emphasis 1-0.9/(z-1) 
Window type Hamming, 25ms 

Window step size 20ms 
 

A. Recording Procedure 

The database was recorded using high quality headsets 
(Sennheiser PC360) and PRAAT Software, in a quiet 
environment. Speech samples were recorded in mono mode 
with 16000Hz sampling rate. A microphone was placed at a 
distance of about 3cm from the volunteer’s mouth. Table III, 
displays the hardware and software used during the speech 
samples recording procedure. 

TABLE III.  HARDWARE AND SOFTWARE DETAILS 

Hardware Software 

Laptop Hp Elite Book: 
(Core i7 ,5th gen, 8GB RAM, SSD 

500GB) 
Windows 10 

Headphone :Sennheiser PC360 PRAAT: 6102_win64 
Microphone  

 

B. Isolated Digits 

Table IV shows the recorded Arabic digits. 

C. Isolated Words 

Isolated Arabic words of the speech corpus were used. 
Table V shows the Arabic words related to learning. 

D. Continuous Sentences 

Table VI shows the continuous sentence text corpus. Five 
utterances were collected for each sentence. 



Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5549  
  

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System 

 

TABLE IV.  ARABIC DIGITS 

Digit Pronunciation Arabic writing 

0 Safer صفر 
1 Wahed َواِحد 
2 Ethnan اِْثنَان 
3 Thlathah ثَةZََث 
4 Arbaah أَْربََعة 
5 Khamsah َخْمَسة 
6 Settah ِستَّة 
7 Sabaah َسْبَعة 
8 Thamaneyah ثََمانِيَة 
9 Tesaah تِْسَعة 

TABLE V.  ARABIC WORDS 

Arabic Word Arabic pronunciation English word 

 Jameaah University جامعة
 Koleyah Collage كلية
 Kesm Department قسم
 Taaleem Education تعليم

 Mauhader Lecture محاضر
 Modares Teacher مدرس
 Maamal Lab معمل
 Madah Course مادة

TABLE VI.  ARABIC SENTENCES RELATED TO GREETINGS 

English language Arabic language 

When does registrations begin at the 
university? 

 متى يبدأ التسجيل في الجامعة ؟

Is there a graduate department?  للدراسات العليا ؟ھل يوجد قسم  
What are the admission requirements? ما ھي شروط القبول ؟ 

Is there a university website? ھل يوجد موقع الكتروني للجامعة؟ 
What are the available majors? ما ھي التخصصات المتوفرة ؟ 

The University has modern programs.  حديثةالجامعة لديھا برامج  
The mission of the university is ambitious. رسالة الجامعة طموحة 

 

V. FEATURE EXTRACTION ALGORITHMS 

Feature extraction is vital for developing a speech 
recognition system. Its main objective is to extract the most 
significant features for identifying Arabic speakers. Three 
feature extraction algorithms were applied: PNCC, ModGDF, 
and MFCC.  

A. Power Normalized Cepstral Cofficients (PNCC) 

The PNCC feature extraction algorithm for extracting 
features for speech recognition can be seen in [3]. PNCC has 
two components: initial processing, and temporal integration 
for environmental analysis. 

1) Initial Processing 

This processing uses a pre-emphasis filter in the form of: 

����� 1�0.97��1    (1) 
Subsequently, a Short-Time Fourier Transformation 

(STFT) is conducted using Hamming windows. The use of a 
DFT volume of 1024 was intended to produce a length of 
25.6ms, with 10ms between frames. By weighting magnitude-
squared STFT outputs, spectral power in 40 analysis bands was 
obtained for positive frequencies. Center frequencies are also 
linearly spaced between 200Hz and 8000Hz using gamma tone 
filters in Equivalent Rectangular Bandwidth (ERB) [3]. 

2) Temporal Integration for Environmental Analysis 

Most speech recognition systems use length frames of 
analysis between 20 and 30ms. It is often found that longer 
analytical windows deliver greater noise modeling efficiency 
and environmental normalization [6], because of the facility 
related to most background conditions, and changes slower 
than the speaking-related instant power. In PNCC processing, 
an estimate is made of a quantity referred to as “medium-time 
power” Q[m,l] by calculating the running average of P[m, l], 
the power observed in a single frame of analysis, according to: 

'1
2 1

[ , ] [ ]
m M

M
m m M

Q m l P m l
+

+

= −

= ∑      (2) 

where m is the index of the frame, and l is the index of the 
channel. 

B. Modified Group Delay Function (ModGDF) 

This method was discussed in detail in [7-15]. It should be 
noted that the group delay feature is different from the phase 
spectrum, and it is defined as the phase negative derivative 
which can be used effectively to extract different system 
parameters when the signal is considered as a minimum phase 
signal. This is mainly so because a minimum phase signal’s 
magnitude spectrum is similar to each other and its group delay 
feature. Figure 2, shows the process of ModGDF algorithm for 
extracting speech features. The algorithm is described below. 

Algorithm: ModGDF feature extraction pseudocode 
Input: speech x(n) 
Output: ModGDF (Features vector) c(n) 
Begin 
Initialize parameters; 
Apply the DFT of the speech x (n) as X[k]; 
Apply the DFT of the speech n x (n) as Y[k]; 
Calculate Group delay function where R and I represents 
real and imaginary parts; 
Compute the spectrally smoothed spectra of X [k] and 
designate it as S [k]; 
Compute modified group delay where S [k] is the smoothed 
version of X [k] and two new parameters α and γ are used 
to regulate the dynamic range of ModGDF; 
Apply the DCT to get the ModGD features; 
Obtain ModGD Features vector (13 Coefficients for each 
frame); 
End. 

 

 
Fig. 2.  Feature extraction process of ModGDF 

C. Mel Frequency Cepstral Coefficients (MFCC) 

MFCC is the mostly used method in speech technology 
development, as it is similar to the human auditory system [16], 
taking into account its characteristics. Moreover, these 



Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5550  
  

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System 

 

coefficients are robust and reliable to variations of speakers and 
recording conditions. Figure 3 shows the processing steps of 
MFCC for feature extraction. 

 

 
Fig. 3.  Processes in MFCC feature extraction method 

Pre-emphasis is the first step of MFCC, which produces 
energy, that was earlier compressed during sound generation, at 
a high frequency. Framing uses narrower parts to trim the 
sound signals. Windowing is used to avert discontinuity of the 
signals produced by the framing method. Fast Fourier 
Transform (FFT) is used for adapting a signal from time to 
frequency domain. Filter bank is the overlapping band pass 
filter. The final process is the Discrete Cosine Transform 
(DCT) making the coefficients of MFCC [18]. MFCC is 
computed from speech signal using the following three steps: 

• Compute the FFT power spectrum of the speech signal 

• Apply a Mel-space filter-bank to the power spectrum to get 
energies 

• Compute DCT of log filter-bank energies to get 
uncorrelated MFCC’s 

The speech signal is first divided into time frames 
comprising of a random number of samples. In most systems, 
overlapping of frames is used to smooth transition from frame 
to frame. Each time frame is then windowed with a Hamming 
window to eliminate discontinuities at the edges [17]. The filter 
coefficients w(n) of a Hamming window of length n are 
computed according to: 

��
� � 0.54�0.46cos��������, 0 � 
 � ��1  
��
� � 0,	 otherwise.    (2) 

where N is the total number of samples, and n is the current 
sample. Mel scale links perceived frequency or pitch of a pure 
tone to its actual measured frequency. Humans discern better 
small changes in pitch at lower frequencies. Integrating this 
scale makes the features match more closely to what humans 
hear. The formula for converting from frequency to Mel scale 
is: 

 �!� � 1125ln�1% &
'((
� (3) 

while the formula to go back from Mel’s scale to frequency is: 

 ���)� � 700�exp� -���.��1� (4) 

VI. CLASSIFICATION 

SVM is principally a binary classifier, but with the 
following two approaches it can be extended to multi-class 

tasks, the first being 1-vs-all i.e. comparing each class to the 
rest and the second, 1-vs-1, i.e. each class to the other, 
separately [20]. In this study, the i-vs-all was used, consisting 
of multiple binary SVMs equal to the number of classes. Every 
SVM with each one of the classes against the rest of them is 
taught and taken into consideration when testing. The decision 
is eventually made based on the distance from all SVMs 
between the test data and the hyper planes. 

VII. SIMULATION RESULTS  

Several experiments were conducted, employing the speech 
database, for classification and recognition using MFCC, 
PNCC and ModGDF for feature extraction. Training procedure 
used 60% of the data, while 40% were used for testing. The test 
procedure was implemented in Matlab 2016, and screen shots 
are shown in Figures 4 and 5. Evaluation and testing was 
performed using accuracy rate, specificity, sensitivity, 
precision, and execution time.  

 

 
Fig. 4.  Layout of the main system 

 
Fig. 5.  Implementation 

A. Analysis for Arabic Digits 

The feature extraction methods were applied on the digit 
samples, and the results are shown in Table VII.  

TABLE VII.  SVM RESULTS ON DIGITS 

Feature 

extraction 

technique 

Accuracy 

rate 
Specificity Sensitivity Precision 

Execution 

time (s) 

ModGDF 90.3 94.5 50.5 72.7 16.39 
PNCC 97.5 98.6 87.6 88.7 54.8 
MFCC 88.3 93.5 41.7 53.7 87.5 



Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5551  
  

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System 

 

Figure 6 illustrates the methods’ performance. As it can be 
observed, ModGDF with SVN obtained better results regarding 
time cost. PNCC and MFCC with SVN obtained good results, 
but their execution time was much higher. It is concluded that 
ModGDF had lower time cost, as it reduced execution time 
complexity. Table VIII, shows the confusion matrix of PNCC 
for the recognition of Arabic digits. Figure 7, displays a sample 
of ModGDF with SVM for the recognition of an Arabic digit 
(“Khamsah”). 

 

 
Fig. 6.  Methods' performance on the recognition of Arabic digits 

 
Fig. 7.  ModGDF sample recognizing the Arabic digit “Khamsah” 

TABLE VIII.  CONFUSION MATRIX OF DIGITS USIN G PNCC/SVM 

19 0 0 0 0 0 0 0 0 0 
0 19 0 0 0 0 0 0 0 0 
0 2 17 0 0 0 0 0 0 0 
1 0 1 17 0 0 0 0 0 0 
0 1 0 0 18 0 0 0 0 0 
0 1 1 0 0 17 0 0 0 0 
0 0 0 2 0 1 16 0 0 0 
2 0 1 1 0 1 1 13 0 0 
1 0 0 3 0 1 0 1 13 0 
0 0 0 1 0 0 1 0 0 17 

 

B. Analysis for Arabic Words 

Table IX shows the results on the recognition of Arabic 
words. The results of ModGDF with SVM are reported to be 
not satisfactory, but time cost is much lesser than the other 
feature extraction methods. PNCC with SVM performed better, 
but time cost turned out to be significantly more. The results 

are also shown in Figure 8. Table X shows the confusion 
matrix of PNCC/SVM for the recognition of Arabic words. The 
confusion matrix has attested that PNCC is more robust and 
demonstrates more strength to identify Arabic words. Figure 9 
illustrates the performance of the PNCC on the recognition of 
an Arabic word (“Dirham”).  

TABLE IX.  RESULTS ON WORDS 

Feature 

extraction 

technique 

Accuracy 

rate 
Specificity Sensitivity Precision 

Execution 

time (s) 

ModGDF 89.3 94.1 46.8 58.6 12.3 
PNCC 95.15 97.3 75.8 79.2 49.5 
MFCC 88.6 93.6 43.1 51.8 99.5 

 

 
Fig. 8.  Performance on the recognition of Arabic words 

 
Fig. 9.  Sample of PNCC with the SVM recognizing the Arabic word 
“Dirham” 

TABLE X.  CONFUSION MATRIX OF WORDS USIN G PNCC/SVM 

19 0 0 0 0 0 0 0 0 0 
2 17 0 0 0 0 0 0 0 0 
2 2 15 0 0 0 0 0 0 0 
2 1 1 15 0 0 0 0 0 0 
2 0 3 2 12 0 0 0 0 0 
1 1 2 2 0 13 0 0 0 0 
1 3 1 0 0 0 14 0 0 0 
0 0 0 1 2 0 0 16 0 0 
0 1 1 1 0 0 0 0 16 0 
2 0 0 0 0 0 2 2 1 12 

 

C.  Analysis for Arabic Sentences 

Table XI shows the performance results on the recognition 
of Arabic sentences. As it can be observed, PNCC with SVM 



Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5552  
  

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System 

 

performed better, but had greater execution time. PNCC had 
again the highest accuracy and lower execution time than 
MFCC. ModGDF had the lowest execution time of 12.3s, and 
accuracy of 88.2, while MFCC showed again the lowest 
accuracy. The accuracy of PNCC/SVM can be also confirmed 
by its confusion matrix analysis of sentences in Table XII. The 
confusion matrix shows that the PNCC/SVM is capable of 
recognizing sentences with satisfactory results. The results are 
also shown in Figure 10, while Figure 11 illustrates the 
performance of PNCC on the recognition of an Arabic sentence 
(“What are the available majors?”). 

TABLE XI.  RESULTS ON SENTENCES 

Feature 

extraction 

technique 

Accuracy 

rate 
Specificity Sensitivity Precision 

Execution 

time (s) 

ModGDF 88.2 93.5 41.2 45.3 18.9 
PNCC 93.05 96.14 65.26 71.04 70.0 
MFCC 86.0 92.2 30.0 49.48 125.0 

 

 
Fig. 10.  Feature extraction performance on Arabic sentences 

 
Fig. 11.  Sample of PNCC/SVM recognizing the Arabic sentence “What are 
the available majors?” 

TABLE XII.  CONFUSION MATRIX OF SENTENCES USING PNCC/SVM 

19 0 0 0 0 0 0 0 0 0 
2 17 0 0 0 0 0 0 0 0 
1 7 11 0 0 0 0 0 0 0 
0 0 1 18 0 0 0 0 0 0 
1 0 0 8 10 0 0 0 0 0 
0 0 0 0 3 16 0 0 0 0 
0 0 0 0 0 7 12 0 0 0 
0 0 0 1 0 0 1 17 0 0 
0 0 0 0 0 0 0 9 10 0 
6 0 0 0 0 0 0 0 3 10 

VIII. CONCLUSION 

In this paper, a speech recognition system for Arabic 
language was presented, evaluating three feature extraction 
algorithms, namely MFCC, PNCC, and ModGDF, while an 
SVM was used for the classification process. Results showed 
that PNCC was more efficient, while ModGDF had moderate 
accuracy. PNCC and ModGDF fill the gaps in SVM, as they 
both had greater accuracy than MFCC. PNCC had a 93-97% 
accuracy rate, ModGDF had 90% and MFCC had 88%. 

REFERENCES 

[1] P. P. Shrishrimal, R. R. Deshmukh, V. B. Waghmare, “Indian language 
speech database: A review”, International Journal of Computer 
Applications, Vol. 47, No. 5, pp. 17-21, 2012 

[2] S. K. Gaikwad, B. W. Gawali, P. Yannawar, “A review on speech 
recognition technique”, International Journal of Computer Applications, 
Vol. 10, No. 3, pp. 16-24, 2010 

[3] C. Huang, T. Chen, E. Chang, “Accent issues in large vocabulary 
continuous speech recognition”, International Journal of Speech 
Technology, Vol. 7, No. 2-3, pp. 141-153, 2004 

[4] M. A. Anasuya, S. K. Katti, “Speech recognition by machine: A 
review”, International Journal of Computer Science and Information 
Security, Vol. 6, No. 3, pp. 181-205, 2009 

[5] P. L. Garvin, P. Ladefoged, “Speaker identification and message 
identification in speech recognition”, Phonetica, Vol. 9, No. 4, pp. 193-
199, 1963 

[6] G. Ceidaite, L. Telksnys, “Analysis of factors influencing accuracy of 
speech recognition”, Elektronika ir Elektrotechnika, Vol. 105, No. 9, pp. 
69-72, 2010 

[7] Z. H. Tan, B. Lindberg, “Speech recognition on mobile devices”, in: 
Mobile Multimedia Processing – WMMP 2008, Lecture Notes in 
Computer Science, Vol. 5960, Springer, 2010 

[8] W. Li, K. Takeda, F. Itakura, “Robust in-car speech recognition based 
on nonlinear multiple regressions”, EURASIP Journal on Advances in 
Signal Processing, 2007 

[9] W. Ou, W. Gao, Z. Li, S. Zhang, Q. Wang, “Application of keywords 
speech recognition in agricultural voice system”, Second International 
Conference on Computational Intelligence and Natural Computing, 
Wuhan, China, September 13-14, 2010 

[10] L. Zhu, L. Chen, D. Zhao, J. Zhou, W. Zhang, “Emotion recognition 
from Chinese speech for smart affective services using a combination of 
SVM and DBN”, Sensors, Vol. 17, No. 7, 2017 

[11] J. E. Noriega-Linares, J. M. Navarro Ruiz, “On the application of the 
raspberry Pi as an advanced acoustic sensor network for noise 
monitoring”, Electronics, Vol. 5, No. 4, 2016 

[12] M. Al-Rousan, K. Assaleh, “A wavelet-and neural network-based voice 
system for a smart wheelchair control”, Journal of the Franklin Institute, 
Vol. 348, No. 1, pp. 90-100, 2011 

[13] I. V. McLoughlin, H. R. Sharifzadeh, “Speech recognition for smart 
homes”, in: Speech Recognition, Technologies and Applications, Intech, 
2008 

[14] A. Glowacz, “Diagnostics of rotor damages of three-phase induction 
motors using acoustic signals and SMOFS-20-EXPANDED”, Archives 
of Acoustics, Vol. 41, No. 3, pp. 507-515, 2016 

[15] A. Glowacz, “Fault diagnosis of single-phase induction motor based on 
acoustic signals”, Mechanical Systems and Signal Processing, Vol. 117, 
pp. 65-80, 2019 

[16] M. Kunicki, A. Cichon, “Application of a phase resolved partial 
discharge pattern analysis for acoustic emission method in high voltage 
insulation systems diagnostics”, Archives of Acoustics, Vol. 43, No. 2, 
pp. 235-243, 2018 

[17] D. Mika, J. Jozwik, “Advanced time-frequency representation in voice 
signal analysis”, Advances in Science and Technology Research Journal, 
Vol. 12, No. 1, pp. 251-259, 2018 



Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5553  
  

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System 

 

[18] L. Zou, Y. Guo, H. Liu, L. Zhang, T. Zhao, “A method of abnormal 
states detection based on adaptive extraction of transformer vibro-
acoustic signals”, Energies, Vol. 10, No. 12, 2017 

[19] H. Yang, G. Wen, Q. Hu, Y. Li, L. Dai, “Experimental investigation on 
influence factors of acoustic emission activity in coal failure process”, 
Energies, Vol. 11, No. 6, Article ID 1414, 2018 

[20] L. Mokhtarpour, H. Hassanpour, “A self-tuning hybrid active noise 
control system”, Journal of the Franklin Institute, Vol. 349, No. 5, pp. 
1904-1914, 2012 

[21] S. C. Lee, J. F. Wang, M. H. Chen, “Threshold-based noise detection 
and reduction for automatic speech recognition system in human-robot 
interactions”, Sensors, Vol. 18, No. 7, Article ID 2068, 2018 

[22] S. M. Kuo, W. M. Peng, “Principle and applications of asymmetric 
crosstalk-resistant adaptive noise canceler”, Journal of the Franklin 
Institute, Vol. 337, No. 1, pp. 57-71, 2000 

[23] J. W. Hung, J. S. Lin, P. J. Wu, “Employing robust principal component 
analysis for noise-robust speech feature extraction in automatic speech 
recognition with the structure of a deep neural network”, Applied 
System Innovation, Vol. 1, No. 3, Article ID 28, 2018 

[24] R. P. Lippmann, “Speech recognition by machines and humans”, Speech 
Communication, Vol. 22, No. 1, pp. 1-15, 1997 

[25] J. B. Allen, “How do humans process and recognize speech?”, IEEE 
Transactions on Speech and Audio Processing, Vol. 2, No. 4, pp. 567-
577, 1994 

[26] S. Haque, R. Togneri, A. Zaknich, “Perceptual features for automatic 
speech recognition in noisy environments”, Speech Communication, 
Vol. 51, No. 1, pp. 58-75, 2009 

[27] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech”, 
The Journal of the Acoustical Society of America, Vol. 87, No. 4, pp. 
1738-1752, 1990 

[28] M. Holmberg, D. Gelbart, W. Hemmert, “Automatic speech recognition 
with an adaptation model motivated by auditory processing”, IEEE 
Transactions on Audio, Speech, and Language Processing, Vol. 14, No. 
1, pp. 43-49, 2005 

[29] C. Kim, R. M. Stern, “Power-normalized Cepstral Coefficients (PNCC) 
for robust speech recognition”, 2012 IEEE International Conference on 
Acoustics, Speech and Signal Processing,  Kyoto, Japan, March 25-30, 
2012 

[30] M. L. Seltzer, D. Yu, Y. Wang, “An investigation of deep neural 
networks for noise robust speech recognition”, 2013 IEEE International 
Conference on Acoustics, Speech and Signal Processing, Vancouver, 
Canada, May 26-31, 2013 

[31] A. L. Maas, Q. V. Le, T. M. O'Neil, O. Vinyals, P. Nguyen, A. Y. Ng, 
“Recurrent neural networks for noise reduction in robust ASR”, 13th 
Annual Conference of the International Speech Communication 
Association, Portland, USA, September 9-13, 2012 

[32] M. Wollmer, B. Schuller, F. Eyben, G. Rigoll, “Combining long short-
term memory and dynamic bayesian networks for incremental emotion-
sensitive artificial listening”, IEEE Journal of Selected Topics in Signal 
Processing, Vol. 4, No. 5, pp. 867-881, 2010 

[33] Z. Zhang, J. Geiger, J. Pohjalainen, A. E. D. Mousa, W. Jin, B. Schuller, 
“Deep learning for environmentally robust speech recognition: An 
overview of recent developments”, ACM Transactions on Intelligent 
Systems and Technology, Vol. 9, No. 5, pp. 1-28, 2018 

[34] E. Principi, S. Squartini, F. Piazza, “Power normalized cepstral 
coefficients based supervectors and i-vectors for small vocabulary 
speech recognition”, 2014 International Joint Conference on Neural 
Networks, Beijing, China, July 6-11, 2014 

[35] E. Loweimi, S. M. Ahadi, “A New group delay-based feature for robust 
speech recognition”, 2011 IEEE International Conference on Multimedia 
and Expo, Barcelona, Spain, July 11-15, 2011 

[36] B. Kurian, K. T. Shanavaz, N. G. Kurup, “PNCC based speech 
enhancement and its performance evaluation using SNR Loss”, 2017 
International Conference on Networks & Advances in Computational 
Technologies, Thiruvanthapuram, India, July 20-22, 2017 

[37] T. Fux, D. Jouvet, “Evaluation of PNCC and extended spectral 
subtraction methods for robust speech recognition”, 23rd European 

Signal Processing Conference, Nice, France, August 31 – September 4, 
2015 

[38] A. Kaur, A. Singh, “Power-Normalized Cepstral Coefficients (PNCC) 
for Punjabi automatic speech recognition using phone based modelling 
in HTK”, 2nd International Conference on Applied and Theoretical 
Computing and Communication Technology, Bangalore, India, July 21-
23, 2016 

[39] C. Kim, R. M. Stern, “Feature extraction for robust speec recognition 
based on Mmximizing the sharpness of the power distribution and on 
power flooring”, 2010 IEEE International Conference on Acoustics, 
Speech and Signal Processing, Dallas, USA, March 14-19, 2010 

[40] D. S. Kim, S. Y. Lee, R. M. Kil, “Auditory processing of speech signals 
for robust speech recognition in real-world noisy environments”, IEEE 
Transactions on Speech and Audio Processing, Vol. 7, No. 1, pp. 55-69, 
1999