Microsoft Word - 37-3465_s1_ETASR_V10_N2_pp5547-5553

Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5547

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System

Efficient Feature Extraction Algorithms to Develop
an Arabic Speech Recognition System

Abdulmalik A. Alasadi
Dept. of Computer Science and IT

Dr. Babasaheb Ambedkar Marathwada
University

Aurangabad, India
dba.ora10g@gmail.com

Theyazn H. H. Adhyani
Community College in Abqaiq

King Faisal University
Saudi Arabia

taldhyani@kfu.edu.sa

Ratnadeep R. Deshmukh
Dept. of Computer Science and IT

Dr. Babasaheb Ambedkar Marathwada
University

Aurangabad, India
rrdeshmukh.csit@bamu.ac.in

Ahmed H. Alahmadi
Department of Computer Science

Taibah University
Saudi Arabia

aahmadio@taibahu.edu.sa

Ali Saleh Alshebami
Community College in Abqaiq

King Faisal University
Saudi Arabia

aalshebami@kfu.edu.sa

Abstract—This paper studies three feature extraction methods,

Mel-Frequency Cepstral Coefficients (MFCC), Power-

Normalized Cepstral Coefficients (PNCC), and Modified Group

Delay Function (ModGDF) for the development of an Automated

Speech Recognition System (ASR) in Arabic. The Support Vector

Machine (SVM) algorithm processed the obtained features. These
feature extraction algorithms extract speech or voice

characteristics and process the group delay functionality

calculated straight from the voice signal. These algorithms were

deployed to extract audio forms from Arabic speakers. PNCC

provided the best recognition results in Arabic speech in

comparison with the other methods. Simulation results showed
that PNCC and ModGDF were more accurate than MFCC in
Arabic speech recognition.

Keywords-speech recognition; feature extraction; PNCC;

ModGDF; MFCC; Arabic speech recognition

I. INTRODUCTION

Speech is the most commonly and widely used form of
communication. Many researches focus on developing reliable
systems that can understand and accept commands through
speech. Nowadays computers are involved in almost every
aspect of our life, and as communication between people is
mostly vocal, people anticipate the same way of interaction
with computers [1]. Speech has the capacity to be an important
mode of human-computer interaction, and the interest in
developing computers that can accept speech as input is
growing. The substantial research effort in global speech
recognition and the increasing computational power at lower
cost could result in more speech recognition applications in the
near future [3]. Arabic language is the most popular in the Arab
world, and the Arabic alphabet is used in some other languages
such as Persian, Urdu, and Malaysian [2].

Research in human-computer speech interaction has
focused mostly on developing better technical speech
recognition systems, and gains in precision and productivity
[4]. This research applied three distinct feature extraction
methods onto an Arabic speech dataset, namely Mel-Frequency
Cepstral (MFCC), Power-Normalized Cepstral Coefficients
(PNCC) and Modified Group Delay Function (ModGDF). The
extracted features were classified by a Support Vector Machine
(SVM). The results of these three feature extracting techniques
were compared in order to get the most efficient and accurate
output. The feature extraction techniques, having their own
properties like ModGDF, give additive and high-resolution
signal. The additive property adds different functions in one
group domain, and high-resolution property is used to sharpen
the peaks of a group delay domain [5].

II. BACKGROUND

Speech awareness and evaluation have captivated
researchers from Fletcher's early works [6] and the first voice
identification devices [7], to present-day. Nevertheless, high
precision machine speech recognition can be achieved mostly
in quiet settings, as the efficiency of a typical speech
recognizer reduces significantly in loud settings [8].
Environmental influence and other variables were explored in
[9]. As technology progresses, speech recognition will be
embedded in more devices used in everyday activities, where
environmental variables perform a major part, such as mobile
phone voice recognition applications [10], cars [11], integrated
access control and information systems [12], emotion
identification systems [13], application monitoring [14],
disabled assistance [15], and intelligent technology. In addition
to voice, many acoustic applications are also essential in
diverse engineering issues [16–22]. A noise decrease method
could be deployed to enhance efficiency in real-world noisy
settings [23–26]. Machine efficiency degrades on noise,

Corresponding author: Theyazn H. H. Adhyani

Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5548

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System

channel variance, and spontaneous expressions further below
than humans [27]. Automatic Speech Recognition (ASR) has
not surpassed human efficiency in precision and robustness but
we continue to avail from it by knowing the central values
behind the identification of Human Speech (HS) [28]. Despite
the advancements in auditory processing and popular front-
ends for ASR devices, only a few elements of noise handling in
the auditory periphery are modeled and simulated [29]. For
instance, common methods such as MFCC use auditory
features like varying bandwidth filter bank and compression
size. Coefficients of Perceptual Linear Prediction (PLP) focus
on perceptual processing by using curves of critical band
resolution, corresponding loudness scaling, and cube root
energy laws of listening Linear Prediction Coefficients (LPC)
[30]. Synaptic adjustment could include an instance of
auditory-motivated enhancements of voice depiction. Standard
MFCC or PLP coefficients could be substituted by coefficients
depending on some cochlear model in order to better represent
human auditory periphery. The proposed model of synaptic
adaptation in [31] showed important improvements in the
efficiency of speech recognition. The PNCC proposed in [32],
was based on auditory processing, including new
characteristics, using a nonlinearity of power-law, a noise-
suppression algorithm relying on asymmetric filtering, and
temporal masking. The experimental findings exhibited
enhanced precision of acceptance, comparing to MFCC and
PLP. Another strategy for feature removal was based on Deep
Neural Networks (DNN). The noise robustness of sound
designs relying on DNN was evaluated in [33]. Recurrent
Neural Networks (RNN) for cleaning distorted input
characteristics were applied in [34]. The use of LSTM-RNNs
was suggested in [35] to manage extremely non-stationary
additive noise. For solid voice recognition, an all-inclusive
outline of profound teaching was presented in [36]. Many
researches utilized PNCC and MFCC to extract the most
significant features from speech signals [37-39]. Group Delay
Function (ModGDF) was used to extract speech signals, being
more efficient than MFCC.

III. METHOD

Figure 1, shows the developed recognition system for
evaluating the identification of Arabic speech.

Fig. 1. Proposed speech recognition system

Audio from Arabic speakers was given as input to the
system, and three feature extraction techniques, MFCC, PNCC

and ModGDF, were applied to extract significant features of
Arabic speech. SVM algorithm was used for training and
classification, and performance measures were employed to
evaluate these algorithms.

IV. DATABASE

A speech database was created, populated with utterances
from volunteered Yemeni students studying at Dr. Babasaheb
Ambedkar Marathwada University, in Aurangabad, India.
Tables I and II, include the demographic information of the
volunteers and the basic parameters of the recordings.

TABLE I. DEMOGRAPHICS OF VOLUNTEERS

Parameter Values

Speaker type Students (BSc, MSc, PhD)
Gender 35 Male, 15 Female

Basic language Arabic
Accent Standard and Yemeni

Age group 20 - 35
Country Yemen

Environment Dept. of CS & IT

TABLE II. BASIC RECORDING PARAMETERS

Parameter Value

Sampling rate 16000Hz
Speakers Dependent

Condition of noise Normal
Accent Arabic

Pre–emphasis 1-0.9/(z-1)
Window type Hamming, 25ms

Window step size 20ms

A. Recording Procedure

The database was recorded using high quality headsets
(Sennheiser PC360) and PRAAT Software, in a quiet
environment. Speech samples were recorded in mono mode
with 16000Hz sampling rate. A microphone was placed at a
distance of about 3cm from the volunteer’s mouth. Table III,
displays the hardware and software used during the speech
samples recording procedure.

TABLE III. HARDWARE AND SOFTWARE DETAILS

Hardware Software

Laptop Hp Elite Book:
(Core i7 ,5th gen, 8GB RAM, SSD

500GB)
Windows 10

Headphone :Sennheiser PC360 PRAAT: 6102_win64
Microphone

B. Isolated Digits

Table IV shows the recorded Arabic digits.

C. Isolated Words

Isolated Arabic words of the speech corpus were used.
Table V shows the Arabic words related to learning.

D. Continuous Sentences

Table VI shows the continuous sentence text corpus. Five
utterances were collected for each sentence.

Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5549

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System

TABLE IV. ARABIC DIGITS

Digit Pronunciation Arabic writing

0 Safer صفر
1 Wahed َواِحد
2 Ethnan اِْثنَان
3 Thlathah ثَةZََث
4 Arbaah أَْربََعة
5 Khamsah َخْمَسة
6 Settah ِستَّة
7 Sabaah َسْبَعة
8 Thamaneyah ثََمانِيَة
9 Tesaah تِْسَعة

TABLE V. ARABIC WORDS

Arabic Word Arabic pronunciation English word

Jameaah University جامعة
Koleyah Collage كلية
Kesm Department قسم
Taaleem Education تعليم

Mauhader Lecture محاضر
Modares Teacher مدرس
Maamal Lab معمل
Madah Course مادة

TABLE VI. ARABIC SENTENCES RELATED TO GREETINGS

English language Arabic language

When does registrations begin at the
university?

متى يبدأ التسجيل في الجامعة ؟

Is there a graduate department? للدراسات العليا ؟ھل يوجد قسم
What are the admission requirements? ما ھي شروط القبول ؟

Is there a university website? ھل يوجد موقع الكتروني للجامعة؟
What are the available majors? ما ھي التخصصات المتوفرة ؟

The University has modern programs. حديثةالجامعة لديھا برامج
The mission of the university is ambitious. رسالة الجامعة طموحة

V. FEATURE EXTRACTION ALGORITHMS

Feature extraction is vital for developing a speech
recognition system. Its main objective is to extract the most
significant features for identifying Arabic speakers. Three
feature extraction algorithms were applied: PNCC, ModGDF,
and MFCC.

A. Power Normalized Cepstral Cofficients (PNCC)

The PNCC feature extraction algorithm for extracting
features for speech recognition can be seen in [3]. PNCC has
two components: initial processing, and temporal integration
for environmental analysis.

1) Initial Processing

This processing uses a pre-emphasis filter in the form of:

����� 1�0.97��1 (1)
Subsequently, a Short-Time Fourier Transformation

(STFT) is conducted using Hamming windows. The use of a
DFT volume of 1024 was intended to produce a length of
25.6ms, with 10ms between frames. By weighting magnitude-
squared STFT outputs, spectral power in 40 analysis bands was
obtained for positive frequencies. Center frequencies are also
linearly spaced between 200Hz and 8000Hz using gamma tone
filters in Equivalent Rectangular Bandwidth (ERB) [3].

2) Temporal Integration for Environmental Analysis

Most speech recognition systems use length frames of
analysis between 20 and 30ms. It is often found that longer
analytical windows deliver greater noise modeling efficiency
and environmental normalization [6], because of the facility
related to most background conditions, and changes slower
than the speaking-related instant power. In PNCC processing,
an estimate is made of a quantity referred to as “medium-time
power” Q[m,l] by calculating the running average of P[m, l],
the power observed in a single frame of analysis, according to:

'1
2 1

[ , ] [ ]
m M

M
m m M

Q m l P m l
+

= −

= ∑ (2)

where m is the index of the frame, and l is the index of the
channel.

B. Modified Group Delay Function (ModGDF)

This method was discussed in detail in [7-15]. It should be
noted that the group delay feature is different from the phase
spectrum, and it is defined as the phase negative derivative
which can be used effectively to extract different system
parameters when the signal is considered as a minimum phase
signal. This is mainly so because a minimum phase signal’s
magnitude spectrum is similar to each other and its group delay
feature. Figure 2, shows the process of ModGDF algorithm for
extracting speech features. The algorithm is described below.

Algorithm: ModGDF feature extraction pseudocode
Input: speech x(n)
Output: ModGDF (Features vector) c(n)
Begin
Initialize parameters;
Apply the DFT of the speech x (n) as X[k];
Apply the DFT of the speech n x (n) as Y[k];
Calculate Group delay function where R and I represents
real and imaginary parts;
Compute the spectrally smoothed spectra of X [k] and
designate it as S [k];
Compute modified group delay where S [k] is the smoothed
version of X [k] and two new parameters α and γ are used
to regulate the dynamic range of ModGDF;
Apply the DCT to get the ModGD features;
Obtain ModGD Features vector (13 Coefficients for each
frame);
End.

Fig. 2. Feature extraction process of ModGDF

C. Mel Frequency Cepstral Coefficients (MFCC)

MFCC is the mostly used method in speech technology
development, as it is similar to the human auditory system [16],
taking into account its characteristics. Moreover, these

Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5550

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System

coefficients are robust and reliable to variations of speakers and
recording conditions. Figure 3 shows the processing steps of
MFCC for feature extraction.

Fig. 3. Processes in MFCC feature extraction method

Pre-emphasis is the first step of MFCC, which produces
energy, that was earlier compressed during sound generation, at
a high frequency. Framing uses narrower parts to trim the
sound signals. Windowing is used to avert discontinuity of the
signals produced by the framing method. Fast Fourier
Transform (FFT) is used for adapting a signal from time to
frequency domain. Filter bank is the overlapping band pass
filter. The final process is the Discrete Cosine Transform
(DCT) making the coefficients of MFCC [18]. MFCC is
computed from speech signal using the following three steps:

• Compute the FFT power spectrum of the speech signal

• Apply a Mel-space filter-bank to the power spectrum to get
energies

• Compute DCT of log filter-bank energies to get
uncorrelated MFCC’s

The speech signal is first divided into time frames
comprising of a random number of samples. In most systems,
overlapping of frames is used to smooth transition from frame
to frame. Each time frame is then windowed with a Hamming
window to eliminate discontinuities at the edges [17]. The filter
coefficients w(n) of a Hamming window of length n are
computed according to:

��
� � 0.54�0.46cos��������, 0 �
� ��1
��
� � 0, otherwise. (2)

where N is the total number of samples, and n is the current
sample. Mel scale links perceived frequency or pitch of a pure
tone to its actual measured frequency. Humans discern better
small changes in pitch at lower frequencies. Integrating this
scale makes the features match more closely to what humans
hear. The formula for converting from frequency to Mel scale
is:

�!� � 1125ln�1% &
'((
� (3)

while the formula to go back from Mel’s scale to frequency is:

���)� � 700�exp� -���.��1� (4)

VI. CLASSIFICATION

SVM is principally a binary classifier, but with the
following two approaches it can be extended to multi-class

tasks, the first being 1-vs-all i.e. comparing each class to the
rest and the second, 1-vs-1, i.e. each class to the other,
separately [20]. In this study, the i-vs-all was used, consisting
of multiple binary SVMs equal to the number of classes. Every
SVM with each one of the classes against the rest of them is
taught and taken into consideration when testing. The decision
is eventually made based on the distance from all SVMs
between the test data and the hyper planes.

VII. SIMULATION RESULTS

Several experiments were conducted, employing the speech
database, for classification and recognition using MFCC,
PNCC and ModGDF for feature extraction. Training procedure
used 60% of the data, while 40% were used for testing. The test
procedure was implemented in Matlab 2016, and screen shots
are shown in Figures 4 and 5. Evaluation and testing was
performed using accuracy rate, specificity, sensitivity,
precision, and execution time.

Fig. 4. Layout of the main system

Fig. 5. Implementation

A. Analysis for Arabic Digits

The feature extraction methods were applied on the digit
samples, and the results are shown in Table VII.

TABLE VII. SVM RESULTS ON DIGITS

Feature

extraction

technique

Accuracy

rate
Specificity Sensitivity Precision

Execution

time (s)

ModGDF 90.3 94.5 50.5 72.7 16.39
PNCC 97.5 98.6 87.6 88.7 54.8
MFCC 88.3 93.5 41.7 53.7 87.5

Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5551

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System

Figure 6 illustrates the methods’ performance. As it can be
observed, ModGDF with SVN obtained better results regarding
time cost. PNCC and MFCC with SVN obtained good results,
but their execution time was much higher. It is concluded that
ModGDF had lower time cost, as it reduced execution time
complexity. Table VIII, shows the confusion matrix of PNCC
for the recognition of Arabic digits. Figure 7, displays a sample
of ModGDF with SVM for the recognition of an Arabic digit
(“Khamsah”).

Fig. 6. Methods' performance on the recognition of Arabic digits

Fig. 7. ModGDF sample recognizing the Arabic digit “Khamsah”

TABLE VIII. CONFUSION MATRIX OF DIGITS USIN G PNCC/SVM

19 0 0 0 0 0 0 0 0 0
0 19 0 0 0 0 0 0 0 0
0 2 17 0 0 0 0 0 0 0
1 0 1 17 0 0 0 0 0 0
0 1 0 0 18 0 0 0 0 0
0 1 1 0 0 17 0 0 0 0
0 0 0 2 0 1 16 0 0 0
2 0 1 1 0 1 1 13 0 0
1 0 0 3 0 1 0 1 13 0
0 0 0 1 0 0 1 0 0 17

B. Analysis for Arabic Words

Table IX shows the results on the recognition of Arabic
words. The results of ModGDF with SVM are reported to be
not satisfactory, but time cost is much lesser than the other
feature extraction methods. PNCC with SVM performed better,
but time cost turned out to be significantly more. The results

are also shown in Figure 8. Table X shows the confusion
matrix of PNCC/SVM for the recognition of Arabic words. The
confusion matrix has attested that PNCC is more robust and
demonstrates more strength to identify Arabic words. Figure 9
illustrates the performance of the PNCC on the recognition of
an Arabic word (“Dirham”).

TABLE IX. RESULTS ON WORDS

Feature

extraction

technique

Accuracy

rate
Specificity Sensitivity Precision

Execution

time (s)

ModGDF 89.3 94.1 46.8 58.6 12.3
PNCC 95.15 97.3 75.8 79.2 49.5
MFCC 88.6 93.6 43.1 51.8 99.5

Fig. 8. Performance on the recognition of Arabic words

Fig. 9. Sample of PNCC with the SVM recognizing the Arabic word
“Dirham”

TABLE X. CONFUSION MATRIX OF WORDS USIN G PNCC/SVM

19 0 0 0 0 0 0 0 0 0
2 17 0 0 0 0 0 0 0 0
2 2 15 0 0 0 0 0 0 0
2 1 1 15 0 0 0 0 0 0
2 0 3 2 12 0 0 0 0 0
1 1 2 2 0 13 0 0 0 0
1 3 1 0 0 0 14 0 0 0
0 0 0 1 2 0 0 16 0 0
0 1 1 1 0 0 0 0 16 0
2 0 0 0 0 0 2 2 1 12

C. Analysis for Arabic Sentences

Table XI shows the performance results on the recognition
of Arabic sentences. As it can be observed, PNCC with SVM

Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5552

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System

performed better, but had greater execution time. PNCC had
again the highest accuracy and lower execution time than
MFCC. ModGDF had the lowest execution time of 12.3s, and
accuracy of 88.2, while MFCC showed again the lowest
accuracy. The accuracy of PNCC/SVM can be also confirmed
by its confusion matrix analysis of sentences in Table XII. The
confusion matrix shows that the PNCC/SVM is capable of
recognizing sentences with satisfactory results. The results are
also shown in Figure 10, while Figure 11 illustrates the
performance of PNCC on the recognition of an Arabic sentence
(“What are the available majors?”).

TABLE XI. RESULTS ON SENTENCES

Feature

extraction

technique

Accuracy

rate
Specificity Sensitivity Precision

Execution

time (s)

ModGDF 88.2 93.5 41.2 45.3 18.9
PNCC 93.05 96.14 65.26 71.04 70.0
MFCC 86.0 92.2 30.0 49.48 125.0

Fig. 10. Feature extraction performance on Arabic sentences

Fig. 11. Sample of PNCC/SVM recognizing the Arabic sentence “What are
the available majors?”

TABLE XII. CONFUSION MATRIX OF SENTENCES USING PNCC/SVM

19 0 0 0 0 0 0 0 0 0
2 17 0 0 0 0 0 0 0 0
1 7 11 0 0 0 0 0 0 0
0 0 1 18 0 0 0 0 0 0
1 0 0 8 10 0 0 0 0 0
0 0 0 0 3 16 0 0 0 0
0 0 0 0 0 7 12 0 0 0
0 0 0 1 0 0 1 17 0 0
0 0 0 0 0 0 0 9 10 0
6 0 0 0 0 0 0 0 3 10

VIII. CONCLUSION

In this paper, a speech recognition system for Arabic
language was presented, evaluating three feature extraction
algorithms, namely MFCC, PNCC, and ModGDF, while an
SVM was used for the classification process. Results showed
that PNCC was more efficient, while ModGDF had moderate
accuracy. PNCC and ModGDF fill the gaps in SVM, as they
both had greater accuracy than MFCC. PNCC had a 93-97%
accuracy rate, ModGDF had 90% and MFCC had 88%.

REFERENCES

[1] P. P. Shrishrimal, R. R. Deshmukh, V. B. Waghmare, “Indian language
speech database: A review”, International Journal of Computer
Applications, Vol. 47, No. 5, pp. 17-21, 2012

[2] S. K. Gaikwad, B. W. Gawali, P. Yannawar, “A review on speech
recognition technique”, International Journal of Computer Applications,
Vol. 10, No. 3, pp. 16-24, 2010

[3] C. Huang, T. Chen, E. Chang, “Accent issues in large vocabulary
continuous speech recognition”, International Journal of Speech
Technology, Vol. 7, No. 2-3, pp. 141-153, 2004

[4] M. A. Anasuya, S. K. Katti, “Speech recognition by machine: A
review”, International Journal of Computer Science and Information
Security, Vol. 6, No. 3, pp. 181-205, 2009

[5] P. L. Garvin, P. Ladefoged, “Speaker identification and message
identification in speech recognition”, Phonetica, Vol. 9, No. 4, pp. 193-
199, 1963

[6] G. Ceidaite, L. Telksnys, “Analysis of factors influencing accuracy of
speech recognition”, Elektronika ir Elektrotechnika, Vol. 105, No. 9, pp.
69-72, 2010

[7] Z. H. Tan, B. Lindberg, “Speech recognition on mobile devices”, in:
Mobile Multimedia Processing – WMMP 2008, Lecture Notes in
Computer Science, Vol. 5960, Springer, 2010

[8] W. Li, K. Takeda, F. Itakura, “Robust in-car speech recognition based
on nonlinear multiple regressions”, EURASIP Journal on Advances in
Signal Processing, 2007

[9] W. Ou, W. Gao, Z. Li, S. Zhang, Q. Wang, “Application of keywords
speech recognition in agricultural voice system”, Second International
Conference on Computational Intelligence and Natural Computing,
Wuhan, China, September 13-14, 2010

[10] L. Zhu, L. Chen, D. Zhao, J. Zhou, W. Zhang, “Emotion recognition
from Chinese speech for smart affective services using a combination of
SVM and DBN”, Sensors, Vol. 17, No. 7, 2017

[11] J. E. Noriega-Linares, J. M. Navarro Ruiz, “On the application of the
raspberry Pi as an advanced acoustic sensor network for noise
monitoring”, Electronics, Vol. 5, No. 4, 2016

[12] M. Al-Rousan, K. Assaleh, “A wavelet-and neural network-based voice
system for a smart wheelchair control”, Journal of the Franklin Institute,
Vol. 348, No. 1, pp. 90-100, 2011

[13] I. V. McLoughlin, H. R. Sharifzadeh, “Speech recognition for smart
homes”, in: Speech Recognition, Technologies and Applications, Intech,
2008

[14] A. Glowacz, “Diagnostics of rotor damages of three-phase induction
motors using acoustic signals and SMOFS-20-EXPANDED”, Archives
of Acoustics, Vol. 41, No. 3, pp. 507-515, 2016

[15] A. Glowacz, “Fault diagnosis of single-phase induction motor based on
acoustic signals”, Mechanical Systems and Signal Processing, Vol. 117,
pp. 65-80, 2019

[16] M. Kunicki, A. Cichon, “Application of a phase resolved partial
discharge pattern analysis for acoustic emission method in high voltage
insulation systems diagnostics”, Archives of Acoustics, Vol. 43, No. 2,
pp. 235-243, 2018

[17] D. Mika, J. Jozwik, “Advanced time-frequency representation in voice
signal analysis”, Advances in Science and Technology Research Journal,
Vol. 12, No. 1, pp. 251-259, 2018

Engineering, Technology & Applied Science Research Vol. 10, No. 2, 2020, 5547-5553 5553

www.etasr.com Alasadi et al.: Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System

[18] L. Zou, Y. Guo, H. Liu, L. Zhang, T. Zhao, “A method of abnormal
states detection based on adaptive extraction of transformer vibro-
acoustic signals”, Energies, Vol. 10, No. 12, 2017

[19] H. Yang, G. Wen, Q. Hu, Y. Li, L. Dai, “Experimental investigation on
influence factors of acoustic emission activity in coal failure process”,
Energies, Vol. 11, No. 6, Article ID 1414, 2018

[20] L. Mokhtarpour, H. Hassanpour, “A self-tuning hybrid active noise
control system”, Journal of the Franklin Institute, Vol. 349, No. 5, pp.
1904-1914, 2012

[21] S. C. Lee, J. F. Wang, M. H. Chen, “Threshold-based noise detection
and reduction for automatic speech recognition system in human-robot
interactions”, Sensors, Vol. 18, No. 7, Article ID 2068, 2018

[22] S. M. Kuo, W. M. Peng, “Principle and applications of asymmetric
crosstalk-resistant adaptive noise canceler”, Journal of the Franklin
Institute, Vol. 337, No. 1, pp. 57-71, 2000

[23] J. W. Hung, J. S. Lin, P. J. Wu, “Employing robust principal component
analysis for noise-robust speech feature extraction in automatic speech
recognition with the structure of a deep neural network”, Applied
System Innovation, Vol. 1, No. 3, Article ID 28, 2018

[24] R. P. Lippmann, “Speech recognition by machines and humans”, Speech
Communication, Vol. 22, No. 1, pp. 1-15, 1997

[25] J. B. Allen, “How do humans process and recognize speech?”, IEEE
Transactions on Speech and Audio Processing, Vol. 2, No. 4, pp. 567-
577, 1994

[26] S. Haque, R. Togneri, A. Zaknich, “Perceptual features for automatic
speech recognition in noisy environments”, Speech Communication,
Vol. 51, No. 1, pp. 58-75, 2009

[27] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech”,
The Journal of the Acoustical Society of America, Vol. 87, No. 4, pp.
1738-1752, 1990

[28] M. Holmberg, D. Gelbart, W. Hemmert, “Automatic speech recognition
with an adaptation model motivated by auditory processing”, IEEE
Transactions on Audio, Speech, and Language Processing, Vol. 14, No.
1, pp. 43-49, 2005

[29] C. Kim, R. M. Stern, “Power-normalized Cepstral Coefficients (PNCC)
for robust speech recognition”, 2012 IEEE International Conference on
Acoustics, Speech and Signal Processing, Kyoto, Japan, March 25-30,
2012

[30] M. L. Seltzer, D. Yu, Y. Wang, “An investigation of deep neural
networks for noise robust speech recognition”, 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing, Vancouver,
Canada, May 26-31, 2013

[31] A. L. Maas, Q. V. Le, T. M. O'Neil, O. Vinyals, P. Nguyen, A. Y. Ng,
“Recurrent neural networks for noise reduction in robust ASR”, 13th
Annual Conference of the International Speech Communication
Association, Portland, USA, September 9-13, 2012

[32] M. Wollmer, B. Schuller, F. Eyben, G. Rigoll, “Combining long short-
term memory and dynamic bayesian networks for incremental emotion-
sensitive artificial listening”, IEEE Journal of Selected Topics in Signal
Processing, Vol. 4, No. 5, pp. 867-881, 2010

[33] Z. Zhang, J. Geiger, J. Pohjalainen, A. E. D. Mousa, W. Jin, B. Schuller,
“Deep learning for environmentally robust speech recognition: An
overview of recent developments”, ACM Transactions on Intelligent
Systems and Technology, Vol. 9, No. 5, pp. 1-28, 2018

[34] E. Principi, S. Squartini, F. Piazza, “Power normalized cepstral
coefficients based supervectors and i-vectors for small vocabulary
speech recognition”, 2014 International Joint Conference on Neural
Networks, Beijing, China, July 6-11, 2014

[35] E. Loweimi, S. M. Ahadi, “A New group delay-based feature for robust
speech recognition”, 2011 IEEE International Conference on Multimedia
and Expo, Barcelona, Spain, July 11-15, 2011

[36] B. Kurian, K. T. Shanavaz, N. G. Kurup, “PNCC based speech
enhancement and its performance evaluation using SNR Loss”, 2017
International Conference on Networks & Advances in Computational
Technologies, Thiruvanthapuram, India, July 20-22, 2017

[37] T. Fux, D. Jouvet, “Evaluation of PNCC and extended spectral
subtraction methods for robust speech recognition”, 23rd European

Signal Processing Conference, Nice, France, August 31 – September 4,
2015

[38] A. Kaur, A. Singh, “Power-Normalized Cepstral Coefficients (PNCC)
for Punjabi automatic speech recognition using phone based modelling
in HTK”, 2nd International Conference on Applied and Theoretical
Computing and Communication Technology, Bangalore, India, July 21-
23, 2016

[39] C. Kim, R. M. Stern, “Feature extraction for robust speec recognition
based on Mmximizing the sharpness of the power distribution and on
power flooring”, 2010 IEEE International Conference on Acoustics,
Speech and Signal Processing, Dallas, USA, March 14-19, 2010

[40] D. S. Kim, S. Y. Lee, R. M. Kil, “Auditory processing of speech signals
for robust speech recognition in real-world noisy environments”, IEEE
Transactions on Speech and Audio Processing, Vol. 7, No. 1, pp. 55-69,
1999