Microsoft Word - ETASR_V12_N6_pp9570-9578 Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9570-9578 9570 www.etasr.com Amraoui & Saadi: A Novel Approach on Speaker Gender Identification and Verification Using DWT … A Novel Approach on Speaker Gender Identification and Verification Using DWT First Level Energy and Zero Crossing Abdelkader Amraoui Laboratory of Applied Automation and Industrial Diagnostics, Faculty of Sciences and Technology, Ziane Achour University of Djelfa, Djelfa, Algeria kader2717@yahoo.fr Slami Saadi Department of Computer Sciences, Faculty of Exact Sciences and Informatics, Ziane Achour University of Djelfa, Djelfa, Algeria s.saadi@univ-djelfa.dz Received: 18 August 2022 | Revised: 3 September 2022 | Accepted: 5 September 2022 Abstract-The aim of this work is to find a new criterion for determining a range of values in order to determine the gender of a speaker. The use of the Discrete Wavelet Transform (DWT) of the Daubechies db7 parent wavelet and the computation of the zero crossing energy from the first level of the DWT was followed by computation of the values of the criterion for both genders and comparison with the value of the speech basic frequency for both genders for the same sign or sentence. The standard has a limited range of values close to the basic frequency range of the same speaker through which we can determine gender. This criterion has been tested on several men and women databases with different repeated sentences for the same person or for both genders and it gives acceptable results that can be worked on. Keywords-speaker gender; DWT; energy; zero crossing I. INTRODUCTION The difference in the linguistic characteristics of humans is characterized by the variance in frequencies of the two genders. Generally, the frequencies of women are smaller than those of men. Usually, the sound consists of vibrations with different lengths and heights. As the human ear cannot hear all the vibrations, some of them are audible, i.e. from 20Hz to 20kHz, and some are inaudible. Frequencies smaller than 20Hz are inaudible and frequencies larger than 20kHz are inaudible and painful to the human ear. Of course, as we get older, the limit of hearing drops, sometimes to 17kHz [1]. The speech signal is non-stationary, complex, and variable with time [2]. In addition, the highest frequency of the speech signal is in the order of 5kHz. The operations done on this signal, such as sampling, require frequencies according to Shannon’s law that the sampling frequency is more than twice the frequency of the original signal under study. Many research works have formerly addressed the study and determination of the range of f0 values [3], which dealt with extracting the basic frequency in the domains of time and frequency as well as wavelets. In addition, some research works deal with a comparison between estimation methods for computing the fundamental for a newborn. In [2], the authors presented techniques for determining the frequency position in time. Other research works [4-6] were interested by this subject, on which researches are still developing since it is not possible to confirm definitively the criterion that determines the gender of the speaker according to the basic frequency domain. Authors in [7] present an efficient approach for automatic speaker identification based on cepstral features, the Normalized Pitch Frequency (NPF) with Discrete Cosine Transform (DCT) and wavelet de-noising pre-processing. On the other hand, wavelet based features in combination with Spectral-Subtraction (SS) were proposed in [8] for speaker identification in clean and noisy environment. Using neural networks for speech recognition tasks, where words are constructed from sequential individual and text-independent sound segments was addressed in [9] for speaker identification. To reduce memory utilization for speaker audio segment identification, authors in [10] proposed a new on line speaker identifier model using short input audio segments. Extracting features from raw speech that captures the unique characteristics of each speaker is accomplished by the filter bank-based Mel Frequency Cepstral Coefficients (MFCC) approach [11] using Discrete Wavelet Transform (DWT). The Average Framing Linear Prediction Coding (AFLPC) technique combined with wavelet transform for text-independent speaker identification systems was presented in [12]. Extracting features from raw speech that capture the unique characteristics of a particular individual using Wavelet Packet Transform (WPT) is presented in [13]. Authors in [14] identify speakers basing on cepstral feature strategy and the NPF as a new feature, for enhancing accuracy using Neural Networks (NN). A motivating application of speech signals identification, for the detection of people with heart failure by glottal features, was presented in [15]. A robust feature extraction method for a real-time speech recognition hardware system was presented in [16]. In the current work, a novel approach for speaker gender identification and verification is introduced. This new criterion is based on DWT and the intersections with zero of the DWT first level for both genders with comparison to speech fundamental frequency. The simulation results prove that the proposed approach enhances the performance of speaker Corresponding author: Slami Saadi Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9570-9578 9571 www.etasr.com Amraoui & Saadi: A Novel Approach on Speaker Gender Identification and Verification Using DWT … identification, especially with the DWT and the de-noising pre- processing step. II. SPEECH SIGNAL CHARACTERISTICS The speech signal is an audio carrier of complex and diverse information and has features such as the fundamental frequency characteristic, which is often denoted by f0. Other characteristics are the energy and the frequency spectrum [17]. The vibration or the cycle of opening/closing of the vocal cords represents f0. This frequency characterizes only the voiced segments and evolves slowly over time [17]. The speaker frequency varies according to age and gender. It extends approximately from 60Hz to 150Hz for men and from 150Hz to 450Hz for women. The fundamental frequency extraction is not an easy task since the periodicity of vibration of the vocal cords is not always perfect [6]. The amplitude of the speech signal changes significantly over time and in the audible speech. This amplitude is much greater than in the inaudible speech and reflects to us the energy changes in the signal. One of the characteristics of the speech signal is the energy feature, usually symbolized by E, computed according to (2). This energy is very high in audible sounds compared to inaudible sounds in which the energy is weak [18]. E� � ∑ �X�n �� � � (1) According to the scheme shown in Figure 1, we symbolize the standard called Wavelet Energy Rate (WER), which is applied to the first DWT level of the speech signal under study. This signal is a spoken sentence by some speakers of different genders containing repeated and different sentences. These are samples taken from a database that was prepared specifically for this study: videos of sentences in mp4 format were taken from people of both genders. These videos are downloaded and the spoken sentences were converted to wav format. Each sentence was sampled separately according to the sampling frequency fe =11025hz, which is a frequency greater than double the frequency of the speech signal (5khz) on the basis of Shannon's law. III. FILTERING The filtering step is accomplished through a mathematical transformation applied on the speech signal under study using a low pass filter with cutoff frequency f� � 600Hz. This low pass filter reduces the signal high frequency to �10log10 (2)dB in the order of -3dB, which means decreasing the signal output energy to the 71% of the original speech signal [19]. Among the most widely used filters are the Butterworth linear filters, which are similar in shape with a difference in the cut-off frequency range. This filter has the largest amplitude and is more stable with frequency in the pass range and in the transition region with moderate reduction and is selected based on amplitude accuracy [20]. The filter order represents the number of columns in the filter pass region. For example, a filter of order n has a reduction rate of 6×n dB/decade to 20×n/decade. When n=8 its reduction is 48dB/ decade [20]. Fig. 1. General scheme. Fig. 2. Filtered speech signal. IV. DISCRETE WAVELET TRANSFORM DWT is based on sub-band coding, and developed after that to be a technique similar to sub-band coding called hierarchical coding [21]. It is identified by the following relationship [22]: X����n� � x�n� ∗ h�n� � ∑ x�k� ∗ h�n � k� � �� � (2) DWT analyzes the original signal x(n) in different frequency bands with different degrees of accuracy by evaluating the signal into approximate and detailed coefficients where the approximate coefficients have the highest amplitude, the lowest frequencies, and possess most of the energy. The components of the highest frequencies are mixed with the noise existing in the original signal. Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9570-9578 9572 www.etasr.com Amraoui & Saadi: A Novel Approach on Speaker Gender Identification and Verification Using DWT … (a) (b) Fig. 3. First level signal DWT decomposition after filtering. (a) Approximation part a1 of speech signal, (b) detail part of speech signal d1. As previously mentioned, the DWT uses digital filtering techniques, in which the signal to be analyzed passes through filters with different cut-off frequencies at different scales. The original signal is divided into details coefficients and approximation coefficients, which are often denoted by ak and dk respectively, where d denotes the detailed coefficients and a denotes the approximate coefficients, while k denotes the analytical level. The two parameters are the result of the convolution of the original signal x(n) with the pass filter, meaning that the approximate parameters are the result of the low pass filter convolution with the original signal and the detailed parameters are the convolution of the high pass filter with the original signal [24]. a1 � ∑ X�k y�2n � k���� � (3) d1 � ∑ X�k z�2n � k�∞�� ∞ (4) In the waveform analysis, the signal is analyzed and synthesized according to stages starting from dividing the signal x(n) to the end of the required level and rebuilding it from the last level we stopped to the end of the first level from which we started. For example, if the original signal x(n) contains 512 samples and a frequency from 0 to π, then each of the approximate and detailed coefficients in the first level has 256 samples and one half(1/2) the frequency of the signal, i.e. π/2, and in the second level there are 128 samples and the half frequency of the signal level, i.e. f = π/4 or ¼ of the original signal frequency for both approximate and detailed parameters. As mentioned above, the wavelet transform has more flexibility in designing the pulse shape and less sensitivity to signal distortion. The DWT analyzes the original signal x(n) in different frequency bands with different degrees of accuracy by analyzing the signal into approximate coefficients and detailed coefficients. The approximate coefficients have the highest amplitude and components of low frequencies and possess most of the signal energy. It is known that the speech signal is continuous and the fundamental frequency of a man’s voice is around 50Hz (with period T = 20ms) whereas the f0 of a woman’s voice is near 180Hz, with a period T = 250ms. Also, the basic frequency for the voice of a woman and of a man rises to 500Hz and to 200Hz respectively. V. INTERSECTIONS WITH ZERO Intersections with zero is the number of the times the signal x(n) passes through zero in a certain period of time. It is an indication of the frequency at which the energy is concentrated in the signal spectrum, as the energy is concentrated at low frequencies in audible speech, while in inaudible speech most of it is found in high frequencies. From this, the highest number of intersections with zero is at high frequencies and the lowest is at low frequencies [18]. We can say that the intersection of the signal with zero has a strong relationship with the distribution of energy with frequency. The number of intersections with zero is computed for the signal according to [25]: npz = ��� ∑ |sgn�x(n)� − sgn�x(n − 1)�|� ���* (5) sgn�x(n)� = + +1x(n) ≥ 0 −1x(n) < 0 (6) VI. THE PROPOSED WAVELET ENERGY RATE (WER) The WER criterion is calculated according to (7). This new criterion allows us to determine the gender of the speaker based on the studied values. WER is the quotient of the energy Ea1 obtained from the approximation coefficient of the first level of the DWT applied on the signal x(n) according to (8) divided by npz: WER = 234�56 × Tpz (7) E9� = ∑ �a1(n) �� �� (8) where a� is the DWT first level, Ea1 the DWT first signal approximation of coefficients energy, npz the DWT first level signal approximation of coefficient intersections with zero, and Tpz the intersection ratio with zero for the original signal. The value of the WER criterion is distributed over a defined range from 50 to 180Hz for men and from 180 to 500Hz for women. These values are similar to the fundamental frequency of the speech signal f0, which determines the gender of the speaker as a first process, before identifying the speaking person by comparison with other values in our previously prepared database. VII. USED COMPUTATION TOOLS Ιn our investigation, we used KVideo Downloader4 program to download English learning videos in mp4 format 0 0.5 1 1.5 2 2.5 x 10 4 -8 -6 -4 -2 0 2 4 6 8 x 10 -4 0 0.5 1 1.5 2 2.5 x 10 4 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9570-9578 9573 www.etasr.com Amraoui & Saadi: A Novel Approach on Speaker Gender Identification and Verification Using DWT … from an Internet Database. Then, these clips were converted to wav format using on line Covertio and on line Cloud Convert. The obtained signal was sampled into some specific sentences using the Audacity program by a sampling frequency fe=11025Hz to be analyzed and studied by the Matlab A13 toolbox on a compute with the following specifications: PC Acer based on x67, processor: Intel® Core™ i3-2348M, CPU: n2.30GHz, Version SMBIOS 2.7, Operating system: Windows 10 VIII. EXPERIMENTAL RESULTS Through the obtained experimental results presented below, the developed approaches were applied on some different repeated sentences from male and female speakers. We run 200 experiments on the speech signal to extract about 4500 different values, after filtering the speech signal with a Butterworth filter of degree n=8 and sampling frequency fe=11025Hz. This signal was analyzed to the first level by DWT seventh partition (db7). The number of intersections with zero to be multiplied by the ratio of the original signal intersections with zero, according to (7) and (8) are shown in Table I. Through Tables I and II, it was found that the energy of the original signal has different values in the two genders. The energy of the original signal for men is much greater than for women and this happens for the same repeated sentence, several different sentences for the same woman or the same man, or using the same sentence for several women and several men, as shown in Figure 9. E:; ≫ E:= (9) where E:; is the energy of the original signal for men and E:= the energy of the original signal for women. TABLE I. A SENTENCE REPEATED FOR THE SAME PERSON AND FOR SEVERAL PEOPLE (4 WOMEN AND 3 MEN) A sentence repeated for the same person and for several people (4 women and 3 men) "do you speak english" Signal Sample number Energy signal Energy >?@ ZCN signal ABC?@ ABD?@ Women Woman A 601 22243 879380 28077000 1268 4405.3 0.0570 Woman A 602 21060 833610 25851000 1224 4282.3 0.0581 Woman B 201 12487 271940 9406900 740 2623.8 0.0592 Woman B 203 12820 275960 10579000 811 2874.1 0.0632 Woman D 301 12449 1795300 9581200 768 2701.7 0.0617 Woman D 302 12793 2313000 10444000 815 2885.4 0.0637 Woman M 1000 15493 577710 11501000 723 2600.5 0.0467 woman M 1003 15988 583980 12943000 795 2833.3 0.0497 Men Man B 101 13292 5851300 6844900 526 1790.1 0.0396 Man B 102 13667 6016100 7346700 538 1864.2 0.0394 Man K 100 11341 317560 4442000 391 1371.3 0.0345 Man W 400 11685 773940 4254200 364 1285.6 0.0311 Man W 401 11638 425690 4075200 354 1245.3 0.0304 Fig. 4. Man B101 speech signal analysis (original, filtered, DWT decomposed, and energy spectrum), sentence: "do you speak English"? 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -1 0 1 time(s) A m p li tu d e wav B 101 for a man voice original signal 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -1 0 1 time(s) A m p li tu d e wav B 101 for a man voice filtered signal 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -2 -1 0 1 time(s) A m p li tu d e wav B 101 for a man voice nv1 wavelet db7 0 2000 4000 6000 8000 10000 12000 14000 0 0.5 1 1.5 wav B 101 for a man voice Energie Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9570-9578 9574 www.etasr.com Amraoui & Saadi: A Novel Approach on Speaker Gender Identification and Verification Using DWT … Fig. 5. Woman B601 speech signal analysis (original, filtered, DWT decomposed, and energy spectrum), sentence: "do you speak English"? Fig. 6. Woman B201 speech signal analysis (original, filtered, DWT decomposed, and energy spectrum), sentence: "do you speak English"? 0 0.5 1 1.5 2 2.5 -0.5 0 0.5 time(s) A m p li tu d e wav A601 for a woman voice original signal 0 0.5 1 1.5 2 2.5 -0.5 0 0.5 time(s) A m p li tu d e wav A601 for a woman voice filtered signal 0 0.5 1 1.5 2 2.5 -0.5 0 0.5 time(s) A m p li tu d e wav A601 for a woman voice nv1 wavelet db7 0 0.5 1 1.5 2 2.5 x 10 4 0 0.05 0.1 wav A601 for a woman voice Energie 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -0.5 0 0.5 time(s) A m p li tu d e wav B 201 for a woman voice original signal 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -0.2 0 0.2 time(s) A m p li tu d e wav B 201 for a woman voice filtered signal 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -0.5 0 0.5 time(s) A m p li tu d e wav B 201 for a woman voice nv1 wavelet db7 0 2000 4000 6000 8000 10000 12000 14000 0 0.05 0.1 wav B 201 for a woman voice Energie Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9570-9578 9575 www.etasr.com Amraoui & Saadi: A Novel Approach on Speaker Gender Identification and Verification Using DWT … Fig. 7. Comparison between woman a601 and woman b201 speech signal analysis (original, filtered, DWT decomposed, and energy spectrum), sentence: "do you speak English"? Fig. 8. Comparison between woman b201 and man b101 speech signal analysis (original, filtered, DWT decomposed, and energy spectrum), sentence: "do you speak English"? In the approximate coefficient a1 signal, in the first level of the DWT, we find that the energy Ea1 for women is greater than the energy for men and this is for the same repeated sentence or several different sentences for the same woman or the same man as shown in Tables I and II and Figure 10, or for a repetitive sentence or several different sentences for several women and several men: E9�; ≫ E9�= (10) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -0.5 0 0.5 time(s) A m p li tu d e wav B 201 for a woman voice original signal 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -0.2 0 0.2 time(s) A m p li tu d e wav B 201 for a woman voice filtered signal 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -0.5 0 0.5 time(s) A m p li tu d e wav B 201 for a woman voice nv1 wavelet db7 0 2000 4000 6000 8000 10000 12000 14000 0 0.05 0.1 woman A 601 wav B 201 for a woman voice Energie 0 0.5 1 1.5 2 2.5 -0.5 0 0.5 time(s) A m p li tu d e wav A601 for a woman voice original signal 0 0.5 1 1.5 2 2.5 -0.5 0 0.5 time(s) A m p li tu d e wav A601 for a woman voice filtered signal 0 0.5 1 1.5 2 2.5 -0.5 0 0.5 time(s) A m p li tu d e wav A601 for a woman voice nv1 wavelet db7 0 0.5 1 1.5 2 2.5 x 10 4 0 0.05 0.1 wav A601 for a woman voice Energie 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -0.5 0 0.5 time(s) A m p li tu d e wav B 201 for a woman voice original signal 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -0.2 0 0.2 time(s) A m p li tu d e wav B 201 for a woman voice filtered signal 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -0.5 0 0.5 time(s) A m p li tu d e wav B 201 for a woman voice nv1 wavelet db7 0 2000 4000 6000 8000 10000 12000 14000 0 0.05 0.1 man B 101 wav B 201 for a woman voice Energie 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -1 0 1 time(s) A m p li tu d e wav B 101 for a man voice original signal 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -1 0 1 time(s) A m p li tu d e wav B 101 for a man voice filtered signal 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -2 0 2 time(s) A m p li tu d e wav B 101 for a man voice nv1 wavelet db7 0 2000 4000 6000 8000 10000 12000 14000 0 1 2 wav B 101 for a man voice Energie Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9570-9578 9576 www.etasr.com Amraoui & Saadi: A Novel Approach on Speaker Gender Identification and Verification Using DWT … where E9�; E9�= represent the energy of the approximate coefficient a1 for men and women respectively. Fig. 9. Comparison of the signal energy between a man and a woman. Fig. 10. Comparison of the Energy of the DWT first level signal between a man and a woman. It is noticeable that the number of intersections with zero (Zero Crossing Number-ZCN) in the original signal differs for the two genders with diverse values. The ZCN of the original signal for men is much greater than for women, and for the same repeated sentence, several different sentences for the same woman or the same man, and for several different sentences and several men/women as shown in Tables I-II and Figure 11. npz:= ≫ npz:; (11) where npz:; is the intersection of the original signal with zero for men and npz:= for women. Fig. 11. ZCN comparison of for the signal between a man and a woman. Fig. 12. ZCN comparison of in the DWT first level for the signals of a man and a moman. On the other hand, for the approximate coefficient a1 signal in the first level of the DWT, we find that the zero crossing npza1F and the Zero Crossing Ratio (ZCR) for women is greater than that for men, for the same repeated sentence, several different sentences for the same woman, or for a repeated sentence or for different sentences for a number of men and women, as shown in Tables I-II and Figure 12. Through Figure13, we conclude that the crossing ratio with zero for the same level is greater for women than for men. npz9�= ≫ npz9�; (12) where EFGH�I and EFGH�J represent the crossing with zero of a1 signal for men and women respectively. Fig. 13. Comparison of ZCR in the DWT first level for the signal of a man and a woman. TABLE II. WER COMPARISON FOR MEN AND WOMEN Signal Sample number Energy signal Energy signal KH� LMNH� LMOH� WER 4 women and 2 men with the same sentence: "do you speak English"? repeated 3 times Woman A 601 22243 879380 28077000 4405.3 0.0570 363.3295 Woman A 602 21060 833610 25851000 4282.3 0.0581 350.8538 Woman B 201 12487 271940 9406900 2623.8 0.0592 212.4698 Woman B 202 12645 274830 9855600 2715.3 0.0606 219.8762 Woman D 301 12449 1795300 9581200 2701.7 0.0617 218.7851 Woman D 302 12793 2313000 10444000 2885.4 0.0637 230.6052 Woman m1000 15493 577710 11501000 2600.5 0.0467 206.3805 Woman m1003 15988 583980 12943000 2833.3 0.0497 227.1477 Man B 101 13292 5851300 6844900 1790.1 0.0396 151.3138 Man B 104 13666 6009400 7216300 1827.3 0.0391 154.3159 Man k 100 11341 317560 4442000 1373.3 0.0345 111.5123 Man w 400 11685 773940 4254200 1285.6 0.0311 103.0833 Man w 401 11638 425690 4075200 1245.3 0.0304 99.5563 4 women with the same sentence: "see you later" Woman A 2001 13595 420980 12289000 3129.3 0.0664 260.8490 Woman B 2003 9503 311870 6561400 2401.4 0.0721 197.2349 Woman D 402 5277 5571700 1748100 1171.4 0.0627 93.6040 Woman m 3002 10365 322280 5379100 1826.2 0.0506 149.1963 Man C201 12686 5501500 7000400 1923.6 0.0436 158.6357 Women and men with different sentences Woman A 401 20629 699960 24785000 4187.6 0.0583 345.1550 Woman A 901 12242 306010 9660000 2726 0.0649 230.1245 Woman A 803 9077 237420 5131100 1984.3 0.0622 160.9541 Woman A 202 13809 374400 12119000 3059.5 0.0632 250.4241 Woman A 101 12444 388450 10553000 2946.0 0.0676 242.3755 Woman B 402 12880 486000 8577500 2319.3 0.0497 183.7694 Woman B 501 8002 132600 5120800 2172.6 0.0813 191.7493 Woman B 802 20612 662150 23670000 3982.7 0.0552 328.4199 Woman B 901 11119 258280 7549900 2346.7 0.0620 199.6459 WomanB 3003 13112 247580 11385000 3006.3 0.0665 251.8563 Woman D 502 10428 2916500 7576900 2519.8 0.0696 209.3417 Woman D 202 11089 1078200 7591400 2390.4 0.0617 195.8928 Woman D 001 12320 1556200 9288400 2666.1 0.0613 213.5040 Man B 501 13160 2597500 8359000 2153.7 0.0487 189.0446 Man B 904 10492 2106500 3325500 1081.8 0.0299 92.0017 Man B 144 11025 2578600 5503800 1718.6 0.0455 145.8184 Man B1004 10227 2481400 4480800 1520.2 0.0434 127.9606 Man B193 10998 2365600 5382100 1678.9 0.0446 143.1177 Man X102 8405 391600 5291400 2182.7 0.0748 181.4226 Man X 100 14152 1428800 13458000 3296.2 0.0668 272.9188 Man C100 11710 2518300 3975700 1196.7 0.0292 97.0313 Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9570-9578 9577 www.etasr.com Amraoui & Saadi: A Novel Approach on Speaker Gender Identification and Verification Using DWT … Fig. 14. Comparison of WER for different speakers of both sexes. The values shown in Table II give us the WER values for the studied speech signals x(n). The signal is a repeated sentence for the same woman or the same man, the same sentence for several men and several women, or a speech signal containing different sentences for several men and several women, as shown in Figure 14. A comparison of these values in relation to the range of the fundamental frequency f0 values in recently published research works was made. The result is that the WER criterion provides values approximately close to the values of f0 for speakers of different sentences and gender. Accordingly, we can through WER specify values in a range from 50 to 180Hz for male speakers and values in a range from 180 to 504Hz for female speakers, which are almost similar to the range of fundamental frequency values which determine the speaker gender according to the value of his/her fundamental frequency using different methods. The direct computational speed of the WER is a faster and easier way to determine the gender of a speaker. Table II illustrates the comparison between WER values and Figure 14 shows their progress. After this stage, which determines the gender of the speaker, the next stage of verifying the speaker is followed by reference to our stored database, according to the comparison process based on a suitable algorithm to authenticate and confirm the speaker as shown by the flowchart in the general scheme of Figure 1. IX. CONCLUSION In speech investigation, gender identification is usually performed by extracting the information from the speech signals. In this paper, we propose a novel criterion for determining the speaker gender by defining a range of sampled values using the Discrete Wavelet Transform (DWT), and computing the energy in addition to the intersections with zero (ZCR) of the first level DWT followed by estimating the introduced criterion (WER) values for both genders. Comparison was made with the value of the speech fundamental frequency for speakers of both sexes. The proposed criterion was tested on a large database prepared for this research, containing many repeated sentences from the same person and from persons of both sexes and gives acceptable results, proving its suitability for speaker gender identification. ACKNOWLEDGMENT The authors would like to thank the LAADI Research Laboratory for the technical support. REFERENCES [1] L. Jeancolas, "Détection précoce de la maladie de Parkinson par l’analyse de la voix et corrélations avec la neuroimagerie," Ph.D. dissertation, Paris-Saclay University, Paris, France, 2019. [2] R. Ajgou, "Techniques De Détection De La Période Du Pitch Par Les Méthodes Temps Fréquence Et Temps Échelle.," M.S. thesis, University of Biskra, Biskra, Algeria, 2010. [3] F. Bahja, Détection du fondamental de la parole en temps-réel: Application aux voix pathologiques. Presses Académiques Francophones, 2014. [4] R. Ajgou, S. Sbaa, S. Aouragh, and A. Taleb, "Détection Du Pitch Par Les Ondelettes Continues En Temps Réel Pour Un Signal Parole Basée Sur Un Seuil Adaptatif Pour Une Détermination V/Nv," Courrier du Savoir Scientifique et Technique, vol. 12, no. 12, pp. 21–26, May 2014. [5] M. A. Ben Messaoud, A. Bouzid, and N. Ellouze, "Estimation du pitch et décision de voisement par compression spectrale de l’autocorrélation du produit multi-échelle (Pitch estimation and voiced decision by spectral autocorrelation compression of multi-scale product) [in French]," in Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, Grenoble, France, Mar. 2012, vol. 1, pp. 201–208. [6] Y. Fayçal, R. Amiar, S. Hecini, W. Benzaba, and L. Bendaouia, "Etude Comparative des Performances de Plusieurs Techniques de Détection de la Fréquence Fondamentale des Signaux Vocaux.," in Proceedings of the 2nd Conférence Internationale sur l’Informatique et ses Applications (CIIA’09), Saida, Algeria, Jan. 2009. [7] M. A. Nasr, M. Abd-Elnaby, A. S. El-Fishawy, S. El-Rabaie, and F. E. Abd El-Samie, "Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients," International Journal of Speech Technology, vol. 21, no. 4, pp. 941–951, Dec. 2018, https://doi.org/10.1007/s10772-018-9524-7. [8] M. Chandra, P. Nandi, A. kumari, and S. Mishra, "Spectral-Subtraction Based Features for Speaker Identification," in Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA), 2015, pp. 529–536, https://doi.org/ 10.1007/978-3-319-12012-6_58. [9] S. R. Shahamiri and F. Thabtah, "An investigation towards speaker identification using a single-sound-frame," Multimedia Tools and Applications, vol. 79, no. 41, pp. 31265–31281, Nov. 2020, https://doi.org/10.1007/s11042-020-09580-4. [10] I. Vélez, C. Rascon, and G. Fuentes-Pineda, "Lightweight speaker verification for online identification of new speakers with short segments," Applied Soft Computing, vol. 95, Oct. 2020, Art. no. 106704, https://doi.org/10.1016/j.asoc.2020.106704. [11] W. Helali, Ζ. Hajaiej, and A. Cherif, "Real Time Speech Recognition based on PWP Thresholding and MFCC using SVM," Engineering, Technology & Applied Science Research, vol. 10, no. 5, pp. 6204–6208, Oct. 2020, https://doi.org/10.48084/etasr.3759. [12] K. Daqrouq and K. Y. Al Azzawi, "Average framing linear prediction coding with wavelet transform for text-independent speaker identification system," Computers & Electrical Engineering, vol. 38, no. 6, pp. 1467–1479, Nov. 2012, https://doi.org/10.1016/j.compeleceng. 2012.04.014. [13] C. Turner and A. Joseph, "A Wavelet Packet and Mel-Frequency Cepstral Coefficients-Based Feature Extraction Method for Speaker Identification," Procedia Computer Science, vol. 61, pp. 416–421, Jan. 2015, https://doi.org/10.1016/j.procs.2015.09.177. [14] M. A. Nasr, M. Abd-Elnaby, A. S. El-Fishawy, S. El-Rabaie, and F. E. Abd El-Samie, "Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients," International Journal of Speech Technology, vol. 21, no. 4, pp. 941–951, Dec. 2018, https://doi.org/10.1007/s10772-018-9524-7. [15] M. Kiran Reddy et al., "The automatic detection of heart failure using speech signals," Computer Speech & Language, vol. 69, Sep. 2021, Art. no. 101205, https://doi.org/10.1016/j.csl.2021.101205. [16] A. Mnassri, M. Bennasr, and C. Adnane, "A Robust Feature Extraction Method for Real-Time Speech Recognition System on a Raspberry Pi 3 Board," Engineering, Technology & Applied Science Research, vol. 9, no. 2, pp. 4066–4070, Apr. 2019, https://doi.org/10.48084/etasr.2533. Engineering, Technology & Applied Science Research Vol. 12, No. 6, 2022, 9570-9578 9578 www.etasr.com Amraoui & Saadi: A Novel Approach on Speaker Gender Identification and Verification Using DWT … [17] A. Amehraye and S. Saoudi, Débruitage perceptuel de la parole. 2009. [18] R. Narayanam, "Voiced and Unvoiced Separation in Speech Auditory Brainstem Responses of Human Subjects Using Zero Crossing Rate (ZCR) and Energy of the Speech Signal," International Journal of Engineering Sciences & Research Technology, vol. 4, no. 9, pp. 370– 380, Jun. 2017, https://doi.org/10.5281/zenodo.892088. [19] "Fréquence de coupure," Wikipédia. Feb. 11, 2022, [Online]. Available: https://fr.wikipedia.org/w/index.php?title=Fr%C3%A9quence_de_coupu re&oldid=190757368. [20] M. V. Daithankar and S. D. Ruikar, "Analysis of the Wavelet Domain Filtering Approach for Video Super-Resolution," Engineering, Technology & Applied Science Research, vol. 11, no. 4, pp. 7477–7482, Aug. 2021, https://doi.org/10.48084/etasr.4262. [21] A. Pini, "Notions de base sur les filtres passe-bas antirepliement (et pourquoi ils doivent être adaptés au CAN)," Digi-Key Electronics, Mar. 24, 2020. https://www.digikey.fr/fr/articles/the-basics-of-anti-aliasing- low-pass-filters. [22] D. Sripath, "Efficient Implementations of Discrete Wavelet Transforms Using FPGAs," Jan. 2003. [23] E. Hostalkova, "Wavelet Transform," Athens, Greece, Nov. 2009. [24] A. Sumithra and B. Thanushkodi, "Performance Evaluation of Different Thresholding Methods in Time Adaptive Wavelet Based Speech Enhancement," International Journal of Engineering and Technology, vol. 1, no. 5, pp. 439–447, 2009, https://doi.org/10.7763/IJET.2009. V1.82. [25] K. Tajane, R. Pitale, and J. Umale, "Review Paper :Comparative Analysis Of Mother Wavelet Functions With The ECG Signals," International Journal of Engineering Research and Applications, vol. 4, no. 1, pp. 38–41, Jan. 2014.