Microsoft Word - 06_TI_Harvianto_Livia_ANALISIS DAN PENGENALAN SUARA - KP-- 2.docx IJCCS ISSN: 1978-1520 Analysis and Voice Recognition … (Harvianto; et. al.) 131 ANALYSIS AND VOICE RECOGNITION IN INDONESIAN LANGUAGE USING MFCC AND SVM METHOD Harvianto1; Livia Ashianti2; Jupiter3; Suhandi Junaedi4 1,2,3,4 Computer Science Department, School of Computer Science, Bina Nusantara University Jl. K.H. Syahdan No. 9, Palmerah, Jakarta Barat, 11480 1harvianto@binus.ac.id 2liviaashianti@gmail.com, 3jupiterc@gmail.com; 4soe.xoe@gmail.com ABSTRACT Voice recognition technology is one of biometric technology. Sound is a unique part of the human being which made an individual can be easily distinguished one from another. Voice can also provide information such as gender, emotion, and identity of the speaker. This research will record human voices that pronounce digits between 0 and 9 with and without noise. Features of this sound recording will be extracted using Mel Frequency Cepstral Coefficient (MFCC). Mean, standard deviation, max, min, and the combination of them will be used to construct the feature vectors. This feature vectors then will be classified using Support Vector Machine (SVM). There will be two classification models. The first one is based on the speaker and the other one based on the digits pronounced. The classification model then will be validated by performing 10-fold cross-validation.The best average accuracy from two classification model is 91.83%. This result achieved using Mean + Standard deviation + Min + Max as features. Keywords: voice recognition, MFCC, SVM, cross validation INTRODUCTION The human voice contains a lot of information such as gender, emotion, and identity of the speaker Lindasalwa et al. (2010). The purpose of the voice recognition is to identify the speaker or the words pronounced by the individual (Yee & Ahmad, 2008). Many techniques have been proposed to reduce the mismatch between testing and training environments. Most of these methods are operated in spectral domain (Lockwood & Boudy, 1992; Rosenberg, Lee, & Soong; 1994) or the cepstral domain. Gracieth et al. (2014) implemented support vector machine (SVM) for automated speech digit recognition. The digit was limited to '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' in Portuguese. The feature was extracted using Mel Frequency Cepstral Coefficients. Discrete Cosine Transform (DCT) was used to produce a two-dimensional matrix that became the input of the SVM. The study produced excellent numerical classification except for the digit '9'. Digit '1' to '8' had the best accuracy. The mean and variance were chosen as the features. Fokoué and Ma (2013) had demonstrated that the combination of MFCC and SVM produces a great tool in identifying the sex of the speaker. RBF kernel and polynomial kernel give accurate results in cross-validation. MFCC needs more time in the calculation of computing because of the complexity of the calculations. Putra and Resmawan (2011) wanted to classify gender base on the speech in Bahasa. The researcher is also using MFCC for extraction method and DTW for classification method. They collect ISSN: 1978-1520 132  ComTech Vol. 7 No. 2 June 2016: 131-139  speeches of 27 men and eight women. These people will speak five words and repeat it seven times. For the evaluation, Darma and Adi used the 7-fold cross validation. Based on the result, the best accuracy is 93.254% and the worst accuracy is 59.664%. This paper will discuss about the voice recognition of digit numeric '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' in Indonesian. The human voice is converted into a digital signal form to produce digital data representing every level of the signal at each different time. Digital sound is then processed using the MFCC for extracting voice features. After that, Support Vector Machine (SVM) is used as classification method to determine the features and combinations of features that generate the most minimal error. The validation process will use 10-fold cross validation. This paper will be separated as follows: background research, the principle voice recognition, the methodology, which will be followed by the results, and conclusions are given. After taking voice input using a microphone from the speaker, the sound will be analyzed. System design involves the manipulation of the audio signal. At some level, the operation is displayed on the input signal is pre-emphasis, framing, windowing, Mel Ceptrum analysis, and recognition of spoken words. Voice recognition algorithm includes two distinct phases. Figure 1 shows the voice algorithm. It can be shown that the first phase is the training phase while the second phase is the testing phase. Voice Recognition  Algorithms Training Phase Each speaker has to provide  samples of their voice so that  the reference tamplate model  can be build Testing Phase To ensure the input test voice is  match with stored reference  template model and recognition  decision are made Figure 1 Voice Recognition Algorithms Mel Frequency Cepstral Coefficients (MFCC) algorithm is a sampling technique. MFCC is one of the most popular feature extraction techniques used in voice recognition based on the frequency domain. MFCC using the Mel scale which is based on the human ear scale. MFCC which is being considered as frequency domain features, are much more accurate than time domain features. The simplicity and ease of the procedures used to implement the method MFCC make this the most favored technique for speech recognition. MFCC considers the sensitivity of human perception of frequency and this makes the best in voice recognition. Figure 2 shows the following steps used in MFCC. IJCCS ISSN: 1978-1520 Analysis and Voice Recognition … (Harvianto; et. al.) 133 Figure 2 Block Diagram to Get the Coefficients MFCC When feature extraction using MFCC in pre-emphasis block, voice signal is filtered with high pass filter. Pre-emphasis improves the voice signal and compensates the suppressed part of the signal during voice production. Then, the pre-emphasized signal is segmented into frames with an optional overlap of 1/3 until 1/2 of the frame size. This step is important to create good results because the variation of amplitude is more in larger signals compared to smaller signals. Then, framing signal will be multiplied with a Hamming window to the keep continuity of the first and last points in signal frame. Then, signal will be converted into frequency domain signal using Fast Fourier Transform. The output of Fast Fourier Transform block is multiplied by triangular band pass filters for getting log energies of each filter. MFCC is defined as follows: log 1 log 2 (1) Fmel is a logarithmic scale of normal frequency scale. Mel-cepstral features [2], can be illustrated by MFCCs, which is calculated from the Fast Fourier Transform (FFT) power coefficient. Power coefficient filtered by triangular bandpass filter bank. When c(5) is in the range of 250-350, the number of filters triangles that fall in the frequency range of 200-1200 Hz (the frequency range of audio information that is dominant) is higher than the other values of c. Therefore, it is efficient to set the value of c in the range to calculate MFCCs. The output is shown from the filter bank Sk (k = 1, 2, …., K), then MFCCs are calculated as follows: C 2 log 0.5 , 1, 2, … , (2) Support Vector Machine is a statistical machine learning techniques that are useful and successfully applied in pattern recognition. The SVM classification method is based on the Structural Risk Minimization principle from computational learning theory. Data can be separated linearly. Data provided is denoted as d whereas each class label is denoted yn {+1,−1} for 1,2, … , where is the number of data. SVM is looking for the best hyperplane that separates all data sets corresponding to the class by measuring margin hyperplane and looking for the biggest margin. Margin is the distance between the nearest hyperplane with the data from each class. The subset of the data set with the nearest distance is called a support vector. Voice Input Pre-Emphasis Sampling and Windowing Fast Fourier Transform Mel Filter Bank Discrete Cosine Transform Output Mel- Coeffic1ients ISSN: 1978-1520 134  ComTech Vol. 7 No. 2 June 2016: 131-139  Class -1 and +1 can be completely separated by hyperplane dimension d, which is defined by the following equation · 0 (3) which includes class -1 (negative samples) can be defined as data that meets inequality · 1 for 1. While which includes class +1 (positive samples) can be defined as data that meets inequality · 1 for 1. is normal field, and b is the position of the field about the origin. Value-defined margins is 2 || || (4) Where | | (5) Maximum margin is obtained when the value of ||w|| minimum of hyperplane equation is · 0. Therefore, to get the biggest margin, it can be formulated as a constrained optimization problem as follows min 1 2 || || (6) subject to · 1 0. One method for the settlement of constraint optimization problems is by multiplying Lagrange. Thus, it can be formulated as follows min , , 1 2 || || · 1 (7) subject to 0. Then the formula (primal problem) was converted into the formula (dual problem) as follows max 1 2 . (8) With the above formula, then is obtained with a positive value. The value of w then obtained by the formula as follows (9) IJCCS ISSN: 1978-1520 Analysis and Voice Recognition … (Harvianto; et. al.) 135 Data in which the value is more than zero is called a support vector. By knowing the support vector, the value of b can be obtained by using support vector obtained as follows 1 . (10) By recognizing the value w and b, the hyperplane equation (1) is obtained. After finding the hyperplane equation, the data classification into class 1, 1 can be done as follows sgn . 1 . 1 1 . 1 (11) SVM formulation for linearly separable data cannot be used for non-linearly separable data. Searching the best hyperplane can be obtained by transforming data from input space ( ) into feature space ( . Thus, the data can be separated linearly in the feature space. Dimension data in the feature space is higher than dimension data in the input space. This situation can make a very large computation in feature space. These problems can be solved by used kernel trick. By using the kernel trick, transformation functions does not need to be known. Kernel functions that are often used are Linear Kernel, Polinomial Kernel (dimension D), and Radial Basis Function (RBF) Kernel. The equation of Linear Kernel is as follows , · 20 (12) Next, the following is the equation of Polinomial Kernel (dimension D) , 1 · 21 (13) The equation of Radial Basis Function (RBF) Kernel is , exp | · | where 0 22 1 2 23 (14) Variable is hyperparameter. Cross Validation is a method to assess the accuracy and validation of statistical models. The available dataset is divided into two parts. The first part is used to data modeling (Payam, Lei, & Huan, 2009). The data modeling from first part used to predict the values in the second part. A valid model should show good prediction accuracy. The procedure of Cross Validation is as follows. First, the data will be divided into three sets; Training, Testing, and Validation . Figure 3 First Step of Cross Validation Training Testing Validation ISSN: 1978-1520 136  ComTech Vol. 7 No. 2 June 2016: 131-139  Second, find the optimal model on the set of training and test sets used to examine the predictive ability. Figure 4 Second Step of Cross Validation Third, see how well the model can predict the set of testing. Validation errors provide an estimate of the predictive power of the model. Figure 5 Third Step of Cross Validation METHODS This research will be conducted in several stages. To have a better understanding, see the following Figure 6. Data Collection Feature Extraction Data Normalization Classification Evaluation Figure 6 Diagram of Research Methods Data is collected by recording the voices of 6 participants. Each participant will pronounce numbers between ‘0’ to ‘9’. The recording process will be repeated up to 5 times; it consists of 3 times recording without noise and two times recording with noise. At the end, there will be 300 voice recording. The voice will be recorded using various devices such as smartphone, PC, and laptop with different types and specifications. The voice recording is saved with frequencies of 44.1 KHz, bit rate 16-bit and Wave Audio (WAV) file format. Each voice recording will be named by numbers pronounced, recording order, and participant name. ValidationModel Testing Training IJCCS Analysis T coefficie index an deviation using for method. based on pronoun feature m used to h will be s will be a The mat performe on numb combina experime its featur and Voice R The voice ents). Using nd is time n, min, and rmat as show Support Vec There will n the speak nced. Classifi matrix. Classificatio have better c separated into For the expe applied to ev trix results th ed. The first bers that the For each cl ation of feat ent is evalua res combinat Recognition … recording f MFCC meth frame. For max of each wn in Figure ctor Machin be two type ker. The sec ication mode on result is e confidence i o ten parts. E R eriment, there very voice re hen will be c type is class speaker has p lassification, tures from f ated by using tion are show Experiment 1 2 3 4 5 6 7 8 9 ISSN … (Harvianto feature will hod, each voi the experim h MFCC ind 5. Figure 7 ne (SVM) w es of classifi cond type is el will be bui evaluated by n prediction Each part con RESULTS e are 300 voi ecording. Fro classified usi sification bas pronounced. 15 experim feature extra g 10-fold cros wn in Table 2 Tab t Mean Standard Min Max Mean + S Mean + M Mean + M Standard Standard N: 1978-1520 o; et. al.) l be extract ice recording ent, MFCC dex will be c Feature Matri with linear ke fications to b s classificati ilt by using e using 10 fo n accuracy. A nsisted 30 vo AND DIS ice recording om this featu ng SVM me sed on the sp ments have action. Detai ss-validation 2. ble 1 Experim F d Deviation Standard Devi Min Max d Deviation + M d Deviation + M 0 ted using M g produced a index will b alculated. Th ix Format ernel functio be performed on based on every feature old cross vali All of voice r oice recordin CUSSION gs that have b ure extraction ethod. There peaker. The s been perfor il of experim n. The accura ment Feature ation Min Max MFCC (Me a matr be used up to he result wil on will be u d. The first n numbers t e or combina idation for e recording th g. NS been collecte n, a new ma are two clas second type i rmed by usi ment is liste acy of each c el-frequency rix where o 12. Mean, ll be stored a used as clas type is clas that the spe ation of featu every experim hat has been ed. Feature e atrix will be ssification ty is classificati ing every fe ed in Table classification 137 cepstral is MFCC standard as matrix   sification sification eaker has ures from ment.It is collected extraction obtained. ypes to be ion based eature or 1. Each n type and ISSN: 1978-1520 138  ComTech Vol. 7 No. 2 June 2016: 131-139  Table 1 Experiment (continued) Experiment Feature 10 Min + Max 11 Mean + Standard Deviation + Min 12 Mean + Standard Deviation + Max 13 Mean + Min + Max 14 Standard Deviation + Min + Max 15 Mean + Standard Deviation + Min + Max Table 2 Experiment Result Experiment Feature Accuracy (%) Speaker Pronounced Number Average 1 Mean 59.00 90.33 74.67 2 Standard deviation 72.67 60.67 66.67 3 Min 58.33 85.33 71.83 4 Max 52.67 67.67 60.17 5 Mean + Standard deviation 86.67 95.67 91.17 6 Mean + Min 69.33 95.33 82.33 7 Mean + Max 73.00 92.67 82.83 8 Standard deviation + Min 80.00 92.67 86.33 9 Standard deviation + Max 82.33 89.67 86.00 10 Min + Max 74.67 92.00 83.33 11 Mean + Standard deviation + Min 87.00 96.00 91.50 12 Mean + Standard deviation + Max 85.33 96.00 90.67 13 Mean + Min + Max 79.00 95.33 87.17 14 Standard deviation + Min + Max 82.33 95.67 89.00 15 Mean + Standard deviation + Min + Max 86.67 97.00 91.83 The best accuracy for classification based on the speaker is 87.00%. This result is achieved using Mean + Standard deviation + Min as features. The worst accuracy for classification based on the speaker is 52.67%. This result is achieved using Max as features. The best accuracy for classification based on the number pronounced is 97.00%. This result is achieved using Mean + Standard deviation + Min + Max as features. The worst accuracy for classification based on the number pronounced is 60.67%. This result achieved using Standard deviation as features. The best average accuracy from both classifications is 91.83%. This result is achieved using Mean + Standard deviation + Min + Max as features. The worst average accuracy from both classifications is 60.17%. This result is achieved using Max as features. CONCLUSIONS Experimental results showed interesting results, feature or combination of features which have the highest accuracy in classification based on the numbers spoken is Mean + Standard Deviation + Min (87%). Feature or combination of features which have the highest accuracy in classification based on the speaker is Mean + Standard Deviation + Min + Max (97%). The best result is obtained by using combination of Mean + Standard Deviation + Min + Max. IJCCS ISSN: 1978-1520 Analysis and Voice Recognition … (Harvianto; et. al.) 139 REFERENCES Fokoue, E., & Ma, Z. (2013). Speaker Gender Recognition via MFCCs and SVMs. RIT Scholar Works. Gracieth, B., Washington, S., & Filho, O. (2014). Classification of Pattern using Support Vector Machines: An Application for Automatic. The Eighth International Conference on Advanced Engineering Computing and Applications in Sciences. Rome, Italy: IARIA. Lindasalwa, M., Mumtaj, B., & Elamvazuthi, I. (2010). Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques. Journal of Computing, 2, 138-143. Lockwood, P., & Boudy, J. (1992). Experiments with a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and the projection, for robust speech recognition in cars. Journal Speech Communication - Eurospeech '91, 11(2-3), 215-228. Payam, R., Lei, T., & Huan, L. (2009). Cross-Validation. In Encyclopedia of Database Systems. Putra, D., & Resmawan, A. (2011). Verifikasi Biometrika Suara Menggunakan Metode MFCC dan DTW. Lontar Komputer, 2, 8-21. Rosenberg, A., Lee, C. H., & Soong, F. (1994). Cepstral channel normalization techniques for HMM- based speaker verification. Proc. Int. Conf. on Spoken Language Processing, 1835-1838. Yee, C. S., & Ahmad, A. M. (2008). Malay language text-independent speaker verification using NN- MLP classifier with MFCC. Electronic Design, 2008. ICED 2008. International Conference. Penang: IEEE.