Development of Feature Set, Classification Implementation and Applications for Vowel Migration/Modification in Sung Filipino (Tagalog) Texts and Perceived Intelligibility Virginia B. Bustos1, Triah Joyce G. Dela Cruz1, Ramon Maria G. Acoymo2 and Rowena Cristina L. Guevara 1* 1Electrical and Electronics Engineering Institute, College of Engineering , University of the Philippines Diliman 1101 Quezon City 2Voice and Music Theater/Dance Department, College of Music University of the Philippines, Diliman 1101 Quezon City *Corresponding author: gev@eee.upd.edu.ph Received: 5 November 2009; Revised: 18 February 2010; Accepted: 24 June 2010 ABSTRACT With the emergence of research on real-time visual feedback to supplement vocal pedagogy, the utilization of technology in the world of music is now seen to accelerate skills learning and enhance cognitive development. The researchers of this project aim to further analyze vowel intelligibility and develop software applications intended to be used not only by professional singers but also by individuals who wish to improve their singing capability. Data in the form of sung vowels and song pieces were obtained from 46 singers. A Listening Test was then conducted on these samples to obtain the ground truth for vowel classification based on human perception. Simulation of the human auditory perception of sung Filipino vowels was performed using formant frequencies and Mel-frequency cepstral coefficients as feature vector inputs to a two-stage Discriminant Analysis classifier. The setup resulted in an over-all Training Set accuracy of 89.4% and an over-all Test Set accuracy of 90.9%. The accuracy of the classifier, measured in terms of the correspondence of vowel classifications obtained from the classifier with the results of the Listening Test, reached 92.3%. Using information obtained from the classifier, offline and online/real-time software applications were developed. The main application features include the display of the spectral envelope and spectrogram, pitch and vibrato analysis and direct feedback on the classification of the sung vowel. These features were recommended by singers who were surveyed and were incorporated in the applications to aid singers to adjust formant locations, directly determine listener’s perception of sung vowels, perform modeling effectively and carry out vowel migration. Keywords: Filipino, Vowel Migration, Intelligibility INTRODUCTION One of the objectives of the singers of bel canto is the development of a vocal scale without interruption throughout its length. (Appelman, 1986) This demands vowel modification in the upper notes to preserve the true vowel sound and to prevent notes from becoming disagreeable or harsh. The technique has become a means of transition to the upper voice for many centuries. The singer, Science Diliman 21(2):13-24 13 mailto:gev@eee.upd.edu.ph Bustos, Dela Cruz, Acoymo & Guevara when modifying a vowel, is actually causing the vowel to migrate in the direction of another recognizable vowel. Vowel migration causes the vowel to lose its integrity for the sake of enhancing musical aesthetics. Thus, intelligibility, the degree or level to which the intended vowel of the singer is perceived correctly by the listeners, is diminished. For example, when the vowel /a/ sung by a singer in its unmigrated or unmodified form is correctly perceived as /a/ by all the listeners, the intelligibility is 100%. However, when the same vowel is migrated to another vowel, fewer listeners might perceive the vowel correctly. For example, when 52 out of 100 listeners correctly perceive the intended vowel, resulting intelligibility is 52%. A more quantitative and objective measure of these vowel modifications can be achieved through software applications that offer real-time visual feedback on vowel intelligibility, together with the analysis of pitch and frequency spectra. The assessment is helpful in determining the extent with which the acoustical demands can be met with minimal compromise in intelligibility. The classification of sung vowels is usually performed by listeners. Research in vowel classification generally applies to spoken vowels through the use of Automatic Speech Recognition (ASR) systems. These systems can be implemented using numerous feature extraction algorithms and classification models as shown by different studies presented in Table 1. Table 1. Studies in Vowel Classification for Spoken Vowels Researcher/s Data (number of speech samples) Features used Classifi- cation Model Achieved Accuracy Merkx and Miles 17213 MFCCs Single Layer Feed- forward ANN 91.50% Dumitri and Gavat 145 MFCCs 3-Layer MLP 96.4% for male speakers 77.9%% for female speakers Schmid and Barnard 58268 Formant Features and MFCCs MLP 73.40% The study made by Merkx and Miles (2005) utilized thirteen Mel-Frequency Cepstral Coefficients (MFCCs) as feature vectors. MFCCs are coefficients derived from a logarithmic scaling of audio frequencies using Mel filterbanks. For pattern classification, the study used a feed-forward Artificial Neural Network (ANN) with 28 internal nodes. ANN is an adaptive, often nonlinear, system that is trained to perform an input/output mapping using a given input and a target data. The classifier reached a recognition accuracy of 91.5% on a subset of 5 vowel phonemes. Another study made by Dumitru and Gavat (2007) used twelve MFCCs, a 3- layer Multilayer Perceptron (MLP) classifier and 145 speech samples as input data. MLP is a feedforward ANN that uses three or more layers of nodes with nonlinear activation functions. Vowel recognition rates reached 96.4% for male speakers and 77.9% for female speakers. Schmid and Barnard (1997) tested the efficiencies of cepstral-based features, MFCCs and formant features including formant trajectory, amplitude, bandwidth, pitch and segment duration in vowel classification. Formants are resonating frequencies that show up as peaks in the sound spectrum. Results showed that formant features alone reached an accuracy rate of 71.8%, while MFCCs alone reached an accuracy rate of 71.6%. The combination of the two features proved optimal, reaching an accuracy rate of 73.4%. This project is part of a research track at the UP Digital Signal Processing Laboratory. The preceding study (Dimaculangan & Felias, 2008) applied strategies used in ASRs to the classification of sung vowels and determined the relationship between vowel migration in sung Filipino text and perceived intelligibility from the perspective of the audience. Several parameters including formants, spectral envelopes, vowel triangles and MFCCs were investigated to identify the parameter that is most appropriate in simulating vowel perception. Results showed that nine MFCCs, extracted from each of the 281 sung unmigrated vowel samples, best discriminated the sung vowels. These features were used as inputs to a Linear Discriminant Analysis (LDA) classifier that proved optimal compared to an ANN. LDA computes a linear predictor from two sets of normally distributed data to allow for classification of new observations. The accuracies achieved were 68.1% for the Training Set and 52.9% for the Test set, based on the correspondence of the vowel classification determined by the 14 Science Diliman Development of Feature Set, Classification Implementation and Application for Vowel Migration classifier and results of the Intelligibility Tests that were conducted. In the Intelligibility Tests the ground truth for vowel classification was based on the perception of 45 listeners. Wilson et al. (2008) studied the effects of real-time visual feedback on teaching pitch accuracy in singing. They investigated whether the style of feedback affects the amount of learning achieved and whether the provision of concurrent visual feedback hampers the simultaneous performance of the singing task. Through the implementation of a baseline-intervention-post test between-groups design, it was determined that real-time visual feedback to the learner promotes the acquisition of the neuromuscular skills underlying the task of singing the correct pitch. Moreover, different styles of visual feedback did not produce differences in the amount and rate of learning. In this paper, the objective of the researchers is to develop software applications that provide an objective assessment of the intelligibility of sung vowels, based on the perception of listeners, through real-time visual feedback. Furthermore, the applications should help singers readily assimilate feedback and improve their singing ability. The novelty of this project is the application of ASR system techniques on sung vowels in unmigrated and migrated forms with emphasis on intelligibility based on the perception of listeners. METHODOLOGY This section contains a description of the data, a discussion of the various extracted features, the pattern classification methods, the application development and testing of these applications. Data Audio recording of the data was held inside a WhisperRoom®, a sound isolation enclosure. The recording equipment includes TASCAM DV- RA1000 High definition Audio Master Recorder, Sennheiser Ew100 G2 wireless microphone and Behringer Ultravoice XM8500 wired microphone. All audio data samples are recorded in stereo wav format with the following attributes: PCM signed 24bit, 44.1 kHz sampling rate and 2116 kbps bit rate. Video recordings were also done using a Sony Handycam SR220 to document the singers’ mouth shapes. Data in the form of sung vowels and song pieces were obtained from 46 singers. The audio recording consists of two parts, vocalization and singing of folk songs. Vocalization was divided into six sets: unmigrated, migrated to [a], migrated to [e], migrated to [i], migrated to [o] and migrated to [u]. Each set comprised vocalizes for each of the five Filipino vowels in increasing pitch, spanning the singer’s stable range of frequency. The second part of the recording involved singing of three folk songs: “Si Pilemon”, “Lubi-Lubi” and “Neneng at Nonoy”. Each of the songs was sung unmigrated, migrated to [a], migrated to [e], migrated to [i], migrated to [o] and migrated to [u]. A survey regarding application design was conducted among 38 of the singers who participated in the project. Singers were asked about software features that would be helpful in improving their singing ability and how they visualize these features to appear in the applications. Information gathered in this stage served as the basis for the design of the applications that were developed. An Intelligibility/Listening Test was done to establish the ground truth for the intelligibility of the recorded vocalizes. 100 listeners composed of 51 non-music major students and 49 students from the UP College of Music participated in this test. A total of 1350 sliced vowels were included in the test, utilizing the vowel preceding the last vowel note for singers who chose to vocalize within an octave, while the second to the last vowel note was used for singers whose vocalises exceeded an octave. Data obtained from the Intelligibility Test served as the ground truth for vowel classification based on human perception. Feature Extraction Before extracting the features from the data, signal pre-processing was done to remove the DC offset of the signal and normalize the average audio volume levels to -18 dB. Vowel detection was implemented by computing short-time energy and short-time zero crossing rate per signal frame then comparing these parameters to set thresholds. Several features have been extracted from the Science Diliman 15 Bustos, Dela Cruz, Acoymo & Guevara vowels and were used as inputs to different pattern classifiers. Below is a summary of the extracted features from the vowels. • Mel-Frequency Cepstral Coefficients (MFCC) is a representation of cepstral coefficients, taken from the Fourier transform of the decibel spectrum, wherein the analysis is done on a non-linear frequency scale known as the Mel scale. The MFCC extraction algorithm used was developed by Slaney (1998). The computation process is as follows. The signal is divided into short time windows, where the Discrete Fourier transform (DFT) of each time window for the discrete-time signal x(n) with length N is computed using X k =∑ n= 0 N −1 w n x n exp− j2  kn/ N  (1) for k = 0, 1, . . . ,N − 1, where k corresponds to the frequency f(k) = kfs/N, fs is the sampling frequency in Hertz and w(n) is a Hamming window, given by w n=0.54−0.46 cos n/ N . (2) The magnitude spectrum |X(k)| is scaled in both frequency and magnitude. The frequency is scaled logarithmically using the Mel filter bank H(k,m) using X ' m=ln∑ k =0 N −1 ∣X k∣⋅H k , m (3) for m = 1, 2, . . . ,M, where M is the number of filter banks. The Mel filter bank is a collection of triangular filters defined by the center frequencies fc(m), as shown in Eq. 4. H  k , m x={ 0 for f k f c m−1 f  k − f cm−1 f cm− f c m−1 for f c m−1≤ f  k  f c m f  k − f cm−1 f cm− f c m1 for f c m≤ f  k f c m1 0 for f  k≥ f c m1 } (4) The center frequencies of the filter banks are computed by approximating the Mel scale using Eq. 5. =2595 log10 f100 1. (5) A fixed frequency resolution in the Mel scale is computed, corresponding to a logarithmic scaling of the repetition frequency, using Eq. 6, where φmax is the highest frequency of the filter bank on the Mel scale, φmin is the lowest frequency in Mel scale. =max−min/M 1 (6) The center frequencies on the Mel scale are given by Eq. 7 for m = 1, 2, . . . ,M. The center frequencies in Hertz can be obtained using Eq. 8. cm=m⋅ (7) f cm=70010 c m 2595 −1 (8) The MFCCs are obtained by computing the Discrete Cosine Transform (DCT) of X′(m) from Eq. 3 using Eq. 9 for l = 1, 2, . . . ,M, where c(l) is the lth MFCC. cl =∑ m=1 M X ' mcosl m m−1 (9) • F1+F2+MFCC is a combination of the first and second formants, F1 and F2, of the vowels and the extracted MFCCs. Formant frequencies characterize the acoustic structure of each vowel, enabling listeners to perceptually identify the vowel. The vowel formants were appended to the MFCCs in two ways: on a formants per frame basis (F1f+F2f+MFCC) as shown in Figure 1 and on a formants per vowel basis (F1v+F2v+MFCC) as shown in Figure 2. The extraction of formant frequencies involves the determination of resonance peaks from the filter coefficients obtained through Linear Prediction Coding (LPC) analysis of the signal. (Makhoul, 16 Science Diliman Figure 1. Formants computed per frame appended to MFCCs Development of Feature Set, Classification Implementation and Application for Vowel Migration 1972) Once the prediction polynomial A(z), shown in Eq. 10, has been calculated, the formant parameters are determined by solving for the roots of the equation A(z) = 0. A z=1a1 x −1a2 z −2...a p z − p (10) • MFCC+ΔMFCCs is a combination of the MFCCs and its derivatives calculated using a simple linear slope. • Line Spectral Frequencies (LSF) or Line Spectral Pairs is a representation of LPC coefficients that represents glottal activity. To determine the LSFs, the Linear Prediction polynomial shown in Eq. 10 is decomposed into P(z) and Q(z) as shown in Eq. 11 where P(z) corresponds to the vocal tract with the glottis closed and Q(z) with the glottis open. The roots of both polynomials represent the Line Spectral Pairs. P  z= Azz− p1 A z−1 Q z= Az−z− p1 A z−1 (11) • MFCC-RASTA is the addition of a RASTA (Relative Spectral) filter block after MFCC extraction. The RASTA filter was approximated by a simple fourth order Butterworth bandpass filter. (Slaney, 1998) Pattern Classification The extracted features from the audio signals were then used as feature vector inputs to pattern classifiers. Below is a summary of the pattern classification methods that were implemented. • Discriminant Analysis (DA) is used to find a number of projection directions that are efficient in separating the features into classes. The process involves maximizing the ratio of between-class variance to within-class variance so that adequate class discrimination is obtained. The different types of DA that were implemented were Linear (LDA), Quadratic (QDA), Mahalanobis (MDA), Diagonal Linear and Diagonal Quadratic. • Feedforward Backpropagation Artificial Neural Network is a type of ANN that is trained using input vectors and corresponding target output vectors until it can approximate a function and associate input vectors with specific output vectors. This is done by reducing calculated errors between the input and output data and consequently adjusting the weights of the network’s forward-connected layers. • Classification Tree (CT) is a type of machine learning algorithm used for non-parametric data classification. A classification tree is a structural mapping of binary decisions that lead to a decision about the class (interpretation) of an object. • Support Vector Machines (SVM) are decision- based prediction algorithms which can classify data into two groups. The training data is mapped to a higher dimensional space and separated by a plane defining the two classes of data. Input data are classified based on the side of the plane they fall on. Cross Validation A comparison of the performance of the classifiers was done to determine the optimal sung Filipino vowel discriminator. The classifiers were compared using the Training Set, consisting of sung vowels from 80% of the singers, and the Test Set, consisting of sung vowels from the remaining 20%. Singers belonging to each set were randomly chosen. The Listening Test vowels were also tested to determine the perception accuracy. Development of Applications In this project, two software applications have been developed using Matlab, a command-line software development program. One software application runs offline and processes audio files while the other software application runs in real-time/online. Screen display interface was developed for both software applications. The following features were incorporated in the applications based on the recommendation of the 38 singers who were surveyed. Science Diliman 17 Figure 2. Formants computed per vowel appended to MFCCs Bustos, Dela Cruz, Acoymo & Guevara a. Pitch Detection Two algorithms, average magnitude difference function - autocorrelation function (AMDF-ACF) and the correlogram model of pitch perception (Slaney, 1998), were tested to develop an accurate pitch estimator. The AMDF-ACF algorithm extracts the pitch period of signals from the short-term autocorrelation of computed AMDF values as shown in Eq. 12. Rk = ∑ n=0 N −k−1 xn x nk (12) The correlogram model of pitch perception uses the largest peak from a summarized autocorrelation plot, a plot of sample autocorrelations versus time lags, as the pitch estimate. b. Calculation of Vibrato Parameters Vibrato was represented by three parameters: intonation, rate and extent. The intonation curve was obtained by passing the instantaneous pitch frequency curve or pitch contour of a signal through a moving average filter. Vibrato rate was estimated as the reciprocal of the maximum period of the pitch contour. Moreover, vibrato extent was estimated as the mean amplitude difference between the pitch and intonation contours. Testing of Developed Software Applications For preliminary testing, Dean Ramon Acoymo of the College of Music assessed the performance and usefulness of the application’s features. Further testing was implemented using vocalises of singers. Vocalises were recorded while singers tested the Online Application. The recorded vocalises were then run on the Offline Application to match the results of both applications. Singers were also asked to fill out a questionnaire assessing the performance and usefulness of the application. RESULTS AND ANALYSIS Optimal Unmigrated Vowel Classifier In order to develop an effective vowel classifier, initial tests on unmigrated vowels were conducted. The samples were assumed to be perceived correctly by the listeners, thus accuracy of the tests depended on the intended vowel classification. Several feature extraction and pattern classification methods were tested on the vowels to determine the optimal vowel classifier elements. The features that were extracted included: MFCCs, F1f+F2f+MFCC, F1v+F2v+MFCC and MFCC+ΔMFCCs. Moreover, the implemented pattern classification methods were Discriminant Analysis, Classification Tree and Feedforward Backpropagation Neural Network. This phase was done in parallel with the Listening Test. Table 2 shows the top 5 classifiers ranked based on Training Set and Test Set accuracies computed as the average accuracy per vowel. Table 2. Top 5 classifiers based on Training Set and Test Set accuracies Number of Coeffi- cients Features Extracted Pattern Classifi- cation Method Training Set Accuracy Test Set Accuracy 15 F1f+F2f+MFCC QDA 85.90% 89.40% 21 F1f+F2f+MFCC QDA 87.80% 86.50% 15 F1v+F2v+MFCC QDA 86.40% 89.10% 21 F1v+F2v+MFCC QDA 87.80% 88.60% 21 F1v+F2v+MFCC CT 100.00% 84.90% Development of Data Set based on Listener Perception The Listening Test that was conducted resulted in 1041 consistently perceived vowels out of the 1350 sung vowels that were presented to the listeners. A consistently perceived vowel has the same classification for 60% or more of the listeners in the Listening Test. To increase the data used in the development of the classifier based on listener perception, k-means clustering based on formants was implemented on consistently perceived vowels. K-means clustering is a partitioning method that finds a partition in which objects within each cluster are as close to each other as possible, and as far from objects in other clusters as possible. The resulting clusters of the vowels were labeled with vowel classifications similar to the consistently perceived vowels from the Listening Test and included in the data set. Unless stated otherwise, this is the developed data set cited in this paper. 18 Science Diliman Development of Feature Set, Classification Implementation and Application for Vowel Migration The number of vowels that are included in the Training Set and Test Set using the data set is shown in Figures 3 and 4, respectively. Figure 4. Vowel distribution of test set. Test on Project 1 Listening Test Data To determine the optimal classifier based on listener perception, the top 5 unmigrated vowel classifiers from Table 2 were tested on the Listening Test data gathered from Project 1 (Dimaculangan & Felias, 2008). Project 1 data were recorded in a large hall. The classifiers were trained using data from 80% of the singers included in the developed data set based on listener perception. The results are summarized in Table 3. The trained classifier with the highest accuracy is the 21 F1v+F2v+MFCC QDA. Classifier Optimizations The trained classifier with the highest accuracy, 21 F1v+F2v+MFCC QDA, was used to test the remaining data from 20% of the singers in the developed data set. The resulting accuracies, computed based on the correspondence of the classifier outputs to the labeled vowel classifications based on listener perception, are shown for each vowel in Table 4. Table 4. Accuracies for each Vowel using the Developed Data Set Vowel Training SetAccuracy Test Set Accuracy Perception Accuracy /a/ 80.00% 91.60% 87.20% /e/ 91.40% 90.40% 94.70% /i/ 93.10% 96.90% 94.00% /o/ 81.60% 63.40% 84.30% /u/ 86.90% 89.50% 86.60% To address the speed constraints of the software applications, minimization of the Training Set size was employed. Feature vectors from the fifteen middle frames of each Training Set vowels were taken to form a new Training Set. The minimization has reduced classification time from 1245.5ms to 131.5ms. Furthermore, accuracies have improved as shown in Table 5. However, using the new Training Set, the Perception accuracies for the vowel /o/ have decreased due to misclassification of the vowel /a/. Table 5. Accuracies for the Minimized Training Set Vowel Training SetAccuracy Test Set Accuracy Perception Accuracy /a/ 82.30% 92.10% 88.90% /e/ 91.90% 92.50% 94.70% /i/ 95.50% 98.00% 96.50% /o/ 80.60% 68.30% 81.90% /u/ 89.50% 90.40% 91.10% A second stage classifier was developed to address the confusion between the vowels /a/ and /o/, with the goal of increasing the accuracies for both vowels. The features that were extracted included LSF, Formants, MFCC and MFCC-Rasta. Moreover, the pattern classification methods that were implemented were Discriminant Analysis, Support Vector Machine and Classification Tree. Comparing the results of testing the combination of features and pattern classification methods showed that the F1v+F2v+MFCC and LDA combination had the highest accuracy in classifying the vowels. The second-stage classifier takes the first-stage classifier /a/ and /o/ outputs whenever the classification accuracy is below 50% for /a/ and below 80% for Science Diliman 19 Figure 3. Vowel distribution of training set. Bustos, Dela Cruz, Acoymo & Guevara /o/, following computed optimal thresholds. The resulting accuracies for vowels /a/ and /o/ after adding the second stage classifier are shown in Table 6, the perception accuracy has improved for both vowels. Table 6. Accuracies for Vowels /a/ and /o/ After Adding the Second Stage Classifier Vowel Training Set Test Set Perception Accuracy /a/ 84.40% 92.50% 89.20% /o/ 79.70% 70.30% 83.10% Final Classifier The final vowel classifier based on listener perception consists of a first-stage 21 F1V+F2V+MFCC QDA classifier with a second- stage classifier 21 F1V+F2V+MFCC LDA classifier for the vowels /a/ and /o/. The Training Set size was minimized by using only the 15 middle frames of the Training Set vowels. Confusion matrices of the final classifier for the Training Set, Test Set and Listening Test vowels used to compute the perception accuracy are shown in Tables 7, 8 and 9, respectively. The diagonal elements of the confusion matrix represent the correctly classified vowels. Off-diagonal elements denote the confusion of one vowel for another vowel – each row of the confusion matrix sums up to 100%. The first row in Table 7 is interpreted as follows: the vowel /a/ in the Training Set is classified as /a/ 84.4% of the time and classified as /e/ 2.4% of the time, as /i/ 0.3% of the time, as /o/ 11.8% of the time and as /u/ 1% of the time. It can be observed that there is confusion among the back and central vowels /a/, /o/ and /u/ which are mainly caused by the overlapping formant values (F1 and F2) of the three vowels. Moreover, among all the vowels, /i/ is the most accurately classified vowel. Table 7. Confusion Matrices for the Training Set Training Set /a/ /e/ /i/ /o/ /u/ /a/ 84.40% 2.40% 0.30% 11.80% 1.00% /e/ 2.80% 91.90% 4.70% 0.10% 0.50% /i/ 0.20% 3.60% 95.50% 0.00% 0.70% /o/ 14.40% 0.00% 0.00% 79.70% 5.90% /u/ 1.70% 0.70% 2.80% 5.40% 89.50% Table 8. Confusion Matrices for the Test Set Test Set /a/ /e/ /i/ /o/ /u/ /a/ 92.50% 1.30% 0.00% 5.40% 0.80% /e/ 0.40% 92.50% 7.00% 0.00% 0.00% /i/ 0.50% 1.50% 98.00% 0.00% 0.00% /o/ 19.80% 1.00% 0.00% 70.30% 8.90% /u/ 4.40% 1.80% 1.80% 1.80% 90.40% Table 9. Confusion Matrices for the Listening Test Vowels Listening Test Vowels /a/ /e/ /i/ /o/ /u/ /a/ 89.20% 2.00% 0.70% 8.10% 0.00% /e/ 1.90% 94.70% 3.00% 0.00% 0.40% /i/ 0.00% 2.80% 96.50% 0.00% 0.70% /o/ 10.80% 0.00% 0.00% 83.10% 6.00% /u/ 0.00% 0.90% 4.50% 3.60% 91.10% The final classifier was used in the developed offline application. The online application, on the other hand, used a further minimized classifier that utilizes only the five middle frames of the Training Set vowels to address the greater need for classification speed. Overall accuracies for the offline application vowel classifier are 89.4% for the Training Set and 90.9% for the Test Set. Overall accuracies for the online application vowel classifier are 89.4% for the Training Set and 89.7% for the Test Set. The overall perception accuracy for both classifiers is 92.3%. It was observed that the Test Set had a higher accuracy than the Training Set. The same result was observed after changing the singers included in both sets. The improvement over Project 1 baseline accuracy (Dimaculangan & Felias, 2008) for vowel classification is 21.3% for the Training Set and 38.0% for the Test Set using the offline classifier. The improvement for the online classifier is 21.3% for the Training Set and 36.8% for the Test Set. The objective of improving the accuracy of the classifier was achieved. 20 Science Diliman Development of Feature Set, Classification Implementation and Application for Vowel Migration Pitch Estimation Algorithms Two algorithms, AMDF-ACF and the correlogram model of pitch perception, were tested to develop an accurate pitch estimator. Sinusoids with fundamental frequencies ranging from 65 to 932 Hz (C2 to A#5) were used as inputs to the pitch estimation algorithms. For the AMDF-ACF algorithm, large deviations, averaging 23.86 Hz, of the pitch estimates from the test fundamental frequencies were observed especially in high frequencies. On the other hand, the correlogram model resulted in more accurate pitch estimates with smaller deviations, averaging 4.95 Hz, from the fundamental frequencies occurring only at very high frequencies. From this result, the correlogram model of pitch perception was set as the pitch estimation algorithm for the software applications. Vibrato Rate and Extent Estimation The performance of the vibrato rate and extent estimation was tested using synthesized vowels with duration of 2 seconds and formant frequencies of 300 and 870 Hz. The synthesized vowels had pitch ranging from 65 to 835 Hz. Vibrato extent was set to be 3% of the pitch value of the vowel and the vibrato rates ranged from 4 Hz to 7 Hz. The average deviation of the estimated vibrato rates from the set vibrato rates is 0.42 Hz while average deviation of estimated vibrato extents from the values computed as 3% of the pitch frequencies, is 4.77 Hz. Inconsistencies in the estimation, especially at high frequencies, are attributed to deviations in the pitch estimate and ripple effects from the pitch interpolation and moving average filter. Offline Application The screen display of the offline application with the corresponding feature labels is shown in Figure 5. The offline application takes transcribed audio files as input and displays the following features: spectral envelope, pitch contour, intonation contour, and estimates for pitch, vibrato rate and vibrato extent, formant frequencies, spectrogram and vowel classification. The vowel classification uses the 21 F1V+F2V+MFCC QDA-LDA classifier with 15 frames per Training Set vowel. Options for vowel detection using short-time energy, transcription file reading and vowel synthesis are also included in the application. Online/Real-time Application The screen display of the online application with the corresponding feature labels is shown in Figure 6. The application continuously takes 250ms of audio input from a microphone. Vowels are detected using short-time energy and ZCR. The following features are included in the application and are displayed: spectral envelope, pitch contour, pitch estimate, formants, vibrato rate, vibrato extent, spectrogram and vowel classification. The vowel classification uses the 21 F1V+F2V+MFCC QDA-LDA classifier utilizing 5 frames per Training Set vowel. Options for audio logging, audio exporting and accompaniment playback are also included in the application. The average processing time for each detected vowel was computed to be 193ms. Testing of Software Applications Testing with Dean Ramon Acoymo of the UP College of Music was conducted to gauge the performance and usefulness of the features included in the applications. According to Dean Acoymo, the vowel classifications made by the classifier were consistent with human perception. The human ear usually has trouble discriminating among the vowels /a/, /o/ and /u/ due to the intersection of the formant values of these vowels. The display of the vowel classification and spectral envelope was advantageous in the assessment of the trade-off between vocal color and quality. Moreover, the features are applicable in singing pedagogy wherein preserving both vocal color and quality is important. The display of the pitch and vibrato parameters, on the other hand, is especially useful for the singers who perform different styles of singing. Science Diliman 21 Bustos, Dela Cruz, Acoymo & Guevara Three other singers tested the software using vocalises. The correspondence between the intended vowels of the singers and the vowel classifications made by the online application was observed before and after the singer had a chance to use the software. The results showed an average increase of 7.0% in the correspondence between the intended vowels of the singers and the vowel classifications made by the online application. Moreover, the singers gave positive feedback in the usefulness, ease of use, layout and performance of the online application. CONCLUSION The developed vowel classifier based on listener perception utilizes the formant frequencies, F1 and F2, and MFCCs as features. Vowel classification is made by a first-stage QDA classifier and a second- stage LDA classifier for the vowels /a/ and /o/. The classifier was optimized for speeds applicable to the offline and online applications through the reduction of the Training Set size while preserving the integrity of the data. Resulting overall accuracies for the classifier used in the offline application are 89.4% and 90.9% for the Training Set and Test Set, respectively. On the other hand, overall accuracies for the classifier used in the online application are 89.4% and 89.7% for the Training Set and Test Set, respectively. Using the Listening Test vowels as inputs to the classifiers, the overall perception accuracy for both offline and online classifiers is 92.3%. Aside from vowel classification assessing the intelligibility of sung vowels, additional features have been incorporated in the developed software applications. The added features are spectral envelope and spectrogram displays, pitch estimation and computation of vibrato parameters; these parameters were chosen based on the suggestions of the singers who were recorded. Vowel classification and spectral envelope displays help singers assess 22 Science Diliman Figure 5. Graphical user interface of the offline application. Development of Feature Set, Classification Implementation and Application for Vowel Migration their vocal color and quality which are important elements in singing and given significant consideration in vowel pedagogy. Moreover, the display of pitch, vibrato parameters and spectrogram will help singers improve their vocal tonalities and assess their adherence to the musical style that they are performing. Other features such as voice recording and accompaniment playback were included in the applications to further enhance usage. In conclusion, the researchers have been able to develop a novel approach for assessing the intelligibility of sung vowels which performs with an accuracy exceeding 89% and effectively emulates the human auditory vowel perception. The implementation of the algorithm was based on the vowel classification made by 100 listeners. Moreover, software applications were developed based on this algorithm. Initial tests of these software applications show them to have potential use in vocal pedagogy and have been enhanced with features that prove to be beneficial to the intended users. RECOMMENDATIONS The inclusion of a lip-shape detector and online video feedback of the user’s lips should be part of the next version of the software. It would be also interesting to study the pedagogical impact of the software on both students at the College of Music and pop singers. It is expected that an increase in the database of recorded sung vowels and song pieces, as well as listeners in the Listening Test, will lead to a higher accuracy in the vowel classifier. ACKNOWLEDGEMENT This research project is funded by the Office of the Vice-Chancellor for Research and Development Open Grant of the University of the Philippines Diliman and the Office of the Vice-President for Science Diliman 23 Figure 6. Graphical user interface of the online application. Bustos, Dela Cruz, Acoymo & Guevara Academic Affairs Emerging Fields Grant, and is part of Interdisciplinary Signal Processing for Pinoys (ISIP) Program. REFERENCES D.R. Appelman. 1986. The Science of Vocal Pedagogy (Theory and Application), Indiana University Press. J. Dimaculangan, and R. Felias. 2008.“Vowel Migration in Sung Filipino Text and Perceived Intelligibility,” Undergraduate Student Project, Department of Electrical and Electronics Engineering, University of the Philippines, Diliman. C. Dumitru, and I. Gavat. 2007. “Vowel, Digit and Continuous Speech Recognition based on Statistical, Neural and Hybrid Modelling by using ASRS_RL”, EUROCON, The International Conference on "Computer as a Tool, Warsaw, Poland, 856-863. P. Merkx, and J. Miles. 2005. Automatic Vowel Classification in Speech, Duke Project Paper in Math 196S, Duke University, Durham, NC, USA. P. Schmid, and E. Barnard. 1997. “Explicit, N-Best Formant Features for Vowel Classsification”, Proc. Intl. Conf. On Acoustics, Speech, and Signal Processing, Munich, Germany, 991-994. M. Slaney. 1998. Auditory Toolbox for Matlab Technical Report, Interval Research Technical Report. P. Wilson, K. Lee, J. Callaghan, and C. W. Thorpe. 2008. “Learning to sing in tune: Does real-time visual feedback help?”. Journal of Interdisciplinary Music Studies 2(12):157-172. J. Makhoul. 1972. “Linear prediction: A tutorial review,” in Proceedings of the IEEE, pp. 1973–1986. 24 Science Diliman