ap-5-11.dvi Acta Polytechnica Vol. 51 No. 5/2011 A Pitch Detection Algorithm for Continuous Speech Signals Using Viterbi Traceback with Temporal Forgetting J. Bartošek Abstract This paper presents a pitch-detection algorithm (PDA) for application to signals containing continuous speech. The core of the method is based on merged normalized forward-backward correlation (MNFBC) working in the time domain with the ability to make basic voicing decisions. In addition, the Viterbi traceback procedure is used for post-processing the MNFBC output considering the three best fundamental frequency (F0) candidates in each step. This should make the final pitch contour smoother, and should also prevent octave errors. In transition probabilities computation between F0 candidates, two major improvements were made over existing post-processing methods. Firstly, we compare pitch distance in musical cent units. Secondly, temporal forgetting is applied in order to avoid penalizing pitch jumps after prosodic pauses of one speaker or changes in pitch connected with turn-taking in dialogs. Results computed on a pitch- reference database definitely show the benefit of the first improvement, but they have not yet proved any benefits of temporal modification. We assume this only happened due to the nature of the reference corpus, which had a small amount of suprasegmental content. Keywords: PDA, fundamental frequency, Viterbi, temporal forgetting, MNFBC, speech processing. 1 Introduction Almost every audible sound tends to have a fun- damental frequency. This is the lowest frequency on which the signal is periodic, and we sense this frequency as the height (pitch) of the sound. Hu- man speech perception is partly based on intonation (changes of pitch), which is an aspect of prosody. Thanks to this we can distinguish whether a person is making a statement or a question [1]. Prosodic in- formation also enables us to recognize the emotions of a speaker. A motivation for finding a precise and robust PDAcould be to track the intonation contour in continuous speech. This is a crucial step for the proper function e.g. of a punctuation detector [2] or an emotion classifier of the speaker. There are nowadays several known pitch detec- tionmethods. They can generally be divided accord- ing to the domain in which they operate (time, fre- quency, cepstrum, etc.) An overview of some basic methods can be found in [12]. The most widely used methods are probably various forms of autocorrela- tion algorithms (time and frequency domain), based on similarity of the signal itself after some time pe- riod (time domain) or periodicity in the spectrum (frequency domain). AMDF [5] (time domain), the cepstralmethod [4] (modificationof the spectrumdo- main) and sub-harmonic summation (SHS) [3] are well described and widely used methods. 2 A description of PDA using MNFBC A critical aspect of PDAs used for speech analysis is that there are fast transitions between articulated phonemes. For this reason, the best result for a speech signal (in contrast to a singing signal) is ob- tained by methods that work in the time domain. The PDA that is used and improved in this pa- per emerges fromthe very complexPDAdescribed in detail in [9]. The core of the pitch-detection method is a merged normalized forward-backward correla- tion. Equation(1)presents thebasic correlationterm that is used in equations (2) and (3) to compute for- ward and backward correlations, where the constant MAX PERrefers to the time period of the lowest de- tectable frequency. These two correlations combined together lead to the MNFBC function (4), which is then half-way rectified. Combining two opposed correlations into a final correlation should improve the precision for frames with problematic content in terms of the different nature of the beginning and ending parts (transitions). 〈xwk[n], xwl[n]〉 = 2∗MAX P ER−1∑ n=0 xw[n+k]xw[n+l] (1) N F C[t] = 〈xw0[n], xwt[n]〉√ 〈xw0[n], xw0[n]〉〈xwt[n], xwt[n]〉 (2) 8 Acta Polytechnica Vol. 51 No. 5/2011 N BC[t] = 〈xw2M AX P ER[n], xw2M AX P ER−t[n]〉√ 〈xw2M AX P ER[n], xw2M AX P ER[n]〉〈xw2M AX P ER−t[n], xw2M AX P ER−t[n]〉 (3) M N F BC[t] = 〈xw0[n]], xw0[n]]〉(N F C′[t])2 + 〈xw2M AX P ER[n], xw2M AX P ER[n]〉(N BC′[t])2 〈xw0[n], xw0[n]〉 + 〈xw2M AX P ER[n], xw2M AX P ER[n]〉 (4) In contrast to [9], there is neither a signal pre- processing section in our algorithm nor a special block determining whether the segment of speech is voiced or unvoiced. This decision is made by thresh- olding theMNFBCvalue itself. There is also no spe- cial block ensuring correct pitchwith a sub-harmonic summation (SHS) function [3] to prevent halving er- rors. With the improvements suggested in this paper, this is not needed (see section 3 for details). The computational requirements are determined by the complexity of the correlation operation done in the time domain, which is N2, where N is the length of the processed window in the samples. This isworse thanthe complexity N log(N) of fasterPDAs operating in the frequency domain and using FFT. Although real-time use in the final implementation is wanted, we assume that the algorithm will not be used in any time-critical or embedded system where computation complexity is a critical issue. 3 Viterbi post-processing Post-processing using the Viterbi algorithm [6] is ap- plied to find the optimal track of the pitch. The powerof theViterbi procedure lies in its ability to ap- ply user-defined rules for comparing the candidates. However, for proper use of the algorithm some re- quirements have to be met. Each candidate needs to haveassigned its “emission”probability bk (the prob- ability that candidate k is F0 for the current frame, without considering any history) and also its “tran- sition” probability akl, which denotes how probable it is that candidate k in the current frame will be followed by candidate l in the next frame. Having these context-independent values, we can gradually compute the values of function δm,l which tells the final probability of candidate l being F0 for frame m considering the results from the previous frames. Function ψm,l designates the index of themost prob- able candidate in frame m − 1. The equations can then be expressed as (5) and (6). δ[m, l] = max k [δ[m −1, k]a[k, l]]b[l] (5) ψ[m, l] = argmax k [δ[m −1, k]a[k, l]] (6) The algorithm starts by assigning for the first frame δ1,i = bi and ψ1,i = 0. In the current imple- mentation, the three best candidates (the three high- est peaks) come from the MNFBC function. Note that these candidates often (but not always) corre- spond to the harmonic content of a speech signal [11]. Thismeans that inmost cases the candidatewith the highestMNFBCvalue is reallyF0, and the twoother highest values are harmonics of F0 (its natural mul- tiples). However, there could also be cases when the harmonics are “stronger” than fundamental. In this case, theViterbi procedure should preventhalving or doubling errors. The basic scheme of the algorithm is depicted in Figure 1. Fig. 1: The trellis of the Viterbi algorithm The emission probability bk is implemented di- rectly as the value of the MNFBC function for each F0 candidate. The transitionprobability canbe com- puted according to [9] as the decreasing exponential of the frequency difference. This could work well for some range of low fundamental frequencies with a suitablemultiplying constant for thedifference. How- ever, our perception of pitch is not linear with grow- ing frequencies, but is logarithmic [11]. This means that the same difference in frequency in a lower fre- quencybandcausesagreaterpitchchangeperception than the same difference in a higher frequency band. For this reason, the difference in frequency in Hz is converted in our algorithm to the difference in pitch in musical manners — semitones and cents. Let variable x be the difference of frequencies for consequent candidates converted tomusical cents (100 cents=1 semitone) according to equation (7). The resulting transition probability function a(x) is then expressed as (8), and the function is visualized 9 Acta Polytechnica Vol. 51 No. 5/2011 in Figure 2. Multiplying constant 0.0012 was ex- perimentally found to give the best results. Overall results that improve the precision of the algorithm with this modification can be found in section 4. x = 1200 ∣∣∣∣log2 ( f1 f2 )∣∣∣∣ (7) a(x) = 1 e0.0012x (8) Fig. 2: Transition probabilities function depending only on the difference in cents Now let us consider a situationwhen it is possible to have a jump in the pitch of speech in the place of the border of neighbouring prosodic units [1] (with unvoiced segments between them, so that the first voiced segment of the new prosodic unit is the next voiced segment for the last prosodic unit in terms of the Viterbi algorithm). The previous probability function will not allow the change to be immediately applied to the pitch track, and needs some time to “adopt” the new pitch level. Most utterances take place in the range of the musical fourth (which is 5 semitones=500 cents in terms of explicit musical distance). This is not the overall pitch range, but it is the common range that we use across prosodic units, sometimes referred to in the literature as the “pitch sigma”. It is prob- able that the biggest jump in pitch of 5 semitones will not occur very often, and only on the bound- aries ofprosodicunits. However,wepermit this jump to be possible without any penalization after a long enough prosodic pause. For this reason, a difference of 500 cents is a limit, and higher differences will be penalized by a linear decrease. The behaviour of the temporal probability func- tion of two variables (cent difference x and time t) can thus be expressed in the cent difference interval x ∈ 〈0,500〉 as: a(x, t)= e−0.0012x + (1−e−0.0012x)t Tthr (9) and on the interval x ∈ 〈500,1200〉 as: a(x, t)= e−0.0012x + ( 1200−x 700 −e −0.0012x) t Tthr (10) where Tthr is the time forgetting threshold. When the prosodicpause length reaches this value, all pitch changes in the rangeof 500Hzhavea transitionprob- ability of 1. Fig. 3: Transition probability function depending on dif- ference in cents and on time 4 Results 4.1 Test conditions All the results were computed using a manually la- beled pitch-reference database as a part of Spanish SPEECON [10], with the use of a pitch evaluation framework [7]. All parts of the proposed algorithm were implemented in the MATLAB environment. 4.2 Evaluation criteria The results section uses the evaluation criteria sug- gested in [7]. The voiced error VE (unvoiced er- ror UE) rate is the proportion of voiced (unvoiced) frames misclassified as unvoiced (voiced). The gross error high GEH (gross error low GEL) is the rate of F0 estimates (correctly classified as voiced) which does not meet the 20 % upper (lower) tolerance of frequency in Hz. The GEH and GEL 20 % toler- ance range is quite broad, and thus cannot distin- guish clearly between two precise PDAs. For this reason,GEH10andGEL10were established by anal- ogy with GEH and GEL, but with only 10 % toler- ance ranges. These new criteria are also expected to result in higher error rates than the older criteria, but might be useful in applications where precision matters. UE+VE and GEH+GEL criteria are some- times used to summarize PDAerrors. Halving errors (HE — the estimated frequency is half of the refe- 10 Acta Polytechnica Vol. 51 No. 5/2011 Table 1: Channel 0 overall results PDA VE UE VE+UE GEH GEL GEH10 GEL10 DE HE [%] [%] [%] [%] [%] [%] [%] [%] [%] ACF freq 44.4 23.5 31.6 1.2 0.1 1.5 0.18 0.4 0.06 DFE 26.6 15.5 20.4 8.4 4.2 16.5 8.9 0.2 1.3 MNBFCv1 22 12.7 16.3 0.4 21.2 1.5 22.1 0.06 19.5 MNBFCv2 22 10.7 15 0.4 1.1 1.8 2.3 0.03 0.8 MNBFCv3 18.5 13.3 15.3 0.5 1.2 1.9 2.6 0.05 0.8 MNBFCv4 15.6 16.3 16 0.6 1.3 2 2.8 0.05 0.9 MNBFCv5 22 10.7 15 0.7 1.7 2.1 2.9 0.14 1 Table 2: Gross errors (GEL+GEH [%]) in 2/3 octave frequency bands on Channel 0 PDA 57–88 88–141 141–225 225–353 353–565 [Hz] [Hz] [Hz] [Hz] [Hz] ACF freq 92.7 5.2 0.6 1.1 17.7 DFE 26.4 12.1 12.3 13.0 53.4 MNBFCv1 2 1.9 28.3 49.7 73.7 MNBFCv2 1.1 0.7 1.7 2.2 32.1 MNBFCv3 1.7 0.9 2 2.3 33 MNBFCv4 2.3 1 2.3 2.7 34.4 MNBFCv5 2.2 1.6 2.8 2.9 32.1 rence) and doubling errors (DE) were also brought in with a tolerance of 1 semitone range from half or double the referenceF0. Errorsof this kindarea spe- cial type of gross errors and often occur on real PDA outputs for noisy signals or transitions fromvoiced to unvoiced speech elements. We may sometimes need to observe the errorsnot in the entire frequencyband but e.g. within 5 smaller frequency sub-bands indi- vidually (2/3 octave bands were used to cover the range from 60 to 560 Hz). 4.3 Results and discussion Table 1 shows the overall results for the highest signal-to-noise (SNR) ratio channel 0 of the ref- erence database. MNBFCv1 is the basic variant with the voiced/unvoiced (V/UV) decision thresh- old set to value 0.5 and with the transition proba- bility of the Viterbi procedure computed from the direct frequency difference. MNBFCv2 improves the first variant with the conversion difference to cents. MNBFCv3 is almost the same as MNBFCv2, but has the V/UV threshold set to 0.45, whereas MNBFCv4 has the threshold value set to 0.4. The final MNBFCv5 involves adding the temporal do- main to the transition probability function with the time forgetting threshold set to 2 seconds. Table 2 presents a comparison of the precision over five dis- tinct frequency bands. To compare our method with other widely used methods, we added the results for autocorrelation in the frequency domain (ACF freq, a very goodmethod for tracking singing) and theDi- rect Frequency Estimation method (DFE) [8], which is currently used for evaluating Parkinson’s disease at FEE CTU in Prague. The results show that MNFBC is better than DFE in V/UV detection and also in precision. The VE+UE parameter is the best for MNBFCv2, but we can achieve the bestVE ratio forMNBFCv4 (but with aworseUE rate). The choice of variantdepends on the target application—whetherweneed tomini- mizevoicederrorsorunvoicederrors. For example, in the case of the planned punctuation detector we are trying tominimize the unvoiced error rate in order to obtain only confident F0 estimates. The results also show a big increase in precision with frequency dif- ference (MNBFCv1) to cent conversion (MNBFCv2). Progress can be seenmainly inGEL and in the halv- ing error rate. Table 2 shows that most errors for MNBFCv1 occur in the highest band, where the dif- 11 Acta Polytechnica Vol. 51 No. 5/2011 ferences from the current frequency aremuch greater in Hz units than for lower bands. Thus these transi- tions are evaluatedwith very lowprobability, leading to these errors. The table also shows that the ACF method can provide the best results for the highest frequency band, but is very poor in the lowest band. MNBFCv5 with temporal forgetting could probably not show its strength on the reference corpus due to lack of suprasegmental prosodic phrases (the cor- pus consists mainly of isolated words). In compari- sonwithMNBFCv2, however, there is only a slightly higher GEH rate. Globally, MNFBC with the addi- tion of the Viterbi traceback procedure outperforms DFE on close talk channel 0. Note that it has much lower GEH and GEL even for lower VE. This is not easy to achieve for PDA, because lower VE means that more uncertain segments (which other PDAs with higher VE have considered as unvoiced) pass to computation of the precision of F0 detection (gross errors). Other resultsnotpresented in this paperhavealso shownanoticeable decrease in precision on channel 1 with the algorithm presented here. To get good re- sults even innoisy environments, higher noise robust- ness is needed. This couldbeaccomplishedbyadding a pre-processing stage with noise reduction (not im- plemented yet). 5 Conclusion Wehavedescribedapitch-detectionalgorithmpurely based on merged normalized forward-backward cor- relation (MNFBC) with an advanced Viterbi post- processing procedure for finding the most probable pitch track. The optimal range of the voicing thresh- old was found for the MNFBC function. The results confirm that computing the transition probabilities with the pitch difference measured in semitones sig- nificantly improves the gross error rates (especially frequencyhalving)over the caseofdirectdifferenceof frequencies. We have also tried to extend the transi- tion probability functionwith a temporal dimension. This enhancement should lead to fewer errors occur- ring on the edges of prosodic pauses, but this has not been proven in experiments performed on a pitch reference database. This could be due to the very limited presence of supra-segmental prosodic pauses in the corpus. More experiments on suitable utter- ances need to be performed in order to evaluate this hypothesis. Acknowledgement The research presented in this paper was super- vised by Ing. Václav Hanžl, FEE CTU in Prague. It has been supported by the Czech Grant Agency under grant No. 102/08/0707 “Speech Recogni- tion under Real-World Conditions” and by grant No. 102/08/H008 “Analysis and modelling biomedi- cal and speech signals”. References [1] Palková, Z.: Fonetika a fonologie češtiny. Praha : Karolinum, 1994. [2] Kim, J., Woodland, P. C.: The use of prosody in a combined system for punctuation genera- tion and speech recognition. In Proceedings Eu- rospeech (2001), 2757–2760. [3] Hermes, D. J.: Measurement of pitch by sub- harmonic summation. J. Acoust. Soc. Am. 83 (1988), 257–264. [4] Noll, A. M.: Cepstrum pitch determination. J. Acoust. Soc. Am. 41 (2) (August 1966), 293–309. [5] Ross, M. J., Shaffer, H. L., Cohen, A., Freud- berg, R., Manley, H. J.: Average magnitude dif- ference function pitch extractor. IEEE Trans. Acoust. Speech Signal Process. ASSP-22 (5) (October 1974), 353–361. [6] Viterbi, A. J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEETrans. Inform.Theory, vol. IT- 13, (April 1967), 260–269. [7] Bartošek, J.: Pitch detection algorithm evalua- tion framework. 20th Czech-German Workshop on Speech Processing, Prague (2010), 118–123. [8] Bořil, H., Pollák, P.: Direct time domain funda- mental frequency estimation of speech in noisy conditions. In the Proceedings of EUSIPCO 2004 (European Signal Processing Conference, Vol. 1) (2004), 1003–1006. [9] Kotnik, B., et al.: Noise robust F0 determina- tion and epoch-marking algorithms. Signal Pro- cessing 89(2009), 2555–2569. [10] Kotnik, B., Höge, H., Kacic, Z.: Evaluation of pitchdetection algorithms in adverse conditions. Proc. 3rd International Conference on Speech Prosody, Dresden, Germany (2006), 149–152. [11] Syrový, V.: Hudební akustika, 2nd ed. Praha : HAMU, 2008. [12] Uhlíř, J.: Technologie hlasových komunikací. Praha : ČVUT, 2007. 12 Acta Polytechnica Vol. 51 No. 5/2011 About the author Jan BARTOŠEK was born in 1984 in Litoměřice (CZ) and attended the primary and the secondary school there. He completed his bachelor degree (2007— Informatics andComputer Science) and his master degree (2009— Software Engineering) at the department of Computer Science, Faculty of Electri- calEngineering,CTU inPrague. He is nowa student on the doctoral study programme at CTU Prague, Dept. of Circuit Theory, and is interested in voice and music technologies. Jan Bartošek E-mail: bartoj11@fel.cvut.cz Dept. of Circuit Theory Faculty of Electrical Engineering Czech Technical University in Prague Technická 2, 166 27 Praha, Czech Republic 13