AP07_4-5.vp 1 Introduction Automatic speech recognition has become a very popular field of research, and the results come into our lives in many forms, e.g. in voice controlled machines, like PCs and mobile phones. These command controlled systems often work with real, spontaneous speech, which is different from the read speech used in the clean laboratory conditions of a research centre. This noisy kind of speech is recorded in a real environment, which causes some noise addition to the speech signal. The speech is also created ”on the fly” and the speaker has to think of the next word while speaking. All this causes many non- -speech events to be present in such speech. Therefore it is important to make the recogniser robust against these events, and to take the events into account and ensure that they are not incorrectly recognised. The long-term goal of our work is to create a hidden Markov model (HMM) based digit recogniser, which will sup- press the non-speech event influence. One solution is to model the non-speech events that affect speech. For this pur- pose it is important to have a large training database with many occurrences of each modelled item, e.g. phoneme, to get a general description of the item. Therefore in the first part of this work, two Czech speech databases are analysed for the presence and quality of the speaker non-speech events. In this very beginning of our work, only speaker-generated non-speech events are taken in account because of their posi- tion between words of recognised speech. This enables us to consider the events to be another word and to model them easily. For this purpose, the databases were inspected for the presence of the events in different situations. The second part of the work is concerned with speaker non-speech event modelling. A robust Czech digit sequence recogniser based on HMMs of Czech phonemes is trained on the analysed database, and the results gained with the non- -speech event recognition feature are presented. An HMM Toolkit [2] is used for this purpose. Based on these results, forced-alignment experiments are shown in the third part of the work. The recogniser is used to re-recognise the training data. This will remove unsuitable event marks in the transcription and it also enables to remark the event, which should improve the recognition results. 2 The database for the non-speech event recognition task As noted above, it is important for automatic speech recogniser training to have a speech database that has enough occurrences of each type of modelled unit, e.g. phonemes. In the non-speech event modelling task this calls for a sufficient number of non-speech events present in the database, which is the way to get a general description of the modelled event. This work deals with speaker non-speech events that ap- pear in between particular words, so the database is analysed for the presence of these events. In most cases, speech recognition systems are trained on a database of mainly read speech. This helps to complete the database, because it is known approximately what was said. However, read speech is different from the spontaneous speech used in voice communication with a machine. Read speech is low in speaker non-speech events. Unlike a sponta- neous speaker, a reader is not forced to think while speaking and so the occurrence of hesitation is low. The same problem occurs with other possible non-speech events, like lip-smack- ing, because the reader has better control over his mouth. The recognizer presented here was trained on two differ- ent sets of speech. The first database (SPEE) is a collection of Czech speaker records in a different environment, which was inspected to contain only the items in a silent environment for training purposes. The SPEE set includes speaker non-speech events divided into two classes: filled pauses (FIL), which are pauses in speech filled with some sound (common in hesita- tion), and other speaker non-speech events (SPK). To increase the number of non-speech events present in the training set of speech data, one other dataset was used. It is a collection of records from a car (TEM), which was also in- spected to contain only silent items from a standing car. This database divides the events into several classes, which helps to create more different models of non-speech events and to describe these events more accurately. Table 1 shows the number of events marked in the tran- scription of the whole dataset and of the selected clean training subset for both the SPEE and the TEM dataset. It can be seen that the training selections (clean) contain notably fewer speaker non-speech events marked in the tran- © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 107 Acta Polytechnica Vol. 47 No. 4–5/2007 Speaker Non-speech Event Recognition with Standard Speech Datasets J. Rajnoha A non-speech event modelling approach to speech recognition is presented in this paper. A speaker independent spoken Czech digit recogniser is used for this purpose, and speaker generated non-speech events are modelled. Because it is important for the recogniser to be trained on suitable data, the paper shows some factors that influence the occurrence of the modeled non-speech events in the training database. Some results achieved on the analysed training database are then shown. In the experiments on forced alignment the recogniser eliminates almost all the insertion error, which is a promising property for subsequent training. However, experiments with a different basis for the non-speech event models provide almost the same results, so the difference seems to be not so significant for recognition. Keywords: speech recognition, digit recognition, non-speech events, training database, forced alignment. scription. The analysis in 2.2 shows that higher SNR (lower noise level) in speech leads to lower occurrence of the events. Some compromise is therefore needed in the choice between clean speech (more understandable for the recogniser) and the number of non-speech events for training the non-speech event robust recogniser. The reason for the low number of non-speech events in the TEM-training subset is that the training subset is only a small fragment of the whole TEM dataset. Only the standing car items were taken for training purposes. The datasets also include some records that are not suit- able for standard phoneme training, e.g. web-page addresses or spelled utterances. While the average occurrence of a non-speech event is 0.64 events per one utterance in the whole SPEE dataset and 0.56 events per utterance in the training subset, the average rate for web-page utterances is 0.89. However, these records were not used in the training phase, which decreases the number of non-speech events in the training subset. The analyses follow. These try to find groups of records with some property that is important in the speaker non-speech event distribution in the training subset. This can help to discover some inefficiency in the non-speech event training process. 2.1 Event distribution in speaker’s age The datasets include some basic information about the speaker. This can help us to find whether some group of peo- ple can influence the training dataset in some way. One such kind of information is the age-class distribution and the num- ber of speaker non-speech events for different age-classes. Figure 2 shows that the presence of speaker non-speech events is not much influenced by the speakers age till about 65 years, and only for higher age the amount rises. There are not many speakers from the 65+ age-class in the datasets (Fig. 1), and this increase is not high enough to cause harmful effects, like training the recogniser for a specific kind of event only. 2.2 Event distribution in different noise levels The SPEE database includes an SNR estimation for the re- cords. It enables the user to analyse the influence of a noisy environment on the occurrence of a non-speech event in speech. There is no SNR estimate for the TEM database, but the information on car type and engine state can also partially describe the environment. The graph in Fig. 3 shows that the distribution of the aver- age non-speech event rate has its maximum in the SNR group between 10 and 15dB. For lower SNR, the rate falls mainly because of the noise that covers the event. A similar effect can be seen in TEM, where the items without a running engine in the background contain more non-speech event marks. For higher SNR ranges the rate also falls. In this environment the 108 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 47 No. 4–5/2007 utterances SPK-event FIL-event SPEE-all 180213 108474 7382 SPEE-train 63024 33138 1856 TEM-all 221318 46742 691 TEM-train 38391 11532 153 Table 1: Number of non-speech events Fig. 1: Age-class distribution in the datasets Fig. 2: Non-speech events distribution in age for the speech datasets Fig. 3: Non-speech events in the training SPEE dataset in differ- ent noisy environment speaker is not disturbed by the environmental noise so he pronounces properly and this leads to a lower non-speech event rate. 2.3 Event distribution for different workgroups The realization team for creating the two databases was di- vided into two parts with different supervisors and workplaces within the Czech Republic. This led to a different dialect dis- tribution within the groups, as the records from the second group were spoken mainly by the speakers with a Moravian dialect. As shown in graph 5, this separation led to a differ- ence in the annotations. Unlike the second group, the workers in the first group had to use fewer non-speech event marks in the transcription, and these items have a notably lower rate of non-speech event occurrence. This may have caused some of the marked non-speech events from the second group to be less loud or less promi- nent. Because such events do not decrease the recognition score, it is quite undesirable to train the recogniser to separate them from common background noise. If these unimportant event marks were not there, the recogniser would take only significant events in account and would be able to model the significant events more properly. This is one reason for using forced alignment on the training dataset see below. 3 Previous tests The analysed databases were used to create a Czech digit sequence recogniser that models speaker non-speech events [1]. In the first step, only two classes of speaker non-speech events were modelled and both datasets could be used. In subsequent processing, the SPK event was divided into sepa- rate plosive and fricative events. Therefore only TEM was used for subsequent training, because there was no informa- tion on these properties in the SPEE dataset. The recogniser was tested on the selection of both datasets (because of the different environment). Figure 6 shows the recognition results for the testing data derived from SPEE and TEM in terms of word error rate WER D S I N � � �� � � � � 100 % (1) where N means number of recognised words, and D stands for deleted words, S for substituted words and I for incorrectly inserted words. At the beginning of non-speech event modelling there was a recogniser trained on SPEE only (col. a). In the first step TEM was added to the training set and also two basic non- -speech event marks were added to the set of models. The first retraining decreased the error rate even without using non- -speech events (col. b), but the recogniser that takes the events into account has better results (col. c). Additional retraining brought no improvement, so only TEM was used for subse- quent phases and the general SPK mark was divided into two classes: fricative (BRE) and plosive (PL) event. After two re- training steps the results show that, unlike in the case of non-speech event modelling (col. e, f), the word error rate and insertion error rate increase rapidly without using the additional models (col. d). So the experiment therefore shows that the non-speech event modelling approach helps recogni- tion improvement. 4 Experiments on forced-alignment As noted above, the SPEE database divides speaker non- -speech events into two classes only, but it is better for the recogniser to have more classes of these events. The more © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 109 Acta Polytechnica Vol. 47 No. 4–5/2007 Fig. 4: Non-speech events in the whole (full filled bar) and training (lined bar) TEM dataset in different noisy environment Fig. 5: Non-speech events in the whole (full filled bar) and train- ing (dashed bar) datasets for different dialect/workgroup Fig. 6: Word error rate (dashed bar) and insertion error rate (full filled bar) on the SPEE-testdat (top) and on the TEM- -testdat (bottom) classes are used, the more accurately the events can be de- scribed and modelled. Therefore three models of non-speech events were used in the recogniser above, which meant using only TEM for subsequent retraining and the training dataset size decreased. Using the SPEE database for subsequent training again needs a decision on, whether the SPK mark in the transcrip- tion closer to a BRE or to a PL event. The HTK system enables a feature called forced alignment, which tries to recognise the training database and puts the most fitting (probable) form of the given word into the record. This is used for deciding between pronunciation variants of one word, and in this work it was used in a similar way for speaker non-speech events. In this experiment the SPEE dataset was re-recognised us- ing the recogniser above. Because the recogniser is able to classify the SPK event, the result of this recognition is the SPEE database, which has 3 classes of speaker non-speech events, like in the TEM database. 4.1 Breath-like noise The recogniser above uses two types of BRE model. One is based on the phonetically similar phoneme ”f ”, while the sec- ond takes advantage of the length of silence model. So it was necessary to decide which should be used for subsequent pro- cessing. The recognition results (Fig. 6, col. e, f) show that in the two cases the recognition score is almost the same. It seems that both models have the same ability to describe the event, but using forced alignment on the SPEE dataset discovered the quality of these models. This was done by com- paring of the results of the re-aligned SPEE when using the BRE models with a different basemodel. The recogniser was trained in several retraining steps. In the last step (phase 2) and in one preceding step (phase 1) a forced alignment of the SPEE database was performed Table 2 shows the difference between the ‘f ’-based and silence-based forced alignment in those two succeeding re- training phases. This led to 4 comparisons. We analysed whether one recogniser had marked the non-speech event with the same mark (BRE vs PL) as another recogniser (substi- tution). And we noted as a deletion the situation, when one of them had not placed a mark where the second one marked the event. In the case of ’f ’-based BRE models, the event was marked the same way as by other recognisers in 89.5% cases at maxi- mum. This shows that the recogniser does not have stable models and these models continue to change their properties notably while being trained. As a result one event will serve as a BRE event for a while, but then it will be used to train the PL model, because of the non-stability of the forced-alignment results. On the other hand, the comparison of silence-based mod- els in both succeeding phases shows that these models act in a rather stable way. Therefore the event will be used to train only one type of non-speech event model, and seems to be more suitable for subsequent processing. 4.2 Aligned data-based recogniser The first analysis (listening test) of the re-aligned data discovered faults in the ability of the recogniser to decide whether the SPK event is of BRE or PL type. In some cases the event was too close to the beginning word, sometimes the event was too silent, so high accuracy could be expected. However, the results expanded the training database, which could help the recognition. Based on the results above retraining was performed with the use of the re-aligned SPEE training dataset. Both kinds of BRE models were used to check whether the comparison experiment has any effect on recognition accuracy. Table 3 shows that using a different base model for a BRE event has no significant influence on recognition accuracy. After two steps of retraining, the recogniser was able to elimi- nate all insertions except for the one that remained. But even when the accuracy does not achieve the value of the original recogniser, the re-alignment seems to bring an improvement for subsequent training. 4.3 Alignment against listening As noted above the TEM dataset divides non-speech events into several classes. These events were marked by the human annotator, so the original transcription can be used as a good basis for re-alignment quality comparison. This transcription can be considered as a good estimate of the non-speech event class, and if the re-alignment phase marks some plosive event as BRE in many cases (or vice versa), the recogniser is unable to describe the difference between these events and it needs to be trained better. Section 2.3 shows that not all the non-speech events marked in the original transcription of the speech datasets can be considered as a suitable pattern for training the event model. So the re-alignment can help to reduce the difference between the marked events by removing silent events, which can be modelled by the silence model. 110 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 47 No. 4–5/2007 silence, phase 1 ’f’, phase 2 ‘f’-based, phase 1 Subst. 24.18 % 10.44 % Deletion 5.34 % 4.14 % silence-based phase 2 Subst. 4.44 % 18.77 % Deletion 3.86 % 7.19 % Table 2: Comparison of re-aligned BRE non-speech events in dif- ferent retraining phases Acc[%] Insertions ’f’-based, original 96.44 3 sil-based, original 96.44 2 ’f’-based, phase 1 95.96 3 sil-based, phase 1 95.96 2 ’f’-based, phase 2 96.20 1 sil-based, phase 2 96.32 1 Table 3: Recognition results after re-alignment with a different BRE basis Table 4 shows the number of non-speech event marks BRE and PL that were deleted from the original training sub- set (annotated by a human being) in different training phases and which were substituted for each other. For the SPEE dataset there was no human-aligned transcription for these non-speech event classes, so there is no information about substituted marks. The number of deleted marks of non-speech events rises in the case of SPEE subset, but the difference between the number of deleted marks in particular phases decreases. The recogniser therefore tends to some final form of non-speech event models that consider about 2300 marks in the original transcription as too silent or inappropriate in some other way. For the TEM subset this number does not change notably. This may be because the recogniser was trained on the TEM subset in all 4 phases, while the SPEE was used only in the last two training phases. The substitutions in Table 4 show that training leads to a decreasing number of substituted marks. This effect means that the recogniser classifies the non-speech events similarly to the human annotator, and so the recogniser can re-align the data more precisely. The substitutions are more often caused by marking the PL event as a BRE event. A simple listening test showed that plosive non-speech events are fol- lowed by or mixed with breath in some cases and only the PL event is marked, so the substitution does not necessarily indicate a bad model of a non-speech event. 5 Conclusion This paper describes some analyses and tests in a non- -speech event modeling task. The analyses of the training datasets show some properties that can influence recognition accuracy. Then some recognition tests were performed to find the best way to model speaker non-speech events. A spoken Czech digit sequence recogniser based on phoneme HMMs was used for this purpose. The speech databases used for the experiments were ana- lysed, and it was found, that a part of both sets contains a notably different non-speech event rate. This was caused by the different supervision in the annotation phase of database creation. The distribution in different noise backgrounds sup- ports the intuitive conclusion that a high noise level covers non-speech events, and so the occurrence rate decreases. For highly silent environments the rate is also lower, so not only the cleanest items are best for the non-speech event recognition task. The analysed datasets were used for training the recog- niser, and although they were not checked to ensure that they contained only suitable non-speech items, using non-speech event modeling feature brought a notable improvement. This recogniser was used to re-align one of the training datasets to get a more accurate description of the non-speech events. This reduced the insertion error. The choice of the model that stands as a basis for non-speech events seems to be of less importance, because after the retraining difference slowly disappears. Acknowledgments The work presented here was supported by GAČR 102/05/0278 “New trends in voice technologies research and usage”, AVČR 1ET201210402 “Voice technologies in infor- mation systems”, IGA MZ ČR NR8287-3/2005 and research activity MSM 6840770014 ”Research in the Area of the Pro- spective Information and Navigation Technologies”. References [1] Rajnoha, J.: Modeling of Speaker Non-Speech Events in Robust Speech Recognition. Proceedings of the 16th Czech-German Workshop on Speech Processing, Prague: Academy of Sciences of the Czech Republic, Institute of Radioengineering and Electronics, 2006, p. 149–155. [2] Young, S. et al.: The HTK Book (for HTK Version 3.2.1) Cambridge University Engineering Department, 2002. [3] Gajic, B., Markhus, V., Pettersen, S. G., Johnsen, M. H.: Automatic Recognition of Spontaneously Dictated Med- ical Records for Norwegian. COST278 and ISCA Tutorial and Research Workshop – ROBUST 2004, 2004. [4] Shriberg, E. E.: Phonetic Consequences of Speech Dis- fluency. Proceedings of the International Congress of Phonetic Sciences, San Francisco, 1999, p. 619–622. [5] SPEECON project webpage. http://www.speechdat.org/speecon. Josef Rajnoha e-mail: rajnoj1@fel.cvut.cz Dept. of Circuit Theory Czech Technical University in Prague Faculty of Electrical Engineering Technická 2 166 27 Praha, Czech Republic © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 111 Acta Polytechnica Vol. 47 No. 4–5/2007 phase Deleted Substituted SPEECON phase 1 before re-align 1116 – phase 2 before re-align 1723 – phase 1 after re-align 2169 – phase 2 after re-align 2282 – TEMIC phase 1 before re-align 1381 424 phase 2 before re-align 1381 566 phase 1 after re-align 1296 458 phase 2 after re-align 1286 448 Table 4: Comparison of non-speech events marked in the origi- nal and re-aligned training subsets Table of Contents Detection of Facial Features in Scale-Space 3 P. Hosten, M. Asbach Valuating the Investment Efficiency of Distribution Companies 8 M. Karajica Valuation of Companies 14 Jiøí Lisník An Improved Version of the Fluxgate Compass Module 18 V. Petrucha Robust Detection of Point Correspondences in Stereo Images 23 A. Stojanovic, M. Unger Assessment of Human Hemodynamics under Hyper- and Microgravity: Results of two Aachen University Parabolic Flight Experiments 29 N. Blanik, M. Hülsbusch, M. Herzog, C. R. Blazek Refactorisation methods for TTCN-3 33 L. Eros, F. Bozoki Extending the Life Time of a Nuclear Power Plant: Impact on Nuclear Liabilities in the Czech Republic 38 L. Havlíèek Negative Chromatic Dispersion Generated by Introducing Curvature into Photonic Crystal Fiber 43 M. Lucki Developing E-learning Courses for Mobile Devices 48 R. Szabados, K. Sipos The Influence of PZT Actuators Positioning in Active Structural Acoustic Control 55 P. Švec, V. Jandák Computer Controlled Switching Device for Deep Brain Stimulation 59 J. Tauchmanová Modeling Measurement Uncertainty in Room Acoustics 63 P. Dietrich Wireless and Non-contact ECG Measurement System – the “Aachen SmartChair” 68 A. Aleksandrowicz, S. Leonhardt Identification of Nonlinear Systems: Volterra Series Simplification 72 A. Novák Inductive Contactless Distance Measurement Intended for a Gastric Electrical Implant 76 J. Tomek Analysis of Mismatched First Order A Priori Information in Iterative Source-Channel Decoding 80 B. Schotsch, P. Vary, T. Clevorn Automated Classification of Analysis- and Reference Cells for Cancer Diagnostics in Microscopic Images of Epithelial Cells from the Oral Mucosa 86 T. E. Schneider Stick Based Speckle Reduction for Real-Time Processing of OCT Images on an FPGA 91 H. Luecken, G. Tech, R. Schwann, G. Kappen The Impact of Connecting Distributed Generation to the Distribution System 96 E. V. Mgaya, Z. Müller Improved Evaluation of Planar Calibration Standards Using the TDR Preselection Method 102 J. Vancl Speaker Non-speech Event Recognition with Standard Speech Datasets 107 J. Rajnoha