AP06_6.vp


1 Introduction
The recognition of clean speech recorded in quiet condi-

tions can be addressed quite successfully with the widely used
Mel-Frequency Cepstral Coefficients – MFCC [1] and Percep-
tual Linear Predictive Coefficients – PLP [2]. In this work, the
behavior of these features is examined in the case of changes
in talking style (neutral speech, speech under Lombard effect
– LE) and in the case of speech bandwidth limitation intro-
duced by telephone filter. LE introduces changes in speech
production due to the speaker’s effort to increase communi-
cation intelligibility in noise [3]. Bandwidth limitation of
telephone filter introduces changes in the spectral content of
the signal which may lead to loss of the fourth speech formant
and may thus degrade the recognizer performance.

In addition, two HMM training strategies were compared
with respect to convergence speed and best achievable perfor-
mance. The strategies differed only in the way in which the
initial HMMs containing one Gaussian mixture component
per HMM state were enhanced to contain the final 32 mix-
tures per HMM state. This was done either by direct splitting
and cloning the only mixture to 32 mixtures and reestimating
them many times (one-shot approach), or by gradually dou-
bling the number of mixtures and reestimating after each
split until there were 32 mixtures (progressive propagation).

The recognition experiments presented in this paper
were carried out on Czech SPEECON [4] and CLSD’05 [5]
databases. Czech SPEECON comprises recordings in public,
office, car and entertainment scenarios. For the purpose of
HMM training, office data representing neutral speech with
high SNR was chosen. For the tests, the CLSD’05 database
was used. This consists of neutral speech and speech uttered
in various types of simulated noisy backgrounds (CAR2E car
noise [6] artificial band-noises). Since the noises were repro-
duced to the speakers through closed headphones, only clean
Lombard speech was captured, benefiting from similar SNR
to the SPEECON office recordings.

2 Feature extraction techniques
Two widely used feature extraction methods were ex-

amined in this work, MFCC and PLP. Both methods use
Mel-Frequency warping with the same number of frequency

subbands, and the estimates of the spectral envelopes were
described by the same number of cepstral coefficients, so that
the two methods were comparable. The most important dif-
ference in the otherwise rather similar approaches is the way
in which the cepstral coefficients are obtained from the spec-
tra: through the DCT transform in the case of MFCC, and
through linear prediction in the case of PLP. The settings were
as follows:
� preemphasis with � � 0.97
� 100 Hz frame rate, frame length 25 ms
� 26 Mel-frequency bands
� LPC of order 12 (for PLP)
� energy normalization per utterance
� cepstral liftering
� 39 features per frame (12 cepstral coeffs � frame energy, �

and �� coeffs)

The input speech data was either sampled at 16 kHz or
resampled at 8 kHz. As the feature extraction settings were
not dependent on the sampling rate, the frequency subbands
were effectively wider for 16 kHz than for 8 kHz.

3 Telephone channel simulation
One of the main goals of the study was to investigate the

influence of limiting the bandwidth of the input speech, par-
ticularly by simulating a standard telephone channel. The

32 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 46  No. 6/2006

Influence of Different Speech
Representations and HMM Training
Strategies on ASR Performance
H. Bořil, P. Fousek

This work studies the influence of various speech signal representations and speaking styles on the performance of automatic speech
recognition (ASR).  The efficiency of two approaches to hidden Markov model (HMM) training are compared.
Common MFCC and PLP features were exposed to two sources of disturbance applied to the original wide-band speech: (i) stress (Lombard
effect) and (ii) transfer channel distortion (simulated telephone line). Subsequently, the efficiencies of the two training strategies were
evaluated. Finally, a study of the optimal number of training iterations is introduced.

Keywords: PLP, MFCC, Lombard effect, CLSD’05.

This text was a part of the International Conference POSTER 2006 which was held in Faculty of Electrical Engineering CTU in Prague.

Fig. 1.: Transfer function of the simulated telephone channel


original 16 kHz, 16 bit PCM speech was processed in two
steps. First, the signal was resampled using the sox tool with a
polyphase filter to 8 kHz [7]. Then the telephone channel
was emulated by applying a G.712 standard IIR filter of the
4th order [8] using the FaNT tool [9]. Superposition of the
processing steps as described above leads to an effective band-
width of 300 Hz–3400 Hz (see Fig. 1).

4 Recognition experiments

4.1 Experimental setup
All experiments were carried out on the Czech SPEECON

and CLSD’05 databases. From both databases only the
close-talk microphone channel was used. The training set
consisted of SPEECON office recordings, which represent
neutral speech in a quiet environment. This set contained
general speech pronounced by both genders, and comprised
about 15 hours of speech. There were four independent
test sets covering examples of gender-dependent neutral or
Lombard speech: neutral-male (1423 words), neutral-female
(4930 words), LE-male (6303 words) and LE-female (5360
words), containing continuously pronounced digits from “nu-
la” to “devět”. Though the neutral and Lombard utterances
differed in the prompt texts, speakers were the same for both
sets.

The recognizer was a gender-independent HTK-based
HMM system with 43 context-independent phoneme models
� 2 silences, each with 3 emitting states and 32 Gaussian mix-
tures per state. The task was to recognize 10 Czech digits in 16
pronunciation variants.

4.2 Effect of Lombard speech and resampling
on MFCC and PLP features

The aim of this experiment was to show how the Lombard
effect and a narrow bandwidth can affect recognition perfor-
mance. In all cases, the training and testing conditions were
the same.

First, the baseline performance of MFCC and PLP fea-
tures on neutral wide-band speech was evaluated in terms of
Word Error rate (WER), see rows 1–2, columns 1–2 in Table 1.
Also, similar narrow-band systems were tested (rows 3–4, col-
umns 1–2 of Table 1). Then all four systems were exposed to
Lombard speech (columns 3–4 of Table 1). The observations
are:
� MFCC and PLP features display comparable performance

in all conditions.
� The Lombard effect leads to severe but consistent degrada-

tion: the relative drop from neutral to Lombard speech
is comparable for male and female (about 800 %) and al-
most independent of bandwidth and features. However,
the absolute errors indicate that for female speakers the
recognizer is almost useless.

� Narrowing the bandwidth to the telephone introduces a
degradation which is consistent over gender, features and
speaking style. On average there is a relative drop of
around 30 %. Note the special case of PLP features and
neutral female speech, when the relative drop is only 12 %.
To help in interpreting the above observations, two phe-

nomena should be mentioned. First, a known property of the

Lombard effect is a significant shift of the first two formant
frequencies [10], see Fig. 2. This may cause inability of the
Gaussian mixtures to match the testing data and thus failure
of the system. Avoiding this can be helped by appropriate
front-end processing (equalization of LE, robust features),
multi-style training (including Lombard speech in the train-
ing data) or back-end processing (changes of HMM structure)
[3].

Second, the formants carry important information for
monophone identification [11]. Narrowing the bandwidth to
the telephone channel causes a loss of the 4th formant, which
is close to 4 kHz (see Table 2). This can contribute to a perfor-
mance drop in narrow band systems.

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 33

Acta Polytechnica Vol. 46  No. 6/2006

Features bandwidth Word Error Rate (%)

Neutral Lombard

Male Female Male Female

MFCC wide 2.4 4.9 18.8 43.9

PLP wide 2.5 5.0 18.6 44.9

MFCC telephone 3.0 6.2 24.2 62.2

PLP telephone 3.2 5.6 23.1 62.5

Table 1: Gender-dependent recognition results with neutral and
Lombard speech at different bandwidths

Neutral

LE

1000

1200

1400

1600

1800

2000

2200

2400

300 400 500 600 700 800 900

F1 (Hz)

F
2

(H
z
)

/u/

/i/ /i'/

/u'/

/e/

/e'/

/o/

/o'/

/a/

/a'/

CLSD - Female Vowel Formants

Fig. 2: CLSD’05 – vowel formant shifts

Vowel Neutral Lombard

F4Male (Hz) F4Female
(Hz)

F4Male (Hz) F4Female
(Hz)

/a/ 3834 3934 3713 4012

/e/ 3696 4181 3728 4196

/i/ 3661 4170 3683 4218

/o/ 3916 3880 3711 4042

/u/ 3738 3939 3661 4001

Table 2: CLSD’05 – average positions of 4th formants in neutral
and Lombard speech


4.3 Comparing training strategies
All the recognizers mentioned up to now were trained

using the progressive propagation method: initial HMMs
containing one Gaussian mixture (GM) per state were re-
estimated using the Baum-Welch procedure and then each
GM was split into two GMs and reestimated. After 5 cycles
there were 32 GMs, which were further trained. This experi-
ment compares such an approach with the one-shot strategy,
where the initial mixture was cloned 32 times in each HMM to
create 32 GMs directly. The HMMs were then reestimated.

The performance of both strategies was tested on a set
comprising 8279 digits from the SPEECON office and
CLSD’05 neutral sessions, see Fig. 3.

To complete the picture about the training process, the
evolution of insertions and deletions is shown in Fig. 4. No
word insertion penalty was used.

4.4 When to stop training

The last experiment attempts to answer the following
questions: How many training iterations should be per-
formed in order to get the best models? Are the best models for
neutral speech also the best for Lombard speech? The wide-
-band MFCC system trained with progressive propagation was
used to recognize neutral and Lombard speech in each train-
ing epoch. Fig. 5 shows the performance evolution.

HMMs tested with neutral speech appear to converge
much earlier than with Lombard speech. Excessive reesti-
mations improve the performance on Lombard speech and
do not seem to harm neutral speech. This suggests that many

iterations do not lead to loss of the essential generalization
properties of HMMs.

5 Conclusions
The aim of the paper was to study the effect of narrowing

the speech bandwidth and the effect of stressed speech on the
performance of the HMM recognizer based on MFCC and
PLP features. Experiments were carried out with the Czech
SPEECON and CLSD’05 corpora.

MFCC and PLP features displayed similar behavior in all
conditions. No fundamental differences were observed.

Narrowing the bandwidth to the telephone channel
brought performance deterioration, which was consistent
over gender, features and speaking style. A possible explana-
tion is the loss of the 4th speech formant.

A consequence of the Lombard effect was a severe drop in
performance, common to both features. Though the relative
drop was comparable for both genders and bandwidths, in
the female case it led to a failure of the recognizer. Without
appropriate modifications, an HMM recognizer is almost use-
less when exposed to Lombard speech.

A comparison of the two training strategies showed their
similar behavior and thus there is no need for further
exploration.

An experiment with a higher number of HMM training
iterations indicated that in order to achieve better recogni-
tion ccuracy on stressed speech, more training epochs are
needed. Fortunately, these iterations do not damage the nec-
essary generalization properties of HMMs.

Acknowledgments
This work was supported by GAČR 102/05/0278 “New

Trends in Research and Application of Voice Technology”,
GAČR 102/03/H085 “Biological and Speech Signals Mo-
deling”, and research activity MSM 6840770014 “Research
in the Area of Prospective Information and Navigation
Technologies”.

34 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 46  No. 6/2006

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60

W
E

R
(%

)

2 4 8 16 32

1 32

1

MFCC 16 kHz 1 � 2 � … � 32 mix

MFCC 16 kHz 1 � 32 mix.

Training epoch

Fig. 3: Comparing training strategies – progressive propagation
(blue) vs. one-shot (red). The top axes show the number of
mixtures in an epoch.

-2400

-2000

-1600

-1200

-800

-400

0

400

800

1200

1600

0 10 20 30 40 50 60

Training evolution (-)

#
In

s
e

rt
io

n
s

(-
)

#
D

e
le

ti
o

n
s

(-
)

1 2 4 8 16 32

1 32

MFCC 16 kHz 1 � 2 � … � 32 mix

MFCC 16 kHz 1 � 32 mix.

Fig. 4: Comparing training strategies – convergence of word in-
sertions and deletions

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60

LE_F

LE_M

Neutral Office + CLSD

W
E

R
(%

)

2 4 8 16 321

43.9 %

18.8 %

3.8 %

Training epoch

Fig. 5: Progressive propagation training – evolution of WER on
male/female Lombard speech and both-genders neutral
speech


References
[1] Young, S. et al.: The HTK Book ver. 2.2. Entropic Ltd

1999.
[2] Hermansky, H.: Perceptual linear predictive (PLP) anal-

ysis of speech, J. Acoust. Soc. Am., Vol. 87, No. 4, April
1990, p. 1738–1752.

[3] Hansen, J. H. L.: Analysis and compensation of speech
under stress and noise for environmental robustness in
speech recognition. Speech Communications, Special Issue
on Speech under Stress, Vol. 20 (1996), No.2, p. 151–170.

[4] SPEECON, http://www.speechdat.org/speecon.
[5] Bořil, H., Pollák, P.: Design and collection of Czech

Lombard Speech Database. In: Proc. INTERSPEECH
’05. Lisboa (Portugal), 2005, p. 1577–1580.

[6] Pollák, P., Vopička, J., Sovka, P.: Czech language data-
base of car speech and environmental noise. In Proc.
EUROSPEECH ’99. Budapest (Hungary) 1999. Vol. 5,
p. 2263–6.

[7] SOX – Sound Exchange Tool manual,
http://sox.sourceforge.net.

[8] The International Telegraph and Telephone Consulta-
tive Committee (CCITT), International Telecommuni-
cation Union (ITU). CCITT G.712: General Aspects of
Digital Transmission Systems; Terminal Equipments.

Transmission Performance Characteristics of Pulse Code Mod-
ulation, 1992.

[9] FaNT – Filtering and Noise Adding Tool.
http://dnt.kr.hsnr.de/download.html.

[10] Bořil, H., Pollák, P.: Comparison of three Czech speech
databases from the standpoint of Lombard effect ap-
pearance. In: ASIDE 2005 – Applied Spoken Language In-
teraction in Distributed Environments. Aalborg (Denmark),
2005. International Speech Communication Associa-
tion. Book of abstracts [CD-ROM].

[11] Rabiner, L.R., Schafer, R. W.: Digital Processing of Speech
Signals. Prentice Hall, New Jersey, 1978.

Ing. Hynek Bořil
e-mail: borilh@gmail.com

Petr Fousek
e-mail: p.fousek@gmail.com

Department of Circuit Theory

Czech Technical University
Faculty of Electrical Engineering
Technická 2
166 27 Prague, Czech Republic

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 35

Acta Polytechnica Vol. 46  No. 6/2006