ap-4-10.dvi


Acta Polytechnica Vol. 50 No. 4/2010

Voice Activity Detection for Speech Enhancement Applications

E. Verteletskaya, K. Sakhnov

Abstract

This paper describes a study of noise-robust voice activity detection (VAD) utilizing the periodicity of the signal, full
band signal energy and high band to low band signal energy ratio. Conventional VADs are sensitive to a variably noisy
environment especially with low SNR, and also result in cutting off unvoiced regions of speech as well as random oscillating
of output VAD decisions. To overcome these problems, the proposed algorithm first identifies voiced regions of speech and
then differentiates unvoiced regions from silence or background noise using the energy ratio and total signal energy. The
performance of the proposed VAD algorithm is tested on real speech signals. Comparisons confirm that the proposed VAD
algorithm outperforms the conventional VAD algorithms, especially in the presence of background noise.

Keywords: voice activity detection, periodicity measurement, voiced/unvoiced classification, speech analysis.

1 Introduction
An important problem in speech processing applica-
tions is the determination of active speech periods
within a given audio signal. Speech can be charac-
terized as a discontinuous signal, since information is
carried only when someone is speaking. The regions
wherevoice information exists are referred to as ‘voice-
active’ segments, and the pauses between talking are
called ‘voice-inactive’ or ‘silence’ segments. The deci-
sion on the class to which an audio segment belongs
is based on an observation vector. This is commonly
referred to as a ‘feature’ vector. One or many differ-
ent features may serve as the input to a decision rule
that assigns the audio segment to one of these two
classes. An algorithmemployed to detect the presence
or absence of speech is referred to as a voice activity
detector (VAD).
VAD is any important component of speech pro-

cessing techniques suchas speechenhancement, speech
coding, and automatic speech recognition. In speech
enhancementapplications, for example in spectral sub-
tractive type noise reduction algorithms, VAD is used
for noise estimation,which is thenused in the noise re-
duction process. Speech/silence detection is necessary
in order to determine frames of noisy speech that con-
tain noise only. Speech pauses or noise only frames
are essential to allow the noise estimate to be up-
dated, thereby making the estimation more accurate.
In speech coding, the purpose is to encode the input
audio signal in such away, that the overall transferred
data rate is reduced. Since information is only carried
when someone is speaking, clearly knowing when this
occurs can greatly aid in data reduction. Another ex-
ample is speech recognition. In this case, a clear indi-
cation of active speech periods is critical. False detec-
tion of active speech periods will have a direct degra-
dation effect on the recognition algorithm. Other ex-

amples include audio conferencing, echo cancellation,
VoIP applications, cellular radio systems (GSM and
CDMA based) [1] and hands-free telephony [2].
Generating an accurate indication of the presence

or absence of speech is generally difficult, especially
when the speech signal is corrupted by background
noise or by unwanted impulse noise. Voice activity de-
tection algorithm performance trade-offs are made by
maximizing the detection rate of active speech while
minimizing the false detection rate of inactive seg-
ments. Various techniques for VAD have been pro-
posed [3, 4, 5, 6, 7]. In the early VAD algorithms,
short-time energy, zero-crossing rate and linear pre-
diction coefficientswere among the features commonly
used in the detection process [3]. Cepstral coeffi-
cients [4], spectral entropy [5], a least-square periodic-
ity measure [6], and wavelet transform coefficients [7]
are examples of recently proposed VAD features. Sig-
nal energy remains one of basic components of the fea-
ture vector. Most of the standardized algorithms use
signal energy and other parameters to make a deci-
sion. For voice activity detection, the proposed algo-
rithm utilizes the total signal energy, which is com-
pared with the dynamically calculated threshold. Be-
sides the total energy measure, the algorithm is sup-
plemented by using a signal periodicity measure and
a high frequency to low frequency signal energy ratio
for more accurate decisions on voice presence.

2 Voice activity detection
principle

The basic principle of aVAD device is that it extracts
measured features or quantities from the input signal
and then compares these values with thresholds usu-
ally extracted from noise-only periods. Voice activity
(VAD=1) is declared if the measured values exceed

100


Acta Polytechnica Vol. 50 No. 4/2010

the thresholds. Otherwise, there is no speech activity
or noise, and silence (VAD=0) is present. A general
block diagram of a VAD design is shown in Fig. 1.
VAD design involves extracting acoustic features

that can appropriately indicate the probability of tar-
get speech signals existing in observed signals. Based
on these acoustic features, the latter part decides
whether the target speech signalsarepresent in the ob-
served signals, using a computedwell-adjusted thresh-
old value. MostVADalgorithms output a binary deci-
sion on a frame-by-framebasis, where the frame of the
input signal is a short unit of time 5–40 ms in length.
The accuracy and reliability of a VAD algorithm de-
pendsheavilyonthedecisionthresholds. Adapting the
threshold value helps to track time-varying changes in
the acoustic environments, and hence provides a more
reliable voice detection result.

2.1 VAD algorithms based on energy
thresholding

In energy-basedVAD, the energy of the signal is com-
paredwith the threshold depending on the noise level.
Speech is detected when the energy estimation lies
above the threshold.

IF (Ej > k · Er), where k > 1, frame is ACTIVE
ELSE frame is INACTIVE

(1)

In the equation, Er represents the energy of the
noise frames, while k · Er is the threshold used in the
decision-making. Having a scaling factor, k allows a
safe band for adapting Er, and, therefore, adapting
the threshold. Different energy-based VADs differ in
the way the thresholds are updated. The simplest
energy-based method, the Linear Energy-Based De-
tector (LED), was first described in [8]. The rule for

updating the threshold value was specified as,

Ernew =(1− p) · Er old + p · Esilence (2)

Here, Er new is the updatedvalue of the threshold,
Er old is the previous energy threshold, and Esilence is
the energy of themost recentunvoiced frame. The ref-
erence Er is updated as a convex combination of the
old thresholdand the currentnoiseupdate. Parameter
p is constant (0 < p < 1).

2.2 Energy of a frame

The most common way to calculate the full-band en-
ergy of a speech signal is a short-time energy calcula-
tion. If x(i) is the i-th sample of speech, N is the num-
ber of samples in a frame, then the short-time energy
of the j-th frame of a speech signal can be represented
as

Ej =
1
N

·
j·N∑

i=(j−1)·N+1
x2(i). (3)

Another common way to calculate the energy of a
speech signal is the root mean square energy (RMSE),
which is the square root of the average sum of the
squares of the amplitude of the signal samples (3).

Ej =

⎡
⎣ 1

N
·

j·N∑
i=(j−1)·N+1

x2(i)

⎤
⎦
1
2

(4)

Fig. 2 shows that the power estimate of a speech
signal exhibits distinct peaks and valleys. While the
peaks correspond to speech activity, the valleys can
be used to obtain a noise power estimate. Therefore,
RMSE is more appropriate for thresholding, because
it display valleys in greater detail.

Fig. 1: Block diagram of a basic VAD design

Fig. 2: Short-time vs. root mean square energy

101


Acta Polytechnica Vol. 50 No. 4/2010

Fig. 3: Logic flowchart of the proposed VAD

3 The proposed voice activity
detector

For voice/silence detection, the proposed algorithm
uses a periodicity measure of the signal, as well as
the high-frequency versus low-frequency signal energy
ratio and full-band energy computation. A simplified
flowchart of the whole algorithm is given in Fig. 3.

3.1 Feature extraction

Signal periodicity C is determined by estimating the
pitch period of the signal. To reduce the compu-
tational complexity, the input signal is first center
clipped [9], then the normalized autocorrelation func-
tion R(τ) given by (5) is used for pitch estimation.

R(τ) =

N−m−1∑
n=0

x(n) · x(n + τ)
√√√√N−m−1∑

n=0

x2(n + τ)

, (5)

Tmin ≤ τ ≤ Tmax

where x(n) n = 0,1, . . . , N is the input signal frame.
The autocorrelation function is calculated for values of
lag τ from Tmin to Tmax. The constants Tmin and Tmax
are the lower and upper limits of the pitch period, re-
spectively. The pitch period of a voiced frame is equal
to the value of τ thatmaximizes the normalized auto-
correlation function. The periodicity C of the frame is
given by maximum value of R(τ).
The total voice band energy Ef is computed for

the voice band frequency range from 0 Hz to 4 kHz.
The total voice band energy is given by (4). The com-
putation of the threshold for total voiceband energy
is based on the energy level Emin and Emax, obtained
from the sequence of incoming frames. These values
are stored in memory and the threshold is calculated
as,

T hreshold = (1 − λ) · Emax + λ · Emin (6)

λ =
Emax − Emin

Emax
. (7)

Here, λ – a scaling factor controlling the estimation
process. The voice detector performs reliably when λ
is in the range of [0.950, . . .,0.999]. For different types
of signals the value of λ cannotbe the same, so itmust
be set up properly. Computing the scaling factor λ by
(7) makes it independent and resistant to the variable
background environment.

Fig. 4: Threshold computation for total band signal energy

Energy ratio Er is computed as the ratio of the
energy above 2 kHz to the energy below 2kHz in the
input voice band signal. To obtain a high-frequency
signal, the input signal is passed through a high-pass
filter that has a cut-off frequency of 2 kHz. The high
frequency to low frequency energy ratio Er is calcu-
lated as

Er = Eh/(Ef − Eh) (8)

Where Ef and Eh are the full band and high band
signal energy, respectively, calculated by (2) and ex-
pressed in dB.

102


Acta Polytechnica Vol. 50 No. 4/2010

Fig. 5: Detailed flowchart of the proposed VAD

3.2 Thresholding and the hang-over
algorithm

After feature extraction, the parameters are compared
with several thresholds to generate an initial VAD de-
cision (IV AD) (see Fig. 5). After the thresholds have
been compared to determinate the value of IV AD, a fi-
nal outputdecision ismadeaccording to the lowerpart
of the algorithm flowchart. Output decision FV AD is
performed anew for each value of IV AD produced by
threshold comparison. The final output decision in-
volves usage of a smoothing hang-over algorithm to
ensure that detection of either the presence or the ab-
sence of speech lasts for at least a minimum period of
time and does not oscillate on-and-off. Upon startup
of VAD, the values of a hangover flag HV AD and a fi-
nalVADflag FV AD are initialized to zero. The output

decision block checkswhether the received IV AD value
is one. If so, it means that speech has been detected.
The output decision therefore sets HV AD and FV AD
to one. If the value of IV AD is found tobe zero, speech
has not been detected. However, the output decision
checks whether the value of HV AD is set to one from
the previous frame. If so, the output decision checks
whether the smoothed value Ef s less the value of Emin
is greater than8dB. If so, holdover is indicated, and so
the output decision maintains FV AD set to one, even
though speech has not been detected.

4 Experimental results

The MATLAB environment was used to test the al-
gorithms on thirty speech signals from the Czech
Speech database. The test templates varied in loud-

103


Acta Polytechnica Vol. 50 No. 4/2010

ness, speech continuity, background noise and accent.
Both male speech and female speech in Czech lan-
guagewere used for the experiments. Fig. 6 shows the
voice/silenceclassificationresultsof theproposedVAD
algorithm. The performance of the algorithm is com-
pared to the performance of the LEDalgorithm [8]. A
comparison is performed on real clean speech and on
speech degraded by additive noise. It is clear from the
figures that the proposedVADoutperformed the LED
algorithm in extent ofmisdetection. In contrast to the
LED algorithm, the proposed VAD results in correct
detection of unvoiced speech regions. The proposed
algorithm is able to detect the beginnings and ends of
active speech segmentsaccuratelyevenonnoisy speech
signals.

Fig. 6: Performance comparison of VAD algorithms:
(a) LED algorithm clean speech, (b) proposed algo-
rithm clean speech, (c) LED algorithm noisy speech
(SNR=5 dB), (d) proposed algorithm noisy speech
(SNR=5 dB)

5 Conclusion
This paper has presented voice activity detection

algorithms employed to detect the presence/absence
of speech components in an audio signal. An alter-
native VAD based on periodicity detection and the
high-frequency to low-frequency signal energy ratio
has been presented. The aim of the paper was to

show the principle of the proposed VAD algorithm,
and to compare it with the known linear energy-based
detector (LED). The results consistently show the su-
periority of the proposed VAD scheme over the LED
algorithm. It is easy to recognize that the algorithm
has low computational complexity, and can be eas-
ily integrated into speech coders and other speech en-
hancement systems.

Acknowledgement

The researchdescribed in this paperwas supervisedby
Prof. Ing. B. Simak, CSc., FEL CTU in Prague and
was supported by Czech Technical University grant
SGSNo.OHK3-108/10and by theMinistry of Educa-
tion, Youth andSports of theCzechRepublic research
program MSM 6840770014.

References

[1] ETSI TS 126 094 V3.0.0 (2000-01), 3G TS
26.094version3.0.0Release1999,UniversalMobile
Telecommunications System (UMTS); Mandatory
Speech Codec speech processing functions AMR
speech codec; Voice Activity Detector (VAD),
2000.

[2] Benyassine,A., Shlomot,E., Su,H.-Y.: ITU-Trec-
ommendation G.729 annex B: A silence compres-
sion scheme for use with G.729 optimized for V.70
digital simultaneous voice and data application,
IEEE Commun. Mag., 1997, Vol. 35, p. 64–73.

[3] Atal, B. S., Rabiner, L. R.: A pattern recog-
nition approach to voiced-unvoiced-silence classi-
fication with applications to speech recognition,
IEEE Trans. Acoustics, Speech, Signal Processing,
Vol. 24, p. 201–212, June 1976.

[4] Haigh, J. A., Mason, J. S.: Robust voice activity
detection using cepstral features, inProc. of IEEE
Region 10 Annual Conf. Speech and Image Tech-
nologies for Computing and Telecommunications,
(Beijing), p. 321–324, Oct. 1993.

[5] McClellan, S. A., Gibson, J. D.: Spectral en-
tropy: An alternative indicator for rate allocation,
in IEEE Int. Conf. on Acoustics, Speech, Signal
Processing, (Adelaide,Australia), p. 201–204,Apr.
1994.

[6] Tucker,R.: Voiceactivitydetectionusingaperiod-
icity measure, IEE Proc.–I, Vol. 139, p. 377–380,
Aug. 1992.

[7] Stegmann, J., Schroder, G.: Robust voice-activity
detection based on the wavelet transform, inProc.
IEEEWorkshop on SpeechCoding for Telecommu-
nications, (Pocono Manor, PN), p. 99–100, Sept.
1997.

104


Acta Polytechnica Vol. 50 No. 4/2010

[8] Pollak, P., Sovka, P., Uhlir, J.: Noise System
for a Car, proc. of the Third European Con-
ference on Speech, Communication and Tech-
nology – EUROSPEECH’93, (Berlin, Germany),
p. 1073–1076, Sept. 1993.

[9] Verteletskaya, E., Šimák, B.: Performance Eval-
uation of Pitch Detection Algorithms. Access
server [online]. 2009, roč. 7, č. 200906, s. 0001.
ISSN 1214-9675.

About the authors

Ekaterina VERTELETSKAYAwas born inUzbe-
kistan. She was awarded an MSc degree in Telecom-
munication and Radio Engineering from the Czech
Technical University, Prague in 2008. She is currently
a PhD student at the Department of Telecommuni-
cation Engineering of CTU in Prague. Her current

activities are in the area of digital signal processing,
focused on speech coding algorithms for mobile com-
munications.

Kirill SAKHNOV was born in Uzbekistan. He was
awardedanMScdegree fromtheCzechTechnicalUni-
versity in Prague in 2008. He is currently a PhD stu-
dent at the Department of Telecommunication Engi-
neering ofCTU inPrague. His current activities are in
the area of adaptive digital signal processing, focused
on problems of acoustical and network echo cancella-
tion in telecommunication devices.

Ekaterina Verteletskaya
Kirill Sakhnov
E-mail: verteeka@fel.cvut.cz,
sakhnkir@.fel.cvut.cz
Czech Technical University in Prague
Technická 2, 166 27 Praha, Czech Republic

105