ap-5-11.dvi


Acta Polytechnica Vol. 51 No. 5/2011

PhpHMM Tool for Generating Speech Recogniser Source Codes
Using Web Technologies

R. Krejč́ı

Abstract

This paper deals with the “phpHMM” software tool, which facilitates the development and optimisation of speech
recognition algorithms. This tool is being developed in the Speech Processing Group at the Department of Circuit
Theory, CTU in Prague, and it is used to generate the source code of a speech recogniser by means of the PHP scripting
language and the MySQL database. The input of the system is a model of speech in a standard HTK format and a list
of words to be recognised. The output consists of the source codes and data structures in C programming language,
which are then compiled into an executable program. This tool is operated via a web interface.

Keywords: speech recognition, DSP, PHP, MySQL, OMAP, TMS320C674x, ARM.

1 Introduction
An automatic speech recogniser is a computer pro-
gram consisting of interconnected algorithms whose
input is human speech converted from a microphone
into digital form, and the output is a text transcrip-
tionof this speech. The structureof the speech recog-
niser consists of two main phases: in the first phase,
so-called “training” is carried out, resulting in the
creation and filling of data structures that describe
a speech model. In the second phase of this process,
decoding algorithms are developed that provide the
speech recognition itself, using the speechmodels ob-
tained in the training phase.
Since huge amounts of data are needed in order

to create the speech recogniser, and huge amounts of
data are elaborated, many activities are performed
automatically using scripts. This facilitates thework
and eliminates the need for repeated manual data
processing. This is usually done using the HTK
Toolkit [1], with the use of which a complete speech
recogniser for the PC platform can be created.
However, when creating a speech recogniser to be

run on various hardware platforms, e.g. digital sig-
nal processors, no such public tool is available, and
thus proprietary software has to be programmed. In
this case, the speech models trained using the HTK
Toolkit can be utilised, but it is necessary to use to-
tally different algorithms and optimisation methods
for their treatment than those used on the PC plat-
form. To test the optimisation methods, it is often
necessary to change the data structures and convert
their parameters. For this purpose, the Speech Pro-
cessing Group at the Department of Circuit Theory,
CTU in Prague has been developing a “phpHMM”
tool that facilitates and integrates the development
of speech recognition algorithms to alternative hard-
ware platforms.

2 PhpHMM tool

The PhpHMM tool is a set of scripts in PHP script-
ing language [2]using theMySQLdatabase server [3].
This technology has become one of the standards for
generating web pages, but it is also useful for gen-
erating other texts, such as the source code in any
programming language. The basis of the phpHMM
tool is a class of functions that can be easily included
into a superior systemwritten inPHP language. The
scripts are run on the server (either on a local com-
puter configuredasa serveroronapubliclyaccessible
web server), and their output is visible via a graph-
ical user-friendly web interface. The source code of
the speech recogniser can consist of a sequence of sin-
gle steps. The stepswill be discussed in the following
text.

2.1 Speech model

The result of the training phase of the speech recog-
niser usingHTKToolkit is a text file in a defined for-
mat thatdescribes a generalmodel of speech, created
on the basis of the utterances of a training database.
The models of speech may have a huge number of
different variations, e.g. the type of parametrisa-
tion (extraction of speech features), the number of
HMM (Hidden Markov Model) states, streams and
mixtures, the number of coefficients in eachmixture,
etc. During recognition, these parameters enter the
output probability density function b(o) [4]:

bj(ōt)=
∏S

s=1

[
Ms∑

m=1

cjsmN (ōst; μ̄jsm,Σjsm)

]γs
;

N (ōst; μ̄jsm,Σjsm)= (1)
1√

(2π)ns |Σ|
e
−12(ōst−μ̄jsm)

T Σ−1
jsm
(ōst−μ̄jsm),

58


Acta Polytechnica Vol. 51 No. 5/2011

~h "a"
<BEGINHMM>
<NUMSTATES> 5
<STATE> 2
<MEAN> 39
1.437809e+00 -6.805577e+00 -8.517246e+00 -9.976683e+00 ...
<VARIANCE> 39
2.393653e+01 4.407170e+01 3.864353e+01 4.710320e+01 ...
<GCONST> 1.341746e+02
<STATE> 3
<MEAN> 39
2.916575e+00 -8.322930e+00 -1.077090e+01 -9.984103e+00 ...
<VARIANCE> 39
1.245955e+01 3.486024e+01 3.388573e+01 4.059823e+01 ...
<GCONST> 1.130805e+02
<STATE> 4
<MEAN> 39
4.856239e-01 -1.422903e+00 -6.716645e+00 -3.694754e+00 ...
<VARIANCE> 39
1.848022e+01 2.745304e+01 3.125877e+01 4.468990e+01 ...
<GCONST> 1.222291e+02
<TRANSP> 5
0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
0.000000e+00 6.224011e-01 3.775989e-01 0.000000e+00 0.000000e+00
0.000000e+00 0.000000e+00 7.666833e-01 2.333166e-01 0.000000e+00
0.000000e+00 0.000000e+00 0.000000e+00 5.902151e-01 4.097848e-01
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
<ENDHMM>

Fig. 1: Example of simple hidden Markov model of “a” phoneme in text form

where S is count of streams, γs is streamweight, Ms
is count of mixtures in a stream, cjsm is weight of
the m-th mixture, N (ō; μ̄,Σ) is multivariate Gaus-
sian distribution with a vector of mean values μ̄ and
a covariancematrix Σ. This function represents the
acoustic similarity of the input signal with the refer-
ence models of speech units (phonemes).
All these factors enter into the phpHMM tool by

uploading the text file with the speech model.

2.2 Parsing and storing into the
database

After a text file with hidden Markov models is up-
loaded, it is parsed and converted from text form
into data structures in the memory of the server. At
the same time, some basic integrity checks of the
file are carried out. Then database tables are cre-
ated in theMySQLdatabase and they are populated
with relevant data from the uploaded file. It is con-
venient to use the server-based (MySQL) database,
inter alia, because it enables easy selection of data
by means of (even complicated) SQL queries. Se-
lection and processing of data using a server-based
database is significantly faster and more comfort-
able than searching in a text file. For our current
experiments, it is advantageous to store the data

in a “MEMORY” table type, as this storage allows
faster access than the commonly used “MyISAM”
type. There are also many techniques for optimiz-
ing the performance of the database, such as the use
of keys and indexes [4].

2.3 Glossary of words

Our goal is to create a speech recogniser that will be
able to handle continuous speech in real time, but
currently we are dealing with recognition of individ-
ual words and short phrases. In this step, we can
simply specify all thewordswhich the recogniserwill
be able to recognise, either by typing in the text-box,
or by uploading a text file. The more words are to
be recognised, the greaterwill be the demands on the
recogniser hardware, and hence on optimizing the al-
gorithms.

2.4 Phonetic transcription

In all languages, there are thedifferencesbetween the
written languageandthe spoken formof speech. This
stepautomatically creates aphonetic transcriptionof
words entered in the previous step. E.g. the Czech
word “zpěv” will be rewritten by the transcription
“spjef”.

59


Acta Polytechnica Vol. 51 No. 5/2011

Fig. 2: Graphically expressed database structure for
speech model

2.5 Selection of hardware platform

Weworkonoptimizingalgorithmsof speech recognis-
ers for platforms of multi-core digital signal proces-
sors of the TMS320C6000 family from Texas Instru-
ments. The intention of phpHMM is to create a gen-
eral tool for a large number of hardware and software
platforms. Currently, this step offers a choice be-
tween a “general” platform and the “OMAP-L137”
platform. OMAP-L137 is a dual-core heterogenous
processor fromTexas Instruments with both a 32-bit
ARM9 and a TMS320C674xDSP core.

2.6 Selection of optimisation
methods

If a speech recogniser is to be run on a system with
limited hardware resources, it is necessary to opti-
mize computationally intensive algorithms. In this
step, a combination of optimisation methods can be
chosen for testing. The optimisation is done at all
levels of the design of the speech recogniser — from
the layout of the data structures up tomodifying the

algorithms so that they are performed faster on the
chosen hardware platform.

2.7 Creating word models

Depending on the optimisation method, models of
the words are created as sequences of states with
which theViterbi algorithm[1]works. For eachword,
the phoneme models are chained into a sequence of
states. E.g. the Czech word “spjef” creates the fol-
lowing sequence of states:

Fig. 3: Sequence of states of word „zpěv	 [spjef]

2.8 Assembling the source code and
data structures

The main task of the phpHMM tool is to set up the
source code and data structures on the basis of the
input data, the specification of which has just been
described. Depending on the type of parametrisa-
tion, the structure of the models and the required
optimisations, the system generates the sources of
the speech recogniser with the relevant data.
The source code must be generated before it can

be programmed for each selection of the hardware
platform and optimisation. The source code can be
set up very effectively using PHP. The code of the
PHP scripting language can be inserted directly into
the source code in C. As described in [5], PHP can
be used as a preprocessor with many more possibil-
ities than the standard C preprocessor. For exam-
ple, it can create cycles or compute with goniometric
functions. A Hamming window lookup table can be
generated as follows:
<?php $PI=3.14159; $N=512; ?>
const float hamming_ar[<?php echo $N;?>]={
<?php
for($n=0; $n<$N; $n++){
$w=0.54 - 0.46 * cos(2*$PI*$n/$N);
echo "$w,";

}
?>
};

The generated code is subsequently compiled by
the appropriate compiler. However, this is already
beyond the function of the current phpHMM tool,
although in future it may be possible, after generat-
ing the source code, just to run the compiler and get
the program in an executable format.

3 Results
Although the phpHMM tool is used to generate the
entire speech recogniser, in the following text we dis-
cuss some examples of using the generated code for
faster calculations.

60


Acta Polytechnica Vol. 51 No. 5/2011

3.1 MFCC optimisations

One of the optimisation methods calculates the re-
sults in advance, if all operands are known at com-
pile time. This will avoid counting the same results
repeatedly in the recognition process, and it speeds
up the calculation.
This so-called “lookup table”methodwasused to

generate theHammingwindowcoefficients,whichare
calculated at the beginning of the signal parametri-
sation by the mel-cepstral coefficients (MFCC) [1].
The parametrisation method during the recognition
process never changes, and therefore the Hamming
window coefficients do not change. The calcula-
tion then reduces to reading the coefficient in a one-
dimensional data field.
A part of the parametrisation block of the signal,

where speech attributes are extracted from the in-
put signal, is the calculation of the Discrete Cosine
Transform(DCT) [6]. Using the standardmethod for
calculating DCT, which is calculatedwith goniomet-
rical functions, a parametrisation calculation time of
approximately 55 ms per segment was achieved at
the tested digital signal processor. With the known
number of input and output DCT coefficients, which
are the constants known at compile-time and do not
change during recognition, the concrete cosine re-
sults are calculated in advance and stored to the data
structure. When running the DCT algorithm in real
time, the cosine is (paradoxically)not calculated, but
the pre-calculated cosine value is used according to
the appropriate arguments. The calculation of the
coefficient is thus reduced to reading its value from
the pre-calculated table. By this optimisation, cal-
culation time lower than 6 ms was achieved, i.e. ap-
proximately a ninefold acceleration.

Fig. 4: Computation time vs. optimisation methods for
MFCC parametrisation

3.2 Output probability density
optimisations

Some of our proposed optimisation methods use
transformed parameters, which arise by converting
the originalmodel parameters. E.g. amodified algo-
rithm for calculating the output probability density

function b(o), based on the type of A = A + B × C
dotproduct operation (“Multiply andAccumulate” –
“MAC”), requires recalculation of the original coef-
ficients by a simple transformation [3]. This trans-
formation is performed while generating the source
code, i.e. in compile time. The calculation without
optimisations on the dual-core TMS320C74x DSP
architecture lasted 1477 ms/segment. After apply-
ing appropriate optimisations by recomputing the
data structures, the best time of 52 ms/segmentwas
achievedwhenusing themodifiedMACalgorithm[3].

Fig. 5: Computation time vs. optimisation methods of
b(o) function

Fig. 6: Computation time of maximum of neighboring
values

3.3 Viterbi algorithm optimisation

The Viterbi algorithm, which evaluates the most
probable passage through the model, contains a part
which compares adjacent values in the vector of the
results of previous operations. Variousmethods have
been tried, andthe“LoopUnroll”methodhasproved
to be the fastest in this case. The code that was
originally performed repeatedly in the cycle is bro-
ken down into multiple particular operations with-
out the cycle loop. This will not only reduce the
overhead of cycle organisation, but will also provide
an opportunity for greater use of the hardware ar-
chitecture. In our case, instead of 32 passes through
the cycle, a sequence of 32 individual operationswith

61


Acta Polytechnica Vol. 51 No. 5/2011

directly addressed operands was created. This loop
unrolling led to the possibility to use the “MAX2” in-
struction of theTMS320C6000architecture, which is
an SIMD (single instruction, multiple data) instruc-
tion that simultaneously compares twopairs of 16-bit
operands and returns two results. The figure below
shows the effectiveness of this optimisation for differ-
ent numbers of test vectors compared with the best
time achievedwithout using the loop unroll method.

4 Conclusion
The phpHMM software tool for developing speech
recognition algorithms focuses on applications for
DigitalSignalProcessors. Theadvantagesof this tool
include easy comparison of optimisation methods,
easily changeable parameters, and a user-friendly
graphical environment. It is used for generating
source code and data structures tailored to the ap-
plication.

Acknowledgement

This research was supported by grants GAČR
102/08/0707 “Speech Recognition under Real-World
Conditions”, GAČR 102/08/H008 “Analysis and
modelling of biomedical and speech signals”, and
by research activity MSM 6840770014 “Perspective
Informative and Communications Technicalities Re-
search”.

References

[1] Young, S., et al.: The HTK Book. Cambridge
University Engineering Department, 2006.

[online] http://htk.eng.cam.ac.uk/ftp/
software/htkbook.pdf.zip.

[2] PHP [online]. 2011 [cit. 2011–03–12].
http:///www.php.net/.

[3] MySQL. The world’s most popular open source
database [online]. 2011 [cit. 2011–03–12].
http://www.mysql.com/.

[4] Krejč́ı, R.: Optimization of Computationally
Intensive Part of Speech Recognizer. In 19th
Czech-German Workshop on Speech Process-
ing [CD-ROM]. Praha : Institute of Photonics
and Electronics AS CR, 2009, p. 22–26. ISBN
978-80-86269-18-4.

[5] Krejč́ı, R.: Use PHP preprocessor for generat-
ing source codes in C programming language. In
Kráĺıky 2010. Brno : BrnoUniversity of Technol-
ogy, 2010, p. 84–87. ISBN 978-80-214-4139-2.

[6] Uhĺı̌r, J., et al.: Technologie hlasových komu-
nikaćı. Praha : Nakladatelstv́ı ČVUT, 2007.
276 p. ISBN 978-80-01-03888-8.

About the author

Robert Krejč́ı deals with digital signal processing
and speech recognition focusing on optimisation of
speech recogniser algorithms for systemswith limited
hardware resources.

Robert Krejč́ı
E-mail: robert.krejci@centrum.cz
Department of Circuit Theory
Czech Technical University
Technická 2, 166 27 Praha, Czech Republic

62