ap-5-11.dvi Acta Polytechnica Vol. 51 No. 5/2011 PhpHMM Tool for Generating Speech Recogniser Source Codes Using Web Technologies R. Krejč́ı Abstract This paper deals with the “phpHMM” software tool, which facilitates the development and optimisation of speech recognition algorithms. This tool is being developed in the Speech Processing Group at the Department of Circuit Theory, CTU in Prague, and it is used to generate the source code of a speech recogniser by means of the PHP scripting language and the MySQL database. The input of the system is a model of speech in a standard HTK format and a list of words to be recognised. The output consists of the source codes and data structures in C programming language, which are then compiled into an executable program. This tool is operated via a web interface. Keywords: speech recognition, DSP, PHP, MySQL, OMAP, TMS320C674x, ARM. 1 Introduction An automatic speech recogniser is a computer pro- gram consisting of interconnected algorithms whose input is human speech converted from a microphone into digital form, and the output is a text transcrip- tionof this speech. The structureof the speech recog- niser consists of two main phases: in the first phase, so-called “training” is carried out, resulting in the creation and filling of data structures that describe a speech model. In the second phase of this process, decoding algorithms are developed that provide the speech recognition itself, using the speechmodels ob- tained in the training phase. Since huge amounts of data are needed in order to create the speech recogniser, and huge amounts of data are elaborated, many activities are performed automatically using scripts. This facilitates thework and eliminates the need for repeated manual data processing. This is usually done using the HTK Toolkit [1], with the use of which a complete speech recogniser for the PC platform can be created. However, when creating a speech recogniser to be run on various hardware platforms, e.g. digital sig- nal processors, no such public tool is available, and thus proprietary software has to be programmed. In this case, the speech models trained using the HTK Toolkit can be utilised, but it is necessary to use to- tally different algorithms and optimisation methods for their treatment than those used on the PC plat- form. To test the optimisation methods, it is often necessary to change the data structures and convert their parameters. For this purpose, the Speech Pro- cessing Group at the Department of Circuit Theory, CTU in Prague has been developing a “phpHMM” tool that facilitates and integrates the development of speech recognition algorithms to alternative hard- ware platforms. 2 PhpHMM tool The PhpHMM tool is a set of scripts in PHP script- ing language [2]using theMySQLdatabase server [3]. This technology has become one of the standards for generating web pages, but it is also useful for gen- erating other texts, such as the source code in any programming language. The basis of the phpHMM tool is a class of functions that can be easily included into a superior systemwritten inPHP language. The scripts are run on the server (either on a local com- puter configuredasa serveroronapubliclyaccessible web server), and their output is visible via a graph- ical user-friendly web interface. The source code of the speech recogniser can consist of a sequence of sin- gle steps. The stepswill be discussed in the following text. 2.1 Speech model The result of the training phase of the speech recog- niser usingHTKToolkit is a text file in a defined for- mat thatdescribes a generalmodel of speech, created on the basis of the utterances of a training database. The models of speech may have a huge number of different variations, e.g. the type of parametrisa- tion (extraction of speech features), the number of HMM (Hidden Markov Model) states, streams and mixtures, the number of coefficients in eachmixture, etc. During recognition, these parameters enter the output probability density function b(o) [4]: bj(ōt)= ∏S s=1 [ Ms∑ m=1 cjsmN (ōst; μ̄jsm,Σjsm) ]γs ; N (ōst; μ̄jsm,Σjsm)= (1) 1√ (2π)ns |Σ| e −12(ōst−μ̄jsm) T Σ−1 jsm (ōst−μ̄jsm), 58 Acta Polytechnica Vol. 51 No. 5/2011 ~h "a" 5 2 39 1.437809e+00 -6.805577e+00 -8.517246e+00 -9.976683e+00 ... 39 2.393653e+01 4.407170e+01 3.864353e+01 4.710320e+01 ... 1.341746e+02 3 39 2.916575e+00 -8.322930e+00 -1.077090e+01 -9.984103e+00 ... 39 1.245955e+01 3.486024e+01 3.388573e+01 4.059823e+01 ... 1.130805e+02 4 39 4.856239e-01 -1.422903e+00 -6.716645e+00 -3.694754e+00 ... 39 1.848022e+01 2.745304e+01 3.125877e+01 4.468990e+01 ... 1.222291e+02 5 0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.224011e-01 3.775989e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 7.666833e-01 2.333166e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 5.902151e-01 4.097848e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 Fig. 1: Example of simple hidden Markov model of “a” phoneme in text form where S is count of streams, γs is streamweight, Ms is count of mixtures in a stream, cjsm is weight of the m-th mixture, N (ō; μ̄,Σ) is multivariate Gaus- sian distribution with a vector of mean values μ̄ and a covariancematrix Σ. This function represents the acoustic similarity of the input signal with the refer- ence models of speech units (phonemes). All these factors enter into the phpHMM tool by uploading the text file with the speech model. 2.2 Parsing and storing into the database After a text file with hidden Markov models is up- loaded, it is parsed and converted from text form into data structures in the memory of the server. At the same time, some basic integrity checks of the file are carried out. Then database tables are cre- ated in theMySQLdatabase and they are populated with relevant data from the uploaded file. It is con- venient to use the server-based (MySQL) database, inter alia, because it enables easy selection of data by means of (even complicated) SQL queries. Se- lection and processing of data using a server-based database is significantly faster and more comfort- able than searching in a text file. For our current experiments, it is advantageous to store the data in a “MEMORY” table type, as this storage allows faster access than the commonly used “MyISAM” type. There are also many techniques for optimiz- ing the performance of the database, such as the use of keys and indexes [4]. 2.3 Glossary of words Our goal is to create a speech recogniser that will be able to handle continuous speech in real time, but currently we are dealing with recognition of individ- ual words and short phrases. In this step, we can simply specify all thewordswhich the recogniserwill be able to recognise, either by typing in the text-box, or by uploading a text file. The more words are to be recognised, the greaterwill be the demands on the recogniser hardware, and hence on optimizing the al- gorithms. 2.4 Phonetic transcription In all languages, there are thedifferencesbetween the written languageandthe spoken formof speech. This stepautomatically creates aphonetic transcriptionof words entered in the previous step. E.g. the Czech word “zpěv” will be rewritten by the transcription “spjef”. 59 Acta Polytechnica Vol. 51 No. 5/2011 Fig. 2: Graphically expressed database structure for speech model 2.5 Selection of hardware platform Weworkonoptimizingalgorithmsof speech recognis- ers for platforms of multi-core digital signal proces- sors of the TMS320C6000 family from Texas Instru- ments. The intention of phpHMM is to create a gen- eral tool for a large number of hardware and software platforms. Currently, this step offers a choice be- tween a “general” platform and the “OMAP-L137” platform. OMAP-L137 is a dual-core heterogenous processor fromTexas Instruments with both a 32-bit ARM9 and a TMS320C674xDSP core. 2.6 Selection of optimisation methods If a speech recogniser is to be run on a system with limited hardware resources, it is necessary to opti- mize computationally intensive algorithms. In this step, a combination of optimisation methods can be chosen for testing. The optimisation is done at all levels of the design of the speech recogniser — from the layout of the data structures up tomodifying the algorithms so that they are performed faster on the chosen hardware platform. 2.7 Creating word models Depending on the optimisation method, models of the words are created as sequences of states with which theViterbi algorithm[1]works. For eachword, the phoneme models are chained into a sequence of states. E.g. the Czech word “spjef” creates the fol- lowing sequence of states: Fig. 3: Sequence of states of word „zpěv [spjef] 2.8 Assembling the source code and data structures The main task of the phpHMM tool is to set up the source code and data structures on the basis of the input data, the specification of which has just been described. Depending on the type of parametrisa- tion, the structure of the models and the required optimisations, the system generates the sources of the speech recogniser with the relevant data. The source code must be generated before it can be programmed for each selection of the hardware platform and optimisation. The source code can be set up very effectively using PHP. The code of the PHP scripting language can be inserted directly into the source code in C. As described in [5], PHP can be used as a preprocessor with many more possibil- ities than the standard C preprocessor. For exam- ple, it can create cycles or compute with goniometric functions. A Hamming window lookup table can be generated as follows: const float hamming_ar[]={ }; The generated code is subsequently compiled by the appropriate compiler. However, this is already beyond the function of the current phpHMM tool, although in future it may be possible, after generat- ing the source code, just to run the compiler and get the program in an executable format. 3 Results Although the phpHMM tool is used to generate the entire speech recogniser, in the following text we dis- cuss some examples of using the generated code for faster calculations. 60 Acta Polytechnica Vol. 51 No. 5/2011 3.1 MFCC optimisations One of the optimisation methods calculates the re- sults in advance, if all operands are known at com- pile time. This will avoid counting the same results repeatedly in the recognition process, and it speeds up the calculation. This so-called “lookup table”methodwasused to generate theHammingwindowcoefficients,whichare calculated at the beginning of the signal parametri- sation by the mel-cepstral coefficients (MFCC) [1]. The parametrisation method during the recognition process never changes, and therefore the Hamming window coefficients do not change. The calcula- tion then reduces to reading the coefficient in a one- dimensional data field. A part of the parametrisation block of the signal, where speech attributes are extracted from the in- put signal, is the calculation of the Discrete Cosine Transform(DCT) [6]. Using the standardmethod for calculating DCT, which is calculatedwith goniomet- rical functions, a parametrisation calculation time of approximately 55 ms per segment was achieved at the tested digital signal processor. With the known number of input and output DCT coefficients, which are the constants known at compile-time and do not change during recognition, the concrete cosine re- sults are calculated in advance and stored to the data structure. When running the DCT algorithm in real time, the cosine is (paradoxically)not calculated, but the pre-calculated cosine value is used according to the appropriate arguments. The calculation of the coefficient is thus reduced to reading its value from the pre-calculated table. By this optimisation, cal- culation time lower than 6 ms was achieved, i.e. ap- proximately a ninefold acceleration. Fig. 4: Computation time vs. optimisation methods for MFCC parametrisation 3.2 Output probability density optimisations Some of our proposed optimisation methods use transformed parameters, which arise by converting the originalmodel parameters. E.g. amodified algo- rithm for calculating the output probability density function b(o), based on the type of A = A + B × C dotproduct operation (“Multiply andAccumulate” – “MAC”), requires recalculation of the original coef- ficients by a simple transformation [3]. This trans- formation is performed while generating the source code, i.e. in compile time. The calculation without optimisations on the dual-core TMS320C74x DSP architecture lasted 1477 ms/segment. After apply- ing appropriate optimisations by recomputing the data structures, the best time of 52 ms/segmentwas achievedwhenusing themodifiedMACalgorithm[3]. Fig. 5: Computation time vs. optimisation methods of b(o) function Fig. 6: Computation time of maximum of neighboring values 3.3 Viterbi algorithm optimisation The Viterbi algorithm, which evaluates the most probable passage through the model, contains a part which compares adjacent values in the vector of the results of previous operations. Variousmethods have been tried, andthe“LoopUnroll”methodhasproved to be the fastest in this case. The code that was originally performed repeatedly in the cycle is bro- ken down into multiple particular operations with- out the cycle loop. This will not only reduce the overhead of cycle organisation, but will also provide an opportunity for greater use of the hardware ar- chitecture. In our case, instead of 32 passes through the cycle, a sequence of 32 individual operationswith 61 Acta Polytechnica Vol. 51 No. 5/2011 directly addressed operands was created. This loop unrolling led to the possibility to use the “MAX2” in- struction of theTMS320C6000architecture, which is an SIMD (single instruction, multiple data) instruc- tion that simultaneously compares twopairs of 16-bit operands and returns two results. The figure below shows the effectiveness of this optimisation for differ- ent numbers of test vectors compared with the best time achievedwithout using the loop unroll method. 4 Conclusion The phpHMM software tool for developing speech recognition algorithms focuses on applications for DigitalSignalProcessors. Theadvantagesof this tool include easy comparison of optimisation methods, easily changeable parameters, and a user-friendly graphical environment. It is used for generating source code and data structures tailored to the ap- plication. Acknowledgement This research was supported by grants GAČR 102/08/0707 “Speech Recognition under Real-World Conditions”, GAČR 102/08/H008 “Analysis and modelling of biomedical and speech signals”, and by research activity MSM 6840770014 “Perspective Informative and Communications Technicalities Re- search”. References [1] Young, S., et al.: The HTK Book. Cambridge University Engineering Department, 2006. [online] http://htk.eng.cam.ac.uk/ftp/ software/htkbook.pdf.zip. [2] PHP [online]. 2011 [cit. 2011–03–12]. http:///www.php.net/. [3] MySQL. The world’s most popular open source database [online]. 2011 [cit. 2011–03–12]. http://www.mysql.com/. [4] Krejč́ı, R.: Optimization of Computationally Intensive Part of Speech Recognizer. In 19th Czech-German Workshop on Speech Process- ing [CD-ROM]. Praha : Institute of Photonics and Electronics AS CR, 2009, p. 22–26. ISBN 978-80-86269-18-4. [5] Krejč́ı, R.: Use PHP preprocessor for generat- ing source codes in C programming language. In Kráĺıky 2010. Brno : BrnoUniversity of Technol- ogy, 2010, p. 84–87. ISBN 978-80-214-4139-2. [6] Uhĺı̌r, J., et al.: Technologie hlasových komu- nikaćı. Praha : Nakladatelstv́ı ČVUT, 2007. 276 p. ISBN 978-80-01-03888-8. About the author Robert Krejč́ı deals with digital signal processing and speech recognition focusing on optimisation of speech recogniser algorithms for systemswith limited hardware resources. Robert Krejč́ı E-mail: robert.krejci@centrum.cz Department of Circuit Theory Czech Technical University Technická 2, 166 27 Praha, Czech Republic 62