Microsoft Word - brain_2_2.doc 18 Development an Automatic Speech to Facial Animation Conversion for Improve Deaf Lives S. Hamidreza Kasaei, S. Mohammadreza Kasaei, S. Alireza Kasaei Young Researchers Club, Isfahan Branch (Khurasgan), Islamic Azad University,Isfahan, Iran hamidreza_kasaee@yahoo.com Abstract In this paper, we propose design and initial implementation of a robust system which can automatically translates voice into text and text to sign language animations. Sign Language Translation Systems could significantly improve deaf lives especially in communications, exchange of information and employment of machine for translation conversations from one language to another has. Therefore, considering these points, it seems necessary to study the speech recognition. Usually, the voice recognition algorithms address three major challenges. The first is extracting feature form speech and the second is when limited sound gallery are available for recognition, and the final challenge is to improve speaker dependent to speaker independent voice recognition. Extracting feature form speech is an important stage in our method. Different procedures are available for extracting feature form speech. One of the commonest of which used in speech recognition systems is Mel-Frequency Cepstral Coefficients (MFCCs). The algorithm starts with preprocessing and signal conditioning. Next extracting feature form speech using Cepstral coefficients will be done. Then the result of this process sends to segmentation part. Finally recognition part recognizes the words and then converting word recognized to facial animation. The project is still in progress and some new interesting methods are described in the current report. Keywords: Deaf Human, Sign Language Translation Systems, Humatronics, Automatic Speech Recognition 1. Introduction Today one in 1000 people become deaf before they have acquired speech and may always have a low reading age for written Persian. Sign is their natural language. Persian Sign Language has its own grammar and linguistic structure that is not based on Persian. So voice recognition systems play a very significant role in field of human electronics and its wide applications in deaf live. This research study was started with several speeches to text experiments to measure the communication skills of deaf people, and to understand their everyday problems better. The primary aim of our project was to develop a communication aid for deaf persons which can be implemented in a mobile telephone. In our system a partially animated face is displayed in interaction with deaf users. They are very useful in much application. Our system starts with preprocessing and signal conditioning. Next extracting feature form voice using Cepstral Coefficients will be done. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each word. Then the result of this process sends to Feature matching, Feature matching involves the actual procedure to identify the unknown word by comparing extracted features from voice input with the ones from a set of known words. Finally recognition part recognizes the words and then converting word recognized to facial animation. Some of the related research in the field of Automatic Translate Voice to Sign Language Animation is as in Fig. 1. Attila Andics, James M. McQueen, [1] proposed Neural mechanisms for voice recognition. M. Benzeghiba, R. De Mori, [2] proposed an Automatic speech recognition and speech variability. S. Hamidreza Kasaei, S. Mohammadreza Kasaei, S. Alireza Kasaei - Development an Automatic Speech to Facial Animation Conversion for Improve Deaf Lives 19 Ramin Halavati, Saeed Bagheri Shouraki, [3] proposed a Recognition of human speech phonemes using a novel fuzzy approach. This paper is organized as follow. Section 2 describe an overview of our system, finally, in section 3 concludes this paper. The project is still in progress and some new interesting methods are described in the current report. 2. The proposed method All technologies of voice recognition, speaker identification and verification, each has its own advantages and disadvantages and may requires different treatments and techniques. The choice of which technology to use is application-specific. At the highest level, all voice recognition systems contain two main modules: feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each word. Feature matching involves the actual procedure to identify the unknown word by comparing extracted features from his/her voice input with the ones from a set of known words. A wide range of possibilities exist for parametrically representing the speech signal for the voice recognition task, such as Linear Prediction Coding (LPC), RASTA-PLP and Mel-Frequency Cepstrum Coefficients (MFCC). Figure 1. The Structure of the Automatic Translate Voice to Sign Language Animation System LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal is called the residue [4]. Another popular speech feature representation is known as RASTA-PLP, an acronym for Relative Spectral Transform - Perceptual Linear Prediction. PLP was originally proposed by Hynek Hermansky as a way of warping spectra to minimize the differences between speakers while preserving the important speech information. RASTA is a separate technique that applies a band- pass filter to the energy in each frequency subband in order to smooth over short-term noise variations and to remove any constant offset resulting from static spectral coloration in the speech channel e.g. from a telephone line [5]. BRAIN. Broad Research in Artificial Intelligence and Neuroscience Volume 2, Issue 2, May-June 2011, ISSN 2067-3957 (online), ISSN 2068 - 0473 (print) 20 MFCC’s are based on the known variation of the human ear’s critical bandwidths with frequency; filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. This is expressed in the Mel-frequency scale, which is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz [3]. MFCC is the best known and most popular so we decided to use MFCC in our project. The process of computing MFCCs is described in more detail in the First section of Proposed Method. In the second, Feature matching Algorithm has been discussed. The goal of pattern recognition is to classify objects of interest into one of a number of categories or classes. The objects of interest are generically called patterns and in our case are sequences of acoustic vectors that are extracted from an input speech using the techniques described in the previous paragraph (Feature Extraction). The classes here refer to individual words. Since the classification procedure in our case is applied on extracted features, it can be also referred to as feature matching. The feature matching techniques used in voice recognition include Dynamic Time Warping (DTW) [6], Hidden Markov Modelling (HMM), Support Vector Machine (SVM) [7], and Vector Quantization (VQ) [8]. There is also another technique which is called Artificial Neural Network (ANN) [9]. In this project, Vector Quantization approach will be used, due to ease of implementation and high accuracy. VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a codeword. The Automatic voice Recognition System will compare the voice with the codewords of the trained data. The best matching result will be the desired voice. 2.1. Feature Extraction Preprocessing mostly is necessary to facilitate further high performance recognition. A wide range of possibilities exist for parametrically representing the speech signal for the voice recognition task that we chose MFCC Algorithm in our project. Mel Frequency Cepstral Coefficients Mel Frequency Cepstral Coefficients (MFCC) are coefficients that represent audio, based on perception. It is derived from the Fourier Transform (FFT) or the Discrete Cosine Transform (DCT) of the audio clip. The basic difference between the FFT/DCT and the MFCC is that in the MFCC, the frequency bands are positioned logarithmically (on the Mel scale) which approximates the human auditory system's response more closely than the linearly spaced frequency bands of FFT or DCT. This allows for better processing of data. The main purpose of the MFCC processor is to mimic the behavior of the human ears. Overall the MFCC process has 5 steps that show in figure 2. Figure 2. MFCC Block Diagram Step 1 – Frame Blocking Step 2 – Windowing Step 3 – Fast Fourier Transform (FFT) Step 4 – Mel-frequency Wrapping S. Hamidreza Kasaei, S. Mohammadreza Kasaei, S. Alireza Kasaei - Development an Automatic Speech to Facial Animation Conversion for Improve Deaf Lives 21 Step 5 – Cepstrum Coefficient At the Frame Blocking Step a continuous speech signal is divided into frames of N samples. Adjacent frames are being separated by M (M