International Journal of Applied Sciences and Smart Technologies


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
11 
 

Artificial Generation of Realistic Voices 

 
Dhruva Mahajan1,*, Ashish Gapat1, Lalita Moharkar1, 

Prathamesh Sawant1, Kapil Dongardive1 

 
1Department of Electronics and Telecommunication Engineering,  

Xavier Institute of Engineering, Mahim, Mumbai, Maharashtra, India 
*Corresponding Author: dhruvam17@gmail.com 

 
(Received 17-07-2020; Revised 27-08-2020; Accepted 15-09-2020) 

 
Abstract 

In this paper, we propose an end-to-end text-to-speech system deployment 

wherein a user feeds input text data which gets synthesized, variated, and 

altered into artificial voice at the output end. To create a text-to-speech 

model, that is, a model capable of generating speech with the help of trained 

datasets. It follows a process which organizes the entire function to present 

the output sequence in three parts. These three parts are Speaker Encoder, 

Synthesizer, and Vocoder. Subsequently, using datasets, the model 

accomplishes generation of voice with prior training and maintains the 

naturalness of speech throughout. For naturalness of speech we implement a 

zero-shot adaption technique. The primary capability of the model is to 

provide the ability of regeneration of voice, which has a variety of 

applications in the advancement of the domain of speech synthesis. With the 

help of speaker encoder, our model synthesizes user generated voice if the 

user wants the output trained on his/her voice which is feeded through the 

mic, present in GUI. Regeneration capabilities lie within the domain Voice 

Regeneration which generates similar voice waveforms for any text. 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
12 
 

Keywords: Speech synthesis, speaker encoder, synthesizer, Text-to-Speech, 

vocoder  

 
1 Introduction 

In the near future, everything around us will be voice operated. With the growing 

trends of Alexa and Google home, advancements are being made to create an 

environment of Artificial Intelligence and its operating medium would be Voice. With 

the advent of signal processing, voice signals have extensive upgradation in terms of 

global standards and kept on challenging for better platform in all its forms of uses. 

Right from voice screeching out of airhorn to voice search engines in smart phones, 

voice applications have always been vital in any or all of their feats. Artificial 

generation itself is a form of voice cloning that is implemented with the help of neural 

nets that help with generating the Mel-spectrograms. This process intensifies on the 

input, where the characters present are synthesized into signal waveforms which are 

digitally spectograms. These spectograms are then coupled and then linked through a 

vocoder, which generates voice corresponding to the characters that are given as input. 

The goal of the this paper is to build a TTS system which can generate natural speech 

for a wide variety of speakers which are absent throughout the process. Our model can 

run in real time by implementing the mic function. This is possible by achieving a 

powerful form of voice cloning. The output must be aligned in such a manner that runs 

on correct lines to provide a clone of the dataset that is trained within the system. The 

output speech must allign with the exact speaker voice picked from the dataset. This 

voice gets matched with the help of RNN, which cycles the implementation process 

wherein the dataset voice gets linked with input text. This implementation works 

throughout for all the speakers whether present or absent. When a speaker implements 

mic function, the speaker encoder does not operate. 

Uncertainties do arise as we train the model with the speaker’s reference speech 

(trained dataset): 

• The output utterance is slightly composed. 

• Dataset voices could be identical, depending on the dataset picked (VCTK and 

LibriSpeech). 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
13 
 

• Voice output obtained, when implementing through the mic tends to be rough. This 

is because while recording in real time, it is very important to check the 

environment around. Mic integrated with system is highly sensitive and tends to 

pick smallest utterance. 

• Recording a large amount of high-quality data for many speakers seems rather im-

practical. 

The approach which we deploy, decouples speaker modeling from speech synthesis 

by independently training a speaker effective embedding network that traces and 

sequentially captures the character space of speaker characteristics and training a high-

quality TTS [1], [2]. Taking findings from a 2017 research paper, Google- Tacotron, 

where we figure out the synthesis and voice generation process [3]. We took 

implementation steps from a deep learning architecture that we researched and coupled 

it with WaveNet [3], [4]. WaveNet is a neural network which acts as a vocoder, to 

convert mel-spectograms into corresponding voice signals. We now had to check the 

process and operation for which we used two public datasets that are Librispeech and 

VCTK. After the respective implementation we got our process confirmed for synthesis 

and started the model. To run with time and technology, we implement a neural network 

architecture of our research pertaining to synthesis of voice. This research model is a 

deep learning text-to-speech model. Text-to-Speech describes that a string of text would 

be converted to speech output. There are three major operations. 

• Speaker Encoder- Where the embedded text feeded in is sent to a convolution bank 

highway network. This convolution bank is nothing but a deep learning tool which 

breaks our text in separate characters. Breaking the string of characters allows the 

network to work on each character modulation, making it more in tune with voice 

rather sounding modulated. i.e. For example- GOOD MORNING in the convolution 

bank will be presented as G-O-O-D M-O-R-N-I-N-G. 

• Synthesis- We implement trained datasets in our paper. These trained datasets 

arefeeded and loaded before operating the model. During the synthesis our character 

embedding which is disintegrated character by character is sent to attention, where 

the character is coupled with trained dataset voice signal. 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
14 
 

• Vocoder- This is the final part of our model where the signals which are bindedas 

spectrogram is converted into voice. This operation uses WaveNet neural network 

[5]. (WaveNet was developed in deepmind labs, which is a AI research wing of 

google.) WaveNet as a vocoder acts as a binder, where all end results 

(melspectrograms obtained) are combined. These results are the voice signal 

modulation that take place along with each character given as input. Explaining 

mathematically, in theory, our input text gets converted into an algebraic equation 

(which consists are input characters), this equation is then simplified by taking K 

constant. (K constant here is the dataset selected). This simplified equation is then 

collected and generated into series function equation (This is the desired output). 

With spectrograms for these character strings, this waveform passes through the 

WaveNet which perform natural language processing, giving voice as the end result. 

 
2 Methodology 

Text-to-Speech implementation for a speaker model to pick up character 

embeddings, the model needs to be well disintegrated in three subsets as clarified. These 

subsets work in a sequential manner providing speech in a uniform and desired manner. 

Putting light on each model now will magnify each role clearly with technical 

overviews and details. See Figure 1 for the block diagram. 

 
Figure 1. Block diagram 

 
Speaker encoder illustration is shown in Figure 2. Speaker encoder plays an 

important role by estimating the embedding from various text characters that is feeded 

as input. While implementing Text-to-Speech model, it’s up to the developer by what 

means it should be operated. To cut the unnecessary word-word translation by using a 

conventional Text-to-Speech system, where the text is sent in a string and the output 

obtained has passive speech, we use deep learning architecture, where the input text is 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
15 
 

feeded as the prenet (pre-feeded information/local information). This prenet is then sent 

to 1D-CBHG. This is one-dimensional convolution bank highway network. This block 

is the place where the string is broken in separate character by character formation [5]. 

Example: GOOD G-O-O-D. 

 
Figure 2. Speaker encoder 

 
Synthesizer is illustrated in Figure 3. Functions of daily text-to-speech is evident in 

our voice search engines, smart electronics and various voice modulating devices that 

inherit the use of synthesizer. Text-to-speech control system employs an easy-to-use 

transport medium from which a user can control most text-to-speech variating functions 

without any prior training. Now, the synthesizer basically works in the areas of 

translating a plurality of discontinuous user-selected part of text in an independent form 

of target application into an audio output that resembles the sound of human speech. To 

generalize, the synthesizer is the core of the Text-to-Speech engine, where it mainly 

focuses on tracing and adapting the text input in a sequence and delivers audio sample 

of it [5]. 

 
Figure 3. Synthesizer block 

 
In the block diagram represented above, the speaker encoder gets linked to the 

Concat block. In this Concat block, the character broken down in the CBHG, gets 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
16 
 

forwarded as a string with intervals between each character to the attention. Attention is 

a deep learning tool where each character is concentrated and constant is removed, and 

binded with the trained dataset. This trained dataset is forwarded from the Decoder, 

which acts as a gate for audio signal and couples it with the data in Concat. The constant 

removed is called ‘k-constant’, which is used for signal processing. 

Vocoder: Simplicity in the output should be in a form such that all the speech is 

synthesized in the manner that the corresponding input is heard in tone that is tuned to 

match the tempo of the unseen speakers’ voice. For this purpose, a vocoder is used 

where it captures the characteristic elements of the audio signal and then uses this 

characteristic signal to affect other audio signals. It is also dubbed as a “talking 

synthesizer” for its ability to fine-tune the synthesized signal in accordance to vocal 

frequency. See Figure 4 for Log Mel-waveforms. 

 
Figure 4. Log Mel-waveforms 

 
The vocoder provides a bank of multiple bandpass filters which dissociate the input 

signal into narrow spectral slices. Consider, we excite channel ‘k’ of vocoder with the 

input signal 𝑎(𝑛𝑇)cos(𝑤𝑘𝑛𝑇) for 𝑛 = 0,1,2,3,4,5, … where 𝑤𝑘 is the centre frequency 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
17 
 

of the channel in radians per second, 𝑇 is the sampling interval in seconds and 

bandwidth of 𝑎(𝑛𝑇) is smaller than channel bandwidth. We regard this input signal as 

an amplitude modulated sinusoid. The component cos(𝑤𝑘𝑛𝑇) can be called the carrier 

wave, while 𝑎(𝑛𝑇)  > 0 is the amplitude envelope. 

If the phase of each channel filter is linear in frequency within the passband (or at 

least across the width of the spectrum) and if each channel filter has a flat amplitude 

response in its passband, then the filter output will be, by the analysis of the previous 

section. 𝑌𝑘(𝑛)  ∼ 𝑎[𝑛𝑇 − 𝐷(𝑤𝑘)]cos(𝑤𝑘[𝑛𝑇 − 𝑃(𝑤𝑘)]) 

Creating a GUI is illustrated in Figure 5 as follows. With the reach of applications 

with software, the main over layer and the visible interaction is the Graphical User 

Interface (GUI) which allows dynamic ability to the user for its functioning [6]. For our 

model we intended the working on an interface to allow the user to interact with the 

model. Creation of GUI also implements the ease of functions laid out at disposal of the 

user within a fixed framework. GUI requires the interaction to be in a flow that does not 

hamper the user and neither causes any sort of imbalance within the process. Taking for 

consideration, our model is heavily based on synthesis, making it completely oriented to 

user interaction (input/character embedding). So, with this, the GUI should be in a 

manner that allows easy flow of task within the same framework. To create the GUI for 

our model, we implement tkinter library of python. Within the tkinter package, there are 

many functions that are used to make things more organized and presentable. Tkinter 

allows us to make various frameworks, buttons and organizes the functions 

systematically. Major properties to include in the GUI is illustrated in Figure 5. 

 
Figure 5. Main GUI components 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
18 
 

3 Implementation 

Real time Voice Cloning (Using SV2TTS): The entire approach to real time voice 

cloning is adapted on the basis of Transfer learning from speaker. i.e. dubbed as 

prosody transfer (voice styling implementation). It is a speaker verification to multi-

speaker text-to-speech synthesis. It essentially defines the framework for voice cloning 

that barely requires 4-6 seconds of reference speech. It is majorly dependent on three 

early works from Google: the GE2E loss (Wan et al., 2017), Tacotron (Wang et al., 

2017) and WaveNet (van den Oord et al., 2016). This proposed model is three-stage 

pipeline, as listed above in order. The google cloud services, google search engine, or 

google assistant make use of these same models. 

Model Architecture of Speaker encoder, Synthesizer and Vocoder are shown in 

Figure 6. 

 
Figure 6. Three-stage pipeline for Text-to-Speech 

 
Model Architecture of Speaker Encoder: It is a three-layer LSTM with 768 hidden 

nodes which is followed up by a projection layer comprising of 256 units. Although 

there is no reference in any of references present, as to what the projection layer defines. 

Hence, we round up to consider the overall function of the projection layer as a 256-

output fully connected layer (per LTSM) which is repeatedly applied to each and every 

output of the LTSM. The inputs to this model are 40-channels log-mel spectrograms 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
19 
 

with 25ms window width and a 10ms step. The desired output is the L2 normalized 

hidden state of the last layer, which works as a vector of 256 elements. 

Model Architecture of Synthesizer: In the synthesizer implementation, the target Mel 

spectrograms present for the synthesizer provide more features than those used for 

speaker encoder which are computed from a 50ms window with a 12.5ms step and have 

80 channels. We use a python implementation of LogMMSE algorithm, which is used 

for filtering the audio speech by erasing the noise in the early frameworks. 

Consequently, we train the synthesizer for 150k steps, comprising of a batch size of 

144. The outputs per second is set to 2 for the decoder. While implementing, the 

architecture tends to provide speech synthesis to attain identical cloning to the unseen 

speaker. During this process there are losses that are accounted in the verge predicted 

and ground truth mel spectrograms. This is the L2 loss function. While training, the 

model is set to ground truth aligned (GTA). The reason for this is that if we do not set 

the model to GTA, then the synthesizer would produce different variations of the same 

utterance (text or embedding). Implementation of neural nets are shown in Figure 7. 

 
Figure 7. Implementation of neural nets 

 
 Model Architecture of vocoder The vocoder implemented in the model is 

WaveNet [1, 4]. WaveNet produces naturalness in TTS. This is the primary reason it is 

fairly used in Tacotron and SV2TTS. However, the efficiency of this neural net is too 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
20 
 

good coming at the cost of the speed. It is very slow and the slowest deep learning 

architecture at interference time. Even if this is a matter of stress on the implementation 

part, there are improvements that can be initiated on it. Googles own vocoder works at 

the rate of output of 8000 samples per second, which is by far not bad for a neural net. 

The model implemented is an open-source Pytorch implementation that is based on an 

RNN model deployed by Github user fatchord. 

Zero-shot speaker adaption the speech characteristics to be synthesized are picked up 

from the audio signal. Zero-shot adaption is the ease with which the model gets linked 

with the training data from the unseen speaker which is not present during the training. 

It just requires time of 4-6 seconds for the model to generate new speech by 

synthesizing the speaker characteristics. Interference can cause the speaker information 

to get synthesized without knowledge of the input fed through. However, in our model, 

interference occurs with arbitrary untranscribed speech audio which does need the text 

to match with the synthesizer, thereby making the implementation hassle free and 

comparatively quick. 

Dataset Used Two datasets which are public by nature for synthesis and speech 

training for vocoder network is used for implementation. VCTK comprises of 43 hours 

of clean speech from 109 speakers, among which most have British accents. We down 

sampled the audio to 25 kHz, trimmed leading and trailing silence (reducing the median 

duration from 3.3 seconds to 1.8 seconds). It’s been split into three subsets: train, 

validation which has the same speakers as the train set and test which has 11 speakers 

held out from the train and validation sets. LibriSpeech comprises of the union of the 

two “clean” training sets, consisting 436 hours of speech from 1,172 speakers, sampled 

at 16 kHz. The speech is US English majorly, however since it is sourced from audio 

books, the vocals and style of speech can differ significantly between utterances from 

the same speaker. We reassembled the data into shorter utterances by force aligning the 

audio to the transcript using an automatic speech recognition (ASR) model and breaking 

segments on silence, reducing the median duration from 14 to 5 seconds. 

Naturalness of Speech Clearer the person’s voice is, crisper is the audibility factor 

that follows. This might tend to differ in cases of speech synthesis, where natural speech 

gets encoded and decoded with help of speech synthesizers and in our case deep 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
21 
 

learning networks too. So, to arrange it cordially, there is prenet data, also termed as 

local information that gets coupled with character embeddings from the input. There is 

interference that plays a major role and thereby noise gets added on while processing 

through the highway networks. With subsequent processing and ability to trace 

character by character (using Fourier transform) we get the log-mel spectrogram 

through vocoder. After attaining the desired output, the major question is how much of 

the part is natural speech. As we implement two public datasets, we tried comparing the 

VCTK and LibriSpeech datasets to check the speech synthesis through synthesizers and 

vocoder. We tried comparing by using 11 unseen and seen speakers for VCTK and 10 

unseen and seen speakers for LibriSpeech. Each of the comparison was conducted 

independently [2]. Comparison of time taken for synthesis is listed in Table 1. 

 
Table 1. Comparison of time taken for synthesis 

System Speaker Information VCTK LibriSpeech 

Ground truth Same Speaker 4.67 ± 0.04 4.33 ± 0.08 

Ground truth Same Gender 2.25 ± 0.07 1.83 ± 0.07 

Ground truth Different Gender 1.15 ± 0.04B 1.04 ± 0.03 

Embedding Table Seen 4.17 ± 0.06 3.70 ± 0.08 

Proposed Model Seen 4.22 ± 0.06 3.28 ± 0.08 

Proposed Model Unseen 3.28 ± 0.07 3.03 ± 0.09 

 
Speaker Similarity and Verification To check the speech having cleaner detail, we 

check the results of the above comparison of the two datasets. This is done to check 

whether the desired output is identical to the input given. From the comparison table, we 

get to know that the values delivered by VCTK tend to be an edge above the 

LibriSpeech dataset. The speech from the VCTK dataset is cleaner. This can be also 

understood by higher ground truth baselines in the VCTK. That makes the VCTK 

better, but on using LibriSpeech on VCTK model, it was visible that the output was 

better than that of VCTK model. This means that depending on the dataset and type of 

groundtruth, the similarity can be accomplished. For the part of speaker verification, on 

LibriSpeech, the synthesized speech is at most similar to the ground truth voices. The 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
22 
 

LibriSpeech synthesizer obtains similar EERs of 5-6 % using reference speakers from 

both datasets, whereas the one trained on VCTK performs much worse, especially on 

out-of-domain LibriSpeech speakers. 

 
4 Results and Discussion 

 
Figure 8. Proposed-method 

 
Having all the details and implementation incorporated, we got our proposed design 

(see Figure 8) that would be implemented by us. We had to take in consideration every 

part as sequential support that will provide proportional synthesis and deliver the 

desired output. We had two procedures followed to obtain results. 

Method 1: In the first method we used the train datasets for speech synthesis, which 

were LibriSpeech and VCTK respectively. The output obtained was clear and audible 

with the help of headphones. In open environment, the voice felt a little light, and 

required a speaker with an equalizer. 

Method 2: This method is an advancement to our paper which could be extended 

further successfully with correct implementations. We tried feeding in our voice through 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
23 
 

the mic, and tried cloning. The speech did get synthesized but the output obtained 

wasn’t as clear as the previous case and was little gibberish in nature. 

Audio Quality Analysis: The quality audio can be better heard by using speakers with 

better bass and equalizer in an open environment. There is a 2-3 seconds lag in setting 

the audio sample from the dataset which, again can be improved by trimming the audio 

samples. Within all testing periods, the model was reliable and highly efficient in terms 

of user interface. Further extension is possible by implementing user generated 

voice (refined), and miscellaneous dataset implementation of various accents and 

dialects. 

Now, in accordance to the above model, we had to implement a GUI (see Figure 9), 

that can provide subsequent functioning of the above model and also should be user 

friendly to operate. We implemented Python tkinter package in accordance to the 

system model and created the model that equips and provides most of the proposed 

system through easy flow of framework design. 

 
Figure 9. Output GUI 

                                                                                                                                                                                                      
5 Conclusion 

The entire concept of having voice cloning has always had some advancements 

tending to always grow for the better. Incorporating all the factors, we try instill the 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
24 
 

model with all datasets and pretrained data through which the model infers and tries to 

pullout the desired output claimed. To attain the desired output, we need to verify the 

speech to be identical to the trained unseen speech, so as to be sure. If the speech 

synthesis is not identical, we cannot term the output as the desired output. Such models 

have implications and applications mostly in the armed forces for providing stealth 

mode a new edge. Voice cloning can help make communication between borders, across 

seas and even on telephone a rounded-up mystery, which is the reason it benefits the 

armed forces greatly. The application also lies with media. People tend to use voice 

cloning for media concentration applications or entertainment media such as Dubsmash. 

This model also helps with regeneration of voice wherein, a certain person can 

communicate if he/she is disabled or lost ability of speech under certain circumstances. 

The voice cloning works for the future, wherein with on-time upgradation, and correct 

use we can use regeneration factor for taking over music and pop culture with a wave. 

With a sytem having a multi-area application domain. It is very important to place 

certain regulations and boundaries bounded by legal clauses for implementation. Major 

part of the model lies within the use of the Text-to-Speech system, which is the 

backbone of the model. Many systems integrated in artificial intelligence pick Text-to-

Speech system as prime domain because of its broad area of implementation, right from 

voice assist, voice detection to voice cloning. Our model works with voice cloning and 

moves in direction of voice regeneration, which will be a major breakthrough in the 

near future. The proposed model does not attain human-like naturalness, despite the use 

of a WaveNet vocoder (along with its very high inference cost), in contrast to the single 

speaker results. This is a consequence of the additional difficulty of generating speech 

for a variety of speakers given significantly less data per speaker, as well as the use of 

datasets with lower data quality. 

 
Acknowledgements 

The corresponding author acknowledges all the co-authors and group members for 

co-operating and working with an optimistic mindset. Every work related to the paper 

required dedicated and devoted attention from the department in association, and 


International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
25 
 

personal guidance from the project guide, Prof. Lalita Moharkar who stood by all along 

the buildup of the research model. Trying to walk the entire path from scratch required 

detailed reference help which acted as a walking stick, which were journal papers and 

technical papers present in international journals and publications. 

References 

[1] L. Wan, Q. Wang, A. Papir and I. L. Moreno, “Generalized end-to-end loss for 

speaker verification.” Acoustics, Speech and Signal Processing (ICASSP), IEEE 

International Conference, 2018. 

[2] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, 

R. Pang, I. L. Moreno and Y. Wu, “Transfer Learning from Speaker Verification to 

Multispeaker Text-To-Speech Synthesis.” Advances in neural information 

processing systems, 31, 4485–4495, 2018.  

[3] Artificial Intelligence at Google – Our Principles. https://ai.google/principles/, 

2018.  

[4] A.V.D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, 

N. Kalchbrenner, A. Senior and K. Kavukcuoglu. “Wavenet: A generative model 

for raw audio.” arXiv preprint 1609.03499, (2016). 

[5] J. Shen, R. Pang, Ron J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y.  

Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis and 

Y. Wu. “Natural TTS synthesis by conditioning WaveNet on mel spectrogram 

predictions.” Proceedings of the IEEE International Conference on Acoustics, 

Speech, and Signal Processing (ICASSP), 2018. 

[6] M. Grechanik, Q. Xie and C. Fu, “Creating GUI testing tools using accessibility 

technologies.” Proceedings of the IEEE International Conference on Software 

Testing, Verification, and Validation Workshops, 243–250, 2009. 

  
International Journal of Applied Sciences and Smart Technologies 

Volume 3, Issue 1, pages 11–26 

p-ISSN 2655-8564, e-ISSN 2685-9432 

 
26 
 

This page intentionally left blank