Instruction


FACTA UNIVERSITATIS  
Series: Electronics and Energetics Vol. 27, N

o
 3, September 2014, pp. 375 - 387 

DOI: 10.2298/FUEE1403375D 

 
USER-AWARENESS AND ADAPTATION 

IN CONVERSATIONAL AGENTS

  

Vlado Delić1, Milan Gnjatović1,2, Nikša Jakovljević1, 

Branislav Popović1, Ivan Jokić1, Milana Bojanić1 

1
Faculty of Technical Sciences, University of Novi Sad, Serbia 

2
Graduate School of Computer Sciences, Megatrend University, Belgrade, Serbia 

 
Abstract: This paper considers the research question of developing user-aware and 

adaptive conversational agents. The conversational agent is a system which is user-

aware to the extent that it recognizes the user identity and his/her emotional states that 

are relevant in a given interaction domain. The conversational agent is user-adaptive 

to the extent that it dynamically adapts its dialogue behavior according to the user and 

his/her emotional state. The paper summarizes some aspects of our previous work and 

presents work-in-progress in the field of speech-based human-machine interaction. It 

focuses particularly on the development of speech recognition modules in cooperation 

with both modules for emotion recognition and speaker recognition, as well as the 

dialogue management module. Finally, it proposes an architecture of a conversational 

agent that integrates those modules and improves each of them based on some kind of 

synergies among themselves. 

Key words: conversational agent, user-awareness, adaptation, speech recognition, 

emotion recognition, speaker recognition, dialogue management 

1. INTRODUCTION 

Context-awareness is certainly one of the most fundamental requirements for advanced 

conversational agents. Recognition and interpretation of the user’s dialogue acts and 

dialogue management are always situated in a particular context. This is primarily due to 

the fact that many inherently present dialogue phenomena are context-dependent. Thus, 

nonlinguistic contexts shared between the user and the system (e.g., graphical displays) 

may influence the language of the user to a high extent with respect to frequency of 

“irregular” utterances (elliptical and minor utterances, utterances containing anaphora and 

exophora, etc.) [1]. In addition, the user’s dialogue acts may fall outside the system’s 

domain, scope and semantic grammar, or contradict his earlier dialogue acts. This is even 

more the case when we consider users in non-neutral emotional states. Forcing users to 

follow a preset grammar or interaction scenario is too restrictive, if possible at all, and 

                                                           

 Received April 30, 2014 
Corresponding author: Vlado Delić 

Faculty of Technical Sciences, University of Novi Sad, 

Trg Dositeja Obradovića 6, 21000 Novi Sad, Serbia 
(e-mail: vlado.delic@uns.ac.rs) 


376 V. DELIĆ, M. GNJATOVIĆ, N. JAKOVLJEVIĆ, B. POPOVIĆ, I. JOKIĆ, M. BOJANIĆ 

would not be well accepted [2]. In such cases, the system needs a considerable amount of 

stored contextual knowledge to enable it to advance the conversation in spite of 

miscommunication and to maintain the dialogue’s consistency.  

However, the requirement for habitable natural language interfaces goes beyond 

pragmatics. Another reason relates to the technology. Speech recognition technology is 

still not accurate enough to deal with flexible, unrestricted language. In realistic settings, 

average word recognition error rates are 20–30%, and they go up to 50% for non-native 

speakers [3]. In certain conditions, speech recognition accuracies may degrade dramatically 

to an extent that systems become unusable even for cooperative users [4]. 

Researchers generally agree that conversational agents need to incorporate dialogue 

context models in order to maintain a consistent dialogue and overcome technical 

deficiencies. Yet, context is a complex construct and can be considered from different 

aspects. In this paper, we consider a restricted research question of how user-awareness 

may help in improving dialogue management. This paper summarizes some aspects of 

our previous work and presents work-in-progress. In the reported approach, we differentiate 

between two research lines:  

 User-awareness. The system is user-aware to the extent that it recognizes the user 
and his/her emotional states that are relevant in a given interaction domain. 

 User-adaptation. The system is user-adaptive to the extent that it dynamically 
adapts its dialogue behavior according to the user and his/her emotional state. 

At the methodological level, these two lines of research are fundamentally different. 

The first line relates to a statistical approach to the research problems of automatic speech 

recognition (ASR), emotional speech recognition (ESR), and speaker recognition. Speech 

signal encodes not only information about the lexical content of the speaker’s dialogue 

act, but also information about the speaker’s voice characteristics that may be used for 

recognition of the speaker and his/her emotional state [5], [6]. The basic idea is to use 

data derived from both speech and language corpora, and apply automated analysis 

methods. Although speech/speaker/emotion recognition technologies have a common 

foundation, they are usually developed and applied separately. We build upon our 

previous work [7]-[13], and investigate the possibilities to combine these technologies 

rather than to apply them separately. Sections 2 and 3 discuss this in more detail. 

The second research line relates to a representational approach to natural language 

processing and dialogue management. In previous work, we introduced a representational 

model of attentional information in human-machine interaction that provides a framework 

for more robust natural language understanding and designing adaptive dialogue strategies 

[2], [14]-[17]. Section 4 discusses the application of this model to designing user-

adaptive conversational agents. 

2. ACOUSTIC INFORMATION-BASED APPROACH TO USER-AWARENESS 

2.1. Speech recognition 

The task of automatic speech recognition is to translate spoken words into text. In 

order to accomplish this task, the reported speech recognizer exploits information about 

acoustic representations of phonemes, encapsulated in an acoustic model, and information 

about syntactic rules, encapsulated in a language model. The relation between words and 

phonemes is captured in a pronunciation dictionary where each word is segmented into at 


 User-Awareness and Adaptation in Conversational Agents 377 

least one sequence of phonemes. Since each phoneme has several acoustic representations, 

as a basic modeling unit we use a context dependent phone referred to as triphone.  

The acoustic model is based on hidden Markov models and Gaussian mixture models. 

In order to reduce the model computational complexity and to achieve robust parameter 

estimation, similar states of triphones share parameters. The tree based clustering 

procedure presented in [18] is performed to find those similar states. The Gaussians are 

modeled using the full covariance matrix, since they obtain more accurate acoustic 

representation in comparison to models with diagonal covariance matrix [19]. However, 

in this variant the computational complexity of log likelihood is significantly increased. 

To overcome this problem, several approaches have been developed and applied [20], 

[21], [22]. The system uses feature vectors consisted of 15 mel-frequency cepstral coefficients 

(MFCC), normalized energy and their first derivatives. The feature vectors are extracted 

from 30 ms speech segments, every 10 ms. The training set for the acoustic model contains 

recordings of both scripted and spontaneous utterances produced by several dozen 

speakers, with a total duration of about 200 hours [23]. 

Language modeling is a special issue for highly inflected languages, since language 

models have to cover a range of grammatical categories (including tense, aspect, mood, 

case, etc.) and morphological derivations that involve the addition of prefixes and suffixes. 

In the currently predominant statistically-based approach to ASR, language models are 

trained on large text corpora. However, simple N-gram based language models do not 

suffice for morphologically more complex languages without significant modifications 

[24]. Our language model is a combination of 3 N-gram models. The first model is based 

on tokens (surface forms), the second on lemmata, and the third on classes [23]. The size 

of vocabulary causes data sparsity problems, resulting in the need for significantly greater 

language corpora, sufficient for obtaining a robust language model. The training set for 

the language model consists of text content from various newspapers, scientific articles 

and books, with a total volume of about 16 million words (178865 lemmata). 

Splitting words into phoneme sequence is relatively simple for the Serbian language, 

due to the fact that it has phonemic orthography. However, there are some exceptions in 

word pronunciation (e.g. dvanaest is usually pronounced as dvanajst) and our phonetic 

inventory distinguishes stressed and unstressed variant of vowels, thus for mapping 

words into phones the system uses the pronunciation dictionary developed for speech 

synthesis [23]. 

The size of search space is determined by the following factors: the number of words 

which are expected to be recognized, the number of their pronunciation variants, and the 

number of hidden Markov model states in the acoustic model. For the real-time 

recognition, it is important to reduce the search space, which can be a significant problem 

for highly inflected languages, where many derived forms may exist for a single lemma. 

The standard way to cope with this problem is pruning, i.e., discarding the less probable 

hypotheses. For this purpose, a system should rely not only on an acoustic model, but 

also on a language model and information about word pronunciation. Our system uses a 

decoder based on the token-passing algorithm (a variant of the Viterbi algorithm in which 

the information about the path and score is stored at the word level instead of trellis state 

level). A detailed description of the decoder can be found in [25]. 


378 V. DELIĆ, M. GNJATOVIĆ, N. JAKOVLJEVIĆ, B. POPOVIĆ, I. JOKIĆ, M. BOJANIĆ 

2.2. Emotion recognition 

Emotional speech recognition is concerned with the task of identifying emotional 

states of the speaker automatically, based upon the analysis of his speech. Prosodic and 

spectral features are the most frequently ones used for this task, while the less frequently 

used features include voice quality features (e.g., harmonic-to-noise ratio, jitter, shimmer). 

Prosodic features, also referred to as paralinguistic features, include specific changes in 

pitch patterning, the energy of the voice signal, and changes in speech rate. The positions 

and bandwidth of formants, and a cepstral representation of the spectrum are usually 

selected as spectral features for emotional speech recognition. This is in line with the 

findings that the distribution of the spectral energy across the speech range of frequency 

is a possible measure of the emotional content of speech. In [11], we show that a feature 

set containing both the prosodic and the spectral features achieves high recognition 

accuracy (i.e., 91.5 %) of the basic emotional states (i.e., anger, joy, fear, sadness, and neutral). 

The feature vector was obtained by applying statistical functionals to the spectral/prosodic 

feature contours, where the most relevant functionals, ranked in the descending order, 

are: moments, extrema, and regression coefficients [12].  

In many speech-based applications, it is beneficial to conceptualize the user’s 

emotional states in a given interaction domain as positive or negative (e.g., for the purpose 

of detecting a frustrated or satisfied call-centre customer). Therefore, in our previous work, 

we also investigated the perspective of dimensional emotion models that describe 

emotional content in terms of valence (positive/negative emotion) and arousal (active/passive 

emotion). We conducted a comparative study of two acted emotion corpora to investigate 

possibilities for classification of discrete basic emotions in the valence-arousal space [26]. 

The first conclusion of this study was that the prosodic-spectral feature set proposed in 

[11] is almost equally effective in modeling emotions in the valence-arousal space as 

compared to modeling discrete emotional states. The second conclusion was that the 

discrimination of emotional states according to the arousal level is more successful than 

their discrimination according to the valence level [26]. 

Our research on acoustic information-based emotion recognition was primarily 

supported by the GEES corpus of emotional and attitude-expressive speech in Serbian 

[27]. It contains recordings of acted speech-based emotional expressions. Six drama 

students (3 female, 3 male) were engaged to produce emotionally colored utterances. They 

were given a set of textual entries (32 isolated words, 30 short sentences, 30 long sentences, 

and one passage of 79 words) and asked to express each entry in five emotional states 

(anger, joy, fear, sadness and neutral). The perception test demonstrated that the corpus 

contains acoustic variations that are indicative of emotional expression of the five target 

emotional states. 

2.3. Speaker recognition  

Our research on speaker recognition centers on a text-independent speaker recognition 

based on the feature set that contains mel-frequency cepstral coefficients (MFCC) and 

their first and second derivatives. The research was primarily supported by a corpus 

containing recordings of 121 native Serbian speakers (61 female, 60 male). Each speaker 

produced 14 audio recordings: one recording of the speaker uttering his/her first name 

and family name, two recordings of the speaker uttering a sequence of digits, and eleven 

recordings of the speaker uttering a sequence of syntactically unrelated words. To reduce 


 User-Awareness and Adaptation in Conversational Agents 379 

the dimensionality of the standard MFCC, we applied the technique of Principal Component 

Analysis (PCA). The reported experimental results [9] suggest that this technique is 

appropriate to reduce the dimensionality without reducing the recognition accuracy. The 

applied automatic speaker recognizer shows that already for a 14-dimensional PCA feature 

space, the recognition accuracy reaches the target value as in the 39-dimensional MFCC 

feature space. 

MFCCs depend on the energy in an observed speech frame. Therefore the distribution 

of a speaker feature vectors depends on the lexical content and expressed emotions. To 

decrease the text dependency on the covariance matrices used for speaker modeling, we 

apply an algorithm of model elements weighting introduced in [10]. The basic idea may 

be formulated as follows: the importance of an element of the speaker model in the 

decision making processes decreases as its time variability increases. In accordance with 

this, an element of the speaker model that has the highest time variability will be assigned 

the smallest value. In real applications, it can be the case that, for some speakers, the 

automatic speaker recognizer has only one model determined during the training phase. 

Thus, the recognizer cannot observe the time variability of model elements. The time 

variability of speaker models depends primarily on the largest model elements. By applying 

a nonlinear function, such as the sigmoid function, on the largest model elements, the 

time variability of the speaker models is decreased, and consequently, the recognition 

accuracy is increased. Also, MFCCs depend on the assumed shape of auditory critical 

bands. When the MFCCs are determined under the assumption that the auditory critical 

bands have exponential shape based on the lower part of the exponential function, the 

automatic speaker recognizer shows more accurate performance than in the case when the 

rectangular or triangular auditory critical bands are applied [10]. 

If should be noted that emotional speech may significantly affect the accuracy of 

speaker recognition. However, not all emotions are equally critical for speaker recognition. 

Preliminary experiments conducted on the GEES database confirmed that, e.g., the 

emotion of anger changes the speaker’s voice (i.e., timbre) to the greater extent than the 

emotion of sadness. In the next sections, we discuss how a combination of different 

knowledge sources may improve the recognition accuracy. 

2.4. Interplay between speech, emotion and speaker recognition 

Acoustic features and language information contained within the acoustic, pronunciation 

and language models may be efficiently combined and used for speech, emotion and 

speaker recognition [5]. High-level features, e.g., phones, idiolect, semantic, accent and 

pronunciation, reveal speaker characteristics, such as socio-economic status, language 

background, personality type, and environmental influence [6]. 

For speech recognition systems based on hidden Markov models in combination with 

Gaussian mixture models, numerous techniques have been developed for model 

adaptation to specific speaker and acoustic condition [28]-[31]. They can be grouped into 

two classes based on maximum a posteriori likelihood (MAP) and maximum likelihood 

(ML), respectively. A MAP-based adaptation interpolates the original prior parameter 

values with parameters obtained from the adaptation data, and thus the estimated parameters 

converge asymptotically to the adaptation domain as the amount of adaptation data 

increases [28]. However, in the case of sparse adaptation data, many model parameters 

remain unchanged [32]. ML-based methods assume that there is a set of linear transformation 


380 V. DELIĆ, M. GNJATOVIĆ, N. JAKOVLJEVIĆ, B. POPOVIĆ, I. JOKIĆ, M. BOJANIĆ 

which can map the existing model parameters into new adapted model parameters. Since 

they use linear transformation to map parameters, these methods are referred as to ML 

linear regression (MLLR). MLLR can be applied only to the Gaussian mean vectors or to 

both mean vectors and covariance matrices. A special case of MLLR where the mean 

vector and covariance matrix of a Gaussian have the same transformation matrix is called 

constrained MLLR. While the use of mean MLLR adaptation has the greatest positive 

impact, the use of variance MLLR adaptation may also bring a slight improvement in 

recognition accuracy [29]. The major advantage of MLLR over MAP adaptations is 

evident in the case of sparse adaptation data, where the same transformation can be 

applied to all Gaussians in the same acoustic class [32]. 

Alternatively, speaker adaptation can be achieved by transformation of features 

instead of model parameters. The common procedure is vocal tract length normalization 

[33], [34]. The basic idea is to find warp scales of the frequency axis for each speaker 

such that the spectrum fits to the spectrum of the universal speaker with a standard vocal 

tract length, and to apply that transformation on the used features. In this way, within-

class scattering and the overlapping between classes are reduced. It is interesting to note 

that the constrained MLLR can be treated as feature transformation, and that it is 

commonly used for speaker adaptive training. Models trained in this way may achieve 

higher recognition accuracy [35]. Additionally, the accuracy of an ASR system can be 

improved by the adaptation of the language model in terms of reducing the search space 

and confusion between words [36], [37]. 

It is widely acknowledged that the speaker’s emotional states affect the speech 

production system at several levels – from the higher levels of linguistic coding (word 

selection and sentence structure) to the lower levels of articulator movements 

(phoneme/word production). This, in turn, may significantly degrade the performance of 

ASR systems. In general, ASR performance and prosodic properties of an utterance are 

related. Variations in speaking style and speaking rate, relative to ASR training conditions, 

may have a negative impact on the performance of an ASR system [38]. Prosodic features 

reflect those variations, and some studies show that prosody itself is capable of re-ranking 

ASR hypotheses such as to separate the correctly recognized utterances from incorrectly 

recognized ones [39], [40]. It can be expected that an ASR system using acoustic models 

trained on neutral speech will have reduced performances in settings when it operates 

under the conditions of emotional speech. Reference [41] shows that training ASR models 

on neutral speech, and its subsequent adaptation on emotional speech samples, does have a 

positive impact on the recognition performance within such conditions.  

In [11] and [42], we discuss how the same prosodic and spectral features can be 

employed for the purpose of speech recognition, emotion recognition and speaker 

recognition. Fig. 1. illustrates how knowledge from different sources is intended to be 

used in the reported speech processing module. The relationship between these 

technologies goes beyond prosodic and spectral features. In the next section, we discuss 

how emotion recognition can employ lexical and discourse information provided by an 

ASR system. 


 User-Awareness and Adaptation in Conversational Agents 381 

 
Fig. 1 Combining knowledge from different sources in the speech processing module 

3. EMOTION RECOGNITION BASED ON LINGUISTIC INFORMATION 

Emotion recognition can be also based on lexical and discourse information [43], e.g., 

a semantic analysis of an output hypothesis of an ASR engine [44]. In line with this, one 

line of our research focuses on recognition and tracking of emotional states of the user 

from lexical information and other linguistic features. As part of previous work [1], a 

substantial refinement of the Wizard-of-Oz technique was proposed in order that a 

scenario designed to elicit affected speech in human-machine interaction could result in 

realistic and useful data. The NIMITEK corpus of affected speech in human-machine 

interaction was produced during a refined Wizard-of-Oz simulation. Ten healthy native 

German speakers participated in the study (7 female, 3 male, ages 18 to 27, mean 21.7). 

The corpus contains 15 hours of audio and video recordings. The number of the subjects’ 

dialogue turns is 1,847, the average number of words per turn is 17.19 (with standard 

deviation 24.37), and the subjects’ lexicon contains about 900 lemmata. The evaluation of 

the corpus with respect to the perception of its emotional content demonstrated that it 

contains recordings of emotions that were overtly signaled, and that the subjects’ 

utterances are indicative of the way in which untrained, nontechnical users probably like to 

converse with conversational agents [1]. The transcribed version of the NIMITEK corpus 

was used to conduct a corpus-based examination of various linguistic features that may 

carry affect information [13]. For the purpose of this contribution, we illustrate the 

following linguistic features: key words and phrases, lexical cohesive agencies, dialogue 

act sequences, and negations. 

The most obvious way of recognizing an emotional state is to detect key words and 

phrases in users’ utterances. Examples from the NIMITEK corpus are given in Table 1. 

However, expressions of emotions are not necessarily limited to a single dialogue act, but 


382 V. DELIĆ, M. GNJATOVIĆ, N. JAKOVLJEVIĆ, B. POPOVIĆ, I. JOKIĆ, M. BOJANIĆ 

can also map over a range of mutually related dialogue acts. For example, the choice of 

lexical items made to create cohesion in the dialogue can signal an emotion-related state, 

both at the lexical level (e.g., repetitions), as well as at the semantic level (e.g., 

reformulations). Table 2 contains examples of repetitions and reformulations that signal 

negative emotional states. In contrast to this, another form of anaphoric cohesion in a 

dialogue is achieved by ellipsis-substitutions. The typical meaning of ellipsis-substitutions 

is not one of co-reference – there is always some significant difference between the 

second instance and the first [45]. To illustrate this, let us observe a typical example from 

the NIMITEK corpus: Please do it! (Bitte tu das!). This utterance does not explicitly 

provide information what the system is expected to do, but contains an elliptical-

substitution (verb do) which is used for signaling that the action the system performed is 

not the same as the action instructed by the user (indicated by the anaphoric reference it). 

In general, ellipsis-substitutions may signal a potential problem in communication. 

Table 1 Examples of key words and phrases that relate to emotional states 

(adopted and adjusted from [13]) 

Emotional state Examples of the subjects’ key words and phrases 

Annoyed Sh*t (Sche*ße), stupid (blöd), Do what I say (Tu was ich sage), Oh … 

something like this I hate just like the plague. (Oh... so was hasse ich doch 

wie die Pest.) 

Retiring I don't understand it (Ich versteh’ das nicht), It's not working at all (Das 

geht doch gar nicht). 

Indisposed I am going now (ich geh’ gleich), Oh man (Oh man), God (Gott), I don't 

feel like doing any more. (Ich hab’ kein’ Bock mehr.) 

Offending You think, doll. (Denkst du, Puppe) 

Satisfied Super (Super), awesome (geil), I am good, am I not? (Bin gut, was?) 

Table 2 Examples of lexical cohesive agencies that relate to negative emotional 

states (adopted and adjusted from [13]) 

Lexical cohesive agencies Examples of the subjects’ dialogue acts 

Repetition It just cannot be. It just … It just cannot be. (Das kann doch 

nicht sein. Das ist doch … das kann doch nicht sein.) 

Reformulation Not true at all. That’s definitely wrong. (Gar nicht wahr. Das 

stimmt gar nicht.) 

Ellipsis-substitution Please do it. 

Based on this study, a prototypical automatic annotator for recognition and tracking of 

the user’s emotional states from linguistic information was implemented [13]. It should be 

noted that the emotional states in the NIMITEK corpus [1] were conceptualized within the 

data-driven 6-class emotion model ARISEN (annoyed, retiring, indisposed, satisfied, 

engaged, neutral). In addition, the subjects’ expressions in the NIMITEK corpus often 

contain mixed emotions, and the human evaluators were allowed to assign more emotion 

labels to each subject’s utterance. Thus, the automatic annotator was implemented to 

annotate mixed emotions, i.e., to attribute zero, one or more labels from the ARISEN 

model to each subject’s utterance.  

The results of the automatic annotation were compared with the results of the human 

evaluators. For the given 6-class emotional model ARISEN, the annotator showed the 


 User-Awareness and Adaptation in Conversational Agents 383 

following performance: 31.70% of subject emotional states were correctly, 34.35% of 

subject emotional states were not recognized, and 33.92% of subject emotional states were 

incorrectly recognized. Furthermore, the ARISEN model was down-sampled to a model 

that differentiates between 3 emotional states, i.e., negative (including annoyed, retiring 

and indisposed emotional states), neutral, and positive (including satisfied and engaged 

emotional states). For this 3-class problem, the annotator showed the following performance: 

51.20% of subject emotional states were correctly recognized, 33.67% of subject 

emotional states were not recognized, and 17.26% of subject emotional states were 

incorrectly recognized. When interpreting these results, it should be kept in mind that the 

automatic annotation was based only on lexical information, while the human evaluators 

were influenced by prosody as well. 

4. USER-ADAPTIVE DIALOGUE MANAGEMENT 

The main idea underlying the conversational agent’s adaptation is that its dialogue 

behavior is dynamically adapted according to the user and his emotional state. In this respect, 

the dialogue management module is the central component of the conversational agent. It 

consists of two components: dialogue context model and adaptive dialogue control [46]. 

4.1. Dialogue context model 

Dialogue context model keeps track of information relevant to the dialogue. For the 

purpose of this contribution, it includes the following knowledge sources: 

 lexical and propositional content of the user’s dialogue act, 
 attentional state, 
 emotional state of the user, 
 information about the user. 
Among these sources, attentional state deserves further discussion. At the conceptual 

level, attentional state contains information about the dialogue entities that are most 

salient at any given point. Its purpose is twofold [2], [47]. First, it summarizes information 

from previous dialogue acts that are necessary for processing subsequent ones, and 

allows for processing spontaneously produced users' dialogue acts. This is an important  

characteristic of the system, not just because it enables a more natural dialogue, but also 

because forcing users to follow a preset grammar or interaction scenario is hardly 

acceptable for users in negative emotional states. The second purpose of attentional state 

is that it allows for predicting the dialogue behavior of the user, i.e., it forms the basis for 

expectations about the succeeding dialogue acts. This information is important both for 

automatic speech recognition, as a means of reducing a set of ASR hypotheses, and 

adaptive dialogue control, for taking initiative in a dialogue. 

In [2], we introduced a representational model of attentional information in human-

machine interaction that provides a framework for more robust natural language processing 

and dialogue management. This model integrates neurocognitive understanding of the focus 

of attention in working memory, the notion of attention related to the theory of discourse 

structure in the field of computational linguistics, and investigation of the NIMITEK 

corpus. To the extent that it is computationally appropriate, it was successfully adapted and 

applied in several prototypical conversational agents with diverse domains of interaction 

[14], including the dialogue management module reported in this paper. 


384 V. DELIĆ, M. GNJATOVIĆ, N. JAKOVLJEVIĆ, B. POPOVIĆ, I. JOKIĆ, M. BOJANIĆ 

 
Fig. 2 The intended architecture of a conversational agent 

4.2 Adaptive dialogue control 

The dialogue control component implements dialogue strategies of the conversational 

agent. In general, a dialogue strategy involves deciding what to do next once the user’s 

input has been received and interpreted, e.g., prompting the user for more input, 

clarifying the user’s previous input, outputting information to the user, etc. [46]. We 

recall that the reported conversational agent is adaptive to the extent that it dynamically 

adapts its dialogue strategies according to the current user and his emotional state. 

Therefore, an adaptive dialogue strategy is specified by means of a set of rules that take 

information about the current dialogue context into account. We build upon previous 

work on emotion-adaptive dialogue strategies, and end-user design of adaptive dialogue 

strategies. It is important to note that the reported dialogue management module allows 

the end-user to design dialogue strategies. This makes two levels of adaptation possible. 

The dialogue behavior is not only dynamically adapted according to the current dialogue 

strategy, but also the dialogue strategy itself can be redefined by the user. For detailed 

discussion, the reader may consult [16], [17].  

5. CONCLUDING REMARKS 

This paper summarized some aspects of our previous work and presents work-in-

progress on developing user-aware and adaptive conversational agents. The intended 

architecture of a conversational agent is given in Fig. 2. The speech recognition module 

and the dialogue management module (integrated with the natural language processing 

modules) are fully implemented, while emotion recognition and speaker recognition 

modules are implemented at a prototype level. 


 User-Awareness and Adaptation in Conversational Agents 385 

Current and future prospects of our research in this field include (but are not limited 

to): further investigation of the interplay between speech recognition, emotion recognition 

and speaker recognition, investigation of linguistic cues for early recognition of negative 

dialogue developments, further development of dialogue strategies for preventing and 

handling negative dialogue development, and investigation of more complex user models 

and alternative models of emotions. 

Acknowledgement: The presented study was performed as part of the project “Development of 

Dialogue Systems for Serbian and Other South Slavic Languages” (TR32035), funded by the 

Ministry of education, science and technological development of the Republic of Serbia.  

REFERENCES 

[1] M. Gnjatović and D. Rösner, “Inducing Genuine Emotions in Simulated Speech-Based Human-Machine 
Interaction: The NIMITEK Corpus”. IEEE Transactions on Affective Computing, vol. 1, no. 2, pp. 132-

144, July-Dec. 2010, DOI: 10.1109/T-AFFC.2010.14 
[2] M. Gnjatović, M. Janev, V. Delić, “Focus Tree: Modeling Attentional Information in Task-Oriented Human-

Machine Interaction”. Applied Intelligence, vol. 37, no. 3, pp. 305-320, 2012, DOI: 10.1007/s10489-011-0329-5 

[3] D. Bohus and A. Rudnicky, “Sorry, I Didn’t Catch That! An Investigation of Non-Understanding Errors 
and Recovery Strategies”. In Recent Trends in Discourse and Dialogue, vol. 39 of Text, Speech and 

Language Technology, pp. 123–154, Springer, 2008. 

[4] C.H. Lee, “Fundamentals and Technical Challenges in Automatic Speech Recognition”. In Proc. of the 
12th International Conference Speech and Computer, SPECOM 2007, pp. 25–44, Moscow, Russia, 2007. 

[5] B. Schuller, G. Rigoll, M. Lang, “Speech Emotion Recognition Combining Acoustic Features and Linguistic 
Information in a Hybrid Support Vector Machine-Belief Network Architecture”. In Proc. of ICASSP 
2004, vol. 1, pp. I-577-580, 2004, DOI: 10.1109/ICASSP.2004.1326051 

[6] T. Kinnunen and L. Haizhou, “An Overview of Text-Independent Speaker Recognition: From Features to 
Supervectors”. Speech Communication, vol. 52, pp. 12-40, 2010, DOI: 10.1016/j.specom.2009.08.009 

[7] V. Delić, M. Sečujski, N. Jakovljević, M. Gnjatović, I. Stanković, “Challenges of Natural Language Com-
munication with Machines”. Chap. 19 in DAAAM International Scientific Book 2013, pp. 371-388, 2013, 

DOI: 10.2507/daaam.scibook.2013.19 
[8] N. Jakovljević, D. Mišković, M. Janev, M. Sečujski, V. Delić, “Comparison of Linear Discriminant Analysis 

Approaches in Automatic Speech Recognition”. Electronics and Electrical Engineering, vol. 19, no. 7, pp. 

76-79, 2013, DOI: 10.5755/j01.eee.19.7.5167 
[9] I. Jokić, S. Jokić, Z. Perić, M. Gnjatović, V. Delić, “Influence of the Number of Principal Components 

used to the Automatic Speaker Recognition Accuracy”. Electronics and Electrical Engineering, vol. 18, 

no. 7, pp. 83-86, 2012, DOI: 10.5755/j01.eee.123.7.2379 
[10] I. Jokić, S. Jokić, V. Delić, Z. Perić, “Towards a Small Intra-Speaker Variability Models”. Electronics and 

Electrical Engineering, vol. 20, 2014 (in press). 
[11] V. Delić, M. Bojanić, M. Gnjatović, M. Sečujski, S.T. Jovičić, “Discrimination Capability of Prosodic and 

Spectral Features for Emotional Speech Recognition”. Electronics and Electrical Engineering, vol. 18, 

no. 9, pp. 51-54, 2012, DOI: 10.5755/j01.eee.18.9.2806 
[12] M. Bojanić, V. Delić, M. Sečujski, “Relevance of the types and the statistical properties of features in the 

recognition of basic emotions in speech”. Facta Universitatis, Series: Electronics and Energetics, vol. 27, 2014 

(in press). 
[13] M. Gnjatović, M. Kunze, X. Zhang, J. Frommer, D. Rösner, “Linguistic Expression of Emotion in 

Human-Machine Interaction: The NIMITEK Corpus as a Research Tool”. In Proceedings of the 4th Int. 

Workshop on Human-Computer Conversation, Bellagio, Italy, no pagination, 2008. 
[14] M. Gnjatović and V. Delić, “A Cognitively-Inspired Method for Meaning Representation in Dialogue 

Systems”. In Proc. of the 3rd IEEE Int. Conf. CogInfoCom-2012, Košice, Slovakia, pp. 383-388, 2012. 

[15] M. Gnjatović and V. Delić, “Electrophysiologically-Inspired Evaluation of Dialogue Act Complexity”. 
In Proc. of the 4th IEEE Int. Conf. CogInfoCom 2013, Budapest, Hungary, pp. 167-172, 2013. 

[16] M. Gnjatović and V. Delić, “Cognitively-inspired representational approach to meaning in machine 
dialogue”. Knowledge-Based Systems, DOI: 10.1016/j.knosys.2014.05.001, 2014. 


386 V. DELIĆ, M. GNJATOVIĆ, N. JAKOVLJEVIĆ, B. POPOVIĆ, I. JOKIĆ, M. BOJANIĆ 

[17] M. Gnjatović, “Therapist-Centered Design of a Robot's Dialogue Behavior”. Cognitive Computation, Special 
issue: The quest for modeling emotion, behavior and context in socially believable Robots and ICT 
interfaces, Springer, DOI: 10.1007/s12559-014-9272-1 (in press). 

[18] S. J. Young, J. Odell, P. C. Woodland, “Tree-based state tying for high accuracy acoustic modelling”. 
In Proceedings of the Workshop on Human Language Technology, pp. 307-312, 1994, DOI: 10.3115/ 
1075812.1075885 

[19] N. Jakovljević, D. Mišković, E. Pakoci, T. Grbić and V. Delić, “Poređenje performansi nekoliko varijanata 
GMM u sistemima za prepoznavanje govora”. In Proc. of  21th Telecommunications Forum, TELFOR 
2013, Belgrade, Serbia, pp. 466-469, 2013. 

[20] M. Janev, D. Pekar, N. Jakovljević, V. Delić, “Eigenvalues driven Gaussian selection in continuous speech 
recognition using HMMs with full covariance matrices”. Applied Intelligence, vol. 33, no. 2, pp. 107-116, 
2010, DOI: 10.1007/s10489-008-0152-9 

[21] B. Popović, M. Janev, D. Pekar, N. Jakovljević, M. Gnjatović, M. Sečujski, V. Delić “A novel split-and-
merge algorithm for hierarchical clustering of Gaussian mixture models”. Applied Intelligence, vol. 37, no. 
3, pp. 377-389, 2012, DOI: 10.1007/s10489-011-0333-9 

[22] N. Jakovljević, Primena retke reprezentacije na modelima Gausovih mešavina koji se koriste za 
automatsko prepoznavanje govora, PhD thesis, University of Novi Sad, March 2014. 

[23] V. Delić, M. Sečujski, N. Jakovljević, D. Pekar, D. Mišković, B. Popović, S. Ostrogonac, M. Bojanić, 
D. Knežević, “Speech and Language Resources within Speech Recognition and Synthesis Systems for 

Serbian and Kindred South Slavic Languages”. In Proc. of the SPECOM 2013, Pilsen, Czech Republic, 
LNCS, vol. 8113, Springer, pp. 319-326, 2013, DOI: 10.1007/978-3-319-01931-4_42 

[24] S. Ostrogonac, M. Sečujski, V. Delić, D. Mišković, N. Jakovljević, N. Vujnović Sedlar, A Mixed-Structure N-
gram Language Model, Axon - inteligentni sistemi, Novi Sad, Serbia. International patent pening: PCT/ 
RS2013/000009 

[25] N. Jakovljević, D. Mišković, M. Janev, D. Pekar, “A Decoder for Large Vocabulary Speech Recognition”. 
In Proc. of 18th International Conference on Systems, Signals and Image Processing, IWSSIP 2011, 
Sarajevo, Bosnia and Herzegovina, pp. 287-290, 2011. 

[26] M. Bojanić, M. Gnjatović, M. Sečujski, V. Delić: “Application of dimensional emotion model in 
automatic emotional speech recognition”. In Proc. of the 11th IEEE Int. Symp. on Intelligent Systems and 
Informatics, SISY 2013, Subotica, Serbia, pp. 353-356, 2013, DOI: 10.1109/SISY.2013.6662601 

[27] S.T. Jovičić., Z. Kašić, M. Djordjević, M. Rajković, “Serbian emotional speech database: design, 
processing and evaluation”. In Proc. of SPECOM 2004, St Peterburg, pp.77–81, 2004. 

[28] J. Gauvain and C. H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture 
observations of Markov chains”. IEEE Trans. on Speech and Audio Process., vol. 2, no. 2, pp. 291-298, 

Apr. 1994, DOI: 10.1109/89.279278 
[29] M.J.F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition”. 

Computer speech & language, vol. 12, no. 2, pp. 75-98, 1998, DOI: 10.1006/csla.1998.0043 

[30] M.J.F. Gales and P.C. Woodland, “Mean and variance adaptation within the MLLR framework”. 
Computer Speech & Language, vol. 10, no. 4, pp. 249-264, 1996, DOI: 10.1006/csla.1996.0013 

[31] D. Povey and G. Saon, “Feature and model space speaker adaptation with full covariance Gaussians”. 
In Proc. Interspeech 2006, paper 2050-Tue2BuP.14, 2006. 

[32] M.J.F. Gales and S. Young, “The application of hidden Markov models in speech recognition”. Foundations and 
Trends in Signal Processing, vol. 1, no. 3, pp. 195-304, 2008, DOI: 10.1561/2000000004 

[33] N. Jakovljević, D. Mišković, M. Sečujski, D. Pekar, “Vocal tract normalization based on formant positions”. 
In Proc. Inter. Language Technologies Conference IS-LTC 2006, Ljubljana, pp. 40-43, 2006. 

[34] N. Jakovljević, M. Sečujski, V. Delić, “Vocal tract length normalization strategy based on maximum 
likelihood criterion”. In Proc. EUROCON 2009, St. Petersburg, pp. 417-420, 2009, DOI: 10.1109/EURCON. 

2009.5167662 

[35] G. Saon and J.T. Chien, “Large-vocabulary continuous  speech  recognition  systems”. IEEE Signal Processing 
Magazine, vol. 29, no. 6, pp. 12-33. Nov. 2012, DOI: 10.1109/MSP.2012.2197156 

[36] J.M. Lucas-Cuesta J. Ferreiros, F. Fernandez-Martinez, J.D. Echeverry, S. Lutfi, “On the dynamic 
adaptation of language models based on dialogue information”. Expert Systems with Applications, vol. 40, 
no. 4, pp. 1069-1085, 2013, DOI: 10.1016/j.eswa.2012.08.029 

[37] W. Kim, Language model adaptation for automatic speech recognition and statistical machine translation, 
PhD Thesis, Johns Hopkins University, 2005. 

[38] L. ten Bosch, “Emotions: what is possible in the ASR framework”. ITRW on Speech and Emotion, Northern 
Ireland, UK, pp. 189-194, 2000. 

[39] J. Hirschberg, D. Litman, M. Swerts, “Prosodic and other cues to speech recognition failures”. Speech 
Communication, vol. 43, pp. 155-175, 2004. 


 User-Awareness and Adaptation in Conversational Agents 387 

[40] D. Litman, J. Hirschberg, M. Swerts, “Predicting automatic speech recognition performance using prosodic 
cues”. In Proc. of the 1

st
 North American chapter of the Association for Computational Linguistics, NAAC, 

Seattle, pp. 218-225, 2000. 

[41] B. Vlasenko, D. Prylipko, A. Wendemuth, “Towards robust spontaneous speech recognition with emotional 
speech adapted acoustic models”. S. Wölfl (ed.), Poster and Demo Track of the 35th German Conference 
on Artificial Intelligence, KI-2012, Saarbrucken, Germany, pp. 103-107, 2012. 

[42] B. Popović, I. Stanković, S. Ostrogonac, “Temporal Discrete Cosine Transform for Speech Emotion 
Recognition”. In Proc. of  the 4th IEEE Int. Conf. CogInfoCom 2013, Budapest, Hungary, pp. 87-90, 2013. 

[43] C.M. Lee and S.S. Narayanan, “Toward detecting emotions in spoken dialogs”. IEEE Transactions on Speech 
and Audio Processing, vol. 13, no. 2, pp. 293-303, 2005, DOI: 10.1109/TSA.2004.838534 

[44] R. Müller, B. Schuller, G. Rigoll, “Enhanced Robustness in Speech Emotion Recognition Combining 
Acoustic and Semantic Analyses”. In Proc. of the Workshop From Signals to Signs of Emotion and Vice 

Versa, Santorini, Greece, 2004. 

[45] M. Halliday, An Introduction to Functional Grammar, Edward Arnold, London New York, Second edition, 
1994. 

[46] K. Jokinen and M. McTear, Spoken Dialogue Systems. Synthesis Lectures on Human Language Technologies, 
Morgan and Claypool, 2009. 

[47] B. Grosz and C. Sidner, “Attention, intentions, and the structure of discourse”. Comput Linguist, vol. 12, 
no 3, pp. 175-204, 1986.