FMIS2011Huang_et_al_REF Electronic Communications of the EASST Volume 45 (2011) Guest Editors: Judy Bowen, Steve Reeves Managing Editors: Tiziana Margaria, Julia Padberg, Gabriele Taentzer ECEASST Home Page: http://www.easst.org/eceasst/ ISSN 1863-2122 Proceedings of the Fourth International Workshop on Formal Methods for Interactive Systems (FMIS 2011) Capturing the distinction between task and device errors in a formal model of user behaviour H. Huang, R. Rukšėnas, M.G.A. Ament, P. Curzon, A.L. Cox, A. Blandford, D. Brumby 16 pages ECEASST 2 / 17 Volume 45 (2011) Capturing the distinction between task and device errors in a formal model of user behaviour H. Huang1 (huayih@eecs.qmul.ac.uk), R. Rukšėnas1, M.G.A. Ament2, P. Curzon1, A.L. Cox2, A. Blandford2, D. Brumby2 1Queen Mary University of London School of Electronic Engineering and Computer Science 2University College London, UCL Interaction Centre Abstract: In any complex interactive human-computer system, people are likely to make errors during its operation. In this paper, we describe a validation study of an existing generic model of user behaviour. The study is based on the data and conclusions from an independent prior experiment. We show that the current model does successfully capture the key concepts investigated in the experiment, particularly relating to results to do with the distinction between task and device-specific errors. However, we also highlight some apparent weaknesses in the current model with respect to initialisation errors, based on comparison with previously unpublished (and more detailed) data from the experiment. The differences between data and observed model behaviour suggest the need for new empirical research to determine what additional factors are at work. We also discuss the potential use of formal models of user behaviour in both informing, and generating further hypotheses about the causes of human error. Keywords: human error, formal models of human behaviour, cognition. 1 Introduction In the complex daily working environment of medical professionals, errors are occasionally made. Often these have only minor consequences, but sometimes much more costly outcomes result. Recent research, particularly in the field of human-computer interaction, has promoted a shift away from the perspective where blame is attached entirely to the individual when mistakes are inevitably made. This change in perspective has been partly motivated by the realisation that people are not, and cannot reasonably be expected to be, completely infallible. This is especially true in high-stress environments such as the hospital workplace. It is also in these situations that the costs of errors, both financially and in human terms, are often the highest. One way of systematically detecting the potential for these somewhat rare but costly errors is by using plausible models of human cognition. However, the frame problem [MH69] must be dealt with when modelling people. In terms of logical models, this relates to the need to include additional ‘common-sense’ axioms in order to generate plausibly realistic patterns of inference. In the context of modelling cognition in general, it relates to the precise scope, form and amount of common-sense knowledge necessary for a cognitively plausible model. Also, when constructing a model of reality hard decisions have to be taken about which aspects of reality are abstracted over and which are represented with relatively high fidelity. To enable the construction of a useful model of human cognition, we need to be clear from the outset about the problem we are trying to solve with the model. Here we try to detect systematic medical device errors. In particular the class of errors that may be rare, but are also not completely random. The errors of interest are ones which at least in principle could be Capturing the distinction between task and device errors Proc. FMIS 2011 3 / 17 predicted, a priori, to plausibly occur or reoccur in the future due to systematic cognitive causes. We are concerned most with those errors that have high-cost consequences associated with them. Our ultimate aim is to have a system that can highlight design flaws that may lead to such preventable errors. For the purposes of this paper, we define an error simply as an action that deviates from some prescribed sequence for achieving the intended outcome. However, this is only one of several plausible definitions. Hollnagel [Hol05] gives a more detailed discussion of some of these alternative definitions of error. 1.1 Related work In the past, a variety of experimental work has identified some of the causes of human error, such as Byrne and Bovair’s work demonstrating the effect of working memory on post completion errors (Byrne & Bovair, [BB97]). Here we focus on task and device errors based on the experiments of Ament et al. [ACBB10]. This particular distinction dates back to various earlier works such as that of Cox & Young [CY00], Kirschenbaum et al. [KGEM96] and Hiltz et al. [HBB10]. The work presented in this paper is not specifically concerned with generating new experimental results but in supporting such work, as well as creating predictive formal tools that allow deeper explanations for observed human behaviour. In terms of using formal modelling to support empirical work we build on Ruksenas et al. [RBCB09] where model checking was used to help understand experimental results about human error and whether all relevant concepts were included in the explanations given. Su et al. [SBB07, SBBW09] adopt a similar idea using formal models of low level cognitive concepts related to the attentional blink phenomena. Their work differs to ours in being based on lower level aspects of human cognition and on simulation-based model exploration. A simple approach to the modelling of plausible user behaviours involves writing both a formal specification of the device and task models for that device, to support reasoning about the behaviour of the interactive system [MD95, Fie01]. Task models, however, describe only a small fraction of real user behaviour - in particular they do not really deal with human fallibility. An alternative approach is to specify users as they are (Butterworth et al. [BBD00]). This is the idea underlying alternative approaches to formal user modelling [DD99, DBMD95]. By modelling the user ‘as is’, we can gain many insights into how and why specific user behaviour are generated/observed. To reason about the behaviour of an interactive system, a formal user model is combined with a formal specification of the device. Both models are then considered as central components of the whole integrated system in an approach known as syndetic modelling [DBDM98]. Ruksenas et al. [RBCB09] present one example of this approach. They model human cognition as a set of production rules expressed in higher-order logic. The generation of the next plausible user-behaviours then involves computing an ‘overall salience value’ for each possible user-action expressed in the model. A key difference between their model and most simulation-based models of human cognition is the underlying assumption of non- deterministic user behaviour throughout. Their model operates based upon sets of cognitively plausible user-actions at each point in the interaction. This allows for simultaneous exploration of the consequences of all of the plausible user-actions that could be taken at each point. This contrasts with most simulation-based approaches where only a single path of potential user- actions is explored at once. ECEASST 4 / 17 Volume 45 (2011) Ruksenas et al. [RBCB09] also describe a case study based on a fire engine dispatch scenario using their generic user model (GUM). We present here a follow-on case study to further validate the model, investigating the behaviour and performance of the model using a different scenario and therefore under a different instantiation. As part of the process, we also gained valuable practical experience and insight into the utility of the iterative model-refinement method suggested by Ruksenas et al. [RBCB09]. 1.2 Choice of scenario We based this study on the experiment of Ament et al. [ACBB10]. It presents experimental data about the effect of memory load on device and task based errors. It is based on an experimental micro-world concerned with a doughnut-making task. This context provides a good proving ground for validation and further refinement of the GUM as it involves both an independent setting, and the investigation of a different aspect of human error. Furthermore the kind of errors considered (device and task based slip errors) fall within the intended scope of the generic model, but were not explicitly designed into it. As noted earlier, the class of errors that we are interested in are systematic deviations from a prescribed action-sequence as a result of slips, rather than say lack of skill or knowledge. In the case of Ament et al. [ACBB10], the intended outcome was to fulfil orders for doughnuts through following a single prescribed sequence of actions. The errors investigated were deviations from that sequence despite the participants being trained and having demonstrated their knowledge of, as well as ability to do the task. 1.3 Research questions and contribution The first question that we address in this paper is whether the GUM is currently expressive enough to encapsulate all the concepts relating to human cognition as presented in Ament et al. [ACBB10]. We investigate the completeness and appropriateness of the current concept- space as defined in the GUM with respect to the concepts addressed in the experiment. Our second question concerns the model-checking approach adopted for the GUM implementation and whether it agrees with the results of the experiment, assuming the particular conceptual mappings modelled. Our initial hypothesis was that the model would be able to replicate the results of the experiment as published concerning the link between device/task-errors and load. The initial aim was thus to provide evidence to further validate the model against these new results, and if it could not, to investigate how the model needed to be improved to capture the underlying behaviour. Finally we consider the potential of this model-checking approach to generate both suggestions for further refinement of the model, and to motivate further empirical investigations. The contributions of this paper are essentially two-fold. We explore the results of validating the generic user model on an independent set of experiments not directly designed for this validation. We also demonstrate how the use of formal methods, and model checking in particular, can generate ideas for model refinement and further experimental investigation. 1.4 Overview of the paper Here we give a brief overview of the rest of the paper. We start by describing the experiment used for validation (Section 2) as well as the key concepts of the generic user model (Section 3). We then, in Section 4, specify the way in which we interpreted the experimental conditions outlined in Ament et al. [ACBB10]. In particular, we note the assumptions made as part of this Capturing the distinction between task and device errors Proc. FMIS 2011 5 / 17 interpretation process. In Section 5, we compare the behaviour of the model with the findings in Ament et al. [ACBB10], and show that the model demonstrates good agreement with the experiment in terms of distinguishing between task and device errors, as well as distinguishing between initialisation and post completion errors (a sub-categorisation of device errors). We then show, based on previously unpublished results from the experiment, that the model does not however predict initialisation errors correctly at the level of the individual steps. In Section 6, model checking is then used to further explore the discrepancies to understand the reasons for the observed model behaviour, leading to suggestions for further experiments. Finally, we draw conclusions in Section 7. 2 The Doughnut Machine Experiment We first present a brief overview of the experiment used in this validation. Due to space constraints we only include the minimum details for understanding the rest of this paper. Interested readers should refer to Ament et al. [ACBB10] for more details. They investigated how working memory load effects task and device-specific actions by using a doughnut-making task. In their paper, a task-specific action is defined as one that is central to the task. In particular it is an action that is perceived to move the participant closer to the completion of the main task. In contrast, a device-specific action is defined as specific to a particular device, and is not typically common to different devices used for carrying out the same task; Device actions are not perceived to move participants closer to main-task completion. There were two independent variables in the experiment, memory load which can take values of high or low, and type of action which can be either device or task specific. The four combinations of these two independent variables affected the dependent variable – which was the rate of error for each of the different steps in the action sequence. Memory load was varied between participants, and the step-type, either device or task, was varied within participants. Each step in the sequence was pre-defined as either a task or device action. On the next page, a screenshot of the main interface is given in Figure 2.1, with the corresponding hierarchical task decomposition of the correct sequence of actions for the doughnut task given in Figure 2.2. Figure 2.2 also indicates the distinction between task and device specific actions used in the experiment. With the coloured rectangles and arrows (6 in all) denoting the steps of the task which were designated device actions, with the rest designated as task actions. The device actions consist of the five initialisation steps (i.e. steps 2.2.1.1.1, 2.2.2.1.1, 2.2.3.1.1, 2.2.4.1.1, 2.2.5.1.1) and a final post-completion step (2.4.1). In this figure, rectangles represent actual actions taken on the device. To successfully complete the task of making the doughnut, the user had to work through the five data entry areas in the prescribed order, shown by subtasks 2.2.1 to 2.2.5 - with similar sub-steps for each (corresponding to the five outer areas in Figure 2.1, with the user having to select the appropriate radio button from the ‘selector’ area on the right hand side of Figure 2.1 before being able to enter the relevant data for each of these subtasks each time). Apart from step 2.4.1 (click ‘clean’), mistakes at any of the other steps were immediately pointed out and corrected by the experimenter. The results from the experiments suggest that error rates for device actions are significantly higher than for task actions under both load conditions. Additionally, while error rates on the ECEASST 6 / 17 Volume 45 (2011) task-steps remained low under a high memory load, the error-rates for device-steps increased significantly under a higher memory load - as shown in Figure 2.3. Figure 2.1: The main interface of the doughnut machine. Figure 2.2: A hierarchical task decomposition of the actions involved in fulfilling an order (taken from Ament et al. [ACBB10]). Capturing the distinction between task and device errors Proc. FMIS 2011 7 / 17 Figure 2.3: Error rates across different working memory load and type of step conditions. Error bars represent the standard error of the mean (taken from Ament et al. [ACBB10]). 3 The Generic User Model – Key Concepts In this section we present a short overview of the key concepts in the generic user model of Ruksenas et al. [RBCB09]. The idea behind this approach is to try to encapsulate cognitive principles that are common to all device interactions within a generic parameterisable framework, which can be instantiated for particular scenarios. The ultimate aim is to create a predictive model that enables us to reason about plausible user behaviour with any arbitrary device. The generic user model consists of five principal parameters, as described below. Procedural cueing: This parameter primarily deals with the idea of habitual cueing, i.e. that of a particular action following from the previous one. It also includes other kinds of unconscious or ‘learnt’ sequences of behaviour due to past experience. In general, this kind of cue could be either due to prior instruction from a third party, and/or due to past usage or experimentation by the user; it deals primarily with the (unconscious) sequential ordering of task-steps. Cognitive cueing: This parameter deals with actions that spring to mind due to the importance of the action for successful task completion. For example, an action central to the successful completion of the main task would typically have a relatively high cognitive cue. Sensory cueing: This parameter represents cueing derived from sensory sources. It currently represents the combined influence of all of the five senses. Examples of relevant sensory cues include visual interface aspects such as using colour to highlight particular actions (like a big red button for aborting the task for example), or the use of sound to draw attention to an action. Intrinsic load: This parameter is related to the inherent ‘difficulty’ of a task or action. For example, doing difficult mathematical calculations as a task-step would have a much higher intrinsic load than multiple-choice check boxes. Extraneous load: This parameter covers ‘external’ influences of the context in which the task is taking place. This includes concepts relating to memory load, and other interfering aspects of the external environment, such as from a visually or acoustically complex environment. ECEASST 8 / 17 Volume 45 (2011) Salience To determine the next set of actions taken by the model at some specific point in an unfolding scenario, the five ‘raw’ parameters outlined previously are ‘summed’ via a production-rule system to derive an associated salience value for each of the potential next user-actions. For details of the actual rules see Curzon et al. [CRB10]. Figure 3.1 shows the functional relationship between these cues and loads, and the overall salience. For example, in the current model, procedural salience is affected only by procedural cueing and the intrinsic load of a task or action. Intermediate ‘grouping’ concepts (procedural, cognitive and sensory salience) give a way for the model to be at a relatively high level of abstraction. It also improves the ease with which the model may be understood conceptually. In addition, these three intermediate concepts are used as a means for the model to be context-sensitive. Figure 3.1: How the ‘raw’ parameters influence the overall salience of a user-action, via the three intermediate concepts. The arrows show the direction of influence. 4 Interpreting the Experimental Conditions in terms of the GUM In this section, the way in which we interpreted the experimental conditions is described. We discuss the rationale behind our particular interpretation, as well as stating the assumptions made. 4.1 Concept mapping of the five principal parameters In order to carry out a detailed comparison between the experimental data and the GUM instantiation, it was necessary to decide precisely how to map concepts from the experiment to our instantiation. We now present in detail our interpretation of the concepts and experimental context discussed in Ament et al. [ACBB10]. Note that these five principal parameters in the GUM are currently assigned values from a binary range. Procedural cueing was assumed to be initially present (i.e. set with a ‘high’ value) between all pairs of actions in the prescribed (‘correct’) action sequence only. This reflected the fact that each of the users had prior instruction, as well as a period of guided exploration before the actual experiment. Data from a small number of users was also excluded from the analysis presented in Ament et al. because they had not learnt the task well enough, reinforcing the fact that those whose data was used were indeed trained to follow the correct sequence. In both the paper and the modelling presented here, we are interested only in slip errors with an Capturing the distinction between task and device errors Proc. FMIS 2011 9 / 17 identifiable underlying cognitive cause, not ones occurring as a result of lack of knowledge or skill. All other pairs of actions were assigned ‘low’ procedural cueing values. The task/device-action distinction of Ament et al. was represented by the cognitive cueing parameter of the generic user model. A weak value was given to the device-specific actions whilst a strong one was given to task-specific actions; actions relating specifically to the task at hand were deemed to have a higher cognitive salience than ones that are only incidental actions required by the specific device. We mapped this concept according to their task/device classification - observed in the hierarchical task decomposition presented in Figure 2.2. Due to the relatively bland nature of the doughnut machine interface, the values set for sensory cueing were based essentially on only positioning and size. Only two kinds of actions were defined with strong sensory cues. The steps that involved filling in data (Steps 2.2.1.2.1 in Figure 2.2, for example) were given strong sensory cueing due to their relatively large size. We, and Ament et al. both group the individual interface elements where the data was actually entered holistically into one atomic action. Secondly, the final confirmation step (after the doughnuts were made) was given high sensory cueing, as this was a typical popup modal dialog box, which allowed the user to proceed with other actions only after its dismissal. All other actions were defined with a weak sensory cue. For intrinsic load, only the subtasks involving some mental arithmetic were assigned a high load value. This was due to the relatively more complex data entry steps within the dough-port and fryer-port subtasks (see Figure 2.1). All other subtasks were defined with a low intrinsic load. The variation in memory load was reflected in the model instantiation by varying the values for extraneous load. A high value (rather than a low one) corresponded to the high memory- load situation, where the user was required to actively monitor additional secondary information fading in and out on the horizontal panel near the bottom of the screen whilst carrying out the main task (see bottom of Figure 2.1). 4.2 Concept mapping – task grouping The GUM provides a facility for grouping individual user-goals into larger collections of mini-tasks. For our instantiation, the three steps taken within each port (i.e. activate / fill / confirm) were grouped into mini-tasks, one for each of the five ports. The remaining user actions were grouped into a sixth collection. These groupings have some influence on the selection process for the next set of actions at each step – mainly by affecting cognitive cue levels. However this is not a major aspect of the model. Although we had decided on the groupings reasonably independently, the configuration eventually chosen do roughly match the hierarchical task-decomposition presented in Figure 2.2. 4.3 Other assumptions Several other assumptions were made. We modelled the system as though there was no potential for errors or mismatches relating to the user’s direct perception of the effects of visible device actions. We also assumed that there were no misinterpretations in their perception of the effects of their actions on the device. This does not rule out all misunderstandings of a user, it simply means that we model that when they press a button, that button is always actually pressed according to expectations, and perceived as such. However, the user model can still potentially end up with ‘misconceptions’ relating to the internal state of the device. The assumption about the user’s perception was deemed appropriate because the ECEASST 10 / 17 Volume 45 (2011) doughnut machine is quite a simple device and the participants were trained in its use, so there is unlikely to be misunderstandings about the role of each interface element. A further assumption was that once defined for the instantiation, the particular pattern of parameter assignments would not be subject to arbitrary dynamic alterations. This reflects the fact that the experiment was carried out under strictly controlled circumstances – so there were no unexpected mid-experiment perturbations to the environment, caused by events such as disruptions etc. We also assumed that in general, users would try to make full use of what few interface cues were available. We therefore took a strictly minimalist position with respect to the necessity of adding extra ‘memory variables’ for the user model, assuming no additional complexity regarding user-memory unless absolutely necessary. 5 Results In this section, an analysis of the behaviour obtained with this instantiation of the GUM is presented. We start in Section 5.1 by comparing with the experimental data in terms of task and device-specific actions only, and determine whether the results of the model match the conclusions of Ament et al. [ACBB10]. In Section 5.2 we look in more detail at the two specific kinds of device errors observed, notably initialisation and post-completion errors. This is followed by a more detailed step-by-step analysis in Section 5.3 comparing the results from the model against more detailed, previously unpublished data from the experiment. This allowed a more fine-grained analysis of the model’s performance. All results presented here were obtained by following the interpretation of experimental concepts and conditions outlined in Section 4. 5.1 Task versus device errors The original research question we set out to answer was whether our model replicates the behaviour as exemplified in the conclusions of Ament et al. [ACBB10]. Their primary conclusion was that error rates were significantly higher for device actions than task actions. We address this first. The GUM is intended to predict only errors that are systematic, i.e. errors that have some underlying cognitive cause, and are not random, one-off errors. Furthermore the model is an abstraction of the actual underlying causes. It therefore works at a certain level of detail. The idea is for the model to highlight plausible situations where design error is likely to be the cause. It is essentially a binary threshold model however, so does not rank erroneous actions in terms of probabilistic estimates of likelihood. Rather, either the model can make an error in a given situation or it cannot. Ament et al. [ACBB10] found that under both memory load conditions, device errors were significantly higher than task errors. Therefore the model ought to (in the first instance) be able to correctly predict that these errors are possible in systems such as the doughnut machine. The results of the model checking are summarised in Table 5.1. We can see that the model makes device errors under both load conditions and makes task errors in neither. Thus it predicts that device errors will be made, and predicts that a significant number of task step errors will not be made. This matches the main conclusion of Ament et al. [ACBB10]. The threshold nature of the model means that it does not make the further distinction as drawn by Ament et al. that a higher memory load leads to more errors. Whatever the load, devices errors Capturing the distinction between task and device errors Proc. FMIS 2011 11 / 17 are made, so a predictive tool needs to highlight this. For a real system (rather than an experimental one as in this case), the system design ought to then be fixed accordingly. Figure 5.1: Our interpretation of the behaviour to expect from the model. Task-specific actions Device-specific actions Low load No error made by model √ Some errors made by model √ High load No error made by model √ Some errors made by model √ Table 5.1: Comparison of the model’s behaviour with the experimental results (√ indicates model behaviour matches experimental result) 5.2 Initialisation errors and post completion errors The previous section shows that our model can indeed capture the device and task distinction, and make the appropriate predictions at this level of granularity. However, this categorisation potentially groups together a number of different kinds of errors. Those errors are actually made due to different rules of the model activating. In particular, both initialisation errors and post completion errors fall under the grouping of ‘device-specific’ errors. Ideally, as well as predicting the potential for error, the model should suggest how the system design might be fixed. It is therefore important not just that the model can match the results in general, but that it is able to also accurately predict the particular kind of device errors made. We therefore now focus on a more fine-grained analysis with respect to the specific kinds of device errors made, to gain a more detailed understanding of the validity of the behaviour of our model. The device errors identified by Ament et al. [ACBB10] compose of post completion, as well as initialisation steps. They put the overall error rate for initialisation steps at just over 27%; the post completion error rate was also found to be high at over 21%. Ideally the model should therefore predict both post completion errors and initialisation errors, not just one or the other, allowing them both to be predicted and fixed. Post-completion step Initialisation steps All load conditions Error made by model √ Some errors made by model √ Table 5.2: The model’s behaviour for post completion errors and initialisation errors (√ indicates model behaviour matches experimental result) As indicated in Table 5.2 the model does predict both that post completion error and initialisation errors will be made. Thus at the level of the kind of device error made, the model’s behaviour does match the behaviour as seen in the experimental data. However, as ECEASST 12 / 17 Volume 45 (2011) will be discussed in the next section, the model does not predict initialisation errors in detail as might be expected. Further work is needed to investigate the underlying causes of those errors and how they may be modelled. It is also important to note here that what is meant by an ‘initialisation’ step in Table 5.2 is any one of the five selection steps before filling in the data for each of the five areas shown in Figure 2.1. This is the same concept as the ‘selector errors’ described in Hiltz et al. [HBB10]. 5.3 A step-by-step analysis Whilst the previous results show the model does match the experiment at the level of device- error types, there are actually several points in the experiment for initialisation errors to be made. In the paper a distinction is made for the error rate for the first initialisation step (with an error rate of about 27% and the later ones with a combined error rate of about 7%). This does not correspond to the pattern predicted by the model checker as we now discuss. The analysis in this section is based on previously unpublished data from the experiment that gives error rates for each individual step in the Doughnut scenario. Whilst conducting this more detailed analysis, it was observed that the model demonstrated the same pattern of behaviour under both low and high extraneous load, given the particular way in which we interpreted the experimental conditions for our model. We therefore do not further consider the effect of load here. More detailed investigation of this is left for further work. Figure 5.2 gives the step-by-step error rates, showing the correspondence between the steps and the specific actions available on the device as given in the hierarchical task-decomposition of Figure 2.2. The distinction between device and task errors here is that adopted by Ament et al. [ACBB10]. The data available groups some steps. For example the four individual actions in the task hierarchy relating to obtaining a new order are grouped together (i.e. subtask 1 in Figure 5.2). Figure 5.2: Step-by-step error rates for the task. This shows a much more nuanced picture of the position and magnitude of initialisation errors. All task errors have very low error rates largely justifying the results of the previous Capturing the distinction between task and device errors Proc. FMIS 2011 13 / 17 sections. Three goal steps are, while still low, noticeably higher. The first step to get the order for example is just below 5%. However, this may be explained by the fact that it combines four steps. The step to enter the data when doing the ‘Operate doughport’ part of the task also has an error rate of around 5% as does the step between subtasks 2.2.1 and 2.2.2 (this step is simply being patient enough to wait for a progress bar to fill). Looking at the initialisation errors a more intriguing pattern emerges. There are 5 opportunities for initialisation errors. However the errors are not spread evenly between them. The initialisation step for the first ‘Operate doughport’ activity is much higher than the others with a 35% error rate. The other initialisation steps all do have errors but these are at much lower levels of between 5% and 10%. Thus in some cases the error rate is barely above the level that might be expected if the errors were stochastic. Clearly, from error rates alone, there is little to distinguish between the later initialisation steps (which the model predicts to be error prone) and the other task based steps (which the model does not predict to be error prone). Furthermore the model, with the settings we gave, only predicts initialisation errors on the first step of subtask 2.2.2. It does not predict errors on the first or later initialisation steps despite the first initialisation step having an extremely large error rate in the experiment. This suggests either problems in the model or our understanding of the experimental conditions and how they should be mapped to our model. The model ought to at least predict that the first initialisation step is error-prone, and possibly the later ones too. Given the discrepancy with the data in Figure 5.2, perhaps the model needs a more sophisticated mechanism to determine when initialisation errors occur. Further investigation is needed to better understand the underlying causes of these error steps. We use the formal user model to explore these issues in more detail in the next section. 6 Further Exploration of the Results In this section, we further explore the formal model and its relation to the experimental results. The particular examples investigated were motivated by the mismatch in behaviour between model and experimental data, as described at the end of the previous section. We deal first with the mismatch at the second of the five initialisation steps (Section 6.1), and then investigate the mismatch at the first of these steps (Section 6.2). Finally we discuss some issues raised by this extended investigation (Section 6.3). 6.1 Presence of errors at the activate-puncher step but not at the other initialisation steps First we investigate the reasons behind the model making an omission error at the activate- puncher step (the ‘Operate puncher’ activity’s initialisation step) but not at the other initialisation steps. The activate-puncher step corresponds to the first step of Subtask 2.2.2 in Figure 2.2. Some errors were made at all initialisation steps but mostly at a low level, suggesting that the underlying causes for an error here were weak compared to the first initialisation step, at least within the conditions of the experiment as modelled. We suspected that the reason that the model demonstrated this behaviour was due to the high level of intrinsic load at the previous step prior to the activate puncher step – i.e. when the confirm-doughport step was executed by the user. The previous subtask was set to a high intrinsic load as the user had to do a series of simple, but different arithmetical calculations to determine the values to actually enter. It was not entirely clear at the outset whether this was complex enough to be considered as a high load or not, though in the original instantiation of ECEASST 14 / 17 Volume 45 (2011) the scenario we decided to set these kinds of steps to high. Intrinsic load for the other subtasks were low. Giving the previous subtask a high intrinsic load caused a reduction in the procedural cueing for the ‘activate-puncher’ step, and therefore reduced the overall salience of this initialisation step. This meant that both the ‘activate’ and ‘fill’ steps of the puncher port ended up with the same level of overall salience in the model. As such they were both plausible actions for the model to choose next at this point of the doughnut task. This understanding was confirmed formally by using Linear Temporal Logic properties to model-check a slightly adjusted instantiation. The only change from the initial interpretation of settings was the assignment of a low rather than a high intrinsic load value to the subtask prior to activating the puncher-port. As expected, there were now no errors for this step of the model under the new setting. This same model behaviour was observed under both high and low extraneous loads. 6.2 No error at the activate-doughport step The second, and more major issue given that it had by far the highest error rate, is why the model did not predict errors at the activate-doughport step. This was the very first sub-activity initialisation step (the first step of Subtask 2.2.1). This seemed likely to be due to the strong influence of procedural cueing upon whether an action is considered plausible or not when, as in this scenario, the sensory cueing for most cues is set to neutral values. In this case sensory cueing effectively did not contribute towards determining model behaviour. When we verified this formally by removing the procedural cueing between the initial button- press to get the next order, and the activate-doughport step, we found that the model demonstrated the same kind of behaviour as initially seen for the activate-puncher step (again under both high and low extraneous load, refer to Section 6.1 for details). Without procedural cueing the model predicts the potential for an omission error at this step. This would bring the model-behaviour into closer correspondence with the experimental results. 6.3 Discussion From the more detailed results described in Sections 6.1 and 6.2, we see that the model does not exactly match the experimental data with respect to initialisation errors with the initial settings we chose, though the model could have done so with different settings. In the experiment initialisation errors were made at every initialisation step, though the error rates differed. Except at the first, rates were relatively low. The original experiment was not explicitly set up to explore the causes behind differing error rates at the initialisation steps of the activities, so this was an unexpected result. Given the low error rates, further experimentation explicitly designed to determine the causes of such errors is needed. The model checking work however suggests interesting areas for further work. The most obvious failure of the model checking was that it did not predict an error on the first initialisation step when in fact the error rate at this point was extremely high. Further exploration and verification of the model shows that by removing the procedural cueing between each of the major activities (as indicated in the task hierarchy) the model would predict initialisation errors at each step. If the task hierarchy is an arbitrary structure imposed as a way to describe the task, there doesn’t seem to be any strong motivation to do this instead of the original interpretation of a single procedural chain for the prescribed sequence. However, the fact that errors are made at each step, even if at a low level for most, suggests that the way a person mentally breaks a task into activities as guided by training and/or the Capturing the distinction between task and device errors Proc. FMIS 2011 15 / 17 interface may play a role. Including procedural cueing only within major activities according to likely conceptual models of users would be one solution. However, there is clearly something distinctly different about the first activation step compared with the other four activation steps. There is clearly some kind of discrepancy here. Amongst other possibilities, it could mean there is some sort of concept representing inter- activity cueing that is not expressed in either the current model or the experimental concepts presented in Ament et al. [ACBB10]. In fact, in parallel to the work described here, we have been developing a more nuanced version of the GUM based around activities and resources, where a new kind of cueing between activities, weaker than the procedural cueing within activities, is explicitly modelled. This will allow us to explore the potential for such a mechanism to model these results. Further experimental evidence to give more detailed insight into the precise scope and nature of the phenomena observed is needed too. As the model checking shows, the model predicts initialisation errors when the previous activity was of high intrinsic load, but not, all else being equal, when the previous load was low. An issue was how complex the activity needed to be for the load to trigger initialisation errors. For this kind of modelling to be useful in a purely predictive sense, guidance on how to make such decisions would be needed. This suggests an alternative way to use the model, however. Where one could take a scenario-based approach – modelling various load levels and exploring, for a given design, the effects of those different load levels on the errors that occur in the model. This could suggest crunch points where designers should aim to keep intrinsic load levels low, or the design changed to avoid offering device-specific actions at those points. The results of this study suggest that the abstraction needed for modelling of intrinsic loads of various actions is perhaps not quite as straightforward as thought from previous empirical data. Experiments to investigate this kind of load in more detail and its interaction with procedural cueing are needed to clarify its effect in a wider variety of situations. In addition, for the model having a high (instead of low) intrinsic load at a prior step with procedural cueing from it has the same effect as having a low intrinsic load without procedural cueing. It would be interesting to find out whether this correlation does indeed hold in reality. Further experimentation is also needed to explore the causes behind the relatively small number of task step errors. If these also had a common cognitive cause that could be determined, then the model could be expanded accordingly to also take those causes into account. 7 Conclusions We have presented a study that aimed to further validate an existing generic user model, based on new experimental data that was not related to the original development of the model. We also aimed to investigate how model checking can help explore issues that arise, helping to suggest areas for further empirical data collection. The work in this paper suggests that the generic user model is conceptually complete with respect to the experiment as presented in Ament et al. [ACBB10]. We were able to map naturally all of the concepts presented in the empirical investigation into concepts in the GUM. There were no ‘leftover concepts’ from the empirical experiment that needed to be somehow ‘fitted’ to a concept in the model in an awkward or superficial way. Secondly, taking the initial interpretation of the experimental settings, we arrive at a situation where the model also demonstrates a definite distinction between task and action steps. In terms of the GUM, we obtain errors on device-specific actions and no errors for task-specific ECEASST 16 / 17 Volume 45 (2011) actions. In terms of the empirical experiments, we see a result that demonstrates both the predominance of device errors over task-specific ones, as well as a greatly increased sensitivity to memory load for device-specific error rates. The presence of a clear difference between task and device-actions in both cases provides some positive evidence for the general approach assumed for the GUM, and also gives some additional justification for the utility of classifying user-actions according to task and device-specific actions. More detailed analysis however suggests that some concepts are perhaps missing from the description in Ament et al. [ACBB10] and the generic user model. In particular, the specific pattern of errors observed on a step-by-step basis, especially with respect to initialisation errors, is apparently subtler than currently expressible using only concepts from the experiment. Further Work The results from our study indicate that it would be useful to further investigate the precise relationship between intrinsic load and procedural cueing. In particular, in the generic user model they have opposite effects. The same local space of plausible actions could be achieved either by inhibiting the effect of procedural cueing from the previous step (with a high intrinsic load from the previous step), or by simply having no procedural cueing at all at that point in the action sequence. Further collaboration between experimentalists and modellers to investigate this issue is needed. For example, interesting plausible parameter sets could be determined initially by experimenting with the generic user model, and then used for further investigation via empirical experimentation. Alternatively, if the experimenters were unclear about whether to investigate some particular value of a parameter, the GUM could be used to offer some rational suggestions about whether those values are likely to have a significant effect on the overall user-behaviour. While the classification of actions into task and device-specific types is relatively uncontested in this paper, the particular choice of assignments to be used still depends very much on context. For example, Hiltz et al. [HBB10] explores how giving them an alternative initial brief can change people’s perceptions about which actions are task-orientated and which are more device-orientated. Instead of talking about making doughnuts as their main task, another group of subjects were told that their main job was to instead test the virtual doughnut machine. As a result, these participants demonstrated significantly less errors on the initialisation steps compared with the group operating under the original ‘making a doughnut’ brief, most probably due to their reconceptualising of previous ‘device-specific’ steps now as task-specific. A potentially very useful follow-up to the study presented in our paper would be to investigate the degree of agreement between model and data under these two alternative perceptual schemas based on concepts and data from Hiltz et al. [HBB10]. In summary, this paper has investigated the validity of an existing generic formal model of cognitively plausible behaviour against data from an independent experiment that was not designed with our model in mind. We have shown that the model does sufficiently capture the concepts described in Ament et al. [ACBB10], and agrees with its conclusions with respect to device and task errors in general. At a more detailed level however, some behaviours demonstrated by the model do not match the experimental results regarding initialisation errors and the effect of procedural cueing and load. The resulting analysis suggests that additional concepts are needed in the model, as well as further experimental work to determine more precisely the fundamental principles behind the behaviours observed. Our work also suggests Capturing the distinction between task and device errors Proc. FMIS 2011 17 / 17 that model checking based on a formal model of cognitively plausible behaviour may help both to explore the results from empirical investigations, as well as generate further research questions of interest. Acknowledgements This work is supported by the EPSRC grants on ‘Extreme Reasoning’ (EP/F02309X/1) and ‘CHI+MED: Safer Medical Devices’ (EP/G059063/1). References [ACBB10] M. Ament, A. Cox, A. Blandford, D. Brumby. Working Memory Load Affects Device-Specific but not Task-Specific Error Rates. In Ohlsson and (eds.) Catrambone (eds.), CogSci10, 32nd Annual Conference of the Cognitive Science Society. pp 91–96, 2010. [BB97] M. D. Byrne, S. Bovair. A Working Memory Model of a Common Procedural Error. Cognitive Science 21(1):31–61, 1997. [BBD00] R. Butterworth, A. Blandford, D. Duke. Demonstrating the cognitive plausibility of interactive system specifications. Formal Aspects of Computing 12(4):237–259, 2000. http://eprints.ucl.ac.uk/5127/(accessed 22-9-2011) [CRB10] P. Curzon, R. Rukšėnas, J. Back. The HUM generic user model: An informal overview of the main features. CHI+MED Working Paper no. 9, 2010. http://dms.chi-med.ac.uk/knowledgetree/browse.php?fFolderId=30(accessed 22-9-2011) [CY00] A. L. Cox, R. M. Young. Device-Oriented and Task-Oriented Exploratory Learning of Interactive Devices. In Proceedings of ICCM 2000: Third International Conference on Cognitive Modeling. pp 70–77, Universal Press, 2000. [DBDM98] D. Duke, P. Barnard, D. Duce, J. May. Syndetic modelling. Human-Computer Interaction 13(4):337–393, 1998. DOI: 10.1207/s15327051hci1304_1 [DBMD95] D. Duke, P. Barnard, J. May, D. Duce. Systematic Development of the Human Interface. Second Asia-Pacific Software Engineering Conference, pp 313-321, Computer Society Press, 1995. DOI: 10.1109/APSEC.1995.496980 [DD99] D. Duke, D. Duce. The Formalization of a Cognitive Architecture and its Application to Reasoning About Human Computer Interaction. Formal Aspects of Computing 11(6):665-689, 1999. http://dblp.uni-trier.de/db/journals/fac/fac11.html#DukeD99(accessed 22-9-2011) [Fie01] R. E. Fields. Analysis of Erroneous Actions in the Design of Critical Systems. PhD thesis, University of York, York, 2001. [HBB10] K. Hiltz, J. Back, A. Blandford. The roles of conceptual device models and user goals in avoiding device initialization errors. Interacting with Computers 22(5):363-374, 2010. DOI: 10.1016/j.intcom.2010.01.001 [Hol05] E. Hollnagel. The Elusiveness of ‘Human Error’. Technical report, based on Hollnagel, E. & Amalberti, R. The Emperors New Clothes, or whatever happened to “human error”? Invited keynote at the 4th International Workshop on Human Error, Safety and System Development, 2005. http://www.ida.liu.se/~eriho/HumanError_M.htm(accessed 22- 9-2011) [KGEM96] S. S. Kirschenbaum, W. D. Gray, B. D. Ehret, S. L. Miller. When using the tool interferes with doing the task. In Conference Companion on Human Factors in Computing Systems: Common Ground. CHI ’96, pp. 203–204. ACM, New York, USA, 1996. DOI: 10.1145/257089.257281 [MD95] T. Moher, V. Dirda. Revising mental models to accommodate expectation failures in human-computer dialogues. In Design, specification and verification of interactive systems (DSV-IS’95). pp 76-92, 1995. [MH69] J. McCarthy, P. J. Hayes. Some Philosophical Problems from the Standpoint of Artificial Intelligence. In Machine Intelligence. Volume 4, pp 463–502, Edinburgh University Press, 1969. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.5082(accessed 22-9-2011) [RBCB09] R. Rukšėnas, J. Back, P. Curzon, A. Blandford. Verification-guided modelling of salience and cognitive load. Formal Aspects of Computing 21(6):541-569, 2009. DOI: 10.1007/s00165-008-0102-7 [SBB07] L. Su, H. Bowman, P. Barnard. Attentional capture by meaning: A multi-level modelling study. In Proceedings of the 29th annual meeting of the cognitive science society (CogSci 2007), Lawrence Erlbaum Associates, NJ, pp 1521-1526, 2007. http://www.cs.kent.ac.uk/pubs/2007/2594(accessed 22-9-2011) [SBBW09] L. Su, H. Bowman, P. Barnard, B. Wyble. Process algebraic modelling of attentional capture and human electrophysiology in interactive systems. Formal Aspects of Computing 21(6):513-539, 2009. DOI: 10.1007/s00165- 008-0094-3.