Eye tracking and early detection of confusion in digital learning environments: Proof of concept Australasian Journal of Educational Technology, 2016, 32(6). 58 Eye tracking and early detection of confusion in digital learning environments: Proof of concept Mariya Pachman, Amaël Arguel Macquarie University, Australia Lori Lockyer Macquarie University, Australia; University of Technology Sydney, Australia Gregor Kennedy, Jason M. Lodge University of Melbourne, Australia Research on incidence of and changes in confusion during complex learning and problem-solving calls for advanced methods of confusion detection in digital learning environments (DLEs). In this study we attempt to address this issue by investigating the use of multiple measures, including psychophysiological indicators and self-ratings, to detect confusion in DLEs. Participants were subjected to two intrinsically confusing insight problems in the form of visual digital puzzles. They were asked to solve problems while their eye trajectories were recorded and these data were triangulated with self-ratings of confusion and cued retrospective verbal reports. All participants had a significant increase in fixations on relevant (i.e., related to the solution) and not-relevant areas at an early stage of the problem-solving process. However, only fixations on not-relevant areas were positively correlated with confusion ratings. Moreover, participants who significantly solved the problem differed in their fixations duration on relevant and not-relevant areas from non-solvers. The importance of early detection of confusion and the affordances of emerging technologies for this purpose are discussed. Introduction Increasingly higher education learning activities are delivered online or in technology-enabled formats. As students in this context are often considered to be self-directed learners, independently timing their learning activities, such activities tend to include minimal guidance and support (VanLehn, Siler, Murray, Yamauchi, & Baggett, 2003; Yamagata-Lynch, Do, Skutnik, Thompson, Stephens & Tays, 2015). Students often encounter difficulties in this context: in particular, in interpreting tasks set by their teachers and maintaining their engagement with online tasks (Waycott, Dalgarno, Kennedy, & Bishop, 2012). In fact, confusion is a quite common state when learning about complex topics in digital learning environments (Baker, D'Mello, Rodrigo, Graesser, 2010). There are a variety of reasons why online or technology-enabled tasks could be confusing: no timely intervention from a teacher may be available; learners may have difficulties understanding the content or solving a problem; or they may have difficulties in following the optimal learning trajectory. Along with confusion, learners often experience frustration or total disengagement if their confusion lasts for too long (D'Mello & Graesser, 2014). But confusion is also deemed to offer expanded learning opportunities in some cases (Lehman, D’Mello, & Graesser, 2012). That is why the moment when confusion is first experienced is often considered a turning point in the learning process: confusion from this point can develop in a detrimental fashion because of the reason mentioned above if not resolved (VanLehn et al., 2003), or be beneficial for learning because of learner’s deliberations on the content (D'Mello, Lehman, Pekrun, & Graesser, 2014). Potentially, one of the keys to keep confusion from contributing negatively to learning would be an early detection of confusion before it becomes non-constructive. Much of the research on confusion detection is focused on improving intelligent tutoring systems (ITS) by embedding confusion, frustration, and boredom detection features. Less is known about confusion in non-ITS digital learning environments. Further, existing detection systems are only partially automated and often require costly human intervention for classification purposes (see D’Mello & Graesser, 2010). Avenues for creation of fully automated confusion detection systems, which are suitable for more traditional, non-ITS digital learning environments need to be explored. Australasian Journal of Educational Technology, 2016, 32(6). 59 In this paper we aim to extend the understanding of changes in levels of confusion in digital learning environments by deriving a set of parameters for an early detection of confusion, using materials that are known to be confusing with the addition of eye-tracking. Further, we create the case for a development of an early confusion detection system based on capacities of modern educational technologies underpinned by learning analytics and user gaze recognition features. We start with a review of confusion research, research on eye movements in situations of cognitive disequilibrium, and confusion detection models in ITS. Then, we investigate learners’ behaviour in a confusing problem-solving situation employing eye tracking as a way to extend our understanding of the changes in levels of confusion. The obtained results shed light on which parameters should be considered for an early detection of confusion in digital learning environments and lead to the recommendations for particular types of technologies to be employed for this purpose. Background Cognitive disequilibrium and the types of confusion Confusion is triggered by cognitive disequilibrium (D’Mello & Graesser, 2014). Cognitive disequilibrium can be defined as a state experienced by an individual during learning when obstacles to the normal flow of the learning process are encountered. These obstacles might include uncertainties, errors, anomalous information, or simply new information that contradicts an individual’s prior knowledge (D’Mello & Graesser, 2012). In this state an individual is often unsure what to do next. At the same time, the arising contradictions are found to make individuals elaborate on, and engage in, a deeper cognitive processing of the learning material (D’Mello et al., 2014). The resulting outcomes help promote conceptual change and transfer of learning (D’Mello & Graesser, 2012, 2014; Limón, 2001). An exploration of cognitive-affective states - emotions related to learning - generated by cognitive disequilibrium served as a starting point for research on confusion (see D'Mello & Graesser, 2012; Graesser, Lu, Olde, Cooper-Pye, & Whitten, 2005; Grasser & Olde, 2003; Lehman, et al., 2012). Simply put, confusion is seen as an affective expression of cognitive disequilibrium (Lehman et al., 2012). When encountering contradictory information leads to uncertainties and results in cognitive disequilibrium learners feel confused. Learners, therefore, need to manage their confusion and to engage in confusion resolution activities, such as problem-solving, before they can move on with their learning (D’Mello & Graesser, 2014). The question of confusion management however, is not a straightforward one. The difficulty of managing confusion is that there is no single pathway or a single method identified in the literature, but rather a collection of methods that could help learners return to the stage of equilibrium or a smooth processing of new information (D’Mello & Graesser, 2014). One of the explanations as to why there is no unified perspective on how learners manage their confusion is that individual differences, such as age, motivation, personality, and prior knowledge all have an impact on confusion regulation (Lehman et al., 2012). Thus, without addressing individual differences, it would be hard to talk about an instructional formula for the regulation of confusion. At the same time, several common strategies, such as greater use of scaffolding, prompting self-regulation, and providing feedback are mentioned in the literature as prospective generic ways to manage confusion and cognitive disequilibrium (D’Mello et al., 2014; Lehman et al., 2012). Designing digital learning environments containing these features would allow learners to work on the needed combination of methods to return to equilibrium. Properly managing and resolving one’s confusion is crucial for further learning. Only in cases when confusion is properly managed and resolved can the benefits of confusion be harvested, leading to learning gains (D’Mello & Graesser, 2014; Lehman, et al., 2012). Confusion that is deemed beneficial for learning is also called constructive confusion. Conversely, unresolved confusion may hinder learning (D’ Mello & Graesser, 2014; VanLehn, et al., 2003). In particular, D’ Mello and Graesser ( 2014) have found that learners who managed to at least partially resolve Australasian Journal of Educational Technology, 2016, 32(6). 60 their confusion had significant learning gains in comparison with a group who left their confusion unresolved. Confusion that lasts for too long is also found to have undesirable consequences for learning (Liu, Pataranutaporn, Ocumpaugh, & Baker, 2013). In case of long-lasting confusion, and in the absence of adequate scaffolds, learners were found to give up or get frustrated and disengage from the task (Baker et al., 2010; D'Mello & Graesser, 2014). This type of detrimental confusion is often referred to as non-constructive confusion. An early detection of confusion in digital learning environments could thus be used as a basis for offering a combination of strategies to help learners manage confusion when they are in danger of following a non- constructive pathway. Such possibilities to empower learners with confusion managing strategies have been discussed in the literature. For example, Lehman et al. (2012) suggested using direct hints and explanations, or even interventions to help learners understand benefits of confusion. D’Mello and Graesser (2014) noted that learners’ scholastic aptitude was an important factor influencing confusion resolution. Thus, individualised learning paths through material based on learners’ scholastic aptitude could be a good way to help learners manage their confusion. However, few researchers who have considered confusion management strategies have paid attention to the pre-requisite question of how confusion can be detected for the provision of hints and support. The other known caveat with an early detection of confusion is the dynamic nature of confusion. Confusion is a process rather than an instant occurrence (D’Mello et al., 2014), and this process can produce different outcomes depending on the timing of the measurement. For example, partially resolved confusion results in a lower overall mean confusion rating in comparison with unresolved confusion (see D’Mello & Graesser, 2014; Fig. 2, p. 111). At the same time the authors found that partially resolved confusion peaks relatively early in the learning episode in comparison with unresolved confusion. If their learners were compared at the beginning of the learning process the difference would probably be opposite to the overall results: learners who partially- resolved their confusion would score higher on confusion in comparison with those whose confusion remained unresolved. Thus, choosing suitable measurement slots and understanding the difference of confusion ratings within these slots from overall averages is an important question to consider. Early detection of confusion Given that confusion can potentially become non-constructive if not properly regulated, the early detection of incidents of confusion could help prevent this transformation. The importance of the early detection of confusion and other cognitive-affective states, such as frustration and boredom, has been widely discussed in ITS research (Craig, D’Mello, Witherspoon, & Graesser, 2008; D’Mello, Craig, Witherspoon, McDaniel, & Graesser, 2008; D’Mello & Graesser, 2010; Rodrigo & Baker, 2011; Zeng, Pantic, Roisman, & Huang, 2009). Initially, researchers who considered detection linked it to a single source of data, from either facial expressions or dialogue utterances (Zeng, et al., 2009). For example, cognitive-affective states were classified by evaluating learners’ postures (D'Mello & Graesser, 2009), monitoring facial expressions (McDaniel et al., 2007), and analysing vocabulary and semantics in tutorial dialogues (D‘Mello, Dowell, & Graesser, 2009). Detection models that were based on data from a single source had high error rates (D’Mello & Graesser, 2010). In later studies, researchers pooled data from several sources, such as conversational cues, gross body language, and facial features which represented a promising new avenue for ITS research (see D’Mello & Graesser, 2010). However, detection processes described in these research studies were not completely automated. In fact, ITS research is typically based on costly methods of classifying cognitive-affective states, including confusion, by external judges (D’Mello & Graesser, 2010; Rodrigo & Baker, 2011). Even though D’Mello and Graesser’s (2010) multimodal system was quite effective in detecting confusion based on facial expressions and discourse correlates (also see Craig et al., 2008; D'Mello, et al., 2008), human judges were still needed to classify these states. Using a totally automated system is suggested as a future research direction by the authors. The question is how to extend the findings from this existing ITS research into more generic digital learning environments. Usually, digital learning environments, such as learning management systems, do not require learners to have verbal interactions with the system, thus, it is not possible to use verbal cues to detect confusion Australasian Journal of Educational Technology, 2016, 32(6). 61 which has such value in ITS-based studies. There are clear opportunities to develop both more cost-effective and simpler ways to detect learner confusion in traditional learning management systems and other modern educational technologies. The emergence and maturity of innovative research technologies, such as eye tracking and learning analytics are one such opportunity. Eye or gaze tracking refers to the recording learners’ eye movements using a simple web camera connected to the computer screen and further filtering of these data using sophisticated data sorting algorithms. Learning analytics refers to the affordances of contemporary educational technologies to compile and analyse learners’ mouse clicks, interactions, time on task, and other non-verbal interactions within digital learning environments. When combined, these two techniques could be developed and implemented in DLEs for early confusion detection. Using eye tracking for detecting of confusion Eye tracking has been used as a successful technique for in-depth investigations of general problem-solving as well as situations involving cognitive disequilibrium (Graesser et al., 2005; Knoblich, Ohlsson, & Raney, 2001). Below, we present the findings from several studies that have used eye-tracking for various investigations of this sort, focusing on eye movement parameters used. Specifically, we focus on total fixations duration. In problem-solving research the increased gaze fixation time on relevant to the solution parts of the problem has been found to occur at the moment directly preceding the successful problem solution (Ellis, Glaholt, & Reingold, 2011; Knoblich, Ohlsson, & Raney, 2001). In studies of cognitive disequilibrium the most successful learners were found to have longer overall fixation duration on device components relevant to the solution (Graesser et al., 2005). It should be noted however, that first of all, in the Graesser et al. (2005) study learners’ confusion was not measured directly. It could only be inferred from their success in resolving cognitive disequilibrium introduced through a device breakdown scenario. Second, cognitive disequilibrium resolution (performance) was not measured directly but rather inferred from the quality of the questions asked by participants. Since neither confusion nor performance were measured directly it is difficult to judge whether the most successful learners would have experienced overall higher or lower levels of confusion in comparison to unsuccessful learners. It might have been that the most successful learners had higher confusion at the early stages of the problem-solving and lower overall confusion in comparison with unsuccessful learners, as in D’Mello and Graesser (2014). Finally, DeLucia, Preddy, Derby, Tharanathan, and Putrevu, (2014) investigated participants’ eye movements when participants remotely operated two different devices. They measured confusion using a subjective Likert- type measure (e.g., I was confused) and were not able to find consistent common correlation patterns between the variables for both devices, but only for several tasks performed with a second device. In particular, they found that higher confusion ratings were positively correlated with the total fixation time on the whole screen, mean fixation duration (long fixations) and task completion time (longer task completion). Unfortunately, these researchers did not report on correlations between confusion ratings and total fixation duration for relevant areas and other areas of the screen. They simply did not have this differentiation. Thus, it is not clear whether any significant correlations would remain if confusion ratings were linked with fixations on relevant parts of the screen. Although eye-tracking emerges as an effective technique in gathering detailed data related to cognitive disequilibrium and incidents of confusion, the existing findings are limited. We believe that exploring correlations of self-rated confusion and fixations on relevant parts of the screen could be beneficial not only for research on confusion but could also help with identifying parameters for automated methods of early detection of confusion. The present study This study investigated a set of parameters for an early detection of confusion in non-ITS digital learning environments. It is expected that using eye-tracking will broaden our understanding of confusion and potential parameters associated with its early detection. Thus, exploring eye-tracking behaviour should help us (1) find Australasian Journal of Educational Technology, 2016, 32(6). 62 correlates of confusion and, based on these data, (2) derive measures of an early detection of confusion in digital learning environments. Following Graesser et al. (2005) and DeLucia et al. (2014) we adopted a working assumption that longer overall fixation durations denotes a greater amount of cognitive processing. The longer people fixate on relevant to solution areas of the problem in total, the higher their chance to resolve cognitive disequilibrium. We should be cautious, however, about hypothesising a relationship between longer total fixation durations and confusion ratings: although Graesser et al. (2005) have used eye-tracking in situations involving cognitive disequilibrium, they did not include measurements of confusion in their design. DeLucia et al. (2014), on the other hand, did not differentiate between relevant and not-relevant areas of the screen when reporting a positive correlation between total fixation durations and confusion ratings. Finally, D’Mello and Graesser (2014) used the same experimental materials as Graesser et al. (2005) and found that partially resolved confusion led to the higher problem-solving/troubleshooting performance than unresolved confusion. The partially resolved confusion group also rated their confusion higher than the unresolved confusion group for the first half of the problem- solving process. However, D’Mello and Graesser (2014) did not use eye tracking in their study. Thus, our hypothesis is partially based on the confusion research and partially on problem-solving research using eye- tracking. Based on these previous studies it is expected that the level of learners’ self-reported confusion will be positively correlated with fixations on relevant areas of the problem. In other words, high total fixation durations on relevant areas will be correlated with high confusion ratings. Method Participants Fourteen young adults (university students from a large metropolitan Australian university) volunteered to participate in the study. The recruitment was conducted via an internal university employment website. All of the participants were novices in regard to the insight problems used in the experiment. This was confirmed by their statements during cued-retrospective reporting. Participants were compensated at the rate $15 per hour. Materials We used multimedia based insight problems to generate confusion in this study. An insight problem is a problem that requires the learner to shift his or her perspective and view the problem in a novel way in order to achieve the solution (Dow & Mayer 2004). In lay terms, learners need to have an “Aha!” moment to solve such a problem. Insight problems are considered inherently challenging (Knoblich et al., 2001) and serve as a good instructional material for the exploration of confusion (see Andres, Andres, Rodrigo, Baker, & Beck, (2015). Insight problems presented to our participants were transformation puzzles, in which the pieces could form two different layouts showing different pictures (the missing square puzzle, Figure 1, and the 13 crystal skulls puzzle, Figure 2). Both problems were developed using Mathematica 10 (Wolfram Research, 2014) and presented as on-screen simulations with learner control: participants were encouraged to manipulate a scrollbar to transform the problem from its initial state to the final state and to draw comparisons when needed. Australasian Journal of Educational Technology, 2016, 32(6). 63 Figure 1. The initial and the final positions of the missing square puzzle with areas of interests (AOIs) marked in grey. The initial question was: “Where does the white square appear from?” (Picture by Krauss / CC BY- SA 4.0) https://commons.wikimedia.org/wiki/File:Missing_Square_Animation.gif%23/media/File:Missing_Square_Animation.gif Australasian Journal of Educational Technology, 2016, 32(6). 64 Figure 2. The initial and the final positions of the 13 crystal skulls puzzle. The initial question was: “Where does the 13th skull disappear to?” (Picture by Gianni Sarcone / CC BY-NC-ND 3.0 US) Paper-based materials consisted of a laminated A4 format sheet of instructions and six 4.5”x3” laminated feedback cards as well as four laminated A4 format sheets representing the initial and final positions of each problem. Feedback card 1 (hint 1) used a questioning technique probing participants to make a comparison between specific parts of the puzzle. Feedback card 2 used a rhetoric question and gave an immediate answer in regard to the other specific parts of the puzzle. Computer-based materials consisted of a visual-spatial abilities test and a self-rated confusion scale. The materials were administered via secure sequence of Qualtrics™. Tobii T120 Eye Tracker integrated into a 17” TFT monitor with an angular resolution of less than 0.5° was used to record participants’ eye movements and to replay their gaze trajectories back to them at the retrospective cued reporting stage. Sampling frequency of 60 Hz was used for the current study. A laptop with installed Tobii Studio 2.3 software operated the calibration of the eye tracking system and acquisition of data. Computer-based materials were all presented on the eye tracker monitor, and the mouse connected to the laptop was used to manipulate the interactive materials. Sony ICD series audio recorder was used to gather participants’ verbalisations at the cued retrospective reporting stage. Data sources and measures A visual-spatial abilities test, The Card Rotation test with a 3 minute time limit (Ekstrom, French, & Harman, 1976) was used to measure learners’ visual-spatial abilities. Visual-spatial abilities tests such as The Card Rotation test and The Paper Folding test (Ekstrom, et al., 1976) are routinely used in research on multimedia learning, since visual-spatial abilities mediate the effects of learning with instructional multimedia (Mayer & Sims, 1994; but also see Paik & Schraw, 2013). Specifically, learners with low visual-spatial abilities experience greater difficulty in completing a task containing visual and verbal information (Mayer & Sims, 1994). Since our tasks contained both visual and verbal information, we wanted to be able to estimate the influence of this factor (i.e., visual-spatial abilities) on the final performance. Participants’ behaviours (e.g., wrong answers, early feedback requests) and the timing of the responses during the problem-solving task were recorded by the experimenters using an observation sheet. Participants’ gaze trajectories and fixations were recorded with the eye tracker. Audio records of participants’ verbal reporting were collected. A self-rated confusion measure (“Please, select the number below that best represents your level of confusion as experienced in this time point”, Likert scale 1 to 10) similar to the measure used by D’Mello and Graesser (2014) was administered to the participants, and they rated their confusion for each 1-minute http://www.archimedes-lab.org/workshop13skulls.html Australasian Journal of Educational Technology, 2016, 32(6). 65 interval of the problem-solving phase. The measure was considered to be a viable method to document changes in confusion levels because the previous research established that the retrospective confusion ratings of this sort correlate with online recordings of facial expressions and body language (D'Mello & Graesser, 2010; McDaniel et al., 2007). Finally, participants problem-solving success (solved/did not solve) served as a performance measure. The study used cued-retrospective reporting (Van Gog, Paas, Van Merriënboer, & Witte, 2005) to collect self- rated confusion measures and verbal protocols. Participants were presented with their own gaze trajectories after the problem-solving task as a cue to retrospectively describe the thoughts they had during a problem- solving process; they were also self-rating their confusion level for each 1-minute time interval of the task (see D’Mello et al., 2014). For their study, Van Gog et al. (2005) gathered participants’ verbalisations in concurrent, retrospective and cued-retrospective reporting conditions. Participants’ eye-movements were recorded while they were problem-solving. Both concurrent and cued-retrospective reporting resulted in more detailed learning process descriptions than simple retrospective reporting (Van Gog et al.’s, 2005). Since Van Gog et al. (2005) were only collecting participants’ verbalisations, and we were collecting confusion measures in addition to verbal protocols, concurrent reporting had a danger of introducing substantial interruptions into the learning process. Thus, we employed cued-retrospective reporting to gather a detailed verbal data at the same time minimising the risk of learning process interruptions while frequent confusion measures were taken. Eye-tracking data and confusion self-ratings were analysed for the first 2 minutes of the problem-solving period. To remind, the independent problem-solving took place within the first 2 minutes, after that learners’ gaze and confusion ratings could have been influenced by feedback. Our goals in this study were educational, and we wanted to assure learners did not feel helpless facing unfamiliar complex problems. Thus, all the learners were provided with feedback (hint 1 after 2 minutes and hint 2 after 4 minutes of the overall problem-solving period). Procedure Experimental manipulations took place in a laboratory setting. All participants were tested in individual sessions of approximately 45-55 minutes. First, all the participants were asked to read the paper-based instructions explaining the context of the study, given the opportunity to ask study-related questions if they had them, then asked to read and sign consent forms. Participants were then pre-tested on their visual-spatial abilities, seated in front of the eye-tracker and probed to solve their first problem after the calibration. If they did not produce either an incorrect or correct solution or ask for a hint within the first 2 minutes, hint 1 was suggested to them. Alternatively, if they asked for a hint earlier than in 2 minutes time, they were probed to further elaborate on the task and were provided with the hint after 2 minutes if they still needed it. This interval (2 minutes) was considered a minimum sufficient time for successful problem-solving to take place, based on pilots. Hint 2 was suggested in the similar circumstances after the first 4 minutes. During the response participants were instructed to turn away from the eye tracker and explain their prospective solution using paper sheets representing the initial and final positions of each problem. If a participant failed to produce a correct solution within 10 minutes the solution was given to the participant. During the problem-solving process, an experimenter recorded participant questions, their timing, and the timing of the solution by filling in an observation sheet. After problem 1 was resolved, the same procedure was repeated with the problem 2. Participants were then shown the recordings of their eye movements during both problem-solving activities and were asked to think aloud and explain their thought processes. They also rated their past confusion levels on self-rated measure in 1- minute intervals. Finally, participants were debriefed, compensated, and discharged. Visual-spatial abilities and problem 2 Visual-spatial abilities scores (M = 4.88; SD = 2.04) were not significantly correlated with confusion ratings for either problem 1 (r = .05) or problem 2 (r = .05). Nor did they influence problem-solving outcomes: Wald (1, 13) = 2.64; p = .10, Exp(b) = 1.88 for problem 1, ns = 7, nns = 7 (problem 2 had only three non-solvers and logistic regression could not be calculated). Since the results were not significant we did not include visual- spatial abilities in any further calculations. Australasian Journal of Educational Technology, 2016, 32(6). 66 Problem 2 was included for a possible replication of the trends found with problem 1. However, the data show that participants spent significantly less time: M2 = 5 min 25 sec; M1 = 7 min 40 sec; Wilks’ lambda = 0.69, F(1, 13) = 5.90, p = .030, and were significantly less confused: M2= 5.17; M1= 7.23; Wilks’ lambda = 0.41, F(1, 13) = 18.43, p = .001 with problem 2 in comparison to problem 1. Participants’ verbalisations seem to point at the fact that problem 1 served as kind of pre-training for solving problem 2 (i.e., “The first one gives you a hint for the second one”). Thus, only problem 1 was considered for data analysis. Results A range of pre-processing of the data was required in order to arrive at the variables used in this investigation. To calculate total fixation durations a Tobii default fixation filter was applied to the raw data, such that if a participant engaged in glances within a radius of 35 pixels for more than 75 milliseconds they were determined to be fixations. An initial look at problem-solving times of participants revealed that problem-solving times were uneven and ranged from 1 minute 40 seconds to more than 10 minutes for some of the unsuccessful problem-solvers (M = 7 min 40 sec; SD = 2 min 57 sec). The main set of analyses responded to the hypothesis that the level of learners’ self-reported confusion will be positively correlated with fixations on relevant areas of the problem. First, we describe the changes in self- reported confusion ratings. There was a significant decrease in confusion ratings for all participants between minute 1 (M = 8.69; SD = 1.93) and minute 2 (M = 6.85; SD = 3.05), Wilks’ lambda = 0.39, F(1,13) = 19.20, p < .01. All participants seemed to believe they understood the problem well enough and were on the right track for solution. Second, we report on eye fixations particularly as they related to instructions, non-relevant, neutral, and relevant areas of interest (AOIs). Data from all four relevant AOIs (see Figure 1) were combined for this analysis, since the original multiple relevant AOIs were drawn to avoid any misinterpretations due to the dynamic nature of the environment. Thus, only the areas that could be unequivocally related to the problem solution in various positions of the scrollbar were considered relevant AOIs. The results demonstrated that the whole group had longer total fixation times focusing at relevant AOIs (Wilks’ lambda = 0.57, F(1,13) = 9.95, p = .01) at minute 2. While participants had spent on average about 8 seconds fixating at relevant areas at minute 1, this number had doubled at minute 2 (Mr1 = 7.71; SDr1 = 6.19; Mr2 = 15.41; SDr2 = 10.67). Simultaneously, the whole group also had longer total fixation times focusing at not-relevant AOIs (Wilks’ lambda = 0.65, F(1,13) = 6.89, p = .02) between minutes 1 and 2. At the beginning participants spent on average about 8 seconds fixating at not- relevant areas, this number had increased at minute 2 (Mn1 = 7.65; SDn1 = 4.17; Mn2 = 11.49; SDn2 = 6.32). Finally, in regard to correlations, a comparison of average participants’ fixations for the first 2 minutes and average confusion ratings for the first 2 minutes demonstrated a medium sized positive correlation between fixating on not-relevant AOI and increase in confusion ratings (Pearson’s r =.59, p = .03). To more closely investigate the finding that participants were fixating significantly more on not-relevant and relevant AOIs between minutes 1 and 2, we introduced problem-solvers and non-solvers groups as a between- subjects factor in the analysis. The results demonstrate that solvers were fixating more on relevant information in comparison with non-solvers: F(1,12) = 5.38, MSE = 82.62, p = .04, although both groups had a total overall increase in fixations on relevant information between minutes 1 and 2: F(1,12) = 5.38, MSE = 82.62, p = .04 (Figure 3). It was also clear that non-solvers were fixating somewhat more on not-relevant information in comparison with solvers: F(1,12) = 4.75, MSE = 32.94, p = .05, although both groups had a total overall increase in fixations on not-relevant information between minutes 1 and 2: Wilks’ lambda = 0.65, F(1,12) = 6.47, p = .03 (Figure 3). Australasian Journal of Educational Technology, 2016, 32(6). 67 Figure 3. Comparison of solvers and non-solvers total fixations on not-relevant and relevant AOIs during the first 2 minutes of the problem-solving. Expected changes in confusion self-ratings were not consistent with the findings above. Despite the original finding that all participants rated their confusion lower between minutes 1 and 2, the group differences or interaction were not significant for changes in confusion self-ratings: F(1,12) = 1.58, MSE = 11.33, p = .23 for a group and Wilks’ lambda = 0.89, F(1,11) = 1.3, p = .28 for the interaction term. Discussion This study used potentially confusing digital material (insight problems) with eye-tracking and self-report measures to extend the understanding of changes in the levels of confusion. The ultimate aim was to derive a set of parameters for an early detection of confusion in digital learning environments. We hypothesised that learners’ self-reported confusion would positively correlate with fixations on relevant areas of the problem. The results did not support this hypothesis. Rather, they have supported an alternative hypothesis that self-reported confusion is positively correlated with fixations on not-relevant elements of the problem (the finding that was not particularly discussed in the literature). One of the possible explanations of this finding is that we have focused on the first 2 minutes of the problem-solving instead of the whole solution period because of the uneven solution times and feedback provided after the first 2 minutes. In Graesser et al.’s (2005) study, troubleshooting period was limited to 90 seconds and participants were well aware of it and able to adjust their strategies accordingly. The other point mentioned before is that Graesser et al. (2005) did not use measures of confusion. Their reasoning about being or not being able to resolve an arising cognitive disequilibrium was based on the quality of participants’ questions (serving as indicators of performance). Thus, we do not have confusion measurements from their study. We can only assume that learners’ confusion was growing at the beginning of the learning process until some of the learners were able to identify the potential problem area on a screen; they reached an “aha” moment. At the same time our finding that all the participants fixated more on relevant areas from minute 1 to minute 2 potentially adds to Graesser et al.’s (2005) results with effective learners during the 90 sec troubleshooting sessions. As a reminder, Graesser et al. (2005) have found that only effective learners fixated more on relevant areas. Ineffective learners fixated on relevant areas at the level of chance (randomly). It could well be that our current findings are aligned with DeLucia et al.’s (2014) results (i.e., confusion ratings are positively correlated with fixations), but since the authors did not report correlations between confusion and fixations on specific areas of the screen, this conclusion cannot be made. Australasian Journal of Educational Technology, 2016, 32(6). 68 In a broader sense our finding that self-reported confusion has a positive correlation with fixations on not- relevant area has links to some of the findings from problem-solving research. In particular, Hodgson, Bajwa, Owen, & Kennard, (2000) found that poor problem-solvers attempting a logical puzzle fixate more on irrelevant units of the puzzle than good problem-solvers. In our case all participants fixated significantly more on not- relevant AOIs between minute 1 and 2, but non-solvers fixated more than solvers (similar to Hodgson et al., 2000). These long fixations were however not accompanied by a significant increase in confusion for either of the two groups (7 participants in each). The found correlation was only true for the whole sample. It is quite possible, however, that there might have been insufficient power to detect an effect because of small sample sizes of the solver and non-solver groups. The other possibility is that non-solvers were still at the stage of cognitive disequilibrium without an idea of how to resolve it while solvers could foresee a potential for solution (they had longer total fixations on relevant areas) and rated their confusion somewhat lower than non-solvers during minute 2 (Ms = 5.86; SDs = 3.67; Mns = 8.00; SDns = 1.79). As mentioned in the results, the group difference in confusion self-ratings was not statistically significant. Besides, the solvers group was not very homogenous in their confusion ratings for minute 2 (large standard deviation). Our findings have potential theoretical implications for confusion research and specifically, for early detection of confusion. There is a possibility that instead of tracking whether learners fixate on areas relevant for task completion or a problem-solution area, confusion researchers should first evaluate learners’ fixations on not- relevant parts of the screen coupled with relatively high confusion ratings. While fixating on relevant areas is conducive for a successful problem solving and a higher performance, fixating on not-relevant parts and relatively high confusion ratings help detecting potential non-constructive confusion cases. In practice, a combination or a choice of strategies to manage confusion could be provided to the learners fixating on not-relevant elements of the material and rating confusion relatively high at the beginning of the learning process. As we have discussed, feedback, advanced scaffolding, and other methods could in time potentially help such learners resolve their cognitive disequilibrium and avoid non-constructive confusion. Besides, practical implications of our findings are relatively easy to implement using the existing technologies. While early fixations on not-relevant areas of the screen could be assessed using one of the low cost web cam based gaze recognition applications, such as xLabs (https://xlabsgaze.com/), PyGaze (http://www.pygaze.org/) or GazeHawk (http://gazehawk.com/), regular confusion ratings could be embedded within a learning management system. Combined in the learning analytics engine, these data could serve as a basis for creation of a fully automated early warning system used in the existing not-ITS digital learning environments. Specifically, learners could be probed to rate their confusion at 30 second to 1 minute intervals after they have accessed a particular resource, or to simply click an emoticon-based button when they feel uncertain about the presented information. At the same time the system will detect if they fixate on not-relevant features starting from the moment they have accessed a particular resource. While our finding suggests a relatively simple clear-cut confusion detection method, further testing and fine tuning of this method in the real technology-enabled classrooms is a must. The implementation of confusion management strategies mentioned above also requires future research. Overall, however, the multimodal method of an early confusion detection presented in this paper could serve as an example of using simple recognition parameters to stimulate fully automated confusion detection. Limitations One of the limitations of the current study is the post-factum division on successful and unsuccessful problem- solvers. While the consequences of insight problem-solving process cannot be predicted (Knobich et al., 2001), it is safe to assume that some people will solve such problems fast and some will solve the problems slowly, while others will not solve a problem at all within an allotted time. While post-factum division on groups complicate the making of inferences from the obtained results, such divisions are quite common in problem- Australasian Journal of Educational Technology, 2016, 32(6). 69 solving and confusion research alike (see D’Mello & Graesser, 2014; Graesser et al., 2005; Knobich et al., 2001). A second, limitation is that since problem 1 seemed to serve as pre-training for problem 2 it affected our within-subjects analysis. A design including independent problems could allow for detailed repeated measures comparisons and a discussion on the role of the task features in confusion detection. Third, individual differences in problem-solving could have influenced participants’ problem-solving trajectories but we failed to collect the demographic data to be able to assess this influence. Fourth, although think aloud data was collected it has not been systematically analysed, but rather used for triangulation of the eye-tracking data, that is participant gazing at not-relevant areas and talking about an incorrect solution. Fifth, we have included a pre- test to assess visual-spatial abilities of our participants, but the results of this test did not have a significant influence on further problem-solving success. Possibly, a more extensive assessment of visual-spatial abilities in future studies could help uncover the influence of this factor on the final performance. Finally, the dynamic nature of an on-screen stimuli and the limits of the technologies did not allow for a more detailed analysis of learning trajectories in relation to the moving puzzle pieces. Future directions While this study provides a proof-of-concept for early confusion detection, it does not test the validity of generic confusion management methods (i.e., advanced scaffolding, introduction of self-regulatory techniques) for a confusion resolution. Further research could shed the light on effectiveness of these interventions after confusion is detected. First, learners could receive some information on confusion and how to manage confusion in digital learning environments. Second, they could be pre-trained on self-regulatory techniques. Third, they could be shown a video or a simulated example of their peer managing confusion in a similar situation. Fourth, further research should seek to replicate the results of this study and possibly investigate additional indicators for confusion detection, such as a change of posture, pulse, and in facial muscles activity. A word of caution, however, should be added in terms of implementation of these potentially invasive methods in educational settings. Finally, the results of our study could be evaluated in a realistic higher education context, once an early warning system based on the discussed confusion detection parameters is implemented within learning management system. Overall, an implementation of such system could help promote learning and avoid detrimental outcomes of non-constructive confusion. Acknowledgements This research is funded by the Science of Learning Research Centre - A Special Research Initiative of the Australian Research Council (SR120300015). References Andres, J. M. A. L., Andres, J. M. L., Rodrigo, M. M. T., Baker, R. S., & Beck, J. B. (2015). An investigation of eureka and the affective states surrounding eureka moments. In H. Ogata et al. (Eds.), Proceedings of the 23rd International Conference on Computers in Education. China: Asia-Pacific Society for Computers in Education. Baker, R. S. J. D, D'Mello, S., Rodrigo, M., & Graesser, A. (2010). Better to be frustrated than bored: The incidence and persistence of affect during interactions with three different computer-based learning environments. International Journal of Human-computer Studies, 68(4), 223–241. http://dx.doi.org/10.1016/j.ijhcs.2009.12.003 Craig, S., D’Mello, S., Witherspoon, A., & Graesser, A. (2008). Emote-aloud during learning with AutoTutor: Applying the facial action coding system to cognitive-affective states during learning. Cognition and Emotion, 22(5), 777-788. http://dx.doi.org/10.1080/02699930701516759 DeLucia, P., Preddy, D., Derby, P., Tharanathan, A., & Putrevu S. (2014). Eye movement behavior during confusion: Toward a method. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, September 2014, 58(1), 1300-1304. http://dx.doi.org/10.1177/1541931214581271 D’Mello, S., Craig, S. D., Witherspoon, A. W., McDaniel, B. T., & Graesser, A. C. (2008). Automatic detection of learner’s affect from conversational cues. User Modeling and User-Adapted Interaction, http://dx.doi.org/10.1016/j.ijhcs.2009.12.003 http://dx.doi.org/10.1080/02699930701516759 http://dx.doi.org/10.1177/1541931214581271 Australasian Journal of Educational Technology, 2016, 32(6). 70 18(1-2), 45-80. http://dx.doi.org/10.1007/s11257-007-9037-6 D’Mello, S., Dowell, N., & Graesser, A. C. (2009). Cohesion relationships in tutorial dialogue as predictors of affective states. In V. Dimitrova, R. Mizoguchi, B. du Boulay, & A. Graesser (Eds.), Proceedings of 14th International Conference on Artificial Intelligence In Education, (pp. 9-16). Amsterdam: IOS Press. D’Mello, S., & Graesser, A. (2009). Automatic detection of learners’ emotions from gross body language. Applied Artificial Intelligence, 23(2), 123-150. http://dx.doi.org/10.1080/08839510802631745 D’Mello, S., & Graesser, A. (2010). Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features. User Modeling and User-adapted Interaction, 20(2), 147-187. http://dx.doi.org/10.1007/s11257-010-9074-4 D’Mello, S., & Graesser, A. (2012). Dynamics of affective states during complex learning. Learning and Instruction, 22(2), 145-157. http://dx.doi.org/10.1016/j.learninstruc.2011.10.001 D’Mello, S., & Graesser, A. (2014). Confusion and its dynamics during device comprehension with breakdown scenarios. Acta Psychologica, 151, 106-116. http://dx.doi.org/10.1016/j.actpsy.2014.06.005 D'Mello, S., Lehman, B. Pekrun, R., & Graesser, A. C. (2014). Confusion can be beneficial for learning. Learning & Instruction, 29(1), 153-170. http://dx.doi.org/10.1016/j.learninstruc.2012.05.003 Dow, G. T., & Mayer, R. E. (2004). Teaching students to solve insight problems. Evidence for domain specificity in training. Creativity Research Journal, 16(4), 389-402. http://dx.doi.org/10.1080/10400410409534550 Ekstrom, R., French, J., & Harman, D. (1976). Manual for kit of factor-referenced cognitive tests. Princeton, NJ: ETS. Ellis, J., Glaholt, M., & Reingold, E. (2011). Eye movements reveal solution knowledge prior to insight. Consciousness and Cognition, 20(3), 768-76. http://dx.doi.org/10.1016/j.concog.2010.12.007 Graesser, A., Lu, S., Olde, B., Cooper-Pye, E., & Whitten S. (2005). Question asking and eye tracking during cognitive disequilibrium: Comprehending illustrated texts on devices when the devices break down. Memory & Cognition, 33(7), 1235-1247. http://dx.doi.org/10.3758/BF03193225 Graesser, A. C., & Olde, B. A. (2003). How does one know whether a person understands a device? The quality of the questions the person asks when the device breaks down. Journal of Educational Psychology, 95(3), 524–536. http://dx.doi.org/10.1037/0022-0663.95.3.524 Hodgson, T., Bajwa, A., Owen, A., & Kennard, C. (2000). The strategic control of gaze direction in the Tower-of-London task. Journal of Cognitive Neuroscience, 12(5), 894-907. http://dx.doi.org/10.1162/089892900562499 Knobich, G., Ohlsson, S., & Raney, G. E. (2001). An eye movement study of insight problem solving. Memory & Cognition, 29(7), 1000-1009. http://dx.doi.org/10.3758/BF03195762 Lehman, B., D’Mello, S. K., & Graesser, A. C. (2012). Confusion and complex learning during interactions with computer learning environments. The Internet and Higher Education, 15(3), 184-194. http://dx.doi.org/10.1016/j.iheduc.2012.01.002 Limón, M. (2001). On the cognitive conflict as an instructional strategy for conceptual change: A critical appraisal. Learning and Instruction, 11(4-5), 357-380. http://dx.doi.org/10.1016/S0959-4752(00)00037-2 Liu, Z., Pataranutaporn, V., Ocumpaugh, J., & Baker, R. (2013). Sequences of frustration and confusion, and learning. In S. K. D’Mello, R. A. Calvo, & A. Olney (Eds.) Proceedings of the 6th International Conference on Educational Data Mining Conference, Memphis, TN, 114-120. Mayer, R. E., & Sims, V. K. (1994). For whom is a picture worth a thousand words? Extensions of a dual- coding theory of multimedia learning. Journal of Educational Psychology, 86(3), 389-401. http://dx.doi.org/10.1037/0022-0663.86.3.389 McDaniel, B. T., D’Mello, S. K., King, B. G., Chipman, P., Tapp, K., & Graesser, A. C. (2007). Facial features for affective state detection in learning environments. In D. S. McNamara, & J. G. Trafton (Eds.), Proceedings of the 29th Annual Cognitive Science Society (pp. 467-472). Austin, TX: Cognitive Science Society. Paik, E. S., & Schraw, G. (2013). Learning with animation and illusions of understanding. Journal of Educational Psychology, 105(2), 278-289. http://dx.doi.org/10.1037/a0030281 Rodrigo, M. M. T., & Baker, R. S. J. D. (2011). Comparing learners' affect while using an intelligent tutor and an educational game. Research and Practice in Technology Enhanced Learning, 6(1), 43–66. http://dx.doi.org/10.1007/978-3-540-69132-7_9 http://dx.doi.org/10.1007/s11257-007-9037-6 http://dx.doi.org/10.1080/08839510802631745 http://dx.doi.org/10.1007/s11257-010-9074-4 http://dx.doi.org/10.1016/j.learninstruc.2011.10.001 http://dx.doi.org/10.1016/j.actpsy.2014.06.005 http://dx.doi.org/10.1016/j.learninstruc.2012.05.003 http://dx.doi.org/10.1080/10400410409534550 http://dx.doi.org/10.1016/j.concog.2010.12.007 http://dx.doi.org/10.3758/BF03193225 http://psycnet.apa.org/doi/10.1037/0022-0663.95.3.524 http://dx.doi.org/10.1162/089892900562499 http://dx.doi.org/10.3758/BF03195762 http://dx.doi.org/10.1016/j.iheduc.2012.01.002 http://dx.doi.org/10.1016/S0959-4752%2800%2900037-2 http://psycnet.apa.org/doi/10.1037/0022-0663.86.3.389 http://psycnet.apa.org/doi/10.1037/a0030281 http://dx.doi.org/10.1007/978-3-540-69132-7_9 Australasian Journal of Educational Technology, 2016, 32(6). 71 Van Gog, T., Paas, F., Van Merriënboer, J. J. G., & Witte, P. (2005). Uncovering the problem-solving process: Cued retrospective reporting versus concurrent and retrospective reporting. Journal of Experimental Psychology, Applied, 11(4), 237-244. http://dx.doi.org/10.1037/1076-898X.11.4.237 VanLehn, K., Siler, S., Murray, C., Yamauchi, T., & Baggett, W. (2003). Why do only some events cause learning during human tutoring? Cognition and Instruction, 21(3), 209 –249. http://www.jstor.org/stable/3233810 Waycott, J., Dalgarno, B., Kennedy, G., & Bishop, A. (2012). Making science real: Photo-sharing in biology and chemistry. Research in Learning Technology, 20. http://dx.doi.org/10.3402/rlt.v20i0.16151 Wolfram Research (2014). Mathematica 10 [computer program]. Champaign, IL: Author. Yamagata-Lynch, L. C., Do, J., Skutnik, A. L., Thompson, D. J., Stephens, A. F., & Tays, C. A. (2015). Design lessons about participatory self-directed online learning in a graduate-level instructional technology course. Open Learning, 30(2), 178-189. http://dx.doi.org/10.1080/02680513.2015.1071244 Zeng, Z., Pantic, M., Roisman, G. & Huang, T. (2009). A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39-58. http://dx.doi.org/10.1109/TPAMI.2008.52 Corresponding author: Mariya Pachman, korotenkom@yahoo.com Australasian Journal of Educational Technology © 2016. Please cite as: Pachman, M., Arguel, A., Lockyer, L., Kennedy, G., & Lodge, J. M. (2016). Eye tracking and early detection of confusion in digital learning environments: Proof of concept. Australasian Journal of Educational Technology, 32(6), 58-71. http://dx.doi.org/10.14742/ajet.3060 http://dx.doi.org/10.1037/1076-898X.11.4.237 http://www.jstor.org/stable/3233810 http://dx.doi.org/10.3402/rlt.v20i0.16151 http://dx.doi.org/10.1080/02680513.2015.1071244 http://dx.doi.org/10.1109/TPAMI.2008.52 mailto:korotenkom@yahoo.com http://dx.doi.org/10.14742/ajet.XXXX