Journal of Effective Teaching in Higher Education, vol. 1, no. 1 Testing Effect: A Further Examination of Open-book and Closed-book Test Formats Olesya Senkova, Hajime Otani, Reid L. Skeel, and Renée L. Babcock Central Michigan University, Mount Pleasant, MI Abstract. If assessment is the purpose of testing, open-book tests may defeat the purpose. However, a goal of education is to build knowledge, and based on the literature, open-book tests may not be inferior to closed-book tests in promoting long-term retention of information. Participants studied Swahili-English pairs and either re-studied or took an initial quiz, which was cued recall or recognition in an open-book or closed-book format. One week later, the final closed-book recognition test showed higher performance in the quizzed conditions than in the study-twice condition, replicating the testing effect. However, performance was similar across the quizzed conditions, indicating that testing promoted long-term retention regardless of test format (open-book versus closed-book) and test type (cued recall versus recognition). Open-book tests are not inferior to closed-book tests in building knowledge and can be particularly useful in online classes because preventing cheating is difficult when closed-book tests are administered online. Keywords: testing effect; open-book tests; long-term retention; learning In educational settings, tests are used as assessment tools because testing students is the primary method of determining how much students have learned. Traditionally, tests are divided into two categories, recall-based and recognition- based, with essay and short-answer questions representing the former and multiple-choice questions representing the latter. Furthermore, tests can be administered with closed-book or open-book formats, with the former requiring students to rely entirely on memory and the latter allowing students to look up what they did not commit to their memory. Choosing the right format for testing students presents a challenge particularly because of the recent popularity of online classes. In most online classes, it is difficult to know whether students are looking up answers to test questions. In fact, there is a debate over whether cheating on the test is more prevalent in online classes than in face-to-face classes, and if so how to prevent it (e.g., Alessio, Malay, Maurer, Bailer, & Rubin, 2017; Christe, 2003; Cluskey, Ehlen, & Raiborn, 2011; Grijalva, Nowell, & Kerkvliet, 2006; Michael & Williams, 2013; Owens, 2016; Rowe, 2004). Although measures can be implemented to minimize cheating such as using a lock-down web browser with a monitor and imposing a time limit, one must realize that no method is completely foolproof. Because a traditional closed-book test is difficult to implement in an online environment, the instructor may adopt an open-book test. However, the consequences of adopting open-book tests, instead of using traditional closed-book tests, are still not clear. Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 21 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 In the present study, the two formats, closed-book and open-book, were compared to examine a commonly held assumption that the latter format is inferior to the former in achieving the goal of promoting long-term retention of studied materials. Additionally, the type of the test (i.e., cued recall and recognition) was manipulated to investigate whether test format would interact with test type. If the purpose of testing is to assess learning, one may argue that an open-book test would defeat the purpose because when one is allowed to look up answers, it would be difficult to assess what one knows and does not know. However, because one of the goals of education is to develop knowledge (see Bloom’s Taxonomy of Educational Objectives, Bloom, Engelhart, Furst, Hill, & Krathwohl, 1956; also see Anderson et al., 2001, for a revised version), it is important to determine whether using open- book tests, rather than the traditional closed-book tests, would defeat this goal. Reasons that Open-book Tests May Be Inferior to Closed-book Tests There are several reasons to assume that open-book tests are inferior to closed- book tests in building knowledge. First, when a test is offered in an open-book format, students may view the impending test as easy because the answers will be available during the test, and thus, they may devote less effort in studying. In fact, Sparrow, Liu, and Wegner (2011) showed that when participants were informed that they would be allowed to look up the answers when they take the test later on, they tended to show on the test lower recall of the target information but enhanced recall of where to find the target information. Accordingly, expecting an open-book test may discourage students from putting enough effort to build their own knowledge. The second reason that open-book tests might be inferior to closed-book tests in building knowledge is based on the notion that retrieving information from memory enhances learning and facilitates future retrieval. According to Bjork and Bjork (1992), the likelihood of remembering depends on both storage strength and retrieval strength. The storage strength, commonly referred to as memory strength, reflects the amount of learning and determines the availability of information in memory. The storage strength can be increased by repeatedly studying the material. Another component of remembering is retrieval strength, which is how easily memory can be accessed. Bjork and Bjork assumed that building strong memory (storage strength) by repeatedly studying is not sufficient to guarantee successful memory retrieval because in addition to creating strong memory, one needs to practice retrieving memory in order to make it easy to access. Take an old telephone number for example. It may be still available in memory due to repeated use in the past; however, one may experience difficulty retrieving it because it has not been used recently. Another aspect of this theory is that these two types of strength are related such that increasing retrieval strength by repeatedly retrieving memory would also increase storage strength. In effect, retrieving memory acts as another opportunity to learn. What is critical to the issue of test format and building knowledge is that there is an inverse relationship between the ease of retrieval and the amount of increment in storage strength, such that easy retrieval would result in a small increment in storage strength, whereas, difficult retrieval would result in a large increment in storage strength. In Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 22 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 line with this principle, Bjork (1999) argued that training procedures with easy access to correct responses would slow down the growth of storage strength. Based on these notions, it is reasonable to assume that open-book tests, with information readily available during retrieval, would be less efficient in building knowledge because looking up an answer is easier than retrieving an answer from memory, resulting in a smaller increment in storage strength. Reasons that Open-Book Tests May Not Be Inferior to Closed-Book Tests There are indications that making a test open-book may not defeat the purpose of building knowledge. First, the format of the test, open-book or closed-book, may not matter as long as the test questions are sufficiently difficult to promote elaborative or deep-levels of processing. Based on the literature on the levels of processing model (Craik, 2002; Craik & Lockhart, 1972; Craik & Tulving, 1975), it is clear that regardless of one’s intention to learn, processing information at a deep level (i.e., semantic level) would produce durable memory compared to processing information at a shallow level (i.e., perceptual level). For instance, recognition performance was higher when participants were asked to think whether a given word would fit into a sentence frame (e.g., “He met _____ in the street?”) than when participants were asked to judge whether a word was in capital letters (Craik & Tulving, 1975). These orienting questions can be conceptualized as questions on a test; and based on the results of these studies, it is reasonable to assume that if the questions on the open-book test direct one to process the answers at a deep level, durable memories can be formed. Research on the testing effect also supports the notion that open-book tests may not be inferior to closed-book tests with regard to long-term retention of information. The testing effect is a phenomenon that simply taking tests increases long-term retention of information better than re-studying does. There is substantial evidence showing that the testing effect is a robust phenomenon (see Roediger & Karpicke, 2006a, for an extensive review) that can be observed with a variety of tests (such as free recall, cued recall, and recognition), materials (such as word lists, lists of paired-associates, and prose materials), and settings (such as laboratory and educational settings). There are at least three recently published meta-analyses on the testing effect (Adesope, Trevisan, & Sundararajan, 2017; Pan & Rickard, 2018; Rowland, 2014), and all confirmed that the testing effect is a powerful method of increasing learning. Furthermore, Pan and Rickard (2018) showed that the testing effect is robust even when the format and information being tested on the initial test are different from those on the final test, showing the transfer of learning effect. In addition, some studies have shown that the testing effect is similar between open-book and closed-book tests, even though contrary results have also been reported. In a review of the literature comparing open-book and closed-book test formats, Durning and colleagues (2016) reported that among the five studies that examined the testing effect between these formats, four showed that the testing effect was similar between these test formats (Agarwal, Karpicke, Kang, Roediger, & McDermott, 2008; Agarwal & Roediger, 2011; Gharib, Phillips, & Mathew, 2012; Pauker, 1974), whereas, one showed that the testing effect was lower in the open-book format than in the closed book format Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 23 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 (Moore & Jensen, 2007). However, the results are not as straight forward as it appears. Although Pauker (1974) reported that the testing effect was similar between the open-book and closed book conditions, for the students who started out the semester at the bottom third of the class, open-book tests resulted in lower final test scores than closed-book tests. Agarwal and Roediger (2011) also showed that expecting an open-book test reduced time spent on learning, resulting in lower performance on transfer questions on the final test that probed deep learning. Furthermore, Moore and Jensen (2007) showed that performance on the final test was lower when previous tests were open-book relative to closed-book, and that expecting open-book tests resulted in poor academic behaviors, such as skipping lectures and help-sessions as well as not completing extra assignments. In sum, the literature on testing effect showed mixed results that indicate a complex interaction of various factors (such as motivation) that goes beyond the issue of test format. Another reason that open-book tests may not be inferior to closed-book tests is based on the issue of information re-exposure. Open-book tests re-expose students to information successfully retrieved as well as information that was not retrieved, providing opportunities for additional learning. Moreover, an open-book test allows students to access correct answers, which would, in turn, minimize commission errors. Butler, Marsh, Goode, and Roediger (2006) showed that when participants made commission errors on the first test and did not receive feedback with the correct answers, they made the same errors on the final test. It is, therefore, possible that open-book tests are superior to closed-book tests by limiting the number of commission errors, preventing long-term retention of incorrect information. Note, however, that a recent meta-analysis by Adesope et al. (2017) showed that providing or not providing feedback was not a significant moderator of the effect size associated with the testing effect. Their explanation was that taking a test itself is cognitively challenging enough to produce the benefit of testing, and therefore, availability of feedback may not produce an additional benefit. Nevertheless, these researchers also cautioned that their finding may be due to unidentified confounding variables given that there are studies that showed the benefit of providing feedback in addition to testing (e.g., Metcalfe, Kornell, & Finn, 2009; Pashler, Cepeda, Wixted, & Rohrer, 2005). With this caveat, Adesope et al. concluded that further research is needed to investigate the effect of providing feedback on the testing effect. Rationale and Methods of Present Study It is intriguing to consider a possibility that open-book tests are as effective as closed-book tests in building knowledge, particularly for traditional educators who regard tests only as assessment tools. Because this issue has an important implication for educational practice, it warrants further investigation. As noted above, the results of research are conflicted as to whether open-book and closed- book tests differ in promoting long-term retention of studied material, with some studies showing that these two formats are similar (e.g., Agarwal et al., 2008; Agarwal & Roediger, 2011; Pauker, 1974; Gharib et al., 2012) and the other studies showing that open-book tests are less effective than closed-book tests, Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 24 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 particularly, in promoting deep learning (e.g., Agarwal & Roediger, 2011; Pauker, 1974). However, not all learning situations require deep learning, such as during the early stage of learning. The question, then, is whether the lack of difference between open-book and closed-book tests can be generalized to non-text-based materials. Although text-based materials clearly have educational relevance, building knowledge also involves acquiring facts and building vocabularies. Accordingly, in the present study, we decided to investigate the effect of test format using Swahili-English word pairs in a testing effect paradigm similar to Agarwal et al. (2008). It is possible that the lack of difference between open-book and closed-book tests reported by Agarwal et al. and other researchers in the past was based on the use of text materials. Because text materials are well-organized and meaningful, it is possible that participants could easily engage in elaborative processing regardless of whether the initial test was open-book or closed-book, thereby reducing the difference between the two formats. The question is whether the effect of test format would emerge when the study material is less organized and less meaningful. The present study consisted of three phases. During Phase I, participants were asked to study a list of Swahili-English word pairs. During Phase II, participants were provided with either additional opportunity to study (re-study) or they received an initial test of the material. During Phase III, which occurred one week later, participants received a final test that assessed their memory for the word pairs. The critical manipulation was that for half of the participants, the initial test was an open-book test, and for the other half of the participants, the initial test was a closed-book test. Finally, in addition to the format of the test, the type of the test (cued recall versus recognition) was manipulated during Phase II when the initial test was administered. This manipulation was based on the assumption that the difference between open-book and closed-book tests may appear when an initial test is not sufficiently difficult to promote a deep level of processing (i.e., recognition). That is, when a test is sufficiently challenging to promote long-term learning, the test format may not matter, whereas when a test is not challenging enough, a closed- book test may show superiority over an open-book test. In sum, participants in the present study were asked to learn a list of Swahili-English word pairs followed by a cued-recall or recognition initial test, which was administered in an open-book or closed-book format. Also, there was a control condition in which participants were asked to re-study the list instead of taking the initial test. The final recognition test was administered one week later to examine whether the test format (open-book versus closed-book) as well as the test type (cued recall versus recognition) made a difference in long-term retention of the study material. Methods Participants Participants were 39 male and 136 female undergraduate students attending introductory psychology courses at a public university in the Midwestern region of Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 25 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 the United States. They were offered extra course credit for their participation. An equal number (n = 35) of participants were randomly assigned to five between- subjects conditions based on the format (i.e., open-book versus closed-book) and type (i.e., cued recall versus recognition) of the initial quiz, plus a study twice control condition: (1) closed-book recognition, (2) closed-book cued recall, (3) open-book recognition, (4) open-book cued recall, and (5) study 2x (twice). Table 1 summarized the conditions and procedure. Note that for the rest of this paper, the initial test will be referred to as the ‘initial quiz’ and the final test will be referred to as the ‘final test’ to make it easy to differentiate these tests. The study was conducted in accordance with approval given by the Institutional Review Board at the university where participants were tested. Materials Fifty Swahili-English word pairs were selected from the Nelson and Dunlosky (1994) norms (see Appendix for examples). Based on the normative proportion of correct recall, these pairs were high in difficulty (ranged from .07 to .18). A PowerPoint presentation was used to present these pairs, one at a time in the middle of the computer screen in lowercase letters at the rate of one pair per 5 s. The order of the pairs was randomized once, and the same order was used for all participants. The initial quiz was either cued recall or recognition. These quizzes were constructed by randomly selecting 35 Swahili words from the study list. Using 35 words rather than 50 words left 15 words for assessing performance on the one- week delayed final test when there was no initial quiz. For the cued recall quiz, these words were presented on a sheet of paper in a random order with a blank space next to each word for a response (e.g., theluji - _____). For the recognition quiz, these words were presented with four alternative choices of possible English translation for each Swahili word. The distractor choices were randomly selected from English translation of the other words in the study list, making it associative- recognition rather than item-recognition. Associative recognition was used to increase retrieval effort because unlike item-recognition, associative recognition depends more on retrieval than familiarity (e.g., Hockley & Consoli, 1999; Westerman, 2001). Each Swahili word was presented with four choices next to it, randomly ordered. Table 1 Conditions and Procedure Conditions Session 1 Session 2 (One week later) Phase I Phase II (Initial Quiz or Study) Phase III (Final Test) Closed-book Cued Recall Study Cued Recall (Closed-book) Recognition (Closed-Book) Closed-book Recognition Study Recognition (Closed-book) Recognition (Closed-Book) Open-book Cued recall Study Cued Recall (Open-book) Recognition (Closed-Book) Open-book Recognition Study Recognition (Open-book) Recognition (Closed-Book) Study 2x Study Study Recognition (Closed-Book) Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 26 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 Note: During Phase I, participants studied 50 Swahili-English word pairs. During Phase II, participants were quizzed on 35 word pairs or re-studied 50 word pairs. During Phase III, participants were tested on all 50 word pairs. The one-week delayed final test consisted of all 50 Swahili words from the study list, each presented with four alternative choices of possible English translation. A recognition test was used as the final test based on the assumption that recognition would provide more sensitive assessment of learning than recall. For the 35 Swahili words that were initially quizzed, the recognition items were the same as those in the initial recognition quiz. For the 15 Swahili words that were not initially quizzed, recognition items were constructed using English translation of other studied words as distractors. These items were randomly ordered once, and the same order was used across participants. In addition, a sheet of paper with the study list (in the order of presentation) was used for the open-book quiz. A sheet of paper with random two-digit number was used for the filler task (see below), with a stopwatch used to time the duration of the filler task. Procedure Small groups up to four individuals were tested in two sessions with one-week delay between the sessions. During Session 1, Phase I and II of the study were administered. During Phase I, participants were instructed to study a list of 50 Swahili-English word pairs. The presentation of the study list was repeated three times to ensure that participants learned the list at a sufficiently high level to avoid a floor effect. They were not informed about the format of the initial quiz nor the final test they took after one-week delay. Following the study phase, participants were asked to perform a filler task for 2 minutes, crossing out the numbers divisible by three. The filler task was administered to eliminate a recency effect. Following the filler task, Phase II commenced, and participants in the initially quizzed condition completed the self-paced initial quiz. Participants in the cued recall condition were asked to write an English equivalent of each Swahili word, and participants in the recognition condition were asked to select the correct English equivalent of each Swahili word among the four alternatives. These quizzes were administered in a closed-book or open-book format; that is, participants in the closed-book condition were asked to take the quiz without looking up the answers whereas participants in the open-book condition were given a sheet of paper with the study list and were allowed to look up the answers. No feedback was given after completing the quiz. In the study 2x condition, instead of taking the initial quiz, participants re-studied all 50 Swahili-English word pairs printed on a sheet of paper one more time. At the end of the first session, participants in all conditions were told that at the second session seven days later, they would be asked about the Swahili words they studied during the first session. Phase III of the study was administered in Session 2, which was scheduled one week after Session 1. During Phase III, participants took the self-paced final recognition test, with instruction to select the correct English equivalent for each Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 27 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 Swahili word. However, before taking the test, participants were asked to make a global judgment of learning (JOL), predicting how many words they would be able to correctly recognize among 50 Swahili words. Participants were asked to make a JOL rating because it is possible that the format and type of the initial quiz would influence their metacognitive judgments. Results The significance level was set at .05, and unless otherwise specified, two-tailed tests were performed. The dependent measures were the proportion of correct responses on the initial quiz, the proportion of correct responses on the final recognition test, and JOL, which was converted to a proportion. Note that on the initial quiz, participants were quizzed on 35 word pairs out of 50 word pairs they studied whereas on the final recognition test and JOL, participants were tested on all 50 word pairs. Because the goal of the study was to investigate the effect of the initial quiz format (i.e., open-book versus closed-book) as well as the initial quiz type (i.e., cued recall versus recognition) on final test performance, the proportion of correct responses on the final recognition test was compared across the conditions. A one-way analysis of variance (ANOVA) on the final test performance for all 50 word pairs indicated that the difference among the conditions was significant, F (4, 170) = 2.95, MSE = 0.03, p = .02, ηp2 = .07. As shown in Table 2, least significant difference (LSD) tests revealed that the study 2x condition (M = .45, SD = .13) showed significantly lower performance than the open-book cued recall (M = .57, SD = .18), open-book recognition (M = .54, SD = .19), and closed-book cued recall (M = .57, SD = .17) conditions. The difference between the study 2x condition (M = .45, SD = .13) and the closed-book recognition condition (M = .52, SD = .14) did not reach statistical significance with a two-tailed test (p = .09); however, based on a priori hypothesis that the initial testing would produce a testing effect, the difference was significant with a one-tailed test (p = .04). No other difference was significant, indicating that the testing effect was similar across the quizzed conditions.1 In order to gain insight as to how the testing effect had occurred, different groups of word pairs were analyzed. Because 35 word pairs out of 50 studied word pairs were quizzed on the initial quiz, these quizzed word pairs should show a testing effect on the final test, and that the effect should be similar across the quizzed conditions. This expectation was confirmed. A one-way ANOVA on the final test performance for 35 words that were quizzed on the initial quiz indicated that the difference among the conditions was significant, F (4, 170) = 4.36, MSE = 0.03, p = .002, ηp2 = .09. As shown in Table 2, LSD tests indicated that the study 2x condition (M = .44, SD = .14) showed significantly lower performance than the open-book cued recall (M = .59, SD = .18), open-book recognition (M = .55, SD = .19), closed-book cued recall (M = .57, SD = .18), and closed-book recognition (M = .53, SD = .15) conditions. No other comparison was significant, indicating that the testing effect was similar across the quizzed conditions. Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 28 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 Next, 15 word pairs that were not quizzed on the initial quiz were analyzed to test a possibility that these non-quizzed words also showed a testing effect. This expectation was not confirmed. A one-way ANOVA on final recognition performance for 15 word pairs that were not quizzed on the initial quiz indicated that the difference among the conditions was not significant, F (4, 170) = 0.83, MSE = 0.04, p = .51, indicating that the testing effect was not observed with these words (see Table 2). Table 2 Mean Proportion of Correct Responses on the Initial Quiz and on the Final Recognition Test as a Function of Initially Quizzed Conditions and Different Groups of Words on Final Test Initial Quiz Final Test Conditions (35 word pairs) All 50 Word Pairs 35 Quizzed Word Pairs 15 Not Quizzed Word Pairs Correct on Initial Quiz Incorrect on Initial Quiz JOL Closed-book Cued Recall M SD .36a .19 .58a .18 .57a .18 .58a .19 .82a .16 .48a .18 .29a .19 Closed-book Recognition M SD .73b .18 .52a .14 .53a .15 .51a .14 .61b .14 .28b .21 .33a .20 Open-book Cued Recall M SD .99c .01 .57a .18 .59a .18 .53a .24 .59b .18 - - .28a .14 Open-book Recognition M SD .98c .04 .54a .19 .55a .19 .52a .24 .55b .19 - - .30a .17 Study 2x M - .45b .44b .50a - - .29a SD - .14 .14 .18 - - .19 Note: For each column, significant differences (p < .05) are indicated by different superscript letters. All comparisons were based on two-tailed tests except for the comparison between the closed-book recognition and study 2x conditions for the final test with all 50 word pairs, which was based on a one-tailed test. To investigate whether correctly responding on the initial quiz influenced the final test performance, the next analysis examined what proportion of the correct responses on the initial quiz was also correct on the final test. Because the initial quiz was not administered in the study 2x condition, this condition was excluded from the analysis. A one-way ANOVA indicated that the difference among the conditions was significant, F (3, 136) = 19.23, MSE = 0.03, p < .001, ηp2 = .30. As shown in Table 2, LSD tests revealed that the closed-book cued condition (M = .82, SD = .16) showed significantly higher performance than the open-book cued recall (M = .59, SD = .18), open-book recognition (M = .55, SD = .18), and closed-book recognition (M = .61, SD = .13) conditions, indicating that successfully recalling without looking up answers on the initial quiz (i.e., closed-book cued recall) led to a higher success on the final test. No other comparison was significant. An analysis was also performed to investigate whether the initial quiz had a positive Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 29 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 effect on the final test performance for those word pairs for which incorrect responses (i.e., omission and commission errors) were given on the initial quiz. However, in the open-book cued recall and recognition conditions, the number of such errors was close to zero, and therefore, a t-test was conducted to compare the closed-book cued recall and recognition conditions. Final recognition performance on the words participants failed to remember on the initial quiz indicated that the closed-book cued recall condition (M = .48, SD = .18) produced higher performance than the closed-book recognition condition (M = .28, SD = .21), t (66) = 4.23, p < .001, d = 1.02, on the final test. This finding shows that the participants were able to recognize items on the final tests that they did not remember on the initial quiz, and that cued recall on the initial quiz showed higher likelihood of such success than recognition. Next, JOL was analyzed to investigate whether metacognition was influenced by the format and type of the initial quiz. A one-way ANOVA on JOL indicated that the difference among the conditions was not significant, F (4, 170) = 0.45, MSE = 0.03, p = .77. In all conditions, participants predicted that they would be able to correctly recognize 30 percent from 50 Swahili words they studied one week earlier (M = .30, SD = .18). Lastly, a one-way ANOVA on the initial quiz performance showed that the difference among the conditions was significant, F (3, 136) = 175.01, MSE = 0.02, p < .001, ηp2 = .79. As mentioned, the number of errors was closed to zero for the open-book cured recall (M = .99, SD = .01) and recognition tests (M = .98, SD = .04), and LSD tests showed that the difference was non-significant between these two conditions. All the other comparisons were significant, indicating that recognition was easier than cued recall when the initial quiz was closed-book. Discussion The present study examined whether open-book and closed-book formats of an initial quiz would influence performance on a delayed final recognition test when Swahili-English word pairs, as opposed to text materials, are used as study material. As mentioned in the introduction section, in the online learning environment, testing is challenging because it is difficult, if not impossible, to make a test cheat proof.2 However, such a concern may only arise when a test is considered only as an assessment tool, consistent with the traditional view of education. In contrast, if the focus is shifted toward a long-term goal of education (i.e., building knowledge), it may not matter whether a test is open-book or closed- book because what matters most is whether students will develop knowledge of whatever they are learning. In fact, the past studies investigating the effect of test format using the testing effect paradigm showed that both open-book and closed- book formats produced similar performance on the final closed-book test, indicating that both formats would promote long-term memory (e.g., Agarwal et al., 2008; Agarwal & Roediger, 2011; Pauker, 1974; Gharib et al., 2012). However, a concern with these studies is that they used text-based materials, and therefore, it is possible that the lack of difference between open-book and closed-book formats was simply reflecting the fact that the materials were well-organized and Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 30 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 meaningful, thereby conducive to elaborative processing regardless of the format of the initial quiz. The present study, therefore, used a simple material (i.e., Swahili- English word pairs) in order to reduce the possible influence of other factors (such as organization, meaningfulness, and familiarity). Furthermore, the present study investigated the type of initial quiz (cued recall versus recognition) because the difference between the test formats might emerge when the initial quiz is not sufficiently difficult to induce a deep level of processing (i.e., recognition). There were several major findings. First, the final test performance was similar among the four quizzed conditions, regardless of whether all 50 words from the study list (i.e., the whole list) or 35 words that were quizzed on the initial quiz were examined. The results, therefore, indicated that neither the initial quiz format (i.e., open-book versus closed-book) nor the initial quiz type (i.e., cued recall versus recognition) influenced the final test performance. Second, the study 2x control condition produced significantly lower final recognition performance than the four quizzed conditions for both the whole list and the quizzed words. This finding is consistent with a phenomenon referred to as the testing effect (e.g., Roediger & Karpicke, 2006), indicating that taking a quiz increases long-term retention of information relative to re-studying. This notion is further supported by the finding that the testing effect was only found with the quizzed words, as opposed to the non-quizzed words (i.e., 15 control words that were not quizzed on the initial quiz). Third, when the final test performance was conditionalized on correct responses on the initial quiz, the closed-book cued recall condition showed higher performance than the other quizzed conditions. This finding indicates that there is an advantage in making the initial quiz closed-book cued recall in line with the notion of desirable difficulty by Bjork (1994, 1999), which contends that difficult processing, whether it is at encoding or retrieval, benefits long-term retention. Finally, JOLs of the final test was similar across conditions: In all conditions, including the study 2x condition, participants predicted that they would be able to correctly recognize about 30% of 50 Swahili words on the final test. Because actual performance was higher than 30% in all conditions, participants underestimated their performance. Although JOL ratings were not accurate in all conditions, it is important to note that JOLs were similar between the study 2x and quizzed conditions, indicating that JOLs did not show a testing effect. This finding is consistent with the result of other studies (e.g., Agarwal et al., 2008; Roediger & Karpicke, 2006b) that participants are not sensitive to the beneficial effect of testing. Furthermore, note that in many studies, participants show a tendency to overestimate rather than underestimate their performance (e.g., Agarwal et al., 2008; Ayton & McClelland, 1997). It is not clear the reason that participants underestimated their performance in the present study. A possibility is that the material they learned in this study was unfamiliar foreign vocabularies, and therefore, participants did not have high confidence. Overall, the results of the present study showed that although performance on the Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 31 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 initial quiz was predictably higher in the open-book condition than in the closed- book condition, this advantage vanished after one-week delay, resulting in similar final test performance between the open-book and closed-book conditions. Furthermore, the final test performance did not show the effect of initial quiz type (i.e., cued recall and recognition). Taken together, an open-book test format is as effective as a closed-book test format in promoting long-term retention, even when the study material is not text materials. In addition, the initial quiz could be either cued recall or recognition, indicating that in building knowledge, the act of being quizzed is the critical factor across quiz formats and quiz types. Why is it that the initial quiz format did not make a difference in the amount of testing effect? It is possible that even when the quiz was open-book, participants took it as if it was a closed-book quiz. If so, there was no functional difference between the two quiz formats. Although this is plausible, it is unlikely based on the initial quiz performance, showing that performance was almost 100% on the open-book quiz whereas performance was much lower on the closed-book quiz. Alternatively, it is possible that what is critical is the opportunity to process the words deeply regardless of whether it is done by open-book or closed-book quizzes. It appears that both open-book and closed-book quizzes acted as orienting questions in an incidental learning study (e.g., Craik & Tulving, 1975) inducing a deep level of processing. Further studies are needed to examine the processes that are induced by open-book and closed-book formats, which led to higher final test performance for the quizzed conditions relative to the study 2x condition. Although the final test performance was similar across the quizzed conditions, there was an indication that a closed-book format with cued recall on the initial quiz may yield some advantage. As mentioned, consistent with the notion of desirable difficulty (Bjork, 1994, 1999), the final test performance was higher in the closed- book cued recall condition than in other quizzed conditions when the words that were correctly responded on the initial quiz were examined. This finding indicated that difficult retrieval would produce greater increment in storage strength than easy retrieval, in line with the notion that there is a negative correlation between storage and retrieval strength (Bjork & Bjork, 1992). It is possible that the difficulty of retrieval may ultimately prevail when memory is tested after a retention interval longer than one week. In fact, meta-analyses by Pan and Rickard (2018) and Rowland (2014) showed that retrieval effort or elaborative retrieval was a moderating variable that increased the testing effect. Note however that a meta- analysis by Adesope et al. (2017) showed that the testing effect was greater with less effort such that the testing effect was greater with recognition tests than with cued-recall tests. Accordingly, the role of retrieval difficulty in the testing effect is not clear, and therefore, further investigation is needed. Another interesting finding was that when the words that were not correctly responded on the initial quiz (i.e., omission and commission errors) were examined, the closed-book cued recall condition showed higher final recognition performance than the closed-book recognition condition. However, this finding may simply reflect the fact that recognition is easier than recall, such that the increase in performance Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 32 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 from the initial quiz to the final test was as a result of comparing a difficult initial quiz (i.e., cued recall) and an easy final test (i.e., recognition). Another issue that needs to be investigated in the future is the nature of the final test. In the present study, a recognition test was used to test final performance. However, it is possible that the results might be different when a cued recall test is used as the final test. The type of final test might be important because in the present study, the advantage of initial cued recall over recognition may have been masked due to a mismatch between the initial and final tests in the cued recall conditions. That is, a transfer appropriate processing (Morris, Bransford, & Franks, 1977) is confounded between the initial cued recall and recognition quiz conditions. However, ultimately how knowledge will be tested would be critically dependent on the type of knowledge and how it is used. For instance, learning a foreign language would require more than just recognizing vocabulary words and their English equivalents. In this sense, a decision to adopt a particular test type may require a domain specific approach. Nevertheless, the present study showed that it is premature to assume that a test is inferior just because it is administered using an open-book test format. In conclusion, the results of the present study showed that the testing effect is similar between open-book and closed-book quizzes, even when the study material is an unrelated set of Swahili-English pairs, as opposed to well-organized and meaningful text materials. Furthermore, initial quiz type, cued recall or recognition, did not make a difference. These results, therefore, supported the notion that an open-book test is not necessarily inferior to a closed-book test in promoting long- term retention. However, there was an indication that making the initial quiz difficult, as in the closed-book cued recall condition, has an added advantage, which needs to be investigated in future research. Practical Implications for Classroom Practice Given that testing promotes long-term retention of studied material, coupled with the present result that there is no difference between open-book and closed-book formats, how can these research findings be translated to classroom practice? On the one hand, it seems to be safe to replace traditional closed-book tests with open-book tests if the purpose of education is to build knowledge. On the other hand, such practice would represent a radical departure from the traditional method, and as such, it may be difficult to convince teachers in traditional face-to- face classrooms to adopt such new practice. However, the situation may be different for teachers in online classes because these teachers may be more experienced with non-traditional methods. Nevertheless, doing away with closed- book tests entirely may not be practical because it would be difficult if not impossible to document the outcome of education unless there is an assessment.3 Based on these considerations, a preferable approach would be to mix open-book tests and closed-book tests within a particular course (or a curriculum) with the former being used for building knowledge and the latter being used for assessment. In line with this recommendation, the second author of this paper began using this hybrid approach in a 300-level course at his university. In this face-to-face class, Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 33 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 he implemented open-book quizzes (multiple-choice questions), which he allowed students to take multiple times. He used these open-book quizzes for building knowledge, but for assessment, he used closed-book exams. The results were encouraging. For the semester when he implemented the quizzes, there was a modest increase in the average score of the closed-book exams (9% improvement overall) compared to prior semesters. Although the increase was not dramatic, the results were encouraging, given that the quizzes were not mandatory and did not contribute to the course grade. With this modest success, this hybrid approach should be explored further, particularly in online classes. We argue that this hybrid approach is especially relevant for online classes for at least two reasons. First, in these classes, in-person test proctoring can be expensive and time-consuming, and second, as mentioned earlier, other methods of minimizing the potential for cheating on online closed-book tests have limitations. Accordingly, by incorporating open-book quizzes, the number of closed-book tests can be reduced without jeopardizing the development of long-term knowledge. In conclusion, any laboratory finding requires extensive translational research before it becomes useful in practice. However, the results of the present study showed that open-book tests are a viable method of building knowledge, which we regard as one of the important goals of education. Conflicts of Interest The authors declare that there is no conflict of interest regarding the publication of this article. References Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking the use of tests: A meta-analysis of practice testing. Review of Educational Research, 87, 659-701. Anderson, L. W., Krathwohl, D. R., Airasian, P. W., Cruikshank, K. A., Mayer, R. E., Pintrich, P. R., Raths, J., & Wittrock, M. C. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s Taxonomy of Educational Objectives (Complete edition). New York: Longman. Alessio, H. M., Malay, N., Maurer, K., Bailer, A. J., & Rubin, B. (2017). Examining the effect of proctoring on online test scores. Online Learning, 21, 146-161. Agarwal, P. K., Karpicke, J. D., Kang, S. H. K., Roediger, H. L. III, & McDermott, K. B. (2008). Examining the testing effect with open- and closed-book tests. Applied Cognitive Psychology, 22, 861-876. Agarwal, P. K., & Roediger III, H. L. (2011). Expectancy of an open-book test decreases performance on a delayed closed-book test. Memory, 19, 836-852. Ayton, P., & McClelland, A. G. R. (1997). How real is overconfidence? Journal of Behavioral Decision Making, 10, 279–285. Bjork, R. A. (1999). Assessing our own competence: Heuristics and illusions. In D. Gopher & A. Koriat (Eds.), Attention and performance XVII. Cognitive regulation of performance: Interaction of theory and application (pp. 435- 459). Cambridge, MA: MIT Press. Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 34 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe & A. P Shimamura (Eds.), Metacognition: Knowing about knowing (pp. 185-205). Cambridge, MA: MIT Press. Bjork, R. A., & Bjork, E. L. (1992). A new theory of disuse and an old theory of stimulus fluctuation. In A. F. Healy, S. M. Kosslyn, & R. M. Shiffrin (Eds.), From learning processes to cognitive processes: Essays in honor of William K. Estes (pp. 35-67). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives. The classification of educational goals. Handbook 1 Cognitive domain. New York: Logmans, Green and Co. Butler, A. C., Marsh, E. J., Goode, M. K., & Roediger, H. L., III. (2006). When additional multiple-choice lures aid versus hinder later memory. Applied Cognitive Psychology, 20, 941-956. Christe, B. (2003). Designing online courses to discourage dishonesty. Educause Quarterly, 26, 54-58. Cluskey Jr, G. R., Ehlen, C. R., & Raiborn, M. H. (2011). Thwarting online exam cheating without proctor supervision. Journal of Academic and Business Ethics, 4, 1-7. Craik, F. I. (2002). Levels of processing: Past, present... and future? Memory, 10, 305-318. Craik, F. I., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. Journal of Verbal Learning & Verbal Behavior, 11, 671- 684. Craik, F. I., & Tulving, E. (1975). Depth of processing and the retention of words in episodic memory. Journal of Experimental Psychology: General, 104, 268- 294. Durning, S. J., Dong, T., Ratcliffe, T., Schuwirth, L., Artino, A. R., Boulet, J. R., & Eva, K. (2016). Comparing open-book and closed-book examinations: A systematic review. Academic Medicine, 91, 583-599. Gharib, A., Phillips, W., & Mathew, N. (2012). Cheat sheet or open-book? A comparison of the effects of exam types on performance, retention, and anxiety. Psychology Research, 2, 469-478. Grijalva, T. C., Nowell, C., & Kerkvliet, J. (2006). Academic honesty and online courses. College Student Journal, 40, 180-185. Hockley, W. E., & Consoli, A. (1999). Familiarity and recollection in item and associative recognition. Memory & Cognition, 27, 657-664. Metcalfe, J., Kornell, N., & Finn, B. (2009). Delayed versus immediate feedback in children’s and adults’ vocabulary learning. Memory & Cognition, 37, 1077- 1087. Michael, T. B., & Williams, M. A. (2013). Student equity: Discouraging cheating in online courses. Administrative Issues Journal, 3(2). Retrieved from https://www.swosu.edu/academics/aij/2013/v3i2/michael-williams.pdf Moore, R., & Jensen, P. A. (2007). Do open-book exams impede long-term learning in introductory biology courses? Journal of College Science Teaching, 36, 46- 49. Morris, C. D., Bransford, J. D., & Franks, J. J. (1977). Levels of processing versus transfer appropriate processing. Journal of Verbal Learning and Verbal Behavior, 16, 519-533. https://www.swosu.edu/academics/aij/2013/v3i2/michael-williams.pdf Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 35 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 Nelson, T. O., & Dunlosky, J. (1994). Norms of paired-associate recall during multitrial learning of Swahili-English translation equivalents. Memory, 2, 325- 335. Owens, H. S. (2016). Cheating within online assessments: A comparison of cheating behaviors in proctored and unproctored environments (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No. 3737143). Pan, S. C., & Rickard, T. C. (2018). Transfer of test-enhanced learning: Meta- analytic review and synthesis. Psychological Bulletin, 144, 710-756. Pashler, H., Cepeda, N. J., Wixted, J. T., & Rohrer, D. (2005). When does feedback facilitate learning of words? Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 3-8. Pauker, J. D. (1974). Effect of open book examinations on test performance in an undergraduate child psychology course. Teaching of Psychology, 1, 71-73. Roediger, H. L. III, & Karpicke, J. D. (2006a). The power of testing memory: Basic research and implications for educational practice. Perspectives on Psychological Science, 1, 181-210. Roediger, H. L. III, & Karpicke, J. D. (2006b). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science, 17, 249- 255. Rowe, N. C. (2004). Cheating in online student assessment: Beyond plagiarism. Online Journal of Distance Learning Administration, 7. Retrieved from http://www.westga.edu/~distance/ojdla/summer72/rowe72.html Rowland, C. A. (2014). The effect of testing versus restudy on retention: A meta- analytic review of the testing effect. Psychological Bulletin, 140, 1432-1463. Sparrow, B., Liu, J., & Wegner, D. M. (2011). Google effects on memory: Cognitive consequences of having information at our fingertips. Science, 333, 476-478. Westerman, D. L. (2001). The role of familiarity in item recognition, associative recognition, and plurality recognition on self-paced and speeded tests. Journal of Experimental Psychology: Learning, Memory, and Cognition, 27, 723-732. Footnotes 1 We also conducted a 2 (quiz type: cued recall and recognition) x 2 (quiz format: open- book and closed-book) ANOVA without the study 2x condition. The results showed that no effect was significant. 2 We acknowledge that the problem of cheating is not limited to online tests. 3 We also acknowledge that assessments can be conducted in a variety of ways including (and not limited to) using closed-book tests. http://www.westga.edu/%7Edistance/ojdla/summer72/rowe72.html Testing Effect: A Further Examination of Open-book and Closed-book Test Formats 36 Journal of Effective Teaching in Higher Education, vol. 1, no. 1 Appendix Examples of Swahili-English Word Pairs Swahili Word English Equivalent Jani Leaf Chura Frog Lozi Almond Nira Yoke Wakili Agent Yatima Orphan Bahasha Envelope Chaza Oyster Fumbo Mystery