V11.2&3, Spring&Fal11985 I nformal Logic Teaching Critical Thinking at the University Level: A Review of Some Empirical Evidence* LEONARD E. GIBBS University of Wisconsin-Eau Claire This review was conducted specifical- ly to help us plan a critical thinking pro- gram for faculty at the University of Wisconsin-Eau Claire, and to evaluate its effects on faculty and students in their classrooms. It seemed appropriate to first weigh evidence before begin- ning such a program. Thus, the ques- tions listed in the abstract concern mea- sures of critical thinking, effectiveness of conventional curricula, effectiveness of curricula designed specifically to teach critical thinking, and factors asso- ciated with successful learning by parti- cipants. Because we thought empirical studies would provide the clearest answers to our questions, studies of critical thinking in universities are the only evidence included. Some clarifications may be helpful. Some otherwise excellent studies were excluded because they evaluated critic- al thinking at the pre-university level (Noyce, 1970; Smith & Tyler, 1942). The review begins with two tables. The first summarizes methodological features of studies reviewed; the second summarizes findings and mea- sures used to quantify those findings. Readers who want a detailed overview of each study and a feature-by-feature comparison of it with other studies, may find the tables and their explana- tions helpful. Readers who want a quick overview of the evidence may want to skip the tables and read the discussion. Discussion hits high points in the tables and follows the sequence of questions posed in the abstract. Criteria for Inclusion When choosing evidence, I intended to cast a net with a wide enough aper- ture to catch the best empirical evid- ence but not so wide that it caught a confusing mixture of weak and strong evidence. Ideally, the best criteria for inclusion would be: random selec- tion of subjects, measures of proven validity and reliability, random assign- ment of subjects to alternate programs for teaching critical thinking or to a control, specific hypotheses tested by appropriate inferential statistics, and sufficient follow-up to measure strength of effect over time. These criteria were too rigorous. The first sweep of the net caught nothing. Nine studies did meet the following criteria: their authors studied effects of university level programs for teach- ing critical thinking; authors stated specifically that they were evaluating critical thinking; they used at least one measure of critical thinking to evaluate effects of teaching; they made some comparison (either pretest against posttest or across groups), and authors used descriptive or inferential statis- tics. Because none had sufficient control over their experiment to randomly assign subjects to experimental condi- tions, designs summarized here are all quasi-experimental (Campbell and Stanley, 1963). Though there is no question that random assignment to alternate programs would enable more powerful causal inferences, random assignment is not the all inclusive facil- itator of high quality research (Cook and Campbell, 1979). Thus, studies re- ported here, from various contexts, often using different measures, can still provide tentative answers to our 138 Leonard Gibbs questions. We located studies by asking ex- perts for references, by reading reviews on the subject (Norris, 1985; Baker, 1979), and by searching DIALOG's ERIC and other files for the intersect "Critical Thinking" and "Higher Education." The review reflects month- ly reviews of ERIC files through February, 1986. Explanations of Tables Table 1 shows how well each study meets several criteria for methodo- logical precision. It describes each study's merits and allows a quick com- parison across studies by criterion. The first column identifies each study by author and year. The second column contains the location of the study, where possible, and identifies the type of class or setting for subjects. Column three describes study design according to Cook and Campbell's (1979) term- inology. The fourth and fifth columns give the number of subjects pretested and the number posttested, thus pro- viding a quick reference to the number of subjects involved in the experiment and any subject attrition. The symbols Rs and Ra in columns six and seven respectively, denote whether subjects were randomly selected for inclusion in the study, or were randomly as- signed to alternate treatments or to control. A slash (/) through these symbols means randomization criteria were not met. Column 8 lists the period of follow-up, or the interval between pretest and posttest, if such a design is used. Column 9 lists the Credibility Index (CI) for each study. This index is based on a Quality of Study Rating Form that lists nine criteria for a good evaluation study and accompanying instructions for identifying and weighting those evaluation criteria (Gibbs, 1985). Thirty-nine raters have used the form to rate two studies agreeing an average of 95% and 93% with keyed criteria. Stronger randomized trials generally score above 70 points on the form. CI is computed by adding weights for the following criteria: random selec- tion of subjects (10 points), random assignment (20 points), nontreated control or comparison group (10 points), number of subjects in the largest treatment group exceeding twenty (10 points), a check of validity by correlating the principal outcome measure with another similar measure (16 pOints), a reliability coefficient for the principal measure of critical thinking (15 points), a reliability co- efficient of at least .70 or 70% agree- ment between raters (9 points), follow- up longer than six months (4 points), and using an inferential statistic to test comparisons for statistical signi- ficance (6 points). The CI can range from zero, in a study where none of the criteria are met, to one hundred, where all criteria are met. Table 2 lists measures of critical thinking, criteria for evaluating mea- sures, and summaries of study results. The first column identifies each study by author and year. The second des- cribes the location and type of univer- sity class providing subjects. Columns 3 through 5 list respectively, the name of the measure or measures used to quantify critical thinking, the reliability coefficient or percent of inter-rater agreement for each measure, and in- formation relevant to validity. Column 6 lists principal hypotheses; these may be explicitly stated by the author or implicit. Column 7 lists the statistical test and "p" (significance) level re- lated to each hypothesis. (Here lip" level generally means the probability that a given result could be found due to chance alone; so the smaller the lip" level the greater our confidence in difference reported.) Column 8 lists the strength of treatment effect (SE) in standard deviation units (Glass, 1972; Hedges, 1984). This index is usually the mean of the experimental group minus the mean of the control group, all divided by the standard deviation of the control group, or the difference between treatments all divided by a pooled estimate of their standard deviation. Especially perti- nent comments are in column 9. Findings Which kinds of instruments have been used most frequently by evalua- tors to measure university level critical thinking? Column 3 of Table 2 shows that the Watson-Glaser Critical Think- ing Appraisal, a test whose forms A and B were copyrighted in 1951, is most popular: three authors used it. The eighty-item Watson-Glaser is a multiple choice test of ability to dis- criminate among degrees of support for inferences, recognition of unstated assumptions, ability to make logical deductions, interpretation of evidence to see if generalizations or conclusions are warranted, and ability to judge the relevance of arguments to parti- cular questions (Watson and Glaser, 1980). Two studies used a procedure for grading essay tests developed by Browne, Haas and Keeley (1978). Their rubric scores the following ele- ments in student essays: identifying a controversy and conclusions regard- ing that controversy, identifying major arguments, identifying and analyzing implicit premises, recognizing lan- guage difficulties (e.g. ambiguity and vagueness), evaluating validity of individual arguments and truth of in- dividual premises, formulating a con- clusion from premises, and recognizing alternative inferences. Using this rubric, it takes an hour to score a single essay. Each of the following tests were used in one study only: The American Council of Education's Test of Critical Thinking, Inclination toward Method- ological Criticism, Ability at Method- ological Criticism, Creative Reasoning Test, Florida Taxonomy of Cognitive Behavior, and the Cornell Critical Thinking Test. Just as there are a wide variety of instruments used to measure critical thinking, evaluations come from a wide range of disciplines and locations. (See column 2 of Table 2.) Four authors evaluate classes across disciplines. Teaching: Evidence 139 Others study effects of critical thinking programs on students from a single discipline including classes in mass communication, business, biology, and sociology. What are relative merits for essay versus multiple choice tests for critical thinking? Some argue that essay tests are more valid because essay tests measure application of critical thinking skill, not merely knowledge of prin- ciples (Browne, Haas, Vogt, & West, 1977). While using the Watson-Glaser as their principal measure in a program at Bowling Green State University, Browne and his colleagues found that students, though able to demonstrate knowledge of critical thinking on the Watson-Glaser, still had trouble critic- ally evaluating essays and other exam- ples of thinking (Browne, Haas and Keeley, 1978). They argue that the multiple choice Watson-Glaser may measure the ability to recognize a valid syllogism, but may not test the ability of students to apply valid de- ductive reasoning to a problem (Browne, Haas & Keeley, 1978). Evenhandedly, Browne and his asso- ciates concede that multiple choice tests are easy to use, have national norms, and take less time to score than do essay tests (Keeley, Browne r & Kreutzer, 1982). How reliable are tests of critical thinking? Reliability is vital to any evaluation, because consistent mea- sures help to rule out sources of varia- tion that can obscure real effects of educational programs. A rough rule of thumb for interpreting reliability co- efficients is that the closer they ap- proach one the better. Values equal to or exceeding .70 are generally accept- able. Evaluators using multiple choice tests did not measure the reliability either of the Watson-Glaser or the Cornell by using data from subjects participating in their evaluations. However, the manual for the Watson- Glaser (1980) reports test-retest reliability (r= .75), alternate forms re- liability (r= .75 for Form A and Form B), and split-half reliability coefficients 140 Leonard Gibbs Table 1 Credibility of Studies Author 1 Baker, P.J. & Anderson, L.E., 1983 Type of Subjects 2 Students in three sections of a Social Problems course at Illinois State University Study Design 3 Three..group pretest- postlest No. in Pretest 4 Browne. M.N., Treatment group was freshmen in special Two-group T 21 Haas, P. F., Vogt, course. Comparison group was seniors in pretest-posttest Compar. :::: 40 K.E. & West, business major. J.S,1977 Givens, C.F., 1976 40 randomly selected faculty and their students in classes at 4 universities Keeley. S.M., Students at a midwestern Browne, M.N., & university Kreutzer, J.S., 1982 Lehmann. I.J. & Students at Michigan State Dressel, P.L., 1963 University Logan,C.H. 1976 Meiss, G.T. & Bates, G.W., 1984 Students at 8 levels ina large university, and one course In critical thinking (all in sociology) Students in an introductory class in mass communication Smith, D.G., 1977; Students in 12 classes, where teaching and Smith, D.G., critical thinking was not a specific goal, 1983 (for more de- at a small liberal arts college tailed description of the study) Statkiewicz, W.R. One section of 112 General Biology & Allen, RD., students at West Virginia University 1983 One-group posttest only None Posttest only with 500 freshmen nonequivalent 500 seniors groups One-group pretest -pasttest Two group pretest -posttest with non- equivalent groups, Five groups post- tested only Three-group pretest-posttest Freshmen 590 M461 F Egroup :::: 84; comparison groups 102,30 N = 102 One group N = 210 pretest-pastiest (12 classes combined) One..group re- N :::: 48 peated measures design (measures made at three times during the semester) No. in Postlest 5 T:::: 21 Compar. :::: 40 40 class of students in 4 universities 155 freshmen 145 seniors Freshmen 590 M 461 F Sophomore 235 M 189 F Junior 179 M 144 M E group 67, comparison groups = 144, 32,36,42, 18. N1 = 27 N2 26 N3:::: 30 N 138 N 48 Random Selection 6 RS(classes selected randomly) RS RS RS Random Assignment 7 ~ , ~ (treatment assigned randomly to classes) Period of Follow-up 8 One semester of class Teaching: Evidence 141 Credibility Index 9 34 Academic quarter 26 No follow-up 50 none 41 1,2,3 years 30 One semester in 26 experimental group. 15 weeks 46 One semester 16 One semester 42 142 Leonard Gibbs Table 2 Measures of Critical Thinking and Findings Author Baker, P.J. & Anderson, L.E., 1983 Browne, M.N .. Haas, P.F., Vogt, K.E., & West, J.S., 1971 Givens, C.F. 1976 Keeley,S.M., Browne, M.N., & Kreutzer, J.S., 1982 Lehmann, I.J. & Dressel, P.L., 1963 Smith. D.G., 1971; and Smith. D.G, 1983 (for more de- tailed description of the study) Statkiewicz, W.R, & Allen, RD., 1983 Logan, C.H. 1976 Meiss. G.T. & Bates, G.w. 1984 Type of Subjects 2 Students in 3 sec- tions of a social prob- blems course at Illi- nois State University Treatment group was freshmen in special course. Comparison group was seniors in business major. 40 randomly selected faculty and their stu- dents in classes at 4 universities Students at a midwestern university Students at Michigan State University Students in 12 classes, where critical thinking was not a specific goal, at a small liberal arts col/ege One section of 112 General Biology stu- dents at West Vlrginia University Students at 8 levels in large university, and one course in critical thinking, (all in sociology) Measures of Critical Thinking 3 Creative Reasoning Test The principal measure was a ruble devised by the authors to grade essay tests, plus the Watson-Glaser and Cornell. Florida Taxonomy of Cognitive Behavior (FTCB) Rubric developed by the authors for grading essay tests American Council of Education's Test of Critical Thinking Watson-Glaser Critical Thinking Appraisal Practice Exercises (a forced choice, "de- fend your choice," test developed by authors) Inclination Toward Methodological Criticism; Ability at Methodological Criticism. Students in an intro- Watson-Glaser Critical ductory class in mass Thinking Appraisal communication Reliability of Critical Thinking Test 4 Inter-rater r = .70, .93, .96, .74 Graders scored 122 essay tests and agreed within one letter grade on all but 5 of the 122 tests. 85% agreement on items for independent raters Interrater reliability = .90 Validity of Critical Thinking Test 5 Authors argue that the essay test measures applied critical thinking. Items based on Bloom's Taxonomy of Education Objectives Authors think multiple choice tests fail to measure ability to identify argument and to generate criticism Exercise correlated (r == .56, .59, and. 71) with course examina- tion grade Items exemplify common fallacies Hyp. Tested 6 Stat. Tested 7 Students will improve their critical think- Percent ing skills pre to posUest in three classes improved Those in a business and society cluster course will score higher on an essay test of CTability than will seniors in a com- parison group at posttest. Those in a business and society cluster will score significantly higher at posttest than they did at pretest. 1) The average I evel of classroom d is- course is on the lowest cognitive level (FTCB) 2) There is no difference for professor's nor student's FTCB score between course level (basic/advanced), subject area, time the class had been in session 3) Students in small classes had higher cognitive level (FTCB) than in larger classes Seniors will score higher on forms P and C of an essay test of critical thinking than will freshmen F test for homo- geneityof variance, t test for difference of means P<.005 P<.005 (See Comments) tcompar. of average medians P