3 

articles 

Problems in Testing Informal Logic 
Critical Tilinking 

Reasoning Ability 1 
t 

Robert H. Ennis 
University of Illinois 

To my knowledge there are now five English-language 
machine-gradeable tests readily available on the North 
American Continent that are, or might be construed as, critical 
thinking tests: the two Cornell critical thinking tests (Ennis & 
Millman, 1982a, 1982b); New Jersey Test of Reasoning Skills 
(Shipman, 1983); Ross Test of Higher Cognitive Processes (Ross 
& Ross, 1976); and Watson-Glaser Critical Thinking Appraisal 
(Watson & Glaser, 1980).2 But there will be more critical 
thinking tests, because of the greatly increased emphasis 
critical thinking and informal logic are receiving these days. 
More instructors will want a quick way to compare their 
students with others, and will turn to machine-gradeable tests. 
Furthermore many of us are engaged in discussion (sometimes 
dispute) over what gets taught under a label like "critical 
thinking", and how long it continues to be taught. One instru-
ment in such discussion or dispute is the machine-gradeable 
multiple-choice critical thinking test. 

Since I use the terms "critical thinking" and "informal 
logic" and "reasoning" roughly interchangeably as labels for 
an area of concern, I economize here by using only the term, 
"critical thinking". It is in the names of three of the five cited 
tests, the terms "reasoning" and "cognitive process" appear-
ing once each. I know of no widely available test containing 
the term "informal logic" in its title. 

It is largely in preparation for this expected increase in 
critical thinking tests that I attempt in this paper to share my 
experience in critical thinking testing by noting some pro-
blems and by offering possible resolutions. My hopes are 1) 
that both consumers and developers of critical thinking tests 
will profit from the sharing; 2) that some members of my au-
dience will help me in my attempts to deal with the problems; 
and 3) that some even will become deeply enough involved in 
the problems to work on them. Although the problems are 
practical, all have philosophical foundations. 

The problems with which I shall deal are concerned with 
the testing for students' value judgments, their induction abili-
ty, and their assumption-identification ability; and with the 
"reliability" (read "'consistency") and validity of critical 
thinking tests. The problems are broad and varied. I shall not 
do them full justice. 

VALUE JUDGMENTS 

Although making value judgments strikes me as an aspect 
of critical thinking, I do not think that it is fair for the keying of 
an answer to depend on a value judgment about which there 
is possible disagreement, unless the value judgment is con-
stitutive of critical thinking, such as the judgment that it is 
generally good to be ready and willing to consider open-
mindedly points of view with which one disagrees. 

I realize that the tone of these remarks might suggest a 
greater clarity and precision for the concepts, value judgment, 
openminded, etc. than they have. But in the context of my 
comments they seem precise enough. Consider for example 
two items from a section of Watson-Glaser Critical Thinking Ap-
praisal, Form A (Watson & Glaser, 1980), "Test 5: Evaluation 
of Arguments". The question is whether a strong labor party 
would promote the general welfare of the people of the 
United States. For each of the following items the student is to 
take what is offered as a reason as true, and must decide 
whether the argument is strong or weak. (To be strong, the 
reason must be both important and directly related to the 
question.) 

65. No; a strong labor party would make it unattractive for 
private investors to risk their money in business ventures, 
thus causing sustained large-scale unemployment. 

67. No; labor unions have called strikes in a number of im-
portant industries. 

Item 65 is keyed "strong"; Item 67 is keyed "weak". 


However, a good Marxist might well regard Item 65 as 
weak on the ground that sustained large-scale unemployment 
would be a good thing because it would awaken the pro-
letariat. Item 67, on the other hand, might well be regarded as 
strong by many conservatives who believe that a strong labor 
party would encourage labor unions, and that strikes in impor-
tant industries are bad things. It does not seem fair to mark 
such people wrong in their evaluations of these arguments, so 
I urge that such items not appear on critical thinking tests.3 
The keying depends on value judgments about which there is 
possible disagreement and which are not constitutive of 
critical thinking. The concept value judgment seems clear 
enough in this context for me to make this recommendation. 

INDUCTION 

Under the label "induction" I include generalizing from a 
number of particular instances to a broad statement using the 
same concepts (for example, inferring from "One railroad tie 
burned with a foul smell", "Another railroad tie burned with a 
foul smell", etc. to "Railroad ties burn with a foul smell."). I 
also include best-explanation inference. Basically the problem 
is that induction requires background assumptions about the 
way the world is and works, not all of which can be explicitly 
specified in a set of directions. A related problem is that 
people with different levels of sophistication justifiably give 
different levels of endorsement to a conclusion. Labeling a 
conclusion"probably true" instead of "true" constitutes a lesser 
level of endorsement. Saying that the data is insufficient is no 
endorsement. 

Consider Item 6 in the Watson-Glaser test, preceded by a 
description of a situation. 

Description: 

Mr. Brown, who lives in the town of Salem, was brought 
before the Salem municipal court for the sixth time in the past 
month on a charge of keeping his pool hall open after 1 a.m. 
He again admitted his guilt and was fined the maximum, $500, 
as in each earlier instance. 

Hem: 
6. On some nights it was to Mr. Brown's advantage to keep his 

pool hall open after 1 a.m., even at the risk of paying a $500 
fine. 

Since the proposed conclusion is a possible explanation of 
the facts, this seems to be a case of induction. The choices are 
"True", "Probably true", "Insufficient data", "Probably 
false", and "False". The keyed answer is "Probably true." 

A very sophisticated person might well adopt the position 
that we do not know enough about the situation even to say 
"Probably true." Perhaps he had put his son in charge and 
thought this was a small price to pay for all the years he had 
neglected his son. Perhaps in spite of his admission of guilt, he 
had not kept the pool hall open after 1 a.m., but this was a 
way to payoff the municipal authorities for granting him a 
license. If so many possibilities occur to someone, that per-
son, if sophisticated and cautious, might well justifiably decide 
to mark "Insufficient data". 

On the other hand imagine a less sophisticated student 
who has learned in civics class that people often find it pro-
fitable to violate the law and pay the resulting fines, but that 
fines of that magnitude would deter someone unless it were to 
the perso!1's advantage to be an offender. Such a student 
could justifiably mark the item "True". To mark the answer 
incorrect would be to penalize the student for having em-
pirical beliefs about the way the world works that are different 
from those of the test authors. 

4 

In designing the Cornell critical thinking tests, I attempted 
to deal with the problem of avoiding the distinction between 
degrees of endorsement (for example, between "True" and 
"Probably true"), asking only in which direction, if any, the 
evidence points, and by seeking topics and items about which 
I thought it most likely that there would be no significant dif-
ferences in background knowledge. However, our interviews 
with respondents have made clear that this program was not 
totally successful. 

For example some of the items in the Cornell Level-X 
critical thinking test ask about the bearing of certain informa-
tion on the hypothesis that some missing explorers on the 
newly-discovered plant, Nicoma, are dead. One piece of in-
formation is that the blankets and sheets of the explorers' huts 
are all found neatly folded in the closets. The intended answer 
is that this information goes against the hypothesis, because 
the folding and putting away are not things that would have 
been done in emergency or disaster. On the other hand, so-
meone with a belief that it is standard practice, even in 
emergencies, to clean up immediately after the dead and to 
fold their sheets and blankets, might think that this informa-
tion neither supports nor goes against the hypothesis. 
Background beliefs influence the answer here. 

Different people do sometimes bring different background 
assumptions and different levels of sophistication to our in-
duction items. It seems unfair to mark them down for so do-
ing. Accordingly the stance that I have adopted is that we can-
not expect 100% agreement with the key on all of these items, 
but that the best critical thinkers will agree at least 85% of the 
time. 

I do not see this problem as merely a testing problem. It is a 
problem for anyone who tries to develop a system of rules for 
judging inductive conclusions. There is always the possibility 
.that something else will turn up that has not yet been figured 
into the decision. If I am wrong about this, I hope to be so in-
structed, perhaps so that my "85% stance" can be replaced 
by a "100% stance". 

In his Critical Thinking and Education, John McPeck (1981, 
p. 149) makes some suggestion that seem aimed in part at this 
induction-testing problem: 

"1. That the test be subject-specific in an area (or areas) of the 
test taker's experience or preparation. This is required 
because knowledge and information are necessary 
ingredients of critical thinking." 

"2. That the answer format permit more than one justifiable 
answer. Thus an essay might better fit the task, awkward 
and time consuming as this might be ... " 

"3. That good answers are not predicated on being right, in 
the sense of true, but on the quality of justification given 
for a response." 

McPeck's second and third suggestions call for essay tests that 
are graded by human experts. I do not see how computers can 
do it. At the Illinois Thinking Project, Eric Weir and I 
developed an essay test (Ennis & Weir, 1983) that does call for 
appraisal of the justification offered. This we feel requires 
trained appraisers, but it is time consuming, as McPeck sug-
gests. Given an average of six minutes per grading, 5000 tests 
would take 50 hours. I like the idea of essay tests, but have not 
found it heavily used. 

Furthermore the problem still exists to some extent. Even if 
the subject matter of the item be within the test-taker's ex-
perience there will be differences in the unstated background 
beliefs of the test-taker and evaluator. The evaluator can make 
allowances for explicit differences in background beliefs, but 
not always for implicit ones of which the grader is unaware. 


r 
'I' 

If McPeck means by his first suggestions that critical think-
ing tests must be in a given subject as taught in schools and 
colleges, then I must demur. Consider the criteri?n that a 
hypothesis is justified only to the extent that plausible alter-
natives have been ruled out. Not only does this criterion apply 
very widely (for example, educational research, 
Shakespearean interpretation), but it applies in areas that are 
not subjects taught in schools, such as figuring out why there 
is water in the basement, deciding whether the defendant 
knew that her act created a strong probability of great bodily 
harm, and judgining whether Ernie stole the cookies. These 
last three are enterprises that call for critical thinking and are 
not subject specific, if one thinks only of subjects as taught in 
schools. But we do want to teach people how to operate in 
such areas and we do want to test for competence to do so. 

ASSUMPTION I DfNTI FICA TION 

Testing for assumption identification ability faces several 
problems: 1) a variety of things' are called assumptions; 2) 
assumptions that are significant are not (logically) necessarily 
made; and 3) the role of background information often makes 
it unfair to ask whether some particular proposition is an 
assumption. 

The Variety of Things Called Assumptions. 

Often the word "assumption" is a pejorative term, so that 
in an open-ended test, if asked to find an assumption, a stu-
dent, unless warned to do otherwise, will usually pick 
something that the student believes to be dubious, rather than 
only something that is a crucial support. Furthermore students 
often pick dubious conclusions as assumptions (Doing so is not 
a violation of standard usage.). There are also Strawsonian 
presuPPQsitions, unstated gap-filling premises, and unstated 
back-ups for other premises. (See Ennis, .1982, for further ex-
planation, if these labels do not communicate.) If the test is 
open-ended, and we do not want conclusions or merely 
dubious statements, we should say so. 

Another choice is between used and needed assumptions. 
If the context is such that we want to know what the assumer 
was actually thinking, we search for used assumptions 
(assumptions that were actually used, consciously, or perhaps 
subconsciously, by the thinker). A claim that something is a 
used assumption is an empirical claim about a mental event, 
and thus, by my way of thinking, is to be judged on the 
inference-to-best explanation model. On the other hand, if 
the context is such that we want to know what the assumer 
needs to add to the argument to make it least weak, then we 
look for a needed assumption. If we are trying to decide 
whether the conclusion is true, we have this sort of context. 
Here we employ the principle of maximum charity, because 
we want to give the conclusion its best chance. 

In an open-ended test of assumption-identification ability, 
we should make clear to our students whether the context is 
one calling for figuring out what the person was thinking, or 
for figuring out whether to believe the conclusion. Different 
contexts often call for different assumptions. Multiple-choice 
critical thinking tests that I know about offer a context in 
which the purpose is to decide whether to believe the conclu-
sion of an argument for which the assumption is sought. There 
the basic question is whether the assumption is needed by the 
argument. If it is not needed, then it would be unfair to at-
tribute the assumption to the argument. This brings us to the 
second and third problems that I mentioned. 

Logical Necessity. 

As I have argued elsewhere (Ennis, 1982), assumptions that 
are significant are not logically necessary to an argument. 

5 

There is always a way around them. Consider this example 
from the Watson-Glaser test: 

"I'm travelling to South America. I want to be sure that I do not 
get typhoid fever, so I shall go to my physician and get vac-
cinated against typhoid fever before I begin my trip." 

Proposed assumption: 

28. Typhoid fever is more common in South America than it is 
where Ilive~ 

The key claims that this proposed assumption is "made". The 
directions say: 

If you think the assumption is not necessarily taken for granted 
in the statement, blacken the space under "ASSUMPTION NOT 
MADE." 

It is logically possible that typhoid fever is more common 
where the speaker lives, but that its consequences are more 
serious if contracted in South America, perhaps because of the 
climate-or differences in typhoid-care facilities. So the pro-
posed assumption is "not necessarily taken for granted" if the 
necessity in question is logical necessity. Those students who 
give such an interpretation to "necessarily" will get this item 
wrong, and others like it. 

But even if the instructions are not given this interpreta-
tion, there would be the problem of background information 
in this and similar items. If I believe the suggested possibility to 
be a plausible alternative (background information), then 
again I would be justified in marking "Not made" (contrary to 
the key) on the basis of my background information. 

In order to identify assumptions, whether used or needed, 
background information is always relevant. Hence it is 
dangerous to ask in a multiple-choice test whether a particular 
assumption is made. Rather it seems safer to give a choice of 
several alternatives, including one and only one gap filler that 
makes (or easily helps make) ,a deductively valid argument 
from the given premise to the given conclusion. This is then 
the reasonable choice for the answer, so long as it is not less 
plausible than the other choices (background knowledge 
sneaking in again), and if the context is one in which the truth 
of the conclusion is the concern. 

Here is an example from Cornell Critical Thinking Test, 
Level X (Ennis & Millman, 1982a): 

69. "The shorter of the two people wearing green hats is a 
female. I know because I saw her long hair when she 
remo~ed her hat." Which is probably taken for granted? 

A. All females have long hair. 
B. Only females have long hair. 
C. A person wearing a green hat is likely to be female. 

The keyed answer, B, makes the argument deductively valid 
(or does so with minor adjustments if one wants to be strict 
about it). The word "probably" has been included in the 
question asked in deference to the fact that full context is not 
specified, and that background knowledge does matter. 

Somewhat in between is the following item from the New 
Jersey test: 

8. Josie said, ""This paper must have been written by a boy, 
because the handwriting is so bad." Josie must be assuming 
that 

a. some boys have poor handwriting. 
b. only boys have poor handwriting. 
c. all boys have poor handwriting. 

Although the key is not distributed, presumably the keyed 
answer is b. At least b would transform the argument into a 


deductively valid one. But a careful thinker might leave it 
blank on the ground that there is no right answer. That is, Josie 
does not have be assuming that only boys have poor hand-
writing, which attribution is actually uncharitable, since it is so 
obviously false. Josie's argument works if we add the proposi-
tion that only the boys in that class group have poor hand-
writing and that all the papers being considered are from that 
class group. The lead-in would probably be better stated as 
follows: "Josie is probably assuming that, in this group ... ". 

In sum I recommend that the type of assumption should be 
made clear, that we not ask for logically necessary assump-
tions, that in multiple-choice tests we ask test-takers to choose 
among several candidates, rather than decide for each 
whether it is assumed, and that among the choices there be 
one and only one that contributes readily to the deductive 
validity of the argument-and that this one not be more in-
herently implausible than the other choices. 

CONSISTENCY 

The testing establishment defines "reliability" as consisten-
cy of measurement, and pays much attention to the "reliabili-
ty" of tests, partly because one can obtain "objective" 
numbers that indicate consistency, partly because these 
numbers are generally higher than other numbers one obtains 
about tests, and partly because it seems like a good idea for a 
test to be consistent from one administration to the next, 
though it is somewhat misleading to the public to label con-
sistency in measurement by the term "reliability". The term 
"reliable" is defined in my Webster's as "trustworthy", which 
suggests that a reliable test tells us what we want to know, not 
merely that it gives the same result each time regardless of 
whether it gives me any information about critical thinking 
ability, for example. So we could have a reliable test that is 
called a critical thinking test, according to this technical sense 
of "reliable", even though it does not test for critical thinking 
at all. However if we remember that "reliability" in the 
technical sense means consistency of measurement, not 
validity, then this problem will not cause trouble. 

But the situation is more serious, because the most 
frequently-used indicators of "reliability" are the Kuder-
Richardson formulas, which tell only the degree of internal 
consistency of a test; that is, the degree to which the items in-
tercorrelate with each other. This is not consistency from one 
test administration to the next; it is item homogeneity. If 
critical thinking is a heterogeneous concept, then a good com-
prehensive critical thinking test would probably not do as well 
on such so-called "reliability" measures as a critical thinking 
test for only one aspect of critical thinking, say deduction. 

There are other indicators of reliability, I should note, in-
cluding test-retest correlations and correlations between sup-
posedly parallel forms. But at least in part because they are so 
much easier to use, the Kuder-Richardson formulas are used 
most often, usually presenting us therefore with a double in-
vitation to misinterpretation. 

Item discrimination information generally has the same 
problem. One would expect from the name that an item 
discrimination index would tell the extent to which an item 
discriminates the way it is supposed to discriminate. But the 
criterion usually used is total score on the test, so average item 
discrimination indices are generally indicators of internal 
consistency. 

Thus we must remember that several readily-obtainable 
apparent indicators of quality are indicators of internal con-
sistency. The difficult question then is, "How important is in-

6 

ternal consistency?" An important part of this question is the 
question, "To what extent is critical thinking ability a 
homogeneous ability?" I am puzzled by this question, but my 
inclination is to say that critical thinking ability is fairly 
heterogeneous, consisting of such diverse elements as open-
mindedness, ability to see other alternatives, experience and 
background knowledge, knowledge of criteria to apply in 
thinking critically, ability to handle complexity in an orderly 
fashion, and some others. All of this is quite speculative. I in-
vite you to join me in the attempt to deal with this question, 
the answer to which has instructional and curricular implica-
tions-in addition to its relevance to the question of how to 
treat internal-consistency data about critical thinking tests. 

VALIDITY 

The problem of determining the validity of a critical think-
ing test is a difficult one. Standard approaches to validity in-
clude criterion-related validity, content validity (old and new), 
and construct validity. There is discussion of these in Standard 
for Educational and Psychological Tests Uoint Committee, 
1974), but there are problems and the booklet is being exten-
sively revised. (The distinction between old and new content 
validity I introduce here to give greater coherence to the· 
discussion in the light of tradition.) After considering the stan-
dard approaches to validity, I shall look at one pesky validity 
question, "What does the test really test?" 

Criterion-Related Validity. 

Criterion-related validity is the extent to which the test cor-
relates with an outside pre-established criterion, already ac-
cepted as valid. But there really is no outside pre-established 
criterion for critical thinking ability. I am regretfully suspicious 
even of teachers' rating of students-even my own ratings of 
my own students. 

Content Validity, Old and New. 

Content validity of the older type depended upon the 
following of a careful plan to cover the area to be tested, and 
agreement among experts that the test (with its accompanying 
answers) does in fact reasonably cover the content. This ap-
proach seems the best to me, though securing agreement on 
anything of this nature is difficult, expecially among 
philosophers. Needed is agreement about what constitutes 
critical thinking, about the appropriateness of some particular 
coverage, and about the answers to the items. All of this is 
good practical epistemology, so I hope that more 
philosophers can be persuaded to think of working in the area 
of critical thinking testing as more than only a fulfillment of 
their teaching responsibility. 

The five tests that I mentioned earlier differ markedly in 
their content. The three that are actually called "critical think-
ing" tests (the two Cornell tests and the Watson-Glaser Test) 
all include sections on deduction, induction, and assumption 
identification. In adition the Watson-Glaser test includes a 
section on strong and weak arguments (the one to which I 
earlier objected because of its testing for a person's value 
judgments); Cornell Level X has a section on credibility and 
observation; and Cornell Level Z has sections on credibility, 
fallacies (especially equivocation), experimental planning and 
reasoning, and definition. 

The New Jersey test (called a "reasoning" test) emphasizes 
deduction quite heavily, with over half its items on deduction. 
Assumption identification receives some attention and a varie-
ty of other critical thinking aspects are touched upon. It seems 
to fit the curriculum it was presumably designed to test, that of 
the Institute for the Advancement of Philosophy for Children, 


an advantage or a disadvantage, depending on the extent to 
which the things emphsized in the curriculum actually reflect 
critical thinking in a balanced manner. 

The Ross test (called a "cognitive processes" test), 
although it includes sections on deduction and assumption 
identification, also includes six other sections, some of which 
one might have trouble calling "critical thinking", for exam-
ple, a section on verbal analogies. 

A sixth test, though its catalogue calls it a critical thinking 
test, I have not included in my listing, because it contains only 
deduction items. It is Logical-Reasoning (Hertzka & Guilford, 
1955). There are other deduction tests available (including 
some other Cornell tests), but since they do not claim to be 
critical thinking tests, I shall not discuss them here. 

A clear implication of this brief commentary on content is 
that people vary in their judgments about the appropriate con-
tent for a critical thinking test. At least some experts thus are in 
disagreement, requiring a test consumer to choose among the 
different conceptualizations of critical thinking. 

But there is more. One must not only look at, but look 
beyond the names of the tests and the sections of the tests. 
One must also look at the items and their keyed answers. For 
example, the heading, "Strong and Weak Arguments" does 
not reveal all that is going on in that section of the Watson-
Glaser test to which I earlier objected. In the Ross test, the 
given heading, "Questioning Strategies", fails to reveal that 
the test-takers do not choose among or devise questioning 
strategies. Rather they choose among interpretations of infor-
mation secured by questioning strategies devised by the test 
authors. 

Since critical thinking testing is very difficult, I am not here 
urging critical-thinking-test consumers to demand perfection. 
Rather I am urging them to take the trouble to pay close atten-
tion to the actual content of an alleged critical thinking test. 
Although expert opinion is relevant to a content validity judg-
ment of the old type, since the "experts" disagree, a test con-
sumer must look at the content as well as the statistics. 

New. Content validity (new type) has the appearance of 
behavioral scientific objectivity, because it calls for random 
sampling from some universe, but it seems crippled by deep 
problems, as Thomas Tomko has argued (1981). Often called 
"criterion-referenced testing" or "domain-referenced 
testing", its idea is that there is some total universe that is the 
content. A random sample drawn from this universe should 
surely be a scientifically-objective representation of the con-
tent. The problem is to identify a set of sampleable units that 
are in fact the content of the field. Candidates for the types of 
units include "behaviors", responses, test items, and situa-
tions. Items, and more broadly, situations that call for 
responses are at least plausible identifiable and selectable 
units, but I cannot imagine an exhaustive comprehensive 
depiction of the content of critical thinking that proceeds by 
listing such things. There is an infinite number of such situa-
tions or possible items. In order to assure a random sample we 
must provide that each unit in the universe have an equal 
chance of being selected. I cannot imagine an exhaustive set 
of critical thinking situations such that one can give each an 
equal chance of being selected. Hence new-type content 
validity seems an inappropriate approach for judging critical 
thinking tests. 

Construct Validity. 

The theory of the third type of validity, construct validity, is 
still being developed (see Cronbach, 1971; Norris, 1981). The 

7 

motivation is the perceived difficulties with the other types of 
validity, in particular the lack of a pre-established outside 
criterion with which to correlate the results of a test under in-
vestigation. Roughly speaking, the idea here is that a test is 
justifiably believed valid to the extent that information about it 
fits with other information we have. This is a vague notion, 
ripe for sharpening and investigation by philosophers of 
science, as Norris is doing. However, regardless of the out-
come of the investigation, it will at least for a long time be dif-
ficult to claim for any critical thinking test that it is valid in this 
way, because of the looseness of the concept critical thinking, 
and because our scientific knowledge about the human activi-
ty of critical thinking, is at the air-earth-fire-water stage, and 
perhaps will be there for a long time. 

In sum, those who develop critic~1 thinking tests will find it 
difficult to make a convincing case for their validity. Describ-
ing the structure of the test and inviting people, including 
experts, to look it over seems like the best approach now. Cor-
respondingly a person trying to determine whether a test is 
valid should be cautious in judging scientific-appearing claims 
for validity, and should look carefully at the items and the pro-
posed answers, paying heed to the structure and basis of con-
struction of the test. 

"What does the test really testr' 

Often a claim is made that a test really tests for something 
other than what its name suggests. Earlier I was implicitly sug-
gesting that for some people Test 5 of the Watson-Glaser test 
really tests in part for their values. McPeck (1981, p. 146) sug-
gested that the induction items in Level Z of the Cornell critical 
thinking tests (Ennis & Millman, 1982b) "are clearly questions 
of reading comprehension more than anything else". 

Another thing that critical thinking tests are claimed to rea-
Iy be testing is general intelligence, on the ground that they 
correlate substantially with intelligence tests. Michael Scriven 
(personal communication) ascribed such a claim to significant 
figures in Educational Testing Service. McPeck (1981, p. 142) 
made such a claim about the Watson-Glaser test. 

Are these "really-tests-for" claims testable? If such claims 
are, as I think, responsibility-ascribing casual claims, McPeck's 
claim about reading might then be translatable into the follow-
ing: "The cause of significant variation in Level Z scores 
among the members of the population being tested is varia-
tion in reading ability." (The reduction, based on correlation, 
of critical thinking to intelligence in addition seems to assume 
a strongly positivist principle of parsimony and reduction.) If I 
am correct about the causal part of my suggestion, then by my 
non-reductive analysis (Ennis, 1973) of effect-explaining 
causal statements, the McPeck reading claim is that variations 
in reading ability 1) are sufficient, given the circumstances, to 
produce the variations we get in test scores, and 2) are respon-
sible for them. If this is so, one prediction might be that there 
would be high correlations between reading scores and in-
duction items at all levels of the population in question. 
Another prediction is that attempts by informal logicians to 
teach induction skill to college students would fail to produce 
improvement in their Level Z induction scores, unless the in-
struction is instruction in reading, or at least improves their 
reading ability. These predictions suggest at least partial 
testability of McPeck's claim. 

I am trying to do several things here: to suggest a way' of 
understanding the charge that a test really tests for something 
else to warn critical thinking testers that the charge might well 
be leveled at them, to suggest ways of responding to the 
charge, and to show again the relevance of traditional 


philosophical concerns (in this case concerns with causation 
and testability) for the informal logic movement. 

SUMMARY 

In my remarks, I have covered much ground, but have 
neglected many possible refinements. I have tried to share my 
experience in facing the practical and philosophical dimen-
sions of some critical thinking testing problems in the hope 
that this would be of help to consumers and developers of 
critical thinking tests, in the hope that my audience could help 
me deal with these problems, and in the hope that more 
specialists of many sorts, including philosophers, would 
devote their talents to these practical problems. 

I have broken the problems into two groups: critical think-
ing content (value judgments, induction, and assumption 
identification) and testing concerns (internal consistency and 
validity), but remember that one of the testing concerns is 
critical thinking content. These Me not the only problems; I 
have picked some that have particularly stimulated me. 

In the area of value judgments I suggested that we not 
allow a student's score to depend on the student's agreement 
with our value judgments in controversial areas-except for 
values constitutive of critical thinking. In the area of induction 
I claimed the dependence of judgments to some extent on 
background beliefs and level of sophistication, and suggested 
that we try to seek items that required background beliefs on 
which there would be heavy agreement, and that we not ask 
students to make the distinction among degrees of endorse-
ment ("True", "Probably true", etc.). I do not recommend 
that all critical thinking testing should be in specific subjects as 
taught in the schools and colleges, because so much critical 
thinking in real life is not thus artifically delimited. 

In the area of assumption identification I recommended 
that when asking for open-ended assumption identification 
we be clear about the kind of assumptions we are seeking, 
and that we not ask for logically-necessary assumptions; and 
that for multiple-choice testing we not ask whether an 
assumption is made, but rather ask which of several can-
didates is probably assumed, given the choices and given 
some situation. I also suggested that one acceptable item-type 
would have as one and only one of its choices a statement that 
would fill the gap in (or best help to fill the gap in) a deductive 
argument. 

I have focused on multiple-choice tests, because they 
have certain practical advantages, but I do think that some of 
the problems I mentioned can be handled by essay 
testing-with grading by people who are good at critical think-
ing and are flexible enough to adjust their scoring to accom-
modate good arguments and insights that are different from 
those expected. 

In the area of internal consistency of tests, I noted the use 
of the word "reliability" to refer to consistency of repeated 
measure and the use of internal consistency as an indicator of 
this "reliability". A problem here is the extent to which critical 
thinking is a homogeneous concept. I warily suggest that it is 
not. 

In the area of validity, I suggested the inapplicability of 
criterion-related validity and new-type content validity, and 
the difficulty of application of old-type content validity and 
construct validity, but did suggest initial emphasis on old-type 
content validity, and warned of the differences in content 
among existing tests. I also suggested a causal interpretation of 

8 

claims that a test really tests for something else, and inter-
preted such claims in terms of a non-reductive responsibility 
analysis of effect-explaining causal claims. From this I sug-
gested some possible preditions that might be generated 1) to 
test claims about what a test really tests, and 2) to explore the 
testability and meaning of such claims. 

Some philosophical questions that are foundational here 
include the following: 

What is critical thinking? 

In what way are value judgments different from empirical 
judgment? 

Can there be rules for induction, the application of which does 
not depend on unspecifiable outside knowledge? 

Is "probably" a degree-of-endorsement specifier? 

What is the role of deduction in real arguments? 

How do you tell what is assumed? 

Is critical thinking ability a homogeneous trait? 

How does one judge the fittingness of a test into an array of in-
formation and beliefs? 

What do effect-explaining causal claims mean? 

What constitutes a check on testability? 

I have not tried to discourage you by suggesting these dif-
ficulties and questions. The situation, although imperfect, is 
not a disaster. Actually, I believe that anyone of the five tests 
listed is worth using and could be quite helpful. In a perverse 
way, I am trying to encourage by provocaton. 

NOTES 

1. This essay was partially prepared while I was a Fellow at 
the Center for Advanced Study in the Behavioral Sciences. 
Earlier versions were presented to the Second International 
Symposium on Informal Logic, University of Windsor, 
Windsor, Ontario, June 22, 1983, and to a colloquium at 
Sacramento State University, October 27, 1983. I ap-
preciate helpful suggestions from Peter Gray Whiteley, 
Robert Linn, William Rapaport, and Andrea Schn.all, and 
am grateful for financial support provided by the Spencer 
Foundation. 

2. One of these is aimed at undergraduate and graduate 
students (Cornell Level Z); one at secondary and college 
(Watson-Glaser); and three at grade four through college 
(Cornell Level Z, New Jersey, and Ross). The Ross test 
seems to emphasize critical thinking less than the others, as 
I shall later suggest, but realize that this judgment is based 
on my conception of critical thinking. 

3. Edward Glaser, who has seen these comments and recom-
mendation, was kind enough to provide me with his reac-
tion to them: 

"My reaction to your specific questions regarding our scor-
ing key for items 65 and 67 on the Evaluation of Arguments 
subtest is: 

#65. Accepting the argument as true for the purpose of this 
test, if the actions of a strong labor party would "cause s'.Jstain-
ed large-scale unemployment," that would be a disastrous 
consequence for all citizens adversely affected and for our 
democratic form of government in general. It would seem that 
only "good" Marxists or other types of revolutionaries who 
wanted to bring down our form of government and supplant it 
with their form of dictatorship would consider #65 to be a 
weak argument in relation to the question posed. 

L 


#67. The fact that "labor unions have called strikes in a 
number of important industries" is weak because no informa-
tion is given about the employees' (or union's) grievances, why 
they chose to withhold their labor (strike) in those instances, 
whether their actions were peaceful and legal during the strike 
period, what results or consequences followed their strike ac-
tion, etc. A strike in and of itself is not necessarily "bad" in 
given instances; the net balance of consequences might be 
"good" for the country as a whole, for the strikers, and even 
for the owners of the plants in a given industry over the long 
run. 

If we were rewriting-revising #67, however, I would recom-
mend that the wording be changed to a number of employer 
sites (or companies) rather than 'important industries.' 

I agree with you that selecting ;:my particular value position is 
'bound to be in conflict with a number of (other possible) 
value positions.' In a test that explicitly accepts (starts from) 
the values expressed in our Constitution, Bill of Rights and 
Declaration of Independence, we do espouse free speech, 
etc., but what is judged 'good' 6r 'bad,' or 'strong' or 'weak' 
arguments with reference to a given issue should (as I see it) 
be judged from the Judeo-Christian and democratic value 
orientation underlying our society." 

REFERENCES 

Cronbach, Lee J. Test validation. In R.L. Thorndike (Ed.), 
Educational measurement (2nd ed.). Washington, D.C.: 
American Council on Education, 1971. 

Ennis, Robert H. The responsibility of a cause. In B. Crittenden 
(Ed.), Philosophy of education 1973. Edwardsville, IL: 
Philosophy of Education Society, 1973. 

Ennis, Robert H. Identifying implicit assumptions. Synthese, 
1982,51,61-86. . 

Ennis, Robert H. & Millman, Jason. Cornell critical thinking test, 
level X. Champaign, IL: Illinois Thinking Project, 1982a. 

Ennis, Robert H. & Millman, Jason. Cornell critical thinking test, 
level Z. Champaign, IL: Illinois Thinking Project, 1982b. 

Ennis, Robert H. & Weir, Eric. The Ennis-Weir critical-thinking 
essay test. Champaign, IL: Illinois Thinking Project, 1983. 

Hertzka, Alfred E. & Guilford, J.P. Logical reasoning. Orange, 
CA: Sheridan Psychological Services, Inc., 1955. 

McPeck, John E. Critical thinking and education. New York: 
SI. Martin's Press, 1981. 

Norris, Stephen P. A pitfall in the construct validation of ability 
tests. Unpublished doctoral dissertation, University of 
Illinois, U.c., 1981. 

Ross, John D. & Ross, Catherine M. Ross test of higher cognitive 
processes. Novato, CA: Academic Therapy Publications, 
1976. 

Shipman, Virginia. New Jersey test of reasoning skills. Upper 
Montclair, N.J.: Institute for the Advancement of Philosophy 
for Children, 1983. 

Tomko, Thomas N. The logic of criterion-reference testing. Un-
published doctoral dissertation, University of Illinois, U.c., 
1981. 

Watson, Goodwin & Glaser, Edward M. Watson-Glaser critical 
thinking appraisal. New York: The Psychological Corpor-
ation, 1980 .• 

Robert H. Ennis, Center for Advanced Study in the Behavioral 
Sciences, 202 Junipera Serra Blvd., Standford, CA 94305. 

9