TEF Special Edition

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017

Evidence does not support the rationale of the TEF

Graham Gibbs

Abstract

The Teaching Excellence Framework (TEF) has evolved since it was first announced, and

HEFCE guidance to institutions on its implementation reveals a number of significant

concessions to evidence, common sense and fairness. Institutions may well implement useful

teaching improvement mechanisms in response, as they have always done, regardless of the

nature of external quality assurance demands. However, the rationale of the TEF remains – and

it is deeply flawed. It is the rationale that this paper focuses on. It is argued here that its

interpretation of evidence about educational quality, employability and value for money ratings,

used to justify a TEF, are irrational and are not supported by evidence. Making fine judgements

about institutional rankings (and hence fee levels) on the basis of metrics is likely to be thwarted

by the very small differences in scores between institutions. Some of its proposed metrics are

invalid. Its belief in the ability of a small panel of experts to make sound quality judgments is not

well founded, given the poor record of past attempts to make such judgements about teaching

quality in higher education. The higher education market is very complex and perhaps only a

minority of institutions will be able to benefit in the way the TEF intends. The TEF seems

unlikely to be perceived, by most, as rewarding.

The Teaching Excellence Framework’s underlying assumptions

However unfit for purpose past teaching quality regimes have been, they have often resulted in

institutions’ putting more effort into improving teaching than previously, because the risks of not

doing so have been perceived to be significant. Most institutions have markedly improved their

National Student Survey (NSS) scores since the NSS and metric-based league tables were

introduced, under a regime that has focused on quality assurance rather than on quality and

under which fees have not been linked to quality. Institutions seem likely to take the TEF

extremely seriously, whatever they think of it. The TEF is built on a number of explicit

assumptions stated in the Green and then White papers. It is argued here that these

assumptions are unfounded. If teaching quality does improve, and it might, it will not be because

Government policy is soundly based.

TEF Special Edition

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017

Current teaching metrics do not indicate that there is a substantial teaching

quality problem that needs an urgent solution

The government argues that the TEF is necessary because teaching quality is unacceptably

low. However, the NSS, that provides one of the only ways currently to monitor quality over

time, reveals a completely different picture. Levels of satisfaction and judgements about

teaching are high and have gone up every year but one since it was introduced. Those scores

that were initially lower (such as for assessment and feedback) have shown the largest

improvements. The rate of improvement shows little sign of slowing even though there is a

ceiling effect – the scores are often so high there is little room for further improvement. It is true

that a few institutions (especially elite research universities such as Imperial College and the

LSE) have performed less well recently. It is also true that students are very generous judges of

teaching: about three quarters of all teachers are usually considered ‘above average’. The

picture the NSS provides is probably too rosy. However, the overall trend is inescapable. A

more credible interpretation of the available quality data is that existing teaching metrics,

however flawed, have been surprisingly successful in levering institutional efforts to improve

teaching quality, particularly outside the research elite, even in the absence of variable fees. It

might be the case that collating the data differently (for example, not bundling together the top

two ratings on rating scales, which tends to exaggerate how good things are) or adding new and

more valid quality data (such as concerning students’ level of engagement) would provide even

more effective leverage to institutional efforts to improve. But there is nothing in the existing

evidence that points to the pressing need for varied fees as a lever on the grounds that

otherwise institutions will do nothing to improve.

Poor ‘value for money’ is not caused by poor quality

It is argued that alarmingly low ‘value for money’ ratings justify a strong emphasis on improving

teaching quality. But satisfaction and teaching quality ratings are very much higher than value

for money ratings! Low value for money ratings are to do with high cost, not low quality.

Whatever the level of quality, the cost of higher education is perceived as too high because the

much cited ‘graduate premium’ (the additional income graduates can expect simply as a result

of being a graduate) is unrealistic. It is based on historical data when there were fewer

graduates, the economy was expanding and wages were higher in real terms. It is not at all

clear that the current economy needs the current number of graduates each year and this is

reflected in the proportion of graduates who, at least initially, undertake non-graduate-level jobs

TEF Special Edition

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017

with low wages. The cause of perceived low value for money is that students, quite realistically,

are worried that they may not be able to recover their very substantial investment. The

experience of the USA, with many graduates never repaying debts caused by ever-higher

college tuition fees, provides a perfect example of how it can all go wrong. Higher fees reflect

higher reputations, but, in the USA, reputation predicts almost nothing about teaching quality or

student learning gains or the extent of use of educational practices known to improve student

learning gains. High fees have become a proxy for high quality – but they are a thoroughly

misleading proxy.

If ‘value for money’ is low, it makes no sense to put fees up

Faced with ample evidence of perceived poor value for money, the rational thing to do (if you

are incapable of improving the economy and the employment market) is to lower the investment

students need to make – to reduce fees. Instead the government say they will improve value for

money by increasing fees, at least for most. This is Alice in Wonderland logic.

The higher education market does not work perfectly or uniformly

The TEF naïvely assumes a perfect and uniform market. It assumes that all institutions would

seek to raise fees, and would raise them if they were allowed to, and that they would

automatically benefit as a result; by doing so, they would increase both their attractiveness to

consumers and their income. This ignores the reality that many institutions operate in local or

not very flexible markets, in which prospective students may have little choice about where to

study or much flexibility over how much they can afford to pay. There are already examples of

institutions which have increased fees, to take advantage of excellent NSS scores, only to find

that they cannot fill their places and have had to put their fees down again. Some institutions,

even with comparatively low fee levels and/or perfectly respectable teaching quality metrics, are

currently not filling their places. Those institutions that recruit nationally and internationally may

benefit from an increase in perceived reputation that comes with higher fees and be able to

exploit their market; many others cannot do so, however good they are at teaching. There are

assumptions about the market that make some sense for the elite but not for others.

Increased income from raised fees may have little impact on teaching quality

There is an assumption that increased income from increased fees would be spent on further

improving teaching. The overall evidence about the relationship between income and teaching

quality suggests that the link is weak, at best. In the USA, tuition costs have doubled, and

TEF Special Edition

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017

doubled again, with no improvement in class sizes or other valid teaching quality indicators, and

current tuition costs are unconnected with educational quality. In the UK, the comparatively

richer Russell Group Universities actually have larger cohorts and larger classes than do poorer

‘teaching-intensive’ universities. Russell Group students experience a smaller proportion of

academic-led small group teaching, because graduate teaching assistants are so often used to

save academics’ time for research activities. Cohort size, class size and the proportion of

teaching undertaken by people other than academics are all good negative predictors of student

learning. The research elite do not, in the main, spend their money on teaching students if they

can help it and there seems little prospect of a change in their policy, wherever the money may

come from.

Distinguishing appropriate fee levels for institutions is unreliable, and in the

homogeneous middle range, impossible

There is an assumption that it is possible, safely and fairly, to make fine-grained distinctions

between institutions, so that a range of fees can be fixed in precise relationship to a range of

teaching quality. Three significant problems prevent this assumption from being remotely

reasonable, the first two being associated with the two forms of evidence that will determine

decisions: qualitative judgements by panels and quantitative metrics.

In the TEF’s first stage, there are due to be qualitative judgements (basically a ‘yes’ or a ‘no’),

made by some kind of expert (or inexpert) panel, about whether institutions deserve to be

allowed to put their fees up. Those who are as long in the tooth as I am will remember Teaching

Quality Assessment. Every subject in every institution in England was allocated a score out of

24 as a result of qualitative judgements made by a large panel of ‘subject experts’. The process

involved visits, observation of teaching and meetings, often with teachers and students, and the

collation and examination of truly vast piles of documentation. It took six years to implement.

Despite the enormous cost in time and effort, the extensive evidence base, the visits, the

training of assessors and so on, the outcomes were highly unreliable. Some subjects allocated

much higher average scores than others, with no discernible justification. Later scores were

higher than early scores. Most scores were so high as to be indistinguishable for the vast

majority of institutions. But, more worryingly, there were substantial systematic biases. There

was a strong positive correlation between research strengths and TQA scores, despite its being

known that research strengths do not predict teaching quality. What is more, larger departments

and institutions gained higher scores than did small ones, despite the fact that size was known

TEF Special Edition

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017

to be a negative predictor of educational quality. It seems that assessment panels were dazzled

by reputation and were incapable of making reliable judgements about teaching. The TEF’s

qualitative judgements are intended to be made extraordinarily quickly, by small panels without

the benefit of visits, observation, meetings or even detailed evidence about educational quality.

Instead, they will largely be looking at a very short text prepared by institutions themselves. The

chance of their making sound and precise judgements seems negligible; the chance of their

being dazzled by reputation seems somewhat higher.

The second problem facing the TEF relates to attempts to make distinctions between institutions

about what level of fees they will be allowed to charge, on the basis of teaching metrics. I can

still visualise the graphs I was shown twenty-five years ago, when the Course Experience

Questionnaire (CEQ) was first used in Australia to provide public comparative quality data about

every subject in every university. The CEQ is (or at least was, until the Australian Government

turned it into a ‘happiness’ questionnaire) a valid instrument for judging educational quality. It

produces scores on a range of credible variables and is adequately reliable and valid. The

graphs I saw took one scale on the questionnaire at a time (such as ‘deep approach’ – the

extent to which students attempted to make sense of subject matter rather than only to

memorise it) and ranked every department in the country in a particular subject. What was

immediately clear was that a couple of departments were measurably worse than the rest at the

bottom and a couple were measurably better at the top; everyone else was pretty much

indistinguishable in the middle. This was true for every scale on the questionnaire, for every

subject. Statistically, this is an inevitable consequence of the variable being measured being

more or less normally distributed and with a small standard deviation - a phenomenon apparent

for virtually every variable about quality one can think of when comparing institutions. With NSS

scores, the same is true. If you look at national rankings for ‘satisfaction’, you find the vast

majority of institutions in an indistinguishable middle, with adjacent institutions having almost

identical scores, and even blocks of ten institutions not differing significantly from adjacent

blocks of ten. No less than forty-three institutions shared NSS satisfaction scores of 85-87% in

2016. You can tell an institution ranked 120 from an institution ranked 20, but not one ranked 50

from one ranked 60. The differences are so small and so volatile from one year to the next, that

overall rankings can change markedly, year on year, without any change in the underlying

phenomenon. Such variations are picked up by ‘The Times’ and trumpeted in such emotive

headlines as “University X crashes down quality league”, when in fact the change in score has

been random and statistically insignificant. It is rarely possible to distinguish one institution from

TEF Special Edition

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017

the next in a reliable and safe way using such metrics because the differences are, in most

cases, simply too small. Yet that is exactly what the TEF has to do – say that one institution

deserves to charge higher fees whilst the next one down the rankings does not, even though

they are statistically utterly indistinguishable. Adding scores together from a bunch of varied,

and often invalid, metrics actually makes this problem worse and produces a grey muddle. The

HECE guidelines now suggest that no more than 20% of institutions might be indentified at the

top and bottom of rankings and distinguished from the middle-ranked institutions. But even that

is bound to create unfairness for the institutions just below the boundaries that will be, in any

statistical sense, indistinguishable from those just above the boundaries.

Institutional average scores on teaching metrics usually hide wide departmental

differences

The third problem facing the TEF in making the required fine-grained distinctions is that it

intends to rank and distinguish institutions. Institutions are made up of departments (or subjects)

that very often differ widely from each other in terms of a whole range of metrics. These internal

differences can be so large that an institution may have the top-ranked department in the

country in one subject and the bottom-ranked department in a different subject. These

departmental scores are then averaged and the institution as a whole might end up looking

average (as most in fact do). This averaging of varied departments helps to produce the

problem of lack of distinction between institutions highlighted above. Students need to know

about the subject they are interested in and to be able to compare that subject across

institutions. The current TEF mechanism will not allow them to do this - it could even trick them

into paying a higher fee to study in a lower quality department. If students are interested in their

likely employability, the problem is even more acute, as national differences between subjects

are gross, and institutional employability averages are, at least in part, a consequence of their

subject mix. If an institution taught just Nursing and Cultural Studies, then it might look average

for employability, but comparatively bad for a student wishing to study Nursing and surprisingly

good for a student wishing to study Cultural Studies. This problem would be partly solved if the

TEF operated at the level of subjects (or departments) rather than institutions, which is a

development being considered for the future.

But even then, there would be significant difficulties in identifying what a ‘subject’ is. I once

helped a Sports Science department collect a good deal of data about students’ experience of

assessment, using the Assessment Experience Questionnaire (AEQ). There were seven degree

TEF Special Edition

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017

programmes within ‘Sports Science’ and, in terms of students’ experience, they differed from

one another to a considerable extent, ranging from rather good to pretty awful. As there were no

NSS categories to differentiate between these degree programmes, for NSS data collection and

reporting purposes the seven were simply aggregated into a single undifferentiated muddle.

Standard NSS subject categories might work for traditional academic departments with one

degree programme and large cohort sizes, but they may be less than helpful as a means of

distinguishing the more unconventional and varied subject groupings usually found in modern

teaching institutions. Again, this suits traditional research universities best.

Employability has little to do with teaching quality

There is a misinformed and confused conflation of employability with quality. Quality, according

to the TEF, is apparently all about employability. Students don’t think so. Those responding to a

HEPI survey asking them what best indicated the quality of a course had some perhaps

surprisingly conventional ideas about teachers and teaching; employability came nearly bottom

in their reckoning in terms of telling them anything useful about quality.

If the government had bothered to look at national rankings of universities’ teaching and

employability performance, it would have discovered that its assumption is complete nonsense.

The table below ranks institutions according to 2016 NSS scores.

Table 1: Institutional teaching and employability rankings

Rank NSS

2016

Graduate employability

Times, 2016

NSS rank

2016

1 Buckingham Cambridge 20

2 University of Law Oxford 20

3 St Marys College Belfast LSE 155

4 Courtauld Institute of Art Manchester 87

5 Keele Imperial College 116

6 St Andrews Kings College 129

7 Bishop Grossteste Edinburgh 145

8 Harper Adams University College 102

9 Liverpool Hope London Business School 155

10 Aberystwyth Bristol 76

It will be noticed that most of the top ten institutions are neither prestigious nor research giants.

The second set of rankings is from ‘The Times’ 2016 data collection about graduate

employability. There is no overlap at all with the top ten for NSS satisfaction. The right-hand

TEF Special Edition

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017

column lists the NSS rankings for the top ten institutions for employability. With the exception of

Oxford and Cambridge, they are considered by students to be amongst the worst in the country.

Imperial College led the clamour for higher fees; it is currently ranked 116th for student

satisfaction and dropping like a stone, but its reputation guarantees effortlessly-high graduate

employability metrics.

It took about ten minutes to compile this table from data easily available on the internet.

Employability is largely a product of reputation which follows research performance, overall

income and visibility. Graduate employability has almost nothing to do with teaching quality and

most institutions are not in a position to do much about the employability of their students, which

is largely determined by employers’ notions about reputation and the employment market - often

the local employment market. The government could do something about that, but not

universities.

It is also the case that size helps visibility and reputation, and hence employability, but hinders

teaching quality. It is rare in research literature about good teaching departments to discover

one that is even medium-sized, let alone large. Top research universities are mainly large and

tend to keep students’ choice of courses down, so creating large cohorts and large classes in

order to reduce teaching loads. The consequences are there for all to see.

The TEF’s proposed teaching metrics have limited validity

The TEF rests on teaching metrics’ being valid. If they are not, in the sense that they do not

predict student learning, then orienting institutions to improving them may distract institutions

from actual efforts to improve student learning. The government is fully aware of the contents of

‘Dimensions of Quality’ (Gibbs, 2010) and its identification of which metrics are valid and which

are not, and so the proposals in the original Green Paper about the metrics the TEF would use

were guaranteed to dismay. By the time details of the implementation of the TEF were made

public, the situation had improved. Nevertheless, ‘satisfaction’ is not a valid measure of learning

gains or of teaching quality. Outcome measures (including retention and employability) are

significantly determined by student selectivity, and so indicate reputation rather than teaching

quality, and reputation does not predict learning gains or the extent of use of pedagogic

practices that lead to learning gains. The introduction of benchmarks that take the nature of

student intake into account will help here, and ‘The Times’ modelling of institutional rankings

based on benchmarked TEF metrics, using 2015 data, produced somewhat inverted rankings

TEF Special Edition

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017

compared with the newspaper rankings we are used to seeing (that have been created by using

almost entirely invalid metrics). This cannot have been what was originally intended. It is

possible that the TEF’s benchmarked metrics, even if some of them are invalid, will create quite

a shock to the system. Increasing the role played by valid measures, such as of student

engagement, will help in the future and it is to be hoped that there will continue to be pragmatic

changes in implementation in the pursuit of validity and fairness. The first attempt to produce

rankings and associated varied fee levels is unlikely to get it right and decisions about

institutions’ futures based on the current form of implementation are likely to be dangerously

unsound. It would be prudent to wait until some of the problems identified above have been

tackled more satisfactorily and to treat the rankings of the first year or two as a wake-up call. It

is not as if students are impatiently pushing the government hard to increase fees.

The TEF is unlikely to be perceived by most as a reward

The government argues that, just as strong research performance is rewarded by the REF,

strong teaching performance should be rewarded by the TEF. But the majority of institutions

have seen their research income decline dramatically over several rounds of research

selectivity. The REF and its predecessors were designed explicitly to allocate research funding

to fewer institutions (and fewer researchers) and to take funding away altogether from most.

Careers, working lives and institutional reputations have been blighted by the REF. For most, it

has been experienced as a punishment. Similarly, the TEF is likely to be perceived as offering

brickbats and an uncertain future to perhaps thirty institutions, and as damning by faint praise

perhaps 100 more. Only those institutions that are allowed to charge top whack, and the sub-set

of these for whom this is actually useful and welcome, are likely to feel rewarded: big sticks and

small carrots, again.

Conclusion

A national policy with this degree of leverage over institutional behaviour risks causing damage

if the assumptions on which is built are wrong and the measures it uses are invalid. Institutions

may feel obliged to play the system and try to improve their metrics even if they do not believe

in them and even if this has no useful impact on student learning. But perhaps institutions will

become more sophisticated about using appropriate metrics in sensible ways. The demands of

the TEF for evidence of ‘impact’ are already stimulating fresh thinking. If that prompts new

evidence-based approaches to enhancement, then the TEF might even improve students’

TEF Special Edition

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017

learning gains despite its rationale and design. As students will be paying even more for their

education, let us hope so.

Reference list

Gibbs, G. (2010) Dimensions of Quality. York: Higher Education Academy. Available at:

file:///C:/Users/Home/Desktop/General/HEA%20Quality/Dimensions%20of%20Quality%20Final

%20Report.pdf. (Accessed: 19 January 2017).