TEF Special Edition 

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017 

Evidence does not support the rationale of the TEF 

Graham Gibbs 

 

Abstract 

The Teaching Excellence Framework (TEF) has evolved since it was first announced, and 

HEFCE guidance to institutions on its implementation reveals a number of significant 

concessions to evidence, common sense and fairness. Institutions may well implement useful 

teaching improvement mechanisms in response, as they have always done, regardless of the 

nature of external quality assurance demands. However, the rationale of the TEF remains – and 

it is deeply flawed. It is the rationale that this paper focuses on. It is argued here that its 

interpretation of evidence about educational quality, employability and value for money ratings, 

used to justify a TEF, are irrational and are not supported by evidence. Making fine judgements 

about institutional rankings (and hence fee levels) on the basis of metrics is likely to be thwarted 

by the very small differences in scores between institutions. Some of its proposed metrics are 

invalid. Its belief in the ability of a small panel of experts to make sound quality judgments is not 

well founded, given the poor record of past attempts to make such judgements about teaching 

quality in higher education. The higher education market is very complex and perhaps only a 

minority of institutions will be able to benefit in the way the TEF intends. The TEF seems 

unlikely to be perceived, by most, as rewarding. 

 

The Teaching Excellence Framework’s underlying assumptions 

However unfit for purpose past teaching quality regimes have been, they have often resulted in 

institutions’ putting more effort into improving teaching than previously, because the risks of not 

doing so have been perceived to be significant. Most institutions have markedly improved their 

National Student Survey (NSS) scores since the NSS and metric-based league tables were 

introduced, under a regime that has focused on quality assurance rather than on quality and 

under which fees have not been linked to quality. Institutions seem likely to take the TEF 

extremely seriously, whatever they think of it. The TEF is built on a number of explicit 

assumptions stated in the Green and then White papers. It is argued here that these 

assumptions are unfounded. If teaching quality does improve, and it might, it will not be because 

Government policy is soundly based. 



TEF Special Edition 

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017 

Current teaching metrics do not indicate that there is a substantial teaching 

quality problem that needs an urgent solution 

The government argues that the TEF is necessary because teaching quality is unacceptably 

low. However, the NSS, that provides one of the only ways currently to monitor quality over 

time, reveals a completely different picture. Levels of satisfaction and judgements about 

teaching are high and have gone up every year but one since it was introduced. Those scores 

that were initially lower (such as for assessment and feedback) have shown the largest 

improvements. The rate of improvement shows little sign of slowing even though there is a 

ceiling effect – the scores are often so high there is little room for further improvement. It is true 

that a few institutions (especially elite research universities such as Imperial College and the 

LSE) have performed less well recently. It is also true that students are very generous judges of 

teaching: about three quarters of all teachers are usually considered ‘above average’. The 

picture the NSS provides is probably too rosy. However, the overall trend is inescapable. A 

more credible interpretation of the available quality data is that existing teaching metrics, 

however flawed, have been surprisingly successful in levering institutional efforts to improve 

teaching quality, particularly outside the research elite, even in the absence of variable fees. It 

might be the case that collating the data differently (for example, not bundling together the top 

two ratings on rating scales, which tends to exaggerate how good things are) or adding new and 

more valid quality data (such as concerning students’ level of engagement) would provide even 

more effective leverage to institutional efforts to improve. But there is nothing in the existing 

evidence that points to the pressing need for varied fees as a lever on the grounds that 

otherwise institutions will do nothing to improve. 

Poor ‘value for money’ is not caused by poor quality 

It is argued that alarmingly low ‘value for money’ ratings justify a strong emphasis on improving 

teaching quality. But satisfaction and teaching quality ratings are very much higher than value 

for money ratings! Low value for money ratings are to do with high cost, not low quality. 

Whatever the level of quality, the cost of higher education is perceived as too high because the 

much cited ‘graduate premium’ (the additional income graduates can expect simply as a result 

of being a graduate) is unrealistic. It is based on historical data when there were fewer 

graduates, the economy was expanding and wages were higher in real terms. It is not at all 

clear that the current economy needs the current number of graduates each year and this is 

reflected in the proportion of graduates who, at least initially, undertake non-graduate-level jobs 



TEF Special Edition 

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017 

with low wages. The cause of perceived low value for money is that students, quite realistically, 

are worried that they may not be able to recover their very substantial investment. The 

experience of the USA, with many graduates never repaying debts caused by ever-higher 

college tuition fees, provides a perfect example of how it can all go wrong. Higher fees reflect 

higher reputations, but, in the USA, reputation predicts almost nothing about teaching quality or 

student learning gains or the extent of use of educational practices known to improve student 

learning gains. High fees have become a proxy for high quality – but they are a thoroughly 

misleading proxy. 

If ‘value for money’ is low, it makes no sense to put fees up 

Faced with ample evidence of perceived poor value for money, the rational thing to do (if you 

are incapable of improving the economy and the employment market) is to lower the investment 

students need to make – to reduce fees. Instead the government say they will improve value for 

money by increasing fees, at least for most. This is Alice in Wonderland logic.  

The higher education market does not work perfectly or uniformly 

The TEF naïvely assumes a perfect and uniform market. It assumes that all institutions would 

seek to raise fees, and would raise them if they were allowed to, and that they would 

automatically benefit as a result; by doing so, they would increase both their attractiveness to 

consumers and their income. This ignores the reality that many institutions operate in local or 

not very flexible markets, in which prospective students may have little choice about where to 

study or much flexibility over how much they can afford to pay. There are already examples of 

institutions which have increased fees, to take advantage of excellent NSS scores, only to find 

that they cannot fill their places and have had to put their fees down again. Some institutions, 

even with comparatively low fee levels and/or perfectly respectable teaching quality metrics, are 

currently not filling their places. Those institutions that recruit nationally and internationally may 

benefit from an increase in perceived reputation that comes with higher fees and be able to 

exploit their market; many others cannot do so, however good they are at teaching. There are 

assumptions about the market that make some sense for the elite but not for others. 

Increased income from raised fees may have little impact on teaching quality 

There is an assumption that increased income from increased fees would be spent on further 

improving teaching. The overall evidence about the relationship between income and teaching 

quality suggests that the link is weak, at best. In the USA, tuition costs have doubled, and 



TEF Special Edition 

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017 

doubled again, with no improvement in class sizes or other valid teaching quality indicators, and 

current tuition costs are unconnected with educational quality. In the UK, the comparatively 

richer Russell Group Universities actually have larger cohorts and larger classes than do poorer 

‘teaching-intensive’ universities. Russell Group students experience a smaller proportion of 

academic-led small group teaching, because graduate teaching assistants are so often used to 

save academics’ time for research activities. Cohort size, class size and the proportion of 

teaching undertaken by people other than academics are all good negative predictors of student 

learning. The research elite do not, in the main, spend their money on teaching students if they 

can help it and there seems little prospect of a change in their policy, wherever the money may 

come from. 

Distinguishing appropriate fee levels for institutions is unreliable, and in the 

homogeneous middle range, impossible 

There is an assumption that it is possible, safely and fairly, to make fine-grained distinctions 

between institutions, so that a range of fees can be fixed in precise relationship to a range of 

teaching quality. Three significant problems prevent this assumption from being remotely 

reasonable, the first two being associated with the two forms of evidence that will determine 

decisions: qualitative judgements by panels and quantitative metrics. 

In the TEF’s first stage, there are due to be qualitative judgements (basically a ‘yes’ or a ‘no’), 

made by some kind of expert (or inexpert) panel, about whether institutions deserve to be 

allowed to put their fees up. Those who are as long in the tooth as I am will remember Teaching 

Quality Assessment. Every subject in every institution in England was allocated a score out of 

24 as a result of qualitative judgements made by a large panel of ‘subject experts’. The process 

involved visits, observation of teaching and meetings, often with teachers and students, and the 

collation and examination of truly vast piles of documentation. It took six years to implement. 

Despite the enormous cost in time and effort, the extensive evidence base, the visits, the 

training of assessors and so on, the outcomes were highly unreliable. Some subjects allocated 

much higher average scores than others, with no discernible justification. Later scores were 

higher than early scores. Most scores were so high as to be indistinguishable for the vast 

majority of institutions. But, more worryingly, there were substantial systematic biases. There 

was a strong positive correlation between research strengths and TQA scores, despite its being 

known that research strengths do not predict teaching quality. What is more, larger departments 

and institutions gained higher scores than did small ones, despite the fact that size was known 



TEF Special Edition 

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017 

to be a negative predictor of educational quality. It seems that assessment panels were dazzled 

by reputation and were incapable of making reliable judgements about teaching. The TEF’s 

qualitative judgements are intended to be made extraordinarily quickly, by small panels without 

the benefit of visits, observation, meetings or even detailed evidence about educational quality. 

Instead, they will largely be looking at a very short text prepared by institutions themselves. The 

chance of their making sound and precise judgements seems negligible; the chance of their 

being dazzled by reputation seems somewhat higher. 

The second problem facing the TEF relates to attempts to make distinctions between institutions 

about what level of fees they will be allowed to charge, on the basis of teaching metrics. I can 

still visualise the graphs I was shown twenty-five years ago, when the Course Experience 

Questionnaire (CEQ) was first used in Australia to provide public comparative quality data about 

every subject in every university. The CEQ is (or at least was, until the Australian Government 

turned it into a ‘happiness’ questionnaire) a valid instrument for judging educational quality. It 

produces scores on a range of credible variables and is adequately reliable and valid. The 

graphs I saw took one scale on the questionnaire at a time (such as ‘deep approach’ – the 

extent to which students attempted to make sense of subject matter rather than only to 

memorise it) and ranked every department in the country in a particular subject. What was 

immediately clear was that a couple of departments were measurably worse than the rest at the 

bottom and a couple were measurably better at the top; everyone else was pretty much 

indistinguishable in the middle. This was true for every scale on the questionnaire, for every 

subject. Statistically, this is an inevitable consequence of the variable being measured being 

more or less normally distributed and with a small standard deviation - a phenomenon apparent 

for virtually every variable about quality one can think of when comparing institutions. With NSS 

scores, the same is true. If you look at national rankings for ‘satisfaction’, you find the vast 

majority of institutions in an indistinguishable middle, with adjacent institutions having almost 

identical scores, and even blocks of ten institutions not differing significantly from adjacent 

blocks of ten. No less than forty-three institutions shared NSS satisfaction scores of 85-87% in 

2016. You can tell an institution ranked 120 from an institution ranked 20, but not one ranked 50 

from one ranked 60. The differences are so small and so volatile from one year to the next, that 

overall rankings can change markedly, year on year, without any change in the underlying 

phenomenon. Such variations are picked up by ‘The Times’ and trumpeted in such emotive 

headlines as “University X crashes down quality league”, when in fact the change in score has 

been random and statistically insignificant. It is rarely possible to distinguish one institution from 



TEF Special Edition 

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017 

the next in a reliable and safe way using such metrics because the differences are, in most 

cases, simply too small. Yet that is exactly what the TEF has to do – say that one institution 

deserves to charge higher fees whilst the next one down the rankings does not, even though 

they are statistically utterly indistinguishable. Adding scores together from a bunch of varied, 

and often invalid, metrics actually makes this problem worse and produces a grey muddle. The 

HECE guidelines now suggest that no more than 20% of institutions might be indentified at the 

top and bottom of rankings and distinguished from the middle-ranked institutions. But even that 

is bound to create unfairness for the institutions just below the boundaries that will be, in any 

statistical sense, indistinguishable from those just above the boundaries. 

Institutional average scores on teaching metrics usually hide wide departmental 

differences 

The third problem facing the TEF in making the required fine-grained distinctions is that it 

intends to rank and distinguish institutions. Institutions are made up of departments (or subjects) 

that very often differ widely from each other in terms of a whole range of metrics. These internal 

differences can be so large that an institution may have the top-ranked department in the 

country in one subject and the bottom-ranked department in a different subject. These 

departmental scores are then averaged and the institution as a whole might end up looking 

average (as most in fact do). This averaging of varied departments helps to produce the 

problem of lack of distinction between institutions highlighted above. Students need to know 

about the subject they are interested in and to be able to compare that subject across 

institutions. The current TEF mechanism will not allow them to do this - it could even trick them 

into paying a higher fee to study in a lower quality department. If students are interested in their 

likely employability, the problem is even more acute, as national differences between subjects 

are gross, and institutional employability averages are, at least in part, a consequence of their 

subject mix. If an institution taught just Nursing and Cultural Studies, then it might look average 

for employability, but comparatively bad for a student wishing to study Nursing and surprisingly 

good for a student wishing to study Cultural Studies. This problem would be partly solved if the 

TEF operated at the level of subjects (or departments) rather than institutions, which is a 

development being considered for the future. 

But even then, there would be significant difficulties in identifying what a ‘subject’ is. I once 

helped a Sports Science department collect a good deal of data about students’ experience of 

assessment, using the Assessment Experience Questionnaire (AEQ). There were seven degree 



TEF Special Edition 

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017 

programmes within ‘Sports Science’ and, in terms of students’ experience, they differed from 

one another to a considerable extent, ranging from rather good to pretty awful. As there were no 

NSS categories to differentiate between these degree programmes, for NSS data collection and 

reporting purposes the seven were simply aggregated into a single undifferentiated muddle. 

Standard NSS subject categories might work for traditional academic departments with one 

degree programme and large cohort sizes, but they may be less than helpful as a means of 

distinguishing the more unconventional and varied subject groupings usually found in modern 

teaching institutions. Again, this suits traditional research universities best. 

Employability has little to do with teaching quality 

There is a misinformed and confused conflation of employability with quality. Quality, according 

to the TEF, is apparently all about employability. Students don’t think so. Those responding to a 

HEPI survey asking them what best indicated the quality of a course had some perhaps 

surprisingly conventional ideas about teachers and teaching; employability came nearly bottom 

in their reckoning in terms of telling them anything useful about quality. 

If the government had bothered to look at national rankings of universities’ teaching and 

employability performance, it would have discovered that its assumption is complete nonsense. 

The table below ranks institutions according to 2016 NSS scores.  

Table 1: Institutional teaching and employability rankings  

Rank NSS   

2016 

Graduate employability  

Times, 2016 

NSS rank  

2016 

1 Buckingham Cambridge 20 

2 University of Law Oxford 20 

3 St Marys College Belfast LSE 155 

4 Courtauld Institute of Art Manchester 87 

5 Keele Imperial College 116 

6 St Andrews Kings College 129 

7 Bishop Grossteste Edinburgh 145 

8 Harper Adams University College 102 

9 Liverpool Hope London Business School 155 

10 Aberystwyth Bristol 76 

 

It will be noticed that most of the top ten institutions are neither prestigious nor research giants. 

The second set of rankings is from ‘The Times’ 2016 data collection about graduate 

employability. There is no overlap at all with the top ten for NSS satisfaction. The right-hand 



TEF Special Edition 

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017 

column lists the NSS rankings for the top ten institutions for employability. With the exception of 

Oxford and Cambridge, they are considered by students to be amongst the worst in the country. 

Imperial College led the clamour for higher fees; it is currently ranked 116th for student 

satisfaction and dropping like a stone, but its reputation guarantees effortlessly-high graduate 

employability metrics. 

It took about ten minutes to compile this table from data easily available on the internet. 

Employability is largely a product of reputation which follows research performance, overall 

income and visibility. Graduate employability has almost nothing to do with teaching quality and 

most institutions are not in a position to do much about the employability of their students, which 

is largely determined by employers’ notions about reputation and the employment market - often 

the local employment market. The government could do something about that, but not 

universities. 

It is also the case that size helps visibility and reputation, and hence employability, but hinders 

teaching quality. It is rare in research literature about good teaching departments to discover 

one that is even medium-sized, let alone large. Top research universities are mainly large and 

tend to keep students’ choice of courses down, so creating large cohorts and large classes in 

order to reduce teaching loads. The consequences are there for all to see. 

The TEF’s proposed teaching metrics have limited validity 

The TEF rests on teaching metrics’ being valid. If they are not, in the sense that they do not 

predict student learning, then orienting institutions to improving them may distract institutions 

from actual efforts to improve student learning. The government is fully aware of the contents of 

‘Dimensions of Quality’ (Gibbs, 2010) and its identification of which metrics are valid and which 

are not, and so the proposals in the original Green Paper about the metrics the TEF would use 

were guaranteed to dismay. By the time details of the implementation of the TEF were made 

public, the situation had improved. Nevertheless, ‘satisfaction’ is not a valid measure of learning 

gains or of teaching quality. Outcome measures (including retention and employability) are 

significantly determined by student selectivity, and so indicate reputation rather than teaching 

quality, and reputation does not predict learning gains or the extent of use of pedagogic 

practices that lead to learning gains. The introduction of benchmarks that take the nature of 

student intake into account will help here, and ‘The Times’ modelling of institutional rankings 

based on benchmarked TEF metrics, using 2015 data, produced somewhat inverted rankings 



TEF Special Edition 

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017 

compared with the newspaper rankings we are used to seeing (that have been created by using 

almost entirely invalid metrics). This cannot have been what was originally intended. It is 

possible that the TEF’s benchmarked metrics, even if some of them are invalid, will create quite 

a shock to the system. Increasing the role played by valid measures, such as of student 

engagement, will help in the future and it is to be hoped that there will continue to be pragmatic 

changes in implementation in the pursuit of validity and fairness. The first attempt to produce 

rankings and associated varied fee levels is unlikely to get it right and decisions about 

institutions’ futures based on the current form of implementation are likely to be dangerously 

unsound. It would be prudent to wait until some of the problems identified above have been 

tackled more satisfactorily and to treat the rankings of the first year or two as a wake-up call. It 

is not as if students are impatiently pushing the government hard to increase fees. 

The TEF is unlikely to be perceived by most as a reward 

The government argues that, just as strong research performance is rewarded by the REF, 

strong teaching performance should be rewarded by the TEF. But the majority of institutions 

have seen their research income decline dramatically over several rounds of research 

selectivity. The REF and its predecessors were designed explicitly to allocate research funding 

to fewer institutions (and fewer researchers) and to take funding away altogether from most. 

Careers, working lives and institutional reputations have been blighted by the REF. For most, it 

has been experienced as a punishment. Similarly, the TEF is likely to be perceived as offering 

brickbats and an uncertain future to perhaps thirty institutions, and as damning by faint praise 

perhaps 100 more. Only those institutions that are allowed to charge top whack, and the sub-set 

of these for whom this is actually useful and welcome, are likely to feel rewarded: big sticks and 

small carrots, again. 

Conclusion 

A national policy with this degree of leverage over institutional behaviour risks causing damage 

if the assumptions on which is built are wrong and the measures it uses are invalid. Institutions 

may feel obliged to play the system and try to improve their metrics even if they do not believe 

in them and even if this has no useful impact on student learning. But perhaps institutions will 

become more sophisticated about using appropriate metrics in sensible ways. The demands of 

the TEF for evidence of ‘impact’ are already stimulating fresh thinking. If that prompts new 

evidence-based approaches to enhancement, then the TEF might even improve students’ 



TEF Special Edition 

Compass: Journal of Learning and Teaching, Vol 10, No 2, 2017 

learning gains despite its rationale and design. As students will be paying even more for their 

education, let us hope so. 

 

Reference list 

Gibbs, G. (2010) Dimensions of Quality. York: Higher Education Academy. Available at: 

file:///C:/Users/Home/Desktop/General/HEA%20Quality/Dimensions%20of%20Quality%20Final

%20Report.pdf. (Accessed: 19 January 2017).