Microsoft Word - PDF_Issue_12_1.docx


 Italian Political Science, VOLUME 12, ISSUE 1, JUNE 2017 

© 2017 Italian Political Science. ISSN 2420-8434. 
Volume 12, Issue 1, pp. 46–54. 

The Italian Research 
Assessment Exercises 

Daniele Checchi 
UNIVERSITY OF MILAN 

The Italian experience 
Italian universities have so far experienced three assessment exercises (2001-3, 2004-10 
and 2011-14), which are described in details in Table 1. The fiscal law approved in Decem-
ber 2016 dictates that from now onwards the reference periods will be quinquennial, 
reducing the discretionary power so far exercised by the Ministry of Education in design-
ing the exercise. 

Table 1. The three research assessment exercises 

 
After an initial trial-and-error approach, the second and third exercises have been ra-

ther similar, thus consolidating a standard of evaluation, whose principles are the 
following: 

• each assessment is intended to evaluate groups (universities, research agencies, 
down to departments and institutes) and not individuals (individual assessments 
are revealed to each researcher, but not to heads of departments, deans or chancel-
lors); 


THE IMPACT OF RESEARCH ASSESSMENT ON THE PROFESSION AND THE DISCIPLINE OF POLITICAL SCIENCE 

 47 

• the assessment considers a fixed number of products per capita/year, which should 
capture the best production: as such, it is closer to a monitoring exercise than to a 
quality assessment, revealing the excellences in a given research field; 

• using current standards (1/2 product per year per university professor – currently 
around 52,000 – and 1 product per researcher working in a research agency – cur-
rently around 10,000) implies approximately 35,000 products per year; over a 5-
year interval it sums up to 175,000 products, making some sort of automatic (bibli-
ometric) assessment unavoidable; 

• the process has been managed by groups of experts, defined according to predefined 
research areas (since Italian professors are pigeon-holed into 371 research fields, 
then grouped into 14 research areas, known as Aree CUN). Each group was com-
posed by a variable number of experts (from 20 to 60, depending on the expected 
number of products – the experts were selected by ANVUR from list of applicants 
according to their publication records and their area of expertise). In turn, these ex-
perts relied onto 14,500 external peer reviewers, working in domestic and foreign 
institutions; 

• in the last two exercises, the evaluating agency (ANVUR) requested to the experts a 
preassigned distribution of journals, according to the world distribution of impact. 
As a consequence, the top list of journals should correspond to the best 10% of the 
world production; nevertheless, more than 30% of the submitted products to the last 
exercise ended up in this category (because the exercise considers only the best 
products); 

• depending on the research area, two assessment procedures have been followed: 
– bibliometric assessment consisted of combining the ranking of the journal ac-

cording to the Impact Factor and the citations obtained by a specific article – 
articles in highly ranked journal with limited citations and/or highly quoted 
articles published in low ranked journals were peer reviewed; 

– peer review assessment consisted of a product being separately assigned to two 
experts, who independently selected an external peer reviewer; once the re-
views were returned, a consensus report was drafted by the experts. In case of 
significant disagreement, a third reviewer was introduced, and the final as-
sessment has to be approved by coordinator of the group of experts. 

In both cases the submission to experts were non-blind, and the evaluators may have 
formed their opinion looking at the place of publication, in what has been called as “in-
formed peer review”. 

2. The impact of the research assessment 
The evaluation of the product is normalised according to the means in each research area, 
leading to an indicator which combines quality and quantity assessment of a research field 
in a university.1 This indicator counts for three-fourths of the funds allocation, and is then 
                                                
1 From a technical point of view, the indicator consists of the share of scores attained by a single univer-
sity/department over the total scores achieved at the national level by all institutions. That share is then 
applied to the distribution of funds. If a university/department performs above the average, it will ob-
tain a funding share which exceeds the corresponding share computed on the personnel heads. 


PARADEISE, The French HER system and the issue of evaluation 

 48 

complemented with other indicators (PhDs, foreign students, external funding) in order 
to achieve the summary indicator to be applied to a funding scheme for universities. The 
most recent exercise lead to the distribution of ¼ of total funding to public universities in 
Italy (1.4 billions of euro for 2016). Approximately 15% of total funding relies on the proper 
evaluation of research products.2  

As such Italy belongs to evaluation-based systems (with the UK, Australia, New Zea-
land), to be contrasted with indicator-based systems (Norway, Denmark, Czech Republic). 
However, the 5-year interval is long enough to call for alternative methods of evaluation in 
the intermediate years. In addition, the results of the evaluation have trickled-down, di-
rectly or indirectly, to many other dimensions of the life of university departments. Many 
universities have used the scores obtained by their departments in the internal allocation 
of funds and promotions; the current accreditation of PhD programs is based on the re-
search assessment of the teaching staff; newspapers articles have widely disseminated the 
results of the research assessment with reference to local universities, in order to drive the 
choices of students and their families. 

Even if they are formally independent, the process of selecting new academics has 
been significantly influenced by the research assessment exercises. Selection in hard sci-
ence research fields makes large use of bibliometric methods, while in soft science journal 
rankings have been adopted. Though I would not dare claiming that the introduction of 
assessment exercises has raised the standards of hiring in most disciplines, as a matter of 
fact in the most recent VQR the average score of newly hired/promoted researchers is 
higher than the average of permanent members (the indicator called IRAS2). This implies 
that new entrants in the academia have introjected the assessment approach in shaping 
the way in which they publish their research outputs. 

While the VQR asks for the assessment of “originality, relevance, exposure to inter-
national debate”, what is more perceivable (and perceived) is the internationalisation of 
the domestic production. Publishing in a foreign language (notably in English) has be-
come the dominant strategy in several fields. As a consequence, many Italian journals 
which used to publish in Italian opted for the English language. A related issue is the mul-
tiplication of the number of papers via the diffusion of co-authorship. Since the VQR rules 
allow for the same product being submitted by more than one author (as long as they be-
long to different research entities), many authors have followed a strategy of risk 
diversification, by developing joint research projects in the expectation that at least one of 
them would obtain publication in a highly ranked journal. 

3. The recent VQR (2011-14) 
The most recent research assessment exercise ended in February 2017, with the offi-

cial presentation of global report on the Italian research activity accompanied by specific 
reports for each research areas and for the social impact activity. 96 Universities partici-
pated to the exercise, together with 12 PRO’s (Public Research Organisations) and 26 other 
institutions on a voluntary basis. The distribution of 118,036 products received for evalua-

                                                
2 To be honest, the impact on funding is less dramatic in the short run, because of high persistence on 
historical values: each university cannot receive ±2% of what it has received the previous year, thus 
strongly attenuating whatever result could obtain from the research assessment. 


THE IMPACT OF RESEARCH ASSESSMENT ON THE PROFESSION AND THE DISCIPLINE OF POLITICAL SCIENCE 

 49 

tion is reported in Table 2, where one can easily detect few regularities. Compliance rates 
vary across research areas, oscillating between 90% and 97%.3 Journal articles represent 
the dominant submission for hard sciences (reaching 98% in Biology and Medicine), while 
collected papers (edited volumes) prevail in the social sciences and humanities. Books 
have almost not been submitted in bibliometric areas, while they represent one fourth of 
all submissions in some non-bibliometric areas. The residual category [including musical 
compositions, designs, projects (architecture), performances, exhibitions, arts objects, 
databases and software] account for a small fraction of the total output submitted to the 
assessment. This does not produce a representative snapshot of the research activity of 
universities (PROs have similar composition), because the limit imposed to two products 
per researcher. Rather it allows monitoring of what can be considered as relevant scien-
tific productivity of the entire research community.4  

Table 2. Distribution of products by research area and type of output – Italy VQR 2011-14 

 
The assessment of each product was conducted according to three criteria: 

1. Originality, to be intended as the degree according to which the publication is able 
to introduce a new way of thinking about the object of the research; 

2. Methodological accuracy, to be intended as the degree according to which the pub-
lication adopts an appropriate methodology and is able to present its results to 
peers; 

3. Actual or potential impact, to be intended as the level of influence – current or po-
tential – that the research exerts on the relevant scientific community. 

                                                
3 It is important to recall that a protest organised in some universities led a fraction of university profes-
sors to refuse to submit their required output. However, in the first VQR, the submission rate for 
universities was 95.09% of the expected output, while it went down to 933.82 during the second one. 
4 The rules prevented the submission of textbooks, working papers and self-publications. 


PARADEISE, The French HER system and the issue of evaluation 

 50 

Each publication was attributed a quality profile: 

• Excellent (weight 1) if it falls in the top decile of the world distribution of publica-
tions in the research area; 

• Good (weight 0.7) if it falls in the 70-90% segment of the distribution; 
• Fair (weight 0.4) if it falls in the 50-70% segment of the international distribution; 
• Acceptable (weight 0.1) if it falls in the 20-50% segment of the distribution; 
• Limited (weight 0) if it belongs to the 0-20% lowest segment of the distribution; 
• Impossible to evaluate (weight 0) was assigned to missing publications or publica-

tions that were impossible to evaluate. 
As one can easily expect, any evaluation of a product following the above-mentioned 

criteria contains some degree of arbitrariness. One can initially consider the language of 
publication as a proxy for the exposure to the international debate. An inspection to Table 
3 seems to suggest that what are considered as bibliometric sectors (in light grey) are large-
ly open to the international debate. From this perspective, the research area 13 
(Economics and statistics) could be considered equally open to internationalisation. These 
areas have mostly relied on automatic assignment of products to the evaluation categories, 
using the principle that journal with high impact factors are generally speaking more se-
lective in acceptance, and therefore impose higher standards of quality. This principle is 
complemented with the use of papers’ citations, which should capture the relevance of the 
contents for the scientific debate. 

The evaluation in non-bibliometric areas relied on peer review (with the exception of 
the research area 13, which adopted a ranking of the journals based on the impact factors). 
If the replacement of an algorithm with human reviewers may be welcome in terms of 
adherence to the suggested evaluation principles, it introduces the problem of potential 
disagreement among the reviewers, which is likely to motivate the lower fraction of “excel-
lent” and “good” evaluation recorded in the non-bibliometric areas (see Table 4). 

Table 3. Language of the products submitted to VQR 2011-14 

 
THE IMPACT OF RESEARCH ASSESSMENT ON THE PROFESSION AND THE DISCIPLINE OF POLITICAL SCIENCE 

 51 

 
Table 4. Distribution of products by research area and received evaluation – VQR 2011-14 

 
4. The receipt of the research assessment exercises in the 
academic community 
These exercises have generated enthusiasm and collaboration as well as suspicion and 
resistance. A large fraction of academics definitively cooperated with the exercise, organ-
ising the submission within each department and accepting to review the product. A 
smaller fraction opposed it, on the arguments that these exercises were misleading the 
Italian research towards irrelevant topics, were promoting harmful competition among 
research agencies and were destroying the weakest segment of the academia (very often 
located in Southern universities).5 

My impression is that the main argument against the research assessment exercise 
runs as follows: “the assessment legitimizes budget cuts, especially against southern uni-
versities. If we want to save the equal opportunity in accessing universities, we should 
oppose any assessment which associate funding and results”. This argument has some 
plausibility, especially when looking at Figure 1, which shows the trends in state funding to 
Italian universities in nominal and real (i.e. deflated by the price variation) terms. Re-
member that the first exercise with impact on funding was launched in 2011, when the 
decline in resources became more pronounced. Although the actual impact was not dis-
ruptive (due to safeguard clauses – see above), the linkage of resources to assessment 
opened the risk of “poverty traps”: a poorly performing university received fewer re-
sources and was therefore less likely to improve its performance in the next round of 
assessment. Budget cuts curtailed hiring possibilities, which were only later released in 
                                                
5 Perhaps the most representative instances of this aversion towards evaluation performed by ANVUR 
can be traced in the following websites (unfortunately, all in Italian): www.roars.it (usually covering 
topics related to assessment methods); http://www.flcgil.it/universita/ (the website of the main union 
of university workers); http://firmiamodimissionianvur.org/ (more than 2,000 researchers signed a 
petition asking for the dismissal of the board of ANVUR, the evaluation agency). 


PARADEISE, The French HER system and the issue of evaluation 

 52 

correlation with performance. Thus poorly performing universities were supposedly pre-
vented from hiring better researchers in order to revert their rank position. 

Figure 1. Total public revenues accrued to Italian public universities (2000=100) 

 
Despite its simplicity, this line of argument is substantially flawed. During the first 
decade of the present century, the hiring procedures of Italian universities were reformed, 
moving from a format of centralised competition to one of local competitions. Each de-
partment was left almost free to hire or promote whoever they deemed worthy to be hired. 
The first exercise (VTR) did not provide a clear picture of the average performance, be-
cause it was designed to assess excellence within each university, without considering who 
wrote what. The second exercise (VQR 2004-10) for the first time revealed that a non-
negligible fraction of researchers was unable to submit any research product at all. The 
third exercise (VQR 2011-14) provided evidence of some convergence of universities to-
wards the mean, thanks to the change in the grading procedure (missing submissions 
were no longer penalised with a negative grade) but also to the injection of new resources 
that made possible to all universities the hiring of new scholars. 

5. Open issues for future assessment exercises 
In the immediate aftermath of the publication of the results of the third exercise, several 
suggestions have emerged in the press as well as in official forums. Some of them were main-
ly technical, some other more philosophical. In the following I will review them in brief. 

The first concerns the potential bias contained in the evaluation. Given existing rules, 
co-authored papers to be submitted to foreign journals have the highest probability to re-
ceive a high grade. This implicitly “delegates” to foreign editors (and publishers) the 
choice of what is to be considered relevant for the international debates. Topics that are 
outside the mainstream, or that are simply concerned with national debates, are likely to 
appear at best in local journals, which then receive lower evaluation even by referees. Still, 
most of Italian journals do not yet have standard double blind reviewing procedures, in-
ducing the suspicion that the quality of their articles may be lower. 


THE IMPACT OF RESEARCH ASSESSMENT ON THE PROFESSION AND THE DISCIPLINE OF POLITICAL SCIENCE 

 53 

The absence of domestic databases on publications and citations makes it impossible 
to introduce a dual layer system, where articles and books in Italian could gain more visi-
bility. The use of peer reviewers is not a panacea, for various reasons. Especially in the 
social sciences, where the ideological content of the arguments is important, the judgment 
of the reviewer may be biased by strategical concerns (by attributing a lower score to an 
author, one may be tempted to alter the competition among different schools of thought). 
In addition, peer review of papers that have already undergone a real blind review process 
represent in inherent contradiction: suppose that the final reviewer spots an evident er-
ror; who has to be blamed, the author, the journal referees, or the editor of the journal? 
Finally, the peer review is expensive. Consider the following back of the envelope calcula-
tion: in the most recent exercise 52,060 products (corresponding to 44.1% of total 
production) underwent a double review; each reviewer received 30 euro per review, lead-
ing to a total cost above 3 million euro, which is a cost that cannot be frequently afforded. 

The second aspect concerns the different publication strategies of different research 
communities. On average applied physics scholars publish more than 30 papers per year, 
because the number of co-authors can easily exceed one hundred. The corresponding fig-
ure for a theorist in mathematics may not reach one paper per year. To partially account 
for these differences the scores are normalised by research area, but this does not reduce 
the evident advantage of sectors where the scholar may select their best production from a 
larger set of papers. 

A related issue deals with the weighing of different products. The most recent exer-
cise introduced for the first time a different weighing for books vis a vis journal articles: 
under specific request of the author, a book could have been considered as equivalent to 
two articles, thus satisfying the requirement of submission. But the principle could be 
extended to other categories of products, because an article collected in a book is probably 
subject to less scrutiny than an article in a journal. Articles and/or books could be weighed 
by the number of co-authors. And so on. 

A further issue that has been raised deals with the boundaries of research areas. So far 
the assessment exercises have considered aggregation of research fields (settori scientifi-
co-disciplinari) under which academics have been hired to teach. This does not have any 
correspondence to other classification criteria (like ERC) and tend to penalise cross-
disciplinary research. In principle, nothing prevents redesigning of the evaluation areas, 
but this interferes with the academic careers, which represents the strongest incentive to 
publish (at least for academics). Thus, a net separation between research assessment and 
promotion criteria would be required before addressing this problem. 

A final point deals with the potential trade-off between teaching and research. The 
assessment is conducted without any reference to the resources available/invested in re-
search, including the time absorbed by teaching. Most universities in peripheral areas 
lament the excess burden of teaching created by the chronic lack of staff. From an intui-
tive point of view, a proper assessment should correct for differences in the starting 
conditions. Otherwise stressing research results as unique measure for scholars’ quality is 
detrimental to the effectiveness of teaching, because scholars will devote their best ener-
gies to article writing. There are possible solutions to avoid this trade-off: if each academic 
could choose over a menu of different combinations of teaching loads and commitment to 
publications, we could observe a possible sorting of scholars according to their preferences 


PARADEISE, The French HER system and the issue of evaluation 

 54 

and abilities. This would require a revision of the procedure of assessment, because schol-
ars should then be weighed or converted into full-time equivalents. 

Overall, the unsolved issue for the Italian research assessment exercises seems to be 
whether the results should be interpreted as monitoring the system (in order to ensure 
accountability vis-à-vis the tax-payers) or rather a research quality assessment (intended to 
promote excellence). The Ministry of Education oscillates among these two interpreta-
tions, which however lead to alternative policy suggestions. According to the former 
perspective, uniformity of performance is a goal, and the weakest universities should be 
sustained in order to grant a common standard of tertiary education across the country. 
According to the latter, the best universities/departments should obtain even greater re-
sources, given their good evaluations obtained in the assessment.