Microsoft Word - PDF_Issue_12_1.docx Italian Political Science, VOLUME 12, ISSUE 1, JUNE 2017 © 2017 Italian Political Science. ISSN 2420-8434. Volume 12, Issue 1, pp. 46–54. The Italian Research Assessment Exercises Daniele Checchi UNIVERSITY OF MILAN The Italian experience Italian universities have so far experienced three assessment exercises (2001-3, 2004-10 and 2011-14), which are described in details in Table 1. The fiscal law approved in Decem- ber 2016 dictates that from now onwards the reference periods will be quinquennial, reducing the discretionary power so far exercised by the Ministry of Education in design- ing the exercise. Table 1. The three research assessment exercises After an initial trial-and-error approach, the second and third exercises have been ra- ther similar, thus consolidating a standard of evaluation, whose principles are the following: • each assessment is intended to evaluate groups (universities, research agencies, down to departments and institutes) and not individuals (individual assessments are revealed to each researcher, but not to heads of departments, deans or chancel- lors); THE IMPACT OF RESEARCH ASSESSMENT ON THE PROFESSION AND THE DISCIPLINE OF POLITICAL SCIENCE 47 • the assessment considers a fixed number of products per capita/year, which should capture the best production: as such, it is closer to a monitoring exercise than to a quality assessment, revealing the excellences in a given research field; • using current standards (1/2 product per year per university professor – currently around 52,000 – and 1 product per researcher working in a research agency – cur- rently around 10,000) implies approximately 35,000 products per year; over a 5- year interval it sums up to 175,000 products, making some sort of automatic (bibli- ometric) assessment unavoidable; • the process has been managed by groups of experts, defined according to predefined research areas (since Italian professors are pigeon-holed into 371 research fields, then grouped into 14 research areas, known as Aree CUN). Each group was com- posed by a variable number of experts (from 20 to 60, depending on the expected number of products – the experts were selected by ANVUR from list of applicants according to their publication records and their area of expertise). In turn, these ex- perts relied onto 14,500 external peer reviewers, working in domestic and foreign institutions; • in the last two exercises, the evaluating agency (ANVUR) requested to the experts a preassigned distribution of journals, according to the world distribution of impact. As a consequence, the top list of journals should correspond to the best 10% of the world production; nevertheless, more than 30% of the submitted products to the last exercise ended up in this category (because the exercise considers only the best products); • depending on the research area, two assessment procedures have been followed: – bibliometric assessment consisted of combining the ranking of the journal ac- cording to the Impact Factor and the citations obtained by a specific article – articles in highly ranked journal with limited citations and/or highly quoted articles published in low ranked journals were peer reviewed; – peer review assessment consisted of a product being separately assigned to two experts, who independently selected an external peer reviewer; once the re- views were returned, a consensus report was drafted by the experts. In case of significant disagreement, a third reviewer was introduced, and the final as- sessment has to be approved by coordinator of the group of experts. In both cases the submission to experts were non-blind, and the evaluators may have formed their opinion looking at the place of publication, in what has been called as “in- formed peer review”. 2. The impact of the research assessment The evaluation of the product is normalised according to the means in each research area, leading to an indicator which combines quality and quantity assessment of a research field in a university.1 This indicator counts for three-fourths of the funds allocation, and is then 1 From a technical point of view, the indicator consists of the share of scores attained by a single univer- sity/department over the total scores achieved at the national level by all institutions. That share is then applied to the distribution of funds. If a university/department performs above the average, it will ob- tain a funding share which exceeds the corresponding share computed on the personnel heads. PARADEISE, The French HER system and the issue of evaluation 48 complemented with other indicators (PhDs, foreign students, external funding) in order to achieve the summary indicator to be applied to a funding scheme for universities. The most recent exercise lead to the distribution of ¼ of total funding to public universities in Italy (1.4 billions of euro for 2016). Approximately 15% of total funding relies on the proper evaluation of research products.2 As such Italy belongs to evaluation-based systems (with the UK, Australia, New Zea- land), to be contrasted with indicator-based systems (Norway, Denmark, Czech Republic). However, the 5-year interval is long enough to call for alternative methods of evaluation in the intermediate years. In addition, the results of the evaluation have trickled-down, di- rectly or indirectly, to many other dimensions of the life of university departments. Many universities have used the scores obtained by their departments in the internal allocation of funds and promotions; the current accreditation of PhD programs is based on the re- search assessment of the teaching staff; newspapers articles have widely disseminated the results of the research assessment with reference to local universities, in order to drive the choices of students and their families. Even if they are formally independent, the process of selecting new academics has been significantly influenced by the research assessment exercises. Selection in hard sci- ence research fields makes large use of bibliometric methods, while in soft science journal rankings have been adopted. Though I would not dare claiming that the introduction of assessment exercises has raised the standards of hiring in most disciplines, as a matter of fact in the most recent VQR the average score of newly hired/promoted researchers is higher than the average of permanent members (the indicator called IRAS2). This implies that new entrants in the academia have introjected the assessment approach in shaping the way in which they publish their research outputs. While the VQR asks for the assessment of “originality, relevance, exposure to inter- national debate”, what is more perceivable (and perceived) is the internationalisation of the domestic production. Publishing in a foreign language (notably in English) has be- come the dominant strategy in several fields. As a consequence, many Italian journals which used to publish in Italian opted for the English language. A related issue is the mul- tiplication of the number of papers via the diffusion of co-authorship. Since the VQR rules allow for the same product being submitted by more than one author (as long as they be- long to different research entities), many authors have followed a strategy of risk diversification, by developing joint research projects in the expectation that at least one of them would obtain publication in a highly ranked journal. 3. The recent VQR (2011-14) The most recent research assessment exercise ended in February 2017, with the offi- cial presentation of global report on the Italian research activity accompanied by specific reports for each research areas and for the social impact activity. 96 Universities partici- pated to the exercise, together with 12 PRO’s (Public Research Organisations) and 26 other institutions on a voluntary basis. The distribution of 118,036 products received for evalua- 2 To be honest, the impact on funding is less dramatic in the short run, because of high persistence on historical values: each university cannot receive ±2% of what it has received the previous year, thus strongly attenuating whatever result could obtain from the research assessment. THE IMPACT OF RESEARCH ASSESSMENT ON THE PROFESSION AND THE DISCIPLINE OF POLITICAL SCIENCE 49 tion is reported in Table 2, where one can easily detect few regularities. Compliance rates vary across research areas, oscillating between 90% and 97%.3 Journal articles represent the dominant submission for hard sciences (reaching 98% in Biology and Medicine), while collected papers (edited volumes) prevail in the social sciences and humanities. Books have almost not been submitted in bibliometric areas, while they represent one fourth of all submissions in some non-bibliometric areas. The residual category [including musical compositions, designs, projects (architecture), performances, exhibitions, arts objects, databases and software] account for a small fraction of the total output submitted to the assessment. This does not produce a representative snapshot of the research activity of universities (PROs have similar composition), because the limit imposed to two products per researcher. Rather it allows monitoring of what can be considered as relevant scien- tific productivity of the entire research community.4 Table 2. Distribution of products by research area and type of output – Italy VQR 2011-14 The assessment of each product was conducted according to three criteria: 1. Originality, to be intended as the degree according to which the publication is able to introduce a new way of thinking about the object of the research; 2. Methodological accuracy, to be intended as the degree according to which the pub- lication adopts an appropriate methodology and is able to present its results to peers; 3. Actual or potential impact, to be intended as the level of influence – current or po- tential – that the research exerts on the relevant scientific community. 3 It is important to recall that a protest organised in some universities led a fraction of university profes- sors to refuse to submit their required output. However, in the first VQR, the submission rate for universities was 95.09% of the expected output, while it went down to 933.82 during the second one. 4 The rules prevented the submission of textbooks, working papers and self-publications. PARADEISE, The French HER system and the issue of evaluation 50 Each publication was attributed a quality profile: • Excellent (weight 1) if it falls in the top decile of the world distribution of publica- tions in the research area; • Good (weight 0.7) if it falls in the 70-90% segment of the distribution; • Fair (weight 0.4) if it falls in the 50-70% segment of the international distribution; • Acceptable (weight 0.1) if it falls in the 20-50% segment of the distribution; • Limited (weight 0) if it belongs to the 0-20% lowest segment of the distribution; • Impossible to evaluate (weight 0) was assigned to missing publications or publica- tions that were impossible to evaluate. As one can easily expect, any evaluation of a product following the above-mentioned criteria contains some degree of arbitrariness. One can initially consider the language of publication as a proxy for the exposure to the international debate. An inspection to Table 3 seems to suggest that what are considered as bibliometric sectors (in light grey) are large- ly open to the international debate. From this perspective, the research area 13 (Economics and statistics) could be considered equally open to internationalisation. These areas have mostly relied on automatic assignment of products to the evaluation categories, using the principle that journal with high impact factors are generally speaking more se- lective in acceptance, and therefore impose higher standards of quality. This principle is complemented with the use of papers’ citations, which should capture the relevance of the contents for the scientific debate. The evaluation in non-bibliometric areas relied on peer review (with the exception of the research area 13, which adopted a ranking of the journals based on the impact factors). If the replacement of an algorithm with human reviewers may be welcome in terms of adherence to the suggested evaluation principles, it introduces the problem of potential disagreement among the reviewers, which is likely to motivate the lower fraction of “excel- lent” and “good” evaluation recorded in the non-bibliometric areas (see Table 4). Table 3. Language of the products submitted to VQR 2011-14 THE IMPACT OF RESEARCH ASSESSMENT ON THE PROFESSION AND THE DISCIPLINE OF POLITICAL SCIENCE 51 Table 4. Distribution of products by research area and received evaluation – VQR 2011-14 4. The receipt of the research assessment exercises in the academic community These exercises have generated enthusiasm and collaboration as well as suspicion and resistance. A large fraction of academics definitively cooperated with the exercise, organ- ising the submission within each department and accepting to review the product. A smaller fraction opposed it, on the arguments that these exercises were misleading the Italian research towards irrelevant topics, were promoting harmful competition among research agencies and were destroying the weakest segment of the academia (very often located in Southern universities).5 My impression is that the main argument against the research assessment exercise runs as follows: “the assessment legitimizes budget cuts, especially against southern uni- versities. If we want to save the equal opportunity in accessing universities, we should oppose any assessment which associate funding and results”. This argument has some plausibility, especially when looking at Figure 1, which shows the trends in state funding to Italian universities in nominal and real (i.e. deflated by the price variation) terms. Re- member that the first exercise with impact on funding was launched in 2011, when the decline in resources became more pronounced. Although the actual impact was not dis- ruptive (due to safeguard clauses – see above), the linkage of resources to assessment opened the risk of “poverty traps”: a poorly performing university received fewer re- sources and was therefore less likely to improve its performance in the next round of assessment. Budget cuts curtailed hiring possibilities, which were only later released in 5 Perhaps the most representative instances of this aversion towards evaluation performed by ANVUR can be traced in the following websites (unfortunately, all in Italian): www.roars.it (usually covering topics related to assessment methods); http://www.flcgil.it/universita/ (the website of the main union of university workers); http://firmiamodimissionianvur.org/ (more than 2,000 researchers signed a petition asking for the dismissal of the board of ANVUR, the evaluation agency). PARADEISE, The French HER system and the issue of evaluation 52 correlation with performance. Thus poorly performing universities were supposedly pre- vented from hiring better researchers in order to revert their rank position. Figure 1. Total public revenues accrued to Italian public universities (2000=100) Despite its simplicity, this line of argument is substantially flawed. During the first decade of the present century, the hiring procedures of Italian universities were reformed, moving from a format of centralised competition to one of local competitions. Each de- partment was left almost free to hire or promote whoever they deemed worthy to be hired. The first exercise (VTR) did not provide a clear picture of the average performance, be- cause it was designed to assess excellence within each university, without considering who wrote what. The second exercise (VQR 2004-10) for the first time revealed that a non- negligible fraction of researchers was unable to submit any research product at all. The third exercise (VQR 2011-14) provided evidence of some convergence of universities to- wards the mean, thanks to the change in the grading procedure (missing submissions were no longer penalised with a negative grade) but also to the injection of new resources that made possible to all universities the hiring of new scholars. 5. Open issues for future assessment exercises In the immediate aftermath of the publication of the results of the third exercise, several suggestions have emerged in the press as well as in official forums. Some of them were main- ly technical, some other more philosophical. In the following I will review them in brief. The first concerns the potential bias contained in the evaluation. Given existing rules, co-authored papers to be submitted to foreign journals have the highest probability to re- ceive a high grade. This implicitly “delegates” to foreign editors (and publishers) the choice of what is to be considered relevant for the international debates. Topics that are outside the mainstream, or that are simply concerned with national debates, are likely to appear at best in local journals, which then receive lower evaluation even by referees. Still, most of Italian journals do not yet have standard double blind reviewing procedures, in- ducing the suspicion that the quality of their articles may be lower. THE IMPACT OF RESEARCH ASSESSMENT ON THE PROFESSION AND THE DISCIPLINE OF POLITICAL SCIENCE 53 The absence of domestic databases on publications and citations makes it impossible to introduce a dual layer system, where articles and books in Italian could gain more visi- bility. The use of peer reviewers is not a panacea, for various reasons. Especially in the social sciences, where the ideological content of the arguments is important, the judgment of the reviewer may be biased by strategical concerns (by attributing a lower score to an author, one may be tempted to alter the competition among different schools of thought). In addition, peer review of papers that have already undergone a real blind review process represent in inherent contradiction: suppose that the final reviewer spots an evident er- ror; who has to be blamed, the author, the journal referees, or the editor of the journal? Finally, the peer review is expensive. Consider the following back of the envelope calcula- tion: in the most recent exercise 52,060 products (corresponding to 44.1% of total production) underwent a double review; each reviewer received 30 euro per review, lead- ing to a total cost above 3 million euro, which is a cost that cannot be frequently afforded. The second aspect concerns the different publication strategies of different research communities. On average applied physics scholars publish more than 30 papers per year, because the number of co-authors can easily exceed one hundred. The corresponding fig- ure for a theorist in mathematics may not reach one paper per year. To partially account for these differences the scores are normalised by research area, but this does not reduce the evident advantage of sectors where the scholar may select their best production from a larger set of papers. A related issue deals with the weighing of different products. The most recent exer- cise introduced for the first time a different weighing for books vis a vis journal articles: under specific request of the author, a book could have been considered as equivalent to two articles, thus satisfying the requirement of submission. But the principle could be extended to other categories of products, because an article collected in a book is probably subject to less scrutiny than an article in a journal. Articles and/or books could be weighed by the number of co-authors. And so on. A further issue that has been raised deals with the boundaries of research areas. So far the assessment exercises have considered aggregation of research fields (settori scientifi- co-disciplinari) under which academics have been hired to teach. This does not have any correspondence to other classification criteria (like ERC) and tend to penalise cross- disciplinary research. In principle, nothing prevents redesigning of the evaluation areas, but this interferes with the academic careers, which represents the strongest incentive to publish (at least for academics). Thus, a net separation between research assessment and promotion criteria would be required before addressing this problem. A final point deals with the potential trade-off between teaching and research. The assessment is conducted without any reference to the resources available/invested in re- search, including the time absorbed by teaching. Most universities in peripheral areas lament the excess burden of teaching created by the chronic lack of staff. From an intui- tive point of view, a proper assessment should correct for differences in the starting conditions. Otherwise stressing research results as unique measure for scholars’ quality is detrimental to the effectiveness of teaching, because scholars will devote their best ener- gies to article writing. There are possible solutions to avoid this trade-off: if each academic could choose over a menu of different combinations of teaching loads and commitment to publications, we could observe a possible sorting of scholars according to their preferences PARADEISE, The French HER system and the issue of evaluation 54 and abilities. This would require a revision of the procedure of assessment, because schol- ars should then be weighed or converted into full-time equivalents. Overall, the unsolved issue for the Italian research assessment exercises seems to be whether the results should be interpreted as monitoring the system (in order to ensure accountability vis-à-vis the tax-payers) or rather a research quality assessment (intended to promote excellence). The Ministry of Education oscillates among these two interpreta- tions, which however lead to alternative policy suggestions. According to the former perspective, uniformity of performance is a goal, and the weakest universities should be sustained in order to grant a common standard of tertiary education across the country. According to the latter, the best universities/departments should obtain even greater re- sources, given their good evaluations obtained in the assessment.