The paper reviews the Italian experience in the evaluation of research in the 2000-2020 period. The initial exercise (VTR 2000-2003) did not involve all researchers and had no impact on funding. After a long political and cultural debate there was a decision to create an independent Agency in charge of a periodic research assessment, involving all researchers, and having impact on performance-based funding. The legislation was approved in 2006 and the Agency was created in 2010-2011. In parallel, a major reform of academic promotion was approved in 2010. The Agency (ANVUR) launched three exercises, two of which have been completed and published (
This paper is the first of a two-part essay in which I try to offer a complete and critical view of the Italian experience of research assessment. Part I is dedicated to the detailed description of the experience, while in Part II I will try to make justice of the criticisms and controversies generated by research assessment.
Italy is an interesting case study for the international community working on science policy and research evaluation, on the one hand, and on informetrics and bibliometrics, on the other hand. It is the only large Continental European country in which research assessment has been made mandatory, has implications on university funding, and is carried out on a large scale at regular intervals. With more than 180,000 research products evaluated, the VQR 2004-2010 was the largest institutional exercise ever carried out. With more than 40,000 titles and 15,000 journals rated, the journal rating system is one of the largest available and has survived the criticisms that in other countries, such as France and Australia, led to its cancelation.
Another intriguing reason of interest is that in the Italian context the evaluative informetrics, in particular the use of bibliometric indicators, has been introduced suddenly and rapidly, generating in a few years a lot of controversies, but also large opportunities for institutional learning and adaptation.
There is an important caveat to my analysis: I have been a member of the Board of the Italian Agency (ANVUR) during the startup phase (2011-2015) and I have been personally responsible for some of the procedures, and collectively responsible for all decisions made in that period. In the current and the companion paper I will try to examine the experience in a professional way, by using the available evidence systematically and balancing the arguments. The reader will evaluate whether my account is worth of attention.
In this paper I first describe the events and decisions that led to the various research assessment exercises (Section 2). I then work backward, from the legislation and the administration to the main principles, objectives, purposes and criteria of the evaluative framework (Section 3), as reconstructed. The reasons for this inversion of the logical flow (not from principles to execution but the other way round) will be clear to the reader only after reading these sections. Section 4 discusses the reception of the research assessment in the university landscape and Section 5 enlarges the description to another assessment activity carried out by the Agency in the context of National Scientific Habilitation of candidates to the academic career.
In the companion paper I will examine all criticisms that have been raised against the research assessment in the peer-reviewed literature and will try to balance the various arguments. At the end I will try to propose a balanced judgment of the overall experience.
The current experience of research assessment goes back to the year 2000 and covers three completed exercises and a work-in-progress. I summarize most technical elements of the four exercises (from VTR 2000-2003 to VQR 2015-2019) over two decades in Table
Synopsis of main features of Research asssessment exercises in Italy. Year 2000-2020.
VTR | VQR I | VQR II | VQR III | |
---|---|---|---|---|
Period covered | 2000-2003 | 2004-2010 | 2011-2014 | 2015-2019 |
Year of start | 2003 | 2011 | 2015 | 2020(a) |
Year of publication | 2004 | 2013 | 2017 | 2022 |
Organization | CIVR | ANVUR | ANVUR | ANVUR |
Subject evaluated | 77 universities |
96 universities |
96 universities |
n.a. |
Evaluation method | Peer review | Peer review + Bibliometrics | Peer review + Bibliometrics | Informed peer review |
Bibliometric indicator | None | Normalized number of citations until 2011 + Journal impact factor | Normalized number of citations until 2015 + Journal impact factor |
To be decided by GEVs. |
Bibliometric source | No | WoS, Scopus, MathSciNet | WoS, Scopus, MathSciNet | To be decided by GEVs. |
Submission decision | Department | University or PRO based on individual proposals for submission by researchers (n = 61822) | University or PRO based on individual proposals for submission by researchers (n = 52677) | University or PRO based on individual proposals for submission by researchers |
Type of products | Journal article |
Journal article |
Same as 2004-2010 | Same as 2004-2010 |
Number of products per capita | At least one product each 4 researchers (universities) or 2 researchers (PROs) | 3 products per university staff |
2 products per university staff |
2 products per university staff |
Total number of products evaluated | 17329 evaluated |
184878 evaluated | 118036 evaluated | n.a. |
Expert panel | 20 expert panels | 14 GEV (Gruppi di esperti della valutazione) |
16 GEV (Gruppi di esperti della valutazione) |
17 GEV (Gruppi di esperti della valutazione) + GEV Third mission |
Choice of expert | Call + nomination CIVR | List of experts from previous call (n > 3000) + nomination ANVUR | Public call + nomination ANVUR | Public call + random extraction with quotas |
Number of referees | 6661 referees of which 1465 from abroad | >14,000 | >13,000 | n.a. |
Main quality criteria | Relevance to the field |
Originality |
Originality |
|
Classes of merit | Excellent (top 20%) |
Excellent (top 20%) |
Excellent (top 10%) |
A. Excellent and extremely relevant |
Score | Excellent 1 |
Excellent 1 |
Excellent 1 |
Not defined in the Ministerial decree |
Penalty | Not applicable | Proven cases of plagiarism or fraud (-2) |
Not assessable 0 | Not assessable 0 |
Aggregation | Scientific structure |
Department |
Department |
Department |
Cost | 3,55 million euro | 10,570 million euro (including CINECA costs) | 14,7 million euro (including costs at university level) | n.a. |
Impact on funding | No(b) | 13% of total funding (2013) |
Appr. 25% total funding (1,5 bn euro) | n.a. |
Additional information | Human resources |
Overall indicator IRFS1 |
Same as 2004-2010 | n.a. |
Legenda
(a) The start of the process was postponed due to the Covid-19 crisis.
(b) As a matter of fact, the Ministry used the VTR data to allocate a small share of funding (2%) in 2009. The move was criticized for using obsolete data, backing to the start of the decade. This criticism accelerated the pressure for launching a new exercise and using new data for allocation of funding.
As in other advanced countries, Italy experienced a process of shift from a centralized model of university administration to a model based on autonomy, although the process started only in the ‘90s. After granting organizational and financial autonomy to universities there was general agreement on the need to provide the higher education system with advanced instruments for steering, accountability and responsibility.
This orientation led to the creation of CIVR (
According to many observers, the main limitations of the VTR were the procedure for the selection of products and the lack of impact on funding. The selection of products to be submitted was the responsibility of departments, which in theory should have followed criteria of research quality. As a matter of fact, given the lack of experience in research assessment, but also given the lack of consequences of inappropriate choices, products submitted were selected using a variety of criteria, often not correlated to the quality of products but to academic rank or other considerations. Abramo, D’Angelo and Caprasecca (
The initial experience led to a large debate, which identified several policy priorities for the future. First, the evaluation of research should become a permanent activity, allocated to a professional structure with a clear long-term mandate. Second, research assessment should have an impact on funding of universities. Third, in order to create a diffused evaluation culture, all researchers should be submitted to evaluation and should be responsible for the selection of their products. In terms of methodological choices of VTR several contributions pointed to the need to enlarge the assessment toolbox in order to include bibliometrics (
Given these orientations, a long parliamentary debate led to the creation, in 2006, of a national agency, called ANVUR (
The legislation designed a complex procedure for the nomination of the members of the Board (
This complex selection and nomination procedure was designed to build up the independence of the Agency from the Ministry. The lack of formal separation from the budget of the Ministry, however, made this provision weaker, insofar as the allocation of resources was dependent on budgetary decisions of a political type. This financial subordination was solved only later on (see below). At the same time the nomination procedure was aimed at giving the Agency an authoritative and legitimate role, based on professionalism, independence and transparency.
With such a political and administrative complexity, the Agency, whose legislative creation dates back to 2006, was created in 2010 and put into action only in May 2011. This may explain why the Agency started immediately to work on research assessment, without a long consultation with the scientific communities, universities, and public research organizations (PROs). This can be considered a violation of the recommendations that are usually formulated for a sound research assessment process, that is, a large discussion with the academic community
A Ministerial decree dictated the main objectives of the research assessment exercise but also defined several technical details.
A clear divide was put in place by expert panels between those fields in which bibliometric criteria were adopted (STEM disciplines, which fall in CUN areas 1-9) and those in which peer review was adopted (SSH, in CUN areas 10-14). Remarkable exceptions were Architecture (which adopted peer review although it is included in the Engineering area) and Psychology (which adopted bibliometrics although it is included in Humanities). Another exception was Economics, illustrated below.
In order to put this approach into context a number of remarks should be made (see also
In those GEVs in which bibliometric analysis was implemented, two indicators were used: the normalized number of citations of the individual article, and the journal impact factor, or equivalent. The GEVs had the mandate to publish evaluation criteria that were tailored to the scientific areas. In practice this meant two main decisions: the choice of the bibliometric database, and the algorithm for the aggregation of the indicators. While ANVUR left the panels free to choose, most panels adopted the Web of Science (WoS), while a few used several journal indicators both from WoS and Scopus. The issue of algorithm selection is more complex, and is at the origin of controversies, as we will see.
Let us illustrate the issue. The number of citations received by an article in the relevant citation window was normalized against the distribution of citations of all articles of the same year in the same Subject Category at world level. This resulted in a number which represents a percentile of the world distribution. The journal impact factor (whether WoS or Scopus) was normalized against the distribution of all journals in the Subject Category, again resulting in a percentile.
The technical issue was that the Ministerial decree mandated the classification of all products in ordinal categories (classes of merit), defining the range of the quantiles for the overall distribution of the score (see Table
The work of GEVs produced a score for each of the products submitted. All Italian researchers have access to a personal page in a platform (loginmiur) managed by a large IT consortium of universities (CINECA) working on behalf of the Ministry. The scores were transmitted confidentially to all researchers via their personal page on loginmiur. This means that all researchers received a number for each of the products submitted. No individual information was disclosed. ANVUR made clear that the evaluation referred only to products and had a statistical meaning, hence it was not intended to evaluate researchers. The Agency explicitly warned universities against the possibility of asking researchers to disclose their personal scores on a voluntary basis. At the same time it is clear that the impact of research assessment in the Italian context has been very deep, and emotionally charged, given the personal engagement of all people.
After receiving the scores, ANVUR aggregated the indicators at three levels: (a) discipline; (b) department; (c) university. The aggregation at the level of scientific field or discipline (
After integration of these indicators it was possible to produce several rankings of universities, using various weighting schemes. ANVUR prepared several variants on behalf of the Ministry, but then transmitted to the media the unique ranking that was selected
The results of the first VQR were presented in summer 2013 and obtained a large media coverage. They showed a large divide between universities in the South and those in Central and Northern regions. This opened a large debate on whether research assessment might produce unintended effects of deprivation for Southern universities, which may suffer from external diseconomies and lack of attractiveness due to the lower level of socio-economic development. At the same time, it was shown that Southern universities performed well in hard sciences, in which the scientific communities are international by nature, while the large gap in research quality was mostly concentrated in SSH and, in part, in Medicine. The debate went on. The law adopted a cautious approach to the allocation of performance-based funding, implementing a funding formula that prevented large reductions in overall funds as a direct consequence of the VQR (see section 4 for details).
The large media coverage of the VQR reflected the novelty of the exercise. The final ranking was unanticipated, with several surprising results, such as the relatively poor performance of many historical and large universities and of some established PROs, against the dynamism of a few medium-sized young universities.
The second VQR started in 2015
- Two new groups of experts were created, separating Psychology from History, Philosophy and Pedagogy (since in Psychology it was decided to adopt bibliometrics) and Architecture from Civil Engineering (since in Architecture there was a need to adopt peer review for the evaluation of books, graphical works and projects).
- The classes of merit were modified, with a more uniform definition of the boundaries of the classes.
- The lowest score was no longer zero, a score which created a feeling of discomfort in many scholars.
- The algorithm for the aggregation of bibliometric indicators was modified by introducing a more general nonlinear transformation (see
In addition, with a separate provision, the use of ranking for performance-based funding was modified, by shifting part of the allocation of funding from universities to departments, and selecting a number of excellent departments scattered across all universities.
The results of the second VQR were anticipated in late 2016 at aggregate level and were published in detail in 2017. Overall they confirmed the North-South divide, although some convergence towards a higher average level of research quality was evident.
The new rules for the allocation of performance-based funding were implemented immediately, following the Budget law 2017 (Law 232/2016).
After the second VQR a legislative change was introduced, by establishing that the time window of the exercise should be five years.
Following this provision, the third VQR will cover the period 2015-2019. It has been launched in 2019 but postponed to the end of 2020 due to the Covid crisis. In the third exercise the following modifications have been introduced:
- The composition of GEVs must follow a set of rules, with a minimum number by categories (e.g. by academic rank, type of institution, or gender).
- The members of GEVs will not be nominated by ANVUR but extracted
- All products will be peer-reviewed using an informed peer review methodology, associated in some cases to citation indicators.
- The number of products per researcher will be variable, allowing some researchers to compensate for others.
- The number of co-authors will be considered in the evaluation.
- Classes of merit do not have predefined quantitative levels of a score.
- A new GEV has been created, separating Management from Economics and Statistics.
- The exercise will also include the evaluation of third mission, with a dedicated GEV.
In the companion paper I will examine carefully the main criticisms that have been raised against VQR. My preliminary statement is that several weaknesses have been addressed in moving from VQR 2004-2010 to the latest editions. On other points there is still significant discussion.
After the detailed description of the research assessment exercises carried out and in-progress, it is important to step back and ask whether there is an overall conceptual framework guiding the actions of the Agency.
In recent contributions Henk Moed has emphasized the need to place evaluative activities in an evaluative framework (Moed 2020a; 2020b). Research assessment has four interdependent dimensions: Policy and management, Evaluation, Analytics, and Data collection. The former two dimensions formulate the main issues underlying the research assessment, establish the objectives, and define a set of evaluation criteria. Analytics and Data collection, on the other hand, are the domain of evaluative informetrics, or the study of evaluative aspects of science and scholarship using informetric data and methodologies. Policy objectives and evaluation criteria cannot be founded or demonstrated informetrically but must be defined at higher levels by taking a non-neutral responsibility (ultimately a political responsibility).
In the Italian context, it is possible to reconstruct the overall evaluative framework by inspecting the legislative and ministerial documents that have shaped the research assessment activities. It must be emphasized that these documents define the operational aspects of the activities of the Agency and of the specific exercises without a formal and extended definition of the evaluative framework. Nevertheless, by reading available documents systematically and comparing them with the previous experience (VTR 2000-2003) it is possible to obtain a reasonably clear picture. I build up the reconstruction following the items suggested by Moed (2020a), who cites the
By combining the provisions of the Law that created the Agency and the ministerial decrees that gave origin to the various VQR the following principles emerge.
The law that has regulated the Agency (DPR 1 febbraio 2010 n. 76) clearly states that the object of the activity is the quality of processes, results and products of activities of universities and Public Research Organisations (PROs), considering their internal configuration (art. 2, comma 1, lettera a). The units of assessment are clearly identified at institution level, i.e. universities and PROs.
The same text argues that in the assessment there must be consideration for disciplinary differences (art. 2 comma 3). This amounts to call for institutional level assessment with breakdown by discipline and/or internal administrative articulation. In practice, as described above, the evaluation is carried out at different institutional levels for universities and PROs. In universities, given that all researchers are by necessity classified by discipline and are affiliated to departments the assessment is carried out at (i) discipline; (ii) department; (iii) university-level. In PROs the assessment is carried out at institutional level and with a breakdown by internal organization (e.g. department or institute) but not by discipline, since researchers at PROs are not enrolled using the same classification system of universities (SSD).
It is important to remark that in all legislative and ministerial documents there is no mention at all of evaluation of
This orientation is clear when compared to other provisions of the general legislative framework. For example, according to the university reform contained in Law 240/2010 universities can indeed evaluate their affiliates on an individual level, and use the results of assessment for monetary or other incentives, because they are the direct employers. Another provision that shows the prohibition to use the research assessment at the individual level is that the scores assigned to individual research products are communicated privately and electronically to their authors, while members of the Agency and experts are obliged to observe the strictest confidentiality. Finally, individual scores are not disclosed to anybody, following the general legislation on privacy.
While researchers are not evaluated individually, with the VQR they must deliver the products to be evaluated to their institution (university and PRO) with an autonomous decision. This is in sharp opposition to the practice of VTR 2000-2003. This principle is evident from the provision that all in-service researchers (that is, all researchers having an employment contract with the university of PROs)
Summing up, the Italian research assessment is clearly oriented towards the institution-level (university and PROs) but achieves this objective through the involvement of all researchers. The use of results of research assessment, as well as the use of metric indicators, for the evaluation of individual researchers is inhibited.
This principle results from the law that created the Agency and is evident in the parliamentary debate that addressed the issue before its approval. As a matter of fact, shortly after the first VQR some Rectors asked, formally or informally, to their professors to disclose voluntarily the individual score, or made some internal procedures conditional on individual scores. The Agency has systematically criticized this approach, stating explicitly that individuals cannot be evaluated with the research assessment methods used for large scale exercises. In the Report that presented the results of VQR 2004-2010 it was firmly stated that “the results of VQR cannot and should not be used to evaluate individual researchers”. To the best of my knowledge these isolated mispractices disappeared, as an example of institutional learning.
The overall orientation of the legislative and ministerial documents is towards a definition of research performance in terms of
In order to interpret this term appropriately we should make reference to other principles in the law. A prominent one is the reference to the best available practice at international level about the “evaluation of results” (Art. 1, comma 1). Art. 3 (comma 2, lettera b) states that the Agency evaluates “the quality of the products of research”. It is clear that the focus of interest is not on the dimension of process (e.g. organization of research, management, resources, personnel, quality procedures), as is done for education in the context of quality assurance, but on the dimension of products, or results. The assessment or research is mainly an assessment of final results of an (unobserved) research process. These results must be materialized in specific objects, called research
At the same time, other dimensions of the research process are mentioned in the law but with a non-systematic approach.
First, there is an explicit mention of technology transfer as a dimension of performance of research (Art. 3, comma 1, lettera a). As a matter of fact, the first ministerial decree launching the VQR included specific indicators of technology transfer. The decree (art. 6) asked institutions to provide data on:
Patents and spinoffs of which the institution is owner or co-owner, with separate specification of the age and performance of spinoffs
Cash inflows from sales and licensing of patents, with indication of the nature and characteristics of acquirers
It also dictated (art. 8) that the evaluation of patents should include transfer, development and socio-economic impact, including potential impact (Ministerial Decree 15 July 2011).
The decree did not add any other specification of indicators associated to third mission. ANVUR therefore added to the collection of institutional data an open section labelled Public engagement, inviting universities and PROs to give voluntary evidence of activities. The results were not used for the construction of indicators, given their lack of normalization.
As already stated, the role of social benefit or third mission of research has been suffering from a lack of formalization until the last VQR in 2019. The nature of evaluation of third mission in Italy and the methodological choices made for it will require a dedicated study.
Second, the final score attributed to universities includes not only the assessment of research products, but also the internationalization, the doctoral education and the quality of recruitment. In this way, several dimensions of the research process (somewhat similar to the concept of Infrastructure in the REF) are included. These indicators are evaluated at university or department level.
Summing up, there is a focus on the scientific-scholarly impact of research, with some consideration for the research process. The role of societal impact has been latent for some years but will be fully examined in 2021.
Specific evaluation criteria are not listed in the law, although it refers to the best international experience as a source. They are however clearly established in the ministerial decrees that have opened the VQR exercises. In this sense they satisfy the principle put forward by Moed (2020b) according to which “evaluation criteria and policy objectives cannot be founded informetrically”. They have been defined in a policy-oriented document, put forward by the Ministry. As such, they work as constraints to the discretionary work of the Agency.
While they have been slightly changed across the three VQRs, they can be summarized as follows:
- Novelty or originality
- Relevance to the field, or impact
- Methodological rigor
The
It is interesting to comment on the third criterion,
While the law regulating the Agency (DPR 1 febbraio 2010 n. 76) is silent on specific evaluation criteria and delegates them to ministerial decrees, it is prescriptive on issues of methodology. This rather surprising aspect of the evaluative framework can be explained with reference to the opening part of the law, which is part of the interpretive apparatus of the law itself. In this section the law mentions all other laws that must be considered in order to interpret the text and the positions of the legislative bodies that have had a role in the approval. In one of these positions the members of Parliament (House of Representatives) of Commission VII (Culture and education) asked to modify the text of art. 3 (comma 2 lettera b) according to which the evaluation of research products is carried out “mainly (
Interestingly, the law rejects this position, arguing that there might be sectors and cases in which evaluative informetrics is admissible (“è ammissibile la valutazione metrica”). Assuming the peer review as the only admissible methodology, according to the law, would have limited the discretionary appreciation of the Agency in the selection of the best methodology. The law therefore establishes that the Agency “utilizes the criteria, the methods and the indicators which are considered more appropriate for any type of evaluation” (art. 3 comma 3).
Again, I find that this provision is consistent with the requirement that the two levels of research assessment (Policy and management, and Evaluation) should define the objectives, purposes and criteria but not the methodologies, while the research assessment exercise (Analytics and Data collection) should be responsible for the methodological and technical choices.
It is interesting to place this legislative text in the context of the large debate on peer review vs informetrics in research assessment, ignited by
The Italian law took a balanced position. Both methodologies are admissible (peer review and metrics), so that any position forbidding one of the two is not acceptable. Peer review must prevail in aggregate terms, however. At the same time, they must be adopted with specific reference to the needs of disciplines. And the final choice must be left to the independent Agency, not the government or the Parliament.
It is possible to derive a number of principles and statements about the objectives and purposes of research assessment.
The law creating the Agency (Legge 24 novembre 2006, n. 286) and the law regulating its activities (DPR 10 febbraio 2010, n. 276) are clear in stating that research assessment should be external, impartial, independent, professional, transparent and public. It must be directed towards the evaluation of effectiveness and efficiency of all research activities funded with public money. These formulations clearly point to a major objective of accountability. Institutions that receive public money must be able to justify their results.
This formulation, which is in some sense a standard tenet of public policy towards accountability, in the context of Italian policy making had a peculiar meaning. It came after several years of debate and policy efforts to reform the Public Administration and was introduced into one of the sectors of the administration in which accountability was traditionally very low, given the power of academic staff. It was then associated to large expectations from the public opinion and the media.
The two mentioned laws are also explicit in mentioning the use of research assessment results as an input for the allocation of funding to universities. The law 24 novembre 2006, n. 286 states that “the results of the evaluation activities of ANVUR constitute the reference criteria for the allocation of public funds to universities and research institutes” (comma 139). This statement is repeated in DPR n. 276 (art. 4 comma 1).
As a matter of fact, the results of evaluation are used only for a small part of the overall funding, the so called
At the same time, the
In addition, the provision that the results of research assessment must enter into the allocation of funding has the consequence that they must be formulated in a quantitative way. Consequently, the ministerial decrees that have launched the various VQRs have defined a quantitative framework and the Agency has published results accordingly.
As a corollary of transparency and publicness of the overall evaluation there is also an explicit objective of improved communication with respect to stakeholders, in particular students.
This dimension of the evaluative framework is difficult to examine using the official documents. I can mention a systemic issue that is peculiar to the Italian economic and social background, i.e. the North-South divide. As we will see below and in the companion paper, there are large differences across universities located in the Northern and Southern regions of the country. Given that the research assessment has implications on funding, this issue has raised large attention.
My reading of the legislation is that this problem is not addressed in the context of the traditional notion of
The first VQR was certainly a shock for the academic system. Not only for the first time all researchers were asked to submit their products, but the final results showed several unexpected facts with respect to the ranking of universities. Moreover, the financial impact of the evaluation was significant and academic bodies and university administrators started to be worried about the implications for their annual budget.
In particular, Law 98/2013 stated that the performance-based funding of universities should be an increasing share of the overall funding (FFO,
In subsequent years the impact of research assessment percolated down all layers of the academic world. From this perspective the impact of VQR on university administrators and the overall academic body has been deep and pervasive.
In 2013 universities were unprepared to manage the sudden visibility given by the VQR. When the results were made public, the media started a campaign which lasted several weeks and involved hundreds of headings, with a coverage which was much larger than for any academic topic. In a couple of papers we have examined the impact of VQR on universities assuming the perspective of coverage of the overall exercise and of ranking of individual institutions in the media. Blasi et al. (
Interestingly, the situation changed significantly after the second VQR. When the data were presented in 2017, universities had enough experience to anticipate the main results. In most cases they had invested into communication offices that were prepared to exploit the news in a self-interested way. Bonaccorsi et al. (
There is a large international debate about the impact of university rankings and the risks associated to the creation of vertically stratified higher education systems. The intriguing findings about categorization tactics by Italian universities show that counterbalance moves are actively implemented. Universities try to prevent the creation of a higher education system which is perceived as stratified in terms of research excellence by focusing on other dimensions of performance and communicating systematically with their stakeholders. In other words, those universities that know they cannot compete on research excellence try to differentiate their offering. It is probably too early to evaluate whether this differentiation increased after the introduction of research assessment.
After the legislative initiative for the creation of the Agency in 2010 the Parliament approved a major reform of the academic recruitment system. The history of various solutions experienced in the Italian system is quite complex and cannot be outlined here.
It is however useful to reconstruct briefly the cultural climate that led to the 2010 reform and its implementation. Complaints about the lack of transparency and meritocracy of academic promotions have been repeatedly made in the scientific literature (
This background helps to understand the origins of the legislation that went into service in 2010, introducing a radical departure from the tradition (
There were several new provisions. The committee was formed by five full professors, but one of them should not be affiliated to Italian universities, but to a university or equivalent institution in OECD countries. This provision was intended to increase the transparency and to mitigate the supposedly deep propensity of Italian academicians to make collusive agreements within the committees.
In fact, the most radical innovation was that not all Full professors were admitted to the competition for becoming members of the Habilitation committee, but only a selection of them- as a matter of fact, only 50% of them, as we shall see. For the first time in the history of the academic community in Italy, not all Full professors were considered legitimate as members of committees for the selection and cooptation of colleagues. Full professors submitted their application for being included in the list of candidates. The committees were formed by random sampling within the lists of candidates. To be included in the list, however, applicants should satisfy a number of requisites established in the decree. The general principle was that members of the Habilitation committees should be at least of equal value of the potential candidates. There were two types of requisites: one based on the qualitative appreciation of a number of activities (recipient of research grant, project coordination, editorial work and the like), the other based on quantitative indicators. The indicators were established in a detailed way in the decree. They were divided in two groups: bibliometric and non bibliometric indicators. The former were applied to STEM disciplines, the latter to SSH. All indicators were to be computed for the entire scientific career of the candidates.
Bibliometric indicators were as follows: (1) number of articles in indexed journals; (2) number of citations received; (3) h-index. Non bibliometric indicators were instead: (1) number of books; (2) number of book chapters and articles; (3) number of articles in A-rated journals. All indicators were normalized by academic age, defined as the number of years since the first publication. Given the quantitative nature of these indicators, there was a need to specify the threshold value. The decree established that the threshold was to be computed as the median value of the distribution of the indicators across the entire academic community. In particular, candidates to the Associate professorship should meet at least the median value of the indicators for those already affiliated as Associate professors; candidates to Full professorship should meet the median value of those already in this position. In bibliometric sectors candidates had to overcome at least two out of three indicators, while in non bibliometric sectors the requirement was only one.
To clarify the meaning of the median, researchers who wanted to be recognized as Associate professors in STEM disciplines should have more citations than 50% of the current Associate professors in the same discipline. The median has robust statistical properties, insofar as it is not influenced by extreme values. Establishing the median value as the threshold had two dramatic consequences: first, it was a severe requirement; second, it was dynamically adjusted upwards, since newly admitted professors would have by definition better indicators than the existing population. ANVUR strongly defended in official documents the use of the median value as a device for inducing qualified recruitment.
The median value of these indicators was to be computed by ANVUR. The Agency published detailed criteria
More complex was the production of indicators in SSH. To start with, there was no definition of a scientific publication. The list of journals was extracted from a dataset, managed by the University Consortium CINECA, originated from loginmiur, or the self-managed personal website of all academic staff affiliated to universities, in which people record their own publications. Preliminary analysis, in fact, showed that researchers used to fill metadata with all sorts of publications, from newspaper articles to local bulletins, from grey literature to narrative books. No definition of the eligible journals was included in the dataset. Metadata are not curated according to library criteria. In Summer 2012 CINECA collaborated with ANVUR in extracting all metadata and examining their distributions.
For non-indexed journals, ANVUR was in charge of classifying
On top of this, the Agency was asked to rate scientific journals in order to obtain a category, called A-rated journals, on which the third indicator was based. The total number of journals initially rated A-class was 3,676. This indicator (Number of articles in A-rated journals) was in fact the only one in non-bibliometric sectors that was explicitly based on quality of research criteria, while the two others (Number of books; Number of book chapters and journal articles) were mainly volume-based. Remember that there was no restriction on the type of books (hence, of chapters of books) to be included in the indicator.
For the sake of the current discussion, it must be remembered that the classification was carried out in two months only (
The lists of indicators were published between the end of August and the start of September 2012, following the deadline of the Ministerial decree. They were published separately for candidates to the Habilitation, based only on the decade 2002-2012, separately for Associate and Full professorships, and for candidates to the Habilitation committees, based on the entire scientific career of Full professors. For each of these three categories, the lists included three bibliometric and non-bibliometric indicators, represented by the median value of the distributions. Full professors applying to a Habilitation committee in bibliometric sectors should meet at least two of the three indicators (i.e. overcome the median value of their peers) while in non bibliometric sectors the requirement was only for one indicator.
After the publication of indicators Full professors submitted their candidature as members of the Habilitation committees. They submitted their CV and the full list of publications. Based on these data, the Agency informed them about the admission to the final list for the random extraction. In order to facilitate the procedure, the Agency had confidentially informed all Full professors of their position with respect to indicators. All Full professors received in their own personal website a message stating whether their list of publications included publications that resulted above or below the three median values. This information was provided in a traffic light form, where “green” meant that the indicator was satisfied, and “red” that it was not.
The traffic light notwithstanding, several Full professors submitted their candidature irrespective of the indicators. The number of applications was 7325, a very large number. Of these, 1468, or 20% were rejected. The Agency sent a personal mail to all candidates, stating whether each of the indicators were satisfied or not, and concluding on whether the candidature was eligible or not (that is, whether at least two indicators for bibliometric sectors and one for non-bibliometric sectors were satisfied).
In order to take into account possible mistakes and to manage controversies, rejected candidates could submit an appeal. After controlling manually for these cases, a few exclusions were corrected. Summing up, 20% of full professors were told they were not in the position to serve as members of the committees for the cooptation of other colleagues in the academic community. This was a shocking novelty. It had never been done before in the history of Italian universities. An external authority, the Agency, was made responsible for entering into the sacred domain of academic autonomy and self-reference in order to admit or reject professors, who were at the acme of their scientific career.
It is useful to examine the composition of the 1468 candidatures that were rejected. The largest group comes from life sciences: 305 in Medicine, 151 in Biology, 137 in Veterinary sciences. Almost 600 Full professors in these disciplines were gently told they were not legitimated to sit in the Habilitation committee. Next comes Engineering, with 163 rejected candidatures. Third comes Arts and Humanities with 104, then Law with 97 and Architecture with 95. Less represented in the top list are hard sciences.
By and large, rejected candidates were those with a poor scientific production, as compared with national colleagues of the same discipline. In some cases, however, they were scientifically active, but published in non-eligible sources. For example, medical researchers publishing books or articles in Italian journals suddenly found their production was not eligible, because in bibliometric sectors only indexed journals were relevant. A similar pattern was found in traditional areas of engineering, in which textbooks are considered scholarly products, but were deliberately ignored in the calculation of indicators. It is also interesting that a large proportion of rejected professors come from applied disciplines, in which academicians are also involved into professional and consulting activities, as it happens in Medicine, Engineering, Law or Architecture. Broadly speaking, we may expect that rejected candidates were: (a) no longer active researchers; (b) mostly traditional scholars, publishing in non eligible sources; (c) less productive researchers, either due to a heavy engagement into external, non academic activities, or due to intrinsic lower productivity. In all three cases the reasons for exclusion seemed well grounded in the legislative mandate, which explicitly aimed at establishing international standards and transparent procedures in recruitment.
While the rationale for this provision is clear, nevertheless the technique used was a radical departure from the tradition in the history of the relations between the State and academic communities (
The procedures were modified in 2016
The most important changes are as follows:
- The Habilitation procedure has been made permanent: Habilitation commitees are formed each two years but candidates may submit their candidature at any time, and the Habilitation has validity six years;
- the committee is formed by five full professors (no foreign member);
- the majority requested is three votes out of five;
- it is possible to establish the maximum number of publications to be submitted for evaluation;
- the criteria and quantitative parameters should be defined with a Ministerial Decree, after proposal from ANVUR and CUN (
The most important changes refer to the majority vote (three members of the committee instead of four) and the decision making process for the criteria and parameters. In the 2012 application of the law, the responsibility for the definition of criteria and parameters was entirely delegated to ANVUR, a technical independent body. In the 2016 reform the procedure for the publication of indicators changed. Instead of being delegated to ANVUR, as it was in 2012, the numerical value of indicators has been decided by the Ministry after consultation with ANVUR, but also with an elective body (CUN), in which all categories of academic staff (researchers, associate and full professors, as well as students) are represented.
As a matter of fact the median values were eliminated. They were substituted by thresholds, or quantitative values for each of the indicators that represented minimum requirements. The statistical foundations of the thresholds have not been published. They are based, as it was in 2012, on the distribution of indicators across the population of university researchers, as maintained by CINECA. At the same time there is no way to identify a statistical concept that has been applied in a uniform way.
The Ministry published tables with the thresholds in 2016 and 2018. By inspecting the tables and comparing them with the median values published in 2012 it appears that the thresholds are much lower, perhaps placed at the 30% of the distribution. It is too early to compare the effect of the 2012-2013 procedures and the 2016-2020 ones. I have obvious conflict of interest in arguing in favour of the median approach but I must observe that the wave of criticism against the Habilitation settled down after 2016. The procedures are running smoothly. In the companion paper I will discuss some recent studies that have evaluated the ASN procedure.
From the Italian experience I think that a number of policy highlights can be derived.
First, it is important that research assessment is prepared by an extensive period of discussion with scientific communities and institutions. This discussion should cover not only the broad policy goals, but also research assessment criteria and methodologies. Most likely the academia will not have developed detailed understanding of the methodological and technical issues at stake. What is required is not deliberation, but involvement. In the Italian experience this open discussion was made almost impossible by the extremely long delay of implementation of the legislative provision (from 2006 to 2011) and the need to bootstrap the implementation. As a general recommendation, however, an adequate preparation period is to be planned.
Second, if research assessment takes place in a period of financial restrictions to research and/or higher education, it is perceived as a policy instrument to reduce resources by avoiding to take the political responsibility. In other experiences of European countries, namely United Kingdom and the Netherlands, the introduction of research assessment has been associated to an increase of financial resources in real terms. Performance-based funding is then perceived as an incentive to perform better. Again, the Italian experience shows the negative impact of implementing research assessment in a period of budget crisis. The share of funding based on research performance is not perceived as a prize, but as a reduction of penalty.
Third, linking the research assessment to the allocation of resources ensures a fast and pervasive impact. In a few years academic leaders and administrators have learnt to take seriously into account the results of the assessment. This awareness percolates rapidly to the academic community. There is an international debate on the pros and cons of performance-based funding in which the Italian experience is not yet examined, due to the short time window of the implementation. However, the comparison between an exercise without financial implications (VTR 200-2003) and one with implications on the funding formula (VQR 2004-2010) clearly shows the superiority of the latter.
Fourth, quantitative indicators help a lot. They work as focusing devices and approximate the underlying complex constructs (such as research quality) in such a way to generate attention and awareness. We must be aware, at the same time, that a number of researchers simply do not have the training and attitude to reason quantitatively in a smooth way.
Fifth, policy makers and research assessors should be prepared to manage conflicts. Although the normative foundations for research assessment may actively embrace ideals of neutrality and professionalism, there is always room for conflicts. They come from a variety of sources, some of which open (e.g. conflicts on assessment criteria, or political commitment against assessment), some covert (e.g. protection of vested academic interest). The opposition against research assessment merges together these sources. This makes it extremely difficult to interpret the intentionality of opponents. My recommendation is to put in place systematic and transparent procedures of hearings and discussions. The arguments of opponents should be spelled out and rationally debated. Conflicts must be managed with open listening and open assumption of responsibility.
Finally, policy makers should be keenly aware of the unintended consequences of research assessment and prepare counterbalance actions. Perhaps the most risky consequence is bureaucratization. The slow transformation of research assessment into a routine administrative activity is dangerous. Another unintended consequence is the adoption of strategic or gaming behaviors to a point where the academic values are placed at risk. Research assessment policies and practices should be periodically re-opened and re-motivated in order to prevent or mitigate these outcomes.
Legge 24 novembre 2006, n. 286. All documentation on the legislation and the composition of the Board is available at
“I risultati delle attività di valutazione dell’ANVUR costituiscono criterio di riferimento per l’allocazione dei finanziamenti statali alle università e agli enti di ricerca” (The results of the evaluation activities of ANVUR constitute reference criteria for the allocation of government funding to universities and PROs). Legge 24 novembre 2006, no. 286, art. 2, comma 139.
Decreto del Presidente della Repubblica 1 febbraio 2010, n.76.
Decreto del Presidente della Repubblica 22 febbraio 2011.
The documentation on the three VQR exercises is available at
Decreto Ministeriale 15 luglio 2011 no. 17.
While scientific fields (SSD) at more granular level are subject to periodic revision and update, following scientific developments, the main broad areas are relatively stable. They are labelled
Decreto Ministeriale 27 giugno 2015 no. 458.
The overall procedure is described in
The committe was formed by seven members: one nominated by the Prime Minister, two nominated by the Minister of University and Research, four nominated by the same Minister within a short list of candidates proposed by ANVUR and by the National Committee of Resarch Guarantors (Ministerial Decree n. 262, 11 May 2017).
Legge 11 dicembre 2016, n. 232, art. 1, co. 339.
The history of evaluation of third mission is very interesting but cannot be discussed here. See Blasi et al. (
This definition includes, for universities, Full professors, Associate professors, Researchers, both full time and part-time, and does not include post-doc researchers, doctoral students, and contract teachers. For PROs it includes researchers from all ranks and does not include technicians.
Decreto del Presidente della Repubblica 14 settebre 2011, n.222. The law 240/2010 was subsequently modified by the Law 114/2014. The original decree, which went into service in 2012, was deeply modified in 2016 by the Decreto del Presidente della Repubblica 4 aprile 2016, n. 95.
This provision was eliminated in 2016 due to various administrative problems (e.g. lack of comparability across disciplines between national and foreign academics).
All documentation available at
Consiglio Direttivo ANVUR,
As a matter of fact, the same journals appeared repeatedly across many scientific areas. While the judgment of lack of scientific nature had to be valid universally (although there were cases of controversies across experts for journals of high culture), the A-class rating was clearly specific to the disciplines and might lead to different decisions. This means that the same journal was subject to many evaluations. My own reconstruction from internal data is that the total number of items evaluated was 67,038 titles.
Decreto del Presidente della Repubblica 4 aprile 2016, n. 95; Decreto Ministeriale 7 giugno 2016, n. 120; Decreto Ministeriale 29 luglio 2016, n. 602.
This was clear after the introduction of the notion of “median” in the National Scientific Habilitation. It took the academic community by surprise. I have an entire collection of anecdotes of respected authorities that made declarations in which the difference between the median and the mean of a distribution was ignored. They simply did not know it.
The author has no competing interests to declare.