Research data in all its diversity—instrument readouts, observations, images, texts, video and audio files, and so on—is the basis for most advancement in the sciences. Yet the assessment of most research programmes happens at the publication level, and data has yet to be treated like a first-class research object.
How can and should the research community use indicators to understand the quality and many potential impacts of research data? In this article, we discuss the research into research data metrics, these metrics’ strengths and limitations with regard to formal evaluation practices, and the possible meanings of such indicators. We acknowledge the dearth of guidance for using altmetrics and other indicators when assessing the impact and quality of research data, and suggest heuristics for policymakers and evaluators interested in doing so, in the absence of formal governmental or disciplinary policies.
Research data is an important building block of scientific production, but efforts to develop a framework for assessing data’s impacts have had limited success to date.
Indicators like citations, altmetrics, usage statistics, and reuse metrics highlight the influence of research data upon other researchers and the public, to varying degrees.
In the absence of a shared definition of “quality”, varying metrics may be used to measure a dataset’s accuracy, currency, completeness, and consistency.
Policymakers interested in setting standards for assessing research data using indicators should take into account indicator availability and disciplinary variations in the data when creating guidelines for explaining and interpreting research data’s impact.
Quality metrics are context dependent: they may vary based upon discipline, data structure, and repository. For this reason, there is no agreed upon set of indicators that can be used to measure quality.
Citations are well-suited to showcase research impact and are the most widely understood indicator. However, efforts to standardize and promote data citation practices have seen limited success, leading to varying rates of citation data availability across disciplines.
Altmetrics can help illustrate public interest in research, but availability of altmetrics for research data is very limited.
Usage statistics are typically understood to showcase interest in research data, but infrastructure to standardize these measures have only recently been introduced, and not all repositories report their usage metrics to centralized data brokers like DataCite.
Reuse metrics vary widely in terms of what kinds of reuse they measure (e.g. educational, scholarly, etc). This category of indicator has the fewest heuristics for collection and use associated with it; think about explaining and interpreting reuse with qualitative data, wherever possible.
All research data impact indicators should be interpreted in line with the Leiden Manifesto’s principles, including accounting for disciplinary variation and data availability.
Assessing research data impact and quality using numeric indicators is not yet widely practiced, though there is generally support for the practice amongst researchers.
Research data is the foundation upon which innovation and discovery are built. Any researcher who uses the scientific method—whether scientist or humanist—typically generates data. Much of this data is analyzed by computational means and archived and shared with other researchers to ensure the reproducibility of their work and to allow others to repurpose the data in other research contexts.
Given the importance of data, there have increasingly been calls for research data to be recognized as a first-class research object (
In this article, we discuss how research indicators can be used in evaluation scenarios to understand research data’s impact. We begin with an overview of the research evaluation landscape with respect to data: why policymakers and evaluators should care about research data’s impact; the challenges of measuring research data’s quality and impact; and indicators available for measuring data’s impact. We then discuss how to use data-related indicators in evaluation scenarios: specifically, how data metrics are being used in current evaluation scenarios, and guidelines for using these metrics responsibly.
The goals of this article are to provide the community with a summary of evidence-backed approaches to using indicators in assessment of research data and suggest heuristics that can guide evaluators in using these metrics in their own work.
Merriam-Webster defines data as “factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation”.
“Research data means data in the form of facts, observations, images, computer program results, recordings, measurements or experiences on which an argument, theory, test or hypothesis, or another research output is based. Data may be numerical, descriptive, visual or tactile. It may be raw, cleaned or processed, and may be held in any format or media.”
Crucially, research data is analyzed as the basis for new discovery and is not the results of an analysis (e.g. figures or tables). Data can be observational (unique data collected in real time), experimental (data collected under controlled conditions, often using lab equipment), simulation (data generated from test models), compiled (data combined from existing analyses or data sources like texts), or reference (curated, peer review data compiled by organizations) (
Research data exists in many different digital formats. These can include photographs, geospatial data, videos, text, audio files, databases, instrumentation outputs.
Specific definitions for what constitute “data” and popular formats tend to vary by field (Table
Examples of research data, by discipline.
Discipline | Data types | Data formats |
---|---|---|
Biology | DNA sequences, microscopy images, morphological data, images of specimens | Images (.jpg, .tiff), FASTQ (.fq), SAM (.sam) |
Sociology | Survey responses, digitized videos, labor statistics, economic indicators | Spreadsheets (.csv, .xlsx), videos (.mov, .mp4, .avi), text files (.txt, .xml, .docx), databases (.sql, .csv) |
History | Speech transcripts, digitized photographs and videos, journals, newspaper clippings, diaries | Text files (.txt, .xml, .docx), images (.png, .jpg, .tiff), videos (.mov, .mp4, .avi) |
Overview of major dataset repositories and services.
Service | Repository | Description |
---|---|---|
Repository (Generalist) | Zenodo | A nonprofit, open access repository serving all subject areas. Administered by CERN. |
Repository (Subject) | Dryad | A nonprofit, open access data repository serving all subject areas, with special emphasis on evolutionary, genetic, and ecology biology. |
Repository (Generalist) | Figshare | A commercial, open access repository serving all subject areas. A subsidiary of Digital Science. |
Repository (Subject) | Inter-university Consortium for Political and Social Research (ICPSR) | A nonprofit data archive and repository serving the social sciences. Administered by the Institute for Social Research at the University of Michigan. |
Metadata; Metrics Provider (Citations, Usage) | DataCite | A nonprofit dataset registry, DOI service, metadata index, and metrics provider. |
Metrics Provider (Citations) | Data Citation Index | A commercial service that indexes datasets, data publications, and citations to data in the research literature. A subsidiary of Clarivate Analytics. |
Metrics Provider (Altmetrics) | Altmetric | A commercial service that tracks research shared online, including datasets, and makes both quantitative and qualitative engagement and discussion data (“altmetrics”) available. A subsidiary of Digital Science. |
Metrics Provider (Altmetrics) | PlumX Metrics | A commercial service that tracks research shared online, including datasets, and makes both quantitative and qualitative engagement and discussion data (“altmetrics”) available. A subsidiary of Elsevier. |
Repository Registry | Registry of Research Data Repositories | A service operated by the nonprofit DataCite, offering information on more than 200 research data repositories worldwide. |
At a high level, datasets are defined by the COUNTER Code of Practice for Research Data as “an aggregation of data, published or curated by a single agent, and available for access or download in one or more formats, with accompanying metadata” (
How one defines research data has a bearing on how data is stored and accessed. For example, the contents of historical speech transcripts would not change over time and might be used regularly in teaching, whereas a collection of bird species images could be updated regularly to reflect new discoveries and shared widely by birding enthusiasts on social media. Differences in data storage and access can affect data’s use and the related metrics that can evaluate the data’s impacts (
There have been a number of changes in recent years that shape why and how evaluators may assess research data:
The volume of research data available worldwide has grown with advances in the ease of creating “born digital” research and using high-capacity computing resources (
Developments in “citizen science” have raised concerns about dataset quality for data collected, maintained, and described by members of the public (
The research community has increasingly shown support for Open Access, and along with it “Open Research” practices like data sharing (
The development of the “data publication”, an article-like format for sharing descriptions of datasets; these may make datasets into more “citable” formats (
The recent rise of altmetrics, a class of scientometric indicators that help evaluators understand the attention that research has received online (
It is in this context that the evaluation community has been considering how to use indicators to understand the quality and impact of research data. There is also an interest in using indicators to incentivize data sharing—the idea being that the increased use of data-related indicators in evaluation practices would encourage researchers to begin sharing their data more often (
Before we discuss the many indicators that can be used to understand the impacts of research data, it is important to consider the challenges that exist to developing useful and sound metrics, given various socio-technical limitations. Note that throughout this article, we differentiate between
Perhaps the biggest challenge to understanding the impact of research data is that data can change. As new observations are gathered, data may be updated and “versioned”, making a dataset at one point in time potentially very different from the same dataset a year later. Moreover, one version of a dataset may be substantially different in content from others, in terms of the volume of data collected or the features included in the dataset (
Where data is in flux, it can be difficult to compare citation rates or altmetrics for research data generated in the same year, or even comparing a dataset’s citations in one year to the next. Thus, time-bound normalized indicators (e.g. comparing all datasets published in the same subject area and same year), which are typically used to account for variance, should not necessarily be used to make comparisons against changing data.
Metrics providers themselves also introduce challenges, by way of their product design choices and how they represent different versions of a single research output like a dataset. For example, Figshare, Altmetric, and PlumX Metrics all report metrics for various versions of a dataset in a single item record, with no differentiation made between attention received by different versions of the dataset. These design choices make it difficult to tease out the reuse and attention for a dataset at a particular point in time.
Data is often shared with the express purpose of allowing others to reuse and repurpose the data to fuel new discoveries. This purposeful atomization of research objects makes it difficult to track the cumulative impact of a single dataset (
These challenges are due to a number of technical barriers. In our current research environment, data citation standards vary from field to field (Table
Examples of data sharing norms, by discipline.
Discipline | Shares data? | Explanation |
---|---|---|
Astronomy | Usually | “The fact that astronomical data from large surveys are publicly available is remarkable, but by no means surprising. Astronomers collect data about the Universe, and thus, they may feel a moral obligation to share collected data openly.” ( |
Political Science | Sometimes | Concerns over sharing personally identifiable or sensitive information, particularly for qualitative data (e.g. interviews) ( |
Medical Sciences | Sometimes | HIPAA protections of patient data ( |
Though many data providers share metrics and indicators that can be used to track the impacts of data, these providers often rely upon different data sources that are not directly comparable or collect metrics from the same set of providers in different ways.
Citation counts can vary between providers, due to differences in where the services track for citations. The Data Citation Index offers a comprehensive view on citations to research data and data publications within the peer-reviewed research literature, sourced from Web of Science, the Chinese Science Citation Database, and other Clarivate Analytics-owned citation databases, in addition to data from partners like SciELO (
Similarly, altmetrics services treat research data differently from one another, and often index different attention sources. Altmetric treats datasets as they do any other research object, indexing files that are assigned persistent identifiers and shared in one of the 17 sources they track. Plum Analytics also takes this approach, and supplements it by providing data-specific metrics from data sharing platforms. Comparative analyses have shown differences in altmetrics between these two providers (
Moreover, metrics services sometimes collect data from the same sources in different ways. For example, two providers may have different standards for what “counts” as a citation to a dataset, with one tracking any link in a document as a citation, and another only counting a citation if it appears in the references list of a peer-reviewed journal article. Similarly, altmetrics services may track a source like Facebook very differently: for the sake of “auditability,” Altmetric tracks only links to research that appear in posts on public Facebook pages, while Plum Analytics counts Facebook likes, comments, and shares across the entire platform, including for private posts (
Research data is shared across hundreds of repositories worldwide, each capturing and reporting metrics with varying degrees of transparency, granularity, and completeness. Reporting usage statistics like downloads to DataCite is voluntary,
Overall, the prevalence of platform-specific metrics and the dearth of reporting to and tracking by centralized metrics providers like DataCite and Plum Analytics make it difficult to accurately benchmark or create normalized metrics for dataset usage.
Despite these challenges, there are a number of research data indicators that evaluators can use to support their assessment of the quality and impact of datasets. These indicators can be categorized into five main classes: quality indicators, citation-based indicators, altmetrics, usage statistics, and reuse indicators. Here, we will describe each class of indicator, paying special attention to their strengths and limitations for evaluation scenarios.
A dataset’s quality can be understood as its accuracy, currency, completeness, and consistency (
Quality dimensions typically have associated measures, from which indicators of a dataset’s quality can be derived. For example, Fox
Data quality dimensions and indicators, adapted from
Dimensions | Possible Measures |
---|---|
Accuracy | Syntactic Accuracy=Number of correct values/number of total values |
Currency | Currency = Time in which data are stored in the system - time in which data are updated in the real world |
Completeness | Completeness = Number of not null values/total number of values |
Consistency | Consistency = Number of consistent values/number of total values |
Pipino et al. (
Indicators can be helpful in statistical monitoring for data quality, especially for clinical trials and other disciplines that produce highly structured data (
However, it is unclear to what extent automated statistical monitoring are helpful in fields that produce unstructured and semi-structured data. In such disciplines, manual data quality checks are an important safeguard against fraud. Some repositories like ICPSR perform manual data quality checks as part of the data deposit process;
It is important to note that the context of a dataset can shape its data quality dimensions and related indicators. For example, open geospatial data quality has distinct dimensions regarding openness, mapping capabilities, and other features that factor into concepts of a dataset’s quality; these dimensions would not necessarily appear in other kinds of datasets like polling data (
Dataset quality measurement is an ongoing, iterative endeavor. As Pipino et al. write (2009), “assessing data quality is an ongoing effort that requires awareness of the fundamental principles underlying the development of subjective and objective data quality metrics.”
Citations are used widely to understand the scholarly impact of research publications like journal articles, monographs, and edited volumes, especially in the sciences (
Citations to research data can be interpreted in a number of ways:
A signal for
An indicator of
A measure of
For a complete overview of the many possible meanings of data citations, consult Silvello’s “Theory and practice of data citation” (
Data citations function similarly to traditional citations in academic texts, providing a way to acknowledge intellectual debt to scholarly forebears. They also allow scientists to connect research articles to the data they are based upon and give credit to others who have shared their data openly (
Dataset citations can be referenced in texts similarly to how monographs and journal articles are referenced, typically by noting the authors and year of publication. The specific formatting of a dataset reference will differ, depending upon the preferred writing style of the publisher (Table
Examples of a dataset citation, by writing style.
Writing style | Reference |
---|---|
APA (6th Edition) | Powers, J. et al. (2020). |
Chicago (16th Edition) | Powers, Jennifer et al. 2020. |
DataCite | Powers, Jennifer et al. (2020), A catastrophic tropical drought kills hydraulically vulnerable tree species, v4, Dryad, Dataset, |
The Data Citation Index (Clarivate Analytics) and DataCite are two of the best-known providers of data citation indicators. The Data Citation Index (DCI) has indexed more than 2.6 million records, tracking over 400,000 citations to datasets, data papers, and data repositories (
DataCite is a content registration service, and its Event Data API provides citation indicators for research outputs that link to DataCite records (
Citation-based indicators are well-suited for evaluation scenarios for a few reasons. The practice of citing—whereby an author references publications or other resources that have influenced their own study—is a concept that is understood and valued by many within academia, and thus is relatively easy for evaluators to interpret.
Given their legibility, citations can also help evaluators to better understand the value of “Open” research practices. Citations can help evaluators and decision-makers understand the value of sharing data for others to reuse (
However, data citations should be interpreted carefully, due to the lack of coverage across disciplines (
Technical barriers also exist to the widespread use of data citation in evaluation. The proliferation of scholarly identifiers is an ongoing challenge that can make it difficult to reference datasets in a consistent and stable manner (
With these strengths and limitations in mind, it is possible that evaluators and policymakers who are highly attuned to disciplinary data citation practices can potentially use data citations to understand the impacts of the research they are evaluating.
Altmetrics are data that highlight how research has been shared, discussed, or otherwise engaged with online. They are collected when a research output is linked to or mentioned in a source that altmetrics aggregators track. In recent years, altmetrics have increasingly been suggested as a means of understanding the broader impacts of research data. However, these highly heterogeneous data can mean many different things, and should be interpreted carefully, usually in tandem with other indicators (
For the purposes of this article, we define altmetrics as links to research from online platforms and documents that add commentary and value when research is shared, e.g. public policy, social media, peer reviews, patents, and software sharing platforms. Sugimoto
Altmetrics are distinct from usage statistics and other webometrics like referral links, and from citation-based indicators (
Altmetrics are typically interpreted as a proxy for:
In rare cases,
Altmetrics are often praised for their ability to reflect engagement with research on a much faster timescale than citations can—it can take hours for the first mentions of a dataset to be tracked by an altmetrics aggregator, while it typically takes months for citations to appear in the peer-reviewed literature. For example, altmetrics for coronavirus-related research increased rapidly starting in mid-February 2020, shortly after the virus became a global public health concern (Figure
Online attention for research outputs with “covid-19” in the title or abstract (n = 27,824), January 2020-July 2020. A majority of online attention originates from Twitter. (data source: Altmetric Explorer).
Altmetrics can also highlight engagement with research from a broad set of stakeholders, including members of the public, science communicators, policymakers, educators, and researchers.
However, much that is known about altmetrics and their meanings, stakeholders, and time-bound usage comes from research that examines engagement with journal articles. Far less is known about the communities that engage with research data on the web, beyond overall rates of engagement for cited data (
Altmetric and PlumX Metrics are two of the best-known altmetrics services that track altmetrics for research data. These services incorporate different data sources
Altmetric indexes more than 30,000 datasets sourced from repositories and data journals like Figshare, Dryad, and Gigascience. Altmetric treats research data similarly to all other research outputs: it tracks links to datasets that are mentioned in any of the 17 sources that the company tracks, and does not track data-specific indicators. Though Altmetric indexes research of all disciplines, it has higher rates of coverage overall in the sciences, and primarily indexes multidisciplinary and science-specific data repositories. Of Altmetric-indexed content labeled as a “data set”, around 92.9% of attention occurs on Twitter, followed by the now-defunct Google+ platform (2.1%) and blogs (2%). Around 41% of dataset tweets (n = 65,909) are original (i.e. not retweets).
PlumX Metrics has collected metrics for over 450,000 datasets sourced from repositories like Dryad and Figshare, as well as data sets indexed in various institutional repositories such as Digital Commons and DSpace (
Independent studies have found that PlumX Metrics has indexed 16% of datasets shared on Zenodo (
The company recently shared that it will soon index over 13 million datasets from more than 1,700 repositories that are shared in Mendeley Data Search (
Though altmetrics are promising, there are limitations to their use in research data evaluation scenarios. The largest challenge is that altmetrics are not yet widely used in research assessment, and as such literacy surrounding their interpretation and responsible use is low (
The relative ease with which online attention for research can be fabricated or “gamed” is another concern often expressed by evaluators and researchers alike (
For example, from 2010 to 2016 the Twitter account @datadryadnew automatically tweeted all new datasets added to the Dryad repository, accounting for over 7,200 mentions of research over six years. Adie (
Altmetrics aggregators’ data coverage is affected by their organizational contexts. PlumX Metrics, owned by Elsevier (and by EBSCO before that), tracks mentions from both open and proprietary data sources like SSRN, be press, and EBSCO, and has exclusive access to the latter. Figshare, a sister company to Altmetric under the Digital Science banner, and Figshare-hosted repository ChemRxiv account for more than twice the number of datasets in Altmetric than independent repositories like Dryad.
Altmetrics aggregators cannot yet offer useful disciplinary benchmarking for research data performance, due to the limited number of repositories they index. While the Data Citation Index includes hundreds of data repositories by design, Altmetric and PlumX index a much smaller number of repositories. Moreover, while research data is often still shared as supplementary files accompanying journal articles, no major altmetrics service currently records altmetrics for these files (with the exception of those files hosted by Figshare, in partnership with publishers).
Evaluators interested in understanding the social impacts of research data would be well-served by altmetrics. However, any assessment programme that incorporates these indicators should plan to develop an adaptive and iterative evaluation strategy that can address the above-mentioned caveats, because the use of research data altmetrics is still relatively new.
Data usage is “counted as the accesses of a dataset or its associated metadata” (
Typically, usage statistics are interpreted as:
A proxy for
In some cases, an indicator for data’s
Usage statistics are often reported through a standard called COUNTER, which ensures consistency in how these data are tracked and reported (
Usage statistics are often reported within data repositories, at the item record level. The extent to which these usage statistics are COUNTER-compliant varies, and is dependent upon the data repository.
DataCite is the largest centralized source for usage statistics for data repositories, indexing 6.7 million datasets from over 1900 repositories worldwide. However, not all repositories register their data with DataCite, and those that do may not share usage reports with DataCite. DataCite usage statistics are freely available via the DataCite Search and DataCite’s Event Data API. As of February 2020, DataCite reports over 17.6 million views to dataset documentation, and 2.4 million downloads of the datasets themselves.
Usage statistics are generally thought to help end users understand the overall interest or readership in research (
However, usage statistics are limited in their ability to help evaluators understand who is accessing content, or how the downloaded content is being used (
Perhaps the biggest challenge for evaluators interested in using usage statistics to understand research impact is that benchmarking is not yet widely used to make comparisons between datasets or across disciplines.
Overall, usage statistics can provide a good indicator of absolute reach and interest in a dataset, but they are limited with regard to the ability to make comparisons between datasets or understand how research data is being used.
Data citations are the most commonly accepted metric for understanding scholarly data reuse, though usage statistics and altmetrics have also been suggested as solutions (
Reuse metrics can help evaluators understand:
The
The Data Citation Index is the most comprehensive and user-friendly source of citation information for research data. Using the DCI, evaluators can retrieve citation statistics for datasets and data papers that are cited by other research teams.
Citations to data can also be found in abstracting and indexing services like Dimensions, which reports over 176,000 citations to more than 22,000 Figshare-hosted datasets as of March 2020.
DataCite’s Event Data API is a source for other reuse metrics that more technologically adept users might consider. As of February 2020, the API reports over 107,000 “events” where a dataset has been identified as the source or derivative of another dataset. However, the technical mastery required to retrieve data via the API would be a barrier for the average evaluator.
Certain kinds of platform-specific altmetrics may be useful reuse indicators. Software sharing site Github includes metrics that allow users to measure educational reuse and the diversity of contexts in which data may be reused. The platform’s native “fork” feature (where users can copy software and data for their own reuse and adaptation) provides reuse metrics. Moreover, reporting for a project’s “contributors” can potentially be traced to find who has reused the data (as users who suggest revisions to code and data can be credited as project contributors on the platform).
While the vast majority of content shared on Github is software and not necessarily research related, there are high-profile examples of research data and related software being shared and reused via Github. For example, coronavirus data shared on Github
Perhaps the biggest limitation to the available reuse metrics is the uncertainty in whether these data actually point to substantive data reuse, rather than mere interest in the data or perfunctory citations for related (but not dependent) research. Though the above suggested indicators are more precise proxies for reuse, they are not direct measurements of reuse, nor of the value of reuse. For example, it is impossible to tell by looking at the numbers whether a “forked” coronavirus dataset on Github has resulted in breakthrough treatments for the illness—that is, whether the reuse of the data has resulted in so-called “real world” impacts.
This challenge is not limited to data reuse indicators; it is shared by all research indicators, including citation-based indicators. Indeed, the grand challenge of research evaluation is that all research indicators are mere proxies for impact, and not direct measures (though the terms “metrics”, “measures”, and “indicators” are often used interchangeably).
Despite the challenges described above, evaluators can use research indicators to better understand the impact and quality of research data. Expert advice suggests that indicators be used to supplement, not supplant, expert peer review (
The use of quality and impact indicators in research data evaluation scenarios does not appear to be widespread. High-profile, regional initiatives such as the EU Open Science Monitor
Indeed, this is a common theme to research data and evaluation-related policy among funding agencies, scholarly societies, and universities: while data is increasingly being acknowledged for its importance as the building block to high-quality research, data-related measures are typically concerned with the “openness” or availability of the data, rather than measuring the data’s quality, impact, or reuse.
Though precedents are lacking for indicator usage, peer review guidelines (such as those summarized by Mayernik et al. (
The Leiden Manifesto (
In cases where data is the primary focus of an evaluation, the data should be evaluated by subject area experts who are equipped to assess the data’s quality and impact on its own merits. Indicators can play a part in assessment activities but should not take the place of expert review.
The objectives of your organization should guide the indicators you use when assessing data’s research impact, and not the other way around. Too often, evaluations are subject to the “streetlight effect” (
When comparing data from multiple disciplines, keep in mind that each field has its own data citation and data sharing norms, and as such average citation rates may vary between fields. When setting assessment practices, it is important to document these differences in data citation and sharing norms and provide guidance for evaluators on how to compare and interpret the data.
The technologies used to share and cite data are ever-changing and changing rapidly. With these new technologies may come new indicators. Evaluators should periodically survey the data sharing, data citation, and altmetrics landscapes to determine if these new indicators are useful to their own assessment practices. If so, data assessment guidelines should be updated regularly to reflect best practices for using these new indicators.
The Leiden Manifesto primarily advocates that metrics providers should be transparent in their data collection and aggregation processes. There is also a need for those who use data metrics to explain their impact to use metrics that are auditable and transparent. Evaluators can help researchers by writing guidelines that clearly require metric auditability and transparency.
For example, in reporting scenarios, self-collected data should be clearly explained and shared: What date did you collect these metrics? What databases did you use to collect the metrics? If you are reporting specialized or opaque indicators (e.g. the Altmetric Attention Score), have you linked to documentation that clearly explains how the indicator is calculated?
In addition to the Leiden Manifesto’s guidelines, a final important heuristic is to know the strengths and limitations of your data sources. As explained above, seemingly similar data sources (e.g. altmetrics aggregators, citation databases, etc) may collect data differently from the same source, aggregate the data according to in-house preferences, or may incorporate entirely different data sources altogether. Additionally, platform specificities like user interface design, platform searchability, or search engine optimization may affect download rates, citation practices, and so on.
In this article, we discussed how and why evaluators can use indicators to better understand the quality and impacts of research data. We provided a critical overview of the known scientometric research on research data impact indicators, including quality metrics, altmetrics, citations, usage statistics, and reuse indicators.
Current data assessment practices are primarily concerned with the openness of data, rather than with measuring quality or impact. This is an area with few precedents. We suggest that the Leiden Manifesto, originally developed for evaluation practices based on publications and their citations, may guide evaluators who seek to develop in-house heuristics for evaluating research data.
Ultimately, research data can be assessed using research indicators, like any scholarly work. Those considering adopting indicators to understand the quality and impact of research data should start from a place of expert peer review, using indicators to supplement their interpretation of the importance of a work.
A list of repositories that offer data quality checks can be found on
For more information on Altmetric’s data sources, see
For more information on PlumX Metrics’ data sources, see
Statistics retrieved from the DataCite Events Data API
Statistics retrieved from app.dimensions.ai <
The author wishes to thank Christine Burgess, Dan Valen, Mike Taylor, and Stephanie Faulkner for their input on this manuscript’s early drafts. The author also thanks the anonymous reviewers for their helpful feedback.
The author is employed by Altmetric.
SK is solely responsible for the contents of this article.