Robyn Price examines the data and methods of the PLOS Open Science Indicators dataset to reflect on the questions that institutions might need to ask themselves before looking to expand open research metrics
As research organisations, funders and publishers pursue open research agendas amidst a broader ‘datafication’ of the sector, interest will likely only grow for metrics that go beyond open access status or compliance and into complex understandings of the relationships between different versions and different output types. From an aggregate country or institution-level all the way to individual authors, open research metrics will be wanted by many organisations trying to understand, measure and incentivise open research.
The PLOS Open Science Indicators is a free dataset created by PLOS using a combination of techniques to detect connections between published articles with preprints, shared datasets and shared code. It uses the Crossref and DataCite APIs to determine similarity that suggest a preprint is related to a published article. The PLOS method also uses a Natural Language Processing (NLP) model by DataSeer (available open source) to identify whether data or code was generated by the research and search the article to indicate the shared dataset or code location. Some of these methods have been used before by PLOS or by reproducibility studies but this is possibly the first time these methods have been combined and applied at this scale. Their dataset released in December 2022 is not intended to be comprehensive- it captures articles from PLOS journals (around 60,000 articles) and an additional smaller comparison group of articles from non-PLOS journals (6,000 articles) published between January 2019 to June 2022 (I understand future data releases are expected).
To examine their method and try to understand the potential it might have for bibliometrics and open research at institutions, I downloaded and constructed a sample for outputs produced by Imperial College London authors from 2019 to 2021 . This sample represents only 1.7% of total Imperial articles published in the period (605 articles compared to 36,102 articles found on Scopus). The set is biased as 89% of the articles come from PLOS journals and the entire set are exclusively from open access journals. Because of these limitations, the findings should not be interpreted as if representative of open research behaviour at Imperial, rather as starting points for reflection on how open research indicators of the near future might be constructed and implications this might have for institutions preparing for them.
Around 80% of the papers in the Imperial set were identified as associated with a publicly shared dataset. Their method uses NLP to search supplementary files and the article text for data shared in an online repository. The 80% shared data finding makes sense for two reasons – firstly that most Imperial authors are subject to funder data sharing policies, and secondly, that we know that the majority of the papers are from PLOS journals, which have comprehensive data sharing policies.
The PLOS fields of ‘Data Generated’ and ‘Data Shared’, shown in Figure 2, are really useful – they could be applied to distinguish between creation of data, sharing of data, and from the NLP analysis of the methods section, re-use of existing data. This hints at a technical possibility of creating indicators that might be able to identify intelligently when a paper has created data, and shared it or not, or reused existing data and linked to it, and to know when the paper did not interact with data at all. Using a combination of these fields would make a ‘Data Shared’ indicator more meaningful. However, can a binary ‘Yes’ or ‘No’ category response to a Data Sharing indicator every really be meaningful? Hypothetically, a paper which has engaged as fully as possible with FAIR principles, and another paper that has linked to a file or a repository that is technically ‘shared’ but the access to the data is still via request only, would both receive a ‘Yes’ to ‘Data Shared’ indicator. It is important that a potential Findability or Accessibility of FAIR data principles that this metric might be able to represent does not become misconstructed as proxy metric of impact or quality of data. If an institution can determine their own values and intentions around data sharing behaviour, and what they intend to do with this data, it will make it easier for them to understand if this kind of indicator is useful or not.
The PLOS method of searching for author and title similarities to ascertain similarity between preprints and peer reviewed articles could produce more comprehensive identification of connected outputs than more commonly used identification methods that rely on persistent identifiers such as DOIs to link two outputs. 37% of the Imperial papers in this set were identified to have an associated preprint and as demonstrated in Figure 3 an average of 241 days between Preprint Shared to Publication. Although the difference between preprints and non-peer reviewed research must be acknowledged, particularly the value of peer review to research integrity; with access to high quality and large scale data that connects different output types, institutions can stand to better understand diverse routes in research dissemination and output types and versions, rather than valuing only the published version of record.
However, defining what an institution might or should do with this data is difficult because, unlike the data sharing, there are fewer institutional and funder preprint policies to guide how these indicators should be constructed or applied. Seeing preprint to article data available at this scale prompts me to think that in the near future a distinction between preprints that have ‘resulted’ in a peer reviewed published article verses preprints that have not ‘resulted’ in one might not be uncommon, or journal articles that ‘originated’ in a preprint verses those that haven’t. Maybe this mindset already exists amongst journal publishers, but in the university sector this is not a metric that I have encountered. Will bibliometric analytics tools start providing institutions with a ‘preprint conversion to article’ rate or ‘speed’ metric? How will this be treated by institutions trying to quantify or incentivise open research-aligned behaviour? Respect to disciplinary norms as well as author freedom must be retained. This dataset, as well as my own interpretation of it, supposes liner relationships between singular preprints, data, and code to the creation of published articles. In trying to find data to express open research behaviour, this is still centering published articles as the primary unit of data through which we understand and organise a complexity of multiple versions of connected but independent output types.
Identification of whether code was generated and/or shared is constructed by applying the Dataseer NLP to the paper. This is potentially very interesting as institutions like Imperial have limited existing sources of information on code existence, location and connection to papers because we generally have to rely on authors or creators of code to engage with depositing information about or archiving versions of code in repositories. Practices such as citing papers that describe software rather than being able to cite the software output itself, are both a cause and an outcome of broader problem of lack of recognition for the contribution that software engineering makes to so many research projects.
Using the PLOS dataset I identified found 22% of Imperial papers were associated with publicly shared Code outputs, and of those the majority were shared online in repositories. Using the distinction between ‘Code Generated’ and ‘Code Shared’ field we can see that in context, only 394 of 605 papers in the set generated or used Code, so this could also be represented as 34% of papers that were identified as using code shared it. The same need exists for institutions to be precise about what we want or expect these terms to mean exists with code as discussed above for data and preprints. Is a study that wrote code to process data equivalent to a study that uses a common code environment or package to carry out the study, or the same as a study that has created novel software that will be reused by others? If ‘Code Shared’ metrics are used by organisations, likely with the positive intention of measuring and incentivising reproducibility, how we will protect author discretion as to what does not need to be shared, like toy or test scripts from development that could be reasonably retained privately with no harm to research integrity?
I think it is important for the sector to experiment, share and discuss openly, exactly as this PLOS initiative has done so, and with a priority on taking what we are learning from responsible metrics to avoid just replicating traditional metrics methods and mindsets into open research. The PLOS Open Science Indicators are not without limitations but seem to me to be a new attempt at methods of defining relationships between research outputs that might be scalable or interoperable with research organisation data needs. However, an understanding of and defined complex and diverse research processes must guide, alongside our own organisational intentions and values, the construction of any future open research metrics. The availability of this dataset and the open source scripts are an invitation, as well as prompt to look for other adaptable methods, to institutions interested in investigating what this type of data might mean for our own open research environments and goals.
Note on data compilation
How I compiled the data for Imperial College London outputs and added a days between preprint and published article count
- Download PLOS-Dataset_v1_Dec22.csv and Comparator-Dataset_v1_Dec22.csv files from https://plos.figshare.com/articles/dataset/PLOS_Open_Science_Indicators/21687686 and combine records into one file
- Download records of all journal article outputs published by Imperial College London in 2019-2021 from bibliographic database such as Scopus
- Create a column to indicate Imperial-authored outputs by matching on DOI. I used VLOOKUP in Excel
- Create a count of number of days between ‘Publication Date’ column and ‘Preprint Date’ column. I used DATEDIFF in Power BI
The PLOS datasets are available here and my extracted Imperial dataset is available here:
Thanks to Wayne Peters and Jeremy Cohen (Imperial College London), Alice Gibson (JISC) and Iain Hrynaszkiewicz (PLOS) for providing useful discussion whilst writing and/or review of the text.
 See ‘Note on data compilation’ for description of how I compiled the data for Imperial College London outputs
 Spot checking the PLOS data I identified potential errors in some of the classifications of preprint related to paper. In these cases, the outputs were similar in title and author composition but reading the contents suggests they were different works. This mislabelling occurred generally with the outputs that have a negative days between Preprint to Article Publication date count, and so the true average days between count is likely different to that calculated here
 See ‘Note on data compilation’ for description of how I added a count of days between preprint and published article
Robyn Price is the Bibliometrics Manager at Imperial College London where she established a responsible bibliometric analysis and education service. Previously, Robyn worked in the editorial teams of open access and subscription journals.
Unless it states other wise, the content of the Bibliomagician is licensed under a Creative Commons Attribution 4.0 International License.