AI-based citation evaluation tools: good, bad or ugly?

Lizzie Gadd gets all fancy talking about algorithms, machine learning and artificial intelligence. And how tools using these technologies to make evaluative judgements about publications are making her nervous.

A couple of weeks ago, The Bibliomagician posted an interesting piece by Josh Nicholson introducing scite. scite is a new Artificial Intelligence (AI) enabled tool that seeks to go beyond citation counting to citation assessment, recognising that it’s not necessarily the number of citations that is meaningful, but whether they support or dispute the paper they cite.

scite is one of a range of new citation-based discovery and evaluation tools on the market. Some, like Citation Gecko, Connected Papers and CoCites, use the citation network in creative ways to help identify papers that might not appear in your results list through simple keyword matching. They use techniques like co-citation (where two papers appear together in the same reference list) or bibliographic coupling (where two papers cite the same paper) as indicators of similarity. This enables them to provide “if you like this you might also like that” type services.

Other tools, like scite and Semantic Scholar, go one step further and employ technologies like Natural Language Processing (NLP), Machine Learning (ML) and Artificial Intelligence (AI) to start making judgements about the papers they index. In Semantic Scholar’s case it seeks to identfy where a paper is ‘influential’ and in scite’s case, where citations are ‘supporting’ or ‘disputing’.

And this is where I start to twitch.

The Good

I mean, there is an obvious need to understand the nuance of the citation network more fully. The main criticism of citation-based evaluation has always been that citations are wrongly treated as always a good thing. In fact, the Citation Typing Ontology lists 43 different types of citation (including my favourite, ‘is-ridiculed-by’). Although the fact that the majority are positive (<0.6% of citations are negative by scite’s calculations) itself may indicate a skewing of the scholarly record. Why cite work you don’t rate, knowing it will lead to additional glory for that paper? So if we can use new technologies to provide more insight into the nature of citation, this is a positive thing. If it’s reliable. And this is where I have questions. And although I’ve dug into this a bit, I freely admit that some of my questions might be borne of ignorance. So feel free to use the comments box liberally to supplement my thinking.

The main criticism of citation-based evaluation has always been that citations are wrongly treated as always a good thing.

A bit about the technologies

All search engines use algorithms (sets of human encoded instructions) to return the results that match our search terms. Some, like Google Scholar, will use the citedness of papers as one element of its algorithm to sort the results in an order that may give you a better chance of finding the paper you’re looking for.  And we already know that this is problematic in that it compounds the Matthew Effect: the more cited a paper is, the more likely it will surface in your search results, thereby increasing its chances of getting read and further cited. And of course, the use of more complex citation network analysis for information discovery can contribute to the same problem: by definition the less cited works are going to be less well-connected and thus returned less often by the algorithm.

Even their developers might not ever really understand what characteristics the AI is identifying in the data as ultimately contributing to the desired outcome.

Image by Gordon Johnson from Pixabay

But it’s the use of natural language processing (NLP) to ‘read’ the full text of papers and artificial intelligence or machine learning to find patterns in the data that concerns me more. So whereas historically humans might provide a long list of instructions to tell computers how to identify an influential paper, ML works by providing a shed load of examples of what an influential paper might look like, and leaving the AI to learn for itself. When the AI gets it right, it gets rewarded (reinforcement learning) and so it goes on to achieve greater levels of accuracy and sophistication. So much so, that even their developers might not ever really understand what characteristics the AI is identifying in the data as ultimately contributing to the desired outcome.

Can you see why am I twitching?

The (potentially) bad

Shaky foundations

The obvious problem is that the assumptions we draw from these data are inherently limited by the quality of the data themselves. So we know that the literature is already hugely biased towards positive studies over null and negative results and towards journal-based STEM over monograph-based AHSS. So the literature is, in this way, already a biased sample of the scholarship it seeks to represent.

We also know that within the scholarship the literature does represent, all scholars are not represented equally. We know that women are less well cited than men, that they self-cite less and are less well-connected. We know the scholarship of the Global South is under-represented, as is scholarship in languages other than English. And whilst a tool may be able to accurately identify positive and negative citations, it can’t (of course) assess whether those positive and negative citations were justified in the first place.

But of course these tools aren’t just indexing the metadata but the full text. So the question I have here is whether Natural Language Processing works equally well on language that isn’t ’natural’ – i.e., where it’s the second language of the author? And what about cultural differences in the language of scholarship, where religious or cultural beliefs make expressions of confidence in the results less certain, less self-aggrandising.  And I’ll bet you a pound that there are disciplinary differences in the way that papers are described when being cited.

So we know that scholarship isn’t fully represented by the literature. The literature isn’t fully representative of the scholars. The scholars don’t all write in the same way. And of course, some of these tools are only based on a subset of the literature anyway.

At best, this seems unreliable, at worst, discriminatory?

Who makes the rules?

Of course, you may well argue that this is a problem we already face with bibliometrics, as recently asserted by Robyn Price.  I guess my particular challenge with some of these tools is that they go beyond simply making data and their inter-relationships available for human interpretation, to actually making explicit value judgements about those data themselves. And that’s where I think things start getting sticky because someone has to decide what that value (known as the target variable) looks like. And it’s not always clear who is doing it, and how.

If you think about it, being the one who gets to declare what an influential paper looks like, or what a disruptive citation looks like, is quite a powerful position. Oh not right now maybe, when these services are in start-up and some products are in Beta. But eventually, if they get to be used for evaluative purposes, you might end up with the power over someone’s career trajectory. And what qualifies them to make these decisions? Who appointed them? Who do they answer to? Are they representative of the communities they evaluate? And what leverage do the community have over their decisions?

If you think about it, being the one who gets to declare what an influential paper looks like, or what a disruptive citation looks like, is quite a powerful position.

When I queried scite’s CEO,  Josh Nicholson, about all this, he confirmed that a) folks were already challenging their definitions of supportive and disruptive citations; b) these challenges were currently being arbitrated by just two individuals; and c) they currently had no independent body (e.g. an ethics committee) overseeing their decision-making – although they were open to this. 

And this is where I find myself unexpectedly getting anxious about the birth of free/mium type services based on open citations/text that we’ve all been calling for. Because at least if a commercial product is bad, no-one need buy it, and if you do, as a paying customer you have some* leverage. But I’m not sure if the community will have the same leverage over open products, because, well, they’re free aren’t they? You take them or leave them. And because they’re free, someone, somewhere, will take them.  (Think Google Scholar).

*Admittedly not a lot in my experience.

Are the rules right?

Of course, it’s not just who defines our target variable but how they do it, that matters. What exactly are these algorithms being trained to look for when they seek out ’influential’, ‘supportive’ or ’disruptive’ citations? And does the end user know that? More pertinently, does the developer know that? Because by definition, AI is trained by examples of what is being sought, rather than by human-written rules around how to find it. (There are some alarming stories about early AI-based cancer detection algorithms getting near 100% hit rates on identifying cancerous cells, before the developers realised that it was taking the presence of a ruler on the training images – used by doctors to detect the size of tumours – as an indicator that this was a cancerous cell.)

I find myself asking if someone else developed an algorithm to make the same judgement, would it make the same judgement?  And when companies like scite talk about their precision statistics (0.8, 0.85, and 0.97 for supporting, contradicting, and mentioning, respectively if you’re interested) to what are they comparing their success rates? Because if it’s the human judgement of the developer, I’m not sure we’re any further forward.

I also wonder whether these products are in danger of obscuring the fact that papers can be ‘influential’ in ways that are not documented by the citation network, or whether these indicators will become the sole proxy for influence – just as the Journal Impact Factor became the sole proxy for impact? And what role should developers play in highlighting this important point – especially when it’s not really in their interests to do so?

The ugly?

Who do the rules discriminate against?

The reason these algorithms need to be right, as I say, is that researcher careers are at stake. If you’ve only published one paper, and its citing papers are wrongly classified as disputing that paper, this could have a significant impact on your reputation. The reverse is true of course – if you’re lauded as a highly cited academic but all your citations dispute your work, surfacing this would be seen as a service to scholarship.

What I’m not clear on is how much of a risk is the former and whether the risk falls disproportionately on members of particular groups. We’ve established that the scientific system is biased against participation by some groups, and that the literature is biased against representation of some groups. So, if those groups (women, AHSS, Global South, EASL-authors) are under-represented in the training data that identifies what an ‘influential’ paper looks like, or what a ‘supporting’ citation looks like, it seems to me that there’s a pretty strong chance they are going to be further disenfranchised by these systems. This really matters.

Masking

I’m pretty confident that any such biases would not be deliberately introduced into these systems, but the fear of course, is that systems which inadvertently discriminate against certain groups might be used to legitimise their deliberate discrimination. One group that are feeling particularly nervous at the moment, with the apparent lack of value placed on their work, are the Arts and Humanities. Citation counting tools already discriminate against these disciplines due to the lack of coverage of their outputs and the relative scarcity of citations in their fields. However, we also know that citations are more likely to be used to dispute than to support a cited work in these fields. I can imagine a scenario where an ignorant third-party seeking evidence to support financial cuts to these disciplines could use the apparently high levels of disputing papers to justify their actions.

But it doesn’t stop here. In their excellent paper, Big Data’s Disparate Impact, Barocas and Selbst discuss the phenomenon of masking, where features used to define a target group (say less influential articles) also define another group with protected characteristics (e.g., sex). And of course, the scenario I envisage is a good example of this, as the Arts & Humanities are dominated by women. Discriminate against one and you discriminate against the other.

The thin end of the wedge.

All this may sound a bit melodramatic at the moment. After all these are pretty fledgling services, and what harm can they possibly do if no-one’s even heard of them?  I guess my point is that the Journal Impact Factor and the h-index were also fledgling once. And if we’d taken the time as a community to think through the possible implications of these developments at the outset, then we might not be in the position we are in now, trying to extract each mention of the JIF and the h-index from the policies, practices and psyches of every living academic.

I guess my point is that the Journal Impact Factor and the h-index were also fledgling once.

Indeed, the misuse of the JIF is particularly pertinent to these cases. Because this was a ‘technology’ designed with good intentions – to help identify journals for inclusion in the Science Citation Index – just as scite and Semantic Scholar are designed to aid discovery and citation sentiment. But it was a very small step between the development of that technology and its ultimate use for evaluation purposes. We just can’t help ourselves. And we are naïve to think that just because a tool was designed for one purpose, that it won’t be used for another.

This is why the INORMS SCOPE model, insists that evaluation approaches ‘Probe deeply’ for unintended consequences, gaming possibilities and discriminatory effects. It’s critical. And it’s so easy to gloss over when we as evaluation ‘designers’ know that our intentions are good. I’ve heard that scite are now moving on to provide supporting and disputing citation counts for journals, which we’ll no doubt see on journal marketing materials soon. How long before these citations start getting aggregated at the level of the individual?

Of course, the other thing that AI is frequently used for, once it has been trained to accurately identify a target variable, is to then go on to predict where that variable might occur in future. Indeed we are already starting to see this with AI-driven tools like Meta Bibliometric Intelligence and UNSILO Evaluate, where they are using the citation graph to predict which papers may go on to be highly cited and therefore a good choice for a particular journal. To me, this is hugely problematic and a further example of the Matthew Effect seeking to reward science that looks like existing science rather than ground-breaking new topics, written by previously unknowns. Do AI-based discovery and evaluation tools have the potential to go the same way, predicting based on past performance, the more influential scholars of the future?

Summary

I don’t want to be a hand-wringing nay-sayer, like an old horse-and-cart driver declaring the automobile the end of all that is holy. But I’m not alone in my handwringing. Big AI developer, DeepMind, are taking this all very seriously. A key element of their work is around Ethics & Society including a pledge to use their technologies for good. They were one of the co-founders of the Partnership on AI initiative where those involved in developing AI have an open discussion forum, including members of the public, around the potential impacts of AI and how to ensure they have positive effects. The Edinburgh Futures Institute have identified Data & AI Ethics as a key concern and are running free short courses in Data Ethics, AI & Responsible Research & Innovation. There are also initiatives such as Explainable AI which recognise the need for humans to understand the process and outcomes of AI developments.

I’ve no doubt that AI can do enormous good in the world, and equally in the world of information discovery and evaluation. I feel we just need to have conversations now about how we want this to pan out, fully cognisant of how it might pan out if left unsupervised. It strikes me that we might do well to develop a community agreed voluntary Code of Practice for working with AI and citation data. This would ensure that we get to extract all the benefits from these new technologies without finding them being over-relied upon for inappropriate purposes. And whilst such services are still in their infancy I think it might be a good time to have this conversation. What do you think?

Elizabeth Gadd is the Research Policy Manager (Publications) at Loughborough University. She is the chair of the Lis-Bibliometrics Forum and co-Champions the ARMA Research Evaluation Special Interest Group. She also chairs the INORMS International Research Evaluation Working Group.

Acknowledgements

I am grateful to Rachel Miles, Josh Nicholson, David Pride for conversations and input to this piece and am especially thankful to Aaron Tay who indulged me with a long and helpful exchange that made this a much improved offering.

Unless it states other wise, the content of the Bibliomagician is licensed under a Creative Commons Attribution 4.0 International License.

11 Replies to “AI-based citation evaluation tools: good, bad or ugly?”

  1. Nice article- layering sentiment analysis on a biased representation of scholarship fraught with data quality issues is challenging to say the least and will inevitably lead to misuse and unthinking use.

    Liked by 1 person

  2. Hi Lizzie,

    Thanks for writing this important blog post. I indeed think we need to have this conversation, and probably also a code of practice or something similar. The questions you are raising don’t have easy answers. This clearly requires a community effort.

    I believe we also need to think carefully about how we draw conclusions about biases and inequalities. Understanding biases is very important, but the study of biases is also highly challenging from a methodological point of view, and drawing clear conclusions from the literature is far from straightforward. For instance, you say that “all scholars are not represented equally. We know that women are less well cited than men, that they self-cite less”. However, the studies that you refer to do not correct for level of experience or career stage. Therefore it is not clear whether we can really say that women are less well cited than men. This is true if we compare the entire population of female researchers with the entire population of male researchers, but it is not clear whether this still holds true if we compare populations of researchers that have a similar level of experience or that are in a similar career stage (for an attempt to compare ‘like with like’, see https://doi.org/10.1073/pnas.1914221117). In the case of self-citations, it has been shown that “self-citation is the hallmark of productive authors, of any gender” (see https://doi.org/10.1371/journal.pone.0195773). In order to have a proper discussion about biases and inequalities, in the research system in general and in AI tools in particular, I think one of the first steps we need to take is to develop a shared view on what it actually means to say something is ‘biased’. At first sight this might seem intuitively clear, but this is actually a highly complex question!

    Ludo

    Liked by 1 person

    1. Thanks for your helpful comments, Ludo. Both of these papers shed useful clarifying light on the issue of gender citation and self-citation. Whilst questioning whether women are less well/self-cited, it’s interesting that both seem to agree that they experience shorter, more disrupted and diverse career lengths – which I guess leads to under-representation in the citation graph being used to train AI. But I agree this is important to understand and I very much hope the domain expertise of CWTS Leiden might be represented on any kind of Code of Practice group that might be forthcoming!

      Like

  3. Thank you Lizzie for such an informative post. I have great regard for your work as you know, as well as for that of Josh whom I have been following since well before Scite!

    The questions you raise are important, and the analogy of JIF and H-Index enlightening. I have often wondered whether we are using Artificial Intelligence where real intelligence might be better suited. Let me explain…

    The author is by definition the person best placed to decide whether a reference is, say, “supported” or “disputed”. If this can be captured during authoring, then we do not need AI and ML to “guess” retrospectively. Would it not be good to have an easy to use semantic mark-up tool in an authoring tool, inviting (or even mandating) relevant mark-up by the author. For instance as soon as a citation is entered, a drop-down appears, inviting the author to select from “Support, Dispute, Ridicule”, etc. This semantic information can be used in any way in future. Then there is no subsequent “reverse engineering” needed.

    If I might digress and generalize, I have always felt that the ubiquitous word processor (let’s face it, there is only one that has had a monopoly for decades!) is far too dumb for scholarly communication. Each scholarly field uses type styles (bold, italic etc) to code their information in a way that can be interpreted only by an informed human reading the text. Before publication many of these need to be manually coded into XML by typesetters. Of course there are automated tools that typesetters have built, but they have to be used carefully for similar reasons you have described. Would it not be nice if the authoring system could guess the semantics according to the field of the author and offered, say, “Math, Person, Date, Chemical formula” for convenient and painless semantic tagging by the author.

    Like

    1. Hi Kaveh,

      In 2013, I tried an approach similar to the one you’ve outlined via an initiative called SocialCite. It was adopted as a pilot by Rockefeller University Press, the Genetics Society of America, and a couple of other publishers, but failed to gain traction. The notion was largely the same from a UX perspective — drop-downs users could access to categorize citations. We didn’t trust authors to categorize them, because they are often the source of citation misdirection or errors. Our approach was to let readers — who our research showed do look hard at citations when going deep into articles — categorize the citations, and to use the pooled data to describe them. We did a lot of work, built and deployed a product, had a robust roadmap, but failed to gain traction for a few reasons mainly linked to asking users to go “off task” to do other work. We also noticed that with a human-based approach, we’d have “hot spots” — articles everyone was reading — and “cold spots” — articles hardly anybody read. That’s not going to give you complete coverage, nor is a gradually adopted author-based approach. Also, an author-based approach is prospective — it only works going forward. That’s a major limitation.

      I’m an advisor to Scite, and was excited to help because the solution we couldn’t scale in 2013 — affordable machine learning and AI — is the one that will work. It can scale quickly. It can work on papers as they’re published, and on papers that already exist (archives). It can be comprehensive and quick. Is it going to be perfect right off the bat? No. Nothing is. We know from software that generally v3.0 is where the magic happens (or, as grandma used to say, “Third time’s the charm”). I think Scite is off to a strong start, and thoughtful posts and comments like these will only help them get better faster.

      Thanks.

      Kent

      Liked by 2 people

  4. A semantic mark-up tool would be nice, but I am at least a bit doubtful about whether or not researchers would take the time to accurately assign classifications of their citations, especially if it wasn’t required. And if it was required, would a certain percentage of them just skip through it and blindly assign classifications (especially if they just weren’t sure because they didn’t write that portion of the paper)? I think it would probably be best if it was optional, and then the percentage of classifications that was received would be more accurate.

    I do have to admit that I like the idea. Someone recently came up with such a platform, called “ACT: An Annotation Platform for Citation Typing at Scale.” The platform allowed authors to classify their citations as the following: background, uses (methodology or tools), compare / contrast, motivation, extension, or future. The researchers / developers of this tool found that only first authors, in almost all cases, “remember their reasons for citing a particular paper without prompting and can therefore complete the process quickly and with confidence.” They also found that with their platform, they would be able to create a data set of annotated citations on a previously unseen scale. For reference, it is the largest collection of citations annotated according to both type and influence. http://oro.open.ac.uk/60670/1/JCDL2019_submission199_Pride_Harag_Knoth.pdf
    And their newest paper on author-led citation classification – http://oro.open.ac.uk/70520/1/Pride_Knoth___An_Authoritative_Approach_to_Citation_Classification_JCDL2020_Submission_181.pdf

    Liked by 1 person

  5. This is a very interesting piece that highlights the myriad ways in which organizations are deploying AI and machine learning in the scholarly communication space. One note of clarification – Meta is note “using the citation graph to predict which papers may go on to be highly cited and therefore a good choice for a particular journal”. The link in the post refers to a 2016 experiment we ran with a handful of publishers. These pilots concludes several years ago, and we have not further pursued them. We still use AI and machine learning, but this technology is deployed to quickly analyze, map and cluster tens of millions of articles and preprints so researchers can easily follow developments, intersections, or emerging trends.

    Like

  6. Hi Liz, great post, which is worth responding to in detail. But just on a detailed point, I work with UNSILO and I can assure you that UNSILO is not predicting which papers will be highly cited. We don’t do anything of the kind! We do use AI tools such as semantic concept matching to identify potential reviewers for submitted articles, and to identify the most relevant journals for a submission, but we don’t use citations or impact factor for any of this, just concepts in the article. We identify an article is about XYZ and we then find other articles about XYZ, or reviewers who have written about XYZ, or journals that have published papers about XYZ. We leave to humans all judgements about the relative worth of papers. It’s nice that people think highly of what we do, but we have never claimed to remove the human decision-maker. All we claim is to provide indicators that enable the human to make decisions in a more informed and effective way.

    Like

  7. Thank you for such a nice informative post. Your concerns about implications of AI for scholarship, especially for evaluation of research are ver true. The technology even though has great potential, if applied outrightly, may jeopardize the careers of many.

    Like

  8. Thanks for this blogpost, Lizzie. Your comments are potentially applicable across systems (of AI and of people), as biases (including blind spots) arise because of context, whether of the developer, the subject, or the society.

    A quick observation. You say that “historically humans might provide a long list of instructions to tell computers how to identify an influential paper”, in comparison to the use of ML and NLP. Earlier generations of AI were often rule-based systems, i.e. long lists of instructions, not necessarily sequenced. The challenge of building this type of system is that one needs to know and articulate all of the rules. And the rule sets would typically vary between sub-domains, perhaps only subtly. The same can be true of human systems: how does one make sure that the human assessor has the full rule set, and does not try to apply it inappropriately between sub-domains?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.