Lizzie Gadd gets all fancy talking about algorithms, machine learning and artificial intelligence. And how tools using these technologies to make evaluative judgements about publications are making her nervous.
A couple of weeks ago, The Bibliomagician posted an interesting piece by Josh Nicholson introducing scite. scite is a new Artificial Intelligence (AI) enabled tool that seeks to go beyond citation counting to citation assessment, recognising that it’s not necessarily the number of citations that is meaningful, but whether they support or dispute the paper they cite.
scite is one of a range of new citation-based discovery and evaluation tools on the market. Some, like Citation Gecko, Connected Papers and CoCites, use the citation network in creative ways to help identify papers that might not appear in your results list through simple keyword matching. They use techniques like co-citation (where two papers appear together in the same reference list) or bibliographic coupling (where two papers cite the same paper) as indicators of similarity. This enables them to provide “if you like this you might also like that” type services.
Other tools, like scite and Semantic Scholar, go one step further and employ technologies like Natural Language Processing (NLP), Machine Learning (ML) and Artificial Intelligence (AI) to start making judgements about the papers they index. In Semantic Scholar’s case it seeks to identfy where a paper is ‘influential’ and in scite’s case, where citations are ‘supporting’ or ‘disputing’.
And this is where I start to twitch.
I mean, there is an obvious need to understand the nuance of the citation network more fully. The main criticism of citation-based evaluation has always been that citations are wrongly treated as always a good thing. In fact, the Citation Typing Ontology lists 43 different types of citation (including my favourite, ‘is-ridiculed-by’). Although the fact that the majority are positive (<0.6% of citations are negative by scite’s calculations) itself may indicate a skewing of the scholarly record. Why cite work you don’t rate, knowing it will lead to additional glory for that paper? So if we can use new technologies to provide more insight into the nature of citation, this is a positive thing. If it’s reliable. And this is where I have questions. And although I’ve dug into this a bit, I freely admit that some of my questions might be borne of ignorance. So feel free to use the comments box liberally to supplement my thinking.
The main criticism of citation-based evaluation has always been that citations are wrongly treated as always a good thing.
A bit about the technologies
All search engines use algorithms (sets of human encoded instructions) to return the results that match our search terms. Some, like Google Scholar, will use the citedness of papers as one element of its algorithm to sort the results in an order that may give you a better chance of finding the paper you’re looking for. And we already know that this is problematic in that it compounds the Matthew Effect: the more cited a paper is, the more likely it will surface in your search results, thereby increasing its chances of getting read and further cited. And of course, the use of more complex citation network analysis for information discovery can contribute to the same problem: by definition the less cited works are going to be less well-connected and thus returned less often by the algorithm.
Even their developers might not ever really understand what characteristics the AI is identifying in the data as ultimately contributing to the desired outcome.
But it’s the use of natural language processing (NLP) to ‘read’ the full text of papers and artificial intelligence or machine learning to find patterns in the data that concerns me more. So whereas historically humans might provide a long list of instructions to tell computers how to identify an influential paper, ML works by providing a shed load of examples of what an influential paper might look like, and leaving the AI to learn for itself. When the AI gets it right, it gets rewarded (reinforcement learning) and so it goes on to achieve greater levels of accuracy and sophistication. So much so, that even their developers might not ever really understand what characteristics the AI is identifying in the data as ultimately contributing to the desired outcome.
Can you see why am I twitching?
The (potentially) bad
The obvious problem is that the assumptions we draw from these data are inherently limited by the quality of the data themselves. So we know that the literature is already hugely biased towards positive studies over null and negative results and towards journal-based STEM over monograph-based AHSS. So the literature is, in this way, already a biased sample of the scholarship it seeks to represent.
We also know that within the scholarship the literature does represent, all scholars are not represented equally. We know that women are less well cited than men, that they self-cite less and are less well-connected. We know the scholarship of the Global South is under-represented, as is scholarship in languages other than English. And whilst a tool may be able to accurately identify positive and negative citations, it can’t (of course) assess whether those positive and negative citations were justified in the first place.
But of course these tools aren’t just indexing the metadata but the full text. So the question I have here is whether Natural Language Processing works equally well on language that isn’t ’natural’ – i.e., where it’s the second language of the author? And what about cultural differences in the language of scholarship, where religious or cultural beliefs make expressions of confidence in the results less certain, less self-aggrandising. And I’ll bet you a pound that there are disciplinary differences in the way that papers are described when being cited.
So we know that scholarship isn’t fully represented by the literature. The literature isn’t fully representative of the scholars. The scholars don’t all write in the same way. And of course, some of these tools are only based on a subset of the literature anyway.
At best, this seems unreliable, at worst, discriminatory?
Who makes the rules?
Of course, you may well argue that this is a problem we already face with bibliometrics, as recently asserted by Robyn Price. I guess my particular challenge with some of these tools is that they go beyond simply making data and their inter-relationships available for human interpretation, to actually making explicit value judgements about those data themselves. And that’s where I think things start getting sticky because someone has to decide what that value (known as the target variable) looks like. And it’s not always clear who is doing it, and how.
If you think about it, being the one who gets to declare what an influential paper looks like, or what a disruptive citation looks like, is quite a powerful position. Oh not right now maybe, when these services are in start-up and some products are in Beta. But eventually, if they get to be used for evaluative purposes, you might end up with the power over someone’s career trajectory. And what qualifies them to make these decisions? Who appointed them? Who do they answer to? Are they representative of the communities they evaluate? And what leverage do the community have over their decisions?
If you think about it, being the one who gets to declare what an influential paper looks like, or what a disruptive citation looks like, is quite a powerful position.
When I queried scite’s CEO, Josh Nicholson, about all this, he confirmed that a) folks were already challenging their definitions of supportive and disruptive citations; b) these challenges were currently being arbitrated by just two individuals; and c) they currently had no independent body (e.g. an ethics committee) overseeing their decision-making – although they were open to this.
And this is where I find myself unexpectedly getting anxious about the birth of free/mium type services based on open citations/text that we’ve all been calling for. Because at least if a commercial product is bad, no-one need buy it, and if you do, as a paying customer you have some* leverage. But I’m not sure if the community will have the same leverage over open products, because, well, they’re free aren’t they? You take them or leave them. And because they’re free, someone, somewhere, will take them. (Think Google Scholar).
*Admittedly not a lot in my experience.
Are the rules right?
Of course, it’s not just who defines our target variable but how they do it, that matters. What exactly are these algorithms being trained to look for when they seek out ’influential’, ‘supportive’ or ’disruptive’ citations? And does the end user know that? More pertinently, does the developer know that? Because by definition, AI is trained by examples of what is being sought, rather than by human-written rules around how to find it. (There are some alarming stories about early AI-based cancer detection algorithms getting near 100% hit rates on identifying cancerous cells, before the developers realised that it was taking the presence of a ruler on the training images – used by doctors to detect the size of tumours – as an indicator that this was a cancerous cell.)
I find myself asking if someone else developed an algorithm to make the same judgement, would it make the same judgement? And when companies like scite talk about their precision statistics (0.8, 0.85, and 0.97 for supporting, contradicting, and mentioning, respectively if you’re interested) to what are they comparing their success rates? Because if it’s the human judgement of the developer, I’m not sure we’re any further forward.
I also wonder whether these products are in danger of obscuring the fact that papers can be ‘influential’ in ways that are not documented by the citation network, or whether these indicators will become the sole proxy for influence – just as the Journal Impact Factor became the sole proxy for impact? And what role should developers play in highlighting this important point – especially when it’s not really in their interests to do so?
Who do the rules discriminate against?
The reason these algorithms need to be right, as I say, is that researcher careers are at stake. If you’ve only published one paper, and its citing papers are wrongly classified as disputing that paper, this could have a significant impact on your reputation. The reverse is true of course – if you’re lauded as a highly cited academic but all your citations dispute your work, surfacing this would be seen as a service to scholarship.
What I’m not clear on is how much of a risk is the former and whether the risk falls disproportionately on members of particular groups. We’ve established that the scientific system is biased against participation by some groups, and that the literature is biased against representation of some groups. So, if those groups (women, AHSS, Global South, EASL-authors) are under-represented in the training data that identifies what an ‘influential’ paper looks like, or what a ‘supporting’ citation looks like, it seems to me that there’s a pretty strong chance they are going to be further disenfranchised by these systems. This really matters.
I’m pretty confident that any such biases would not be deliberately introduced into these systems, but the fear of course, is that systems which inadvertently discriminate against certain groups might be used to legitimise their deliberate discrimination. One group that are feeling particularly nervous at the moment, with the apparent lack of value placed on their work, are the Arts and Humanities. Citation counting tools already discriminate against these disciplines due to the lack of coverage of their outputs and the relative scarcity of citations in their fields. However, we also know that citations are more likely to be used to dispute than to support a cited work in these fields. I can imagine a scenario where an ignorant third-party seeking evidence to support financial cuts to these disciplines could use the apparently high levels of disputing papers to justify their actions.
But it doesn’t stop here. In their excellent paper, Big Data’s Disparate Impact, Barocas and Selbst discuss the phenomenon of masking, where features used to define a target group (say less influential articles) also define another group with protected characteristics (e.g., sex). And of course, the scenario I envisage is a good example of this, as the Arts & Humanities are dominated by women. Discriminate against one and you discriminate against the other.
The thin end of the wedge.
All this may sound a bit melodramatic at the moment. After all these are pretty fledgling services, and what harm can they possibly do if no-one’s even heard of them? I guess my point is that the Journal Impact Factor and the h-index were also fledgling once. And if we’d taken the time as a community to think through the possible implications of these developments at the outset, then we might not be in the position we are in now, trying to extract each mention of the JIF and the h-index from the policies, practices and psyches of every living academic.
I guess my point is that the Journal Impact Factor and the h-index were also fledgling once.
Indeed, the misuse of the JIF is particularly pertinent to these cases. Because this was a ‘technology’ designed with good intentions – to help identify journals for inclusion in the Science Citation Index – just as scite and Semantic Scholar are designed to aid discovery and citation sentiment. But it was a very small step between the development of that technology and its ultimate use for evaluation purposes. We just can’t help ourselves. And we are naïve to think that just because a tool was designed for one purpose, that it won’t be used for another.
This is why the INORMS SCOPE model, insists that evaluation approaches ‘Probe deeply’ for unintended consequences, gaming possibilities and discriminatory effects. It’s critical. And it’s so easy to gloss over when we as evaluation ‘designers’ know that our intentions are good. I’ve heard that scite are now moving on to provide supporting and disputing citation counts for journals, which we’ll no doubt see on journal marketing materials soon. How long before these citations start getting aggregated at the level of the individual?
Of course, the other thing that AI is frequently used for, once it has been trained to accurately identify a target variable, is to then go on to predict where that variable might occur in future. Indeed we are already starting to see this with AI-driven tools like Meta Bibliometric Intelligence and UNSILO Evaluate, where they are using the citation graph to predict which papers may go on to be highly cited and therefore a good choice for a particular journal. To me, this is hugely problematic and a further example of the Matthew Effect seeking to reward science that looks like existing science rather than ground-breaking new topics, written by previously unknowns. Do AI-based discovery and evaluation tools have the potential to go the same way, predicting based on past performance, the more influential scholars of the future?
I don’t want to be a hand-wringing nay-sayer, like an old horse-and-cart driver declaring the automobile the end of all that is holy. But I’m not alone in my handwringing. Big AI developer, DeepMind, are taking this all very seriously. A key element of their work is around Ethics & Society including a pledge to use their technologies for good. They were one of the co-founders of the Partnership on AI initiative where those involved in developing AI have an open discussion forum, including members of the public, around the potential impacts of AI and how to ensure they have positive effects. The Edinburgh Futures Institute have identified Data & AI Ethics as a key concern and are running free short courses in Data Ethics, AI & Responsible Research & Innovation. There are also initiatives such as Explainable AI which recognise the need for humans to understand the process and outcomes of AI developments.
I’ve no doubt that AI can do enormous good in the world, and equally in the world of information discovery and evaluation. I feel we just need to have conversations now about how we want this to pan out, fully cognisant of how it might pan out if left unsupervised. It strikes me that we might do well to develop a community agreed voluntary Code of Practice for working with AI and citation data. This would ensure that we get to extract all the benefits from these new technologies without finding them being over-relied upon for inappropriate purposes. And whilst such services are still in their infancy I think it might be a good time to have this conversation. What do you think?
Elizabeth Gadd is the Research Policy Manager (Publications) at Loughborough University. She is the chair of the Lis-Bibliometrics Forum and co-Champions the ARMA Research Evaluation Special Interest Group. She also chairs the INORMS International Research Evaluation Working Group.
I am grateful to Rachel Miles, Josh Nicholson, David Pride for conversations and input to this piece and am especially thankful to Aaron Tay who indulged me with a long and helpful exchange that made this a much improved offering.
Unless it states other wise, the content of the Bibliomagician is licensed under a Creative Commons Attribution 4.0 International License.