Data-driven anti-racism in genomics research; A funder’s perspective

Bilal Mateen & Talia Caplan, Wellcome Trust

When the Wellcome Trust announced its new anti-racism policies we acknowledged an uncomfortable but incontrovertible truth – Wellcome has played a part in perpetuating racism through barriers to inclusive research, and this must change.

An issue across the higher education and research sector, there is unfortunately no singular solution to the problem of racism in health research because it’s baked into the way institutions operate. Many don’t even recognize the role that they are playing. Thus, it requires us to take an actively anti-racist role to meaningfully achieve change.

As an independent global funder, with the resource to influence the health research landscape, we are committed to addressing this issue. This means changing research processes and practices that systematically exclude or disadvantage people based on disability, gender and race.

Acknowledging the problem

One of the key challenges in achieving a genuinely equitable research landscape requires us first to understand what datasets are available and how they are used. The genomics community have done a substantial amount of analysis that has revealed the lack of diverse data, and how this limits existing genetic research. In 2009, less than 5% of all genomics studies were carried out using data from populations that were not of European descent. By 2016, that had improved to 19%, largely driven by the inclusion of Asian ancestries. Another way of thinking about this is that the vast majority (78%) of research in the field today is based on groups that represent only a small minority (16%) of the global population (that is, individuals of white European ancestry).

This lack of diversity both limits the potential of research in the field and restricts who benefits from it. There are numerous examples of polygenic risk scores from European ancestry that don’t relate to other ancestry groups. And there is another very real issue; without the richness of diverse datasets we are failing to pick up variants that are rare in white European populations but abundant in others. One of the more famous examples is that of a variant of the PCSK9 gene which causes substantial decreases in cholesterol. These variants are rare in individuals of European ancestry (approximately 0.006% of individuals are carriers) whereas, in groups of African ancestry the prevalence can be as high as 2.6%. Thus, finding this gene in most existing datasets, let alone analyzing it, is impossible. These aren’t theoretical benefits. In the space of a few short years we’ve already developed drugs that inhibit PCSK9, which are proven to reduce cardiovascular disease risk in practice. There are likely many similar stories waiting to be told, but until sufficient data is available to unearth them, many will continue to suffer from diseases that are treatable – if only we had the genetic understanding.

The obvious systems-level intervention for funders is to direct resources towards this gap in global genomics research, such as through the Lacuna Fund and H3Africa which seek to support data collection and sharing from underrepresented communities. But is making more and better data available enough? The UK Biobank is a brilliant example of why we need to think about the problem of diverse data more holistically. Despite diverse genomic data from non-European and mixed ancestry samples available in Biobank, only 7.3% of research carried out using Biobank since 2008 used data from these diverse samples. That's not to suggest that the other 93% isn't relevant to those of non-European ancestries, but that there’s a potential missed opportunity to do more good, for more people, by effectively using the available data.

Answering the obvious ‘why?’ that this statistic elicits isn’t a simple task. An educated guess might link it to a lack of diversity within the teams conducting this research or those planning and prioritizing the research questions.

Thus, it is imperative that we invest in solutions that directly address this skewed utilization of available data. For example, by creating novel incentive structures that encourage the use of diverse data – like that of Wellcome’s data prizes which have been set up to facilitate greater (trustworthy) use of under-utilized data from low-and-middle income countries.

At their core, the prizes and the Lacuna Fund are examples of exploring a variety of different approaches to improving research culture. The next wave of genomics research requires a similarly bold approach, to ensure that we do not accidentally recreate historical biases by relying on pre-existing infrastructure (see Box 1), and processes that have allowed, if not facilitated, us getting to where we are; a research landscape addressing the needs of the few not the many.

​Box 1: Ethnicity classification – an imperfect tool or an impediment to effective research?

A key piece of infrastructure at the heart of this problem is the standard library of ethnicities that research and healthcare services rely on. Stripping away the veneer – it is unclear just how useful this classification system really is... For example, ‘White Eastern European' or 'Ashkenazi Jewish' or ‘Polish' - could all refer to the same person and yet that person may identify themselves as British. In essence, the terms we use to identify ourselves are fluid and very context dependant. This is in part why it is so hard to agree universal definitions for each ethnicity, and protocols around how to support individuals select the ‘labels’ that facilitate research whilst not undermining their sense of individual identity. The Wellcome Sanger Institute as a founding member, funder, and host institution of The Global Alliance for Genomics and Health (GA4GH), is one of the main vehicles by which Wellcome is supporting the creation of frameworks and standards in this space. There is much more left to be done but reflecting on these processes - that many probably didn’t think twice about whilst filling out the census - is central to driving progress in the field.

Inclusive research requires diverse teams

Catalyzing the next generation of genomics research based on diverse data requires both creating new data assets and facilitating the use of these resources through the right incentive structures. But it also requires building more diverse teams to take advantage of those new opportunities.

Science and technology roles, both in academia and outside, demonstrate a remarkable lack of representation. The idea of ethnic diversity in groups driving more creative solutions than homogenous teams dates back to the ‘60s, but the consequences of this lack of diversity goes beyond lost opportunity. The way this has manifested more broadly is that some of the most ubiquitous technologies built in the last few decades (by a workforce that is known to lack diversity) have recently had the systematic racism they propagate exposed. From Twitter’s image cropping algorithm’s preference for white faces to the broader issues with facial recognition technologies' remarkably poor performance on recognizing anyone who is not white, there is an insidious downstream impact of a lack of workforce diversity that urgently needs to be addressed.

In research, even the challenges that are less overt can still create an inhospitable environment. Accessibility of a work visa, or even just knowledge of opportunities are in truth quite profound barriers to the goal of a more diverse research community, but readily addressable by funders. A fundamental difficulty which prevents the funding community addressing these causal determinants is that we are reliant on imperfect information (i.e., data that is either out-of-date, or results that are not generalizable).

Altering this perverse and pervasive pattern requires funders to apply the same academic rigour we expect of those we fund. First, we need investment in shared infrastructure to facilitate a transition from retrospectives to near-real time analysis of proposals during the evaluation stages, thus, giving us the opportunity to diagnose these problems on a timescale that allows for immediate remediating actions.

Next, we need to experiment with novel funding mechanisms, for example, we could consider lotteries instead of panels to select grants from pools of competitive submissions. Finally, all of this needs to have embedded evaluation schemes to determine whether any of it improves performance indicators illustrative of equitable funding. Even with the more than £1 billion that Wellcome invests in research, we often don’t make enough similar grants in any given year to conduct such experiments. As such, collaboration amongst funders, of the sort facilitated by the research on research institute, is key to answering these questions on the timescales necessary to address the lack of diverse teams throughout the research community.

Involving the public

Finally, diversity and representation in the people engaged with research is critical. Far too often public involvement is seen as a mechanism for seeking a licence to undertake hypothesis-driven research. Instead, it should be used to challenge researchers to look beyond their pre-conceived notions, and help them understand what is important more broadly, rather than simply rubber stamping a good idea.

Drawing on the current discussions about regulation of artificial intelligence (AI), which has a similarly poor history of utilizing diverse data, it’s remarkably easy to see how we got here when you consider how little the public has been involved in decision making. A recent analysis of the comments submitted as part of the U.S. Food and Drug Administration’s (FDA) public consultation on the proposed updates to the regulation of AI-based tools in healthcare found that less than 5% of submissions were from individual consumers, whereas 50% were from industry, and almost two-thirds of all comments declared some form of financial interests. One might argue that calling it a public consultation almost seems farcical. The question we all ought to reflect on is whether genomics, or really any other field of scientific endeavour, is really doing that much better.

As a funder, we can encourage meaningful public participation in research. However, we also have a duty to engage in it ourselves. It is relatively trivial to include considerations around public involvement and engagement in our judging criteria when awarding grants and facilitate connections between researchers and experts. Similarly, in private we can challenge stewards of genomic data to do better. But until we are seen to model best practices, how can we expect similar from our grantees?

Possibly one of the most important lessons we must all learn is to spend more time listening, rather than immediately jumping to action. At the start of this year, we began a public involvement project with Black and South Asian people to understand their views on health data collection. After years of doing research on representative samples of the public, it is apparent that we need to do more to capture the views of minority groups if we are to effectively incorporate that knowledge into our funding. Importantly the project is about facilitating a better understanding of what best practices look like, rather than seeking to persuade anyone about the value of a given approach. The genomics community is no stranger to such dialogues, but as we look towards making data more diverse, there is clearly more to be done to understand what the communities reflected in that data would like us to do with the information.

Driving an agenda that prioritizes diverse data in genomics requires more than just funding datasets.


Driving an agenda that prioritizes diverse data in genomics requires more than just funding datasets. Defining the role of funders specifically requires an appreciation of our individual and collective ability to exert influence on the community at large. The questions we prioritize, the processes that we ask researchers to engage with when seeking funding, and the experts we convene to review proposals all have an impact on who ends up doing what research.

In essence, all of this contributes to the role we play in determining research environments. Without paying due attention to those essential questions of who does the research, and why, history suggests that we are unlikely to make any real progress. Acknowledging there’s a problem is only the first step. In order for that to be the first step in a productive discourse which seeks to address those problems requires funders to commit to doing something about it. For Wellcome, that something is reflection, engagement, and experimentalism in how we approach being a funder. But to really shift the needle, we need all funders to commit to the same.

Dr. Bilal Mateen is a clinician by training, with an academic interest in health-related applications of data science and machine learning. Bilal is currently the clinical technology lead, and senior manager for digital technology at the Wellcome Trust. He also moonlights as a lecturer in clinical data science at University College London, and as clinical data science fellow at the Alan Turing Institute (the UK’s national institute for data science and AI).