Generating insights into disease, patient, and outcome heterogeneity: The Turing-Roche Partnership

Vicky Hellon, Dr Alisha Davies, Dr Robin Mitra, The Alan Turning Institute

Background on The Alan Turing Institute

Established in 2015, The Alan Turing Institute is the national institute for data science and artificial intelligence, advancing research in this area to change the world for the better. We undertake research ourselves, collaborate with universities, businesses and public and third sector organisations, with the aim to lead the public conversation through agenda-setting research and public engagement.

We recently established a five-year strategic partnership with the pharmaceutical company Roche, who have a longstanding commitment to personalised healthcare, to make advances in treatment heterogeneity. Here, we explore what this term means, current research issues in this area and what we’ve been up to as a partnership to start tackling this. 

Patient and Treatment Heterogeneity

“I tried this, and it worked for me”. Why is it that a drug or treatment can work better in some people than others? This is a very difficult question to answer as there are many reasons why - one of which is that humans can vary in many ways.

This difference, or heterogeneity, in the human population represents a key challenge in medical research as we seek to ensure advances in the prevention, diagnosis and treatment of disease have the greatest benefit to each and every one of us.

Differences between individuals in the development and progression of poorer mental and physical health, and responses to treatment, are likely to be due to many different factors and the interactions between them. These include patient factors such as age, sex, molecular and genomic factors, exposure to health harming risks across the life course - including smoking, alcohol, socio-economic and place effects; disease factors such as the etiology (or the set of causes) and severity of disease, or the presence of comorbidities; and health intervention factors such as the modality of treatment, or management of multiple treatments.

In medical research, such heterogeneity in patients and treatments is usually adjusted when using statistical and analytical methods to determine the average effect across populations.  Whilst this is appropriate to inform decisions at a population level, it does mean that we often do not look to better understand why some treatments are more effective for some groups of patients and not others. Understanding and exploring patient and treatment heterogeneity, rather than adjusting the differences away, could lead towards more individualised medicine, selecting the treatments achieving better clinical outcomes for patients in different groups, and potentially preventing the use of treatments which are likely to be ineffective in others - reducing waste. 

In order to develop a comprehensive understanding of patient and treatment heterogeneity, it is important to be able to distinguish between any differences or effects observed in the data that may be real, from those that are artefacts. However, our ability to do this can be significantly impeded due to limitations on the quality and information available in the data, including:

  • Comprehensive data across all factors is often not know, or collated,

  • Data is incomplete e.g., follow up data on patient outcomes,

  • Groups underrepresented in research/routine data - certain groups are less likely to participate in clinical trials or appear in healthcare data e.g., minority ethnic groups, unregistered (homeless) populations.

These limitations on data quality/information can all be viewed within the context of ‘missing data problems’, i.e., they represent challenges where we do not have all the data we would ideally like, impeding our ability to perform analyses, and thus make decisions.

“Heterogeneity of treatment effects reflects patient diversity in risk of disease, responsiveness to treatment, vulnerability to adverse effects, and utility for different outcomes. Recognizing these factors, researchers can design studies that better characterize who will benefit from medical treatments, and clinicians and policymakers can make better use of the results." Kravitz, 2004

Missing Data

Missing data is an inherent problem associated with analysis of large complex data, such as those that combine multiple sources of information. We often apply standard statistical procedures to the data to make inferences, build prediction models, or address causal questions of interest, but these procedures inherently assume that the data is complete, and so cannot be applied without some reasonable adjustments made to the data to deal with the missing values. Simple adjustments, such as subsetting on only the complete part of the data, can severely reduce the available information in the data as well as compromise the validity of any analyses.

Missing values can arise for different reasons, often referred to as missing data mechanisms. In complex data sets, there are likely to be a combination of different mechanisms that explain the missingness seen in the data. For example:

  • Missing data could be due to purely random chance, e.g., the data collector forgetting to take a measurement,

  • Missing values could be likely amongst a certain demographic group, e.g., older males are less likely to turn up to regular follow up appointments in a clinical trial,

  • Missing values could be directly associated with their underlying value, e.g., the patients with the most negative outcomes following treatment are less able or willing to participate in follow up visits to report these outcomes.

Structured Missingness

A particular missing data problem encountered in this setting is that of Structured Missingness. This is where the missing values arise with an associated structure and is almost inevitable when combining data across several different sources. There are various reasons for this including:

  • Information collected across different (potentially overlapping) sets of individuals,

  • Information collected across different (potentially overlapping) sets of variables,

  • Information collected across different sources on the same variables but following different measurement practices.

Such missing data is thus likely to result in blocks of missing values present in the data and could present a substantial loss in information. In addition, clinico-genomic data can contain other types of Structured Missingness with complex dependencies and underlying causes. For example, patients may receive a battery of tests to help with a diagnosis, but these tests may only be prescribed for certain patients due to practitioner choices or patient profiles. This results in some patients’ records containing detailed information from a set of tests, while other patients’ records contain none of this information.

The missing data, and importantly the inherent underlying structure associated with it, presents important challenges that need to be addressed in order to make the most effective use of the data, and in doing so advance our understanding of outcome heterogeneity, ultimately helping more patients respond better to treatments. 

The Turing-Roche Partnership

Motivated by the issues above, the Turing-Roche strategic partnership has set its North Star as ‘generating insights through advanced analytics to better understand patient and disease heterogeneity and its relevance to clinical outcomes at an unprecedented level of precision to improve clinical care’.

The partnership is at the start of its journey (we launched formally in summer 2021) and one of our first projects was exploring this area of Structured Missingness. To launch our activities in this area we ran three virtual workshops across November 2021 where we convened a diverse group of around 40 researchers to discuss challenges in this area. 

What immediately became apparent were the research gaps in this area. Many researchers had encountered this missingness problem, but it was clear that there wasn’t any uniquely optimal statistical approach that could address the different challenges associated with Structured Missingness, with researchers having adopted a range of approaches to ‘deal’ with it. 

The workshops culminated in attendees presenting potential research ideas in groups. It was great to see the breadth of ideas that had been developed through the workshops and the collaboration and synergies in and between groups. The presentation titles are below:

  • Develop and evaluate a method to comprehensively find & explain missingness in multi-modality data

  • The bidirectional influence between Structured Missingness and Adaptive Designs

  • Visualisation to Support Investigation of Structured Missingness in Health Data

  • Personalised Cancer Treatment with Missing Data: An Optimization Approach

  • Characterising and Investigating Structured Missingness

  • Imputation incorporating external information

  • Characterising structured missingness for optimising future designs

  • Quantifying and detecting (lack of) diversity

  • Using networks to address Structured Missingness

Ultimately the workshops resulted in several themes emerging that are important to consider when dealing with Structured Missingness. These include building models for such complex incomplete data, visualisation approaches, as well as characterising Structured Missingness from a fundamental mathematical perspective. 

Next steps

After the workshops, we launched a formal funding call inviting proposals in the area of Structured Missingness and we will be announcing the successful applications in the next few months. We envisage the resulting methodological developments will help make better use of clinico-genomic data to address the applied questions of interest, taking us a step closer to our North Star - better treatment for all. 

You can keep up to date with this and other updates from the partnership here - we’re keen to build a collaborative community in this area of treatment heterogeneity, so do get in touch if this sounds of interest!


Vicky Hellon is the Community Manager for the Turing-Roche Partnership- she works closely with both organisations to build an engaged and sustainable community around the partnership, with a focus on supporting researchers to embrace and embed Open Science practices. Vicky has a BSc in Biomedical Science from the University of Sheffield and previously had roles in Open Access publishing at Springer Nature and F1000Research. More recently she worked at Health Data Research UK, building a community around the newly launched Health Data Research Innovation Gateway, a portal to help researchers discover and request access to UK health datasets. She is passionate about supporting forward-thinking changes to the academic publishing system and engaging with researchers from a range of backgrounds.

Dr Alisha Davies is the Health Theme Lead for the AI for science and government (ASG) programme at the Turing. She has a PhD in epidemiology from the London School of Hygiene and Tropical Medicine, and holds an Hon Professorship in the Faculty of Health and Life Sciences at Swansea University. Alisha is also an NHS Consultant in Public Health and Head of Research and Evaluation at Public Health Wales where she leads a multi-disciplinary division working alongside academic partners to address knowledge gaps in public health policy and practice. Alisha is also a member of the NIHR Public Health Research Prioritisation Committee, Deputy Director of the Centre for Population Health Research in Wales and leads the Health Foundation funded Wales Networked Data Lab.

Dr Robin Mitra is an ONS Senior Lecturer in Statistics at Cardiff University and Research Theme Lead in Structured Missingness for the Turing-Roche Partnership. Prior to his appointment at Cardiff University he was a Lecturer at Lancaster University and prior to that a Lecturer at the University of Southampton.

His main research areas are dealing with problems arising due to missing data and data confidentiality. He also has interests in Bayesian methods more generally. Some of his previous collaborations have included working with the Office for National Statistics and the National Health Service Blood and Transplant as well as the Institute of Employment Research in Germany. He is an active member of the Royal Statistical Society (RSS) and was the Chair of the Medical Section from January 2017- January 2020 and is currently a RSS Council member.