By: Alice Richardson

Posted on

Think piece: Statistics at the Interface of Health and AI

 

As the Lead of the Statistical Support Network at ANU, much of my time as a statistical consultant is spent gently drawing out for researchers what it is that they actually want to know. Much of my collaborative research is focused on Sustainable Development Goal 3: Good health and well-being. In recent years this has extended from supporting research into childhood developmental vulnerabilities in Australia to drinking water availability in Fiji and childhood stunting in Bangladesh. It’s not until the research question is clear that any advance can be made on what are the right methods to answer it. You would be forgiven for thinking that in my first presentation to the ANU Integrated AI Network I would come out swinging, fighting hard for statistical methods against the onslaught of machine learning that forms such a huge part of artificial intelligence.

But no. I’m not going to set up in the Statistics corner and fight that particular battle, I’m passionate about the right method to answer the right question. I see statistical thinking as the most direct route to the right method to answer the right question. 

So what goes into statistical thinking? How can you tell when you’re doing it? I couldn’t do much better than go to Frank Harrell, one of the best statistical thinkers of the century, and take a look at his list of the fundamental principles of statistics (https://www.fharrell.com/post/principles/).

I’d like to highlight five of those principles across three key domains, which are design, analysis and presentation of research data. I chose these principles because they are areas where I think machine learning has a lot to offer but also traps that nobody wants to fall into.

  1. In terms of Design, you should design experiments to maximise information. All the sources of variation that impact on your outcome need to be identified and either controlled or measured. There’s no substitute for careful thinking about the context of the experiment and all the likely sources of variation that impact upon it. Statisticians have years of experience in doing this and those that work closely with experimental researchers in health, computer science or any other domain, become very skilled at spotting sources of variation that haven’t been controlled for. They’ve seen it all and you should lean into their expertise. 
  2. In terms ofAnalysis, there are three points I’d like to make.

    a. You should incorporate uncertainty into models. Identification of uncertainty or variability in research data is one of the superpowers of statisticians of which every researcher should be aware. It could be that your data contains batch effects from gene sequencing data, interviewer effects from a virtual reality experiment, or missing data. Statistical thinking can help to identify these sources of uncertainty, and model them appropriately. This could be as simple as incorporating appropriate terms in a model or as complex as a fully Bayesian framework with parameters nested with hyper-parameters - the sky’s the limit!

    b. You should aim to live within the confines of the information content of data. Just because your data set is bigger than everyone else’s doesn’t mean that there is more information in your data set than anyone else’s. This applies in particular to non-probability samples where, for instance, you put an ad on Facebook with a link to a survey and wait while the responses roll in. Without the structure of sampling design, it can be hard to know what population your sample of respondents is drawn from, and therefore it can be hard to know whether your results generalise to the population that you were originally interested in. 

    c. You should use all the information content during analysis. This applies particularly to multilevel structure or longitudinal structure in a data collection. If certain observations were clustered together into a household, then let’s quantify the correlation between household members and use that in the analysis rather than assuming that everyone responds independently. If certain observations were made three (or three thousand) times, then let’s quantify the variability between those measurements rather than simply taking the average of them and analysing that. If measurements were made at repeated timepoints, then there are some wonderful new statistical methods for accounting for that structure that are less clunky than previous approaches. Many machine learning methods do not handle multilevel structures well and while this is an active area of research, a classical statistical method combined with some variable selection may be your best way forward.

  3. And to make up my top five, in terms of Presentation I have one point to make. You should present results in ways that are intuitive, information rich and correctly perceived. There’s a huge literature on data visualisation which you can delve into. There’s also very powerful graphical software sitting on many researchers’ laptops, capable of producing elegant graphics that highlight important findings from the analysis without distortion. There’s no need to stick with the tired and easily misinterpreted graphics like pie charts.

I’d now like to turn to two projects I’m involved with which will allow me to exercise the application of those principles in the pursuit of better health and well-being.

The first one is titled “Harnessing machine learning in post-viral fatigue research to complement traditional scientific research methods”. At the core of this project is the goal of diagnosis and monitoring of chronic fatigue syndrome (CFS), also known as myalgic encephalomyelitis. The symptoms of this condition have been studied for decades, and when COVID came along, followed by the emergence of long COVID, it quickly became clear that long COVID was not some new-fangled thing but bore all the hallmarks of another post-viral fatigue syndrome. This project therefore intends to leverage the knowledge gained from studying CFS to enhance our understanding of what might work for long COVID. Machine learning is going to be important here to provide explainable algorithms that healthcare practitioners can use in their practice, similar to the one my PhD student recently proposed for hepatitis.

The second project is titled “Prediction models in infectious diseases” or “Why are there 238 different algorithms for prediction already available in one software package alone?” This project aims to take the laboratory diagnosis of infectious diseases to the next level by a careful and principled comparison of classical statistical and modern machine learning methods applied to an important disease that has a disproportionate effect on health outcomes in low-income settings. We have access to a freshy collected dataset of close to 1000 hepatitis patients in Nigeria. We’ll be using a newly developed knowledge representation tool, the InteKRator, developed by Apeldoorn et al (2021). I met the lead author at an outdoor Science Fair in the main marketplace of the lovely city of Mainz in the German summer of 2023.  It just goes to show that you never know what you might find or who you might meet when travelling overseas!

In conclusion, I bring my experience in all the statistical thinking principles espoused by Frank Harrell to all the interactions I have with health researchers. My pragmatic approach doesn’t remove my responsibility to ensure that the right method is used to answer the right question, whether that involves machine learning or something a bit more classical. 

 

This article was first presented 1 August 2024 to ANU Integrated AI Network workshop on 'Health and AI'

Photos by Martin Martz on Unsplash (1) (2)