The Need for Individualized Approaches to Medicine
The Flaw of Averages
There's a statistic used by many in the world of clinical investigation called the number needed to treat (NNT), which in plain language is the number of individuals who would need to receive a given therapy to prevent one bad outcome. If a treatment has a NNT of 20 individuals, then it means that only 1 in 20 people actually derive a benefit from that treatment, or that 19 people would be undergoing that treatment without any impact on their health. Of course, the problem is that it isn't necessarily possible to identify the one person from the other 19, and so if a population as a whole wants to receive a benefit from a given treatment, the best approach is to treat everyone who could potentially benefit.
If this approach sounds imprecise, it's because it is. And yet, a NNT of 20 is actually not far off what some of our best treatments provide. As outlined in this great article about NNTs, arguably the greatest cardiac medications of the past 40 years, the statins, have a NNT of about 50 on average. Other less effective medications, such as aspirin, have an NNT of 1,667 to prevent a single stroke or heart attack. With such large numbers, one has to ask why would people be willing to subject themselves to a treatment with possible side effects (if it's a drug) or complications (if it's a procedure) when it has such a small probability of actually benefitting them? The answer is evidence-based medicine.
On a conceptual level, evidence-based medicine is simply the application of a structured statistical approach to the study of a given treatment or disease risk factor. In other words, statistics are applied to a subset of the population in whom the treatment is being studied, and if there is a statistical difference (defined conventionally as less than 5% probability of an observed difference being due to random chance--p value less than 0.05) between the group who received the treatment and those who did not, we conclude that there is 'evidence' that the treatment worked. What constitutes an appropriately designed study (i.e., lacking bias) as would meet the threshold of 'evidence' for or against use of a treatment is controversial, although most agree that the most definitive study design (i.e., the 'gold standard') to determine evidence for a treatment is the use of a randomized controlled trial, or RCT. In the ideal RCT, patients are enrolled based on meeting a set of common criteria (e.g., age over 18, history of high blood pressure, no history of heart disease), and are assigned to a treatment or placebo at random, and then followed in time until the onset of a predefined outcome. A trial is 'double-blind' (also ideal) if neither the patients nor the investigators are aware of the treatment until after the study is completed.
Without getting too far into the weeds, the philosophical goal is RCTs and other study designs is to use counterfactuals to draw causal inference about a treatment effect on a condition. The counterfactual is the outcome that 'would have occurred' in the absence of exposure or treatment, and is based on the notion that in an ideal world we could go back in time and apply the opposite intervention to what a person actually received and see what happens. In the absence of time travel, the alternative is to try to find two people who are as closely matched as possible so that the only difference between them is the treatment received, and see how a given outcome is different based on the treatment. If it were possible to measure everything that could impact a treatment on a person, and find another person with the exact same profile to match, then that's what we would do. Unfortunately, we can't measure everything, and even worse, the unmeasured factors can often impact the effect of the treatment on the outcome, a situation called unmeasured confounding. The alternative then is to take a set of the population with a few common broad characteristics (enrollment criteria) and randomly assigning them to a 'treatment' or 'placebo' group, with the idea that the unmeasured confounders will also be distributed randomly (and if a study has a large enough enrollment, evenly), and should not impact the assessment of the treatment on the outcome. Because there are many unmeasured confounders that must be distributed, RCTs often require large numbers of people to examine a treatment impact. However, because RCTs include large numbers, it is possible to find a treatment that is statistically more effective than placebo, and yet requires use in a large number of people to actually see any effect; hence a large NNT.
The problem with studying a large population applied to an individual is that it can result in the 'flaw of averages'. In simple terms, the flaw of averages is when someone attempts to apply population-average statistics to an individual situation. The 'flaw' in this application is that when applied on an individual level, population-level statistics lose relevance and describe a situation that is not at all reflective of reality (see cartoon below). As the saying goes, if you eat a sandwich and I eat nothing, then on average we have each eaten half a sandwich, and yet I'm still hungry...
The flaw of averages can reek havoc on many attempts to extrapolate information obtained on a population level to individual-level decision making. "Take this medication because it lowers blood pressure and prevents heart disease," is how a provider might explain the need to start a new blood pressure medication on an individual patient. However, this explanation is flawed because studies have shown that not all medications that lower blood pressure prevent heart disease (see the ALLHAT study, for one example), and the only reason that the provider is prescribing the medication is because when investigators conducted a study of tens of thousands of people, they found that the group receiving a given blood pressure lowering medication had less cardiac disease than the group who did not. What the provider should say is, "take this medication because studies show that if you and 200 other people take it, at least one of you will be less likely to have heart disease." Not exactly compelling.
A Matter of Power
It turns out that cause of the flaw of averages, as it occurs in clinical decisions like starting blood pressure medication, boils down to nothing more than simple numbers. Statistical inference requires numbers, enough numbers that any signal (if it is there) in the data will be evident over the random noise. In statistical terminology, this is called power and it refers to the probability of correctly rejecting the null hypothesis if the alternative hypothesis is true. Power is determined by the number of observations, or for a binary outcome like development of heart disease, the number of outcome events. More events, more power to examine a given risk factor or treatment.
Some outcomes, like calories burned or steps taken, can be measured on a single person every day. A study to correlate some outside factor, such as the weather, on calories or steps could easily be conducted on a single person over a reasonable time period and obtain enough outcomes, and power, to draw a statistical conclusion about a correlation (more on this later). However, for other rare outcomes, such as development of heart disease or cancer, a given person may only have that outcome one time, if at all, in his or her lifetime. An investigator could not possibly study the impact of weather on heart disease in a single person with enough power to make any statistical conclusion. Any correlation they suspected would be called nothing more than an anecdote.
When can Medicine be Individualized
Obviously, none of us wants to develop cancer or heart disease or any bad outcome that happens once or twice in a lifetime. And medicine, and medical research, has responded through development of RCTs and other large research studies in thousands of people with the power to identify treatments and associated risk factors. These studies are one of the great triumphs of the past 40 years of medical science, and they continue to identify risk factors that are associated with disease in our lifestyle, our environment, and more recently, in our DNA (genetics). However, despite being called 'personalized medicine' by many researchers and clinicians, this application for genetics is not personalized or individualized any more than our blood pressure medication example above. The ONLY reason that we know that certain genes, lifestyle, or environmental risk factors are associated with a disease is because they have been studied in hundreds to thousands of people. Nothing personal about that. Further, any treatment that is identified from these big studies that seems to work in patients with a specific genetic makeup (aka, genotype) will only be identified because it worked in many people with similar genetic makeup, many of whom will actually not obtain any reduction in disease risk (NNT greater than 1). It might sound good to say that we've personalized the prescription of this medication because of your genetic makeup, but in reality it is no different than stating that we've personalized prescription of the blood pressure medication because we found you had high blood pressure.
So when can medicine be individualized? The answer lies in the numbers. For situations in which the outcome of interest occurs often enough in a single person that a statistical investigation is adequately powered, the process of examining associations with other factors can and should be attempted on an individual level. If I'm interested in finding an association with the amount of sleep I get each night, then the best way to find something modifiable, such as reading before bed, is to study data collected on me over a period of time with different amounts of reading. Collecting data on myself and 5000 other people is probably not going to help, and may lead to the flaw of averages, in which I might conclude that 'on average' there is an association, and yet find that when I stop reading before bed it has no effect on the amount of sleep I get each night.
It turns out that despite the successful application of evidence-based medicine and statistics to the study of rare, bad diseases, medicine has done very poorly in applying statistics and evidence to study of recurrent, individualized disease. In some cases, investigators have tried applying the same population-level approaches to these conditions (such as the sleep study above), finding that a given medication works 'on average' and yet fails to work for many patients they treat. In other cases, the treating clinician ignores the data or any statistical method and uses a trial-and-error approach to treatment, which could stumble upon a causative agent or effective therapy, but could also fail to identify any true effect due to the problems with recall bias and confounding, which is why we conduct clinical trials in the first place. The latter situation is even more frustrating in the world of emerging monitors and sensors that can actually quantify these recurrent outcomes, such as sleep at night. It is this world where many patients find themselves alone without substance-knowledge experts (i.e., doctors) to help guide them towards factors that could be causative and away from factors that are more likely confounders. This world also lacks refined and accepted (within the medical community) methods and tools to analyze individual data to make inferences about individual outcomes. The methods exist, but we have no idea how to judge power, and when to decide if a given treatment is just not working or more time is needed to collect data before drawing conclusions. It is this place that our organization seeks to expand medicine and medical research, and to start the conversation about individualized data analysis.
[Note: There are a couple of caveats to the discussion above worth highlighting. First, it is important not to confuse an outcome of interest, such as minutes of sleep or number of headaches, with a surrogate outcome, such as blood pressure. The former is directly relevant; if we can increase the minutes of sleep at night, or reduce the number of headaches each month, then that is a direct benefit for the patient. The latter is not directly relevant, and is only measured because it is a risk factor for something we do care about, such as heart disease. In many cases, improving the surrogate outcome should indeed reduce the risk of the actual outcome of interest--lowering blood pressure should decrease risk of heart disease; however, there are plenty of cautionary tales from medical research where a treatment directed at a surrogate outcome actually worsened the outcome of interest--see the CAST study, for one example. A second caveat is that there are actually a number of outcomes that are somewhere in between the spectrum of occurring rarely (like cancer) and often (like sleep) in an individual. In these situations, there are certain statistical approaches that can combine outcomes from a single person with those of across a population in a way that allows the population to 'inform' the estimates for a given person. One class of these statistical methods are called mixed-effects models, and are so-called because they combine population-level measurements (fixed effects) with individual-level measurements (random effects). Without going into too much detail, one characteristic of these models is that they tend to weight data towards using more population-average information or more individual information depending on how much variation there is within and between subjects. When population-level data should be collected to help inform individual conclusions is not well-known for many recurrent outcomes. Finally, although individualized approaches are (by definition) applied at the individual level, in reality the only way to know if a given algorithm or analytical approach is working across a population is to conduct population-level studies. In that case, the level of study is the algorithm, and the outcome is singular for each person: improvement in the recurrent condition. To date, there is little agreement presently about how to compare and validate individualized approaches in a population-testing manner, and you can probably see that the lines begin to blur, especially when we start including deep learning methods like recurrent neural networks.]