4. Analytics
Figure 1. Data Model for an individualized analysis approach.
Big Picture Analytical Approach
Multiple previous posts and sections have delved into the nitty-gritty of individualized data analysis (this whole site in many ways is based on this aspect) and so in this section we won't wander into the weeds as much as provide a broad overview, with specific issues found in analysis of individualized data. As we will find, these issues are not unique to our analyses, and exist in data analysis across all levels; however, many are less well worked-out on the individual level.
As a start, Figure 1 provides the overall data model for the individualized data analysis approach. This model includes several levels for analysis, with each level interacting with others to inform analyses and provide feedback. We will start this section by walking through this figure in order to obtain a broad understanding of how individualized data analysis fits into the larger picture, and then examine a couple of additional issues in data analysis.
Individual level
The entire process is essentially driven by the data collection and analysis as it takes place on the individual level. It is important to recognize that this approach is entirely unique to medical research, where data collection takes place on the individual level, but analysis occurs only on the aggregate level. Data is collected by a log, journal, monitor, or sensor device (see Section #1), and is then analyzed in the context of a clinical question (see Section #3). Based on the analysis, an intervention is recommended, with additional data collected afterwards charting the improvement, or lack thereof, in the outcome of interest following the intervention. This analysis does not require any information outside of what is collected for that individual, and thus can be done entirely on the edge of the interface if desired (see Section #2).
This characteristic cannot be understated for this process. The entire individualized data analysis process can be done on a single person, without ever sharing data between people, or with a third-party provider for storage, additional analysis, etc. The goal of individualized data analysis is to provide guidance for individuals, and so if there is enough information collected for a single person to draw a statistical inference, then no additional aggregation of data is needed. Of course, how much data do we need to draw a statistical inference, or to 'power' our study, is not well-defined. As discussed elsewhere, this factor is primarily determined by the signal to noise ratio of the particular outcome we're studying, and can be different for different people and different outcomes. If there is not enough data collected to sort the signal from the noise at the individual level, then we must move to the aggregate level if we hope to draw any inferences.
Aggregate Level
There are several reasons to move from the individual level to the aggregate level for individualized data analysis. The first, highlighted above, is the situation in which we have insufficient information available from individual data collection to draw a statistical inference. As we reviewed in a previous post, we make the move from individual data collection to aggregate, or population-level, data collection to increase power to draw statistical inferences. Studies from mixed effects models have also demonstrated that for outcomes that have wider within-individual variance than between-individual variance, analysis at a population level is more likely to be informative.
There are other reasons to move data to the aggregate level, however. For one, population-level inferences can serve as priors for individual-level analyses, until enough individual data can be collected to overcome the prior expectation. This analytical approach is called Bayesian, because it applies Bayes' Law to the likelihood using prior information to identify a posterior distribution of outcome probability. This approach has benefits in two ways for individualized data analysis: 1) It allows us to incorporate prior information from other people with similar outcomes until enough data is collected, and 2) It allows seamless transition to the individualized predictions as enough data is collected over time. Figure 2 describes how priors can be used in a Bayesian analysis to dynamically adjust to the probability of a head for a biased coin.
Figure 2. Example of Bayesian analysis applied to coin flip. In both examples, the prior probability distribution (top) favors a biased coin that winds up head 75% of the time (mode = 0.75). Initially (middle, left), after only 5 flips (2 heads), the posterior distribution (bottom) remains with a mode of 0.69 since it heavily favors the prior. However, after 500 flips (200 heads, middle, right), the posterior distribution favors tails with a mode of 0.415 for heads.
Bayesian approaches are only possible if data can be collected at the aggregate level to create priors to inform individualized data analysis approaches. Of course, how these methods are applied across outcome types and data collection patterns will likely require more study.
Finally, the aggregate level also allows us to answer broader 'scientific' questions in the standard population-level approach, with data moved up to a research level.
Research Level
The research level is where our standard research approaches are applied, and where additional insight about science and behavior can be studied. Figure 1 identifies three potential areas of research that can specifically be applied to individual-level data, although there are plenty of others. '-omics' research refers to study of genomics, proteomics, or metabolomics, which are large-scale biological data streams based on genes, circulating proteins, and circulating metabolites, respectively. These studies require large numbers of subjects to identify patterns and trends, due to the inherent intra- and inter-individual variance in the data.
Outcomes, or comparative-effectiveness, research refers to research about how a particular person responds to a certain treatment. It can be applied to subsets of individuals with particular patterns of disease, and is this much enhanced in individuals in whom large amounts of data can be collected. Inferences from these studies can identify patients more or less likely to respond to a given treatment, as well as patients more likely to be given a specific treatment (behavioral study).
Finally, machine-learning, which can be applied at any level, generally requires large numbers of subjects and data to train and test models, and thus occurs at the research level. Through aggregation of high-density individual data, more information is available for model training and testing. However, how this type of data is best engineered to extract the largest amount of information remains less well defined. In a recent publication (in submission, currently), we explored this issue using cardiac device data.
Interpretability of Analysis
A final broad categorization of challenges on the analytical level concerns interpretability. There is a famous quote known throughout the analytical world that goes, "All models are wrong, some models are useful." Implicit in this description is that the point of an analytical model is provide useful information to the user, not to provide a statistical p value so that the analytical team can publish results. Nowhere is this principle more relevant than on the individual level. As we discuss analytical approaches to individualized data, there are several places where interpretability of the analysis will be critical, which we discuss below.
The black box of machine learning
One of the strengths of machine learning, particularly deep learning, models is that they are able to expand the space of variables in such a way as to capture the various nonlinearities that exist in predictive factors. Or to put it more plainly, machine learning models are vastly more flexible than standard regression approaches to model a complex system, and one of the drawbacks to the added flexibility is that we often lose the information that is most critical to capturing the underlying features of the system. The result is that we are often left with a 'black box' situation in which we know that a model is capable of successfully predicting outcomes in a system, but we do not understand what information it is using to make those predictions. The black box grows blacker the more complex the model, starting with random forest and similar ensemble approaches, followed by support vector machines, and culminating in deep learning models created from hundreds and thousands of parameters. While there are methods to obtain some understanding of which features are more or less important in a model (most of these use the jackknife approach of systematically leaving out variables and determining predictability without each variable), these approaches still provide little more than a general idea of how important a variable is.
For example, I can create a deep learning model for determining probability of having a headache (see #3 for details), and I might be able to determine that sleep quantity is an influential factor in the model's predictiveness. However, I will have no way of knowing whether it is sleep quantity itself that is predictive, or whether it is sleep quantity combined with water intake and alcohol consumption. From an analytical standpoint, we may find that the deep learning model is a better predictor of headache frequency, but that an inferior prediction model based on time series regression in which we can explicitly understand these relationships is a better model for our uses (modifying behavior). How much worse a predictor is acceptable is an open question, but one that we must consider in any analytical approach. It also likely depends what the goal of the analysis is, as the former model (regression) may be more useful in understanding risk factors with the goal of modifying lifestyle, but a more complex, black-box, model may be more useful if we want to develop a learning data entry user interface that enables more consistent entry by the user.
Analytical tradeoffs: Precision vs. accuracy, bias-variance trade-off, sensitivity vs. specificity
Precision is generally defined as the ability to make reproducible predictions given the same data multiple times. Accuracy is the ability of those predictions to closely model reality. Ideally, we want a model that can do both, but oftentimes we must choose one over the other. In machine-learning terminology, we also have to consider the bias-variance trade-off, in which a model that fits the data well (low bias) is often less accurate at predicting future data (high variance); or the converse, a model that does not fit a specific dataset well (high bias) may be better at predicting future data (low variance). Finally, for predictive models, particularly when choosing a decision cutoff, we often have to choose between a model that is sensitive, meaning it is unlikely to fail to predict certain events (low false-negative rate), and a model that is specific, meaning the events it predicts are likely to be true events (low false-positive rate). All of these factors are analytical tradeoffs that we must make in creating a model, and are dependent on the specific use of a model.
For example, if our goal is to create a model predicting a bad outcome, such as a heart attack, we generally want a model that will be accurate, with low variance, and high sensitivity, since the consequences of missing a heart attack are much greater than the consequences of overcalling heart attacks in people who later are found to be okay. In contrast, a model for predicting sleep at night is probably better to be precise, with low bias, and greater specificity since it is likely to be applied by an individual to identify factors that can be tested to improve sleep. Obviously in both cases, we would like the best of all worlds, and we certainly work towards this in our model design, but the trade-offs are entirely based on interpretability.
User understanding
Finally, we cannot forget about our users, who be they providers or patients, are unlikely to have deep understanding of the analytical methods and nuances inherent in each approach. For obvious reasons, a perfectly predictive model that cannot be understood by the users is utterly useless. Of course, this does not mean that we should limit ourselves to the lowest common denominator of users, and apply only the simplest models to our data. It does, however, mean that for each approach we need to have a companion method for explaining the model and how it can be interpreted in simple, nontechnical terms.
Education of users would appear to be at the forefront of this issue, which merges with the issues discussed in the user interface section (see #2) for our process for delivering individualized medicine to a population. As with scientific and medical knowledge in general, there are multiple levels to the information contained in each analytical approach, and therefore multiple levels as which we should develop our educational methods. For example, we might develop a model to predict weight loss for an individual patient receiving care from a provider. On the patient level, we may need to simply provide the information that daily exercise of over 30 minutes a day and calorie intake of fewer than 2000 calories per day were the most predictive of weight loss for that patient, so that he or she can adjust their lifestyle to meet these goals. On the provider level, we probably will also want to include that the relative effect of exercise accounts for 5 lbs lost or gained, and caloric intake accounts for 2 lbs lost or gained, so that the provider can create a weight loss plan emphasizing exercise first, then caloric intake. The provider understanding should also include features of the model, such as a correlation with serum thyroid hormone levels, which may be monitored in addition to above to guide medication use if relevant. Of course, some of the 'provider' information may also be useful to the patient, and only through feedback (discussed next, #5) can we modify the information to make it most useful.