Machine Learning Teaser
Machine learning is everywhere. It tells us the fastest way to get somewhere in GoogleMaps. It determines which videos we can watch on Netflix. It determines which ads are presented to us on our Facebook page. It seems like these days everything will eventually be going the way of the machines. Of course, in reality there are many other places where either machine learning, or its dystopian alter-ego term, artificial intelligence, just haven't quite made it to prime time (often with the hype exceeding the reality). And regardless of your long-term views towards machine learning and a happy society, it is coming soon to an easily automated task near you. In this teaser, we'll cover a basic overview of some basic machine learning approaches that can be applied to individualized data. Unlike past posts, we'll keep this one at a high level, with a plan to dive into code and specific applications in future posts. Hopefully you will be able to appreciate that for all the fancy names, like boosted trees and lasso regression, and deep convolutional networks with early stopping, these methods rely on a shared set of concepts that are applicable to all machine learning processes (even if it does mean we'll soon be replacing ourselves...).
What is Machine Learning?
Before we dive into complex algorithms, it's worth mentioning a couple of key concepts around what constitutes machine learning. For an excellent introduction to these concepts, I'd highly recommend the textbook Introduction to Statistical Learning, by James, Witten, Hastie, and Tibshirani.
At a basic level, learning takes place when we compare the predicted values from any model against a known outcome, and then adjust the model based on the difference from the outcome. The model might be something as simple as predicting that the temperature today based on the temperature yesterday (obviously, we'd learn pretty quickly that this model isn't very good), or it could be a predicted using a model that includes a range of predictors (also called features) such as cloud patterns, season, barometric pressure, etc. Either way, learning occurs when we identify a difference between the expected (modeled) value and the actual value. This measurement is called the cost (or error), and it can be measured using a variety of functions, such as mean square (Mean Square Error), absolute value, quadratic function, or cross-entropy, to give a few examples.
Cost/error = function(Actual value - predicted value)
Of course, for learning to take place, we must have a method for modifying our model based on the feedback provided by the cost. The modifications can be everything from using an entirely different model to making minor changes to the values of the parameters or hyper-parameters of the model. One could make the distinction that what makes it 'machine' learning is when we develop an algorithm to implement the changes, rather than doing it manually, although as we'll see shortly, for most of these approaches we end up using some hybrid of automated optimization and manual adjustment.
The other key concept of learning that might be different from standard statistical approaches is that it almost always requires some form of data splitting. What this is referring to is taking the full dataset and breaking into separate parts for building and testing our models called (obviously) training, testing, and validation sets. The distinction between testing and validation is a bit semantic, but the main idea is that we build and develop our model using one part of the data, and then test it on another part. There are a variety of approaches to data splitting, including cross-validation, leave-one-out validation (also a method of determining 'leverage', a key concept in regression), and data folding. The differences are not important now, but we'll encounter a number of these approaches going forward. The key thing to keep in mind is that unlike a lot of statistical models that use internal statistics to describe the fit (such as the R-square value), most machine learning requires matching a forecasted or predicted value against a known value to determine validity.
Variance and Bias
To most of us trained in epidemiology or basic statistics, the terms 'variance' and 'bias' refer to somewhat different concepts than to a data scientist applying machine learning, with variance referring to the squared deviation from the mean, and the bias referring to the difference between the true population measure and the sample measure (for example, selection bias is when the difference is due to selecting a sample that is different from the underlying population being assessed).
Variance as applied in machine learning means essentially "how likely is this model going to be useful when applied to new data?" A model with high variance (a bad thing) will tend to perform poorly when exposed to new data, a well-recognized phenomenon called overfitting. Overfit models are generally more complex due to fitting noise, although it is not necessarily true that a more complex model is overfit (and has high variance). Bias here refers to essentially "how well does the model fit the data?". A model with high bias will tend to be a poor fit to data, while a model with low bias will tend to fit the data well. A linear regression model is a good example of a model with high bias/low variance -- unlikely to fit the data that well, but can be applied to similar data without expecting it to be much worse. On the other hand a complex nonlinear model with close to as many parameters as measures may provide an excellent fit to the data (low bias), but because of overfitting it will not apply well to a new dataset (high variance). The trade-off between these two concepts provide the key constraint around finding the best models using learning methods, and many approaches are designed to account for and adjust to enhance these parameters.
The Machine Learning Toolbox
It is important to understand that, like other statistical approaches, the various methods of machine learning are best thought of as a 'toolbox' of approaches, rather than one particular method (although most people would probably consider the term 'deep learning' to imply multi-layer neural networks). One of the challenges when I was learning about these approaches was understanding that not every method is the best method for a given problem, and one of our jobs as the users of these methods is figuring out which method(s) we should consider.
Like standard statistical approaches, the first step is to understand what kind of parameter we're looking to measure (binary, continuous, categorical, etc.), although an additional considering in machine learning is the amount of data available. For example, automatic neural networks (ANN) are amazingly accurate when there are large amounts of data available for analysis. The classic ANN dataset (used in most tutorials) is the MNIST dataset of handwritten numbers, which contains over 50,000 annotated handwriting examples of written numbers. On the other hand, ANN will generally perform poorly on a limited smaller dataset, as one might obtain for an epidemiological study, with a dataset of hundreds of individuals with 5-10 measures each (although efforts to incorporate these smaller datasets into ANN are ongoing).
Keeping this in mind, I've listed below a couple of methods that should probably be a part of most machine learning toolboxes. Bear in mind, any model can be taught to 'learn' in an automated fashion, and can thus constitute a machine learning approach.
Machine Learning Methods
Ridge/Lasso Regression
One of the first problems that machine learning seeks to address is feature selection. For many machine learning datasets, there are a large number of possible predictors (also called features--think 'x' variables) that could potentially be used to model the predicted variable (think 'y' variable). Many features are either poor predictors or collinear with other predictors, and thus should be either removed or down-weighted in a model. Most of us with some statistics training have used forward or backward stepwise regression to select predictors for a model, although these approaches have limitations for large sets of predictors. Ridge and lasso (least absolute shrinkage and selection operator) regression are also called shrinkage approaches, as these use a penalty term applied to the model fit to 'shrink' the regression weights (also called beta coefficients) toward zero based on how much they contribute to the model fit. Ridge regression down-weights the weights based on importance, but does not necessarily decrease the the weights to zero, which would effectively remove that predictor from the model. Lasso regression, in addition to shrinking the weights, can actually remove (assigns the weight = 0) others that are not important. Lasso regression is thus similar to stepwise regression in that it removes unimportant predictors from the final model, although the method is different.
Decision Trees
Decision trees are generally applicable for situations in which we have a binary or categorical outcome in which we want to identify the factors that lead us in one way or another. These methods are part of the broad category of techniques for classification, which includes logistic regression, linear discriminant analysis, naive Bayes classifiers, and support vector machines (see below). Decision tree learning methods, which also include the more specialized methods bagging (short for bootstrap aggregation), random forest, and boosting, are methods that use the features as points in a decision process ultimately leading to the outcome. The more sophisticated variants (bagging, boosting, and random forests) are created by using various iterations to create multiple decision trees that can be averaged together in order to decrease variance (reduce overfitting). The latter methods are less interpretable due to the averaging of several trees, but are very useful for feature selection.
Support Vector Machines
Despite the fancy high-tech name, support vector machines (SVM) really refers to a group of points in multi-dimensional parameter space that provide a decision boundary for classification of a predicted outcome value. Like trees, SVM are for classification and most easily applied to categorical or binary data, although unlike trees, interpretation often requires understanding the parameter space in multiple dimensions. (Note: There is a version of support vectors called Support Vector Regression in which the vectors are used to identify the best model for a continuous outcome--this is very similar to leverage points in a linear regression). The concept for SVM comes from merging concepts of the marginal classifier, which is a line that can be used to divide groups in parameter space, and kernel functions, which are functions that can be applied to a parameter to transform it nonlinearly. Although the mathematics are more complex, the idea behind SVM is that by transforming the predictors into higher dimensional parameter space with nonlinear kernels, boundaries can be identified to separate categories of outcomes. SVM are generally thought of as an out-of-box approach that works well for classification without a ton of manipulation.
Automatic Neural Networks
Neural networks were built from the idea that idea that the human nervous system can be conceptualized as a series of weighted decision points, which in sum can be used to predict the outcome. The power of neural networks is that they can 'learn' to solve very complex problems using fairly simple algorithms, in which the cost or error is backpropagated through the network to update the weights for each 'neuron'. Because the number of neurons can be in the hundreds or thousands, these approaches tend to be computationally demanding, and also tend to require large amounts of data to successfully train them. The most straightforward neural networks are feedforeward networks in which the input signal is gradually weighted as it passes through the network, ultimately leading to the output layer, where it can be used to predict any variable type, from binary, categorical to continuous, as long as the cost can be calculated. Shallow neural networks (also called vanilla) contain only a single middle layer of neurons (called the hidden layer). Networks with more than one hidden layer are generally called deep neural networks. There are many variations of neural networks, from deep convolutional neural networks to long short term memory neural networks. Many of these networks are so complex that they require use of cluster computers or GPU (graphics processing unit) to train.
Machine Learning and Individualized Data Analysis
As we move ahead, we will plan to examine many of these approaches in analysis of individual-level data. Key issues with individualized data analysis are related to the need for heavy computing power to train some of these approaches, as well as consideration of the aspects of time series data such as seasonality and autocorrelation as we've encountered thus far. In some cases, we may try applying out-of-box approaches like SVM to our data, or in other cases we might just use the concepts related to calculation of cost, variance/bias tradeoff, and feature selection within our basic ARIMA and HMM models. We will plan to address these issues in future posts. Nonetheless, the tools of machine learning hold much potential for scaling methods of individualized data analysis.