Recurrent Neural Networks
"My life seemed to be a series of events and accidents. Yet when I look back, I see a pattern." -- Benoit B. Mandelbrot
Deep Learning
Although technically speaking, deep learning simply refers to the process of fitting a neural network with greater than one hidden layer, when many of us hear the term 'deep learning', it beacons us toward an age when computers and data will reveal the inner workings of our lives. 'Deep learning' implies that we can go beyond the 'shallow' world of statistics and data analysis to understand how and why processes in nature take place as they do. To understand, and more importantly, to predict. As we will see in this post, deep learning in many ways simply takes us further into the black box of model fitting; however, before we venture downward, let's start with some basics.
Neural Networks
A neural network, or artificial neural network, to distinguish the computer version from the biological one, is basically a collection of nodes at which an input, or group or inputs, is processed in such a way that when combined into multiple layers and connections, the network as a whole is capable of capturing degrees of nonlinear relationships between inputs that would otherwise go beyond standard statistical models. Like other machine-learning approaches discussed previously, such as support vector machines, neural networks almost always end up fitting more parameters than data inputs, and yet are somehow capable of predicting future data successfully, often without extensive overfitting (which does occur...).
Figure 1. Diagram of a two-layer deep neural network
The concept of neural networks has been around since the 1960's; however, more recently they have come back with a vengeance with the development of bigger and faster computers, GPUs applied to number crunching, and cloud-computing. In addition, a number of tricks have been found for optimization, avoiding overfitting (regularization), and avoiding the mysterious 'vanishing gradient' problem. A full discussion of neural networks would be beyond this post, although there are some excellent resources I have used personally to get up to speed.
Figure 2. Fitting a neural network. Adapted from https://www.embedded-vision.com/platinum-members/cadence/embedded-vision-training/documents/pages/neuralnetworksimagerecognition
As shown in Figure 2, fitting a neural network basically entails adjusting the weights at each neuron in such a way that the loss, defined as how far off the predicted value from the network is from the label or target, is minimized. To pass this loss back through the network, an approach called stochastic gradient descent is often used, which allows the weights to be adjusted relative to the impact on the loss. This adjustment can be made after each run, or after a batch of several runs averaged out, and is ultimately how a neural net 'learns'. All neural networks, from single layer fully connected models to reinforcement learning models, involve making a prediction, comparing with a known label, finding the loss/error, and then using that error to improve the model.
Figure 3. Analogy of deep learning model to visual recognition of a car. From https://www.slideshare.net/grigorysapunov/deep-learning-and-the-state-of-ai-2016.
So what are deep neural networks learning? The short answer is that it's not always entirely intuitive that neural networks are learning what we think they are. Like SVMs, most classification models are learning to identify the decision boundary between two groups, which may or may not consist of a variable that a substance-matter expert would identify as characteristic of the groups being separated. In other cases, such as convolutional neural networks used in image recognition (Figure 3), there appears to be a closer connection to human vision.
Recurrent Neural Networks
Among the more interesting and widely applicable types of neural networks are recurrent neural networks, which have found uses in everything from natural language processing, machine translation, and sentiment analysis. These models, unlike static neural networks, are trained on a dynamic label, which can be the future input, or some transformed version of the future input (see below), or a label that occurs in the future. These models (Figure 4) come in various configurations, depending on the use and the goal of the model. This dynamic nature of recurrent neural networks makes them perfect for modeling time series data, which will be the application we will pursue here.
Figure 4. Top: A single recurrent neuron, with input (x) and output (y); with unfolding over time. Bottom: Various configurations of recurrent neural networks, depending on the task. From the excellent blog: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
For reasons that we will not discuss here in detail, it turns out that applying a simple recurrent neural network to many tasks suffers from the problem of vanishing or exploding gradients. One method investigators have used to overcome this limitation, as well as provide a mechanism for memory of past states within a recurrent neural network was development of the long short-term memory (LSTM) neural network.
Figure 5. A long short-term memory (LSTM) neuron.
LSTM on Fitbit data
There are several excellent online blogs and resources for learning details about LSTM neural networks. The details of how and why a LSTM network is better than prior versions is beyond the scope of this post; however, we will be using this model for future examples that we will apply to our Fitbit data. The code in Python (Jupyter Notebook), as well as summary files as HTML are available on our Github page. These models were created using the Keras wrapper for TensorFlow, which is Google's deep learning platform. Keras is an amazing resource for building deep learning models, and I would highly recommend starting with this package for learning deep learning.
The data we will examine, as prior posts, is from my Fitbit device, and includes nightly sleep minutes, steps, exercise (minutes.very.active), floors, and distance. Some of these series are probably derived from others, but are different enough to include in this analysis. Much of the code for this analysis is based on an amazing post located here, by Jason Brownlee.
The outcome we will be examining is minutes of sleep, and we will be building a model to predict nightly sleep based on steps, exercise (Minutes.Very.Active), floors, and distance (See Figure 6). Keep in mind that sleep is the night of sleep prior to the activity of that day. In other words, minutes of sleep at time t refers to the amount of sleep that precedes activity on day t (steps on day t). This is important since if we want to predict the amount of sleep in a night, we need to know steps, etc., on day t-1. Steps on day t would not be a predictor of sleep on day t, since it occurs afterwards.
Figure 6. Time series for each parameter from Fitbit data
As with all machine-learning approaches, the first step is to split our data into a training and testing set. Here we will use the first 75% of the data for training, and then test the model on the last 25%. Our LSTM layer will be one layer of 50 neurons, with Adam optimization, and a loss function of mean absolute error (MAE), and we will use a batch size of 72, trained over 50 epochs (These are just the values used in the MachineLearningMastery post). Figure 7 shows the decrease in error on both sets with each training epoch. Figure 8 shows the predicted sequence, which has a root mean squared error (RMSE) of 86.7. Although we do not know exactly how much the model weights each input, we can infer from the graph that it is using the prior night of sleep to some degree since it displays some amount of shift from the prior night.
Figure 7. Training error over training epoch.
Figure 8. Actual (blue) and predicted (orange) sleep predicted by the model.
As above, this simple model tells us how much a night of sleep can be predicted from activity the day prior. However, we can just as easily build a model with multiple days leading up to the night of sleep. The model results from Figure 9 uses the prior 7 days of data to predict sleep. As we can see, the model seems to bounce around a bit before finding the global minimum (Figure 9).
Figure 9. Fitting a model with 7 days of lagged values.
Figure 10. Predicted minutes of sleep from model with 7 days of lagged data
The RMSE for the 7-day model is 75.2, which is slightly better than the model with only 1 day. As shown in Figure 10, it appears to be less influenced by the prior night of sleep compared with the1-day model, with less fluctuation in general. Of course, this is interesting until we consider that the RMSE for assuming the long-term average sleep (440 min/night) over the test set is 70.1. In fairness to our LSTM model, we did not perform any adjustment to the hyper-parameters (learning rate, number of layers, batch size, etc.) in training it, so let's not throw deep learning out the window just yet. In future posts, we'll examine some ways to tune these networks, as well as try to understand what's in the 'black box' that the model learning from our data.