Geometric Brownian Motion is a stochastic process that can be used to model stock prices. It's related to random walks and Markov chains. This is a much different way to look at time series than what I explored in my Time Series Predictions post and given the recent market volatility it seems especially timely to take a closer look at it.
There are detailed explanations of the math and theory elsewhere. I've just written some code to see it in action. This generates 100 simulations for the S&P 500 ETF with the ticker SPY, modeling the past year:
What's especially interesting today is that the current value of SPY is on the extreme lower end of what we see simulated. Even increasing 100 to 1000 or more, we are still on a rare path.
There are some known problems with the model that may help explain this. First, volatility isn't really constant in financial markets. Second, the randomness in GBM is normally distributed but we know that stock returns are not. They have fatter tails, or higher kurtosis. Also stock prices react to specific geopolitical events that definitely are not random, even opening a day at a different level than the previous day's closing.
Out of curiosity, and because the second issue is the easiest to tweak, I replaced the normal distribution with a Laplace distribution and see a slightly wider dispersion of results (in both directions). In reality, though, the 52 week range is 218.26 - 339.08 so we still aren't capturing the extremes witnessed.
Any ideas on what's going on here? It must have something to do with the massive increase in volatility at the end of what was otherwise a calm year. Please comment.
Saturday, March 28, 2020
Saturday, March 14, 2020
Time Series Predictions
Time series data is just a series of observations ordered in time. As simple as it sounds, there are important differences when analyzing time series data vs. cross-sectional data. This post will attempt to cover enough basics, from statistics and machine learning, to get to a point where we can forecast future observations.
First, some terminology. Data is autocorrelated when there are similarities between an observation and previous observations. Seasonality is when the similarities occur at regular intervals. Trend is a long-term upward or downward movement. And data is stationary when its statistical properties, like mean and variance, do not change over time.
I'll generate data with these characteristics to use for the rest of the post:
There are other (non-parametric) stationarity tests without the normally distributed data assumption that are beyond the scope of this post.
Special care must be taken when splitting time series data into a training and a test set. The order must be preserved, the data can not be reshuffled. For cross-validation, it is also important to evaluate the model only on future observations so a variation of k-fold is needed.
First, some terminology. Data is autocorrelated when there are similarities between an observation and previous observations. Seasonality is when the similarities occur at regular intervals. Trend is a long-term upward or downward movement. And data is stationary when its statistical properties, like mean and variance, do not change over time.
I'll generate data with these characteristics to use for the rest of the post:
Detecting stationarity
While time series data is usually not stationary, stationarity is important because most statistical models and tests have that assumption. The Augmented Dickey-Fuller (ADF) test can be used on normally distributed data to detect stationarity. The null hypothesis is that the data is not stationary, thus you are looking to reject it with a certain level of confidence.There are other (non-parametric) stationarity tests without the normally distributed data assumption that are beyond the scope of this post.
Transformations
By applying different transformations to our data we can make non-stationary data stationary. One approach is to subtract the rolling mean or weighted rolling mean (favoring more recent observations) from the data. A another approach is called differencing. Subtract the difference from some time period ago, like a month or a week, from the data.Forecasting
Special care must be taken when splitting time series data into a training and a test set. The order must be preserved, the data can not be reshuffled. For cross-validation, it is also important to evaluate the model only on future observations so a variation of k-fold is needed.
SARIMA
Seasonal autoregressive integrated moving average (SARIMA) is a model that can be fitted via a Kalman filter to time series data. It accounts for seasonality and trend by differencing the data, however it is a linear model so an observation needs to be a linear combination of past observations. A log or square root transform, for example, might help make the time series linear.RNN
A recurrent neural network (RNN) with long short-term memory (LSTM) is an alternative to SARIMA for modeling time series data. At the cost of complexity, it can handle non-linear data or data that isn't normally distributed.
I didn't put a lot of effort into tuning these models, or coming up with additional features, and they aren't perfect, but we can start to get a feel for how they work. The SARIMA model looks underfit. It did, however, nicely ignore the randomness in the data. The RNN model clearly overfits the data and more work would be needed to get a smoother curve.
This was my first attempt at working with SARIMAX and RNNs so any feedback is appreciated.
Labels:
feature engineering,
machine learning,
python,
time series