Time Series notes

I have done this course proposed by Kaggle, and I would like to take some notes.

The trend component of a time series represents a persistent, long-term change in the mean of the series.

We mainly have:

  • Time dependent properties: trends and seasonality.
  • Serial dependent properties: Cycles and lagged series

1. Trends

The trend component of a time series represents a persistent, long-term change in the mean of the series.

The use of polynomials enable the ability to predict trends. The right order of polynomials is key. To use high order of polynomial has risks.

An high-order polynomial tend to diverge rapidly outside of the training period making forecasts very unreliable.

Splines are a nice alternative to polynomials when you want to fit a trend. The Multivariate Adaptive Regression Splines (MARS) algorithm in the pyearth library is powerful and easy to use. There are a lot of hyper-parameters you may want to investigate.

2. Seasonality

We say that a time series exhibits seasonality whenever there is a regular, periodic change in the mean of the series. Seasonal changes generally follow the clock and calendar.

For simple situations we only needed eight features (four sine / cosine pairs) to get a good estimate of the annual seasonality. Compare this to the seasonal indicator method which would have required hundreds of features (one for each day of the year). By modeling only the “main effect” of the seasonality with Fourier features, you’ll usually need to add far fewer features to your training data, which means reduced computation time and less risk of over-fitting.

What is Serial Dependence?

In time series, we have seen properties that are most easily modeled as time dependent properties, that is, with features we could derive directly from the time index (for instance: trend and seasonality).

Some time series properties, however, can only be modeled as serially dependent properties, that is, using as features past values of the target series. The structure of these time series may not be apparent from a plot over time; plotted against past values, however, the structure becomes clear plotting in different forms.

For instance:

3.- Cycles

Cycles are patterns of growth and decay in a time series associated with how the value in a series at one time depends on values at previous times, but not necessarily on the time step itself.

What distinguishes cyclic behavior from seasonality is that cycles are not necessarily time dependent, as seasons are.

Lagged Series and Lag Plots

To investigate possible serial dependence (like cycles) in a time series, we need to create “lagged” copies of the series. Lagging a time series means to shift its values forward one or more time steps, or equivalently, to shift the times in its index backward one or more steps. In either case, the effect is that the observations in the lagged series will appear to have happened later in time.

Choosing lags

When choosing lags to use as features, it generally won’t be useful to include every lag with a large autocorrelation.

The partial autocorrelation tells you the correlation of a lag accounting for all of the previous lags — the amount of “new” correlation the lag contributes, so to speak. Plotting the partial autocorrelation can help you choose which lag features to use. “plot_pacf(data, lags=12)”

 When using lag features, however, we are limited to forecasting time steps whose lagged values are available. Using a lag 1 feature on Monday, we can’t make a forecast for Wednesday because the lag 1 value needed is Tuesday which hasn’t happened yet.

4.- Hybrid models

Linear regression excels at extrapolating trends, but can’t learn interactions. XGBoost excels at learning interactions, but can’t extrapolate trends. So, mixing them in the right way could help to get better results.

The difference between the target series and the predictions gives the series of residuals. These residuals can be used.

There are generally two ways a regression algorithm can make predictions:

  1. By transforming the features:  learn some mathematical function that takes features as an input and then combines and transforms them to produce an output that matches the target values in the training set.
  2. By transforming the target: use the features to group the target values in the training set and make predictions by averaging values in a group; a set of feature just indicates which group to average. Decision trees and nearest neighbors are of this kind.

For instance:

  1. Linear regression and neural nets -works by transforming the features.
  2. Decision trees and nearest neighbors works by transforming the target.

The important thing is this: feature transformers generally can extrapolate target values beyond the training set given appropriate features as inputs, but the predictions of target transformers will always be bound within the range of the training set. If the time dummy continues counting time steps, linear regression continues drawing the trend line. Given the same time dummy, a decision tree will predict the trend indicated by the last step of the training data into the future forever. Decision trees cannot extrapolate trends. Random forests and gradient boosted decision trees (like XGBoost) are ensembles of decision trees, so they also cannot extrapolate trends.

A decision tree will fail to extrapolate a trend beyond the training set.

This drives us to hybrid models: use linear regression to extrapolate the trend, transform the target to remove the trend, and apply XGBoost to the detrended residuals.

To hybridize a neural net (a feature transformer), you could instead include the predictions of another model as a feature, which the neural net would then include as part of its own predictions. The method of fitting to residuals is actually the same method the gradient boosting algorithm uses, so we will call these boosted hybrids; the method of using predictions as features is known as “stacking”, so we will call these stacked hybrids.

5.- Multistep Forecasting Strategies

Let’s start with some definitions:

  • The forecast origin is time at which you are making a forecast. Practically, you might consider the forecast origin to be the last time for which you have training data for the time being predicted. Everything up to he origin can be used to create features.
  • The forecast horizon is the time for which you are making a forecast. We often describe a forecast by the number of time steps in its horizon: a “1-step” forecast or “5-step” forecast, say. The forecast horizon describes the target.
  • The time between the origin and the horizon is the lead time (or sometimes latency) of the forecast.

There are a number of strategies for producing the multiple target steps required for a forecast:

  • Multioutput model: Use a model that produces multiple outputs naturally. Linear regression and neural networks can both produce multiple outputs. This strategy is simple and efficient, but not possible for every algorithm you might want to use.
  • Direct strategy: Train a separate model for each step in the horizon: one model forecasts 1-step ahead, another 2-steps ahead, and so on. Forecasting 1-step ahead is a different problem than 2-steps ahead (and so on), so it can help to have a different model make forecasts for each step. The downside is that training lots of models can be computationally expensive.
  • Recursive strategy: Train a single one-step model and use its forecasts to update the lag features for the next step. With the recursive method, we feed a model’s 1-step forecast back in to that same model to use as a lag feature for the next forecasting step. We only need to train one model, but since errors will propagate from step to step, forecasts can be inaccurate for long horizons.
  • DirRec strategy: A combination of the direct and recursive strategies: train a model for each step and use forecasts from previous steps as new lag features. Step by step, each model gets an additional lag input. Since each model always has an up-to-date set of lag features, the DirRec strategy can capture serial dependence better than Direct, but it can also suffer from error propagation like Recursive.

Leave a Comment