Time Series Forecasting Basics

Sneiderman, Robby

ML Methods

Time Series Forecasting Basics

Stationarity, autocorrelation, AR, MA, ARIMA, exponential smoothing, modern architectures (PatchTST, iTransformer, TSMixer), and time-series foundation models (TimesFM, Chronos, Moirai).

CoreTier 2CurrentSupporting~55 min

Prerequisites

Linear Regression Time Series Foundations

Quiz (5)Pulse Check Prereq Map

Why This Matters

Time series data is everywhere: stock prices, server metrics, weather, sensor readings, demand forecasting. The temporal structure (ordering, autocorrelation, trends, seasonality) makes time series different in kind from i.i.d. tabular data. Methods that ignore this structure fail. Methods that exploit it, even simple classical ones, often outperform complex deep learning approaches on standard benchmarks.

Fundamental Concepts

Definition

Stationarity

A time series $\{X_t\}$ is (weakly) stationary if:

$\mathbb{E}[X_t] = \mu$ for all $t$ (constant mean)
$\text{Var}(X_t) = \sigma^2$ for all $t$ (constant variance)
$\text{Cov}(X_t, X_{t+h}) = \gamma(h)$ depends only on lag $h$ , not on $t$

Stationarity means the statistical properties do not change over time. Most forecasting methods assume stationarity or require transforming the data to achieve it.

Definition

Autocorrelation Function $ρ (h)$

The autocorrelation function (ACF) at lag $h$ is:

$\rho(h) = \frac{\gamma(h)}{\gamma(0)} = \frac{\text{Cov}(X_t, X_{t+h})}{\text{Var}(X_t)}$

The ACF measures the linear dependence between a time series and its lagged values. The partial autocorrelation function (PACF) at lag $h$ measures the correlation between $X_t$ and $X_{t+h}$ after removing the effect of intermediate lags $X_{t+1}, \ldots, X_{t+h-1}$ .

Autoregressive Models

Definition

AR(p) Model

An autoregressive model of order $p$ is:

$X_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + \cdots + \phi_p X_{t-p} + \epsilon_t$

where $c$ is a constant, $\phi_1, \ldots, \phi_p$ are parameters, and $\epsilon_t \sim \text{WN}(0, \sigma^2)$ is white noise. The current value is a linear combination of $p$ past values plus noise.

Theorem

AR(p) Stationarity Condition

Statement

An AR( $p$ ) process $X_t = \phi_1 X_{t-1} + \cdots + \phi_p X_{t-p} + \epsilon_t$ is stationary if and only if all roots of the characteristic polynomial

$1 - \phi_1 z - \phi_2 z^2 - \cdots - \phi_p z^p = 0$

lie outside the unit circle in the complex plane ( $|z_i| > 1$ for all roots $z_i$ ).

Intuition

For AR(1), the condition reduces to $|\phi_1| < 1$ . If $|\phi_1| \geq 1$ , shocks accumulate rather than decay, and the process drifts or explodes. For higher orders, the characteristic polynomial encodes how past values combine; roots inside the unit circle correspond to exponentially growing components.

Proof Sketch

Write the AR( $p$ ) in lag operator notation: $\Phi(L) X_t = \epsilon_t$ where $\Phi(L) = 1 - \phi_1 L - \cdots - \phi_p L^p$ . The process has a causal (forward-looking) representation $X_t = \Phi(L)^{-1} \epsilon_t = \sum_{j=0}^\infty \psi_j \epsilon_{t-j}$ if and only if $\Phi(z) \neq 0$ for $|z| \leq 1$ . The $\psi_j$ coefficients decay geometrically, ensuring finite variance.

Why It Matters

Before fitting an AR model, you must check stationarity. If the series has a unit root ( $|z_i| = 1$ ), the process is a random walk and AR estimation is inconsistent. Differencing the series (ARIMA) addresses this. The Augmented Dickey-Fuller test checks for unit roots.

Failure Mode

Fitting AR to a non-stationary series gives spurious parameter estimates. The OLS coefficients converge to their true values at rate $n$ instead of $\sqrt{n}$ (superconsistency), but inference (confidence intervals, tests) is invalid. Standard t-statistics do not follow the t-distribution; you need the Dickey-Fuller distribution instead.

report a correction →

Moving Average Models

Definition

MA(q) Model

A moving average model of order $q$ is:

$X_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \cdots + \theta_q \epsilon_{t-q}$

where $\epsilon_t \sim \text{WN}(0, \sigma^2)$ . The current value depends on $q$ past error terms. MA models are always stationary (finite sums of stationary processes are stationary).

The Wold Decomposition

Theorem

Wold Decomposition Theorem

Statement

Any covariance-stationary process $\{X_t\}$ can be written as:

$X_t = \sum_{j=0}^{\infty} \psi_j \epsilon_{t-j} + D_t$

where $\psi_0 = 1$ , $\sum_{j=0}^\infty \psi_j^2 < \infty$ , $\epsilon_t$ is white noise, and $D_t$ is a deterministic component (perfectly predictable from its own past). The MA( $\infty$ ) part is unique.

Intuition

Every stationary process is, in a precise sense, an infinite moving average plus a deterministic trend. This justifies the use of ARMA models: AR and MA are two different finite-parameter approximations to the general MA( $\infty$ ) representation.

Proof Sketch

Project $X_t$ onto the closed linear span of its own past innovations. The projection gives the MA( $\infty$ ) component. The residual, being orthogonal to all past innovations, is deterministic (perfectly predictable from past values). The $\psi_j$ coefficients are the Wold representation coefficients.

Why It Matters

The Wold theorem provides the theoretical foundation for ARMA modeling. It says that ARMA is not just a convenient parametric family but a natural finite-parameter approximation to the true data-generating process.

Failure Mode

The Wold decomposition assumes stationarity. Non-stationary processes (trending, unit root) must be transformed (differenced) first. Also, the decomposition is linear. Nonlinear dependencies (volatility clustering, regime switching) are invisible to Wold and require GARCH or regime-switching models.

report a correction →

ARIMA

ARIMA( $p, d, q$ ) combines autoregression, differencing, and moving average:

Differencing $d$ times to achieve stationarity: $\Delta^d X_t = (1-L)^d X_t$
Fit an ARMA( $p, q$ ) to the differenced series

For seasonal data, SARIMA adds seasonal terms: SARIMA( $p,d,q$ )( $P,D,Q$ ) $_s$ with seasonal period $s$ .

Model selection: use the ACF to identify $q$ (ACF cuts off after lag $q$ for MA( $q$ )) and PACF to identify $p$ (PACF cuts off after lag $p$ for AR( $p$ )). In practice, use AIC or BIC to select among candidate models.

Exponential Smoothing

Simple exponential smoothing forecasts using a weighted average of all past observations with exponentially decaying weights:

$\hat{X}_{t+1} = \alpha X_t + (1 - \alpha) \hat{X}_t, \quad 0 < \alpha < 1$

Holt's method adds a trend component. Holt-Winters adds seasonality. The ETS (Error-Trend-Seasonal) framework encompasses all exponential smoothing variants with automatic model selection.

Exponential smoothing is competitive with ARIMA on many datasets and is computationally trivial. It has a state-space representation that provides prediction intervals.

Modern Approaches

Prophet (Taylor & Letham, 2018): decomposable model with trend, seasonality, and holidays. Designed for business forecasting with irregular holidays and missing data. Uses Stan for Bayesian inference.

N-BEATS (Oreshkin et al. 2020): deep learning architecture with backward and forward residual links. Interpretable variant decomposes forecasts into trend and seasonality.

Temporal Fusion Transformer (Lim et al. 2021): attention-based model handling multiple time series with static covariates, known future inputs, and observed past inputs. Competitive on several multi-horizon benchmarks.

Patch-based and inverted transformers (2023-2024): PatchTST (Nie et al. 2023) segments a series into patches before attention, reducing token count and improving local structure capture. iTransformer (Liu et al. 2024) inverts the attention axis, attending over variates rather than time steps, which helps multivariate settings with many correlated channels. TSMixer (Chen et al. 2023) drops attention entirely, showing that mixing MLPs across time and features can match or beat transformer models at lower cost.

Time-series foundation models (2024): a newer paradigm sidesteps the classical single-series fitting regime entirely. TimesFM (Das et al. 2024) is Google's time-series foundation model trained on a large corpus of real-world series, enabling zero-shot forecasting without any target-domain fitting. Chronos (Ansari et al. 2024) tokenizes numerical values and applies language-model-style pretraining on diverse time series. Moirai (Woo et al. 2024) uses a masked encoder trained on Salesforce's LOTSA dataset and handles any number of variates without architectural changes. The Makridakis comparison ("classical wins on short univariate series") described the fit-one-series setting; foundation models operate in a different regime, leveraging cross-series pretraining. Their zero-shot advantage is strongest when the target series resembles the pretraining distribution; the advantage is less clear on domain-specific or structurally unusual series.

Classical vs. Deep Learning

The Makridakis competitions (M3, M4, M5) and subsequent studies consistently show that simple methods (exponential smoothing, ARIMA, theta method) match or beat complex deep learning methods on univariate forecasting. Deep learning methods excel when: the dataset has many related time series (enabling cross-series learning), rich exogenous variables are available, or the series is long enough to train large models.

The failure of deep learning on short univariate series is not surprising: ARIMA has $O(p+q)$ parameters, while a transformer has millions. With 100 observations, the classical model wins by not overfitting.

Common Confusions

Watch Out

Stationarity does not mean constant

A stationary series fluctuates around a fixed mean with constant variance. It can have substantial variation. Stationarity means the statistics of the fluctuations do not change over time, not that the series itself is flat.

Watch Out

Differencing is not detrending

Differencing removes stochastic trends (unit roots). Detrending removes deterministic trends (fitting and subtracting a trend line). Applying the wrong one gives incorrect results: detrending a unit root process leaves residuals that are still non-stationary; differencing a trend-stationary process introduces an unnecessary MA unit root.

Watch Out

Good in-sample fit does not mean good forecasts

Overfitting is particularly dangerous in time series because the effective sample size is smaller than the number of observations (autocorrelation reduces information content). Always evaluate forecasts on a held-out future period, not on the training period.

Summary

Stationarity (constant mean, variance, autocovariance) is the core assumption; test for it before modeling
AR( $p$ ) captures dependence on past values; MA( $q$ ) captures dependence on past errors; ARIMA combines both with differencing
The Wold theorem justifies ARMA as a universal approximation for stationary processes
Exponential smoothing is simple, effective, and has a rigorous state-space formulation
Classical methods beat deep learning on many univariate forecasting benchmarks, especially with short series

Exercises

ExerciseCore

Problem

An AR(1) model has $\phi_1 = 0.8$ . What is the autocorrelation at lag 3? Is this process stationary?

ExerciseAdvanced

Problem

You observe a time series that appears non-stationary. You difference it once ( $d=1$ ) and the resulting series passes the ADF test for stationarity. The ACF of the differenced series cuts off after lag 1, and the PACF decays gradually. What ARIMA model should you fit? Justify your choice.

References

Canonical:

Box, Jenkins, Reinsel, Time Series Analysis (5th ed.), Chapters 3-5
Hamilton, Time Series Analysis (1994), Chapters 3-4
Engle, R. F. "Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation," Econometrica 50(4), 1982 (GARCH)
Dickey, D. A. & Fuller, W. A. "Distribution of the Estimators for Autoregressive Time Series with a Unit Root," Journal of the American Statistical Association 74(366), 1979

Current:

Hyndman & Athanasopoulos, Forecasting: Principles and Practice (3rd ed.), Chapters 8-9
Makridakis et al., "The M4 Competition" (2020)
Taylor & Letham, "Forecasting at Scale" (Prophet, 2018)
Oreshkin et al., "N-BEATS: Neural basis expansion analysis for interpretable time series forecasting," arXiv:1905.10437, 2020
Lim et al., "Temporal Fusion Transformers for interpretable multi-horizon time series forecasting," International Journal of Forecasting 37(4), 2021
Nie et al., "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (PatchTST)," arXiv:2211.14730, 2023
Liu et al., "iTransformer: Inverted Transformers Are Effective for Time Series Forecasting," arXiv:2310.06625, 2024
Chen et al., "TSMixer: An All-MLP Architecture for Time Series Forecasting," arXiv:2303.06053, 2023
Das et al., "A decoder-only foundation model for time-series forecasting (TimesFM)," arXiv:2310.10688, 2024
Ansari et al., "Chronos: Learning the Language of Time Series," arXiv:2403.07815, 2024
Woo et al., "Unified Training of Universal Time Series Forecasting Transformers (Moirai)," arXiv:2402.02592, 2024

Next Topics

Gaussian processes for ML: a nonparametric approach to time series and regression with uncertainty

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Linear Regressionlayer 1 · tier 1
Time Series Foundationslayer 2 · tier 2

Derived topics

2

Gaussian Processes for Machine Learninglayer 4 · tier 3
Macroeconomic Time-Series Forecastinglayer 4 · tier 3

Graph-backed continuations

Gaussian Processes for Machine Learning Macroeconomic Time-Series Forecasting