ML Methods
Time Series Forecasting Basics
Stationarity, autocorrelation, AR, MA, ARIMA, exponential smoothing, modern architectures (PatchTST, iTransformer, TSMixer), and time-series foundation models (TimesFM, Chronos, Moirai).
Prerequisites
Why This Matters
Time series data is everywhere: stock prices, server metrics, weather, sensor readings, demand forecasting. The temporal structure (ordering, autocorrelation, trends, seasonality) makes time series different in kind from i.i.d. tabular data. Methods that ignore this structure fail. Methods that exploit it, even simple classical ones, often outperform complex deep learning approaches on standard benchmarks.
Fundamental Concepts
Stationarity
A time series is (weakly) stationary if:
- for all (constant mean)
- for all (constant variance)
- depends only on lag , not on
Stationarity means the statistical properties do not change over time. Most forecasting methods assume stationarity or require transforming the data to achieve it.
Autocorrelation Function
The autocorrelation function (ACF) at lag is:
The ACF measures the linear dependence between a time series and its lagged values. The partial autocorrelation function (PACF) at lag measures the correlation between and after removing the effect of intermediate lags .
Autoregressive Models
AR(p) Model
An autoregressive model of order is:
where is a constant, are parameters, and is white noise. The current value is a linear combination of past values plus noise.
AR(p) Stationarity Condition
Statement
An AR() process is stationary if and only if all roots of the characteristic polynomial
lie outside the unit circle in the complex plane ( for all roots ).
Intuition
For AR(1), the condition reduces to . If , shocks accumulate rather than decay, and the process drifts or explodes. For higher orders, the characteristic polynomial encodes how past values combine; roots inside the unit circle correspond to exponentially growing components.
Proof Sketch
Write the AR() in lag operator notation: where . The process has a causal (forward-looking) representation if and only if for . The coefficients decay geometrically, ensuring finite variance.
Why It Matters
Before fitting an AR model, you must check stationarity. If the series has a unit root (), the process is a random walk and AR estimation is inconsistent. Differencing the series (ARIMA) addresses this. The Augmented Dickey-Fuller test checks for unit roots.
Failure Mode
Fitting AR to a non-stationary series gives spurious parameter estimates. The OLS coefficients converge to their true values at rate instead of (superconsistency), but inference (confidence intervals, tests) is invalid. Standard t-statistics do not follow the t-distribution; you need the Dickey-Fuller distribution instead.
Moving Average Models
MA(q) Model
A moving average model of order is:
where . The current value depends on past error terms. MA models are always stationary (finite sums of stationary processes are stationary).
The Wold Decomposition
Wold Decomposition Theorem
Statement
Any covariance-stationary process can be written as:
where , , is white noise, and is a deterministic component (perfectly predictable from its own past). The MA() part is unique.
Intuition
Every stationary process is, in a precise sense, an infinite moving average plus a deterministic trend. This justifies the use of ARMA models: AR and MA are two different finite-parameter approximations to the general MA() representation.
Proof Sketch
Project onto the closed linear span of its own past innovations. The projection gives the MA() component. The residual, being orthogonal to all past innovations, is deterministic (perfectly predictable from past values). The coefficients are the Wold representation coefficients.
Why It Matters
The Wold theorem provides the theoretical foundation for ARMA modeling. It says that ARMA is not just a convenient parametric family but a natural finite-parameter approximation to the true data-generating process.
Failure Mode
The Wold decomposition assumes stationarity. Non-stationary processes (trending, unit root) must be transformed (differenced) first. Also, the decomposition is linear. Nonlinear dependencies (volatility clustering, regime switching) are invisible to Wold and require GARCH or regime-switching models.
ARIMA
ARIMA() combines autoregression, differencing, and moving average:
- Differencing times to achieve stationarity:
- Fit an ARMA() to the differenced series
For seasonal data, SARIMA adds seasonal terms: SARIMA()() with seasonal period .
Model selection: use the ACF to identify (ACF cuts off after lag for MA()) and PACF to identify (PACF cuts off after lag for AR()). In practice, use AIC or BIC to select among candidate models.
Exponential Smoothing
Simple exponential smoothing forecasts using a weighted average of all past observations with exponentially decaying weights:
Holt's method adds a trend component. Holt-Winters adds seasonality. The ETS (Error-Trend-Seasonal) framework encompasses all exponential smoothing variants with automatic model selection.
Exponential smoothing is competitive with ARIMA on many datasets and is computationally trivial. It has a state-space representation that provides prediction intervals.
Modern Approaches
Prophet (Taylor & Letham, 2018): decomposable model with trend, seasonality, and holidays. Designed for business forecasting with irregular holidays and missing data. Uses Stan for Bayesian inference.
N-BEATS (Oreshkin et al. 2020): deep learning architecture with backward and forward residual links. Interpretable variant decomposes forecasts into trend and seasonality.
Temporal Fusion Transformer (Lim et al. 2021): attention-based model handling multiple time series with static covariates, known future inputs, and observed past inputs. Competitive on several multi-horizon benchmarks.
Patch-based and inverted transformers (2023-2024): PatchTST (Nie et al. 2023) segments a series into patches before attention, reducing token count and improving local structure capture. iTransformer (Liu et al. 2024) inverts the attention axis, attending over variates rather than time steps, which helps multivariate settings with many correlated channels. TSMixer (Chen et al. 2023) drops attention entirely, showing that mixing MLPs across time and features can match or beat transformer models at lower cost.
Time-series foundation models (2024): a newer paradigm sidesteps the classical single-series fitting regime entirely. TimesFM (Das et al. 2024) is Google's time-series foundation model trained on a large corpus of real-world series, enabling zero-shot forecasting without any target-domain fitting. Chronos (Ansari et al. 2024) tokenizes numerical values and applies language-model-style pretraining on diverse time series. Moirai (Woo et al. 2024) uses a masked encoder trained on Salesforce's LOTSA dataset and handles any number of variates without architectural changes. The Makridakis comparison ("classical wins on short univariate series") described the fit-one-series setting; foundation models operate in a different regime, leveraging cross-series pretraining. Their zero-shot advantage is strongest when the target series resembles the pretraining distribution; the advantage is less clear on domain-specific or structurally unusual series.
Classical vs. Deep Learning
The Makridakis competitions (M3, M4, M5) and subsequent studies consistently show that simple methods (exponential smoothing, ARIMA, theta method) match or beat complex deep learning methods on univariate forecasting. Deep learning methods excel when: the dataset has many related time series (enabling cross-series learning), rich exogenous variables are available, or the series is long enough to train large models.
The failure of deep learning on short univariate series is not surprising: ARIMA has parameters, while a transformer has millions. With 100 observations, the classical model wins by not overfitting.
Common Confusions
Stationarity does not mean constant
A stationary series fluctuates around a fixed mean with constant variance. It can have substantial variation. Stationarity means the statistics of the fluctuations do not change over time, not that the series itself is flat.
Differencing is not detrending
Differencing removes stochastic trends (unit roots). Detrending removes deterministic trends (fitting and subtracting a trend line). Applying the wrong one gives incorrect results: detrending a unit root process leaves residuals that are still non-stationary; differencing a trend-stationary process introduces an unnecessary MA unit root.
Good in-sample fit does not mean good forecasts
Overfitting is particularly dangerous in time series because the effective sample size is smaller than the number of observations (autocorrelation reduces information content). Always evaluate forecasts on a held-out future period, not on the training period.
Summary
- Stationarity (constant mean, variance, autocovariance) is the core assumption; test for it before modeling
- AR() captures dependence on past values; MA() captures dependence on past errors; ARIMA combines both with differencing
- The Wold theorem justifies ARMA as a universal approximation for stationary processes
- Exponential smoothing is simple, effective, and has a rigorous state-space formulation
- Classical methods beat deep learning on many univariate forecasting benchmarks, especially with short series
Exercises
Problem
An AR(1) model has . What is the autocorrelation at lag 3? Is this process stationary?
Problem
You observe a time series that appears non-stationary. You difference it once () and the resulting series passes the ADF test for stationarity. The ACF of the differenced series cuts off after lag 1, and the PACF decays gradually. What ARIMA model should you fit? Justify your choice.
References
Canonical:
- Box, Jenkins, Reinsel, Time Series Analysis (5th ed.), Chapters 3-5
- Hamilton, Time Series Analysis (1994), Chapters 3-4
- Engle, R. F. "Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation," Econometrica 50(4), 1982 (GARCH)
- Dickey, D. A. & Fuller, W. A. "Distribution of the Estimators for Autoregressive Time Series with a Unit Root," Journal of the American Statistical Association 74(366), 1979
Current:
- Hyndman & Athanasopoulos, Forecasting: Principles and Practice (3rd ed.), Chapters 8-9
- Makridakis et al., "The M4 Competition" (2020)
- Taylor & Letham, "Forecasting at Scale" (Prophet, 2018)
- Oreshkin et al., "N-BEATS: Neural basis expansion analysis for interpretable time series forecasting," arXiv:1905.10437, 2020
- Lim et al., "Temporal Fusion Transformers for interpretable multi-horizon time series forecasting," International Journal of Forecasting 37(4), 2021
- Nie et al., "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (PatchTST)," arXiv:2211.14730, 2023
- Liu et al., "iTransformer: Inverted Transformers Are Effective for Time Series Forecasting," arXiv:2310.06625, 2024
- Chen et al., "TSMixer: An All-MLP Architecture for Time Series Forecasting," arXiv:2303.06053, 2023
- Das et al., "A decoder-only foundation model for time-series forecasting (TimesFM)," arXiv:2310.10688, 2024
- Ansari et al., "Chronos: Learning the Language of Time Series," arXiv:2403.07815, 2024
- Woo et al., "Unified Training of Universal Time Series Forecasting Transformers (Moirai)," arXiv:2402.02592, 2024
Next Topics
- Gaussian processes for ML: a nonparametric approach to time series and regression with uncertainty
Last reviewed: April 14, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Linear Regressionlayer 1 · tier 1
- Time Series Foundationslayer 2 · tier 2
Derived topics
2- Gaussian Processes for Machine Learninglayer 4 · tier 3
- Macroeconomic Time-Series Forecastinglayer 4 · tier 3
Graph-backed continuations