Fourier Neural Operator

Sneiderman, Robby

ML Methods

Fourier Neural Operator

The Fourier Neural Operator (Li, Kovachki, Anandkumar et al., ICLR 2021) parameterizes the kernel of an integral operator directly in Fourier space, giving a resolution-invariant architecture for learning maps between function spaces. Canonical baseline for data-driven PDE solvers and the architectural backbone of FourCastNet weather prediction.

AdvancedTier 2CurrentSupporting~45 min

Prerequisites

Fast Fourier Transform Navier Stokes for ML Spectral Theory of Operators Complex Numbers for Fourier

Prereq Map

Why This Matters

Classical neural networks learn maps $\mathbb{R}^n \to \mathbb{R}^m$ between finite-dimensional vector spaces. A PDE solver, by contrast, computes a map between function spaces: from an initial condition $u_0 \in L^2(\Omega)$ to the solution $u(\cdot, T) \in L^2(\Omega)$ at later time, or from a coefficient field $a \in L^\infty(\Omega)$ to the corresponding solution of an elliptic boundary-value problem. Discretize on a grid of $N$ points and a CNN can imitate this map, but the learned weights are tied to that grid: train at $64 \times 64$ and the network has no principled meaning at $256 \times 256$ .

The Fourier Neural Operator (FNO) of Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, Stuart, and Anandkumar (ICLR 2021) was the first architecture to break this discretization tie convincingly. The construction is direct: parameterize the kernel of an integral operator in Fourier space, truncate to a fixed number of low-frequency modes, and evaluate with an FFT. The learned weights live in spectral space and are independent of the spatial resolution, so the same trained network applies to any sufficiently fine grid.

On the standard 2D Navier-Stokes benchmark of Li et al. (2021), FNO matched a pseudospectral solver to relative $L^2$ error around $10^{-3}$ at sub-second inference, three orders of magnitude faster than the solver it was trained against. This result rearranged the data-driven PDE landscape: FNO became the default baseline that every subsequent neural operator (DeepONet, Geo-FNO, U-Net Operators, transformer operators) is benchmarked against. The architecture also scaled to global weather: Pathak et al. (2022) trained an Adaptive FNO variant (FourCastNet) on ERA5 reanalysis data and matched the IFS forecast on key surface variables at $10^4 \times$ the throughput.

The conceptual contrast with physics-informed neural networks is sharp. PINNs solve a single PDE instance by encoding the residual into the loss; every new initial condition demands a new optimization. FNO amortizes the cost: train once on a distribution of initial conditions, then evaluate the learned operator instantly on any new instance from that distribution. The tradeoff is that FNO needs labeled solution data (typically from a conventional solver), while PINNs need only the PDE.

Mental Model

Many useful operators on functions can be written as integral operators of the form $(K v)(x) = \int_\Omega \kappa(x, y)\, v(y)\, dy$ . When the kernel is shift-invariant, $\kappa(x, y) = \kappa(x - y)$ , this integral is a convolution, and the convolution theorem turns it into pointwise multiplication in the Fourier domain: $\widehat{Kv}(k) = \hat{\kappa}(k) \cdot \hat{v}(k)$ . FNO takes this identity as a design principle. Rather than parameterize $\kappa$ in physical space (as a CNN kernel does), it parameterizes the spectral multiplier $\hat{\kappa}(k)$ directly, learns a different multiplier for each Fourier mode up to a truncation cutoff, and applies the operator with two FFTs.

The truncation is the crucial engineering choice. Without it, the spectral multiplier on a grid of $N$ points has $N$ independent values per channel pair, making the parameter count grow with resolution. With it, the truncation $k \le k_{\max}$ pins the parameter count to $O(k_{\max}^d \cdot d_v^2)$ in $d$ spatial dimensions and $d_v$ channels, independent of $N$ . The same trained weights can be evaluated on any grid of size $N \ge k_{\max}$ by zero-padding or truncating the FFT output appropriately.

Formal Statement

Definition

Fourier Layer $v_{t + 1} (x) = s i g ma (W v_{t} (x) + (F^{- 1} R F v_{t}) (x))$

Let $v_t : \Omega \to \mathbb{R}^{d_v}$ be a function-valued hidden state on a domain $\Omega$ with periodic boundary, sampled on an $N$ -point grid. A Fourier layer maps $v_t$ to $v_{t+1}$ via

$v_{t+1}(x) = \sigma\!\left(W\, v_t(x) + \bigl(\mathcal{F}^{-1} R\, \mathcal{F}\, v_t\bigr)(x)\right),$

where $\mathcal{F}$ is the discrete Fourier transform applied channelwise, $R \in \mathbb{C}^{k_{\max}^d \times d_v \times d_v}$ is a learned complex tensor of spectral multipliers indexed by wavenumber $k$ with $|k| \le k_{\max}$ , $W \in \mathbb{R}^{d_v \times d_v}$ is a pointwise linear map (a $1 \times 1$ convolution), and $\sigma$ is a pointwise nonlinearity such as GELU. Modes outside the truncation $|k| > k_{\max}$ are zeroed before the inverse FFT.

The pointwise term $W v_t(x)$ is the residual channel: it carries information across layers without filtering through the spectral cutoff and is essential for representing high-frequency content that the truncation would otherwise discard. Without it, the network can only output linear combinations of the lowest $k_{\max}^d$ Fourier modes, which is a strictly bandlimited family.

The FNO Architecture

A complete FNO is a sandwich. Input functions $a : \Omega \to \mathbb{R}^{d_a}$ are first lifted to the hidden width via a pointwise MLP $P : \mathbb{R}^{d_a} \to \mathbb{R}^{d_v}$ , applied channelwise. The lifted function $v_0(x) = P(a(x))$ then passes through $L$ Fourier layers, producing $v_L$ . Finally, a pointwise projection $Q : \mathbb{R}^{d_v} \to \mathbb{R}^{d_u}$ collapses the hidden width to the output dimension $u(x) = Q(v_L(x))$ . Typical hyperparameters in the original paper: $L = 4$ Fourier layers, $d_v = 32$ channels, and $k_{\max} = 12$ modes per spatial direction.

The compute cost per evaluation on an $N$ -point grid is $O(L \cdot N \log N)$ , dominated by the $2L$ FFTs (one forward, one inverse per layer). Compare this to a standard CNN with kernel size $k$ in $d$ dimensions: a single layer costs $O(k^d \cdot N)$ per channel pair. For modest $k_{\max}$ and large $N$ , FNO is asymptotically more expensive than a small-stencil CNN; the win is not asymptotic complexity but the much larger effective receptive field. A single Fourier layer sees the entire domain, while a CNN needs depth proportional to the diameter divided by the kernel size.

Universal Approximation

Theorem

FNO Universal Approximation (Kovachki, Lanthaler, Mishra 2023)

Statement

Let $\mathcal{G} : H^s(\mathbb{T}^d) \to H^s(\mathbb{T}^d)$ be a continuous operator on Sobolev functions over the torus, and let $K \subset H^s(\mathbb{T}^d)$ be compact. For every $\epsilon > 0$ there exists an FNO $\mathcal{N}_\theta$ with finitely many Fourier layers, finite hidden width, and finite mode cutoff $k_{\max}$ such that

$\sup_{a \in K} \, \lVert \mathcal{N}_\theta(a) - \mathcal{G}(a) \rVert_{H^s} < \epsilon.$

Intuition

The compact-set restriction is doing the work. On a compact subset of $H^s$ , input functions are uniformly bounded in spectral content, and a continuous operator into $H^s$ outputs functions whose high-frequency tail decays uniformly. So a finite mode cutoff $k_{\max}$ is enough to capture the relevant frequencies, and a finite-depth FNO can interpolate the truncated map on the compact set.

Proof Sketch

The argument has two parts. First, by density of trigonometric polynomials in $H^s(\mathbb{T}^d)$ and continuity of $\mathcal{G}$ , both inputs and outputs can be approximated to error $\epsilon/3$ by their truncations to the lowest $k_{\max}^d$ Fourier modes for $k_{\max}$ large enough. Second, the truncated operator is a continuous map between two finite-dimensional spaces (the trigonometric polynomial spaces of degree $k_{\max}$ ), and the Chen-Chen (1995) universal approximation theorem for operator networks adapts to this finite-dimensional case: a sufficiently wide pointwise MLP composed with the spectral multiplier and inverse FFT realizes any continuous map up to error $\epsilon/3$ . Triangle inequality closes the argument.

Why It Matters

This is the operator-learning analog of the Hornik-Cybenko universal approximation theorem for finite-dimensional networks. It justifies FNO as an architecture: in principle, any continuous operator on Sobolev functions can be approximated arbitrarily well. The result tells you what to expect from infinite-data, infinite-compute training; it says nothing about generalization from finite samples or about which operators are efficiently approximable.

Failure Mode

The constants in the approximation bound are exponential in the spatial dimension $d$ and the Sobolev order $s$ . The cutoff $k_{\max}$ needed to reach error $\epsilon$ scales like $\epsilon^{-1/s}$ . Targets with shocks, discontinuities, or heavy high-wavenumber content (compressible flows with shocks, fully developed 3D turbulence below the Kolmogorov scale) require an impractically large $k_{\max}$ , and the FFT-based architecture cannot represent them efficiently. Empirically, FNO degrades on advection-dominated problems where the solution sharpens rather than smooths over time.

report a correction →

Resolution Invariance

The headline property of FNO is that the same learned weights $R, W, P, Q$ apply to any spatial resolution $N \ge k_{\max}$ . Concretely: train on a $64 \times 64$ grid with $k_{\max} = 12$ , then evaluate on a $256 \times 256$ grid by computing the FFT at the new resolution, multiplying the lowest $k_{\max}$ modes by $R$ (and zeroing the rest), and inverse-transforming. No retraining, no interpolation of weights, no architecture change. This is a genuine property of operators rather than discretizations and is the cleanest argument for treating FNO as something more than a CNN with a different kernel parameterization.

In practice the picture is more nuanced. Kovachki et al. (2023, §3.4) and Bartolucci et al. (2024) document that empirical resolution invariance degrades when training data is single-resolution: the network learns spectral biases tied to the training grid (aliasing artifacts, mode-coupling errors at the Nyquist frequency, behavior of $W$ on under-resolved patches) that do not transfer cleanly to other resolutions. The fix is either multi-resolution training (mix grids during training) or alias-free architectural variants. Treat the resolution-invariance claim as a property of the operator the architecture is capable of representing, not as a guarantee about any particular trained instance.

Worked Example: 2D Darcy Flow

Consider the steady-state Darcy equation on the unit square $\Omega = [0,1]^2$ with periodic boundary:

$-\nabla \cdot \bigl(a(x)\, \nabla u(x)\bigr) = f(x), \qquad x \in \Omega,$

with a fixed forcing $f$ and a spatially varying diffusion coefficient $a(x) > 0$ drawn from a Gaussian random field. The solution operator $\mathcal{G} : a \mapsto u$ is nonlinear in $a$ (the inverse of an $a$ -dependent elliptic operator), continuous on $L^\infty(\Omega) \cap \{a \ge a_{\min}\}$ , and is exactly the kind of map FNO targets.

In Li et al. (2021, §5.1), training data is generated by solving the PDE with a second-order finite-difference method on a $241 \times 241$ grid for $N_{\text{train}} = 1024$ samples of $a$ . An FNO with $L = 4$ Fourier layers, $d_v = 32$ channels, and $k_{\max} = 12$ modes per direction is trained for 500 epochs with Adam. The learned operator achieves relative $L^2$ test error of about $0.7 \times 10^{-3}$ , comparable to the discretization error of the FEM solver itself. Inference takes around 5 ms per sample on a single GPU; the FEM solver takes tens of seconds. The cost asymmetry is the entire point of operator learning: pay an upfront training cost to amortize a large number of downstream evaluations.

Common Confusions

Watch Out

FNO is not just a CNN in the Fourier domain

A standard CNN convolves with a fixed-size spatial kernel: parameter count $O(k^d)$ per channel pair, receptive field grows linearly with depth, and the kernel is local in space. FNO learns a full-rank (up to truncation) spectral multiplier of size $O(k_{\max}^d)$ per channel pair: every output mode depends nontrivially on the corresponding input mode at every grid point, so the receptive field is the entire domain in a single layer. The two architectures encode different inductive biases. CNNs are biased toward local features; FNO is biased toward smooth global structure with a controlled bandwidth.

Watch Out

Resolution invariance is a property of the architecture, not a guarantee about a trained model

The FNO weights are independent of the grid, so the same parameters can be applied at any resolution above the mode cutoff. This is a real architectural property and distinguishes FNO from CNN-based operators. It is not a guarantee that a network trained on one resolution will generalize cleanly to a much higher resolution: training data at a single grid size induces spectral biases tied to that discretization, and zero-shot transfer to dramatically finer grids often degrades. Multi-resolution training or alias-aware variants are the standard remedy.

Watch Out

Plain FNO requires function values on a regular periodic grid

The FFT in the Fourier layer is a discrete Fourier transform on a uniform grid with periodic boundary; FNO out of the box is not mesh-free, not adaptive, and not directly applicable to functions sampled on irregular point clouds or in domains with non-trivial boundary. Extensions exist: Geo-FNO (Li et al. 2023) learns a coordinate diffeomorphism that maps a non-trivial geometry to a periodic reference domain; Graph Neural Operators (Anandkumar et al. 2020) replace the FFT with message-passing on irregular graphs; spherical FNO (Bonev et al. 2023) replaces the planar FFT with a spherical harmonic transform for global atmospheric data. All of these are responses to the periodic-grid restriction of the original architecture.

Exercises

ExerciseCore

Problem

Consider 1D periodic functions sampled on $N$ grid points and a single Fourier layer with hidden width $d_v$ and mode cutoff $k_{\max}$ . Drop the nonlinearity and the residual term $W v$ , so the layer is $v_{t+1} = \mathcal{F}^{-1} R \mathcal{F} v_t$ . Write out the matrix form of this map and show that the parameter count of $R$ is $O(k_{\max} d_v^2)$ , independent of $N$ .

ExerciseAdvanced

Problem

A linear operator $T : L^2(\mathbb{T}^d) \to L^2(\mathbb{T}^d)$ is shift-invariant if it commutes with all translations: $T \circ \tau_h = \tau_h \circ T$ for every shift $\tau_h$ . Show that any bounded shift-invariant operator on $L^2(\mathbb{T}^d)$ can be represented exactly by a single Fourier layer with no truncation ( $k_{\max} = N$ ), no nonlinearity, and no residual term $W$ .

References

Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, Stuart, and Anandkumar, Fourier Neural Operator for Parametric Partial Differential Equations (ICLR 2021, arXiv:2010.08895). The original FNO paper. Section 4 defines the Fourier layer; Section 5 reports the Burgers, Darcy, and 2D Navier-Stokes benchmarks where FNO first beat CNN, U-Net, and DeepONet baselines on resolution-invariant operator learning.
Kovachki, Li, Liu, Azizzadenesheli, Bhattacharya, Stuart, and Anandkumar, Neural Operator: Learning Maps Between Function Spaces with Applications to PDEs (JMLR 2023, vol 24, arXiv:2108.08481). The framework paper. Sections 4-5 develop the integral-operator view of neural operators in Banach spaces; Section 6 unifies FNO, graph neural operators, and low-rank operators under a single formulation.
Kovachki, Lanthaler, and Mishra, On Universal Approximation and Error Bounds for Fourier Neural Operators (JMLR 2021, vol 22). Proves the universal approximation theorem cited above and gives quantitative error bounds in terms of network width, depth, and mode cutoff for target operators with prescribed Sobolev regularity.
Pathak, Subramanian, Harrington, Raja, Chattopadhyay, Mardani, Kurth, Hall, Li, Azizzadenesheli, et al., FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators (arXiv:2202.11214, 2022). Trains an FNO variant on ERA5 reanalysis at $0.25^\circ$ global resolution; matches IFS forecast skill on key surface variables out to several days at roughly $10^4\times$ the throughput.
Li, Huang, Liu, and Anandkumar, Fourier Neural Operator with Learned Deformations for PDEs on General Geometries (Geo-FNO, JMLR 2023). Extends FNO to non-rectangular domains by learning a coordinate transformation that maps the input domain to a periodic reference grid where the standard FFT-based layer applies.
Bartolucci, de Bezenac, Raonić, Molinaro, Mishra, and Alaifari, Are Neural Operators Really Neural Operators? Frame Theory Meets Neural Operators (NeurIPS 2024). Shows that FNO and similar architectures fail strict discretization invariance under aliasing; introduces Convolutional Neural Operators with alias-free design and characterizes when discretization invariance actually holds.
Lanthaler, Mishra, and Karniadakis, Error estimates for DeepONets: A deep learning framework in infinite dimensions (Transactions of Mathematics and Its Applications 2022). The matched analysis on the DeepONet side of operator learning; useful as a comparison for what error bounds look like in a different operator architecture.
Chen and Chen, Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems (IEEE Transactions on Neural Networks 1995, vol 6 no 4). The 1995 result that all subsequent operator-learning universal approximation theorems (DeepONet, FNO) extend. Defines the branch-trunk decomposition and proves continuous operators on compact subsets of $C(K)$ are uniformly approximable by shallow operator networks.

Next Topics

Navier-Stokes for ML: the canonical PDE benchmark FNO is evaluated against and the setting where operator learning competes head-on with PINN approaches.
Physics-Informed Neural Networks: the per-instance optimization alternative to FNO; compare data-driven operator learning against residual-loss training on a single PDE solve.
Spectral Theory of Operators: the Hilbert-space machinery underlying the integral-operator view; eigenfunction expansions are the abstract version of the Fourier basis FNO uses concretely.
PDE Fundamentals for ML: elliptic, parabolic, and hyperbolic classification and what each implies for the regularity of the solution operator FNO is trying to learn.
Fast Fourier Transform: the $O(N \log N)$ algorithm that makes the Fourier layer practical at scale.

Summary. FNO parameterizes the kernel of an integral operator directly in Fourier space, truncates to $k_{\max}$ low-frequency modes, and applies the operator with two FFTs per layer. The architecture is resolution-invariant by construction: the same learned weights apply to any grid above the mode cutoff. Universal approximation holds for continuous operators on Sobolev spaces over the torus, but the constants are exponential in dimension, the architecture struggles on shock-dominated and high-wavenumber problems, and empirical resolution invariance degrades under single-resolution training without alias-free variants.

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Complex Numbers for Fourierlayer 0A · tier 2
Fast Fourier Transformlayer 1 · tier 2
Spectral Theory of Operatorslayer 0B · tier 3

Derived topics

1

DeepONetlayer 3 · tier 2

Graph-backed continuations

DeepONet