Physics-Informed Neural Networks

Sneiderman, Robby

Beyond LLMS

Physics-Informed Neural Networks

Embedding PDE constraints directly into the neural network loss function via automatic differentiation. When physics-informed learning works, when it fails, and what alternatives exist.

AdvancedTier 2CurrentSupporting~55 min

Prerequisites

The Jacobian Matrix Automatic Differentiation Feedforward Networks and Backpropagation Gradient Descent Variants

Prereq Map

Why This Matters

Interactive module

The same live PDE explorer appears here as optional context. Hide it if you want a pure reading path through PINNs.

Spectral PDE explorer

Four classical PDEs solved exactly in your browser · no time-stepping · one FFT per frame · [0, 1]² N = 128

t = 0.0500 · t_c = 1.6887

Parseval err = 0.0e+0·FFT <0.05 ms·dominant mode = (0, 0) · 0.0%contours

▸What is this?

A live solver for four classical partial differential equations, running entirely in your browser. The left pane is the solution $u(x,y,t)$ — think of it as temperature (heat mode), a transported field (advection), a wave function (Schrödinger), or the steady-state of a source (Poisson). The right pane is its Fourier spectrum $|\hat{u}(k,t)|$ on a log scale. Both update together as you drag the time slider, switch PDE modes, or paint on the field.

What makes it unusual: instead of stepping time forward in tiny numerical increments (the standard PDE-solver approach), this evaluates the exact closed-form solution at any time via a single Fourier multiplier and one inverse FFT per frame. That is why you can scrub time freely, including backwards (scrubbing past $t = 0$ in heat mode reproduces the ill-posedness that motivates DDPM and the score models that learn $\nabla \log p_t$ ).

Try the guided tour button in the top-right to watch all four modes and the live neural-net fit without touching anything. Or click ▶ and start painting heat onto the field.

colormaplow

u(x,y,t)

→ high(amber = temperature / density / wave amplitude)

u(x,y,t)

spatial

diffuses

\sigma = \sqrt{2\alpha t}

=0.039

click + drag to paint heat · shift+drag to erase · domain is periodic

|\hat{u}(k,t)|

log · DC centered

rings at

|k| = 8,\,16,\,32,\,48

each mode decays as

e^{-\alpha(2\pi)^2|k|^2 t}

— high-

k

dies first

energy

\|u(t)\|_2^{\,2} / \|u(0)\|_2^{\,2}

log scaleexpected slope

\approx -2\alpha(2\pi)^2 |k_{\text{eff}}|^2

10^{0}

10^{-1}

10^{-2}

10^{-3}

10^{-4}

10^{-5}

time t0.0500

−0.05 (reverse)

diffusivity α0.0150

0.0010.08

PDE archetype

\hat{u}(k,t) \;=\; \hat{u}(k,0)\,e^{-\alpha\,(2\pi)^2|k|^2 t}

initial condition

why ML cares about this equation

Gaussian blur with radius $\sigma$ is this simulation at $t = \sigma^2 / (2\alpha)$ . Scale space in computer vision (Witkin 1983) is the heat equation in disguise.
The DDPM / score-SDE forward process diffuses the data density $p_t$ . Every Fourier mode decays as $e^{-\alpha|k|^2 t}$ — high-frequency detail dies first. Watch the letter A collapse into a blob. This is what Gaussian noise does to real images.
Scrub time backward. Each mode is now multiplied by $e^{+\alpha|k|^2 t}$ , and high- $k$ modes blow up exponentially. This is why naive reverse diffusion is ill-posed and a learned score $\nabla \log p_t$ is required to stay on the data manifold (Anderson 1982; Song 2021).
A Fourier Neural Operator layer is this solver with the radial multiplier learned rather than closed-form (Li et al. 2021). The attenuation visible in the $|\hat{u}(k)|$ pane is the target those layers learn to reproduce.

What the left pane shows

The field $u(x,y,t)$ on the periodic square $[0,1]^2$ , rendered directly from the current solution. Paint heat with click-drag; shift-drag erases. Edges wrap around.

What the right pane shows

The log-magnitude Fourier spectrum $\log|\hat{u}(k,t)|$ , with DC centered so $k=0$ sits at the middle. Rings mark $|k|=8,16,32,48$ .

Try this

Scrub time past 0 into negative values
Switch to mode IC and slide $k_x,k_y$
Paint, then press ▶

per-mode Parseval L2 error vs. inverse-heat condition numberα = 0.015 · T = 0.05

Bars (left axis): radial Parseval-decomposition of the squared L2 error a no-decay predictor makes against $u(x,y,t)$ — modes that diffusion has already smoothed contribute the most. Line (right axis): noise-amplification factor for inverting that mode back to $t = 0$ . The two are the same shape because both grow with $\alpha k^2 t$ , which is exactly why the inverse heat problem is ill-conditioned where the forward problem looks easy.

Can a neural net learn what the FFT just computed?SIREN 2→24→24→1 ·

\omega_0 = 30

A tiny neural network (two hidden layers, sine activations) trains live to match the exact solver's output. Press ▶ train and watch the prediction improve, the error heatmap shrink, and the loss curve decay on a log axis.

The point: exact solvers give machine-precision answers in microseconds; a learned approximator converges slowly to a few-percent RMS and stops there — limited by its ~1,200 parameters, not by compute. This is the honest baseline neural PDE methods fight against on easy problems, and the reason they only win on inverse problems or high-dimensional domains where classical solvers can't go.

This demo fits a supervised target. A true PINN (Raissi 2019) swaps the target for the PDE residual $u_t - \alpha\,\Delta u$ and never sees ground truth; it needs autograd for second derivatives, which is heavier than what runs here. The PINN failure modes appear later on this page.

Spectral 2D solver · 128² grid · radix-2 FFT · closed-form propagator. Space = play/pause · R = reset · 1–4 = mode · C = contours · F = fullscreen · 🔗 share permalinks.

Robby Sneiderman · @Robby955·report an issue

Loading interactive PDE explorer...

The explorer above is the setup a PINN tries to replicate. The spectral solver on the left is the exact closed-form ground truth for four PDE archetypes (heat, advection, Schrödinger, Poisson): one Fourier multiplier per archetype, no time-stepping. The "Neural-net function fit" panel at the bottom trains a small SIREN MLP to memorize that field via supervised MSE. A true PINN keeps the same kind of network but replaces the supervised target with the PDE residual itself; the net never sees ground-truth values and instead learns to satisfy the equation pointwise. The sections below describe exactly what that substitution buys you and where it breaks.

Most of science is governed by partial differential equations (PDEs). Classical numerical solvers (finite elements, finite differences, spectral methods) work well but struggle with high-dimensional problems, complex geometries, and inverse problems. PINNs propose a different approach: use a neural network as a function approximator and enforce the PDE through the loss function.

The idea is seductive. The reality is more nuanced. PINNs work well in certain regimes and fail badly in others. Understanding where the boundary lies is critical for anyone applying ML to scientific problems.

Mental Model

A standard neural network learns from data alone. A PINN adds a second source of supervision: the governing equations. You do not need as much data because the physics constrains the space of acceptable solutions.

Think of it as regularization by physical law. Instead of an L2 penalty that pushes weights toward zero, you have a PDE residual penalty that pushes the solution toward physical consistency.

Core Definitions

Definition

Physics-Informed Neural Network (PINN)

A neural network $u_\theta(x, t)$ trained to approximate the solution of a PDE by minimizing a composite loss that includes both data fidelity and PDE residual terms. The PDE residual is computed via automatic differentiation of the network output with respect to its inputs.

Definition

PDE Residual $R [u_{θ}]$

For a PDE of the form $\mathcal{N}[u] = 0$ where $\mathcal{N}$ is a differential operator, the PDE residual at a collocation point $x$ is:

$\mathcal{R}[u_\theta](x) = \mathcal{N}[u_\theta](x)$

This is computed by differentiating $u_\theta$ with respect to its inputs using automatic differentiation. A perfect solution has zero residual everywhere.

The PINN Loss Function

The central construct of PINNs is the composite loss:

Proposition

PINN Loss Decomposition

Statement

The PINN loss is:

$\mathcal{L}(\theta) = \lambda_{\text{data}} \mathcal{L}_{\text{data}} + \lambda_{\text{pde}} \mathcal{L}_{\text{pde}} + \lambda_{\text{bc}} \mathcal{L}_{\text{bc}} + \lambda_{\text{ic}} \mathcal{L}_{\text{ic}}$

where:

$\mathcal{L}_{\text{data}} = \frac{1}{N_d} \sum_{i=1}^{N_d} |u_\theta(x_i) - u_i^{\text{obs}}|^2$ (data fidelity)
$\mathcal{L}_{\text{pde}} = \frac{1}{N_r} \sum_{j=1}^{N_r} |\mathcal{N}[u_\theta](x_j)|^2$ (PDE residual at collocation points)
$\mathcal{L}_{\text{bc}}$ and $\mathcal{L}_{\text{ic}}$ enforce boundary and initial conditions

The weights $\lambda$ balance the different loss terms.

Intuition

The network is pulled in multiple directions: fit the observed data, satisfy the PDE everywhere in the domain, and respect boundary/initial conditions. The physics term acts as an infinite-dimensional regularizer, constraining the solution to the manifold of physically plausible functions even where no data exists.

Proof Sketch

There is no convergence "proof" in the classical sense for general PINNs. Theoretical results (Shin, Darbon, Karniadakis 2020) show that as the number of collocation points and network capacity grow, minimizers of the PINN loss converge to the PDE solution under regularity assumptions. The rate of convergence is generally worse than classical solvers for smooth problems.

Why It Matters

This decomposition is the entire PINN methodology. The key engineering decisions are: (1) the architecture of $u_\theta$ , (2) the placement of collocation points, (3) the relative weights $\lambda$ , and (4) the optimizer and training schedule. Getting these wrong leads to solutions that satisfy neither the data nor the physics.

Failure Mode

The multi-objective nature of the loss creates optimization difficulties. The PDE residual and data terms can have vastly different scales and gradients, leading to one dominating the other. Adaptive weighting schemes (e.g., learning rate annealing, neural tangent kernel-based weighting) partially address this but do not fully solve it.

report a correction →

How PINNs Use Automatic Differentiation

The key enabling technology is automatic differentiation (autodiff). To compute $\mathcal{N}[u_\theta]$ , you differentiate the neural network output with respect to its inputs (not its parameters). Modern frameworks (PyTorch, JAX) compute these derivatives exactly and efficiently.

For example, if the PDE is the heat equation $\partial_t u = \alpha \nabla^2 u$ , the residual at a point $(x, t)$ is:

$\mathcal{R} = \frac{\partial u_\theta}{\partial t} - \alpha \frac{\partial^2 u_\theta}{\partial x^2}$

Both derivatives are computed by autodiff through the network graph.

When PINNs Work

Example

Smooth Solutions to Known PDEs

PINNs perform well on problems with smooth solutions where the governing PDE is known exactly. Classic demonstrations include the Burgers equation (before shock formation), the Schrodinger equation, and steady-state heat conduction. In these settings, the physics loss provides strong regularization, and the network can represent the solution accurately with moderate capacity.

Example

Inverse Problems

PINNs are particularly attractive for inverse problems: given sparse noisy observations, infer unknown PDE parameters. For example, estimating the diffusion coefficient from temperature measurements. The physics constraint regularizes the inverse problem and shrinks the feasible solution set; it does not, on its own, make the problem well-posed. Identifiability and stability still depend on the observation operator, the boundary and initial conditions, the parameterization of the unknowns, the noise model, and whether distinct parameters can produce indistinguishable observations. Many inverse PDE problems remain genuinely ill-posed even with the correct governing equation, and a successful PINN run on such a problem is fitting one solution from a non-trivial null space.

When PINNs Fail

The honest-survey failure modes (discontinuities, forward-solve cost, stiffness) are the surface story. The deeper, structural failures have names attached to them now and are where the interesting work is happening.

PINN Failure-Mode Taxonomy

five canonical modes. Pick a tab; live training where applicable.

Robby Sneiderman · @Robby955

Even with infinite data and a perfect optimizer, the best $u_\theta$ in the realizable function class $\mathcal{H}_{\mathrm{NN}}$ may not equal $u^\star$ . The shaded region is the irreducible approximation error — a function of network width, depth, and activation choice, not of training. SIREN-class biases this gap toward the high-frequency end of the spectrum.

u''(x) = f(x)

on

[0,1]

, Dirichlet

u(0)=u(1)=0

iter 0 · data 0.0e+0 · residual 0.0e+0

\log_{10} \lambda_{\text{PDE}}

λ = 1.0

Drag $\lambda_{\text{PDE}}$ across six decades. Near $\lambda \to 0$ the PINN ignores the equation and overfits the sparse observations; near $\lambda \to 10^3$ it ignores the data and trivially sets $u \equiv 0$ (a zero-residual, zero-data solution the Dirichlet BC also admits). The goldilocks zone around $\lambda \approx 1$ is narrow and highly problem-dependent. This is exactly what Wang, Teng & Perdikaris (2021) argued via an NTK eigenvalue-spectrum analysis and why they proposed online rescaling by the kernel trace ratio instead of manual tuning.

Target:

0.8\sin(2\pi x)+0.4\sin(10\pi x)+0.2\sin(50\pi x)

iter = 0

Both networks are $1\to 20\to 20\to 1$ with the same Adam optimizer and parameter count. The only difference is the activation: ReLU vs. $\sin(\omega_0 \cdot)$ with $\omega_0 = 30$ (SIREN, Sitzmann et al. 2020). The ReLU MLP fits the low-frequency bulk within a hundred iterations and then stalls — its NTK spectrum decays as $k^{-2}$ , making high frequencies exponentially slower to learn. SIREN's NTK is essentially flat across frequencies and it learns all three harmonics at comparable rates. This is why every modern PINN uses Fourier features or sinusoidal activations.

Vanilla PINN training treats every collocation point equally, so the network can drive residual to zero at large $t$ by learning the trivial decayed state, never receiving signal from the unsatisfied initial condition. Causal weighting $w_i = \exp(-\varepsilon \sum_{j<i}\mathcal{L}_j)$ only unlocks weight on bin $i$ once all prior bins are satisfied — Wang–Sankaran–Perdikaris (2022).

Forward heat is a strong contraction: the smoothing kernel $G_t = (4\pi\alpha t)^{-1/2}\exp(-x^2/4\alpha t)$ erases high-frequency detail at rate $e^{-\alpha k^2 t}$ . Two distinct $u_a(\cdot,0)$ and $u_b(\cdot,0)$ whose Fourier coefficients differ only on modes with $\alpha k^2 T \gg 1$ are observationally indistinguishable at $t = T$ once measurement noise exceeds the residual difference. Identifiability requires a prior: sparsity, smoothness, or a PINN's residual penalty.

Loading PINN failure zoo...

The three tabs above are the live, in-browser versions of the three failure modes described below. The code is tiny, runs as you watch, and cites its source paper on each tab.

Watch Out

Spectral bias: plain MLPs cannot represent high frequencies

Rahaman et al. (2019) and Tancik et al. (arXiv:2006.10739) showed that an MLP with standard activations is biased toward low-frequency components and provably slow to fit oscillatory targets. A vanilla PINN on a solution containing $\sin(100x)$ will converge to the low-frequency bulk and never resolve the detail. Two practical fixes: Fourier feature encodings (Tancik et al. 2020; Mildenhall et al. 2020 for NeRF) preprocess inputs through $\gamma(x) = [\cos(2\pi B x), \sin(2\pi B x)]$ , and sinusoidal representations like SIREN (Sitzmann et al. 2020, arXiv:2006.09661) replace activations with $\sin(\omega_0 (Wx + b))$ . Either transforms the NTK into one that is no longer frequency-biased. The "Neural-net function fit" panel in the explorer above uses SIREN for exactly this reason.

Watch Out

NTK imbalance: the data and PDE losses train at wildly different rates

Wang, Teng, Perdikaris (arXiv:2001.04536, 2021) gave the foundational analysis: the neural tangent kernels of the data-fitting loss and the PDE-residual loss have eigenvalue spectra separated by orders of magnitude. In effect, one loss dominates every gradient step and the other stagnates, regardless of the manually-chosen $\lambda$ weights. The paper proposes an NTK-balanced weighting scheme that measures the kernels online and rescales. Calling it "weighting" understates the result: it is the first mechanistic explanation of why PINNs are so brittle to loss coefficients.

NTK spectrum: data loss vs. PDE residual

Online eigendecomposition of

K_{uu}

and

K_{rr}

over a 10-point sample, updated every 30 training steps · SIREN 1→12→12→1,

\omega_0 = 30

iter 0·@Robby955

NTK-balanced weighting (Wang-Teng-Perdikaris 2021)

The data NTK $K_{uu}[i,j] = \langle\nabla_\theta u(x_i),\,\nabla_\theta u(x_j)\rangle$ has eigenvalues within roughly two decades (well-conditioned). The PDE-residual NTK $K_{rr}[i,j] = \langle\nabla_\theta r(x_i),\,\nabla_\theta r(x_j)\rangle$ , computed by finite-differencing $\nabla_\theta u$ to get $\nabla_\theta u''$ , typically spreads across five to seven decades. This spectral gap is why the data loss converges while the PDE residual stalls (or vice versa). Toggle NTK-balanced weighting and the PDE loss is rescaled online by $\mathrm{tr}(K_{uu}) / \mathrm{tr}(K_{rr})$ ; both spectra move toward the same band and training becomes balanced. This is the mechanism behind virtually every modern PINN training recipe.

Loading NTK spectrum lab...

The demo above recomputes both kernels on a fixed 10-point sample every 30 training steps and plots their sorted eigenvalues side by side. Watch how the PDE kernel's spectrum widens across decades as training progresses while the data kernel stays bounded. Toggle NTK-balanced weighting and the gap closes.

Watch Out

Curriculum and causality: naive simultaneous training fails on time-dependent PDEs

Krishnapriyan et al. (arXiv:2109.01050, NeurIPS 2021) built an explicit failure-mode taxonomy for PINNs and showed that simultaneous training on all collocation points (the standard setup) is wrong for evolution equations: information from $t=0$ must propagate causally through time, but vanilla training mixes early- and late-time residuals as if they were independent. A sequence-to-sequence curriculum (train to some cutoff time, then extend) recovers much of the accuracy gap. Wang, Sankaran, Perdikaris (arXiv:2203.07404) formalized this as causal time-weighting: weight each collocation point by the product of earlier residuals so the network must solve $t_0$ before it is asked to solve $t_1$ . The improvement is often orders of magnitude, which is why this result is structurally surprising.

Watch Out

Discontinuities and sharp gradients

Shock waves, contact discontinuities, and thin boundary layers cause PINNs to fail or converge extremely slowly. Neural networks with smooth activations represent discontinuities poorly; the PDE residual near a shock is large and noisy, destabilizing training. Classical shock-capturing schemes (WENO, Godunov) handle these cases far better. Variational PINNs (Kharazmi et al., arXiv:2003.00596) sidestep some of this by minimizing a weak form of the residual against test functions, reducing the required derivative order.

Watch Out

PINNs can be slower than classical solvers for forward problems

For a well-posed forward PDE with known coefficients, a finite element solver with adaptive meshing is typically faster and more accurate than a PINN. PINNs pay the cost of neural network training (thousands of gradient descent steps) for a single PDE instance. Classical solvers amortize their cost better. PINNs become competitive when you need to solve many related PDEs (amortized inference), when the problem is inverse (unknown coefficients to infer from sparse data), or when classical meshing is impractical.

Inverse problem: classical FD vs. PINN

infer

\alpha

in

-\alpha\,u''(x) = \sin(2\pi x)

from

n

noisy observations of

u

· ground truth

\alpha^\star = 0.5

iter 0 · n = 12·@Robby955

n observations12

At $n = 3$ , three noisy observations underdetermine the least-squares problem: the classical FD search lands anywhere between $\alpha \sim 0.1$ and $\sim 5$ depending on sample luck. The PINN's physics residual stabilizes the estimate because the family of admissible $u$ is much smaller — $\alpha$ is identifiable from any observation that sees the true solution, provided the residual is enforced. Drag $n$ up: the classical estimate sharpens as the data constraint tightens; above $n \approx 20$ it matches or beats the PINN. This is the honest crossover the PINN literature often obscures.

Inverse-heat conditioning: κ_k(T) = e^{α k² T} at T=1, α=0.1

Classical Fourier inversion of $u(x, T) \to u(x, 0)$ divides the observed mode amplitude by $e^{-\alpha k^2 T}$ , i.e. multiplies noise by $\kappa_k(T) = e^{\alpha k^2 T}$ . The bars above show that even at modest $\alpha = 0.1, T = 1$ , every mode beyond $k \approx 5$ amplifies noise more than the signal itself, so the classical inversion is unrecoverable past a cutoff. A PINN does not invert mode-by-mode; the residual loss constrains $u$ to lie on the heat manifold, which acts as a spectral prior and damps high-mode noise without needing to know $\alpha$ in closed form.

Loading classical vs PINN inverse demo...

The demo above shows the honest crossover. At 3 observations the classical FD grid search is under-constrained and its α estimate varies wildly with sample luck; the PINN's physics residual stabilizes the estimate. Drag n upward — above roughly 20 observations the classical estimate sharpens and matches or beats the PINN, because the data now constrains u directly.

Watch Out

Provable impossibility: some PDE classes cannot be PINN-approximated

de Ryck and Mishra (2024) gave explicit PDE classes where no PINN training trajectory can converge, no matter the architecture or weighting. This is a healthy counterweight to the "universal approximation" optimism of the early literature: PINNs are a method, not a solver, and the method has provable limits.

Watch Out

Stiff systems

PDEs with widely separated time scales create loss landscapes with pathological curvature. Fast dynamics produce large PDE residuals early in training; slow dynamics require long training to resolve. Multiscale architectures, hard-constraint formulations (Lu, Pestourie et al., arXiv:2102.04626) that enforce boundary conditions by architectural construction rather than loss penalty, and gradient-enhanced residual losses (gPINN, Yu et al., arXiv:2111.02801) mitigate but do not eliminate the problem.

Extensions

Biologically-Informed Neural Networks (BINNs)

BINNs apply the PINN framework to biological systems where the governing equations are partially known. Instead of enforcing a fully specified PDE, BINNs learn unknown terms in the equations from data while enforcing known structural constraints (conservation laws, positivity, symmetries).

Neural Operators: A Different Paradigm

Instead of learning a single PDE solution, neural operators learn the solution operator: a mapping from PDE parameters, initial conditions, or forcing terms to solutions.

Fourier Neural Operator (FNO): learns in the frequency domain, applying learned filters to the Fourier coefficients of the input function. Resolution invariant and fast at inference (one forward pass per new PDE instance).

DeepONet: uses a branch-trunk architecture. The branch network encodes the input function, the trunk network encodes the evaluation point, and their dot product gives the solution value. Theoretically grounded in the universal approximation theorem for operators (Chen & Chen, 1995).

Neural operators amortize the cost of training over many PDE instances. Once trained, solving a new PDE instance requires only a forward pass, not retraining. This makes them far more practical than PINNs for applications requiring repeated solves (design optimization, uncertainty quantification).

Bayesian PINNs

Yang, Meng, Karniadakis (arXiv:2003.06097, 2021) replace the point-estimate network with a posterior over weights, giving calibrated uncertainty quantification on both the solution and any inferred coefficients. The resulting B-PINN is more honest about its extrapolations than a vanilla PINN and connects directly to standard Bayesian deep learning.

Structural Connection: PINNs and RLHF

PINN training and RLHF are the same abstract problem with a cosmetic change of clothes, and noticing this is the single most useful way to transfer intuition between the two.

In RLHF, the reward model $r_\phi$ is a learned proxy for human judgment. Optimizing a policy against it aggressively (small KL coefficient $\beta$ ) produces reward hacking: the policy finds degenerate behaviors that score highly under $r_\phi$ without actually being good. The fix is a KL regularizer against the SFT policy $\pi_{\text{sft}}$ :

$\max_{\pi} \; \mathbb{E}_{x, y \sim \pi}\left[r_\phi(x, y)\right] - \beta \, D_{\text{KL}}\!\left(\pi \,\|\, \pi_{\text{sft}}\right).$

In a PINN, the PDE residual $\mathcal{R}(u_\theta)$ is a computed proxy for PDE satisfaction. Optimizing it aggressively relative to the data loss produces the mirror pathology: the network finds functions with small residual that ignore the observational data (or vice versa, dominated-by-data fits that violate physics away from measurement points). The effective multi-objective is:

$\min_{\theta} \; \mathcal{L}_{\text{data}}(u_\theta) + \lambda \, \mathcal{L}_{\text{PDE}}(u_\theta),$

and Wang, Teng, Perdikaris's NTK analysis shows that the "correct" $\lambda$ is not a hyperparameter to tune; it is determined by the ratio of kernel traces, which is the PINN analog of selecting $\beta$ from a target KL budget in RLHF.

The parallels line up term by term:

Structural role	RLHF	PINN
Ground-truth objective	Human judgment (unobserved)	PDE satisfaction (infeasible to compute exactly)
Learned proxy	Reward model $r_\phi$	PDE residual $\mathcal{R}(u_\theta)$
What "hacking" looks like	Policy scores high, humans disagree	Residual small, data fit poor (or vice versa)
Implicit trust region	KL against $\pi_{\text{sft}}$	Data-loss weight fixes $u_\theta$ near observations
Ratio-balancing rule	Target KL budget	NTK eigenvalue ratio (Wang-Teng-Perdikaris 2021)
Non-stationary proxy	Reward drifts across distribution	PDE residual landscape stiffens during training

Both are proxy optimization problems where the proxy is differentiable but the true objective is not; both need a regularizer to prevent the optimizer from exploiting the gap; and in both, the optimal regularizer strength is structural rather than tuned.

This is not merely an analogy. The algorithmic fixes transfer: causal time-weighting (Wang-Sankaran-Perdikaris 2022) is a curriculum on collocation points, which is the PINN analog of process-reward shaping in RLHF. Hard-constraint PINNs (Lu-Pestourie 2021) enforce boundary conditions architecturally instead of through a loss term — the PINN version of constrained decoding. The open question in both fields is whether proxy-hacking is fundamental or an artifact of current architectures.

Open Questions

Two questions where the existing TheoremPath machinery meets the PINN literature at a productive boundary. Framed deliberately as open so the reader can enter the research trajectory rather than just consume a survey.

Watch Out

Finite-sample generalization bounds for PINNs

Current PINN theory (Shin-Darbon-Karniadakis 2020) gives asymptotic consistency: as the number of collocation points and training iterations tends to infinity, the PINN converges to a PDE solution. This is the wrong kind of result for practice: real PINNs are trained on $10^3$ to $10^5$ collocation points, not infinitely many.

A natural program: treat the PINN loss $\mathcal{L}(f_\theta) = \mathcal{L}_{\text{data}}(f_\theta) + \lambda \, \mathcal{L}_{\text{PDE}}(f_\theta)$ where $\mathcal{L}_{\text{PDE}}$ depends on $f_\theta$ through its autodiff derivatives. Under Lipschitz-derivative regularity, apply the vector contraction inequality (Maurer 2016, $\sqrt{2}L$ form) to bound empirical process terms. For a Sobolev or Barron-norm hypothesis class $H$ , the Rademacher complexity $R_n(H) \leq B/\sqrt{n}$ would give the PINN analog of a VC-style generalization bound.

The hard technical question is whether $\partial_x f_\theta$ has a Rademacher complexity controllable by $R_n(H)$ , i.e., whether the derivative operator acts on the hypothesis class benignly under the parameter norm. This is open and genuinely research-grade.

Watch Out

PINN scaling laws

Kaplan-style scaling laws have reshaped language modeling: given a compute budget, there is a compute-optimal trade-off between parameter count, dataset size, and training steps. For PINNs, no equivalent study exists. Ad-hoc empirical surveys report that "bigger networks help until they do not," but there is no Chinchilla-style joint fit across problem classes.

A specific formulation: for a PDE whose solution lies in a Barron-norm class of complexity $\kappa$ , is the PINN training loss $\mathcal{L} \sim N_{\text{params}}^{-\alpha(\kappa)} \, N_{\text{coll}}^{-\beta(\kappa)} \, N_{\text{steps}}^{-\gamma(\kappa)}$ for some exponents determined by the PDE's conditioning rather than by generic architecture scaling? If yes, compute-optimal allocation should trade parameters against collocation-point density the way language models trade parameters against tokens. If no, PINN scaling is limited by PDE conditioning and the comparison to LMs is misleading.

This is a natural target for a workshop note or a short empirical paper. The TheoremPath curriculum has the pieces (scaling laws, Barron norms, Rademacher bounds) already assembled elsewhere; the PINN literature does not.

Connections to Mechanistic Interpretability

A possibly-naive observation, flagged here as a working hypothesis rather than a result. Mechanistic interpretability looks inside a trained network to recover human-readable circuits. The NTK spectrum view embedded above is doing something adjacent: looking inside an actively training network to see which directions in parameter space are responsible for which components of the loss.

Three specific ways the two fields touch, none of them fully worked out:

The eigenvectors of the data NTK and the PDE NTK are literal directions in parameter space. When the gap between their spectra closes under NTK-balanced weighting, it means the same parameter directions start absorbing gradient signal from both losses. That is a mechanistic statement about the training dynamics, not just a convergence-rate statement, and it suggests a way to ask "which layer learned which part of the equation" by projecting gradients onto layer-restricted Gram matrices.
SIREN and Fourier-feature networks (Tancik 2020, Sitzmann 2020) fix spectral bias by construction. The mechanistic question — which parts of the network represent high-frequency components and which represent low-frequency ones — is open. The superposition work on sparse features in LLMs has a natural analog for PINNs: are high-frequency Fourier modes "in superposition" with low-frequency ones in the hidden layer, and does spectral bias correspond to high-frequency directions being the ones most contested by superposition?
The PINN–RLHF isomorphism in the section above implies that proxy hacking in RLHF and loss imbalance in PINNs share a mechanistic signature. A reward model has an NTK. Measuring its eigenvalue spread vs the KL regularizer's NTK spread during PPO updates is directly analogous to what the demo above does for PINN training. I have not seen this connection in either literature. That is either because it is wrong or because no one has written it yet; I think it is the second and the two-page version would be worth writing.

None of the above is a claim that any of this is solved. It is a list of the questions I would actually work on if I had the time, and it is the honest research trajectory TheoremPath is meant to support rather than decorate. If any of this is already established in a paper I have not read, I would like to hear about it.

Summary

PINN loss = data fidelity + PDE residual + boundary/initial conditions
Autodiff computes exact spatial/temporal derivatives of the network
PINNs work best for smooth solutions, inverse problems, and data-sparse regimes
PINNs fail on discontinuities, stiff systems, and problems where classical solvers already excel
Loss balancing between physics and data terms is a critical engineering challenge
Neural operators (FNO, DeepONet) learn solution operators and amortize cost across PDE instances

Exercises

ExerciseCore

Problem

Write the PINN loss for the 1D heat equation $\partial_t u = \alpha \partial_{xx} u$ on $x \in [0, 1]$ , $t \in [0, T]$ , with initial condition $u(x, 0) = f(x)$ and boundary conditions $u(0, t) = u(1, t) = 0$ . Identify all collocation point sets needed.

ExerciseAdvanced

Problem

Why do PINNs struggle with the Burgers equation $\partial_t u + u \partial_x u = \nu \partial_{xx} u$ at small viscosity $\nu$ ? What happens to the PDE residual near a shock?

References

Foundational:

Raissi, Perdikaris, Karniadakis, "Physics-informed neural networks," Journal of Computational Physics 378 (2019), pp. 686–707
Karniadakis, Kevrekidis, Lu, Perdikaris, Wang, Yang, "Physics-informed machine learning," Nature Reviews Physics 3 (2021)

Failure-mode analysis (read these before applying PINNs):

Wang, Teng, Perdikaris, "When and why PINNs fail to train: a neural tangent kernel perspective," arXiv:2001.04536 (2021). The foundational NTK-imbalance analysis.
Krishnapriyan, Gholami, Zhe, Kirby, Mahoney, "Characterizing possible failure modes in physics-informed neural networks," NeurIPS 2021, arXiv:2109.01050
Wang, Sankaran, Perdikaris, "Respecting causality is all you need for training PINNs," arXiv:2203.07404 (2022)
de Ryck, Mishra, "Error analysis for PINNs and related models approximating PDEs," 2024 survey

Spectral bias / implicit neural representations:

Rahaman et al., "On the spectral bias of neural networks," ICML 2019
Tancik et al., "Fourier features let networks learn high frequency functions in low dimensional domains," NeurIPS 2020, arXiv:2006.10739
Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi, Ng, "NeRF: Representing scenes as neural radiance fields for view synthesis," ECCV 2020
Sitzmann, Martel, Bergman, Lindell, Wetzstein, "Implicit neural representations with periodic activation functions" (SIREN), NeurIPS 2020, arXiv:2006.09661

Theory:

Shin, Darbon, Karniadakis, "On the convergence of PINNs" (2020). Asymptotic consistency; compare with the finite-sample program sketched in Open Questions above.
Maurer, "A vector-contraction inequality for Rademacher complexities," ALT 2016

Architectural extensions:

Lu, Pestourie, Yao, Wang, Verdugo, Johnson, "Physics-informed neural networks with hard constraints for inverse design," arXiv:2102.04626 (2021). Hard-constraint PINNs.
Kharazmi, Zhang, Karniadakis, "Variational physics-informed neural networks (VPINN)," arXiv:2003.00596 (2020)
Yu, Lu, Meng, Karniadakis, "Gradient-enhanced physics-informed neural networks (gPINN)," arXiv:2111.02801 (2021)
Yang, Meng, Karniadakis, "B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy data," arXiv:2003.06097 (2021)
Jagtap, Karniadakis, "Extended PINNs (XPINNs): a generalized space-time domain decomposition based deep learning framework" (2020)

Neural operators (alternative paradigm):

Li et al., "Fourier Neural Operator for parametric PDEs," ICLR 2021, arXiv:2010.08895
Lu, Jin, Pang, Zhang, Karniadakis, "Learning nonlinear operators via DeepONet," Nature Machine Intelligence (2021)

Next Topics

Explore neural operators and scientific ML applications as this rapidly evolving field develops.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

9

The Jacobian Matrixlayer 0A · tier 1
Automatic Differentiationlayer 1 · tier 1
Classical ODEs: Existence, Stability, and Numerical Methodslayer 1 · tier 1
Gradient Descent Variantslayer 1 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

1

Lyapunov-Based Machine Learning for Chaoslayer 4 · tier 3

Graph-backed continuations

Lyapunov-Based Machine Learning for Chaos