Beyond LLMS
Physics-Informed Neural Networks
Embedding PDE constraints directly into the neural network loss function via automatic differentiation. When physics-informed learning works, when it fails, and what alternatives exist.
Prerequisites
Why This Matters
Interactive module
The same live PDE explorer appears here as optional context. Hide it if you want a pure reading path through PINNs.
The explorer above is the setup a PINN tries to replicate. The spectral solver on the left is the exact closed-form ground truth for four PDE archetypes (heat, advection, Schrödinger, Poisson): one Fourier multiplier per archetype, no time-stepping. The "Neural-net function fit" panel at the bottom trains a small SIREN MLP to memorize that field via supervised MSE. A true PINN keeps the same kind of network but replaces the supervised target with the PDE residual itself; the net never sees ground-truth values and instead learns to satisfy the equation pointwise. The sections below describe exactly what that substitution buys you and where it breaks.
Most of science is governed by partial differential equations (PDEs). Classical numerical solvers (finite elements, finite differences, spectral methods) work well but struggle with high-dimensional problems, complex geometries, and inverse problems. PINNs propose a different approach: use a neural network as a function approximator and enforce the PDE through the loss function.
The idea is seductive. The reality is more nuanced. PINNs work well in certain regimes and fail badly in others. Understanding where the boundary lies is critical for anyone applying ML to scientific problems.
Mental Model
A standard neural network learns from data alone. A PINN adds a second source of supervision: the governing equations. You do not need as much data because the physics constrains the space of acceptable solutions.
Think of it as regularization by physical law. Instead of an L2 penalty that pushes weights toward zero, you have a PDE residual penalty that pushes the solution toward physical consistency.
Core Definitions
Physics-Informed Neural Network (PINN)
A neural network trained to approximate the solution of a PDE by minimizing a composite loss that includes both data fidelity and PDE residual terms. The PDE residual is computed via automatic differentiation of the network output with respect to its inputs.
PDE Residual
For a PDE of the form where is a differential operator, the PDE residual at a collocation point is:
This is computed by differentiating with respect to its inputs using automatic differentiation. A perfect solution has zero residual everywhere.
The PINN Loss Function
The central construct of PINNs is the composite loss:
PINN Loss Decomposition
Statement
The PINN loss is:
where:
- (data fidelity)
- (PDE residual at collocation points)
- and enforce boundary and initial conditions
The weights balance the different loss terms.
Intuition
The network is pulled in multiple directions: fit the observed data, satisfy the PDE everywhere in the domain, and respect boundary/initial conditions. The physics term acts as an infinite-dimensional regularizer, constraining the solution to the manifold of physically plausible functions even where no data exists.
Proof Sketch
There is no convergence "proof" in the classical sense for general PINNs. Theoretical results (Shin, Darbon, Karniadakis 2020) show that as the number of collocation points and network capacity grow, minimizers of the PINN loss converge to the PDE solution under regularity assumptions. The rate of convergence is generally worse than classical solvers for smooth problems.
Why It Matters
This decomposition is the entire PINN methodology. The key engineering decisions are: (1) the architecture of , (2) the placement of collocation points, (3) the relative weights , and (4) the optimizer and training schedule. Getting these wrong leads to solutions that satisfy neither the data nor the physics.
Failure Mode
The multi-objective nature of the loss creates optimization difficulties. The PDE residual and data terms can have vastly different scales and gradients, leading to one dominating the other. Adaptive weighting schemes (e.g., learning rate annealing, neural tangent kernel-based weighting) partially address this but do not fully solve it.
How PINNs Use Automatic Differentiation
The key enabling technology is automatic differentiation (autodiff). To compute , you differentiate the neural network output with respect to its inputs (not its parameters). Modern frameworks (PyTorch, JAX) compute these derivatives exactly and efficiently.
For example, if the PDE is the heat equation , the residual at a point is:
Both derivatives are computed by autodiff through the network graph.
When PINNs Work
Smooth Solutions to Known PDEs
PINNs perform well on problems with smooth solutions where the governing PDE is known exactly. Classic demonstrations include the Burgers equation (before shock formation), the Schrodinger equation, and steady-state heat conduction. In these settings, the physics loss provides strong regularization, and the network can represent the solution accurately with moderate capacity.
Inverse Problems
PINNs are particularly attractive for inverse problems: given sparse noisy observations, infer unknown PDE parameters. For example, estimating the diffusion coefficient from temperature measurements. The physics constraint regularizes the inverse problem and shrinks the feasible solution set; it does not, on its own, make the problem well-posed. Identifiability and stability still depend on the observation operator, the boundary and initial conditions, the parameterization of the unknowns, the noise model, and whether distinct parameters can produce indistinguishable observations. Many inverse PDE problems remain genuinely ill-posed even with the correct governing equation, and a successful PINN run on such a problem is fitting one solution from a non-trivial null space.
When PINNs Fail
The honest-survey failure modes (discontinuities, forward-solve cost, stiffness) are the surface story. The deeper, structural failures have names attached to them now and are where the interesting work is happening.
The three tabs above are the live, in-browser versions of the three failure modes described below. The code is tiny, runs as you watch, and cites its source paper on each tab.
Spectral bias: plain MLPs cannot represent high frequencies
Rahaman et al. (2019) and Tancik et al. (arXiv:2006.10739) showed that an MLP with standard activations is biased toward low-frequency components and provably slow to fit oscillatory targets. A vanilla PINN on a solution containing will converge to the low-frequency bulk and never resolve the detail. Two practical fixes: Fourier feature encodings (Tancik et al. 2020; Mildenhall et al. 2020 for NeRF) preprocess inputs through , and sinusoidal representations like SIREN (Sitzmann et al. 2020, arXiv:2006.09661) replace activations with . Either transforms the NTK into one that is no longer frequency-biased. The "Neural-net function fit" panel in the explorer above uses SIREN for exactly this reason.
NTK imbalance: the data and PDE losses train at wildly different rates
Wang, Teng, Perdikaris (arXiv:2001.04536, 2021) gave the foundational analysis: the neural tangent kernels of the data-fitting loss and the PDE-residual loss have eigenvalue spectra separated by orders of magnitude. In effect, one loss dominates every gradient step and the other stagnates, regardless of the manually-chosen weights. The paper proposes an NTK-balanced weighting scheme that measures the kernels online and rescales. Calling it "weighting" understates the result: it is the first mechanistic explanation of why PINNs are so brittle to loss coefficients.
The demo above recomputes both kernels on a fixed 10-point sample every 30 training steps and plots their sorted eigenvalues side by side. Watch how the PDE kernel's spectrum widens across decades as training progresses while the data kernel stays bounded. Toggle NTK-balanced weighting and the gap closes.
Curriculum and causality: naive simultaneous training fails on time-dependent PDEs
Krishnapriyan et al. (arXiv:2109.01050, NeurIPS 2021) built an explicit failure-mode taxonomy for PINNs and showed that simultaneous training on all collocation points (the standard setup) is wrong for evolution equations: information from must propagate causally through time, but vanilla training mixes early- and late-time residuals as if they were independent. A sequence-to-sequence curriculum (train to some cutoff time, then extend) recovers much of the accuracy gap. Wang, Sankaran, Perdikaris (arXiv:2203.07404) formalized this as causal time-weighting: weight each collocation point by the product of earlier residuals so the network must solve before it is asked to solve . The improvement is often orders of magnitude, which is why this result is structurally surprising.
Discontinuities and sharp gradients
Shock waves, contact discontinuities, and thin boundary layers cause PINNs to fail or converge extremely slowly. Neural networks with smooth activations represent discontinuities poorly; the PDE residual near a shock is large and noisy, destabilizing training. Classical shock-capturing schemes (WENO, Godunov) handle these cases far better. Variational PINNs (Kharazmi et al., arXiv:2003.00596) sidestep some of this by minimizing a weak form of the residual against test functions, reducing the required derivative order.
PINNs can be slower than classical solvers for forward problems
For a well-posed forward PDE with known coefficients, a finite element solver with adaptive meshing is typically faster and more accurate than a PINN. PINNs pay the cost of neural network training (thousands of gradient descent steps) for a single PDE instance. Classical solvers amortize their cost better. PINNs become competitive when you need to solve many related PDEs (amortized inference), when the problem is inverse (unknown coefficients to infer from sparse data), or when classical meshing is impractical.
The demo above shows the honest crossover. At 3 observations the classical FD grid search is under-constrained and its α estimate varies wildly with sample luck; the PINN's physics residual stabilizes the estimate. Drag n upward — above roughly 20 observations the classical estimate sharpens and matches or beats the PINN, because the data now constrains u directly.
Provable impossibility: some PDE classes cannot be PINN-approximated
de Ryck and Mishra (2024) gave explicit PDE classes where no PINN training trajectory can converge, no matter the architecture or weighting. This is a healthy counterweight to the "universal approximation" optimism of the early literature: PINNs are a method, not a solver, and the method has provable limits.
Stiff systems
PDEs with widely separated time scales create loss landscapes with pathological curvature. Fast dynamics produce large PDE residuals early in training; slow dynamics require long training to resolve. Multiscale architectures, hard-constraint formulations (Lu, Pestourie et al., arXiv:2102.04626) that enforce boundary conditions by architectural construction rather than loss penalty, and gradient-enhanced residual losses (gPINN, Yu et al., arXiv:2111.02801) mitigate but do not eliminate the problem.
Extensions
Biologically-Informed Neural Networks (BINNs)
BINNs apply the PINN framework to biological systems where the governing equations are partially known. Instead of enforcing a fully specified PDE, BINNs learn unknown terms in the equations from data while enforcing known structural constraints (conservation laws, positivity, symmetries).
Neural Operators: A Different Paradigm
Instead of learning a single PDE solution, neural operators learn the solution operator: a mapping from PDE parameters, initial conditions, or forcing terms to solutions.
Fourier Neural Operator (FNO): learns in the frequency domain, applying learned filters to the Fourier coefficients of the input function. Resolution invariant and fast at inference (one forward pass per new PDE instance).
DeepONet: uses a branch-trunk architecture. The branch network encodes the input function, the trunk network encodes the evaluation point, and their dot product gives the solution value. Theoretically grounded in the universal approximation theorem for operators (Chen & Chen, 1995).
Neural operators amortize the cost of training over many PDE instances. Once trained, solving a new PDE instance requires only a forward pass, not retraining. This makes them far more practical than PINNs for applications requiring repeated solves (design optimization, uncertainty quantification).
Bayesian PINNs
Yang, Meng, Karniadakis (arXiv:2003.06097, 2021) replace the point-estimate network with a posterior over weights, giving calibrated uncertainty quantification on both the solution and any inferred coefficients. The resulting B-PINN is more honest about its extrapolations than a vanilla PINN and connects directly to standard Bayesian deep learning.
Structural Connection: PINNs and RLHF
PINN training and RLHF are the same abstract problem with a cosmetic change of clothes, and noticing this is the single most useful way to transfer intuition between the two.
In RLHF, the reward model is a learned proxy for human judgment. Optimizing a policy against it aggressively (small KL coefficient ) produces reward hacking: the policy finds degenerate behaviors that score highly under without actually being good. The fix is a KL regularizer against the SFT policy :
In a PINN, the PDE residual is a computed proxy for PDE satisfaction. Optimizing it aggressively relative to the data loss produces the mirror pathology: the network finds functions with small residual that ignore the observational data (or vice versa, dominated-by-data fits that violate physics away from measurement points). The effective multi-objective is:
and Wang, Teng, Perdikaris's NTK analysis shows that the "correct" is not a hyperparameter to tune; it is determined by the ratio of kernel traces, which is the PINN analog of selecting from a target KL budget in RLHF.
The parallels line up term by term:
| Structural role | RLHF | PINN |
|---|---|---|
| Ground-truth objective | Human judgment (unobserved) | PDE satisfaction (infeasible to compute exactly) |
| Learned proxy | Reward model | PDE residual |
| What "hacking" looks like | Policy scores high, humans disagree | Residual small, data fit poor (or vice versa) |
| Implicit trust region | KL against | Data-loss weight fixes near observations |
| Ratio-balancing rule | Target KL budget | NTK eigenvalue ratio (Wang-Teng-Perdikaris 2021) |
| Non-stationary proxy | Reward drifts across distribution | PDE residual landscape stiffens during training |
Both are proxy optimization problems where the proxy is differentiable but the true objective is not; both need a regularizer to prevent the optimizer from exploiting the gap; and in both, the optimal regularizer strength is structural rather than tuned.
This is not merely an analogy. The algorithmic fixes transfer: causal time-weighting (Wang-Sankaran-Perdikaris 2022) is a curriculum on collocation points, which is the PINN analog of process-reward shaping in RLHF. Hard-constraint PINNs (Lu-Pestourie 2021) enforce boundary conditions architecturally instead of through a loss term — the PINN version of constrained decoding. The open question in both fields is whether proxy-hacking is fundamental or an artifact of current architectures.
Open Questions
Two questions where the existing TheoremPath machinery meets the PINN literature at a productive boundary. Framed deliberately as open so the reader can enter the research trajectory rather than just consume a survey.
Finite-sample generalization bounds for PINNs
Current PINN theory (Shin-Darbon-Karniadakis 2020) gives asymptotic consistency: as the number of collocation points and training iterations tends to infinity, the PINN converges to a PDE solution. This is the wrong kind of result for practice: real PINNs are trained on to collocation points, not infinitely many.
A natural program: treat the PINN loss where depends on through its autodiff derivatives. Under Lipschitz-derivative regularity, apply the vector contraction inequality (Maurer 2016, form) to bound empirical process terms. For a Sobolev or Barron-norm hypothesis class , the Rademacher complexity would give the PINN analog of a VC-style generalization bound.
The hard technical question is whether has a Rademacher complexity controllable by , i.e., whether the derivative operator acts on the hypothesis class benignly under the parameter norm. This is open and genuinely research-grade.
PINN scaling laws
Kaplan-style scaling laws have reshaped language modeling: given a compute budget, there is a compute-optimal trade-off between parameter count, dataset size, and training steps. For PINNs, no equivalent study exists. Ad-hoc empirical surveys report that "bigger networks help until they do not," but there is no Chinchilla-style joint fit across problem classes.
A specific formulation: for a PDE whose solution lies in a Barron-norm class of complexity , is the PINN training loss for some exponents determined by the PDE's conditioning rather than by generic architecture scaling? If yes, compute-optimal allocation should trade parameters against collocation-point density the way language models trade parameters against tokens. If no, PINN scaling is limited by PDE conditioning and the comparison to LMs is misleading.
This is a natural target for a workshop note or a short empirical paper. The TheoremPath curriculum has the pieces (scaling laws, Barron norms, Rademacher bounds) already assembled elsewhere; the PINN literature does not.
Connections to Mechanistic Interpretability
A possibly-naive observation, flagged here as a working hypothesis rather than a result. Mechanistic interpretability looks inside a trained network to recover human-readable circuits. The NTK spectrum view embedded above is doing something adjacent: looking inside an actively training network to see which directions in parameter space are responsible for which components of the loss.
Three specific ways the two fields touch, none of them fully worked out:
- The eigenvectors of the data NTK and the PDE NTK are literal directions in parameter space. When the gap between their spectra closes under NTK-balanced weighting, it means the same parameter directions start absorbing gradient signal from both losses. That is a mechanistic statement about the training dynamics, not just a convergence-rate statement, and it suggests a way to ask "which layer learned which part of the equation" by projecting gradients onto layer-restricted Gram matrices.
- SIREN and Fourier-feature networks (Tancik 2020, Sitzmann 2020) fix spectral bias by construction. The mechanistic question — which parts of the network represent high-frequency components and which represent low-frequency ones — is open. The superposition work on sparse features in LLMs has a natural analog for PINNs: are high-frequency Fourier modes "in superposition" with low-frequency ones in the hidden layer, and does spectral bias correspond to high-frequency directions being the ones most contested by superposition?
- The PINN–RLHF isomorphism in the section above implies that proxy hacking in RLHF and loss imbalance in PINNs share a mechanistic signature. A reward model has an NTK. Measuring its eigenvalue spread vs the KL regularizer's NTK spread during PPO updates is directly analogous to what the demo above does for PINN training. I have not seen this connection in either literature. That is either because it is wrong or because no one has written it yet; I think it is the second and the two-page version would be worth writing.
None of the above is a claim that any of this is solved. It is a list of the questions I would actually work on if I had the time, and it is the honest research trajectory TheoremPath is meant to support rather than decorate. If any of this is already established in a paper I have not read, I would like to hear about it.
Summary
- PINN loss = data fidelity + PDE residual + boundary/initial conditions
- Autodiff computes exact spatial/temporal derivatives of the network
- PINNs work best for smooth solutions, inverse problems, and data-sparse regimes
- PINNs fail on discontinuities, stiff systems, and problems where classical solvers already excel
- Loss balancing between physics and data terms is a critical engineering challenge
- Neural operators (FNO, DeepONet) learn solution operators and amortize cost across PDE instances
Exercises
Problem
Write the PINN loss for the 1D heat equation on , , with initial condition and boundary conditions . Identify all collocation point sets needed.
Problem
Why do PINNs struggle with the Burgers equation at small viscosity ? What happens to the PDE residual near a shock?
References
Foundational:
- Raissi, Perdikaris, Karniadakis, "Physics-informed neural networks," Journal of Computational Physics 378 (2019), pp. 686–707
- Karniadakis, Kevrekidis, Lu, Perdikaris, Wang, Yang, "Physics-informed machine learning," Nature Reviews Physics 3 (2021)
Failure-mode analysis (read these before applying PINNs):
- Wang, Teng, Perdikaris, "When and why PINNs fail to train: a neural tangent kernel perspective," arXiv:2001.04536 (2021). The foundational NTK-imbalance analysis.
- Krishnapriyan, Gholami, Zhe, Kirby, Mahoney, "Characterizing possible failure modes in physics-informed neural networks," NeurIPS 2021, arXiv:2109.01050
- Wang, Sankaran, Perdikaris, "Respecting causality is all you need for training PINNs," arXiv:2203.07404 (2022)
- de Ryck, Mishra, "Error analysis for PINNs and related models approximating PDEs," 2024 survey
Spectral bias / implicit neural representations:
- Rahaman et al., "On the spectral bias of neural networks," ICML 2019
- Tancik et al., "Fourier features let networks learn high frequency functions in low dimensional domains," NeurIPS 2020, arXiv:2006.10739
- Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi, Ng, "NeRF: Representing scenes as neural radiance fields for view synthesis," ECCV 2020
- Sitzmann, Martel, Bergman, Lindell, Wetzstein, "Implicit neural representations with periodic activation functions" (SIREN), NeurIPS 2020, arXiv:2006.09661
Theory:
- Shin, Darbon, Karniadakis, "On the convergence of PINNs" (2020). Asymptotic consistency; compare with the finite-sample program sketched in Open Questions above.
- Maurer, "A vector-contraction inequality for Rademacher complexities," ALT 2016
Architectural extensions:
- Lu, Pestourie, Yao, Wang, Verdugo, Johnson, "Physics-informed neural networks with hard constraints for inverse design," arXiv:2102.04626 (2021). Hard-constraint PINNs.
- Kharazmi, Zhang, Karniadakis, "Variational physics-informed neural networks (VPINN)," arXiv:2003.00596 (2020)
- Yu, Lu, Meng, Karniadakis, "Gradient-enhanced physics-informed neural networks (gPINN)," arXiv:2111.02801 (2021)
- Yang, Meng, Karniadakis, "B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy data," arXiv:2003.06097 (2021)
- Jagtap, Karniadakis, "Extended PINNs (XPINNs): a generalized space-time domain decomposition based deep learning framework" (2020)
Neural operators (alternative paradigm):
- Li et al., "Fourier Neural Operator for parametric PDEs," ICLR 2021, arXiv:2010.08895
- Lu, Jin, Pang, Zhang, Karniadakis, "Learning nonlinear operators via DeepONet," Nature Machine Intelligence (2021)
Next Topics
Explore neural operators and scientific ML applications as this rapidly evolving field develops.
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
9- The Jacobian Matrixlayer 0A · tier 1
- Automatic Differentiationlayer 1 · tier 1
- Classical ODEs: Existence, Stability, and Numerical Methodslayer 1 · tier 1
- Gradient Descent Variantslayer 1 · tier 1
- Feedforward Networks and Backpropagationlayer 2 · tier 1
Derived topics
1- Lyapunov-Based Machine Learning for Chaoslayer 4 · tier 3
Graph-backed continuations