Neural ODEs and Continuous-Depth Networks

Sneiderman, Robby

Scientific ML

Neural ODEs and Continuous-Depth Networks

Treating neural network depth as a continuous variable: the ODE formulation of residual networks, the adjoint method for memory-efficient backpropagation, the duality with PINNs, the SDE bridge to diffusion models, and the open research frontier.

AdvancedTier 3CurrentFrontier watch~60 min

Prerequisites

Classical Odes Skip Connections and Resnets Gradient Flow and Vanishing Gradients Automatic Differentiation

Prereq Map

Why This Matters

A ResNet block has the form $h_{\ell+1} = h_\ell + f_\ell(h_\ell)$ . This matches an Euler update only when the residual is scaled by a step size $\Delta t = 1/L$ for network depth $L$ (Haber and Ruthotto 2017; Chen et al. 2018). Standard unscaled ResNets are not formal Euler discretizations of an ODE; the continuous limit requires this $1/L$ rescaling. With that rescaling, the $L \to \infty$ limit gives a continuous dynamical system:

$\frac{dh}{dt} = f_\theta(h(t), t)$

This reframing is not just mathematical elegance. It gives you: constant memory backpropagation (via the adjoint method), adaptive computation depth (the ODE solver decides how many steps to take), and a bridge between deep learning and dynamical systems theory. The tradeoffs are real: training is slower, and the expressiveness is constrained by ODE theory (no crossing trajectories). Neural ODEs connect to DEQ models (which solve for the ODE's fixed point directly) and continuous thought machines (which use ODE dynamics for adaptive-depth reasoning).

The ResNet-ODE Connection

Proposition

ResNet as Discretized ODE

Statement

A residual network with update $h_{t+1} = h_t + \frac{1}{L} f_\theta(h_t, t/L)$ for $t = 0, 1, \ldots, L-1$ is the Euler discretization of the initial value problem:

$\frac{dh}{dt} = f_\theta(h(t), t), \quad h(0) = x, \quad t \in [0, 1]$

with step size $\Delta t = 1/L$ . In the limit $L \to \infty$ , the discrete trajectory $\{h_0, h_1, \ldots, h_L\}$ converges to the continuous solution $h(t)$ (under Lipschitz conditions on $f_\theta$ ).

The output of the network is $h(1) = h(0) + \int_0^1 f_\theta(h(t), t) \, dt$ .

Intuition

Each ResNet layer adds a small correction to the hidden state. In the continuous limit, these corrections become a vector field that flows the input through a smooth trajectory. The network's "depth" becomes a continuous time variable. Deeper networks correspond to longer integration times, and the network learns the vector field $f_\theta$ that transforms inputs into useful representations.

Why It Matters

This perspective explains why ResNets work: the skip connection $h_{t+1} = h_t + f(h_t)$ is not just a gradient-flow trick. It makes each layer an incremental transformation, and in the continuous limit these incremental transformations compose into a smooth flow that is invertible (under standard well-posedness conditions on $f_\theta$ , by uniqueness of ODE solutions). The discrete finite-step ResNet map $h \mapsto h + f(h)$ is not invertible in general — invertibility of a single block requires extra conditions like a Lipschitz constraint $\|\nabla f\| < 1$ (the basis of invertible ResNets, Behrmann et al. 2019). The takeaway is that small-step composition is the regime where forward dynamics behave benignly, which is why deep ResNets train well.

It also enables replacing the fixed $L$ -layer architecture with an adaptive ODE solver that chooses its own step size based on local numerical error and stiffness of the dynamics, not on semantic input difficulty. Inputs whose trajectories pass through stiff regions of $f_\theta$ get more steps; smooth trajectories get fewer. This is data-dependent compute, but the criterion is numerical (truncation error tolerance), not a learned "is this input hard" signal.

Failure Mode

The continuous limit requires the dynamics $f_\theta$ to be Lipschitz continuous. If $f_\theta$ is not Lipschitz (e.g., if it has sharp discontinuities or unbounded gradients), the ODE may not have a unique solution, and the convergence of Euler's method is not guaranteed. ReLU is globally 1-Lipschitz despite being non-differentiable at zero (Lipschitz continuity does not require differentiability), so standard ReLU networks with bounded weights satisfy this condition. The places where Lipschitz fails in practice are unbounded activations like exp without clipping, or architectures whose Jacobian norm grows without bound during training.

report a correction →

The Adjoint Method

Theorem

Adjoint Sensitivity Method

Statement

To compute $\frac{dL}{d\theta}$ for a neural ODE, define the adjoint state $a(t) = \frac{dL}{dh(t)}$ . The adjoint satisfies a backward ODE:

$\frac{da}{dt} = -a(t)^\top \frac{\partial f}{\partial h}(h(t), t, \theta)$

integrated backwards from $t = T$ to $t = 0$ with initial condition $a(T) = \frac{dL}{dh(T)}$ . This follows the convention of Chen et al. (2018). The sign depends on the time-direction convention: reversing time or redefining $a(t)$ flips the sign, so check the convention before comparing references.

The parameter gradient is:

$\frac{dL}{d\theta} = -\int_0^T a(t)^\top \frac{\partial f}{\partial \theta}(h(t), t, \theta) \, dt$

Memory cost: $O(1)$ in depth (constant, regardless of the number of ODE solver steps), compared to $O(L)$ for standard backpropagation through $L$ layers. The $O(1)$ memory cost is asymptotic; in practice adjoint solvers may require more function evaluations than discrete backprop, and reconstructed gradients can drift for stiff dynamics (Gholaminejad et al. 2019).

Intuition

Standard backprop stores all intermediate activations $h_0, h_1, \ldots, h_L$ to compute gradients, costing $O(L)$ memory. The adjoint method avoids this by solving the adjoint ODE backwards in time, recomputing $h(t)$ on the fly by integrating the forward ODE backwards. This trades memory for compute: you solve two ODEs (forward and backward) instead of storing all activations.

This is the continuous analog of activation checkpointing, taken to its logical extreme.

Why It Matters

Constant-memory backpropagation enables training of very deep (or continuous-depth) networks without running out of GPU memory. For standard ResNets with $L = 100$ layers, the memory saving is $\sim 100\times$ . For neural ODEs where the solver may take thousands of steps, the saving is even larger.

The adjoint method is not new. It was developed in optimal control theory in the 1960s (Pontryagin's maximum principle). Neural ODEs brought it to deep learning.

Failure Mode

The backward ODE recomputation of $h(t)$ introduces numerical error. If the forward and backward solvers use different discretizations or if the dynamics are chaotic, the recomputed $h(t)$ can diverge from the original, causing gradient inaccuracy. In practice, this is mitigated by using the same adaptive solver in both directions, but it remains a source of subtle bugs. Checkpointed approaches (solving forward, saving a few checkpoints, recomputing between them) offer a middle ground.

report a correction →

Why ODEs Constrain Expressiveness

ODE trajectories cannot cross. If $h_a(0) \neq h_b(0)$ , then $h_a(t) \neq h_b(t)$ for all $t$ (by uniqueness of ODE solutions under Lipschitz conditions). This means the map $h(0) \mapsto h(T)$ is a homeomorphism: it is continuous, invertible, and its inverse is continuous.

This is a limitation. A homeomorphism cannot change the topology of the data. If the input data has two intertwined spirals, a neural ODE cannot "untangle" them into linearly separable clusters through a continuous flow. A standard ResNet (with finite step size) can, because discrete maps are not constrained by ODE uniqueness.

In practice, this limitation is addressed by:

Augmented neural ODEs (Dupont et al. 2019): concatenate $k$ extra zero-initialized dimensions to the state, so the flow acts on $\mathbb{R}^{d+k}$ instead of $\mathbb{R}^d$ . The homeomorphism constraint still applies in the lifted space, but in higher dimensions it can unknot data that is topologically stuck in $\mathbb{R}^d$ . Dupont et al. show that this both fixes the topology obstruction for tasks like nested spheres and reduces solver NFE at training time, because the learned vector field can be simpler.
Using neural ODEs as components in a larger architecture, not as the entire model.

Common Confusions

Watch Out

Neural ODEs are not just deep ResNets

The continuous formulation gives qualitatively different properties: adaptive depth, constant-memory training, invertibility constraints, and connections to physics. A 100-layer ResNet with ReLU and batch norm does not behave like a neural ODE. The ODE perspective is most useful when you actually use an ODE solver (adaptive step size, error control), not when you just view a discrete ResNet "as if" it were continuous.

Watch Out

Constant memory does not mean free computation

The adjoint method saves memory by recomputing activations during the backward pass. This doubles the computational cost (two ODE solves instead of one forward pass). For some applications, this tradeoff is worth it (when memory is the bottleneck). For others, standard backprop with gradient checkpointing is more practical.

Watch Out

Neural ODEs are not universally better than discrete networks

Neural ODEs are slower to train (ODE solver overhead), harder to parallelize (sequential integration), and more constrained in expressiveness (no trajectory crossing). They are most useful when continuous dynamics are a natural fit: time series, physical systems, normalizing flows, and problems where adaptive computation depth matters.

Exercises

ExerciseCore

Problem

A ResNet has 50 layers with hidden dimension 256. Standard backprop stores all 50 intermediate activations ( $50 \times 256$ floats). The neural ODE adjoint method stores only the initial and final states. Compute the memory ratio. Under what conditions is the ODE approach worth the extra compute?

ExerciseAdvanced

Problem

Explain why the non-crossing property of ODE trajectories limits the expressiveness of neural ODEs. Give a specific example of a classification problem in $\mathbb{R}^2$ that a neural ODE (without augmentation) cannot solve but a standard 2-layer network can.

Duality with Physics-Informed Neural Networks

Neural ODEs and PINNs are dual approaches to the same class of dynamical-system problems. The duality is sharp:

Object	What is parameterized	What is solved	What the loss penalizes
PINN	The solution $u_\theta(t, x)$	The PDE residual $\mathcal{N}[u_\theta] = 0$ at collocation points (no integrator)	$\\|\mathcal{N}[u_\theta]\\|^2$ + boundary/data terms
Neural ODE	The vector field $f_\theta(h, t)$	The IVP $\dot{h} = f_\theta(h, t)$ via a numerical solver	$\\|h_\theta(T) - y\\|^2$ at observed endpoints

PINNs are mesh-free function approximators of the solution; they evaluate $u_\theta$ at points and demand the differential equation be satisfied there. Neural ODEs are learned right-hand sides; they specify the dynamics and let an ODE solver produce the trajectory.

Each has the failure mode the other avoids. PINNs struggle with sharp gradients and stiff multi-scale problems because the solution itself has structure that smooth networks resolve poorly. Neural ODEs struggle when the dynamics themselves have intrinsic complexity that requires fine-grained vector fields, but they handle stiff regions automatically through adaptive solvers. PINNs amortize across initial conditions only by retraining; neural ODEs handle new initial conditions via a fresh forward solve at no retraining cost.

In scientific ML the choice is usually dictated by what is known: if you trust the PDE and want to solve it once well, PINNs (or classical FEM) are appropriate; if you have observations of trajectories and need to learn the unknown dynamics from them, neural ODEs are appropriate. Hybrid approaches like neural operators (FNO, DeepONet) split the difference by amortizing the solution map across PDE instances.

Connection to Diffusion Models: The SDE Bridge

Diffusion models are not neural ODEs; they are neural SDEs (stochastic differential equations). The forward noising process

$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\, dt + g(t)\, d\mathbf{w}$

is an SDE with drift $\mathbf{f}$ and diffusion $g$ . The reverse-time SDE (Anderson 1982),

$d\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})]\, dt + g(t)\, d\bar{\mathbf{w}},$

is the generative direction. The neural network learns the only unknown, the score $\nabla_{\mathbf{x}} \log p_t$ . The full theory requires stochastic calculus: Ito's lemma, the Ito-isometry-based existence theorem for SDEs, and the Fokker-Planck equation that links the SDE to the time evolution of $p_t$ .

The deterministic counterpart of the diffusion reverse SDE is the probability flow ODE:

$\frac{d\mathbf{x}}{dt} = \mathbf{f}(\mathbf{x}, t) - \frac{1}{2} g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x}).$

This ODE has the same marginals $p_t$ as the SDE at every $t$ (Song et al. 2021, Theorem 2). Sampling from the diffusion model by integrating this ODE backward (the standard trick used by DPM-Solver, DDIM, and EDM samplers) is literally a neural ODE inference, with $f_\theta$ given by the score model. The bridge between Neural ODEs and diffusion is not metaphorical; the modern fast samplers are neural ODE solvers operating on a learned vector field.

This connects all the way back to energy-based models: the score $\nabla_{\mathbf{x}} \log p_t$ is the negative gradient of the EBM energy $E_t(\mathbf{x}) = -\log p_t(\mathbf{x})$ (modulo the partition function). The probability-flow ODE includes a $-\frac{1}{2}g(t)^2 \nabla \log p_t$ term that pushes toward higher density (lower energy of $p_t$ ), but it also carries the full SDE drift $\mathbf{f}(\mathbf{x}, t)$ , so it is not pure gradient flow on a single energy. It is the marginal-preserving deterministic dynamics that share marginals with the reverse SDE; the score appears as one term, not as the entire vector field. Calling PF-ODE inference "energy descent" is correct only in the special case where $\mathbf{f} \equiv 0$ (e.g., variance-exploding schedules at the SDE level). The neural-ODE / neural-SDE / EBM trio is one mathematical object viewed three ways, with the score being the bridge term rather than the whole story.

For the SDE-specific machinery — adjoint method extended to SDEs, generative neural SDEs as infinite-dimensional GANs, Latent SDEs for time series — see the dedicated neural SDEs page.

Open Questions

Neural ODEs occupy an unusual position in modern ML. The framework is mathematically clean and connects to control theory, dynamical systems, and statistical physics, but it has not displaced discrete networks for most applications. The interesting questions are about why.

Adjoint accuracy under stiff dynamics. The vanilla adjoint method recomputes forward states by integrating backward. For stiff or chaotic dynamics, recomputed states diverge from the true forward path, corrupting gradients (Gholaminejad et al. 2019). The "discretize-then-optimize" alternative (backprop through the solver) gives exact gradients but loses the constant-memory property. The right tradeoff between memory, compute, and gradient accuracy is unsettled.
The expressiveness gap. The non-crossing property of ODE flows means a Neural ODE in $\mathbb{R}^d$ is a homeomorphism. Augmented Neural ODEs lift to $\mathbb{R}^{d+k}$ , which provably fixes the topology obstruction (Dupont et al. 2019), but the expressiveness gap with discrete ResNets in finite-data regimes remains poorly characterized. When does continuity actually help, and when does it just slow you down?
Why diffusion samplers work despite high-dimensional non-Lipschitz score networks. Probability-flow ODE samplers integrate over thousands of NFE on score networks that are not Lipschitz-regularized in any explicit way. Standard ODE theory says this should be unstable; in practice EDM and DPM-Solver work robustly. The implicit regularization of the training objective is suspected to play a role, but no clean theorem exists.
Neural ODEs for sequence modeling against transformers. Mamba and other state-space models are continuous-time linear systems with selective parameterization, conceptually adjacent to Neural ODEs but with hand-crafted dynamics. A nonlinear Neural-ODE-style sequence model that competes with Mamba or transformers on language has not been demonstrated. Whether the limitation is fundamental or engineering is open.
Symplectic and structure-preserving Neural ODEs. For physical systems with conservation laws (Hamiltonian dynamics, energy conservation), generic Neural ODEs do not preserve the conserved quantity. Hamiltonian Neural Networks (Greydanus et al. 2019) and Lagrangian Neural Networks (Cranmer et al. 2020) constrain the architecture to enforce structure, but at a cost in flexibility. The right balance for general scientific ML applications is unsettled.

References

Canonical:

Chen, Rubanova, Bettencourt, Duvenaud, "Neural Ordinary Differential Equations" (NeurIPS 2018, Best Paper; arXiv:1806.07366). The foundational paper.
Haber and Ruthotto, "Stable Architectures for Deep Neural Networks" (Inverse Problems 2017; arXiv:1705.03341). ResNet as discretized ODE, with the $1/L$ step-size rescaling.
Pontryagin, Boltyanskii, Gamkrelidze, Mishchenko, The Mathematical Theory of Optimal Processes (1962). Original adjoint method, predating ML by 56 years.

Current:

Dupont, Doucet, Teh, "Augmented Neural ODEs" (NeurIPS 2019; arXiv:1904.01681). Fixes the topology-preserving expressiveness limitation.
Gholaminejad, Keutzer, Biros, "ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs" (IJCAI 2019; arXiv:1902.10298). Documents gradient-accuracy and NFE issues in the vanilla adjoint method and proposes a checkpointed alternative.
Onken, Ruthotto, "Discretize-Optimize vs. Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows" (arXiv:2005.13420, 2020). Empirical comparison of the two adjoint paradigms.
Grathwohl, Chen, Bettencourt, Sutskever, Duvenaud, "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models" (ICLR 2019; arXiv:1810.01367). Neural ODEs for normalizing flows via Hutchinson trace estimator.
Li, Wong, Chen, Duvenaud, "Scalable Gradients for Stochastic Differential Equations" (AISTATS 2020; arXiv:2001.01328). Adjoint method extended to SDEs.
Kidger, Foster, Li, Lyons, "Neural SDEs as Infinite-Dimensional GANs" (ICML 2021; arXiv:2102.03657). Generative modeling with neural SDEs.
Kidger, Morrill, Foster, Lyons, "Neural Controlled Differential Equations for Irregular Time Series" (NeurIPS 2020; arXiv:2005.08926). CDEs for irregularly sampled sequences.
Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021; arXiv:2011.13456). The probability flow ODE (Section 4.3 / Appendix D.1), the explicit Neural-ODE / diffusion bridge.
Bai, Kolter, Koltun, "Deep Equilibrium Models" (NeurIPS 2019; arXiv:1909.01377). Related infinite-depth approach that solves for a fixed point instead of integrating.

Reference / Survey:

Kidger, "On Neural Differential Equations" (PhD thesis, Oxford, 2022; arXiv:2202.02435). The standard modern reference covering ODEs, SDEs, and CDEs in one volume.
Greydanus, Dzamba, Yosinski, "Hamiltonian Neural Networks" (NeurIPS 2019; arXiv:1906.01563). Structure-preserving architecture for physical systems.

Next Topics

Adjoint sensitivity method: deeper treatment of the constant-memory backprop machinery, including the Pontryagin origins and the discretize-vs-optimize debate
Neural SDEs: the stochastic generalization, including the explicit bridge to diffusion models and score-based generative modeling
Continuous normalizing flows: density estimation via Neural ODE flows with the FFJORD trace trick
Physics-informed neural networks: the dual approach that parameterizes the solution rather than the dynamics
Stochastic calculus for ML: the SDE prerequisites for the diffusion bridge

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

8

Automatic Differentiationlayer 1 · tier 1
Classical ODEs: Existence, Stability, and Numerical Methodslayer 1 · tier 1
Gradient Flow and Vanishing Gradientslayer 2 · tier 1
Skip Connections and ResNetslayer 2 · tier 1
PDE Fundamentals for Machine Learninglayer 1 · tier 2

Derived topics

4

Adjoint Sensitivity Methodlayer 3 · tier 2
Continuous Normalizing Flowslayer 3 · tier 3
Neural SDEs and the Diffusion Bridgelayer 4 · tier 3
Continuous Thought Machineslayer 5 · tier 3

Graph-backed continuations

Adjoint Sensitivity Method Neural SDEs and the Diffusion Bridge Continuous Normalizing Flows Continuous Thought Machines