Skip to main content

ML Methods

Graph Neural Networks

Message passing on graphs: GCN, GAT, GraphSAGE, the WL isomorphism test as an expressivity ceiling, over-smoothing in deep GNNs, and applications to molecules, social networks, and knowledge graphs.

AdvancedTier 2CurrentSupporting~55 min

Why This Matters

Many important data structures are graphs: molecules (atoms as nodes, bonds as edges), social networks, citation networks, knowledge graphs, protein interaction networks. Standard neural networks operate on fixed-size vectors or grids. Graph Neural Networks (GNNs) operate directly on graph-structured data, respecting the topology.

The central idea is message passing: each node updates its representation by aggregating information from its neighbors. This is analogous to how CNNs aggregate information from spatial neighborhoods, but generalized to irregular graph structures. Gilmer et al. (2017) unified GCN, MPNN, interaction networks, and neural fingerprints under the single message passing neural network (MPNN) framework below.

Formal Setup

Let G=(V,E)G = (V, E) be a graph with node set VV (V=n|V| = n) and edge set EE. Each node vv has a feature vector hv(0)Rdh_v^{(0)} \in \mathbb{R}^d. Let N(v)\mathcal{N}(v) denote the neighbors of vv.

Definition

Message Passing Neural Network

A message passing layer updates node representations as:

hv(+1)=UPDATE()(hv(),AGGREGATE()({hu():uN(v)}))h_v^{(\ell+1)} = \text{UPDATE}^{(\ell)}\left(h_v^{(\ell)}, \text{AGGREGATE}^{(\ell)}\left(\{h_u^{(\ell)} : u \in \mathcal{N}(v)\}\right)\right)

The AGGREGATE function collects neighbor information (must be permutation-invariant: the output does not depend on the ordering of neighbors). The UPDATE function combines the node's current representation with the aggregated message. After LL layers, each node's representation captures information from its LL-hop neighborhood.

Definition

Adjacency and Degree Matrices

The adjacency matrix A{0,1}n×nA \in \{0,1\}^{n \times n} has Aij=1A_{ij} = 1 if and only if (i,j)E(i,j) \in E. The degree matrix DD is diagonal with Dii=jAijD_{ii} = \sum_j A_{ij}. The normalized adjacency is A^=D~1/2A~D~1/2\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} where A~=A+I\tilde{A} = A + I (self-loops added) and D~ii=jA~ij\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}.

Graph Convolutional Network (GCN)

Proposition

GCN as First-Order Spectral Approximation

Statement

The GCN layer (Kipf and Welling, 2017) computes:

H(+1)=σ(A^H()W())H^{(\ell+1)} = \sigma\left(\hat{A} H^{(\ell)} W^{(\ell)}\right)

where H()Rn×dH^{(\ell)} \in \mathbb{R}^{n \times d_\ell} is the matrix of node features at layer \ell, W()Rd×d+1W^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell+1}} is a learnable weight matrix, and σ\sigma is a nonlinearity (typically ReLU).

This is derived from spectral graph convolution gθx=Ugθ(Λ)Uxg_\theta \star x = U g_\theta(\Lambda) U^\top x (where L=UΛUL = U\Lambda U^\top is the eigendecomposition of the graph Laplacian) by approximating gθ(Λ)g_\theta(\Lambda) with a first-order Chebyshev polynomial and simplifying.

Intuition

The multiplication by A^\hat{A} averages each node's features with its neighbors' features (with symmetric normalization to prevent scale issues). The weight matrix WW then transforms this averaged representation. One GCN layer is equivalent to: (1) smooth features over the graph, (2) apply a linear transform, (3) apply nonlinearity. Stacking LL layers propagates information LL hops.

Proof Sketch

Start with spectral convolution: gθx=Ugθ(Λ)Uxg_\theta \star x = U g_\theta(\Lambda) U^\top x. Approximate gθ(λ)θ0+θ1λg_\theta(\lambda) \approx \theta_0 + \theta_1 \lambda (first-order Chebyshev with the normalized Laplacian L^=IA^\hat{L} = I - \hat{A}). This gives gθxθ0x+θ1A^xg_\theta \star x \approx \theta_0 x + \theta_1 \hat{A} x. Set θ0=θ1=θ\theta_0 = \theta_1 = \theta (parameter sharing) to get gθx=θ(I+A^)xg_\theta \star x = \theta(I + \hat{A})x. The renormalization trick replaces I+AI + A with D~1/2A~D~1/2\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} to stabilize eigenvalues.

Why It Matters

GCN provides a simple, efficient message passing operation grounded in spectral graph theory. The spectral derivation clarifies what GCN does: it is a low-pass filter on the graph (smoothing node features along edges). This connection to spectral theory also explains GCN's limitations, particularly over-smoothing.

Failure Mode

GCN treats all neighbors equally (after degree normalization). It cannot learn to weight some neighbors more than others. For heterogeneous graphs where some edges are more informative, this uniform aggregation is suboptimal. GAT addresses this by learning attention weights per edge.

Graph Attention Network (GAT)

GAT (Velickovic et al., 2018) replaces the fixed normalization in GCN with learned attention weights:

αij=exp(LeakyReLU(a[WhiWhj]))kN(i)exp(LeakyReLU(a[WhiWhk]))\alpha_{ij} = \frac{\exp(\text{LeakyReLU}(a^\top [Wh_i \| Wh_j]))}{\sum_{k \in \mathcal{N}(i)} \exp(\text{LeakyReLU}(a^\top [Wh_i \| Wh_k]))}

hi(+1)=σ(jN(i)αijWhj())h_i^{(\ell+1)} = \sigma\left(\sum_{j \in \mathcal{N}(i)} \alpha_{ij} W h_j^{(\ell)}\right)

where aR2da \in \mathbb{R}^{2d'} is a learned attention vector and \| denotes concatenation. GAT typically uses multi-head attention: KK independent attention heads whose outputs are concatenated.

The advantage: GAT can learn which neighbors are more relevant for each node. The cost: attention computation adds O(Ed)O(|E| \cdot d) overhead per layer, and the attention mechanism introduces more parameters.

GraphSAGE

GraphSAGE (Hamilton et al., 2017) uses sampling and aggregation for scalability:

  1. For each node, sample a fixed-size subset of neighbors (e.g., 25 from 1-hop, 10 from 2-hop)
  2. Aggregate sampled neighbors using mean, LSTM, or max-pool
  3. Concatenate the node's own features with the aggregated neighbor features
  4. Apply a linear transform and nonlinearity

The sampling makes GraphSAGE scalable to large graphs (millions of nodes) because the computation per node is bounded regardless of the node's degree. GCN and GAT aggregate over all neighbors, which is expensive for high-degree nodes.

WL Isomorphism Test and GNN Expressivity

Theorem

GNN Expressivity Upper Bound via WL Test

Statement

The Weisfeiler-Leman (WL) graph isomorphism test iteratively refines node labels: at each step, a node's label is updated based on its current label and the multiset of its neighbors' labels. Two graphs are "WL-equivalent" if the WL test cannot distinguish them.

Upper bound: No message passing GNN can distinguish two graphs that the 1-WL test cannot distinguish.

Achievability: A GNN with injective AGGREGATE and UPDATE functions (specifically, the Graph Isomorphism Network (GIN)) is as powerful as the 1-WL test.

Intuition

Each layer of a message passing GNN refines node representations using neighbor information, exactly mirroring one iteration of the WL test. If the WL test produces identical label sequences for two nodes (or two graphs), no amount of message passing can tell them apart. The GNN's aggregation function determines whether it achieves this upper bound: sum aggregation (GIN) is maximally expressive; mean and max aggregation lose information about neighbor multisets.

Proof Sketch

The proof by Xu et al. (2019) proceeds in two steps. (1) Show that if a GNN maps two different multisets of neighbor features to the same output, it is strictly less powerful than 1-WL. Sum aggregation preserves the multiset (by the injectivity of hashing multisets to sums over an injective function). Mean and max lose information (e.g., mean cannot distinguish {1,1,1}\{1, 1, 1\} from {1}\{1\}). (2) Show that with an injective update function (modeled as an MLP by the universal approximation theorem), the GNN matches 1-WL step for step.

Why It Matters

This theorem sets a hard ceiling on what standard message passing GNNs can compute. The cycle-counting consequence is sharper than it sounds: Chen et al. (2020, Can Graph Neural Networks Count Substructures?) prove that 1-WL MPNNs cannot count induced cycles of any length 3\geq 3 as graph-level features in the worst case. They cannot reliably count triangles, 4-cycles, or any longer cycle on adversarially chosen graphs, and they cannot detect arbitrary subgraph patterns of size 4\geq 4. This motivates higher-order GNN architectures that go beyond 1-WL (kk-WL GNNs, subgraph GNNs, GNNs with positional encodings).

Failure Mode

The 1-WL test cannot distinguish certain non-isomorphic regular graphs (e.g., some pairs of 3-regular graphs on 8 nodes). Standard GNNs inherit this limitation. For tasks that require distinguishing such graphs (e.g., certain molecular properties), higher-order methods or augmentations (random features, positional encodings) are needed.

Over-Smoothing

As GNN depth increases, node representations converge to the same vector. After LL message passing layers, each node's representation depends on its LL-hop neighborhood. For a connected graph with diameter dd, once LdL \geq d, every node sees every other node. The repeated averaging (in GCN) drives all representations toward the leading Perron eigenvector of the normalized adjacency.

Formally, for symmetric-normalized GCN without nonlinearity: H(L)=A^LH(0)W(0)W(L1)H^{(L)} = \hat{A}^L H^{(0)} W^{(0)} \cdots W^{(L-1)}. For a connected graph, A^=D~1/2A~D~1/2\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} has leading eigenvalue 1 with Perron eigenvector v1=D~1/21/D~1/21v_1 = \tilde{D}^{1/2} \mathbf{1} / \lVert \tilde{D}^{1/2} \mathbf{1} \rVert. Subleading eigenvalues satisfy λi<1|\lambda_i| < 1, so A^Lv1v1\hat{A}^L \to v_1 v_1^\top at the rate λ2L|\lambda_2|^L (Oono-Suzuki 2020). All node features collapse onto the subspace spanned by v1v_1, which depends only on the degree sequence, not on the input features.

(The random-walk normalization P=D1AP = D^{-1} A is a different operator whose leading left eigenvector is the stationary distribution π\pi. The symmetric normalization used in GCN is related to PP by a similarity transform, but its Perron eigenvector is not π\pi itself.)

Practical consequence: GNNs with more than 4-8 layers often perform worse than shallower ones. This limits the effective receptive field and distinguishes GNNs from CNNs, where additional depth consistently helps up to much larger counts.

Mitigations:

  • Residual / initial-residual connections (GCNII, Chen et al. 2020)
  • Jumping Knowledge (Xu et al. 2018): aggregate representations from all layers via max, concat, or LSTM pooling
  • PairNorm (Zhao-Akoglu 2020): per-layer centering + pairwise distance normalization
  • DropEdge (Rong et al. 2020): randomly mask edges at each layer to slow mixing
  • GraphNorm (Cai et al. 2021): batch-level normalization adapted to graph batches
  • Graph transformers (Ying et al. 2021, Kim et al. 2022): replace local message passing with global attention plus structural encodings

(Hierarchical pooling methods such as DiffPool (Ying et al. 2018) are often cited alongside these, but they address graph-level classification rather than over-smoothing of per-node representations.)

Over-Squashing

A complementary pathology, identified by Alon and Yahav (2021): in long-range tasks, information from distant nodes must pass through bottlenecks in the graph, and the message space at each node has fixed capacity. Exponentially many paths fold into a dd-dimensional message, destroying signal before it reaches the target node.

Topping et al. (2022) formalize this via graph curvature: negatively-curved edges (bottlenecks) cause exponential signal attenuation, and graph rewiring (Ricci flow) can provably reduce over-squashing. This motivates graph transformers and other architectures that let distant nodes communicate directly without traversing bottlenecks.

Beyond 1-WL

The kk-WL hierarchy refines the 1-WL test by tracking labels of kk-tuples instead of single nodes (Morris et al. 2019 introduced kk-GNNs to match this power). Each increment strictly increases distinguishing power:

  • 1-WL ≡ standard MPNN / GIN (Xu et al. 2019): cannot count any cycle of length 3\geq 3, cannot distinguish 3-regular graphs on 8 nodes
  • 2-WL = 1-WL (Cai-Fürer-Immerman 1992): no extra power over 1-WL despite operating on pairs
  • 3-WL ≡ 2-FWL (folklore): strictly more powerful than 1-WL; can count cycles up to length 7 (Arvind et al. 2020) but still misses some structural properties
  • kk-WL for k3k \geq 3: strictly increasing power, but O(nk)O(n^k) memory and time

Practical alternatives that exceed 1-WL without kk-tuples:

  • Subgraph-aware GNNs (Zhao et al. 2022, Bevilacqua et al. 2022): mark nodes by subgraph identity
  • Random features (Abboud et al. 2021): adding i.i.d. random features to nodes breaks symmetries
  • Positional encodings (Dwivedi-Bresson 2021): Laplacian eigenvectors, random-walk landing probabilities, distance encodings

Heterogeneous and Temporal Graphs

Real-world graphs frequently have multiple node types, multiple edge types, or evolve over time. Plain MPNNs ignore this structure.

Heterogeneous graphs (typed nodes/edges):

  • R-GCN (Schlichtkrull et al. 2018) introduces per-relation weight matrices: hv(+1)=σ(ruNr(v)1cv,rWr()hu())h_v^{(\ell+1)} = \sigma(\sum_{r} \sum_{u \in \mathcal{N}_r(v)} \frac{1}{c_{v,r}} W_r^{(\ell)} h_u^{(\ell)}). Used for knowledge graphs (entity classification, link prediction).
  • HAN (Wang et al. 2019) defines metapaths over node types and applies attention at the metapath level.
  • HGT (Hu et al. 2020): per-edge-type attention with relation-aware key/value projections; the typical baseline for ogbn-mag.

Temporal graphs (edges with timestamps):

  • TGN (Rossi et al. 2020): memory modules per node updated by message passing on the streaming edge sequence.
  • TGAT (Xu et al. 2020): time-aware self-attention with functional encodings of time deltas.
  • CAW (Wang et al. 2021): causal anonymous walks for inductive temporal link prediction.

The choice between static-with-features (encoding time as a node feature) and a true temporal architecture matters for tasks where the order of edge arrivals carries signal — fraud detection, recommendation, social-event forecasting.

Graph Transformers

Local message passing has structural limits (1-WL ceiling, over-smoothing, over-squashing). Graph transformers replace or augment local aggregation with global self-attention while injecting structural information through positional/structural encodings.

  • Graphormer (Ying et al. 2021): transformer with shortest-path distance as an attention bias and centrality encodings as node features. Won the OGB-LSC PCQM4M benchmark.
  • GraphGPS (Rampasek et al. 2022): a recipe combining MPNN layers, transformer layers, and positional encodings (Laplacian eigenvectors, random-walk landing). The current default for "general-purpose graph transformer."
  • Exphormer (Shirzad et al. 2023): sparse global attention via expander graphs, scaling to graphs with >104> 10^4 nodes per pass.
  • Graphormer-style models are the current state of the art on PCQM4Mv2 and Long Range Graph Benchmark (LRGB).

Benchmarks

  • OGB (Hu et al. 2020) is the de facto graph-learning benchmark suite: ogbn (node prediction), ogbg (graph prediction), ogbl (link prediction), and OGB-LSC for million-to-billion-scale tasks. Use OGB rather than legacy datasets (Cora, Citeseer) when reporting numbers, since the small datasets are saturated and have well-known split issues.
  • Long Range Graph Benchmark (Dwivedi et al. 2022) targets tasks where the answer requires combining information from >5> 5 hops; it is the standard setting for measuring whether a graph transformer or rewiring scheme actually helps with long-range dependencies.

Applications

Molecular property prediction. Atoms are nodes, bonds are edges. GNNs predict molecular properties (toxicity, solubility, binding affinity) from the molecular graph. The Open Catalyst Project uses GNNs for catalyst discovery.

Social network analysis. Users are nodes, connections are edges. GNNs predict user behavior, detect communities, and identify influential nodes.

Recommendation systems. Users and items form a bipartite graph. GNNs propagate preferences through the graph to predict ratings. PinSage (Ying et al., 2018) deployed a GNN for Pinterest recommendations at scale.

Knowledge graph completion. Entities are nodes, relations are edges. GNNs predict missing links (e.g., given "X is a protein" and "X interacts with Y," predict Y's function).

Common Confusions

Watch Out

GNNs are not invariant to graph isomorphism by default

A GNN with mean aggregation cannot distinguish certain non-isomorphic graphs (limited by 1-WL). Only GNNs with injective aggregation (GIN) achieve maximal distinguishing power within the 1-WL limit. Claims that "GNNs respect graph structure" must be qualified by the expressivity bound.

Watch Out

Graph convolution is not the same as image convolution

In images, the grid structure is fixed and uniform (every pixel has the same number of neighbors in the same relative positions). In graphs, nodes have varying degrees and no canonical ordering of neighbors. Graph convolution must be permutation-invariant with respect to neighbor ordering, while image convolution exploits the fixed grid.

Watch Out

Over-smoothing is not the same as oversmoothing in statistics

Over-smoothing in GNNs refers to node representations converging to the same vector with increasing depth. This is a structural property of iterated message passing on graphs, not a bias-variance issue. It has no analogue in standard feedforward networks, where depth generally increases representational power.

Canonical Examples

Example

GCN on a citation network

In the Cora citation network (2,708 papers, 5,429 citation edges, 7 classes), a 2-layer GCN with 16 hidden units achieves approximately 81% node classification accuracy using only 140 labeled nodes (20 per class). The GCN propagates label information through the citation graph: if paper A cites papers B and C which are both about "neural networks," the GCN infers A is likely about neural networks too. A 3-layer GCN performs similarly; a 10-layer GCN drops to about 60% due to over-smoothing.

Exercises

ExerciseCore

Problem

Write the GCN update rule for a single node vv with neighbors N(v)={u1,u2,u3}\mathcal{N}(v) = \{u_1, u_2, u_3\} in a graph with self-loops, using symmetric normalization. If all four nodes (including vv) have degree 3 (after adding self-loops: degree 4), what is the coefficient on each neighbor's features?

ExerciseAdvanced

Problem

Prove that mean aggregation cannot distinguish the multisets {1,1,1,1}\{1, 1, 1, 1\} and {1,1}\{1, 1\} but sum aggregation can. Give an example of two non-isomorphic graphs where this distinction matters for node classification.

ExerciseResearch

Problem

Why does over-smoothing occur in GCNs but not in standard deep feedforward networks? Relate your answer to the spectral properties of the normalized adjacency matrix A^\hat{A}.

References

Canonical:

  • Gilmer, Schoenholz, Riley, Vinyals, Dahl, "Neural Message Passing for Quantum Chemistry" (ICML 2017) -- MPNN framework
  • Kipf, Welling, "Semi-Supervised Classification with Graph Convolutional Networks" (ICLR 2017) -- GCN
  • Hamilton, Ying, Leskovec, "Inductive Representation Learning on Large Graphs" (NeurIPS 2017) -- GraphSAGE
  • Velickovic et al., "Graph Attention Networks" (ICLR 2018) -- GAT
  • Xu, Hu, Leskovec, Jegelka, "How Powerful are Graph Neural Networks?" (ICLR 2019), Sections 3-5 -- GIN and 1-WL bound

Current:

  • Morris et al., "Weisfeiler and Leman Go Neural" (AAAI 2019) -- kk-GNNs
  • Oono, Suzuki, "Graph Neural Networks Exponentially Lose Expressive Power for Node Classification" (ICLR 2020) -- over-smoothing rate
  • Alon, Yahav, "On the Bottleneck of Graph Neural Networks and its Practical Implications" (ICLR 2021) -- over-squashing
  • Topping et al., "Understanding Over-Squashing and Bottlenecks on Graphs via Curvature" (ICLR 2022)
  • Bronstein, Bruna, Cohen, Velickovic, "Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges" (2021). The unifying framing for GNNs and other equivariant architectures. arXiv:2104.13478
  • Hamilton, Graph Representation Learning (2020), Chapters 5-7
  • Schlichtkrull et al., "Modeling Relational Data with Graph Convolutional Networks" (ESWC 2018). R-GCN. arXiv:1703.06103
  • Hu et al., "Heterogeneous Graph Transformer" (WWW 2020). HGT. arXiv:2003.01332
  • Rossi et al., "Temporal Graph Networks for Deep Learning on Dynamic Graphs" (ICML GRL+ 2020). TGN. arXiv:2006.10637
  • Ying et al., "Do Transformers Really Perform Bad for Graph Representation?" (NeurIPS 2021). Graphormer. arXiv:2106.05234
  • Rampasek, Galkin, Dwivedi, Luu, Wolf, Beaini, "Recipe for a General, Powerful, Scalable Graph Transformer" (NeurIPS 2022). GraphGPS. arXiv:2205.12454
  • Hu et al., "Open Graph Benchmark: Datasets for Machine Learning on Graphs" (NeurIPS 2020). OGB. arXiv:2005.00687
  • Dwivedi et al., "Long Range Graph Benchmark" (NeurIPS Datasets 2022). arXiv:2206.08164

Next Topics

GNNs connect to spectral graph theory, geometric deep learning, and the study of equivariant neural networks.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

2