Graph Neural Networks

Sneiderman, Robby

ML Methods

Graph Neural Networks

Message passing on graphs: GCN, GAT, GraphSAGE, the WL isomorphism test as an expressivity ceiling, over-smoothing in deep GNNs, and applications to molecules, social networks, and knowledge graphs.

AdvancedTier 2CurrentSupporting~55 min

Prerequisites

Convolutional Neural Networks Eigenvalues and Eigenvectors Clustering for Gene Expression Pagerank Algorithm

Quiz (4)Prereq Map

Why This Matters

Many important data structures are graphs: molecules (atoms as nodes, bonds as edges), social networks, citation networks, knowledge graphs, protein interaction networks. Standard neural networks operate on fixed-size vectors or grids. Graph Neural Networks (GNNs) operate directly on graph-structured data, respecting the topology.

The central idea is message passing: each node updates its representation by aggregating information from its neighbors. This is analogous to how CNNs aggregate information from spatial neighborhoods, but generalized to irregular graph structures. Gilmer et al. (2017) unified GCN, MPNN, interaction networks, and neural fingerprints under the single message passing neural network (MPNN) framework below.

Formal Setup

Let $G = (V, E)$ be a graph with node set $V$ ( $|V| = n$ ) and edge set $E$ . Each node $v$ has a feature vector $h_v^{(0)} \in \mathbb{R}^d$ . Let $\mathcal{N}(v)$ denote the neighbors of $v$ .

Definition

Message Passing Neural Network $M P N N$

A message passing layer updates node representations as:

$h_v^{(\ell+1)} = \text{UPDATE}^{(\ell)}\left(h_v^{(\ell)}, \text{AGGREGATE}^{(\ell)}\left(\{h_u^{(\ell)} : u \in \mathcal{N}(v)\}\right)\right)$

The AGGREGATE function collects neighbor information (must be permutation-invariant: the output does not depend on the ordering of neighbors). The UPDATE function combines the node's current representation with the aggregated message. After $L$ layers, each node's representation captures information from its $L$ -hop neighborhood.

Definition

Adjacency and Degree Matrices

The adjacency matrix $A \in \{0,1\}^{n \times n}$ has $A_{ij} = 1$ if and only if $(i,j) \in E$ . The degree matrix $D$ is diagonal with $D_{ii} = \sum_j A_{ij}$ . The normalized adjacency is $\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ where $\tilde{A} = A + I$ (self-loops added) and $\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}$ .

Graph Convolutional Network (GCN)

Proposition

GCN as First-Order Spectral Approximation

Statement

The GCN layer (Kipf and Welling, 2017) computes:

$H^{(\ell+1)} = \sigma\left(\hat{A} H^{(\ell)} W^{(\ell)}\right)$

where $H^{(\ell)} \in \mathbb{R}^{n \times d_\ell}$ is the matrix of node features at layer $\ell$ , $W^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell+1}}$ is a learnable weight matrix, and $\sigma$ is a nonlinearity (typically ReLU).

This is derived from spectral graph convolution $g_\theta \star x = U g_\theta(\Lambda) U^\top x$ (where $L = U\Lambda U^\top$ is the eigendecomposition of the graph Laplacian) by approximating $g_\theta(\Lambda)$ with a first-order Chebyshev polynomial and simplifying.

Intuition

The multiplication by $\hat{A}$ averages each node's features with its neighbors' features (with symmetric normalization to prevent scale issues). The weight matrix $W$ then transforms this averaged representation. One GCN layer is equivalent to: (1) smooth features over the graph, (2) apply a linear transform, (3) apply nonlinearity. Stacking $L$ layers propagates information $L$ hops.

Proof Sketch

Start with spectral convolution: $g_\theta \star x = U g_\theta(\Lambda) U^\top x$ . Approximate $g_\theta(\lambda) \approx \theta_0 + \theta_1 \lambda$ (first-order Chebyshev with the normalized Laplacian $\hat{L} = I - \hat{A}$ ). This gives $g_\theta \star x \approx \theta_0 x + \theta_1 \hat{A} x$ . Set $\theta_0 = \theta_1 = \theta$ (parameter sharing) to get $g_\theta \star x = \theta(I + \hat{A})x$ . The renormalization trick replaces $I + A$ with $\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ to stabilize eigenvalues.

Why It Matters

GCN provides a simple, efficient message passing operation grounded in spectral graph theory. The spectral derivation clarifies what GCN does: it is a low-pass filter on the graph (smoothing node features along edges). This connection to spectral theory also explains GCN's limitations, particularly over-smoothing.

Failure Mode

GCN treats all neighbors equally (after degree normalization). It cannot learn to weight some neighbors more than others. For heterogeneous graphs where some edges are more informative, this uniform aggregation is suboptimal. GAT addresses this by learning attention weights per edge.

report a correction →

Graph Attention Network (GAT)

GAT (Velickovic et al., 2018) replaces the fixed normalization in GCN with learned attention weights:

$\alpha_{ij} = \frac{\exp(\text{LeakyReLU}(a^\top [Wh_i \| Wh_j]))}{\sum_{k \in \mathcal{N}(i)} \exp(\text{LeakyReLU}(a^\top [Wh_i \| Wh_k]))}$

$h_i^{(\ell+1)} = \sigma\left(\sum_{j \in \mathcal{N}(i)} \alpha_{ij} W h_j^{(\ell)}\right)$

where $a \in \mathbb{R}^{2d'}$ is a learned attention vector and $\|$ denotes concatenation. GAT typically uses multi-head attention: $K$ independent attention heads whose outputs are concatenated.

The advantage: GAT can learn which neighbors are more relevant for each node. The cost: attention computation adds $O(|E| \cdot d)$ overhead per layer, and the attention mechanism introduces more parameters.

GraphSAGE

GraphSAGE (Hamilton et al., 2017) uses sampling and aggregation for scalability:

For each node, sample a fixed-size subset of neighbors (e.g., 25 from 1-hop, 10 from 2-hop)
Aggregate sampled neighbors using mean, LSTM, or max-pool
Concatenate the node's own features with the aggregated neighbor features
Apply a linear transform and nonlinearity

The sampling makes GraphSAGE scalable to large graphs (millions of nodes) because the computation per node is bounded regardless of the node's degree. GCN and GAT aggregate over all neighbors, which is expensive for high-degree nodes.

WL Isomorphism Test and GNN Expressivity

Theorem

GNN Expressivity Upper Bound via WL Test

Statement

The Weisfeiler-Leman (WL) graph isomorphism test iteratively refines node labels: at each step, a node's label is updated based on its current label and the multiset of its neighbors' labels. Two graphs are "WL-equivalent" if the WL test cannot distinguish them.

Upper bound: No message passing GNN can distinguish two graphs that the 1-WL test cannot distinguish.

Achievability: A GNN with injective AGGREGATE and UPDATE functions (specifically, the Graph Isomorphism Network (GIN)) is as powerful as the 1-WL test.

Intuition

Each layer of a message passing GNN refines node representations using neighbor information, exactly mirroring one iteration of the WL test. If the WL test produces identical label sequences for two nodes (or two graphs), no amount of message passing can tell them apart. The GNN's aggregation function determines whether it achieves this upper bound: sum aggregation (GIN) is maximally expressive; mean and max aggregation lose information about neighbor multisets.

Proof Sketch

The proof by Xu et al. (2019) proceeds in two steps. (1) Show that if a GNN maps two different multisets of neighbor features to the same output, it is strictly less powerful than 1-WL. Sum aggregation preserves the multiset (by the injectivity of hashing multisets to sums over an injective function). Mean and max lose information (e.g., mean cannot distinguish $\{1, 1, 1\}$ from $\{1\}$ ). (2) Show that with an injective update function (modeled as an MLP by the universal approximation theorem), the GNN matches 1-WL step for step.

Why It Matters

This theorem sets a hard ceiling on what standard message passing GNNs can compute. The cycle-counting consequence is sharper than it sounds: Chen et al. (2020, Can Graph Neural Networks Count Substructures?) prove that 1-WL MPNNs cannot count induced cycles of any length $\geq 3$ as graph-level features in the worst case. They cannot reliably count triangles, 4-cycles, or any longer cycle on adversarially chosen graphs, and they cannot detect arbitrary subgraph patterns of size $\geq 4$ . This motivates higher-order GNN architectures that go beyond 1-WL ( $k$ -WL GNNs, subgraph GNNs, GNNs with positional encodings).

Failure Mode

The 1-WL test cannot distinguish certain non-isomorphic regular graphs (e.g., some pairs of 3-regular graphs on 8 nodes). Standard GNNs inherit this limitation. For tasks that require distinguishing such graphs (e.g., certain molecular properties), higher-order methods or augmentations (random features, positional encodings) are needed.

report a correction →

Over-Smoothing

As GNN depth increases, node representations converge to the same vector. After $L$ message passing layers, each node's representation depends on its $L$ -hop neighborhood. For a connected graph with diameter $d$ , once $L \geq d$ , every node sees every other node. The repeated averaging (in GCN) drives all representations toward the leading Perron eigenvector of the normalized adjacency.

Formally, for symmetric-normalized GCN without nonlinearity: $H^{(L)} = \hat{A}^L H^{(0)} W^{(0)} \cdots W^{(L-1)}$ . For a connected graph, $\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}$ has leading eigenvalue 1 with Perron eigenvector $v_1 = \tilde{D}^{1/2} \mathbf{1} / \lVert \tilde{D}^{1/2} \mathbf{1} \rVert$ . Subleading eigenvalues satisfy $|\lambda_i| < 1$ , so $\hat{A}^L \to v_1 v_1^\top$ at the rate $|\lambda_2|^L$ (Oono-Suzuki 2020). All node features collapse onto the subspace spanned by $v_1$ , which depends only on the degree sequence, not on the input features.

(The random-walk normalization $P = D^{-1} A$ is a different operator whose leading left eigenvector is the stationary distribution $\pi$ . The symmetric normalization used in GCN is related to $P$ by a similarity transform, but its Perron eigenvector is not $\pi$ itself.)

Practical consequence: GNNs with more than 4-8 layers often perform worse than shallower ones. This limits the effective receptive field and distinguishes GNNs from CNNs, where additional depth consistently helps up to much larger counts.

Mitigations:

Residual / initial-residual connections (GCNII, Chen et al. 2020)
Jumping Knowledge (Xu et al. 2018): aggregate representations from all layers via max, concat, or LSTM pooling
PairNorm (Zhao-Akoglu 2020): per-layer centering + pairwise distance normalization
DropEdge (Rong et al. 2020): randomly mask edges at each layer to slow mixing
GraphNorm (Cai et al. 2021): batch-level normalization adapted to graph batches
Graph transformers (Ying et al. 2021, Kim et al. 2022): replace local message passing with global attention plus structural encodings

(Hierarchical pooling methods such as DiffPool (Ying et al. 2018) are often cited alongside these, but they address graph-level classification rather than over-smoothing of per-node representations.)

Over-Squashing

A complementary pathology, identified by Alon and Yahav (2021): in long-range tasks, information from distant nodes must pass through bottlenecks in the graph, and the message space at each node has fixed capacity. Exponentially many paths fold into a $d$ -dimensional message, destroying signal before it reaches the target node.

Topping et al. (2022) formalize this via graph curvature: negatively-curved edges (bottlenecks) cause exponential signal attenuation, and graph rewiring (Ricci flow) can provably reduce over-squashing. This motivates graph transformers and other architectures that let distant nodes communicate directly without traversing bottlenecks.

Beyond 1-WL

The $k$ -WL hierarchy refines the 1-WL test by tracking labels of $k$ -tuples instead of single nodes (Morris et al. 2019 introduced $k$ -GNNs to match this power). Each increment strictly increases distinguishing power:

1-WL ≡ standard MPNN / GIN (Xu et al. 2019): cannot count any cycle of length $\geq 3$ , cannot distinguish 3-regular graphs on 8 nodes
2-WL = 1-WL (Cai-Fürer-Immerman 1992): no extra power over 1-WL despite operating on pairs
3-WL ≡ 2-FWL (folklore): strictly more powerful than 1-WL; can count cycles up to length 7 (Arvind et al. 2020) but still misses some structural properties
$k$ -WL for $k \geq 3$ : strictly increasing power, but $O(n^k)$ memory and time

Practical alternatives that exceed 1-WL without $k$ -tuples:

Subgraph-aware GNNs (Zhao et al. 2022, Bevilacqua et al. 2022): mark nodes by subgraph identity
Random features (Abboud et al. 2021): adding i.i.d. random features to nodes breaks symmetries
Positional encodings (Dwivedi-Bresson 2021): Laplacian eigenvectors, random-walk landing probabilities, distance encodings

Heterogeneous and Temporal Graphs

Real-world graphs frequently have multiple node types, multiple edge types, or evolve over time. Plain MPNNs ignore this structure.

Heterogeneous graphs (typed nodes/edges):

R-GCN (Schlichtkrull et al. 2018) introduces per-relation weight matrices: $h_v^{(\ell+1)} = \sigma(\sum_{r} \sum_{u \in \mathcal{N}_r(v)} \frac{1}{c_{v,r}} W_r^{(\ell)} h_u^{(\ell)})$ . Used for knowledge graphs (entity classification, link prediction).
HAN (Wang et al. 2019) defines metapaths over node types and applies attention at the metapath level.
HGT (Hu et al. 2020): per-edge-type attention with relation-aware key/value projections; the typical baseline for ogbn-mag.

Temporal graphs (edges with timestamps):

TGN (Rossi et al. 2020): memory modules per node updated by message passing on the streaming edge sequence.
TGAT (Xu et al. 2020): time-aware self-attention with functional encodings of time deltas.
CAW (Wang et al. 2021): causal anonymous walks for inductive temporal link prediction.

The choice between static-with-features (encoding time as a node feature) and a true temporal architecture matters for tasks where the order of edge arrivals carries signal — fraud detection, recommendation, social-event forecasting.

Graph Transformers

Local message passing has structural limits (1-WL ceiling, over-smoothing, over-squashing). Graph transformers replace or augment local aggregation with global self-attention while injecting structural information through positional/structural encodings.

Graphormer (Ying et al. 2021): transformer with shortest-path distance as an attention bias and centrality encodings as node features. Won the OGB-LSC PCQM4M benchmark.
GraphGPS (Rampasek et al. 2022): a recipe combining MPNN layers, transformer layers, and positional encodings (Laplacian eigenvectors, random-walk landing). The current default for "general-purpose graph transformer."
Exphormer (Shirzad et al. 2023): sparse global attention via expander graphs, scaling to graphs with $> 10^4$ nodes per pass.
Graphormer-style models are the current state of the art on PCQM4Mv2 and Long Range Graph Benchmark (LRGB).

Benchmarks

OGB (Hu et al. 2020) is the de facto graph-learning benchmark suite: ogbn (node prediction), ogbg (graph prediction), ogbl (link prediction), and OGB-LSC for million-to-billion-scale tasks. Use OGB rather than legacy datasets (Cora, Citeseer) when reporting numbers, since the small datasets are saturated and have well-known split issues.
Long Range Graph Benchmark (Dwivedi et al. 2022) targets tasks where the answer requires combining information from $> 5$ hops; it is the standard setting for measuring whether a graph transformer or rewiring scheme actually helps with long-range dependencies.

Applications

Molecular property prediction. Atoms are nodes, bonds are edges. GNNs predict molecular properties (toxicity, solubility, binding affinity) from the molecular graph. The Open Catalyst Project uses GNNs for catalyst discovery.

Social network analysis. Users are nodes, connections are edges. GNNs predict user behavior, detect communities, and identify influential nodes.

Recommendation systems. Users and items form a bipartite graph. GNNs propagate preferences through the graph to predict ratings. PinSage (Ying et al., 2018) deployed a GNN for Pinterest recommendations at scale.

Knowledge graph completion. Entities are nodes, relations are edges. GNNs predict missing links (e.g., given "X is a protein" and "X interacts with Y," predict Y's function).

Common Confusions

Watch Out

GNNs are not invariant to graph isomorphism by default

A GNN with mean aggregation cannot distinguish certain non-isomorphic graphs (limited by 1-WL). Only GNNs with injective aggregation (GIN) achieve maximal distinguishing power within the 1-WL limit. Claims that "GNNs respect graph structure" must be qualified by the expressivity bound.

Watch Out

Graph convolution is not the same as image convolution

In images, the grid structure is fixed and uniform (every pixel has the same number of neighbors in the same relative positions). In graphs, nodes have varying degrees and no canonical ordering of neighbors. Graph convolution must be permutation-invariant with respect to neighbor ordering, while image convolution exploits the fixed grid.

Watch Out

Over-smoothing is not the same as oversmoothing in statistics

Over-smoothing in GNNs refers to node representations converging to the same vector with increasing depth. This is a structural property of iterated message passing on graphs, not a bias-variance issue. It has no analogue in standard feedforward networks, where depth generally increases representational power.

Canonical Examples

Example

GCN on a citation network

In the Cora citation network (2,708 papers, 5,429 citation edges, 7 classes), a 2-layer GCN with 16 hidden units achieves approximately 81% node classification accuracy using only 140 labeled nodes (20 per class). The GCN propagates label information through the citation graph: if paper A cites papers B and C which are both about "neural networks," the GCN infers A is likely about neural networks too. A 3-layer GCN performs similarly; a 10-layer GCN drops to about 60% due to over-smoothing.

Exercises

ExerciseCore

Problem

Write the GCN update rule for a single node $v$ with neighbors $\mathcal{N}(v) = \{u_1, u_2, u_3\}$ in a graph with self-loops, using symmetric normalization. If all four nodes (including $v$ ) have degree 3 (after adding self-loops: degree 4), what is the coefficient on each neighbor's features?

ExerciseAdvanced

Problem

Prove that mean aggregation cannot distinguish the multisets $\{1, 1, 1, 1\}$ and $\{1, 1\}$ but sum aggregation can. Give an example of two non-isomorphic graphs where this distinction matters for node classification.

ExerciseResearch

Problem

Why does over-smoothing occur in GCNs but not in standard deep feedforward networks? Relate your answer to the spectral properties of the normalized adjacency matrix $\hat{A}$ .

References

Canonical:

Gilmer, Schoenholz, Riley, Vinyals, Dahl, "Neural Message Passing for Quantum Chemistry" (ICML 2017) -- MPNN framework
Kipf, Welling, "Semi-Supervised Classification with Graph Convolutional Networks" (ICLR 2017) -- GCN
Hamilton, Ying, Leskovec, "Inductive Representation Learning on Large Graphs" (NeurIPS 2017) -- GraphSAGE
Velickovic et al., "Graph Attention Networks" (ICLR 2018) -- GAT
Xu, Hu, Leskovec, Jegelka, "How Powerful are Graph Neural Networks?" (ICLR 2019), Sections 3-5 -- GIN and 1-WL bound

Current:

Morris et al., "Weisfeiler and Leman Go Neural" (AAAI 2019) -- $k$ -GNNs
Oono, Suzuki, "Graph Neural Networks Exponentially Lose Expressive Power for Node Classification" (ICLR 2020) -- over-smoothing rate
Alon, Yahav, "On the Bottleneck of Graph Neural Networks and its Practical Implications" (ICLR 2021) -- over-squashing
Topping et al., "Understanding Over-Squashing and Bottlenecks on Graphs via Curvature" (ICLR 2022)
Bronstein, Bruna, Cohen, Velickovic, "Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges" (2021). The unifying framing for GNNs and other equivariant architectures. arXiv:2104.13478
Hamilton, Graph Representation Learning (2020), Chapters 5-7
Schlichtkrull et al., "Modeling Relational Data with Graph Convolutional Networks" (ESWC 2018). R-GCN. arXiv:1703.06103
Hu et al., "Heterogeneous Graph Transformer" (WWW 2020). HGT. arXiv:2003.01332
Rossi et al., "Temporal Graph Networks for Deep Learning on Dynamic Graphs" (ICML GRL+ 2020). TGN. arXiv:2006.10637
Ying et al., "Do Transformers Really Perform Bad for Graph Representation?" (NeurIPS 2021). Graphormer. arXiv:2106.05234
Rampasek, Galkin, Dwivedi, Luu, Wolf, Beaini, "Recipe for a General, Powerful, Scalable Graph Transformer" (NeurIPS 2022). GraphGPS. arXiv:2205.12454
Hu et al., "Open Graph Benchmark: Datasets for Machine Learning on Graphs" (NeurIPS 2020). OGB. arXiv:2005.00687
Dwivedi et al., "Long Range Graph Benchmark" (NeurIPS Datasets 2022). arXiv:2206.08164

Next Topics

GNNs connect to spectral graph theory, geometric deep learning, and the study of equivariant neural networks.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Eigenvalues and Eigenvectorslayer 0A · tier 1
PageRank Algorithmlayer 2 · tier 2
Convolutional Neural Networkslayer 3 · tier 2
Clustering for Gene Expressionlayer 4 · tier 3

Derived topics

2

Equivariant Deep Learninglayer 4 · tier 2
Graph Neural Networks for Moleculeslayer 4 · tier 3

Graph-backed continuations

Equivariant Deep Learning Graph Neural Networks for Molecules