Convolutional Neural Networks

Sneiderman, Robby

ML Methods

Convolutional Neural Networks

How weight sharing and local connectivity exploit spatial structure: convolution as cross-correlation, translation equivariance, pooling for approximate invariance, and the conv-pool-fc architecture.

AdvancedTier 2StableSupporting~55 min

Prerequisites

Feedforward Networks and Backpropagation Vectors Matrices and Linear Maps Fast Fourier Transform Signals and Systems for ML

Quiz (1)Prereq Map

Why This Matters

CNNs were the first deep learning architecture to achieve superhuman performance on a major benchmark (ImageNet, 2015). Their design hard-codes two assumptions: spatial structure matters, and the same pattern can appear anywhere in the image. Understanding why CNNs work — weight sharing, equivariance, local connectivity — gives you the template for designing architectures that exploit other symmetries.

Mental Model

A fully connected layer treats each input pixel as an independent feature with its own weight. A convolutional layer instead slides a small filter across the input, applying the same weights everywhere. This has two consequences: far fewer parameters (weight sharing), and the output shifts when the input shifts (translation equivariance). Pooling then coarsens the representation, providing approximate translation invariance.

The Convolution Operation

Definition

Discrete Convolution (Cross-Correlation) $(f * k)$

In deep learning, "convolution" is technically cross-correlation. For a 2D input $f$ and kernel $k$ of size $m \times m$ :

$(f \star k)[i, j] = \sum_{a=0}^{m-1} \sum_{b=0}^{m-1} f[i+a, j+b] \cdot k[a, b]$

True convolution flips the kernel ( $k[m-1-a, m-1-b]$ ), but since the kernel is learned, the optimizer simply absorbs the flip. The deep learning literature uses "convolution" to mean cross-correlation.

Modern CNNs use small kernels ( $3 \times 3$ , $5 \times 5$ ) where direct summation is faster than FFT-based convolution. The $O(H W \log(H W))$ FFT advantage only matters for kernel sizes comparable to the input, which is rare in practice.

Definition

Feature Map

Applying a kernel $k$ to input $f$ produces a feature map (or activation map). A convolutional layer has $C_{\text{out}}$ kernels, each of size $C_{\text{in}} \times m \times m$ , producing $C_{\text{out}}$ feature maps. Each kernel detects a different pattern (edge, texture, shape) across all spatial locations.

Weight Sharing and Parameter Counting

Definition

Weight Sharing

In a convolutional layer, the same kernel is applied at every spatial position. This is weight sharing: all positions share the same parameters.

For a layer with $C_{\text{in}}$ input channels, $C_{\text{out}}$ output channels, and kernel size $m \times m$ , the parameter count is:

$C_{\text{out}} \times (C_{\text{in}} \times m \times m + 1)$

where the $+1$ accounts for the bias per output channel. Compare this to a fully connected layer mapping the same spatial dimensions, which would have $C_{\text{out}} \times H \times W \times C_{\text{in}} \times H \times W$ parameters. orders of magnitude more.

Translation Equivariance

Proposition

Translation Equivariance of Convolution

Statement

Let $T_\tau$ denote translation by $\tau$ : $(T_\tau f)[i,j] = f[i - \tau_1, j - \tau_2]$ . Then convolution commutes with translation:

$T_\tau(f \star k) = (T_\tau f) \star k$

That is, shifting the input and then convolving gives the same result as convolving and then shifting the output.

Intuition

If a cat detector fires at position $(i, j)$ , moving the cat to $(i+5, j+3)$ moves the detection to $(i+5, j+3)$ . The network does not need to relearn the cat detector for every position. one set of weights works everywhere.

Proof Sketch

$(T_\tau f \star k)[i,j] = \sum_{a,b} f[i+a-\tau_1, j+b-\tau_2] \cdot k[a,b] = (f \star k)[i - \tau_1, j - \tau_2] = T_\tau(f \star k)[i,j]$ . The sum is over the kernel indices, and the shift passes through linearly.

Why It Matters

Equivariance is the key inductive bias of CNNs. It means the network automatically generalizes across spatial positions without needing to see examples at every location. This is why CNNs need far less data than fully connected networks for image tasks.

Failure Mode

Equivariance is exact for continuous convolution but only approximate for discrete convolution with striding or padding. Stride-2 convolutions are equivariant only to even shifts. Boundary effects from padding also break exact equivariance at the edges.

report a correction →

Pooling

Definition

Pooling Layers

Pooling reduces spatial dimensions by summarizing local regions:

Max pooling: $\text{pool}(R) = \max_{(i,j) \in R} f[i,j]$ . takes the maximum activation in each region
Average pooling: $\text{pool}(R) = \frac{1}{|R|}\sum_{(i,j) \in R} f[i,j]$ . takes the mean

Pooling provides approximate translation invariance: small shifts in the input do not change the pooled output (as long as the maximum stays within the same pooling window). Note: pooling gives invariance, while convolution gives equivariance. these are different properties.

Receptive Field

Definition

Receptive Field

The receptive field of a neuron is the region of the input that can influence its value. In a CNN with $L$ layers of kernel size $m$ and stride 1:

$\text{receptive field size} = L(m - 1) + 1$

Stacking small kernels (e.g., two $3 \times 3$ layers have receptive field 5) is preferred over one large kernel ( $5 \times 5$ ) because it uses fewer parameters and adds more nonlinearity.

The Conv-Pool-FC Architecture

The classical CNN architecture follows a pattern:

Convolutional blocks: alternating conv layers (with ReLU) and pooling layers. Spatial dimensions decrease while channel depth increases.
Flatten: reshape the final feature maps into a 1D vector.
Fully connected layers: one or more dense layers for classification.

Architecture lineage

LeNet-5 (LeCun 1998): two conv + two pool + two FC. Trained on MNIST.
AlexNet (Krizhevsky, Sutskever, Hinton 2012): 8 layers, ReLU, dropout, GPU training. Won ImageNet 2012 by a 10-point margin over the best hand-crafted feature pipeline.
VGG (Simonyan-Zisserman 2014): exclusively $3 \times 3$ kernels, 16-19 layers deep, uniform design.
GoogLeNet / Inception (Szegedy 2014): parallel kernel sizes inside each "inception" block, $1 \times 1$ convolutions for channel reduction.
ResNet (He et al. 2015): residual connections $y = x + F(x)$ enable 50-152 layer training; won ImageNet 2015 with $3.57\%$ top-5 error.
DenseNet (Huang 2016): every layer receives feature maps from all preceding layers in its block.
MobileNet / EfficientNet (Howard 2017, Tan-Le 2019): depthwise separable convolutions and compound scaling for mobile / compute budgets.
ConvNeXt (Liu 2022): modernized CNN matching Swin transformer accuracy with pure convolution, using LayerNorm, GELU, inverted bottlenecks.

Convolution variants

$1 \times 1$ convolution: pointwise channel mixing; zero spatial extent but nonzero representational cost. Used for channel reduction (Inception) and as a cheap projection in bottleneck blocks.
Dilated (atrous) convolution: kernel with gaps between taps, $k[a, b]$ applied to $f[i + r a, j + r b]$ with dilation rate $r$ . Receptive field grows $O(r \cdot m)$ without extra parameters. Used in semantic segmentation (DeepLab) and WaveNet.
Depthwise separable convolution: factorize a $C_{\text{in}} \to C_{\text{out}}$ conv into a depthwise (per-channel $m \times m$ ) step followed by a pointwise $1 \times 1$ step. Parameter count drops from $C_{\text{out}} C_{\text{in}} m^2$ to $C_{\text{in}} m^2 + C_{\text{out}} C_{\text{in}}$ . Core to MobileNet and Xception.
Transposed ("deconvolution"): upsamples spatial dimensions by spreading each input pixel over a kernel-sized output patch. Used in autoencoders and segmentation decoders.

Training ingredients

Batch normalization (Ioffe-Szegedy 2015): normalizes per-channel activations across the batch dimension, reducing internal covariate shift and enabling larger learning rates.
Residual connections (He 2015): identity shortcuts let gradients flow through deep networks without vanishing; they also smooth the loss landscape (Li et al. 2018).

Why CNNs Work for Images

CNNs succeed on images because of two structural assumptions about natural images that are approximately true:

Locality: nearby pixels are more related than distant ones. Small kernels exploit this by looking at local neighborhoods.
Translation symmetry: the same pattern (edge, texture, object part) can appear anywhere. Weight sharing exploits this.

These are inductive biases. assumptions baked into the architecture. When the assumptions match the data (images, audio spectrograms), CNNs dominate. When they do not (tabular data, graphs), CNNs offer no advantage. For graph-structured data, graph neural networks generalize the convolution idea to irregular topologies.

Connection to Group Equivariant Convolutions

Standard CNNs are equivariant to translations but not to rotations or reflections. Group equivariant CNNs generalize by replacing the translation group with a larger symmetry group $G$ :

$(f \star_G k)(g) = \sum_{h \in G} f(h) \cdot k(g^{-1}h)$

This gives equivariance to the full group $G$ (e.g., rotations, reflections) by construction. Standard CNNs are the special case where $G$ is the translation group $\mathbb{Z}^2$ .

Canonical Examples

Example

Parameter count: conv vs. fully connected

Consider a $32 \times 32 \times 3$ input (RGB image). A conv layer with 16 filters of size $5 \times 5$ has $16 \times (3 \times 25 + 1) = 1{,}216$ parameters. A fully connected layer mapping the same input to 16 outputs would have $16 \times (32 \times 32 \times 3) + 16 = 49{,}168$ parameters, 40 times more, and without translation equivariance.

Common Confusions

Watch Out

Convolution in deep learning is actually cross-correlation

Mathematical convolution flips the kernel before sliding; deep learning convolution does not. Since the kernel is learned, this distinction is irrelevant in practice. But it matters if you compare CNN operations to signal processing formulas.

Watch Out

Equivariance is not invariance

Convolution is translation equivariant: the output shifts with the input. Pooling provides approximate translation invariance: the output stays the same under small shifts. A classifier needs invariance (the label should not change); the internal representations should be equivariant (preserving spatial information until the final layers).

Summary

CNN "convolution" is technically cross-correlation. kernel is not flipped
Weight sharing means the same kernel is applied at every position, drastically reducing parameters
Convolution is translation equivariant: $T_\tau(f \star k) = (T_\tau f) \star k$
Pooling provides approximate translation invariance
Receptive field grows with depth: stacking small kernels is more efficient than large kernels
CNNs encode the inductive bias that images have local structure and translation symmetry

Exercises

ExerciseCore

Problem

A convolutional layer has 64 input channels, 128 output channels, and $3 \times 3$ kernels. How many parameters does it have (including biases)?

ExerciseAdvanced

Problem

Prove that a stride-2 convolution is equivariant to translations by even numbers but not by odd numbers. What does this imply about the information lost by striding?

Related Comparisons

CNN vs. ViT vs. Swin Transformer

References

Canonical:

LeCun, Bottou, Bengio, Haffner, "Gradient-Based Learning Applied to Document Recognition" (1998), Sections II-IV -- LeNet-5
Krizhevsky, Sutskever, Hinton, "ImageNet Classification with Deep Convolutional Neural Networks" (NeurIPS 2012) -- AlexNet
Simonyan, Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition" (ICLR 2015) -- VGG
He, Zhang, Ren, Sun, "Deep Residual Learning for Image Recognition" (CVPR 2016) -- ResNet
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 9 -- CNNs

Current:

Cohen, Welling, "Group Equivariant Convolutional Networks" (ICML 2016)
Ioffe, Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (ICML 2015)
Howard et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" (2017), Section 3 -- depthwise separable convolutions
Liu et al., "A ConvNet for the 2020s" (CVPR 2022) -- ConvNeXt

Next Topics

The natural next steps from CNNs:

Recurrent neural networks: handling sequential instead of spatial data
Transformers: attention-based architectures that replace convolution
Group equivariant networks: generalizing the equivariance principle

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Vectors, Matrices, and Linear Mapslayer 0A · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Skip Connections and ResNetslayer 2 · tier 1
Fast Fourier Transformlayer 1 · tier 2
Signals and Systems for MLlayer 1 · tier 2

Derived topics

11

Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMlayer 4 · tier 1
AlexNet and Deep Learning Historylayer 3 · tier 2
Graph Neural Networkslayer 3 · tier 2
Object Detection and Segmentationlayer 3 · tier 2
Recurrent Neural Networkslayer 3 · tier 2

+6 more on the derived-topics page.

Graph-backed continuations

Recurrent Neural Networks Transformer Architecture AlexNet and Deep Learning History Anomaly Detection for Gravitational Waves CNNs for Medical Imaging CNNs for Signal Feature Extraction Equivariant Deep Learning Graph Neural Networks Object Detection and Segmentation Spiking Neural Networks Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM