ML Methods
Convolutional Neural Networks
How weight sharing and local connectivity exploit spatial structure: convolution as cross-correlation, translation equivariance, pooling for approximate invariance, and the conv-pool-fc architecture.
Prerequisites
Why This Matters
CNNs were the first deep learning architecture to achieve superhuman performance on a major benchmark (ImageNet, 2015). Their design hard-codes two assumptions: spatial structure matters, and the same pattern can appear anywhere in the image. Understanding why CNNs work — weight sharing, equivariance, local connectivity — gives you the template for designing architectures that exploit other symmetries.
Mental Model
A fully connected layer treats each input pixel as an independent feature with its own weight. A convolutional layer instead slides a small filter across the input, applying the same weights everywhere. This has two consequences: far fewer parameters (weight sharing), and the output shifts when the input shifts (translation equivariance). Pooling then coarsens the representation, providing approximate translation invariance.
The Convolution Operation
Discrete Convolution (Cross-Correlation)
In deep learning, "convolution" is technically cross-correlation. For a 2D input and kernel of size :
True convolution flips the kernel (), but since the kernel is learned, the optimizer simply absorbs the flip. The deep learning literature uses "convolution" to mean cross-correlation.
Modern CNNs use small kernels (, ) where direct summation is faster than FFT-based convolution. The FFT advantage only matters for kernel sizes comparable to the input, which is rare in practice.
Feature Map
Applying a kernel to input produces a feature map (or activation map). A convolutional layer has kernels, each of size , producing feature maps. Each kernel detects a different pattern (edge, texture, shape) across all spatial locations.
Weight Sharing and Parameter Counting
Weight Sharing
In a convolutional layer, the same kernel is applied at every spatial position. This is weight sharing: all positions share the same parameters.
For a layer with input channels, output channels, and kernel size , the parameter count is:
where the accounts for the bias per output channel. Compare this to a fully connected layer mapping the same spatial dimensions, which would have parameters. orders of magnitude more.
Translation Equivariance
Translation Equivariance of Convolution
Statement
Let denote translation by : . Then convolution commutes with translation:
That is, shifting the input and then convolving gives the same result as convolving and then shifting the output.
Intuition
If a cat detector fires at position , moving the cat to moves the detection to . The network does not need to relearn the cat detector for every position. one set of weights works everywhere.
Proof Sketch
. The sum is over the kernel indices, and the shift passes through linearly.
Why It Matters
Equivariance is the key inductive bias of CNNs. It means the network automatically generalizes across spatial positions without needing to see examples at every location. This is why CNNs need far less data than fully connected networks for image tasks.
Failure Mode
Equivariance is exact for continuous convolution but only approximate for discrete convolution with striding or padding. Stride-2 convolutions are equivariant only to even shifts. Boundary effects from padding also break exact equivariance at the edges.
Pooling
Pooling Layers
Pooling reduces spatial dimensions by summarizing local regions:
- Max pooling: . takes the maximum activation in each region
- Average pooling: . takes the mean
Pooling provides approximate translation invariance: small shifts in the input do not change the pooled output (as long as the maximum stays within the same pooling window). Note: pooling gives invariance, while convolution gives equivariance. these are different properties.
Receptive Field
Receptive Field
The receptive field of a neuron is the region of the input that can influence its value. In a CNN with layers of kernel size and stride 1:
Stacking small kernels (e.g., two layers have receptive field 5) is preferred over one large kernel () because it uses fewer parameters and adds more nonlinearity.
The Conv-Pool-FC Architecture
The classical CNN architecture follows a pattern:
- Convolutional blocks: alternating conv layers (with ReLU) and pooling layers. Spatial dimensions decrease while channel depth increases.
- Flatten: reshape the final feature maps into a 1D vector.
- Fully connected layers: one or more dense layers for classification.
Architecture lineage
- LeNet-5 (LeCun 1998): two conv + two pool + two FC. Trained on MNIST.
- AlexNet (Krizhevsky, Sutskever, Hinton 2012): 8 layers, ReLU, dropout, GPU training. Won ImageNet 2012 by a 10-point margin over the best hand-crafted feature pipeline.
- VGG (Simonyan-Zisserman 2014): exclusively kernels, 16-19 layers deep, uniform design.
- GoogLeNet / Inception (Szegedy 2014): parallel kernel sizes inside each "inception" block, convolutions for channel reduction.
- ResNet (He et al. 2015): residual connections enable 50-152 layer training; won ImageNet 2015 with top-5 error.
- DenseNet (Huang 2016): every layer receives feature maps from all preceding layers in its block.
- MobileNet / EfficientNet (Howard 2017, Tan-Le 2019): depthwise separable convolutions and compound scaling for mobile / compute budgets.
- ConvNeXt (Liu 2022): modernized CNN matching Swin transformer accuracy with pure convolution, using LayerNorm, GELU, inverted bottlenecks.
Convolution variants
- convolution: pointwise channel mixing; zero spatial extent but nonzero representational cost. Used for channel reduction (Inception) and as a cheap projection in bottleneck blocks.
- Dilated (atrous) convolution: kernel with gaps between taps, applied to with dilation rate . Receptive field grows without extra parameters. Used in semantic segmentation (DeepLab) and WaveNet.
- Depthwise separable convolution: factorize a conv into a depthwise (per-channel ) step followed by a pointwise step. Parameter count drops from to . Core to MobileNet and Xception.
- Transposed ("deconvolution"): upsamples spatial dimensions by spreading each input pixel over a kernel-sized output patch. Used in autoencoders and segmentation decoders.
Training ingredients
- Batch normalization (Ioffe-Szegedy 2015): normalizes per-channel activations across the batch dimension, reducing internal covariate shift and enabling larger learning rates.
- Residual connections (He 2015): identity shortcuts let gradients flow through deep networks without vanishing; they also smooth the loss landscape (Li et al. 2018).
Why CNNs Work for Images
CNNs succeed on images because of two structural assumptions about natural images that are approximately true:
- Locality: nearby pixels are more related than distant ones. Small kernels exploit this by looking at local neighborhoods.
- Translation symmetry: the same pattern (edge, texture, object part) can appear anywhere. Weight sharing exploits this.
These are inductive biases. assumptions baked into the architecture. When the assumptions match the data (images, audio spectrograms), CNNs dominate. When they do not (tabular data, graphs), CNNs offer no advantage. For graph-structured data, graph neural networks generalize the convolution idea to irregular topologies.
Connection to Group Equivariant Convolutions
Standard CNNs are equivariant to translations but not to rotations or reflections. Group equivariant CNNs generalize by replacing the translation group with a larger symmetry group :
This gives equivariance to the full group (e.g., rotations, reflections) by construction. Standard CNNs are the special case where is the translation group .
Canonical Examples
Parameter count: conv vs. fully connected
Consider a input (RGB image). A conv layer with 16 filters of size has parameters. A fully connected layer mapping the same input to 16 outputs would have parameters, 40 times more, and without translation equivariance.
Common Confusions
Convolution in deep learning is actually cross-correlation
Mathematical convolution flips the kernel before sliding; deep learning convolution does not. Since the kernel is learned, this distinction is irrelevant in practice. But it matters if you compare CNN operations to signal processing formulas.
Equivariance is not invariance
Convolution is translation equivariant: the output shifts with the input. Pooling provides approximate translation invariance: the output stays the same under small shifts. A classifier needs invariance (the label should not change); the internal representations should be equivariant (preserving spatial information until the final layers).
Summary
- CNN "convolution" is technically cross-correlation. kernel is not flipped
- Weight sharing means the same kernel is applied at every position, drastically reducing parameters
- Convolution is translation equivariant:
- Pooling provides approximate translation invariance
- Receptive field grows with depth: stacking small kernels is more efficient than large kernels
- CNNs encode the inductive bias that images have local structure and translation symmetry
Exercises
Problem
A convolutional layer has 64 input channels, 128 output channels, and kernels. How many parameters does it have (including biases)?
Problem
Prove that a stride-2 convolution is equivariant to translations by even numbers but not by odd numbers. What does this imply about the information lost by striding?
Related Comparisons
References
Canonical:
- LeCun, Bottou, Bengio, Haffner, "Gradient-Based Learning Applied to Document Recognition" (1998), Sections II-IV -- LeNet-5
- Krizhevsky, Sutskever, Hinton, "ImageNet Classification with Deep Convolutional Neural Networks" (NeurIPS 2012) -- AlexNet
- Simonyan, Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition" (ICLR 2015) -- VGG
- He, Zhang, Ren, Sun, "Deep Residual Learning for Image Recognition" (CVPR 2016) -- ResNet
- Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 9 -- CNNs
Current:
- Cohen, Welling, "Group Equivariant Convolutional Networks" (ICML 2016)
- Ioffe, Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (ICML 2015)
- Howard et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" (2017), Section 3 -- depthwise separable convolutions
- Liu et al., "A ConvNet for the 2020s" (CVPR 2022) -- ConvNeXt
Next Topics
The natural next steps from CNNs:
- Recurrent neural networks: handling sequential instead of spatial data
- Transformers: attention-based architectures that replace convolution
- Group equivariant networks: generalizing the equivariance principle
Last reviewed: April 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Vectors, Matrices, and Linear Mapslayer 0A · tier 1
- Feedforward Networks and Backpropagationlayer 2 · tier 1
- Skip Connections and ResNetslayer 2 · tier 1
- Fast Fourier Transformlayer 1 · tier 2
- Signals and Systems for MLlayer 1 · tier 2
Derived topics
11- Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMlayer 4 · tier 1
- AlexNet and Deep Learning Historylayer 3 · tier 2
- Graph Neural Networkslayer 3 · tier 2
- Object Detection and Segmentationlayer 3 · tier 2
- Recurrent Neural Networkslayer 3 · tier 2
+6 more on the derived-topics page.
Graph-backed continuations