Unlock: Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM
Dosovitskiy et al. 2021 (ViT) showed that a plain transformer applied to image patches matches CNN backbones once given enough pretraining data. The follow-on lineage (DeiT, Swin, MAE, DINOv2, SAM, register tokens, NaViT, ViT-22B, SigLIP-2) traded design knobs along three axes: how much locality bias to keep, what supervision to train against, and whether tokens are spatial-uniform or content-adaptive. This page covers the patch-embedding complexity, Swin's shifted-window argument, the MAE 75% masking ratio, DINO self-distillation with centering, and the 2024 register-token finding that closes the long-standing CLIP/ViT artifact-token gap.
Chernoff Bounds is your weakest prerequisite with available questions. You haven't been assessed on this topic yet.
Sign in to track your mastery and see personalized gap analysis.