Unlock: Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM

Dosovitskiy et al. 2021 (ViT) showed that a plain transformer applied to image patches matches CNN backbones once given enough pretraining data. The follow-on lineage (DeiT, Swin, MAE, DINOv2, SAM, register tokens, NaViT, ViT-22B, SigLIP-2) traded design knobs along three axes: how much locality bias to keep, what supervision to train against, and whether tokens are spatial-uniform or content-adaptive. This page covers the patch-embedding complexity, Swin's shifted-window argument, the MAE 75% masking ratio, DINO self-distillation with centering, and the 2024 register-token finding that closes the long-standing CLIP/ViT artifact-token gap.

172 Prerequisites0 Mastered0 Working144 Gaps

Prerequisite mastery16%

Recommended probe

Chernoff Bounds is your weakest prerequisite with available questions. You haven't been assessed on this topic yet.

Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMTARGET