Skip to main content
← Choose a different target

Unlock: Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM

Dosovitskiy et al. 2021 (ViT) showed that a plain transformer applied to image patches matches CNN backbones once given enough pretraining data. The follow-on lineage (DeiT, Swin, MAE, DINOv2, SAM, register tokens, NaViT, ViT-22B, SigLIP-2) traded design knobs along three axes: how much locality bias to keep, what supervision to train against, and whether tokens are spatial-uniform or content-adaptive. This page covers the patch-embedding complexity, Swin's shifted-window argument, the MAE 75% masking ratio, DINO self-distillation with centering, and the 2024 register-token finding that closes the long-standing CLIP/ViT artifact-token gap.

172 Prerequisites0 Mastered0 Working144 Gaps
Prerequisite mastery16%
Recommended probe

Chernoff Bounds is your weakest prerequisite with available questions. You haven't been assessed on this topic yet.

Chernoff BoundsFoundationsWEAKEST
Not assessed3 questions
Not assessed13 questions
Not assessed2 questions
Not assessed15 questions
Not assessed3 questions
Not assessed58 questions
Not assessed1 question
Not assessed1 question
Not assessed11 questions

Sign in to track your mastery and see personalized gap analysis.