Unlock: CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining
Radford et al. 2021 (CLIP) trained two encoders, one for images and one for text, with a symmetric InfoNCE objective on 400M web pairs. The result was a shared embedding space that powers zero-shot classification, retrieval, and serves as the visual backbone of every modern vision-language model. This page covers the contrastive objective as a mutual-information bound, the OpenCLIP scaling laws (Cherti et al. 2023), the SigLIP pairwise-sigmoid alternative (Zhai et al. 2023), the modality gap (Liang et al. 2022), and the practical pipeline from training corpus to LLaVA-style VLM backbone.
McDiarmid's Inequality is your weakest prerequisite with available questions. You haven't been assessed on this topic yet.
Sign in to track your mastery and see personalized gap analysis.