Unlock: CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining

Radford et al. 2021 (CLIP) trained two encoders, one for images and one for text, with a symmetric InfoNCE objective on 400M web pairs. The result was a shared embedding space that powers zero-shot classification, retrieval, and serves as the visual backbone of every modern vision-language model. This page covers the contrastive objective as a mutual-information bound, the OpenCLIP scaling laws (Cherti et al. 2023), the SigLIP pairwise-sigmoid alternative (Zhai et al. 2023), the modality gap (Liang et al. 2022), and the practical pipeline from training corpus to LLaVA-style VLM backbone.

178 Prerequisites0 Mastered0 Working146 Gaps

Prerequisite mastery18%

Recommended probe

McDiarmid's Inequality is your weakest prerequisite with available questions. You haven't been assessed on this topic yet.

CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image PretrainingTARGET