Florence and Vision Foundation Models

Sneiderman, Robby

Beyond LLMS

Florence and Vision Foundation Models

Vision foundation models that unify classification, detection, segmentation, captioning, and VQA under a single pretrained backbone. Florence, Florence-2, and the path toward GPT-4V-style multimodal understanding.

AdvancedTier 2FrontierSupporting~50 min

Prerequisites

Vision Transformer Lineage Self Supervised Vision

Prereq Map

Why This Matters

For decades, computer vision was a collection of separate problems solved by separate models: one model for classification, another for detection, another for segmentation, another for captioning. Each required its own architecture, its own training data format, and its own evaluation protocol.

Vision foundation models collapse this into a single pretrained backbone that handles all vision tasks. Train once on massive data, adapt to any task with minimal fine-tuning or just a text prompt. This parallels how large language models unified NLP tasks, and it represents the same kind of consolidation happening in vision.

Mental Model

Think of a vision transformer pretrained on billions of image-text pairs as having learned a general-purpose "visual language." Given an image, it produces rich representations that encode objects, their locations, their relationships, and their semantic meaning. Different tasks are just different questions asked of these representations: "What is in this image?" (classification), "Where are the objects?" (detection), "Describe this image" (captioning).

Florence-2 takes this further by formulating every vision task as a sequence-to-sequence problem: image in, text out, where the text format encodes the task-specific answer.

Formal Setup and Notation

Definition

Vision Foundation Model

A vision foundation model is a pretrained model $f_\theta: \mathcal{I} \to \mathcal{R}$ that maps images to representations $\mathcal{R}$ useful across multiple downstream tasks. Formally, for a set of vision tasks $\{\tau_1, \ldots, \tau_K\}$ with task-specific heads $\{h_1, \ldots, h_K\}$ , the foundation model minimizes:

$\sum_{k=1}^{K} \lambda_k \mathcal{L}_k(h_k(f_\theta(I)), y^{(k)})$

where $\mathcal{L}_k$ is the loss for task $\tau_k$ and $\lambda_k$ is a task weight. The key property: $f_\theta$ is shared across all tasks and pretrained on data far larger than any single task dataset.

Definition

Sequence-to-Sequence Vision (Florence-2)

Florence-2 formulates all vision tasks as text generation. Given image $I$ and task prompt $p$ (e.g., "Detect all objects"), the model generates output text $s$ that encodes the answer:

$\hat{s} = \arg\max_{s} \prod_{t=1}^{T} p_\theta(s_t | s_{<t}, \text{Enc}(I), p)$

Task-specific output formats:

Classification: "cat" (a single label)
Detection: "cat [x1,y1,x2,y2]; dog [x1,y1,x2,y2]" (labels with coordinates)
Captioning: "A cat sitting on a red couch" (natural language)
Segmentation: polygon coordinate sequences encoding region boundaries

All tasks share the same encoder-decoder architecture with no task-specific heads.

Core Definitions

Florence (Microsoft, 2021) is a vision foundation model pretrained on 900M image-text pairs using a contrastive objective similar to CLIP. It uses a hierarchical ViT (CoSwin Transformer) as the image encoder. Florence demonstrated strong transfer to classification, retrieval, detection, and segmentation using the same pretrained backbone with task-specific adapters.

Florence-2 (2024) reformulates all vision tasks as sequence-to-sequence generation. The architecture is a vision encoder (DaViT) plus a text decoder (transformer). The training data consists of 5.4 billion annotations across 126 million images, generated by a combination of specialized models and human annotation. Florence-2 comes in two sizes: 0.23B and 0.77B parameters.

Multimodal foundation models (GPT-4V, Gemini Vision, Claude with vision capability) extend this concept by integrating the vision encoder into a large language model. The LLM serves as both the task decoder and the reasoning engine. These models can handle open-ended visual questions that Florence-2 cannot, but they are much larger and more expensive to run.

Main Theorems

Proposition

Unified Task Formulation via Sequence Generation

Statement

Let $\mathcal{T} = \{\tau_1, \ldots, \tau_K\}$ be a set of vision tasks. If each task $\tau_k$ has an output that can be serialized as a text sequence $s^{(k)} \in \mathcal{V}^*$ (using coordinate tokens for spatial outputs), then a single encoder-decoder model can be trained on all tasks jointly:

$\mathcal{L}(\theta) = -\sum_{k=1}^{K} \sum_{i \in D_k} \sum_{t=1}^{T_i} \log p_\theta(s_t^{(i,k)} | s_{<t}^{(i,k)}, \text{Enc}(I^{(i)}), p_k)$

where $D_k$ is the dataset for task $k$ and $p_k$ is the task-specific text prompt. The model shares all parameters across tasks. The prompt $p_k$ alone determines the output format.

Florence-2 empirically shows that this joint training improves performance on each individual task compared to training separate models on the same data, suggesting positive transfer across vision tasks.

Intuition

Detection requires understanding what objects look like and where they are. Captioning requires understanding what objects look like and how they relate. Segmentation requires precise spatial understanding. By training on all tasks simultaneously, the shared encoder learns representations that capture appearance, location, and relationships jointly. Each task provides a different supervisory signal that enriches the shared representation.

Proof Sketch

No formal proof of positive transfer. Xiao et al. (2024) report transfer gains for Florence-2 over single-task baselines on several vision benchmarks, though the magnitude of improvement varies substantially by task and evaluation protocol. The hypothesized mechanism is multi-task regularization: detection annotations provide localization signal that helps segmentation, and captioning annotations provide semantic signal that helps classification.

Why It Matters

This formulation eliminates task-specific architecture design. Adding a new vision task requires only defining its text output format and collecting training data. No new model heads, no architecture changes. This is the same simplification that sequence-to-sequence brought to NLP (translation, summarization, and QA all became text generation problems).

Failure Mode

Serializing spatial outputs as text token sequences introduces quantization error. Bounding box coordinates are discretized to a fixed vocabulary (e.g., 1000 location bins), limiting spatial precision. For tasks requiring sub-pixel accuracy (medical imaging, satellite imagery), this quantization can be a bottleneck. The approach also scales poorly with output complexity: generating polygon coordinates for dense segmentation produces very long output sequences.

report a correction →

Florence-2 Architecture Details

The Florence-2 encoder (DaViT) processes the input image at multiple scales, producing a feature pyramid. These features are flattened into a sequence of visual tokens. The decoder is a standard transformer that generates output tokens with cross-attention to the visual token sequence.

Coordinate representation: spatial coordinates are encoded as discrete tokens from a vocabulary of 1000 bins per axis. A bounding box $(x_1, y_1, x_2, y_2)$ is encoded as four tokens. This means the model's spatial resolution is limited to $1/1000$ of the image dimension.

Florence-2 is pretrained in two phases:

Image-level pretraining: contrastive learning on image-text pairs (similar to CLIP) plus image captioning
Region-level pretraining: detection, region captioning, and referring expression grounding on 5.4B region annotations

From Florence to GPT-4V

Vision foundation models exist on a spectrum of capability and cost:

Florence-2 (0.23B-0.77B params): Fast, cheap, strong on standard vision tasks. Cannot do open-ended reasoning.
LLaVA, InternVL (7B-34B params): Open-source vision-language models. Can reason about images and follow complex instructions.
GPT-4V, Gemini, Claude (parameter counts not publicly disclosed): Full multimodal reasoning. Can handle arbitrary visual questions, multi-step reasoning, and tool use. Expensive per image.

The trade-off is clear: larger models handle more open-ended tasks but cost orders of magnitude more per inference. For structured extraction (detect objects, read text, segment regions), Florence-2 is sufficient and runs on a single consumer GPU. For "what is the architectural style of this building and how does it relate to the surrounding neighborhood," you need a multimodal LLM.

Recent Foundation Models (2023-2024)

The vision foundation model landscape expanded rapidly after Florence-2. Key canonical entries:

SAM (Kirillov et al. 2023, arXiv:2304.02643). "Segment Anything" introduces a promptable segmentation foundation model trained on 1B masks across 11M images. Points, boxes, or text prompts condition a lightweight decoder on a shared ViT image encoder.
SAM 2 (Ravi et al. 2024, arXiv:2408.00714). Extends SAM to video with a memory attention module that propagates masks across frames. Unifies image and video segmentation under one model.
DINOv2 (Oquab et al. 2023, arXiv:2304.07193). Self-supervised visual features trained on a curated 142M image dataset. Produces dense features strong enough for downstream tasks without any labeled data or text supervision, in contrast to CLIP-style contrastive pretraining.
InternVL (Chen et al. 2023, arXiv:2312.14238). Scales the vision encoder to 6B parameters and aligns it with an LLM. Designed to close the gap between open vision-language models and proprietary systems.
Qwen-VL / Qwen2-VL (Bai et al. 2023, arXiv:2308.12966). Vision-language model with dynamic resolution support, grounding, and multilingual OCR. Qwen2-VL extends this with native dynamic resolution and longer context.
PaLI-X (Chen et al. 2023, arXiv:2305.18565). A 55B encoder-decoder vision-language model from Google, scaling both the ViT encoder and the language decoder jointly.
Chameleon (Meta 2024, arXiv:2405.09818). Early-fusion mixed-modal model: images and text share a single token vocabulary and a single transformer. Removes the separate vision encoder.
GPT-4o native vision (OpenAI 2024). Single-model multimodal system trained end-to-end over text, image, and audio tokens. No public architecture details; referenced here as a design point for native multimodality rather than a replicable baseline.

These models span two trends. First, specialized vision foundation models (SAM, DINOv2) keep a fixed output modality and scale data and self-supervision. Second, vision-language models (InternVL, Qwen-VL, PaLI-X, Chameleon, GPT-4o) fold perception into a language model, either via a separate encoder or via early fusion.

Canonical Examples

Example

Florence-2 multi-task inference

Given a single image of a street scene with the prompt "Detect all objects": the model outputs "car [120,340,450,520]; person [500,200,580,480]; traffic light [250,50,290,120]". With the prompt "Caption this image": the model outputs "A person crossing a street near a parked car with a traffic light overhead." Same encoder, same decoder, same weights. Only the prompt changes.

Common Confusions

Watch Out

Florence-2 is not a multimodal LLM

Florence-2 is a vision model that outputs text-formatted answers to predefined task types. It cannot hold a conversation, reason about abstract concepts, or follow arbitrary instructions. It is a vision specialist, not a general-purpose assistant. The text decoder is small (a few hundred million parameters) and optimized for structured outputs, not open-ended language generation.

Watch Out

More pretraining data does not always help

Florence-2 uses 5.4 billion annotations, but many are machine-generated by specialist models. These pseudo-labels contain errors that the foundation model can memorize. The quality of pretraining data matters at least as much as quantity. Noisy labels on rare object categories can actually hurt performance on those categories compared to smaller, cleaner datasets.

Watch Out

Zero-shot does not mean no training data is needed

Florence-2 can perform tasks it was trained on without task-specific fine-tuning (zero-shot with respect to downstream data). But it was trained on massive annotated data for those exact task types. It cannot perform genuinely novel tasks it has never seen during pretraining. The "zero-shot" label refers to the downstream dataset, not the pretraining data.

Exercises

ExerciseCore

Problem

Florence-2 uses 1000 coordinate bins per axis. What is the maximum localization error (in pixels) for a bounding box prediction on a $1024 \times 768$ image?

ExerciseAdvanced

Problem

Explain why multi-task pretraining (detection + captioning + segmentation jointly) might improve detection accuracy compared to single-task detection pretraining, even when evaluated only on detection.

References

Canonical:

Yuan et al., Florence: A New Foundation Model for Computer Vision (2021), Microsoft Research, arXiv:2111.11432
Xiao et al., Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (2024), CVPR

Current:

Liu et al., Visual Instruction Tuning (LLaVA) (2023), NeurIPS
OpenAI, GPT-4V Technical Report (2023), for comparison with LLM-based multimodal models
Radford et al., Learning Transferable Visual Models from Natural Language Supervision (CLIP) (2021), arXiv:2103.00020
Li et al., BLIP: Bootstrapping Language-Image Pre-training (2022), arXiv:2201.12086
Kirillov et al., Segment Anything (SAM) (2023), arXiv:2304.02643
Ravi et al., SAM 2: Segment Anything in Images and Videos (2024), arXiv:2408.00714
Oquab et al., DINOv2: Learning Robust Visual Features without Supervision (2023), arXiv:2304.07193
Chen et al., Pix2Seq: A Language Modeling Framework for Object Detection (2022), ICLR, arXiv:2109.10852 -- the first clean formulation of detection as autoregressive coordinate-token generation; the conceptual ancestor of Florence-2's serialization scheme
Lee et al., Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding (2023), ICML, arXiv:2210.03347 -- screenshot-to-HTML pretraining objective for documents and UIs; directly comparable region-level pretraining to Florence-2 phase 2
Chen et al., InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (2023), arXiv:2312.14238
Bai et al., Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (2023), arXiv:2308.12966
Chen et al., PaLI-X: On Scaling up a Multilingual Vision and Language Model (2023), arXiv:2305.18565
Meta, Chameleon: Mixed-Modal Early-Fusion Foundation Models (2024), arXiv:2405.09818
Deitke et al. (AI2), Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models (2024), arXiv:2409.17146 -- open-weight VLM family released with the PixMo dataset; the strongest open answer to GPT-4V at release
OpenAI, GPT-4o System Card (2024), for the native multimodal design point

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMlayer 4 · tier 1
Self-Supervised Visionlayer 4 · tier 2

Derived topics

1

CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraininglayer 4 · tier 1

Graph-backed continuations

CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining