Beyond LLMS
Florence and Vision Foundation Models
Vision foundation models that unify classification, detection, segmentation, captioning, and VQA under a single pretrained backbone. Florence, Florence-2, and the path toward GPT-4V-style multimodal understanding.
Prerequisites
Why This Matters
For decades, computer vision was a collection of separate problems solved by separate models: one model for classification, another for detection, another for segmentation, another for captioning. Each required its own architecture, its own training data format, and its own evaluation protocol.
Vision foundation models collapse this into a single pretrained backbone that handles all vision tasks. Train once on massive data, adapt to any task with minimal fine-tuning or just a text prompt. This parallels how large language models unified NLP tasks, and it represents the same kind of consolidation happening in vision.
Mental Model
Think of a vision transformer pretrained on billions of image-text pairs as having learned a general-purpose "visual language." Given an image, it produces rich representations that encode objects, their locations, their relationships, and their semantic meaning. Different tasks are just different questions asked of these representations: "What is in this image?" (classification), "Where are the objects?" (detection), "Describe this image" (captioning).
Florence-2 takes this further by formulating every vision task as a sequence-to-sequence problem: image in, text out, where the text format encodes the task-specific answer.
Formal Setup and Notation
Vision Foundation Model
A vision foundation model is a pretrained model that maps images to representations useful across multiple downstream tasks. Formally, for a set of vision tasks with task-specific heads , the foundation model minimizes:
where is the loss for task and is a task weight. The key property: is shared across all tasks and pretrained on data far larger than any single task dataset.
Sequence-to-Sequence Vision (Florence-2)
Florence-2 formulates all vision tasks as text generation. Given image and task prompt (e.g., "Detect all objects"), the model generates output text that encodes the answer:
Task-specific output formats:
- Classification: "cat" (a single label)
- Detection: "cat [x1,y1,x2,y2]; dog [x1,y1,x2,y2]" (labels with coordinates)
- Captioning: "A cat sitting on a red couch" (natural language)
- Segmentation: polygon coordinate sequences encoding region boundaries
All tasks share the same encoder-decoder architecture with no task-specific heads.
Core Definitions
Florence (Microsoft, 2021) is a vision foundation model pretrained on 900M image-text pairs using a contrastive objective similar to CLIP. It uses a hierarchical ViT (CoSwin Transformer) as the image encoder. Florence demonstrated strong transfer to classification, retrieval, detection, and segmentation using the same pretrained backbone with task-specific adapters.
Florence-2 (2024) reformulates all vision tasks as sequence-to-sequence generation. The architecture is a vision encoder (DaViT) plus a text decoder (transformer). The training data consists of 5.4 billion annotations across 126 million images, generated by a combination of specialized models and human annotation. Florence-2 comes in two sizes: 0.23B and 0.77B parameters.
Multimodal foundation models (GPT-4V, Gemini Vision, Claude with vision capability) extend this concept by integrating the vision encoder into a large language model. The LLM serves as both the task decoder and the reasoning engine. These models can handle open-ended visual questions that Florence-2 cannot, but they are much larger and more expensive to run.
Main Theorems
Unified Task Formulation via Sequence Generation
Statement
Let be a set of vision tasks. If each task has an output that can be serialized as a text sequence (using coordinate tokens for spatial outputs), then a single encoder-decoder model can be trained on all tasks jointly:
where is the dataset for task and is the task-specific text prompt. The model shares all parameters across tasks. The prompt alone determines the output format.
Florence-2 empirically shows that this joint training improves performance on each individual task compared to training separate models on the same data, suggesting positive transfer across vision tasks.
Intuition
Detection requires understanding what objects look like and where they are. Captioning requires understanding what objects look like and how they relate. Segmentation requires precise spatial understanding. By training on all tasks simultaneously, the shared encoder learns representations that capture appearance, location, and relationships jointly. Each task provides a different supervisory signal that enriches the shared representation.
Proof Sketch
No formal proof of positive transfer. Xiao et al. (2024) report transfer gains for Florence-2 over single-task baselines on several vision benchmarks, though the magnitude of improvement varies substantially by task and evaluation protocol. The hypothesized mechanism is multi-task regularization: detection annotations provide localization signal that helps segmentation, and captioning annotations provide semantic signal that helps classification.
Why It Matters
This formulation eliminates task-specific architecture design. Adding a new vision task requires only defining its text output format and collecting training data. No new model heads, no architecture changes. This is the same simplification that sequence-to-sequence brought to NLP (translation, summarization, and QA all became text generation problems).
Failure Mode
Serializing spatial outputs as text token sequences introduces quantization error. Bounding box coordinates are discretized to a fixed vocabulary (e.g., 1000 location bins), limiting spatial precision. For tasks requiring sub-pixel accuracy (medical imaging, satellite imagery), this quantization can be a bottleneck. The approach also scales poorly with output complexity: generating polygon coordinates for dense segmentation produces very long output sequences.
Florence-2 Architecture Details
The Florence-2 encoder (DaViT) processes the input image at multiple scales, producing a feature pyramid. These features are flattened into a sequence of visual tokens. The decoder is a standard transformer that generates output tokens with cross-attention to the visual token sequence.
Coordinate representation: spatial coordinates are encoded as discrete tokens from a vocabulary of 1000 bins per axis. A bounding box is encoded as four tokens. This means the model's spatial resolution is limited to of the image dimension.
Florence-2 is pretrained in two phases:
- Image-level pretraining: contrastive learning on image-text pairs (similar to CLIP) plus image captioning
- Region-level pretraining: detection, region captioning, and referring expression grounding on 5.4B region annotations
From Florence to GPT-4V
Vision foundation models exist on a spectrum of capability and cost:
- Florence-2 (0.23B-0.77B params): Fast, cheap, strong on standard vision tasks. Cannot do open-ended reasoning.
- LLaVA, InternVL (7B-34B params): Open-source vision-language models. Can reason about images and follow complex instructions.
- GPT-4V, Gemini, Claude (parameter counts not publicly disclosed): Full multimodal reasoning. Can handle arbitrary visual questions, multi-step reasoning, and tool use. Expensive per image.
The trade-off is clear: larger models handle more open-ended tasks but cost orders of magnitude more per inference. For structured extraction (detect objects, read text, segment regions), Florence-2 is sufficient and runs on a single consumer GPU. For "what is the architectural style of this building and how does it relate to the surrounding neighborhood," you need a multimodal LLM.
Recent Foundation Models (2023-2024)
The vision foundation model landscape expanded rapidly after Florence-2. Key canonical entries:
- SAM (Kirillov et al. 2023, arXiv:2304.02643). "Segment Anything" introduces a promptable segmentation foundation model trained on 1B masks across 11M images. Points, boxes, or text prompts condition a lightweight decoder on a shared ViT image encoder.
- SAM 2 (Ravi et al. 2024, arXiv:2408.00714). Extends SAM to video with a memory attention module that propagates masks across frames. Unifies image and video segmentation under one model.
- DINOv2 (Oquab et al. 2023, arXiv:2304.07193). Self-supervised visual features trained on a curated 142M image dataset. Produces dense features strong enough for downstream tasks without any labeled data or text supervision, in contrast to CLIP-style contrastive pretraining.
- InternVL (Chen et al. 2023, arXiv:2312.14238). Scales the vision encoder to 6B parameters and aligns it with an LLM. Designed to close the gap between open vision-language models and proprietary systems.
- Qwen-VL / Qwen2-VL (Bai et al. 2023, arXiv:2308.12966). Vision-language model with dynamic resolution support, grounding, and multilingual OCR. Qwen2-VL extends this with native dynamic resolution and longer context.
- PaLI-X (Chen et al. 2023, arXiv:2305.18565). A 55B encoder-decoder vision-language model from Google, scaling both the ViT encoder and the language decoder jointly.
- Chameleon (Meta 2024, arXiv:2405.09818). Early-fusion mixed-modal model: images and text share a single token vocabulary and a single transformer. Removes the separate vision encoder.
- GPT-4o native vision (OpenAI 2024). Single-model multimodal system trained end-to-end over text, image, and audio tokens. No public architecture details; referenced here as a design point for native multimodality rather than a replicable baseline.
These models span two trends. First, specialized vision foundation models (SAM, DINOv2) keep a fixed output modality and scale data and self-supervision. Second, vision-language models (InternVL, Qwen-VL, PaLI-X, Chameleon, GPT-4o) fold perception into a language model, either via a separate encoder or via early fusion.
Canonical Examples
Florence-2 multi-task inference
Given a single image of a street scene with the prompt "Detect all objects": the model outputs "car [120,340,450,520]; person [500,200,580,480]; traffic light [250,50,290,120]". With the prompt "Caption this image": the model outputs "A person crossing a street near a parked car with a traffic light overhead." Same encoder, same decoder, same weights. Only the prompt changes.
Common Confusions
Florence-2 is not a multimodal LLM
Florence-2 is a vision model that outputs text-formatted answers to predefined task types. It cannot hold a conversation, reason about abstract concepts, or follow arbitrary instructions. It is a vision specialist, not a general-purpose assistant. The text decoder is small (a few hundred million parameters) and optimized for structured outputs, not open-ended language generation.
More pretraining data does not always help
Florence-2 uses 5.4 billion annotations, but many are machine-generated by specialist models. These pseudo-labels contain errors that the foundation model can memorize. The quality of pretraining data matters at least as much as quantity. Noisy labels on rare object categories can actually hurt performance on those categories compared to smaller, cleaner datasets.
Zero-shot does not mean no training data is needed
Florence-2 can perform tasks it was trained on without task-specific fine-tuning (zero-shot with respect to downstream data). But it was trained on massive annotated data for those exact task types. It cannot perform genuinely novel tasks it has never seen during pretraining. The "zero-shot" label refers to the downstream dataset, not the pretraining data.
Exercises
Problem
Florence-2 uses 1000 coordinate bins per axis. What is the maximum localization error (in pixels) for a bounding box prediction on a image?
Problem
Explain why multi-task pretraining (detection + captioning + segmentation jointly) might improve detection accuracy compared to single-task detection pretraining, even when evaluated only on detection.
References
Canonical:
- Yuan et al., Florence: A New Foundation Model for Computer Vision (2021), Microsoft Research, arXiv:2111.11432
- Xiao et al., Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (2024), CVPR
Current:
- Liu et al., Visual Instruction Tuning (LLaVA) (2023), NeurIPS
- OpenAI, GPT-4V Technical Report (2023), for comparison with LLM-based multimodal models
- Radford et al., Learning Transferable Visual Models from Natural Language Supervision (CLIP) (2021), arXiv:2103.00020
- Li et al., BLIP: Bootstrapping Language-Image Pre-training (2022), arXiv:2201.12086
- Kirillov et al., Segment Anything (SAM) (2023), arXiv:2304.02643
- Ravi et al., SAM 2: Segment Anything in Images and Videos (2024), arXiv:2408.00714
- Oquab et al., DINOv2: Learning Robust Visual Features without Supervision (2023), arXiv:2304.07193
- Chen et al., Pix2Seq: A Language Modeling Framework for Object Detection (2022), ICLR, arXiv:2109.10852 -- the first clean formulation of detection as autoregressive coordinate-token generation; the conceptual ancestor of Florence-2's serialization scheme
- Lee et al., Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding (2023), ICML, arXiv:2210.03347 -- screenshot-to-HTML pretraining objective for documents and UIs; directly comparable region-level pretraining to Florence-2 phase 2
- Chen et al., InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (2023), arXiv:2312.14238
- Bai et al., Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (2023), arXiv:2308.12966
- Chen et al., PaLI-X: On Scaling up a Multilingual Vision and Language Model (2023), arXiv:2305.18565
- Meta, Chameleon: Mixed-Modal Early-Fusion Foundation Models (2024), arXiv:2405.09818
- Deitke et al. (AI2), Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models (2024), arXiv:2409.17146 -- open-weight VLM family released with the PixMo dataset; the strongest open answer to GPT-4V at release
- OpenAI, GPT-4o System Card (2024), for the native multimodal design point
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMlayer 4 · tier 1
- Self-Supervised Visionlayer 4 · tier 2
Derived topics
1Graph-backed continuations