Modal: Serverless GPU Platform

Sneiderman, Robby

Infrastructure

Modal: Serverless GPU Platform

Reference for Modal Labs: function decorators, container builds, deployment patterns, GPU types, and how it compares to RunPod, Replicate, Beam, and Lambda Labs.

CoreTier 3CurrentReference~12 min

Prereq Map

What It Is

Modal is a serverless cloud platform for Python, founded by Erik Bernhardsson (formerly of Spotify) in 2021. Instead of provisioning servers or Kubernetes pods, the user writes Python functions decorated with @app.function(...) and Modal builds the container, schedules a worker, and runs the function. The platform spans CPU jobs, GPU inference, batch processing, and HTTPS endpoints behind one SDK.

The execution model is function-first: every unit of work is a Python function. Functions can be called like local functions (f.remote(x)), spawned async (f.spawn(x)), mapped over an iterable (f.map(args)), or exposed as a web endpoint (@modal.web_endpoint). A collection of related functions plus shared resources (volumes, secrets, scheduled triggers) is an app.

Container images are built declaratively in Python: modal.Image.debian_slim().pip_install("torch", "transformers").apt_install("git"). The image spec is hashed, so identical specs reuse cached layers. For more control, Image.from_dockerfile(...) ingests a normal Dockerfile.

The category is "serverless GPU." Direct competitors:

RunPod: cheaper raw GPU rental, less Pythonic, exposes Docker containers and a serverless layer; better for users who already have container infrastructure.
Replicate: deploy-a-model service built around the Cog packaging format; better for shipping a single model behind an API, weaker for general Python jobs.
Beam Cloud: similar function-first model to Modal, smaller team and ecosystem.
Lambda Labs: traditional GPU rental (1-click instances or reserved clusters), no serverless layer; best for long-running training, not bursty inference.
Hyperbolic, Fly.io GPUs, Cloudflare Workers AI: smaller niches.

When You'd Use It

Use Modal for inference endpoints with bursty traffic (the platform spins workers up and down so you do not pay for idle GPUs), batch jobs that need 100-10000 parallel workers (the .map API plus per-second billing makes this trivial), and quick experiments where setting up RunPod or AWS feels heavyweight.

Anti-patterns: do not run multi-day distributed training jobs on Modal; the per-second billing is convenient but ends up more expensive than a reserved Lambda Labs or CoreWeave cluster for sustained workloads. Do not use Modal for state-heavy services that need a long-lived database connection; the serverless model fits stateless functions best.

Deployment patterns: modal run for one-off invocations, modal deploy to create a persistent app with a stable URL, @app.schedule(modal.Cron("0 9 * * *")) for cron-style triggers. Volumes (modal.Volume) and Network File Systems (modal.NetworkFileSystem) provide persistent storage between invocations.

GPU types as of 2026: T4, L4, A10G, A100 (40 GB and 80 GB), H100, H200, and B200 on a waitlist. Cold start latency for an unloaded container ranges from ~5 seconds (small CPU image) to 30+ seconds (PyTorch + 10 GB model weights). Setting keep_warm=N keeps N containers always-on at full per-second cost; this is the single most common bill surprise.

Definition

Serverless GPU Function

A serverless GPU function is a callable unit whose image, accelerator request, secrets, timeout, concurrency policy, and endpoint exposure are declared in code while the platform schedules the underlying worker on demand.

Proposition

Burstiness Fit Principle

Statement

Serverless GPU platforms fit workloads where utilization arrives in bursts and the idle cost of reserved accelerators would dominate.

Intuition

The platform trades some startup latency and abstraction constraints for the ability to scale down to zero. That is valuable for demos, batch bursts, and low-duty-cycle inference.

Failure Mode

Long training jobs, always-on services, large warm pools, and stateful workloads can cost more or become more fragile than a reserved instance or cluster.

report a correction →

ExerciseCore

Problem

A model endpoint receives traffic for ten minutes after a newsletter goes out, then sits idle for two days. Should you start by evaluating serverless GPU or a reserved GPU VM?

Notable Gotchas

Watch Out

keep_warm pools bill 24x7

A function with keep_warm=4 on an A100 80 GB runs four GPUs around the clock at full price. Forgetting to remove or downscale a warm pool after a launch can multiply a monthly bill by 10x. Audit keep_warm settings before any deploy and prefer min_containers=0 with longer cold starts for low-traffic apps.

Watch Out

Container builds are not Dockerfiles

Modal's Image API looks similar to Dockerfile syntax but is not Docker. Layer caching, build context, and base-image selection are all handled by Modal's builder. A Dockerfile that works locally may produce a different image when wrapped via Image.from_dockerfile. For maximum portability, write the image as a .py spec, not a Dockerfile.

References

Modal Documentation (https://modal.com/docs).
Modal pricing page (https://modal.com/pricing), authoritative for GPU types and per-second rates.
RunPod Documentation (https://docs.runpod.io/), for serverless comparison.
Replicate Documentation and Cog format (https://replicate.com/docs).
Bernhardsson, E. "Modal: high-performance cloud for developers" (2022), founding blog post explaining the design.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

1

GPU Compute Modellayer 5 · tier 2

Graph-backed continuations

GPU Compute Model