Should I use spot, on-demand, or reserved GPUs?

Use spot when checkpointing makes the job resilient to interruption — it is the cheapest, but recovery means downtime. Use on-demand while usage is unpredictable or the job cannot tolerate interruption. Once usage is predictable, "reserved" is really two products: a committed discount that lowers the bill, or a capacity reservation that holds GPUs for you.

A rough price map

On-demand: 1.00x — flexible capacity, no commitment.
Capacity Blocks / reservations: ~0.35-0.75x — predictable GPU window, paid whether the GPUs are busy or idle.
Spot: ~0.10-0.50x when healthy — more resumable work per dollar, but interruptions can delay completion.

These are planning ranges, not guarantees. Actual prices vary by instance type, region, date, capacity, and operating system.

The decision in three questions

Can the job resume and tolerate delayed completion?Checkpoints survive interruption, and finish time can move.

YesSpotThe biggest discount. Test resume before a long run.

No, deadline-bound, or latency-sensitiveCan you keep a reserved GPU window busy?Idle reserved time is still paid.

Not yetOn-demandFlexible and commitment-free, but still capacity-dependent.

YesDo you need guaranteed capacity?Discounts and capacity assurance are different products.

No, discount onlyCommitted discountCuts the bill for matching usage; may not guarantee launch capacity.

YesCapacity reservationHolds real capacity in a specific zone and VM type, billed whether you use it or not.

What one batch workload showed

In one anonymized month-long GPU batch workload, we measured 3,771 jobs, 50,707 VM attempts, and 6,500 preempted VM attempts. Preemption touched 42.7% of all jobs.

There were two separate penalties:

Extra billable setup — repeated setup after interruption was about 0.74% of estimated compute cost, using recorded provider cost where available and hourly estimates elsewhere. About 76% of preempted jobs had no measured repeated setup; among the rest, the median extra setup was 9 minutes. This excludes any extra data-transfer or egress charges from repeated download or sync work.
Job latency — wall-clock waiting for a preempted job to get back to running. Across 537 recovery events, recovery added 5,234.1 wall-hours of waiting: median 37.1 minutes, average 9.75 hours, and p95 77.71 hours. This is user-visible delay, not necessarily extra billed compute time, and it can overlap across concurrent jobs.

Spot usually wins on price when jobs can absorb interruptions. Capacity Blocks trade lower flexibility for assured capacity in a paid window: they help when launch risk is more expensive than idle reserved time.

The four options

Spot — discounted capacity the provider can reclaim at any time. Best when checkpointing makes the workload resilient to losing the VM: resumable training, sweeps, batch inference, experiments. Not suitable for hard deadlines or realtime latency requirements — recovery means downtime while a replacement VM starts.
On-demand — full-price capacity with no commitment. Best for first runs, jobs without checkpointing, and latency-sensitive work while usage is still unpredictable. It does not guarantee capacity: a region can still be out of the GPU you want.
Committed discount — AWS Reserved Instances, Google Cloud committed use discounts, Azure Reserved VM Instances. A billing agreement: commit to sustained usage and matching consumption costs less. It holds no GPUs for you — a discounted launch can still fail on capacity.
Capacity reservation — AWS On-Demand Capacity Reservations, AWS Capacity Blocks, Google Cloud reservations, Azure capacity reservations. An inventory hold: the provider keeps specific capacity (zone, VM type, or block shape) for you, billed whether you use it or not. It does not lower the price by itself; combine it with a discount for that. AWS zonal Reserved Instances are the exception that does both at once.

A practical rollout

Start with on-demand until the container, data path, and GPU fit are known.
Move resumable jobs to spot after checkpoint and resume have been tested.
Measure real GPU-hour usage for a few weeks before buying a commitment.
Separate discount commitments from capacity reservations in your cost model.
Re-check quota, live capacity, and provider terms before assuming a reservation applies to a new region or GPU type.

Where anycloud fits

anycloud can run the same submitted workload on spot or on-demand targets depending on the job configuration. It can also choose among valid unpinned targets when credentials and region are left open.

For reserved or capacity-specific setups, be more explicit. Pin the provider, credential, region, or VM type when the reservation only applies there; otherwise anycloud may pick another valid target that does not use the commitment.

Related questions

When should I use spot GPUs?

Use spot GPUs for jobs that can resume from checkpoints, tolerate variable capacity, and have no hard deadline or realtime latency requirement. Recovery after an interruption means downtime while a replacement VM starts, so latency-sensitive serving does not fit spot even with checkpointing.

When should I use on-demand GPUs?

Use on-demand GPUs when you need flexibility, do not want a long-term commitment, are still learning the workload shape, or cannot tolerate spot interruption because of deadlines or realtime latency.

Does reserved GPU capacity always guarantee availability?

No. Discount products and capacity products are different. For example, AWS Reserved Instances are billing discounts, while AWS Capacity Blocks and Capacity Reservations reserve capacity. Azure Reserved VM Instances offer prioritized capacity but do not guarantee capacity in every placement scenario.

Are reserved GPUs always cheaper?

Only when the committed usage actually happens. A discount commitment can cost more than on-demand if the team does not use enough matching GPU hours.

Is spot still cheaper after interruptions?

Yes, when the workload is checkpointed, resumable, and can tolerate slower completion. As a planning heuristic, if on-demand is 1.00x, Capacity Blocks and reservations are often about 0.35-0.75x, while Spot is often about 0.10-0.50x when capacity is healthy. In the measured workload, preemption touched 42.7% of jobs, but repeated setup was still small: about 0.74% of estimated compute cost. About 76% of preempted jobs had no measured repeated setup; among the rest, the median extra setup was 9 minutes. This excludes any extra data-transfer or egress charges from repeated download or sync work. The bigger tradeoff was recovery latency and finish-time uncertainty.

Can anycloud choose between spot and on-demand?

Yes. Set spot for interruptible runs and leave spot off for on-demand runs. Reserved or capacity-reserved usage may require pinning the credential, region, VM type, or provider setup that owns the reservation.