Skip to main content
Help

Should I use spot, on-demand, or reserved GPUs?

Updated June 2026

Use spot when checkpointing makes the job resilient to interruption — it is the cheapest, but recovery means downtime. Use on-demand while usage is unpredictable or the job cannot tolerate interruption. Once usage is predictable, "reserved" is really two products: a committed discount that lowers the bill, or a capacity reservation that holds GPUs for you.

The decision in three questions

Can the job restart from durable checkpoints?Training state, outputs, and the startup command are resumable.
YesSpotThe biggest discount. Test resume before a long run.
No, deadline-bound, or latency-sensitiveIs steady GPU usage predictable?Enough matching GPU hours to justify a commitment.
Not yetOn-demandFlexible and commitment-free, but still capacity-dependent.
YesDo you need guaranteed capacity?Discounts and capacity assurance are different products.
No, discount onlyCommitted discountCuts the bill for matching usage; may not guarantee launch capacity.
YesCapacity reservationHolds real capacity in a specific zone and VM type, billed whether you use it or not.

The four options

  • Spot — discounted capacity the provider can reclaim at any time. Best when checkpointing makes the workload resilient to losing the VM: restartable training, sweeps, batch inference, experiments. Not suitable for hard deadlines or realtime latency requirements — recovery means downtime while a replacement VM starts.
  • On-demand — full-price capacity with no commitment. Best for first runs, jobs without checkpointing, and latency-sensitive work while usage is still unpredictable. It does not guarantee capacity: a region can still be out of the GPU you want.
  • Committed discount — AWS Reserved Instances, Google Cloud committed use discounts, Azure Reserved VM Instances. A billing agreement: commit to sustained usage and matching consumption costs less. It holds no GPUs for you — a discounted launch can still fail on capacity.
  • Capacity reservation — AWS On-Demand Capacity Reservations, Google Cloud reservations, Azure capacity reservations. An inventory hold: the provider keeps specific capacity (zone, VM type) for you, billed whether you use it or not. It does not lower the price by itself; combine it with a discount for that. AWS zonal Reserved Instances are the exception that does both at once.

A practical rollout

  • Start with on-demand until the container, data path, and GPU fit are known.
  • Move restartable jobs to spot after checkpoint and resume have been tested.
  • Measure real GPU-hour usage for a few weeks before buying a commitment.
  • Separate discount commitments from capacity reservations in your cost model.
  • Re-check quota, live capacity, and provider terms before assuming a reservation applies to a new region or GPU type.

Where anycloud fits

anycloud can run the same submitted workload on spot or on-demand targets depending on the job configuration. It can also choose among valid unpinned targets when credentials and region are left open.

For reserved or capacity-specific setups, be more explicit. Pin the provider, credential, region, or VM type when the reservation only applies there; otherwise anycloud may pick another valid target that does not use the commitment.

Sources

Related answers