Use spot when checkpointing makes the job resilient to interruption — it is the cheapest, but recovery means downtime. Use on-demand while usage is unpredictable or the job cannot tolerate interruption. Once usage is predictable, "reserved" is really two products: a committed discount that lowers the bill, or a capacity reservation that holds GPUs for you.
The decision in three questions
The four options
- Spot — discounted capacity the provider can reclaim at any time. Best when checkpointing makes the workload resilient to losing the VM: restartable training, sweeps, batch inference, experiments. Not suitable for hard deadlines or realtime latency requirements — recovery means downtime while a replacement VM starts.
- On-demand — full-price capacity with no commitment. Best for first runs, jobs without checkpointing, and latency-sensitive work while usage is still unpredictable. It does not guarantee capacity: a region can still be out of the GPU you want.
- Committed discount — AWS Reserved Instances, Google Cloud committed use discounts, Azure Reserved VM Instances. A billing agreement: commit to sustained usage and matching consumption costs less. It holds no GPUs for you — a discounted launch can still fail on capacity.
- Capacity reservation — AWS On-Demand Capacity Reservations, Google Cloud reservations, Azure capacity reservations. An inventory hold: the provider keeps specific capacity (zone, VM type) for you, billed whether you use it or not. It does not lower the price by itself; combine it with a discount for that. AWS zonal Reserved Instances are the exception that does both at once.
A practical rollout
- Start with on-demand until the container, data path, and GPU fit are known.
- Move restartable jobs to spot after checkpoint and resume have been tested.
- Measure real GPU-hour usage for a few weeks before buying a commitment.
- Separate discount commitments from capacity reservations in your cost model.
- Re-check quota, live capacity, and provider terms before assuming a reservation applies to a new region or GPU type.
Where anycloud fits
anycloud can run the same submitted workload on spot or on-demand targets depending on the job configuration. It can also choose among valid unpinned targets when credentials and region are left open.
For reserved or capacity-specific setups, be more explicit. Pin the provider, credential, region, or VM type when the reservation only applies there; otherwise anycloud may pick another valid target that does not use the commitment.