Skip to main content
Help

Can you train on spot GPUs without losing progress?

Updated June 2026

Yes, if the training job can restart from durable checkpoints. Spot and preemptible GPUs can be reclaimed by the cloud provider, so the process in memory can disappear. The safe pattern is to write enough state that losing the VM does not mean losing the run.

What spot interruption really means

Spot capacity is discounted because the provider can reclaim it. AWS Spot Instances can be stopped, hibernated, or terminated, and AWS provides an interruption notice before many stop or terminate events. Google Cloud Spot VMs can be preempted when capacity is needed. Azure Spot VMs can be evicted when Azure needs capacity or when the current price exceeds a configured max price.

The important part: the cloud does not preserve your Python process. A training run resumes only if your code or framework writes restartable state and reads it on startup.

What to save

For a serious training job, save more than model weights. A useful checkpoint usually includes:

  • Model weights
  • Optimizer state
  • Learning-rate scheduler state
  • Global step or epoch
  • Random seeds or sampler state when reproducibility matters
  • Mixed-precision scaler state
  • Configuration needed to validate the checkpoint against the run

PyTorch, TensorFlow, Lightning, and Hugging Face Trainer all support checkpoint-based resume patterns. The exact API differs, but the infrastructure rule is the same: write checkpoint files somewhere that survives VM loss.

How often to checkpoint

Checkpoint frequency is a cost tradeoff. Frequent checkpoints reduce lost work but add I/O and storage overhead. Large checkpoints can also take long enough to write that they interfere with training.

For spot training, aim to lose minutes, not hours. Write atomically when possible: save to a temporary path, flush the file, then rename it into place so a restart does not read a half-written checkpoint.

A restart checklist

  • The container can start from scratch with the same command.
  • Startup code checks for an existing checkpoint before training begins.
  • Checkpoints are written to durable storage, not only the VM boot disk.
  • Outputs are separated from checkpoints so partial artifacts do not look final.
  • Provider-specific interruption hooks, shutdown scripts, or scheduled events save a final checkpoint when advance notice is available.
  • You have tested resume by killing a small job and restarting it.

Where anycloud fits

anycloud's spot recovery path gives jobs a checkpoint folder at /mnt/checkpoint. Your training code writes checkpoint files there; anycloud uploads checkpoint changes and restores them before restarting the container after preemption.

anycloud moves files and restarts the job. It does not invent framework checkpoints for you.

Sources

Related answers