Do spot GPUs keep running until my job finishes?

No. Spot and preemptible VMs can be reclaimed. Treat them as temporary compute and design the training job to resume from checkpoint files.

Can a checkpoint resume optimizer state?

Yes, if you save it. Frameworks such as PyTorch and TensorFlow can save more than model weights, including optimizer state and training step metadata.

How much progress can be lost after interruption?

You can lose work since the last durable checkpoint. If checkpoints are uploaded or written every few minutes, the rerun window is usually a few minutes rather than the whole job.

Are spot GPUs safe for production training?

They can be, if the job is checkpointed, resumable, and the business can tolerate delayed completion. For non-resumable jobs or fixed deadlines, on-demand or reserved capacity may be safer.

Does anycloud automatically checkpoint my model?

No. anycloud syncs and restores checkpoint files, but your code or ML framework still needs to write those files.

Can you train on spot GPUs without losing progress?

Yes, if the training job can resume from durable checkpoints. Spot and preemptible GPUs can be reclaimed by the cloud provider, so the process in memory can disappear. The safe pattern is to write enough state that losing the VM does not mean losing the run.

What spot interruption really means

Spot capacity is discounted because the provider can reclaim it. AWS Spot Instances can be stopped, hibernated, or terminated, and AWS provides an interruption notice before many stop or terminate events. Google Cloud Spot VMs can be preempted when capacity is needed. Azure Spot VMs can be evicted when Azure needs capacity or when the current price exceeds a configured max price.

The important part: the cloud does not preserve your Python process. A training run resumes only if your code or framework writes resumable state and reads it on startup.

What to save

For a serious training job, save more than model weights. A useful checkpoint usually includes:

Model weights
Optimizer state
Learning-rate scheduler state
Global step or epoch
Random seeds or sampler state when reproducibility matters
Mixed-precision scaler state
Configuration needed to validate the checkpoint against the run

PyTorch, TensorFlow, Lightning, and Hugging Face Trainer all support checkpoint-based resume patterns. The exact API differs, but the infrastructure rule is the same: write checkpoint files somewhere that survives VM loss.

How often to checkpoint

Checkpoint frequency is a cost tradeoff. Frequent checkpoints reduce lost work but add I/O and storage overhead. Large checkpoints can also take long enough to write that they interfere with training.

For spot training, aim to lose minutes, not hours. Write atomically when possible: save to a temporary path, flush the file, then rename it into place so a restart does not read a half-written checkpoint.

A resume checklist

The container can start from scratch with the same command.
Startup code checks for an existing checkpoint before training begins.
Checkpoints are written to durable storage, not only the VM boot disk.
Outputs are separated from checkpoints so partial artifacts do not look final.
Provider-specific interruption hooks, shutdown scripts, or scheduled events save a final checkpoint when advance notice is available.
You have tested resume by killing a small job and restarting it.

Where anycloud fits

anycloud's spot recovery path gives jobs a checkpoint folder at /mnt/checkpoint. Your training code writes checkpoint files there; anycloud uploads checkpoint changes and restores them before restarting the container after preemption.

anycloud moves files and restarts the job. It does not invent framework checkpoints for you.

Preemption recovery

VM 1VM 2VM 3checkpoint⚡preemptionresume on a new VM

For the buying decision, see when spot, on-demand, or reserved GPUs make sense.

Can you train on spot GPUs without losing progress?

What spot interruption really means

What to save

How often to checkpoint

A resume checklist

Where anycloud fits

Sources

Related answers

Can you train on spot GPUs without losing progress?

What spot interruption really means

What to save

How often to checkpoint

A resume checklist

Where anycloud fits

Related questions

Sources

Related answers