Spot Instances
Cloud providers sell spare compute capacity at up to 90% discount. The tradeoff: they can reclaim your VM at any time.
anycloud handles preemption automatically — if your VM gets taken back, it provisions a new one and restores your job.
Enable Spot
- Python
- CLI
from anycloud.types import CloudConfig
job = ac.submit(
"ghcr.io/acme/my-training:latest",
gpu="a100:all",
cloud_config=CloudConfig(spot=True),
)
anycloud submit ghcr.io/acme/my-training:latest \
--credentials my-aws \
--gpu-type a100 \
--spot
♻️ Preemption Recovery
When a spot VM is preempted:
- anycloud detects the VM is gone (preemption monitor polls every 30s)
- Terminates the old VM's cloud resources
- Provisions a new VM
- Restores checkpoint data from cloud storage
- Restarts your container
Your job resumes from its last checkpoint — no manual intervention needed.
💾 Checkpointing
Every spot deployment gets an automatic checkpoint bucket mounted at /mnt/checkpoint. Write your state there and it syncs to cloud storage every ~60 seconds.
import json, os
CHECKPOINT = '/mnt/checkpoint/state.json'
# Resume from checkpoint if it exists
if os.path.exists(CHECKPOINT):
state = json.loads(open(CHECKPOINT).read())
start_epoch = state['epoch']
else:
start_epoch = 0
# Training loop
for epoch in range(start_epoch, 100):
train_one_epoch(epoch)
# Save checkpoint — survives preemption
with open(CHECKPOINT, 'w') as f:
json.dump({'epoch': epoch + 1}, f)
Best Practices
- Checkpoint frequently — you lose work since the last checkpoint (~60s sync interval)
- Make startup idempotent — your code should handle resuming from any checkpoint
- Keep checkpoints small — they upload every ~60s
- Use
/mnt/checkpoint— this path is always available on spot deployments
The checkpoint bucket is part of anycloud's Bucket Sync system — see that page for more on input, output, and checkpoint buckets.
Retaining Checkpoints Across Runs
By default the checkpoint bucket is deleted when a spot job completes. Pass --persist-bucket to keep it around so a later run can resume from the same checkpoint state without keeping a VM alive in between:
anycloud submit ghcr.io/acme/training:latest --spot --persist-bucket -- python train.py
The VM is still torn down on completion (the flag is independent of --persist); only the bucket sticks around. To resume on a fresh VM, use anycloud resubmit with the same deployment ID:
anycloud resubmit <previous-id>
The new VM mounts the retained bucket at /mnt/checkpoint exactly as the previous run left it. Cleanup is manual when you're done — aws s3 rb s3://<id> --force (or the GCS / Azure equivalent). The flag is a no-op on non-spot deployments since they have no checkpoint bucket.