Skip to main content

Spot Instances

Cloud providers sell spare compute capacity at up to 90% discount. The tradeoff: they can reclaim your VM at any time.

anycloud handles preemption automatically — if your VM gets taken back, it provisions a new one and restores your job.

Enable Spot

from anycloud.types import CloudConfig

job = ac.submit(
"ghcr.io/acme/my-training:latest",
gpu="a100:all",
cloud_config=CloudConfig(spot=True),
)

♻️ Preemption Recovery

When a spot VM is preempted:

  1. anycloud detects the VM is gone (preemption monitor polls every 30s)
  2. Terminates the old VM's cloud resources
  3. Provisions a new VM
  4. Restores checkpoint data from cloud storage
  5. Restarts your container

Your job resumes from its last checkpoint — no manual intervention needed.

💾 Checkpointing

Every spot deployment gets an automatic checkpoint bucket mounted at /mnt/checkpoint. Write your state there and it syncs to cloud storage every ~60 seconds.

import json, os

CHECKPOINT = '/mnt/checkpoint/state.json'

# Resume from checkpoint if it exists
if os.path.exists(CHECKPOINT):
state = json.loads(open(CHECKPOINT).read())
start_epoch = state['epoch']
else:
start_epoch = 0

# Training loop
for epoch in range(start_epoch, 100):
train_one_epoch(epoch)

# Save checkpoint — survives preemption
with open(CHECKPOINT, 'w') as f:
json.dump({'epoch': epoch + 1}, f)

Best Practices

  • Checkpoint frequently — you lose work since the last checkpoint (~60s sync interval)
  • Make startup idempotent — your code should handle resuming from any checkpoint
  • Keep checkpoints small — they upload every ~60s
  • Use /mnt/checkpoint — this path is always available on spot deployments

The checkpoint bucket is part of anycloud's Bucket Sync system — see that page for more on input, output, and checkpoint buckets.

Retaining Checkpoints Across Runs

By default the checkpoint bucket is deleted when a spot job completes. Pass --persist-bucket to keep it around so a later run can resume from the same checkpoint state without keeping a VM alive in between:

anycloud submit ghcr.io/acme/training:latest --spot --persist-bucket -- python train.py

The VM is still torn down on completion (the flag is independent of --persist); only the bucket sticks around. To resume on a fresh VM, use anycloud resubmit with the same deployment ID:

anycloud resubmit <previous-id>

The new VM mounts the retained bucket at /mnt/checkpoint exactly as the previous run left it. Cleanup is manual when you're done — aws s3 rb s3://<id> --force (or the GCS / Azure equivalent). The flag is a no-op on non-spot deployments since they have no checkpoint bucket.