Spot Instances

Cloud providers sell spare compute capacity at up to 90% discount. The tradeoff: they can reclaim your VM at any time.

anycloud handles preemption automatically — if your VM gets taken back, it provisions a new one and restores your job.

Enable Spot

Python
CLI

from anycloud.types import CloudConfig

job = ac.submit(
    "ghcr.io/acme/my-training:latest",
    gpu="a100:8",
    cloud_config=CloudConfig(spot=True),
)

anycloud submit ghcr.io/acme/my-training:latest \
  --credentials my-aws \
  --gpu-type a100 \
  --spot

♻️ Preemption Recovery

When a spot VM is preempted:

anycloud detects the VM is gone (preemption monitor polls every 30s)
Terminates the old VM's cloud resources
Provisions a new VM
Restores checkpoint data from cloud storage
Restarts your container

Your job resumes from its last checkpoint — no manual intervention needed.

💾 Checkpointing

Every spot deployment gets an automatic checkpoint bucket mounted at /mnt/checkpoint. Write your state there and it syncs to cloud storage every ~60 seconds.

import json, os

CHECKPOINT = '/mnt/checkpoint/state.json'

# Resume from checkpoint if it exists
if os.path.exists(CHECKPOINT):
    state = json.loads(open(CHECKPOINT).read())
    start_epoch = state['epoch']
else:
    start_epoch = 0

# Training loop
for epoch in range(start_epoch, 100):
    train_one_epoch(epoch)

    # Save checkpoint — survives preemption
    with open(CHECKPOINT, 'w') as f:
        json.dump({'epoch': epoch + 1}, f)

Best Practices

Checkpoint frequently — you lose work since the last checkpoint (~60s sync interval)
Make startup idempotent — your code should handle resuming from any checkpoint
Keep checkpoints small — they upload every ~60s
Use /mnt/checkpoint — this path is always available on spot deployments

The checkpoint bucket is part of anycloud's Bucket Sync system — see that page for more on input, output, and checkpoint buckets.

Retaining Checkpoints Across Runs

By default the checkpoint bucket is deleted when a spot job completes. Pass --persist-bucket to keep it around after the VM is torn down:

anycloud submit ghcr.io/acme/training:latest --spot --persist-bucket -- python train.py

The VM is still torn down on completion (the flag is independent of --persist); only the bucket sticks around for inspection or manual reuse. Cleanup is manual when you're done — aws s3 rb s3://<id> --force (or the GCS / Azure equivalent). The flag is a no-op on non-spot deployments since they have no checkpoint bucket.

Enable Spot​

♻️ Preemption Recovery​

💾 Checkpointing​

Best Practices​

Retaining Checkpoints Across Runs​

Enable Spot

♻️ Preemption Recovery

💾 Checkpointing

Best Practices

Retaining Checkpoints Across Runs