Skip to main content

Train and Run MACE on Cloud GPUs

MACE is a fast, accurate interatomic model widely used in computational chemistry and materials science.

This tutorial walks through training a MACE model on a cloud GPU using spot instances, then running batch inference — all with automatic checkpointing and preemption recovery.

What you'll need

🐳 MACE image

You can use the pre-built public image, which includes mace-torch and CUDA 12.4:

ghcr.io/anycloud-sh/mace:latest

Or build your own custom image. Create a Dockerfile:

FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
RUN pip install mace-torch

Build and push to GitHub Container Registry:

anycloud build

This tags and pushes the image to ghcr.io/<your-github-user>/mace:latest.

🚀 Train with spot preemption recovery

Upload your training data to a cloud storage bucket, then launch training. anycloud mounts your buckets directly into the container — no setup code needed:

import anycloud

ac = anycloud.Client()

# Upload training data
data = ac.bucket("my-training-data")
data.upload("./train.xyz")
data.upload("./test.xyz")

# Launch training
job = ac.submit(
"ghcr.io/anycloud-sh/mace:latest",
gpu="a100:all",
input=data,
output=ac.bucket("my-results"),
cloud_config=anycloud.CloudConfig(spot=True, disk_size_gb=200),
command=[
"mace_run_train",
"--name=my_model",
"--train_file=/mnt/input/train.xyz",
"--valid_fraction=0.1",
"--test_file=/mnt/input/test.xyz",
"--model=MACE",
"--hidden_irreps=128x0e+128x1o",
"--r_max=6.0",
"--batch_size=32",
"--max_num_epochs=500",
"--device=cuda",
"--checkpoints_dir=/mnt/checkpoint",
"--restart_latest",
"--results_dir=/mnt/output",
],
)
job.wait()

See Bucket Sync — Combining Buckets for what each mount path does.

The --restart_latest flag tells MACE to resume from the most recent checkpoint. Combined with anycloud's automatic preemption recovery, this means:

  1. Spot VM gets preempted
  2. anycloud provisions a new VM and restores /mnt/checkpoint
  3. MACE picks up training from the last checkpoint

No manual intervention needed.

If you don't need a specific VM type, you can use --gpu-type to let anycloud pick the best available instance:

anycloud submit ghcr.io/anycloud-sh/mace:latest \
--credentials my-aws \
--gpu-type a100 \
--spot \
--disk-size 200 \
--input-bucket my-training-data \
--output-bucket my-results \
--gpus all \
-- mace_run_train ...

If you run the same configuration often, save it as a reusable profile. See Configuration.

📊 Monitor training

# List running deployments
anycloud list

# Stream logs
anycloud logs <deployment-id> --follow

# SSH into the VM
anycloud ssh <deployment-id>

⚡ Batch inference

Once you have a trained model, run it on new structures to predict energies, forces, and stresses:

import anycloud

ac = anycloud.Client()

job = ac.submit(
"ghcr.io/anycloud-sh/mace:latest",
gpu="a100:all",
input=ac.bucket("my-training-data"),
output=ac.bucket("my-results"),
command=[
"mace_eval_configs",
"--configs=/mnt/input/structures.xyz",
"--model=/mnt/input/my_model.model",
"--output=/mnt/output/predictions.xyz",
"--device=cuda",
],
)
job.wait()

Results appear in your output bucket as each job completes.

Next steps

  • Bucket Sync — details on input, output, and checkpoint buckets
  • Spot Instances — more on preemption recovery and checkpointing best practices
  • CLI Reference — full list of commands and flags