Train and Run MACE on Cloud GPUs
MACE is a fast, accurate interatomic model widely used in computational chemistry and materials science.
This tutorial walks through training a MACE model on a cloud GPU using spot instances, then running batch inference — all with automatic checkpointing and preemption recovery.
What you'll need
- anycloud installed and credentials configured (Getting Started)
- Training data in extended XYZ format
- A cloud storage bucket with your training data uploaded
🐳 MACE image
You can use the pre-built public image, which includes mace-torch and CUDA 12.4:
ghcr.io/anycloud-sh/mace:latest
Or build your own custom image. Create a Dockerfile:
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
RUN pip install mace-torch
Build and push to GitHub Container Registry:
anycloud build
This tags and pushes the image to ghcr.io/<your-github-user>/mace:latest.
🚀 Train with spot preemption recovery
Upload your training data to a cloud storage bucket, then launch training. anycloud mounts your buckets directly into the container — no setup code needed:
- Python
- CLI
import anycloud
ac = anycloud.Client()
# Upload training data
data = ac.bucket("my-training-data")
data.upload("./train.xyz")
data.upload("./test.xyz")
# Launch training
job = ac.submit(
"ghcr.io/anycloud-sh/mace:latest",
gpu="a100:all",
input=data,
output=ac.bucket("my-results"),
cloud_config=anycloud.CloudConfig(spot=True, disk_size_gb=200),
command=[
"mace_run_train",
"--name=my_model",
"--train_file=/mnt/input/train.xyz",
"--valid_fraction=0.1",
"--test_file=/mnt/input/test.xyz",
"--model=MACE",
"--hidden_irreps=128x0e+128x1o",
"--r_max=6.0",
"--batch_size=32",
"--max_num_epochs=500",
"--device=cuda",
"--checkpoints_dir=/mnt/checkpoint",
"--restart_latest",
"--results_dir=/mnt/output",
],
)
job.wait()
# Upload training data
aws s3 cp train.xyz s3://my-training-data/
aws s3 cp test.xyz s3://my-training-data/
# Launch training
anycloud submit ghcr.io/anycloud-sh/mace:latest \
--credentials my-aws \
--region us-east-1 \
--vm-type g6e.xlarge \
--spot \
--disk-size 200 \
--input-bucket my-training-data \
--output-bucket my-results \
--gpus all \
-- mace_run_train \
--name=my_model \
--train_file=/mnt/input/train.xyz \
--valid_fraction=0.1 \
--test_file=/mnt/input/test.xyz \
--model=MACE \
--hidden_irreps='128x0e+128x1o' \
--r_max=6.0 \
--batch_size=32 \
--max_num_epochs=500 \
--device=cuda \
--checkpoints_dir=/mnt/checkpoint \
--restart_latest \
--results_dir=/mnt/output
See Bucket Sync — Combining Buckets for what each mount path does.
The --restart_latest flag tells MACE to resume from the most recent checkpoint. Combined with anycloud's automatic preemption recovery, this means:
- Spot VM gets preempted
- anycloud provisions a new VM and restores
/mnt/checkpoint - MACE picks up training from the last checkpoint
No manual intervention needed.
If you don't need a specific VM type, you can use --gpu-type to let anycloud pick the best available instance:
anycloud submit ghcr.io/anycloud-sh/mace:latest \
--credentials my-aws \
--gpu-type a100 \
--spot \
--disk-size 200 \
--input-bucket my-training-data \
--output-bucket my-results \
--gpus all \
-- mace_run_train ...
If you run the same configuration often, save it as a reusable profile. See Configuration.
📊 Monitor training
# List running deployments
anycloud list
# Stream logs
anycloud logs <deployment-id> --follow
# SSH into the VM
anycloud ssh <deployment-id>
⚡ Batch inference
Once you have a trained model, run it on new structures to predict energies, forces, and stresses:
- Python
- CLI
import anycloud
ac = anycloud.Client()
job = ac.submit(
"ghcr.io/anycloud-sh/mace:latest",
gpu="a100:all",
input=ac.bucket("my-training-data"),
output=ac.bucket("my-results"),
command=[
"mace_eval_configs",
"--configs=/mnt/input/structures.xyz",
"--model=/mnt/input/my_model.model",
"--output=/mnt/output/predictions.xyz",
"--device=cuda",
],
)
job.wait()
anycloud submit ghcr.io/anycloud-sh/mace:latest \
--credentials my-aws \
--region us-east-1 \
--vm-type g6e.xlarge \
--disk-size 200 \
--input-bucket my-training-data \
--output-bucket my-results \
--gpus all \
-- mace_eval_configs \
--configs=/mnt/input/structures.xyz \
--model=/mnt/input/my_model.model \
--output=/mnt/output/predictions.xyz \
--device=cuda
Results appear in your output bucket as each job completes.
Next steps
- Bucket Sync — details on input, output, and checkpoint buckets
- Spot Instances — more on preemption recovery and checkpointing best practices
- CLI Reference — full list of commands and flags