Train and Run MACE on Cloud GPUs

MACE is a fast, accurate interatomic model widely used in computational chemistry and materials science.

This tutorial walks through building a MACE image, training a model on a cloud GPU with spot instances, then running batch inference — all with automatic checkpointing and preemption recovery. It leads with the @anycloud.function() decorator (your code synced from git, no rebuild between runs) and shows the equivalent CLI command alongside.

What you'll need

anycloud installed and credentials configured (Getting Started)
For the decorator path, your training code in a GitHub repo (committed and pushed)
Training data in extended XYZ format

🐳 Build the MACE image

The image holds MACE and CUDA. With the @anycloud.function() decorator your code is synced separately from git at run time, so you only rebuild when dependencies change. Create a Dockerfile:

FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
# git is required for the decorator's code sync
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
RUN pip install mace-torch

Build and push it to GitHub Container Registry. If Docker was installed when you ran anycloud login, your local Docker CLI is already logged in to GHCR:

docker buildx build --platform linux/amd64 \
  -t ghcr.io/<your-github-user>/mace:latest \
  --push .

--platform linux/amd64 matters on Apple Silicon — the cloud GPU VMs run x86_64. See Docker for more on building and pushing images.

Prefer not to build? The prebuilt ghcr.io/anycloud-sh/mace:latest (includes mace-torch + CUDA) works with the CLI path below. The decorator path needs git in the image, so build your own for that.

📤 Upload your training data

Upload your training data to a bucket — anycloud mounts it read-only at /mnt/input in the container. The bucket is created on first upload.

CLI
Python

anycloud bucket upload my-training-data ./train.xyz train.xyz --credentials my-aws
anycloud bucket upload my-training-data ./test.xyz  test.xyz  --credentials my-aws

import anycloud

data = anycloud.Client().bucket("my-training-data")
data.upload("./train.xyz", remote_path="train.xyz")
data.upload("./test.xyz", remote_path="test.xyz")

🚀 Train with spot preemption recovery

anycloud mounts your buckets directly into the container — input at /mnt/input, results at /mnt/output, and a checkpoint bucket at /mnt/checkpoint. MACE's --restart_latest resumes from the latest checkpoint, so a preempted spot VM picks up where it left off.

Decorator (Python)
CLI

With @anycloud.function(), your repo is cloned onto the VM at the current commit — pass hyperparameters as function arguments and change them between runs without rebuilding the image (just commit, push, and resubmit):

import anycloud

@anycloud.function(
    image="ghcr.io/<your-github-user>/mace:latest",
    gpu="a100:8",
    cloud_config=anycloud.CloudConfig(
        credentials="my-aws",
        spot=True,
        disk_size_gb=200,
        disk_tier="high",
        input_bucket="my-training-data",
        output_bucket="my-results",
    ),
)
def train(max_epochs: int = 500, batch_size: int = 32):
    import subprocess

    subprocess.run(
        [
            "mace_run_train",
            "--name=my_model",
            "--train_file=/mnt/input/train.xyz",
            "--valid_fraction=0.1",
            "--test_file=/mnt/input/test.xyz",
            "--model=MACE",
            "--hidden_irreps=128x0e+128x1o",
            "--r_max=6.0",
            f"--batch_size={batch_size}",
            f"--max_num_epochs={max_epochs}",
            "--device=cuda",
            "--checkpoints_dir=/mnt/checkpoint",
            "--restart_latest",
            "--results_dir=/mnt/output",
        ],
        check=True,
    )

job = train.submit(max_epochs=500)
job.wait()

The image only needs git and MACE — your train function comes from git. See Deploying Jobs for how the decorator works.

anycloud submit ghcr.io/anycloud-sh/mace:latest \
  --credentials my-aws \
  --gpu-type a100 \
  --gpus all \
  --spot \
  --disk-size 200 \
  --disk-tier high \
  --input-bucket my-training-data \
  --output-bucket my-results \
  -- mace_run_train \
    --name=my_model \
    --train_file=/mnt/input/train.xyz \
    --valid_fraction=0.1 \
    --test_file=/mnt/input/test.xyz \
    --model=MACE \
    --hidden_irreps='128x0e+128x1o' \
    --r_max=6.0 \
    --batch_size=32 \
    --max_num_epochs=500 \
    --device=cuda \
    --checkpoints_dir=/mnt/checkpoint \
    --restart_latest \
    --results_dir=/mnt/output

With --restart_latest plus anycloud's preemption recovery: a spot VM gets preempted → anycloud provisions a new one and restores /mnt/checkpoint → MACE resumes from the last checkpoint. No manual intervention. See Spot Instances and Bucket Sync — Combining Buckets.

📊 Monitor training

anycloud list                           # running deployments
anycloud status <deployment-id> --verbose  # state machine + captured output
anycloud exec <deployment-id> "<command>"  # run a command on the VM

⚡ Batch inference

Once you have a trained model, run it on new structures to predict energies, forces, and stresses:

Decorator (Python)
CLI

import anycloud

@anycloud.function(
    image="ghcr.io/<your-github-user>/mace:latest",
    gpu="a100:8",
    cloud_config=anycloud.CloudConfig(
        credentials="my-aws",
        input_bucket="my-training-data",
        output_bucket="my-results",
    ),
)
def evaluate():
    import subprocess

    subprocess.run(
        [
            "mace_eval_configs",
            "--configs=/mnt/input/structures.xyz",
            "--model=/mnt/input/my_model.model",
            "--output=/mnt/output/predictions.xyz",
            "--device=cuda",
        ],
        check=True,
    )

evaluate.submit().wait()

anycloud submit ghcr.io/anycloud-sh/mace:latest \
  --credentials my-aws \
  --gpu-type a100 \
  --gpus all \
  --disk-size 200 \
  --disk-tier high \
  --input-bucket my-training-data \
  --output-bucket my-results \
  -- mace_eval_configs \
    --configs=/mnt/input/structures.xyz \
    --model=/mnt/input/my_model.model \
    --output=/mnt/output/predictions.xyz \
    --device=cuda

Results appear in your output bucket as each job completes.

Next steps

Deploying Jobs — the decorator vs prebuilt-image workflows in depth
Bucket Sync — input, output, and checkpoint buckets
Spot Instances — preemption recovery and checkpointing best practices
CLI Reference — full list of commands and flags

What you'll need​

🐳 Build the MACE image​

📤 Upload your training data​

🚀 Train with spot preemption recovery​

📊 Monitor training​

⚡ Batch inference​

Next steps​