<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://anycloud.sh/blog/</id>
    <title>anycloud blog</title>
    <updated>2026-04-27T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://anycloud.sh/blog/"/>
    <subtitle>Notes, guides, and benchmarks on running ML workloads across cloud GPUs.</subtitle>
    <icon>https://anycloud.sh/img/favicon.svg</icon>
    <rights>Copyright © 2026 anycloud</rights>
    <entry>
        <title type="html"><![CDATA[Training Machine Learning Interatomic Potentials on Cloud GPUs]]></title>
        <id>https://anycloud.sh/blog/training-machine-learning-interatomic-potentials/</id>
        <link href="https://anycloud.sh/blog/training-machine-learning-interatomic-potentials/"/>
        <updated>2026-04-27T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A practical guide to training MLIPs (MACE, NequIP, CHGNet, MACE-MP) on cloud GPUs — architecture choices, honest cost ranges, spot-recovery patterns, and a working example.]]></summary>
        <content type="html"><![CDATA[<p>Machine learning interatomic potentials (MLIPs) are replacing classical force fields for many use cases in computational chemistry — and made simulations that used to take weeks of DFT compute possible in minutes on a single GPU. This post is a practical guide to training your own MLIP on cloud GPU spot instances: which architecture to pick, what it actually costs, and how to keep training going through preemption.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-matters-now">Why this matters now<a href="https://anycloud.sh/blog/training-machine-learning-interatomic-potentials/#why-this-matters-now" class="hash-link" aria-label="Direct link to Why this matters now" title="Direct link to Why this matters now" translate="no">​</a></h2>
<p>Three things changed in the last two years that turned MLIPs from a research tool into a production one:</p>
<p><strong>Foundation models exist now.</strong> MACE-MP-0 (Jan 2024) was trained on the Materials Project's MPtrj dataset of ~1.6M structures across 89 elements. CHGNet (Sep 2023, Nature Machine Intelligence) and GNoME (Nov 2023, Nature, DeepMind) followed. You can fine-tune these on a few hundred molecules of your own data instead of generating millions of DFT calculations from scratch.</p>
<p><strong>NVIDIA shipped CUDA kernels specifically for them.</strong> <a href="https://developer.nvidia.com/blog/accelerate-drug-and-material-discovery-with-new-math-library-nvidia-cuequivariance/" target="_blank" rel="noopener noreferrer" class=""><code>cuEquivariance</code></a> (Nov 2024) accelerates the equivariant operations in MACE foundation models — NVIDIA reports 5.9x training and 5.9x inference speedup for MACE-MP Large on A100. NVIDIA only ships kernels for workloads they expect serious commercial adoption.</p>
<p><strong>Adoption is real.</strong> The Materials Project hosts MLIP-derived datasets, the CederGroup at Berkeley ships CHGNet as a pre-trained universal potential, and a growing wave of biotech, battery, and materials startups now run MLIP-driven simulations as part of their core pipeline.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-a-machine-learning-interatomic-potential">What is a machine learning interatomic potential<a href="https://anycloud.sh/blog/training-machine-learning-interatomic-potentials/#what-is-a-machine-learning-interatomic-potential" class="hash-link" aria-label="Direct link to What is a machine learning interatomic potential" title="Direct link to What is a machine learning interatomic potential" translate="no">​</a></h2>
<p>A machine learning interatomic potential — sometimes called a neural network potential (NNP) — is a neural network that predicts the potential energy of a system of atoms, and by differentiation the forces on each atom, given only their positions and chemical identities. Train one accurately, and you can run molecular dynamics simulations or geometry optimizations at a tiny fraction of the cost of solving the underlying quantum mechanics directly.</p>
<p>The reference physics is density functional theory (DFT) — the workhorse of computational chemistry since the 1990s. DFT scales O(N³) with the number of electrons; a 200-atom simulation that takes minutes per step in DFT can take milliseconds per step with a trained MLIP. Run a million-step molecular dynamics trajectory and the difference compounds: weeks of supercomputer time becomes hours on a single GPU.</p>
<p>The history matters because architecture choice still does:</p>
<p><strong>2007 — Behler-Parrinello neural networks.</strong> The first MLIPs to scale to real systems, using hand-crafted symmetry functions to encode atomic environments. "BPNN" is still a term you'll see in benchmarks today.</p>
<p><strong>2017–2022 — equivariance era.</strong> SchNet (2017), PaiNN, NequIP (2021), MACE (2022). Instead of hand-crafting rotational invariance, these architectures bake it into the network through equivariant message passing. Accuracy jumped sharply, especially on small training sets.</p>
<p><strong>2023–2024 — foundation models.</strong> Universal MLIPs trained on huge crystal-structure datasets. MACE-MP-0, CHGNet, GNoME. You don't always need to train from scratch; you can fine-tune a foundation model on a few hundred examples of your specific chemistry.</p>
<p>The industry pattern that emerged: train (or fine-tune) once on cloud GPUs, ship the resulting model file (typically 10–500MB), run inference cheaply on local hardware or smaller cloud instances. The training-time cost is the lever — get it right and the rest follows.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architecture-landscape-mace-nequip-chgnet-gnome-mace-mp">The architecture landscape: MACE, NequIP, CHGNet, GNoME, MACE-MP<a href="https://anycloud.sh/blog/training-machine-learning-interatomic-potentials/#the-architecture-landscape-mace-nequip-chgnet-gnome-mace-mp" class="hash-link" aria-label="Direct link to The architecture landscape: MACE, NequIP, CHGNet, GNoME, MACE-MP" title="Direct link to The architecture landscape: MACE, NequIP, CHGNet, GNoME, MACE-MP" translate="no">​</a></h2>
<p>Five architectures dominate practical use today.</p>
<table><thead><tr><th>Architecture</th><th>Year</th><th>Architecture type</th><th>Pre-training data</th><th>When to choose</th></tr></thead><tbody><tr><td><strong>MACE</strong></td><td>2022</td><td>Higher-order equivariant message passing</td><td>Train on your own data</td><td>Best accuracy/cost tradeoff for a custom potential. Default pick.</td></tr><tr><td><strong>NequIP</strong></td><td>2021</td><td>E(3) equivariant message passing</td><td>Train on your own data</td><td>Slightly less accurate than MACE in published benchmarks; well-supported reference</td></tr><tr><td><strong>CHGNet</strong></td><td>Sep 2023</td><td>Graph network with magnetic-moment regularization</td><td>1.5M+ MPtrj structures</td><td>Charge-informed; good when magnetic information matters</td></tr><tr><td><strong>GNoME</strong></td><td>Nov 2023 (DeepMind)</td><td>Graph network</td><td>Materials Project + active learning</td><td>Materials discovery (used to find 381K new stable crystals from 2.2M candidates); weights publicly released</td></tr><tr><td><strong>MACE-MP-0</strong></td><td>Jan 2024</td><td>Higher-order equivariant message passing</td><td>~1.6M MPtrj structures, 89 elements</td><td>Universal foundation — fine-tune for any chemistry, fastest path to a working potential</td></tr></tbody></table>
<p>MACE and NequIP are typically trained from scratch on a user's own DFT calculations (hundreds to tens of thousands of configurations). The foundation-model rows show what was used for pre-training — your fine-tuning data sits on top.</p>
<p><strong>Practical rules of thumb:</strong></p>
<p>If you have ≥10K of your own DFT calculations, <strong>train a custom MACE</strong> from scratch. You'll get the best accuracy on your specific chemistry. Expect 12–48 hours on a single A100 for typical configurations.</p>
<p>If you have a few hundred to a few thousand DFT calculations, <strong>fine-tune MACE-MP-0</strong>. This is now a common pattern — leveraging the 1.6M-structure pre-training of the foundation model means your fine-tuning run converges in 1–6 hours, often on a single GPU.</p>
<p>If you're doing materials discovery (screening crystal candidates rather than detailed dynamics), <strong>try GNoME or CHGNet first</strong> — they're already trained on broad materials datasets and may not need fine-tuning at all.</p>
<p>The <strong>MPtrj dataset</strong> that MACE-MP-0 and CHGNet were trained on deserves separate mention: ~1.6M structures with energies, forces, stresses, and (for CHGNet) magnetic moments, derived from Materials Project relaxation trajectories. If you're building your own pre-training corpus, MPtrj is the reference benchmark you'll want to compare against.</p>
<p>A note on speed: NVIDIA's <code>cuEquivariance</code> library (Nov 2024) accelerates the costly tensor-product operations in MACE. Per NVIDIA's published benchmarks (A100, batch size 32, FP64): 5.9x training and 5.9x inference on MACE-MP Large; 6.1x training and 7.2x inference on MACE-OFF Large. If you're running on H100s or B200s, install cuEquivariance — the speedup is essentially free.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-it-costs-to-train">What it costs to train<a href="https://anycloud.sh/blog/training-machine-learning-interatomic-potentials/#what-it-costs-to-train" class="hash-link" aria-label="Direct link to What it costs to train" title="Direct link to What it costs to train" translate="no">​</a></h2>
<p>A single MLIP training run on cloud GPUs costs somewhere between <strong>a few dollars and a few hundred</strong>, depending almost entirely on three choices.</p>
<p><strong>The three cost levers:</strong></p>
<ol>
<li class="">
<p><strong>Train from scratch vs. fine-tune a foundation model.</strong> Fine-tuning MACE-MP-0 on a few hundred molecules: typically 1–6 hours on a single A100. From-scratch training on a custom dataset: 12–48 hours on a single A100, often longer for production-grade potentials.</p>
</li>
<li class="">
<p><strong>Spot vs. on-demand instance pricing.</strong> A100 80GB spot prices currently start around $0.78/hour at the cheapest specialty providers (Thunder Compute) and run to $1.50–$2.00/hour at most spot marketplaces (Vast.ai, RunPod, similar); on-demand at major hyperscalers runs $3–$5/hour per A100. Spot is the right default for MLIP training — the workload checkpoints cleanly and most preemption recoveries cost minutes, not hours.</p>
</li>
<li class="">
<p><strong>GPU class.</strong> A100 (40GB or 80GB) is the workhorse and the recommended default. H100 trains noticeably faster on equivariant architectures with cuEquivariance; whether the speed-up justifies the ~3x price-per-hour depends on how time-sensitive the run is. B200s are now available at multiple providers but are usually overkill for MLIP training; the bottleneck shifts to data-loading rather than compute.</p>
</li>
</ol>
<p><strong>Practical cost ranges by workload (estimates derived from spot rates above and typical wall-clock times; not measurements):</strong></p>
<table><thead><tr><th>Workload</th><th>Hardware</th><th>Wall-clock</th><th>Spot cost</th><th>On-demand cost</th></tr></thead><tbody><tr><td>Fine-tune MACE-MP, 100 molecules</td><td>1× A100</td><td>1–2 hours</td><td><strong>$1–$3</strong></td><td>$4–$8</td></tr><tr><td>Fine-tune MACE-MP, 1K molecules</td><td>1× A100</td><td>4–8 hours</td><td><strong>$3–$12</strong></td><td>$16–$32</td></tr><tr><td>Train custom MACE, 10K configs</td><td>1× A100</td><td>24–48 hours</td><td><strong>$20–$70</strong></td><td>$80–$200</td></tr><tr><td>Train custom MACE, 100K configs</td><td>4× A100</td><td>48–72 hours</td><td><strong>$120–$430</strong></td><td>$640–$1700</td></tr></tbody></table>
<p>These are training-only numbers. Add 15–30% if you're including hyperparameter sweeps. Add ~5% for storage (training data + checkpoint buckets typically run a few cents per GB-month).</p>
<p><strong>Where multi-cloud matters:</strong> A100 spot capacity is often unavailable in any single region for hours at a time. Spreading submissions across AWS, GCP, and Azure when capacity is short drops effective queue time. For a training run that takes 24 hours of compute but spends another 12 hours queued, the cost-effective answer is whichever cloud has capacity right now — not whichever has the cheapest sticker price. anycloud today optimizes region selection within whichever cloud you submit to (via <code>--credentials</code>); cross-cloud picking from a single submission is a manual decision today.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-spot-recovery-problem">The spot-recovery problem<a href="https://anycloud.sh/blog/training-machine-learning-interatomic-potentials/#the-spot-recovery-problem" class="hash-link" aria-label="Direct link to The spot-recovery problem" title="Direct link to The spot-recovery problem" translate="no">​</a></h2>
<p>Spot instances are typically 3–5x cheaper than on-demand. For a 24-hour MLIP training run, that's the difference between $20 and $80. The catch: spot VMs can be reclaimed with short notice — AWS gives 2 minutes, GCP gives 30 seconds, Azure varies. You need a recovery story that doesn't lose your training progress.</p>
<p><strong>What MLIP training needs from a recovery system:</strong></p>
<ol>
<li class="">
<p><strong>Frequent, durable checkpointing.</strong> MACE and most modern MLIP trainers checkpoint at the end of each epoch. For a 1000-epoch training run with 10-second epochs, that's a checkpoint every 10 seconds. Each checkpoint is typically tens to hundreds of MB. Writing locally is fast; getting the checkpoint <em>off</em> the VM before it's reclaimed is the hard part.</p>
</li>
<li class="">
<p><strong>Automatic resume from the latest checkpoint.</strong> When the new VM provisions, the trainer needs to find the most recent checkpoint without manual intervention. MACE's <code>mace_run_train</code> supports <code>--restart_latest</code> to read from a checkpoint directory automatically.</p>
</li>
<li class="">
<p><strong>Cross-region fallback.</strong> If your spot VM gets preempted in <code>us-east-1</code>, the next instance might come up in <code>us-west-2</code>. The checkpoint storage needs to be region-agnostic (object storage works; local SSD doesn't), and the training environment needs to come up identically wherever the new VM lands. Cross-cloud fallback (preempted on AWS, resume on GCP) is a stronger version of the same idea, useful when one cloud's spot pool is fully starved.</p>
</li>
<li class="">
<p><strong>Image caching.</strong> Pulling a large MLIP container (typically 5–15GB with CUDA + PyTorch + mace-torch + dependencies) takes several minutes from scratch. Each preemption that triggers a re-pull eats into the cost savings. Per-region image caching gets the second launch down to seconds.</p>
</li>
</ol>
<p>Realistic options today: SkyPilot handles managed spot lifecycle and cross-cloud fallback well; Slurm with cluster autoscaling works for HPC-shop teams; PyTorch Lightning can do checkpointing-to-S3 with custom configuration; or use a system that bundles points 1, 3, and 4 by default — anycloud's <code>--spot</code> flag does this with container-as-unit-of-work rather than YAML configuration. (Point 2, automatic resume, is the trainer's job — MACE provides <code>--restart_latest</code> for exactly this.)</p>
<p>The cost of getting this wrong: a single missed checkpoint on a 24-hour training run is 12 hours of A100 time — $20–$80 of compute, plus a day of calendar time. Multiply by the 5–10 training runs a typical research group submits per week and the engineering investment in good spot recovery pays for itself within a month.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="worked-example-training-mace-on-cloud-spot">Worked example: training MACE on cloud spot<a href="https://anycloud.sh/blog/training-machine-learning-interatomic-potentials/#worked-example-training-mace-on-cloud-spot" class="hash-link" aria-label="Direct link to Worked example: training MACE on cloud spot" title="Direct link to Worked example: training MACE on cloud spot" translate="no">​</a></h2>
<p>Here's a command to fine-tune MACE-MP on a small molecular dataset using AWS spot. Substitute the <code>--credentials</code> value for whichever cloud you've authenticated (<code>anycloud credentials generate aws|gcp|azure</code> provisions least-privilege IAM):</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">anycloud submit ghcr.io/anycloud-sh/mace:latest </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--credentials</span><span class="token plain"> my-aws </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  --gpu-type a100 </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--spot</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  --disk-size </span><span class="token number">200</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  --input-bucket my-training-data </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  --output-bucket my-results </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--gpus</span><span class="token plain"> all </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  -- mace_run_train </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--name</span><span class="token operator">=</span><span class="token plain">my_finetune </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--foundation_model</span><span class="token operator">=</span><span class="token plain">medium </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--train_file</span><span class="token operator">=</span><span class="token plain">/mnt/input/train.xyz </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--valid_fraction</span><span class="token operator">=</span><span class="token number">0.1</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--max_num_epochs</span><span class="token operator">=</span><span class="token number">200</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--batch_size</span><span class="token operator">=</span><span class="token number">16</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--lr</span><span class="token operator">=</span><span class="token number">0.001</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--device</span><span class="token operator">=</span><span class="token plain">cuda </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--checkpoints_dir</span><span class="token operator">=</span><span class="token plain">/mnt/checkpoint </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--restart_latest</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--results_dir</span><span class="token operator">=</span><span class="token plain">/mnt/output</span><br></span></code></pre></div></div>
<p>What that does, in plain terms:</p>
<ul>
<li class="">Pulls the MACE container (CUDA 12.4, PyTorch, <code>mace-torch</code>)</li>
<li class="">Picks the best-priced A100 spot region within the cloud associated with <code>--credentials</code></li>
<li class="">Mounts your training data bucket at <code>/mnt/input</code> and a checkpoint bucket at <code>/mnt/checkpoint</code> (auto-created when <code>--spot</code> is set)</li>
<li class="">Runs MACE's <code>mace_run_train</code> with <code>--foundation_model=medium</code> to fine-tune from MACE-MP, and <code>--restart_latest</code> so the trainer resumes from the most recent checkpoint after any preemption</li>
<li class="">Writes the trained model and logs to your output bucket</li>
</ul>
<p>If the spot VM gets preempted partway through, anycloud provisions a new VM (potentially in a different region of the same cloud), restores the checkpoint directory, and <code>mace_run_train</code> picks up from where it left off. To cover the case where one cloud's spot pool is fully starved, submit the same job in parallel against credentials for a second cloud — whichever one acquires capacity first runs to completion.</p>
<p>The Python SDK equivalent is the same shape — see <a class="" href="https://anycloud.sh/tutorials/mace/">the full MACE tutorial</a> for the SDK version, batch inference, and exhaustive flag descriptions.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="fine-tuning-mace-mp-from-a-foundation-model">Fine-tuning MACE-MP from a foundation model<a href="https://anycloud.sh/blog/training-machine-learning-interatomic-potentials/#fine-tuning-mace-mp-from-a-foundation-model" class="hash-link" aria-label="Direct link to Fine-tuning MACE-MP from a foundation model" title="Direct link to Fine-tuning MACE-MP from a foundation model" translate="no">​</a></h2>
<p>Fine-tuning MACE-MP-0 from the foundation model is now a common pattern. The foundation model has already learned the general physics of 89 elements from ~1.6M structures in MPtrj. Your fine-tuning data only needs to teach it the specific chemistry of your system.</p>
<p><strong>When fine-tuning beats from-scratch:</strong></p>
<ul>
<li class=""><strong>You have under 5K of your own DFT calculations.</strong> From-scratch training on small datasets typically gives noisy potentials with poor extrapolation; fine-tuning gives you the foundation model's broad coverage with your domain-specific accuracy.</li>
<li class=""><strong>Your chemistry overlaps with the MPtrj distribution.</strong> Most main-group inorganic systems do; some organic chemistry doesn't (MPtrj is crystal-biased). Check your element coverage before assuming foundation-model fine-tuning will work.</li>
<li class=""><strong>You need a usable model fast.</strong> Fine-tuning runs converge in 1–6 hours; from-scratch runs take 12–48.</li>
</ul>
<p>The MACE foundation models are released at three sizes (<code>small</code>, <code>medium</code>, <code>large</code>). <code>medium</code> is the typical workhorse — fits comfortably on a single A100 with reasonable batch sizes.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">anycloud submit ghcr.io/anycloud-sh/mace:latest </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--credentials</span><span class="token plain"> my-aws </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  --gpu-type a100 </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--spot</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  --input-bucket my-training-data </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  --output-bucket my-results </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--gpus</span><span class="token plain"> all </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">  -- mace_run_train </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--name</span><span class="token operator">=</span><span class="token plain">finetune_mp_medium </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--foundation_model</span><span class="token operator">=</span><span class="token plain">medium </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--train_file</span><span class="token operator">=</span><span class="token plain">/mnt/input/finetune.xyz </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--max_num_epochs</span><span class="token operator">=</span><span class="token number">100</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--lr</span><span class="token operator">=</span><span class="token number">0.0001</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--device</span><span class="token operator">=</span><span class="token plain">cuda </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--checkpoints_dir</span><span class="token operator">=</span><span class="token plain">/mnt/checkpoint </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--restart_latest</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--results_dir</span><span class="token operator">=</span><span class="token plain">/mnt/output</span><br></span></code></pre></div></div>
<p>Two flags worth noting: <code>--lr=0.0001</code> (an order of magnitude lower than from-scratch training — fine-tuning needs gentler updates to preserve foundation-model knowledge) and <code>--max_num_epochs=100</code> (most fine-tuning runs converge well before this).</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="further-reading">Further reading<a href="https://anycloud.sh/blog/training-machine-learning-interatomic-potentials/#further-reading" class="hash-link" aria-label="Direct link to Further reading" title="Direct link to Further reading" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://github.com/ACEsuit/mace" target="_blank" rel="noopener noreferrer" class="">MACE</a> — the reference implementation</li>
<li class=""><a href="https://github.com/ACEsuit/mace-foundations" target="_blank" rel="noopener noreferrer" class="">MACE-MP foundation models</a> on GitHub, weights on Hugging Face</li>
<li class=""><a href="https://mace-docs.readthedocs.io/en/latest/guide/finetuning.html" target="_blank" rel="noopener noreferrer" class="">MACE fine-tuning docs</a> — <code>--foundation_model</code> flag and multi-head fine-tuning</li>
<li class=""><a href="https://developer.nvidia.com/blog/accelerate-drug-and-material-discovery-with-new-math-library-nvidia-cuequivariance/" target="_blank" rel="noopener noreferrer" class="">NVIDIA cuEquivariance</a> — CUDA kernels for equivariant networks</li>
<li class=""><a class="" href="https://anycloud.sh/tutorials/mace/">MACE on anycloud — full tutorial</a> — runnable end-to-end with Python SDK and CLI variants</li>
</ul>]]></content>
        <category label="mlip" term="mlip"/>
        <category label="mace" term="mace"/>
        <category label="machine-learning" term="machine-learning"/>
        <category label="gpu" term="gpu"/>
        <category label="materials-science" term="materials-science"/>
    </entry>
</feed>