Skip to main content

How Ångstrom (YC S24) used Claude Code to train a model that beat Meta's UMA-OMC

· 7 min read
Luis Fernando de Pombo
Co-founder, anycloud
Laurence Midgley
Co-founder & CTO, Ångstrom AI

Ångstrom AI (YC S24), with the University of Cambridge (the Csanyi group) and AstraZeneca, released DFT Accuracy on Crystal Structure Prediction with Machine Learning Interatomic Potentials. The paper presented CSP-MACE-Å, a machine learning model designed to replace DFT, the expensive quantum mechanical calculation at the heart of crystal structure prediction, with the same accuracy but a 10,000x speedup.

CSP-MACE-Å also significantly outperformed UMA-OMC on crystal-structure prediction benchmarks. UMA is Meta's general purpose model for atoms and molecules; UMA-OMC is the version adapted for organic molecular crystals.

Ångstrom built CSP-MACE-Å on anycloud, a CLI that runs GPU jobs across your own cloud accounts. Ångstrom pointed Claude Code at anycloud: the agent called the anycloud CLI to drive the experiment loop, roughly 100,000 GPU jobs, almost entirely on multi-cloud spot, on their own cloud accounts.

Why CSP matters to AstraZeneca

Crystal structure prediction (CSP) answers a deceptively simple question: given a molecule, what solid crystals can it form? It matters because one molecule can pack into different crystals, or polymorphs, with different solubility and stability, and a drug can suddenly convert to a new, less soluble form that breaks its manufacturing. In 1998, that nearly sank the HIV drug ritonavir, pulled and reformulated when an unexpected, more stable crystal appeared, reportedly costing Abbott more than $250 million. Veritasium tells the story well in this The Crystal That Could Destroy All Medicine video. Mapping the possible forms early is how drugmakers prevent this from happening again.

The workhorse of CSP is DFT (density functional theory). DFT is an extremely expensive quantum-mechanical calculation. It's accurate enough to have dominated every CSP blind test, but a single structure can take hours to days, which caps how much of the landscape anyone can afford to explore.

What CSP-MACE-Å changed

CSP-MACE-Å replaces DFT at roughly four orders of magnitude better cost and speed while keeping the same accuracy. This takes AstraZeneca's calculations from taking weeks on DFT to just hours on CSP-MACE-Å.

Each benchmark asks the model to sort possible crystal structures for a molecule, including the known crystal form. On AstraZeneca's internal systems, CSP-MACE-Å found the known form in its first few guesses on average: rank 2.11, compared with 23.26 for UMA-OMC. On the CSP blind-test systems, the averages were 2.96 and 3.50. Lower is better.

Why evaluation became 100,000 GPU jobs

Training and evaluation both ran on anycloud. Both broke into many independent jobs, and both could survive spot interruptions on the cloud to access the lowest prices.

Constant evaluation of different model versions, more than the training of the model itself, was where most of the GPU compute was. Ångstrom tested CSP-MACE-Å on two evaluation suites:

  • The CSP blind tests: the field's gold-standard exam. Run by the Cambridge Crystallographic Data Centre, these are community competitions where teams are handed only a molecule's flat 2D diagram and must predict its full 3D crystal structure before the experimental answer is revealed. Doing well here is how a method earns trust.
  • AstraZeneca's own systems: a set curated from AstraZeneca's prior CSP studies, so the model is tested on exactly the kind of real, drug-relevant molecules a pharma company needs it to get right.

Across both evaluation suites, Ångstrom tested 47 systems, including cocrystals and salts. At that scale, evaluation alone became roughly a hundred thousand independent GPU jobs.

The agent-driven experiment loop

At that scale, the bottleneck was not writing one more submit command. It was deciding what to run next, launching the right batch, watching failures and spend, pulling results back, and turning the output into the next scientific decision. The same fan-out that made the loop fast also made it dangerous: a mistaken batch spec could become thousands of dollars of real GPU spend before anyone noticed.

Ångstrom researchers used Claude Code in that loop. They talked through what computational experiments to run, which batches to launch, what outputs to compare, and what plots would answer the next question. Claude then turned that plan into concrete work: launching batches of anycloud jobs, monitoring status and spend, downloading results, and generating plots and summaries for the next research decision.

Claude used the same local anycloud CLI and cloud configuration the team used by hand. The researchers stayed focused on the experiment plan and interpretation; Claude handled the fan-out and the bookkeeping between decisions.

How anycloud kept the experiment loop under control

When Claude was launching thousands of evaluation jobs, starting the jobs was only half the problem. The team also needed to know what was running, what had failed, which Claude Code session started each job, and what needed to stop.

anycloud made that visible by tagging each Claude-launched anycloud submit with the Claude Code session that started it. From a normal shell, the team could filter by agent or session, inspect failed jobs, and terminate anything still running. Each job remained a plain anycloud submit command: choose the image, GPU class, spot mode, and output bucket, and anycloud handled the VM work.

anycloud submit ghcr.io/angstrom/csp:latest \
--gpu-type a100 --spot \
--output-bucket angstrom-csp \
--output-storage-credentials angstrom-results \
--output-storage-region us-east-1 \
-- python rank_structure.py "$compound" "$structure_id" "$model"

anycloud handles the cloud details: finding available GPU capacity, trying other regions, and restarting work when cloud providers take back spot machines.

That mattered because a runaway agent-launched batch was not a harmless typo. One bad instruction could keep launching GPU work and turn into thousands of dollars of cloud spend. anycloud's spend controls are scoped per agent session - one spending limit per Claude Code run. A rate cap limits live $/hr and a daily budget limits total spend, so new experiments wait in the queue instead of overspending:

Slack makes spend and blocked work visible without watching a terminal. anycloud notifications enable slack --webhook ... posts a daily digest with total spend, job counts, interruption rate, median runtime, and active users. If a budget or rate cap starts blocking new jobs, anycloud posts a waiting-on-spend-cap alert. A daily budget block clears at the next daily reset; a rate-cap block clears when live spend falls back under the configured ceiling. Caps block only new jobs; already-running jobs keep running.

What this unlocked for Ångstrom

"Our monthly compute spend is often more than 2x higher than our cash burn - so compute cost is a serious problem for us. anycloud has been critical for letting us use our credits across all major providers efficiently. We run our experiments almost exclusively on spot, which has significantly extended our compute runway. The bottleneck for an AI research company is the rate at which we iterate on the run experiments -> analyse results -> plan next experiments loop - anycloud lets us orchestrate hundreds of experiments each day."

  • Laurence Midgley, Co-founder & CTO, Ångstrom AI

Two problems sit behind Laurence's quote: cost and iteration speed. Cloud credits are the cheapest GPUs a startup will ever touch, but they're stranded - spread across providers, each with its own quotas, regions, and spot pools. And the rate of research is capped by how fast you can run the next batch of experiments. anycloud is built for exactly this: schedule across every connected account, take the cheapest capacity that's actually available, and run on spot without the workload having to care which cloud it lands on. In Ångstrom's case, Claude Code drove that research loop by calling the anycloud CLI directly against the team's own clouds - exactly the workflow anycloud is built around.

Ångstrom AI is one of a new generation of AI research companies using Claude Code to increase the speed at which they can iterate on research. anycloud sits at the core of their infrastructure, powering the computational experiments behind that research loop.