Skip to main content
Help

Do you need a custom Docker image for GPU training?

Updated June 2026

Often no. If a public image such as pytorch/pytorch or nvidia/cuda already has the runtime your job needs, run it directly and bring your code at run time. Build and push a custom image only when the job needs dependencies no off-the-shelf image provides.

What the image is actually for

A training image has one job: provide the runtime. That means the operating system, the CUDA user-space libraries, and the ML framework — not necessarily your code. Your code can arrive separately when the container starts:

  • Baked into the image with a Dockerfile COPY step
  • Cloned from git when the container starts
  • Installed as a package at startup
  • Read from storage mounted into the container

Separating the runtime from the code is what makes stock images viable: the image changes only when dependencies change, not on every commit.

When a public image is enough

Most training and fine-tuning jobs import a framework, read data, and write checkpoints. A stock framework image already covers that, and skipping the build means no build-push loop on every change, no registry credentials to manage, and faster iteration. A public image is enough when:

  • Your dependencies import cleanly inside a stock framework image
  • The job is a script or module the image can run as a command
  • The code is reachable at run time, from git or from storage
  • You need no system packages beyond what the image ships

When to build your own

  • System dependencies: the job needs OS packages, custom drivers for data loading, or tools no public image ships.
  • Compiled extensions: CUDA extensions or native code that must be compiled into the environment ahead of time.
  • Strict reproducibility: every dependency pinned and frozen, so the same image re-runs identically months later.
  • No network at startup: the container cannot fetch code or packages at run time, so everything must be baked in.

If you do build, build for the platform the GPU VM runs: linux/amd64. On Apple Silicon use docker buildx --platform linux/amd64, since a plain build publishes an arm64 image that pulls successfully but cannot start on an x86 VM. Start from a CUDA or framework base image, and prefer building in CI so each image is tied to the commit that built it.

Where anycloud fits

anycloud runs any pullable image on a cloud GPU VM — public images from any registry, private images through GHCR. It does not build images: build and push first, then submit the image reference with a command.

With the Python SDK, the @anycloud.function() decorator takes this further: point it at a stock framework image and anycloud clones your code from git onto the VM at run time. You rebuild only when dependencies change, not on every code change.

Sources

Related answers