Troubleshooting

Common errors and what to do about them. For the full state machine, see Jobs — Lifecycle.

Job states & failures

My job is `Errored` — what happened?

Errored (🪲) means your container ran but exited non-zero — your code or command failed, not the infrastructure. anycloud does not retry errored jobs. Read the output with anycloud status <id> --verbose (or job.logs()), fix the problem, and resubmit. See Jobs — Lifecycle.

What's the difference between `Failed` and `Errored`?

Failed (❌) is an infrastructure error — the VM never started. anycloud auto-retries these up to 100 times (through the Retrying state) before giving up and marking the job Failed. So Failed means every setup attempt failed, while Errored means your code ran and exited non-zero. See Jobs — Lifecycle.

My job went `Invalid` and never retried.

Invalid (🚫) means the config is wrong — for example an input bucket that doesn't exist, or a single credential the cloud rejected. Invalid jobs are never retried; fix the config and resubmit. Run anycloud status <id> for the reason. See Jobs — Lifecycle and Cloud Credentials.

What does `Terminated` mean?

Terminated (🪦) is user-initiated — someone ran anycloud terminate <id> or called job.terminate(). It's a terminal state, not a failure. Run it again with anycloud resubmit <id>.

My job is stuck in `queued`.

Two common causes:

More than 50 deployments are setting up VMs at once (the per-API provisioning cap), so the rest wait for a slot.
A spend control — throttle or budget — is at its cap.

Either way the blocking reason shows in anycloud status <id> and anycloud list, and the job auto-dispatches once a slot frees or the cap clears. See Jobs — Lifecycle and Spend Controls.

My spot job keeps restarting (`Recovering`).

That's expected on spot. When the cloud preempts the VM, anycloud detects it and re-provisions from scratch through the Recovering (🔁) state. Write checkpoints to /mnt/checkpoint frequently so a restart resumes with minimal lost work. See Jobs — Spot recovery.

How do I get logs from a job that already failed?

anycloud status <id> --verbose works after the fact. In the SDK, job.wait() raises DeploymentFailedError on a terminal failure, and the exception carries .logs and .state. See Error Handling.

My job was killed (exit 137) or ran out of disk.

Exit code 137 means the container was killed — usually out of memory. Raise --memory (and --shm-size for PyTorch DataLoaders and other shared-memory use). No space left on device means the disk filled up — raise --disk-size. See Docker runtime options.

Setup & credentials

`anycloud api start` fails, or the SDK can't reach the API.

The local API runs as a Docker container, so Docker must be running first. Check anycloud api status; if it's down, run anycloud api start. The CLI and SDK connect to API_URL when set, otherwise ~/.anycloud/api-url when present, otherwise http://localhost:8080. Run anycloud api info to see the active target. The local API binds to 127.0.0.1 and is not intended for LAN exposure. See anycloud api.

`anycloud api start` says the server is already running.

A container is already up. Stop it first with anycloud api stop, then anycloud api start. (anycloud update restarts a running API server for you, so you normally don't need to do this after an update.)

`anycloud submit --local` says Local Docker control is not enabled.

Restart the local API with Docker control enabled:

anycloud api stop
anycloud api start --enable-local-docker

This mounts /var/run/docker.sock into the API container. That socket grants host-root-equivalent Docker control, so keep the API local to your machine.

`No authentication token found`.

The SDK reads your token from anycloud login (or, in CI, the GITHUB_TOKEN env var). Run anycloud login, or set GITHUB_TOKEN. See Environment Variables and api and utilities.

The SDK can't decide which credential to use.

The SDK auto-selects a saved credential only when exactly one exists. With none or several, pass Client(credentials="my-aws") or an explicit cloud_config; unlike the CLI, the SDK does not submit an unpinned multi-credential pool. Add or list credentials with anycloud credentials new / anycloud credentials list. See Configuration — Credential precedence.

A credential error sent my job to `Invalid` mid-run.

A pinned credential the cloud rejects (expired token, revoked key) makes the deployment Invalid immediately — retrying the same one won't help. Omit --credentials to leave compute unpinned so anycloud can try another saved named credential after an auth failure. See Cloud Credentials — Select credentials for compute.

Capacity & quota

Capacity errors keep my job in `Retrying`.

When the cloud denies capacity for the VM family, anycloud blocks that region and keeps retrying other regions — unless you pinned --region, in which case it can only retry the one region. Drop the region pin to let anycloud fail over. See GPUs & VM Types — Credentials and regions.

How do I raise my cloud quota?

Request it from the CLI, then check status:

anycloud quota request <vmType> --credential <name>
anycloud quota status --credential <name>

Re-running against a region that already has an open case returns SKIPPED with the existing case's URL — no duplicates. See anycloud quota request.

Images & the function decorator

Image pull failed with `denied` or `401`.

For a private image, anycloud pulls it on the VM using your GitHub token. A denied / 401 pull error usually means that token is stale — re-run anycloud login to re-authenticate. Only private images on GHCR are supported. See Container Images.

`git is not installed in this container image`.

@anycloud.function() clones your repo onto the VM, so the image must have git. Use a base image that includes it (e.g. python:3.11) or add RUN apt-get update && apt-get install -y git to your Dockerfile. See Function Decorator.

`anycloud could not set up your code: commit … is not on the remote. Did you push it?`

The decorator clones your repo at your current local commit, so that commit must be pushed to GitHub. Commit and push before submitting. See Jobs — Operate and debug.

`destination path '/app' already exists and is not an empty directory`.

Your image already populated /app, so the decorator's git clone can't land there. Set target_path on the decorator to an empty directory (e.g. target_path="/code"). See Decorator Parameters.

My function's arguments or return value didn't come through.

Decorated-function arguments must be JSON-serializable (str, int, float, bool, None, list, dict); pass large or complex data through an input bucket instead. Return values are discarded — write results to /mnt/output. See Function Decorator.

`ConflictError` — deployment ID already exists.

You reused a custom id. Deployment IDs are unique; use a new one, or anycloud resubmit <id> to re-run the existing deployment. See Error Handling.

GPU vs VM selection

Should I use `--gpu-type` or `--vm-type`?

--gpu-type (SDK gpu=) names a GPU like h100 or a100:8 and lets anycloud pick the cheapest matching instance across clouds and regions. --vm-type pins an exact instance (e.g. g6e.xlarge, or a CPU-only VM). They're mutually exclusive — set one. See GPUs & VM Types.

My container can't see the GPU.

--gpu-type / --vm-type choose the hardware; anycloud sets up GPU access for you, so the container sees the GPUs with no extra flags. Pass --gpus to set a specific device count. See Container Images — GPU images.

How do I see what GPUs, VM types, or regions are available?

Use the catalog commands: anycloud gpus <provider>, anycloud vm-types <provider> <region>, anycloud regions <provider>, and anycloud pricing <provider> <vm-type>. See Catalog and pricing.

Job states & failures​

My job is Errored — what happened?​

What's the difference between Failed and Errored?​

My job went Invalid and never retried.​

What does Terminated mean?​

My job is stuck in queued.​

My spot job keeps restarting (Recovering).​

How do I get logs from a job that already failed?​

My job was killed (exit 137) or ran out of disk.​

Setup & credentials​

anycloud api start fails, or the SDK can't reach the API.​

anycloud api start says the server is already running.​

anycloud submit --local says Local Docker control is not enabled.​

No authentication token found.​

The SDK can't decide which credential to use.​

A credential error sent my job to Invalid mid-run.​

Capacity & quota​

Capacity errors keep my job in Retrying.​

How do I raise my cloud quota?​

Images & the function decorator​

Image pull failed with denied or 401.​

git is not installed in this container image.​

anycloud could not set up your code: commit … is not on the remote. Did you push it?​

destination path '/app' already exists and is not an empty directory.​

My function's arguments or return value didn't come through.​

ConflictError — deployment ID already exists.​

GPU vs VM selection​

Should I use --gpu-type or --vm-type?​

My container can't see the GPU.​

How do I see what GPUs, VM types, or regions are available?​