Skip to main content

Troubleshooting

Common errors and what to do about them. For the full state machine, see Job Lifecycle.

Job states & failures

My job is Errored — what happened?

Errored (🪲) means your container ran but exited non-zero — your code or command failed, not the infrastructure. anycloud does not retry errored jobs. Read the output with anycloud status <id> --verbose (or job.logs()), fix the problem, and resubmit. See Error States.

What's the difference between Failed and Errored?

Failed (❌) is an infrastructure error — the VM never started. anycloud auto-retries these up to 100 times (through the Retrying state) before giving up and marking the job Failed. So Failed means every setup attempt failed, while Errored means your code ran and exited non-zero. See Auto-Retry.

My job went Invalid and never retried.

Invalid (🚫) means the config is wrong — for example an input bucket that doesn't exist, or a single credential the cloud rejected. Invalid jobs are never retried; fix the config and resubmit. Run anycloud status <id> for the reason. See Error States and Credential Recovery.

What does Terminated mean?

Terminated (🪦) is user-initiated — someone ran anycloud terminate <id> or called job.terminate(). It's a terminal state, not a failure. Run it again with anycloud resubmit <id>.

My job is stuck in queued.

Two common causes:

  1. More than 50 deployments are setting up VMs at once (the per-API provisioning cap), so the rest wait for a slot.
  2. A spend control — throttle or budget — is at its cap.

Either way the blocking reason shows in anycloud status <id> and anycloud list, and the job auto-dispatches once a slot frees or the cap clears. See Provisioning Concurrency and Spend Controls.

My spot job keeps restarting (Recovering).

That's expected on spot. When the cloud preempts the VM, anycloud detects it and re-provisions from scratch through the Recovering (🔁) state. Write checkpoints to /mnt/checkpoint frequently so a restart resumes with minimal lost work. See Spot Recovery and Spot Instances.

How do I get logs from a job that already failed?

anycloud status <id> --verbose works after the fact. In the SDK, job.wait() raises JobFailedError on a terminal failure, and the exception carries .logs and .state. See Error Handling.

My job was killed (exit 137) or ran out of disk.

Exit code 137 means the container was killed — usually out of memory. Raise --memory (and --shm-size for PyTorch DataLoaders and other shared-memory use). No space left on device means the disk filled up — raise --disk-size. See Docker runtime options.

Setup & credentials

anycloud api start fails, or the SDK can't reach the API.

The API runs as a Docker container, so Docker must be running first. Check anycloud api status; if it's down, run anycloud api start. The SDK connects to http://localhost:8080 by default (override with the API_URL env var). See anycloud api.

anycloud api start says the server is already running.

A container is already up. Stop it first with anycloud api stop, then anycloud api start. (anycloud update restarts a running API server for you, so you normally don't need to do this after an update.)

No authentication token found.

The SDK reads your token from anycloud login (or, in CI, the GITHUB_TOKEN env var). Run anycloud login, or set GITHUB_TOKEN. See Environment Variables and CI Pipeline.

The SDK can't decide which credential to use.

anycloud.Client() auto-selects a credential only when exactly one is saved. With none or several saved, pass Client(credentials="my-aws") or a cloud_config. Add or list credentials with anycloud credentials new / anycloud credentials list. See CloudConfig.

A credential error sent my job to Invalid mid-run.

A pinned credential the cloud rejects (expired token, revoked key) makes the deployment Invalid immediately — retrying the same one won't help. Omit --credentials to leave compute unpinned so anycloud can try another saved named credential after an auth failure. See Credential Recovery.

Capacity & quota

Capacity errors keep my job in Retrying.

When the cloud denies capacity for the VM family, anycloud blocks that region and keeps retrying other regions — unless you pinned --region, in which case it can only retry the one region. Drop the region pin to let anycloud fail over. See Quota Recovery.

How do I raise my cloud quota?

Request it from the CLI, then check status:

anycloud quota request <vmType> --credential <name>
anycloud quota status --credential <name>

Re-running against a region that already has an open case returns SKIPPED with the existing case's URL — no duplicates. See anycloud quota request.

Images & the function decorator

Image pull failed with denied or 401.

For a private image, anycloud pulls it on the VM using your GitHub token. A denied / 401 pull error usually means that token is stale — re-run anycloud login to re-authenticate. Only private images on GHCR are supported. See Docker.

git is not installed in this container image.

@anycloud.function() clones your repo onto the VM, so the image must have git. Use a base image that includes it (e.g. python:3.11) or add RUN apt-get update && apt-get install -y git to your Dockerfile. See Function Decorator.

anycloud could not set up your code: commit … is not on the remote. Did you push it?

The decorator clones your repo at your current local commit, so that commit must be pushed to GitHub. Commit and push before submitting. See Debugging a Job.

destination path '/app' already exists and is not an empty directory.

Your image already populated /app, so the decorator's git clone can't land there. Set target_path on the decorator to an empty directory (e.g. target_path="/code"). See Decorator Parameters.

My function's arguments or return value didn't come through.

Decorated-function arguments must be JSON-serializable (str, int, float, bool, None, list, dict); pass large or complex data through an input bucket instead. Return values are discarded — write results to /mnt/output. See Function Decorator.

ConflictError — deployment ID already exists.

You reused a custom id. Deployment IDs are unique; use a new one, or anycloud resubmit <id> to re-run the existing deployment. See Error Handling.

GPU vs VM selection

Should I use --gpu-type or --vm-type?

--gpu-type (SDK gpu=) names a GPU like h100 or a100:8 and lets anycloud pick the cheapest matching instance across clouds and regions. --vm-type pins an exact instance (e.g. g6e.xlarge, or a CPU-only VM). They're mutually exclusive — set one. See GPU Type vs VM Type.

My container can't see the GPU.

--gpu-type / --vm-type choose the hardware; --gpus (e.g. --gpus all) is the Docker runtime flag that exposes the GPUs to the container, and some images also need --runtime nvidia. See GPU Support.

How do I see what GPUs, VM types, or regions are available?

Use the catalog commands: anycloud gpus <provider>, anycloud vm-types <provider> <region>, anycloud regions <provider>, and anycloud pricing <provider> <vm-type>. See CLI Reference.