Skip to main content

Bucket Sync Internals

How bucket sync to mounted files in the container works under the hood.

Data Flow

Input Bucket (s3:/azureblob:/gcs:)
↓ rclone sync (download once on startup)
VM Directory (/mnt/{inputBucket})
↕ Docker bind mount :ro
Container (/mnt/{inputBucket}) - READ ONLY

Output Bucket (s3:/azureblob:/gcs:)
↑ rclone copy (upload only, continuous every ~60s)
VM Directory (/mnt/{outputBucket})
↕ Docker bind mount
Container (/mnt/{outputBucket}) - WRITE

Checkpoint Bucket (s3:/azureblob:/gcs:{deployment-id})
↓ rclone copy (download once on startup)
↑ rclone copy (upload only, continuous every ~60s)
VM Directory (/mnt/checkpoint)
↕ Docker bind mount
Container (/mnt/checkpoint) - READ + WRITE

Zero-Copy Access

Containers access synced files via Docker bind mounts — files exist once on the VM disk and containers see them directly. No duplication, no copying.

-v /mnt/input:/mnt/input:ro         # input (read-only)
-v /mnt/output:/mnt/output # output
-v /mnt/checkpoint:/mnt/checkpoint # checkpoint

Sync Strategies

Input: Download Once

rclone sync bucket → VM runs once on startup. No supervisor service — the data is static.

Output: Upload Only

A supervisord service runs rclone copy VM → bucket every ~60 seconds. Uses copy (not sync) so multiple jobs can safely write to the same bucket without deleting each other's files.

Checkpoint: Download Once, Then Upload

On startup, rclone copy bucket → VM downloads existing checkpoint data (e.g., from a previous preempted run). Then a supervisord service runs rclone copy VM → bucket every ~60 seconds to upload changes. This is one-way upload — the same strategy as output buckets.

Supervisord Configs

Output bucket:

[program:rclone-output]
command=/bin/bash -c "while true; do \
/usr/bin/rclone copy /mnt/output ${remotePath} --verbose \
--fast-list --no-update-modtime --modify-window 1s \
--log-file=/var/log/rclone-output.log && \
sleep 60; done"
autostart=true
autorestart=true
startsecs=5
startretries=999

Checkpoint bucket:

[program:rclone-checkpoint]
command=/bin/bash -c "while true; do \
/usr/bin/rclone copy /mnt/checkpoint ${remotePath} \
--verbose --fast-list --no-update-modtime --modify-window 1s && \
sleep 60; done"
autostart=true
autorestart=true
startsecs=5
startretries=999

VM Initialization Sequence

  1. API creates VM with SSH-only startup script
  2. After SSH is accessible, initializeVM() installs Docker and supervisord
  3. initRclone() installs rclone and configures each bucket type
  4. Connectivity check uses --retries 30 --retries-sleep 15s (covers Azure MSI propagation delays up to 7.5 min)
  5. Container starts with bind mounts for all configured buckets

Cloud-Native Authentication

anycloud uses cloud-native auth — no keys or secrets in containers:

  • AWS — IAM instance roles with S3 bucket policy
  • Azure — Managed identities with Storage Blob Data Contributor role
  • GCP — Service accounts with Storage Object Admin role

Credentials are automatically available to rclone via cloud metadata services.

Continent-Based Storage

Storage is organized by continent to balance data sharing with latency and egress costs:

ScenarioLatencyEgress Cost
Within continent10-30ms~$0.01-0.02/GB
Cross-continent70-150ms~$0.05-0.12/GB

Continent Codes

CodeRegions
usUS East, US West, US Central
euWest Europe, North Europe, UK, France, Germany
apacEast Asia, Southeast Asia, Japan, Korea, India
ausAustralia East/Southeast
latamBrazil, Mexico, Chile
meaUAE, South Africa, Israel
caCanada Central/East

First-wins location: The storage location is set by whichever region deploys first within that continent. Subsequent deploys in other regions of the same continent reuse the existing storage.

AWS S3

  • Buckets use user-specified names, tagged with continent metadata
  • One IAM role per deployment: anycloud-{deploymentId}
  • Bucket cleanup uses GetBucketLocation to find the bucket's region — this handles cases where a spot recovery moved the deployment to a different region than where the bucket was originally created
  • Continent validation: deploying to a different continent than an existing bucket fails with an error
Error: Bucket 'my-results' already exists in continent 'us' but you are deploying to continent 'apac'.
Please use a different bucket name in anycloud.yaml.

Resolution: use a different bucket name or deploy to a region in the same continent.

Azure Blob Storage

Azure uses storage accounts with containers inside them:

Azure Subscription
├── anycloud-eastus (resource group) ← VMs
├── anycloud-westus (resource group) ← VMs
└── anycloud-storage-us (resource group) ← Shared storage for all US regions
└── anycloudus12345678 (storage account)
└── my-bucket (container)
  • Storage account: anycloud{continent}{subscriptionPrefix} (max 24 chars)
  • Resource group: anycloud-storage-{continent}
  • Storage accounts are accessible via URL from any region within the continent
  • Race condition on simultaneous creates is handled (Azure rejects duplicate names globally)

GCP Cloud Storage

  • GCS bucket per continent: anycloud-{continent}-{project-prefix}
  • Service account with Storage Object Admin role