Compute autoscaling with YellowDog

Set up YellowDog as a compute provider and tune Helix's autoscaler. Covers the floor, on-demand scale-up, and idle scale-down policies.

Helix Compute is the autoscaler. It brings sandbox hosts (Runners) into existence when demand needs them and releases them when demand drains. Underneath it sits an external compute provider, which means YellowDog. Helix decides when to add or remove a host. The provider decides how to actually get one.

Until you set HELIX_COMPUTE_PROVIDER, the autoscaler is off and Helix just uses whatever Runners register themselves. This page covers turning it on, pointing it at YellowDog, and tuning the scaling policy.

What each layer owns

Layer	What it owns
Helix Compute	When to ask for a host, when to release one, and how many it owns. The reconcile loop, the `sandbox_instances` rows, the Provider interface.
YellowDog	How to get a host: region, AZ, instance type, fallback chain, pool scheduling. The worker pool, the Compute Requirement Template, the Tasks.
AWS (or another cloud)	Hardware. EC2 capacity.

Helix turns "we need N hosts right now" into "submit N Work Requirements to YellowDog". YellowDog turns "N Work Requirements need workers" into "spin up EC2". The layers are independent. A future provider (GCP, Azure, on-prem Kubernetes) can plug into the Provider interface and none of the Helix-side knobs below change.

Set up YellowDog as the provider

1. One-time YellowDog setup

These steps happen in the YellowDog portal, outside Helix, and you do them once. The YellowDog documentation covers each in detail:

Credentials. Generate an API key ID and secret under Applications. Treat both as secrets.
Namespace. Use the namespace your YellowDog administrator allocated for this install. Work Requirements live here.
Compute Requirement Template and Image Family. This defines the instance type, the region, and the region-fallback chain. Helix does not care about regions. If eu-west-2 is full, YellowDog quietly falls back to the next region in the template (say us-east-1). The pool spec references these by ID.
Worker pool. Provision the pool once per deployment. You set the sizing (max_nodes, workers_per_node). Keep Helix's HELIX_COMPUTE_MAX at or below max_nodes * workers_per_node, otherwise the pool runs out before the autoscaler does and Tasks queue on YellowDog instead of new hosts being provisioned.

The worker pool advertises a worker tag. Helix stamps that same tag on every Task it submits, so the YellowDog scheduler only hands Tasks to matching workers. Get the tag wrong and you get a quiet failure: the scheduler finds no eligible workers and the Task just sits there pending. Helix logs the resolved tag at boot, so check it there if Tasks are starving.

2. Configure Helix

Set the provider block on the control plane, either in .env or via controlplane.extraEnv in the Helm chart:

HELIX_COMPUTE_PROVIDER=yellowdog
 
HELIX_YD_KEY=...                 # API key ID
HELIX_YD_SECRET=...              # API secret
HELIX_YD_NAMESPACE=production    # the namespace allocated to this install
# HELIX_YD_WORKER_TAG=...        # defaults to "worker-<namespace>" if unset

Every YellowDog field is required once HELIX_COMPUTE_PROVIDER=yellowdog. If any is empty, Helix fails at boot rather than starting half-configured. The configuration reference lists all of them.

Restart the control plane so it reads the new environment. On Docker Compose, docker compose restart api does not re-read .env, so use docker compose up -d api to recreate the container. You should see the subsystem come up:

compute subsystem enabled; Manager will start at boot
  floor=1 max=3 scaleup_headroom_min=4 idle_timeout=600000
  reconcile_interval=30000 max_concurrent_provisions=1
  provider=yellowdog-production

The scaling policy knobs

Three knobs set the shape of the fleet, the floor it never drops below, the ceiling it never climbs above, and the trigger for adding a host in between.

HELIX_COMPUTE_FLOOR (default 0) is the warm baseline: how many hosts stay Ready or Provisioning at all times, even with no users. 0 means pure on-demand, so the first user waits for a cold start. N means N hosts you pay for around the clock. This is the latency-versus-cost call for whoever arrives first.

HELIX_COMPUTE_MAX (default 0) is the hard ceiling. Helix never owns more hosts than this. 0 switches on-demand scale-up off, so you pay for Floor only and bursts queue on YellowDog. Set it above Floor to turn scale-up on. This is your cost cap, the number that comes out of the budget conversation.

HELIX_COMPUTE_SCALEUP_HEADROOM_MIN (default 0, acts as 1 when Max > Floor) is the free sandbox slots Helix tries to keep spare across all Ready hosts. When sum(max_sandboxes) - sum(active_sandboxes) falls below this and total owned is below Max, Helix provisions another host next cycle. Raise it to pre-warm capacity for bursty workloads and hide the roughly 90s cold start.

The rest control timing and safety:

HELIX_COMPUTE_IDLE_TIMEOUT (default 10m) is how long a Ready host sits with zero active sessions before Helix sheds it. 0 switches scale-down off, so the fleet never shrinks below Floor. Lower it to reclaim idle capacity faster, at the cost of more churn.

HELIX_COMPUTE_MAX_CONCURRENT_PROVISIONS (default 1) is how many hosts Helix asks for per reconcile cycle. 1 is a steady single step. Raise it to cold-fill a large Floor in fewer cycles, which leans harder on the provider API.

HELIX_COMPUTE_RECONCILE_INTERVAL (default 30s) is how often the autoscaler wakes up and re-evaluates. Every scaling decision happens on this beat; nothing is event-driven. Lower means faster reactions and more load on the database and provider API.

HELIX_COMPUTE_MAX_PROVISIONING_AGE (default 30m) is how long a host may sit Provisioning before Helix gives up and rolls it back. This bounds the "stuck Work Requirement" window when cloud capacity is tight.

HELIX_COMPUTE_HARD_IDLE_TIMEOUT (default 4h) is a safety override. When fleet-pressure inhibition would otherwise keep an idle host alive, this forces it to shed once it crosses the threshold, so one session pinned at-cap can't keep idle peers running forever. 0 disables the override.

HELIX_COMPUTE_DEPLOYMENT_TAG (default auto) is the tag Helix stamps on every Work Requirement it creates, so two installs can share one YellowDog account. Defaults to helix-<namespace>. Set it explicitly when two installs share a namespace, or each will treat the other's Work Requirements as its own.

How the knobs make one decision

Every reconcile interval, Helix runs one cycle. Scale-up takes priority; scale-down only happens when no host was provisioned and there's no demand pressure, so the two never fire together.

Every reconcile interval
        │
        ▼
Read rows, roll back any host stuck Provisioning > MaxProvisioningAge
        │
        ▼
Count available hosts and headroom (free slots on Ready hosts)
        │
        ▼
Decision, in priority order:
        │
        ├── available < Floor, OR (Max > Floor AND headroom < HeadroomMin)?
        │      └── SCALE UP: provision hosts up to Max,
        │                    capped per cycle by MaxConcurrentProvisions
        │
        ├── else, a Ready host idle past IdleTimeout?
        │      └── SCALE DOWN: shed the oldest idle host, never below Floor
        │
        └── else: do nothing (steady state)
        │
        ▼
Wait for the next cycle

The max_sandboxes side of the headroom sum comes from each Runner's capacity, which you set with HELIX_SANDBOX_MAX_DEV_CONTAINERS (see the configuration reference).

Tuning the policy

Pick values for your workload, not for a demo:

ScaleUpHeadroomMin. In production, 1 or 2 works well ("scale when we're nearly out of slots"). Setting it to a Runner's full max_sandboxes makes any single session trigger a scale-up. That's handy for a visible demo and wasteful in steady state.
IdleTimeout. The 10-minute default absorbs the "user closed a tab, came back five minutes later" pattern without churning EC2. Drop it only if cloud spend matters more to you than re-warm latency.
MaxConcurrentProvisions. Leave it at 1 so a brief spike doesn't over-provision a pile of hosts at once. Raise it for workloads that burst on a schedule (every Monday at 9am, twenty people log in) so a cold fill to a large Floor finishes in fewer cycles.
Max. Your cost cap. One g5.xlarge runs about $1/hr, so Max=3 caps peak spend near $3/hr. Set it wherever the budget lands, but never above max_nodes * workers_per_node, or the pool becomes the limit instead of Helix.

Faster cold starts

On-demand scale-up makes a user wait while the new host pulls the sandbox image. Pulling from GHCR across clouds can take around 120s for a ~10GB image, and that dominates the wait. Point Runners at a same-region mirror with HELIX_SANDBOX_REGISTRY (usually ECR in the same AWS account as the worker pool) and the pull drops to roughly 15-30s. See the sandbox image registry reference.

What stays operator-driven

Helix doesn't manage these yet, so they're on you:

Worker pool creation, provisioned once per deployment.
The Compute Requirement Template and Image Family, set up once in the YellowDog account.
Pool sizing (max_nodes, workers_per_node).
Region selection and fallback, which the Compute Requirement Template owns.

Troubleshooting

Symptom	Likely cause
Scale-up never fires despite demand	`Max` isn't above `Floor`, or `ScaleUpHeadroomMin` is too low. Set it to a Runner's `max_sandboxes` value to confirm the path works at all.
A new Work Requirement is submitted but the Task sits READY forever	The worker pool is down, or the worker tag doesn't match. Check the pool status and confirm `HELIX_YD_WORKER_TAG` matches what the pool advertises.
A host stays `provisioning` past `MaxProvisioningAge`	The cloud couldn't allocate capacity (g5.xlarge tight in-region, say). The age rollback recycles the row and submits a fresh Work Requirement next cycle. This is expected when capacity runs out.
Scale-down fires immediately instead of waiting `IdleTimeout`	A stale in-memory idle timer survived a previous run. Restart the control plane to clear it.
Scale-down doesn't fire after `IdleTimeout` passed	One of: demand pressure still exists somewhere (check the headroom math), a provision happened in the same cycle (it retries next cycle), or the row never reached `ready` (still provisioning, so scale-down ignores it).
A host loops on "Deprovision failed; retrying next cycle"	A transient provider error. After a bounded retry budget Helix deregisters the row anyway and logs the provider ID so you can clean up by hand.

Configuration reference: compute autoscaling lists every environment variable.
Architecture for operators shows where Compute sits in the deployment.

Agent Virtualization

Enterprise Coding Agents

Private AI Platform

Digital Sovereignty

Follow the Sun

The Speed Advantage

Financial Services

GPU Cloud & Neoclouds

Public Sector & Defence

Compute autoscaling with YellowDog

What each layer owns

Set up YellowDog as the provider

1. One-time YellowDog setup

2. Configure Helix

The scaling policy knobs

How the knobs make one decision

Tuning the policy

Faster cold starts

What stays operator-driven

Troubleshooting

Compute autoscaling with YellowDog

What each layer owns

Set up YellowDog as the provider

1. One-time YellowDog setup

2. Configure Helix

The scaling policy knobs

How the knobs make one decision

Tuning the policy

Faster cold starts

What stays operator-driven

Troubleshooting

Related