Compute autoscaling with YellowDog
Set up YellowDog as a compute provider and tune Helix's autoscaler. Covers the floor, on-demand scale-up, and idle scale-down policies.
Helix Compute is the autoscaler. It brings sandbox hosts (Runners) into existence when demand needs them and releases them when demand drains. Underneath it sits an external compute provider, which means YellowDog. Helix decides when to add or remove a host. The provider decides how to actually get one.
Until you set HELIX_COMPUTE_PROVIDER, the autoscaler is off and Helix just uses whatever Runners register themselves. This page covers turning it on, pointing it at YellowDog, and tuning the scaling policy.
What each layer owns
| Layer | What it owns |
|---|---|
| Helix Compute | When to ask for a host, when to release one, and how many it owns. The reconcile loop, the sandbox_instances rows, the Provider interface. |
| YellowDog | How to get a host: region, AZ, instance type, fallback chain, pool scheduling. The worker pool, the Compute Requirement Template, the Tasks. |
| AWS (or another cloud) | Hardware. EC2 capacity. |
Helix turns "we need N hosts right now" into "submit N Work Requirements to YellowDog". YellowDog turns "N Work Requirements need workers" into "spin up EC2". The layers are independent. A future provider (GCP, Azure, on-prem Kubernetes) can plug into the Provider interface and none of the Helix-side knobs below change.
Set up YellowDog as the provider
1. One-time YellowDog setup
These steps happen in the YellowDog portal, outside Helix, and you do them once. The YellowDog documentation covers each in detail:
- Credentials. Generate an API key ID and secret under Applications. Treat both as secrets.
- Namespace. Use the namespace your YellowDog administrator allocated for this install. Work Requirements live here.
- Compute Requirement Template and Image Family. This defines the instance type, the region, and the region-fallback chain. Helix does not care about regions. If
eu-west-2is full, YellowDog quietly falls back to the next region in the template (sayus-east-1). The pool spec references these by ID. - Worker pool. Provision the pool once per deployment. You set the sizing (
max_nodes,workers_per_node). Keep Helix'sHELIX_COMPUTE_MAXat or belowmax_nodes * workers_per_node, otherwise the pool runs out before the autoscaler does and Tasks queue on YellowDog instead of new hosts being provisioned.
The worker pool advertises a worker tag. Helix stamps that same tag on every Task it submits, so the YellowDog scheduler only hands Tasks to matching workers. Get the tag wrong and you get a quiet failure: the scheduler finds no eligible workers and the Task just sits there pending. Helix logs the resolved tag at boot, so check it there if Tasks are starving.
2. Configure Helix
Set the provider block on the control plane, either in .env or via controlplane.extraEnv in the Helm chart:
HELIX_COMPUTE_PROVIDER=yellowdog
HELIX_YD_KEY=... # API key ID
HELIX_YD_SECRET=... # API secret
HELIX_YD_NAMESPACE=production # the namespace allocated to this install
# HELIX_YD_WORKER_TAG=... # defaults to "worker-<namespace>" if unsetEvery YellowDog field is required once HELIX_COMPUTE_PROVIDER=yellowdog. If any is empty, Helix fails at boot rather than starting half-configured. The configuration reference lists all of them.
Restart the control plane so it reads the new environment. On Docker Compose, docker compose restart api does not re-read .env, so use docker compose up -d api to recreate the container. You should see the subsystem come up:
compute subsystem enabled; Manager will start at boot
floor=1 max=3 scaleup_headroom_min=4 idle_timeout=600000
reconcile_interval=30000 max_concurrent_provisions=1
provider=yellowdog-production
The scaling policy knobs
Three knobs set the shape of the fleet, the floor it never drops below, the ceiling it never climbs above, and the trigger for adding a host in between.
HELIX_COMPUTE_FLOOR (default 0) is the warm baseline: how many hosts stay Ready or Provisioning at all times, even with no users. 0 means pure on-demand, so the first user waits for a cold start. N means N hosts you pay for around the clock. This is the latency-versus-cost call for whoever arrives first.
HELIX_COMPUTE_MAX (default 0) is the hard ceiling. Helix never owns more hosts than this. 0 switches on-demand scale-up off, so you pay for Floor only and bursts queue on YellowDog. Set it above Floor to turn scale-up on. This is your cost cap, the number that comes out of the budget conversation.
HELIX_COMPUTE_SCALEUP_HEADROOM_MIN (default 0, acts as 1 when Max > Floor) is the free sandbox slots Helix tries to keep spare across all Ready hosts. When sum(max_sandboxes) - sum(active_sandboxes) falls below this and total owned is below Max, Helix provisions another host next cycle. Raise it to pre-warm capacity for bursty workloads and hide the roughly 90s cold start.
The rest control timing and safety:
HELIX_COMPUTE_IDLE_TIMEOUT (default 10m) is how long a Ready host sits with zero active sessions before Helix sheds it. 0 switches scale-down off, so the fleet never shrinks below Floor. Lower it to reclaim idle capacity faster, at the cost of more churn.
HELIX_COMPUTE_MAX_CONCURRENT_PROVISIONS (default 1) is how many hosts Helix asks for per reconcile cycle. 1 is a steady single step. Raise it to cold-fill a large Floor in fewer cycles, which leans harder on the provider API.
HELIX_COMPUTE_RECONCILE_INTERVAL (default 30s) is how often the autoscaler wakes up and re-evaluates. Every scaling decision happens on this beat; nothing is event-driven. Lower means faster reactions and more load on the database and provider API.
HELIX_COMPUTE_MAX_PROVISIONING_AGE (default 30m) is how long a host may sit Provisioning before Helix gives up and rolls it back. This bounds the "stuck Work Requirement" window when cloud capacity is tight.
HELIX_COMPUTE_HARD_IDLE_TIMEOUT (default 4h) is a safety override. When fleet-pressure inhibition would otherwise keep an idle host alive, this forces it to shed once it crosses the threshold, so one session pinned at-cap can't keep idle peers running forever. 0 disables the override.
HELIX_COMPUTE_DEPLOYMENT_TAG (default auto) is the tag Helix stamps on every Work Requirement it creates, so two installs can share one YellowDog account. Defaults to helix-<namespace>. Set it explicitly when two installs share a namespace, or each will treat the other's Work Requirements as its own.
How the knobs make one decision
Every reconcile interval, Helix runs one cycle. Scale-up takes priority; scale-down only happens when no host was provisioned and there's no demand pressure, so the two never fire together.
Every reconcile interval
│
▼
Read rows, roll back any host stuck Provisioning > MaxProvisioningAge
│
▼
Count available hosts and headroom (free slots on Ready hosts)
│
▼
Decision, in priority order:
│
├── available < Floor, OR (Max > Floor AND headroom < HeadroomMin)?
│ └── SCALE UP: provision hosts up to Max,
│ capped per cycle by MaxConcurrentProvisions
│
├── else, a Ready host idle past IdleTimeout?
│ └── SCALE DOWN: shed the oldest idle host, never below Floor
│
└── else: do nothing (steady state)
│
▼
Wait for the next cycle
The max_sandboxes side of the headroom sum comes from each Runner's capacity, which you set with HELIX_SANDBOX_MAX_DEV_CONTAINERS (see the configuration reference).
Tuning the policy
Pick values for your workload, not for a demo:
ScaleUpHeadroomMin. In production,1or2works well ("scale when we're nearly out of slots"). Setting it to a Runner's fullmax_sandboxesmakes any single session trigger a scale-up. That's handy for a visible demo and wasteful in steady state.IdleTimeout. The 10-minute default absorbs the "user closed a tab, came back five minutes later" pattern without churning EC2. Drop it only if cloud spend matters more to you than re-warm latency.MaxConcurrentProvisions. Leave it at1so a brief spike doesn't over-provision a pile of hosts at once. Raise it for workloads that burst on a schedule (every Monday at 9am, twenty people log in) so a cold fill to a large Floor finishes in fewer cycles.Max. Your cost cap. One g5.xlarge runs about $1/hr, soMax=3caps peak spend near $3/hr. Set it wherever the budget lands, but never abovemax_nodes * workers_per_node, or the pool becomes the limit instead of Helix.
Faster cold starts
On-demand scale-up makes a user wait while the new host pulls the sandbox image. Pulling from GHCR across clouds can take around 120s for a ~10GB image, and that dominates the wait. Point Runners at a same-region mirror with HELIX_SANDBOX_REGISTRY (usually ECR in the same AWS account as the worker pool) and the pull drops to roughly 15-30s. See the sandbox image registry reference.
What stays operator-driven
Helix doesn't manage these yet, so they're on you:
- Worker pool creation, provisioned once per deployment.
- The Compute Requirement Template and Image Family, set up once in the YellowDog account.
- Pool sizing (
max_nodes,workers_per_node). - Region selection and fallback, which the Compute Requirement Template owns.
Troubleshooting
| Symptom | Likely cause |
|---|---|
| Scale-up never fires despite demand | Max isn't above Floor, or ScaleUpHeadroomMin is too low. Set it to a Runner's max_sandboxes value to confirm the path works at all. |
| A new Work Requirement is submitted but the Task sits READY forever | The worker pool is down, or the worker tag doesn't match. Check the pool status and confirm HELIX_YD_WORKER_TAG matches what the pool advertises. |
A host stays provisioning past MaxProvisioningAge | The cloud couldn't allocate capacity (g5.xlarge tight in-region, say). The age rollback recycles the row and submits a fresh Work Requirement next cycle. This is expected when capacity runs out. |
Scale-down fires immediately instead of waiting IdleTimeout | A stale in-memory idle timer survived a previous run. Restart the control plane to clear it. |
Scale-down doesn't fire after IdleTimeout passed | One of: demand pressure still exists somewhere (check the headroom math), a provision happened in the same cycle (it retries next cycle), or the row never reached ready (still provisioning, so scale-down ignores it). |
| A host loops on "Deprovision failed; retrying next cycle" | A transient provider error. After a bounded retry budget Helix deregisters the row anyway and logs the provider ID so you can clean up by hand. |
Related
- Configuration reference: compute autoscaling lists every environment variable.
- Architecture for operators shows where Compute sits in the deployment.