Vintner

Tendril Scaling

How Tendrils scale from zero to one when jobs arrive and back to zero when idle.

Tendril Scaling

Tendrils run on AWS ECS Fargate with scale-to-zero enabled by default. When a job is queued, Trellis immediately invokes a Lambda scaler via its Function URL to spin up a Tendril container. An EventBridge rule polls every minute as a fallback. When no jobs remain, the same Lambda scales containers back down after a short idle period.

Tendril Scaling Architecture

Why Scale-to-Zero

Each Tendril container ships Terraform, kubectl, Helm, AWS CLI, Google Cloud SDK, Azure CLI, and Infracost — roughly 500 MB of tooling. Running these 24/7 across multiple regions is expensive, but infrastructure provisioning happens in bursts: a user clicks "Apply," the Tendril works for 5–15 minutes, then sits idle for hours.

Scale-to-zero eliminates idle costs entirely. The platform only pays for compute when jobs are actually running.

Architecture

Four components work together:

ComponentRole
TrellisInvokes the Lambda scaler instantly via Function URL when a job is queued
Lambda scalerQueries Supabase for queued jobs, adjusts ECS service desired count
EventBridge ruleTriggers the Lambda every 1 minute as a fallback (also drives scale-down)
ECS FargateRuns Tendril containers, pulls image from GHCR on scale-up

The Lambda function runs in eu-west-1 but manages ECS services across multiple regions (currently eu-west-1 and eu-central-1). Each Tendril deployment gets its own ECS cluster, service, VPC, and secrets.

Scale-Up Flow

Job Created

A user clicks "Plan," "Apply," or "Destroy" in Trellis (or runs grape plan / grape harvest / grape destroy). A provision_jobs row is inserted with status QUEUED.

Trellis Notifies the Scaler

Immediately after inserting the job, Trellis sends a fire-and-forget POST to the Lambda scaler's Function URL. The Lambda queries Supabase's REST API for the count of QUEUED jobs using the content-range header. If the direct call fails, EventBridge retries within 60 seconds.

ECS Scales Up

If queued jobs exist and the service's desiredCount is 0, the Lambda calls ecs:UpdateService to set desiredCount to 1.

Container Starts

ECS Fargate pulls the Tendril Docker image from GHCR (ghcr.io/bobikenobi12/tendril:latest), injects secrets from AWS Secrets Manager, and starts the container. This takes roughly 30–60 seconds depending on image cache state.

Tendril Claims Job

The Tendril authenticates with Trellis using its worker token, enters its poll loop, and claims the queued job atomically via FOR UPDATE SKIP LOCKED. See Job Queue Pattern for claiming details.

Cold start latency: ~30–60 seconds from job creation to job claimed, dominated by ECS task startup (image pull + container init). The Lambda is invoked instantly by Trellis, so there is no polling delay. Subsequent jobs while the Tendril is already running are claimed within seconds.

Scale-Down Flow

When the Lambda detects zero queued jobs but the ECS service is running (desiredCount > 0), it increments an idle counter for that service:

Check 1: 0 queued, 1 running → idle 1/5
Check 2: 0 queued, 1 running → idle 2/5
Check 3: 0 queued, 1 running → idle 3/5
Check 4: 0 queued, 1 running → idle 4/5
Check 5: 0 queued, 1 running → scale DOWN to 0

After 5 consecutive idle checks (5 minutes), the Lambda sets desiredCount to 0. ECS drains the running task and the Tendril shuts down gracefully.

If a new job arrives during the idle countdown, the counter resets to zero immediately.

Heartbeat Monitoring

While running, a Tendril sends a heartbeat to Trellis every 30 seconds via POST /api/workers/heartbeat. Each heartbeat includes the Tendril's binary version and resets its status to ONLINE.

If heartbeats stop for 60 seconds, Trellis marks the Tendril as OFFLINE. Jobs stuck in PROCESSING with no log activity for 5 minutes are marked FAILED — see Failure Recovery for details.

On this page