Tendril Scaling
How Tendrils scale from zero to one when jobs arrive and back to zero when idle.
Tendril Scaling
Tendrils run on AWS ECS Fargate with scale-to-zero enabled by default. When a job is queued, Trellis immediately invokes a Lambda scaler via its Function URL to spin up a Tendril container. An EventBridge rule polls every minute as a fallback. When no jobs remain, the same Lambda scales containers back down after a short idle period.
Why Scale-to-Zero
Each Tendril container ships Terraform, kubectl, Helm, AWS CLI, Google Cloud SDK, Azure CLI, and Infracost — roughly 500 MB of tooling. Running these 24/7 across multiple regions is expensive, but infrastructure provisioning happens in bursts: a user clicks "Apply," the Tendril works for 5–15 minutes, then sits idle for hours.
Scale-to-zero eliminates idle costs entirely. The platform only pays for compute when jobs are actually running.
Architecture
Four components work together:
| Component | Role |
|---|---|
| Trellis | Invokes the Lambda scaler instantly via Function URL when a job is queued |
| Lambda scaler | Queries Supabase for queued jobs, adjusts ECS service desired count |
| EventBridge rule | Triggers the Lambda every 1 minute as a fallback (also drives scale-down) |
| ECS Fargate | Runs Tendril containers, pulls image from GHCR on scale-up |
The Lambda function runs in eu-west-1 but manages ECS services across multiple regions (currently eu-west-1 and eu-central-1). Each Tendril deployment gets its own ECS cluster, service, VPC, and secrets.
Scale-Up Flow
Job Created
A user clicks "Plan," "Apply," or "Destroy" in Trellis (or runs grape plan / grape harvest / grape destroy). A provision_jobs row is inserted with status QUEUED.
Trellis Notifies the Scaler
Immediately after inserting the job, Trellis sends a fire-and-forget POST to the Lambda scaler's Function URL. The Lambda queries Supabase's REST API for the count of QUEUED jobs using the content-range header. If the direct call fails, EventBridge retries within 60 seconds.
ECS Scales Up
If queued jobs exist and the service's desiredCount is 0, the Lambda calls ecs:UpdateService to set desiredCount to 1.
Container Starts
ECS Fargate pulls the Tendril Docker image from GHCR (ghcr.io/bobikenobi12/tendril:latest), injects secrets from AWS Secrets Manager, and starts the container. This takes roughly 30–60 seconds depending on image cache state.
Tendril Claims Job
The Tendril authenticates with Trellis using its worker token, enters its poll loop, and claims the queued job atomically via FOR UPDATE SKIP LOCKED. See Job Queue Pattern for claiming details.
Cold start latency: ~30–60 seconds from job creation to job claimed, dominated by ECS task startup (image pull + container init). The Lambda is invoked instantly by Trellis, so there is no polling delay. Subsequent jobs while the Tendril is already running are claimed within seconds.
Scale-Down Flow
When the Lambda detects zero queued jobs but the ECS service is running (desiredCount > 0), it increments an idle counter for that service:
Check 1: 0 queued, 1 running → idle 1/5
Check 2: 0 queued, 1 running → idle 2/5
Check 3: 0 queued, 1 running → idle 3/5
Check 4: 0 queued, 1 running → idle 4/5
Check 5: 0 queued, 1 running → scale DOWN to 0After 5 consecutive idle checks (5 minutes), the Lambda sets desiredCount to 0. ECS drains the running task and the Tendril shuts down gracefully.
If a new job arrives during the idle countdown, the counter resets to zero immediately.
Heartbeat Monitoring
While running, a Tendril sends a heartbeat to Trellis every 30 seconds via POST /api/workers/heartbeat. Each heartbeat includes the Tendril's binary version and resets its status to ONLINE.
If heartbeats stop for 60 seconds, Trellis marks the Tendril as OFFLINE. Jobs stuck in PROCESSING with no log activity for 5 minutes are marked FAILED — see Failure Recovery for details.