PandaStack docs

How the API picks an agent for each create — warm-slot dominance, leases, NATID partitioning, and zombie reaping.

PandaStack has a tiny global scheduler — a handful of SQL queries plus a deterministic scoring function — that picks the right agent host for every create. It's not Kubernetes; it doesn't need to be. The whole thing fits in one file.

The decision

score(agent, template) =
    100 × warm_slots(agent, template)
    +   1 × free_natid_slots(agent)
    -   0.1 × current_cpu_pct(agent)
    -   1000 × (stale_heartbeat ? 1 : 0)

Whichever agent scores highest wins the create. Ties broken by lowest current_cpu_pct.

The ×100 on warm slots is the key: an agent with even one warm slot for the right template beats an empty agent with idle capacity. This was a real bug we fixed — the original scorer weighted warm slots at 0.1× and the empty agent always won, sending every request through the cold path.

Inputs (all from Postgres)

Field	Source	Freshness
`warm_slots`	`agents.heartbeat_meta -> 'warm'`	10 s heartbeat
`free_natid_slots`	`agents.heartbeat_meta -> 'natid_free'`	10 s
`current_cpu_pct`	`agents.heartbeat_meta -> 'cpu_pct'`	10 s
`stale_heartbeat`	`now() - heartbeat_at > 30 s`	computed

A 30 s in-memory cache (was 2 s — that was the second scheduler bug) sits in front of the agents query. At 1000 RPS we don't want to slam Postgres for the same answer.

Leases (10 s refresh, 60 s timeout)

Each agent holds a lease in the agent_leases table — a row containing its agent_id and expires_at = now() + 60s. The agent refreshes it every 10 s. Scheduler ignores agents whose lease has expired (treated like stale heartbeat).

This is the mechanism for "an agent went away (host died, network partition, OOM kill)" — the scheduler stops sending work in ≤30 s.

NATID partitioning

NATID slots are a per-agent resource (16,384 per agent, see networking). The scheduler doesn't move sandboxes between agents to balance NATID — instead, it factors free_natid_slots into the score so we naturally drift load away from saturated hosts.

If an agent hits NATID exhaustion (every slot in use), creates routed to it will 503 and the scheduler will pick another. Customers should never see this because the autoscaler reacts to the deficit metric well before saturation.

Zombie reaping

agent_leases rows whose expires_at is more than 1 hour old are removed by a sweep query — gradual cleanup so a flapping host doesn't churn the table.

For sandbox rows: a separate reconciliation loop in each agent compares its DB rows to its live Firecracker processes (see snapshot & restore: recovery). The scheduler trusts these — it never tries to second-guess what's actually running on a host.

Affinity

Two affinities exist:

Volume affinity. If create references a volume by name, the scheduler restricts candidates to agents that already have that volume file on disk. (Volumes are host-local in v1.)
Fork affinity. POST /fork defaults to placing the child on the parent's agent (memory + rootfs are already there). Pass ?cross_host=1 to opt into the cross-host path; the scheduler picks the best other agent.

No labels, no taints, no node selectors. Either the volume is here, or it isn't.

Walked example

A code-interpreter create lands on the API gateway. The scheduler runs:

SELECT agent_id, heartbeat_meta, heartbeat_at
FROM agents
WHERE (now() - heartbeat_at) < interval '30 seconds'
  AND template_supported(heartbeat_meta, 'code-interpreter');

Two rows come back:

agent_id	warm	natid_free	cpu_pct
`pz20`	18	21	31
`n1v2`	5	23	12

Scores:

pz20: 100×18 + 21 − 0.1×31 = 1800 + 21 − 3.1 = 1817.9
n1v2: 100×5 + 23 − 0.1×12 = 500 + 23 − 1.2 = 521.8

Pick pz20. POST /v1/sandboxes is forwarded to its admin endpoint. Done — 1 ms of scheduling, then the warm-slot claim is ~10 ms inside the agent.

What can go wrong

Failure	What scheduler does
Picked agent 5xxs	Gateway retries on the next-best agent (one retry max).
All agents stale	503 to caller; autoscaler should bring up a new host.
Heartbeat lies (warm=18 but pool is empty)	Falls through to cold-boot path inside the agent. Working as intended — warm pool is best-effort.
Network partition	Scheduler keeps using its cached view, agent keeps serving locally; lease expiry catches it within 30 s.

Why so simple

Two reasons:

The hard part isn't picking the right host — it's making "cold create" fast enough that picking wrong only costs you 80 ms. Once cold = warm + 80 ms, scheduling becomes a 1ms preference, not a contract.
State is the bottleneck, not compute. A million-line scheduler with thousand of node features doesn't help when the actual constraints are "is the rootfs here?" and "is there a free slot?".

Files

api/internal/scheduler/scheduler.go — score function + candidate query.
api/cmd/api/multinode.go — 30 s cache, retry logic, agent dispatch.
agent/internal/api/heartbeat.go — what each agent publishes every 10 s.
infra/terraform/modules/gcp-agent-mig/main.tf — google_compute_region_autoscaler.agent (the layer above the scheduler — adds/removes hosts).

Known limits

Single global scheduler shard (it's a stateless function over a Postgres view; sharding is trivial when needed but not yet useful).
10 s heartbeat means warm-slot view is up to 10 s stale. We mitigate with agent-side fallthrough — if scheduler claims warm but pool is empty, agent cold-boots.
No bin-packing of large (cpu, mem) requests. Edge cases visible only at >70 % cluster fullness; autoscaler keeps us well under.

Scheduler

On this page