Warm Pool
Why claiming a sandbox feels instant — per-template pools, O(1) LIFO pop, debounced refill, and an hour-of-week forecaster.
The warm pool is the reason Sandbox.create() returns in tens of milliseconds. We keep N fully-restored, idle VMs per (template, host) parked behind a pre-allocated NATID slot so that a customer create degrades into "look up a free slot, hand back the ID, done".
Shape
For each (template, cpu, mem_mb) triple, every agent runs one WarmPool with:
Target— desired idle slot count (set per template, default20).MaxBurst— how many slots can be spawning at once during refill (default4).- An
idleslice of fully-restored sandbox handles whose DB rows havestatus='pooled'. - A 200 ms debounced refill ticker.
The slot map for a code-interpreter template on a single agent looks like:
WarmPool{template="code-interpreter", target=20, idle=[s7, s8, s9, …, s26]}
↑ LIFO: pop the rightClaim
func (wp *WarmPool) Claim() (*Slot, bool) {
wp.mu.Lock()
defer wp.mu.Unlock()
if n := len(wp.idle); n > 0 {
s := wp.idle[n-1]
wp.idle = wp.idle[:n-1]
wp.kickRefill()
return s, true
}
return nil, false
}That's it. One lock, one slice pop, one async refill kick. Latency is in the single microseconds — vastly dominated by the network hop from the API gateway to the agent.
The LIFO order isn't an accident — the most recently spawned slot is most likely to still be hot in page cache.
Spawn (refill)
When len(idle) < target and fewer than MaxBurst slots are currently spawning, the refill loop spawns one more:
spawn:
1. allocate NATID slot ~1 ms
2. configure tap inside netns ~6 ms
3. reflink-copy rootfs.ext4 ~4 ms
4. fork+exec firecracker ~25 ms
5. POST /snapshot/load ~80 ms (or full cold boot if no snap yet — see auto-bake below)
6. POST /snapshot/state Resume ~6 ms
7. probe TCP :22 ~40 ms
8. INSERT sandbox (status=pooled) ~6 ms (async)
total ≈ 170 ms — the same path as a real create, run ahead of timeThe Postgres write is fire-and-forget; we don't make customers wait on it. On warm-pool claim the row is UPDATEd to running synchronously (so reconciliation can find it correctly).
Auto-bake on first spawn
Chicken-and-egg problem: a fresh agent has rootfs but no vm.mem snapshot for the template. The first spawnSlot() therefore does a full cold boot (one-time, ~3 s), captures the snapshot, then every subsequent spawn uses the snapshot path.
This fixed a class of "agent reboots → warm pool stuck at 0 forever" bugs in early deployments. Now: spin up a fresh agent, wait ~3 s, you have a freshly-baked template snapshot and the pool fills normally.
Forecaster (hour-of-week EWMA)
predictor.go keeps a 168-bucket exponentially-weighted ring per template (hour of week × template). Every claim increments the current bucket; every 5 min we publish pandastack_warmpool_forecast{template="…"} as a Prometheus gauge.
The autoscaler can use either:
pandastack_warmpool_deficit{template}— current deficit (target minus idle).pandastack_warmpool_forecast{template}— predicted next-hour claim rate.
Forecasted refill is "pre-warm before Monday 9 AM" behavior. Default is off for the spec.Target mutator; the autoscaler reads the forecast metric and scales hosts.
Capacity math
For a code-interpreter template (1 vCPU, 256 MiB RAM):
| Agent type | Slots / host | Memory budget | Notes |
|---|---|---|---|
n2-standard-2 | 10 | 2.5 GiB | 1× burstable, 1× safety. |
n2-standard-4 | 20 | 5 GiB | Current prod baseline. |
n2-standard-8 | 40 | 10 GiB | Recommended for high QPS. |
The bottleneck is memory, not CPU — idle VMs cost ~0.05 % vCPU each (just the kvm vcpu thread parked in HLT) but each holds its full guest RAM in mmap private pages. The host can over-commit modestly thanks to MAP_PRIVATE (unused pages aren't faulted in).
Constraints
- NATID-mode only. Legacy vsock mode pre-dates the per-sandbox netns design and can't share the pool's IP-space allocator. Set
PANDASTACK_NATID=1(the default). - Cooperative shutdown.
SIGTERMdoes not drain the pool — it leaves pooled VMs and the reconciliation loop reclaims them on next start (emittingrecover.unmanaged). This is a deliberate tradeoff: graceful drain takes seconds × pool size, and we'd rather restart fast. - One pool per
(template, cpu, mem). If a customer requestscode-interpreterwithmem=1024, that's a different pool — empty by default, cold-boots on first request.
Code
agent/internal/sandbox/warmpool.go— pool state,Claim,spawnSlot, refill loop.agent/internal/sandbox/predictor.go— hour-of-week EWMA + Prometheus gauge.agent/internal/api/metrics_prom.go—pandastack_warmpool_idle,..._deficit,..._forecastexports.
Why this design (and not "just keep VMs paused")
A paused VM still has a netns, a tap, a FC process, an SSH-ready guest. Pausing buys nothing — it's the same resource footprint as a running idle VM, with extra latency on resume.
Snapshotting and killing the FC process between claims, on the other hand, frees the process slot but loses the in-kernel state — every claim becomes a fresh /snapshot/load. We benchmarked: /snapshot/load of a 256-MiB snapshot is ~80 ms; keeping the VM live in the pool is ~0 ms.
Pool live + snapshot-and-kill on Delete is the optimum: claim latency is unbeatable, idle hosts shed VMs gracefully.