PandaStack

Warm Pool

Why claiming a sandbox feels instant — per-template pools, O(1) LIFO pop, debounced refill, and an hour-of-week forecaster.

The warm pool is the reason Sandbox.create() returns in tens of milliseconds. We keep N fully-restored, idle VMs per (template, host) parked behind a pre-allocated NATID slot so that a customer create degrades into "look up a free slot, hand back the ID, done".

Shape

For each (template, cpu, mem_mb) triple, every agent runs one WarmPool with:

  • Target — desired idle slot count (set per template, default 20).
  • MaxBurst — how many slots can be spawning at once during refill (default 4).
  • An idle slice of fully-restored sandbox handles whose DB rows have status='pooled'.
  • A 200 ms debounced refill ticker.

The slot map for a code-interpreter template on a single agent looks like:

WarmPool{template="code-interpreter", target=20, idle=[s7, s8, s9, …, s26]}
                                                  ↑ LIFO: pop the right

Claim

func (wp *WarmPool) Claim() (*Slot, bool) {
    wp.mu.Lock()
    defer wp.mu.Unlock()
    if n := len(wp.idle); n > 0 {
        s := wp.idle[n-1]
        wp.idle = wp.idle[:n-1]
        wp.kickRefill()
        return s, true
    }
    return nil, false
}

That's it. One lock, one slice pop, one async refill kick. Latency is in the single microseconds — vastly dominated by the network hop from the API gateway to the agent.

The LIFO order isn't an accident — the most recently spawned slot is most likely to still be hot in page cache.

Spawn (refill)

When len(idle) < target and fewer than MaxBurst slots are currently spawning, the refill loop spawns one more:

spawn:
  1. allocate NATID slot           ~1 ms
  2. configure tap inside netns    ~6 ms
  3. reflink-copy rootfs.ext4      ~4 ms
  4. fork+exec firecracker         ~25 ms
  5. POST /snapshot/load           ~80 ms   (or full cold boot if no snap yet — see auto-bake below)
  6. POST /snapshot/state Resume   ~6 ms
  7. probe TCP :22                 ~40 ms
  8. INSERT sandbox (status=pooled) ~6 ms   (async)
total ≈ 170 ms — the same path as a real create, run ahead of time

The Postgres write is fire-and-forget; we don't make customers wait on it. On warm-pool claim the row is UPDATEd to running synchronously (so reconciliation can find it correctly).

Auto-bake on first spawn

Chicken-and-egg problem: a fresh agent has rootfs but no vm.mem snapshot for the template. The first spawnSlot() therefore does a full cold boot (one-time, ~3 s), captures the snapshot, then every subsequent spawn uses the snapshot path.

This fixed a class of "agent reboots → warm pool stuck at 0 forever" bugs in early deployments. Now: spin up a fresh agent, wait ~3 s, you have a freshly-baked template snapshot and the pool fills normally.

Forecaster (hour-of-week EWMA)

predictor.go keeps a 168-bucket exponentially-weighted ring per template (hour of week × template). Every claim increments the current bucket; every 5 min we publish pandastack_warmpool_forecast{template="…"} as a Prometheus gauge.

The autoscaler can use either:

  • pandastack_warmpool_deficit{template} — current deficit (target minus idle).
  • pandastack_warmpool_forecast{template} — predicted next-hour claim rate.

Forecasted refill is "pre-warm before Monday 9 AM" behavior. Default is off for the spec.Target mutator; the autoscaler reads the forecast metric and scales hosts.

Capacity math

For a code-interpreter template (1 vCPU, 256 MiB RAM):

Agent typeSlots / hostMemory budgetNotes
n2-standard-2102.5 GiB1× burstable, 1× safety.
n2-standard-4205 GiBCurrent prod baseline.
n2-standard-84010 GiBRecommended for high QPS.

The bottleneck is memory, not CPU — idle VMs cost ~0.05 % vCPU each (just the kvm vcpu thread parked in HLT) but each holds its full guest RAM in mmap private pages. The host can over-commit modestly thanks to MAP_PRIVATE (unused pages aren't faulted in).

Constraints

  • NATID-mode only. Legacy vsock mode pre-dates the per-sandbox netns design and can't share the pool's IP-space allocator. Set PANDASTACK_NATID=1 (the default).
  • Cooperative shutdown. SIGTERM does not drain the pool — it leaves pooled VMs and the reconciliation loop reclaims them on next start (emitting recover.unmanaged). This is a deliberate tradeoff: graceful drain takes seconds × pool size, and we'd rather restart fast.
  • One pool per (template, cpu, mem). If a customer requests code-interpreter with mem=1024, that's a different pool — empty by default, cold-boots on first request.

Code

  • agent/internal/sandbox/warmpool.go — pool state, Claim, spawnSlot, refill loop.
  • agent/internal/sandbox/predictor.go — hour-of-week EWMA + Prometheus gauge.
  • agent/internal/api/metrics_prom.gopandastack_warmpool_idle, ..._deficit, ..._forecast exports.

Why this design (and not "just keep VMs paused")

A paused VM still has a netns, a tap, a FC process, an SSH-ready guest. Pausing buys nothing — it's the same resource footprint as a running idle VM, with extra latency on resume.

Snapshotting and killing the FC process between claims, on the other hand, frees the process slot but loses the in-kernel state — every claim becomes a fresh /snapshot/load. We benchmarked: /snapshot/load of a 256-MiB snapshot is ~80 ms; keeping the VM live in the pool is ~0 ms.

Pool live + snapshot-and-kill on Delete is the optimum: claim latency is unbeatable, idle hosts shed VMs gracefully.

On this page