PandaStack

Observability

Per-sandbox metrics, workspace-wide aggregates, and the ClickHouse schema behind the dashboard charts.

PandaStack ships first-class observability — every sandbox emits process-level metrics and lifecycle events that flow into ClickHouse, then back out through the dashboard as charts. No Grafana, no Prometheus to operate; it's built in.

What gets collected

Five tables in ClickHouse, all partitioned by workspace and retained per the SLA below.

TableCardinalityRetentionWhat's in it
pandastack.sandbox_metrics1 row / 10 s / sandbox90 daysCPU %, memory bytes, net rx/tx, disk rd/wr.
pandastack.sandbox_events1 row / event90 daysLifecycle (sandbox.running, paused, hibernated, forked, etc.).
pandastack.boot_events1 row / create90 daysboot_ms, boot_mode, template, from_snapshot.
pandastack.audit_log1 row / mutating call365 daysWho did what, from where, when.
pandastack.http_requests1 row / API call30 daysRoute, status, duration, request ID.

Every row is tagged with workspace_id; the API rewrites every query to add WHERE workspace_id = $jwt.workspace_id so workspaces never see each other's data.

Per-sandbox metrics

curl -H "Authorization: Bearer $PANDASTACK_TOKEN" \
  "https://api.pandastack.ai/v1/sandboxes/$SID/metrics"
{
  "pid": 12345,
  "uptime_seconds": 482,
  "host_cpu_percent": 4.7,
  "host_rss_bytes": 248123392,
  "host_vsz_bytes": 1029531648,
  "threads": 8
}

This is the live, in-the-moment view served by the agent reading /proc/<pid>/stat — not from ClickHouse. P95 latency ~3 ms. Useful for autoscaling decisions and dashboards.

Workspace-wide aggregates

curl -H "Authorization: Bearer $PANDASTACK_TOKEN" \
  "https://api.pandastack.ai/v1/metrics/overview?range=24h"

Returns time-series buckets for sandboxes-running, vCPU-hours used, memory-GB-hours, p50/p99 boot_ms. Backed by ClickHouse — works across all your hosts, not just one.

curl -H "Authorization: Bearer $PANDASTACK_TOKEN" \
  "https://api.pandastack.ai/v1/metrics/sandbox/$SID?range=1h"

Single-sandbox time series — what the dashboard's per-sandbox detail page draws.

Events stream

Each lifecycle change is one row in sandbox_events. The full set:

EventWhen
sandbox.runningReached running state, sshd accepting.
sandbox.ssh_readyFirst successful SSH probe (includes ssh_ready_ms).
sandbox.pausedpause succeeded.
sandbox.resumedresume succeeded.
sandbox.hibernatedMemory written to disk, VM stopped.
sandbox.wokenHibernated sandbox restored.
sandbox.forkedA fork child reached running.
sandbox.warmforkedA warm-fork child reached running (faster path).
sandbox.deletedRootfs purged, slot released.
fork.staged / fork.completed / fork.child_failedfork-tree progress.
fork_tree.completed / fork_tree.promotedfork-tree finished, winner picked.
vm.diedFirecracker process disappeared unexpectedly.
recover.orphanedManager found a sandbox row whose VM was gone — marked failed.
recover.unmanagedManager found a Firecracker socket with no row — cleaned up.

Subscribe live:

curl -N -H "Authorization: Bearer $PANDASTACK_TOKEN" \
  "https://api.pandastack.ai/v1/sandboxes/$SID/events?follow=1&tail=20"

Returns SSE: 20 most recent past events, then any new ones as they happen. Perfect for "wait for fork to finish" or "did the warm pool refill yet" UX flows. See Events for the full schema.

Dashboard tour

Three pages drive the entire UX, all backed by the above:

PageWhat it showsData source
/observabilityWorkspace charts: sandboxes over time, p50/p99 boot, vCPU-hours, success/fail rate./v1/metrics/overview
/sandboxes/[id]Per-sandbox CPU / mem / net / disk time series + lifecycle event log./v1/metrics/sandbox/{id} + /events
/auditFull audit trail with filter by actor / action / time.audit_log table
/usageAggregated billing-grade usage: vCPU-hours, GB-hours, count of creates per template.rollups on sandbox_metrics + boot_events
/statsBoot-time histogram with template breakdown.boot_events table

The charts are pure SVG (no Chart.js / D3 monster) — one Go binary, one Next.js app, no third-party telemetry stack to babysit.

Bring-your-own observability

The agent also exposes a Prometheus /metrics endpoint on its admin port:

pandastack_sandboxes_total{template="code-interpreter",status="running"}  3
pandastack_boot_ms_bucket{template="code-interpreter",le="200"}         418
pandastack_boot_ms_sum{template="code-interpreter"}                   78213.4
pandastack_natid_slots_used                                              42
pandastack_warmpool_idle{template="code-interpreter"}                     4

OpenTelemetry traces are emitted on every API request when OTEL_EXPORTER_OTLP_ENDPOINT is set — point them at Tempo / Honeycomb / Datadog as you see fit. No vendor lock.

Query the warehouse directly

If you self-host, every workspace can read pandastack.* tables directly via the ClickHouse HTTP interface:

echo "SELECT toStartOfHour(ts) h, quantile(0.5)(boot_ms) p50, quantile(0.99)(boot_ms) p99
      FROM pandastack.boot_events
      WHERE workspace_id = '$WS' AND ts > now() - INTERVAL 7 DAY
      GROUP BY h ORDER BY h" \
  | curl --data-binary @- "$CH_URL?database=pandastack"

This is the same query the dashboard runs. The ClickHouse cluster doubles as your analytics backplane — no separate warehouse export needed for product analytics on PandaStack usage.

Use cases

Autoscaling agent worker pools. Subscribe to per-sandbox metrics; when p95 CPU stays >70 % for 5 min, fork a worker. The same primitives Kubernetes HPA uses, without the YAML.

Cost attribution by team. Tag sandboxes with metadata.team=… at create; group ClickHouse rollups by metadata to get a per-team vCPU-hour invoice. Already wired in the /usage page.

SLO dashboards. Watch p99 boot_ms over a 28-day window. Alert when it crosses your budget.

Post-incident forensics. A customer's sandbox crashed at 03:14 UTC. Pull the events stream around that time, cross-reference with audit_log to see what API call preceded it, look at sandbox_metrics for the memory curve.

Limits

  • ClickHouse retention is the table TTL. After that, rows are deleted server-side; if you need 5-year SLA evidence, dump to S3 with INSERT INTO FUNCTION s3(...) SELECT * FROM pandastack.audit_log on a schedule.
  • Charts in the dashboard query buckets, not raw rows. For >1 M-row analytical queries, prefer the direct CH interface.
  • Metric write path is best-effort: if ClickHouse is down, the agent buffers ~5 min then drops the oldest. Critical events go to events.jsonl on the host as a backup.

On this page