Data pipelines
Run untrusted data transformations in disposable sandboxes — perfect for ETL, scraping, and one-off batch jobs.
Pattern: parallel batch transform
You have 10,000 documents to process. Each runs a different transformation. You don't want one bad input to take down the whole job.
from pandastack import Sandbox
from concurrent.futures import ThreadPoolExecutor
def process_one(doc_id: str) -> dict:
with Sandbox(template="python-data", cpu=1, memory_mb=512) as sb:
sb.write(f"/tmp/input.json", json.dumps(load(doc_id)))
result = sb.run("python /pipeline.py /tmp/input.json /tmp/output.json", timeout=60)
if result.exit_code != 0:
return {"doc_id": doc_id, "error": result.stderr}
return {"doc_id": doc_id, "output": json.loads(sb.read("/tmp/output.json"))}
with ThreadPoolExecutor(max_workers=200) as ex:
results = list(ex.map(process_one, doc_ids))A cluster of 5 c5n.metal agents handles 1000 concurrent sandboxes comfortably.
Pattern: scraper with template
Pre-bake your scraper into a template so each sandbox boots ready-to-go:
cat > scraper/Dockerfile <<'EOF'
FROM ghcr.io/pandastack/templates:python-data
RUN pip install playwright requests beautifulsoup4 lxml
RUN playwright install chromium --with-deps
COPY scrape.py /scrape.py
EOF
pandastack templates build scraper ./scraperThen:
with Sandbox(template="scraper", memory_mb=2048) as sb:
result = sb.run(f"python /scrape.py {url}", timeout=30)Each sandbox starts in <500 ms (browser template baked-in), processes one URL, dies.
Pattern: snapshot mid-pipeline for restart
For multi-hour jobs:
sb = Sandbox(template="python-data", memory_mb=4096)
sb.run("python /pipeline.py --stage extract --out /tmp/extracted.parquet")
# Snapshot before the expensive step
snap = sb.snapshot()
print(f"checkpoint: {snap.id}")
sb.run("python /pipeline.py --stage transform --in /tmp/extracted.parquet --out /tmp/transformed.parquet")If the transform step crashes, you can restore from snap.id and retry the second
half without redoing extract.
Cost notes
Sandboxes are billed per wall-clock running second + RAM-seconds. A typical scraper sandbox running for 5 seconds with 512 MiB costs ~$0.00006 on managed PandaStack.