Sports Dashboard

MI Bivariate Poisson + Dixon-Coles + Elo

← Back to Blog
|Post-Mortem

The Cloud Lab Guardian (and the seven reasons we needed one)

A week-long cloud-lab outage traced to one 5-line shell script bug (LAST=0 in the idle-shutdown check, reproduced live twice). Full arc: the false SSH-key diagnosis, the real bug, six stacked infrastructure issues (including fire-and-poll shipped as 5 serial hotfixes to live traffic), the rebuild in three phases, a version-controlled priority-task roadmap with a submit-time validator and a 60-second auto-advance ticker — then a red-team battery against the guardian that caught a critical capitalization bypass (target="Cloud-Lab" skipped the validator entirely). 22/22 tests pass after the bypass fix.

The Cloud Lab Guardian (and the seven reasons we needed one)

Yesterday we wanted to run the xG A/B solver-cache work on cloud-lab. Today it almost turned into a week-long outage for the second time. Here's the full arc — the false start, the real bug, the rebuild, the roadmap contract, and the red-team battery that found a critical bypass in the thing we'd just built.

Cloud Lab was supposed to be bulletproof

Cloud-lab is a Hetzner CPX 42 (8 vCPU / 16 GB RAM) that runs heavy compute — solver caches, factorials, long backtests — dispatched from the main server via lib/compute/queue.ts's fire-and-poll pattern. The idea: kick off work, go to sleep, come back in the morning to results.

The idea has never once worked.

Here's what actually happened Apr 7 → Apr 14:

  • Apr 7: We shipped autopilot-runner.ts + mega-queue.ts — a whole mini-scheduler so cloud-lab could drive itself. 299 queued jobs.
  • Apr 8–11: Cloud-lab booted every four hours, ran for exactly 30 minutes, then shut itself down. 24 complete cycles. Zero useful work landed. Nobody noticed until I went to check the results.
  • Apr 11: Emergency session. We rearchitected the dispatcher to fire-and-poll. Five serial commits (8ec880917adde16c) rushed into production between 14:48 and 00:11 UTC while jobs were in flight. Each commit fixed the previous one's bug. The test suite landed last.
  • Apr 11 15:30 → Apr 12 03:30: A 12-hour solver-cache run timed out exactly on the 12-hour wall in queue.ts. The work was stranded on cloud-lab's disk, never synced back.
  • Apr 12 12:30: A blog post went up titled "Cloud Lab: Fire-and-Poll Architecture" claiming "228 solver caches for 2025-26. Fire-and-poll verified working in production." The 12-hour timeout had already fired nine hours earlier.

On Apr 14 we woke up with no completed work, one confused success post, and a reminder that "bulletproof" means nothing without proof.

The false start

Starting the investigation, I grep'd the solver-cache log and saw a suspicious line 14 times:

[poll] SSH unreachable: Warning: Identity file /root/.ssh/id_ed25519 not accessible: No such file or directory.

I called it the smoking gun. I was wrong.

The user pushed back: _how do we know this was the root cause?_

So I actually counted:

  • grep -c 'Identity file' → 14
  • grep -c 'FAILED' (before the final timeout line) → 0

The solver kept progressing through dates — 2013-02, 2016-01, 2022-09, 2023-03 — for 12 hours after the first SSH warning. Fire-and-poll's retry-on-SSH-failure design worked exactly as intended. The SSH key was noisy, but it was not the killer. The job was killed by a 12-hour timeout wall, not by SSH drops.

"A visible warning in the log" ≠ "the thing that killed the job." I confused the two. Back to basics.

The real bug, reproduced live twice

Full forensic dump from cloud-lab's own /root/ surfaced it:

# /root/watchdog.sh  (runs every 30 min via systemd timer)
LAST=$(cat /tmp/last-job-activity 2>/dev/null || echo 0)
NOW=$(date +%s)
IDLE=$((NOW - LAST))
if [ "$PROCS" -eq 0 ] && [ "$IDLE" -gt 7200 ]; then
  curl -X POST ... /actions/shutdown
fi

/tmp is wiped on every boot. LAST=$(cat file 2>/dev/null || echo 0) returns 0 when the file is missing. IDLE = NOW - 0 = the current unix timestamp ≈ 1.7 billion seconds ≈ 55 years. Always greater than 7200. Unconditional shutdown.

And with OnBootSec=5min in the systemd timer, the watchdog fires five minutes after every fresh boot. Every new boot, five minutes later, the script reads "no activity since 1970" and kills the VM via Hetzner API.

I reproduced it live, twice, on Apr 13 during the investigation:

Boot (Hetzner)Shutdown (Hetzner)Interval
14:43:26 UTC14:48:43 UTC5m 17s
15:24:48 UTC15:30:01 UTC5m 13s

The watchdog.log showed it in flagrante delicto:

Apr 13 02:48:43 PM UTC 2026 — Procs: 0, Idle: 1776091723s
Apr 13 02:48:43 PM UTC 2026 — No processes, idle 2h+. Shutting down.

Idle: 1776091723s. 55 years. Cloud-lab had been up for 5 minutes.

This is why the pre-Apr-11 schedule was so regular: boot at HH:00, auto-shutdown cron fires at HH:30, shutdown. Every. Single. Time. The one 44-hour stretch it survived was Apr 11–13, because we had master-orchestrator.sh keeping tsx processes alive continuously — the watchdog's PROCS > 0 early-exit saved it. As soon as the orchestrator finished, the next cron tick hit the bug and killed the VM again.

Six more bugs on top

The LAST=0 bug was the hinge. The other six were stacked on top and made everything worse:

  1. Fire-and-poll shipped as five serial hotfixes to production main between 14:48 and 00:11 UTC Apr 11. Jobs submitted between commits hit whichever in-flight bug was current. No end-to-end smoke test before the first real workload.
  2. Job timeouts set without measuring workload. queue.ts hardcoded solver-cache = 12h for a workload (26 leagues × 13 seasons) that actually needs 20+ hours. The 12-hour wall killed the only real run of the session.
  3. `rsync` missing from the Dockerfile. syncCloudLabResults shells out to rsync, the container has only openssh-client, every cloud-lab job's result sync fails silently with rsync: not found. The factorial test on Apr 11 exited 0 with all-zero output, rsync silently failed, and the job was marked "completed." Three bugs stacked into a false success.
  4. `waitForSSH` too tight. queue.ts:522 gave SSH only 15 seconds to come up after Hetzner reported "running." Cloud-init on the image takes ~12s. Three seconds of margin.
  5. Two competing queue systems. lib/compute/queue.ts in the web container and scripts/compute-worker.ts in the compute-worker container, both reading the same jobs.json, neither checking target. The compute-worker happily ran cloud-lab-targeted jobs locally — which is what produced the Apr 11 "factorial all zeros" (no input data → empty result → exit 0).
  6. Job status conflates infra failures with legitimate script failures. Signal tests that were correctly rejected by the approval gates (for being unregistered) showed up as status: failed in jobs.json, indistinguishable from real infra crashes. The Apr 11 session _looked_ like a total compute collapse when half of it was signal science working as designed.

Full detail is in docs/specs/cloud-lab-apr11-retro-and-rebuild.md.

Phases 0, 1, 2

Phase 0 — fix the LAST=0 bug on cloud-lab. Both /root/auto-shutdown.sh and /root/watchdog.sh now treat a missing activity file as "just booted, initialize and skip" instead of "epoch":

LAST=$(cat /tmp/last-job-activity 2>/dev/null)
if [ -z "$LAST" ]; then
  date +%s > /tmp/last-job-activity
  exit 0
fi

Plus an @reboot cron to pre-seed the file. Scripts committed to scripts/cloud-lab-bootstrap/ so the fix survives VM rebuilds. 5 lines of shell script unblocked 5 days of lost compute.

We verified live. /root/watchdog.log from the first firing after the fix:

Mon Apr 13 04:45:31 PM UTC 2026 — Procs: 0, Idle: 105s

105 seconds. Not 1.7 billion. Cloud-lab stayed up.

Phase 1 — fix the sharp corners in compute-worker:

  • queue.ts:32 solver-cache timeout 12h → 24h, with a comment documenting the 26 × 13 workload.
  • queue.ts:522 waitForSSH 15 s → 60 s.
  • queue.ts SSH key path now reads CLOUD_LAB_SSH_KEY_PATH env var (default unchanged for backward compat).
  • Dockerfile installs rsync alongside openssh-client.
  • New smoke test at scripts/cloud-lab-smoke-test.ts that exercises power-on → fire → poll → rsync-round-trip end-to-end. This is what should have shipped _before_ fire-and-poll, not after it.

Phase 2 — the queue split. The compute-worker.ts tick() never checked target, which is what let it pick up cloud-lab jobs and run them locally. Fix:

// compute-worker.ts
const queued = jobs.filter(
  j => j.status === "queued" && !processes.has(j.id) && j.target !== "cloud-lab",
);
// queue.ts (the other side)
const next = [...this.jobs.values()].find(
  (j) => j.status === "queued" && j.target === "cloud-lab",
);

The split is now enforced in code, not by convention. queue.ts handles cloud-lab only. compute-worker.ts handles local only. Neither can accidentally pick the other's work. scripts/autopilot-runner.ts, scripts/mega-queue.ts, and scripts/cloud-autopilot.sh — all dormant but still in git, lying in wait for the next confused session — deleted.

The guardian — because the Phase 2 race was a symptom, not the disease

The Phase 2 fix stopped compute-worker from stealing cloud-lab jobs, but it didn't address the actual underlying problem: any session with `CRON_SECRET` could submit anything to cloud-lab, and there was no way for a future session to know whether the work was sanctioned, duplicative, or about to collide.

Cloud-lab had a brain (queue.ts) but no orders.

The guardian adds the orders. Three components:

(1) A version-controlled roadmap. data/cloud-lab/roadmap.json. Every task approved to run on cloud-lab is listed here. Git is the review gate — adding/editing requires commit + push + deploy, no runtime edits, no backdoor.

{
  "id": "xg-ab-treat-a",
  "type": "solver-cache",
  "target": "cloud-lab",
  "paramsMatch": { "xgWeight": "0.2" },
  "priority": 1,
  "status": "in-progress",
  "goal": "Generate remaining Treat A snapshots to match Control count",
  "completionCriteria": "Treat A count >= Control × 0.95",
  "blockedBy": []
}

Matching is by subset: an entry with paramsMatch: {"xgWeight": "0.2"} matches any solver-cache submission with that xgWeight regardless of league. paramsMatch: {} is a wildcard. Submissions must also land on a task whose status is approved or in-progress, and whose blockedBy chain is fully done.

(2) A validator at the submit endpoint. /api/compute/submit loads the roadmap on every target: "cloud-lab" submission and looks for a match. No match → 403. Match but blocked deps → 409. Roadmap file corrupted → 503 (fail closed, never fall open). Every accepted job gets its roadmapTaskId stamped on it for forensic trace, plus whatever the caller passed in X-Compute-Source for provenance.

(3) A 60-second auto-advance ticker. Even with no session present, queue.ts's constructor registers a setInterval that reads the roadmap every 60 s, finds the highest-priority eligible task with no active job, and self-submits it with source: "guardian/auto-advance". One task per tick to avoid racing manual submissions. Capacity-aware. Cloud-lab fills itself from the priority list.

Every session now has one spec to read (docs/specs/cloud-lab-guardian-spec.md) and one JSON file to consult (data/cloud-lab/roadmap.json) before touching cloud-lab. The spec ends with:

Emergency bypass: there is none. If a real emergency requires a one-off job, add it to the roadmap and commit. Overhead is ~3 minutes end-to-end. The whole point of the guardian is that there is no backdoor.

Or at least, that's what the spec promised.

Red-teaming our own guardian

The guardian shipped. Four live happy-path tests passed. Ticker observed firing. The user asked: "can we run some red-team tests to ensure this works?"

I built an 18-test battery against the live /api/compute/submit endpoint — auth bypasses, matching edge cases, malformed inputs, case/whitespace attacks, bypass attempts, shell injection. All non-mutating (everything should hit a rejection path before a job is created).

Round one: 14 passed, 4 failed. The failures were real bugs — in the thing I'd just shipped.

Test #14 — `target: "Cloud-Lab"` (capitalized) → 200 OK, job created.

The route's "normalization" was:

const validTarget = target === "cloud-lab" ? "cloud-lab" : "local";

Anything not exactly lowercase "cloud-lab" fell through to "local". A hostile session could submit cloud-lab-intended work with target: "Cloud-Lab" — or a trailing space, or "CLOUD-LAB", or any other capitalization — and the validator would see target === "local", skip the roadmap check entirely, and route the job to the compute-worker, which has zero validation by design (it's local-only, out of guardian scope). A total bypass of the guardian via one character of case.

Severity: high. Exactly the kind of thing the spec was supposed to prevent.

The tests that found it actually got through — two rogue solver-cache --xg-weight=0.99 jobs were now live on compute-worker. We SIGTERM'd them before they wrote any snapshots to disk, then fixed the route to reject any target that isn't an exact "local" or "cloud-lab":

if (target !== "cloud-lab" && target !== "local") {
  return NextResponse.json(
    { error: `Invalid target. Must be exactly "local" or "cloud-lab".`, received: target },
    { status: 400 },
  );
}

Tests #9 and #10 surfaced a second bug: malformed bodies (no body, invalid JSON) returned 500 instead of 400 because request.json() throws uncaught. Fixed with a try/catch.

Round two, with fixes deployed: 22/22 passed.

BucketTestsResult
Auth layer22/2 → 401
Roadmap matching rejections44/4 → 403
Blocked-dependency (409)22/2 → 409
Malformed input66/6 → 400
Case/whitespace attacks44/4 → 400
Bypass attempts (force flags, shell injection, null bytes)44/4 → 403

Every bypass I could think of — ?force=true query parameter, bypass: true field in the body, shell metacharacters in param values, null bytes, uppercase targets, trailing spaces — caught and rejected. The 503 fail-closed path on corrupted roadmap is the one thing I can't demonstrate in a red-team battery without actually breaking the roadmap file in production, but it's covered in the validator's try/catch.

Lessons

"A visible warning in the log" ≠ "the thing that killed the job." I called the SSH key issue a smoking gun before counting anything. When the user pushed back with "how do we know this was the root cause?" and I actually counted occurrences of FAILED vs Identity file in the log, the numbers instantly disproved my first diagnosis. Debug from the foundation up, and demand evidence before calling anything a root cause.

Shipping compute infrastructure as hotfixes to live traffic is how you create zero-value work. Five serial commits to main, each fixing the last, with the test suite landing at the end, meant that every job submitted during that window hit a different in-flight version. The blog post declaring "fire-and-poll verified working" went up nine hours after the only long job had been killed by a timeout nobody measured. Rule: any change to compute dispatch needs a committed smoke test that runs BEFORE the feature commit, not after. Added to tasks/lessons.md.

Off-by-nothing. LAST=$(cat file || echo 0) followed by (NOW - LAST) > threshold is a bug pattern, not a script. The missing-file case has to be handled explicitly — either "just booted, initialize and skip" or "treat as recent." Never let a fallback-to-zero silently flow into a time comparison. The reason this hurt so much: the failure mode was "server turns itself off" — infrastructure doing exactly what it was told, by code whose logic everyone had glanced at.

Red-team your own fences. We built a guardian to prevent unauthorized cloud-lab work, shipped it, tested the happy path four times, and called it done. Then the user asked "can we red-team this?" and the first battery found a critical bypass we'd just introduced. Live tests of the specific thing you claim to enforce are cheap. Do them. Do them before the blog post.

Don't overclaim in writeups. The Apr 12 fire-and-poll post will get an incident note appended today. The rule going forward: any public claim ("shipped", "verified working", "deployed") must cite a specific artifact — commit, test name, dashboard URL, job ID with exit 0 AND result-sync success. No bare claims. No "should work."

What this unlocked

Cloud-lab has been running the three xG A/B solver-cache treatments (Treat A xgWeight=0.2, B=0.1, C=0.05) continuously since the Phase 0 fix landed. The guardian's auto-advance ticker is picking up work as slots free. When all three hit done, the four Phase-2 backtests (already in the roadmap as approved, blockedBy the solver caches) will fire automatically.

The full rebuild was seven commits — cloud-lab bootstrap scripts, queue.ts fixes, Dockerfile, the Phase 2 split, the guardian + roadmap + validator + ticker + CLI + spec, and the post-red-team guardian fix. Zero of them disturbed in-flight work. The original compute that sparked this retro is still running.

The cloud-lab contract now:

  1. One document to read: docs/specs/cloud-lab-guardian-spec.md
  2. One file to consult: data/cloud-lab/roadmap.json
  3. One endpoint to call: POST /api/compute/submit with X-Compute-Source: claude/<session-hint>
  4. One 403 response if you try anything that isn't in the roadmap
  5. One 409 response if you try to run work whose dependencies aren't done
  6. One 60-second ticker that fills idle capacity from the priority list without any session having to be present

Not bulletproof. But provably hard to sneak past. And, crucially, every rule is enforced by tested code — not by "just be careful."