Sports Dashboard

MI Bivariate Poisson + Dixon-Coles + Elo

← Back to Blog
|infrastructure|FIX

Cloud Lab: Fire-and-Poll Architecture

Rebuilt cloud-lab job dispatch from 12-hour SSH pipes to 1-second fire-and-poll. Jobs survive SSH drops, server restarts, and network hiccups. Added master orchestrator, watchdogs, and solver OOM fix.

Cloud Lab: Fire-and-Poll Architecture

We rebuilt how the main server communicates with cloud-lab. The old system held an SSH connection open for the entire duration of a job — up to 14 hours for solver cache generation. Any network hiccup killed the connection, which killed the job, which lost all progress. This happened repeatedly.

The Problem

OLD: Main server ──── SSH pipe open 14 hours ────── Cloud-lab job
         ↓ network hiccup
     SSH dies → job dies → hours of work lost

We identified 15 categories of failure across the compute system. The root causes:

  1. SSH as parent process — remote jobs died when the SSH connection dropped
  2. No process isolation — jobs ran in the SSH session, not their own process group
  3. Two duplicate queue systems — compute-worker and autopilot-runner with different state
  4. Cloud-lab tried to manage itself — but it gets powered off, so it can't restart its own jobs
  5. 2-hour timeout killing 12-hour jobs — autopilot-runner had a flat 2h timeout for all job types

The Fix: Fire-and-Poll

NEW: Main server ── fire (1s SSH) ──→ Cloud-lab starts job in setsid nohup
                 ── poll (1s SSH) ──→ "is PID 4303 alive?" every 60s
                 ── poll (1s SSH) ──→ "is PID 4303 alive?" + grab log line
                 ── poll (1s SSH) ──→ "PID 4303 exited, exit code 0"
                 ── rsync results ──→ sync solver caches back

Fire: One SSH call writes a job script to cloud-lab, launches it with setsid nohup (survives SSH disconnect), captures the PID, disconnects. Total: 1-2 seconds.

Poll: Every 60 seconds, one SSH call checks if the PID is alive. If alive, grabs the last log line for progress. If dead, reads the exit code file. Total: 1 second per poll.

Key property: The job runs in its own session. If the main server restarts, the poll resumes from persisted state (PID and job tag stored in jobs.json). If SSH is unreachable for 10 consecutive polls, the job is marked failed — but the remote process keeps running and can be recovered.

What Changed

ComponentBeforeAfter
SSH pattern12h open pipe1s fire + 1s polls
Job survivalDies on SSH dropSurvives (setsid nohup)
Resume on restartRe-queue (lose progress)Resume polling
Queue systems2 (competing)1 (compute-worker)
Cloud-lab roleSelf-orchestratingDumb worker
TimeoutFlat 2h all jobsPer-type (12h solver, 3h factorial)

Infrastructure Added

  • Watchdog on cloud-lab (systemd timer, every 30 min): kills hung processes after 3h idle, shuts down VM after 2h idle with no pending work
  • Watchdog on main server (cron, every 2h): powers on cloud-lab if compute queue has pending cloud-lab jobs
  • SSH key persistence (cron, every 5 min): copies SSH keys into web container after deploys
  • openssh-client in Dockerfile: SSH available in production container without manual install
  • Keepalive touches: every long-running script touches /tmp/last-job-activity to prevent idle shutdown

Master Orchestrator

A 7-phase pipeline that runs all compute work in sequence at maximum capacity:

  1. Solver Cache — walk-forward snapshots for 2025-26 season (8 parallel per league)
  2. Factorial — 4,096 signal combinations across 64 chunks (8 parallel)
  3. Analysis — scorecard, shadow configs, backfill to production
  4. Signal Approvals — 5 priority signals through 10-gate approval (4 parallel)
  5. Graduations — promote validated signals to production
  6. Research Queue — param sweeps, specialized backtests, model sweep (6 parallel)
  7. Signal Cleanup — remaining 22 signal approvals (4 parallel)

Progress is visible on this page via live log tail and phase timeline.

Solver OOM Fix

The solver cache generator was loading all 16 seasons (6,000+ matches) into memory for big leagues. Serie A OOM'd at 35GB virtual memory. Fix:

  • NODE_OPTIONS='--max-old-space-size=8192 --expose-gc' (was 4096)
  • GC hint every 10 solve dates (was only per-league)
  • Per-solve-date keepalive touch (was only per-league)

Current Status

  • 228 solver caches for 2025-26 (Big 5 done, 14 leagues in progress)
  • Factorial running on Big 5 data (partial live comparison imminent)
  • Fire-and-poll verified working in production
  • Full pipeline ETA: ~12-16 hours from start