Sports Dashboard

MI Bivariate Poisson + Dixon-Coles + Elo

← Back to Blog
|infrastructure|FIX

Solver Cache: Incremental Season Loading

Solver was loading all 16 seasons (6,000 matches, 35GB) into memory. Now loads incrementally per solve date — peak memory drops from 35GB to 500MB.

Solver Cache: Incremental Season Loading

The solver cache generator was loading all 16 seasons (6,000+ matches, ~200MB) into memory for every league, even when only the latest season had uncached solve dates. Serie A with 39 teams OOM'd at 35GB virtual memory — the Linux kernel killed the process, losing hours of computation.

Before: Load Everything

loadRawMatches("serie-a", ["2010-11", "2011-12", ..., "2025-26"])
  → Parse 16 JSON files
  → 6,009 matches in one array (~200MB)
  → For each solve date: filter to matches before that date
  → Solver runs on filtered subset
  → But the full 6,009-match array stays in memory the entire time

Peak memory: 4-35GB depending on league size. Serie A with 39 teams and complex solver iterations hit 35GB and got OOM-killed.

After: Load Incrementally

loadMatchDates("serie-a", [...])
  → Read only date fields from 16 files (~50KB total)
  → Compute solve date schedule
  → Check which dates are already cached (most of them)

For each UNCACHED solve date:
  loadRawMatches("serie-a", [...], cutoffDate="2025-11-15")
    → Only loads seasons containing matches up to cutoff
    → For 2025-26 solve dates: loads ~400 recent matches, not 6,000
    → Solver runs on this subset
    → Array is garbage collected before next solve date

Peak memory: ~500MB regardless of league size. The solver only holds matches up to the current solve date, and the GC hint between solve dates releases intermediate state.

What Changed

Two new functions replace the monolithic loader:

`loadMatchDates()` — Reads only the date field from each season file to build the solve schedule. Uses ~50KB instead of ~200MB. No match data loaded.

`loadRawMatches()` with cutoff — Now accepts an optional cutoffDate parameter. When set, only loads matches up to that date and stops reading season files that start after the cutoff. For a league where 2010-2024 is fully cached and only 2025-26 needs solving, this loads ~400 matches instead of 6,000.

The per-solve-date loop now calls loadRawMatches(league, seasons, solveDate) for each uncached date, loading and releasing data incrementally.

Combined with Earlier Fixes

  • 8GB heap (--max-old-space-size=8192) — safety margin, but incremental loading means we rarely need more than 2GB
  • GC every 10 solve dates (global.gc()) — releases intermediate solver state between iterations
  • Per-solve-date keepalive — touches /tmp/last-job-activity to prevent cloud-lab idle shutdown

Impact

MetricBeforeAfter
Peak memory (Serie A)35GB (OOM killed)~500MB
Peak memory (EPL)~8GB~400MB
Can run parallel solversNo (OOM risk)Yes (2GB each)
Seasons loaded per solveAll 16Only those needed
Match data per solve6,000 (full history)~400 (up to solve date)

The solver can now process any league on a 4GB machine. More importantly, multiple solvers can run in parallel during the research queue phase without competing for memory.