The Solver Was Already Right: Why Tuning Form Weights Made Things Worse
We tested 5 configurations of recentFormBoost (1.5-3.0) and decayRate (0.005-0.015) to track within-season collapses faster. Every config was worse or flat vs baseline. RFB increase costs -1.10pp, decay alone is noise. Also found a bug: --decay-rate was never passed to data-prep. The solver already correctly weights form data.
The Solver Was Already Right: Why Tuning Form Weights Made Things Worse
We tested 5 parameter configurations to see if the MI Bivariate Poisson model could track within-season team collapses better. It couldn't. The current parameters are near-optimal, and along the way we found a bug that had been silently breaking every prior decay rate test.
The Question
The model has +6.2% CLV (it sees edge correctly) but -3.3% AH ROI (it can't convert). One theory: recentFormBoost=1.5 and decayRate=0.005 are too conservative. Teams like Rangers, Barcelona, and Sevilla collapse mid-season, but the solver keeps weighting their early-season performances equally. If we boost recent form weighting and/or decay old data faster, the solver should catch collapses sooner.
This theory came from a genuine finding: the calibration gap (CalGap) correlates with ROI at r=-0.922 across leagues, and within-season collapses are the primary driver.
What We Found
Every configuration was worse or flat. The current parameters are the local optimum.
| Config | AH ROI | vs Baseline |
|---|---|---|
| **Baseline (rfb=1.5, d=0.005)** | **-3.29%** | **--** |
| rfb=2.5 | -4.39% | -1.10pp |
| rfb=2.5, d=0.010 | -4.02% | -0.73pp |
| rfb=3.0, d=0.010 | -3.96% | -0.67pp |
| rfb=1.5, d=0.010 | -3.23% | +0.06pp |
| rfb=1.5, d=0.015 | -4.39% | -1.10pp |
The gradient is clear:
- RFB increase: -1.10pp -- boosting recent form weight is strongly negative
- Decay alone: +0.06pp -- faster forgetting is noise, not signal
- Aggressive decay (0.015): -1.10pp -- forgetting too much history destroys the model
The Nuance
Why Does More Form Weighting Hurt?
The intuition "weight recent form more" assumes the market isn't already doing this. But Pinnacle odds -- which are our training target -- already incorporate recent form. When we boost recentFormBoost, we're double-counting a signal the market already prices. The solver over-adjusts to recent results, creates false confidence in form-based ratings, and generates more bets that the market has already correctly priced.
Per-League Results Tell the Story
For the best-performing config (decay=0.010 alone):
| League | Baseline | d=0.010 | Delta |
|---|---|---|---|
| serie-b | -7.95% | -3.06% | +4.89pp |
| serie-a | -4.90% | -3.48% | +1.42pp |
| epl | +1.26% | +2.84% | +1.57pp |
| belgian-pro | -3.09% | -6.78% | -3.70pp |
| turkish-super | -2.82% | -5.50% | -2.68pp |
| eredivisie | +1.89% | +1.28% | -0.61pp |
5 leagues improve, 5 worsen. No systematic pattern -- it's redistributing edge, not creating it.
The Sign Test
We checked how many league x year cells (30 total) each config beat baseline in:
- rfb=2.5: 11/30 (37%) -- worse in nearly 2/3 of cells
- rfb=2.5, d=0.010: 12/30 (40%)
- rfb=3.0, d=0.010: 13/30 (43%)
None cleared the 50% threshold, let alone the 60% elimination bar.
What Didn't Work (and What We Actually Fixed)
The Bug
While running the sweep, we discovered that --decay-rate had been a no-op since it was first implemented. The CLI flag was parsed into the config object and used for cache key generation, but was never actually passed to prepareMarketMatches() -- the function that applies time-decay weighting to training matches.
The smoking gun: rfb-2.5-d10 (before the fix) produced byte-identical results to rfb-2.5 (same RFB, default decay). The config hash was different, so the solver cache was correctly invalidated, but the underlying data had identical decay weights.
Fix: Two lines of code in backtest-v2.ts and backtest-worker.ts, adding decayRate: baseConfig.decayRate to the prepareMarketMatches() options.
This means every prior experiment that used --decay-rate was actually running with the default 0.005. Any conclusions drawn from those experiments about decay rate effects were measuring RFB effects only.
The Methodology
We used a syndicate-style approach:
- Stratified dev/holdout split: 10 dev leagues (8,568 AH bets) / 9 holdout leagues (6,597 AH bets)
- Sequential elimination: 3 strategic configs in Phase 1, gradient-directed refinement in Phase 2
- Bootstrap paired difference: block bootstrap on matchday-level ROI differences (10K resamples)
- Sign test: per league x year cell comparison vs baseline
The holdout set was never needed -- no config survived elimination to reach validation.
What This Means
- The model's form tracking is already correct. The solver, trained on Pinnacle closing odds, already captures form changes at the right rate. Attempting to amplify or accelerate this process double-counts what the market already knows.
- The CLV-to-ROI gap is not a form-tracking problem. The -3.3% ROI doesn't come from stale ratings on collapsed teams. It comes from somewhere else -- likely structural calibration issues, line-specific biases, or market microstructure.
- Infrastructure for future sweeps exists.
sweep-rfb-decay.shandanalyze-sweep.tsare reusable for any parameter optimization with proper dev/holdout split, elimination rules, and statistical validation.
What's Next
With RFB/decay ruled out, the remaining paths to closing the CLV-to-ROI gap are:
- Quarter-line regime: the exhaustive regime search found quarter-lines at +4.3% vs whole-lines at -7.8% (12pp spread). Line-type routing may be more productive than parameter tuning.
- Edge-weighted sizing: weighting bets by edge magnitude showed +1.5pp ROI improvement OOS in shadow mode.
- Calibration shrinkage: shrinking model probabilities toward market consensus (x0.4) reduced CalGap from 6.1pp to 1.5pp with ROI improvement.
- Team-specific volatility: instead of a global form boost, model per-team rating variance to flag unstable teams.
The engine is correctly calibrated for form. Time to look at the chassis.