Sports Dashboard

MI Bivariate Poisson + Dixon-Coles + Elo

← Back to Blog
|Research|INVESTIGATION

From +29.7% ROI to -2.2%: The Journey from Illusion to Understanding

Paper trading showed +29.7% ROI at 58 bets. Rigorous backtesting showed -2.98% at 15,165 bets. Today we moved it to -2.19% by removing match results from the solver. The CLV was always real (+7%). The ROI was a lucky streak. Three sweeps, one winner, three bugs fixed, and a model that's 27% less negative.

Paper ROI
+29.7%
58 bets (illusion)
Backtest ROI
-2.98%
15,165 AH bets (truth)
After Fix
-2.19%
+0.79pp, +118u saved
CLV
+7.03%
genuine edge (real)

From +29.7% ROI to -2.2%: The Journey from Illusion to Understanding

Paper trading showed +29.7% ROI with a 68% hit rate. Rigorous backtesting showed -2.98%. Today we moved it to -2.19%. This is the story of how we discovered our model was lying to us, and what we did about it.

The Illusion

Two weeks ago, the dashboard was electric. Paper trading results:

MetricValue
Total Bets58
Settled25 (17W / 8L)
ROI+29.7%
P&L+160.3u
CLV+22.1%
Hit Rate68%
Avg Edge+10.2%

Green everywhere. The model was crushing it. We built conviction badges, deployed motivation signals, wired in xG decline alerts. The system felt alive.

There was one footnote: "24 bets with CLV data -- need ~300 for reliable significance."

We ignored it.

The Reckoning

Then we built rigorous backtesting infrastructure. Walk-forward validation across 26 leagues, 3 seasons, 15,165 AH bets. No peeking at future data. 7-day re-solve intervals with 3-day embargo. Pinnacle closing odds.

The truth:

MetricPaper Trading (58 bets)Backtest (15,165 bets)
AH ROI+29.7%-2.98%
CLV+22.1%+6.33%
Hit Rate68%~52%

The +29.7% ROI was a lucky streak. At n=58, the 95% confidence interval for a -3% true-ROI model is [-29%, +23%]. Our result was at the extreme tail -- a 1-in-230 event -- but statistically plausible from pure variance.

The CLV was real. +22.1% in paper trading, +6.33% in backtest (difference explained by edge filters selecting higher-CLV bets). The model genuinely finds edge. It just can't convert that edge into profit.

The Signal Graveyard

We built the most rigorous testing process we could imagine. 10 automated gates. Per-league matchday interleave OOS. Bootstrap on marginal ROI. Walk-forward validation. Parallel testing terminals.

Then 9 signals went through it. All 9 rejected.

SignalMarginal ROIWalk-ForwardResult
tc2-league-filter+0.91pp1/4 foldsREJECTED
tc2-home-ah-rescue+0.43pp0/4 foldsREJECTED
promoted-team-penalty+0.14ppFAILREJECTED
6 others<= 0ppFAILREJECTED

The best signal adds +0.91pp. The base is -3%. You'd need +3-5pp from a single signal. The gates were doing their job: telling us the engine was broken, not the paint.

The Diagnosis

CalGap (calibration gap) predicts ROI at r=-0.922 across 19 leagues. Within-season team collapses -- Rangers, Barcelona, Sevilla -- drive the gap. The hypothesis: the solver reacts too slowly.

Today: Three Sweeps, One Winner

We ran three rigorous parameter sweeps using stratified dev/holdout validation (10 dev leagues, 9 holdout leagues, 8,568 AH bets).

Sweep 1: RFB x Decay -- REJECTED

ConfigAH ROIDelta
Baseline (rfb=1.5, d=0.005)-3.29%--
rfb=2.5-4.39%-1.10pp
d=0.010 alone-3.23%+0.06pp

The solver's form tracking is already optimal. Increasing recentFormBoost hurts. The "reacts too slowly" hypothesis was wrong.

Along the way, we found a bug: --decay-rate was never being passed to the data prep function. Every prior decay rate test had been running with the default.

Sweep 2: Edge Shrinkage -- WRONG DIRECTION

Min EdgeAH ROIDelta
0% (baseline)-3.29%--
5%-4.18%-0.89pp
10%-6.49%-3.20pp

Higher confidence = worse bets. The model's largest edges are its most overconfident predictions. This is the calibration problem expressed directly.

Sweep 3: Loss Weights -- ACCEPTED

ConfigAH ROIDelta
Baseline (outcome=0.3, xg=0.2)-3.29%--
market-only (outcome=0, xg=0)-2.24%+1.05pp

Removing match results and xG from the loss function improved everything. The solver, freed from fitting to noisy data, produces better-calibrated probabilities purely from Pinnacle odds.

Validated on holdout (9 unseen leagues): +0.45pp, same direction. Production (26 leagues): +0.79pp, +118u saved, 12/19 leagues improve. Deployed.

Phase 3: Re-Test Signals

Both top signals re-tested on the improved base. Both still fail.

The market-only config raised the floor (-3.3% to -2.6% on filtered data) but signal marginals didn't change. Walk-forward still requires positive marginal in 2/3 of folds, and neither signal achieves that.

Where We Are Now

MetricBeforeAfterDelta
AH ROI (26 leagues)-2.98%-2.19%+0.79pp
AH CLV+6.33%+7.03%+0.70pp
AH P&L-451u-333u+118u
Leagues improving--12/19--
Signals passing 10-gate0/90/9--
Bugs found/fixed03--

Still unprofitable. Still negative. But:

  1. 27% less negative. The first structural model improvement in the project's history.
  2. CLV went UP. We didn't trade model quality for lucky results -- the model actually got better at predicting markets.
  3. Three bugs fixed. --decay-rate passthrough, snapshot loading config hash, edge threshold direction.
  4. Infrastructure built. Dev/holdout splits, sweep scripts, analysis tools, statistical validation. Every future experiment runs through this framework.
  5. Wrong hypotheses eliminated. "Solver reacts too slowly" -- wrong. "Higher edge = better bets" -- wrong. "Outcomes/xG help calibration" -- wrong.
  6. Right hypothesis found. The solver fits better to efficient market prices when not distracted by noisy match data.

The Honest Assessment

The paper trading results were exciting and statistically meaningless. 58 bets is not a sample size. The model has genuine edge (+7% CLV) but can't fully convert it to profit (-2.2% ROI). The gap is structural -- the Bivariate Poisson grid overestimates extreme outcomes, inflating edges that the market correctly prices.

We're not there yet. But we now know exactly where "there" is, exactly what's blocking us, and we have the tools to test every hypothesis rigorously. The -2.2% gap is smaller than it was, and we have a clear roadmap: quarter-line routing (12pp spread, walk-forward confirmed), max-edge cap (wrong-direction discovery), and 127 unimplemented hypotheses to triage.

The model isn't trash. The paper trading wasn't a lie. It was a small sample from a genuinely edge-positive system that happens to be overconfident on its biggest bets. Fix the overconfidence, and the +7% CLV turns into profit.

What's Next

Tonight: automated overnight test suite running 18 implemented signals through 10-gate, 192 parameter combinations, max-edge-cap sweep, and full decomposition. Tomorrow: triage 127 registered but unimplemented hypotheses. Then build the most promising ones.

The engine runs better. Time to find the tuning that makes it profitable.