From +29.7% ROI to -2.2%: The Journey from Illusion to Understanding
Paper trading showed +29.7% ROI at 58 bets. Rigorous backtesting showed -2.98% at 15,165 bets. Today we moved it to -2.19% by removing match results from the solver. The CLV was always real (+7%). The ROI was a lucky streak. Three sweeps, one winner, three bugs fixed, and a model that's 27% less negative.
From +29.7% ROI to -2.2%: The Journey from Illusion to Understanding
Paper trading showed +29.7% ROI with a 68% hit rate. Rigorous backtesting showed -2.98%. Today we moved it to -2.19%. This is the story of how we discovered our model was lying to us, and what we did about it.
The Illusion
Two weeks ago, the dashboard was electric. Paper trading results:
| Metric | Value |
|---|---|
| Total Bets | 58 |
| Settled | 25 (17W / 8L) |
| ROI | +29.7% |
| P&L | +160.3u |
| CLV | +22.1% |
| Hit Rate | 68% |
| Avg Edge | +10.2% |
Green everywhere. The model was crushing it. We built conviction badges, deployed motivation signals, wired in xG decline alerts. The system felt alive.
There was one footnote: "24 bets with CLV data -- need ~300 for reliable significance."
We ignored it.
The Reckoning
Then we built rigorous backtesting infrastructure. Walk-forward validation across 26 leagues, 3 seasons, 15,165 AH bets. No peeking at future data. 7-day re-solve intervals with 3-day embargo. Pinnacle closing odds.
The truth:
| Metric | Paper Trading (58 bets) | Backtest (15,165 bets) |
|---|---|---|
| AH ROI | +29.7% | -2.98% |
| CLV | +22.1% | +6.33% |
| Hit Rate | 68% | ~52% |
The +29.7% ROI was a lucky streak. At n=58, the 95% confidence interval for a -3% true-ROI model is [-29%, +23%]. Our result was at the extreme tail -- a 1-in-230 event -- but statistically plausible from pure variance.
The CLV was real. +22.1% in paper trading, +6.33% in backtest (difference explained by edge filters selecting higher-CLV bets). The model genuinely finds edge. It just can't convert that edge into profit.
The Signal Graveyard
We built the most rigorous testing process we could imagine. 10 automated gates. Per-league matchday interleave OOS. Bootstrap on marginal ROI. Walk-forward validation. Parallel testing terminals.
Then 9 signals went through it. All 9 rejected.
| Signal | Marginal ROI | Walk-Forward | Result |
|---|---|---|---|
| tc2-league-filter | +0.91pp | 1/4 folds | REJECTED |
| tc2-home-ah-rescue | +0.43pp | 0/4 folds | REJECTED |
| promoted-team-penalty | +0.14pp | FAIL | REJECTED |
| 6 others | <= 0pp | FAIL | REJECTED |
The best signal adds +0.91pp. The base is -3%. You'd need +3-5pp from a single signal. The gates were doing their job: telling us the engine was broken, not the paint.
The Diagnosis
CalGap (calibration gap) predicts ROI at r=-0.922 across 19 leagues. Within-season team collapses -- Rangers, Barcelona, Sevilla -- drive the gap. The hypothesis: the solver reacts too slowly.
Today: Three Sweeps, One Winner
We ran three rigorous parameter sweeps using stratified dev/holdout validation (10 dev leagues, 9 holdout leagues, 8,568 AH bets).
Sweep 1: RFB x Decay -- REJECTED
| Config | AH ROI | Delta |
|---|---|---|
| Baseline (rfb=1.5, d=0.005) | -3.29% | -- |
| rfb=2.5 | -4.39% | -1.10pp |
| d=0.010 alone | -3.23% | +0.06pp |
The solver's form tracking is already optimal. Increasing recentFormBoost hurts. The "reacts too slowly" hypothesis was wrong.
Along the way, we found a bug: --decay-rate was never being passed to the data prep function. Every prior decay rate test had been running with the default.
Sweep 2: Edge Shrinkage -- WRONG DIRECTION
| Min Edge | AH ROI | Delta |
|---|---|---|
| 0% (baseline) | -3.29% | -- |
| 5% | -4.18% | -0.89pp |
| 10% | -6.49% | -3.20pp |
Higher confidence = worse bets. The model's largest edges are its most overconfident predictions. This is the calibration problem expressed directly.
Sweep 3: Loss Weights -- ACCEPTED
| Config | AH ROI | Delta |
|---|---|---|
| Baseline (outcome=0.3, xg=0.2) | -3.29% | -- |
| market-only (outcome=0, xg=0) | -2.24% | +1.05pp |
Removing match results and xG from the loss function improved everything. The solver, freed from fitting to noisy data, produces better-calibrated probabilities purely from Pinnacle odds.
Validated on holdout (9 unseen leagues): +0.45pp, same direction. Production (26 leagues): +0.79pp, +118u saved, 12/19 leagues improve. Deployed.
Phase 3: Re-Test Signals
Both top signals re-tested on the improved base. Both still fail.
The market-only config raised the floor (-3.3% to -2.6% on filtered data) but signal marginals didn't change. Walk-forward still requires positive marginal in 2/3 of folds, and neither signal achieves that.
Where We Are Now
| Metric | Before | After | Delta |
|---|---|---|---|
| AH ROI (26 leagues) | -2.98% | -2.19% | +0.79pp |
| AH CLV | +6.33% | +7.03% | +0.70pp |
| AH P&L | -451u | -333u | +118u |
| Leagues improving | -- | 12/19 | -- |
| Signals passing 10-gate | 0/9 | 0/9 | -- |
| Bugs found/fixed | 0 | 3 | -- |
Still unprofitable. Still negative. But:
- 27% less negative. The first structural model improvement in the project's history.
- CLV went UP. We didn't trade model quality for lucky results -- the model actually got better at predicting markets.
- Three bugs fixed.
--decay-ratepassthrough, snapshot loading config hash, edge threshold direction. - Infrastructure built. Dev/holdout splits, sweep scripts, analysis tools, statistical validation. Every future experiment runs through this framework.
- Wrong hypotheses eliminated. "Solver reacts too slowly" -- wrong. "Higher edge = better bets" -- wrong. "Outcomes/xG help calibration" -- wrong.
- Right hypothesis found. The solver fits better to efficient market prices when not distracted by noisy match data.
The Honest Assessment
The paper trading results were exciting and statistically meaningless. 58 bets is not a sample size. The model has genuine edge (+7% CLV) but can't fully convert it to profit (-2.2% ROI). The gap is structural -- the Bivariate Poisson grid overestimates extreme outcomes, inflating edges that the market correctly prices.
We're not there yet. But we now know exactly where "there" is, exactly what's blocking us, and we have the tools to test every hypothesis rigorously. The -2.2% gap is smaller than it was, and we have a clear roadmap: quarter-line routing (12pp spread, walk-forward confirmed), max-edge cap (wrong-direction discovery), and 127 unimplemented hypotheses to triage.
The model isn't trash. The paper trading wasn't a lie. It was a small sample from a genuinely edge-positive system that happens to be overconfident on its biggest bets. Fix the overconfidence, and the +7% CLV turns into profit.
What's Next
Tonight: automated overnight test suite running 18 implemented signals through 10-gate, 192 parameter combinations, max-edge-cap sweep, and full decomposition. Tomorrow: triage 127 registered but unimplemented hypotheses. Then build the most promising ones.
The engine runs better. Time to find the tuning that makes it profitable.