The Shadow Model: Proving Improvements Before Deploying Them
Shadow model v1 launched alongside production. Contains market-only solver + DC rho correction (validated +1.25pp on backtest). Portfolio stack shows +2.92% ROI (p=0.064, all years positive). Shadow must prove itself on 100+ live bets before graduating to production.
The Shadow Model: Proving Improvements Before Deploying Them
We built three validated improvements but haven't deployed any to production. Instead, we created a shadow model that runs alongside production on the same matches. When it proves itself on live data, it graduates.
Why a Shadow Model
This session exposed a pattern: we kept finding improvements in the backtest that didn't match what production actually does. The variance filter used fake data (constant 1.35). We missed 269 FootyStats files. We assumed regime skip couldn't work when the data was right there. We changed backtest defaults without touching the live solver.
The lesson: don't trust the backtest to represent production. Run both and compare.
What the Shadow Contains
Two rigorously validated solver improvements that production doesn't have:
| Change | Backtest Evidence | Live Status |
|---|---|---|
| market-only (outcomeWeight=0, xgWeight=0) | Dev +1.05pp (p=0.097), holdout +0.45pp | NOT in production solver |
| Dixon-Coles rho=+0.05 | Dev +1.60pp (p=0.025), holdout +0.80pp | NOT in production solver |
Production still uses outcomeWeight=0.3, xgWeight=0.2, no DC rho. The paper trading's +29.7% ROI (58 bets) was on the OLD model.
The Portfolio Stack Discovery
When we stacked tc2-league-filter + gk-psxg-opponent-filter on the improved base:
- +2.92% ROI (1,954 AH bets, +57.1u P&L)
- 3/3 years positive (marginals +4.24 to +5.21pp)
- 23/23 leagues survive leave-one-out
- Bootstrap p=0.064 (significant at 10%, not 5%)
This was hidden by the 10-gate process testing signals individually. The Fundamental Law of Active Management says: stack weak signals. We had weak signals all along — the test was wrong.
What Happens Next
The shadow model generates picks on the same matches as production. Both log bets. Both settle against real outcomes. After 100+ settled bets:
- Shadow CLV > Production CLV → directionally better
- Shadow ROI ≥ Production ROI → makes at least as much money
- No league-level regression → doesn't break anything
When all three pass, shadow replaces production.
The Honest State
| Metric | Production (live) | Shadow (backtest) |
|---|---|---|
| Solver | outcome=0.3, xg=0.2 | outcome=0, xg=0 |
| DC rho | None | +0.05 |
| Regime skip | Yes (live HFA) | Yes (Fotmob cached) |
| GK adjust | Feature-flagged | Enabled |
| AH ROI (backtest) | ~-2.98% | -1.89% (base), +2.92% (with stack) |
The shadow is better on backtest. Whether it's better on live data is what the comparison will prove.