Sports Dashboard

MI Bivariate Poisson + Dixon-Coles + Elo

← Back to Blog
|System Update|DEPLOYED

The Shadow Model: Proving Improvements Before Deploying Them

Shadow model v1 launched alongside production. Contains market-only solver + DC rho correction (validated +1.25pp on backtest). Portfolio stack shows +2.92% ROI (p=0.064, all years positive). Shadow must prove itself on 100+ live bets before graduating to production.

The Shadow Model: Proving Improvements Before Deploying Them

We built three validated improvements but haven't deployed any to production. Instead, we created a shadow model that runs alongside production on the same matches. When it proves itself on live data, it graduates.

Why a Shadow Model

This session exposed a pattern: we kept finding improvements in the backtest that didn't match what production actually does. The variance filter used fake data (constant 1.35). We missed 269 FootyStats files. We assumed regime skip couldn't work when the data was right there. We changed backtest defaults without touching the live solver.

The lesson: don't trust the backtest to represent production. Run both and compare.

What the Shadow Contains

Two rigorously validated solver improvements that production doesn't have:

ChangeBacktest EvidenceLive Status
market-only (outcomeWeight=0, xgWeight=0)Dev +1.05pp (p=0.097), holdout +0.45ppNOT in production solver
Dixon-Coles rho=+0.05Dev +1.60pp (p=0.025), holdout +0.80ppNOT in production solver

Production still uses outcomeWeight=0.3, xgWeight=0.2, no DC rho. The paper trading's +29.7% ROI (58 bets) was on the OLD model.

The Portfolio Stack Discovery

When we stacked tc2-league-filter + gk-psxg-opponent-filter on the improved base:

  • +2.92% ROI (1,954 AH bets, +57.1u P&L)
  • 3/3 years positive (marginals +4.24 to +5.21pp)
  • 23/23 leagues survive leave-one-out
  • Bootstrap p=0.064 (significant at 10%, not 5%)

This was hidden by the 10-gate process testing signals individually. The Fundamental Law of Active Management says: stack weak signals. We had weak signals all along — the test was wrong.

What Happens Next

The shadow model generates picks on the same matches as production. Both log bets. Both settle against real outcomes. After 100+ settled bets:

  • Shadow CLV > Production CLV → directionally better
  • Shadow ROI ≥ Production ROI → makes at least as much money
  • No league-level regression → doesn't break anything

When all three pass, shadow replaces production.

The Honest State

MetricProduction (live)Shadow (backtest)
Solveroutcome=0.3, xg=0.2outcome=0, xg=0
DC rhoNone+0.05
Regime skipYes (live HFA)Yes (Fotmob cached)
GK adjustFeature-flaggedEnabled
AH ROI (backtest)~-2.98%-1.89% (base), +2.92% (with stack)

The shadow is better on backtest. Whether it's better on live data is what the comparison will prove.