Sports Dashboard

MI Bivariate Poisson + Dixon-Coles + Elo

← Back to Blog
|Model Architecture|REJECTED

Testing is_home as an XGBoost Feature: Why the Model Already Knew

Added is_home to XGBoost feature set to learn venue x shot-type interactions. Walk-forward on 312K shots: Model A Brier 0.0787 (baseline 0.0785, delta +0.0002). Model B unchanged. The post-hoc venue calibration already captures the venue signal. XGBoost finds no additional home/away interaction effects at the individual shot level.

Model A Brier
0.0787
baseline 0.0785
Model B Brier
0.0712
unchanged
Shots
312K
walk-forward 3 folds

Testing is_home as an XGBoost Feature: Why the Model Already Knew

Date: 2026-04-01 Signal: xg-is-home-feature Category: Model Architecture Status: Rejected

The Question

Our xG model applies venue correction *after* prediction — a flat multiplier (home ×0.934, away ×0.940) learned from 133K shots. It works, beating Understat's uncalibrated xG by 0.0005 Brier out-of-sample.

But it's a blunt instrument. Every home shot gets the same discount regardless of type, distance, or game state. What if XGBoost could learn *interaction effects* — headers from crosses converting differently at home vs away, set pieces behaving differently under crowd pressure, long-range shots varying by venue while close-range don't?

Adding is_home as a training feature would let the model discover these interactions automatically.

What We Found

Nothing. The feature adds near-zero predictive value across both models.

Model A (Universal — 312K shots, Tier 1+2 features)

Walk-Forward FoldBrier (baseline)Brier (with is_home)Delta
train≤2021, test 20220.07930.0792-0.0001
train≤2022, test 20230.07810.0788+0.0007
train≤2023, test 20240.07820.07820.0000
**Average****0.0785****0.0787****+0.0002**

Model B (Enhanced — 87K StatsBomb shots, Tier 1+2+3 with freeze-frame)

Walk-Forward FoldBrier (baseline)Brier (with is_home)Delta
train≤2021, test 20220.07710.0772+0.0001
train≤2022, test 20230.07050.07050.0000
train≤2023, test 20240.06600.0658-0.0002
**Average****0.0712****0.0712****0.0000**

Target was Brier < 0.077 for Model A. Actual: 0.0787. Not close.

Cross-Source Validation

TestBrierAUCN
StatsBomb → Understat0.080560.7569221,769
Understat → StatsBomb0.081850.762685,501
Understat's own xG0.072910.8065221,769

The Nuance

This isn't surprising in hindsight. The venue effect on shot conversion is *uniform* — home advantage manifests as more shots, better positions, and referee bias, not as a per-shot-type interaction. The flat calibration multiplier captures this correctly.

XGBoost *could* learn venue interactions if they existed. With 312K training shots, it had plenty of data. The fact that it didn't split on is_home in any meaningful way (the feature doesn't appear in SHAP top-10) tells us the interactions aren't there at the individual shot level.

The venue signal lives at the *match level* (home teams create more chances) and the *market level* (home advantage is priced into odds). At the shot level, a 15-yard shot from the center is equally likely to go in whether you're home or away.

What This Means

  • `is_home` stays in the feature config — it's harmless (no Brier regression) and available for future model versions
  • Post-hoc venue calibration stays active — it's not redundant, it's the correct mechanism
  • No deployment needed — the models were retrained but the delta is noise
  • The venue calibration table (Layer 2) remains the right place for this signal

What's Next

The xG model improvement path is elsewhere:

  • Better freeze-frame features (GK positioning, defender blocking geometry)
  • More training data from non-Big-5 leagues
  • Situation-specific sub-models (set pieces, fast breaks)
  • The venue calibration itself could be refined per-league instead of global
REJECTEDSignal: xg-is-home-feature|2026-04-01