Testing is_home as an XGBoost Feature: Why the Model Already Knew
Added is_home to XGBoost feature set to learn venue x shot-type interactions. Walk-forward on 312K shots: Model A Brier 0.0787 (baseline 0.0785, delta +0.0002). Model B unchanged. The post-hoc venue calibration already captures the venue signal. XGBoost finds no additional home/away interaction effects at the individual shot level.
Testing is_home as an XGBoost Feature: Why the Model Already Knew
Date: 2026-04-01 Signal: xg-is-home-feature Category: Model Architecture Status: Rejected
The Question
Our xG model applies venue correction *after* prediction — a flat multiplier (home ×0.934, away ×0.940) learned from 133K shots. It works, beating Understat's uncalibrated xG by 0.0005 Brier out-of-sample.
But it's a blunt instrument. Every home shot gets the same discount regardless of type, distance, or game state. What if XGBoost could learn *interaction effects* — headers from crosses converting differently at home vs away, set pieces behaving differently under crowd pressure, long-range shots varying by venue while close-range don't?
Adding is_home as a training feature would let the model discover these interactions automatically.
What We Found
Nothing. The feature adds near-zero predictive value across both models.
Model A (Universal — 312K shots, Tier 1+2 features)
| Walk-Forward Fold | Brier (baseline) | Brier (with is_home) | Delta |
|---|---|---|---|
| train≤2021, test 2022 | 0.0793 | 0.0792 | -0.0001 |
| train≤2022, test 2023 | 0.0781 | 0.0788 | +0.0007 |
| train≤2023, test 2024 | 0.0782 | 0.0782 | 0.0000 |
| **Average** | **0.0785** | **0.0787** | **+0.0002** |
Model B (Enhanced — 87K StatsBomb shots, Tier 1+2+3 with freeze-frame)
| Walk-Forward Fold | Brier (baseline) | Brier (with is_home) | Delta |
|---|---|---|---|
| train≤2021, test 2022 | 0.0771 | 0.0772 | +0.0001 |
| train≤2022, test 2023 | 0.0705 | 0.0705 | 0.0000 |
| train≤2023, test 2024 | 0.0660 | 0.0658 | -0.0002 |
| **Average** | **0.0712** | **0.0712** | **0.0000** |
Target was Brier < 0.077 for Model A. Actual: 0.0787. Not close.
Cross-Source Validation
| Test | Brier | AUC | N |
|---|---|---|---|
| StatsBomb → Understat | 0.08056 | 0.7569 | 221,769 |
| Understat → StatsBomb | 0.08185 | 0.7626 | 85,501 |
| Understat's own xG | 0.07291 | 0.8065 | 221,769 |
The Nuance
This isn't surprising in hindsight. The venue effect on shot conversion is *uniform* — home advantage manifests as more shots, better positions, and referee bias, not as a per-shot-type interaction. The flat calibration multiplier captures this correctly.
XGBoost *could* learn venue interactions if they existed. With 312K training shots, it had plenty of data. The fact that it didn't split on is_home in any meaningful way (the feature doesn't appear in SHAP top-10) tells us the interactions aren't there at the individual shot level.
The venue signal lives at the *match level* (home teams create more chances) and the *market level* (home advantage is priced into odds). At the shot level, a 15-yard shot from the center is equally likely to go in whether you're home or away.
What This Means
- `is_home` stays in the feature config — it's harmless (no Brier regression) and available for future model versions
- Post-hoc venue calibration stays active — it's not redundant, it's the correct mechanism
- No deployment needed — the models were retrained but the delta is noise
- The venue calibration table (Layer 2) remains the right place for this signal
What's Next
The xG model improvement path is elsewhere:
- Better freeze-frame features (GK positioning, defender blocking geometry)
- More training data from non-Big-5 leagues
- Situation-specific sub-models (set pieces, fast breaks)
- The venue calibration itself could be refined per-league instead of global