April 1, 2026|Model Architecture|REJECTED

Testing is_home as an XGBoost Feature: Why the Model Already Knew

Added is_home to XGBoost feature set to learn venue x shot-type interactions. Walk-forward on 312K shots: Model A Brier 0.0787 (baseline 0.0785, delta +0.0002). Model B unchanged. The post-hoc venue calibration already captures the venue signal. XGBoost finds no additional home/away interaction effects at the individual shot level.

Model A Brier

0.0787

baseline 0.0785

Model B Brier

0.0712

unchanged

Shots

312K

walk-forward 3 folds

Testing is_home as an XGBoost Feature: Why the Model Already Knew

Date: 2026-04-01 Signal: xg-is-home-feature Category: Model Architecture Status: Rejected

The Question

Our xG model applies venue correction *after* prediction — a flat multiplier (home ×0.934, away ×0.940) learned from 133K shots. It works, beating Understat's uncalibrated xG by 0.0005 Brier out-of-sample.

But it's a blunt instrument. Every home shot gets the same discount regardless of type, distance, or game state. What if XGBoost could learn *interaction effects* — headers from crosses converting differently at home vs away, set pieces behaving differently under crowd pressure, long-range shots varying by venue while close-range don't?

Adding is_home as a training feature would let the model discover these interactions automatically.

What We Found

Nothing. The feature adds near-zero predictive value across both models.

Model A (Universal — 312K shots, Tier 1+2 features)

Walk-Forward Fold	Brier (baseline)	Brier (with is_home)	Delta
train≤2021, test 2022	0.0793	0.0792	-0.0001
train≤2022, test 2023	0.0781	0.0788	+0.0007
train≤2023, test 2024	0.0782	0.0782	0.0000
Average	0.0785	0.0787	+0.0002

Model B (Enhanced — 87K StatsBomb shots, Tier 1+2+3 with freeze-frame)

Walk-Forward Fold	Brier (baseline)	Brier (with is_home)	Delta
train≤2021, test 2022	0.0771	0.0772	+0.0001
train≤2022, test 2023	0.0705	0.0705	0.0000
train≤2023, test 2024	0.0660	0.0658	-0.0002
Average	0.0712	0.0712	0.0000

Target was Brier < 0.077 for Model A. Actual: 0.0787. Not close.

Cross-Source Validation

Test	Brier	AUC	N
StatsBomb → Understat	0.08056	0.7569	221,769
Understat → StatsBomb	0.08185	0.7626	85,501
Understat's own xG	0.07291	0.8065	221,769

The Nuance

This isn't surprising in hindsight. The venue effect on shot conversion is *uniform* — home advantage manifests as more shots, better positions, and referee bias, not as a per-shot-type interaction. The flat calibration multiplier captures this correctly.

XGBoost *could* learn venue interactions if they existed. With 312K training shots, it had plenty of data. The fact that it didn't split on is_home in any meaningful way (the feature doesn't appear in SHAP top-10) tells us the interactions aren't there at the individual shot level.

The venue signal lives at the *match level* (home teams create more chances) and the *market level* (home advantage is priced into odds). At the shot level, a 15-yard shot from the center is equally likely to go in whether you're home or away.

What This Means

`is_home` stays in the feature config — it's harmless (no Brier regression) and available for future model versions
Post-hoc venue calibration stays active — it's not redundant, it's the correct mechanism
No deployment needed — the models were retrained but the delta is noise
The venue calibration table (Layer 2) remains the right place for this signal

What's Next

The xG model improvement path is elsewhere:

Better freeze-frame features (GK positioning, defender blocking geometry)
More training data from non-Big-5 leagues
Situation-specific sub-models (set pieces, fast breaks)
The venue calibration itself could be refined per-league instead of global

REJECTEDSignal: xg-is-home-feature|2026-04-01