Testing Multi-Source xG Disagreement: 5 Signals, 0 Survivors
Batch-tested 5 multi-source xG disagreement signals through 10-gate approval. inter-model-disagreement (9/10 gates, +1.1pp marginal) is the most promising but fails bootstrap (p=0.24). layered-threshold (+0.3pp) too small. overperformance-decomposition broken by v3 data coverage. footystats-all and finishingLuck show zero marginal. v3 as variance filter replacement: identical results to MI lambdas.
Testing Multi-Source xG Disagreement: 5 Signals, 0 Survivors
We built a v3 XGBoost xG model from 632K shots across 3 data sources and 22 leagues. Then we used it to create 5 signals based on disagreement between xG providers. The thesis: when multiple xG models disagree about a team's expected performance, that disagreement carries information about regression confidence, prediction uncertainty, or data quality. None of the 5 signals passed the 10-gate approval pipeline.
The Question
The variance filter is the single most important signal in our stack. It detects teams whose actual goals diverge from expected goals (xG), flagging regression candidates. Currently it uses MI Bivariate Poisson lambdas as "expected." We asked: does using multiple independent xG sources — our v3 model, FotMob's match-level xG, and FotMob's shot-level xG — improve regression detection?
Five signals tested this thesis from different angles:
- inter-model-disagreement — skip bets where v3 and FotMob disagree strongly (high uncertainty)
- layered-threshold-variance — require 2+ xG sources to independently flag regression
- overperformance-decomposition — decompose overperformance into "shot quality" vs "finishing luck"
- footystats-all-regression — use FootyStats xG for regression across all leagues
- finishingLuck — flag teams with high finishing luck component
What We Found
| Signal | Gates | Marginal ROI | p-value | Verdict |
|---|---|---|---|---|
| inter-model-disagreement | 9/10 | +1.1pp | 0.24 | Best of batch, not yet significant |
| layered-threshold-variance | 8/10 | +0.3pp | 0.39 | Real but too small |
| overperformance-decomposition | N/A | -0.02pp | N/A | No viable threshold |
| footystats-all-regression | 7/10 | +0.0pp | 0.50 | Zero marginal |
| finishingLuck | 7/10 | +0.0pp | 0.50 | Zero marginal |
Before signal testing, we also ran the v3 model as a direct replacement for MI lambdas in the variance filter. Result: identical. CLV +9.7% both ways, 47,453 vs 47,393 bets, +0.1pp ROI difference (noise). The variance filter doesn't care whether its xG source is precise or noisy — it just needs gaps.
The Nuance
inter-model-disagreement is genuinely interesting. It passed 9 of 10 gates, only failing bootstrap significance (p=0.24, need <0.10). Its marginal entry-adjusted ROI of +1.1pp exceeds the practical significance threshold. OOS entry-adj ROI (+9.2%) actually beats IS (+6.8%), which is the opposite of overfitting. Walk-forward shows 10/13 folds positive including all recent years (2019-2025). The signal just doesn't have enough v3 data (only 2023-2026) to reach statistical significance.
overperformance-decomposition has a data coverage problem. The decomposition requires v3 xG per match, which only exists for 2023+. The backtest spans 2020-2025, so 60%+ of matches have no decomposition data and pass the filter unmodified. Standalone N was 133,784 at every threshold — the filter literally couldn't differentiate because the data wasn't there. The original +2.68pp marginal claim likely came from a test window that coincidentally aligned with v3 coverage.
footystats-all-regression and finishingLuck are the same signal. They produce identical N (224,764) and identical marginal ROI (0.0%). Both are functional equivalents of the existing variance filter with a different xG source bolted on. The existing filter already captures the full regression effect.
What Didn't Work
The fundamental issue: the variance filter is already near-optimal for what it does. It detects teams overperforming their xG by 3+ goals over 10 matches. Adding more xG sources doesn't find substantially different regression candidates because all xG models agree on the big picture — a team scoring 2.0 goals/match on 1.2 xG/match is flagged by every source.
The information gain from multi-source disagreement is real but tiny (+0.3pp to +1.1pp marginal). At our current bet volume (~47K backtest bets), this translates to ~15-50 units — not enough to reach statistical significance against the noise.
What This Means
No deployment. The variance filter stays as-is with MI lambdas.
The v3 xG model is not wasted — it powers the xG inference engine, provides per-shot predictions for the xGOT model, and has 89.4% regression rate on its own. But using it to sharpen the variance filter doesn't move the needle.
What's Next
inter-model-disagreement gets a second look after the 2026-27 season, when v3 data will cover 4+ full seasons instead of 3. If bootstrap reaches p<0.10 with more data, it graduates.
overperformance-decomposition revisit after v3 covers 5+ seasons. The concept (decomposing into shot quality vs finishing luck) is sound; the data just doesn't exist yet for a fair test.
Everything else is closed. The multi-source xG thesis has been thoroughly tested and the returns are diminishing.