The xG Calibration That Didn't Matter (And the Switch That Did)
Multi-feature calibrated FootyStats xG (corr 0.35→0.51) has zero impact on variance regression betting outcomes. But switching variance filter from model lambdas to match-level xG gives +0.3-0.4% ROI lift. The real win: FS non-Big-5 shows +37.7u marginal — FootyStats xG creates value through finishing luck in soft markets, not through better regression detection.
The xG Calibration That Didn't Matter (And the Switch That Did)
The Question
Our non-Big-5 leagues use FootyStats xG, which correlates 0.35 with actual goals vs Understat's 0.62 for Big-5. We built a multi-feature calibration model (xG + shots on target + shots + possession) that closes 48% of that gap (corr 0.35 to 0.51, walk-forward validated across 4 seasons).
Two hypotheses:
- Better xG quality improves variance regression detection for non-Big-5 leagues
- Using match-level xG instead of model lambdas for regression detection improves all leagues
What We Found
Calibration: zero impact on betting outcomes. Despite dramatically improving xG quality (RMSE 1.158 to 1.073), there was no change to ROI or CLV in any A/B comparison.
| Scenario | Model-proxy ROI | Real-xG ROI | CLV |
|---|---|---|---|
| Old (raw FootyStats) | -1.7% (n=8,806) | -1.3% (n=8,194) | +12.2% |
| New (calibrated) | -2.1% (n=8,860) | -1.8% (n=8,045) | +12.2% |
Real-xG variance switch: consistent +0.3-0.4% ROI lift. Using actual match-level xG instead of Dixon-Coles model lambdas for the regression filter improved outcomes in every test configuration.
The Nuance
The variance regression filter checks: "Has this team's actual goals diverged from expected goals by 3.0+ over their last 10 matches?" It's a coarse binary filter. Whether "expected goals" comes from raw FootyStats (corr 0.35) or calibrated (corr 0.51), the 3.0 threshold is so wide that both versions identify the same regression candidates.
The model-proxy vs real-xG switch matters more because it changes the *source type*:
- Model lambdas = "this team's season-average attack × opponent defense" (static rating)
- Match-level xG = "in this specific match, what chances did they create?" (granular reality)
Match-level xG captures game-specific context that season ratings miss. A team might have a 1.4 average expected goals but create 3.2 xG in a single match against a leaky defense. The model lambda wouldn't see that spike.
What Didn't Work
The multi-feature calibration model itself is solid engineering. The walk-forward shows consistent improvement:
| Fold | Raw RMSE | Calibrated RMSE | Gap Closed |
|---|---|---|---|
| 2021 | 1.176 | 1.072 | 50% |
| 2022 | 1.173 | 1.077 | 50% |
| 2023 | 1.146 | 1.088 | 40% |
| 2024 | 1.139 | 1.054 | 52% |
Interesting: the trained model puts nearly zero weight on raw FootyStats xG (coefficient 0.002) and reconstructs expected goals primarily from shots on target (0.166) and total shots (0.057). FootyStats "xG" is essentially discarded in favor of shot counts.
But none of this precision transfers to the variance filter. The filter doesn't need a precise expected goals number — it needs to know "is this team over or underperforming?" Raw and calibrated xG agree on that question.
The Real Win: FS non-Big-5
The biggest finding came from a parallel experiment. The Finishing Luck signal, which measures team-level xG overperformance, shows dramatically different results by data source:
| Signal | Marginal Impact |
|---|---|
| FS non-Big-5 + skip Big-5 | **+37.7u** |
| FS All (FootyStats everywhere) | +15.9u |
| FS non-Big-5 (Understat Big-5) | +9.0u |
| Finishing Luck (current, Understat only) | +3.1u |
FootyStats xG creates massive value in non-Big-5 leagues through the finishing luck mechanism. The edge isn't xG precision — it's information asymmetry. Soft markets don't price in xG overperformance the way Big-5 sharp markets do.
Skipping Big-5 entirely for finishing luck (+37.7u vs +9.0u) proves the signal is actively destroyed by sharp market efficiency. The information is correct (teams do regress to xG) but already priced in for Big-5.
What This Means
- Calibrated xG files: not deployed. Quality improvement is real but irrelevant at the variance filter layer.
- Match-xG variance switch: shadow deployed on
/gauntletasmatchXgVar. Early results (67/72 data, +9.25u net impact) are directionally positive. Needs more volume before graduation. - FS non-Big-5: the strongest signal from this investigation. +37.7u marginal with Big-5 skip. On the graduation track via gauntlet.
- Precision vs coverage: for soft markets, having xG data at all matters more than having perfect xG data. The training script now saves multi-feature coefficients for future use if a downstream application needs precision.
What's Next
- Monitor
matchXgVaron gauntlet (791 bets to graduation at current pace) - FS non-Big-5 is the priority deployment candidate
- Investigate whether calibrated xG helps the finishing luck signal specifically (hypothesis: better xG baseline = more accurate overperformance measurement)