Training an xG Model on 632K Shots: Does More Data Mean Better Regression Detection?
Retrained XGBoost xG model on 632K combined shots (3 sources, 22 leagues). V3 hits 89.4% regression rate (target >82.1%), beating v1 (Big 5 only) by +1.3pp. FotMob raw xG still wins at 90.8% — noisier xG is better for regression detection. Advanced metrics (npxG, xGA, PPDA, xG Chain) computed for all leagues.
Training an xG Model on 632K Shots: Does More Data Mean Better Regression Detection?
We retrained our XGBoost xG model from scratch on 632K combined shots from three sources (StatsBomb, Understat, FotMob) — up from 311K in the v1 model. The v3 model now covers 22 leagues instead of just the Big 5. The question: does a model that's seen Championship, Eredivisie, and Turkish Super League shots produce better regression predictions than one trained only on EPL and La Liga?
The Question
Our variance filter uses team xGD (expected goal difference) to detect regression candidates — teams whose actual results have diverged from their expected performance. When the gap is large enough (|gap| >= 3.0 goals over 10 matches), we bet on regression.
The regression rate — what percentage of flagged teams actually regress in the next 5 matches — is the core quality metric. Our validated benchmark is 82.1% from FotMob shot-level data. We needed the v3 model to meet or beat this.
We also wanted to settle a related question: does training on more diverse leagues actually help, or does the v1 model (Big 5 only) already capture the shot-to-goal relationship adequately?
What We Found
All three xG sources pass comfortably:
| Source | Regression Rate | Flags | Mean Gap Before | Mean Gap After |
|---|---|---|---|---|
| V3 Model (632K shots) | **89.4%** | 1,468 | 7.96 | 4.16 |
| V1 Model (Big 5 only) | 88.1% | 1,567 | 8.45 | 4.60 |
| FotMob Raw xG | **90.8%** | 1,440 | 7.92 | 3.95 |
V3 beats v1 by +1.3pp overall. The improvement is consistent across splits: +1.1pp on non-Big5 leagues and +1.6pp on Big 5.
The Nuance
V3 wins on tight calibration
The v3 model produces smaller initial gaps (7.96 vs 8.45) and smaller residual gaps after regression (4.16 vs 4.60). This means v3's xGD is closer to what actually happens — it flags fewer false regression candidates.
The v1 model, trained only on Big 5 data, slightly miscalibrates on non-Big5 shots. It generates 1,567 flags vs v3's 1,468 — more flags, but a lower hit rate. The extra flags are noise from out-of-distribution predictions.
FotMob raw xG is still king for regression
The result that matters most: FotMob's own per-shot xG values (not our model's predictions) produce the highest regression rate at 90.8%.
This confirms what we established earlier in our xG regression investigation. For regression detection, you want an xG source that captures genuine divergence between shot quality and finishing outcomes. FotMob's model is "noisier" in the sense that it disagrees with actual goals more often — but that noise is signal for regression detection.
Our v3 model is optimized for per-shot accuracy (Brier 0.076). A perfectly calibrated model would produce xGD that tracks actual GD closely, generating fewer regression flags. That's great for xG accuracy but counterproductive for catching regression candidates.
Per-league breakdown reveals where v3 shines
V3's biggest wins over FotMob raw:
- Eredivisie: +14.3pp (100% vs 85.7%, small N=15)
- Turkish Super: +4.3pp (100% vs 95.7%)
- Bundesliga: +3.0pp (90.0% vs 87.0%)
- Ligue 1: +2.5pp (93.5% vs 91.0%)
V3's biggest losses:
- Scottish Prem: -7.9pp (78.1% vs 86.0%) — weakest league for the model
- Segunda: -6.5pp (79.4% vs 85.9%)
The Scottish Prem result is interesting: v3 underperforms both v1 and FotMob there. This might indicate that Celtic/Rangers dominance creates a distribution the model handles poorly, or that the league's shot profile differs enough to cause miscalibration.
Over vs under-performers
Under-performing teams (results worse than xG) regress more reliably (90.2%) than over-performers (88.6%). This makes mechanical sense: it's harder to sustain bad finishing than good finishing. FotMob is the most symmetric (91.0% over, 90.7% under).
What Didn't Work
We initially tried scoring all 320K FotMob shots through the saved v3 model using a pure Python tree traversal (reading the JSON tree structure node by node). This was prohibitively slow — each shot needed 500 tree traversals of depth 6.
The fix: retrain the model from scratch using XGBoost's native API (random_state=42 for reproducibility) and use predict_proba(). Total scoring time: 0.2 seconds vs the estimated 30+ minutes for Python tree traversal.
What This Means
- V3 model is validated. The 89.4% regression rate across 22 leagues confirms the model generalizes well. It's ready for production use in the per-shot xG pipeline.
- Don't swap the variance filter's xG source. FotMob raw xG remains the best source for regression detection (90.8%). Our model is better at per-shot accuracy, which is valuable for other applications (team-level npxG, shot quality analysis, player finishing multipliers).
- Advanced metrics are computed. The retrain also generated comprehensive team-level metrics: npxG, xGA, xG/shot across all 22 leagues, plus PPDA and deep completions for Big 5 (2020-2023 from Understat cache).
What's Next
- A/B test v3 model xG in the multi-source pipeline — replace the aggregate xG for non-Big5 leagues with v3 shot-level xG where FotMob shots are available
- Investigate Scottish Prem underperformance — the model's weakest league (78.1%) deserves diagnosis
- PPDA data gap — the Understat cache for 2024+ uses a simpler format without PPDA. Re-fetching with the detailed format would extend pressing data coverage