Shot-Level xG for Variance Regression: More Data, Same Nothing
Re-tested shot-level xG variance signal with 3.5x more data (3,489 matches, 3-4 seasons). Marginal ROI unchanged at +0.1% (bootstrap p=0.47). Shot-level and match-level xG produce equivalent regression candidates at 10-match window granularity. Hypothesis falsified: data volume was not the bottleneck.
Shot-Level xG for Variance Regression: More Data, Same Nothing
We hypothesized that FotMob shot-level xG would produce better regression candidates than match-level xG for non-Big-5 leagues. The first test showed +0.1% marginal with only one season of shot data. We backfilled to 3-4 seasons (3,489 matches, 59 league files) and re-ran the full pipeline.
Result: still +0.1%. The hypothesis is dead.
The Question
Our variance regression filter flags teams whose actual goals deviate from expected goals over a 10-match window. The "expected goals" input comes from either:
- Match-level xG (FootyStats aggregate per match)
- Shot-level xG (FotMob per-shot coordinates, summed to match level)
The theory: shot-level xG captures shot *quality* independent of finishing luck. Match-level xG bakes in finishing outcomes (shots on target correlate with goals), making the gap between actual and expected less predictive of regression. In offline validation, shot-level xG identified regression candidates with 82.1% accuracy vs 75.6% for match-level.
First test (1 season, ~1K matches): marginal ROI +0.1%, bootstrap p=0.43. We attributed this to shallow team histories — only 10-15 matches per team. The fix: backfill historical data so teams have 30-40 match histories.
What We Found
| Metric | Previous (1 season) | New (3-4 seasons) |
|---|---|---|
| Shot data matches | ~1,000 | 3,489 |
| Base entry-adj ROI | +7.8% | +7.8% |
| Shot entry-adj ROI | +7.9% | +7.9% |
| Marginal entry-adj ROI | +0.1% | +0.1% |
| Bootstrap p | 0.43 | 0.47 |
| Bootstrap CI | — | [-1.5%, +1.6%] |
| Walk-forward folds | 12/12 | 12/12 |
| Gates passed | 8/10 | 7/10 |
The marginal is identical. The bootstrap p actually *worsened* slightly. The confidence interval straddles zero symmetrically.
The Nuance
Shot-level xG does change which bets get made — 613 added, 1,675 removed. But neither group is systematically better or worse:
| Bet Group | N | CLV | Entry-adj ROI |
|---|---|---|---|
| Added (shot-only finds) | 613 | +11.7% | +3.8% |
| Removed (base-only finds) | 1,675 | +11.9% | +5.6% |
| Shared (both agree) | 14,492 | — | — |
The removed bets actually have slightly better entry-adj ROI than the added ones. Shot-level xG isn't finding *better* regression candidates — it's finding *different* ones of roughly equal quality.
Per-league breakdown
A few leagues show real movement but it's mixed:
| League | Delta ROI | Direction |
|---|---|---|
| Brazil A | +2.4% | Better |
| Scottish Prem | +1.5% | Better |
| Greek Super | +1.0% | Better |
| National League | +0.9% | Better |
| UCL | -1.4% | Worse |
| EPL | -1.1% | Worse |
| Ligue 1 | -1.1% | Worse |
| League Two | -0.8% | Worse |
No systematic pattern. The winners and losers roughly cancel out.
By season
| Season | Delta ROI |
|---|---|
| 2022 (calendar) | +4.2% |
| 2015-16 | +1.8% |
| 2025 (calendar) | +1.8% |
| 2024 (calendar) | -2.3% |
| 2017-18 | -0.8% |
Also mixed with no trend over time.
Why It Didn't Work
The mechanism sounded right but the implementation couldn't capture it. Three reasons:
- 10-match window is too coarse. The variance filter uses a sliding window of 10 matches. At this granularity, the difference between shot-level and match-level xG averages out. Both produce similar 10-match expectedGF sums. The theoretical advantage of shot-level xG (removing finishing luck) only matters if finishing luck is persistent within a 10-match window — and at the team level, it mostly isn't.
- Match-level xG already captures enough. FootyStats match-level xG has correlation 0.35 with goals — "noisy" compared to shot-level. But for regression detection, what matters is the *gap* between expected and actual. A noisier xG estimate means bigger gaps, which means more aggressive regression flags. The noise isn't random — it's correlated with the factors (shot quality, finishing variance) that drive regression.
- The 82.1% vs 75.6% offline validation didn't translate. The offline test measured regression *accuracy* (did the flagged team regress?). The backtest measures *profitability* (did betting on that regression make money?). Accuracy and profitability are different — a more accurate filter that removes profitable bets while adding unprofitable ones is worse for the portfolio even if its regression predictions are more correct.
What This Means
The fotmob-shot-xg-variance signal is parked permanently. We tested it twice:
- Once with 1 season of shot data → +0.1% marginal
- Once with 3-4 seasons of shot data → +0.1% marginal
The data volume hypothesis is falsified. There's no iteration path that changes the fundamental problem: at the 10-match window granularity, shot-level and match-level xG are functionally equivalent for regression detection.
The shot data itself remains valuable for other purposes (finishing persistence research, set piece analysis, xG model training). But as a variance regression input, it doesn't move the needle.
What's Next
Nothing for this signal. It joins the graveyard.
The broader lesson: higher-precision inputs don't automatically produce better outputs when the downstream filter is coarse. The variance regression filter's 10-match window and 3.0 gap threshold are too blunt to benefit from shot-level granularity. If we ever revisit variance regression methodology (continuous scoring instead of binary threshold, match-weighted windows, team-specific baselines), shot-level xG could be re-evaluated — but that's a different signal entirely.