Sports Dashboard

MI Bivariate Poisson + Dixon-Coles + Elo

← Back to Blog
|Research

When xG Models Disagree, We Should Pay Attention

Pointed four independent xG models at 3,282 matches. Multi-source consensus produces 91.2% regression rate. The strongest signal (91.5%) came from decomposing overperformance into shot quality vs finishing luck — when both are high, regression is near-certain.

When xG Models Disagree, We Should Pay Attention

We have four xG sources that see football differently. FotMob's match-level model guesses from shot locations. Their shot-level model adds body part and angle. Our v3 XGBoost model uses 69 features including assist type, game state, and defensive positioning. Understat does something proprietary for the Big 5.

Most betting systems pick one xG source and build everything on it. We decided to point all four at the same matches and see what happens when they disagree.

The Setup

We built a Python pipeline that runs all four models on the same 3,282 matches across 26 leagues. For each match, we get four independent xG estimates per team. We can then ask: when a team's actual goals exceed ALL four estimates, is that a stronger regression signal than when only one model flags it?

The answer turned out to be yes — but the most interesting finding came from a direction we didn't expect.

What We Tested

Signal 1: Multi-Source Regression Confidence. Weight the models by noisiness (FotMob match gets highest weight because noisier = better for regression), combine into a single confidence score. Split teams into quintiles. Do high-confidence regression candidates actually regress more?

Signal 3: v3 Model Residual. Our 69-feature model is the most precise. When even it can't explain a team's scoring, that should be the purest luck signal. Use |v3 residual| >= 2.0 as a stricter filter.

Signal 5: Overperformance Decomposition. Break a team's overperformance into two parts: shot quality (what the precise model sees that the noisy model doesn't) and finishing luck (what even the precise model can't explain). Does the ratio predict how much a team regresses?

The Results

Multi-source consensus works — but the effect is small

Among teams overperforming their xG, splitting by how many models agree:

Agreement LevelRegression Rate
Low (1 model flags)72.0%
Medium75.7%
High (all models agree)**91.2%**

That's a 19.2pp spread. When all four models agree a team is overperforming, regression is near-certain.

Through the formal 10-gate approval process, this signal passed 8 of 10 gates. The marginal ROI improvement (+0.3pp) was real — it was positive in every single walk-forward fold from 2014 to 2026, 13 out of 13 — but too small (+0.3% vs the +0.5% deployment threshold) to justify as a binary filter. It may have more value as a sizing signal: bet bigger when all models agree.

The precise model is worse for regression detection

This confirmed something we'd already suspected but hadn't tested with our own model:

FilterRegression Rate
FotMob match \gap\>= 3.0**79.9%**
v3 model \gap\>= 2.074.9%

Our 69-feature model is 5pp worse at catching regression than FotMob's simple match-level estimate. Why? Because a better model explains more of the variance, leaving a smaller, less predictive residual. The "noise" in a simpler model IS the regression signal. This is counterintuitive but now confirmed with our own data.

The breakthrough: both components high

The decomposition hypothesis partially failed. The fraction of overperformance explained by finishing luck (vs shot quality) had zero correlation with regression magnitude (r = 0.012). The ratio doesn't matter.

But when we tested absolute thresholds — requiring BOTH components to be high — we found the strongest regression signal in this entire research:

GroupRegression RateAvg Regression
Shot quality > 1.0 AND finishing luck > 1.0**91.5%**0.89 gpg
Finishing luck only (> 1.5, quality < 0.5)68.8%0.55 gpg
Quality only (> 1.5, luck < 0.5)62.0%0.52 gpg

91.5% regression rate with 0.89 goals per match of regression. That's from 270 team-windows across all leagues.

Why "Both High" Works

Think about what it means when both components are high simultaneously:

Shot quality > 1.0 means our precise v3 model sees at least 1 goal more expected output than the noisy FotMob model over 10 matches. The team is creating genuinely better chances than the simple location-based estimate would suggest — through buildup play, assist patterns, shot selection.

Finishing luck > 1.0 means the team is scoring at least 1 goal more than even our precise model predicts, AFTER accounting for all 69 features. This is pure finishing luck — deflections, goalkeeper errors, impossible angles going in.

When both are true, the team is riding a double wave: real tactical improvement that shows up in shot quality metrics, PLUS unsustainable finishing luck on top of that real improvement. The total overperformance is extreme (2+ goals over 10 matches minimum), and neither the noise nor the signal in isolation explains it. That's why it regresses so reliably.

The teams currently in this state include PSV Eindhoven (SQ 1.48, FL 2.30), Barcelona (SQ 1.28, FL 2.66), and Montpellier (SQ 2.13, FL 1.35). These are teams on hot streaks where both the process and the outcomes have been abnormally good.

What's Next

The both-high decomposition signal is fully wired into the evaluation pipeline and waiting for the formal approval gate to run. The data infrastructure is built — 61,000+ matches with multi-source xG data flowing through the TypeScript stack.

The deeper implication: we now have a framework for turning model disagreement into signal. The gap between any two models is information about what one model sees that the other doesn't. We've only scratched the surface of what four independent xG architectures can tell us when we listen to all of them at once.

Full spec: [multi-source-xg-results-spec.md](/docs/specs/multi-source-xg-results-spec.md)