320K Shots Across 20 Leagues: The FotMob Page Scraping Breakthrough
FotMob pages embed shot x,y data in __NEXT_DATA__ for all leagues. Scraped 320K shots, 20 leagues, 3-4 seasons. Shot-level xG achieves 82.1% regression rate (matching Understat Big 5). Server cron running daily. The data quality gap for non-Big-5 is closed.
The Question
We needed shot-level x,y coordinates for non-Big-5 leagues. Without them, those 20 leagues use FootyStats aggregate xG (corr 0.35 with goals) for the variance filter — half the signal quality of Understat's shot-level xG (corr 0.63) available for Big 5. Every public source we checked was blocked:
- Understat: Big 5 only
- FBref: Cloudflare blocked
- Sofascore: 403 from datacenter IPs
- FootyStats API: aggregate only, no shot coordinates
- WhoScored: not integrated, Opta-gated
What We Found
FotMob pages embed full shot data in `__NEXT_DATA__` JSON. Every match page contains per-shot x,y coordinates, body part, situation, shot type, per-shot xG, and per-shot xGOT. No API needed — page scraping with standard HTTP requests.
It works from the Hetzner server. Unlike Sofascore (403 from datacenter IPs), FotMob page routes return 200 from our production server. This means shot data can run as a daily automated cron — no local machine dependency.
Coverage: 20 non-Big-5 leagues confirmed with shot data. 84,450 shots scraped across 3,372 matches (current season). Historical backfill expanded to 12,685 matches / 320,580 shots across 54 league-season files (3-4 seasons per league).
| League | Matches | Shots |
|---|---|---|
| Championship | 1,736 | 43,000+ |
| League One | 1,762 | 42,000+ |
| League Two | 1,306 | 31,000+ |
| Eredivisie | 832 | 23,000+ |
| Turkish Super | 854 | 21,000+ |
| + 15 more leagues | ... | ... |
Only gap: National League (0 shots — too low-tier for FotMob).
The Nuance
Shot-level xG achieves 82.1% regression rate on non-Big-5 — matching Understat Big 5 quality (82.2%). This was validated on a strict "gap must halve" methodology.
| Source | Non-Big-5 Regression | Big 5 Regression |
|---|---|---|
| FotMob shot-level | **82.1%** | N/A |
| FotMob match-level | 75.6% | 74.5% |
| Understat | N/A | **82.2%** |
Shot-level wins in 10 of 13 non-Big-5 leagues, often by 10-16pp (Serie B +15.6%, Turkish Super +16.4%, Championship +12.9%).
However: the 82.1% regression rate doesn't translate to marginal ROI improvement. The signal was tested through the 10-gate pipeline twice (1 season and 3-4 seasons of data). Both times: +0.1% marginal, p>0.40. The existing match-level xG variance filter already catches the same teams. Better regression detection at the individual level doesn't change which bets are selected at scale.
What Was Built
| Component | Status |
|---|---|
| `scripts/fetch-fotmob-shots.ts` | Scraper with `--incremental` flag |
| `lib/fotmob-shots.ts` | Loader with coordinate conversion |
| Server cron (09:30 UTC daily) | Running on Hetzner |
| `--incremental` flag | Last 3 days, skips cached |
| Pipeline alerting | #red-alert Discord on failure |
| Historical backfill | 54 league-season files complete |
| Liga MX, Austria BL, UCL | Backfilled and in cron |
What This Means
We now have Understat-quality shot data for 20+ leagues, automated daily on the server. The infrastructure is production-ready even though the direct marginal ROI signal is parked.
The shot data enables:
- Multi-source xG disagreement signals (the inter-model gap IS information)
- xGOT computation (FotMob includes pre-computed xGOT per shot)
- Advanced team metrics (xG/shot, npxG, set-piece breakdowns)
- Future model retraining (632K shots across 25+ leagues for the v3 XGBoost model)
What's Next
Shot data flows daily. The value shows up in derivative signals (multi-source disagreement, set-piece xGA regression) rather than as a direct variance filter replacement. The data pipeline is the foundation — the edge comes from what we build on top of it.