When Three Data Sources Die in One Session: Building Pipeline Resilience
FotMob API died (404), CDN blocked (403), Sofascore blocked from server. Fixed GK PSxG via CDN underscore format, built Sofascore warm standby (270K shots in Supabase), added Discord alerting. Every critical data need now has 2+ sources except GK PSxG.
The Question
In one session, three data sources died: FotMob /api/ endpoints (404), data.fotmob.com CDN (403 from server), Sofascore API (403 from datacenter IPs). How resilient is our data pipeline, and what happens when sources break?
What We Found
Every critical data need now has at least 2 sources, except GK PSxG.
| Data Need | Primary Source | Backup | Automated |
|---|---|---|---|
| Shot x,y (Big 5) | Understat scraper | StatsBomb open data | Local |
| Shot x,y (non-Big-5) | FotMob page scraping | Sofascore (local) | **Server cron** ✅ |
| Match-level xG | Sofascore → Supabase | FotMob match-level cache | Local → server |
| GK PSxG | data.fotmob.com CDN | **NONE** ⚠️ | Server |
| Pinnacle odds | The Odds API | football-data.co.uk | Server |
| Match results | FootyStats API | football-data.co.uk | Server |
What Was Fixed
FotMob API is dead. All /api/ endpoints return 404 (permanent, not temporary). But FotMob page routes (/matches/{slug}) work from everywhere — including Hetzner. This is how we get shot data now.
data.fotmob.com CDN confusion: The CDN blocks some URL formats but not others. expected_goals.json → 200. goals_prevented.json → 403. _goals_prevented.json (with underscore) → 200. The GK PSxG fix uses the underscore format and works.
Sofascore blocks datacenter IPs but works locally. The Sofascore → Supabase pipeline runs from the local Mac, pushes to Supabase, and the server cron reads from Supabase. 270K shots + 10K matches in the warm standby.
FotMob page scraping from Hetzner: Confirmed working (200, full __NEXT_DATA__). Unlike Sofascore, FotMob doesn't block datacenter IPs on their page routes. This is why the shot scraper runs as a server cron.
What Was Built
| Component | Purpose |
|---|---|
| Server cron (09:30 UTC) | FotMob shot scraping, daily, 20 leagues |
| Alerting wrapper | Discord #red-alert when 0 matches fetched |
| Sofascore warm standby | 270K shots in Supabase, activatable in 5 min |
| GK PSxG fix | CDN with numeric tournament IDs + underscore prefix |
| `--incremental` flag | Only scrapes last 3 days, skips cached |
What This Means
The biggest remaining risk is GK PSxG with no backup. The CDN format could change at any time. FotMob league stats pages have GK data in __NEXT_DATA__ (verified) — this is the identified backup path but not yet built.
Key principle established: Every critical data source needs 2 providers and 1 automated backup. The cost of maintaining warm standby (Sofascore alongside FotMob) is negligible vs discovering your pipeline silently failed for 2 weeks.
What's Next
- Build GK PSxG backup (FotMob page scraping for league stats)
- Monthly Sofascore health check
- Automated data quality alerts: "did we get xG for >90% of yesterday's matches?"