Walk-forward backtest across 26 leagues confirms xgWeight=0.2 beats control on every market. OVERS +3.99pp, UNDERS +3.31pp, SIDES +1.18pp. Effect is monotonic — no plateau. Deploying to production.
Walk-forward extension test: bundesliga-2 and league-one pass (direction-correct IS and OOS, magnitudes comparable to validated Big-5 leagues). Seven other leagues fail — serie-b, ligue-2, league-two, eredivisie, belgian-pro, portuguese-liga, scottish-prem. The signal extends along English + German pyramid tiers but not cross-country.
Lowering VARIANCE_LOOKBACK from 10 to 5/6/7/8 produced identical CLV (+10.0-10.1%) and fewer bets, not more. The 3.0-goal threshold is coupled to window length — shrinking the window tightens the filter, doesn't relax it. Second rejection in the variance-tuning space after attack-defense-asymmetry. The 22 early-season bets this was meant to unlock need a two-parameter fix (window + scaled threshold), not this one.
Higher minEdge monotonically improves entryROI AND CLV across 13 walk-forward folds on the factorial base combo. minEdge=0.12 gives +4.91pp entryROI and +4.24pp CLV over production. Previous tests (March 26) found no effect — the difference is the factorial combo changes which bets appear at each edge level.
Sofascore v3 shot-level xG scored 368K shots across 24 leagues, but the A/B test against baseline showed zero marginal impact (+0.0% CLV, +0.1% entry-adj ROI). Root cause: only 1 of 12 backtest seasons affected. Infrastructure stays; retest after 2+ seasons.
A week-long cloud-lab outage traced to one 5-line shell script bug (LAST=0 in the idle-shutdown check, reproduced live twice). Full arc: the false SSH-key diagnosis, the real bug, six stacked infrastructure issues (including fire-and-poll shipped as 5 serial hotfixes to live traffic), the rebuild in three phases, a version-controlled priority-task roadmap with a submit-time validator and a 60-second auto-advance ticker — then a red-team battery against the guardian that caught a critical capitalization bypass (target="Cloud-Lab" skipped the validator entirely). 22/22 tests pass after the bypass fix.
Multi-source xG enrichment data failed as binary filters (4 approaches, all negative). Reframing as continuous stake sizing produced +0.5-0.6% sizing lift, improved Sharpe, and passed 3/3 walk-forward folds. The data was always informative — we were just using it wrong.
Rewrote the 10-gate signal-approval pipeline into 6 data-driven gates. Dropped hardcoded N≥1000, +0.5pp practical-significance floor, p<0.10 bootstrap, 3pp interleave tolerance, and 1% suspicious-N dedup. Dry-run confirmed 0 live signals regress and 9 previously-rejected signals would be unblocked. Ran end-to-end on two of them: inter-model-disagreement failed Gate 2 (CI width too wide — a sharper rejection than the old pipeline managed), contextXg passed 6/6 (previously killed on the old +0.5pp floor). The reformed pipeline is live; pod-shop math fix and bake-off come next.
Tracked 249 players who migrated from Big 5 leagues at age 30+ to lower divisions. 85% declined, average xG/90 dropped 38%. Finishers survive the drop; creators don't. Registering squad-age-creative-concentration as a pod-shop signal.
Tested three Bayesian priors for early-season predictions. Elo warm-start made predictions WORSE (+0.004 Brier). Marcel early-prior confirmed at 0.0pp marginal. Player finishing xG calibration shows -0.00375 Brier but needs walk-forward. Key insight: the solver already fits to Pinnacle market odds, which IS the best prior. Path A (solver priors) is closed. Path C (shot-level xG) remains open.
Eight settlement failures in 5 weeks, all from the same root cause: a hardcoded 100-entry team map sitting next to a verified 588-entry map that was never loaded. Fixed by loading the verified map at runtime and auto-generating it to 1,257 entries across 40+ leagues.
Ran every possible on/off combination of 11 signals (2^11 = 2,048 configs). Best portfolio: regime + crossBtts + leagueExcl + finishingLuck at +10.4% entry ROI. Four signals confirmed dead or harmful. Biggest surprise: leagueExcl and contextXg destructively interfere — pick one, not both.
Every GH Actions deploy for sports-dashboard had been marked failed for weeks — 15+ consecutive red runs, false Discord ROLLBACK alerts, all while production ran fine. Root cause: the verify loop filtered by coolify.applicationId=UUID but that label is an integer ID, so every poll returned empty. The rollback was a silent no-op on top of the broken polling.
Scaled from 789 to 9,050 player-seasons (12 years, Wikidata birth dates). The aging signal is real — 11% decline from peak to 35 — but still doesn't improve predictions. Marcel's recency weighting already handles decline implicitly. Hardcoded Caley curves are definitively the worst option.
Pointed four independent xG models at 3,282 matches. Multi-source consensus produces 91.2% regression rate. The strongest signal (91.5%) came from decomposing overperformance into shot quality vs finishing luck — when both are high, regression is near-certain.
Two weeks building a complete BTTS pipeline — shot-level xG model, dual-grid calibration, five per-match adjustment layers, 16K-bet backtest. The model finds +4.47% entry ROI against soft books but agrees with Pinnacle. The blocker isn't the model — it's execution venue.
Rebuilt cloud-lab job dispatch from 12-hour SSH pipes to 1-second fire-and-poll. Jobs survive SSH drops, server restarts, and network hiccups. Added master orchestrator, watchdogs, and solver OOM fix.
Decomposing overperformance into shot_quality (shotXG - matchXG) vs finishing_luck (actual - shotXG) predicts regression magnitude. Teams with high finishing_luck component regress faster than teams where overperformance is explained by shot quality.. no-signal
When v3 69-feature model and FotMob match-level xG disagree significantly about a team's rolling performance, it indicates unusual match characteristics that create prediction uncertainty. High |disagreement| matches are harder to predict — skip or size down.. not-significant
Instead of one binary threshold, use per-source thresholds (fotmob 3.0, matchXg 2.5, v3 2.0). When 2+ sources independently flag regression, confidence is higher. Level 3 (all flag) should have highest per-bet ROI.. not-significant
Context-adjusted xG (venue, GK, squad, regime corrections) improves lambda estimation, producing more accurate edges and better bet selection. Did not pass approval gates.
Batch-tested 5 multi-source xG disagreement signals through 10-gate approval. inter-model-disagreement (9/10 gates, +1.1pp marginal) is the most promising but fails bootstrap (p=0.24). layered-threshold (+0.3pp) too small. overperformance-decomposition broken by v3 data coverage. footystats-all and finishingLuck show zero marginal. v3 as variance filter replacement: identical results to MI lambdas.
Tested whether selecting matches with high backed-team SP xG AND high opponent SP xGA would sharpen AH edge. Passed 8/10 gates with 12/12 walk-forward folds positive, but threshold sweep (0.30→0.60) revealed the filter is a data-availability proxy — selectivity comes from whether shot data exists, not from mismatch magnitude. Marginal +0.8pp below +1pp spec, p=0.19.
Re-tested shot-level xG variance signal with 3.5x more data (3,489 matches, 3-4 seasons). Marginal ROI unchanged at +0.1% (bootstrap p=0.47). Shot-level and match-level xG produce equivalent regression candidates at 10-match window granularity. Hypothesis falsified: data volume was not the bottleneck.
Re-tested the finishing persistence signal after backfilling FotMob shots to 19 leagues (5,144 matches, 127K shots). Player count cleaned to 2,040 (dedup), split-half r improved to 0.411. Affected bets doubled (899 vs 427) but marginal ROI halved (+0.2pp, below +0.5pp gate). The finishing effect is real but a binary filter can't capture it — needs continuous lambda adjustment.
Tested variance filtering, GK changes, and HFA regime on 14,387 BTTS bets (+9.93% CLV, -0.67% ROI). None improve ROI. The edge is real but BTTS market vig eats it. Fix is execution (league selection, Pinnacle/exchange), not better signals.
FotMob API died (404), CDN blocked (403), Sofascore blocked from server. Fixed GK PSxG via CDN underscore format, built Sofascore warm standby (270K shots in Supabase), added Discord alerting. Every critical data need now has 2+ sources except GK PSxG.
Multi-feature calibrated FootyStats xG (corr 0.35→0.51) has zero impact on variance regression betting outcomes. But switching variance filter from model lambdas to match-level xG gives +0.3-0.4% ROI lift. The real win: FS non-Big-5 shows +37.7u marginal — FootyStats xG creates value through finishing luck in soft markets, not through better regression detection.
BTTS and O/U specialist markets diverge from our Poisson model by +8.4pp and +12.3pp (both p<0.0001, 41K OOS). But when wired as binary filters on AH/O/U bets, marginal ROI is +0.1% (BTTS) and +0.2% (O/U) — both fail bootstrap significance. The signal predicts match texture, not who wins. Pivoting to direct BTTS value betting where the hit-rate edge translates directly to profit.
Added is_home to XGBoost feature set to learn venue x shot-type interactions. Walk-forward on 312K shots: Model A Brier 0.0787 (baseline 0.0785, delta +0.0002). Model B unchanged. The post-hoc venue calibration already captures the venue signal. XGBoost finds no additional home/away interaction effects at the individual shot level.
Penalty frequency is heavily mean-reverting (corr=0.145). High-penalty teams drop 51% in the second half of the season. No table-position effect — penalties are luck-driven for all teams. Always use npxG.
Attack and defense overperformance regress at identical rates (68.3% vs 68.2%). The thesis that defense is 'luckier' is wrong. But set-piece defense regresses at 71.9% — the strongest single regression signal found. Now on gauntlet shadow with 12/12 walk-forward.
Tested manager bounce, promoted team arc, international breaks, and congestion×depth. Only promoted team arc confirmed (p=0.0001) but league-specific. Manager bounce rejected — Pinnacle reprices in 1-2 matches. International breaks: zero effect.
FotMob pages embed shot x,y data in __NEXT_DATA__ for all leagues. Scraped 320K shots, 20 leagues, 3-4 seasons. Shot-level xG achieves 82.1% regression rate (matching Understat Big 5). Server cron running daily. The data quality gap for non-Big-5 is closed.
Understat doesn't adjust shot-level xG for home/away. A multiplicative correction (home ×0.934, away ×0.940) beats them on all 4 walk-forward folds. Edge is growing — home overprediction getting worse as HFA declines post-COVID. Set pieces worst: home set-piece xG overpredicted by 12.8%.
FootyStats xG has corr=0.35 with goals — beaten by SoT×0.32 (corr=0.56) in ALL 20 non-Big-5 leagues. But for regression detection, noise is a feature: FootyStats xG (62.7% regression) beats our precise aggregate model (58%). The variance filter needs process, not outcome.
48-hour infrastructure overhaul after discovering entry-adjusted ROI was +5.1%. Built 125-test suite from scratch (settlement, Poisson math, devigging, sizing, bootstrap). Wired entry-adjusted ROI through all 10 approval gates. Graduated shadow solver (market-only + Dixon-Coles rho). Re-evaluated 7 signals, approved tc2-league-filter (+1.2% marginal). Kelly sizing research: quarter Kelly optimal, full Kelly catastrophic. Zero tests to full CI coverage in two days.
The backtest used closing odds, but we bet before close. CLV +5.3% means entry is ~8pp better than closing. AH entry-adjusted ROI: +5.1% (p=0.000, 7,190 OOS bets, 3/3 seasons stable). Two edge sources: entry timing (+8pp) and soft-book premium (+3pp). Calibration tax (-4.1pp) is the biggest lever — validates the solver research roadmap.
Bucketed 138K bets into 5 edge brackets and swept 25 floor thresholds. Higher edge does NOT predict higher ROI — the 3-5% bracket (-4.5%) outperforms 12%+ (-8.7%). DSR=0.000. The CLV→ROI gap is structural (calibration + market structure), not driven by edge size. AH near breakeven at all edges; 1X2/OU25 lose everywhere.
Analyzed within-match outcome correlation across 30K matches. Same-match bets are ANTI-correlated (-0.153) — they hedge, not concentrate. 56% of pairs split (one wins, one loses). All 3 capping policies worsen ROI. The 40-60u match swings in live trading are a stake-sizing artifact, not structural correlation risk.
Swept 7 edge thresholds across 14 expansion leagues using BetExplorer AH odds. 7% + exclude -0.25 line maximizes OOS profit: +166u vs +136u current (+22% more). Deployed for all Tier 2/3 leagues.
Systematic leave-one-out analysis of every production feature. maxEdge=15% cap is +0.99pp (biggest improvement available). minEdge=7% is optimal. Variance filter slightly hurts. Defiance filter genuinely helps. CLV bug found and fixed (1X2 proxy inflated AH CLV by ~11pp). ROI (+29.69%) is real.
Shadow model v1 launched alongside production. Contains market-only solver + DC rho correction (validated +1.25pp on backtest). Portfolio stack shows +2.92% ROI (p=0.064, all years positive). Shadow must prove itself on 100+ live bets before graduating to production.
Stacking tc2-league-filter + gk-psxg on base filters produces +2.69% ROI (all 3 years positive, p=0.085). Leave-one-league-out: robust across all 16 leagues. The 10-gate rejected these individually but the portfolio is profitable. 6 flaws identified in testing framework. Needs production parity verification before deployment.
The variance filter compared goals to a constant 1.35 instead of real xG. Real xG improves dev (+0.99pp) but not holdout (+0.03pp). Disabling the filter is a coin flip. Ted's thesis needs a richer implementation — not a binary filter, but a multi-factor xG regression score. The data exists (269 match-xG files) but isn't being used properly.
Applied Dixon-Coles tau correction (rho=+0.05) to the Bivariate Poisson score grid. Dev +1.60pp (p=0.025), holdout +0.80pp, production -2.98% to -1.72%. The grid overestimated 0-0/1-1 for AH markets. Combined with market-only: +1.25pp total, 189u saved.
A plain-English walkthrough of the entire system: the MI Bivariate Poisson model (the engine), production filters (the transmission), and experimental signals (the steering). What's working (+7% CLV), what isn't (-2.2% ROI), and why the gap exists.
Paper trading showed +29.7% ROI at 58 bets. Rigorous backtesting showed -2.98% at 15,165 bets. Today we moved it to -2.19% by removing match results from the solver. The CLV was always real (+7%). The ROI was a lucky streak. Three sweeps, one winner, three bugs fixed, and a model that's 27% less negative.
Re-ran top 2 signals through 10-gate approval on the new market-only baseline. Both failed again (6/10 gates). The base got better (-3.3% to -2.6%) but signal marginals unchanged (+0.9pp, +0.4pp). Walk-forward 1/4 and 0/4 folds. Individual Layer 3 signals can't close the remaining -2.2% gap.
We removed outcome prediction and xG fitting from the solver loss function. Validated on holdout (+0.45pp, 3/3 years) and production (26 leagues, +0.79pp, +118u saved, 12/19 improve). The solver produces better calibrated probabilities when fitting ONLY to Pinnacle odds. First structural model change to improve AH ROI.
We tested edge shrinkage and minimum edge thresholds to filter overconfident bets. Every threshold made ROI worse -- monotonically. minEdge=10% costs -3.20pp vs baseline. The model's largest edges are its most overconfident. Wrong-direction discovery: max-edge-cap registered as reversed hypothesis.
We tested 5 configurations of recentFormBoost (1.5-3.0) and decayRate (0.005-0.015) to track within-season collapses faster. Every config was worse or flat vs baseline. RFB increase costs -1.10pp, decay alone is noise. Also found a bug: --decay-rate was never passed to data-prep. The solver already correctly weights form data.
9/9 signals rejected through the 10-gate process. The gates aren't broken — they're correctly telling us the -3% base ROI can't be fixed by Layer 3 filters. CalGap (r=-0.922) points to the solver reacting too slowly to mid-season team collapses. We're sweeping recentFormBoost × decayRate to fix the engine before painting the car.
Skip bets within 10 matches of a mid-season manager change? Marginal ROI = -0.1% — filter hurts. 877 changes loaded via Fotmob (529 mid-season). The solver reads Pinnacle odds which already price manager changes. Combined with the xG window test, manager changes are definitively priced. Case closed.
Explored whether opponent goalkeeper quality (PSxG+/-) predicts AH outcomes. Expanded GK data from 4 to 22 leagues. Exploration found a +15.8pp ROI spread, but formal 10-gate approval rejected: marginal ROI +0.4pp (p=0.36), walk-forward decayed from +1.4% in 2023 to -9.0% in 2025. Markets appear to have adapted.
We tested whether truncating xG histories at mid-season manager changes improves variance regression. Built the infrastructure, loaded 877 changes across 101 teams, ran the 10-gate approval. Result: zero marginal ROI. The 10-match rolling lookback already ages out old-manager data naturally. The system was self-correcting all along.
Filtering bets where Pinnacle moved ≥3pp toward our selection removes only 127/6,606 bets (1.9%) with zero marginal ROI. The model uses closing odds — line movement is already in the CLV. Shelved without full gate. This plus 7 other capture signal failures confirm: the CLV→ROI gap is calibration, not execution quality.
Our original strongest finding — Under 2.5 vs sufferball teams at +10.83% CLV, +7.63% ROI — was retested through the corrected 10-gate pipeline. 5/10 gates passed. Marginal ROI = -0.2pp (signal hurts the stack). Original N=262 was a pre-filtered artifact. Walk-forward: 2024 -4.2%, 2025 -8.6%. Shelved.
Four signals failed under the corrected protocol: defensive-overperf, gameweek timing, fixture congestion AH shrinkage, and DGF motivation filter. All select 87,210/87,210 matches at minEdge=0. Zero marginal ROI. The existing filter stack already captures all four phenomena.
Two signals from 36 Ted Knutson transcripts looked like portfolio-savers in ad-hoc testing (+2.09pp and +2.24pp marginal ROI). Both failed the canonical 10-gate approval. League filter: p=0.22, IS/OOS sign flip, walk-forward fails 2024-2025. Home AH rescue: p=0.36, marginal ROI only +0.4pp, walk-forward collapses to -9.2% in 2025. The formal process caught what custom analysis missed: both signals overfit to historical data.
We tested whether filtering bets backing promoted teams improves ROI. 339 promoted teams, 26 leagues, 5 seasons. The effect exists (+0.1pp) but is statistically insignificant (p=0.46) and temporally unstable. The MI solver reads Pinnacle odds, and Pinnacle already prices promotion correctly. Signal #42 tested, signal #42 rejected.
Every bet now gets a post-settlement execution verdict: did we get a better price than closing? The bridge between 'model found edge' and 'we captured that edge.' Three new fields, zero new data sources — just connecting dots that were already there.
What happens after the model makes picks and real money is on the line. Daily settlement, 3-layer health checks, loss classification (5 categories), regime change detection, kill switches, and the feedback loop that turns every loss into a new alpha hypothesis.
Post-deployment monitoring that classifies every loss (variance, model error, stale input, regime shift, execution leak), detects when edges erode vs when you're just unlucky, and generates new alpha hypotheses from failure patterns. Kill switches, re-enable gates, and the virtuous cycle that makes the system smarter every time it loses.
The complete signal testing workflow — from hypothesis to deployment in 4 minutes. Register, explore, analyze, gate. Designed for parallel terminals. 10 automated approval gates including per-league matchday interleave OOS, walk-forward validation, and practical significance checks. Nothing reaches production without passing.
We tested 4 capture signals (vig-aware, line movement confirmation, Pinnacle vs market gap, AH line shift) against 6,986 matches with complete open/close/average odds. All 4 rejected. The CLV→ROI gap isn't about execution quality — it's about calibration. High-vig bets actually have HIGHER CLV. 94% of our bets are already on the sharp side. The path to profitability is Track 2 (model improvements), not Track 1 (better execution).
We read 36 Ted Knutson transcripts, extracted every betting edge, and ran 8 new signals through the full testing protocol. Two survived: a league portfolio filter (+2.09pp marginal ROI, p=0.033) and a home AH conditional rescue (+2.24pp, needs live validation). Combined, they turn a -128.7u portfolio into +19.9u. Six signals failed. One wrong-direction discovery (new managers outperform) opens a new investigation.
Our testing infrastructure had a bug that made every signal look better than it was. runStandaloneSignal() never removed the 7% edge threshold — so all 40 'accepted' signals were validated on a pre-filtered pool. The fix was 2 lines of code. The damage was 10 days of false confidence. Here's the corrected protocol with 7 new hard rules.
We added a single loss term to the solver — asking it to match the O/U 2.5 market — and AH profits jumped +90u (+24%). The O/U bets themselves are still unprofitable. The improvement came from better lambda estimates. The paradox: we failed to fix totals, but the attempt made sides dramatically better.
We ran the full evaluation pipeline across 26 leagues — 29,977 matches, 12 signal configurations, proper IS/OOS split. CLV is +11% everywhere (model is genuinely good). ROI is negative everywhere OOS (execution eats the edge). The odds cap is the only filter that matters (+4.2pp marginal). And our first attempt at this analysis was completely wrong — here's what we learned from that too.
We found two signals worth ~700u, validated them with Monte Carlo and walk-forward, deployed them, then ran the 10-gate process. Both failed — then we discovered the gate had a bug (wasn't toggling signals). Fixed it, re-ran: congestion +0.3pp (p=0.36), AH lines -0.1pp (p=0.54). Still rejected. Right answer, wrong path to get there.
We mined 39 rejected experiments and found two signals worth ~700u. One was a filter actively removing our best bets. The other was dismissed because the hypothesis was backwards. Both survived Monte Carlo bootstrap and walk-forward validation — but later failed the 10-gate approval process. See follow-up: 'The Gate Killed Our Darlings.'
We deployed the biggest league expansion since launch — seven new leagues across three continents, validated through walk-forward backtesting on 14,000+ matches.