Fix the Engine, Then Paint the Car: Why We're Pausing Signal Testing
9/9 signals rejected through the 10-gate process. The gates aren't broken — they're correctly telling us the -3% base ROI can't be fixed by Layer 3 filters. CalGap (r=-0.922) points to the solver reacting too slowly to mid-season team collapses. We're sweeping recentFormBoost × decayRate to fix the engine before painting the car.
We tested 9 signals through our 10-gate approval process. All 9 rejected. Zero passed. After building parallel testing infrastructure, a claim system, a /tests dashboard, and a hypothesis classification framework — we realized we were optimizing the wrong layer.
The gates aren't broken. They're telling us: your base model is -3% ROI. No signal adding +0.5pp will make it profitable. Stop decorating a losing system.
The Evidence
We built the most rigorous signal testing process we could imagine. 10 automated gates. Per-league matchday interleave OOS. Bootstrap on marginal ROI. Walk-forward validation across seasons. Hypothesis classification. Parallel terminals with claim coordination.
Then we ran 9 signals through it:
| Signal | Marginal ROI | Gate 10 (Walk-Forward) | Result |
|---|---|---|---|
| tc2-league-filter | +0.91pp | 2022: OK, 2023: FAIL, 2024: FAIL, 2025: FAIL | REJECTED |
| tc2-home-ah-rescue | +0.43pp | FAIL (every fold negative) | REJECTED |
| promoted-team-penalty | +0.14pp | FAIL | REJECTED |
| style-matchup-bet-routing | ~0pp | FAIL | REJECTED |
| 5 others | ≤0pp | FAIL | REJECTED |
The best signal we found adds +0.91pp. The base model is at -3% ROI. You'd need +3-5pp from a single signal to flip the system positive — 5x more than anything tested produces.
Why Signal Testing Is Futile Right Now
The system is a layered stack:
Layer 3: Signals ← we spent 2 days optimizing this Layer 2: Core filters ← minEdge ≥ 7%, odds ≤ 2.0 (already optimized) Layer 1: MI BP Model ← this is what's broken
Walk-forward (Gate 10) requires positive marginal in 2/3 of season folds. But the base ROI is negative in EVERY fold:
| Year | Base ROI | Best signal marginal | Base + Signal |
|---|---|---|---|
| 2022 | -2.0% | +0.91pp | -1.1% (still negative) |
| 2023 | -3.8% | +0.91pp | -2.9% (still negative) |
| 2024 | -2.8% | +0.91pp | -1.9% (still negative) |
| 2025 | -2.6% | +0.91pp | -1.7% (still negative) |
No Layer 3 signal can pass walk-forward because the Layer 1 base is underwater everywhere. The gates will reject everything until the engine works.
The Root Cause: The Solver Reacts Too Slowly
From the March 18 session (discovered but never acted on):
CalGap predicts ROI with r = -0.922. Calibration gap — the difference between what the model predicts and what actually happens — explains 85% of the variance in per-league ROI.
The mechanism: within-season team collapses. Rangers (CalGap +45pp), Barcelona (+38pp), Sevilla (+39pp). These teams implode mid-season but the solver keeps pricing them based on their early-season ratings.
Why? The solver uses recentFormBoost=1.5, which gives recent matches 50% more weight. But when a team has 20+ good matches followed by 10 bad ones, 10 × 1.5 = 15 "effective matches" of bad data vs 20+ of good data. The good data wins. The solver keeps thinking the team is strong while they're actually in freefall.
The fix: make the solver react faster by increasing recentFormBoost and/or decayRate. These are the two levers that control how quickly old data loses influence.
The Plan: Fix the Engine First
Phase 1: Dual-Parameter Sweep
We're sweeping recentFormBoost (how much recent matches count) AND decayRate (how fast old matches fade) simultaneously. 7 configurations:
| Config | recentFormBoost | decayRate | What it tests |
|---|---|---|---|
| baseline | 1.5 | 0.005 | Current production |
| rfb-2.0 | 2.0 | 0.005 | Moderate increase |
| rfb-2.5 | 2.5 | 0.005 | The value identified March 18 |
| rfb-3.0 | 3.0 | 0.005 | Aggressive |
| rfb-2.5-d10 | 2.5 | 0.010 | Moderate boost + 2x faster decay |
| rfb-2.5-d15 | 2.5 | 0.015 | Moderate boost + 3x faster decay |
| rfb-3.0-d10 | 3.0 | 0.010 | Aggressive boost + 2x faster decay |
All 7 run in parallel using the infrastructure we built earlier (~2 hours total).
Acceptance Criteria
Every criterion must pass — no cherry-picking:
- AH ROI improves by ≥ 1pp vs baseline
- CLV stays ≥ +9% (current: +11%). If CLV drops, the solver is fitting to noise instead of markets.
- Improves in ≥ 3 of 4 years. A config that helps 2022 but hurts 2024 is overfit to one year.
- OOS holds. Matchday-interleave split gap ≤ 3pp.
- One variable per commit. Only recentFormBoost + decayRate change.
What We're Watching For
CLV drops but ROI improves: The solver is now matching results (noisy, already happened) instead of markets (efficient, forward-looking). This is worse, not better — we'd be trading model quality for short-term lucky results. Reject.
Only early years improve: 2022-2023 have more training data behind them. If only those years get better, the config might be overfitting to data density, not genuine faster reaction. Suspicious.
Small leagues collapse: Faster decay means fewer effective training matches. A league with 300 total matches might lose too much signal. Check per-league breakdown.
Phase 2: Loss Weights (if needed)
Only if Phase 1 doesn't reach AH ROI > -1%. Reduce (not eliminate) outcome and xG weights to let the solver trust markets more. Eliminating them entirely would make CLV → 0.
Phase 3: Re-Baseline + Re-Test Signals
Deploy the winning config. Run fresh 26-league baseline. Then re-run the top signals through the 10-gate process. With a better base, signals that previously added +0.91pp might now be enough to pass walk-forward.
The Lesson
We built an incredible testing infrastructure in one day. Parallel backtesting (20h → 33min). 10-gate approval. Signal claims. /tests dashboard. Hypothesis classification. Post-deployment monitoring. Blog posts documenting every step.
Then we used it to discover that the infrastructure wasn't the problem.
The testing process worked exactly as designed — it rejected everything and forced us to look at the real bottleneck. The 0% pass rate wasn't a failure of the process. It was the process doing its job: telling us where the actual problem lives.
The model produces +11% CLV. That's real edge. The gap between "the model is right" and "we make money" is a solver reactivity problem that was identified on March 18 and never fixed. The fix is two parameters in one file. The testing infrastructure will still be there when we come back to Layer 3.
Fix the engine first. Then paint the car.