Fix the Engine, Then Paint the Car: Why We're Pausing Signal Testing

March 19, 2026|Strategy|INVESTIGATION

Fix the Engine, Then Paint the Car: Why We're Pausing Signal Testing

9/9 signals rejected through the 10-gate process. The gates aren't broken — they're correctly telling us the -3% base ROI can't be fixed by Layer 3 filters. CalGap (r=-0.922) points to the solver reacting too slowly to mid-season team collapses. We're sweeping recentFormBoost × decayRate to fix the engine before painting the car.

Signals Tested

9/9

all rejected

Base ROI

-3%

every year negative

CalGap → ROI

r=-0.92

85% variance explained

Fix

2 params

recentFormBoost + decayRate

We tested 9 signals through our 10-gate approval process. All 9 rejected. Zero passed. After building parallel testing infrastructure, a claim system, a /tests dashboard, and a hypothesis classification framework — we realized we were optimizing the wrong layer.

The gates aren't broken. They're telling us: your base model is -3% ROI. No signal adding +0.5pp will make it profitable. Stop decorating a losing system.

The Evidence

We built the most rigorous signal testing process we could imagine. 10 automated gates. Per-league matchday interleave OOS. Bootstrap on marginal ROI. Walk-forward validation across seasons. Hypothesis classification. Parallel terminals with claim coordination.

Then we ran 9 signals through it:

Signal	Marginal ROI	Gate 10 (Walk-Forward)	Result
tc2-league-filter	+0.91pp	2022: OK, 2023: FAIL, 2024: FAIL, 2025: FAIL	REJECTED
tc2-home-ah-rescue	+0.43pp	FAIL (every fold negative)	REJECTED
promoted-team-penalty	+0.14pp	FAIL	REJECTED
style-matchup-bet-routing	~0pp	FAIL	REJECTED
5 others	≤0pp	FAIL	REJECTED

The best signal we found adds +0.91pp. The base model is at -3% ROI. You'd need +3-5pp from a single signal to flip the system positive — 5x more than anything tested produces.

Why Signal Testing Is Futile Right Now

The system is a layered stack:

Layer 3: Signals         ← we spent 2 days optimizing this
Layer 2: Core filters    ← minEdge ≥ 7%, odds ≤ 2.0 (already optimized)
Layer 1: MI BP Model     ← this is what's broken

Walk-forward (Gate 10) requires positive marginal in 2/3 of season folds. But the base ROI is negative in EVERY fold:

Year	Base ROI	Best signal marginal	Base + Signal
2022	-2.0%	+0.91pp	-1.1% (still negative)
2023	-3.8%	+0.91pp	-2.9% (still negative)
2024	-2.8%	+0.91pp	-1.9% (still negative)
2025	-2.6%	+0.91pp	-1.7% (still negative)

No Layer 3 signal can pass walk-forward because the Layer 1 base is underwater everywhere. The gates will reject everything until the engine works.

The Root Cause: The Solver Reacts Too Slowly

From the March 18 session (discovered but never acted on):

CalGap predicts ROI with r = -0.922. Calibration gap — the difference between what the model predicts and what actually happens — explains 85% of the variance in per-league ROI.

The mechanism: within-season team collapses. Rangers (CalGap +45pp), Barcelona (+38pp), Sevilla (+39pp). These teams implode mid-season but the solver keeps pricing them based on their early-season ratings.

Why? The solver uses recentFormBoost=1.5, which gives recent matches 50% more weight. But when a team has 20+ good matches followed by 10 bad ones, 10 × 1.5 = 15 "effective matches" of bad data vs 20+ of good data. The good data wins. The solver keeps thinking the team is strong while they're actually in freefall.

The fix: make the solver react faster by increasing recentFormBoost and/or decayRate. These are the two levers that control how quickly old data loses influence.

The Plan: Fix the Engine First

Phase 1: Dual-Parameter Sweep

We're sweeping recentFormBoost (how much recent matches count) AND decayRate (how fast old matches fade) simultaneously. 7 configurations:

Config	recentFormBoost	decayRate	What it tests
baseline	1.5	0.005	Current production
rfb-2.0	2.0	0.005	Moderate increase
rfb-2.5	2.5	0.005	The value identified March 18
rfb-3.0	3.0	0.005	Aggressive
rfb-2.5-d10	2.5	0.010	Moderate boost + 2x faster decay
rfb-2.5-d15	2.5	0.015	Moderate boost + 3x faster decay
rfb-3.0-d10	3.0	0.010	Aggressive boost + 2x faster decay

All 7 run in parallel using the infrastructure we built earlier (~2 hours total).

Acceptance Criteria

Every criterion must pass — no cherry-picking:

AH ROI improves by ≥ 1pp vs baseline
CLV stays ≥ +9% (current: +11%). If CLV drops, the solver is fitting to noise instead of markets.
Improves in ≥ 3 of 4 years. A config that helps 2022 but hurts 2024 is overfit to one year.
OOS holds. Matchday-interleave split gap ≤ 3pp.
One variable per commit. Only recentFormBoost + decayRate change.

What We're Watching For

CLV drops but ROI improves: The solver is now matching results (noisy, already happened) instead of markets (efficient, forward-looking). This is worse, not better — we'd be trading model quality for short-term lucky results. Reject.

Only early years improve: 2022-2023 have more training data behind them. If only those years get better, the config might be overfitting to data density, not genuine faster reaction. Suspicious.

Small leagues collapse: Faster decay means fewer effective training matches. A league with 300 total matches might lose too much signal. Check per-league breakdown.

Phase 2: Loss Weights (if needed)

Only if Phase 1 doesn't reach AH ROI > -1%. Reduce (not eliminate) outcome and xG weights to let the solver trust markets more. Eliminating them entirely would make CLV → 0.

Phase 3: Re-Baseline + Re-Test Signals

Deploy the winning config. Run fresh 26-league baseline. Then re-run the top signals through the 10-gate process. With a better base, signals that previously added +0.91pp might now be enough to pass walk-forward.

The Lesson

We built an incredible testing infrastructure in one day. Parallel backtesting (20h → 33min). 10-gate approval. Signal claims. /tests dashboard. Hypothesis classification. Post-deployment monitoring. Blog posts documenting every step.

Then we used it to discover that the infrastructure wasn't the problem.

The testing process worked exactly as designed — it rejected everything and forced us to look at the real bottleneck. The 0% pass rate wasn't a failure of the process. It was the process doing its job: telling us where the actual problem lives.

The model produces +11% CLV. That's real edge. The gap between "the model is right" and "we make money" is a solver reactivity problem that was identified on March 18 and never fixed. The fix is two parameters in one file. The testing infrastructure will still be there when we come back to Layer 3.

Fix the engine first. Then paint the car.