The Gate Killed Our Ted Knutson Signals: Both Failed the 10-Gate Process
Two signals from 36 Ted Knutson transcripts looked like portfolio-savers in ad-hoc testing (+2.09pp and +2.24pp marginal ROI). Both failed the canonical 10-gate approval. League filter: p=0.22, IS/OOS sign flip, walk-forward fails 2024-2025. Home AH rescue: p=0.36, marginal ROI only +0.4pp, walk-forward collapses to -9.2% in 2025. The formal process caught what custom analysis missed: both signals overfit to historical data.
Two signals looked like portfolio-savers in our ad-hoc analysis. Both failed the formal 10-gate approval process. This is the system working as intended.
What Happened
In the Ted Canon 2 session, we mined 36 Ted Knutson transcripts and tested 8 signals against the deployed AH baseline. Two passed our custom test harness:
- League filter (remove Segunda, La Liga, Ligue 2): +2.09pp marginal ROI, p=0.033
- Home AH rescue (keep home bets only vs bottom-quarter opponents): +2.24pp marginal ROI
Combined, they appeared to flip the portfolio from -3.28% ROI to +0.97% ROI. We wrote the blog post. We updated the signal registry. We pushed the code.
Then we ran them through approve-signal.ts — the canonical 10-gate pipeline that every signal must pass before deployment.
Both failed.
League Filter: 6/10 Gates Passed
npx tsx scripts/approve-signal.ts --signal=tc2-league-filter
| Gate | Result | Detail |
|---|---|---|
| 1. Pre-registered | ✓ | |
| 2. True standalone (minEdge=0) | ✓ | N=75,890, ROI=-5.7% |
| 3. Minimum N ≥ 1,000 | ✓ | |
| 4. Marginal ROI > 0 | ✓ | +0.9pp (base -2.1% vs without -3.0%) |
| 5. Bootstrap significance | **✗** | p=0.22 (threshold: <0.10) |
| 6. Matchday interleave OOS | **✗** | IS: -4.2%, OOS: +0.2% — **sign flip**, 4.4pp gap |
| 7. Regime stratification | ✓ | No opposite-sign regimes |
| 8. Suspicious N | **✗** | N=75,890 similar to 6 other signals |
| 9. Practical significance | ✓ | +0.9pp > 0.5pp |
| 10. Walk-forward | **✗** | 2022: +2.1% OK, 2023: +0.6% OK, **2024: -3.9% FAIL, 2025: -6.6% FAIL** |
What killed it
Gate 5 (bootstrap): The marginal ROI of +0.9pp isn't statistically distinguishable from zero. CI spans [-1.4%, +3.2%]. We can't be confident the improvement is real.
Gate 6 (OOS): This is the worst failure. When we split by odd/even matchdays within each league-season (the fairest possible temporal split), the signal shows -4.2% ROI on odd matchdays and +0.2% on even. That's a sign flip — the signal's effect depends on *which games you test it on*, not on a stable underlying mechanism.
Gate 10 (walk-forward): The signal works on 2022 and 2023 data but fails on 2024 and 2025. The three "bad" leagues (Segunda, La Liga, Ligue 2) were particularly bad in early seasons but improved recently. Filtering them out helped historically but hurts now. This is the definition of overfitting to past data.
Home AH Rescue: 6/10 Gates Passed
npx tsx scripts/approve-signal.ts --signal=tc2-home-ah-rescue
| Gate | Result | Detail |
|---|---|---|
| 1. Pre-registered | ✓ | |
| 2. True standalone (minEdge=0) | ✓ | N=81,293, ROI=-5.9% |
| 3. Minimum N ≥ 1,000 | ✓ | |
| 4. Marginal ROI > 0 | ✓ | +0.4pp (base -2.5% vs without -3.0%) |
| 5. Bootstrap significance | **✗** | p=0.36 (not even close) |
| 6. Matchday interleave OOS | ✓ | IS: -1.5%, OOS: -3.6%, gap=2.1pp, no sign flip |
| 7. Regime stratification | ✓ | |
| 8. Suspicious N | **✗** | N=81,293 similar to 14 other signals |
| 9. Practical significance | **✗** | +0.4pp < 0.5pp threshold |
| 10. Walk-forward | **✗** | 2022: +2.7% OK, 2023: +0.2% OK, **2024: -3.9% FAIL, 2025: -9.2% FAIL** |
What killed it
Gate 5: p=0.36. Not remotely significant.
Gate 9: Marginal ROI of +0.4pp is below the 0.5pp practical significance threshold. Even if it were real, the effect is too small to matter.
Gate 10: Same walk-forward failure pattern. Works on older data, fails on recent seasons. 2025 is particularly ugly at -9.2% ROI — the home AH filter made things *worse* in the most recent season.
Why the Ad-Hoc Test Was Wrong
Our custom analysis (test-ted-canon-2.ts) made three mistakes that the formal pipeline caught:
1. Pre-filtered bet pool
We loaded backtest-v2-results.json directly and filtered to AH bets with edge ≥ 7% and odds ≤ 2.0. This is a 3,927-bet subset. The approval pipeline evaluates through the full loadAllData() → evaluateBets() chain with 29,977 matches and produces ~5,500 base bets. The pre-filtered pool was a different population.
2. Simple league-based IS/OOS split
Our custom test split 6 IS leagues vs 13 OOS leagues. The approval pipeline uses per-league matchday interleaving (odd vs even matchdays within each league-season). The matchday interleave is a much harder test because it eliminates calendar artifacts — December congestion, January transfers, and the Bundesliga winter break don't cluster in one split.
3. No walk-forward validation
Our custom test never asked: "Does this signal work on recent data?" Walk-forward validation showed both signals degrading sharply in 2024-2025. The historical edge was real but is now gone — exactly the kind of regime shift that walk-forward catches.
What the Suspicious N Gate Means
Both signals triggered Gate 8 (suspicious N). The league filter had standalone N=75,890, which is within 10% of variance-regression (79,047), congestion-filter (73,224), and several others. Home AH rescue was even worse — N=81,293 was within 10% of 14 other signals.
This gate exists because the meta-analysis found that all "accepted" signals were being tested on the same pre-filtered pool. When standalone N values cluster, it means the signals aren't actually filtering independently — they're all riding on the same bet universe. In this case, the league filter (which removes ~17% of matches) and the home AH rescue (which changes bet selection within matches) both produce standalone N values similar to signals that don't filter at all. The Gate 8 flag is correct: these signals don't meaningfully reduce the bet universe in standalone mode (minEdge=0), which means their standalone test isn't measuring the signal's independent effect.
The Odds Quality Tier Analysis (Informational)
Both signals showed an interesting pattern in the informational Gate 11:
League filter:
- Sharp leagues: ROI=-1.9%, CLV=+11.4%
- Medium leagues: ROI=-4.5%, CLV=+11.1%
- Soft leagues: ROI=-0.3%, CLV=+11.1%
Home AH rescue:
- Sharp leagues: ROI=-0.6%, CLV=+11.3%
- Medium leagues: ROI=-5.3%, CLV=+11.0%
- Soft leagues: ROI=-2.2%, CLV=+11.0%
The CLV is stable across tiers (~11%). But the "soft" leagues (lower tier, less liquid) actually have the *best* ROI for the league filter and second-best for home AH rescue. This contradicts our hypothesis that soft leagues are structurally unprofitable. The problem isn't the leagues — it's the medium-tier leagues that perform worst.
What We Learned
The formal process works
This is the best outcome. We built a 10-gate approval process specifically to catch signals that look good in ad-hoc analysis but don't hold up to rigorous testing. It caught both signals. If we'd deployed them based on the custom test alone, we'd be running a signal that's actively degrading on recent data.
Ad-hoc analysis is for discovery, not deployment
The custom test harness (test-ted-canon-2.ts) is valuable for *discovering* potential signals quickly. It correctly identified that league selection and home/away asymmetry are important dimensions. But discovery ≠ deployment. The gap between "this looks promising" and "this will make money going forward" is exactly what the 10-gate process measures.
Walk-forward is the hardest gate
Both signals passed marginal ROI, regime stratification, and practical significance. Both failed walk-forward. The signals worked in 2022-2023 but not 2024-2025. This means the market adapted — the inefficiency we detected in historical data has been arbitraged away, or the underlying dynamics changed.
The CLV→ROI gap remains unsolved
After mining 36 transcripts, testing 12 signals, and building capture infrastructure, the core problem is unchanged: the model sees +11% CLV and returns negative ROI. The league filter came closest (+0.9pp marginal) but isn't robust. The path forward is model architecture improvements (Track 2), not bet selection (Track 1 or signal layer).
Updated Signal Registry
Both signals now show status: "rejected" with full gate results:
{
"id": "tc2-league-filter",
"status": "rejected",
"backtestStats": {
"standaloneROI": -0.057,
"standaloneCLV": 0.056,
"standaloneN": 75890,
"marginalROI": 0.009,
"approvalResult": "REJECTED",
"failedGates": ["Bootstrap marginal significance", "Matchday interleave OOS", "Suspicious N", "Walk-forward"]
}
}The signal registry now has 148 entries. The approval gate has a 100% rejection rate on all signals tested through it. This is either a sign that our bar is appropriately high, or that no Layer 3 signal survives the full process. Both interpretations point the same direction: the model's edge comes from Layer 1 (solver) and Layer 2 (core filters), not from Layer 3 (signals).