We Were Testing the Tests Wrong: The Corrected Signal Protocol
Our testing infrastructure had a bug that made every signal look better than it was. runStandaloneSignal() never removed the 7% edge threshold — so all 40 'accepted' signals were validated on a pre-filtered pool. The fix was 2 lines of code. The damage was 10 days of false confidence. Here's the corrected protocol with 7 new hard rules.
We're rewriting our testing protocol. Not because the old one was bad in principle — pre-registration, walk-forward validation, regime stratification, bootstrap CIs. All of that was correct. The problem was a bug in the testing infrastructure itself that made every result look better than it was.
This post covers what went wrong, the corrected mental model, and the specific guardrails we've added so it can't happen again.
The Layered Stack: Why Isolation Testing Is the Wrong Frame
Before we get to the bug, we need to reframe how we think about signal testing. We were asking "does this signal work alone?" That's the wrong question. No signal works alone. The system is a layered stack:
Layer 3: Signal filters (variance, congestion, regime, style) Layer 2: Core filters (minEdge ≥ 7%, odds ≤ 2.0) Layer 1: MI Bivariate Poisson (produces CLV from devigged odds)
Layer 1 is the foundation. The solver produces +11% CLV across 26 leagues. That's real and universal. But +11% CLV with -6% ROI means the model sees edge it can't capture.
Layer 2 is where ROI improves. The edge threshold (only bet when CLV ≥ 7%) and odds cap (≤ 2.0) together turn -6% ROI into ~-1.4%. These aren't "signals" — they're the core filtration that makes the model's edge actionable.
Layer 3 is where individual signals live. Variance filters, congestion checks, regime multipliers. These layer on top of the core. Testing them without the core is like testing whether a roof keeps rain out without the walls.
The right test for each layer
| Layer | Test | Question |
|---|---|---|
| 1 (Model) | CLV vs closing lines | "Does the model see edge?" |
| 2 (Core filters) | ROI with vs without each filter | "Does this filter capture more of the edge?" |
| 3 (Signals) | **Marginal ROI** (leave-one-out) | "Given the core stack, does adding this signal help?" |
The standalone test (all filters off, minEdge=0) is diagnostic — it shows what a signal does to the raw bet universe. But the deployment decision is always marginal: does adding this signal to the existing stack improve ROI?
Here's what that looks like for our current stack:
| Component | Marginal ROI | Role |
|---|---|---|
| MI Bivariate Poisson | — | Foundation (+11% CLV) |
| minEdge ≥ 7% | ~+4.6pp | Core (turns -6% into -1.4%) |
| odds-cap ≤ 2.0 | **+4.2pp** | Core (strongest individual filter) |
| variance-regression | -0.4pp | Signal (slightly harmful) |
| congestion-filter | -0.3pp | Signal (harmful — confirmed in post-mortem) |
| defiance-filter | -0.2pp | Signal (neutral) |
| no-draws | +0.0pp | Signal (neutral) |
The model + two core filters do all the work. The signal layer is approximately neutral. This doesn't mean signals are useless — it means we haven't found one that demonstrably helps on top of the core. Yet.
What Went Wrong
Our signal testing function runStandaloneSignal() was supposed to test a signal in isolation — disable everything else, see what the signal alone contributes. It correctly disabled the boolean filters (variance, congestion, defiance) and the odds cap. But it never removed the minEdge threshold.
Overrides applied by runStandaloneSignal(): varianceFilter: false ✓ disabled congestionFilter: false ✓ disabled defianceFilter: false ✓ disabled maxOdds: 99 ✓ disabled noDraws: false ✓ disabled minEdge: ??? ← NEVER TOUCHED (inherited 0.07 from defaults) skipEarlyMatchdays: ??? ← NEVER TOUCHED (inherited 5 from defaults)
The minEdge: 0.07 default means: only consider bets where our model has at least 7% edge over closing lines. That threshold does most of the work. It turns -6% ROI (all bets) into -1.4% ROI (filtered). Every signal we tested was riding on top of it.
The Evidence
Four "independent" signals shared identical standalone results:
| Signal | standaloneN | standaloneROI |
|---|---|---|
| variance-regression | 1,092 | +2.9% |
| pass-rate-filter | 1,092 | +2.9% |
| injury-lambda | 1,092 | +2.9% |
| congestion-filter | 1,092 | +2.9% |
Same N. Same ROI. They weren't being tested independently — they were all evaluating within the same pre-filtered pool of ~1,092 bets that passed the 7% edge threshold.
After the Fix
We added minEdge: 0 and skipEarlyMatchdays: 0 to the standalone overrides. The difference is dramatic:
| Signal | OLD N (with minEdge=0.07) | NEW N (minEdge=0) | OLD ROI | NEW ROI |
|---|---|---|---|---|
| variance-regression | 5,588 | **79,047** | -4.1% | **-6.2%** |
| congestion-filter | 5,254 | **73,224** | -4.8% | **-6.0%** |
| odds-cap-2.0 | 2,614 | **29,020** | +0.2% | **-2.7%** |
| ted-base (all) | 1,968 | **21,479** | -1.4% | **-3.3%** |
The N explosion (5K → 79K) confirms the old tests were pre-filtered. No signal produces positive standalone ROI when truly isolated. The minEdge threshold was the real alpha — the signals were approximately neutral on top of it.
The Corrected Protocol
Phase 1: Pre-Registration
Unchanged. This was always the strongest part of our process.
- Write down the hypothesis, metric, and threshold before testing
- Register in
data/signal-registry.json - Never delete failed entries — they're the denominator
Phase 2: Walk-Forward Backtesting
Unchanged. Expanding training window, 7-day re-solve, 3-day embargo. The model at any point only knows what it would have known at that time.
Phase 3: True Standalone Test (FIXED)
What changed: runStandaloneSignal() now sets minEdge: 0, maxOdds: 99, skipEarlyMatchdays: 0, all boolean filters off. The signal is tested against the full unfiltered bet universe — not a pre-filtered subset.
What to check:
- Config dump must show
minEdge=0(now printed automatically by test-signal.ts) - Standalone N should be large (tens of thousands, not ~1,000)
- If standalone N ≈ 1,000, something is still pre-filtering — investigate before trusting the result
What standalone tells you: Does this signal, applied to all possible bets, select a subset with better ROI than the full pool? If standalone ROI is worse than the unfiltered baseline, the signal is destroying value.
Phase 4: Marginal Contribution Test (NEW)
Standalone ROI alone is insufficient. A signal might have negative standalone ROI but still contribute positively when combined with other filters.
How: runWithoutSignal() runs the full filter stack minus the target signal. Marginal ROI = base ROI - without ROI.
What to check:
- Marginal ROI > 0 means the signal helps when combined with the stack
- Marginal ROI < 0 means removing the signal improves the stack — disable it
- Marginal ROI ≈ 0 means the signal is neutral — keep or remove based on risk management value
Current marginal contributions (26 leagues):
| Filter | Marginal | Action |
|---|---|---|
| odds-cap-2.0 | **+4.2pp** | Keep — only filter with meaningful positive contribution |
| variance-regression | -0.4pp | Neutral/slightly harmful |
| congestion-filter | -0.3pp | Harmful — confirms post-mortem |
| defiance-filter | -0.2pp | Neutral |
| no-draws | +0.0pp | Neutral |
Phase 5: Regime Stratification
Unchanged. Every signal must be tested across HFA regime, season phase, and home/away side. Edges that only work in one regime are deployed conditionally.
Phase 6: Statistical Testing
Unchanged. Bootstrap CIs (10K iterations), permutation tests between strata, Holm-Bonferroni for multiple comparisons.
Phase 7: IS/OOS Validation (NEW)
What changed: Signals must hold on leagues not used during development.
- In-sample (IS): Original 6 leagues (EPL, La Liga, Bundesliga, Serie A, Ligue 1, Championship)
- Out-of-sample (OOS): 20 additional leagues
- Rule: OOS ROI must be within 3pp of IS ROI. If OOS collapses, the signal is overfit.
The data-loader now supports all 26 leagues, making this test possible for the first time.
Phase 8: Live Validation + Kill Switch
Unchanged. Paper trading crons, CLV vs closing line, automatic disable below 50 bets.
New Rules (Post-Meta-Analysis)
These are hard rules, not guidelines. They exist because we got burned.
1. "Standalone" means ALL config at zero
minEdge: 0, maxOdds: 99, skipEarlyMatchdays: 0, all boolean filters off. If any default leaks through, the test is invalid. The config dump in test-signal.ts now prints the actual values — check them.
2. Marginal ROI required for deployment
A positive standalone ROI is necessary but not sufficient. The signal must also show positive marginal ROI (leave-one-out test). If removing the signal from the stack improves ROI, the signal is harmful regardless of its standalone number.
3. OOS replication required for deployment
IS-only validation is not enough. The signal must hold on held-out leagues. The 20 OOS leagues provide sufficient data. If OOS ROI flips sign, the signal is overfit to the development leagues.
4. Canonical pipeline only — no ad hoc scripts
Any analysis that produces a signal-level verdict must use loadAllData() → evaluateBets() → runStandaloneSignal() / runWithoutSignal(). No loading raw bet JSON and writing custom filters. The meta-analysis v1 error happened because we bypassed the pipeline.
5. CLV for model evaluation, ROI for deployment
CLV answers "is the model good?" ROI answers "do we make money?" Our meta-analysis confirmed CLV is +11% universally (IS, OOS, all leagues). ROI is -1.4% IS and -3.6% OOS. The model is good. The execution doesn't capture the edge. These are different problems requiring different solutions.
- Model research: Optimize for CLV. Use it to evaluate changes to the solver, loss function, or data sources.
- Signal deployment: Gate on ROI. A signal with +5% CLV and -2% ROI should not be deployed. CLV says the model sees something; ROI says we can't capture it.
6. Minimum N = 1,000
No deployment decision on fewer than 1,000 bets. Below this threshold, ROI is dominated by variance. The meta-analysis confirmed this — signals validated on N=232 to N=924 showed no predictive relationship between original ROI and scaled ROI.
7. Suspicious N = investigation
If a signal's standalone N is suspiciously similar to another signal's N (within 10%), the tests may be sharing a pre-filtered pool. Investigate the config before trusting the results.
The Infrastructure Audit Checklist
Before trusting ANY signal test result, verify:
- [ ] Config dump shows
minEdge=0for standalone tests - [ ] Standalone N is in the tens of thousands (not ~1,000)
- [ ] Marginal ROI is computed via leave-one-out (not just standalone)
- [ ] OOS leagues were tested (not just the development 6)
- [ ] The script used the canonical pipeline (loadAllData → evaluateBets → runner)
- [ ] N is not suspiciously identical to other signals
- [ ] Bootstrap CI computed with sufficient resamples (≥5,000)
- [ ] If testing multiple signals, Holm-Bonferroni was applied
What This Taught Us
The scariest kind of bug is the one that makes your results look *better* than they are. We had rigorous statistics applied on top of a pre-filtered dataset. The bootstrap CIs were correct. The p-values were real. The regime stratification was sound. But the question being answered — "does this signal help among bets that already pass a 7% edge threshold?" — was not the question we thought we were answering.
The fix was two lines of code. The damage was 10 days of false confidence in 40 signals.
Process isn't a checklist you run once. It's the infrastructure you build to catch the things you can't see. The testing protocol is only as good as the code that implements it. We're now testing the tests.