We Were Testing the Tests Wrong: The Corrected Signal Protocol

March 19, 2026|Process|DEPLOYED

We Were Testing the Tests Wrong: The Corrected Signal Protocol

Our testing infrastructure had a bug that made every signal look better than it was. runStandaloneSignal() never removed the 7% edge threshold — so all 40 'accepted' signals were validated on a pre-filtered pool. The fix was 2 lines of code. The damage was 10 days of false confidence. Here's the corrected protocol with 7 new hard rules.

Bug Impact

40 signals

tested on pre-filtered pool

Fix

2 lines

minEdge: 0, skipEarly: 0

Real Signal

odds-cap

+4.2pp marginal (only one)

New Min N

1,000

hard deployment threshold

We're rewriting our testing protocol. Not because the old one was bad in principle — pre-registration, walk-forward validation, regime stratification, bootstrap CIs. All of that was correct. The problem was a bug in the testing infrastructure itself that made every result look better than it was.

This post covers what went wrong, the corrected mental model, and the specific guardrails we've added so it can't happen again.

The Layered Stack: Why Isolation Testing Is the Wrong Frame

Before we get to the bug, we need to reframe how we think about signal testing. We were asking "does this signal work alone?" That's the wrong question. No signal works alone. The system is a layered stack:

Layer 3: Signal filters          (variance, congestion, regime, style)
Layer 2: Core filters            (minEdge ≥ 7%, odds ≤ 2.0)
Layer 1: MI Bivariate Poisson    (produces CLV from devigged odds)

Layer 1 is the foundation. The solver produces +11% CLV across 26 leagues. That's real and universal. But +11% CLV with -6% ROI means the model sees edge it can't capture.

Layer 2 is where ROI improves. The edge threshold (only bet when CLV ≥ 7%) and odds cap (≤ 2.0) together turn -6% ROI into ~-1.4%. These aren't "signals" — they're the core filtration that makes the model's edge actionable.

Layer 3 is where individual signals live. Variance filters, congestion checks, regime multipliers. These layer on top of the core. Testing them without the core is like testing whether a roof keeps rain out without the walls.

The right test for each layer

Layer	Test	Question
1 (Model)	CLV vs closing lines	"Does the model see edge?"
2 (Core filters)	ROI with vs without each filter	"Does this filter capture more of the edge?"
3 (Signals)	Marginal ROI (leave-one-out)	"Given the core stack, does adding this signal help?"

The standalone test (all filters off, minEdge=0) is diagnostic — it shows what a signal does to the raw bet universe. But the deployment decision is always marginal: does adding this signal to the existing stack improve ROI?

Here's what that looks like for our current stack:

Component	Marginal ROI	Role
MI Bivariate Poisson	—	Foundation (+11% CLV)
minEdge ≥ 7%	~+4.6pp	Core (turns -6% into -1.4%)
odds-cap ≤ 2.0	+4.2pp	Core (strongest individual filter)
variance-regression	-0.4pp	Signal (slightly harmful)
congestion-filter	-0.3pp	Signal (harmful — confirmed in post-mortem)
defiance-filter	-0.2pp	Signal (neutral)
no-draws	+0.0pp	Signal (neutral)

The model + two core filters do all the work. The signal layer is approximately neutral. This doesn't mean signals are useless — it means we haven't found one that demonstrably helps on top of the core. Yet.

What Went Wrong

Our signal testing function runStandaloneSignal() was supposed to test a signal in isolation — disable everything else, see what the signal alone contributes. It correctly disabled the boolean filters (variance, congestion, defiance) and the odds cap. But it never removed the minEdge threshold.

Overrides applied by runStandaloneSignal():
  varianceFilter: false       ✓ disabled
  congestionFilter: false     ✓ disabled
  defianceFilter: false       ✓ disabled
  maxOdds: 99                 ✓ disabled
  noDraws: false              ✓ disabled
  minEdge: ???                ← NEVER TOUCHED (inherited 0.07 from defaults)
  skipEarlyMatchdays: ???     ← NEVER TOUCHED (inherited 5 from defaults)

The minEdge: 0.07 default means: only consider bets where our model has at least 7% edge over closing lines. That threshold does most of the work. It turns -6% ROI (all bets) into -1.4% ROI (filtered). Every signal we tested was riding on top of it.

The Evidence

Four "independent" signals shared identical standalone results:

Signal	standaloneN	standaloneROI
variance-regression	1,092	+2.9%
pass-rate-filter	1,092	+2.9%
injury-lambda	1,092	+2.9%
congestion-filter	1,092	+2.9%

Same N. Same ROI. They weren't being tested independently — they were all evaluating within the same pre-filtered pool of ~1,092 bets that passed the 7% edge threshold.

After the Fix

We added minEdge: 0 and skipEarlyMatchdays: 0 to the standalone overrides. The difference is dramatic:

Signal	OLD N (with minEdge=0.07)	NEW N (minEdge=0)	OLD ROI	NEW ROI
variance-regression	5,588	79,047	-4.1%	-6.2%
congestion-filter	5,254	73,224	-4.8%	-6.0%
odds-cap-2.0	2,614	29,020	+0.2%	-2.7%
ted-base (all)	1,968	21,479	-1.4%	-3.3%

The N explosion (5K → 79K) confirms the old tests were pre-filtered. No signal produces positive standalone ROI when truly isolated. The minEdge threshold was the real alpha — the signals were approximately neutral on top of it.

The Corrected Protocol

Phase 1: Pre-Registration

Unchanged. This was always the strongest part of our process.

Write down the hypothesis, metric, and threshold before testing
Register in data/signal-registry.json
Never delete failed entries — they're the denominator

Phase 2: Walk-Forward Backtesting

Unchanged. Expanding training window, 7-day re-solve, 3-day embargo. The model at any point only knows what it would have known at that time.

Phase 3: True Standalone Test (FIXED)

What changed: runStandaloneSignal() now sets minEdge: 0, maxOdds: 99, skipEarlyMatchdays: 0, all boolean filters off. The signal is tested against the full unfiltered bet universe — not a pre-filtered subset.

What to check:

Config dump must show minEdge=0 (now printed automatically by test-signal.ts)
Standalone N should be large (tens of thousands, not ~1,000)
If standalone N ≈ 1,000, something is still pre-filtering — investigate before trusting the result

What standalone tells you: Does this signal, applied to all possible bets, select a subset with better ROI than the full pool? If standalone ROI is worse than the unfiltered baseline, the signal is destroying value.

Phase 4: Marginal Contribution Test (NEW)

Standalone ROI alone is insufficient. A signal might have negative standalone ROI but still contribute positively when combined with other filters.

How: runWithoutSignal() runs the full filter stack minus the target signal. Marginal ROI = base ROI - without ROI.

What to check:

Marginal ROI > 0 means the signal helps when combined with the stack
Marginal ROI < 0 means removing the signal improves the stack — disable it
Marginal ROI ≈ 0 means the signal is neutral — keep or remove based on risk management value

Current marginal contributions (26 leagues):

Filter	Marginal	Action
odds-cap-2.0	+4.2pp	Keep — only filter with meaningful positive contribution
variance-regression	-0.4pp	Neutral/slightly harmful
congestion-filter	-0.3pp	Harmful — confirms post-mortem
defiance-filter	-0.2pp	Neutral
no-draws	+0.0pp	Neutral

Phase 5: Regime Stratification

Unchanged. Every signal must be tested across HFA regime, season phase, and home/away side. Edges that only work in one regime are deployed conditionally.

Phase 6: Statistical Testing

Unchanged. Bootstrap CIs (10K iterations), permutation tests between strata, Holm-Bonferroni for multiple comparisons.

Phase 7: IS/OOS Validation (NEW)

What changed: Signals must hold on leagues not used during development.

In-sample (IS): Original 6 leagues (EPL, La Liga, Bundesliga, Serie A, Ligue 1, Championship)
Out-of-sample (OOS): 20 additional leagues
Rule: OOS ROI must be within 3pp of IS ROI. If OOS collapses, the signal is overfit.

The data-loader now supports all 26 leagues, making this test possible for the first time.

Phase 8: Live Validation + Kill Switch

Unchanged. Paper trading crons, CLV vs closing line, automatic disable below 50 bets.

New Rules (Post-Meta-Analysis)

These are hard rules, not guidelines. They exist because we got burned.

1. "Standalone" means ALL config at zero

minEdge: 0, maxOdds: 99, skipEarlyMatchdays: 0, all boolean filters off. If any default leaks through, the test is invalid. The config dump in test-signal.ts now prints the actual values — check them.

2. Marginal ROI required for deployment

A positive standalone ROI is necessary but not sufficient. The signal must also show positive marginal ROI (leave-one-out test). If removing the signal from the stack improves ROI, the signal is harmful regardless of its standalone number.

3. OOS replication required for deployment

IS-only validation is not enough. The signal must hold on held-out leagues. The 20 OOS leagues provide sufficient data. If OOS ROI flips sign, the signal is overfit to the development leagues.

4. Canonical pipeline only — no ad hoc scripts

Any analysis that produces a signal-level verdict must use loadAllData() → evaluateBets() → runStandaloneSignal() / runWithoutSignal(). No loading raw bet JSON and writing custom filters. The meta-analysis v1 error happened because we bypassed the pipeline.

5. CLV for model evaluation, ROI for deployment

CLV answers "is the model good?" ROI answers "do we make money?" Our meta-analysis confirmed CLV is +11% universally (IS, OOS, all leagues). ROI is -1.4% IS and -3.6% OOS. The model is good. The execution doesn't capture the edge. These are different problems requiring different solutions.

Model research: Optimize for CLV. Use it to evaluate changes to the solver, loss function, or data sources.
Signal deployment: Gate on ROI. A signal with +5% CLV and -2% ROI should not be deployed. CLV says the model sees something; ROI says we can't capture it.

6. Minimum N = 1,000

No deployment decision on fewer than 1,000 bets. Below this threshold, ROI is dominated by variance. The meta-analysis confirmed this — signals validated on N=232 to N=924 showed no predictive relationship between original ROI and scaled ROI.

7. Suspicious N = investigation

If a signal's standalone N is suspiciously similar to another signal's N (within 10%), the tests may be sharing a pre-filtered pool. Investigate the config before trusting the results.

The Infrastructure Audit Checklist

Before trusting ANY signal test result, verify:

[ ] Config dump shows minEdge=0 for standalone tests
[ ] Standalone N is in the tens of thousands (not ~1,000)
[ ] Marginal ROI is computed via leave-one-out (not just standalone)
[ ] OOS leagues were tested (not just the development 6)
[ ] The script used the canonical pipeline (loadAllData → evaluateBets → runner)
[ ] N is not suspiciously identical to other signals
[ ] Bootstrap CI computed with sufficient resamples (≥5,000)
[ ] If testing multiple signals, Holm-Bonferroni was applied

What This Taught Us

The scariest kind of bug is the one that makes your results look *better* than they are. We had rigorous statistics applied on top of a pre-filtered dataset. The bootstrap CIs were correct. The p-values were real. The regime stratification was sound. But the question being answered — "does this signal help among bets that already pass a 7% edge threshold?" — was not the question we thought we were answering.

The fix was two lines of code. The damage was 10 days of false confidence in 40 signals.

Process isn't a checklist you run once. It's the infrastructure you build to catch the things you can't see. The testing protocol is only as good as the code that implements it. We're now testing the tests.