The 4-Minute Signal Test: How We Explore Fast and Deploy Slow
The complete signal testing workflow — from hypothesis to deployment in 4 minutes. Register, explore, analyze, gate. Designed for parallel terminals. 10 automated approval gates including per-league matchday interleave OOS, walk-forward validation, and practical significance checks. Nothing reaches production without passing.
This is how we test betting signals now. Not the theory — the actual workflow, terminal by terminal.
We redesigned the process after a meta-analysis revealed that our previous testing infrastructure had a bug causing 40 "accepted" signals to be validated on a pre-filtered pool. The corrected process is built around one principle: explore fast, deploy slow.
The 4-Minute Loop
Every signal goes through the same loop. The whole thing takes about 4 minutes if you have a hypothesis ready.
Minute 0-1: Register
/signal teams with 7+ days rest outperform AH expectations
The /signal command is the single entry point for the entire pipeline. It checks the registry, determines where the signal is in the pipeline, and runs the appropriate next step. On first invocation it asks for hypothesis, mechanism, metric, and threshold, then registers before any testing happens.
Why register first? Because if you test 20 things and pick the winner, your p-value is lying. The registry tracks the denominator — how many things you tried. Our acceptance rate is 37% (40 out of 107). Without the registry, you'd only see the 40 wins.
Minute 1-3: Explore
The evaluation runs through the canonical pipeline:
npx tsx scripts/test-signal.ts --signal=rest-days-7plus --by-league --by-season
This loads 29,977 precomputed matches across 26 leagues, applies the signal, and shows:
- Standalone: Signal alone, all other filters off, minEdge=0 (true isolation)
- Marginal: Does adding this signal to the existing stack improve ROI? (leave-one-out test)
- By league: Does it work in EPL and Serie B, or just EPL?
- By season: Stable across 2022-2025, or one-season fluke?
The /signal command summarizes all of this into a pass/fail assessment before you decide whether to proceed.
Minute 3-4: Gate
If it looks promising:
npx tsx scripts/approve-signal.ts --signal=rest-days-7plus
Ten automated gates. All must pass:
| Gate | What it checks | Why |
|---|---|---|
| 1. Pre-registered | Hypothesis exists in registry | Prevents post-hoc rationalization |
| 2. True standalone | minEdge=0, all filters off | Honest isolation (the bug we fixed) |
| 3. N ≥ 1,000 | Enough bets to trust | Below this, ROI is noise |
| 4. Marginal ROI > 0 | Helps the stack | The deployment decision |
| 5. Bootstrap p < 0.10 | Statistically significant | Could this be luck? |
| 6. Matchday interleave OOS | Odd/even matchdays per league, within 3pp | No late-season bias |
| 7. No regime flip | Consistent across conditions | Doesn't blow up in certain regimes |
| 8. No suspicious N | Not riding a shared pre-filter | The exact bug we caught |
| 9. Practical significance | Marginal > +0.5pp | Worth the complexity |
| 10. Walk-forward | Positive in 2/3 season folds | Stable over time |
Gate passes → signal accepted, registry updated. Next:
- Add per-signal edge delta computation to
scripts/backfill-shadow-signals.ts - Add a
SIGNAL_DEFSentry inapp/gauntlet/page.tsx— it automatically appears in all column dropdowns - Run
npx tsx scripts/backfill-shadow-signals.tsto backfill historical impact - Monitor on
/gauntletfor 2 weeks before flipping to live
Gate fails → you see exactly which gate and why. Fix it, variant it, or shelf it.
The Layered Stack
This is the mental model behind the testing. The system has three layers:
Layer 3: Signals ← what you're testing Layer 2: Core filters ← minEdge ≥ 7%, odds ≤ 2.0 Layer 1: MI BP Model ← produces +11% CLV
Layer 1 is the foundation. The MI Bivariate Poisson model prices match outcomes better than Pinnacle closing lines by 11% on average. This is real and universal across 26 leagues.
Layer 2 is where ROI improves. The edge threshold (only bet when CLV ≥ 7%) and odds cap (≤ 2.0) turn -6% unfiltered ROI into ~-1.4%. These aren't signals — they're the core filtration.
Layer 3 is where signals live. Each one layers on top. The test isn't "does this work alone?" — it's "does adding this to the existing stack improve marginal ROI?" That's Gate 4.
The previous testing infrastructure tested signals as "Signal X + 7% edge threshold" and attributed the improvement to the signal. We fixed this. Now standalone tests use minEdge=0 (true isolation) and the deployment decision is marginal contribution (leave-one-out).
Parallel Exploration
The process is designed for multiple terminal windows running simultaneously:
Terminal 1: /signal bookmaker consensus predicts AH direction Terminal 2: /signal vig asymmetry toward our side hurts ROI Terminal 3: /signal midweek matches have different AH margins
Each terminal runs independently — loadAllData() is read-only, signal tests don't share state, and the registry uses unique IDs so there are no write conflicts.
The quality guarantee is the approval gate at the end, not a bureaucratic process at the beginning. Explore aggressively. The gate catches the problems.
What Changed from v1
| Before (broken) | After (corrected) |
|---|---|
| Standalone test inherited minEdge=0.07 | minEdge=0, all filters off |
| 40 signals validated on same pre-filtered pool | Each signal tested in true isolation |
| standaloneN ≈ 1,092 for 4 different signals | standaloneN = 29K-86K (full universe) |
| No OOS requirement | Must hold on 20 held-out leagues |
| Deployment based on standalone ROI | Deployment based on marginal ROI |
| No automated gate | 8-gate approval script |
| CLV and ROI conflated | CLV for model eval, ROI for deployment |
The Wrong-Direction Protocol
When a signal shows the opposite of your hypothesis, that's not a failure — it might be the most valuable finding.
Three of our best deployed discoveries came from wrong-direction results:
- Congestion filter was REMOVING our best bets (+6.2% ROI)
- AH +0.25 line was rejected despite +9.2% ROI
- New managers were expected to hurt teams but actually helped
The /signal-test command automatically detects wrong-direction results and offers to register the reversed hypothesis. The reversed hypothesis goes through the same register → explore → gate loop.
Current Stack Performance
After the meta-analysis and corrections (29,977 matches, 26 leagues):
| Component | Marginal ROI | Status |
|---|---|---|
| MI Bivariate Poisson | — | Foundation (+11% CLV) |
| minEdge ≥ 7% | ~+4.6pp | Core |
| odds-cap ≤ 2.0 | +4.2pp | Core (strongest filter) |
| variance-regression | -0.4pp | Deployed but neutral |
| congestion-filter | -0.3pp | Harmful (removal confirmed) |
| defiance-filter | -0.2pp | Neutral |
IS ROI: -1.4%. OOS ROI: -3.6%. The model works. The edge exists. The gap is execution cost — vig, odds quality, entry timing. That's a separate workstream.
How to Get Started
# Test a new hypothesis /signal-test [your hypothesis here] # Retest an existing signal with corrected infrastructure /signal-test [signal-id] # Run the approval gate on a promising signal npx tsx scripts/approve-signal.ts --signal=[signal-id] # See all pending signals jq '.signals[] | select(.status=="pending") | .id' data/signal-registry.json
The signal registry has 21 pending hypotheses waiting to be tested. The approval gate is ready. The parallel backtest runs 26 leagues in under an hour. Go find alpha.