The Gate Killed Our Darlings: How Two 'Validated' Signals Failed the Formal Process
We found two signals worth ~700u, validated them with Monte Carlo and walk-forward, deployed them, then ran the 10-gate process. Both failed — then we discovered the gate had a bug (wasn't toggling signals). Fixed it, re-ran: congestion +0.3pp (p=0.36), AH lines -0.1pp (p=0.54). Still rejected. Right answer, wrong path to get there.
Two weeks ago we mined 39 rejected experiments and found what looked like ~700 units of hidden profit. We ran Monte Carlo bootstrap (p=0.009), walk-forward validation (3/3 replicate), and even audited the settlement code. Everything checked out. We deployed two changes: disabled the congestion filter and added AH line-specific confidence adjustments.
Then we ran them through our new 10-gate approval process.
Both failed. We reverted everything.
Then we discovered the gate itself had a bug — it wasn't actually toggling the signals on/off, just comparing the base portfolio to itself. We fixed it and re-ran. Real results: congestion removal adds +0.3pp marginal ROI (p=0.36, not significant). AH line exclusion adds -0.1pp (slightly hurts). Both still rejected, but now with honest numbers.
This is the story of how a rigorous process killed two signals that looked bulletproof, then how we found the process itself was broken, fixed it, and still got the same answer.
The Hypothesis
Signal 1: Congestion Filter Removal
Our congestion filter removes matches where either team plays 3+ times in 8 days. A deep stratification study found that congested bets actually had the *best* ROI (+6.2%) and best calibration (+2.0pp CalGap) of any rest-day bucket. The filter appeared to be removing our most profitable bets.
Signal 2: AH +0.25 Line Confidence Boost
A per-line breakdown of all 15,200 AH bets revealed that the +0.25 line had +9.2% ROI on 6,360 bets — 42% of all AH bets generating 135% of total profit. Meanwhile, the +0.75 line was -16.7% ROI. We proposed boosting confidence on +0.25 bets and penalizing +0.75 bets.
Both findings were statistically significant, consistent across seasons and leagues, and passed independent validation.
The Tests That Passed
We didn't deploy blindly. Both signals went through serious validation before the first commit:
Monte Carlo Bootstrap (10,000 iterations)
| Signal | Delta | p-value (block) | 95% CI |
|---|---|---|---|
| Congestion removal | +0.6pp ROI | **0.009** | [+0.01pp, +1.2pp] |
| +0.25 line advantage | +10.9pp ROI | **<0.0001** | [+7.9pp, +14.0pp] |
Walk-Forward Hold-Out (train on early seasons, validate on later)
| Signal | Config A | Config B | Config C |
|---|---|---|---|
| Congestion | Replicates | Replicates | Replicates |
| +0.25 line | Replicates | Replicates | Replicates |
Probability of Ruin
The proposed portfolio nearly doubled the Sharpe ratio (2.48 to 4.69) and dropped ruin probability from 1.05% to 0.05%.
These results looked airtight. We deployed.
The Test That Failed
Then we ran approve-signal.ts — the 10-gate formal approval process that was built *after* our initial deployment. This is the new canonical pipeline that evaluates signals as marginal contributions to the full stack, not in isolation.
Congestion filter removal: 5/10 gates passed
| Gate | Result | Verdict |
|---|---|---|
| 1. Pre-registered | PASS | |
| 2. True standalone | PASS | N=87,210, ROI=-5.9%, CLV=+5.6% |
| 3. Minimum N | PASS | 87,210 >> 1,000 |
| **4. Marginal ROI** | **FAIL** | **+0.0pp (base -3.0% with or without)** |
| **5. Bootstrap marginal** | **FAIL** | **p=0.50** |
| 6. OOS interleave | PASS | gap 2.8pp |
| 7. Regime stratification | PASS | |
| **8. Suspicious N** | **FAIL** | Similar to 2 other signals |
| **9. Practical significance** | **FAIL** | +0.0pp < +0.5pp threshold |
| **10. Walk-forward** | **FAIL** | 2/4 folds positive (2024, 2025 negative) |
AH +0.25 line: identical results. 5/10 gates passed, same failures.
The critical gate is Gate 4: Marginal ROI. It asks: "Does adding this signal to the existing filter stack improve ROI?" The answer for both signals: no. Zero. The base portfolio produces -3.0% ROI with the full stack. Adding or removing either signal doesn't move that number.
Why Did the Earlier Tests Pass But the Gate Fail?
Three differences between our earlier analysis and the canonical pipeline:
1. Standalone vs. Marginal
Our earlier Monte Carlo and walk-forward tests measured the signal in isolation — comparing congested-only bets to non-congested bets, or +0.25 bets to other lines. These showed real differences.
But the approval gate measures marginal contribution to the stack. When you add the congestion signal on top of the existing minEdge filter, variance filter, pass-rate filter, and all the other deployed signals, the incremental value is zero. The stack already captures whatever the congestion signal was finding.
This is the "last mile" problem. A signal can be statistically real in isolation and still add nothing to a system that already works differently.
2. 15K bets vs. 87K bets
Our earlier analysis used 15,200 AH-only bets from the original backtest. The approval gate uses 87,210 bets across all markets from the 26-league canonical dataset. The larger, more diverse dataset diluted the effects that looked strong in the narrower sample.
3. ROI vs. CLV
Our earlier work focused heavily on ROI differences. But the approval gate revealed something more fundamental: CLV is +11% across the board regardless of which signals are active. The model genuinely beats the closing line. The problem is converting that CLV to ROI — and neither of these signals helps with that conversion.
What We Actually Learned
The Model Works. The Conversion Doesn't.
This is the biggest takeaway. Across 87,210 bets:
| Metric | Standalone | With Filters |
|---|---|---|
| CLV | +5.6% | +11.0% |
| ROI | -5.9% | -3.0% |
| Gap | 11.6pp | 14.0pp |
The model finds genuine edge. The filters concentrate it. But there's a 14 percentage point gap between "the model is right" (CLV) and "we make money" (ROI). Every signal we've tested — 107 hypotheses across two weeks — adds zero to the ROI side. The bottleneck is not signal selection. It's CLV-to-ROI conversion.
Sharp Markets Convert Better
The approval gate's informational tier analysis showed:
| Odds Quality | N | ROI | CLV |
|---|---|---|---|
| Sharp | 2,103 | **-1.2%** | +11.3% |
| Medium | 1,942 | -5.6% | +11.1% |
| Soft | 2,561 | -2.4% | +11.1% |
CLV is identical. But sharp-market leagues convert 2.4pp better than soft-market leagues. The edge is the same — the extraction rate differs. This is a capital allocation question, not a modeling question.
The Process Worked Exactly As Designed
We deployed two signals based on compelling standalone evidence. Then the approval gate caught them. This is the system working. Better to catch false positives at the gate than in the live P&L.
The old process would have left these deployed permanently. The new process caught them in hours.
What We Reverted
Both changes, fully rolled back:
- Congestion filter re-enabled in
ted-filters.ts. Matches with 3+ games in 8 days are filtered again. - AH line confidence adjustments removed from
decision-table.ts. No line-specific confidence modifiers. AH line parsing removed frompicks-engine.ts.
Both signals updated to "rejected" in the signal registry with the gate failure details.
New Opportunities Registered
The analysis wasn't wasted. Two new hypotheses emerged from the gate output:
1. Odds Quality Routing (odds-quality-routing)
Sharp leagues convert CLV to ROI at a 2.4pp higher rate than soft leagues, despite identical CLV. If we route full stake to sharp leagues and reduce stake on soft leagues, we might close part of the 14pp gap. This is a capital allocation signal, not an edge signal.
2. CLV-ROI Gap Structural Investigation (clv-roi-gap-structural-investigation)
The 14pp gap between +11% CLV and -3% ROI is the central question. Every signal we've tested adds zero marginal ROI. The bottleneck is conversion, not selection. Candidate root causes:
- Systematic odds staleness between model solve and market close
- Vig structure absorbing edge asymmetrically by line/market
- Model overconfidence at specific probability ranges
- Correlated same-day losses amplifying drawdowns
If we can close even 5pp of that 14pp gap, the portfolio becomes profitable. This is now the highest-priority investigation.
The Meta-Lesson
We went through four phases in 48 hours:
- Discovery — mined rejections, found compelling patterns
- Validation — Monte Carlo, walk-forward, settlement audit — all passed
- Deployment — shipped two changes with evidence
- Formal gate — 10-gate process killed both signals
The temptation is to feel like we wasted two days. We didn't. We learned that:
- Standalone significance is not deployment significance. A signal can be real and still add nothing to the stack.
- The approval gate is the only test that matters for deployment. Everything else is exploratory.
- The real problem isn't finding edges. The model finds +11% CLV. The real problem is converting that edge to profit. That's where the next breakthrough will come from.
The process is painful. It killed two signals we were excited about. But it also prevented us from running a production system with changes that provably add zero value. That's exactly what it's for.