Your Rejected Experiments Are a Gold Mine: How We Found ~700u Hiding in Plain Sight
We mined 39 rejected experiments and found two signals worth ~700u. One was a filter actively removing our best bets. The other was dismissed because the hypothesis was backwards. Both survived Monte Carlo bootstrap and walk-forward validation — but later failed the 10-gate approval process. See follow-up: 'The Gate Killed Our Darlings.'
Last week we mined our 39 rejected model experiments and found two signals worth an estimated ~700 units of profit across three seasons of backtest data. One was an active filter that was *removing* our best bets. The other was a rejected finding that generated 135% of all AH profit — dismissed because the hypothesis was written backwards.
Both passed Monte Carlo bootstrap validation (p=0.009 and p<0.0001) and walk-forward hold-out testing (3/3 out-of-sample replications). They're now deployed.
This post covers what we found, how we missed it, and what we changed so it doesn't happen again.
The Discovery
Signal 1: The Congestion Filter Was Removing Our Best Bets
Back on March 8th, we deployed a congestion filter as part of our Ted Knutson-inspired filter stack. The logic was intuitive: teams playing 3+ matches in 8 days are fatigued, fatigue makes outcomes unpredictable, so skip those matches. The filter was accepted with the note "Part of Ted's core filter set" and a copy-pasted baseline ROI number that had nothing to do with congestion.
Nobody ever ran a congestion-specific backtest.
When we finally did, the data said the exact opposite:
| Rest Bucket | N | ROI | CLV | CalGap |
|---|---|---|---|---|
| **Congested (<=3 days)** | **1,917** | **+6.2%** | **+6.5%** | **2.0pp** |
| Normal (4-6 days) | 4,288 | +0.7% | +6.6% | 5.1pp |
| Rested (7+ days) | 8,995 | +3.1% | +6.4% | 3.7pp |
Congested teams had the *highest* ROI and the *lowest* calibration gap (meaning our model is most accurate on these matches). The filter had been live for 10 days, quietly discarding our best-performing bets.
But why? The first question everyone asks: isn't congestion just a proxy for team quality? Elite teams in the Champions League play more fixtures, so "congested" really means "good team."
We tested this directly:
| Group | N | ROI |
|---|---|---|
| Congested + Top-6 team | 429 | +10.0% |
| Congested + Non-Top-6 team | 1,488 | **+5.2%** |
| Rested + Top-6 team | 3,343 | +13.4% |
| Rested + Non-Top-6 team | 9,940 | -1.4% |
Non-Top-6 congested teams still show +5.2% ROI. The effect isn't purely quality — the market genuinely overadjusts for fatigue. When a mid-table team plays three games in a week, the bookmakers shade the odds assuming they'll be gassed. Our model, which doesn't know about rest days, just sees the underlying team strength — and that disagreement creates value.
Consistency: 3/3 seasons positive. 13/17 leagues positive. Championship (+14.0%) and EPL (+16.4%) particularly strong.
Signal 2: The +0.25 AH Line Generates 135% of All Profit
On March 18th, we tested whether specific Asian Handicap line values are systematically mispriced. The hypothesis predicted quarter lines would be *worse* — harder for the Poisson grid to handle.
The test found the exact opposite: quarter lines were the *best* performing subset. The verdict was written as "REJECTED (opposite of hypothesis). No action needed."
Case closed. Except it shouldn't have been.
When we finally ran the per-line breakdown instead of grouping all quarter lines together, the picture was dramatic:
| AH Line | N | ROI | CalGap | P&L |
|---|---|---|---|---|
| **+0.25** | **6,360** | **+9.2%** | **2.2pp** | **+584u** |
| **-0.75** | **1,877** | **+12.8%** | **-2.0pp** | **+240u** |
| +0.5 | 2,713 | -2.7% | 5.5pp | -73u |
| +0.75 | 933 | **-16.7%** | **11.2pp** | -156u |
| -0.25 | 660 | **-20.7%** | **12.0pp** | -137u |
The +0.25 line alone — 42% of all AH bets — generates 135% of total AH profit. Meanwhile, the +0.75 line is a money pit at -16.7% ROI with an 11.2 percentage point calibration gap (meaning the model thinks it's right but the hit rate is 11 points lower than predicted).
Both are "quarter lines." Grouping them masked a +9.2% vs -16.7% split.
The mechanism: Our vig analysis revealed that the model has a 7.1 percentage point average edge on quarter lines vs 5.5pp on half lines and 3.2pp on integer lines. The model's probability estimates are structurally better calibrated at the +0.25 boundary. The CalGap data confirms this: 2.2pp at +0.25 (model barely overconfident) vs 11.2pp at +0.75 (model catastrophically overconfident).
The +0.25 advantage is consistent across all 3 seasons (+6.5%, +10.8%, +10.3%), 18 of 19 leagues, and both home and away sides.
The Validation
Finding a pattern in historical data is easy. Finding one that's real is hard. Before deploying anything, we ran two independent validation methods.
Monte Carlo Bootstrap (10,000 iterations)
We resampled the full bet pool with replacement 10,000 times and recomputed the ROI delta each time. This tells us: if the effect were just random noise, how often would we see a delta this large by chance?
Congestion filter removal:
- Observed delta: +0.6pp ROI improvement
- i.i.d. bootstrap p-value: 0.023
- Block bootstrap p-value (preserving temporal correlation): 0.009
- 95% CI: [+0.01pp, +1.2pp]
The block bootstrap is more conservative because it resamples in 30-day blocks, preserving the fact that bets on the same matchday aren't independent. Even with this stricter test, p=0.009 — well below the 0.05 threshold.
AH +0.25 line advantage:
- Observed delta: +10.9pp over non-+0.25 bets
- p-value: <0.0001 (both methods)
- 95% CI: [+7.9pp, +14.0pp]
Not close. The CI doesn't come near zero.
Risk analysis — probability of ruin simulation:
| Metric | Current Portfolio | Proposed Portfolio |
|---|---|---|
| Ruin probability | 1.05% | **0.05%** |
| Mean max drawdown | 36.9% | **26.6%** |
| Sharpe ratio | 2.48 | **4.69** |
| Mean final bankroll (100u start) | 393u | **701u** |
The proposed portfolio (congestion included, line-weighted staking) nearly doubles the Sharpe ratio and drops ruin probability by 95%.
Walk-Forward Hold-Out
The toughest test: train on early seasons, validate on a season the model has never seen. If the effect is overfit, it dies here.
We ran three configurations:
- A: Train on 2022-23, validate on 2023-24
- B: Train on 2022-23 + 2023-24, validate on 2024-25
- C: Train on 2023-24, validate on 2024-25
Congestion — replicates 3/3:
| Config | Train Delta | Validation Delta |
|---|---|---|
| A | +1.7pp | **+1.2pp** |
| B | +1.5pp | **+11.2pp** |
| C | +1.2pp | **+11.2pp** |
The 2024-25 validation shows the effect *strengthening* out of sample. Not what overfitting looks like.
+0.25 line — replicates 3/3:
| Config | Train +0.25 ROI | Validation +0.25 ROI |
|---|---|---|
| A | +6.5% | **+10.8%** |
| B | +8.6% | **+10.3%** |
| C | +10.8% | **+10.3%** |
Danger lines — all replicate negative 3/3:
- +0.75: train -13% to -21%, validation -13% to -17%
- -0.25: train -13% to -25%, validation -13% to -23%
Every line's direction held out of sample. The good lines stayed good. The bad lines stayed bad.
The Settlement Audit
A concern was raised that the backtest might be settling quarter-line bets incorrectly — treating half-wins as full wins and half-losses as full losses. If true, this would inflate the +0.25 finding.
We read the actual code. The settleAH() function in backtest-v2.ts already handles all five cases correctly: full win, half win (profit = 50% of odds), push, half lose (profit = -0.5), and full lose. The bug didn't exist. The concern was generated from reasoning about what *might* be wrong without checking the implementation.
This led to its own process rule (more on that below).
How Did We Miss This?
Two systematic failures. Both are now fixed in our Signal Testing Protocol.
Failure 1: Hypothesis-Confirmation Bias
Our signal testing process requires pre-registering a directional hypothesis before running any test. This is good practice — it prevents p-hacking and data dredging. But it created a blind spot.
When the congestion test found "congestion helps" instead of "congestion hurts," it was marked REJECTED and the case was closed. When the AH line test found "quarter lines are best" instead of "quarter lines are worst," same thing — REJECTED, no action.
The problem: a wrong-direction result is more interesting than no signal. It means the market has the opposite bias from what we expected. That's literally an exploitable mispricing. But our process treated "hypothesis wrong" as equivalent to "no signal found."
We had six wrong-direction rejections in total. Two contained gold. The other four were correctly dismissed after investigation (flat CLV, tiny effects). But we would never have known without looking.
New rule (Protocol Rule 6): When a test finds the opposite of the hypothesis, this is NOT a rejection — it's a discovery. The opposite hypothesis must be registered and tested with full stratification before closing.
Failure 2: Aggregation Masked the Signal
The AH line test grouped +0.25 and +0.75 together as "quarter lines" and reported the average. That average was +5.4% ROI — positive and interesting, but the verdict said "no action needed" because the hypothesis direction was wrong.
Hidden inside that average: +0.25 at +9.2% and +0.75 at -16.7%. A 26 percentage point spread, completely invisible in the grouped result.
New rule (Protocol Rule 7): When testing signals that categorize bets (line value, rest days, team type), always show results for EACH category individually, not just grouped aggregates.
Failure 3: No Retroactive Audit
The congestion filter was accepted on March 8th. Our regime testing framework (which enables proper stratification by league, season, HFA regime, and season phase) was built on March 17th. Nobody went back to re-test old signals with the new tools.
The filter ran for 10 days, removing profitable bets, because there was no trigger to revisit previously accepted signals when our testing capabilities improved.
New rule (Protocol Rule 8): When new testing infrastructure is added, all previously accepted signals must be re-validated within 48 hours.
Bonus: The Phantom Bug
A detailed bug report was written claiming the quarter-line settlement code was broken. It described exactly what the bug *would look like* if the code used simple win/lose logic. The report was thorough, specific, and wrong. Nobody had read the actual function.
New rule (Protocol Rule 9): Verify bugs against actual code before planning fixes. The code is the source of truth, not reasoning from memory.
What We Deployed
Two changes, each committed separately (one variable per commit, per protocol):
1. Congestion filter disabled (lib/mi-picks/ted-filters.ts)
The isCongested() check is commented out. Matches where a team plays 3+ times in 8 days are no longer filtered. These bets now flow through to the picks engine like any other match.
2. AH line confidence adjustments (lib/regime/decision-table.ts)
The regime classifier now receives the AH line value and adjusts confidence accordingly:
| Line | Confidence | Evidence |
|---|---|---|
| +0.25 | **+15** | +9.2% ROI, 2.2pp CalGap, N=6,360 |
| -0.75 | **+10** | +12.8% ROI, -2.0pp CalGap, N=1,877 |
| +0.75 | **-20** | -16.7% ROI, 11.2pp CalGap, N=933 |
| -0.25 | **-20** | -20.7% ROI, 12.0pp CalGap, N=660 |
These adjustments flow into the stake multiplier: confidence >=20 = full stake, >=-10 = half stake, <-10 = quarter stake. A +0.25 bet that would have been borderline now gets the full stake. A +0.75 bet that would have been full stake now gets reduced.
The Takeaway
Mine your rejections. Every quantitative team has a graveyard of failed experiments. Most of them are correctly dead. But the ones tagged "wrong direction" or "opposite of hypothesis" deserve a second look — the data told you something real, you just asked the wrong question.
Disaggregate before you dismiss. An average can hide anything. If your signal categorizes bets into buckets, look at every bucket individually before writing the verdict. The difference between +9.2% and -16.7% was invisible at the group level.
Audit old decisions with new tools. Your testing capabilities improve over time. Your old accepted signals don't automatically get better with them. Build a trigger: when you ship new testing infrastructure, re-run everything.
Read the code. Not the description of the code. Not what you think the code does. The actual code.