The Model Works, The Money Doesn't: Our 26-League Meta-Analysis
We ran the full evaluation pipeline across 26 leagues — 29,977 matches, 12 signal configurations, proper IS/OOS split. CLV is +11% everywhere (model is genuinely good). ROI is negative everywhere OOS (execution eats the edge). The odds cap is the only filter that matters (+4.2pp marginal). And our first attempt at this analysis was completely wrong — here's what we learned from that too.
We just finished the biggest audit our system has ever run — and then had to throw out the first draft because our methodology was wrong. This post covers the corrected results, what we got wrong the first time, and what it all means.
The headline: The model genuinely sees edge (CLV +11% everywhere, IS and OOS). But no configuration converts that edge to profit OOS. The gap between "our model prices better than closing lines" and "we make money" is structural, confirmed across 26 leagues, and bigger than we thought.
What We Got Wrong the First Time
We published initial results showing "10/11 signals failed." Those results were invalid. Here's what happened:
- Wrong data source. The v1 script loaded raw backtest bet records (87K bets, all markets, no filters) instead of using the actual evaluation pipeline. The baseline ROI for unfiltered bets is -6.0%. Every "signal test" was selecting subsets of an already-negative pool — of course they all looked bad.
- Fake filter functions. Instead of using our real signal infrastructure (
loadAllData→evaluateBets→runStandaloneSignal), we wrote crude proxy functions like "AH bets where selection starts with 'Away'" as a stand-in for the defensive overperformance signal. These proxies didn't replicate what the actual signals do.
- Apples-to-oranges comparison. The "original ROI" numbers came from signals tested through the full filtered pipeline (Ted filters, odds cap, variance filter, early-season skip). We compared those against subsets of an unfiltered bet pool.
- Tautological filter. The "process-scorecard-tracking" proxy (
won AND clv > 0.03) selected bets that already won. The +133% ROI was survivorship bias, not a signal.
The lesson: Running a meta-analysis on your own pipeline requires using your own pipeline, not a shortcut. We built the shortcut because it seemed simpler. It produced confident-looking but meaningless results.
The Corrected Study
The v2 analysis uses the proper evaluation infrastructure:
loadAllData()loads 29,977 precomputed matches across all 26 leagues with team histories, xG data, congestion trackingevaluateBets()applies the full Ted filter stack (variance, congestion, defiance, early-season skip)runStandaloneSignal()/runWithoutSignal()for proper marginal analysis- 10,000-iteration bootstrap CIs per signal
| Split | Precomputed Matches | Purpose |
|---|---|---|
| **In-sample (IS)** | 6,955 | Original 6 leagues |
| **Out-of-sample (OOS)** | 23,022 | 20 new leagues |
| **Total** | 29,977 | Full 26-league pipeline |
Results: The IS/OOS Split
Base Performance (Ted filters, all markets)
| N | ROI | CLV | |
|---|---|---|---|
| **IS (6 leagues)** | 1,968 | -1.4% | +11.1% |
| **OOS (20 leagues)** | 4,638 | -3.6% | +11.2% |
| **Full** | 6,606 | -3.0% | +11.2% |
CLV is identical IS and OOS (+11.1% vs +11.2%). The model's pricing advantage is real and universal. But ROI degrades by 2.2pp going OOS.
Signal-Level Results
| Signal | Full N | Full ROI | IS ROI | OOS ROI | Marginal | Verdict |
|---|---|---|---|---|---|---|
| ted-base (all filters) | 6,606 | -3.0% | -1.4% | -3.6% | baseline | CLV-ONLY |
| odds-cap-2.0 | 9,102 | -2.0% | +0.2% | -2.9% | **+4.2pp** | Strongest signal |
| sides-only (1X2+AH) | 3,196 | -1.4% | **+0.6%** | -2.3% | — | IS positive |
| ah-only | 2,775 | -1.9% | **-0.0%** | -2.7% | — | IS breakeven |
| ah-low-edge (5-7%) | 5,836 | -1.9% | -0.4% | -2.7% | — | CLV-ONLY |
| ah-high-edge (10%+) | 1,510 | -3.0% | -1.3% | -3.7% | — | CLV-ONLY |
| unders-only | 2,331 | -6.3% | -3.7% | -7.0% | — | Worst performer |
Filter Marginal Contributions
| Filter | Marginal ROI | Effect |
|---|---|---|
| **odds-cap-2.0** | **+4.2pp** | Without cap: -7.2%. With cap: -3.0%. Best individual signal by far. |
| variance-regression | -0.4pp | Slightly hurts — restricting to regression candidates doesn't help |
| congestion-filter | -0.3pp | Slightly hurts — confirms the post-mortem (congested teams perform better) |
| defiance-filter | -0.2pp | Slightly hurts |
| no-draws | +0.0pp | Neutral |
The Three Real Findings
1. CLV Is Real and Universal
+11.1% CLV across 26 leagues, IS and OOS, every configuration. The MI Bivariate Poisson model genuinely prices match outcomes better than Pinnacle closing lines. This is not an artifact of data selection or market softness — it holds in sharp markets (EPL, Serie A) and soft markets (National League, Liga MX) equally.
2. ROI Doesn't Convert — And OOS Is Worse
IS: -1.4% ROI (nearly breakeven with full filters, slightly positive for sides-only and AH). OOS: -3.6% ROI (2.2pp worse across the board).
The degradation is consistent, not catastrophic. It's not that OOS leagues are fundamentally different — the CLV is identical. The ROI gap comes from execution: proxy odds in soft markets, wider vig, less liquid lines.
3. Odds Cap Is the Only Signal That Matters
The odds-cap-2.0 filter contributes +4.2pp of marginal ROI — the difference between -7.2% (no cap) and -3.0% (with cap). Every other filter is between -0.4pp and +0.0pp marginal. The entire Ted filter stack minus odds-cap is approximately neutral.
Why? Odds > 2.0 correspond to implied probabilities < 50% — longshots. Our model overestimates edge on longshots because the CLV calculation inflates small probability differences into large percentage edges. A 2% absolute edge on a 30% probability event looks like 6.7% CLV. The same 2% edge on a 55% event is 3.6% CLV. The odds cap removes the inflated-looking edges.
The Deeper Problem: How 40 Signals Got "Accepted"
This is the finding that matters most.
Before this meta-analysis, we had 40 accepted signals in the registry, all showing positive ROI. We felt good about our process — pre-registration, bootstrap CIs, regime stratification, walk-forward validation. Rigorous. Scientific. And apparently, wrong.
How did 40 signals pass validation when the system's actual ROI is negative?
The Bug in runStandaloneSignal()
When you test a signal "standalone," the runner is supposed to isolate it — disable everything else so you can see what that signal alone contributes. Here's what runStandaloneSignal() actually does (from lib/signals/runner.ts, line 239):
overrides = {
varianceFilter: false, ✓ disabled
congestionFilter: false, ✓ disabled
defianceFilter: false, ✓ disabled
maxOdds: 99, ✓ disabled
noDraws: false, ✓ disabled
minEdge: ??? ← NEVER TOUCHED
}The minEdge parameter — which requires every bet to have at least 7% CLV before it's even considered — is never overridden. It stays at DEFAULT_EVAL_CONFIG.minEdge = 0.07.
So every "standalone" signal test was actually:
Signal X + minEdge ≥ 7%
Not:
Signal X alone
The 7% edge threshold is the real alpha source. It's the thing that turns -6% unfiltered ROI into -1.4% filtered ROI. Every signal we tested was riding on top of it, and we attributed the improvement to the signal instead of the threshold.
The Evidence: standaloneN ≈ 1,092
Look at the signal registry. Multiple "independent" signals share the exact same standalone N:
| Signal | standaloneN | standaloneROI |
|---|---|---|
| variance-regression-filter | 1,092 | +2.9% |
| pass-rate-filter-50pct | 1,092 | +2.9% |
| injury-lambda-multiplier | 1,092 | +2.9% |
| congestion-filter-3in8 | 1,092 | +2.9% |
Four different signals. Same N. Same ROI. That's because they're all testing within the same pre-filtered universe — the ~1,092 bets that pass the 7% edge threshold. The "signals" are adding approximately nothing on top of it.
Why This Felt Rigorous
The testing infrastructure had every hallmark of rigor:
- Pre-registration. Every hypothesis was registered before testing. ✓
- Bootstrap CIs. 10,000-iteration confidence intervals on every result. ✓
- Regime stratification. Tested across HFA regime, season phase, home/away side. ✓
- Walk-forward validation. Out-of-sample splits across seasons. ✓
- Holm-Bonferroni correction. Multiple comparison adjustment when testing multiple signals. ✓
All of that rigor was applied on top of a pre-filtered bet pool. The statistical tests were valid *within that pool*. The confidence intervals were correct. The p-values were real. But the question being answered was "does Signal X improve ROI among bets that already pass a 7% edge threshold?" — not "does Signal X generate independent alpha?"
When the answer to the first question is "yes, by +0.1pp" and the answer to the second question is "no," you get 40 accepted signals that look great in the registry but contribute nothing in production.
The Analogy
Imagine testing sunglasses by measuring how well you can see in a dark room with them on vs. off. You find: "no difference, both terrible." Then someone hands you a flashlight and says "test again." Now: "sunglasses group sees 8/10 objects, no-sunglasses sees 8.2/10. The sunglasses hurt by 0.2." But your report says "sunglasses + flashlight: sees 8/10 objects. Deployed."
The flashlight is the 7% edge threshold. The sunglasses are the signals. We were testing sunglasses-with-flashlight and crediting the sunglasses.
What This Means
- Most of our 40 accepted signals are not independently generating alpha. They were validated within a pre-filtered universe where the edge threshold was doing the work.
- The testing infrastructure produced false confidence. The bug wasn't in any individual test — it was in the test harness itself. Every test inherited it silently.
- The 7% edge threshold IS the signal. It's the one filter that consistently separates positive-CLV bets from the rest. The odds cap (+4.2pp marginal) is the only other filter that demonstrably matters.
- We need to retest all 40 signals with `minEdge: 0`. Strip the edge threshold, test each signal truly standalone, and see which ones actually contribute marginal ROI. Most won't. A few might. Those few are the real signals.
Other Process Learnings
What the v1 Error Taught Us
The most valuable output of this meta-analysis wasn't the signal results — it was discovering that our first attempt was completely wrong despite looking plausible. The v1 results had tables, confidence intervals, Holm-Bonferroni corrections, effect size calculations. They looked rigorous. They were garbage.
The root cause: We took a shortcut. Instead of using the actual evaluation pipeline (which loads data, builds team histories, applies filters in the correct order), we loaded raw bet records and wrote ad hoc filter functions. The shortcut saved 30 minutes of development time and produced results that would have led to deactivating signals that are actually working.
New rule: Any analysis that produces a signal-level verdict must run through the canonical pipeline (loadAllData → evaluateBets → runStandaloneSignal). No proxies, no approximations, no "this filter is roughly equivalent."
CLV vs ROI: Different Questions
CLV answers "is our model good?" ROI answers "do we make money?" These are different questions with different answers.
- CLV: +11.1% (IS), +11.2% (OOS). Answer: yes, the model is good.
- ROI: -1.4% (IS), -3.6% (OOS). Answer: no, we don't make money.
The gap is execution cost — vig, odds quality, entry timing. This means the next dollar of improvement comes from execution infrastructure (better odds sources, tighter entry, vig reduction), not from more signal research.
The Congestion Filter Confirmation
The congestion filter's marginal contribution is -0.3pp — it makes things slightly worse. This independently confirms the post-mortem finding from March 18th that the filter was removing our best bets. The meta-analysis just verified it through a different methodology.
What Changes
- Fix `runStandaloneSignal()`. Add
minEdge: 0to the standalone overrides so signals are tested in true isolation. This is a one-line fix that changes the meaning of every future signal test. - Retest all 40 accepted signals with `minEdge: 0`. Most will lose their positive ROI. The ones that survive are the real signals. The ones that don't were riding the edge threshold.
- odds-cap-2.0 stays deployed. +4.2pp marginal is real and consistent IS/OOS.
- Other Ted filters remain but are not load-bearing. Variance, congestion, and defiance filters are approximately neutral. Keeping them for risk management but not claiming they generate alpha.
- Focus shifts to execution. The model works. The edge is real. The problem is capturing it. Next investment goes to odds quality, entry timing, and vig modeling.
- No more ad hoc analysis scripts. All signal verdicts must use the canonical pipeline.
The Infrastructure Made This Possible
The corrected study — 29,977 matches, 26 leagues, 12 signal configurations, 120,000 bootstrap iterations — ran in under 5 minutes on a single machine. Last week the underlying backtest alone would have taken 20 hours. The worker thread parallelization and solver memoization turned "too expensive to ask" into "ask it three times to make sure."
The speed upgrade also exposed the v1 error faster. When results come back in minutes instead of hours, you can afford to be suspicious and rerun. We caught the methodology bug on the same day we published — because we could check our work before the next meeting.