The Model Works, The Money Doesn't: Our 26-League Meta-Analysis

March 19, 2026|Research|INVESTIGATION

The Model Works, The Money Doesn't: Our 26-League Meta-Analysis

We ran the full evaluation pipeline across 26 leagues — 29,977 matches, 12 signal configurations, proper IS/OOS split. CLV is +11% everywhere (model is genuinely good). ROI is negative everywhere OOS (execution eats the edge). The odds cap is the only filter that matters (+4.2pp marginal). And our first attempt at this analysis was completely wrong — here's what we learned from that too.

CLV (universal)

+11.1%

IS and OOS identical

IS ROI

-1.4%

6 leagues, near breakeven

OOS ROI

-3.6%

20 leagues, execution gap

Strongest Signal

+4.2pp

odds-cap-2.0 marginal

We just finished the biggest audit our system has ever run — and then had to throw out the first draft because our methodology was wrong. This post covers the corrected results, what we got wrong the first time, and what it all means.

The headline: The model genuinely sees edge (CLV +11% everywhere, IS and OOS). But no configuration converts that edge to profit OOS. The gap between "our model prices better than closing lines" and "we make money" is structural, confirmed across 26 leagues, and bigger than we thought.

What We Got Wrong the First Time

We published initial results showing "10/11 signals failed." Those results were invalid. Here's what happened:

Wrong data source. The v1 script loaded raw backtest bet records (87K bets, all markets, no filters) instead of using the actual evaluation pipeline. The baseline ROI for unfiltered bets is -6.0%. Every "signal test" was selecting subsets of an already-negative pool — of course they all looked bad.

Fake filter functions. Instead of using our real signal infrastructure (loadAllData → evaluateBets → runStandaloneSignal), we wrote crude proxy functions like "AH bets where selection starts with 'Away'" as a stand-in for the defensive overperformance signal. These proxies didn't replicate what the actual signals do.

Apples-to-oranges comparison. The "original ROI" numbers came from signals tested through the full filtered pipeline (Ted filters, odds cap, variance filter, early-season skip). We compared those against subsets of an unfiltered bet pool.

Tautological filter. The "process-scorecard-tracking" proxy (won AND clv > 0.03) selected bets that already won. The +133% ROI was survivorship bias, not a signal.

The lesson: Running a meta-analysis on your own pipeline requires using your own pipeline, not a shortcut. We built the shortcut because it seemed simpler. It produced confident-looking but meaningless results.

The Corrected Study

The v2 analysis uses the proper evaluation infrastructure:

loadAllData() loads 29,977 precomputed matches across all 26 leagues with team histories, xG data, congestion tracking
evaluateBets() applies the full Ted filter stack (variance, congestion, defiance, early-season skip)
runStandaloneSignal() / runWithoutSignal() for proper marginal analysis
10,000-iteration bootstrap CIs per signal

Split	Precomputed Matches	Purpose
In-sample (IS)	6,955	Original 6 leagues
Out-of-sample (OOS)	23,022	20 new leagues
Total	29,977	Full 26-league pipeline

Results: The IS/OOS Split

Base Performance (Ted filters, all markets)

N	ROI	CLV
IS (6 leagues)	1,968	-1.4%	+11.1%
OOS (20 leagues)	4,638	-3.6%	+11.2%
Full	6,606	-3.0%	+11.2%

CLV is identical IS and OOS (+11.1% vs +11.2%). The model's pricing advantage is real and universal. But ROI degrades by 2.2pp going OOS.

Signal-Level Results

Signal	Full N	Full ROI	IS ROI	OOS ROI	Marginal	Verdict
ted-base (all filters)	6,606	-3.0%	-1.4%	-3.6%	baseline	CLV-ONLY
odds-cap-2.0	9,102	-2.0%	+0.2%	-2.9%	+4.2pp	Strongest signal
sides-only (1X2+AH)	3,196	-1.4%	+0.6%	-2.3%	—	IS positive
ah-only	2,775	-1.9%	-0.0%	-2.7%	—	IS breakeven
ah-low-edge (5-7%)	5,836	-1.9%	-0.4%	-2.7%	—	CLV-ONLY
ah-high-edge (10%+)	1,510	-3.0%	-1.3%	-3.7%	—	CLV-ONLY
unders-only	2,331	-6.3%	-3.7%	-7.0%	—	Worst performer

Filter Marginal Contributions

Filter	Marginal ROI	Effect
odds-cap-2.0	+4.2pp	Without cap: -7.2%. With cap: -3.0%. Best individual signal by far.
variance-regression	-0.4pp	Slightly hurts — restricting to regression candidates doesn't help
congestion-filter	-0.3pp	Slightly hurts — confirms the post-mortem (congested teams perform better)
defiance-filter	-0.2pp	Slightly hurts
no-draws	+0.0pp	Neutral

The Three Real Findings

1. CLV Is Real and Universal

+11.1% CLV across 26 leagues, IS and OOS, every configuration. The MI Bivariate Poisson model genuinely prices match outcomes better than Pinnacle closing lines. This is not an artifact of data selection or market softness — it holds in sharp markets (EPL, Serie A) and soft markets (National League, Liga MX) equally.

2. ROI Doesn't Convert — And OOS Is Worse

IS: -1.4% ROI (nearly breakeven with full filters, slightly positive for sides-only and AH). OOS: -3.6% ROI (2.2pp worse across the board).

The degradation is consistent, not catastrophic. It's not that OOS leagues are fundamentally different — the CLV is identical. The ROI gap comes from execution: proxy odds in soft markets, wider vig, less liquid lines.

3. Odds Cap Is the Only Signal That Matters

The odds-cap-2.0 filter contributes +4.2pp of marginal ROI — the difference between -7.2% (no cap) and -3.0% (with cap). Every other filter is between -0.4pp and +0.0pp marginal. The entire Ted filter stack minus odds-cap is approximately neutral.

Why? Odds > 2.0 correspond to implied probabilities < 50% — longshots. Our model overestimates edge on longshots because the CLV calculation inflates small probability differences into large percentage edges. A 2% absolute edge on a 30% probability event looks like 6.7% CLV. The same 2% edge on a 55% event is 3.6% CLV. The odds cap removes the inflated-looking edges.

The Deeper Problem: How 40 Signals Got "Accepted"

This is the finding that matters most.

Before this meta-analysis, we had 40 accepted signals in the registry, all showing positive ROI. We felt good about our process — pre-registration, bootstrap CIs, regime stratification, walk-forward validation. Rigorous. Scientific. And apparently, wrong.

How did 40 signals pass validation when the system's actual ROI is negative?

The Bug in `runStandaloneSignal()`

When you test a signal "standalone," the runner is supposed to isolate it — disable everything else so you can see what that signal alone contributes. Here's what runStandaloneSignal() actually does (from lib/signals/runner.ts, line 239):

overrides = {
  varianceFilter: false,      ✓ disabled
  congestionFilter: false,    ✓ disabled
  defianceFilter: false,      ✓ disabled
  maxOdds: 99,                ✓ disabled
  noDraws: false,             ✓ disabled
  minEdge: ???                ← NEVER TOUCHED
}

The minEdge parameter — which requires every bet to have at least 7% CLV before it's even considered — is never overridden. It stays at DEFAULT_EVAL_CONFIG.minEdge = 0.07.

So every "standalone" signal test was actually:

Signal X + minEdge ≥ 7%

Not:

Signal X alone

The 7% edge threshold is the real alpha source. It's the thing that turns -6% unfiltered ROI into -1.4% filtered ROI. Every signal we tested was riding on top of it, and we attributed the improvement to the signal instead of the threshold.

The Evidence: `standaloneN ≈ 1,092`

Look at the signal registry. Multiple "independent" signals share the exact same standalone N:

Signal	standaloneN	standaloneROI
variance-regression-filter	1,092	+2.9%
pass-rate-filter-50pct	1,092	+2.9%
injury-lambda-multiplier	1,092	+2.9%
congestion-filter-3in8	1,092	+2.9%

Four different signals. Same N. Same ROI. That's because they're all testing within the same pre-filtered universe — the ~1,092 bets that pass the 7% edge threshold. The "signals" are adding approximately nothing on top of it.

Why This Felt Rigorous

The testing infrastructure had every hallmark of rigor:

Pre-registration. Every hypothesis was registered before testing. ✓
Bootstrap CIs. 10,000-iteration confidence intervals on every result. ✓
Regime stratification. Tested across HFA regime, season phase, home/away side. ✓
Walk-forward validation. Out-of-sample splits across seasons. ✓
Holm-Bonferroni correction. Multiple comparison adjustment when testing multiple signals. ✓

All of that rigor was applied on top of a pre-filtered bet pool. The statistical tests were valid *within that pool*. The confidence intervals were correct. The p-values were real. But the question being answered was "does Signal X improve ROI among bets that already pass a 7% edge threshold?" — not "does Signal X generate independent alpha?"

When the answer to the first question is "yes, by +0.1pp" and the answer to the second question is "no," you get 40 accepted signals that look great in the registry but contribute nothing in production.

The Analogy

Imagine testing sunglasses by measuring how well you can see in a dark room with them on vs. off. You find: "no difference, both terrible." Then someone hands you a flashlight and says "test again." Now: "sunglasses group sees 8/10 objects, no-sunglasses sees 8.2/10. The sunglasses hurt by 0.2." But your report says "sunglasses + flashlight: sees 8/10 objects. Deployed."

The flashlight is the 7% edge threshold. The sunglasses are the signals. We were testing sunglasses-with-flashlight and crediting the sunglasses.

What This Means

Most of our 40 accepted signals are not independently generating alpha. They were validated within a pre-filtered universe where the edge threshold was doing the work.

The testing infrastructure produced false confidence. The bug wasn't in any individual test — it was in the test harness itself. Every test inherited it silently.

The 7% edge threshold IS the signal. It's the one filter that consistently separates positive-CLV bets from the rest. The odds cap (+4.2pp marginal) is the only other filter that demonstrably matters.

We need to retest all 40 signals with `minEdge: 0`. Strip the edge threshold, test each signal truly standalone, and see which ones actually contribute marginal ROI. Most won't. A few might. Those few are the real signals.

Other Process Learnings

What the v1 Error Taught Us

The most valuable output of this meta-analysis wasn't the signal results — it was discovering that our first attempt was completely wrong despite looking plausible. The v1 results had tables, confidence intervals, Holm-Bonferroni corrections, effect size calculations. They looked rigorous. They were garbage.

The root cause: We took a shortcut. Instead of using the actual evaluation pipeline (which loads data, builds team histories, applies filters in the correct order), we loaded raw bet records and wrote ad hoc filter functions. The shortcut saved 30 minutes of development time and produced results that would have led to deactivating signals that are actually working.

New rule: Any analysis that produces a signal-level verdict must run through the canonical pipeline (loadAllData → evaluateBets → runStandaloneSignal). No proxies, no approximations, no "this filter is roughly equivalent."

CLV vs ROI: Different Questions

CLV answers "is our model good?" ROI answers "do we make money?" These are different questions with different answers.

CLV: +11.1% (IS), +11.2% (OOS). Answer: yes, the model is good.
ROI: -1.4% (IS), -3.6% (OOS). Answer: no, we don't make money.

The gap is execution cost — vig, odds quality, entry timing. This means the next dollar of improvement comes from execution infrastructure (better odds sources, tighter entry, vig reduction), not from more signal research.

The Congestion Filter Confirmation

The congestion filter's marginal contribution is -0.3pp — it makes things slightly worse. This independently confirms the post-mortem finding from March 18th that the filter was removing our best bets. The meta-analysis just verified it through a different methodology.

What Changes

Fix `runStandaloneSignal()`. Add minEdge: 0 to the standalone overrides so signals are tested in true isolation. This is a one-line fix that changes the meaning of every future signal test.
Retest all 40 accepted signals with `minEdge: 0`. Most will lose their positive ROI. The ones that survive are the real signals. The ones that don't were riding the edge threshold.
odds-cap-2.0 stays deployed. +4.2pp marginal is real and consistent IS/OOS.
Other Ted filters remain but are not load-bearing. Variance, congestion, and defiance filters are approximately neutral. Keeping them for risk management but not claiming they generate alpha.
Focus shifts to execution. The model works. The edge is real. The problem is capturing it. Next investment goes to odds quality, entry timing, and vig modeling.
No more ad hoc analysis scripts. All signal verdicts must use the canonical pipeline.

The Infrastructure Made This Possible

The corrected study — 29,977 matches, 26 leagues, 12 signal configurations, 120,000 bootstrap iterations — ran in under 5 minutes on a single machine. Last week the underlying backtest alone would have taken 20 hours. The worker thread parallelization and solver memoization turned "too expensive to ask" into "ask it three times to make sure."

The speed upgrade also exposed the v1 error faster. When results come back in minutes instead of hours, you can afford to be suspicious and rerun. We caught the methodology bug on the same day we published — because we could check our work before the next meeting.

INVESTIGATIONSignal: meta-analysis-26-league-revalidation|2026-03-19