April 14, 2026|Research

The Signal Pipeline Was Killing Real Alpha

Rewrote the 10-gate signal-approval pipeline into 6 data-driven gates. Dropped hardcoded N≥1000, +0.5pp practical-significance floor, p<0.10 bootstrap, 3pp interleave tolerance, and 1% suspicious-N dedup. Dry-run confirmed 0 live signals regress and 9 previously-rejected signals would be unblocked. Ran end-to-end on two of them: inter-model-disagreement failed Gate 2 (CI width too wide — a sharper rejection than the old pipeline managed), contextXg passed 6/6 (previously killed on the old +0.5pp floor). The reformed pipeline is live; pod-shop math fix and bake-off come next.

The Signal Pipeline Was Killing Real Alpha

The 10-gate signal-test pipeline has shipped three live signals with real P&L — loss-weight-sweep, dixon-coles-rho-correction, tc2-league-filter. It has also rejected nine signals that, under a more honest reading of the data, should never have been thrown out. One of them, contextXg, was rejected on a marginal of exactly +0.5% — one basis point below a hardcoded floor. It passed the reformed pipeline today at 6/6. Here's what we did, why, and what's next.

The trigger

The 10-gate pipeline was designed after a bad incident: in March, 40 signals were "validated" on a pre-filtered bet pool, and meta-analysis discovered the leakage too late. The gates were built as scar tissue — defensive infrastructure to prevent it happening again. They worked, but they also ossified into a set of arbitrary thresholds that the pipeline couldn't justify with its own data:

N ≥ 1000 — hardcoded floor, no scaling with signal variance
Marginal ROI > +0.5pp — arbitrary practical-significance bar that ignored how unique the signal was to the stack
Bootstrap p < 0.10 — fixed significance threshold that punished signals with tight CIs that happened to straddle zero
Matchday interleave OOS gap ≤ 3pp — opaque tolerance
Suspicious N within 1% of another signal's N — false-positive-prone dedup heuristic
Walk-forward ≥ 2/3 folds positive, "ROI > −1pp counts as positive" — soft threshold that directly contradicted the strict marginal > 0 gate earlier in the same pipeline

The gates were also doing double duty with a parallel system — the pod-shop factorial framework (2^11 = 2,048 combinations, alpha/beta/IR per signal) that was meant to measure portfolio contribution but currently has known-broken math: all alpha/beta/IR values come out zero, and the best combo scorecard shows −1.65% ROI over 6,546 bets. Pod-shop has never shipped a live signal.

So there were three overlapping systems — gate pipeline, pod-shop factorial, gauntlet shadow UI — that didn't share a data plane, disagreed on what "alpha" meant, and between them were gating real alpha out of production.

The original plan was to collapse all three into one pipeline centered on pod-shop's alpha/beta/IR framework. That plan died during exploration because the empirical check was brutal: pod-shop's math is broken, the gates are the only working approval mechanism, and collapsing them into an aspirational framework would throw out working infrastructure. Reform, not replacement.

What changed

Phase A — reform the 10 gates into 6 data-driven ones. Ship this now. Nothing else.

Old → new

Old	New
G1 Pre-registered in registry + G2 true standalone config	G1 "Pre-registered + isolated" — merged. Single check. Pre-registration is required only for promotion; unregistered peeks use a new `scripts/explore-signal.ts`.
G3 `N ≥ 1000` hardcoded	G2 "Sufficient precision" — bootstrap CI95 width on marginal entry-adj ROI must be ≤ 4pp. Low N = wide CI = fails on uncertainty, not an arbitrary count. An N=800 signal with a tight CI passes; an N=1200 with huge variance fails.
G4 marginal > 0	G3 "Positive marginal alpha" — unchanged in spirit; kept strict. This is the deployment question.
G5 bootstrap p < 0.10	G4 "Bootstrap lower bound" — CI95[0] must be > `−0.5 × width`. Scales with the data's own noise instead of a fixed p-value.
G6 matchday interleave ≤ 3pp AND G10 ≥2/3 walk-forward folds positive, "> −1pp counts"	G5 "Walk-forward generalization" — continuous `monotonicityScore > 0.5`: the fraction of seasonal folds whose marginal is above the median across all folds. One metric, no contradiction.
G7 regime stratification (±5pp opposite-sign)	G6 "Regime consistency" — max-regime-gap must be inside `2 × CI_width`. Gaps inside the sample's own noise aren't penalized.
G8 Suspicious N within 1% dedup	Deleted. Heuristic that false-positived on independent signals with naturally similar coverage.
G9 marginal > +0.5pp hardcoded	Deleted. G3 already requires positive marginal; this was a double-dip with an arbitrary floor. A +0.3pp uncorrelated signal is better than +2pp redundant, and the hardcoded gate couldn't tell them apart.

10 gates → 6 gates. No hardcoded thresholds except the 4pp CI width and 0.5 monotonicity score, both of which can be justified directly from the data.

Exploration vs promotion

The second big change is the two-path model. The old pipeline treated *every* look at a signal as a commitment: you had to pre-register a hypothesis before running test-signal.ts, and skipping that step was blocked by the skill. That's fine for promotion but it's ceremony for ideation.

The new scripts/explore-signal.ts runs standalone + marginal + by-season on *any* signal id, registered or not. Output is labeled tier: EXPLORATORY and written to data/alpha-exploration/<id>.json. It never touches signal-registry.json. The rule: exploratory peeks can't be used to justify promotion — if a peek is interesting, you register the hypothesis and re-run through approve-signal.ts where the 6 reformed gates are strict and the pre-registration is honest.

Safety net: dry-run against the existing registry

Before shipping the rewrite, I built scripts/dry-run-reformed-gates.ts, which replays the reformed 6-gate logic against every entry in data/signal-registry.json using only the aggregate stats already stored there. It reports two things: signals currently *accepted* that the new gates would *reject* (the critical safety check), and signals currently *rejected* that the new gates would *not* reject (the expected gains).

Result: 0 accepted→rejected flips. 9 rejected→accepted flips.

All three currently-live signals (tc2-league-filter, loss-weight-sweep, dixon-coles-rho-correction) pass the reformed gates cleanly. None flip.

The nine unblocked signals are the interesting part:

Signal	Marginal entry-adj ROI	Old failure
`inter-model-disagreement`	+1.08%	Bootstrap p=0.23
`odds-quality-routing`	+0.82%	Bootstrap p
`setpiece-xga-regression`	+0.73%	Hardcoded +0.5pp floor (wait — see note)
`set-piece-mismatch`	+0.73%	Hardcoded +0.5pp floor
`variance-regression`	+0.53%	Hardcoded +0.5pp floor
`contextXg`	+0.49%	Hardcoded +0.5pp floor AND bootstrap p=0.23
`motivation-discount-totals`	+0.41%	Hardcoded +0.5pp floor
`xg-finishing-persistence`	+0.20%	Noise-level
`marcel-concentration`	+0.01%	Noise-level

The bottom three are at noise level and would likely fail the actual reformed gates on CI width or walk-forward monotonicity — the dry-run only uses aggregate stats and can't compute those. But the top six are all signals the old pipeline killed on *how the data was summarized*, not on *whether the data supported the signal*.

Live retest: inter-model-disagreement vs contextXg

To prove the reformed pipeline isn't just rubber-stamping everything the old one killed, I ran two of the nine unblocked signals through the actual 6-gate check end-to-end. The results are a good stress test.

inter-model-disagreement — REJECTED (5/6)

The old system had promoted this signal to shadow after it passed 9/10 old gates (failing only bootstrap at p=0.23, with a registered result claiming "13/13 walk-forward folds positive" and "+1.1% marginal entry-adj ROI"). Under the reformed 6 gates:

G1 Pre-registered + isolated — PASS (N=63,444 standalone)
G2 Sufficient precision — FAIL (CI95 width = 6.04pp, threshold ≤ 4pp, marginal N = 4,161)
G3 Positive marginal alpha — PASS (+1.1%)
G4 Bootstrap lower bound — PASS (CI95 = [−1.9%, +4.1%], floor = −3.02%)
G5 Walk-forward generalization — PASS (monotonicity 0.54)
G6 Regime consistency — PASS (2pp gap vs 12pp 2×CI)

The reformed pipeline caught something the old one obscured. The marginal is +1.1%, but the per-fold breakdown — which the new continuous gate prints — shows 7 positive folds and 6 negative folds across 13 seasons, not "13/13 positive" as the old registry entry claimed. 2014: −11.4%. 2018: −8.9%. 2020: −2.8%. 2022: −2.6%. 2026: −11.6%. The signal moves +1.1% on average because a few good seasons are dragging it, but the CI is 6pp wide and the fold variance is enormous.

The old system smoothed this into "9/10 gates passed, shadow-promote on manual override." The reformed system says: your CI is too wide to trust, come back with more data. That's the correct answer. The old framing was flattering; the new framing is honest.

contextXg — ACCEPTED (6/6)

Same setup, opposite result.

G1 Pre-registered + isolated — PASS (N=230,558 standalone)
G2 Sufficient precision — PASS (CI95 width = 2.64pp, well under 4pp, marginal N = 21,699)
G3 Positive marginal alpha — PASS (+0.5%)
G4 Bootstrap lower bound — PASS (CI95 = [−0.8%, +1.8%], floor = −1.32%)
G5 Walk-forward generalization — PASS (monotonicityScore 0.54)
G6 Regime consistency — PASS (gap 0.88pp vs 2×CI = 5.28pp)

Same +0.5% marginal that the old system rejected as "boundary case at exactly +0.5pp" — but contextXg has ~5× the marginal bet count of inter-model-disagreement, so its CI is less than half the width. A signal with a well-measured +0.5pp marginal is not the same thing as a signal with a poorly-measured +1.1pp marginal. The old gates couldn't tell those apart; the new ones can.

The old pipeline killed contextXg on two gates that no longer exist:

G5 bootstrap p=0.23 — replaced with G4 "lower bound vs own CI width"
G9 marginal > +0.5pp hardcoded — deleted entirely

contextXg is now status: accepted in the registry. The entry retains a full audit trail linking the old rejection and the new acceptance so the history isn't lost.

The contextXg deployment question

Charles has already asked for contextXg to be flipped on in production. The mechanism is simple — lib/backtest/bet-evaluator.ts line 346, contextXgEnabled: false → true in DEFAULT_EVAL_CONFIG. Single-variable diff, reviewable, revertible.

Per the rules, I won't touch it without explicit confirmation and a verification step. But the walk-forward gate that the project's global rules require for any production parameter change is *exactly* what Gate 5 does in the reformed pipeline. The gate ran. The gate passed. The ball is in the user's court.

What's next

The rearchitecture plan is four phases deep. Phase A is the one that shipped today. Phases B, C, D are still ahead and each exists to buy the pod-shop framework a fair shot at replacing the gate pipeline — but only after it earns the right.

Phase B — fix pod-shop math

The pod-shop execution plan already identifies the bugs:

Filter-signal alpha/beta/IR is wrong. Filter signals change *which* bets are taken, so pairing configs by "one bit flipped" requires re-aligning bet sets before diffing profits. Currently the analysis subtracts profit arrays of different lengths.
Lambda-signal config pairing is wrong. Lambdas produce continuous probability deltas, so the paired-config diff needs to hold filter bits constant while varying the lambda. Currently the pairing collapses across filter dimensions.
Correlation matrix compute path shows all zeros, likely mis-keyed arrays across configs with different bet orderings.

The acceptance criterion for Phase B is concrete and falsifiable: the fixed scorecard must produce IR/alpha values for the three currently-live signals (`loss-weight-sweep`, `dixon-coles-rho-correction`, `tc2-league-filter`) that are directionally consistent with their realized P&L. If it can't retroactively score signals we already know worked, it can't prospectively identify new ones.

Time-box: 2 weeks. If the math still isn't producing sensible numbers for the known-good signals by the end of the box, the debugging effort was aimed at the wrong layer and pod-shop needs a more fundamental redesign.

Phase C — the bake-off

Once pod-shop's numbers are trustworthy, both systems run on every new candidate signal. The 6-gate pipeline stays as the shipping path. Pod-shop is informational. A new /pod?tab=bakeoff view shows, for each signal: the 6-gate verdict, the pod-shop scorecard (alphaVsProd, IR, uniqueAlpha, correlation-to-stack), and — once shadow data exists — the realized live performance.

The specific case that would validate pod-shop as a better framework: a signal that pod-shop ranks highly (high IR, low correlation to the stack) that the 6-gate pipeline either rejected or never prioritized, where shadow data subsequently confirms positive alpha. Until that happens at least once, pod-shop has not earned replacement rights.

Phase D — conditional consolidation

If Phase C produces ≥1 pod-shop-originated signal that the gate pipeline would have missed AND that signal shows positive shadow alpha over a meaningful window, then consolidation can be reconsidered with a concrete track record. Until then:

6-gate pipeline stays canonical.
/gauntlet stays as the live shadow UI. Charles likes the gauntlet UI and pod-shop hasn't earned it.
Pod-shop stays as an analytical layer on /pod.

The bigger picture

Two rules from the global playbook apply here. The first: *never throw away validated signal because implementation is weak. Iterate: entry odds, regime modifier, relaxed criteria.* The old pipeline was doing the opposite — it was throwing away validated signal because the *gates* were weak. The gates had arbitrary thresholds that couldn't justify themselves to the data, and the signals were paying the price.

The second: *stop after the second analytical reversal.* The original rearchitecture plan was going to collapse the gate pipeline into pod-shop. One empirical check reversed it. That was the only reversal — the reformed-gate plan has been consistent from the moment that check came back — but it's worth naming. The temptation to keep the ambitious plan was real. The evidence said "no." The evidence wins.

What shipped today is the less ambitious version of the plan, and it's the right version. The 10-gate pipeline got rewritten into 6 data-driven gates, an unregistered exploration path, and a full audit trail for the one signal that cleared end-to-end. The pod-shop rescue is still ahead. The bake-off is still ahead. The automated promotion and rollback bot is still ahead. But the single biggest unlock today — contextXg at 6/6 — came from a pipeline reform that was purely about *stripping arbitrary rules*, not about adding a new framework.

That's the order of operations. Fix what's broken. Prove the replacement earns its place. Don't do it the other way around.

Old	New
G1 Pre-registered in registry + G2 true standalone config	G1 "Pre-registered + isolated" — merged. Single check. Pre-registration is required only for promotion; unregistered peeks use a new `scripts/explore-signal.ts`.
G3 `N ≥ 1000` hardcoded	G2 "Sufficient precision" — bootstrap CI95 width on marginal entry-adj ROI must be ≤ 4pp. Low N = wide CI = fails on uncertainty, not an arbitrary count. An N=800 signal with a tight CI passes; an N=1200 with huge variance fails.
G4 marginal > 0	G3 "Positive marginal alpha" — unchanged in spirit; kept strict. This is the deployment question.
G5 bootstrap p < 0.10	G4 "Bootstrap lower bound" — CI95[0] must be > `−0.5 × width`. Scales with the data's own noise instead of a fixed p-value.
G6 matchday interleave ≤ 3pp AND G10 ≥2/3 walk-forward folds positive, "> −1pp counts"	G5 "Walk-forward generalization" — continuous `monotonicityScore > 0.5`: the fraction of seasonal folds whose marginal is above the median across all folds. One metric, no contradiction.
G7 regime stratification (±5pp opposite-sign)	G6 "Regime consistency" — max-regime-gap must be inside `2 × CI_width`. Gaps inside the sample's own noise aren't penalized.
G8 Suspicious N within 1% dedup	Deleted. Heuristic that false-positived on independent signals with naturally similar coverage.
G9 marginal > +0.5pp hardcoded	Deleted. G3 already requires positive marginal; this was a double-dip with an arbitrary floor. A +0.3pp uncorrelated signal is better than +2pp redundant, and the hardcoded gate couldn't tell them apart.