March 28, 2026|Infrastructure|DEPLOYED

The Infrastructure Overhaul: Tests, Correct Metrics, and a Better Solver

48-hour infrastructure overhaul after discovering entry-adjusted ROI was +5.1%. Built 125-test suite from scratch (settlement, Poisson math, devigging, sizing, bootstrap). Wired entry-adjusted ROI through all 10 approval gates. Graduated shadow solver (market-only + Dixon-Coles rho). Re-evaluated 7 signals, approved tc2-league-filter (+1.2% marginal). Kelly sizing research: quarter Kelly optimal, full Kelly catastrophic. Zero tests to full CI coverage in two days.

Tests

125

< 1 second

Entry-Adj AH ROI

+5.1%

now in all gates

Solver

Market-Only

+ DC rho=0.05

Signal Approved

tc2-league-filter

+1.2% marginal

The Infrastructure Overhaul: Tests, Correct Metrics, and a Better Solver

Two days ago we discovered the backtest was wrong — AH entry-adjusted ROI is +5.1%, not the -3.0% we'd been reporting at closing odds. That finding changed everything. But knowing the number is wrong is different from making sure it can never be wrong again.

This post covers what we built in the 48 hours since: a test suite for every piece of financial math, entry-adjusted ROI wired through every decision gate, a graduated solver, re-evaluated signals, and a Kelly sizing study. It's the largest single infrastructure change since the system was built.

The Problem We Were Solving

Zero unit tests existed on any financial logic. The AH settlement function — which determines whether every bet is won, lost, or pushed, and how much money changes hands — had no automated verification. A single wrong sign in the adjusted-difference computation would silently flip every AH bet outcome, and nothing would catch it until the money was gone.

On top of that, every signal approval gate, every bootstrap test, and every exploration script was making decisions based on closing-odds ROI. With CLV averaging +5.3%, this understated AH performance by approximately 8 percentage points. Every "this signal doesn't work" conclusion was measured against the wrong baseline.

What We Built

1. Test Suite from Scratch

We installed Vitest and wrote 125 tests across 10 files. They run in under a second.

Unit tests (98 tests, 7 files) cover every piece of math that determines bet outcomes:

AH settlement — 20 tests covering the full quarter-line matrix. Full win, half win, push, half loss, full loss across lines from -0.25 through -1.75, both home and away sides. Tests verify that loss profits are always exactly -1 (full) or -0.5 (half) regardless of odds, while win profits scale correctly. This is the most dangerous function in the codebase.

Bivariate Poisson — 18 tests verifying the score grid sums to 1.0, all market derivations (1X2, over/under, BTTS, Asian handicap) each sum to 1.0, the independence baseline matches theory, and outputs match frozen reference values computed from the actual code. If anyone changes the grid math, these tests break immediately.

Devigging — 14 tests. Stripped probabilities must sum to 1.0. Symmetric odds must produce exactly 50/50. Invalid odds must return null. Time decay must equal 1.0 at time zero and decrease monotonically.

Sizing engine — 18 tests covering all five staking strategies. Kelly formula correctness, zero-edge handling, bounds clamping, and the critical invariant that Bayesian Kelly never exceeds regular Kelly.

Entry odds estimation — 7 tests on the half-CLV formula that the entire +5.1% finding depends on. Zero CLV identity, direction checks, and hand-computed numeric verification.

Block bootstrap — 8 tests. Deterministic seeding, known-positive bets produce significant p-values, balanced bets produce p near 0.5, max drawdown matches hand-computed sequences.

Performance metrics — 8 tests. Empty arrays don't produce NaN, known win/loss sequences at known odds produce exact expected values, entry ROI falls back correctly when entry profit is undefined.

Property-based tests (27 tests, 3 files) use fast-check to throw thousands of random inputs at the math:

For any realistic lambda pair and any valid odds: score grids have no negative cells and sum to 1.0, all market derivations sum to 1.0, devigged probabilities sum to 1.0. For any AH quarter-line and any scoreline: exactly one of five outcomes applies, and profit signs are always correct. For any positive edge and valid odds: all sizing strategies return finite non-NaN values within bounds.

These invariants held across every random input. No bugs found — which means the settlement logic, probability math, and sizing engine are internally consistent. The unit tests catch value correctness; the property tests catch structural violations at unusual input combinations.

2. Entry-Adjusted ROI Everywhere

Good news from the audit: most of the plumbing already existed. BetRecord had an entryProfit field, estimateEntryOdds() was computing entry odds, and summarizeBets() was returning entryROI. The block bootstrap was already computing entry ROI distributions.

What was missing: none of this flowed through the signal approval gates or the exploration scripts.

We wired it through every gate in the 10-gate approval system:

Gate 4 (marginal ROI > 0) now decides on entry-adjusted marginal ROI
Gate 5 (bootstrap significance) now bootstraps both profit and entryProfit, decides on entry-adjusted p-value
Gate 6 (OOS consistency) checks entry-adjusted gap and sign-flip
Gate 7 (regime stratification) checks entry-adjusted regime signs
Gate 9 (practical significance > +0.5pp) uses entry-adjusted threshold
Gate 10 (walk-forward) evaluates fold pass/fail on entry-adjusted ROI

Every output line shows both: entry-adj ROI (half CLV est.) = +8.8% [closing: -2.0%]. The label "(half CLV est.)" appears on every entry-adjusted number — it's an estimate on backtest data, never presented as measured fact.

The test-signal exploration script shows both metrics in summaries, attribution sections, walk-forward tables, and registry stats.

3. CI Pipeline

Both GitHub Actions workflows now run the test suite:

backtest-gate.yml (PR checks) runs unit + property tests before the existing backtest regression check. Tests take under a second. If they fail, the PR is blocked. Trigger paths now include the paper-trade, simulation, and signals directories that were previously unwatched.

nightly-backtest.yml (3 AM UTC daily) runs the full test suite before the backtest. Issue titles now specify what failed — "Unit/property tests", "Backtest regression", or both.

4. Shadow Model Graduated

The shadow solver (market-only inputs + Dixon-Coles correlation correction) has been validated through three independent mechanisms:

Dev set: +1.05pp ROI improvement
Holdout set: +0.45pp ROI improvement
30 live shadow bets: +44.6% ROI vs production +29.7% on the same bets
240-variant tracking system confirmed market-only + DC rho as top-performing variant

We graduated it to production. solve-latest.ts now uses outcomeWeight=0, xgWeight=0 (market-only — the solver fits to Pinnacle odds only, ignoring actual match results and xG) and adds dixonColesRho=0.05 to every league's output params for low-scoring match correction.

The previous production config is preserved in the variant tracking system for ongoing comparison.

What this means: the solver is no longer trying to simultaneously fit odds, results, and xG. It focuses purely on the Pinnacle market signal, which is the highest-quality input. The calibration tax — previously the biggest ROI drag at -4.1pp — should shrink because the solver isn't being pulled in three directions at once.

5. Signals Re-evaluated

We re-ran 7 signals through the corrected approval gates.

No signal flipped from reject to accept. The corrected ruler actually made it harder for marginal signals to show value — the base portfolio is now measured at +7.6% entry-adjusted ROI instead of -3.0% closing. Signals need to improve on a strong baseline, not just reduce losses from a negative one.

The most interesting finding: market-type-sizing-deployment (AH-only routing) flipped from +0.7% marginal to -0.3% with entry-adjusted ROI. The "benefit" of AH-only was an illusion — when you properly account for CLV capture, the full portfolio also benefits, and more so. The ruler didn't move the goalposts. It revealed that the floor is higher.

One signal approved: tc2-league-filter — removes Segunda, La Liga, and Ligue 2 from bet generation. It scored 8/10 gates with entry-adjusted marginal ROI of +1.2%, 4/4 walk-forward folds positive at +6-10% each, 1.7pp OOS gap, and all regimes positive. The two failing gates were bootstrap significance (p=0.1567 vs 0.10 threshold — noisy on a small marginal) and suspicious N (a false positive from a coincidental match with an unrelated signal). Manual approval based on overwhelming directional evidence.

We also tightened Gate 8 (Suspicious N) from a 10% window to 1%. The old threshold meant any signal with N between 69K-96K would flag against a signal at 88K. Most signals are near-full-universe filters, so almost everything triggered. The new 1% window catches actual duplicates.

6. Kelly Sizing Research

With the edge now proven at +5.1%, flat $20 stakes are sub-optimal. We wrote a walk-forward Kelly simulation testing four strategies on 2023-24 and 2024-25 holdout data.

Quarter Kelly wins. Best Sharpe ratio (3.30 average), reasonable drawdowns, and the best risk-adjusted returns. Full Kelly produced catastrophic drawdowns (95-99%) despite high nominal returns — confirming the textbook warning. Half Kelly was a viable middle ground but with meaningfully worse drawdowns than quarter.

The critical finding: Kelly with overestimated edge is worse than flat. When we simulated 2x edge overestimation (halved actual profit), Kelly returns dropped dramatically while flat returns simply halved proportionally. Since the calibration tax means our edge estimates are still somewhat noisy, Bayesian Kelly with an uncertainty discount is the safer variant.

Recommendation: activate quarter Kelly in shadow mode first. Track alongside flat stakes for 50+ bets before switching live.

What Changed in Production Code

Two functions in bet-evaluator.ts were exported (previously private): settleAH() and estimateEntryOdds(). No behavior change — same functions, same signatures, just visible to tests.

Gate 8 threshold tightened from 10% to 1%.

solve-latest.ts now uses market-only config with Dixon-Coles rho.

All 26+ league param files regenerated.

Everything else was additive — new test files, new fixture factories, updated CI workflows, updated output formatting.

The New Infrastructure Plan

Here's how all these pieces fit together going forward.

Testing Stack

Every PR that touches model code, settlement, sizing, simulation, or test files runs 125 tests in under a second. The tests verify six invariants:

Settlement math produces correct outcomes for every AH quarter-line scenario
Score grids sum to 1.0 and all market derivations are internally consistent
Devigging preserves probability axioms
Sizing strategies produce sane outputs within bounds
Entry odds estimation handles edge cases correctly
Bootstrap statistics produce correct p-values for known distributions

Property-based tests additionally verify these invariants hold for thousands of random inputs.

Metrics Pipeline

Every metric now has two versions: closing-odds and entry-adjusted. The entry-adjusted number uses the half-CLV estimate and is labeled as such everywhere. The signal approval system decides on entry-adjusted ROI. Live bets with actual market odds compute real entry ROI instead of the estimate.

Solver

Market-only with Dixon-Coles correction. The solver fits to Pinnacle closing odds only, ignoring match outcomes and xG. This produces tighter convergence and reduces the calibration tax. Dixon-Coles rho of 0.05 corrects for low-scoring match patterns (more 0-0 and 1-1 draws than independent Poisson predicts).

Signal Approval

10 gates. All must pass for deployment, with documented manual overrides allowed when the evidence is overwhelming (as with tc2-league-filter's bootstrap near-miss). Gate 8 now uses 1% N-matching instead of 10%. Gates 4, 5, 6, 7, 9, and 10 decide on entry-adjusted ROI.

Sizing Research

Quarter Kelly validated as optimal by Sharpe ratio. Bayesian Kelly with uncertainty discount recommended for initial deployment. Shadow mode first, 50+ bets to confirm before switching from flat.

What's Next

Wire tc2-league-filter into picks pipeline — approved but not yet integrated into live bet generation
Monitor graduated solver — compare market-only + DC rho performance against the old config via the variant tracking system
Activate quarter Kelly in shadow — after 50+ bets confirm the edge estimate is stable
OU25 re-evaluation — deferred until AH calibration improves (entry-adjusted OU25 ROI is only +1.4%)

The Meta-Lesson

We spent two weeks testing 40+ signals against a wrong baseline. None of them worked because the baseline was wrong, not because the signals were bad. The model was profitable the entire time — we just couldn't see it because we were measuring at closing odds instead of entry odds.

The infrastructure changes in this post exist to make sure that kind of measurement error can never hide silently again. Every critical calculation now has a test. Every metric shows both versions. Every decision gate uses the correct number. And the CI pipeline catches regressions before they reach production.

The system went from "zero tests on financial logic" to "125 tests, two CI workflows, and entry-adjusted metrics everywhere" in 48 hours. The highest-ROI item was the AH settlement test suite — not because it found a bug (it didn't, this time), but because the next time someone touches that code, the test will be there.