# M3 Phase 3 Milestone 1 Ship Summary

Status: shipped  
Date: 2026-05-05  
Scope: HR Valinor hardening plus reusable calibration/replay infrastructure

## What Shipped

Phase 3 Milestone 1 turned HR Valinor from a rebadged research candidate into a production calibration lane. The final production lane is a non-stacked LightGBM model with beta calibration, trained without proxy Palantir/Fangorn features and served through the existing HR daily pipeline.

The milestone also delivered the reusable calibration harness, production replay pattern, and forward prediction log that later Valinor milestones should use before promotion decisions.

## Commit Ladder

| Step | Commit | Summary |
| --- | --- | --- |
| 1 | `5553ad6` | Built reusable calibration harness and documentation. |
| 2 | `75e48b6` | Trained first hardened stacked LightGBM + Platt HR Valinor artifact. |
| 3 | `6eed3a8` | Ran proxy-baseline promotion evaluation; result was too clean and later rejected. |
| 3.5 | `9b15063` | Added production replay and real-baseline evaluation; mechanical result became do-not-promote. |
| 3.6 | `09fb0f9` | Diagnosed underperformance; found proxy stacking and calibration selection problems. |
| 3.7 | `ea95621` | Ran candidate bake-off; selected non-stacked LightGBM + beta as best calibration/distinctness tradeoff. |
| 4 | `e8ef64b` | Wired HR Valinor production lane and preserved the rebadged baseline as backward compatibility. |
| 5 | This commit | Verification, replay/forward-log documentation, and this ship summary. |

## Methodology Journey

Sub-step 2 trained a stacked LightGBM model with Platt calibration using synthetic/proxy Palantir and Fangorn features. It was a defensible first pass, but it created a train/serve mismatch because the upstream probabilities were not the real served production probabilities.

Sub-step 3 compared that model to a proxy baseline and returned `PROMOTE-RECOMMENDED`; the result was too clean to trust. Sub-step 3.5 replayed the actual production HR pipeline over the held-out window and showed that the proxy comparison was misleading. Against the real served baseline, the stacked model did not pass the mechanical gate.

Sub-step 3.6 isolated the likely causes: synthesized stacking features hurt robustness, and Platt calibration chosen on the calibration fold did not generalize best to the real replay window. Sub-step 3.7 then ran a candidate bake-off across the available HR probability bundle and the non-stacked diagnostic model. The non-stacked LightGBM with beta calibration was the best calibration plus distinctness candidate.

Sub-step 4 promoted that model by human judgment. The strict "beat baseline on Brier and ECE" rule was not met because Brier was worse by `0.000039`, but the paired bootstrap CI straddled zero and the ECE win was material. For Valinor's stated role, reliability is the defining metric.

## Final Production State

| Item | Value |
| --- | --- |
| Lane | HR Valinor |
| Status | Production |
| Production probability column | `hr_prob_valinor_nonstacked_beta` |
| Backward-compatible baseline column | `hr_prob_rf_safe_logistic_anchor` |
| Algorithm | LightGBM, non-stacked |
| Calibration | Beta calibration |
| Training cadence | Weekly, Monday 03:45 via `train_hr_valinor` |
| Latest verified artifact | `outputs/models/hr_valinor/hr_valinor_nonstacked_lgbm_beta_20260505T164141Z.joblib` |
| Latest verified metadata | `outputs/models/hr_valinor/hr_valinor_nonstacked_lgbm_beta_20260505T164141Z.metadata.json` |
| Production daily task | `run_home_run_daily_production` |
| Forward log | `data/derived/predictions_log/hr_predictions_<date>.parquet` |

## Calibration Metrics

Replay window: 2025-08-15 through 2025-09-15, 9,403 served rows.

| Model | Brier | Log Loss | ECE | AUC |
| --- | ---: | ---: | ---: | ---: |
| Palantir context | 0.122772 | 0.409969 | 0.011731 | 0.636959 |
| Fangorn context | 0.123136 | 0.411339 | 0.006471 | 0.632948 |
| Rebadged Valinor baseline | 0.124683 | 0.414520 | 0.036910 | 0.636127 |
| Production HR Valinor | 0.124722 | 0.413736 | 0.014869 | 0.593233 |

Promotion gate note:

| Check | Result |
| --- | --- |
| ECE delta vs rebadged baseline | `0.036910 -> 0.014869`, about 60% improvement |
| Brier delta vs rebadged baseline | `+0.000039` |
| Brier paired bootstrap 95% CI | `[-0.000523, +0.000632]` |
| Brier bootstrap p-value | `0.868` |
| Promotion decision | Human-judgment promote: ECE win is material, Brier loss is noise-level |

## Verification

Forced scheduler slice:

| Task | Status | Metric |
| --- | --- | --- |
| `run_home_run_daily_production` | Success | 237 prediction-log rows for 2026-05-04 |
| `train_hr_valinor` | Success | Wrote latest metadata artifact `20260505T164141Z` |

Health check returned nonzero because of carried broader ops state, not Phase-3 failures. The carried failures included stale CLV snapshots, stale shared fetches, existing hits/outs board failures, and stale projection/scorecard tasks.

Current HR production verification for 2026-05-04:

| Check | Result |
| --- | --- |
| HR edge rows | 237 |
| Valinor nulls | 0 |
| `valinor_coverage_status` | `production` on all rows |
| `model_lane_count` | 3 on all rows |
| Agreement distribution | 30 aligned, 150 mixed, 57 disagreement |
| Consensus route counts | 30 aligned rows, 57 disagreement rows |

Route smoke tests:

| Surface | Result |
| --- | --- |
| `/betting/home-run-edges?lane=valinor` | 200 |
| `/betting/consensus` | 200 |
| `/models/valinor` | 200, HR displays Production |
| Other betting lane routes | 200 across Palantir/Fangorn/Valinor checks |
| Date and alias routes | 200 for strikeout, total base, hits, pitcher outs, and Fangorn date route checks |

Test suite:

| Suite | Count | Result |
| --- | ---: | --- |
| Lane-contract tests | 25 | Pass |
| Calibration harness tests | 9 | Pass |
| Total verified tests | 34 | Pass |

The prompt expected roughly `27 + 9` tests. The current repository contains 25 lane-contract tests and 9 calibration harness tests, so the verified total is 34.

Forward logging:

| File | Rows | Valinor populated |
| --- | ---: | ---: |
| `data/derived/predictions_log/hr_predictions_2026-05-04.parquet` | 237 | 237 |

## Reusable Infrastructure

Calibration harness:

- `src/pitcher_card_engine/domains/betting/_shared/calibration/`
- `brier_score`, `log_loss`, `expected_calibration_error`, `reliability_diagram`, `mondrian_calibration_report`, and `calibration_comparison`
- Paired bootstrap comparison for Brier/log-loss uncertainty

Replay and evaluation infrastructure:

- `scripts/shared/replay_hr_production_window.py`
- `scripts/shared/evaluate_hr_valinor_promotion.py`
- `scripts/shared/diagnose_hr_valinor_underperformance.py`
- `scripts/shared/bakeoff_hr_valinor_candidates.py`
- `scripts/shared/promote_hr_valinor_nonstacked_beta.py`

The calibration harness doc now states the Phase 3 rule explicitly: use forward logs whenever possible; if logs are missing, replay the real production pipeline with artifact/input provenance and skipped-date disclosure before trusting a promotion gate.

## Playbook For P3 M2+

- Do not stack on synthesized proxy upstream features. Use real upstream predictions only, which means forward logging must exist before training/evaluation.
- Calibration method selection must be validated on real held-out data, not just calibration-fold metrics.
- Distinctness checks are mandatory for rebadge or challenger selection. A candidate highly correlated with Palantir or Fangorn is not a useful third ideology even if its metrics look acceptable.
- Promotion gate precedent: ECE is primary for Valinor, Brier is secondary. Brier losses inside a bootstrap CI that straddles zero are noise and should not block a material ECE improvement.
- Replay infrastructure should be built before evaluation, not bolted on after a too-good-to-trust result.

## Carried Debts

- Real upstream feature backfill and forward-log accumulation are needed before the next HR Valinor retrain. Target at least 4-6 weeks of forward logs.
- Park ID and handedness coverage were incomplete in the training/evaluation frame; future HR methodology work should close those buckets before deeper Mondrian calibration.
- 13 days in the original held-out window could not be replayed because props/market caches were missing. Those caches should be persisted going forward.
- HR Valinor still underperforms Palantir/Fangorn on AUC and Brier. Its production rationale is calibration/distinctness, not discrimination. Later HR methodology work may revisit whether Valinor should use a different architecture.
- Broad ops health still has carried stale/failing tasks unrelated to Phase 3 Milestone 1.

## Explicit Non-Goals

- Did not retrain Palantir or Fangorn for HR.
- Did not retroactively repair the 13-day replay gap.
- Did not add new HR features.
- Did not expand Valinor to K, TB, Hits, Outs, or Futures.
