# M3 Phase 1.5 Methodology Audit - 2026-04

Status: doc-only methodology specification  
Scope: Palantir / Fangorn / Valinor lanes across Home Runs, Strikeouts, Total Bases, Hits, Pitcher Outs, and Futures  
Primary inputs: `docs/multi_modal_model_inventory_2026_04.md`, `docs/projections_m3_model_lane_contract.md`, `docs/projections_m3_phase_1_ship_summary.md`, `docs/projections_m3_betting_revamp_spec.md`, `docs/research/betting_revamp_research_2026_04.md`

This audit defines what each model-lane cell should be methodologically before Phase 3 training work begins. Phase 1 proved the schema bridge, but it also proved that most non-HR markets are honestly Palantir-only today. The next build should not rebadge confidence filters as model identities. Palantir, Fangorn, and Valinor must be independent epistemologies that can disagree on the same row.

## 1. Risk Archetype Methodology Contracts

### Palantir

Method family:

- Regularized, projection-stack-anchored, low-variance models.
- Preferred forms: elastic-net / ridge / generalized linear models, calibrated distribution transforms from projection means, projection ensemble averaging, and transparent parametric or semi-parametric count distributions when the market demands a full probability curve.

Why this fits the archetype:

Palantir is the stable mainline. Its job is to start from the best broad prior, absorb the core projection stack, and avoid overreacting to noisy short-window signals. Regularized linear and projection-anchored methods are appropriate because they make the same trade Palantir is supposed to make: lower variance, cleaner interpretability, and less sensitivity to one odd feature interaction. In production terms, Palantir should be the probability users can understand and trust even when Fangorn or Valinor disagrees.

Not acceptable:

- A Random Forest, CatBoost, LightGBM, or neural model with conservative hyperparameters.
- A selective filter wrapped around another lane.
- A "best model from tournament" if the winner is nonlinear and interaction-driven.
- A market-derived line-implied probability with no baseball projection underneath.

Methodology distinctness test:

If removing interaction-heavy features or tree splits does not materially change the method, it may be Palantir. If the model's edge comes primarily from nonlinear interaction discovery, it is not Palantir. The proof question is: "Can the core probability be explained as a regularized transformation of projection priors, matchup aggregates, and market line context?"

### Fangorn

Method family:

- Nonlinear, interaction-seeking, signal-hunting models.
- Preferred forms: Random Forest, ExtraTrees, gradient-boosted trees when used for interaction discovery, segmented role-stack models, and selective specialist models that intentionally abstain outside their learned pockets.

Why this fits the archetype:

Fangorn should find cases where the mainline prior is too smooth. Baseball props are full of conditional interactions: pitch mix x batter whiff zone, park shape x pull-air profile, lineup exposure x pitcher leash, bullpen availability x manager behavior. Tree ensembles and selective segmentation are suited to this because they can discover discontinuities and pockets that a stable linear anchor should suppress. Fangorn is allowed to be narrower than Palantir; it should not need to produce a useful probability for every row.

Not acceptable:

- A stricter threshold on Palantir output.
- A confidence label or trust filter over the same Palantir probability.
- A calibration layer whose only purpose is smoothing or probability repair.
- A copy of Palantir with more recent rolling averages.

Methodology distinctness test:

Fangorn must answer "What nonlinear or segmented interaction does Palantir miss?" A proposed Fangorn lane should produce meaningfully different rankings on at least some subset of rows because of tree splits, role pockets, or feature interactions, not because a threshold was tightened.

### Valinor

Method family:

- Calibration-focused challenger and probability-refinement models.
- Preferred forms: CatBoost / LightGBM with explicit calibration, stacked ensembles over Palantir and Fangorn, beta calibration, isotonic / Platt / Mondrian calibration, conformalized quantile heads, or Bayesian/hierarchical approaches where uncertainty and calibration are the reason to use them.

Why this fits the archetype:

Valinor is the model that asks whether the probability is shaped correctly. It should be willing to revise the center and tails of Palantir/Fangorn outputs when historical calibration says the raw model is overconfident, underconfident, or locally distorted. This lane is not necessarily the most aggressive or highest-AUC lane; it is the lane that should win on Brier/log-loss/reliability when the market is evaluated honestly.

Not acceptable:

- A raw uncalibrated tree model.
- A pure research artifact that has not been calibrated against held-out outcomes.
- A model that only mirrors Fangorn because it uses the same tree output without a distinct probability-refinement objective.
- A hand-adjusted probability column with no validation.

Methodology distinctness test:

Valinor must answer "What calibration or uncertainty problem is this lane solving?" If the method does not explicitly improve probability shape, interval coverage, tail behavior, or reliability by bucket, it is probably Fangorn or Palantir, not Valinor.

## 2. The Methodology Matrix

| Lane / Market | Home Runs | Strikeouts | Total Bases | Hits | Pitcher Outs | Futures |
|---|---|---|---|---|---|---|
| Palantir | Built: elastic-net / projection-stack anchor in `home_run_edges` and `home_run_props`; stable HR probability via `hr_prob_palantir`. Differs from Fangorn by avoiding RF interaction pockets and from Valinor by not being the calibration challenger. | Built: matchup-adjusted K projection and distribution board from `pitcher_strikeouts`, `strikeout_props`, and `strikeout_value_board`; should remain projection/count-distribution anchored. Differs from Fangorn role-stack pockets and Valinor calibrated residual/tail layer. | Built: expected TB from `total_base_props` and `total_base_edges/site_board`; should remain contact-volume and projection anchored. Differs from Fangorn geometry/interactions and Valinor calibrated outcome distribution. | Built: expected hits / logistic board from `hits` and `hits_edges`; should be a contact-opportunity anchor. Differs from Fangorn contact-shape interaction model and Valinor calibration layer. | Built: expected outs from `pitcher_outs` and `pitcher_outs_edges`; should be workload/leash projection anchor. Differs from Fangorn volatility segmentation and Valinor conformal/calibrated distribution. | Built: M2 projection-implied probabilities from `futures_edges`; Palantir is the main season simulator / projection ensemble. Differs from Fangorn scenario stress testing and Valinor probability calibration of simulator odds. |
| Fangorn | Built: Random Forest / RF-safe HR stack and `/betting/fangorn` strategy artifacts in `home_run_strategy`; nonlinear interaction lane. Differs from Palantir linear anchor and Valinor probability repair. | Partially built: `outputs/research/strikeout_architecture/strikeout_role_stack_proto_fangorn.csv`; should become RF/role-stack model over pitch-shape, lineup, and count-exposure interactions. Differs from Palantir by segmented nonlinear pockets and from Valinor by not primarily calibrating. | Partially built: `total_base_strategy` + `tb_under_specialty`; should become RF/ExtraTrees contact-shape/park/suppression model. Differs by batted-ball/park interactions rather than Palantir expected TB smoothing or Valinor calibration. | Missing: should be nonlinear contact-shape and defensive-context model, not `hits_edges/confidence.py`. Differs by interaction discovery across contact, speed, defense, and role. | Missing: should be workload/leash volatility RF/segmentation model. Differs by managerial/removal interaction pockets rather than mean expected outs or final calibration. | Missing: should be scenario/selective simulator lane; e.g. stress tests for roster/park/lineup volatility. Differs by conditional scenario logic, not base simulator or calibration. |
| Valinor | Partially built / research-rebadged: `hr_prob_rf_safe_logistic_anchor` selected in Phase 1; future version should be CatBoost/logistic anchor or Bayesian HR with Mondrian calibration. Differs by probability refinement and tail reliability. | Missing: should be calibrated boosted residual/tail layer over K distribution, possibly LightGBM/CatBoost + CQR/beta calibration. Differs from Fangorn by calibration target and from Palantir by correcting distribution shape. | Missing: should be calibrated boosted or conformal distribution over total bases. Differs by calibrated ordinal/count distribution rather than RF pocket-finding. | Missing: should be calibrated binary/ordinal hit probability model, likely boosted + beta/isotonic calibration. Differs by reliability of 0.5+/1.5+ hit probabilities. | Missing: should be CQR / calibrated distribution over outs with Mondrian buckets for role, pitch count, and manager tendencies. Differs by interval/probability coverage rather than volatility discovery alone. | Missing / exploratory: should be simulator calibration or Bayesian model averaging of projection systems, not a second copy of M2. Differs by correcting odds reliability and uncertainty bands for futures. |

## 3. Per-Market Methodology Design

### Home Runs

Palantir method:

- Algorithm choice: elastic-net / regularized logistic or regularized count-to-probability model anchored in the HR projection stack.
- Primary feature categories: player power baseline, projected PA, handedness, park, pitcher HR allowance, matchup aggregates, and market line context.
- Label definition: batter hits at least one HR in the game or prop row outcome where available.
- Training cadence: weekly in-season with daily inference; retrain after major feature-source changes.
- Current files/artifacts: `src/pitcher_card_engine/domains/betting/home_run_edges/pipeline.py`, `src/pitcher_card_engine/domains/projections/home_run_props/pipeline.py`.

Fangorn method:

- Algorithm choice: Random Forest / RF-safe stack.
- Primary feature categories: batter/pitcher interaction terms, park/weather, spray/launch profile, pitcher pitch-mix HR vulnerability, support/exposure/suppression features.
- Label definition: same HR binary label as Palantir, with model allowed to abstain or segment.
- Training cadence: weekly with daily inference; monitor feature drift by month.
- Current files/artifacts: `src/pitcher_card_engine/domains/betting/home_run_strategy/pipeline.py`, `src/pitcher_card_engine/domains/betting/home_run_edges/confidence.py`.

Valinor method:

- Algorithm choice: CatBoost/logistic-anchor challenger today; future candidate is CatBoost + Mondrian calibration or Bayesian/hierarchical HR probability prototype.
- Primary feature categories: Palantir/Fangorn probabilities, raw HR features, calibration buckets, low-sample/player-state indicators, park/weather buckets.
- Label definition: same HR binary label, optimized for calibrated probability rather than raw ranking.
- Training cadence: weekly or twice monthly; calibration can update more frequently if enough outcomes accumulate.
- Current files/artifacts: `hr_prob_rf_safe_logistic_anchor` in the HR edge payload; archived/training scripts noted in `docs/multi_modal_model_inventory_2026_04.md`.

Why these three differ:

HR is the template because the three lanes already satisfy the independence test. Palantir is the stable projection-first anchor. Fangorn is a nonlinear RF stack that can elevate interaction-heavy pockets. Valinor is the probability-refinement challenger that should be judged on reliability and tail calibration, not whether it simply agrees with the RF model.

Special considerations:

- HR is a low-base-rate binary market, so calibration matters more than raw accuracy.
- The research report identifies spray-angle-conditional HR probability and park geometry as central features.
- Pre-May-2024 HR model artifacts must remain guarded because feature relationships changed after public Stuff/Driveline updates.

### Strikeouts

Palantir method:

- Algorithm choice: projection-stack anchored K mean plus transparent count distribution or regularized distribution model.
- Primary feature categories: pitcher K skill, projected batters faced, opponent K tendency, handedness/platoon mix, pitch count/runway, park/weather where relevant, market line.
- Label definition: pitcher strikeouts over/under listed line and alternate K thresholds.
- Training cadence: daily inference; weekly recalibration/retrain during season.
- Current files/artifacts: `src/pitcher_card_engine/domains/projections/pitcher_strikeouts/`, `src/pitcher_card_engine/domains/projections/strikeout_props/`, `src/pitcher_card_engine/domains/betting/strikeout_value_board/pipeline.py`.

Fangorn method:

- Algorithm choice: Random Forest or role-stack tree model over interaction-heavy pitcher-start contexts.
- Primary feature categories: pitch arsenal and whiff profile, opponent chase/contact weaknesses, umpire zone if available, ABS challenge tendencies, catcher/lineup context, pitch-count/leash interaction, recent role changes.
- Label definition: same over/under and alt-threshold outcomes, but model may specialize by role-stack segment.
- Training cadence: weekly training; daily inference; role-stack thresholds refreshed monthly.
- Existing candidate: `outputs/research/strikeout_architecture/strikeout_role_stack_proto_fangorn.csv`.

Valinor method:

- Algorithm choice: boosted residual/tail model plus calibration; likely LightGBM/CatBoost for residual distribution with CQR for continuous K count and beta/isotonic calibration for over/under probabilities.
- Primary feature categories: Palantir distribution moments, Fangorn probability, recent calibration residuals, role bucket, line bucket, pitcher workload bucket, opponent bucket.
- Label definition: realized K count for distribution/CQR; over/under binary labels derived per market row.
- Training cadence: weekly model retrain; rolling calibration on recent 60 days when sample permits.
- Existing candidates: `src/pitcher_card_engine/domains/betting/strikeout_value_board_comparison/`, `src/pitcher_card_engine/domains/projections/strikeout_model_v2/`, `outputs/research/strikeout_ml_models/`.

Why these three differ:

Palantir should answer "What should this pitcher project for in a stable distribution?" Fangorn should answer "Which pitcher/opponent/role pockets behave differently than the smooth projection?" Valinor should answer "Are the over/under and alt-ladder probabilities calibrated, especially in the tails?"

Special considerations:

- The research report explicitly says K distributions are overdispersed and batter-mix dependent; do not use a simple Poisson tail as final truth.
- Strikeouts should eventually live inside the M3 pitcher-family joint distribution with Outs and Earned Runs.
- Alt-ladder monotonicity must be enforced: P(K>=5) >= P(K>=6) >= P(K>=7).

### Total Bases

Palantir method:

- Algorithm choice: regularized expected-total-bases model anchored to projected PA, contact quality, and baseline hitter/pitcher projections.
- Primary feature categories: expected PA, hitter quality, pitcher contact allowed, handedness/platoon, lineup slot, park factor, market line.
- Label definition: total bases over/under listed line and realized total bases count.
- Training cadence: daily inference; weekly retrain/recalibration.
- Current files/artifacts: `src/pitcher_card_engine/domains/projections/total_base_props/`, `src/pitcher_card_engine/domains/betting/total_base_edges/site_board.py`.

Fangorn method:

- Algorithm choice: RF/ExtraTrees or segmented specialist models over contact-shape and park-interaction features.
- Primary feature categories: batted-ball type, spray direction, pull-air profile, park geometry, pitcher pitch mix, launch/EV profile, suppression/under-specialty indicators.
- Label definition: realized total bases and over/under market outcome.
- Training cadence: weekly; selective under-specialty lane can refresh separately.
- Existing candidates: `src/pitcher_card_engine/domains/betting/total_base_strategy/pipeline.py`, `src/pitcher_card_engine/domains/betting/tb_under_specialty/pipeline.py`.

Valinor method:

- Algorithm choice: calibrated ordinal/count model or boosted residual distribution with CQR.
- Primary feature categories: Palantir expected TB, Fangorn contact-shape probability, market line, calibration buckets by line/handedness/park, uncertainty features.
- Label definition: realized total bases count, with market probabilities for common thresholds.
- Training cadence: weekly; calibration updated with rolling in-season windows.
- Current state: missing. Archived RF prototype can inform feature design but should not be treated as a lane.

Why these three differ:

Palantir models a stable expected TB baseline. Fangorn should exploit nonlinear contact-shape and park geometry pockets, especially where a pull-air profile plays differently by venue. Valinor should repair the probability distribution for thresholds, because a mean TB estimate is not enough to price over 1.5/2.5.

Special considerations:

- The research report names spray-adjusted expected-bases surfaces as the highest-leverage free-data move for TB/Hits.
- Under-side specialty must remain side-aware; an under-only Fangorn probability should compare against Palantir's under probability, not over probability.
- TB should ultimately share a hitter-family contact-quality engine with HR and Hits.

### Hits

Palantir method:

- Algorithm choice: regularized contact/opportunity model or logistic model anchored to expected PA and baseline hit probability.
- Primary feature categories: projected PA, lineup slot, contact rate, strikeout rate, pitcher contact allowed, handedness, availability/role, market line.
- Label definition: player records at least one hit or exceeds listed hit line.
- Training cadence: daily inference; weekly recalibration.
- Current files/artifacts: `src/pitcher_card_engine/domains/projections/hits/`, `src/pitcher_card_engine/domains/betting/hits_edges/pipeline.py`.

Fangorn method:

- Algorithm choice: nonlinear contact-shape model, preferably RF/GBM over batted-ball, speed, defense, and role interactions.
- Primary feature categories: contact shape, infield-hit profile, sprint speed, opposing defense/OAA if available, pitcher batted-ball allowed, handedness/spray, role/start probability.
- Label definition: one-plus hit binary and, later, multi-hit thresholds.
- Training cadence: weekly; abstain where projected role/start probability is too uncertain.
- Current state: missing. `src/pitcher_card_engine/domains/betting/hits_edges/confidence.py` is not enough because it is a confidence filter on the Palantir-shaped board.

Valinor method:

- Algorithm choice: calibrated boosted binary/ordinal model with beta or isotonic calibration by line/role buckets.
- Primary feature categories: Palantir hit probability, Fangorn interaction probability, market line, role uncertainty, calibration residuals, platoon buckets.
- Label definition: one-plus hit and multi-hit outcomes.
- Training cadence: weekly model retrain, rolling calibration if enough outcomes exist.
- Current state: missing.

Why these three differ:

Palantir is a stable contact-opportunity anchor. Fangorn should be the model that notices when contact shape, speed, defense, and pitcher profile create a special pocket. Valinor should be judged on whether 0.5-hit and 1.5-hit probabilities are reliable across role and lineup buckets.

Special considerations:

- Hits are heavily role/opportunity dependent; bad lineup/start probability can swamp contact skill.
- Hits and TB should share feature infrastructure but not necessarily the exact same labels.
- Defensive context matters more for Hits than HR and may require a new public-data join.

### Pitcher Outs

Palantir method:

- Algorithm choice: regularized workload/leash projection model transformed into over/under probabilities.
- Primary feature categories: projected innings/outs, pitch count history, recent workload, starter role, opposing lineup strength, bullpen context, market line.
- Label definition: recorded outs over/under listed line and realized outs count.
- Training cadence: daily inference; weekly recalibration.
- Current files/artifacts: `src/pitcher_card_engine/domains/projections/pitcher_outs/`, `src/pitcher_card_engine/domains/betting/pitcher_outs_edges/pipeline.py`.

Fangorn method:

- Algorithm choice: RF/segmented volatility model focused on leash/removal interactions.
- Primary feature categories: manager behavior, pitch-count progression, bullpen fatigue, leverage-exit likelihood, handedness/order-turnover risk, opponent approach, injury/role volatility, weather/park run environment.
- Label definition: outs over/under and realized outs count, with emphasis on volatility pockets.
- Training cadence: weekly; manager/team behavior refreshed monthly.
- Current state: missing.

Valinor method:

- Algorithm choice: CQR / calibrated distribution model over recorded outs, with Mondrian buckets for starter type, line bucket, team/manager, and workload.
- Primary feature categories: Palantir projected outs, Fangorn volatility score, historical residuals, market line, role confidence, pitch-count bucket.
- Label definition: realized outs count; derived over/under probabilities.
- Training cadence: weekly distribution model; rolling conformal calibration when enough starts exist.
- Current state: missing.

Why these three differ:

Palantir says how long the pitcher is expected to go. Fangorn says when leash dynamics make that expectation fragile. Valinor says whether the full distribution around that expectation is calibrated enough to price an over/under.

Special considerations:

- Pitcher Outs is likely harder than K because the label depends on manager/team behavior, game state, bullpen, and run environment.
- The research report identifies Pitcher Outs/ER as one of the sharpest free-data surfaces, but only if modeled as workload x lineup x bullpen x park/umpire.
- This should eventually join the pitcher-family joint model with K and ER, not remain isolated forever.

### Futures

Palantir method:

- Algorithm choice: M2 projection ensemble and Monte Carlo season simulator.
- Primary feature categories: team talent, depth charts, player projections, playoff odds, awards simulations, win distributions, market devig.
- Label definition: season-long event outcomes: win total over/under, division, pennant, World Series, MVP/Cy/ROY.
- Training cadence: preseason heavy run; weekly or major-roster-change refresh; daily market ingestion/gap recompute.
- Current files/artifacts: `src/pitcher_card_engine/domains/betting/futures_edges/pipeline.py`, `src/pitcher_card_engine/domains/projections/teams/season_sim.py`, `data/derived/betting/futures_edges_<YYYYMMDD>.parquet`.

Fangorn method:

- Algorithm choice: scenario-driven selective simulator or stress-test model.
- Primary feature categories: roster fragility, pitcher-depth volatility, division-specific schedule pressure, injury concentration, park/lineup sensitivity, bullpen volatility, playoff-path shape.
- Label definition: same futures event outcomes, but output may only apply where scenario sensitivity is high.
- Training cadence: preseason and weekly refresh; scenario rules updated around major injuries/trades.
- Current state: missing.

Valinor method:

- Algorithm choice: simulator calibration / Bayesian model averaging layer over Palantir plus external projection priors where available; alternatively isotonic/beta calibration over historical simulator odds by market family.
- Primary feature categories: Palantir implied probability, simulator uncertainty width, historical calibration by market type, projection-system disagreement, market price bucket.
- Label definition: realized futures outcome; for win totals, final wins over/under line.
- Training cadence: preseason model fit with historical backtests; weekly calibration check.
- Current state: missing and more uncertain than other markets.

Why these three differ:

Palantir is the base simulator. Fangorn should ask "which alternate season shape breaks the base assumption?" Valinor should ask "when this simulator says 34%, does that class of event actually happen 34% of the time?" Futures cannot simply run the same simulator three times with different thresholds.

Special considerations:

- Futures are one-sided or season-long markets, not simple daily over/unders.
- Sample size is smaller than daily props; Valinor may need Bayesian pooling across market families.
- The right Valinor method is not fully known yet. A Phase 3 exploratory pass should test simulator calibration by market type before committing to one challenger design.

## 4. Data And Feature Dependencies

| Cell | Required features | Required labels | Sample size estimate | External dependencies / notes |
|---|---|---|---|---|
| HR Palantir | Already exists: HR projections, player/pitcher/park features | HR binary game outcomes | Multiple seasons of daily HR prop/game outcomes | Built; keep guarded against stale pre-2024 artifacts |
| HR Fangorn | Already exists: RF-safe HR features, support/exposure/suppression | HR binary game outcomes | Multiple seasons | Built; continue model tournament hygiene |
| HR Valinor | Mostly exists: challenger probabilities, calibration buckets | HR outcomes + held-out calibration | Multiple seasons, but low base rate | Needs formal calibration report before promotion beyond rebadged research |
| K Fangorn | Pitch arsenal, opponent chase/contact, role stack, umpire/ABS optional | K count and over/under outcomes | Thousands of pitcher-start props across seasons | Role-stack proto exists; ABS/umpire enrich later |
| K Valinor | Palantir distribution, Fangorn score, residuals, line buckets | K count, alt-threshold outcomes | Thousands of starts; enough for CQR buckets if grouped carefully | Strong candidate to reuse M2 CQR machinery |
| TB Fangorn | Spray/contact shape, park geometry, pitcher contact allowed, under-specialty features | Total bases count and prop outcomes | Many hitter-game rows; labels abundant but noisy | Needs spray/park geometry extension from HR to 1B/2B/3B |
| TB Valinor | Palantir expected TB, Fangorn probabilities, line/park/hand buckets | TB count and threshold outcomes | Many rows, but multi-base tails are sparse | CQR/ordinal calibration recommended |
| Hits Fangorn | Contact shape, sprint speed, defense/OAA, start probability, pitcher contact profile | Hit count / one-plus hit outcomes | Many rows; role uncertainty reduces clean labels | Needs defensive public-data join for best version |
| Hits Valinor | Palantir hit probability, role buckets, Fangorn score, calibration residuals | One-plus hit and multi-hit outcomes | Many rows; enough for binary calibration | Easier than Fangorn if Palantir artifact date issue is fixed |
| Outs Fangorn | Pitch count, manager/team leash, bullpen fatigue, opponent lineup, game context | Recorded outs and prop outcomes | Lower than hitter rows; one row per starter prop | Needs manager/leash feature engineering before training |
| Outs Valinor | Palantir outs projection, volatility score, line bucket, residuals | Recorded outs count | One row per starter; moderate sample | Best implemented after Palantir/Fangorn outs features are stable |
| Futures Fangorn | Scenario sensitivity, roster fragility, schedule/strength paths, injury concentration | Season outcomes and awards finishes | Small historical N; mostly 30 teams x seasons plus player awards | Requires careful pooling; not a quick daily-prop build |
| Futures Valinor | Simulator probabilities, uncertainty widths, historical calibration by market | Futures outcomes and win-total results | Very small compared with props | Exploratory first; Bayesian pooling likely needed |

Data readiness read:

- Easiest missing cell: Hits Valinor, because the label is abundant and the current Palantir board already writes a probability.
- Best ROI/time cell: K Valinor or K Fangorn, because pitcher props have volume and the research/spec already identify K/Outs/ER as a sharp free-data surface.
- Hardest cells: Futures Fangorn/Valinor and Pitcher Outs Fangorn, because they require new feature substrates rather than only new learners.

## 5. Phase 3 Build Sequencing Recommendation

Recommended order:

1. K Valinor: build the reusable calibrated distribution / CQR pattern for pitcher-count props first.
2. K Fangorn: promote the role-stack idea into a true RF/segmented daily lane.
3. Pitcher Outs Valinor: reuse the K distribution/calibration machinery on recorded outs.
4. Pitcher Outs Fangorn: build the leash-volatility model after the base residuals show where the mean model breaks.
5. TB Fangorn: extend HR geometry/contact-shape logic into TB interaction pockets.
6. TB Valinor: calibrate the TB threshold distribution after Fangorn exists.
7. Hits Valinor: comparatively easy calibration layer once same-day artifacts are reliable.
8. Hits Fangorn: build the richer contact/speed/defense interaction model.
9. HR Valinor hardening: turn the current rebadged challenger into a formally calibrated production/research lane.
10. Futures Valinor exploratory: test simulator calibration by market family before choosing method.
11. Futures Fangorn exploratory: scenario/stress lane only after futures calibration is understood.

| Cell | Complexity | Effort | Rationale |
|---|---:|---:|---|
| K Valinor | M | 1.5-2.5 eng-weeks | Data and labels exist; CQR/calibration pattern is well understood. |
| K Fangorn | M | 2-3 eng-weeks | Role-stack proto exists but needs real daily RF/segmentation and validation. |
| Outs Valinor | M | 2-3 eng-weeks | Reuses K calibration infrastructure, but label dynamics are more volatile. |
| Outs Fangorn | L | 3-4 eng-weeks | Needs leash/manager/bullpen feature engineering before model training. |
| TB Fangorn | M/L | 2.5-4 eng-weeks | Contact/geometry substrate exists conceptually but must expand beyond HR. |
| TB Valinor | M | 1.5-2.5 eng-weeks | Easier after TB Fangorn and distribution labels are stable. |
| Hits Valinor | S/M | 1-2 eng-weeks | Abundant labels; primarily calibration over Palantir. |
| Hits Fangorn | M/L | 2.5-4 eng-weeks | Needs contact-shape, speed, and defense interactions. |
| HR Valinor hardening | S/M | 1-2 eng-weeks | Candidate probability exists; needs formal calibration gate. |
| Futures Valinor | L/XL | 3-5 eng-weeks | Small sample; requires historical simulator calibration and pooling. |
| Futures Fangorn | XL | 4-6 eng-weeks | Scenario lane is conceptually valuable but methodologically least settled. |

Sequencing logic:

- Start with pitcher props because the research report calls K/Outs/ER the sharpest free-data prop surface.
- Build calibration infrastructure early because Valinor patterns can transfer across K, Outs, TB, Hits, and eventually futures.
- Delay Futures Fangorn because scenario modeling is product-interesting but not data-ready.
- Delay Hits Fangorn until the hitter-family contact-quality engine can share features with TB/HR rather than creating a fourth isolated model.

## 6. Cross-Market Methodology Consistency

The lanes should share methodology families, not identical algorithms.

Palantir should always be the stable anchor, but it does not have to always be elastic-net. For daily props, elastic-net/ridge/GLM-style models are the clean default. For Futures, Palantir can be a projection ensemble and Monte Carlo simulator because futures do not map naturally to a simple over/under row-level GLM.

Fangorn should always be nonlinear or segmented, but it does not have to always be Random Forest. RF is the clean template because HR already uses it, but GBM, ExtraTrees, or explicit role-stack segmentation are acceptable when they are used for interaction discovery and selective pockets. A Fangorn method can abstain.

Valinor should always be calibration/probability-refinement focused, but the method should adapt to market type. Binary markets can use CatBoost/LightGBM plus beta/isotonic/Mondrian calibration. Continuous/count markets should prefer CQR or calibrated count distributions. Futures may need Bayesian pooling or simulator-level calibration because sample sizes are much smaller.

Phase 3 is therefore not "build 10 copies of the same model." It is "build reusable templates for three epistemologies, then adapt them to each market's label shape." The reusable pieces should be calibration evaluation, model-lane output writing, reliability diagrams, and agreement-label reporting.

## 7. What This Audit Explicitly Does Not Decide

- Specific hyperparameters.
- Exact feature lists beyond categories.
- Final calibration thresholds or production gates.
- UI/display behavior for model switchers or consensus surfaces.
- Which model wins production promotion after backtesting.
- Whether a research lane graduates from `rebadged_research` to `production`.
- The final pitcher-family and hitter-family directory refactor.

Those are training-time or Phase 2/3 implementation decisions. This audit decides the methodological contract: every lane must be independently meaningful, or it should stay missing.

## Bottom Line

The working template is HR: Palantir is stable and projection-anchored, Fangorn is nonlinear RF interaction discovery, and Valinor is calibration/probability refinement. Phase 3 should build the same epistemological triangle market by market, not merely fill null columns. The first serious build should be pitcher-side, starting with K Valinor and K Fangorn, because the data is ready, the market has volume, and the calibration pattern can be reused for Outs, TB, and Hits.