METHODOLOGY · v1.0 · LAST UPDATED 2026-05-13

The math is the product.

Every assumption documented, every weight editable, every model limitation named on the same page as the model. 25,000 trials per sim. Sample sizes shown on every claim.

The philosophy.

BallBet is a simulation platform, not a picks service. The model outputs probability distributions, edge percentages, and sample sizes — and lets you decide what to bet. The transparency is the trust mechanism.

Pick services have 30–50% monthly churn and tank brand integrity the first losing week. We chose the harder path: publish the math, publish the misses, let sharps verify our edge themselves. Every claim on this site is reproducible from the documented inputs. If you can't reproduce it, that's a bug — email us.

The sim engine.

For every batter-pitcher matchup, we run a Monte Carlo simulation that draws plate-appearance outcomes from a multinomial distribution conditioned on: (a) the pitcher's career pitch-type usage, velocity, and per-pitch effectiveness, (b) the batter's per-pitch-type splits (BA, ISO, K%, BB%, EV, Barrel%), (c) the park's 3-year HR / 2B+3B / 1B / Runs factors, (d) live weather (wind speed/direction, temperature, humidity, pressure) via Open-Meteo, and (e) any explicit user overrides on /sim.

Each trial simulates the batter's remaining PAs in tonight's game (typically 3–5). The aggregated distribution over trials becomes the modeled probability for each prop line (HR over 0.5, Hits over 1.5, etc.).

Why 25,000 trials.

Margin of error on a Monte Carlo probability estimate is roughly 1 / √N. At 25,000 trials, that's ±0.6% at 95% confidence — tight enough that prop pricing is no longer limited by sim noise. For the bulk slate-builder cron we run at 1,500 trials (±2.5% MOE) to fit memory constraints; for /sim on-demand individual matchups, we run the full 25,000.

Features & inputs.

Pitcher inputs.

Per-pitch career rates (usage %, velocity, whiff%, putaway%), handedness-split arsenals (vs LHB vs RHB), season pitching strength factors (HR allowed, 1B/2B/3B allowed, ER allowed, WHIP), bullpen freshness signals from /bullpens (days rest, last 3-day workload).

Batter inputs.

Per-pitch-type splits from Statcast (BA, OBP, SLG, ISO, wOBA, Barrel%, HH%, EV, K%, BB%), platoon splits (vs LHP / vs RHP), last 30-day form, batted-ball profile (LD% / GB% / FB%, Pull% / Straight% / Oppo%). The matchups V2 page aggregates these per-pitch stats weighted by PA-count across the selected pitch set.

Environmental inputs.

Park: 3-year HR / 2B+3B / 1B / Runs factors (Coors HR is +30%, Petco HR is -10%, etc.) with handedness interaction (Yankee Stadium LH HR is +25%, Petco LH HR is suppressed). Weather: wind speed + direction (projected onto stadium CF azimuth), temperature (cold suppresses carry), humidity, surface pressure. Pulled fresh from Open-Meteo per game.

Feature freshness.

Statcast pitch-level cache refreshes nightly. Batter and pitcher splits regenerate from that cache on the same schedule. Live odds poll The Odds API every 5 minutes on the slate. Closing-line snapshots happen ~20 minutes before each game's first pitch to seed CLV calculations.

The Comp Lab.

When a batter has fewer than 20 prior PAs against a pitcher, raw BvP is statistical noise. The Comp Lab finds statistically similar batters and aggregates their PAs against that pitcher, expanding the sample 5–10× under a transparent similarity score.

The similarity formula.

Weighted Euclidean distance over z-scored features. similarity = 1 / (1 + distance). Each feature carries a weight (set by backtest performance) so contact quality and plate discipline matter more than batted-ball spray. Two batters at distance 0 have similarity 1.0; the far end of the distribution sits around 0.4. Below 0.7 we flag the target as “unique” — comp aggregates should be interpreted carefully.

The feature vector.

38 z-scored features per batter spanning: contact quality (EV, barrel%, hard-hit%, sweet-spot LA%), plate discipline (whiff%, chase%, in-zone swing%, BB%), batted-ball profile (FB%, LD%, GB%, Pull% / Oppo% / Straight%), pitch-type performance (per-pitch BA, SLG, whiff%), and physical metrics where available (bat speed, fast-swing%, squared-up%). Each feature is z-scored against the active MLB hitter pool for the season.

Comp selection rules.

D2: hard hand-handedness filter (we never comp an LHB with an RHB). D3: pitcher-hand match (when looking for “comps facing this pitcher,” we require the comp to have seen the same pitcher hand). D6: minimum n=2 PAs per comp against the target pitcher to count toward the aggregate. Comps are ranked by similarity, capped at top-10, and the user can see every contributing batter on /comp-lab.

Time decay on historical BvP.

Linear: current season 1.0×, last season 0.7×, two seasons ago 0.4×, anything older 0.2×. Captures arsenal evolution without throwing volume away.

Uniqueness flag.

When the rank-1 comp scores below 0.70, we flag the target as unique and recommend additional caution on the aggregate sample.

xStats & luck adjustment.

Every batter on the site carries two paired numbers in the feature store: woba_actual (linear-weights wOBA over the last 365 days) and xwoba (expected wOBA from Statcast's launch-speed + launch-angle model over the same window). The difference is the luck delta — positive means the batter has been over-performing the quality of their contact, negative means under-performing.

The model uses the luck delta as a regression-toward-mean tilt at inference time. A batter sitting at .380 actual wOBA but .335 xwOBA isn't projected as a .380 hitter — the forty-five-point gap is variance that will close, and the model knows it before the prop market does. This appears on the Lucky / Unlucky board so you can see the regression candidates for yourself; it quietly modulates every prop projection too.

Per-pitch-type xStats.

We also store xBA, xSLG, xwOBA, and xISO segmented by pitch type on every batter (and the inverse on every pitcher arsenal). When a sinker-heavy starter faces a batter whose actual results against sinkers are luck-bloated, the per-pitch xStats catch it where a season-average wOBA would miss it entirely. M3 of the model upgrade plan, shipped 2026-05-16; under-the-hood, never paywalled, no upsell.

Why we surface this.

Other prop tools either hide xStats behind the model entirely or sell a vague “luck-adjusted rating” number without showing the underlying delta. We publish the two raw inputs (actual + expected wOBA), the delta, and the PA count behind each. If you disagree with our regression assumption you can see it and override it — that's the transparency tax we pay for being a tool, not a tout.

Priors, samples & confidence.

Sample size thresholds.

Every probability shown on the site is annotated with its underlying sample size (PA count). Below 20 raw PAs against a pitcher, the system automatically falls back to Comp Lab aggregation. Below 50 PAs even after comp aggregation, the row carries a low-confidence flag. We refuse to show isolated “n=3 BvP” numbers without context.

Confidence bands.

Confidence tiers (S / A / B / C / PASS) on each prediction encode: (a) sample-size adequacy, (b) Monte Carlo variance tightness, and (c) feature freshness. S-tier requires >50 comp-weighted PAs, <0.5% MC variance, and current-season data. PASS means we have a number but won't recommend it.

Low-confidence flag.

Rows below the n threshold get a warning pill on /edges and /today. They are visible in the API response but visually de-emphasized. You can still bet them — we just won't pretend the model is confident.

Edge calculation & CLV.

Vig handling.

Edge percentages are computed against the no-vig implied probability, not the raw line. Example: a market priced at -110 / -110 implies 50.0% true probability after de-vigging (raw -110 implies 52.4%). If the model says 55%, the edge is +5pp — not +2.6pp. This is the only honest way to compare modeled vs market.

CLV measurement.

Closing Line Value (CLV) measures how much the market line moved toward our pick between open and close. Positive CLV (line moved toward us) is the strongest single signal of model skill — it's what professional bettors and sportsbooks use to grade their own staff. Our Closing-Line Snapshot cron locks the closing line ~20 minutes before each game; the calculation appears on /calibration and individual /tracker rows.

Why “edges” not “picks.”

We surface modeled probability distributions, not betting recommendations. You choose what to bet based on the model output plus your own bankroll and risk tolerance. This isn't semantic — it's a different product. Picks services churn because the next losing week breaks the trust. A methodology that publishes its math + its misses doesn't.

Calibration.

Public-facing calibration.

The Calibration Dashboard at /calibration shows our hit rate by edge band over every tracked play. Anyone can audit it. Tools that hide losing streaks are tout services. We're not that.

What miscalibration looks like.

A perfectly calibrated model's “70% confident” predictions hit 70% of the time over a large enough sample. On the calibration chart, that's a 45-degree diagonal from bottom-left to top-right. Points above the line = model is UNDER-confident in its winners. Below = model is OVER-confident. We retune feature weights when consistent miscalibration appears in any edge band (typically after 100+ settled plays in that band) and only when accompanied by negative CLV — a losing record at positive CLV is variance, not a model problem.

What we publish that other tools don't.

Most MLB prop tools fall into two camps. The rating-only camp ships a single 0–100 score per row with no factor breakdown — you take the number on faith. The signal-only camp ships per-pitch xStats and Statcast tables but leaves you to combine them into a decision yourself. Neither publishes calibration; neither lets you adjust assumptions and see what changes. We do both at once:

  • Public calibration. Every settled play feeds the calibration dashboard — hit rate by edge band, CLV, sample size per bucket. We publish the misses, not just the hits. Tout services don't do this; we'd argue they can't.
  • Editable assumptions. Every input the simulator uses is exposed on /sim as a slider. Disagree with our park-factor weighting? Crank it down and re-run the 25,000-trial sim live. Sharps don't need a rating — they need a calculator that won't lie about what changed when they moved a knob.
  • BB Rating with the factor breakdown. Our 0–100 score on every row breaks down into seven named buckets (matchup, recent form, season base, pitcher form, park + weather, lineup spot, last-5/10). Click any row and see which factor moved the rating where. No black box.
  • Bayesian shrinkage on rate stats. A batter with 30 PAs against a starter doesn't get projected at his raw 30-PA BvP; we shrink toward a stat-specific stability anchor (e.g. 150 PA for HR rate) so a 4-for-30 doesn't inflate a HR projection. Seemodels/shrinkage_pb.py referenced from the changelog.
  • Comp Lab expansion when raw samples thin.Below 20 raw PAs we expand the sample 5–10× with statistically similar batters, scored on 38 z-scored features — and you can see every contributing comp on /comp-lab before you trust the aggregate.
  • Live in-game re-sim. On the Edge tier, the simulator re-runs after each at-bat as the game progresses. The pre-game number isn't the only number you get.

If a feature is on the site, the math behind it is documented here or in a linked code path. We'd rather lose a sale to a sharp who studied the methodology and walked than win one who didn't look.

This page evolves with the model. Every methodology change is logged at /changelog. If something here is wrong or unclear, email methodology@ballbet.ai and we'll fix it.