Methodology

Physics + calibration. No ML black box. The how + the why behind the model. Usage docs in /help; library / API / CLI in /developers.

git clone https://github.com/sam-dumont/bike-power-model
cd bike-power-model
uv sync --extra api --extra analysis
uv run bpm --help

MIT, single committer (Sam Dumont). Repo flips public with the v1.0 / PyPI publish commit; until then the link 404s and the install path is correct-but-deferred.

Physics: the standard formula

The core is the standard cycling power balance you'll find in Martin 1998, Kreuzotter, Gribble, and every textbook: a Newton solver that balances pedal power against three resistance forces and one inertial term, iterated per-split until power matches resistance.

P_pedal · (1 − drivetrain_loss) =
   ½ · ρ(altitude, temp) · CdA · v_air² · v_ground   (aero drag)
 + Crr(surface) · m · g · cos(θ)        · v_ground   (rolling)
 + m · g · sin(θ)                       · v_ground   (gradient)
 + m · a                                · v_ground   (inertia)

Same formula. ρ, CdA, Crr, draft, and effort each get filled in per split, not per ride.

Modifiers above the standard formula

The standard physics handles a smooth tarmac TT in still air at sea level. Real races have cobbles, headwinds, peloton drafts, mountain altitude, summer heat, and tactical attacks. Each modifier below is a documented departure from "use one number for the whole ride":

  • Per-surface Crr. Tarmac, cobble, gravel, hardpack, mud: different surfaces, different rolling resistance. Default comes from OSM surface=* + smoothness=*tags per split, mapped to a Crr value by surface quality (good / moderate / rough / severe). On top of that, a curated database of named sectors (Arenberg + Carrefour de l'Arbre cobbles, Strade Bianche white roads, Unbound chunky-flint gravel, Muur van Geraardsbergen pavé, …) overrides OSM where its tagging is sparse or misclassifies the surface. Wet sectors add a rain-decay factor scaled by the past 72 h of precipitation history at the sector coordinates, decayed with a 12 h dry-time half-life on tarmac (longer on cobbles or shaded gravel). 10 mm of rain that fell 24 h ago still leaves the sector measurably damp. When the rough-surface fraction (pavé or gravel) crosses ≥ 12 %, the planner also swaps its defaults to a cobbled or gravel profile (paceline draft, surface-appropriate effort targets) instead of a peloton road-race one. Crr is resolved per-meter, not averaged across the route.
  • Asymmetric wind. Per-split headwind is the dot product of forecast wind (Open-Meteo archived hourly history at the sector lat/lon) and the split's heading. Tailwinds go through the same equation, not "wind speed = average across ride".
  • Tier-aware draft schedules. Draft factor isn't one number. Baseline 0.58 (peloton/paceline) refit 2026-04-25 from 64 cohort rides with explicit draft annotation. For WT pros on race or cobbled intents the baseline is replaced by a tier-aware schedule that captures bunch fragmentation through the race: bunch phase (heavy domestic shelter, low effective draft) then selection, chase, solo finale. Trip-averaged draft lands around 0.49 on a long monument (LBL), 0.67 on the 250 km Olympic road race, 0.77 on Roubaix. Amateur schedules use higher phase factors throughout because amateur fields fragment earlier and don't get the team-coordinated late selection.
  • Power-duration curve as effort ceiling. Effort is PDC-aware: 0.75 on a 4 h ride sustains a different absolute IF than 0.75 on a 20-min TT. The PDC is fitted from the rider's last 90 days of FITs (intervals.icu integration) or back-solved from tier + FTP when no archive exists.
  • Heat / altitude / fatigue. Heat > 25 °C derates sustainable power by ~0.9 %/°C up to 35 °C, then ~1.5 %/°C above, anchored to Tatterson et al. (2000): 6.5 % power loss at 32 °C vs 23 °C in trained cyclists. Altitude > 1500 m derates by 1.6 %/100 m (acclimatized) or 2.0 %/100 m (unacclimatized), slopes calibrated above the Wehrlin & Hallén (2006) lab corridor of 1.0-1.2 %/100 m to match the cohort's observed 8 % drop at 2000 m on alpine rides (Marmotte, Galibier, TdF mountain stages). Long-ride fatigue picks one of three profiles by intent: standard (onset 1.0 h), endurance (onset 1.5 h), ultra (onset 2.5 h). After onset, sustainable power decays at 3-5 %/h to a 60 % floor.
  • Two-pass duration refinement. Pass 1 plans at a bootstrap 28 km/h. Pass 2 refines weather, sector wetness, temperature, and the PDC duration bucket using Pass 1's actual time estimate. The refinement matters most on stage TTs: bootstrap 28 km/h would put a 32 km Olympic TT at ~69 min, dropping the PDC into the 1-hour bucket (1.0 × FTP). Actual finish ~36 min lives in the 30-min bucket (~1.04 × FTP). Skipping Pass 2 leaves TT watt targets several percent low and the predicted time correspondingly slow.
  • CdA calibration ranges. Pro/amateur gap is huge. The model's TT bound is 0.17-0.24 (mass_inference.py:75): WT TT specialists at the bottom (Ganna ~0.175, Evenepoel ~0.17: estimates from Castelli wind-tunnel commentary in cycling press, not peer-reviewed), road-bike-with-clip-ons amateur club TTs at the top. WT road 0.25-0.30, amateur drops 0.38-0.45, hoods 0.45-0.50. The 0.40 slider default is amateur-drops; run bpm calibrate on a flat ride of yours for a ~0.02 refinement.

How this was built: Power Guide reverse engineering

The first piece was getting the prediction onto the Edge, before any modeling work began.

Garmin Power Guide is a per-segment wattage target overlaid on a course. The two FIT messages behind it (352 for the Power Guide header, 353 for per-split targets) are undocumented. The Garmin FIT SDK doesn't expose them; reverse engineers have noted them in passing on GitHub but no public tool writes them. Sam decoded the messages from binary dumps of Edge-saved Power Guide files: opened a few in a hex inspector, cross-referenced field IDs against the SDK's known message families, mapped each field to the corresponding UI knob (target watts as % of FTP, distance, grade, heading), and confirmed the layout by writing test files and re-reading them on a real Edge. The full RE write-up sits in the repo under src/bike_power_model/writer.py + the round-trip tests in tests/test_writer.py.

How this was built: the model

With the FIT writer working, the model itself was the next problem. Garmin's native Power Guide is gradient-only: it can't see Arenberg or a Zeeland headwind. So the model accreted, one stage at a time:

  1. Martin-1998 power balance → predict any stage's time from FTP + mass + CdA.
  2. Per-surface Crr (OSM surface=* tags) → tarmac, cobble, gravel, hardpack stop sharing one Crr.
  3. Asymmetric wind from Open-Meteo archived history → real weather, not "average".
  4. Per-sector overrides DB → curated values for named sectors (cobble, gravel, hardpack) where OSM tagging is sparse or wrong.
  5. Tier-aware draft schedules → pros, cat-2s, sportive groups shelter differently.
  6. Two-pass duration refinement → TT predictions tighten from +3.4 % to +0.2 %.
  7. Per-rider PDC from FIT archive → effort=1.0 means rider's actual ceiling, not category-default.
  8. Heat / altitude / fatigue / W'-balance / corner-radius / sector-aware rain decay → diminishing-returns refinements.

Every step was driven by a stage where the prediction was off. The validator runs ~200 pro + amateur rides every commit; the headline number on this page is the result of 18 months of "the model says X, the rider did Y, why is the gap there".

OTS-stamped predictions: pre-race proofs + live-model reruns

Some pre-race predictions in validation_predictions/are OpenTimestamps-stamped to Bitcoin before the gun. Each stamp proves the prediction existed at that moment with no hindsight tuning. That's the integrity claim. The cohort headlines above (built from running the live model on a held-out validation set) are the anchor for "what can this thing actually do". The per-race OTS numbers below are reference.

Re-running the four pre-Romandie OTS-stamped classics through the live model today: LBL 1.96 % → 1.82 % (slightly better), FW 2.47 % → 3.35 % (close), Amstel 2.80 % → 4.06 % (moderate drift), Brabantse Pijl 3.80 % → 7.89 % (largest drift). The shift correlates with stamp age: LBL was stamped two days before the spring physics refresh and lands closest; BP was stamped twelve days before and drifts most. Some races got marginally better, some worse. I'd rather ship more accurate physics and lose a flattering MAE on a single race than hold onto numbers that don't survive a closer look.

Most of the recent physics improvements (draft, fatigue, altitude, heat, cornering) only fire when the model knows what kind of effort the rider is doing. A road race uses a different draft model than a time trial. A TT uses different aero defaults. A gravel race uses different rolling resistance. The same numerical inputs can mean different things in different contexts.

The Romandie prologue is the cleanest demonstration. The input carried CdA 0.25 for Pogi: a sensible value for his road races. On a TT bike that CdA means the rider is sitting up rather than tucked into an aero position, which is rare for a pro doing a TT. The TT context tells the model to expect ~0.19 for a specialist his size. Knowing it's a TT (not just an aero number) is what tells the model which calibration to apply. (I updated the prologue input from 0.25 to 0.19.)

Romandie stages 1-5 re-stamped 2026-04-29. The earlier stamps were generated by a script that wasn't passing the ride type to the model, so the model fell back to flat physics with no context-aware calibration. The re-stamps land before each stage starts. The prologue (already raced) was not re-stamped: the original stamp is the locked pre-race proof.

OTS-stamped races
RaceDatePredicted → actualMAEStamped
Loading…

Predictions are timestamp-signed via OpenTimestamps before the race. The winner's predicted and actual times are shown next to the average MAE across every scored rider in the stamped prediction file. The small line under each rider discloses the intent parameters used for that prediction — effort (PDC-aware target, 1.00 = at the ceiling), pacing, and draft. Effort below 1.00 means the prediction was made for a sub-ceiling effort (marker day, gruppetto, peloton control), which is information the reader deserves to evaluate the number on.

Validation evidence

Where the numbers come from. Cohort headlines, held-out cohorts the model hasn't seen during tuning, and OTS- stamped pre-race forecasts. Refreshed whenever the validator re-runs.

Predictor vs analyzer

The model can be evaluated through two lenses, which give different numbers because they answer different questions:

LensWhat it doesWhere the SPA uses it
Predictor
forward-only
Reads the per-ride intent (race / TT / sportive). Never reads measured power. Predicts finish time from intent + FTP + mass + course. Cobble and gravel physics are auto-derived from the GPX surface — not a rider choice.The planner page. User picks "race" from a dropdown, the model predicts. SPA headlines cite this lens.
Analyzer
post-ride
Back-solves the rider's effort from the ride's measured avg_power, then predicts finish from that effort.The /analyzer page (post-ride retrospective). Not on the headline because drafting-heavy races confound the back-solve.

Headline cohort numbers

Predictor lens, cohort default mode (per-ride overrides allowed — what the SPA serves to a user who picked the right intent). 167 rides from 68 riders. Cluster-by-rider bootstrap CIs.

Pros
3.76 %
115 rides · bias −1.70 %
95 % CI [3.30 %, 4.26 %] · 56 riders
Amateurs
6.70 %
52 rides · bias −2.15 %
95 % CI [5.68 %, 7.76 %] · 12 riders
Overall
4.68 %
167 rides · bias −1.84 %
95 % CI [4.03 %, 5.31 %]
User-defaults mode (no per-ride overrides — first-time-user simulation)
Pros 6.41 % (bias +1.91 %, 115 rides) · Amateurs 10.85 % (bias +2.61 %, 52 rides) · Overall 7.79 % · BLIND-mode overall 8.76 %
Each input the user actually supplies (current FTP, measured CdA, real mass) tightens the prediction from this baseline toward the cohort-default numbers above.

Leakage audit: re-running the headline with 67 suspect intent-override entries removed (notes referencing observed outcomes) shifts overall MAE from 4.71 % to 4.18 %, Δ −0.53 %. Direction: the dropped overrides averaged +0.22 % bias (slightly slow-of-actual). Removing them moves headline bias from −1.08 % to −1.95 %, so the clean cohort runs ~2 pp fast-of-actual on time. The suspect overrides had been masking that, not inflating accuracy. Verdict: modest (0.5-1.0 pp) — disclose alongside headline.

Held-out evidence (post-calibration)

Cohorts and rides scored without re-tuning the model after seeing the result.

Race / cohortDateNMAEBiasNotes
Giro 2024 Stage 7 ITT2024-05-1034.52 %−4.52 %Foligno → Perugia ITT, frozen run
Giro 2025 Stage 10 ITT2025-05-2031.30 %+0.89 %Lucca → Pisa, climb + descent + flat run-in
Olympic TT 20242024-07-2732.08 %+0.92 %Paris
Private amateur cohort39 · 14 riders6.94 %+1.21 %Amateur-league, anonymised
Milan-Sanremo 20252025-03-2233.74 %+3.74 %MvdP / Ganna / Pogačar at sprint finish.
Milan-Sanremo 20262026-03-2133.86 %−3.86 %Pogi / Pidcock / van Aert; race had a 32 km neutralisation after a Pogi crash that the model can't see.
Worlds ITT 2023 Stirling2023-08-1130.85 %−0.85 %47.8 km undulating + 800 m / 5.5 % cobble climb finish. Top-3 of Remco / Ganna / Tarling — youngest podium in Worlds ITT history (avg 23 y 131 d). Held-out, the planner's first time seeing this course.

OTS-stamped pre-race forecasts

Pre-race predictions committed to git AND timestamp-signed via OpenTimestamps to the Bitcoin blockchain before the start. Once the OTS proof lands, no benefit-of-hindsight tuning is possible.

RaceDateStampedNMAEBiasModel SHA
Brabantse Pijl 20262026-04-172026-04-1533.80 %−1.33 %36d9df26
Amstel Gold Race 20262026-04-192026-04-1632.80 %−1.81 %63b2a899
La Flèche Wallonne 20262026-04-222026-04-224ext2.47 %+2.47 %f1810c61
Liège-Bastogne-Liège 20262026-04-262026-04-226ext1.96 %+0.70 %2ba062c1

Why the amateur sportive cell overstates the user-facing risk

The per-intent breakdown shows sportive at MAE ~11 %, the worst cell. Read in isolation that's alarming for an amateur SPA user picking "sportive". It's almost entirely a data-quality artefact, not a physics gap:

  • Three of the six sportive rides come from one rider (Amateur A03), all 15-21 % too fast.
  • Their mass disagrees across three sources by ±20 % (cohort 78 kg, FIT-archive back-solve 87 kg). FTP also has no FIT cross-check on these rides — Strava-exported FITs strip zones_target.
  • Detected CdA on those rides comes back at 0.32-0.39, implausibly aero for an amateur in sportive position. The back-solve is absorbing draft as low CdA, then propagating bogus aero into the prediction.

For an SPA user entering their own current mass (a scale this morning), their own current FTP (from a power meter), and a CdA either entered manually or detected on a known-solo ride, the inputs are internally consistent in a way the cohort sportive cell isn't. Predictor behaviour on consistent-input pro rides (3.76 %) is the better proxy for what a careful amateur user will see than the cohort sportive number.

Payload generated 2026-04-30T22:08:18Z. Source-of-truth doc on GitHub: docs/VALIDATION_STATE.md · How to re-derive the suite.

Good, bad, and honest gaps

Works well
Pro pacing on 3-6 h races with predictable surfaces and steady weather. The Flèche Wallonne 2026 OTS-stamped forecast called Paul Seixas's winning time at +0.13 %, and the LBL 2026 6-rider extended pool landed at 1.96 % MAE across the contender field.
Weaker
Crits, discrete attacks, heat >30 °C, very wet cobbles. The physics is fine; the rider model doesn't know how to pace a surge yet.
Won't model
Bunch tow. If a weaker rider finishes with a stronger group (Powless +14 % bias on LBL 2025), the prediction stays at their solo-equivalent pace. Binding to the leader would bake in tactics I can't verify before the race: riders drop out of bunches too. Predictions land best for the rider setting the pace.
bikepowermodel.fit · open-source race-time prediction · per-split surface Crr · Garmin / Wahoo / Hammerhead / Bryton