Methodology
Physics + calibration. No ML black box. The how + the why behind the model. Usage docs in /help; library / API / CLI in /developers.
git clone https://github.com/sam-dumont/bike-power-model cd bike-power-model uv sync --extra api --extra analysis uv run bpm --help
MIT, single committer (Sam Dumont). Repo flips public with the v1.0 / PyPI publish commit; until then the link 404s and the install path is correct-but-deferred.
Physics: the standard formula
The core is the standard cycling power balance you'll find in Martin 1998, Kreuzotter, Gribble, and every textbook: a Newton solver that balances pedal power against three resistance forces and one inertial term, iterated per-split until power matches resistance.
P_pedal · (1 − drivetrain_loss) = ½ · ρ(altitude, temp) · CdA · v_air² · v_ground (aero drag) + Crr(surface) · m · g · cos(θ) · v_ground (rolling) + m · g · sin(θ) · v_ground (gradient) + m · a · v_ground (inertia)
Same formula. ρ, CdA, Crr, draft, and effort each get filled in per split, not per ride.
Modifiers above the standard formula
The standard physics handles a smooth tarmac TT in still air at sea level. Real races have cobbles, headwinds, peloton drafts, mountain altitude, summer heat, and tactical attacks. Each modifier below is a documented departure from "use one number for the whole ride":
- Per-surface Crr. Tarmac, cobble, gravel, hardpack, mud: different surfaces, different rolling resistance. Default comes from OSM
surface=*+smoothness=*tags per split, mapped to a Crr value by surface quality (good / moderate / rough / severe). On top of that, a curated database of named sectors (Arenberg + Carrefour de l'Arbre cobbles, Strade Bianche white roads, Unbound chunky-flint gravel, Muur van Geraardsbergen pavé, …) overrides OSM where its tagging is sparse or misclassifies the surface. Wet sectors add a rain-decay factor scaled by the past 72 h of precipitation history at the sector coordinates, decayed with a 12 h dry-time half-life on tarmac (longer on cobbles or shaded gravel). 10 mm of rain that fell 24 h ago still leaves the sector measurably damp. When the rough-surface fraction (pavé or gravel) crosses ≥ 12 %, the planner also swaps its defaults to a cobbled or gravel profile (paceline draft, surface-appropriate effort targets) instead of a peloton road-race one. Crr is resolved per-meter, not averaged across the route. - Asymmetric wind. Per-split headwind is the dot product of forecast wind (Open-Meteo archived hourly history at the sector lat/lon) and the split's heading. Tailwinds go through the same equation, not "wind speed = average across ride".
- Tier-aware draft schedules. Draft factor isn't one number. Baseline 0.58 (peloton/paceline) refit 2026-04-25 from 64 cohort rides with explicit draft annotation. For WT pros on race or cobbled intents the baseline is replaced by a tier-aware schedule that captures bunch fragmentation through the race: bunch phase (heavy domestic shelter, low effective draft) then selection, chase, solo finale. Trip-averaged draft lands around 0.49 on a long monument (LBL), 0.67 on the 250 km Olympic road race, 0.77 on Roubaix. Amateur schedules use higher phase factors throughout because amateur fields fragment earlier and don't get the team-coordinated late selection.
- Power-duration curve as effort ceiling. Effort is PDC-aware: 0.75 on a 4 h ride sustains a different absolute IF than 0.75 on a 20-min TT. The PDC is fitted from the rider's last 90 days of FITs (intervals.icu integration) or back-solved from tier + FTP when no archive exists.
- Heat / altitude / fatigue. Heat > 25 °C derates sustainable power by ~0.9 %/°C up to 35 °C, then ~1.5 %/°C above, anchored to Tatterson et al. (2000): 6.5 % power loss at 32 °C vs 23 °C in trained cyclists. Altitude > 1500 m derates by 1.6 %/100 m (acclimatized) or 2.0 %/100 m (unacclimatized), slopes calibrated above the Wehrlin & Hallén (2006) lab corridor of 1.0-1.2 %/100 m to match the cohort's observed 8 % drop at 2000 m on alpine rides (Marmotte, Galibier, TdF mountain stages). Long-ride fatigue picks one of three profiles by intent: standard (onset 1.0 h), endurance (onset 1.5 h), ultra (onset 2.5 h). After onset, sustainable power decays at 3-5 %/h to a 60 % floor.
- Two-pass duration refinement. Pass 1 plans at a bootstrap 28 km/h. Pass 2 refines weather, sector wetness, temperature, and the PDC duration bucket using Pass 1's actual time estimate. The refinement matters most on stage TTs: bootstrap 28 km/h would put a 32 km Olympic TT at ~69 min, dropping the PDC into the 1-hour bucket (1.0 × FTP). Actual finish ~36 min lives in the 30-min bucket (~1.04 × FTP). Skipping Pass 2 leaves TT watt targets several percent low and the predicted time correspondingly slow.
- CdA calibration ranges. Pro/amateur gap is huge. The model's TT bound is 0.17-0.24 (mass_inference.py:75): WT TT specialists at the bottom (Ganna ~0.175, Evenepoel ~0.17: estimates from Castelli wind-tunnel commentary in cycling press, not peer-reviewed), road-bike-with-clip-ons amateur club TTs at the top. WT road 0.25-0.30, amateur drops 0.38-0.45, hoods 0.45-0.50. The 0.40 slider default is amateur-drops; run
bpm calibrateon a flat ride of yours for a ~0.02 refinement.
How this was built: Power Guide reverse engineering
The first piece was getting the prediction onto the Edge, before any modeling work began.
Garmin Power Guide is a per-segment wattage target overlaid on a course. The two FIT messages behind it (352 for the Power Guide header, 353 for per-split targets) are undocumented. The Garmin FIT SDK doesn't expose them; reverse engineers have noted them in passing on GitHub but no public tool writes them. Sam decoded the messages from binary dumps of Edge-saved Power Guide files: opened a few in a hex inspector, cross-referenced field IDs against the SDK's known message families, mapped each field to the corresponding UI knob (target watts as % of FTP, distance, grade, heading), and confirmed the layout by writing test files and re-reading them on a real Edge. The full RE write-up sits in the repo under src/bike_power_model/writer.py + the round-trip tests in tests/test_writer.py.
How this was built: the model
With the FIT writer working, the model itself was the next problem. Garmin's native Power Guide is gradient-only: it can't see Arenberg or a Zeeland headwind. So the model accreted, one stage at a time:
- Martin-1998 power balance → predict any stage's time from FTP + mass + CdA.
- Per-surface Crr (OSM
surface=*tags) → tarmac, cobble, gravel, hardpack stop sharing one Crr. - Asymmetric wind from Open-Meteo archived history → real weather, not "average".
- Per-sector overrides DB → curated values for named sectors (cobble, gravel, hardpack) where OSM tagging is sparse or wrong.
- Tier-aware draft schedules → pros, cat-2s, sportive groups shelter differently.
- Two-pass duration refinement → TT predictions tighten from +3.4 % to +0.2 %.
- Per-rider PDC from FIT archive → effort=1.0 means rider's actual ceiling, not category-default.
- Heat / altitude / fatigue / W'-balance / corner-radius / sector-aware rain decay → diminishing-returns refinements.
Every step was driven by a stage where the prediction was off. The validator runs ~200 pro + amateur rides every commit; the headline number on this page is the result of 18 months of "the model says X, the rider did Y, why is the gap there".
OTS-stamped predictions: pre-race proofs + live-model reruns
Some pre-race predictions in validation_predictions/are OpenTimestamps-stamped to Bitcoin before the gun. Each stamp proves the prediction existed at that moment with no hindsight tuning. That's the integrity claim. The cohort headlines above (built from running the live model on a held-out validation set) are the anchor for "what can this thing actually do". The per-race OTS numbers below are reference.
Re-running the four pre-Romandie OTS-stamped classics through the live model today: LBL 1.96 % → 1.82 % (slightly better), FW 2.47 % → 3.35 % (close), Amstel 2.80 % → 4.06 % (moderate drift), Brabantse Pijl 3.80 % → 7.89 % (largest drift). The shift correlates with stamp age: LBL was stamped two days before the spring physics refresh and lands closest; BP was stamped twelve days before and drifts most. Some races got marginally better, some worse. I'd rather ship more accurate physics and lose a flattering MAE on a single race than hold onto numbers that don't survive a closer look.
Most of the recent physics improvements (draft, fatigue, altitude, heat, cornering) only fire when the model knows what kind of effort the rider is doing. A road race uses a different draft model than a time trial. A TT uses different aero defaults. A gravel race uses different rolling resistance. The same numerical inputs can mean different things in different contexts.
The Romandie prologue is the cleanest demonstration. The input carried CdA 0.25 for Pogi: a sensible value for his road races. On a TT bike that CdA means the rider is sitting up rather than tucked into an aero position, which is rare for a pro doing a TT. The TT context tells the model to expect ~0.19 for a specialist his size. Knowing it's a TT (not just an aero number) is what tells the model which calibration to apply. (I updated the prologue input from 0.25 to 0.19.)
Romandie stages 1-5 re-stamped 2026-04-29. The earlier stamps were generated by a script that wasn't passing the ride type to the model, so the model fell back to flat physics with no context-aware calibration. The re-stamps land before each stage starts. The prologue (already raced) was not re-stamped: the original stamp is the locked pre-race proof.
| Race | Date | Predicted → actual | MAE | Stamped |
|---|---|---|---|---|
| Loading… | ||||
Predictions are timestamp-signed via OpenTimestamps before the race. The winner's predicted and actual times are shown next to the average MAE across every scored rider in the stamped prediction file. The small line under each rider discloses the intent parameters used for that prediction — effort (PDC-aware target, 1.00 = at the ceiling), pacing, and draft. Effort below 1.00 means the prediction was made for a sub-ceiling effort (marker day, gruppetto, peloton control), which is information the reader deserves to evaluate the number on.
Validation evidence
Where the numbers come from. Cohort headlines, held-out cohorts the model hasn't seen during tuning, and OTS- stamped pre-race forecasts. Refreshed whenever the validator re-runs.
Predictor vs analyzer
The model can be evaluated through two lenses, which give different numbers because they answer different questions:
| Lens | What it does | Where the SPA uses it |
|---|---|---|
| Predictor forward-only | Reads the per-ride intent (race / TT / sportive). Never reads measured power. Predicts finish time from intent + FTP + mass + course. Cobble and gravel physics are auto-derived from the GPX surface — not a rider choice. | The planner page. User picks "race" from a dropdown, the model predicts. SPA headlines cite this lens. |
| Analyzer post-ride | Back-solves the rider's effort from the ride's measured avg_power, then predicts finish from that effort. | The /analyzer page (post-ride retrospective). Not on the headline because drafting-heavy races confound the back-solve. |
Headline cohort numbers
Predictor lens, cohort default mode (per-ride overrides allowed — what the SPA serves to a user who picked the right intent). 167 rides from 68 riders. Cluster-by-rider bootstrap CIs.
95 % CI [3.30 %, 4.26 %] · 56 riders
95 % CI [5.68 %, 7.76 %] · 12 riders
95 % CI [4.03 %, 5.31 %]
Leakage audit: re-running the headline with 67 suspect intent-override entries removed (notes referencing observed outcomes) shifts overall MAE from 4.71 % to 4.18 %, Δ −0.53 %. Direction: the dropped overrides averaged +0.22 % bias (slightly slow-of-actual). Removing them moves headline bias from −1.08 % to −1.95 %, so the clean cohort runs ~2 pp fast-of-actual on time. The suspect overrides had been masking that, not inflating accuracy. Verdict: modest (0.5-1.0 pp) — disclose alongside headline.
Held-out evidence (post-calibration)
Cohorts and rides scored without re-tuning the model after seeing the result.
| Race / cohort | Date | N | MAE | Bias | Notes |
|---|---|---|---|---|---|
| Giro 2024 Stage 7 ITT | 2024-05-10 | 3 | 4.52 % | −4.52 % | Foligno → Perugia ITT, frozen run |
| Giro 2025 Stage 10 ITT | 2025-05-20 | 3 | 1.30 % | +0.89 % | Lucca → Pisa, climb + descent + flat run-in |
| Olympic TT 2024 | 2024-07-27 | 3 | 2.08 % | +0.92 % | Paris |
| Private amateur cohort | — | 39 · 14 riders | 6.94 % | +1.21 % | Amateur-league, anonymised |
| Milan-Sanremo 2025 | 2025-03-22 | 3 | 3.74 % | +3.74 % | MvdP / Ganna / Pogačar at sprint finish. |
| Milan-Sanremo 2026 | 2026-03-21 | 3 | 3.86 % | −3.86 % | Pogi / Pidcock / van Aert; race had a 32 km neutralisation after a Pogi crash that the model can't see. |
| Worlds ITT 2023 Stirling | 2023-08-11 | 3 | 0.85 % | −0.85 % | 47.8 km undulating + 800 m / 5.5 % cobble climb finish. Top-3 of Remco / Ganna / Tarling — youngest podium in Worlds ITT history (avg 23 y 131 d). Held-out, the planner's first time seeing this course. |
OTS-stamped pre-race forecasts
Pre-race predictions committed to git AND timestamp-signed via OpenTimestamps to the Bitcoin blockchain before the start. Once the OTS proof lands, no benefit-of-hindsight tuning is possible.
| Race | Date | Stamped | N | MAE | Bias | Model SHA |
|---|---|---|---|---|---|---|
| Brabantse Pijl 2026 | 2026-04-17 | 2026-04-15 | 3 | 3.80 % | −1.33 % | 36d9df26 |
| Amstel Gold Race 2026 | 2026-04-19 | 2026-04-16 | 3 | 2.80 % | −1.81 % | 63b2a899 |
| La Flèche Wallonne 2026 | 2026-04-22 | 2026-04-22 | 4ext | 2.47 % | +2.47 % | f1810c61 |
| Liège-Bastogne-Liège 2026 | 2026-04-26 | 2026-04-22 | 6ext | 1.96 % | +0.70 % | 2ba062c1 |
Why the amateur sportive cell overstates the user-facing risk
The per-intent breakdown shows sportive at MAE ~11 %, the worst cell. Read in isolation that's alarming for an amateur SPA user picking "sportive". It's almost entirely a data-quality artefact, not a physics gap:
- Three of the six sportive rides come from one rider (Amateur A03), all 15-21 % too fast.
- Their mass disagrees across three sources by ±20 % (cohort 78 kg, FIT-archive back-solve 87 kg). FTP also has no FIT cross-check on these rides — Strava-exported FITs strip
zones_target. - Detected CdA on those rides comes back at 0.32-0.39, implausibly aero for an amateur in sportive position. The back-solve is absorbing draft as low CdA, then propagating bogus aero into the prediction.
For an SPA user entering their own current mass (a scale this morning), their own current FTP (from a power meter), and a CdA either entered manually or detected on a known-solo ride, the inputs are internally consistent in a way the cohort sportive cell isn't. Predictor behaviour on consistent-input pro rides (3.76 %) is the better proxy for what a careful amateur user will see than the cohort sportive number.
2026-04-30T22:08:18Z. Source-of-truth doc on GitHub: docs/VALIDATION_STATE.md · How to re-derive the suite.