Methodology

Physics + calibration. No ML black box. The how + the why behind the model. Usage docs in /help; library / API / CLI in /developers.

git clone https://github.com/sam-dumont/bike-power-model-lib
cd bike-power-model-lib
uv sync --extra api --extra analysis
uv run bpm --help

MIT, single committer (Sam Dumont). Repo flips public with the v1.0 / PyPI publish commit; until then the link 404s and the install path is correct-but-deferred.

Pro · win3.2 %11 racesPro · finish6.0 %213 racesAmateur · win4.9 %5 racesAmateur · finish11.0 %47 races

How it works, in order

It plans a realistic effort-based amateur ride. Your effort sets your sustainable personal commitment and allocation across the course. Then one rule runs the whole thing: logic → power → physics → speed. The model picks a power target for each part of the course, then physics turns that power into a speed. It never runs backwards. There's no "assume 41 km/h and back-solve the watts" anywhere. Speed is an output, not an input.

What goes in

You

FTP
Body weight (bike separate)
Height
Sex
Effort, 0.10-1.00

Your setup

Position (auto by tier)
Bike class
Tyres → Crr
Wheels → yaw curve
Measured CdA if you have one

The event

Route (GPX / FIT / demo)
Race date + start hour
Intent: race / TT / sportive
Shelter: one course-wide choice
Coast: pedal or freewheel descents

The world, fetched

Hourly wind + temperature
72 h of rain → wetness
LiDAR / satellite elevation
Surface: OSM + curated sectors

The pipeline

1Split the route

Cut the course into segments where the road actually changes: grade breaks, corners, cobble sectors

2True elevation

LiDAR then satellite DEM replace barometric noise, the #1 early error source

3Weather

Per-segment headwind and crosswind, air density, road wetness from the rain history

4Surfaces

Tarmac, cobbles, gravel → a rolling-resistance per segment

5The rider model

Your power-duration curve + CdA from size, position and bike set the useful bounds. Estimated W′ is uncertainty and risk, not a hard tank

6Power targets (the logic)

Effort sets your sustainable personal commitment and allocation. Coast independently changes descent pedalling and may reallocate saved work elsewhere so the full ride still closes at your effort

7Physics turns power into speed

Martin 1998 power balance turns personal watts into speed. Your one course-wide shelter changes aero, speed and duration only; it never changes your target watts and does not simulate group fragmentation or transitions

8Twice, because duration is circular

Pass 1 guesses the duration and plans; pass 2 re-fetches the weather for the real finish time and re-plans

9Out: splits, watts, FIT

Every segment carries its receipt, so you can ask why any target is what it is, then it writes a head-unit file (Garmin, Wahoo, Hammerhead, Bryton)

Where the numbers come from

Every model like this lives or dies on where its numbers come from. There are three tiers, and pretending there were two would be tidier than the truth.

Cited from the literature, and left alone. The power balance and bearing drag (Martin 1998), frontal area (Heil 2001), the 121-rider peloton drag map (Blocken 2018), critical-power and W′ estimates (Skiba, Monod-Scherrer), air density (Buck 1996), the altitude VO2 loss (Bassett 1999), the durability and heat anchors. All checked against the full papers, not the abstracts. These don't move.

Anchored, then nudged. Constants that have a published starting point the cohort adjusted or extended: the road-race drag figure, the altitude slope set above the lab corridor to match real alpine rides, the gravel rolling-resistance classes bridging road-tyre and mountain-bike measurements, the descent cornering caps derived conservatively from tyre-grip tables.

Cohort-fitted, and treated with suspicion. The adjustment layer: descent allocation, durability, heat response, and course-wide wind effects. Fitted on the validation set, each constant annotated with where it came from in the source, and re-tuning one has to pass per-rider, per-terrain checks on BOTH power and speed, leave-one-race-out. A pretty overall average that hides a fast-climbs-slow-descents cancellation is a failing grade here, on purpose.

Known soft spots, plainly: the heat model is a placeholder (barely a dozen rides in the cohort clear 30 °C, so it's directional, not calibrated); altitude under-predicts above ~2000 m; drivetrain loss is one scalar; yaw-aware CdA only kicks in when you name a wheelset; drafting is solo physics with a shelter factor, not a multi-body simulation; and 1-3 star cobble sectors surge and coast in ways a steady segment can't capture. The limitations section carries the full list.

How this was built: Power Guide reverse engineering

The first piece was getting the prediction onto the Edge, before any modeling work began.

Garmin Power Guide is a per-segment wattage target overlaid on a course. The two FIT messages behind it (352 for the Power Guide header, 353 for per-split targets) are undocumented. The Garmin FIT SDK doesn't expose them; reverse engineers have noted them in passing on GitHub but no public tool writes them. Sam decoded the messages from binary dumps of Edge-saved Power Guide files: opened a few in a hex inspector, cross-referenced field IDs against the SDK's known message families, mapped each field to the corresponding UI knob (target watts as % of FTP, distance, grade, heading), and confirmed the layout by writing test files and re-reading them on a real Edge. The full RE write-up sits in the repo under src/bike_power_model/writer.py + the round-trip tests in tests/test_writer.py.

How this was built: the model

With the FIT writer working, the model itself was the next problem. Garmin's native Power Guide is gradient-only: it can't see Arenberg or a Zeeland headwind. So the model accreted, one stage at a time:

Martin-1998 power balance → predict any stage's time from FTP + mass + CdA.
Per-surface Crr (OSM surface=* tags) → tarmac, cobble, gravel, hardpack stop sharing one Crr.
Asymmetric wind from Open-Meteo archived history → real weather, not "average".
Per-sector overrides DB → curated values for named sectors (cobble, gravel, hardpack) where OSM tagging is sparse or wrong.
Tier-aware draft schedules → pros, cat-2s, sportive groups shelter differently.
Two-pass duration refinement → TT predictions tighten from +3.4 % to +0.2 %.
Per-rider PDC from FIT archive → effort=1.0 means rider's actual ceiling, not category-default.
Heat / altitude / fatigue / W'-balance / corner-radius / sector-aware rain decay → diminishing-returns refinements.

Every step was driven by a stage where the prediction was off. The smoke demos run on every commit; the full pro and amateur cohort, close to 300 races, runs before each sign-off. The numbers on this page are the result of 18 months of "the model says X, the rider did Y, why is the gap there".

Validation evidence

Where the numbers come from. Time error by tier and goal, and per-terrain power alignment, all measured through the deployed plan_race path over close to 300 races. Refreshed whenever the validator re-runs. Forward-looking evidence lives on the predictions page: each race is timestamp-signed before the start, so the number can't have been edited once the result is in.

Where it lands: time by tier and goal

Two axes: pro or amateur, and whether the rider went for the win or rode for a finish. Time error is whole-ride predicted vs actual through the deployed plan_race path. Effort is back-solved per ride, CdA is derived from height at the road-race position, descents use each rider's own coast preference. Races only: sportives, cyclos and paused rides are out. There is no blended cohort average here on purpose: that one number hid which cell was carrying the error.

Pro · win

Mean absolute error: 3.2 %
N races: 11
Median absolute error: 4.2 %
Within 5 %: 10/11
Bias: −1.1 %

Pro · finish

Mean absolute error: 6.0 %
N races: 213
Median absolute error: 4.8 %
Within 5 %: 51 % (n=213)
Bias: −3.8 %

Amateur · win

Mean absolute error: 4.9 %
N races: 5
Median absolute error: 6.4 %
Within 5 %: 2/5
Bias: −2.2 %

Amateur · finish

Mean absolute error: 11.0 %
N races: 47
Median absolute error: 9.3 %
Within 5 %: 19 % (n=47)
Bias: −9.7 %

Tier · goal	N	MAE	Median	Within 5 %	Bias
Pro · win	11	3.2 %	4.2 %	10/11	−1.1 %
Pro · finish	213	6.0 %	4.8 %	51 % (n=213)	−3.8 %
Amateur · win	5	4.9 %	6.4 %	2/5	−2.2 %
Amateur · finish	47	11.0 %	9.3 %	19 % (n=47)	−9.7 %

The win cells are the model's target effort, so they land tightest. Finish rides are looser because a rider sitting in the bunch has a draft and an effort the model needs told to it: a cohort scan has to guess them, a real user supplies them. The amateur-finish cell is also held back by the road-race CdA fit running too slick for a club racer sitting up in the group; supply a measured CdA and it drops. That aero lever is a known, deferred fix, not a data hole.

Power alignment by terrain

ΔP is actual minus predicted watts, median per terrain bucket. Positive means the rider pushed harder than the steady-state target.

Terrain	Pros ΔP	Amateurs ΔP
Flat	+3 W	+8 W
Climb (mild)	+7 W	+25 W
Climb (steep)	+19 W	+53 W
Cobble	−6 W	+0 W
Gravel	−0 W	+5 W
Descent (mild)	−18 W	−41 W
Descent (steep)	−36 W	−76 W

Flat, cobble and gravel sit near zero: that's the model landing on steady ground. The climb-over / descent-under pattern is the grade-elastic pacing preference (push the climb, coast the descent), which the rider supplies through the effort and coast settings. It's a choice, not an error, so the model doesn't bake it in by default.

Leakage audit, kept for transparency: dropping 67 suspect intent-override entries (notes that reference the observed outcome) shifts the intent-override pool's average time error from 4.71 % to 4.18 %, Δ −0.53 %. The dropped overrides averaged +0.22 % bias (slightly slow-of-actual), so removing them moves the pool's bias from −1.08 % to −1.95 %: the suspect overrides were masking a small fast-of-actual lean, not inflating accuracy. Verdict: modest (0.5-1.0 pp) — disclose alongside headline.

Why the amateur-finish cell overstates the careful-user risk

The tier × goal table above puts amateur-finish at 11 %, the loosest cell. Read in isolation that looks alarming for an amateur planning a long ride. It's mostly a data-quality and aero artefact, not a physics gap:

A handful of rides dominate it, and on those the mass disagrees across sources by ±20 % and the FTP has no FIT cross-check (Strava-exported FITs strip zones_target).
With no supplied CdA, the back-solve absorbs the rider's draft as an implausibly low CdA (0.32 to 0.39, far too aero for anyone sitting up in a group) and propagates that bogus aero into the time.

Give the model the real inputs instead of making it guess and the artefact goes away: the two amateur demos on the home page reproduce well. The 40 km bunch race lands at +3.6 % on an honest mass-event draft, tighter still if you tell it you sat in a full peloton; the 10-hour Alpine day at −2.6 %. For a rider entering their own current mass, their own FTP, and a measured CdA, the supplied-input win cells (pro 3.2 %, amateur 4.9 %) are the better proxy than the guessed-input amateur-finish number.

Payload generated 2026-07-10T21:34:46Z. Source-of-truth doc on GitHub: docs/VALIDATION_STATE.md · How to re-derive the suite.

Good, bad, and honest gaps

Works well

Pro pacing on 3-6 h races with predictable surfaces and steady weather, and time trials, where the two-pass duration refinement earns its keep. Per-race headline numbers are deliberately absent here right now: the model is mid-revision and every number in this section gets re-validated against the Tour de France 2026 before it comes back.

Weaker

Crits, discrete attacks, heat >30 °C, very wet cobbles. The physics is fine; the rider model doesn't know how to pace a surge yet.

Won't model

Bunch tow. If a weaker rider finishes with a stronger group, the prediction stays at their solo-equivalent pace. Binding to the leader would bake in tactics I can't verify before the race: riders drop out of bunches too. Predictions land best for the rider setting the pace.