When a confident forecast is lying to you
AI weather models are brilliant on average and quietly overconfident at the tails. A cheap, distribution-free correction makes their stated uncertainty match reality.
Over the last few years, artificial intelligence has quietly revolutionised weather forecasting. Models like GenCast, NeuralGCM and AIFS now match or beat the traditional physics-based supercomputer forecasts, at a tiny fraction of the cost. There has been a comfortable assumption that these newer models are well-calibrated. This paper takes that assumption and, politely but firmly, shows it is wrong — then hands us a tool to fix it.
Why forecasts are probabilistic at all
The atmosphere is chaotic: tiny uncertainties in today's measurements get amplified over days into wildly different possible futures. So a good forecast does not give you a single number; it gives you a distribution over what might happen. The practical way to do that is an ensemble — run the forecast many times, each from slightly perturbed starting conditions, and read off a spread of outcomes. From that spread you get a range: "there is a ninety percent chance tomorrow's temperature lands between these two values."
Which brings us to the central idea of the whole paper: calibration, or what statisticians call coverage. It is beautifully simple. If your model says "this is a ninety percent range," then over many days the real weather should actually fall inside that range about ninety percent of the time. If it does, the forecast is calibrated — its stated confidence matches reality. If the truth only lands inside seventy percent of the time, the model is overconfident: it is claiming more certainty than it has earned. Coverage is, as the authors put it, the ultimate measure of calibration.
The intervals are too narrow
Here is what they find. These leading AI weather models are systematically overconfident. Their prediction ranges are too narrow — they undercover. When they claim ninety percent coverage for near-surface temperature, the truth actually falls inside only around seventy-eight or seventy-nine percent of the time for two of the models, and as low as sixty-six percent for the third. The intervals look reassuringly tight, but that tightness is a lie.
You might ask: don't we already measure whether these models are good? We do, but there is a subtle point. The popular scores used to judge probabilistic forecasts, with names like CRPS, can look perfectly healthy even when coverage is broken. A model can earn a great average score and still be miscalibrated. Good scores do not guarantee honest uncertainty — and that gap is exactly what has been hiding the problem.
It gets worse precisely where it matters most: extreme events. For the rare, dangerous tails — the heat waves, the deluges — the undercoverage is dramatic. In one case, a model's ninety percent range for extreme temperature contained the truth only thirty-nine percent of the time. Near Chicago, during a 2023 heat wave, the raw forecast's ninety-percent intervals achieved just sixty-two percent coverage.
When you most need a forecast to be honest about its uncertainty, it is most overconfident.
A self-adjusting thermostat for confidence
So what is the fix? A statistical method called conformal prediction, and this is the part worth really understanding. Conformal prediction adjusts your prediction intervals so they hit the coverage they claim — with a mathematical guarantee — and, crucially, without assuming anything about the shape of the errors. Older correction methods typically assume the errors follow some nice distribution, like a bell curve. Conformal prediction makes no such assumption and still guarantees the right coverage rate over time. It is distribution-free.
The flavour they use is online, adaptive conformal prediction, and the mechanism is genuinely intuitive. Think of it as a self-adjusting thermostat for your forecast's confidence. The method keeps a small correction term that widens or narrows the raw interval. Every day it checks: did the truth land inside or outside? If the truth fell outside — a miss — it nudges the interval a little wider, making future forecasts more cautious. If the truth fell inside, it nudges it slightly narrower. The sizes of those nudges are tuned so that, for a ninety percent target, one miss is balanced by exactly nine successful hits — precisely the ten-percent miss rate a ninety percent interval is supposed to have. Over time, this feedback loop drags the actual coverage to where it should be.
A few nuances, because precision matters here:
- It corrects the width of the interval, not its centre — in the authors' words, it corrects the variance, not the bias. It makes the range the right size; it does not move it to the right place.
- It gracefully handles delayed feedback: for a five-day forecast you do not learn whether you were right until five days later, and the maths is built to update on that delay.
- It is almost free — works out of the box, needs essentially no tuning, and can be wrapped around any forecasting system at all, AI or traditional physics. It just sees the intervals and the outcomes.
The numbers
They test three state-of-the-art AI models — GenCast, NeuralGCM and AIFS-ENS — using the high-quality ERA5 record of the actual weather as ground truth. They look at near-surface temperature and twelve-hour precipitation, at lead times from one to fifteen days, targeting ninety percent coverage, and compare against a classic statistical correction method as a baseline.
The results are clean. For temperature on ordinary days, the raw models undercovered — roughly seventy-eight, seventy-nine and sixty-six percent — and after conformal correction all three landed right around the target ninety. The model that was worst calibrated to begin with got the largest correction, exactly as you would hope. Convergence is fast: coverage gets within one percentage point of target within days, and within a tenth of a percent after about a month. And the lovely part — fixing coverage did not damage the other forecast-quality scores. The CRPS and spread metrics were improved or barely touched. You get calibration essentially for free.
The honest limits
The authors are admirably candid. The method only corrects the spread, not the underlying bias. To get the distribution-free guarantee, they fit a separate correction for each location, each lead time and each variable — which pulls apart the spatial and cross-variable structure that real weather has; stitching that back together is future work. The mathematical guarantee is for overall, average coverage — it does not formally hold for the narrow slice of extreme events. And that shows up: for extreme temperature, the worst case went from thirty-nine percent coverage up to seventy-six — a huge improvement, but still short of ninety. For extreme precipitation the improvement was marginal; those intervals stayed badly undercovered, in the low-to-mid sixties even after correction. The authors say plainly that developing conformal methods tailored for extremes remains an important challenge. There are other caveats: the ground truth is itself a model-based reconstruction rather than raw station data, one of the three models was not strictly tested out of sample, and they looked at only two weather variables. This is a preprint, with code promised on acceptance.
Why it matters
Enormous numbers of real decisions are made not on a single forecast number but on the probabilities around it. The authors' headline example is agriculture: a falsely confident forecast about the likelihood of rainfall can cause real, measurable harm to farmers deciding when to plant or irrigate. Any field that turns weather probabilities into decisions inherits this problem — and inherits the fix. The fix is appealing precisely because it is cheap, tuning-free and model-agnostic: you can bolt it onto a brilliant but overconfident AI model and make its stated uncertainty trustworthy, without retraining anything.
We have spent years chasing predictive accuracy, and these AI weather models are a triumph of accuracy. But accuracy is not the same as honesty about uncertainty. A model that is brilliant on average but overconfident at the tails will quietly lead people into bad decisions, exactly when the stakes are highest. The frontier is not only better predictions anymore. It is predictions that know, and correctly state, how much they should be trusted.