Forecast the easy bits, deal the rest downward

A retailer may need a billion product-by-store forecasts a week. This method predicts only the smooth 0.3% of them, then splits the rest with loaded dice so everything adds up.

On the surface this is the least glamorous problem in machine learning: predicting how many units of a product will sell at a store tomorrow. But it is the kind of forecasting companies actually pay for, and the constraints it respects are the real ones. A method from researchers in Lugano makes it fast, cheap, and coherent enough to run on a laptop in minutes.

A hierarchy of demand

Picture a big retailer. At the bottom is the finest level of detail: how many units of this product will sell at that store tomorrow. Above that, you aggregate by store, by region, by category, up to total company sales. That nested structure — bottom-level series rolling up into ever-broader totals — is a hierarchy, and a large retailer's is staggering: hundreds of thousands, even millions, of product-store combinations at the bottom. The paper cites an estimate that one large retailer may need to refresh on the order of a billion store-by-product forecasts on a daily or weekly basis.

Two things make this hard. The first is the bottom-level data. A single product at a single store often sells zero units on most days, with occasional spikes — what forecasters call intermittent demand: lots of zeros, low signal, very hard to predict one series at a time. The aggregate series, by contrast, are smooth, because all that individual noise averages out.

Why the forecasts have to add up

The second difficulty is a property called coherence. Your forecasts are coherent if they add up consistently: the forecast for total sales must equal the sum of the regional forecasts, which must equal the sum of the store forecasts, and so on. If they do not, then different teams — inventory, finance, operations — are planning against contradictory pictures of demand.

Incoherent forecasts mean contradictory plans.

There is a third demand: these forecasts should be probabilistic — not just "we expect to sell forty units," but a full distribution of possibilities, because the decisions that hang off them are about risk. When you decide how much stock to hold, the cost-optimal order quantity is not the average forecast — it is a specific quantile of the demand distribution, usually a high one to ensure good service levels. For a low-selling item, the average might be a fraction of a unit, operationally useless; you cannot stock half a widget. You need the shape of the distribution, especially its upper tail.

So the challenge is: coherent, probabilistic forecasts across hundreds of thousands of intermittent series, fast and cheap. The naive approaches do not scale. Forecasting every bottom series with its own model is slow and inaccurate, because those series are so noisy; the elegant statistical methods for reconciling across a hierarchy tend to choke on this many. A fast, effective, and scalable method, the authors note, was simply missing.

Predict the smooth part, deal the rest downward

Their solution, which they call e2eTD, is elegant in its laziness. Instead of forecasting every noisy bottom series, they forecast only a tiny subset of the smooth aggregate series — about three-tenths of one percent of the entire hierarchy — which are easy to predict well. They then push those forecasts back down to the bottom level using a new technique, probabilistic top-down sampling, and sum the bottom-level results back up, which guarantees coherence at every level automatically.

The clever part is that top-down step. The old-fashioned approach splits a total using fixed historical proportions — "store A always gets thirty percent of the regional total" — which is rigid and ignores uncertainty. e2eTD instead models the splitting probabilistically. For each simulated total — say a scenario where a category sells ten units — it deals those ten units out among the constituent products using, in effect, a set of loaded dice. The dice are loaded according to how those products historically move together, captured with a statistical tool called a copula, which models not just each product's typical share but how their shares correlate. To handle many products under one aggregate, it splits them in a cascade: divide the group in two, split the total between the halves, then split each half again, recursively, until every product has its share.

Because every split divides an exact total, the parts always re-sum to the whole — coherence is baked in, for every simulated scenario, not just on average. To make this fast enough for hundreds of thousands of series, they use quick statistical estimators — moment matching and the like — roughly two orders of magnitude faster than the rigorous maximum-likelihood alternative, batching and parallelising the work, with the heavy lifting written in C++ under an R interface.

What the numbers say

They test on two big public retail datasets: M5, Walmart sales with about thirty thousand bottom series, and Favorita, an Ecuadorian grocery chain with about a hundred and sixty thousand bottom series. To judge probabilistic accuracy, they use the weighted scaled pinball loss — a score for how well the predicted quantiles match reality across the whole distribution, lower being better. e2eTD posts the lowest score on both: on M5, about 0.164 against the next-best competitor's 0.215, a substantial margin. The authors note it would have ranked eleventh out of eight hundred and ninety-two teams in the M5 forecasting competition.

But the number that makes this paper sing is the runtime. On a standard laptop — no GPU, just a normal multi-core CPU — e2eTD produces the full set of coherent probabilistic forecasts for thirty thousand Walmart series in about four to five minutes, and for the hundred-and-sixty-thousand-series Favorita hierarchy in roughly twenty minutes. Near-top-of-competition accuracy, at this scale, on hardware anyone has.

The honest caveats

A few things deserve precision. The probabilistic score is the pinball loss, not the CRPS metric you might expect from other probabilistic-forecasting work — same spirit, measuring distributional quality, but a different specific measure. The subset of aggregate series to forecast is chosen manually, not by an automatic rule. And there is a slight tension in that the aggregate forecasts use a Gaussian reconciliation step, even though the paper rightly argues normality is a poor fit for the intermittent bottom series — though it is applied only to the smooth aggregates, where it is far more reasonable. It is demonstrated on two datasets, the released code is R-only, and this is a preprint.

Why it matters

The paper cites striking stakes: at the scale of a giant retailer running daily forecast cycles, switching from a heavy method to a fast, frugal one can save on the order of tens of millions of dollars a year — and even reduce the associated carbon emissions by something like a hundred thousand tonnes of CO2-equivalent annually. This is not a leaderboard exercise; it is an industrial process where speed, cost, and coherence are first-order concerns. A beautiful method that takes a server farm and a day to run is, for many businesses, no method at all.

This is a quiet but important kind of progress — not a new architecture or a benchmark record, but making a genuinely useful thing fast, cheap, and coherent enough to deploy. The core trick is a lovely piece of pragmatism: do not fight to predict the noisy bottom of the hierarchy directly; predict the smooth, easy aggregates, then deal them downward with probabilistically loaded dice so everything adds up by construction. It will not make headlines the way a giant model does. But if you run forecasting for a retailer, it is built for the problem you actually have, at the scale you actually have it, on the hardware you actually own.