The one-line model that beats the Transformers

A line of algebra can match forecasting architectures thousands of times its size — if you stop scaling the model and start tuning the preprocessing nobody bothers to touch.

Machine learning has spent years marching in one direction: bigger models, more parameters, more compute. Time-series forecasting followed the crowd, from specialised neural networks to giant foundation models trained on enormous corpora. This paper takes the opposite position, and states it plainly — most of the gap can be closed at far lower cost by tuning the preprocessing rather than scaling the model.

A rigged evaluation

Forecasting is the task of predicting the future values of a signal from its past: electricity demand next week, traffic on a road network, the temperature in a transformer station. To compare models fairly, the field follows a standard protocol — and the authors argue that protocol quietly favours the heavyweights.

The usual approach fixes the preprocessing to a single setting — how much history the model sees, how the data is normalised — and then pits architectures against each other. The problem is that this systematically disadvantages simple models. A Transformer with millions of parameters can partly compensate for a badly chosen history window or an uninformative normalisation scheme; it has the capacity to absorb a poor input representation and fix it internally. A linear model has nowhere to hide.

So linear models look weaker than they really are — not because they lack expressive power, but because they are more sensitive to choices nobody bothers to tune. Level that playing field, and the picture changes.

The simplest respectable model

The contender they champion is Ridge regression — about the simplest respectable model in all of statistics. It predicts the future as a straight, weighted combination of past values, and it has a closed-form solution: no iterative training, just one equation solved in a few milliseconds on a single GPU.

Crucially, the authors lean on a recent result showing that the whole zoo of popular linear forecasters — models with names like DLinear, NLinear, and FITS — are all mathematically equivalent to plain linear regression over transformed features. They collapse into one model class. Which raises a sharp question: if the model is effectively fixed, where should you spend your remaining effort?

Their answer is the preprocessing. They identify four levers and search over them carefully:

Context length — how much history the model gets to see.
Local normalisation — instead of using statistics from the whole training set, you use statistics from just the most recent trailing fraction of the window.
Regularisation — how strongly you rein in the model to prevent overfitting.
Data augmentation — adding a little noise during training to toughen the model up.

They search these four knobs not just per dataset, but per forecast horizon and even per individual series, using an automated search. They call the resulting pipeline SearchCast.

The leaderboard

The results are genuinely striking. Within the family of linear models, the tuned approach achieves the best average error on seven of eight standard benchmarks.

But the headline is the comparison against the heavy nonlinear models — Transformers, MLPs, convolutional networks. On six of eight benchmarks, the humble tuned linear model wins the average, despite having no nonlinearity and no learned representation whatsoever. On those six, the margin in its favour ranges from about four percent to sixteen percent — and it does this orders of magnitude more cheaply, a trial in milliseconds versus the heavy compute of training a deep network.

Sometimes the model was never the bottleneck.

The two cases where it loses are both to one strong Transformer, PatchTST, and both on the two largest datasets — the ones with hundreds of parallel series, where shared neural representations plausibly help.

The knobs as a diagnostic

The most interesting part is not the leaderboard. It is that the tuned knobs become a diagnostic instrument, a lens on the data itself.

Take context length. The conventional wisdom is "longer horizons need more history." The authors fit how optimal lookback scales with horizon across datasets and find the exponents span both signs and an order of magnitude — from strongly positive on one dataset to negative on others, where longer horizons actually prefer shorter history, a signature of non-stationarity where old data actively misleads. There is no universal rule; the data tells you, and the search recovers it.

Or take normalisation. Full-window normalisation is essentially never selected. The optimal choice is almost always a short trailing fraction — somewhere between roughly a third of a percent and thirty percent of the window. Recent local statistics carry more signal than global ones. These are not just tuning tricks; they are findings about the structure of the data, the kind of thing a deep model would absorb silently and never tell you.

And there is a clean supporting result: on several datasets, just extending the history window — doing nothing else — drops the forecast error by thirty-six to fifty-three percent. The field's habit of fixing a short default lookback was leaving enormous accuracy on the table, for free.

The honest caveats

The authors are careful here. They report point-estimate errors without confidence intervals, so genuinely small margins — a tie on one dataset, or that four-percent Weather gap — should be treated as ties, not victories.

There is an asymmetry worth naming: SearchCast searches its own preprocessing, while the neural baseline numbers are quoted from their original publications, in their published configurations. That cuts in the linear model's favour, and the authors acknowledge it. The two real losses, on the largest datasets, suggest that with hundreds of similar series, a Transformer's shared representations do buy you something a per-series linear model cannot. And all of this is specific to standard numeric long-horizon benchmarks — findings for that regime, not universal laws. This is a preprint.

Why it matters

Two reasons, one practical and one cultural. Practically, most companies do not need a giant forecasting model. They need something cheap, interpretable, stable, and fast to retune — and this paper says a well-tuned linear model is a far stronger baseline than the field assumes, runnable on modest hardware, with weights you can actually inspect. That gives anyone evaluating a flashy foundation-model product a serious challenger to measure it against.

Culturally, it is a clean hype-check. When you control for preprocessing, much of the apparent advantage of scale shrinks. The gains the field has been attributing to model capacity may, in significant part, have been gains from preprocessing all along — we just were not tuning it for the small models. Before you reach for the foundation model, it is worth asking whether you have actually tuned the humble baseline. Often, you have not — and often, that is where the real gains were hiding.