The leaderboard lie: foundation models top the rankings without finding alpha

Point a pretrained time-series model at the stock market and it wins almost every contest. Look closer and the gaps are a fraction of a single bit — a ranking victory that carries no edge you could trade.

One of the most seductive promises in quantitative finance right now is that the new wave of pretrained time-series foundation models can be pointed at the stock market and made to find alpha — genuine, exploitable predictive edge. This paper asks that question honestly, and the answer is nuanced. The nuance is the whole point.

What a foundation model promises

Over the last couple of years, the core idea behind large language models — pretrain one enormous model on a vast corpus, then apply it to new tasks with little or no extra training — has jumped to time series. Companies have trained models on staggering amounts of temporal data: TimeGPT on over a hundred billion data points; Moirai on roughly thirty-six million distinct time series. The pitch is that such a model arrives already understanding the grammar of sequences — trends, cycles, shocks — and can forecast a brand-new series it has never seen, with no per-series training at all. That is called zero-shot forecasting.

The tantalising question follows immediately: if these models are such good general forecasters, can they forecast financial returns and make money?

Why returns are a brutal test

To see why that is so hard, you need to understand how little signal there is in financial returns. Daily stock returns are close to random. They combine very low signal-to-noise ratios, structural breaks, heavy tails, and weak persistence. The paper puts a hard number on the ceiling: in the academic literature on predicting returns, the out-of-sample R-squared — the fraction of variance you can actually explain — is typically below one percent.

The authors translate that into information terms: about five thousandths of a nat per forecast, a tiny fraction of a single bit. Sit with that. The entire predictable signal in tomorrow's return is a sliver of one bit. Any sufficiently capable model that manages to extract those few bits will perform about the same as any other. They are not really competing on architecture; they are competing for a minuscule information budget.

There is one more idea you need, and it is the conceptual heart of the paper: the difference between statistical significance and economic significance. A model can be statistically distinguishable from another — you can show with confidence that its forecasts are a hair better — and yet that edge can be far too small to actually trade on profitably once you account for noise and costs. Beating a benchmark on a leaderboard is not the same as making money.

The honest benchmark

The benchmark the authors choose is itself worth dwelling on: the random walk, which simply forecasts zero return — no change. Why is that the right thing to beat? Because under the efficient-market view, a zero-return forecast is essentially the optimal one; and because the typical return is about zero anyway, it is the natural baseline. To beat it, a model has to extract genuine, conditional information about the future beyond "nothing much will happen." That is a high, honest bar.

Here is the setup. They take six pretrained foundation models — including TimeGPT, TimesFM, Moirai, and two versions of Chronos — and run them zero-shot. They pit these against five conventional neural networks trained from scratch, individually, on each stock — models like NBEATS, PatchTST, an iTransformer, and a newer one called KAN. The playing field is five highly liquid U.S. equities — Apple, Amazon, Google, JP Morgan, and Meta — over about eleven and a half years of daily data. They forecast twenty business days ahead, roughly a trading month.

Crucially, every model gets the same amount of history — a context window of five hundred and twelve days — so context length is not an unfair advantage. They evaluate over ten rolling windows, with two ways of representing returns, giving ten task-level contests in total. The primary metric is mean absolute error, and the statistical test is the standard Diebold-Mariano, which asks whether one forecaster is genuinely better than another or just lucky.

The flattering headline, the sobering reality

The results split cleanly. The flattering headline: the pretrained foundation models dominate the rankings. They win eight of the ten task-level contests, with Moirai and TimesFM posting the strongest average ranks. The conventional from-scratch models mostly trail — with one striking exception: the iTransformer wins both Meta tasks, a reminder that local, supervised learning on a single asset can still beat generic pretraining for specific stocks and regimes. Stop at the leaderboard and you would conclude the foundation models had won handily.

But the reality is the part that matters. The gains over that random-walk benchmark are small and sparse. The skill scores are on the order of a thousandth — exactly the scale you would expect given that one-bit ceiling. And when the proper statistical test asks whether any model is genuinely, significantly better than the random walk, it comes back positive in only two of the ten cases — Chronos on Amazon, and Moirai on Google. Two out of ten. The same models that dominated the rankings are, for the most part, statistically indistinguishable from forecasting nothing at all.

They rank first in a photo finish where the gaps are measured in milliseconds.

This is where the statistical-versus-economic distinction lands with full force. Winning the ranking tells you which model placed first. It does not tell you that first place was far enough ahead to matter. The authors are explicit: strong rankings need not imply economically meaningful predictability in noisy markets. A model can top the leaderboard and still carry no edge you could actually trade.

The honest caveats

The authors stress several limits. This is an "as-used" comparison, not a pure test of architecture: the foundation models arrive with the benefit of massive external pretraining, while the from-scratch models learn only from one stock's history. So rank dominance reflects the pretraining prior as much as the model design — you cannot read it as "this architecture is better." It also cannot establish that the models learned anything finance-specific; a generic prior can win simply because it is a tighter fit when local data is scarce.

The study is deliberately small: five large U.S. stocks, ten rolling windows, a roughly seventeen-month test period that happens to span a turbulent, AI-driven market. The metric, mean absolute error, is deliberately not aligned with how the models were trained. And this is a preprint. The economic conclusion is framed largely through the information-ceiling argument rather than a published dollars-and-cents backtest — so the verdict is best read as a reasoned argument from the structure of the problem, not a reported Sharpe ratio.

Why it matters

This reframes what foundation models are for in quantitative finance. The verdict is not "they are useless" — it is far more useful than that. They are genuinely strong, low-cost priors. If you are a practitioner with one return series per asset and you do not want to build and tune a bespoke model for every ticker, an off-the-shelf foundation model run zero-shot will land you near the top of the rankings for almost no engineering effort. That is real value: they reduce model-development cost in low-data settings. What they are not is a magic money machine — not universal engines of statistically reliable alpha or trading performance in realistic conditions. They accelerate the research; they do not manufacture the edge.

So when someone tells you a powerful new model "beats the benchmark" on financial returns, ask two questions. First: by how much, and is that gap statistically real? Here, the dominant models were significantly better than doing nothing in only two of ten tries. Second: even if it is real, is it big enough to survive contact with a noisy market and trading costs? In finance the honest answer is usually no, because the predictable signal is a fraction of a single bit. Telling those two things apart is most of what good quantitative research actually is.