The forecast is not the point: machine learning on the yield curve

Neural networks can beat the decades-old workhorses of bond forecasting — but this study insists on the harder test, asking whether a better forecast actually makes a better trade.

Most machine-learning finance papers report a lower error metric and stop. This one refuses to. It takes neural networks into one of the largest and most consequential corners of finance — the government bond market — and asks not only whether they forecast the yield curve more accurately than the decades-old incumbents, but whether that accuracy actually translates into money.

Why the bond market deserves the scrutiny

The scale frames why this matters. As of early 2025, U.S. government debt stands at around thirty-seven trillion dollars; European Union government debt is roughly thirteen trillion euros. Pension funds and insurers hold enormous quantities of these bonds, and interest-rate movements ripple through the valuation of essentially every financial asset. Forecasting where bond yields are headed is not an academic game — it is central to how trillions of dollars are managed.

The object being forecast is the yield curve, also called the term structure. A government issues bonds of many maturities — three months, two years, ten years — and each carries a yield. Plot yield against maturity and you get a curve. This paper works specifically with zero-coupon bonds, which pay no interim coupons, just a single lump at maturity, so each point is a clean statement: the rate the government promises for locking your money away for exactly that long. The whole curve moves and reshapes as expectations about rates, inflation, and growth shift.

Duration: the lever the strategy pulls

Duration measures how sensitive a bond portfolio's price is to a change in interest rates. A long-duration portfolio swings hard when rates move; a short-duration one barely budges. The core principle of active bond management is simple to state:

Increase duration when you expect yields to fall — because falling yields mean rising bond prices, and you want maximum exposure to that gain.
Decrease duration when you expect yields to rise, to protect yourself.

A good forecast of the yield curve feeds directly into one decision: how much duration to hold. That is duration management.

The incumbents this paper challenges are the established tools. The most famous is the Nelson-Siegel model, and its dynamic version, which compresses the entire curve into three interpretable numbers — a level factor, a slope factor, and a curvature factor. Level shifts the curve up or down; slope tilts the short end against the long end; curvature bends the middle. The paper also tests an arbitrage-free version and a purely statistical approach using principal component analysis.

The harder question: accuracy or value?

Here is the methodological heart. How do you judge whether a forecast is any good? The obvious answer is statistical accuracy — how close the predicted yields are to the actual ones, measured by error metrics like RMSE. But the authors insist that is not enough, and they have a striking piece of prior evidence. In an earlier study, one model — a classical ARMA model — achieved the lowest forecast error of all, and yet a neural-network-based strategy generated the highest risk-adjusted returns. The most accurate forecaster did not make the most money.

Statistical superiority does not always translate to economic value.

So this paper evaluates every model two ways at once: statistical accuracy, and economic relevance — by actually running a bond trading strategy on the forecasts and measuring the portfolio's performance.

The setup is substantial. They forecast seven maturities, from three months to ten years, using weekly data. For the United States, the data runs from 1987 to early 2025; for Europe, they use European Central Bank triple-A government yields, extending the series back to 1992 by carefully using German government bonds as a proxy for the years before the official series began. They test both traditional models and neural networks, and two strategies: forecasting the three compressed factors and rebuilding the curve, or forecasting the individual rates directly. Some models also get fed macroeconomic data — measures of inflation and real economic activity — sometimes compressed through a neural network called an autoencoder. In total they build and test forty-three different models, then filter them down through a procedure that demands both statistical and economic quality.

What survived the filter

After all that filtering, the models that survive — three for the U.S. market, three for Europe — are, every single one, neural networks. Not one traditional model makes the final cut. The paper's blunt summary is that neural networks consistently outperform traditional models in both forecasting accuracy and portfolio performance.

The winning recipes differ by market, which is itself revealing. For the U.S., the best model forecasts rates directly, using Nelson-Siegel factors plus macroeconomic features compressed through an autoencoder — the U.S. curve genuinely responds to macro data. For Europe, the best model is simpler: it forecasts factors from principal components, with no macroeconomic inputs at all, suggesting European yields were less sensitive to those macro fluctuations over this period.

Then the economic test. Every selected model beat a passive benchmark — a portfolio that just holds a constant five-year duration and does nothing clever. The strategy works by tilting duration up or down according to each weekly forecast, within sensible bounds. The models made better duration decisions than the benchmark in a majority of periods — about fifty-three percent of the time for the top U.S. model, and about fifty-eight percent for the top European one. That may not sound dramatic, but in markets, a consistent edge a few points above a coin flip, applied steadily, is exactly what active management is chasing.

The honest caveats

Precision matters here. The authors measure economic performance using the Information Ratio, the Omega Ratio, and maximum drawdown — and all selected models posted a positive Information Ratio, meaning they added value over the benchmark per unit of risk taken. But the familiar Sharpe ratio is only invoked when describing that earlier prior study, not reported for this paper's own strategies, and the paper gives no single clean headline return percentage. The honest framing: the neural strategies beat the passive benchmark on risk-adjusted terms and made better duration calls more often than not — but no specific return or Sharpe figure attaches, because the paper does not report one.

More, in fairness. The European data required backcasting through a German proxy, which introduces its own modelling assumptions, especially for the very short maturities in the early 1990s. The evaluation is a single roughly ten-year out-of-sample window for each market. Transaction costs — which can quietly destroy a trading strategy — are not part of the reported picture. And the strategy lives within preset duration bounds, so the results are conditional on those industry-standard limits. This is a preprint.

Why it matters

Fixed income is a serious, enormous, under-glamorised commercial domain, and this is exactly the kind of evaluation it deserves. The paper could have stopped at "neural networks have lower forecast error than Nelson-Siegel" and called it a win. Instead it pushed all the way through to the portfolio — because a model that nails the yield curve but cannot improve a duration decision is, commercially, worthless, while a slightly less accurate model that consistently tilts you the right way is worth real money. The value of a forecast is not the accuracy number on a slide; it is whether it changes a decision for the better. In bond markets, as everywhere, the forecast is only the means. The decision is the point.