Teaching a language model to actually hear a number

Language models shatter numbers into meaningless fragments before they ever reason about them. Fix that one interface — not the model — and forecasting accuracy jumps.

Large language models are built for words; time-series forecasting is built on numbers. The dream is to fuse them — a forecaster that reads the news headline and the sales figure and reasons over both. But there's a catch buried in how a model reads, and it quietly corrupts every number before any reasoning begins. TempoWave's bet is that fixing this one interface matters more than making the model bigger.

Why put a language model near a forecast at all

The appeal is context. Classical forecasting models see only the numbers — past sales, past prices, nothing else. But the real world is full of text that moves those numbers: a policy change, a news headline, a holiday, a clinical note, an operational log. Language models can read all of that.

So the dream is a forecaster that fuses the numeric history with the surrounding text — "sales were flat, but here's the press release about the promotion" — and reasons over both. That is why LLM-based forecasting is such a hot area.

How a number gets shattered at the door

Language models don't read text character by character, or even word by word. They break text into tokens — chunks of sub-words — using a scheme optimised for language. When that machinery hits a number, it fragments it in ways that have nothing to do with the number's value. The paper's own example makes it vivid: the number 2026 gets split into the tokens "20" and "26," with split points tied to common text patterns, not to magnitude.

Sit with why that's so damaging. The entire meaning of a number is its magnitude and ordering — that 2026 is just after 2025, and far from 19. But once it's shattered into "20" and "26," that structure is gone. Two numbers very close in value can be chopped into completely different token sequences, so they look unrelated; two numbers far apart can happen to share a chunk, creating a spurious similarity the model reads as kinship.

The continuity and ordinal relationships that are the whole point of a number are destroyed before the model begins to reason. The authors argue that this translation layer — between continuous real values and discrete tokens — is the principal bottleneck for LLM forecasting. Not the size of the model. The way numbers get in the door.

A language model can have all the contextual brilliance in the world, but if a number arrives already shattered into meaningless fragments, that brilliance is wasted on a corrupted input.

What TempoWave actually does

The fix has to happen right at the interface, and two concepts make the method legible. An embedding is the vector of numbers a token becomes inside the model — a point in a high-dimensional space where the token's "meaning" lives and the reasoning happens. A wavelet represents a signal at multiple resolutions at once, capturing both fine local detail and broad coarse structure — a wide-angle lens and a macro lens on the same scene. Forecasting fundamentally needs that: local wiggles for the short term, broad trends for the long term.

TempoWave puts the two together as a plug-and-play interface that intervenes only at the embedding layer, leaving the model's backbone untouched. The mechanism:

Each number is written out as a fixed-precision string of individual digits, and each digit gets its own dedicated token.
Instead of the standard text embedding, TempoWave represents each digit through a set of multi-wavelet, multi-scale coefficients — a structured pattern across resolutions rather than a raw value — and turns that into the embedding vector.
Because there are only ten digits, zero through nine, you compute these ten embeddings once and cache them in a tiny lookup table.

It is a remarkably lightweight intervention: a custom adapter that rewires only the numeric pins of the connector.

A subtle bonus deep inside the network

Because each digit is encoded by its pattern across scales rather than a raw magnitude, the normalisation layers inside the transformer — which can otherwise wash out or distort raw numeric values — can't easily collapse the distinctions between digits. The numeric identity survives the journey through the network, so the model can still tell its digits apart deep inside, where naive encoding might lose them.

The numbers on the numbers

They test on five forecasting datasets that pair numbers with context — Bitcoin with news, an Australian news-paired set, solar power, Paris traffic, and London electricity usage — using a modestly sized open language model as the backbone. TempoWave sets a new state of the art on seven of ten reported metrics and lands in the top two on all ten. On average it improves mean absolute error by about seven percent, with the largest gains around fourteen percent on the electricity data and eleven percent on the Australian set. The most dramatic comparison is against an earlier numeric-interface method based on Fourier features: on the Bitcoin task, TempoWave roughly halves the error, cutting one measure from about 1.71 down to 0.80. For a change that touches only the embedding layer, those are substantial gains.

The honest caveats

The authors are reasonably candid. The improvements are more consistent on mean absolute error than on the error measure that punishes large outliers, where TempoWave wins on only three of five datasets. A meaningful share of the headline performance comes from the added contextual features — statistical descriptors and situational context — rather than the embedding alone, and the paper itself shows performance dropping when those are stripped away.

They report a single backbone model, so how it scales to larger LLMs is unknown, and the five datasets are relatively small. The authors flag as open whether the method helps pipelines that don't use rich context. And as with several recent preprints, the document carries some forward-dated and likely OCR-mangled artefacts, so fine details deserve a loose grip.

Why it matters

This reframes where the effort should go in LLM-based forecasting. The instinct is always to reach for a bigger model or a longer prompt. This paper says the bottleneck might instead be the humble numeric interface — and that principled multi-resolution digit embeddings can do more than scaling up.

Because it's plug-and-play, with the backbone unchanged and a ten-entry cached table, it's a low-friction upgrade for the systems becoming common: forecasting agents and analyst copilots that must fuse text context with precise numbers, in finance, operations, and healthcare.

The frontier of usefulness often isn't the headline model — it's the unglamorous interface around it. Here it lives in the plumbing between words and numbers. Teaching a model to actually hear a number — its magnitude, its order, its structure across scales — may be worth more than making the model bigger, and may be one of the most important small problems in the field.