Engineering Explainer

In trading, a slow model is a useless model

In fast markets a mid-price signal decays in microseconds, so latency is part of the strategy — not an implementation detail. "Best model" is meaningless without a time budget.

There's a point anyone building models for fast markets feels in their bones but the academic literature often ignores: a more accurate model that's slower can be completely worthless. This paper makes that point rigorously — and there are really two findings, both sharp.

Why latency is the strategy

A limit order book is the live ledger of every resting buy and sell order at each price — the microscopic machinery of a market. Predicting it usually means predicting the next move of the mid-price: up, down, or flat over a very short horizon.

The signal decays incredibly fast. A prediction about the next move can lose its value in microseconds as the book updates and other traders act. So latency — how long your model takes to produce a prediction — isn't something you optimise later. A model that's slightly more accurate but takes twice as long can produce signals that are already stale by the time you can act. "Best model" is a meaningless phrase here without saying "best under what time budget."

Does a compute frontier appear?

The paper borrows an idea from large-model research: scaling laws, where loss falls predictably as you add compute. The first question is whether an analogous frontier appears in order-book prediction.

The test is clever. The author measures each model's "structural forward work" — roughly, the computation per prediction — across a zoo of cheap, non-neural models (trees, gradient boosting, random-convolution methods), fits a frontier to those cheap models only, then asks whether that frontier predicts the performance of expensive neural models deliberately held out of the fit. It does — with an R² around 0.94, extrapolating dozens to thousands of times beyond the range it was fit on. So there's a stable, predictable inference-compute frontier: spend this much computation, expect roughly this much accuracy, across very different model families.

But latency is not compute

Then the author redoes the analysis using measured wall-clock latency instead of that abstract compute count — and the neat frontier falls apart, dropping to about R² = 0.47. Latency is not just noisy compute. Models reorder.

The examples are vivid. A gradient-boosting model had a low work count but a high real latency of over four milliseconds. A CatBoost model had far more structural work on paper but ran in under three hundred microseconds. A neural network had tens of thousands of times more structural work than the gradient-boosting model, yet lower actual latency on the same CPU. Where the computation sits — whether it's arranged into operations the hardware runs efficiently — matters as much as how much there is. You cannot read the clock off the recipe card. (There was even a "dominated" model: more compute, no lower loss. Bigger was strictly wasteful.)

FastBiNLOB

The constructive half is FastBiNLOB, an architecture designed around the insight that capacity and latency are separate things to optimise. It deliberately avoids the attention mechanism that powers transformers — flexible but expensive to serve — and instead uses dense, hardware-friendly mixing operations that achieve a global view through plain, fast matrix operations. It keeps the useful computation and strips out the parts that are slow to run.

On the standard benchmark it matches or beats the published state-of-the-art accuracy on two of the key prediction horizons while running substantially faster — meeting the best accuracy at around twenty-three percent lower latency on the shortest horizon, and beating it at around sixty percent lower latency on a longer one.

The honest caveats

The author explicitly does not claim universal state-of-the-art — FastBiNLOB underperforms on two of the middle horizons. The whole study is one benchmark on a single machine — a laptop, where latencies are microseconds-to-milliseconds, not the sub-microsecond regime of real ultra-low-latency trading. The structural-work measure is a reproducible convention, not a true hardware instruction count. Treat the specific numbers as evidence from one careful study, not settled fact.

Why it matters

A weather forecast that's slightly more accurate but arrives an hour after the rain is worthless — the moment to act has passed. That's a slow trading model's situation, every microsecond. The broader lesson is that deployment constraints are first-class model properties, not footnotes: model capacity and serving latency should be optimised as separate objects, and the right design question isn't only "how much computation," but "where does the computation sit."