Engineering Explainer

Elo was right by accident — here's the theory it stumbled onto

A sixty-year-old chess-rating trick turns out to be a special case of a clean statistical principle — one that extends naturally to draws, scorelines, and whole-field rankings, with fairness baked in.

Elo is one of the most famous and oldest algorithms in all of sport and games. It ranks chess players, footballers, tennis players, video gamers — and, increasingly, competing AI models. This paper takes that everyday tool and asks a deceptively simple question: what is it, really, underneath? The answer reveals that Elo has been doing the statistically optimal thing all along, by a happy accident of approximation.

A thermostat for skill

It was invented by a physicist, Arpad Elo, to rank chess players more than sixty years ago, and the mechanism is elegant enough to hold in your head. Every player has a rating, a single number. After a game, your rating moves by a simple rule: take what actually happened, subtract what was expected to happen, and nudge your rating in that direction by some step size. The expectation comes from the gap between the two players' ratings, run through an S-shaped curve. Beat someone you were heavily expected to beat, and the system shrugs — you gained almost nothing. Pull off an upset against a much stronger player, and your rating jumps while theirs drops hard. It's a thermostat constantly nudging each number toward what the results imply.

It's beautiful, and it works. But in its classic form, Elo only knows how to handle a win or a loss — a binary outcome. Sport is full of richer results. A draw. A margin of victory — winning one-nil is not the same as winning five-nil. A race where ten runners finish in a full ranked order. Over the years, researchers bolted extensions onto Elo to handle each of these: one paper for draws, another for margins, another for multiple players. Each was its own bespoke patch. There was no single, principled framework underneath, and the deeper question — why does the update rule take the exact form it does? — went largely unexamined.

The other kind of score

The key idea comes from a corner of statistics called score-driven models — and there's a confusing collision of words here, which the authors themselves flag. The word score does not mean the sports score, the goals or the points. It's a technical term: the score is the gradient of the log-likelihood. Unpack that without the jargon. Suppose you have a probability model of how a game turns out, given the players' ratings. After you observe an actual result, you can ask: in which direction, and how hard, should I adjust these ratings so my model would have considered this result more likely? That direction-and-amount is the statistical score — the steepest-uphill nudge toward explaining what you just saw. The pun, the authors note, is entirely unintentional.

So here's the central move. Elo updates a rating by "actual minus expected." The score-driven approach replaces that with the statistical score — the optimal nudge under a full probability model of the outcome. Two reasons make it the right thing to use. First, it is, in a precise mathematical sense, the most informative single adjustment you can make: it points in the direction that most increases the likelihood of the observed result. Second, it uses the entire shape of the outcome's probability distribution, not just its average — which matters, because for some outcomes, like a win-draw-loss category or a full ranking, the average Elo leans on might not even be well-defined. The score still is.

A sixty-year-old accident

Now the elegant part. The authors prove that classic Elo is exactly a special case of this framework — and not loosely. The two coincide if and only if you use the logistic function, that particular S-shaped curve, as your link.

Here's the lovely historical twist. Elo originally built his system on the normal distribution, the bell curve. The logistic curve was adopted later, by others, purely because it was easier to compute — a convenient approximation, and for decades that's all it was thought to be. This paper reveals that the logistic choice isn't merely convenient: it is the unique link function that makes Elo's update equal to the optimal statistical score.

Convention stumbled onto something deeply principled without quite knowing it.

Once you see Elo as one instance of a general principle, the generalisation writes itself. Choose a probability distribution describing whatever outcome you care about, take its score, and you have a principled rating update that inherits good properties automatically. The authors walk through four examples:

  • Win or loss. Use the logistic, and you recover Elo exactly.
  • Margin of victory. Model the goal difference with the Skellam distribution — the difference of two random counts — and you get a system that understands scorelines. It has a wonderfully counterintuitive consequence: a heavy favourite can actually lose rating after winning, if they only won by a narrow margin, because the model expected them to win convincingly and a one-goal squeaker disappoints. Plain win-loss Elo cannot express that idea at all.
  • Win-draw-loss. Use an ordered model with an adjustable threshold for how likely draws are.
  • A full ranking of many competitors — a marathon, a Formula race. Use the Plackett-Luce distribution. Here there's a neat bonus: the "expected outcome" ordinary Elo would need isn't available as a clean formula — it's a sum over every possible ordering of the field — but the score is perfectly computable. The score-driven approach works where the old recipe can't even be written down.

Fairness, by construction

The part the authors care about most isn't predictive power — it's fairness. They prove three properties that hold for this whole family by construction.

First, zero expected change before a game. On average, no player can expect their rating to move by playing — whoever they're matched against. The likely small loss against a strong opponent is exactly balanced by the rare large gain. You can't farm rating by careful matchmaking.

Second, the points are conserved. Across all players, the updates sum to zero, so there's no inflation or deflation and the average rating stays put. It behaves like a closed economy with a fixed money supply: every point one player gains, another loses.

Third, the update shrinks as your own rating climbs. A strong player gains less for beating a mediocre opponent than a weak player would for the same win — and the mighty have more to lose. The system is self-correcting, gently pulling toward the truth.

A fourth, subtler property is a reversion tendency: ratings drift to follow players' true, hidden skill over time. The authors are careful here, and so should we be. This is explicitly not mean reversion — they're not claiming ratings snap back to a fixed average. And they establish it only partially; a full proof that ratings converge to true skill remains for future research.

The honest caveats

This paper demands plain caveats. It is purely theoretical — there is no empirical evaluation, no dataset, no head-to-head forecasting contest against Elo. If you're hoping to hear "this system predicts football results five percent better," that claim simply isn't here. The convergence-to-true-skill result is only partial. The framework also requires that your outcome distribution satisfy certain conditions — it has to be nicely shaped, what mathematicians call log-concave, and depend only on the differences between ratings — so not every model will qualify. And the authors deliberately separate two goals often confused: rating, which is about fairness, and forecasting, which is about predictive accuracy. This is a rating paper, fair above all, and it says so.

Why it matters

Three reasons. First, unification: a scattered, ad hoc literature of one-off Elo extensions — this patch for draws, that patch for margins — is shown to be so many instances of a single underlying principle. Second, it's a construction kit: for any novel kind of competitive outcome, you now have a systematic recipe for building a principled, fair rating system rather than improvising. Third, and most distinctively, it bakes fairness in by design — guaranteed conservation, no free rating from matchmaking, self-correction — exactly what you want when ratings are used to rank people and match them against each other.

Sometimes the most valuable research isn't a flashy new benchmark number — it's understanding what we already had. Elo wasn't wrong. It was a glimpse, caught early and by convention, of something more general than its inventor could have proven at the time. This paper transforms it from a clever heuristic into a special case of a clean, extensible theory — one that reaches naturally to draws, to scorelines, to whole-field rankings, and carries provable fairness wherever it goes.