Football's foundation-model moment

One reusable embedding of every on-ball event, learned BERT-style with the player names stripped out — and reused for expected goals, action value, and scouting.

Every professional football match is now logged as a stream of around seventeen hundred on-ball events — a pass, a shot, a tackle — each carrying attributes like pitch location, body part, and outcome. This paper asks whether football analytics is about to have its foundation-model moment: one reusable representation that powers many tasks instead of a bespoke model for each.

Two problems with the status quo

Today almost every analytical question gets its own model with its own hand-crafted features. Expected goals is one model; valuing every action is another; scouting is a third. It's slow, expensive, and nothing is shared.

There's a subtler problem too. How do you feed "action type equals cross" into a model? The traditional answer assigns each action an arbitrary code, and those numbers carry no meaning — a cross ends up exactly as similar to a through-ball as it is to a tackle, which is absurd. The encoding throws away the meaning before the model even starts.

The fix is an embedding: instead of an arbitrary code, the model learns a vector for each action, arranged so that actions used in similar ways sit near each other. A learned map instead of random locker numbers — similarity becomes geometry.

Pretrain once, reuse everywhere

The ambition is one rich, reusable embedding of any event, borrowed straight from language models. Just as BERT is pretrained by hiding words and predicting them from context, the authors hide one feature of an event and train the model to predict it from the others. No labels needed — the data predicts itself.

The architecture is a TabTransformer, and its key move is where attention operates: across the features within a single event. The body part, the pitch zone, the action type, and the angle to goal all inform each other, so the model recognises that "right foot, plus wide-right position, plus cross" is one coherent pattern. (An earlier approach treated each whole event as a single sealed token and never let its attributes interact.)

The move I love most is simple: before pretraining, they strip out player and team identities. Leave Messi in the data and the model can cheat — learning "Messi's actions are valuable" instead of what makes an action good. Remove the names and it must learn the qualities of the action itself, which is exactly what lets the representation travel to players and tournaments it never trained on. Judge the shot, not the shooter.

They pretrain on about 6.4 million events from five major European leagues, turning each event into a 911-dimensional fingerprint, then test transfer to international tournaments the model never saw.

What transfers — honestly

They try three downstream jobs, and the authors are refreshingly candid.

Expected goals. The embedding beats a like-for-like neural baseline on calibration (Brier 0.092 vs 0.101), though StatsBomb's own data-rich model is still better — it trains on far more data and extra positional information.
Action valuation (VAEP). Fed into a gradient-boosted classifier, the embeddings beat the task-specific models on calibration — and the authors argue the gain comes from a genuinely better representation, not just more data. But on a ranking metric, AUC, the dedicated models still edge ahead. The superiority is specifically about calibration, not across the board.
Scouting. Build a per-player vector and find nearest neighbours by cosine similarity, and it works beautifully as a sanity check: ask for players most similar to Lewandowski and you get Higuaín, Suárez, Ibrahimović, Benzema — elite forwards, every one, with no position information ever provided. (A caveat: many similarity scores sit suspiciously near a perfect 1.0, an artifact of season-level averaging.)

The honest framing

This is a preprint, and it doesn't beat dedicated gradient-boosted models on every metric — notably AUC. The whole pitch rests on transfer learning, yet the authors never test the data-scarce scenario where it should shine most. Training is a single season from five leagues. The authors frame the contribution exactly right: representational versatility rather than peak single-task performance — one representation that does many jobs well, rather than one model that wins one job.

Why it matters

It's the shift that reshaped natural-language processing, now arriving in sport: pretrain one model on the raw grammar of the game, then reuse its understanding everywhere — for goal probability, for action value, for scouting — adapting cheaply instead of starting over. And by deliberately forgetting who did what, the representation learns something that generalises across players and leagues rather than memorising reputations. Football's BERT moment may not have fully arrived, but you can see its shape from here.