Engineering Explainer

When something is always breaking

Three teams — Cloudflare, Netflix and Airbnb — on three faces of staying reliable at scale: how to undo a half-finished job, schedule a flood of work fairly, and push changes to thousands of servers without breaking any.

Keeping enormous distributed systems from quietly falling over is one of the least glamorous and most important jobs in all of software. When thousands of machines are doing thousands of things at once, failure isn't an edge case — it's a constant. Three recent posts, from Cloudflare, Netflix and Airbnb, tackle three different facets of staying reliable at that scale: how to undo a half-finished job, how to schedule a flood of work fairly, and how to push a change to thousands of servers without breaking any of them. Read together, they share an instinct worth naming.

Cloudflare: the art of the undo

Cloudflare added a feature called saga rollbacks to Cloudflare Workflows, its system for building durable, multi-step processes — programmes where each step might call an external system, retry on failure, and remember where it got to. The classic example is moving money between two banks. Step one: debit Bank A. Step two: credit Bank B. Step three: email both account holders. So what happens if step two fails after step one has already succeeded? The money has left Bank A but never arrived at Bank B. You can't pretend step one didn't happen — the debit is real, in another company's system. You need a brand-new action that semantically reverses it: a refund.

This is an old, well-known idea in distributed systems called the saga pattern, and the heart of it is pairing every action with a compensating action — its specific undo. Cloudflare baked that into the programming model: when you define a step, you can attach a rollback function, and if the overall workflow fails, the system automatically runs the rollback handlers in reverse order, walking back up the staircase.

A couple of details are subtle enough to be worth knowing. First, these rollbacks have to be idempotent — safe to run more than once — because in a distributed system the undo itself might get retried. Second, a step can fail after it has already caused a side effect but before it managed to record its result, so a rollback handler has to cope with not even knowing what the original step returned. And here is what makes this more powerful than an ordinary error-handling block: a normal catch block only knows what's still in memory right now, whereas Cloudflare's system keeps a durable, persisted history of every step — what ran, what finished, what it returned — so even if the whole engine crashes and restarts mid-failure, it can reconstruct exactly what needs undoing and finish the job.

A normal catch block only knows what's still in memory right now. A durable, persisted history of every step survives the engine crashing mid-failure — and that is the difference between clever and reliable.

Netflix: stop building, start adopting

Back in 2018, Netflix built its own system for managing batch compute — all the big offline number-crunching jobs that aren't serving you a video in real time. They called it CMB. It worked. But over the years the open-source world caught up and then overtook it: the Kubernetes ecosystem grew the very features CMB had been built to provide. Because CMB was so far removed from the underlying cluster, adding new capabilities had become painful. The one they really wanted, and couldn't easily build, was preemption — the ability to pause a lower-priority job to make room for an urgent one. In old CMB, once a job started, it ran to completion, full stop.

So Netflix adopted an open-source system called Kueue — a job-queueing system for batch work on Kubernetes — and folded it into their container platform. A neat reason they picked Kueue over the alternatives: it doesn't rip out and replace Kubernetes' own scheduler; it sits on top and cooperates with it, so Netflix kept all their existing placement logic. They mapped their old concepts onto Kueue's, and for the teams using it the switch was as simple as clicking a button, with easy rollback if anything went wrong. The payoff is preemption-based fair sharing: teams can lend out idle reserved capacity but reclaim it when they need it, and high-priority jobs can now bump lower-priority ones out of the way.

The scale and the discipline are the memorable part. Kueue now manages millions of batch workloads at Netflix, fully rolled out. And here's the migration wisdom worth underlining for anyone doing a big platform swap: they migrated their largest, most complex customer first — the opposite of the comfortable instinct to save the hard one for last. The reasoning is that if the scariest case works, everything else is downhill, and you find the nasty surprises early instead of at the finish line. The whole production migration took just four weeks.

Airbnb: changing thousands of services without redeploying any

Airbnb's post answers a question you might not have known was hard: how do you change the behaviour of thousands of running services without redeploying any of them? Airbnb has a dynamic configuration system — a giant control panel of settings that engineers tweak constantly, several times a minute — and those changes have to reach thousands of running service instances, fast and reliably, with the services never going down. They built a helper they call sitar-agent.

The constraints they set are a fine lesson in real-world priorities. Configuration must always be readable, even if the central config service is completely down — because a slightly stale setting is fine, but a setting you can't read at all is a disaster. Changes have to arrive in tens of seconds, not minutes. The system has to absorb tens of thousands of servers all checking for updates at once. And it has to work across five different programming languages — Java, Python, Go, TypeScript and Ruby — without maintaining five separate versions.

Their solution is a sidecar — a small, separate helper that runs right alongside each service, in its own little container. When a service starts up, the sidecar first downloads a recent full snapshot of all the config from cloud storage — a known-good starting point that means it works even if the live service is offline — then it catches up with the central service, and from then on it simply polls for changes every ten seconds or so, writing the latest config to local disk where the main service reads it instantly. What stands out is the parade of deliberately boring choices. They kept it as a separate sidecar rather than a library baked into each app, because a library would mean reimplementing it in all five languages and losing isolation. They kept a simple pull model — everyone politely asking "anything new?" — rather than a fancier push system, and just made each poll cheap with a short-lived cache. And when they needed a local database, they benchmarked a faster option but chose SQLite, because it has solid support in all five languages and handles concurrent reads cleanly. Over and over, they picked the simpler, more reliable, more portable option over the cleverer one.

The threads

Each team is solving a different piece of the reliability puzzle — undoing partial failures, scheduling work fairly, distributing changes safely — but the shared instinct underneath is the same. Cloudflare chose explicit, durable, replayable undo logic over clever in-memory tricks. Netflix chose to adopt a proven open-source system over maintaining their own. And Airbnb, at decision after decision, chose the simpler and more recoverable design over the theoretically faster one. At real scale, operational clarity beats cleverness almost every time — because the system that's easy to reason about when something breaks is the system that survives. And at this scale, something is always breaking.