Cloudflare outage should not have happened, and they seem to be missing the point on how to avoid it in the future by Eduardo Bellani

Yet again, another global IT outage happen (deja vu strikes again in our industry). This time at cloudflare(Prince 2025). Again, taking down large swats of the internet with it(Booth 2025).

And yes, like my previous analysis of the GCP and CrowdStrike’s outages, this post critiques Cloudflare’s root cause analysis (RCA), which — despite providing a great overview of what happened — misses the real lesson.

Here’s the key section of their RCA:

Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:

SELECT name, type FROM system.columns WHERE table = ‘http_requests_features’ order by name;

Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.

This, unfortunately, was the type of query that was performed by the Bot Management feature file generation logic to construct each input “feature” for the file mentioned at the beginning of this section.

The query above would return a table of columns like the one displayed (simplified example):

However, as part of the additional permissions that were granted to the user, the response now contained all the metadata of the r0 schema effectively more than doubling the rows in the response ultimately affecting the number of rows (i.e. features) in the final file output.

A central database query didn’t have the right constraints to express business rules. Not only it missed the database name, but it clearly needs a distinct and a limit, since these seem to be crucial business rules.

So, a new underlying security work manifested the (unintended) potential already there in the query. Since this was by definition unintended, the application code didn’t expect that value to be what it was, and reacted poorly. This caused a crash loop across seemingly all of cloudflare’s core systems. This bug wasn’t caught during rollout because the faulty code path required data that was assumed to be impossible to be generated.

Sounds familiar? It should. Any senior engineer has seen this pattern before. This is classic database/application mismatch. With this in mind, let’s review how Cloudflare is planning to prevent this from happening again:

These are all solid, reasonable steps. But here’s the problem: they already do most of this—and the outage happened anyway.

Why? Because of they seem to mistake physical replication with not having a single point of failure. This mistakes the physical layer with the logical layer. One can have a logical single point of failure without having any physical one, which was the case in this situation.

I base my paragraph on their choice of abandoning PostgreSQL and adopting ClickHouse(Bocharov 2018). The whole post is a great overview on trying to process data fast, without a single line on how to garantee its logical correctness/consistency in the face of changes.

They are treating a logical problem as if it was a physical problem

I’ll repeat the same advice I offered in my previous article on GCP’s outage:

The real cause

These kinds of outages stem from the uncontrolled interaction between application logic and database schema. You can’t reliably catch that with more tests or rollouts or flags. You prevent it by construction—through analytical design.

  1. No nullable fiels.
  2. (as a cororally of 1) full normalization of the database (The principles of database design, or, the Truth is out there)
  3. formally verified application code(Chapman et al. 2024)

Conclusion

FAANG-style companies are unlikely to adopt formal methods or relational rigor wholesale. But for their most critical systems, they should. It’s the only way to make failures like this impossible by design, rather than just less likely.

The internet would thank them. (Cloud users too—caveat emptor.)

References

Bocharov, Alex. 2018. “Http Analytics for 6m Requests per Second Using Clickhouse.” https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/.
Booth, Robert. 2025. “What Is Cloudflare — and Why Did Its Outage Take down so Many Websites?” https://www.theguardian.com/technology/2025/nov/18/what-is-cloudflare-and-why-did-its-outage-take-down-so-many-websites.
Chapman, Roderick, Claire Dross, Stuart Matthews, and Yannick Moy. 2024. “Co-Developing Programs and Their Proof of Correctness.” Commun. Acm 67 (3): 84–94. https://doi.org/10.1145/3624728.
Prince, Matthew. 2025. “Cloudflare Outage on November 18, 2025.” https://blog.cloudflare.com/18-november-2025-outage/.
Figure 1: The Cluny library was one of the richest and most important in France and Europe. In 1790 during the French Revolution, the abbey was sacked and mostly destroyed, with only a small part surviving

Figure 1: The Cluny library was one of the richest and most important in France and Europe. In 1790 during the French Revolution, the abbey was sacked and mostly destroyed, with only a small part surviving