Cloudflare outage should not have happened, and they seem to be missing the point on how to avoid it in the future
by Eduardo Bellani
Yet again, another global IT outage happen (deja vu strikes again in our
industry). This time at
cloudflare(Prince 2025). Again, taking down
large swats of the internet with
it(Booth 2025).
And yes, like my previous analysis of the GCP and CrowdStrike’s outages,
this post critiques Cloudflare’s root cause analysis (RCA), which —
despite providing a great overview of what happened — misses the real
lesson.
Here’s the key section of their RCA:
Unfortunately, there were assumptions made in the past, that the list of
columns returned by a query like this would only include the “default”
database:
SELECT
name,
type
FROM system.columns
WHERE
table = ‘http_requests_features’
order by name;
Note how the query does not filter for the database name. With us
gradually rolling out the explicit grants to users of a given ClickHouse
cluster, after the change at 11:05 the query above started returning
“duplicates” of columns because those were for underlying tables stored
in the r0 database.
This, unfortunately, was the type of query that was performed by the Bot
Management feature file generation logic to construct each input
“feature” for the file mentioned at the beginning of this section.
The query above would return a table of columns like the one displayed
(simplified example):
However, as part of the additional permissions that were granted to the
user, the response now contained all the metadata of the r0 schema
effectively more than doubling the rows in the response ultimately
affecting the number of rows (i.e. features) in the final file output.
A central database query didn’t have the right constraints to express
business rules. Not only it missed the database name, but it clearly
needs a distinct and a limit, since these seem to be crucial business
rules.
So, a new underlying security work manifested the (unintended) potential
already there in the query. Since this was by definition unintended, the
application code didn’t expect that value to be what it was, and reacted
poorly. This caused a crash loop across seemingly all of cloudflare’s
core systems. This bug wasn’t caught during rollout because the faulty
code path required data that was assumed to be impossible to be
generated.
Sounds familiar? It should. Any senior engineer has seen this pattern
before. This is classic database/application mismatch. With this in
mind, let’s review how Cloudflare is planning to prevent this from happening
again:
Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
Enabling more global kill switches for features
Eliminating the ability for core dumps or other error reports to overwhelm system resources
Reviewing failure modes for error conditions across all core proxy modules
These are all solid, reasonable steps. But here’s the problem: they
already do most of this—and the outage happened anyway.
Why? Because of they seem to mistake physical replication with not
having a single point of failure. This mistakes the physical layer with
the logical layer. One can have a logical single point of failure
without having any physical one, which was the case in this
situation.
I base my paragraph on their choice of abandoning PostgreSQL and
adopting
ClickHouse(Bocharov 2018). The
whole post is a great overview on trying to process data fast, without a
single line on how to garantee its logical correctness/consistency in
the face of changes.
They are treating a logical problem as if it was a physical problem
I’ll repeat the same advice I offered in my previous article on GCP’s outage:
The real cause
These kinds of outages stem from the uncontrolled interaction between
application logic and database schema. You can’t reliably catch that
with more tests or rollouts or flags. You prevent it by
construction—through analytical design.
FAANG-style companies are unlikely to adopt formal methods or relational
rigor wholesale. But for their most critical systems, they should. It’s
the only way to make failures like this impossible by design, rather
than just less likely.
The internet would thank them. (Cloud users too—caveat emptor.)
Chapman, Roderick, Claire Dross, Stuart Matthews, and Yannick Moy. 2024. “Co-Developing Programs and Their Proof of Correctness.” Commun. Acm 67 (3): 84–94. https://doi.org/10.1145/3624728.
Figure 1: The Cluny library was one of the richest and most important in France and Europe. In 1790 during the French Revolution, the abbey was sacked and mostly destroyed, with only a small part surviving