On November 18, 2025, the world experienced a massive disruption as Cloudflare, a foundational infrastructure provider serving nearly 20% of the web’s traffic, suffered a global control plane failure for approximately 6 hours. Casmer Labs has compiled an analysis of the incident, including the root cause, its effects within the Cloudflare architecture, and how it could have been prevented.
How it Happened
It is first critical to understand the general architecture of the Cloudflare platform. It serves as a reverse proxy or intermediary between the end-user and the origin server with the purpose of inspecting traffic for denial-of-service attacks and caching static content before forwarding valid requests to the origin. However, when digging deeper, we realize that there were a few key factors at play within the Cloudflare architecture:
- The extensive usage of the Rust programming language within their core infrastructure
- The extensive usage of ClickHouse, a column-oriented database management system
- The Bot Management system, used to “score” every incoming request to the Cloudflare intermediary
Let’s dig deeper to the chronological order of the attack:
- At approximately 11:05 UTC, CloudFlare’s database reliability engineering team initiated a scheduled maintenance task on the ClickHouse cluster supporting the Bot Management system. The objective was to modify access privileges of the service account querying the database to generate the feature files used by the Bot Management system.
- As a result of this update, the service account became exposed to underlying system tables in a way that the database query logic was not designed to handle. By not specifying a database qualifier (e.g., it used
SELECT * FROMfeatures rather thanSELECT * FROM production.features), the query matched thefeaturestable in multiple schemas. - In the minutes following, ClickHouse began returning duplicate machine learning features data as a result of the query, pushing it over the Cloudflare-designated limit of 200. This data was serialized into a new configuration file and pushed through the global distribution system across the Cloudflare network.
- At 11:20 UTC, the first edge servers received the update. The core proxy software (FL2), written in Rust, attempted to load the new Bot Management configuration. When these files were parsed, and over 200 features were found, the Rust
panic!state was invoked. - By 11:28, a massive spike in HTTP 500 (Internal Server Error) and HTTP 502 (Bad Gateway) were detected by external monitoring organizations like DownDetector.
- While engineers initially suspected a sophisticated DDoS attack, at 14:24 UTC the root cause was isolated to the Bot Management configuration file. At 14:30, the team manually forced a “known good” file out to the global fleet, resolving the panic condition and allowing traffic to flow.
- At 17:06 UTC, after a period of monitoring, the incident was declared fully resolved.
Technical Analysis
To fully appreciate the subtlety of this outage, we must examine the specific behavior of ClickHouse that led to the data duplication.
ClickHouse uses MergeTree table engines, which are log-structured. When data is inserted, it is written to a new “part” on disk. These parts are immutable. A background process merges these parts and applies rules—such as deduplication or aggregation.
- Eventual Consistency: Deduplication is eventual. A
SELECTquery run at any given moment may see duplicates if the merge process hasn’t completed, or if the query spans multiple unmerged parts. - The Role of Views: To handle this, production systems often query a specific
Viewor use theFINALmodifier (e.g.,SELECT * FROM table FINAL), which forces the deduplication logic to run at query time (at a performance cost).
The update at 11:05 UTC changed the GRANT permissions for the user executing the Bot Management query.
- Implicit Scoping: Prior to the change, the user likely had access only to a specific
Materialized Viewthat presented a clean, deduplicated dataset. - Explicit Exposure: The new permissions likely granted
SELECTaccess to the entire database or a broader set of tables, including the underlyingDistributedtable shards or replication logs.
Crucially, the SQL query code used to generate the config file was:
SQL
SELECT feature_name, rule_logic FROM bot_features…
It did not specify FROM production_db.bot_features.
In many SQL environments, this ambiguity is resolved by the user’s default schema. However, in ClickHouse’s complex distributed environment, combined with the permission change, the query resolver matched rows from multiple sources.
This illustrates a critical lesson in database reliability engineering: Explicit is better than implicit. Queries in production automation must always use fully qualified names (database.table) and explicit deduplication logic (DISTINCT, LIMIT, or GROUP BY) to defend against environmental changes.
Blast Radius and Prevention
OpenAI’s ChatGPT, with over 700 million users, was one of the most high-profile casualties. Spotify and a number of other high-profile SaaS tools such as Canva, Figma, Trello, and Discord also experienced disruptions, both internal and external.
While the Cloudflare outage is more of a case study on reliability engineering, there are lessons here to be learned by all types of IT teams (this is a security blog, after all!):
- Robust Input Validation & Graceful Degradation:
- Mission-critical software should effectively never use
panic!orunwrap()on dynamic input. It should use match or if let to handle errors. - If a configuration file fails validation (e.g., >200 features), the system must reject the update and continue running with the previous valid configuration. It should log an alert, not crash the process.
- Constraints (like the 200 feature limit) should be enforced at the Control Plane (generation time), not just the Data Plane (load time). The generator should have failed to build the file, preventing it from ever leaving the database layer.
- Mission-critical software should effectively never use
- Database Hygiene in Distributed Systems
- All production queries must use fully qualified table names.
- Use
FINALor explicit aggregation (argMax,group by) when querying data that relies on eventual consistency, especially for configuration generation. - Permission changes should be tested in a staging environment that mirrors the production data topology (sharding/replication) to catch side effects like visibility changes.
- The “Staggered Rollout” (Canary) Imperative
- Configuration updates should be pushed to a small subset of the network (e.g., 1% of edge nodes) first.
- Observability systems should monitor the error rate of the canary fleet. If 5xx errors spike, the rollout is automatically halted and reversed. This would have contained the November 18 outage to a few nodes rather than the entire planet.
- Client-Side Resilience (The “Break Glass” Strategy)
- For critical services, use a multi-CDN strategy with a DNS load balancer (like NS1 or Route53) that health-checks the CDNs.
- Serve stale content if the upstream is unreachable (Stale-While-Revalidate).
It is important to know that this outage served as an “impromptu penetration test.” Organizations should analyze their WAF logs from the outage window. If they routed around Cloudflare, did they see a spike in attacks?
Leave a comment