Crowdstrike's outage should not have happened, and the company is missing the point on how to avoid it in the future by Eduardo Bellani

A global IT outage occurred on [2024-07-18 Thu], with several industries having significant economic problems (see Appendix 1: The impact for some quotes on what happened). The outage what caused by a bug in the remote update system of the software of Crowdstrike, a popular Threat Intelligence/Response company.

The company has published the Post Incident Review(Crowdstrike 2024a) right after the incident and has just released its root cause analysis (Crowdstrike 2024b). Reading them has led me to write this article, specially the proposed mitigations.

According to the RCA, the essence of what happened was an index out of bounds, which is a special case of a buffer overflow and considered an undefined behavior in C++, the language that seems to be used to develop crowdstrike’ system(Stack 2024).

Here then we get to the core of my argument: For a software of this criticality, such problem should not be possible. The technology to ensure such has existed for decades already, as can be seen in this quote:

… we can continue to add contracts to the code until every subprogram has a fully functional specification. By this we mean that every subprogram has a postcondition that specifies the value of each of its outputs and a precondition as required to constrain the input space. Further type invariants may also be added over and above those already present from Gold level. Once the implementation has been completed against this full specification and all VCs generated by the analyzer have been proved, we have reached Platinum level of SPARK assurance.

Due to the additional effort involved in developing the specification and proof to this level, Platinum will only be appropriate for the most critical applications. However, it is worth considering a reduction in unit testing for functional verification if Platinum-level proof has been achieved, since we *know that the program will return the correct result for all inputs, not just for those we have been able to test*. (Chapman et al. 2024)

Furthermore, all the technical mitigations proposed in the RCA (see the full list of problems found and their proposals in Appendix 2: What happened) amount to just plugging holes. But safety cannot be achieved in such way, safety needs to be designed into the design, tools and languages used from the start of such endeavor.

If I were a client of Crowdstrike, I would be worried about the future.

Appendix 1: The impact

A major IT fault has hit services and infrastructure around the world, with aviation, banking, healthcare and financial services among the sectors affected.(Banfield-Nwachi 2024)

The CrowdStrike outage didn’t just delay flights and make it harder to order coffee. It also affected doctor’s offices and hospitals, 911 emergency services, hotel check-in and key card systems, and work-issued computers that were online and grabbing updates when the flawed update was sent out. In addition to providing fixes for client PCs and virtual machines hosted in its Azure cloud, Microsoft says it has been working with Google Cloud Platform, Amazon Web Services, and “other cloud providers and stakeholders” to provide fixes to Windows VMs running in its competitors’ clouds. (Cunningham 2024)

While software updates may occasionally cause disturbances, significant incidents like the CrowdStrike event are infrequent. We currently estimate that CrowdStrike’s update affected 8.5 million Windows devices, or less than one percent of all Windows machines. While the percentage was small, the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services. (Weston 2024)

Appendix 2: What happened

Here is the list of problems found and their mitigations proposed by Crowdstrike’s RCA(Crowdstrike 2024b) (slightly reworded for space eficiency):

Finding Mitigation
The number of input fields .. not validated at sensor compile time Validate the number of input fields at compile time
Missing runtime array bounds check Add runtime input array bounds checks
Lack of variety in testing Increase test coverage
Inconsistency between validator and interpreter Fix the instance of inconsistency and add checks
No validation in the interpreter Add tests
No staged deployment Add staged deployment
Figure 1: St Nedelya Church, partially destroyed in a terrorist attack by the Bulgarian Communist Party. 16 April 1925.

Figure 1: St Nedelya Church, partially destroyed in a terrorist attack by the Bulgarian Communist Party. 16 April 1925.

References

Banfield-Nwachi, Mabel. 2024. “Windows Global It Outage: What We Know so Far.” The Guardian. https://www.theguardian.com/technology/article/2024/jul/19/windows-global-it-outage-what-we-know-so-far.
Chapman, Roderick, Claire Dross, Stuart Matthews, and Yannick Moy. 2024. “Co-Developing Programs and Their Proof of Correctness.” Commun. Acm 67 (3): 84–94. https://doi.org/10.1145/3624728.
Crowdstrike. 2024a. “Crowdstrike Preliminary Post Incident Review (Pir): Content Configuration Update Impacting the Falcon Sensor and the Windows Operating System (Bsod).” Crowdstrike blog. https://www.crowdstrike.com/wp-content/uploads/2024/07/CrowdStrike-PIR-Executive-Summary.pdf.
———. 2024b. “External Technical Root Cause Analysis — Channel File 291.” Crowdstrike blog. https://www.crowdstrike.com/wp-content/uploads/2024/08/Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf.
Cunningham, Andrew. 2024. “Microsoft Says 8.5m Systems Hit by Crowdstrike Bsod, Releases Usb Recovery Tool.” Ars Technica. https://arstechnica.com/information-technology/2024/07/microsoft-says-8-5m-systems-hit-by-crowdstrike-bsod-releases-usb-recovery-tool/.
Stack, The. 2024. “Crowdstrike Promises Rca as c++ Null Pointer Claim Contested.” The stack. https://www.thestack.technology/crowstrike-null-pointer-blamed-rca/.
Weston, David. 2024. “Helping Our Customers through the Crowdstrike Outage.” Microsoft Official Blog. https://blogs.microsoft.com/blog/2024/07/20/helping-our-customers-through-the-crowdstrike-outage/.