CrowdStrike Attributes Buggy Update Crashing 8.5 Million Windows Machines to Faulty Testing Software
In a post-incident review (PIR), CrowdStrike has identified faulty testing software as the cause of a buggy update that led to the crash of 8.5 million Windows machines globally. “Due to a bug in the Content Validator, one of the two [updates] passed validation despite containing problematic data,” the company explained. In response, CrowdStrike has promised to implement a series of new measures to prevent similar issues in the future.
The extensive BSOD (blue screen of death) outage affected multiple organizations worldwide, including airlines, broadcasters, and the London Stock Exchange. The issue caused Windows machines to enter a boot loop, necessitating local access by technicians for recovery, though Apple and Linux machines remained unaffected. Companies like Delta Airlines are still dealing with the aftermath.
CrowdStrike’s Falcon Sensor, designed to prevent DDoS and other attacks, played a central role in the incident. The sensor uses kernel-level content, called Sensor Content, and employs “Template Types” to define threat defenses. When new threats emerge, CrowdStrike releases “Rapid Response Content” as “Template Instances.”
On March 5, 2024, a new Template Type for the sensor was released without issues. However, on July 19, two new Template Instances were released, and one of them (just 40KB in size) passed validation with problematic data. According to CrowdStrike, “When received by the sensor and loaded into the Content Interpreter, [this] resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD).”
To prevent future incidents, CrowdStrike has committed to more rigorous testing of Rapid Response content. This includes local developer testing, content update and rollback testing, stress testing, and stability testing. The company is also adding validation checks and enhancing error handling.
Additionally, CrowdStrike will adopt a staggered deployment strategy for Rapid Response Content to avoid global outages. Customers will gain greater control over content delivery, and the company will provide release notes for updates.
However, some analysts and engineers believe these measures should have been in place from the beginning. “CrowdStrike must have been aware that these updates are interpreted by the drivers and could lead to problems,” engineer Florian Roth commented on X. “They should have implemented a staggered deployment strategy for Rapid Response Content from the start.”
Source: Engadget.com