On July 18, 2024, a significant fault in a software update issued by CrowdStrike, a leading cybersecurity firm, led to a global IT outage affecting a wide range of industries and their critical infrastructures, including airlines, banks, healthcare facilities, and government services. This incident resulted in one of the largest IT outages in history and exposed the vulnerabilities of critical infrastructures that are dependent on digital systems. Today, four days after the incident, critical infrastructure operators had better delve into the causes of the incident and its far-reaching implications. Most importantly, they have to look at what needs to be done to prevent similar occurrences in the future.
Causes of the Incident
The root cause of the CrowdStrike incident was a defect in a content update for its cybersecurity software. This specifically affected machines running Microsoft’s Windows operating system. The update led to a “blue screen of death” (BSOD) and caused systems to get stuck in a restarting state. CrowdStrike’s software, which requires deep access to the operating system to function effectively, interacted poorly with Windows, while lead to widespread crashes.
Several factors contributed to the severity of the incident:
- Deep System Access: CrowdStrike’s software requires privileged access to the operating system to scan for threats. This made any faults in updates potentially catastrophic.
- Rapid Update Deployment: The update was pushed globally without adequate phased testing, which could have identified the defect before it caused widespread damage.
- Complex Recovery Process: The fix required manual intervention in data centres, including navigating to specific files and rebooting systems. This complicated significantly the recovery process.
Implications of the Incident
The CrowdStrike incident had severe and wide-ranging implications, including:
- Operational Disruptions: The outage affected around 8.5 million Windows devices, leading to significant disruptions in various sectors, including airlines, banks, healthcare, and government services. Thousands of flights were cancelled or delayed, and hospitals experienced delays in procedures.
- Economic Impact: The financial cost of the outage is estimated to exceed $1 billion, with uncertainties around compensation for affected customers.
- Reputational Damage: The incident has damaged CrowdStrike’s reputation, with many customers reconsidering their reliance on the company’s services.
- Highlighting Vulnerabilities: The incident underscored the fragility of critical infrastructure and the interconnected nature of modern digital systems.
Guidelines and Recommendations to Prevent Future Incidents
To avoid future episodes like the CrowdStrike incident, critical infrastructure operators, linked industrial organizations and cybersecurity firms had better adopt several best practices and guidelines:
- Rigorous Testing and Phased Rollouts: It is imperative to conduct extensive testing of updates in controlled environments before deployment. This should include testing on various operating systems and configurations to identify potential issues. Moreover, it is advised to implement phased rollouts of updates, starting with a small subset of systems and gradually expanding. This approach allows for the identification and resolution of issues before they affect a larger number of systems.
- Continuous Monitoring and Incident Response: Critical Infrastructure operators must implement continuous monitoring of their IT infrastructure to detect anomalies and potential security threats in real-time. This enables quick identification and containment of issues before they escalate.
- Incident Response Training: Critical infrastructure providers and organizations must regularly train employees on incident response protocols and conduct tabletop exercises to ensure preparedness for potential cybersecurity incidents.
- Robust Communication and Coordination: Organizations of the critical infrastructure resilience value chain must establish clear communication channels between cybersecurity firms and their clients. During an incident, timely and accurate information sharing is crucial to mitigate the impact. Furthermore, they must coordinate with relevant stakeholders, including government agencies and industry partners, to ensure a unified response to cybersecurity incidents.
- Enhanced Security Measures: The CrowdStrike incident revealed a need for enhancing security measures based on a multi-layered approach that includes network segmentation, strong access controls, and regular software updates, to protect critical infrastructure from cyber threats. Furthermore, it is important to consider strong supply chain security in order to ensure that all components of the supply chain adhere to stringent cybersecurity standards. This is key to preventing vulnerabilities stemming from the introduction of third-party vendors.
- Investment in Cybersecurity Technologies and Workforce: It is also important to invest in advanced cybersecurity technologies, such as real-time threat intelligence, endpoint detection, and automation, to enhance the ability to detect and respond to cyber threats. Most importantly, the CrowdStrike incident underlines the importance of a skilled workforce. Hence, organizations and policy makers have to address the global shortage of skilled cybersecurity professionals by investing in training and upskilling employees.
The EU-CIP Resources and Approach
For over 18 months the EU-CIP project is coordinating and supporting the European Critical Infrastructure Resilience ecosystem, highlighting the need for strong cybersecurity capabilities across industrial supply chains in different sectors, while at the same time providing resources in the form of analysis documents, whitepapers, innovation support and training resources, which are available through the project’s knowledge hub. The EU-CIP analysis and resources have already identified the capability gaps (e.g., gaps in training and cybersecurity response) that led to the CrowdStrike incident. For us in EU-CIP, the CrowdStrike incident serves as a stark reminder of the vulnerabilities inherent in increasingly interconnected and digitalized critical infrastructures. Our project will continue to underline the importance and to provide support for adopting rigorous testing protocols, enhancing continuous monitoring, improving communication, and investing in advanced security measures and skilled personnel. This is key for critical infrastructure operators and other security organizations to better safeguard their critical infrastructures against future cybersecurity incidents. Proactive measures and a commitment to cybersecurity resilience are absolutely required to protect the vital systems that underpin our society and economy.