Lessons Learned After the Microsoft/CrowdStrike Meltdown: Insights from ACI Learning's Experts

laptops saying Microsoft on screen in back and Crowdstrike in foreground

In the wake of the recent CrowdStrike and Microsoft situation, the tech world has been quick to point fingers, assign blame and “Monday Morning Quarterback” on all the things that the corporate world failed to do in what ultimately led to a flawed security patch that caused widespread outages across multiple organizations. To put it graciously, the incident has highlighted vulnerabilities and prompted crucial conversations about resilience and best practices. We listened to ACI Learning's subject matter experts in audit, cybersecurity and IT to gain insights on how to move forward and the lessons learned from this incident.

Context: What Happened?

The incident began with a flawed security patch deployed by CrowdStrike, a major cybersecurity provider. This patch inadvertently caused disruptions across various sectors, including banks, airlines, hospitals, and government services. The widespread use of CrowdStrike and Microsoft products meant that the impact was felt globally, affecting many organizations that people rely on daily.

Expert Reactions and Lessons Learned

Robin Abernathy: Patch Management Best Practices

Robin Abernathy emphasized the importance of proper patch management. While the flawed patch was a critical issue, Robin pointed out that organizations also bear responsibility for the fallout. "Patch Deployment 101 encourages organizations to first test any patches in a lab environment or on a few computers," Robin said. "If the organizations involved had good patch management practices in place, many could have avoided this outage."

Robin highlighted that deploying the patch to a single computer would have revealed the issue immediately, preventing broader deployment and subsequent disruptions. The takeaway here is clear: effective patch management is a shared responsibility between vendors and organizations.

She noted that in the days following her initial advice, she learned that this update was automatically deployed because of its nature. It was basically like a virus signature update, which most people would want. Abernathy said she is not familiar with CrowdStrike enough to know if this type of automatic update can be disabled, but it seems in the best interest to have the option available.

NIST SP 800-40r4 encourages organizations to test updates before deploying them. Patches should be tested by the vendor (in this case, CrowdStrike) and the organization where it would be deployed, and this ability is vital.

“So while I understand there may not have been a way within CrowdStrike to prevent automatic deployment of patches, why isn't there,” she asked, explaining that a BSOD basically means a reboot into Safe Mode had to be manually completed at each affected computer to remove the update. And it looks like this major outage was the result of one or two lines of code.

She said that if she were leading an IT department right now, she would be checking all enterprise applications, operating systems, etc to see if automatic updates are enabled (and if they can be turned off.)

Dr. Hernan Murdock: Comprehensive Resilience Programs

Dr. Hernan Murdock focused on the broader implications of the meltdown, noting that this incident affected multiple industries and geographies simultaneously. He stressed the importance of up-to-date resilience programs, including Crisis Management, Risk Management, Cyber Resilience, and Emergency Response Plans (ERP).

"Your Employee Assistance Program (EAP), Compliance, and Ethics Programs should include mechanisms to support employees during and after a disaster," Dr. Murdock added. "Resilience is about people, processes, and technology, not only IT."

Dr. Murdock's insights remind us that resilience planning must be holistic, encompassing all aspects of an organization and its workforce.

The Technado Crew: Broader Implications and Vigilance

Sophie Goodwin painted a picture of the real-world impact of the CrowdStrike outage. "It wasn’t just a tech issue but real-world problems like people being stuck in airports," Sophie explained. She also raised concerns about the potential consequences if a similar incident occurred on a larger scale, highlighting the need for robust contingency planning.

Daniel Lowrie delved into the technical side, urging caution about over-reliance on a single security provider. He praised CrowdStrike's CEO for taking responsibility but warned about the increase in phishing attempts following such incidents. Daniel's advice to be vigilant against scammers trying to exploit the situation is a timely reminder for all organizations.

Moving Forward: Building Stronger Resilience

The CrowdStrike and Microsoft incident has underscored the need for robust and comprehensive resilience strategies. Here are key steps organizations should take:

  1. Implement rigorous patch management: Always test patches in a controlled environment before full deployment.
  2. Regularly update resilience programs: Ensure all crisis management, risk management, and resilience plans are current and comprehensive.
  3. Support employees: Develop mechanisms to support employees during crises, including counseling and financial assistance.
  4. Remain vigilant against threats: Be aware of increased phishing attempts and other scams following major incidents.

By learning from this meltdown, organizations can strengthen their defenses and be better prepared for future challenges.

ACI Learning

Published

Share

Learning areas