This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Five lessons from the CrowdStrike Windows IT outage
Embarrassingly for the cybersecurity community, the largest IT outage in history was caused not by hacktivists or criminals but by one of their own.
Friday’s global IT outage, which grounded planes, shut down TV stations, disrupted payments, and cancelled surgeries, was blamed on a faulty update in CrowdStrike’s EDR tool Falcon on Microsoft’s operating system.
The defect caused Windows computers with Falcon installed to crash without fully loading. Microsoft has revealed that the outage affected 8.5 million devices — or 1% of Windows computers worldwide.
A fix and an apology from CrowdStrike CEO George Kurtz followed, as the error wiped 13% off the Texan firm’s share price.
And yet there was no ‘quick fix’ for many with companies reporting having to work through every single device and manually reboot in ‘safe mode’.
As businesses focus on recovery, TechInformed shares the key takeaways from this outage (so far!) and what’s likely to happen in the months to come.
1. Endpoint security now under the microscope
In theory, endpoint detection response (EDR) tools like CrowdStrike’s Falcon are a no-brainer in an enterprise’s cyber defence armoury thanks to their ability to immediately resolve or suspend services if malicious activity is detected. However, the outage highlighted that their deployment is not without risk.
Elliott Wilkes, CTO of Advanced Cyber Defence Systems, explains that Falcon software runs on end-user devices — called an “agent”— in a similar fashion to classic antivirus software running on a desktop computer.
“Because agent-based detection systems often require enhanced or even administrator-level privileges to conduct monitoring of computer activity to detect malicious code, they are integrated into critical components of the operating system of the end-user devices,” he says.
The consequences of a faulty Windows update are now plain to see, adds Wilkes: “End-user devices getting stuck in a reboot loop, on a screen that’s known as the ‘blue screen of death’ [BSOD]. Ultimately, the likelihood of these events is small, but the impact, as we can see today, is tremendous.”
Getting stuck without the ability to reboot has been a widely reported issue. “As of now, fixing this issue requires manual, hands-on keyboard work — in some cases, for hundreds of thousands of affected machines,” claims Omer Grossman, CIO at CyberArk.
Hands on deck
Danny Jenkins, CEO & co-cofounder of ThreatLocker, an Orlando-based cybersecurity firm providing zero-trust endpoint security, said the challenge is greater for organisations with a large amount of remote and scattered workers.
“Getting technical support from external sources might be the only option,” says Jenkins.
“We’re doing stuff in the community, in places like hospitals, where we can send our teams out to just get their hands on the keyboard because, unfortunately, the big risk is getting users to self-remediate, which means giving users administrative passwords, which could have security implications later on.”
Jenkins also reports that BitLocker, a Windows security feature that provides encryption for hard drives on laptops, is also slowing down recovery.
“It’s an important feature. But when you restart your computer in safe mode, BitLocker must be disabled. To disable it, you must enter a long character key. If you get one character wrong, you will need to restart. And the other issue is that every computer will have a different recovery key.”
The fear is that the hassle caused by the outage will lead some firms to disable their EDR tools altogether – an ill-advised move given that hackers are gleefully jumping over the debris this week, sizing up which outage-hit firms will make the easiest targets.
Jenkins adds: “We’re in a state of heightened security right now. The world has seen more cyber-attacks in the last 12 months than ever before. So, turning off security products, while sometimes necessary, is also a bad thing,” he warns.
According to Neatsun Ziv, CEO at Ox Security, one lesson learned is the importance of choosing a vendor that can protect an organisation’s server as a distinct and valuable portion of the network, “separate from endpoints,” especially in critical operations.
“Endpoint devices may need resetting in this kind of scenario, but if the server also needs resetting, it becomes a more complex fix,” he explains.
“Taking the example of an ATM connected to an affected server, this may require a manual reset by an engineer, which for the large financial organisations currently affected could mean hours or days of downtime for key services.”
Echoing this, Jenkins adds, “What we’ve learned with the servers is that sometimes less is more, especially with something that’s auto-updating because these servers that are operating airports and hospitals cannot afford to go down.”
2. Cybersecurity vendors must win back trust
“The antivirus was the virus”, crowed Elon Musk on his social media platform X, next to an image of a CrowdStrike ad promoting its 2024 Global Threat Report.
And, for all the damage the outage has done to the global economy (experts are predicting a billion-dollar bill), the most valuable thing to have been lost in the multi-billion dollar cybersecurity sector is trust — a word that is used repeatedly in many a cyber security firm’s marketing materials and products.
CrowdStrike’s CSO and former FBI agent Shawn Henry acknowledged this in a LinkedIn post that followed the incident.
“On Friday, we failed you, and for that, I’m deeply sorry…The confidence we built in drips over the years was lost in buckets within hours, and it was a gut punch.”
A lack of faith in cyber products is something that’s likely to impact the entire cyber community for months to come. CTOs and CIOs who are already trying to convince boards to invest more in security tooling now have a greater task.
Board funding
“I’m concerned about the impact on CSOs getting funding for future endpoint security tools,” says Threatlocker’s Jenkins.
“Obviously, boards and finance directors never want to spend money, and security threats are higher than ever.”
Jenkins suggests, however, that executives put things into perspective: “While this week’s events have been catastrophic, we’re talking about single digits of a percentage of computers that have been taken down.
“If we compare that to ransomware attacks over the last five years or 10 years, there have been far more endpoints taken down by ransomware attacks.
He continues: “An attack is much worse than an outage because you are talking about a scenario where you spend hours per device rather than 15 minutes for an outage. Plus, your data is all over the internet. I’ve seen businesses completely fail over that. So, while this is a pain, it’s not as bad as the alternative.”
He nonetheless understands customers’ concerns. “We’re already answering a lot more questions for our customers who are asking: ‘What do we do to make sure this won’t happen?’ ‘How do we know you’re not pushing 10 updates a day on us because we’re chasing threats so fast?’. And that’s important. And it’s going to have to be well articulated to the businesses.
“And then, honestly, we as vendors must step up. We must make sure we don’t destroy people’s machines.”
3. Are too many relying on too few?
Some commentators have remarked that the incident brings into sharp focus just how big of a market share both Microsoft and CrowdStrike have, and question whether the concentration of risk in machines that runs our everyday lives – from airlines and banks to telecoms and stock exchanges – is the right way to go. Are we sacrificing resiliency in favour of efficiency and cost? Does this need to change?
One per cent of Microsoft’s market share represents 8.5 million devices. While CrowdStrike is reported to own 24 per cent share of the ‘endpoint security’ market.
Mark Boost, CEO of cloud computing firm CIVO, believes that the scale of this outage highlights the risks associated with over-reliance on a single system or provider.
“It’s a sobering reminder that size and reputation do not guarantee invulnerability to significant technical issues or security breaches. Even the largest and most established companies must be vigilant, continuously updating and securing their systems.”
Microsoft’s role
While Microsoft was quick to label the outage as ‘a third party supplier issue’, and has clearly been working around the clock to support affected users, the big tech firm is likely to be held to account in the coming weeks.
Microsoft is, after all, responsible for maintaining the operating system and should be able to make a computer usable when things go wrong.
When a purchase is made, a contract is established with the vendor, not its third-party service providers — surely, it’s Microsoft’s responsibility to ensure all third-party providers meet the mark and do not erode the brand value built over time, isn’t it?
For its part, Microsoft blames the regulators. Specifically, a European Commission antitrust investigation resulted in a 2009 agreement that allows multiple security providers to install software at the kernel level.
In contrast, Apple blocked access to the kernel on its Mac computers in 2020, which it said would improve security and reliability. While this makes it more challenging for third-party developers, there were no sad Macs last Friday.
4. Importance of supply chain monitoring
While CrowdStrike’s CEO assured the public on Friday morning that the outage was not “a security incident or a cyberattack,” its impact is still comparable to that of a major supply chain attack.
Understanding supply chains is vital for operational resilience, as shown by attacks like SolarWinds. IT teams need to comprehend business and tech supplier dependencies to effectively respond to outages from cyber-attacks, human error, or other issues.
Part of this framework, security experts urge, includes the pre-rollout and batch testing of updates and not blindly accepting automatic ones.
As Carlos Aguilar Melchor, chief scientist, cybersecurity at SandboxAQ, says: “It is essential to have visibility on the practices of your software supply chain, which includes how it is updated.
“We all learned from the global SolarWinds catastrophe that we cannot blindly accept updates from software that impacts key systems. This is especially true for software that is commonly used in all big businesses, such as ERPs, CRMs, and, above all, cybersecurity software.
Limiting updates
Sandbox colleague Graham Steel, head of cybersecurity product, adds the outage should spur all companies to put in place systems that will analyse every update before it is allowed into their company, although, he notes, “recent consolidation in the cybersecurity market has increased the risk of this recurring — businesses rely on just a few vendors.”
Not only do companies need to make sure they understand the tools that they are using, but Threatlocker’s Jenkins argues they would also do well to limit them.
“We advocate and help customers limit the number of tools that they run on their machine to avoid issues like this,” he says.
And rather than pushing out ten updates a day to keep on top of constant threats, Jenkins advises firms to change the way they think about security.
“Why not just lock down systems better in the first place? So, they don’t have to look for every threat. Have better authentication; have things like application analysis — because that’s important when it comes to hardening your environment — and then you don’t need tools that push out ten updates a day,” he reasons.
5. Incident response plan is crucial — and should include doughnuts
Friday’s outage highlights the importance of putting both technical and non-technical controls in place to protect business operations when issues arise.
“This incident demonstrates the need for every organisation to have a robust Incident Response Plan in place that is regularly reviewed and tested to minimise the impact and recover quickly,” says Simon Newman, co-founder of Cyber London and International Cyber Expo Advisory Council member.
Civo’s Boost adds that clear and timely communication is crucial in managing such crises. “Organisations that effectively communicate with their staff and customers during outages can significantly mitigate disruption and maintain trust.”
The importance of looking after the needs of IT teams working 24/7 on recovery also shouldn’t be underestimated, according to Allie Mellen, a principal analyst at Forrester.
“This disruption hit on Friday evening in some geographies, right as people were headed home for their weekend. Tech incidents like this require an all-hands-on-deck approach.
“Support your teams by ensuring that they have adequate support and rest breaks to avoid burnout and mistakes. Clearly communicate roles, responsibilities, and expectations.”
According to ThreatLocker’s Jenkins, if your firm is still experiencing difficulties, the best way to beat the blue screen of death is to keep IT teams fed and watered.
“I would say buy your IT guy doughnuts, and you might get fixed quicker. The IT crowds right now, they’re struggling. They’re underwater. I’ve worked a lot of ransomware recoveries, and when you’re in a single business, it’s a similar scenario. Every device is offline. They’re going to be tired. They’re going to be working. And there’s a lot of IT people who worked very long hours last weekend.”
#BeInformed
Subscribe to our Editor's weekly newsletter