On the 19th of July (past Friday) we had a global IT outage which impacted hospitals, trains, air travel, supermarkets and many other businesses. It seems like it was caused by a small/inconspicuous update to a bit of anti-virus software for Windows operating systems which seemed to ‘break’ Windows. Although the issue has been fixed, there are still some lingering issues caused by how widespread the issue was and the fact that in order to fix it, it seems to need human intervention, likely actually at the device itself.
This is not my area of expertise and I’m not going to go into the details of how it happened. However, I do think that it has something to teach us all. I was watching in interest/horror as it was unfolding, before the cause was known. We weren’t impacted, thank goodness.
It was bad, but it could have been a lot worse.
First of all, Linux/Unix and Apple devices were not impacted at all and so although the impact was wide, there were many businesses and services which weren’t impacted. That’s because either they had non-Windows devices and/or because they didn’t use CrowdStrike. It’s good not to have all of your eggs in one basket. Having choice, although it probably makes things more expensive, gives a bit of space if things go wrong. I think it’s critical to weigh the pros and cons of this though and find the best middle ground.
Secondly, it’s super important to have business continuity plans in place wherever possible. We are so reliant on technology these days what happens when it disappears? In the early hours of Friday when it looked like Microsoft was taking action (with no explanation of what the possible impact was) we started to consider other ways of communicating as a team (slack/whatsapp), just-in-case Teams and email went down. At least we would be able to talk to each other/coordinate things should the worst happen.
In this day an age, cloud providers roll out changes so regularly, we barely notice. These changes are happening all of the time – often to keep us safe – and have likely kept us secure and saved us from many days like Friday in the past (but we haven’t noticed because nothing broke). So there’s a far bigger risk if we don’t get those regular updates and patches but in this case the update did cause a problem. This will happen sometimes and it’s important we understand that and try to put mitigations in place where that’s possible, prioritising where lives are in danger or people are unsafe.
And as a fellow IT person, I feel terrible for the poor soul(s) who put that update out, completely unaware of the chaos it would cause.