At first, the recent massive enterprise Windows outage made me proud I was a Linux user. However, as I learnt more details, I realised this could happen to anyone. It wasn't Microsoft's mistake, but a mistake made by the security company CrowdStrike.
The Background
Corporations are understandably obsessed with security on their employees' work computers. They could lose user passwords, trade secrets, and many other kinds of sensitive information.
CrowdStrike is a company that provides this kind of security to lots of companies worldwide. It's a glorified antivirus, installed at a very core level (called the kernel layer) of the system. Microsoft, obviously doesn't take this lightly, and requires CrowdStrike to go through intensive testing before it can be installed on Windows computers.
The problem here is that CrowdStrike wants to push updates out rapidly to respond to any new kinds of malware, because rapid response is also very critical to cybersecurity.
The compromise they have come up with is that the core functionality files will be kept at the kernel level, and will only be updated with great care. Other files, containing descriptions of the malware, will only be read by the kernel-level files, and can be refreshed as soon as CrowdStrike publishes them.
This division of responsibilities seemed to have been incomplete, however, as a mistake in a file of the second type caused computers to fail starting up.
What Was The Issue?
The exact mistake was that one of the new files had a new function that expected 21 inputs. The code that called this function from the kernel sent only 20 inputs. A lack of graceful error handling meant that the antivirus - running at kernel level - crashed, bringing the rest of the kernel down with it.
The fix was simple. A fixed version of the file was published promptly, and the users just had to boot into recovery mode, delete the offending file, and reboot, allowing the fixed version of the file to be downloaded.
Only issue being most corporate employees don't have the permissions to access recovery mode on their work computers, and the IT department had to reach out to every single one manually.
This issue also struck many virtual machines, but it was much easier to fix virtual machines because they're managed automatically anyway.
A lot of games demand kernel-level anti-cheat software to be installed. This incident raises concerns about that as well. It is one thing to risk bricking corporate computers in the interest of security, and another to risk bricking personal computers just because the user wants to relax.
How Can We Avoid This?
The incident also highlighted some operational shortcomings with Crowdstrike's processes.
Obviously, a glaring error such as this should've been caught in QA, which it was not. Likely because they modified multiple files during development and QA, but only published a subset.
This also shows that they don't have a suite of test machines that are dedicated to running the release candidates. No in-dev code. Only going from one public release to the next, approximating what a customer's computer would do.
And a final mitigating strategy would've been phased releases, which is actually quite common in the software world. Phased releases means that you don't let every user get the update at the same time - a small group gets an update first, and if it seems that there are no issues, the update is made available to progressively larger and larger groups.