No I do get this and there is, absolutely, a huge risk in A.I which won’t easily go away. HBR recently coined the term “Botshit” which while I did snigger at, is probably true.
Over-reliance on the data the tool gives and the need to ensure knowledge doesn’t reduce as a result.
But I still think that AI assist in the future could help reduce the risk of such an event.
It will take about 5 years after the tech is here for the “replace your employees” hype to go away and then we can all use it as the very powerful tool it has the potential for
One thing to note, this happened to linux causing crashing and being left unable to boot, it just affected the much much smaller amount of linux customers on 2 distros:
Scary how its not just Windows that has been taken down by their bad updates.
I would say that the biggest issue with the Blue Screen of 2024 is that it has to be removed manually. It’s reportedly effected between 500M to 600M Devices. That’s gonna take alot of work
I might be wrong on this, but in this instance if they’re in a fully managed world (InTune etc) wouldn’t the better option be to mass rebuild the machines remotely?
Probably quicker than trying to remote on or ship and replace?
Granted that might not be possible and I’d imagine in that case your second option is probably the least painful.
Fair enough. I’m technical, but not that technical to know the answer to this one.
Will be interesting to see what happens next here. I see one of a few things happening:
Nothing. People just fix it, quietly update BCP plans. Likely.
Fast scale move to alternative cloud platforms/endpoint software. Unlikely, the cost will be huge and possibly not worth the outlay
Slow scale move to alternative/in house only solutions. Probable.
Microsoft buy out crowd strike and slowly depreciate it, pushing people towards sentinel and defender. Y’know that’s not totally unthinkable, but I don’t see it.
Microsoft work with crowdstrike to “invest” in it.
why? any other vendor could have the issue. This isn’t the first time something like this has happened, there was a similar bug with an iscsi driver for vmware that put windows servers into a boot loop. In fact you can imagine the one company doing anything they can so that this wont happen again is Crowdstrike
Yes they absolutely could. But people are fickle, and I’ll bet you the amount of CIOs/CTOs etc getting a line from the rest of the board along the lines of
“ I don’t care about the fix, we need better solutions”
That’ll be going on. I do think it’s unlikely because of cost and for the reasons you’ve given.
That’d be the same crowdstrike whose CEO was also at the helm of Mcafee when it had a very similar incident in 2010? I’d like to hope strong lessons, industry wide, can be learnt from this.
Nah it’s more that. It could’ve happened to any solution. It’s only Crowdstrike because of the sheer dominance of windows machines and this software.
It really showed just how truly fragile the infrastructure is, there will need to be lessons learned and I’d be astounded if that isn’t what happens next personally.
It happened to CrowdStrikes MacOS update a few weeks ago but wasn’t as severe as Mac doesn’t allow kernel level access like Windows where things can really go wrong
The error in the code was a relatively simple one that basic condition testing should’ve caught, maybe the industry can look at not using Kernel level security systems but that should only be when the threats can’t be on the kernel either like MacOS - until then we just need people to be competent and test their damn code
The broken update was a definition update not an agent or engine update that the configurable update rings apply to. Microsoft defender gets definition updates multiple times a day too as do most AVs.
Crowdstrikes systems should probably been have been able to flag computers going offline quicker and the Linux complaints point towards poor QA but there definitely doesn’t seem to have been widespread issues with Debian, no one else on that HN thread had or saw that issue with Debian and I didn’t see anyone else mention it elsewhere, it was probably something in their environment conflicting with CS.
This time execs have someone to blame but who are they going to blame when it’s ransomware or some in house software next time? Orgs having to touch each endpoint one by one is an IT failure, why are digital signage, point of sales and the likes not on PXE boot ? Desktops should be disposable, IT should be able to send out a guide on how to reset the computer and their provisioning will set it back up, it’s 10 clicks to do it from the recovery screen? End users can do that.
If they didn’t do the above why are workers manually typing out bitlocker keys and commands, put they keys in a csv, create a windows pe usb, in the startup folder put the csv and a script that 1,identifies the locked drive 2, unlocks it 3, runs the commands 4, reboots. And then you’re by each machine for under a minute.
Similar can be done for servers on hypervisors.
This time there’s someone to ‘blame’ but next time there probably won’t be.
And yes it’s not always possible for scada or medical devices but if everything else was planned and tested properly they can focus more resources there.