The Great Blue Screen Of 2024

Yes it could. But if you have but one cloud environment and but one MDR or SOC you kind of leave yourself a little ruined.

Astounding that this could be one update which has, for reasons tbd, been pushed past into live without proper testing.

That’s where proper process, automated pipelines and good governance should’ve meant it was never possible to do in the first place.

No I do get this and there is, absolutely, a huge risk in A.I which won’t easily go away. HBR recently coined the term “Botshit” which while I did snigger at, is probably true.

Over-reliance on the data the tool gives and the need to ensure knowledge doesn’t reduce as a result.

But I still think that AI assist in the future could help reduce the risk of such an event.

1 Like

It will take about 5 years after the tech is here for the “replace your employees” hype to go away and then we can all use it as the very powerful tool it has the potential for

Just let the HR sales people get it over with

Kind of like .com bubble all over again

Just at a train station now and all the ticket machines have the BSOD.

Strangely all the platform gates are still closed but as soon as you speak to the person on the gateline they just let you through…

One thing to note, this happened to linux causing crashing and being left unable to boot, it just affected the much much smaller amount of linux customers on 2 distros:

Scary how its not just Windows that has been taken down by their bad updates.

2 Likes

Definitely my experience too. I would say windows server is still the standard for most businesses.

I would say that the biggest issue with the Blue Screen of 2024 is that it has to be removed manually. It’s reportedly effected between 500M to 600M Devices. That’s gonna take alot of work

3 Likes

I love that National Lottery is having issues today, like they felt left out yesterday and decided to update just to join in :joy:

A lot easier than it used to be, now that most will have phones with cameras and instant messaging.

I was having constant Visa card declines last night which seems to be consistent with Visa down? Current problems and outages | Downdetector

Mastercard and Amex worked fine

I might be wrong on this, but in this instance if they’re in a fully managed world (InTune etc) wouldn’t the better option be to mass rebuild the machines remotely?

Probably quicker than trying to remote on or ship and replace?

Granted that might not be possible and I’d imagine in that case your second option is probably the least painful.

Fair enough. I’m technical, but not that technical to know the answer to this one.

Will be interesting to see what happens next here. I see one of a few things happening:

  • Nothing. People just fix it, quietly update BCP plans. Likely.
  • Fast scale move to alternative cloud platforms/endpoint software. Unlikely, the cost will be huge and possibly not worth the outlay
  • Slow scale move to alternative/in house only solutions. Probable.
  • Microsoft buy out crowd strike and slowly depreciate it, pushing people towards sentinel and defender. Y’know that’s not totally unthinkable, but I don’t see it.
  • Microsoft work with crowdstrike to “invest” in it.

why? any other vendor could have the issue. This isn’t the first time something like this has happened, there was a similar bug with an iscsi driver for vmware that put windows servers into a boot loop. In fact you can imagine the one company doing anything they can so that this wont happen again is Crowdstrike

It’s glaring incompetence that leaves a bad taste, this only happens because of disastrous QA. How can you not test something like this?

Yes they absolutely could. But people are fickle, and I’ll bet you the amount of CIOs/CTOs etc getting a line from the rest of the board along the lines of

“ I don’t care about the fix, we need better solutions”

That’ll be going on. I do think it’s unlikely because of cost and for the reasons you’ve given.

That’d be the same crowdstrike whose CEO was also at the helm of Mcafee when it had a very similar incident in 2010? I’d like to hope strong lessons, industry wide, can be learnt from this.

2 Likes

Like what though? It’s been strong industry standard to test updates before you roll them out worldwide… for like 2 decades or more

I think the main lesson to be learnt is don’t use CrowdStrike

2 Likes

Nah it’s more that. It could’ve happened to any solution. It’s only Crowdstrike because of the sheer dominance of windows machines and this software.

It really showed just how truly fragile the infrastructure is, there will need to be lessons learned and I’d be astounded if that isn’t what happens next personally.

It happened to CrowdStrikes MacOS update a few weeks ago but wasn’t as severe as Mac doesn’t allow kernel level access like Windows where things can really go wrong

The error in the code was a relatively simple one that basic condition testing should’ve caught, maybe the industry can look at not using Kernel level security systems but that should only be when the threats can’t be on the kernel either like MacOS - until then we just need people to be competent and test their damn code

2 Likes

The broken update was a definition update not an agent or engine update that the configurable update rings apply to. Microsoft defender gets definition updates multiple times a day too as do most AVs.

Crowdstrikes systems should probably been have been able to flag computers going offline quicker and the Linux complaints point towards poor QA but there definitely doesn’t seem to have been widespread issues with Debian, no one else on that HN thread had or saw that issue with Debian and I didn’t see anyone else mention it elsewhere, it was probably something in their environment conflicting with CS.

This time execs have someone to blame but who are they going to blame when it’s ransomware or some in house software next time? Orgs having to touch each endpoint one by one is an IT failure, why are digital signage, point of sales and the likes not on PXE boot ? Desktops should be disposable, IT should be able to send out a guide on how to reset the computer and their provisioning will set it back up, it’s 10 clicks to do it from the recovery screen? End users can do that.

If they didn’t do the above why are workers manually typing out bitlocker keys and commands, put they keys in a csv, create a windows pe usb, in the startup folder put the csv and a script that 1,identifies the locked drive 2, unlocks it 3, runs the commands 4, reboots. And then you’re by each machine for under a minute.

Similar can be done for servers on hypervisors.

This time there’s someone to ‘blame’ but next time there probably won’t be.

And yes it’s not always possible for scada or medical devices but if everything else was planned and tested properly they can focus more resources there.

2 Likes