The Great Blue Screen Of 2024

JIMMWX · 19 July 2024 19:28

Yes it could. But if you have but one cloud environment and but one MDR or SOC you kind of leave yourself a little ruined.

Astounding that this could be one update which has, for reasons tbd, been pushed past into live without proper testing.

That’s where proper process, automated pipelines and good governance should’ve meant it was never possible to do in the first place.

JIMMWX · 19 July 2024 19:36

No I do get this and there is, absolutely, a huge risk in A.I which won’t easily go away. HBR recently coined the term “Botshit” which while I did snigger at, is probably true.

Over-reliance on the data the tool gives and the need to ensure knowledge doesn’t reduce as a result.

But I still think that AI assist in the future could help reduce the risk of such an event.

jonj1 · 19 July 2024 19:37

It will take about 5 years after the tech is here for the “replace your employees” hype to go away and then we can all use it as the very powerful tool it has the potential for

Just let the HR sales people get it over with

Kind of like .com bubble all over again

RichardL · 19 July 2024 21:56

Just at a train station now and all the ticket machines have the BSOD.

Strangely all the platform gates are still closed but as soon as you speak to the person on the gateline they just let you through…

l33t · 19 July 2024 22:04

One thing to note, this happened to linux causing crashing and being left unable to boot, it just affected the much much smaller amount of linux customers on 2 distros:

Scary how its not just Windows that has been taken down by their bad updates.

lionflower · 20 July 2024 08:42

Definitely my experience too. I would say windows server is still the standard for most businesses.

mastercardnevervisa · 20 July 2024 08:47

I would say that the biggest issue with the Blue Screen of 2024 is that it has to be removed manually. It’s reportedly effected between 500M to 600M Devices. That’s gonna take alot of work

coffeemadman · 20 July 2024 09:47

I love that National Lottery is having issues today, like they felt left out yesterday and decided to update just to join in

projectfortytwo · 20 July 2024 10:06

A lot easier than it used to be, now that most will have phones with cameras and instant messaging.

jonj1 · 20 July 2024 11:19

I was having constant Visa card declines last night which seems to be consistent with Visa down? Current problems and outages | Downdetector

Mastercard and Amex worked fine

JIMMWX · 20 July 2024 11:22

I might be wrong on this, but in this instance if they’re in a fully managed world (InTune etc) wouldn’t the better option be to mass rebuild the machines remotely?

Probably quicker than trying to remote on or ship and replace?

Granted that might not be possible and I’d imagine in that case your second option is probably the least painful.

JIMMWX · 20 July 2024 15:37

Fair enough. I’m technical, but not that technical to know the answer to this one.

Will be interesting to see what happens next here. I see one of a few things happening:

Nothing. People just fix it, quietly update BCP plans. Likely.
Fast scale move to alternative cloud platforms/endpoint software. Unlikely, the cost will be huge and possibly not worth the outlay
Slow scale move to alternative/in house only solutions. Probable.
Microsoft buy out crowd strike and slowly depreciate it, pushing people towards sentinel and defender. Y’know that’s not totally unthinkable, but I don’t see it.
Microsoft work with crowdstrike to “invest” in it.

lionflower · 20 July 2024 15:40

why? any other vendor could have the issue. This isn’t the first time something like this has happened, there was a similar bug with an iscsi driver for vmware that put windows servers into a boot loop. In fact you can imagine the one company doing anything they can so that this wont happen again is Crowdstrike

jonj1 · 20 July 2024 15:42

It’s glaring incompetence that leaves a bad taste, this only happens because of disastrous QA. How can you not test something like this?

JIMMWX · 20 July 2024 15:49

Yes they absolutely could. But people are fickle, and I’ll bet you the amount of CIOs/CTOs etc getting a line from the rest of the board along the lines of

“ I don’t care about the fix, we need better solutions”

That’ll be going on. I do think it’s unlikely because of cost and for the reasons you’ve given.

That’d be the same crowdstrike whose CEO was also at the helm of Mcafee when it had a very similar incident in 2010? I’d like to hope strong lessons, industry wide, can be learnt from this.

jonj1 · 20 July 2024 16:14

jonj1 · 20 July 2024 16:14

Like what though? It’s been strong industry standard to test updates before you roll them out worldwide… for like 2 decades or more

I think the main lesson to be learnt is don’t use CrowdStrike

JIMMWX · 20 July 2024 16:22

Nah it’s more that. It could’ve happened to any solution. It’s only Crowdstrike because of the sheer dominance of windows machines and this software.

It really showed just how truly fragile the infrastructure is, there will need to be lessons learned and I’d be astounded if that isn’t what happens next personally.

jonj1 · 20 July 2024 16:29

It happened to CrowdStrikes MacOS update a few weeks ago but wasn’t as severe as Mac doesn’t allow kernel level access like Windows where things can really go wrong

The error in the code was a relatively simple one that basic condition testing should’ve caught, maybe the industry can look at not using Kernel level security systems but that should only be when the threats can’t be on the kernel either like MacOS - until then we just need people to be competent and test their damn code

kolok · 21 July 2024 15:04

The broken update was a definition update not an agent or engine update that the configurable update rings apply to. Microsoft defender gets definition updates multiple times a day too as do most AVs.

Crowdstrikes systems should probably been have been able to flag computers going offline quicker and the Linux complaints point towards poor QA but there definitely doesn’t seem to have been widespread issues with Debian, no one else on that HN thread had or saw that issue with Debian and I didn’t see anyone else mention it elsewhere, it was probably something in their environment conflicting with CS.

This time execs have someone to blame but who are they going to blame when it’s ransomware or some in house software next time? Orgs having to touch each endpoint one by one is an IT failure, why are digital signage, point of sales and the likes not on PXE boot ? Desktops should be disposable, IT should be able to send out a guide on how to reset the computer and their provisioning will set it back up, it’s 10 clicks to do it from the recovery screen? End users can do that.

If they didn’t do the above why are workers manually typing out bitlocker keys and commands, put they keys in a csv, create a windows pe usb, in the startup folder put the csv and a script that 1,identifies the locked drive 2, unlocks it 3, runs the commands 4, reboots. And then you’re by each machine for under a minute.

Similar can be done for servers on hypervisors.

This time there’s someone to ‘blame’ but next time there probably won’t be.

And yes it’s not always possible for scada or medical devices but if everything else was planned and tested properly they can focus more resources there.

Topic		Replies	Views
We had issues with Monzo on 29th July. Here's what happened, and what we did to fix it News & Updates	69	6237	23 December 2019
Major Outage 25th October 2017 Monzo Chat	13	2031	23 April 2018
Outages Monzo Chat	19	5026	21 September 2016
O2 Outage General Chat	192	3912	8 December 2018
Empathy & responsibility when communicating issues Feedback & Ideas	29	2803	1 April 2018

The Great Blue Screen Of 2024

Related topics