We had issues with Monzo on 29th July. Here's what happened, and what we did to fix it

44 Likes

These posts are always awesome :crown:

5 Likes

Really good read. As always from a technical perspective they are fascinating.
Cloudflare also do a great job at these and I wish more companies would!

1 Like

My inner nerd is satisfied. Thank you :blush: :laughing:

6 Likes

I didn’t even notice Monzo was down on that day and now it makes sense. Rapid response and logical incident management! This was a great technical update with clear learning objectives for the future :raised_hands:

1 Like

Thanks for this. Clear explanation, and an impressively short timeline for the incident investigation and fix (IMO).

I didn’t understand all of the details but really appreciate the level of detail and the industry-leading transparency.

1 Like

This is what we came for

Plus can blame DBAs rather than software engineers :wink:

1 Like

This is why I love Monzo, no BS, just honesty and integrity to it’s customers. Keep up the fantastic work!

2 Likes

I struggled with some areas, but that was a genuinely interesting read. Thanks for the post.

These posts are the entire reason I, and my business, use Monzo for as many Banking services as possible. It is not that you don’t make mistakes, its that when you do, you own up to them and lay it out for all to see.
It is also that you share some knowledge of how your systems work, as an Engineer, that builds trust. You are making decisions that I would make (and mistakes that I could see myself making).

11 Likes

Thanks for the detailed analysis and breakdown Monzo.

Interesting read.

Interesting read.
How do you manage to deal with responsibility in these circumstances? Even if it’s shared across team, the engineer finding out they have turned off the flag wrongly might feel bad and take ownership for the outage personally.

Is there any debrief session where the person involved can share their feelings and/or the team can clear oneself of responsibility for an accident? :love_letter:

3 Likes

They probably get out the :monzo: branding iron :rofl:

I’m kidding, I know what you’re saying :slight_smile:

5 Likes

I’m not technical at all and I was able to follow that :exploding_head:

Great job with the explanation :tada:

6 Likes

Excellent explanation, I’m impressed by the way you diagnosed the fault particularly under considerable pressure. I think it does however highlight a deficiency in your test regime, you can never test every eventuality, but this one was quite simple. ie why not add the 3 servers to your test environment first? The timing of the change also raises questions about the project management and risk assessment, is Friday afternoon a great time to implement such a large change? I’m a great Monzo fan, but having spent many years working in enterprise solutions, I get the feeling that your IT professionals are maybe more focused on development rather than service delivery.

2 Likes

The outage was on Monday (but I think your question still stands). I’d imagine that for a bank there is no good time for a large change. Monday lunchtimes may even be the quietest time of the week, when all engineers will be on hand in case something does goes wrong.

Don’t want something breaking at 11 pm when most engineers are asleep and people can be left stranded places if their card doesn’t work.

1 Like

Wasn’t it Monday 29th July? :sweat_smile:

Also, the plan was just to add the new servers (not switch them on). So no operational changes were planned :slight_smile:

I’m sure you guys have figured this out already, but a tip from working with Cassandra in the past: if you’re doing any manual bootstrapping, ensure you set cassandra.join_ring to false and change the state of the new node manually with nodetool only after you’ve performed some sanity checks and is 100% sure all the data has been streamed in.

6 Likes