We had issues with Monzo on 29th July. Here's what happened, and what we did to fix it

cookywook · 9 August 2019 12:42

anon95339334 · 9 August 2019 12:59

These posts are always awesome

anon95933191 · 9 August 2019 13:04

Really good read. As always from a technical perspective they are fascinating.
Cloudflare also do a great job at these and I wish more companies would!

Ordog · 9 August 2019 13:07

My inner nerd is satisfied. Thank you

itsBigH · 9 August 2019 13:07

I didn’t even notice Monzo was down on that day and now it makes sense. Rapid response and logical incident management! This was a great technical update with clear learning objectives for the future

projectfortytwo · 9 August 2019 13:17

Thanks for this. Clear explanation, and an impressively short timeline for the incident investigation and fix (IMO).

sitereportuk · 9 August 2019 13:27

I didn’t understand all of the details but really appreciate the level of detail and the industry-leading transparency.

SouthseaOne · 9 August 2019 13:30

This is what we came for

Plus can blame DBAs rather than software engineers

Sydogma · 9 August 2019 13:33

This is why I love Monzo, no BS, just honesty and integrity to it’s customers. Keep up the fantastic work!

anon99402360 · 9 August 2019 13:33

I struggled with some areas, but that was a genuinely interesting read. Thanks for the post.

maxcbc · 9 August 2019 13:38

These posts are the entire reason I, and my business, use Monzo for as many Banking services as possible. It is not that you don’t make mistakes, its that when you do, you own up to them and lay it out for all to see.
It is also that you share some knowledge of how your systems work, as an Engineer, that builds trust. You are making decisions that I would make (and mistakes that I could see myself making).

dwdqw12 · 9 August 2019 13:42

Thanks for the detailed analysis and breakdown Monzo.

Interesting read.

Antwan · 9 August 2019 13:48

Interesting read.
How do you manage to deal with responsibility in these circumstances? Even if it’s shared across team, the engineer finding out they have turned off the flag wrongly might feel bad and take ownership for the outage personally.

Is there any debrief session where the person involved can share their feelings and/or the team can clear oneself of responsibility for an accident?

Ordog · 9 August 2019 13:56

They probably get out the branding iron

I’m kidding, I know what you’re saying

Rat_au_van · 9 August 2019 14:01

I’m not technical at all and I was able to follow that

Great job with the explanation

Fowlerat · 9 August 2019 14:21

Excellent explanation, I’m impressed by the way you diagnosed the fault particularly under considerable pressure. I think it does however highlight a deficiency in your test regime, you can never test every eventuality, but this one was quite simple. ie why not add the 3 servers to your test environment first? The timing of the change also raises questions about the project management and risk assessment, is Friday afternoon a great time to implement such a large change? I’m a great Monzo fan, but having spent many years working in enterprise solutions, I get the feeling that your IT professionals are maybe more focused on development rather than service delivery.

Kumnaa · 9 August 2019 14:30

The outage was on Monday (but I think your question still stands). I’d imagine that for a bank there is no good time for a large change. Monday lunchtimes may even be the quietest time of the week, when all engineers will be on hand in case something does goes wrong.

Don’t want something breaking at 11 pm when most engineers are asleep and people can be left stranded places if their card doesn’t work.

itsBigH · 9 August 2019 14:31

Wasn’t it Monday 29th July?

Ordog · 9 August 2019 14:35

Also, the plan was just to add the new servers (not switch them on). So no operational changes were planned

danielkza · 9 August 2019 14:40

I’m sure you guys have figured this out already, but a tip from working with Cassandra in the past: if you’re doing any manual bootstrapping, ensure you set cassandra.join_ring to false and change the state of the new node manually with nodetool only after you’ve performed some sanity checks and is 100% sure all the data has been streamed in.

Topic		Replies	Views
RESOLVED: Current account payments may fail - Major Outage (27/10/2017) Help	184	72368	5 December 2017
Monzo Stability Monzo Chat	46	3705	31 January 2020
Monzo Staff Weekly Q&A - Oliver Beattie (Head of Engineering) - Reliability Report Special! Monzo Chat	13	3942	26 July 2018
RESOLVED: Monzo Services Degraded/Outage - Card Payments may fail (16/01/18) Help	90	13478	29 January 2018
Discussion: Quality Assurance within Monzo Monzo Chat	160	5750	26 December 2018

We had issues with Monzo on 29th July. Here's what happened, and what we did to fix it

Related topics