These posts are always awesome
Really good read. As always from a technical perspective they are fascinating.
Cloudflare also do a great job at these and I wish more companies would!
My inner nerd is satisfied. Thank you
I didn’t even notice Monzo was down on that day and now it makes sense. Rapid response and logical incident management! This was a great technical update with clear learning objectives for the future
Thanks for this. Clear explanation, and an impressively short timeline for the incident investigation and fix (IMO).
I didn’t understand all of the details but really appreciate the level of detail and the industry-leading transparency.
This is what we came for
Plus can blame DBAs rather than software engineers
This is why I love Monzo, no BS, just honesty and integrity to it’s customers. Keep up the fantastic work!
I struggled with some areas, but that was a genuinely interesting read. Thanks for the post.
These posts are the entire reason I, and my business, use Monzo for as many Banking services as possible. It is not that you don’t make mistakes, its that when you do, you own up to them and lay it out for all to see.
It is also that you share some knowledge of how your systems work, as an Engineer, that builds trust. You are making decisions that I would make (and mistakes that I could see myself making).
Thanks for the detailed analysis and breakdown Monzo.
How do you manage to deal with responsibility in these circumstances? Even if it’s shared across team, the engineer finding out they have turned off the flag wrongly might feel bad and take ownership for the outage personally.
Is there any debrief session where the person involved can share their feelings and/or the team can clear oneself of responsibility for an accident?
They probably get out the branding iron
I’m kidding, I know what you’re saying
I’m not technical at all and I was able to follow that
Great job with the explanation
Excellent explanation, I’m impressed by the way you diagnosed the fault particularly under considerable pressure. I think it does however highlight a deficiency in your test regime, you can never test every eventuality, but this one was quite simple. ie why not add the 3 servers to your test environment first? The timing of the change also raises questions about the project management and risk assessment, is Friday afternoon a great time to implement such a large change? I’m a great Monzo fan, but having spent many years working in enterprise solutions, I get the feeling that your IT professionals are maybe more focused on development rather than service delivery.
The outage was on Monday (but I think your question still stands). I’d imagine that for a bank there is no good time for a large change. Monday lunchtimes may even be the quietest time of the week, when all engineers will be on hand in case something does goes wrong.
Don’t want something breaking at 11 pm when most engineers are asleep and people can be left stranded places if their card doesn’t work.
Wasn’t it Monday 29th July?
Also, the plan was just to add the new servers (not switch them on). So no operational changes were planned
I’m sure you guys have figured this out already, but a tip from working with Cassandra in the past: if you’re doing any manual bootstrapping, ensure you set
false and change the state of the new node manually with
nodetool only after you’ve performed some sanity checks and is 100% sure all the data has been streamed in.