All three of these components – like the vast majority of our backend – are free, open-source software. Often these projects have started by or are built upon technology used by large internet companies. Upgrades are totally within our control, but we generally try to run within a few versions of the latest.
I’m with @anon70107404 on this one. Thank you very much with the detailed explanation.
I have a question if I may?
Following the impact that this change had on the service, are there any plans to run potential service affecting changes “out of hours”? Not that there is any such thing in the banking world really but between the hours of 10pm - 6am I would imagine the traffic would be lower?
As someone that works on backend systems using Docker etc, that was a fascinating read! I love how open, detailed and technical it was
Thaaank you so much for sharing. I’ll gonna read it few times later on, just so I’m sure I understood it all.
This is a superb insight and I really appreciate you writing it
As someone that has used Docker/Kubernetes (on a much smaller level) I can see how when scaled up in can be very difficult to identify exactly where you have an issue - especially when the breaking change doesn’t manifest itself immediately.
Did nobody tell you guys about the friday deployments rule?
Excellent post-mortem that increases my confidence in monzo as a bank, which was also fun to read a developer.
Please keep being open like this about your tech stack and future outages as you grow - there are bound to be some
Thanks for the transparency over this. When companies hide facts it only causes users to guess and lose confidence in a service. This post is shows Monzo is willing to engage with the community and increases confidence and trust in the bank. Please keep up the great work.
Thanks @oliver for this detailed insight very informative, Recently attended Google cloud platform workshop so I could make sense and understand the whole thing.
Not sure Santander would explain all this if things went down for a bit so thank you Monzo for this. As @anon70107404 said;
An interesting read. I might need to research what exactly ‘Kubernetes’ are as I feel I’ve been Kuberneted to death as someone not into programming/servers (I know there was a small explanation, but I need to know more!). However, it was a good read nethertheless.
It’s interesting that a 1 and a half hour outage is the worst in Monzo history. In my previous council job, relatively minor IT outages could last all day. Bravo for fixing and debugging it so quickly!
Here is a video as the first step in your research
Fantastic explanation @oliver Sounds like a bit like the payroll incident in The Phoenix Project! (Great IT book if you haven’t read it!) As a fellow infrastructure guy, this maybe got me a little “moist”. Love the transparency also. Sometimes when we have outages, it’s all glossed over, glad to see that’s not the case with Monzo.
I think we’d much prefer that deployment was a routine operation that can be performed frequently and with confidence. To me it makes a lot of sense to invest in ensuring our systems can handle that rather than reducing the frequency of change. I think maintenance windows are generally a bad idea for a few reasons:
- During the night, only a few engineers are likely to be around and alert. While of course we have mechanisms to page engineers and wake them up 24x7, it would be better for everyone to avoid the need to do so.
- Batching changes necessarily increases the “surface area” of each deployment. One of the major advantages of a micro services architecture like ours is that changes are small and isolated, so the risk is lower. When that isolation breaks we take fixing it very seriously.
- One of our competitive advantages is that our systems can be changed quickly, so we can test new features with users and fix bugs quickly.
- We aim to operate an always-available service. While maintenance windows don’t necessarily mean downtime, they do feel a bit like a slippery slope for there to be a “less bad” time to take downtime. Especially in the future when we offer accounts internationally, and even now for customers who travel, there is no acceptable time for our platform to be unavailable.
- A platform built on technologies like Kubernetes is inherently dynamic, and topology changes can happen with no human intervention. As such, this problem could have been exposed even if we hadn’t been releasing a new service version at the time.
I absolutely love the explanation as in-depth post mortems are unheard of when it comes to banks and their customers!
I feel like I should be the devil’s advocate here and suggest a slightly different version of the explanation for the less technical types; to run alongside the main one.
For example, I’m currently trying to get my 70 year old Grandad to sign up to Monzo and he, like other Monzo customers, would want to know why he wasn’t able to make payments. He wouldn’t understand the technical explanation, though.
So maybe a less technical version could be available in the future?
As always, great work team!
The less technical version for your Granddad is they had a problem with a computer and worked hard to fix it as quick as they could.
Or the car had a broken fan belt and now it’s been fixed.
You use the live system to test things?
Have you never heard of live testing?
All software has bugs - I’d expect regression suites, test servers etc to be used to find them, not the system that holds my money.
This is not a video game. Agile has it’s place, but ‘release often and let the customers find the bugs’ is not appropriate for a bank.
It’s not hypothetical. Pretty much every large UK bank suffers several downtime issues across one or more services every single week and they very rarely make the news. I use HSBC and Lloyds and often go to their websites to find unscheduled downtime. You only hear about it in the news if the downtime is cross-bank or prolonged.