RESOLVED: Current account payments may fail - Major Outage (27/10/2017)

oliver · 30 October 2017 20:54

All three of these components – like the vast majority of our backend – are free, open-source software. Often these projects have started by or are built upon technology used by large internet companies. Upgrades are totally within our control, but we generally try to run within a few versions of the latest.

anon38274058 · 30 October 2017 20:54

I’m with @anon70107404 on this one. Thank you very much with the detailed explanation.

I have a question if I may?

Following the impact that this change had on the service, are there any plans to run potential service affecting changes “out of hours”? Not that there is any such thing in the banking world really but between the hours of 10pm - 6am I would imagine the traffic would be lower?

Thanks!

danfury · 30 October 2017 20:58

As someone that works on backend systems using Docker etc, that was a fascinating read! I love how open, detailed and technical it was

Avishai · 30 October 2017 21:05

Thaaank you so much for sharing. I’ll gonna read it few times later on, just so I’m sure I understood it all.

crablab · 30 October 2017 21:20

This is a superb insight and I really appreciate you writing it

As someone that has used Docker/Kubernetes (on a much smaller level) I can see how when scaled up in can be very difficult to identify exactly where you have an issue - especially when the breaking change doesn’t manifest itself immediately.

m8tt · 30 October 2017 21:20

Did nobody tell you guys about the friday deployments rule?

kennygrant · 30 October 2017 21:32

Excellent post-mortem that increases my confidence in monzo as a bank, which was also fun to read a developer.

Please keep being open like this about your tech stack and future outages as you grow - there are bound to be some

tim7 · 30 October 2017 21:57

Thanks for the transparency over this. When companies hide facts it only causes users to guess and lose confidence in a service. This post is shows Monzo is willing to engage with the community and increases confidence and trust in the bank. Please keep up the great work.

SC95 · 30 October 2017 22:11

Thanks @oliver for this detailed insight very informative, Recently attended Google cloud platform workshop so I could make sense and understand the whole thing.
Not sure Santander would explain all this if things went down for a bit so thank you Monzo for this. As @anon70107404 said;

Chapuys · 30 October 2017 22:53

An interesting read. I might need to research what exactly ‘Kubernetes’ are as I feel I’ve been Kuberneted to death as someone not into programming/servers (I know there was a small explanation, but I need to know more!). However, it was a good read nethertheless.

It’s interesting that a 1 and a half hour outage is the worst in Monzo history. In my previous council job, relatively minor IT outages could last all day. Bravo for fixing and debugging it so quickly!

SC95 · 30 October 2017 22:57

Here is a video as the first step in your research

mark1 · 31 October 2017 00:15

Fantastic explanation @oliver Sounds like a bit like the payroll incident in The Phoenix Project! (Great IT book if you haven’t read it!) As a fellow infrastructure guy, this maybe got me a little “moist”. Love the transparency also. Sometimes when we have outages, it’s all glossed over, glad to see that’s not the case with Monzo.

oliver · 31 October 2017 08:25

I think we’d much prefer that deployment was a routine operation that can be performed frequently and with confidence. To me it makes a lot of sense to invest in ensuring our systems can handle that rather than reducing the frequency of change. I think maintenance windows are generally a bad idea for a few reasons:

During the night, only a few engineers are likely to be around and alert. While of course we have mechanisms to page engineers and wake them up 24x7, it would be better for everyone to avoid the need to do so.
Batching changes necessarily increases the “surface area” of each deployment. One of the major advantages of a micro services architecture like ours is that changes are small and isolated, so the risk is lower. When that isolation breaks we take fixing it very seriously.
One of our competitive advantages is that our systems can be changed quickly, so we can test new features with users and fix bugs quickly.
We aim to operate an always-available service. While maintenance windows don’t necessarily mean downtime, they do feel a bit like a slippery slope for there to be a “less bad” time to take downtime. Especially in the future when we offer accounts internationally, and even now for customers who travel, there is no acceptable time for our platform to be unavailable.
A platform built on technologies like Kubernetes is inherently dynamic, and topology changes can happen with no human intervention. As such, this problem could have been exposed even if we hadn’t been releasing a new service version at the time.

warriner · 31 October 2017 08:33

I absolutely love the explanation as in-depth post mortems are unheard of when it comes to banks and their customers!

I feel like I should be the devil’s advocate here and suggest a slightly different version of the explanation for the less technical types; to run alongside the main one.

For example, I’m currently trying to get my 70 year old Grandad to sign up to Monzo and he, like other Monzo customers, would want to know why he wasn’t able to make payments. He wouldn’t understand the technical explanation, though.

So maybe a less technical version could be available in the future?

As always, great work team!

anon44204028 · 31 October 2017 08:46

The less technical version for your Granddad is they had a problem with a computer and worked hard to fix it as quick as they could.

anon72173902 · 31 October 2017 09:05

Or the car had a broken fan belt and now it’s been fixed.

DaveTMG · 31 October 2017 09:07

You use the live system to test things?

m8tt · 31 October 2017 09:10

Have you never heard of live testing?

DaveTMG · 31 October 2017 09:13

All software has bugs - I’d expect regression suites, test servers etc to be used to find them, not the system that holds my money.

This is not a video game. Agile has it’s place, but ‘release often and let the customers find the bugs’ is not appropriate for a bank.

tommy5dollar · 31 October 2017 09:24

It’s not hypothetical. Pretty much every large UK bank suffers several downtime issues across one or more services every single week and they very rarely make the news. I use HSBC and Lloyds and often go to their websites to find unscheduled downtime. You only hear about it in the news if the downtime is cross-bank or prolonged.

Topic		Replies	Views
Current Account’s Status Page Monzo Chat	11	16517	9 February 2018
Anyone else having this? Help	8	735	29 April 2021
Monzo Server's Down Help	28	1410	26 February 2021
Failed direct debit following switch Help	3	1452	11 February 2020
Outage 02/05/18 post mortem Monzo Chat	6	1230	29 October 2018

RESOLVED: Current account payments may fail - Major Outage (27/10/2017)

Related Topics