We had issues with Monzo on 29th July. Here's what happened, and what we did to fix it

tobyadams · 9 August 2019 14:45

Wow! Fascinating stuff, thanks for the detailed run down

StokerP · 9 August 2019 14:48

Its refreshing to see a company being so open and detailed about issues. Great work Monzo

I was with Barclays up until moving to Monzo and had endless issues with their app on the Galaxy S10. I got the occasional text to say we are aware and pointing the finger at Samsung but nothing other than that and to be honest its still not fixed to this day.

suhailpatel · 9 August 2019 15:02

(context: I am an engineer on the Platform Team at Monzo and I was working on this)

The change was indeed on Monday 29th July. We elected to do it after lunchtime so there would be sufficient time to do the remainder of the steps during office hours. We want to do these changes in office hours so if help is needed at any point, there are plenty of people available.

The testing is definitely something we are evaluating. In this instance, adding 6 nodes in a controlled environment wouldn’t have been enough by itself to identify this issue. We would’ve also needed constant querying of a wide variety of data partitions to notice differences in the response patterns. It may have identified some of the issues before though and arguably made us even more confident in the procedure (simply by repeating the procedure a bunch of times).

We’ve already started rolling out services on the back of this to do constant coherence checking by constantly reading/writing a range of different pieces of simulation data and seeing if the consistency/durability guarantees we give to engineers are being broken. We’ve been busy over the past week running these coherence checks through a variety of these failure scenarios in our sandbox (losing quorum, moving partitions etc.) to ensure alerts fire. We will continue to refine and add to this.

suhailpatel · 9 August 2019 15:29

(context: I am an engineer on the Platform Team at Monzo and I was working on this)

Is there any debrief session where the person involved can share their feelings and/or the team can clear oneself of responsibility for an accident?

We do debrief sessions and yes the engineers involved will be present (most of our debriefs are open to the entire company and are remote-friendly and recorded). The aim of the debrief is to get a clear understanding of the timeline / sequence of events, especially as there will be lots of parts of the company involved who might not have full depth of technical knowledge of what happened. We want to bring them up to speed so they can understand better collectively.

We also share technical post-mortems internally. In fact, the internal timeline and post-mortem for this incident is very similar to the one in the blog post.

We always strive to avoid blame or asking leading questions like “Why weren’t you more careful” or “Why didn’t you think of this beforehand”. The possibility of human error will be present in all actions. We aim to capture the incident and understand how we can build better systems, tools and processes so we can have the safety whilst being able to conduct changes like this and constantly iterate on our technology.

If you want to learn more about methodology to running good investigations and debriefs and coming up with good actions, a great book is ‘The Field Guide to Understanding Human Error’ by Sidney Dekker (big credit to the amazing @anon61228674 who recommended it to me)

Florjon · 9 August 2019 23:09

Very intersting article. Truly.

I think you have realised this yourself, but can I suggest the time you choose to do this was not the right one. Even though the developers might have been after lunch and they had a lot of time ahead, 13:10 is still lunch time for everyone else.

When choosing a time for the update, should not be decided in favor of devs timeline, but to the customer timeline, maybe when customers are a sleep.

The impact would have been much different if all this would have happend at 01:10am.

Regards,

SouthseaOne · 9 August 2019 23:19

Yeah, that could have been absolutely dreadful instead of just bad

Direct Debits and BACS failing instead of some card payments!

Wider point taken though

Florjon · 9 August 2019 23:36

Still, DD can be processed anytime of the day, as long as they are processed within that working day, and that will be fine.

Waiting in front of the cashier, waiting for the transaction to be procesed, that’s easy. Or even worst, paying at a restaurant, after you have eat, with only Monzo in your pocket.

HoldenCarver · 10 August 2019 00:27

With anything that may be used 24/7, there were no good times for failures, only least-worst times.

I don’t necessarily think Monzo got their timing wrong in this case. Say they did roll it out at 1am. That’s at least one member of engineering staff who has to work antisocial hours to push the change. Say it all looks fine so he clocks off and goes to sleep before problems manifest. That’s a big increase over the window before an alarm being raised, potentially. Then you only have on-call staff troubleshooting the issue, not the whole company. And they’re possibly not running on full cognitive levels due to being sleepy and/or just having been woken up. Are they going to troubleshoot and find the issue any quicker? I’m not sure they will.

As has already been mentioned, overnight problems could also have affected DDs, which could result in bigger problems.

Feathers · 10 August 2019 07:25

Yes, I think overnight when the ‘normal’ operating teams are all out of action could have led to a much larger problem. Working hours when everyone’s around to assist is definitely the way to go with this stuff.

danmullen · 10 August 2019 07:46

It absolutely isn’t. On occasion, when making big changes, you need staff in overnight.

AidanBW · 10 August 2019 09:06

Thanks Monzo! Posts like these are great. I really like the transparency

amcguign · 10 August 2019 22:44

Great post Chris! I really appreciate your candour, twice over:

Firstly, as the Cassandra SRE in my company, I’ve fallen foul of auto_bootstrap: false too.
And secondly, as a Monzo customer myself, this transparency is what truly sets Monzo apart.

SouthseaOne · 10 August 2019 22:47

Welcome (back) to the community!

vbock · 10 August 2019 23:16

Oh, wow. As a former system administrator who has served on teams trying to figure out what was causing customer-facing downtime, I found this a gripping and suspenseful read! Was very pleased to learn the solution to the puzzle and to enjoy the happy ending! Of such adventures are more robust testing environments built!

Drew58 · 11 August 2019 05:26

However at night the ‘load’ on the infrastructure is less. This is why a lot of banks do maintenance at night.
Natwest turn off the app for 40 minutes every night.
Card payment processing do maintenance at night.
It obviously impacts staff more especially if they are day time workers but generally the customer has less impact.
As long as there is a clear plan of what to do if things don’t go correctly then doing it at night is better from a customers point of view - on the whole.

As a night worker I can get frustrated that maintenance in financial institutions is carried out at night but it’s a fact of life that most people in society do not & probably will never work a night shift.

coldclimate · 11 August 2019 07:16

I bank with you because you wrote things like this

tdootson · 11 August 2019 08:13

Hi,

When these kind of changes are taking place. Do the engineers perform a change of control request that contains info eg. What it is to do, how they will do it, any problems that may arise and if so what services may be affected and finally what the roll back procedure is.

Also why was the engineer who started the change not involved in the engineer group to start with?

Thanks
Tom

SouthseaOne · 11 August 2019 08:37

On your first question(s), I think you are describing the same thing they are calling runbooks

The blog highlights that those runbooks were not as complete as you could be for the Cassandra actions

V13 · 11 August 2019 15:28

I can’t possibly recommend enough switching to Spanner if possible. Scaling, upgrades and performance are all issues that will appear again and again unless you manage to create something like spanner yourself. Especially when becoming a global bank and trying to achieve global consistency for some data. The cost of creating and running a large scale database yourself is way higher than getting a cloud product, mostly because of the maintenance, downtime and brand damage costs.

SouthseaOne · 11 August 2019 15:51

Given that Spanner only appears to be a couple of years old, and that is with Googling it for the first time, it is not likely to be out of the woods itself

What do you believe it offers over Cassandra that would prompt such a shift of technologies on the part of Monzo? What makes you think that their existing Cassandra cluster in AWS is not a cloud product?

Topic		Replies	Views
Monzo Stability Monzo Chat	46	3649	31 January 2020
Outages Monzo Chat	19	5026	21 September 2016
We had problems with bank transfers on 30th May. Here's what happened and how we're fixing it for the future News & Updates	42	4737	29 June 2019
Empathy & responsibility when communicating issues Feedback & Ideas	29	2801	1 April 2018
A tip of the hat to Monzo Customer Service Monzo Chat	6	647	2 September 2023

We had issues with Monzo on 29th July. Here's what happened, and what we did to fix it

Related topics