We had issues with Monzo on 29th July. Here's what happened, and what we did to fix it

Wow! Fascinating stuff, thanks for the detailed run down :nerd_face:

Its refreshing to see a company being so open and detailed about issues. Great work Monzo :monzo:

I was with Barclays up until moving to Monzo and had endless issues with their app on the Galaxy S10. I got the occasional text to say we are aware and pointing the finger at Samsung but nothing other than that and to be honest its still not fixed to this day.

1 Like

(context: I am an engineer on the Platform Team at Monzo and I was working on this)

The change was indeed on Monday 29th July. We elected to do it after lunchtime so there would be sufficient time to do the remainder of the steps during office hours. We want to do these changes in office hours so if help is needed at any point, there are plenty of people available.

The testing is definitely something we are evaluating. In this instance, adding 6 nodes in a controlled environment wouldn’t have been enough by itself to identify this issue. We would’ve also needed constant querying of a wide variety of data partitions to notice differences in the response patterns. It may have identified some of the issues before though and arguably made us even more confident in the procedure (simply by repeating the procedure a bunch of times).

We’ve already started rolling out services on the back of this to do constant coherence checking by constantly reading/writing a range of different pieces of simulation data and seeing if the consistency/durability guarantees we give to engineers are being broken. We’ve been busy over the past week running these coherence checks through a variety of these failure scenarios in our sandbox (losing quorum, moving partitions etc.) to ensure alerts fire. We will continue to refine and add to this.

29 Likes

(context: I am an engineer on the Platform Team at Monzo and I was working on this)

Is there any debrief session where the person involved can share their feelings and/or the team can clear oneself of responsibility for an accident? :love_letter:

We do debrief sessions and yes the engineers involved will be present (most of our debriefs are open to the entire company and are remote-friendly and recorded). The aim of the debrief is to get a clear understanding of the timeline / sequence of events, especially as there will be lots of parts of the company involved who might not have full depth of technical knowledge of what happened. We want to bring them up to speed so they can understand better collectively.

We also share technical post-mortems internally. In fact, the internal timeline and post-mortem for this incident is very similar to the one in the blog post.

We always strive to avoid blame or asking leading questions like “Why weren’t you more careful” or “Why didn’t you think of this beforehand”. The possibility of human error will be present in all actions. We aim to capture the incident and understand how we can build better systems, tools and processes so we can have the safety whilst being able to conduct changes like this and constantly iterate on our technology.

If you want to learn more about methodology to running good investigations and debriefs and coming up with good actions, a great book is ‘The Field Guide to Understanding Human Error’ by Sidney Dekker (big credit to the amazing @anon61228674 who recommended it to me) :pray:

35 Likes

Very intersting article. Truly.

I think you have realised this yourself, but can I suggest the time you choose to do this was not the right one. Even though the developers might have been after lunch and they had a lot of time ahead, 13:10 is still lunch time for everyone else.

When choosing a time for the update, should not be decided in favor of devs timeline, but to the customer timeline, maybe when customers are a sleep.

The impact would have been much different if all this would have happend at 01:10am.

Regards,

Yeah, that could have been absolutely dreadful instead of just bad

Direct Debits and BACS failing instead of some card payments!

Wider point taken though

1 Like

Still, DD can be processed anytime of the day, as long as they are processed within that working day, and that will be fine.

Waiting in front of the cashier, waiting for the transaction to be procesed, that’s easy. Or even worst, paying at a restaurant, after you have eat, with only Monzo in your pocket.

1 Like

With anything that may be used 24/7, there were no good times for failures, only least-worst times.

I don’t necessarily think Monzo got their timing wrong in this case. Say they did roll it out at 1am. That’s at least one member of engineering staff who has to work antisocial hours to push the change. Say it all looks fine so he clocks off and goes to sleep before problems manifest. That’s a big increase over the window before an alarm being raised, potentially. Then you only have on-call staff troubleshooting the issue, not the whole company. And they’re possibly not running on full cognitive levels due to being sleepy and/or just having been woken up. Are they going to troubleshoot and find the issue any quicker? I’m not sure they will.

As has already been mentioned, overnight problems could also have affected DDs, which could result in bigger problems.

3 Likes

Yes, I think overnight when the ‘normal’ operating teams are all out of action could have led to a much larger problem. Working hours when everyone’s around to assist is definitely the way to go with this stuff.

1 Like

It absolutely isn’t. On occasion, when making big changes, you need staff in overnight.

3 Likes

Thanks Monzo! Posts like these are great. I really like the transparency :slight_smile:

Great post Chris! I really appreciate your candour, twice over:

Firstly, as the Cassandra SRE in my company, I’ve fallen foul of auto_bootstrap: false too.
And secondly, as a Monzo customer myself, this transparency is what truly sets Monzo apart.

4 Likes

Welcome (back) to the community!

Oh, wow. As a former system administrator who has served on teams trying to figure out what was causing customer-facing downtime, I found this a gripping and suspenseful read! Was very pleased to learn the solution to the puzzle and to enjoy the happy ending! Of such adventures are more robust testing environments built!

5 Likes

However at night the ‘load’ on the infrastructure is less. This is why a lot of banks do maintenance at night.
Natwest turn off the app for 40 minutes every night.
Card payment processing do maintenance at night.
It obviously impacts staff more especially if they are day time workers but generally the customer has less impact.
As long as there is a clear plan of what to do if things don’t go correctly then doing it at night is better from a customers point of view - on the whole.

As a night worker I can get frustrated that maintenance in financial institutions is carried out at night but it’s a fact of life that most people in society do not & probably will never work a night shift.

1 Like

I bank with you because you wrote things like this

5 Likes

Hi,

When these kind of changes are taking place. Do the engineers perform a change of control request that contains info eg. What it is to do, how they will do it, any problems that may arise and if so what services may be affected and finally what the roll back procedure is.

Also why was the engineer who started the change not involved in the engineer group to start with?

Thanks
Tom

1 Like

On your first question(s), I think you are describing the same thing they are calling runbooks

The blog highlights that those runbooks were not as complete as you could be for the Cassandra actions

I can’t possibly recommend enough switching to Spanner if possible. Scaling, upgrades and performance are all issues that will appear again and again unless you manage to create something like spanner yourself. Especially when becoming a global bank and trying to achieve global consistency for some data. The cost of creating and running a large scale database yourself is way higher than getting a cloud product, mostly because of the maintenance, downtime and brand damage costs.

Given that Spanner only appears to be a couple of years old, and that is with Googling it for the first time, it is not likely to be out of the woods itself

What do you believe it offers over Cassandra that would prompt such a shift of technologies on the part of Monzo? What makes you think that their existing Cassandra cluster in AWS is not a cloud product?